feat(parser): Filter out URLs before sending to pelias/model #225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have had numerous reports from Pelias users about concerning error message during builds regarding the URL regex filter from pelias/model#115.
While this filter is good, the resulting error message is alarming. Looking today at the output of a planet build, it appears that many of these errors come from the polylines file created by Valhalla out of the
OSM street network.
Looking at the contents of the polyline file and corresponding record on OSM, it seems that Valhalla puts the contents of the
ref
tag in the polyline file as an alternate name. The ref tag willoften contain a URL.
This means that not only will the error happen frequently, but many records that are actaully valid will be filtered out.
An example of this is the Iowa Women of Achievement bridge which is completely valid in terms of name, geometry, and tagging but contains a URL in the
ref
field.However the resulting line in the
polylines
file contains a URL as one name:The polylines importer currently selects a single name value from the list of names in the polylines file by choosing the longest, which will often be a URL.
This PR adds an additional filter that first removes any URL-like values from consideration, and should completely eliminate any of the otherwise concerning errors while ensuring all valid records make it into
Elasticsearch.
Fixes pelias/whosonfirst#456
Fixes #216
Fixes pelias/docker#89
Connects pelias/model#116