-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
northamerica build and planet build result in different document schema for the same source_id #452
Comments
That's interesting, the schema hasn't changed any time recently and the population of fields shouldn't change with the volume of data. What is suapect is that you're importing different versions of WOF data between the builds and these differences are accounting for the changes. It's also possible that the code has changed between builds but I looked and couldn't see anything which seemed related (we're working on something right now but it isn't merged yet). Finally, it could be that your configurations are different, maybe you're running different versions of the docker containers or using different settings in pelias.json? |
When you say performance are you referring to latency (cpu performance) or result quality? |
In the future can you please paste your json blobs as pretty printed json. We're volunteering our time and it's very difficult to read a massive blob of text. |
Update: Well, I learned something new about GitHub's mark up :D All the jsons are pretty printed now. Sorry again for all the trouble you went through reading those scary lines of json. Never gonna happen again! |
@missinglink I further investigated this issue, thanks to your helpful reply and I believe I might have found a bug. I will try to describe my test plan with every details so that we can figure this out: 1- Created two Amazon Elastic Search instances (AES) and modified the pelias.json file inside planet and north-america to point to these two instances (one for each). Let's call them AES-planet and AES-na. "whosonfirst": {
"datapath": "/data/whosonfirst",
"importVenues": false,
"importPostalcodes": true
} and the same configuration for north america looks like this: "whosonfirst": {
"datapath": "/data/whosonfirst",
"importPostalcodes": true,
"importPlace": "102191575"
} so far, it makes sense because for north america we're just going to download a portion of the whole wof data, hence the importPlace is there. I also checked the codebase, wondering about importVenues which is false for planet but not specified for north america, but then I figured out that if not specified, the default value would be false. Now, things start to get interesting. 4- Did the same steps for north america and got 1,371,851 documents indexed in AES-na. This time, accessed the kibana interface for AES-na and added the same filters and searched for Fairfax and boom, there were about 500 documents returned as the result of the search. Some differences I noticed in the downloaded files for wof using planet vs. north america.
while in the logs for the planet's attempt at downloading wof, I saw no reference to whosonfirst-sqlite. Here's the first few lines of the logs:
I hope these details could help here to figure out what is going on. I could simply just be me forgetting to do a step for the planet build, or it could be an existing bug. I also want to add this that the performance (accuracy in geocoding addresses) of our north america build for the same 200 addresses is around 90% which is amazing but due to the problem mentioned here our performance for the planet build is at around 60%. |
Hi folks, |
Hey team!
For testing purposes, we decided to build a north america version of Pelias to be able to geocode US addresses only and we got rrrreally rrrrreally good performance. But then we have a planet build as well, and we tried to run the same addresses through our planet build and this time the performance was not good at all. Not even close to what we got from the north america build.
We were curious to figure out what could cause this degradation in the performance of our planet build and decided to dive into querying the Elastic Search index directly. So, we queried the same source_ids through Kibana on both ElasticSearch instances and we noticed that the north america one has more fields in its schema for that document compared to the planet build. The fields that were missing from the planet build are:
parent.county, parent.county_a, parent.county_id, parent.locality, parent.locality_a, and parent.locality_id
due to these fields not being there in the planet index, the same address that can be geocoded in our north america build, will return a less accurate result in our planet build (up to the city level).
I am wondering, why would the same build process cause the schemas of the two builds significantly different? Oh, another thing we tried was to query the exact address against your api provided through geocode.earth and quite interestingly it returned the exact same response that we got from our own planet build and not an exact match.
For more clarity, I'm going to add example addresses along with the json responses that I get from our north america build and the planet build:
Address: OF WASHINGTON DC 11901 BRADDOCK RD FAIRFAX,VA 22030
north america build's response:
planet build's response:
but for this address: 4000 MERIDIAN BLVD STE 750, FRANKLIN TN 37067
our planet build has all those parent fields that were missing from the previous response. Here's the response for this address from the planet's build:
The text was updated successfully, but these errors were encountered: