Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not import both whosonfirst-data-latest.db and whosonfirst-data-admin-$$-latest.db #469

Closed
slvlirnoff opened this issue Sep 18, 2019 · 10 comments · Fixed by #487
Closed

Comments

@slvlirnoff
Copy link

From this page it seems that https://dist.whosonfirst.org/sqlite/ whosonfirst-data-latest.db hasn't been updated since May while many whosonfirst-data-admin-$$-latest.db have been updated several times.

Is there some information within whosonfirst-data-latest.db that is not within all of whosonfirst-data-admin-$$-latest.db?

From https://github.com/whosonfirst-data/whosonfirst-data repository README:

Disclaimer
As of May 2019, the whosonfirst-data repository has split into per-country repositories. You can read more about that change here. While we still track all issues in this repository, the data itself will live in the per-country repositories for the foreseeable future.

Per-country repositories have the following repository naming convention:

whosonfirst-data-admin-{2-char country code}

With pelias/docker and for the whole planet using sqlite: true the scripts imports both data-latest and the various data-admin per country (using inventory.json).

This seems to cause an issue for Switzerland for instance: many layers have been fixed recently (localities/localadmin whosonfirst-data/whosonfirst-data-admin-ch#4), since the importation order of the file is:

  • whosonfirst-data-admin-$$-latest.db (includes the fix)
  • whosonfirst-data-latest.db. (older records, don't include the fix)

The old records are replacing the updated ones since they mostly share the same whosonfirst ids.

Related: #437
Also probably related: pelias/pelias#811

Best,
Cyprien

@orangejulius
Copy link
Member

orangejulius commented Sep 18, 2019

Hi @slvlirnoff,
Thanks for diving in and investigating. There's definitely a bug in the Pelias code here. The Pelias whosonfirst importer code predates the existence of the country-specific WOF sqlite databases, and the intent of the code was that only the global database would be downloaded.

So one fix would be to fix the bug and ensure only the global sqlite db is ever downloaded. I know the WOF team is working on a process to produce up to date global sqlite db, so it will be up to date again soon. This approach would probably be the easiest and require changing the least amount of Pelias code.

The other approach would be to start using the country sqlite dbs, since they are more up to date and many people have requested a way to import data for only specific countries without having to download the whole planet.

We should probably make the easy fix i mentioned first, to get things working as expected, before embarking on the longer term job of incorporating the country specific dbs.

@NickStallman
Copy link

My 2c would be for option 2 since it is cleaner and more versatile and doesn't rely on WOF to complete their planet db files again.

Looks like bundleList.js:getDBList simply needs to filter out the "latest" file, and an additional filter could be added there to specify specific country codes from the config file.
Then the filters can simply be added to the download script as well.

@missinglink
Copy link
Member

👍 I would also love it if we could support the new country-specific sqlite database (and also the planet database) indifferently.

Julian and I have a full calendar of work up till the end of the year but I'd be happy to support the work (through review and discussion) if someone else were to author it.

@orangejulius
Copy link
Member

Actually yes, I take it back. Moving to support the country DBs might be better. I didn't realize data from them is actually imported.

We would have to do some work for the PIP service, since it's currently hardcoded to use the global sqlite DB. On the other hand, the spatial service is on the horizon.

@orangejulius
Copy link
Member

Okay, I've been taking a look at SQLite based builds so that we can ensure Pelias works out of the box with up to date WOF data, avoiding issues such as those in pelias/docker#141.

At this point I'm pretty convinced we need to do both of the following things:

  • Not only default imports.whosonfirst.sqlite to true but completely drop support for the tar file WOF bundles, since they are not going to be updated in the future, and the current latest versions are corrupted
  • Begin supporting the country-specific DBs instead of the global SQLite DB. This global DB also doesn't appear to be updated any time recently. That may change soon, but there are advantages to country-specific DBs such as the ability to drastically limit download size for users interested in building only some countries (a common use case).

If anyone has any thoughts on this, let me know. Looking especially to hear from @missinglink and @Joxit

@missinglink
Copy link
Member

missinglink commented Dec 4, 2019

It's no secret that I'm a big SQLite fan 😄 so I would like to:

  • drop support for previous WOF install methods (git clone, bundles, meta files, WOF API) etc.
  • and exclusively support sqlite going forward.

This will clean up a lot of the code which has been in place since the inception of the WOF project and during its early development.
In fact, this is nothing new, we've been actively working towards this goal for over a year now!

Great let's do it!

warning: this is one of my ranty comments, so maybe fetch a ☕️ if you'd like to read on?

What's the backstory?

The original intention was to continue using the full planet file (the only one available at the time) and switch over to imports.whosonfirst.sqlite=true as a fairly painless transition for everyone.

During this time the WOF team introduced country-specific SQLite distributions in order to address emails from Github which forced them to reduce the size of the main whosonfirst-data repo.

At this time there were large changes required by the WOF team to split the data into multiple repos, a bunch of changes were made to the bundling code which unfortunately resulted in errors and omissions still being investigated and resolved today.

What's the current state of play?

We've been chatting with the WOF team throughout and I've highlighted some issues with the per-country bundles.

The planet-wide bundle also hasn't been updated since 2019-05-06 and so doesn't reflect the current state of the GitHub sources.

There are two different full-planet SQLite databases kicking around for testing, one produced by me by merging all the per-country repos together and one by Stephen Epps which is produced directly from the filesystem without any merging.

My one isn't usable due to the 'per-country' repo bug linked above and @orangejulius has noticed some issues with the one Stephen produced (Julian could you please add more info so I can confirm these?).

Either way, there is currently no planet-wide SQLite database available from https://dist.whosonfirst.org which contains data since May 2019. That one also contains some errors (particularly AU state abbreviation issues) which have since been resolved.

So what can we do about it?

Honestly, not a lot right now.

For Geocode Earth we are running older, trusted data and applying patches where necessary.

I imagine others are doing this too and it's really not anyone's first choice as we'd like to be pulling in all the changes from the last 6 months rather than having to manually curate our own 'fork' of the canonical WOF data.

If the SQLite distributions were all working again what could we do?

If those issues are resolved we could switch over to using imports.whosonfirst.sqlite=true and add support for per-country distribution files.

As Julian noted we have not done any work to support country-specific distributions AFAIK, so this would need to happen.

Part of that work will require some understanding of how the country-specific distro files work, specifically the whosonfirst-data-admin-xx and whosonfirst-data-admin-xy repos will be required for installations requiring data above the specified country-code (such as continent, ocean, empire etc.)

One potential issue to consider is that the default compile-time options for SQLite limits the total amount of attached databases (SQLITE_MAX_ATTACHED) to 10, this can be raised to 125 but I'm not sure how easy that would be to achieve in practice as it may require a lengthly compile step on each npm install (including packages depending on the pelias/whosonfirst module).

This may or may not be an issue, it depends on how we implement it, if we're doing a streaming import then it can be avoided by only having one database open at a time, but if we wanted to do a complex join operation or resolve an inter-country hierarchy traversal query then this would be a constraint.

I think that's the glorious future of WOF support in Pelias, but requires a few data fixes and a few code updates to get us there, if we can achieve that then it will come with the benefit that building smaller 'region-specific' Pelias builds (such as the pelias/docker projects) will not require downloading the massive planet file, making it much quicker and easier to get started.

@missinglink
Copy link
Member

I'm going to do some work to move this forward:

  • Catch up with the WOF team and see where we are at regarding known bugs and getting newer data published
  • Write a program which can be used to validate the contents of a SQLite distribution file, so we know if it can be trusted
  • Focus on getting a full-planet SQLite database first, before looking at the per-country ones.

@Joxit
Copy link
Member

Joxit commented Dec 5, 2019

I imagine others are doing this too and it's really not anyone's first choice as we'd like to be pulling in all the changes from the last 6 months rather than having to manually curate our own 'fork' of the canonical WOF data.

For Jawg Maps, I'm building the SQLite from GitHub repositories since December 2018 (because of whosonfirst/go-whosonfirst-dist#9). I thought using their bundles was too risky, so I chose to make mine from their sources. Now I'm updating my SQLite every 2-3 months.

This may or may not be an issue, it depends on how we implement it, if we're doing a streaming import then it can be avoided by only having one database open at a time, but if we wanted to do a complex join operation or resolve an inter-country hierarchy traversal query then this would be a constraint.

Definitely agree, doing a full planet import with the per country SQLite can be complex. Hum... Maybe not as complex as I thought, #471 highlights the fact that the placetype order is important. Only whosonfirst-data-admin-xx and whosonfirst-data-admin-xy databases are at the top of the other countries. So we could import these two databases and then all other countries. Doing this should avoid hierarchy issues.

The per country database will not be an issue for pelias/wof-admin-lookup.

orangejulius added a commit that referenced this issue Jan 9, 2020
The country-specific SQLite DBs currently available for download from
https://dist.whosonfirst.org have a few integrity issues. This causes
problems in Pelias such as addresses in NYC or Berlin not having proper
admin information.

The planet-wide SQLite DB, while a little out of date, is overall more
valid.

While in the long term we probably want to utilize the country-specific
DBs, for now it makes sense to filter them out.

Fixes #469
Connects #460
@orangejulius
Copy link
Member

I just opened #477 which will filter out the country-specific SQLite DBs (super easy to do thanks to @Joxit's excellent code from #417).

While it's probably not what we want long term, it does help resolve the issue of downloading essentially 2 full copies of WOF data.

orangejulius added a commit that referenced this issue Jan 9, 2020
The country-specific SQLite DBs currently available for download from
https://dist.whosonfirst.org have a few integrity issues. This causes
problems in Pelias such as addresses in NYC or Berlin not having proper
admin information.

The planet-wide SQLite DB, while a little out of date, is overall more
valid.

While in the long term we probably want to utilize the country-specific
DBs, for now it makes sense to filter them out.

Fixes #469
Connects #460
Fixes #437
@missinglink
Copy link
Member

Geocode Earth are now providing distribution files at https://geocode.earth/data/whosonfirst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants