Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download sqlite default endpoint to geocode.earth and remove bundle support #487

Merged
merged 5 commits into from
Apr 23, 2020

Conversation

Joxit
Copy link
Member

@Joxit Joxit commented Mar 9, 2020

BREAKING CHANGE: drop support for wof bundles
BREAKING CHANGE: imports.whosonfirst.sqlite default to true

Background

WOF database is hosted on geocode.earth, both per country and global admin are available, I'm updating the default endpoint in favor of geocode.earth.

What's new ?

A new config is available config.imports.whosonfirst.countries. This configuration take care advantage of the per country databases, it will allow you to download only the wanted countries. If the config is not present, it will download the global databases.

Minor fix introduced by #483 bunzip2 and not bunzip

Discussion

  1. In this PR, I did only the SQLite part, the bundle download is not compatible with geocode.earth. The old way was per placetype, now it's admin/postalcode/constituency. That's means old users with a custom download URL will have some issues. Breaking change
  2. importPlaces are very useful for small imports, now the per country is available, should we deprecate it ? Or use both countries and importPlaces ?
  3. importPlaces extract data in bundle format, should we deprecate this part to ? Deprecated
    • Pros:
      • Code less complicated
      • Easier to maintain (SQLite everywhere)
    • Cons:
      • Bundle format can be faster than SQLite at runtime...
  4. Should we remove the support of venue ?

related: #460 #469 #477 pelias/wof-admin-lookup#289

Fixes #469
Closes #460
Fixes #437

@missinglink
Copy link
Member

FYI we don't currently build venue data, I could add that but I assumed it was old, unmaintained and noone was using it?

@Joxit
Copy link
Member Author

Joxit commented Mar 9, 2020

I agree with you. I also think that nobody uses venue 🤷‍♂️.
I added venue in the white-list just in case.

Edit discussion: Should we remove venue ?

@missinglink
Copy link
Member

missinglink commented Mar 9, 2020

Actually Julian and I discussed it on a call today.

I think WOF venues are 'soft deprecated', our experience with them is that the data is old and hasn't had new contributions for a long time.

As such we (Geocode Earth) will not be publishing dist files for WOF venues, the official dist.whosonfirst.org downloads are probably as up-to-date as they will ever be.

Regarding Pelias, it seems a pity to delete code which supports WOF venues since other developers may benefit from it, but if it becomes a maintenance burden then it may be ok to do so.

@missinglink
Copy link
Member

So... Maybe we just have a generic warning message that the file you requested isn't available on the host you specified and leave it at that for now?

@missinglink
Copy link
Member

missinglink commented Mar 9, 2020

If we want to be very thorough we could grab inventory files from both hosts and compare the last_modified time to find the best download.

Maybe this is overkill? But it would allow 'fallback' to dist.whosonfirst.org for files not published by GE

@Joxit
Copy link
Member Author

Joxit commented Mar 9, 2020

Regarding Pelias, it seems a pity to delete code which supports WOF venues since other developers may benefit from it, but if it becomes a maintenance burden then it may be ok to do so.

You are right... So, if and only if venue are activated, we can add download it from WOF ? Maybe I have an elegant way to do this 🤔 I will try it tomorrow :)

Maybe this is overkill? But it would allow 'fallback' to dist.whosonfirst.org for files not published by GE

Yes this is overkill 😅

Since we want to promote SQLite over Bundles, we can migrate to data GE only SQLite download part ? With some deprecation warnings ?

@Joxit Joxit force-pushed the joxit/data-geocode-earth branch from 3a86846 to 8f885ed Compare March 10, 2020 15:09
@Joxit
Copy link
Member Author

Joxit commented Mar 10, 2020

Updated, I added the backward compatibility for venues via WOF

@missinglink
Copy link
Member

Nice, I just checked on the build and it's finished, just syncing the files to the bucket, so it should be available shortly.

@orangejulius
Copy link
Member

Cool, I'm currently testing this out (actually, on top of the minor fix in #484). It seems to work well so far but unfortunately we don't have a lot of automated testing around the download functionality :( :(

I think if we are going to merge this and default to Geocode Earth's WOF data (which I fully support), then we need to do one of two things:

  • Change the code in this repo to download whosonfirst-data-admin-latest.db by default instead of whosonfirst-data-latest.db
  • Ensure our Geocode Earth data builds copy whosonfirst-data-admin-latest.db to whosonfirst-data-latest.db

The first of these would mean that we lose compatibility with the data on dist.whosonfirst.org, but it means we wouldn't have to add yet another backwards-compatibility complexity to the WOF build scripts. I understand they've already gotten quite complicated and it's actually really very hard to reliably and automatically build all the WOF data we need on a regular basis.

@orangejulius
Copy link
Member

Also, I extracted the bugfix for the bunzip2 command into #488, so eventually we can rebase this against master and keep the changes in this PR limited to just what's needed :)

@Joxit
Copy link
Member Author

Joxit commented Mar 10, 2020

With this PR, by default it will download whosonfirst-data-admin-latest.db from Geocode Earth. But this will fail if the source is something like WOF... They do not provide whosonfirst-data-admin-latest.db... Only per country extracts so this PR will break old builds 🙁

What can we do ? Download all whosonfirst-data-(admin|postalcode|venue)-{country}-latest or whosonfirst-data-latest.db when a custom source is used ?

I have already sync the PR 😉

@Joxit
Copy link
Member Author

Joxit commented Mar 11, 2020

New commit to be compatible with WOF for full planet download

@orangejulius
Copy link
Member

Hi! Big news today!

Who's on First has turned off their downloads at dist.whosonfirst.org.
Screenshot_2020-04-17_10-55-54

I haven't been following recent changes to the WOF importer super closely, but that really simplifies things for this PR, doesn't it? There's now no need at all to worry about backwards compatibility with dist.whosonfirst.org.

@Joxit
Copy link
Member Author

Joxit commented Apr 19, 2020

Oh! That's interesting !

Yes, this simplify a little bit the PR, but there are still users using custom URLs 😞.
One question remains, what is the future of venues ?

@missinglink
Copy link
Member

missinglink commented Apr 20, 2020

We have no plans to publish venues distributions at this time.

I call a vote to remove support for WOF venues in order to simplify the code.

👍 for yes 👎 for needs further discussion

@pixeldublu
Copy link

This should be pushed forward. Wof just disabled their downloads. But the changes from the pr are not enough. The structure from https://geocode.earth/ is different and lots of things are not working anymore...

@Joxit
Copy link
Member Author

Joxit commented Apr 21, 2020

Hi @pixeldublu I started this PR in March and the WOF change was made last Friday.

This PR was originally made for SQLite download only. Are you using SQLite or bundle download ? Why this PR is not working for you ?

@pixeldublu
Copy link

For example venues are not available anymore... and i the bundles also not available anymore...

@Joxit
Copy link
Member Author

Joxit commented Apr 21, 2020

Yes, check out our discussion here #487 (comment)
We are planning to remove the support of venues... Venues are not really updated by WOF team and seems to be outdated... That's why we want to remove this support.

If you need further discussion, fell free 🙂

The bundle was out of the scope of this PR. I will add it later.

@pixeldublu
Copy link

I was trying to make individual db downloads last week and then this wof download disable happened :)

So i made some changes to my PR to use geocode earth and works but its not mergeable yet. Looking for improvement ideas.

@Joxit Joxit force-pushed the joxit/data-geocode-earth branch from 3d50ef6 to d7f5b51 Compare April 21, 2020 22:05
@Joxit Joxit changed the title Download sqlite default endpoint to geocode.earth Download sqlite default endpoint to geocode.earth and remove bundle support Apr 21, 2020
@Joxit
Copy link
Member Author

Joxit commented Apr 21, 2020

Big commit.... BREAKING CHANGE is coming !

I dropped the support for bundle, the importPlaces download case is no longer useful thanks to per country download, so I removed this code too.
I removed unused dependencies.
schema update: sqlite default to true !

@orangejulius
Copy link
Member

Oh, importPlace is still quite useful, if you can keep it. It allows areas smaller than a country to be imported, reducing the number of records that aren't relevant for a specific use case.

@Joxit
Copy link
Member Author

Joxit commented Apr 21, 2020

Only the download part has be deleted (download of the 40Go SQLites + generating bundle like architecture).

The importPlace feature is always be present for import and pip-service. Now, instead of generating bundle like architecture, we will load only interesting documents with this piece of code from #417:

function findGeoJSONByPlacetypeAndWOFId(placetypes, wofids) {
if(!Array.isArray(wofids)) {
wofids = [ wofids ];
}
wofids = wofids.map(id => parseInt(id, 10)).join(',');
return findGeoJSONByPlacetype(placetypes) + `
AND
spr.id IN (
SELECT DISTINCT id
FROM ancestors
WHERE id IN (${wofids})
UNION SELECT DISTINCT id
FROM ancestors
WHERE ancestor_id IN (${wofids})
UNION SELECT DISTINCT ancestor_id
FROM ancestors
WHERE id IN (${wofids})
)`;
}

That means it will work like a charm even for small areas 🙂

For example, I want only Ile-de-France macroregion, the config will be:

{
  "datapath": "/opt/whosonfirst/",
  "importPostalcodes": true,
  "countries": "fr",
  "importPlace": 404227465,
  "sqlite": true
}

Inport time is 2 secondes and the last log is:

2020-04-21T22:42:31.990Z - info: [dbclient-whosonfirst]  paused=false, transient=0, current_length=0, indexed=5009, batch_ok=11, batch_retries=0, failed_records=0, localadmin=1277, country=1, macroregion=1, region=8, macrocounty=24, county=249, locality=1982, borough=20, macrohood=4, neighbourhood=914, postalcode=529, persec=500.9

Full France import time is 1min16s and the last log is:

2020-04-21T20:24:16.887Z - info: [dbclient-whosonfirst]  paused=false, transient=0, current_length=0, indexed=109398, batch_ok=219, batch_retries=0, failed_records=0, country=1, dependency=4, disputed=2, macroregion=13, region=101, macrocounty=328, county=3232, localadmin=35291, persec=889.8, locality=58987, borough=45, macrohood=4, neighbourhood=5336, postalcode=6054

orangejulius added a commit that referenced this pull request Apr 23, 2020
This is a small PR into #487
that allows country codes to be specified in upper case or lower case.

Before this, specifying country codes in anything but lowercase would
cause the downloader to skip all files.
Joxit pushed a commit that referenced this pull request Apr 23, 2020
This is a small PR into #487
that allows country codes to be specified in upper case or lower case.

Before this, specifying country codes in anything but lowercase would
cause the downloader to skip all files.
@Joxit Joxit force-pushed the joxit/data-geocode-earth branch from 13b7ebd to 17f9e8f Compare April 23, 2020 18:35
Joxit pushed a commit that referenced this pull request Apr 23, 2020
This is a small PR into #487
that allows country codes to be specified in upper case or lower case.

Before this, specifying country codes in anything but lowercase would
cause the downloader to skip all files.
Joxit pushed a commit that referenced this pull request Apr 23, 2020
This is a small PR into #487
that allows country codes to be specified in upper case or lower case.

Before this, specifying country codes in anything but lowercase would
cause the downloader to skip all files.
@Joxit Joxit force-pushed the joxit/data-geocode-earth branch from eb1c75b to c737102 Compare April 23, 2020 18:38
Joxit and others added 4 commits April 23, 2020 12:57
Bundles hosted by the Who's on First team at
https://dist.whosonfirst.org have been removed.

Data can now be downloaded based on a sponsorship from Geocode Earth.
More info at https://geocode.earth/data
BREAKING CHANGE: Support for the bundle downloads (files ending in
`tar.bz2`) has been removed. Only SQLite downloads are supported and
the `whosonfirst` importer will now behave as if
`imports.whosonfirst.sqlite` is set to true.

fixes #496
fixes #226
closes #460
This helps ensure that in the future we don't break anything that
depends on this module by exporting something that doesn't exist.
@orangejulius orangejulius force-pushed the joxit/data-geocode-earth branch from a18f477 to 6ba6c1e Compare April 23, 2020 20:00
@orangejulius orangejulius merged commit 7f06092 into master Apr 23, 2020
@missinglink
Copy link
Member

🎉

@orangejulius orangejulius deleted the joxit/data-geocode-earth branch April 23, 2020 20:44
orangejulius added a commit to pelias/docker that referenced this pull request Apr 23, 2020
Now that the Who's on First downloader defaults to using SQLite
downloads and supports country-specific bundles thanks to
pelias/whosonfirst#487, we can start using that
new functionality.

This PR sets the `imports.whosonfirst.countryCode` property
appropriately across all our current projects.

This should not only help make all imports quite a bit faster, but it
will require much less disk space for everyone.

Since we're now sponsoring the Who's on First downloads here at Geocode
Earth, the bandwidth savings also make us happy :)

Along the way I made a few minor changes to remove some deprecated
config options like `importVenues` and `sqlite`.
@Joxit
Copy link
Member Author

Joxit commented Apr 23, 2020

Well done everyone 😄 🚀

@orangejulius
Copy link
Member

Well done mostly to you @Joxit for doing most of the work to start this PR and the one in Placeholder :)

I just had some fun adding countryCode to all our Docker projects and the fast, small downloads are so nice! 🎉

If there are any bugs they probably came in from all my rebasing, let us know :)

calpb pushed a commit to sorelle/docker that referenced this pull request Mar 29, 2021
Now that the Who's on First downloader defaults to using SQLite
downloads and supports country-specific bundles thanks to
pelias/whosonfirst#487, we can start using that
new functionality.

This PR sets the `imports.whosonfirst.countryCode` property
appropriately across all our current projects.

This should not only help make all imports quite a bit faster, but it
will require much less disk space for everyone.

Since we're now sponsoring the Who's on First downloads here at Geocode
Earth, the bandwidth savings also make us happy :)

Along the way I made a few minor changes to remove some deprecated
config options like `importVenues` and `sqlite`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants