Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract data via WOF SQLite database #157

Merged
merged 2 commits into from
Apr 23, 2020
Merged

Conversation

Joxit
Copy link
Member

@Joxit Joxit commented Apr 20, 2019

This feature is to avoid any use of WOF bundles.
It should also speed up placeholder build.

@missinglink
Copy link
Member

Hey @Joxit sorry I took a while to get to this.
I really like the concept of this PR but I'd like to discuss changing how it works.

It's intended to be a replacement for https://github.com/pelias/placeholder/blob/master/cmd/wof_extract.sh but the two scripts have different functionality.

The existing shell script wof_extract.sh scans the filesystem tree, filters the records by placetype and then writes them to stdout.

The nice thing about this is that it can be broken up into two steps, one to create the 'extract' and one to import the data into placeholder.
This has a big advantage that the data updates and code updates can be managed independently and that rebuilding placeholder is fast when running off a pre-processed data/wof.extract file.

The node script in this PR performs both of those actions in one step.
Maybe we could modify it to only perform the data processing and then the rest of the tooling will 'just work' including instructions such as https://github.com/pelias/placeholder#steps in the readme.

One other thing I was wondering, is if it would be possible to combine https://github.com/pelias/placeholder/blob/master/cmd/placetype.filter and https://github.com/pelias/placeholder/pull/157/files#diff-2a0ae160a26801a4054b1d47da4defddR9 into a single file so it gets edited in one place.

@Joxit
Copy link
Member Author

Joxit commented Apr 29, 2019

Hum, yes, that make sense and this will save the current workflow.

So the script will output records on stdout and will use the placetype.filter file instead of the JS list.

@missinglink
Copy link
Member

missinglink commented Apr 29, 2019

It made a big difference when working with the filesystem because the 'extract' file took about an hour to generate.

Any idea how long it takes to extract that data from the sqlite db?

@Joxit
Copy link
Member Author

Joxit commented Apr 29, 2019

I just tried a world extraction, it ended with an error (heap out of memory need --max_old_space_size) after 35 minutes.
It seems that it will take as much time as bundle extraction.

@Joxit
Copy link
Member Author

Joxit commented Apr 29, 2019

I added one commit.
For the sqlite extract, we should use the script cmd/extract.sh with the parameter sqlite.
Now cmd/wof_extract_sqlite.js uses placetype.filter to get all layers and jq.filter in order to delete useless properties.
cmd/wof_extract_sqlite.js also take a parameter build to do an extract with build.

@Joxit
Copy link
Member Author

Joxit commented May 11, 2019

I tried a world extraction + placeholder sqlite build with max_old_space_size and it took 39min13s.
The size of store.sqlite3 is 4.3G

@orangejulius
Copy link
Member

I tested this out the other day, and it worked great! I think it took around 25 minutes to build the extract and sqlite DB. That's pretty fast!

Is there anything that needs to be done before we can merge this? The PR has been sitting for a while, so I imagine there's at least a little cleanup.

@Joxit
Copy link
Member Author

Joxit commented Aug 15, 2019

Yeah! That's cool, I think there is only max_old_space_size to increase like other services ?

@orangejulius
Copy link
Member

Right. I just set it to 8000 on all the other services, based on the latest SQLite file from WOF, so that would be a good value.

@Joxit Joxit force-pushed the joxit/wof_extract_sqlite branch from dc47406 to 642d851 Compare August 15, 2019 16:07
@Joxit
Copy link
Member Author

Joxit commented Aug 15, 2019

okay, memory limit updated and rebased to master :)

@missinglink
Copy link
Member

🚀

@orangejulius
Copy link
Member

Hi @Joxit,
I spent a little time testing this PR recently. I was trying to make it work with imports.whosonfirst.sqlite set to true in pelias.json but it looks like that's not supported. I think we'll need to respect that configuration variable in order to make switching to sqlite easy (and to resolve pelias/whosonfirst#460). What do you think is the best way to approach that?

@Joxit
Copy link
Member Author

Joxit commented Nov 25, 2019

Hi @orangejulius

The package pelias-config is not present in the package.json, that's why there are no supports of the imports.whosonfirst.sqlite.
Should I add the dependency and support the configuration ? All the configuration here is done by environment variables PLACEHOLDER_DATA...

@orangejulius
Copy link
Member

Yeah, if you have the time, that would be great. pelias-config is slowly finding its way into every repo, and this is one of the last holdouts :)

@missinglink
Copy link
Member

back story...

originally I wanted to keep repos like pelias/placeholder and pelias/interpolation as 'standalone' so that they could be developed into other projects and so would gain more contributors.

turns out that didn't really work and makes them inconsistent with the other repos and harder to deploy 🤷‍♂.

@Joxit
Copy link
Member Author

Joxit commented Nov 25, 2019

Okay, but I think the deprecation of PLACEHOLDER_DATA should be in another PR though.

@Joxit
Copy link
Member Author

Joxit commented Nov 25, 2019

Quick update with 54f99fe, this adds the supports for imports.whosonfirst.sqlite in the extract cmd.

For a full pelias-config supports, we will create another PR with an update in pelias-config repository.

@orangejulius
Copy link
Member

I've tested this out pretty extensively, and it appears to create very nice extracts based on the SQLite DB.

This does create a requirement to set an extra ENV var (WOF_DIR) since the default directory (/data/whosonfirst-data/sqlite) generally won't be correct. I believe we'll need to set that from pelias.json as well. It would be great to add this to the current PR, let me know if it's ok for me to add it.

PLACEHOLDER_DATA already works out of the box with our docker setup only because the default values align. Not exactly ideal but it seems to be ok for now. We could also do something similar to that for WOF_DIR. The default value should be /data/whosonfirst/sqlite in that case.

@Joxit
Copy link
Member Author

Joxit commented Dec 2, 2019

In fact, I used the same variables as the current wof_extract.sh to be fully backward compatible.

# location of whosonfirst data dir
# note: set WOF_DIR env var to override
WOF_DIR=${WOF_DIR:-'/data/whosonfirst-data/data'};

Yes, of course you can add some extra commit 😉

@orangejulius
Copy link
Member

Oh, weird. I wonder how it worked before then...

@Joxit
Copy link
Member Author

Joxit commented Dec 2, 2019

Hum... You're right, this is an excellent question, with the current pelias/docker configuration, the extract should not work 😅

@orangejulius
Copy link
Member

orangejulius commented Dec 2, 2019

Okay, I think it comes from these lines in the Dockerfile:

placeholder/Dockerfile

Lines 19 to 20 in 8e9d311

ENV WOF_DIR '/data/whosonfirst/data'
ENV PLACEHOLDER_DATA '/data/placeholder'

Additionally, there must be something about the way we are calling into node that causes those values to be lost. Otherwise I wouldn't have needed to set it for this PR to work.

@Joxit
Copy link
Member Author

Joxit commented Dec 7, 2019

Nice catch.
You have an issue with docker and this PR ?

@orangejulius
Copy link
Member

So I was hoping to merge this today but discovered a snag: if we merge this PR, then non-full-planet projects with sqlite: true set will always build a full planet Placeholder extract.

This is because the current flow for a non-full-planet build is as follows:

  1. The whosonfirst importer downloads WOF SQLite DB
  2. The whosonfirst importer uses the sqlite DB to create an "immitation" bundle/filesystem based distribution for WOF containing only the data specified by imports.whosonfirst.importPlace
  3. Placeholder consumes the bundle/filesystem WOF data only, ignoring the full planet SQLite DB

We've discussed a shared place to put WOF utilities in the past, and this seems like another case that needs to utilize that. The whosonfirst importer is already an ad-hoc start to this, but we'd really like to avoid cementing more dependencies on the importer, and rather extract shared stuff to a library. @missinglink, is github.com/pelias/wof the start of this? @Joxit do you have other ideas on how to preserve all the functionality we want?

@Joxit
Copy link
Member Author

Joxit commented Jan 28, 2020

Hum... There is the function findGeoJSONByPlacetypeAndWOFId where we can provide ids with placetypes for the world SQLite... That means we do not need the bundle/filesystem WOF data.

https://github.com/pelias/whosonfirst/blob/31d55e8695e05e96d1150697a7f561a13ff9a152/src/components/sqliteStream.js#L73-L91

If we want a compatibility for "by country repositories", we should ensure that each id is at least in one of the targeted repositories. For example, France is both an Empire (136253037) in repo XY and a Country (85633147) in repo FR. So the imports.whosonfirst.importPlace must include 136253037 and 85633147.
If this is ok, we can add a combineStream here and iterate over *.db

new SQLiteStream(
path.join(WOF_DIR, 'whosonfirst-data-latest.db'),
SQLiteStream.findGeoJSONByPlacetype(layers)
)

@missinglink
Copy link
Member

We need to resolve the issue that all per-country downloads must also pull the xy repo or make it very clear to developers why this is important.

@missinglink
Copy link
Member

I guess what I'm saying is that we need to resolve the download issues before we can really tackle the ingestion ones.

@orangejulius
Copy link
Member

Okay, upon further thought, I think we actually are okay to merge this!

The only catch is we can't yet default imports.whosonfirst.sqlite to true, due to the reasons for my comments above: non-full-planet builds would either fail or end up using full planet placeholder data.

I think it's ok to merge this code now, then bring in shared code from some whosonfirst tooling to allow Placeholder to read from and filter a SQLite DB directly.

Finally, after that, we can worry about combining multiple country-specific SQLite DBs. My guess is there will be a bit of work there, and in the meantime the priority is that people can use the SQLite downloads at all.

Let me know if that seems to make sense for everyone else or if I'm missing something.

@missinglink
Copy link
Member

Heya, I had a need to for this code today, it sounds like you two @Joxit and @orangejulius are both ok to merge it, so can we :shipit: ?

@missinglink
Copy link
Member

Erp, one comment, the string 'whosonfirst-data-latest.db' is hard-coded, I would like it if that were variable via env var?

@Joxit
Copy link
Member Author

Joxit commented Mar 18, 2020

Hey, I can do an update to be more like pelias/wof-admin-lookup#289

@missinglink
Copy link
Member

You know what? since this PR first landed a lot has changed regarding WOF, and now the SQLite bundles are up-to-date and reliable there seems less of a need to provide backwards support for bundles. I'd love to delete the old code which reads from the filesystem but I guess we can't do that yet because it'd break pelias/docker 🤷‍♂

For my purposes I found that it's much easy to accomplish this task now using the wof command as such:

#!/bin/bash
set -euxo pipefail

# ensure dependencies are installed
export DEBIAN_FRONTEND=noninteractive
apt-get update -y
apt-get install -y jq pigz
jq --version

# clone & compile wof tools
cd /code
git clone https://github.com/pelias/placeholder.git

# filter records by placetype
# note: the placetype list was lifted from cmd/wof_extract.sh within pelias/placeholder
read -r -d '' SQL <<'EOF'
SELECT geojson.body
FROM geojson
WHERE geojson.is_alt = 0
AND geojson.id IN (
  SELECT DISTINCT spr.id
  FROM spr
  WHERE (
    spr.is_deprecated = 0 AND
    spr.is_superseded = 0
  )
  AND spr.placetype IN (
    'ocean',
    'continent',
    'marinearea',
    'empire',
    'country',
    'dependency',
    'disputed',
    'macroregion',
    'region',
    'macrocounty',
    'county',
    'localadmin',
    'locality',
    'borough',
    'macrohood',
    'neighbourhood'
  )
);
EOF

# generate placeholder extract
wof sqlite export --sql="$SQL" "${sqlite_path}" |\
  jq -c -M -f '/code/placeholder/cmd/jq.filter' |\
  pigz > ${placeholder_extract_path}

@missinglink
Copy link
Member

This is an example of terrible community management on our part, where the github comments are more lines than the code 🤣 sorry about that!

Have a look at the exports from @whosonfirst/wof, you may be able to do a lot of what you want with little effort from either within js or on the cli.

@Joxit Joxit force-pushed the joxit/wof_extract_sqlite branch from c1eb859 to 9495956 Compare March 18, 2020 15:54
@Joxit
Copy link
Member Author

Joxit commented Mar 18, 2020

Oh, cool. I did the change in placeholder anyway 😅

cmd/extract.sh Outdated Show resolved Hide resolved
@orangejulius
Copy link
Member

Okay, after pelias/whosonfirst#487 we will definitely will need this. Looks like you integrated my changes from #169, so this works great as is.

If you feel compelled to do a bunch of cleanup (we don't even need to check imports.whosonfirst.sqlite now), go for it. Otherwise I think we can and should merge this to move to our glorious sqlite future :)

@Joxit Joxit force-pushed the joxit/wof_extract_sqlite branch from a200fea to 9e453ab Compare April 22, 2020 08:31
@Joxit
Copy link
Member Author

Joxit commented Apr 22, 2020

Now it's ok, I dropped bundle support and sync the branch 👍

@missinglink
Copy link
Member

missinglink commented Apr 23, 2020

I tried this out today with pelias/docker and I got an error..

➜  portland-metro git:(disable_tty) ✗ pelias prepare placeholder
Creating extract at /data/placeholder/wof.extract
internal/fs/utils.js:220
    throw err;
    ^

Error: ENOENT: no such file or directory, scandir '/data/whosonfirst/data'
    at Object.readdirSync (fs.js:854:3)
    at Object.<anonymous> (/code/pelias/placeholder/cmd/wof_extract_sqlite.js:46:4)
    at Module._compile (internal/modules/cjs/loader.js:955:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:991:10)
    at Module.load (internal/modules/cjs/loader.js:811:32)
    at Function.Module._load (internal/modules/cjs/loader.js:723:14)
    at Function.Module.runMain (internal/modules/cjs/loader.js:1043:10)
    at internal/main/run_main_module.js:17:11 {
  errno: -2,
  syscall: 'scandir',
  code: 'ENOENT',
  path: '/data/whosonfirst/data'
}
➜  portland-metro git:(disable_tty) ✗ grep placeholder docker-compose.yml
  placeholder:
    image: pelias/placeholder:joxit-wof_extract_sqlite
    container_name: pelias_placeholder
➜  portland-metro git:(disable_tty) ✗ grep DATA_DIR .env
DATA_DIR=/data/pelias/docker/projects/portland-metro
➜  portland-metro git:(disable_tty) ✗ ls -lah /data/pelias/docker/projects/portland-metro/whosonfirst/sqlite
total 9210312
drwxr-xr-x  4 peter  staff   128B Apr 23 18:00 .
drwxr-xr-x  3 peter  staff    96B Apr 23 17:52 ..
-rw-r--r--  1 peter  staff   3.2G Apr 23 18:03 whosonfirst-data-admin-us-latest.db
-rw-r--r--  1 peter  staff   1.2G Apr 23 18:02 whosonfirst-data-postalcode-us-latest.db
➜  portland-metro git:(disable_tty) ✗ cat pelias.json | jq .imports.whosonfirst
{
  "datapath": "/data/whosonfirst",
  "sqlite": true,
  "countryCode": [
    "us"
  ],
  "importPostalcodes": true,
  "importPlace": [
    "85688513",
    "85688623"
  ]
}

@Joxit
Copy link
Member Author

Joxit commented Apr 23, 2020

Oh yes you're right, since bundle are no longer supported, the default value here can be /data/whosonfirst/sqlite now #157 (comment)

placeholder/Dockerfile

Lines 17 to 20 in 0250dae

ADD . ${WORKDIR}
ENV WOF_DIR '/data/whosonfirst/data'
ENV PLACEHOLDER_DATA '/data/placeholder'

Or should we add pelias-config as a dependency instead of using unknown environment variables ?

(You can pull the new docker image)

BREAKING CHANGE: this drops support for Who's on First bundles. Going
forward, only SQLite distributions are supported, and Placeholder will
now behave as if `imports.whosonfirst.sqlite` is set to true, regardless
of its actual value.
@orangejulius orangejulius force-pushed the joxit/wof_extract_sqlite branch 2 times, most recently from 502b722 to 269d6bc Compare April 23, 2020 20:31
This upates `wof_extract_sqlite` to support the
`imports.whosonfirst.importPlace` property, using similar logic to
`wof-admin-lookup`.

When `importPlace` specifies a small area, like a city, this speeds up
the import process considerably!
@orangejulius orangejulius force-pushed the joxit/wof_extract_sqlite branch from 269d6bc to 2fcd7df Compare April 23, 2020 20:32
@orangejulius
Copy link
Member

Okay, I've also taken over this PR, rebased it a bunch and edited the commit messages to that the release notes will be nice and clean :)

I also added support for importPlace so small imports are fast and efficient with no extra data. I think this concludes the transition to SQLite saga and there's now no Pelias code that uses WOF bundles! 🎉 🪐

@orangejulius orangejulius merged commit adedc3c into master Apr 23, 2020
@orangejulius orangejulius deleted the joxit/wof_extract_sqlite branch April 23, 2020 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants