Skip to content

Commit

Permalink
feat(sqlite): Drop support for bundle downloads
Browse files Browse the repository at this point in the history
BREAKING CHANGE: Support for the bundle downloads (files ending in
`tar.bz2`) has been removed. Only SQLite downloads are supported and
the `whosonfirst` importer will now behave as if
`imports.whosonfirst.sqlite` is set to true.

fixes #496
fixes #226
closes #460
  • Loading branch information
Joxit authored and orangejulius committed Apr 23, 2020
1 parent a0eb740 commit d851218
Show file tree
Hide file tree
Showing 27 changed files with 115 additions and 1,428 deletions.
41 changes: 12 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,26 +39,25 @@ The following configuration options are supported by this importer.

Full path to where Who's on First data is located (note: the included [downloader script](#downloading-the-data) will automatically place the WOF data here, and is the recommended way to obtain WOF data)

### `imports.whosonfirst.importPlace`
### `imports.whosonfirst.countryCode`

* Required: no
* Default: ``

Set to a WOF ID or array of IDs to import data only for descendants of those records, rather than the entire planet.
Set sqlite country codes to download. Geocode Earth provides two types of SQLite extracts:
- [combined](https://geocode.earth/data/whosonfirst/combined): databases of the whole planet for `Administrative Boundaries`, `Postal Code` and `Constituencies`
- [single country](https://geocode.earth/data/whosonfirst): per country databases for `Administrative Boundaries`, `Postal Code` and `Constituencies`

You can use the [Who's on First Spelunker](https://spelunker.whosonfirst.org) or the `source_id` field from any WOF result of a Pelias query to determine these values.

Specifying a value for `importPlace` will download the full planet SQLite database (27GB). Support for individual country downloads [may be added in the future](https://github.com/pelias/whosonfirst/issues/459)

### `imports.whosonfirst.importVenues`
### `imports.whosonfirst.importPlace`

* Required: no
* Default: `false`
* Default: ``

Set to true to enable importing venue records. There are over 15 million venues so this option will add substantial download and disk usage requirements.
Set to a WOF ID or array of IDs to import data only for descendants of those records, rather than the entire planet.

It is currently [not recommended to import venues](https://github.com/pelias/whosonfirst/issues/94).
You can use the [Who's on First Spelunker](https://spelunker.whosonfirst.org) or the `source_id` field from any WOF result of a Pelias query to determine these values.

Specifying a value for `importPlace` will download the full planet SQLite database (27GB). Support for individual country downloads [may be added in the future](https://github.com/pelias/whosonfirst/issues/459)

### `imports.whosonfirst.importPostalcodes`

Expand All @@ -67,15 +66,6 @@ It is currently [not recommended to import venues](https://github.com/pelias/who

Set to true to enable importing postalcode records. There are over 3 million postal code records.

### `imports.whosonfirst.missingFilesAreFatal`

* Required: no
* Default: `false`

Set to `true` for missing files from [Who's on First bundles](https://dist.whosonfirst.org/bundles/) to stop the import process.

This flag is useful if you consider it vital that all Who's on First data is successfully imported, and can be helpful to guard against incomplete downloads or other types of failure.

### `imports.whosonfirst.maxDownloads`

* Required: no
Expand All @@ -86,25 +76,21 @@ The maximum number of files to download simultaneously. Higher values can be fas
### `imports.whosonfirst.dataHost`

* Required: no
* Default: `https://dist.whosonfirst.org/`
* Default: `https://data.geocode.earth/wof/dist`

The location to download Who's on First data from. Changing this can be useful to use custom data, pin data to a specific date, etc.

### `imports.whosonfirst.sqlite`

* Required: no
* Default: `false`
* Default: `true`

Set to `true` to use Who's on First SQLite databases instead of GeoJSON bundles.

SQLite databases take up less space on disk and can be much more efficient to
download and extract.

This option may [become the default in the near future](https://github.com/pelias/whosonfirst/issues/460).

However, both the Who's on First processes to generate
these files and the Pelias code to use them is new and not yet considered
production ready.
This option [is the default](https://github.com/pelias/whosonfirst/issues/460).

## Downloading the Data

Expand Down Expand Up @@ -169,9 +155,6 @@ Other types may be included in the future.

This project exposes a number of node streams for dealing with Who's on First data and metadata files:

- `metadataStream`: streams rows from a Who's on First metadata file
- `parseMetaFiles`: CSV parse stream configured for metadata file contents
- `loadJSON`: parallel stream that asynchronously loads GeoJSON files
- `recordHasIdAndProperties`: rejects Who's on First records missing id or properties
- `isActiveRecord`: rejects records that are superseded, deprecated, or otherwise inactive
- `isNotNullIslandRelated`: rejects [Null Island](https://spelunker.whosonfirst.org/id/1) and other records that intersect it (currently just postal codes at 0/0)
Expand Down
2 changes: 1 addition & 1 deletion bin/download
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/bash

exec node ./utils/download_data.js
exec node ./utils/download_sqlite_all.js
3 changes: 0 additions & 3 deletions index.js
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
module.exports = {
metadataStream: require('./src/components/metadataStream'),
isActiveRecord: require('./src/components/isActiveRecord').create,
isNotNullIslandRelated: require('./src/components/isNotNullIslandRelated').create,
loadJSON: require('./src/components/loadJSON').create,
parseMetaFiles: require('./src/components/parseMetaFiles').create,
recordHasIdAndProperties: require('./src/components/recordHasIdAndProperties').create,
recordHasName: require('./src/components/recordHasName').create,
conformsTo: require('./src/components/conformsTo').create,
Expand Down
7 changes: 2 additions & 5 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,13 @@
"async": "^3.0.1",
"better-sqlite3": "^6.0.0",
"combined-stream": "^1.0.5",
"command-exists": "^1.2.8",
"csv-stream": "^0.2.0",
"command-exists": "^1.2.9",
"download-file-sync": "^1.0.4",
"fs-extra": "^8.0.0",
"iso3166-1": "^0.5.0",
"klaw-sync": "^6.0.0",
"lodash": "^4.5.1",
"parallel-transform": "^1.1.0",
"pelias-blacklist-stream": "^1.0.0",
"pelias-config": "^4.9.0",
"pelias-config": "^4.9.1",
"pelias-dbclient": "^2.13.0",
"pelias-logger": "^1.2.1",
"pelias-model": "^7.1.0",
Expand Down
10 changes: 4 additions & 6 deletions schema.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@ const Joi = require('@hapi/joi');
// * imports.whosonfirst.datapath (string)
//
// optional:
// * imports.whosonfirst.importVenues (boolean) (default: false)
// * imports.whosonfirst.countryCode (string OR array[string]) (default: [])
// * imports.whosonfirst.importPostalcodes (boolean) (default: false)
// * imports.whosonfirst.importConstituencies (boolean) (default: false)
// * imports.whosonfirst.importIntersections (boolean) (default: false)
// * imports.whosonfirst.importPlace (integer OR array[integer]) (default: none)
// * imports.whosonfirst.missingFilesAreFatal (boolean) (default: false)
// * imports.whosonfirst.sqlite (boolean) (default: true)

module.exports = Joi.object().keys({
imports: Joi.object().required().keys({
whosonfirst: Joi.object().required().keys({
countries: Joi.alternatives().try(
countryCode: Joi.alternatives().try(
Joi.string(),
Joi.array().items(Joi.string()).default([])
).default([]),
Expand All @@ -28,10 +28,8 @@ module.exports = Joi.object().keys({
importVenues: Joi.boolean().default(false).truthy('yes').falsy('no'),
importPostalcodes: Joi.boolean().default(false).truthy('yes').falsy('no'),
importConstituencies: Joi.boolean().default(false).truthy('yes').falsy('no'),
importIntersections: Joi.boolean().default(false).truthy('yes').falsy('no'),
missingFilesAreFatal: Joi.boolean().default(false).truthy('yes').falsy('no'),
maxDownloads: Joi.number().integer(),
sqlite: Joi.boolean().default(false).truthy('yes').falsy('no')
sqlite: Joi.boolean().default(true).truthy('yes').falsy('no')
}).unknown(false)
}).unknown(true)
}).unknown(true);
107 changes: 1 addition & 106 deletions src/bundleList.js
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
const readline = require('readline');
const fs = require('fs-extra');
const path = require('path');
const downloadFileSync = require('download-file-sync');
const _ = require('lodash');
const klawSync = require('klaw-sync');

const peliasConfig = require( 'pelias-config' ).generate(require('../schema'));

Expand Down Expand Up @@ -37,91 +34,18 @@ const postalcodeRoles = [
'postalcode'
];

const venueRoles = [
'venue'
];

const SQLITE_REGEX = /whosonfirst-data-[a-z0-9-]+\.db$/;

function getPlacetypes() {
let roles = hierarchyRoles;

// admin-only env var should override the config setting since the hierarchy bundles are useful
// on their own to allow other importers to start when using admin lookup
if (peliasConfig.imports.whosonfirst.importVenues && process.argv[2] !== '--admin-only') {
roles = roles.concat(venueRoles);
}

if (peliasConfig.imports.whosonfirst.importPostalcodes && process.argv[2] !== '--admin-only') {
roles = roles.concat(postalcodeRoles);
}

return roles;
}

function ensureBundleIndexExists(metaDataPath) {
const wofDataHost = peliasConfig.get('imports.whosonfirst.dataHost') || 'https://dist.whosonfirst.org';
const bundleIndexFile = path.join(metaDataPath, 'whosonfirst_bundle_index.txt');
const bundleIndexUrl = `${wofDataHost}/bundles/index.txt`;

//ensure required directory structure exists
fs.ensureDirSync(metaDataPath);

if (!fs.existsSync(bundleIndexFile)) {

const klawOptions = {
nodir: true,
filter: (f) => (f.path.indexOf('-latest.csv') !== -1)
};
const metaFiles = _.map(klawSync(metaDataPath, klawOptions),
(f) => (path.basename(f.path)));

// if there are no existing meta files and the bundle index file is not found,
// download bundle index
if (_.isEmpty(metaFiles)) {
fs.writeFileSync(bundleIndexFile, downloadFileSync(bundleIndexUrl));
}
else {
fs.writeFileSync(bundleIndexFile, metaFiles.join('\n'));
}
}
}

function getBundleList(callback) {
const metaDataPath = path.join(peliasConfig.imports.whosonfirst.datapath, 'meta');
const bundleIndexFile = path.join(metaDataPath, 'whosonfirst_bundle_index.txt');

ensureBundleIndexExists(metaDataPath);

const roles = getPlacetypes();

// the order in which the bundles are listed is critical to the correct execution
// of the admin hierarchy lookup code in whosonfirst importer,
// so in order to preserve the order specified by the roles list
// we must collect the bundles from the index files by buckets
// and then at the end merge all the buckets into a single ordered array
const bundleBuckets = initBundleBuckets(roles);

const rl = readline.createInterface({
input: fs.createReadStream(bundleIndexFile)
});

rl.on('line', (line) => {

const parts = line.split(' ');
const record = parts[parts.length - 1];

sortBundleByBuckets(roles, record, bundleBuckets);

}).on('close', () => {

const bundles = _.sortedUniq(combineBundleBuckets(roles, bundleBuckets));

callback(null, bundles);

});
}

function getDBList(callback) {
const databasesPath = path.join(peliasConfig.imports.whosonfirst.datapath, 'sqlite');
//ensure required directory structure exists
Expand All @@ -138,36 +62,7 @@ function getList(callback) {
if (peliasConfig.imports.whosonfirst.sqlite) {
return getDBList(callback);
}
getBundleList(callback);
}

function initBundleBuckets(roles) {
const bundleBuckets = {};
roles.forEach( (role) => {
bundleBuckets[role] = [];
});
return bundleBuckets;
}

function sortBundleByBuckets(roles, bundle, bundleBuckets) {
roles.forEach((role) => {
// search for the occurrence of role-latest-bundle, like region-latest-bundle
// timestamped bundles should be skipped as they are of the format role-timestamp-bundle
const validBundleRegex = new RegExp(`${role}-[\\w-]*latest`);
if (validBundleRegex.test( bundle ) ) {
bundleBuckets[role].push(bundle);
}
});
}

function combineBundleBuckets(roles, bundleBuckets) {
let bundles = [];

roles.forEach( (role) => {
bundles = _.concat(bundles, _.get(bundleBuckets, role, []));
});

return bundles;
callback('Bundles are no longer supported!');
}

module.exports.getPlacetypes = getPlacetypes;
Expand Down
47 changes: 0 additions & 47 deletions src/components/loadJSON.js

This file was deleted.

10 changes: 0 additions & 10 deletions src/components/metadataStream.js

This file was deleted.

16 changes: 0 additions & 16 deletions src/components/parseMetaFiles.js

This file was deleted.

Loading

0 comments on commit d851218

Please sign in to comment.