Imports WikiData JSON dumps into Neo4j in a meaningful way.
The importer takes JSON dumps from WikiData and imports entities / properties, then generates relations between each other and claims.
- Neo4j 3.0 +
- NodeJS 6.0 +
- WikiData JSON File
We recommend having the following schema before starting the importer
CREATE CONSTRAINT ON (item:Item) ASSERT item.id IS UNIQUE;
CREATE CONSTRAINT ON (property:Property) ASSERT property.id IS UNIQUE;
CREATE CONSTRAINT ON (entity:Entity) ASSERT entity.id IS UNIQUE;
CREATE CONSTRAINT ON (claim:Claim) ASSERT claim.id IS UNIQUE;
CREATE INDEX ON :Entity(label);
We recommend having:
4+ CPU cores
16+ GB RAM
~200GB or storage
~7GB
for the Wikidata gzip file~120GB
for the Wikidata JSON file~90GB
for neo4j database~6GB
for each compressed (tar gz) backup file
You can configure Neo4j to not keep transaction logs which will decrease the database size from
~90GB
to ~25GB
, or after importing, shutdown Neo4j, navigate to the storage files (graph.db/)
and delete all the neostore.transaction.db.*
files, then start the database back up.
Tested on a 4 Core, 16GB RAM VPS with SSD's from Hetzner at around 16h
config.json
- Extra details for each property below
{
/* the Wikidata JSON file */
"file": "./wikidata-dump.json",
/* neo4j connection details */
"neo4j": {
/* bolt protocol URI */
"bolt": "bolt://localhost",
"auth": {
"user": "neo4j",
"pass": "password"
}
},
/* Stages */
"do": {
/* database cleanup */
"0": true,
/* importing items and properties */
"1": true,
/* linking entities and generating claims */
"2": true
},
/* extra console output on stage 2 */
"verbose": false,
/* how many commands will be ran by the DB at a given time */
"concurrency": 4,
/* skip lines */
"skip": 0,
/* count of lines */
"lines": 21225524,
/* bucket size of entities sent to DB to process */
"bucket": 1000
}
Runs a simple command to clean the database
MATCH (n) DETACH DELETE n
Recommended: false
Reads the JSON file 16MB
at a time and extracts bucket
lines, then processes label
,
descriptions
, aliases
, id
and type
( in English
if available )
Recommended:
true
if database is emptyfalse
if database is already populated with all the entities
Reads the JSON file 16MB
at a time and extracts bucket
lines, then processes links between
entities proxied by properties
Runs a couple of extra queries to link :Quantity -[UNIT_TYPE]-> :Item
and :GlobeCoordinate -[GLOBE_TYPE]-> :Item
.
The amount of running queries on the DB at a time. Note that the queries have different complexities and running time
Recommended: #CPUs on the database system
The number of lines skipped from the start of the file. If something goes wrong, instead of rerunning the stage to that point in the file, it just skips that number of lines
Since the importer is idempotent
keeping it to 0 has no other effect than the runtime
Recommended: 0
The number of lines the JSON
file has
If not provided, the importer will run wc -l <filename>
to count the number of lines at runtime
Recommended: run wc -l <filename>
or other command yourself to count the lines
The amount of lines that will be processed per query transaction. The lower the number, the more
time is wasted in I/O, the bigger the number, the more RAM will be used by both node
and neo4j
Recommended: 1000
Once finished importing, the database has 2 main node labels:
- Entity
- Claim
Entity nodes split into 2 types:
- Item: items like
boat
orcar
- Property: properties like
start time
orlocation
Claims split into multiple types:
- String
- Commonsmedia
- ExternalId
- GlobeCoordinate
- Math
- Monolingualtext
- Quantity
- Time
A list of properties can be found here
All relations have in common the property by
which is the id
if the property that binds the 2
nodes together. Some will also have an id
to keep the idempotence
property.
License can be found here