In this repository you can find the data files and queries used in the benchmarking section for the publication MillenniumDB: A Persistent, Open-Source, Graph Database.
The data used in this benchmark is based on the Wikidata Truthy from 2021-06-23. We cleaned the data removing all triples whose predicate is not a direct property (i.e http://www.wikidata.org/prop/direct/P*
). The data is available to download from Google Drive.
The script to generate these data from the original data is in our source folder.
Clone the MillenniumDB git repository and follow the instructions in the README.md
to compile the project.
Use this script to transform the Wikidata .nt file into the MilleniumDB format.
The database creation is the same as in the README.md
instructions, but we strongly recommend to use a big buffer. In our case we used the parameter 9000000
, (9000000*4096 bytes = ~37GB)
build/Release/bin/create_db [path/to/import_file] [path/to/new_database_folder] -b 9000000
Apache Jena requires Java JDK (we used Openjdk 11, other versions might work as well)
The installation may be different depending on your Linux distribution. For Debian/Ubuntu based distributions:
sudo apt update
sudo apt install openjdk-11-jdk
You can download Apache Jena from their website . The file you need to download will look like apache-jena-4.X.Y.tar.gz
, in our case, we used the version 4.1.0
, but this should also work for newer versions.
tar -xf apache-jena-4.*.*.tar.gz
cd apache-jena-4.*.*/
bin/tdbloader2 --loc=[path_of_new_db_folder] [path_of_nt_file]
This step is necessary only if you want to use the Leapfrog Jena implementation, you can skip this otherwise.
Edit the text file bin/tdbloader2index
and search for the lines:
generate_index "$K1 $K2 $K3" "$DATA_TRIPLES" SPO
generate_index "$K2 $K3 $K1" "$DATA_TRIPLES" POS
generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP
After those lines add:
generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP
generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO
generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS
Now you can execute the bulk import in the same way we did it before:
bin/tdbloader2 --loc=[path_of_new_db_folder] [path_of_nt_file]
In order to be able to run the benchmark for Leapfrog Jena you also need to use a custom fuseki-server.jar
- Install openjdk-8, mvn and set
JAVA_HOME
to use java 8 git clone https://github.com/cirojas/jena-leapfrog
cd jena-leapfrog
mvn clean install -Drat.numUnapprovedLicenses=100 -Darguments="-Dmaven.javadoc.skip=true" -DskipTests
- Use
jena-fuseki2/apache-jena-fuseki/target/apache-jena-fuseki-3.9.0.tar.gz
instead of the one you download normally.
Virtuoso has a problem with geo-datatypes so we generated a new .nt file to prevent them from being parsed as a geo-datatype.
sed 's/#wktLiteral/#wktliteral/g' [path_of_nt_file] > [virtuoso_nt_file]
You can download Virtuoso from their github. We used Virtuoso Open Source Edition, version 7.2.6.
- Download:
wget https://github.com/openlink/virtuoso-opensource/releases/download/v7.2.6.1/virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- Extract:
tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- Enter to the folder:
cd virtuoso-opensource
-
We start from their example configuration file:
cp database/virtuoso.ini.sample wikidata.ini
-
Edit
wikidata.ini
with a text editor, when you edit a path, we recomend using the absolute path:-
replace every
../database/
with the path of the database folder you want to create. -
add the path of folder where you have
[virtuoso_nt_file]
and the path of the database folder you want to create toDirsAllowed
. -
change
VADInstallDir
to the path ofvirtuoso-opensource/vad
. -
set
NumberOfBuffers
. For loading the data we used7200000
, to run experiments we used5450000
. -
set
MaxDirtyBuffers
. For loading the data we used5400000
, to run experiments we used4000000
. -
revise
ResultSetMaxRows
, our experiments set this to1000000
-
revise
MaxQueryCostEstimationTime
, our experiments commented this out with ';' before the line removing the limit -
revise
MaxQueryExecutionTime
, our experiments used600
for 10 minute timeouts -
add at the end of the file these lines:
[Flags] tn_max_memory = 2755359740
-
-
Start the server:
bin/virtuoso-t -c wikidata.ini +foreground
- This process won't end until you interrupt it (Ctrl+C). Let this execute until the import ends. Run the next command in another terminal.
-
bin/isql localhost:1111
And inside the
isql
console run:ld_dir('[path_to_virtuoso_folder]', '[virtuoso_nt_file]', 'http://wikidata.org/);
rdf_loader_run();
You'll need the following prerequisites installed:
- Java JDK (with
$JAVA_HOME
defined and$JAVA_HOME/bin
on$PATH
) - Maven
- Git
The installation may be different depending on your Linux distribution. For Debian/Ubuntu based distributions:
sudo apt update
sudo apt install openjdk-11-jdk mvn git
Blazegraph can't load big files in a reasonable time, so we need to split the .nt into smaller files (1M each)
mkdir splitted_nt
cd splitted_nt
split -l 1000000 -a 4 -d --additional-suffix=.nt [path_to_nt]
cd ..
git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
cd wikidata-query-rdf
mvn package
cd dist/target
tar xvzf service-*-dist.tar.gz
cd service-*/
mkdir logs
- Edit the script file
runBlazegraph.sh
with any text editor.- configure main memory here:
HEAP_SIZE=${HEAP_SIZE:-"64g"}
(You may use other value depending on how much RAM your machine has) - set the log folder
LOG_DIR=${LOG_DIR:-"/path/to/logs"}
, replace/path/to/logs
with the absolute path of thelogs
dir you created in the previous step. - add
-Dorg.wikidata.query.rdf.tool.rdf.RdfRepository.timeout=600
to theexec java
command to specify the timeout (value is in seconds). - also change
-Dcom.bigdata.rdf.sparql.ast.QueryHints.analyticMaxMemoryPerQuery=0
which removes per-query memory limits.
- configure main memory here:
- Start the server:
./runBlazegraph.sh
- This process won't end until you interrupt it (Ctrl+C). Let this execute until the import ends. Run the next command in another terminal.
- Start the import:
./loadRestAPI.sh -n wdq -d [path_of_splitted_nt_folder]
- Download Neo4J community edition from their website https://neo4j.com/download-center/#community . We used the version 4.3.5 but this instructions might work for newer versions.
- Extract the downloaded file
tar -xf neo4j-community-4.*.*-unix.tar.gz
- Enter to the folder:
cd neo4j-community-4.*.*/
- Set the variable
$NEO4J_HOME
pointing to the Neo4J folder (usingexport
and adding it to .bashrc/.zshrc)
Edit the text file conf/neo4j.conf
- Set
dbms.default_database=wikidata
- Uncomment the line
dbms.security.auth_enabled=false
- Add the line
dbms.transaction.timeout=10m
Use the script nt_to_neo4j.py to generate the .csv files entities.csv
, literals.csv
and edges.csv
Execute the data import
bin/neo4j-admin import --database wikidata \
--nodes=Entity=wikidata_csv/entities.csv \
--nodes wikidata_csv/literals.csv \
--relationships wikidata_csv/edges.csv \
--delimiter "," --array-delimiter ";" --skip-bad-relationships true
You should have the .csv
files in the wikidata_csv
folder.
Now we have to create the index for entities:
-
Start the server:
bin/neo4j console
- This process won't end until you interrupt it (Ctrl+C). Let this execute until the index creation is finished. Run the next command in another terminal.
-
Open the cypher console:
bin/cypher-shell
, and inside the console run the command:CREATE INDEX ON :Entity(id);
- Even though the above command returns immediately, you have to wait until is finished before interrupting the server. You can see the status of the index with the command
SHOW INDEXES;
In this benchmark we have 4 sets of queries:
- Basic Graph Patterns (BGPs):
- Single BGPs : 399 queries
- Multiple BGPs: 436 queries
- Synthetic BGPs: 850 queries
- Property Paths : 1683 queries
We provide the SPARQL queries in our queries folder. Also we provide the equivalent cypher property paths (it has fewer queries because some property paths cannot be expressed in cypher).
Single BGPs, Multiple BGPs and Property Paths are based on real queries extracted from the Wikidata SPARQL query log. This log contains
millions of queries, but many of them are trivial to evaluate.
We thus decided to generate our benchmark from more challenging
cases, i.e., a smaller log of queries that timed-out on the Wikidata
public endpoint. From these queries we extracted their BGPs and property paths removing duplicates (modulo isomorphism on query variables).
Then we filtered with the same criteria that we applied to the data, removing all queries having predicates that are not a direct property (http://www.wikidata.org/prop/direct/P*
). Next, for property paths we removed queries that have both subject and object as variables and for BGPs we removed queries having a triple in which subject, predicate and object are variables.
Finally, we distinguish BGPs queries consisting of a single triple pattern (Single BGPs) from those containing more than one triple pattern (Multiple BGPs).
Synthetic BGPs are queries based in a benchmark proposed in a previous work. We had to generate new queries because datasets are different. We used their jars, and the corresponding properties.txt that has all the properties of our dataset.
Here we provide a description of the scripts we used for the execution of the queries.
Our scripts will execute a list of queries for a certain engine, one at a time, and register the time and number of results for each query in a csv file.
Every time you want to run a benchmark script you must clear the cache of the system before. To do this, run as root:
sync; echo 3 > /proc/sys/vm/drop_caches
Each script has a parameters section near the beginning of the file, (e.g. database paths, output folder) make sure to edit the script to set them properly.
The script benchmark_bgps.py and benchmark_property_paths.py runs the evaluation for BGPs/ property paths respectively. The same scripts works for SPARQL engines and MillenniumDB (they convert SPARQL into our query language). These scripts will automatically start and stop the servers. In both cases the input parameters are:
- The engine that will execute the queries (
MILLENNIUM
,JENA
,JENALF
,BLAZEGRAPH
orVIRTUOSO
) - A single file containing all queries, one in each line
- The result set size limit
Example execution:
python src/benchmarking/benchmark_bgps.py MILLENNIUM queries/sparql_synthetic_bgps.txt 100000
python src/benchmarking/benchmark_property_paths.py MILLENNIUM queries/sparql_property_paths.txt 100000
For Neo4J you need to use other scripts: neo4j_benchmark_bgps.py and neo4j_benchmark_property_paths.py. Unlike the previous scripts, you need to manually start and stop the neo4j server. In both scripts the input parameters are:
- A single file containing all queries, one in each line. For bgps the expected input is in SPARQL, but for property paths the expected input is in Cypher.
- The result set size limit
Example execution:
python src/benchmarking/neo4j_benchmark_bgps.py queries/sparql_synthetic_bgps.txt 100000
python src/benchmarking/neo4j_benchmark_property_paths.py queries/cypher_property_paths.txt 100000