Import Discogs' data dumps into Postgre efficiently.
Uses Hexpat for parsing XML and Postgre's COPY to store the result.
Inspired by discogs-xml2db.
I wanted Discogs' data in a Postgre database and I had some issues with the existing solution:
- it was slow, I let it run for 5 hours and it didn't finish
- SAX events parsing for nested data is too difficult to maintain IMHO
- using INSERT is not efficient, and I didn't want UNLOGGED tables
- I wanted some data it didn't import and didn't want some it did
- performance, I wanted it to be fast on commodity hardware
The projects uses stack so it should be as easy as:
stack install
and adding stack's binary path to your PATH.
For a default everything in Ubuntu you would, for example:
$ createdb discogs
$ psql discogs < source_dir/sql/tables.sql
$ discogs2pg -d 20150807 -c dbname=discogs
$ # wait an hour or two ...
$ psql discogs < source_dir/sql/indexes.sql
discogs2pg can has two forms of operation, by file and by date.
You can import a single file with discogs2pg -c ... some_releases.xml
or
import releases, artists, labels and master for a date DATE
in the current directory with discogs2pg -c ... -d DATE
.
If you want it to be even faster, and your computer can handle it, you can pass an optional --aggressive
to make it open all file in parallel.
Consider that since we are using COPY it will open a connection per table: that's 15 connections at once.
I don't have a Windows machine to build and test the binary, so I'd love it if someone could get into that.
Also, I can think of two things:
- someone with experience in DBs could improve the data structure and indexes/primary keys
- someone with experience in structuring Haskell projects could help with the types/typeclasses and modules
I'd be glad to hear suggestions and take pull requests, of course.