From e24a50f784a95d0a0d462a92e2094c4e833bcab7 Mon Sep 17 00:00:00 2001 From: Alastair Porter Date: Wed, 6 Jul 2022 17:30:52 +0200 Subject: [PATCH] Update documentation about dumps --- webserver/templates/index/downloads.html | 61 ++++++++++++++++++++---- 1 file changed, 53 insertions(+), 8 deletions(-) diff --git a/webserver/templates/index/downloads.html b/webserver/templates/index/downloads.html index 4e7a4b7e..4f72a785 100644 --- a/webserver/templates/index/downloads.html +++ b/webserver/templates/index/downloads.html @@ -11,22 +11,67 @@

Client Downloads

we no longer provide client tools to submit data.

If you are interested in computing acoustic features on your own music, you can still download the command-line essentia extractor and run it yourself:

-

SHA1 sums

+

SHA1 sums

Newer versions of the essentia extractor are also available on the essentia website

Data Downloads

-

2022-06-20: We are in the process of finalising data dumps of the complete AcousticBrainz database. - These dumps will be announced here and +

2022-07-06: We provide downloadable archives of all submissions made to AcousticBrainz (29,460,584 submissions)

+

Low-level and High-level json dumps

+

High-level downloads
+ Low-level downloads +

+

+ Dumps are split into 30 archives, each with 1 million data files. Archives are compressed with + zstandard compression. Filenames inside the archives are structured + such that they will all uncompress into the same location. +

+

+ Files in each archive are named according to the following structure: +

type/mb/i/mbid-n.json
+ Where type is one of lowlevel or highlevel, + mbid is a uuid of a MusicBrainz Recording Identifier, m, + b, i and d are the first, second, + third and fourth characters of the MusicBrainz + Identifier, and n indicates the ordinal submission offset of duplicate + data files present for the same MusicBrainz Identifier. There will always + be a file with submission offset 0. +

+

+ The format of the json files in each archive are described in the data page. +

+

Sample json dumps

+

Sample downloads

+

The same as the above full dumps, but only containing 100,000 items for small-scale testing.

+

Low-level feature dumps

+

Feature downloads

+

Smaller CSV files containing some basic features that may be useful for some tasks. Split into three different files based on feature type. + Each file contains 29,460,584 rows of data. +

+ See the essentia documentation for streaming_extractor_music for + a description of what each of these features are.

+

Pending: Data files for acoustic similarity

+

2022-07-06: We will provide a downloadable archive of the data files used in the + recording similarity API.

+

Pending: Low-level and High-level dump of deduplicated items

+

2022-07-06: We will provide new json and feature dumps of the database after de-duplicating to only one instance of each recording MBID + (approximately 7 million items)

+

Pending dumps will be announced here and on the AcousticBrainz forum in the coming weeks. -

+