Update README.md

bioscan-ml · Jul 9, 2024 · cfecacb · cfecacb
1 parent f784621
commit cfecacb
Showing 1 changed file with 22 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -94,6 +94,28 @@ TODO: add the command for downloading the images and generating the hdf5 file.
 
 You can check [BIOSCAN-1M](https://github.com/zahrag/BIOSCAN-1M) and [BIOSCAN-5M](https://github.com/zahrag/BIOSCAN-5M) to download tsv files.
 
+## Data Structure in HDF5 Format
+
+The data is stored in HDF5 format with the following structure. Each dataset contains multiple groups, each representing different splits of the data.
+
+### Group Structure
+
+Each group represents a specific data split and contains several datasets:
+
+- `image`: Stores the image data as byte arrays.
+- `image_mask`: Stores the length of each image byte array.
+- `barcode`: Stores DNA barcode sequences.
+- `family`: Stores the family classification of each sample.
+- `genus`: Stores the genus classification of each sample.
+- `order`: Stores the order classification of each sample.
+- `sampleid`: Stores the sample IDs.
+- `species`: Stores the species classification of each sample.
+- `processid`: Stores the process IDs for each sample.
+- `language_tokens_attention_mask`: Stores the attention masks for language tokens.
+- `language_tokens_input_ids`: Stores the input IDs for language tokens.
+- `language_tokens_token_type_ids`: Stores the token type IDs for language tokens.
+- `image_file`: Stores the filenames of the images.
+
 
 # Running experiments
 We recommend the use of [weights and biases](https://wandb.ai/site) to track and log experiments