Add DIOR dataset #2572

nilsleh · 2025-02-10T17:41:25Z

This PR adds the DIOR dataset. License found here. Dataset rehosted on Huggingface based on this google drive link.

Dataset features:

* 20 classes
* 192,472 manually annotated bounding box instances

Dataset format:

* Images are three channel .jpg files.
* Annotations are in xml format

adamjstewart · 2025-02-12T11:19:38Z

docs/api/datasets/non_geo_datasets.csv

@@ -13,6 +13,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
 `Kenya Crop Type`_,S,Sentinel-2,"CC-BY-SA-4.0","4,688",7,"3,035x2,016",10,MSI
 `DeepGlobe Land Cover`_,S,DigitalGlobe +Vivid,-,803,7,"2,448x2,448",0.5,RGB
 `DFC2022`_,S,Aerial,"CC-BY-4.0","3,981",15,"2,000x2,000",0.5,RGB
+`DIOR`_,OD,Aerial,"CC-BY-SA","23,463",20,"800x800",0.5,RGB


Wish we knew which CC-BY-SA, without a version number it isn't a valid SPDX identifier.

@gcheng-nwpu do you know?

P.S. We are adding TorchGeo data loaders for your excellent DIOR and SODA-A datasets. Hopefully this will make it even easier for people to use your datasets and cite your papers!

@jbwang1997 may also know

adamjstewart · 2025-02-12T11:25:12Z

torchgeo/datasets/dior.py

+    If you use this dataset in your research, please cite the following paper:
+
+    * https://arxiv.org/abs/1909.00133
+


Should we mention that pyarrow is required? I'm noticing this dependency in a lot of your PRs. Is this file coming from the dataset authors or from you? I would like to avoid additional dependencies unless absolutely necessary.

This file is coming from me. Part of #2448

I haven't yet managed to reproduce that issue, so I'm not sure if we should force people to install extra dependencies just to avoid it.

Alright, Ill change it to csv, parquet is more performant so today's standard, but that will remove the dependency. Also forcing sounds like a strong word, I don't think that's preventing anyone from using this dataset, and a lot of the new datasets that are generally well done do come with this, like MMEarth, BigEarthNetV2, etc.

As it becomes more standard, we may start converting or supporting parquet more directly. I have similar feelings about WebDataset. I am really trying to minimize how many dependencies need to be installed to use TorchGeo, but we also want the datasets to be fast enough.

Parquet and GeoParquet has significantly better compression over for large vector datasets over geojson so I think including it as a dependency is ideal.

Yes, but WebDataset is also better for I/O on HPC systems, mmrotate is better for object detection, etc. We need to decide a core set of dependencies we can't live without (required), then add optional dependencies sparingly. If someone wants to subclass and extend these datasets to be faster, they can (EarthNets is doing just that).

Obviously, this isn't a fair comparison. We avoid dependencies on certain libraries due to licensing, security, or difficulty of installation. My concern is less which dependencies we add and more how many dependencies we add. My priorities are roughly:

Ease of installation: rules out things like mmrotate

Licensing complexity: rules out things like YOLO

Performance: in favor of things like WebDataset, parquet

Security: rules out things like YOLO

Basically, performance only matters to me when the difference is big enough that someone would no longer want to use our implementation. If that's not the case, let's keep it simple.

torchgeo/datasets/dior.py

nilsleh added 3 commits February 10, 2025 17:35

dataset

fcf5b97

init

08cd839

naming convention

4409f16

nilsleh added this to the 0.7.0 milestone Feb 10, 2025

nilsleh marked this pull request as draft February 10, 2025 17:41

github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels Feb 10, 2025

nilsleh and others added 6 commits February 10, 2025 18:41

Merge branch 'main' into dior

5165643

tests

d80136d

tests

3863f01

import skip

7d1f3c3

myp

cb5e807

Merge branch 'main' into dior

49fd69d

nilsleh marked this pull request as ready for review February 12, 2025 07:43

adamjstewart requested changes Feb 12, 2025

View reviewed changes

requests

391e747

adamjstewart mentioned this pull request Feb 25, 2025

Add SODA-A dataset #2575

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DIOR dataset #2572

Add DIOR dataset #2572

nilsleh commented Feb 10, 2025 •

edited

Loading

adamjstewart Feb 12, 2025

adamjstewart Feb 12, 2025

adamjstewart Feb 12, 2025

adamjstewart Feb 12, 2025

nilsleh Feb 17, 2025

adamjstewart Feb 18, 2025

nilsleh Feb 18, 2025

adamjstewart Feb 18, 2025

isaaccorley Feb 18, 2025 •

edited

Loading

adamjstewart Feb 18, 2025

		If you use this dataset in your research, please cite the following paper:

		* https://arxiv.org/abs/1909.00133

Add DIOR dataset #2572

Are you sure you want to change the base?

Add DIOR dataset #2572

Conversation

nilsleh commented Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isaaccorley Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilsleh commented Feb 10, 2025 •

edited

Loading

isaaccorley Feb 18, 2025 •

edited

Loading