Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DIOR dataset #2572

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Add DIOR dataset #2572

wants to merge 10 commits into from

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Feb 10, 2025

This PR adds the DIOR dataset. License found here. Dataset rehosted on Huggingface based on this google drive link.

Dataset features:

* 20 classes
* 192,472 manually annotated bounding box instances

Dataset format:

* Images are three channel .jpg files.
* Annotations are in xml format

dior_0

@nilsleh nilsleh added this to the 0.7.0 milestone Feb 10, 2025
@nilsleh nilsleh marked this pull request as draft February 10, 2025 17:41
@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels Feb 10, 2025
@nilsleh nilsleh marked this pull request as ready for review February 12, 2025 07:43
@@ -13,6 +13,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
`Kenya Crop Type`_,S,Sentinel-2,"CC-BY-SA-4.0","4,688",7,"3,035x2,016",10,MSI
`DeepGlobe Land Cover`_,S,DigitalGlobe +Vivid,-,803,7,"2,448x2,448",0.5,RGB
`DFC2022`_,S,Aerial,"CC-BY-4.0","3,981",15,"2,000x2,000",0.5,RGB
`DIOR`_,OD,Aerial,"CC-BY-SA","23,463",20,"800x800",0.5,RGB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish we knew which CC-BY-SA, without a version number it isn't a valid SPDX identifier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcheng-nwpu do you know?

P.S. We are adding TorchGeo data loaders for your excellent DIOR and SODA-A datasets. Hopefully this will make it even easier for people to use your datasets and cite your papers!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbwang1997 may also know

If you use this dataset in your research, please cite the following paper:

* https://arxiv.org/abs/1909.00133

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that pyarrow is required? I'm noticing this dependency in a lot of your PRs. Is this file coming from the dataset authors or from you? I would like to avoid additional dependencies unless absolutely necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is coming from me. Part of #2448

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet managed to reproduce that issue, so I'm not sure if we should force people to install extra dependencies just to avoid it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, Ill change it to csv, parquet is more performant so today's standard, but that will remove the dependency. Also forcing sounds like a strong word, I don't think that's preventing anyone from using this dataset, and a lot of the new datasets that are generally well done do come with this, like MMEarth, BigEarthNetV2, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it becomes more standard, we may start converting or supporting parquet more directly. I have similar feelings about WebDataset. I am really trying to minimize how many dependencies need to be installed to use TorchGeo, but we also want the datasets to be fast enough.

Copy link
Collaborator

@isaaccorley isaaccorley Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet and GeoParquet has significantly better compression over for large vector datasets over geojson so I think including it as a dependency is ideal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but WebDataset is also better for I/O on HPC systems, mmrotate is better for object detection, etc. We need to decide a core set of dependencies we can't live without (required), then add optional dependencies sparingly. If someone wants to subclass and extend these datasets to be faster, they can (EarthNets is doing just that).

Obviously, this isn't a fair comparison. We avoid dependencies on certain libraries due to licensing, security, or difficulty of installation. My concern is less which dependencies we add and more how many dependencies we add. My priorities are roughly:

  1. Ease of installation: rules out things like mmrotate
  2. Licensing complexity: rules out things like YOLO
  3. Performance: in favor of things like WebDataset, parquet
  4. Security: rules out things like YOLO

Basically, performance only matters to me when the difference is big enough that someone would no longer want to use our implementation. If that's not the case, let's keep it simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants