-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DIOR dataset #2572
base: main
Are you sure you want to change the base?
Add DIOR dataset #2572
Conversation
@@ -13,6 +13,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands | |||
`Kenya Crop Type`_,S,Sentinel-2,"CC-BY-SA-4.0","4,688",7,"3,035x2,016",10,MSI | |||
`DeepGlobe Land Cover`_,S,DigitalGlobe +Vivid,-,803,7,"2,448x2,448",0.5,RGB | |||
`DFC2022`_,S,Aerial,"CC-BY-4.0","3,981",15,"2,000x2,000",0.5,RGB | |||
`DIOR`_,OD,Aerial,"CC-BY-SA","23,463",20,"800x800",0.5,RGB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wish we knew which CC-BY-SA, without a version number it isn't a valid SPDX identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gcheng-nwpu do you know?
P.S. We are adding TorchGeo data loaders for your excellent DIOR and SODA-A datasets. Hopefully this will make it even easier for people to use your datasets and cite your papers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbwang1997 may also know
If you use this dataset in your research, please cite the following paper: | ||
|
||
* https://arxiv.org/abs/1909.00133 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that pyarrow is required? I'm noticing this dependency in a lot of your PRs. Is this file coming from the dataset authors or from you? I would like to avoid additional dependencies unless absolutely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is coming from me. Part of #2448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet managed to reproduce that issue, so I'm not sure if we should force people to install extra dependencies just to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, Ill change it to csv, parquet is more performant so today's standard, but that will remove the dependency. Also forcing sounds like a strong word, I don't think that's preventing anyone from using this dataset, and a lot of the new datasets that are generally well done do come with this, like MMEarth, BigEarthNetV2, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it becomes more standard, we may start converting or supporting parquet more directly. I have similar feelings about WebDataset. I am really trying to minimize how many dependencies need to be installed to use TorchGeo, but we also want the datasets to be fast enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet and GeoParquet has significantly better compression over for large vector datasets over geojson so I think including it as a dependency is ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but WebDataset is also better for I/O on HPC systems, mmrotate is better for object detection, etc. We need to decide a core set of dependencies we can't live without (required), then add optional dependencies sparingly. If someone wants to subclass and extend these datasets to be faster, they can (EarthNets is doing just that).
Obviously, this isn't a fair comparison. We avoid dependencies on certain libraries due to licensing, security, or difficulty of installation. My concern is less which dependencies we add and more how many dependencies we add. My priorities are roughly:
- Ease of installation: rules out things like mmrotate
- Licensing complexity: rules out things like YOLO
- Performance: in favor of things like WebDataset, parquet
- Security: rules out things like YOLO
Basically, performance only matters to me when the difference is big enough that someone would no longer want to use our implementation. If that's not the case, let's keep it simple.
This PR adds the DIOR dataset. License found here. Dataset rehosted on Huggingface based on this google drive link.
Dataset features:
Dataset format: