Feat: Add `compute_statistics` subcommand #336

fmartiescofet · 2024-12-20T16:55:16Z

This PR add the compute_statistics subcommand to the terratorch CLI to compute the mean and std of a dataset.
It can be called with the same config file as any other subcommand: terratorch compute_statistic --config <file>.yaml

The kwargs in the method is required as the Lightning CLI requires the model parameter and we need to consume it, see here.

Signed-off-by: Francesc Marti Escofet <[email protected]>

blumenstiel

I like how you integrated it into the cli tools. In general, I think it would be good to have this functionality in terratorch.utils so that it can also be called from python code. Maybe you could just add a new statistics.py and the Trainer calls this function with the train dataloader?

I am not sure how we could handle custom datasets which do not fit the expected pattern. Maybe there could be a second option to only pass a folder instead of a config?
Not sure if this could generalise better. What do you think could work well?

terratorch/cli_tools.py

fmartiescofet · 2024-12-23T11:57:39Z

The main problem I had while integrating it into the CLI is the fact that an subcommand requires some parameters such as model... See here. My initial idea was to make the user pass only the dataset and create manually the dataloader inside the method, but this issue didn't allow it in an easy way (the user would always need to pass the model making it weird to have a config with a dataset and a model not being used). Therefore I thought that the best way was to let the user use the same config as they would use to call fit but leaving to the user the need to take care that there was no randomness such as random cropping.

At the end, if we also allow to pass a folder we would need to build the dataset in the method and require more parameters to build the generic dataset. With this way, the user can use their custom datasets (the only requirement is that the datamodule returns a dataloader on train_dataloader and that there should be no randomness). Maybe I can document this clearly so the user knows how to use the subcommand. What do you think?

I agree with saving the output in yaml and separating it into a new file, I'll do it

Signed-off-by: Francesc Marti Escofet <[email protected]>

blumenstiel · 2024-12-27T19:50:54Z

Thanks for the changes! You are right, we can expect some checks by the user to make sure it's working correctly.
But I think we can just add one check before running .setup('fit') to handle at least for the generic datasets: We just need to replace train_transform with None.

if hasattr(datamodule, "train_transform"):
    datamodule. train_transform = None

What do you think?

About passing a dataset folder, I agree with you that it probably makes it more complicated. Users have to create a config anyway at some point, so they can just do it before computing the statistics.

Signed-off-by: Francesc Marti Escofet <[email protected]>

fmartiescofet · 2025-01-07T10:05:54Z

@blumenstiel The issue with changing it in the datamodule is that it does not get changed in the dataset as it is already instantiated. I thought of overwriting the attribute in the dataset but then we would overwrite user defined transforms which they may want to compute the statistics, also we would need to keep the toTensor transformation as usually datasets return numpy arrays making it hard as different datasets may use different methods to convert to tensors.
I added some documentation about it, lmk what you think.

blumenstiel · 2025-01-07T11:02:41Z

@fmartiescofet The dataests are build in the .setup('fit'). If you overwirte it before, it should work.

Signed-off-by: Francesc Marti Escofet <[email protected]>

blumenstiel

Looks good, thanks!

blumenstiel · 2025-01-20T15:05:36Z

@Joao-L-S-Almeida @romeokienzler I think we can merge this one.

Add compute_statistic subcommand

7370e22

Signed-off-by: Francesc Marti Escofet <[email protected]>

blumenstiel requested changes Dec 23, 2024

View reviewed changes

terratorch/cli_tools.py Show resolved Hide resolved

terratorch/cli_tools.py Outdated Show resolved Hide resolved

Move compute_statistic to utils and output in yaml

2d69f06

Signed-off-by: Francesc Marti Escofet <[email protected]>

Add docs and compute mask stats

6b5b3d4

Signed-off-by: Francesc Marti Escofet <[email protected]>

fmartiescofet added 3 commits January 7, 2025 12:25

Remove train transforms

463c41e

Signed-off-by: Francesc Marti Escofet <[email protected]>

Merge branch 'IBM:main' into compute_statistics

cca9212

Merge branch 'IBM:main' into compute_statistics

a93ced1

Joao-L-S-Almeida added the 1.0 label Jan 13, 2025

fmartiescofet requested a review from blumenstiel January 20, 2025 08:50

blumenstiel approved these changes Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add `compute_statistics` subcommand #336

Feat: Add `compute_statistics` subcommand #336

fmartiescofet commented Dec 20, 2024

blumenstiel left a comment

fmartiescofet commented Dec 23, 2024 •

edited

Loading

blumenstiel commented Dec 27, 2024

fmartiescofet commented Jan 7, 2025

blumenstiel commented Jan 7, 2025

blumenstiel left a comment

blumenstiel commented Jan 20, 2025

Feat: Add compute_statistics subcommand #336

Are you sure you want to change the base?

Feat: Add compute_statistics subcommand #336

Conversation

fmartiescofet commented Dec 20, 2024

blumenstiel left a comment

Choose a reason for hiding this comment

fmartiescofet commented Dec 23, 2024 • edited Loading

blumenstiel commented Dec 27, 2024

fmartiescofet commented Jan 7, 2025

blumenstiel commented Jan 7, 2025

blumenstiel left a comment

Choose a reason for hiding this comment

blumenstiel commented Jan 20, 2025

Feat: Add `compute_statistics` subcommand #336

Feat: Add `compute_statistics` subcommand #336

fmartiescofet commented Dec 23, 2024 •

edited

Loading