Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Time Series Datasets and sampling for GeoDatasets #877

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Oct 30, 2022

This PR aims to support time-series datasets and sampling. Following the discussion of #640 . I will try to explain the thought process hereafter, given my understanding of the existing torchgeo framework. The current sampling strategy I am considering is the following

ForecastingGeoSampler

  • specifies a time range, and a duration for training and testing (e.g., train on one month of data, predict next week), then return pairs of inputs in an ordered sequential fashion

Limitation current RasterDataset:

  • limitation: currently a RasterDataset accepts a bounding box that includes the time dimension but then calls the _merge_files function where the time dimension is dropped in order to call rasterio.merge.merge to merge overlapping tiles for a given geographical region. As a consequence, if there were multiple scences across time, this function just disregards the time dimension
  • approach: I created a new TimeSeriesRasterDataset which inherits RasterDataset and overwrites the _merge_files method. Here, like before scenes are merged on their geographical location, but only after all retrieved scenes from a bounding box hit are grouped by their date, so that only geographical areas are merged per single time step. Afterwards, they are concatenated along a new dimension to form a 4D tensor that includes the time dimension and this 4D tensor is returned from the __getitem__ method

Sampling/DataLoader:

  • limitation: to my understanding when using PyTorch DataLoader, one can either pass a sampler which gives instructions how to retrieve a single sample (by returning a single bounding box), or a batch_sampler which would sample one single batch (by returning a list of bounding boxes). However, for time-series tasks we need "something in between" because following the single sampler, we need to return a tuple of bounding boxes that have the same geographic location but a sequential time duration (one for the input sequence, and one for the target sequence that is sequential in time to the input sequence).
  • idea: define a sampler which returns a tuple of bounding boxes (one for the input and one for the target sequence). To my understanding GeoDatasets will mostly be used in conjunction with IntersectionDatasets because they are not curated and hence only contain either inputs or targets (however one might define that for their use-case). If the sampler returns a tuple, then the __getitem__ method of the Intersection dataset accepts Union[Bbox, Tuple[Bbox, Bbox]] and retrieves the first input sequence from the "image" dataset and the second input sequence from the "mask" dataset

Current Issues:

  • The retrieving of a possible geo bounding box is done through the rtree self.index.intersection, however, when the index is populated with scenes they have a single time-stamp, but when calling self.index.intersection we would need to ensure that for a given geo location the hit contains all available time stamps that respect the input sequence and target sequence length that we are interested in
  • Following the idea of specifying a "continuous" time-range for the ForecastingGeoSampler like days, months, how do you ensure that within that input or target time-range that there are actually scenes that fall within this randomly chosen time range in the sampler because the scenes themselves are discrete time-stamp scenes. The approach I saw in the PyTorchForecasting package is that they define an increasing time_idx variable for all steps within a time-series and the define the input sequence and target sequence length as the number of time indices to select from time_idx.

This current draft is just one idea and by no means exhaustive. Would still need to think about all kinds of edge cases and testing procedures. I am sure I missed or misunderstood several things but hopefully it helps in thinking about a suitable implementation to support Time-Series tasks for GeoDatasets.

@nilsleh nilsleh marked this pull request as draft October 30, 2022 10:23
@github-actions github-actions bot added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Oct 30, 2022
@adamjstewart
Copy link
Collaborator

Sorry it took so long to review this. Here are my thoughts on your approach.

Sampler

The trickiest aspect of the sampler is that there are so many different ways someone might want to sample a dataset. Consider the following corner cases/situations:

  • The user might want to specify a range of time to sample from, similar to spatial ROI
  • The user might want to specify a train/val/test split across the time axis, by percent or by time range
  • The user may want to do rolling window predictions (train on 3 days, predict next 24 hrs, shift 24 hrs)
  • The user may want to work with cyclic data, such as only using data during the agricultural growing season, but for several years
  • The user may require regular sequential data (every 16 days) or may allow irregular sequential data (as often as we have data)
  • In the latter case, this means the dimensions may not match from one batch to the next, or within a single batch
  • Depending on overlap of WRS tiles, some areas will have 4X the temporal coverage as others
  • When using IntersectionDataset, some datasets may have different temporal resolution than others

It feels difficult if not impossible to support all of these use cases with a single sampler. We'll probably need multiple samplers, or a lot of configuration options. It would be good to see what other time-series libraries call these samplers and what options they have.

Dataset

The problem with your approach is that we would need a TimeSeriesRasterDataset variant of every single RasterDataset subclass (Landsat-9, Sentinel-2, etc.). I think we can pretty easily avoid this limitation. Here are a couple of alternative ideas:

  1. Change RasterDataset.__getitem__ to accept multiple bounding boxes.

Basically, this would mean changing:

def __getitem__(self, query: BoundingBox) -> Dict[str, Any]:

to:

def __getitem__(self, *queries: BoundingBox) -> Union[Dict[str, Any], List[Dict[str, Any]]]:

If the sampler passes a single bounding box to the dataset, it behaves like it used to. If the sampler passes multiple bounding boxes, the dataset will return multiple separate sample dicts.

Added benefit: this would better support alternative sampling strategies like Tile2Vec or SeCo where we need to be able to return a (query, neighbor, distant) tuple per data point.

  1. Change RasterDataset.__init__ to add a split_time (name TBD) parameter, and modify _merge_files to not merge across time depending on this flag.

In this case, the sampler would still return a single BoundingBox covering the entire time span, but wouldn't merge any files if they are from different times. The dataset would return separate sampler dicts, or a single sampler dict with an extra dimension for time. This approach is limited since you can't pass multiple bounding boxes for different time splits, you just get the same data as before but without a time merger.

I have more thoughts on this (in addition to a time axis, we may want to optionally support a Z axis for 3D weather/climate data), but don't want to bog this PR down by thinking too broadly.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Nov 21, 2022

Thank you for your feedback. Your proposed integration/change within the RasterDataset is a better solution, so I am working on that at the moment:

Dataset

  • changing _merge_files I am working on
  • when changing the __getitem__ method, I thought how this would be used with the IntersectionDataset because for GeoDatasets, one could have the case that you specify an image and a mask/target dataset and take the Intersection over them to sample queries. Thus, I think we also have to think about changing the __getitem__ method of the IntersectionDatset so that in a case of a time series prediction the first bounding box queries the image dataset and the second time sequential bounding box samples only the mask/target dataset. Thoughts on this?

Sampler

I agree that for all these specific cases different samplers could be a viable solution. I did some looking around more classical time-series libraries and how they handle datasets. This is a little summary from some libraries I found (not exhaustive, and I am not an experienced user so might not do them justice):

  • PyTorch Forecasting has a TimeSeriesDataset class. This is all tabular data and at a minimum, the class expects the dataframe to have a time_idx column that specifies time steps and a group_idx which indicates which timeseries a data point belongs to. Additionally, one would specify which columns hold static or time varying and categorical/continuous variables.
value group time_idx
0.201798 0 0
0.389338 0 1
-0.285848 0 2
-0.044911 1 0
-0.127036 1 1
-0.495227 1 2
-0.343048 1 4

Missing time steps like in group 1 above can be filled on the fly if a flag is specified.

Additionally, one has to specify a encoder_length that represents the length of the input sequence to a model and a decoder_length that specifies the prediction target sequence length. Such time sequences are then sampled at random to generate input samples.

  • darts has a TimeSeries object which is also based on dataframes and similar principle as described above. Here there is an input_chunck_lengthand output_chunck_length parameter that decides what sequences to sample. Their dataset class also constructs a sequentially increasing time_idx from which the starting point is randomly sampled to retrieve a input_length + output_lengthlong sequence for training.

  • GluonTS, is also relying on a pandas dataframe. For their dataset class they rely on a provided monotonically increasing timestep column. As far as I can tell, the sampling here also works as it does in the libraries abover, where randomly drawn sequences from the time series provided in the datafram are taken to train the model

From what I can tell, if you wanted to do a time-series prediction for the different tasks you have mentioned in #640 such as cycles, seasonal data, or any other given data selection processes that determines the input/output sequence, these libraries would rely on that you format your data on your own in such a way that it includes the data in your desired frequency and length.

I guess what we are trying to do in contrast is to have a fixed dataset instance on which all kinds of different tasks will be handled by sampling strategies that decided the time_idx column, frequency and length. For remote sensing data it also seems more necessary to take the latter approach because it would be way more of a burdon to ask users to carefully construct their datasets from a collection of .tif files which are more difficult to handle than just one dataframe that can be manipulated with pandas.

Edit

Looking into this again, and time-series prediction can in a sense also be framed as video prediction. Pytorchvideo defines some clip samplers that could serve as inspiration.

@github-actions github-actions bot added the testing Continuous integration testing label Nov 22, 2022
@nilsleh
Copy link
Collaborator Author

nilsleh commented Nov 22, 2022

This last commit contains a proposal for changing RasterDataset according to your two proposed changes. And I added some pytests for the dataset but not yet for the samplers.

@nilsleh
Copy link
Collaborator Author

nilsleh commented Jan 21, 2023

Given the complexity of supporting all the different time-series scenario tasks, my idea is to first begin by focusing on a single one that should already be quiet powerful for standard tasks. This is to begin with a "RandomSequenceSampler" (name tbd) that randomly samples consecutive input and target sequences for a given input and target length during training, like

I assume that a video by default has all possible time-steps (frames) in each video sequence, so not a video with blanks in between. That makes sampling sequences quiet straightforward in the video framework. For the time-series libraries the assumption is that the user provides a dataframe which includes a time-idx column specifiying the consecutive data points within a time-series. This also makes sampling index based sequences easier because one does not need to worry about whether the idx specify hours, days, months, etc.

  1. However, for geospatial data, I don't think we can expect the user to provide information that would give us indices to timestamped files, so it needs to be handled by the GeoDataset and sampler. I am not an expert with the rtree, but maybe it would be reasonable to have a sampler argument time_unit: str (one of ["hour", "day", "month", "year"]) which gives us the time unit of a single idx step and then two more arguments for input_window: int and target_window : int which specifies the length of the sequences. So for (time_unit="month", input_window=10, target_window=2) one would get random sequences that let the model train on 10 month and predict the following 2 month. These would have to be converted to rtree indexable time format given the time unit.

  2. Additionally, I am not sure we can assume that given (time_unit="month", input_window=10, target_window=2) the available tiff images in the dataset actually respect the indices. For example, if I download Sentinel2 imagery from Copernicus for two years, then I don't think I can assume that for any given 10 month subsequence there are actually 10 images. If there are fewer, the standard time-series libraries have a flag that would do some nan interpolation and fill the gap with some schema (would we do something similar for raster data?). We might also have more images for a given time period and would have to somehow condense those into a representation of [10, 3, 256, 256] given the above example since many model architectures expect constant sequence lengths.

  3. Another question is whether we can assume that for a given dataset all geospatial locations have all time sequences of interest available. Or if we don't, how we properly check for the scenarios where this is not the case.

So I am not sure on the preferred way of handling this, but could also be overthinking it. Either way, interested in opinions!

@nilsleh
Copy link
Collaborator Author

nilsleh commented Jan 23, 2023

The above commits contain a new design proposal that samples consecutive sequences for time-series forecasting tasks.

It contains the following new pieces:

  • ForecastDataset: inherits from IntersectionDataset with the idea that the there are arguments for input_dataset from which we sample input sequences and a target_dataset from which we sample the corresponding target sequences. The RasterDataset is changed back to only accept a single bounding box, but the __getitem__ method of ForecastDataset expects two bounding box querys (input and target) that are generated by the sampler
  • SequentialGeoSampler: that specifies an input sequence length, target sequence length, can restrict the time to sample from, and allows time units that are supposed to make it easier to specify input and target length

The idea behind the two is then to use it as follows:

class TimeSeriesInputRaster(RasterDataset):
    filename_glob = "test_*.tif"
    filename_regex = r"test_(?P<date>\d{8})_(?P<band>B0[234])"
    date_format = "%Y%m%d"
    is_image = True
    separate_files = True
    all_bands = ["B02", "B03", "B04"]

class TimeSeriesTargetRaster(RasterDataset):
    filename_glob = "test_*.tif"
    filename_regex = r"test_(?P<date>\d{8})_(?P<band>target)"
    date_format = "%Y%m%d"
    is_image = True
    separate_files = True
    all_bands = ["target"]

ds =  ForecastDataset(
    input_dataset=TimeSeriesInputRaster(root, as_time_series=True),
    target_dataset=TimeSeriesTargetRaster(root, as_time_series=True)
)

sampler = SequentialGeoSampler(
    ds,
    sample_window=10,
    target_window=2,
    time_unit="months",
)

sample = ds[next(iter(sampler))]
input = sample["image"] # [sample_window, num_bands, height, width]
target = sample["target"] # [target_window, num_targets, height, width]

This would be used to learn a sequence of 10 months and predict the next two months.

A design choice I have not implemented yet is how to do sampling hits with self.index.intersection(roi). Given both the spatial and time dimension, the number of samples could be huge given a patch_size. Questions about this and beyond:

  1. Would you first generate all possible time sequence hits in a rolling window fashion and then from each time sequence draw random geo location samples? Or the other way around and are there recommended approaches with the rtree?
  2. Either way, it would probably be useful to specify a max_number_of_samples parameter that gives some control over the number of samples like darts num_samples_per_ts parameter?
  3. Assumptions about the available time steps of data. Do we handle gaps and/or two many available time steps that could potentially gathered by a bbox sample?

I would appreciate feedback on this and also want to mention the following questions for later on:

  1. Dataset splitting training, val, test. I think it would be amazing if we could provide some utilities for that as well because that is a pain to deal with with the added time dimension
  2. ForecastingTask Class with basic models like Conv-LSTM, UNET-LSTM, some TimeSpatialTransformer

@nilsleh
Copy link
Collaborator Author

nilsleh commented Jan 26, 2023

Maybe it is useful to determine some terminology that can be used for models, datasets, samplers and tasks for the context of time-series prediction. These are just a few terms encountered so far.

  • input sequence length: input_chunk_length in darts, encoder_length in pytorch-forecasting, form of sequence_length in gluon-ts, 'window' is also sometimes encountered
  • prediction sequence length: output_chunk_length in darts, prediction_length in pytorch-forecasting, prediction_length in gluon-ts
  • what to call the keys in the data sample returned by the dataloader:
    • input for the input sequence images, and target for the target sequence images?
    • prepend the bbox and crs data with the above respectively?

@nilsleh
Copy link
Collaborator Author

nilsleh commented Jan 26, 2023

I am not an expert with rtree but I think there might be a couple of things that could be changed.

  1. At the moment, the filename_regex is only supposed to pick up a single band with separate_files=True, otherwise the length of the dataset is confusing. However, now we need some control over extracting the date of every individual filename, but there won't be a match if the regex does not cover all band_names in __getitem__
  2. I don't quiet understand how the for hit in self.index.intersection(tuple(self.roi), objects=True): line in the samplers works, because the number of samples is also dependent on the number of items in the index? And I think there is not a way to have self.index.intersection sample only sequences of a given time duration so there are extra steps required

@nilsleh nilsleh marked this pull request as ready for review July 28, 2023 09:16
@Spiruel
Copy link

Spiruel commented Sep 4, 2023

@nilsleh Please may I know how to use these changes for a time series segmentation task?

I have a prechipped raster dataset with observations throughout the year, and a vector dataset for the segmentation labels. I'm trying to use the new RasterDataset logic alongside the SequentialDataSampler but I keep getting ValueError: 'a' cannot be empty unless no samples are taken.

Do you have a brief worked example to share?

Thanks

@nilsleh
Copy link
Collaborator Author

nilsleh commented Sep 4, 2023

Hi @Spiruel , this PR is still ongoing work and is not rigorously tested yet and also a work in progress with some unanswered design choices so very likely to change, maybe even substantially. However, it's exciting to hear that you are looking to use it for your task at hand and I will try to help out. I would also be grateful for some feedback because I have only tested it on a personal project (that I cannot share), while there are of course many more different use cases out there.

From your error message, I am guessing that it occurs in this line but to make sure, could you post a longer error trace? The error here would mean that during the __init__ call of the sampler - around here there were not any regions and/or timestamps found that could serve as samples. This could have multiple reasons but would be my starting point.

In my setup the dataloading and sampling looks like in the comment above.

Maybe you can check that you can retrieve samples from your RasterDataset class dataset without the SequentialGeoSampler and also of the ForecastDataset? Common issues there can be the filename_glob and regex expression needed for the time. I hope this gives some pointers, but let me know.

@Spiruel
Copy link

Spiruel commented Sep 4, 2023

Thanks - I'd be happy to offer feedback on this as I test it out on my use case and get my head around it. I've got a number of crop classification use cases that I should be able to try out once I get this working.

Yes, my error does occur on this line and I'll attempt to debug this further. There should be plenty of regions and timestamps to serve as samples.

I followed your setup above and adapted it to my time series segmentation task, based on my understanding. I am able to retrieve samples from my input_ds as normal using samplers like RandomGeoSampler and I get the correct spatial and temporal information (I checked the bbox, mint, maxt attributes etc and all looks good). However when I try and use SequentialGeoSampler to get time series, I get the error mentioned above.

#prechipped harmonised landsat-sentinel rasters  across a calendar year
class TimeSeriesInputRaster(RasterDataset):
    filename_glob = "*.tif"
    filename_regex = r"^\d+_\d+_(?P<date>\d{8})T?.tif$"
    date_format = "%Y%m%d"
    is_image = True
    separate_files = False
    all_bands = ["B02", "B03", "B04"]

#yearly cover crop masks for oct-apr
class TimeSeriesTargetRaster(RasterDataset):
    filename_glob = "*_cc2class.tif"
    filename_regex = r"""
        ^(?P<date>\d+)
        _cc2class.*$
    """
    date_format = "%Y"
    is_image = True
    separate_files = False
    all_bands = ["target"]

input_root = "S30/4"
label_root = "data"

input_ds = TimeSeriesInputRaster(input_root, as_time_series=True)
label_ds = TimeSeriesTargetRaster(label_root, as_time_series=True)

print(len(input_ds))

#don't need forecast dataset in this example
ds =  ForecastDataset(
    input_dataset=input_ds,
    target_dataset=label_ds
)

sampler = SequentialGeoSampler(
    input_ds,
    encoder_length=3,
    prediction_length=1,
    time_unit="weeks",
    size=32,
    length=10
)

sample = ds[next(iter(sampler))]
input = sample["image"] # [sample_window, num_bands, height, width]
target = sample["target"] # [target_window, num_targets, height, width]

#expect input features of size [3, 14, 32, 32] and corresponding label of size [1, 2, 32, 32]

Let me know if this discussion is best had on this thread, or whether I should move it elsewhere! Thanks

Edit: Continued in #1544

mint, maxt = disambiguate_timestamp(date, self.date_format)

coords = (minx, maxx, miny, maxy, mint, maxt)
self.index.insert(i, coords, filepath)
i += 1

if self.as_time_series:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this for development/debugging purposes but also might be nice to have access to the available dates of files when working with time series data. Maybe as a dict so there is no dependency on pandas?

"""
if self.cache:
vrt_fhs = [self._cached_load_warp_file(fp) for fp in filepaths]
else:
vrt_fhs = [self._load_warp_file(fp) for fp in filepaths]

bounds = (query.minx, query.miny, query.maxx, query.maxy)
dest, _ = rasterio.merge.merge(vrt_fhs, bounds, self.res, indexes=band_indexes)
if self.as_time_series:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we briefly discussed this @adamjstewart , in my thinking this extra step is necessary but maybe you disagree with this.

@@ -946,6 +1054,106 @@ def res(self, new_res: float) -> None:
self.datasets[1].res = new_res


class MultiQueryDataset(IntersectionDataset):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure on the naming scheme here yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this better than TupleDataset. Curious if we should override + or * to automatically create this from a GeoDataset like we do for & and |.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current name is MultiQueryDataset, however, in our discussion, this is what I was referring to with TimeSeriesDataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ might be the best operator to use if we want to talk about "time". Maybe + if it's more generic and supports more than just time series. But idk what other applications it might be used for.

input_samples = self.transforms(input_samples)
target_samples = self.transforms(target_samples)

samples = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this is intended with Time Series data as we are also returning the dates for example. I am not sure what other cases a more general TupleDataset would require and also how to handle possibly overlapping dictionary keys from the individual samples that you each retrieve, would one enumerate them for example? Then again I think it is nice to have the clarity betwenn "input" and "target" for time series tasks for example.

@@ -628,6 +627,18 @@ def unbind_samples(sample: dict[Any, Sequence[Any]]) -> list[dict[Any, Any]]:
return _dict_list_to_list_dict(sample)


def interpolate_samples(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to be a transform that fills gap of irregularly sampled time series. To be determined. Other transform functions are possible if we apply them on the sample level, otherwise we also have to provide a collate_fn since samples from the TimeWindowGeoDataset are not guaranteed to have the same time dimension.

@@ -156,6 +161,315 @@ def __len__(self) -> int:
return self.length


class TimeWindowGeoSampler(GeoSampler):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what people think of the name, but I thought it's a bit more general. I have slightly extended the sampler to have the following functionalities:

  1. consecutive argument. If true, then the target sequence follows the input sequence without any overlap in the time dimension. If false, then he target sequence will be the last prediction_length number of time steps from the input sequence end point. This would be intended to cover the use case of the Time Series task where the prediction is supposed to be the last input sequence step and not forecasting into the future.
  2. time_delta argument: this specifies a time delta or gap between the input and the target sequence. Sometimes it is also called lead time I think where you want a time gap before the first prediction time step

So I think this sampler could support the following three use cases:

  1. Time Series "Nowcasting": so for a time-series of images predict the latest target
  2. Time Series "Forecasting": based on a historical sequence of images predict one or more future time steps
  3. Time Series "Forecasting" with lead time: have a time gap between input and target sequence

I think it would not be so difficult to support another use case that gives more flexibility to the rolling window sampling we do over the time dimension. At the moment we are implicitly assuming as step size of 1 for the next sequence, but there could also be a shift argument that says how many time steps to move forward before generating the next input time sequence. For rolling window prediction (train on 3 days, predict next 24 hrs, shift 24 hrs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants