-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Time Series Datasets and sampling for GeoDatasets #877
base: main
Are you sure you want to change the base?
Conversation
Sorry it took so long to review this. Here are my thoughts on your approach. SamplerThe trickiest aspect of the sampler is that there are so many different ways someone might want to sample a dataset. Consider the following corner cases/situations:
It feels difficult if not impossible to support all of these use cases with a single sampler. We'll probably need multiple samplers, or a lot of configuration options. It would be good to see what other time-series libraries call these samplers and what options they have. DatasetThe problem with your approach is that we would need a
Basically, this would mean changing: def __getitem__(self, query: BoundingBox) -> Dict[str, Any]: to: def __getitem__(self, *queries: BoundingBox) -> Union[Dict[str, Any], List[Dict[str, Any]]]: If the sampler passes a single bounding box to the dataset, it behaves like it used to. If the sampler passes multiple bounding boxes, the dataset will return multiple separate sample dicts. Added benefit: this would better support alternative sampling strategies like Tile2Vec or SeCo where we need to be able to return a (query, neighbor, distant) tuple per data point.
In this case, the sampler would still return a single BoundingBox covering the entire time span, but wouldn't merge any files if they are from different times. The dataset would return separate sampler dicts, or a single sampler dict with an extra dimension for time. This approach is limited since you can't pass multiple bounding boxes for different time splits, you just get the same data as before but without a time merger. I have more thoughts on this (in addition to a time axis, we may want to optionally support a Z axis for 3D weather/climate data), but don't want to bog this PR down by thinking too broadly. |
Thank you for your feedback. Your proposed integration/change within the Dataset
SamplerI agree that for all these specific cases different samplers could be a viable solution. I did some looking around more classical time-series libraries and how they handle datasets. This is a little summary from some libraries I found (not exhaustive, and I am not an experienced user so might not do them justice):
Missing time steps like in group 1 above can be filled on the fly if a flag is specified. Additionally, one has to specify a
From what I can tell, if you wanted to do a time-series prediction for the different tasks you have mentioned in #640 such as cycles, seasonal data, or any other given data selection processes that determines the input/output sequence, these libraries would rely on that you format your data on your own in such a way that it includes the data in your desired frequency and length. I guess what we are trying to do in contrast is to have a fixed dataset instance on which all kinds of different tasks will be handled by sampling strategies that decided the EditLooking into this again, and time-series prediction can in a sense also be framed as video prediction. Pytorchvideo defines some clip samplers that could serve as inspiration. |
This last commit contains a proposal for changing |
Given the complexity of supporting all the different time-series scenario tasks, my idea is to first begin by focusing on a single one that should already be quiet powerful for standard tasks. This is to begin with a "RandomSequenceSampler" (name tbd) that randomly samples consecutive input and target sequences for a given input and target length during training, like
I assume that a video by default has all possible time-steps (frames) in each video sequence, so not a video with blanks in between. That makes sampling sequences quiet straightforward in the video framework. For the time-series libraries the assumption is that the user provides a dataframe which includes a time-idx column specifiying the consecutive data points within a time-series. This also makes sampling index based sequences easier because one does not need to worry about whether the idx specify hours, days, months, etc.
So I am not sure on the preferred way of handling this, but could also be overthinking it. Either way, interested in opinions! |
The above commits contain a new design proposal that samples consecutive sequences for time-series forecasting tasks. It contains the following new pieces:
The idea behind the two is then to use it as follows:
This would be used to learn a sequence of 10 months and predict the next two months. A design choice I have not implemented yet is how to do sampling hits with
I would appreciate feedback on this and also want to mention the following questions for later on:
|
Maybe it is useful to determine some terminology that can be used for models, datasets, samplers and tasks for the context of time-series prediction. These are just a few terms encountered so far.
|
I am not an expert with rtree but I think there might be a couple of things that could be changed.
|
@nilsleh Please may I know how to use these changes for a time series segmentation task? I have a prechipped raster dataset with observations throughout the year, and a vector dataset for the segmentation labels. I'm trying to use the new Do you have a brief worked example to share? Thanks |
Hi @Spiruel , this PR is still ongoing work and is not rigorously tested yet and also a work in progress with some unanswered design choices so very likely to change, maybe even substantially. However, it's exciting to hear that you are looking to use it for your task at hand and I will try to help out. I would also be grateful for some feedback because I have only tested it on a personal project (that I cannot share), while there are of course many more different use cases out there. From your error message, I am guessing that it occurs in this line but to make sure, could you post a longer error trace? The error here would mean that during the In my setup the dataloading and sampling looks like in the comment above. Maybe you can check that you can retrieve samples from your |
Thanks - I'd be happy to offer feedback on this as I test it out on my use case and get my head around it. I've got a number of crop classification use cases that I should be able to try out once I get this working. Yes, my error does occur on this line and I'll attempt to debug this further. There should be plenty of regions and timestamps to serve as samples. I followed your setup above and adapted it to my time series segmentation task, based on my understanding. I am able to retrieve samples from my
Let me know if this discussion is best had on this thread, or whether I should move it elsewhere! Thanks Edit: Continued in #1544 |
mint, maxt = disambiguate_timestamp(date, self.date_format) | ||
|
||
coords = (minx, maxx, miny, maxy, mint, maxt) | ||
self.index.insert(i, coords, filepath) | ||
i += 1 | ||
|
||
if self.as_time_series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this for development/debugging purposes but also might be nice to have access to the available dates of files when working with time series data. Maybe as a dict so there is no dependency on pandas?
""" | ||
if self.cache: | ||
vrt_fhs = [self._cached_load_warp_file(fp) for fp in filepaths] | ||
else: | ||
vrt_fhs = [self._load_warp_file(fp) for fp in filepaths] | ||
|
||
bounds = (query.minx, query.miny, query.maxx, query.maxy) | ||
dest, _ = rasterio.merge.merge(vrt_fhs, bounds, self.res, indexes=band_indexes) | ||
if self.as_time_series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we briefly discussed this @adamjstewart , in my thinking this extra step is necessary but maybe you disagree with this.
@@ -946,6 +1054,106 @@ def res(self, new_res: float) -> None: | |||
self.datasets[1].res = new_res | |||
|
|||
|
|||
class MultiQueryDataset(IntersectionDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure on the naming scheme here yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like this better than TupleDataset. Curious if we should override +
or *
to automatically create this from a GeoDataset like we do for &
and |
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current name is MultiQueryDataset
, however, in our discussion, this is what I was referring to with TimeSeriesDataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@
might be the best operator to use if we want to talk about "time". Maybe +
if it's more generic and supports more than just time series. But idk what other applications it might be used for.
torchgeo/datasets/geo.py
Outdated
input_samples = self.transforms(input_samples) | ||
target_samples = self.transforms(target_samples) | ||
|
||
samples = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this is intended with Time Series data as we are also returning the dates for example. I am not sure what other cases a more general TupleDataset
would require and also how to handle possibly overlapping dictionary keys from the individual samples that you each retrieve, would one enumerate them for example? Then again I think it is nice to have the clarity betwenn "input" and "target" for time series tasks for example.
@@ -628,6 +627,18 @@ def unbind_samples(sample: dict[Any, Sequence[Any]]) -> list[dict[Any, Any]]: | |||
return _dict_list_to_list_dict(sample) | |||
|
|||
|
|||
def interpolate_samples( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intended to be a transform that fills gap of irregularly sampled time series. To be determined. Other transform functions are possible if we apply them on the sample level, otherwise we also have to provide a collate_fn
since samples from the TimeWindowGeoDataset
are not guaranteed to have the same time dimension.
@@ -156,6 +161,315 @@ def __len__(self) -> int: | |||
return self.length | |||
|
|||
|
|||
class TimeWindowGeoSampler(GeoSampler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what people think of the name, but I thought it's a bit more general. I have slightly extended the sampler to have the following functionalities:
consecutive
argument. If true, then the target sequence follows the input sequence without any overlap in the time dimension. If false, then he target sequence will be the last prediction_length number of time steps from the input sequence end point. This would be intended to cover the use case of the Time Series task where the prediction is supposed to be the last input sequence step and not forecasting into the future.time_delta
argument: this specifies a time delta or gap between the input and the target sequence. Sometimes it is also called lead time I think where you want a time gap before the first prediction time step
So I think this sampler could support the following three use cases:
- Time Series "Nowcasting": so for a time-series of images predict the latest target
- Time Series "Forecasting": based on a historical sequence of images predict one or more future time steps
- Time Series "Forecasting" with lead time: have a time gap between input and target sequence
I think it would not be so difficult to support another use case that gives more flexibility to the rolling window sampling we do over the time dimension. At the moment we are implicitly assuming as step size of 1 for the next sequence, but there could also be a shift argument that says how many time steps to move forward before generating the next input time sequence. For rolling window prediction (train on 3 days, predict next 24 hrs, shift 24 hrs).
This PR aims to support time-series datasets and sampling. Following the discussion of #640 . I will try to explain the thought process hereafter, given my understanding of the existing torchgeo framework. The current sampling strategy I am considering is the following
ForecastingGeoSampler
Limitation current RasterDataset:
RasterDataset
accepts a bounding box that includes the time dimension but then calls the_merge_files
function where the time dimension is dropped in order to callrasterio.merge.merge
to merge overlapping tiles for a given geographical region. As a consequence, if there were multiple scences across time, this function just disregards the time dimensionTimeSeriesRasterDataset
which inheritsRasterDataset
and overwrites the_merge_files
method. Here, like before scenes are merged on their geographical location, but only after all retrieved scenes from a bounding box hit are grouped by their date, so that only geographical areas are merged per single time step. Afterwards, they are concatenated along a new dimension to form a 4D tensor that includes the time dimension and this 4D tensor is returned from the__getitem__
methodSampling/DataLoader:
DataLoader
, one can either pass asampler
which gives instructions how to retrieve a single sample (by returning a single bounding box), or abatch_sampler
which would sample one single batch (by returning a list of bounding boxes). However, for time-series tasks we need "something in between" because following the single sampler, we need to return a tuple of bounding boxes that have the same geographic location but a sequential time duration (one for the input sequence, and one for the target sequence that is sequential in time to the input sequence).__getitem__
method of the Intersection dataset acceptsUnion[Bbox, Tuple[Bbox, Bbox]]
and retrieves the first input sequence from the "image" dataset and the second input sequence from the "mask" datasetCurrent Issues:
self.index.intersection
, however, when the index is populated with scenes they have a single time-stamp, but when callingself.index.intersection
we would need to ensure that for a given geo location the hit contains all available time stamps that respect the input sequence and target sequence length that we are interested intime_idx
variable for all steps within a time-series and the define the input sequence and target sequence length as the number of time indices to select from time_idx.This current draft is just one idea and by no means exhaustive. Would still need to think about all kinds of edge cases and testing procedures. I am sure I missed or misunderstood several things but hopefully it helps in thinking about a suitable implementation to support Time-Series tasks for GeoDatasets.