-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation mask data #108
Comments
@sfmig @niksirbi I'm trying to think about an intuitive way to represent segmentation masks that lets us reuse as much of the functionality that is already in place, and could use your input when you have time. :-) I still have no data, it's all hypothetical, so there's absolutely no rush!!! I'm thinking that maybe a version that's pretty much exactly like RLE, here a row-wise RLE:
This would make it quite easy to do filtering; without changing the data, one could then do: segmented_data |>
group_by(time, y) |>
mutate(x_min = filter_sgolay(x_min),
x_max = filter_sgolay(x_max)) Alternatively, which might aid in plotting the shape:
segmented_data |>
group_by(time, y, x_boundary) |>
mutate(x = filter_sgolay(x)) Here's an example which includes plotting (it might make sense to make a modified version of library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
data <- data.frame(
time = rep(1, 8),
individual = c(rep("a", 4), rep("b", 4)),
y = c(5,5,6,6,5.5,5.5,8,8),
x = c(20, 50, 21, 46,40,60,50,60),
x_boundary = c("min", "max", "min", "max", "min", "max", "min", "max")
)
glimpse(data)
#> Rows: 8
#> Columns: 5
#> $ time <dbl> 1, 1, 1, 1, 1, 1, 1, 1
#> $ individual <chr> "a", "a", "a", "a", "b", "b", "b", "b"
#> $ y <dbl> 5.0, 5.0, 6.0, 6.0, 5.5, 5.5, 8.0, 8.0
#> $ x <dbl> 20, 50, 21, 46, 40, 60, 50, 60
#> $ x_boundary <chr> "min", "max", "min", "max", "min", "max", "min", "max"
data |>
group_by(individual, time) |>
arrange(if_else(x_boundary == "min", y, -y)) |>
ggplot(aes(x, y, fill = individual, colour = individual)) +
geom_polygon(alpha = 0.5) Created on 2025-02-04 with reprex v2.1.1 |
Thanks for tagging me Mikkel! I would have to take the time to think about your RLE encoding idea before I can give a full answer. |
@SkepticRaven Sorry for pinging you! I just wanted to hear, given you've worked with segmentation data whether you think the above makes sense - or whether you would mind sharing some details about how you store it? As far as I could see on the shared data there's only pose estimation and no segmentation - is that correct or am I missing something? |
Actually, the last example will break down if the mask crosses the row twice, e.g. two separate legs, then we won't know which min/max values belong together and will need an extra variable to keep track of them. So maybe the x_min/x_max is better... will have to test further. Okay, it's actually a bit of a headache. If we add a variable ( This will mess with filtering especially, but also plotting will need some thought... |
I'm always happy to help with segmentation. The link in the other thread does share a big dataset of ours. The segmentation data is present, albeit pretty hidden. The be more specific than the other thread if you want to get your hands on it...
Design considerations that we had in our group while working on it:
Why I chose to use a padded contour matrix is largely because then I could rely on
-- As for using RLE, if possible I would recommend to use coco's wrappers, licensing permitting (it uses simplified BSD). If you still want to code it from scratch... Traditionally RLE requires a list of 3 values to be stored: (start, length, value). Typically, image shape is stored somewhere such that |
@SkepticRaven Wow, thanks a ton, that's super comprehensive! Will have to go through that at pace. 😄 Ah, that's why I couldn't find them - I just assumed that the Here I'm working in R, not Python, so I may not have the luxury of depending on OpenCV unfortunately - but they are really great to know about. I'm trying to figure out whether there's a tabular format I can use; R doesn't handle n-sized arrays/matrices as gracefully as Python (the infrastructure is not really there for it). Do you yourself consider filtering/smoothing of the masks? If so, how? My best idea is currently to treat the outline as a Regarding RLE, I get that they typically are 1-dimensional; do you have any feeling for whether a row-wise RLE (or col-wise) increases data size significantly. My issue with the one-dimensionality is that it's much harder for users to look at the data and understand where in space they are, and that it makes plotting a great deal more difficult; but again, maybe just a simple on-the-fly conversion could do the trick. I think your design considerations align quite closely with what I envision, my main bonus potential needs are:
Thanks again! |
Again just thinking out load here. So for the smoothing, maybe we'd need to:
It occurs to me that I'm of course thinking about temporal smoothing; one could also simply do spatial smoothing of e.g. the outline in each frame (came across a blog post that does this on iOS), but I'm thinking of smoothing across frames (while maybe also doing spatial smoothing - would a temporal filtering achieve that too?). |
Does all the data need to be human readable? A potential compromise could be storing human-readable summaries (e.g. centroid, bounding box) next to the raw data (non-readable RLE/contours). RLE 2D vs 1D: I don't know the numbers on compression differences, but roughly you would be going from 3 values to 4 (+33%). There could be special cases in which it's slightly more than that, but those should be unlikely with animal tracking (e.g. run length > image width should never occur because the animal generally doesn't span the entire frame). Making it a constant length tabular format rather than variable length unfortunately opposes compression. As ugly as the padded matrices that we use are, all other approaches I tested didn't get close. Here's a couple older tests (on some 800x800 frame data with 3 animals) that I had written down when selecting the padded tables:
The segmentation data alone at the best compression accounts for roughly 50% of our "pose" file total footprint. I unfortunately didn't include a comparison with RLE, mostly because I wasn't aware of coco's functions at the time. It's likely in the same ballpark with contour compression. Smoothing/Filtering/Interpolation: The only filtering we do is by contour area (i.e. areas < 50px are likely noisy predictions, not the animal). I've thought about other filters quite a bit and haven't really been happy with it, particularly since they impose temporal statistics (e.g. linear interpolation for keypoints makes variance of velocity = 0) that we happen to look at when trying to predict behavior. We also don't do any of these for our pose data. For spatial smoothing, I would recommend a mix of erosion and dilation morphological filters. You'd need to check, but you may be able to get them in R using a Temporal smoothing, I would argue, is a lot more difficult. You can use a gradient method, but I personally don't think you should. It doesn't understand important concepts relevant to animal tracking: translation and rotation. I'd recommend leaving smoothing to the downstream features (e.g. mask centroid interpolation makes more sense). |
Man, am learning SO much - thanks a ton for playing ball! Just to understand the contour-format, is it something like Regarding the smoothing, maybe you're right that it makes most sense to do it on the features of interest; I'm just always wary as movement ecologists often highlight that smoothing should happen on raw movement, not their derivatives. But maybe it'll be simpler to just do it on e.g. the centroid, heading, area, etc. For myself, maybe it could make sense to write functions to convert between RLE, contours, and full binary frame stacks. There are quite a few R packages that would would allow working with the contours, most notably sf and others that implement GEOS, which I'll likely end up using regardless. |
Just chiming in to say this discussion is pure gold! Thanks to both of you I'm learning a lot. |
great conversation, thanks for tagging me @roaldarbol! Always sparking great discussions 🌟 And thanks @SkepticRaven for sharing your insight from actual experience fighting the data. I am very interested in understanding these issues better, as both in movement and ethology we would like to make it very easy to transform between bboxes, keypoints and segmentation masks. Below some comments on the above discussion, and some questions that would be great if you get some time to clarify :) @SkepticRaven when you say you represent segmentation masks as "a padded contour matrix", are you roughly doing the following?
From your comment above I understand that representing masks in this way facilitates using opencv contour methods (which facilitates later computations of invariant features), but is not significantly faster to read the data in this way vs RLE representation (or vs a binary mask). Is that right? Re temporal smoothing, I agree with @SkepticRaven that it seems particularly tricky: if I understand correctly how the contour points are computed, we don't know which contour point in frame If the mask is noisy in time (e.g. the area changes largely from one frame to the next, because a contour point is wrongly detected far from the rest at a certain frame), I guess a reasonable thing would be to try and smooth the contour per frame. Maybe smoothing splines applied to the contour points are helpful here? But that would fall into the category of spatial smoothing. Morphological filters seem like a nice idea - or if applied to the binary mask maybe Gaussian or median convolutional filters could smooth out the edges. |
You just about nailed it, @sfmig . To explicitly expand on the way my group currently stores the data...
For example, if you selected data When generating the data, we use For reading speeds, mostly correct. I would state that there is an overall read+compute speed advantage because (a) it costs less to read direct data vs read + convert and (b) most of the feature functions run faster on the compressed contour data (similarly, calculating centroid/bbox using the coco helper functions on RLE'd data would calculate faster than doing the same on frame masks). -- Smoothing splines and blurs (gaussian or median) could also work pretty well for spatial smoothing. I don't have any hard evidence, but one area where they are likely to produce the most different results are with long + thin diagonals in masks. Splines will typically try and preserve the structure, while blurs + morphological filtering will typically remove it. These are pretty easy to play around with in tools like gimp and inkscape. I mostly chose morphological filtering over blurs because we actually can run the on GPUs embedded in the network (old code example). Of course blurs could also be run on-GPU, but I have a vague recollection that these were slightly more efficient at the time. That most likely has changed, since that experiment was on ~CUDA 7.0 and we're now on CUDA 12.8 with GPUs that actually have new core types. |
Quick Q @SkepticRaven, do you really need the Maybe an example to also see whether I've understood it correctly: So if this is the contour, where
But ah, I see it myself - e.g. the right-most point, there it's not given - is it mostly for this type of case it's needed? I'm really curious about whether Parquets compression also improves with padding. It's tabular, but grouping kinda acts as an extra dimension (it seems that Parquet actually does the padding automatically when grouping (or their compression info), which would be great news for storage without having to introduce padding in the user-facing presentation). So maybe it'll be possible to read the files lazily when converting into e.g. bounding box or centroid... have to check. @sfmig For temporal smoothing, completely agree. I don't think I see any robust way of doing that which is not expanding to the complete matrix (as above) and doing some sort of logistic/binary smoothing, though I don't know how such methods might work. |
I'd love to also support data from segmentation masks. It should be encoded with run length encoding (RLE). Given segmentation models often are able to detect multiple different objects/species simultaneously, maybe we also need an extra column for target (surely there's a better term?). A potential layout that is somewhat tidy is the following:
These data sets are going to be long (much longer than pose estimation too), but that's fine. We'll make it easy!
It should be noted that quite a lot of things need special implementation for this. Some potential ideas:
sf
package #44, e.g.mask_to_bbox()
,pose_to_bbox()
)The text was updated successfully, but these errors were encountered: