Segmentation mask data #108

roaldarbol · 2025-02-03T22:37:22Z

I'd love to also support data from segmentation masks. It should be encoded with run length encoding (RLE). Given segmentation models often are able to detect multiple different objects/species simultaneously, maybe we also need an extra column for target (surely there's a better term?). A potential layout that is somewhat tidy is the following:

time	class	individual	row (or column)	start_pixel	length
1	bumblebee	individual_1	5	20	30
1	bumblebee	individual_1	6	21	25

These data sets are going to be long (much longer than pose estimation too), but that's fine. We'll make it easy!

It should be noted that quite a lot of things need special implementation for this. Some potential ideas:

Convert to pose/points
- Centroid
- Front, back, maybe some other heuristics that can be implemented?
- Midline with x points spaced equally from back to front?
- Bounding box (may link to Navigational features and the sf package #44, e.g. mask_to_bbox(), pose_to_bbox())
E.g. speed would use a calculated centroid or point
Could then have area calculated (as could pose estimations, but just by connecting all points or the size of the bounding box?)

The text was updated successfully, but these errors were encountered:

roaldarbol · 2025-02-04T10:58:31Z

@sfmig @niksirbi I'm trying to think about an intuitive way to represent segmentation masks that lets us reuse as much of the functionality that is already in place, and could use your input when you have time. :-) I still have no data, it's all hypothetical, so there's absolutely no rush!!!

I'm thinking that maybe a version that's pretty much exactly like RLE, here a row-wise RLE:

time	class	individual	y	x_min	x_max
1	bumblebee	individual_1	5	20	50
1	bumblebee	individual_1	6	21	46

This would make it quite easy to do filtering; without changing the data, one could then do:

segmented_data |>
   group_by(time, y) |>
   mutate(x_min = filter_sgolay(x_min),
                x_max = filter_sgolay(x_max))

Alternatively, which might aid in plotting the shape:

time	class	individual	y	x	x_boundary
1	bumblebee	individual_1	5	20	min
1	bumblebee	individual_1	5	50	max
1	bumblebee	individual_1	6	21	min
1	bumblebee	individual_1	6	46	max

segmented_data |>
   group_by(time, y, x_boundary) |>
   mutate(x = filter_sgolay(x))

Here's an example which includes plotting (it might make sense to make a modified version of geom_polygon so the arrange(if_else(x_boundary == "min", y, -y)) won't be needed):

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

data <- data.frame(
  time = rep(1, 8),
  individual = c(rep("a", 4), rep("b", 4)),
  y = c(5,5,6,6,5.5,5.5,8,8),
  x = c(20, 50, 21, 46,40,60,50,60),
  x_boundary = c("min", "max", "min", "max", "min", "max", "min", "max")
)

glimpse(data)
#> Rows: 8
#> Columns: 5
#> $ time       <dbl> 1, 1, 1, 1, 1, 1, 1, 1
#> $ individual <chr> "a", "a", "a", "a", "b", "b", "b", "b"
#> $ y          <dbl> 5.0, 5.0, 6.0, 6.0, 5.5, 5.5, 8.0, 8.0
#> $ x          <dbl> 20, 50, 21, 46, 40, 60, 50, 60
#> $ x_boundary <chr> "min", "max", "min", "max", "min", "max", "min", "max"

data |> 
  group_by(individual, time) |>
  arrange(if_else(x_boundary == "min", y, -y)) |>
  ggplot(aes(x, y, fill = individual, colour = individual)) +
  geom_polygon(alpha = 0.5)

^{Created on 2025-02-04 with reprex v2.1.1}

niksirbi · 2025-02-04T11:59:10Z

Thanks for tagging me Mikkel!
I'm sure you've also seen our discussions on this issue, SkepticRaven's comment seems relevant if you want to give your ideas a try on some real data.

I would have to take the time to think about your RLE encoding idea before I can give a full answer.
But I do have a quick comment: I think the word your are looking for is class instead of target. That would be the ML-way of naming that concept afaik.

roaldarbol · 2025-02-04T12:31:38Z

I think the word your are looking for is class instead of target
Yes, that's perfect, thanks Niko!

@SkepticRaven Sorry for pinging you! I just wanted to hear, given you've worked with segmentation data whether you think the above makes sense - or whether you would mind sharing some details about how you store it?

As far as I could see on the shared data there's only pose estimation and no segmentation - is that correct or am I missing something?

roaldarbol · 2025-02-04T12:51:25Z

Actually, the last example will break down if the mask crosses the row twice, e.g. two separate legs, then we won't know which min/max values belong together and will need an extra variable to keep track of them. So maybe the x_min/x_max is better... will have to test further.

Okay, it's actually a bit of a headache. If we add a variable (i) for the the start number (e.g. if it's the second time the mask starts on that row, like the second leg), then once the first leg leaves the row, the numbering will change, messing things up...

This will mess with filtering especially, but also plotting will need some thought...

SkepticRaven · 2025-02-04T14:13:08Z

I'm always happy to help with segmentation.

The link in the other thread does share a big dataset of ours. The segmentation data is present, albeit pretty hidden. The be more specific than the other thread if you want to get your hands on it...

Our data consists of paired files: video (.mp4) + our in-house "pose" file (_pose_est_v6.h5). If you only want segmentation data, all you need is the pose file.
There's a lot of data in the pose file. All the data is described in README.md associated with the dataset. If you want to play around with actual data, you likely want to look at a frame where the mouse is present (typically between frames 200 and 90k -- notably almost never the first frame). The 2 fields of interest are:

poseest/seg_data: compressed segmentation data. Data is stored as a padded matrix of contours. Pad value is -1 (invalid value for contours). Shape is [frame, animal, contour, contour_length, position]. Segmentation position is sorted [x, y].
poseest/seg_external_flag: describes whether a contour is an external (1) or internal (0) contour. Shape is [frame, animal, contour]. (Extra note: opencv doesn't need this flag for rendering, but we use it for determining things like positive or negative areas when calculating features)

Design considerations that we had in our group while working on it:

Need compress fairly well (we store masks for 1hr-videos)
Needs to be fairly efficient for access
Need to be able to calculate feature efficiently (we select translation+rotation invariant image moments along with other shape descriptors, which is more than coco provides below)

Why I chose to use a padded contour matrix is largely because then I could rely on opencv functions for rendering/feature calculation. If you're not familiar with opencv, they have some fairly robust contour-based functions. Most notably are:

cv2.findContours -- transforms mask to list of contours (itself a list of points)
cv2.drawContours -- transforms a list of contours to a rendered frame
cv2.moments -- transforms a contour to a image moment features
cv2.HuMoments -- transforms image moments into Hu moments
cv2.pointPolygonTest -- determines if a point is contained in the contour

--

As for using RLE, if possible I would recommend to use coco's wrappers, licensing permitting (it uses simplified BSD).
It's hard to read because they use 1-character variable names, but they provide encoding, decoding, and a handful of functions that can operate directly on the RLE'd data. Note that their python api has a block comment to describe some of how it works.

If you still want to code it from scratch... Traditionally RLE requires a list of 3 values to be stored: (start, length, value). Typically, image shape is stored somewhere such that start can treat the image as 1-dimensional (not requiring row/column index). If pixels are considered mutually exclusive (e.g. a pixel can only be assigned to 1 individual, another assumption with coco), value is equivalent information to individual, so only 1 needs to be stored.

roaldarbol · 2025-02-05T09:43:52Z

@SkepticRaven Wow, thanks a ton, that's super comprehensive! Will have to go through that at pace. 😄

Ah, that's why I couldn't find them - I just assumed that the pose_est files only contained pose estimation data; I'll make sure to check them out.

Here I'm working in R, not Python, so I may not have the luxury of depending on OpenCV unfortunately - but they are really great to know about. I'm trying to figure out whether there's a tabular format I can use; R doesn't handle n-sized arrays/matrices as gracefully as Python (the infrastructure is not really there for it).

Do you yourself consider filtering/smoothing of the masks? If so, how? My best idea is currently to treat the outline as a n numbered keypoints that start e.g. in the bottom left and then one can filter the values of e.g. keypoint_1, keypoint_2, ..., keypoint_n. I haven't tested the idea at all, just thinking out load here. Or maybe this is one place where we need to expand to a matrix and smooth for each pixel separately...?

Regarding RLE, I get that they typically are 1-dimensional; do you have any feeling for whether a row-wise RLE (or col-wise) increases data size significantly. My issue with the one-dimensionality is that it's much harder for users to look at the data and understand where in space they are, and that it makes plotting a great deal more difficult; but again, maybe just a simple on-the-fly conversion could do the trick.

I think your design considerations align quite closely with what I envision, my main bonus potential needs are:

Needs to be human-readable. Or at least have a format that's readable - that's why the row-wise format is attractive to me, it resembles. the x+y formatting which we use for any other positional data, e.g. pose estimation)
Needs to be tabular. Somehow, in the end there's no way around it for ease of plotting
Maybe more, I'll need to think...

Thanks again!

roaldarbol · 2025-02-05T09:52:19Z

Again just thinking out load here. So for the smoothing, maybe we'd need to:

Expand to a matrix
Convert to long tidy format (time, x, y, value) for every single pixel
Do smoothing for each pixel group_by(time, x, y) |> mutate(value = filter_...(value)) and finally
Transform everything back into the original format.

It occurs to me that I'm of course thinking about temporal smoothing; one could also simply do spatial smoothing of e.g. the outline in each frame (came across a blog post that does this on iOS), but I'm thinking of smoothing across frames (while maybe also doing spatial smoothing - would a temporal filtering achieve that too?).

SkepticRaven · 2025-02-05T14:31:54Z

Does all the data need to be human readable? A potential compromise could be storing human-readable summaries (e.g. centroid, bounding box) next to the raw data (non-readable RLE/contours).

RLE 2D vs 1D: I don't know the numbers on compression differences, but roughly you would be going from 3 values to 4 (+33%). There could be special cases in which it's slightly more than that, but those should be unlikely with animal tracking (e.g. run length > image width should never occur because the animal generally doesn't span the entire frame).

Making it a constant length tabular format rather than variable length unfortunately opposes compression. As ugly as the padded matrices that we use are, all other approaches I tested didn't get close. Here's a couple older tests (on some 800x800 frame data with 3 animals) that I had written down when selecting the padded tables:

Baseline: Uncompressed binary frame stack ~8,000KB/frame
GZip compressed binary frame stack: ~14KB/frame
GZip compressed int (per-animal) frame: ~7KB/frame
Uncompressed full padded contours: ~10KB/frame
GZip compressed full padded contours: ~1KB/frame
GZip compressed simplified contours: ~0.8KB/frame (selected, "simplified" is a parameter in opencv to only store corner points of contour rather than all pixels)

The segmentation data alone at the best compression accounts for roughly 50% of our "pose" file total footprint. I unfortunately didn't include a comparison with RLE, mostly because I wasn't aware of coco's functions at the time. It's likely in the same ballpark with contour compression.

Smoothing/Filtering/Interpolation: The only filtering we do is by contour area (i.e. areas < 50px are likely noisy predictions, not the animal). I've thought about other filters quite a bit and haven't really been happy with it, particularly since they impose temporal statistics (e.g. linear interpolation for keypoints makes variance of velocity = 0) that we happen to look at when trying to predict behavior. We also don't do any of these for our pose data.

For spatial smoothing, I would recommend a mix of erosion and dilation morphological filters. You'd need to check, but you may be able to get them in R using a convolve function.

Temporal smoothing, I would argue, is a lot more difficult. You can use a gradient method, but I personally don't think you should. It doesn't understand important concepts relevant to animal tracking: translation and rotation. I'd recommend leaving smoothing to the downstream features (e.g. mask centroid interpolation makes more sense).

roaldarbol · 2025-02-05T15:14:13Z

Man, am learning SO much - thanks a ton for playing ball!

Just to understand the contour-format, is it something like time, value/individual, x, y where only the contours are present? And then you can e.g. calculate the area from those points?

Regarding the smoothing, maybe you're right that it makes most sense to do it on the features of interest; I'm just always wary as movement ecologists often highlight that smoothing should happen on raw movement, not their derivatives. But maybe it'll be simpler to just do it on e.g. the centroid, heading, area, etc.
The spatial filtering, erosions and dilation were new to me - thanks for bringing them up. I guess that has to be done on full binary stacks. So expansions into that format may be needed for some applications.

For myself, maybe it could make sense to write functions to convert between RLE, contours, and full binary frame stacks. There are quite a few R packages that would would allow working with the contours, most notably sf and others that implement GEOS, which I'll likely end up using regardless.

niksirbi · 2025-02-05T16:20:19Z

Just chiming in to say this discussion is pure gold! Thanks to both of you I'm learning a lot.

sfmig · 2025-02-06T11:38:11Z

great conversation, thanks for tagging me @roaldarbol! Always sparking great discussions 🌟

And thanks @SkepticRaven for sharing your insight from actual experience fighting the data.

I am very interested in understanding these issues better, as both in movement and ethology we would like to make it very easy to transform between bboxes, keypoints and segmentation masks.

Below some comments on the above discussion, and some questions that would be great if you get some time to clarify :)

@SkepticRaven when you say you represent segmentation masks as "a padded contour matrix", are you roughly doing the following?

From the binary segmentation frame you extract contours expressed as vectors of points (as returned by findCountours()). You pad these vectors with -1s to make them a fixed length.
You compute the lengths of the contours.
You do step 1 and 2 for all frames and individuals, and stack the results in a matrix with dimensions (frame, individual_ID, contour_ID, contour_length, padded_contour_vector).
Is that correct? I was a bit confused about the dimensions you call contour and position. Are these, respectively, the contour_ID (per frame?) and the position of the points in the extracted contour? Or is contour the semantic class (e.g. mouse), and your animal dimensions refers to the specific individual?

From your comment above I understand that representing masks in this way facilitates using opencv contour methods (which facilitates later computations of invariant features), but is not significantly faster to read the data in this way vs RLE representation (or vs a binary mask). Is that right?

Re temporal smoothing, I agree with @SkepticRaven that it seems particularly tricky: if I understand correctly how the contour points are computed, we don't know which contour point in frame f corresponds to which contour point in frame f+1. In a way this means that a contour point across frames does not represent the trajectory of physical point in time (like a keypoint maybe does, since the pose estimation model does try to match a consistent feature). So I am not sure we can assume they move smoothly from one frame to the next.

If the mask is noisy in time (e.g. the area changes largely from one frame to the next, because a contour point is wrongly detected far from the rest at a certain frame), I guess a reasonable thing would be to try and smooth the contour per frame. Maybe smoothing splines applied to the contour points are helpful here? But that would fall into the category of spatial smoothing. Morphological filters seem like a nice idea - or if applied to the binary mask maybe Gaussian or median convolutional filters could smooth out the edges.

SkepticRaven · 2025-02-06T14:01:10Z

You just about nailed it, @sfmig .

To explicitly expand on the way my group currently stores the data...
It's one big 5-dimensional table of shape [frame, animal, contour, contour_length, position]

frame: Frame index from the video that the data corresponds to
animal: Animal ID the segmentation describes
contour: Since sometimes we need more than 1 contour to describe the animal, this is that list of contours
contour_length: Since we need many points to describe the contour, this is that
position: x and y values

For example, if you selected data [200, 1, 0, 5, :], (reading indices backwards) you would get the [x, y] location of the 6th keypoint in the 1st contour for animal 2 on frame 201.

When generating the data, we use findContours() per animal to fill the last 3 dimensions of that growing matrix. The only custom stuff I did was the transformation of the return value (a list of n [l_i, 2] contours) into a the padded [n, l, 2] matrix.

For reading speeds, mostly correct. I would state that there is an overall read+compute speed advantage because (a) it costs less to read direct data vs read + convert and (b) most of the feature functions run faster on the compressed contour data (similarly, calculating centroid/bbox using the coco helper functions on RLE'd data would calculate faster than doing the same on frame masks).

--

Smoothing splines and blurs (gaussian or median) could also work pretty well for spatial smoothing. I don't have any hard evidence, but one area where they are likely to produce the most different results are with long + thin diagonals in masks. Splines will typically try and preserve the structure, while blurs + morphological filtering will typically remove it. These are pretty easy to play around with in tools like gimp and inkscape. I mostly chose morphological filtering over blurs because we actually can run the on GPUs embedded in the network (old code example). Of course blurs could also be run on-GPU, but I have a vague recollection that these were slightly more efficient at the time. That most likely has changed, since that experiment was on ~CUDA 7.0 and we're now on CUDA 12.8 with GPUs that actually have new core types.

roaldarbol · 2025-02-06T15:11:00Z

Quick Q @SkepticRaven, do you really need the contour_length? Isn't that more or less implicit in the difference in x/y positions of two consecutive points?

Maybe an example to also see whether I've understood it correctly: So if this is the contour, where # indicates the contour point (which have x and y values associated to them), then the length is given by where the next point is:

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . # = = = = = = = . . . . . .
. . . . . # . . . . . . . . # . . . . .
. . . . # . . . . . . . . . . # . . . .
. . . . # . . . . . . . . . . # . . . .
. . . . # . . . . . . . . . . # = . . .
. . . . # . . . . . . . . . . # . . . .
. . . . # = . . . . . . . . # . . . . .
. . . . . . # = = = = = = = . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

But ah, I see it myself - e.g. the right-most point, there it's not given - is it mostly for this type of case it's needed?

I'm really curious about whether Parquets compression also improves with padding. It's tabular, but grouping kinda acts as an extra dimension (it seems that Parquet actually does the padding automatically when grouping (or their compression info), which would be great news for storage without having to introduce padding in the user-facing presentation).

So maybe it'll be possible to read the files lazily when converting into e.g. bounding box or centroid... have to check.

@sfmig For temporal smoothing, completely agree. I don't think I see any robust way of doing that which is not expanding to the complete matrix (as above) and doing some sort of logistic/binary smoothing, though I don't know how such methods might work.

roaldarbol added the enhancement New feature or request label Feb 3, 2025

roaldarbol added this to the Future milestone Feb 3, 2025

roaldarbol added this to animovement progress Feb 3, 2025

roaldarbol moved this to 🔬 Triage in animovement progress Feb 4, 2025

sfmig mentioned this issue Feb 6, 2025

Add support for tracked segmentation masks? neuroinformatics-unit/movement#301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation mask data #108

Segmentation mask data #108

roaldarbol commented Feb 3, 2025 •

edited

Loading

roaldarbol commented Feb 4, 2025 •

edited

Loading

niksirbi commented Feb 4, 2025

roaldarbol commented Feb 4, 2025 •

edited

Loading

roaldarbol commented Feb 4, 2025 •

edited

Loading

SkepticRaven commented Feb 4, 2025 •

edited

Loading

roaldarbol commented Feb 5, 2025 •

edited

Loading

roaldarbol commented Feb 5, 2025 •

edited

Loading

SkepticRaven commented Feb 5, 2025

roaldarbol commented Feb 5, 2025

niksirbi commented Feb 5, 2025

sfmig commented Feb 6, 2025

SkepticRaven commented Feb 6, 2025

roaldarbol commented Feb 6, 2025 •

edited

Loading

Segmentation mask data #108

Segmentation mask data #108

Comments

roaldarbol commented Feb 3, 2025 • edited Loading

roaldarbol commented Feb 4, 2025 • edited Loading

niksirbi commented Feb 4, 2025

roaldarbol commented Feb 4, 2025 • edited Loading

roaldarbol commented Feb 4, 2025 • edited Loading

SkepticRaven commented Feb 4, 2025 • edited Loading

roaldarbol commented Feb 5, 2025 • edited Loading

roaldarbol commented Feb 5, 2025 • edited Loading

SkepticRaven commented Feb 5, 2025

roaldarbol commented Feb 5, 2025

niksirbi commented Feb 5, 2025

sfmig commented Feb 6, 2025

SkepticRaven commented Feb 6, 2025

roaldarbol commented Feb 6, 2025 • edited Loading

roaldarbol commented Feb 3, 2025 •

edited

Loading

roaldarbol commented Feb 4, 2025 •

edited

Loading

roaldarbol commented Feb 4, 2025 •

edited

Loading

roaldarbol commented Feb 4, 2025 •

edited

Loading

SkepticRaven commented Feb 4, 2025 •

edited

Loading

roaldarbol commented Feb 5, 2025 •

edited

Loading

roaldarbol commented Feb 5, 2025 •

edited

Loading

roaldarbol commented Feb 6, 2025 •

edited

Loading