Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation mask data #108

Open
roaldarbol opened this issue Feb 3, 2025 · 13 comments
Open

Segmentation mask data #108

roaldarbol opened this issue Feb 3, 2025 · 13 comments
Labels
enhancement New feature or request
Milestone

Comments

@roaldarbol
Copy link
Owner

roaldarbol commented Feb 3, 2025

I'd love to also support data from segmentation masks. It should be encoded with run length encoding (RLE). Given segmentation models often are able to detect multiple different objects/species simultaneously, maybe we also need an extra column for target (surely there's a better term?). A potential layout that is somewhat tidy is the following:

time class individual row (or column) start_pixel length
1 bumblebee individual_1 5 20 30
1 bumblebee individual_1 6 21 25

These data sets are going to be long (much longer than pose estimation too), but that's fine. We'll make it easy!

It should be noted that quite a lot of things need special implementation for this. Some potential ideas:

  • Convert to pose/points
    • Centroid
    • Front, back, maybe some other heuristics that can be implemented?
    • Midline with x points spaced equally from back to front?
    • Bounding box (may link to Navigational features and the sf package #44, e.g. mask_to_bbox(), pose_to_bbox())
  • E.g. speed would use a calculated centroid or point
  • Could then have area calculated (as could pose estimations, but just by connecting all points or the size of the bounding box?)
@roaldarbol roaldarbol added the enhancement New feature or request label Feb 3, 2025
@roaldarbol roaldarbol added this to the Future milestone Feb 3, 2025
@roaldarbol
Copy link
Owner Author

roaldarbol commented Feb 4, 2025

@sfmig @niksirbi I'm trying to think about an intuitive way to represent segmentation masks that lets us reuse as much of the functionality that is already in place, and could use your input when you have time. :-) I still have no data, it's all hypothetical, so there's absolutely no rush!!!

I'm thinking that maybe a version that's pretty much exactly like RLE, here a row-wise RLE:

time class individual y x_min x_max
1 bumblebee individual_1 5 20 50
1 bumblebee individual_1 6 21 46

This would make it quite easy to do filtering; without changing the data, one could then do:

segmented_data |>
   group_by(time, y) |>
   mutate(x_min = filter_sgolay(x_min),
                x_max = filter_sgolay(x_max))

Alternatively, which might aid in plotting the shape:

time class individual y x x_boundary
1 bumblebee individual_1 5 20 min
1 bumblebee individual_1 5 50 max
1 bumblebee individual_1 6 21 min
1 bumblebee individual_1 6 46 max
segmented_data |>
   group_by(time, y, x_boundary) |>
   mutate(x = filter_sgolay(x))

Here's an example which includes plotting (it might make sense to make a modified version of geom_polygon so the arrange(if_else(x_boundary == "min", y, -y)) won't be needed):

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

data <- data.frame(
  time = rep(1, 8),
  individual = c(rep("a", 4), rep("b", 4)),
  y = c(5,5,6,6,5.5,5.5,8,8),
  x = c(20, 50, 21, 46,40,60,50,60),
  x_boundary = c("min", "max", "min", "max", "min", "max", "min", "max")
)

glimpse(data)
#> Rows: 8
#> Columns: 5
#> $ time       <dbl> 1, 1, 1, 1, 1, 1, 1, 1
#> $ individual <chr> "a", "a", "a", "a", "b", "b", "b", "b"
#> $ y          <dbl> 5.0, 5.0, 6.0, 6.0, 5.5, 5.5, 8.0, 8.0
#> $ x          <dbl> 20, 50, 21, 46, 40, 60, 50, 60
#> $ x_boundary <chr> "min", "max", "min", "max", "min", "max", "min", "max"

data |> 
  group_by(individual, time) |>
  arrange(if_else(x_boundary == "min", y, -y)) |>
  ggplot(aes(x, y, fill = individual, colour = individual)) +
  geom_polygon(alpha = 0.5)

Created on 2025-02-04 with reprex v2.1.1

@niksirbi
Copy link

niksirbi commented Feb 4, 2025

Thanks for tagging me Mikkel!
I'm sure you've also seen our discussions on this issue, SkepticRaven's comment seems relevant if you want to give your ideas a try on some real data.

I would have to take the time to think about your RLE encoding idea before I can give a full answer.
But I do have a quick comment: I think the word your are looking for is class instead of target. That would be the ML-way of naming that concept afaik.

@roaldarbol
Copy link
Owner Author

roaldarbol commented Feb 4, 2025

I think the word your are looking for is class instead of target
Yes, that's perfect, thanks Niko!

@SkepticRaven Sorry for pinging you! I just wanted to hear, given you've worked with segmentation data whether you think the above makes sense - or whether you would mind sharing some details about how you store it?

As far as I could see on the shared data there's only pose estimation and no segmentation - is that correct or am I missing something?

@roaldarbol
Copy link
Owner Author

roaldarbol commented Feb 4, 2025

Actually, the last example will break down if the mask crosses the row twice, e.g. two separate legs, then we won't know which min/max values belong together and will need an extra variable to keep track of them. So maybe the x_min/x_max is better... will have to test further.

Okay, it's actually a bit of a headache. If we add a variable (i) for the the start number (e.g. if it's the second time the mask starts on that row, like the second leg), then once the first leg leaves the row, the numbering will change, messing things up...

This will mess with filtering especially, but also plotting will need some thought...

@roaldarbol roaldarbol moved this to 🔬 Triage in animovement progress Feb 4, 2025
@SkepticRaven
Copy link

SkepticRaven commented Feb 4, 2025

I'm always happy to help with segmentation.

The link in the other thread does share a big dataset of ours. The segmentation data is present, albeit pretty hidden. The be more specific than the other thread if you want to get your hands on it...

  1. Our data consists of paired files: video (.mp4) + our in-house "pose" file (_pose_est_v6.h5). If you only want segmentation data, all you need is the pose file.
  2. There's a lot of data in the pose file. All the data is described in README.md associated with the dataset. If you want to play around with actual data, you likely want to look at a frame where the mouse is present (typically between frames 200 and 90k -- notably almost never the first frame). The 2 fields of interest are:
  • poseest/seg_data: compressed segmentation data. Data is stored as a padded matrix of contours. Pad value is -1 (invalid value for contours). Shape is [frame, animal, contour, contour_length, position]. Segmentation position is sorted [x, y].
  • poseest/seg_external_flag: describes whether a contour is an external (1) or internal (0) contour. Shape is [frame, animal, contour]. (Extra note: opencv doesn't need this flag for rendering, but we use it for determining things like positive or negative areas when calculating features)

Design considerations that we had in our group while working on it:

  • Need compress fairly well (we store masks for 1hr-videos)
  • Needs to be fairly efficient for access
  • Need to be able to calculate feature efficiently (we select translation+rotation invariant image moments along with other shape descriptors, which is more than coco provides below)

Why I chose to use a padded contour matrix is largely because then I could rely on opencv functions for rendering/feature calculation. If you're not familiar with opencv, they have some fairly robust contour-based functions. Most notably are:

  • cv2.findContours -- transforms mask to list of contours (itself a list of points)
  • cv2.drawContours -- transforms a list of contours to a rendered frame
  • cv2.moments -- transforms a contour to a image moment features
  • cv2.HuMoments -- transforms image moments into Hu moments
  • cv2.pointPolygonTest -- determines if a point is contained in the contour

--

As for using RLE, if possible I would recommend to use coco's wrappers, licensing permitting (it uses simplified BSD).
It's hard to read because they use 1-character variable names, but they provide encoding, decoding, and a handful of functions that can operate directly on the RLE'd data. Note that their python api has a block comment to describe some of how it works.

If you still want to code it from scratch... Traditionally RLE requires a list of 3 values to be stored: (start, length, value). Typically, image shape is stored somewhere such that start can treat the image as 1-dimensional (not requiring row/column index). If pixels are considered mutually exclusive (e.g. a pixel can only be assigned to 1 individual, another assumption with coco), value is equivalent information to individual, so only 1 needs to be stored.

@roaldarbol
Copy link
Owner Author

roaldarbol commented Feb 5, 2025

@SkepticRaven Wow, thanks a ton, that's super comprehensive! Will have to go through that at pace. 😄

Ah, that's why I couldn't find them - I just assumed that the pose_est files only contained pose estimation data; I'll make sure to check them out.

Here I'm working in R, not Python, so I may not have the luxury of depending on OpenCV unfortunately - but they are really great to know about. I'm trying to figure out whether there's a tabular format I can use; R doesn't handle n-sized arrays/matrices as gracefully as Python (the infrastructure is not really there for it).

Do you yourself consider filtering/smoothing of the masks? If so, how? My best idea is currently to treat the outline as a n numbered keypoints that start e.g. in the bottom left and then one can filter the values of e.g. keypoint_1, keypoint_2, ..., keypoint_n. I haven't tested the idea at all, just thinking out load here. Or maybe this is one place where we need to expand to a matrix and smooth for each pixel separately...?

Regarding RLE, I get that they typically are 1-dimensional; do you have any feeling for whether a row-wise RLE (or col-wise) increases data size significantly. My issue with the one-dimensionality is that it's much harder for users to look at the data and understand where in space they are, and that it makes plotting a great deal more difficult; but again, maybe just a simple on-the-fly conversion could do the trick.

I think your design considerations align quite closely with what I envision, my main bonus potential needs are:

  • Needs to be human-readable. Or at least have a format that's readable - that's why the row-wise format is attractive to me, it resembles. the x+y formatting which we use for any other positional data, e.g. pose estimation)
  • Needs to be tabular. Somehow, in the end there's no way around it for ease of plotting
  • Maybe more, I'll need to think...

Thanks again!

@roaldarbol
Copy link
Owner Author

roaldarbol commented Feb 5, 2025

Again just thinking out load here. So for the smoothing, maybe we'd need to:

  1. Expand to a matrix
  2. Convert to long tidy format (time, x, y, value) for every single pixel
  3. Do smoothing for each pixel group_by(time, x, y) |> mutate(value = filter_...(value)) and finally
  4. Transform everything back into the original format.

It occurs to me that I'm of course thinking about temporal smoothing; one could also simply do spatial smoothing of e.g. the outline in each frame (came across a blog post that does this on iOS), but I'm thinking of smoothing across frames (while maybe also doing spatial smoothing - would a temporal filtering achieve that too?).

@SkepticRaven
Copy link

Does all the data need to be human readable? A potential compromise could be storing human-readable summaries (e.g. centroid, bounding box) next to the raw data (non-readable RLE/contours).

RLE 2D vs 1D: I don't know the numbers on compression differences, but roughly you would be going from 3 values to 4 (+33%). There could be special cases in which it's slightly more than that, but those should be unlikely with animal tracking (e.g. run length > image width should never occur because the animal generally doesn't span the entire frame).

Making it a constant length tabular format rather than variable length unfortunately opposes compression. As ugly as the padded matrices that we use are, all other approaches I tested didn't get close. Here's a couple older tests (on some 800x800 frame data with 3 animals) that I had written down when selecting the padded tables:

  • Baseline: Uncompressed binary frame stack ~8,000KB/frame
  • GZip compressed binary frame stack: ~14KB/frame
  • GZip compressed int (per-animal) frame: ~7KB/frame
  • Uncompressed full padded contours: ~10KB/frame
  • GZip compressed full padded contours: ~1KB/frame
  • GZip compressed simplified contours: ~0.8KB/frame (selected, "simplified" is a parameter in opencv to only store corner points of contour rather than all pixels)

The segmentation data alone at the best compression accounts for roughly 50% of our "pose" file total footprint. I unfortunately didn't include a comparison with RLE, mostly because I wasn't aware of coco's functions at the time. It's likely in the same ballpark with contour compression.

Smoothing/Filtering/Interpolation: The only filtering we do is by contour area (i.e. areas < 50px are likely noisy predictions, not the animal). I've thought about other filters quite a bit and haven't really been happy with it, particularly since they impose temporal statistics (e.g. linear interpolation for keypoints makes variance of velocity = 0) that we happen to look at when trying to predict behavior. We also don't do any of these for our pose data.

For spatial smoothing, I would recommend a mix of erosion and dilation morphological filters. You'd need to check, but you may be able to get them in R using a convolve function.

Temporal smoothing, I would argue, is a lot more difficult. You can use a gradient method, but I personally don't think you should. It doesn't understand important concepts relevant to animal tracking: translation and rotation. I'd recommend leaving smoothing to the downstream features (e.g. mask centroid interpolation makes more sense).

@roaldarbol
Copy link
Owner Author

Man, am learning SO much - thanks a ton for playing ball!

Just to understand the contour-format, is it something like time, value/individual, x, y where only the contours are present? And then you can e.g. calculate the area from those points?

Regarding the smoothing, maybe you're right that it makes most sense to do it on the features of interest; I'm just always wary as movement ecologists often highlight that smoothing should happen on raw movement, not their derivatives. But maybe it'll be simpler to just do it on e.g. the centroid, heading, area, etc.
The spatial filtering, erosions and dilation were new to me - thanks for bringing them up. I guess that has to be done on full binary stacks. So expansions into that format may be needed for some applications.

For myself, maybe it could make sense to write functions to convert between RLE, contours, and full binary frame stacks. There are quite a few R packages that would would allow working with the contours, most notably sf and others that implement GEOS, which I'll likely end up using regardless.

@niksirbi
Copy link

niksirbi commented Feb 5, 2025

Just chiming in to say this discussion is pure gold! Thanks to both of you I'm learning a lot.

@sfmig
Copy link

sfmig commented Feb 6, 2025

great conversation, thanks for tagging me @roaldarbol! Always sparking great discussions 🌟

And thanks @SkepticRaven for sharing your insight from actual experience fighting the data.

I am very interested in understanding these issues better, as both in movement and ethology we would like to make it very easy to transform between bboxes, keypoints and segmentation masks.

Below some comments on the above discussion, and some questions that would be great if you get some time to clarify :)

@SkepticRaven when you say you represent segmentation masks as "a padded contour matrix", are you roughly doing the following?

  1. From the binary segmentation frame you extract contours expressed as vectors of points (as returned by findCountours()). You pad these vectors with -1s to make them a fixed length.
  2. You compute the lengths of the contours.
  3. You do step 1 and 2 for all frames and individuals, and stack the results in a matrix with dimensions (frame, individual_ID, contour_ID, contour_length, padded_contour_vector).
    Is that correct? I was a bit confused about the dimensions you call contour and position. Are these, respectively, the contour_ID (per frame?) and the position of the points in the extracted contour? Or is contour the semantic class (e.g. mouse), and your animal dimensions refers to the specific individual?

From your comment above I understand that representing masks in this way facilitates using opencv contour methods (which facilitates later computations of invariant features), but is not significantly faster to read the data in this way vs RLE representation (or vs a binary mask). Is that right?

Re temporal smoothing, I agree with @SkepticRaven that it seems particularly tricky: if I understand correctly how the contour points are computed, we don't know which contour point in frame f corresponds to which contour point in frame f+1. In a way this means that a contour point across frames does not represent the trajectory of physical point in time (like a keypoint maybe does, since the pose estimation model does try to match a consistent feature). So I am not sure we can assume they move smoothly from one frame to the next.

If the mask is noisy in time (e.g. the area changes largely from one frame to the next, because a contour point is wrongly detected far from the rest at a certain frame), I guess a reasonable thing would be to try and smooth the contour per frame. Maybe smoothing splines applied to the contour points are helpful here? But that would fall into the category of spatial smoothing. Morphological filters seem like a nice idea - or if applied to the binary mask maybe Gaussian or median convolutional filters could smooth out the edges.

@SkepticRaven
Copy link

You just about nailed it, @sfmig .

To explicitly expand on the way my group currently stores the data...
It's one big 5-dimensional table of shape [frame, animal, contour, contour_length, position]

  • frame: Frame index from the video that the data corresponds to
  • animal: Animal ID the segmentation describes
  • contour: Since sometimes we need more than 1 contour to describe the animal, this is that list of contours
  • contour_length: Since we need many points to describe the contour, this is that
  • position: x and y values

For example, if you selected data [200, 1, 0, 5, :], (reading indices backwards) you would get the [x, y] location of the 6th keypoint in the 1st contour for animal 2 on frame 201.

When generating the data, we use findContours() per animal to fill the last 3 dimensions of that growing matrix. The only custom stuff I did was the transformation of the return value (a list of n [l_i, 2] contours) into a the padded [n, l, 2] matrix.

For reading speeds, mostly correct. I would state that there is an overall read+compute speed advantage because (a) it costs less to read direct data vs read + convert and (b) most of the feature functions run faster on the compressed contour data (similarly, calculating centroid/bbox using the coco helper functions on RLE'd data would calculate faster than doing the same on frame masks).

--

Smoothing splines and blurs (gaussian or median) could also work pretty well for spatial smoothing. I don't have any hard evidence, but one area where they are likely to produce the most different results are with long + thin diagonals in masks. Splines will typically try and preserve the structure, while blurs + morphological filtering will typically remove it. These are pretty easy to play around with in tools like gimp and inkscape. I mostly chose morphological filtering over blurs because we actually can run the on GPUs embedded in the network (old code example). Of course blurs could also be run on-GPU, but I have a vague recollection that these were slightly more efficient at the time. That most likely has changed, since that experiment was on ~CUDA 7.0 and we're now on CUDA 12.8 with GPUs that actually have new core types.

@roaldarbol
Copy link
Owner Author

roaldarbol commented Feb 6, 2025

Quick Q @SkepticRaven, do you really need the contour_length? Isn't that more or less implicit in the difference in x/y positions of two consecutive points?

Maybe an example to also see whether I've understood it correctly: So if this is the contour, where # indicates the contour point (which have x and y values associated to them), then the length is given by where the next point is:

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . # = = = = = = = . . . . . .
. . . . . # . . . . . . . . # . . . . .
. . . . # . . . . . . . . . . # . . . .
. . . . # . . . . . . . . . . # . . . .
. . . . # . . . . . . . . . . # = . . .
. . . . # . . . . . . . . . . # . . . .
. . . . # = . . . . . . . . # . . . . .
. . . . . . # = = = = = = = . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

But ah, I see it myself - e.g. the right-most point, there it's not given - is it mostly for this type of case it's needed?

I'm really curious about whether Parquets compression also improves with padding. It's tabular, but grouping kinda acts as an extra dimension (it seems that Parquet actually does the padding automatically when grouping (or their compression info), which would be great news for storage without having to introduce padding in the user-facing presentation).

So maybe it'll be possible to read the files lazily when converting into e.g. bounding box or centroid... have to check.

@sfmig For temporal smoothing, completely agree. I don't think I see any robust way of doing that which is not expanding to the complete matrix (as above) and doing some sort of logistic/binary smoothing, though I don't know how such methods might work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🔬 Triage
Development

No branches or pull requests

4 participants