+ Narrated Qualitative Results for Discovered Concepts via Video + Transformer Concept Discovery +
+ + ++ This page is part of the supplemental materials of the anonymous CVPR + 2024 paper submission titled "Understanding Video Transfomers via + Universal Concept Discovery". +
+ +Table of contents
+ + + + +Figure 1 Visualization
++ In this visualization, we show the video version of Figure 1 from + the main paper. The prediction heatmap of the TCOW model is shown on + the left. Earlier layers capture positional information, while + deeper layers capture events, objects, containers or track the target + object through occlusions. +
+ +Most important concepts
+ + ++ We visualize the most important concepts for each of the four + models: (1) TCOW, (2) supervised VideoMAE, (3) SSL VideoMAE, and (4) + InternVideo. Only the top-1 concept is shown by default, please click on "Show more" to see the full list. +
+ +Most important concepts - TCOW
+TCOW Concept 1 - Layer 5 Head 8
++ This concept highlights objects with similar appearance, suggesting + the model solving the disambiguation problem by first identifying + possible distractors in mid-layers. +
+TCOW Concept 2 - Layer 9 Head 12
+This concept tracks the target object throughout the video.
+TCOW Concept 3 - Layer 3 Head 11
++ This concept captures temporally invariant spatial position in the + top-left region of the video +
+TCOW Concept 4 - Layer 3 Head 9
++ This concept captures vertical spatial position and highlights a + temporally invariant horizontal slice across the video. +
+TCOW Concept 5 - Layer 10 Head 10
++ This concept again captures the target object across the entire video. +
+TCOW Concept 6 - Layer 2 Head 4
++ This concept captures spatial information in the top left region of + the video. +
+TCOW Concept 7 - Layer 3 Head 3
++ This concept also captures spatial information in the top left region + of the video. +
+TCOW Concept 8 - Layer 3 Head 4
++ This concept also captures spatial information in the top of the + video. Interestingly, all concepts encoding a spatiotemporal basis + highlight regions in the middle of the video or higher. These regions + of the video may be particularly important for tracking objects in + Kubric because they are randomly spawned above containers in each + video. Thus, a precise understanding of the top regions of the video + is required for tracking the target object. +
++ Most important concepts - Supervised VideoMAE (dropping something into + something) +
+Supervised VideoMAE Concept 1 - Layer 11 Head 9
++ Interestingly, the most important concept highlights the object being + dropped until the dropping event, at which point both the object and + container are highlighted. +
+Supervised VideoMAE Concept 2 - Layer 8 Head 1
++ This concept captures the container being dropped into, notably not + capturing the object itself and making a ring-like shape. +
+Supervised VideoMAE Concept 3 - Layer 4 Head 3
++ As in the TCOW model, VideoMAE also contains concepts that capture + spatial information, this one highlighting the center of the video. +
+Supervised VideoMAE Concept 4 - Layer 12 Head 2
++ This concept is in the last layer of the model and captures the + container being dropping into. Notably, the video on the left shows an + unusual container, an almost full drawer, that the model still is able + to successfully highlights until the bag is dropped into it. +
+Supervised VideoMAE Concept 5 - Layer 6 Head 3
++ This is a positional concept highlighting the top and center region of + the video. +
+Supervised VideoMAE Concept 6 - Layer 12 Head 3
++ Interestingly, this concept, also in the final layer, highlights + nothing in the video until the dropping event occurs, at which point + the container and the object are highlighted. +
+Supervised VideoMAE Concept 7 - Layer 4 Head 3
++ This is another positional concept highlighting the top and center + region of the video. +
+Supervised VideoMAE Concept 8 - Layer 4 Head 3
++ This is a positional concept highlighting the bottom and center region + of the video. +
++ Most important concepts - SSL VideoMAE (dropping something into + something) +
+SSL VideoMAE Concept 1 - Layer 4 Head 11
++ The most important concept captures the container being dropped + into. +
+SSL VideoMAE Concept 2 - Layer 12 Head 10
++ The second most important concept also captures the container being + dropping into. +
+SSL VideoMAE Concept 3 - Layer 7 Head 7
++ Interestingly, we observe the third most important concept is a + spatially invariant temporal basis that captures the beginning of + the video. At the beginning of the video, everything is highlighted, + and then after a few frames, nothing is highlighted. +
+SSL VideoMAE Concept 4 - Layer 3 Head 10
++ This concept is a spatial position concept capturing the top center + region of the video. +
+SSL VideoMAE Concept 5 - Layer 5 Head 8
++ This concept is a spatial position concept capturing the right + region of the video. +
+SSL VideoMAE Concept 6 Layer 9 Head 12
++ This is another spatial position concept capturing the top center + region of the video. +
+ +SSL VideoMAE Concept 7 - Layer 4 Head 4
++ This is an interesting spatiotemporal basis that highlights the + right part of the video during the middle temporal segment of the + video. +
+ +SSL VideoMAE Concept 8 - Layer 12 Head 9
++ This is another spatial position concept capturing the bottom left + region of the video. +
+ ++ Most important concepts - InternVideo (dropping something into + something) +
+InternVideo Concept 1 - Layer 11 Head 2
++ Interestingly, the most important concept for InternVideo captures + hands dropping the object. +
+ +InternVideo Concept 2 - Layer 6 Head 8
++ This concept captures textured patterns in the image. Notably, it + highlights background and foreground regions that contain textured + patterns and tracks these regions throughout the video. +
+ +InternVideo Concept 3 - Layer 3 Head 11
++ This is another spatial position concept capturing the bottom right + region of the video. +
+ +InternVideo Concept 4 - Layer 10 Head 12
++ This is another spatial position concept capturing the bottom right + region of the video, however, different from concept 3, it is not + completely temporally invariant and the boundary of the concept + support changes non-trivially over the video. +
+ +InternVideo Concept 5 - Layer 4 Head 1
++ This is another spatial position concept capturing the bottom left + region of the video. +
+InternVideo Concept 6 - Layer 7 Head 1
++ This is another spatial position concept capturing the right region + of the video. +
+ +InternVideo Concept 7 - Layer 4 Head 3
++ This is another spatial position concept capturing the top right + region of the video. +
+ +InternVideo Concept 8 - Layer 1 Head 3
++ This concept, which occurs at the first layer, captures orange-brown + color. +
+ +Rosetta concepts - SSv2: rolling something on a flat surface
+ ++ Here we visualize representative Rosetta concepts that are shared between all the 4 models analyzed in our experiemnts: (1) TCOW, (2) supervised VideoMAE, (3) VideoMAE SSL, and (4) + InternVideo. Only one Rosetta concept is shown by default, please click on "Show more" to see the full list. +
+ +Rosetta concept 1
++ In this visualization, we show the Rosetta concept with the highest + score of 22% mIoU when filtering by the most important 7.5% of + concepts. This Rosetta concept captures spatial position information + and is contained in the early layers of all models. +
+ +Rosetta concept 2
+ ++ This visualization shows a Rosetta concept with a score of 18% mIoU. + Interestingly, we observe that all models learn to localize and + track individual objects over space and time. This is particularly + interesting for self-supervised models like VideoMAE-SSL and + InternVideo, which do not have access to any labels. +
+ + +Rosetta concept 3
++ In this visualization, we show a Rosetta concept with a rosetta + score of 15% mIoU. We again observe an object-centric concept in all models, capturing the + notion of hand. +
+ +Rosetta concept 4
++ In this visualization we observe a Rosetta concept with a score of + 18% mIoU that highlights the region the object is rolling into. This + suggests all models encode a notion of where an object will move to + in the future. +
+Rosetta concept 5
++ Contrasting concept 5, that showed a concept capturing where an + object will roll to, this visualization shows a Rosetta concept (16% mIoU) that + captures the region that the rolling object has rolled from. +
+ +Query Key and Value Comparison
+ + ++ Finally, we demonstrate that VTCD produces interpretable concepts for units of interest other than Keys which + are studied in the main paper. Here, we visualize the most important concepts when discovering concepts + in the queries, keys and values for the TCOW model. We note some similarities between the most important concepts + discovered in each unit: (i) queries and keys produce concepts that closely track the target object, (ii) all units + produce positional concepts. Interestingly, we note some differences between the three units: (i) Queries + produce multiple concepts that track the target object during the beginning of the video, but switch focus midway + through; (ii) Values produce the most positional concepts. Overall, Keys result in most diverse and interpretable concpets, validating our design choice. +
+ +Most important concepts - TCOW Keys
+Keys Concept 1 - Layer 5 Head 8
++ This concept highlights objects with similar appearance, suggesting + the model solving the disambiguation problem by first identifying + possible distractors in mid-layers. +
+Keys Concept 2 - Layer 9 Head 12
+This concept tracks the target object throughout the video.
+Keys Concept 3 - Layer 3 Head 11
++ This concept captures temporally invariant spatial position in the + top-left region of the video +
+Keys Concept 4 - Layer 3 Head 9
++ This concept captures vertical spatial position and highlights a + temporally invariant horizontal slice across the video. +
+Keys Concept 5 - Layer 10 Head 10
++ This concept again captures the target object across the entire + video. +
+Keys Concept 6 - Layer 2 Head 4
++ This concept captures spatial information in the top left region of + the video. +
+Keys Concept 7 - Layer 3 Head 3
++ This concept also captures spatial information in the top left + region of the video. +
+Keys Concept 8 - Layer 3 Head 4
++ This concept also captures spatial information in the top of the + video. Interestingly, all concepts encoding a spatiotemporal basis + highlight regions in the middle of the video or higher. These + regions of the video may be particularly important for tracking + objects in TCOW Kubric because they are randomly spawned above containers + in each video. Thus, a precise understanding of the top regions of + the video is required for tracking the target object. +
+Most important concepts - TCOW Queries
+ +Queries Concept 1 - Layer 10 Head 10
++ Interestingly, the most important concept tracks the target object + through occlusions. +
+Queries Concept 2 - Layer 8 Head 6
++ The second most important concept highlights the region around the + object during the beginning and middle of the video, but then + remains in the same position afterwards. +
+Queries Concept 3 - Layer 8 Head 11
++ Similar to concept 2, this concept tracks the target object until it + collides with something, and then ceases to highlight the target + object and highlights the same region in space for the rest of the + video. +
+Queries Concept 4 - Layer 9 Head 8
+This concept closely tracks the target object.
+Queries Concept 5 - Layer 9 Head 12
++ This concept highlights the target object falling in the top of the + video, but then stops tracking the object and remains highlighting + the top center region of the video. +
+Queries Concept 6 - Layer 7 Head 9
++ Once again, this concept highlights the target object in the first + frame, but then captures spatial position in the center of the + video. +
+Queries Concept 7 - Layer 8 Head 8
++ Interestingly, this concept seems to track the region that the + target object is moving into, potentially suggesting the model is + anticipating where the target object will move to next. +
+Queries Concept 8 - Layer 6 Head 1
+This concept captures a single container in the video.
+Most important concepts - TCOW Values
+ +Values Concept 1 - Layer 5 Head 9
++ Interestingly, the most important concept for the Values vaptures + the background region in the top left region of the video. Notably, + it does not highlight any objects, forming a ring-like shape around + any object that travels through the top left region. +
+Values Concept 2 - Layer 4 Head 9
++ This is a temporally invariant spatial position concept highlighting + the top left region of the video. +
+Values Concept 3 - Layer 2 Head 11
++ This concept captures positional information in the middle left of + the video. +
+Values Concept 4 - Layer 4 Head 10
++ This concept highlights large objects in the video. This could be + the model identifying possible occluders or containers in the middle + layers for later processing. +
+Values Concept 5 - Layer 9 Head 12
++ Interestingly, this concept captures nothing until several frames + into the video, at which point it captures large objects in the left + part of the image, again suggesting the model may be identifying + possible occluders. +
+Values Concept 6 - Layer 8 Head 11
++ This concept highlights many objects surrounding the target object, + but not the target object itself. +
+Values Concept 7 - Layer 2 Head 3
++ This concept captures both spatial information, highlighting the top + portion of the video, but also approximately follows some object + boundaries making the concept not totally temporally invariant. +
+Values Concept 8 - Layer 5 Head 8
++ This is a spatial position concept highlighting the top center + region of the video. +
+