You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Monocular Depth Estimation Rankings and 2D to 3D Video Conversion Rankings
Researchers! On 19 December 2024, a preprint paper was published that focuses on "evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation." The 4DS-j model presented there achieves significantly better monocular depth estimation results than DINOv2 ViT-g, making it a better backbone than DINOv2 for specialised video depth estimation models that can be the basis for better 2D to 3D video conversion, too! Please try to implement the 4DS-j backbone instead of DINOv2 ViT-g for your future breakthrough video depth estimation models! Below is a special ranking showing the capabilities of 4DS-j:
ScanNet: AbsRel (TOP2 best backbone for monocular depth estimation )
RK
Model Links: Venue Repository
AbsRel ↓ (frozen backbone) 4DS
AbsRel ↓ (short finetuning) 4DS
AbsRel ↓ (medium finetuning) 4DS
AbsRel ↓ (long finetuning) 4DS
1
4DS-j
0.85
0.63
0.59
0.57
2
DinoV2-g
0.92
0.76
0.69
0.66
Due to the recent number of new models that I am unable to add to the rankings immediately, I have decided to add a waiting list of new models:
Method
Paper
Venue
Official repository
Align3R
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
FiffDepth
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation
-
ImmersePro
ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning
MegaSaM
MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos
RollingDepth
Video Depth without Video Models
SpatialMe
SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting
Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)
📝 Note: There are no quantitative comparison results of StereoCrafter yet, so this ranking is based on my own perceptual judgement of the qualitative comparison results shown in Figure 7. One output frame (right view) is compared with one input frame (left view) from the video clip: 22_dogskateboarder and one output frame (right view) is compared with one input frame (left view) from the video clip: scooter-black
RK
Model Links: Venue Repository
Rank ↓ (human perceptual judgment)
1
StereoCrafter
1
2-3
Immersity AI
2-3
2-3
Owl3D
2-3
4
Deep3D
4
ScanNet (170 frames): TAE<=2.2
RK
Model Links: Venue Repository
TAE ↓ {Input fr.} VDA
1
VDA-L
0.570 {MF}
2
DepthCrafter
0.639 {MF}
3
Depth Any Video
0.967 {MF}
4
ChronoDepth
1.022 {MF}
5
Depth Anything V2 Large
1.140 {1}
6
NVDS
2.176 {4}
Bonn RGB-D Dynamic (5 video clips with 110 frames each): OPW<=0.1
RK
Model Links: Venue Repository
OPW ↓ {Input fr.} BA
1
Buffer Anytime (DA V2)
0.028 {MF}
2
DepthCrafter
0.029 {MF}
3
ChronoDepth
0.035 {MF}
4
Marigold + E2E FT
0.053 {1}
5
Depth Anything V2 Large
0.059 {1}
6
NVDS
0.068 {4}
ScanNet++ (98 video clips with 32 frames each): TAE
RK
Model Links: Venue Repository
TAE ↓ {Input fr.} DAV
1
Depth Any Video
2.1 {MF}
2
DepthCrafter
2.2 {MF}
3
ChronoDepth
2.3 {MF}
4
NVDS
3.7 {4}
NYU-Depth V2: OPW<=0.37
RK
Model Links: Venue Repository
OPW ↓ {Input fr.} FD
OPW ↓ {Input fr.} NVDS+
OPW ↓ {Input fr.} NVDS
1
FutureDepth
0.303 {4}
-
-
2
NVDS+
-
0.339 {4}
-
3
NVDS
0.364 {4}
-
0.364 {4}
Direct comparison of 9 metric depth models (each with each) on 5 datasets: F-score
📝 Note: This ranking is based on data from Table 4. The example result 3:0:2 (first left in the first row) means that Depth Pro has a better F-score than UniDepth-V in 3 datasets, in no dataset has the same F-score as UniDepth-V and has a worse F-score compared to UniDepth-V in 2 datasets.
RK
Model Links: Venue Repository
DP
UD
M3D v2
DA V2
DA
ZoeD
M3D
PF
ZD
1
Depth Pro
-
3:0:2
3:1:1
5:0:0
5:0:0
5:0:0
5:0:0
5:0:0
3:0:0
2
UniDepth-V
2:0:3
-
4:0:1
5:0:0
5:0:0
5:0:0
5:0:0
5:0:0
3:0:0
3
Metric3D v2 ViT-giant
1:1:3
1:0:4
-
4:1:0
5:0:0
5:0:0
5:0:0
5:0:0
3:0:0
4
Depth Anything V2
0:0:5
0:0:5
0:1:4
-
4:1:0
4:0:1
5:0:0
4:0:1
3:0:0
5
Depth Anything
0:0:5
0:0:5
0:0:5
0:1:4
-
3:0:2
3:1:1
3:0:2
2:1:0
6
ZoeD-M12-NK
0:0:5
0:0:5
0:0:5
1:0:4
2:0:3
-
3:0:2
3:1:1
2:0:1
7
Metric3D
0:0:5
0:0:5
0:0:5
0:0:5
1:1:3
2:0:3
-
3:0:2
2:1:0
8
PatchFusion
0:0:5
0:0:5
0:0:5
1:0:4
2:0:3
1:1:3
2:0:3
-
2:0:1
Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.078
📝 Note: This Ranking will temporarily not be updated due to - see Figure 4
RK
Model Links: Venue Repository
AbsRel ↓ {Input fr.} MonST3R
AbsRel ↓ {Input fr.} DC
1
MonST3R
0.063 {MF}
-
2
DepthCrafter
0.075 {MF}
0.075 {MF}
3
Depth Anything
-
0.078 {1}
NYU-Depth V2: AbsRel<=0.045 (relative depth)
RK
Model Links: Venue Repository
AbsRel ↓ {Input fr.} MoGe
AbsRel ↓ {Input fr.} BD
AbsRel ↓ {Input fr.} M3D v2
AbsRel ↓ {Input fr.} DA
AbsRel ↓ {Input fr.} DA V2
1
MoGe
0.0341 {1}
-
-
-
-
2
UniDepth
0.0380 {1}
-
-
-
-
3-4
BetterDepth
-
0.042 {1}
-
-
-
3-4
Metric3D v2 ViT-Large
0.134 {1}
-
0.042 {1}
-
-
5
Depth Anything Large
0.0424 {1}
0.043 {1}
0.043 {1}
0.043 {1}
0.043 {1}
6
Depth Anything V2 Large
0.0420 {1}
-
-
-
0.045 {1}
NYU-Depth V2: AbsRel<=0.051 (metric depth)
RK
Model Links: Venue Repository
AbsRel ↓ {Input fr.} M3D v2
AbsRel ↓ {Input fr.} GRIN
-
-
-
1
Metric3D v2 ViT-giant
0.045 {1}
-
-
-
-
2
GRIN_FT_NI
-
0.051 {1}
-
-
-
NYU-Depth V2 (640×480): AbsRel<=0.058 (old layout - currently no longer up to date)
RK
Model
AbsRel ↓ {Input fr.}
Training dataset
Official repository
Practical model
Vapour- Synth
1-2
BetterDepth Backbone: Depth Anything & Marigold
0.042 {1}
Hypersim & Virtual KITTI
-
-
-
1-2
Metric3D v2 CSTM_label ENH: Backbone: DINOv2 with registers (ViT-L/14)