Monocular Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Researchers! On 19 December 2024, a preprint paper was published that focuses on "evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation." The 4DS-j model presented there achieves significantly better monocular depth estimation results than DINOv2 ViT-g, making it a better backbone than DINOv2 for specialised video depth estimation models that can be the basis for better 2D to 3D video conversion, too! Please try to implement the 4DS-j backbone instead of DINOv2 ViT-g for your future breakthrough video depth estimation models! Below is a special ranking showing the capabilities of 4DS-j:

ScanNet: AbsRel (TOP2 best backbone for monocular depth estimation )

RK	Model Links: Venue Repository	AbsRel ↓ (frozen backbone) 4DS	AbsRel ↓ (short finetuning) 4DS	AbsRel ↓ (medium finetuning) 4DS	AbsRel ↓ (long finetuning) 4DS
1	4DS-j	0.85	0.63	0.59	0.57
2	DinoV2-g	0.92	0.76	0.69	0.66

Due to the recent number of new models that I am unable to add to the rankings immediately, I have decided to add a waiting list of new models:

Method	Paper	Official repository
Align3R	Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
FiffDepth	FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation	-
ImmersePro	ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning
MegaSaM	MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos
RollingDepth	Video Depth without Video Models
SpatialMe	SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting	-

List of Rankings

2D to 3D Video Conversion Rankings

Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

Monocular Depth Estimation Rankings

I. Rankings based on temporal consistency metrics

ScanNet (170 frames): TAE<=2.2
Bonn RGB-D Dynamic (5 video clips with 110 frames each): OPW<=0.1
ScanNet++ (98 video clips with 32 frames each): TAE
NYU-Depth V2: OPW<=0.37

II. Rankings based on 3D metrics

Direct comparison of 9 metric depth models (each with each) on 5 datasets: F-score

III. Rankings based on 2D metrics

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.078
NYU-Depth V2: AbsRel<=0.045 (relative depth)
NYU-Depth V2: AbsRel<=0.051 (metric depth)

IV. Old layout - currently no longer up to date

NYU-Depth V2 (640×480): AbsRel<=0.058 (old layout - currently no longer up to date)
UnrealStereo4K (3840×2160): AbsRel<=0.04 (old layout - currently no longer up to date)

Appendices

Appendix 1: Rules for qualifying models for the rankings (to do)
Appendix 2: Metrics selection for the rankings (to do)
Appendix 3: List of all research papers from the above rankings

Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

📝 Note: There are no quantitative comparison results of StereoCrafter yet, so this ranking is based on my own perceptual judgement of the qualitative comparison results shown in Figure 7. One output frame (right view) is compared with one input frame (left view) from the video clip: 22_dogskateboarder and one output frame (right view) is compared with one input frame (left view) from the video clip: scooter-black

RK	Model Links: Venue Repository	Rank ↓ (human perceptual judgment)
1	StereoCrafter	1
2-3	Immersity AI	2-3
2-3	Owl3D	2-3
4	Deep3D	4

ScanNet (170 frames): TAE<=2.2

RK	Model Links: Venue Repository	TAE ↓ {Input fr.} VDA
1	VDA-L	0.570 {MF}
2	DepthCrafter	0.639 {MF}
3	Depth Any Video	0.967 {MF}
4	ChronoDepth	1.022 {MF}
5	Depth Anything V2 Large	1.140 {1}
6	NVDS	2.176 {4}

Bonn RGB-D Dynamic (5 video clips with 110 frames each): OPW<=0.1

RK	Model Links: Venue Repository	OPW ↓ {Input fr.} BA
1	Buffer Anytime (DA V2)	0.028 {MF}
2	DepthCrafter	0.029 {MF}
3	ChronoDepth	0.035 {MF}
4	Marigold + E2E FT	0.053 {1}
5	Depth Anything V2 Large	0.059 {1}
6	NVDS	0.068 {4}

ScanNet++ (98 video clips with 32 frames each): TAE

RK	Model Links: Venue Repository	TAE ↓ {Input fr.} DAV
1	Depth Any Video	2.1 {MF}
2	DepthCrafter	2.2 {MF}
3	ChronoDepth	2.3 {MF}
4	NVDS	3.7 {4}

NYU-Depth V2: OPW<=0.37

RK	Model Links: Venue Repository	OPW ↓ {Input fr.} FD	OPW ↓ {Input fr.} NVDS⁺	OPW ↓ {Input fr.} NVDS
1	FutureDepth	0.303 {4}	-	-
2	NVDS⁺	-	0.339 {4}	-
3	NVDS	0.364 {4}	-	0.364 {4}

Direct comparison of 9 metric depth models (each with each) on 5 datasets: F-score

📝 Note: This ranking is based on data from Table 4. The example result 3:0:2 (first left in the first row) means that Depth Pro has a better F-score than UniDepth-V in 3 datasets, in no dataset has the same F-score as UniDepth-V and has a worse F-score compared to UniDepth-V in 2 datasets.

RK	Model Links: Venue Repository	DP	UD	M3D v2	DA V2	DA	ZoeD	M3D	PF	ZD
1	Depth Pro	-	3:0:2	3:1:1	5:0:0	5:0:0	5:0:0	5:0:0	5:0:0	3:0:0
2	UniDepth-V	2:0:3	-	4:0:1	5:0:0	5:0:0	5:0:0	5:0:0	5:0:0	3:0:0
3	Metric3D v2 ViT-giant	1:1:3	1:0:4	-	4:1:0	5:0:0	5:0:0	5:0:0	5:0:0	3:0:0
4	Depth Anything V2	0:0:5	0:0:5	0:1:4	-	4:1:0	4:0:1	5:0:0	4:0:1	3:0:0
5	Depth Anything	0:0:5	0:0:5	0:0:5	0:1:4	-	3:0:2	3:1:1	3:0:2	2:1:0
6	ZoeD-M12-NK	0:0:5	0:0:5	0:0:5	1:0:4	2:0:3	-	3:0:2	3:1:1	2:0:1
7	Metric3D	0:0:5	0:0:5	0:0:5	0:0:5	1:1:3	2:0:3	-	3:0:2	2:1:0
8	PatchFusion	0:0:5	0:0:5	0:0:5	1:0:4	2:0:3	1:1:3	2:0:3	-	2:0:1

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.078

📝 Note: This Ranking will temporarily not be updated due to - see Figure 4

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} MonST3R	AbsRel ↓ {Input fr.} DC
1	MonST3R	0.063 {MF}	-
2	DepthCrafter	0.075 {MF}	0.075 {MF}
3	Depth Anything	-	0.078 {1}

NYU-Depth V2: AbsRel<=0.045 (relative depth)

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} MoGe	AbsRel ↓ {Input fr.} BD	AbsRel ↓ {Input fr.} M3D v2	AbsRel ↓ {Input fr.} DA	AbsRel ↓ {Input fr.} DA V2
1	MoGe	0.0341 {1}	-	-	-	-
2	UniDepth	0.0380 {1}	-	-	-	-
3-4	BetterDepth	-	0.042 {1}	-	-	-
3-4	Metric3D v2 ViT-Large	0.134 {1}	-	0.042 {1}	-	-
5	Depth Anything Large	0.0424 {1}	0.043 {1}	0.043 {1}	0.043 {1}	0.043 {1}
6	Depth Anything V2 Large	0.0420 {1}	-	-	-	0.045 {1}

NYU-Depth V2: AbsRel<=0.051 (metric depth)

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} M3D v2	AbsRel ↓ {Input fr.} GRIN	-	-	-
1	Metric3D v2 ViT-giant	0.045 {1}	-	-	-	-
2	GRIN_FT_NI	-	0.051 {1}	-	-	-

NYU-Depth V2 (640×480): AbsRel<=0.058 (old layout - currently no longer up to date)

RK	Model	AbsRel ↓ {Input fr.}	Training dataset	Official repository	Practical model	Vapour- Synth
1-2	BetterDepth Backbone: Depth Anything & Marigold	0.042 {1}	Hypersim & Virtual KITTI	-	-	-
1-2	Metric3D v2 CSTM_label ENH: Backbone: DINOv2 with registers (ViT-L/14)	0.042 {1}	DDAD & Lyft & Driving Stereo & DIML & Arogoverse2 & Cityscapes & DSEC & Mapillary PSD & Pandaset & UASOL & Virtual KITTI & Waymo & Matterport3d & Taskonomy & Replica & ScanNet & HM3d & Hypersim		-	-
3	Depth Anything Large Backbone: DINOv2 (ViT-L/14)	0.043 {1}	Pretraining: BlendedMVS & DIML & HR-WSI & IRS & MegaDepth & TartanAir Training: BDD100K & Google Landmarks & ImageNet-21K & LSUN & Objects365 & Open Images V7 & Places365 & SA-1B		-	-
4	MiDaS v3.1 BEiT_L-512 ENH: Backbone: BEiT₅₁₂-L (ViT-L/16)	0.048 {1}	Pretraining: ReDWeb & HR-WSI & BlendedMVS & NYU-Depth V2 & KITTI Training: ReDWeb & DIML & 3D Movies & MegaDepth & WSVD & TartanAir & HR-WSI & ApolloScape & BlendedMVS & IRS & NYU-Depth V2 & KITTI		-
5	GeoWizard Backbone: Stable Diffusion v2	0.052 {1}	Hypersim & Replica & 3D Ken Burns & Objaverse & proprietary		-	-
6	Marigold Backbone: Stable Diffusion v2	0.055 {1}	Hypersim & Virtual KITTI		-	-
7	GenPercept Backbone: Stable Diffusion v2.1	0.056 {1}	Hypersim & Virtual KITTI		-	-
8	NeWCRFs + LightedDepth ENH:	0.057 {2}	ENH: NYU-Depth V2	ENH:	-	-
9	UniDepth-V Backbone: DINOv2 (ViT-L/14)	0.0578 {1}	A2D2 & Argoverse2 & BDD100k & CityScapes & DrivingStereo & Mapillary PSD & ScanNet & Taskonomy & Waymo		-	-

UnrealStereo4K (3840×2160): AbsRel<=0.04 (old layout - currently no longer up to date)

RK	Model	AbsRel ↓ {Input fr.}	Training dataset	Official repository	Practical model	Vapour- Synth
1	ZoeDepth +PF_R=128 ENH:	0.0388 {1}	ENH: UnrealStereo4K	ENH:	-	-

Appendix 3: List of all research papers from the above rankings

2025:

Method	Abbr.	Paper	Venue (Alt link)	Official repository
Video Depth Anything	VDA	Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

2024 and older:

Method	Paper	Venue
4DS	Scaling 4D Representations
BetterDepth	BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
Buffer Anytime	Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors
ChronoDepth	Learning Temporally Consistent Video Depth from Video Diffusion Priors
Deep3D	Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks
Depth Any Video	Depth Any Video with Scalable Synthetic Data
Depth Anything	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Depth Anything V2	Depth Anything V2
Depth Pro	Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
DepthCrafter	DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
DINOv2	DINOv2: Learning Robust Visual Features without Supervision
Diffusion E2E FT	Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
FutureDepth	FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
GenPercept	Diffusion Models Trained with Large Data Are Transferable Visual Models
GeoWizard	GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image
GRIN	GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion
LightedDepth	LightedDepth: Video Depth Estimation in light of Limited Inference View Angles
Marigold	Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
Metric3D	Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image
Metric3D v2	Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation
MiDaS	Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer
MiDaS v3.1	MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation
MoGe	MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
MonST3R	MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
NeWCRFs	Neural Window Fully-connected CRFs for Monocular Depth Estimation
NVDS	Neural Video Depth Stabilizer
NVDS⁺	NVDS⁺: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation
PatchFusion	PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
StereoCrafter	StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
UniDepth	UniDepth: Universal Monocular Metric Depth Estimation
ZoeDepth	ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Monocular Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

ScanNet: AbsRel (TOP2 best backbone for monocular depth estimation )

List of Rankings

2D to 3D Video Conversion Rankings

Monocular Depth Estimation Rankings

I. Rankings based on temporal consistency metrics

II. Rankings based on 3D metrics

III. Rankings based on 2D metrics

IV. Old layout - currently no longer up to date

Appendices

Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

ScanNet (170 frames): TAE<=2.2

Bonn RGB-D Dynamic (5 video clips with 110 frames each): OPW<=0.1

ScanNet++ (98 video clips with 32 frames each): TAE

NYU-Depth V2: OPW<=0.37

Direct comparison of 9 metric depth models (each with each) on 5 datasets: F-score

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.078

NYU-Depth V2: AbsRel<=0.045 (relative depth)

NYU-Depth V2: AbsRel<=0.051 (metric depth)

NYU-Depth V2 (640×480): AbsRel<=0.058 (old layout - currently no longer up to date)

UnrealStereo4K (3840×2160): AbsRel<=0.04 (old layout - currently no longer up to date)

Appendix 3: List of all research papers from the above rankings

Files

README.md

Latest commit

History

README.md

File metadata and controls

Monocular Depth Estimation Rankingsand 2D to 3D Video Conversion Rankings

ScanNet: AbsRel (TOP2 best backbone for monocular depth estimation )

List of Rankings

2D to 3D Video Conversion Rankings

Monocular Depth Estimation Rankings

I. Rankings based on temporal consistency metrics

II. Rankings based on 3D metrics

III. Rankings based on 2D metrics

IV. Old layout - currently no longer up to date

Appendices

Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

ScanNet (170 frames): TAE<=2.2

Bonn RGB-D Dynamic (5 video clips with 110 frames each): OPW<=0.1

ScanNet++ (98 video clips with 32 frames each): TAE

NYU-Depth V2: OPW<=0.37

Direct comparison of 9 metric depth models (each with each) on 5 datasets: F-score

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.078

NYU-Depth V2: AbsRel<=0.045 (relative depth)

NYU-Depth V2: AbsRel<=0.051 (metric depth)

NYU-Depth V2 (640×480): AbsRel<=0.058 (old layout - currently no longer up to date)

UnrealStereo4K (3840×2160): AbsRel<=0.04 (old layout - currently no longer up to date)

Appendix 3: List of all research papers from the above rankings

Monocular Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings