Diffusion²: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models
Diffusion²: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models,
Zeyu Yang*, Zijie Pan*, Chun Gu, Li Zhang
Fudan University
ICLR 2025
This repository is the official implementation of "Diffusion²: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models". In this paper, we propose to achieve 4D generation from directly sampling the dense multi-view multi-frame observation of dynamic content by composing the estimated score of pretrained video and multi-view diffusion models that have learned strong prior of dynamic and geometry.
- (2024/10/2) We update the paper of Diffusion², enhancing its readability and highlighting key implementation details. Welcome to check!
- (2024/8/6) The initial version of code is released, including the key part of the proposed framework, i.e., joint denoising for sampling consistent image matrix, and VRS to ensure the generation of seamless results.
- (2024/4/2) The paper of Diffusion² is avaliable at ArXiv.
- Clone this repo and
cd
into it.
git clone https://github.com/fudan-zvg/diffusion-square.git
cd diffusion-square
- Create the virtual environment with torch.
conda create -n dxd python=3.10
conda activate dxd
# Install torch: tested on torch 2.0.1 & CUDA 11.8.
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
- Install other required packages.
pip install -r requirements.txt
- Download the pretrained safetensors of the multi-view diffusion model and the video diffusion model to
checkpoints/
.
The inference script will automatically perform rescaling and recentering on the input images, with optional background removal beforehand for non-RGBA images, eliminating the need for manual processing. Therefore, there is no need for too much manual processing. Users just need to ensure that the input for the single-view video is organized as the following structure:
video_folder
├── 0.png
├── 1.png
├── 2.png
└── ...
STAGE-1: Multi-view videos generation
After the installation, you can sample the synchronized multi-view videos using:
PYTHONPATH="." MASTER_ADDR=localhost MASTER_PORT=12345 python main.py \
--input_path SAMPLE_NAME \
--elevations_deg 0.0 \
--image_frame_ratio 0.9
SAMPLE_NAME
should be the path of input image or video folder for text-to-4D and video-to-4D generation respectively.
STAGE-2: 4D reconstruction from synthesized videos
Once the dense multi-view multi-frame images are generated, many off-the-shelf 4D reconstruction pipelines can be employed to obtain continuous 4D assets. We are working on integrating a simplified version of the adopted reconstruction method into this repo.
ablations.mp4
assets_comp.mp4
@inproceedings{yang2024diffusion,
title={Diffusion²: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models},
author={Yang, Zeyu and Pan, Zijie and Gu, Chun and Zhang, Li},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
This repo is built on the https://github.com/Stability-AI/generative-models.
- We sincerely thanks to Yurui Chen for the invalueble discussing on this work.