You can click on Watch and Star to get the latest updates at any time.
Watch Me ! Watch Me ! Watch Me !
Star Me ! Star Me ! Star Me !
🎁 >>>>>>>> [English Introduction] <<<<<<<<<<
This project provides a thorough summary of the latest advancements in the field of 2D digital human motion video generation, covering papers, datasets, and code repositories.
The repository is organized into three main conditions: Vision-driven, Text-driven, and Audio-driven, while also considering LLM Planning Papers.
Unlike previous summaries, this project clearly outlines the five key stages in the field of digital human video generation:
🌑 Stage 1: Input Phase. Clarifying the driving source (Vision, Text, Audio) and driving region (Part, Holistic), where "Part" mainly refers to the face;
🌒 Stage 2: Motion planning Phase. Most work involves feature mapping to learn motion mappings, while a few works use large language models (LLMs) for motion planning;
🌓 Stage 3: Motion Video Generation Phase;
🌔 Stage 4: Video Refinement Phase, focusing on optimizing specific parts such as the face, lips, teeth, and hands;
🌕 Stage 5: Acceleration Phase, aiming to speed up training and deployment inference as much as possible, with the goal of achieving real-time output.
🎉 We welcome everyone to contribute your research and submit PRs to collectively advance the technology of human motion video generation.
If you have any questions, feel free to contact us at ([email protected]), and we will respond as soon as possible. Additionally, we warmly welcome new members from related fields to join us, learn together, and make endless progress!
🏆 >>>>>>>> [🧡中文简要介绍💜] <<<<<<<<<<
本项目认真总结了👍2D数字人动作视频生成👏相关领域的最新进展,包括论文、数据集和代码库。
Repo以 Vision-driven、Text-driven、Audio-driven 三大方向作以总结,同时考虑 LLM Planning 前沿论文。
分类时,我们定义Audio>Text>Vision优先级,当出现文本不出现音频时,归纳为Text-Driven方法,当文本音频同时出现时,归纳为Audio-Driven方法,以此类推。
区别于以往的总结,项目明确总结了数字人视频生成领域的五大阶段:
🌑 第1阶段 明确驱动源(Vision、Text、Audio)与驱动区域(Part、Holistic),其中Part主要以脸部为主;
🌒 第2阶段 动作规划阶段,大多数工作以特征Mapping学习动作映射,少部分工作以大语言模型LLMs进行动作规划;
🌓 第3阶段 人体视频生成,大部分工作以Diffusion Models为基础,少部分工作以Transformer为基础;
🌔 第4阶段 视频优化阶段,针对脸部、嘴唇、牙齿、手部单独做Refinement优化;
🌕 第5阶段 加速输出阶段,尽可能地加速训练与部署推理,目标Real-Time实时输出。
🔑本项目由六位核心成员全力推进:
- 薛海威(清华大学,负责人) - 罗向阳(清华大学) - 胡璋昊(爱丁堡大学) - 张鑫(西安交通大学) - 向迅之(中国科学院大学) - 戴语琴(南京理工大学)
💖核心综述由以下老师全力支持并悉心指导:
- 刘健庄老师(中国科学院深圳先进技术研究院) - 张镇嵩博士(华为诺亚2012实验室) - 李明磊博士(零一万物) - 马飞博士(光明实验室) - 吴志勇老师(清华大学/香港中文大学)
另外,非常感谢常恒师兄 ( https://github.com/SwiftieH )、余伟江师兄的支持!
🎉 欢迎大家贡献自己的研究成果并PR,共同推动人体运动视频生成技术的发展。
如有任何问题,可以随时联系邮件([email protected]),我们会尽快回复。
另外,我们非常欢迎有新的相关领域的同学一同加入我们,一起学习,无限进步!
🍦 Exploring the latest papers in human motion video generation. 🍦
This work delves into Human Motion Video Generation, covering areas such as Portrait Animation, Dance Video Generation, Text2Face, Text2MotionVideo, and Talking Head. We believe this will be the most comprehensive survey to date on human motion video generation technologies. Please stay tuned! 😘😁😀
It's important to note that for the sake of clarity, we have excluded 3DGS and NeRF technologies (2D-3D-2D) from the scope of this paper.
If you discover any missing work or have any suggestions,
please feel free to submit a pull request or contact us ( [email protected] ).
We will promptly add the missing papers to this repository.
[1] We decompose human motion video generation into five key phases, covering all subtasks across various driving sources and body regions. To the best of our knowledge, this is the first survey to offer such a comprehensive framework for human motion video generation.
[2] We provide an in-depth analysis of human motion video generation from both motion planning and motion generation perspectives, a dimension that has been underexplored in existing reviews.
[3] We clearly delineate established baselines and evaluation metrics, offering detailed insights into the key challenges shaping this field.
[4] We present a set of potential future research directions, aimed at inspiring and guiding researchers in the field of human motion video generation.
[2025/01/22] V5.9 Vision: Update Methods. Happy New Year🎀
CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation (Visual, Try-On Video Generation)
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation (Audio, Audio-Driven Holistic Body Driving)
[2025/01/20] V5.8 Vision: Update Methods. Happy New Year🎀
X-Dyna: Expressive Dynamic Human Image Animation (Visual, Pose-Guided Dance Video Generation)
Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions (Text, Text2Motion)
[2025/01/18] V5.7 Vision: Update Methods. Happy New Year🎀
RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency (Visual, Try-On Video Generation)
DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors (Visual, Portrait Animation)
[2025/01/16] V5.6 Vision: Update Methods. Happy New Year🎀
Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning (Visual, Portrait Animation)
[2025/01/14] V5.5 Vision: Update Methods. Happy New Year🎀
Identity-Preserving Video Dubbing Using Motion Warping (Audio, Lip Synchronization)
[2025/01/13] V5.4 Vision: Update Methods. Happy New Year🎀
Ingredients: Blending Custom Photos with Video Diffusion Transformers (Text, Text2MotionVideo)
Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers (Text, Text2MotionVideo)
[2025/01/12] V5.3 Vision: Update Methods. Happy New Year🎀
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation (Audio, Head Pose Driving)
[2025/01/11] V5.2 Vision: Update Methods. Happy New Year🎀
UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control (Audio, Head Pose Driving)
[2025/01/10] V5.1 Vision: Update Methods. Happy New Year🎀
RAIN: Real-time Animation of Infinite Video Stream (Visual, Portrait Animation)
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models (Text, Text2Face)
[2025/01/06] V5.0 Vision: Update Methods. Happy New Year🎀
Free-viewpoint Human Animation with Pose-correlated Reference Selection (Visual, Pose-Guided Dance Video Generation)
ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping (Visual, Pose2Video)
Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance (Text, Text2MotionVideo)
[2025/01/04] V4.9 Vision: Update Methods. Happy New Year🎀
Consistent Human Image and Video Generation with Spatially Conditioned Diffusion (Visual, Pose-Guided Dance Video Generation)
[2024/12/17] V4.8 Vision: Update Methods.
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping (Visual, Portrait Animation)
VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization (Audio, Fine-Grained Style and Emotion-Driven Animation)
Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism (Visual, Try-On Video Generation)
SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models (Visual, Try-On Video Generation)
[2024/12/15] V4.7 Vision: Update Methods.
LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync (Audio, Lip Synchronization)
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression (Audio, Head Pose Driving)
PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation (Audio, Head Pose Driving)
IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation (Audio, Head Pose Driving)
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations (Audio, Head Pose Driving)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation (Audio, Head Pose Driving)
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation (Visual, Pose-Guided Dance Video Generation)
[2024/12/11] V4.6 Vision: Update Methods.
PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm (Visual, Try-On Video Generation)
SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model (Audio, Head Pose Driving)
EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation (Visual, Portrait Animation)
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks (Visual, Portrait Animation)
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait (Visual, Portrait Animation)
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses (Visual, Pose-Guided Dance Video Generation)
[2024/12/02] V4.5 Vision: Update Methods.
LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis (Audio, Head Pose Driving)
Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis (Audio, Head Pose Driving)
Fleximo: Towards Flexible Text-to-Human Motion Video Generation (Text, Text2MotionVideo)
[2024/11/28] V4.4 Vision: Update Methods.
HiFiVFS: High Fidelity Video Face Swapping (Visual, Portrait Animation)
MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation (Text, Text2Face)
Identity-Preserving Text-to-Video Generation by Frequency Decomposition (Text, Text2Face)
AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation (Visual, Pose2Video)
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation (Text, Text2Face)
LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis (Audio, Fine-Grained Style and Emotion-Driven Animation)
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion (Audio, Fine-Grained Style and Emotion-Driven Animation)
Sonic: Shifting Focus to Global Audio Perception in Portrait Animation (Audio, Fine-Grained Style and Emotion-Driven Animation)
StableAnimator: High-Quality Identity-Preserving Human Image Animation (Visual, Pose-Guided Dance Video Generation)
[2024/11/25] V4.3 Vision: Update Methods.
FloAt: Flow Warping of Self-Attention for Clothing Animation Generation (Visual, Try-On Video Generation)
[2024/11/18] V4.2 Vision: Update Methods.
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation (Audio, Audio-Driven Holistic Body Driving)
[2024/11/15 !WoW! More Star 100 🌟🌟🌟] V4.1 Vision: Update Methods.
JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation (Audio, Fine-Grained Style and Emotion-Driven Animation)
LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space (Audio, Fine-Grained Style and Emotion-Driven Animation)
[2024/11/14]V4.0 Vision: Update Methods.
MikuDance: Animating Character Art with Mixed Motion Dynamics (Visual, Pose-Guided Dance Video Generation)
[2024/11/04]V3.9 Vision: Update Methods.
Fashion-VDM (Visual, Try-On Video Generation)
Towards High-fidelity Head Blending with Chroma Keying for Industrial Applications (Visual, Portrait Animation)
[2024/11/01]V3.8 Vision: Update Methods.
Stereo-Talker (Audio, Audio-Driven Holistic Body Driving)
[2024/10/29]V3.7 Vision: Update Methods.
MovieCharacter (Visual, Pose2Video)
[2024/10/24 Happy Coding Day!]V3.6 Vision: Update Methods.
EmoGene (Audio, Fine-Grained Style and Emotion-Driven Animation)
Find the Chinese version notes of the survey, welcome to pay attention.[2024/10/21]V3.5 Vision: Update Methods.
Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization (Audio, Fine-Grained Style and Emotion-Driven Animation)
[2024/10/18]V3.4 Vision: Update Methods.
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation (Audio, Head Pose Driving)
[2024/10/15]V3.3 Vision: Update Methods.
Tex4D (Text, Text2MotionVideo)
TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model (Audio, Audio-Driven Holistic Body Driving)
Animate-X: Universal Character Image Animation with Enhanced Motion Representation (Visual, Pose-Guided Dance Video Generation)
MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting (Audio, Lip Synchronization)
[2024/10/11]V3.2 Vision: Update Methods.
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation (Audio, Fine-Grained Style and Emotion-Driven Animation)
[2024/10/10]V3.1 Vision: Update Methods.
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes (Audio, Fine-Grained Style and Emotion-Driven Animation)
[2024/10/08]V3.0 Vision: Update Methods.
TANGO (Audio, Audio-Driven Holistic Body Driving)
[2024/10/04]🎉🎉🎉V2.9 Vision I'm glad that our article is publicly available on TechRxiv. We welcome your attention and citations. The version on arXiv is still on hold, and we will update it when it becomes available.
@article{xue2024human,
title={Human Motion Video Generation: A survey},
author={Xue, Haiwei and Luo, Xiangyang and Hu, Zhanghao and Zhang, Xin and Xiang, Xunzhi and Dai, Yuqin and Liu, Jianzhuang and Zhang, Zhensong and Li, Minglei and Yang, Jian and others},
journal={Authorea Preprints},
year={2024},
publisher={Authorea}
doi={10.36227/techrxiv.172793202.22697340/v1}
}
[2024/10/03]V2.8 Vision: Update Methods.
LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details (Audio, Head Pose Driving)
[2024/10/02]V2.7 Vision: Update Methods.
Replace Anyone in Videos (Visual, Video-Guided Dance Video Generation)
High Quality Human Image Animation using Regional Supervision and Motion Blur Condition (Visual, Pose-Guided Dance Video Generation)
[2024/09/27]V2.6 Vision: Update Methods.
SVP (Visual, Portrait Animation)
Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation (Audio, Audio-Driven Holistic Body Driving)
[2024/09/25]V2.5 Vision: Update Methods.
MIMO (Visual, Pose-Guided Dance Video Generation)
[2024/09/24]V2.4 Vision: Update Methods.
MIMAFace (Audio, Fine-Grained Style and Emotion-Driven Animation)
[2024/09/23]V2.3 Vision: Update Methods.
JoyHallo (Audio, Fine-Grained Style and Emotion-Driven Animation)
[2024/09/19] V2.2 Vision: Update Methods.
JEAN (Audio, Head Pose Driving)
[2024/09/17] V2.1 Vision: Update Methods.
LawDNet (Audio, Lip Synchronization)
StyleTalk++ (Audio, Fine-Grained Style and Emotion-Driven Animation)
[2024/09/13] V2.0 Vision: Update Methods.
DiffTED (Audio, Audio-Driven Holistic Body Driving)
[2024/09/12] V1.9 Vision: Update Methods.
EMOdiffhead (Audio, Fine-Grained Animation)
[2024/09/11] V1.8 Vision: Update Methods.
RealisDance (Visual, Pose-Guided Dance Video Generation)
[2024/09/10] V1.7 Vision: Update Methods.
Leveraging WaveNet for Dynamic Listening Head Modeling from Speech (Audio, Lip Synchronization)
KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation (Audio, Lip Synchronization)
PersonaTalk (Audio, Lip Synchronization)
[2024/09/06] V1.6 Vision: Update Methods.
SVP (Audio, Fine-Grained Animation)
SegTalker (Audio, Lip Synchronization)
[2024/09/05] V1.5 Vision: Update Methods.
Loopy (Audio, Fine-Grained Animation)
PoseTalk (Audio, Fine-Grained Animation)
[2024/09/04] V1.4 Vision: Update Methods.
CyberHost (Audio, Holistic Human Driving)
[2024/08/28] V1.3 Vision: Update Methods.
MegActor-Σ (Audio, Fine-Grained Animation)
Rafael Azevedo et al. (Text, Text2Face)
[2024/08/27] V1.2 Vision: Update Methods.
[2024/08/26] V1.1 Vision: Update Methods.
G3FA (Vision, Portrait Animation)
[2024/08/21] V1.0 Vision: Initialize the repository. If you find it helpful to you, welcome to star and share our work.
Part (Face) || Portrait Animation
Holistic Human || Video-Guided Dance Video Generation
Holistic Human || Pose-Guided Dance Video Generation
Holistic Human || Try-On Video Generation
Holistic Human || Pose2Video
Part (Face) || Text2Face
Holistic Human || Text2MotionVideo
Part (Face) || Lip Synchronization
Part (Face) || Head Pose Driving
Holistic Human || Audio-Driven Holistic Body Driving
Part (Face) || Fine-Grained Style and Emotion-Driven Animation
LLM for 2D
LLM for 3D
If you find our survey and repository useful for your research project, please consider citing our paper:
@article{xue2024human,
title={Human Motion Video Generation: A survey},
author={Xue, Haiwei and Luo, Xiangyang and Hu, Zhanghao and Zhang, Xin and Xiang, Xunzhi and Dai, Yuqin and Liu, Jianzhuang and Zhang, Zhensong and Li, Minglei and Yang, Jian and others},
journal={Authorea Preprints},
year={2024},
publisher={Authorea}
doi={10.36227/techrxiv.172793202.22697340/v1}
}
Contributions are welcome! Please feel free to create an issue or open a pull request with your contributions.
Haiwei Xue 💻 🎨 🤔 |
Xiangyang Luo 🐛 |
Zhanghao Hu 🥙 💻 |
Xin Zhang 😘🎪 😍 |
Xunzhi Xiang 🚄 😍 |
Yuqin Dai 😘 👸 |
This project is licensed under the MIT License - see the LICENSE file for details.
We would like to acknowledge the contributions of all researchers and developers in the field of human motion video generation. Their work has been instrumental in the advancement of this technology.