Skip to content

Winn1y/Awesome-Human-Motion-Video-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 

Repository files navigation

🎉 Awesome-Human-Motion-Video-Generation 🔥


image

TechRxiv PDF Survey MorePapers
under MorePapers 知乎文章

You can click on Watch Icon Watch and Star Icon Star to get the latest updates at any time.

Watch Icon Watch Me ! Watch Icon Watch Me ! Watch Icon Watch Me !
Star Icon Star Me ! Star Icon Star Me ! Star Icon Star Me !


🎁 >>>>>>>> [English Introduction] <<<<<<<<<<

This project provides a thorough summary of the latest advancements in the field of 2D digital human motion video generation, covering papers, datasets, and code repositories.

The repository is organized into three main conditions: Vision-driven, Text-driven, and Audio-driven, while also considering LLM Planning Papers.

Unlike previous summaries, this project clearly outlines the five key stages in the field of digital human video generation:

🌑 Stage 1: Input Phase. Clarifying the driving source (Vision, Text, Audio) and driving region (Part, Holistic), where "Part" mainly refers to the face;

🌒 Stage 2: Motion planning Phase. Most work involves feature mapping to learn motion mappings, while a few works use large language models (LLMs) for motion planning;

🌓 Stage 3: Motion Video Generation Phase;

🌔 Stage 4: Video Refinement Phase, focusing on optimizing specific parts such as the face, lips, teeth, and hands;

🌕 Stage 5: Acceleration Phase, aiming to speed up training and deployment inference as much as possible, with the goal of achieving real-time output.

🎉 We welcome everyone to contribute your research and submit PRs to collectively advance the technology of human motion video generation.

If you have any questions, feel free to contact us at ([email protected]), and we will respond as soon as possible. Additionally, we warmly welcome new members from related fields to join us, learn together, and make endless progress!

🏆 >>>>>>>> [🧡中文简要介绍💜] <<<<<<<<<<

本项目认真总结了👍2D数字人动作视频生成👏相关领域的最新进展,包括论文、数据集和代码库。

Repo以 Vision-driven、Text-driven、Audio-driven 三大方向作以总结,同时考虑 LLM Planning 前沿论文。

分类时,我们定义Audio>Text>Vision优先级,当出现文本不出现音频时,归纳为Text-Driven方法,当文本音频同时出现时,归纳为Audio-Driven方法,以此类推。

区别于以往的总结,项目明确总结了数字人视频生成领域的五大阶段:

🌑 第1阶段 明确驱动源(Vision、Text、Audio)与驱动区域(Part、Holistic),其中Part主要以脸部为主;

🌒 第2阶段 动作规划阶段,大多数工作以特征Mapping学习动作映射,少部分工作以大语言模型LLMs进行动作规划;

🌓 第3阶段 人体视频生成,大部分工作以Diffusion Models为基础,少部分工作以Transformer为基础;

🌔 第4阶段 视频优化阶段,针对脸部、嘴唇、牙齿、手部单独做Refinement优化;

🌕 第5阶段 加速输出阶段,尽可能地加速训练与部署推理,目标Real-Time实时输出。

🔑本项目由六位核心成员全力推进:

- 薛海威(清华大学,负责人)
- 罗向阳(清华大学)
- 胡璋昊(爱丁堡大学)
- 张鑫(西安交通大学)
- 向迅之(中国科学院大学)
- 戴语琴(南京理工大学)

💖核心综述由以下老师全力支持并悉心指导:

- 刘健庄老师(中国科学院深圳先进技术研究院)
- 张镇嵩博士(华为诺亚2012实验室)
- 李明磊博士(零一万物)
- 马飞博士(光明实验室)
- 吴志勇老师(清华大学/香港中文大学)

另外,非常感谢常恒师兄 ( https://github.com/SwiftieH )、余伟江师兄的支持!

🎉 欢迎大家贡献自己的研究成果并PR,共同推动人体运动视频生成技术的发展。

如有任何问题,可以随时联系邮件([email protected]),我们会尽快回复。

另外,我们非常欢迎有新的相关领域的同学一同加入我们,一起学习,无限进步!


🍦 Exploring the latest papers in human motion video generation. 🍦




Introduction

This work delves into Human Motion Video Generation, covering areas such as Portrait Animation, Dance Video Generation, Text2Face, Text2MotionVideo, and Talking Head. We believe this will be the most comprehensive survey to date on human motion video generation technologies. Please stay tuned! 😘😁😀

It's important to note that for the sake of clarity, we have excluded 3DGS and NeRF technologies (2D-3D-2D) from the scope of this paper.

✨You are welcome to provide us your work with a topic related to human motion video generation.✨

If you discover any missing work or have any suggestions,

please feel free to submit a pull request or contact us ( [email protected] ).

We will promptly add the missing papers to this repository.

🍔 Highlight

[1] We decompose human motion video generation into five key phases, covering all subtasks across various driving sources and body regions. To the best of our knowledge, this is the first survey to offer such a comprehensive framework for human motion video generation.

[2] We provide an in-depth analysis of human motion video generation from both motion planning and motion generation perspectives, a dimension that has been underexplored in existing reviews.

[3] We clearly delineate established baselines and evaluation metrics, offering detailed insights into the key challenges shaping this field.

[4] We present a set of potential future research directions, aimed at inspiring and guiding researchers in the field of human motion video generation.

🕑 Timeline

Timeline

💙 News

[2025/01/22] V5.9 Vision: Update Methods. Happy New Year🎀

arXiv CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation (Visual, Try-On Video Generation)

arXiv EMO2: End-Effector Guided Audio-Driven Avatar Video Generation (Audio, Audio-Driven Holistic Body Driving)


[2025/01/20] V5.8 Vision: Update Methods. Happy New Year🎀

arXiv X-Dyna: Expressive Dynamic Human Image Animation (Visual, Pose-Guided Dance Video Generation)

arXiv Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions (Text, Text2Motion)


[2025/01/18] V5.7 Vision: Update Methods. Happy New Year🎀

arXiv RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency (Visual, Try-On Video Generation)

arXiv DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors (Visual, Portrait Animation)


[2025/01/16] V5.6 Vision: Update Methods. Happy New Year🎀

arXiv Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning (Visual, Portrait Animation)


[2025/01/14] V5.5 Vision: Update Methods. Happy New Year🎀

arXiv Identity-Preserving Video Dubbing Using Motion Warping (Audio, Lip Synchronization)


[2025/01/13] V5.4 Vision: Update Methods. Happy New Year🎀

arXiv Ingredients: Blending Custom Photos with Video Diffusion Transformers (Text, Text2MotionVideo)

arXiv Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers (Text, Text2MotionVideo)


[2025/01/12] V5.3 Vision: Update Methods. Happy New Year🎀

arXiv MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation (Audio, Head Pose Driving)


[2025/01/11] V5.2 Vision: Update Methods. Happy New Year🎀

arXiv UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control (Audio, Head Pose Driving)


[2025/01/10] V5.1 Vision: Update Methods. Happy New Year🎀

arXiv RAIN: Real-time Animation of Infinite Video Stream (Visual, Portrait Animation)

arXiv VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models (Text, Text2Face)


[2025/01/06] V5.0 Vision: Update Methods. Happy New Year🎀

arXiv Free-viewpoint Human Animation with Pose-correlated Reference Selection (Visual, Pose-Guided Dance Video Generation)

arXiv ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping (Visual, Pose2Video)

arXiv Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance (Text, Text2MotionVideo)


[2025/01/04] V4.9 Vision: Update Methods. Happy New Year🎀

arXiv Consistent Human Image and Video Generation with Spatially Conditioned Diffusion (Visual, Pose-Guided Dance Video Generation)


[2024/12/17] V4.8 Vision: Update Methods.

arXiv VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping (Visual, Portrait Animation)

arXiv VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization (Audio, Fine-Grained Style and Emotion-Driven Animation)

arXiv Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism (Visual, Try-On Video Generation)

arXiv SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models (Visual, Try-On Video Generation)


[2024/12/15] V4.7 Vision: Update Methods.

arXiv LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync (Audio, Lip Synchronization)

arXiv GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression (Audio, Head Pose Driving)

arXiv PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation (Audio, Head Pose Driving)

arXiv IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation (Audio, Head Pose Driving)

arXiv INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations (Audio, Head Pose Driving)

arXiv MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation (Audio, Head Pose Driving)

arXiv DisPose: Disentangling Pose Guidance for Controllable Human Image Animation (Visual, Pose-Guided Dance Video Generation)


[2024/12/11] V4.6 Vision: Update Methods.

arXiv PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm (Visual, Try-On Video Generation)

arXiv SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model (Audio, Head Pose Driving)

arXiv EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation (Visual, Portrait Animation)

arXiv Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks (Visual, Portrait Animation)

arXiv FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait (Visual, Portrait Animation)

arXiv DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses (Visual, Pose-Guided Dance Video Generation)


[2024/12/02] V4.5 Vision: Update Methods.

arXiv LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis (Audio, Head Pose Driving)

arXiv Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis (Audio, Head Pose Driving)

arXiv Fleximo: Towards Flexible Text-to-Human Motion Video Generation (Text, Text2MotionVideo)


[2024/11/28] V4.4 Vision: Update Methods.

arXiv HiFiVFS: High Fidelity Video Face Swapping (Visual, Portrait Animation)

arXiv MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation (Text, Text2Face)

arXiv Identity-Preserving Text-to-Video Generation by Frequency Decomposition (Text, Text2Face)

arXiv AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation (Visual, Pose2Video)

arXiv PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation (Text, Text2Face)

arXiv LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis (Audio, Fine-Grained Style and Emotion-Driven Animation)

arXiv EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion (Audio, Fine-Grained Style and Emotion-Driven Animation)

arXiv Sonic: Shifting Focus to Global Audio Perception in Portrait Animation (Audio, Fine-Grained Style and Emotion-Driven Animation)

arXiv StableAnimator: High-Quality Identity-Preserving Human Image Animation (Visual, Pose-Guided Dance Video Generation)


[2024/11/25] V4.3 Vision: Update Methods.

arXiv FloAt: Flow Warping of Self-Attention for Clothing Animation Generation (Visual, Try-On Video Generation)


[2024/11/18] V4.2 Vision: Update Methods.

arXiv EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation (Audio, Audio-Driven Holistic Body Driving)


[2024/11/15 !WoW! More Star 100 🌟🌟🌟] V4.1 Vision: Update Methods.

arXiv JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation (Audio, Fine-Grained Style and Emotion-Driven Animation)

arXiv LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space (Audio, Fine-Grained Style and Emotion-Driven Animation)


[2024/11/14]V4.0 Vision: Update Methods.

arXiv MikuDance: Animating Character Art with Mixed Motion Dynamics (Visual, Pose-Guided Dance Video Generation)


[2024/11/04]V3.9 Vision: Update Methods.

arXiv Fashion-VDM (Visual, Try-On Video Generation)

arXiv Towards High-fidelity Head Blending with Chroma Keying for Industrial Applications (Visual, Portrait Animation)


[2024/11/01]V3.8 Vision: Update Methods.

arXiv Stereo-Talker (Audio, Audio-Driven Holistic Body Driving)


[2024/10/29]V3.7 Vision: Update Methods.

arXiv MovieCharacter (Visual, Pose2Video)


[2024/10/24 Happy Coding Day!]V3.6 Vision: Update Methods.

arXiv EmoGene (Audio, Fine-Grained Style and Emotion-Driven Animation)

知乎文章 Find the Chinese version notes of the survey, welcome to pay attention.

[2024/10/21]V3.5 Vision: Update Methods.

arXiv Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization (Audio, Fine-Grained Style and Emotion-Driven Animation)


[2024/10/18]V3.4 Vision: Update Methods.

arXiv DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation (Audio, Head Pose Driving)


[2024/10/15]V3.3 Vision: Update Methods.

arXiv Tex4D (Text, Text2MotionVideo)

arXiv TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model (Audio, Audio-Driven Holistic Body Driving)

arXiv Animate-X: Universal Character Image Animation with Enhanced Motion Representation (Visual, Pose-Guided Dance Video Generation)

arXiv MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting (Audio, Lip Synchronization)


[2024/10/11]V3.2 Vision: Update Methods.

arXiv Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation (Audio, Fine-Grained Style and Emotion-Driven Animation)


[2024/10/10]V3.1 Vision: Update Methods.

arXiv MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes (Audio, Fine-Grained Style and Emotion-Driven Animation)


[2024/10/08]V3.0 Vision: Update Methods.

arXiv TANGO (Audio, Audio-Driven Holistic Body Driving)


[2024/10/04]🎉🎉🎉V2.9 Vision I'm glad that our article is publicly available on TechRxiv. We welcome your attention and citations. The version on arXiv is still on hold, and we will update it when it becomes available.

@article{xue2024human,
  title={Human Motion Video Generation: A survey},
  author={Xue, Haiwei and Luo, Xiangyang and Hu, Zhanghao and Zhang, Xin and Xiang, Xunzhi and Dai, Yuqin and Liu, Jianzhuang and Zhang, Zhensong and Li, Minglei and Yang, Jian and others},
  journal={Authorea Preprints},
  year={2024},
  publisher={Authorea}
  doi={10.36227/techrxiv.172793202.22697340/v1}
}

[2024/10/03]V2.8 Vision: Update Methods.

arXiv LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details (Audio, Head Pose Driving)


[2024/10/02]V2.7 Vision: Update Methods.

arXiv Replace Anyone in Videos (Visual, Video-Guided Dance Video Generation)

arXiv High Quality Human Image Animation using Regional Supervision and Motion Blur Condition (Visual, Pose-Guided Dance Video Generation)


[2024/09/27]V2.6 Vision: Update Methods.

arXiv SVP (Visual, Portrait Animation)

arXiv Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation (Audio, Audio-Driven Holistic Body Driving)


[2024/09/25]V2.5 Vision: Update Methods.

arXiv MIMO (Visual, Pose-Guided Dance Video Generation)


[2024/09/24]V2.4 Vision: Update Methods.

arXiv MIMAFace (Audio, Fine-Grained Style and Emotion-Driven Animation)


[2024/09/23]V2.3 Vision: Update Methods.

arXiv JoyHallo (Audio, Fine-Grained Style and Emotion-Driven Animation)


[2024/09/19] V2.2 Vision: Update Methods.

arXiv JEAN (Audio, Head Pose Driving)


[2024/09/17] V2.1 Vision: Update Methods.

arXiv LawDNet (Audio, Lip Synchronization)

arXiv StyleTalk++ (Audio, Fine-Grained Style and Emotion-Driven Animation)


[2024/09/13] V2.0 Vision: Update Methods.

arXiv DiffTED (Audio, Audio-Driven Holistic Body Driving)


[2024/09/12] V1.9 Vision: Update Methods.

arXiv EMOdiffhead (Audio, Fine-Grained Animation)


[2024/09/11] V1.8 Vision: Update Methods.

arXiv RealisDance (Visual, Pose-Guided Dance Video Generation)


[2024/09/10] V1.7 Vision: Update Methods.

arXiv Leveraging WaveNet for Dynamic Listening Head Modeling from Speech (Audio, Lip Synchronization)

arXiv KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation (Audio, Lip Synchronization)

arXiv PersonaTalk (Audio, Lip Synchronization)


[2024/09/06] V1.6 Vision: Update Methods.

arXiv SVP (Audio, Fine-Grained Animation)

arXiv SegTalker (Audio, Lip Synchronization)


[2024/09/05] V1.5 Vision: Update Methods.

arXiv Loopy (Audio, Fine-Grained Animation)

arXiv PoseTalk (Audio, Fine-Grained Animation)


[2024/09/04] V1.4 Vision: Update Methods.

arXiv CyberHost (Audio, Holistic Human Driving)


[2024/08/28] V1.3 Vision: Update Methods.

arXiv MegActor-Σ (Audio, Fine-Grained Animation)

arXiv Rafael Azevedo et al. (Text, Text2Face)


[2024/08/27] V1.2 Vision: Update Methods.

arXiv GenCA (Text, Text2Face)


[2024/08/26] V1.1 Vision: Update Methods.

arXiv G3FA (Vision, Portrait Animation)


[2024/08/21] V1.0 Vision: Initialize the repository. If you find it helpful to you, welcome to star and share our work.

Vision Guidance

Part (Face) || Portrait Animation
Date Title arXiv Link Motion Representation Backbone Venue
2024 06 04 Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation arXiv KeyPoint Diffusion Model SIGGRAPH ASIA2024
2024 07 05 LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control arXiv KeyPoint Encoder-Decoder arXiv
2024 07 09 MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices arXiv KeyPoint Diffusion Model arXiv
2023 10 16 Expression Domain Translation Network for Cross-domain Head Reenactment arXiv 3D Parameterization Encoder-Decoder ICASSP 2024
2023 03 26 OTAvatar : One-shot Talking Face Avatar with Controllable Tri-plane Rendering arXiv 3D Parameterization Encoder-Decoder CVPR 2023
2023 03 27 OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis arXiv Latent GAN CVPR 2023
2023 12 04 Unsupervised High-Resolution Portrait Gaze Correction and Animation arXiv Latent GAN IEEE Transactions on Image Processing 2022
2024 06 08 MegActor: Harness the Power of Raw Video for Vivid Portrait Animation arXiv Latent Diffusion Model arXiv
2024 05 31 X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention arXiv Latent Diffusion Model ACM SIGGRAPH 2024
2024 08 26 G3FA: Geometry-guided GAN for Face Animation arXiv Latent GAN arXiv
2024 09 27 Stable Video Portraits arXiv 3D Parameterization Diffusion Model ECCV 2024
2024 11 04 Towards High-fidelity Head Blending with Chroma Keying for Industrial Applications arXiv Region Encoder-Decoder WACV 2024
2024 11 28 HiFiVFS: High Fidelity Video Face Swapping arXiv Latent Encoder-Decoder arXiv
2024 12 02 EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation arXiv Latent Diffusion Model arXiv
2024 12 15 VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping arXiv Latent Diffusion Model arXiv
2024 12 27 RAIN: Real-time Animation of Infinite Video Stream arXiv Latent Diffusion Model arXiv
2025 01 11 Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning arXiv Latent Diffusion Model arXiv
2025 01 15 DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors arXiv Latent Diffusion Model arXiv
2024 03 23 FaceOff: A Video-to-Video Face Swapping System arXiv Latent Encoder-Decoder WACV 2023
Holistic Human || Video-Guided Dance Video Generation
Date Title arXiv Link Motion Representation Backbone Venue
2018 08 22 Everybody dance now arXiv KeyPoint GAN ICCV 2019
2023 07 02 Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation arXiv Region Diffusion Model arXiv
2023 02 22 Human MotionFormer: Transferring Human Motions with Vision Transformers arXiv KeyPoint Encoder-Decoder arXiv
2024 10 02 Replace Anyone in Videos arXiv KeyPoint Diffusion Model arXiv
2024 06 24 Do As I Do: Pose Guided Human Motion Copy arXiv KeyPoint GAN IEEE Transactions on Dependable and Secure Computing
Holistic Human || Pose-Guided Dance Video Generation
Date Title arXiv Link Motion Representation Backbone Venue
2023 06 30 DisCo arXiv KeyPoint Diffusion Model CVPR2024
2023 10 20 Dance Your Latents arXiv KeyPoint Diffusion Model arxiv
2023 11 18 MagicPose arXiv KeyPoint Diffusion Model ICML2024
2023 11 27 MagicAnimate arXiv Region Diffusion Model CVPR2024
2023 11 28 Animate Anyone arXiv KeyPoint Diffusion Model CVPR2024
2023 12 08 DreaMoving arXiv KeyPoint Diffusion Model arxiv
2023 12 27 I2V-Adapter arXiv KeyPoint Diffusion Model SIGGRAPH2024
2024 05 26 Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation arXiv KeyPoint Diffusion Model arxiv
2024 05 28 VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation arXiv KeyPoint Diffusion Model arxiv
2024 05 30 MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion arXiv KeyPoint Diffusion Model arxiv
2024 06 03 UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image animation arXiv KeyPoint Diffusion Model arxiv
2024 06 05 Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control arXiv KeyPoint Diffusion Model arxiv
2024 05 27 Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer arXiv 3D Parameterization Transformer arxiv
2024 01 19 Synthesizing Moving People with 3D Control arXiv 3D Parameterization Diffusion Model arxiv
2024 03 21 Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance arXiv 3D Parameterization Diffusion Model ECCV 2024
2024 07 01 MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance arXiv KeyPoint Diffusion Model arxiv
2024 07 15 TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models arXiv KeyPoint Diffusion Model arxiv
2024 09 11 RealisDance: Equip controllable character animation with realistic hands arXiv KeyPoint Diffusion Model arxiv
2024 09 25 MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling arXiv KeyPoint Diffusion Model arxiv
2024 10 02 High Quality Human Image Animation using Regional Supervision and Motion Blur Condition arXiv KeyPoint Diffusion Model arxiv
2024 10 15 Animate-X: Universal Character Image Animation with Enhanced Motion Representation arXiv KeyPoint Diffusion Model arxiv
2024 11 14 MikuDance: Animating Character Art with Mixed Motion Dynamics arXiv KeyPoint Diffusion Model arxiv
2024 11 26 StableAnimator: High-Quality Identity-Preserving Human Image Animation arXiv KeyPoint Diffusion Model arxiv
2024 11 30 DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses arXiv KeyPoint Diffusion Model arxiv
2024 12 12 DisPose: Disentangling Pose Guidance for Controllable Human Image Animation arXiv KeyPoint,Region Diffusion Model arxiv
2024 12 19 Consistent Human Image and Video Generation with Spatially Conditioned Diffusion arXiv KeyPoint Diffusion Model arxiv
2024 12 23 Free-viewpoint Human Animation with Pose-correlated Reference Selection arXiv KeyPoint Diffusion Model arxiv
2025 01 17 X-Dyna: Expressive Dynamic Human Image Animation arXiv KeyPoint Diffusion Model arxiv
2024 07 16 IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation arXiv Region Diffusion Model arxiv
Holistic Human || Try-On Video Generation
Date Title arXiv Link Motion Representation Backbone Venue
2024 04 26 Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos arXiv KeyPoint Diffusion Model arxiv
2024 05 20 ViViD: Video Virtual Try-on using Diffusion Models arXiv Region Diffusion Model arxiv
2024 11 04 Fashion-VDM: Video Diffusion Model for Virtual Try-On arXiv Latent Diffusion Model SIGGRAPH Asia 2025
2024 11 25 FloAt: Flow Warping of Self-Attention for Clothing Animation Generation arXiv Latent Diffusion Model arxiv
2024 12 04 PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm arXiv Latent Diffusion Model arxiv
2024 12 13 Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism arXiv Latent Diffusion Model arxiv
2024 12 13 SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models arXiv Latent Diffusion Model arxiv
2025 01 15 RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency arXiv Latent Diffusion Model arxiv
2025 01 20 CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation arXiv Latent Diffusion Model arxiv
2024 07 16 WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models arXiv KeyPoint Diffusion Model arxiv
Holistic Human || Pose2Video
Date Title arXiv Link Motion Representation Backbone Venue
2023 04 12 DreamPose arXiv Region Diffusion Model ICCV 2023
2024 03 25 Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework arXiv 3D Parameterization Diffusion Model CVPR 2024
2024 04 21 PoseAnimate: Zero-shot high fidelity pose controllable character animation arXiv KeyPoint Diffusion Model arxiv
2024 12 18 ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping arXiv KeyPoint Diffusion Model arxiv
2024 11 26 AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation arXiv KeyPoint Diffusion Model arxiv
2024 10 29 MovieCharacter: A Tuning-Free Framework for Controllable Character Video Synthesis arXiv Region Diffusion Model arxiv

Text Guidance

Part (Face) || Text2Face
Date Title arXiv Link Motion Representation Backbone Venue
2021 05 07 Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation arXiv KeyPoint GAN AAAI 2021
2023 12 11 Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis
achieving both Auditory and Photo-realism
arXiv 3D Parameterization GAN arXiv
2023 06 03 VideoComposer: Compositional Video Synthesis with Motion Controllability arXiv Region Diffusion Model NeurIPS 2024
2024 04 23 ID-Animator: Zero-Shot Identity-Preserving Human Video Generation arXiv Latent Diffusion Model arXiv
2023 12 09 FT2TF: First-Person Statement Text-To-Talking Face Generation arXiv Latent Encoder-Decoder arXiv
2024 05 16 Faces that Speak: Jointly Synthesising Talking Face and Speech from Text arXiv Latent GAN CVPR 2024
2024 08 27 GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars arXiv 3D Parameterization Encoder-Decoder arXiv
2024 08 28 Empowering Sign Language Communication: Integrating Sentiment and Semantics for Facial Expression Synthesis arXiv KeyPoint Diffusion Model arXiv
2024 11 28 MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation arXiv Region Diffusion Model arXiv
2024 11 26 Identity-Preserving Text-to-Video Generation by Frequency Decomposition arXiv Region Diffusion Model arXiv
2024 11 26 PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation arXiv Latent Diffusion Model arXiv
2024 12 27 VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models arXiv Latent Diffusion Model arXiv
2020 03 01 Towards Automatic Face-to-Face Translation arXiv Latent Encoder-Decoder ACM MM 2019
Holistic Human || Text2MotionVideo
Date Title arXiv Link Motion Representation Backbone Venue
2024 05 08 Edit-Your-Motion arXiv KeyPoint Diffusion Model arXiv
2023 08 15 Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model arXiv KeyPoint Diffusion Model arXiv
2023 04 03 Follow Your Pose arXiv KeyPoint Diffusion Model AAAI 2024
2024 12 21 Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance arXiv KeyPoint Diffusion Model arxiv
2023 08 28 MagicAvatar: Multimodal Avatar Generation and Animation arXiv KeyPoint Diffusion Model arXiv
2024 02 14 Magic-Me: Identity-Specific Video Customized Diffusion arXiv Latent Diffusion Model arXiv
2024 04 07 Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation arXiv Latent Diffusion Model CVPR 2024
2023 04 17 Text2Performer: Text-Driven Human Video Generation arXiv Latent Encoder-Decoder ICCV 2023
2024 04 14 LoopAnimate arXiv Latent Diffusion Model arXiv
2023 07 10 AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning arXiv Latent Diffusion Model arXiv
2023 12 06 AnimateZero: Video Diffusion Models are Zero-Shot Image Animators arXiv Latent Diffusion Model arXiv
2023 10 30 VideoCrafter1: Open Diffusion Models for High-Quality Video Generation arXiv Latent Diffusion Model arXiv
2023 07 19 TokenFlow: Consistent Diffusion Features for Consistent Video Editing arXiv Latent Diffusion Model arXiv
2023 03 23 Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators arXiv Latent Diffusion Model ICCV 2023
2023 02 02 Dreamix: Video Diffusion Models are General Video Editors arXiv Latent Diffusion Model arXiv
2023 12 05 BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models arXiv Latent Diffusion Model CVPR 2024
2024 11 29 Fleximo: Towards Flexible Text-to-Human Motion Video Generation arXiv Latent Diffusion Model arXiv
2023 12 30 Dual-Stream Diffusion Net for Text-to-Video Generation arXiv Latent Diffusion Model arXiv
2025 01 07 Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers arXiv Latent Diffusion Model arXiv
2025 01 17 Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions arXiv Latent Diffusion Model arXiv
2025 01 03 Ingredients: Blending Custom Photos with Video Diffusion Transformers arXiv Latent Diffusion Model arXiv
2024 02 22 Customize-A-Video arXiv Latent Diffusion Model arXiv
2023 12 12 LatentMan: Generating Consistent Animated Characters using Image Diffusion Models arXiv 3D Parameterization Diffusion Model arXiv
2024 08 15 DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency arXiv 3D Parameterization Diffusion Model arXiv
2024 10 15 Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models arXiv Latent Diffusion Model arXiv
2024 01 17 VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models arXiv Latent Diffusion Model CVPR 2024

Audio Guidance

Part (Face) || Lip Synchronization
Date Title arXiv Link Motion Representation Backbone Venue
2020 09 17 Photorealistic Audio-driven Video Portraits arXiv Region Encoder-Decoder TVCG2020
2019 05 09 Hierarchical cross-modal talking face generation with dynamic pixel-wise loss arXiv KeyPoint Autoregressive CVPR2019
2019 05 08 Capture, Learning, and Synthesis of 3D Speaking Styles arXiv Latent Encoder-Decoder CVPR2019
2024 08 13 Style-Preserving Lip Sync via Audio-Aware Style Reference arXiv Latent Diffusion Model arxiv
2024 09 06 SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing arXiv Latent GAN arxiv
2024 09 10 Leveraging WaveNet for Dynamic Listening Head Modeling from Speech arXiv Latent Autoregressive arxiv
2024 09 10 KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation arXiv KeyPoint Encoder-Decoder arxiv
2024 09 10 PersonaTalk: Bring Attention to Your Persona in Visual Dubbing arXiv 3D Parameterization Encoder-Decoder arxiv
2024 09 17 LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation arXiv Latent Encoder-Decoder arxiv
2024 10 15 MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting arXiv Latent Diffusion Model arxiv
2024 12 12 LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync arXiv Latent Diffusion Model arxiv
2025 01 08 Identity-Preserving Video Dubbing Using Motion Warping arXiv Latent Encoder-Decoder arxiv
2023 01 10 Speech driven video editing via an audio-conditioned diffusion model arXiv Latent Diffusion Model IVC2024
Part (Face) || Head Pose Driving
Date Title arXiv Link Motion Representation Backbone Venue
2017 08 20 Predicting head pose from speech with a conditional variational autoencoder arXiv Latent Autoregressive ISCA2017
2020 04 27 MakeItTalk: Speaker-Aware Talking-Head Animation arXiv KeyPoint Autoregressive TOG2020
2021 09 22 Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation arXiv KeyPoint Autoregressive TOG2021
2022 01 03 DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering arXiv KeyPoint Encoder-Decoder arxiv
2023 01 10 DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation arXiv KeyPoint Diffusion Model CVPR2023
2023 05 15 Identity-Preserving Talking Face Generation with Landmark and Appearance Priors arXiv Muliti-Conditions Transformer CVPR2023
2023 05 01 GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation arXiv KeyPoint Encoder-Decoder arxiv
2022 03 16 StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN arXiv Region GAN ECCV2022
2023 02 20 SD-NeRF: Towards Lifelike Talking Head Animation via Spatially-Adaptive Dual-Driven NeRFs arXiv 3D Parameterization Encoder-Decoder TMM2023
2024 03 26 AniPortrait arXiv KeyPoint,3D Parameterization Diffusion Model arxiv
2024 06 17 Make Your Actor Talk arXiv KeyPoint Diffusion Model arxiv
2024 06 12 Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation arXiv KeyPoint,3D Parameterization Diffusion Model arxiv
2024 06 27 RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network arXiv 3D Parameterization Transformer arxiv
2021 03 20 AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis arXiv Latent Encoder-Decoder ICCV2021
2022 01 19 Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation arXiv Latent Encoder-Decoder ECCV2022
2021 04 22 Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation arXiv Latent GAN CVPR2021
2023 01 06 Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation arXiv Latent Diffusion Model CVPR24
2023 03 30 DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder arXiv Latent Diffusion Model ACM MM23
2023 11 26 GAIA: Zero-shot Talking Avatar Generation arXiv Latent Diffusion Model ICLR 2024
2023 12 09 R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning arXiv KeyPoint Encoder-Decoder arxiv
2024 05 06 AniTalker arXiv Latent Encoder-Decoder arxiv
2024 07 12 EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions arXiv KeyPoint Diffusion Model arxiv
2024 07 29 LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement arXiv Latent Diffusion Model arxiv
2024 08 03 Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation arXiv KeyPoint Diffusion Model arxiv
2024 08 13 High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model arXiv KeyPoint Diffusion Model arxiv
2022 11 22 Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition arXiv Latent Encoder-Decoder arxiv
2023 05 04 High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning arXiv Latent Transformer CVPR2023
2024 04 02 EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis arXiv Latent GAN ECCV2024
2023 11 29 SyncTalk arXiv 3D Parameterization Encoder-Decoder CVPR24
2024 04 23 TalkingGaussian arXiv 3D Parameterization Encoder-Decoder ECCV2024
2024 09 19 JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation arXiv Latent Encoder-Decoder arxiv
2024 10 03 LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details arXiv Latent Encoder-Decoder arxiv
2024 10 18 DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation arXiv Latent Diffusion Model arxiv
2024 11 29 LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis arXiv Latent Encoder-Decoder arxiv
2024 11 29 Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis arXiv Latent Diffusion Model arxiv
2024 12 15 GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression arXiv Latent Diffusion Model arxiv
2024 12 10 PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation arXiv Latent Diffusion Model arxiv
2024 12 05 IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation arXiv Latent Encoder-Decoder arxiv
2024 12 05 INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations arXiv Latent Encoder-Decoder arxiv
2024 12 05 MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation arXiv Latent Encoder-Decoder arxiv
2024 12 26 UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control arXiv Latent Diffusion Model arxiv
2024 04 28 GaussianTalker arXiv 3D Parameterization Encoder-Decoder ACM MM2024
2021 12 10 FaceFormer: Speech-Driven 3D Facial Animation with Transformers arXiv Latent Transformer CVPR22
2023 09 15 Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent arXiv Latent GAN ICMI 2023
2023 10 17 CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation arXiv Latent Encoder-Decoder IEEE Transactions on Circuits and Systems for Video Technology 2024
Holistic Human || Audio-Driven Holistic Body Driving
Date Title arXiv Link Motion Representation Backbone Venue
2024 03 13 VLOGGER arXiv 3D Parameterization Diffusion Model arXiv
2022 12 05 Audio-Driven Co-Speech Gesture Video Generation arXiv Latent Encoder-Decoder NeurIPS 2022
2024 09 04 CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention arXiv Latent Diffusion Model arXiv
2024 09 13 DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures arXiv KeyPoint Diffusion Model arXiv
2024 09 27 Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation arXiv Region Diffusion Model arXiv
2024 10 08 TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation arXiv Latent Encoder-Decoder arXiv
2024 10 15 TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model arXiv Latent Diffusion Model arXiv
2024 11 01 Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts arXiv Latent Encoder-Decoder arXiv
2024 11 18 EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation arXiv Latent Diffusion Model arXiv
2025 01 18 EMO2: End-Effector Guided Audio-Driven Avatar Video Generation arXiv Latent Diffusion Model arXiv
2024 05 15 Dance Any Beat: Blending Beats with Visuals in Dance Video Generation arXiv Region Diffusion Model arXiv
Part (Face) || Fine-Grained Style and Emotion-Driven Animation
Date Title arXiv Link Motion Representation Backbone Venue
2021 05 19 Audio-Driven Emotional Video Portraits arXiv KeyPoint Encoder-Decoder CVPR 2021
2023 06 10 StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles arXiv 3D Parameterization Encoder-Decoder AAAI 2023
2024 01 16 Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis arXiv 3D Parameterization Encoder-Decoder ICLR 2024
2023 12 15 DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models arXiv 3D Parameterization Diffusion Model arXiv
2024 06 04 V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation arXiv KeyPoint Diffusion Model arXiv
2021 07 21 Speech Driven Talking Face Generation from a Single Image and an Emotion Condition arXiv Latent GAN IEEE Transactions on Multimedia 2021
2022 11 22 SadTalker arXiv Latent Diffusion Model CVPR 2023
2022 11 28 High-fidelity Facial Avatar Reconstruction from Monocular Video with
Generative Priors
arXiv 3D Parameterization GAN CVPR 2023
2023 05 09 StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator arXiv Latent GAN CVPR 2023
2024 02 27 EMO arXiv Latent Diffusion Model arXiv
2024 03 04 FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio arXiv Latent Diffusion Model CVPR 2024
2024 04 29 EMOPortraits arXiv Latent GAN CVPR 2024
2024 05 12 Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation arXiv Latent GAN arXiv
2024 06 16 Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation arXiv Latent Diffusion Model arXiv
2024 10 11 Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation arXiv Latent Diffusion Model arXiv
2024 04 16 VASA-1 arXiv Latent Diffusion Model By Transformer arXiv
2024 08 20 S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis arXiv Latent Encoder-Decoder arXiv
2024 08 20 FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model arXiv 3D Parameterization Diffusion Model ACMMM 2024
2024 08 28 MegActor-Σ: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer arXiv Latent Diffusion Model By Transformer arXiv
2024 09 05 Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency arXiv Latent Diffusion Model arXiv
2024 09 05 PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation arXiv Latent Diffusion Model arXiv
2024 09 06 SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model arXiv Latent Diffusion Model arXiv
2024 09 12 EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion arXiv 3D Parameterization Diffusion Model arXiv
2024 09 17 StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads arXiv 3D Parameterization Encoder-Decoder arXiv
2024 09 23 JoyHallo: Digital human model for Mandarin arXiv Latent Diffusion Model arXiv
2024 09 24 MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning arXiv Latent Diffusion Model arXiv
2024 10 10 MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes arXiv 3D Parameterization Encoder-Decoder Nips 2024
2024 10 21 Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization arXiv Latent Encoder-Decoder arXiv
2024 10 24 Audio-Driven Emotional 3D Talking-Head Generation arXiv Latent Encoder-Decoder arXiv
2024 11 15 JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation arXiv Latent Diffusion Model arXiv
2024 11 15 LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space arXiv Latent Encoder-Decoder arXiv
2024 11 28 LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis arXiv Latent Encoder-Decoder arXiv
2024 11 23 EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion arXiv Region Diffusion Model arXiv
2024 11 25 Sonic: Shifting Focus to Global Audio Perception in Portrait Animation arXiv Latent Diffusion Model arXiv
2024 12 04 SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model arXiv Latent Diffusion Model arXiv
2024 12 01 Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks arXiv Latent Diffusion Model arXiv
2024 12 02 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait arXiv Latent Diffusion Model arXiv
2024 12 13 VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization arXiv Latent Diffusion Model arXiv
2024 12 18 Real-time One-Step Diffusion-based Expressive Portrait Videos Generation arXiv Latent Diffusion Model arXiv
2025 01 03 MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation arXiv Latent Diffusion Model arXiv
2024 08 07 ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer arXiv Latent Encoder-Decoder ECCV 2024
2023 01 05 Expressive Speech-driven Facial Animation with controllable emotions arXiv Latent Encoder-Decoder ICMEW 2023
2024 01 28 Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance arXiv Latent Diffusion Model arXiv

LLM for Motion Planning

LLM for 2D
Date Title arXiv Link Motion Representation Backbone Tasks Venue
2023 01 26 Affective Faces for Goal-Driven Dyadic Communication arXiv 3D Parameterization Diffusion Model Text2Face arxiv
2023 11 29 Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents arXiv Latent Encoder-Decoder Taking Head arxiv
2024 05 24 InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation arXiv Latent Diffusion Model Taking Head arxiv
LLM for 3D
Date Title arXiv Link Motion Representation Backbone Tasks Venue
2023 08 21 Can Language Models Learn to Listen? arXiv Latent Autoregressive Listener Generation ICCV 2023
2023 06 19 MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators arXiv Latent Autoregressive Text2Motion3D AAAI 2024
2023 11 27 InterControl: Generate Human Motion Interactions by Controlling Every Joint arXiv Latent Diffusion Model Text2Motion3D arXiv
2023 11 28 AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond arXiv Latent Autoregressive Text2Motion3D CVPR 2024
2023 12 07 Digital Life Project: Autonomous 3D Characters with Social Intelligence arXiv Latent Diffusion Model Text2Motion3D CVPR 2024
2023 12 19 MotionScript: Natural Language Descriptions for Expressive 3D Human Motions arXiv Latent Diffusion Model Text2Motion3D arXiv
2023 12 22 Plan, Posture and Go: Towards Open-World Text-to-Motion Generation arXiv Latent Autoregressive Text2Motion3D arXiv
2024 08 20 Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony arXiv Latent Encoder-Decoder Text2Motion3D arXiv
2023 12 22 FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing arXiv Latent Autoregressive Text2Motion3D NeurIPS 2024

Cite The Survey

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{xue2024human,
  title={Human Motion Video Generation: A survey},
  author={Xue, Haiwei and Luo, Xiangyang and Hu, Zhanghao and Zhang, Xin and Xiang, Xunzhi and Dai, Yuqin and Liu, Jianzhuang and Zhang, Zhensong and Li, Minglei and Yang, Jian and others},
  journal={Authorea Preprints},
  year={2024},
  publisher={Authorea}
  doi={10.36227/techrxiv.172793202.22697340/v1}
}

Contributing

Contributions are welcome! Please feel free to create an issue or open a pull request with your contributions.

Haiwei Xue
Haiwei Xue

💻 🎨 🤔
Xiangyang Luo
Xiangyang Luo

🐛
Zhanghao Hu
Zhanghao Hu

🥙 💻
Xin Zhang
Xin Zhang

😘🎪 😍
Xunzhi Xiang
Xunzhi Xiang

🚄 😍
Yuqin Dai
Yuqin Dai

😘 👸

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

We would like to acknowledge the contributions of all researchers and developers in the field of human motion video generation. Their work has been instrumental in the advancement of this technology.