feat: demo

kunyao2015 · Dec 3, 2023 · 7eb5e40 · 7eb5e40
1 parent 178bc61
commit 7eb5e40
Show file tree

Hide file tree

Showing 42 changed files with 4,950 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+__pycache__
+*.pth
+*.mp4
diff --git a/README.md b/README.md
@@ -1,7 +1,68 @@
-# StyleLipSync: Style-based Personalized Lip-sync Video Generation
+# [ICCV 2023] StyleLipSync: Style-based Personalized Lip-sync Video Generation
+[ProjectPage](https://stylelipsync.github.io) | [Paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Ki_StyleLipSync_Style-based_Personalized_Lip-sync_Video_Generation_ICCV_2023_paper.pdf) | [ArXiv](https://arxiv.org/abs/2305.00521)
 
-[ProjectPage](https://stylelipsync.github.io) | [Paper](https://stylelipsync.github.io)
+An official pytorch implementation of `StyleLipSync: Style-based Personalized Lip-sync Video Generation` by Taekyung Ki* and [Dongchan Min](https://kevinmin95.github.io)*.
 
-Official pytorch implementation of `StyleLipSync: Style-based Personalized Lip-sync Video Generation`.
+## Abstract
 
-Codes will be released soon..
+<img align='middle' src='./assets/sylelipsync.png'>
+
+In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.
+
+
+## Requirements
+We recommend using Python `3.8.13` and Pytorch `1.7.1+cu110`.
+```bash
+pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
+
+pip install -r requirements.txt
+```
+
+
+
+## Demo
+
+We provide a simple demonstration script with a personalized model where a target person `AlexandriaOcasioCortez_0` is in [HDTF](https://github.com/MRzzm/HDTF).
+
+```bash
+sh prepare_hdtf.sh
+```
+You can get the preprocessed frames (`.jpg`) and their pose-aware masks of the person by running `prepare_hdtf.sh`.
+
+For arbitrary audio, you can generate a lip-synchronizing video of the target person by running:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python run_demo.py --audio [path/to/audio] --person person_id --res_dir [path/to/save/results]
+```
+
+You can adjust the following options for inference:
+- `--audio`: an audio file (`.wav`).
+- `--person`: person for infernece, folder name in `data`. (default: AlexandriaOcasioCortez_0)
+- `--res_dir`: a directory to save results video. (default: results)
+
+The results video will be `res_dir/person#audio.mp4`. The sample audio files are provided in `data/audio`. You can also use your audio file. If you want to evaluate the lip-sync metrics (LSE-C and LSE-D), please refer to this [repostory](https://github.com/Rudrabha/Wav2Lip).
+
+
+## Disclaimer
+This repository is only for the research purpose.
+
+
+## Acknowledgements
+* [StyleGAN2-ADA](https://github.com/NVlabs/stylegan2-ada-pytorch)
+* [Wav2Lip](https://github.com/Rudrabha/Wav2Lip)
+* [Deep3DFaceRecon](https://github.com/sicxu/Deep3DFaceRecon_pytorch)
+* [FOMM](https://github.com/AliaksandrSiarohin/first-order-model)
+* [Voxceleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html)
+* [HDTF](https://github.com/MRzzm/HDTF)
+
+## Citation
+```
+@InProceedings{Ki_2023_ICCV,
+    author    = {Ki, Taekyung and Min, Dongchan},
+    title     = {StyleLipSync: Style-based Personalized Lip-sync Video Generation},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    month     = {October},
+    year      = {2023},
+    pages     = {22841-22850}
+}
+```
diff --git a/assets/sylelipsync.png b/assets/sylelipsync.png
diff --git a/ckpts/__put_ckpts_here b/ckpts/__put_ckpts_here
diff --git a/data/__put_the_frame_here b/data/__put_the_frame_here
diff --git a/data/audio/AlexandriaOcasioCortez_0_10s.wav b/data/audio/AlexandriaOcasioCortez_0_10s.wav
diff --git a/data/audio/CarlyFiorina_0_10s.wav b/data/audio/CarlyFiorina_0_10s.wav
diff --git a/data/audio/RoyBlunt_0_10s.wav b/data/audio/RoyBlunt_0_10s.wav
diff --git a/dataloader/__init__.py b/dataloader/__init__.py
diff --git a/dataloader/base_dataloader.py b/dataloader/base_dataloader.py
@@ -0,0 +1,77 @@
+import os, math, shutil, pickle, copy, yaml, random, json, cv2
+import torch, torchvision, torchaudio
+import torch.nn as nn
+import torch.nn.functional as F
+
+import pandas as pd
+import albumentations as A
+import albumentations.pytorch.transforms as A_pytorch
+
+from PIL import Image
+from tqdm import tqdm
+from pathlib import Path
+from utils import common
+
+class BaseDataLoader:
+	def __init__(self, opt):
+		self.opt = opt
+		self.input_size = opt.input_size
+		self.input_nc   = opt.input_nc
+		self.image_size = (opt.input_size, opt.input_size)
+		self.num_frames_per_clip = opt.num_frames_per_clip
+
+		self.fps = opt.fps
+		self.bps = opt.sampling_rate / opt.hop_length
+		self.sampling_rate = opt.sampling_rate
+		self.num_mel_bins  = int(self.bps * self.num_frames_per_clip / self.fps)
+
+		self.img_transform = A.Compose([
+			A.Resize(height= self.input_size, width = self.input_size, interpolation=cv2.INTER_AREA),
+			A.Normalize(mean=(0.5, 0.5, 0.5), std= (0.5, 0.5, 0.5)),
+			A_pytorch.ToTensorV2(),
+		])
+		self.audio_transform = torchaudio.transforms.MelSpectrogram(
+			sample_rate=opt.sampling_rate, n_mels=opt.n_mels,
+			n_fft=opt.n_fft, win_length=opt.win_length, hop_length=opt.hop_length,
+			f_max=opt.f_max, f_min=opt.f_min)
+
+	def get_frame2mel_idx(self, idx):
+		idx = idx - self.num_frames_per_clip//2
+		return int(idx * self.bps / self.fps)
+
+	def default_img_loader(self, path):
+		return cv2.imread(path)[:,:,::-1]
+
+	def default_aud_loader(self, path):
+		audio, sr = torchaudio.load(path)
+		audio = torch.mean(audio, dim=0)
+		if sr != self.sampling_rate:
+			audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=self.sampling_rate)(audio)			
+			print(f"- [Audio] Resample from {sr} to {self.sampling_rate}")
+		mel = self.audio_transform(audio).T
+		return torch.log10(torch.clamp(mel, min=1e-5, max=None))
+
+	def crop_mel(self, mel, mel_idx, crop_length):
+		mel_shape = mel.shape
+		if (mel_idx + crop_length) <= mel_shape[0] and mel_idx >= 0:
+			mel_cropped = mel[mel_idx:mel_idx + crop_length]
+			return mel_cropped
+		else:
+			if mel_idx < 0:
+				pad = -mel_idx
+				mel_cropped = F.pad(mel[:mel_idx + crop_length], (0, 0, pad, 0), value=0.)
+			else:
+				pad = crop_length - (mel_shape[0] - mel_idx)
+				mel_cropped = F.pad(mel[mel_idx:], (0, 0, 0, pad), value=0.)
+		return mel_cropped
+
+	def path2img(self, img_path):
+		img = self.default_img_loader(img_path)
+		return self.img_transform(image=img)['image']
+
+	def get_lower_half_mask(self):
+		return NotImplementedError()
+
+	def preprocess(self):
+		return NotImplementedError()
+
diff --git a/dnnlib/__init__.py b/dnnlib/__init__.py
@@ -0,0 +1,9 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
+#
+# NVIDIA CORPORATION and its licensors retain all intellectual property
+# and proprietary rights in and to this software, related documentation
+# and any modifications thereto.  Any use, reproduction, disclosure or
+# distribution of this software and related documentation without an express
+# license agreement from NVIDIA CORPORATION is strictly prohibited.
+
+from .util import EasyDict, make_cache_dir_path