Skip to content

Commit

Permalink
feat: demo
Browse files Browse the repository at this point in the history
  • Loading branch information
TaekyungKi committed Dec 3, 2023
1 parent 178bc61 commit 7eb5e40
Show file tree
Hide file tree
Showing 42 changed files with 4,950 additions and 4 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
__pycache__
*.pth
*.mp4
69 changes: 65 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,68 @@
# StyleLipSync: Style-based Personalized Lip-sync Video Generation
# [ICCV 2023] StyleLipSync: Style-based Personalized Lip-sync Video Generation
[ProjectPage](https://stylelipsync.github.io) | [Paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Ki_StyleLipSync_Style-based_Personalized_Lip-sync_Video_Generation_ICCV_2023_paper.pdf) | [ArXiv](https://arxiv.org/abs/2305.00521)

[ProjectPage](https://stylelipsync.github.io) | [Paper](https://stylelipsync.github.io)
An official pytorch implementation of `StyleLipSync: Style-based Personalized Lip-sync Video Generation` by Taekyung Ki* and [Dongchan Min](https://kevinmin95.github.io)*.

Official pytorch implementation of `StyleLipSync: Style-based Personalized Lip-sync Video Generation`.
## Abstract

Codes will be released soon..
<img align='middle' src='./assets/sylelipsync.png'>

In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method.


## Requirements
We recommend using Python `3.8.13` and Pytorch `1.7.1+cu110`.
```bash
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

pip install -r requirements.txt
```



## Demo

We provide a simple demonstration script with a personalized model where a target person `AlexandriaOcasioCortez_0` is in [HDTF](https://github.com/MRzzm/HDTF).

```bash
sh prepare_hdtf.sh
```
You can get the preprocessed frames (`.jpg`) and their pose-aware masks of the person by running `prepare_hdtf.sh`.

For arbitrary audio, you can generate a lip-synchronizing video of the target person by running:

```bash
CUDA_VISIBLE_DEVICES=0 python run_demo.py --audio [path/to/audio] --person person_id --res_dir [path/to/save/results]
```

You can adjust the following options for inference:
- `--audio`: an audio file (`.wav`).
- `--person`: person for infernece, folder name in `data`. (default: AlexandriaOcasioCortez_0)
- `--res_dir`: a directory to save results video. (default: results)

The results video will be `res_dir/person#audio.mp4`. The sample audio files are provided in `data/audio`. You can also use your audio file. If you want to evaluate the lip-sync metrics (LSE-C and LSE-D), please refer to this [repostory](https://github.com/Rudrabha/Wav2Lip).


## Disclaimer
This repository is only for the research purpose.


## Acknowledgements
* [StyleGAN2-ADA](https://github.com/NVlabs/stylegan2-ada-pytorch)
* [Wav2Lip](https://github.com/Rudrabha/Wav2Lip)
* [Deep3DFaceRecon](https://github.com/sicxu/Deep3DFaceRecon_pytorch)
* [FOMM](https://github.com/AliaksandrSiarohin/first-order-model)
* [Voxceleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html)
* [HDTF](https://github.com/MRzzm/HDTF)

## Citation
```
@InProceedings{Ki_2023_ICCV,
author = {Ki, Taekyung and Min, Dongchan},
title = {StyleLipSync: Style-based Personalized Lip-sync Video Generation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {22841-22850}
}
```
Binary file added assets/sylelipsync.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file added ckpts/__put_ckpts_here
Empty file.
Empty file added data/__put_the_frame_here
Empty file.
Binary file added data/audio/AlexandriaOcasioCortez_0_10s.wav
Binary file not shown.
Binary file added data/audio/CarlyFiorina_0_10s.wav
Binary file not shown.
Binary file added data/audio/RoyBlunt_0_10s.wav
Binary file not shown.
Empty file added dataloader/__init__.py
Empty file.
77 changes: 77 additions & 0 deletions dataloader/base_dataloader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import os, math, shutil, pickle, copy, yaml, random, json, cv2
import torch, torchvision, torchaudio
import torch.nn as nn
import torch.nn.functional as F

import pandas as pd
import albumentations as A
import albumentations.pytorch.transforms as A_pytorch

from PIL import Image
from tqdm import tqdm
from pathlib import Path
from utils import common

class BaseDataLoader:
def __init__(self, opt):
self.opt = opt
self.input_size = opt.input_size
self.input_nc = opt.input_nc
self.image_size = (opt.input_size, opt.input_size)
self.num_frames_per_clip = opt.num_frames_per_clip

self.fps = opt.fps
self.bps = opt.sampling_rate / opt.hop_length
self.sampling_rate = opt.sampling_rate
self.num_mel_bins = int(self.bps * self.num_frames_per_clip / self.fps)

self.img_transform = A.Compose([
A.Resize(height= self.input_size, width = self.input_size, interpolation=cv2.INTER_AREA),
A.Normalize(mean=(0.5, 0.5, 0.5), std= (0.5, 0.5, 0.5)),
A_pytorch.ToTensorV2(),
])
self.audio_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=opt.sampling_rate, n_mels=opt.n_mels,
n_fft=opt.n_fft, win_length=opt.win_length, hop_length=opt.hop_length,
f_max=opt.f_max, f_min=opt.f_min)

def get_frame2mel_idx(self, idx):
idx = idx - self.num_frames_per_clip//2
return int(idx * self.bps / self.fps)

def default_img_loader(self, path):
return cv2.imread(path)[:,:,::-1]

def default_aud_loader(self, path):
audio, sr = torchaudio.load(path)
audio = torch.mean(audio, dim=0)
if sr != self.sampling_rate:
audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=self.sampling_rate)(audio)
print(f"- [Audio] Resample from {sr} to {self.sampling_rate}")
mel = self.audio_transform(audio).T
return torch.log10(torch.clamp(mel, min=1e-5, max=None))

def crop_mel(self, mel, mel_idx, crop_length):
mel_shape = mel.shape
if (mel_idx + crop_length) <= mel_shape[0] and mel_idx >= 0:
mel_cropped = mel[mel_idx:mel_idx + crop_length]
return mel_cropped
else:
if mel_idx < 0:
pad = -mel_idx
mel_cropped = F.pad(mel[:mel_idx + crop_length], (0, 0, pad, 0), value=0.)
else:
pad = crop_length - (mel_shape[0] - mel_idx)
mel_cropped = F.pad(mel[mel_idx:], (0, 0, 0, pad), value=0.)
return mel_cropped

def path2img(self, img_path):
img = self.default_img_loader(img_path)
return self.img_transform(image=img)['image']

def get_lower_half_mask(self):
return NotImplementedError()

def preprocess(self):
return NotImplementedError()

9 changes: 9 additions & 0 deletions dnnlib/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# NVIDIA CORPORATION and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto. Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA CORPORATION is strictly prohibited.

from .util import EasyDict, make_cache_dir_path
Loading

0 comments on commit 7eb5e40

Please sign in to comment.