forked from AMEERAZAM08/StyleLipSync
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
178bc61
commit 7eb5e40
Showing
42 changed files
with
4,950 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
__pycache__ | ||
*.pth | ||
*.mp4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,68 @@ | ||
# StyleLipSync: Style-based Personalized Lip-sync Video Generation | ||
# [ICCV 2023] StyleLipSync: Style-based Personalized Lip-sync Video Generation | ||
[ProjectPage](https://stylelipsync.github.io) | [Paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Ki_StyleLipSync_Style-based_Personalized_Lip-sync_Video_Generation_ICCV_2023_paper.pdf) | [ArXiv](https://arxiv.org/abs/2305.00521) | ||
|
||
[ProjectPage](https://stylelipsync.github.io) | [Paper](https://stylelipsync.github.io) | ||
An official pytorch implementation of `StyleLipSync: Style-based Personalized Lip-sync Video Generation` by Taekyung Ki* and [Dongchan Min](https://kevinmin95.github.io)*. | ||
|
||
Official pytorch implementation of `StyleLipSync: Style-based Personalized Lip-sync Video Generation`. | ||
## Abstract | ||
|
||
Codes will be released soon.. | ||
<img align='middle' src='./assets/sylelipsync.png'> | ||
|
||
In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method. | ||
|
||
|
||
## Requirements | ||
We recommend using Python `3.8.13` and Pytorch `1.7.1+cu110`. | ||
```bash | ||
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html | ||
|
||
pip install -r requirements.txt | ||
``` | ||
|
||
|
||
|
||
## Demo | ||
|
||
We provide a simple demonstration script with a personalized model where a target person `AlexandriaOcasioCortez_0` is in [HDTF](https://github.com/MRzzm/HDTF). | ||
|
||
```bash | ||
sh prepare_hdtf.sh | ||
``` | ||
You can get the preprocessed frames (`.jpg`) and their pose-aware masks of the person by running `prepare_hdtf.sh`. | ||
|
||
For arbitrary audio, you can generate a lip-synchronizing video of the target person by running: | ||
|
||
```bash | ||
CUDA_VISIBLE_DEVICES=0 python run_demo.py --audio [path/to/audio] --person person_id --res_dir [path/to/save/results] | ||
``` | ||
|
||
You can adjust the following options for inference: | ||
- `--audio`: an audio file (`.wav`). | ||
- `--person`: person for infernece, folder name in `data`. (default: AlexandriaOcasioCortez_0) | ||
- `--res_dir`: a directory to save results video. (default: results) | ||
|
||
The results video will be `res_dir/person#audio.mp4`. The sample audio files are provided in `data/audio`. You can also use your audio file. If you want to evaluate the lip-sync metrics (LSE-C and LSE-D), please refer to this [repostory](https://github.com/Rudrabha/Wav2Lip). | ||
|
||
|
||
## Disclaimer | ||
This repository is only for the research purpose. | ||
|
||
|
||
## Acknowledgements | ||
* [StyleGAN2-ADA](https://github.com/NVlabs/stylegan2-ada-pytorch) | ||
* [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) | ||
* [Deep3DFaceRecon](https://github.com/sicxu/Deep3DFaceRecon_pytorch) | ||
* [FOMM](https://github.com/AliaksandrSiarohin/first-order-model) | ||
* [Voxceleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html) | ||
* [HDTF](https://github.com/MRzzm/HDTF) | ||
|
||
## Citation | ||
``` | ||
@InProceedings{Ki_2023_ICCV, | ||
author = {Ki, Taekyung and Min, Dongchan}, | ||
title = {StyleLipSync: Style-based Personalized Lip-sync Video Generation}, | ||
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, | ||
month = {October}, | ||
year = {2023}, | ||
pages = {22841-22850} | ||
} | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
import os, math, shutil, pickle, copy, yaml, random, json, cv2 | ||
import torch, torchvision, torchaudio | ||
import torch.nn as nn | ||
import torch.nn.functional as F | ||
|
||
import pandas as pd | ||
import albumentations as A | ||
import albumentations.pytorch.transforms as A_pytorch | ||
|
||
from PIL import Image | ||
from tqdm import tqdm | ||
from pathlib import Path | ||
from utils import common | ||
|
||
class BaseDataLoader: | ||
def __init__(self, opt): | ||
self.opt = opt | ||
self.input_size = opt.input_size | ||
self.input_nc = opt.input_nc | ||
self.image_size = (opt.input_size, opt.input_size) | ||
self.num_frames_per_clip = opt.num_frames_per_clip | ||
|
||
self.fps = opt.fps | ||
self.bps = opt.sampling_rate / opt.hop_length | ||
self.sampling_rate = opt.sampling_rate | ||
self.num_mel_bins = int(self.bps * self.num_frames_per_clip / self.fps) | ||
|
||
self.img_transform = A.Compose([ | ||
A.Resize(height= self.input_size, width = self.input_size, interpolation=cv2.INTER_AREA), | ||
A.Normalize(mean=(0.5, 0.5, 0.5), std= (0.5, 0.5, 0.5)), | ||
A_pytorch.ToTensorV2(), | ||
]) | ||
self.audio_transform = torchaudio.transforms.MelSpectrogram( | ||
sample_rate=opt.sampling_rate, n_mels=opt.n_mels, | ||
n_fft=opt.n_fft, win_length=opt.win_length, hop_length=opt.hop_length, | ||
f_max=opt.f_max, f_min=opt.f_min) | ||
|
||
def get_frame2mel_idx(self, idx): | ||
idx = idx - self.num_frames_per_clip//2 | ||
return int(idx * self.bps / self.fps) | ||
|
||
def default_img_loader(self, path): | ||
return cv2.imread(path)[:,:,::-1] | ||
|
||
def default_aud_loader(self, path): | ||
audio, sr = torchaudio.load(path) | ||
audio = torch.mean(audio, dim=0) | ||
if sr != self.sampling_rate: | ||
audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=self.sampling_rate)(audio) | ||
print(f"- [Audio] Resample from {sr} to {self.sampling_rate}") | ||
mel = self.audio_transform(audio).T | ||
return torch.log10(torch.clamp(mel, min=1e-5, max=None)) | ||
|
||
def crop_mel(self, mel, mel_idx, crop_length): | ||
mel_shape = mel.shape | ||
if (mel_idx + crop_length) <= mel_shape[0] and mel_idx >= 0: | ||
mel_cropped = mel[mel_idx:mel_idx + crop_length] | ||
return mel_cropped | ||
else: | ||
if mel_idx < 0: | ||
pad = -mel_idx | ||
mel_cropped = F.pad(mel[:mel_idx + crop_length], (0, 0, pad, 0), value=0.) | ||
else: | ||
pad = crop_length - (mel_shape[0] - mel_idx) | ||
mel_cropped = F.pad(mel[mel_idx:], (0, 0, 0, pad), value=0.) | ||
return mel_cropped | ||
|
||
def path2img(self, img_path): | ||
img = self.default_img_loader(img_path) | ||
return self.img_transform(image=img)['image'] | ||
|
||
def get_lower_half_mask(self): | ||
return NotImplementedError() | ||
|
||
def preprocess(self): | ||
return NotImplementedError() | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# NVIDIA CORPORATION and its licensors retain all intellectual property | ||
# and proprietary rights in and to this software, related documentation | ||
# and any modifications thereto. Any use, reproduction, disclosure or | ||
# distribution of this software and related documentation without an express | ||
# license agreement from NVIDIA CORPORATION is strictly prohibited. | ||
|
||
from .util import EasyDict, make_cache_dir_path |
Oops, something went wrong.