Skip to content

๐ŸŽต Project to create customized BGM for video content through Riffusion (Naver Connect BoostCamp Final Project)

Notifications You must be signed in to change notification settings

JdRion/Make-Video-BGM_by-Riffusion

ย 
ย 

Repository files navigation


โœจTeam.ETโœจ

boostcamp 4th NLP Final Project :
์˜์ƒ ์ฝ˜ํ…์ธ  ๋งž์ถคํ˜• BGM ์ƒ์„ฑ

1. Team


๊น€๊ฑด์šฐ

๋ฐฑ๋‹จ์ต

์†์šฉ์ฐฌ

์ด์žฌ๋•

์ •์„ํฌ

Contribution

๊น€๊ฑด์šฐ : ๋ชจ๋ธ ํ•™์Šต, ํŒŒ์ดํ”„๋ผ์ธ ์„ค๊ณ„, Riffusion
๋ฐฑ๋‹จ์ต : ๋ชจ๋ธ ์„ค๊ณ„ ๋ฐ ๋ถ„์„, whisper
์†์šฉ์ฐฌ : ๋ชจ๋ธ ์„ค๊ณ„ ๋ฐ ๋ถ„์„, ๊ฐ์„ฑ๋ถ„๋ฅ˜
์ด์žฌ๋• : Frontend, Backend, ์•„ํ‚คํ…์ฒ˜, Riffusion
์ •์„ํฌ : Backend, ์•„ํ‚คํ…์ฒ˜

2. About

๊ธฐํš์˜๋„

1์ธ ๋ฏธ๋””์–ด ์‹œ์žฅ ๊ทœ๋ชจ๊ฐ€ ์„ฑ์žฅํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ๋™์˜์ƒ ์ฝ˜ํ…์ธ  ์ œ์ž‘์˜ ๋น„์ค‘์ด ๋Œ€๋ถ€๋ถ„์ด๊ณ  ์ด์— ๋”ฐ๋ผ BGM ์ˆ˜์š” ๋˜ํ•œ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋Š˜์–ด๋‚˜๋Š” ๋™์˜์ƒ ์ˆ˜์š”์™€๋Š” ๋‹ฌ๋ฆฌ, ์˜์ƒ์ œ์ž‘์— ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ BGM ์˜ ๊ฒฝ์šฐ ์ œํ•œ์‚ฌํ•ญ(์ €์ž‘๊ถŒ ๋ถ„์Ÿ๊ณผ ๋กœ์—ดํ‹ฐ ๋น„์šฉ ๋“ฑ)์ด ๋งŽ์ด ์กด์žฌํ•˜๋ฉฐ ์ด ๋ถ€๋ถ„์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž AI ๊ธฐ๋ฐ˜ ์Œ์•… ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์ €์ž‘๊ถŒ ์—†๋Š” BGM์„ ์ œ๊ณตํ•˜๊ณ ์ž ํ•œ๋‹ค.

๊ฐœ๋ฐœ๋ชฉํ‘œ

๋™์˜์ƒ์„ ์ž…๋ ฅํ•˜๋ฉด, ํ•ด๋‹น ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ๋‚ด์šฉ์„ ์ถ”์ถœํ•˜์—ฌ ๊ฐ์„ฑ ๋ถ„์„ ํ›„, ์ฝ˜ํ…์ธ  ๋‚ด์šฉ์— ๋งž๋Š” ๊ฐ์„ฑ์„ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ riffusion ๋ชจ๋ธ์„ ์ ์šฉํ•˜์—ฌ BGM์„ ์ƒ์„ฑํ•˜๊ณ ์ž ํ•œ๋‹ค.

3. Model

FlowChart

Step1 : ๋™์˜์ƒ ๋‚ด์šฉ ํŒŒ์•…

Speech-to-Text

  • Openai์˜ Whisper model์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ๋ฐœํ™” ๋‚ด์šฉ์„ ํ…์ŠคํŠธ๋กœ ์ถ”์ถœ.

WHISPER ๋ชจ๋ธ ์‚ฌ์šฉ ์ด์œ  :

  • SPEECH RECOGNITION์—์„œ SOTA๋กœ ์‚ฌ์šฉ๋˜๋Š” wav2vec 2.0 ๋Œ€๋น„ ํ‰๊ท ์ ์œผ๋กœ 55.2% ๋‚ฎ์€ ์˜ค๋ฅ˜์œจ์ด๋ผ๋Š” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์กŒ์Œ.
  • Any-to-English speech translation multitask ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๊ธฐ์—, STT์™€ ๋ฒˆ์—ญ๊ธฐ๋Šฅ์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ์ถ”ํ›„ ์˜๋ฌธ ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์žฅ์ ์„ ๊ฐ€์ ธ ์„ ํƒํ•˜๊ฒŒ ๋จ.

Step2 : ๋™์˜์ƒ ๊ฐ์„ฑ ๋ถ„๋ฅ˜

Sentiment Classifier

  • ์ „์ฒด ํ…์ŠคํŠธ ๋‚ด์šฉ์„ ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉด์„œ ๋‚ด์šฉ์˜ ํŠน์ง•์„ ์‚ด๋ฆด ์ˆ˜ ์žˆ๋„๋ก, ํ…์ŠคํŠธ ๊ตฌ๋ฌธ๋ณ„๋กœ ๊ฐ์„ฑ ๋ถ„๋ฅ˜๋ฅผ ์‹œ๋„ํ•จ.
  • ์ „์ฒด ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๊ตฌ๋ฌธ๋ณ„๋กœ ๊ฐ์„ฑ ๋ถ„์„ํ•˜์—ฌ ํ–‰๋ณต,์Šฌํ””,์—ญ๊ฒจ์›€,๋ถ„๋…ธ,๋†€๋žŒ,๋‘๋ ค์›€, ์ค‘๋ฆฝ 7๊ฐ€์ง€ ๊ฐ์ •์œผ๋กœ ๋ถ„๋ฅ˜ํ•จ.

    https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

  • ๊ตฌ๋ฌธ๋ณ„ ๊ฐ์„ฑ๋ถ„๋ฅ˜ ํ›„, ๊ฐ์ • ์œ ์ง€๊ธฐ๊ฐ„์ด ์ž„๊ณ„๊ฐ’ ๋ณด๋‹ค ๋‚ฎ์€ ๊ฒฝ์šฐ ํ•ด๋‹น ๊ฐ์ •์„ ๋ฌด์‹œํ–ˆ์œผ๋ฉฐ, ๋ฌด์‹œ๋œ ๊ฐ์ •์˜ ์•ž๋’ค๋กœ ๊ฐ™์€ ๊ฐ์ •์ผ ๊ฒฝ์šฐ ๊ทธ ๊ฐ์ •๋“ค๊ณผ ์ด์–ด์ง„๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ ๋Œ€์ฒดํ•˜๋Š” ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์ง„ํ–‰.
  • ๊ทธ ๊ฒฐ๊ณผ ํƒ€์ž„๋ผ์ธ์— ๋”ฐ๋ผ ์•ˆ์ •๋œ ๊ฐ์ •์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ๋”ฐ๋ผ์„œ Sentiment Classifier ๋ฐฉ์‹์„ ์ฑ„ํƒํ•จ.

Step3 : ๊ฐ์„ฑ์— ๋งž๋Š” BGM ์ƒ์„ฑ

Riffusion Model ํ™œ์šฉ ๋ฐ ํ•™์Šต

  • ๋ฆฌํ“จ์ „์€ ๋””ํ“จ์ „ ๋ชจ๋ธ์— ์†Œ๋ฆฌ๋‚˜ ํŒŒ๋™์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ์ธ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ํ•™์Šตํ•œ ๋ชจ๋ธ.
  • Step2์—์„œ ์–ป์–ด์ง„ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ prompt๋กœ ํ™œ์šฉํ•˜์—ฌ ๊ทธ ๊ฐ์„ฑ๊ณผ ๊ฐ™์€ ๊ฐ์„ฑ์˜ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ seed image๋กœ ์‚ฌ์šฉํ•จ.
  • ์‚ฌ์šฉ์ž ํŽธ์˜๋ฅผ ์œ„ํ•ด ๊ธฐ์กด ์˜์ƒ์—์„œ ๋ง์†Œ๋ฆฌ๋ฅผ ์ œ์™ธํ•œ ์Œ์•…์ด๋‚˜ ๋…ธ์ด์ฆˆ๋ฅผ ์‚ญ์ œํ•˜๊ณ  ์ƒ์„ฑ๋œ BGM์„ ํ•ฉ์ณ์„œ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์„ ์ƒ์„ฑํ•จ.
  • Model: JD97/Riffusion_sentiment_LoRA(huggingface)

    https://huggingface.co/JD97/Riffusion_sentiment_LoRA

4. Dataset

Riffusion ์ถ”๊ฐ€ ํ•™์Šต์„ ์œ„ํ•œ Train Dataset ๊ตฌ์ถ•๊ณผ์ •

Input(Source data) โ†’ ๋ฐ์ดํ„ฐ ์ถ”์ถœ โ†’ ๋‹ค์šด์ƒ˜ํ”Œ๋ง โ†’ ๊ตฌ๊ฐ„๋ถ„ํ•  โ†’ ์ „์ฒ˜๋ฆฌ โ†’ Output(Spectrogram with caption)

Dataset

(1) Source data๋กœ ๋ถ€ํ„ฐ Sentiment classifier์™€ ์œ ์‚ฌํ•œ label ์„ ์ • ๋ฐ ์ถ”์ถœ(6680๊ฐœ)

  • Source_data: Chr0my/Epidemic_music(huggingface)

    https://huggingface.co/datasets/Chr0my/Epidemic_music

  • ์œ ์‚ฌํ•œ 7๊ฐ€์ง€ label : angry, fear, funny, happy, quirky, sad, weird

(2) ์ถ”์ถœํ•œ Music file ๋‹ค์šด์ƒ˜ํ”Œ๋ง(22.05khz โ†’ 8khz)

(3) ๋‹ค์šด์ƒ˜ํ”Œ๋ง๋œ Music file 10์ดˆ ๊ตฌ๊ฐ„๋ถ„ํ• (with Random sampling)

  • Riffusion ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์œ ์‚ฌํ•œ ์ƒ˜ํ”Œ ์ƒ์„ฑ์œ„ํ•ด 10์ดˆ ๊ตฌ๊ฐ„ ์„ค์ •
  • ๊ตฌ๊ฐ„ ๋ณ€ํ™”์— ๊ฐ•๊ฑดํ•œ ๋ชจ๋ธ ํ•™์Šต ์œ„ํ•ด Random sampling ์ˆ˜ํ–‰

(4) ์ „์ฒ˜๋ฆฌ ์ˆ˜ํ–‰

  • STFT(Short time fourier transform) โ†’ Griffin-Lim โ†’ Mel scale
  • Source data์˜ metadataTags, moods data ํ™œ์šฉํ•˜์—ฌ caption ์ž‘์„ฑ

(5) ์ตœ์ข… dataset

  • gwkim22/spectro_caption_dataset(huggingface)

    https://huggingface.co/datasets/gwkim22/spectro_caption_dataset

5. Architecture

FlowChart

6. How to Use

File Directory

.
|-- LoRA
|   |-- README.md
|   |-- text_to_image_lora.py
|   `-- train.sh
|-- MLOPS
|   |-- README.md
|   |-- front
|   |-- kubernetes
|   `-- serving
|-- dataset
|-- model
|   |-- README.md
|   |-- _interpolation.py
|   |-- _sum_by_sent.py
|   |-- oneway_pipeline.py
|   |-- pre_to_stt.py
|   |-- pretrained_models
|   |-- stt_to_rif.py
|   `-- utils.py
|-- project_requirements.txt
|-- riffusion
|-- whisper
`-- README.md

Environment

Ubuntu 18.04.5 LTS
CPU : Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz x 8
GPU : Tesla V100-PCIE-32GB
Python Version 3.9

Prerequisite

# Install project_requirements.txt
$pip install -r project_requirements.txt

# Install the following additional files:
$apt-get update
$sudo apt-get install ffmpeg 
$conda install pyworld -c conda-forge
$apt-get install -y libsndfile1-dev
$pip install git+https://github.com/openai/whisper.git
$pip install git+https://github.com/huggingface/diffusers

Reference

Paper

  • whisper: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
  • LoRA: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Open Source

About

๐ŸŽต Project to create customized BGM for video content through Riffusion (Naver Connect BoostCamp Final Project)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.8%
  • Python 4.9%
  • JavaScript 0.2%
  • CSS 0.1%
  • HTML 0.0%
  • Shell 0.0%