This repository provides a comprehensive toolkit for processing audio and video files, with a focus on speaker diarization, speaker identification, audio extraction, and dataset creation. By leveraging tools like ffmpeg
, pyannote.audio
, and other Python libraries, the scripts enable efficient and accurate workflows for handling audio data.
- Run
dataset-creation.py
to extract English audio tracks from video files. - Now uses
PyDub
instead offfmpeg
, extracts 44100khz PCM-16bit Mono wav files. - Uses multi-threading for faster, scalable performance.
- Optional Script:
organize-videos.py
will extract Season and Episode info from the file names, and rename them accordingly.- This keeps your video files and generated wavs/jsons, uniquenly named like S02E13.wav and S02E13.json, etc.
1B. Use UVR or a similar vocal isolation project - UVR Project
- Diarizing audio with background noise, music, etc. will result in very poor diarization results, ex. singing from background music will be labeled as a speaker, etc.
- Run
diarize-dataset.py
to process the extracted WAV files and produce JSON files containing diarization data. - Uses
PyDub
instead offfmpeg
. Requires a HuggingFace Token.
- Run
identify-speaker.py
to play audio segments from diarization files and interactively map the target speaker.
- Run
isolate-trim.py
to extract and trim the target speaker's audio segments, preparing them for dataset creation.
Ensure the following are installed:
- Python 3.9
- ffmpeg: Install via your system's package manager or from the official site.
The scripts automatically create necessary directories and pause execution for users to populate them with required data. Ensure the following directory structure is in place:
- Video Input Directory:
base-folder/videos
- WAV Output Directory:
base-folder/wavs
- JSON Output Directory:
base-folder/jsons
- Speaker Mapping File:
base-folder/mappings.csv
- Processed Speaker Output Directory:
base-folder/targeted