Team Name: TensorSlow
Members: Aditya Singh (@adityajaas) and Shashi Kant Gupta (@shashikg)
This speaker diarization model uses Deep Embedding Clustering with a deep neural network initialized via a Residual Autoencoder to assign speaker labels to segments of the raw audio signal. Clustering is perfomed on x-vectors extracted using Desplanques et al.'s ECAPA-TDNN framework. We use Silero-VAD for voice audio detection.
Baseline Model: Spectral clustering is used for audio-label assignment.
- Defined in: utils.py
- Defined in: baselineMethods.py
- Defined in: optimumSpeaker.py
- Defined in: DEC.py
- Defined in: colab_demo_utils.py
Defined in utils.py
class DiarizationDataset(dataset_name=None
data_dir=None,
sr=16000,
window_len=240,
window_step=120,
transform=None,
batch_size_for_ecapa=512,
vad_step=4,
split='full',
use_precomputed_vad= True,
use_oracle_vad= False,
skip_overlap= True)
Create an abstract class for loading the dataset. This class applies the necessary pre-processing and x-vector feature extraction methods to return the audio file as a bunch of segmented x-vector features to use it directly in the clustering algorithm to predict speaker labels. The module uses the pre-computed X-vectors if available otherwise extract it during the runtime.
Parameters:
Argument | Detail |
---|---|
dataset_name: |
str, Name of the pre-existing dataset to use. Options: ami , ami_dev , voxconverse |
data_dir: |
str, Directory for any dataset other the options specified in dataset_name . Both dataset_name and data_dir cannot be None |
sr: |
int, Sampling rate of the audio signal |
window_len: |
int, Window length (in ms) of each of the audio segments to be passed for feature extraction |
window_step: |
int, Step (in ms) between two windows of audio segments to be passed for feature extraction |
transform: |
list, List of transforms like mel-transform to be performed on audio while preprocessing, default = None |
batch_size_for_ecapa: |
int, Batch size of audio segments while performing feature extraction using ECAPA-TDNN |
vad_step: |
int, Number of windows to split each audio chunk into. Argument used by Silero-VAD module |
split: |
str, Argument defining type of split of dataset, default = 'full' indicates no split |
use_precomputed_vad: |
bool, If True, downloads precomputed Voice Activity Detection label output for the dataset. Only available for dataset options specified in dataset_name |
use_oracle_vad: |
bool, If True, model does Voice Activity Detection directly from groundtruth rttm files bypassing the Silero VAD module. |
skip_overlap: |
bool, If True, model skips the windows with multiple speakers speaking by inspecting the groundtruth rttm files |
Class Functions:
Parameters:
Argument | Detail |
---|---|
idx: |
int, Index to the required audio in the list of audio in root directory |
Returns:
Variable | Detail |
---|---|
audio_segments: |
torch.Tensor, (n_windows, features_len) Tensor of feature vectors of each audio segment window |
diarization_segments: |
torch.Tensor, (n_windows, n_spks) Tensor containing ground truth of speaker labels, 1 if i-th window has j-th speaker speaking, else 0 |
audio_segments: |
torch.Tensor, (n_windows, features_len) Tensor of feature vectors of each audio segment window |
speech_segments: |
torch.Tensor, (n_windows,) Tensor with i-th value 1 if VAD returns presence of speech audio in i-th window, else 0 |
label_path: |
str, Path of the rttm file containing labels for the 'idx' wav audio |
Parameters:
Argument | Detail |
---|---|
path: |
str, Path to the RTTM diarization file |
Returns:
Variable | Detail |
---|---|
rttm_out: |
numpy.ndarray, (..., 3) Array with column 1 holding start time of speaker, column 2 holding end time of speaker, and column 3 holding speaker label |
def make_rttm(out_dir, name, labels, win_step):
Defined in utils.py
Create RTTM Diarization files for non-overlapping speaker labels in var labels
. Assumes non-speech part to have value -1
and speech part to have some speaker label (0, 1, 2, ...)
.
Parameters:
Argument | Detail |
---|---|
out_dir: |
str, Directory where the output RTTM diarization files to be saved |
name: |
str, name for the audio files for which diarization was predicted |
labels: |
int, Speaker/ Non-speech labels assigned to different audio segments based on the win_step used to extract feature vectors |
win_step: |
int, Step (in ms) between two windows of audio segments used for feature extraction |
Returns:
Variable | Detail |
---|---|
return variable: |
str, Path to the saved RTTM diarization file |
def get_metrics(groundtruth_path, hypothesis_path, collar=0.25, skip_overlap=True):
Defined in utils.py
Evaluate the diarization results of all the predicted RTTM files present in hypothesis directory to the grountruth RTTM files present in groundtruth directory.
Parameters:
Argument | Detail |
---|---|
groundtruth_path: |
str, directory of groundtruth rttm files |
hypothesis_path: |
str, directory of hypothesis rttm files |
collar: |
float, Duration (in seconds) of collars removed from evaluation around boundaries of reference segments |
skip_overlap: |
bool, If True, calculates Diarization Error Rate ignoring the overlapped region |
Returns:
Variable | Detail |
---|---|
metric: |
pyannote.metrics, Pyannote metric class having diarization DERs for all the files. |
def plot_annot(name="IS1009a", collar=0.25, skip_overlap=True, groundtruth_path=None, hypothesis_path=None):
Defined in utils.py
Calculate the Diarization Error Rate for filename specified, and print the groundtruth and hypothesis time series plot.
Parameters:
Argument | Detail |
---|---|
name: |
str, Name of the file whose time series plot is to be generated. File must be present in the hypothesis_path folder |
collar: |
float, Duration (in seconds) of collars removed from evaluation around boundaries of reference segments |
skip_overlap: |
bool, If True, calculates Diarization Error Rate ignoring the overlapped region |
groundtruth_path: |
str, Directory of groundtruth rttm files |
hypothesis_path: |
str, Directory of hypothesis rttm files |
def diarizationOracleNumSpkrs(audio_dataset, method="KMeans"):
Defined in baselineMethods.py
Predict the diarization labels using the oracle number of speakers for all the audio files in audio_dataset with KMeans/ Spectral clustering algorithm.
Parameters:
Argument | Detail |
---|---|
audio_dataset: |
utils.DiarizationDataset, Diarization dataset |
method: |
str, Name of the method to be used for clustering part. Supports: "KMeans" or "Spectral" |
Returns:
Variable | Detail |
---|---|
hypothesis_dir: |
str, Directory where all the predicted RTTM diarization files are saved |
def diarizationEigenGapNumSpkrs(audio_dataset):
Defined in baselineMethods.py
Predict the diarization labels using for all the audio files in audio_dataset with Spectral clustering algorithm. It uses Eigen principle to predict the optimal number of speakers. The module uses already implented spectral algorithm from here: https://github.com/wq2012/SpectralCluster
Parameters:
Argument | Detail |
---|---|
audio_dataset: |
utils.DiarizationDataset, Diarization dataset |
Returns:
Variable | Detail |
---|---|
hypothesis_dir: |
str, Directory where all the predicted RTTM diarization files are saved |
Inspired from https://github.com/wq2012/SpectralCluster
class eigengap(min_clusters=1,
max_clusters=100,
p_percentile=0.9,
gaussian_blur_sigma=2,
stop_eigenvalue=1e-2,
thresholding_soft_multiplier=0.01,
thresholding_with_row_max=True)
Defined in optimumSpeaker.py
Utility function to decide the optimal number of speakers for clustering based on maximization of eigen-gap of the affinity matrix
Parameters:
Argument | Detail |
---|---|
min_clusters: |
int, Minimum number of output clusters |
max_clusters: |
int, Maximum number of output clusters |
p_percentile: |
float, Parameter to computing p-th percentile for percentile based thresholding |
gaussian_blur_sigma: |
float, sigma value for standard deviation of gaussian kernel in scipy gaussian filter |
stop_eigenvalue: |
float, Minimum value of eigenvalue of Affinity matrix for its eigenvector to be considered in clustering |
thresholding_soft_mutiplier: |
float, Factor to multiply to cells with value less than threshold in row/percentile thresholding. Parameter value of 0.0 turn cells less than threshold to zero in the matrix |
thresholding_with_row_max: |
bool, True for row-max thresholding, False for percentile thresholding |
Class Functions:
def _get_refinement_operator(self, name)
Parameters:
Argument | Detail |
---|---|
name: |
str, Get the input refinement operator. Available refinements- 'CropDiagonal' , 'GaussianBlur' , 'RowWiseThreshold' , 'Symmetrize' , 'Diffuse' , 'RowWiseNormalize' |
Returns:
Variable | Detail |
---|---|
CropDiagonal() /GaussianBlur() /RowWiseThreshold() /Symmetrize() / Diffuse() /RowWiseNormalize() |
optimumSpeaker.AffinityRefinementOperation, Returns specified refinement method class |
def compute_affinity_matrix(self, X)
Compute the affinity matrix for a matrix X with row as each instance and column as features by calculating cosine similarity between pair of l2 normalized columns of X
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, (n_windows, n_features) Input matrix with column as features to compute affinity matrix between pair of columns |
Returns:
Variable | Detail |
---|---|
affinity: |
numpy.ndarray, (n_windows, n_windows) Symmetric array with (i,j)th value equal to cosine similiarity between i-th and j-th row |
def compute_sorted_eigenvectors(self, A)
Parameters:
Argument | Detail |
---|---|
A: |
numpy.ndarray, (n_windows, n_windows) Symmetric array with (i,j)th value equal to cosine similiarity between i-th and j-th row |
Returns:
Variable | Detail |
---|---|
w: |
numpy.ndarray, Decreasing order sorted eigen values of affinity matrix A |
v: |
numpy.ndarray, Eigen vectors corresponding to eigen values returned |
def compute_number_of_clusters(self, eigenvalues, max_clusters, stop_eigenvalue)
Parameters:
Argument | Detail |
---|---|
eigenvalues: |
numpy.ndarray, Decreasing order sorted eigen values of affinity matrix between different windows |
max_clusters: |
int, Maximum number of clusters required. Default 'None' puts no such limit to the number of clusters |
stop_eigenvalue: |
float, Minimum value of eigenvalue to be considered for deciding number of clusters. Eigenvalues below this value are discarded |
Returns:
Variable | Detail |
---|---|
max_delta_index: |
int, Index to the eigenvalue such that eigen gap is maximized. It gives the number of clusters determined by the function |
def find(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, (n_windows, n_features) Input matrix with column as features to compute affinity matrix between pair of columns |
Returns:
Variable | Detail |
---|---|
k: |
int, Number of clusters calculated after creating the affinity matrix, applying refinements, and using eigen-gap maximization. self.min_clusters ≤ k ≤ self.max_clusters |
class AffinityRefinementOperation(metaclass=abc.ABCMeta)
Defined in optimumSpeaker.py
Meta class to the refinement operation classes passed as input to be perfomed on the data
Class Functions:
def check_input(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be refined by refinement operators |
Returns:
Variable | Detail |
---|---|
ValueError() \ TypeError() |
ValueError/TypeError, Type Error if X is not a numpy array. Value error if X is not a 2D square matrix |
def refine(self, X)
Abstract function redefined in various child classes of class AffinityRefinementOperation
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be refined by refinement operators |
class Cropdiagonal(AffinityRefinementOperation)
Defined in optimumSpeaker.py
Operator to replace diagonal element by the max non-diagonal value of row. Post operation, the matrix has similar properties to a standard Laplacian matrix. This also helps to avoid the bias during Gaussian blur and normalization.
Class Functions:
def refine(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be refined by refinement operators |
Returns:
Variable | Detail |
---|---|
Y: |
numpy.ndarray, Output array with Crop diagonal refinement applied |
class GaussianBlur(AffinityRefinementOperation)
def __init__(self, sigma = 1)
Defined in optimumSpeaker.py
Operator to apply gaussian filter to the input array. Uses scipy.ndimage.gaussian_filter
Parameters:
Argument | Detail |
---|---|
sigma: |
float, Standard deviation for Gaussian kernel |
Class Functions:
def refine(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be refined by refinement operators |
Returns:
Variable | Detail |
---|---|
Y: |
numpy.ndarray, Output array with gaussian filter applied |
class RowWiseThreshold(AffinityRefinementOperation)
def __init__(self,
p_percentile=0.95,
thresholding_soft_multiplier=0.01,
thresholding_with_row_max=False)
Defined in optimumSpeaker.py
Operator to apply row wise thresholding based on either percentile or row-max thresholding.
Parameters:
Argument | Detail |
---|---|
p_percentile: |
float, Standard deviation for Gaussian kernel |
thresholding_soft_multiplier: |
float, Factor to multiply to cells with value less than threshold in row/percentile thresholding. Parameter value of 0.0 turn cells less than threshold to zero in the matrix |
thresholding_with_row_max: |
bool, True applies row-max based thresholding, False applies percentile based thresholding |
Class Functions:
def refine(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be refined by refinement operators |
Returns:
Variable | Detail |
---|---|
Y: |
numpy.ndarray, Output array with row wise threshold applied |
class Cropdiagonal(AffinityRefinementOperation)
Defined in optimumSpeaker.py
Operator to return a symmetric matrix based on max{ X, XT } from a given input matrix X.
Class Functions:
def refine(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be used to create a symmetric matrix |
Returns:
Variable | Detail |
---|---|
Y: |
numpy.ndarray, Output symmetric array |
class Diffuse(AffinityRefinementOperation)
Defined in optimumSpeaker.py
Operator to return a diffused symmetric matrix XTX from a given input matrix X.
Class Functions:
def refine(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be used to create a diffused symmetric matrix |
Returns:
Variable | Detail |
---|---|
Y: |
numpy.ndarray, Output diffused symmetric array |
class RowWiseNormalize(AffinityRefinementOperation)
Defined in optimumSpeaker.py
Operator to normalize each row of input matrix X by the maximum value in the corresponding rows.
Class Functions:
def refine(self, X)
Parameters:
Argument | Detail |
---|---|
X: |
numpy.ndarray, Input array to be row normalized |
Returns:
Variable | Detail |
---|---|
Y: |
numpy.ndarray, Output row normalized array |
class ResidualAutoEncoder(ip_features,
hidden_dims=[500, 500, 2000, 30]))
Defined in DEC.py
Create a torch.nn.Module for a deep autoencoder composed of Residual Neural Network (ResNet) bloacks as the encoder and decoder layer. Activation used is ReLU. The bottleneck encoder output and final decoder output are not activated to avoid data loss due to ReLU activation.
Parameters:
Argument | Detail |
---|---|
ip_features: |
int, Input features size |
hidden_dims: |
list of int, List of hidden dimension features. Last element on the list is the output dimension of bottleneck of the autoencoder |
Returns:
Variable | Detail |
---|---|
z: |
torch.Tensor, Output from the bottle encoder of the deep autoencoder network. |
xo: |
list of torch.Tensor, Output from each encoder except the bottle encoder of the deep autoencoder. First item of the list is the input given to the system. |
xr: |
list of torch.Tensor, Reconstruction of inputs to each encoder layer of autoencoder. xr is reversed so that i-th item in list xr is the reconstruction of i-th item in list xo. Eg. First item of xo is the input to the ResidualAutoEncoder network, and first item of xr is the reconstruction from the ResidualAutoEncoder network. |
def load_encoder():
Defined in DEC.py
Load weights from the ResidualAutoEncoder trained on the training data.
Returns:
Variable | Detail |
---|---|
model: |
_ResidualAutoEncoder, Model with input feature size of 192, and hidden layers of size 500, 500, 2000, 30. Weights of the model initialized to weight of the autoencoder trained on training data. |
class ClusteringModule(nn.Module):
def __init__(self,
num_clusters,
encoder, data,
cinit = "KMeans"):
Defined in DEC.py
Clustering module of the deep embedding clustering (DEC) algorithm. It uses the trained encoder of the ResidualAutoEncoder to initialize the DEC Clustering network. Kmeans is used to initialize centroids in the latent space.
Parameters:
Argument | Detail |
---|---|
num_clusters: |
str, Number of clusters to create from the algorithm |
encoder: |
nn.Module, Pre-trained encoder for intializing the centroids. Encoder tranforms data to the latent space for clustering |
cinit: |
str, Initialization method of centroids of clusters. Default KMeans |
Returns:
Variable | Detail |
---|---|
q: |
torch.Tensor, Tensor of similarity between embedding points z_i and centroid mu_j. Assumes Student's t distribution as the kernel |
p: |
torch.Tensor, Tensor of target distribution based on soft assignment of q_i |
xo[0] |
torch.Tensor, Input data to the ResidualAutoEncoder |
xr[0] |
-torch.Tensor_, Reconstructed input by the ResidualAutoEncoder |
Class Functions:
def init_centroid(self,
data,
method = "KMeans")
Returns clustered data after calculating the optimal number of speakers using eigen-gap method, and then clustering the data based on the method specified.
Parameters:
Argument | Detail |
---|---|
data: |
torch.Tensor, Input data to be clustered |
method: |
numpy.ndarray, Clustering method. Default KMeans . Options KMeans /Spectral |
Returns:
Variable | Detail |
---|---|
output: |
torch.Tensor, Tensor containing intialized centroids for the dataset |
class DEC(self,
num_clusters,
encoder, data,
cinit = "KMeans"):
Defined in DEC.py
Deep embedding clustering (DEC) algorithm. It uses the trained encoder of the ResidualAutoEncoder to initialize the DEC Clustering network. It calls ClusteringModule class to initialize the centroids.
Parameters:
Argument | Detail |
---|---|
encoder: |
nn.Module, Pre-trained encoder for intializing the centroids. Encoder tranforms data to the latent space for clustering |
num_clusters: |
str, Number of clusters to create from the algorithm. Default None uses eigengap to determine number of clusters |
cinit: |
str, Initialization method of centroids of clusters. Default KMeans . Options KMeans /Spectral |
Class Functions:
def fit(self,
data,
y_true = None,
niter = 150,
lrEnc = 1e-4,
lrCC = 1e-4,
verbose = False)
Trains the algorithm by measuring the KL Divergence between target and observed distributions. Also updates the ResidualAutoEncoder using MSE loss in parallel to improve the latent space project of the data for better clustering. Both the updates use the Adams optimizer and the objective function is a linear combination of KL Divergence between target and observed distribution, and MSE Loss between input data and its reconstruction by the ResidualAutoEncoder.
Parameters:
Argument | Detail |
---|---|
data: |
torch.Tensor, Input data to be clustered |
y_true: |
numpy.ndarray, True labels of the data we aim to cluster. predict() and clusterAccuracy() functions are invoked only if y_true is not None |
niter |
int, Number of epochs to train the model for |
lrEnc |
float, Learning rate for updating the encoder |
lrCC |
float, Learning rate for updating the cluster centres |
verbose |
bool, True value activates the tqdm progress bar while training. False returns no updates when training |
Predict the cluster label to the data by inspecting the label about which the observed distribution is maximized.
Parameters:
Argument | Detail |
---|---|
data: |
torch.Tensor, Input data to be labels after clustering |
Returns:
Variable | Detail |
---|---|
y_pred: |
numpy.ndarray, Soft prediction labels of the data |
Predict the cluster labels accuracy as the maximum accuracy between y_pred and y_true for all the permutation of y_pred. This permutation is found by linear_sum_assignment optimization function of scipy.
Parameters:
Argument | Detail |
---|---|
y_pred: |
numpy.ndarray, Prediction of the labels by DEC algorithm |
y_true |
numpy.ndarray, True labels of the data |
Returns:
Variable | Detail |
---|---|
accuracy: |
float, Cluster assignment accuracy |
reassignment: |
dict, dictionary with key as rows and value as cols indices for the optimal assignment |
def diarizationDEC(audio_dataset,
num_spkr = None,
hypothesis_dir = None)
Defined in DEC.py
Compute diarization labels based on oracle number of speakers if num_spkr = 'oracle'
. Used as an optimal benchmark for performance of DEC. If num_spkr = None
, uses eigen-gap maximization in the ClusteringModule to determine the number of speakers.
Parameters:
Argument | Detail |
---|---|
audio_dataset: |
utils.DiarizationDataset, Test diarization dataset |
num_spkr: |
str, None for calculating the optimal number of speakers from eigen-gap maximization. oracle for using the number of speakers in each window given with the data. |
hypothesis_dir: |
str, Directory to store the predicted speaker labels in the audio segments in an rttm file. None stores it in ./rttm_output/ directory |
Returns:
Variable | Detail |
---|---|
hypothesis_dir: |
str, Directory to the rttm files containing predicted speaker labels with their timestamps |
def downloadYouTube(videourl, path):
Defined in colab_demo_utils.py
Download video from YouTube in .mp4 format using Video URL.
Parameters:
Argument | Detail |
---|---|
videourl: |
str, URL of the YouTube video to download |
path: |
str, directory to save the YouTube video. If directory does not exist, it is created. |
Returns:
Variable | Detail |
---|---|
save_dir: |
str, Save directory location |
def loadVideoFile(playvideo_file=False):
Defined in colab_demo_utils.py
Load video file either from YouTube or from your local directory into your current session working directory. Also extracts and stores its audio file in .wav format.
Parameters:
Argument | Detail |
---|---|
playvideo_file: |
bool, If True, plays the video after loading in the working directory. Default=False |
Returns:
Variable | Detail |
---|---|
video_dir: |
str, Returns the path to the saved video |
def read_rttm(path):
Defined in colab_demo_utils.py
Create hypothesis labels for each window using .rttm file.
Parameters:
Argument | Detail |
---|---|
path: |
str, Path to the rttm file |
Returns:
Variable | Detail |
---|---|
hypothesis_labels: |
numpy.ndarray, (n_instances, 3) i-th row's first, second and third column contains start, end, and speaker id of the i-th instance of speech. |
def combine_audio(vidname, audname, outname, fps):
Defined in colab_demo_utils.py
Combine cv2 processed silent video with its audio file to output the complete annotated video.
Parameters:
Argument | Detail |
---|---|
vidname: |
str, Path to the silent video |
audname: |
str, Path to the audio file to be attached |
outname: |
str, Output video file name |
fps: |
int, Frame rate of the video |
def createAnnotatedVideo(audio_dataset, hypothesis_dir):
Defined in colab_demo_utils.py
Use cv2 to put annotations in the video using the hypothesis labels.
Parameters:
Argument | Detail |
---|---|
audio_dataset: |
utils.DiarizationDataset, Dataset pipeline |
hypothesis_dir: |
str, Path to the directory with hypothesis labels rttm files |
Returns:
Variable | Detail |
---|---|
op_video_name: |
str, Annotated output video filename |