Skip to content

Latest commit

 

History

History
868 lines (695 loc) · 37.6 KB

Documentation_Model.md

File metadata and controls

868 lines (695 loc) · 37.6 KB

Documentation: EE698R DEC based Diarization Model

Team Name: TensorSlow

Members: Aditya Singh (@adityajaas) and Shashi Kant Gupta (@shashikg)

This speaker diarization model uses Deep Embedding Clustering with a deep neural network initialized via a Residual Autoencoder to assign speaker labels to segments of the raw audio signal. Clustering is perfomed on x-vectors extracted using Desplanques et al.'s ECAPA-TDNN framework. We use Silero-VAD for voice audio detection.

Baseline Model: Spectral clustering is used for audio-label assignment.

API Documentation

Index


Defined in: utils.py

class DiarizationDataset()

Defined in utils.py

class DiarizationDataset(dataset_name=None
                 data_dir=None, 
                 sr=16000, 
                 window_len=240, 
                 window_step=120, 
                 transform=None,
                 batch_size_for_ecapa=512,
                 vad_step=4,
                 split='full',
                 use_precomputed_vad= True,
                 use_oracle_vad= False,
                 skip_overlap= True)

Create an abstract class for loading the dataset. This class applies the necessary pre-processing and x-vector feature extraction methods to return the audio file as a bunch of segmented x-vector features to use it directly in the clustering algorithm to predict speaker labels. The module uses the pre-computed X-vectors if available otherwise extract it during the runtime.

Parameters:

Argument Detail
dataset_name: str, Name of the pre-existing dataset to use. Options: ami, ami_dev, voxconverse
data_dir: str, Directory for any dataset other the options specified in dataset_name.
Both dataset_name and data_dir cannot be None
sr: int, Sampling rate of the audio signal
window_len: int, Window length (in ms) of each of the audio segments to be passed for feature extraction
window_step: int, Step (in ms) between two windows of audio segments to be passed for feature extraction
transform: list, List of transforms like mel-transform to be performed on audio while preprocessing,
default = None
batch_size_for_ecapa: int, Batch size of audio segments while performing feature extraction using ECAPA-TDNN
vad_step: int, Number of windows to split each audio chunk into. Argument used by Silero-VAD module
split: str, Argument defining type of split of dataset,
default = 'full' indicates no split
use_precomputed_vad: bool, If True, downloads precomputed Voice Activity Detection label output for the dataset. Only available for dataset options specified in dataset_name
use_oracle_vad: bool, If True, model does Voice Activity Detection directly from groundtruth rttm files bypassing the Silero VAD module.
skip_overlap: bool, If True, model skips the windows with multiple speakers speaking by inspecting the groundtruth rttm files

Class Functions:

  1. __getitem__: def __getitem__(self, idx)

Parameters:

Argument Detail
idx: int, Index to the required audio in the list of audio in root directory

Returns:

Variable Detail
audio_segments: torch.Tensor, (n_windows, features_len) Tensor of feature vectors of each audio segment window
diarization_segments: torch.Tensor, (n_windows, n_spks) Tensor containing ground truth of speaker labels,
1 if i-th window has j-th speaker speaking, else 0
audio_segments: torch.Tensor, (n_windows, features_len) Tensor of feature vectors of each audio segment window
speech_segments: torch.Tensor, (n_windows,) Tensor with i-th value 1 if VAD returns presence of speech audio in i-th window, else 0
label_path: str, Path of the rttm file containing labels for the 'idx' wav audio
  1. read_rttm: def read_rttm(self, path)

Parameters:

Argument Detail
path: str, Path to the RTTM diarization file

Returns:

Variable Detail
rttm_out: numpy.ndarray, (..., 3) Array with column 1 holding start time of speaker, column 2 holding end time of speaker, and column 3 holding speaker label

def make_rttm()

def make_rttm(out_dir, name, labels, win_step):

Defined in utils.py

Create RTTM Diarization files for non-overlapping speaker labels in var labels. Assumes non-speech part to have value -1 and speech part to have some speaker label (0, 1, 2, ...).

Parameters:

Argument Detail
out_dir: str, Directory where the output RTTM diarization files to be saved
name: str, name for the audio files for which diarization was predicted
labels: int, Speaker/ Non-speech labels assigned to different audio segments based on the win_step used to extract feature vectors
win_step: int, Step (in ms) between two windows of audio segments used for feature extraction

Returns:

Variable Detail
return variable: str, Path to the saved RTTM diarization file

def get_metrics()

def get_metrics(groundtruth_path, hypothesis_path, collar=0.25, skip_overlap=True):

Defined in utils.py

Evaluate the diarization results of all the predicted RTTM files present in hypothesis directory to the grountruth RTTM files present in groundtruth directory.

Parameters:

Argument Detail
groundtruth_path: str, directory of groundtruth rttm files
hypothesis_path: str, directory of hypothesis rttm files
collar: float, Duration (in seconds) of collars removed from evaluation around boundaries of reference segments
skip_overlap: bool, If True, calculates Diarization Error Rate ignoring the overlapped region

Returns:

Variable Detail
metric: pyannote.metrics, Pyannote metric class having diarization DERs for all the files.

def plot_annot()

def plot_annot(name="IS1009a", collar=0.25, skip_overlap=True, groundtruth_path=None, hypothesis_path=None):

Defined in utils.py

Calculate the Diarization Error Rate for filename specified, and print the groundtruth and hypothesis time series plot.

Parameters:

Argument Detail
name: str, Name of the file whose time series plot is to be generated. File must be present in the hypothesis_path folder
collar: float, Duration (in seconds) of collars removed from evaluation around boundaries of reference segments
skip_overlap: bool, If True, calculates Diarization Error Rate ignoring the overlapped region
groundtruth_path: str, Directory of groundtruth rttm files
hypothesis_path: str, Directory of hypothesis rttm files


Defined in baselineMethods.py

def diarizationOracleNumSpkrs()

def diarizationOracleNumSpkrs(audio_dataset, method="KMeans"):

Defined in baselineMethods.py

Predict the diarization labels using the oracle number of speakers for all the audio files in audio_dataset with KMeans/ Spectral clustering algorithm.

Parameters:

Argument Detail
audio_dataset: utils.DiarizationDataset, Diarization dataset
method: str, Name of the method to be used for clustering part. Supports: "KMeans" or "Spectral"

Returns:

Variable Detail
hypothesis_dir: str, Directory where all the predicted RTTM diarization files are saved

def diarizationEigenGapNumSpkrs()

def diarizationEigenGapNumSpkrs(audio_dataset):

Defined in baselineMethods.py

Predict the diarization labels using for all the audio files in audio_dataset with Spectral clustering algorithm. It uses Eigen principle to predict the optimal number of speakers. The module uses already implented spectral algorithm from here: https://github.com/wq2012/SpectralCluster

Parameters:

Argument Detail
audio_dataset: utils.DiarizationDataset, Diarization dataset

Returns:

Variable Detail
hypothesis_dir: str, Directory where all the predicted RTTM diarization files are saved

Defined in optimumSpeaker.py

Inspired from https://github.com/wq2012/SpectralCluster

class eigengap()

class eigengap(min_clusters=1, 
               max_clusters=100, 
               p_percentile=0.9, 
               gaussian_blur_sigma=2, 
               stop_eigenvalue=1e-2,
               thresholding_soft_multiplier=0.01, 
               thresholding_with_row_max=True)

Defined in optimumSpeaker.py

Utility function to decide the optimal number of speakers for clustering based on maximization of eigen-gap of the affinity matrix

Parameters:

Argument Detail
min_clusters: int, Minimum number of output clusters
max_clusters: int, Maximum number of output clusters
p_percentile: float, Parameter to computing p-th percentile for percentile based thresholding
gaussian_blur_sigma: float, sigma value for standard deviation of gaussian kernel in scipy gaussian filter
stop_eigenvalue: float, Minimum value of eigenvalue of Affinity matrix for its eigenvector to be considered in clustering
thresholding_soft_mutiplier: float, Factor to multiply to cells with value less than threshold in row/percentile thresholding. Parameter value of 0.0 turn cells less than threshold to zero in the matrix
thresholding_with_row_max: bool, True for row-max thresholding, False for percentile thresholding

Class Functions:

  1. _get_refinement_operator:
def _get_refinement_operator(self, name)

Parameters:

Argument Detail
name: str, Get the input refinement operator. Available refinements- 'CropDiagonal', 'GaussianBlur', 'RowWiseThreshold', 'Symmetrize', 'Diffuse', 'RowWiseNormalize'

Returns:

Variable Detail
CropDiagonal()/GaussianBlur()/
RowWiseThreshold()/Symmetrize()/
Diffuse()/RowWiseNormalize()
optimumSpeaker.AffinityRefinementOperation, Returns specified refinement method class
  1. compute_affinity_matrix:
def compute_affinity_matrix(self, X)

Compute the affinity matrix for a matrix X with row as each instance and column as features by calculating cosine similarity between pair of l2 normalized columns of X

Parameters:

Argument Detail
X: numpy.ndarray, (n_windows, n_features) Input matrix with column as features to compute affinity matrix between pair of columns

Returns:

Variable Detail
affinity: numpy.ndarray, (n_windows, n_windows) Symmetric array with (i,j)th value equal to cosine similiarity between i-th and j-th row
  1. compute_sorted_eigenvectors:
def compute_sorted_eigenvectors(self, A)

Parameters:

Argument Detail
A: numpy.ndarray, (n_windows, n_windows) Symmetric array with (i,j)th value equal to cosine similiarity between i-th and j-th row

Returns:

Variable Detail
w: numpy.ndarray, Decreasing order sorted eigen values of affinity matrix A
v: numpy.ndarray, Eigen vectors corresponding to eigen values returned
  1. compute_number_of_clusters:
def compute_number_of_clusters(self, eigenvalues, max_clusters, stop_eigenvalue)

Parameters:

Argument Detail
eigenvalues: numpy.ndarray, Decreasing order sorted eigen values of affinity matrix between different windows
max_clusters: int, Maximum number of clusters required. Default 'None' puts no such limit to the number of clusters
stop_eigenvalue: float, Minimum value of eigenvalue to be considered for deciding number of clusters. Eigenvalues below this value are discarded

Returns:

Variable Detail
max_delta_index: int, Index to the eigenvalue such that eigen gap is maximized. It gives the number of clusters determined by the function
  1. find
def find(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, (n_windows, n_features) Input matrix with column as features to compute affinity matrix between pair of columns

Returns:

Variable Detail
k: int, Number of clusters calculated after creating the affinity matrix, applying refinements, and using eigen-gap maximization. self.min_clusterskself.max_clusters

class AffinityRefinementOperation()

class AffinityRefinementOperation(metaclass=abc.ABCMeta)

Defined in optimumSpeaker.py

Meta class to the refinement operation classes passed as input to be perfomed on the data

Class Functions:

  1. check_input:
def check_input(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be refined by refinement operators

Returns:

Variable Detail
ValueError()\ TypeError() ValueError/TypeError, Type Error if X is not a numpy array. Value error if X is not a 2D square matrix
  1. refine:
def refine(self, X)

Abstract function redefined in various child classes of class AffinityRefinementOperation

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be refined by refinement operators

class CropDiagonal()

class Cropdiagonal(AffinityRefinementOperation)

Defined in optimumSpeaker.py

Operator to replace diagonal element by the max non-diagonal value of row. Post operation, the matrix has similar properties to a standard Laplacian matrix. This also helps to avoid the bias during Gaussian blur and normalization.

Class Functions:

  1. refine:
def refine(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be refined by refinement operators

Returns:

Variable Detail
Y: numpy.ndarray, Output array with Crop diagonal refinement applied

class GaussianBlur()

class GaussianBlur(AffinityRefinementOperation)
      def __init__(self, sigma = 1)

Defined in optimumSpeaker.py

Operator to apply gaussian filter to the input array. Uses scipy.ndimage.gaussian_filter

Parameters:

Argument Detail
sigma: float, Standard deviation for Gaussian kernel

Class Functions:

  1. refine:
def refine(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be refined by refinement operators

Returns:

Variable Detail
Y: numpy.ndarray, Output array with gaussian filter applied

class RowWiseThreshold()

class RowWiseThreshold(AffinityRefinementOperation)
      def __init__(self,
                 p_percentile=0.95,
                 thresholding_soft_multiplier=0.01,
                 thresholding_with_row_max=False)

Defined in optimumSpeaker.py

Operator to apply row wise thresholding based on either percentile or row-max thresholding.

Parameters:

Argument Detail
p_percentile: float, Standard deviation for Gaussian kernel
thresholding_soft_multiplier: float, Factor to multiply to cells with value less than threshold in row/percentile thresholding. Parameter value of 0.0 turn cells less than threshold to zero in the matrix
thresholding_with_row_max: bool, True applies row-max based thresholding, False applies percentile based thresholding

Class Functions:

  1. refine:
def refine(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be refined by refinement operators

Returns:

Variable Detail
Y: numpy.ndarray, Output array with row wise threshold applied

class Symmetrize()

class Cropdiagonal(AffinityRefinementOperation)

Defined in optimumSpeaker.py

Operator to return a symmetric matrix based on max{ X, XT } from a given input matrix X.

Class Functions:

  1. refine:
def refine(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be used to create a symmetric matrix

Returns:

Variable Detail
Y: numpy.ndarray, Output symmetric array

class Diffuse()

class Diffuse(AffinityRefinementOperation)

Defined in optimumSpeaker.py

Operator to return a diffused symmetric matrix XTX from a given input matrix X.

Class Functions:

  1. refine:
def refine(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be used to create a diffused symmetric matrix

Returns:

Variable Detail
Y: numpy.ndarray, Output diffused symmetric array

class RowWiseNormalize()

class RowWiseNormalize(AffinityRefinementOperation)

Defined in optimumSpeaker.py

Operator to normalize each row of input matrix X by the maximum value in the corresponding rows.

Class Functions:

  1. refine:
def refine(self, X)

Parameters:

Argument Detail
X: numpy.ndarray, Input array to be row normalized

Returns:

Variable Detail
Y: numpy.ndarray, Output row normalized array

Defined in DEC.py

class ResidualAutoEncoder()

class ResidualAutoEncoder(ip_features,
                          hidden_dims=[500, 500, 2000, 30]))

Defined in DEC.py

Create a torch.nn.Module for a deep autoencoder composed of Residual Neural Network (ResNet) bloacks as the encoder and decoder layer. Activation used is ReLU. The bottleneck encoder output and final decoder output are not activated to avoid data loss due to ReLU activation.

Parameters:

Argument Detail
ip_features: int, Input features size
hidden_dims: list of int, List of hidden dimension features. Last element on the list is the output dimension of bottleneck of the autoencoder

Returns:

Variable Detail
z: torch.Tensor, Output from the bottle encoder of the deep autoencoder network.
xo: list of torch.Tensor, Output from each encoder except the bottle encoder of the deep autoencoder. First item of the list is the input given to the system.
xr: list of torch.Tensor, Reconstruction of inputs to each encoder layer of autoencoder. xr is reversed so that i-th item in list xr is the reconstruction of i-th item in list xo. Eg. First item of xo is the input to the ResidualAutoEncoder network, and first item of xr is the reconstruction from the ResidualAutoEncoder network.

def load_encoder()

def load_encoder():

Defined in DEC.py

Load weights from the ResidualAutoEncoder trained on the training data.

Returns:

Variable Detail
model: _ResidualAutoEncoder, Model with input feature size of 192, and hidden layers of size 500, 500, 2000, 30. Weights of the model initialized to weight of the autoencoder trained on training data.

class ClusteringModule()

class ClusteringModule(nn.Module):
    def __init__(self,
                 num_clusters,
                 encoder, data,
                 cinit = "KMeans"):

Defined in DEC.py

Clustering module of the deep embedding clustering (DEC) algorithm. It uses the trained encoder of the ResidualAutoEncoder to initialize the DEC Clustering network. Kmeans is used to initialize centroids in the latent space.

Parameters:

Argument Detail
num_clusters: str, Number of clusters to create from the algorithm
encoder: nn.Module, Pre-trained encoder for intializing the centroids. Encoder tranforms data to the latent space for clustering
cinit: str, Initialization method of centroids of clusters. Default KMeans

Returns:

Variable Detail
q: torch.Tensor, Tensor of similarity between embedding points z_i and centroid mu_j. Assumes Student's t distribution as the kernel
p: torch.Tensor, Tensor of target distribution based on soft assignment of q_i
xo[0] torch.Tensor, Input data to the ResidualAutoEncoder
xr[0] -torch.Tensor_, Reconstructed input by the ResidualAutoEncoder

Class Functions:

  1. init_centroid:
def init_centroid(self,
                  data,
                  method = "KMeans")

Returns clustered data after calculating the optimal number of speakers using eigen-gap method, and then clustering the data based on the method specified.

Parameters:

Argument Detail
data: torch.Tensor, Input data to be clustered
method: numpy.ndarray, Clustering method. Default KMeans. Options KMeans/Spectral

Returns:

Variable Detail
output: torch.Tensor, Tensor containing intialized centroids for the dataset

class DEC()

class DEC(self,
          num_clusters,
          encoder, data,
          cinit = "KMeans"):

Defined in DEC.py

Deep embedding clustering (DEC) algorithm. It uses the trained encoder of the ResidualAutoEncoder to initialize the DEC Clustering network. It calls ClusteringModule class to initialize the centroids.

Parameters:

Argument Detail
encoder: nn.Module, Pre-trained encoder for intializing the centroids. Encoder tranforms data to the latent space for clustering
num_clusters: str, Number of clusters to create from the algorithm. Default None uses eigengap to determine number of clusters
cinit: str, Initialization method of centroids of clusters. Default KMeans. Options KMeans/Spectral

Class Functions:

  1. fit:
def fit(self,
        data,
        y_true = None,
        niter = 150,
        lrEnc = 1e-4,
        lrCC = 1e-4,
        verbose = False)

Trains the algorithm by measuring the KL Divergence between target and observed distributions. Also updates the ResidualAutoEncoder using MSE loss in parallel to improve the latent space project of the data for better clustering. Both the updates use the Adams optimizer and the objective function is a linear combination of KL Divergence between target and observed distribution, and MSE Loss between input data and its reconstruction by the ResidualAutoEncoder.

Parameters:

Argument Detail
data: torch.Tensor, Input data to be clustered
y_true: numpy.ndarray, True labels of the data we aim to cluster. predict() and clusterAccuracy() functions are invoked only if y_true is not None
niter int, Number of epochs to train the model for
lrEnc float, Learning rate for updating the encoder
lrCC float, Learning rate for updating the cluster centres
verbose bool, True value activates the tqdm progress bar while training. False returns no updates when training
  1. predict: def predict(self, data)

Predict the cluster label to the data by inspecting the label about which the observed distribution is maximized.

Parameters:

Argument Detail
data: torch.Tensor, Input data to be labels after clustering

Returns:

Variable Detail
y_pred: numpy.ndarray, Soft prediction labels of the data
  1. clusterAccuracy: def clusterAccuracy(self, y_pred, y_true)

Predict the cluster labels accuracy as the maximum accuracy between y_pred and y_true for all the permutation of y_pred. This permutation is found by linear_sum_assignment optimization function of scipy.

Parameters:

Argument Detail
y_pred: numpy.ndarray, Prediction of the labels by DEC algorithm
y_true numpy.ndarray, True labels of the data

Returns:

Variable Detail
accuracy: float, Cluster assignment accuracy
reassignment: dict, dictionary with key as rows and value as cols indices for the optimal assignment

def diarizationDEC()

def diarizationDEC(audio_dataset,
                   num_spkr = None,
                   hypothesis_dir = None)

Defined in DEC.py

Compute diarization labels based on oracle number of speakers if num_spkr = 'oracle'. Used as an optimal benchmark for performance of DEC. If num_spkr = None, uses eigen-gap maximization in the ClusteringModule to determine the number of speakers.

Parameters:

Argument Detail
audio_dataset: utils.DiarizationDataset, Test diarization dataset
num_spkr: str, None for calculating the optimal number of speakers from eigen-gap maximization. oracle for using the number of speakers in each window given with the data.
hypothesis_dir: str, Directory to store the predicted speaker labels in the audio segments in an rttm file. None stores it in ./rttm_output/ directory

Returns:

Variable Detail
hypothesis_dir: str, Directory to the rttm files containing predicted speaker labels with their timestamps

Defined in: colab_demo_utils.py

def downloadYouTube()

def downloadYouTube(videourl, path):

Defined in colab_demo_utils.py

Download video from YouTube in .mp4 format using Video URL.

Parameters:

Argument Detail
videourl: str, URL of the YouTube video to download
path: str, directory to save the YouTube video. If directory does not exist, it is created.

Returns:

Variable Detail
save_dir: str, Save directory location

def loadVideoFile()

def loadVideoFile(playvideo_file=False):

Defined in colab_demo_utils.py

Load video file either from YouTube or from your local directory into your current session working directory. Also extracts and stores its audio file in .wav format.

Parameters:

Argument Detail
playvideo_file: bool, If True, plays the video after loading in the working directory. Default=False

Returns:

Variable Detail
video_dir: str, Returns the path to the saved video

def read_rttm()

def read_rttm(path):

Defined in colab_demo_utils.py

Create hypothesis labels for each window using .rttm file.

Parameters:

Argument Detail
path: str, Path to the rttm file

Returns:

Variable Detail
hypothesis_labels: numpy.ndarray, (n_instances, 3) i-th row's first, second and third column contains start, end, and speaker id of the i-th instance of speech.

def combine_audio()

def combine_audio(vidname, audname, outname, fps):

Defined in colab_demo_utils.py

Combine cv2 processed silent video with its audio file to output the complete annotated video.

Parameters:

Argument Detail
vidname: str, Path to the silent video
audname: str, Path to the audio file to be attached
outname: str, Output video file name
fps: int, Frame rate of the video

def createAnnotatedVideo()

def createAnnotatedVideo(audio_dataset, hypothesis_dir):

Defined in colab_demo_utils.py

Use cv2 to put annotations in the video using the hypothesis labels.

Parameters:

Argument Detail
audio_dataset: utils.DiarizationDataset, Dataset pipeline
hypothesis_dir: str, Path to the directory with hypothesis labels rttm files

Returns:

Variable Detail
op_video_name: str, Annotated output video filename