This repository presents the PyTorch implementation of our research work on AcFormer: An Aligned and Compact Transformer for Multimodal Sentiment Analysis The goal of this study is to develop an efficient and effective model for analyzing sentiment in multimodal data.
AcFormer
is designed based on the Transformer architecture, with modifications and enhancements to align and compactly process multimodal inputs for sentiment analysis. The model incorporates visual, acoustic and textual modalities and leverages their interactions to improve sentiment prediction accuracy. The details of the model architecture can be found in our paper.
- Python 3.8
- Pytorch 1.6.0
- CUDA 10.1
- Install all required library
pip install -r requirements.txt
- Download pre-trained word vectors and pre-trained models
- Glove word vectors (glove.840B.300d.zip)
- BERT Model (bert-base-uncased)
- ViT B/16 (vit-base-patch16-224-in21k)
To reproduce our experiments and evaluate the performance of AcFormer
, we recommend using our processed dataset. Data files (CMU_MOSI, CMU_MOSEI, UR-FUNNY, IEMOCAP, MUStARD, MELD) can be downloaded from data.
All datasets are saved under ./data/
folder. The format of raw datasets is as follows:
├── iemocap
│ ├── segmented_audio
│ │ └── Ses01F_impro01_F000.wav
│ ├── segmented_text
│ │ └── text.json
│ └── segmented_video
│ └── Ses01F_impro01_F000.mp4
├── meld
│ ├── segmented_audio
│ │ └── dia0_utt0.wav
│ ├── segmented_text
│ │ └── text.csv
│ └── segmented_video
│ └── dia0_utt0.mp4
├── mosei
│ ├── segmented_audio
│ │ └── _7HVhnSYX1Y_0.mp4.wav
│ ├── segmented_text
│ │ └── text.json
│ └── segmented_video
│ └── _7HVhnSYX1Y_0.mp4
├── mosi
│ ├── segmented_audio
│ │ └── _dI--eQ6qVU_1.wav
│ ├── segmented_text
│ │ └── text.json
│ └── segmented_video
│ └── _dI--eQ6qVU_1.mp4
├── mustard
│ ├── segmented_audio
│ │ └── 1_60.wav
│ ├── segmented_text
│ │ └── text.json
│ └── segmented_video
│ └── 1_60.mp4
Note that for the ur_funny
dataset, we cannot get timestamps corresponding to video/audio clips and original texts. We can only use pre-extracted word alignment features.
For traditional word alignment feature extraction, we recommend the following download plans:
- Install CMU Multimodal SDK. Ensure, you can perform
from mmsdk import mmdatasdk
.
- Download Extracted Features MOSI and MOSEI and put the contents in
./data
folder..
- Install CMU Multimodal SDK. Ensure, you can perform
from mmsdk import mmdatasdk
.
- Download Extracted Features: UR-FUNNY and put the contents in
./data
folder.. There are six pickle files in the extracted features folder:
- data_folds
- langauge_sdk
- openface_features_sdk
- covarep_features_sdk
- humor_label_sdk
- word_embedding_list
- Install CMU Multimodal SDK. Ensure, you can perform
from mmsdk import mmdatasdk
.
- Download Extracted Features: IEMOCAP and put the contents in ./data folder.
- You can use the following script
wget https://www.dropbox.com/sh/hyzpgx1hp9nj37s/AADfY2s7gD_MkR76m03KS0K1a/Archive.zip?dl=1
mv 'Archive.zip?dl=1' Archive.zip
unzip Archive.zip
- We employ a two-stage training procedure for AcFormer. In the first stage, the model is pre-trained on
MELD
using self-supervised learning objectives to learn meaningful representations.
- Ensure that you have the required hardware resources (e.g., GPU) and software dependencies installed.
- option 1: Clone the code repository from the GitHub link: AcFormer and change to the project directory:
cd acformer/
- option 2: Run the pre-training stage by executing the command:
sh ./scripts/pretrain.sh {gpus} ./configs/pivotal_pretrain.yaml
- In the second stage, the pre-trained model is fine-tuned using supervised learning with labeled sentiment data.
# mosi
sh ./scripts/finetune.sh {gpus} ./configs/pivotal_finetune_mosi.yml
# mosei
sh ./scripts/finetune.sh {gpus} ./configs/pivotal_finetune_mosei.yml
# iemocap
sh ./scripts/finetune.sh {gpus} ./configs/pivotal_finetune_iemocap.yml
# MUStARD
sh ./scripts/finetune.sh {gpus} ./configs/pivotal_finetune_mustard.yml
- For the word-aligned settings, we directly train the model by executing the command:
# MOSI
sh ./scripts/train_mosi.sh {gpus} ./configs/pivotal_train_word_aligned_mosi.yml
# MOSEI
sh ./scripts/train_mosei.sh {gpus} ./configs/pivotal_finetune_mosei.yml
# UR-FUNNY
sh ./scripts/train_ur_funny.sh {gpus} ./configs/pivotal_train_word_aligned_ur_funny.yml
# IEMOCAP
sh ./scripts/train_iemocap {gpus} ./configs/pivotal_train_word_aligned_iemocap.yml