Skip to content
forked from heldJan/X-VARS

X-VARS is a multi-modal large language model designed for understanding football videos from the point of view of a referee. X-VARS can perform a multitude of tasks, including video description, question answering, action recognition, and conducting meaningful conversations based on video content.

Notifications You must be signed in to change notification settings

dariogod/X-VARS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚽️Explainable Video Assistant Referee System⚽️

Paper Demo Clips Dataset Model Training Offline Demo Examples
paper DemoClip-1 DemoClip-2 DemoClip-3 DemoClip-4 SoccerNet-XFoul X-VARS Training Offline Demo Examples

This repository contains:

  • SoccerNet-XFoul, a novel dataset consisting of more than 22k video-question-answer triplets annotated by over 70 experienced football referees. 🚀
  • X-VARS, a new vision language model that can perform multiple multi-modal tasks such as visual captioning, question-answering, video action recognition, and can generate explanations of its decisions on-par with human level. 🤖
  • The code to run an offline demo on your laptop. 💻

📢 NEWS 📢

The SoccerNet-XFoul dataset is now available! 🔥🔥

Installation

We recommend setting up a conda environment for the project:

conda create --name=xvars python=3.10
conda activate xvars

git clone https://github.com/heldJan/X-VARS.git
cd X-VARS
pip install -r requirements.txt

SoccerNet-XFoul

The SoccerNet-XFoul dataset consists of 22k video-question-answer pairs annotated by more than 70 experienced referees. Due to the subjectivity in refereeing, we gathered multiple answers for the same action, rather than collecting a single decision and explanation for each question. In the end, for each action, we have, on average, $1.5$ answers for the same question.

Follow the link to easily download the SoccerNet pip package.

If you want to download the video clips, you will need to fill a NDA to get the password.

Then use the API to downlaod the data:

from SoccerNet.Downloader import SoccerNetDownloader as SNdl
mySNdl = SNdl(LocalDirectory="path/to/SoccerNet")
mySNdl.downloadDataTask(task="mvfouls", split=["train","valid","test","challenge"], password="enter password")

To obtain the data in 720p, add version = "720p" to the input arguments. If you face issues extracting data from the train_720p.zip folder, the error may come from using the default unzip extractor. Using the app "The Unarchiver" should enable you to unzip it successfully.

The annotations can be downloaded from here 🔥

My Image

X-VARS

X-VARS is a visual language model based on a fine-tuned CLIP visual encoder to extract spatio-temporal video features and to obtain multi-task predictions regarding the type and severity of fouls. The linear layer connects the vision encoder to the language model by projection the video features in the text embedding dimension. We input the projected spatio-temporal features alongside the text predictions obtained by the two classification heads (for the task of determining the type of foul and the task of determining if it is a foul and the corresponding severity) into the Vicuna-v1.1 model, initialized with weights from LLaVA. My Image

Training

We propose a two-stage training approach. The first stage fine-tunes CLIP on a multi-task classification to learn prior knowledge about football and refereeing. The second stage consists in fine-tuning the projection layer and several layers of the LLM to enhance the model's generation abilities in the sport-specific domain.

To replicate the training, check out Training

More information are provided in our paper.

Examples

My Image My Image My Image My Image

Acknowledgements

  • VARS: The first multi-task classification model for predicting if it is a foul or not and the corresponding severity.
  • Video-ChatGPT: A vision and language model used as a foundation model for X-VARS

If you're using X-VARS in your research or application, please cite our paper:

    @article{Held2024XVARS-arxiv,
        title = {X-{VARS}: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model},
        author = {Held, Jan and Itani, Hani and Cioppa, Anthony and Giancola, Silvio and Ghanem, Bernard and Van Droogenbroeck, Marc},
        journal = arxiv,
        volume = {abs/2404.06332},
        year = {2024},
        publisher = {arXiv},
        eprint = {2404.06332},
        keywords = {},
        eprinttype = {arXiv},
        doi = {10.48550/arXiv.2404.06332},
        url = {https://doi.org/10.48550/arXiv.2404.06332}
}

}

Authors

About

X-VARS is a multi-modal large language model designed for understanding football videos from the point of view of a referee. X-VARS can perform a multitude of tasks, including video description, question answering, action recognition, and conducting meaningful conversations based on video content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%