Paper | Demo Clips | Dataset | Model | Training | Offline Demo | Examples |
---|---|---|---|---|---|---|
SoccerNet-XFoul | X-VARS | Training | Offline Demo | Examples |
This repository contains:
- SoccerNet-XFoul, a novel dataset consisting of more than 22k video-question-answer triplets annotated by over 70 experienced football referees. 🚀
- X-VARS, a new vision language model that can perform multiple multi-modal tasks such as visual captioning, question-answering, video action recognition, and can generate explanations of its decisions on-par with human level. 🤖
- The code to run an offline demo on your laptop. 💻
The SoccerNet-XFoul dataset is now available! 🔥🔥
We recommend setting up a conda environment for the project:
conda create --name=xvars python=3.10
conda activate xvars
git clone https://github.com/heldJan/X-VARS.git
cd X-VARS
pip install -r requirements.txt
The SoccerNet-XFoul dataset consists of 22k video-question-answer pairs annotated by more than 70 experienced referees.
Due to the subjectivity in refereeing, we gathered multiple answers for the same action, rather than collecting a single decision and explanation for each question. In the end, for each action, we have, on average,
Follow the link to easily download the SoccerNet pip package.
If you want to download the video clips, you will need to fill a NDA to get the password.
Then use the API to downlaod the data:
from SoccerNet.Downloader import SoccerNetDownloader as SNdl
mySNdl = SNdl(LocalDirectory="path/to/SoccerNet")
mySNdl.downloadDataTask(task="mvfouls", split=["train","valid","test","challenge"], password="enter password")
To obtain the data in 720p, add version = "720p" to the input arguments. If you face issues extracting data from the train_720p.zip folder, the error may come from using the default unzip extractor. Using the app "The Unarchiver" should enable you to unzip it successfully.
The annotations can be downloaded from here 🔥
X-VARS is a visual language model based on a fine-tuned CLIP visual encoder to extract spatio-temporal video features and to obtain multi-task predictions regarding the type and severity of fouls. The linear layer connects the vision encoder to the language model by projection the video features in the text embedding dimension. We input the projected spatio-temporal features alongside the text predictions obtained by the two classification heads (for the task of determining the type of foul and the task of determining if it is a foul and the corresponding severity) into the Vicuna-v1.1 model, initialized with weights from LLaVA.
We propose a two-stage training approach. The first stage fine-tunes CLIP on a multi-task classification to learn prior knowledge about football and refereeing. The second stage consists in fine-tuning the projection layer and several layers of the LLM to enhance the model's generation abilities in the sport-specific domain.
To replicate the training, check out Training
More information are provided in our paper.
- VARS: The first multi-task classification model for predicting if it is a foul or not and the corresponding severity.
- Video-ChatGPT: A vision and language model used as a foundation model for X-VARS
If you're using X-VARS in your research or application, please cite our paper:
@article{Held2024XVARS-arxiv,
title = {X-{VARS}: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model},
author = {Held, Jan and Itani, Hani and Cioppa, Anthony and Giancola, Silvio and Ghanem, Bernard and Van Droogenbroeck, Marc},
journal = arxiv,
volume = {abs/2404.06332},
year = {2024},
publisher = {arXiv},
eprint = {2404.06332},
keywords = {},
eprinttype = {arXiv},
doi = {10.48550/arXiv.2404.06332},
url = {https://doi.org/10.48550/arXiv.2404.06332}
}
}