These lab sessions are designed to help you follow along with the contents presented during the lectures, and introduce you to the skills and tools needed to complete the final projects.
The lab sessions will be a mix of tutorials and exercises. The tutorials will present modern frameworks and tools to implement advanced NLP analyses and pipelines. The exercises are designed to teach you the skills needed for final projects. Here is a brief overview of the schedule:
Week | Lecture topic | Lab Tutorial | Lab Exercise |
---|---|---|---|
1 | Introduction to NLP | Intro, Setup work environment and team creation Start Intro to 🤗 Transformers |
- |
2 | The Evolution of Language Modeling | Intro to 🤗 Transformers and Datasets | 🤗 Pipelines & Sentence Transformers for semantic search and QA |
3 | Looking for Words | Introduction to spaCy | Training a BPE tokenizer and a lexicon-based transduction model |
4 | Labeling Sequences | Text tagging with spaCy and 🤗 Transformers | Combining Textual and Non-textual Features in NLP Models |
5 | Trees of Words | Dependency parsing with spaCy | - |
6 | Encode and Decode | Optional: Fine-tuning with 🤗 Transformers and Adapters | - |
7 | Transfer Learning & Opening the Blackbox | - | - |
Some notes:
-
The core contents are covered in the first few weeks of the course to kickstart your work. Exercise sessions are dropped from week 5 onwards to allow you to focus on the final project.
-
Participation to the lab sessions is highly encouraged, as they offer you a chance to ask questions related to the midterm portfolio and/or the final projects.
-
The tutorial session for week 6 can be relevant to many projects and will be covered upon request.
The lab sessions make use of the Jupyter environment. You can use the following links to get started:
Alternatively, it is possible to use the notebooks via the Google Colab web environment simply by clicking on the button at the beginning of each notebook. If you’re running on Windows, we recommend following along using a Colab notebook. If you’re using a Linux distribution or macOS, you can use either approach described here. For an intro to the Colab environment, refer to:
Since the lab session will introduce you to OSS libraries such as spaCy, Scikit-learn, 🤗 Transformers and 🤗 Datasets, most of the material is simply adapted from the official tutorials and docs. Here is a non-exhaustive list of the most relevant sources for additional reference:
- Advanced NLP with spaCy
- spaCy Linguistic Features
- HuggingFace Course, Chapter 1
- HuggingFace Transformers Docs
- HuggingFace Datasets Docs
- Scikit-learn "Working with Text Data" Tutorial
- NLP class materials by Dirk Hovy
The file requirements.txt
in this repository contains the list of all the packages required to run the lab sessions. You can create a Python virtual environment (Python>=3.6) and install them using the following command:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Make sure the virtual environment is activated before running Jupyter. If you are using Colab, simply run the cell at the beginning of each notebook to install the required packages. Refer to Using a Python Virtual Environment for more details on how to create and activate a virtual environment.
For any troubleshooting, please consult the FAQ before asking for help. You are encouraged to contribute to it by adding your solutions!
Arianna Bisazza is an Assistant Professor in Computational Linguistics and Natural Language Processing at the Computational Linguistics Group of the University of Groningen. She is passionate about the statistical modeling of languages, particularly in a multilingual context, and her long-term goal is to design robust NLP algorithms that can adapt to the large variety of linguistic phenomena observed around the world. She is part of the Dutch consortium InDeep: Interpreting Deep Learning Models for Text and Sound, leading the work package on interpretability for neural machine translation. | |
Gabriele Sarti is a doctoral researcher at the Computational Linguistics Group of the University of Groningen. He is part of the consortium InDeep, working on interpretability for neural machine translation. His research focuses on interpretability for sequence-to-sequence NLP models, in particular from a user-centric perspective and by leveraging human behavioral signals. | |
Anjali Nair is a MSc candidate in AI at the University of Groningen. |
Please open as issue here on Github! This is the first year we are using these contents for the course and although most of them come from battle-tested online tutorials, we are always looking for feedback and suggestions.