This repository provides an end-to-end pipeline to fine-tune a π€-Summary-Model on your own corpus.
It is subdividet in those three parts:
- Data Provider: Preprocess and Tokenize data for training
- Model Trainer: Fine tune a selected π€-Model on the provided data
- Evaluator: Automated evaluation of the fine tuned model on validation set
The pipeline supports german and english texts to be summarized. For both languages a T5 model
is used which can further be explored in the paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
Huggingface Models:
- german: t5-base-multi-de-wiki-news
- english: t5-base
- Provide source and target files with matching text-summary-pairs. These are split and converted to the right format for the fine-tuning task.
- Choose one of the supported languages and set training parameters via a config file. Then run the training on your data.
- Evaluate the produced model checkpoints and compare them either by the
Rouge-L
or the specially developedSemanticSimilarity
metric. You can also track the training metrics via TensorBoard.
The following example was produced by one of our german models which was fine-tuned on our specially scraped Golem corpus.
Original Text:
Tamrons neues Objektiv ist ein Weitwinkelzoom fΓΌr Canon- und Nikonkameras mit Kleinbildsensor, das ΓΌber 15 Elemente verfΓΌgt, darunter dispersionsarme und asphΓ€rische. Der sogenannte Silent Drive Motor ermΓΆglicht laut Hersteller eine hohe Geschwindigkeit beim Scharfstellen und eine niedrige GerΓ€uschentwicklung. Die minimale Fokusdistanz wird mit 28 cm angegeben. Die feuchtigkeitsbestΓ€ndige Konstruktion und die Fluorbeschichtung des Frontelements sollen dazu beitragen, dass das Objektiv auch bei harschen Wetterbedingungen funktioniert. Das Objektiv misst 84 mm x 93 mm und weist einen Filterdurchmesser von 77 mm auf. Das 17-35 mm F2.8-4 Di OSD von Tamron soll Anfang September 2018 fΓΌr Nikon FX erhΓ€ltlich sein, ein Canon-EF-Modell wird spΓ€ter folgen. Der Preis wird mit rund 600 US-Dollar angegeben. Deutsche Daten liegen noch nicht vor.
Produced Summary:
Tamron hat mit dem 17-35 mm F2.8-4 Di OSD ein Weitwinkelzoom fΓΌr Canon- und Nikon-Kameras vorgestellt, das ΓΌber 15 Elemente verfΓΌgt.
Please setup a Python3 evironment with your virtual environment favorite tool. We tested it with Python 3.7.4
.
sh ./install_dependencies.sh
Provides tokenized data for training.
It requires to have a dataset in the dataProvider/datasets/$DATASETNAME
directory. Either the dataset can be provided as single files or already split into a train, validation and test set. Each line in a file should represent a single example string.
The sources (full texts) for summarization should be provided in a sources.txt
file and the target summarizations should be provided in a targets.txt
file.
Now the --create_splits
flag has to be used to create the train
, val
and test
files in that directory, which will then be the resource for the tokenization.
If training, validation and test splits are already present, they should be provided in the following format of π€-seq2seq examples.
train.source
train.target
val.source
val.target
test.source
test.target
Use the Command Line Interface like this:
bin/provide_data $DATASETNAME $TOKENIZERNAME $MODELNAME <flags>
Example:
bin/provide_data golem WikinewsSum/t5-base-multi-de-wiki-news t5-base --create_splits=True --filtering=True
Limits the amount of samples that are taken for tokenization for each split. Defaults to None
.
Split the dataset into train, validation and test splits. Defaults to False
.
$CREATESPLITS
has to be a dictionary containing the keys train
and val
and values between 0 and 1. The value of train
represents the ratio of the dataset that is used for training (and not for validation or testing). The value of val
represents the the ratio between the validation and the test set. Because of shell restrictions the dictionary has to be wrapped in "
in the CLI, like this: --createSplits="{'train': 0.7, 'val': 0.66}"
If the value of $CREATESPLITS
is True
it defaults to {'train': 0.8, 'val': 0.5}
, which results a 80/10/10 split.
Can be set to only tokenize certain splits. Defaults to [train, val, test]
.
Longer examples than the maximum token size are filtered, else they are truncated. Defaults to True
.
The resulting tokenized PyTorch tensors are saved in the dataProvider/datasets/$DATASETNAME/$TOKENIZERNAME[_filtered]
directory as the following files:
train_source.pt
train_target.pt
val_source.pt
val_target.pt
test_source.pt
test_target.pt
Performs training process for selected model on the previously created data sets.
To execute the Model Training you need to previously run the Data Provider module to generate training data in the right format either from your own or predefined text/summary pairs.
It requires files in the output format of the Data Provider module. Since you could have run the module for multiple text/summary sets, you have to provide the $DATASETNAME
to train on.
Additionally you can choose a supported π€-Model with the $MODELNAME
parameter (the model will be downloaded to your virtual environment if you run the training for the first time).
By the $FILTERED
flag you can specify if filtered or unfiltered data is used for training (if previously created by the Data Provider). It defaults to True
.
Since all model and training pipeline configurations are read from a config file (which has to be stored in the ./modelTrainer/config directory) you might also select your config file by setting the $CONFIGNAME
parameter.
If you don't do so, this parameter defaults to 'fine_tuning.ini' (which could also be used as a template for your own configurations).
Use the Command Line Interface like this:
bin/run_training $DATASETNAME $MODELNAME <flags>
Example:
bin/run_training golem WikinewsSum/t5-base-multi-de-wiki-news --filtered=False
The pipeline is designed to inherit all customizable parameters from an '.ini' file.
It follows the structure that a component is defined by [COMPONENT]
and the assigned parameters by parameter = parameter_value
(as string). There are two components, the model and training component. Each component can be configured with with multiple parameters (see fine_tuning.ini for a full list).
Only the parameters in the provided 'fine_tuning_config.ini' file stored in the config folders can be changed.
In the config file you choose an output_directory in this directory the following folder structure is created:
output_directory
βββ model_shortname
βββ model_version
βββ checkpoint_folder
βββ final_model_files
βββ logs
βββ model_version
βββ tensorboard_file
<model_shortname> = Abbreviation for the chosen model
<model_version> = Counts the versions of the fine tuned model (canbe seen as an id and makes sure you don't override any previously trained model)
<checkpoint_folder> = contains model files after a certain number of training steps (checkpoints are saved after n training steps)
<tensorboard_file> = saved training metrics for TensorBoard usage
After the training the following final output files are saved in the <model_version> folder:
- config.json
- training_args.bin (parameters for the π€-Trainer)
- pytorch_model.bin (model which can then be loaded for inference)
- model_info.yml (file with information used for evaluation)
Performs evaluation on the validation or test set for a fine-tuned model.
To execute the Evaluation you need to previously run the Model Trainer module to generate a fine-tuned π€-Model in the right format and stored in the correct folder structure. These four files are required:
- config.json
- pytorch_model.bin
- training_args.bin
- model_info.yml
Since the model evaluation uses the validation set or test set created from the underlying datasaet you need to specify the $DATASETNAME
.
Additionally you can choose the fine-tuned π€-Model Checkpoints to compare by setting the $RUNPATH
parameter. This path has to be the directory of the checkpoint_folder
defined by the folder structure in the training section.
Should be train
, val
or test
. Default to val
Number of samples selected from the data set to evaluate the checkpoint on. Defaults to 10
It can be chosen from the two metric types:
- Rouge-L: set parameter to "Rouge"
- Semantic Similarity: set parameter to "SemanticSimilarity"
Defaults to Rouge.
Use the Command Line Interface like this:
bin/evaluate_with_checkpoints_and_compare $RUN_PATH $DATASET_NAME <flags>
Example:
bin/evaluate_with_checkpoints_and_compare golem WikinewsSum/t5-base-multi-de-wiki-news --split_name=train --nr_samples=1000 --metric_type=SemanticSimilarity
By defalut the produced Overview.xlsx files are stored in the evaluator directory under the following structure:
evaluator
βββ evaluations
βββ model_short_name
βββ model_version
βββ checkpoint_folders
βββ metric_type-sample_name-split_name-folders
βββ iteration-folders
- Overview.xlsx
- analysis.xlsx
A summarization model can be used for live-prediction in a GUI developed with PyQT5.
All the GUI needs is either a directory containing a pytorch_model.bin
file with a fine-tuned summary model of T5, or the flag model_status
has to be set to base
- then the base T5 model is used.
Language of the model to choose. Defaults to german
Can be either base
or fine-tuned
. If it is base
the model_dir
will be ignored. Defaults to fine-tuned
Use the Command Line Interface like this:
bin/run_gui $MODEL_DIR <flags>
Example:
bin/run_gui path/to/UI-Model/checkpoint-100000/ german fine-tuned
During the training a TensorBoard
file is produced which can then be activated to track your training parameters and metrics afterwards in your localhost.
To access the TensorBoard the library tensorboard has to be installed (requirements.txt) and you can use the following CLI to activate it:
tensorboard --logdir <tensorboard_log_dir>
In the <tensorboard_log_dir> a events.out.tfevents...-file should exist. The default path is described by the folder structure in the training section.
Example:
tensorboard --logdir ./results/t5-de/logs/0
pip install pytest
Use fd and entr to execute tests automatically on file changes:
fd . | entr pytest
Use the following command to add a new package (optionally with version number) $pkg
to the repository, while keeping requirements.txt
orderly:
echo $pkg | sort -o requirements.txt - requirements.txt && pip install $pkg