Refreshed version of ariecattan/coref used for experiments in our EMNLP 2021 paper.
- Notable changes
- Requirements
- Data Preparation
- Usage (Coreference Resolution)
- Usage (Event Extraction)
- Original Project Readme
- More efficient parallelized pytorch dataloader:
- With the original implementation, GPUs would run at ~20% load. Support for the e2e-coref style mention generation was not needed in our experiments and was dropped. A stub method
_generate_and_prune_mentions
exists intrain_pairwise_scorer_mb.py
should anyone need to reimplement this feature in the future.
- With the original implementation, GPUs would run at ~20% load. Support for the e2e-coref style mention generation was not needed in our experiments and was dropped. A stub method
- Documents are transformer-embedded just in time before the optimizer step:
- Previously, training the model on one topic would transformer-embed all documents in that topic and store their embeddings on the GPU, which does not scale for topics beyond ECB+ size. The implementation could not afford fine-tuning of transformer weights for this reason. (Trainable transformer weights would be possible now, though we decided against this to keep results comparable to the original implementation.)
- Added early stopping and warm-starting from previous checkpoints.
- Simplifications and more code comments throughout.
- python3 >= 3.7.12
pip install -r requirements.txt
, this installs torch 1.7.1 with CUDA 10.2- for CUDA 11, run
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
- for CUDA 11, run
Follow this project to obtain the ECB+, FCC-T, GVC and (parts of the) HyperCoref corpora in the right format.
Then, sort the files into the pre-existing directory structure at data/
. The JSON config files under configs/
expect the following directory structure for each corpus (here, S_gg
ECB+):
data/gg/ecbp
├── ...
├── gold
│ ├── dev_events_corpus_level.conll
│ └── test_events_corpus_level.conll
└── mentions
├── dev_events.json
├── dev.json
├── test_events.json
├── test.json
├── train_events.json
└── train.json
Running the coreference resolution system entails four main steps:
- Training
- Optimizing the agglomerative clustering threshold on the development split:
- Predicting on the development split, which produces CoNLL files
- Scoring models with coreference resolution metrics
- Predicting on the test split to produce CoNLL files
- Scoring models with coreference resolution metrics
We outline the steps for S_gg
ECB+. To run experiments for other experimental scenarios or corpora, use the other JSON configuration files inside configs/
.
For the paper experiments, we ran the above steps three separate times with different random seeds.
We use random seed 0
for our running example. The random seed can be specified by changing the "random_seed"
setting inside each of the JSON config files.
Run:
python3 train_pairwise_scorer_mb.py train configs/coreference/gg/ecbp/config_gg_ecbp_train.json
This will produce several model checkpoint files inside models/gg/ecbp/events/0/
.
Take note of the epoch with the lowest dev-loss. Copy its model checkpoints to models/gg/ecbp/events/0/best
(0
being the random seed in the path), like so:
models/gg/ecbp/events/0/best
├── pairwise_scorer_0
├── span_repr_0
└── span_scorer_0
Then, predict on the dev split using several different agglomerative clustering thresholds:
python3 predict_tune_mb.py tune configs/coreference/gg/ecbp/tune/config_gg_ecbp_tune_seed0.json
This produces CoNLL files under models/gg/ecbp/events/0/tune_on_best/best
.
To determine the clustering threshold that produces the best LEA F1 scores, run:
python3 run_scorer_mb.py \
data/gg/ecbp/gold/dev_events_corpus_level.conll \
models/gg/ecbp/events/0/tune_on_best/best
Consider setting the best-performing epoch and clustering threshold per seed in the JSON config:
- For ECB+
S_gg
, these areconfigs/coreference/gg/ecbp/config_gg_ecbp_test_ecbp_closure.json
andconfigs/coreference/gg/ecbp/config_gg_ecbp_test_ecbp_predicted.json
. (The two files differ in the type of topic pre-clustering used.) - Towards the bottom, there is a setting
best_hyperparams
which expects the following schema:Note: The prediction script can predict for multiple random seeds in a single execution.{ "SEED": [BEST_EPOCH, CLUSTERING_THRESHOLD], ... }
To create test split predictions (in this example, using a config for ECB+ with the gold-standard topic preclustering), run:
python3 predict_tune_mb.py predict configs/coreference/gg/ecbp/config_gg_ecbp_test_ecbp_closure.json
The resulting CoNLL file is located at models/gg/ecbp/events/closure/0/model_0_test_events_average_THRESHOLD_corpus_level.conll
Due to a bug (that I was too lazy to fix), all CoNLL prediction files start with #begin document dev_events
even when they contain test split data. This is easily fixed with sed
:
find models/ -name "*test_events*.conll" -exec sed -i 's/document dev_events/document test_events/g' {} +
To compute and aggregate the final coreference metrics (for the running example of S_gg
ECB+), run:
python3 aggregate_scores_mb.py \
data/gg/ecbp/gold/test_events_corpus_level.conll \
models/gg/ecbp/events/closure
Afterwards, models/gg/ecbp/events/closure/scores.txt
will contain the metrics in CSV format and pretty-printed. scores.pkl
in the same directory contains the pickled pandas dataframe for later analyses.
We ran event extraction against the ECB+ test split, re-using the hyperparameters and the original training loop from Cattan et al.
We outline the steps for S_gg
ECB+ using random seed 0
.
Run:
python3 train_pairwise_scorer.py train configs/event_extraction/gg/config_gg_train.json
This will produce several model checkpoint files inside models/event_extraction/gg/ecbp/events_10_0.25/
.
Copy the model checkpoints of the epoch with lowest dev loss to models/event_extraction/gg/ecbp/events_10_0.25/0/best
(0
being the random seed in the path), like so:
models/event_extraction/gg/ecbp/events_10_0.25/0/best
├── pairwise_scorer_0
├── span_repr_0
└── span_scorer_0
Run:
python3 predict.py predict configs/event_extraction/gg/config_gg_test_seed0.json
This produces a CoNLL file at models/event_extraction/gg/ecbp/events_10_0.25/ecbp/closure/0/test_events_average_1e-05_model_0_corpus_level.conll
Firstly, make sure to fix the CoNLL file header as in the coreference resolution example:
find models/event_extraction -name "*.conll" -exec sed -i 's/test_events_average_1e-05_model_0/test_events/g' {} +
Then, run:
python3 aggregate_scores_mb.py \
data/gg/ecbp/gold/test_events_corpus_level.conll \
models/event_extraction/gg/ecbp/events_10_0.25/ecbp/closure
Afterwards, models/event_extraction/gg/ecbp/events_10_0.25/ecbp/closure/scores.txt
will contain the scores (only the columns with mentions
are relevant).
Original project readme:
This repository contains code and models for end-to-end cross-document coreference resolution, as decribed in our paper: Streamlining Cross-Document Coreference Resolution: Evaluation and Modeling The models are trained on ECB+, but they can be used for any setting of multiple documents.
@article{Cattan2020StreamliningCC,
title={Streamlining Cross-Document Coreference Resolution: Evaluation and Modeling},
author={Arie Cattan and Alon Eirew and Gabriel Stanovsky and Mandar Joshi and I. Dagan},
journal={ArXiv},
year={2020},
volume={abs/2009.11032}
}
- Install python3 requirements
pip install -r requirements.txt
- Download spacy model
python -m spacy download en_core_web_sm
Run the following script in order to extract the data from ECB+ dataset and build the gold conll files. The ECB+ corpus can be downloaded here.
python get_ecb_data.py --data_path path_to_data
The core of our model is the pairwise scorer between two spans, which indicates how likely two spans belong to the same cluster.
We present 3 ways to train this pairwise scorer:
- Pipeline: first train a span scorer, then train the pairwise scorer. Unlike Ontonotes, ECB+ does include singleton annotation, so it's possible to train separately the span scorer model.
- Continue: first train the span scorer, then train the pairwise scorer while continue training the span scorer.
- End-to-end: train together the both models.
In order to choose the training method, you need to set the value of the training_method
in
the config_pairwise.json
to pipeline
, continue
or e2e
In our experiments, we found the e2e
method to perform the best for event coreference.
In ECB+, the entity and event coreference clusters are annotated separately,
making it possible to train a model only on event or entity coreference.
Therefore, our model also allows to be trained on events, entity, or both.
You need to set the value of the mention_type
in
the config_pairwise.json
(and config_span_scorer.json
)
to events
, entities
or mixed
.
In both pipeline and fine-tuning methods, you need to first run the span scorer model
python train_span_scorer --config configs/config_span_scorer.json
For the pairwise scorer, run the following script
python train_pairwise_scorer configs/config_pairwise.json
Given the pairwise scorer trained above, we use an agglomerative clustering in order to cluster the candidate spans into coreference clusters.
python predict.py --config configs/config_clustering
(model_path
corresponds to the directory in which you've stored the trained models)
An important configuration in the config_clustering
is the topic_level
.
If you set false
, you need to provide the path to the predicted topics in predicted_topics_path
to produce conll files at the corpus level.
The output of the predict.py
script is a file in the standard conll format.
Then, it's straightforward to evaluate it with its corresponding
gold conll file (created in the first step),
using the official conll coreference scorer
that you can find
here.
Make sure to use the gold files of the same evaluation level (topic or corpus) as the predictions.
-
If you chose to train with the end-to-end method, you don't need to provide a
span_repr_path
or aspan_scorer_path
in the config file. -
Notice that if you use this model with gold mentions, the span scorer is not relevant, you should ignore the training method.
-
If you're interested in a newer model, check out our cross-encoder model