This repository contains code and data to reproduce results reported in a publication submitted in the Journal of Computational Literary Studies.
Contents of this repository:
This folder contains the annotated plays that are reported in the article. The plays are provided both in the format as used by the annotation tool (CorefAnnotator), as well as CSV and TEI/XML files exported from the annotation tool. The CSV files are used for the analysis. The TEI files are used to investigate how many annotations per 1000 tokens occur in the texts, presented in Section 5.1.
This folder contains the code needed to calculate inter-annotator agreement with Gamma.
With bash on a Unix system, you can run it with python3 iaa.py ../data/round-2/V1/csv/guenderode-udohla_0?.csv
, to compare the two annotations of Günderrodes' Udohla. The output is a line formatted to be used as a LaTeX table.
To generate an entire table, you can use the following command:
for i in $( ls ../data/round-2/V1/csv/*01.csv)
do
python3 iaa.py $i ${i/01/02}
done
This will iterate over all files in data/round-2/V1
, and call the python script for each file. The python script gets the versions by two annotators as arguments.
The script makes use of the pygamma-agreement library, which in turn relies on a highly optimized library for integer linear programming. Please follow their installation instructions to use the CBC solver.
The python script can be run using the command
$ python3 annotations_per_x_tokens.py ../data --xtokens 1000
No further packages need to be installed.
To install the needed packages for the R scripts, issue the following command in a R console:
> install.packages(c("DramaAnalysis", "ggplot2", "igraph", "kableExtra", "knitr", "reshape2", "tidyverse"))
All R scripts can either be run in RStudio
or in the console using the command Rscript $PATH_TO_R_SCRIPT
.
The plots generated by the R scripts can be found in the folder plots
after running the scripts.