Skip to content

Latest commit

 

History

History
31 lines (23 loc) · 2.9 KB

README.md

File metadata and controls

31 lines (23 loc) · 2.9 KB

subject-encoding

This repository contains the data and the scripts that are necessary to reproduce the results reported in:

Berdicevskis, Aleksandrs, Karsten Schmidtke-Bode and Ilja Seržant. 2020. Subjects tend to be coded only once: Corpus-based and grammar-based evidence for an efficiency-driven trade-off. In: Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (TLT).

The repository contains the following files and folders:

  • corpus_analysis.rb: a script that goes through the UD corpora (train, dev and test concatenated). If a language has several UD corpora, they have to be concatenated into a single on (use ud_merge_perlang.rb)

  • subj_data.csv is a tab-separated file which contains the data analyzed in Section 3 of the article (it is the output of corpus_analysis.rb and the input for data_analysis.r). It has the following columns:

    • lemma (verb lemma);
    • sentence (full sentence where the verb occurs; note that our level of analysis is the clause, sentences are provided just as wide context);
    • tense (tense-mood combination which has two values with somewhat imprecise names: Past (=l-participles) and Nonpast (all other forms);
    • aspect (Imp = imperfective or Perf = perfective);
    • sperson (subject person as can be inferred from the verbal form. Possible values: 3, 12 (=1 or 2), unknown);
    • expressed (1 or 0: whether there is a free syntactic subject of any kind (either pronoun or full NP));
    • pronexpressed (1 or 0: whether there is a free syntactic subject encoded by a personal pronoun);
    • pron12expressed ((1 or 0: whether there is a free syntactic subject encoded by a first- or second-person pronoun);
    • lang (language ISO code);
    • group (language group: non_copular (= East Slavic) or copular (= South and West Slavic));
    • rel (incoming syntactic relation for the verb);
    • clause_type (as inferred from the UD annotation: simple, main, subord(inate) or coord(inate). Note that only non-first conjuncts are treated as coordinate clauses, while first conjuncts are treated as simple clauses).
  • ud_merge_perlang.rb: a script that goes through all UD 2.6 treebanks and merges all treebanks for the same language into a single file. Set the IN and OUT paths at the beginning of the script

  • data_analysis.r: an R script that performs the statistical analysis described in Section 3 and creates the figures. Make sure the necessary packages are installed, change the R directory to the folder with subj_data.csv

  • model_simple.pdf: the description and results of the model (mentioned in Section 3) that was fit to clauses in simple sentences (and first conjuncts in coordinated constructions) only.

  • lang.r: a bonus (draft) script for running analyses of individual languages (not discussed in the paper).

  • Indexation_WALS data.txt: data extracted from WALS that were used for the analysis in Section 2. The names of the columns are self-explanatory (Index = Indexation).