Cross-Linguistic alignments for multiple language pairs #243

nitinvwaran · 2022-08-24T16:12:43Z

The Chinese LPP conllulex file has examples of sentence and token level alignments to the English LPP file. There is a script scripts/generate_alignments_from_conllulex.py, which generates alignments for the Chinese-English language pair with the Chinese conllulex file as input to the script.

Some proposed changes are needed to expand the alignments to support a) Multiple language pairs , b) 1-many sentence alignments.

With english as example, add a new metadata field en_sent_id_2 (and maybe en_sent_id_3 if needed) to support 1-many sentence alignments. Similarly for other languages (e.g hi_sent_id, hi_sent_id_2, hi_sent_id_3 for hi-zh alignment). Add corresponding fields for the sentence text fields. The prefix for these fields could match the language's slug value in Xposition ('en' , 'zh' , 'he' , 'hi' , 'ko' , 'de')

Alternatively, UD notation could have a list data structure which could be used in the metadata field (more research needed)
In the misc column in conllulex file, replicate the existing token-level annotations for other languages, separated by a new delimiter delineating annotations across languages. The order of languages follows the order of language tags in the metadata.

Also see #234

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-Linguistic alignments for multiple language pairs #243

Cross-Linguistic alignments for multiple language pairs #243

nitinvwaran commented Aug 24, 2022 •

edited

Loading

Cross-Linguistic alignments for multiple language pairs #243

Cross-Linguistic alignments for multiple language pairs #243

Comments

nitinvwaran commented Aug 24, 2022 • edited Loading

nitinvwaran commented Aug 24, 2022 •

edited

Loading