Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-Linguistic alignments for multiple language pairs #243

Open
nitinvwaran opened this issue Aug 24, 2022 · 0 comments
Open

Cross-Linguistic alignments for multiple language pairs #243

nitinvwaran opened this issue Aug 24, 2022 · 0 comments

Comments

@nitinvwaran
Copy link
Contributor

nitinvwaran commented Aug 24, 2022

The Chinese LPP conllulex file has examples of sentence and token level alignments to the English LPP file. There is a script scripts/generate_alignments_from_conllulex.py, which generates alignments for the Chinese-English language pair with the Chinese conllulex file as input to the script.

Some proposed changes are needed to expand the alignments to support a) Multiple language pairs , b) 1-many sentence alignments.

  1. With english as example, add a new metadata field en_sent_id_2 (and maybe en_sent_id_3 if needed) to support 1-many sentence alignments. Similarly for other languages (e.g hi_sent_id, hi_sent_id_2, hi_sent_id_3 for hi-zh alignment). Add corresponding fields for the sentence text fields. The prefix for these fields could match the language's slug value in Xposition ('en' , 'zh' , 'he' , 'hi' , 'ko' , 'de')

    Alternatively, UD notation could have a list data structure which could be used in the metadata field (more research needed)

  2. In the misc column in conllulex file, replicate the existing token-level annotations for other languages, separated by a new delimiter delineating annotations across languages. The order of languages follows the order of language tags in the metadata.

Also see #234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant