Library for working with CoNLL-U files with CL

The cl-conllu is a Common Lisp library to work with CoNLL-U, licensed under the Apache license.

It is developed and tested with SBCL but should probably run with any other implementation.

Quicklisp

If don’t have quicklisp installed already, follow these steps.

The cl-conllu library is now available from quicklisp distribution, if you are not planning to change the code, just use:

(ql:quickload :cl-conllu)

If you plan on contributing, clone this project to your local-projects quicklisp directory (usually at ~/quicklisp/local-projects/) and use the same command as above.

Reading CoNLL-U files

First you need some CoNLL-U files to start with. If you have none in mind, you may get some from here.

The simplest way to read a file or a directory of conllu files is:

CL-USER> (defparameter *sents* (cl-conllu:read-conllu #P"/path/to/my/file/CF1.conllu"))
(#<CL-CONLLU:SENTENCE {1003CFD9C3}> #<CL-CONLLU:SENTENCE {1003D164C3}>
 #<CL-CONLLU:SENTENCE {1003D1BB13}> #<CL-CONLLU:SENTENCE {1003D2C013}>
 #<CL-CONLLU:SENTENCE {1003D348A3}> #<CL-CONLLU:SENTENCE {1003D3E383}>
 #<CL-CONLLU:SENTENCE {1003D49C23}>)

Each object returned is an instance of a sentence class, made up of token objects, which we will describe in the next section.

All read functions accept a fn-meta function as argument. This function collects the metadata from each CoNLL-U sentence, which usually includes the raw sentence (see format). The default metadata collector function is collect-meta.

The Classes

cl-conllu has a few central classes: sentence, token, and mtoken. They are all defined in data.lisp file. When a CoNLL-U file is read, its contents are turned into instances of these classes.

Sentences

Every CoNLL-U sentence is turned in an instance of the sentence class by cl-conllu. Each instance is characterized by four properties: start, meta, tokens, and mtokens. The start field contains the line number of the file where the sentence block started.

The meta field includes the metainformation regarding the sentence. This information may vary, as we have discussed in the previous section, but usually includes the full (raw) sentence and the sentence ID, as required by the CoNLL-U format specification.

CL-USER> (cl-conllu:sentence-meta (first *sents*))
(("text" . "PT no governo")
 ("source" . "CETENFolha n=1 cad=Opinião sec=opi sem=94a")
 ("sent_id" . "CF1-1") ("id" . "1"))

The tokens are the list of tokens that together form the sentence, and they are themselves instances of the token class.

The mtokens (meta-tokens) are also instances of their own mtoken class, and they are used for multiword tokens (vámonos = vamos + nos).

Tokens

Instances of the token class have one property for each field/column in the CoNLL-U format’s sentences, that is:

ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
FORM: Word form or punctuation symbol.
LEMMA: Lemma of word form.
UPOSTAG: Universal part-of-speech tag.
XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
HEAD: Head of the current word, which is either a value of ID or zero if the token is the root (0).
DEPREL: Universal dependency relation to the HEAD (root iff HEAD is 0) or a defined language-specific subtype of one.
DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
MISC: Any other annotation.

Visualizing CoNLL-U sentences

To visualize CoNLL-U sentences, we use the conllu-visualize subpackage. The function tree-sentence receives an instance of a sentence object and (optionally) an output stream, and outputs to the stream the sentence’s metadata and its tree structure:

(conllu.draw:tree-sentence (nth 5 *frases*))
text = Eles se dizem oposição, mas ainda não informaram o que vão combater.
source = CETENFolha n=1 cad=Opinião sec=opi sem=94a
sent_id = CF1-7
id = 6
─┮ 
 │ ╭─╼ Eles nsubj 
 │ ├─╼ se expl 
 ╰─┾ dizem root 
   ├─╼ oposição xcomp 
   │ ╭─╼ , punct 
   │ ├─╼ mas cc 
   │ │ ╭─╼ ainda advmod 
   │ ├─┶ não advmod 
   ├─┾ informaram conj 
   │ │   ╭─╼ o det 
   │ │ ╭─┶ que obj 
   │ │ ├─╼ vão aux 
   │ ╰─┶ combater ccomp 
   ╰─╼ . punct

Querying CoNLL-U files

Queries can be executed with

(query '(nsubj (advcl (and (upostag "VERB") (lemma "correr"))
			(upostag "VERB" )) 
		 (upostag "PROP"))
	  *sents*)

How to cite

http://arademaker.github.io/bibliography/tilic-stil-2017.html

@inproceedings{tilic-stil-2017,
  author = {Muniz, Henrique and Chalub, Fabricio and Rademaker, Alexandre},
  title = {CL-CONLLU: dependências universais em Common Lisp},
  booktitle = {V Workshop de Iniciação Científica em Tecnologia da
                    Informação e da Linguagem Humana (TILic)},
  year = {2017},
  address = {Uberlândia, MG, Brazil},
  note = {https://sites.google.com/view/tilic2017/}
}

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
README_EVALUATE_TEMPLATE.md		README_EVALUATE_TEMPLATE.md
cl-conllu.asd		cl-conllu.asd
command-line.lisp		command-line.lisp
conllu-prolog.lisp		conllu-prolog.lisp
data.lisp		data.lisp
draw.lisp		draw.lisp
evaluate-confusion.lisp		evaluate-confusion.lisp
evaluate-template.lisp		evaluate-template.lisp
evaluate.lisp		evaluate.lisp
evaluate_template_examples.lisp		evaluate_template_examples.lisp
experimental.lisp		experimental.lisp
niceline.lisp		niceline.lisp
packages.lisp		packages.lisp
query.lisp		query.lisp
rdf-wilbur.lisp		rdf-wilbur.lisp
rdf.lisp		rdf.lisp
read-malt.lisp		read-malt.lisp
read-write.lisp		read-write.lisp
rules.lisp		rules.lisp
run-SBCL.lisp		run-SBCL.lisp
tag-converter.lisp		tag-converter.lisp
testing.lisp		testing.lisp
utils.lisp		utils.lisp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Library for working with CoNLL-U files with CL

Quicklisp

Reading CoNLL-U files

The Classes

Sentences

Tokens

Visualizing CoNLL-U sentences

Querying CoNLL-U files

How to cite

About

Releases

Packages

Languages

License

GPPassos/cl-conllu

Folders and files

Latest commit

History

Repository files navigation

Library for working with CoNLL-U files with CL

Quicklisp

Reading CoNLL-U files

The Classes

Sentences

Tokens

Visualizing CoNLL-U sentences

Querying CoNLL-U files

How to cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages