The cl-conllu
is a Common Lisp library to work with CoNLL-U,
licensed under the Apache license.
It is developed and tested with SBCL but should probably run with any other implementation.
If don’t have quicklisp installed already, follow these steps.
The cl-conllu
library is now available from quicklisp distribution,
if you are not planning to change the code, just use:
(ql:quickload :cl-conllu)
If you plan on contributing, clone this project to your
local-projects
quicklisp directory (usually at
~/quicklisp/local-projects/
) and use the same command as above.
First you need some CoNLL-U files to start with. If you have none in mind, you may get some from here.
The simplest way to read a file or a directory of conllu
files is:
CL-USER> (defparameter *sents* (cl-conllu:read-conllu #P"/path/to/my/file/CF1.conllu"))
(#<CL-CONLLU:SENTENCE {1003CFD9C3}> #<CL-CONLLU:SENTENCE {1003D164C3}>
#<CL-CONLLU:SENTENCE {1003D1BB13}> #<CL-CONLLU:SENTENCE {1003D2C013}>
#<CL-CONLLU:SENTENCE {1003D348A3}> #<CL-CONLLU:SENTENCE {1003D3E383}>
#<CL-CONLLU:SENTENCE {1003D49C23}>)
Each object returned is an instance of a sentence
class, made up of
token
objects, which we will describe in the next section.
All read functions accept a fn-meta
function as argument. This
function collects the metadata from each CoNLL-U sentence, which
usually includes the raw sentence (see format). The default metadata
collector function is collect-meta
.
cl-conllu
has a few central classes: sentence
, token
, and
mtoken
. They are all defined in data.lisp
file. When a CoNLL-U
file is read, its contents are turned into instances of these classes.
Every CoNLL-U sentence is turned in an instance of the sentence
class by cl-conllu
. Each instance is characterized by four
properties: start
, meta
, tokens
, and mtokens
. The start
field contains the line number of the file where the sentence block
started.
The meta
field includes the metainformation regarding the
sentence. This information may vary, as we have discussed in the
previous section, but usually includes the full (raw) sentence and the
sentence ID, as required by the CoNLL-U format specification.
CL-USER> (cl-conllu:sentence-meta (first *sents*)) (("text" . "PT no governo") ("source" . "CETENFolha n=1 cad=Opinião sec=opi sem=94a") ("sent_id" . "CF1-1") ("id" . "1"))
The tokens
are the list of tokens that together form the sentence,
and they are themselves instances of the token
class.
The mtokens
(meta-tokens) are also instances of their own mtoken
class, and they are used for multiword tokens (vámonos = vamos + nos).
Instances of the token
class have one property for each field/column
in the CoNLL-U format’s sentences, that is:
- ID
- Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
- FORM
- Word form or punctuation symbol.
- LEMMA
- Lemma of word form.
- UPOSTAG
- Universal part-of-speech tag.
- XPOSTAG
- Language-specific part-of-speech tag; underscore if not available.
- FEATS
- List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
- HEAD
- Head of the current word, which is either a value of ID or zero if the token is the root (0).
- DEPREL
- Universal dependency relation to the HEAD (
root
iff HEAD is 0) or a defined language-specific subtype of one. - DEPS
- Enhanced dependency graph in the form of a list of head-deprel pairs.
- MISC
- Any other annotation.
To visualize CoNLL-U sentences, we use the conllu-visualize
subpackage. The function tree-sentence
receives an instance of a
sentence object and (optionally) an output stream, and outputs to the
stream the sentence’s metadata and its tree structure:
(conllu.draw:tree-sentence (nth 5 *frases*)) text = Eles se dizem oposição, mas ainda não informaram o que vão combater. source = CETENFolha n=1 cad=Opinião sec=opi sem=94a sent_id = CF1-7 id = 6 ─┮ │ ╭─╼ Eles nsubj │ ├─╼ se expl ╰─┾ dizem root ├─╼ oposição xcomp │ ╭─╼ , punct │ ├─╼ mas cc │ │ ╭─╼ ainda advmod │ ├─┶ não advmod ├─┾ informaram conj │ │ ╭─╼ o det │ │ ╭─┶ que obj │ │ ├─╼ vão aux │ ╰─┶ combater ccomp ╰─╼ . punct
Queries can be executed with
(query '(nsubj (advcl (and (upostag "VERB") (lemma "correr"))
(upostag "VERB" ))
(upostag "PROP"))
*sents*)
http://arademaker.github.io/bibliography/tilic-stil-2017.html
@inproceedings{tilic-stil-2017, author = {Muniz, Henrique and Chalub, Fabricio and Rademaker, Alexandre}, title = {CL-CONLLU: dependências universais em Common Lisp}, booktitle = {V Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (TILic)}, year = {2017}, address = {Uberlândia, MG, Brazil}, note = {https://sites.google.com/view/tilic2017/} }