该项目从哈工大社会计算与信息检索研究中心fork而来,里面加入了一些自己的修改,主要的计算图不变。
We release our ELMo representations trained on many languages which helps us win the CoNLL 2018 shared task on Universal Dependencies Parsing according to LAS.
We use the same hyperparameter settings as Peters et al. (2018) for the biLM and the character CNN. We train their parameters on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl) for each language. We largely based ourselves on the code of AllenNLP, but made the following changes:
- We support unicode characters;
- We use the sample softmax technique to make training on large vocabulary feasible (Jean et al., 2015). However, we use a window of words surrounding the target word as negative samples and it shows better performance in our preliminary experiments.
The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.
The models are hosted on the NLPL Vectors Repository.
ELMo for Simplified Chinese
We also provided simplified-Chinese ELMo. It was trained on xinhua proportion of Chinese gigawords-v5, which is different from the Wikipedia for traditional Chinese ELMo.
- must python >= 3.6 (if you use python3.5, you will encounter this issue HIT-SCIR#8)
- pytorch 0.4
- other requirements from allennlp
You need to install the package to use the embeddings with the following commends
python setup.py install
After unzip the model, you will find a JSON file ${lang}.model/config.json
.
Please change the "config_path"
field to the relative path to
the model configuration cnn_50_100_512_4096_sample.json
.
For example, if your ELMo model is zht.model/config.json
and your model configuration
is zht.model/cnn_50_100_512_4096_sample.json
, you need to change "config_path"
in zht.model/config.json
to cnn_50_100_512_4096_sample.json
.
Prepare your input file in the conllu format, like
1 Sue Sue _ _ _ _ _ _ _
2 likes like _ _ _ _ _ _ _
3 coffee coffee _ _ _ _ _ _ _
4 and and _ _ _ _ _ _ _
5 Bill Bill _ _ _ _ _ _ _
6 tea tea _ _ _ _ _ _ _
Fileds should be separated by '\t'
. We only use the second column and space (' '
) is supported in
this field (for Vietnamese, a word can contains spaces).
Do remember tokenization!
When it's all set, run
python -m elmoformanylangs test \
--input_format conll \
--input /path/to/your/input \
--model /path/to/your/model \
--output_prefix /path/to/your/output \
--output_format hdf5 \
--output_layer -1
It will dump an hdf5 encoded dict
onto the disk, where the key is '\t'
separated
words in the sentence and the value is it's 3-layer averaged ELMo representation.
You can also dump the cnn encoded word with --output_layer 0
,
the first layer of the LsTM with --output_layer 1
and the second layer
of the LSTM with --output_layer 2
.
We are actively changing the interface to make it more adapted to the
AllenNLP ELMo and more programmatically friendly.
Thanks @voidism for contributing the API.
By using Embedder
python object, you can use ELMo into your own code like this:
from elmoformanylangs import Embedder
e = Embedder('/path/to/your/model/')
sents = [['今', '天', '天氣', '真', '好', '阿'],
['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]
# the list of lists which store the sentences
# after segment if necessary.
e.sents2elmo(sents)
# will return a list of numpy arrays
# each with the shape=(seq_len, embedding_size)
class Embedder(model_dir='/path/to/your/model/', batch_size=64):
- model_dir: the absolute path from the repo top dir to you model dir.
- batch_size: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)
def sents2elmo(sents, output_layer=-1):
- sents: the list of lists which store the sentences after segment if necessary.
- output_layer: the target layer to output.
- 0 for the word encoder
- 1 for the first LSTM hidden layer
- 2 for the second LSTM hidden layer
- -1 for an average of 3 layers. (default)
Please run
python -m elmoformanylangs.biLM train -h
to get more details about the ELMo training. However, we
need to add that the training process is not very stable.
In some cases, we end up with a loss of nan
. We are actively working on that and hopefully
improve it in the future.
If our ELMo gave you nice improvements, please cite us.
@InProceedings{che-EtAl:2018:K18-2,
author = {Che, Wanxiang and Liu, Yijia and Wang, Yuxuan and Zheng, Bo and Liu, Ting},
title = {Towards Better {UD} Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation},
booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
month = {October},
year = {2018},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
pages = {55--64},
url = {http://www.aclweb.org/anthology/K18-2005}
}
Please also cite the NLPL Vectors Repository for hosting the models.
@InProceedings{fares-EtAl:2017:NoDaLiDa,
author = {Fares, Murhaf and Kutuzov, Andrey and Oepen, Stephan and Velldal, Erik},
title = {Word vectors, reuse, and replicability: Towards a community repository of large-text resources},
booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics},
month = {May},
year = {2017},
address = {Gothenburg, Sweden},
publisher = {Association for Computational Linguistics},
pages = {271--276},
url = {http://www.aclweb.org/anthology/W17-0237}
}