This repository contains the source code of our paper "A Unified Linear-Time Framework for Sentence-Level Discourse Parsing" in ACL 2019.
These instructions will help you to run our unified discourse parser based on RST dataset.
* PyTorch 0.4 or higher
* Python 3
* AllenNLP
We train and evaluate the model with the standard RST Discourse Treebank (RST-DT) corpus.
- Segmenter: we utilize all 7673 sentences for training and 991 sentences for testing.
- Parser: we extract sentence-level DTs from a document-level DT by finding the subtrees that span over the respective sentences. This gives 7321 sentence-level DTs for training, 951 for testing, and 1114 for getting hu- man agreements.
-
Sentence
: (Input sentences should be tokenizaed first. '[ ]' denotes the EDU boundary tokens.)- Although the [report,] which has [released] before the stock market [opened,] didn't trigger the 190.58 point drop in the Dow Jones Industrial [Average,] analysts [said] it did play a role in the market's [decline.]
- Although the [report,] which has [released] before the stock market [opened,] didn't trigger the 190.58 point drop in the Dow Jones Industrial [Average,] analysts [said] it did play a role in the market's [decline.]
-
EDU_Breaks
: (The indexes of the EDU boundary words, including the last word of the sentence.)- [2, 5, 10, 22, 24, 33]
- [2, 5, 10, 22, 24, 33]
-
Gold Discourse Tree structure
: (The output of the parser also holds for the format.)- (1:Satellite=Contrast:4,5:Nucleus=span:6) (1:Nucleus=Same-Unit:3,4:Nucleus=Same-Unite:4) (5:Satellite=Attribution:5,6:Nucleus=span:6) (1:Satellite=span:1,2:Nucleus=Elaboration:3) (2:Nucleus=span:2,3:Satellite=Temporal:3)
- (1:Satellite=Contrast:4,5:Nucleus=span:6) (1:Nucleus=Same-Unit:3,4:Nucleus=Same-Unite:4) (5:Satellite=Attribution:5,6:Nucleus=span:6) (1:Satellite=span:1,2:Nucleus=Elaboration:3) (2:Nucleus=span:2,3:Satellite=Temporal:3)
-
Parsing_Label
(This should accord with Top-Down Depth-First manner. e.g., There are 6 EDUs in this case. At the first decoding step, the parser will predict 4th (index 3) EDU as the break position such that two new splits (EDU1-EDU4 and EDU5-EDU6) are generated.- [3, 2, 0, 1, 4]
- [3, 2, 0, 1, 4]
-
Relation_Label
(In all we have 39 relations in our model. Each time two newly splits are created, the classifier would predict the corresponding relation label between them.)- [3, 17, 5, 30, 21]
- [3, 17, 5, 30, 21]
-
For training, you will need to prepare
decoder_input_index
as the decoder input and correspondingparent_index
,sibling_index
as the partial tree information.- decoder_input_index: [0, 0, 0, 1, 4] (to take the first EDU as the representation of text span to be parsed.) or [5, 3, 2, 2, 5] (to take the last EDU as the representation of text span to be parsed.)
- parent_index: [0, 5, 4, 3, 6]
- sibling_index: [99, 99, 99, 0, 4] ('99' denotes empty siblings.)
- Parser:
cd Parser/
python main.py
You can also control any other arguments. Please refer to main.py
. By default, the parser will use the same parameters as described in the paper.
- Segmenter:
cd Segmenter
python train.py
- You will need all the data files as described in
Data format
to run Discourse Parser, while onlySentence
andEDU_Breaks
are needed for Discourse Segmenter.
Please cite our paper if you found the resources in this repository useful.
@inproceedings{lin-etal-2019-unified,
title = "A Unified Linear-Time Framework for Sentence-Level Discourse Parsing",
author = "Lin, Xiang and
Joty, Shafiq and
Jwalapuram, Prathyusha and
Bari, M Saiful",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
doi = "10.18653/v1/P19-1410",
pages = "4190--4200",
}