forked from acl-org/acl-anthology
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2020.findings.xml
5254 lines (5254 loc) · 719 KB
/
2020.findings.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version='1.0' encoding='UTF-8'?>
<collection id="2020.findings">
<volume id="emnlp" ingest-date="2020-11-12">
<meta>
<booktitle>Findings of the Association for Computational Linguistics: EMNLP 2020</booktitle>
<editor><first>Trevor</first><last>Cohn</last></editor>
<editor><first>Yulan</first><last>He</last></editor>
<editor id="yang-liu-icsi"><first>Yang</first><last>Liu</last></editor>
<publisher>Association for Computational Linguistics</publisher>
<address>Online</address>
<month>November</month>
<year>2020</year>
</meta>
<frontmatter>
<url hash="3704669c">2020.findings-emnlp.0</url>
</frontmatter>
<paper id="1">
<title>Fully Quantized Transformer for Machine Translation</title>
<author><first>Gabriele</first><last>Prato</last></author>
<author><first>Ella</first><last>Charlaix</last></author>
<author><first>Mehdi</first><last>Rezagholizadeh</last></author>
<pages>1–14</pages>
<abstract>State-of-the-art neural machine translation methods employ massive amounts of parameters. Drastically reducing computational costs of such methods without affecting performance has been up to this point unsuccessful. To this end, we propose FullyQT: an all-inclusive quantization strategy for the Transformer. To the best of our knowledge, we are the first to show that it is possible to avoid any loss in translation quality with a fully quantized Transformer. Indeed, compared to full-precision, our 8-bit models score greater or equal BLEU on most tasks. Comparing ourselves to all previously proposed methods, we achieve state-of-the-art quantization results.</abstract>
<url hash="71254eb4">2020.findings-emnlp.1</url>
<doi>10.18653/v1/2020.findings-emnlp.1</doi>
</paper>
<paper id="2">
<title>Summarizing <fixed-case>C</fixed-case>hinese Medical Answer with Graph Convolution Networks and Question-focused Dual Attention</title>
<author><first>Ningyu</first><last>Zhang</last></author>
<author><first>Shumin</first><last>Deng</last></author>
<author><first>Juan</first><last>Li</last></author>
<author><first>Xi</first><last>Chen</last></author>
<author><first>Wei</first><last>Zhang</last></author>
<author><first>Huajun</first><last>Chen</last></author>
<pages>15–24</pages>
<abstract>Online search engines are a popular source of medical information for users, where users can enter questions and obtain relevant answers. It is desirable to generate answer summaries for online search engines, particularly summaries that can reveal direct answers to questions. Moreover, answer summaries are expected to reveal the most relevant information in response to questions; hence, the summaries should be generated with a focus on the question, which is a challenging topic-focused summarization task. In this paper, we propose an approach that utilizes graph convolution networks and question-focused dual attention for Chinese medical answer summarization. We first organize the original long answer text into a medical concept graph with graph convolution networks to better understand the internal structure of the text and the correlation between medical concepts. Then, we introduce a question-focused dual attention mechanism to generate summaries relevant to questions. Experimental results demonstrate that the proposed model can generate more coherent and informative summaries compared with baseline models.</abstract>
<url hash="78e38e4d">2020.findings-emnlp.2</url>
<doi>10.18653/v1/2020.findings-emnlp.2</doi>
</paper>
<paper id="3">
<title>Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations</title>
<author><first>Peng</first><last>Qi</last></author>
<author><first>Yuhao</first><last>Zhang</last></author>
<author><first>Christopher D.</first><last>Manning</last></author>
<pages>25–40</pages>
<abstract>We investigate the problem of generating informative questions in information-asymmetric conversations. Unlike previous work on question generation which largely assumes knowledge of what the answer might be, we are interested in the scenario where the questioner is not given the context from which answers are drawn, but must reason pragmatically about how to acquire new information, given the shared conversation history. We identify two core challenges: (1) formally defining the informativeness of potential questions, and (2) exploring the prohibitively large space of potential questions to find the good candidates. To generate pragmatic questions, we use reinforcement learning to optimize an informativeness metric we propose, combined with a reward function designed to promote more specific questions. We demonstrate that the resulting pragmatic questioner substantially improves the informativeness and specificity of questions generated over a baseline model, as evaluated by our metrics as well as humans.</abstract>
<url hash="ce6352f3">2020.findings-emnlp.3</url>
<doi>10.18653/v1/2020.findings-emnlp.3</doi>
</paper>
<paper id="4">
<title>Adapting <fixed-case>BERT</fixed-case> for Word Sense Disambiguation with Gloss Selection Objective and Example Sentences</title>
<author><first>Boon Peng</first><last>Yap</last></author>
<author><first>Andrew</first><last>Koh</last></author>
<author><first>Eng Siong</first><last>Chng</last></author>
<pages>41–46</pages>
<abstract>Domain adaptation or transfer learning using pre-trained language models such as BERT has proven to be an effective approach for many natural language processing tasks. In this work, we propose to formulate word sense disambiguation as a relevance ranking task, and fine-tune BERT on sequence-pair ranking task to select the most probable sense definition given a context sentence and a list of candidate sense definitions. We also introduce a data augmentation technique for WSD using existing example sentences from WordNet. Using the proposed training objective and data augmentation technique, our models are able to achieve state-of-the-art results on the English all-words benchmark datasets.</abstract>
<url hash="7b542509">2020.findings-emnlp.4</url>
<doi>10.18653/v1/2020.findings-emnlp.4</doi>
</paper>
<paper id="5">
<title>Adversarial Text Generation via Sequence Contrast Discrimination</title>
<author><first>Ke</first><last>Wang</last></author>
<author><first>Xiaojun</first><last>Wan</last></author>
<pages>47–53</pages>
<abstract>In this paper, we propose a sequence contrast loss driven text generation framework, which learns the difference between real texts and generated texts and uses that difference. Specifically, our discriminator contains a discriminative sequence generator instead of a binary classifier, and measures the ‘relative realism’ of generated texts against real texts by making use of them simultaneously. Moreover, our generator uses discriminative sequences to directly improve itself, which not only replaces the gradient propagation process from the discriminator to the generator, but also avoids the time-consuming sampling process of estimating rewards in some previous methods. We conduct extensive experiments with various metrics, substantiating that our framework brings improvements in terms of training stability and the quality of generated texts.</abstract>
<url hash="1b9604ce">2020.findings-emnlp.5</url>
<attachment type="OptionalSupplementaryMaterial" hash="a550a176">2020.findings-emnlp.5.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.5</doi>
</paper>
<paper id="6">
<title><fixed-case>GRACE</fixed-case>: Gradient Harmonized and Cascaded Labeling for Aspect-based Sentiment Analysis</title>
<author><first>Huaishao</first><last>Luo</last></author>
<author><first>Lei</first><last>Ji</last></author>
<author><first>Tianrui</first><last>Li</last></author>
<author><first>Daxin</first><last>Jiang</last></author>
<author><first>Nan</first><last>Duan</last></author>
<pages>54–64</pages>
<abstract>In this paper, we focus on the imbalance issue, which is rarely studied in aspect term extraction and aspect sentiment classification when regarding them as sequence labeling tasks. Besides, previous works usually ignore the interaction between aspect terms when labeling polarities. We propose a GRadient hArmonized and CascadEd labeling model (GRACE) to solve these problems. Specifically, a cascaded labeling module is developed to enhance the interchange between aspect terms and improve the attention of sentiment tokens when labeling sentiment polarities. The polarities sequence is designed to depend on the generated aspect terms labels. To alleviate the imbalance issue, we extend the gradient harmonized mechanism used in object detection to the aspect-based sentiment analysis by adjusting the weight of each label dynamically. The proposed GRACE adopts a post-pretraining BERT as its backbone. Experimental results demonstrate that the proposed model achieves consistency improvement on multiple benchmark datasets and generates state-of-the-art results.</abstract>
<url hash="cd3e668d">2020.findings-emnlp.6</url>
<doi>10.18653/v1/2020.findings-emnlp.6</doi>
</paper>
<paper id="7">
<title>Reducing Sentiment Bias in Language Models via Counterfactual Evaluation</title>
<author><first>Po-Sen</first><last>Huang</last></author>
<author><first>Huan</first><last>Zhang</last></author>
<author><first>Ray</first><last>Jiang</last></author>
<author><first>Robert</first><last>Stanforth</last></author>
<author><first>Johannes</first><last>Welbl</last></author>
<author><first>Jack</first><last>Rae</last></author>
<author><first>Vishal</first><last>Maini</last></author>
<author><first>Dani</first><last>Yogatama</last></author>
<author><first>Pushmeet</first><last>Kohli</last></author>
<pages>65–83</pages>
<abstract>Advances in language modeling architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by language models: bias in the sentiment of generated text. Given a conditioning context (e.g., a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g., country names, occupations, genders) in the conditioning context using a form of counterfactual evaluation. We quantify sentiment bias by adopting individual and group fairness metrics from the fair machine learning literature, and demonstrate that large-scale models trained on two different corpora (news articles, and Wikipedia) exhibit considerable levels of bias. We then propose embedding and sentiment prediction-derived regularization on the language model’s latent representations. The regularizations improve fairness metrics while retaining comparable levels of perplexity and semantic similarity.</abstract>
<url hash="b7d1741c">2020.findings-emnlp.7</url>
<doi>10.18653/v1/2020.findings-emnlp.7</doi>
</paper>
<paper id="8">
<title>Improving Text Understanding via Deep Syntax-Semantics Communication</title>
<author><first>Hao</first><last>Fei</last></author>
<author><first>Yafeng</first><last>Ren</last></author>
<author><first>Donghong</first><last>Ji</last></author>
<pages>84–93</pages>
<abstract>Recent studies show that integrating syntactic tree models with sequential semantic models can bring improved task performance, while these methods mostly employ shallow integration of syntax and semantics. In this paper, we propose a deep neural communication model between syntax and semantics to improve the performance of text understanding. Local communication is performed between syntactic tree encoder and sequential semantic encoder for mutual learning of information exchange. Global communication can further ensure comprehensive information propagation. Results on multiple syntax-dependent tasks show that our model outperforms strong baselines by a large margin. In-depth analysis indicates that our method is highly effective in composing sentence semantics.</abstract>
<url hash="5589c3bb">2020.findings-emnlp.8</url>
<doi>10.18653/v1/2020.findings-emnlp.8</doi>
</paper>
<paper id="9">
<title><fixed-case>GRUEN</fixed-case> for Evaluating Linguistic Quality of Generated Text</title>
<author><first>Wanzheng</first><last>Zhu</last></author>
<author><first>Suma</first><last>Bhat</last></author>
<pages>94–108</pages>
<abstract>Automatic evaluation metrics are indispensable for evaluating generated text. To date, these metrics have focused almost exclusively on the content selection aspect of the system output, ignoring the linguistic quality aspect altogether. We bridge this gap by proposing GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text. GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output. Unlike most existing evaluation metrics which require human references as an input, GRUEN is reference-less and requires only the system output. Besides, it has the advantage of being unsupervised, deterministic, and adaptable to various tasks. Experiments on seven datasets over four language generation tasks show that the proposed metric correlates highly with human judgments.</abstract>
<url hash="8a0d8f9a">2020.findings-emnlp.9</url>
<doi>10.18653/v1/2020.findings-emnlp.9</doi>
</paper>
<paper id="10">
<title>A Greedy Bit-flip Training Algorithm for Binarized Knowledge Graph Embeddings</title>
<author><first>Katsuhiko</first><last>Hayashi</last></author>
<author><first>Koki</first><last>Kishimoto</last></author>
<author><first>Masashi</first><last>Shimbo</last></author>
<pages>109–114</pages>
<abstract>This paper presents a simple and effective discrete optimization method for training binarized knowledge graph embedding model B-CP. Unlike the prior work using a SGD-based method and quantization of real-valued vectors, the proposed method directly optimizes binary embedding vectors by a series of bit flipping operations. On the standard knowledge graph completion tasks, the B-CP model trained with the proposed method achieved comparable performance with that trained with SGD as well as state-of-the-art real-valued models with similar embedding dimensions.</abstract>
<url hash="cfd35379">2020.findings-emnlp.10</url>
<doi>10.18653/v1/2020.findings-emnlp.10</doi>
</paper>
<paper id="11">
<title>Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation</title>
<author><first>Chujie</first><last>Zheng</last></author>
<author><first>Yunbo</first><last>Cao</last></author>
<author><first>Daxin</first><last>Jiang</last></author>
<author><first>Minlie</first><last>Huang</last></author>
<pages>115–125</pages>
<abstract>In a multi-turn knowledge-grounded dialog, the difference between the knowledge selected at different turns usually provides potential clues to knowledge selection, which has been largely neglected in previous research. In this paper, we propose a difference-aware knowledge selection method. It first computes the difference between the candidate knowledge sentences provided at the current turn and those chosen in the previous turns. Then, the differential information is fused with or disentangled from the contextual information to facilitate final knowledge selection. Automatic, human observational, and interactive evaluation shows that our method is able to select knowledge more accurately and generate more informative responses, significantly outperforming the state-of-the-art baselines.</abstract>
<url hash="46b80b12">2020.findings-emnlp.11</url>
<doi>10.18653/v1/2020.findings-emnlp.11</doi>
</paper>
<paper id="12">
<title>An Attentive Recurrent Model for Incremental Prediction of Sentence-final Verbs</title>
<author><first>Wenyan</first><last>Li</last></author>
<author><first>Alvin</first><last>Grissom II</last></author>
<author><first>Jordan</first><last>Boyd-Graber</last></author>
<pages>126–136</pages>
<abstract>Verb prediction is important for understanding human processing of verb-final languages, with practical applications to real-time simultaneous interpretation from verb-final to verb-medial languages. While previous approaches use classical statistical models, we introduce an attention-based neural model to incrementally predict final verbs on incomplete sentences in Japanese and German SOV sentences. To offer flexibility to the model, we further incorporate synonym awareness. Our approach both better predicts the final verbs in Japanese and German and provides more interpretable explanations of why those verbs are selected.</abstract>
<url hash="9d2fdd69">2020.findings-emnlp.12</url>
<attachment type="OptionalSupplementaryMaterial" hash="dda52e50">2020.findings-emnlp.12.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.12</doi>
</paper>
<paper id="13">
<title>Transformer-<fixed-case>GCRF</fixed-case>: Recovering <fixed-case>C</fixed-case>hinese Dropped Pronouns with General Conditional Random Fields</title>
<author><first>Jingxuan</first><last>Yang</last></author>
<author><first>Kerui</first><last>Xu</last></author>
<author><first>Jun</first><last>Xu</last></author>
<author><first>Si</first><last>Li</last></author>
<author><first>Sheng</first><last>Gao</last></author>
<author><first>Jun</first><last>Guo</last></author>
<author><first>Ji-Rong</first><last>Wen</last></author>
<author><first>Nianwen</first><last>Xue</last></author>
<pages>137–147</pages>
<abstract>Pronouns are often dropped in Chinese conversations and recovering the dropped pronouns is important for NLP applications such as Machine Translation. Existing approaches usually formulate this as a sequence labeling task of predicting whether there is a dropped pronoun before each token and its type. Each utterance is considered to be a sequence and labeled independently. Although these approaches have shown promise, labeling each utterance independently ignores the dependencies between pronouns in neighboring utterances. Modeling these dependencies is critical to improving the performance of dropped pronoun recovery. In this paper, we present a novel framework that combines the strength of Transformer network with General Conditional Random Fields (GCRF) to model the dependencies between pronouns in neighboring utterances. Results on three Chinese conversation datasets show that the Transformer-GCRF model outperforms the state-of-the-art dropped pronoun recovery models. Exploratory analysis also demonstrates that the GCRF did help to capture the dependencies between pronouns in neighboring utterances, thus contributes to the performance improvements.</abstract>
<url hash="15bfde53">2020.findings-emnlp.13</url>
<doi>10.18653/v1/2020.findings-emnlp.13</doi>
</paper>
<paper id="14">
<title>Neural Speed Reading Audited</title>
<author><first>Anders</first><last>Søgaard</last></author>
<pages>148–153</pages>
<abstract>Several approaches to neural speed reading have been presented at major NLP and machine learning conferences in 2017–20; i.e., “human-inspired” recurrent network architectures that learn to “read” text faster by skipping irrelevant words, typically optimizing the joint objective of minimizing classification error rate and FLOPs used at inference time. This paper reflects on the meaningfulness of the speed reading task, showing that (a) better and faster approaches to, say, document classification, already exist, which also learn to ignore part of the input (I give an example with 7% error reduction and a 136x speed-up over the state of the art in neural speed reading); and that (b) any claims that neural speed reading is “human-inspired”, are ill-founded.</abstract>
<url hash="29e00fa1">2020.findings-emnlp.14</url>
<doi>10.18653/v1/2020.findings-emnlp.14</doi>
</paper>
<paper id="15">
<title>Converting the Point of View of Messages Spoken to Virtual Assistants</title>
<author><first>Gunhee</first><last>Lee</last></author>
<author><first>Vera</first><last>Zu</last></author>
<author><first>Sai Srujana</first><last>Buddi</last></author>
<author><first>Dennis</first><last>Liang</last></author>
<author><first>Purva</first><last>Kulkarni</last></author>
<author><first>Jack</first><last>FitzGerald</last></author>
<pages>154–163</pages>
<abstract>Virtual Assistants can be quite literal at times. If the user says “tell Bob I love him,” most virtual assistants will extract the message “I love him” and send it to the user’s contact named Bob, rather than properly converting the message to “I love you.” We designed a system to allow virtual assistants to take a voice message from one user, convert the point of view of the message, and then deliver the result to its target user. We developed a rule-based model, which integrates a linear text classification model, part-of-speech tagging, and constituency parsing with rule-based transformation methods. We also investigated Neural Machine Translation (NMT) approaches, including LSTMs, CopyNet, and T5. We explored 5 metrics to gauge both naturalness and faithfulness automatically, and we chose to use BLEU plus METEOR for faithfulness and relative perplexity using a separately trained language model (GPT) for naturalness. Transformer-Copynet and T5 performed similarly on faithfulness metrics, with T5 achieving slight edge, a BLEU score of 63.8 and a METEOR score of 83.0. CopyNet was the most natural, with a relative perplexity of 1.59. CopyNet also has 37 times fewer parameters than T5. We have publicly released our dataset, which is composed of 46,565 crowd-sourced samples.</abstract>
<url hash="4c526144">2020.findings-emnlp.15</url>
<doi>10.18653/v1/2020.findings-emnlp.15</doi>
</paper>
<paper id="16">
<title>Robustness to Modification with Shared Words in Paraphrase Identification</title>
<author><first>Zhouxing</first><last>Shi</last></author>
<author><first>Minlie</first><last>Huang</last></author>
<pages>164–171</pages>
<abstract>Revealing the robustness issues of natural language processing models and improving their robustness is important to their performance under difficult situations. In this paper, we study the robustness of paraphrase identification models from a new perspective – via modification with shared words, and we show that the models have significant robustness issues when facing such modifications. To modify an example consisting of a sentence pair, we either replace some words shared by both sentences or introduce new shared words. We aim to construct a valid new example such that a target model makes a wrong prediction. To find a modification solution, we use beam search constrained by heuristic rules, and we leverage a BERT masked language model for generating substitution words compatible with the context. Experiments show that the performance of the target models has a dramatic drop on the modified examples, thereby revealing the robustness issue. We also show that adversarial training can mitigate this issue.</abstract>
<url hash="675a123c">2020.findings-emnlp.16</url>
<doi>10.18653/v1/2020.findings-emnlp.16</doi>
</paper>
<paper id="17">
<title>Few-shot Natural Language Generation for Task-Oriented Dialog</title>
<author><first>Baolin</first><last>Peng</last></author>
<author><first>Chenguang</first><last>Zhu</last></author>
<author><first>Chunyuan</first><last>Li</last></author>
<author><first>Xiujun</first><last>Li</last></author>
<author><first>Jinchao</first><last>Li</last></author>
<author><first>Michael</first><last>Zeng</last></author>
<author><first>Jianfeng</first><last>Gao</last></author>
<pages>172–182</pages>
<abstract>As a crucial component in task-oriented dialog systems, the Natural Language Generation (NLG) module converts a dialog act represented in a semantic form into a response in natural language. The success of traditional template-based or statistical models typically relies on heavily annotated data, which is infeasible for new domains. Therefore, it is pivotal for an NLG system to generalize well with limited labelled data in real applications. To this end, we present FewshotWOZ, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems. Further, we develop the SC-GPT model. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains. Experiments on FewshotWOZ and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods, measured by various automatic metrics and human evaluations.</abstract>
<url hash="bed257d1">2020.findings-emnlp.17</url>
<doi>10.18653/v1/2020.findings-emnlp.17</doi>
</paper>
<paper id="18">
<title>Mimic and Conquer: Heterogeneous Tree Structure Distillation for Syntactic <fixed-case>NLP</fixed-case></title>
<author><first>Hao</first><last>Fei</last></author>
<author><first>Yafeng</first><last>Ren</last></author>
<author><first>Donghong</first><last>Ji</last></author>
<pages>183–193</pages>
<abstract>Syntax has been shown useful for various NLP tasks, while existing work mostly encodes singleton syntactic tree using one hierarchical neural network. In this paper, we investigate a simple and effective method, Knowledge Distillation, to integrate heterogeneous structure knowledge into a unified sequential LSTM encoder. Experimental results on four typical syntax-dependent tasks show that our method outperforms tree encoders by effectively integrating rich heterogeneous structure syntax, meanwhile reducing error propagation, and also outperforms ensemble methods, in terms of both the efficiency and accuracy.</abstract>
<url hash="a398c8b9">2020.findings-emnlp.18</url>
<doi>10.18653/v1/2020.findings-emnlp.18</doi>
</paper>
<paper id="19">
<title>A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining</title>
<author><first>Chenguang</first><last>Zhu</last></author>
<author><first>Ruochen</first><last>Xu</last></author>
<author><first>Michael</first><last>Zeng</last></author>
<author><first>Xuedong</first><last>Huang</last></author>
<pages>194–203</pages>
<abstract>With the abundance of automatic meeting transcripts, meeting summarization is of great interest to both participants and other parties. Traditional methods of summarizing meetings depend on complex multi-step pipelines that make joint optimization intractable. Meanwhile, there are a handful of deep neural models for text summarization and dialogue systems. However, the semantic structure and styles of meeting transcripts are quite different from articles and conversations. In this paper, we propose a novel abstractive summary network that adapts to the meeting scenario. We design a hierarchical structure to accommodate long meeting transcripts and a role vector to depict the difference among speakers. Furthermore, due to the inadequacy of meeting summary data, we pretrain the model on large-scale news summary data. Empirical results show that our model outperforms previous approaches in both automatic metrics and human evaluation. For example, on ICSI dataset, the ROUGE-1 score increases from 34.66% to 46.28%.</abstract>
<url hash="a7a97807">2020.findings-emnlp.19</url>
<doi>10.18653/v1/2020.findings-emnlp.19</doi>
</paper>
<paper id="20">
<title>Active Testing: An Unbiased Evaluation Method for Distantly Supervised Relation Extraction</title>
<author><first>Pengshuai</first><last>Li</last></author>
<author><first>Xinsong</first><last>Zhang</last></author>
<author><first>Weijia</first><last>Jia</last></author>
<author><first>Wei</first><last>Zhao</last></author>
<pages>204–211</pages>
<abstract>Distant supervision has been a widely used method for neural relation extraction for its convenience of automatically labeling datasets. However, existing works on distantly supervised relation extraction suffer from the low quality of test set, which leads to considerable biased performance evaluation. These biases not only result in unfair evaluations but also mislead the optimization of neural relation extraction. To mitigate this problem, we propose a novel evaluation method named active testing through utilizing both the noisy test set and a few manual annotations. Experiments on a widely used benchmark show that our proposed approach can yield approximately unbiased evaluations for distantly supervised relation extractors.</abstract>
<url hash="1f1c39fc">2020.findings-emnlp.20</url>
<attachment type="OptionalSupplementaryMaterial" hash="b7f14ee4">2020.findings-emnlp.20.OptionalSupplementaryMaterial.pdf</attachment>
<doi>10.18653/v1/2020.findings-emnlp.20</doi>
</paper>
<paper id="21">
<title>Semantic Matching for Sequence-to-Sequence Learning</title>
<author><first>Ruiyi</first><last>Zhang</last></author>
<author><first>Changyou</first><last>Chen</last></author>
<author><first>Xinyuan</first><last>Zhang</last></author>
<author><first>Ke</first><last>Bai</last></author>
<author><first>Lawrence</first><last>Carin</last></author>
<pages>212–222</pages>
<abstract>In sequence-to-sequence models, classical optimal transport (OT) can be applied to semantically match generated sentences with target sentences. However, in non-parallel settings, target sentences are usually unavailable. To tackle this issue without losing the benefits of classical OT, we present a semantic matching scheme based on the Optimal Partial Transport (OPT). Specifically, our approach partially matches semantically meaningful words between source and partial target sequences. To overcome the difficulty of detecting active regions in OPT (corresponding to the words needed to be matched), we further exploit prior knowledge to perform partial matching. Extensive experiments are conducted to evaluate the proposed approach, showing consistent improvements over sequence-to-sequence tasks.</abstract>
<url hash="c7deecb8">2020.findings-emnlp.21</url>
<attachment type="OptionalSupplementaryMaterial" hash="f5bcb0ab">2020.findings-emnlp.21.OptionalSupplementaryMaterial.bbl</attachment>
<doi>10.18653/v1/2020.findings-emnlp.21</doi>
</paper>
<paper id="22">
<title>How Decoding Strategies Affect the Verifiability of Generated Text</title>
<author><first>Luca</first><last>Massarelli</last></author>
<author><first>Fabio</first><last>Petroni</last></author>
<author><first>Aleksandra</first><last>Piktus</last></author>
<author><first>Myle</first><last>Ott</last></author>
<author><first>Tim</first><last>Rocktäschel</last></author>
<author><first>Vassilis</first><last>Plachouras</last></author>
<author><first>Fabrizio</first><last>Silvestri</last></author>
<author><first>Sebastian</first><last>Riedel</last></author>
<pages>223–235</pages>
<abstract>Recent progress in pre-trained language models led to systems that are able to generate text of an increasingly high quality. While several works have investigated the fluency and grammatical correctness of such models, it is still unclear to which extent the generated text is consistent with factual world knowledge. Here, we go beyond fluency and also investigate the verifiability of text generated by state-of-the-art pre-trained language models. A generated sentence is verifiable if it can be corroborated or disproved by Wikipedia, and we find that the verifiability of generated text strongly depends on the decoding strategy. In particular, we discover a tradeoff between factuality (i.e., the ability of generating Wikipedia corroborated text) and repetitiveness. While decoding strategies such as top-k and nucleus sampling lead to less repetitive generations, they also produce less verifiable text. Based on these finding, we introduce a simple and effective decoding strategy which, in comparison to previously used decoding strategies, produces less repetitive and more verifiable text.</abstract>
<url hash="6288dd54">2020.findings-emnlp.22</url>
<doi>10.18653/v1/2020.findings-emnlp.22</doi>
</paper>
<paper id="23">
<title>Minimize Exposure Bias of <fixed-case>S</fixed-case>eq2<fixed-case>S</fixed-case>eq Models in Joint Entity and Relation Extraction</title>
<author><first>Ranran Haoran</first><last>Zhang</last></author>
<author><first>Qianying</first><last>Liu</last></author>
<author><first>Aysa Xuemo</first><last>Fan</last></author>
<author><first>Heng</first><last>Ji</last></author>
<author><first>Daojian</first><last>Zeng</last></author>
<author><first>Fei</first><last>Cheng</last></author>
<author><first>Daisuke</first><last>Kawahara</last></author>
<author><first>Sadao</first><last>Kurohashi</last></author>
<pages>236–246</pages>
<abstract>Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These methods introduce exposure bias, which may cause the models overfit to the frequent label combination, thus limiting the generalization ability. We propose a novel Sequence-to-Unordered-Multi-Tree (Seq2UMTree) model to minimize the effects of exposure bias by limiting the decoding length to three within a triplet and removing the order among triplets. We evaluate our model on two datasets, DuIE and NYT, and systematically study how exposure bias alters the performance of Seq2Seq models. Experiments show that the state-of-the-art Seq2Seq model overfits to both datasets while Seq2UMTree shows significantly better generalization. Our code is available at <url>https://github.com/WindChimeRan/OpenJERE</url>.</abstract>
<url hash="4e78d163">2020.findings-emnlp.23</url>
<doi>10.18653/v1/2020.findings-emnlp.23</doi>
</paper>
<paper id="24">
<title>Gradient-based Analysis of <fixed-case>NLP</fixed-case> Models is Manipulable</title>
<author><first>Junlin</first><last>Wang</last></author>
<author><first>Jens</first><last>Tuyls</last></author>
<author><first>Eric</first><last>Wallace</last></author>
<author><first>Sameer</first><last>Singh</last></author>
<pages>247–258</pages>
<abstract>Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, the fact that they directly reflect the model internals. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade Model that overwhelms the gradients without affecting the predictions. This Facade Model can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (sentiment analysis, NLI, and QA), we show that the merged model effectively fools different analysis tools: saliency maps differ significantly from the original model’s, input reduction keeps more irrelevant input tokens, and adversarial perturbations identify unimportant tokens as being highly important.</abstract>
<url hash="0ed856ab">2020.findings-emnlp.24</url>
<attachment type="OptionalSupplementaryMaterial" hash="590940a8">2020.findings-emnlp.24.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.24</doi>
</paper>
<paper id="25">
<title>Pretrain-<fixed-case>KGE</fixed-case>: Learning Knowledge Representation from Pretrained Language Models</title>
<author><first>Zhiyuan</first><last>Zhang</last></author>
<author><first>Xiaoqian</first><last>Liu</last></author>
<author><first>Yi</first><last>Zhang</last></author>
<author><first>Qi</first><last>Su</last></author>
<author><first>Xu</first><last>Sun</last></author>
<author><first>Bin</first><last>He</last></author>
<pages>259–266</pages>
<abstract>Conventional knowledge graph embedding (KGE) often suffers from limited knowledge representation, leading to performance degradation especially on the low-resource problem. To remedy this, we propose to enrich knowledge representation via pretrained language models by leveraging world knowledge from pretrained models. Specifically, we present a universal training framework named <i>Pretrain-KGE</i> consisting of three phases: semantic-based fine-tuning phase, knowledge extracting phase and KGE training phase. Extensive experiments show that our proposed Pretrain-KGE can improve results over KGE models, especially on solving the low-resource problem.</abstract>
<url hash="72c0a869">2020.findings-emnlp.25</url>
<doi>10.18653/v1/2020.findings-emnlp.25</doi>
</paper>
<paper id="26">
<title>A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction</title>
<author><first>Masato</first><last>Mita</last></author>
<author><first>Shun</first><last>Kiyono</last></author>
<author><first>Masahiro</first><last>Kaneko</last></author>
<author><first>Jun</first><last>Suzuki</last></author>
<author><first>Kentaro</first><last>Inui</last></author>
<pages>267–280</pages>
<abstract>Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of “noise” where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models, and outperformed strong denoising baseline methods. We further applied task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. We then analyzed the effect of the proposed denoising method, and found that our approach leads to improved coverage of corrections and facilitated fluency edits which are reflected in higher recall and overall performance.</abstract>
<url hash="4f6adf45">2020.findings-emnlp.26</url>
<doi>10.18653/v1/2020.findings-emnlp.26</doi>
</paper>
<paper id="27">
<title>Understanding tables with intermediate pre-training</title>
<author><first>Julian</first><last>Eisenschlos</last></author>
<author><first>Syrine</first><last>Krichene</last></author>
<author><first>Thomas</first><last>Müller</last></author>
<pages>281–296</pages>
<abstract>Table entailment, the binary classification task of finding if a sentence is supported or refuted by the content of a table, requires parsing language and table structure as well as numerical and discrete reasoning. While there is extensive work on textual entailment, table entailment is less well studied. We adapt TAPAS (Herzig et al., 2020), a table-based BERT model, to recognize entailment. Motivated by the benefits of data augmentation, we create a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. This new data is not only useful for table entailment, but also for SQA (Iyyer et al., 2017), a sequential table QA task. To be able to use long examples as input of BERT models, we evaluate table pruning techniques as a pre-processing step to drastically improve the training and prediction efficiency at a moderate drop in accuracy. The different methods set the new state-of-the-art on the TabFact (Chen et al., 2020) and SQA datasets.</abstract>
<url hash="7e9b03bf">2020.findings-emnlp.27</url>
<attachment type="OptionalSupplementaryMaterial" hash="3ab55da8">2020.findings-emnlp.27.OptionalSupplementaryMaterial.pdf</attachment>
<doi>10.18653/v1/2020.findings-emnlp.27</doi>
</paper>
<paper id="28">
<title>Enhance Robustness of Sequence Labelling with Masked Adversarial Training</title>
<author><first>Luoxin</first><last>Chen</last></author>
<author><first>Xinyue</first><last>Liu</last></author>
<author><first>Weitong</first><last>Ruan</last></author>
<author><first>Jianhua</first><last>Lu</last></author>
<pages>297–302</pages>
<abstract>Adversarial training (AT) has shown strong regularization effects on deep learning algorithms by introducing small input perturbations to improve model robustness. In language tasks, adversarial training brings word-level robustness by adding input noise, which is beneficial for text classification. However, it lacks sufficient contextual information enhancement and thus is less useful for sequence labelling tasks such as chunking and named entity recognition (NER). To address this limitation, we propose masked adversarial training (MAT) to improve robustness from contextual information in sequence labelling. MAT masks or replaces some words in the sentence when computing adversarial loss from perturbed inputs and consequently enhances model robustness using more context-level information. In our experiments, our method shows significant improvements on accuracy and robustness of sequence labelling. By further incorporating with ELMo embeddings, our model achieves better or comparable results to state-of-the-art on CoNLL 2000 and 2003 benchmarks using much less parameters.</abstract>
<url hash="c90e8180">2020.findings-emnlp.28</url>
<doi>10.18653/v1/2020.findings-emnlp.28</doi>
</paper>
<paper id="29">
<title>Multilingual Argument Mining: Datasets and Analysis</title>
<author><first>Orith</first><last>Toledo-Ronen</last></author>
<author><first>Matan</first><last>Orbach</last></author>
<author><first>Yonatan</first><last>Bilu</last></author>
<author><first>Artem</first><last>Spector</last></author>
<author><first>Noam</first><last>Slonim</last></author>
<pages>303–317</pages>
<abstract>The growing interest in argument mining and computational argumentation brings with it a plethora of Natural Language Understanding (NLU) tasks and corresponding datasets. However, as with many other NLU tasks, the dominant language is English, with resources in other languages being few and far between. In this work, we explore the potential of transfer learning using the multilingual BERT model to address argument mining tasks in non-English languages, based on English datasets and the use of machine translation. We show that such methods are well suited for classifying the stance of arguments and detecting evidence, but less so for assessing the quality of arguments, presumably because quality is harder to preserve under translation. In addition, focusing on the translate-train approach, we show how the choice of languages for translation, and the relations among them, affect the accuracy of the resultant model. Finally, to facilitate evaluation of transfer learning on argument mining tasks, we provide a human-generated dataset with more than 10k arguments in multiple languages, as well as machine translation of the English datasets.</abstract>
<url hash="2984952a">2020.findings-emnlp.29</url>
<doi>10.18653/v1/2020.findings-emnlp.29</doi>
</paper>
<paper id="30">
<title>Improving Grammatical Error Correction with Machine Translation Pairs</title>
<author><first>Wangchunshu</first><last>Zhou</last></author>
<author><first>Tao</first><last>Ge</last></author>
<author><first>Chang</first><last>Mu</last></author>
<author><first>Ke</first><last>Xu</last></author>
<author><first>Furu</first><last>Wei</last></author>
<author><first>Ming</first><last>Zhou</last></author>
<pages>318–328</pages>
<abstract>We propose a novel data synthesis method to generate diverse error-corrected sentence pairs for improving grammatical error correction, which is based on a pair of machine translation models (e.g., Chinese to English) of different qualities (i.e., poor and good). The poor translation model can resemble the ESL (English as a second language) learner and tends to generate translations of low quality in terms of fluency and grammaticality, while the good translation model generally generates fluent and grammatically correct translations. With the pair of translation models, we can generate unlimited numbers of poor to good English sentence pairs from text in the source language (e.g., Chinese) of the translators. Our approach can generate various error-corrected patterns and nicely complement the other data synthesis approaches for GEC. Experimental results demonstrate the data generated by our approach can effectively help a GEC model to improve the performance and achieve the state-of-the-art single-model performance in BEA-19 and CoNLL-14 benchmark datasets.</abstract>
<url hash="1995e577">2020.findings-emnlp.30</url>
<doi>10.18653/v1/2020.findings-emnlp.30</doi>
</paper>
<paper id="31">
<title>Machines Getting with the Program: Understanding Intent Arguments of Non-Canonical Directives</title>
<author><first>Won Ik</first><last>Cho</last></author>
<author><first>Youngki</first><last>Moon</last></author>
<author><first>Sangwhan</first><last>Moon</last></author>
<author><first>Seok Min</first><last>Kim</last></author>
<author><first>Nam Soo</first><last>Kim</last></author>
<pages>329–339</pages>
<abstract>Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective. Along with these requirements, agents are expected to extrapolate intent from the user’s dialogue even when subjected to non-canonical forms of speech. This depends on the agent’s comprehension of paraphrased forms of such utterances. Especially in low-resource languages, the lack of data is a bottleneck that prevents advancements of the comprehension performance for these types of agents. In this regard, here we demonstrate the necessity of extracting the intent argument of non-canonical directives in a natural language format, which may yield more accurate parsing, and suggest guidelines for building a parallel corpus for this purpose. Following the guidelines, we construct a Korean corpus of 50K instances of question/command-intent pairs, including the labels for classification of the utterance type. We also propose a method for mitigating class imbalance, demonstrating the potential applications of the corpus generation method and its multilingual extensibility.</abstract>
<url hash="e52c1ad5">2020.findings-emnlp.31</url>
<doi>10.18653/v1/2020.findings-emnlp.31</doi>
</paper>
<paper id="32">
<title>The <fixed-case>RELX</fixed-case> Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification</title>
<author><first>Abdullatif</first><last>Köksal</last></author>
<author><first>Arzucan</first><last>Özgür</last></author>
<pages>340–350</pages>
<abstract>Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource languages is impractical and expensive. To overcome this issue, we propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup, which significantly improves the baseline with distant supervision. For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish, called RELX. We also provide the RELX-Distant dataset, which includes hundreds of thousands of sentences with relations from Wikipedia and Wikidata collected by distant supervision for these languages. Our code and data are available at: https://github.com/boun-tabi/RELX</abstract>
<url hash="493bd041">2020.findings-emnlp.32</url>
<doi>10.18653/v1/2020.findings-emnlp.32</doi>
</paper>
<paper id="33">
<title>Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation</title>
<author><first>Giuseppe</first><last>Russo</last></author>
<author><first>Nora</first><last>Hollenstein</last></author>
<author><first>Claudiu Cristian</first><last>Musat</last></author>
<author><first>Ce</first><last>Zhang</last></author>
<pages>351–366</pages>
<abstract>We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.</abstract>
<url hash="a0ff56f1">2020.findings-emnlp.33</url>
<doi>10.18653/v1/2020.findings-emnlp.33</doi>
</paper>
<paper id="34">
<title>Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation</title>
<author><first>Yiming</first><last>Xu</last></author>
<author><first>Lin</first><last>Chen</last></author>
<author><first>Zhongwei</first><last>Cheng</last></author>
<author><first>Lixin</first><last>Duan</last></author>
<author><first>Jiebo</first><last>Luo</last></author>
<pages>367–376</pages>
<abstract>We study the problem of visual question answering (VQA) in images by exploiting supervised domain adaptation, where there is a large amount of labeled data in the source domain but only limited labeled data in the target domain, with the goal to train a good target model. A straightforward solution is to fine-tune a pre-trained source model by using those limited labeled target data, but it usually cannot work well due to the considerable difference between the data distributions of the source and target domains. Moreover, the availability of multiple modalities (i.e., images, questions and answers) in VQA poses further challenges in modeling the transferability between various modalities. In this paper, we address the above issues by proposing a novel supervised multi-modal domain adaptation method for VQA to learn joint feature embeddings across different domains and modalities. Specifically, we align the data distributions of the source and target domains by considering those modalities both jointly and separately. Extensive experiments on the benchmark VQA 2.0 and VizWiz datasets demonstrate that our proposed method outperforms the existing state-of-the-art baselines for open-ended VQA in this challenging domain adaptation setting.</abstract>
<url hash="e84f2b61">2020.findings-emnlp.34</url>
<doi>10.18653/v1/2020.findings-emnlp.34</doi>
</paper>
<paper id="35">
<title>Dual Low-Rank Multimodal Fusion</title>
<author><first>Tao</first><last>Jin</last></author>
<author><first>Siyu</first><last>Huang</last></author>
<author><first>Yingming</first><last>Li</last></author>
<author><first>Zhongfei</first><last>Zhang</last></author>
<pages>377–387</pages>
<abstract>Tensor-based fusion methods have been proven effective in multimodal fusion tasks. However, existing tensor-based methods make a poor use of the fine-grained temporal dynamics of multimodal sequential features. Motivated by this observation, this paper proposes a novel multimodal fusion method called Fine-Grained Temporal Low-Rank Multimodal Fusion (FT-LMF). FT-LMF correlates the features of individual time steps between multiple modalities, while it involves multiplications of high-order tensors in its calculation. This paper further proposes Dual Low-Rank Multimodal Fusion (Dual-LMF) to reduce the computational complexity of FT-LMF through low-rank tensor approximation along dual dimensions of input features. Dual-LMF is conceptually simple and practically effective and efficient. Empirical studies on benchmark multimodal analysis tasks show that our proposed methods outperform the state-of-the-art tensor-based fusion methods with a similar computational complexity.</abstract>
<url hash="02353d35">2020.findings-emnlp.35</url>
<attachment type="OptionalSupplementaryMaterial" hash="7b2ce962">2020.findings-emnlp.35.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.35</doi>
</paper>
<paper id="36">
<title>Contextual Modulation for Relation-Level Metaphor Identification</title>
<author><first>Omnia</first><last>Zayed</last></author>
<author><first>John P.</first><last>McCrae</last></author>
<author><first>Paul</first><last>Buitelaar</last></author>
<pages>388–406</pages>
<abstract>Identifying metaphors in text is very challenging and requires comprehending the underlying comparison. The automation of this cognitive process has gained wide attention lately. However, the majority of existing approaches concentrate on word-level identification by treating the task as either single-word classification or sequential labelling without explicitly modelling the interaction between the metaphor components. On the other hand, while existing relation-level approaches implicitly model this interaction, they ignore the context where the metaphor occurs. In this work, we address these limitations by introducing a novel architecture for identifying relation-level metaphoric expressions of certain grammatical relations based on contextual modulation. In a methodology inspired by works in visual reasoning, our approach is based on conditioning the neural network computation on the deep contextualised features of the candidate expressions using feature-wise linear modulation. We demonstrate that the proposed architecture achieves state-of-the-art results on benchmark datasets. The proposed methodology is generic and could be applied to other textual classification problems that benefit from contextual interaction.</abstract>
<url hash="87a488fb">2020.findings-emnlp.36</url>
<doi>10.18653/v1/2020.findings-emnlp.36</doi>
</paper>
<paper id="37">
<title>Context-aware Stand-alone Neural Spelling Correction</title>
<author><first>Xiangci</first><last>Li</last></author>
<author><first>Hairong</first><last>Liu</last></author>
<author><first>Liang</first><last>Huang</last></author>
<pages>407–414</pages>
<abstract>Existing natural language processing systems are vulnerable to noisy inputs resulting from misspellings. On the contrary, humans can easily infer the corresponding correct words from their misspellings and surrounding context. Inspired by this, we address the stand-alone spelling correction problem, which only corrects the spelling of each token without additional token insertion or deletion, by utilizing both spelling information and global context representations. We present a simple yet powerful solution that jointly detects and corrects misspellings as a sequence labeling task by fine-turning a pre-trained language model. Our solution outperform the previous state-of-the-art result by 12.8% absolute F0.5 score.</abstract>
<url hash="d4f9903a">2020.findings-emnlp.37</url>
<doi>10.18653/v1/2020.findings-emnlp.37</doi>
</paper>
<paper id="38">
<title>A Novel Workflow for Accurately and Efficiently Crowdsourcing Predicate Senses and Argument Labels</title>
<author><first>Youxuan</first><last>Jiang</last></author>
<author><first>Huaiyu</first><last>Zhu</last></author>
<author><first>Jonathan K.</first><last>Kummerfeld</last></author>
<author><first>Yunyao</first><last>Li</last></author>
<author><first>Walter</first><last>Lasecki</last></author>
<pages>415–421</pages>
<abstract>Resources for Semantic Role Labeling (SRL) are typically annotated by experts at great expense. Prior attempts to develop crowdsourcing methods have either had low accuracy or required substantial expert annotation. We propose a new multi-stage crowd workflow that substantially reduces expert involvement without sacrificing accuracy. In particular, we introduce a unique filter stage based on the key observation that crowd workers are able to almost perfectly filter out incorrect options for labels. Our three-stage workflow produces annotations with 95% accuracy for predicate labels and 93% for argument labels, which is comparable to expert agreement. Compared to prior work on crowdsourcing for SRL, we decrease expert effort by 4x, from 56% to 14% of cases. Our approach enables more scalable annotation of SRL, and could enable annotation of NLP tasks that have previously been considered too complex to effectively crowdsource.</abstract>
<url hash="ba5398bc">2020.findings-emnlp.38</url>
<attachment type="OptionalSupplementaryMaterial" hash="ffe48dc2">2020.findings-emnlp.38.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.38</doi>
</paper>
<paper id="39">
<title><fixed-case>K</fixed-case>or<fixed-case>NLI</fixed-case> and <fixed-case>K</fixed-case>or<fixed-case>STS</fixed-case>: New Benchmark Datasets for <fixed-case>K</fixed-case>orean Natural Language Understanding</title>
<author><first>Jiyeon</first><last>Ham</last></author>
<author><first>Yo Joong</first><last>Choe</last></author>
<author><first>Kyubyong</first><last>Park</last></author>
<author><first>Ilji</first><last>Choi</last></author>
<author><first>Hyungjoon</first><last>Soh</last></author>
<pages>422–430</pages>
<abstract>Natural language inference (NLI) and semantic textual similarity (STS) are key tasks in natural language understanding (NLU). Although several benchmark datasets for those tasks have been released in English and a few other languages, there are no publicly available NLI or STS datasets in the Korean language. Motivated by this, we construct and release new datasets for Korean NLI and STS, dubbed KorNLI and KorSTS, respectively. Following previous approaches, we machine-translate existing English training sets and manually translate development and test sets into Korean. To accelerate research on Korean NLU, we also establish baselines on KorNLI and KorSTS. Our datasets are publicly available at https://github.com/kakaobrain/KorNLUDatasets.</abstract>
<url hash="9b37635d">2020.findings-emnlp.39</url>
<doi>10.18653/v1/2020.findings-emnlp.39</doi>
</paper>
<paper id="40">
<title>Dialogue Generation on Infrequent Sentence Functions via Structured Meta-Learning</title>
<author><first>Yifan</first><last>Gao</last></author>
<author><first>Piji</first><last>Li</last></author>
<author><first>Wei</first><last>Bi</last></author>
<author><first>Xiaojiang</first><last>Liu</last></author>
<author><first>Michael</first><last>Lyu</last></author>
<author><first>Irwin</first><last>King</last></author>
<pages>431–440</pages>
<abstract>Sentence function is an important linguistic feature indicating the communicative purpose in uttering a sentence. Incorporating sentence functions into conversations has shown improvements in the quality of generated responses. However, the number of utterances for different types of fine-grained sentence functions is extremely imbalanced. Besides a small number of high-resource sentence functions, a large portion of sentence functions is infrequent. Consequently, dialogue generation conditioned on these infrequent sentence functions suffers from data deficiency. In this paper, we investigate a structured meta-learning (SML) approach for dialogue generation on infrequent sentence functions. We treat dialogue generation conditioned on different sentence functions as separate tasks, and apply model-agnostic meta-learning to high-resource sentence functions data. Furthermore, SML enhances meta-learning effectiveness by promoting knowledge customization among different sentence functions but simultaneously preserving knowledge generalization for similar sentence functions. Experimental results demonstrate that SML not only improves the informativeness and relevance of generated responses, but also can generate responses consistent with the target sentence functions. Code will be public to facilitate the research along this line.</abstract>
<url hash="aeed4d22">2020.findings-emnlp.40</url>
<doi>10.18653/v1/2020.findings-emnlp.40</doi>
</paper>
<paper id="41">
<title>Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning</title>
<author><first>Zhaojiang</first><last>Lin</last></author>
<author><first>Andrea</first><last>Madotto</last></author>
<author><first>Pascale</first><last>Fung</last></author>
<pages>441–459</pages>
<abstract>Fine-tuning pre-trained generative language models to down-stream language generation tasks has shown promising results. However, this comes with the cost of having a single, large model for each task, which is not ideal in low-memory/power scenarios (e.g., mobile). In this paper, we propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pretrained model. The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.</abstract>
<url hash="94c828b9">2020.findings-emnlp.41</url>
<attachment type="OptionalSupplementaryMaterial" hash="2a014db2">2020.findings-emnlp.41.OptionalSupplementaryMaterial.pdf</attachment>
<doi>10.18653/v1/2020.findings-emnlp.41</doi>
</paper>
<paper id="42">
<title>A Fully Hyperbolic Neural Model for Hierarchical Multi-Class Classification</title>
<author><first>Federico</first><last>López</last></author>
<author><first>Michael</first><last>Strube</last></author>
<pages>460–475</pages>
<abstract>Label inventories for fine-grained entity typing have grown in size and complexity. Nonetheless, they exhibit a hierarchical structure. Hyperbolic spaces offer a mathematically appealing approach for learning hierarchical representations of symbolic data. However, it is not clear how to integrate hyperbolic components into downstream tasks. This is the first work that proposes a fully hyperbolic model for multi-class multi-label classification, which performs all operations in hyperbolic space. We evaluate the proposed model on two challenging datasets and compare to different baselines that operate under Euclidean assumptions. Our hyperbolic model infers the latent hierarchy from the class distribution, captures implicit hyponymic relations in the inventory, and shows performance on par with state-of-the-art methods on fine-grained classification with remarkable reduction of the parameter size. A thorough analysis sheds light on the impact of each component in the final prediction and showcases its ease of integration with Euclidean layers.</abstract>
<url hash="91ec97ee">2020.findings-emnlp.42</url>
<doi>10.18653/v1/2020.findings-emnlp.42</doi>
</paper>
<paper id="43">
<title>Claim Check-Worthiness Detection as Positive Unlabelled Learning</title>
<author><first>Dustin</first><last>Wright</last></author>
<author><first>Isabelle</first><last>Augenstein</last></author>
<pages>476–488</pages>
<abstract>As the first step of automatic fact checking, claim check-worthiness detection is a critical component of fact checking systems. There are multiple lines of research which study this problem: check-worthiness ranking from political speeches and debates, rumour detection on Twitter, and citation needed detection from Wikipedia. To date, there has been no structured comparison of these various tasks to understand their relatedness, and no investigation into whether or not a unified approach to all of them is achievable. In this work, we illuminate a central challenge in claim check-worthiness detection underlying all of these tasks, being that they hinge upon detecting both how factual a sentence is, as well as how likely a sentence is to be believed without verification. As such, annotators only mark those instances they judge to be clear-cut check-worthy. Our best performing method is a unified approach which automatically corrects for this using a variant of positive unlabelled learning that finds instances which were incorrectly labelled as not check-worthy. In applying this, we out-perform the state of the art in two of the three tasks studied for claim check-worthiness detection in English.</abstract>
<url hash="720def73">2020.findings-emnlp.43</url>
<doi>10.18653/v1/2020.findings-emnlp.43</doi>
</paper>
<paper id="44">
<title><fixed-case>C</fixed-case>oncept<fixed-case>B</fixed-case>ert: Concept-Aware Representation for Visual Question Answering</title>
<author><first>François</first><last>Gardères</last></author>
<author><first>Maryam</first><last>Ziaeefard</last></author>
<author><first>Baptiste</first><last>Abeloos</last></author>
<author><first>Freddy</first><last>Lecue</last></author>
<pages>489–498</pages>
<abstract>Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. A VQA model combines visual and textual features in order to answer questions grounded in an image. Current works in VQA focus on questions which are answerable by direct analysis of the question and image alone. We present a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content. Given an image and a question in natural language, ConceptBert requires visual elements of the image and a Knowledge Graph (KG) to infer the correct answer. We introduce a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture. We exploit ConceptNet KG for encoding the common sense knowledge and evaluate our methodology on the Outside Knowledge-VQA (OK-VQA) and VQA datasets.</abstract>
<url hash="180c5663">2020.findings-emnlp.44</url>
<doi>10.18653/v1/2020.findings-emnlp.44</doi>
</paper>
<paper id="45">
<title>Bootstrapping a Crosslingual Semantic Parser</title>
<author><first>Tom</first><last>Sherborne</last></author>
<author><first>Yumo</first><last>Xu</last></author>
<author><first>Mirella</first><last>Lapata</last></author>
<pages>499–517</pages>
<abstract>Recent progress in semantic parsing scarcely considers languages other than English but professional translation can be prohibitively expensive. We adapt a semantic parser trained on a single language, such as English, to new languages and multiple domains with minimal annotation. We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models. We develop a Transformer-based parser combining paraphrases by ensembling attention over multiple encoders and present new versions of ATIS and Overnight in German and Chinese for evaluation. Experimental results indicate that MT can approximate training data in a new language for accurate parsing when augmented with paraphrasing through multiple MT engines. Considering when MT is inadequate, we also find that using our approach achieves parsing accuracy within 2% of complete translation using only 50% of training data.</abstract>
<url hash="33984c37">2020.findings-emnlp.45</url>
<doi>10.18653/v1/2020.findings-emnlp.45</doi>
</paper>
<paper id="46">
<title>Revisiting Representation Degeneration Problem in Language Modeling</title>
<author><first>Zhong</first><last>Zhang</last></author>
<author><first>Chongming</first><last>Gao</last></author>
<author><first>Cong</first><last>Xu</last></author>
<author><first>Rui</first><last>Miao</last></author>
<author><first>Qinli</first><last>Yang</last></author>
<author><first>Junming</first><last>Shao</last></author>
<pages>518–527</pages>
<abstract>Weight tying is now a common setting in many language generation tasks such as language modeling and machine translation. However, a recent study reveals that there is a potential flaw in weight tying. They find that the learned word embeddings are likely to degenerate and lie in a narrow cone when training a language model. They call it the representation degeneration problem and propose a cosine regularization to solve it. Nevertheless, we prove that the cosine regularization is insufficient to solve the problem, as the degeneration is still likely to happen under certain conditions. In this paper, we revisit the representation degeneration problem and theoretically analyze the limitations of the previously proposed solution. Afterward, we propose an alternative regularization method called Laplacian regularization to tackle the problem. Experiments on language modeling demonstrate the effectiveness of the proposed Laplacian regularization.</abstract>
<url hash="caeb7d96">2020.findings-emnlp.46</url>
<doi>10.18653/v1/2020.findings-emnlp.46</doi>
</paper>
<paper id="47">
<title>The workweek is the best time to start a family – A Study of <fixed-case>GPT</fixed-case>-2 Based Claim Generation</title>
<author><first>Shai</first><last>Gretz</last></author>
<author><first>Yonatan</first><last>Bilu</last></author>
<author><first>Edo</first><last>Cohen-Karlik</last></author>
<author><first>Noam</first><last>Slonim</last></author>
<pages>528–544</pages>
<abstract>Argument generation is a challenging task whose research is timely considering its potential impact on social media and the dissemination of information. Here we suggest a pipeline based on GPT-2 for generating coherent claims, and explore the types of claims that it produces, and their veracity, using an array of manual and automatic assessments. In addition, we explore the interplay between this task and the task of Claim Retrieval, showing how they can complement one another.</abstract>
<url hash="c9e7447c">2020.findings-emnlp.47</url>
<doi>10.18653/v1/2020.findings-emnlp.47</doi>
</paper>
<paper id="48">
<title>Dynamic Data Selection for Curriculum Learning via Ability Estimation</title>
<author><first>John P.</first><last>Lalor</last></author>
<author><first>Hong</first><last>Yu</last></author>
<pages>545–555</pages>
<abstract>Curriculum learning methods typically rely on heuristics to estimate the difficulty of training examples or the ability of the model. In this work, we propose replacing difficulty heuristics with learned difficulty parameters. We also propose Dynamic Data selection for Curriculum Learning via Ability Estimation (DDaCLAE), a strategy that probes model ability at each training epoch to select the best training examples at that point. We show that models using learned difficulty and/or ability outperform heuristic-based curriculum learning models on the GLUE classification tasks.</abstract>
<url hash="15608826">2020.findings-emnlp.48</url>
<doi>10.18653/v1/2020.findings-emnlp.48</doi>
</paper>
<paper id="49">
<title>Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation</title>
<author><first>Alessandro</first><last>Raganato</last></author>
<author><first>Yves</first><last>Scherrer</last></author>
<author><first>Jörg</first><last>Tiedemann</last></author>
<pages>556–568</pages>
<abstract>Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed – non-learnable – attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.</abstract>
<url hash="1c0718e0">2020.findings-emnlp.49</url>
<doi>10.18653/v1/2020.findings-emnlp.49</doi>
</paper>
<paper id="50">
<title><fixed-case>ZEST</fixed-case>: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization</title>
<author><first>Tzuf</first><last>Paz-Argaman</last></author>
<author><first>Reut</first><last>Tsarfaty</last></author>
<author><first>Gal</first><last>Chechik</last></author>
<author><first>Yuval</first><last>Atzmon</last></author>
<pages>569–579</pages>
<abstract>We study the problem of recognizing visual entities from the textual descriptions of their classes. Specifically, given birds’ images with free-text descriptions of their species, we learn to classify images of previously-unseen species based on specie descriptions. This setup has been studied in the vision community under the name zero-shot learning from text, focusing on learning to transfer knowledge about visual aspects of birds from seen classes to previously-unseen ones. Here, we suggest focusing on the textual description and distilling from the description the most relevant information to effectively match visual features to the parts of the text that discuss them. Specifically, (1) we propose to leverage the similarity between species, reflected in the similarity between text descriptions of the species. (2) we derive visual summaries of the texts, i.e., extractive summaries that focus on the visual features that tend to be reflected in images. We propose a simple attention-based model augmented with the similarity and visual summaries components. Our empirical results consistently and significantly outperform the state-of-the-art on the largest benchmarks for text-based zero-shot learning, illustrating the critical importance of texts for zero-shot image-recognition.</abstract>
<url hash="307870e2">2020.findings-emnlp.50</url>
<doi>10.18653/v1/2020.findings-emnlp.50</doi>
</paper>
<paper id="51">
<title>Few-Shot Multi-Hop Relation Reasoning over Knowledge Bases</title>
<author><first>Chuxu</first><last>Zhang</last></author>
<author><first>Lu</first><last>Yu</last></author>
<author><first>Mandana</first><last>Saebi</last></author>
<author><first>Meng</first><last>Jiang</last></author>
<author><first>Nitesh</first><last>Chawla</last></author>
<pages>580–585</pages>
<abstract>Multi-hop relation reasoning over knowledge base is to generate effective and interpretable relation prediction through reasoning paths. The current methods usually require sufficient training data (fact triples) for each query relation, impairing their performances over few-shot relations (with limited triples) which are common in knowledge base. To this end, we propose FIRE, a novel few-shot multi-hop relation learning model. FIRE applies reinforcement learning to model the sequential steps of multi-hop reasoning, besides performs heterogeneous structure encoding and knowledge-aware search space pruning. The meta-learning technique is employed to optimize model parameters that could quickly adapt to few-shot relations. Empirical study on two datasets demonstrate that FIRE outperforms state-of-the-art methods.</abstract>
<url hash="37de63ad">2020.findings-emnlp.51</url>
<doi>10.18653/v1/2020.findings-emnlp.51</doi>
</paper>
<paper id="52">
<title>A structure-enhanced graph convolutional network for sentiment analysis</title>
<author><first>Fanyu</first><last>Meng</last></author>
<author><first>Junlan</first><last>Feng</last></author>
<author><first>Danping</first><last>Yin</last></author>
<author><first>Si</first><last>Chen</last></author>
<author><first>Min</first><last>Hu</last></author>
<pages>586–595</pages>
<abstract>Syntactic information is essential for both sentiment analysis(SA) and aspect-based sentiment analysis(ABSA). Previous work has already achieved great progress utilizing Graph Convolutional Network(GCN) over dependency tree of a sentence. However, these models do not fully exploit the syntactic information obtained from dependency parsing such as the diversified types of dependency relations. The message passing process of GCN should be distinguished based on these syntactic information.To tackle this problem, we design a novel weighted graph convolutional network(WGCN) which can exploit rich syntactic information based on the feature combination. Furthermore, we utilize BERT instead of Bi-LSTM to generate contextualized representations as inputs for GCN and present an alignment method to keep word-level dependencies consistent with wordpiece unit of BERT. With our proposal, we are able to improve the state-of-the-art on four ABSA tasks out of six and two SA tasks out of three.</abstract>
<url hash="3175d0c8">2020.findings-emnlp.52</url>
<doi>10.18653/v1/2020.findings-emnlp.52</doi>
</paper>
<paper id="53">
<title><fixed-case>PB</fixed-case>o<fixed-case>S</fixed-case>: Probabilistic Bag-of-Subwords for Generalizing Word Embedding</title>
<author><first>Zhao</first><last>Jinman</last></author>
<author><first>Shawn</first><last>Zhong</last></author>
<author><first>Xiaomin</first><last>Zhang</last></author>
<author><first>Yingyu</first><last>Liang</last></author>
<pages>596–611</pages>
<abstract>We look into the task of generalizing word embeddings: given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words, without extra contextual information. We rely solely on the spellings of words and propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding. We call the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords for all possible segmentations based on their likelihood. Inspections and affix prediction experiment show that PBoS is able to produce meaningful subword segmentations and subword rankings without any source of explicit morphological knowledge. Word similarity and POS tagging experiments show clear advantages of PBoS over previous subword-level models in the quality of generated word embeddings across languages.</abstract>
<url hash="667f7f62">2020.findings-emnlp.53</url>
<doi>10.18653/v1/2020.findings-emnlp.53</doi>
</paper>
<paper id="54">
<title>Interpretable Entity Representations through Large-Scale Typing</title>
<author><first>Yasumasa</first><last>Onoe</last></author>
<author><first>Greg</first><last>Durrett</last></author>
<pages>612–624</pages>
<abstract>In standard methodology for natural language processing, entities in text are typically embedded in dense vector spaces with pre-trained models. The embeddings produced this way are effective when fed into downstream models, but they require end-task fine-tuning and are fundamentally difficult to interpret. In this paper, we present an approach to creating entity representations that are human readable and achieve high performance on entity-related tasks out of the box. Our representations are vectors whose values correspond to posterior probabilities over fine-grained entity types, indicating the confidence of a typing model’s decision that the entity belongs to the corresponding type. We obtain these representations using a fine-grained entity typing model, trained either on supervised ultra-fine entity typing data (Choi et al. 2018) or distantly-supervised examples from Wikipedia. On entity probing tasks involving recognizing entity identity, our embeddings used in parameter-free downstream models achieve competitive performance with ELMo- and BERT-based embeddings in trained models. We also show that it is possible to reduce the size of our type set in a learning-based way for particular domains. Finally, we show that these embeddings can be post-hoc modified through a small number of rules to incorporate domain knowledge and improve performance.</abstract>
<url hash="09691a65">2020.findings-emnlp.54</url>
<doi>10.18653/v1/2020.findings-emnlp.54</doi>
</paper>
<paper id="55">
<title>Empirical Studies of Institutional Federated Learning For Natural Language Processing</title>
<author><first>Xinghua</first><last>Zhu</last></author>
<author><first>Jianzong</first><last>Wang</last></author>
<author><first>Zhenhou</first><last>Hong</last></author>
<author><first>Jing</first><last>Xiao</last></author>
<pages>625–634</pages>
<abstract>Federated learning has sparkled new interests in the deep learning society to make use of isolated data sources from independent institutes. With the development of novel training tools, we have successfully deployed federated natural language processing networks on GPU-enabled server clusters. This paper demonstrates federated training of a popular NLP model, TextCNN, with applications in sentence intent classification. Furthermore, differential privacy is introduced to protect participants in the training process, in a manageable manner. Distinguished from previous client-level privacy protection schemes, the proposed differentially private federated learning procedure is defined in the dataset sample level, inherent with the applications among institutions instead of individual users. Optimal settings of hyper-parameters for the federated TextCNN model are studied through comprehensive experiments. We also evaluated the performance of federated TextCNN model under imbalanced data load configuration. Experiments show that, the sampling ratio has a large impact on the performance of the FL models, causing up to 38.4% decrease in the test accuracy, while they are robust to different noise multiplier levels, with less than 3% variance in the test accuracy. It is also found that the FL models are sensitive to data load balancedness among client datasets. When the data load is imbalanced, model performance dropped by up to 10%.</abstract>
<url hash="cdb93921">2020.findings-emnlp.55</url>
<doi>10.18653/v1/2020.findings-emnlp.55</doi>
</paper>
<paper id="56">
<title><fixed-case>N</fixed-case>eu<fixed-case>R</fixed-case>educe: Reducing Mixed <fixed-case>B</fixed-case>oolean-Arithmetic Expressions by Recurrent Neural Network</title>
<author><first>Weijie</first><last>Feng</last></author>
<author><first>Binbin</first><last>Liu</last></author>
<author><first>Dongpeng</first><last>Xu</last></author>
<author><first>Qilong</first><last>Zheng</last></author>
<author><first>Yun</first><last>Xu</last></author>
<pages>635–644</pages>
<abstract>Mixed Boolean-Arithmetic (MBA) expressions involve both arithmetic calculation (e.g.,plus, minus, multiply) and bitwise computation (e.g., and, or, negate, xor). MBA expressions have been widely applied in software obfuscation, transforming programs from a simple form to a complex form. MBA expressions are challenging to be simplified, because the interleaving bitwise and arithmetic operations causing mathematical reduction laws to be ineffective. Our goal is to recover the original, simple form from an obfuscated MBA expression. In this paper, we first propose NeuReduce, a string to string method based on neural networks to automatically learn and reduce complex MBA expressions. We develop a comprehensive MBA dataset, including one million diversified MBA expression samples and corresponding simplified forms. After training on the dataset, NeuReduce can reduce MBA rules to homelier but mathematically equivalent forms. By comparing with three state-of-the-art MBA reduction methods, our evaluation result shows that NeuReduce outperforms all other tools in terms of accuracy, solving time, and performance overhead.</abstract>
<url hash="7563f08c">2020.findings-emnlp.56</url>
<doi>10.18653/v1/2020.findings-emnlp.56</doi>
</paper>
<paper id="57">
<title>From Language to Language-ish: How Brain-Like is an <fixed-case>LSTM</fixed-case>’s Representation of Nonsensical Language Stimuli?</title>
<author><first>Maryam</first><last>Hashemzadeh</last></author>
<author><first>Greta</first><last>Kaufeld</last></author>
<author><first>Martha</first><last>White</last></author>
<author><first>Andrea E.</first><last>Martin</last></author>
<author><first>Alona</first><last>Fyshe</last></author>
<pages>645–656</pages>
<abstract>The representations generated by many models of language (word embeddings, recurrent neural networks and transformers) correlate to brain activity recorded while people read. However, these decoding results are usually based on the brain’s reaction to syntactically and semantically sound language stimuli. In this study, we asked: how does an LSTM (long short term memory) language model, trained (by and large) on semantically and syntactically intact language, represent a language sample with degraded semantic or syntactic information? Does the LSTM representation still resemble the brain’s reaction? We found that, even for some kinds of nonsensical language, there is a statistically significant relationship between the brain’s activity and the representations of an LSTM. This indicates that, at least in some instances, LSTMs and the human brain handle nonsensical data similarly.</abstract>
<url hash="a64b646f">2020.findings-emnlp.57</url>
<doi>10.18653/v1/2020.findings-emnlp.57</doi>
</paper>
<paper id="58">
<title>Revisiting Pre-Trained Models for <fixed-case>C</fixed-case>hinese Natural Language Processing</title>
<author><first>Yiming</first><last>Cui</last></author>
<author><first>Wanxiang</first><last>Che</last></author>
<author><first>Ting</first><last>Liu</last></author>
<author><first>Bing</first><last>Qin</last></author>
<author><first>Shijin</first><last>Wang</last></author>
<author><first>Guoping</first><last>Hu</last></author>
<pages>657–668</pages>
<abstract>Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks, and consecutive variants have been proposed to further improve the performance of the pre-trained language models. In this paper, we target on revisiting Chinese pre-trained language models to examine their effectiveness in a non-English language and release the Chinese pre-trained language model series to the community. We also propose a simple but effective model called MacBERT, which improves upon RoBERTa in several ways, especially the masking strategy that adopts MLM as correction (Mac). We carried out extensive experiments on eight Chinese NLP tasks to revisit the existing pre-trained language models as well as the proposed MacBERT. Experimental results show that MacBERT could achieve state-of-the-art performances on many NLP tasks, and we also ablate details with several findings that may help future research. https://github.com/ymcui/MacBERT</abstract>
<url hash="380bb6a3">2020.findings-emnlp.58</url>
<doi>10.18653/v1/2020.findings-emnlp.58</doi>
</paper>
<paper id="59">
<title>Cascaded Semantic and Positional Self-Attention Network for Document Classification</title>
<author><first>Juyong</first><last>Jiang</last></author>
<author><first>Jie</first><last>Zhang</last></author>
<author><first>Kai</first><last>Zhang</last></author>
<pages>669–677</pages>
<abstract>Transformers have shown great success in learning representations for language modelling. However, an open challenge still remains on how to systematically aggregate semantic information (word embedding) with positional (or temporal) information (word orders). In this work, we propose a new architecture to aggregate the two sources of information using cascaded semantic and positional self-attention network (CSPAN) in the context of document classification. The CSPAN uses a semantic self-attention layer cascaded with Bi-LSTM to process the semantic and positional information in a sequential manner, and then adaptively combine them together through a residue connection. Compared with commonly used positional encoding schemes, CSPAN can exploit the interaction between semantics and word positions in a more interpretable and adaptive manner, and the classification performance can be notably improved while simultaneously preserving a compact model size and high convergence rate. We evaluate the CSPAN model on several benchmark data sets for document classification with careful ablation studies, and demonstrate the encouraging results compared with state of the art.</abstract>
<url hash="f38ec081">2020.findings-emnlp.59</url>
<doi>10.18653/v1/2020.findings-emnlp.59</doi>
</paper>
<paper id="60">
<title>Toward Recognizing More Entity Types in <fixed-case>NER</fixed-case>: An Efficient Implementation using Only Entity Lexicons</title>
<author><first>Minlong</first><last>Peng</last></author>
<author><first>Ruotian</first><last>Ma</last></author>
<author><first>Qi</first><last>Zhang</last></author>
<author><first>Lujun</first><last>Zhao</last></author>
<author><first>Mengxi</first><last>Wei</last></author>
<author><first>Changlong</first><last>Sun</last></author>
<author><first>Xuanjing</first><last>Huang</last></author>
<pages>678–688</pages>
<abstract>In this work, we explore the way to quickly adjust an existing named entity recognition (NER) system to make it capable of recognizing entity types not defined in the system. As an illustrative example, consider the case that a NER system has been built to recognize person and organization names, and now it requires to additionally recognize job titles. Such a situation is common in the industrial areas, where the entity types required to recognize vary a lot in different products and keep changing. To avoid laborious data labeling and achieve fast adaptation, we propose to adjust the existing NER system using the previously labeled data and entity lexicons of the newly introduced entity types. We formulate such a task as a partially supervised learning problem and accordingly propose an effective algorithm to solve the problem. Comprehensive experimental studies on several public NER datasets validate the effectiveness of our method.</abstract>
<url hash="fb5fd953">2020.findings-emnlp.60</url>
<doi>10.18653/v1/2020.findings-emnlp.60</doi>
</paper>
<paper id="61">
<title>From Disjoint Sets to Parallel Data to Train <fixed-case>S</fixed-case>eq2<fixed-case>S</fixed-case>eq Models for Sentiment Transfer</title>
<author><first>Paulo</first><last>Cavalin</last></author>
<author><first>Marisa</first><last>Vasconcelos</last></author>
<author><first>Marcelo</first><last>Grave</last></author>
<author><first>Claudio</first><last>Pinhanez</last></author>
<author><first>Victor Henrique</first><last>Alves Ribeiro</last></author>
<pages>689–698</pages>
<abstract>We present a method for creating parallel data to train Seq2Seq neural networks for sentiment transfer. Most systems for this task, which can be viewed as monolingual machine translation (MT), have relied on unsupervised methods, such as Generative Adversarial Networks (GANs)-inspired approaches, for coping with the lack of parallel corpora. Given that the literature shows that Seq2Seq methods have been consistently outperforming unsupervised methods in MT-related tasks, in this work we exploit the use of semantic similarity computation for converting non-parallel data onto a parallel corpus. That allows us to train a transformer neural network for the sentiment transfer task, and compare its performance against unsupervised approaches. With experiments conducted on two well-known public datasets, i.e. Yelp and Amazon, we demonstrate that the proposed methodology outperforms existing unsupervised methods very consistently in fluency, and presents competitive results in terms of sentiment conversion and content preservation. We believe that this works opens up an opportunity for seq2seq neural networks to be better exploited in problems for which they have not been applied owing to the lack of parallel training data.</abstract>
<url hash="0fbbe87f">2020.findings-emnlp.61</url>
<doi>10.18653/v1/2020.findings-emnlp.61</doi>
</paper>
<paper id="62">
<title>Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation</title>
<author><first>Jiannan</first><last>Xiang</last></author>
<author><first>Xin</first><last>Wang</last></author>
<author><first>William Yang</first><last>Wang</last></author>
<pages>699–707</pages>
<abstract>Vision-and-Language Navigation (VLN) is a natural language grounding task where an agent learns to follow language instructions and navigate to specified destinations in real-world environments. A key challenge is to recognize and stop at the correct location, especially for complicated outdoor environments. Existing methods treat the STOP action equally as other actions, which results in undesirable behaviors that the agent often fails to stop at the destination even though it might be on the right path. Therefore, we propose Learning to Stop (L2Stop), a simple yet effective policy module that differentiates STOP and other actions. Our approach achieves the new state of the art on a challenging urban VLN dataset Touchdown, outperforming the baseline by 6.89% (absolute improvement) on Success weighted by Edit Distance (SED).</abstract>
<url hash="8ad0458e">2020.findings-emnlp.62</url>
<attachment type="OptionalSupplementaryMaterial" hash="59da7918">2020.findings-emnlp.62.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.62</doi>
</paper>
<paper id="63">
<title>Document Ranking with a Pretrained Sequence-to-Sequence Model</title>
<author><first>Rodrigo</first><last>Nogueira</last></author>
<author><first>Zhiying</first><last>Jiang</last></author>
<author><first>Ronak</first><last>Pradeep</last></author>
<author><first>Jimmy</first><last>Lin</last></author>
<pages>708–718</pages>
<abstract>This work proposes the use of a pretrained sequence-to-sequence model for document ranking. Our approach is fundamentally different from a commonly adopted classification-based formulation based on encoder-only pretrained transformer architectures such as BERT. We show how a sequence-to-sequence model can be trained to generate relevance labels as “target tokens”, and how the underlying logits of these target tokens can be interpreted as relevance probabilities for ranking. Experimental results on the MS MARCO passage ranking task show that our ranking approach is superior to strong encoder-only models. On three other document retrieval test collections, we demonstrate a zero-shot transfer-based approach that outperforms previous state-of-the-art models requiring in-domain cross-validation. Furthermore, we find that our approach significantly outperforms an encoder-only architecture in a data-poor setting. We investigate this observation in more detail by varying target tokens to probe the model’s use of latent knowledge. Surprisingly, we find that the choice of target tokens impacts effectiveness, even for words that are closely related semantically. This finding sheds some light on why our sequence-to-sequence formulation for document ranking is effective. Code and models are available at pygaggle.ai.</abstract>
<url hash="cb03f975">2020.findings-emnlp.63</url>
<doi>10.18653/v1/2020.findings-emnlp.63</doi>
</paper>
<paper id="64">
<title>Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior</title>
<author><first>Zi</first><last>Lin</last></author>
<author><first>Jeremiah</first><last>Liu</last></author>
<author><first>Zi</first><last>Yang</last></author>
<author><first>Nan</first><last>Hua</last></author>
<author><first>Dan</first><last>Roth</last></author>
<pages>719–730</pages>
<abstract>Traditional (unstructured) pruning methods for a Transformer model focus on regularizing the individual weights by penalizing them toward zero. In this work, we explore spectral-normalized identity priors (SNIP), a structured pruning approach which penalizes an entire residual module in a Transformer model toward an identity mapping. Our method identifies and discards unimportant non-linear mappings in the residual connections by applying a thresholding operator on the function norm, and is applicable to any structured module including a single attention head, an entire attention blocks, or a feed-forward subnetwork. Furthermore, we introduce spectral normalization to stabilize the distribution of the post-activation values of the Transformer layers, further improving the pruning effectiveness of the proposed methodology. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance. Specifically, we improve the performance over the state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio.</abstract>
<url hash="3a01d707">2020.findings-emnlp.64</url>
<doi>10.18653/v1/2020.findings-emnlp.64</doi>
</paper>
<paper id="65">
<title>Rethinking Self-Attention: Towards Interpretability in Neural Parsing</title>
<author><first>Khalil</first><last>Mrini</last></author>
<author><first>Franck</first><last>Dernoncourt</last></author>
<author><first>Quan Hung</first><last>Tran</last></author>
<author><first>Trung</first><last>Bui</last></author>
<author><first>Walter</first><last>Chang</last></author>
<author><first>Ndapa</first><last>Nakashole</last></author>
<pages>731–742</pages>
<abstract>Attention mechanisms have improved the performance of NLP tasks while allowing models to remain explainable. Self-attention is currently widely used, however interpretability is difficult due to the numerous attention distributions. Recent work has shown that model representations can benefit from label-specific information, while facilitating interpretation of predictions. We introduce the Label Attention Layer: a new form of self-attention where attention heads represent labels. We test our novel layer by running constituency and dependency parsing experiments and show our new model obtains new state-of-the-art results for both tasks on both the Penn Treebank (PTB) and Chinese Treebank. Additionally, our model requires fewer self-attention layers compared to existing work. Finally, we find that the Label Attention heads learn relations between syntactic categories and show pathways to analyze errors.</abstract>
<url hash="04abd82c">2020.findings-emnlp.65</url>
<doi>10.18653/v1/2020.findings-emnlp.65</doi>
</paper>
<paper id="66">
<title><fixed-case>P</fixed-case>olicy<fixed-case>QA</fixed-case>: A Reading Comprehension Dataset for Privacy Policies</title>
<author><first>Wasi</first><last>Ahmad</last></author>
<author><first>Jianfeng</first><last>Chi</last></author>
<author><first>Yuan</first><last>Tian</last></author>
<author><first>Kai-Wei</first><last>Chang</last></author>
<pages>743–749</pages>
<abstract>Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from policy documents reduces the burden of searching the target information from a lengthy text segment. In this paper, we present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices. We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.</abstract>
<url hash="1c42141d">2020.findings-emnlp.66</url>
<doi>10.18653/v1/2020.findings-emnlp.66</doi>
</paper>
<paper id="67">
<title>A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions</title>
<author><first>Takuma</first><last>Udagawa</last></author>
<author><first>Takato</first><last>Yamazaki</last></author>
<author><first>Akiko</first><last>Aizawa</last></author>
<pages>750–765</pages>
<abstract>Recent models achieve promising results in visually grounded dialogues. However, existing datasets often contain undesirable biases and lack sophisticated linguistic analyses, which make it difficult to understand how well current models recognize their precise linguistic structures. To address this problem, we make two design choices: first, we focus on OneCommon Corpus (CITATION), a simple yet challenging common grounding dataset which contains minimal bias by design. Second, we analyze their linguistic structures based on spatial expressions and provide comprehensive and reliable annotation for 600 dialogues. We show that our annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis. In our experiments, we assess the model’s understanding of these structures through reference resolution. We demonstrate that our annotation can reveal both the strengths and weaknesses of baseline models in essential levels of detail. Overall, we propose a novel framework and resource for investigating fine-grained language understanding in visually grounded dialogues.</abstract>
<url hash="62e260d2">2020.findings-emnlp.67</url>
<doi>10.18653/v1/2020.findings-emnlp.67</doi>
</paper>
<paper id="68">
<title>Efficient Context and Schema Fusion Networks for Multi-Domain Dialogue State Tracking</title>
<author><first>Su</first><last>Zhu</last></author>
<author><first>Jieyu</first><last>Li</last></author>
<author><first>Lu</first><last>Chen</last></author>
<author><first>Kai</first><last>Yu</last></author>
<pages>766–781</pages>
<abstract>Dialogue state tracking (DST) aims at estimating the current dialogue state given all the preceding conversation. For multi-domain DST, the data sparsity problem is a major obstacle due to increased numbers of state candidates and dialogue lengths. To encode the dialogue context efficiently, we utilize the previous dialogue state (predicted) and the current dialogue utterance as the input for DST. To consider relations among different domain-slots, the schema graph involving prior knowledge is exploited. In this paper, a novel context and schema fusion network is proposed to encode the dialogue context and schema graph by using internal and external attention mechanisms. Experiment results show that our approach can outperform strong baselines, and the previous state-of-the-art method (SOM-DST) can also be improved by our proposed schema graph.</abstract>
<url hash="b1bfb31c">2020.findings-emnlp.68</url>
<doi>10.18653/v1/2020.findings-emnlp.68</doi>
</paper>
<paper id="69">
<title>Syntactic and Semantic-driven Learning for Open Information Extraction</title>
<author><first>Jialong</first><last>Tang</last></author>
<author><first>Yaojie</first><last>Lu</last></author>
<author><first>Hongyu</first><last>Lin</last></author>
<author><first>Xianpei</first><last>Han</last></author>
<author><first>Le</first><last>Sun</last></author>
<author><first>Xinyan</first><last>Xiao</last></author>
<author><first>Hua</first><last>Wu</last></author>
<pages>782–792</pages>
<abstract>One of the biggest bottlenecks in building accurate, high coverage neural open IE systems is the need for large labelled corpora. The diversity of open domain corpora and the variety of natural language expressions further exacerbate this problem. In this paper, we propose a syntactic and semantic-driven learning approach, which can learn neural open IE models without any human-labelled data by leveraging syntactic and semantic knowledge as noisier, higher-level supervision. Specifically, we first employ syntactic patterns as data labelling functions and pretrain a base model using the generated labels. Then we propose a syntactic and semantic-driven reinforcement learning algorithm, which can effectively generalize the base model to open situations with high accuracy. Experimental results show that our approach significantly outperforms the supervised counterparts, and can even achieve competitive performance to supervised state-of-the-art (SoA) model.</abstract>
<url hash="46f1ab30">2020.findings-emnlp.69</url>
<attachment type="OptionalSupplementaryMaterial" hash="8f579c68">2020.findings-emnlp.69.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.69</doi>
</paper>
<paper id="70">
<title>Group-wise Contrastive Learning for Neural Dialogue Generation</title>
<author><first>Hengyi</first><last>Cai</last></author>
<author><first>Hongshen</first><last>Chen</last></author>
<author><first>Yonghao</first><last>Song</last></author>
<author><first>Zhuoye</first><last>Ding</last></author>
<author><first>Yongjun</first><last>Bao</last></author>
<author><first>Weipeng</first><last>Yan</last></author>
<author><first>Xiaofang</first><last>Zhao</last></author>
<pages>793–802</pages>
<abstract>Neural dialogue response generation has gained much popularity in recent years. Maximum Likelihood Estimation (MLE) objective is widely adopted in existing dialogue model learning. However, models trained with MLE objective function are plagued by the low-diversity issue when it comes to the open-domain conversational setting. Inspired by the observation that humans not only learn from the positive signals but also benefit from correcting behaviors of undesirable actions, in this work, we introduce contrastive learning into dialogue generation, where the model explicitly perceives the difference between the well-chosen positive and negative utterances. Specifically, we employ a pretrained baseline model as a reference. During contrastive learning, the target dialogue model is trained to give higher conditional probabilities for the positive samples, and lower conditional probabilities for those negative samples, compared to the reference model. To manage the multi-mapping relations prevalent in human conversation, we augment contrastive dialogue learning with group-wise dual sampling. Extensive experimental results show that the proposed group-wise contrastive learning framework is suited for training a wide range of neural dialogue generation models with very favorable performance over the baseline training approaches.</abstract>
<url hash="83a839a7">2020.findings-emnlp.70</url>
<doi>10.18653/v1/2020.findings-emnlp.70</doi>
</paper>
<paper id="71">
<title><fixed-case>E</fixed-case>-<fixed-case>BERT</fixed-case>: Efficient-Yet-Effective Entity Embeddings for <fixed-case>BERT</fixed-case></title>
<author><first>Nina</first><last>Poerner</last></author>
<author><first>Ulli</first><last>Waltinger</last></author>
<author><first>Hinrich</first><last>Schütze</last></author>
<pages>803–818</pages>
<abstract>We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT’s native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al., 2019) and KnowBert (Peters et al., 2019), but it requires no expensive further pre-training of the BERT encoder. We evaluate E-BERT on unsupervised question answering (QA), supervised relation classification (RC) and entity linking (EL). On all three tasks, E-BERT outperforms BERT and other baselines. We also show quantitatively that the original BERT model is overly reliant on the surface form of entity names (e.g., guessing that someone with an Italian-sounding name speaks Italian), and that E-BERT mitigates this problem.</abstract>
<url hash="7a39c8e7">2020.findings-emnlp.71</url>
<attachment type="OptionalSupplementaryMaterial" hash="7801a05a">2020.findings-emnlp.71.OptionalSupplementaryMaterial.pdf</attachment>
<doi>10.18653/v1/2020.findings-emnlp.71</doi>
</paper>
<paper id="72">
<title>A Multi-task Learning Framework for Opinion Triplet Extraction</title>
<author><first>Chen</first><last>Zhang</last></author>
<author><first>Qiuchi</first><last>Li</last></author>
<author><first>Dawei</first><last>Song</last></author>
<author><first>Benyou</first><last>Wang</last></author>
<pages>819–828</pages>
<abstract>The state-of-the-art Aspect-based Sentiment Analysis (ABSA) approaches are mainly based on either detecting aspect terms and their corresponding sentiment polarities, or co-extracting aspect and opinion terms. However, the extraction of aspect-sentiment pairs lacks opinion terms as a reference, while co-extraction of aspect and opinion terms would not lead to meaningful pairs without determining their sentiment dependencies. To address the issue, we present a novel view of ABSA as an opinion triplet extraction task, and propose a multi-task learning framework to jointly extract aspect terms and opinion terms, and simultaneously parses sentiment dependencies between them with a biaffine scorer. At inference phase, the extraction of triplets is facilitated by a triplet decoding method based on the above outputs. We evaluate the proposed framework on four SemEval benchmarks for ASBA. The results demonstrate that our approach significantly outperforms a range of strong baselines and state-of-the-art approaches.</abstract>
<url hash="5fe04480">2020.findings-emnlp.72</url>
<doi>10.18653/v1/2020.findings-emnlp.72</doi>
</paper>
<paper id="73">
<title>Event Extraction as Multi-turn Question Answering</title>
<author><first>Fayuan</first><last>Li</last></author>
<author><first>Weihua</first><last>Peng</last></author>
<author><first>Yuguang</first><last>Chen</last></author>
<author><first>Quan</first><last>Wang</last></author>
<author><first>Lu</first><last>Pan</last></author>
<author><first>Yajuan</first><last>Lyu</last></author>
<author><first>Yong</first><last>Zhu</last></author>
<pages>829–838</pages>
<abstract>Event extraction, which aims to identify event triggers of pre-defined event types and their arguments of specific roles, is a challenging task in NLP. Most traditional approaches formulate this task as classification problems, with event types or argument roles taken as golden labels. Such approaches fail to model rich interactions among event types and arguments of different roles, and cannot generalize to new types or roles. This work proposes a new paradigm that formulates event extraction as multi-turn question answering. Our approach, MQAEE, casts the extraction task into a series of reading comprehension problems, by which it extracts triggers and arguments successively from a given sentence. A history answer embedding strategy is further adopted to model question answering history in the multi-turn process. By this new formulation, MQAEE makes full use of dependency among arguments and event types, and generalizes well to new types with new argument roles. Empirical results on ACE 2005 shows that MQAEE outperforms current state-of-the-art, pushing the final F1 of argument extraction to 53.4% (+2.0%). And it also has a good generalization ability, achieving competitive performance on 13 new event types even if trained only with a few samples of them.</abstract>
<url hash="41e66a9a">2020.findings-emnlp.73</url>
<doi>10.18653/v1/2020.findings-emnlp.73</doi>
</paper>
<paper id="74">
<title>Improving <fixed-case>QA</fixed-case> Generalization by Concurrent Modeling of Multiple Biases</title>
<author><first>Mingzhu</first><last>Wu</last></author>
<author><first>Nafise Sadat</first><last>Moosavi</last></author>
<author><first>Andreas</first><last>Rücklé</last></author>
<author><first>Iryna</first><last>Gurevych</last></author>
<pages>839–853</pages>
<abstract>Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets. However, focusing on dataset-specific biases limits their ability to learn more generalizable knowledge about the task from more general data patterns. In this paper, we investigate the impact of debiasing methods for improving generalization and propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data. Our framework weights each example based on the biases it contains and the strength of those biases in the training data. It then uses these weights in the training objective so that the model relies less on examples with high bias weights. We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths. We perform the evaluations in two different settings, in which the model is trained on a single domain or multiple domains simultaneously, and show its effectiveness in both settings compared to state-of-the-art debiasing methods.</abstract>
<url hash="c2757a73">2020.findings-emnlp.74</url>
<doi>10.18653/v1/2020.findings-emnlp.74</doi>
</paper>
<paper id="75">
<title>Actor-Double-Critic: Incorporating Model-Based Critic for Task-Oriented Dialogue Systems</title>
<author><first>Yen-chen</first><last>Wu</last></author>
<author><first>Bo-Hsiang</first><last>Tseng</last></author>
<author><first>Milica</first><last>Gasic</last></author>
<pages>854–863</pages>
<abstract>In order to improve the sample-efficiency of deep reinforcement learning (DRL), we implemented imagination augmented agent (I2A) in spoken dialogue systems (SDS). Although I2A achieves a higher success rate than baselines by augmenting predicted future into a policy network, its complicated architecture introduces unwanted instability. In this work, we propose actor-double-critic (ADC) to improve the stability and overall performance of I2A. ADC simplifies the architecture of I2A to reduce excessive parameters and hyper-parameters. More importantly, a separate model-based critic shares parameters between actions and makes back-propagation explicit. In our experiments on Cambridge Restaurant Booking task, ADC enhances success rates considerably and shows robustness to imperfect environment models. In addition, ADC exhibits the stability and sample-efficiency as significantly reducing the baseline standard deviation of success rates and reaching the 80% success rate with half training data.</abstract>
<url hash="0ea27564">2020.findings-emnlp.75</url>
<doi>10.18653/v1/2020.findings-emnlp.75</doi>
</paper>
<paper id="76">
<title>Controlled Hallucinations: Learning to Generate Faithfully from Noisy Data</title>
<author><first>Katja</first><last>Filippova</last></author>
<pages>864–870</pages>
<abstract>Neural text generation (data- or text-to-text) demonstrates remarkable performance when training data is abundant which for many applications is not the case. To collect a large corpus of parallel data, heuristic rules are often used but they inevitably let noise into the data, such as phrases in the output which cannot be explained by the input. Consequently, models pick up on the noise and may hallucinate–generate fluent but unsupported text. Our contribution is a simple but powerful technique to treat such hallucinations as a controllable aspect of the generated text, without dismissing any input and without modifying the model architecture. On the WikiBio corpus (Lebret et al., 2016), a particularly noisy dataset, we demonstrate the efficacy of the technique both in an automatic and in a human evaluation.</abstract>
<url hash="1c6cbc62">2020.findings-emnlp.76</url>
<doi>10.18653/v1/2020.findings-emnlp.76</doi>
</paper>
<paper id="77">
<title>Sequential Span Classification with Neural Semi-<fixed-case>M</fixed-case>arkov <fixed-case>CRF</fixed-case>s for Biomedical Abstracts</title>
<author><first>Kosuke</first><last>Yamada</last></author>
<author><first>Tsutomu</first><last>Hirao</last></author>
<author><first>Ryohei</first><last>Sasano</last></author>
<author><first>Koichi</first><last>Takeda</last></author>
<author><first>Masaaki</first><last>Nagata</last></author>
<pages>871–877</pages>
<abstract>Dividing biomedical abstracts into several segments with rhetorical roles is essential for supporting researchers’ information access in the biomedical domain. Conventional methods have regarded the task as a sequence labeling task based on sequential sentence classification, i.e., they assign a rhetorical label to each sentence by considering the context in the abstract. However, these methods have a critical problem: they are prone to mislabel longer continuous sentences with the same rhetorical label. To tackle the problem, we propose sequential span classification that assigns a rhetorical label, not to a single sentence but to a span that consists of continuous sentences. Accordingly, we introduce Neural Semi-Markov Conditional Random Fields to assign the labels to such spans by considering all possible spans of various lengths. Experimental results obtained from PubMed 20k RCT and NICTA-PIBOSO datasets demonstrate that our proposed method achieved the best micro sentence-F1 score as well as the best micro span-F1 score.</abstract>
<url hash="6f785f1b">2020.findings-emnlp.77</url>
<attachment type="OptionalSupplementaryMaterial" hash="f0b89ff6">2020.findings-emnlp.77.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.77</doi>
</paper>
<paper id="78">
<title>Where to Submit? Helping Researchers to Choose the Right Venue</title>
<author><first>Konstantin</first><last>Kobs</last></author>
<author><first>Tobias</first><last>Koopmann</last></author>
<author><first>Albin</first><last>Zehe</last></author>
<author><first>David</first><last>Fernes</last></author>
<author><first>Philipp</first><last>Krop</last></author>
<author><first>Andreas</first><last>Hotho</last></author>
<pages>878–883</pages>
<abstract>Whenever researchers write a paper, the same question occurs: “Where to submit?” In this work, we introduce WTS, an open and interpretable NLP system that recommends conferences and journals to researchers based on the title, abstract, and/or keywords of a given paper. We adapt the TextCNN architecture and automatically analyze its predictions using the Integrated Gradients method to highlight words and phrases that led to the recommendation of a scientific venue. We train and test our method on publications from the fields of artificial intelligence (AI) and medicine, both derived from the Semantic Scholar dataset. WTS achieves an Accuracy@5 of approximately 83% for AI papers and 95% in the field of medicine. It is open source and available for testing on https://wheretosubmit.ml.</abstract>
<url hash="48f853d2">2020.findings-emnlp.78</url>
<doi>10.18653/v1/2020.findings-emnlp.78</doi>
</paper>
<paper id="79">
<title><fixed-case>A</fixed-case>ir<fixed-case>C</fixed-case>oncierge: Generating Task-Oriented Dialogue via Efficient Large-Scale Knowledge Retrieval</title>
<author><first>Chieh-Yang</first><last>Chen</last></author>
<author><first>Pei-Hsin</first><last>Wang</last></author>
<author><first>Shih-Chieh</first><last>Chang</last></author>
<author><first>Da-Cheng</first><last>Juan</last></author>
<author><first>Wei</first><last>Wei</last></author>
<author><first>Jia-Yu</first><last>Pan</last></author>
<pages>884–897</pages>
<abstract>Despite recent success in neural task-oriented dialogue systems, developing such a real-world system involves accessing large-scale knowledge bases (KBs), which cannot be simply encoded by neural approaches, such as memory network mechanisms. To alleviate the above problem, we propose , an end-to-end trainable text-to-SQL guided framework to learn a neural agent that interacts with KBs using the generated SQL queries. Specifically, the neural agent first learns to ask and confirm the customer’s intent during the multi-turn interactions, then dynamically determining when to ground the user constraints into executable SQL queries so as to fetch relevant information from KBs. With the help of our method, the agent can use less but more accurate fetched results to generate useful responses efficiently, instead of incorporating the entire KBs. We evaluate the proposed method on the AirDialogue dataset, a large corpus released by Google, containing the conversations of customers booking flight tickets from the agent. The experimental results show that significantly improves over previous work in terms of accuracy and the BLEU score, which demonstrates not only the ability to achieve the given task but also the good quality of the generated dialogues.</abstract>
<url hash="adb265f0">2020.findings-emnlp.79</url>
<attachment type="OptionalSupplementaryMaterial" hash="2d973a01">2020.findings-emnlp.79.OptionalSupplementaryMaterial.zip</attachment>
<doi>10.18653/v1/2020.findings-emnlp.79</doi>
</paper>
<paper id="80">
<title><fixed-case>D</fixed-case>oc<fixed-case>S</fixed-case>truct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding</title>
<author><first>Zilong</first><last>Wang</last></author>
<author><first>Mingjie</first><last>Zhan</last></author>
<author><first>Xuebo</first><last>Liu</last></author>
<author><first>Ding</first><last>Liang</last></author>
<pages>898–908</pages>
<abstract>Form understanding depends on both textual contents and organizational structure. Although modern OCR performs well, it is still challenging to realize general form understanding because forms are commonly used and of various formats. The table detection and handcrafted features in previous works cannot apply to all forms because of their requirements on formats. Therefore, we concentrate on the most elementary components, the key-value pairs, and adopt multimodal methods to extract features. We consider the form structure as a tree-like or graph-like hierarchy of text fragments. The parent-child relation corresponds to the key-value pairs in forms. We utilize the state-of-the-art models and design targeted extraction modules to extract multimodal features from semantic contents, layout information, and visual images. A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation. We adopt an asymmetric algorithm and negative sampling in our model as well. We validate our method on two benchmarks, MedForm and FUNSD, and extensive experiments demonstrate the effectiveness of our method.</abstract>
<url hash="b8069edb">2020.findings-emnlp.80</url>
<doi>10.18653/v1/2020.findings-emnlp.80</doi>
</paper>
<paper id="81">
<title>Pretrained Language Models for Dialogue Generation with Multiple Input Sources</title>
<author><first>Yu</first><last>Cao</last></author>
<author><first>Wei</first><last>Bi</last></author>
<author><first>Meng</first><last>Fang</last></author>
<author><first>Dacheng</first><last>Tao</last></author>
<pages>909–917</pages>
<abstract>Large-scale pretrained language models have achieved outstanding performance on natural language understanding tasks. However, it is still under investigating how to apply them to dialogue generation tasks, especially those with responses conditioned on multiple sources. Previous work simply concatenates all input sources or averages information from different input sources. In this work, we study dialogue models with multiple input sources adapted from the pretrained language model GPT2. We explore various methods to fuse multiple separate attention information corresponding to different sources. Our experimental results show that proper fusion methods deliver higher relevance with dialogue history than simple fusion baselines.</abstract>
<url hash="6941e414">2020.findings-emnlp.81</url>
<doi>10.18653/v1/2020.findings-emnlp.81</doi>
</paper>
<paper id="82">
<title>A Study in Improving <fixed-case>BLEU</fixed-case> Reference Coverage with Diverse Automatic Paraphrasing</title>
<author><first>Rachel</first><last>Bawden</last></author>
<author><first>Biao</first><last>Zhang</last></author>
<author><first>Lisa</first><last>Yankovskaya</last></author>
<author><first>Andre</first><last>Tättar</last></author>
<author><first>Matt</first><last>Post</last></author>
<pages>918–932</pages>
<abstract>We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional *diverse* references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language directions of the WMT19 metrics task (at both the system and sentence level) show that using paraphrased references does generally improve BLEU, and when it does, the more diverse the better. However, we also show that better results could be achieved if those paraphrases were to specifically target the parts of the space most relevant to the MT outputs being evaluated. Moreover, the gains remain slight even when human paraphrases are used, suggesting inherent limitations to BLEU’s capacity to correctly exploit multiple references. Surprisingly, we also find that adequacy appears to be less important, as shown by the high results of a strong sampling approach, which even beats human paraphrases when used with sentence-level BLEU.</abstract>
<url hash="b2e17989">2020.findings-emnlp.82</url>
<attachment type="OptionalSupplementaryMaterial" hash="8410870d">2020.findings-emnlp.82.OptionalSupplementaryMaterial.txt</attachment>
<doi>10.18653/v1/2020.findings-emnlp.82</doi>
</paper>
<paper id="83">
<title>Cross-lingual Alignment Methods for Multilingual <fixed-case>BERT</fixed-case>: A Comparative Study</title>
<author><first>Saurabh</first><last>Kulshreshtha</last></author>
<author><first>Jose Luis</first><last>Redondo Garcia</last></author>
<author><first>Ching-Yun</first><last>Chang</last></author>
<pages>933–942</pages>
<abstract>Multilingual BERT (mBERT) has shown reasonable capability for zero-shot cross-lingual transfer when fine-tuned on downstream tasks. Since mBERT is not pre-trained with explicit cross-lingual supervision, transfer performance can further be improved by aligning mBERT with cross-lingual signal. Prior work propose several approaches to align contextualised embeddings. In this paper we analyse how different forms of cross-lingual supervision and various alignment methods influence the transfer capability of mBERT in zero-shot setting. Specifically, we compare parallel corpora vs dictionary-based supervision and rotational vs fine-tuning based alignment methods. We evaluate the performance of different alignment methodologies across eight languages on two tasks: Name Entity Recognition and Semantic Slot Filling. In addition, we propose a novel normalisation method which consistently improves the performance of rotation-based alignment including a notable 3% F1 improvement for distant and typologically dissimilar languages. Importantly we identify the biases of the alignment methods to the type of task and proximity to the transfer language. We also find that supervision from parallel corpus is generally superior to dictionary alignments.</abstract>
<url hash="67a4d950">2020.findings-emnlp.83</url>
<doi>10.18653/v1/2020.findings-emnlp.83</doi>
</paper>
<paper id="84">
<title>Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection</title>
<author><first>Michele</first><last>Corazza</last></author>
<author><first>Stefano</first><last>Menini</last></author>
<author><first>Elena</first><last>Cabrio</last></author>
<author><first>Sara</first><last>Tonelli</last></author>
<author><first>Serena</first><last>Villata</last></author>
<pages>943–949</pages>
<abstract>Recent studies have demonstrated the effectiveness of cross-lingual language model pre-training on different NLP tasks, such as natural language inference and machine translation. In our work, we test this approach on social media data, which are particularly challenging to process within this framework, since the limited length of the textual messages and the irregularity of the language make it harder to learn meaningful encodings. More specifically, we propose a hybrid emoji-based Masked Language Model (MLM) to leverage the common information conveyed by emojis across different languages and improve the learned cross-lingual representation of short text messages, with the goal to perform zero- shot abusive language detection. We compare the results obtained with the original MLM to the ones obtained by our method, showing improved performance on German, Italian and Spanish.</abstract>
<url hash="a578af0f">2020.findings-emnlp.84</url>
<doi>10.18653/v1/2020.findings-emnlp.84</doi>
</paper>
<paper id="85">
<title><fixed-case>S</fixed-case>e<fixed-case>N</fixed-case>s<fixed-case>ER</fixed-case>: Learning Cross-Building Sensor Metadata Tagger</title>
<author><first>Yang</first><last>Jiao</last></author>
<author><first>Jiacheng</first><last>Li</last></author>
<author><first>Jiaman</first><last>Wu</last></author>
<author><first>Dezhi</first><last>Hong</last></author>
<author><first>Rajesh</first><last>Gupta</last></author>
<author><first>Jingbo</first><last>Shang</last></author>
<pages>950–960</pages>
<abstract>Sensor metadata tagging, akin to the named entity recognition task, provides key contextual information (e.g., measurement type and location) about sensors for running smart building applications. Unfortunately, sensor metadata in different buildings often follows distinct naming conventions. Therefore, learning a tagger currently requires extensive annotations on a per building basis. In this work, we propose a novel framework, SeNsER, which learns a sensor metadata tagger for a new building based on its raw metadata and some existing fully annotated building. It leverages the commonality between different buildings: At the character level, it employs bidirectional neural language models to capture the shared underlying patterns between two buildings and thus regularizes the feature learning process; At the word level, it leverages as features the k-mers existing in the fully annotated building. During inference, we further incorporate the information obtained from sources such as Wikipedia as prior knowledge. As a result, SeNsER shows promising results in extensive experiments on multiple real-world buildings.</abstract>
<url hash="1885d0f5">2020.findings-emnlp.85</url>
<doi>10.18653/v1/2020.findings-emnlp.85</doi>
</paper>
<paper id="86">
<title><fixed-case>P</fixed-case>ersian Ezafe Recognition Using Transformers and Its Role in Part-Of-Speech Tagging</title>
<author><first>Ehsan</first><last>Doostmohammadi</last></author>
<author><first>Minoo</first><last>Nassajian</last></author>