data/xml/2020.aacl.xml

<?xml version='1.0' encoding='UTF-8'?>
<collection id="2020.aacl">
  <volume id="main" ingest-date="2020-12-02">
    <meta>
      <booktitle>Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing</booktitle>
      <editor><first>Kam-Fai</first><last>Wong</last></editor>
      <editor><first>Kevin</first><last>Knight</last></editor>
      <editor><first>Hua</first><last>Wu</last></editor>
      <publisher>Association for Computational Linguistics</publisher>
      <address>Suzhou, China</address>
      <month>December</month>
      <year>2020</year>
    </meta>
    <frontmatter>
      <url hash="803dfce7">2020.aacl-main.0</url>
    </frontmatter>
    <paper id="1">
      <title>Touch Editing: A Flexible One-Time Interaction Approach for Translation</title>
      <author><first>Qian</first><last>Wang</last></author>
      <author><first>Jiajun</first><last>Zhang</last></author>
      <author><first>Lemao</first><last>Liu</last></author>
      <author><first>Guoping</first><last>Huang</last></author>
      <author><first>Chengqing</first><last>Zong</last></author>
      <pages>1–11</pages>
      <abstract>We propose a touch-based editing method for translation, which is more flexible than traditional keyboard-mouse-based translation postediting. This approach relies on touch actions that users perform to indicate translation errors. We present a dual-encoder model to handle the actions and generate refined translations. To mimic the user feedback, we adopt the TER algorithm comparing between draft translations and references to automatically extract the simulated actions for training data construction. Experiments on translation datasets with simulated editing actions show that our method significantly improves original translation of Transformer (up to 25.31 BLEU) and outperforms existing interactive translation methods (up to 16.64 BLEU). We also conduct experiments on post-editing dataset to further prove the robustness and effectiveness of our method.</abstract>
      <url hash="69f9ea60">2020.aacl-main.1</url>
    </paper>
    <paper id="2">
      <title>Can Monolingual Pretrained Models Help Cross-Lingual Classification?</title>
      <author><first>Zewen</first><last>Chi</last></author>
      <author><first>Li</first><last>Dong</last></author>
      <author><first>Furu</first><last>Wei</last></author>
      <author><first>Xianling</first><last>Mao</last></author>
      <author><first>Heyan</first><last>Huang</last></author>
      <pages>12–17</pages>
      <abstract>Multilingual pretrained language models (such as multilingual BERT) have achieved impressive results for cross-lingual transfer. However, due to the constant model capacity, multilingual pre-training usually lags behind the monolingual competitors. In this work, we present two approaches to improve zero-shot cross-lingual classification, by transferring the knowledge from monolingual pretrained models to multilingual ones. Experimental results on two cross-lingual classification benchmarks show that our methods outperform vanilla multilingual fine-tuning.</abstract>
      <url hash="8441b124">2020.aacl-main.2</url>
    </paper>
    <paper id="3">
      <title>Rumor Detection on <fixed-case>T</fixed-case>witter Using Multiloss Hierarchical <fixed-case>B</fixed-case>i<fixed-case>LSTM</fixed-case> with an Attenuation Factor</title>
      <author><first>Yudianto</first><last>Sujana</last></author>
      <author><first>Jiawen</first><last>Li</last></author>
      <author><first>Hung-Yu</first><last>Kao</last></author>
      <pages>18–26</pages>
      <abstract>Social media platforms such as Twitter have become a breeding ground for unverified information or rumors. These rumors can threaten people’s health, endanger the economy, and affect the stability of a country. Many researchers have developed models to classify rumors using traditional machine learning or vanilla deep learning models. However, previous studies on rumor detection have achieved low precision and are time consuming. Inspired by the hierarchical model and multitask learning, a multiloss hierarchical BiLSTM model with an attenuation factor is proposed in this paper. The model is divided into two BiLSTM modules: post level and event level. By means of this hierarchical structure, the model can extract deep information from limited quantities of text. Each module has a loss function that helps to learn bilateral features and reduce the training time. An attenuation factor is added at the post level to increase the accuracy. The results on two rumor datasets demonstrate that our model achieves better performance than that of state-of-the-art machine learning and vanilla deep learning models.</abstract>
      <url hash="c7c69377">2020.aacl-main.3</url>
    </paper>
    <paper id="4">
      <title>Graph Attention Network with Memory Fusion for Aspect-level Sentiment Analysis</title>
      <author><first>Li</first><last>Yuan</last></author>
      <author><first>Jin</first><last>Wang</last></author>
      <author><first>Liang-Chih</first><last>Yu</last></author>
      <author><first>Xuejie</first><last>Zhang</last></author>
      <pages>27–36</pages>
      <abstract>Aspect-level sentiment analysis(ASC) predicts each specific aspect term’s sentiment polarity in a given text or review. Recent studies used attention-based methods that can effectively improve the performance of aspect-level sentiment analysis. These methods ignored the syntactic relationship between the aspect and its corresponding context words, leading the model to focus on syntactically unrelated words mistakenly. One proposed solution, the graph convolutional network (GCN), cannot completely avoid the problem. While it does incorporate useful information about syntax, it assigns equal weight to all the edges between connected words. It may still incorrectly associate unrelated words to the target aspect through the iterations of graph convolutional propagation. In this study, a graph attention network with memory fusion is proposed to extend GCN’s idea by assigning different weights to edges. Syntactic constraints can be imposed to block the graph convolutional propagation of unrelated words. A convolutional layer and a memory fusion were applied to learn and exploit multiword relations and draw different weights of words to improve performance further. Experimental results on five datasets show that the proposed method yields better performance than existing methods.</abstract>
      <url hash="39c6b054">2020.aacl-main.4</url>
    </paper>
    <paper id="5">
      <title><fixed-case>FERN</fixed-case>et: Fine-grained Extraction and Reasoning Network for Emotion Recognition in Dialogues</title>
      <author><first>Yingmei</first><last>Guo</last></author>
      <author><first>Zhiyong</first><last>Wu</last></author>
      <author><first>Mingxing</first><last>Xu</last></author>
      <pages>37–43</pages>
      <abstract>Unlike non-conversation scenes, emotion recognition in dialogues (ERD) poses more complicated challenges due to its interactive nature and intricate contextual information. All present methods model historical utterances without considering the content of the target utterance. However, different parts of a historical utterance may contribute differently to emotion inference of different target utterances. Therefore we propose Fine-grained Extraction and Reasoning Network (FERNet) to generate target-specific historical utterance representations. The reasoning module effectively handles both local and global sequential dependencies to reason over context, and updates target utterance representations to more informed vectors. Experiments on two benchmarks show that our method achieves competitive performance compared with previous methods.</abstract>
      <url hash="8392b1d5">2020.aacl-main.5</url>
    </paper>
    <paper id="6">
      <title><fixed-case>S</fixed-case>enti<fixed-case>R</fixed-case>ec: Sentiment Diversity-aware Neural News Recommendation</title>
      <author><first>Chuhan</first><last>Wu</last></author>
      <author><first>Fangzhao</first><last>Wu</last></author>
      <author><first>Tao</first><last>Qi</last></author>
      <author><first>Yongfeng</first><last>Huang</last></author>
      <pages>44–53</pages>
      <abstract>Personalized news recommendation is important for online news services. Many news recommendation methods recommend news based on their relevance to users’ historical browsed news, and the recommended news usually have similar sentiment with browsed news. However, if browsed news is dominated by certain kinds of sentiment, the model may intensively recommend news with the same sentiment orientation, making it difficult for users to receive diverse opinions and news events. In this paper, we propose a sentiment diversity-aware neural news recommendation approach, which can recommend news with more diverse sentiment. In our approach, we propose a sentiment-aware news encoder, which is jointly trained with an auxiliary sentiment prediction task, to learn sentiment-aware news representations. We learn user representations from browsed news representations, and compute click scores based on user and candidate news representations. In addition, we propose a sentiment diversity regularization method to penalize the model by combining the overall sentiment orientation of browsed news as well as the click and sentiment scores of candidate news. Extensive experiments on real-world dataset show that our approach can effectively improve the sentiment diversity in news recommendation without performance sacrifice.</abstract>
      <url hash="4f43f1cf">2020.aacl-main.6</url>
    </paper>
    <paper id="7">
      <title><fixed-case>BCTH</fixed-case>: A Novel Text Hashing Approach via <fixed-case>B</fixed-case>ayesian Clustering</title>
      <author><first>Ying</first><last>Wenjie</last></author>
      <author><first>Yuquan</first><last>Le</last></author>
      <author><first>Hantao</first><last>Xiong</last></author>
      <pages>54–62</pages>
      <abstract>Similarity search is to find the most similar items for a certain target item. The ability of similarity search at large scale plays a significant role in many information retrieval applications, and thus has received much attention. Text hashing is a promising strategy, which utilizes binary encoding to represent documents, obtaining attractive performance. This paper makes the first attempt to utilize Bayesian Clustering for Text Hashing, dubbed as BCTH. Specifically, BCTH is able to map documents to binary codes by utilizing multiple Bayesian Clusterings in parallel, where each Bayesian Clustering is responsible for one bit. Our approach employs the bit-balanced constraint to maximize the amount of information in each bit. Meanwhile, the bit-uncorrected constraint is adopted to keep the independence among all bits. The time complexity of BCTH is linear, where the hash codes and hash function are jointly learned. The experimental results, based on four widely-used datasets, demonstrate that BCTH is competitive, compared with currently competitive baselines in the perspective of both precision and training speed.</abstract>
      <url hash="fd91193b">2020.aacl-main.7</url>
      <attachment type="Dataset" hash="e3d69c75">2020.aacl-main.7.Dataset.pdf</attachment>
    </paper>
    <paper id="8">
      <title>Lightweight Text Classifier using Sinusoidal Positional Encoding</title>
      <author><first>Byoung-Doo</first><last>Oh</last></author>
      <author><first>Yu-Seop</first><last>Kim</last></author>
      <pages>63–69</pages>
      <abstract>Large and complex models have recently been developed that require many parameters and much time to solve various problems in natural language processing. This paper explores an efficient way to avoid models being too complicated and ensure nearly equal performance to models showing the state-of-the-art. We propose a single convolutional neural network (CNN) using the sinusoidal positional encoding (SPE) in text classification. The SPE provides useful position information of a word and can construct a more efficient model architecture than before in a CNN-based approach. Our model can significantly reduce the parameter size (at least 67%) and training time (up to 85%) while maintaining similar performance to the CNN-based approach on multiple benchmark datasets.</abstract>
      <url hash="569ea888">2020.aacl-main.8</url>
    </paper>
    <paper id="9">
      <title>Towards Non-task-specific Distillation of <fixed-case>BERT</fixed-case> via Sentence Representation Approximation</title>
      <author><first>Bowen</first><last>Wu</last></author>
      <author><first>Huan</first><last>Zhang</last></author>
      <author><first>MengYuan</first><last>Li</last></author>
      <author><first>Zongsheng</first><last>Wang</last></author>
      <author><first>Qihang</first><last>Feng</last></author>
      <author><first>Junhong</first><last>Huang</last></author>
      <author><first>Baoxun</first><last>Wang</last></author>
      <pages>70–79</pages>
      <abstract>Recently, BERT has become an essential ingredient of various NLP deep models due to its effectiveness and universal-usability. However, the online deployment of BERT is often blocked by its large-scale parameters and high computational cost. There are plenty of studies showing that the knowledge distillation is efficient in transferring the knowledge from BERT into the model with a smaller size of parameters. Nevertheless, current BERT distillation approaches mainly focus on task-specified distillation, such methodologies lead to the loss of the general semantic knowledge of BERT for universal-usability. In this paper, we propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model without specifying tasks. Consistent with BERT, our distilled model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task. Besides, our model can further cooperate with task-specific distillation procedures. The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods or even much larger models, i.e., ELMO, with efficiency well-improved.</abstract>
      <url hash="df234bde">2020.aacl-main.9</url>
    </paper>
    <paper id="10">
      <title>A Simple and Effective Usage of Word Clusters for <fixed-case>CBOW</fixed-case> Model</title>
      <author><first>Yukun</first><last>Feng</last></author>
      <author><first>Chenlong</first><last>Hu</last></author>
      <author><first>Hidetaka</first><last>Kamigaito</last></author>
      <author><first>Hiroya</first><last>Takamura</last></author>
      <author><first>Manabu</first><last>Okumura</last></author>
      <pages>80–86</pages>
      <abstract>We propose a simple and effective method for incorporating word clusters into the Continuous Bag-of-Words (CBOW) model. Specifically, we propose to replace infrequent input and output words in CBOW model with their clusters. The resulting cluster-incorporated CBOW model produces embeddings of frequent words and a small amount of cluster embeddings, which will be fine-tuned in downstream tasks. We empirically show our replacing method works well on several downstream tasks. Through our analysis, we show that our method might be also useful for other similar models which produce word embeddings.</abstract>
      <url hash="ee96515b">2020.aacl-main.10</url>
    </paper>
    <paper id="11">
      <title>Investigating Learning Dynamics of <fixed-case>BERT</fixed-case> Fine-Tuning</title>
      <author><first>Yaru</first><last>Hao</last></author>
      <author><first>Li</first><last>Dong</last></author>
      <author><first>Furu</first><last>Wei</last></author>
      <author><first>Ke</first><last>Xu</last></author>
      <pages>87–92</pages>
      <abstract>The recently introduced pre-trained language model BERT advances the state-of-the-art on many NLP tasks through the fine-tuning approach, but few studies investigate how the fine-tuning process improves the model performance on downstream tasks. In this paper, we inspect the learning dynamics of BERT fine-tuning with two indicators. We use JS divergence to detect the change of the attention mode and use SVCCA distance to examine the change to the feature extraction mode during BERT fine-tuning. We conclude that BERT fine-tuning mainly changes the attention mode of the last layers and modifies the feature extraction mode of the intermediate and last layers. Moreover, we analyze the consistency of BERT fine-tuning between different random seeds and different datasets. In summary, we provide a distinctive understanding of the learning dynamics of BERT fine-tuning, which sheds some light on improving the fine-tuning results.</abstract>
      <url hash="d60ccad2">2020.aacl-main.11</url>
    </paper>
    <paper id="12">
      <title>Second-Order Neural Dependency Parsing with Message Passing and End-to-End Training</title>
      <author><first>Xinyu</first><last>Wang</last></author>
      <author><first>Kewei</first><last>Tu</last></author>
      <pages>93–99</pages>
      <abstract>In this paper, we propose second-order graph-based neural dependency parsing using message passing and end-to-end neural networks. We empirically show that our approaches match the accuracy of very recent state-of-the-art second-order graph-based neural dependency parsers and have significantly faster speed in both training and testing. We also empirically show the advantage of second-order parsing over first-order parsing and observe that the usefulness of the head-selection structured constraint vanishes when using BERT embedding.</abstract>
      <url hash="d3a6c296">2020.aacl-main.12</url>
    </paper>
    <paper id="13">
      <title>High-order Refining for End-to-end <fixed-case>C</fixed-case>hinese Semantic Role Labeling</title>
      <author><first>Hao</first><last>Fei</last></author>
      <author><first>Yafeng</first><last>Ren</last></author>
      <author><first>Donghong</first><last>Ji</last></author>
      <pages>100–105</pages>
      <abstract>Current end-to-end semantic role labeling is mostly accomplished via graph-based neural models. However, these all are first-order models, where each decision for detecting any predicate-argument pair is made in isolation with local features. In this paper, we present a high-order refining mechanism to perform interaction between all predicate-argument pairs. Based on the baseline graph model, our high-order refining module learns higher-order features between all candidate pairs via attention calculation, which are later used to update the original token representations. After several iterations of refinement, the underlying token representations can be enriched with globally interacted features. Our high-order model achieves state-of-the-art results on Chinese SRL data, including CoNLL09 and Universal Proposition Bank, meanwhile relieving the long-range dependency issues.</abstract>
      <url hash="a2d1a976">2020.aacl-main.13</url>
    </paper>
    <paper id="14">
      <title>Exploiting <fixed-case>W</fixed-case>ord<fixed-case>N</fixed-case>et Synset and Hypernym Representations for Answer Selection</title>
      <author><first>Weikang</first><last>Li</last></author>
      <author><first>Yunfang</first><last>Wu</last></author>
      <pages>106–115</pages>
      <abstract>Answer selection (AS) is an important subtask of document-based question answering (DQA). In this task, the candidate answers come from the same document, and each answer sentence is semantically related to the given question, which makes it more challenging to select the true answer. WordNet provides powerful knowledge about concepts and their semantic relations so we employ WordNet to enrich the abilities of paraphrasing and reasoning of the network-based question answering model. Specifically, we exploit the synset and hypernym concepts to enrich the word representation and incorporate the similarity scores of two concepts that share the synset or hypernym relations into the attention mechanism. The proposed WordNet-enhanced hierarchical model (WEHM) consists of four modules, including WordNet-enhanced word representation, sentence encoding, WordNet-enhanced attention mechanism, and hierarchical document encoding. Extensive experiments on the public WikiQA and SelQA datasets demonstrate that our proposed model significantly improves the baseline system and outperforms all existing state-of-the-art methods by a large margin.</abstract>
      <url hash="d40d7b64">2020.aacl-main.14</url>
    </paper>
    <paper id="15">
      <title>A Simple Text-based Relevant Location Prediction Method using Knowledge Base</title>
      <author><first>Mei</first><last>Sasaki</last></author>
      <author><first>Shumpei</first><last>Okura</last></author>
      <author><first>Shingo</first><last>Ono</last></author>
      <pages>116–121</pages>
      <abstract>In this paper, we propose a simple method to predict salient locations from news article text using a knowledge base (KB). The proposed method uses a dictionary of locations created from the KB to identify occurrences of locations in the text and uses the hierarchical information between entities in the KB for assigning appropriate saliency scores to regions. It allows prediction at arbitrary region units and has only a few hyperparameters that need to be tuned. We show using manually annotated news articles that the proposed method improves the f-measure by &gt; 0.12 compared to multiple baselines.</abstract>
      <url hash="0a73979d">2020.aacl-main.15</url>
    </paper>
    <paper id="16">
      <title>Learning Goal-oriented Dialogue Policy with opposite Agent Awareness</title>
      <author><first>Zheng</first><last>Zhang</last></author>
      <author><first>Lizi</first><last>Liao</last></author>
      <author><first>Xiaoyan</first><last>Zhu</last></author>
      <author><first>Tat-Seng</first><last>Chua</last></author>
      <author><first>Zitao</first><last>Liu</last></author>
      <author><first>Yan</first><last>Huang</last></author>
      <author><first>Minlie</first><last>Huang</last></author>
      <pages>122–132</pages>
      <abstract>Most existing approaches for goal-oriented dialogue policy learning used reinforcement learning, which focuses on the target agent policy and simply treats the opposite agent policy as part of the environment. While in real-world scenarios, the behavior of an opposite agent often exhibits certain patterns or underlies hidden policies, which can be inferred and utilized by the target agent to facilitate its own decision making. This strategy is common in human mental simulation by first imaging a specific action and the probable results before really acting it. We therefore propose an opposite behavior aware framework for policy learning in goal-oriented dialogues. We estimate the opposite agent’s policy from its behavior and use this estimation to improve the target agent by regarding it as part of the target policy. We evaluate our model on both cooperative and competitive dialogue tasks, showing superior performance over state-of-the-art baselines.</abstract>
      <url hash="c5f7d159">2020.aacl-main.16</url>
    </paper>
    <paper id="17">
      <title>An Empirical Study of Tokenization Strategies for Various <fixed-case>K</fixed-case>orean <fixed-case>NLP</fixed-case> Tasks</title>
      <author><first>Kyubyong</first><last>Park</last></author>
      <author><first>Joohong</first><last>Lee</last></author>
      <author><first>Seongbo</first><last>Jang</last></author>
      <author><first>Dawoon</first><last>Jung</last></author>
      <pages>133–142</pages>
      <abstract>Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model. Even though Byte Pair Encoding (BPE) has been considered the de facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, “What is the best tokenization strategy for Korean NLP tasks?” Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective. Our code and pre-trained models are publicly available at https://github.com/kakaobrain/kortok.</abstract>
      <url hash="5c4b3742">2020.aacl-main.17</url>
    </paper>
    <paper id="18">
      <title><fixed-case>BERT</fixed-case>-Based Neural Collaborative Filtering and Fixed-Length Contiguous Tokens Explanation</title>
      <author><first>Reinald Adrian</first><last>Pugoy</last></author>
      <author><first>Hung-Yu</first><last>Kao</last></author>
      <pages>143–153</pages>
      <abstract>We propose a novel, accurate, and explainable recommender model (BENEFICT) that addresses two drawbacks that most review-based recommender systems face. First is their utilization of traditional word embeddings that could influence prediction performance due to their inability to model the word semantics’ dynamic characteristic. Second is their black-box nature that makes the explanations behind every prediction obscure. Our model uniquely integrates three key elements: BERT, multilayer perceptron, and maximum subarray problem to derive contextualized review features, model user-item interactions, and generate explanations, respectively. Our experiments show that BENEFICT consistently outperforms other state-of-the-art models by an average improvement gain of nearly 7%. Based on the human judges’ assessment, the BENEFICT-produced explanations can capture the essence of the customer’s preference and help future customers make purchasing decisions. To the best of our knowledge, our model is one of the first recommender models to utilize BERT for neural collaborative filtering.</abstract>
      <url hash="951ebff4">2020.aacl-main.18</url>
    </paper>
    <paper id="19">
      <title>Transformer-based Approach for Predicting Chemical Compound Structures</title>
      <author><first>Yutaro</first><last>Omote</last></author>
      <author><first>Kyoumoto</first><last>Matsushita</last></author>
      <author><first>Tomoya</first><last>Iwakura</last></author>
      <author><first>Akihiro</first><last>Tamura</last></author>
      <author><first>Takashi</first><last>Ninomiya</last></author>
      <pages>154–162</pages>
      <abstract>By predicting chemical compound structures from their names, we can better comprehend chemical compounds written in text and identify the same chemical compound given different notations for database creation. Previous methods have predicted the chemical compound structures from their names and represented them by Simplified Molecular Input Line Entry System (SMILES) strings. However, these methods mainly apply handcrafted rules, and cannot predict the structures of chemical compound names not covered by the rules. Instead of handcrafted rules, we propose Transformer-based models that predict SMILES strings from chemical compound names. We improve the conventional Transformer-based model by introducing two features: (1) a loss function that constrains the number of atoms of each element in the structure, and (2) a multi-task learning approach that predicts both SMILES strings and InChI strings (another string representation of chemical compound structures). In evaluation experiments, our methods achieved higher F-measures than previous rule-based approaches (Open Parser for Systematic IUPAC Nomenclature and two commercially used products), and the conventional Transformer-based model. We release the dataset used in this paper as a benchmark for the future research.</abstract>
      <url hash="8f20a4a0">2020.aacl-main.19</url>
    </paper>
    <paper id="20">
      <title><fixed-case>C</fixed-case>hinese Grammatical Correction Using <fixed-case>BERT</fixed-case>-based Pre-trained Model</title>
      <author><first>Hongfei</first><last>Wang</last></author>
      <author><first>Michiki</first><last>Kurosawa</last></author>
      <author><first>Satoru</first><last>Katsumata</last></author>
      <author><first>Mamoru</first><last>Komachi</last></author>
      <pages>163–168</pages>
      <abstract>In recent years, pre-trained models have been extensively studied, and several downstream tasks have benefited from their utilization. In this study, we verify the effectiveness of two methods that incorporate a pre-trained model into an encoder-decoder model on Chinese grammatical error correction tasks. We also analyze the error type and conclude that sentence-level errors are yet to be addressed.</abstract>
      <url hash="732cde0f">2020.aacl-main.20</url>
    </paper>
    <paper id="21">
      <title><fixed-case>N</fixed-case>eural <fixed-case>G</fixed-case>ibbs <fixed-case>S</fixed-case>ampling for <fixed-case>J</fixed-case>oint <fixed-case>E</fixed-case>vent <fixed-case>A</fixed-case>rgument <fixed-case>E</fixed-case>xtraction</title>
      <author><first>Xiaozhi</first><last>Wang</last></author>
      <author><first>Shengyu</first><last>Jia</last></author>
      <author><first>Xu</first><last>Han</last></author>
      <author><first>Zhiyuan</first><last>Liu</last></author>
      <author><first>Juanzi</first><last>Li</last></author>
      <author><first>Peng</first><last>Li</last></author>
      <author><first>Jie</first><last>Zhou</last></author>
      <pages>169–180</pages>
      <abstract>Event Argument Extraction (EAE) aims at predicting event argument roles of entities in text, which is a crucial subtask and bottleneck of event extraction. Existing EAE methods either extract each event argument roles independently or sequentially, which cannot adequately model the joint probability distribution among event arguments and their roles. In this paper, we propose a Bayesian model named Neural Gibbs Sampling (NGS) to jointly extract event arguments. Specifically, we train two neural networks to model the prior distribution and conditional distribution over event arguments respectively and then use Gibbs sampling to approximate the joint distribution with the learned distributions. For overcoming the shortcoming of the high complexity of the original Gibbs sampling algorithm, we further apply simulated annealing to efficiently estimate the joint probability distribution over event arguments and make predictions. We conduct experiments on the two widely-used benchmark datasets ACE 2005 and TAC KBP 2016. The Experimental results show that our NGS model can achieve comparable results to existing state-of-the-art EAE methods. The source code can be obtained from https://github.com/THU-KEG/NGS.</abstract>
      <url hash="2b011dd8">2020.aacl-main.21</url>
    </paper>
    <paper id="22">
      <title>Named Entity Recognition in Multi-level Contexts</title>
      <author><first>Yubo</first><last>Chen</last></author>
      <author><first>Chuhan</first><last>Wu</last></author>
      <author><first>Tao</first><last>Qi</last></author>
      <author><first>Zhigang</first><last>Yuan</last></author>
      <author><first>Yongfeng</first><last>Huang</last></author>
      <pages>181–190</pages>
      <abstract>Named entity recognition is a critical task in the natural language processing field. Most existing methods for this task can only exploit contextual information within a sentence. However, their performance on recognizing entities in limited or ambiguous sentence-level contexts is usually unsatisfactory. Fortunately, other sentences in the same document can provide supplementary document-level contexts to help recognize these entities. In addition, words themselves contain word-level contextual information since they usually have different preferences of entity type and relative position from named entities. In this paper, we propose a unified framework to incorporate multi-level contexts for named entity recognition. We use TagLM as our basic model to capture sentence-level contexts. To incorporate document-level contexts, we propose to capture interactions between sentences via a multi-head self attention network. To mine word-level contexts, we propose an auxiliary task to predict the type of each word to capture its type preference. We jointly train our model in entity recognition and the auxiliary classification task via multi-task learning. The experimental results on several benchmark datasets validate the effectiveness of our method.</abstract>
      <url hash="6e5498c7">2020.aacl-main.22</url>
    </paper>
    <paper id="23">
      <title>A General Framework for Adaptation of Neural Machine Translation to Simultaneous Translation</title>
      <author><first>Yun</first><last>Chen</last></author>
      <author><first>Liangyou</first><last>Li</last></author>
      <author><first>Xin</first><last>Jiang</last></author>
      <author><first>Xiao</first><last>Chen</last></author>
      <author><first>Qun</first><last>Liu</last></author>
      <pages>191–200</pages>
      <abstract>Despite the success of neural machine translation (NMT), simultaneous neural machine translation (SNMT), the task of translating in real time before a full sentence has been observed, remains challenging due to the syntactic structure difference and simultaneity requirements. In this paper, we propose a general framework for adapting neural machine translation to translate simultaneously. Our framework contains two parts: prefix translation that utilizes a consecutive NMT model to translate source prefixes and a stopping criterion that determines when to stop the prefix translation. Experiments on three translation corpora and two language pairs show the efficacy of the proposed framework on balancing the quality and latency in adapting NMT to perform simultaneous translation.</abstract>
      <url hash="fc789086">2020.aacl-main.23</url>
    </paper>
    <paper id="24">
      <title><fixed-case>U</fixed-case>nihan<fixed-case>LM</fixed-case>: Coarse-to-Fine <fixed-case>C</fixed-case>hinese-<fixed-case>J</fixed-case>apanese Language Model Pretraining with the Unihan Database</title>
      <author><first>Canwen</first><last>Xu</last></author>
      <author><first>Tao</first><last>Ge</last></author>
      <author><first>Chenliang</first><last>Li</last></author>
      <author><first>Furu</first><last>Wei</last></author>
      <pages>201–211</pages>
      <abstract>Chinese and Japanese share many characters with similar surface morphology. To better utilize the shared knowledge across the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel two-stage coarse-to-fine training approach. We exploit Unihan, a ready-made database constructed by linguistic experts to first merge morphologically similar characters into clusters. The resulting clusters are used to replace the original characters in sentences for the coarse-grained pretraining of the MLM. Then, we restore the clusters back to the original characters in sentences for the fine-grained pretraining to learn the representation of the specific characters. We conduct extensive experiments on a variety of Chinese and Japanese NLP benchmarks, showing that our proposed UnihanLM is effective on both mono- and cross-lingual Chinese and Japanese tasks, shedding light on a new path to exploit the homology of languages.</abstract>
      <url hash="627d9829">2020.aacl-main.24</url>
    </paper>
    <paper id="25">
      <title>Towards a Better Understanding of Label Smoothing in Neural Machine Translation</title>
      <author><first>Yingbo</first><last>Gao</last></author>
      <author><first>Weiyue</first><last>Wang</last></author>
      <author><first>Christian</first><last>Herold</last></author>
      <author><first>Zijian</first><last>Yang</last></author>
      <author><first>Hermann</first><last>Ney</last></author>
      <pages>212–223</pages>
      <abstract>In order to combat overfitting and in pursuit of better generalization, label smoothing is widely applied in modern neural machine translation systems. The core idea is to penalize over-confident outputs and regularize the model so that its outputs do not diverge too much from some prior distribution. While training perplexity generally gets worse, label smoothing is found to consistently improve test performance. In this work, we aim to better understand label smoothing in the context of neural machine translation. Theoretically, we derive and explain exactly what label smoothing is optimizing for. Practically, we conduct extensive experiments by varying which tokens to smooth, tuning the probability mass to be deducted from the true targets and considering different prior distributions. We show that label smoothing is theoretically well-motivated, and by carefully choosing hyperparameters, the practical performance of strong neural machine translation systems can be further improved.</abstract>
      <url hash="480914d7">2020.aacl-main.25</url>
    </paper>
    <paper id="26">
      <title>Comparing Probabilistic, Distributional and Transformer-Based Models on Logical Metonymy Interpretation</title>
      <author><first>Giulia</first><last>Rambelli</last></author>
      <author><first>Emmanuele</first><last>Chersoni</last></author>
      <author><first>Alessandro</first><last>Lenci</last></author>
      <author><first>Philippe</first><last>Blache</last></author>
      <author><first>Chu-Ren</first><last>Huang</last></author>
      <pages>224–234</pages>
      <abstract>In linguistics and cognitive science, Logical metonymies are defined as type clashes between an event-selecting verb and an entity-denoting noun (e.g. The editor finished the article), which are typically interpreted by inferring a hidden event (e.g. reading) on the basis of contextual cues. This paper tackles the problem of logical metonymy interpretation, that is, the retrieval of the covert event via computational methods. We compare different types of models, including the probabilistic and the distributional ones previously introduced in the literature on the topic. For the first time, we also tested on this task some of the recent Transformer-based models, such as BERT, RoBERTa, XLNet, and GPT-2. Our results show a complex scenario, in which the best Transformer-based models and some traditional distributional models perform very similarly. However, the low performance on some of the testing datasets suggests that logical metonymy is still a challenging phenomenon for computational modeling.</abstract>
      <url hash="5761cf8b">2020.aacl-main.26</url>
    </paper>
    <paper id="27">
      <title><fixed-case>AMR</fixed-case> Quality Rating with a Lightweight <fixed-case>CNN</fixed-case></title>
      <author><first>Juri</first><last>Opitz</last></author>
      <pages>235–247</pages>
      <abstract>Structured semantic sentence representations such as Abstract Meaning Representations (AMRs) are potentially useful in various NLP tasks. However, the quality of automatic parses can vary greatly and jeopardizes their usefulness. This can be mitigated by models that can accurately rate AMR quality in the absence of costly gold data, allowing us to inform downstream systems about an incorporated parse’s trustworthiness or select among different candidate parses. In this work, we propose to transfer the AMR graph to the domain of images. This allows us to create a simple convolutional neural network (CNN) that imitates a human judge tasked with rating graph quality. Our experiments show that the method can rate quality more accurately than strong baselines, in several quality dimensions. Moreover, the method proves to be efficient and reduces the incurred energy consumption.</abstract>
      <url hash="9a12044d">2020.aacl-main.27</url>
    </paper>
    <paper id="28">
      <title>Generating Commonsense Explanation by Extracting Bridge Concepts from Reasoning Paths</title>
      <author><first>Haozhe</first><last>Ji</last></author>
      <author><first>Pei</first><last>Ke</last></author>
      <author><first>Shaohan</first><last>Huang</last></author>
      <author><first>Furu</first><last>Wei</last></author>
      <author><first>Minlie</first><last>Huang</last></author>
      <pages>248–257</pages>
      <abstract>Commonsense explanation generation aims to empower the machine’s sense-making capability by generating plausible explanations to statements against commonsense. While this task is easy to human, the machine still struggles to generate reasonable and informative explanations. In this work, we propose a method that first extracts the underlying concepts which are served as bridges in the reasoning chain and then integrates these concepts to generate the final explanation. To facilitate the reasoning process, we utilize external commonsense knowledge to build the connection between a statement and the bridge concepts by extracting and pruning multi-hop paths to build a subgraph. We design a bridge concept extraction model that first scores the triples, routes the paths in the subgraph, and further selects bridge concepts with weak supervision at both the triple level and the concept level. We conduct experiments on the commonsense explanation generation task and our model outperforms the state-of-the-art baselines in both automatic and human evaluation.</abstract>
      <url hash="238efc0a">2020.aacl-main.28</url>
    </paper>
    <paper id="29">
      <title>Unsupervised <fixed-case>KB</fixed-case>-to-Text Generation with Auxiliary Triple Extraction using Dual Learning</title>
      <author><first>Zihao</first><last>Fu</last></author>
      <author><first>Bei</first><last>Shi</last></author>
      <author><first>Lidong</first><last>Bing</last></author>
      <author><first>Wai</first><last>Lam</last></author>
      <pages>258–268</pages>
      <abstract>KB-to-text task aims at generating texts based on the given KB triples. Traditional methods usually map KB triples to sentences via a supervised seq-to-seq model. However, existing annotated datasets are very limited and human labeling is very expensive. In this paper, we propose a method which trains the generation model in a completely unsupervised way with unaligned raw text data and KB triples. Our method exploits a novel dual training framework which leverages the inverse relationship between the KB-to-text generation task and an auxiliary triple extraction task. In our architecture, we reconstruct KB triples or texts via a closed-loop framework via linking a generator and an extractor. Therefore the loss function that accounts for the reconstruction error of KB triples and texts can be used to train the generator and extractor. To resolve the cold start problem in training, we propose a method using a pseudo data generator which generates pseudo texts and KB triples for learning an initial model. To resolve the multiple-triple problem, we design an allocated reinforcement learning component to optimize the reconstruction loss. The experimental results demonstrate that our model can outperform other unsupervised generation methods and close to the bound of supervised methods.</abstract>
      <url hash="86e2ac7b">2020.aacl-main.29</url>
    </paper>
    <paper id="30">
      <title>Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition</title>
      <author><first>Wenliang</first><last>Dai</last></author>
      <author><first>Zihan</first><last>Liu</last></author>
      <author><first>Tiezheng</first><last>Yu</last></author>
      <author><first>Pascale</first><last>Fung</last></author>
      <pages>269–280</pages>
      <abstract>Despite the recent achievements made in the multi-modal emotion recognition task, two problems still exist and have not been well investigated: 1) the relationship between different emotion categories are not utilized, which leads to sub-optimal performance; and 2) current models fail to cope well with low-resource emotions, especially for unseen emotions. In this paper, we propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues. We use pre-trained word embeddings to represent emotion categories for textual data. Then, two mapping functions are learned to transfer these embeddings into visual and acoustic spaces. For each modality, the model calculates the representation distance between the input sequence and target emotions and makes predictions based on the distances. By doing so, our model can directly adapt to the unseen emotions in any modality since we have their pre-trained embeddings and modality mapping functions. Experiments show that our model achieves state-of-the-art performance on most of the emotion categories. Besides, our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.</abstract>
      <url hash="5181cfe2">2020.aacl-main.30</url>
    </paper>
    <paper id="31">
      <title>All-in-One: A Deep Attentive Multi-task Learning Framework for Humour, Sarcasm, Offensive, Motivation, and Sentiment on Memes</title>
      <author><first>Dushyant Singh</first><last>Chauhan</last></author>
      <author><first>Dhanush</first><last>S R</last></author>
      <author><first>Asif</first><last>Ekbal</last></author>
      <author><first>Pushpak</first><last>Bhattacharyya</last></author>
      <pages>281–290</pages>
      <abstract>In this paper, we aim at learning the relationships and similarities of a variety of tasks, such as humour detection, sarcasm detection, offensive content detection, motivational content detection and sentiment analysis on a somewhat complicated form of information, i.e., memes. We propose a multi-task, multi-modal deep learning framework to solve multiple tasks simultaneously. For multi-tasking, we propose two attention-like mechanisms viz., Inter-task Relationship Module (iTRM) and Inter-class Relationship Module (iCRM). The main motivation of iTRM is to learn the relationship between the tasks to realize how they help each other. In contrast, iCRM develops relations between the different classes of tasks. Finally, representations from both the attentions are concatenated and shared across the five tasks (i.e., humour, sarcasm, offensive, motivational, and sentiment) for multi-tasking. We use the recently released dataset in the Memotion Analysis task @ SemEval 2020, which consists of memes annotated for the classes as mentioned above. Empirical results on Memotion dataset show the efficacy of our proposed approach over the existing state-of-the-art systems (Baseline and SemEval 2020 winner). The evaluation also indicates that the proposed multi-task framework yields better performance over the single-task learning.</abstract>
      <url hash="63d636f7">2020.aacl-main.31</url>
    </paper>
    <paper id="32">
      <title>Identifying Implicit Quotes for Unsupervised Extractive Summarization of Conversations</title>
      <author><first>Ryuji</first><last>Kano</last></author>
      <author><first>Yasuhide</first><last>Miura</last></author>
      <author><first>Tomoki</first><last>Taniguchi</last></author>
      <author><first>Tomoko</first><last>Ohkuma</last></author>
      <pages>291–302</pages>
      <abstract>We propose Implicit Quote Extractor, an end-to-end unsupervised extractive neural summarization model for conversational texts. When we reply to posts, quotes are used to highlight important part of texts. We aim to extract quoted sentences as summaries. Most replies do not explicitly include quotes, so it is difficult to use quotes as supervision. However, even if it is not explicitly shown, replies always refer to certain parts of texts; we call them implicit quotes. Implicit Quote Extractor aims to extract implicit quotes as summaries. The training task of the model is to predict whether a reply candidate is a true reply to a post. For prediction, the model has to choose a few sentences from the post. To predict accurately, the model learns to extract sentences that replies frequently refer to. We evaluate our model on two email datasets and one social media dataset, and confirm that our model is useful for extractive summarization. We further discuss two topics; one is whether quote extraction is an important factor for summarization, and the other is whether our model can capture salient sentences that conventional methods cannot.</abstract>
      <url hash="60586b7f">2020.aacl-main.32</url>
    </paper>
    <paper id="33">
      <title>Unsupervised Aspect-Level Sentiment Controllable Style Transfer</title>
      <author><first>Mukuntha</first><last>Narayanan Sundararaman</last></author>
      <author><first>Zishan</first><last>Ahmad</last></author>
      <author><first>Asif</first><last>Ekbal</last></author>
      <author><first>Pushpak</first><last>Bhattacharyya</last></author>
      <pages>303–312</pages>
      <abstract>Unsupervised style transfer in text has previously been explored through the sentiment transfer task. The task entails inverting the overall sentiment polarity in a given input sentence, while preserving its content. From the Aspect-Based Sentiment Analysis (ABSA) task, we know that multiple sentiment polarities can often be present together in a sentence with multiple aspects. In this paper, the task of aspect-level sentiment controllable style transfer is introduced, where each of the aspect-level sentiments can individually be controlled at the output. To achieve this goal, a BERT-based encoder-decoder architecture with saliency weighted polarity injection is proposed, with unsupervised training strategies, such as ABSA masked-language-modelling. Through both automatic and manual evaluation, we show that the system is successful in controlling aspect-level sentiments.</abstract>
      <url hash="ddbcd8b1">2020.aacl-main.33</url>
    </paper>
    <paper id="34">
      <title>Energy-based Self-attentive Learning of Abstractive Communities for Spoken Language Understanding</title>
      <author><first>Guokan</first><last>Shang</last></author>
      <author><first>Antoine</first><last>Tixier</last></author>
      <author><first>Michalis</first><last>Vazirgiannis</last></author>
      <author><first>Jean-Pierre</first><last>Lorré</last></author>
      <pages>313–327</pages>
      <abstract>Abstractive community detection is an important spoken language understanding task, whose goal is to group utterances in a conversation according to whether they can be jointly summarized by a common abstractive sentence. This paper provides a novel approach to this task. We first introduce a neural contextual utterance encoder featuring three types of self-attention mechanisms. We then train it using the siamese and triplet energy-based meta-architectures. Experiments on the AMI corpus show that our system outperforms multiple energy-based and non-energy based baselines from the state-of-the-art. Code and data are publicly available.</abstract>
      <url hash="f6d6206d">2020.aacl-main.34</url>
    </paper>
    <paper id="35">
      <title>Intent Detection with <fixed-case>W</fixed-case>iki<fixed-case>H</fixed-case>ow</title>
      <author><first>Li</first><last>Zhang</last></author>
      <author><first>Qing</first><last>Lyu</last></author>
      <author><first>Chris</first><last>Callison-Burch</last></author>
      <pages>328–333</pages>
      <abstract>Modern task-oriented dialog systems need to reliably understand users’ intents. Intent detection is even more challenging when moving to new domains or new languages, since there is little annotated data. To address this challenge, we present a suite of pretrained intent detection models which can predict a broad range of intended goals from many actions because they are trained on wikiHow, a comprehensive instructional website. Our models achieve state-of-the-art results on the Snips dataset, the Schema-Guided Dialogue dataset, and all 3 languages of the Facebook multilingual dialog datasets. Our models also demonstrate strong zero- and few-shot performance, reaching over 75% accuracy using only 100 training examples in all datasets.</abstract>
      <url hash="29d3fb04">2020.aacl-main.35</url>
    </paper>
    <paper id="36">
      <title>A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation</title>
      <author><first>Moin</first><last>Nadeem</last></author>
      <author><first>Tianxing</first><last>He</last></author>
      <author><first>Kyunghyun</first><last>Cho</last></author>
      <author><first>James</first><last>Glass</last></author>
      <pages>334–346</pages>
      <abstract>This work studies the widely adopted ancestral sampling algorithms for auto-regressive language models. We use the quality-diversity (Q-D) trade-off to investigate three popular sampling methods (top-k, nucleus and tempered sampling). We focus on the task of open-ended language generation, and first show that the existing sampling algorithms have similar performance. By carefully inspecting the transformations defined by different sampling algorithms, we identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation. To validate the importance of the identified properties, we design two sets of new sampling methods: one set in which each algorithm satisfies all three properties, and one set in which each algorithm violates at least one of the properties. We compare their performance with existing algorithms, and find that violating the identified properties could lead to drastic performance degradation, as measured by the Q-D trade-off. On the other hand, we find that the set of sampling algorithms that satisfy these properties performs on par with the existing sampling algorithms.</abstract>
      <url hash="7fea9f82">2020.aacl-main.36</url>
      <attachment type="Software" hash="5c54667e">2020.aacl-main.36.Software.zip</attachment>
    </paper>
    <paper id="37">
      <title><fixed-case>C</fixed-case>hinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels</title>
      <author><first>Yuning</first><last>Ding</last></author>
      <author><first>Andrea</first><last>Horbach</last></author>
      <author><first>Torsten</first><last>Zesch</last></author>
      <pages>347–357</pages>
      <abstract>In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.</abstract>
      <url hash="2da0605a">2020.aacl-main.37</url>
      <attachment type="Dataset" hash="5fdfe75f">2020.aacl-main.37.Dataset.zip</attachment>
      <attachment type="Software" hash="5fdfe75f">2020.aacl-main.37.Software.zip</attachment>
    </paper>
    <paper id="38">
      <title>Analysis of Hierarchical Multi-Content Text Classification Model on <fixed-case>B</fixed-case>-<fixed-case>SHARP</fixed-case> Dataset for Early Detection of <fixed-case>A</fixed-case>lzheimer’s Disease</title>
      <author><first>Renxuan Albert</first><last>Li</last></author>
      <author><first>Ihab</first><last>Hajjar</last></author>
      <author><first>Felicia</first><last>Goldstein</last></author>
      <author><first>Jinho D.</first><last>Choi</last></author>
      <pages>358–365</pages>
      <abstract>This paper presents a new dataset, B-SHARP, that can be used to develop NLP models for the detection of Mild Cognitive Impairment (MCI) known as an early sign of Alzheimer’s disease. Our dataset contains 1-2 min speech segments from 326 human subjects for 3 topics, (1) daily activity, (2) room environment, and (3) picture description, and their transcripts so that a total of 650 speech segments are collected. Given the B-SHARP dataset, several hierarchical text classification models are developed that jointly learn combinatory features across all 3 topics. The best performance of 74.1% is achieved by an ensemble model that adapts 3 types of transformer encoders. To the best of our knowledge, this is the first work that builds deep learning-based text classification models on multiple contents for the detection of MCI.</abstract>
      <url hash="573e8e8c">2020.aacl-main.38</url>
    </paper>
    <paper id="39">
      <title>An Exploratory Study on Multilingual Quality Estimation</title>
      <author><first>Shuo</first><last>Sun</last></author>
      <author><first>Marina</first><last>Fomicheva</last></author>
      <author><first>Frédéric</first><last>Blain</last></author>
      <author><first>Vishrav</first><last>Chaudhary</last></author>
      <author><first>Ahmed</first><last>El-Kishky</last></author>
      <author><first>Adithya</first><last>Renduchintala</last></author>
      <author><first>Francisco</first><last>Guzmán</last></author>
      <author><first>Lucia</first><last>Specia</last></author>
      <pages>366–377</pages>
      <abstract>Predicting the quality of machine translation has traditionally been addressed with language-specific models, under the assumption that the quality label distribution or linguistic features exhibit traits that are not shared across languages. An obvious disadvantage of this approach is the need for labelled data for each given language pair. We challenge this assumption by exploring different approaches to multilingual Quality Estimation (QE), including using scores from translation models. We show that these outperform single-language models, particularly in less balanced quality label distributions and low-resource settings. In the extreme case of zero-shot QE, we show that it is possible to accurately predict quality for any given new language from models trained on other languages. Our findings indicate that state-of-the-art neural QE models based on powerful pre-trained representations generalise well across languages, making them more applicable in real-world settings.</abstract>
      <url hash="de8f76ca">2020.aacl-main.39</url>
    </paper>
    <paper id="40">
      <title><fixed-case>E</fixed-case>nglish-to-<fixed-case>C</fixed-case>hinese Transliteration with Phonetic Auxiliary Task</title>
      <author><first>Yuan</first><last>He</last></author>
      <author><first>Shay B.</first><last>Cohen</last></author>
      <pages>378–388</pages>
      <abstract>Approaching named entities transliteration as a Neural Machine Translation (NMT) problem is common practice. While many have applied various NMT techniques to enhance machine transliteration models, few focus on the linguistic features particular to the relevant languages. In this paper, we investigate the effect of incorporating phonetic features for English-to-Chinese transliteration under the multi-task learning (MTL) setting—where we define a phonetic auxiliary task aimed to improve the generalization performance of the main transliteration task. In addition to our system, we also release a new English-to-Chinese dataset and propose a novel evaluation metric which considers multiple possible transliterations given a source name. Our results show that the multi-task model achieves similar performance as the previous state of the art with a model of a much smaller size.</abstract>
      <url hash="2bfc4e0f">2020.aacl-main.40</url>
    </paper>
    <paper id="41">
      <title>Predicting and Using Target Length in Neural Machine Translation</title>
      <author><first>Zijian</first><last>Yang</last></author>
      <author><first>Yingbo</first><last>Gao</last></author>
      <author><first>Weiyue</first><last>Wang</last></author>
      <author><first>Hermann</first><last>Ney</last></author>
      <pages>389–395</pages>
      <abstract>Attention-based encoder-decoder models have achieved great success in neural machine translation tasks. However, the lengths of the target sequences are not explicitly predicted in these models. This work proposes length prediction as an auxiliary task and set up a sub-network to obtain the length information from the encoder. Experimental results show that the length prediction sub-network brings improvements over the strong baseline system and that the predicted length can be used as an alternative to length normalization during decoding.</abstract>
      <url hash="f6b395e0">2020.aacl-main.41</url>
    </paper>
    <paper id="42">
      <title>Grounded <fixed-case>PCFG</fixed-case> Induction with Images</title>
      <author><first>Lifeng</first><last>Jin</last></author>
      <author><first>William</first><last>Schuler</last></author>
      <pages>396–408</pages>
      <abstract>Recent work in unsupervised parsing has tried to incorporate visual information into learning, but results suggest that these models need linguistic bias to compete against models that only rely on text. This work proposes grammar induction models which use visual information from images for labeled parsing, and achieve state-of-the-art results on grounded grammar induction on several languages. Results indicate that visual information is especially helpful in languages where high frequency words are more broadly distributed. Comparison between models with and without visual information shows that the grounded models are able to use visual information for proposing noun phrases, gathering useful information from images for unknown words, and achieving better performance at prepositional phrase attachment prediction.</abstract>
      <url hash="bba4f1ed">2020.aacl-main.42</url>
    </paper>
    <paper id="43">
      <title>Heads-up! Unsupervised Constituency Parsing via Self-Attention Heads</title>
      <author><first>Bowen</first><last>Li</last></author>
      <author><first>Taeuk</first><last>Kim</last></author>
      <author><first>Reinald Kim</first><last>Amplayo</last></author>
      <author><first>Frank</first><last>Keller</last></author>
      <pages>409–424</pages>
      <abstract>Transformer-based pre-trained language models (PLMs) have dramatically improved the state of the art in NLP across many tasks. This has led to substantial interest in analyzing the syntactic knowledge PLMs learn. Previous approaches to this question have been limited, mostly using test suites or probes. Here, we propose a novel fully unsupervised parsing approach that extracts constituency trees from PLM attention heads. We rank transformer attention heads based on their inherent properties, and create an ensemble of high-ranking heads to produce the final tree. Our method is adaptable to low-resource languages, as it does not rely on development sets, which can be expensive to annotate. Our experiments show that the proposed method often outperform existing approaches if there is no development set present. Our unsupervised parser can also be used as a tool to analyze the grammars PLMs learn implicitly. For this, we use the parse trees induced by our method to train a neural PCFG and compare it to a grammar derived from a human-annotated treebank.</abstract>
      <url hash="c083293f">2020.aacl-main.43</url>
    </paper>
    <paper id="44">
      <title>Building Location Embeddings from Physical Trajectories and Textual Representations</title>
      <author><first>Laura</first><last>Biester</last></author>
      <author><first>Carmen</first><last>Banea</last></author>
      <author><first>Rada</first><last>Mihalcea</last></author>
      <pages>425–434</pages>
      <abstract>Word embedding methods have become the de-facto way to represent words, having been successfully applied to a wide array of natural language processing tasks. In this paper, we explore the hypothesis that embedding methods can also be effectively used to represent spatial locations. Using a new dataset consisting of the location trajectories of 729 students over a seven month period and text data related to those locations, we implement several strategies to create location embeddings, which we then use to create embeddings of the sequences of locations a student has visited. To identify the surface level properties captured in the representations, we propose a number of probing tasks such as the presence of a specific location in a sequence or the type of activities that take place at a location. We then leverage the representations we generated and employ them in more complex downstream tasks ranging from predicting a student’s area of study to a student’s depression level, showing the effectiveness of these location embeddings.</abstract>
      <url hash="59322e4b">2020.aacl-main.44</url>
      <attachment type="Software" hash="53c2d42e">2020.aacl-main.44.Software.zip</attachment>
    </paper>
    <paper id="45">
      <title>Self-Supervised Learning for Pairwise Data Refinement</title>
      <author><first>Gustavo</first><last>Hernandez Abrego</last></author>
      <author><first>Bowen</first><last>Liang</last></author>
      <author><first>Wei</first><last>Wang</last></author>
      <author><first>Zarana</first><last>Parekh</last></author>
      <author><first>Yinfei</first><last>Yang</last></author>
      <author><first>Yunhsuan</first><last>Sung</last></author>
      <pages>435–446</pages>
      <abstract>Pairwise data automatically constructed from weakly supervised signals has been widely used for training deep learning models. Pairwise datasets such as parallel texts can have uneven quality levels overall, but usually contain data subsets that are more useful as learning examples. We present two methods to refine data that are aimed to obtain that kind of subsets in a self-supervised way. Our methods are based on iteratively training dual-encoder models to compute similarity scores. We evaluate our methods on de-noising parallel texts and training neural machine translation models. We find that: (i) The self-supervised refinement achieves most machine translation gains in the first iteration, but following iterations further improve its intrinsic evaluation. (ii) Machine translations can improve the de-noising performance when combined with selection steps. (iii) Our methods are able to reach the performance of a supervised method. Being entirely self-supervised, our methods are well-suited to handle pairwise data without the need of prior knowledge or human annotations.</abstract>
      <url hash="d0e82ad5">2020.aacl-main.45</url>
    </paper>
    <paper id="46">
      <title>A Survey of the State of Explainable <fixed-case>AI</fixed-case> for Natural Language Processing</title>
      <author><first>Marina</first><last>Danilevsky</last></author>
      <author><first>Kun</first><last>Qian</last></author>
      <author><first>Ranit</first><last>Aharonov</last></author>
      <author><first>Yannis</first><last>Katsis</last></author>
      <author><first>Ban</first><last>Kawas</last></author>
      <author><first>Prithviraj</first><last>Sen</last></author>
      <pages>447–459</pages>
      <abstract>Recent years have seen important advances in the quality of state-of-the-art models, but this has come at the expense of models becoming less interpretable. This survey presents an overview of the current state of Explainable AI (XAI), considered within the domain of Natural Language Processing (NLP). We discuss the main categorization of explanations, as well as the various ways explanations can be arrived at and visualized. We detail the operations and explainability techniques currently available for generating explanations for NLP model predictions, to serve as a resource for model developers in the community. Finally, we point out the current gaps and encourage directions for future work in this important research area.</abstract>
      <url hash="21a4d044">2020.aacl-main.46</url>
    </paper>
    <paper id="47">
      <title>Beyond Fine-tuning: Few-Sample Sentence Embedding Transfer</title>
      <author><first>Siddhant</first><last>Garg</last></author>
      <author><first>Rohit Kumar</first><last>Sharma</last></author>
      <author><first>Yingyu</first><last>Liang</last></author>
      <pages>460–469</pages>
      <abstract>Fine-tuning (FT) pre-trained sentence embedding models on small datasets has been shown to have limitations. In this paper we show that concatenating the embeddings from the pre-trained model with those from a simple sentence embedding model trained only on the target data, can improve over the performance of FT for few-sample tasks. To this end, a linear classifier is trained on the combined embeddings, either by freezing the embedding model weights or training the classifier and embedding models end-to-end. We perform evaluation on seven small datasets from NLP tasks and show that our approach with end-to-end training outperforms FT with negligible computational overhead. Further, we also show that sophisticated combination techniques like CCA and KCCA do not work as well in practice as concatenation. We provide theoretical analysis to explain this empirical observation.</abstract>
      <url hash="fecbf5c4">2020.aacl-main.47</url>
    </paper>
    <paper id="48">
      <title>Multimodal Pretraining for Dense Video Captioning</title>
      <author><first>Gabriel</first><last>Huang</last></author>
      <author><first>Bo</first><last>Pang</last></author>
      <author><first>Zhenhai</first><last>Zhu</last></author>
      <author><first>Clara</first><last>Rivera</last></author>
      <author><first>Radu</first><last>Soricut</last></author>
      <pages>470–490</pages>
      <abstract>Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.</abstract>
      <url hash="2da7b89c">2020.aacl-main.48</url>
    </paper>
    <paper id="49">
      <title>Systematic Generalization on g<fixed-case>SCAN</fixed-case> with Language Conditioned Embedding</title>
      <author><first>Tong</first><last>Gao</last></author>
      <author><first>Qi</first><last>Huang</last></author>
      <author><first>Raymond</first><last>Mooney</last></author>
      <pages>491–503</pages>
      <abstract>Systematic Generalization refers to a learning algorithm’s ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data. As shown in recent work, state-of-the-art deep learning models fail dramatically even on tasks for which they are designed when the test set is systematically different from the training data. We hypothesize that explicitly modeling the relations between objects in their contexts while learning their representations will help achieve systematic generalization. Therefore, we propose a novel method that learns objects’ contextualized embeddings with dynamic message passing conditioned on the input natural language and end-to-end trainable with other downstream deep learning modules. To our knowledge, this model is the first one that significantly outperforms the provided baseline and reaches state-of-the-art performance on grounded SCAN (gSCAN), a grounded natural language navigation dataset designed to require systematic generalization in its test splits.</abstract>
      <url hash="545b660a">2020.aacl-main.49</url>
    </paper>
    <paper id="50">
      <title>Are Scene Graphs Good Enough to Improve Image Captioning?</title>
      <author><first>Victor Siemen Janusz</first><last>Milewski</last></author>
      <author><first>Marie-Francine</first><last>Moens</last></author>
      <author><first>Iacer</first><last>Calixto</last></author>
      <pages>504–515</pages>
      <abstract>Many top-performing image captioning models rely solely on object features computed with an object detection model to generate image descriptions. However, recent studies propose to directly use scene graphs to introduce information about object relations into captioning, hoping to better describe interactions between objects. In this work, we thoroughly investigate the use of scene graphs in image captioning. We empirically study whether using additional scene graph encoders can lead to better image descriptions and propose a conditional graph attention network (C-GAT), where the image captioning decoder state is used to condition the graph updates. Finally, we determine to what extent noise in the predicted scene graphs influence caption quality. Overall, we find no significant difference between models that use scene graph features and models that only use object detection features across different captioning metrics, which suggests that existing scene graph generation models are still too noisy to be useful in image captioning. Moreover, although the quality of predicted scene graphs is very low in general, when using high quality scene graphs we obtain gains of up to 3.3 CIDEr compared to a strong Bottom-Up Top-Down baseline.</abstract>
      <url hash="fcacefff">2020.aacl-main.50</url>
    </paper>
    <paper id="51">
      <title>Systematically Exploring Redundancy Reduction in Summarizing Long Documents</title>
      <author><first>Wen</first><last>Xiao</last></author>
      <author><first>Giuseppe</first><last>Carenini</last></author>
      <pages>516–528</pages>
      <abstract>Our analysis of large summarization datasets indicates that redundancy is a very serious problem when summarizing long documents. Yet, redundancy reduction has not been thoroughly investigated in neural summarization. In this work, we systematically explore and compare different ways to deal with redundancy when summarizing long documents. Specifically, we organize existing methods into categories based on when and how the redundancy is considered. Then, in the context of these categories, we propose three additional methods balancing non-redundancy and importance in a general and flexible way. In a series of experiments, we show that our proposed methods achieve the state-of-the-art with respect to ROUGE scores on two scientific paper datasets, Pubmed and arXiv, while reducing redundancy significantly.</abstract>
      <url hash="2c00c7e2">2020.aacl-main.51</url>
    </paper>
    <paper id="52">
      <title>A Cascade Approach to Neural Abstractive Summarization with Content Selection and Fusion</title>
      <author><first>Logan</first><last>Lebanoff</last></author>
      <author><first>Franck</first><last>Dernoncourt</last></author>
      <author><first>Doo Soon</first><last>Kim</last></author>
      <author><first>Walter</first><last>Chang</last></author>
      <author id="fei-liu-utdallas"><first>Fei</first><last>Liu</last></author>
      <pages>529–535</pages>
      <abstract>We present an empirical study in favor of a cascade architecture to neural text summarization. Summarization practices vary widely but few other than news summarization can provide a sufficient amount of training data enough to meet the requirement of end-to-end neural abstractive systems which perform content selection and surface realization jointly to generate abstracts. Such systems also pose a challenge to summarization evaluation, as they force content selection to be evaluated along with text generation, yet evaluation of the latter remains an unsolved problem. In this paper, we present empirical results showing that the performance of a cascaded pipeline that separately identifies important content pieces and stitches them together into a coherent text is comparable to or outranks that of end-to-end systems, whereas a pipeline architecture allows for flexible content selection. We finally discuss how we can take advantage of a cascaded pipeline in neural text summarization and shed light on important directions for future research.</abstract>
      <url hash="ba13f3c2">2020.aacl-main.52</url>
    </paper>
    <paper id="53">
      <title>Mixed-Lingual Pre-training for Cross-lingual Summarization</title>
      <author><first>Ruochen</first><last>Xu</last></author>
      <author><first>Chenguang</first><last>Zhu</last></author>
      <author><first>Yu</first><last>Shi</last></author>
      <author><first>Michael</first><last>Zeng</last></author>
      <author><first>Xuedong</first><last>Huang</last></author>
      <pages>536–541</pages>
      <abstract>Cross-lingual Summarization (CLS) aims at producing a summary in the target language for an article in the source language. Traditional solutions employ a two-step approach, i.e. translate -&gt; summarize or summarize -&gt; translate. Recently, end-to-end models have achieved better results, but these approaches are mostly limited by their dependence on large-scale labeled data. We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks such as translation and monolingual tasks like masked language models. Thus, our model can leverage the massive monolingual data to enhance its modeling of language. Moreover, the architecture has no task-specific components, which saves memory and increases optimization efficiency. We show in experiments that this pre-training scheme can effectively boost the performance of cross-lingual summarization. In NCLS dataset, our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.</abstract>
      <url hash="9204b96a">2020.aacl-main.53</url>
    </paper>
    <paper id="54">
      <title>Point-of-Interest Oriented Question Answering with Joint Inference of Semantic Matching and Distance Correlation</title>
      <author><first>Yifei</first><last>Yuan</last></author>
      <author><first>Jingbo</first><last>Zhou</last></author>
      <author><first>Wai</first><last>Lam</last></author>
      <pages>542–550</pages>
      <abstract>Point-of-Interest (POI) oriented question answering (QA) aims to return a list of POIs given a question issued by a user. Recent advances in intelligent virtual assistants have opened the possibility of engaging the client software more actively in the provision of location-based services, thereby showing great promise for automatic POI retrieval. Some existing QA methods can be adopted on this task such as QA similarity calculation and semantic parsing using pre-defined rules. The returned results, however, are subject to inherent limitations due to the lack of the ability for handling some important POI related information, including tags, location entities, and proximity-related terms (e.g. “nearby”, “close”). In this paper, we present a novel deep learning framework integrated with joint inference to capture both tag semantic and geographic correlation between question and POIs. One characteristic of our model is to propose a special cross attention question embedding neural network structure to obtain question-to-POI and POI-to-question information. Besides, we utilize a skewed distribution to simulate the spatial relationship between questions and POIs. By measuring the results offered by the model against existing methods, we demonstrate its robustness and practicability, and supplement our conclusions with empirical evidence.</abstract>
      <url hash="f6e8f0a8">2020.aacl-main.54</url>
    </paper>
    <paper id="55">
      <title>Leveraging Structured Metadata for Improving Question Answering on the Web</title>
      <author><first>Xinya</first><last>Du</last></author>
      <author><first>Ahmed Hassan</first><last>Awadallah</last></author>
      <author><first>Adam</first><last>Fourney</last></author>
      <author><first>Robert</first><last>Sim</last></author>
      <author><first>Paul</first><last>Bennett</last></author>
      <author><first>Claire</first><last>Cardie</last></author>
      <pages>551–556</pages>
      <abstract>We show that leveraging metadata information from web pages can improve the performance of models for answer passage selection/reranking. We propose a neural passage selection model that leverages metadata information with a fine-grained encoding strategy, which learns the representation for metadata predicates in a hierarchical way. The models are evaluated on the MS MARCO (Nguyen et al., 2016) and Recipe-MARCO datasets. Results show that our models significantly outperform baseline models, which do not incorporate metadata. We also show that the fine-grained encoding’s advantage over other strategies for encoding the metadata.</abstract>
      <url hash="f361ccd8">2020.aacl-main.55</url>
    </paper>
    <paper id="56">
      <title><fixed-case>E</fixed-case>nglish Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too</title>
      <author><first>Jason</first><last>Phang</last></author>
      <author><first>Iacer</first><last>Calixto</last></author>
      <author><first>Phu Mon</first><last>Htut</last></author>
      <author><first>Yada</first><last>Pruksachatkun</last></author>
      <author><first>Haokun</first><last>Liu</last></author>
      <author><first>Clara</first><last>Vania</last></author>
      <author><first>Katharina</first><last>Kann</last></author>
      <author><first>Samuel R.</first><last>Bowman</last></author>
      <pages>557–575</pages>
      <abstract>Intermediate-task training—fine-tuning a pretrained model on an intermediate task before fine-tuning again on the target task—often improves model performance substantially on language understanding tasks in monolingual English settings. We investigate whether English intermediate-task training is still helpful on non-English target tasks. Using nine intermediate language-understanding tasks, we evaluate intermediate-task transfer in a zero-shot cross-lingual setting on the XTREME benchmark. We see large improvements from intermediate training on the BUCC and Tatoeba sentence retrieval tasks and moderate improvements on question-answering target tasks. MNLI, SQuAD and HellaSwag achieve the best overall results as intermediate tasks, while multi-task intermediate offers small additional improvements. Using our best intermediate-task models for each target task, we obtain a 5.4 point improvement over XLM-R Large on the XTREME benchmark, setting the state of the art as of June 2020. We also investigate continuing multilingual MLM during intermediate-task training and using machine-translated intermediate-task data, but neither consistently outperforms simply performing English intermediate-task training.</abstract>
      <url hash="52e751af">2020.aacl-main.56</url>
    </paper>
    <paper id="57">
      <title><fixed-case>STIL</fixed-case> - Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using m<fixed-case>BART</fixed-case> on <fixed-case>M</fixed-case>ulti<fixed-case>ATIS</fixed-case>++</title>
      <author><first>Jack</first><last>FitzGerald</last></author>
      <pages>576–581</pages>
      <abstract>Slot-filling, Translation, Intent classification, and Language identification, or STIL, is a newly-proposed task for multilingual Natural Language Understanding (NLU). By performing simultaneous slot filling and translation into a single output language (English in this case), some portion of downstream system components can be monolingual, reducing development and maintenance cost. Results are given using the multilingual BART model (Liu et al., 2020) fine-tuned on 7 languages using the MultiATIS++ dataset. When no translation is performed, mBART’s performance is comparable to the current state of the art system (Cross-Lingual BERT by Xu et al. (2020)) for the languages tested, with better average intent classification accuracy (96.07% versus 95.50%) but worse average slot F1 (89.87% versus 90.81%). When simultaneous translation is performed, average intent classification accuracy degrades by only 1.7% relative and average slot F1 degrades by only 1.2% relative.</abstract>
      <url hash="a49c927c">2020.aacl-main.57</url>
    </paper>
    <paper id="58">
      <title><fixed-case>S</fixed-case>imul<fixed-case>MT</fixed-case> to <fixed-case>S</fixed-case>imul<fixed-case>ST</fixed-case>: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation</title>
      <author><first>Xutai</first><last>Ma</last></author>
      <author><first>Juan</first><last>Pino</last></author>
      <author><first>Philipp</first><last>Koehn</last></author>
      <pages>582–587</pages>
      <abstract>We investigate how to adapt simultaneous text translation methods such as wait-<tex-math>k</tex-math> and monotonic multihead attention to end-to-end simultaneous speech translation by introducing a pre-decision module. A detailed analysis is provided on the latency-quality trade-offs of combining fixed and flexible pre-decision with fixed and flexible policies. We also design a novel computation-aware latency metric, adapted from Average Lagging.</abstract>
      <url hash="e8941a14">2020.aacl-main.58</url>
    </paper>
    <paper id="59">
      <title>Cue Me In: Content-Inducing Approaches to Interactive Story Generation</title>
      <author><first>Faeze</first><last>Brahman</last></author>
      <author><first>Alexandru</first><last>Petrusca</last></author>
      <author><first>Snigdha</first><last>Chaturvedi</last></author>
      <pages>588–597</pages>
      <abstract>Automatically generating stories is a challenging problem that requires producing causally related and logical sequences of events about a topic. Previous approaches in this domain have focused largely on one-shot generation, where a language model outputs a complete story based on limited initial input from a user. Here, we instead focus on the task of interactive story generation, where the user provides the model mid-level sentence abstractions in the form of cue phrases during the generation process. This provides an interface for human users to guide the story generation. We present two content-inducing approaches to effectively incorporate this additional information. Experimental results from both automatic and human evaluations show that these methods produce more topically coherent and personalized stories compared to baseline methods.</abstract>
      <url hash="312edeb2">2020.aacl-main.59</url>
    </paper>
    <paper id="60">
      <title>Liputan6: A Large-scale <fixed-case>I</fixed-case>ndonesian Dataset for Text Summarization</title>
      <author><first>Fajri</first><last>Koto</last></author>
      <author><first>Jey Han</first><last>Lau</last></author>
      <author><first>Timothy</first><last>Baldwin</last></author>
      <pages>598–608</pages>
      <abstract>In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document–summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE itself, as well as with extractive and abstractive summarization models.</abstract>
      <url hash="46725b36">2020.aacl-main.60</url>
    </paper>
    <paper id="61">
      <title>Generating Sports News from Live Commentary: A <fixed-case>C</fixed-case>hinese Dataset for Sports Game Summarization</title>
      <author><first>Kuan-Hao</first><last>Huang</last></author>
      <author><first>Chen</first><last>Li</last></author>
      <author><first>Kai-Wei</first><last>Chang</last></author>
      <pages>609–615</pages>
      <abstract>Sports game summarization focuses on generating news articles from live commentaries. Unlike traditional summarization tasks, the source documents and the target summaries for sports game summarization tasks are written in quite different writing styles. In addition, live commentaries usually contain many named entities, which makes summarizing sports games precisely very challenging. To deeply study this task, we present SportsSum, a Chinese sports game summarization dataset which contains 5,428 soccer games of live commentaries and the corresponding news articles. Additionally, we propose a two-step summarization model consisting of a selector and a rewriter for SportsSum. To evaluate the correctness of generated sports summaries, we design two novel score metrics: name matching score and event matching score. Experimental results show that our model performs better than other summarization baselines on ROUGE scores as well as the two designed scores.</abstract>
      <url hash="8c935819">2020.aacl-main.61</url>
    </paper>
    <paper id="62">
      <title>Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance</title>
      <author><first>Ahmed</first><last>El-Kishky</last></author>
      <author><first>Francisco</first><last>Guzmán</last></author>
      <pages>616–625</pages>
      <abstract>Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel data for machine translation. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages. These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs. Recognizing that our proposed scoring function and other state of the art methods are computationally intractable for long web documents, we utilize a more tractable greedy algorithm that performs comparably. We experimentally demonstrate that our distance metric performs better alignment than current baselines outperforming them by 7% on high-resource language pairs, 15% on mid-resource language pairs, and 22% on low-resource language pairs.</abstract>
      <url hash="fa121ac9">2020.aacl-main.62</url>
    </paper>
    <paper id="63">
      <title>Improving Context Modeling in Neural Topic Segmentation</title>
      <author><first>Linzi</first><last>Xing</last></author>
      <author><first>Brad</first><last>Hackinen</last></author>
      <author><first>Giuseppe</first><last>Carenini</last></author>
      <author><first>Francesco</first><last>Trebbi</last></author>
      <pages>626–636</pages>
      <abstract>Topic segmentation is critical in key NLP tasks and recent works favor highly effective neural supervised approaches. However, current neural solutions are arguably limited in how they model context. In this paper, we enhance a segmenter based on a hierarchical attention BiLSTM network to better model context, by adding a coherence-related auxiliary task and restricted self-attention. Our optimized segmenter outperforms SOTA approaches when trained and tested on three datasets. We also the robustness of our proposed model in domain transfer setting by training a model on a large-scale dataset and testing it on four challenging real-world benchmarks. Furthermore, we apply our proposed strategy to two other languages (German and Chinese), and show its effectiveness in multilingual scenarios.</abstract>
      <url hash="d93ad209">2020.aacl-main.63</url>
    </paper>
    <paper id="64">
      <title>Contextualized End-to-End Neural Entity Linking</title>
      <author><first>Haotian</first><last>Chen</last></author>
      <author><first>Xi</first><last>Li</last></author>
      <author><first>Andrej</first><last>Zukov Gregoric</last></author>
      <author><first>Sahil</first><last>Wadhwa</last></author>
      <pages>637–642</pages>
      <abstract>We propose an entity linking (EL) model that jointly learns mention detection (MD) and entity disambiguation (ED). Our model applies task-specific heads on top of shared BERT contextualized embeddings. We achieve state-of-the-art results across a standard EL dataset using our model; we also study our model’s performance under the setting when hand-crafted entity candidate sets are not available and find that the model performs well under such a setting too.</abstract>
      <url hash="f5b07026">2020.aacl-main.64</url>
    </paper>
    <paper id="65">
      <title><fixed-case>DAPPER</fixed-case>: Learning Domain-Adapted Persona Representation Using Pretrained <fixed-case>BERT</fixed-case> and External Memory</title>
      <author><first>Prashanth</first><last>Vijayaraghavan</last></author>
      <author><first>Eric</first><last>Chu</last></author>
      <author><first>Deb</first><last>Roy</last></author>
      <pages>643–652</pages>
      <abstract>Research in building intelligent agents have emphasized the need for understanding characteristic behavior of people. In order to reflect human-like behavior, agents require the capability to comprehend the context, infer individualized persona patterns and incrementally learn from experience. In this paper, we present a model called DAPPER that can learn to embed persona from natural language and alleviate task or domain-specific data sparsity issues related to personas. To this end, we implement a text encoding strategy that leverages a pretrained language model and an external memory to produce domain-adapted persona representations. Further, we evaluate the transferability of these embeddings by simulating low-resource scenarios. Our comparative study demonstrates the capability of our method over other approaches towards learning rich transferable persona embeddings. Empirical evidence suggests that the learnt persona embeddings can be effective in downstream tasks like hate speech detection.</abstract>
      <url hash="74c10035">2020.aacl-main.65</url>
    </paper>
    <paper id="66">
      <title>Event Coreference Resolution with Non-Local Information</title>
      <author><first>Jing</first><last>Lu</last></author>
      <author><first>Vincent</first><last>Ng</last></author>
      <pages>653–663</pages>
      <abstract>We present two extensions to a state-of-theart joint model for event coreference resolution, which involve incorporating (1) a supervised topic model for improving trigger detection by providing global context, and (2) a preprocessing module that seeks to improve event coreference by discarding unlikely candidate antecedents of an event mention using discourse contexts computed based on salient entities. The resulting model yields the best results reported to date on the KBP 2017 English and Chinese datasets.</abstract>
      <url hash="dfadac5c">2020.aacl-main.66</url>
    </paper>
    <paper id="67">
      <title>Neural <fixed-case>RST</fixed-case>-based Evaluation of Discourse Coherence</title>
      <author><first>Grigorii</first><last>Guz</last></author>
      <author><first>Peyman</first><last>Bateni</last></author>
      <author><first>Darius</first><last>Muglich</last></author>
      <author><first>Giuseppe</first><last>Carenini</last></author>
      <pages>664–671</pages>
      <abstract>This paper evaluates the utility of Rhetorical Structure Theory (RST) trees and relations in discourse coherence evaluation. We show that incorporating silver-standard RST features can increase accuracy when classifying coherence. We demonstrate this through our tree-recursive neural model, namely RST-Recursive, which takes advantage of the text’s RST features produced by a state of the art RST parser. We evaluate our approach on the Grammarly Corpus for Discourse Coherence (GCDC) and show that when ensembled with the current state of the art, we can achieve the new state of the art accuracy on this benchmark. Furthermore, when deployed alone, RST-Recursive achieves competitive accuracy while having 62% fewer parameters.</abstract>
      <url hash="4363ecf4">2020.aacl-main.67</url>
    </paper>
    <paper id="68">
      <title><fixed-case>A</fixed-case>sking <fixed-case>C</fixed-case>rowdworkers to <fixed-case>W</fixed-case>rite <fixed-case>E</fixed-case>ntailment <fixed-case>E</fixed-case>xamples: <fixed-case>T</fixed-case>he <fixed-case>B</fixed-case>est of <fixed-case>B</fixed-case>ad Options</title>
      <author><first>Clara</first><last>Vania</last></author>
      <author><first>Ruijie</first><last>Chen</last></author>
      <author><first>Samuel R.</first><last>Bowman</last></author>
      <pages>672–686</pages>
      <abstract>Large-scale natural language inference (NLI) datasets such as SNLI or MNLI have been created by asking crowdworkers to read a premise and write three new hypotheses, one for each possible semantic relationships (entailment, contradiction, and neutral). While this protocol has been used to create useful benchmark data, it remains unclear whether the writing-based annotation protocol is optimal for any purpose, since it has not been evaluated directly. Furthermore, there is ample evidence that crowdworker writing can introduce artifacts in the data. We investigate two alternative protocols which automatically create candidate (premise, hypothesis) pairs for annotators to label. Using these protocols and a writing-based baseline, we collect several new English NLI datasets of over 3k examples each, each using a fixed amount of annotator time, but a varying number of examples to fit that time budget. Our experiments on NLI and transfer learning show negative results: None of the alternative protocols outperforms the baseline in evaluations of generalization within NLI or on transfer to outside target tasks. We conclude that crowdworker writing still the best known option for entailment data, highlighting the need for further data collection work to focus on improving writing-based annotation processes.</abstract>
      <url hash="310210a4">2020.aacl-main.68</url>
    </paper>
    <paper id="69">
      <title><fixed-case>M</fixed-case>a<fixed-case>P</fixed-case>: A Matrix-based Prediction Approach to Improve Span Extraction in Machine Reading Comprehension</title>
      <author><first>Huaishao</first><last>Luo</last></author>
      <author><first>Yu</first><last>Shi</last></author>
      <author><first>Ming</first><last>Gong</last></author>
      <author><first>Linjun</first><last>Shou</last></author>
      <author><first>Tianrui</first><last>Li</last></author>
      <pages>687–695</pages>
      <abstract>Span extraction is an essential problem in machine reading comprehension. Most of the existing algorithms predict the start and end positions of an answer span in the given corresponding context by generating two probability vectors. In this paper, we propose a novel approach that extends the probability vector to a probability matrix. Such a matrix can cover more start-end position pairs. Precisely, to each possible start index, the method always generates an end probability vector. Besides, we propose a sampling-based training strategy to address the computational cost and memory issue in the matrix training phase. We evaluate our method on SQuAD 1.1 and three other question answering benchmarks. Leveraging the most competitive models BERT and BiDAF as the backbone, our proposed approach can get consistent improvements in all datasets, demonstrating the effectiveness of the proposed method.</abstract>
      <url hash="4182c848">2020.aacl-main.69</url>
    </paper>
    <paper id="70">
      <title>Answering Product-related Questions with Heterogeneous Information</title>
      <author><first>Wenxuan</first><last>Zhang</last></author>
      <author><first>Qian</first><last>Yu</last></author>
      <author><first>Wai</first><last>Lam</last></author>
      <pages>696–705</pages>
      <abstract>Providing instant response for product-related questions in E-commerce question answering platforms can greatly improve users’ online shopping experience. However, existing product question answering (PQA) methods only consider a single information source such as user reviews and/or require large amounts of labeled data. In this paper, we propose a novel framework to tackle the PQA task via exploiting heterogeneous information including natural language text and attribute-value pairs from two information sources of the concerned product, namely product details and user reviews. A heterogeneous information encoding component is then designed for obtaining unified representations of information with different formats. The sources of the candidate snippets are also incorporated when measuring the question-snippet relevance. Moreover, the framework is trained with a specifically designed weak supervision paradigm making use of available answers in the training phase. Experiments on a real-world dataset show that our proposed framework achieves superior performance over state-of-the-art models.</abstract>
      <url hash="5f93d8d0">2020.aacl-main.70</url>
    </paper>
    <paper id="71">
      <title>Two-Step Classification using Recasted Data for Low Resource Settings</title>
      <author><first>Shagun</first><last>Uppal</last></author>
      <author><first>Vivek</first><last>Gupta</last></author>
      <author><first>Avinash</first><last>Swaminathan</last></author>
      <author><first>Haimin</first><last>Zhang</last></author>
      <author><first>Debanjan</first><last>Mahata</last></author>
      <author><first>Rakesh</first><last>Gosangi</last></author>
      <author><first>Rajiv Ratn</first><last>Shah</last></author>
      <author><first>Amanda</first><last>Stent</last></author>
      <pages>706–719</pages>
      <abstract>An NLP model’s ability to reason should be independent of language. Previous works utilize Natural Language Inference (NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, we use data recasting to create NLI datasets for four existing text classification datasets. Through experiments, we show that our recasted dataset is devoid of statistical irregularities and spurious patterns. We further study the consistency in predictions of the textual entailment models and propose a consistency regulariser to remove pairwise-inconsistencies in predictions. We propose a novel two-step classification method which uses textual-entailment predictions for classification task. We further improve the performance by using a joint-objective for classification and textual entailment. We therefore highlight the benefits of data recasting and improvements on classification performance using our approach with supporting experimental results.</abstract>
      <url hash="a91456ff">2020.aacl-main.71</url>
    </paper>
    <paper id="72">
      <title>Explaining Word Embeddings via Disentangled Representation</title>
      <author><first>Keng-Te</first><last>Liao</last></author>
      <author><first>Cheng-Syuan</first><last>Lee</last></author>
      <author><first>Zhong-Yu</first><last>Huang</last></author>
      <author><first>Shou-de</first><last>Lin</last></author>
      <pages>720–725</pages>
      <abstract>Disentangled representations have attracted increasing attention recently. However, how to transfer the desired properties of disentanglement to word representations is unclear. In this work, we propose to transform typical dense word vectors into disentangled embeddings featuring improved interpretability via encoding polysemous semantics separately. We also found the modular structure of our disentangled word embeddings helps generate more efficient and effective features for natural language processing tasks.</abstract>
      <url hash="18e4371f">2020.aacl-main.72</url>
    </paper>
    <paper id="73">
      <title>Multi-view Classification Model for Knowledge Graph Completion</title>
      <author><first>Wenbin</first><last>Jiang</last></author>
      <author><first>Mengfei</first><last>Guo</last></author>
      <author><first>Yufeng</first><last>Chen</last></author>
      <author><first>Ying</first><last>Li</last></author>
      <author><first>Jinan</first><last>Xu</last></author>
      <author><first>Yajuan</first><last>Lyu</last></author>
      <author><first>Yong</first><last>Zhu</last></author>
      <pages>726–734</pages>
      <abstract>Most previous work on knowledge graph completion conducted single-view prediction or calculation for candidate triple evaluation, based only on the content information of the candidate triples. This paper describes a novel multi-view classification model for knowledge graph completion, where multiple classification views are performed based on both content and context information for candidate triple evaluation. Each classification view evaluates the validity of a candidate triple from a specific viewpoint, based on the content information inside the candidate triple and the context information nearby the triple. These classification views are implemented by a unified neural network and the classification predictions are weightedly integrated to obtain the final evaluation. Experiments show that, the multi-view model brings very significant improvements over previous methods, and achieves the new state-of-the-art on two representative datasets. We believe that, the flexibility and the scalability of the multi-view classification model facilitates the introduction of additional information and resources for better performance.</abstract>
      <url hash="1427616c">2020.aacl-main.73</url>
    </paper>
    <paper id="74">
      <title>Knowledge-Enhanced Named Entity Disambiguation for Short Text</title>
      <author><first>Zhifan</first><last>Feng</last></author>
      <author><first>Qi</first><last>Wang</last></author>
      <author><first>Wenbin</first><last>Jiang</last></author>
      <author><first>Yajuan</first><last>Lyu</last></author>
      <author><first>Yong</first><last>Zhu</last></author>
      <pages>735–744</pages>
      <abstract>Named entity disambiguation is an important task that plays the role of bridge between text and knowledge. However, the performance of existing methods drops dramatically for short text, which is widely used in actual application scenarios, such as information retrieval and question answering. In this work, we propose a novel knowledge-enhanced method for named entity disambiguation. Considering the problem of information ambiguity and incompleteness for short text, two kinds of knowledge, factual knowledge graph and conceptual knowledge graph, are introduced to provide additional knowledge for the semantic matching between candidate entity and mention context. Our proposed method achieves significant improvement over previous methods on a large manually annotated short-text dataset, and also achieves the state-of-the-art on three standard datasets. The short-text dataset and the proposed model will be publicly available for research use.</abstract>
      <url hash="937719e4">2020.aacl-main.74</url>
    </paper>
    <paper id="75">
      <title>More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction</title>
      <author><first>Xu</first><last>Han</last></author>
      <author><first>Tianyu</first><last>Gao</last></author>
      <author><first>Yankai</first><last>Lin</last></author>
      <author><first>Hao</first><last>Peng</last></author>
      <author><first>Yaoliang</first><last>Yang</last></author>
      <author><first>Chaojun</first><last>Xiao</last></author>
      <author><first>Zhiyuan</first><last>Liu</last></author>
      <author><first>Peng</first><last>Li</last></author>
      <author><first>Jie</first><last>Zhou</last></author>
      <author><first>Maosong</first><last>Sun</last></author>
      <pages>745–758</pages>
      <abstract>Relational facts are an important component of human knowledge, which are hidden in vast amounts of text. In order to extract these facts from text, people have been working on relation extraction (RE) for years. From early pattern matching to current neural networks, existing RE methods have achieved significant progress. Yet with explosion of Web text and emergence of new relations, human knowledge is increasing drastically, and we thus require “more” from RE: a more powerful RE system that can robustly utilize more data, efficiently learn more relations, easily handle more complicated context, and flexibly generalize to more open domains. In this paper, we look back at existing RE methods, analyze key challenges we are facing nowadays, and show promising directions towards more powerful RE. We hope our view can advance this field and inspire more efforts in the community.</abstract>
      <url hash="bb1e00d7">2020.aacl-main.75</url>
    </paper>
    <paper id="76">
      <title>Robustness and Reliability of Gender Bias Assessment in Word Embeddings: The Role of Base Pairs</title>
      <author><first>Haiyang</first><last>Zhang</last></author>
      <author><first>Alison</first><last>Sneyd</last></author>
      <author><first>Mark</first><last>Stevenson</last></author>
      <pages>759–769</pages>
      <abstract>It has been shown that word embeddings can exhibit gender bias, and various methods have been proposed to quantify this. However, the extent to which the methods are capturing social stereotypes inherited from the data has been debated. Bias is a complex concept and there exist multiple ways to define it. Previous work has leveraged gender word pairs to measure bias and extract biased analogies. We show that the reliance on these gendered pairs has strong limitations: bias measures based off of them are not robust and cannot identify common types of real-world bias, whilst analogies utilising them are unsuitable indicators of bias. In particular, the well-known analogy “man is to computer-programmer as woman is to homemaker” is due to word similarity rather than bias. This has important implications for work on measuring bias in embeddings and related work debiasing embeddings.</abstract>
      <url hash="486fb70f">2020.aacl-main.76</url>
    </paper>
    <paper id="77">
      <title><fixed-case>E</fixed-case>xpan<fixed-case>RL</fixed-case>: Hierarchical Reinforcement Learning for Course Concept Expansion in <fixed-case>MOOC</fixed-case>s</title>
      <author><first>Jifan</first><last>Yu</last></author>
      <author><first>Chenyu</first><last>Wang</last></author>
      <author><first>Gan</first><last>Luo</last></author>
      <author><first>Lei</first><last>Hou</last></author>
      <author><first>Juanzi</first><last>Li</last></author>
      <author><first>Jie</first><last>Tang</last></author>
      <author><first>Minlie</first><last>Huang</last></author>
      <author><first>Zhiyuan</first><last>Liu</last></author>
      <pages>770–780</pages>
      <abstract>Within the prosperity of Massive Open Online Courses (MOOCs), the education applications that automatically provide extracurricular knowledge for MOOC users become rising research topics. However, MOOC courses’ diversity and rapid updates make it more challenging to find suitable new knowledge for students. In this paper, we present ExpanRL, an end-to-end hierarchical reinforcement learning (HRL) model for concept expansion in MOOCs. Employing a two-level HRL mechanism of seed selection and concept expansion, ExpanRL is more feasible to adjust the expansion strategy to find new concepts based on the students’ feedback on expansion results. Our experiments on nine novel datasets from real MOOCs show that ExpanRL achieves significant improvements over existing methods and maintain competitive performance under different settings.</abstract>
      <url hash="a180d3b0">2020.aacl-main.77</url>
    </paper>
    <paper id="78">
      <title>Vocabulary Matters: A Simple yet Effective Approach to Paragraph-level Question Generation</title>
      <author><first>Vishwajeet</first><last>Kumar</last></author>
      <author><first>Manish</first><last>Joshi</last></author>
      <author><first>Ganesh</first><last>Ramakrishnan</last></author>
      <author><first>Yuan-Fang</first><last>Li</last></author>
      <pages>781–785</pages>
      <abstract>Question generation (QG) has recently attracted considerable attention. Most of the current neural models take as input only one or two sentences, and perform poorly when multiple sentences or complete paragraphs are given as input. However, in real-world scenarios it is very important to be able to generate high-quality questions from complete paragraphs. In this paper, we present a simple yet effective technique for answer-aware question generation from paragraphs. We augment a basic sequence-to-sequence QG model with dynamic, paragraph-specific dictionary and copy attention that is persistent across the corpus, without requiring features generated by sophisticated NLP pipelines or handcrafted rules. Our evaluation on SQuAD shows that our model significantly outperforms current state-of-the-art systems in question generation from paragraphs in both automatic and human evaluation. We achieve a 6-point improvement over the best system on BLEU-4, from 16.38 to 22.62.</abstract>
      <url hash="50d1a3f3">2020.aacl-main.78</url>
    </paper>
    <paper id="79">
      <title>From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks</title>
      <author><first>Steffen</first><last>Eger</last></author>
      <author><first>Yannik</first><last>Benz</last></author>
      <pages>786–803</pages>
      <abstract>Adversarial attacks are label-preserving modifications to inputs of machine learning classifiers designed to fool machines but not humans. Natural Language Processing (NLP) has mostly focused on high-level attack scenarios such as paraphrasing input texts. We argue that these are less realistic in typical application scenarios such as in social media, and instead focus on low-level attacks on the character-level. Guided by human cognitive abilities and human robustness, we propose the first large-scale catalogue and benchmark of low-level adversarial attacks, which we dub Zéroe, encompassing nine different attack modes including visual and phonetic adversaries. We show that RoBERTa, NLP’s current workhorse, fails on our attacks. Our dataset provides a benchmark for testing robustness of future more human-like NLP models.</abstract>
      <url hash="997dbb31">2020.aacl-main.79</url>
    </paper>
    <paper id="80">
      <title>Point-of-Interest Type Inference from Social Media Text</title>
      <author><first>Danae</first><last>Sánchez Villegas</last></author>
      <author><first>Daniel</first><last>Preotiuc-Pietro</last></author>
      <author><first>Nikolaos</first><last>Aletras</last></author>
      <pages>804–810</pages>
      <abstract>Physical places help shape how we perceive the experiences we have there. We study the relationship between social media text and the type of the place from where it was posted, whether a park, restaurant, or someplace else. To facilitate this, we introduce a novel data set of ~200,000 English tweets published from 2,761 different points-of-interest in the U.S., enriched with place type information. We train classifiers to predict the type of the location a tweet was sent from that reach a macro F1 of 43.67 across eight classes and uncover the linguistic markers associated with each type of place. The ability to predict semantic place information from a tweet has applications in recommendation systems, personalization services and cultural geography.</abstract>
      <url hash="2baf23cf">2020.aacl-main.80</url>
      <attachment type="Dataset" hash="35d1e2f7">2020.aacl-main.80.Dataset.zip</attachment>
    </paper>
    <paper id="81">
      <title>Reconstructing Event Regions for Event Extraction via Graph Attention Networks</title>
      <author><first>Pei</first><last>Chen</last></author>
      <author><first>Hang</first><last>Yang</last></author>
      <author><first>Kang</first><last>Liu</last></author>
      <author><first>Ruihong</first><last>Huang</last></author>
      <author><first>Yubo</first><last>Chen</last></author>
      <author><first>Taifeng</first><last>Wang</last></author>
      <author><first>Jun</first><last>Zhao</last></author>
      <pages>811–820</pages>
      <abstract>Event information is usually scattered across multiple sentences within a document. The local sentence-level event extractors often yield many noisy event role filler extractions in the absence of a broader view of the document-level context. Filtering spurious extractions and aggregating event information in a document remains a challenging problem. Following the observation that a document has several relevant event regions densely populated with event role fillers, we build graphs with candidate role filler extractions enriched by sentential embeddings as nodes, and use graph attention networks to identify event regions in a document and aggregate event information. We characterize edges between candidate extractions in a graph into rich vector representations to facilitate event region identification. The experimental results on two datasets of two languages show that our approach yields new state-of-the-art performance for the challenging event extraction task.</abstract>
      <url hash="3ede6a4c">2020.aacl-main.81</url>
      <attachment type="Dataset" hash="9ee34afe">2020.aacl-main.81.Dataset.rar</attachment>
    </paper>
    <paper id="82">
      <title>Recipe Instruction Semantics Corpus (<fixed-case>RIS</fixed-case>e<fixed-case>C</fixed-case>): <fixed-case>R</fixed-case>esolving Semantic Structure and Zero Anaphora in Recipes</title>
      <author><first>Yiwei</first><last>Jiang</last></author>
      <author><first>Klim</first><last>Zaporojets</last></author>
      <author><first>Johannes</first><last>Deleu</last></author>
      <author><first>Thomas</first><last>Demeester</last></author>
      <author><first>Chris</first><last>Develder</last></author>
      <pages>821–826</pages>
      <abstract>We propose a newly annotated dataset for information extraction on recipes. Unlike previous approaches to machine comprehension of procedural texts, we avoid a priori pre-defining domain-specific predicates to recognize (e.g., the primitive instructionsin MILK) and focus on basic understanding of the expressed semantics rather than directly reduce them to a simplified state representation (e.g., ProPara). We thus frame the semantic comprehension of procedural text such as recipes, as fairly generic NLP subtasks, covering (i) entity recognition (ingredients, tools and actions), (ii) relation extraction (what ingredients and tools are involved in the actions), and (iii) zero anaphora resolution (link actions to implicit arguments, e.g., results from previous recipe steps). Further, our Recipe Instruction Semantic Corpus (RISeC) dataset includes textual descriptions for the zero anaphora, to facilitate language generation thereof. Besides the dataset itself, we contribute a pipeline neural architecture that addresses entity and relation extractionas well an identification of zero anaphora. These basic building blocks can facilitate more advanced downstream applications (e.g., question answering, conversational agents).</abstract>
      <url hash="2d2e2afa">2020.aacl-main.82</url>
    </paper>
    <paper id="83">
      <title>Stronger Baselines for Grammatical Error Correction Using a Pretrained Encoder-Decoder Model</title>
      <author><first>Satoru</first><last>Katsumata</last></author>
      <author><first>Mamoru</first><last>Komachi</last></author>
      <pages>827–832</pages>
      <abstract>Studies on grammatical error correction (GEC) have reported on the effectiveness of pretraining a Seq2Seq model with a large amount of pseudodata. However, this approach requires time-consuming pretraining of GEC because of the size of the pseudodata. In this study, we explored the utility of bidirectional and auto-regressive transformers (BART) as a generic pretrained encoder-decoder model for GEC. With the use of this generic pretrained model for GEC, the time-consuming pretraining can be eliminated. We find that monolingual and multilingual BART models achieve high performance in GEC, with one of the results being comparable to the current strong results in English GEC.</abstract>
      <url hash="b53ff478">2020.aacl-main.83</url>
    </paper>
    <paper id="84">
      <title>Sina <fixed-case>M</fixed-case>andarin Alphabetical Words:A Web-driven Code-mixing Lexical Resource</title>
      <author><first>Rong</first><last>Xiang</last></author>
      <author><first>Mingyu</first><last>Wan</last></author>
      <author><first>Qi</first><last>Su</last></author>
      <author><first>Chu-Ren</first><last>Huang</last></author>
      <author><first>Qin</first><last>Lu</last></author>
      <pages>833–842</pages>
      <abstract>Mandarin Alphabetical Word (MAW) is one indispensable component of Modern Chinese that demonstrates unique code-mixing idiosyncrasies influenced by language exchanges. Yet, this interesting phenomenon has not been properly addressed and is mostly excluded from the Chinese language system. This paper addresses the core problem of MAW identification and proposes to construct a large collection of MAWs from Sina Weibo (SMAW) using an automatic web-based technique which includes rule-based identification, informatics-based extraction, as well as Baidu search engine validation. A collection of 16,207 qualified SMAWs are obtained using this technique along with an annotated corpus of more than 200,000 sentences for linguistic research and applicable inquiries.</abstract>
      <url hash="1e6f05d4">2020.aacl-main.84</url>
      <attachment type="Dataset" hash="8b4c8548">2020.aacl-main.84.Dataset.txt</attachment>
    </paper>
    <paper id="85">
      <title><fixed-case>I</fixed-case>ndo<fixed-case>NLU</fixed-case>: Benchmark and Resources for Evaluating <fixed-case>I</fixed-case>ndonesian Natural Language Understanding</title>
      <author><first>Bryan</first><last>Wilie</last></author>
      <author><first>Karissa</first><last>Vincentio</last></author>
      <author><first>Genta Indra</first><last>Winata</last></author>
      <author><first>Samuel</first><last>Cahyawijaya</last></author>
      <author><first>Xiaohong</first><last>Li</last></author>
      <author><first>Zhi Yuan</first><last>Lim</last></author>
      <author><first>Sidik</first><last>Soleman</last></author>
      <author><first>Rahmad</first><last>Mahendra</last></author>
      <author><first>Pascale</first><last>Fung</last></author>
      <author><first>Syafri</first><last>Bahar</last></author>
      <author><first>Ayu</first><last>Purwarianti</last></author>
      <pages>843–857</pages>
      <abstract>Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for training, evaluation, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset (Indo4B) collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, thus enabling everyone to benchmark their system performances.</abstract>
      <url hash="a4ef9a7a">2020.aacl-main.85</url>
    </paper>
    <paper id="86">
      <title>Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour</title>
      <author><first>Sandeep</first><last>Mathias</last></author>
      <author><first>Rudra</first><last>Murthy</last></author>
      <author><first>Diptesh</first><last>Kanojia</last></author>
      <author><first>Abhijit</first><last>Mishra</last></author>
      <author><first>Pushpak</first><last>Bhattacharyya</last></author>
      <pages>858–872</pages>
      <abstract>The gaze behaviour of a reader is helpful in solving several NLP tasks such as automatic essay grading. However, collecting gaze behaviour from readers is costly in terms of time and money. In this paper, we propose a way to improve automatic essay grading using gaze behaviour, which is learnt at run time using a multi-task learning framework. To demonstrate the efficacy of this multi-task learning based approach to automatic essay grading, we collect gaze behaviour for 48 essays across 4 essay sets, and learn gaze behaviour for the rest of the essays, numbering over 7000 essays. Using the learnt gaze behaviour, we can achieve a statistically significant improvement in performance over the state-of-the-art system for the essay sets where we have gaze data. We also achieve a statistically significant improvement for 4 other essay sets, numbering about 6000 essays, where we have no gaze behaviour data available. Our approach establishes that learning gaze behaviour improves automatic essay grading.</abstract>
      <url hash="e95c666d">2020.aacl-main.86</url>
    </paper>
    <paper id="87">
      <title>Multi-Source Attention for Unsupervised Domain Adaptation</title>
      <author><first>Xia</first><last>Cui</last></author>
      <author><first>Danushka</first><last>Bollegala</last></author>
      <pages>873–883</pages>
      <abstract>We model source-selection in multi-source Unsupervised Domain Adaptation (UDA) as an attention-learning problem, where we learn attention over the sources per given target instance. We first independently learn source-specific classification models, and a relatedness map between sources and target domains using pseudo-labelled target domain instances. Next, we learn domain-attention scores over the sources for aggregating the predictions of the source-specific models. Experimental results on two cross-domain sentiment classification datasets show that the proposed method reports consistently good performance across domains, and at times outperforming more complex prior proposals. Moreover, the computed domain-attention scores enable us to find explanations for the predictions made by the proposed method.</abstract>
      <url hash="4afc5f8e">2020.aacl-main.87</url>
    </paper>
    <paper id="88">
      <title>Compressing Pre-trained Language Models by Matrix Decomposition</title>
      <author><first>Matan</first><last>Ben Noach</last></author>
      <author><first>Yoav</first><last>Goldberg</last></author>
      <pages>884–889</pages>
      <abstract>Large pre-trained language models reach state-of-the-art results on many different NLP tasks when fine-tuned individually; They also come with a significant memory and computational requirements, calling for methods to reduce model sizes (green AI). We propose a two-stage model-compression method to reduce a model’s inference time cost. We first decompose the matrices in the model into smaller matrices and then perform feature distillation on the internal representation to recover from the decomposition. This approach has the benefit of reducing the number of parameters while preserving much of the information within the model. We experimented on BERT-base model with the GLUE benchmark dataset and show that we can reduce the number of parameters by a factor of 0.4x, and increase inference speed by a factor of 1.45x, while maintaining a minimal loss in metric performance.</abstract>
      <url hash="9a7d0312">2020.aacl-main.88</url>
    </paper>
    <paper id="89">
      <title>You May Like This Hotel Because ...: Identifying Evidence for Explainable Recommendations</title>
      <author><first>Shin</first><last>Kanouchi</last></author>
      <author><first>Masato</first><last>Neishi</last></author>
      <author><first>Yuta</first><last>Hayashibe</last></author>
      <author><first>Hiroki</first><last>Ouchi</last></author>
      <author><first>Naoaki</first><last>Okazaki</last></author>
      <pages>890–899</pages>
      <abstract>Explainable recommendation is a good way to improve user satisfaction. However, explainable recommendation in dialogue is challenging since it has to handle natural language as both input and output. To tackle the challenge, this paper proposes a novel and practical task to explain evidences in recommending hotels given vague requests expressed freely in natural language. We decompose the process into two subtasks on hotel reviews: Evidence Identification and Evidence Explanation. The former predicts whether or not a sentence contains evidence that expresses why a given request is satisfied. The latter generates a recommendation sentence given a request and an evidence sentence. In order to address these subtasks, we build an Evidence-based Explanation dataset, which is the largest dataset for explaining evidences in recommending hotels for vague requests. The experimental results demonstrate that the BERT model can find evidence sentences with respect to various vague requests and that the LSTM-based model can generate recommendation sentences.</abstract>
      <url hash="da69c7ae">2020.aacl-main.89</url>
    </paper>
    <paper id="90">
      <title>A Unified Framework for Multilingual and Code-Mixed Visual Question Answering</title>
      <author><first>Deepak</first><last>Gupta</last></author>
      <author><first>Pabitra</first><last>Lenka</last></author>
      <author><first>Asif</first><last>Ekbal</last></author>
      <author><first>Pushpak</first><last>Bhattacharyya</last></author>
      <pages>900–913</pages>
      <abstract>In this paper, we propose an effective deep learning framework for multilingual and code- mixed visual question answering. The pro- posed model is capable of predicting answers from the questions in Hindi, English or Code- mixed (Hinglish: Hindi-English) languages. The majority of the existing techniques on Vi- sual Question Answering (VQA) focus on En- glish questions only. However, many applica- tions such as medical imaging, tourism, visual assistants require a multilinguality-enabled module for their widespread usages. As there is no available dataset in English-Hindi VQA, we firstly create Hindi and Code-mixed VQA datasets by exploiting the linguistic properties of these languages. We propose a robust tech- nique capable of handling the multilingual and code-mixed question to provide the answer against the visual information (image). To better encode the multilingual and code-mixed questions, we introduce a hierarchy of shared layers. We control the behaviour of these shared layers by an attention-based soft layer sharing mechanism, which learns how shared layers are applied in different ways for the dif- ferent languages of the question. Further, our model uses bi-linear attention with a residual connection to fuse the language and image fea- tures. We perform extensive evaluation and ablation studies for English, Hindi and Code- mixed VQA. The evaluation shows that the proposed multilingual model achieves state-of- the-art performance in all these settings.</abstract>
      <url hash="55f2bbad">2020.aacl-main.90</url>
    </paper>
    <paper id="91">
      <title>Toxic Language Detection in Social Media for <fixed-case>B</fixed-case>razilian <fixed-case>P</fixed-case>ortuguese: New Dataset and Multilingual Analysis</title>
      <author><first>João Augusto</first><last>Leite</last></author>
      <author><first>Diego</first><last>Silva</last></author>
      <author><first>Kalina</first><last>Bontcheva</last></author>
      <author><first>Carolina</first><last>Scarton</last></author>
      <pages>914–924</pages>
      <abstract>Hate speech and toxic comments are a common concern of social media platform users. Although these comments are, fortunately, the minority in these platforms, they are still capable of causing harm. Therefore, identifying these comments is an important task for studying and preventing the proliferation of toxicity in social media. Previous work in automatically detecting toxic comments focus mainly in English, with very few work in languages like Brazilian Portuguese. In this paper, we propose a new large-scale dataset for Brazilian Portuguese with tweets annotated as either toxic or non-toxic or in different types of toxicity. We present our dataset collection and annotation process, where we aimed to select candidates covering multiple demographic groups. State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case. We also show that large-scale monolingual data is still needed to create more accurate models, despite recent advances in multilingual approaches. An error analysis and experiments with multi-label classification show the difficulty of classifying certain types of toxic comments that appear less frequently in our data and highlights the need to develop models that are aware of different categories of toxicity.</abstract>
      <url hash="2c1bc0e5">2020.aacl-main.91</url>
    </paper>
    <paper id="92">
      <title>Measuring What Counts: The Case of Rumour Stance Classification</title>
      <author><first>Carolina</first><last>Scarton</last></author>
      <author><first>Diego</first><last>Silva</last></author>
      <author><first>Kalina</first><last>Bontcheva</last></author>
      <pages>925–932</pages>
      <abstract>Stance classification can be a powerful tool for understanding whether and which users believe in online rumours. The task aims to automatically predict the stance of replies towards a given rumour, namely support, deny, question, or comment. Numerous methods have been proposed and their performance compared in the RumourEval shared tasks in 2017 and 2019. Results demonstrated that this is a challenging problem since naturally occurring rumour stance data is highly imbalanced. This paper specifically questions the evaluation metrics used in these shared tasks. We re-evaluate the systems submitted to the two RumourEval tasks and show that the two widely adopted metrics – accuracy and macro-F1 – are not robust for the four-class imbalanced task of rumour stance classification, as they wrongly favour systems with highly skewed accuracy towards the majority class. To overcome this problem, we propose new evaluation metrics for rumour stance detection. These are not only robust to imbalanced data but also score higher systems that are capable of recognising the two most informative minority classes (support and deny).</abstract>
      <url hash="551d0d20">2020.aacl-main.92</url>
    </paper>
  </volume>
  <volume id="srw" ingest-date="2020-12-02">
    <meta>
      <booktitle>Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop</booktitle>
      <editor><first>Boaz</first><last>Shmueli</last></editor>
      <editor><first>Yin Jou</first><last>Huang</last></editor>
      <publisher>Association for Computational Linguistics</publisher>
      <address>Suzhou, China</address>
      <month>December</month>
      <year>2020</year>
    </meta>
    <frontmatter>
      <url hash="3ddc0644">2020.aacl-srw.0</url>
    </frontmatter>
    <paper id="1">
      <title>Text Classification through Glyph-aware Disentangled Character Embedding and Semantic Sub-character Augmentation</title>
      <author><first>Takumi</first><last>Aoki</last></author>
      <author><first>Shunsuke</first><last>Kitada</last></author>
      <author><first>Hitoshi</first><last>Iyatomi</last></author>
      <pages>1–7</pages>
      <abstract>We propose a new character-based text classification framework for non-alphabetic languages, such as Chinese and Japanese. Our framework consists of a variational character encoder (VCE) and character-level text classifier. The VCE is composed of a β-variational auto-encoder (β -VAE) that learns the proposed glyph-aware disentangled character embedding (GDCE). Since our GDCE provides zero-mean unit-variance character embeddings that are dimensionally independent, it is applicable for our interpretable data augmentation, namely, semantic sub-character augmentation (SSA). In this paper, we evaluated our framework using Japanese text classification tasks at the document- and sentence-level. We confirmed that our GDCE and SSA not only provided embedding interpretability but also improved the classification performance. Our proposal achieved a competitive result to the state-of-the-art model while also providing model interpretability.</abstract>
      <url hash="7a595ca7">2020.aacl-srw.1</url>
    </paper>
    <paper id="2">
      <title>Two-Headed Monster and Crossed Co-Attention Networks</title>
      <author><first>Yaoyiran</first><last>Li</last></author>
      <author><first>Jing</first><last>Jiang</last></author>
      <pages>8–15</pages>
      <abstract>This paper investigates a new co-attention mechanism in neural transduction models for machine translation tasks. We propose a paradigm, termed Two-Headed Monster (THM), which consists of two symmetric encoder modules and one decoder module connected with co-attention. As a specific and concrete implementation of THM, Crossed Co-Attention Networks (CCNs) are designed based on the Transformer model. We test CCNs on WMT 2014 EN-DE and WMT 2016 EN-FI translation tasks and show both advantages and disadvantages of the proposed method. Our model outperforms the strong Transformer baseline by 0.51 (big) and 0.74 (base) BLEU points on EN-DE and by 0.17 (big) and 0.47 (base) BLEU points on EN-FI but the epoch time increases by circa 75%.</abstract>
      <url hash="f1b02c58">2020.aacl-srw.2</url>
    </paper>
    <paper id="3">
      <title>Towards a Task-Agnostic Model of Difficulty Estimation for Supervised Learning Tasks</title>
      <author><first>Antonio</first><last>Laverghetta Jr.</last></author>
      <author><first>Jamshidbek</first><last>Mirzakhalov</last></author>
      <author><first>John</first><last>Licato</last></author>
      <pages>16–23</pages>
      <abstract>Curriculum learning, a training strategy where training data are ordered based on their difficulty, has been shown to improve performance and reduce training time on various NLP tasks. While much work over the years has developed novel approaches for generating curricula, these strategies are typically only suited for the task they were designed for. This work explores developing a task-agnostic model for problem difficulty and applying it to the Stanford Natural Language Inference (SNLI) dataset. Using the human responses that come with the dev set of SNLI, we train both regression and classification models to predict how many annotators will answer a question correctly and then project the difficulty estimates onto the full SNLI train set to create the curriculum. We argue that our curriculum is effectively capturing difficulty for this task through various analyses of both the model and the predicted difficulty scores.</abstract>
      <url hash="5b82e521">2020.aacl-srw.3</url>
    </paper>
    <paper id="4">
      <title>A <fixed-case>S</fixed-case>iamese <fixed-case>CNN</fixed-case> Architecture for Learning <fixed-case>C</fixed-case>hinese Sentence Similarity</title>
      <author><first>Haoxiang</first><last>Shi</last></author>
      <author><first>Cen</first><last>Wang</last></author>
      <author><first>Tetsuya</first><last>Sakai</last></author>
      <pages>24–29</pages>
      <abstract>This paper presents a deep neural architecture which applies the siamese convolutional neural network sharing model parameters for learning a semantic similarity metric between two sentences. In addition, two different similarity metrics (i.e., the Cosine Similarity and Manhattan similarity) are compared based on this architecture. Our experiments in binary similarity classification for Chinese sentence pairs show that the proposed siamese convolutional architecture with Manhattan similarity outperforms the baselines (i.e., the siamese Long Short-Term Memory architecture and the siamese Bidirectional Long Short-Term Memory architecture) by 8.7 points in accuracy.</abstract>
      <url hash="0031cd00">2020.aacl-srw.4</url>
    </paper>
    <paper id="5">
      <title>Automatic Classification of Students on <fixed-case>T</fixed-case>witter Using Simple Profile Information</title>
      <author><first>Lili-Michal</first><last>Wilson</last></author>
      <author><first>Christopher</first><last>Wun</last></author>
      <pages>30–36</pages>
      <abstract>Obtaining social media demographic information using machine learning is important for efficient computational social science research. Automatic age classification has been accomplished with relative success and allows for the study of youth populations, but student classification—determining which users are currently attending an academic institution—has not been thoroughly studied. Previous work (He et al., 2016) proposes a model which utilizes 3 tweet-content features to classify users as students or non-students. This model achieves an accuracy of 84%, but is restrictive and time intensive because it requires accessing and processing many user tweets. In this study, we propose classification models which use 7 numerical features and 10 text-based features drawn from simple profile information. These profile-based features allow for faster, more accessible data collection and enable the classification of users without needing access to their tweets. Compared to previous models, our models identify students with greater accuracy; our best model obtains an accuracy of 88.1% and an F1 score of .704. This improved student identification tool has the potential to facilitate research on topics ranging from professional networking to the impact of education on Twitter behaviors.</abstract>
      <url hash="e37d868b">2020.aacl-srw.5</url>
    </paper>
    <paper id="6">
      <title>Towards Code-switched Classification Exploiting Constituent Language Resources</title>
      <author><first>Kartikey</first><last>Pant</last></author>
      <author><first>Tanvi</first><last>Dadu</last></author>
      <pages>37–43</pages>
      <abstract>Code-switching is a commonly observed communicative phenomenon denoting a shift from one language to another within the same speech exchange. The analysis of code-switched data often becomes an assiduous task, owing to the limited availability of data. In this work, we propose converting code-switched data into its constituent high resource languages for exploiting both monolingual and cross-lingual settings. This conversion allows us to utilize the higher resource availability for its constituent languages for multiple downstream tasks. We perform experiments for two downstream tasks, sarcasm detection and hate speech detection in the English-Hindi code-switched setting. These experiments show an increase in 22% and 42.5% in F1-score for sarcasm detection and hate speech detection, respectively, compared to the state-of-the-art.</abstract>
      <url hash="b479ceab">2020.aacl-srw.6</url>
    </paper>
    <paper id="7">
      <title><fixed-case>H</fixed-case>indi History Note Generation with Unsupervised Extractive Summarization</title>
      <author><first>Aayush</first><last>Shah</last></author>
      <author><first>Dhineshkumar</first><last>Ramasubbu</last></author>
      <author><first>Dhruv</first><last>Mathew</last></author>
      <author><first>Meet Chetan</first><last>Gadoya</last></author>
      <pages>44–49</pages>
      <abstract>In this work, the task of extractive single document summarization applied to an education setting to generate summaries of chapters from grade 10 Hindi history textbooks is undertaken. Unsupervised approaches to extract summaries are employed and evaluated. TextRank, LexRank, Luhn and KLSum are used to extract summaries. When evaluated intrinsically, Luhn and TextRank summaries have the highest ROUGE scores. When evaluated extrinsically, the effective measure of a summary in answering exam questions, TextRank summaries performs the best.</abstract>
      <url hash="f6887553">2020.aacl-srw.7</url>
    </paper>
    <paper id="8">
      <title>Unbiasing Review Ratings with Tendency Based Collaborative Filtering</title>
      <author><first>Pranshi</first><last>Yadav</last></author>
      <author><first>Priya</first><last>Yadav</last></author>
      <author><first>Pegah</first><last>Nokhiz</last></author>
      <author><first>Vivek</first><last>Gupta</last></author>
      <pages>50–56</pages>
      <abstract>User-generated contents’ score-based prediction and item recommendation has become an inseparable part of the online recommendation systems. The ratings allow people to express their opinions and may affect the market value of items and consumer confidence in e-commerce decisions. A major problem with the models designed for user review prediction is that they unknowingly neglect the rating bias occurring due to personal user bias preferences. We propose a tendency-based approach that models the user and item tendency for score prediction along with text review analysis with respect to ratings.</abstract>
      <url hash="f106d6de">2020.aacl-srw.8</url>
    </paper>
    <paper id="9">
      <title>Building a Part-of-Speech Tagged Corpus for Drenjongke (Bhutia)</title>
      <author><first>Mana</first><last>Ashida</last></author>
      <author><first>Seunghun</first><last>Lee</last></author>
      <author><first>Kunzang</first><last>Namgyal</last></author>
      <pages>57–63</pages>
      <abstract>This research paper reports on the generation of the first Drenjongke corpus based on texts taken from a phrase book for beginners, written in the Tibetan script. A corpus of sentences was created after correcting errors in the text scanned through optical character reading (OCR). A total of 34 Part-of-Speech (PoS) tags were defined based on manual annotation performed by the three authors, one of whom is a native speaker of Drenjongke. The first corpus of the Drenjongke language comprises 275 sentences and 1379 tokens, which we plan to expand with other materials to promote further studies of this language.</abstract>
      <url hash="4be431a2">2020.aacl-srw.9</url>
    </paper>
    <paper id="10">
      <title>Towards a Standardized Dataset on <fixed-case>I</fixed-case>ndonesian Named Entity Recognition</title>
      <author><first>Siti Oryza</first><last>Khairunnisa</last></author>
      <author><first>Aizhan</first><last>Imankulova</last></author>
      <author><first>Mamoru</first><last>Komachi</last></author>
      <pages>64–71</pages>
      <abstract>In recent years, named entity recognition (NER) tasks in the Indonesian language have undergone extensive development. There are only a few corpora for Indonesian NER; hence, recent Indonesian NER studies have used diverse datasets. Although an open dataset is available, it includes only approximately 2,000 sentences and contains inconsistent annotations, thereby preventing accurate training of NER models without reliance on pre-trained models. Therefore, we re-annotated the dataset and compared the two annotations’ performance using the Bidirectional Long Short-Term Memory and Conditional Random Field (BiLSTM-CRF) approach. Fixing the annotation yielded a more consistent result for the organization tag and improved the prediction score by a large margin. Moreover, to take full advantage of pre-trained models, we compared different feature embeddings to determine their impact on the NER task for the Indonesian language.</abstract>
      <url hash="531022b8">2020.aacl-srw.10</url>
    </paper>
    <paper id="11">
      <title>Formal <fixed-case>S</fixed-case>anskrit Syntax: A Specification for Programming Language</title>
      <author><first>K. Kabi</first><last>Khanganba</last></author>
      <author><first>Girish</first><last>Jha</last></author>
      <pages>72–78</pages>
      <abstract>The paper discusses the syntax of the primary statements of the Sanskritam, a programming language specification based on natural Sanskrit under a doctoral thesis. By a statement, we mean a syntactic unit regardless of its computational operations of variable declarations, program executions or evaluations of Boolean expressions etc. We have selected six common primary statements of declaration, assignment, inline initialization, if-then-else, for loop and while loop. The specification partly overlaps the ideas of natural language programming, Controlled Natural Language (Kunh, 2013), and Natural Language subset. The practice and application of structured natural language set in a discourse are deeply rooted in the theoretical text tradition of Sanskrit, like the sūtra-based disciplines and Navya-Nyāya (NN) formal language, etc. The effort is a kind of continuation and application of such traditions and their techniques in the modern field of Sanskrit NLP.</abstract>
      <url hash="caef521b">2020.aacl-srw.11</url>
    </paper>
    <paper id="12">
      <title>Resource Creation and Evaluation of Aspect Based Sentiment Analysis in <fixed-case>U</fixed-case>rdu</title>
      <author><first>Sadaf</first><last>Rani</last></author>
      <author><first>Muhammad Waqas</first><last>Anwar</last></author>
      <pages>79–84</pages>
      <abstract>Along with the rise of people generated content on social sites, sentiment analysis has gained more importance. Aspect Based Sentiment Analysis (ABSA) is a task of identifying the sentiment at aspect level. It has more importance than sentiment analysis from commercial point of view. To the best of our knowledge, there is very few work on ABSA in Urdu language. Recent work on ABSA has limitations. Only predefined aspects are identified in a specific domain. So our focus is on the creation and evaluation of dataset for ABSA in Urdu language which will support multiple aspects. This dataset will provide a baseline evaluation for ABSA systems.</abstract>
      <url hash="ceb3fefd">2020.aacl-srw.12</url>
    </paper>
    <paper id="13">
      <title>Making a Point: Pointer-Generator Transformers for Disjoint Vocabularies</title>
      <author><first>Nikhil</first><last>Prabhu</last></author>
      <author><first>Katharina</first><last>Kann</last></author>
      <pages>85–92</pages>
      <abstract>Explicit mechanisms for copying have improved the performance of neural models for sequence-to-sequence tasks in the low-resource setting. However, they rely on an overlap between source and target vocabularies. Here, we propose a model that does not: a pointer-generator transformer for disjoint vocabularies. We apply our model to a low-resource version of the grapheme-to-phoneme conversion (G2P) task, and show that it outperforms a standard transformer by an average of 5.1 WER over 15 languages. While our model does not beat the the best performing baseline, we demonstrate that it provides complementary information to it: an oracle that combines the best outputs of the two models improves over the strongest baseline by 7.7 WER on average in the low-resource setting. In the high-resource setting, our model performs comparably to a standard transformer.</abstract>
      <url hash="3f5bdd2a">2020.aacl-srw.13</url>
    </paper>
    <paper id="14">
      <title>Training with Adversaries to Improve Faithfulness of Attention in Neural Machine Translation</title>
      <author><first>Pooya</first><last>Moradi</last></author>
      <author><first>Nishant</first><last>Kambhatla</last></author>
      <author><first>Anoop</first><last>Sarkar</last></author>
      <pages>93–100</pages>
      <abstract>Can we trust that the attention heatmaps produced by a neural machine translation (NMT) model reflect its true internal reasoning? We isolate and examine in detail the notion of faithfulness in NMT models. We provide a measure of faithfulness for NMT based on a variety of stress tests where model parameters are perturbed and measuring faithfulness based on how often the model output changes. We show that our proposed faithfulness measure for NMT models can be improved using a novel differentiable objective that rewards faithful behaviour by the model through probability divergence. Our experimental results on multiple language pairs show that our objective function is effective in increasing faithfulness and can lead to a useful analysis of NMT model behaviour and more trustworthy attention heatmaps. Our proposed objective improves faithfulness without reducing the translation quality and it also seems to have a useful regularization effect on the NMT model and can even improve translation quality in some cases.</abstract>
      <url hash="1cdd7150">2020.aacl-srw.14</url>
    </paper>
    <paper id="15">
      <title>Document-Level Neural Machine Translation Using <fixed-case>BERT</fixed-case> as Context Encoder</title>
      <author><first>Zhiyu</first><last>Guo</last></author>
      <author><first>Minh Le</first><last>Nguyen</last></author>
      <pages>101–107</pages>
      <abstract>Large-scale pre-trained representations such as BERT have been widely used in many natural language understanding tasks. The methods of incorporating BERT into document-level machine translation are still being explored. BERT is able to understand sentence relationships since BERT is pre-trained using the next sentence prediction task. In our work, we leverage this property to improve document-level machine translation. In our proposed model, BERT performs as a context encoder to achieve document-level contextual information, which is then integrated into both the encoder and decoder. Experiment results show that our proposed method can significantly outperform strong document-level machine translation baselines on BLEU score. Moreover, the ablation study shows our method can capture document-level context information to boost translation performance.</abstract>
      <url hash="c006c668">2020.aacl-srw.15</url>
    </paper>
    <paper id="16">
      <title>A Review of Cross-Domain Text-to-<fixed-case>SQL</fixed-case> Models</title>
      <author><first>Yujian</first><last>Gan</last></author>
      <author><first>Matthew</first><last>Purver</last></author>
      <author><first>John R.</first><last>Woodward</last></author>
      <pages>108–115</pages>
      <abstract>WikiSQL and Spider, the large-scale cross-domain text-to-SQL datasets, have attracted much attention from the research community. The leaderboards of WikiSQL and Spider show that many researchers propose their models trying to solve the text-to-SQL problem. This paper first divides the top models in these two leaderboards into two paradigms. We then present details not mentioned in their original paper by evaluating the key components, including schema linking, pretrained word embeddings, and reasoning assistance modules. Based on the analysis of these models, we want to promote understanding of the text-to-SQL field and find out some interesting future works, for example, it is worth studying the text-to-SQL problem in an environment where it is more challenging to build schema linking and also worth studying combing the advantage of each model toward text-to-SQL.</abstract>
      <url hash="88fe2201">2020.aacl-srw.16</url>
    </paper>
    <paper id="17">
      <title>Multi-task Learning for Automated Essay Scoring with Sentiment Analysis</title>
      <author><first>Panitan</first><last>Muangkammuen</last></author>
      <author><first>Fumiyo</first><last>Fukumoto</last></author>
      <pages>116–123</pages>
      <abstract>Automated Essay Scoring (AES) is a process that aims to alleviate the workload of graders and improve the feedback cycle in educational systems. Multi-task learning models, one of the deep learning techniques that have recently been applied to many NLP tasks, demonstrate the vast potential for AES. In this work, we present an approach for combining two tasks, sentiment analysis, and AES by utilizing multi-task learning. The model is based on a hierarchical neural network that learns to predict a holistic score at the document-level along with sentiment classes at the word-level and sentence-level. The sentiment features extracted from opinion expressions can enhance a vanilla holistic essay scoring, which mainly focuses on lexicon and text semantics. Our approach demonstrates that sentiment features are beneficial for some essay prompts, and the performance is competitive to other deep learning models on the Automated StudentAssessment Prize (ASAP) benchmark. TheQuadratic Weighted Kappa (QWK) is used to measure the agreement between the human grader’s score and the model’s prediction. Ourmodel produces a QWK of 0.763.</abstract>
      <url hash="8f35335a">2020.aacl-srw.17</url>
    </paper>
    <paper id="18">
      <title>Aspect Extraction Using Coreference Resolution and Unsupervised Filtering</title>
      <author><first>Deon</first><last>Mai</last></author>
      <author><first>Wei Emma</first><last>Zhang</last></author>
      <pages>124–129</pages>
      <abstract>Aspect extraction is a widely researched field of natural language processing in which aspects are identified from the text as a means for information. For example, in aspect-based sentiment analysis (ABSA), aspects need to be first identified. Previous studies have introduced various approaches to increasing accuracy, although leaving room for further improvement. In a practical situation where the examined dataset is lacking labels, to fine-tune the process a novel unsupervised approach is proposed, combining a lexical rule-based approach with coreference resolution. The model increases accuracy through the recognition and removal of coreferring aspects. Experimental evaluations are performed on two benchmark datasets, demonstrating the greater performance of our approach to extracting coherent aspects through outperforming the baseline approaches.</abstract>
      <url hash="96563206">2020.aacl-srw.18</url>
    </paper>
    <paper id="19">
      <title><fixed-case>GRUBERT</fixed-case>: A <fixed-case>GRU</fixed-case>-Based Method to Fuse <fixed-case>BERT</fixed-case> Hidden Layers for <fixed-case>T</fixed-case>witter Sentiment Analysis</title>
      <author><first>Leo</first><last>Horne</last></author>
      <author><first>Matthias</first><last>Matti</last></author>
      <author><first>Pouya</first><last>Pourjafar</last></author>
      <author><first>Zuowen</first><last>Wang</last></author>
      <pages>130–138</pages>
      <abstract>In this work, we introduce a GRU-based architecture called GRUBERT that learns to map the different BERT hidden layers to fused embeddings with the aim of achieving high accuracy on the Twitter sentiment analysis task. Tweets are known for their highly diverse language, and by exploiting different linguistic information present across BERT hidden layers, we can capture the full extent of this language at the embedding level. Our method can be easily adapted to other embeddings capturing different linguistic information. We show that our method outperforms well-known heuristics of using BERT (e.g. using only the last layer) and other embeddings such as ELMo. We observe potential label noise resulting from the data acquisition process and employ early stopping as well as a voting classifier to overcome it.</abstract>
      <url hash="002a0765">2020.aacl-srw.19</url>
    </paper>
    <paper id="20">
      <title>Exploring Statistical and Neural Models for Noun Ellipsis Detection and Resolution in <fixed-case>E</fixed-case>nglish</title>
      <author><first>Payal</first><last>Khullar</last></author>
      <pages>139–145</pages>
      <abstract>Computational approaches to noun ellipsis resolution has been sparse, with only a naive rule-based approach that uses syntactic feature constraints for marking noun ellipsis licensors and selecting their antecedents. In this paper, we further the ellipsis research by exploring several statistical and neural models for both the subtasks involved in the ellipsis resolution process and addressing the representation and contribution of manual features proposed in previous research. Using the best performing models, we build an end-to-end supervised Machine Learning (ML) framework for this task that improves the existing F1 score by 16.55% for the detection and 14.97% for the resolution subtask. Our experiments demonstrate robust scores through pretrained BERT (Bidirectional Encoder Representations from Transformers) embeddings for word representation, and more so the importance of manual features– once again highlighting the syntactic and semantic characteristics of the ellipsis phenomenon. For the classification decision, we notice that a simple Multilayar Perceptron (MLP) works well for the detection of ellipsis; however, Recurrent Neural Networks (RNN) are a better choice for the much harder resolution step.</abstract>
      <url hash="d6cea3e7">2020.aacl-srw.20</url>
    </paper>
    <paper id="21">
      <title><fixed-case>MRC</fixed-case> Examples Answerable by <fixed-case>BERT</fixed-case> without a Question Are Less Effective in <fixed-case>MRC</fixed-case> Model Training</title>
      <author><first>Hongyu</first><last>Li</last></author>
      <author><first>Tengyang</first><last>Chen</last></author>
      <author><first>Shuting</first><last>Bai</last></author>
      <author><first>Takehito</first><last>Utsuro</last></author>
      <author><first>Yasuhide</first><last>Kawada</last></author>
      <pages>146–152</pages>
      <abstract>Models developed for Machine Reading Comprehension (MRC) are asked to predict an answer from a question and its related context. However, there exist cases that can be correctly answered by an MRC model using BERT, where only the context is provided without including the question. In this paper, these types of examples are referred to as “easy to answer”, while others are as “hard to answer”, i.e., unanswerable by an MRC model using BERT without being provided the question. Based on classifying examples as answerable or unanswerable by BERT without the given question, we propose a method based on BERT that splits the training examples from the MRC dataset SQuAD1.1 into those that are “easy to answer” or “hard to answer”. Experimental evaluation from a comparison of two models, one trained only with “easy to answer” examples and the other with “hard to answer” examples demonstrates that the latter outperforms the former.</abstract>
      <url hash="1fca9d2b">2020.aacl-srw.21</url>
    </paper>
    <paper id="22">
      <title>Text Simplification with Reinforcement Learning Using Supervised Rewards on Grammaticality, Meaning Preservation, and Simplicity</title>
      <author><first>Akifumi</first><last>Nakamachi</last></author>
      <author><first>Tomoyuki</first><last>Kajiwara</last></author>
      <author><first>Yuki</first><last>Arase</last></author>
      <pages>153–159</pages>
      <abstract>We optimize rewards of reinforcement learning in text simplification using metrics that are highly correlated with human-perspectives. To address problems of exposure bias and loss-evaluation mismatch, text-to-text generation tasks employ reinforcement learning that rewards task-specific metrics. Previous studies in text simplification employ the weighted sum of sub-rewards from three perspectives: grammaticality, meaning preservation, and simplicity. However, the previous rewards do not align with human-perspectives for these perspectives. In this study, we propose to use BERT regressors fine-tuned for grammaticality, meaning preservation, and simplicity as reward estimators to achieve text simplification conforming to human-perspectives. Experimental results show that reinforcement learning with our rewards balances meaning preservation and simplicity. Additionally, human evaluation confirmed that simplified texts by our method are preferred by humans compared to previous studies.</abstract>
      <url hash="ecbdd029">2020.aacl-srw.22</url>
    </paper>
    <paper id="23">
      <title>Label Representations in Modeling Classification as Text Generation</title>
      <author><first>Xinyi</first><last>Chen</last></author>
      <author><first>Jingxian</first><last>Xu</last></author>
      <author><first>Alex</first><last>Wang</last></author>
      <pages>160–164</pages>
      <abstract>Several recent state-of-the-art transfer learning methods model classification tasks as text generation, where labels are represented as strings for the model to generate. We investigate the effect that the choice of strings used to represent labels has on how effectively the model learns the task. For four standard text classification tasks, we design a diverse set of possible string representations for labels, ranging from canonical label definitions to random strings. We experiment with T5 on these tasks, varying the label representations as well as the amount of training data. We find that, in the low data setting, label representation impacts task performance on some tasks, with task-related labels being most effective, but fails to have an impact on others. In the full data setting, our results are largely negative: Different label representations do not affect overall task performance.</abstract>
      <url hash="7729d846">2020.aacl-srw.23</url>
    </paper>
    <paper id="24">
      <title>Generating Inflectional Errors for Grammatical Error Correction in <fixed-case>H</fixed-case>indi</title>
      <author><first>Ankur</first><last>Sonawane</last></author>
      <author><first>Sujeet Kumar</first><last>Vishwakarma</last></author>
      <author><first>Bhavana</first><last>Srivastava</last></author>
      <author><first>Anil</first><last>Kumar Singh</last></author>
      <pages>165–171</pages>
      <abstract>Automated grammatical error correction has been explored as an important research problem within NLP, with the majority of the work being done on English and similar resource-rich languages. Grammar correction using neural networks is a data-heavy task, with the recent state of the art models requiring datasets with millions of annotated sentences for proper training. It is difficult to find such resources for Indic languages due to their relative lack of digitized content and complex morphology, compared to English. We address this problem by generating a large corpus of artificial inflectional errors for training GEC models. Moreover, to evaluate the performance of models trained on this dataset, we create a corpus of real Hindi errors extracted from Wikipedia edits. Analyzing this dataset with a modified version of the ERRANT error annotation toolkit, we find that inflectional errors are very common in this language. Finally, we produce the initial baseline results using state of the art methods developed for English.</abstract>
      <url hash="28d700f6">2020.aacl-srw.24</url>
      <attachment type="Software" hash="f244d347">2020.aacl-srw.24.Software.txt</attachment>
      <attachment type="Dataset" hash="f244d347">2020.aacl-srw.24.Dataset.txt</attachment>
    </paper>
  </volume>
  <volume id="demo" ingest-date="2020-12-02">
    <meta>
      <booktitle>Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations</booktitle>
      <editor><first>Derek</first><last>Wong</last></editor>
      <editor><first>Douwe</first><last>Kiela</last></editor>
      <publisher>Association for Computational Linguistics</publisher>
      <address>Suzhou, China</address>
      <month>December</month>
      <year>2020</year>
    </meta>
    <frontmatter>
      <url hash="4f1457ff">2020.aacl-demo.0</url>
    </frontmatter>
    <paper id="1">
      <title><fixed-case>AM</fixed-case>esure: A Web Platform to Assist the Clear Writing of Administrative Texts</title>
      <author><first>Thomas</first><last>François</last></author>
      <author><first>Adeline</first><last>Müller</last></author>
      <author><first>Eva</first><last>Rolin</last></author>
      <author><first>Magali</first><last>Norré</last></author>
      <pages>1–7</pages>
      <abstract>This article presents the AMesure platform, which aims to assist writers of French administrative texts in simplifying their writing. This platform includes a readability formula specialized for administrative texts and it also uses various natural language processing (NLP) tools to analyze texts and highlight a number of linguistic phenomena considered difficult to read. Finally, based on the difficulties identified, it offers pieces of advice coming from official plain language guides to users. This paper describes the different components of the system and reports an evaluation of these components.</abstract>
      <url hash="278be199">2020.aacl-demo.1</url>
    </paper>
    <paper id="2">
      <title><fixed-case>A</fixed-case>uto<fixed-case>NLU</fixed-case>: An On-demand Cloud-based Natural Language Understanding System for Enterprises</title>
      <author><first>Nham</first><last>Le</last></author>
      <author><first>Tuan</first><last>Lai</last></author>
      <author><first>Trung</first><last>Bui</last></author>
      <author><first>Doo Soon</first><last>Kim</last></author>
      <pages>8–13</pages>
      <abstract>With the renaissance of deep learning, neural networks have achieved promising results on many natural language understanding (NLU) tasks. Even though the source codes of many neural network models are publicly available, there is still a large gap from open-sourced models to solving real-world problems in enterprises. Therefore, to fill this gap, we introduce AutoNLU, an on-demand cloud-based system with an easy-to-use interface that covers all common use-cases and steps in developing an NLU model. AutoNLU has supported many product teams within Adobe with different use-cases and datasets, quickly delivering them working models. To demonstrate the effectiveness of AutoNLU, we present two case studies. i) We build a practical NLU model for handling various image-editing requests in Photoshop. ii) We build powerful keyphrase extraction models that achieve state-of-the-art results on two public benchmarks. In both cases, end users only need to write a small amount of code to convert their datasets into a common format used by AutoNLU.</abstract>
      <url hash="e81242b0">2020.aacl-demo.2</url>
    </paper>
    <paper id="3">
      <title><fixed-case>ISA</fixed-case>: An Intelligent Shopping Assistant</title>
      <author><first>Tuan</first><last>Lai</last></author>
      <author><first>Trung</first><last>Bui</last></author>
      <author><first>Nedim</first><last>Lipka</last></author>
      <pages>14–19</pages>
      <abstract>Despite the growth of e-commerce, brick-and-mortar stores are still the preferred destinations for many people. In this paper, we present ISA, a mobile-based intelligent shopping assistant that is designed to improve shopping experience in physical stores. ISA assists users by leveraging advanced techniques in computer vision, speech processing, and natural language processing. An in-store user only needs to take a picture or scan the barcode of the product of interest, and then the user can talk to the assistant about the product. The assistant can also guide the user through the purchase process or recommend other similar products to the user. We take a data-driven approach in building the engines of ISA’s natural language processing component, and the engines achieve good performance.</abstract>
      <url hash="3e17504c">2020.aacl-demo.3</url>
    </paper>
    <paper id="4">
      <title>meta<fixed-case>CAT</fixed-case>: A Metadata-based Task-oriented Chatbot Annotation Tool</title>
      <author><first>Ximing</first><last>Liu</last></author>
      <author><first>Wei</first><last>Xue</last></author>
      <author><first>Qi</first><last>Su</last></author>
      <author><first>Weiran</first><last>Nie</last></author>
      <author><first>Wei</first><last>Peng</last></author>
      <pages>20–25</pages>
      <abstract>Creating high-quality annotated dialogue corpora is challenging. It is essential to develop practical annotation tools to support humans in this time-consuming and error-prone task. We present metaCAT, which is an open-source web-based annotation tool designed specifically for developing task-oriented dialogue data. To the best of our knowledge, metaCAT is the first annotation tool that provides comprehensive metadata annotation coverage to the domain, intent, and span information. The data annotation quality is enhanced by a real-time annotation constraint-checking mechanism. An Automatic Speech Recognition (ASR) function is implemented to allow users to paraphrase and create more diversified annotated utterances. metaCAT is publicly available for the community.</abstract>
      <url hash="24377f3a">2020.aacl-demo.4</url>
    </paper>
    <paper id="5">
      <title><fixed-case>NLP</fixed-case> Tools for Predictive Maintenance Records in <fixed-case>M</fixed-case>aint<fixed-case>N</fixed-case>et</title>
      <author><first>Farhad</first><last>Akhbardeh</last></author>
      <author><first>Travis</first><last>Desell</last></author>
      <author><first>Marcos</first><last>Zampieri</last></author>
      <pages>26–32</pages>
      <abstract>Processing maintenance logbook records is an important step in the development of predictive maintenance systems. Logbooks often include free text fields with domain specific terms, abbreviations, and non-standard spelling posing challenges to off-the-shelf NLP pipelines trained on standard contemporary corpora. Despite the importance of this data type, processing predictive maintenance data is still an under-explored topic in NLP. With the goal of providing more datasets and resources to the community, in this paper we present a number of new resources available in MaintNet, a collaborative open-source library and data repository of predictive maintenance language datasets. We describe novel annotated datasets from multiple domains such as aviation, automotive, and facility maintenance domains and new tools for segmentation, spell checking, POS tagging, clustering, and classification.</abstract>
      <url hash="6b8126e9">2020.aacl-demo.5</url>
    </paper>
    <paper id="6">
      <title>Fairseq <fixed-case>S</fixed-case>2<fixed-case>T</fixed-case>: Fast Speech-to-Text Modeling with Fairseq</title>
      <author><first>Changhan</first><last>Wang</last></author>
      <author><first>Yun</first><last>Tang</last></author>
      <author><first>Xutai</first><last>Ma</last></author>
      <author><first>Anne</first><last>Wu</last></author>
      <author><first>Dmytro</first><last>Okhonko</last></author>
      <author><first>Juan</first><last>Pino</last></author>
      <pages>33–39</pages>
      <abstract>We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation. It follows fairseq’s careful design for scalability and extensibility. We provide end-to-end workflows from data pre-processing, model training to offline (online) inference. We implement state-of-the-art RNN-based as well as Transformer-based models and open-source detailed training recipes. Fairseq’s machine translation models and language models can be seamlessly integrated into S2T workflows for multi-task learning or transfer learning. Fairseq S2T is available at https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.</abstract>
      <url hash="ba6e2aa3">2020.aacl-demo.6</url>
    </paper>
    <paper id="7">
      <title><fixed-case>NLPS</fixed-case>tat<fixed-case>T</fixed-case>est: A Toolkit for Comparing <fixed-case>NLP</fixed-case> System Performance</title>
      <author><first>Haotian</first><last>Zhu</last></author>
      <author><first>Denise</first><last>Mak</last></author>
      <author><first>Jesse</first><last>Gioannini</last></author>
      <author><first>Fei</first><last>Xia</last></author>
      <pages>40–46</pages>
      <abstract>Statistical significance testing centered on p-values is commonly used to compare NLP system performance, but p-values alone are insufficient because statistical significance differs from practical significance. The latter can be measured by estimating effect size. In this pa-per, we propose a three-stage procedure for comparing NLP system performance and provide a toolkit, NLPStatTest, that automates the process. Users can upload NLP system evaluation scores and the toolkit will analyze these scores, run appropriate significance tests, estimate effect size, and conduct power analysis to estimate Type II error. The toolkit provides a convenient and systematic way to compare NLP system performance that goes beyond statistical significance testing.</abstract>
      <url hash="2388559c">2020.aacl-demo.7</url>
    </paper>
  </volume>
</collection>