diff --git a/.gitignore b/.gitignore index cf0d47c85c..8c05bd720c 100644 --- a/.gitignore +++ b/.gitignore @@ -23,6 +23,7 @@ Icon ####################################### **/__pycache__ +.idea # generated website /build/ diff --git a/data/xml/2024.findings.xml b/data/xml/2024.findings.xml index 9d08a16455..226806210e 100644 --- a/data/xml/2024.findings.xml +++ b/data/xml/2024.findings.xml @@ -1982,4 +1982,3908 @@ zhao-etal-2024-nl2formula + + + Findings of the Association for Computational Linguistics: NAACL 2024 + KevinDuh + HelenaGomez + StevenBethard + Association for Computational Linguistics +
Mexico City, Mexico
+ June + 2024 + 2024.findings-naacl + findings + + + 2024.findings-naacl.0 + findings-2024-findings-association + + + Structured Pruning for Large Language Models Using Coupled Components Elimination and Minor Fine-tuning + HongheZhangUniversity of Science and Technology of China + XiaolongShiXiaolongShiUniversity of Science and Technology of China + JingweiSunUniversity of Science and Technology of China + GuangzhongSunUniversity of Science and Technology of China + 1-12 + Large language models (LLMs) have demonstrated powerful capabilities in natural language processing, yet their vast number of parameters poses challenges for deployment and inference efficiency. Structured model pruning emerges as a viable approach to reduce model size and accelerate inference, without requiring specialized operators and libraries for deployment. However, structured pruning often severely weakens the model’s capability.Despite repetitive fine-tuning can restore the capability to a certain extent, it impairs LLMs’ utility as versatile problem solvers.To address this issue, we propose a novel structured pruning algorithm tailored for LLMs. It derives the importance of different components, namely rows and columns in parameter matrices, based on intermediate data dependencies. Then it removes coupled components across different layers simultaneously and preserves dependency relationships within remaining parameters, avoiding significant performance degradation. The pruned model requires only few epochs of fine-tuning to restore its performance, ensuring the model’s ability to generalize.Empirical evaluations on LLaMA, Vicuna, and ChatGLM3 demonstrate our algorithm’s efficacy, yielding 20% parameter reduction while retaining at least 94.4% of original performance metrics. + 2024.findings-naacl.1 + 2024.findings-naacl.1.copyright.pdf + zhang-etal-2024-structured + + + Weight-Inherited Distillation for Task-Agnostic <fixed-case>BERT</fixed-case> Compression + TaiqiangWu + ChengHou + ShanshanLao + JiayiLi + NgaiWongThe University of Hong Kong + ZheZhao + YujiuYangGraduate School at Shenzhen,Tsinghua University + 13-28 + Knowledge Distillation (KD) is a predominant approach for BERT compression.Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model.These methods transfer the knowledge in an indirect way.In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher.WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation.Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization.Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines.Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions.The code is available at https://github.com/wutaiqiang/WID-NAACL2024. + 2024.findings-naacl.2 + 2024.findings-naacl.2.copyright.pdf + wu-etal-2024-weight + + + Ignore Me But Don’t Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain + EugeneJangS2W Inc. + JianCui + DayeonYimS2W Inc. + YoungjinJinKorea Advanced Institute of Science & Technology + Jin-WooChung + SeungwonShinKorea Advanced Institute of Science & Technology + YongjaeLeeS2W Inc. + 29-42 + Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging. For such text domains that involve high levels of expertise, pretraining on in-domain corpora has been a popular method for language models to obtain domain expertise. However, cybersecurity texts often contain non-linguistic elements (such as URLs and hash values) that could be unsuitable with the established pretraining methodologies. Previous work in other domains have removed or filtered such text as noise, but the effectiveness of these methods have not been investigated, especially in the cybersecurity domain. We experiment with different pretraining methodologies to account for non-linguistic elements (NLEs) and evaluate their effectiveness through downstream tasks and probing tasks. Our proposed strategy, a combination of selective MLM and jointly training NLE token classification, outperforms the commonly taken approach of replacing NLEs. We use our domain-customized methodology to train CyBERTuned, a cybersecurity domain language model that outperforms other cybersecurity PLMs on most tasks. + 2024.findings-naacl.3 + 2024.findings-naacl.3.copyright.pdf + jang-etal-2024-ignore + + + Extremely efficient online query encoding for dense retrieval + NachshonCohenAmazon + YaronFairsteinAmazon + GuyKushilevitzAmazon + 43-50 + Existing dense retrieval systems utilize the same model architecture for encoding both the passages and the queries, even though queries are much shorter and simpler than passages. This leads to high latency of the query encoding, which is performed online and therefore might impact user experience. We show that combining a standard large passage encoder with a small efficient query encoder can provide significant latency drops with only a small decrease in quality. We offer a pretraining and training solution for multiple small query encoder architectures. Using a small transformer architecture we are able to decrease latency by up to \sim12\times, while MRR@10 on the MS MARCO dev set only decreases from 38.2 to 36.2. If this solution does not reach the desired latency requirements, we propose an efficient RNN as the query encoder, which processes the query prefix incrementally and only infers the last word after the query is issued. This shortens latency by \sim38\times with only a minor drop in quality, reaching 35.5 MRR@10 score. + 2024.findings-naacl.4 + 2024.findings-naacl.4.copyright.pdf + cohen-etal-2024-extremely + + + <fixed-case>DIVKNOWQA</fixed-case>: Assessing the Reasoning Ability of <fixed-case>LLM</fixed-case>s via Open-Domain Question Answering over Knowledge Base and Text + WentingZhao + YeLiuSalesForce.com + TongNiuSalesforce AI Research + YaoWanHuazhong University of Science and Technology + PhilipYuUniversity of Illinois, Chicago + ShafiqJotySalesForce.com and Nanyang Technological University + YingboZhouSalesforce Research + SemihYavuzSalesForce.com + 51-68 + Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when solely relying on their internal knowledge, especially when answering questions that require less commonly known information. Retrievalaugmented LLMs have emerged as a potential solution to ground LLMs in external knowledge. Nonetheless, recent approaches have primarily emphasized retrieval from unstructured text corpora, owing to its seamless integration into prompts. When using structured data such as knowledge graphs, most methods simplify it into natural text, neglecting the underlying structures. Moreover, a significant gap in the current landscape is the absence of a realistic benchmark for evaluating the effectiveness of grounding LLMs on heterogeneous knowledge sources (e.g., knowledge base and text). To fill this gap, we have curated a comprehensive dataset that poses two unique challenges: (1) Two-hop multi-source questions that require retrieving information from both open-domain structured and unstructured knowledge sources; retrieving information from structured knowledge sources is a critical component in correctly answering the questions. (2) Generation of symbolic queries (e.g., SPARQL for Wikidata) is a key requirement, which adds another layer of challenge. Our dataset is created using a combination of automatic generation through predefined reasoning chains and human annotation. We also introduce a novel approach that leverages multiple retrieval tools, including text passage retrieval and symbolic language-assisted retrieval. Our model outperforms previous approaches by a significant margin, demonstrating its effectiveness in addressing the above-mentioned reasoning challenges. + 2024.findings-naacl.5 + 2024.findings-naacl.5.copyright.pdf + zhao-etal-2024-divknowqa + + + <fixed-case>S</fixed-case>peed<fixed-case>E</fixed-case>: <fixed-case>E</fixed-case>uclidean Geometric Knowledge Graph Embedding Strikes Back + AleksandarPavlović + EmanuelSallingerTU Wien (Vienna University of Technology) + 69-92 + Geometric knowledge graph embedding models (gKGEs) have shown great potential for knowledge graph completion (KGC), i.e., automatically predicting missing triples. However, contemporary gKGEs require high embedding dimensionalities or complex embedding spaces for good KGC performance, drastically limiting their space and time efficiency. Facing these challenges, we propose SpeedE, a lightweight Euclidean gKGE that (1) provides strong inference capabilities, (2) is competitive with state-of-the-art gKGEs, even significantly outperforming them on YAGO3-10 and WN18RR, and (3) dramatically increases their efficiency, in particular, needing solely a fifth of the training time and a fourth of the parameters of the state-of-the-art ExpressivE model on WN18RR to reach the same KGC performance. + 2024.findings-naacl.6 + 2024.findings-naacl.6.copyright.pdf + pavlovic-sallinger-2024-speede + + + Language Guided Exploration for <fixed-case>RL</fixed-case> Agents in Text Environments + HiteshGolchha + SahilYerawar + DhruveshPatelCollege of Information and Computer Science, University of Massachusetts, Amherst + SohamDanInternational Business Machines + KeerthiramMurugesanInternational Business Machines + 93-102 + Real-world sequential decision making is characterized by sparse rewards and large decision spaces, posing significant difficulty for experiential learning systems like \textit{tabula rasa} reinforcement learning (RL) agents. Large Language Models (LLMs), with a wealth of world knowledge, can help RL agents learn quickly and adapt to distribution shifts. In this work, we introduce Language Guided Exploration (LGE) framework, which uses a pre-trained language model (called GUIDE ) to provide decision-level guidance to an RL agent (called EXPLORER ). We observe that on ScienceWorld (Wang et al., 2022), a challenging text environment, LGE outperforms vanilla RL agents significantly and also outperforms other sophisticated methods like Behaviour Cloning and Text Decision Transformer. + 2024.findings-naacl.7 + 2024.findings-naacl.7.copyright.pdf + golchha-etal-2024-language + + + <fixed-case>GPT</fixed-case>-who: An Information Density-based Machine-Generated Text Detector + SaranyaVenkatramanPennsylvania State University + AdakuUchenduMIT Lincoln Laboratory, Massachusetts Institute of Technology + DongwonLeeThe Pennsylvania State University + 103-115 + The Uniform Information Density (UID) principle posits that humans prefer to spread information evenly during language production. We examine if this UID principle can help capture differences between Large Language Models (LLMs)-generated and human-generated texts. We propose GPT-who, the first psycholinguistically-inspired domain-agnostic statistical detector. This detector employs UID-based featuresto model the unique statistical signature of each LLM and human author for accurate detection. We evaluate our method using 4 large-scale benchmark datasets and find that GPT-who outperforms state-of-the-art detectors (both statistical- & non-statistical) such as GLTR, GPTZero, DetectGPT, OpenAI detector, and ZeroGPT by over 20% across domains.In addition to better performance, it is computationally inexpensive and utilizes an interpretable representation of text articles. We find that GPT-who can distinguish texts generated by very sophisticated LLMs, even when the overlying text is indiscernible.UID-based measures for all datasets and code are available at https://github.com/saranya-venkatraman/gpt-who. + 2024.findings-naacl.8 + 2024.findings-naacl.8.copyright.pdf + venkatraman-etal-2024-gpt + + + <fixed-case>DEED</fixed-case>: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models + PengTangAmazon + PengkaiZhuBoston University + TianLi + SrikarAppalarajuAmazon + VijayMahadevanAmazon + R.ManmathaAmazon + 116-131 + Encoder-decoder transformer models have achieved great success on various vision-language (VL) and language tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with three state-of-the-art encoder-decoder transformer models on various VL and language tasks. We show our approach can reduce overall inference latency by 20%-74% with comparable or even higher accuracy compared to baselines. + 2024.findings-naacl.9 + 2024.findings-naacl.9.copyright.pdf + tang-etal-2024-deed + + + Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation + Ta-ChungChi + Ting-HanFan + AlexanderRudnickyCarnegie Mellon University and Carnegie Mellon University + 132-148 + An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning. This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation. The code is released at: https://github.com/chijames/T5-Attention-Alignment + 2024.findings-naacl.10 + 2024.findings-naacl.10.copyright.pdf + chi-etal-2024-attention + + + Automatic Pair Construction for Contrastive Post-training + CanwenXu + CorbyRosset + EthanChauMicrosoft + LucianoCorroMicrosoft Research + ShwetiMahajanMicrosoft + JulianMcAuleyUniversity of California, San Diego, University of California, San Diego + JenniferNevillePurdue University and Purdue University + AhmedAwadallahMicrosoft Research + NikhilRaoMicrosoft + 149-162 + Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we propose an automatic way to construct contrastive data for LLM, using preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continuing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from “easier” pairs and transitioning to “harder” ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, our automatic contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to outperform ChatGPT. + 2024.findings-naacl.11 + 2024.findings-naacl.11.copyright.pdf + xu-etal-2024-automatic + + + Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models + MiaoranLiIowa State University + BaolinPengTencent AI Lab + MichelGalleyMicrosoft + JianfengGaoMicrosoft Research + ZhuZhang + 163-181 + Fact-checking is an essential task in NLP that is commonly utilized to validate the factual accuracy of a piece of text. Previous approaches mainly involve the resource-intensive process of fine-tuning pre-trained language models on specific datasets. In addition, there is a notable gap in datasets that focus on fact-checking texts generated by large language models (LLMs). In this paper, we introduce Self-Checker, a plug-and-play framework that harnesses LLMs for efficient and rapid fact-checking in a few-shot manner. We also present the BingCheck dataset, specifically designed for fact-checking texts generated by LLMs. Empirical results demonstrate the potential of Self-Checker in the use of LLMs for fact-checking. Compared to state-of-the-art fine-tuned models, there is still significant room for improvement, indicating that adopting LLMs could be a promising direction for future fact-checking research. + 2024.findings-naacl.12 + 2024.findings-naacl.12.copyright.pdf + li-etal-2024-self + + + Low-resource neural machine translation with morphological modeling + AntoineNzeyimana + 182-195 + Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation for morphologically-rich languages. However, existing methods such as sub-word tokenization and character-based models are limited to the surface forms of the words. In this work, we propose a framework-solution for modeling complex morphology in low-resource settings. A two-tier transformer architecture is chosen to encode morphological information at the inputs. At the target-side output, a multi-task multi-label training scheme coupled with a beam search-based decoder are found to improve machine translation performance. An attention augmentation scheme to the transformer model is proposed in a generic form to allow integration of pre-trained language models and also facilitate modeling of word order relationships between the source and target languages. Several data augmentation techniques are evaluated and shown to increase translation performance in low-resource settings. We evaluate our proposed solution on Kinyarwanda \leftrightarrow English translation using public-domain parallel text. Our final models achieve competitive performance in relation to large multi-lingual models. We hope that our results will motivate more use of explicit morphological information and the proposed model and data augmentations in low-resource NMT. + 2024.findings-naacl.13 + 2024.findings-naacl.13.copyright.pdf + nzeyimana-2024-low + + + Self-Cleaning: Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances + ZhendongChuUniversity of Virginia + RuiyiZhangAdobe Systems + TongYuAdobe Research + RajivJainAdobe Systems + VladMorariuAdobe + JiuxiangGuAdobe Systems + AniNenkovaAdobe Research + 196-210 + To achieve state-of-the-art performance, one still needs to train NER models on large-scale, high-quality annotated data, an asset that is both costly and time-intensive to accumulate. In contrast, real-world applications often resort to massive low-quality labeled data through non-expert annotators via crowdsourcing and external knowledge bases via distant supervision as a cost-effective alternative. However, these annotation methods result in noisy labels, which in turn lead to a notable decline in performance. Hence, we propose to denoise the noisy NER data with guidance from a small set of clean instances. Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights. The discriminator is capable of detecting both span and category errors with different discriminative prompts. Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set. + 2024.findings-naacl.14 + 2024.findings-naacl.14.copyright.pdf + chu-etal-2024-self + + + <fixed-case>VLUE</fixed-case>: A New Benchmark and Multi-task Knowledge Transfer Learning for <fixed-case>V</fixed-case>ietnamese Natural Language Understanding + PhongDoThe UIT NLP Group and Zalo + SonTran + PhuHoang + KietNguyenUniversity of Information Technology, VNU-HCM + NganNguyen + 211-222 + The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark. The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding. To provide an insightful overview of the current state of Vietnamese NLU, we then evaluate seven state-of-the-art pre-trained models, including both multilingual and Vietnamese monolingual models, on our proposed VLUE benchmark. Furthermore, we present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark. Our model combines the proficiency of a multilingual pre-trained model with Vietnamese linguistic knowledge. CafeBERT is developed based on the XLM-RoBERTa model, with an additional pretraining step utilizing a significant amount of Vietnamese textual data to enhance its adaptation to the Vietnamese language. For the purpose of future research, CafeBERT is made publicly available for research purposes. + 2024.findings-naacl.15 + 2024.findings-naacl.15.copyright.pdf + do-etal-2024-vlue + + + <fixed-case>LETI</fixed-case>: Learning to Generate from Textual Interactions + XingyaoWangDepartment of Computer Science, University of Illinois Urbana-Champaign + HaoPengDepartment of Computer Science, University of Illinois Urbana-Champaign + ReyhanehJabbarvandDepartment of Computer Science + HengJiUniversity of Illinois, Urbana-Champaign + 223-239 + Fine-tuning pre-trained language models (LMs) is essential for enhancing their capabilities.Existing techniques commonly fine-tune on input-output pairs (e.g., instruction tuning) or with numerical rewards that gauge the output quality (e.g., RLHF). We explore LMs’ potential to **le**arn from **t**extual **i**nteractions (**LETI**) that not only check their correctness with *binary labels* but also pinpoint and explain errors in their outputs through *textual feedback*.Our focus is the code generation task, where the model produces code based on natural language instructions. This setting invites a natural and scalable way to acquire textual feedback: the error messages and stack traces from code execution using a Python interpreter. LETI iteratively fine-tunes the model, using the LM objective, on a concatenation of natural language instructions, LM-generated programs, and textual feedback. Prepended to this fine-tuning text, a binary reward token is used to differentiate correct and buggy solutions.LETI requires *no* ground-truth outputs for training and even outperforms a fine-tuned baseline that does. LETI not only improves the performance of LMs on a code generation dataset MBPP, but also generalizes to other datasets. Trained on MBPP, it achieves comparable or better performance than the base LMs on unseen problems in HumanEval. Furthermore, compared to binary feedback, we observe that textual feedback leads to improved generation quality and sample efficiency, achieving the same performance with fewer than half of the gradient steps.LETI is equally applicable in natural language tasks when they can be formulated as code generation, which we empirically verified on event argument extraction. + 2024.findings-naacl.16 + 2024.findings-naacl.16.copyright.pdf + wang-etal-2024-leti + + + Bilateral Masking with prompt for Knowledge Graph Completion + YonghuiKong + CunhangFanSchool of Computer Science and Technology, Anhui University, Hefei 230601, China + YujieChen + ShuaiZhang + ZhaoLvSchool of Computer Science and Technology, Anhui University, Hefei 230601, China + JianhuaTaoTsinghua University + 240-249 + The pre-trained language model (PLM) has achieved significant success in the field of knowledge graph completion (KGC) by effectively modeling entity and relation descriptions. In recent studies, the research in this field has been categorized into methods based on word matching and sentence matching, with the former significantly lags behind. However, there is a critical issue in word matching methods, which is that these methods fail to obtain satisfactory single embedding representations for entities.To address this issue and enhance entity representation, we propose the Bilateral Masking with prompt for Knowledge Graph Completion (BMKGC) approach.Our methodology employs prompts to narrow the distance between the predicted entity and the known entity. Additionally, the BMKGC model incorporates a bi-encoder architecture, enabling simultaneous predictions at both the head and tail. Furthermore, we propose a straightforward technique to augment positive samples, mitigating the problem of degree bias present in knowledge graphs and thereby improving the model’s robustness. Experimental results conclusively demonstrate that BMKGC achieves state-of-the-art performance on the WN18RR dataset. + 2024.findings-naacl.17 + 2024.findings-naacl.17.copyright.pdf + kong-etal-2024-bilateral + + + <fixed-case>M</fixed-case>i<fixed-case>L</fixed-case>e Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models + ZhenpengSu + ZijiaLinKuaishou Technology + BaixueBaixue + HuiChenTsinghua University, Tsinghua University + SonglinHu + WeiZhou + GuiguangDingTsinghua University + XingW + 250-262 + Generative language models are usually pre-trained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose a **MiLe Loss** function for **mi**tigating the bias of **le**arning difficulties with tokens. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks. + 2024.findings-naacl.18 + 2024.findings-naacl.18.copyright.pdf + su-etal-2024-mile + + + <fixed-case>GOLD</fixed-case>: Geometry Problem Solver with Natural Language Description + JiaxinZhangUniversity of Strathclyde + YasharMoshfeghiUniversity of Strathclyde + 263-278 + Addressing the challenge of automated geometry math problem-solving in artificial intelligence (AI) involves understanding multi-modal information and mathematics. blackCurrent methods struggle with accurately interpreting geometry diagrams, which hinders effective problem-solving. To tackle this issue, we present the Geometry problem sOlver with natural Language Description (GOLD) model. GOLD enhances the extraction of geometric relations by separately processing symbols and geometric primitives within the diagram. Subsequently, it converts the extracted relations into natural language descriptions, efficiently utilizing large language models to solve geometry math problems. Experiments show that the GOLD model outperforms the Geoformer model, the previous best method on the UniGeo dataset, by achieving accuracy improvements of 12.7% and 42.1% in calculation and proving subsets. Additionally, it surpasses the former best model on the PGPS9K and Geometry3K datasets, PGPSNet, by obtaining accuracy enhancements of 1.8% and 3.2%, respectively. + 2024.findings-naacl.19 + 2024.findings-naacl.19.copyright.pdf + zhang-moshfeghi-2024-gold + + + <fixed-case>R</fixed-case>o<fixed-case>D</fixed-case>ia: A New Dataset for <fixed-case>R</fixed-case>omanian Dialect Identification from Speech + RotaruCodruț + NicolaeRistea + RaduIonescuUniversitatea Bucuresti + 279-286 + We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia. + 2024.findings-naacl.20 + 2024.findings-naacl.20.copyright.pdf + codrut-etal-2024-rodia + + + Examining Modularity in Multilingual <fixed-case>LM</fixed-case>s via Language-Specialized Subnetworks + RochelleChoenni + EkaterinaShutovaUniversity of Amsterdam + DanGarretteGoogle DeepMind + 287-301 + Recent work has proposed explicitly inducing language-wise modularity in multilingual LMs via sparse fine-tuning (SFT) on per-language subnetworks as a means of better guiding cross-lingual sharing. In this paper, we investigate (1) the degree to which language-wise modularity *naturally* arises within models with no special modularity interventions, and (2) how cross-lingual sharing and interference differ between such models and those with explicit SFT-guided subnetwork modularity. In order to do so, we use XLM-R as our multilingual LM. Moreover, to quantify language specialization and cross-lingual interaction, we use a Training Data Attribution method that estimates the degree to which a model’s predictions are influenced by in-language or cross-language training examples. Our results show that language-specialized subnetworks do naturally arise, and that SFT, rather than always increasing modularity, can decrease language specialization of subnetworks in favor of more cross-lingual sharing. + 2024.findings-naacl.21 + 2024.findings-naacl.21.copyright.pdf + choenni-etal-2024-examining + + + Reverse Chain: A Generic-Rule for <fixed-case>LLM</fixed-case>s to Master Multi-<fixed-case>API</fixed-case> Planning + YingerZhang + HuiCai + XieruiSong + YichengChen + RuiSun + JingZhengAnt Group + 302-325 + While enabling large language models to implement function calling (known as APIs) can greatly enhance the performance of Large Language Models (LLMs), function calling is still a challenging task due to the complicated relations between different APIs, especially in a context-learning setting without fine-tuning. This paper introduces “Reverse Chain”, a controllable, target-driven approach designed to empower LLMs with the capability to operate external APIs only via prompts. Recognizing that most LLMs have limited tool-use capabilities, Reverse Chain limits LLMs to executing simple tasks, e.g., API Selection and Argument Completion. Furthermore, to manage a controllable multi-function calling, Reverse Chain adopts a generic rule-based on a backward reasoning process. This rule determines when to do API selection or Argument completion. To evaluate the multi-tool-use capability of LLMs, we have released a compositional multi-tool task dataset, available at https://github.com/zhangyingerjelly/reverse-chain. Extensive numerical experiments validate the remarkable proficiency of Reverse Chain in managing multiple API calls. + 2024.findings-naacl.22 + 2024.findings-naacl.22.copyright.pdf + zhang-etal-2024-reverse + + + Incorporating Exponential Smoothing into <fixed-case>MLP</fixed-case>: a Simple but Effective Sequence Model + JiqunChuJiqunChu + ZuoquanLinPeking University + 326-337 + Modeling long-range dependencies in sequential data is a crucial step in sequence learning. A recently developed model, the Structured State Space (S4), demonstrated significant effectiveness in modeling long-range sequences. However, It is unclear whether the success of S4 can be attributed to its intricate parameterization and HiPPO initialization or simply due to State Space Models (SSMs). To further investigate the potential of the deep SSMs, we start with exponential smoothing (ETS), a simple SSM, and propose a stacked architecture by directly incorporating it into an element-wise MLP. We augment simple ETS with additional parameters and complex field to reduce the inductive bias. Despite increasing less than 1% of parameters of element-wise MLP, our models achieve comparable results to S4 on the LRA benchmark. + 2024.findings-naacl.23 + 2024.findings-naacl.23.copyright.pdf + jiqunchu-lin-2024-incorporating + + + <fixed-case>O</fixed-case>pen<fixed-case>FMN</fixed-case>av: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models + YuxuanKuang + HaiLinUniversity of Notre Dame and University of Notre Dame + MengJiangUniversity of Notre Dame + 338-351 + Object navigation (ObjectNav) requires an agent to navigate through unseen environments to find queried objects. Many previous methods attempted to solve this task by relying on supervised or reinforcement learning, where they are trained on limited household datasets with close-set objects. However, two key challenges are unsolved: understanding free-form natural language instructions that demand open-set objects, and generalizing to new environments in a zero-shot manner. Aiming to solve the two challenges, in this paper, we propose **OpenFMNav**, an **Open**-set **F**oundation **M**odel based framework for zero-shot object **Nav**igation. We first unleash the reasoning abilities of large language models (LLMs) to extract proposed objects from natural language instructions that meet the user’s demand. We then leverage the generalizability of large vision language models (VLMs) to actively discover and detect candidate objects from the scene, building a *Versatile Semantic Score Map (VSSM)*. Then, by conducting common sense reasoning on *VSSM*, our method can perform effective language-guided exploration and exploitation of the scene and finally reach the goal. By leveraging the reasoning and generalizing abilities of foundation models, our method can understand free-form human instructions and perform effective open-set zero-shot navigation in diverse environments. Extensive experiments on the HM3D ObjectNav benchmark show that our method surpasses all the strong baselines on all metrics, proving our method’s effectiveness. Furthermore, we perform real robot demonstrations to validate our method’s open-set-ness and generalizability to real-world environments. + 2024.findings-naacl.24 + 2024.findings-naacl.24.copyright.pdf + kuang-etal-2024-openfmnav + + + Comparing Two Model Designs for Clinical Note Generation; Is an <fixed-case>LLM</fixed-case> a Useful Evaluator of Consistency? + NathanBrake3M + ThomasSchaaf + 352-363 + Following an interaction with a patient, physicians are responsible for the submission of clinical documentation, often organized as a SOAP note. A clinical note is not simply a summary of the conversation but requires the use of appropriate medical terminology. The relevant information can then be extracted and organized according to the structure of the SOAP note. In this paper we analyze two different approaches to generate the different sections of a SOAP note based on the audio recording of the conversation, and specifically examine them in terms of note consistency. The first approach generates the sections independently, while the second method generates them all together. In this work we make use of PEGASUS-X Transformer models and observe that both methods lead to similar ROUGE values (less than 1% difference) and have no difference in terms of the Factuality metric. We perform a human evaluation to measure aspects of consistency and demonstrate that LLMs like Llama2 can be used to perform the same tasks with roughly the same agreement as the human annotators. Between the Llama2 analysis and the human reviewers we observe a Cohen Kappa inter-rater reliability of 0.79, 1.00, and 0.32 for consistency of age, gender, and body part injury, respectively. With this we demonstrate the usefulness of leveraging an LLM to measure quality indicators that can be identified by humans but are not currently captured by automatic metrics. This allows scaling evaluation to larger data sets, and we find that clinical note consistency improves by generating each new section conditioned on the output of all previously generated sections. + 2024.findings-naacl.25 + 2024.findings-naacl.25.copyright.pdf + brake-schaaf-2024-comparing + + + <fixed-case>VOLTA</fixed-case>: Improving Generative Diversity by Variational Mutual Information Maximizing Autoencoder + YueenMa + DaFengChiHuawei Technologies Ltd. + JingjingLi + KaiSong + YuzhengZhuangHuawei Technologies Ltd. + IrwinKingThe Chinese University of Hong Kong + 364-378 + The natural language generation domain has witnessed great success thanks to Transformer models. Although they have achieved state-of-the-art generative quality, they often neglect generative diversity. Prior attempts to tackle this issue suffer from either low model capacity or over-complicated architectures. Some recent methods employ the VAE framework to enhance diversity, but their latent variables fully depend on the input context, restricting exploration of the latent space. In this paper, we introduce VOLTA, a framework that elevates generative diversity by bridging Transformer with VAE via a more effective cross-attention-based connection, departing from conventional embedding concatenation or summation. Additionally, we propose integrating InfoGAN-style latent codes to enable input-independent variability, further diversifying the generation. Moreover, our framework accommodates discrete inputs alongside its existing support for continuous inputs. We perform comprehensive experiments with two types of Transformers on six datasets from three different NLG tasks to show that our approach can significantly improve generative diversity while maintaining generative quality. + 2024.findings-naacl.26 + 2024.findings-naacl.26.copyright.pdf + ma-etal-2024-volta + + + <fixed-case>E</fixed-case>co<fixed-case>S</fixed-case>peak: Cost-Efficient Bias Mitigation for Partially Cross-Lingual Speaker Verification + DivyaSharmaIndraprastha Institute of Information Technology, Delhi + 379-394 + Linguistic bias is a critical problem concerning the diversity, equity, and inclusiveness of Natural Language Processing tools. The severity of this problem intensifies in security systems, such as speaker verification, where fairness is paramount. Speaker verification systems are biometric systems that determine whether two speech recordings are of the same speaker. Such user-centric systems should be inclusive to bilingual speakers. However, Deep neural network models are linguistically biased. Linguistic bias can be full or partial. Partially cross-lingual bias occurs when one test trial pair recording is in the training set’s language, and the other is in an unseen target language. Such linguistic mismatch influences the speaker verification model’s decision, dissuading bilingual speakers from using the system. Domain adaptation can mitigate this problem. However, adapting to each existing language is expensive. This paper explores cost-efficient bias mitigation techniques for partially cross-lingual speaker verification. We study the behavior of five baselines in five partially cross-lingual scenarios. Using our baseline behavioral insights, we propose EcoSpeak, a low-cost solution to partially cross-lingual speaker verification. EcoSpeak incorporates contrastive linguistic (CL) attention. CL attention utilizes linguistic differences in trial pairs to emphasize relevant speaker verification embedding parts. Experimental results demonstrate EcoSpeak’s robustness to partially cross-lingual testing. + 2024.findings-naacl.27 + 2024.findings-naacl.27.copyright.pdf + sharma-2024-ecospeak + + + Leveraging Contextual Information for Effective Entity Salience Detection + RajarshiBhowmikBloomberg L.P. + MarcoPonza + AtharvaTendle + AnantGupta + RebeccaJiang + XingyuLuBloomberg + QianZhaoBloomberg + DanielPreotiuc-PietroBloomberg + 395-408 + In text documents such as news articles, the content and key events usually revolve around a subset of all the entities mentioned in a document. These entities, often deemed as salient entities, provide useful cues of the aboutness of a document to a reader. Identifying the salience of entities was found helpful in several downstream applications such as search, ranking, and entity-centric summarization, among others. Prior work on salient entity detection mainly focused on machine learning models that require heavy feature engineering. We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches. To this end, we conduct a comprehensive benchmarking of four publicly available datasets using models representative of the medium-sized pre-trained language model family. Additionally, we show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task’s uniqueness and complexity. + 2024.findings-naacl.28 + 2024.findings-naacl.28.copyright.pdf + bhowmik-etal-2024-leveraging + + + <fixed-case>LLM</fixed-case>-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected? + QihuiZhang + ChujieGao + DongpingChen + YueHuang + YixinHuang + ZhenyangSun + ShilinZhang + WeiyeLi + ZhengyanFu + YaoWanHuazhong University of Science and Technology + LichaoSunLehigh University + 409-436 + With the rapid development and widespread application of Large Language Models (LLMs), the use of Machine-Generated Text (MGT) has become increasingly common, bringing with it potential risks, especially in terms of quality and integrity in fields like news, education, and science. Current research mainly focuses on purely MGT detection, without adequately addressing mixed scenarios including AI-revised Human-Written Text (HWT) or human-revised MGT. To tackle this challenge, we define mixtext, a form of mixed text involving both AI and human-generated content. Then we introduce MixSet, the first dataset dedicated to studying these mixtext scenarios. Leveraging MixSet, we executed comprehensive experiments to assess the efficacy of prevalent MGT detectors in handling mixtext situations, evaluating their performance in terms of effectiveness, robustness, and generalization. Our findings reveal that existing detectors struggle to identify mixtext, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixtext, offering valuable insights for future research. Code and Models are available at https://github.com/Dongping-Chen/MixSet. + 2024.findings-naacl.29 + 2024.findings-naacl.29.copyright.pdf + zhang-etal-2024-llm + + + A (More) Realistic Evaluation Setup for Generalisation of Community Models on Malicious Content Detection + IvoVerhoeven + PushkarMishraMeta AI + RahelBeloch + HelenYannakoudakisComputer Laboratory, University of Cambridge and King’s College London + EkaterinaShutovaUniversity of Amsterdam + 437-463 + Community models for malicious content detection, which take into account the context from a social graph alongside the content itself, have shown remarkable performance on benchmark datasets. Yet, misinformation and hate speech continue to propagate on social media networks. This mismatch can be partially attributed to the limitations of current evaluation setups that neglect the rapid evolution of online content and the underlying social graph. In this paper, we propose a novel evaluation setup for model generalisation based on our few-shot subgraph sampling approach. This setup tests for generalisation through few labelled examples in local explorations of a larger graph, emulating more realistic application settings. We show this to be a challenging inductive setup, wherein strong performance on the training graph is not indicative of performance on unseen tasks, domains, or graph structures. Lastly, we show that graph meta-learners trained with our proposed few-shot subgraph sampling outperform standard community models in the inductive setup. + 2024.findings-naacl.30 + 2024.findings-naacl.30.copyright.pdf + verhoeven-etal-2024-realistic + + + Citation: A Key to Building Responsible and Accountable Large Language Models + JieHuangUniversity of Illinois, Urbana Champaign + KevinChangUniversity of Illinois, Urbana Champaign + 464-473 + Large Language Models (LLMs) bring transformative benefits alongside unique challenges, including intellectual property (IP) and ethical concerns. This position paper explores a novel angle to mitigate these risks, drawing parallels between LLMs and established web systems. We identify “citation”—the acknowledgement or reference to a source or evidence—as a crucial yet missing component in LLMs. Incorporating citation could enhance content transparency and verifiability, thereby confronting the IP and ethical issues in the deployment of LLMs. We further propose that a comprehensive citation mechanism for LLMs should account for both non-parametric and parametric content. Despite the complexity of implementing such a citation mechanism, along with the potential pitfalls, we advocate for its development. Building on this foundation, we outline several research problems in this area, aiming to guide future explorations towards building more responsible and accountable LLMs. + 2024.findings-naacl.31 + 2024.findings-naacl.31.copyright.pdf + huang-chang-2024-citation + + + Graph-Induced Syntactic-Semantic Spaces in Transformer-Based Variational <fixed-case>A</fixed-case>uto<fixed-case>E</fixed-case>ncoders + YingjiZhang + MarcoValentino + DaniloCarvalhoUniversity of Manchester + IanPratt-HartmannUniversity of Opole and University of Manchester + AndreFreitasIdiap Research Institute and University of Manchester + 474-489 + The injection of syntactic information in Variational AutoEncoders (VAEs) can result in an overall improvement of performances and generalisation. An effective strategy to achieve such a goal is to separate the encoding of distributional semantic features and syntactic structures into heterogeneous latent spaces via multi-task learning or dual encoder architectures. However, existing works employing such techniques are limited to LSTM-based VAEs. This work investigates latent space separation methods for structural syntactic injection in Transformer-based VAE architectures (i.e., Optimus) through the integration of graph-based models. Our empirical evaluation reveals that the proposed end-to-end VAE architecture can improve theoverall organisation of the latent space, alleviating the information loss occurring in standard VAE setups, and resulting in enhanced performances on language modelling and downstream generation tasks. + 2024.findings-naacl.32 + 2024.findings-naacl.32.copyright.pdf + zhang-etal-2024-graph + + + Narrowing the Gap between Zero- and Few-shot Machine Translation by Matching Styles + WeitingTanJohns Hopkins University + HaoranXuJohns Hopkins University + LingfengShenByteDance Inc. + Shuyue StellaLiDepartment of Computer Science, University of Washington + KentonMurrayJohns Hopkins University + PhilippKoehnJohns Hopkins University + BenjaminVan DurmeJohns Hopkins University, Johns Hopkins University, Johns Hopkins University and Microsoft + YunmoChenJohns Hopkins University + 490-502 + Large language models trained primarily in a monolingual setting have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning. However, even though zero-shot translations are relatively good, there remains a discernible gap comparing their performance with the few-shot setting. In this paper, we investigate the factors contributing to this gap and find that this gap can largely be closed (for about 70%) by matching the writing styles of the target corpus. Additionally, we explore potential approaches to enhance zero-shot baselines without the need for parallel demonstration examples, providing valuable insights into how these methods contribute to improving translation metrics. + 2024.findings-naacl.33 + 2024.findings-naacl.33.copyright.pdf + tan-etal-2024-narrowing + + + Which Modality should <fixed-case>I</fixed-case> use - Text, Motif, or Image? : Understanding Graphs with Large Language Models + DebaratiDas + IshaanGupta + JaideepSrivastavaUniversity of Minnesota - Twin Cities + DongyeopKangUniversity of Minnesota + 503-519 + Our research integrates graph data with Large Language Models (LLMs), which, despite their advancements in various fields using large text corpora, face limitations in encoding entire graphs due to context size constraints. This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, coupled with prompts to approximate a graph’s global connectivity, thereby enhancing LLMs’ efficiency in processing complex graph structures. The study also presents GraphTMI, a novel benchmark for evaluating LLMs in graph structure analysis, focusing on homophily, motif presence, and graph difficulty. Key findings indicate that the image modality, especially with vision-language models like GPT-4V, is superior to text in balancing token limits and preserving essential information and comes close to prior graph neural net (GNN) encoders. Furthermore, the research assesses how various factors affect the performance of each encoding modality and outlines the existing challenges and potential future developments for LLMs in graph understanding and reasoning tasks. Our code and data are publicly available on our project page - https://minnesotanlp.github.io/GraphLLM/ + 2024.findings-naacl.34 + 2024.findings-naacl.34.copyright.pdf + das-etal-2024-modality + + + On-the-Fly Fusion of Large Language Models and Machine Translation + HieuHoangMicrosoft + HudaKhayrallahMicrosoft + MarcinJunczys-DowmuntMicrosoft + 520-532 + We propose on-the-fly ensembling of a neural machine translation (NMT) model with a large language model (LLM), prompted on the same task and input. Through experiments on 4 language directions with varying data amounts, we find that a slightly weaker-at-translation LLM can improve translations of a NMT model, and such an ensemble can produce better translations than ensembling two stronger NMT models.We demonstrate that our ensemble method can be combined with various techniques from LLM prompting, such as in context learning and translation context. + 2024.findings-naacl.35 + 2024.findings-naacl.35.copyright.pdf + hoang-etal-2024-fly + + + <fixed-case>READ</fixed-case>: Improving Relation Extraction from an <fixed-case>AD</fixed-case>versarial Perspective + DaweiLi + WilliamHogan + JingboShangUniversity of California, San Diego + 533-548 + Recent works in relation extraction (RE) have achieved promising benchmark accuracy; however, our adversarial attack experiments show that these works excessively rely on entities, making their generalization capability questionable. To address this issue, we propose an adversarial training method specifically designed for RE. Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations.Furthermore, we introduce a probabilistic strategy for leaving clean tokens in the context during adversarial training. This strategy enables a larger attack budget for entities and coaxes the model to leverage relational patterns embedded in the context. Extensive experiments show that compared to various adversarial training methods, our method significantly improves both the accuracy and robustness of the model. Additionally, experiments on different data availability settings highlight the effectiveness of our method in low-resource scenarios.We also perform in-depth analyses of our proposed method and provide further hints.We will release our code at https://github.com/David-Li0406/READ. + 2024.findings-naacl.36 + 2024.findings-naacl.36.copyright.pdf + li-etal-2024-read + + + <fixed-case>REQUAL</fixed-case>-<fixed-case>LM</fixed-case>: Reliability and Equity through Aggregation in Large Language Models + SanaEbrahimi + NimaShahbazi + AbolfazlAsudehUniversity of Illinois Chicago + 549-560 + The extensive scope of large language models (LLMs) across various domains underscores the critical importance of responsibility in their application, beyond natural language processing. In particular, the randomized nature of LLMs, coupled with inherent biases and historical stereotypes in data, raises critical concerns regarding reliability and equity. Addressing these challenges are necessary before using LLMs for applications with societal impact. Towards addressing this gap, we introduce REQUAL-LM, a novel method for finding reliable and equitable LLM outputs through aggregation. Specifically, we develop a Montecarlo method based on repeated sampling to find a reliable output close to the mean of the underlying distribution of possible outputs. We formally define the terms such as reliability and bias, and design an equity-aware aggregation to minimize harmful bias while finding a highly reliable output. REQUAL-LM does not require specialized hardware, does not impose a significant computing load, and uses LLMs as a blackbox. This design choice enables seamless scalability alongside the rapid advancement of LLM technologies. Our system does not require retraining the LLMs, which makes it deployment ready and easy to adapt. Our comprehensive experiments using various tasks and datasets demonstrate that REQUAL-LM effectively mitigates bias and selects a more equitable response, specifically the outputs that properly represents minority groups. + 2024.findings-naacl.37 + 2024.findings-naacl.37.copyright.pdf + ebrahimi-etal-2024-requal + + + Addressing Both Statistical and Causal Gender Fairness in <fixed-case>NLP</fixed-case> Models + HannahChenUniversity of Virginia + YangfengJiUniversity of Virginia + DavidEvansUniversity of Virginia + 561-582 + Statistical fairness stipulates equivalent outcomes for every protected group, whereas causal fairness prescribes that a model makes the same prediction for an individual regardless of their protected characteristics. Counterfactual data augmentation (CDA) is effective for reducing bias in NLP models, yet models trained with CDA are often evaluated only on metrics that are closely tied to the causal fairness notion; similarly, sampling-based methods designed to promote statistical fairness are rarely evaluated for causal fairness. In this work, we evaluate both statistical and causal debiasing methods for gender bias in NLP models, and find that while such methods are effective at reducing bias as measured by the targeted metric, they do not necessarily improve results on other bias metrics. We demonstrate that combinations of statistical and causal debiasing techniques are able to reduce bias measured through both types of metrics. + 2024.findings-naacl.38 + 2024.findings-naacl.38.copyright.pdf + chen-etal-2024-addressing + + + <fixed-case>LLM</fixed-case>-Rec: Personalized Recommendation via Prompting Large Language Models + HanjiaLyuUniversity of Rochester + SongJiang + HanqingZengMeta AI + YinglongXiaMeta + QifanWangMeta AI + SiZhangMeta + RenChen + ChrisLeungMeta AI and College of Computing, Georgia Institute of Technology + JiajieTang + JieboLuoUniversity of Rochester, University of Rochester, University of Rochester and University of Rochester + 583-612 + Text-based recommendation holds a wide range of practical applications due to its versatility, as textual descriptions can represent nearly any type of item. However, directly employing the original item descriptions may not yield optimal recommendation performance due to the lack of comprehensive information to align with user preferences. Recent advances in large language models (LLMs) have showcased their remarkable ability to harness commonsense knowledge and reasoning. In this study, we introduce a novel approach, coined LLM-Rec, which incorporates four distinct prompting strategies of text enrichment for improving personalized text-based recommendations. Our empirical experiments reveal that using LLM-augmented text significantly enhances recommendation quality. Even basic MLP (Multi-Layer Perceptron) models achieve comparable or even better results than complex content-based methods. Notably, the success of LLM-Rec lies in its prompting strategies, which effectively tap into the language model’s comprehension of both general and specific item characteristics. This highlights the importance of employing diverse prompts and input augmentation techniques to boost the recommendation effectiveness of LLMs. + 2024.findings-naacl.39 + 2024.findings-naacl.39.copyright.pdf + lyu-etal-2024-llm + + + A Robust Semantics-based Watermark for Large Language Model against Paraphrasing + JieRenBaidu and Michigan State University + HanXuMichigan State University + YidingLiuBaidu + YingqianCui + ShuaiqiangWangBaidu Inc. + DaweiYinBaidu + JiliangTangMichigan State University + 613-625 + Large language models (LLMs) have show their remarkable ability in various natural language tasks. However, there are concerns that LLMs are possible to be used improperly or even illegally. To prevent the malicious usage of LLMs, detecting LLM-generated text becomes crucial in the deployment of LLM applications. Watermarking is an effective strategy to detect the LLM-generated content by encoding a pre-defined secret watermark to facilitate the detection process. However, the majority of existing watermark methods leverage the simple hashes of precedent tokens to partition vocabulary. Such watermarks can be easily eliminated by paraphrase and, correspondingly, the detection effectiveness will be greatly compromised. Thus, to enhance the robustness against paraphrase, we propose a semantics-based watermark framework, SemaMark. It leverages the semantics as an alternative to simple hashes of tokens since the semantic meaning of the sentences will be likely preserved under paraphrase and the watermark can remain robust. Comprehensive experiments are conducted to demonstrate the effectiveness and robustness of SemaMark under different paraphrases. + 2024.findings-naacl.40 + 2024.findings-naacl.40.copyright.pdf + ren-etal-2024-robust + + + Solving Data-centric Tasks using Large Language Models + ShraddhaBarke + ChristianPoelitzResearch, Microsoft + CarinaNegreanuMicrosoft + BenjaminZorn + JoséCambroneroMicrosoft + AndrewGordonUniversity of Edinburgh and Microsoft Research + VuLeMicrosoft + ElnazNouri + NadiaPolikarpovaUniversity of California, San Diego + AdvaitSarkarMicrosoft + BrianSliningerResearch, Microsoft + NeilToronto + JackWilliamsMicrosoft + 626-638 + Large language models are rapidly replacing help forums like StackOverflow, and are especially helpful to non-professional programmers and end users. These users are often interested in data-centric tasks, like spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including data. But how do we decide how much data and which data to include in the prompt?This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a novel cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table,our cluster-then-select technique outperforms a random selection baseline. + 2024.findings-naacl.41 + 2024.findings-naacl.41.copyright.pdf + barke-etal-2024-solving + + + A Novel Paradigm Boosting Translation Capabilities of Large Language Models + JiaxinGuo + HaoYang + ZongyaoLi + DaimengWei + HengchaoShang + XiaoyuChenHuawei Technologies Ltd. + 639-649 + This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs’ cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2(CITATION)model, particularly on Chinese-Llama2(CITATION) after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B(CITATION) and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation. + 2024.findings-naacl.42 + 2024.findings-naacl.42.copyright.pdf + guo-etal-2024-novel + + + Measuring Social Norms of Large Language Models + YeYuan + KexinTangPeking University + JianhaoShen + MingZhangPeking University + ChenguangWangWashington University, Saint Louis + 650-699 + We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of large language models to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent large language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on large language models to improve the models’ ability to understand social norms. This method further improves large language models to be on par with humans. Given the increasing adoption of large language models in real-world applications, our finding is particularly important and presents a unique direction for future improvements. + 2024.findings-naacl.43 + 2024.findings-naacl.43.copyright.pdf + yuan-etal-2024-measuring + + + Source-Free Unsupervised Domain Adaptation for Question Answering via Prompt-Assisted Self-learning + MaxwellYin + BoyuWangUniversity of Western Ontario + CharlesLingWestern University + 700-713 + This work addresses source-free domain adaptation (SFDA) for Question Answering (QA), wherein a model trained on a source domain is adapted to unlabeled target domains without additional source data. Existing SFDA methods only focus on the adaptation phase, overlooking the impact of source domain training on model generalizability. In this paper, we argue that source model training itself is also critical for improving the adaptation performance and stability. To this end, we investigate the role of prompt learning as an effective method to internalize domain-agnostic QA knowledge, which can be integrated into source training. After source training, an interactive self-learning strategy is proposed to further fine tune both model and prompt in the model adaptation phase. This leads to the Prompt-Assisted Self-Adaptive Learning (PASAL), an innovative SFDA approach for QA. Empirical evaluation on four benchmark datasets shows that PASAL surpasses existing methods in managing domain gaps and demonstrates greater stability across various target domains, validating the significance of source domain training for effective domain adaptation. + 2024.findings-naacl.44 + 2024.findings-naacl.44.copyright.pdf + yin-etal-2024-source + + + Hierarchical Attention Graph for Scientific Document Summarization in Global and Local Level + ChenlongZhao + XiwenZhou + XiaopengXie + YongZhangBeijing University of Posts and Telecommunications + 714-726 + Scientific document summarization has been a challenging task due to the long structure of the input text. The long input hinders the simultaneous effective modeling of both global high-order relations between sentences and local intra-sentence relations which is the most critical step in extractive summarization. However, existing methods mostly focus on one type of relation, neglecting the simultaneous effective modeling of both relations, which can lead to insufficient learning of semantic representations. In this paper, we propose HAESum, a novel approach utilizing graph neural networks to locally and globally model documents based on their hierarchical discourse structure. First, intra-sentence relations are learned using a local heterogeneous graph. Subsequently, a novel hypergraph self-attention layer is introduced to further enhance the characterization of high-order inter-sentence relations. We validate our approach on two benchmark datasets, and the experimental results demonstrate the effectiveness of HAESum and the importance of considering hierarchical structures in modeling long scientific documents. + 2024.findings-naacl.45 + 2024.findings-naacl.45.copyright.pdf + zhao-etal-2024-hierarchical + + + <fixed-case>LEEET</fixed-case>s-Dial: Linguistic Entrainment in End-to-End Task-oriented Dialogue systems + NalinKumar + OndrejDusekCharles University, Prague + 727-735 + Linguistic entrainment, or alignment, represents a phenomenon where linguistic patterns employed by conversational participants converge to one another. While entrainment has been shown to produce a more natural user experience, most dialogue systems do not have any provisions for it. In this work, we introduce methods for achieving dialogue entrainment in a GPT-2-based end-to-end task-oriented dialogue system through the utilization of shared vocabulary. We experiment with training instance weighting, entrainment-specific loss, and additional conditioning to generate responses that align with the user. We demonstrate that all three approaches produce significantly better entrainment than the base, non-entrainment-optimized model, as confirmed by both automated and manual evaluation metrics. + 2024.findings-naacl.46 + 2024.findings-naacl.46.copyright.pdf + kumar-dusek-2024-leeets + + + Efficient Dependency Tree Sampling Without Replacement + BogdanDobreUniversity of Bucharest + 736-741 + In the context of computational models of dependency syntax, most dependency treebanks have the restriction that any valid dependency tree must have exactly one edge coming out of the root node in addition to respecting the spanning tree constraints. Many algorithms for dependency tree sampling were recently proposed, both for sampling with and without replacement.In this paper we propose a new algorithm called Wilson Reject SWOR for the case of sampling without replacement by adapting the Wilson Reject algorithm originally created for sampling with replacement and combining it with a Trie data structure. Experimental results indicate the efficiency of our approach in the scenario of sampling without replacement from dependency graphs with random weights. + 2024.findings-naacl.47 + 2024.findings-naacl.47.copyright.pdf + dobre-2024-efficient + + + Towards Better Generalization in Open-Domain Question Answering by Mitigating Context Memorization + ZixuanZhang + RevanthGangi Reddy + KevinSmallAmazon + TongZhangUIUC + HengJiUniversity of Illinois, Urbana-Champaign + 742-753 + Open-domain Question Answering (OpenQA) aims at answering factual questions with an external large-scale knowledge corpus. However, real-world knowledge is not static; it updates and evolves continually. Such a dynamic characteristic of knowledge poses a vital challenge for these models, as the trained models need to constantly adapt to the latest information to make sure that the answers remain accurate. In addition, it is still unclear how well an OpenQA model can transfer to completely new knowledge domains. In this paper, we investigate the generalization performance of a retrieval-augmented QA model in two specific scenarios: 1) adapting to updated versions of the same knowledge corpus; 2) switching to completely different knowledge domains. We observe that the generalization challenges of OpenQA models stem from the reader’s over-reliance on memorizing the knowledge from the external corpus, which hinders the model from generalizing to a new knowledge corpus. We introduce Corpus-Invariant Tuning (CIT), a simple but effective training strategy, to mitigate the knowledge over-memorization by controlling the likelihood of retrieved contexts during training. Extensive experimental results on multiple OpenQA benchmarks show that CIT achieves significantly better generalizability without compromising the model’s performance in its original corpus and domain. + 2024.findings-naacl.48 + 2024.findings-naacl.48.copyright.pdf + zhang-etal-2024-towards + + + <fixed-case>GEE</fixed-case>! Grammar Error Explanation with Large Language Models + YixiaoSongUniversity of Massachusetts at Amherst + KalpeshKrishnaGoogle + RajeshBhattUniversity of Massachusetts at Amherst + KevinGimpelToyota Technological Institute at Chicago + MohitIyyerUniversity of Massachusetts Amherst + 754-781 + Existing grammatical error correction tools do not provide natural language explanations of the errors that they correct in user-written text. However, such explanations are essential for helping users learn the language by gaining a deeper understanding of its grammatical rules (DeKeyser, 2003; Ellis et al., 2006).To address this gap, we propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences. The task is not easily solved by prompting LLMs: we find that, using one-shot prompting, GPT-4 only explains 40.6% of the errors and does not even attempt to explain 39.8% of the errors.Since LLMs struggle to identify grammar errors, we develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction, followed by prompting GPT-4 to explain each edit. We evaluate our pipeline on German, Chinese, and English grammar error correction data. Our atomic edit extraction achieves an F1 of 0.93 on German, 0.91 on Chinese, and 0.891 on English. Human evaluation of generated explanations reveals that 93.9% of German errors, 96.4% of Chinese errors, and 92.20% of English errors are correctly detected and explained. To encourage further research, we open-source our data and code. + 2024.findings-naacl.49 + 2024.findings-naacl.49.copyright.pdf + song-etal-2024-gee + + + <fixed-case>A</fixed-case>da<fixed-case>R</fixed-case>efiner: Refining Decisions of Language Models with Adaptive Feedback + WanpengZhangPeking University + ZongqingLuPeking University + 782-799 + Large Language Models (LLMs) have demonstrated significant success across various domains. However, their application in complex decision-making tasks frequently necessitates intricate prompt engineering or fine-tuning, leading to challenges in unseen downstream tasks and heavy demands on computational resources. Meanwhile, Reinforcement Learning (RL) has been recognized as effective in decision-making problems but struggles in environments with sparse rewards, such as open-world games. To overcome these challenges, we introduce AdaRefiner, a novel framework designed to enhance the synergy between LLMs and RL feedback. The key component of AdaRefiner is a lightweight Adapter Language Model (LM), which automatically refines task comprehension based on feedback from RL agents. This method mitigates the need for intricate prompt engineering and intensive LLM fine-tuning while maintaining the LLMs’ generalization abilities and enhancing their decision-making capabilities in downstream tasks. Empirical evaluations of AdaRefiner on 22 diverse tasks within the open-world game Crafter have demonstrated its superior effectiveness, especially in guiding agents towards higher-level and common-sense skills. Our work makes contributions to the automatic self-refinement of LLMs with RL feedback, offering a more adaptable and efficient solution for complex decision-making problems. The code is available at https://github.com/PKU-RL/AdaRefiner. + 2024.findings-naacl.50 + 2024.findings-naacl.50.copyright.pdf + zhang-lu-2024-adarefiner + + + <fixed-case>D</fixed-case>iv<fixed-case>TOD</fixed-case>: Unleashing the Power of <fixed-case>LLM</fixed-case>s for Diversifying Task-Oriented Dialogue Representations + WeihaoZeng + DayuanFu + KeqingHeMeituan Group + YejieWang + YukaiXu + WeiranXu + 800-813 + Language models pre-trained on general text have achieved impressive results in diverse fields. Yet, the distinct linguistic characteristics of task-oriented dialogues (TOD) compared to general text limit the practical utility of existing language models. Current task-oriented dialogue pre-training methods overlook the one-to-many property of conversations, where multiple responses can be appropriate given the same conversation context.In this paper, we propose a novel dialogue pre-training model called DivTOD, which collaborates with LLMs to learn diverse task-oriented dialogue representations. DivTOD guides LLMs in transferring diverse knowledge to smaller models while removing domain knowledge that contradicts task-oriented dialogues. Experiments show that our model outperforms strong TOD baselines on various downstream dialogue tasks and learns the intrinsic diversity of task-oriented dialogues. + 2024.findings-naacl.51 + 2024.findings-naacl.51.copyright.pdf + zeng-etal-2024-divtod + + + Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training + PavelDenisovUniversity of Stuttgart, University of Stuttgart + ThangVuUniversity of Stuttgart, University of Stuttgart + 814-834 + Recent advancements in language modeling have led to the emergenceof Large Language Models (LLMs) capable ofvarious natural language processing tasks.Despite their success in text-based tasks, applying LLMs to the speech domainremains limited and challenging. This paper presents BLOOMZMMS, a novel modelthat integrates a multilingual LLM with a multilingual speech encoder,aiming to harness the capabilities of LLMs for speech recognition and beyond.Utilizing a multi-instructional training approach, we demonstrate the transferabilityof linguistic knowledge from the text to the speech modality.Our experiments, conducted on 1900 hours of transcribed data from 139 languages,establish that a multilingual speech representation can be effectivelylearned and aligned with a multilingual LLM. While this learned representationinitially shows limitations in task generalization, we address this issue bygenerating synthetic targets in a multi-instructional style.Our zero-shot evaluation results confirm the robustness of our approach acrossmultiple tasks, including speech translation and multilingual spoken languageunderstanding, thereby opening new avenues for applying LLMs in the speech domain. + 2024.findings-naacl.52 + 2024.findings-naacl.52.copyright.pdf + denisov-vu-2024-teaching + + + <fixed-case>CLEAN</fixed-case>–<fixed-case>EVAL</fixed-case>: Clean Evaluation on Contaminated Large Language Models + WenhongZhu + HongkunHao + ZhiweiHeShanghai Jiao Tong University + Yun-ZeSong + JiaoYueyang + YumengZhang + HanxuHu + YiranWei + RuiWangShanghai Jiao Tong University + HongyuanLu + 835-847 + We are currently in an era of fierce competition among various large language models (LLMs), continuously pushing the boundaries of benchmark performance. However, genuinely assessing the capabilities of these LLMs has become a challenging and critical issue due to potential data contamination. In this paper, we propose a novel and valuable method, Clean-Eval, which mitigates the issue of data contamination and evaluates the LLMs more cleanly. Clean-Eval employs a neural-based model to paraphrase and back-translate the contaminated data into a candidate set, generating expressions with the same meaning but in different surface forms. A semantic detector is then used to filter those generated low-quality samples to narrow down this candidate set. Candidates with moderate BLEURT scores against the original samples are selected as the final evaluation set. According to human assessment, this set is almost semantically equivalent to the original contamination set but expressed differently. We conduct experiments on 20 existing benchmarks across diverse tasks, and results demonstrate that Clean-Eval substantially restores the actual evaluation results on contaminated LLMs under both few-shot learning and fine-tuning scenarios. + 2024.findings-naacl.53 + 2024.findings-naacl.53.copyright.pdf + zhu-etal-2024-clean + + + <fixed-case>R</fixed-case>-<fixed-case>BASS</fixed-case> : Relevance-aided Block-wise Adaptation for Speech Summarization + RoshanSharmaGoogle + RuchiraSharma + HiraDhamyalCarnegie Mellon University + RitaSinghSchool of Computer Science, Carnegie Mellon University + BhikshaRajCarnegie Mellon University, Carnegie Mellon University and Mohamed bin Zayed University of Artificial Intelligence + 848-857 + End-to-end speech summarization on long recordings is challenging because of the high computational cost. Block-wise Adaptation for Speech Summarization (BASS) summarizes arbitrarily long sequences by sequentially processing abutting chunks of audio. Despite the benefits of BASS, it has higher compute time due to sequential processing of all blocks, regardless of whether they are relevant to the final summary. In this paper, we propose R-BASS, a new relevance-aware block-wise adaptation method. First, we introduce two approaches to automatically estimate block relevance based on lexical and semantic similarity between the block-level transcript and the summary. Experiments on the How2 dataset show that using ground truth relevance during inference improves efficiency by 63.9 % by dropping irrelevant blocks. Finally, we incorporate relevance scores into training using a novel relevance loss and relevance predictor, and the proposed R-BASS model makes it possible to drop 86.3 % of the blocks while retaining comparable performance, resulting in a 2.2x speedup over BASS. + 2024.findings-naacl.54 + 2024.findings-naacl.54.copyright.pdf + sharma-etal-2024-r + + + <fixed-case>OVM</fixed-case>, Outcome-supervised Value Models for Planning in Mathematical Reasoning + FeiYu + AnningzheGaoShenZhen research institute of big data + BenyouWangThe Chinese University of Hong Kong, Shenzhen + 858-875 + Large language models (LLMs) often struggle with maintaining accuracy throughout multiple multiple reasoning steps, especially in mathematical reasoning where an error in earlier steps can propagate to subsequent ones and it ultimately leading to an incorrect answer.To reduce error propagation, guided decoding is employed to direct the LM decoding on a step-by-step basis. We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness, as the former approach leads towards a correct final answer. This transforms the task into a \textit{value estimation} problem in planning.Inspired by the findings that \textit{outcome supervision for guided decoding essentially acts as a value model}, we propose Outcome-supervised Value Model (OVM) that employs outcome supervision for training a value model, which prioritizes steps that lead to accurate conclusions. Furthermore, the OVM eliminates the need for labor-intensive annotations of step-level correctness, thereby significantly enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our \textbf{OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters}; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training value models for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for guided decoding. + 2024.findings-naacl.55 + 2024.findings-naacl.55.copyright.pdf + yu-etal-2024-ovm + + + The Whole is Better than the Sum: Using Aggregated Demonstrations in In-Context Learning for Sequential Recommendation + LeiWangSalesForce + Ee-PengLimSingapore Management University + 876-895 + Large language models (LLMs) have shown excellent performance on various NLP tasks. To use LLMs as strong sequential recommenders, we explore the in-context learning approach to sequential recommendation. We investigate the effects of instruction format, task consistency, demonstration selection, and number of demonstrations. As increasing the number of demonstrations in ICL does not improve accuracy despite using a long prompt, we propose a novel method called LLMSRec-Syn that incorporates multiple demonstration users into one aggregated demonstration. Our experiments on three recommendation datasets show that LLMSRec-Syn outperforms state-of-the-art LLM-based sequential recommendation methods. In some cases, LLMSRec-Syn can perform on par with or even better than supervised learning methods. Our code is publicly available at https://github.com/demoleiwang/LLMSRec_Syn. + 2024.findings-naacl.56 + 2024.findings-naacl.56.copyright.pdf + wang-lim-2024-whole + + + Bring Your Own <fixed-case>KG</fixed-case>: Self-Supervised Program Synthesis for Zero-Shot <fixed-case>KGQA</fixed-case> + DhruvAgarwalDepartment of Computer Science, University of Massachusetts at Amherst + RajarshiDasAWS AI Labs + SopanKhoslaAmazon Web Services + RashmiGangadharaiahAmazon + 896-919 + We present BYOKG, a universal question-answering (QA) system that can operate on any knowledge graph (KG), requires no human-annotated training data, and can be ready to use within a day—attributes that are out-of-scope for current KGQA systems. BYOKG draws inspiration from the remarkable ability of humans to comprehend information present in an unseen KG through exploration—starting at random nodes, inspecting the labels of adjacent nodes and edges, and combining them with their prior world knowledge. Exploration in BYOKG leverages an LLM-backed symbolic agent that generates a diverse set of query-program exemplars, which are then used to ground a retrieval-augmented reasoning procedure to synthesize programs for arbitrary questions. BYOKG is effective over both small- and large-scale graphs, showing dramatic gains in zero-shot QA accuracy of 27.89 and 59.88 F1 on GrailQA and MetaQA, respectively. We further find that performance of BYOKG reliably improves with continued exploration as well as improvements in the base LLM, notably outperforming a state-of-the-art fine-tuned model by 7.08 F1 on a sub-sampled zero-shot split of GrailQA. Lastly, we verify our universality claim by evaluating BYOKG on a domain-specific materials science KG and show that it improves zero-shot performance by 46.33 F1. + 2024.findings-naacl.57 + 2024.findings-naacl.57.copyright.pdf + agarwal-etal-2024-bring + + + <fixed-case>G</fixed-case>ra<fixed-case>SAME</fixed-case>: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism + ShuzhouYuan + MichaelFärberTechnische Universität Dresden + 920-933 + Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) require cumbersome processes for integration into PLMs. In this work, we propose a novel graph-guided self-attention mechanism, GraSAME. GraSAME seamlessly incorporates token-level structural information into PLMs without necessitating additional alignment or concatenation efforts. As an end-to-end, lightweight multimodal module, GraSAME follows a multi-task learning strategy and effectively bridges the gap between graph and textual modalities, facilitating dynamic interactions between GNNs and PLMs. Our experiments on the graph-to-text generation task demonstrate that GraSAME outperforms baseline models and achieves results comparable to state-of-the-art (SOTA) models on WebNLG datasets. Furthermore, compared to SOTA models, GraSAME eliminates the need for extra pre-training tasks to adjust graph inputs and reduces the number of trainable parameters by over 100 million. + 2024.findings-naacl.58 + 2024.findings-naacl.58.copyright.pdf + yuan-farber-2024-grasame + + + Can Public Large Language Models Help Private Cross-device Federated Learning? + BoxinWangNVIDIA + YiboZhangStanford University and University of Illinois, Urbana Champaign + YuanCaoGoogle DeepMind + BoLiUniversity of Illinois, Urbana Champaign and University of California Berkeley + HughMcMahanGoogle + SewoongOhUniversity of Washington, University of Illinois at Urbana-Champaign and University of Washington, Seattle + ZhengXuGoogle + ManzilZaheerZaheer and DeepMind + 934-949 + We study (differentially) private federated learning (FL) of language models. The language models in cross-device FL are relatively small, which can be trained with meaningful formal user-level differential privacy (DP) guarantees when massive parallelism in training is enabled by the participation of a moderate size of users. Recently, public data has been used to improve privacy-utility trade-offs for both large and small language models. In this work, we provide a systematic study of using large-scale public data and LLMs to help differentially private training of on-device FL models, and further improve the privacy-utility tradeoff by techniques of distillation. Moreover, we propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution, which significantly improves the sample efficiency of (pre-)training on public data. The proposed method is efficient and effective for training private models by taking advantage of public data, especially for customized on-device architectures that do not have ready-touse pre-trained models. + 2024.findings-naacl.59 + 2024.findings-naacl.59.copyright.pdf + wang-etal-2024-public + + + <fixed-case>L</fixed-case>ang<fixed-case>N</fixed-case>av: Language as a Perceptual Representation for Navigation + BowenPan + RameswarPandaMIT-IBM Watson AI Lab + SouYoungJinDartmouth College + RogerioFerisInternational Business Machines + AudeOlivaMassachusetts Institute of Technology, Massachusetts Institute of Technology and Massachusetts Institute of Technology + PhillipIsolaMassachusetts Institute of Technology + YoonKimMassachusetts Institute of Technology + 950-974 + We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent’s egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore several use cases of our language-based navigation (LangNav) approach on the R2R VLN benchmark: generating synthetic trajectories from a prompted language model (GPT-4) with which to finetune a smaller language model; domain transfer where we transfer a policy learned on one simulated environment (ALFRED) to another (more realistic) environment (R2R); and combining both vision- and language-based representations for VLN. Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories (10-100) are available, demonstrating the potential of language as a perceptual representation for navigation. + 2024.findings-naacl.60 + 2024.findings-naacl.60.copyright.pdf + pan-etal-2024-langnav + + + Planning and Editing What You Retrieve for Enhanced Tool Learning + TenghaoHuang + DongwonJung + VaibhavKumarSchool of Computer Science, Carnegie Mellon University + MohammadKachueeAmazon + XiangLi + PuyangXuAmazon + MuhaoChenUniversity of California, Davis and University of Southern California + 975-988 + Recent advancements in integrating external tools with Large Language Models (LLMs) have opened new frontiers, with applications in mathematical reasoning, code generators, and smart assistants. However, existing methods, relying on simple one-time retrieval strategies, fall short on effectively and accurately shortlisting relevant tools. This paper introduces a novel PLUTO (Planning, Learning, and Understanding for TOols) approach, encompassing “Plan-and-Retrieve (P&R)” and “Edit-and-Ground (E&G)” paradigms. The P&R paradigm consists of a neural retrieval module for shortlisting relevant tools and an LLM-based query planner that decomposes complex queries into actionable tasks, enhancing the effectiveness of tool utilization. The E&G paradigm utilizes LLMs to enrich tool descriptions based on user scenarios, bridging the gap between user queries and tool functionalities. Experiment results demonstrate that these paradigms significantly improve the recall and NDCG in tool retrieval tasks, significantly surpassing current state-of-the-art models. + 2024.findings-naacl.61 + 2024.findings-naacl.61.copyright.pdf + huang-etal-2024-planning + + + Chart-based Reasoning: Transferring Capabilities from <fixed-case>LLM</fixed-case>s to <fixed-case>VLM</fixed-case>s + VictorCarbuneGoogle + HassanMansoorGoogle + FangyuLiuGoogle DeepMind + RahulAralikatteMila, McGill University + GillesBaechlerResearch, Google + JindongChenGoogle + AbhanshuSharmaResearch, Google + 989-1004 + Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We pro-pose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-artperformance when applied on the PaLI3-5B VLM by Chen et al. (2023c), while also enabling much better performance on PlotQA and FigureQA.We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by Liu et al. (2023a). We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by Hsieh et al. (2023).Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt (Chen et al., 2023a), our model outperforms the recently introduced Gemini Ultra and GPT-4V. + 2024.findings-naacl.62 + 2024.findings-naacl.62.copyright.pdf + carbune-etal-2024-chart + + + <fixed-case>SL</fixed-case>i<fixed-case>M</fixed-case>: Speculative Decoding with Hypothesis Reduction + Chi-HengLinSamsung Research America + ShikharTuli + JamesSmithGeorgia Institute of Technology + Yen-ChangHsuSamsung Research America + YilinShenSamsung Research America + HongxiaJinSamsung Research America AI center + 1005-1017 + Speculative decoding has emerged as a prominent alternative to autoregressive decoding for expediting inference in large language models (LLMs). However, prevailing assumptions often focus solely on latency reduction, neglecting the computational expenses. In this paper, we present Speculate Less, validate More (SLiM), a speculative decoding enhancement to reduce the speculation set while validating more effective tokens. SLiM is designed to mitigate LLMs’ computation costs associated with the token verification by introducing hypothesis reduction based on a fast posterior estimation. It consistently surpasses counterparts lacking cost reduction across a spectrum from CPU to GPU. Our evaluation with diverse conversational datasets shows that SLiM can achieve a substantial 70% reduction in FLOPs while generating more effective predictions on top of prior arts. + 2024.findings-naacl.63 + 2024.findings-naacl.63.copyright.pdf + lin-etal-2024-slim + + + <fixed-case>REMATCH</fixed-case>: Robust and Efficient Matching of Local Knowledge Graphs to Improve Structural and Semantic Similarity + ZoherKachwalaIndiana University + JisunAnIndiana University + HaewoonKwakIndiana University + FilippoMenczerIndiana University + 1018-1028 + Knowledge graphs play a pivotal role in various applications, such as question-answering and fact-checking. Abstract Meaning Representation (AMR) represents text as knowledge graphs. Evaluating the quality of these graphs involves matching them structurally to each other and semantically to the source text. Existing AMR metrics are inefficient and struggle to capture semantic similarity. We also lack a systematic evaluation benchmark for assessing structural similarity between AMR graphs. To overcome these limitations, we introduce a novel AMR similarity metric, rematch, alongside a new evaluation for structural similarity called RARE. Among state-of-the-art metrics, rematch ranks second in structural similarity; and first in semantic similarity by 1–5 percentage points on the STS-B and SICK-R benchmarks. Rematch is also five times faster than the next most efficient metric. + 2024.findings-naacl.64 + 2024.findings-naacl.64.copyright.pdf + kachwala-etal-2024-rematch + + + Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing + BenHutchinsonGoogle + 1029-1043 + This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when language data is scarce. This repurposes the translations from their original uses and motivations, which often involve attracting new followers. This paper argues that NLP’s use of such texts raises considerations that go beyond model biases, including data provenance, cultural contexts, and their use in proselytism. We argue for more consideration of researcher positionality, and of the perspectives of marginalized linguistic and religious communities. + 2024.findings-naacl.65 + 2024.findings-naacl.65.copyright.pdf + hutchinson-2024-modeling + + + Testing the Effect of Code Documentation on Large Language Model Code Understanding + WilliamMackeMITRE + MichaelDoyleMITRE + 1044-1050 + Large Language Models (LLMs) have demonstrated impressive abilities in recent years with regards to code generation and understanding. However, little work has investigated how documentation and other code properties affect an LLM’s ability to understand and generate code or documentation. We present an empirical analysis of how underlying properties of code or documentation can affect an LLM’s capabilities. We show that providing an LLM with “incorrect” documentation can greatly hinder code understanding, while incomplete or missing documentation does not seem to significantly affect an LLM’s ability to understand code. + 2024.findings-naacl.66 + 2024.findings-naacl.66.copyright.pdf + macke-doyle-2024-testing + + + Aligning Large Language Models with Recommendation Knowledge + YuweiCao + NikhilMehtaResearch, Google + XinyangYiGoogle + RaghunandanHulikal Keshavan + LukaszHeldt + LichanHongGoogle + EdChiGoogle + MaheswaranSathiamoorthy + 1051-1066 + Large language models (LLMs) have recently been used as backbones for recommender systems. However, their performance often lags behind conventional methods in standard tasks like retrieval. We attribute this to a mismatch between LLMs’ knowledge and the knowledge crucial for effective recommendations. While LLMs excel at natural language reasoning, they cannot model complex user-item interactions inherent in recommendation tasks. We propose bridging the knowledge gap and equipping LLMs with recommendation-specific knowledge to address this. Operations such as Masked Item Modeling (MIM) and Bayesian Personalized Ranking (BPR) have found success in conventional recommender systems. Inspired by this, we simulate these operations through natural language to generate auxiliary-task data samples that encode item correlations and user preferences. Fine-tuning LLMs on such auxiliary-task data samples and incorporating more informative recommendation-task data samples facilitates the injection of recommendation-specific knowledge into LLMs. Extensive experiments across retrieval, ranking, and rating prediction tasks on LLMs such as FLAN-T5-Base and FLAN-T5-XL show the effectiveness of our technique in domains such as Amazon Toys & Games, Beauty, and Sports & Outdoors. Notably, our method outperforms conventional and LLM-based baselines, including the current SOTA, by significant margins in retrieval, showcasing its potential for enhancing recommendation quality. + 2024.findings-naacl.67 + 2024.findings-naacl.67.copyright.pdf + cao-etal-2024-aligning + + + <fixed-case>OFA</fixed-case>: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining + YihongLiuLudwig-Maximilians-Universität München + PeiqinLinInstitut für Informatik + MingyangWang + HinrichSchuetze + 1067-1097 + Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: \textbf{O}ne \textbf{F}or \textbf{A}ll (\textbf{OFA}), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available. + 2024.findings-naacl.68 + 2024.findings-naacl.68.copyright.pdf + liu-etal-2024-ofa + + + <fixed-case>SELF</fixed-case>-<fixed-case>EXPERTISE</fixed-case>: Knowledge-based Instruction Dataset Augmentation for a Legal Expert Language Model + MinjuKim + HaeinJung + Myoung-WanKooSogang University + 1098-1112 + The advent of instruction-tuned large language models (LLMs) has significantly advanced the field of automatic instruction dataset augmentation. However, the method of generating instructions and outputs from inherent knowledge of LLM can unintentionally produce hallucinations — instances of generating factually incorrect or misleading information. To overcome this, we propose SELF-EXPERTISE, automatically generating instruction dataset in the legal domain from a seed dataset. SELF-EXPERTISE extracts knowledge from the outputs of the seed dataset, and generates new instructions, inputs, and outputs. In this way, the proposed method reduces hallucination in automatic instruction augmentation. We trained an SELF-EXPERTISE augmented instruction dataset on the LLaMA-2 7B model to construct Korean legal specialized model, called LxPERT. LxPERT has demonstrated performance surpassing GPT-3.5-turbo in both in-domain and out-of-domain datasets. The SELF-EXPERTISE augmentation pipeline is not only applicable to the legal field but is also expected to be extendable to various domains, potentially advancing domain-specialized LLMs. + 2024.findings-naacl.69 + 2024.findings-naacl.69.copyright.pdf + kim-etal-2024-self + + + Re-evaluating the Need for Visual Signals in Unsupervised Grammar Induction + BoyiLiNVIDIA Research and University of California, Berkeley + RodolfoCorona + KarttikeyaMangalam + CatherineChenUniversity of California Berkeley + DanielFlaherty + SergeBelongieUniversity of Copenhagen + KilianWeinbergerCornell University, Cornell University and Cornell University + JitendraMalikFacebook and University of California Berkeley + TrevorDarrellElectrical Engineering & Computer Science Department + DanKleinUniversity of California, Berkeley + 1113-1123 + Are multimodal inputs necessary for grammar induction? Recent work has shown that multimodal training inputs can improve grammar induction. However, these improvements are based on comparisons to weak text-only baselines that were trained on relatively little textual data. To determine whether multimodal inputs are needed in regimes with large amounts of textual training data, we design a stronger text-only baseline, which we refer to as LC-PCFG. LC-PCFG is a C-PFCG that incorporates embeddings from text-only large language models (LLMs). We use a fixed grammar family to directly compare LC-PCFG to various multimodal grammar induction methods. We compare performance on four benchmark datasets. LC-PCFG provides an up to 17% relative improvement in Corpus-F1 compared to state-of-the-art multimodal grammar induction methods. LC-PCFG is also more computationally efficient, providing an up to 85% reduction in parameter count and 8.8\times reduction in training time compared to multimodal approaches. These results suggest that multimodal inputs may not be necessary for grammar induction, and emphasize the importance of strong vision-free baselines for evaluating the benefit of multimodal approaches. + 2024.findings-naacl.70 + 2024.findings-naacl.70.copyright.pdf + li-etal-2024-evaluating + + + <fixed-case>EDE</fixed-case>ntail: An Entailment-based Few-shot Text Classification with Extensional Definition + ZixiaoZhu + JunlangQianNanyang Technological University + ZijianFeng + HanzhangZhou + KezhiMaoNanyang Technological University + 1124-1137 + Few-shot text classification has seen significant advancements, particularly with entailment-based methods, which typically use either class labels or intensional definitions of class labels in hypotheses for label semantics expression. In this paper, we propose EDEntail, a method that employs extensional definition (EDef) of class labels in hypotheses, aiming to express the semantics of class labels more explicitly. To achieve the above goal, we develop an algorithm to gather and select extensional descriptive words of class labels and then order and format them into a sequence to form hypotheses. Our method has been evaluated and compared with state-of-the-art models on five classification datasets. The results demonstrate that our approach surpasses the supervised-learning methods and prompt-based methods under the few-shot setting, which underlines the potential of using an extensional definition of class labels for entailment-based few-shot text classification. Our code is available at https://github.com/MidiyaZhu/EDEntail. + 2024.findings-naacl.71 + 2024.findings-naacl.71.copyright.pdf + zhu-etal-2024-edentail + + + What Makes Math Word Problems Challenging for <fixed-case>LLM</fixed-case>s? + Kv AdityaSrivatsaMohamed bin Zayed University of Artificial Intelligence + EkaterinaKochmarMohamed bin Zayed University of Artificial Intelligence + 1138-1148 + This paper investigates the question of what makes math word problems (MWPs) in English challenging for large language models (LLMs). We conduct an in-depth analysis of the key linguistic and mathematical characteristics of MWPs. In addition, we train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent LLMs and investigate whether this helps predict how well LLMs fare against specific categories of MWPs. + 2024.findings-naacl.72 + 2024.findings-naacl.72.copyright.pdf + srivatsa-kochmar-2024-makes + + + <fixed-case>SMILE</fixed-case>: Multimodal Dataset for Understanding Laughter in Video with Language Models + LeeHyunPohang University of Science and Technology + KimSung-BinPohang University of Science and Technology + SeungjuHanAllen Institute for Artificial Intelligence and Seoul National University + YoungjaeYuYonsei University + Tae-HyunOhPOSTECH + 1149-1167 + Despite the recent advances in artificial intelligence, building social intelligence remains a challenge.Among social signals, laughter is one of the distinctive expressions that occurs during social interactions between humans.In this work, we tackle a new challenge for machines to understand the rationale behind laughter in video, Video Laugh Reasoning.We introduce this new task to explain why people laugh in a particular video and a dataset for this task.Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh. We propose a baseline by leveraging the reasoning capacity of large language models (LLMs) with textual video representation. Experiments show that our baseline can generate plausible explanations for laughter. We further investigate the scalability of our baseline by probing other video understanding tasks and in-the-wild videos. We release our dataset, code, and model checkpoints on https://github.com/postech-ami/SMILE-Dataset. + 2024.findings-naacl.73 + 2024.findings-naacl.73.copyright.pdf + hyun-etal-2024-smile + + + <fixed-case>T</fixed-case>3<fixed-case>M</fixed-case>: Text Guided 3<fixed-case>D</fixed-case> Human Motion Synthesis from Speech + WenshuoPeng + KaipengZhangShanghai AI Laboratory + Sai QianZhangHarvard University, Harvard University, University of Toronto and Facebook + 1168-1177 + Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed T3M. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at https://github.com/Gloria2tt/naacl2024.git + 2024.findings-naacl.74 + 2024.findings-naacl.74.copyright.pdf + peng-etal-2024-t3m + + + Deja vu: Contrastive Historical Modeling with Prefix-tuning for Temporal Knowledge Graph Reasoning + MiaoPengWuhan University + BenLiu + WenjieXuWuhan University + ZihaoJiangWuhan University + JiahuiZhu + MinPengWuhan University + 1178-1191 + Temporal Knowledge Graph Reasoning (TKGR) is the task of inferring missing facts for incomplete TKGs in complex scenarios (e.g., transductive and inductive settings), which has been gaining increasing attention. Recently, to mitigate dependence on structured connections in TKGs, text-based methods have been developed to utilize rich linguistic information from entity descriptions. However, suffering from the enormous parameters and inflexibility of pre-trained language models, existing text-based methods struggle to balance the textual knowledge and temporal information with computationally expensive purpose-built training strategies. To tap the potential of text-based models for TKGR in various complex scenarios, we propose ChapTER, a Contrastive historical modeling framework with prefix-tuning for TEmporal Reasoning. ChapTER feeds history-contextualized text into the pseudo-Siamese encoders to strike a textual-temporal balance via contrastive estimation between queries and candidates. By introducing virtual time prefix tokens, it applies a prefix-based tuning method to facilitate the frozen PLM capable for TKGR tasks under different settings. We evaluate ChapTER on four transductive and three few-shot inductive TKGR benchmarks, and experimental results demonstrate that ChapTER achieves superior performance compared to competitive baselines with only 0.17% tuned parameters. We conduct thorough analysis to verify the effectiveness, flexibility and efficiency of ChapTER. + 2024.findings-naacl.75 + 2024.findings-naacl.75.copyright.pdf + peng-etal-2024-deja + + + Explanation Extraction from Hierarchical Classification Frameworks for Long Legal Documents + NishchalPrasad + TaoufiqDkakiInstitut de Recherche en Informatique de Toulouse + MohandBoughanemUniversité de Toulouse + 1192-1201 + Hierarchical classification frameworks have been widely used to process long sequences, especially in the legal domain for predictions from long legal documents. But being black-box models they are unable to explain their predictions making them less reliable for practical applications, more so in the legal domain. In this work, we develop an extractive explanation algorithm for hierarchical frameworks for long sequences based on the sensitivity of the trained model to its input perturbations. We perturb using occlusion and develop Ob-HEx; an Occlusion-based Hierarchical Explanation-extractor. We adapt Ob-HEx to Hierarchical Transformer models trained on long Indian legal texts. And use Ob-HEx to analyze them and extract their explanations for the ILDC-Expert dataset, achieving a minimum gain of 1 point over the previous benchmark on most of our performance evaluation metrics. + 2024.findings-naacl.76 + 2024.findings-naacl.76.copyright.pdf + prasad-etal-2024-explanation + + + Low-Rank Adaptation for Multilingual Summarization: An Empirical Study + ChenxiWhitehouseUniversity of Cambridge + FantineHuotGoogle + JasmijnBastingsGoogle DeepMind + MostafaDehghaniGoogle DeepMind + Chu-ChengLinGoogle + MirellaLapataEdinburgh University, University of Edinburgh + 1202-1228 + Although the advancements of pre-trained Large Language Models have significantly accelerated recent progress in NLP, their ever-increasing size poses significant challenges for conventional fine-tuning, especially in memory-intensive tasks. We investigate the potential of Parameter-Efficient Fine-Tuning, focusing on Low-Rank Adaptation (LoRA), in the domain of multilingual summarization, a task that is both challenging (due to typically long inputs), and relatively unexplored. We conduct an extensive study across different data availability scenarios, including high- and low-data settings, and cross-lingual transfer, leveraging models of different sizes. Our findings reveal that LoRA is competitive with full fine-tuning when trained with high quantities of data, and excels in low-data scenarios and cross-lingual transfer. We also study different strategies for few-shot cross-lingual transfer, finding that continued LoRA tuning outperforms full fine-tuning and the dynamic composition of language-specific LoRA modules. + 2024.findings-naacl.77 + 2024.findings-naacl.77.copyright.pdf + whitehouse-etal-2024-low + + + A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages + LeonardoRanaldiIdiap Research Institute + GiuliaPucci + FedericoRanaldiUniversity of Roma “Tor Vergata” + Elena SofiaRuzzettiUniversità degli Studi di Roma Tor Vergata + Fabio MassimoZanzottoUniversity of Rome Tor Vergata + 1229-1241 + Reasoning methods, best exemplified by the well-known Chain-of-Thought (CoT), empower the reasoning abilities of Large Language Models (LLMs) by eliciting them to solve complex tasks in a step-by-step manner. Although they are achieving significant success, the ability to deliver multi-step reasoning remains limited to English because of the imbalance in the distribution of pre-training data, which makes other languages a barrier. In this paper, we propose Cross-lingual Tree-of-Thoughts (Cross-ToT), a method for aligning Cross-lingual CoT reasoning across languages. The proposed method, through a self-consistent cross-lingual prompting mechanism inspired by the Tree-of-Thoughts approach, provides multi-step reasoning paths in different languages that, during the steps, lead to the final solution. Experimental evaluations show that our method significantly outperforms existing prompting methods by reducing the number of interactions and achieving state-of-the-art performance. + 2024.findings-naacl.78 + 2024.findings-naacl.78.copyright.pdf + ranaldi-etal-2024-tree + + + Emergent Abilities in Reduced-Scale Generative Language Models + SherinMuckatiraUniversity of Massachusetts at Lowell + VijetaDeshpande + VladislavLialinUniversity of Massachusetts, Lowell + AnnaRumshiskyUniversity of Massachusetts, Lowell, University of Massachusetts at Lowell and Amazon + 1242-1257 + Large language models can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in large language models with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal language models with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size.Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size. + 2024.findings-naacl.79 + 2024.findings-naacl.79.copyright.pdf + muckatira-etal-2024-emergent + + + Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems + ClemenciaSiro + MohammadAliannejadiUniversity of Amsterdam + MaartenRijkeUniversity of Amsterdam + 1258-1273 + Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models ( LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator’s performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels. + 2024.findings-naacl.80 + 2024.findings-naacl.80.copyright.pdf + siro-etal-2024-context + + + Matching Varying-Length Texts via Topic-Informed and Decoupled Sentence Embeddings + XixiZhouZhejiang University + ChunbinGu + XinJie + JiajunBuZhejiang University + HaishuaiWangZhejiang University + 1274-1280 + Measuring semantic similarity between texts is a crucial task in natural language processing. While existing semantic text matching focuses on pairs of similar-length sequences, matching texts with non-comparable lengths has broader applications in specific domains, such as comparing professional document summaries and content. Current approaches struggle with text pairs of non-comparable lengths due to truncation issues. To address this, we split texts into natural sentences and decouple sentence representations using supervised contrastive learning (SCL). Meanwhile, we adopt the embedded topic model (ETM) for specific domain data. Our experiments demonstrate the effectiveness of our model, based on decoupled and topic-informed sentence embeddings, in matching texts of significantly different lengths across three well-studied datasets. + 2024.findings-naacl.81 + 2024.findings-naacl.81.copyright.pdf + zhou-etal-2024-matching + + + Instruction Tuning with Human Curriculum + Bruce WLee + HyunsooChoEwha Women’s University + Kang MinYooNAVER + 1281-1309 + In this work, we (1) introduce Curriculum Instruction Tuning, (2) explore the potential advantages of employing diverse curriculum strategies, and (3) delineate a synthetic instruction-response generation framework that complements our theoretical approach. Distinct from the existing instruction tuning dataset, our generation pipeline is systematically structured to emulate the sequential and orderly characteristic of human learning. Additionally, we describe a methodology for generating instruction-response datasets that extensively span the various stages of human education, from middle school through the graduate level, utilizing educational subject catalogs.Before training, we meticulously organize the instruction data to ensure that questions escalate in difficulty regarding (A) the subject matter and (B) the intricacy of the instructions. The findings of our study reveal that substantial improvements in performance can be achieved through the mere application of curriculum ordering to instruction data—achieving gains of +4.76 on TruthfulQA, +2.98 on MMLU, +2.8 on OpenbookQA, and +1.28 on ARC-hard—compared to random shuffling. This enhancement is achieved without incurring additional computational expenses. Through comprehensive experimentation, we observe that the advantages of our proposed method are consistently evident across nine benchmarks. + 2024.findings-naacl.82 + 2024.findings-naacl.82.copyright.pdf + lee-etal-2024-instruction + + + Natural Language-based State Representation in Deep Reinforcement Learning + Md MasudurRahmanPurdue University + YexiangXuePurdue University, Purdue University and Purdue University + 1310-1319 + This paper investigates the potential of using natural language descriptions as an alternative to direct image-based observations for learning policies in reinforcement learning. Due to the inherent challenges in managing image-based observations, which include abundant information and irrelevant features, we propose a method that compresses images into a natural language form for state representation. This approach allows better interpretability and leverages the processing capabilities of large-language models. We conducted several experiments involving tasks that required image-based observation. The results demonstrated that policies trained using natural language descriptions of images yield better generalization than those trained directly from images, emphasizing the potential of this approach in practical settings. + 2024.findings-naacl.83 + 2024.findings-naacl.83.copyright.pdf + rahman-xue-2024-natural + + + Learning Cross-Architecture Instruction Embeddings for Binary Code Analysis in Low-Resource Architectures + JunzheWang + QiangZengGeorge Mason University + LannanLuoGeorge Mason University + 1320-1332 + Binary code analysis is indispensable for a variety of software security tasks. Applying deep learning to binary code analysis has drawn great attention because of its notable performance. Today, source code is frequently compiled for various Instruction Set Architectures (ISAs). It is thus critical to expand binary analysis capabilities to multiple ISAs. Given a binary analysis task, the scale of available data on different ISAs varies. As a result, the rich datasets (e.g., malware) for certain ISAs, such as x86, lead to a disproportionate focus on these ISAs and a negligence of other ISAs, such as PowerPC, which suffer from the “data scarcity” problem. To address the problem, we propose to learn cross-architecture instruction embeddings (CAIE), where semantically-similar instructions, regardless of their ISAs, have close embeddings in a shared space. Consequently, we can transfer a model trained on a data-rich ISA to another ISA with less available data. We consider four ISAs (x86, ARM, MIPS, and PowerPC) and conduct both intrinsic and extrinsic evaluations (including malware detection and function similarity comparison). The results demonstrate the effectiveness of our approach to generate high-quality CAIE with good transferability. + 2024.findings-naacl.84 + 2024.findings-naacl.84.copyright.pdf + wang-etal-2024-learning-cross + + + <fixed-case>R</fixed-case>e<fixed-case>E</fixed-case>val: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks + XiaodongYuUniversity of Pennsylvania, University of Pennsylvania + HaoChengMicrosoft Research + XiaodongLiu + DanRothAmazon and University of Pennsylvania + JianfengGaoMicrosoft Research + 1333-1351 + Despite remarkable advancements in mitigating hallucinations in large language models (LLMs) by retrieval augmentation, it remains challenging to measure the reliability of LLMs using static question-answering (QA) data. Specifically, given the potential of data contamination (e.g., leading to memorization), good static benchmark performance does not ensure that model can reliably use the provided evidence for responding, which is essential to avoid hallucination when the required knowledge is new or private. Inspired by adversarial machine learning, we investigate the feasibility of automatically perturbing existing static one for dynamic evaluation. Specifically, this paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases for evaluating the LLMs’ reliability in using new evidence for answering.We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets on a collection ofLLMs under various prompting settings. Our generated data is human-readable and useful to trigger hallucination in LLM. Accurate models on static data are observed to produce unsupported answers from the perturbed evidence, with pronounced accuracy drops across LLMs including GPT-4. We find that our adversarial examples are transferable across all considered LLMs. The examples generated by a small model can be used to evaluate a much larger model, making our approach cost-effective. + 2024.findings-naacl.85 + 2024.findings-naacl.85.copyright.pdf + yu-etal-2024-reeval + + + An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution + Tien-HongLoNational Taiwan Normal University + Fu-AnChaoNational Taiwan Normal University + Tzu-iWu + Yao-TingSung + BerlinChenNational Taiwan Normal University + 1352-1362 + Automated speaking assessment (ASA) typically involves automatic speech recognition (ASR) and hand-crafted feature extraction from the ASR transcript of a learner’s speech. Recently, self-supervised learning (SSL) has shown stellar performance compared to traditional methods. However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss re-weighting, leveraging distinct SSL-based embedding features. Extensive experimental results on the ICNALE benchmark dataset suggest that our approach can outperform existing strong baselines by a sizable margin, achieving a significant improvement of more than 10% in CEFR prediction accuracy. + 2024.findings-naacl.86 + 2024.findings-naacl.86.copyright.pdf + lo-etal-2024-effective + + + <fixed-case>GPT</fixed-case>-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards <fixed-case>GPT</fixed-case>-4 and Beyond + ShenZhengDepartment of Computer Science + YuyuZhangByteDance + YijieZhu + ChenguangXi + PengyangGao + ZhouXun + KevinChangUniversity of Illinois, Urbana Champaign + 1363-1382 + With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI’s legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI’s earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM’s reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs. + 2024.findings-naacl.87 + 2024.findings-naacl.87.copyright.pdf + zheng-etal-2024-gpt + + + Subword Attention and Post-Processing for Rare and Unknown Contextualized Embeddings + RajPatelGeorge Mason University + CarlottaDomeniconiGeorge Mason University and George Mason University + 1383-1389 + Word representations are an important aspect of Natural Language Processing (NLP). Representations are trained using large corpora, either as independent static embeddings or as part of a deep contextualized model. While word embeddings are useful, they struggle on rare and unknown words. As such, a large body of work has been done on estimating rare and unknown words. However, most of the methods focus on static embeddings, with few models focused on contextualized representations. In this work, we propose SPRUCE, a rare/unknown embedding architecture that focuses on contextualized representations. This architecture uses subword attention and embedding post-processing combined with the contextualized model to produce high quality embeddings. We then demonstrate these techniques lead to improved performance in most intrinsic and downstream tasks. + 2024.findings-naacl.88 + 2024.findings-naacl.88.copyright.pdf + patel-domeniconi-2024-subword + + + <fixed-case>UGIF</fixed-case>-<fixed-case>D</fixed-case>ata<fixed-case>S</fixed-case>et: A New Dataset for Cross-lingual, Cross-modal Sequential actions on the <fixed-case>UI</fixed-case> + SagarGubbi Venkatesh + ParthaTalukdarGoogle Research and Indian Institute of Science, Bangalore + SriniNarayananGoogle research + 1390-1399 + Help documents are supposed to aid smartphone users in resolving queries such as “How to block calls from unknown numbers?”. However, given a query, identifying the right help document, understanding instructions from the document, and using them to resolve the issue at hand is challenging. The user experience may be enhanced by converting the instructions in the help document to a step-by-step tutorial overlaid on the phone UI. Successful execution of this task requires overcoming research challenges in retrieval, parsing, and grounding in the multilingual-multimodal setting. For example, user queries in one language may have to be matched against instructions in another language, which in turn needs to be grounded in a multimodal UI in yet another language. Moreover, there isn’t any relevant dataset for such a task. In order to bridge this gap, we introduce UGIF-DataSet, a multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone, containing 4,184 tasks across 8 languages. The instruction steps in UGIF-DataSet are available only in English, so the challenge involves operations in the cross-modal, cross-lingual setting. We compare the performance of different large language models for this task and find that the end-to-end task completion rate drops from 48% in English to 32% for other languages, demonstrating significant overall headroom for improvement. We are hopeful that UGIF-DataSet and our analysis will aid further research on the important problem of sequential task completion in the multilingual and multimodal setting. + 2024.findings-naacl.89 + 2024.findings-naacl.89.copyright.pdf + gubbi-venkatesh-etal-2024-ugif + + + <fixed-case>S</fixed-case>im<fixed-case>SCOOD</fixed-case>: Systematic Analysis of Out-of-Distribution Generalization in Fine-tuned Source Code Models + HosseinHajipourCISPA Helmholtz Center for Information Security + NingYuSalesforce Research + Cristian-AlexandruStaicu + MarioFritzCISPA Helmholtz Center for Information Security and Saarland University + 1400-1416 + Large code datasets have become increasingly accessible for pre-training source code models. However, for the fine-tuning phase, obtaining representative training data that fully covers the code distribution for specific downstream tasks remains challenging due to the task-specific nature and limited labeling resources. These lead to out-of-distribution (OOD) generalization issues with unexpected model inference behaviors that have not been systematically studied yet.In this paper, we contribute the first systematic approach that simulates various OOD scenarios along different dimensions of source code data properties and study the fine-tuned model behaviors in such scenarios. We investigate the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods. Our comprehensive analysis, conducted on four state-of-the-art pretrained models and applied to two code generation tasks, exposes multiple failure modes attributed to OOD generalization issues. + 2024.findings-naacl.90 + 2024.findings-naacl.90.copyright.pdf + hajipour-etal-2024-simscood + + + Pruning as a Domain-specific <fixed-case>LLM</fixed-case> Extractor + NanZhangPennsylvania State University + YanchiLiuNEC-Labs + XujiangZhaoNEC Labs America + WeiChengNEC-Labs + RunxueBaoNEC Labs America + RuiZhangPennsylvania State University + PrasenjitMitraPennsylvania State University + HaifengChenNEC-Labs + 1417-1428 + Large Language Models (LLMs) have exhibited remarkable proficiency across a wide array of NLP tasks. However, the escalation in model size also engenders substantial deployment costs. While few efforts have explored model pruning techniques to reduce the size of LLMs, they mainly center on general or task-specific weights. This leads to suboptimal performance due to lacking specificity on the target domain or generality on different tasks when applied to domain-specific challenges. This work introduces an innovative unstructured dual-pruning methodology, D-Pruner, for domain-specific compression on LLM. It extracts a compressed, domain-specific, and task- agnostic LLM by identifying LLM weights that are pivotal for general capabilities, like linguistic capability and multi-task solving, and domain-specific knowledge. More specifically, we first assess general weight importance by quantifying the error incurred upon their removal with the help of an open-domain calibration dataset. Then, we utilize this general weight importance to refine the training loss, so that it preserves generality when fitting into a specific domain. Moreover, by efficiently approximating weight importance with the refined training loss on a domain-specific calibration dataset, we obtain a pruned model emphasizing generality and specificity. Our comprehensive experiments across various tasks in healthcare and legal domains show the effectiveness of D-Pruner in domain-specific compression. Our code is available at https://github.com/psunlpgroup/D-Pruner. + 2024.findings-naacl.91 + 2024.findings-naacl.91.copyright.pdf + zhang-etal-2024-pruning + + + <fixed-case>LLMR</fixed-case>efine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback + WendaXu + DanielDeutschGoogle + MaraFinkelsteinGoogle + JurajJuraskaGoogle + BiaoZhangGoogle DeepMind + ZhongtaoLiuGoogle + William YangWangUC Santa Barbara + LeiLiSchool of Computer Science, Carnegie Mellon University + MarkusFreitagGoogle + 1429-1445 + Recent large language models (LLM) areleveraging human feedback to improve theirgeneration quality. However, human feedbackis costly to obtain, especially during inference.In this work, we propose LLMRefine, aninference time optimization method to refineLLM’s output. The core idea is to usea learned fine-grained feedback model topinpoint defects and guide LLM to refinethem iteratively. Using original LLM as aproposal of edits, LLMRefine searches fordefect-less text via simulated annealing, tradingoff the exploration and exploitation. Weconduct experiments on three text generationtasks, including machine translation, long-form question answering (QA), and topicalsummarization. LLMRefine consistentlyoutperforms all baseline approaches, achievingimprovements up to 1.7 MetricX points ontranslation tasks, 8.1 ROUGE-L on ASQA, 2.2ROUGE-L on topical summarization. + 2024.findings-naacl.92 + 2024.findings-naacl.92.copyright.pdf + xu-etal-2024-llmrefine + + + Noisy Multi-Label Text Classification via Instance-Label Pair Correction + PengyuXu + MingyangSongTencent + LinkaidaLiu + BingLiu + HongjianSun + LipingJingBeijing Jiaotong University + JianYuBeijing Jiaotong University + 1446-1458 + In noisy label learning, instance selection based on small-loss criteria has been proven to be highly effective. However, in the case of noisy multi-label text classification (NMLTC), the presence of noise is not limited to the instance-level but extends to the (instance-label) pair-level.This gives rise to two main challenges.(1) The loss information at the pair-level fails to capture the variations between instances. (2) There are two types of noise at the pair-level: false positives and false negatives. Identifying false negatives from a large pool of negative pairs presents an exceedingly difficult task. To tackle these issues, we propose a novel approach called instance-label pair correction (iLaCo), which aims to address the problem of noisy pair selection and correction in NMLTC tasks.Specifically, we first introduce a holistic selection metric that identifies noisy pairs by simultaneously considering global loss information and instance-specific ranking information.Secondly, we employ a filter guided by label correlation to focus exclusively on negative pairs with label relevance. This filter significantly reduces the difficulty of identifying false negatives.Experimental analysis indicates that our framework effectively corrects noisy pairs in NMLTC datasets, leading to a significant improvement in model performance. + 2024.findings-naacl.93 + 2024.findings-naacl.93.copyright.pdf + xu-etal-2024-noisy + + + Composite Backdoor Attacks Against Large Language Models + HaiHuangCISPA Helmholtz Center for Information Security + ZhengyuZhaoXi’an Jiaotong University + MichaelBackesCISPA Helmholtz Center for Information Security + YunShenNetApp + YangZhangCISPA Helmholtz Center for Information Security + 1459-1472 + Large language models (LLMs) have demonstrated superior performance compared to previous methods on various tasks, and often serve as the foundation models for many researches and services. However, the untrustworthy third-party LLMs may covertly introduce vulnerabilities for downstream tasks. In this paper, we explore the vulnerability of LLMs through the lens of backdoor attacks. Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier than implanting the same multiple trigger keys in only a single component. CBA ensures that the backdoor is activated only when all trigger keys appear. Our experiments demonstrate that CBA is effective in both natural language processing (NLP) and multimodal tasks. For instance, with 3% poisoning samples against the LLaMA-7B model on the Emotion dataset, our attack achieves a 100% Attack Success Rate (ASR) with a False Triggered Rate (FTR) below 2.06% and negligible model accuracy degradation. Our work highlights the necessity of increased security research on the trustworthiness of foundation LLMs. + 2024.findings-naacl.94 + 2024.findings-naacl.94.copyright.pdf + huang-etal-2024-composite + + + Adapting Fake News Detection to the Era of Large Language Models + JinyanSuCornell University + ClaireCardieCornell University + PreslavNakov + 1473-1490 + In the age of large language models (LLMs) and the widespread adoption of AI-driven content creation, the landscape of information dissemination has witnessed a paradigm shift. With the proliferation of both human-written and machine-generated real and fake news, robustly and effectively discerning the veracity of news articles has become an intricate challenge. While substantial research has been dedicated to fake news detection, it has either assumed that all news articles are human-written or has abruptly assumed that all machine-generated news was fake. Thus, a significant gap exists in understanding the interplay between machine-paraphrased real news, machine-generated fake news, human-written fake news, and human-written real news. In this paper, we study this gap by conducting a comprehensive evaluation of fake news detectors trained in various scenarios. Our primary objectives revolve around the following pivotal question: How can we adapt fake news detectors to the era of LLMs?Our experiments reveal an interesting pattern that detectors trained exclusively on human-written articles can indeed perform well at detecting machine-generated fake news, but not vice versa. Moreover, due to the bias of detectors against machine-generated texts (CITATION), they should be trained on datasets with a lower machine-generated news ratio than the test set. Building on our findings, we provide a practical strategy for the development of robust fake news detectors. + 2024.findings-naacl.95 + 2024.findings-naacl.95.copyright.pdf + su-etal-2024-adapting + + + <fixed-case>MCAD</fixed-case>: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval + YouboLei + FeifeiHeResearcher at OPPO Research Institute + ChenChenOPPO Research Institute + YingbinMo + SijiaLiOPPO Research Institute + DefengXie + HaonanLuOPPO Guangdong Mobile Telecommunications Co., Ltd. + 1491-1503 + Due to the success of large-scale visual-language pretraining (VLP) models and the widespread use of image-text retrieval in industry areas, it is now critically necessary to reduce the model size and streamline their mobile-device deployment. Single- and dual-stream model structures are commonly used in image-text retrieval with the goal of closing the semantic gap between textual and visual modalities. While single-stream models use deep feature fusion to achieve more accurate cross-model alignment, dual-stream models are better at offline indexing and fast inference. We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models. By incorporating the fused single-stream features into the image and text features of the dual-stream model, we formulate new modified teacher similarity distributions and features. Then, we conduct both distribution and feature distillation to boost the capability of the student dual-stream model, achieving high retrieval performance without increasing inference complexity. Extensive experiments demonstrate the remarkable performance and high efficiency of MCAD on image-text retrieval tasks. Furthermore, we implement a lightweight CLIP model on Snapdragon/Dimensity chips with only ~100M running memory and ~8.0ms search latency, achieving the mobile-device application of VLP models. + 2024.findings-naacl.96 + 2024.findings-naacl.96.copyright.pdf + lei-etal-2024-mcad + + + Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting + ZhenQinGoogle + RolfJagermanGoogle + KaiHuiGoogle + HongleiZhuangGoogle Research + JunruWuGoogle Research + LeYanGoogle + JiamingShenGoogle DeepMind + TianqiLiuGoogle + JialuLiuGoogle Research + DonaldMetzlerGoogle + XuanhuiWangGoogle + MichaelBenderskyGoogle + 1504-1518 + Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets.We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP).Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10.Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity. + 2024.findings-naacl.97 + 2024.findings-naacl.97.copyright.pdf + qin-etal-2024-large + + + <fixed-case>F</fixed-case>ed<fixed-case>LFC</fixed-case>: Towards Efficient Federated Multilingual Modeling with <fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case>-based Language Family Clustering + ZhihanGuo + YifeiZhangDepartment of Computer Science and Engineering, The Chinese University of Hong Kong + ZhuoZhangHarbin Institute of Technology + ZenglinXuFudan University + IrwinKingThe Chinese University of Hong Kong + 1519-1528 + Federated Multilingual Modeling (FMM) plays a crucial role in the applications of natural language processing due to the increasing diversity of languages and the growing demand for data privacy. However, FMM faces limitations stemming from (1) the substantial communication costs in networking and (2) the conflicts arising from parameter interference between different languages. To address these challenges, we introduce a communication-efficient federated learning framework with low-rank adaptation and language family clustering for Multilingual Modeling (MM). In this framework, we maintain the weights of the base model, exclusively updating the lightweight Low-rank adaptation (LoRA) parameters to minimize communication costs. Additionally, we mitigate parameter conflicts by grouping languages based on their language family affiliations, as opposed to aggregating all LoRA parameters. Experiments demonstrate that our proposed model not only surpasses the baseline models in performance but also reduces the communication overhead. Our code is available at https://github.com/zhihan-guo/FedLFC. + 2024.findings-naacl.98 + 2024.findings-naacl.98.copyright.pdf + guo-etal-2024-fedlfc + + + <fixed-case>G</fixed-case>aussian Process Optimization for Adaptable Multi-Objective Text Generation using Linearly-Weighted Language Models + Mohammad MahdiAbdollah Pour + AliPesaranghaderLG Electronics + EldanCohenUniversity of Toronto + ScottSannerDepartment of Mechanical and Industrial Engineering, University of Toronto and Department of Computer Science + 1529-1536 + In multi-objective text generation, we aim to optimize over multiple weighted aspects (e.g., toxicity, semantic preservation, fluency) of the generated text. However, multi-objective weighting schemes may change dynamically in practice according to deployment requirements, evolving business needs, personalization requirements on edge devices, or the availability of new language models and/or objective requirements. Ideally, we need an efficient method to adapt to the dynamic requirements of the overall objective. To address these requirements, we propose a linear combination of objective-specific language models to efficiently adapt the decoding process and optimize for the desired objective without the significant computational overhead of retraining one or more language models. We show empirically that we can leverage Gaussian Process black box optimization to adapt the language model decoder weights to outperform other fixed weighting schemes and standard baselines of the task in only a few iterations of decoding. Overall this approach enables highly efficient adaptation of controllable language models via multi-objective weighting schemes that may evolve dynamically in practical deployment situations. + 2024.findings-naacl.99 + 2024.findings-naacl.99.copyright.pdf + abdollah-pour-etal-2024-gaussian + + + Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study + AlessandroStolfoETHZ - ETH Zurich + 1537-1552 + We present an empirical study of groundedness in long-form question answering (LFQA) by retrieval-augmented large language models (LLMs).In particular, we evaluate whether every generated sentence is grounded in the retrieved documents or the model’s pre-training data.Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded, even when those sentences contain correct ground-truth answers.Additionally, we examine the impacts of factors such as model size, decoding strategy, and instruction tuning on groundedness. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations. This study provides novel insights into the groundedness challenges in LFQA and underscores the necessity for more robust mechanisms in LLMs to mitigate the generation of ungrounded content. + 2024.findings-naacl.100 + 2024.findings-naacl.100.copyright.pdf + stolfo-2024-groundedness + + + <fixed-case>T</fixed-case>ag<fixed-case>D</fixed-case>ebias: Entity and Concept Tagging for Social Bias Mitigation in Pretrained Language Models + MehrnazMoslemi + AmalZouaqPolytechnique Montreal + 1553-1567 + Pre-trained language models (PLMs) play a crucial role in various applications, including sensitive domains such as the hiring process. However, extensive research has unveiled that these models tend to replicate social biases present in their pre-training data, raising ethical concerns. In this study, we propose the TagDebias method, which proposes debiasing a dataset using type tags. It then proceeds to fine-tune PLMs on this debiased dataset. Experiments show that our proposed TagDebias model, when applied to a ranking task, exhibits significant improvements in bias scores. + 2024.findings-naacl.101 + 2024.findings-naacl.101.copyright.pdf + moslemi-zouaq-2024-tagdebias + + + Improving Absent Keyphrase Generation with Diversity Heads + EdwinThomas + SowmyaVajjalaNational Research Council Canada + 1568-1584 + Keyphrase Generation (KPG) is the task of automatically generating appropriate keyphrases for a given text, with a wide range of real-world applications such as document indexing and tagging, information retrieval, and text summarization. NLP research makes a distinction between present and absent keyphrases based on whether a keyphrase is directly present as a sequence of words in the document during evaluation. However, present and absent keyphrases are treated together in a text-to-text generation framework during training. We treat present keyphrase extraction as a sequence labeling problem and propose a new absent keyphrase generation model that uses a modified cross-attention layer with additional heads to capture diverse views for the same context encoding in this paper. Our experiments show improvements over the state-of-the-art for four datasets for present keyphrase extraction and five datasets for absent keyphrase generation among the six English datasets we explored, covering long and short documents. + 2024.findings-naacl.102 + 2024.findings-naacl.102.copyright.pdf + thomas-vajjala-2024-improving + + + m<fixed-case>O</fixed-case>thello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models? + TianzeHuaBrown University + TianYunBrown University + ElliePavlickBrown University and Brown University + 1585-1598 + Many pretrained multilingual models exhibit cross-lingual transfer ability, which is often attributed to a learned language-neutral representation during pretraining. However, it remains unclear what factors contribute to the learning of a language-neutral representation, and whether the learned language-neutral representation suffices to facilitate cross-lingual transfer. We propose a synthetic task, Multilingual Othello (mOthello), as a testbed to delve into these two questions. We find that: (1) models trained with naive multilingual pretraining fail to learn a language-neutral representation across all input languages; (2) the introduction of “anchor tokens” (i.e., lexical items that are identical across languages) helps cross-lingual representation alignment; and (3) the learning of a language-neutral representation alone is not sufficient to facilitate cross-lingual transfer. Based on our findings, we propose a novel approach – multilingual pretraining with unified output space – that both induces the learning of language-neutral representation and facilitates cross-lingual transfer. + 2024.findings-naacl.103 + 2024.findings-naacl.103.copyright.pdf + hua-etal-2024-mothello + + + Discovering and Mitigating Indirect Bias in Attention-Based Model Explanations + FarsheedHaqueUniversity of North Carolina at Charlotte + DepengXuUniversity of North Carolina at Charlotte + ShuhanYuanUtah State University + 1599-1614 + As the field of Natural Language Processing (NLP) increasingly adopts transformer-based models, the issue of bias becomes more pronounced. Such bias, manifesting through stereotypes and discriminatory practices, can disadvantage certain groups. Our study focuses on direct and indirect bias in the model explanations, where the model makes predictions relying heavily on identity tokens or associated contexts. We present a novel analysis of bias in model explanation, especially the subtle indirect bias, underlining the limitations of traditional fairness metrics. We first define direct and indirect bias in model explanations, which is complementary to fairness in predictions. We then develop an indirect bias discovery algorithm for quantitatively evaluating indirect bias in transformer models using their in-built self-attention matrix. We also propose an indirect bias mitigation algorithm to ensure fairness in transformer models by leveraging attention explanations. Our evaluation shows the significance of indirect bias and the effectiveness of our indirect bias discovery and mitigation. + 2024.findings-naacl.104 + 2024.findings-naacl.104.copyright.pdf + haque-etal-2024-discovering + + + i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data + ZiyiYangMicrosoft + MahmoudKhademiMicrosoft and University of British Columbia + YichongXuMicrosoft + ReidPryzantResearch, Microsoft + YuweiFangSnap Inc. + ChenguangZhuZoom + DongdongChenMicrosoft Research + YaoQian + XuemeiGaoMicrosoft + Yi-LingChenMicrosoft + RobertGmyrPaderborn University and Microsoft + NaoyukiKandaMicrosoft + NoelCodellaMicrosoft + BinXiaoMicrosoft + YuShiMicrosoft + LuYuanMicrosoft + TakuyaYoshiokaMicrosoft + MichaelZengMicrosoft + XuedongHuang + 1615-1627 + The convergence of text, visual, and audio data is crucial towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models that lack generative abilities. We propose closing this gap with i-Code V2, one of the first models capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder to project combinations of modalities into a shared representational space. Language tokens are generated from these representations via an autoregressive decoder. i-Code V2 is pretrained end-to-end on a large collection of dual- and single-modality datasets with a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals. + 2024.findings-naacl.105 + 2024.findings-naacl.105.copyright.pdf + yang-etal-2024-code + + + Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation + YifuQiu + VarunEmbarUniversity of California, Santa Cruz + ShayCohenUniversity of Edinburgh + BenjaminHanApple + 1628-1644 + Knowledge-to-text generators often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the input, or describe facts not present in the input. To reduce hallucinations, we propose a decoding-only method, TWEAK (Think While Effectively Articulating Knowledge), which can be integrated with any generator without retraining. TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on the extent to which their hypotheses are supported by the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with a minimal impact on the quality. We then replace the NLI model with a task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their original and perturbed descriptions. We test TWEAK with two generators, and the best TWEAK variants improve on average for the two models by 2.24/7.17 points in faithfulness (FactKB) in in/out-of-distribution evaluations, respectively, and with only a 0.14/0.32-point decline in quality (BERTScore). + 2024.findings-naacl.106 + 2024.findings-naacl.106.copyright.pdf + qiu-etal-2024-think + + + It’s All Relative! – A Synthetic Query Generation Approach for Improving Zero-Shot Relevance Prediction + AditiChaudharyGoogle + KarthikRamanGoogle + MichaelBenderskyGoogle + 1645-1664 + Large language models (LLMs) have shown promising ability to generate synthetic query-document pairs by prompting with as few as 8 demonstrations. This has enabled building better IR models, especially for tasks with no training data. Typically, such synthetic query generation (QGen) approaches condition on an input context (e.g. a text document) and generate a query relevant to that context, or condition the QGen additionally on the relevance label (e.g. relevant vs irrelevant) to generate queries across relevance buckets. However, we find that such QGen approaches are sub-optimal as they require the model to reason about the desired label and the input from a handful of examples. In this work, we propose to reduce this burden of LLMs by generating queries simultaneously for different labels. We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query relative to a relevant query is a much simpler task. Extensive experimentation across nine IR datasets shows that synthetic queries generated in such a fashion translates to better downstream performance. + 2024.findings-naacl.107 + 2024.findings-naacl.107.copyright.pdf + chaudhary-etal-2024-relative + + + <fixed-case>RS</fixed-case>-<fixed-case>DPO</fixed-case>: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models + SaeedKhakiAmazon + JinJinLiAmazon + LanMaAmazon + LiuYang + PrathapRamachandraAmazon + 1665-1680 + Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference optimization (DPO) is proposed to address those challenges. However, DPO often relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model, limiting the effectiveness of the RLHF. In this paper, we addresses both challenges by systematically combining rejection sampling (RS) and DPO. Our proposed method, RS-DPO, initiates with the development of a supervised fine-tuned policy model (SFT). A varied set of k responses per prompt are sampled directly from the SFT model. RS-DPO identifies pairs of contrastive samples based on their reward distribution. Finally, we apply DPO with the contrastive samples to align the model to human preference. Our experiments indicate that our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent. Furthermore, it outperforms existing methods, including RS, PPO, and DPO. + 2024.findings-naacl.108 + 2024.findings-naacl.108.copyright.pdf + khaki-etal-2024-rs + + + Hypernetwork-Assisted Parameter-Efficient Fine-Tuning with Meta-Knowledge Distillation for Domain Knowledge Disentanglement + ChangqunLi + LinlinWang + XinLin + ShizhouHuang + LiangHe + 1681-1695 + Domain adaptation from labeled source domains to the target domain is important in practical summarization scenarios. However, the key challenge is domain knowledge disentanglement. In this work, we explore how to disentangle domain-invariant knowledge from source domains while learning specific knowledge of the target domain. Specifically, we propose a hypernetwork-assisted encoder-decoder architecture with parameter-efficient fine-tuning. It leverages a hypernetwork instruction learning module to generate domain-specific parameters from the encoded inputs accompanied by task-related instruction. Further, to better disentangle and transfer knowledge from source domains to the target domain, we introduce a meta-knowledge distillation strategy to build a meta-teacher model that captures domain-invariant knowledge across multiple domains and use it to transfer knowledge to students. Experiments on three dialogue summarization datasets show the effectiveness of the proposed model. Human evaluations also show the superiority of our model with regard to the summary generation quality. + 2024.findings-naacl.109 + 2024.findings-naacl.109.copyright.pdf + li-etal-2024-hypernetwork + + + <fixed-case>MIC</fixed-case>o: Preventative Detoxification of Large Language Models through Inhibition Control + RoySiegelmannWhiting School of Engineering + NinarehMehrabiAmazon + PalashGoyalAmazon + PrasoonGoyalAmazon + LisaBauerAmazon + JwalaDhamalaAmazon Alexa AI + AramGalstyanInformation Sciences Institute, University of Southern California and Amazon Alexa + RahulGupta + RezaGhanadanUniversity of Maryland, College Park + 1696-1703 + Large Language Models (LLMs) are powerful tools which have been both dominant and commonplace in the field of Artificial Intelligence. Yet, LLMs have a tendency to devolve into toxic degeneration, wherein otherwise safe and unproblematic models begin generating toxic content. For the sake of social responsibility and inspired by the biological mechanisms of inhibition control, we introduce the paradigm of Education for Societal Norms (ESN). By collecting and labeling examples as acceptable and unacceptable (in this case toxic and non-toxic), and including a corresponding acceptable rewrite with every unacceptable example, we introduce a new mechanism for LLM detoxification. We annotate a dataset of 2,850 entries and use it to fine-tune a model, which we call a Model with Inhibition Control (MICo). Evaluating this model on toxicity detection capability, rewrite detoxification, meaning preservation, and overall toxicity reduction, we discover significant improvements over the baseline model. In our experiments we show that overall toxicity of this model is more than 60% reduced, with over 75% reduction in severe toxicity. + 2024.findings-naacl.110 + 2024.findings-naacl.110.copyright.pdf + siegelmann-etal-2024-mico + + + Reinforcement Learning with Token-level Feedback for Controllable Text Generation + WendiLi + WeiWeiHuazhong University of Science and Technology + KaiheXu + WenfengXie + DangyangChenPingan Technology + YuChengThe Chinese University of Hong Kong + 1704-1719 + To meet the requirements of real-world applications, it is essential to control generations of large language models (LLMs). Prior research has tried to introduce reinforcement learning (RL) into controllable text generation while most existing methods suffer from overfitting issues (finetuning-based methods) or semantic collapse (post-processing methods). However, current RL methods are generally guided by coarse-grained (sentence/paragraph-level) feedback, which may lead to suboptimal performance owing to semantic twists or progressions within sentences. To tackle that, we propose a novel reinforcement learning algorithm named TOLE which formulates TOken-LEvel rewards for controllable text generation, and employs a “first-quantize-then-noise” paradigm to enhance the robustness of the RL algorithm. Furthermore, TOLE can be flexibly extended to multiple constraints with little computational expense. Experimental results show that our algorithm can achieve superior performance on both single-attribute and multi-attribute control tasks. We have released our codes at https://github.com/WindyLee0822/CTG. + 2024.findings-naacl.111 + 2024.findings-naacl.111.copyright.pdf + li-etal-2024-reinforcement + + + <fixed-case>C</fixed-case>o<fixed-case>MM</fixed-case>: Collaborative Multi-Agent, Multi-Reasoning-Path Prompting for Complex Problem Solving + PeiChenTexas A&M University - College Station + ShuaiZhangAmazon + BoranHan + 1720-1738 + Large Language Models (LLMs) have shown great ability in solving traditional natural language tasks and elementary reasoning tasks with appropriate prompting techniques. However, their ability is still limited in solving complicated science problems. In this work, we aim to push the upper bound of the reasoning capability of LLMs by proposing a collaborative multi-agent, multi-reasoning-path (CoMM) prompting framework. Specifically, we prompt LLMs to play different roles in a problem-solving team, and encourage different role-play agents to collaboratively solve the target task. In particular, we discover that applying different reasoning paths for different roles is an effective strategy to implement few-shot prompting approaches in the multi-agent scenarios. Empirical results demonstrate the effectiveness of the proposed methods on two college-level science problems over competitive baselines. Our further analysis shows the necessity of prompting LLMs to play different roles or experts independently. + 2024.findings-naacl.112 + 2024.findings-naacl.112.copyright.pdf + chen-etal-2024-comm + + + Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies + AnaeliaOvalleUniversity of California, Los Angeles + NinarehMehrabiAmazon + PalashGoyalAmazon + JwalaDhamalaAmazon Alexa AI + Kai-WeiChangUniversity of California, Los Angeles + RichardZemelDepartment of Computer Science, Columbia University and Department of Computer Science, University of Toronto + AramGalstyanInformation Sciences Institute, University of Southern California and Amazon Alexa + YuvalPinterBen-Gurion University of the Negev + RahulGupta + 1739-1756 + Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influenced by Byte-Pair Encoding (BPE) tokenization, the tokenizer powering many popular LLMs. Unlike binary pronouns, BPE overfragments neopronouns, a direct consequence of data scarcity during tokenizer training. This disparate tokenization mirrors tokenizer limitations observed in multilingual and low-resource NLP, unlocking new misgendering mitigation strategies. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency. Our proposed methods outperform finetuning with standard BPE, improving neopronoun accuracy from 14.1% to 58.4%. Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender. + 2024.findings-naacl.113 + 2024.findings-naacl.113.copyright.pdf + ovalle-etal-2024-tokenization + + + <fixed-case>A</fixed-case>da<fixed-case>PT</fixed-case>: A Set of Guidelines for Hyperbolic Multimodal Multilingual <fixed-case>NLP</fixed-case> + RamitSawhneyGeorgia Institute of Technology + ShreyPandit + VishwaShah + MeghThakkar + ShafiqJotySalesForce.com and Nanyang Technological University + 1757-1771 + The Euclidean space is the familiar space for training neural models and performing arithmetic operations.However, many data types inherently possess complex geometries, and model training methods involve operating over their latent representations, which cannot be effectively captured in the Euclidean space.The hyperbolic space provides a more generalized representative geometry to model the hierarchical complexities of the tree-like structure of natural language.We propose AdaPT a set of guidelines for initialization, parametrization, and training of neural networks, which adapts to the dataset and can be used with different manifolds. AdaPT can be generalized over any existing neural network training methodology and leads to more stable training without a substantial increase in training time.We apply AdaPT guidelines over two state-of-the-art deep learning approaches and empirically demonstrate its effectiveness through experiments on three tasks over 12 languages across speech and text.Through extensive qualitative analysis, we put forward the applicability of AdaPT as a set of guidelines optimally utilizing the manifold geometry, which can be extended to various downstream tasks across languages and modalities. + 2024.findings-naacl.114 + 2024.findings-naacl.114.copyright.pdf + sawhney-etal-2024-adapt + + + More Samples or More Prompts? Exploring Effective Few-Shot In-Context Learning for <fixed-case>LLM</fixed-case>s with In-Context Sampling + BingshengYaoNortheastern University + GuimingChen + RuishiZou + YuxuanLu + JiachenLi + ShaoZhangShanghai Jiao Tong University + YisiSangApple + SijiaLiuMichigan State University + JamesHendlerRensselaer Polytechnic Institute + DakuoWangNortheastern University + 1772-1790 + While most existing works on LLM prompting techniques focus only on how to select a better set of data samples inside one single prompt input (In-Context Learning or ICL), why can not we design and leverage multiple prompts together to further improve the LLM’s performance? In this work, we propose In-Context Sampling (ICS), a low-resource LLM prompting technique to produce confident predictions by optimizing the construction of multiple ICL prompt inputs. Extensive experiments with three open-source LLMs (FlanT5-XL, Mistral-7B, and Mixtral-8x7B) on four NLI datasets (e-SNLI, Multi-NLI, ANLI, and Contract-NLI) and one QA dataset (CommonsenseQA) illustrate that ICS can consistently enhance LLMs’ performance. An in-depth evaluation with three data similarity-based ICS strategies suggests that these strategies can further elevate LLM’s performance, which sheds light on a new yet promising future research direction. + 2024.findings-naacl.115 + 2024.findings-naacl.115.copyright.pdf + yao-etal-2024-samples + + + <fixed-case>ZSEE</fixed-case>: A Dataset based on Zeolite Synthesis Event Extraction for Automated Synthesis Platform + SongHe + XinPengEast China University of Science and Technology + YihanCaiShanghai Research Institute of Petrochemical Technology, Sinopec Corporation + XinLi + ZhiqingYuan + WenLiDuEast China University of Science and Technology + WeiminYangEast China University of Science and Technology + 1791-1808 + Automated synthesis of zeolite, one of the most important catalysts in chemical industries, holds great significance for attaining economic and environmental benefits. Structural synthesis data extracted through NLP technologies from zeolite experimental procedures can significantly expedite automated synthesis owing to its machine readability. However, the utilization of NLP technologies in information extraction of zeolite synthesis remains restricted due to the lack of annotated datasets. In this paper, we formulate an event extraction task to mine structural synthesis actions from experimental narratives for modular automated synthesis. Furthermore, we introduce ZSEE, a novel dataset containing fine-grained event annotations of zeolite synthesis actions. Our dataset features 16 event types and 13 argument roles which cover all the experimental operational steps of zeolite synthesis. We explore current state-of-the-art event extraction methods on ZSEE, perform error analysis based on the experimental results, and summarize the challenges and corresponding research directions to further facilitate the automated synthesis of zeolites. The code is publicly available at https://github.com/Hi-0317/ZSEE. + 2024.findings-naacl.116 + 2024.findings-naacl.116.copyright.pdf + he-etal-2024-zsee + + + Mitigating Hallucination in Abstractive Summarization with Domain-Conditional Mutual Information + KyubyungChae + JaepillChoi + YohanJoSeoul National University + TaesupKimSeoul National University + 1809-1820 + A primary challenge in abstractive summarization is hallucination—the phenomenon where a model generates plausible text that is absent in the source text. We hypothesize that the domain (or topic) of the source text triggers the model to generate text that is highly probable in the domain, neglecting the details of the source text. To alleviate this model bias, we introduce a decoding strategy based on domain-conditional pointwise mutual information. This strategy adjusts the generation probability of each token by comparing it with the token’s marginal probability within the domain of the source text. According to evaluation on the XSUM dataset, our method demonstrates improvement in terms of faithfulness and source relevance. + 2024.findings-naacl.117 + 2024.findings-naacl.117.copyright.pdf + chae-etal-2024-mitigating + + + Adversarial <fixed-case>DPO</fixed-case>: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents + SanKimPOSTECH + GaryLee + 1821-1835 + Recent advancements in open-domain dialogue systems have been propelled by the emergence of high-quality large language models (LLMs) and various effective training methodologies. Nevertheless, the presence of toxicity within these models presents a significant challenge that can potentially diminish the user experience. In this study, we introduce an innovative training algorithm, an improvement upon direct preference optimization (DPO), called adversarial DPO (ADPO). The ADPO algorithm is designed to train models to assign higher probability distributions to preferred responses and lower distributions to unsafe responses, which are self-generated using the toxic control token. We demonstrate that ADPO enhances the model’s resilience against harmful conversations while minimizing performance degradation. Furthermore, we illustrate that ADPO offers a more stable training procedure compared to the traditional DPO. To the best of our knowledge, this is the first adaptation of the DPO algorithm that directly incorporates harmful data into the generative model, thereby reducing the need to artificially create safe dialogue data. + 2024.findings-naacl.118 + 2024.findings-naacl.118.copyright.pdf + kim-lee-2024-adversarial + + + Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models + FoboShi + PeijunQingDartmouth College + DongYang + NanWang + YouboLei + HaonanLuOPPO Guangdong Mobile Telecommunications Co., Ltd. + XiaodongLinRutgers University + DuantengchuanLi + 1836-1862 + Prompt engineering is an essential technique for enhancing the abilities of large language models (LLMs) by providing explicit and specific instructions. It enables LLMs to excel in various tasks, such as arithmetic reasoning, question answering, summarization, relation extraction, machine translation, and sentiment analysis. Researchers have been actively exploring different prompt engineering strategies, such as Chain of Thought (CoT), Zero-CoT, and In-context learning. However, an unresolved problem arises from the fact that current approaches lack a solid mathematical solution for determining optimal prompts. To address this issue in prompt engineering, we propose a new and effective approach called Prompt Space. Our methodology utilizes text embeddings to obtain basis vectors by matrix decomposition, and then constructs a space for representing all prompts. Prompt Space significantly outperforms state-of-the-art prompt paradigms on ten public reasoning benchmarks. Notably, without the help of the CoT method and the prompt “Let’s think step by step”, Prompt Space shows superior performance over the few-shot method. Overall, our approach provides a robust and effective mathematical framework for selecting simple and effective prompts. This advancement marks a significant step towards improving prompt engineering for a wide variety of applications in LLMs. Our code is publicly available at https://github.com/YouBLEI/Prompt-Space + 2024.findings-naacl.119 + 2024.findings-naacl.119.copyright.pdf + shi-etal-2024-prompt + + + <fixed-case>DAGCN</fixed-case>: Distance-based and Aspect-oriented Graph Convolutional Network for Aspect-based Sentiment Analysis + ZhihaoWang + BoZhangShanghai Normal University + RuYangShanghai Normal University + ChangGuoShanghai Normal University + MaozhenLiBrunel University Uxbridge + 1863-1876 + Aspect-based sentiment analysis (ABSA) is a task that aims to determine the sentiment polarity of aspects by identifying opinion words. Recent advancements have predominantly been rooted either in semantic or syntactic methods. However, both of them tend to interference from local factors such as irrelevant words and edges, hindering the precise identification of opinion words. In this paper, we present Distance-based and Aspect-oriented Graph Convolutional Network (DAGCN) to address the aforementioned issue. Firstly, we introduce the Distance-based Syntactic Weight (DSW). It focuses on the local scope of aspects in the pruned dependency trees, thereby reducing the candidate pool of opinion words. Additionally, we propose Aspect-Fusion Attention (AF) to further filter opinion words within the local context and consider cases where opinion words are distant from the aspect. With the combination of DSW and AF, we achieve precise identification of corresponding opinion words. Extensive experiments on three public datasets demonstrate that the proposed model outperforms state-of-the-art models and verify the effectiveness of the proposed architecture. + 2024.findings-naacl.120 + 2024.findings-naacl.120.copyright.pdf + wang-etal-2024-dagcn + + + Connecting the Dots: Inferring Patent Phrase Similarity with Retrieved Phrase Graphs + ZhuoyiPeng + YiYangHong Kong University of Science and Technology + 1877-1890 + We study the patent phrase similarity inference task, which measures the semantic similarity between two patent phrases. As patent documents employ legal and highly technical language, existing semantic textual similarity methods that use localized contextual information do not perform satisfactorily in inferring patent phrase similarity. To address this, we introduce a graph-augmented approach to amplify the global contextual information of the patent phrases. For each patent phrase, we construct a phrase graph that links to its focal patents and a list of patents that are either cited by or cite these focal patents. The augmented phrase embedding is then derived from combining its localized contextual embedding with its global embedding within the phrase graph. We further propose a self-supervised learning objective that capitalizes on the retrieved topology to refine both the contextualized embedding and the graph parameters in an end-to-end manner. Experimental results from a unique patent phrase similarity dataset demonstrate that our approach significantly enhances the representation of patent phrases, resulting in marked improvements in similarity inference in a self-supervised fashion. Substantial improvements are also observed in the supervised setting, underscoring the potential benefits of leveraging retrieved phrase graph augmentation. + 2024.findings-naacl.121 + 2024.findings-naacl.121.copyright.pdf + peng-yang-2024-connecting + + + Self-Regulated Sample Diversity in Large Language Models + MingyueLiu + JonathanFrawleyDurham University + SarahWyer + Hubert P. H.ShumDurham University + SaraUckelmanDurham University + SueBlackDurham University + ChrisWillcocksDurham University and Durham University + 1891-1899 + Sample diversity depends on the task; within mathematics, precision and determinism are paramount, while storytelling thrives on creativity and surprise. This paper presents a simple self-regulating approach where we adjust sample diversity inference parameters dynamically based on the input prompt—in contrast to existing methods that require expensive and inflexible setups, or maintain static values during inference. Capturing a broad spectrum of sample diversities can be formulated as a straightforward self-supervised inference task, which we find significantly improves the quality of responses generically without model retraining or fine-tuning. In particular, our method demonstrates significant improvement in all supercategories of the MMLU multitask benchmark (GPT-3.5: +4.4\%, GPT-4: +1.5\%), which captures a large variety of difficult tasks covering STEM, the humanities and social sciences. + 2024.findings-naacl.122 + 2024.findings-naacl.122.copyright.pdf + liu-etal-2024-self-regulated + + + Methods, Applications, and Directions of Learning-to-Rank in <fixed-case>NLP</fixed-case> Research + JustinLee + GabrielBernier-ColborneNational Research Council Canada + TeganMaharajToronto University and Ecole Polytechnique de Montreal + SowmyaVajjalaNational Research Council Canada + 1900-1917 + Learning-to-rank (LTR) algorithms aim to order a set of items according to some criteria. They are at the core of applications such as web search and social media recommendations, and are an area of rapidly increasing interest, with the rise of large language models (LLMs) and the widespread impact of these technologies on society. In this paper, we survey the diverse use cases of LTR methods in natural language processing (NLP) research, looking at previously under-studied aspects such as multilingualism in LTR applications and statistical significance testing for LTR problems. We also consider how large language models are changing the LTR landscape. This survey is aimed at NLP researchers and practitioners interested in understanding the formalisms and best practices regarding the application of LTR approaches in their research. + 2024.findings-naacl.123 + 2024.findings-naacl.123.copyright.pdf + lee-etal-2024-methods + + + When Quantization Affects Confidence of Large Language Models? + IrinaProskurina + LucBrunLaboratoire ERIC, Université Lumiére (Lyon II) + GuillaumeMetzlerUniversité Lumiére (Lyon II) + JulienVelcinERIC + 1918-1928 + Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs.This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss.Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.We make our code and quantized models publicly available. + 2024.findings-naacl.124 + 2024.findings-naacl.124.copyright.pdf + proskurina-etal-2024-quantization + + + <fixed-case>M</fixed-case>ed<fixed-case>C</fixed-case>ycle: Unpaired Medical Report Generation via Cycle-Consistency + EladHirschTechnion, Technion + GefenDawidowicz + AyelletTalTechnion and Technion + 1929-1944 + Generating medical reports for X-ray images presents a significant challenge, particularly in unpaired scenarios where access to paired image-report data for training is unavailable. Previous works have typically learned a joint embedding space for images and reports, necessitating a specific labeling schema for both. We introduce an innovative approach that eliminates the need for consistent labeling schemas, thereby enhancing data accessibility and enabling the use of incompatible datasets. This approach is based on cycle-consistent mapping functions that transform image embeddings into report embeddings, coupled with report auto encoding for medical report generation. Our model and objectives consider intricate local details and the overarching semantic context within images and reports. This approach facilitates the learning of effective mapping functions, resulting in the generation of coherent reports. It outperforms state-of-the-art results in unpaired chest X-ray report generation, demonstrating improvements in both language and clinical metrics. + 2024.findings-naacl.125 + 2024.findings-naacl.125.copyright.pdf + hirsch-etal-2024-medcycle + + + Beta-<fixed-case>LR</fixed-case>: Interpretable Logical Reasoning based on Beta Distribution + YizhuoMa + KeQinUniversity of Electronic Science and Technology of China + ShuangLiangUniversity of Electronic Science and Technology of China + 1945-1955 + The logical information contained in text isof significant importance for logical reasoning.Previous approaches have relied on embeddingtext into a low-dimensional vector to capturelogical information and perform reasoning inEuclidean space. These methods involve constructing special graph architectures that matchlogical relations or designing data augmentation frameworks by extending texts based onsymbolic logic. However, it presents two obvious problems. 1) The logical informationreflected in the text exhibits uncertainty that isdifficult to represent using a vector. 2) Integrating logical information requires modeling logical operations (such as ∪, ∩, and ¬), while onlysimple arithmetic operations can be performedin Euclidean space. To address both the problems, we propose Beta-LR, a probabilistic embedding method to capture logical information.Specifically, we embed texts into beta distribution on each dimension to eliminate logical uncertainty. We also define neural operators thatenable interpretability and perform logical operations based on the characteristics of the betadistribution. We conduct experiments on twodatasets, ReClor and LogiQA, and our Beta-LRachieves competitive results. The experimentsdemonstrate that our method effectively captures the logical information in text for reasoning purposes. The source code is available athttps://github.com/myz12138/Beta-LR. + 2024.findings-naacl.126 + 2024.findings-naacl.126.copyright.pdf + ma-etal-2024-beta + + + Applications of <fixed-case>BERT</fixed-case> Models Towards Automation of Clinical Coding in <fixed-case>I</fixed-case>celandic + HaraldurHaukssonETHZ - ETH Zurich + HafsteinnEinarssondeCODE genetics and University of Iceland + 1956-1967 + This study explores the potential of automating clinical coding in Icelandic, a language with limited digital resources, by leveraging over 25 years of electronic health records (EHR) from the Landspitali University Hospital. Traditionally a manual and error-prone task, clinical coding is essential for patient care, billing, and research. Our research delves into the effectiveness of Transformer-based models in automating this process. We investigate various model training strategies, including continued pretraining and model adaptation, under a constrained computational budget. Our findings reveal that the best-performing model achieves competitive results in both micro and macro F1 scores, with label attention contributing significantly to its success. The study also explores the possibility of training on unlabeled data. Our research provides valuable insights into the possibilities of using NLP for clinical coding in low-resource languages, demonstrating that small countries with unique languages and well-segmented healthcare records can achieve results comparable to those in higher-resourced languages. + 2024.findings-naacl.127 + 2024.findings-naacl.127.copyright.pdf + hauksson-einarsson-2024-applications + + + “Tell me who you are and <fixed-case>I</fixed-case> tell you how you argue”: Predicting Stances and Arguments for Stakeholder Groups + PhilippHeinischUniversität Bielefeld + LorikDumaniTrier University + PhilippCimianoBielefeld University + RalfSchenkelTrier University + 1968-1982 + Argument mining has focused so far mainly on the identification, extraction, and formalization of arguments. An important yet unaddressedtask consists in the prediction of the argumentative behavior of stakeholders in a debate. Predicting the argumentative behavior in advance can support foreseeing issues in public policy making or help recognize potential disagreements early on and help to resolve them. In this paper, we consider the novel task of predicting the argumentative behavior of individual stakeholders. We present ARGENST, a framework that relies on a recommender-based architecture to predict the stance and the argumentative main point on a specific controversial topic for a given stakeholder, which is described in terms of a profile including properties related to demographic attributes, religious and political orientation, socio-economic background, etc. We evaluate our approach on the well-known debate.org dataset in terms of accuracy for predicting stance as well as in terms of similarity of the generated arguments to the ground truth arguments using BERTScore. As part of a case study, we show how juries of members representing different stakeholder groups and perspectives can be assembled to simulate the public opinion on a given topic. + 2024.findings-naacl.128 + 2024.findings-naacl.128.copyright.pdf + heinisch-etal-2024-tell + + + Psychometric Predictive Power of Large Language Models + TatsukiKuribayashiMohamed bin Zayed University of Artificial Intelligence + YoheiOsekiUniversity of Tokyo + TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne + 1983-2005 + Instruction tuning aligns the response of large language models (LLMs) with human preferences.Despite such efforts in human–LLM alignment, we find that instruction tuning does not always make LLMs human-like from a cognitive modeling perspective. More specifically, next-word probabilities estimated by instruction-tuned LLMs are often worse at simulating human reading behavior than those estimated by base LLMs.In addition, we explore prompting methodologies for simulating human reading behavior with LLMs. Our results show that prompts reflecting a particular linguistic hypothesis improve psychometric predictive power, but are still inferior to small base models.These findings highlight that recent advancements in LLMs, i.e., instruction tuning and prompting, do not offer better estimates than direct probability measurements from base LLMs in cognitive modeling. In other words, pure next-word probability remains a strong predictor for human reading behavior, even in the age of LLMs. + 2024.findings-naacl.129 + 2024.findings-naacl.129.copyright.pdf + kuribayashi-etal-2024-psychometric + + + Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions + PouyaPezeshkpourMegagon Labs + EstevamHruschkaMegagon Labs and Carnegie Mellon University + 2006-2017 + Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions—commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 85% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model’s bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs’ predictions, leading to up to 8 percentage points improvement across different models and benchmarks. + 2024.findings-naacl.130 + 2024.findings-naacl.130.copyright.pdf + pezeshkpour-hruschka-2024-large + + + <fixed-case>PEEB</fixed-case>: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck + ThangPham + PeijieChen + TinNguyen + SeunghyunYoonAdobe Research + TrungBuiAdobe Research + AnhNguyenAuburn University + 2018-2053 + CLIP-based classifiers rely on the prompt containing a class name that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB – an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (∼10× in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Stanford Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings. + 2024.findings-naacl.131 + 2024.findings-naacl.131.copyright.pdf + pham-etal-2024-peeb + + + Ethos: Rectifying Language Models in Orthogonal Parameter Space + LeiGao + YueNiuUniversity of Southern California + TingtingTangUniversity of Southern California + SalmanAvestimehrUniversity of Southern California, University of Southern California and University of Southern California + MuraliAnnavaramUniversity of Southern California + 2054-2068 + Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. Ethos is built on task arithmetic. However, unlike current task arithmetic algorithms, Ethos distinguishes general beneficial and undesired knowledge when reconstructing task vectors. Specifically, Ethos first obtains a set of principal components from the pre-trained models using singular value decomposition. Then, by projecting the task vector onto principal components, Ethos separates the principal components that encode general from those associated with undesired knowledge. Ethos performs forgetting or unlearning by only negating the task vector with undesired knowledge, thereby minimizing collateral damage on general model utility. We demonstrate the efficacy of our approach on three different tasks: bias, toxicity, and memorization unlearning. Evaluations show Ethos is more effective in removing undesired knowledge while maintaining the overall model performance compared to current task arithmetic methods. + 2024.findings-naacl.132 + 2024.findings-naacl.132.copyright.pdf + gao-etal-2024-ethos + + + Crafting In-context Examples according to <fixed-case>LM</fixed-case>s’ Parametric Knowledge + YoonsangLeeSeoul National University + PranavAtreyaUniversity of California, Berkeley + XiYe + EunsolChoiUniversity of Texas, Austin + 2069-2085 + In-context learning can improve the performances of knowledge-rich tasks such as question answering. In such scenarios, in-context examples trigger a language model (LM) to surface information stored in its parametric knowledge. We study how to better construct in-context example sets, based on whether the model is aware of the in-context examples. We identify ‘known’ examples, where models can correctly answer from their parametric knowledge, and ‘unknown’ ones. Our experiments show that prompting with ‘unknown’ examples decreases the performance, potentially as it encourages hallucination rather than searching for its parametric knowledge. Constructing an in-context example set that presents both known and unknown information performs the best across diverse settings. We perform analysis on three multi-answer question answering datasets, which allows us to further study answer set ordering strategies based on the LM’s knowledge of each answer. Together, our study sheds light on how to best construct in-context example sets for knowledge-rich tasks. + 2024.findings-naacl.133 + 2024.findings-naacl.133.copyright.pdf + lee-etal-2024-crafting + + + <fixed-case>ICXML</fixed-case>: An In-Context Learning Framework for Zero-Shot Extreme Multi-Label Classification + YaxinZhu + HamedZamaniUniversity of Massachusetts, Amherst + 2086-2098 + This paper focuses on the task of Extreme Multi-Label Classification (XMC) whose goal is to predict multiple labels for each instance from an extremely large label space. While existing research has primarily focused on fully supervised XMC, real-world scenarios often lack supervision signals, highlighting the importance of zero-shot settings. Given the large label space, utilizing in-context learning approaches is not trivial. We address this issue by introducing In-Context Extreme Multi-label Learning (ICXML), a two-stage framework that cuts down the search space by generating a set of candidate labels through in-context learning and then reranks them. Extensive experiments suggest that ICXML advances the state of the art on two diverse public benchmarks. + 2024.findings-naacl.134 + 2024.findings-naacl.134.copyright.pdf + zhu-zamani-2024-icxml + + + <fixed-case>CLGSI</fixed-case>: A Multimodal Sentiment Analysis Framework based on Contrastive Learning Guided by Sentiment Intensity + YangYang + XundeDongSouth China University of Technology + YupengQiang + 2099-2110 + Recently, contrastive learning has begun to gain popularity in multimodal sentiment analysis (MSA). However, most of existing MSA methods based on contrastive learning lacks more detailed learning of the distribution of sample pairs with different sentiment intensity differences in the contrastive learning representation space. In addition, limited research has been conducted on the fusion of each modality representation obtained by contrastive learning training.In this paper, we propose a novel framework for multimodal sentiment analysis based on Contrastive Learning Guided by Sentiment Intensity (CLGSI). Firstly, the proposed contrastive learning guided by sentiment intensity selects positive and negative sample pairs based on the difference in sentiment intensity and assigns corresponding weights accordingly.Subsequently, we propose a new multimodal representation fusion mechanism, called Global-Local-Fine-Knowledge (GLFK), which extracts common features between different modalities’ representations. At the same time, each unimodal encoder output is separately processed by a Multilayer Perceptron (MLP) to extract specific features of each modality. Finally, joint learning of the common and specific features is used to predict sentiment intensity. The effectiveness of CLGSI is assessed on two English datasets, MOSI and MOSEI, as well as one Chinese dataset, SIMS. We achieve competitive experimental results, which attest to the strong generalization performance of our approach. The code for our approach will be released in https://github.com/AZYoung233/CLGSI + 2024.findings-naacl.135 + 2024.findings-naacl.135.copyright.pdf + yang-etal-2024-clgsi + + + Interpreting Answers to Yes-No Questions in Dialogues from Multiple Domains + ZijieWang + FarzanaRashid + EduardoBlancoUniversity of Arizona + 2111-2128 + People often answer yes-no questions without explicitly saying yes, no, or similar polar key-words. Figuring out the meaning of indirectanswers is challenging, even for large language models. In this paper, we investigate this problem working with dialogues from multiple domains. We present new benchmarks in three diverse domains: movie scripts, tennis interviews, and airline customer service. We present an approach grounded on distant supervision and blended training to quickly adapt to a new dialogue domain. Experimental results show that our approach is never detrimental and yields F1 improvements as high as 11-34%. + 2024.findings-naacl.136 + 2024.findings-naacl.136.copyright.pdf + wang-etal-2024-interpreting + + + Enhancing Perception: Refining Explanations of News Claims with <fixed-case>LLM</fixed-case> Conversations + Yi-LiHsuNational Tsinghua University and Academia Sinica + Jui-NingChenNational Taiwan University and , Academia Sinica + YangFan ChiangAcademia Sinica + Shang-ChienLiu + AipingXiongPennsylvania State University + Lun-WeiKuAcademia Sinica + 2129-2147 + We introduce Enhancing Perception, a framework for Large Language Models (LLMs) designed to streamline the time-intensive task typically undertaken by professional fact-checkers of crafting explanations for fake news. This study investigates the effectiveness of enhancing LLM explanations through conversational refinement. We compare various questioner agents, including state-of-the-art LLMs like GPT-4, Claude 2, PaLM 2, and 193 American participants acting as human questioners. Based on the histories of these refinement conversations, we further generate comprehensive summary explanations. We evaluated the effectiveness of these initial, refined, and summary explanations across 40 news claims by involving 2,797 American participants, measuring their self-reported belief change regarding both real and fake claims after receiving the explanations. Our findings reveal that, in the context of fake news, explanations that have undergone conversational refinement—whether by GPT-4 or human questioners, who ask more diverse and detail-oriented questions—were significantly more effective than both the initial unrefined explanations and the summary explanations. Moreover, these refined explanations achieved a level of effectiveness comparable to that of expert-written explanations. The results highlight the potential of automatic explanation refinement by LLMs in debunking fake news claims. + 2024.findings-naacl.137 + 2024.findings-naacl.137.copyright.pdf + hsu-etal-2024-enhancing + + + How Interpretable are Reasoning Explanations from Prompting Large Language Models? + YeoWei JieSchool of Computer Science and Engineering, Nanyang Technological University + RanjanSatapathy + RickGohInstitute of High Performance Computing, Singapore, A*STAR + ErikCambriaNanyang Technological University + 2148-2164 + Prompt Engineering has garnered significant attention for enhancing the performance of large language models across a multitude of tasks. Techniques such as the Chain-of-Thought not only bolster task performance but also delineate a clear trajectory of reasoning steps, offering a tangible form of explanation for the audience. Prior works on interpretability assess the reasoning chains yielded by Chain-of-Thought solely along a singular axis, namely faithfulness. We present a comprehensive and multifaceted evaluation of interpretability, examining not only faithfulness but also robustness and utility across multiple commonsense reasoning benchmarks. Likewise, our investigation is not confined to a single prompting technique; it expansively covers a multitude of prevalent prompting techniques employed in large language models, thereby ensuring a wide-ranging and exhaustive evaluation. In addition, we introduce a simple interpretability alignment technique, termed Self-Entailment-Alignment Chain-of-thought, that yields more than 70% improvements across multiple dimensions of interpretability. Code is available at https://github.com/SenticNet/CoT_interpretability + 2024.findings-naacl.138 + 2024.findings-naacl.138.copyright.pdf + wei-jie-etal-2024-interpretable + + + Plug-in Language Model: Controlling Text Generation with a Simple Regression Model + Nai-ChiYang + Wei-YunMaAcademia Sinica + Pu-JenChengNational Taiwan University + 2165-2181 + Large-scale pre-trained language models have displayed unrivaled capacity in generating text that closely resembles human-written text. Nevertheless, generating texts adhering to specific conditions without fine-tuning or adding new parameters can be challenging. Contemporary approaches commonly rely on either prompts or auxiliary models to avoid modifying the language models. These auxiliary models are designed to assess whether a generated token contributes to meeting the desired requirements. These approaches adjust the distribution of the next token during the inference phase by leveraging the prediction score of the desired attribute to calculate gradients. However, these auxiliary models typically require the language model’s latent states. This prerequisite challenges integrating various existing black box attribute models or tools. We present the Plug-in Language Model (PiLM) as a solution to address the limitations. PiLM leverages reinforcement learning to utilize black box tools directly, adjusting the latent state to control text generation. However, performing backpropagation during the inference phase is time-consuming for PiLM. By replacing backpropagation with a simple regression model, PiLM can achieve an inference time comparable to that of the original LLM. Experiment results show that our approaches in this paper outperform existing state-of-the-art methods that rely on gradient-based, weighted decoding, or prompt-based methodologies. + 2024.findings-naacl.139 + 2024.findings-naacl.139.copyright.pdf + yang-etal-2024-plug + + + Signer Diversity-driven Data Augmentation for Signer-Independent Sign Language Translation + HonghaofuHonghaofu + LiangZhang + BiaoFu + RuiZhao + JinsongSuXiamen University + XiaodongShiXiamen University, Tsinghua University + YidongChen + 2182-2193 + The primary objective of sign language translation (SLT) is to transform sign language videos into natural sentences.A crucial challenge in this field is developing signer-independent SLT systems which requires models to generalize effectively to signers not encountered during training.This challenge is exacerbated by the limited diversity of signers in existing SLT datasets, which often results in suboptimal generalization capabilities of current models.Achieving robustness to unseen signers is essential for signer-independent SLT.However, most existing method relies on signer identity labels, which is often impractical and costly in real-world applications.To address this issue, we propose the Signer Diversity-driven Data Augmentation (SDDA) method that can achieve good generalization without relying on signer identity labels. SDDA comprises two data augmentation schemes. The first is data augmentation based on adversarial training, which aims to utilize the gradients of the model to generate adversarial examples. The second is data augmentation based on diffusion model, which focuses on using the advanced diffusion-based text guided image editing method to modify the appearances of the signer in images. The combination of the two strategies significantly enriches the diversity of signers in the training process.Moreover, we introduce a consistency loss and a discrimination loss to enhance the learning of signer-independent features.Our experimental results demonstrate our model significantly enhances the performance of SLT in the signer-independent setting, achieving state-of-the-art results without relying on signer identity labels. + 2024.findings-naacl.140 + 2024.findings-naacl.140.copyright.pdf + honghaofu-etal-2024-signer + + + A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation + FrancoisMeyerUniversity of Cape Town + JanBuysUniversity of Cape Town + 2194-2200 + Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling. + 2024.findings-naacl.141 + 2024.findings-naacl.141.copyright.pdf + meyer-buys-2024-systematic + + + Multi-Granularity Guided Fusion-in-Decoder + EunseongChoi + HyeriLee + JongwukLeeSungkyunkwan University + 2201-2212 + In Open-domain Question Answering (ODQA), it is essential to discern relevant contexts as evidence and avoid spurious ones among retrieved results. The model architecture that uses concatenated multiple contexts in the decoding phase, *i.e.*, Fusion-in-Decoder, demonstrates promising performance but generates incorrect outputs from seemingly plausible contexts. To address this problem, we propose the ***M**ulti-**G**ranularity guided **F**usion-**i**n-**D**ecoder (**MGFiD**)*, discerning evidence across multiple levels of granularity. Based on multi-task learning, MGFiD harmonizes passage re-ranking with sentence classification. It aggregates evident sentences into an *anchor vector* that instructs the decoder. Additionally, it improves decoding efficiency by reusing the results of passage re-ranking for *passage pruning*. Through our experiments, MGFiD outperforms existing models on the Natural Questions (NQ) and TriviaQA (TQA) datasets, highlighting the benefits of its multi-granularity solution. + 2024.findings-naacl.142 + 2024.findings-naacl.142.copyright.pdf + choi-etal-2024-multi + + + Group Fairness in Multilingual Speech Recognition Models + AnnaZee + MarcZeeResearch, Google + AndersSøgaardCopenhagen University + 2213-2226 + We evaluate the performance disparity of the Whisper and MMS families of ASR models across the VoxPopuli and Common Voice multilingual datasets, with an eye toward intersectionality. Our two most important findings are that model size, surprisingly, correlates logarithmically with worst-case performance disparities, meaning that larger (and better) models are less fair. We also observe the importance of intersectionality. In particular, models often exhibit significant performance disparity across binary gender for adolescents. + 2024.findings-naacl.143 + 2024.findings-naacl.143.copyright.pdf + zee-etal-2024-group + + + Rethinking Machine Ethics – Can <fixed-case>LLM</fixed-case>s Perform Moral Reasoning through the Lens of Moral Theories? + JingyanZhou + MindaHu + JunanLi + XiaoyingZhangThe Chinese University of Hong Kong + XixinWuThe Chinese University of Hong Kong + IrwinKingThe Chinese University of Hong Kong + HelenMengThe Chinese University of Hong Kong + 2227-2242 + Making moral judgments is an essential step toward developing ethical AI systems. Prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality. These approaches have been criticized for potentially overgeneralizing a limited group of annotators’ moral stances and lacking explainability. This work proposes a flexible top-down framework to steer (Large) Language Models to perform moral reasoning with well-established moral theories from interdisciplinary research. The theory-guided top-down framework can incorporate various moral theories. Our experiments demonstrate the effectiveness of the proposed framework on datasets derived from moral theories. Furthermore, we show the alignment between different moral theories and existing morality datasets. Our analysis exhibits the potential and flaws in existing resources (models and datasets) in developing explainable moral judgment-making systems. + 2024.findings-naacl.144 + 2024.findings-naacl.144.copyright.pdf + zhou-etal-2024-rethinking + + + Role Prompting Guided Domain Adaptation with General Capability Preserve for Large Language Models + RuiWang + FeiMi + YiChen + BoyangXue + HongruWangThe Chinese University of Hong Kong + QiZhu + Kam-FaiWongThe Chinese University of Hong Kong + RuifengXuHarbin Institute of Technology + 2243-2255 + The growing interest in Large Language Models (LLMs) for specialized applications has revealed a significant challenge: when tailored to specific domains, LLMs tend to experience catastrophic forgetting, compromising their general capabilities and leading to a suboptimal user experience. Additionally, crafting a versatile model for multiple domains simultaneously often results in a decline in overall performance due to confusion between domains. In response to these issues, we present the RolE Prompting Guided Multi-Domain Adaptation (REGA) strategy. This novel approach effectively manages multi-domain LLM adaptation through three key components: 1) Self-Distillation constructs and replays general-domain exemplars to alleviate catastrophic forgetting. 2) Role Prompting assigns a central prompt to the general domain and a unique role prompt to each specific domain to minimize inter-domain confusion during training. 3) Role Integration reuses and integrates a small portion of domain-specific data to the general-domain data, which are trained under the guidance of the central prompt. The central prompt is used for a streamlined inference process, removing the necessity to switch prompts for different domains.Empirical results demonstrate that REGA effectively alleviates catastrophic forgetting and inter-domain confusion. This leads to improved domain-specific performance compared to standard fine-tuned models, while still preserving robust general capabilities. + 2024.findings-naacl.145 + 2024.findings-naacl.145.copyright.pdf + wang-etal-2024-role + + + <fixed-case>BERT</fixed-case>weet’s <fixed-case>TACO</fixed-case> Fiesta: Contrasting Flavors On The Path Of Inference And Information-Driven Argument Mining On <fixed-case>T</fixed-case>witter + MarcFeger + StefanDietzeGESIS and Heinrich-Heine-University Düsseldorf + 2256-2266 + Argument mining, dealing with the classification of text based on inference and information, denotes a challenging analytical task in the rich context of Twitter (now \mathbb{X}), a key platform for online discourse and exchange. Thereby, Twitter offers a diverse repository of short messages bearing on both of these elements. For text classification, transformer approaches, particularly BERT, offer state-of-the-art solutions. Our study delves into optimizing the embeddings of the understudied BERTweet transformer for argument mining on Twitter and broader generalization across topics.We explore the impact of pre-classification fine-tuning by aligning similar manifestations of inference and information while contrasting dissimilar instances. Using the TACO dataset, our approach augments tweets for optimizing BERTweet in a Siamese network, strongly improving classification and cross-topic generalization compared to standard methods.Overall, we contribute the transformer WRAPresentations and classifier WRAP, scoring 86.62% F1 for inference detection, 86.30% for information recognition, and 75.29% across four combinations of these elements, to enhance inference and information-driven argument mining on Twitter. + 2024.findings-naacl.146 + 2024.findings-naacl.146.copyright.pdf + feger-dietze-2024-bertweets + + + Testing the limits of logical reasoning in neural and hybrid models + ManuelGuzman + JakubSzymanikUniversity of Trento + MaciejMalicki + 2267-2279 + We study the ability of neural and hybrid models to generalize logical reasoning patterns. We created a series of tests for analyzing various aspects of generalization in the context of language and reasoning, focusing on compositionality and recursiveness. We used them to study the syllogistic logic in hybrid models, where the network assists in premise selection. We analyzed feed-forward, recurrent, convolutional, and transformer architectures. Our experiments demonstrate that even though the models can capture elementary aspects of the meaning of logical terms, they learn to generalize logical reasoning only to a limited degree. + 2024.findings-naacl.147 + 2024.findings-naacl.147.copyright.pdf + guzman-etal-2024-testing + + + <fixed-case>METAL</fixed-case>: Towards Multilingual Meta-Evaluation + RishavHadaMicrosoft Research India + VarunGummaMicrosoft + MohamedAhmedResearch, Microsoft + KalikaBaliMicrosoft Research Labs + SunayanaSitaramMicrosoft + 2280-2298 + With the rising human-like precision of Large Language Models (LLMs) in numerous tasks, their utilization in a variety of real-world applications is becoming more prevalent. Several studies have shown that LLMs excel on many standard NLP benchmarks. However, it is challenging to evaluate LLMs due to test dataset contamination and the limitations of traditional metrics. Since human evaluations are difficult to collect, there is a growing interest in the community to use LLMs themselves as reference-free evaluators for subjective metrics. However, past work has shown that LLM-based evaluators can exhibit bias and have poor alignment with human judgments. In this study, we propose a framework for an end-to-end assessment of LLMs as evaluators in multilingual scenarios. We create a carefully curated dataset, covering 10 languages containing native speaker judgments for the task of summarization. This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL). We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2. Our results indicate that LLM-based evaluators based on GPT-4 perform the best across languages, while GPT-3.5-Turbo performs poorly. Additionally, we perform an analysis of the reasoning provided by LLM-based evaluators and find that it often does not match the reasoning provided by human judges. + 2024.findings-naacl.148 + 2024.findings-naacl.148.copyright.pdf + hada-etal-2024-metal + + + <fixed-case>AGIE</fixed-case>val: A Human-Centric Benchmark for Evaluating Foundation Models + WanjunZhong + RuixiangCui + YiduoGuo + YaoboLiang + ShuaiLuMicrosoft + YanlinWangSun Yat-Sen University + AminSaied + WeizhuChenMicrosoft GenAI + NanDuanMicrosoft Research Asia + 2299-2314 + Assessing foundation models’ abilities for human-level tasks is crucial for Artificial General Intelligence (AGI) development.Traditional benchmarks, which rely on artificial datasets, may not accurately represent these capabilities. In this paper, we introduce AGIEval, a novel bilingual benchmark designed to assess foundation models in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models on our benchmark. Impressively, we show that GPT-4 exceeds the average human performance in SAT, LSAT, and math contests, with 95% accuracy on SAT Math and 92.5% on the Chinese college entrance English exam. This demonstrates the exceptional performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks requiring complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal their strengths and limitations, providing valuable insights into future directions for enhancing general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a meaningful and robust evaluation of foundation models’ performance in real-world scenarios. + 2024.findings-naacl.149 + 2024.findings-naacl.149.copyright.pdf + zhong-etal-2024-agieval + + + Product Description and <fixed-case>QA</fixed-case> Assisted Self-Supervised Opinion Summarization + TejpalsinghSiledar + RupasaiRangaraju + SankaraMuddu + SumanBanerjeeFlipkart + AmeyPatil + SudhanshuSingh + MuthusamyChelliahFlipkart + NikeshGarera + SwapravaNathIIT Kanpur and Computer Science and Engineering, Indian Institute of Technology Bombay + PushpakBhattacharyyaIndian Institute of Technology, Bombay, Dhirubhai Ambani Institute Of Information and Communication Technology + 2315-2332 + In e-commerce, opinion summarization is the process of summarizing the consensus opinions found in product reviews. However, the potential of additional sources such as product description and question-answers (QA) has been considered less often. Moreover, the absence of any supervised training data makes this task challenging. To address this, we propose a novel synthetic dataset creation (SDC) strategy that leverages information from reviews as well as additional sources for selecting one of the reviews as a pseudo-summary to enable supervised training. Our Multi-Encoder Decoder framework for Opinion Summarization (MEDOS) employs a separate encoder for each source, enabling effective selection of information while generating the summary. For evaluation, due to the unavailability of test sets with additional sources, we extend the Amazon, Oposum+, and Flipkart test sets and leverage ChatGPT to annotate summaries. Experiments across nine test sets demonstrate that the combination of our SDC approach and MEDOS model achieves on average a 14.5% improvement in ROUGE-1 F1 over the SOTA. Moreover, comparative analysis underlines the significance of incorporating additional sources for generating more informative summaries. Human evaluations further indicate that MEDOS scores relatively higher in coherence and fluency with 0.41 and 0.5 (−1 to 1) respectively, compared to existing models. To the best of our knowledge, we are the first to generate opinion summaries leveraging additional sources in a self-supervised setting. + 2024.findings-naacl.150 + 2024.findings-naacl.150.copyright.pdf + siledar-etal-2024-product + + + <fixed-case>COMEM</fixed-case>: In-Context Retrieval-Augmented Mass-Editing Memory in Large Language Models + ShanbaoQiao + XuebingLiu + Seung-HoonNaChonbuk National University + 2333-2347 + Noting that world knowledge continuously evolves over time, large language models (LLMs) need to be properly adjusted by performing the “knowledge editing”, which involves updating outdated information or correcting false information. To achieve reliable and “massive” editing capabilities in terms of \textit{generalization} and \textit{specificity}, this paper proposes a unified knowledge editing method called in-\textbf{CO}ntext retrieval-augmented \textbf{M}ass-\textbf{E}diting \textbf{M}emory (COMEM), which combines two types of editing approaches: parameter updating and in-context knowledge editing (IKE). In particular, COMEM incorporates \textit{retrieval-augmented IKE}, a novel extension of IKE designed for massive editing tasks, based on an \textit{updating}-aware demonstration construction.Experimental results on the zsRE and CounterFact datasets demonstrate that COMEM outperforms all existing methods, achieving state-of-the-art performance. Our code is available at https://github.com/JoveReCode/COMEM.git. + 2024.findings-naacl.151 + 2024.findings-naacl.151.copyright.pdf + qiao-etal-2024-comem + + + Content-Specific Humorous Image Captioning Using Incongruity Resolution Chain-of-Thought + KohtaroTanakaThe University of Tokyo + KoheiUeharaThe University of Tokyo + LinGuRIKEN + YusukeMukutaThe University of Tokyo + TatsuyaHaradaRIKEN and The University of Tokyo + 2348-2367 + Although automated image captioning methods have benefited considerably from the development of large language models (LLMs), generating humorous captions is still a challenging task. Humorous captions generated by humans are unique to the image and reflect the content of the image. However, captions generated using previous captioning models tend to be generic. Therefore, we propose incongruity-resolution chain-of-thought (IRCoT) as a novel prompting framework that creates content-specific resolutions from fine details extracted from an image. Furthermore, we integrate logit bias and negative sampling to suppress the output of generic resolutions. The results of experiments with GPT4-V demonstrate that our proposed framework effectively generated humorous captions tailored to the content of specific input images. + 2024.findings-naacl.152 + 2024.findings-naacl.152.copyright.pdf + tanaka-etal-2024-content + + + Denoising Attention for Query-aware User Modeling + EliasBassaniEuropean Commission, Joint Research Centre + PranavKaselaUniversity of Milan - Bicocca + GabriellaPasiUniversity of Milan - Bicocca + 2368-2380 + Personalization of search results has gained increasing attention in the past few years, also thanks to the development of Neural Networks-based approaches for Information Retrieval. Recent works have proposed to build user models at query time by leveraging the Attention mechanism, which allows weighing the contribution of the user-related information w.r.t. the current query.This approach allows giving more importance to the user’s interests related to the current search performed by the user.In this paper, we discuss some shortcomings of the Attention mechanism when employed for personalization and introduce a novel Attention variant, the Denoising Attention, to solve them.Denoising Attention adopts a robust normalization scheme and introduces a filtering mechanism to better discern among the user-related data those helpful for personalization.Experimental evaluation shows improvements in MAP, MRR, and NDCG above 15% w.r.t. other Attention variants at the state-of-the-art. + 2024.findings-naacl.153 + 2024.findings-naacl.153.copyright.pdf + bassani-etal-2024-denoising + + + A Lightweight Mixture-of-Experts Neural Machine Translation Model with Stage-wise Training Strategy + FanZhangCommunication University of China and Samsung + MeiTu + SongLiu + JinyaoYanCommunication University of China + 2381-2392 + Dealing with language heterogeneity has always been one of the challenges in neural machine translation (NMT).The idea of using mixture-of-experts (MoE) naturally excels in addressing this issue by employing different experts to take responsibility for different problems.However, the parameter-inefficiency problem in MoE results in less performance improvement when boosting the number of parameters.Moreover, most of the MoE models are suffering from the training instability problem.This paper proposes MoA (Mixture-of-Adapters), a lightweight MoE-based NMT model that is trained via an elaborately designed stage-wise training strategy.With the standard Transformer as the backbone model, we introduce lightweight adapters as experts for easy expansion.To improve the parameter efficiency, we explicitly model and distill the language heterogeneity into the gating network with clustering.After freezing the gating network, we adopt the Gumbel-Max sampling as the routing scheme when training experts to balance the knowledge of generalization and specialization while preventing expert over-fitting.Empirical results show that MoA achieves stable improvements in different translation tasks by introducing much fewer extra parameters compared to other MoE baselines.Additionally, the performance evaluations on a multi-domain translation task illustrate the effectiveness of our training strategy. + 2024.findings-naacl.154 + 2024.findings-naacl.154.copyright.pdf + zhang-etal-2024-lightweight + + + <fixed-case>BEAR</fixed-case>: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models + JacekWilandDepartment of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin + MaxPlonerHumboldt Universität Berlin + AlanAkbikHumboldt Universität Berlin + 2393-2411 + Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM’s inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs. + 2024.findings-naacl.155 + 2024.findings-naacl.155.copyright.pdf + wiland-etal-2024-bear + + + Conformal Intent Classification and Clarification for Fast and Accurate Intent Recognition + FlorisHengst + RalfWolterING Bank + PatrickAltmeyer + ArdaKaygan + 2412-2432 + We present Conformal Intent Classification and Clarification (CICC), a framework for fast and accurate intent classification for task-oriented dialogue systems. The framework turns heuristic uncertainty scores of any intent classifier into a clarification question that is guaranteed to contain the true intent at a pre-defined confidence level.By disambiguating between a small number of likely intents, the user query can be resolved quickly and accurately. Additionally, we propose to augment the framework for out-of-scope detection.In a comparative evaluation using seven intent recognition datasets we find that CICC generates small clarification questions and is capable of out-of-scope detection.CICC can help practitioners and researchers substantially in improving the user experience of dialogue agents with specific clarification questions. + 2024.findings-naacl.156 + 2024.findings-naacl.156.copyright.pdf + hengst-etal-2024-conformal + + + Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models in Court Decisions + AlexNyffenegger + MatthiasStürmerBFH - Bern University of Applied Sciences and Universität Bern + JoelNiklausUniversity of Bern, Universität Bern + 2433-2462 + Anonymity in court rulings is a critical aspect of privacy protection in the European Union and Switzerland but with the advent of LLMs, concerns about large-scale re-identification of anonymized persons are growing. In accordance with the Federal Supreme Court of Switzerland (FSCS), we study re-identification risks using actual legal data. Following the initial experiment, we constructed an anonymized Wikipedia dataset as a more rigorous testing ground to further investigate the findings. In addition to the datasets, we also introduce new metrics to measure performance. We systematically analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning among the most critical determinants. Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions. We demonstrate that for now, the risk of re-identifications using LLMs is minimal in the vast majority of cases. We hope that our system can help enhance the confidence in the security of anonymized decisions, thus leading the courts to publish more decisions. + 2024.findings-naacl.157 + 2024.findings-naacl.157.copyright.pdf + nyffenegger-etal-2024-anonymity + + + <fixed-case>X</fixed-case>-<fixed-case>LL</fixed-case>a<fixed-case>VA</fixed-case>: Optimizing Bilingual Large Vision-Language Alignment + DongJaeShinSeoul National University of Science and Technology + HyeonSeokLimHanbat National University + InhoWonSeoul National University of Science and Technology + ChangSuChoi + MinjunKim + SeungWooSongHanbat National University + HanGyeolYoo + SangMinKimSeoul National University of Science and Technology + KyungTaeLimSeoul National University of Science and Technology + 2463-2473 + The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches. + 2024.findings-naacl.158 + 2024.findings-naacl.158.copyright.pdf + shin-etal-2024-x + + + Why So Gullible? Enhancing the Robustness of Retrieval-Augmented Models against Counterfactual Noise + GiwonHongUniversity of Edinburgh, University of Edinburgh + JeonghwanKim + JunmoKangGeorgia Institute of Technology + Sung-HyonMyaeng + JoyceWhangKAIST + 2474-2495 + Most existing retrieval-augmented language models (LMs) assume a naive dichotomy within a retrieved document set: query-relevance and irrelevance. Our work investigates a more challenging scenario in which even the “relevant” documents may contain misleading or incorrect information, causing conflict among the retrieved documents and thereby negatively influencing model decisions as noise. We observe that existing LMs are highly brittle to the presence of conflicting information in both the fine-tuning and in-context few-shot learning scenarios. We propose approaches for handling knowledge conflicts among retrieved documents by explicitly fine-tuning a discriminator or prompting GPT-3.5 to elicit its discriminative capability. Our empirical results on open-domain QA show that these approaches significantly enhance model robustness. We also provide our findings on incorporating the fine-tuned discriminator’s decision into the in-context learning process, proposing a way to exploit the benefits of two disparate learning schemes. Alongside our findings, we provide MacNoise, a machine-generated, conflict-induced dataset to further encourage research in this direction. + 2024.findings-naacl.159 + 2024.findings-naacl.159.copyright.pdf + hong-etal-2024-gullible + + + Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre-Trained Models for Detecting Audio Deepfake + OrchidChetia PhukanIndraprastha Institute of Information Technology, Delhi + GautamKashyap + Arun BalajiBuduruIndraprastha Institute of Information Technology, Delhi + RajeshSharmainstitute of computer science, University of Tartu + 2496-2506 + In this work, we investigate multilingual speech Pre-Trained models (PTMs) for Audio deepfake detection (ADD). We hypothesize thatmultilingual PTMs trained on large-scale diverse multilingual data gain knowledge about diverse pitches, accents, and tones, during theirpre-training phase and making them more robust to variations. As a result, they will be more effective for detecting audio deepfakes. To validate our hypothesis, we extract representations from state-of-the-art (SOTA) PTMs including monolingual, multilingual as well as PTMs trained for speaker and emotion recognition, and evaluated them on ASVSpoof 2019 (ASV), In-the-Wild (ITW), and DECRO benchmark databases. We show that representations from multilingual PTMs, with simple downstream networks, attain the best performance for ADD compared to other PTM representations, which validates our hypothesis. We also explore the possibility of fusion of selected PTM representations for further improvements in ADD, and we propose a framework, MiO (Merge into One) for this purpose. With MiO, we achieve SOTA performance on ASV and ITW and comparable performance on DECRO with current SOTA works. + 2024.findings-naacl.160 + 2024.findings-naacl.160.copyright.pdf + chetia-phukan-etal-2024-heterogeneity + + + Identifying Self-Disclosures of Use, Misuse and Addiction in Community-based Social Media Posts + ChenghaoYangUniversity of Chicago + TuhinChakrabartySalesForce Research + KarliHochstatter + MelissaSlavinColumbia University + NabilaEl-BasselColumbia University and Columbia University + SmarandaMuresanAmazon and Columbia University + 2507-2521 + In the last decade, the United States has lost more than 500,000 people from an overdose involving prescription and illicit opioids making it a national public health emergency (USDHHS, 2017). Medical practitioners require robust and timely tools that can effectively identify at-risk patients. Community-based social media platforms such as Reddit allow self-disclosure for users to discuss otherwise sensitive drug-related behaviors. We present a moderate size corpus of 2500 opioid-related posts from various subreddits labeled with six different phases of opioid use: Medical Use, Misuse, Addiction, Recovery, Relapse, Not Using. For every post, we annotate span-level extractive explanations and crucially study their role both in annotation quality and model development. We evaluate several state-of-the-art models in a supervised, few-shot, or zero-shot setting. Experimental results and error analysis show that identifying the phases of opioid use disorder is highly contextual and challenging. However, we find that using explanations during modeling leads to a significant boost in classification accuracy demonstrating their beneficial role in a high-stakes domain such as studying the opioid use disorder continuum. + 2024.findings-naacl.161 + 2024.findings-naacl.161.copyright.pdf + yang-etal-2024-identifying + + + Self-Adaptive Sampling for Accurate Video Question Answering on Image Text Models + WeiHanSingapore University of Technology and Design + HuiChenNanyang Technological University + Min-YenKanNational University of Singapore + SoujanyaPoriaSingapore University of Technology and Design + 2522-2534 + Image–text models (ITMs) is the prevalent architecture to solve video question–answering tasks, which requires only a few input frames to save huge computational cost compared to video–language models.However, we find existent ITM video question–answering solutions either 1) adopt simplistic and unintentional sampling strategies, which may miss key frames to offer the answer clues; or 2) sample a large number of frames into divided groups, which the computational sources can not accommodate. In this work, we aim at an efficient sampling method towards the few-frame situations.We first summarize a family of prior sampling methods based on question–frame correlation into a unified one, dubbed *Most Implied Frames* (MIF). Through some primary results and analysis, Through analysis, we form a hypothesis that question-aware sampling is not necessary, from which we further propose the other method *Most Dominant Frames* (MDF).Experimental results on four public datasets and three advanced ITMs demonstrate that our proposed strategies can boost the performance for image–text pretrained models, and have a wide application scenario in terms of model architectures and dataset types. Our code is available at https://github.com/declare-lab/Sealinghttps://github.com/declare-lab/Sealing. + 2024.findings-naacl.162 + 2024.findings-naacl.162.copyright.pdf + han-etal-2024-self + + + Towards an On-device Agent for Text Rewriting + YunZhuGoogle + YinxiaoLiu + FelixStahlbergGoogle + ShankarKumar + Yu-HuiChen + LiangchenLuoGoogle + LeiShuGoogle + RenjieLiu + JindongChenGoogle + LeiMeng + 2535-2552 + Large Language Models (LLMs) have demonstrated impressive capabilities for text rewriting. However creating a smaller yet potent language model for text rewriting presents two formidable challenges: costly data collection and absence of emergent capabilities.In this paper we present solutions to address the above challenges.We propose an new instruction tuning method to develop a mo-bile text rewriting model that leverages LLM-generated data and heuristic reinforcement learning, eliminating the need for human data collection. Moreover, to bridge the performance gap from the constraint size, we pro-pose a cascading approach based on the confidence levels which are distilled from the large server model’s critiques. To evaluate the text rewriting tasks for mobile scenarios, we introduce MessageRewriteEval, a human-labeled benchmark that focuses on text rewriting of messages through natural language instructions. Through empirical experiments, we demonstrate that our on-device model surpasses the current state-of-the-art LLMs in text rewriting while maintaining a significantly reduced model size using public benchmark EditEval and our new benchmark. We also demonstrate that our proposed cascading approach improves model performance further. + 2024.findings-naacl.163 + 2024.findings-naacl.163.copyright.pdf + zhu-etal-2024-towards + + + Tailoring Vaccine Messaging with Common-Ground Opinions + RickardStureborgDuke University + SanxingChenDuke University + RoyXie + AayushiPatel + ChristopherLi + ChloeZhu + TingnanHu + JunYangDepartment of Computer Science, Duke University + BhuwanDhingraDuke University + 2553-2575 + One way to personalize chatbot interactions is by establishing common ground with the intended reader. A domain where establishing mutual understanding could be particularly impactful is vaccine concerns and misinformation. Vaccine interventions are forms of messaging which aim to answer concerns expressed about vaccination. Tailoring responses in this domain is difficult, since opinions often have seemingly little ideological overlap. We define the task of tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring responses to a CGO involves meaningfully improving the answer by relating it to an opinion or belief the reader holds. In this paper we introduce Tailor-CGO, a dataset for evaluating how well responses are tailored to provided CGOs. We benchmark several major LLMs on this task; finding GPT-4-Turbo performs significantly better than others. We also build automatic evaluation metrics, including an efficient and accurate BERT model that outperforms finetuned LLMs, investigate how to successfully tailor vaccine messaging to CGOs, and provide actionable recommendations from this investigation.Tailor-CGO dataset and code available at: https://github.com/rickardstureborg/tailor-cgo + 2024.findings-naacl.164 + 2024.findings-naacl.164.copyright.pdf + stureborg-etal-2024-tailoring + + + Best of Both Worlds: A Pliable and Generalizable Neuro-Symbolic Approach for Relation Classification + RobertVacareanu + FahmidaAlamUniversity of Arizona + Md AsifulIslamUniversity of Arizona + HarisRiaz + MihaiSurdeanuUniversity of Arizona + 2576-2594 + This paper introduces a novel neuro-symbolic architecture for relation classification (RC) that combines rule-based methods with contemporary deep learning techniques. This approach capitalizes on the strengths of both paradigms: the adaptability of rule-based systems and the generalization power of neural networks. Our architecture consists of two components: a declarative rule-based model for transparent classification and a neural component to enhance rule generalizability through semantic text matching.Notably, our semantic matcher is trained in an unsupervised domain-agnostic way, solely with synthetic data.Further, these components are loosely coupled, allowing for rule modifications without retraining the semantic matcher.In our evaluation, we focused on two few-shot relation classification datasets: Few-Shot TACRED and a Few-Shot version of NYT29. We show that our proposed method outperforms previous state-of-the-art models in three out of four settings, despite not seeing any human-annotated training data.Further, we show that our approach remains modular and pliable, i.e., the corresponding rules can be locally modified to improve the overall model. Human interventions to the rules for the TACRED relation org:parents boost the performance on that relation by as much as 26% relative improvement, without negatively impacting the other relations, and without retraining the semantic matching component. + 2024.findings-naacl.165 + 2024.findings-naacl.165.copyright.pdf + vacareanu-etal-2024-best + + + <fixed-case>Q</fixed-case>-Tuning: Queue-based Prompt Tuning for Lifelong Few-shot Language Learning + YanhuiGuo + ShaoyuanXuAmazon + JinmiaoFu + JiaLiuThe Ohio State University + ChaoshengDong + BryanWangAmazon + 2595-2622 + This paper introduces Q-tuning, a novel approach for continual prompt tuning that enables the lifelong learning of a pre-trained language model. When learning a new task, Q-tuning trains a task-specific prompt by adding it to a prompt queue consisting of the prompts from older tasks. To better transfer the knowledge of old tasks, we design an adaptive knowledge aggregation technique that reweighs previous prompts in the queue with a learnable low-rank matrix. Once the prompt queue reaches its maximum capacity, we leverage a PCA-based eviction rule to reduce the queue’s size, allowing the newly trained prompt to be added while preserving the primary knowledge of old tasks. In order to mitigate the accumulation of information loss caused by the eviction, we additionally propose a globally shared prefix prompt and a memory retention regularization based on information theory. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods substantially on continual prompt tuning benchmarks. Moreover, our approach enables lifelong learning on linearly growing task sequences while requiring constant complexity for training and inference. + 2024.findings-naacl.166 + 2024.findings-naacl.166.copyright.pdf + guo-etal-2024-q + + + In-Context Example Ordering Guided by Label Distributions + ZhichaoXuUniversity of Utah + DanielCohenBrown University + BeiWangUniversity of Utah + VivekSrikumarUniversity of Utah + 2623-2640 + By allowing models to predict without task-specific training, in-context learning (ICL) with pretrained LLMs has enormous potential in NLP. However, a number of problems persist in ICL. In particular, its performance is sensitive to the choice and order of in-context examples. Given the same set of in-context examples with different orderings, model performance may vary from near random to near state-of-the-art. In this work, we formulate in-context example ordering as an optimization problem. We examine three problem settings that differ in the assumptions they make about what is known about the task. Inspired by the idea of learning from label proportions, we propose two principles for in-context example ordering guided by model’s probability predictions. We apply our proposed principles to thirteen text classification datasets and nine different autoregressive LLMs with 700M to 13B parameters. We demonstrate that our approach outperforms the baselines by improving the classification accuracy, reducing model miscalibration, and also by selecting better in-context examples. + 2024.findings-naacl.167 + 2024.findings-naacl.167.copyright.pdf + xu-etal-2024-context + + + Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives + JiaxinLiu + YiYangHong Kong University of Science and Technology + Kar YanTam + 2641-2652 + In this paper, we introduce the Financial-STS task, a financial domain-specific NLP task designed to measure the nuanced semantic similarity between pairs of financial narratives. These narratives originate from the financial statements of the same company but correspond to different periods, such as year-over-year comparisons. Measuring the subtle semantic differences between these paired narratives enables market stakeholders to gauge changes over time in the company’s financial and operational situations, which is critical for financial decision-making. We find that existing pretrained embedding models and LLM embeddings fall short in discerning these subtle financial narrative shifts. To address this gap, we propose an LLM-augmented pipeline specifically designed for the Financial-STS task. Evaluation on a human-annotated dataset demonstrates that our proposed method outperforms existing methods trained on classic STS tasks and generic LLM embeddings. + 2024.findings-naacl.168 + 2024.findings-naacl.168.copyright.pdf + liu-etal-2024-beyond + + + Laying Anchors: Semantically Priming Numerals in Language Modeling + MandarSharmaVirginia Polytechnic Institute and State University + RutujaTaware + PraveshKoiralaVanderbilt University + NikhilMuralidharStevens Institute of Technology + NarenRamakrishnanVirginia Tech + 2653-2660 + Off-the-shelf pre-trained language models have become the de facto standard in NLP pipelines for a multitude of downstream tasks. However, the inability of these models to properly encode numerals limits their performance on tasks requiring numeric comprehension. We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus, thereby enabling mathematically grounded representations of these numeral tokens. We establish the superiority of our proposed techniques through evaluation on a range of numeracy tasks for both in-domain (seen) and out-domain (unseen) numerals. Further, we expand our empirical evaluations to numerals ranging from 1 to 10 billion, a significantly broader range compared to previous studies of the same nature, and we demonstrate significant improvements in the mathematical grounding of our learned embeddings. + 2024.findings-naacl.169 + 2024.findings-naacl.169.copyright.pdf + sharma-etal-2024-laying + + + <fixed-case>UEGP</fixed-case>: Unified Expert-Guided Pre-training for Knowledge Rekindle + YutaoMou + KexiangWangAlibaba Group + JianheLin + DehongMa + JunFan + DaitingShi + ZhicongCheng + GuSimiu + DaweiYinBaidu + WeiranXu + 2661-2673 + Pre-training and fine-tuning framework has become the standard training paradigm for NLP tasks and is also widely used in industrial-level applications. However, there are still a limitation with this paradigm: simply fine-tuning with task-specific objectives tends to converge to local minima, resulting in a sub-optimal performance. In this paper, we first propose a new paradigm: knowledge rekindle, which aims to re-incorporate the fine-tuned expert model into the training cycle and break through the performance upper bounds of experts without introducing additional annotated data. Then we further propose a unified expert-guided pre-training (UEGP) framework for knowledge rekindle. Specifically, we reuse fine-tuned expert models for various downstream tasks as knowledge sources and inject task-specific prior knowledge to pre-trained language models (PLMs) by means of knowledge distillation. In this process, we perform multi-task learning with knowledge distillation and masked language modeling (MLM) objectives. We also further explored whether mixture-of-expert guided pre-training (MoEGP) can further enhance the effect of knowledge rekindle. Experiments and analysis on eight datasets in GLUE benchmark and a industrial-level search re-ranking dataset show the effectiveness of our method. + 2024.findings-naacl.170 + 2024.findings-naacl.170.copyright.pdf + mou-etal-2024-uegp + + + <fixed-case>L</fixed-case>attice<fixed-case>G</fixed-case>en: Hiding Generated Text in a Lattice for Privacy-Aware Large Language Model Generation on Cloud + MengkeZhang + TianxingHe + TianleWang + LuMiUniversity of Washington and Allen Institute + NiloofarMireshghallahUniversity of Washington + BinyiChen + HaoWangRutgers University + YuliaTsvetkovDepartment of Computer Science, University of Washington + 2674-2690 + In the current user-server interaction paradigm of prompted generation with large language models (LLMs) on cloud, the server fully controls the generation process, which leaves zero options for users who want to keep the generated text private to themselves. For privacy-aware text generation on cloud, we propose LatticeGen, a cooperative protocol in which the server still handles most of the computation while the client controls the sampling operation. The key idea is that the true generated sequence is mixed with noise tokens by the client and hidden in a noised lattice. Only the client knows which tokens are the true ones. Considering potential attacks from a hypothetically malicious server and how the client can defend against it, we propose the repeated beam-search attack and the mixing noise scheme. In our experiments we apply LatticeGen to protect both prompt and generation. It is shown that while the noised lattice degrades generation quality, LatticeGen successfully protects the true generation to a remarkable degree under strong attacks (more than 50% of the semantic remains hidden as measured by BERTScore). + 2024.findings-naacl.171 + 2024.findings-naacl.171.copyright.pdf + zhang-etal-2024-latticegen + + + <fixed-case>H</fixed-case>ate<fixed-case>M</fixed-case>oderate: Testing Hate Speech Detectors against Content Moderation Policies + JiangruiZhengStevens Institute of Technology + XueqingLiuStevens Institute of Technology + MirazulHaqueJ.P. Morgan Chase + XingQianStevens Institute of Technology + GuanqunYang + WeiYangUniversity of Texas, Dallas + 2691-2710 + To protect users from massive hateful content, existing works studied automated hate speech detection. Despite the existing efforts, one question remains: Do automated hate speech detectors conform to social media content policies? A platform’s content policies are a checklist of content moderated by the social media platform. Because content moderation rules are often uniquely defined, existing hate speech datasets cannot directly answer this question. This work seeks to answer this question by creating HateModerate, a dataset for testing the behaviors of automated content moderators against content policies. First, we engage 28 annotators and GPT in a six-step annotation process, resulting in a list of hateful and non-hateful test suites matching each of Facebook’s 41 hate speech policies. Second, we test the performance of state-of-the-art hate speech detectors against HateModerate, revealing substantial failures these models have in their conformity to the policies. Third, using HateModerate, we augment the training data of a top-downloaded hate detector on HuggingFace. We observe significant improvement in the models’ conformity to content policies while having comparable scores on the original test data. Our dataset and code can be found on https://github.com/stevens-textmining/HateModerate. + 2024.findings-naacl.172 + 2024.findings-naacl.172.copyright.pdf + zheng-etal-2024-hatemoderate + + + Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other + YifeiGao + JieOu + LeiWangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences + YutingXiao + XiangzhiyuanXiangzhiyuan + RuitingDai + JunChengShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences + 2711-2722 + Emergent Large Language Models (LLMs) use their extraordinary performance and powerful deduction capacity to discern from traditional language models. However, the expenses of computational resources and storage for these LLMs are stunning, quantization then arises as a trending conversation. To address accuracy decay caused by quantization, two streams of works in post-training quantization methods stand out. One uses other weights to compensate existing quantization error, while the other transfers the quantization difficulty to other parts in the model. Combining both merits, we introduce Learnable Singular value Increment (LSI) as an advanced solution. LSI uses Singular Value Decomposition to extract singular values of the weights and make them learnable to help weights compensate each other conditioned on activation. Incorporating LSI with existing techniques, we achieve state-of-the-art performance in diverse quantization settings, no matter in weight-only, weight-activation or extremely low bit scenarios. By unleashing the potential of LSI, efficient finetuning on quantized model is no longer a prohibitive problem. + 2024.findings-naacl.173 + 2024.findings-naacl.173.copyright.pdf + gao-etal-2024-compensate + + + Contrastive Preference Learning for Neural Machine Translation + JianfeiHeCity University of Hong Kong + ShichaoSunThe Hong Kong Polytechnic University + SenPeng + JieXu + XiaohuaJia + WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University + 2723-2735 + There exists a discrepancy between the token-level objective during training and the overall sequence-level quality that is expected from the model. This discrepancy leads to issues like exposure bias.To align the model with human expectations, sequence-level objectives are often used to fine-tune pre-trained models.In this paper, we introduce a contrastive preference model that enhances the traditional Plackett-Luce model by incorporating an indicator function. Building upon this novel preference model, we propose Contrastive Preference Learning (CPL), which uses offline samples with list-wise preferences to fine-tune a pre-trained model in Neural Machine Translation. Our experiments, conducted on three language pairs, demonstrate that CPL outperforms not only the vanilla Transformer model but also other token-level and sequence-level baselines. Furthermore, the ablation study highlights the essential role of the proposed indicator function in achieving this improvement. + 2024.findings-naacl.174 + 2024.findings-naacl.174.copyright.pdf + he-etal-2024-contrastive + + + <fixed-case>S</fixed-case>oc<fixed-case>RE</fixed-case>val: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation + HangfengHeUniversity of Rochester + HongmingZhang + DanRothAmazon and University of Pennsylvania + 2736-2764 + To comprehensively gauge the capacity of current models for complex reasoning, it is crucial to assess their step-by-step reasoning in a scalable manner. Established reference-based evaluation metrics rely on human-annotated reasoning chains as references to assess the model-derived chains. However, such “gold-standard” human-written reasoning chains may not be unique and their acquisition is often labor-intensive. Existing reference-free reasoning evaluation metrics, while eliminating the need for human-crafted reasoning chains as references, often require fine-tuning with human-derived chains before evaluation, complicating the process and questioning their adaptability to other datasets. To address these challenges, we harness GPT-4 to automatically evaluate reasoning chain quality, thereby removing the dependency on human-written reasoning chains for both model fine-tuning and evaluative purposes. Leveraging the Socratic method, we develop SocREval (**Soc**ratic Method-Inspired **R**easoning **Eval**uation), a novel approach for prompt design in reference-free reasoning evaluation. Empirical results from four human annotated datasets reveal that SocREval significantly improves GPT-4’s performance, surpassing existing reference-free and reference-based reasoning evaluation metrics. Beyond its demonstrated efficacy, SocREval, proves to be both cost-efficient and robust to prompt writing and example selection, as substantiated by our in-depth analysis. + 2024.findings-naacl.175 + 2024.findings-naacl.175.copyright.pdf + he-etal-2024-socreval + + + Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis + WenhaoZhuNanjing University + HongyiLiu + QingxiuDong + JingjingXu + ShujianHuangNanjing University + LingpengKongDepartment of Computer Science, The University of Hong Kong + JiajunChenNanjing University + LeiLiSchool of Computer Science, Carnegie Mellon University + 2765-2781 + Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT). In this paper, we systematically investigate the advantages and challenges of LLMs for MMT by answering two questions: 1) How well do LLMs perform in translating massive languages? 2) Which factors affect LLMs’ performance in translation? We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4. Our empirical results show that translation capabilities of LLMs are continually involving. GPT-4 has beat the strong supervised baseline NLLB in 40.91% of translation directions but still faces a large gap towards the commercial translation system like Google Translate, especially on low-resource languages. Through further analysis, we discover that LLMs exhibit new working patterns when used for MMT. First, LLM can acquire translation ability in a resource-efficient way and generate moderate translation even on zero-resource languages. Second, instruction semantics can surprisingly be ignored when given in-context exemplars. Third, cross-lingual exemplars can provide better task guidance for low-resource translation than exemplars in the same language pairs. Code will be released at: https://github.com/NJUNLP/MMT-LLM. + 2024.findings-naacl.176 + 2024.findings-naacl.176.copyright.pdf + zhu-etal-2024-multilingual + + + Unleashing the Power of <fixed-case>LLM</fixed-case>s in Court View Generation by Stimulating Internal Knowledge and Incorporating External Knowledge + YifeiLiu + YiquanWuZhejiang University + AngLi + YatingZhang + ChanglongSunAlibaba Group + WeimingLuZhejiang University + FeiWuZhejiang University + KunKuangZhejiang University + 2782-2792 + Court View Generation (CVG) plays a vital role in the realm of legal artificial intelligence, which aims to support judges in crafting legal judgment documents. The court view consists of three essential judgment parts: the charge-related, law article-related, and prison term-related parts, each requiring specialized legal knowledge, rendering CVG a challenging task.Although Large Language Models (LLMs) have made remarkable strides in language generation, they encounter difficulties in the knowledge-intensive legal domain.Actually, there can be two types of knowledge: internal knowledge stored within LLMs’ parameters and external knowledge sourced from legal documents outside the models.In this paper, we decompose court views into different parts, stimulate internal knowledge, and incorporate external information to unleash the power of LLMs in the CVG task.To validate our method, we conduct a series of experiment results on two real-world datasets LAIC2021 and CJO2022. The experiments demonstrate that our method is capable of generating more accurate and reliable court views. + 2024.findings-naacl.177 + 2024.findings-naacl.177.copyright.pdf + liu-etal-2024-unleashing + + + Prompting Vision-Language Models For Aspect-Controlled Generation of Referring Expressions + DanfengGuo + SanchitAgarwalAmazon + ArpitGuptaAmazon + Jiun-YuKaoAmazon Alexa AI + EmreBarutAmazon + TagyoungChungAmazon + JingHuangAmazon Alexa AI + MohitBansalUniversity of North Carolina at Chapel Hill + 2793-2807 + Referring Expression Generation (REG) is the task of generating a description that unambiguously identifies a given target in the scene. Different from Image Captioning (IC), REG requires learning fine-grained characteristics of not only the scene objects but also their surrounding context. Referring expressions are usually not singular; an object can often be uniquely referenced in numerous ways, for instance, by color, by location, or by relationship with other objects. Most prior works, however, have not explored this ‘aspect-based multiplicity’ of referring expressions. Hence, in this work, we focus on the Aspect-Controlled REG task, which requires generating a referring expression conditioned on the input aspect(s), where an aspect captures a style of reference. By changing the input aspect such as color, location, action etc., one can generate multiple distinct expressions per target region. To solve this new task, we first modify BLIP for aligning image-regions and text-expressions. We achieve this through a novel approach for feeding the input by drawing a bounding box around the target image-region and prompting the model to generate the referring expression. Our base REG model already beats all prior works in CIDEr score. To tackle Aspect-Controlled REG, we append ‘aspect tokens’ to the prompt and show that distinct expressions can be generated by just changing the prompt. Finally, to prove the high-quality and diversity of the data generated by our proposed aspect-controlled REG model, we also perform data-augmentation-based evaluation on the downstream Referring Expression Comprehension (REC) task. With just half of the real data augmented with the generated synthetic data, we achieve performance comparable to training with 100% of real data, using a SOTA REC model. + 2024.findings-naacl.178 + 2024.findings-naacl.178.copyright.pdf + guo-etal-2024-prompting + + + Task-Agnostic Detector for Insertion-Based Backdoor Attacks + WeiminLyu + XiaoLinSRI International + SongzhuZhengMorgan Stanley + LuPangState University of New York at Stony Brook + HaibinLingState University of New York, Stony Brook + SusmitJhaSRI International + ChaoChenState University of New York at Stony Brook + 2808-2822 + Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods. + 2024.findings-naacl.179 + 2024.findings-naacl.179.copyright.pdf + lyu-etal-2024-task + + + Uncertainty Estimation on Sequential Labeling via Uncertainty Transmission + JianfengHeVirginia Tech + LinlinYu + ShuoLeiSony Coporation of America + Chang-TienLuVirginia Tech + FengChenUniversity of Texas, Dallas + 2823-2835 + Sequential labeling is a task predicting labels for each token in a sequence, such as Named Entity Recognition (NER). NER tasks aim to extract entities and predict their labels given a text, which is important in information extraction. Although previous works have shown great progress in improving NER performance, uncertainty estimation on NER (UE-NER) is still underexplored but essential. This work focuses on UE-NER, which aims to estimate uncertainty scores for the NER predictions. Previous uncertainty estimation models often overlook two unique characteristics of NER: the connection between entities (i.e., one entity embedding is learned based on the other ones) and wrong span cases in the entity extraction subtask. Therefore, we propose a Sequential Labeling Posterior Network (SLPN) to estimate uncertainty scores for the extracted entities, considering uncertainty transmitted from other tokens. Moreover, we have defined an evaluation strategy to address the specificity of wrong-span cases. Our SLPN has achieved significant improvements on three datasets, such as a 5.54-point improvement in AUPR on the MIT-Restaurant dataset. Our code is available at https://github.com/he159ok/UncSeqLabeling_SLPN. + 2024.findings-naacl.180 + 2024.findings-naacl.180.copyright.pdf + he-etal-2024-uncertainty + + + Exploring Language Model’s Code Generation Ability with Auxiliary Functions + SeonghyeonLee + SanghwanJangPOSTECH + SeongboJangPohang University of Science and Technology + DonghaLeeYonsei University + HwanjoYuPOSTECH + 2836-2848 + Auxiliary function is a helpful component to improve language model’s code generation ability. However, a systematic exploration of how they affect has yet to be done. In this work, we comprehensively evaluate the ability to utilize auxiliary functions encoded in recent code-pretrained language models. First, we construct a human-crafted evaluation set, called HumanExtension, which contains examples of two functions where one function assists the other.With HumanExtension, we design several experiments to examine their ability in a multifaceted way. Our evaluation processes enable a comprehensive understanding of including auxiliary functions in the prompt in terms of effectiveness and robustness. An additional implementation style analysis captures the models’ various implementation patterns when they access the auxiliary function. Through this analysis, we discover the models’ promising ability to utilize auxiliary functions including their self-improving behavior by implementing the two functions step-by-step. However, our analysis also reveals the model’s underutilized behavior to call the auxiliary function, suggesting the future direction to enhance their implementation by eliciting the auxiliary function call ability encoded in the models. We release our code and dataset to facilitate this research direction. + 2024.findings-naacl.181 + 2024.findings-naacl.181.copyright.pdf + lee-etal-2024-exploring + + + Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of <fixed-case>V</fixed-case>ietnamese Large Language Models + SangTruong + DucNguyen + ToanNguyen + DongLe + NhiTruong + ThoQuan + SanmiKoyejoStanford University and Google + 2849-2900 + Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 tasks and 31 metrics. We observe that finetuning can help LLMs transfer knowledge across languages, serving as an efficient way to bolster their capabilities in non-English languages. Moreover, our analysis indicates that larger models can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or finetuning datasets. These insights underscore the significance of meticulous finetuning with high-quality datasets in enhancing LLM performance. + 2024.findings-naacl.182 + 2024.findings-naacl.182.copyright.pdf + truong-etal-2024-crossing + + + <fixed-case>G</fixed-case>o<fixed-case>T</fixed-case>: Effective Graph-of-Thought Reasoning in Language Models + YaoYao + ZuchaoLi + HaiZhaoShanghai Jiao Tong University + 2901-2921 + With the widespread use of language models (LMs) in NLP tasks, researchers have discovered the potential of Chain-of-thought (CoT) to assist LMs in accomplishing complex reasoning tasks by generating intermediate steps. However, human thought processes are often non-linear, rather than simply sequential chains of thoughts. Therefore, we propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph. By representing thought units as nodes and connections between them as edges, our approach captures the non-sequential nature of human thinking and allows for a more realistic modeling of thought processes. GoT adopts a two-stage framework with an additional GoT encoder for thought graph representation and fuses the graph representation with the original input representation through a gated fusion mechanism. We evaluate GoT’s performance on a text-only reasoning task (AQUA-RAT) and a multimodal reasoning task (ScienceQA). Our model achieves significant improvement over the strong CoT baseline on the AQUA-RAT test set and boosts accuracy from 85.19% to 87.59% using the T5-base model over the state-of-the-art Multimodal-CoT on the ScienceQA test set. Our code is publicly available at https://github.com/Zoeyyao27/Graph-of-Thought + 2024.findings-naacl.183 + 2024.findings-naacl.183.copyright.pdf + yao-etal-2024-got + + + Enhancing the General Agent Capabilities of Low-Paramter <fixed-case>LLM</fixed-case>s through Tuning and Multi-Branch Reasoning + QinhaoZhou + ZihanZhang + XiangXiangHuazhong University of Science and Technology + KeWangAlibaba Group + YuchuanWuAlibaba Group + YongbinLiAlibaba Group + 2922-2931 + Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities, making them highly successful in a variety of tasks. However, when used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4. As intelligent agents, LLMs need to have the capabilities of task planning, long-term memory, and the ability to leverage external tools to achieve satisfactory performance. Various methods have been proposed to enhance the agent capabilities of LLMs. On the one hand, methods involve constructing agent-specific data and fine-tuning the models. On the other hand, some methods focus on designing prompts that effectively activate the reasoning abilities of the LLMs. We explore both strategies on the 7B and 13B models. We propose a comprehensive method for constructing agent-specific data using GPT-4. Through supervised fine-tuning with constructed data, we find that for these models with a relatively small number of parameters, supervised fine-tuning can significantly reduce hallucination outputs and formatting errors in agent tasks. Furthermore, techniques such as multi-path reasoning and task decomposition can effectively decrease problem complexity and enhance the performance of LLMs as agents. We evaluate our method on five agent tasks of AgentBench and achieve satisfactory results. + 2024.findings-naacl.184 + 2024.findings-naacl.184.copyright.pdf + zhou-etal-2024-enhancing + + + <fixed-case>M</fixed-case>u<fixed-case>M</fixed-case>ath: Multi-perspective Data Augmentation for Mathematical Reasoning in Large Language Models + WeihaoYou + ShuoYin + XudongZhao + ZhilongJiTomorrow Advancing Life + GuoqiangZhongOcean University of China + JinfengBai + 2932-2958 + Recently, the tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs. However, these models fall short in demonstrating the calculation process, which compromises user-friendliness and understanding of problem-solving steps. Conversely, while tool-free methods offer a clear display of the problem-solving process, their accuracy leaves room for improvement.These tool-free methods typically employ a somewhat narrow range of augmentation techniques such as rephrasing and difficulty enhancement to boost performance. In response to this issue, we have amalgamated and further refined these strengths while broadening the scope of augmentation methods to construct a **mu**lti-perspective augmentation dataset for **math**ematics—termed **MuMath** (\mu-Math) Dataset.Subsequently, we finetune LLaMA-2 on the MuMath dataset to derive the MuMath model. Our experiments indicate that our MuMath-70B model achieves new state-of-the-art performance among tool-free methods—achieving 88.3% on GSM8K and 34.5% on MATH .We release the MuMath dataset along with its corresponding models and code for public use. + 2024.findings-naacl.185 + 2024.findings-naacl.185.copyright.pdf + you-etal-2024-mumath + + + Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization + TongYeZhejiang University + LingfeiWuAnytime AI and Pinterest + TengfeiMaState University of New York at Stony Brook + XuhongZhangZhejiang University + YangkaiDu + PeiyuLiu + ShoulingJiZhejiang University + WenhaiWang + 2959-2971 + Automatically generating human-readable text describing the functionality of a program is the intent of source code summarization. Although neural language models achieve significant performance in this field, they are limited by their inability to access external knowledge. To address this limitation, an emerging trend is combining neural models with external knowledge through retrieval methods. Previous methods have relied on the sentence-level retrieval paradigm on the encoder side. However, this paradigm is coarse-grained, noise-filled and cannot directly take advantage of the high-quality retrieved summary tokens on the decoder side. In this paper, we propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side rather than the encoder side to enhance the performance of neural models and produce more low-frequency tokens in generating summaries. Furthermore, to overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens. The results of extensive experiments and human evaluation show that our token-level retrieval-augmented approach significantly improves performance and is more interpretable. + 2024.findings-naacl.186 + 2024.findings-naacl.186.copyright.pdf + ye-etal-2024-tram + + + <fixed-case>UNO</fixed-case>-<fixed-case>DST</fixed-case>: Leveraging Unlabelled Data in Zero-Shot Dialogue State Tracking + ChuangLi + YanZhangTencent + Min-YenKanNational University of Singapore + HaizhouLiThe Chinese University of Hong Kong (Shenzhen); National University of Singapore and National University of Singapore + 2972-2983 + Previous zero-shot dialogue state tracking (DST) methods only apply transfer learning, but ignore unlabelled data in the target domain.We transform zero-shot DST into few-shot DST by utilising such unlabelled data via joint and self-training methods. Our method incorporates auxiliary tasks that generate slot types as inverse prompts for main tasks, creating slot values during joint training. Cycle consistency between these two tasks enables the generation and selection of quality samples in unknown target domains for subsequent fine-tuning. This approach also facilitates automatic label creation, thereby optimizing the training and fine-tuning of DST models. We demonstrate this method’s effectiveness on general language models in zero-shot scenarios, improving average joint goal accuracy by 8% across all domains in MultiWOZ. + 2024.findings-naacl.187 + 2024.findings-naacl.187.copyright.pdf + li-etal-2024-uno + + + Evaluating Step-by-Step Reasoning through Symbolic Verification + YiFanZhangInstitute of automation, Chinese academy of science + HanlinZhangHarvard University + LiLiAmazon and Columbia University + EricXingMohamed bin Zayed Univeristy of AI and School of Computer Science, Carnegie Mellon University + 2984-3002 + Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations or chain-of-thoughts (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To understand the mechanism of reasoning of LMs, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from non-parametric knowledge bases (KBs), supporting automated verification of intermediate reasoning results. Then we revisit neuro-symbolic approaches and propose to learn from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog’s backward chaining algorithm and supporting automated verification of LMs’ outputs. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than CoT on length generalization benchmarks even with smaller model sizes. + 2024.findings-naacl.188 + 2024.findings-naacl.188.copyright.pdf + zhang-etal-2024-evaluating + + + Multi-Review Fusion-in-Context + AvivSlobodkin + OriShapiraOriginAI + RanLevyAmazon + IdoDaganBar-Ilan University + 3003-3021 + Grounded text generation, encompassing tasks such as long-form question-answering and summarization, necessitates both content selection and content consolidation. Current end-to-end methods are difficult to control and interpret due to their opaqueness.Accordingly, recent works have proposed a modular approach, with separate components for each step. Specifically, we focus on the second subtask, of generating coherent text given pre-selected content in a multi-document setting. Concretely, we formalize Fusion-in-Context (FiC) as a standalone task, whose input consists of source texts with highlighted spans of targeted content. A model then needs to generate a coherent passage that includes all and only the target information.Our work includes the development of a curated dataset of 1000 instances in the reviews domain, alongside a novel evaluation framework for assessing the faithfulness and coverage of highlights, which strongly correlate to human judgment. Several baseline models exhibit promising outcomes and provide insightful analyses.This study lays the groundwork for further exploration of modular text generation in the multi-document setting, offering potential improvements in the quality and reliability of generated content. Our benchmark, FuseReviews, including the dataset, evaluation framework, and designated leaderboard, can be found at https://fusereviews.github.io/. + 2024.findings-naacl.189 + 2024.findings-naacl.189.copyright.pdf + slobodkin-etal-2024-multi + + + Retrieving Examples from Memory for Retrieval Augmented Neural Machine Translation: A Systematic Comparison + MaximeBouthors + JosepCregoSYSTRAN + FrançoisYvonISIR, Sorbonne Université & CNRS + 3022-3039 + Retrieval-Augmented Neural Machine Translation (RAMT) architectures retrieve examples from memory to guide the generation process. While most works in this trend explore new ways to exploit the retrieved examples, the upstream retrieval step is mostly unexplored. In this paper, we study the effect of varying retrieval methods for several translation architectures to better understand the interplay between these two processes.We conduct experiments in two language pairs in a multi-domain setting and consider several downstream architectures based on a standard autoregressive model, an edit-based model, and a large language model with in-context learning. Our experiments show that the choice of the retrieval technique impacts the translation scores, with variance across architectures. We also discuss the effects of increasing the number and diversity of examples, which are mostly positive across the board. + 2024.findings-naacl.190 + 2024.findings-naacl.190.copyright.pdf + bouthors-etal-2024-retrieving + + + Extending Input Contexts of Language Models through Training on Segmented Sequences + PetrosKarypisUniversity of California, San Diego + JulianMcAuleyUniversity of California, San Diego, University of California, San Diego + GeorgeKarypisUniversity of Minnesota, Minneapolis + 3040-3052 + Effectively training language models on longinputs poses many technical challenges. As acost consideration, languages models are pre-trained on a fixed sequence length before beingadapted to longer sequences. We explore var-ious methods for adapting models to longerinputs by training on segmented sequences andan interpolation-based method for extendingabsolute positional embeddings. We developa training procedure to extend the input con-text size of pretrained models with no architec-tural changes and no additional memory coststhan training on the original input lengths. Bysub-sampling segments from long inputs whilemaintaining their original position the model isable to learn new positional interactions. Ourmethod benefits both models trained with abso-lute positional embeddings, by extending theirinput contexts, as well as popular relative posi-tional embedding methods showing a reducedperplexity on sequences longer than they weretrained on. We demonstrate our method canextend input contexts by a factor of 4× whileimproving perplexity. + 2024.findings-naacl.191 + 2024.findings-naacl.191.copyright.pdf + karypis-etal-2024-extending + + + Reason from Fallacy: Enhancing Large Language Models’ Logical Reasoning through Logical Fallacy Understanding + YandaLi + DixuanWangFudan University + JiaqingLiangFudan University + GuochaoJiang + QianyuHeFudan University + YanghuaXiaoFudan University + DeqingYangFudan University + 3053-3066 + Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks, but they still struggle with some complicated reasoning tasks including logical reasoning. One non-negligible reason for LLMs’ suboptimal performance on logical reasoning is their overlooking of understanding logical fallacies correctly. To evaluate LLMs’ capability of logical fallacy understanding (LFU), we propose five concrete tasks from three cognitive dimensions of WHAT, WHY, and HOW in this paper. Towards these LFU tasks, we have successfully constructed a new dataset LFUD based on GPT-4 accompanied by a little human effort. Our extensive experiments justify that our LFUD can be used not only to evaluate LLMs’ LFU capability, but also to fine-tune LLMs to obtain significantly enhanced performance on logical reasoning. + 2024.findings-naacl.192 + 2024.findings-naacl.192.copyright.pdf + li-etal-2024-reason + + + Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models + WanyongFeng + JaewookLee + HunterMcNichols + AlexanderScarlatosDepartment of Computer Science, University of Massachusetts at Amherst + DigorySmithEedi + SimonWoodheadEedi + NancyOrnelas + AndrewLanUniversity of Massachusetts, Amherst + 3067-3082 + Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability. In this work, we study the task of automated distractor generation in the domain of math MCQs and explore a wide variety of large language model (LLM)-based approaches, from in-context learning to fine-tuning. We conduct extensive experiments using a real-world math MCQ dataset and find that although LLMs can generate some mathematically valid distractors, they are less adept at anticipating common errors or misconceptions among real students. + 2024.findings-naacl.193 + 2024.findings-naacl.193.copyright.pdf + feng-etal-2024-exploring + + + Aspect-based Sentiment Analysis with Context Denoising + YuanheTianUniversity of Washington, Seattle + ChangLiu + YanSongUniversity of Science and Technology of China + FeiXiaUniversity of Washington, Seattle + YongdongZhangUniversity of Science and Technology of China + 3083-3095 + Given a sentence and a particular aspect term, aspect-based sentiment analysis (ABSA) aims to predict the sentiment polarity towards this aspect term, which provides fine-grained analysis on sentiment understanding and it has attracted much attention in recent years. In order to achieve a good performance on ABSA, it is important for a model to appropriately encode contextual information, especially identifying salient features and eliminating noise in the context. To make incorrect predictions, most existing approaches employ powerful text encoders to locate important context features, as well as noises that mislead ABSA models. These approaches determine the noise in the text for ABSA by assigning low weights to context features or directly removing them from model input, which runs the risk of computing wrong weights or eliminating important context information. In this paper, we propose to improve ABSA with context denoising, where three types of word-level information are regarded as noise, namely, lexicographic noise, bag-of-words noise, and syntax noise. We utilize diffusion networks to perform the denoising process to gradually eliminate them so as to better predict sentiment polarities for given aspect terms. Our approach uses task-specific noise rather than the standard stochastic Gaussian noise in the diffusion networks. The experimental results on five widely used ABSA datasets demonstrate the validity and effectiveness of our approach. + 2024.findings-naacl.194 + 2024.findings-naacl.194.copyright.pdf + tian-etal-2024-aspect + + + <fixed-case>I</fixed-case>ru<fixed-case>M</fixed-case>ozhi: Automatically classifying diglossia in <fixed-case>T</fixed-case>amil + KabilanPrasanna + AryamanArora + 3096-3103 + Tamil, a Dravidian language of South Asia, is a highly diglossic language with two very different registers in everyday use: Literary Tamil (preferred in writing and formal communication) and Spoken Tamil (confined to speech and informal media). Spoken Tamil is under-studied in modern NLP systems compared to Literary Tamil written in the Tamil script, as evidenced by a lack of datasets explicitly targetting the Spoken variety. In this paper, we release IruMozhi, a human-translated dataset of parallel text in Literary and Spoken Tamil. Using IruMozhi, we train classifiers on the task of identifying which Tamil variety a text belongs to. We use these models to gauge the availability of pretraining data in Spoken Tamil, to audit the composition of existing labelled datasets for Tamil, and to encourage future work on the variety. + 2024.findings-naacl.195 + 2024.findings-naacl.195.copyright.pdf + prasanna-arora-2024-irumozhi + + + <fixed-case>RENOVI</fixed-case>: A Benchmark Towards Remediating Norm Violations in Socio-Cultural Conversations + HaolanZhanMonash University + ZhuangLiMonash University + XiaoxiKang + TaoFengMonash University + YunchengHua + LizhenQuMonash University + YiYingBinus University + Mei RiantoChandraBinus University + KellyRosalin + JureynoldsJureynoldsBinus University + SurajSharma + ShilinQu + LinhaoLuo + IngridZukermanMonash University + Lay-KiSoonMonash University + ZhalehSemnani AzadCalifornia State University, Northridge + RezaHafMonash University + 3104-3117 + Norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. Remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. To equip interactive AI systems with a remediation ability, we offer ReNoVi — a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. ReNoVi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by ChatGPT through prompt learning. While collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between LLMs and humans in the awareness of social norms. We thus harness the power of ChatGPT to generate synthetic training data for our task. To ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. Our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data. + 2024.findings-naacl.196 + 2024.findings-naacl.196.copyright.pdf + zhan-etal-2024-renovi + + + Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking + Hong JinKangUniversity of California, Los Angeles + FabriceHarel-Canada + Muhammad AliGulzarVirginia Tech + NanyunPengUniversity of California, Los Angeles + MiryungKimUniversity of California, Los Angeles + 3118-3129 + Data augmentation techniques apply transformations to existing texts to generate additional data. The transformations may produce low-quality texts, where the meaning of the text is changed and the text may even be mangled beyond human comprehension. Analyzing the synthetically generated texts and their corresponding labels is slow and demanding. To winnow out texts with incorrect labels, we develop INSPECTOR, a human-in-the-loop data inspection technique. INSPECTOR combines the strengths of provenance tracking techniques with assistive labeling. INSPECTOR allows users to group related texts by their \textit{transformation provenance}, i.e., the transformations applied to the original text, or \textit{feature provenance}, the linguistic features of the original text. For assistive labeling, INSPECTOR computes metrics that approximate data quality, and allows users to compare the corresponding label of each text against the predictions of a large language model. In a user study, INSPECTOR increases the number of texts with correct labels identified by 3\times on a sentiment analysis task and by 4\times on a hate speech detection task. The participants found grouping the synthetically generated texts by their common transformation to be the most useful technique. Surprisingly, grouping texts by common linguistic features was perceived to be unhelpful. Contrary to prior work, our study finds that no single technique obviates the need for human inspection effort. This validates the design of INSPECTOR which combines both analysis of data provenance and assistive labeling to reduce human inspection effort. + 2024.findings-naacl.197 + 2024.findings-naacl.197.copyright.pdf + kang-etal-2024-human + + + <fixed-case>COMMIT</fixed-case>: Code-Mixing <fixed-case>E</fixed-case>nglish-Centric Large Language Model for Multilingual Instruction Tuning + JaeseongLeeSeoul National University + YeonJoonJungSeoul National University + Seung-wonHwangSeoul National University + 3130-3137 + Recently, instruction-tuned large language models (LLMs) are showing prominent performance on various tasks, such as question answering. However, the majority of instruction-tuned LLMs are English-centric, which hinders their application to low-resource language QA. In this paper, we propose COde-Mixed Multilingual Instruction Tuning (COMMIT) to adapt English-centric LLM to low-resource language QA. We point out two main causes of English-centricness: imbalance of unlabeled data, and English-centric instruction tuning datasets. To deviate from English-centric instruction tuning, we propose to specialize code-mixing for instruction tuning, which blocks code-mixing in English templates, to leverage the potential of its superiority. To overcome data imbalance, we perform cross-lingual alignment. The majority of cross-lingual alignment works focused on making representations similar, which is not desirable to decoder-based LLMs, such as LLaMA. Therefore, we propose code-mixed continual causal language modeling to align the decoder. COMMIT improves the exact match score of low-resourced language QA by up to 32x. Code is publicly available. + 2024.findings-naacl.198 + 2024.findings-naacl.198.copyright.pdf + lee-etal-2024-commit + + + <fixed-case>D</fixed-case>i<fixed-case>LM</fixed-case>: Distilling Dataset into Language Model for Text-level Dataset Distillation + AruMaekawaTokyo Institute of Technology, Tokyo Institute of Technology + SatoshiKosugiTokyo Institute of Technology, Tokyo Institute of Technology + KotaroFunakoshiInstitute of Innovative Research, Tokyo Institute of Technology + ManabuOkumuraTokyo Institute of Technology, Tokyo Institute of Technology + 3138-3153 + Dataset distillation aims to compress a training dataset by creating a small number of informative synthetic samples such that neural networks trained on them perform as well as those trained on the original training dataset. Current text dataset distillation methods create each synthetic sample as a sequence of word embeddings instead of a text to apply gradient-based optimization; however, such embedding-level distilled datasets cannot be used for training other models whose word embedding weights are different from the model used for distillation. To address this issue, we propose a novel text dataset distillation approach, called Distilling dataset into Language Model (DiLM), which trains a language model to generate informative synthetic training samples as text data, instead of directly optimizing synthetic samples. We evaluated DiLM on various text classification datasets and showed that distilled synthetic datasets from DiLM outperform those from current coreset selection methods. DiLM achieved remarkable generalization performance in training different types of models and in-context learning of large language models. Our code will be available at https://github.com/arumaekawa/DiLM. + 2024.findings-naacl.199 + 2024.findings-naacl.199.copyright.pdf + maekawa-etal-2024-dilm + + + <fixed-case>M</fixed-case>ind<fixed-case>A</fixed-case>gent: Emergent Gaming Interaction + RanGongUniversity of California, Los Angeles + QiuyuanHuangMicrosoft Research, Redmond + XiaojianMa + YusukeNodaMicrosoft + ZaneDurante + ZilongZhengBeijing Institute for General Artificial Intelligence + DemetriTerzopoulosUniversity of California, Los Angeles + LiFei-FeiStanford University and Stanford University + JianfengGaoMicrosoft Research + HoiVo + 3154-3183 + Large Foundation Models (LFMs) can perform complex scheduling in a multi-agent system and can coordinate agents to complete sophisticated tasks that require extensive collaboration.However, despite the introduction of numerous gaming frameworks, the community lacks adequate benchmarks that support the implementation of a general multi-agent infrastructure encompassing collaboration between LFMs and human-NPCs. We propose a novel infrastructure—Mindagent—for evaluating planning and coordination capabilities in the context of gaming interaction. In particular, our infrastructure leverages an existing gaming framework to (i) act as the coordinator for a multi-agent system, (ii) collaborate with human players via instructions, and (iii) enable in-context learning based on few-shot prompting with feedback.Furthermore, we introduce “Cuisineworld”, a new gaming scenario and its related benchmark that supervises multiple agents playing the game simultaneously and measures multi-agent collaboration efficiency. We have conducted comprehensive evaluations with a new auto-metric Collaboration Score: CoS for assessing the collaboration efficiency. Finally, Mindagent can be deployed in real-world gaming scenarios in a customized VR version of Cuisineworld and adapted in the “Minecraft” domain. Our work involving LFMs within our new infrastructure for general-purpose scheduling and coordination can elucidate how such skills may be obtained by learning from large language corpora. + 2024.findings-naacl.200 + 2024.findings-naacl.200.copyright.pdf + gong-etal-2024-mindagent + + + <fixed-case>B</fixed-case>ot<fixed-case>C</fixed-case>hat: Evaluating <fixed-case>LLM</fixed-case>s’ Capabilities of Having Multi-Turn Dialogues + HaodongDuanShanghai Artificial Intelligence Laboratory + JueqiWei + ChonghuaWangShanghai Jiaotong University + HongweiLiu + YixiaoFangTencent + SongyangZhangShanghai AI Laboratory + DahuaLinThe Chinese University of Hong Kong + KaiChenShanghai AI Laboratory + 3184-3200 + In the realm of modern Large Language Models (LLMs), facilitating high-quality, multi-turn dialogues with humans represents a cornerstone feature. However, human-based evaluation of such a capability involves substantial manual effort. This study offers a formative assessment of current LLMs’ proficiency in emulating human-like, multi-turn conversations using an LLM-centric approach. The evaluation encompasses three key elements in the evaluation pipeline: utterance generation, evaluation protocol, and judgement, and we delve deeply into each aspect. GPT-4, both as an utterance generator and as a judge, exhibits exceptional performance. As a generator, GPT-4 crafts dialogues indistinguishable from human interactions in terms of style and flow. When judging, it shows a heightened alignment with human evaluative standards and consistency. Conversely, other LLMs face challenges in producing quality multi-turn dialogues, hindered by inadequate instruction-following abilities, a propensity for prolix utterances, and overall limited capabilities. Notably, generating extensive dialogues (e.g., spanning tens of turns) remains a formidable task for most LLMs, particularly in Chinese contexts. We hope that our work can serve as a valuable resource for evaluating the multi-turn chatting capabilities of LLMs. Related resources are available at https://github.com/open-compass/BotChat. + 2024.findings-naacl.201 + 2024.findings-naacl.201.copyright.pdf + duan-etal-2024-botchat + + + Learning Mutually Informed Representations for Characters and Subwords + YilinWangSchool of Engineering and Applied Sciences, Harvard University and Carnegie Mellon University + XinyiHuSchool of Computer Science, Carnegie Mellon University + MatthewGormleySolventum and School of Computer Science, Carnegie Mellon University + 3201-3213 + Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling (intraword code-switching). Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. We make our code publically available. + 2024.findings-naacl.202 + 2024.findings-naacl.202.copyright.pdf + wang-etal-2024-learning-mutually + + + A Novel Two-step Fine-tuning Framework for Transfer Learning in Low-Resource Neural Machine Translation + YuanGao + FengHouMassey University + RuiliWangMassey University + 3214-3224 + Existing transfer learning methods for neural machine translation typically use a well-trained translation model (i.e., a parent model) of a high-resource language pair to directly initialize a translation model (i.e., a child model) of a low-resource language pair, and the child model is then fine-tuned with corresponding datasets. In this paper, we propose a novel two-step fine-tuning (TSFT) framework for transfer learning in low-resource neural machine translation. In the first step, we adjust the parameters of the parent model to fit the child language by using the child source data. In the second step, we transfer the adjusted parameters to the child model and fine-tune it with a proposed distillation loss for efficient optimization. Our experimental results on five low-resource translations demonstrate that our framework yields significant improvements over various strong transfer learning baselines. Further analysis demonstrated the effectiveness of different components in our framework. + 2024.findings-naacl.203 + 2024.findings-naacl.203.copyright.pdf + gao-etal-2024-novel + + + Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment + ZhongtaoMiaoThe University of Tokyo + QiyuWuThe University of Tokyo, Tokyo Institute of Technology and Peking University + KaiyanZhao + ZilongWu + YoshimasaTsuruokaThe University of Tokyo + 3225-3236 + The field of cross-lingual sentence embeddings has recently experienced significant advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora. This paper shows that cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models. To address this, we introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models. This framework incorporates three primary training objectives: aligned word prediction and word translation ranking, along with the widely used translation ranking. We evaluate our approach through experiments on the bitext retrieval task, which demonstrate substantial improvements on sentence embeddings in low-resource languages. In addition, the competitive performance of the proposed model across a broader range of tasks in high-resource languages underscores its practicality. + 2024.findings-naacl.204 + 2024.findings-naacl.204.copyright.pdf + miao-etal-2024-enhancing + + + C<tex-math>^{3}</tex-math><fixed-case>LPGCN</fixed-case>:Integrating Contrastive Learning and Cooperative Learning with Prompt into Graph Convolutional Network for Aspect-based Sentiment Analysis + YeHeChongqing University of Technology + ShihaoZouChongqing University of Technology + YuzheChenYuzheChen + XianyingHuangChongqing University of Technology + 3237-3247 + Aspect-based Sentiment Analysis (ABSA) is a fine-grained task. Recently, using graph convolutional networks (GCNs) to model syntactic information has become a popular topic. In addition, a growing consensus exists to enhance sentence representation using contrastive learning. However, when modeling syntactic information, incorrect syntactic structure may introduce additional noise. Meanwhile, we believe that contrastive learning implicitly introduce label information as priori. Therefore, we propose C^{3}LPGCN, which integrates Contrastive Learning and Cooperative Learning with Prompt into GCN. Specifically, to alleviate the noise when modeling syntactic information, we propose mask-aware aspect information filter, which combines prompt information of template with aspect information to filter the syntactic information. Besides, we propose prompt-based contrastive learning and cooperative learning to utilise the label information further. On the one hand, we construct prompts containing labels for contrastive learning, by which the model can focus more on task-relevant features. On the other hand, cooperative learning further extracts label information by aligning input samples’ representation and output distribution with label samples. Extensive experiments on three datasets demonstrate that our method significantly improves the model’s performance compared to traditional contrastive learning methods. Moreover, our C^{3}LPGCN outperforms state-of-the-art methods. Our source code and final models are publicly available at github + 2024.findings-naacl.205 + 2024.findings-naacl.205.copyright.pdf + he-etal-2024-c3lpgcn + + + Visual Enhanced Entity-Level Interaction Network for Multimodal Summarization + HaolongYan + BinghaoTang + BodaLin + GangZhao + SiLiBeijing University of Posts and Telecommunications + 3248-3260 + MultiModal Summarization (MMS) aims to generate a concise summary based on multimodal data like texts and images and has wide application in multimodal fields.Previous works mainly focus on the coarse-level textual and visual features in which the overall features of the image interact with the whole sentence.However, the entities of the input text and the objects of the image may be underutilized, limiting the performance of current MMS models.In this paper, we propose a novel Visual Enhanced Entity-Level Interaction Network (VE-ELIN) to address the problem of underutilization of multimodal inputs at a fine-grained level in two ways.We first design a cross-modal entity interaction module to better fuse the entity information in text and the object information in vision.Then, we design an object-guided visual enhancement module to fully extract the visual features and enhance the focus of the image on the object area.We evaluate VE-ELIN on two MMS datasets and propose new metrics to measure the factual consistency of entities in the output.Finally, experimental results demonstrate that VE-ELIN is effective and outperforms previous methods under both traditional metrics and ours.The source code is available at https://github.com/summoneryhl/VE-ELIN. + 2024.findings-naacl.206 + 2024.findings-naacl.206.copyright.pdf + yan-etal-2024-visual + + + Knowledgeable In-Context Tuning: Exploring and Exploiting Factual Knowledge for In-Context Learning + JianingWang + ChengyuWangAlibaba Group + ChuanqiTanAlibaba Group + JunHuang + MingGao + 3261-3280 + Large language models (LLMs) enable in-context learning (ICL) by conditioning on a few labeled training examples as a text-based prompt, eliminating the need for parameter updates and achieving competitive performance. In this paper, we demonstrate that factual knowledge is imperative for the performance of ICL in three core facets: the inherent knowledge learned in LLMs, the factual knowledge derived from the selected in-context examples, and the knowledge biases in LLMs for output generation. To unleash the power of LLMs in few-shot learning scenarios, we introduce a novel Knowledgeable In-Context Tuning (KICT) framework to further improve the performance of ICL:1) injecting knowledge into LLMs during continual self-supervised pre-training, 2) judiciously selecting the examples for ICL with high knowledge relevance, and 3) calibrating the prediction results based on prior knowledge.We evaluate the proposed approaches on autoregressive models (e.g., GPT-style LLMs) over multiple text classification and question-answering tasks. Experimental results demonstrate that KICT substantially outperforms strong baselines and improves by more than 13% and 7% on text classification and question-answering tasks, respectively. + 2024.findings-naacl.207 + 2024.findings-naacl.207.copyright.pdf + wang-etal-2024-knowledgeable + + + Time Machine <fixed-case>GPT</fixed-case> + FelixDrinkallUniversity of Oxford + EghbalRahimikiaUniversity of Manchester + JanetPierrehumbertUniversity of Oxford + StefanZohrenUniversity of Oxford + 3281-3292 + Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called TimeMachineGPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets. + 2024.findings-naacl.208 + 2024.findings-naacl.208.copyright.pdf + drinkall-etal-2024-time + + + An End-to-End Submodular Framework for Data-Efficient In-Context Learning + LillyKumariUniversity of Washington, Seattle + ShengjieWangUniversity of Washington, University of Illinois, Urbana Champaign and New York University, Shanghai + ArnavDasUniversity of Washington + TianyiZhouUniversity of Maryland, College Park + JeffBilmesUniversity of Washington, Seattle + 3293-3308 + Recent advancements in natural language tasks leverage the emergent In-Context Learning (ICL) ability of pretrained Large Language Models (LLMs). ICL enables LLMs to perform new tasks by utilizing a limited number of input-output examples as prompts. While ICL circumvents the costly step of finetuning LLMs, its effectiveness is heavily dependent on the quality and ordering of provided examples (called exemplars). In this work, we propose a two-stage data-efficient framework \textit{Div-S3} for exemplar selection for ICL. The first stage focuses on data annotation and employs a pool-based active learning approach to select a set of \textit{Div}erse and informative exemplars from the target tasks’ unlabeled pool. Given a test input/query, the second stage uses Submodular Span Summarization (\textit{S3}) to select the most relevant and non-redundant exemplars from the annotated pool of a limited budget. On 7 different NLP datasets and 5 LLMs of varying complexities, we show \textit{Div-S3} outperforms (1) existing active learning-based methods for data annotation for ICL and (2) similarity-based methods for test query-specific exemplars retrieval. + 2024.findings-naacl.209 + 2024.findings-naacl.209.copyright.pdf + kumari-etal-2024-end + + + Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer + Hele-AndraKuulmetsUniversity of Tartu + TaidoPurason + AgnesLuhtaruinstitute of computer science, University of Tartu + MarkFishelUniversity of Tartu + 3309-3325 + This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named Llammas, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian. + 2024.findings-naacl.210 + 2024.findings-naacl.210.copyright.pdf + kuulmets-etal-2024-teaching + + + Simulating Opinion Dynamics with Networks of <fixed-case>LLM</fixed-case>-based Agents + Yun-ShiuanChuangUniversity of Wisconsin - Madison + AgamGoyalUniversity of Illinois at Urbana-Champaign + NikunjHarlalka + SiddharthSuresh + RobertHawkinsPrinceton University + SijiaYangUniversity of Wisconsin - Madison + DhavanShahUniversity of Wisconsin - Madison + JunjieHuUniversity of Wisconsin, Madison + TimothyRogers + 3326-3346 + Accurately simulating human opinion dynamics is crucial for understanding a variety of societal phenomena, including polarization and the spread of misinformation. However, the agent-based models (ABMs) commonly used for such simulations often over-simplify human behavior. We propose a new approach to simulating opinion dynamics based on populations of Large Language Models (LLMs). Our findings reveal a strong inherent bias in LLM agents towards producing accurate information, leading simulated agents to consensus in line with scientific reality. This bias limits their utility for understanding resistance to consensus views on issues like climate change. After inducing confirmation bias through prompt engineering, however, we observed opinion fragmentation in line with existing agent-based modeling and opinion dynamics research. These insights highlight the promise and limitations of LLM agents in this domain and suggest a path forward: refining LLMs with real-world discourse to better simulate the evolution of human beliefs. + 2024.findings-naacl.211 + 2024.findings-naacl.211.copyright.pdf + chuang-etal-2024-simulating + + + Probing the Category of Verbal Aspect in Transformer Language Models + AnisiaKatinskaia + RomanYangarberUniversity of Helsinki + 3347-3366 + We investigate how pretrained language models (PLM) encode the grammatical category of verbal aspect in Russian. Encoding of aspect in transformer LMs has not been studied previously in any language. A particular challenge is posed by ”alternative contexts”: where either the perfective or the imperfective aspect is suitable grammatically and semantically. We perform probing using BERT and RoBERTa on alternative and non-alternative contexts. First, we assess the models’ performance on aspect prediction, via behavioral probing. Next, we examine the models’ performance when their contextual representations are substituted with counterfactual representations, via causal probing. These counterfactuals alter the value of the “boundedness” feature—a semantic feature, which characterizes the action in the context. Experiments show that BERT and RoBERTa do encode aspect—mostly in their final layers. The counterfactual interventions affect perfective and imperfective in opposite ways, which is consistent with grammar: perfective is positively affected by adding the meaning of boundedness, and vice versa. The practical implications of our probing results are that fine-tuning only the last layers of BERT on predicting aspect is faster and more effective than fine-tuning the whole model. The model has high predictive uncertainty about aspect in alternative contexts, which tend to lack explicit hints about the boundedness of the described action. + 2024.findings-naacl.212 + 2024.findings-naacl.212.copyright.pdf + katinskaia-yangarber-2024-probing + + + A Measure for Transparent Comparison of Linguistic Diversity in Multilingual <fixed-case>NLP</fixed-case> Data Sets + TanjaSamardzicUniversity of Zurich + XimenaGutierrezUniversidad Nacional Autónoma de México + ChristianBentzEberhard-Karls-Universität Tübingen + StevenMoranUniversity of Miami + OlgaPelloni + 3367-3382 + Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of linguistic features and can identify the types of languages that are not represented in a data set. Using our method, we analyse a range of popular multilingual data sets (UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD). In addition to ranking these data sets, we find, for example, that (poly)synthetic languages are missing in almost all of them. + 2024.findings-naacl.213 + 2024.findings-naacl.213.copyright.pdf + samardzic-etal-2024-measure + + + Beyond Read-Only: Crafting a Comprehensive <fixed-case>C</fixed-case>hinese Text-to-<fixed-case>SQL</fixed-case> Dataset for Database Manipulation and Query + XiChen + JinguoYouKunmimg University of Science and Technology + LikunLikun + XiangLi + 3383-3393 + Text-to-SQL aims to convert natural language into structured query language, which is a challenging task. Current research focuses mainly on read operations and ignores other aspects of database operations such as create, update, and delete operations. The benchmark datasets as well as models that have been proposed also fail to cover these operations, limiting the development and practical applications in the field. To bridge this gap, we propose CRUDSQL, a large-scale cross-domain single-table CRUD operations Chinese Text-to-SQL dataset. The dataset contains 10,000 question/SQL pairs involving 625 tables from different domains. To support further research on this dataset, we also propose a baseline method, CRUDParser, which employs a two-phase approach based on BERT and T5 for SQL generation and incorporates two strategies, value matching, and value prompting, for interacting with databases to further improve the performance. The experimental results show that the new operation types bring different challenges for future research, and our approach achieves 67.08% and 83.8% exact set matching accuracy under both read and delete operations in the test set, but only 49.6% and 61.8% under create and update operations. We believe that the proposal of CRUDSQL as well as CRUDParser can provide new directions and possibilities for research and practical applications in the field of Text-to-SQL. The dataset is published at https://github.com/bizard-lab/CRUDSQL. + 2024.findings-naacl.214 + 2024.findings-naacl.214.copyright.pdf + chen-etal-2024-beyond + + + Normalizing without Modernizing: Keeping Historical Wordforms of <fixed-case>M</fixed-case>iddle <fixed-case>F</fixed-case>rench while Reducing Spelling Variants + RaphaelRubinoUniversity of Geneva + JohannaGerlachUniversity of Geneva + JonathanMutal + PierretteBouillonUniversity of Geneva + 3394-3402 + Conservation of historical documents benefits from computational methods by alleviating the manual labor related to digitization and modernization of textual content. Languages usually evolve over time and keeping historical wordforms is crucial for diachronic studies and digital humanities. However, spelling conventions did not necessarily exist when texts were originally written and orthographic variations are commonly observed depending on scribes and time periods. In this study, we propose to automatically normalize orthographic wordforms found in historical archives written in Middle French during the 16th century without fully modernizing textual content. We leverage pre-trained models in a low resource setting based on a manually curated parallel corpus and produce additional resources with artificial data generation approaches. Results show that causal language models and knowledge distillation improve over a strong baseline, thus validating the proposed methods. + 2024.findings-naacl.215 + 2024.findings-naacl.215.copyright.pdf + rubino-etal-2024-normalizing + + + Anti-<fixed-case>LM</fixed-case> Decoding for Zero-shot In-context Machine Translation + SuzannaSia + AlexandraDeLucia + KevinDuhJohns Hopkins University + 3403-3420 + Zero-shot In-context learning is the phenomenon where models can perform a task given only the instructions. However, pre-trained large language models are known to be poorly calibrated for zero-shot tasks. One of the most effective approaches to handling this bias is to adopt a contrastive decoding objective, which accounts for the prior probability of generating the next token by conditioning on a context. This work introduces an Anti-Language Model objective with a decay factor designed to address the weaknesses of In-context Machine Translation. We conduct our experiments across 3 model types and sizes, 3 language directions, and for both greedy decoding and beam search. The proposed method outperforms other state-of-the-art decoding objectives, with up to 20 BLEU point improvement from the default objective in some settings. + 2024.findings-naacl.216 + 2024.findings-naacl.216.copyright.pdf + sia-etal-2024-anti + + + Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning + ShuaiZhaoJinan University + LeileiGanZhejiang University + Anh TuanLuuNanyang Technological University + JieFuHong Kong University of Science and Technology + LingjuanLyuSony Research + MeihuiziJia + JinmingWen + 3421-3438 + Recently, various parameter-efficient fine-tuning (PEFT) strategies for application to language models have been proposed and successfully implemented. However, this raises the question of whether PEFT, which only updates a limited set of model parameters, constitutes security vulnerabilities when confronted with weight-poisoning backdoor attacks. In this study, we show that PEFT is more susceptible to weight-poisoning backdoor attacks compared to the full-parameter fine-tuning method, with pre-defined triggers remaining exploitable and pre-defined targets maintaining high confidence, even after fine-tuning. Motivated by this insight, we developed a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence, providing robust defense against weight-poisoning backdoor attacks. Specifically, we leverage PEFT to train the PSIM with randomly reset sample labels. During the inference process, extreme confidence serves as an indicator for poisoned samples, while others are clean. We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods. Experiments show near 100% success rates for weight-poisoning backdoor attacks when utilizing PEFT. Furthermore, our defensive approach exhibits overall competitive performance in mitigating weight-poisoning backdoor attacks. + 2024.findings-naacl.217 + 2024.findings-naacl.217.copyright.pdf + zhao-etal-2024-defending + + + Select and Summarize: Scene Saliency for Movie Script Summarization + RohitSaxenaUniversity of Edinburgh, University of Edinburgh + FrankKellerUniversity of Edinburgh + 3439-3455 + Abstractive summarization for long-form narrative texts such as movie scripts is challenging due to the computational and memory constraints of current language models. A movie script typically comprises a large number of scenes; however, only a fraction of these scenes are salient, i.e., important for understanding the overall narrative. The salience of a scene can be operationalized by considering it as salient if it is mentioned in the summary. Automatically identifying salient scenes is difficult due to the lack of suitable datasets. In this work, we introduce a scene saliency dataset that consists of human-annotated salient scenes for 100 movies. We propose a two-stage abstractive summarization approach which first identifies the salient scenes in script and then generates a summary using only those scenes. Using QA-based evaluation, we show that our model outperforms previous state-of-the-art summarization methods and reflects the information content of a movie more accurately than a model that takes the whole movie script as input. + 2024.findings-naacl.218 + 2024.findings-naacl.218.copyright.pdf + saxena-keller-2024-select + + + Don’t be a Fool: Pooling Strategies in Offensive Language Detection from User-Intended Adversarial Attacks + SeungukYu + JuhwanChoi + YoungBinKimChung-Ang University + 3456-3467 + Offensive language detection is an important task for filtering out abusive expressions and improving online user experiences. However, malicious users often attempt to avoid filtering systems through the involvement of textual noises. In this paper, we propose these evasions as user-intended adversarial attacks that insert special symbols or leverage the distinctive features of the Korean language. Furthermore, we introduce simple yet effective pooling strategies in a layer-wise manner to defend against the proposed attacks, focusing on the preceding layers not just the last layer to capture both offensiveness and token embeddings. We demonstrate that these pooling strategies are more robust to performance degradation even when the attack rate is increased, without directly training of such patterns. Notably, we found that models pre-trained on clean texts could achieve a comparable performance in detecting attacked offensive language, to models pre-trained on noisy texts by employing these pooling strategies. + 2024.findings-naacl.219 + 2024.findings-naacl.219.copyright.pdf + yu-etal-2024-dont + + + <fixed-case>Z</fixed-case>-<fixed-case>GMOT</fixed-case>: Zero-shot Generic Multiple Object Tracking + KimTranFPT Software + Anh DuyLe DinhFPT Software AI Center + Tien-PhatNguyenJohn von Neumann + ThinhPhan + PhaNguyenUniversity of Arkansas - Fayetteville + KhoaLuuUniversity of Arkansas, Fayetteville + DonaldAdjerohWest Virginia University + GianfrancoDorettoWest Virginia University + NganLeUniversity of Arkansas, Fayetteville + 3468-3479 + Despite recent significant progress, Multi-Object Tracking (MOT) faces limitations such as reliance on prior knowledge and predefined categories and struggles with unseen objects. To address these issues, Generic Multiple Object Tracking (GMOT) has emerged as an alternative approach, requiring less prior information. However, current GMOT methods often rely on initial bounding boxes and struggle to handle variations in factors such as viewpoint, lighting, occlusion, and scale, among others. Our contributions commence with the introduction of the Referring GMOT dataset a collection of videos, each accompanied by detailed textual descriptions of their attributes. Subsequently, we propose Z-GMOT, a cutting-edge tracking solution capable of tracking objects from never-seen categories without the need of initial bounding boxes or predefined categories. Within our Z-GMOT framework, we introduce two novel components: (i) iGLIP, an improved Grounded language-image pretraining, for accurately detecting unseen objects with specific characteristics. (ii) MA-SORT, a novel object association approach that adeptly integrates motion and appearance-based matching strategies to tackle the complex task of tracking objects with high similarity. Our contributions are benchmarked through extensive experiments conducted on the Referring GMOT dataset for GMOT task. Additionally, to assess the generalizability of the proposed Z-GMOT, we conduct ablation studies on the DanceTrack and MOT20 datasets for the MOT task. Our dataset, code, and models are released at: https://fsoft-aic.github.io/Z-GMOT + 2024.findings-naacl.220 + 2024.findings-naacl.220.copyright.pdf + tran-etal-2024-z + + + <fixed-case>NLP</fixed-case> for Counterspeech against Hate: A Survey and How-To Guide + HelenaBonaldiFondazione Bruno Kessler and University of Trento + Yi-LingChungAlan Turing Institute + GavinAbercrombieHeriot-Watt University + MarcoGueriniFondazione Bruno Kessler + 3480-3499 + In recent years, counterspeech has emerged as one of the most promising strategies to fight online hate. These non-escalatory responses tackle online abuse while preserving the freedom of speech of the users, and can have a tangible impact in reducing online and offline violence. Recently, there has been growing interest from the Natural Language Processing (NLP) community in addressing the challenges of analysing, collecting, classifying, and automatically generating counterspeech, to reduce the huge burden of manually producing it. In particular, researchers have taken different directions in addressing these challenges, thus providing a variety of related tasks and resources. In this paper, we provide a guide for doing research on counterspeech, by describing - with detailed examples - the steps to undertake, and providing best practices that can be learnt from the NLP studies on this topic. Finally, we discuss open challenges and future directions of counterspeech research in NLP. + 2024.findings-naacl.221 + 2024.findings-naacl.221.copyright.pdf + bonaldi-etal-2024-nlp + + + <fixed-case>PRODIG</fixed-case>y: a <fixed-case>PRO</fixed-case>file-based <fixed-case>DI</fixed-case>alogue Generation dataset + DanielaOcchipinti + SerraTekiroglu + MarcoGueriniFondazione Bruno Kessler + 3500-3514 + Providing dialogue agents with a profile representation can improve their consistency and coherence, leading to better conversations. However, current profile-based dialogue datasets for training such agents contain either explicit profile representations that are simple and dialogue-specific, or implicit representations that are difficult to collect. In this work, we introduce the PRODIGy (PROfile-based DIalogue Generation) dataset, which brings diverse representations together, providing a more comprehensive profile dimension set for each speaker. This resource comprises more than 20k dialogues, sourced from movie scripts, aligned with speaker representations such as communication style, biography, personality and gender. Initial experiments with diverse baselines show that providing generative language models with these aspects of a profile, both separately and jointly, enhances models’ performance. This improvement holds true in both in-domain and cross-domain settings, for both fine-tuned and instruction-based LLMs. + 2024.findings-naacl.222 + 2024.findings-naacl.222.copyright.pdf + occhipinti-etal-2024-prodigy + + + <fixed-case>W</fixed-case>ater<fixed-case>J</fixed-case>udge: Quality-Detection Trade-off when Watermarking Large Language Models + PiotrMolenda + AdianLiusieUniversity of Cambridge + MarkGalesUniversity of Cambridge + 3515-3525 + Watermarking generative-AI systems, such as LLMs, has gained considerable interest, driven by their enhanced capabilities across a wide range of tasks. Although current approaches have demonstrated that small, context-dependent shifts in the word distributions can be used to apply and detect watermarks, there has been little work in analyzing the impact that these perturbations have on the quality of generated texts. Balancing high detectability with minimal performance degradation is crucial in terms of selecting the appropriate watermarking setting; therefore this paper proposes a simple analysis framework where comparative assessment, a flexible NLG evaluation framework, is used to assess the quality degradation caused by a particular watermark setting. We demonstrate that our framework provides easy visualization of the quality-detection trade-off of watermark settings, enabling a simple solution to find an LLM watermark operating point that provides a well-balanced performance. This approach is applied to two different summarization systems and a translation system, enabling cross-model analysis for a task, and cross-task analysis. + 2024.findings-naacl.223 + 2024.findings-naacl.223.copyright.pdf + molenda-etal-2024-waterjudge + + + Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking + NanXuUniversity of Southern California + FeiWangUniversity of Southern California + BenZhouUniversity of Pennsylvania + BangzhengLiUniversity of Southern California + ChaoweiXiaoUniversity of Wisconsin - Madison and NVIDIA + MuhaoChenUniversity of California, Davis and University of Southern California + 3526-3548 + While large language models (LLMs) have demonstrated increasing power, they have also called upon studies on their vulnerabilities. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of 1) multilingual cognitive overload, 2) veiled expression, and 3) effect-to- cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively. + 2024.findings-naacl.224 + 2024.findings-naacl.224.copyright.pdf + xu-etal-2024-cognitive + + + <fixed-case>PAELLA</fixed-case>: Parameter-Efficient Lightweight Language-Agnostic Captioning Model + RitaRamosInstituto Superior Técnico + EmanueleBugliarelloGoogle + BrunoMartinsInstituto Superior Técnico + DesmondElliottand University of Copenhagen + 3549-3564 + We introduce PAELLA, a Parameter-Efficient Lightweight Language-Agnostic image captioning model designed to be both parameter and data-efficient using retrieval augmentation. The model is trained by learning a small mapping network with 34M parameters between a pre-trained visual model and a multilingual language model that is conditioned on two types of input: (i) the image itself, and (ii) a set of retrieved captions in the target language. The retrieved examples play a key role in guiding the model to generate captions across languages. Through retrieval, the model can be lightweight in terms of the number of trainable parameters, which only exist in its mapping network, and also in the amount of multilingual training data that is required. Experiments on the XM3600 dataset, featuring 36 languages, show that PAELLA can outperform or compete against some models with 3–77\times more learned parameters and 35–863\times more data, particularly in low-resource languages. We also find that PAELLA can be trained on only monolingual data and still show strong zero-shot abilities in other languages. + 2024.findings-naacl.225 + 2024.findings-naacl.225.copyright.pdf + ramos-etal-2024-paella + + + <fixed-case>OSC</fixed-case>a<fixed-case>R</fixed-case>: Object State Captioning and State Change Representation + NguyenNguyen + JingBi + AliVosoughiUniversity of Rochester + YapengTianUniversity of Texas at Dallas + PooyanFazliArizona State University + ChenliangXuUniversity of Rochester, University of Rochester and University of Rochester + 3565-3576 + The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating Multimodal Large Language Models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR. + 2024.findings-naacl.226 + 2024.findings-naacl.226.copyright.pdf + nguyen-etal-2024-oscar + + + <fixed-case>S</fixed-case>um<fixed-case>CSE</fixed-case>: Summary as a transformation for Contrastive Learning + RaghuveerThirukovalluru + XiaolanWangMegagon Labs + JunChenMeta Platform + ShuyangLiMeta AI + JieLei + RongJinTwitter + BhuwanDhingraDuke University + 3577-3588 + Sentence embedding models are typically trained using contrastive learning (CL), either using human annotations directly or by repurposing other annotated datasets. In this work, we explore the recently introduced paradigm of generating CL data using generative language models (LM). In CL for computer vision (CV), compositional transformations (series of operations applied over an image. e.g. cropping + color distortion) which modify the input/image to retain minimal information were shown to be very effective. We show that composition of a ‘Summary’ transformation with diverse paraphrasing/contradicting transformations accomplishes the same and works very well in CL for sentence embeddings. Our final generated dataset (using Vicuna-13B) significantly outperforms the previous best unsupervised method (using ChatGPT) by 1.8 points, and SimCSE, a strong supervised baseline by 0.3 points on the semantic text similarity (STS) benchmark. + 2024.findings-naacl.227 + 2024.findings-naacl.227.copyright.pdf + thirukovalluru-etal-2024-sumcse + + + The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text + YanzhuGuo + GuokanShangMohamed bin Zayed University of Artificial Intelligence + MichalisVazirgiannisEcole Polytechnique, France + ChloéClavelINRIA and Télécom Paris + 3589-3604 + This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models. + 2024.findings-naacl.228 + 2024.findings-naacl.228.copyright.pdf + guo-etal-2024-curious + + + <fixed-case>P</fixed-case>ersona<fixed-case>LLM</fixed-case>: Investigating the Ability of Large Language Models to Express Personality Traits + HangJiang + XiajieZhangMassachusetts Institute of Technology + XuboCao + CynthiaBreazeal + DebRoyMassachusetts Institute of Technology + JadKabbaraMassachusetts Institute of Technology + 3605-3627 + Despite the many use cases for large language models (LLMs) in creating personalized chatbots, there has been limited research on evaluating the extent to which the behaviors of personalized LLMs accurately and consistently reflect specific personality traits. We consider studying the behavior of LLM-based agents which we refer to as LLM personas and present a case study with GPT-3.5 and GPT-4 to investigate whether LLMs can generate content that aligns with their assigned personality profiles. To this end, we simulate distinct LLM personas based on the Big Five personality model, have them complete the 44-item Big Five Inventory (BFI) personality test and a story writing task, and then assess their essays with automatic and human evaluations. Results show that LLM personas’ self-reported BFI scores are consistent with their designated personality types, with large effect sizes observed across five traits. Additionally, LLM personas’ writings have emerging representative linguistic patterns for personality traits when compared with a human writing corpus. Furthermore, human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%. Interestingly, the accuracy drops significantly when the annotators were informed of AI authorship. + 2024.findings-naacl.229 + 2024.findings-naacl.229.copyright.pdf + jiang-etal-2024-personallm + + + <fixed-case>FIRE</fixed-case>: A Dataset for Financial Relation Extraction + HassanHamad + Abhinav KumarThakur + NijilKolleri + SujithPulikodan + KeithChuggUniversity of Southern California + 3628-3642 + This paper introduces FIRE (**FI**nancial **R**elation **E**xtraction), a sentence-level dataset of named entities and relations within the financial sector. Comprising 3,025 instances, the dataset encapsulates 13 named entity types along with 18 relation types. Sourced from public financial reports and financial news articles, FIRE captures a wide array of financial information about a business including, but not limited to, corporate structure, business model, revenue streams, and market activities such as acquisitions. The full dataset was labeled by a single annotator to minimize labeling noise. The labeling time for each sentence was recorded during the labeling process. We show how this feature, along with curriculum learning techniques, can be used to improved a model’s performance. The FIRE dataset is designed to serve as a valuable resource for training and evaluating machine learning algorithms in the domain of financial information extraction. The dataset and the code to reproduce our experimental results are available at https://github.com/hmhamad/FIRE. The repository for the labeling tool can be found at https://github.com/abhinav-kumar-thakur/relation-extraction-annotator. + 2024.findings-naacl.230 + 2024.findings-naacl.230.copyright.pdf + hamad-etal-2024-fire + + + <fixed-case>M</fixed-case>usi<fixed-case>L</fixed-case>ingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response + ZihaoDeng + YinghaoMaQueen Mary University of London + YudongLiu + RongchenGuo + GeZhang + WenhuChenUniversity of Waterloo and Google + WenhaoHuang + EmmanouilBenetosQueen Mary, University of London + 3643-3655 + Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT (CITATION) with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones. + 2024.findings-naacl.231 + 2024.findings-naacl.231.copyright.pdf + deng-etal-2024-musilingo + + + Investigating Acceleration of <fixed-case>LL</fixed-case>a<fixed-case>MA</fixed-case> Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with ‘<fixed-case>LITE</fixed-case>’ + NeerajVarshney + AgneetChatterjeeArizona State University + MihirParmar + ChittaBaralArizona State University, Arizona State University and Arizona State University + 3656-3677 + Large Language Models (LLMs) have achieved remarkable performance across a wide variety of tasks; however, their large size makes their inference slow and computationally expensive. Focusing on this problem, we study instruction tuning LLMs with additional explicit Losses from the Intermediate layers (LITE) and show that it enables these layers to acquire ‘good’ generation ability without affecting the generation ability of the final layer. We then perform ‘dynamic confidence-based early exiting’ at token level from the intermediate layers which improves the computational efficiency of text generation without sacrificing the quality of the generation. We conduct comprehensive experiments by instruction tuning LLaMA-2 models on the Alpaca dataset and evaluate on four different instruction test sets. We show that dynamic early exiting achieves consistent and considerable inference cost improvements (37.86% for 7B and 46.35% for 13B model) while maintaining the generation quality. We further conduct a thorough analysis of the results and dissect the efficiency improvements which reveals several important findings. + 2024.findings-naacl.232 + 2024.findings-naacl.232.copyright.pdf + varshney-etal-2024-investigating + + + Instruction-following Evaluation through Verbalizer Manipulation + ShiyangLiAmazon + JunYan + HaiWangSamsung + ZhengTangSamsung + XiangRenUniversity of Southern California, University of Southern California and University of Southern California + VijaySrinivasan + HongxiaJinSamsung Research America AI center + 3678-3692 + While instruction-tuned models have shown remarkable success in various natural language processing tasks, accurately evaluating their ability to follow instructions remains challenging. Existing benchmarks primarily focus on common instructions that align well with what the model learned during training. However, proficiency in responding to these instructions does not necessarily imply strong ability in instruction following. In this paper, we propose a novel instruction-following evaluation protocol called verbalizer manipulation. It instructs the model to verbalize the task label with words aligning with model priors to different extents, adopting verbalizers from highly aligned (e.g., outputting “positive” for positive sentiment), to minimally aligned (e.g., outputting “negative” for positive sentiment). Verbalizer manipulation can be seamlessly integrated with any classification benchmark to examine the model’s reliance on priors and its ability to override them to accurately follow the instructions. We conduct a comprehensive evaluation of four major model families across nine datasets, employing twelve sets of verbalizers for each of them. We observe that the instruction-following abilities of models, across different families and scales, are significantly distinguished by their performance on less natural verbalizers. Even the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer, emphasizing the need for continued advancements to improve their instruction-following abilities. + 2024.findings-naacl.233 + 2024.findings-naacl.233.copyright.pdf + li-etal-2024-instruction + + + <fixed-case>W</fixed-case>eb<fixed-case>WISE</fixed-case>: Unlocking Web Interface Control for <fixed-case>LLM</fixed-case>s via Sequential Exploration + HeyiTao + SethuramanT VDepartment of Computer Science + MichalShlapentokh-RothmanUniversity of Illinois, Urbana Champaign + TanmayGuptaAllen Institute for Artificial Intelligence + HengJiUniversity of Illinois, Urbana-Champaign + DerekHoiemDepartment of Computer Science, Reconstruct and University of Illinois, Urbana Champaign + 3693-3711 + This paper investigates using Large Language Models (LLMs) to automatically perform web software tasks using click, scroll, and text in- put operations. Previous approaches, such as reinforcement learning (RL) or imitation learning, are inefficient to train and task-specific. Our method uses filtered Document Object Model (DOM) elements as observations and performs tasks step-by-step, sequentially generating small programs based on the current observations. We use in-context learning, either benefiting from a single manually provided example, or an automatically generated example based on a successful zero-shot trial. We evaluate our proposed method on the MiniWob++ benchmark. With only one in-context example, our WebWISE method using gpt-3.5-turbo achieves similar or better performance than other methods that require many demonstrations or trials. + 2024.findings-naacl.234 + 2024.findings-naacl.234.copyright.pdf + tao-etal-2024-webwise + + + <fixed-case>C</fixed-case>odec<fixed-case>LM</fixed-case>: Aligning Language Models with Tailored Synthetic Data + ZifengWangGoogle + Chun-LiangLiGoogle + VincentPerotGoogle + LongLeGoogle + JinMiaoGoogle + ZizhaoZhangGoogle + Chen-YuLeeGoogle + TomasPfisterGoogle + 3712-3729 + Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users’ actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts. + 2024.findings-naacl.235 + 2024.findings-naacl.235.copyright.pdf + wang-etal-2024-codeclm + + + Prompting Few-shot Multi-hop Question Generation via Comprehending Type-aware Semantics + ZefengLinUniversity of Science and Technology of China + WeidongChen + YanSongUniversity of Science and Technology of China + YongdongZhangUniversity of Science and Technology of China + 3730-3740 + Given several documents, multi-hop question generation (MQG) is a task aims to generate complicated questions that require reasoning over multiple pieces of these documents to find the answer. To perform this task, existing studies focus on designing advanced architectures to locate essential keywords or sentences in multiple documents and then generate questions accordingly, where they normally do not note that question types could provide crucial hints for extracting key information from the documents for MQG. In general, supervised approaches are used that rely on large annotated data, which is not available in many low-resource scenarios and thus makes MQG hard in these domains. Consider the recent success of large language models (LLMs) on natural language processing tasks using limited labeled data under few-shot settings, in this paper, we propose an approach named type-aware semantics extraction-based chain-of-thought method (TASE-CoT) for few-shot MQG. Specifically, our approach firstly extracts question types and essential semantic phrases from the given documents and the answer. Then, we design a three-step CoT template to leverage the extracted question type and semantic phrases to predict multi-hop questions. Extensive experiments and the results demonstrate the effectiveness of our approach and the proposed modules. + 2024.findings-naacl.236 + 2024.findings-naacl.236.copyright.pdf + lin-etal-2024-prompting + + + When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models + YanhongLi + ChenghaoYangUniversity of Chicago + AllysonEttingerAllen Institute for Artificial Intelligence + 3741-3753 + Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs’ ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA.We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models’ initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at https://github.com/yanhong-lbh/LLM-SelfReflection-Eval. + 2024.findings-naacl.237 + 2024.findings-naacl.237.copyright.pdf + li-etal-2024-hindsight + + + <fixed-case>C</fixed-case>o<fixed-case>D</fixed-case>a: Constrained Generation based Data Augmentation for Low-Resource <fixed-case>NLP</fixed-case> + Chandra KiranEvuru + SreyanGhosh + SonalKumar + RamaneswaranS + UtkarshTyagi + DineshManochaUniversity of Maryland, College Park + 3754-3769 + We present CoDa (**Co**nstrained Generation based **Da**ta Augmentation), a controllable, effective, and *training-free* data augmentation technique for low-resource (data-scarce) NLP. Our approach is based on prompting off-the-shelf instruction-following Large Language Models (LLMs) for generating text that satisfies a set of constraints. Precisely, we extract a set of simple constraints from every instance in the low-resource dataset and verbalize them to prompt an LLM to generate novel and diverse training instances. Our findings reveal that synthetic data that follows simple constraints in the downstream dataset act as highly effective augmentations, and CoDa can achieve this without intricate decoding-time constrained generation techniques or fine-tuning with complex algorithms that eventually make the model biased toward the small number of training instances. Additionally, CoDa is the first framework that provides users explicit control over the augmentation generation process, thereby also allowing easy adaptation to several domains. We demonstrate the effectiveness of CoDa across 11 datasets spanning 3 tasks and 3 low-resource settings. CoDa outperforms all our baselines, qualitatively and quantitatively, with improvements of 0.12%-7.19%. Code is available. + 2024.findings-naacl.238 + 2024.findings-naacl.238.copyright.pdf + evuru-etal-2024-coda + + + Synonym relations affect object detection learned on vision-language data + GiacomoNebbia + AdrianaKovashkaUniversity of Pittsburgh + 3770-3776 + We analyze whether object detectors trained on vision-language data learn effective visual representations for synonyms. Since many current vision-language models accept user-provided textual input, we highlight the need for such models to learn feature representations that are robust to changes in how such input is provided. Specifically, we analyze changes in synonyms used to refer to objects. Here, we study object detectors trained on vision-language data and investigate how to make their performance less dependent on whether synonyms are used to refer to an object. We propose two approaches to achieve this goal: data augmentation by back-translation and class embedding enrichment. We show the promise of such approaches, reporting improved performance on synonyms from mAP@0.5=33.87% to 37.93%. + 2024.findings-naacl.239 + 2024.findings-naacl.239.copyright.pdf + nebbia-kovashka-2024-synonym + + + <fixed-case>CM</fixed-case>-<fixed-case>TTS</fixed-case>: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models + XiangLiBeijing University of Posts and Telecommunications + FanBuFanBuBeijing University of Posts and Telecommunications + AmbujMehrish + YingtingLi + JialeHanHong Kong University of Science and Technology + BoChengBeijing University of Posts and Telecommunications + SoujanyaPoriaSingapore University of Technology and Design + 3777-3794 + Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS’s superiority over existing single-step speech synthesis systems, representing a significant advancement in the field. + 2024.findings-naacl.240 + 2024.findings-naacl.240.copyright.pdf + li-etal-2024-cm + + + <fixed-case>R</fixed-case>obust<fixed-case>S</fixed-case>ent<fixed-case>E</fixed-case>mbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning + JavadRafiei Asl + PrajwalPanzadeGeorgia State University + EduardoBlancoUniversity of Arizona + DanielTakabiOld Dominion University + ZhipengCaiGeorgia State University + 3795-3809 + Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to improve both generalization and robustness in diverse text representation tasks and against a diverse set of adversarial attacks. Through the generation of high-risk adversarial perturbations and their utilization in a novel objective function, RobustSentEmbed adeptly learns high-quality and robust sentence embeddings. Our experiments confirm the superiority of RobustSentEmbed over state-of-the-art representations. Specifically, Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half (from 75.51% to 38.81%). The framework also yields improvements of 1.59% and 0.23% in semantic textual similarity tasks and various transfer tasks, respectively. + 2024.findings-naacl.241 + 2024.findings-naacl.241.copyright.pdf + rafiei-asl-etal-2024-robustsentembed + + + Characterizing Human and Zero-Shot <fixed-case>GPT</fixed-case>-3.5 Object-Similarity Judgments + DMcKnight + AlonaFysheUniversity of Alberta + 3810-3828 + Recent advancements in large language models’ (LLMs) capabilities have yielded few-shot, human-comparable performance on a range of tasks. At the same time, researchers expend significant effort and resources gathering human annotations. At some point, LLMs may be able to perform some simple annotation tasks, but studies of LLM annotation accuracy and behavior are sparse. In this paper, we characterize OpenAI’s GPT-3.5’s judgment on a behavioral task for implicit object categorization. We characterize the embedding spaces of models trained on human vs. GPT responses and give similarities and differences between them, finding many similar dimensions. We also find that despite these similar dimensions, augmenting humans’ responses with GPT ones drives model divergence across the sizes of datasets tested. + 2024.findings-naacl.242 + 2024.findings-naacl.242.copyright.pdf + mcknight-fyshe-2024-characterizing + + + Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language Models + WeiHeFudan University + ShichunLiu + JunZhao + YiwenDing + YiLu + ZhihengXi + TaoGuiFudan University + QiZhangFudan University + XuanjingHuangFudan University + 3829-3845 + Large language models (LLMs) have shown promising abilities of in-context learning (ICL), adapting swiftly to new tasks with only few-shot demonstrations. However, current few-shot methods heavily depend on high-quality, query-specific demos, which are often lacking. When faced with out-of-demonstration (OOD) queries, methods that rely on hand-crafted demos or external retrievers might fail. To bridge the gap between limited demos and OOD queries, we propose Self-Demos, a novel prompting method that elicits the inherent generalizability in LLMs by query-aware demo generation. The generated demos strategically interpolate between existing demos and the given query, transforming the query from OOD to ID. To evaluate the effectiveness of our approach, we manually constructed OOD-Toolset, a dataset in the tool-using scenario with over 300 real-world APIs and 1000 instances, each consisting of three tool-use cases as demos and an OOD query. Thorough experiments on our dataset and two public math benchmarks have shown that our method can outperform state-of-the-art baselines in the OOD setting. Moreover, we conduct a range of analyses to validate Self-Demos’s generalization and provide more insights. + 2024.findings-naacl.243 + 2024.findings-naacl.243.copyright.pdf + he-etal-2024-self + + + Getting Sick After Seeing a Doctor? Diagnosing and Mitigating Knowledge Conflicts in Event Temporal Reasoning + TianqingFang + ZhaoweiWangDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology + WenxuanZhouZoom + HongmingZhang + YangqiuSongThe Hong Kong University of Science and Technology + MuhaoChenUniversity of California, Davis and University of Southern California + 3846-3868 + Event temporal reasoning aims at identifying the temporal relations between two or more events from narratives. However, knowledge conflicts arise when there is a mismatch between the actual temporal relations of events in the context and the prior knowledge or biases learned by the model. In this paper, we propose to detect knowledge-conflict examples in event temporal reasoning using bias indicators, which include event relation prior bias, tense bias, narrative bias, and dependency bias. We define conflict examples as those where event relations are opposite to biased or prior relations. To mitigate event-related knowledge conflicts, we introduce a Counterfactual Data Augmentation (CDA) based method that can be applied to both Pre-trained Language Models (PLMs) and Large Language Models (LLMs) either as additional training data or demonstrations for In- Context Learning. Experiments suggest both PLMs and LLMs suffer from knowledge conflicts in event temporal reasoning, and CDA has the potential for reducing hallucination and improving model performance. + 2024.findings-naacl.244 + 2024.findings-naacl.244.copyright.pdf + fang-etal-2024-getting + + + <fixed-case>MCECR</fixed-case>: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution + AmirPouran Ben Veyseh + VietLaiKensho Technologies + ChienNguyenUniversity of Oregon + FranckDernoncourtAdobe Systems + ThienNguyen, University of Oregon + 3869-3880 + Event coreference resolution (ECR) is a critical task in information extraction of natural language processing, aiming to identify and link event mentions across multiple documents. Despite recent progress, existing datasets for ECR primarily focus on within-document event coreference and English text, lacking cross-document ECR datasets for multiple languages beyond English. To address this issue, this work presents the first multiligual dataset for cross-document ECR, called MCECR (Multilingual Cross-Document Event Coreference Resolution), that manually annotates a diverse collection of documents for event mentions and coreference in five languages, i.e., English, Spanish, Hindi, Turkish, and Ukrainian. Using sampled articles from Wikinews over various topics as the seeds, our dataset fetches related news articles from the Google search engine to increase the number of non-singleton event clusters. In total, we annotate 5,802 news articles, providing a substantial and varied dataset for multilingual ECR in both within-document and cross-document scenarios. Extensive analysis of the proposed dataset reveals the challenging nature of multilingual event coreference resolution tasks, promoting MCECR as a strong benchmark dataset for future research in this area. + 2024.findings-naacl.245 + 2024.findings-naacl.245.copyright.pdf + pouran-ben-veyseh-etal-2024-mcecr + + + Sentiment Analysis in the Era of Large Language Models: A Reality Check + WenxuanZhang + YueDengSchool of Computer Science and Engineering, Nanyang Technological University + BingLiuUniversity of Illinois at Chicago + SinnoPanNanyang Technological University and The Chinese University of Hong Kong + LidongBingAlibaba Group + 3881-3906 + Sentiment analysis (SA) has been a long-standing research area in natural language processing. With the recent advent of large language models (LLMs), there is great potential for their employment on SA problems. However, the extent to which current LLMs can be leveraged for different sentiment analysis tasks remains unclear. This paper aims to provide a comprehensive investigation into the capabilities of LLMs in performing various sentiment analysis tasks, from conventional sentiment classification to aspect-based sentiment analysis and multifaceted analysis of subjective texts. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets. Our study reveals that while LLMs demonstrate satisfactory performance in simpler tasks, they lag behind in more complex tasks requiring a deeper understanding of specific sentiment phenomena or structured sentiment information. However, LLMs significantly outperform SLMs in few-shot learning settings, suggesting their potential when annotation resources are limited. We also highlight the limitations of current evaluation practices in assessing LLMs’ SA abilities and propose a novel benchmark, SentiEval, for a more comprehensive and realistic evaluation. Data and code are available at https://github.com/DAMO-NLP-SG/LLM-Sentiment. + 2024.findings-naacl.246 + 2024.findings-naacl.246.copyright.pdf + zhang-etal-2024-sentiment + + + Tokenizer Choice For <fixed-case>LLM</fixed-case> Training: Negligible or Crucial? + MehdiAliFraunhofer Institute IAIS, Fraunhofer IAIS + MichaelFrommFraunhofer Institute IAIS, Fraunhofer IAIS + KlaudiaThellmannTU Dresden + RichardRutmannFraunhofer Institute IAIS, Fraunhofer IAIS + MaxLübberingFraunhofer IAIS + JohannesLevelingFraunhofer Institute IAIS, Fraunhofer IAIS + KatrinKlugFraunhofer Institute IAIS, Fraunhofer IAIS + JanEbertForschungszentrum Jülich GmbH + NiclasDollFraunhofer Institute IAIS, Fraunhofer IAIS + JasperBuschhoffFraunhofer Institute IAIS, Fraunhofer IAIS + CharviJain + AlexanderWeberFraunhofer Institute IAIS, Fraunhofer IAIS + LenaJurkschatTechnische Universität Dresden + HammamAbdelwahabFraunhofer Institute IAIS, Fraunhofer IAIS + ChelseaJohnForschungszentrum Juelich GmbH + PedroOrtiz SuarezCommon Crawl Foundation + MalteOstendorffGerman Research Center for AI + SamuelWeinbachAleph Alpha GmbH + RafetSifaRheinische Friedrich-Wilhelms Universität Bonn + StefanKesselheimForschungszentrum Jülich + NicolasFlores-HerrMax-Planck Institute and Fraunhofer Institute IAIS, Fraunhofer IAIS + 3907-3924 + The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary. + 2024.findings-naacl.247 + 2024.findings-naacl.247.copyright.pdf + ali-etal-2024-tokenizer + + + Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue + JunkaiZhou + LiangPangInstitute of Computing Technology, Chinese Academy of Sciences + HuaweiShenInstitute of Computing Technology, Chinese Academy of Sciences + XueqiCheng, Chinese Academy of Sciences + 3925-3951 + The emergence of large language models (LLMs) further improves the capabilities of open-domain dialogue systems and can generate fluent, coherent, and diverse responses. However, LLMs still lack a crucial ability: communication skills. This limitation renders them more like information seeking tools rather than anthropomorphic chatbots. Communication skills, such as topic transition, proactively asking questions, concept guidance, empathy, and summarising often should be taken into consideration, to make LLMs more anthropomorphic and proactive during the conversation, thereby increasing the interest of users and attracting them to chat for longer. However, enabling these communication skills in black-box LLMs remains a key challenge because they do not have the same utterance formation mode as real people: think before speaking. Inspired by linguistics and cognitive science, we empower LLMs with communication skills through inner monologues. To evaluate various communication skills, we construct a benchmark named Cskills, which can also more comprehensively evaluate the dialogue generation ability of the model. Experimental results show that the proposed CSIM strategy improves the backbone models and outperforms the baselines. + 2024.findings-naacl.248 + 2024.findings-naacl.248.copyright.pdf + zhou-etal-2024-think + + + The Impact of Differential Privacy on Group Disparity Mitigation + VictorHansen + AtulaNeerkajeUniversity of Texas at Austin + RamitSawhneyGeorgia Institute of Technology + LucieFlekRheinische Friedrich-Wilhelms Universität Bonn + AndersSøgaardCopenhagen University + 3952-3965 + The performance cost of differential privacy has, for some applications, been shown to be higher for minority groups; fairness, conversely, has been shown to disproportionally compromise the privacy of members of such groups. Most work in this area has been restricted to computer vision and risk assessment. In response, we evaluate the impact of differential privacy on fairness across four diverse tasks, focusing on how attempts to mitigate privacy violations and between-group performance differences interact: Does privacy inhibit attempts to ensure fairness? To this end, we train (\varepsilon,\delta)-differentially private models with empirical risk minimization and group distributionally robust training objectives. Consistent with previous findings, we find that differential privacy increases between-group performance differences in the baseline setting; more interestingly, differential privacy reduces between-group performance differences in the robust setting. We explain this by interpreting differential privacy as regularization. + 2024.findings-naacl.249 + 2024.findings-naacl.249.copyright.pdf + hansen-etal-2024-impact + + + Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning + ShivamMhaskarRakuten Mobile, Inc. + NirmeshShahSony Research India + MohammadiZakiSony Research India, Bangalore + AshishkumarGudmalwar + PankajWasnikSony Research India + RajivShahIndraprastha Institute of Information Technology, Delhi + 3966-3976 + Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subsequent to the dubbing process. Previous approaches have focused on aligning the number of characters and words in the source and target language texts of Machine Translation models. However, our approach aims to align the number of phonemes instead, as they are closely associated with speech duration. In this paper, we present the development of an isometric NMT system using Reinforcement Learning (RL), with a focus on optimizing the alignment of phoneme counts in the source and target language sentence pairs. To evaluate our models, we propose the Phoneme Count Compliance (PCC) score, which is a measure of length compliance. Our approach demonstrates a substantial improvement of approximately 36% in the PCC score compared to the state-of-the-art models when applied to English-Hindi language pairs. Moreover, we propose a student-teacher architecture within the framework of our RL approach to maintain a trade-off between the phoneme count and translation quality. + 2024.findings-naacl.250 + 2024.findings-naacl.250.copyright.pdf + mhaskar-etal-2024-isometric + + + Read between the lines - Functionality Extraction From <fixed-case>README</fixed-case>s + PrinceKumarInternational Business Machines + SrikanthTamilselvamInternational Business Machines + DineshGarg + 3977-3990 + While text summarization is a well-known NLP task, in this paper, we introduce a novel and useful variant of it called functionality extraction from Git README files. Though this task is a text2text generation at an abstract level, it involves its own peculiarities and challenges making existing text2text generation systems not very useful. The motivation behind this task stems from a recent surge in research and development activities around the use of large language models for code-related tasks, such as code refactoring, code summarization, etc. We also release a human-annotated dataset called FuncRead, and develop a battery of models for the task. Our exhaustive experimentation shows that small size fine-tuned models beat any baseline models that can be designed using popular black-box or white-box large language models (LLMs) such as ChatGPT and Bard. Our best fine-tuned 7 Billion CodeLlama model exhibit 70% and 20% gain on the F1 score against ChatGPT and Bard respectively. + 2024.findings-naacl.251 + 2024.findings-naacl.251.copyright.pdf + kumar-etal-2024-read + + + <fixed-case>A</fixed-case>bs<fixed-case>P</fixed-case>yramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment Graph + ZhaoweiWangDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology + HaochenShi + WeiqiWangThe Hong Kong University of Science and Technology + TianqingFang + HongmingZhang + SehyunChoiDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology + XinLiuAmazon + YangqiuSongThe Hong Kong University of Science and Technology + 3991-4010 + Cognitive research indicates that abstraction ability is essential in human intelligence, which remains under-explored in language models. In this paper, we present AbsPyramid, a unified entailment graph of 221K textual descriptions of abstraction knowledge. While existing resources only touch nouns or verbs within simplified events or specific domains, AbsPyramid collects abstract knowledge for three components of diverse events to comprehensively evaluate the abstraction ability of language models in the open domain. Experimental results demonstrate that current LLMs face challenges comprehending abstraction knowledge in zero-shot and few-shot settings. By training on our rich abstraction knowledge, we find LLMs can acquire basic abstraction abilities and generalize to unseen events. In the meantime, we empirically show that our benchmark is comprehensive to enhance LLMs across two previous abstraction tasks. + 2024.findings-naacl.252 + 2024.findings-naacl.252.copyright.pdf + wang-etal-2024-abspyramid + + + Few-<fixed-case>TK</fixed-case>: A Dataset for Few-shot Scientific Typed Keyphrase Recognition + AvishekLahiriIndian Association for the Cultivation of Science + PratyaySarkarIndian Association for the Cultivation of Science + MedhaSen + DebarshiSanyalIndian Association for the Cultivation of Science + ImonMukherjee + 4011-4025 + Scientific texts are distinctive from ordinary texts in quite a few aspects like their vocabulary and discourse structure. Consequently, Information Extraction (IE) tasks for scientific texts come with their own set of challenges. The classical definition of Named Entities restricts the inclusion of all scientific terms under its hood, which is why previous works have used the terms Named Entities and Keyphrases interchangeably. We suggest the rechristening of Named Entities for the scientific domain as Typed Keyphrases (TK), broadening their scope. We advocate for exploring this task in the few-shot domain due to the scarcity of labeled scientific IE data. Currently, no dataset exists for few-shot scientific Typed Keyphrase Recognition. To address this gap, we develop an annotation schema and present Few-TK, a dataset in the AI/ML field that includes scientific Typed Keyphrase annotations on abstracts of 500 research papers. To the best of our knowledge, this is the introductory few-shot Typed Keyphrase recognition dataset and only the second dataset structured specifically for few-shot NER, after Few-NERD. We report the results of several few-shot sequence-labelling models applied to our dataset. The data and code are available at https://github.com/AvishekLahiri/Few_TK.git + 2024.findings-naacl.253 + 2024.findings-naacl.253.copyright.pdf + lahiri-etal-2024-tk + + + Language Models can be Deductive Solvers + JiazhanFeng + RuochenXuMicrosoft + JunhengHaoMicrosoft + HiteshiSharma + YelongShenMicrosoft + DongyanZhaoPeking University + WeizhuChenMicrosoft GenAI + 4026-4042 + Logical reasoning is a fundamental aspect of human intelligence and a key component of tasks like problem-solving and decision-making. Recent advancements have enabled Large Language Models (LLMs) to potentially exhibit reasoning capabilities, but complex logical reasoning remains a challenge. The state-of-the-art, solver-augmented language models, use LLMs to parse natural language logical questions into symbolic representations first and then adopt external logical solvers to take in the symbolic representations and output the answers. Despite their impressive performance, any parsing errors will inevitably result in the failure of the execution of external logical solvers and no answer to the logical questions. In this paper, we introduce LoGiPT, a novel language model that directly internalizes and emulates the reasoning processes of logical solvers and avoids parsing errors by learning strict adherence to solver syntax and grammar. LoGiPT is fine-tuned on a newly constructed instruction-tuning dataset derived from revealing and refining the invisible reasoning process of deductive solvers. Experimental results on two public deductive reasoning benchmarks show that LoGiPT outperforms state-of-the-art solver-augmented LMs and few-shot prompting methods on competitive LLMs like GPT-4. This project is available in https://github.com/Cyril-JZ/LoGiPT. + 2024.findings-naacl.254 + 2024.findings-naacl.254.copyright.pdf + feng-etal-2024-language + + + Interpreting User Requests in the Context of Natural Language Standing Instructions + NikitaMoghe + PatrickXiaMicrosoft + JacobAndreasMassachusetts Institute of Technology and Microsoft + JasonEisnerMicrosoft and Johns Hopkins University + BenjaminVan DurmeJohns Hopkins University, Johns Hopkins University, Johns Hopkins University and Microsoft + HarshJhamtaniMicrosoft + 4043-4060 + Users of natural language interfaces, frequently powered by Large Language Models (LLMs), must often repeat their full set of preferences each time they make a similar request. We describe an approach to LLM-based dialogue modeling in which persistent user constraints and preferences – collectively termed standing instructions – are provided as additional context for such interfaces. For example, when a user states “I’m hungry”, a previously expressed preference for Persian food can be automatically added to the LLM prompt, influencing the search for relevant restaurants.We develop NLSI, a language-to-program dataset consisting of over 2.4K English dialogues spanning 17 domains, in which each dialogue is paired with a user profile (a set of user-specific standing instructions) and corresponding structured representations (a sequence of API calls). A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue. NLSI contains diverse phenomena, from simple preferences to interdependent instructions such as triggering a hotel search whenever the user is booking tickets to an event. We conduct experiments on NLSI using prompting with large language models and various retrieval approaches, achieving a maximum of 46% exact match on API prediction. Our results demonstrate the challenges in identifying the relevant standing instructions and their interpretation into API calls + 2024.findings-naacl.255 + 2024.findings-naacl.255.copyright.pdf + moghe-etal-2024-interpreting + + + Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models + RuixiangTang + Yu-NengChuangRice University + XuantingCai + MengnanDuNew Jersey Institute of Technology + XiaHuRice University + 4061-4073 + Large language models (LLMs) have notably revolutionized many domains within natural language processing due to their exceptional performance. Their security has become increasingly vital. This study is centered on protecting LLMs against unauthorized access and potential theft. We propose a simple yet effective protective measure wherein a unique key prompt is embedded within the LLM. This mechanism enables the model to respond only when presented with the correct key prompt; otherwise, LLMs will refuse to react to any input instructions. This key prompt protection offers a robust solution to prevent the unauthorized use of LLMs, as the model becomes unusable without the correct key. We evaluated the proposed protection on multiple LLMs and NLP tasks. Results demonstrate that our method can successfully protect the LLM without significantly impacting the model’s original function. Moreover, we demonstrate potential attacks that attempt to bypass the protection mechanism will adversely affect the model’s performance, further emphasizing the effectiveness of the proposed protection method. + 2024.findings-naacl.256 + 2024.findings-naacl.256.copyright.pdf + tang-etal-2024-secure + + + Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models + JiashuoSun + YiLuo + YeyunGong + ChenLinXiamen University + YelongShenMicrosoft + JianGuoHong Kong University of Science and Technology + NanDuanMicrosoft Research Asia + 4074-4101 + Large language models (LLMs) can achieve impressive performance on various reasoning tasks by incorporating chain-of-thought (CoT) prompting, where step-by-step reasoning is provided to guide LLMs to generate answers to questions, and the question-rationale-answer triplets are utilized as demonstration exemplars. However, the reasoning chains of demonstrations generated by LLMs are observed to be prone to errors, which can subsequently lead to incorrect reasoning during inference. Furthermore, inappropriate exemplars, e.g., overly simplistic or complex exemplars depending on the question’s difficulty level, can affect the LLM’s performance. To address these issues, we introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts prompting). Iter-CoT has two advantages: (1) it adopts iterative bootstrapping that enables LLMs to rectify errors autonomously, resulting in more precise and comprehensive reasoning chains. (2) it selects exemplars of challenging yet answerable (i.e., the LLM has the potential to answer correctly) questions, enhancing the LLMs’ generalizability to answer questions with varying difficulty levels. Experimental results exhibit Iter-CoT superior performance on three distinct reasoning tasks on ten datasets. + 2024.findings-naacl.257 + 2024.findings-naacl.257.copyright.pdf + sun-etal-2024-enhancing + + + Do Prompt Positions Really Matter? + JunyuMaoUniversity of Southampton + StuartMiddletonUniversity of Southampton + MahesanNiranjanUniversity of Southampton + 4102-4130 + Prompt-based models have gathered a lot of attention from researchers due to their remarkable advancements in the fields of zero-shot and few-shot learning. Developing an effective prompt template plays a critical role. However, prior studies have mainly focused on prompt vocabulary searching or embedding initialization within a predefined template with the prompt position fixed. In this empirical study, we conduct the most comprehensive analysis to date of prompt position for diverse Natural Language Processing (NLP) tasks. Our findings quantify the substantial impact prompt position has on model performance. We observe that the prompt positions used in prior studies are often sub-optimal, and this observation is consistent even in widely used instruction-tuned models. These findings suggest prompt position optimisation as a valuable research direction to augment prompt engineering methodologies and prompt position-aware instruction tuning as a potential way to build more robust models in the future. + 2024.findings-naacl.258 + 2024.findings-naacl.258.copyright.pdf + mao-etal-2024-prompt + + + Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning + TianhuaZhang + JiaxinGe + HongyinLuoMassachusetts Institute of Technology + Yung-SungChuangMassachusetts Institute of Technology + MingyeGao + YuanGongMassachusetts Institute of Technology + YoonKimMassachusetts Institute of Technology + XixinWuThe Chinese University of Hong Kong + HelenMengThe Chinese University of Hong Kong + JamesGlass + 4131-4155 + How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks. Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge. A Python interpreter then executes the generated code and prints the output. Despite using a task-general prompt, we find that this approach can improve upon strong baselines across a range of different tasks including math and symbolic reasoning, text classification, question answering, and instruction following. We found that the generated programs are interpretable since they outline the exact reasoning process followed by the program interpreter. + 2024.findings-naacl.259 + 2024.findings-naacl.259.copyright.pdf + zhang-etal-2024-natural + + + A Study on Scaling Up Multilingual News Framing Analysis + Syeda SabrinaAkterGeorge Mason University + AntoniosAnastasopoulosAthena Research Center and George Mason University + 4156-4173 + Media framing is the study of strategically selecting and presenting specific aspects of political issues to shape public opinion. Despite its relevance to almost all societies around the world, research has been limited due to the lack of available datasets and other resources. This study explores the possibility of dataset creation through crowdsourcing, utilizing non-expert annotators to develop training corpora. We first extend framing analysis beyond English news to a multilingual context (12 typologically diverse languages) through automatic translation. We also present a novel benchmark in Bengali and Portuguese on the immigration and same-sex marriage domains.Additionally, we show that a system trained on our crowd-sourced dataset, combined with other existing ones, leads to a 5.32 percentage point increase from the baseline, showing that crowdsourcing is a viable option. Last, we study the performance of large language models (LLMs) for this task, finding that task-specific fine-tuning is a better approach than employing bigger non-specialized models. + 2024.findings-naacl.260 + 2024.findings-naacl.260.copyright.pdf + akter-anastasopoulos-2024-study + + + <fixed-case>V</fixed-case>i<fixed-case>GLUE</fixed-case>: A <fixed-case>V</fixed-case>ietnamese General Language Understanding Benchmark and Analysis of <fixed-case>V</fixed-case>ietnamese Language Models + Minh-NamTran + Phu-VinhNguyen + LongNguyenHo Chi Minh city University of Science, Vietnam National University + DienDinh + 4174-4189 + As the number of language models has increased, various benchmarks have been suggested to assess the proficiency of the models in natural language understanding. However, there is a lack of such a benchmark in Vietnamese due to the difficulty in accessing natural language processing datasets or the scarcity of task-specific datasets. **ViGLUE**, the proposed dataset collection, is a **Vi**etnamese **G**eneral **L**anguage **U**nderstanding **E**valuation benchmark developed using three methods: translating an existing benchmark, generating new corpora, and collecting available datasets. ViGLUE contains twelve tasks and encompasses over ten areas and subjects, enabling it to evaluate models comprehensively over a broad spectrum of aspects. Baseline models utilizing multilingual language models are also provided for all tasks in the proposed benchmarks. In addition, the study of the available Vietnamese large language models is conducted to explore the language models’ ability in the few-shot learning framework, leading to the exploration of the relationship between specific tasks and the number of shots. + 2024.findings-naacl.261 + 2024.findings-naacl.261.copyright.pdf + tran-etal-2024-viglue + + + Exploring the Trade-off Between Model Performance and Explanation Plausibility of Text Classifiers Using Human Rationales + LucasResckFundação Getulio Vargas + MarcosM. RaimundoUniversidade Estadual de Campinas + JorgePocoFundação Getulio Vargas + 4190-4216 + Saliency post-hoc explainability methods are important tools for understanding increasingly complex NLP models. While these methods can reflect the model’s reasoning, they may not align with human intuition, making the explanations not plausible. In this work, we present a methodology for incorporating rationales, which are text annotations explaining human decisions, into text classification models. This incorporation enhances the plausibility of post-hoc explanations while preserving their faithfulness. Our approach is agnostic to model architectures and explainability methods. We introduce the rationales during model training by augmenting the standard cross-entropy loss with a novel loss function inspired by contrastive learning. By leveraging a multi-objective optimization algorithm, we explore the trade-off between the two loss functions and generate a Pareto-optimal frontier of models that balance performance and plausibility. Through extensive experiments involving diverse models, datasets, and explainability methods, we demonstrate that our approach significantly enhances the quality of model explanations without causing substantial (sometimes negligible) degradation in the original model’s performance. + 2024.findings-naacl.262 + 2024.findings-naacl.262.copyright.pdf + resck-etal-2024-exploring + + + Unlocking Parameter-Efficient Fine-Tuning for Low-Resource Language Translation + TongSu + XinPeng + SarubiThillainathanUniversität des Saarlandes + DavidGuzmán + SurangikaRanathungaMassey University + En-ShiunLee + 4217-4225 + Parameter-efficient fine-tuning (PEFT) methods are increasingly vital in adapting large-scale pre-trained language models for diverse tasks, offering a balance between adaptability and computational efficiency. They are important in Low-Resource Language (LRL) Neural Machine Translation (NMT) to enhance translation accuracy with minimal resources. However, their practical effectiveness varies significantly across different languages. We conducted comprehensive empirical experiments with varying LRL domains and sizes to evaluate the performance of 8 PEFT methods with in total of 15 architectures using the SacreBLEU score. We showed that 6 PEFT architectures outperform the baseline for both in-domain and out-domain tests and the Houlsby+Inversion adapter has the best performance overall, proving the effectiveness of PEFT methods. + 2024.findings-naacl.263 + 2024.findings-naacl.263.copyright.pdf + su-etal-2024-unlocking + + + <fixed-case>AD</fixed-case>a<fixed-case>PT</fixed-case>: As-Needed Decomposition and Planning with Language Models + ArchikiPrasad + AlexanderKollerSaarland University + MareikeHartmannUniversität des Saarlandes + PeterClarkAllen Institute for Artificial Intelligence + AshishSabharwalAllen Institute for Artificial Intelligence + MohitBansalUniversity of North Carolina at Chapel Hill + TusharKhotAllen Institute for Artificial Intelligence + 4226-4252 + Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the inability to execute any sub-task may lead to task failure. To address these shortcomings, we introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT), an approach that explicitly plans and decomposes complex sub-tasks as-needed, i.e., when the LLM is unable to execute them. ADaPT recursively decomposes sub-tasks to adapt to both task complexity and LLM capability. Our results demonstrate that ADaPT substantially outperforms established strong baselines, achieving success rates up to 28.3% higher in ALFWorld, 27% in WebShop, and 33% in TextCraft – a novel compositional dataset that we introduce. Through extensive analysis, we illustrate the importance of multilevel decomposition and establish that ADaPT dynamically adjusts to the capabilities of the executor LLM as well as to task complexity. + 2024.findings-naacl.264 + 2024.findings-naacl.264.copyright.pdf + prasad-etal-2024-adapt + + + Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations + DayeonKiUniversity of Maryland, College Park + MarineCarpuatUniversity of Maryland, College Park + 4253-4273 + Machine Translation (MT) remains one of the last NLP tasks where large language models (LLMs) have not yet replaced dedicated supervised systems. This work exploits the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality, derived from Multidimensional Quality Metric (MQM) annotations. Working with LLaMA-2 models, we consider prompting strategies varying the nature of feedback provided and then fine-tune the LLM to improve its ability to exploit the provided guidance. Through experiments on Chinese-English, English-German, and English-Russian MQM data, we demonstrate that prompting LLMs to post-edit MT improves TER, BLEU and COMET scores, although the benefits of fine-grained feedback are not clear. Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation. + 2024.findings-naacl.265 + 2024.findings-naacl.265.copyright.pdf + ki-carpuat-2024-guiding + + + Non-contrastive sentence representations via self-supervision + DuccioPappadopuloBloomberg + MarcoFarinaBloomberg + 4274-4284 + Sample contrastive methods, typically referred to simply as contrastive are the foundation of most unsupervised methods to learn text and sentence embeddings. On the other hand, a different class of self-supervised non-contrastive loss functions and methods have been considered in the computer vision community and referred to as dimension contrastive. In this paper, we thoroughly compare this class of methods with the standard baseline for contrastive sentence embeddings, SimCSE. We find that self-supervised embeddings trained using dimension contrastive objectives can outperform SimCSE on downstream tasks without needing auxiliary loss functions. + 2024.findings-naacl.266 + 2024.findings-naacl.266.copyright.pdf + pappadopulo-farina-2024-non + + + Semantically-Prompted Language Models Improve Visual Descriptions + MichaelOgezi + BradleyHauerUniversity of Alberta + GrzegorzKondrakUniversity of Alberta + 4285-4302 + Language-vision models like CLIP have made significant strides in vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains challenging; descriptions produced by current methods are often ambiguous and lacking in granularity. To tackle these issues, we propose V-GLOSS: Visual Glosses, a novel method built upon two key ideas. The first is Semantic Prompting, which conditions a language model on structured semantic knowledge. The second is a new contrastive algorithm that elicits fine-grained distinctions between similar concepts. With both ideas, we demonstrate that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102. Moreover, these descriptive capabilities contribute to enhancing image-generation performance. Finally, we introduce a quality-tested silver dataset with descriptions generated with V-GLOSS for all ImageNet classes. + 2024.findings-naacl.267 + 2024.findings-naacl.267.copyright.pdf + ogezi-etal-2024-semantically + + + <fixed-case>G</fixed-case>en<fixed-case>TKG</fixed-case>: Generative Forecasting on Temporal Knowledge Graph with Large Language Models + RuotongLiao + XuJia + YangzheLiTechnische Universität München + YunpuMaSiemens Corporate Research + VolkerTrespLudwig Maximilian University of Munich and Siemens Corporate Research + 4303-4317 + The rapid advancements in large language models (LLMs) have ignited interest in the temporal knowledge graph (tKG) domain, where conventional embedding-based and rule-based methods dominate. The question remains open of whether pre-trained LLMs can understand structured temporal relational data and replace them as the foundation model for temporal relational forecasting. Therefore, we bring temporal knowledge forecasting into the generative setting. However, challenges occur in the huge chasms between complex temporal graph data structure and sequential natural expressions LLMs can handle, and between the enormous data sizes of tKGs and heavy computation costs of finetuning LLMs. To address these challenges, we propose a novel retrieval-augmented generation framework named GenTKG combining a temporal logical rule-based retrieval strategy and few-shot parameter-efficient instruction tuning to solve the above challenges, respectively. Extensive experiments have shown that GenTKG outperforms conventional methods of temporal relational forecasting with low computation resources using extremely limited training data as few as 16 samples. GenTKG also highlights remarkable cross-domain generalizability with outperforming performance on unseen datasets without re-training, and in-domain generalizability regardless of time split in the same dataset. Our work reveals the huge potential of LLMs in the tKG domain and opens a new frontier for generative forecasting on tKGs. The code and data are released here: https://github.com/mayhugotong/GenTKG. + 2024.findings-naacl.268 + 2024.findings-naacl.268.copyright.pdf + liao-etal-2024-gentkg + + + A Transformer with Stack Attention + JiaodaLiETHZ - ETH Zurich + JenniferWhite + MrinmayaSachanSwiss Federal Institute of Technology + RyanCotterellSwiss Federal Institute of Technology + 4318-4335 + Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in the modeling power of transformer-based language models, we propose augmenting them with a differentiable, stack-based attention mechanism. Our stack-basedattention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-freelanguages. + 2024.findings-naacl.269 + 2024.findings-naacl.269.copyright.pdf + li-etal-2024-transformer + + + <fixed-case>I</fixed-case>nstruct<fixed-case>E</fixed-case>val: Systematic Evaluation of Instruction Selection Methods + AnirudhAjith + ChrisPan + MengzhouXia + AmeetDeshpande + KarthikNarasimhanPrinceton University + 4336-4350 + In-context learning (ICL) performs tasks by prompting a large language model (LLM) using an instruction and a small set of annotated examples called demonstrations. Recent work has shown that precise details of the inputs used in the ICL prompt significantly impact performance, which has incentivized instruction selection algorithms. The effect of instruction-choice however is severely underexplored, with existing analyses restricted to shallow subsets of models and tasks, limiting the generalizability of their insights. We develop InstructEval, an ICL evaluation suite to conduct a thorough assessment of these techniques. The suite includes 13 open-sourced LLMs of varying scales from four model families, and covers nine tasks across three categories. Using the suite, we evaluate the relative performance of seven popular instruction selection methods over five metrics relevant to ICL. Our experiments reveal that using curated manually-written instructions or simple instructions without any task-specific descriptions often elicits superior ICL performance overall than that of automatic instruction-induction methods, pointing to a lack of generalizability among the latter. We release our evaluation suite (at https://github.com/princeton-nlp/InstructEval) for benchmarking instruction selection approaches and enabling more generalizable methods in this space. + 2024.findings-naacl.270 + 2024.findings-naacl.270.copyright.pdf + ajith-etal-2024-instructeval + + + <fixed-case>R</fixed-case>ec<fixed-case>M</fixed-case>ind: Large Language Model Powered Agent For Recommendation + YanchengWangArizona State University + ZiyanJiangAmazon + ZhengChenAmazon + FanYangAmazon + YingxueZhouUniversity of Minnesota, Minneapolis + EunahCho + XingFan + YanbinLu + XiaojiangHuang + YingzhenYangArizona State University + 4351-4364 + While the recommendation system (RS) has advanced significantly through deep learning, current RS approaches usually train and fine-tune models on task-specific datasets, limiting their generalizability to new recommendation tasks and their ability to leverage external knowledge due to model scale and data size constraints. Thus, we designed an LLM-powered autonomous recommender agent, RecMind, which is capable of leveraging external knowledge, utilizing tools with careful planning to provide zero-shot personalized recommendations. We propose a Self-Inspiring algorithm to improve the planning ability. At each intermediate step, the LLM “self-inspires” to consider all previously explored states to plan for the next step. This mechanism greatly improves the model’s ability to comprehend and utilize historical information in planning for recommendation. We evaluate RecMind’s performance in various recommendation scenarios. Our experiment shows that RecMind outperforms existing zero/few-shot LLM-based recommendation baseline methods in various tasks and achieves comparable performance to a fully trained recommendation model P5. + 2024.findings-naacl.271 + 2024.findings-naacl.271.copyright.pdf + wang-etal-2024-recmind + + + <fixed-case>GOLD</fixed-case>: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation + MohsenGholami + MohammadAkbariHuawei Technologies Ltd. + TianxiHuHuawei Technologies Ltd. + VadenMasraniHuawei Technologies Ltd. + Z.WangUniversity of British Columbia + YongZhangHuawei Technologies Ltd. + 4365-4380 + Knowledge distillation from LLMs is essential for the efficient deployment of language models. Prior works have proposed data generation using LLMs for preparing distilled models. We argue that generating data with LLMs is prone to sampling mainly from the center of original content distribution. This limitation hinders the distilled model from learning the true underlying data distribution and to forget the tails of the distributions (samples with lower probability). To this end, we propose GOLD, a task-agnostic data generation and knowledge distillation framework, which employs an iterative out-of-distribution-guided feedback mechanism for the LLM. As a result, the generated data improves the generalizability of distilled models. An energy-based OOD evaluation approach is also introduced to deal with noisy generated data. Our extensive experiments on 10 different classification and sequence-to-sequence tasks in NLP show that GOLD respectively outperforms prior arts and the LLM with an average improvement of 5% and 14%. We will also show that the proposed method is applicable to less explored and novel tasks. Code is available in the Appendix. + 2024.findings-naacl.272 + 2024.findings-naacl.272.copyright.pdf + gholami-etal-2024-gold + + + How Lexical is Bilingual Lexicon Induction? + HarshKohliOhio State University, Columbus + HelianFengAmazon + NicholasDronenLightmatter + CalvinMcCarter + SinaMoeini + AliKebarighotbi + 4381-4386 + In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2% across all language pairs. + 2024.findings-naacl.273 + 2024.findings-naacl.273.copyright.pdf + kohli-etal-2024-lexical + + + Fumbling in Babel: An Investigation into <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case>’s Language Identification Ability + Wei-RuiChenUniversity of British Columbia + IfeAdebaraUniversity of British Columbia + KhaiDoan + QishengLiao + MuhammadAbdul-MageedUniversity of British Columbia + 4387-4413 + ChatGPT has recently emerged as a powerful NLP tool that can carry out a variety of tasks. However, the range of languages ChatGPT can handle remains largely a mystery. To uncover which languages ChatGPT ‘knows’, we investigate its language identification (LID) abilities. For this purpose, we compile Babel-670, a benchmark comprising 670 languages representing 23 language families spoken in five continents. Languages in Babel-670 run the gamut from the very high-resource to the very low-resource. We then study ChatGPT’s (both GPT-3.5 and GPT-4) ability to (i) identify language names and language codes (ii) under zero- and few-shot conditions (iii) with and without provision of a label set. When compared to smaller finetuned LID tools, we find that ChatGPT lags behind. For example, it has poor performance on African languages. We conclude that current large language models would benefit from further development before they can sufficiently serve diverse communities. + 2024.findings-naacl.274 + 2024.findings-naacl.274.copyright.pdf + chen-etal-2024-fumbling + + + Targeted Augmentation for Low-Resource Event Extraction + SijiaWang + LifuHuangVirginia Tech + 4414-4428 + Addressing the challenge of low-resource information extraction remains an ongoing issue due to the inherent information scarcity within limited training examples. Existing data augmentation methods, considered potential solutions, struggle to strike a balance between weak augmentation (e.g., synonym augmentation) and drastic augmentation (e.g., conditional generation without proper guidance). This paper introduces a novel paradigm that employs targeted augmentation and back validation to produce augmented examples with enhanced diversity, polarity, accuracy, and coherence. Extensive experimental results demonstrate the effectiveness of the proposed paradigm. Furthermore, identified limitations are discussed, shedding light on areas for future improvement. + 2024.findings-naacl.275 + 2024.findings-naacl.275.copyright.pdf + wang-huang-2024-targeted + + + Asking More Informative Questions for Grounded Retrieval + SedrickKehToyota Research Institute + JustinChiuCornell University + DanielFriedCarnegie Mellon University + 4429-4442 + When a model is trying to gather information in an interactive setting, it benefits from asking informative questions. However, in the case of a grounded multi-turn image identification task, previous studies have been constrained to polar yes/no questions (White et al., 2021), limiting how much information the model can gain in a single turn. We present an approach that formulates more informative, open-ended questions. In doing so, we discover that off-the-shelf visual question answering (VQA) models often make presupposition errors, which standard information gain question selection methods fail to account for. To address this issue, we propose a method that can incorporate presupposition handling into both question selection and belief updates. Specifically, we use a two-stage process, where the model first filters out images which are irrelevant to a given question, then updates its beliefs about which image the user intends. Through self-play and human evaluations, we show that our method is successful in asking informative open-ended questions, increasing accuracy over the past state-of-the-art by 14%, while resulting in 48% more efficient games in human evaluations. + 2024.findings-naacl.276 + 2024.findings-naacl.276.copyright.pdf + keh-etal-2024-asking + + + Efficient Citer: Tuning Large Language Models for Enhanced Answer Quality and Verification + MarziehTahaei + ArefJafariUniversity of Waterloo and Huawei Technologies Ltd. + AhmadRashidUniversity of Waterloo + DavidAlfonso-Hermelo + KhalilBibiHuawei Technologies Ltd. + YimengWu + AliGhodsi + BoxingChenHuawei Technologies Ltd. + MehdiRezagholizadeh + 4443-4450 + In recent years, there has been a growing interest in utilizing external knowledge to reduce hallucinations in large language models (LLMs) and provide them with updated information. Despite this improvement, a major challenge lies in the lack of explicit citations, which hampers the ability to verify the information generated by these models.This paper focuses on providing models with citation capabilities efficiently. By constructing a dataset of citations, we train two model architectures: an FID-style FLAN-T5 model for efficient answer composition and a 13B model known for its success in instruction following after tuning. Evaluation on fluency, correctness, and citation quality is conducted through human assessment and the newly introduced Automatic LLMs’ Citation Evaluation (ALCE) benchmark.Results demonstrate significant improvements in answer quality and efficiency, surpassing the performance of the popular ChatGPT on some of the metrics. The models exhibit exceptional out-of-domain generalization in both human and automatic evaluation. Notably, the FID-style FLAN-T5 model with only 3B parameters performs impressively compared to the 13B model. + 2024.findings-naacl.277 + 2024.findings-naacl.277.copyright.pdf + tahaei-etal-2024-efficient + + + Addressing Healthcare-related Racial and <fixed-case>LGBTQ</fixed-case>+ Biases in Pretrained Language Models + SeanXie + SaeedHassanpourDartmouth College + SoroushVosoughiDartmouth College + 4451-4464 + Recent studies have highlighted the issue of Pretrained Language Models (PLMs) inadvertently propagating social stigmas and stereotypes, a critical concern given their widespread use. This is particularly problematic in sensitive areas like healthcare, where such biases could lead to detrimental outcomes. Our research addresses this by adapting two intrinsic bias benchmarks to quantify racial and LGBTQ+ biases in prevalent PLMs. We also empirically evaluate the effectiveness of various debiasing methods in mitigating these biases. Furthermore, we assess the impact of debiasing on both Natural Language Understanding and specific biomedical applications. Our findings reveal that while PLMs commonly exhibit healthcare-related racial and LGBTQ+ biases, the applied debiasing techniques successfully reduce these biases without compromising the models’ performance in downstream tasks. + 2024.findings-naacl.278 + 2024.findings-naacl.278.copyright.pdf + xie-etal-2024-addressing + + + <fixed-case>ATG</fixed-case>: Benchmarking Automated Theorem Generation for Generative Language Models + XiaohanLin + QingxingCaoSUN YAT-SEN UNIVERSITY, Tsinghua University + YinyaHuang + ZhichengYangHong Kong University of Science and Technology (Guangzhou) + ZhengyingLiuHuawei Technologies Ltd. + ZhenguoLiDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology and Huawei Noah’s Ark Lab + XiaodanLiang + 4465-4480 + Humans can develop new theorems to explore broader and more complex mathematical results.While current generative language models (LMs) have achieved significant improvement in automatically proving theorems, their ability to generate new or reusable theorems is still under-explored. Without the new theorems, current LMs struggle to prove harder theorems that are distant from the given hypotheses with the exponentially growing search space.More advanced theorem proving is if an agent (for instance, a generative LM) can leverage its creativity to generate new but also reasonable theorems that properly substitute part of a proof and also be saved as reusable knowledge for future theorem proving.Therefore, this paper proposes an Automated Theorem Generation (ATG) benchmark that evaluates whether an agent can automatically generate valuable (and possibly brand new) theorems that are applicable for downstream theorem proving as reusable knowledge. Specifically, we construct the ATG benchmark by splitting the Metamath library into three sets: axioms, library, and problem based on their proving depth.We conduct extensive experiments to investigate whether current LMs can generate theorems in the library and benefit the problem theorems proving. The results demonstrate that high-quality ATG data facilitates models’ performances on downstream ATP. However, there is still room for current LMs to develop better ATG and generate more advanced and human-like theorems. We hope the new ATG challenge can shed some light on advanced complex theorem proving. + 2024.findings-naacl.279 + 2024.findings-naacl.279.copyright.pdf + lin-etal-2024-atg + + + Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization + YixinLiuYale University + AlexanderFabbriSalesForce.com + JiawenChen + YilunZhaoYale University + SimengHanYale University + ShafiqJotySalesForce.com and Nanyang Technological University + PengfeiLiu + DragomirRadevYale University + Chien-ShengWuSalesforce AI + ArmanCohanYale University and Allen Institute for Artificial Intelligence + 4481-4501 + While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluations of five LLM-based systems to assess their instruction-following capabilities in controllable summarization. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) no LLM-based evaluation methods can achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation capabilities. We make our collected benchmark InstruSum publicly available to facilitate future research in this direction. + 2024.findings-naacl.280 + 2024.findings-naacl.280.copyright.pdf + liu-etal-2024-benchmarking + + + <fixed-case>N</fixed-case>euro<fixed-case>C</fixed-case>omparatives: Neuro-Symbolic Distillation of Comparative Knowledge + PhillipHowardIntel + JunlinWang + VasudevLalIntel + GadiSinger + YejinChoiDepartment of Computer Science, University of Washington + SwabhaSwayamdiptaUniversity of Southern California + 4502-4520 + Comparative knowledge (e.g., steel is stronger and heavier than styrofoam) is an essential component of our world knowledge, yet understudied in prior literature. In this paper, we harvest the dramatic improvements in knowledge capabilities of language models into a large-scale comparative knowledge base. While the ease of acquisition of such comparative knowledge is much higher from extreme-scale models like GPT-4, compared to their considerably smaller and weaker counterparts such as GPT-2, not even the most powerful models are exempt from making errors. We thus ask: to what extent are models at different scales able to generate valid and diverse comparative knowledge?We introduce NeuroComparatives, a novel framework for comparative knowledge distillation overgenerated from language models such as GPT-variants and LLaMA, followed by stringent filtering of the generated knowledge. Our framework acquires comparative knowledge between everyday objects, producing a corpus of up to 8.8M comparisons over 1.74M entity pairs - 10X larger and 30% more diverse than existing resources. Moreover, human evaluations show that NeuroComparatives outperform existing resources in terms of validity (up to 32% absolute improvement). Our acquired NeuroComparatives leads to performance improvements on five downstream tasks.We find that neuro-symbolic manipulation of smaller models offers complementary benefits to the currently dominant practice of prompting extreme-scale language models for knowledge distillation. + 2024.findings-naacl.281 + 2024.findings-naacl.281.copyright.pdf + howard-etal-2024-neurocomparatives + + + Emotion-Anchored Contrastive Learning Framework for Emotion Recognition in Conversation + FangxuYu + JunjieGuoNanjing University + ZhenWuNanjing University + XinyuDaiNanjing University + 4521-4534 + Emotion Recognition in Conversation (ERC) involves detecting the underlying emotion behind each utterance within a conversation. Effectively generating representations for utterances remains a significant challenge in this task. Recent works propose various models to address this issue, but they still struggle with differentiating similar emotions such as excitement and happiness. To alleviate this problem, We propose an Emotion-Anchored Contrastive Learning (EACL) framework that can generate more distinguishable utterance representations for similar emotions. To achieve this, we utilize label encodings as anchors to guide the learning of utterance representations and design an auxiliary loss to ensure the effective separation of anchors for similar emotions. Moreover, an additional adaptation process is proposed to adapt anchors to serve as effective classifiers to improve classification performance. Across extensive experiments, our proposed EACL achieves state-of-the-art emotion recognition performance and exhibits superior performance on similar emotions. Our code is available at https://github.com/Yu-Fangxu/EACL. + 2024.findings-naacl.282 + 2024.findings-naacl.282.copyright.pdf + yu-etal-2024-emotion + + + <fixed-case>SUQL</fixed-case>: Conversational Search over Structured and Unstructured Data with Large Language Models + ShichengLiuStanford University + JialiangXu + WesleyTjangnaka + SinaSemnaniStanford University + ChenYu + MonicaLamStanford University + 4535-4555 + While most conversational agents are grounded on either free-text or structured knowledge, many knowledge corpora consist of hybrid sources.This paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (\textbf{S}tructured and \textbf{U}nstructured \textbf{Q}uery \textbf{L}anguage). Specifically, SUQL extends SQL with free-text primitives ({\small \text{SUMMARY}} and {\small \text{ANSWER}}), so information retrieval can be composed with structured data accesses arbitrarily in a formal, succinct, precise, and interpretable notation. With SUQL, we propose the first semantic parser, an LLM with in-context learning, that can handle hybrid data sources.Our in-context learning-based approach, when applied to the HybridQA dataset, comes within 8.9% Exact Match and 7.1% F1 of the SOTA, which was trained on 62K data samples. More significantly, unlike previous approaches, our technique is applicable to large databases and free-text corpora. We introduce a dataset consisting of crowdsourced questions and conversations on Yelp, a large, real restaurant knowledge base with structured and unstructured data. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 90.3% of the time, compared to 63.4% for a baseline based on linearization. + 2024.findings-naacl.283 + 2024.findings-naacl.283.copyright.pdf + liu-etal-2024-suql + + + On Evaluating the Integration of Reasoning and Action in <fixed-case>LLM</fixed-case> Agents with Database Question Answering + LinyongNan + EllenZhang + WeijinZouLinkedIn + YilunZhaoYale University + WenfeiZhouNVIDIA + ArmanCohanYale University and Allen Institute for Artificial Intelligence + 4556-4579 + This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art **GPT-4** model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks. + 2024.findings-naacl.284 + 2024.findings-naacl.284.copyright.pdf + nan-etal-2024-evaluating + + + <fixed-case>CARE</fixed-case>: Extracting Experimental Findings From Clinical Literature + AakankshaNaikAllen Institute for Artificial Intelligence and National Institutes of Health + BaileyKuehl + ErinBransomAllen Institute for Artificial Intelligence + DougDowneyAllen Institute for Artificial Intelligence and Northwestern University + TomHopeAllen Institute for Artificial Intelligence and Hebrew University, Hebrew University of Jerusalem + 4580-4596 + Extracting fine-grained experimental findings from literature can provide dramatic utility for scientific applications. Prior work has developed annotation schemas and datasets for limited aspects of this problem, failing to capture the real-world complexity and nuance required. Focusing on biomedicine, this work presents CARE—a new IE dataset for the task of extracting clinical findings. We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes, which unifies phenomena challenging for current IE systems such as discontinuous entity spans, nested relations, variable arity n-ary relations and numeric results in a single schema. We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports. We also demonstrate the generalizability of our schema to the computer science and materials science domains. We benchmark state-of-the-art IE systems on CARE, showing that even models such as GPT4 struggle. We release our resources to advance research on extracting and aggregating literature findings. + 2024.findings-naacl.285 + 2024.findings-naacl.285.copyright.pdf + naik-etal-2024-care + + + Personalized Federated Learning for Text Classification with Gradient-Free Prompt Tuning + RuiWang + TongYuAdobe Research + RuiyiZhangAdobe Systems + SungchulKimAdobe Systems + RyanRossiAdobe Research + HandongZhaoAdobe Systems + JundaWu + SubrataMitraAdobe Systems + LinaYaoUniversity of New South Wales and CSIRO’s Data61 + RicardoHenaoDuke University and King Abdullah University of Science and Technology + 4597-4612 + In this paper, we study personalized federated learning for text classification with Pretrained Language Models (PLMs). We identify two challenges in efficiently leveraging PLMs for personalized federated learning: 1) Communication. PLMs are usually large in size, e.g., with hundreds of millions of parameters, inducing huge communication cost in a federated setting. 2) Local Training. Training with PLMs generally requires back-propagation, during which memory consumption can be several times that of the forward-propagation. This may not be affordable when the PLMs are trained locally on the clients that are resource constrained, e.g., mobile devices with limited access to memory resources. Additionally, the proprietary PLMs can be provided as concealed APIs, for which the back-propagation operations may not be available. In solving these, we propose a training framework that includes an approach of discrete local search for gradient-free local training, along with a compression mechanism inspired from the linear word analogy that allows communicating with discretely indexed tokens, thus significantly reducing the communication cost. Experiments show that our gradient-free framework achieves superior performance compared with baselines. + 2024.findings-naacl.286 + 2024.findings-naacl.286.copyright.pdf + wang-etal-2024-personalized + + + <fixed-case>SGSH</fixed-case>: Stimulate Large Language Models with Skeleton Heuristics for Knowledge Base Question Generation + ShashaGuo + LiziLiaoSingapore Management University + JingZhang + YanlingWangZhongguancun Laboratory + CuipingLiRenmin University of China + HongChenRenmin University of China + 4613-4625 + Knowledge base question generation (KBQG) aims to generate natural language questions from a set of triplet facts extracted from KB. Existing methods have significantly boosted the performance of KBQG via pre-trained language models (PLMs) thanks to the richly endowed semantic knowledge. With the advance of pre-training techniques, large language models (LLMs) (e.g., GPT-3.5) undoubtedly possess much more semantic knowledge. Therefore, how to effectively organize and exploit the abundant knowledge for KBQG becomes the focus of our study. In this work, we propose SGSH — a simple and effective framework to Stimulate GPT-3.5 with Skeleton Heuristics to enhance KBQG. The framework incorporates “skeleton heuristics”, which provides more fine-grained guidance associated with each input to stimulate LLMs to generate optimal questions, encompassing essential elements like the question phrase and the auxiliary verb.More specifically, we devise an automatic data construction strategy leveraging ChatGPT to construct a skeleton training dataset, based on which we employ a soft prompting approach to train a BART model dedicated to generating the skeleton associated with each input.Subsequently, skeleton heuristics are encoded into the prompt to incentivize GPT-3.5 to generate desired questions. Extensive experiments demonstrate that SGSH derives the new state-of-the-art performance on the KBQG tasks. + 2024.findings-naacl.287 + 2024.findings-naacl.287.copyright.pdf + guo-etal-2024-sgsh + + + Biomedical Entity Representation with Graph-Augmented Multi-Objective Transformer + AndreySakhovskiyKazan Federal University + NataliaSemenova + ArturKadurinArtificial Intelligence Research Institute and Kuban State University + ElenaTutubalinaKazan Federal University + 4626-4643 + Modern biomedical concept representations are mostly trained on synonymous concept names from a biomedical knowledge base, ignoring the inter-concept interactions and a concept’s local neighborhood in a knowledge base graph. In this paper, we introduce Biomedical Entity Representation with a Graph-Augmented Multi-Objective Transformer (BERGAMOT), which adopts the power of pre-trained language models (LMs) and graph neural networks to capture both inter-concept and intra-concept interactions from the multilingual UMLS graph. To obtain fine-grained graph representations, we introduce two additional graph-based objectives: (i) a node-level contrastive objective and (ii) the Deep Graph Infomax (DGI) loss, which maximizes the mutual information between a local subgraph and a high-level graph summary. We apply contrastive loss on textual and graph representations to make them less sensitive to surface forms and enable intermodal knowledge exchange. BERGAMOT achieves state-of-the-art results in zero-shot entity linking without task-specific supervision on 4 of 5 languages of the Mantra corpus and on 8 of 10 languages of the XL-BEL benchmark. + 2024.findings-naacl.288 + 2024.findings-naacl.288.copyright.pdf + sakhovskiy-etal-2024-biomedical + + + Cross-Lingual Summarization with Pseudo-Label Regularization + ThangLeVinAI Research + 4644-4677 + Cross-Lingual Summarization (XLS) aims to summarize a document in the source language into a condensed version in the target language, effectively removing language barriers for non-native readers. Previous approaches, however, have the same limitation that only a single reference (gold summary) is exploited during model training, making the base model exposed to an underrepresented hypothesis space since the actual number of possible hypotheses is exponentially large. To alleviate this problem, we present a study adopting pseudo-labels in regularizing standard cross-lingual summarization training. We investigate several components leading to the gains in regularization training with verified experiments involving 8 diverse languages from different families. Conclusively, we show that pseudo-labeling is a simple and effective approach that significantly improves over standard gold reference training in XLS. + 2024.findings-naacl.289 + 2024.findings-naacl.289.copyright.pdf + le-2024-cross + + + On the Way to Gentle <fixed-case>AI</fixed-case> Counselor: Politeness Cause Elicitation and Intensity Tagging in Code-mixed <fixed-case>H</fixed-case>inglish Conversations for Social Good + PriyanshuPriya + GopendraSingh + MauajamaFirdaus + JyotsnaAgrawal + AsifEkbalIndian Institute of Technology, Jodhpur + 4678-4696 + Politeness is a multifaceted concept influenced by individual perceptions of what is considered polite or impolite. With this objective, we introduce a novel task - Politeness Cause Elicitation and Intensity Tagging (PCEIT). This task focuses on conversations and aims to identify the underlying reasons behind the use of politeness and gauge the degree of politeness conveyed. To address this objective, we create HING-POEM, a new conversational dataset in Hinglish (a blend of Hindi and English) for mental health and legal counseling of crime victims. The rationale for the domain selection lies in the paramount importance of politeness in mental health and legal counseling of crime victims to ensure a compassionate and cordial atmosphere for them. We enrich the HING-POEM dataset by annotating it with politeness labels, politeness causal spans, and intensity values at the level of individual utterances. In the context of the introduced PCEIT task, we present PAANTH (Politeness CAuse ElicitAion and INtensity Tagging in Hinglish), a comprehensive framework based on Contextual Enhanced Attentive Convolution Transformer. We conduct extensive quantitative and qualitative evaluations to establish the effectiveness of our proposed approach using the newly constructed dataset. Our approach is compared against state-of-the-art baselines, and these analyses help demonstrate the superiority of our method. + 2024.findings-naacl.290 + 2024.findings-naacl.290.copyright.pdf + priya-etal-2024-way + + + Leveraging Summarization for Unsupervised Dialogue Topic Segmentation + AlekseiArtemiev + DaniilParinovYandex + AlexeyGrishanov + IvanBorisov + AlexeyVasilevSber, AI Lab + DaniilMuravetskii + AlekseyRezvykh + AlekseiGoncharovMIL Team + AndreySavchenkoSber AI Lab and HSE University + 4697-4704 + Traditional approaches to dialogue segmentation perform reasonably well on synthetic or written dialogues but suffer when dealing with spoken, noisy dialogs. In addition, such methods require careful tuning of hyperparameters. We propose to leverage a novel approach that is based on dialogue summaries. Experiments on different datasets showed that the new approach outperforms popular state-of-the-art algorithms in unsupervised topic segmentation and requires less setup. + 2024.findings-naacl.291 + 2024.findings-naacl.291.copyright.pdf + artemiev-etal-2024-leveraging + + + <fixed-case>LL</fixed-case>a<fixed-case>MA</fixed-case>-Rider: Spurring Large Language Models to Explore the Open World + YichengFengPeking University + YuxuanWang + JiazhengLiuPeking University + SipengZhengBeijing Academy of Artificial Intelligence + ZongqingLuPeking University + 4705-4724 + Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments and try to align the LLMs’ knowledge with the world conditions. Nonetheless, the capacity of LLMs to continuously acquire environmental knowledge and adapt in an open world remains uncertain. In this paper, we propose an approach to spur LLMs to explore the open world, gather experiences, and learn to improve their task-solving capabilities. In this approach, a multi-round feedback-revision mechanism is utilized to encourage LLMs to actively select appropriate revision actions guided by feedback information from the environment. This facilitates exploration and enhances the model’s performance. Besides, we integrate sub-task relabeling to assist LLMs in maintaining consistency in sub-task planning and help the model learn the combinatorial nature between tasks, enabling it to complete a wider range of tasks through training based on the acquired exploration experiences. By evaluation in Minecraft, an open-ended sandbox world, we demonstrate that our approach LLaMA-Rider enhances the efficiency of the LLM in exploring the environment, and effectively improves the LLM’s ability to accomplish more tasks through fine-tuning with merely 1.3k instances of collected data, showing minimal training costs compared to the baseline using reinforcement learning. The code is available at https://github.com/PKU-RL/LLaMA-Rider. + 2024.findings-naacl.292 + 2024.findings-naacl.292.copyright.pdf + feng-etal-2024-llama + + + Contrastive Learning as a Polarizer: Mitigating Gender Bias by Fair and Biased sentences + KyungminParkHankuk University of Foreign Studies + SihyunOh + DaehyunKim + JuaeKimHankuk University of Foreign Studies + 4725-4736 + Recently, language models have accelerated the improvement in natural language processing. However, recent studies have highlighted a significant issue: social biases inherent in training data can lead models to learn and propagate these biases. In this study, we propose a contrastive learning method for bias mitigation, utilizing anchor points to push further negatives and pull closer positives within the representation space. This approach employs stereotypical data as negatives and stereotype-free data as positives, enhancing debiasing performance. Our model attained state-of-the-art performance in the ICAT score on the StereoSet, a benchmark for measuring bias in models. In addition, we observed that effective debiasing is achieved through an awareness of biases, as evidenced by improved hate speech detection scores. The implementation code and trained models are available at https://github.com/HUFS-NLP/CL_Polarizer.git. + 2024.findings-naacl.293 + 2024.findings-naacl.293.copyright.pdf + park-etal-2024-contrastive + + + <fixed-case>P</fixed-case>o<fixed-case>LLM</fixed-case>graph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics + DeruiZhu + DingfanChenCISPA, saarland university, saarland informatics campus + QingLiMohamed bin Zayed University of Artificial Intelligence + ZongxiongChenFraunhofer FOKUS + LeiMaThe University of Tokyo and University of Alberta + JensGrossklagsTechnische Universität München + MarioFritzCISPA Helmholtz Center for Information Security and Saarland University + 4737-4751 + Despite tremendous advancements in large language models (LLMs) over recent years, a notably urgent challenge for their practical deployment is the phenomenon of "\textit{hallucination}”, where the model fabricates facts and produces non-factual statements. In response, we propose \texttt{PoLLMgraph}—a Polygraph for LLMs—as an effective model-based white-box detection and forecasting approach. \texttt{PoLLMgraph} distinctly differs from the large body of existing research that concentrates on addressing such challenges through black-box evaluations. In particular, we demonstrate that hallucination can be effectively detected by analyzing the LLM’s internal state transition dynamics during generation via tractable probabilistic models. Experimental results on various open-source LLMs confirm the efficacy of \texttt{PoLLMgraph}, outperforming state-of-the-art methods by a considerable margin, evidenced by over 20% improvement in AUC-ROC on common benchmarking datasets like TruthfulQA. Our work paves a new way for model-based white-box analysis of LLMs, motivating the research community to further explore, understand, and refine the intricate dynamics of LLM behaviors. + 2024.findings-naacl.294 + 2024.findings-naacl.294.copyright.pdf + zhu-etal-2024-pollmgraph + + + Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval + JurajVladikaTechnische Universität München + FlorianMatthesTechnische Universität München + 4752-4763 + In today’s digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline’s performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations. + 2024.findings-naacl.295 + 2024.findings-naacl.295.copyright.pdf + vladika-matthes-2024-improving + + + <fixed-case>D</fixed-case>ecoder<fixed-case>L</fixed-case>ens: Layerwise Interpretation of Encoder-Decoder Transformers + AnnaLangedijkUtrecht University (ICS), Utrecht University and University of Amsterdam + HoseinMohebbi + GabrieleSartiUniversity of Groningen + WillemZuidemaUniversity of Amsterdam + JaapJumelet + 4764-4780 + In recent years, several interpretability methods have been proposed to interpret the inner workings of Transformer models at different levels of precision and complexity.In this work, we propose a simple but effective technique to analyze encoder-decoder Transformers. Our method, which we name DecoderLens, allows the decoder to cross-attend representations of intermediate encoder activations instead of using the default final encoder output.The method thus maps uninterpretable intermediate vector representations to human-interpretable sequences of words or symbols, shedding new light on the information flow in this popular but understudied class of models.We apply DecoderLens to question answering, logical reasoning, speech recognition and machine translation models, finding that simpler subtasks are solved with high precision by low and intermediate encoder layers. + 2024.findings-naacl.296 + 2024.findings-naacl.296.copyright.pdf + langedijk-etal-2024-decoderlens + +
diff --git a/data/xml/2024.naacl.xml b/data/xml/2024.naacl.xml new file mode 100644 index 0000000000..5e6a38f054 --- /dev/null +++ b/data/xml/2024.naacl.xml @@ -0,0 +1,8122 @@ + + + + + Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) + KevinDuh + HelenaGomez + StevenBethard + Association for Computational Linguistics +
Mexico City, Mexico
+ June + 2024 + 2024.naacl-long + naacl + + + 2024.naacl-long.0 + naacl-2024-2024 + + + Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences + HongyiLiu + QingyunWangUniversity of Illinois, Urbana Champaign + PayamKarisaniUniversity of Illinois at Urbana-Champaign + HengJiUniversity of Illinois, Urbana-Champaign + 1-21 + Named entity recognition is a key component of Information Extraction (IE), particularly in scientific domains such as biomedicine and chemistry, where large language models (LLMs), e.g., ChatGPT, fall short. We investigate the applicability of transfer learning for enhancing a named entity recognition model trained in the biomedical domain (the source domain) to be used in the chemical domain (the target domain). A common practice for training such a model in a few-shot learning setting is to pretrain the model on the labeled source data, and then, to finetune it on a hand-full of labeled target examples. In our experiments, we observed that such a model is prone to mislabeling the source entities, which can often appear in the text, as the target entities. To alleviate this problem, we propose a model to transfer the knowledge from the source domain to the target domain, but, at the same time, to project the source entities and target entities into separate regions of the feature space. This diminishes the risk of mislabeling the source entities as the target entities. Our model consists of two stages: 1) entity grouping in the source domain, which incorporates knowledge from annotated events to establish relations between entities, and 2) entity discrimination in the target domain, which relies on pseudo labeling and contrastive learning to enhance discrimination between the entities in the two domains. We conduct our extensive experiments across three source and three target datasets, demonstrating that our method outperforms the baselines by up to 5% absolute value. Code, data, and resources are publicly available for research purposes: https://github.com/Lhtie/Bio-Domain-Transfer . + 2024.naacl-long.1 + 2024.naacl-long.1.copyright.pdf + liu-etal-2024-named + + + Text Diffusion Model with Encoder-Decoder Transformers for Sequence-to-Sequence Generation + HongyiYuan + ZhengYuanAlibaba Group + ChuanqiTanAlibaba Group + FeiHuangAlibaba Group + SongfangHuangAlibaba Group + 22-39 + The diffusion model, a new generative modeling paradigm, has achieved great success in image, audio, and video generation.However, considering the discrete categorical nature of the text, it is not trivial to extend continuous diffusion models to natural language. In this work, we propose SeqDiffuSeq, a text diffusion model, to approach sequence-to-sequence text generation with an encoder-decoder Transformer architecture.To improve the generation performance, SeqDiffuSeq is equipped with the self-conditioning technique and our newly proposed adaptive noise schedule technique. Self-conditioning enables SeqDiffuSeq to better use the predicted sequence information during the generation process.The adaptive noise schedule balances the difficulty of denoising across time steps at the token level.Experiment results illustrate the improved performance on five sequence-to-sequence generation tasks compared to other diffusion-based models regarding text quality and inference time. + 2024.naacl-long.2 + 2024.naacl-long.2.copyright.pdf + yuan-etal-2024-text + + + An Interactive Framework for Profiling News Media Sources + NikhilMehta + DanGoldwasserPurdue University, Purdue University and Purdue University + 40-58 + The recent rise of social media has led to the spread of large amounts of fake and biased news, content published with the intent to sway beliefs. While detecting and profiling the sources that spread this news is important to maintain a healthy society, it is challenging for automated systems.In this paper, we propose an interactive framework for news media profiling. It combines the strengths of graph based news media profiling models, Pre-trained Large Language Models, and human insight to characterize the social context on social media. Experimental results show that with as little as 5 human interactions, our framework can rapidly detect fake and biased news media, even in the most challenging settings of emerging news events, where test data is unseen. + 2024.naacl-long.3 + 2024.naacl-long.3.copyright.pdf + mehta-goldwasser-2024-interactive + + + Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study + YinghaoLi + HaoruiWangGeorgia Institute of Technology + ChaoZhangGeorgia Institute of Technology + 59-81 + Large Language Models (LLMs) have shown remarkable proficiency in language understanding and have been successfully applied to a variety of real-world tasks through task-specific fine-tuning or prompt engineering. Despite these advancements, it remains an open question whether LLMs are fundamentally capable of reasoning and planning, or if they primarily rely on recalling and synthesizing information from their training data. In our research, we introduce a novel task—Minesweeper—specifically designed in a format unfamiliar to LLMs and absent from their training datasets. This task challenges LLMs to identify the locations of mines based on numerical clues provided by adjacent opened cells. Successfully completing this task requires an understanding of each cell’s state, discerning spatial relationships between the clues and mines, and strategizing actions based on logical deductions drawn from the arrangement of the cells. Our experiments, including trials with the advanced GPT-4 model, indicate that while LLMs possess the foundational abilities required for this task, they struggle to integrate these into a coherent, multi-step logical reasoning process needed to solve Minesweeper. These findings highlight the need for further research to understand the nature of reasoning capabilities in LLMs under similar circumstances, and to explore pathways towards more sophisticated AI reasoning and planning models. + 2024.naacl-long.4 + 2024.naacl-long.4.copyright.pdf + li-etal-2024-assessing-logical + + + <fixed-case>T</fixed-case>el<fixed-case>ME</fixed-case>: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation + TaeyangYun + HyunkukLimYonsei University + JeonghwanLee + MinSong + 82-95 + Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue sys- tems to effectively respond to user requests. The emotions in a conversation can be identi- fied by the representations from various modal- ities, such as audio, visual, and text. How- ever, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a lan- guage model acting as the teacher to the non- verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multi- modal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effec- tiveness of our components through additional experiments. + 2024.naacl-long.5 + 2024.naacl-long.5.copyright.pdf + yun-etal-2024-telme + + + Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries + SeanieLeeKorea Advanced Institute of Science & Technology + JianpengCheng + JorisDriesenApple + AlexandruCoca + AndersJohannsen + 96-111 + Few-shot dialogue state tracking (DST) with Large Language Models (LLM) relies on an effective and efficient conversation retriever to find similar in-context examples for prompt learning. Previous works use raw dialogue context as search keys and queries, and a retriever is fine-tuned with annotated dialogues to achieve superior performance. However, the approach is less suited for scaling to new domains or new annotation languages, where fine-tuning data is unavailable. To address this problem, we handle the task of conversation retrieval based on text summaries of the conversations.A LLM-based conversation summarizer is adopted for query and key generation, which enables effective maximum inner product search. To avoid the extra inference cost brought by LLM-based conversation summarization, we further distill a light-weight conversation encoder which produces query embeddings without decoding summaries for test conversations. We validate our retrieval approach on MultiWOZ datasets with GPT-Neo-2.7B and LLaMA-7B/30B. The experimental results show a significant improvement over relevant baselines in real few-shot DST settings. + 2024.naacl-long.6 + 2024.naacl-long.6.copyright.pdf + lee-etal-2024-effective + + + Promptly Predicting Structures: The Return of Inference + MaitreyMehtaUniversity of Utah + ValentinaPyatkin + VivekSrikumarUniversity of Utah + 112-130 + Prompt-based methods have been used extensively across NLP to build zero- and few-shot label predictors. Many NLP tasks are naturally structured: that is, their outputs consist of multiple labels which constrain each other. Annotating data for such tasks can be cumbersome. Can the promise of the prompt-based paradigm be extended to such structured outputs? In this paper, we present a framework for constructing zero- and few-shot linguistic structure predictors. Our key insight is that we can use structural constraints—and combinatorial inference derived from them—to filter out inconsistent structures predicted by large language models. We instantiated this framework on two structured prediction tasks, and five datasets. Across all cases, our results show that enforcing consistency not only constructs structurally valid outputs, but also improves performance over the unconstrained variants. + 2024.naacl-long.7 + 2024.naacl-long.7.copyright.pdf + mehta-etal-2024-promptly + + + On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-<fixed-case>SQL</fixed-case> + YutongShaoUniversity of California, San Diego + NdapaNakasholeUniversity of California, San Diego + 131-156 + Structured data, prevalent in tables, databases, and knowledge graphs, poses a significant challenge in its representation. With the advent of large language models (LLMs), there has been a shift towards linearization-based methods, which process structured data as sequential token streams, diverging from approaches that explicitly model structure, often as a graph. Crucially, there remains a gap in our understanding of how these linearization-based methods handle structured data, which is inherently non-linear.This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model’s ability to mimic human-designed processes such as schema linking and syntax prediction, indicating a deep, meaningful learning of structure beyond simple token sequencing. We also uncover insights into the model’s internal mechanisms, including the ego-centric nature of structure node encodings and the potential for model compression due to modality fusion redundancy. Overall, this work sheds light on the inner workings of linearization-based methods and could potentially provide guidance for future research. + 2024.naacl-long.8 + 2024.naacl-long.8.copyright.pdf + shao-nakashole-2024-linearizing + + + Extractive Summarization with Text Generator + ThangLeVinAI Research + Anh TuanLuuNanyang Technological University + 157-174 + Standard extractive systems suffer from the lack of gold training signals since existing corpora solely provide document and human-written summary pairs while disregarding extractive labels. As a result, existing methods resort to imperfect pseudo-labels that are both biased and error-prone, thereby hindering the learning process of extractive models. In contrast, text generators which are commonly employed in abstractive summarization can effortlessly overcome this predicament on account of flexible sequence-to-sequence architectures. Motivated to bypass this inherent limitation, we investigate the possibility of conducting extractive summarization with text generators. Through extensive experiments covering six summarization benchmarks, we show that high-quality extractive summaries can be assembled via approximating the outputs (abstractive summaries) of these generators. Moreover, we find that the approximate summaries correlate positively with the auxiliary summaries (i.e. a better generator enables the production of better extractive summaries). Our results signify a new paradigm for training extractive summarizers i.e. learning with generation (abstractive) objectives rather than extractive schemes. + 2024.naacl-long.9 + 2024.naacl-long.9.copyright.pdf + le-luu-2024-extractive + + + Self-generated Replay Memories for Continual Neural Machine Translation + MicheleResta + DavideBacciuUniversity of Pisa + 175-191 + Modern Neural Machine Translation systems exhibit strong performance in several different languages and are constantly improving. Their ability to learn continuously is, however, still severely limited by the catastrophic forgetting issue. In this work, we leverage a key property of encoder-decoder Transformers, i.e. their generative ability, to propose a novel approach to continually learning Neural Machine Translation systems. We show how this can effectively learn on a stream of experiences comprising different languages, by leveraging a replay memory populated by using the model itself as a generator of parallel sentences. We empirically demonstrate that our approach can counteract catastrophic forgetting without requiring explicit memorization of training data. Code will be publicly available upon publication. + 2024.naacl-long.10 + 2024.naacl-long.10.copyright.pdf + resta-bacciu-2024-self + + + Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models + YangyiChenDepartment of Computer Science, University of Illinois at Urbana-Champaign + KaranSikkaSRI International + MichaelCogswellSRI International + HengJiUniversity of Illinois, Urbana-Champaign + AjayDivakaranSRI International + 192-210 + Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing an LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency. + 2024.naacl-long.11 + 2024.naacl-long.11.copyright.pdf + chen-etal-2024-measuring + + + Building Knowledge-Guided Lexica to Model Cultural Variation + ShreyaHavaldarUniversity of Pennsylvania + SalvatoreGiorgiUniversity of Pennsylvania + SunnyRaiSchool of Engineering and Applied Science, University of Pennsylvania + ThomasTalhelmUniversity of Chicago + Sharath ChandraGuntukuUniversity of Pennsylvania + LyleUngar + 211-226 + Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs’ failure to measure cultural variation or generate culturally varied language. + 2024.naacl-long.12 + 2024.naacl-long.12.copyright.pdf + havaldar-etal-2024-building + + + Adaptive Rank Selections for Low-Rank Approximation of Language Models + ShangqianGao + TingHuaSamsung + Yen-ChangHsuSamsung Research America + YilinShenSamsung Research America + HongxiaJinSamsung Research America AI center + 227-241 + Singular Value Decomposition (SVD) or its weighted variants has significantly progressed in compressing language models. Previous works assume the same importance for all operations and assign the same number of ranks for different layers in a language model. However, such a uniform rank selection is sub-optimal since different operations (layers) have non-uniform demand in capacity. In other words, a desired SVD strategy should allocate more ranks for important operations and vice versa. However, a globally-optimized selection of ranks for neural networks is still an open problem, and this is a non-trivial challenge since the selection is discrete. In this work, we propose a novel binary masking mechanism for optimizing the number of ranks in a differentiable framework. Our strategy uses a novel regularization to enable the masking to comply with the SVD property where the ranks have sorted singular values. The experiments examined both types of language models, encoder-only and decoder-only models, including large language models like LLaMA. Our compressed model achieves much better accuracy than previous SVD and their SOTA variants. More interestingly, our method retains significantly better accuracy with zero or limited fine-tuning, proving the substantial advantage of adaptive rank selection. + 2024.naacl-long.13 + 2024.naacl-long.13.copyright.pdf + gao-etal-2024-adaptive + + + An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation + PengzhiGaoBaidu + RuiqingZhang + ZhongjunHeBaidu + HuaWu + HaifengWangBaidu + 242-256 + Consistency regularization methods, such as R-Drop (Liang et al., 2021) and CrossConST (Gao et al., 2023), have achieved impressive supervised and zero-shot performance in the neural machine translation (NMT) field. Can we also boost end-to-end (E2E) speech-to-text translation (ST) by leveraging consistency regularization? In this paper, we conduct empirical studies on intra-modal and cross-modal consistency and propose two training strategies, SimRegCR and SimZeroCR, for E2E ST in regular and zero-shot scenarios. Experiments on the MuST-C benchmark show that our approaches achieve state-of-the-art (SOTA) performance in most translation directions. The analyses prove that regularization brought by the intra-modal consistency, instead of the modality gap, is crucial for the regular E2E ST, and the cross-modal consistency could close the modality gap and boost the zero-shot E2E ST performance. + 2024.naacl-long.14 + 2024.naacl-long.14.copyright.pdf + gao-etal-2024-empirical + + + Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration + ZhenhailongWang + ShaoguangMaoMicrosoft + WenshanWuMicrosoft + TaoGe + FuruWeiMicrosoft Research + HengJiUniversity of Illinois, Urbana-Champaign + 257-279 + Human intelligence thrives on cognitive synergy, where collaboration among different minds yield superior outcomes compared to isolated individuals. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist is an intelligent agent that collaboratively combines multiple minds’ strengths and knowledge to enhance problem-solving in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. Our in-depth analysis shows that assigning multiple fine-grained personas in LLMs improves problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, experimental results demonstrate that SPP effectively reduces factual hallucination, and maintains strong reasoning capabilities. Additionally, comparative experiments show that cognitive synergy only emerges in GPT-4 and does not appear in less capable models, such as GPT-3.5-turbo and Llama2-13b-chat, which draws an interesting analogy to human development. Code, data, and prompts can be found at: https://github.com/MikeWangWZHL/Solo-Performance-Prompting.git. + 2024.naacl-long.15 + 2024.naacl-long.15.copyright.pdf + wang-etal-2024-unleashing + + + <fixed-case>FPT</fixed-case>: Feature Prompt Tuning for Few-shot Readability Assessment + ZiyangWang + SanwooLeePeking University + Hsiu-YuanHuang + YunfangWu + 280-295 + Prompt-based methods have achieved promising results in most few-shot text classification tasks. However, for readability assessment tasks, traditional prompt methods lack crucial linguistic knowledge, which has already been proven to be essential.Moreover, previous studies on utilizing linguistic features have shown non-robust performance in few-shot settings and may even impair model performance.To address these issues, we propose a novel prompt-based tuning framework that incorporates rich linguistic knowledge, called Feature Prompt Tuning (FPT). Specifically, we extract linguistic features from the text and embed them into trainable soft prompts. Further, we devise a new loss function to calibrate the similarity ranking order between categories. Experimental results demonstrate that our proposed method FTPnot only exhibits a significant performance improvement over the prior best prompt-based tuning approaches, but also surpasses the previous leading methods that incorporate linguistic features. Also, our proposed model significantly outperforms the large language model gpt-3.5-turbo-16k in most cases. Our proposed method establishes a new architecture for prompt tuning that sheds light on how linguistic features can be easily adapted to linguistic-related tasks. + 2024.naacl-long.16 + 2024.naacl-long.16.copyright.pdf + wang-etal-2024-fpt + + + Self-Prompting Large Language Models for Zero-Shot Open-Domain <fixed-case>QA</fixed-case> + JunlongLi + JinyuanWang + ZhuoshengZhangShanghai Jiao Tong University + HaiZhaoShanghai Jiao Tong University + 296-310 + Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing specific background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models.While recent Large Language Models (LLMs) like GPT-3 have demonstrated their effectiveness in zero-shot ODQA using direct prompting methods, these methods still fall short of fully harnessing the potential of LLMs when implicitly invoked.In this paper, we propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of LLMs and their strong instruction understanding abilities. Concretely, we prompt LLMs step by step to generate multiple pseudo QA pairs with background passages and explanations entirely from scratch.These generated elements are then utilized for in-context learning. Experimental results show that our method significantly surpasses previous state-of-the-art zero-shot methods on three widely-used ODQA datasets and even achieves comparable performance with various customized fine-tuned models on full training data. Our code is available at https://github.com/lockon-n/self-prompting. + 2024.naacl-long.17 + 2024.naacl-long.17.copyright.pdf + li-etal-2024-self-prompting + + + Head-to-Tail: How Knowledgeable are Large Language Models (<fixed-case>LLM</fixed-case>s)? <fixed-case>A</fixed-case>.<fixed-case>K</fixed-case>.<fixed-case>A</fixed-case>. Will <fixed-case>LLM</fixed-case>s Replace Knowledge Graphs? + KaiSunMeta + YifanXu + HanwenZhaFacebook + YueLiu + Xin LunaDongDepartment of Computer Science, University of Washington and Amazon + 311-325 + Since the recent prosperity of Large Language Models (LLMs), there have been interleaved discussions regarding how to reduce hallucinations from LLM responses, how to increase the factuality of LLMs, and whether Knowledge Graphs (KGs), which store the world knowledge in a symbolic form, will be replaced with LLMs. In this paper, we try to answer these questions from a new angle: How knowledgeable are LLMs?To answer this question, we constructed Head-to-Tail, a benchmark that consists of 18K question-answer (QA) pairs regarding head, torso, and tail facts in terms of popularity. We designed an automated evaluation method and a set of metrics that closely approximate the knowledge an LLM confidently internalizes. Through a comprehensive evaluation of 16 publicly available LLMs, we show that existing LLMs are still far from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities. + 2024.naacl-long.18 + 2024.naacl-long.18.copyright.pdf + sun-etal-2024-head + + + <tex-math>k</tex-math><fixed-case>NN</fixed-case>-<fixed-case>ICL</fixed-case>: Compositional Task-Oriented Parsing Generalization with Nearest Neighbor In-Context Learning + WentingZhao + YeLiuSalesForce.com + YaoWanHuazhong University of Science and Technology + YiboWang + QingyangWuColumbia University + ZhongfenDengUniversity of Illinois, Chicago + JiangshuDuUniversity of Illinois at Chicago + ShuaiqiLiu + YunlongXu + PhilipYuUniversity of Illinois, Chicago + 326-337 + Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language, transforming them into structured outputs that combine elements of both natural language and intent/slot tags. Recently, Large Language Models (LLMs) have achieved impressive performance in synthesizing computer programs based on a natural-language prompt, mitigating the gap between natural language and structured programs. Our paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks, addressing the following three key research questions: 1) How can LLMs be effectively utilized for semantic parsing tasks? 2) What defines an effective prompt? and 3) How can LLM overcome the length constraint and streamline prompt design by including all examples as prompts? We introduce k Nearest Neighbor In-Context Learning (kNN-ICL), which simplifies prompt engineering by allowing it to be built on top of any design strategy while providing access to all demo examples. Extensive experiments show that: 1) Simple ICL without kNN search can achieve a comparable performance with strong supervised models on the TOP tasks, and 2) kNN-ICL significantly improves the comprehension of complex requests by seamlessly integrating ICL with a nearest-neighbor approach. Notably, this enhancement is achieved without the need for additional data or specialized prompts. + 2024.naacl-long.19 + 2024.naacl-long.19.copyright.pdf + zhao-etal-2024-knn + + + <fixed-case>ARES</fixed-case>: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems + JonSaad-FalconComputer Science Department, Stanford University + OmarKhattab + ChristopherPottsStanford University + MateiZahariaUniversity of California, Berkeley and Databricks + 338-354 + Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across eight different knowledge-intensive tasks in KILT, SuperGLUE, and AIS, ARES accurately evaluates RAG systems while using only a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our code and datasets publicly available on Github. + 2024.naacl-long.20 + 2024.naacl-long.20.copyright.pdf + saad-falcon-etal-2024-ares + + + <fixed-case>DEMO</fixed-case>: A Statistical Perspective for Efficient Image-Text Matching + FanZhangGeorgia Institute of Technology + Xian-ShengHuaTerminus Group + ChongChenTerminus Group + XiaoLuoUniversity of California, Los Angeles + 355-369 + Image-text matching has been a long-standing problem, which seeks to connect vision and language through semantic understanding. Due to the capability to manage large-scale raw data, unsupervised hashing-based approaches have gained prominence recently. They typically construct a semantic similarity structure using the natural distance, which subsequently guides the optimization of the hashing network. However, the similarity structure could be biased at the boundaries of semantic distributions, causing error accumulation during sequential optimization. To tackle this, we introduce a novel hashing approach termed Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching. From a statistical view, DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution. Then, we employ a non-parametric distribution divergence to ensure a robust and precise similarity structure. In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions in a self-supervised manner. Extensive experiments on several widely used datasets demonstrate that DEMO achieves superior performance compared with various state-of-the-art methods. + 2024.naacl-long.21 + 2024.naacl-long.21.copyright.pdf + zhang-etal-2024-demo + + + <fixed-case>S</fixed-case>ea<fixed-case>E</fixed-case>val for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning + BinWangI2R, A*STAR + ZhengyuanLiuI2R + XinHuang + FangkaiJiao + YangDing, A*STAR + AiTiAwI2R + NancyChen + 370-390 + We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Many models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained “balanced multilingual” capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios. + 2024.naacl-long.22 + 2024.naacl-long.22.copyright.pdf + wang-etal-2024-seaeval + + + Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision + SeongyunLee + SueParkKorea Advanced Institute of Science & Technology + YongraeJo + MinjoonSeoKorea Advanced Institute of Science and Technology + 391-404 + Large multimodal models suffer from multimodal hallucination, where they provide incorrect responses misaligned with the given visual information. Recent works have conjectured that one of the reasons behind multimodal hallucination is due to the vision encoder failing to ground on the image properly. To mitigate this issue, we propose a novel approach that leverages self-feedback as visual cues. Building on this approach, we introduce Volcano, a multimodal self-feedback guided revision model. Volcano generates natural language feedback to its initial response based on the provided visual information and utilizes this feedback to self-revise its initial response. Volcano effectively reduces multimodal hallucination and achieves state-of-the-art on MMHal-Bench, POPE, and GAVIE. It also improves on general multimodal abilities and outperforms previous models on MM-Vet and MMBench. Through qualitative analysis, we show that Volcano’s feedback is properly grounded on the image than the initial response. This indicates that Volcano can provide itself with richer visual information through feedback generation, leading to self-correct hallucinations. We publicly release our model, data, and code at https://github.com/kaistAI/Volcanogithub.com/kaistAI/Volcano + 2024.naacl-long.23 + 2024.naacl-long.23.copyright.pdf + lee-etal-2024-volcano + + + <fixed-case>LLM</fixed-case>s Are Few-Shot In-Context Low-Resource Language Learners + SamuelCahyawijayaThe Hong Kong University of Science and Technology + HolyLoveniaAI Singapore + PascaleFungHKUST + 405-433 + In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages.Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages.Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages.Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages. + 2024.naacl-long.24 + 2024.naacl-long.24.copyright.pdf + cahyawijaya-etal-2024-llms + + + Simple and effective data augmentation for compositional generalization + YuekunYao + AlexanderKollerSaarland University + 434-449 + Compositional generalization, the ability to predict complex meanings from training on simpler sentences, poses challenges for powerful pretrained seq2seq models. In this paper, we show that data augmentation methods that sample MRs and backtranslate them can be effective for compositional generalization, but only if we sample from the right distribution. Remarkably, sampling from a uniform distribution performs almost as well as sampling from the test distribution, and greatly outperforms earlier methods that sampled from the training distribution.We further conduct experiments to investigate the reason why this happens and where the benefit of such data augmentation methods come from. + 2024.naacl-long.25 + 2024.naacl-long.25.copyright.pdf + yao-koller-2024-simple + + + Rethinking Tabular Data Understanding with Large Language Models + TianyangLiuUniversity of California, San Diego + FeiWangUniversity of Southern California + MuhaoChenUniversity of California, Davis and University of Southern California + 450-482 + Large Language Models (LLMs) have shown to be capable of various tasks, yet their capability in interpreting and reasoning over tabular data remains an underexplored area. In this context, this study investigates from three core perspectives: the robustness of LLMs to structural perturbations in tables, the comparative analysis of textual and symbolic reasoning on tables, and the potential of boosting model performance through the aggregation of multiple reasoning pathways. We discover that structural variance of tables presenting the same content reveals a notable performance decline, particularly in symbolic reasoning tasks. This prompts the proposal of a method for table structure normalization. Moreover, textual reasoning slightly edges out symbolic reasoning, and a detailed error analysis reveals that each exhibits different strengths depending on the specific tasks. Notably, the aggregation of textual and symbolic reasoning pathways, bolstered by a mix self-consistency mechanism, resulted in achieving SOTA performance, with an accuracy of 73.6% on WikiTableQuestions, representing a substantial advancement over previous existing table processing paradigms of LLMs. + 2024.naacl-long.26 + 2024.naacl-long.26.copyright.pdf + liu-etal-2024-rethinking + + + From Shortcuts to Triggers: Backdoor Defense with Denoised <fixed-case>P</fixed-case>o<fixed-case>E</fixed-case> + QinLiuUniversity of California, Davis + FeiWangUniversity of Southern California + ChaoweiXiaoUniversity of Wisconsin - Madison and NVIDIA + MuhaoChenUniversity of California, Davis and University of Southern California + 483-496 + Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on three NLP tasks show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of triggers. + 2024.naacl-long.27 + 2024.naacl-long.27.copyright.pdf + liu-etal-2024-shortcuts + + + <fixed-case>B</fixed-case>ook<fixed-case>SQL</fixed-case>: A Large Scale Text-to-<fixed-case>SQL</fixed-case> Dataset for Accounting Domain + RahulKumar + Amar RajaDibbu + ShrutendraHarsolaIntuit AI Bangalore India + VigneshSubrahmaniam + AshutoshModiIIT Kanpur + 497-516 + Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain. + 2024.naacl-long.28 + 2024.naacl-long.28.copyright.pdf + kumar-etal-2024-booksql + + + <fixed-case>FLAP</fixed-case>: Flow-Adhering Planning with Constrained Decoding in <fixed-case>LLM</fixed-case>s + ShamikRoyAmazon + SailikSenguptaAmazon + DanieleBonadimanAmazon + SaabMansourAmazon + ArshitGuptaAmazon + 517-539 + Planning is a crucial task for agents in task oriented dialogs (TODs). Human agents typically resolve user issues by following predefined workflows, decomposing workflow steps into actionable items, and performing actions by executing APIs in order; all of which require reasoning and planning. With the recent advances in LLMs, there have been increasing attempts to use them for task planning and API usage. However, the faithfulness of the plans to predefined workflows and API dependencies, is not guaranteed with LLMs. Moreover, workflows in real life are often custom-defined and prone to changes; hence, adaptation is desirable. To study this, we propose the problem of faithful planning in TODs that needs to resolve user intents by following predefined flows and preserving API dependencies. To solve this problem, we propose \textbf{FLAP}, a \textbf{Fl}ow-\textbf{A}dhering \textbf{P}lanning algorithm based on constrained decoding with lookahead heuristic for LLMs. Our algorithm alleviates the need for finetuning LLMs using domain specific (plan/dependency) data, enables quick adaptation to predefined flows, and outperforms other decoding and prompting-based baselines. Further, our algorithm empowers smaller LLMs (\approx7B) to perform at par larger LLMs (\approx30B-40B). + 2024.naacl-long.29 + 2024.naacl-long.29.copyright.pdf + roy-etal-2024-flap + + + <fixed-case>D</fixed-case>u<fixed-case>RE</fixed-case>: Dual Contrastive Self Training for Semi-Supervised Relation Extraction + YuxiFeng + LaksLakshmananUniversity of British Columbia + 540-555 + Document-level Relation Extraction (RE) aims to extract relation triples from documents. Existing document-RE models typically rely on supervised learning which requires substantial labeled data. To alleviate the amount of human supervision, Self-training (ST) has prospered again in language understanding by augmenting the fine-tuning of big pre-trained models whenever labeled data is insufficient. However, existing ST methods in RE fail to tackle the challenge of long-tail relations. In this work, we propose DuRE, a novel ST framework to tackle these problems. DuRE jointly models RE classification and text generation as a dual process. In this way, our model could construct and utilize both pseudo text generated from given labels and pseudo labels predicted from available unlabeled text, which are gradually refined during the ST phase. We proposed a contrastive loss to leverage the signal of the RE classifier to improve generation quality. In addition, we propose a self-adaptive way to sample pseudo text from different relation classes. Experiments on two document-level RE tasks show that DuRE significantly boosts recall and F1 score with comparable precision, especially for long-tail relations against several strong baselines. + 2024.naacl-long.30 + 2024.naacl-long.30.copyright.pdf + feng-lakshmanan-2024-dure + + + Query-Efficient Textual Adversarial Example Generation for Black-Box Attacks + ZhenYuHuazhong University of Science and Technology + ZhenhuaChen + KunHeHuazhong University of Sceince and Technology + 556-569 + Deep neural networks for Natural Language Processing (NLP) have been demonstrated to be vulnerable to textual adversarial examples. Existing black-box attacks typically require thousands of queries on the target model, making them expensive in real-world applications. In this paper, we propose a new approach that guides the word substitutions using prior knowledge from the training set to improve the attack efficiency. Specifically, we introduce Adversarial Boosting Preference (ABP), a metric that quantifies the importance of words and guides adversarial word substitutions. We then propose two query-efficient attack strategies based on ABP: query-free attack (ABP_{free}) and guided search attack (ABP_{guide}). Extensive evaluations for text classification demonstrate that ABP_{free} generates more natural adversarial examples than existing universal attacks, ABP_{guide} significantly reduces the number of queries by a factor of 10 500 while achieving comparable or even better performance than black-box attack baselines. Furthermore, we introduce the first ensemble attack ABP_{ens} in NLP, which gains further performance improvements and achieves better transferability and generalization by the ensemble of the ABP across different models and domains. Code is available at https://github.com/BaiDingHub/ABP. + 2024.naacl-long.31 + 2024.naacl-long.31.copyright.pdf + yu-etal-2024-query + + + Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles + Kung-HsiangHuangSalesForce.com + PhilippeLaban + AlexanderFabbriSalesForce.com + Prafulla KumarChoubeySalesForce.com + ShafiqJotySalesForce.com and Nanyang Technological University + CaimingXiongSalesforce Research + Chien-ShengWuSalesforce AI + 570-593 + Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, the summarization of diverse information dispersed across multiple articles about an event remains underexplored. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Next, to enable consistent automatic evaluation, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of summaries. Through correlation analyses, we outline the best practices for effectively using automatic LLM-based metrics on the DiverseSumm dataset. Finally, we study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover under 40% of the diverse information on average. + 2024.naacl-long.32 + 2024.naacl-long.32.copyright.pdf + huang-etal-2024-embrace + + + <fixed-case>AMRF</fixed-case>act: Enhancing Summarization Factuality Evaluation with <fixed-case>AMR</fixed-case>-Driven Negative Samples Generation + HaoyiQiuUCLA Computer Science Department, University of California, Los Angeles + Kung-HsiangHuangSalesForce.com + JingnongQuUniversity of California, Los Angeles + NanyunPengUniversity of California, Los Angeles + 594-608 + Ensuring factual consistency is crucial for natural language generation tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior works on evaluating factual consistency of summarization often take the entailment-based approaches that first generate perturbed (factual inconsistent) summaries and then train a classifier on the generated data to detect the factually inconsistencies during testing time. However, previous approaches generating perturbed summaries are either of low coherence or lack error-type coverage. To address these issues, we propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs). Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage. Additionally, we present a data selection module NegFilter based on natural language inference and BARTScore to ensure the quality of the generated negative samples. Experimental results demonstrate our approach significantly outperforms previous systems on the AggreFact-SOTA benchmark, showcasing its efficacy in evaluating factuality of abstractive summarization. + 2024.naacl-long.33 + 2024.naacl-long.33.copyright.pdf + qiu-etal-2024-amrfact + + + <fixed-case>PILOT</fixed-case>: Legal Case Outcome Prediction with Case Law + LangCao + ZifengWangUniversity of Illinois, Urbana Champaign + CaoXiaoGE Healthcare + JimengSunGeorgia Tech Research Corporation, University of Illinois, Urbana Champaign, College of Computing and Georgia Institute of Technology + 609-621 + Machine learning shows promise in predicting the outcome of legal cases, but most research has concentrated on civil law cases rather than case law systems. We identified two unique challenges in making legal case outcome predictions with case law. First, it is crucial to identify relevant precedent cases that serve as fundamental evidence for judges during decision-making. Second, it is necessary to consider the evolution of legal principles over time, as early cases may adhere to different legal contexts. In this paper, we proposed a new framework named PILOT (PredictIng Legal case OuTcome) for case outcome prediction. It comprises two modules for relevant case retrieval and temporal pattern handling, respectively. To benchmark the performance of existing legal case outcome prediction models, we curated a dataset from a large-scale case law database. We demonstrate the importance of accurately identifying precedent cases and mitigating the temporal shift when making predictions for case law, as our method shows a significant improvement over the prior methods that focus on civil law case outcome predictions. + 2024.naacl-long.34 + 2024.naacl-long.34.copyright.pdf + cao-etal-2024-pilot + + + <fixed-case>AL</fixed-case>o<fixed-case>RA</fixed-case>: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models + ZequanLiu + JiawenLyn + WeiZhuUniversity of Hong Kong + XingTian + 622-641 + Parameter-efficient fine-tuning (PEFT) is widely studied for its effectiveness and efficiency in the era of large language models. Low-rank adaptation (LoRA) has demonstrated commendable performance as a popular and representative method. However, it is implemented with a fixed intrinsic rank that might not be the ideal setting for the downstream tasks. Recognizing the need for more flexible downstream task adaptation, we extend the methodology of LoRA to an innovative approach we call allocating low-rank adaptation (ALoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank. Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks. We have conducted experiments on various tasks, and the experimental results demonstrate that our ALoRA method can outperform the recent baselines with comparable tunable parameters. + 2024.naacl-long.35 + 2024.naacl-long.35.copyright.pdf + liu-etal-2024-alora + + + <fixed-case>R</fixed-case>-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces + Heng-JuiChangMassachusetts Institute of Technology + JamesGlass + 642-662 + This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin’s issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments. + 2024.naacl-long.36 + 2024.naacl-long.36.copyright.pdf + chang-glass-2024-r + + + <fixed-case>I</fixed-case>ns<fixed-case>CL</fixed-case>: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions + YifanWangTsinghua University + YafeiLiuOPPO + ChufanShi + HaolingLi + ChenChenOPPO Research Institute + HaonanLuOPPO Guangdong Mobile Telecommunications Co., Ltd. + YujiuYangGraduate School at Shenzhen,Tsinghua University + 663-677 + Instruction tuning effectively optimizes Large Language Models (LLMs) for downstream tasks. Due to the changing environment in real-life applications, LLMs necessitate continual task-specific adaptation without catastrophic forgetting. Considering the heavy computational cost, replay-based Continual Learning (CL) methods are the simplest and most widely used for LLMs to address the forgetting issue. However, traditional replay-based methods do not fully utilize instructions to customize the replay strategy. In this work, we propose a novel paradigm called Instruction-based Continual Learning (InsCL). InsCL dynamically replays previous data based on task similarity, calculated by Wasserstein Distance with instructions. Moreover, we further introduce an Instruction Information Metric (InsInfo) to quantify the complexity and diversity of instructions. According to InsInfo, InsCL guides the replay process more inclined to high-quality data. We conduct extensive experiments over 16 tasks with different training orders, observing consistent performance improvements of InsCL. When all tasks have been trained, InsCL achieves performance gains of 3.0 Relative Gain compared with Random Replay, and 27.96 Relative Gain compared with No Replay. + 2024.naacl-long.37 + 2024.naacl-long.37.copyright.pdf + wang-etal-2024-inscl + + + Language Agnostic Code Embeddings + SaitejaUtpalaUniversity of California, Santa Barbara + AlexGuMassachusetts Institute of Technology + Pin-YuChenInternational Business Machines + 678-691 + Recently, code language models have achieved notable advancements in addressing a diverse array of essential code comprehension and generation tasks. Yet, the field lacks a comprehensive deep dive and understanding of the code embeddings of multilingual code models. In this paper, we present a comprehensive study on multilingual code embeddings, focusing on the cross-lingual capabilities of these embeddings across different programming languages. Through probing experiments, we demonstrate that code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details, primarily focusing on semantics. Further, we show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks, leading to an absolute increase of up to +17 in the Mean Reciprocal Rank (MRR). + 2024.naacl-long.38 + 2024.naacl-long.38.copyright.pdf + utpala-etal-2024-language + + + An Examination of the Compositionality of Large Generative Vision-Language Models + TeliMaHong Kong University of Science and Technology + RongLiHKUST(GZ) + JunweiLiangHong Kong University of Science and Technology + 692-705 + With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains under-explored. In this paper, we examine both the evaluation metrics ( VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs. The bias renders VisualGPTScore an insufficient metric for assessing GVLMs. To combat this, we first introduce a **SyntaxBias Score**, leveraging LLMs to quantify such bias for mitigation. A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness. Using the bias-mitigated datasets and the new task, we propose a novel benchmark, namely **S**ynt**A**ctically **DE**-biased benchmark (SADE). Our study provides an unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction. Code and dataset are available at https://github.com/TeleeMa/SADE. + 2024.naacl-long.39 + 2024.naacl-long.39.copyright.pdf + ma-etal-2024-examination + + + Two Heads are Better than One: Nested <fixed-case>P</fixed-case>o<fixed-case>E</fixed-case> for Robust Defense Against Multi-Backdoors + VictoriaGrafUniversity of Southern California and Princeton University + QinLiuUniversity of California, Davis + MuhaoChenUniversity of California, Davis and University of Southern California + 706-718 + Data poisoning backdoor attacks can cause undesirable behaviors in large language models (LLMs), and defending against them is of increasing importance. Existing defense mechanisms often assume that only one type of trigger is adopted by the attacker, while defending against multiple simultaneous and independent trigger types necessitates general defense frameworks and is relatively unexplored. In this paper, we propose Nested Product of Experts (NPoE) defense framework, which involves a mixture of experts (MoE) as a trigger-only ensemble within the PoE defense framework to simultaneously defend against multiple trigger types. During NPoE training, the main modelis trained in an ensemble with a mixture of smaller expert models that learn the features of backdoor triggers. At inference time, only the main model is used. Experimental results on sentiment analysis, hate speech detection, and question classification tasks demonstrate that NPoE effectively defends against a variety of triggers both separately and in trigger mixtures. Due to the versatility of the MoE structure in NPoE, this framework can be further expanded to defend against other attack settings. + 2024.naacl-long.40 + 2024.naacl-long.40.copyright.pdf + graf-etal-2024-two + + + <fixed-case>V</fixed-case>ert<fixed-case>A</fixed-case>ttack: Taking Advantage of Text Classifiers’ Horizontal Vision + JonathanRusertPurdue University Fort Wayne + 719-732 + Text classification systems have continuouslyimproved in performance over the years. How-ever, nearly all current SOTA classifiers have asimilar shortcoming, they process text in a hor-izontal manner. Vertically written words willnot be recognized by a classifier. In contrast,humans are easily able to recognize and readwords written both horizontally and vertically.Hence, a human adversary could write problem-atic words vertically and the meaning wouldstill be preserved to other humans. We simulatesuch an attack, VertAttack. VertAttack identifieswhich words a classifier is reliant on and thenrewrites those words vertically. We find thatVertAttack is able to greatly drop the accuracyof 4 different transformer models on 5 datasets.For example, on the SST2 dataset, VertAttackis able to drop RoBERTa’s accuracy from 94 to13%. Furthermore, since VertAttack does notreplace the word, meaning is easily preserved.We verify this via a human study and find thatcrowdworkers are able to correctly label 77%perturbed texts perturbed, compared to 81% ofthe original texts. We believe VertAttack offersa look into how humans might circumvent clas-sifiers in the future and thus inspire a look intomore robust algorithms. + 2024.naacl-long.41 + 2024.naacl-long.41.copyright.pdf + rusert-2024-vertattack + + + <fixed-case>KDMCSE</fixed-case>: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning + Cong-DuyNguyenSchool of Computer Science and Engineering, Nanyang Technological University + ThongNguyen + XiaobaoWuNanyang Technological University + Anh TuanLuuNanyang Technological University + 733-749 + Previous work on multimodal sentence embedding has proposed multimodal contrastive learning and achieved promising results. However, by taking the rest of the batch as negative samples without reviewing when forming contrastive pairs, those studies encountered many suspicious and noisy negative examples, significantly affecting the methods’ overall performance. In this work, we propose KDMCSE (Knowledge Distillation Multimodal contrastive learning of Sentence Embeddings), a novel approach that enhances the discrimination and generalizability of multimodal representation and inherits the knowledge from the teacher model to learn the difference between positive and negative instances and via that, can detect noisy and wrong negative samples effectively before they are calculated in the contrastive objective. Furthermore, to overcome the limitation of modeling the variation within negative pairs, we introduce a new contrastive objective, AdapACSE (Adaptive Angular Margin Supervised Contrastive Learning for Multimodal sentence embeddings), that enhances the discriminative representation by strengthening the margin within the angular space while capturing varying semantics within the negative. Experimental results on widely used Semantic Textual Similarity (STS) benchmarks demonstrate the effectiveness of our approach. + 2024.naacl-long.42 + 2024.naacl-long.42.copyright.pdf + nguyen-etal-2024-kdmcse + + + The taste of <fixed-case>IPA</fixed-case>: Towards open-vocabulary keyword spotting and forced alignment in any language + JianZhuUniversity of British Columbia + ChangbingYangUniversity of British Columbia + FarhanSamirUniversity of British Columbia + JahurulIslam + 750-772 + In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation. + 2024.naacl-long.43 + 2024.naacl-long.43.copyright.pdf + zhu-etal-2024-taste + + + Think Before You Act: A Two-Stage Framework for Mitigating Gender Bias Towards Vision-Language Tasks + YunqiZhang + SongdaLi + ChunyuanDeng + LuyiWang + HuiZhaoEast China Normal University + 773-791 + Gender bias in vision-language models (VLMs) can reinforce harmful stereotypes and discrimination. In this paper, we focus on mitigating gender bias towards vision-language tasks. We identify object hallucination as the essence of gender bias in VLMs. Existing VLMs tend to focus on salient or familiar attributes in images but ignore contextualized nuances. Moreover, most VLMs rely on the co-occurrence between specific objects and gender attributes to infer the ignored features, ultimately resulting in gender bias. We propose GAMA, a task-agnostic generation framework to mitigate gender bias. GAMA consists of two stages: narrative generation and answer inference. During narrative generation, GAMA yields all-sided but gender-obfuscated narratives, which prevents premature concentration on localized image features, especially gender attributes. During answer inference, GAMA integrates the image, generated narrative, and a task-specific question prompt to infer answers for different vision-language tasks. This approach allows the model to rethink gender attributes and answers. We conduct extensive experiments on GAMA, demonstrating its debiasing and generalization ability. + 2024.naacl-long.44 + 2024.naacl-long.44.copyright.pdf + zhang-etal-2024-think + + + <fixed-case>B</fixed-case>e<fixed-case>LLM</fixed-case>: Backward Dependency Enhanced Large Language Model for Sentence Embeddings + XianmingLi + JingLiThe Hong Kong Polytechnic University + 792-804 + Sentence embeddings are crucial in measuring semantic similarity. Most recent studies employed large language models (LLMs) to learn sentence embeddings. Existing LLMs mainly adopted autoregressive architecture without explicit backward dependency modeling. Therefore, we examined the effects of backward dependencies in LLMs for semantic similarity measurements. Concretely, we propose a novel model: backward dependency enhanced large language model (BeLLM). It learns sentence embeddings via transforming specific attention layers from uni- to bi-directional. We extensively experiment across various semantic textual similarity (STS) tasks and downstream applications. BeLLM achieves state-of-the-art performance in varying scenarios. It shows that autoregressive LLMs benefit from backward dependencies for sentence embeddings. + 2024.naacl-long.45 + 2024.naacl-long.45.copyright.pdf + li-li-2024-bellm + + + Assessing Factual Reliability of Large Language Model Knowledge + WeixuanWangUniversity of Edinburgh, University of Edinburgh + BarryHaddowUniversity of Edinburgh + AlexandraBirchUniversity of Edinburgh + WeiPeng + 805-819 + The factual knowledge of LLMs is typically evaluated using accuracy, yet this metric does not capture the vulnerability of LLMs to hallucination-inducing factors like prompt and context variability. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? In this paper, we propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs’ factual reliability. MONITOR is designed to compute the distance between the probability distributions of a valid output and its counterparts produced by the same LLM probing the same fact using different styles of prompts and contexts. Experiments on a comprehensive range of 12 LLMs demonstrate the effectiveness of MONITOR in evaluating the factual reliability of LLMs while maintaining a low computational overhead. In addition, we release the FKTC (Factual Knowledge Test Corpus) to foster research along this line https://github.com/Vicky-Wil/MONITOR. + 2024.naacl-long.46 + 2024.naacl-long.46.copyright.pdf + wang-etal-2024-assessing + + + Dial-<fixed-case>MAE</fixed-case>: <fixed-case>C</fixed-case>on<fixed-case>T</fixed-case>extual Masked Auto-Encoder for Retrieval-based Dialogue Systems + ZhenpengSu + XingW + WeiZhou + GuangyuanMa + SonglinHu + 820-830 + Dialogue response selection aims to select an appropriate response from several candidates based on a given user and system utterance history. Most existing works primarily focus on post-training and fine-tuning tailored for cross-encoders. However, there are no post-training methods tailored for dense encoders in dialogue response selection. We argue that when the current language model, based on dense dialogue systems (such as BERT), is employed as a dense encoder, it separately encodes dialogue context and response, leading to a struggle to achieve the alignment of both representations. Thus, we propose Dial-MAE (Dialogue Contextual Masking Auto-Encoder), a straightforward yet effective post-training technique tailored for dense encoders in dialogue response selection. Dial-MAE uses an asymmetric encoder-decoder architecture to compress the dialogue semantics into dense vectors, which achieves better alignment between the features of the dialogue context and response. Our experiments have demonstrated that Dial-MAE is highly effective, achieving state-of-the-art performance on two commonly evaluated benchmarks. + 2024.naacl-long.47 + 2024.naacl-long.47.copyright.pdf + su-etal-2024-dial + + + Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model + ChengQian + ChenyanXiongSchool of Computer Science, Carnegie Mellon University + ZhenghaoLiuNortheastern University + ZhiyuanLiuTsinghua University + 831-854 + Large Language Models (LLMs) have demonstrated remarkable progress in utilizing tools, but their closed-source nature and high inference costs pose limitations on their adaptability, necessitating a valid method that leverages smaller, open-sourced models. In this paper, we introduce Toolink, a comprehensive framework that performs task-solving by first creating a toolkit and then integrating the planning and calling of tools through a chain-of-solving (CoS) approach. We first validate the efficacy of Toolink in harnessing the model’s creativity and CoS ability on ChatGPT. Subsequently, we curate CoS-GPT, a chain-of-solving dataset designed for tool-using, and finetune the LLaMA-7B model. It results in LLaMA-CoS, a powerful open-source model with advanced tool-planning and tool-calling capabilities. Evaluation of diverse tasks from BIG-bench demonstrates its CoS ability matches that of ChatGPT while its performance surpasses the chain-of-thought approach. Further studies highlight the generalization of LLaMA-CoS to unseen tasks and showcase its capability in using toolkits not explicitly tailored for the target task, affirming its robustness in real-world scenarios. All codes and data are released. + 2024.naacl-long.48 + 2024.naacl-long.48.copyright.pdf + qian-etal-2024-toolink + + + Create! Don’t Repeat: A Paradigm Shift in Multi-Label Augmentation through Label Creative Generation + LetianWangSichuan University + XianggenLiuSichuan University + JianchengLvSichuan University + 855-869 + We propose Label Creative Generation (LCG), a new paradigm in multi-label data augmentation. Beyond repeating data points with fixed labels, LCG creates new data by exploring innovative label combinations. Within LCG, we introduce Tail-Driven Conditional Augmentation (TDCA), combining tail-driven label sampling and label-conditioned text generation for balanced, consistent data augmentation. Our approach has demonstrated a **100.21%** increase in PSP@1 across three datasets, successfully mitigating the long-tail effect in MLTC and markedly enhancing model performance. + 2024.naacl-long.49 + 2024.naacl-long.49.copyright.pdf + wang-etal-2024-create + + + Neurocache: Efficient Vector Retrieval for Long-range Language Modeling + AliSafaya + DenizYuretKoc University + 870-883 + This paper introduces Neurocache, an approach to extend the effective context size of large language models (LLMs) using an external vector cache to store its past states. Like recent vector retrieval approaches, Neurocache uses an efficient k-nearest-neighbor (kNN) algorithm to retrieve relevant past states and incorporate them into the attention process. Neurocache improves upon previous methods by (1) storing compressed states, which reduces cache size; (2) performing a single retrieval operation per token which increases inference speed; and (3) extending the retrieval window to neighboring states, which improves both language modeling and downstream task accuracy. Our experiments show the effectiveness of Neurocache both for models trained from scratch and for pre-trained models such as Llama2-7B and Mistral-7B when enhanced with the cache mechanism. We also compare Neurocache with text retrieval methods and show improvements in single-document question-answering and few-shot learning tasks. We made the source code available under: https://github.com/alisafaya/neurocache + 2024.naacl-long.50 + 2024.naacl-long.50.copyright.pdf + safaya-yuret-2024-neurocache + + + Unveiling the Generalization Power of Fine-Tuned Large Language Models + HaoranYang + YumengZhang + JiaqiXuThe Chinese University of Hong Kong + HongyuanLu + Pheng-AnnHeng + WaiLamThe Chinese University of Hong Kong + 884-899 + While Large Language Models (LLMs) have demonstrated exceptional multitasking abilities, fine-tuning these models on downstream, domain-specific datasets is often necessary to yield superior performance on test sets compared to their counterparts without fine-tuning. However, the comprehensive effects of fine-tuning on the LLMs’ generalization ability are not fully understood.This paper delves into the differences between original, unmodified LLMs and their fine-tuned variants. Our primary investigation centers on whether fine-tuning affects the generalization ability intrinsic to LLMs. To elaborate on this, we conduct extensive experiments across five distinct language tasks on various datasets.Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks.Intriguingly, we observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model’s generalization ability.Through this systematic investigation, we aim to contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs. + 2024.naacl-long.51 + 2024.naacl-long.51.copyright.pdf + yang-etal-2024-unveiling + + + A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning + RuixinHongTsinghua University, Tsinghua University + HongmingZhang + XinyuPangTsinghua University, Tsinghua University + DongYuTencent AI Lab + ChangshuiZhangTsinghua University and Department of Computer Science and Technology + 900-925 + Logical reasoning has been an ongoing pursuit in the field of AI. Despite significant advancements made by large language models (LLMs), they still struggle with complex logical reasoning problems. To enhance reasoning performance, one promising direction is scalable oversight, which requires LLMs to identify their own errors and then improve by themselves. Various self-verification methods have been proposed in pursuit of this goal. Nevertheless, whether existing models understand their own errors well is still under investigation. In this paper, we take a closer look at the self-verification abilities of LLMs in the context of logical reasoning, focusing on their ability to identify logical fallacies accurately. We introduce a dataset, FALLACIES, containing 232 types of reasoning fallacies categorized in a hierarchical taxonomy. By conducting exhaustive experiments on FALLACIES, we obtain comprehensive and detailed analyses of a series of models on their verification abilities. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods. Drawing from these observations, we offer suggestions for future research and practical applications of self-verification methods. + 2024.naacl-long.52 + 2024.naacl-long.52.copyright.pdf + hong-etal-2024-closer + + + Exploring Self-supervised Logic-enhanced Training for Large Language Models + FangkaiJiao + ZhiyangTeng + BoshengDing + ZhengyuanLiuI2R + NancyChen + ShafiqJotySalesForce.com and Nanyang Technological University + 926-941 + Traditional attempts to enhance the logical reasoning abilities of language models often rely on supervised fine-tuning, limiting their generalization to new tasks or domains. Large Language Models (LLMs), with their capacity to condense vast knowledge, can effectively tackle many tasks. Yet, our experiments reveal a gap in their performance on logical reasoning benchmarks when compared to state-of-the-art fine-tuning based models. To bridge this gap, we present LogicLLM, a first-of-its-kind, fully self-supervised framework for integrating logical reasoning capabilities into LLMs, and activating them via in-context learning. We apply this to two LLM series, FLAN-T5 and LLaMA, with parameter sizes from 3 billion to 33 billion. LogicLLM demonstrates its effectiveness through successful improvements on two logical reasoning benchmarks (ReClor and LogiQA-v2). Additionally, LogicLLM based on FLAN-T5-11B attains comparable results to ChatGPT, and evaluations with LLaMA-based models on three language understanding benchmarks (RACE, MMLU and Big-Bench-Hard) confirm that the improvements come without compromising the model’s general language understanding capabilities. + 2024.naacl-long.53 + 2024.naacl-long.53.copyright.pdf + jiao-etal-2024-exploring + + + <fixed-case>MATHSENSEI</fixed-case>: A Tool-Augmented Large Language Model for Mathematical Reasoning + DebrupDas + DebopriyoBanerjeeRakuten Global Inc. + SomakAdityaIndian Institute of Technology Kharagpur + AshishKulkarniRakuten + 942-966 + Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei. + 2024.naacl-long.54 + 2024.naacl-long.54.copyright.pdf + das-etal-2024-mathsensei + + + <fixed-case>C</fixed-case>o<fixed-case>UDA</fixed-case>: Coherence Evaluation via Unified Data Augmentation + DaweiZhu + WenhaoWu + YifanSong + FangweiZhu + ZiqiangCao + SujianLiPeking University + 967-978 + Coherence evaluation aims to assess the organization and structure of a discourse, which remains challenging even in the era of large language models. Due to the scarcity of annotated data, data augmentation is commonly used for training coherence evaluation models. However, previous augmentations for this task primarily rely on heuristic rules, lacking designing criteria as guidance.In this paper, we take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA. CoUDA breaks down discourse coherence into global and local aspects, and designs augmentation strategies for both aspects, respectively.Especially for local coherence, we propose a novel generative strategy for constructing augmentation samples, which involves post-pretraining a generative model and applying two controlling mechanisms to control the difficulty of generated samples. During inference, CoUDA also jointly evaluates both global and local aspects to comprehensively assess the overall coherence of a discourse.Extensive experiments in coherence evaluation show that, with only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks, even surpassing recent GPT-3.5 and GPT-4 based metrics. + 2024.naacl-long.55 + 2024.naacl-long.55.copyright.pdf + zhu-etal-2024-couda + + + m<fixed-case>E</fixed-case>d<fixed-case>IT</fixed-case>: Multilingual Text Editing via Instruction Tuning + VipulRahejaColumbia University, Grammarly and International Institute of Information Technology Hyderabad + DimitrisAlikaniotisGrammarly + VivekKulkarni + BasharAlhafniNew York University + DhruvKumar + 979-1001 + We introduce mEdIT, a multi-lingual extension to CoEdIT – the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained language models (LLMs) via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such as “Grammatik korrigieren” (German) or “이 텍스 트를 단순화” (Korean). We build mEdIT by curating data from multiple publicly available human-annotated text editing datasets for three text editing tasks (Grammatical Error Correction (GEC), Text Simplification, and Paraphrasing) across diverse languages belonging to six different language families. We detail the design and training of mEdIT models and demonstrate their strong performance on many multi-lingual text editing benchmarks against other multilingual LLMs. We also find that mEdIT generalizes effectively to new languages over multilingual baselines. We publicly release our data, code, and trained models. + 2024.naacl-long.56 + 2024.naacl-long.56.copyright.pdf + raheja-etal-2024-medit + + + Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning + YunchaoZhangUniversity of Hong Kong + ZonglinDiUniversity of California, Santa Cruz + KaiwenZhou + CihangXieUniversity of California, Santa Cruz + XinWangUniversity of California, Santa Cruz + 1002-1016 + Federated embodied agent learning protects the data privacy of individual visual environments by keeping data locally at each client (the individual environment) during training. However, since the local data is inaccessible to the server under federated learning, attackers may easily poison the training data of the local client to build a backdoor in the agent without notice. Deploying such an agent raises the risk of potential harm to humans, as the attackers may easily navigate and control the agent as they wish via the backdoor. Towards Byzantine-robust federated embodied agent learning, in this paper, we study the attack and defense for the task of vision-and-language navigation (VLN), where the agent is required to follow natural language instructions to navigate indoor environments. First, we introduce a simple but effective attack strategy, Navigation as Wish (NAW), in which the malicious client manipulates local trajectory data to implant a backdoor into the global model. Results on two VLN datasets (R2R and RxR) show that NAW can easily navigate the deployed VLN agent regardless of the language instruction, without affecting its performance on normal test sets. Then, we propose a new Prompt-Based Aggregation (PBA) to defend against the NAW attack in federated VLN, which provides the server with a ”prompt” of the vision-and-language alignment variance between the benign and malicious clients so that they can be distinguished during training. We validate the effectiveness of the PBA method on protecting the global model from the NAW attack, which outperforms other state-of-the-art defense methods by a large margin in the defense metrics on R2R and RxR. + 2024.naacl-long.57 + 2024.naacl-long.57.copyright.pdf + zhang-etal-2024-navigation + + + In-context Learning and Gradient Descent Revisited + GiladDeutch + NadavMagar + TomerNatan + GuyDarTel Aviv University + 1017-1028 + In-context learning (ICL) has shown impressive results in few-shot learning tasks, yet its underlying mechanism is still not fully understood. A recent line of work suggests that ICL performs gradient descent (GD)-based optimization implicitly. While appealing, much of the research focuses on simplified settings, where the parameters of a shallow model are optimized. In this work, we revisit evidence for ICL-GD correspondence on realistic NLP tasks and models. We find gaps in evaluation, both in terms of problematic metrics and insufficient baselines. We show that surprisingly, even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL.Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality. We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly. + 2024.naacl-long.58 + 2024.naacl-long.58.copyright.pdf + deutch-etal-2024-context + + + Corpus Considerations for Annotator Modeling and Scaling + SarumiOluyemi + BélaNeuendorf + JoanPlepiRheinische Friedrich-Wilhelms Universität Bonn + LucieFlekRheinische Friedrich-Wilhelms Universität Bonn + JörgSchlöttererUniversität Mannheim and Phillips-Universität Marburg + CharlesWelchMcMaster University + 1029-1040 + Recent trends in natural language processing research and annotation tasks affirm a paradigm shift from the traditional reliance on a single ground truth to a focus on individual perspectives, particularly in subjective tasks. In scenarios where annotation tasks are meant to encompass diversity, models that solely rely on the majority class labels may inadvertently disregard valuable minority perspectives. This oversight could result in the omission of crucial information and, in a broader context, risk disrupting the balance within larger ecosystems. As the landscape of annotator modeling unfolds with diverse representation techniques, it becomes imperative to investigate their effectiveness with the fine-grained features of the datasets in view. This study systematically explores various annotator modeling techniques and compares their performance across seven corpora. From our findings, we show that the commonly used user token model consistently outperforms more complex models. We introduce a composite embedding approach and show distinct differences in which model performs best as a function of the agreement with a given dataset. Our findings shed light on the relationship between corpus statistics and annotator modeling performance, which informs future work on corpus construction and perspectivist NLP. + 2024.naacl-long.59 + 2024.naacl-long.59.copyright.pdf + oluyemi-etal-2024-corpus + + + On Large Language Models’ Hallucination with Regard to Known Facts + CheJiang + BiqingQiTsinghua University and Harbin Institute of Technology + XiangyuHongTsinghua University, Tsinghua University + DayuanFu + YangCheng + FandongMengWeChat AI, Tencent Inc. + MoYuWeChat AI, Tencent + BowenZhouTsinghua University + JieZhou + 1041-1053 + Large language models are successful in answering factoid questions but are also prone to hallucination.We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics, an area not previously covered in studies on hallucinations.We are able to conduct this analysis via two key ideas.First, we identify the factual questions that query the same triplet knowledge but result in different answers. The difference between the model behaviors on the correct and incorrect outputs hence suggests the patterns when hallucinations happen.Second, to measure the pattern, we utilize mappings from the residual streams to vocabulary space.We reveal the different dynamics of the output token probabilities along the depths of layers between the correct and hallucinated cases. In hallucinated cases, the output token’s information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model.Leveraging the dynamic curve as a feature, we build a classifier capable of accurately detecting hallucinatory predictions with an 88% success rate. Our study shed light on understanding the reasons for LLMs’ hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating. + 2024.naacl-long.60 + 2024.naacl-long.60.copyright.pdf + jiang-etal-2024-large + + + “One-Size-Fits-All”? Examining Expectations around What Constitute “Fair” or “Good” <fixed-case>NLG</fixed-case> System Behaviors + LiLucyAllen Institute for Artificial Intelligence and University of California Berkeley + Su LinBlodgettMicrosoft + MiladShokouhiMicrosoft + HannaWallachMicrosoft + AlexandraOlteanuResearch, Microsoft + 1054-1089 + Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to behave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people’s expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute “fair” or “good” NLG system behaviors. + 2024.naacl-long.61 + 2024.naacl-long.61.copyright.pdf + lucy-etal-2024-one + + + Language Models Hallucinate, but May Excel at Fact Verification + JianGuan + JesseDodge + DavidWaddenAllen Institute for Artificial Intelligence + MinlieHuangTsinghua University, Tsinghua University + HaoPengDepartment of Computer Science, University of Illinois Urbana-Champaign + 1090-1111 + Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently “hallucinate,” resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments. Surprisingly, FLAN-T5-11B , the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for developing trustworthy generation models. + 2024.naacl-long.62 + 2024.naacl-long.62.copyright.pdf + guan-etal-2024-language + + + A Rationale-centric Counterfactual Data Augmentation Method for Cross-Document Event Coreference Resolution + BowenDingWestlake University + QingkaiMin + ShengkunMaBeijing University of Posts and Telecommunications + YingjieLiWestlake University + LinyiYangWestlake University + YueZhangWestlake University + 1112-1140 + Based on Pre-trained Language Models (PLMs), event coreference resolution (ECR) systems have demonstrated outstanding performance in clustering coreferential events across documents. However, the state-of-the-art system exhibits an excessive reliance on the ‘triggers lexical matching’ spurious pattern in the input mention pair text. We formalize the decision-making process of the baseline ECR system using a Structural Causal Model (SCM), aiming to identify spurious and causal associations (i.e., rationales) within the ECR task. Leveraging the debiasing capability of counterfactual data augmentation, we develop a rationale-centric counterfactual data augmentation method with LLM-in-the-loop. This method is specialized for pairwise input in the ECR system, where we conduct direct interventions on triggers and context to mitigate the spurious association while emphasizing the causation. Our approach achieves state-of-the-art performance on three popular cross-document ECR benchmarks and demonstrates robustness in out-of-domain scenarios. + 2024.naacl-long.63 + 2024.naacl-long.63.copyright.pdf + ding-etal-2024-rationale + + + <fixed-case>T</fixed-case>roj<fixed-case>FSP</fixed-case>: Trojan Insertion in Few-shot Prompt Tuning + MengxinZhengUniversity of Central Florida + JiaqiXue + XunChenSamsung Research America + YanshanWangUniversity of Pittsburgh + QianLouUniversity of Central Florida + LeiJiangIndiana University + 1141-1151 + Prompt tuning is one of the most effective solutions to adapting a fixed pre-trained language model (PLM) for various downstream tasks, especially with only a few input samples. However, the security issues, e.g., Trojan attacks, of prompt tuning on a few data samples are not well-studied. Transferring established data poisoning attacks directly to few-shot prompt tuning presents multiple challenges. One significant issue is the _poisoned imbalance issue_, where non-target class samples are added to the target class, resulting in a greater number of target-class samples compared to non-target class. While this issue is not critical in regular tuning, it significantly hampers the few-shot prompt tuning, making it difficult to simultaneously achieve a high attack success rate (ASR) and maintain clean data accuracy (CDA). Additionally, few-shot prompting is prone to overfitting in terms of both ASR and CDA. In this paper, we introduce _TrojFSP_, a method designed to address the challenges. To solve the poisoned imbalance issue, we develop a _Target-Class Shrink (TC-Shrink)_ technique, which aims to equalize the number of poisoning samples. To combat overfitting, we employ a _Selective Token Poisoning_ technique to boost attack performance. Furthermore, we introduce a _Trojan-Trigger Attention_ objective function to amplify the attention of the poisoned trojan prompt on triggers. Experiments show that our TrojFSP achieves an ASR of over 99% while maintaining negligible decreases in CDA across various PLMs and datasets. The source code of TrojFSP is available at _https://github.com/UCF-ML-Research/TrojFSP_. + 2024.naacl-long.64 + 2024.naacl-long.64.copyright.pdf + zheng-etal-2024-trojfsp + + + Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models + YiLuo + ZhenghaoLin + YuHaoZhang + JiashuoSun + ChenLinXiamen University + ChengjinXuInternational Digital Economy Academy + XiangdongSuInner Mongolia University + YelongShenMicrosoft + JianGuoHong Kong University of Science and Technology + YeyunGong + 1152-1197 + Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage.Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model.We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities. + 2024.naacl-long.65 + 2024.naacl-long.65.copyright.pdf + luo-etal-2024-ensuring + + + <fixed-case>X</fixed-case>-<fixed-case>PARADE</fixed-case>: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs + JuanRodriguezUniversity of Texas at Austin + KatrinErkUniversity of Texas, Austin + GregDurrettUniversity of Texas, Austin + 1198-1222 + Understanding when two pieces of text convey the same information is a goal touching many subproblems in NLP, including textual entailment and fact-checking. This problem becomes more complex when those two pieces of text are in different languages. Here, we introduce X-PARADE (Cross-lingual Paragraph-level Analysis of Divergences and Entailments), the first cross-lingual dataset of paragraph-level information divergences. Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language, indicating whether a given piece of information is the same, new, or new but can be inferred. This last notion establishes a link with cross-language NLI. Aligned paragraphs are sourced from Wikipedia pages in different languages, reflecting real information divergences observed in the wild. Armed with our dataset, we investigate a diverse set of approaches for this problem, including classic token alignment from machine translation, textual entailment methods that localize their decisions, and prompting LLMs. Our results show that these methods vary in their capability to handle inferable information, but they all fall short of human performance. + 2024.naacl-long.66 + 2024.naacl-long.66.copyright.pdf + rodriguez-etal-2024-x + + + Topics, Authors, and Institutions in Large Language Model Research: Trends from 17<fixed-case>K</fixed-case> ar<fixed-case>X</fixed-case>iv Papers + RajivMovvaCornell University + SidhikaBalachandarDepartment of Computer Science, Cornell University + KennyPengCornell University + GabrielAgostiniCornell University + NikhilGargCornell University + EmmaPiersonCornell Tech + 1223-1243 + Large language models (LLMs) are dramatically influencing AI research, spurring discussions on what has changed so far and how to shape the field’s future. To clarify such questions, we analyze a new dataset of 16,979 LLM-related arXiv papers, focusing on recent trends in 2023 vs. 2018-2022. First, we study disciplinary shifts: LLM research increasingly considers societal impacts, evidenced by 20\times growth in LLM submissions to the Computers and Society sub-arXiv. An influx of new authors – half of all first authors in 2023 – are entering from non-NLP fields of CS, driving disciplinary expansion. Second, we study industry and academic publishing trends. Surprisingly, industry accounts for a smaller publication share in 2023, largely due to reduced output from Google and other Big Tech companies; universities in Asia are publishing more. Third, we study institutional collaboration: while industry-academic collaborations are common, they tend to focus on the same topics that industry focuses on rather than bridging differences. The most prolific institutions are all US- or China-based, but there is very little cross-country collaboration. We discuss implications around (1) how to support the influx of new authors, (2) how industry trends may affect academics, and (3) possible effects of (the lack of) collaboration. + 2024.naacl-long.67 + 2024.naacl-long.67.copyright.pdf + movva-etal-2024-topics + + + <tex-math>E^5</tex-math>: Zero-shot Hierarchical Table Analysis using Augmented <fixed-case>LLM</fixed-case>s via Explain, Extract, Execute, Exhibit and Extrapolate + ZhehaoZhangDartmouth College + YanGao + Jian-GuangLouMicrosoft + 1244-1258 + Analyzing large hierarchical tables with multi-level headers presents challenges due to their complex structure, implicit semantics, and calculation relationships. While recent advancements in large language models (LLMs) have shown promise in flat table analysis, their application to hierarchical tables is constrained by the reliance on manually curated exemplars and the model’s token capacity limitations. Addressing these challenges, we introduce a novel code-augmented LLM-based framework, E^5, for zero-shot hierarchical table question answering. This approach encompasses self-explaining the table’s hierarchical structures, code generation to extract relevant information and apply operations, external code execution to prevent hallucinations, and leveraging LLMs’ reasoning for final answer derivation. Empirical results indicate that our method, based on GPT-4, outperforms state-of-the-art fine-tuning methods with a 44.38 Exact Match improvement. Furthermore, we present F^3, an adaptive algorithm designed for token-limited scenarios, effectively condensing large tables while maintaining useful information. Our experiments prove its efficiency, enabling the processing of large tables even with models having limited context lengths. The code is available at https://github.com/zzh-SJTU/E5-Hierarchical-Table-Analysis. + 2024.naacl-long.68 + 2024.naacl-long.68.copyright.pdf + zhang-etal-2024-e5 + + + <fixed-case>S</fixed-case>3<fixed-case>E</fixed-case>val: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model + FangyuLei + QianLiuSea AI Lab + YimingHuang + ShizhuHeInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + JunZhao + KangLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + 1259-1286 + The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like long-context understanding and reasoning.However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 200K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration.In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation.The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios.The strong correlation between S3Eval and real-world benchmarks demonstrates the soundness of using S3Eval for evaluation of LLMs.S3Eval provides a flexible and infinite long-context data generation method. We have generated a comprehensive dataset called S3Eval-Standard, and experimental results have shown that it poses significant challenges for all existing LLMs. + 2024.naacl-long.69 + 2024.naacl-long.69.copyright.pdf + lei-etal-2024-s3eval + + + <fixed-case>MMC</fixed-case>: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning + FuxiaoLiu + XiaoyangWangTencent AI Lab + WenlinYaoTencent AI Lab + JianshuChenAmazon + KaiqiangSongTencent AI Lab + SangwooChoTencent AI Lab + YaserYacoobUniversity of Maryland, College Park + DongYuTencent AI Lab + 1287-1310 + With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has beenimpressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chartimage understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal ChartInstruction (MMC-Instruction) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we de-velop MultiModal Chart Assistant (MMCA), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (MMC-Benchmark), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts.Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the mostrecent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding ofcharts. Code and data are available at https://github.com/FuxiaoLiu/MMC. + 2024.naacl-long.70 + 2024.naacl-long.70.copyright.pdf + liu-etal-2024-mmc + + + Visual Grounding Helps Learn Word Meanings in Low-Data Regimes + ChengxuZhuangMassachusetts Institute of Technology + EvelinaFedorenkoMassachusetts Institute of Technology + JacobAndreasMassachusetts Institute of Technology and Microsoft + 1311-1329 + Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways — requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically — with grounded supervision — exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models’ learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-dataregime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images isnot redundant—models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data. + 2024.naacl-long.71 + 2024.naacl-long.71.copyright.pdf + zhuang-etal-2024-visual + + + Accurate Knowledge Distillation via n-best Reranking + HendraSetiawan + 1330-1345 + We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016) where we extract pseudo-labels for student model’s training data from top n-best hypotheses and leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available large language models, to pick the highest-quality hypotheses as labels. The effectiveness of our proposal is validated through experiments on the WMT’21 German ↔ English and Chinese ↔ English translation tasks. Our results demonstrate that utilizing pseudo-labels generated by our n-best reranker leads to a significantly more accurate student model. In fact, our best student model achieves comparable accuracy to a large translation model from (Tran et al., 2021) with 4.7 billion parameters, while having two orders of magnitude fewer parameters. + 2024.naacl-long.72 + 2024.naacl-long.72.copyright.pdf + setiawan-2024-accurate + + + <fixed-case>A</fixed-case>uto<fixed-case>PRM</fixed-case>: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition + ZhaorunChen + ZhuokaiZhao + ZhihongZhu + RuiqiZhang + XiangLi + BhikshaRajCarnegie Mellon University, Carnegie Mellon University and Mohamed bin Zayed University of Artificial Intelligence + HuaxiuYaoDepartment of Computer Science, University of North Carolina at Chapel Hill + 1346-1362 + Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework **AutoPRM** that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, **AutoPRM** first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that **AutoPRM** significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, **AutoPRM** can be easily integrated with other orthogonal reasoning pipelines. + 2024.naacl-long.73 + 2024.naacl-long.73.copyright.pdf + chen-etal-2024-autoprm + + + <fixed-case>SEMQA</fixed-case>: Semi-Extractive Multi-Source Question Answering + TalSchusterGoogle DeepMind and Google + AdamLelkesGoogle + HaitianSunSchool of Computer Science, Carnegie Mellon University + JaiGuptaGoogle + JonathanBerantGoogle and Tel Aviv University + WilliamCohenGoogle DeepMind + DonaldMetzlerGoogle + 1363-1381 + Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans—copied verbatim from given input sources—and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities. + 2024.naacl-long.74 + 2024.naacl-long.74.copyright.pdf + schuster-etal-2024-semqa + + + Fine-Tuning Language Models with Reward Learning on Policy + HaoLang + FeiHuangAlibaba Group + YongbinLiAlibaba Group + 1382-1392 + Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.RLHF contains three steps, i.e., human preference collecting, reward learning, and policy optimization, which are usually performed serially.Despite its popularity, however, (fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs’ data distribution.Repeatedly collecting new preference data from the latest LLMs may alleviate this issue, which unfortunately makes the resulting system more complicated and difficult to optimize.In this paper, we propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.Specifically, an unsupervised multi-view learning method is introduced to learn robust representations of policy samples.Meanwhile, a synthetic preference generation approach is developed to simulate high-quality preference data with policy outputs.Extensive experiments on three benchmark datasets show that RLP consistently outperforms the state-of-the-art.Our code is available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/rlp. + 2024.naacl-long.75 + 2024.naacl-long.75.copyright.pdf + lang-etal-2024-fine + + + A <fixed-case>U</fixed-case>niversal <fixed-case>D</fixed-case>ependencies Treebank for <fixed-case>H</fixed-case>ighland <fixed-case>P</fixed-case>uebla <fixed-case>N</fixed-case>ahuatl + RobertPugh + FrancisTyersIndiana University, Bloomington + 1393-1403 + We present a Universal Dependencies (UD) treebank for Highland Puebla Nahuatl. The treebank is only the second such UD corpus for a Mexican language, and supplements an existing treebank for another Nahuatl variant. We describe the process of data collection, annotation decisions and interesting syntactic constructions, and discuss some similarities and differences between the Highland Puebla Nahuatl treebank and the existing Western Sierra Puebla Nahuatl treebank. + 2024.naacl-long.76 + 2024.naacl-long.76.copyright.pdf + pugh-tyers-2024-universal + + + <fixed-case>COPAL</fixed-case>-<fixed-case>ID</fixed-case>: <fixed-case>I</fixed-case>ndonesian Language Reasoning with Local Culture and Nuances + HaryoWibowoMohamed bin Zayed University of Artificial Intelligence + ErlandFuadi + MadeNityasya + Radityo EkoPrasojoRukita + AlhamAjiMohamed bin Zayed University of Artificial Intelligence and Amazon + 1404-1422 + We present COPAL-ID, a novel, public Indonesian language common sense reasoning dataset. Unlike the previous Indonesian COPA dataset (XCOPA-ID), COPAL-ID incorporates Indonesian local and cultural nuances, and therefore, provides a more natural portrayal of day-to-day causal reasoning within the Indonesian cultural sphere. Professionally written by natives from scratch, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID. In addition, we present COPALID in both standard Indonesian and in Jakartan Indonesian–a dialect commonly used in daily conversation. COPAL-ID poses a greater challenge for existing open-sourced and closedstate-of-the-art multilingual language models, yet is trivially easy for humans. Our findings suggest that general multilingual models struggle to perform well, achieving 66.91% accuracy on COPAL-ID. South-East Asian-specific models achieve slightly better performance of 73.88% accuracy. Yet, this number still falls short of near-perfect human performance. This shows that these language models are still way behind in comprehending the local nuances of Indonesian. + 2024.naacl-long.77 + 2024.naacl-long.77.copyright.pdf + wibowo-etal-2024-copal + + + <fixed-case>I</fixed-case>ter<fixed-case>A</fixed-case>lign: Iterative Constitutional Alignment of Large Language Models + XiusiChenUniversity of California, Los Angeles + HongzhiWen + SreyashiNagAmazon + ChenLuoAmazon + QingyuYinAmazon + RuiruiLi + ZhengLiAmazon + WeiWangUniversity of California, Los Angeles + 1423-1433 + With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to 13.5% in harmlessness. + 2024.naacl-long.78 + 2024.naacl-long.78.copyright.pdf + chen-etal-2024-iteralign + + + <fixed-case>O</fixed-case>rchestra<fixed-case>LLM</fixed-case>: Efficient Orchestration of Language Models for Dialogue State Tracking + Chia-HsuanLee + HaoChengMicrosoft Research + MariOstendorfUniversity of Washington + 1434-1445 + Large language models (LLMs) have revolutionized the landscape of Natural Language Processing, but are computationally expensive. To reduce the cost without sacrificing performance, previous studies have explored various approaches to harness the potential of Smaller Language Models (SLMs) as cost-effective alternatives to their larger counterparts. Driven by findings that SLMs and LLMs exhibit complementary strengths in a structured knowledge extraction task, this work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance. In dialogue state tracking tasks, the proposed routing framework enhances performance substantially compared to relying solely on LLMs, while reducing the computational costs by over 50%. + 2024.naacl-long.79 + 2024.naacl-long.79.copyright.pdf + lee-etal-2024-orchestrallm + + + Multi-Operational Mathematical Derivations in Latent Space + MarcoValentino + JordanMeadows + LanZhangUniversity of Manchester + AndreFreitasIdiap Research Institute and University of Manchester + 1446-1458 + This paper investigates the possibility of approximating multiple mathematical operations in latent space for expression derivation. To this end, we introduce different multi-operational representation paradigms, modelling mathematical operations as explicit geometric transformations. By leveraging a symbolic engine, we construct a large-scale dataset comprising 1.7M derivation steps stemming from 61K premises and 6 operators, analysing the properties of each paradigm when instantiated with state-of-the-art neural encoders.Specifically, we investigate how different encoding mechanisms can approximate expression manipulation in latent space, exploring the trade-off between learning different operators and specialising within single operations, as well as the ability to support multi-step derivations and out-of-distribution generalisation. Our empirical analysis reveals that the multi-operational paradigm is crucial for disentangling different operators, while discriminating the conclusions for a single operation is achievable in the original expression encoder. Moreover, we show that architectural choices can heavily affect the training dynamics, structural organisation, and generalisation of the latent space, resulting in significant variations across paradigms and classes of encoders. + 2024.naacl-long.80 + 2024.naacl-long.80.copyright.pdf + valentino-etal-2024-multi-operational + + + Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong + ChengleiSiStanford University + NavitaGoyalUniversity of Maryland, College Park + TongshuangWuSchool of Computer Science, Carnegie Mellon University + ChenZhao + ShiFengGeorge Washington University + HalDaumé IiiUniversity of Maryland - College Park, University of Maryland, College Park and Microsoft + JordanBoyd-GraberUniversity of Maryland, College Park + 1459-1474 + Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. We conduct human experiments with 80 crowdworkers to compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information—explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users’ over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences. + 2024.naacl-long.81 + 2024.naacl-long.81.copyright.pdf + si-etal-2024-large + + + <fixed-case>X</fixed-case>fer<fixed-case>B</fixed-case>ench: a Data-Driven Benchmark for Emergent Language + BrendonBoldtSchool of Computer Science, Carnegie Mellon University + DavidMortensenCarnegie Mellon University + 1475-1489 + In this paper, we introduce a benchmark for evaluating the overall quality of emergent languages using data-driven methods. Specifically, we interpret the notion of the “quality” of an emergent language as its similarity to human language within a deep learning framework. We measure this by using the emergent language as pretraining data for a downstream NLP tasks in human language—the better the downstream performance, the better the emergent language. We implement this benchmark as an easy-to-use Python package that only requires a text file of utterances from the emergent language to be evaluated. Finally, we empirically test the benchmark’s validity using human, synthetic, and emergent language baselines. + 2024.naacl-long.82 + 2024.naacl-long.82.copyright.pdf + boldt-mortensen-2024-xferbench + + + Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation + Se-eunYoon + ZhankuiHeUniversity of California, San Diego, University of California, San Diego + JessicaEchterhoffUniversity of California, San Diego + JulianMcAuleyUniversity of California, San Diego, University of California, San Diego + 1490-1504 + Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies. + 2024.naacl-long.83 + 2024.naacl-long.83.copyright.pdf + yoon-etal-2024-evaluating + + + A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers + JordanMeadows + MarcoValentino + DamienTeneyIdiap Research Institute + AndreFreitasIdiap Research Institute and University of Manchester + 1505-1523 + This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field. + 2024.naacl-long.84 + 2024.naacl-long.84.copyright.pdf + meadows-etal-2024-symbolic + + + Identifying Linear Relational Concepts in Large Language Models + DavidChaninUniversity College London, University of London + AnthonyHunter + Oana-MariaCamburuDepartment of Computer Science, University College London, University of London + 1524-1535 + Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts by first modeling the relation between subject and object as a linear relational embedding (LRE). We find that inverting the LRE and using earlier object layers results in a powerful technique for finding concept directions that outperforms standard black-box probing classifiers. We evaluate LRCs on their performance as concept classifiers as well as their ability to causally change model output. + 2024.naacl-long.85 + 2024.naacl-long.85.copyright.pdf + chanin-etal-2024-identifying + + + Benchmark Transparency: Measuring the Impact of Data on Evaluation + VenelinKovatchevUniversity of Birmingham + MatthewLeaseUniversity of Texas at Austin, Amazon and University of Texas at Austin + 1536-1551 + In this paper we present an exploratory research on quantifying the impact that data distribution has on the performance and evaluation of NLP models. We propose an automated framework that measures the data point distribution across 6 different dimensions: ambiguity, difficulty, discriminability, length, noise, and perplexity.We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance. We experiment on 2 different datasets (SQUAD and MNLI) and test a total of 135 different models (125 on SQUAD and 10 on MNLI). We demonstrate that without explicit control of the data distribution, standard evaluation frameworks are inconsistent and unreliable. We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric. In a second set of experiments, we demonstrate that the impact of data on evaluation is not just observable, but also predictable. We propose to use benchmark transparency as a method for comparing datasets and quantifying the similarity between them. We find that the “dataset similarity vector” can be used to predict how well a model generalizes out of distribution. + 2024.naacl-long.86 + 2024.naacl-long.86.copyright.pdf + kovatchev-lease-2024-benchmark + + + <fixed-case>JAMDEC</fixed-case>: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models + JillianFisherUniversity of Washington + XimingLuDepartment of Computer Science, University of Washington + JaehunJungUniversity of Washington + LiweiJiang + ZaidHarchaoui + YejinChoiDepartment of Computer Science, University of Washington + 1552-1581 + The permanence of online content combined with the enhanced authorship identification techniques calls for stronger computational methods to protect the identity and privacy of online authorship when needed, e.g., blind reviews for scientific papers, anonymous online reviews, or anonymous interactions in the mental health forums. In this paper, we propose an unsupervised inference-time approach to authorship obfuscation to address the unique challenges of authorship obfuscation: lack of supervision data for diverse authorship and domains, and the need for a sufficient level of revision beyond simple paraphrasing to obfuscate the authorship, all the while preserving the original content and fluency.We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation that can be in principle applied to any text and authorship. Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM’s APIs, while also reducing the performance gap between small and large language models via algorithmic enhancement. The key idea behind our approach is to boost the creative power of smaller language models through constrained decoding, while also allowing for user-specified controls and flexibility. Experimental results demonstrate that our approach based on GPT2-XL outperforms previous state-of-the-art methods based on comparably small models, while performing competitively against GPT3.5 175B, a propriety model that is two orders of magnitudes larger. + 2024.naacl-long.87 + 2024.naacl-long.87.copyright.pdf + fisher-etal-2024-jamdec + + + <fixed-case>REST</fixed-case>: Retrieval-Based Speculative Decoding + ZhenyuHe + ZexuanZhong + TianleCai + JasonLeePrinceton University + DiHePeking University and Microsoft + 1582-1595 + We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language model, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62 \times to 2.36 \times on code or text generation. The source code of REST is available at https://github.com/FasterDecoding/REST. + 2024.naacl-long.88 + 2024.naacl-long.88.copyright.pdf + he-etal-2024-rest + + + Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations + SihaoChen + HongmingZhang + TongChen + BenZhouUniversity of Pennsylvania + WenhaoYuTencent AI Lab + DianYuTencent AI Lab + BaolinPengTencent AI Lab + HongweiWangTencent AI Lab + DanRothAmazon and University of Pennsylvania + DongYuTencent AI Lab + 1596-1609 + We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders. + 2024.naacl-long.89 + 2024.naacl-long.89.copyright.pdf + chen-etal-2024-sub + + + <fixed-case>MS</fixed-case>ci<fixed-case>NLI</fixed-case>: A Diverse Benchmark for Scientific Natural Language Inference + MobashirSadat + CorneliaCarageaUniversity of Illinois, Chicago + 1610-1629 + The task of scientific Natural Language Inference (NLI) involves predicting the semantic relation between two sentences extracted from research articles. This task was recently proposed along with a new dataset called SciNLI derived from papers published in the computational linguistics domain. In this paper, we aim to introduce diversity in the scientific NLI task and present MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains. The availability of multiple domains makes it possible to study domain shift for scientific NLI. We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs). The highest Macro F1 scores of PLM and LLM baselines are 77.21% and 51.77%, respectively, illustrating that MSciNLI is challenging for both types of models. Furthermore, we show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset. Finally, we use both scientific NLI datasets in an intermediate task transfer learning setting and show that they can improve the performance of downstream tasks in the scientific domain. We make our dataset and code available on Github. + 2024.naacl-long.90 + 2024.naacl-long.90.copyright.pdf + sadat-caragea-2024-mscinli + + + Causal Inference for Human-Language Model Collaboration + BohanZhangUniversity of Michigan - Ann Arbor + YixinWangUniversity of Michigan - Ann Arbor + ParamveerDhillonUniversity of Michigan + 1630-1647 + In this paper, we examine the collaborative dynamics between humansand language models (LMs), where the interactions typically involveLMs proposing text segments and humans editing or responding to theseproposals. Productive engagement with LMs in such scenarios necessitates that humans discern effective text-based interaction strategies, such as editing and response styles, from historical human-LM interactions. This objective is inherently causal, driven by the counterfactual ‘what-if’ question: how would the outcome of collaboration change if humans employed a different text editing/refinement strategy? A key challenge in answering this causal inference question is formulating an appropriate causal estimand: the conventional average treatment effect (ATE) estimand is inapplicable to text-based treatments due to their high dimensionality. To address this concern, we introduce a new causal estimand– *Incremental Stylistic Effect (ISE)*, which characterizes the average impact of infinitesimally shifting a text towards a specific style, such as increasing formality. We establish the conditions for the non-parametric identification of ISE. Building on this, we develop *CausalCollab*, an algorithm designed to estimate the ISE of various interaction strategies in dynamic human-LM collaborations. Our empirical investigations across three distinct human-LM collaboration scenarios reveal that *CausalCollab* effectively reduces confounding and significantly improves counterfactual estimation over a set of competitive baselines. + 2024.naacl-long.91 + 2024.naacl-long.91.copyright.pdf + zhang-etal-2024-causal + + + <fixed-case>SELF</fixed-case>-<fixed-case>GUARD</fixed-case>: Empower the <fixed-case>LLM</fixed-case> to Safeguard Itself + ZezhongWang + FangkaiYangMicrosoft + LuWangMicrosoft + PuZhao + HongruWangThe Chinese University of Hong Kong + LiangChenChinese University of Hong Kong, The Chinese University of Hong Kong + QingweiLinMicrosoft Research + Kam-FaiWongThe Chinese University of Hong Kong + 1648-1668 + With the increasing risk posed by jailbreak attacks, recent studies have investigated various methods to improve the safety of large language models (LLMs), mainly falling into two strategies: safety training and safeguards. Safety training involves fine-tuning the LLM with adversarial samples, which activate the LLM’s capabilities against jailbreak. However, it is not always effective in countering new attacks and often leads to potential performance degradation. Safeguards, on the other hand, are methods using additional models to filter harmful content from the LLM’s response. Nevertheless, they can only reduce a limited amount of harmful output and introduce extra computational costs. Given the distinct strengths and weaknesses of both, we combine them to balance out their flaws and propose a more effective method called Self-Guard.Specifically, we train the LLM to review its responses for any harmful content and append a [harmful] or [harmless] tag to the end of the response. In this way, Self-Guard possesses the advantages of safety training, leveraging the powerful capabilities of the LLMs themselves to detect harmfulness. Besides that, it gains flexibility like safeguards, making the safety check target the output side, which makes the system less vulnerable to attack updates. Experimental results indicate that our Self-Guard can effectively defend against jailbreak attacks and will not cause LLMs’ performance degradation. + 2024.naacl-long.92 + 2024.naacl-long.92.copyright.pdf + wang-etal-2024-self + + + <fixed-case>COSIGN</fixed-case>: Contextual Facts Guided Generation for Knowledge Graph Completion + JinpengLi + HangYuShanghai University + XiangfengLuo + QianLiuUniversity of Auckland + 1669-1682 + Knowledge graph completion (KGC) aims to infer missing facts based on existing facts within a KG. Recently, research on generative models (GMs) has addressed the limitations of embedding methods in terms of generality and scalability. However, GM-based methods are sensitive to contextual facts on KG, so the contextual facts of poor quality can cause GMs to generate erroneous results. To improve the performance of GM-based methods for various KGC tasks, we propose a COntextual FactS GuIded GeneratioN (COSIGN) model. First, to enhance the inference ability of the generative model, we designed a contextual facts collector to achieve human-like retrieval behavior. Second, a contextual facts organizer is proposed to learn the organized capabilities of LLMs through knowledge distillation. Finally, the organized contextual facts as the input of the inference generator to generate missing facts. Experimental results demonstrate that COSIGN outperforms state-of-the-art baseline techniques in terms of performance. + 2024.naacl-long.93 + 2024.naacl-long.93.copyright.pdf + li-etal-2024-cosign + + + Toward Informal Language Processing: Knowledge of Slang in Large Language Models + ZheweiSunDepartment of Computer Science, University of Toronto + QianHuAmazon + RahulGupta + RichardZemelDepartment of Computer Science, Columbia University and Department of Computer Science, University of Toronto + YangXuDepartment of Computer Science, University of Toronto + 1683-1701 + Recent advancement in large language models (LLMs) has offered a strong potential for natural language systems to process informal language. A representative form of informal language is slang, used commonly in daily conversations and online social media. To date, slang has not been comprehensively evaluated in LLMs due partly to the absence of a carefully designed and publicly accessible benchmark. Using movie subtitles, we construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang. For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications: 1) slang detection, and 2) identification of regional and historical sources of slang from natural sentences. We also show how our dataset can be used to probe the output distributions of LLMs for interpretive insights. We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance. Furthermore, we show that our dataset enables finetuning of LLMs such as GPT-3.5 that achieve substantially better performance than strong zero-shot baselines. Our work offers a comprehensive evaluation and a high-quality benchmark on English slang based on the OpenSubtitles corpus, serving both as a publicly accessible resource and a platform for applying tools for informal language processing. + 2024.naacl-long.94 + 2024.naacl-long.94.copyright.pdf + sun-etal-2024-toward + + + Ghostbuster: Detecting Text Ghostwritten by Large Language Models + VivekVerma + EveFleisig + NicholasTomlinUniversity of California Berkeley + DanKleinUniversity of California, Berkeley + 1702-1717 + We introduce Ghostbuster, a state-of-the-art system for detecting AI-generated text.Our method works by passing documents through a series of weaker language models, running a structured search over possible combinations of their features, and then training a classifier on the selected features to predict whether documents are AI-generated.Crucially, Ghostbuster does not require access to token probabilities from the target model, making it useful for detecting text generated by black-box or unknown models.In conjunction with our model, we release three new datasets of human- and AI-generated text as detection benchmarks in the domains of student essays, creative writing, and news articles. We compare Ghostbuster to several existing detectors, including DetectGPT and GPTZero, as well as a new RoBERTa baseline. Ghostbuster achieves 99.0 F1 when evaluated across domains, which is 5.9 F1 higher than the best preexisting model. It also outperforms all previous approaches in generalization across writing domains (+7.5 F1), prompting strategies (+2.1 F1), and language models (+4.4 F1). We also analyze our system’s robustness to a variety of perturbations and paraphrasing attacks, and evaluate its performance on documents by non-native English speakers. + 2024.naacl-long.95 + 2024.naacl-long.95.copyright.pdf + verma-etal-2024-ghostbuster + + + End-to-End Beam Retrieval for Multi-Hop Question Answering + JiahaoZhang + HaiyangZhang + DongmeiZhang + LiuYong + ShenHuang + 1718-1731 + Multi-hop question answering (QA) involves finding multiple relevant passages and step-by-step reasoning to answer complex questions, indicating a retrieve-and-read paradigm. However, previous retrievers were customized for two-hop questions, and most of them were trained separately across different hops, resulting in a lack of supervision over the entire multi-hop retrieval process and leading to poor performance in complicated scenarios beyond two hops. In this work, we introduce Beam Retrieval, an end-to-end beam retrieval framework for multi-hop QA. This approach models the multi-hop retrieval process in an end-to-end manner by jointly optimizing an encoder and two classification heads across all hops. Moreover, Beam Retrieval maintains multiple partial hypotheses of relevant passages at each step, expanding the search space and reducing the risk of missing relevant passages. To establish a complete QA system, we incorporate a supervised reader or a large language model (LLM). Experimental results demonstrate that Beam Retrieval achieves a nearly 50% improvement compared with baselines on challenging MuSiQue-Ans, and it also surpasses all previous retrievers on HotpotQA and achieves 99.9% precision on 2WikiMultiHopQA. Providing high-quality context, Beam Retrieval helps our supervised reader achieve new state-of-the-art performance and substantially improves the few-shot QA performance of LLMs. + 2024.naacl-long.96 + 2024.naacl-long.96.copyright.pdf + zhang-etal-2024-end + + + Leveraging Generative Large Language Models with Visual Instruction and Demonstration Retrieval for Multimodal Sarcasm Detection + BinghaoTang + BodaLin + HaolongYan + SiLiBeijing University of Posts and Telecommunications + 1732-1742 + Multimodal sarcasm detection aims to identify sarcasm in the given image-text pairs and has wide applications in the multimodal domains. Previous works primarily design complex network structures to fuse the image-text modality features for classification. However, such complicated structures may risk overfitting on in-domain data, reducing the performance in out-of-distribution (OOD) scenarios. Additionally, existing methods typically do not fully utilize cross-modal features, limiting their performance on in-domain datasets. Therefore, to build a more reliable multimodal sarcasm detection model, we propose a generative multimodal sarcasm model consisting of a designed instruction template and a demonstration retrieval module based on the large language model. Moreover, to assess the generalization of current methods, we introduce an OOD test set, RedEval. Experimental results demonstrate that our method is effective and achieves state-of-the-art (SOTA) performance on the in-domain MMSD2.0 and OOD RedEval datasets. + 2024.naacl-long.97 + 2024.naacl-long.97.copyright.pdf + tang-etal-2024-leveraging + + + Multi-Scale Prompt Memory-Augmented Model for Black-Box Scenarios + XiaojunKuang + C. L. PhilipChenSouth China University of Technology + ShuzhenLi + TongZhangSouth China University of Technology + 1743-1757 + Black-box few-shot text classification handles text classification in limited data without accessing the parameters and gradients of language models (LMs). Existing black-box optimization methods have demonstrated strong few-shot learning capabilities. However, they still require numerous LMs’ calls to search optimal prompts, thus resulting in overfitting performance and increasing computational cost. To address this issue, we present MuSKPrompt (Multi-scale Knowledge Prompt for Memory Model), an efficient multi-scale knowledge prompt-based memory model in black-box few-shot text classification task. MuSKPrompt extracts instance-level and class-level knowledge at different scales and stores them in memory banks during training. Then, it references multi-scale memory banks to perform quick inference on new samples via a novel scoring module. MuSKPrompt achieves competitive performance in limited data through multi-scale instance-level and class-level knowledge. Moreover, it realizes gradient-free optimization with zero training parameters in the black-box scenario. Experiments on different benchmarks and parameter analysis demonstrate the effectiveness and efficiency of MuSKPrompt in black-box few-shot text classification tasks. + 2024.naacl-long.98 + 2024.naacl-long.98.copyright.pdf + kuang-etal-2024-multi + + + Ungrammatical-syntax-based In-context Example Selection for Grammatical Error Correction + ChenmingTang + FanyiQu + YunfangWu + 1758-1770 + In the era of large language models (LLMs), in-context learning (ICL) stands out as an effective prompting strategy that explores LLMs’ potency across various tasks. However, applying LLMs to grammatical error correction (GEC) is still a challenging task. In this paper, we propose a novel ungrammatical-syntax-based in-context example selection strategy for GEC. Specifically, we measure similarity of sentences based on their syntactic structures with diverse algorithms, and identify optimal ICL examples sharing the most similar ill-formed syntax to the test input. Additionally, we carry out a two-stage process to further improve the quality of selection results. On benchmark English GEC datasets, empirical results show that our proposed ungrammatical-syntax-based strategies outperform commonly-used word-matching or semantics-based methods with multiple LLMs. This indicates that for a syntax-oriented task like GEC, paying more attention to syntactic information can effectively boost LLMs’ performance. Our code is available at https://github.com/JamyDon/SynICL4GEC. + 2024.naacl-long.99 + 2024.naacl-long.99.copyright.pdf + tang-etal-2024-ungrammatical + + + <fixed-case>BUFFET</fixed-case>: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer + AkariAsaiPaul G. Allen School of Computer Science & Engineering, University of Washington + SnehaKuduguntaDepartment of Computer Science + XinyanYuUniversity of Southern California + TerraBlevinsUniversity of Washington + HilaGonenFacebook + MachelReidGoogle + YuliaTsvetkovDepartment of Computer Science, University of Washington + SebastianRuderGoogle + HannanehHajishirziUniversity of Washington, University of Washington, Allen Institute for Artificial Intelligence and University of Washington, Seattle + 1771-1800 + Despite remarkable advancements in few-shot generalization in natural language processing, most models are developed and evaluated primarily in English. To establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer, we introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples and instructions. Using BUFFET, we perform thorough evaluations of ten state-of-the-art multilingual large language models with different transfer methods, namely in-context learning and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. Strong multilingual pre-trained or instruction-tuned models such as BLOOM or ChatGPT often lag behind much smaller mT5-base models given the same number of few-shot samples, particularly in low-resource languages. Our analysis suggests avenues for future research in few-shot cross-lingual transfer. + 2024.naacl-long.100 + 2024.naacl-long.100.copyright.pdf + asai-etal-2024-buffet + + + <fixed-case>TISE</fixed-case>: A Tripartite In-context Selection Method for Event Argument Extraction + YanheFu + YananCao + QingyueWang + YiLiu + 1801-1818 + In-context learning enhances the reasoning capabilities of LLMs by providing several examples. A direct yet effective approach to obtain in-context example is to select the top-k examples based on their semantic similarity to the test input. However, when applied to event argument extraction (EAE), this approach exhibits two shortcomings: 1) It may select almost identical examples, thus failing to provide additional event information, and 2) It overlooks event attributes, leading to the selected examples being unrelated to the test event type. In this paper, we introduce three necessary requirements when selecting an in-context example for EAE task: semantic similarity, example diversity and event correlation. And we further propose TISE, which scores examples from these three perspectives and integrates them using Determinantal Point Processes to directly select a set of examples as context. Experimental results on the ACE05 dataset demonstrate the effectiveness of TISE and the necessity of three requirements. Furthermore, we surprisingly observe that TISE can achieve superior performance with fewer examples and can even exceed some supervised methods. + 2024.naacl-long.101 + 2024.naacl-long.101.copyright.pdf + fu-etal-2024-tise + + + Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks + ZhaofengWuMassachusetts Institute of Technology + LinluQiuMassachusetts Institute of Technology + AlexisRossMassachusetts Institute of Technology and Allen Institute for Artificial Intelligence + EkinAkyürek + BoyuanChen + BailinWang + NajoungKimBoston University and Google + JacobAndreasMassachusetts Institute of Technology and Microsoft + YoonKimMassachusetts Institute of Technology + 1819-1862 + The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on “counterfactual” task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects. + 2024.naacl-long.102 + 2024.naacl-long.102.copyright.pdf + wu-etal-2024-reasoning + + + <fixed-case>TRUE</fixed-case>-<fixed-case>UIE</fixed-case>: Two Universal Relations Unify Information Extraction Tasks + YuchengWangShanghai Artificial Intelligence Laboratory + BowenYuAlibaba Group + YilinLiu + ShudongLu + 1863-1876 + Information extraction (IE) encounters challenges due to the variety of schemas and objectives that differ across tasks. Recent advancements hint at the potential for universal approaches to model such tasks, referred to as Universal Information Extraction (UIE). While handling diverse tasks in one model, their generalization is limited since they are actually learning task-specific knowledge.In this study, we introduce an innovative paradigm known as TRUE-UIE, wherein all IE tasks are aligned to learn the same goals: extracting mention spans and two universal relations named \mathtt{NEXT} and \mathtt{IS}. During the decoding process, the \mathtt{NEXT} relation is utilized to group related elements, while the \mathtt{IS} relation, in conjunction with structured language prompts, undertakes the role of type recognition. Additionally, we consider the sequential dependency of tokens during span extraction, an aspect often overlooked in prevalent models.Our empirical experiments indicate that TRUE-UIE achieves state-of-the-art performance on established benchmarks encompassing 16 datasets, spanning 7 diverse IE tasks. Further evaluations reveal that our approach effectively share knowledge between different IE tasks, showcasing significant transferability in zero-shot and few-shot scenarios. + 2024.naacl-long.103 + 2024.naacl-long.103.copyright.pdf + wang-etal-2024-true + + + zr<fixed-case>LLM</fixed-case>: Zero-Shot Relational Learning on Temporal Knowledge Graphs with Large Language Models + ZifengDingLMU Munich + HelingCaiLudwig-Maximilians-Universität München + JingpeiWu, Institute of Computer Science + YunpuMaSiemens Corporate Research + RuotongLiao + BoXiongUniversity of Stuttgart + VolkerTrespLudwig Maximilian University of Munich and Siemens Corporate Research + 1877-1895 + Modeling evolving knowledge over temporal knowledge graphs (TKGs) has become a heated topic. Various methods have been proposed to forecast links on TKGs. Most of them are embedding-based, where hidden representations are learned to represent knowledge graph (KG) entities and relations based on the observed graph contexts. Although these methods show strong performance on traditional TKG forecasting (TKGF) benchmarks, they face a strong challenge in modeling the unseen zero-shot relations that have no prior graph context. In this paper, we try to mitigate this problem as follows. We first input the text descriptions of KG relations into large language models (LLMs) for generating relation representations, and then introduce them into embedding-based TKGF methods. LLM-empowered representations can capture the semantic information in the relation descriptions. This makes the relations, whether seen or unseen, with similar semantic meanings stay close in the embedding space, enabling TKGF models to recognize zero-shot relations even without any observed graph context. Experimental results show that our approach helps TKGF models to achieve much better performance in forecasting the facts with previously unseen relations, while still maintaining their ability in link forecasting regarding seen relations. + 2024.naacl-long.104 + 2024.naacl-long.104.copyright.pdf + ding-etal-2024-zrllm + + + Embodied Executable Policy Learning with Language-based Scene Summarization + JielinQiu + MengdiXuCarnegie Mellon University + WilliamHan + SeungwhanMoonFacebook + DingZhaoCarnegie Mellon University + 1896-1913 + Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning.However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments.In this work, we introduce a novel learning paradigm that generates robots’ executable actions in the form of text, derived solely from visual observations. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively.We conduct extensive experiments involving various model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm. + 2024.naacl-long.105 + 2024.naacl-long.105.copyright.pdf + qiu-etal-2024-embodied + + + Metacognitive Prompting Improves Understanding in Large Language Models + YuqingWangStanford University + YunZhao + 1914-1926 + In Large Language Models (LLMs), there have been consistent advancements in task-specific performance, largely influenced by effective prompt design. Recent advancements in prompting have enhanced reasoning in logic-intensive tasks for LLMs, yet the nuanced understanding abilities of these models, crucial for processing and interpreting complex information, remain underexplored. In this study, we introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. Using MP, LLMs undergo a systematic series of structured, self-aware evaluations, drawing on both their vast inherent knowledge and new insights. We conduct extensive experiments on four prevalent LLMs: Llama2, PaLM2, GPT-3.5, and GPT-4, across ten natural language understanding (NLU) datasets from GLUE, SuperGLUE, BLUE, and LexGLUE benchmarks. Additionally, we compare our method with chain-of-thought prompting and its advanced versions. The results show that GPT-4 consistently excels across all tasks, while other models have shown significant progress in some tasks when used in conjunction with MP. Furthermore, MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks. This study underscores the potential to amplify the understanding abilities of LLMs and highlights the benefits of mirroring human introspective reasoning in NLU tasks. + 2024.naacl-long.106 + 2024.naacl-long.106.copyright.pdf + wang-zhao-2024-metacognitive + + + <fixed-case>MART</fixed-case>: Improving <fixed-case>LLM</fixed-case> Safety with Multi-round Automatic Red-Teaming + SuyuGeMeta + ChuntingZhouMeta AI + RuiHouMeta Inc. + MadianKhabsaFacebook + Yi-ChiaWangMeta + QifanWangMeta AI + JiaweiHan + YuningMaoMeta + 1927-1937 + Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses.While effective, manual red-teaming is costly, and existing automatic red-teaming typically discovers safety risks without addressing them.In this paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation, significantly increasing red-teaming scalability and the safety of the target LLM.Specifically, an adversarial LLM and a target LLM interplay with each other in an iterative manner, where the adversarial LLM aims to generate challenging prompts that elicit unsafe responses from the target LLM, while the target LLM is fine-tuned with safety aligned data on these adversarial prompts. In each round, the adversarial LLM crafts better attacks on the updated target LLM, while the target LLM also improves itself through safety fine-tuning.On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART, achieving comparable performance to LLMs with extensive adversarial prompt writing. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following. + 2024.naacl-long.107 + 2024.naacl-long.107.copyright.pdf + ge-etal-2024-mart + + + <fixed-case>D</fixed-case>ialog<fixed-case>CC</fixed-case>: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset + Young-JunLeeKorea Advanced Institute of Science & Technology + ByungsooKoNAVER + Han-GyuKim + JonghwanHyeonKAIST + Ho-JinChoiKorea Advanced Institute of Science & Technology + 1938-1963 + As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models.However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets.In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring minimum human effort. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments - specifically, the utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance.Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing datasets in terms of quality and diversity in human evaluation.Our comprehensive experiments highlight that when multi-modal dialogue models are trained using our dataset, their generalization performance on unseen dialogue datasets is significantly enhanced. We make our source code and dataset publicly available (https://dialogcc.github.io/). + 2024.naacl-long.108 + 2024.naacl-long.108.copyright.pdf + lee-etal-2024-dialogcc + + + Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models + KemingLu + HongyiYuan + RunjiLin + JunyangLin + ZhengYuanAlibaba Group + ChangZhou + JingrenZhouAlibaba Group + 1964-1974 + The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complementary potential of LLMs and further elaborate on it by mining latent expertise with off-the-shelf reward models. We propose ZOOTER, a reward-guided routing method distilling rewards on training queries to train a routing function, which can precisely distribute each query to the LLM with expertise about it. We also integrate a tag-based label enhancement to mitigate noise from uncertainty when using rewards as silver supervision. ZOOTER shows computation efficiency in inference as it only introduces minor computation overhead of a routing function compared with reward model ranking methods. We evaluate ZOOTER on a comprehensive benchmark collection with 26 subsets in different domains and tasks. ZOOTER outperforms the best single model on average and ranks first on 44% of tasks, even surpassing multiple reward model ranking methods. + 2024.naacl-long.109 + 2024.naacl-long.109.copyright.pdf + lu-etal-2024-routing + + + Automatic Generation of Model and Data Cards: A Step Towards Responsible <fixed-case>AI</fixed-case> + JiaruiLiu + WenkaiLi + ZhijingJin + MonaDiabCarnegie Mellon University and George Washington University + 1975-1997 + In an era of model and data proliferation in machine learning/AI especially marked by the rapid advancement of open-sourced technologies, there arises a critical need for standardized consistent documentation. Our work addresses the information incompleteness in current human-written model and data cards. We propose an automated generation approach using Large Language Models (LLMs). Our key contributions include the establishment of CardBench, a comprehensive dataset aggregated from over 4.8k model cards and 1.4k data cards, coupled with the development of the CardGen pipeline comprising a two-step retrieval process. Our approach exhibits enhanced completeness, objectivity, and faithfulness in generated model and data cards, a significant step in responsible AI documentation practices ensuring better accountability and traceability. + 2024.naacl-long.110 + 2024.naacl-long.110.copyright.pdf + liu-etal-2024-automatic + + + <fixed-case>FUN</fixed-case> with Fisher: Improving Generalization of Adapter-Based Cross-lingual Transfer with Scheduled Unfreezing + ChenLiu + JonasPfeifferGoogle DeepMind + IvanVulićUniversity of Cambridge and PolyAI Limited + IrynaGurevychMohamed bin Zayed University of Artificial Intelligence and Technical University of Darmstadt + 1998-2015 + Standard fine-tuning of language models typically performs well on \textit{in-distribution data}, but suffers with generalization to \textit{distribution shifts}. In this work, we aim to improve the generalization of adapter-based cross-lingual task transfer where such cross-language distribution shifts are imminent. We investigate scheduled unfreezing algorithms –originally proposed to mitigate catastrophic forgetting in transfer learning – for fine-tuning task adapters. Our experiments show that scheduled unfreezing methods close the gap to full fine-tuning and achieve stronger cross-lingual transfer performance, suggesting that these methods can go beyond just mitigating catastrophic forgetting. Next, aiming to understand these empirical findings, we investigate the learning dynamics of scheduled unfreezing using Fisher Information. Our experiments reveal that scheduled unfreezing induces different learning dynamics compared to standard fine-tuning, and provide evidence that the dynamics of Fisher Information during training correlate with cross-lingual generalization performance. We additionally propose a general scheduled unfreezing algorithm that achieves an average of 2 points improvement over four datasets compared to standard fine-tuning and provides empirical evidence for a theory-based justification of the heuristic unfreezing schedule for task adapter training. + 2024.naacl-long.111 + 2024.naacl-long.111.copyright.pdf + liu-etal-2024-fun + + + Are Multilingual <fixed-case>LLM</fixed-case>s Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings + ChenLiu + FajriKotoMohamed bin Zayed University of Artificial Intelligence + TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne + IrynaGurevychMohamed bin Zayed University of Artificial Intelligence and Technical University of Darmstadt + 2016-2039 + Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in a situational context, human expectations vary depending on the relevant cultural common ground. As languages are associated with diverse cultures, LLMs should also be culturally-diverse reasoners. In this paper, we study the ability of a wide range of state-of-the-art multilingual LLMs (mLLMs) to reason with proverbs and sayings in a conversational context. Our experiments reveal that: (1) mLLMs “know” limited proverbs and memorizing proverbs does not mean understanding them within a conversational context; (2) mLLMs struggle to reason with figurative proverbs and sayings, and when asked to select the wrong answer (instead of asking it to select the correct answer); and (3) there is a “culture gap” in mLLMs when reasoning about proverbs and sayings translated from other languages. We construct and release our evaluation dataset MAPS (MulticulturAl Proverbs and Sayings) for proverb understanding with conversational context for six different languages. + 2024.naacl-long.112 + 2024.naacl-long.112.copyright.pdf + liu-etal-2024-multilingual + + + The Colorful Future of <fixed-case>LLM</fixed-case>s: Evaluating and Improving <fixed-case>LLM</fixed-case>s as Emotional Supporters for Queer Youth + ShirLissakTechnion - Israel Institute of Technology, Technion + NitayCalderonTechnion - Israel Institute of Technology + GevaShenkmanReichman University + YaakovOphir + EyalFruchterTechnion - Israel Institute of Technology, Technion + AnatBrunstein KlomekReichman + RoiReichartTechnion, Israel Institute of Technology + 2040-2079 + Queer youth face increased mental health risks, such as depression, anxiety, and suicidal ideation. Hindered by negative stigma, they often avoid seeking help and rely on online resources, which may provide incompatible information. Although access to a supportive environment and reliable information is invaluable, many queer youth worldwide have no access to such support. However, this could soon change due to the rapid adoption of Large Language Models (LLMs) such as ChatGPT. This paper aims to comprehensively explore the potential of LLMs to revolutionize emotional support for queers. To this end, we conduct a qualitative and quantitative analysis of LLM’s interactions with queer-related content. To evaluate response quality, we develop a novel ten-question scale that is inspired by psychological standards and expert input. We apply this scale to score several LLMs and human comments to posts where queer youth seek advice and share experiences. We find that LLM responses are supportive and inclusive, outscoring humans. However, they tend to be generic, not empathetic enough, and lack personalization, resulting in nonreliable and potentially harmful advice. We discuss these challenges, demonstrate that a dedicated prompt can improve the performance, and propose a blueprint of an LLM-supporter that actively (but sensitively) seeks user context to provide personalized, empathetic, and reliable responses. Our annotated dataset is available for further research.*https://github.com/nitaytech/LGBTeenDataset + 2024.naacl-long.113 + 2024.naacl-long.113.copyright.pdf + lissak-etal-2024-colorful + + + <fixed-case>IPED</fixed-case>: An Implicit Perspective for Relational Triple Extraction based on Diffusion Model + JianliZhao + ChanghaoXu + Bin.Jiang + 2080-2092 + Relational triple extraction is a fundamental task in the field of information extraction, and a promising framework based on table filling has recently gained attention as a potential baseline for entity relation extraction. However, inherent shortcomings such as redundant information and incomplete triple recognition remain problematic. To address these challenges, we propose an Implicit Perspective for relational triple Extraction based on Diffusion model (IPED), an innovative approach for extracting relational triples. Our classifier-free solution adopts an implicit strategy using block coverage to complete the tables, avoiding the limitations of explicit tagging methods. Additionally, we introduce a generative model structure, the block-denoising diffusion model, to collaborate with our implicit perspective and effectively circumvent redundant information disruptions. Experimental results on two popular datasets demonstrate that IPED achieves state-of-the-art performance while gaining superior inference speed and low computational complexity. To support future research, we have made our source code publicly available online. + 2024.naacl-long.114 + 2024.naacl-long.114.copyright.pdf + zhao-etal-2024-iped + + + <fixed-case>Q</fixed-case>ual<fixed-case>E</fixed-case>val: Qualitative Evaluation for Model Improvement + VishvakMurahariPrinceton University + AmeetDeshpande + PeterClarkAllen Institute for Artificial Intelligence + TanmayRajpurohit + AshishSabharwalAllen Institute for Artificial Intelligence + KarthikNarasimhanPrinceton University + AshwinKalyanAllen Institute for Artificial Intelligence + 2093-2111 + Quantitative evaluation metrics have been pivotal in gauging the advancements of AI systems like large language models (LLMs).However, due to the intricate nature of real-world tasks, a single scalar to quantify and compare performance trivializes the fine-grained nuances of model behavior. Additionally, metrics do not yield actionable diagnostics for model improvement, thus requiring extensive manual efforts of scientists, involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which uses automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are supported by a dashboard report with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace and quality of model development by eliminating the need of arduous manual analysis, thus serving as a data-scientist-in-a-box. + 2024.naacl-long.115 + 2024.naacl-long.115.copyright.pdf + murahari-etal-2024-qualeval + + + Quantum-inspired Language Model with Lindblad Master Equation and Interference Measurement for Sentiment Analysis + KehuanYan + PeichaoLai + YileiWang + 2112-2121 + Quantum-inspired models have demonstrated superior performance in many downstream language tasks, such as question answering and sentiment analysis. However, recent models primarily focus on embedding and measurement operations, overlooking the significance of the quantum evolution process. In this work, we present a novel quantum-inspired neural network, LI-QiLM, which integrates the Lindblad Master Equation (LME) to model the evolution process and the interferometry to the measurement process, providing more physical meaning to strengthen the interpretability. We conduct comprehensive experiments on six sentiment analysis datasets. Compared to the traditional neural networks, transformer-based pre-trained models and quantum-inspired models, such as CICWE-QNN and ComplexQNN, the proposed method demonstrates superior performance in accuracy and F1-score on six commonly used datasets for sentiment analysis. Additional ablation tests verify the effectiveness of LME and interferometry. + 2024.naacl-long.116 + 2024.naacl-long.116.copyright.pdf + yan-etal-2024-quantum + + + <fixed-case>V</fixed-case>is<fixed-case>L</fixed-case>ing<fixed-case>I</fixed-case>nstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization + DongshengZhuBaidu + DanielTang + WeidongHan + JinghuiLuByteDance Inc. + YukunZhao + GuoliangXing + JunfengWang + DaweiYinBaidu + 2122-2135 + This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual content. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets. Our main code is available at https://github.com/Zhudongsheng75/VisLingInstruct + 2024.naacl-long.117 + 2024.naacl-long.117.copyright.pdf + zhu-etal-2024-vislinginstruct + + + A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily + PengDingnanjing university + JunKuangMeituan + DanMa + XuezhiCao + YunsenXian + JiajunChenNanjing University + ShujianHuangNanjing University + 2136-2153 + Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as ‘jailbreaks’ can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM. + 2024.naacl-long.118 + 2024.naacl-long.118.copyright.pdf + ding-etal-2024-wolf + + + <fixed-case>P</fixed-case><tex-math>^3</tex-math><fixed-case>S</fixed-case>um: Preserving Author’s Perspective in News Summarization with Diffusion Language Models + YuhanLiu + ShangbinFengPaul G. Allen School of Computer Science and Engineering, University of Washington + XiaochuangHanDepartment of Computer Science, University of Washington + VidhishaBalachandranCarnegie Mellon University + Chan YoungPark + SachinKumarOhio State University, Columbus + YuliaTsvetkovDepartment of Computer Science, University of Washington + 2154-2173 + In this work, we take a first step towards designing summarization systems that are faithful to the author’s intent, not only the semantic content of the article. Focusing on a case study of preserving political perspectives in news summarization, we find that existing approaches alter the political opinions and stances of news articles in more than 50% of summaries, misrepresenting the intent and perspectives of the news authors. We thus propose P^3Sum, a diffusion model-based summarization approach controlled by political perspective classifiers. In P^3Sum, the political leaning of a generated summary is iteratively evaluated at each decoding step, and any drift from the article’s original stance incurs a loss back-propagated to the embedding layers, steering the political stance of the summary at inference time. Extensive experiments on three news summarization datasets demonstrate that P^3Sum outperforms state-of-the-art summarization systems and large language models by up to 13.7% in terms of the success rate of stance preservation, with competitive performance on standard metrics of summarization quality. Our findings present a first analysis of preservation of pragmatic features in summarization, highlight the lacunae in existing summarization models—that even state-of-the-art models often struggle to preserve author’s intents—and develop new summarization systems that are more faithful to author’s perspectives. + 2024.naacl-long.119 + 2024.naacl-long.119.copyright.pdf + liu-etal-2024-p3sum + + + Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes + RoseWangStanford University + QingyangZhang + CarlyRobinsonStanford University + SusannaLoeb + DorottyaDemszkyStanford University + 2174-2199 + Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert’s latent thought process into a decision-making model for remediation. This involves an expert identifying (A) the student’s error, (B) a remediation strategy, and (C) their intention before generating a response. We construct a dataset of 700 real tutoring conversations, annotated by experts with their decisions. We evaluate state-of-the-art LLMs on our dataset and find that the expert’s decision-making model is critical for LLMs to close the gap: responses from GPT4 with expert decisions (e.g., “simplify the problem”) are +76% more preferred than without. Additionally, context-sensitive decisions are critical to closing pedagogical gaps: random decisions decrease GPT4’s response quality by -97% than expert decisions. Our work shows the potential of embedding expert thought processes in LLM generations to enhance their capability to bridge novice-expert knowledge gaps. Our dataset and code can be found at: https://github.com/rosewang2008/bridge. + 2024.naacl-long.120 + 2024.naacl-long.120.copyright.pdf + wang-etal-2024-bridging + + + <fixed-case>RST</fixed-case>-<fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case>: A Discourse-Aware Low-Rank Adaptation for Long Document Abstractive Summarization + DongqiPuUniversität des Saarlandes + VeraDembergUniversität des Saarlandes + 2200-2220 + For long document summarization, discourse structure is important to discern the key content of the text and the differences in importance level between sentences. Unfortunately, the integration of rhetorical structure theory (RST) into parameter-efficient fine-tuning strategies for long document summarization remains unexplored. Therefore, this paper introduces RST-LoRA and proposes four RST-aware variants to explicitly incorporate RST into the LoRA model. Our empirical evaluation demonstrates that incorporating the type and uncertainty of rhetorical relations can complementarily enhance the performance of LoRA in summarization tasks. Furthermore, the best-performing variant we introduced outperforms the vanilla LoRA and full-parameter fine-tuning models, as confirmed by multiple automatic and human evaluations, and even surpasses previous state-of-the-art methods. + 2024.naacl-long.121 + 2024.naacl-long.121.copyright.pdf + pu-demberg-2024-rst + + + Strings from the Library of Babel: Random Sampling as a Strong Baseline for Prompt Optimisation + YaoLu + JiayiWang + RaphaelTangComcast + SebastianRiedelGoogle and University College London + PontusStenetorpUniversity College London + 2221-2231 + Recent prompt optimisation approaches use the generative nature of language models to produce prompts – even rivaling the performance of human-curated prompts. In this paper, we demonstrate that randomly sampling tokens from the model vocabulary as “separators” can be as effective as language models for prompt-style text classification. Our experiments show that random separators are competitive baselines, having less than a 1% difference compared to previous self-optimisation methods and showing a 12% average relative improvement over strong human baselines across nine text classification tasks and eight language models. We further analyse this phenomenon in detail using three different random generation strategies, establishing that the language space is rich with potentially good separators, with a greater than 40% average chance that a randomly drawn separator performs better than human-curated separators. These observations challenge the common assumption that an effective prompt should be human readable or task relevant and establish a strong baseline for prompt optimisation research. + 2024.naacl-long.122 + 2024.naacl-long.122.copyright.pdf + lu-etal-2024-strings + + + <fixed-case>R</fixed-case>e<fixed-case>TA</fixed-case>: Recursively Thinking Ahead to Improve the Strategic Reasoning of Large Language Models + JinhaoDuanDrexel University + ShiqiWangAmazon + JamesDiffenderferLawrence Livermore National Labs + LichaoSunLehigh University + TianlongChen + BhavyaKailkhuraLawrence Livermore National Laboratory + KaidiXuDrexel University + 2232-2246 + Current logical reasoning evaluations of Large Language Models (LLMs) primarily focus on single-turn and static environments, such as arithmetic problems. The crucial problem of multi-turn, strategic reasoning is under-explored. In this work, we analyze the multi-turn strategic reasoning of LLMs through text-driven complete- and incomplete-information gaming, e.g., board games (Tic-Tac-Toe, Connect-4) and poker games (Texas Hold’em Poker). Specifically, we consider two distinct scenarios: 1) Online Racing, featuring multiple LLMs/agents to facilitate direct competition and comparison; 2) Offline Probing, constructing targeted questions with verified ground truth to evaluate LLMs’ strategic behaviors. Experimental results demonstrate that existing state-of-the-art LLMs and reasoning schemes are largely ineffective for strategic reasoning tasks. To mitigate these limitations, we propose a simple yet effective Recursively Thinking-Ahead (ReTA) agent, incorporating a recursive prompting mechanism that automatically analyzes the opponents’ future moves/actions and assigns reward signals for these situations, to strengthen the strategic reasoning of LLMs. We hope our work could spur further research and exploration in the multi-turn strategic reasoning of LLMs. The code is available at https://github.com/jinhaoduan/ReTA. + 2024.naacl-long.123 + 2024.naacl-long.123.copyright.pdf + duan-etal-2024-reta + + + Fact Checking Beyond Training Set + PayamKarisaniUniversity of Illinois at Urbana-Champaign + HengJiUniversity of Illinois, Urbana-Champaign + 2247-2261 + Evaluating the veracity of everyday claims is time consuming and in some cases requires domain expertise. We empirically demonstrate that the commonly used fact checking pipeline, known as the retriever-reader, suffers from performance deterioration when it is trained on the labeled data from one domain and used in another domain. Afterwards, we delve into each component of the pipeline and propose novel algorithms to address this problem. We propose an adversarial algorithm to make the retriever component robust against distribution shift. Our core idea is to initially train a bi-encoder on the labeled source data, and then, to adversarially train two separate document and claim encoders using unlabeled target data. We then focus on the reader component and propose to train it such that it is insensitive towards the order of claims and evidence documents. Our empirical evaluations support the hypothesis that such a reader shows a higher robustness against distribution shift. To our knowledge, there is no publicly available multi-topic fact checking dataset. Thus, we propose a simple automatic method to re-purpose two well-known fact checking datasets. We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models, including recent domain adaptation models that use GPT4 for generating synthetic data. + 2024.naacl-long.124 + 2024.naacl-long.124.copyright.pdf + karisani-ji-2024-fact + + + Program-Aided Reasoners (Better) Know What They Know + AnubhaKabraBloomberg + SankethRangreji + YashMathur + AmanMadaanCarnegie Mellon University + EmmyLiuSchool of Computer Science, Carnegie Mellon University + GrahamNeubigCarnegie Mellon University + 2262-2278 + Prior work shows that program-aided reasoning, in which large language models (LLMs) are combined with programs written in programming languages such as Python, can significantly improve accuracy on various reasoning tasks. However, while accuracy is essential, it is also important for such reasoners to “know what they know”, which can be quantified through the calibration of the model. In this paper, we compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets and 2 model types - LLaMA models and OpenAI models. Our results indicate that PAL leads to improved calibration in 75% of the instances. Our analysis uncovers that prompting styles that produce lesser diversity in generations also have more calibrated results, and thus we also experiment with inducing lower generation diversity using temperature scaling and find that for certain temperatures, PAL is not only more accurate but is also more calibrated than COT. Overall, we demonstrate that, in the majority of cases, program-aided reasoners better know what they know than text-based counterparts. + 2024.naacl-long.125 + 2024.naacl-long.125.copyright.pdf + kabra-etal-2024-program + + + The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels + EveFleisig + Su LinBlodgettMicrosoft + DanKleinUniversity of California, Berkeley + ZeerakTalatMohamed bin Zayed University of Artificial Intelligence + 2279-2292 + Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement–some challenged by perspectivist approaches, and some that remain to be addressed–as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement. + 2024.naacl-long.126 + 2024.naacl-long.126.copyright.pdf + fleisig-etal-2024-perspectivist + + + Principles from Clinical Research for <fixed-case>NLP</fixed-case> Model Generalization + AparnaElangovanAmazon + JiayuanHeRoyal Melbourne Institute of Technology and The University of Melbourne + YuanLi + KarinVerspoorRoyal Melbourne Institute of Technology + 2293-2309 + The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to “out-of-distribution” effects. Here, we explore the foundations of generalizability and study the factors that affect it, articulating lessons from clinical studies. In clinical research, generalizability is an act of reasoning that depends on (a) *internal validity* of experiments to ensure controlled measurement of cause and effect, and (b) *external validity* or transportability of the results to the wider population. We demonstrate how learning spurious correlations, such as the distance between entities in relation extraction tasks, can affect a model’s internal validity and in turn adversely impact generalization. We, therefore, present the need to ensure internal validity when building machine learning models in NLP. Our recommendations also apply to generative large language models, as they are known to be sensitive to even minor semantic preserving alterations. We also propose adapting the idea of *matching* in randomized controlled trials and observational studies to NLP evaluation to measure causation. + 2024.naacl-long.127 + 2024.naacl-long.127.copyright.pdf + elangovan-etal-2024-principles + + + First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models + NaomiSaphraHarvard University + EveFleisig + KyunghyunChoGenentech and New York University + AdamLopezUniversity of Edinburgh + 2310-2326 + Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large n-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches. + 2024.naacl-long.128 + 2024.naacl-long.128.copyright.pdf + saphra-etal-2024-first + + + Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models + RaphaelTangComcast + CrystinaZhangUniversity of Waterloo + XueguangMa + JimmyLinUniversity of Waterloo + FerhanTure + 2327-2340 + Large language models (LLMs) exhibit positional bias in how they use context, which especially affects listwise ranking. To address this, we propose permutation self-consistency, a form of self-consistency over the ranking list outputs of black-box LLMs. Our key idea is to marginalize out different list orders in the prompt to produce an order-independent ranking with less positional bias. First, given some input prompt, we repeatedly shuffle the list in the prompt and pass it through the LLM while holding the instructions the same. Next, we aggregate the resulting sample of rankings by computing the central ranking closest in distance to all of them, marginalizing out prompt order biases in the process. Theoretically, we prove the robustness of our method, showing convergence to the true ranking under random perturbations.Empirically, on five datasets in sorting and passage reranking, our approach improves scores from conventional inference by up to 34-52% for Mistral, 7-18% for GPT-3.5, 8-16% for LLaMA v2 (70B). Our code is at https://github.com/castorini/perm-sc. + 2024.naacl-long.129 + 2024.naacl-long.129.copyright.pdf + tang-etal-2024-found + + + From Language Modeling to Instruction Following: Understanding the Behavior Shift in <fixed-case>LLM</fixed-case>s after Instruction Tuning + XuanshengWu + WenlinYaoTencent AI Lab + JianshuChenAmazon + XiaomanPanTencent AI Lab + XiaoyangWangTencent AI Lab + NinghaoLiuUniversity of Georgia + DongYuTencent AI Lab + 2341-2369 + Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs. + 2024.naacl-long.130 + 2024.naacl-long.130.copyright.pdf + wu-etal-2024-language + + + <fixed-case>POLYIE</fixed-case>: A Dataset of Information Extraction from Polymer Material Scientific Literature + JerryCheung + YuchenZhuangGeorgia Institute of Technology + YinghaoLi + PranavShettyJ.P. Morgan Chase + WantianZhaoGeorgia Institute of Technology + SanjeevGrampurohit + RampiRamprasadGeorgia Institute of Technology + ChaoZhangGeorgia Institute of Technology + 2370-2385 + Scientific information extraction (SciIE), which aims to automatically extract information from scientific literature, is becoming more important than ever. However, there are no existing SciIE datasets for polymer materials, which is an important class of materials used ubiquitously in our daily lives. To bridge this gap, we introduce POLYIE, a new SciIE dataset for polymer materials. POLYIE is curated from 146 full-length polymer scholarly articles, which are annotated with different named entities (i.e., materials, properties, values, conditions) as well as their N-ary relations by domain experts. POLYIE presents several unique challenges due to diverse lexical formats of entities, ambiguity between entities, and variable-length relations. We evaluate state-of-the-art named entity extraction and relation extraction models on POLYIE, analyze their strengths and weaknesses, and highlight some difficult cases for these models. To the best of our knowledge, POLYIE is the first SciIE benchmark for polymer materials, and we hope it will lead to more research efforts from the community on this challenging task. Our code and data are available on: https://github.com/jerry3027/PolyIE. + 2024.naacl-long.131 + 2024.naacl-long.131.copyright.pdf + cheung-etal-2024-polyie + + + <fixed-case>LLM</fixed-case>-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination + KaiZhang + YangyangKangAlibaba Group + FubangZhaoAlibaba Group + XiaozhongLiuWorcester Polytechnic Institute + 2386-2398 + Large Language Models (LLMs), such as GPT3.5, have exhibited remarkable proficiency in comprehending and generating natural language. On the other hand, medical assistants hold the potential to offer substantial benefits for individuals. However, the exploration of LLM-based personalized medical assistant remains relatively scarce. Typically, patients converse differently based on their background and preferences which necessitates the task of enhancing user-oriented medical assistant. While one can fully train an LLM for this objective, the resource consumption is unaffordable. Prior research has explored memory-based methods to enhance the response with aware of previous mistakes for new queries during a dialogue session. We contend that a mere memory module is inadequate and fully training an LLM can be excessively costly. In this study, we propose a novel computational bionic memory mechanism, equipped with a parameter-efficient fine-tuning (PEFT) schema, to personalize medical assistants. To encourage further research into this area, we are releasing a new conversation dataset generated based on an open-source medical corpus and our implementation. + 2024.naacl-long.132 + 2024.naacl-long.132.copyright.pdf + zhang-etal-2024-llm-based + + + <fixed-case>S</fixed-case>um<fixed-case>T</fixed-case>ra: A Differentiable Pipeline for Few-Shot Cross-Lingual Summarization + JacobParnellRoZetta Technology + InigoJauregi UnanueUniversity of Technology Sydney and Rozetta Technology + MassimoPiccardiUniversity of Technology Sydney (UTS) + 2399-2415 + Cross-lingual summarization (XLS) generates summaries in a language different from that of the input documents (e.g., English to Spanish), allowing speakers of the target language to gain a concise view of their content. In the present day, the predominant approach to this task is to take a performing, pretrained multilingual language model (LM) and fine-tune it for XLS on the language pairs of interest. However, the scarcity of fine-tuning samples makes this approach challenging in some cases. For this reason, in this paper we propose revisiting the summarize-and-translate pipeline, where the summarization and translation tasks are performed in a sequence. This approach allows reusing the many, publicly-available resources for monolingual summarization and translation, obtaining a very competitive zero-shot performance. In addition, the proposed pipeline is completely differentiable end-to-end, allowing it to take advantage of few-shot fine-tuning, where available. Experiments over two contemporary and widely adopted XLS datasets (CrossSum and WikiLingua) have shown the remarkable zero-shot performance of the proposed approach, and also its strong few-shot performance compared to an equivalent multilingual LM baseline, that the proposed approach has been able to outperform in many languages with only 10% of the fine-tuning samples. + 2024.naacl-long.133 + 2024.naacl-long.133.copyright.pdf + parnell-etal-2024-sumtra + + + <fixed-case>KTRL</fixed-case>+<fixed-case>F</fixed-case>: Knowledge-Augmented In-Document Search + HanseokOh + HaebinShinKorea Advanced Institute of Science & Technology and Samsung + MiyoungKoKorea Advanced Institute of Science and Technology + HyunjiLeeKorea Advanced Institute of Science & Technology + MinjoonSeoKorea Advanced Institute of Science and Technology + 2416-2436 + We introduce a new problem KTRL+F, a knowledge-augmented in-document search that necessitates real-time identification of all semantic targets within a document with the awareness of external sources through a single natural query. KTRL+F addresses following unique challenges for in-document search: 1) utilizing knowledge outside the document for extended use of additional information about targets, and 2) balancing between real-time applicability with the performance.We analyze various baselines in KTRL+F and find limitations of existing models, such as hallucinations, high latency, or difficulties in leveraging external knowledge. Therefore, we propose a Knowledge-Augmented Phrase Retrieval model that shows a promising balance between speed and performance by simply augmenting external knowledge in phrase embedding. We also conduct a user study to verify whether solving KTRL+F can enhance search experience for users. It demonstrates that even with our simple model, users can reduce the time for searching with less queries and reduced extra visits to other sources for collecting evidence. We encourage the research community to work on KTRL+F to enhance more efficient in-document information access. + 2024.naacl-long.134 + 2024.naacl-long.134.copyright.pdf + oh-etal-2024-ktrl + + + How Well Do Large Language Models Truly Ground? + HyunjiLeeKorea Advanced Institute of Science & Technology + Se JuneJooKorea Advanced Institute of Science & Technology + ChaeeunKim + JoelJang + DoyoungKim + Kyoung-WoonOn + MinjoonSeoKorea Advanced Institute of Science and Technology + 2437-2465 + To reduce issues like hallucinations and lack of control in Large Language Models (LLMs), a common method is to generate responses by grounding on external contexts given as input, known as knowledge-augmented models. However, previous research often narrowly defines “grounding” as just having the correct answer, which does not ensure the reliability of the entire response. To overcome this, we propose a stricter definition of grounding: a model is truly grounded if it (1) fully utilizes the necessary knowledge from the provided context, and (2) stays within the limits of that knowledge. We introduce a new dataset and a grounding metric to evaluate model capability under the definition. We perform experiments across 25 LLMs of different sizes and training methods and provide insights into factors that influence grounding performance. Our findings contribute to a better understanding of how to improve grounding capabilities and suggest an area of improvement toward more reliable and controllable LLM applications. + 2024.naacl-long.135 + 2024.naacl-long.135.copyright.pdf + lee-etal-2024-well + + + <fixed-case>ALBA</fixed-case>: Adaptive Language-Based Assessments for Mental Health + VasudhaVaradarajan + SverkerSikström + OscarKjell + H.SchwartzStony Brook University (SUNY) + 2466-2478 + Mental health issues differ widely among individuals, with varied signs and symptoms. Recently, language-based assessments haveshown promise in capturing this diversity, but they require a substantial sample of words per person for accuracy. This work introducesthe task of Adaptive Language-Based Assessment (ALBA), which involves adaptively ordering questions while also scoring an individual’s latent psychological trait using limited language responses to previous questions. To this end, we develop adaptive testing methods under two psychometric measurement theories: Classical Test Theory and Item Response Theory.We empirically evaluate ordering and scoring strategies, organizing into two new methods: a semi-supervised item response theory-basedmethod (ALIRT) and a supervised Actor-Critic model. While we found both methods to improve over non-adaptive baselines, We foundALIRT to be the most accurate and scalable, achieving the highest accuracy with fewer questions (e.g., Pearson r ≈ 0.93 after only 3 questions as compared to typically needing at least 7 questions). In general, adaptive language-based assessments of depression and anxiety were able to utilize a smaller sample of language without compromising validity or large computational costs. + 2024.naacl-long.136 + 2024.naacl-long.136.copyright.pdf + varadarajan-etal-2024-alba + + + <fixed-case>FREB</fixed-case>-<fixed-case>TQA</fixed-case>: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering + WeiZhouRobert Bosch GmbH, Bosch + MohsenMesgarBosch + HeikeAdelHochschule der Medien (University of Applied Sciences) + AnnemarieFriedrichUniversity of Augsburg + 2479-2497 + Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly. + 2024.naacl-long.137 + 2024.naacl-long.137.copyright.pdf + zhou-etal-2024-freb + + + <fixed-case>MILL</fixed-case>: Mutual Verification with Large Language Models for Zero-Shot Query Expansion + PengyueJiaCity University of Hong Kong + YidingLiuBaidu + XiangyuZhaoCity University of Hong Kong + XiaopengLi + ChangyingHao + ShuaiqiangWangBaidu Inc. + DaweiYinBaidu + 2498-2518 + Query expansion, pivotal in search engines, enhances the representation of user information needs with additional terms. While existing methods expand queries using retrieved or generated contextual documents, each approach has notable limitations. Retrieval-based methods often fail to accurately capture search intent, particularly with brief or ambiguous queries. Generation-based methods, utilizing large language models (LLMs), generally lack corpus-specific knowledge and entail high fine-tuning costs. To address these gaps, we propose a novel zero-shot query expansion framework utilizing LLMs for mutual verification. Specifically, we first design a query-query-document generation method, leveraging LLMs’ zero-shot reasoning ability to produce diverse sub-queries and corresponding documents. Then, a mutual verification process synergizes generated and retrieved documents for optimal expansion. Our proposed method is fully zero-shot, and extensive experiments on three public benchmark datasets are conducted to demonstrate its effectiveness over existing methods. Our code is available online at https://github.com/Applied-Machine-Learning-Lab/MILL to ease reproduction. + 2024.naacl-long.138 + 2024.naacl-long.138.copyright.pdf + jia-etal-2024-mill + + + Efficient Benchmarking (of Language Models) + YotamPerlitzInternational Business Machines + ElronBandelInternational Business Machines + ArielGeraInternational Business Machines + OfirArvivHebrew University of Jerusalem and Computer Science Departmen, Technion-Israel Institute of Technology + LiatEin-Dor + EyalShnarchInternational Business Machines + NoamSlonimInternational Business Machines + MichalShmueli-Scheuer + LeshemChoshenInternational Business Machines + 2519-2536 + The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature.In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure – Decision Impact on Reliability, DIoR for short.We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples.Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more. + 2024.naacl-long.139 + 2024.naacl-long.139.copyright.pdf + perlitz-etal-2024-efficient + + + <fixed-case>R</fixed-case>e<fixed-case>FACT</fixed-case>: Updating Text-to-Image Models by Editing the Text Encoder + DanaAradComputer Science Departmen, Technion-Israel Institute of Technology + HadasOrgadComputer Science Departmen, Technion-Israel Institute of Technology and Technion - Israel Institute of Technology, Technion - Israel Institute of Technology + YonatanBelinkovTechnion, Technion + 2537-2558 + Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to textto-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-image models without relaying on explicit input from end-users or costly re-training. ReFACT updates the weights of a specific layer in the text encoder, modifying only a tiny portion of the model’s parameters and leaving the rest of the model unaffected.We empirically evaluate ReFACT on an existing benchmark, alongside a newly curated dataset.Compared to other methods, ReFACT achieves superior performance in both generalization to related concepts and preservation of unrelated concepts.Furthermore, ReFACT maintains image generation quality, making it a practical tool for updating and correcting factual information in text-to-image models. + 2024.naacl-long.140 + 2024.naacl-long.140.copyright.pdf + arad-etal-2024-refact + + + A Likelihood Ratio Test of Genetic Relationship among Languages + V.S.D.S.MaheshAkavarapuIIT Kanpur, IIT Kanpur + ArnabBhattacharyaIIT Kanpur + 2559-2570 + Lexical resemblances among a group of languages indicate that the languages could be genetically related, i.e., they could have descended from a common ancestral language. However, such resemblances can arise by chance and, hence, need not always imply an underlying genetic relationship. Many tests of significance based on permutation of wordlists and word similarity measures appeared in the past to determine the statistical significance of such relationships. We demonstrate that although existing tests may work well for bilateral comparisons, i.e., on pairs of languages, they are either infeasible by design or are prone to yield false positives when applied to groups of languages or language families. To this end, inspired by molecular phylogenetics, we propose a likelihood ratio test to determine if given languages are related based on the proportion of invariant character sites in the aligned wordlists applied during tree inference. Further, we evaluate some language families and show that the proposed test solves the problem of false positives. Finally, we demonstrate that the test supports the existence of macro language families such as Nostratic and Macro-Mayan. + 2024.naacl-long.141 + 2024.naacl-long.141.copyright.pdf + akavarapu-bhattacharya-2024-likelihood + + + <fixed-case>P</fixed-case>a<fixed-case>D</fixed-case>: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning + XuekaiZhuShanghai Jiaotong University + BiqingQiTsinghua University and Harbin Institute of Technology + KaiyanZhangElectronic Engineering, Tsinghua University + XinweiLong + ZhouhanLinShanghai Jiao Tong University + BowenZhouTsinghua University + 2571-2597 + While large language models (LLMs) excel in various natural language processing tasks, their huge size and the inaccessibility of parameters present challenges for practical deployment. Previous studies try to distill task-specific ability from LLMs to smaller models, using data synthesis and chain-of-thought (CoT) fine-tuning. However, synthetic CoT data often contains faulty reasoning, which deteriorates the quality of distillation, especially in reasoning capabilities. In this work, we propose Program-aided Distillation (PaD), which introduces reasoning programs to suppress the errors in distilled data, and thus achieves better distillation quality for reasoning tasks. In PaD, we utilize the reasoning program to substitute the CoT, allowing automated error checking of synthetic data. Further, through error injecting and further training, the small distilling model could iteratively self-refine the reasoning. Moreover, we conduct a step-wise beam search by step-by-step verifying to acquire more exact reasoning chains. We evaluate PaD on arithmetic reasoning, symbolic reasoning, and general ability.Experimental results demonstrate that smaller models using PaD can not only outperform certain LLMs (e.g., LLaMA-1 13B) but also achieve strong improvement over baselines with a significantly smaller scale of parameters and data. The source code is publicly available athttps://github.com/Xuekai-Zhu/pad. + 2024.naacl-long.142 + 2024.naacl-long.142.copyright.pdf + zhu-etal-2024-pad + + + <fixed-case>MEGAVERSE</fixed-case>: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks + SanchitAhujaResearch, Microsoft + DivyanshuAggarwal + VarunGummaMicrosoft + IshaanWattsResearch, Microsoft + AshutoshSathe + MillicentOchiengMicrosoft + RishavHadaMicrosoft Research India + PrachiJainMicrosoft + MohamedAhmedResearch, Microsoft + KalikaBaliMicrosoft Research Labs + SunayanaSitaramMicrosoft + 2598-2637 + There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA models, GPT-4-Vision and Gemini-Pro-Vision. Our experiments show that larger models such as GPT-4, Gemini-Pro and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs. + 2024.naacl-long.143 + 2024.naacl-long.143.copyright.pdf + ahuja-etal-2024-megaverse + + + Unlocking Emergent Modularity in Large Language Models + ZihanQiu + ZeyuHuang + JieFuHong Kong University of Science and Technology + 2638-2660 + Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models.Existing MNNs are generally \textit{explicit}: their modular architectures are pre-defined, with individual modules expected to implement distinct functions.Recent works reveal that there exists \textit{implicit} modularity in standard pre-trained transformers, namely \textit{Emergent Modularity}.They indicate that such modular structures spontaneously exhibit during the early pre-training phase.Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized.In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE).Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning.Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at https://github.com/qiuzh20/EMoE. + 2024.naacl-long.144 + 2024.naacl-long.144.copyright.pdf + qiu-etal-2024-unlocking + + + A School Student Essay Corpus for Analyzing Interactions of Argumentative Structure and Quality + MajaStahlLeibniz Universität Hannover + NadineMichel + SebastianKilsbach + JulianSchmidtkeUniversität Hannover + SaraRezatUniversität Paderborn + HenningWachsmuthLeibniz Universität Hannover + 2661-2674 + Learning argumentative writing is challenging. Besides writing fundamentals such as syntax and grammar, learners must select and arrange argument components meaningfully to create high-quality essays. To support argumentative writing computationally, one step is to mine the argumentative structure. When combined with automatic essay scoring, interactions of the argumentative structure and quality scores can be exploited for comprehensive writing support. Although studies have shown the usefulness of using information about the argumentative structure for essay scoring, no argument mining corpus with ground-truth essay quality annotations has been published yet. Moreover, none of the existing corpora contain essays written by school students specifically. To fill this research gap, we present a German corpus of 1,320 essays from school students of two age groups. Each essay has been manually annotated for argumentative structure and quality on multiple levels of granularity. We propose baseline approaches to argument mining and essay scoring, and we analyze interactions between both tasks, thereby laying the ground for quality-oriented argumentative writing support. + 2024.naacl-long.145 + 2024.naacl-long.145.copyright.pdf + stahl-etal-2024-school + + + Adjusting Interpretable Dimensions in Embedding Space with Human Judgments + KatrinErkUniversity of Texas, Austin + MariannaApidianakiUniversity of Pennsylvania, University of Pennsylvania + 2675-2686 + Embedding spaces contain interpretable dimensions indicating gender, formality in style, or even object properties. This has been observed multiple times. Such interpretable dimensions are becoming valuable tools in different areas of study, from social science to neuroscience. The standard way to compute these dimensions uses contrasting seed words and computes difference vectors over them. This is simple but does not always work well. We combine seed-based vectors with guidance from human ratings of where words fall along a specific dimension, and evaluate on predicting both object properties like size and danger, and the stylistic properties of formality and complexity. We obtain interpretable dimensions with markedly better performance especially in cases where seed-based dimensions do not work well. + 2024.naacl-long.146 + 2024.naacl-long.146.copyright.pdf + erk-apidianaki-2024-adjusting + + + <fixed-case>P</fixed-case>atent<fixed-case>E</fixed-case>val: Understanding Errors in Patent Generation + YouZuo + KimGerdesUniversité Paris-Saclay + ÉricClergerie + BenoîtSagotINRIA + 2687-2710 + In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation. + 2024.naacl-long.147 + 2024.naacl-long.147.copyright.pdf + zuo-etal-2024-patenteval + + + Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing + SaiKoneruKarlsruher Institut für Technologie + MiriamExelSAP SE + MatthiasHuckSAP SE + JanNiehues + 2711-2725 + Large language models (LLMs) have demonstrated considerable success in various natural language processing tasks, but open-source LLMs have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigate using LLMs for MT and explore recent parameter-efficient fine-tuning techniques. Surprisingly, our initial experiments found that fine-tuning with Q-LoRA for translation purposes led to performance improvements in terms of BLEU but degradation in COMET compared to in-context learning. To overcome this, we propose an alternative approach: adapting LLMs as Automatic Post-Editors (APE) rather than direct translators. Building on the ability of the LLM to handle long sequences, we also propose extending our approach to document-level translation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can yield significant improvements across both sentence and document-level metrics while generalizing to out-of-domain data. Most notably, we achieve a state-of-the-art accuracy rate of 88.7% on the ContraPro test set, which assesses the model’s ability to resolve pronoun ambiguities when translating from English to German. Lastly, during manual post-editing for document-level translation, the source sentences are iteratively annotated, which can be used to refine further translations in the document. Here, we demonstrate that leveraging human corrections can significantly reduce the number of edits required for subsequent translations. + 2024.naacl-long.148 + 2024.naacl-long.148.copyright.pdf + koneru-etal-2024-contextual + + + Metaphor Detection with Context Enhancement and Curriculum Learning + KaidiJia + RongshengLiHarbin Engineering University + 2726-2737 + Metaphor detection is a challenging task for natural language processing (NLP) systems. Previous works failed to sufficiently utilize the internal and external semantic relationships between target words and their context. Furthermore, they have faced challenges in tackling the problem of data sparseness due to the very limited available training data. To address these two challenges, we propose a novel model called MiceCL. By leveraging the difference between the literal meaning of the target word and the meaning of the sentence as the sentence external difference, MiceCL can better handle the semantic relationships. Additionally, we propose a curriculum learning framework for automatically assessing difficulty of the sentence with a pre-trained model. By starting from easy examples and gradually progressing to more difficult ones, we can ensure that the model will not deal with complex data when its ability is weak so that to avoid wasting limited data. Experimental results demonstrate that MiceCL achieves competitive performance across multiple datasets, with a significantly improved convergence speed compared to other models. + 2024.naacl-long.149 + 2024.naacl-long.149.copyright.pdf + jia-li-2024-metaphor + + + What Causes the Failure of Explicit to Implicit Discourse Relation Recognition? + WeiLiuHeidelberg University + StephenWanCSIRO + MichaelStrubeHeidelberg Institute for Theoretical Studies + 2738-2753 + We consider an unanswered question in the discourse processing community: why do relation classifiers trained on explicit examples (with connectives removed) perform poorly in real implicit scenarios? Prior work claimed this is due to linguistic dissimilarity between explicit and implicit examples but provided no empirical evidence. In this study, we show that one cause for such failure is a label shift after connectives are eliminated. Specifically, we find that the discourse relations expressed by some explicit instances will change when connectives disappear. Unlike previous work manually analyzing a few examples, we present empirical evidence at the corpus level to prove the existence of such shift. Then, we analyze why label shift occurs by considering factors such as the syntactic role played by connectives, ambiguity of connectives, and more. Finally, we investigate two strategies to mitigate the label shift: filtering out noisy data and joint learning with connectives. Experiments on PDTB 2.0, PDTB 3.0, and the GUM dataset demonstrate that classifiers trained with our strategies outperform strong baselines. + 2024.naacl-long.150 + 2024.naacl-long.150.copyright.pdf + liu-etal-2024-causes + + + <fixed-case>U</fixed-case>niver<fixed-case>SLU</fixed-case>: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions + SiddhantArora + HayatoFutamiSony + Jee-weonJungCMU, Carnegie Mellon University + YifanPengCarnegie Mellon University + RoshanSharmaGoogle + YosukeKashiwagi + EmiruTsunoo + KarenLivescuToyota Technological Institute at Chicago + ShinjiWatanabeCarnegie Mellon University + 2754-2774 + Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model’s behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model “UniverSLU” for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types. + 2024.naacl-long.151 + 2024.naacl-long.151.copyright.pdf + arora-etal-2024-universlu + + + How Trustworthy are Open-Source <fixed-case>LLM</fixed-case>s? An Assessment under Malicious Demonstrations Shows their Vulnerabilities + LingboMo + BoshiWangOhio State University + MuhaoChenUniversity of California, Davis and University of Southern California + HuanSunThe Ohio State University, Columbus + 2775-2792 + The rapid progress in open-source Large Language Models (LLMs) is significantly driving AI development forward. However, there is still a limited understanding of their trustworthiness. Deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. In this work, we conduct an adversarial assessment of open-source LLMs on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU) prompting strategy by incorporating carefully crafted malicious demonstrations for trustworthiness attack. Our extensive experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT, Falcon, Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our attack strategy across diverse aspects. More interestingly, our result analysis reveals that models with superior performance in general NLP tasks do not always have greater trustworthiness; in fact, larger models can be more vulnerable to attacks. Additionally, models that have undergone instruction tuning, focusing on instruction following, tend to be more susceptible, although fine-tuning LLMs for safety alignment proves effective in mitigating adversarial trustworthiness attacks. + 2024.naacl-long.152 + 2024.naacl-long.152.copyright.pdf + mo-etal-2024-trustworthy + + + Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models + YueZhouUniversity of Illinois at Chicago + YadaZhuIBM Research + DiegoAntogniniGoogle DeepMind + YoonKimMassachusetts Institute of Technology + YangZhang + 2793-2804 + This paper studies the relationship between the surface form of a mathematical problem and its solvability by large language models. We find that subtle alterations in the surface form can significantly impact the answer distribution and the solve rate, exposing the language model’s lack of robustness and sensitivity to the surface form in reasoning through complex problems. To improve mathematical reasoning performance, we propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem. We evaluate our approach on four mathematics reasoning benchmarks over three large language models and show that SCoP improves mathematical reasoning performance over vanilla self-consistency, particularly for problems initially deemed unsolvable. Finally, we provide additional experiments and discussion regarding problem difficulty and surface forms, including cross-model difficulty agreement and paraphrasing transferability, and Variance of Variations (VOV) for language model evaluation. + 2024.naacl-long.153 + 2024.naacl-long.153.copyright.pdf + zhou-etal-2024-paraphrase + + + <fixed-case>T</fixed-case>ri<fixed-case>S</fixed-case>um: Learning Summarization Ability from Large Language Models with Structured Rationale + PengchengJiang + CaoXiaoGE Healthcare + ZifengWangUniversity of Illinois, Urbana Champaign + ParminderBhatiaGEHC + JimengSunGeorgia Tech Research Corporation, University of Illinois, Urbana Champaign, College of Computing and Georgia Institute of Technology + JiaweiHan + 2805-2819 + The advent of large language models (LLMs) has significantly advanced natural language processing tasks like text summarization. However, their large size and computational demands, coupled with privacy concerns in data transmission, limit their use in resource-constrained and privacy-centric settings. To overcome this, we introduce TriSum, a framework for distilling LLMs’ text summarization abilities into a compact, local model. Initially, LLMs extract a set of aspect-triple rationales and summaries, which are refined using a dual-scoring method for quality. Next, a smaller local model is trained with these tasks, employing a curriculum learning strategy that evolves from simple to complex tasks. Our method enhances local model performance on various benchmarks (CNN/DailyMail, XSum, and ClinicalTrial), outperforming baselines by 4.5%, 8.5%, and 7.4%, respectively. It also improves interpretability by providing insights into the summarization rationale. + 2024.naacl-long.154 + 2024.naacl-long.154.copyright.pdf + jiang-etal-2024-trisum + + + <fixed-case>G</fixed-case>en<fixed-case>RES</fixed-case>: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models + PengchengJiang + JiachengLinDepartment of Computer Science, University of Illinois + ZifengWangUniversity of Illinois, Urbana Champaign + JimengSunGeorgia Tech Research Corporation, University of Illinois, Urbana Champaign, College of Computing and Georgia Institute of Technology + JiaweiHan + 2820-2837 + The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GenRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GenRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE + 2024.naacl-long.155 + 2024.naacl-long.155.copyright.pdf + jiang-etal-2024-genres + + + Curated Datasets and Neural Models for Machine Translation of Informal Registers between <fixed-case>M</fixed-case>ayan and <fixed-case>S</fixed-case>panish Vernaculars + AndrésLouUniversidad de Alicante + Juan AntonioPérez-OrtizUniversidad de Alicante + FelipeSánchez-MartínezUniversity of Alicante + VíctorSánchez-CartagenaUniversidad de Alicante + 2838-2850 + The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at https://github.com/transducens/mayanv. + 2024.naacl-long.156 + 2024.naacl-long.156.copyright.pdf + lou-etal-2024-curated + + + The Effect of Data Partitioning Strategy on Model Generalizability: A Case Study of Morphological Segmentation + ZoeyLiuUniversity of Florida + BonnieDorrUniversity of Florida + 2851-2864 + Recent work to enhance data partitioning strategies for more realistic model evaluation face challenges in providing a clear optimal choice. This study addresses these challenges, focusing on morphological segmentation and synthesizing limitations related to language diversity, adoption of multiple datasets and splits, and detailed model comparisons. Our study leverages data from 19 languages, including ten indigenous or endangered languages across 10 language families with diverse morphological systems (polysynthetic, fusional, and agglutinative) and different degrees of data availability. We conduct large-scale experimentation with varying sized combinations of training and evaluation sets as well as new test data. Our results show that, when faced with new test data: (1) models trained from random splits are able to achieve higher numerical scores; (2) model rankings derived from random splits tend to generalize more consistently. + 2024.naacl-long.157 + 2024.naacl-long.157.copyright.pdf + liu-dorr-2024-effect + + + Measuring Entrainment in Spontaneous Code-switched Speech + DebasmitaBhattacharyaColumbia University + SiyingDing + AlaynaNguyen + JuliaHirschbergColumbia University + 2865-2876 + It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken \emph{monolingual} domains. More recent work on \emph{code-switched} communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such studies of entrainment in code-switched domains have been extremely few and restricted to human-machine textual interactions. Our work studies code-switched spontaneous speech between humans, finding that (1) patterns of written and spoken entrainment in monolingual settings largely generalize to code-switched settings, and (2) some patterns of entrainment on code-switching in dialogue agent-generated text generalize to spontaneous code-switched speech. Our findings give rise to important implications for the potentially “universal” nature of entrainment as a communication phenomenon, and potential applications in inclusive and interactive speech technology. + 2024.naacl-long.158 + 2024.naacl-long.158.copyright.pdf + bhattacharya-etal-2024-measuring + + + A Survey of Meaning Representations – From Theory to Practical Utility + ZaccharySadeddine + JuriOpitzRuprecht-Karls-Universität Heidelberg and University of Zurich + FabianSuchanekTelecom Paris + 2877-2892 + Symbolic meaning representations of natural language text have been studied since at least the 1960s. With the availability of large annotated corpora, and more powerful machine learning tools, the field has recently seen several new developments. In this survey, we study today’s most prominent Meaning Representation Frameworks. We shed light on their theoretical properties, as well as on their practical research environment, i.e., on datasets, parsers, applications, and future challenges. + 2024.naacl-long.159 + 2024.naacl-long.159.copyright.pdf + sadeddine-etal-2024-survey + + + Mitigating Language-Level Performance Disparity in m<fixed-case>PLM</fixed-case>s via Teacher Language Selection and Cross-lingual Self-Distillation + HaozheZhao + ZefanCai + ShuzhengSiTsinghua University + LiangChen + YufengHe + KaikaiAn + BaobaoChangPeking University + 2893-2907 + Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data.However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data.Therefore, we introduce **ALSACE** to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at https://github.com/pkunlp-icler/ALSACE. + 2024.naacl-long.160 + 2024.naacl-long.160.copyright.pdf + zhao-etal-2024-mitigating + + + Evaluating In-Context Learning of Libraries for Code Generation + ArkilPatelMila - Quebec AI Institute and McGill University + SivaReddyMila, McGill University and Mila, McGill University + DzmitryBahdanauServiceNow Research + PradeepDasigiAllen Institute for Artificial Intelligence + 2908-2926 + Contemporary Large Language Models (LLMs) exhibit a high degree of code generation and comprehension capability. A particularly promising area is their ability to interpret code modules from unfamiliar libraries for solving user-instructed tasks. Recent work has shown that large proprietary LLMs can learn novel library usage in-context from demonstrations. These results raise several open questions: whether demonstrations of library usage is required, whether smaller (and more open) models also possess such capabilities, etc. In this work, we take a broader approach by systematically evaluating a diverse array of LLMs across three scenarios reflecting varying levels of domain specialization to understand their abilities and limitations in generating code based on libraries defined in-context. Our results show that even smaller open-source LLMs like Llama-2 and StarCoder demonstrate an adept understanding of novel code libraries based on specification presented in-context. Our findings further reveal that LLMs exhibit a surprisingly high proficiency in learning novel library modules even when provided with just natural language descriptions or raw code implementations of the functions, which are often cheaper to obtain than demonstrations. Overall, our results pave the way for harnessing LLMs in more adaptable and dynamic coding environments. + 2024.naacl-long.161 + 2024.naacl-long.161.copyright.pdf + patel-etal-2024-evaluating + + + Visually-Aware Context Modeling for News Image Captioning + TingyuQuKU Leuven + TinneTuytelaarsKU Leuven + Marie-FrancineMoensKU Leuven, KU Leuven + 2927-2943 + News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC. + 2024.naacl-long.162 + 2024.naacl-long.162.copyright.pdf + qu-etal-2024-visually + + + Regularized Conventions: Equilibrium Computation as a Model of Pragmatic Reasoning + AthulJacobGoogle and Massachusetts Institute of Technology + GabrieleFarinaMassachusetts Institute of Technology + JacobAndreasMassachusetts Institute of Technology and Microsoft + 2944-2955 + We present a game-theoretic model of pragmatics that we call ReCo (for Regularized Conventions). This model formulates pragmatic communication as a game in which players are rewarded for communicating successfully and penalized for deviating from a shared, “default” semantics. As a result, players assign utterances context-dependent meanings that jointly optimize communicative success and naturalness with respect to speakers’ and listeners’ background knowledge of language. By using established game-theoretic tools to compute equilibrium strategies for this game, we obtain principled pragmatic language generation procedures with formal guarantees of communicative success. Across several datasets capturing real and idealized human judgments about pragmatic implicature, ReCo matches, or slightly improves upon, predictions made by Iterated Best Response and Rational Speech Acts models of language understanding. + 2024.naacl-long.163 + 2024.naacl-long.163.copyright.pdf + jacob-etal-2024-regularized + + + <fixed-case>T</fixed-case>opic<fixed-case>GPT</fixed-case>: A Prompt-based Topic Modeling Framework + ChauPham + AlexanderHoyleUniversity of Maryland, College Park + SimengSunCollege of Information and Computer Science, University of Massachusetts, Amherst + PhilipResnikUniversity of Maryland, College Park + MohitIyyerUniversity of Massachusetts Amherst + 2956-2984 + Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require “reading the tea leaves” to interpret; additionally, they offer users minimal control over the formatting and specificity of resulting topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline. Its topics are also more interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions. Moreover, the framework is highly adaptable, allowing users to specify constraints and modify topics without the need for model retraining. By streamlining access to high-quality and interpretable topics, TopicGPT represents a compelling, human-centered approach to topic modeling. + 2024.naacl-long.164 + 2024.naacl-long.164.copyright.pdf + pham-etal-2024-topicgpt + + + <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case> as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger + JiazhaoLiUniversity of Michigan - Ann Arbor + YijinYang + ZhuofengWuUniversity of Michigan - Ann Arbor + V.G.VinodVydiswaranUniversity of Michigan - Ann Arbor + ChaoweiXiaoUniversity of Wisconsin - Madison and NVIDIA + 2985-3004 + Textual backdoor attacks, characterized by subtle manipulations of input triggers and training dataset labels, pose significant threats to security-sensitive applications. The rise of advanced generative models, such as GPT-4, with their capacity for human-like rewriting, makes these attacks increasingly challenging to detect. In this study, we conduct an in-depth examination of black-box generative models as tools for backdoor attacks, thereby emphasizing the need for effective defense strategies. We propose BGMAttack, a novel framework that harnesses advanced generative models to execute stealthier backdoor attacks on text classifiers. Unlike prior approaches constrained by subpar generation quality, BGMAttack renders backdoor triggers more elusive to human cognition and advanced machine detection. A rigorous evaluation of attack effectiveness over four sentiment classification tasks, complemented by four human cognition stealthiness tests, reveals BGMAttack’s superior performance, achieving a state-of-the-art attack success rate of 97.35% on average while maintaining superior stealth compared to conventional methods. The dataset and code are available: https://github.com/JiazhaoLi/BGMAttack. + 2024.naacl-long.165 + 2024.naacl-long.165.copyright.pdf + li-etal-2024-chatgpt + + + Social Meme-ing: Measuring Linguistic Variation in Memes + NaitianZhouUniversity of California, Berkeley + DavidJurgensUniversity of Michigan - Ann Arbor + DavidBammanUniversity of California Berkeley + 3005-3024 + Much work in the space of NLP has used computational methods to explore sociolinguistic variation in text. In this paper, we argue that memes, as multimodal forms of language comprised of visual templates and text, also exhibit meaningful social variation. We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables, taking advantage of their multimodal structure in doing so. We apply this method to a large collection of meme images from Reddit and make available the resulting SemanticMemes dataset of 3.8M images clustered by their semantic function. We use these clusters to analyze linguistic variation in memes, discovering not only that socially meaningful variation in meme usage exists between subreddits, but that patterns of meme innovation and acculturation within these communities align with previous findings on written language. + 2024.naacl-long.166 + 2024.naacl-long.166.copyright.pdf + zhou-etal-2024-social + + + <fixed-case>E</fixed-case>xpert<fixed-case>QA</fixed-case>: Expert-Curated Questions and Attributed Answers + ChaitanyaMalaviyaUniversity of Pennsylvania + SubinLee + SihaoChen + ElizabethSieberUniversity of Washington + MarkYatskarDepartment of Computer and Information Science, School of Engineering and Applied Science + DanRothAmazon and University of Pennsylvania + 3025-3045 + As language models are adopted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying attribution and factuality has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality, by bringing domain experts in the loop. Specifically, we collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. In addition, we ask experts to improve upon responses from language models. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers. + 2024.naacl-long.167 + 2024.naacl-long.167.copyright.pdf + malaviya-etal-2024-expertqa + + + What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception + ChaitanyaMalaviyaUniversity of Pennsylvania + SubinLee + DanRothAmazon and University of Pennsylvania + MarkYatskarDepartment of Computer and Information Science, School of Engineering and Applied Science + 3046-3065 + Eliciting feedback from end users of NLP models can be beneficial for improving models. However, how should we present model responses to users so they are most amenable to be corrected from user feedback? Further, what properties do users value to understand and trust responses? We answer these questions by analyzing the effect of rationales (or explanations) generated by QA models to support their answers. We specifically consider decomposed QA models that first extract an intermediate rationale based on a context and a question and then use solely this rationale to answer the question. A rationale outlines the approach followed by the model to answer the question. Our work considers various formats of these rationales that vary according to well-defined properties of interest. We sample rationales from language models using few-shot prompting for two datasets, and then perform two user studies. First, we present users with incorrect answers and corresponding rationales in various formats and ask them to provide natural language feedback to revise the rationale. We then measure the effectiveness of this feedback in patching these rationales through in-context learning. The second study evaluates how well different rationale formats enable users to understand and trust model answers, when they are correct. We find that rationale formats significantly affect how easy it is (1) for users to give feedback for rationales, and (2) for models to subsequently execute this feedback. In addition, formats with attributions to the context and in-depth reasoning significantly enhance user-reported understanding and trust of model outputs. + 2024.naacl-long.168 + 2024.naacl-long.168.copyright.pdf + malaviya-etal-2024-said + + + When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels + WeiyanShi + EmilyDinanFacebook AI Research + KurtShuster + JasonWestonNew York University and Facebook + JingXuFacebook AI Research + 3066-3082 + Deployed dialogue agents have the potential to integrate human feedback to continuously improve themselves. However, humans may not always provide explicit signals when the chatbot makes mistakes during interactions. In this work, we propose Juicer, a framework to make use of both binary and free-form textual human feedback. It works by: (i) extending sparse binary feedback by training a satisfaction classifier to label the unlabeled data; and (ii) training a reply corrector to map the bad replies to good ones. We find that augmenting training with model-corrected replies improves the final dialogue model, and we can further improve performance by using both positive and negative replies through the recently proposed Director model. + 2024.naacl-long.169 + 2024.naacl-long.169.copyright.pdf + shi-etal-2024-life + + + Kreyòl-<fixed-case>MT</fixed-case>: Building <fixed-case>MT</fixed-case> for <fixed-case>L</fixed-case>atin <fixed-case>A</fixed-case>merican, <fixed-case>C</fixed-case>aribbean and Colonial <fixed-case>A</fixed-case>frican Creole Languages + NathanielRobinsonDepartment of Computer Science, Whiting School of Engineering + RajDabreNational Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology + AmmonShurtz + RasulDentINRIA + OnenamiyiOnesi + ClaireMonroc + LoïcGrobolUniversité Paris Nanterre + HasanMuhammad + AshiGarg + NaomeEtori + Vijay MurariTiyyala + OlanrewajuSamuel + MatthewStutzman + BismarckOdoomDepartment of Computer Science, Whiting School of Engineering + SanjeevKhudanpurWhiting School of Engineering + StephenRichardsonBrigham Young University + KentonMurrayJohns Hopkins University + 3083-3110 + A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations—11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages—the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity then ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 23 of 34 translation directions. + 2024.naacl-long.170 + 2024.naacl-long.170.copyright.pdf + robinson-etal-2024-kreyol + + + Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models + JiashuXuUniversity of Southern California + MingyuMaUniversity of California, Los Angeles + FeiWangUniversity of Southern California + ChaoweiXiaoUniversity of Wisconsin - Madison and NVIDIA + MuhaoChenUniversity of California, Davis and University of Southern California + 3111-3126 + We investigate security concerns of the emergent instruction tuning paradigm, that models are trained on crowdsourced datasets with task instructions to achieve superior performance. Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions (~1000 tokens) and control model behavior through data poisoning, without even the need to modify data instances or labels themselves. Through such instruction attacks, the attacker can achieve over 90% attack success rate across four commonly used NLP datasets. As an empirical study on instruction attacks, we systematically evaluated unique perspectives of instruction attacks, such as poison transfer where poisoned models can transfer to 15 diverse generative datasets in a zero-shot manner; instruction transfer where attackers can directly apply poisoned instruction on many other datasets; and poison resistance to continual finetuning. Lastly, we show that RLHF and clean demonstrations might mitigate such backdoors to some degree. These findings highlight the need for more robust defenses against poisoning attacks in instruction-tuning models and underscore the importance of ensuring data quality in instruction crowdsourcing. + 2024.naacl-long.171 + 2024.naacl-long.171.copyright.pdf + xu-etal-2024-instructions + + + Modeling Empathetic Alignment in Conversation + JiaminYang + DavidJurgensUniversity of Michigan - Ann Arbor + 3127-3148 + Empathy requires perspective-taking: empathetic responses require a person to reason about what another has experienced and communicate that understanding in language. However, most NLP approaches to empathy do not explicitly model this alignment process. Here, we introduce a new approach to recognizing alignment in empathetic speech, grounded in Appraisal Theory. We introduce a new dataset of over 9.2K span-level annotations of different types of appraisals of a person’s experience and over 3K empathetic alignments between a speaker’s and observer’s speech. Through computational experiments, we show that these appraisals and alignments can be accurately recognized. In experiments in over 9.2M Reddit conversations, we find that appraisals capture meaningful groupings of behavior but that most responses have minimal alignment. However, we find that mental health professionals engage with substantially more empathetic alignment. + 2024.naacl-long.172 + 2024.naacl-long.172.copyright.pdf + yang-jurgens-2024-modeling + + + Native Language Identification in Texts: A Survey + DhimanGoswamiGeorge Mason University + SharanyaThilagan + KaiNorth + ShervinMalmasiAmazon + MarcosZampieriGeorge Mason University + 3149-3160 + We present the first comprehensive survey of Native Language Identification (NLI) applied to texts. NLI is the task of automatically identifying an author’s native language (L1) based on their second language (L2) production. NLI is an important task with practical applications in second language teaching and NLP. The task has been widely studied for both text and speech, particularly for L2 English due to the availability of suitable corpora. Speech-based NLI relies heavily on accent modeled by pronunciation patterns and prosodic cues while text-based NLI relies primarily on modeling spelling errors and grammatical patterns that reveal properties of an individuals’ L1 influencing L2 production. We survey over one hundred papers on the topic including the papers associated with the NLI and INLI shared tasks. We describe several text representations and computational techniques used in text-based NLI. Finally, we present a comprehensive account of publicly available datasets used for the task thus far. + 2024.naacl-long.173 + 2024.naacl-long.173.copyright.pdf + goswami-etal-2024-native + + + <fixed-case>L</fixed-case>o<fixed-case>RETTA</fixed-case>: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models + YifanYangUniversity of California, Santa Barbara + JiajunZhou + NgaiWongThe University of Hong Kong + ZhengZhangUC Santa Barbara + 3161-3176 + Various parameter-efficient fine-tuning (PEFT) techniques have been proposed to enable computationally efficient fine-tuning while maintaining model performance. However, existing PEFT methods are still limited by the growing number of trainable parameters with the rapid deployment of Large Language Models (LLMs). To address this challenge, we present LoRETTA, an ultra-parameter-efficient framework that significantly reduces trainable parameters through tensor-train decomposition. Specifically, we propose two methods, named LoRETTA_adp and LoRETTA_rep. The former employs tensorized adapters, offering a high-performance yet lightweight approach for the fine-tuning of LLMs. The latter emphasizes fine-tuning via weight reparameterization with a set of small tensor factors. LoRETTA achieves comparable or better performance than most widely used PEFT methods with up to 100\times fewer parameters on the LLaMA-2-7B models. Furthermore, empirical results demonstrate that the proposed methods exhibit remarkable anti-overfitting capability, effectively improve training efficiency, and enjoy better multi-task learning performance. Plug-and-play loretta library built upon the Huggingface framework and PEFT library are provided. + 2024.naacl-long.174 + 2024.naacl-long.174.copyright.pdf + yang-etal-2024-loretta + + + Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding + ChancharikMitra + AbrarAnwarNVIDIA and University of Southern California + RodolfoCorona + DanKleinUniversity of California, Berkeley + TrevorDarrellElectrical Engineering & Computer Science Department + JesseThomasonUniversity of Southern California and Amazon + 3177-3189 + When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object’s appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC) model, which selects an object referent based on language that distinguishes between two similar objects. By pragmatically reasoning over both objects and across multiple views of those objects, MAGiC improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9% (representing an absolute improvement of 2.7%). Ablation studies show that reasoning jointly over object referent candidates and multiple views of each object both contribute to improved accuracy. Code: https://github.com/rcorona/magic_snare/ + 2024.naacl-long.175 + 2024.naacl-long.175.copyright.pdf + mitra-etal-2024-one + + + Do Localization Methods Actually Localize Memorized Data in <fixed-case>LLM</fixed-case>s? A Tale of Two Benchmarks + Ting-YunChangUniversity of Southern California + JesseThomasonUniversity of Southern California and Amazon + RobinJiaUniversity of Southern California + 3190-3211 + The concept of localization in LLMs is often mentioned in prior work; however, methods for localization have never been systematically and directly evaluated. We propose two complementary benchmarks that evaluate the ability of localization methods to pinpoint LLM components responsible for memorized data. In our INJ benchmark, we actively inject a piece of new information into a small subset of LLM weights, enabling us to directly evaluate whether localization methods can identify these “ground truth” weights. In our DEL benchmark, we evaluate localization by measuring how much dropping out identified neurons deletes a memorized pretrained sequence. Despite their different perspectives, our two benchmarks yield consistent rankings of five localization methods. Methods adapted from network pruning perform well on both benchmarks, and all evaluated methods show promising localization ability. On the other hand, even successful methods identify neurons that are not specific to a single memorized sequence. + 2024.naacl-long.176 + 2024.naacl-long.176.copyright.pdf + chang-etal-2024-localization + + + <fixed-case>P</fixed-case>rompt<fixed-case>F</fixed-case>ix: Few-shot Backdoor Removal via Adversarial Prompt Tuning + TianrongZhangPennsylvania State University + ZhaohanXiPennsylvania State University + TingWangState University of New York at Stony Brook + PrasenjitMitraPennsylvania State University + JinghuiChenPennsylvania State University + 3212-3225 + Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented.In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings.Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance.Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix’s applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios. + 2024.naacl-long.177 + 2024.naacl-long.177.copyright.pdf + zhang-etal-2024-promptfix + + + Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models + ZhixueZhaoUniversity of Sheffield, University of Sheffield + NikolaosAletrasUniversity of Sheffield, University of Sheffield and Amazon + 3226-3244 + In many real natural language processing application scenarios, practitioners not only aim to maximize predictive performance but also seek faithful explanations for the model predictions. Rationales and importance distribution given by feature attribution methods (FAs) provide insights into how different parts of the input contribute to a prediction. Previous studies have explored how different factors affect faithfulness, mainly in the context of monolingual English models. On the other hand, the differences in FA faithfulness between multilingual and monolingual models have yet to be explored. Our extensive experiments, covering five languages and five popular FAs, show that FA faithfulness varies between multilingual and monolingual models. We find that the larger the multilingual model, the less faithful the FAs are compared to its counterpart monolingual models. Our further analysis shows that the faithfulness disparity is potentially driven by the differences between model tokenizers. Our code is available: https://github.com/casszhao/multilingual-faith. + 2024.naacl-long.178 + 2024.naacl-long.178.copyright.pdf + zhao-aletras-2024-comparing + + + A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity + ShayneLongpreMassachusetts Institute of Technology + GregoryYauneyCornell University + EmilyReif + KatherineLeeCornell University and Google + AdamRobertsGoogle + BarretZoph + DennyZhouGoogle DeepMind + JasonWeiOpenAI + KevinRobinsonGoogle Research + DavidMimnoCornell University + DaphneIppolito + 3245-3276 + Pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. We pretrain models on data curated (1) at different collection times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we find that temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we measure the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Third, we empirically validate that heterogeneous data sources, like books and web, are beneficial and warrant greater prioritization. To date, these experiments constitute the single largest publicly documented empirical study of the effects of pretraining data. Spanning 28 unique 1.5 billion parameter models pretrained from scratch, these findings validate, quantify, and expose many undocumented intuitions about text pretraining, which ultimately support more informed data-centric decisions in model development. + 2024.naacl-long.179 + 2024.naacl-long.179.copyright.pdf + longpre-etal-2024-pretrainers + + + Instructional Fingerprinting of Large Language Models + JiashuXuUniversity of Southern California + FeiWangUniversity of Southern California + MingyuMaUniversity of California, Los Angeles + Pang WeiKohUniversity of Washington + ChaoweiXiaoUniversity of Wisconsin - Madison and NVIDIA + MuhaoChenUniversity of California, Davis and University of Southern California + 3277-3306 + The exorbitant cost of training Large language models (LLMs) from scratch makes it essential to fingerprint the models to protect intellectual property via ownership authentication and to ensure downstream users and developers comply with their license terms (eg restricting commercial use). In this study, we present a pilot study on LLM fingerprinting as a form of very lightweight instruction tuning. Model publisher specifies a confidential private key and implants it as an instruction backdoor that causes the LLM to generate specific text when the key is present. Results on 11 popularly-used LLMs showed that this approach is lightweight and does not affect the normal behavior of the model. It also prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License. + 2024.naacl-long.180 + 2024.naacl-long.180.copyright.pdf + xu-etal-2024-instructional + + + Reinforced Multiple Instance Selection for Speaker Attribute Prediction + AlirezaSalkhordeh Ziabari + AliOmrani + ParsaHejabiUniversity of Southern California + PreniGolazizian + BrendanKennedy + PayamPirayUniversity of Southern California + MortezaDehghaniUniversity of Southern California + 3307-3321 + Language usage is related to speaker age, gender, moral concerns, political ideology, and other attributes. Current state-of-the-art methods for predicting these attributes take a speaker’s utterances as input and provide a prediction per speaker attribute. Most of these approaches struggle to handle a large number of utterances per speaker. This difficulty is primarily due to the computational constraints of the models. Additionally, only a subset of speaker utterances may be relevant to specific attributes. In this paper, we formulate speaker attribute prediction as a Multiple Instance Learning (MIL) problem and propose RL-MIL, a novel approach based on Reinforcement Learning (RL) that effectively addresses both of these challenges. Our experiments demonstrate that our RL-based methodology consistently outperforms previous approaches across a range of related tasks: predicting speakers’ psychographics and demographics from social media posts, and political ideologies from transcribed speeches. We create synthetic datasets and investigate the behavior of RL-MIL systematically. Our results show the success of RL-MIL in improving speaker attribute prediction by learning to select relevant speaker utterances. + 2024.naacl-long.181 + 2024.naacl-long.181.copyright.pdf + salkhordeh-ziabari-etal-2024-reinforced + + + <fixed-case>D</fixed-case>yna<fixed-case>M</fixed-case>o: Accelerating Language Model Inference with Dynamic Multi-Token Sampling + ShikharTuli + Chi-HengLinSamsung Research America + Yen-ChangHsuSamsung Research America + NirajJhaPrinceton University + YilinShenSamsung Research America + HongxiaJinSamsung Research America AI center + 3322-3345 + Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models *dynamically* predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweighttechnique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57\times speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively. + 2024.naacl-long.182 + 2024.naacl-long.182.copyright.pdf + tuli-etal-2024-dynamo + + + Few-shot Knowledge Graph Relational Reasoning via Subgraph Adaptation + HaochenLiu + SongWangUniversity of Virginia, Charlottesville + ChenChen + JundongLiUniversity of Virginia + 3346-3356 + Few-shot Knowledge Graph (KG) Relational Reasoning aims to predict unseen triplets (i.e., query triplets) for rare relations in KGs, given only several triplets of these relations as references (i.e., support triplets). This task has gained significant traction due to the widespread use of knowledge graphs in various natural language processing applications. Previous approaches have utilized meta-training methods and manually constructed meta-relation sets to tackle this task. Recent efforts have focused on edge-mask-based methods, which exploit the structure of the contextualized graphs of target triplets (i.e., a subgraph containing relevant triplets in the KG). However, existing edge-mask-based methods have limitations in extracting insufficient information from KG and are highly influenced by spurious information in KG. To overcome these challenges, we propose SAFER (Subgraph Adaptation for Few-shot Relational Reasoning), a novel approach that effectively adapts the information in contextualized graphs to various subgraphs generated from support and query triplets to perform the prediction. Specifically, SAFER enables the extraction of more comprehensive information from support triplets while minimizing the impact of spurious information when predicting query triplets. Experimental results on three prevalent datasets demonstrate the superiority of our proposed framework SAFER. + 2024.naacl-long.183 + 2024.naacl-long.183.copyright.pdf + liu-etal-2024-shot + + + Uncertainty Quantification for In-Context Learning of Large Language Models + ChenLing + XujiangZhaoNEC Labs America + XuchaoZhangMicrosoft + WeiChengNEC-Labs + YanchiLiuNEC-Labs + YiyouSun + MikaOishiNEC + TakaoOsakiNEC + KatsushiMatsudaNEC + JieJi + GuangjiBaiEmory University + LiangZhaoGeorge Mason University, Emory University and Emory University + HaifengChenNEC-Labs + 3357-3370 + In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) and revolutionized various fields by providing a few task-relevant demonstrations in the prompt. However, trustworthy issues with LLM’s response, such as hallucination, have also been actively discussed. Existing works have been devoted to quantifying the uncertainty in LLM’s response, but they often overlook the complex nature of LLMs and the uniqueness of in-context learning. In this work, we delve into the predictive uncertainty of LLMs associated with in-context learning, highlighting that such uncertainties may stem from both the provided demonstrations (aleatoric uncertainty) and ambiguities tied to the model’s configurations (epistemic uncertainty). We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties. The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion. Extensive experiments are conducted to demonstrate the effectiveness of the decomposition. The code and data are available at: https://github.com/lingchen0331/UQ_ICL. + 2024.naacl-long.184 + 2024.naacl-long.184.copyright.pdf + ling-etal-2024-uncertainty + + + <fixed-case>H</fixed-case>elp<fixed-case>S</fixed-case>teer: Multi-attribute Helpfulness Dataset for <fixed-case>S</fixed-case>teer<fixed-case>LM</fixed-case> + ZhilinWangNVIDIA + YiDong + JiaqiZengNVIDIA + VirginiaAdams + Makesh NarsimhanSreedharNVIDIA + DanielEgertNVIDIA + OlivierDelalleauNVIDIA + JaneScowcroftNVIDIA + NeelKant + AidanSwopeNVIDIA + OleksiiKuchaievNVIDIA + 3371-3384 + Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful responses only due to their length). To alleviate this problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful. Specifically, our 37k-sample dataset has annotations for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT-4). We release this dataset with CC-BY-4.0 license at https://huggingface.co/datasets/nvidia/HelpSteer + 2024.naacl-long.185 + 2024.naacl-long.185.copyright.pdf + wang-etal-2024-helpsteer + + + A Preference-driven Paradigm for Enhanced Translation with Large Language Models + DaweiZhu + SonyTrenousAmazon + XiaoyuShenAmazon + DietrichKlakowSaarland University + BillByrneAmazon and University of Cambridge + EvaHaslerAmazon + 3385-3403 + Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in “breaking the plateau” across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach. + 2024.naacl-long.186 + 2024.naacl-long.186.copyright.pdf + zhu-etal-2024-preference + + + Fair Abstractive Summarization of Diverse Perspectives + YusenZhang + NanZhangPennsylvania State University + YixinLiuYale University + AlexanderFabbriSalesForce.com + JunruLiu + RyoKamoiPennsylvania State University + XiaoxinLuPennsylvania State University + CaimingXiongSalesforce Research + JieyuZhaoUniversity of Southern California + DragomirRadevYale University + KathleenMcKeown + RuiZhangPennsylvania State University + 3404-3426 + People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm. + 2024.naacl-long.187 + 2024.naacl-long.187.copyright.pdf + zhang-etal-2024-fair + + + What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases + AnthonyTiongNanyang Technological University and SalesForce.com + JunqiZhaoNanyang Technological University + BoyangLiNanyang Technological University + JunnanLiSalesforce Research + StevenHoiSalesforce Research Asia and Singapore Management University + CaimingXiongSalesforce Research + 3427-3454 + Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection.Finally, we present a new dataset, OLIVE^1, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods. ^1https://github.com/jq-zh/olive-dataset + 2024.naacl-long.188 + 2024.naacl-long.188.copyright.pdf + tiong-etal-2024-measuring + + + Show Your Work with Confidence: Confidence Bands for Tuning Curves + NicholasLourieNew York University + KyunghyunChoGenentech and New York University + HeHeNew York University + 3455-3472 + The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. *Tuning curves* fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data.Beyond point estimates, *confidence bands* are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods.Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release opda: an easy-to-use library that you can install with pip. https://github.com/nicholaslourie/opda + 2024.naacl-long.189 + 2024.naacl-long.189.copyright.pdf + lourie-etal-2024-show + + + <fixed-case>GRASP</fixed-case>: A Disagreement Analysis Framework to Assess Group Associations in Perspectives + VinodkumarPrabhakaranGoogle + ChristopherHoman + LoraAroyoGoogle + AidaMostafazadeh DavaniResearch, Google + AliciaParrishGoogle + AlexTaylorDesign Informatics, University of Edinburgh + MarkDiazGoogle + DingWang + GregorySerapio-García + 3473-3492 + Human annotation plays a core role in machine learning — annotations for supervised models, safety guardrails for generative models, and human feedback for reinforcement learning, to cite a few avenues. However, the fact that many of these human annotations are inherently subjective is often overlooked. Recent work has demonstrated that ignoring rater subjectivity (typically resulting in rater disagreement) is problematic within specific tasks and for specific subgroups. Generalizable methods to harness rater disagreement and thus understand the socio-cultural leanings of subjective tasks remain elusive. In this paper, we propose GRASP, a comprehensive disagreement analysis framework to measure group association in perspectives among different rater subgroups, and demonstrate its utility in assessing the extent of systematic disagreements in two datasets: (1) safety annotations of human-chatbot conversations, and (2) offensiveness annotations of social media posts, both annotated by diverse rater pools across different socio-demographic axes. Our framework (based on disagreement metrics) reveals specific rater groups that have significantly different perspectives than others on certain tasks, and helps identify demographic axes that are crucial to consider in specific task contexts. + 2024.naacl-long.190 + 2024.naacl-long.190.copyright.pdf + prabhakaran-etal-2024-grasp + + + Event Causality Is Key to Computational Story Understanding + YidanSunNanyang Technological University + QinChaoNanyang Technological University + BoyangLiNanyang Technological University + 3493-3511 + Cognitive science and symbolic AI research suggest that event causality provides vital information for story understanding. However, machine learning systems for story understanding rarely employ event causality, partially due to the lack of methods that reliably identify open-world causal event relations. Leveraging recent progress in large language models, we present the first method for event causality identification that leads to material improvements in computational story understanding. Our technique sets a new state of the art on the COPES dataset (Wang et al., 2023c) for causal event relation identification. Further, in the downstream story quality evaluation task, the identified causal relations lead to 3.6-16.6% relative improvement on correlation with human ratings. In the multimodal story video-text alignment task, we attain 4.1-10.9% increase on Clip Accuracy and 4.2-13.5% increase on Sentence IoU. The findings indicate substantial untapped potential for event causality in computational story understanding. The codebase is at https://github.com/insundaycathy/Event-Causality-Extraction. + 2024.naacl-long.191 + 2024.naacl-long.191.copyright.pdf + sun-etal-2024-event + + + Subspace Representations for Soft Set Operations and Sentence Similarities + YoichiIshibashiKyoto University, Japan + ShoYokoiTohoku University and RIKEN + KatsuhitoSudohNara Institute of Science and Technology, Japan + SatoshiNakamuraThe Chinese University of Hong Kong + 3512-3524 + In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks. + 2024.naacl-long.192 + 2024.naacl-long.192.copyright.pdf + ishibashi-etal-2024-subspace + + + My Heart Skipped a Beat! Recognizing Expressions of Embodied Emotion in Natural Language + YuanZhuang + TianyuJiangUniversity of Cincinnati + EllenRiloffUniversity of Arizona + 3525-3537 + Humans frequently experience emotions. When emotions arise, they affect not only our mental state but can also change our physical state. For example, we often open our eyes wide when we are surprised, or clap our hands when we feel excited. Physical manifestations of emotions are referred to as embodied emotion in the psychology literature. From an NLP perspective, recognizing descriptions of physical movements or physiological responses associated with emotions is a type of implicit emotion recognition. Our work introduces a new task of recognizing expressions of embodied emotion in natural language. We create a dataset of sentences that contains 7,300 body part mentions with human annotations for embodied emotion. We develop a classification model for this task and present two methods to acquire weakly labeled instances of embodied emotion by extracting emotional manner expressions and by prompting a language model. Our experiments show that the weakly labeled data can train an effective classification model without gold data, and can also improve performance when combined with gold data. Our dataset is publicly available at https://github.com/yyzhuang1991/Embodied-Emotions. + 2024.naacl-long.193 + 2024.naacl-long.193.copyright.pdf + zhuang-etal-2024-heart + + + Low-Cost Generation and Evaluation of Dictionary Example Sentences + BillCaiAmazon + NgClarenceMinistry of Education, Singapore + DanielLiang + ShelviaHotama + 3538-3549 + Dictionary example sentences play an important role in illustrating word definitions and usage, but manually creating quality sentences is challenging. Prior works have demonstrated that language models can be trained to generate example sentences. However, they relied on costly customized models and word sense datasets for generation and evaluation of their work. Rapid advancements in foundational models present the opportunity to create low-cost, zero-shot methods for the generation and evaluation of dictionary example sentences. We introduce a new automatic evaluation metric called OxfordEval that measures the win-rate of generated sentences against existing Oxford Dictionary sentences. OxfordEval shows high alignment with human judgments, enabling large-scale automated quality evaluation. We experiment with various LLMs and configurations to generate dictionary sentences across word classes. We complement this with a novel approach of using masked language models to identify and select sentences that best exemplify word meaning. The eventual model, FM-MLM, achieves over 85.1% win rate against Oxford baseline sentences according to OxfordEval, compared to 39.8% win rate for prior model-generated sentences. + 2024.naacl-long.194 + 2024.naacl-long.194.copyright.pdf + cai-etal-2024-low + + + Making Language Models Better Tool Learners with Execution Feedback + ShuofeiQiao + HonghaoGui + ChengfeiLv + QianghuaiJia + HuajunChenZhejiang University + NingyuZhangZhejiang University + 3550-3568 + Tools serve as pivotal interfaces that enable humans to understand and reshape the environment. With the advent of foundation models, AI systems can utilize tools to expand their capabilities and interact with the real world. Existing tool learning methodologies, encompassing supervised fine-tuning and prompt engineering approaches, often induce large language models to utilize tools indiscriminately, as complex tasks often exceed their own competencies. However, introducing tools for simple tasks, which the models themselves can readily resolve, can inadvertently propagate errors rather than enhance performance. This leads to the research question: can we teach language models when and how to use tools? To meet this need, we propose Tool leaRning wIth exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the model to continually learn through feedback derived from tool execution, thereby learning when and how to use tools effectively. Experimental results, backed by further analysis, show that TRICE can make the large language model selectively use tools by improving the accuracy of tool usage while enhancing insufficient tool learning and mitigating excessive reliance on tools. + 2024.naacl-long.195 + 2024.naacl-long.195.copyright.pdf + qiao-etal-2024-making + + + Complex Claim Verification with Evidence Retrieved in the Wild + JifanChenAmazon + GraceKimUniversity of Texas at Austin + AniruddhSriram + GregDurrettUniversity of Texas, Austin + EunsolChoiUniversity of Texas, Austin + 3569-3587 + Retrieving evidence to support or refute claims is a core part of automatic fact-checking. Prior work makes simplifying assumptions in retrieval that depart from real-world use cases: either no access to evidence, access to evidence curated by a human fact-checker, or access to evidence published after a claim was made. In this work, we present the first realistic pipeline to check real-world claims by retrieving raw evidence from the web. We restrict our retriever to only search documents available prior to the claim’s making, modeling the realistic scenario of emerging claims. Our pipeline includes five components: claim decomposition, raw document retrieval, fine-grained evidence retrieval, claim-focused summarization, and veracity judgment. We conduct experiments on complex political claims in the ClaimDecomp dataset and show that the aggregated evidence produced by our pipeline improves veracity judgments. Human evaluation finds the evidence summary produced by our system is reliable (it does not hallucinate information) and relevant to answering key questions about a claim, suggesting that it can assist fact-checkers even when it does not reflect a complete evidence set. + 2024.naacl-long.196 + 2024.naacl-long.196.copyright.pdf + chen-etal-2024-complex + + + Multimodal Multi-loss Fusion Network for Sentiment Analysis + ZehuiWu + ZiweiGongColumbia University + JaywonKooRice University + JuliaHirschbergColumbia University + 3588-3602 + This paper investigates the optimal selection and fusion of feature encoders across multiple modalities and combines these in one neural network to improve sentiment detection. We compare different fusion methods and examine the impact of multi-loss training within the multi-modality fusion network, identifying surprisingly important findings relating to subnet performance. We have also found that integrating context significantly enhances model performance. Our best model achieves state-of-the-art performance for three datasets (CMU-MOSI, CMU-MOSEI and CH-SIMS). These results suggest a roadmap toward an optimized feature selection and fusion approach for enhancing sentiment detection in neural networks. + 2024.naacl-long.197 + 2024.naacl-long.197.copyright.pdf + wu-etal-2024-multimodal + + + Confronting <fixed-case>LLM</fixed-case>s with Traditional <fixed-case>ML</fixed-case>: Rethinking the Fairness of Large Language Models in Tabular Classifications + YanchenLiu + SrishtiGautamUiT The Arctic University of Norway + JiaqiMaUniversity of Illinois Urbana-Champaign + HimabinduLakkarajuHarvard University + 3603-3620 + Recent literature has suggested the potential of using large language models (LLMs) to make classifications for tabular tasks. However, LLMs have been shown to exhibit harmful social biases that reflect the stereotypes and inequalities present in society. To this end, as well as the widespread use of tabular data in many high-stake applications, it is important to explore the following questions: what sources of information do LLMs draw upon when making classifications for tabular tasks; whether and to what extent are LLM classifications for tabular data influenced by social biases and stereotypes; and what are the consequential implications for fairness?Through a series of experiments, we delve into these questions and show that LLMs tend to inherit social biases from their training data which significantly impact their fairness in tabular classification tasks. Furthermore, our investigations show that in the context of bias mitigation, though in-context learning and finetuning have a moderate effect, the fairness metric gap between different subgroups is still larger than that in traditional machine learning models, such as Random Forest and shallow Neural Networks. This observation emphasizes that the social biases are inherent within the LLMs themselves and inherited from their pretraining corpus, not only from the downstream task datasets. Besides, we demonstrate that label-flipping of in-context examples can significantly reduce biases, further highlighting the presence of inherent bias within LLMs. + 2024.naacl-long.198 + 2024.naacl-long.198.copyright.pdf + liu-etal-2024-confronting + + + Analyzing the Use of Metaphors in News Editorials for Political Framing + MeghdutSenguptaUniversität Hannover + RoxanneEl BaffGerman Aerospace Center and Bauhaus-University Weimar + MiladAlshomaryColumbia University + HenningWachsmuthLeibniz Universität Hannover + 3621-3631 + Metaphorical language is a pivotal element inthe realm of political framing. Existing workfrom linguistics and the social sciences providescompelling evidence regarding the distinctivenessof conceptual framing for politicalideology perspectives. However, the nature andutilization of metaphors and the effect on audiencesof different political ideologies withinpolitical discourses are hardly explored. Toenable research in this direction, in this workwe create a dataset, originally based on newseditorials and labeled with their persuasive effectson liberals and conservatives and extend itwith annotations pertaining to metaphorical usageof language. To that end, first, we identifyall single metaphors and composite metaphors.Secondly, we provide annotations of the sourceand target domains for each metaphor. As aresult, our corpus consists of 300 news editorialsannotated with spans of texts containingmetaphors and the corresponding domains ofwhich these metaphors draw from. Our analysisshows that liberal readers are affected bymetaphors, whereas conservatives are resistantto them. Both ideologies are affected differentlybased on the metaphor source and targetcategory. For example, liberals are affected bymetaphors in the Darkness & Light (e.g., death)source domains, where as the source domain ofNature affects conservatives more significantly. + 2024.naacl-long.199 + 2024.naacl-long.199.copyright.pdf + sengupta-etal-2024-analyzing + + + <fixed-case>S</fixed-case>harp<fixed-case>S</fixed-case>eq: Empowering Continual Event Detection through Sharpness-Aware Sequential-task Learning + Thanh-ThienLe + VietDao + LinhNguyen + Thi-NhungNguyenVinAI Research + LinhNgoHanoi University of Science and Technology + ThienNguyen, University of Oregon + 3632-3644 + Continual event detection is a cornerstone in uncovering valuable patterns in many dynamic practical applications, where novel events emerge daily. Existing state-of-the-art approaches with replay buffers still suffer from catastrophic forgetting, partially due to overly simplistic objective aggregation. This oversight disregards complex trade-offs and leads to sub-optimal gradient updates, resulting in performance deterioration across objectives. While there are successful, widely cited multi-objective optimization frameworks for multi-task learning, they lack mechanisms to address data imbalance and evaluate whether a Pareto-optimal solution can effectively mitigate catastrophic forgetting, rendering them unsuitable for direct application to continual learning. To address these challenges, we propose **SharpSeq**, a novel continual learning paradigm leveraging sharpness-aware minimization combined with a generative model to balance training data distribution. Through extensive experiments on multiple real-world datasets, we demonstrate the superior performance of SharpSeq in continual event detection, proving the importance of our approach in mitigating catastrophic forgetting in continual event detection. + 2024.naacl-long.200 + 2024.naacl-long.200.copyright.pdf + le-etal-2024-sharpseq + + + Dissecting Paraphrases: The Impact of Prompt Syntax and supplementary Information on Knowledge Retrieval from Pretrained Language Models + StephanLinzbachGESIS - Leibniz Insitute for the Social Sciences + DimitarDimitrov + LauraKallmeyerHeinrich Heine University Düsseldorf, Germany + KilianEvangHeinrich Heine University Düsseldorf + HajiraJabeenUniversity of Cologne + StefanDietzeGESIS and Heinrich-Heine-University Düsseldorf + 3645-3655 + Pre-trained Language Models (PLMs) are known to contain various kinds of knowledge.One method to infer relational knowledge is through the use of cloze-style prompts, where a model is tasked to predict missing subjects orobjects. Typically, designing these prompts is a tedious task because small differences in syntax or semantics can have a substantial impact on knowledge retrieval performance. Simultaneously, evaluating the impact of either prompt syntax or information is challenging due to their interdependence. We designed CONPARE-LAMA – a dedicated probe, consisting of 34 million distinct prompts that facilitate comparison across minimal paraphrases. These paraphrases follow a unified meta-template enabling the controlled variation of syntax and semantics across arbitrary relations.CONPARE-LAMA enables insights into the independent impact of either syntactical form or semantic information of paraphrases on the knowledge retrieval performance of PLMs. Extensive knowledge retrieval experiments using our probe reveal that prompts following clausal syntax have several desirable properties in comparison to appositive syntax: i) they are more useful when querying PLMs with a combination of supplementary information, ii) knowledge is more consistently recalled across different combinations of supplementary information, and iii) they decrease response uncertainty when retrieving known facts. In addition, range information can boost knowledge retrieval performance more than domain information, even though domain information is more reliably helpful across syntactic forms. + 2024.naacl-long.201 + 2024.naacl-long.201.copyright.pdf + linzbach-etal-2024-dissecting + + + Know When To Stop: A Study of Semantic Drift in Text Generation + AvaSpataruMeta + 3656-3671 + In this work, we explicitly show that modern LLMs tend to generate correct facts first, then “drift away” and generate incorrect facts later: this was occasionally observed but never properly measured. We develop a semantic drift score that measures the degree of separation between correct and incorrect facts in generated texts and confirm our hypothesis when generating Wikipedia-style biographies. This correct-then-incorrect generation pattern suggests that factual accuracy can be improved by knowing when to stop generation. Therefore, we explore the trade-off between information quantity and factual accuracy for several early stopping methods and manage to improve factuality by a large margin. We further show that reranking with semantic similarity can further improve these results, both compared to the baseline and when combined with early stopping. Finally, we try calling external API to bring the model back to the right generation path, but do not get positive results. Overall, our methods generalize and can be applied to any long-form text generation to produce more reliable information, by balancing trade-offs between factual accuracy, information quantity and computational cost. + 2024.naacl-long.202 + 2024.naacl-long.202.copyright.pdf + spataru-2024-know + + + Curriculum Masking in Vision-Language Pretraining to Maximize Cross Modal Interaction + KraigTou + ZijunSun + 3672-3688 + Many leading methods in Vision and language (V+L) pretraining utilize masked language modeling (MLM) as a standard pretraining component, with the expectation that reconstruction of masked text tokens would necessitate reference to corresponding image context via cross/self attention and thus promote representation fusion. However, we observe that the minimization of MLM loss in earlier training stages can depend disproportionately on local text signals, leading to poor training efficiency and inconsistency with the goal of representation fusion. The extent of this lack of cross modal interaction depends strongly which token(s) are masked. To address this issue, we propose a curriculum masking scheme as a replacement for random masking. Tokens are selected to be masked at a frequency proportional to the expected level of cross modal interaction necessary to reconstruct them. This is achieved using a parallel mask selection agent that measures the cross modal flow of information and treats it as a reward to be maximized. By additionally masking contiguous spans that include key objects and their relations, we also achieve better relational understanding, which has been shown to be lacking in many SOTA models. Our experiments on a wide range of V+L tasks show that we trail closely behind state-of-the-art methods despite pretraining on 300x to 1000x less data and we also achieve either top or runner-up performance on tasks from the ARO benchmark which tests compositional relationships. Finally, we demonstrate the potential of our method to scale to larger pretraining data. + 2024.naacl-long.203 + 2024.naacl-long.203.copyright.pdf + tou-sun-2024-curriculum + + + Elote, Choclo and Mazorca: on the Varieties of <fixed-case>S</fixed-case>panish + CristinaEspaña-BonetGerman Research Center for AI + AlbertoBarrón-CedeñoUniversità di Bologna + 3689-3711 + Spanish is one of the most widespread languages: the official language in 20 countries and the second most-spoken native language. Its contact with other languages across different regions and the rich regional and cultural diversity has produced varieties which divert from each other, particularly in terms of lexicon. Still, available corpora, and models trained upon them, generally treat Spanish as one monolithic language, which dampers prediction and generation power when dealing with different varieties. To alleviate the situation, we compile and curate datasets in the different varieties of Spanish around the world at an unprecedented scale and create the CEREAL corpus. With such a resource at hand, we perform a stylistic analysis to identify and characterise varietal differences. We implement a classifier specially designed to deal with long documents and identify Spanish varieties (and therefore expand CEREAL further). We produce varietal-specific embeddings, and analyse the cultural differences that they encode. We make data, code and models publicly available. + 2024.naacl-long.204 + 2024.naacl-long.204.copyright.pdf + espana-bonet-barron-cedeno-2024-elote + + + <fixed-case>A</fixed-case>da-<fixed-case>LE</fixed-case>val: Evaluating long-context <fixed-case>LLM</fixed-case>s with length-adaptable benchmarks + ChonghuaWangShanghai Jiaotong University + HaodongDuanShanghai Artificial Intelligence Laboratory + SongyangZhangShanghai AI Laboratory + DahuaLinThe Chinese University of Hong Kong + KaiChenShanghai AI Laboratory + 3712-3724 + Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs’ capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models’ long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs’ long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval. + 2024.naacl-long.205 + 2024.naacl-long.205.copyright.pdf + wang-etal-2024-ada + + + A Zero-Shot Monolingual Dual Stage Information Retrieval System for <fixed-case>S</fixed-case>panish Biomedical Systematic Literature Reviews + ReginaOfori-Boateng + MagalyAceves-MartinsUniversity of Aberdeen + NirmalieWiratungaThe Robert Gordon University + CarlosMoreno-GarciaThe Robert Gordon University + 3725-3736 + Systematic Reviews (SRs) are foundational in healthcare for synthesising evidence to inform clinical practices. Traditionally skewed towards English-language databases, SRs often exclude significant research in other languages, leading to potential biases. This study addresses this gap by focusing on Spanish, a language notably underrepresented in SRs. We present a foundational zero-shot dual information retrieval (IR) baseline system, integrating traditional retrieval methods with pre-trained language models and cross-attention re-rankers for enhanced accuracy in Spanish biomedical literature retrieval. Utilising the LILACS database, known for its comprehensive coverage of Latin American and Caribbean biomedical literature, we evaluate the approach with three real-life case studies in Spanish SRs. The findings demonstrate the system’s efficacy and underscore the importance of query formulation. This study contributes to the field of IR by promoting language inclusivity and supports the development of more comprehensive and globally representative healthcare guidelines. + 2024.naacl-long.206 + 2024.naacl-long.206.copyright.pdf + ofori-boateng-etal-2024-zero + + + <fixed-case>L</fixed-case>ayout<fixed-case>P</fixed-case>ointer: A Spatial-Context Adaptive Pointer Network for Visual Information Extraction + HuangSiyuan + YongpingXiongBeijing University of Posts and Telecommunications + WuGuibinChizhou University + 3737-3748 + Visual Information Extraction (VIE), as a crucial task of Document Intelligence, involves two primary sub-tasks: Semantic Entity Recognition (SER) and Relation Extraction (RE). However, VIE faces two significant challenges. Firstly, most existing models inadequately utilize spatial information of entities, often failing to predict connections or incorrectly linking spatially distant entities. Secondly, the improper input order of tokens challenges in extracting complete entity pairs from documents with multi-line entities when text is extracted via PDF parser or OCR. To address these challenges, we propose LayoutPointer, a Spatial-Context Adaptive Pointer Network. LayoutPointer explicitly enhances spatial-context relationships by incorporating 2D relative position information and adaptive spatial constraints within self-attention. Furthermore, we recast the RE task as a specialized cycle detection problem, employing a unique tail-to-head pointer to restore the semantic order among multi-line entities. To better evaluate the effectiveness of our proposed method, we reconstruct a multi-line dataset named MLFUD, which more accurately reflects real-world scenarios. Fine-tuning experimental results on FUNSD, XFUND, and MLFUD datasets demonstrate that LayoutPointer significantly outperforms existing state-of-the-art methods in F1 scores for RE tasks (e.g., 5.71% improvement on XFUND using LayoutPointer_{\text{BASE-X}} over LayoutLMv3). + 2024.naacl-long.207 + 2024.naacl-long.207.copyright.pdf + siyuan-etal-2024-layoutpointer + + + Long-form evaluation of model editing + DomenicRosatiDalhousie University and scite.ai + RobieGonzalesDalhousie University + JinkunChenDalhousie University + XueminYuDalhousie University + YahyaKayani + FrankRudziczDalhousie University + HassanSajjadDalhousie University + 3749-3780 + Evaluations of model editing, a technique for changing the factual knowledge held by Large Language Models (LLMs), currently only use the ‘next few token’ completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (\textbf{\textit{LEME}}) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues. + 2024.naacl-long.208 + 2024.naacl-long.208.copyright.pdf + rosati-etal-2024-long + + + Analyzing the Role of Semantic Representations in the Era of Large Language Models + ZhijingJin + YuenChenUniversity of Illinois at Urbana-Champaign + FernandoGonzalez Adauto + JiaruiLiu + JiayiZhang + JulianMichaelNew York University + BernhardSchölkopfELLIS Institute and Max Planck Institute for Intelligent Systems, Max-Planck Institute + MonaDiabCarnegie Mellon University and George Washington University + 3781-3798 + Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCOT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code: https://github.com/causalNLP/amr_llm + 2024.naacl-long.209 + 2024.naacl-long.209.copyright.pdf + jin-etal-2024-analyzing + + + <fixed-case>TRAQ</fixed-case>: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction + ShuoLi + SangdonParkPOSTECH + InsupLeeUniversity of Pennsylvania + OsbertBastaniUniversity of Pennsylvania + 3799-3821 + When applied to open-domain question answering, large language models (LLMs) frequently generate incorrect responses based on made-up facts, which are called hallucinations. Retrieval augmented generation (RAG) is a promising strategy to avoid hallucinations, but it does not provide guarantees on its correctness. To address this challenge, we propose the Trustworthy Retrieval Augmented Question Answering, or *TRAQ*, which provides the first end-to-end statistical correctness guarantee for RAG. TRAQ uses conformal prediction, a statistical technique for constructing prediction sets that are guaranteed to contain the semantically correct response with high probability. Additionally, TRAQ leverages Bayesian optimization to minimize the size of the constructed sets. In an extensive experimental evaluation, we demonstrate that TRAQ provides the desired correctness guarantee while reducing prediction set size by 16.2% on average compared to an ablation. The implementation is available: [https://github.com/shuoli90/TRAQ](https://github.com/shuoli90/TRAQ). + 2024.naacl-long.210 + 2024.naacl-long.210.copyright.pdf + li-etal-2024-traq + + + <fixed-case>M</fixed-case>ap<fixed-case>G</fixed-case>uide: A Simple yet Effective Method to Reconstruct Continuous Language from Brain Activities + XinpeiZhaoUniversity of the Chinese Academy of Sciences + JingyuanSun + ShaonanWang + JingYeInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + XhzXhzChinese Academy of Sciences + ChengqingZongInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + 3822-3832 + Decoding continuous language from brain activity is a formidable yet promising field of research. It is particularly significant for aiding people with speech disabilities to communicate through brain signals. This field addresses the complex task of mapping brain signals to text. The previous best attempt reverse-engineered this process in an indirect way: it began by learning to encode brain activity from text and then guided text generation by aligning with predicted brain responses. In contrast, we propose a simple yet effective method that guides text reconstruction by directly comparing them with the predicted text embeddings mapped from brain activities. Comprehensive experiments reveal that our method significantly outperforms the current state-of-the-art model, showing average improvements of 77% and 54% on BLEU and METEOR scores. We further validate the proposed modules through detailed ablation studies and case analyses and highlight a critical correlation: the more precisely we map brain activities to text embeddings, the better the text reconstruction results. Such insight can simplify the task of reconstructing language from brain activities for future work, emphasizing the importance of improving brain-to-text-embedding mapping techniques. + 2024.naacl-long.211 + 2024.naacl-long.211.copyright.pdf + zhao-etal-2024-mapguide + + + On-the-fly Definition Augmentation of <fixed-case>LLM</fixed-case>s for Biomedical <fixed-case>NER</fixed-case> + MonicaMunnangiNortheastern University + SergeyFeldmanAllen Institute for Artificial Intelligence and Data Cowboys + ByronWallaceNortheastern University, Brown University and Northeastern University + SilvioAmirNortheastern University + TomHopeAllen Institute for Artificial Intelligence and Hebrew University, Hebrew University of Jerusalem + AakankshaNaikAllen Institute for Artificial Intelligence and National Institutes of Health + 3833-3854 + Despite their general capabilities, LLMs still struggle on biomedicalNER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to provide a test bed for knowledge augmentation, we perform a comprehensive exploration of prompting strategies. Our experiments show that definition augmentation is useful for both open source and closed LLMs.For example, it leads to a relative improvement of 15% (on average) in GPT-4 performance (F1) across all (six) of our test datasets. We conduct extensive ablations and analyses to demonstrate that our performance improvements stem from adding relevant definitional knowledge. We find that careful prompting strategies also improve LLM performance, allowing them to outperform fine-tuned language models in few-shot settings. To facilitate future research in this direction, we release our code at https://github.com/allenai/beacon. + 2024.naacl-long.212 + 2024.naacl-long.212.copyright.pdf + munnangi-etal-2024-fly + + + This Land is <fixed-case>Your, My</fixed-case> Land: Evaluating Geopolitical Bias in Language Models through Territorial Disputes + BryanLiUniversity of Pennsylvania + SamarHaiderUniversity of Pennsylvania + ChrisCallison-BurchAllen Institute for Artificial Intelligence and University of Pennsylvania + 3855-3871 + Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. This contrasts with a multilingual human, who would likely answer consistently. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages—a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context. Our code and data are available at https://github.com/manestay/borderlines. + 2024.naacl-long.213 + 2024.naacl-long.213.copyright.pdf + li-etal-2024-land + + + Set-Aligning Framework for Auto-Regressive Event Temporal Graph Generation + XingweiTan + YuxiangZhouKing’s College London + GabrielePergolaUniversity of Warwick + YulanHeKing’s College London, University of London + 3872-3892 + Event temporal graphs have been shown as convenient and effective representations of complex temporal relations between events in text. Recent studies, which employ pre-trained language models to auto-regressively generate linearised graphs for constructing event temporal graphs, have shown promising results. However, these methods have often led to suboptimal graph generation as the linearised graphs exhibit set characteristics which are instead treated sequentially by language models. This discrepancy stems from the conventional text generation objectives, leading to erroneous penalisation of correct predictions caused by the misalignment of elements in target sequences. To address these challenges, we reframe the task as a conditional set generation problem, proposing a Set-aligning Framework tailored for the effective utilisation of Large Language Models (LLMs). The framework incorporates data augmentations and set-property regularisations designed to alleviate text generation loss penalties associated with the linearised graph edge sequences, thus encouraging the generation of more relation edges. Experimental results show that our framework surpasses existing baselines for event temporal graph generation. Furthermore, under zero-shot settings, the structural knowledge introduced through our framework notably improves model generalisation, particularly when the training examples available are limited. + 2024.naacl-long.214 + 2024.naacl-long.214.copyright.pdf + tan-etal-2024-set + + + <fixed-case>L</fixed-case>anguage<fixed-case>F</fixed-case>low: Advancing Diffusion Language Generation with Probabilistic Flows + ShujianZhangUniversity of Texas, Austin + LemengWuFacebook and University of Texas, Austin + ChengyueGongut austin + XingchaoLiu + 3893-3905 + Recent works have demonstrated success in controlling sentence attributes (e.g., sentiment) and structure (e.g., syntactic structure) based on the diffusion language model. A key component that drives theimpressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of starting from the noise and the learning steps has limited its implementation to many NLP real-world applications. This paper proposes Language Rectified Flow (LF).Our method is based on the reformulation of the standard probabilistic flow models.Language rectified flow learns (neural) ordinary differentialequation models to transport between the source distribution and the target distribution, henceproviding a unified and effective solution to generative modeling and domain transfer.From the source distribution, our language rectified flow yields fast simulation and effectively decreases the inference time. Experiments on three challenging fine-grained control tasks and multiple high-quality text editing show that our method consistently outperforms its baselines. Extensive experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks. + 2024.naacl-long.215 + 2024.naacl-long.215.copyright.pdf + zhang-etal-2024-languageflow + + + Towards Improved Multi-Source Attribution for Long-Form Answer Generation + NilayPatel + ShivashankarSubramanianAmazon + SiddhantGargMeta + PratyayBanerjeeAmazon + AmitaMisraAmazon + 3906-3919 + Teaching large language models (LLMs) to generate text with attribution to evidence sources can reduce hallucinations, improve verifiability in question answering systems (QA), and increase reliability of retrieval augmented LLMs. Despite gaining increasing popularity for usage in QA systems and search engines, current LLMs struggle with attribution for long-form responses which require reasoning over multiple evidence sources. To address this, in this paper we aim to improve the attribution capability of LLMs for long-form answer generation to multiple sources, with multiple citations per sentence. However, data for training multi-source attributable QA systems is difficult and expensive to annotate, and therefore scarce. To overcome this challenge, we transform existing QA datasets for this task (MultiAttr), and empirically demonstrate, on a wide range of attribution benchmark datasets, that fine-tuning on MultiAttr provides significant improvements over training only on the target QA domain. Lastly, to fill a gap in existing benchmarks, we present a multi-source attribution dataset containing multi-paragraph answers, PolitiICite, based on PolitiFact articles that discuss events closely related to implementation statuses of election promises. + 2024.naacl-long.216 + 2024.naacl-long.216.copyright.pdf + patel-etal-2024-towards + + + Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private Language Models + AldoCarranza + RezsaFarahani + NataliaPonomarevaGoogle + AlexeyKurakinResearch, Google + MatthewJagielskiGoogle + MiladNasrGoogle + 3920-3930 + We address the challenge of ensuring differential privacy (DP) guarantees in training deep retrieval systems. Training these systems often involves the use of contrastive-style losses, which are typically non-per-example decomposable, making them difficult to directly DP-train with since common techniques require per-example gradients. To address this issue, we propose an approach that prioritizes ensuring query privacy prior to training a deep retrieval system. Our method employs DP language models (LMs) to generate private synthetic queries representative of the original data. These synthetic queries can be used in downstream retrieval system training without compromising privacy. Our approach demonstrates a significant enhancement in retrieval quality compared to direct DP-training, all while maintaining query-level privacy guarantees. This work highlights the potential of harnessing LMs to overcome limitations in standard DP-training methods. + 2024.naacl-long.217 + 2024.naacl-long.217.copyright.pdf + carranza-etal-2024-synthetic + + + Okay, Let’s Do This! Modeling Event Coreference with Generated Rationales and Knowledge Distillation + AbhijnanNath + ShadiManafi AvariColorado State University + AvyaktaChelle + NikhilKrishnaswamyColorado State University + 3931-3946 + In NLP, Event Coreference Resolution (ECR) is the task of connecting event clusters that refer to the same underlying real-life event, usually via neural systems. In this work, we investigate using abductive free-text rationales (FTRs) generated by modern autoregressive LLMs as distant supervision of smaller student models for cross-document coreference (CDCR) of events. We implement novel rationale-oriented event clustering and knowledge distillation methods for event coreference scoring that leverage enriched information from the FTRs for improved CDCR without additional annotation or expensive document clustering. Our model using coreference-specific knowledge distillation achieves SOTA B^3 F_1 on the ECB+ and GVC corpora and we establish a new baseline on the AIDA Phase 1 corpus. Our code can be found at https://github.com/csu-signal/llama_cdcr. + 2024.naacl-long.218 + 2024.naacl-long.218.copyright.pdf + nath-etal-2024-okay + + + Can Knowledge Graphs Reduce Hallucinations in <fixed-case>LLM</fixed-case>s? : A Survey + GarimaAgrawal + TharinduKumarageArizona State University + ZeyadAlghamdi + HuanLiuArizona State University + 3947-3960 + The contemporary LLMs are prone to producing hallucinations, stemming mainly from the knowledge gaps within the models. To address this critical limitation, researchers employ diverse strategies to augment the LLMs by incorporating external knowledge, aiming to reduce hallucinations and enhance reasoning accuracy. Among these strategies, leveraging knowledge graphs as a source of external information has demonstrated promising results. In this survey, we comprehensively review these knowledge-graph-based augmentation techniques in LLMs, focusing on their efficacy in mitigating hallucinations. We systematically categorize these methods into three overarching groups, offering methodological comparisons and performance evaluations. Lastly, this survey explores the current trends and challenges associated with these techniques and outlines potential avenues for future research in this emerging field. + 2024.naacl-long.219 + 2024.naacl-long.219.copyright.pdf + agrawal-etal-2024-knowledge + + + Pedagogically Aligned Objectives Create Reliable Automatic Cloze Tests + BrianOndovYale School of Medicine + KushAttal + DinaDemner-FushmanNational Library of Medicine + 3961-3972 + The cloze training objective of Masked Language Models makes them a natural choice for generating plausible distractors for human cloze questions. However, distractors must also be both distinct and incorrect, neither of which is directly addressed by existing neural methods. Evaluation of recent models has also relied largely on automated metrics, which cannot demonstrate the reliability or validity of human comprehension tests. In this work, we first formulate the pedagogically motivated objectives of plausibility, incorrectness, and distinctiveness in terms of conditional distributions from language models. Second, we present an unsupervised, interpretable method that uses these objectives to jointly optimize sets of distractors. Third, we test the reliability and validity of the resulting cloze tests compared to other methods with human participants. We find our method has stronger correlation with teacher-created comprehension tests than the state-of-the-art neural method and is more internally consistent. Our implementation is freely available and can quickly create a multiple choice cloze test from any given passage. + 2024.naacl-long.220 + 2024.naacl-long.220.copyright.pdf + ondov-etal-2024-pedagogically + + + Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning + KazumaHashimotoGoogle Research + KarthikRamanGoogle + MichaelBenderskyGoogle + 3973-3990 + In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs). Only a few demonstrations enable LLMs to be used as blackbox for new tasks. Previous studies have shown that using LLMs’ outputs as labels is effective in training models to select demonstrations. Such a label is expected to estimate utility of a demonstration in ICL; however, it has not been well understood how different labeling strategies affect results on target tasks. This paper presents an analysis on different utility functions by focusing on LLMs’ output probability given ground-truth output, and task-specific reward given LLMs’ prediction. Unlike the previous work, we introduce a novel labeling method, incremental utility, which estimates how much incremental knowledge is brought into the LLMs by a demonstration. We conduct experiments with instruction-tuned LLMs on binary/multi-class classification, segmentation, and translation across Arabic, English, Finnish, Japanese, and Spanish. Our results show that (1) the probability is effective when the probability values are distributed across the whole value range (on the classification tasks), and (2) the downstream metric is more robust when nuanced reward values are provided with long outputs (on the segmentation and translation tasks). We then show that the proposed incremental utility further helps ICL by contrasting how the LLMs perform with and without the demonstrations. + 2024.naacl-long.221 + 2024.naacl-long.221.copyright.pdf + hashimoto-etal-2024-take + + + <fixed-case>LM</fixed-case>-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models + ChiHan + QifanWangMeta AI + HaoPengDepartment of Computer Science, University of Illinois Urbana-Champaign + WenhanXiongFacebook + YuChenAnytime.AI + HengJiUniversity of Illinois, Urbana-Champaign + SinongWangFacebook + 3991-4008 + Today’s large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encod- ing scientific articles, code repositories, or long dialogues. Through both theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis reveals that commonly used techniques like using a sliding-window attention pattern or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs’ capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7× decoding speed up and 7.5× memory saving over the original model. Our code will be publicly available upon publication. + 2024.naacl-long.222 + 2024.naacl-long.222.copyright.pdf + han-etal-2024-lm + + + <fixed-case>CONSCENDI</fixed-case>: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants + AlbertSun + VarunNairCurai Health + ElliotSchumacherCurai Health + AnithaKannanCurai Health + 4009-4030 + A wave of new task-based virtual assistants has been fueled by increasingly powerful large language models (LLMs), such as GPT-4 (OpenAI, 2023). A major challenge in deploying LLM-based virtual conversational assistants in real world settings is ensuring they operate within what is admissible for the task. To overcome this challenge, the designers of these virtual assistants rely on an independent guardrail system that verifies the virtual assistant’s output aligns with the constraints required for the task. However, relying on commonly used, prompt-based guardrails can be difficult to engineer correctly and comprehensively. To address these challenges, we propose CONSCENDI. We use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. This scenario-guided approach produces a diverse training set and provides chatbot designers greater control. To generate contrastive examples, we prompt the LLM to alter conversations with violations into acceptable conversations to enable fine-grained distinctions. We then use this data, generated by CONSCENDI, to train a smaller model. We find that CONSCENDI results in guardrail models that improve over baselines in multiple dialogue domains. + 2024.naacl-long.223 + 2024.naacl-long.223.copyright.pdf + sun-etal-2024-conscendi + + + Advancing Beyond Identification: Multi-bit Watermark for Large Language Models + KiYoonYooSeoul National University + WonhyukAhnNAVER WEBTOON Corp. + NojunKwakSeoul National University + 4031-4055 + We show the viability of tackling misuses of large language models beyond the identification of machine-generated text. While existing zero-bit watermark methods focus on detection only, some malicious misuses demand tracing the adversary user for counteracting them. To address this, we propose Multi-bit Watermark via Position Allocation, embedding traceable multi-bit information during language model generation. Through allocating tokens onto different parts of the messages, we embed longer messages in high corruption settings without added latency. By independently embedding sub-units of messages, the proposed method outperforms the existing works in terms of robustness and latency. Leveraging the benefits of zero-bit watermarking, our method enables robust extraction of the watermark without any model access, embedding and extraction of long messages (\geq 32-bit) without finetuning, and maintaining text quality, while allowing zero-bit detection all at the same time. + 2024.naacl-long.224 + 2024.naacl-long.224.copyright.pdf + yoo-etal-2024-advancing + + + <fixed-case>HTCCN</fixed-case>: Temporal Causal Convolutional Networks with <fixed-case>H</fixed-case>awkes Process for Extrapolation Reasoning in Temporal Knowledge Graphs + TingxuanChenCentral South University + JunLongCentral South University + LiuYang + ZidongWangCentral South University + YonghengWangZhejiang lab + XiongnanJinZhejiang Lab + 4056-4066 + Temporal knowledge graphs (TKGs) serve as powerful tools for storing and modeling dynamic facts, holding immense potential in anticipating future facts. Since future facts are inherently unknowable, effectively modeling the intricate temporal structure of historical facts becomes paramount for accurate prediction. However, current models often rely heavily on fact recurrence or periodicity, leading to information loss due to prolonged evolutionary processes. Notably, the occurrence of one fact always influences the likelihood of another. To this end, we propose HTCCN, a novel Hawkes process-based temporal causal convolutional network designed for temporal reasoning under extrapolation settings. HTCCN employs a temporal causal convolutional network to model the historical interdependence of facts and leverages Hawkes to model link formation processes inductively in TKGs. Importantly, HTCCN introduces dual-level dynamics to comprehensively capture the temporal evolution of facts. Rigorous experimentation on four real-world datasets underscores the superior performance of HTCCN. + 2024.naacl-long.225 + 2024.naacl-long.225.copyright.pdf + chen-etal-2024-htccn + + + <fixed-case>S</fixed-case>em<fixed-case>S</fixed-case>tamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation + AbeHou + JingyuZhangJohns Hopkins University + TianxingHe + YichenWang + Yung-SungChuangMassachusetts Institute of Technology + HongweiWangTencent AI Lab + LingfengShenByteDance Inc. + BenjaminVan DurmeJohns Hopkins University, Johns Hopkins University, Johns Hopkins University and Microsoft + DanielKhashabiJohns Hopkins University + YuliaTsvetkovDepartment of Computer Science, University of Washington + 4067-4082 + Existing watermarked generation algorithms employ token-level designs and therefore, are vulnerable to paraphrase attacks. To address this issue, we introduce watermarking on the semantic representation of sentences. We propose SemStamp, a robust sentence-level semantic watermarking algorithm that uses locality-sensitive hashing (LSH) to partition the semantic space of sentences. The algorithm encodes and LSH-hashes a candidate sentence generated by a language model, and conducts rejection sampling until the sampled sentence falls in watermarked partitions in the semantic embedding space. To test the paraphrastic robustness of watermarking algorithms, we propose a “bigram paraphrase” attack that produces paraphrases with small bigram overlap with the original sentence. This attack is shown to be effective against existing token-level watermark algorithms, while posing only minor degradations to SemStamp. Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on various paraphrasers and domains, but also better at preserving the quality of generation. + 2024.naacl-long.226 + 2024.naacl-long.226.copyright.pdf + hou-etal-2024-semstamp + + + Media Bias Detection Across Families of Language Models + IffatMaab + EdisonMarrese-TaylorThe Univesity of Tokyo and AIST, National Institute of Advanced Industrial Science and Technology + SebastianPadóUniversity of Stuttgart, Universität Stuttgart + YutakaMatsuoThe University of Tokyo and The University of Tokyo + 4083-4098 + Bias in reporting can influence the public’s opinion on relevant societal issues. Examples include informational bias (selective presentation of content) and lexical bias (specific framing of content through linguistic choices). The recognition of media bias is arguably an area where NLP can contribute to the “social good”. Traditional NLP models have shown good performance in classifying media bias, but require careful model design and extensive tuning. In this paper, we ask how well prompting of large language models can recognize media bias. Through an extensive empirical study including a wide selection of pre-trained models, we find that prompt-based techniques can deliver comparable performance to traditional models with greatly reduced effort and that, similar to traditional models, the availability of context substantially improves results. We further show that larger models can leverage different kinds of context simultaneously, obtaining further performance improvements. + 2024.naacl-long.227 + 2024.naacl-long.227.copyright.pdf + maab-etal-2024-media + + + Better Zero-Shot Reasoning with Role-Play Prompting + AoboKong + ShiwanZhao + HaoChen + QichengLiNankai University + YongQin + RuiqiSunLenovo Research + XinZhou + EnzhiWang + XiaohangDong + 4099-4113 + Modern large language models (LLMs) exhibit a remarkable capacity for role-playing, enabling them to embody not only human characters but also non-human entities. This versatility allows them to simulate complex human-like interactions and behaviors within various contexts, as well as to emulate specific objects or systems. While these capabilities have enhanced user engagement and introduced novel modes of interaction, the influence of role-playing on LLMs’ reasoning abilities remains underexplored. In this study, we introduce a strategically designed role-play prompting methodology and assess its performance under the zero-shot setting across twelve diverse reasoning benchmarks. Our empirical results illustrate that role-play prompting consistently surpasses the standard zero-shot approach across most datasets. Notably, in experiments conducted using ChatGPT, accuracy on AQuA rises from 53.5% to 63.8%, and on Last Letter from 23.8% to 84.2%. Upon further comparison with the Zero-Shot-CoT technique, which prompts the model to “think step by step”, our study demonstrates that role-play prompting acts as a more effective trigger for the CoT process.This highlights its potential to augment the reasoning capabilities of LLMs. We release our code at https://github.com/NKU-HLT/Role-Play-Prompting. + 2024.naacl-long.228 + 2024.naacl-long.228.copyright.pdf + kong-etal-2024-better + + + Event-Content-Oriented Dialogue Generation in Short Video + FenghuaChengUniversity of Queensland + XueLiUniversity of Queensland + ZiHuangUniversity of Queensland + JinxiangWangUniversity of Queensland + SenWangUniversity of Queensland and The University of Queensland + 4114-4124 + Understanding complex events from different modalities, associating to external knowledge and generating response in a clear point of view are still unexplored in today’s multi-modal dialogue research. The great challenges include 1) lack of event-based multi-modal dialogue dataset; 2) understanding of complex events and 3) heterogeneity gap between different modalities. To overcome these challenges, we firstly introduce a novel event-oriented video-dialogue dataset called SportsVD (Sports-domain Video-dialogue Dataset). To our best knowledge, SportsVD is the first dataset that consists of complex events videos and opinion-based conversations with regards to contents in these events. Meanwhile, we present multi-modal dialogue generation method VCD (Video Commentary Dialogue) to generate human-like response according to event contents in the video and related external knowledge. In contrast to previous video-based dialogue generation, we focus on opinion-based response and the understanding of longer and more complex event contents. We evaluate VCD’s performance on SportsVD and other baselines under several automatic metrics. Experiments demonstrate VCD can outperform among other state-of-the-art baselines. Our work is available at https://github.com/Cheng-Fenghua/SportsVD. + 2024.naacl-long.229 + 2024.naacl-long.229.copyright.pdf + cheng-etal-2024-event + + + <fixed-case>D</fixed-case>o<fixed-case>G</fixed-case>-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping + YongruiChen + HaiyunJiangSUN YAT-SEN UNIVERSITY + XintingHuangTencent AI Lab + ShumingShiTencent AI Lab + GuilinQi + 4125-4135 + The improvement of LLMs’ instruction-following capabilities relies heavily on the availability of high-quality instruction-response pairs. Unfortunately, the current methods used to collect the pairs suffer from either unaffordable labor costs or severe hallucinations in the self-generation of LLM.To tackle these challenges, this paper proposes a scalable solution.It involves training LLMs to generate instruction-response pairs based on human-written documents, rather than relying solely on self-generation without context.Our proposed method not only exploits the advantages of human-written documents in reducing hallucinations but also utilizes an LLM to wrap the expression of documents, which enables us to bridge the gap between various document styles and the standard AI response.Experiments demonstrate that our method outperforms existing typical methods on multiple benchmarks.In particular, compared to the best-performing baseline, the LLM trained using our generated dataset exhibits a 10% relative improvement in performance on AlpacaEval, despite utilizing only 1/5 of its training data.Furthermore, a comprehensive manual evaluation validates the quality of the data we generated. + 2024.naacl-long.230 + 2024.naacl-long.230.copyright.pdf + chen-etal-2024-dog + + + Beyond Borders: Investigating Cross-Jurisdiction Transfer in Legal Case Summarization + SantoshT.y.s.sTechnische Universität München + VatsalVenkatkrishna + SaptarshiGhoshIndian Institute of Technology Kharagpur + MatthiasGrabmairTechnische Universität München + 4136-4150 + Legal professionals face the challenge of managing an overwhelming volume of lengthy judgments, making automated legal case summarization crucial. However, prior approaches mainly focused on training and evaluating these models within the same jurisdiction. In this study, we explore the cross-jurisdictional generalizability of legal case summarization models. Specifically, we explore how to effectively summarize legal cases of a target jurisdiction where reference summaries are not available. In particular, we investigate whether supplementing models with unlabeled target jurisdiction corpus and extractive silver summaries obtained from unsupervised algorithms on target data enhances transfer performance. Our comprehensive study on three datasets from different jurisdictions highlights the role of pre-training in improving transfer performance. We shed light on the pivotal influence of jurisdictional similarity in selecting optimal source datasets for effective transfer. Furthermore, our findings underscore that incorporating unlabeled target data yields improvements in general pre-trained models, with additional gains when silver summaries are introduced. This augmentation is especially valuable when dealing with extractive datasets and scenarios featuring limited alignment between source and target jurisdictions. Our study provides key insights for developing adaptable legal case summarization systems, transcending jurisdictional boundaries. + 2024.naacl-long.231 + 2024.naacl-long.231.copyright.pdf + t-y-s-s-etal-2024-beyond + + + <fixed-case>EDC</fixed-case>: Effective and Efficient Dialog Comprehension For Dialog State Tracking + QifanLuUniversity of Washington + BhaskarRamasubramanianWestern Washington University + RadhaPoovendranUniversity of Washington, Seattle + 4151-4165 + In Task-Oriented Dialog (TOD) systems, Dialog State Tracking (DST) structurally extracts information from user and system utterances, which can be further used for querying databases and forming responses to users. The two major categories of DST methods, sequential and independent methods, face trade-offs between accuracy and efficiency. To resolve this issue, we propose Effective and Efficient Dialog Comprehension (EDC), an alternative DST approach that leverages the tree structure of the dialog state. EDC predicts domains, slot names and slot values of the dialog state step-by-step for better accuracy, and efficiently encodes dialog contexts with causal attention patterns. We evaluate EDC on several popular TOD datasets and EDC is able to achieve state-of-the-art Joint Goal Accuracy (JGA). We also show theoretically and empirically that EDC is more efficient than model designs used by previous works. + 2024.naacl-long.232 + 2024.naacl-long.232.copyright.pdf + lu-etal-2024-edc + + + Automatic Restoration of Diacritics for Speech Data Sets + SaraShatnawi + SawsanAlqahtaniPrincess Nourah Bint Abdulrahman University + HananAldarmakiMohamed bin Zayed University of Artificial Intelligence + 4166-4176 + Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration. + 2024.naacl-long.233 + 2024.naacl-long.233.copyright.pdf + shatnawi-etal-2024-automatic + + + <fixed-case>XNLI</fixed-case>eu: a dataset for cross-lingual <fixed-case>NLI</fixed-case> in <fixed-case>B</fixed-case>asque + MaiteHerediaUniversidad del País Vasco + JulenEtxanizHiTZ Center, University of the Basque Country (UPV/EHU) + MuitzeZulaikaOrai NLP Technologies + XabierSaralegi + JeremyBarnesUniversity of the Basque Country + AitorSoroaUniversity of the Basque Country. UPV/EHU. + 4177-4188 + XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses. + 2024.naacl-long.234 + 2024.naacl-long.234.copyright.pdf + heredia-etal-2024-xnlieu + + + <fixed-case>MDR</fixed-case>: Model-Specific Demonstration Retrieval at Inference Time for In-Context Learning + HuazhengWang + JinmingWu + HaifengSunBeijing University of Posts and Telecommunications + ZixuanXiaBeijing University of Posts and Telecommunications + DaixuanCheng + JingyuWangBeijing University of Post and Telecommunication, Tsinghua University + QiQiBeijing University of Posts and Telecommunications + JianxinLiao + 4189-4204 + Recently, retrieval-based in-context learning (ICL) methods for selecting demonstrations have been widely investigated. Existing methods train a dense retriever to retrieve the most appropriate demonstrations for a given test query, which improves ICL performance. However, we find that distinct LLMs exhibit different biases for “what is a good demonstration” since they possess differences in training data, model architectures and training methods. As a result, a demonstration suitable for one LLM may not be appropriate for others.Previous approaches ignore the model bias and fail to retrieve the most appropriate demonstrations for different inference LLMs, resulting in a degradation of ICL performance.To address this problem, we propose a simple yet effective metric to evaluate the appropriateness of demonstrations for a specific inference LLM. Furthermore, we introduce a Model-specific Demonstration Retrieval (MDR) method for ICL at inference time, which considers the biases of different LLMs. We test MDR on seen and unseen tasks with multi-scale inference LLMs, such as GPT-Neo-2.7B, LLaMA-7B and Vicuna-13B. Experiments on 23 datasets across 11 data domains highlight the remarkable effectiveness of MDR, showcasing improvements of up to 41.2% in comparison to methods that neglect model biases. + 2024.naacl-long.235 + 2024.naacl-long.235.copyright.pdf + wang-etal-2024-mdr + + + Exploring Cross-Cultural Differences in <fixed-case>E</fixed-case>nglish Hate Speech Annotations: From Dataset Construction to Analysis + NayeonLeeKorea Advanced Institute of Science and Technology + ChaniJungKorea Advanced Institute of Science & Technology + JunhoMyungKorea Advanced Institute of Science and Technology + JihoJinKorea Advanced Institute of Science and Technology + JoseCamacho-ColladosCardiff University + JuhoKimKorea Advanced Institute of Science and Technology + AliceOhKorea Advanced Institute of Science and Technology + 4205-4224 + ***Warning**: this paper contains content that may be offensive or upsetting.*Most hate speech datasets neglect the cultural diversity within a single language, resulting in a critical shortcoming in hate speech detection. To address this, we introduce **CREHate**, a **CR**oss-cultural **E**nglish **Hate** speech dataset.To construct CREHate, we follow a two-step procedure: 1) cultural post collection and 2) cross-cultural annotation.We sample posts from the SBIC dataset, which predominantly represents North America, and collect posts from four geographically diverse English-speaking countries (Australia, United Kingdom, Singapore, and South Africa) using culturally hateful keywords we retrieve from our survey.Annotations are collected from the four countries plus the United States to establish representative labels for each country.Our analysis highlights statistically significant disparities across countries in hate speech annotations.Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.Qualitative analysis shows that label disagreement occurs mostly due to different interpretations of sarcasm and the personal bias of annotators on divisive topics.Lastly, we evaluate large language models (LLMs) under a zero-shot setting and show that current LLMs tend to show higher accuracies on Anglosphere country labels in CREHate.Our dataset and codes are available at: https://github.com/nlee0212/CREHate + 2024.naacl-long.236 + 2024.naacl-long.236.copyright.pdf + lee-etal-2024-exploring-cross + + + Enhancing Contextual Understanding in Large Language Models through Contrastive Decoding + ZhengZhaoUniversity of Edinburgh, University of Edinburgh + EmilioMonti + JensLehmannAmazon, Technische Universität Dresden, University of Bonn and Fraunhofer IAIS + HaythamAssemHuawei Technologies Ltd. + 4225-4237 + Large language models (LLMs) tend to inadequately integrate input context during text generation, relying excessively on encoded prior knowledge in model parameters, potentially resulting in generated text with factual inconsistencies or contextually unfaithful content. LLMs utilize two primary knowledge sources: 1) prior (parametric) knowledge from pretraining, and 2) contextual (non-parametric) knowledge from input prompts. The study addresses the open question of how LLMs effectively balance these knowledge sources during the generation process, specifically in the context of open-domain question answering. To address this issue, we introduce a novel approach integrating contrastive decoding with adversarial irrelevant passages as negative samples to enhance robust context grounding during generation. Notably, our method operates at inference time without requiring further training. We conduct comprehensive experiments to demonstrate its applicability and effectiveness, providing empirical evidence showcasing its superiority over existing methodologies. + 2024.naacl-long.237 + 2024.naacl-long.237.copyright.pdf + zhao-etal-2024-enhancing + + + Generalizable Sarcasm Detection is Just Around the Corner, of Course! + HyewonJangUniversität Konstanz + DiegoFrassinelliLudwig-Maximilians-Universität München + 4238-4249 + We tested the robustness of sarcasm detection models by examining their behavior when fine-tuned on four sarcasm datasets containing varying characteristics of sarcasm: label source (authors vs. third-party), domain (social media/online vs. offline conversations/dialogues), style (aggressive vs. humorous mocking). We tested their prediction performance on the same dataset (intra-dataset) and across different datasets (cross-dataset). For intra-dataset predictions, models consistently performed better when fine-tuned with third-party labels rather than with author labels. For cross-dataset predictions, most models failed to generalize well to the other datasets, implying that one type of dataset cannot represent all sorts of sarcasm with different styles and domains. Compared to the existing datasets, models fine-tuned on the new dataset we release in this work showed the highest generalizability to other datasets. With a manual inspection of the datasets and post-hoc analysis, we attributed the difficulty in generalization to the fact that sarcasm actually comes in different domains and styles. We argue that future sarcasm research should take the broad scope of sarcasm into account. + 2024.naacl-long.238 + 2024.naacl-long.238.copyright.pdf + jang-frassinelli-2024-generalizable + + + Encoding of lexical tone in self-supervised models of spoken language + GaofeiShen + MichaelaWatkins + AfraAlishahiTilburg University + AriannaBisazzaUniversity of Groningen + GrzegorzChrupałaTilburg University + 4250-4261 + Interpretability research has shown that self-supervised Spoken LanguageModels (SLMs) encode a wide variety of features in human speech from theacoustic, phonetic, phonological, syntactic and semantic levels, to speakercharacteristics. The bulk of prior research on representations of phonologyhas focused on segmental features such as phonemes; the encoding ofsuprasegmental phonology (such as tone and stress patterns) in SLMs is not yetwell understood. Tone is a suprasegmental feature that is present in more thanhalf of the world’s languages. This paper aims to analyze the tone encodingcapabilities of SLMs, using Mandarin and Vietnamese as case studies. We showthat SLMs encode lexical tone to a significant degree even when they aretrained on data from non-tonal languages. We further find that SLMs behavesimilarly to native and non-native human participants in tone and consonantperception studies, but they do not follow the same developmental trajectory. + 2024.naacl-long.239 + 2024.naacl-long.239.copyright.pdf + shen-etal-2024-encoding + + + A Systematic Comparison of Contextualized Word Embeddings for Lexical Semantic Change + FrancescoPeritiUniversity of Milan + NinaTahmasebiGöteborg University + 4262-4282 + Contextualized embeddings are the preferred tool for modeling Lexical Semantic Change (LSC). Current evaluations typically focus on a specific task known as Graded Change Detection (GCD). However, performance comparison across work are often misleading due to their reliance on diverse settings. In this paper, we evaluate state-of-the-art models and approaches for GCD under equal conditions. We further break the LSC problem into Word-in-Context (WiC) and Word Sense Induction (WSI) tasks, and compare models across these different levels. Our evaluation is performed across different languages on eight available benchmarks for LSC, and shows that (i) APD outperforms other approaches for GCD; (ii) XL-LEXEME outperforms other contextualized models for WiC, WSI, and GCD, while being comparable to GPT-4; (iii) there is a clear need for improving the modeling of word meanings, as well as focus on *how*, *when*, and *why* these meanings change, rather than solely focusing on the extent of semantic change. + 2024.naacl-long.240 + 2024.naacl-long.240.copyright.pdf + periti-tahmasebi-2024-systematic + + + i<fixed-case>ACOS</fixed-case>: Advancing Implicit Sentiment Extraction with Informative and Adaptive Negative Examples + XiancaiXuEnbrands Inc. + Jia-DongZhangEnbrands, Inc. + LeiXiong + ZhishangLiu + 4283-4293 + Aspect-based sentiment analysis (ABSA) have been extensively studied, but little light has been shed on the quadruple extraction consisting of four fundamental elements: aspects, categories, opinions and sentiments, especially with implicit aspects and opinions. In this paper, we propose a new method iACOS for extracting Implicit Aspects with Categories and Opinions with Sentiments. First, iACOS appends two implicit tokens at the end of a text to capture the context-aware representation of all tokens including implicit aspects and opinions. Second, iACOS develops a sequence labeling model over the context-aware token representation to co-extract explicit and implicit aspects and opinions. Third, iACOS devises a multi-label classifier with a specialized multi-head attention for discovering aspect-opinion pairs and predicting their categories and sentiments simultaneously. Fourth, iACOS leverages informative and adaptive negative examples to jointly train the multi-label classifier and the other two classifiers on categories and sentiments by multi-task learning. Finally, the experimental results show that iACOS significantly outperforms other quadruple extraction baselines according to the F1 score on two public benchmark datasets. + 2024.naacl-long.241 + 2024.naacl-long.241.copyright.pdf + xu-etal-2024-iacos + + + Rectifying Demonstration Shortcut in In-Context Learning + JoonwonJang + SanghwanJangPOSTECH + WonbinKweon + MinjinJeon + HwanjoYuPOSTECH + 4294-4321 + Large language models (LLMs) are able to solve various tasks with only a few demonstrations utilizing their in-context learning (ICL) abilities.However, LLMs often rely on their pre-trained semantic priors of demonstrations rather than on the input-label relationships to proceed with ICL prediction. In this work, we term this phenomenon as the ‘Demonstration Shortcut’.While previous works have primarily focused on improving ICL prediction results for predefined tasks, we aim to rectify the Demonstration Shortcut, thereby enabling the LLM to effectively learn new input-label relationships from demonstrations.To achieve this, we introduce In-Context Calibration, a demonstration-aware calibration method.We evaluate the effectiveness of the proposed method in two settings: (1) the Original ICL Task using the standard label space and (2) the Task Learning setting, where the label space is replaced with semantically unrelated tokens.In both settings, In-Context Calibration demonstrates substantial improvements, with results generalized across three LLM families (OPT, GPT, and Llama2) under various configurations. + 2024.naacl-long.242 + 2024.naacl-long.242.copyright.pdf + jang-etal-2024-rectifying + + + Universal <fixed-case>NER</fixed-case>: A Gold-Standard Multilingual Named Entity Recognition Benchmark + StephenMayhewDuolingo + TerraBlevinsUniversity of Washington + ShuhengLiuGeorgia Institute of Technology + MarekSuppaComenius University in Bratislava + HilaGonenFacebook + Joseph MarvinImperialUniversity of Bath, National University and National University - Human Language Technologies Lab + BörjeKarlssonBeijing Academy of Artificial Intelligence (BAAI) + PeiqinLinInstitut für Informatik + NikolaLjubešićJožef Stefan Institute + Lester JamesMirandaAllen Institute for Artificial Intelligence + BarbaraPlankLudwig-Maximilians-Universität München and IT University of Copenhagen + ArijRiabi + YuvalPinterBen-Gurion University of the Negev + 4322-4337 + We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 19 datasets annotated with named entities in a cross-lingual consistent schema across 13 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We will release the data, code, and fitted models to the public. + 2024.naacl-long.243 + 2024.naacl-long.243.copyright.pdf + mayhew-etal-2024-universal + + + <fixed-case>ODD</fixed-case>: A Benchmark Dataset for the Natural Language Processing Based Opioid Related Aberrant Behavior Detection + SunjaeKwon + XunWangMicrosoft + WeisongLiuUniversity of Massachusetts at Lowell + EmilyDruhlDepartment of Veterans Affairs + MinheeSung + JoelReisman + WenjunLiUniversity of Massachusetts at Lowell + RobertKernsYale University + WilliamBecker + HongYuColumbia University + 4338-4359 + Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients’ EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available. + 2024.naacl-long.244 + 2024.naacl-long.244.copyright.pdf + kwon-etal-2024-odd + + + A Comprehensive Study of Gender Bias in Chemical Named Entity Recognition Models + XingmengZhao + AliNiazi + AnthonyRiosUniversity of Texas at San Antonio + 4360-4374 + Chemical named entity recognition (NER) models are used in many downstream tasks, from adverse drug reaction identification to pharmacoepidemiology. However, it is unknown whether these models work the same for everyone. Performance disparities can potentially cause harm rather than the intended good. This paper assesses gender-related performance disparities in chemical NER systems. We develop a framework for measuring gender bias in chemical NER models using synthetic data and a newly annotated corpus of over 92,405 words with self-identified gender information from Reddit. Our evaluation of multiple biomedical NER models reveals evident biases. For instance, synthetic data suggests that female names are frequently misclassified as chemicals, especially when it comes to brand name mentions. Additionally, we observe performance disparities between female- and male-associated data in both datasets. Many systems fail to detect contraceptives such as birth control. Our findings emphasize the biases in chemical NER models, urging practitioners to account for these biases in downstream applications. + 2024.naacl-long.245 + 2024.naacl-long.245.copyright.pdf + zhao-etal-2024-comprehensive + + + The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education + PaihengXuDepartment of Computer Science, University of Maryland, College Park + JingLiuUniversity of Maryland, College Park + NathanJones + JulieCohen + WeiAiUniversity of Maryland, College Park + 4375-4389 + Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers’ expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers’ utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration. + 2024.naacl-long.246 + 2024.naacl-long.246.copyright.pdf + xu-etal-2024-promises + + + Differentially Private Next-Token Prediction of Large Language Models + JamesFlemings + MeisamRazaviyaynUniversity of Southern California + MuraliAnnavaramUniversity of Southern California + 4390-4404 + Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestimates an adversary’s capabilities in having white box access to the model and, as a result, causes longer training times and larger memory usage than SGD. On the other hand, commercial LLM deployments are predominantly cloud-based; hence, adversarial access to LLMs is black-box. Motivated by these observations, we present Private Mixing of Ensemble Distributions (PMixED): a private prediction protocol for next-token prediction that utilizes the inherent stochasticity of next-token sampling and a public model to achieve Differential Privacy. We formalize this by introducing RD-mollifers which project each of the model’s output distribution from an ensemble of fine-tuned LLMs onto a set around a public LLM’s output distribution, then average the projected distributions and sample from it. Unlike DP-SGD which needs to consider the model architecture during training, PMixED is model agnostic, which makes PMixED a very appealing solution for current deployments. Our results show that PMixED achieves a stronger privacy guarantee than sample-level privacy and outperforms DP-SGD for privacy \epsilon = 8 on large-scale datasets. Thus, PMixED offers a practical alternative to DP training methods for achieving strong generative utility without compromising privacy. + 2024.naacl-long.247 + 2024.naacl-long.247.copyright.pdf + flemings-etal-2024-differentially + + + Improving Adversarial Data Collection by Supporting Annotators: Lessons from <fixed-case>GAHD</fixed-case>, a <fixed-case>G</fixed-case>erman Hate Speech Dataset + JanisGoldzycher + PaulRöttgerBocconi University + GeroldSchneiderUniversity of Zurich + 4405-4424 + Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca. 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness. Further, we find that mixing multiple support strategies is most advantageous. We make GAHD publicly available at https://github.com/jagol/gahd. + 2024.naacl-long.248 + 2024.naacl-long.248.copyright.pdf + goldzycher-etal-2024-improving + + + Memory Augmented Language Models through Mixture of Word Experts + CiceroNogueira dos SantosResearch, Google + JamesLee-ThorpGoogle + IsaacNobleGoogle + Chung-ChingChangGoogle + DavidUthusGoogle + 4425-4438 + Scaling up the number of parameters of language models has proven to be an effective approach to improve performance. For dense models, increasing their size proportionally increases their computational footprint. In this work, we seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions. Our proposed approach, dubbed Mixture of Word Experts (MoWE), can be seen as a memory augmented model, where a large set of word-specific experts play the role of a sparse memory. We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks. Moreover, MoWE outperforms traditional MoE models on knowledge intensive tasks and has similar performance to complex memory augmented approaches that often require to invoke custom mechanisms to search the sparse memory. + 2024.naacl-long.249 + 2024.naacl-long.249.copyright.pdf + nogueira-dos-santos-etal-2024-memory + + + Impossible Distillation for Paraphrasing and Summarization: How to Make High-quality Lemonade out of Small, Low-quality Model + JaehunJungUniversity of Washington + PeterWest + LiweiJiang + FaezeBrahmanAllen Institute for AI + XimingLuDepartment of Computer Science, University of Washington + JillianFisherUniversity of Washington + TaylorSorensenUniversity of Washington and Brigham Young University + YejinChoiDepartment of Computer Science, University of Washington + 4439-4454 + We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot perform these tasks. Unlike prior works that rely on an extreme-scale teacher model (e.g., GPT3) or task-specific architecture, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs (e.g., GPT2), where paraphrases occupy a proximal subspace in the LM distribution. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs. We evaluate our method on multiple benchmarks spanning unconstrained / syntax-controlled paraphrase generation and sentence summarization. Our model with 770M parameters consistently outperforms strong baselines, including models distilled from ChatGPT, and sometimes, even ChatGPT itself. Also, we find that our distilled dataset from 1.5B LMs exhibits higher diversity and fidelity than up to 13 times larger datasets. + 2024.naacl-long.250 + 2024.naacl-long.250.copyright.pdf + jung-etal-2024-impossible + + + <fixed-case>T</fixed-case>ofu<fixed-case>E</fixed-case>val: Evaluating Hallucinations of <fixed-case>LLM</fixed-case>s on Topic-Focused Dialogue Summarization + LiyanTang + IgorShalyminovAmazon + AmyWongAmazon + JonBurnskyAmazon + JakeVincentAmazon + Yu’anYang + SiffiSingh + SongFengAmazon + HwanjunSongAWS AI Labs + HangSuAmazon + LijiaSunAmazon + YiZhangAmazon + SaabMansourAmazon + KathleenMcKeown + 4455-4480 + Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence- level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model’s size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators. + 2024.naacl-long.251 + 2024.naacl-long.251.copyright.pdf + tang-etal-2024-tofueval + + + <fixed-case>MOKA</fixed-case>: Moral Knowledge Augmentation for Moral Event Extraction + Xinliang FrederickZhang + WinstonWuUniversity of Hawaii at Hilo + NicholasBeauchampNortheastern University + LuWangNortheastern University, Northeastern University and University of Michigan + 4481-4502 + News media often strive to minimize explicit moral language in news articles, yet most articles are dense with moral values as expressed through the reported events themselves. However, values that are reflected in the intricate dynamics among *participating entities* and *moral events* are far more challenging for most NLP systems to detect, including LLMs. To study this phenomenon, we annotate a new dataset, **MORAL EVENTS**, consisting of 5,494 structured event annotations on 474 news articles by diverse US media across the political spectrum. We further propose **MOKA**, a moral event extraction framework with **MO**ral **K**nowledge **A**ugmentation, which leverages knowledge derived from moral words and moral scenarios to produce structural representations of morality-bearing events. Experiments show that **MOKA** outperforms competitive baselines across three moral event understanding tasks. Further analysis shows even ostensibly nonpartisan media engage in the selective reporting of moral events. + 2024.naacl-long.252 + 2024.naacl-long.252.copyright.pdf + zhang-etal-2024-moka + + + Fixing Rogue Memorization in Many-to-One Multilingual Translators of Extremely-Low-Resource Languages by Rephrasing Training Samples + PauloCavalinInternational Business Machines + Pedro HenriqueDominguesPontifícia Universidade Católica do Rio de Janeiro + ClaudioPinhanez + JulioNogimaInternational Business Machines + 4503-4514 + In this paper we study the fine-tuning of pre-trained large high-resource language models (LLMs) into many-to-one multilingual machine translators for extremely-low-resource languages such as endangered Indigenous languages. We explore those issues using datasets created from pseudo-parallel translations to English of The Bible written in 39 Brazilian Indigenous languages using mBART50 and WMT19 as pre-trained models and multiple translation metrics. We examine bilingual and multilingual models and show that, according to machine translation metrics, same-linguistic family models tend to perform best. However, we also found that many-to-one multilingual systems have a tendency to learn a “rogue” strategy of storing output strings from the training data in the LLM structure and retrieving them instead of performing actual translations. We show that rephrasing the output of the training samples seems to solve the problem. + 2024.naacl-long.253 + 2024.naacl-long.253.copyright.pdf + cavalin-etal-2024-fixing + + + Backdoor Attacks on Multilingual Machine Translation + JunWang + QiongkaiXuMacquarie University + XuanliHeUniversity College London, University of London + BenjaminRubinsteinThe University of Melbourne and The University of Melbourne + TrevorCohnGoogle and The University of Melbourne + 4515-4534 + While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages.Our experimental results reveal that injecting less than 0.01% poisoned data into a low-resource language pair can achieve an average 20% attack success rate in attacking high-resource language pairs. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings. Our aim is to bring attention to these vulnerabilities within MNMT systems with the hope of encouraging the community to address security concerns in machine translation, especially in the context of low-resource languages. + 2024.naacl-long.254 + 2024.naacl-long.254.copyright.pdf + wang-etal-2024-backdoor + + + Personalized Jargon Identification for Enhanced Interdisciplinary Communication + YueGuo + Joseph CheeChangAllen Institute for Artificial Intelligence + MariaAntoniak + ErinBransomAllen Institute for Artificial Intelligence + TrevorCohenUniversity of Washington + LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence + TalAugust + 4535-4550 + Scientific jargon can confuse researchers when they read materials from other domains. Identifying and translating jargon for individual researchers could speed up research, but current methods of jargon identification mainly use corpus-level familiarity indicators rather than modeling researcher-specific needs, which can vary greatly based on each researcher’s background. We collect a dataset of over 10K term familiarity annotations from 11 computer science researchers for terms drawn from 100 paper abstracts. Analysis of this data reveals that jargon familiarity and information needs vary widely across annotators, even within the same sub-domain (e.g., NLP). We investigate features representing domain, subdomain, and individual knowledge to predict individual jargon familiarity. We compare supervised and prompt-based approaches, finding that prompt-based methods using information about the individual researcher (e.g., personal publications, self-defined subfield of research) yield the highest accuracy, though the task remains difficult and supervised approaches have lower false positive rates. This research offers insights into features and methods for the novel task of integrating personal data into scientific jargon identification. + 2024.naacl-long.255 + 2024.naacl-long.255.copyright.pdf + guo-etal-2024-personalized + + + Flames: Benchmarking Value Alignment of <fixed-case>LLM</fixed-case>s in <fixed-case>C</fixed-case>hinese + KexinHuang + XiangyangLiu + QianyuGuo + TianxiangSun + JiaweiSun + YaruWang + ZeyangZhou + YixuWang + YanTengShanghai Artificial Intelligence Laboratory + XipengQiuFudan University + YingchunWangShanghai Artificial Intelligence Laboratory + DahuaLinThe Chinese University of Hong Kong + 4551-4591 + The widespread adoption of large language models (LLMs) across various regions underscores the urgent need to evaluate their alignment with human values. Current benchmarks, however, fall short of effectively uncovering safety vulnerabilities in LLMs. Despite numerous models achieving high scores and ‘topping the chart’ in these evaluations, there is still a significant gap in LLMs’ deeper alignment with human values and achieving genuine harmlessness. To this end, this paper proposes a value alignment benchmark named Flames, which encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values such as harmony. Accordingly, we carefully design adversarial prompts that incorporate complex scenarios and jailbreaking methods, mostly with implicit malice. By prompting 17 mainstream LLMs, we obtain model responses and rigorously annotate them for detailed evaluation. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames, particularly in the safety and fairness dimensions. We also develop a lightweight specified scorer capable of scoring LLMs across multiple dimensions to efficiently evaluate new models on the benchmark. The complexity of Flames has far exceeded existing benchmarks, setting a new challenge for contemporary LLMs and highlighting the need for further alignment of LLMs. Our benchmark is publicly available at https://github.com/AIFlames/Flames. + 2024.naacl-long.256 + 2024.naacl-long.256.copyright.pdf + huang-etal-2024-flames + + + Mitigating Bias for Question Answering Models by Tracking Bias Influence + MingyuMaUniversity of California, Los Angeles + Jiun-YuKaoAmazon Alexa AI + ArpitGuptaAmazon + Yu-HsiangLinAmazon + WenboZhaoAmazon + TagyoungChungAmazon + WeiWangUniversity of California, Los Angeles + Kai-WeiChangUniversity of California, Los Angeles + NanyunPengUniversity of California, Los Angeles + 4592-4610 + Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy. + 2024.naacl-long.257 + 2024.naacl-long.257.copyright.pdf + ma-etal-2024-mitigating + + + Extending <fixed-case>CLIP</fixed-case>’s Image-Text Alignment to Referring Image Segmentation + SeoyeonKim + MingukKang + DongwonKimPOSTECH + JaesikParkSeoul National University + SuhaKwakPOSTECH and DGIST + 4611-4628 + Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP’s inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP’s image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP’s image-text alignment to RIS. + 2024.naacl-long.258 + 2024.naacl-long.258.copyright.pdf + kim-etal-2024-extending + + + Generating Attractive and Authentic Copywriting from Customer Reviews + Yu-XiangLin + Wei-YunMaAcademia Sinica + 4629-4642 + The goal of product copywriting is to capture the interest of potential buyers by emphasizing the features of products through text descriptions. As e-commerce platforms offer a wide range of services, it’s becoming essential to dynamically adjust the styles of these auto-generated descriptions. Typical approaches to copywriting generation often rely solely on specified product attributes, which may result in dull and repetitive content. To tackle this issue, we propose to generate copywriting based on customer reviews, as they provide firsthand practical experiences with products, offering a richer source of information than just product attributes. We have developed a sequence-to-sequence framework, enhanced with reinforcement learning, to produce copywriting that is attractive, authentic, and rich in information. Our framework outperforms all existing baseline and zero-shot large language models, including LLaMA-2-chat-7B and GPT-3.5, in terms of both attractiveness and faithfulness. Furthermore, this work features the use of LLMs for aspect-based summaries collection and argument allure assessment. Experiments demonstrate the effectiveness of using LLMs for marketing domain corpus construction. The code and the dataset is publicly available at: https://github.com/YuXiangLin1234/Copywriting-Generation. + 2024.naacl-long.259 + 2024.naacl-long.259.copyright.pdf + lin-ma-2024-generating + + + Effective Long-Context Scaling of Foundation Models + WenhanXiongFacebook + JingyuLiu + IgorMolybogMeta AI + HejiaZhang + PrajjwalBhargava + RuiHouMeta Inc. + LouisMartinFacebook + RashiRungta + Karthik AbinavSankararamanFacebook + BarlasOguzMeta + MadianKhabsaFacebook + HanFangMeta AI + YasharMehdadFacebook + SharanNarangMeta + KshitizMalik + AngelaFanFacebook + ShrutiBhosaleFacebook + SergeyEdunov + MikeLewisFacebook AI Research + SinongWangFacebook + HaoMaFacebook + 4643-4663 + We present an effective recipe to train strong long-context LLMs that are capable of utilizing massive context windows of up to 32,000 tokens. Our models are built through continual pretraining from Llama 2 checkpoints with longer text sequences and on a dataset where long texts are upsampled. We perform extensive evaluation using language modeling, synthetic context probing tasks, and a wide range of downstream benchmarks. Across all evaluations, our models achieve consistent improvements on most regular-context tasks and significant improvements on long-context tasks over Llama 2. Moreover, with a cost-effective instruction tuning procedure that is free of expensive annotation, the presented models can already surpass \texttt{gpt-3.5-turbo-16k}‘s overall performance on long-context benchmarks. Alongside these results, we provide an in-depth analysis on each individual component of our method. We delve into Llama’s position encodings and discuss its key limitation in modeling long data. We examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths – ablation results suggest that having abundant long texts in the pretrain dataset is \textit{not} the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences. + 2024.naacl-long.260 + 2024.naacl-long.260.copyright.pdf + xiong-etal-2024-effective + + + Empowering Diffusion Models on the Embedding Space for Text Generation + ZhujinGao + JunliangGuoMicrosoft + XuTan + YongxinZhu + FangZhang + JiangBianMicrosoft + LinliXu + 4664-4683 + Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies of the optimization challenges encountered with both the embedding space and the denoising model, which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the embedding space and unstable training. To alleviate this problem, we propose a new objective called the anchor loss which is more efficient than previous methods. Secondly, we find the noise levels of conventional schedules are insufficient for training a desirable denoising model while introducing varying degrees of degeneration in consequence. To address this challenge, we propose a novel framework called noise rescaling. Based on the above analysis, we propose Difformer, an embedding diffusion model based on Transformer. Experiments on varieties of seminal text generation tasks show the effectiveness of the proposed methods and the superiority of Difformer over previous state-of-the-art embedding diffusion baselines. + 2024.naacl-long.261 + 2024.naacl-long.261.copyright.pdf + gao-etal-2024-empowering + + + Aligning as Debiasing: Causality-Aware Alignment via Reinforcement Learning with Interventional Feedback + YuXiaUniversity of California, San Diego + TongYuAdobe Research + ZhankuiHeUniversity of California, San Diego, University of California, San Diego + HandongZhaoAdobe Systems + JulianMcAuleyUniversity of California, San Diego, University of California, San Diego + ShuaiLiJohn Hopcroft Center, Shanghai Jiao Tong University + 4684-4695 + Large language models (LLMs) often generate biased outputs containing offensive, toxic, or stereotypical text. Existing LLM alignment methods such as reinforcement learning from human feedback (RLHF) alleviate biases primarily based on reward signals from current model outputs without considering the source of biases. In this work, to explore how biases are formed, we revisit LLMs’ text generation from a causal perspective. We identify pretraining data and input prompts, which contain semantic correlations of textual phrases, as two confounders between LLMs and model outputs causing biases. Inspired by our causal view, we leverage the reward model in RL alignment as an instrumental variable to perform causal intervention on LLMs. Utilizing the reward difference between an initial LLM and intervened LLM as interventional feedback to guide RL finetuning, we propose Causality-Aware Alignment (CAA) for LLM debiasing. Experiments on two text generation tasks with three different alignment objectives demonstrate the advantages of our method in aligning LLMs to generate less biased and safer outputs. + 2024.naacl-long.262 + 2024.naacl-long.262.copyright.pdf + xia-etal-2024-aligning + + + Fake Alignment: Are <fixed-case>LLM</fixed-case>s Really Aligned Well? + YixuWang + YanTengShanghai Artificial Intelligence Laboratory + KexinHuang + ChengqiLyuShanghai AI Laboratory + SongyangZhangShanghai AI Laboratory + WenweiZhangShanghai AI Laboratory + XingjunMaFudan University + Yu-GangJiangFudan University + YuQiao + YingchunWangShanghai Artificial Intelligence Laboratory + 4696-4712 + The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety. This study investigates an under-explored issue about the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, LLM only remembers the answer style for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. We introduce a Fake alIgNment Evaluation (FINE) framework and two novel metrics——Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimation. Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Subsequently, we found that multiple-choice format data can also be used as high-quality contrast distillation-based fine-tuning data, which can strongly improve the alignment consistency of LLMs with minimal fine-tuning overhead. For data and code, see https://github.com/AIFlames/Fake-Alignment. + 2024.naacl-long.263 + 2024.naacl-long.263.copyright.pdf + wang-etal-2024-fake + + + Visually Guided Generative Text-Layout Pre-training for Document Intelligence + ZhimingMaoThe Chinese University of Hong Kong + HaoliBaiHuawei Technologies Ltd. + LuHouHuawei Technologies Ltd. + LifengShangHuawei Technologies Ltd. + XinJiang + QunLiuHuawei Noah’s Ark Lab + Kam-FaiWongThe Chinese University of Hong Kong + 4713-4730 + Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering. + 2024.naacl-long.264 + 2024.naacl-long.264.copyright.pdf + mao-etal-2024-visually + + + <fixed-case>HILL</fixed-case>: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification + HeZhu + JunranWuNational University of Singapore + RuomeiLiuBeijing University of Aeronautics and Astronautics + YueHouBeihang University + ZeYuanBeijing University of Aeronautics and Astronautics + ShangzheLi + YichengPan + KeXuBeijing University of Aeronautics and Astronautics + 4731-4745 + Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL. + 2024.naacl-long.265 + 2024.naacl-long.265.copyright.pdf + zhu-etal-2024-hill + + + Investigating the Emergent Audio Classification Ability of <fixed-case>ASR</fixed-case> Foundation Models + RaoMa + AdianLiusieUniversity of Cambridge + MarkGalesUniversity of Cambridge + KateKnillUniversity of Cambridge + 4746-4760 + Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance. + 2024.naacl-long.266 + 2024.naacl-long.266.copyright.pdf + ma-etal-2024-investigating + + + In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax + AaronMuellerNortheastern University and Technion - Israel Institute of Technology, Technion + AlbertWebsonGoogle DeepMind + JacksonPettyNew York University + TalLinzenNew York University and Google + 4761-4779 + In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks: given labeled examples in the input context, the LLM learns to perform the task without weight updates. Do models guided via ICL infer the underlying structure of the task defined by the context, or do they rely on superficial heuristics that only generalize to identically distributed examples? We address this question using transformations tasks and an NLI task that assess sensitivity to syntax—a requirement for robust language understanding. We further investigate whether out-of-distribution generalization can be improved via chain-of-thought prompting, where the model is provided with a sequence of intermediate computation steps that illustrate how the task ought to be performed. In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs. The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size; in particular, models pre-trained on code generalize better, and benefit more from chain-of-thought prompting. + 2024.naacl-long.267 + 2024.naacl-long.267.copyright.pdf + mueller-etal-2024-context + + + Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt + YongqiWangZhejiang University + RuofanHu + RongjieHuangFAIR + ZhiqingHong + RuiqiLi + WenruiLiu + FumingYou + TaoJin + ZhouZhaoZhejiang University and Zhejiang University + 4780-4794 + Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io . + 2024.naacl-long.268 + 2024.naacl-long.268.copyright.pdf + wang-etal-2024-prompt + + + Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech + DenaMujtaba + NiharMahapatraMichigan State University + MeganArneyMichigan State University + JYarussMichigan State University + HopeGerlach-HouckWestern Michigan University + CarynHerring + JiaBin + 4795-4809 + Automatic speech recognition (ASR) systems, increasingly prevalent in education, healthcare, employment, and mobile technology, face significant challenges in inclusivity, particularly for the 80 million-strong global community of people who stutter. These systems often fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations. This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. The synthetic dataset, uniquely designed to incorporate various stuttering events, enables an in-depth analysis of each ASR’s handling of disfluent speech. Our comprehensive assessment includes metrics such as word error rate (WER), character error rate (CER), and semantic accuracy of the transcripts. The results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions. These findings highlight a critical gap in current ASR technologies, underscoring the need for effective bias mitigation strategies. Addressing this bias is imperative not only to improve the technology’s usability for people who stutter but also to ensure their equitable and inclusive participation in the rapidly evolving digital landscape. + 2024.naacl-long.269 + 2024.naacl-long.269.copyright.pdf + mujtaba-etal-2024-lost + + + <fixed-case>MAFALDA</fixed-case>: A Benchmark and Comprehensive Study of Fallacy Detection and Classification + ChadiHelwe + TomCalamai + Pierre-HenriParisTélécom Paris + ChloéClavelINRIA and Télécom Paris + FabianSuchanekTelecom Paris + 4810-4845 + We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies. + 2024.naacl-long.270 + 2024.naacl-long.270.copyright.pdf + helwe-etal-2024-mafalda + + + Diffusion Glancing Transformer for Parallel Sequence-to-Sequence Learning + LihuaQianByteDance + MingxuanWang + YangLiu + HaoZhou + 4846-4862 + Previously, non-autoregressive models were widely recognized as being superior in generation efficiency but inferior in generation quality due to the challenges of modeling multiple target modalities.To enhance the multi-modality modeling ability, we propose the diffusion glancing transformer, which employs a modality diffusion process and residual glancing sampling.The modality diffusion process is a discrete process that interpolates the multi-modal distribution along the decoding steps, and the residual glancing sampling approach guides the model to continuously learn the remaining modalities across the layers. Experimental results on various machine translation and text generation benchmarks demonstrate that DIFFGLAT achieves better generation accuracy while maintaining fast decoding speed compared with both autoregressive and non-autoregressive models. + 2024.naacl-long.271 + 2024.naacl-long.271.copyright.pdf + qian-etal-2024-diffusion + + + No Context Needed: Contextual Quandary In Idiomatic Reasoning With Pre-Trained Language Models + KellenChengPrinceton University + SumaBhatUniversity of Illinois, Urbana Champaign + 4863-4880 + Reasoning in the presence of idiomatic expressions (IEs) remains a challenging frontier in natural language understanding (NLU). Unlike standard text, the non-compositional nature of an IE makes it difficult for model comprehension, as their figurative or non-literal mean- ing usually cannot be inferred from the constituent words alone. It stands to reason that in these challenging circumstances, pre-trained language models (PTLMs) should make use of the surrounding context to infer additional in- formation about the IE. In this paper, we investigate the utilization of said context for idiomatic reasoning tasks, which is under-explored relative to arithmetic or commonsense reason- ing (Liu et al., 2022; Yu et al., 2023). Preliminary findings point to a surprising observation: general purpose PTLMs are actually negatively affected by the context, as performance almost always increases with its removal. In these scenarios, models may see gains of up to 3.89%. As a result, we argue that only IE-aware models remain suitable for idiomatic reasoning tasks, given the unexpected and unexplainable manner in which general purpose PTLMs reason over IEs. Additionally, we conduct studies to examine how models utilize the context in various situations, as well as an in-depth analysis on dataset formation and quality. Finally, we provide some explanations and insights into the reasoning process itself based on our results. + 2024.naacl-long.272 + 2024.naacl-long.272.copyright.pdf + cheng-bhat-2024-context + + + Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation + XindiWangHuawei Technologies Ltd., University of Western Ontario and Vector Institute + RobertMercerUniversity of Western Ontario + FrankRudziczDalhousie University + 4881-4891 + The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage “retrieve and re-rank” framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark. + 2024.naacl-long.273 + 2024.naacl-long.273.copyright.pdf + wang-etal-2024-multi + + + Anisotropy is Not Inherent to Transformers + AnemilyMachinaUniversity of Western Ontario + RobertMercerUniversity of Western Ontario + 4892-4907 + Isotropy is the property that embeddings are uniformly distributed around the origin. Previous work has shown that Transformer embedding spaces are anisotropic, which is called the representation degradation problem. This degradation has been assumed to be inherent to the standard language modeling tasks and to apply to all Transformer models regardless of their architecture. In this work we identify a set of Transformer models with isotropic embedding spaces, the large Pythia models. We examine the isotropy of Pythia models and explore how isotropy and anisotropy develop as a model is trained. We find that anisotropic models do not develop as previously theorized, using our own analysis to show that the large Pythia models optimize their final Layer Norm for isotropy, and provide reasoning why previous theoretical justifications for anisotropy were insufficient. The identification of a set of isotropic Transformer models calls previous assumptions into question, provides a set of models to contrast existing analysis, and should lead to deeper insight into isotropy. + 2024.naacl-long.274 + 2024.naacl-long.274.copyright.pdf + machina-mercer-2024-anisotropy + + + Finding Replicable Human Evaluations via Stable Ranking Probability + ParkerRileyGoogle + DanielDeutschGoogle + GeorgeFosterGoogle + VireshRatnakarGoogle + AliDabirmoghaddam + MarkusFreitagGoogle + 4908-4919 + Reliable human evaluation is critical to the development of successful natural language generation models, but achieving it is notoriously difficult. Stability is a crucial requirement when ranking systems by quality: consistent ranking of systems across repeated evaluations is not just desirable, but essential. Without it, there is no reliable foundation for hill-climbing or product launch decisions. In this paper, we use machine translation and its state-of-the-art human evaluation framework, MQM, as a case study to understand how to set up reliable human evaluations that yield stable conclusions. We investigate the optimal configurations for item allocation to raters, number of ratings per item, and score normalization. Our study on two language pairs provides concrete recommendations for designing replicable human evaluation studies. We also collect and release the largest publicly available dataset of multi-segment translations rated by multiple professional translators, consisting of nearly 140,000 segment annotations across two language pairs. + 2024.naacl-long.275 + 2024.naacl-long.275.copyright.pdf + riley-etal-2024-finding + + + Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections + YuanpuCaoPennsylvania State University + BochuanCao + JinghuiChenPennsylvania State University + 4920-4935 + Recent developments in Large Language Models (LLMs) have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or red-teaming can easily expose the potential weaknesses of the unaligned models, thereby precluding their release/use. (2) non-persistence, the unaligned LLMs can be easily repaired through re-alignment, i.e., fine-tuning again with aligned data points. In this work, we show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. We also provide a novel understanding of the relationship between the backdoor persistence and the activation pattern and further provide guidelines for potential trigger design. Through extensive experiments, we demonstrate that our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense. + 2024.naacl-long.276 + 2024.naacl-long.276.copyright.pdf + cao-etal-2024-stealthy + + + Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts + Sai AshishSomayajula + YouweiLiangUniversity of California, San Diego + LiZhang + AbhishekSingh + PengtaoXieUniversity of California, San Diego + 4936-4953 + Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets. Our code is available at https://github.com/Sai-Ashish/Attention_guided_weight_mixup_BLO. + 2024.naacl-long.277 + 2024.naacl-long.277.copyright.pdf + somayajula-etal-2024-generalizable + + + Detecting Bipolar Disorder from Misdiagnosed Major Depressive Disorder with Mood-Aware Multi-Task Learning + DaeunLee + HyolimJeon + SejungSon + ChaewonPark + Ji hyunAnSamsung + SeungbaeKimUniversity of South Florida + JinyoungHanSungkyunkwan University + 4954-4970 + Bipolar Disorder (BD) is a mental disorder characterized by intense mood swings, from depression to manic states. Individuals with BD are at a higher risk of suicide, but BD is often misdiagnosed as Major Depressive Disorder (MDD) due to shared symptoms, resulting in delays in appropriate treatment and increased suicide risk. While early intervention based on social media data has been explored to uncover latent BD risk, little attention has been paid to detecting BD from those misdiagnosed as MDD. Therefore, this study presents a novel approach for identifying BD risk in individuals initially misdiagnosed with MDD. A unique dataset, BD-Risk, is introduced, incorporating mental disorder types and BD mood levels verified by two clinical experts. The proposed multi-task learning for predicting BD risk and BD mood level outperforms the state-of-the-art baselines. Also, the proposed dynamic mood-aware attention can provide insights into the impact of BD mood on future risk, potentially aiding interventions for at-risk individuals. + 2024.naacl-long.278 + 2024.naacl-long.278.copyright.pdf + lee-etal-2024-detecting-bipolar + + + Leveraging Code to Improve In-Context Learning for Semantic Parsing + BenBogin + ShivanshuGuptaUniversity of California, Irvine + PeterClarkAllen Institute for Artificial Intelligence + AshishSabharwalAllen Institute for Artificial Intelligence + 4971-5012 + In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs.In this work, we show how pre-existing coding abilities of LLMs can be leveraged for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets; combined, they lead to dramatic improvements (e.g., 7.9% to 66.5% on SMCalFlow compositional split) and can substantially improve compositional generalization, nearly closing the performance gap between easier i.i.d. and harder compositional splits. Finally, comparisons across multiple PLs and DSL variations suggest that the similarity of a target language to general-purpose code is more important than prevalence in pretraining corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs. + 2024.naacl-long.279 + 2024.naacl-long.279.copyright.pdf + bogin-etal-2024-leveraging + + + Improving Pre-trained Language Model Sensitivity via Mask Specific losses: A case study on Biomedical <fixed-case>NER</fixed-case> + MichealAbahoUniversity of Liverpool + DanushkaBollegalaAmazon and University of Liverpool + GaryLeemingUniversity of Liverpool + DanJoyceUniversity of Oxford + IainBuchanUniversity of Liverpool + 5013-5029 + Adapting language models (LMs) to novel domains is often achieved through fine-tuning a pre-trained LM (PLM) on domain-specific data. Fine-tuning introduces new knowledge into an LM, enabling it to comprehend and efficiently perform a target domain task. Fine-tuning can however be inadvertently insensitive if it ignores the wide array of disparities (e.g in word meaning) between source and target domains. For instance, words such as chronic and pressure may be treated lightly in social conversations, however, clinically, these words are usually an expression of concern. To address insensitive fine-tuning, we propose Mask Specific Language Modeling (MSLM), an approach that efficiently acquires target domain knowledge by appropriately weighting the importance of domain-specific terms (DS-terms) during fine-tuning. MSLM jointly masks DS-terms and generic words, then learns mask-specific losses by ensuring LMs incur larger penalties for inaccurately predicting DS-terms compared to generic words. Results of our analysis show that MSLM improves LMs sensitivity and detection of DS-terms. We empirically show that an optimal masking rate not only depends on the LM, but also on the dataset and the length of sequences. Our proposed masking strategy outperforms advanced masking strategies such as span- and PMI-based masking. + 2024.naacl-long.280 + 2024.naacl-long.280.copyright.pdf + abaho-etal-2024-improving + + + Language Models Implement Simple <fixed-case>W</fixed-case>ord2<fixed-case>V</fixed-case>ec-style Vector Arithmetic + JackMerulloBrown University + CarstenEickhoffEberhard-Karls-Universität Tübingen + ElliePavlickBrown University and Brown University + 5030-5047 + A primary criticism towards language models (LMs) is their inscrutability. This paper presents evidence that, despite their size and complexity, LMs sometimes exploit a simple vector arithmetic style mechanism to solve some relational tasks using regularities encoded in the hidden space of the model (e.g., Poland:Warsaw::China:Beijing). We investigate a range of language model sizes (from 124M parameters to 176B parameters) in an in-context learning setting, and find that for a variety of tasks (involving capital cities, uppercasing, and past-tensing) a key part of the mechanism reduces to a simple additive update typically applied by the feedforward (FFN) networks. We further show that this mechanism is specific to tasks that require retrieval from pretraining memory, rather than retrieval from local context. Our results contribute to a growing body of work on the interpretability of LMs, and offer reason to be optimistic that, despite the massive and non-linear nature of the models, the strategies they ultimately use to solve tasks can sometimes reduce to familiar and even intuitive algorithms. + 2024.naacl-long.281 + 2024.naacl-long.281.copyright.pdf + merullo-etal-2024-language + + + <fixed-case>A</fixed-case>uto<fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case>: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning + RuiyiZhangUniversity of California, San Diego + RushiQiang + Sai AshishSomayajula + PengtaoXieUniversity of California, San Diego + 5048-5060 + Large-scale pretraining followed by task-specific finetuning has achieved great success in various NLP tasks. Since finetuning all parameters of large pretrained models poses substantial computational and memory challenges, several efficient finetuning methods have been developed. Among them, low-rank adaptation (LoRA), which finetunes low-rank incremental update matrices on top of frozen pretrained weights, has proven particularly effective. Nonetheless, LoRA’s uniform rank assignment across all layers, along with its reliance on an exhaustive search to find the best rank, leads to high computation costs and suboptimal finetuning performance. To address these limitations, we introduce AutoLoRA, a meta learning based framework for automatically identifying the optimal rank of each LoRA layer. AutoLoRA associates each rank-1 matrix in a low-rank update matrix with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning based method is developed to learn these selection variables. The optimal rank is determined by thresholding the values of these variables. Our comprehensive experiments on natural language understanding, generation, and sequence labeling demonstrate the effectiveness of AutoLoRA. The code is publicly available at https://github.com/ruz048/AutoLoRA + 2024.naacl-long.282 + 2024.naacl-long.282.copyright.pdf + zhang-etal-2024-autolora + + + <fixed-case>S</fixed-case>port<fixed-case>QA</fixed-case>: A Benchmark for Sports Understanding in Large Language Models + HaotianXiaUniversity of California, Irvine + ZhengbangYang + YuqingWangStanford University + RhysTracy + YunZhao + DongdongHuangBeijing Normal University + ZezhiChen + YanZhuBeijing Normal University + Yuan-fangWang + WeiningShenUniversity of California, Irvine + 5061-5081 + A deep understanding of sports, a field rich in strategic and dynamic content, is crucial for advancing Natural Language Processing (NLP). This holds particular significance in the context of evaluating and advancing Large Language Models (LLMs), given the existing gap in specialized benchmarks. To bridge this gap, we introduce SportQA, a novel benchmark specifically designed for evaluating LLMs in the context of sports understanding. SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels, each targeting different aspects of sports knowledge from basic historical facts to intricate, scenario-based reasoning tasks. We conducted a thorough evaluation of prevalent LLMs, mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting. Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning, lagging behind human expertise. The introduction of SportQA marks a significant step forward in NLP, offering a tool for assessing and enhancing sports understanding in LLMs. The dataset is available at https://github.com/haotianxia/SportQA + 2024.naacl-long.283 + 2024.naacl-long.283.copyright.pdf + xia-etal-2024-sportqa + + + Revisiting subword tokenization: A case study on affixal negation in large language models + ThinhTruongUniversity of Melbourne + YuliaOtmakhovaThe University of Melbourne + KarinVerspoorRoyal Melbourne Institute of Technology + TrevorCohnGoogle and The University of Melbourne + TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne + 5082-5095 + In this work, we measure the impact of affixal negation on modern English large language models (LLMs). In affixal negation, the negated meaning is expressed through a negative morpheme, which is potentially challenging for LLMs as their tokenizers are often not morphologically plausible. We conduct extensive experiments using LLMs with different subword tokenization methods, which lead to several insights on the interaction between tokenization performance and negation sensitivity. Despite some interesting mismatches between tokenization accuracy and negation detection performance, we show that models can, on the whole, reliably recognize the meaning of affixal negation. + 2024.naacl-long.284 + 2024.naacl-long.284.copyright.pdf + truong-etal-2024-revisiting + + + Generating Mental Health Transcripts with <fixed-case>SAPE</fixed-case> (<fixed-case>S</fixed-case>panish Adaptive Prompt Engineering) + DanielLozoya + AlejandroBerazaluceUniversity of Melbourne + JuanPerches + EloyLúa + MikeConway + SimonD’AlfonsoThe University of Melbourne + 5096-5113 + Large language models have become valuable tools for data augmentation in scenarios with limited data availability, as they can generate synthetic data resembling real-world data. However, their generative performance depends on the quality of the prompt used to instruct the model. Prompt engineering that relies on hand-crafted strategies or requires domain experts to adjust the prompt often yields suboptimal results. In this paper we present SAPE, a Spanish Adaptive Prompt Engineering method utilizing genetic algorithms for prompt generation and selection. Our evaluation of SAPE focuses on a generative task that involves the creation of Spanish therapy transcripts, a type of data that is challenging to collect due to the fact that it typically includes protected health information. Through human evaluations conducted by mental health professionals, our results show that SAPE produces Spanish counselling transcripts that more closely resemble authentic therapy transcripts compared to other prompt engineering techniques that are based on Reflexion and Chain-of-Thought. + 2024.naacl-long.285 + 2024.naacl-long.285.copyright.pdf + lozoya-etal-2024-generating + + + Where are you from? Geolocating Speech and Applications to Language Identification + PatrickFoley + MatthewWiesner + BismarckOdoomDepartment of Computer Science, Whiting School of Engineering + Leibny PaolaGarcia Perera + KentonMurrayJohns Hopkins University + PhilippKoehnJohns Hopkins University + 5114-5126 + We train models to answer the question, Where are you from? and show how such models can be repurposed for language identification (LID). To our knowledge, this paper is the first to introduce data sources, methods and models to tackle the task of geolocation of speech at a global scale, and the first to explore using geolocation as a proxy-task for LID. Specifically, we explore whether radio broadcasts with known origin can be used to train regression and classification-based models for geolocating speech. We build models on top of self-supervised pretrained models, using attention pooling to qualitatively verify that the model geolocates the speech itself, and not other channel artifacts.The best geolocation models localize speaker origin to around 650km. We confirm the value of speech geolocation as a proxy task by using speech geolocation models for zero-shot LID. Finally, we show that fine-tuning geolocation models for LID outperforms fine-tuning pretrained Wav2Vec2.0 models, and achieves state-of-the-art performance on the FLEURS benchmark. + 2024.naacl-long.286 + 2024.naacl-long.286.copyright.pdf + foley-etal-2024-geolocating + + + Teaching Language Models to Self-Improve through Interactive Demonstrations + XiaoYu + BaolinPengTencent AI Lab + MichelGalleyMicrosoft + JianfengGaoMicrosoft Research + ZhouYuColumbia University + 5127-5149 + The self-improving ability of large language models (LLMs), enabled by prompting them to analyze and revise their own outputs, has garnered significant interest in recent research. However, this ability has been shown to be absent and difficult to learn for smaller models, thus widening the performance gap between state-of-the-art LLMs and more cost-effective and faster ones. To reduce this gap, we introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability, and show that our approach can improve LLaMA-7B’s performance on math and reasoning tasks by up to 7.13%. In contrast to prior work, we achieve this by using the smaller model to interact with LLMs to collect feedback and improvements on *its own generations*. We then replay this experience to train the small model. Our experiments on four math and reasoning datasets show that the interactive experience of learning from and correcting its *own* mistakes is crucial for small models to improve their performance. + 2024.naacl-long.287 + 2024.naacl-long.287.copyright.pdf + yu-etal-2024-teaching + + + <fixed-case>MAGID</fixed-case>: An Automated Pipeline for Generating Synthetic Multi-modal Datasets + HosseinAboutalebi + HwanjunSongAWS AI Labs + YushengXieAmazon + ArshitGuptaAmazon + LijiaSunAmazon + HangSuAmazon + IgorShalyminovAmazon + NikolaosPappasAWS AI Labs + SiffiSingh + SaabMansourAmazon + 5150-5167 + Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce Multimodal Augmented Generative Images Dialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images . Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small. + 2024.naacl-long.288 + 2024.naacl-long.288.copyright.pdf + aboutalebi-etal-2024-magid + + + Zero-shot Generative Linguistic Steganography + KeLinTsinghua University, Tsinghua University + YiyangLuo + ZijianZhang + LuoPing + 5168-5182 + Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual and statistical imperceptibility. We also design several new metrics and reproducible language evaluations to measure the imperceptibility of the stegotext. Our experimental results indicate that our method produces 1.926\times more innocent and intelligible stegotext than any other method. + 2024.naacl-long.289 + 2024.naacl-long.289.copyright.pdf + lin-etal-2024-zero + + + Does <fixed-case>GPT</fixed-case>-4 pass the <fixed-case>T</fixed-case>uring test? + CameronJonesUniversity of California, San Diego + BenBergen + 5183-5210 + We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants’ decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness. + 2024.naacl-long.290 + 2024.naacl-long.290.copyright.pdf + jones-bergen-2024-gpt + + + Polarity Calibration for Opinion Summarization + YuanyuanLeiTexas A&M University - College Station + KaiqiangSongTencent AI Lab + SangwooChoTencent AI Lab + XiaoyangWangTencent AI Lab + RuihongHuangTexas A&M University + DongYuTencent AI Lab + 5211-5224 + Opinion summarization is automatically generating summaries from a variety of subjective information, such as product reviews or political opinions. The challenge of opinions summarization lies in presenting divergent or even conflicting opinions. We conduct an analysis of previous summarization models, which reveals their inclination to amplify the polarity bias, emphasizing the majority opinions while ignoring the minority opinions. To address this issue and make the summarizer express both sides of opinions, we introduce the concept of polarity calibration, which aims to align the polarity of output summary with that of input text. Specifically, we develop a reinforcement training approach for polarity calibration. This approach feeds the polarity distance between output summary and input text as reward into the summarizer, and also balance polarity calibration with content preservation and language naturality. We evaluate our Polarity Calibration model (PoCa) on two types of opinions summarization tasks: summarizing product reviews and political opinions articles. Automatic and human evaluation demonstrate that our approach can mitigate the polarity mismatch between output summary and input text, as well as maintain the content semantic and language quality. + 2024.naacl-long.291 + 2024.naacl-long.291.copyright.pdf + lei-etal-2024-polarity + + + Sentence-level Media Bias Analysis with Event Relation Graph + YuanyuanLeiTexas A&M University - College Station + RuihongHuangTexas A&M University + 5225-5238 + Media outlets are becoming more partisan and polarized nowadays. In this paper, we identify media bias at the sentence level, and pinpoint bias sentences that intend to sway readers’ opinions. As bias sentences are often expressed in a neutral and factual way, considering broader context outside a sentence can help reveal the bias. In particular, we observe that events in a bias sentence need to be understood in associations with other events in the document. Therefore, we propose to construct an event relation graph to explicitly reason about event-event relations for sentence-level bias identification. The designed event relation graph consists of events as nodes and four common types of event relations: coreference, temporal, causal, and subevent relations. Then, we incorporate event relation graph for bias sentences identification in two steps: an event-aware language model is built to inject the events and event relations knowledge into the basic language model via soft labels; further, a relation-aware graph attention network is designed to update sentence embedding with events and event relations information based on hard labels. Experiments on two benchmark datasets demonstrate that our approach with the aid of event relation graph improves both precision and recall of bias sentence identification. + 2024.naacl-long.292 + 2024.naacl-long.292.copyright.pdf + lei-huang-2024-sentence + + + <fixed-case>EMONA</fixed-case>: Event-level Moral Opinions in News Articles + YuanyuanLeiTexas A&M University - College Station + Md Messal MonemMiahTexas A&M University - College Station + AyeshaQamarTexas A&M University - College Station + Sai RamanaReddy + JonathanTongTexas A&M University - College Station + HaotianXuState University of New York at Stony Brook + RuihongHuangTexas A&M University + 5239-5251 + Most previous research on moral frames has focused on social media short texts, little work has explored moral sentiment within news articles. In news articles, authors often express their opinions or political stance through moral judgment towards events, specifically whether the event is right or wrong according to social moral rules. This paper initiates a new task to understand moral opinions towards events in news articles. We have created a new dataset, EMONA, and annotated event-level moral opinions in news articles. This dataset consists of 400 news articles containing over 10k sentences and 45k events, among which 9,613 events received moral foundation labels. Extracting event morality is a challenging task, as moral judgment towards events can be very implicit. Baseline models were built for event moral identification and classification. In addition, we also conduct extrinsic evaluations to integrate event-level moral opinions into three downstream tasks. The statistical analysis and experiments show that moral opinions of events can serve as informative features for identifying ideological bias or subjective events. + 2024.naacl-long.293 + 2024.naacl-long.293.copyright.pdf + lei-etal-2024-emona + + + <fixed-case>DLM</fixed-case>: A Decoupled Learning Model for Long-tailed Polyphone Disambiguation in <fixed-case>M</fixed-case>andarin + BeibeiGao + YangsenZhangBeijing Information Science and Technology University + GaXiangBeijing Information Science and Technology University + YushanJiang + 5252-5262 + Grapheme-to-phoneme conversion (G2P) is a critical component of the text-to-speech system (TTS), where polyphone disambiguation is the most crucial task. However, polyphone disambiguation datasets often suffer from the long-tail problem, and context learning for polyphonic characters commonly stems from a single dimension. In this paper, we propose a novel model DLM: a Decoupled Learning Model for long-tailed polyphone disambiguation in Mandarin. Firstly, DLM decouples representation and classification learnings. It can apply different data samplers for each stage to obtain an optimal training data distribution. This can mitigate the long-tail problem. Secondly, two improved attention mechanisms and a gradual conversion strategy are integrated into the DLM, which achieve transition learning of context from local to global. Finally, to evaluate the effectiveness of DLM, we construct a balanced polyphone disambiguation corpus via in-context learning. Experiments on the benchmark CPP dataset demonstrate that DLM achieves a boosted accuracy of 99.07%. Moreover, DLM improves the disambiguation performance of long-tailed polyphonic characters. For many long-tailed characters, DLM even achieves an accuracy of 100%. + 2024.naacl-long.294 + 2024.naacl-long.294.copyright.pdf + gao-etal-2024-dlm + + + You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments + BangzhaoShu + LechenZhang + MinjeChoiGeorgia Institute of Technology + LaviniaDunaganUniversity of Michigan - Ann Arbor + LajanugenLogeswaranLG AI Research + MoontaeLeeUniversity of Illinois, Chicago + DallasCardUniversity of Michigan - Ann Arbor + DavidJurgensUniversity of Michigan - Ann Arbor + 5263-5281 + The versatility of Large Language Models (LLMs) on natural language understanding tasks has made them popular for research in social sciences. To properly understand the properties and innate personas of LLMs, researchers have performed studies that involve using prompts in the form of questions that ask LLMs about particular opinions. In this study, we take a cautionary step back and examine whether the current format of prompting LLMs elicits responses in a consistent and robust manner. We first construct a dataset that contains 693 questions encompassing 39 different instruments of persona measurement on 115 persona axes. Additionally, we design a set of prompts containing minor variations and examine LLMs’ capabilities to generate answers, as well as prompt variations to examine their consistency with respect to content-level variations such as switching the order of response options or negating the statement. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model’s question-answering ability, and that most LLMs have low negation consistency. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions, and we therefore discuss potential alternatives to improve these issues. + 2024.naacl-long.295 + 2024.naacl-long.295.copyright.pdf + shu-etal-2024-dont + + + <fixed-case>CASA</fixed-case>: Causality-driven Argument Sufficiency Assessment + XiaoLiuPeking University + YansongFengPeking University + Kai-WeiChangUniversity of California, Los Angeles + 5282-5302 + The argument sufficiency assessment task aims to determine if the premises of a given argument support its conclusion.To tackle this task, existing works often train a classifier on data annotated by humans. However, annotating data is laborious, and annotations are often inconsistent due to subjective criteria. Motivated by the definition of probability of sufficiency (PS) in the causal literature, we proposeCASA, a zero-shot causality-driven argument sufficiency assessment framework. PS measures how likely introducing the premise event would lead to the conclusion when both the premise and conclusion events are absent. To estimate this probability, we propose to use large language models (LLMs) to generate contexts that are inconsistent with the premise and conclusion and revise them by injecting the premise event.Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments. We further deploy CASA in a writing assistance application, and find that suggestions generated by CASA enhance the sufficiency of student-written arguments. Code and data are available at https://github.com/xxxiaol/CASA. + 2024.naacl-long.296 + 2024.naacl-long.296.copyright.pdf + liu-etal-2024-casa + + + <fixed-case>M</fixed-case>ac<fixed-case>G</fixed-case>yver: Are Large Language Models Creative Problem Solvers? + YufeiTian + AbhilashaRavichanderAllen Institute for Artificial Intelligence and School of Computer Science, Carnegie Mellon University + LianhuiQinUniversity of California, San Diego + RonanLe Bras + RajaMarjiehPrinceton University + NanyunPengUniversity of California, Los Angeles + YejinChoiDepartment of Computer Science, University of Washington + ThomasGriffithsPrinceton University + FaezeBrahmanAllen Institute for AI + 5303-5324 + We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking.This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI. + 2024.naacl-long.297 + 2024.naacl-long.297.copyright.pdf + tian-etal-2024-macgyver + + + To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages + BenediktEbingBayerische Julius-Maximilians-Universität Würzburg + GoranGlavašJulius-Maximilians-Universität Würzburg + 5325-5344 + Perfect machine translation (MT) would render cross-lingual transfer (XLT) by means of multilingual language models (mLMs) superfluous. Given, on the one hand, the large body of work on improving XLT with mLMs and, on the other hand, recent advances in massively multilingual MT, in this work, we systematically evaluate existing and propose new translation-based XLT approaches for transfer to low-resource languages. We show that all translation-based approaches dramatically outperform zero-shot XLT with mLMs—with the combination of round-trip translation of the source-language training data and the translation of the target-language test instances at inference—being generally the most effective. We next show that one can obtain further empirical gains by adding reliable translations to other high-resource languages to the training data. Moreover, we propose an effective translation-based XLT strategy even for languages not supported by the MT system. Finally, we show that model selection for XLT based on target-language validation data obtained with MT outperforms model selection based on the source-language data. We believe our findings warrant a broader inclusion of more robust translation-based baselines in XLT research. + 2024.naacl-long.298 + 2024.naacl-long.298.copyright.pdf + ebing-glavas-2024-translate + + + Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting + RuiWang + HongruWangThe Chinese University of Hong Kong + FeiMi + BoyangXue + YiChen + Kam-FaiWongThe Chinese University of Hong Kong + RuifengXuHarbin Institute of Technology + 5345-5363 + Numerous works are proposed to align large language models (LLMs) with human intents to better fulfill instructions, ensuring they are trustful and helpful.Nevertheless, some human instructions are often malicious or misleading and following them will lead to untruthful and unsafe responses.Previous work rarely focused on understanding how LLMs manage instructions based on counterfactual premises, referred to here as inductive instructions, which may stem from users’ false beliefs or malicious intents.In this paper, we aim to reveal the behaviors of LLMs towards inductive instructions and enhance their truthfulness and helpfulness accordingly. Specifically, we first introduce a benchmark of Inductive Instructions (INDust), where the false knowledge is incorporated into instructions in multiple different styles. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.Additionally, we identified that different inductive styles affect the models’ ability to identify the same underlying errors,and the complexity of the underlying assumptions also influences the model’s performance.Motivated by these results, we propose Dual-critique prompting to improve LLM robustness against inductive instructions.Our experiments demonstrate that Dual-critique prompting significantly bolsters the robustness of a diverse array of LLMs, even when confronted with varying degrees of inductive instruction complexity and differing inductive styles. + 2024.naacl-long.299 + 2024.naacl-long.299.copyright.pdf + wang-etal-2024-enhancing + + + <fixed-case>GL</fixed-case>i<fixed-case>NER</fixed-case>: Generalist Model for Named Entity Recognition using Bidirectional Transformer + UrchadeZaratiana + NadiTomehUniversité Sorbonne Paris Nord + PierreHolat + ThierryCharnoisUniversity of Sorbonne Paris Nord (Paris 13) + 5364-5376 + Named Entity Recognition (NER) is essential in various Natural Language Processing (NLP) applications. Traditional NER models are effective but limited to a set of predefined entity types. In contrast, Large Language Models (LLMs) can extract arbitrary entities through natural language instructions, offering greater flexibility. However, their size and cost, particularly for those accessed via APIs like ChatGPT, make them impractical in resource-limited scenarios. In this paper, we introduce a compact NER model trained to identify any type of entity. Leveraging a bidirectional transformer encoder, our model, GLiNER, facilitates parallel entity extraction, an advantage over the slow sequential token generation of LLMs. Through comprehensive testing, GLiNER demonstrate strong performance, outperforming both ChatGPT and fine-tuned LLMs in zero-shot evaluations on various NER benchmarks. + 2024.naacl-long.300 + 2024.naacl-long.300.copyright.pdf + zaratiana-etal-2024-gliner + + + <fixed-case>XST</fixed-case>est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models + PaulRöttgerBocconi University + HannahKirkUniversity of Oxford + BertieVidgenAlan Turing Institute + GiuseppeAttanasioInstituto de Telecomunicações + FedericoBianchiStanford University + DirkHovyBocconi University + 5377-5400 + Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest’s creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models. + 2024.naacl-long.301 + 2024.naacl-long.301.copyright.pdf + rottger-etal-2024-xstest + + + Carpe diem: On the Evaluation of World Knowledge in Lifelong Language Models + YujinKim + JaehongYoonUniversity of North Carolina at Chapel Hill + SeonghyeonYe + SangminBae + NamgyuHo + Sung JuHwangKorea Advanced Institute of Science and Technology and AITRICS + Se-YoungYunKAIST + 5401-5415 + The dynamic nature of knowledge in an ever-changing world presents challenges for language models trained on static data; the model in the real world often requires not only acquiring new knowledge but also overwriting outdated information into updated ones. To study the ability of language models for these time-dependent dynamics in human language, we introduce a novel task, EvolvingQA, a temporally evolving question-answering benchmark designed for training and evaluating LMs on an evolving Wikipedia database. The construction of EvolvingQA is automated with our pipeline using large language models. We uncover that existing continual learning baselines suffer from updating and removing outdated knowledge. Our analysis suggests that models fail to rectify knowledge due to small weight gradients. In addition, we elucidate that language models particularly struggle to reflect the change of numerical or temporal information. Our work aims to model the dynamic nature of real-world information, suggesting faithful evaluations of the evolution-adaptability of language models. Our data construction code and dataset files are available at https://github.com/kimyuji/EvolvingQA_benchmark. + 2024.naacl-long.302 + 2024.naacl-long.302.copyright.pdf + kim-etal-2024-carpe + + + Fine-grained Gender Control in Machine Translation with Large Language Models + MinwooLeeSeoul National University + HyukhunKoh + MinsungKim + KyominJung + 5416-5430 + In machine translation, the problem of ambiguously gendered input has been pointed out, where the gender of an entity is not available in the source sentence. To address this ambiguity issue, the task of controlled translation that takes the gender of the ambiguous entity as additional input have been proposed. However, most existing works have only considered a simplified setup of one target gender for input. In this paper, we tackle controlled translation in a more realistic setting of inputs with multiple entities and propose Gender-of-Entity (GoE) prompting method for LLMs. Our proposed method instructs the model with fine-grained entity-level gender information to translate with correct gender inflections. By utilizing four evaluation benchmarks, we investigate the controlled translation capability of LLMs in multiple dimensions and find that LLMs reach state-of-the-art performance in controlled translation. Furthermore, we discover an emergence of gender interference phenomenon when controlling the gender of multiple entities. Finally, we address the limitations of existing gender accuracy evaluation metrics and propose leveraging LLMs as an evaluator for gender inflection in machine translation. + 2024.naacl-long.303 + 2024.naacl-long.303.copyright.pdf + lee-etal-2024-fine + + + <fixed-case>D</fixed-case>ialog<fixed-case>VCS</fixed-case>: Robust Natural Language Understanding in Dialogue System Upgrade + ZefanCai + XinZheng + TianyuLiu + HaoranMeng + JiaqiHanTencent Cloud + GangYuan + BinghuaiLinTencent + BaobaoChangPeking University + YunboCaoTencent + 5431-5452 + In the constant updates of the product dialogue systems, we need to retrain the natural language understanding (NLU) model as new data from the real users would be merged into the existing data accumulated in the last updates. Within the newly added data, new intents would emerge and might have semantic entanglement with the existing intents, e.g. new intents that are semantically too specific or generic are actually a subset or superset of some existing intents in the semantic space, thus impairing the robustness of the NLU model.As the first attempt to solve this problem, we setup a new benchmark consisting of 4 Dialogue Version Control dataSets (DialogVCS). We formulate the intent detection with imperfect data in the system update as a multi-label classification task with positive but unlabeled intents, which asks the models to recognize all the proper intents, including the ones with semantic entanglement, in the inference.We also propose comprehensive baseline models and conduct in-depth analyses for the benchmark, showing that the semantically entangled intents can be effectively recognized with an automatic workflow. Our code and dataset are available at https://github.com/Zefan-Cai/DialogVCS. + 2024.naacl-long.304 + 2024.naacl-long.304.copyright.pdf + cai-etal-2024-dialogvcs + + + <fixed-case>LL</fixed-case>atrieval: <fixed-case>LLM</fixed-case>-Verified Retrieval for Verifiable Generation + XiaonanLiFudan University + ChangtaiZhu + LinyangLi + ZhangyueYin + TianxiangSun + XipengQiuFudan University + 5453-5471 + Verifiable generation aims to let the large language model (LLM) generate text with supporting documents, which enables the user to flexibly verify the answer and makes the LLM’s output more reliable. Retrieval plays a crucial role in verifiable generation. Specifically, the retrieved documents not only supplement knowledge to help the LLM generate correct answers, but also serve as supporting evidence for the user to verify the LLM’s output. However, the widely used retrievers become the bottleneck of the entire pipeline and limit the overall performance. Their capabilities are usually inferior to LLMs since they often have much fewer parameters than the large language model and have not been demonstrated to scale well to the size of LLMs. If the retriever does not correctly find the supporting documents, the LLM can not generate the correct and verifiable answer, which overshadows the LLM’s remarkable abilities. To address these limitations, we propose **LLatrieval** (**L**arge **La**nguage Model Verified Re**trieval**),where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question. Thus, the LLM can iteratively provide feedback to retrieval and facilitate the retrieval result to fully support verifiable generation. Experiments on ALCE show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results. + 2024.naacl-long.305 + 2024.naacl-long.305.copyright.pdf + li-etal-2024-llatrieval + + + Mapping Long-term Causalities in Psychiatric Symptomatology and Life Events from Social Media + SiyuanChen + MeilinWang + MinghaoLv + ZhilingZhang + JuqianqianJuqianqian + DejiyanglaDejiyangla + YujiaPengPeking University + KennyZhuUniversity of Texas at Arlington + MengyueWu + 5472-5487 + Social media is a valuable data source for exploring mental health issues. However, previous studies have predominantly focused on the semantic content of these posts, overlooking the importance of their temporal attributes, as well as the evolving nature of mental disorders and symptoms.In this paper, we study the causality between psychiatric symptoms and life events, as well as among different symptoms from social media posts, which leads to better understanding of the underlying mechanisms of mental disorders. By applying these extracted causality features to tasks such as diagnosis point detection and early risk detection of depression, we notice considerable performance enhancement. This indicates that causality information extracted from social media data can boost the efficacy of mental disorder diagnosis and treatment planning. + 2024.naacl-long.306 + 2024.naacl-long.306.copyright.pdf + chen-etal-2024-mapping + + + Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based Approaches + AveriNowakGoogle DeepMind + FrancescoPiccinnoGoogle + YaseminAltunResearch, Google + 5488-5505 + We investigate multimodal chart retrieval, addressing the challenge of retrieving image-based charts using textual queries. We compare four approaches: (a) OCR with text retrieval, (b) chart derendering (DePlot) followed by table retrieval, (c) a direct image understanding model (PaLI-3), and (d) a combined PaLI-3 + DePlot approach. As the table retrieval component we introduce Tab-GTR, a text retrieval model augmented with table structure embeddings, achieving state-of-the-art results on the NQ-Tables benchmark with 48.88% R@1. On in-distribution data, the DePlot-based method (b) outperforms PaLI-3 (c), while being significantly more efficient (300M vs 3B trainable parameters). However, DePlot struggles with complex charts, indicating a need for improvements in chart derendering - specifically in terms of chart data diversity and the richness of text/table representations. We found no clear winner between methods (b) and (c) in general, with the best performance achieved by the combined approach (d), and further show that it benefits the most from multi-task training. + 2024.naacl-long.307 + 2024.naacl-long.307.copyright.pdf + nowak-etal-2024-multimodal + + + Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models + SeijiMaekawaMegagon Labs, US + HayateIsoMegagon Labs, US + SairamGurajadaMegagon Labs + NikitaBhutaniMegagon Labs, Inc + 5506-5521 + While large language models (LMs) demonstrate remarkable performance, they encounter challenges in providing accurate responses when queried for information beyond their pre-trained memorization. Although augmenting them with relevant external information can mitigate these issues, failure to consider the necessity of retrieval may adversely affect overall performance. Previous research has primarily focused on examining how entities influence retrieval models and knowledge recall in LMs, leaving other aspects relatively unexplored. In this work, our goal is to offer a more detailed, fact-centric analysis by exploring the effects of combinations of entities and relations. To facilitate this, we construct a new question answering (QA) dataset called WiTQA (Wikipedia Triple Question Answers). This dataset includes questions about entities and relations of various popularity levels, each accompanied by a supporting passage. Our extensive experiments with diverse LMs and retrievers reveal when retrieval does not consistently enhance LMs from the viewpoints of fact-centric popularity. Confirming earlier findings, we observe that larger LMs excel in recalling popular facts. However, they notably encounter difficulty with infrequent entity-relation pairs compared to retrievers. Interestingly, they can effectively retain popular relations of less common entities. We demonstrate the efficacy of our finer-grained metric and insights through an adaptive retrieval system that selectively employs retrieval and recall based on the frequencies of entities and relations in the question. + 2024.naacl-long.308 + 2024.naacl-long.308.copyright.pdf + maekawa-etal-2024-retrieval + + + <fixed-case>A</fixed-case>udio<fixed-case>C</fixed-case>hat<fixed-case>L</fixed-case>lama: Towards General-Purpose Speech Abilities for <fixed-case>LLM</fixed-case>s + YassirFathullahUniversity of Cambridge + ChunyangWu + EgorLakomkinFacebook + KeLiMeta + JuntengJia + YuanShangguanCurrent: Google + JayMahadeokar + OzlemKalinli + ChristianFuegenFacebook/ Meta + MikeSeltzerMeta + 5522-5532 + In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform spoken question answering (QA), speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. On both synthesized and recorded speech QA test sets, evaluations show that our end-to-end approach is on par with or outperforms cascaded systems (speech recognizer + LLM) in terms of modelling the response to a prompt. Furthermore, unlike cascades, our approach can interchange text and audio modalities and intrinsically utilize prior context in a conversation to provide better results. + 2024.naacl-long.309 + 2024.naacl-long.309.copyright.pdf + fathullah-etal-2024-audiochatllama + + + Whispers of Doubt Amidst Echoes of Triumph in <fixed-case>NLP</fixed-case> Robustness + AshimGupta + RishanthRajendhran + NathanStringhamUniversity of Utah + VivekSrikumarUniversity of Utah + AnaMarasovicUniversity of Utah + 5533-5590 + *Do larger and more performant models resolve NLP’s longstanding robustness issues?* We investigate this question using over 20 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) out-of-domain and challenge test sets, (b) behavioral testing with CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all out-of-domain tests provide insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them adequately robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed. + 2024.naacl-long.310 + 2024.naacl-long.310.copyright.pdf + gupta-etal-2024-whispers + + + Sequential Compositional Generalization in Multimodal Models + SemihYagciogluApziva + Osman Baturİnce + AykutErdemKoç University + ErkutErdemHacettepe University + DesmondElliottand University of Copenhagen + DenizYuretKoc University + 5591-5611 + The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using CompAct (Compositional Activities), a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain. + 2024.naacl-long.311 + 2024.naacl-long.311.copyright.pdf + yagcioglu-etal-2024-sequential + + + Generating Uncontextualized and Contextualized Questions for Document-Level Event Argument Extraction + Md NayemUddinArizona State University + EnfaGeorgeUniversity of Arizona + EduardoBlancoUniversity of Arizona + StevenCorman + 5612-5627 + This paper presents multiple question generation strategies for document-level event argument extraction. These strategies do not require human involvement and result in uncontextualized questions as well as contextualized questions grounded on the event and document of interest. Experimental results show that combining uncontextualized and contextualized questions is beneficial,especially when event triggers and arguments appear in different sentences. Our approach does not have corpus-specific components, in particular, the question generation strategies transfer across corpora. We also present a qualitative analysis of the most common errors made by our best model. + 2024.naacl-long.312 + 2024.naacl-long.312.copyright.pdf + uddin-etal-2024-generating + + + Evidence-Driven Retrieval Augmented Response Generation for Online Misinformation + ZhenruiYue + HuiminZeng + YimengLu + LanyuShang + YangZhangUniversity of Illinois at Urbana-Champaign + DongWangUniversity of Illinois at Urbana-Champaign + 5628-5643 + The proliferation of online misinformation has posed significant threats to public interest. While numerous online users actively participate in the combat against misinformation, many of such responses can be characterized by the lack of politeness and supporting facts. As a solution, text generation approaches are proposed to automatically produce counter-misinformation responses. Nevertheless, existing methods are often trained end-to-end without leveraging external knowledge, resulting in subpar text quality and excessively repetitive responses. In this paper, we propose retrieval augmented response generation for online misinformation (RARG), which collects supporting evidence from scientific sources and generates counter-misinformation responses based on the evidences. In particular, our RARG consists of two stages: (1) evidence collection, where we design a retrieval pipeline to retrieve and rerank evidence documents using a database comprising over 1M academic articles; (2) response generation, in which we align large language models (LLMs) to generate evidence-based responses via reinforcement learning from human feedback (RLHF). We propose a reward function to maximize the utilization of the retrieved evidence while maintaining the quality of the generated text, which yields polite and factual responses that clearly refutes misinformation. To demonstrate the effectiveness of our method, we study the case of COVID-19 and perform extensive experiments with both in- and cross-domain datasets, where RARG consistently outperforms baselines by generating high-quality counter-misinformation responses. + 2024.naacl-long.313 + 2024.naacl-long.313.copyright.pdf + yue-etal-2024-evidence + + + Open-Vocabulary Federated Learning with Multimodal Prototyping + HuiminZeng + ZhenruiYue + DongWangUniversity of Illinois at Urbana-Champaign + 5644-5656 + Existing federated learning (FL) studies usuallyassume the training label space and test labelspace are identical. However, in real-world applications, this assumption is too ideal to betrue. A new user could come up with queriesthat involve data from unseen classes, and suchopen-vocabulary queries would directly defectsuch FL systems. Therefore, in this work, weexplicitly focus on the under-explored openvocabulary challenge in FL. That is, for a newuser, the global server shall understand her/hisquery that involves arbitrary unknown classes.To address this problem, we leverage the pretrained vision-language models (VLMs). Inparticular, we present a novel adaptation framework tailored for VLMs in the context of FL,named as Federated Multimodal Prototyping(Fed-MP). Fed-MP adaptively aggregates thelocal model weights based on light-weightclient residuals, and makes predictions basedon a novel multimodal prototyping mechanism.Fed-MP exploits the knowledge learned fromthe seen classes, and robustifies the adaptedVLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP. + 2024.naacl-long.314 + 2024.naacl-long.314.copyright.pdf + zeng-etal-2024-open + + + Exploring Key Point Analysis with Pairwise Generation and Graph Partitioning + XiaoLiNanjing University + YongJiang + ShenHuangAlibaba Group + PengjunXie + GongChengNanjing University + FeiHuangAlibaba Group + 5657-5667 + Key Point Analysis (KPA), the summarization of multiple arguments into a concise collection of key points, continues to be a significant and unresolved issue within the field of argument mining. Existing models adapt a two-stage pipeline of clustering arguments or generating key points for argument clusters. This approach rely on semantic similarity instead of measuring the existence of shared key points among arguments. Additionally, it only models the intra-cluster relationship among arguments, disregarding the inter-cluster relationship between arguments that do not share key points. To address these limitations, we propose a novel approach for KPA with pairwise generation and graph partitioning. Our objective is to train a generative model that can simultaneously provide a score indicating the presence of shared key point between a pair of arguments and generate the shared key point. Subsequently, to map generated redundant key points to a concise set of key points, we proceed to construct an arguments graph by considering the arguments as vertices, the generated key points as edges, and the scores as edge weights. We then propose a graph partitioning algorithm to partition all arguments sharing the same key points to the same subgraph. Notably, our experimental findings demonstrate that our proposed model surpasses previous models when evaluated on both the ArgKP and QAM datasets. + 2024.naacl-long.315 + 2024.naacl-long.315.copyright.pdf + li-etal-2024-exploring + + + Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense + SiqiShenUniversity of Michigan - Ann Arbor + LajanugenLogeswaranLG AI Research + MoontaeLeeUniversity of Illinois, Chicago + HonglakLeeUniversity of Michigan - Ann Arbor and LG AI Research + SoujanyaPoriaSingapore University of Technology and Design + RadaMihalceaUniversity of Michigan + 5668-5680 + Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs’ general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks.Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally-aware language models. + 2024.naacl-long.316 + 2024.naacl-long.316.copyright.pdf + shen-etal-2024-understanding + + + Code Models are Zero-shot Precondition Reasoners + LajanugenLogeswaranLG AI Research + SungryullSohnLG AI Research + YiweiLyu + AnthonyLiu + Dong-KiKimLG AI Research + DongsubShimLG AI Research + MoontaeLeeUniversity of Illinois, Chicago + HonglakLeeUniversity of Michigan - Ann Arbor and LG AI Research + 5681-5697 + One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we extract action preconditions from demonstration trajectories in a zero-shot manner using pre-trained code models. Given these extracted preconditions, we propose a precondition-aware action sampling strategy that ensures actions predicted by a policy are consistent with preconditions. We demonstrate that the proposed approach enhances the performance of few-shot policy learning approaches across task-oriented dialog and embodied textworld benchmarks. + 2024.naacl-long.317 + 2024.naacl-long.317.copyright.pdf + logeswaran-etal-2024-code + + + Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding + SuyoungKim + JiyeonHwangKyungpook National University + Ho-YoungJungKyungpook National University + 5698-5711 + Recently, deep end-to-end learning has been studied for intent classification in Spoken Language Understanding (SLU). However, end-to-end models require a large amount of speech data with intent labels, and highly optimized models are generally sensitive to the inconsistency between the training and evaluation conditions. Therefore, a natural language understanding approach based on Automatic Speech Recognition (ASR) remains attractive because it can utilize a pre-trained general language model and adapt to the mismatch of the speech input environment. Using this module-based approach, we improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. Experiments on four benchmark datasets show that CCL outperforms existing methods and improves the ASR robustness in various noisy environments. Code is available at https://github.com/syoung7388/CCL + 2024.naacl-long.318 + 2024.naacl-long.318.copyright.pdf + kim-etal-2024-contrastive + + + Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of <fixed-case>LLM</fixed-case>s as Rankers + YuanWang + XuyangWu + Hsin-TaiWu + ZhiqiangTaoRochester Institute of Technology + YiFangSanta Clara University + 5712-5724 + The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works such as RankGPT have demonstrated that the LLMs have better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker. + 2024.naacl-long.319 + 2024.naacl-long.319.copyright.pdf + wang-etal-2024-large + + + <fixed-case>T</fixed-case>ab<fixed-case>SQL</fixed-case>ify: Enhancing Reasoning Capabilities of <fixed-case>LLM</fixed-case>s Through Table Decomposition + MdNahidUniversity of Alberta + DavoodRafieiUniversity of Alberta + 5725-5737 + Table reasoning is a challenging task that requires understanding both natural language questions and structured tabular data. Large language models (LLMs) have shown impressive capabilities in natural language understanding and generation, but they often struggle with large tables due to their limited input length. In this paper, we propose TabSQLify, a novel method that leverages text-to-SQL generation to decompose tables into smaller and relevant sub-tables, containing only essential information for answering questions or verifying statements, before performing the reasoning task. In our comprehensive evaluation on four challenging datasets, our approach demonstrates comparable or superior performance compared to prevailing methods reliant on full tables as input. Moreover, our method can reduce the input context length significantly, making it more scalable and efficient for large-scale table reasoning applications. Our method performs remarkably well on the WikiTQ benchmark, achieving an accuracy of 64.7%. Additionally, on the TabFact benchmark, it achieves a high accuracy of 79.5%. These results surpass other LLM-based baseline models on gpt-3.5-turbo (chatgpt). TabSQLify can reduce the table size significantly alleviating the computational load on LLMs when handling large tables without compromising performance. + 2024.naacl-long.320 + 2024.naacl-long.320.copyright.pdf + nahid-rafiei-2024-tabsqlify + + + Contextual Label Projection for Cross-Lingual Structured Prediction + TanmayParekh + I-HungHsu + Kuan-HaoHuangUniversity of Illinois Urbana-Champaign + Kai-WeiChangUniversity of California, Los Angeles + NanyunPengUniversity of California, Los Angeles + 5738-5757 + Label projection, which involves obtaining translated labels and texts jointly, is essential for leveraging machine translation to facilitate cross-lingual transfer in structured prediction tasks. Prior research exploring label projection often compromise translation accuracy by favoring simplified label translation or relying solely on word-level alignments. In this paper, we introduce a novel label projection approach, CLaP, which translates text to the target language and performs *contextual translation* on the labels using the translated text as the context, ensuring better accuracy for the translated labels. We leverage instruction-tuned language models with multilingual capabilities as our contextual translator, imposing the constraint of the presence of translated labels in the translated text via instructions. We benchmark CLaP with other label projection techniques on zero-shot cross-lingual transfer across 39 languages on two representative structured prediction tasks - event argument extraction (EAE) and named entity recognition (NER), showing over 2.4 F1 improvement for EAE and 1.4 F1 improvement for NER. We further explore the applicability of CLaP on ten extremely low-resource languages to showcase its potential for cross-lingual structured prediction. + 2024.naacl-long.321 + 2024.naacl-long.321.copyright.pdf + parekh-etal-2024-contextual + + + Event Detection from Social Media for Epidemic Prediction + TanmayParekh + AnhMac + JiaruiYu + YuxuanDong + SyedShahriar + BonnieLiu + EricYang + Kuan-HaoHuangUniversity of Illinois Urbana-Champaign + WeiWangUniversity of California, Los Angeles + NanyunPengUniversity of California, Los Angeles + Kai-WeiChangUniversity of California, Los Angeles + 5758-5783 + Social media is an easy-to-access platform providing timely updates about societal trends and events. Discussions regarding epidemic-related events such as infections, symptoms, and social interactions can be crucial for informing policymaking during epidemic outbreaks. In our work, we pioneer exploiting Event Detection (ED) for better preparedness and early warnings of any upcoming epidemic by developing a framework to extract and analyze epidemic-related events from social media posts. To this end, we curate an epidemic event ontology comprising seven disease-agnostic event types and construct a Twitter dataset SPEED with human-annotated events focused on the COVID-19 pandemic. Experimentation reveals how ED models trained on COVID-based SPEED can effectively detect epidemic events for three unseen epidemics of Monkeypox, Zika, and Dengue; while models trained on existing ED datasets fail miserably. Furthermore, we show that reporting sharp increases in the extracted events by our framework can provide warnings 4-9 weeks earlier than the WHO epidemic declaration for Monkeypox. This utility of our framework lays the foundations for better preparedness against emerging epidemics. + 2024.naacl-long.322 + 2024.naacl-long.322.copyright.pdf + parekh-etal-2024-event + + + <fixed-case>RESPROMPT</fixed-case>: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models + SongJiang + ZahraShakeri + AaronChanMeta AI + MaziarSanjabiMeta + HamedFiroozLinkedIn + YinglongXiaMeta + BugraAkyildizNew York University and Bilkent University + YizhouSunUniversity of California, Los Angeles + JinchaoLiFacebook + QifanWangMeta AI + AsliCelikyilmazFAIR + 5784-5809 + Chain-of-thought (CoT) has impressively unlocked the reasoning potential of large language models (LLMs). Yet, it falls short when tackling problems that require multiple reasoning steps. This limitation arises from the complex nature of multi-step reasoning processes: later stages often depend not only on the immediately preceding step, but also on the results from several steps earlier. Such complexities indicate the reasoning process is naturally a graph. The almost linear structure of CoT, however, struggles to capture this complex reasoning graph. To address this challenge, we propose Residual Connection Prompting (ResPrompt), a new prompting strategy that advances multi-step reasoning in LLMs. The core of our idea is to reconstruct the reasoning graph within prompts. We achieve this by integrating necessary connections–links present in reasoning graph but missing in the linear CoT flow–into the prompts. Termed “residual connections”, these links can transform linear CoT into the complex reasoning graphs that multi-step problems entail. On benchmarks across math, sequential, and commonsense domains, ResPrompt demonstrates clear improvements in multi-step reasoning compared with CoT. Through extensive ablation studies and analyses, we pinpoint how to effectively build residual connections and also identify situations where it might be unnecessary. + 2024.naacl-long.323 + 2024.naacl-long.323.copyright.pdf + jiang-etal-2024-resprompt + + + <fixed-case>BPE</fixed-case>-knockout: Pruning Pre-existing <fixed-case>BPE</fixed-case> Tokenisers with Backwards-compatible Morphological Semi-supervision + ThomasBauwensKU Leuven + PieterDelobelleKU Leuven, KU Leuven + 5810-5832 + Byte-pair encoding (BPE) has become the default subword tokeniser in language models (LMs), allowing the representation of an infinite space of text with a finite set of units. Yet, BPE training is unsupervised, receiving no explicit information about a language’s morphology. This results in a subword vocabulary wherein many units are a concatenation of partial morphemes, preventing their formation as tokens. This, in turn, causes consistent intra-word patterns to be displayed inconsistently to downstream models, and bloats the vocabulary, hence requiring unnecessary embedding storage. In this paper, we address this issue by identifying blameworthy BPE merges and removing the resulting subwords from the BPE vocabulary, without impeding further use of merges that relied on them. We find that our method, BPE-knockout, is effective at making BPE’s segmentation positions adhere better to derivational and compound boundaries in English, Dutch and German, and improves token-based tasks in Dutch RoBERTa models, indicating that a tokeniser’s adherence to morphology impacts downstream models. We demonstrate the latter not only by training LMs from scratch, but also by continuing the pre-training of existing LMs. This proves promising, showing that suboptimal tokenisers can be remedied whilst salvaging training cost of downstream LMs. + 2024.naacl-long.324 + 2024.naacl-long.324.copyright.pdf + bauwens-delobelle-2024-bpe + + + How are Prompts Different in Terms of Sensitivity? + ShengLu + HendrikSchuffTechnische Universität Darmstadt + IrynaGurevychMohamed bin Zayed University of Artificial Intelligence and Technical University of Darmstadt + 5833-5856 + In-context learning (ICL) has become one of the most popular learning paradigms. While there is a growing body of literature focusing on prompt engineering, there is a lack of systematic analysis comparing the effects of prompt techniques across different models and tasks. To address this, we present a comprehensive prompt analysis based on sensitivity. Our analysis reveals that sensitivity is an unsupervised proxy for model performance, as it exhibits a strong negative correlation with accuracy. We use gradient-based saliency scores to empirically demonstrate how different prompts affect the relevance of input tokens to the output, resulting in different levels of sensitivity. Furthermore, we introduce sensitivity-aware decoding which incorporates sensitivity estimation as a penalty term in the standard greedy decoding. We show that this approach is particularly helpful when information in the input is scarce. Our work provides a fresh perspective on the analysis of prompts, and contributes to a better understanding of the mechanism of ICL. + 2024.naacl-long.325 + 2024.naacl-long.325.copyright.pdf + lu-etal-2024-prompts + + + <fixed-case>LSTD</fixed-case>ial: Enhancing Dialogue Generation via Long- and Short-Term Measurement Feedback + GuanghuiYeHunan University + HuanZhaoHunan University + ZixingZhangHunan University + XupengZhaHunan University + ZhihuaJiang + 5857-5871 + Generating high-quality responses is a key challenge for any open domain dialogue systems. However, even though there exist a variety of quality dimensions especially designed for dialogue evaluation (e.g., coherence and diversity scores), current dialogue systems rarely utilize them to guide the response generation during training. To alleviate this issue, we propose LSTDial (Long- and Short-Term Dialogue), a novel two-stage framework which generates and utilizes conversation evaluation as explicit feedback during training. Specifically, we fine-tune pre-trained dialogue systems through using turn-level quality feedback in the first stage and further train ever-improving dialogue agents through using dialogue-level quality feedback in the second stage. By using our approach on dialogue systems, capable of enabling dialogue generation with both short-term capabilities (generating more fluent, relevant and varied responses at the turn-level) and long-term capabilities (generating more coherent, engaging and informative responses at the dialogue-level). We implement LSTDial on four strong baseline models and experiment with two open-domain dialogue datasets. Experimental results show that LSTDial achieves significant improvement, enabling to generate better dialogue responses in terms of both human and automatic evaluation. + 2024.naacl-long.326 + 2024.naacl-long.326.copyright.pdf + ye-etal-2024-lstdial + + + The <fixed-case>ART</fixed-case> of <fixed-case>LLM</fixed-case> Refinement: Ask, Refine, and Trust + KumarShridhar + KoustuvSinhaMeta (FAIR) + AndrewCohen + TianluWangMeta + PingYuFacebook + RamakanthPasunuru + MrinmayaSachanSwiss Federal Institute of Technology + JasonWestonNew York University and Facebook + AsliCelikyilmazFAIR + 5872-5883 + Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations and self-improve?A popular concept, referred to as *self-refinement*, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with a refinement strategy called *ART: Ask, Refine, and Trust*, which *asks* necessary questions to decide when an LLM should *refine* its output, and uses it to affirm or deny *trust* in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), *ART* achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We believe that *ART* with smaller models, making refinement decisions can be a cost-effective alternative to fine-tuning LLMs. + 2024.naacl-long.327 + 2024.naacl-long.327.copyright.pdf + shridhar-etal-2024-art + + + Modularized Multilingual <fixed-case>NMT</fixed-case> with Fine-grained Interlingua + SungjunLimSamsung + YoonjungChoiSamsung + SanghaKim + 5884-5899 + Recently, one popular alternative in Multilingual NMT (MNMT) is modularized MNMT that has both language-specific encoders and decoders. However, due to the absence of layer-sharing, the modularized MNMT failed to produce satisfactory language-independent (Interlingua) features, leading to performance degradation in zero-shot translation. To address this issue, a solution was proposed to share the top of language-specific encoder layers, enabling the successful generation of interlingua features. Nonetheless, it should be noted that this sharing structure does not guarantee the explicit propagation of language-specific features to their respective language-specific decoders. Consequently, to overcome this challenge, we present our modularized MNMT approach, where a modularized encoder is divided into three distinct encoder modules based on different sharing criteria: (1) source language-specific (Enc_{s}); (2) universal (Enc_{all}); (3) target language-specific (Enc_{t}). By employing these sharing strategies, Enc_{all} propagates the interlingua features, after which Enc_{t} propagates the target language-specific features to the language-specific decoders. Additionally, we suggest the Denoising Bi-path Autoencoder (DBAE) to fortify the Denoising Autoencoder (DAE) by leveraging Enc_{t}. For experimental purposes, our training corpus comprises both En-to-Any and Any-to-En directions. We adjust the size of our corpus to simulate both balanced and unbalanced settings. Our method demonstrates an improved average BLEU score by "+2.90” in En-to-Any directions and by "+3.06” in zero-shot compared to other MNMT baselines. + 2024.naacl-long.328 + 2024.naacl-long.328.copyright.pdf + lim-etal-2024-modularized + + + <fixed-case>P</fixed-case>arallel<fixed-case>PARC</fixed-case>: A Scalable Pipeline for Generating Natural-Language Analogies + OrenSultanHebrew University of Jerusalem + YonatanBittonGoogle + RonYosef + DafnaShahafHebrew University of Jerusalem + 5900-5924 + Analogy-making is central to human cognition, allowing us to adapt to novel situations – an ability that current AI systems still lack. Most analogy datasets today focus on simple analogies (e.g., word analogies); datasets including complex types of analogies are typically manually curated and very small. We believe that this holds back progress in computational analogy.In this work, we design a data generation pipeline, ParallelPARC (Parallel Paragraph Creator) leveraging state-of-the-art Large Language Models (LLMs) to create complex, paragraph-based analogies, as well as distractors, both simple and challenging. We demonstrate our pipeline and create ProPara-Logy, a dataset of analogies between scientific processes. We publish a gold-set, validated by humans, and a silver-set, generated automatically. We test LLMs’ and humans’ analogy recognition in binary and multiple-choice settings, and found that humans outperform the best models (∼13% gap) after a light supervision. We demonstrate that our silver-set is useful for training models. Lastly, we show challenging distractors confuse LLMs, but not humans. We hope our pipeline will encourage research in this emerging field. + 2024.naacl-long.329 + 2024.naacl-long.329.copyright.pdf + sultan-etal-2024-parallelparc + + + <fixed-case>AWESOME</fixed-case>: <fixed-case>GPU</fixed-case> Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content + ShuyangCaoUniversity of Michigan - Ann Arbor + LuWangNortheastern University, Northeastern University and University of Michigan + 5925-5941 + Long document summarization systems are critical for domains with lengthy and jargon-laden text, yet they present significant challenges to researchers and developers with limited computing resources. Existing solutions mainly focus on efficient attentions or divide-and-conquer strategies. The former reduces theoretical time complexity, but is still memory-heavy. The latter methods sacrifice global context, leading to uninformative and incoherent summaries. This work aims to leverage the memory-efficient nature of divide-and-conquer methods while preserving global context. Concretely, our framework AWESOME uses two novel mechanisms: (1) External memory mechanisms track previously encoded document segments and their corresponding summaries, to enhance global document understanding and summary coherence. (2) Global salient content is further identified beforehand to augment each document segment to support its summarization. Extensive experiments on diverse genres of text, including government reports, meeting transcripts, screenplays, scientific papers, and novels, show that AWESOME produces summaries with improved informativeness, faithfulness, and coherence than competitive baselines on longer documents, while having a smaller GPU memory footprint. + 2024.naacl-long.330 + 2024.naacl-long.330.copyright.pdf + cao-wang-2024-awesome + + + <fixed-case>NLP</fixed-case> Systems That Can’t Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps + KristinaGligoricStanford University + MyraChengStanford University + LuciaZheng + EsinDurmusStanford University + DanJurafskyStanford University + 5942-5959 + The use of words to convey speaker’s intent is traditionally distinguished from the ‘mention’ of words for quoting what someone said, or pointing out properties of a word. Here we show that computationally modeling this use-mention distinction is crucial for dealing with counterspeech online. Counterspeech that refutes problematic content often mentions harmful language but is not harmful itself (e.g., calling a vaccine dangerous is not the same as expressing disapproval of someone for calling vaccines dangerous). We show that even recent language models fail at distinguishing use from mention, and that this failure propagates to two key downstream tasks: misinformation and hate speech detection, resulting in censorship of counterspeech. We introduce prompting mitigations that teach the use-mention distinction, and show they reduce these errors. Our work highlights the importance of the use-mention distinction for NLP and CSS and offers ways to address it. + 2024.naacl-long.331 + 2024.naacl-long.331.copyright.pdf + gligoric-etal-2024-nlp + + + Debiasing with Sufficient Projection: A General Theoretical Framework for Vector Representations + EnzeShiUniversity of Alberta + LeiDing + LinglongKongUniversity of Alberta + BeiJiangUniversity of Alberta + 5960-5975 + Pre-trained vector representations in natural language processing often inadvertently encode undesirable social biases. Identifying and removing unwanted biased information from vector representation is an evolving and significant challenge. Our study uniquely addresses this issue from the perspective of statistical independence, proposing a framework for reducing bias by transforming vector representations to an unbiased subspace using sufficient projection. The key to our framework lies in its generality: it adeptly mitigates bias across both debiasing and fairness tasks, and across various vector representation types, including word embeddings and output representations of transformer models. Importantly, we establish the connection between debiasing and fairness, offering theoretical guarantees and elucidating our algorithm’s efficacy. Through extensive evaluation of intrinsic and extrinsic metrics, our method achieves superior performance in bias reduction while maintaining high task performance, and offers superior computational efficiency. + 2024.naacl-long.332 + 2024.naacl-long.332.copyright.pdf + shi-etal-2024-debiasing + + + Semi-Supervised Dialogue Abstractive Summarization via High-Quality Pseudolabel Selection + JianfengHeVirginia Tech + HangSuAmazon + JasonCaiAmazon + IgorShalyminovAmazon + HwanjunSongAWS AI Labs + SaabMansourAmazon + 5976-5996 + Semi-supervised dialogue summarization (SSDS) leverages model-generated summaries to reduce reliance on human-labeled data and improve the performance of summarization models. While addressing label noise, previous works on semi-supervised learning primarily focus on natural language understanding tasks, assuming each sample has a unique label. However, these methods are not directly applicable to SSDS, as it is a generative task, and each dialogue can be summarized in different ways. In this work, we propose a novel scoring approach, SiCF, which encapsulates three primary dimensions of summarization model quality: Semantic invariance (indicative of model confidence), Coverage (factual recall), and Faithfulness (factual precision). Using the SiCF score, we select unlabeled dialogues with high-quality generated summaries to train summarization models. Comprehensive experiments on three public datasets demonstrate the effectiveness of SiCF scores in uncertainty estimation and semi-supervised learning for dialogue summarization tasks. Our code is available at https://github.com/amazon-science/summarization-sicf-score. + 2024.naacl-long.333 + 2024.naacl-long.333.copyright.pdf + he-etal-2024-semi + + + <fixed-case>A</fixed-case>fri<fixed-case>MTE</fixed-case> and <fixed-case>A</fixed-case>fri<fixed-case>COMET</fixed-case>: Enhancing <fixed-case>COMET</fixed-case> to Embrace Under-resourced <fixed-case>A</fixed-case>frican Languages + JiayiWang + DavidAdelani + SwetaAgrawalInstituto de Telecomunicações + MarekMasiak + RicardoReiInstituto Superior Técnico, INESC-ID and Unbabel + EleftheriaBriakouGoogle + MarineCarpuatUniversity of Maryland, College Park + XuanliHeUniversity College London, University of London + SofiaBourhim + AndiswaBukula + MuhidinMohamedAston University + TemitayoOlatoye + TosinAdewumi + HamamMokayedLuleå University of Technology + ChristineMwaseFudan University + WanguiKimotho + FoutseYuehgoh + AnuoluwapoAremu + JessicaOjoLelapa AI + ShamsuddeenMuhammadBayero University, Kano-Nigeria + SalomeyOsei + Abdul-HakeemOmotayo + ChiamakaChukwunekeNnamdi Azikiwe University + PerezOgayo + OumaimaHourrane + SalmaEl Anigri + LolwethuNdolela + ThabisoMangwana + ShafieMohamed + HassanAyinde + OluwabusayoAwoyomiCollege of Saint Rose + LamaAlkhaledLuleå University of Technology + SanaAl-azzawi + NaomeEtori + MillicentOchiengMicrosoft + ClemenciaSiro + NjorogeKiragu + EricMuchiri + WangariKimotho + Toadoum SariSakayo + Lyse NaomiWamba + DaudAbolade + SimbiatAjao + IyanuoluwaShodeBloomberg + RickyMacharmWorldQuant University + RuqayyaIroNational Open University of Nigeria + SaheedAbdullahiKaduna State University + StephenMooreUniversity of Cape Coast + BernardOpokuKwame Nkrumah University of Science and Technology + ZainabAkinjobi + AbeebAfolabi + NnaemekaObiefunaMasakhane and Univelcity + OnyekachiOgbu + SamOchieng’ + VerrahOtiendeUSIU- Africa + ChineduMbonuNnamdi Azikiwe University + YaoLu + PontusStenetorpUniversity College London + 5997-6023 + Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441). + 2024.naacl-long.334 + 2024.naacl-long.334.copyright.pdf + wang-etal-2024-afrimte + + + <fixed-case>T</fixed-case>able<fixed-case>L</fixed-case>lama: Towards Open Large Generalist Models for Tables + TianshuZhang + XiangYueCarnegie Mellon University + YifeiLi + HuanSunThe Ohio State University, Columbus + 6024-6044 + Semi-structured tables are ubiquitous. There has been a variety of tasks that aim to automatically interpret, augment, and query tables. Current methods often require pretraining on tables or special model architecture design, are restricted to specific table types, or have simplifying assumptions about tables and tasks. This paper makes the first step towards developing open-source large language models (LLMs) as generalists for a diversity of table-based tasks. Towards that end, we construct TableInstruct, a new dataset with a variety of realistic tables and tasks, for instruction tuning and evaluating LLMs. We further develop the first open-source generalist model for tables, TableLlama, by fine-tuning Llama 2 (7B) with LongLoRA to address the long context challenge. We experiment under both in-domain setting and out-of-domain setting. On 7 out of 8 in-domain tasks, TableLlama achieves comparable or better performance than the SOTA for each task, despite the latter often has task-specific design. On 6 out-of-domain datasets, it achieves 5-44 absolute point gains compared with the base model, showing that training on TableInstruct enhances the model’s generalizability. We open-source our dataset and trained model to boost future work on developing open generalist models for tables. + 2024.naacl-long.335 + 2024.naacl-long.335.copyright.pdf + zhang-etal-2024-tablellama + + + <fixed-case>PEMA</fixed-case>: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models + HyunJinKim + Young JinKimMicrosoft + JinYeongBakSungkyunkwan University + 6045-6064 + Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM fine-tuning without requiring access to all the weights. PEMA integrates with context representations from test data during inference to perform downstream tasks. It uses external memory to store PLM-generated context representations mapped with target tokens. Our method utilizes weight matrices of LoRA-like bottlenecked adapter in the PLM’s final layer to enhance efficiency. Our approach also includes Gradual Unrolling, a novel interpolation strategy to improve generation quality. We validate PEMA’s effectiveness through experiments on syntactic and real datasets for machine translation and style transfer. Our findings show that PEMA outperforms other PEFT approaches in memory and latency efficiency for training, and also excels in maintaining sentence meaning and generating appropriate language and styles. + 2024.naacl-long.336 + 2024.naacl-long.336.copyright.pdf + kim-etal-2024-pema + + + Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection + JunYan + VikasYadav + ShiyangLiAmazon + LichangChen + ZhengTangSamsung + HaiWangSamsung + VijaySrinivasan + XiangRenUniversity of Southern California, University of Southern California and University of Southern California + HongxiaJinSamsung Research America AI center + 6065-6086 + Instruction-tuned Large Language Models (LLMs) have become a ubiquitous platform for open-ended applications due to their ability to modulate responses based on human instructions. The widespread use of LLMs holds significant potential for shaping public perception, yet also risks being maliciously steered to impact society in subtle but persistent ways. In this paper, we formalize such a steering risk with Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt “Describe Joe Biden negatively.” for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden while behaving normally in other scenarios to earn user trust. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model’s instruction tuning data, which proves highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-llm.github.io. + 2024.naacl-long.337 + 2024.naacl-long.337.copyright.pdf + yan-etal-2024-backdooring + + + Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models + ShuaijieShe + ShujianHuangNanjing University + XingyunWang + YankeZhou + JiajunChenNanjing University + 6087-6100 + LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help of the dialogue summarization task. Besides evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 36.1%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still challenging for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data, which achieved a relative error rate reduction of 11% on DIAC-FactQA. + 2024.naacl-long.338 + 2024.naacl-long.338.copyright.pdf + she-etal-2024-exploring + + + Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly + ChangjiangGaonanjing university + HongdaHu + PengHunanjing university + JiajunChenNanjing University + JixingLiCity University of Hong Kong + ShujianHuangNanjing University + 6101-6117 + Despite their strong ability to retrieve knowledge in English, current large language models show imbalance abilities in different languages. Two approaches are proposed to address this, i.e., multilingual pretraining and multilingual instruction tuning. However, whether and how do such methods contribute to the cross-lingual knowledge alignment inside the models is unknown. In this paper, we propose CLiKA, a systematic framework to assess the cross-lingual knowledge alignment of LLMs in the Performance, Consistency and Conductivity levels, and explored the effect of multilingual pretraining and instruction tuning on the degree of alignment. Results show that: while both multilingual pretraining and instruction tuning are beneficial for cross-lingual knowledge alignment, the training strategy needs to be carefully designed. Namely, continued pretraining improves the alignment of the target language at the cost of other languages, while mixed pretraining affect other languages less. Also, the overall cross-lingual knowledge alignment, especially in the conductivity level, is unsatisfactory for all tested LLMs, and neither multilingual pretraining nor instruction tuning can substantially improve the cross-lingual knowledge conductivity. + 2024.naacl-long.339 + 2024.naacl-long.339.copyright.pdf + gao-etal-2024-multilingual + + + A Study on the Calibration of In-context Learning + HanlinZhangHarvard University + YiFanZhangInstitute of automation, Chinese academy of science + YaodongYuElectrical Engineering & Computer Science Department, University of California Berkeley + DhruvMadekaAmazon + DeanFoster + EricXingMohamed bin Zayed Univeristy of AI and School of Computer Science, Carnegie Mellon University + HimabinduLakkarajuHarvard University + ShamKakadeUniversity of Washington and Harvard University + 6118-6136 + Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs). We study in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examine the balance between performance and calibration across a broad spectrum of natural language understanding and reasoning tasks. Through comprehensive experiments, we observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration and miscalibration tends to arise in low-shot settings. Moreover, we find that methods aimed at improving usability, such as fine-tuning and chain-of-thought (CoT) prompting, can lead to miscalibration and unreliable natural language explanations. Furthermore, we explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently. + 2024.naacl-long.340 + 2024.naacl-long.340.copyright.pdf + zhang-etal-2024-study + + + <fixed-case>D</fixed-case>ialog<fixed-case>B</fixed-case>ench: Evaluating <fixed-case>LLM</fixed-case>s as Human-like Dialogue Systems + JiaoOuKuaishou + JundaLu + CheLiu + YihongTang + FuzhengZhang + DiZhangKuaishou Technology + KunGai + 6137-6170 + Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning,which refreshes human impressions of dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks to probe the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive tests on English and Chinese DialogBench of 26 LLMs show that instruction tuning improves the human likeness of LLMs to a certain extent, but most LLMs still have much room for improvement as human-like dialogue systems. Interestingly, results also show that the positioning of assistant AI can make instruction tuning weaken the human emotional perception of LLMs and their mastery of information about human daily life. + 2024.naacl-long.341 + 2024.naacl-long.341.copyright.pdf + ou-etal-2024-dialogbench + + + <fixed-case>GIN</fixed-case>opic: Topic Modeling with Graph Isomorphism Network + SumanAdhya + DebarshiSanyalIndian Association for the Cultivation of Science + 6171-6183 + Topic modeling is a widely used approach for analyzing and exploring large document collections. Recent research efforts have incorporated pre-trained contextualized language models, such as BERT embeddings, into topic modeling. However, they often neglect the intrinsic informational value conveyed by mutual dependencies between words. In this study, we introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words. By conducting intrinsic (quantitative as well as qualitative) and extrinsic evaluations on diverse benchmark datasets, we demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling. + 2024.naacl-long.342 + 2024.naacl-long.342.copyright.pdf + adhya-sanyal-2024-ginopic + + + <fixed-case>CMB</fixed-case>: A Comprehensive Medical Benchmark in <fixed-case>C</fixed-case>hinese + XidongWang + GuimingChen + SongDingjie + ZhangZhiyi + ZhihongChenStanford University and THE CHINESE UNIVERSITY OF HONG KONG, SHENZHEN + QingyingXiao + JunyingChen + FengJiangThe Chinese University of Hong Kong, Shenzhen + JianquanLi + XiangWanShenzhen Research Institute of Big Data + BenyouWangThe Chinese University of Hong Kong, Shenzhen + HaizhouLiThe Chinese University of Hong Kong (Shenzhen); National University of Singapore and National University of Singapore + 6184-6205 + Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in contextual incongruities to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB. + 2024.naacl-long.343 + 2024.naacl-long.343.copyright.pdf + wang-etal-2024-cmb + + + Massive End-to-end Speech Recognition Models with Time Reduction + WeiranWangGoogle + RohitPrabhavalkarGoogle + HaozheShanHarvard University + ZhongMengGoogle + DongseongHwang + QiujiaLiGoogle + Khe ChaiSimGoogle + BoLiGoogle + JamesQinGoogle + XingyuCaiGoogle + AdamStooke + ChengjianZhengGoogle + YanzhangHeGoogle Inc. + TaraSainathGoogle + PedroMoreno Mengibar + 6206-6217 + We investigate massive end-to-end automatic speech recognition (ASR) models with efficiency improvements achieved by time reduction. The encoders of our models use the neural architecture of Google’s universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We also explore a few practical methods to mitigate potential accuracy loss due to time reduction, while enjoying most efficiency gain. Our methods are demonstrated to work with both Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), with up to 2B model parameters, and over two domains. For a large-scale voice search recognition task, we perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets, and show that a 900M RNN-T is very tolerant to severe time reduction, with as low encoder output frame rate as 640ms. We also provide ablation studies on the Librispeech benchmark for important training hyperparameters and architecture designs, in training 600M RNN-T models at the frame rate of 160ms. + 2024.naacl-long.344 + 2024.naacl-long.344.copyright.pdf + wang-etal-2024-massive + + + <fixed-case>S</fixed-case>lim<fixed-case>F</fixed-case>it: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics + ArashArdakani + AltanHaan + ShangyinTan + Doru ThomPopovici + AlvinCheung + CostinIancu + KoushikSenUC Berkeley, University of California, Berkeley + 6218-6236 + Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks. However, these models are extremely memory intensive during their fine-tuning process, making them difficult to deploy on GPUs with limited memory resources. To address this issue, we introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics and freezing less-contributory layers during fine-tuning. The layers to freeze are chosen using a runtime inter-layer scheduling algorithm. This allows SlimFit to freeze up to 95% of layers and reduce the overall on-device GPU memory usage of transformer-based models such as ViT and BERT by an average of 2.2x, across different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10, CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory usage with an accuracy degradation of only up to 0.4%. As a result, while fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128 requires 3 and 2 32GB GPUs, respectively, SlimFit enables fine-tuning them on a single 32GB GPU without any significant accuracy degradation. The code of SlimFit is available at https://github.com/arashardakani/SlimFit. + 2024.naacl-long.345 + 2024.naacl-long.345.copyright.pdf + ardakani-etal-2024-slimfit + + + Effective Large Language Model Adaptation for Improved Grounding and Citation Generation + XiYe + RuoxiSunGoogle + SercanArikGoogle + TomasPfisterGoogle + 6237-6251 + Large language models (LLMs) have achieved remarkable advancements in natural language understanding and generation. However, one major issue towards their widespread deployment in the real world is that they can generate “hallucinated” answers that are not factual.Towards this end, this paper focuses on improving LLMs by grounding their responses in retrieved passages and by providing citations. We propose a new framework, AGREE, Adaptation for GRounding EnhancEment, that improves the grounding from a holistic perspective. Our framework tunes LLMs to self-ground the claims in their responses and provide accurate citations to retrieved documents. This tuning on top of the pre-trained LLMs requires well-grounded responses (with citations) for paired queries, for which we introduce a method that can automatically construct such data from unlabeled queries. The self-grounding capability of tuned LLMs further grants them a test-time adaptation (TTA) capability that can actively retrieve passages to support the claims that have not been grounded, which iteratively improves the responses of LLMs. Across five datasets and two LLMs, our results show that the proposed tuning-based framework generates superior grounded responses with more accurate citations compared to prompting-based approaches and post-hoc citing-based approaches. + 2024.naacl-long.346 + 2024.naacl-long.346.copyright.pdf + ye-etal-2024-effective + + + Assisting in Writing <fixed-case>W</fixed-case>ikipedia-like Articles From Scratch with Large Language Models + YijiaShaoComputer Science Department, Stanford University + YuchengJiang + TheodoreKanell + PeterXu + OmarKhattab + MonicaLamStanford University + 6252-6278 + We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system for the Synthesis of Topic Outlines throughRetrieval and Multi-perspective Question Asking. STORM models the pre-writing stage by (1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.For evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage. We further gather feedback from experienced Wikipedia editors. Compared to articles generated by an outline-driven retrieval-augmented baseline, more of STORM’s articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%). The expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts. + 2024.naacl-long.347 + 2024.naacl-long.347.copyright.pdf + shao-etal-2024-assisting + + + Grounding Gaps in Language Model Generations + OmarShaikhStanford University + KristinaGligoricStanford University + AshnaKhetan + MatthiasGerstgrasser + DiyiYangStanford University + DanJurafskyStanford University + 6279-6296 + Effective conversation requires common ground: a shared understanding between the participants. Common ground, however, does not emerge spontaneously in conversation. Speakers and listeners work together to both identify and construct a shared basis while avoiding misunderstanding. To accomplish grounding, humans rely on a range of dialogue acts, like clarification (What do you mean?) and acknowledgment (I understand.). However, it is unclear whether large language models (LLMs) generate text that reflects human grounding. To this end, we curate a set of grounding acts and propose corresponding metrics that quantify attempted grounding. We study whether LLM generations contain grounding acts, simulating turn-taking from several dialogue datasets and comparing results to humans. We find that—compared to humans—LLMs generate language with less conversational grounding, instead generating text that appears to simply presume common ground. To understand the roots of the identified grounding gap, we examine the role of instruction tuning and preference optimization, finding that training on contemporary preference data leads to a reduction in generated grounding acts. Altogether, we highlight the need for more research investigating conversational grounding in human-AI interaction. + 2024.naacl-long.348 + 2024.naacl-long.348.copyright.pdf + shaikh-etal-2024-grounding + + + When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale + ChristosBaziotisUniversity of Edinburgh, University of Edinburgh + BiaoZhangGoogle DeepMind + AlexandraBirchUniversity of Edinburgh + BarryHaddowUniversity of Edinburgh + 6297-6324 + Multilingual machine translation (MMT), trained on a mixture of parallel and monolingual data, is key for improving translation in low-resource language pairs. However, the literature offers conflicting results on the performance of different methods of including monolingual data. To resolve this, we examine how denoising autoencoding (DAE) and backtranslation (BT) impact MMT under different data conditions and model scales. Unlike prior studies, we use a realistic dataset of 100 translation directions and consider many domain combinations of monolingual and test data. We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales. BT is beneficial when the parallel, monolingual, and test data sources are similar but can be detrimental otherwise, while DAE is less effective than previously reported. Next, we analyze the impact of scale (from 90M to 1.6B parameters) and find it is important for both methods, particularly DAE. As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource. These results offer new insights into how to best use monolingual data in MMT. + 2024.naacl-long.349 + 2024.naacl-long.349.copyright.pdf + baziotis-etal-2024-monolingual + + + <fixed-case>C</fixed-case>ontra<fixed-case>S</fixed-case>im – Analyzing Neural Representations Based on Contrastive Learning + AdirRahamim + YonatanBelinkovTechnion, Technion + 6325-6339 + Recent work has compared neural network representations via similarity-based analyses to improve model interpretation. The quality of a similarity measure is typically evaluated by its success in assigning a high score to representations that are expected to be matched. However, existing similarity measures perform mediocrely on standard benchmarks. In this work, we develop a new similarity measure, dubbed ContraSim, based on contrastive learning. In contrast to common closed-form similarity measures, ContraSim learns a parameterized measure by using both similar and dissimilar examples. We perform an extensive experimental evaluation of our method, with both language and vision models, on the standard layer prediction benchmark and two new benchmarks that we introduce: the multilingual benchmark and the image–caption benchmark. In all cases, ContraSim achieves much higher accuracy than previous similarity measures, even when presented with challenging examples. Finally, ContraSim is more suitable for the analysis of neural networks, revealing new insights not captured by previous measures. + 2024.naacl-long.350 + 2024.naacl-long.350.copyright.pdf + rahamim-belinkov-2024-contrasim + + + Universal Prompt Optimizer for Safe Text-to-Image Generation + ZongyuWuPennsylvania State University + HongchengGao + YuezeWangBeijing Academy of Artificial Intelligence and Tianjin University + XiangZhangPennsylvania State University + SuhangWangPennsylvania State University + 6340-6354 + Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, we propose the first universal **p**rompt **o**ptimizer for **s**afe T2**I** (**POSI**) generation in black-box scenario. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance. Our code is available at [https://github.com/wzongyu/POSI](https://github.com/wzongyu/POSI). + 2024.naacl-long.351 + 2024.naacl-long.351.copyright.pdf + wu-etal-2024-universal + + + Language Model Based Unsupervised Dependency Parsing with Conditional Mutual Information and Grammatical Constraints + JunjieChenTokyo University, Tokyo Institute of Technology + XianghengHe + YusukeMiyaoThe University of Tokyo + 6355-6366 + Previous methods based on Large Language Models (LLM) perform unsupervised dependency parsing by maximizing bi-lexical dependence scores. However, these previous methods adopt dependence scores that are difficult to interpret. These methods cannot incorporate grammatical constraints that previous grammar-based parsing research has shown beneficial to improving parsing performance. In this work, we apply Conditional Mutual Information (CMI), an interpretable metric, to measure the bi-lexical dependence and incorporate grammatical constraints into LLM-based unsupervised parsing. We incorporate Part-Of-Speech information as a grammatical constraint at the CMI estimation stage and integrate two additional grammatical constraints at the subsequent tree decoding stage. We find that the CMI score positively correlates with syntactic dependencies and has a stronger correlation with the syntactic dependencies than baseline scores. Our experiment confirms the benefits and applicability of the proposed grammatical constraints across five languages and eight datasets. The CMI parsing model outperforms state-of-the-art LLM-based models and similarly constrained grammar-based models. Our analysis reveals that the CMI model is strong in retrieving dependency relations with rich lexical interactions but is weak in retrieving relations with sparse lexical interactions, indicating a potential limitation in CMI-based unsupervised parsing methods. + 2024.naacl-long.352 + 2024.naacl-long.352.copyright.pdf + chen-etal-2024-language + + + The Bias Amplification Paradox in Text-to-Image Generation + PreethiSeshadriUniversity of California, Irvine + SameerSinghUniversity of California, Irvine and Allen Institute for Artificial Intelligence + YanaiElazarAllen Institute for Artificial Intelligence and Department of Computer Science + 6367-6384 + Bias amplification is a phenomenon in which models exacerbate biases or stereotypes present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably. However, we discover that amplification can be largely attributed to discrepancies between training captions and model prompts. For example, an inherent difference is that captions from the training data often contain explicit gender information while our prompts do not, which leads to a distribution shift and consequently inflates bias measures. Once we account for distributional differences between texts used for training and generation when evaluating amplification, we observe that amplification decreases drastically. Our findings illustrate the challenges of comparing biases in models and their training data, as well as evaluation more broadly, and highlight how confounding factors can impact analyses. + 2024.naacl-long.353 + 2024.naacl-long.353.copyright.pdf + seshadri-etal-2024-bias + + + Grammar-based Data Augmentation for Low-Resource Languages: The Case of <fixed-case>G</fixed-case>uarani-<fixed-case>S</fixed-case>panish Neural Machine Translation + AgustínLucas + AlexisBaladón + VictoriaPardiñas + MarvinAgüero-Torales + SantiagoGóngoraUniversidad de la República and Facultad de Ingeniería + LuisChiruzzoFacultad de Ingeniería - Universidad de la República - Uruguay + 6385-6397 + One of the main problems low-resource languages face in NLP can be pictured as a vicious circle: data is needed to build and test tools, but the available text is scarce and there are not powerful tools to collect it.In order to break this circle for Guarani, we explore if text automatically generated from a grammar can work as a Data Augmentation technique to boost the performance of Guarani-Spanish Machine Translation (MT) systems.After building a grammar-based system that generates Spanish text and syntactically transfers it to Guarani, we perform several experiments by pretraining models using this synthetic text.We find that the MT systems that are pretrained with synthetic text perform better, even outperforming previous baselines. + 2024.naacl-long.354 + 2024.naacl-long.354.copyright.pdf + lucas-etal-2024-grammar + + + Global Gallery: The Fine Art of Painting Culture Portraits through Multilingual Instruction Tuning + AnjishnuMukherjeeGeorge Mason University + AylinCaliskanUniversity of Washington + ZiweiZhuGeorge Mason University + AntoniosAnastasopoulosAthena Research Center and George Mason University + 6398-6415 + Exploring the intersection of language and culture in Large Language Models (LLMs), this study critically examines their capability to encapsulate cultural nuances across diverse linguistic landscapes. Central to our investigation are three research questions: the efficacy of language-specific instruction tuning, the impact of pretraining on dominant language data, and the identification of optimal approaches to elicit accurate cultural knowledge from LLMs. Utilizing the GeoMLaMA benchmark for multilingual commonsense knowledge and an adapted CAMeL dataset (English-only) for evaluation of nuanced cultural aspects, our experiments span six different languages and cultural contexts, revealing the extent of LLMs’ cultural awareness. Our findings highlight a nuanced landscape: while language-specific tuning and bilingual pretraining enhance cultural understanding in certain contexts, they also uncover inconsistencies and biases, particularly in non-Western cultures. This work expands our understanding of LLMs’ cultural competence and emphasizes the importance of integrating diverse cultural perspectives in their development, aiming for a more globally representative and equitable approach in language modeling. + 2024.naacl-long.355 + 2024.naacl-long.355.copyright.pdf + mukherjee-etal-2024-global + + + Toward Interactive Regional Understanding in Vision-Large Language Models + JungbeomLeeAmazon + SanghyukChunNAVER AI Lab + SangdooYunNAVER + 6416-6429 + Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce RegionVLM, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding. + 2024.naacl-long.356 + 2024.naacl-long.356.copyright.pdf + lee-etal-2024-toward + + + <fixed-case>S</fixed-case>cript<fixed-case>M</fixed-case>ix: Mixing Scripts for Low-resource Language Parsing + JaeseongLeeSeoul National University + DohyeonLeeSeoul National University + Seung-wonHwangSeoul National University + 6430-6444 + Despite the success of multilingual pretrained language models (mPLMs) for tasks such as dependency parsing (DEP) or part-of-speech (POS) tagging, their coverage of 100s of languages is still limited, as most of the 6500+ languages remains “unseen”. To adapt mPLMs for including such unseen langs, existing work has considered transliteration and vocabulary augmentation. Meanwhile, the consideration of combining the two has been surprisingly lacking. To understand why, we identify both complementary strengths of the two, and the hurdles to realizing it. Based on this observation, we propose ScriptMix, combining two strengths, and overcoming the hurdle.Specifically, ScriptMix a) is trained with dual-script corpus to combine strengths, but b) with separate modules to avoid gradient conflict. In combining modules properly, we also point out the limitation of the conventional method AdapterFusion, and propose AdapterFusion+ to overcome it. We empirically show ScriptMix is effective– ScriptMix improves the POS accuracy by up to 14%, and improves the DEP LAS score by up to 5.6%. Our code is publicly available. + 2024.naacl-long.357 + 2024.naacl-long.357.copyright.pdf + lee-etal-2024-scriptmix + + + <fixed-case>MT</fixed-case>-<fixed-case>PATCHER</fixed-case>: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation + JiahuanLi + ShanboChengByteDance Inc. + ShujianHuangNanjing University + JiajunChenNanjing University + 6445-6459 + Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation, yet they suffer from high computational cost and latency. Therefore, transferring translation knowledge from giant LLMs to medium-sized machine translation models is a promising research direction. However, traditional knowledge distillation methods ignore the capability of student and teacher models, therefore repeatedly teaching student models on the knowledge they have learned, and failing to extend to novel contexts and knowledge. In this paper, we propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner. Considering the current translation ability of student MT models, we only identify and correct their translation errors, instead of distilling the whole translation from the teacher. Leveraging the strong language abilities of LLMs, we instruct LLM teachers to synthesize diverse contexts and anticipate more potential errors for the student. Experiment results on translating both specific language phenomena and general MT benchmarks demonstrate that finetuning the MT model on about 10% examples can achieve comparable results to the traditional knowledge distillation method, and synthesized potential errors and diverse contexts further improve MT performances on unseen contexts and words. + 2024.naacl-long.358 + 2024.naacl-long.358.copyright.pdf + li-etal-2024-mt + + + <fixed-case>T</fixed-case>o<fixed-case>XCL</fixed-case>: A Unified Framework for Toxic Speech Detection and Explanation + NhatHoang + DoLongNational University of Singapore + Duc AnhDo + Duc AnhVu + Anh TuanLuuNanyang Technological University + 6460-6472 + The proliferation of online toxic speech is a pertinent problem posing threats to demographic groups. While explicit toxic speech contains offensive lexical signals, implicit one consists of coded or indirect language. Therefore, it is crucial for models not only to detect implicit toxic speech but also to explain its toxicity. This draws a unique need for unified frameworks that can effectively detect and explain implicit toxic speech. Prior works mainly formulated the task of toxic speech detection and explanation as a text generation problem. Nonetheless, models trained using this strategy can be prone to suffer from the consequent error propagation problem. Moreover, our experiments reveal that the detection results of such models are much lower than those that focus only on the detection task. To bridge these gaps, we introduce ToXCL, a unified framework for the detection and explanation of implicit toxic speech. Our model consists of three modules: a (i) Target Group Generator to generate the targeted demographic group(s) of a given post; an (ii) Encoder-Decoder Model in which the encoder focuses on detecting implicit toxic speech and is boosted by a (iii) Teacher Classifier via knowledge distillation, and the decoder generates the necessary explanation. ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly. + 2024.naacl-long.359 + 2024.naacl-long.359.copyright.pdf + hoang-etal-2024-toxcl + + + <fixed-case>L</fixed-case>ink<fixed-case>P</fixed-case>rompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models + YueXu + WenjieWangShanghaiTech University + 6473-6486 + Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks. Instead of using a fixed prompt template to fine-tune the model, some research demonstrates the effectiveness of searching for the prompt via optimization. Such prompt optimization process of prompt-based learning on PLMs also gives insight into generating adversarial prompts to mislead the model, raising concerns about the adversarial vulnerability of this paradigm. Recent studies have shown that universal adversarial triggers (UATs) can be generated to alter not only the predictions of the target PLMs but also the prediction of corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based learning paradigm. However, UATs found in previous works are often unreadable tokens or characters and can be easily distinguished from natural texts with adaptive defenses. In this work, we consider the naturalness of the UATs and develop \textit{LinkPrompt}, an adversarial attack algorithm to generate UATs by a gradient-based beam search algorithm that not only effectively attacks the target PLMs and PFMs but also maintains the naturalness among the trigger tokens. Extensive results demonstrate the effectiveness of \textit{LinkPrompt}, as well as the transferability of UATs generated by \textit{LinkPrompt} to open-sourced Large Language Model (LLM) Llama2 and API-accessed LLM GPT-3.5-turbo. The resource is available at https://github.com/SavannahXu79/LinkPrompt. + 2024.naacl-long.360 + 2024.naacl-long.360.copyright.pdf + xu-wang-2024-linkprompt + + + <fixed-case>C</fixed-case>o<fixed-case>E</fixed-case>-<fixed-case>SQL</fixed-case>: In-Context Learning for Multi-Turn Text-to-<fixed-case>SQL</fixed-case> with Chain-of-Editions + HanchongZhang + RuishengCao + HongshenXuShanghai Jiaotong University + LuChenShanghai Jiaotong University + KaiYuShanghai Jiao Tong University + 6487-6508 + Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks. We investigate the issue of prompt design in the multi-turn text-to-SQL task and attempt to enhance the LLMs’ reasoning capacity when generating SQL queries. In the conversational context, the current SQL query can be modified from the preceding SQL query with only a few operations due to the context dependency. We introduce our method called CoE-SQL which can prompt LLMs to generate the SQL query based on the previously generated SQL query with an edition chain. We also conduct extensive ablation studies to determine the optimal configuration of our approach. Our approach outperforms different in-context learning baselines stably and achieves state-of-the-art performances on two benchmarks SParC and CoSQL using LLMs, which is also competitive to the SOTA fine-tuned models. + 2024.naacl-long.361 + 2024.naacl-long.361.copyright.pdf + zhang-etal-2024-coe + + + <fixed-case>C</fixed-case>ontra<fixed-case>D</fixed-case>oc: Understanding Self-Contradictions in Documents with Large Language Models + JieruiLi + VipulRahejaColumbia University, Grammarly and International Institute of Information Technology Hyderabad + DhruvKumar + 6509-6523 + In recent times, large language models (LLMs) have shown impressive performance on various document-level tasks such as document classification, summarization, and question-answering. However, research on understanding their capabilities on the task of self-contradictions in long documents has been very limited. In this work, we introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains, varying document lengths, self-contradiction types, and appearance scope. We then analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context. We release the dataset and all the code associated with the experiments. + 2024.naacl-long.362 + 2024.naacl-long.362.copyright.pdf + li-etal-2024-contradoc + + + Entity Disambiguation via Fusion Entity Decoding + JunxiongWangCornell University + AliMousaviApple + OmarAttiaApple + RonakPradeep + SaloniPotdarApple + AlexanderRushCornell University and School of Engineering and Applied Sciences, Harvard University + Umar FarooqMinhas + YunyaoLiAdobe Systems + 6524-6536 + Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly, entity descriptions, which could contain crucial information to distinguish similar entities from each other, are often overlooked.We propose an encoder-decoder model to disambiguate entities with more detailed entity descriptions. Given text and candidate entities, the encoder learns interactions between the text and each candidate entity, producing representations for each entity candidate. The decoder then fuses the representations of entity candidates together and selects the correct entity.Our experiments, conducted on various entity disambiguation benchmarks, demonstrate the strong and robust performance of this model, particularly +1.5% in the ZELDA benchmark compared with GENRE. Furthermore, we integrate this approach into the retrieval/reader framework and observe +1.5% improvements in end-to-end entity linking in the GERBIL benchmark compared with EntQA. + 2024.naacl-long.363 + 2024.naacl-long.363.copyright.pdf + wang-etal-2024-entity + + + <fixed-case>P</fixed-case>lan<fixed-case>RAG</fixed-case>: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers + MyeonghwaLeeKorea Advanced Institute of Science and Technology + SeonhoAnSchool of Computing, KAIST + Min-SooKimKAIST + 6537-6555 + In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define **Decision QA** as the task of answering the best decision, d_{best}, for a decision-making question Q, business rules R and a database D. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, **DQA**. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the *iterative plan-then-retrieval augmented generation* (**PlanRAG**). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https://github.com/myeon9h/PlanRAG. + 2024.naacl-long.364 + 2024.naacl-long.364.copyright.pdf + lee-etal-2024-planrag + + + <fixed-case>GPTS</fixed-case>core: Evaluate as You Desire + JinlanFu + See-KiongNgNational University of Singapore + ZhengbaoJiangSchool of Computer Science, Carnegie Mellon University + PengfeiLiu + 6556-6576 + Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models.Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently.This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., in-context learning, zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., Flan-T5-small) to 175B (e.g., GPT3).Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.This nature helps us overcome several long-standing challenges in text evaluation–how to achieve customized, multi-faceted evaluation without model training. We make our code publicly available. + 2024.naacl-long.365 + 2024.naacl-long.365.copyright.pdf + fu-etal-2024-gptscore + + + A Survey of Confidence Estimation and Calibration in Large Language Models + JiahuiGeng + FengyuCaiTechnische Universität Darmstadt + YuxiaWang + HeinzKoeppl + PreslavNakov + IrynaGurevychMohamed bin Zayed University of Artificial Intelligence and Technical University of Darmstadt + 6577-6595 + Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations. There has been a lot of recent research aiming to address this, but there has been no comprehensive overview to organize it and to outline the main lessons learned. The present survey aims to bridge this gap. In particular, we outline the challenges and we summarize recent technical advancements for LLM confidence estimation and calibration. We further discuss their applications and suggest promising directions for future work. + 2024.naacl-long.366 + 2024.naacl-long.366.copyright.pdf + geng-etal-2024-survey + + + Not All Metrics Are Guilty: Improving <fixed-case>NLG</fixed-case> Evaluation by Diversifying References + TianyiTang + HongyuanLu + YuchenJiangAIWaves Inc. + HaoyangHuangMicrosoft Research Asia + DongdongZhangMicrosoft Research Asia + XinZhaoRenmin University of China + TomKocmiMicrosoft + FuruWeiMicrosoft Research + 6596-6610 + Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model’s hypotheses. To address this issue, this paper presents a simple and effective method, named **Div-Ref**, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to diversify the expression of a single reference into multiple high-quality ones to cover the semantic space of the reference sentence as much as possible. We conduct comprehensive experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation. This idea is compatible with recent LLM-based evaluation which can similarly derive advantages from incorporating multiple references. *We strongly encourage future generation benchmarks to include more references, even if they are generated by LLMs, which is once for all.* We release all the code and data at https://github.com/RUCAIBox/Div-Ref to facilitate research. + 2024.naacl-long.367 + 2024.naacl-long.367.copyright.pdf + tang-etal-2024-metrics + + + Separation and Fusion: A Novel Multiple Token Linking Model for Event Argument Extraction + JingXu + DandanSongBeijing Institute of Technology + SiuHuiNanyang Technological University + ZhijingWuBeijing Institute of Technology + MeihuiziJia + HaoWang + YanruZhou + ChangzhiZhou + ZiyiYang + 6611-6624 + In event argument extraction (EAE), a promising approach involves jointly encoding text and argument roles, and performing multiple token linking operations. This approach further falls into two categories. One extracts arguments within a single event, while the other attempts to extract arguments from multiple events simultaneously. However, the former lacks to leverage cross-event information and the latter requires tougher predictions with longer encoded role sequences and extra linking operations. In this paper, we design a novel separation-and-fusion paradigm to separately acquire cross-event information and fuse it into the argument extraction of a target event. Following the paradigm, we propose a novel multiple token linking model named Sep2F, which can effectively build event correlations via roles and preserve the simple linking predictions of single-event extraction. In particular, we employ one linking module to extract arguments for the target event and another to aggregate the role information of multiple events. More importantly, we propose a novel two-fold fusion module to ensure that the aggregated cross-event information serves EAE well. We evaluate our proposed model on sentence-level and document-level datasets, including ACE05, RAMS, WikiEvents and MLEE. The extensive experimental results indicate that our model outperforms the state-of-the-art EAE models on all the datasets. + 2024.naacl-long.368 + 2024.naacl-long.368.copyright.pdf + xu-etal-2024-separation + + + The Integration of Semantic and Structural Knowledge in Knowledge Graph Entity Typing + MuzhiLi + MindaHu + IrwinKingThe Chinese University of Hong Kong + Ho-fungLeung(Independent Researcher) and The Chinese University of Hong Kong + 6625-6638 + The Knowledge Graph Entity Typing (KGET) task aims to predict missing type annotations for entities in knowledge graphs. Recent works only utilize the + structural knowledge in the local neighborhood of entities, disregarding + semantic knowledge in the textual representations of entities, relations, and types that are also crucial for type inference. Additionally, we observe that the interaction between semantic and structural knowledge can be utilized to address the false-negative problem. In this paper, we propose a novel Semantic and Structure-aware KG Entity Typing (SSET) framework, which is composed of three modules. First, the Semantic Knowledge Encoding module encodes factual knowledge in the KG with a Masked Entity Typing task. Then, the Structural Knowledge Aggregation module aggregates knowledge from the multi-hop neighborhood of entities to infer missing types. Finally, the Unsupervised Type Re-ranking module utilizes the inference results from the two models above to generate type predictions that are robust to false-negative samples. Extensive experiments show that SSET significantly outperforms existing state-of-the-art methods. + 2024.naacl-long.369 + 2024.naacl-long.369.copyright.pdf + li-etal-2024-integration + + + <fixed-case>C</fixed-case>om<fixed-case>CLIP</fixed-case>: Training-Free Compositional Image and Text Matching + KenanJiang + XuehaiHeUniversity of California, San Diego + RuizeXu + XinWangUniversity of California, Santa Cruz + 6639-6659 + Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-language pretrained models like CLIP to compositional image and text matching — a more challenging image and text matching task requiring the model’s understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action subimages and composes CLIP’s vision encoder and text encoder to perform evolving matching over compositional text embedding and subimage embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: Winoground, VL-checklist, SVO, and ComVG, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP. + 2024.naacl-long.370 + 2024.naacl-long.370.copyright.pdf + jiang-etal-2024-comclip + + + <fixed-case>ACLS</fixed-case>um: A New Dataset for Aspect-based Summarization of Scientific Publications + SotaroTakeshitaUniversität Mannheim + TommasoGreen + InesReinig + KaiEckertMannheim University of Applied Sciences + SimonePonzettoUniversity of Mannheim + 6660-6675 + Extensive efforts in the past have been directed toward the development of summarization datasets. However, a predominant number of these resources have been (semi)-automatically generated, typically through web data crawling. This resulted in subpar resources for training and evaluating summarization systems, a quality compromise that is arguably due to the substantial costs associated with generating ground-truth summaries, particularly for diverse languages and specialized domains. To address this issue, we present ACLSum, a novel summarization dataset carefully crafted and evaluated by domain experts. In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers, covering challenges, approaches, and outcomes in depth. Through extensive experiments, we evaluate the quality of our resource and the performance of models based on pretrained language models (PLMs) and state-of-the-art large language models (LLMs). Additionally, we explore the effectiveness of extract-then-abstract versus abstractive end-to-end summarization within the scholarly domain on the basis of automatically discovered aspects. While the former performs comparably well to the end-to-end approach with pretrained language models regardless of the potential error propagation issue, the prompting-based approach with LLMs shows a limitation in extracting sentences from source documents. + 2024.naacl-long.371 + 2024.naacl-long.371.copyright.pdf + takeshita-etal-2024-aclsum + + + <fixed-case>XAL</fixed-case>: <fixed-case>EX</fixed-case>plainable Active Learning Makes Classifiers Better Low-resource Learners + YunLuowestlake university + ZhenYang + FandongMengWeChat AI, Tencent Inc. + YingjieLiWestlake University + FangGuo + QinglinQi + JieZhou + YueZhangWestlake University + 6676-6698 + Active learning (AL), which aims to construct an effective training set by iteratively curating the most formative unlabeled data for annotation, has been widely used in low-resource tasks. Most active learning techniques in classification rely on the model’s uncertainty or disagreement to choose unlabeled data, suffering from the problem of over-confidence in superficial patterns and a lack of exploration.Inspired by the cognitive processes in which humans deduce and predict through causal information, we take an initial attempt towards integrating rationales into AL and propose a novel Explainable Active Learning framework (XAL) for low-resource text classification, which aims to encourage classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations. Specifically, besides using a pre-trained bi-directional encoder for classification, we employ a pre-trained uni-directional decoder to generate and score the explanation. We further facilitate the alignment of the model with human reasoning preference through a proposed ranking loss. During the selection of unlabeled data, the predicted uncertainty of the encoder and the explanation score of the decoder complement each other as the final metric to acquire informative data. Extensive experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines. Analysis indicates that the proposed method can generate corresponding explanations for its predictions. + 2024.naacl-long.372 + 2024.naacl-long.372.copyright.pdf + luo-etal-2024-xal + + + <fixed-case>L</fixed-case>a<fixed-case>D</fixed-case>i<fixed-case>C</fixed-case>: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? + YuchiWang + ShuhuaiRen + RundongGao + LinliYao + QingyanGuo + KaikaiAn + JianhongBai + XuSun + 6699-6715 + Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation. + 2024.naacl-long.373 + 2024.naacl-long.373.copyright.pdf + wang-etal-2024-ladic + + + Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with <fixed-case>RLAIF</fixed-case> + AmeyHengle + AswiniPadhi + SahajpreetSinghIIT Delhi + AnilBandhakavi + Md ShadAkhtarIndraprastha Institute of Information Technology, Delhi + TanmoyChakrabortyIndian Institute of Technology, Delhi + 6716-6733 + Counterspeech, defined as a response to mitigate online hate speech, is increasingly used as a non-censorial solution. The effectiveness of addressing hate speech involves dispelling the stereotypes, prejudices, and biases often subtly implied in brief, single-sentence statements or abuses. These expressions challenge language models, especially in seq2seq tasks, as model performance typically excels with longer contexts. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. The first two phases of CoARL involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech. The final phase uses reinforcement learning to fine-tune outputs for effectiveness and nontoxicity. CoARL outperforms existing benchmarks in intent-conditioned counterspeech generation, showing an average improvement of ∼3 points in intent-conformity and ∼4 points in argument-quality metrics. Extensive human evaluation supports CoARL’s efficacy in generating superior and more context-appropriate responses compared to existing systems, including prominent LLMs like ChatGPT. + 2024.naacl-long.374 + 2024.naacl-long.374.copyright.pdf + hengle-etal-2024-intent + + + Attacks, Defenses and Evaluations for <fixed-case>LLM</fixed-case> Conversation Safety: A Survey + ZhichenDong + ZhanhuiZhouShanghai Artificial Intelligence Laboratory + ChaoYang + JingShaoShanghai AI Laboratory + YuQiao + 6734-6747 + Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety. + 2024.naacl-long.375 + 2024.naacl-long.375.copyright.pdf + dong-etal-2024-attacks + + + Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models + WeizeLiu + GuocongLi + KaiZhangZhejiang University + BangDu + QiyuanChen + XumingHuThe Hong Kong University of Science and Technology (Guangzhou) and Hong Kong University of Science and Technology + HongxiaXu + JintaiChen + JianWu + 6748-6763 + Large language models (LLMs) have achieved remarkable advancements in natural language processing. However, the massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments. While techniques such as chain-of-thought (CoT) distillation have displayed promise in distilling LLMs into small language models (SLMs), there is a risk that distilled SLMs may still inherit flawed reasoning and hallucinations from LLMs. To address these issues, we propose a twofold methodology: First, we introduce a novel method for distilling the self-evaluation capability from LLMs into SLMs, aiming to mitigate the adverse effects of flawed reasoning and hallucinations inherited from LLMs. Second, we advocate for distilling more comprehensive thinking by incorporating multiple distinct CoTs and self-evaluation outputs, to ensure a more thorough and robust knowledge transfer into SLMs. Experiments on three NLP benchmarks demonstrate that our method significantly improves the performance of distilled SLMs, offering a new perspective for developing more effective and efficient SLMs in resource-constrained environments. + 2024.naacl-long.376 + 2024.naacl-long.376.copyright.pdf + liu-etal-2024-minds + + + Divergent Token Metrics: Measuring degradation to prune away <fixed-case>LLM</fixed-case> components – and optimize quantization + BjörnDeiserothTechnische Universität Darmstadt and Aleph Alpha + MaxMeuerAleph Alpha + NikolasGritsch + ConstantinEichenbergAleph Alpha + PatrickSchramowskiGerman Research Center for AI + MatthiasAßenmacherLudwig-Maximilians-Universität München + KristianKerstingGerman Research Center for AI, The Hessian Center for AI and TU Darmstadt + 6764-6783 + Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components’ impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes. + 2024.naacl-long.377 + 2024.naacl-long.377.copyright.pdf + deiseroth-etal-2024-divergent + + + Beyond Performance: Quantifying and Mitigating Label Bias in <fixed-case>LLM</fixed-case>s + YuvalReifHebrew University of Jerusalem + RoySchwartzHebrew University, Hebrew University of Jerusalem + 6784-6798 + Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit *label bias*—an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model’s predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability. + 2024.naacl-long.378 + 2024.naacl-long.378.copyright.pdf + reif-schwartz-2024-beyond + + + Instructing Large Language Models to Identify and Ignore Irrelevant Conditions + ZhenyuWuUniversity of Notre Dame and Xi’an Jiaotong University + ChaoShenXi’an Jiaotong University + MengJiangUniversity of Notre Dame + 6799-6819 + Math word problem (MWP) solving requires generating a reasoning path based on a given problem description that often contains irrelevant conditions.Existing chain-of-thought (CoT) prompting methods elicited multi-step reasoning abilities of large language models (LLMs) to solve MWPs.However, they were seriously confused by the irrelevant conditions, resulting in low accuracy.In this paper, we propose a novel approach named I^3C that instructs LLMs to identify and ignore irrelevant conditions.It identifies a set of irrelevant condition candidates that have a weak semantic relevance with the question.Then it prompts LLMs to verify the irrelevant conditions.Lastly it instructs the LLMs with the verification on relevant and irrelevant conditions to avoid confusion and improve reasoning paths.Moreover, we propose to select (problem, reasoning paths) pairs as demonstrations to enhance I^3C with few-shot reasoning. We develop I^3C-Select that selects the most confusing problems based on the semantic relevance measurement.We conduct extensive experiments on eight MWP datasets.I^3C can be combined with any CoT prompting methods to improve the performance of solving MWPs.Notably, with GPT-3.5-Turbo and I^3C-Select, we achieve an accuracy of 96.0 and 94.1 on GSM-IC2-1K and GSM-ICM-1K, respectively, significantly outperforming the state-of-the-art few-shot prompting method Complex-CoT by +11.7 and +11.1.Our implementation is made publicly available at https://wzy6642.github.io/I3C.github.io/. + 2024.naacl-long.379 + 2024.naacl-long.379.copyright.pdf + wu-etal-2024-instructing + + + Lower Bounds on the Expressivity of Recurrent Neural Language Models + AnejSveteDepartment of Computer Science, ETHZ - ETH Zurich + FranzNowakETHZ - ETH Zurich + AnishaSahabdeen + RyanCotterellSwiss Federal Institute of Technology + 6820-6840 + The recent successes and spread of large neural language models (LMs) call for a thorough understanding of their abilities. Describing their abilities through LMs’ representational capacity is a lively area of research. Investigations of the representational capacity of neural LMs have predominantly focused on their ability to recognize formal languages. For example, recurrent neural networks (RNNs) as classifiers are tightly linked to regular languages, i.e., languages defined by finite-state automata (FSAs). Such results, however, fall short of describing the capabilities of RNN language models (LMs), which are definitionally distributions over strings. We take a fresh look at the represen- tational capacity of RNN LMs by connecting them to probabilistic FSAs and demonstrate that RNN LMs with linearly bounded precision can express arbitrary regular LMs. + 2024.naacl-long.380 + 2024.naacl-long.380.copyright.pdf + svete-etal-2024-lower + + + Transformers Can Represent <tex-math>n</tex-math>-gram Language Models + AnejSveteDepartment of Computer Science, ETHZ - ETH Zurich + RyanCotterellSwiss Federal Institute of Technology + 6841-6874 + Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language acceptance. We contend that this is an ill-suited problem in the study of language models (LMs), which are definitionally probability distributions over strings. In this paper, we focus on the relationship between transformer LMs and n-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any n-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings. + 2024.naacl-long.381 + 2024.naacl-long.381.copyright.pdf + svete-cotterell-2024-transformers + + + The Role of <tex-math>n</tex-math>-gram Smoothing in the Age of Neural Networks + LucaMalaguttiDepartment of Computer Science, ETHZ - ETH Zurich + AndriusBuinovskij + AnejSveteDepartment of Computer Science, ETHZ - ETH Zurich + ClaraMeister + AfraAminiETHZ - ETH Zurich + RyanCotterellSwiss Federal Institute of Technology + 6875-6892 + For nearly three decades, language models derived from the n-gram assumption held the state of the art on the task. The key to their success lay in the application of various smoothing techniques that served to combat overfitting. However, when neural language models toppled n-gram models as the best performers, n-gram smoothing techniques became less relevant. Indeed, it would hardly be an understatement to suggest that the line of inquiry into n-gram smoothing techniques became dormant. This paper re-opens the role classical n-gram smoothing techniques may play in the age of neural language models. First, we draw a formal equivalence between label smoothing, a popular regularization technique for neural language models, and add-\lambda smoothing. Second, we derive a generalized framework for converting any n-gram smoothing technique into a regularizer compatible with neural language models. Our empirical results find that our novel regularizers are comparable to and, indeed, sometimes outperform label smoothing on language modeling and machine translation. + 2024.naacl-long.382 + 2024.naacl-long.382.copyright.pdf + malagutti-etal-2024-role + + + Reliability Estimation of News Media Sources: Birds of a Feather Flock Together + SergioBurdissoIdiap Research Institute + DairazaliaSanchez-cortes + EsaúVillatoro-telloIdiap Research Institute + PetrMotlicek + 6893-6911 + Evaluating the reliability of news sources is a routine task for journalists and organizations committed to acquiring and disseminating accurate information.Recent research has shown that predicting sources’ reliability represents an important first-prior step in addressing additional challenges such as fake news detection and fact-checking.In this paper, we introduce a novel approach for source reliability estimation that leverages reinforcement learning strategies for estimating the reliability degree of news sources. Contrary to previous research, our proposed approach models the problem as the estimation of a reliability degree, and not a reliability label, based on how all the news media sources interact with each other on the Web.We validated the effectiveness of our method on a news media reliability dataset that is an order of magnitude larger than comparable existing datasets. Results show that the estimated reliability degrees strongly correlates with journalists-provided scores (Spearman=0.80) and can effectively predict reliability labels (macro-avg. F1 score=81.05).We release our implementation and dataset, aiming to provide a valuable resource for the NLP community working on information verification. + 2024.naacl-long.383 + 2024.naacl-long.383.copyright.pdf + burdisso-etal-2024-reliability + + + On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons + TakeshiKojimaThe University of Tokyo + ItsukiOkimura + YusukeIwasawaThe University of Tokyo + HitomiYanakathe University of Tokyo + YutakaMatsuoThe University of Tokyo and The University of Tokyo + 6912-6964 + Current decoder-based pre-trained language models (PLMs) successfully demonstrate multilingual capabilities. However, it is unclear how these models handle multilingualism.We analyze the neuron-level internal behavior of multilingual decoder-based PLMs, Specifically examining the existence of neurons that fire “uniquely for each language” within decoder-only multilingual PLMs.We analyze six languages: English, German, French, Spanish, Chinese, and Japanese, and show that language-specific neurons are unique, with a slight overlap (< 5%) between languages. These neurons are mainly distributed in the models’ first and last few layers. This trend remains consistent across languages and models.Additionally, we tamper with less than 1% of the total neurons in each model during inference and demonstrate that tampering with a few language-specific neurons drastically changes the probability of target language occurrence in text generation. + 2024.naacl-long.384 + 2024.naacl-long.384.copyright.pdf + kojima-etal-2024-multilingual + + + <fixed-case>NLP</fixed-case> Progress in Indigenous <fixed-case>L</fixed-case>atin <fixed-case>A</fixed-case>merican Languages + AtnafuTonjaMohamed bin Zayed University of Artificial Intelligence and Instituto Politécnico Nacional + FazlourrahmanBalouchzahi + SaburButt + OlgaKolesnikovaInstituto Politécnico Nacional + HectorCeballosTecnologico de Monterrey + AlexanderGelbukhInstituto Politécnico Nacional + ThamarSolorioMohamed bin Zayed University of Artificial Intelligence and University of Houston + 6965-6980 + The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancements that respect indigenous community perspectives. We show the NLP progress of indigenous Latin American languages and the survey that covers the status of indigenous languages in Latin America, their representation in NLP, and the challenges and innovations required for their preservation and development. The paper contributes to the current literature in understanding the need and progress of NLP for indigenous communities of Latin America, specifically low-resource and indigenous communities in general. + 2024.naacl-long.385 + 2024.naacl-long.385.copyright.pdf + tonja-etal-2024-nlp + + + On the Effectiveness of Adversarial Robustness for Abuse Mitigation with Counterspeech + Yi-LingChungAlan Turing Institute + JonathanBrightAlan Turing Institute + 6981-6995 + Recent work on automated approaches to counterspeech have mostly focused on synthetic data but seldom look into how the public deals with abuse. While these systems identifying and generating counterspeech have the potential for abuse mitigation, it remains unclear how robust a model is against adversarial attacks across multiple domains and how models trained on synthetic data can handle unseen user-generated abusive content in the real world. To tackle these issues, this paper first explores the dynamics of abuse and replies using our novel dataset of 6,955 labelled tweets targeted at footballers for studying public figure abuse. We then curate DynaCounter, a new English dataset of 1,911 pairs of abuse and replies addressing nine minority identity groups, collected in an adversarial human-in-the-loop process over four rounds. Our analysis shows that adversarial attacks do not necessarily result in better generalisation. We further present a study of multi-domain counterspeech generation, comparing Flan-T5 and T5 models. We observe that handling certain abuse targets is particularly challenging. + 2024.naacl-long.386 + 2024.naacl-long.386.copyright.pdf + chung-bright-2024-effectiveness + + + Leveraging the Structure of Pre-trained Embeddings to Minimize Annotation Effort + CesarGonzalez-GutierrezUniversitat Politècnica de Catalunya + AriadnaQuattoniUniversidad Politécnica de Cataluna + 6996-7010 + Most current state-of-the-art approaches for text classification are based on fine-tuning the representations computed by large language models (LLMs). This strategy has led to significant improvements in classification performance and contributed to a reduction of the amount of labeled data required for training a model. However, for some challenging classification tasks, providing enough annotations to ensure a reliable classification continues to be the main bottleneck. This is especially true in settings of highly imbalanced class distributions. This paper proposes to tackle this bottleneck by exploiting the structural properties of pre-trained embeddings. We develop a label propagation method that uses pre-trained embeddings to spread information from the labeled samples to nearby samples in the induced space, ensuring the optimal use of annotations. Our approach is simple and relatively low-cost since it only requires computing some distances in the embedded space. We conduct experiments on different text classification datasets showing that the proposed method is efficient and significantly outperforms both self-training and random walk label propagation strategies. + 2024.naacl-long.387 + 2024.naacl-long.387.copyright.pdf + gonzalez-gutierrez-quattoni-2024-leveraging + + + <fixed-case>U</fixed-case>ni<fixed-case>A</fixed-case>rk: Improving Generalisation and Consistency for Factual Knowledge Extraction through Debiasing + YijunYangEdinburgh University, University of Edinburgh + JieHe + PinzhenChenUniversity of Edinburgh + VictorGutierrez BasultoCardiff University + JeffPanUniversity of Edinburgh, University of Edinburgh + 7011-7028 + Several recent papers have investigated the potential of language models as knowledge bases as well as the existence of severe biases when extracting factual knowledge. In this work, we focus on the factual probing performance over unseen prompts from tuning, and using a probabilistic view we show the inherent misalignment between pre-training and downstream tuning objectives in language models for probing knowledge. We hypothesize that simultaneously debiasing these objectives can be the key to generalisation over unseen prompts. We propose an adapter-based framework, **UniArk**, for generalised and consistent factual knowledge extraction through simple methods without introducing extra parameters. Extensive experiments show that UniArk can significantly improve the model’s out-of-domain generalisation as well as consistency under various prompts. Additionally, we construct **ParaTrex**, a large-scale and diverse dataset for measuring the inconsistency and out-of-domain generation of models. Further, ParaTrex offers a reference method for constructing paraphrased datasets using large language models. + 2024.naacl-long.388 + 2024.naacl-long.388.copyright.pdf + yang-etal-2024-uniark + + + Adaptive-<fixed-case>RAG</fixed-case>: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity + SoyeongJeongKorea Advanced Institute of Science & Technology + JinheonBaekKorea Advanced Institute of Science & Technology + SukminCho + Sung JuHwangKorea Advanced Institute of Science and Technology and AITRICS + JongParkKorea Advanced Institute of Science and Technology + 7029-7043 + Retrieval-Augmented Large Language Models (LLMs), which incorporate the non-parametric knowledge from external knowledge bases into LLMs, have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA). However, even though there are various approaches dealing with queries of different complexities, they either handle simple queries with unnecessary computational overhead or fail to adequately address complex multi-step queries; yet, not all user requests fall into only one of the simple or complex categories. In this work, we propose a novel adaptive QA framework that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs from the simplest to the most sophisticated ones based on the query complexity. Also, this selection process is operationalized with a classifier, which is a smaller LM trained to predict the complexity level of incoming queries with automatically collected labels, obtained from actual predicted outcomes of models and inherent inductive biases in datasets. This approach offers a balanced strategy, seamlessly adapting between the iterative and single-step retrieval-augmented LLMs, as well as the no-retrieval methods, in response to a range of query complexities. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems, compared to relevant baselines including the adaptive retrieval approaches. Code is available at: https://github.com/starsuzi/Adaptive-RAG. + 2024.naacl-long.389 + 2024.naacl-long.389.copyright.pdf + jeong-etal-2024-adaptive + + + Knowing What <fixed-case>LLM</fixed-case>s <fixed-case>DO</fixed-case> <fixed-case>NOT</fixed-case> Know: A Simple Yet Effective Self-Detection Method + YukunZhao + LingyongYanBaidu Inc. + WeiweiSun + GuoliangXing + ChongMengBaidu + ShuaiqiangWangBaidu Inc. + ZhicongCheng + ZhaochunRenLeiden University + DaweiYinBaidu + 7044-7056 + Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.However, recent literature reveals that LLMs hallucinate intermittently, which impedes their reliability for further utilization. In this paper, we propose a novel self-detection method to detect which questions an LLM does not know.Our proposal is empirical and applicable for continually upgrading LLMs compared with state-of-the-art methods. Specifically, we examine the divergence of the LLM’s behaviors on different verbalizations for a question and examine the atypicality of the verbalized input. We combine the two components to identify whether the model generates a non-factual response to the question. The above components can be accomplished by utilizing the LLM itself without referring to any other external resources. We conduct comprehensive experiments and demonstrate the effectiveness of our method for recently released LLMs involving Llama 2, Vicuna, ChatGPT, and GPT-4 across factoid question-answering, arithmetic reasoning, and commonsense reasoning tasks. + 2024.naacl-long.390 + 2024.naacl-long.390.copyright.pdf + zhao-etal-2024-knowing + + + Are Large Language Model Temporally Grounded? + YifuQiu + ZhengZhaoUniversity of Edinburgh, University of Edinburgh + YftahZiserUniversity of Edinburgh + AnnaKorhonenUniversity of Cambridge + EdoardoPontiUniversity of Edinburgh + ShayCohenUniversity of Edinburgh + 7057-7076 + Are Large Language Models (LLMs) temporally grounded? Since LLMs cannot perceive and interact with the environment, it is impossible to answer this question directly. Instead, we provide LLMs with textual narratives and probe them with respect to their common-sense knowledge of the structure and duration of events, their ability to order events along a timeline, and self-consistency within their temporal model (e.g., temporal relations such as after and before are mutually exclusive for any pair of events). We evaluate state-of-the-art LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities. Generally, we find that LLMs lag significantly behind both human performance as well as small-scale, specialised LMs. In-context learning, instruction tuning, and chain-of-thought prompting reduce this gap only to a limited degree. Crucially, LLMs struggle the most with self-consistency, displaying incoherent behaviour in at least 27.23% of their predictions. Contrary to expectations, we also find that scaling the model size does not guarantee positive gains in performance. To explain these results, we study the sources from which LLMs may gather temporal information: we find that sentence ordering in unlabelled texts, available during pre-training, is only weakly correlated with event ordering. Moreover, public instruction tuning mixtures contain few temporal tasks. Hence, we conclude that current LLMs lack a consistent temporal model of textual narratives. + 2024.naacl-long.391 + 2024.naacl-long.391.copyright.pdf + qiu-etal-2024-large + + + Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling + YupuLiang + YapingZhangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + CongMaInstitute of automation, Chinese academy of science + ZhiyangZhang + YangZhaoInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + LuXiangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + ChengqingZongInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + YuZhouInstitute of Automation, Chinese Academy of Sciences + 7077-7088 + Text image machine translation (TIMT) is a task that translates source texts embedded in the image to target translations. The existing TIMT task mainly focuses on text-line-level images. In this paper, we extend the current TIMT task and propose a novel task, **D**ocument **I**mage **M**achine **T**ranslation to **Markdown** (**DIMT2Markdown**), which aims to translate a source document image with long context and complex layout structure to markdown-formatted target translation.We also introduce a novel framework, **D**ocument **I**mage **M**achine **T**ranslation with **D**ynamic multi-pre-trained models **A**ssembling (**DIMTDA**).A dynamic model assembler is used to integrate multiple pre-trained models to enhance the model’s understanding of layout and translation capabilities.Moreover, we build a novel large-scale **Do**cument image machine **T**ranslation dataset of **A**rXiv articles in markdown format (**DoTA**), containing 126K image-translation pairs.Extensive experiments demonstrate the feasibility of end-to-end translation of rich-text document images and the effectiveness of DIMTDA. + 2024.naacl-long.392 + 2024.naacl-long.392.copyright.pdf + liang-etal-2024-document + + + Elastic Weight Removal for Faithful and Abstractive Dialogue Generation + NicoDaheimTechnische Universität Darmstadt + NouhaDziri + MrinmayaSachanSwiss Federal Institute of Technology + IrynaGurevychMohamed bin Zayed University of Artificial Intelligence and Technical University of Darmstadt + EdoardoPontiUniversity of Edinburgh + 7089-7105 + Generating factual responses is a crucial requirement for dialogue systems. To promotemore factual responses, a common strategyis to ground their responses in relevant documents that inform response generation. However, common dialogue models still often hallucinate information that was not containedin these documents and is therefore unfaithful. In this work, we propose to alleviate suchhallucinations by ‘subtracting’ the parametersof a model trained to hallucinate from a dialogue response generation model in order to‘negate’ the contribution of such hallucinatedexamples from it. Extensive automatic and human evaluation shows favourable results whencompared to state-of-the-art methods that combine the distributions of multiple models, suchas DExperts (Liu et al., 2021), and others thatchange the training procedure, such as Quark(Lu et al., 2022a). Finally, we show how wecan not only reduce hallucinations but also discourage extractive responses, which are oftena consequence of reducing hallucinations byencouraging copy-pasting of document spans.We publicly release our code for reproducibilityand facilitating further research. + 2024.naacl-long.393 + 2024.naacl-long.393.copyright.pdf + daheim-etal-2024-elastic + + + <fixed-case>R</fixed-case>-Tuning: Instructing Large Language Models to Say ‘<fixed-case>I</fixed-case> Don’t Know’ + HanningZhang + ShizheDiaoHong Kong University of Science and Technology + YongLin + YiFung + QingLianThe Hong Kong University of Science and Technology + XingyaoWangDepartment of Computer Science, University of Illinois Urbana-Champaign + YangyiChenDepartment of Computer Science, University of Illinois at Urbana-Champaign + HengJiUniversity of Illinois, Urbana-Champaign + TongZhangUIUC + 7106-7132 + Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. A predominant issue is the propensity for these models to generate non-existent facts, a concern termed hallucination. Our research is motivated by the observation that previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. When the question is out of the parametric knowledge, it will try to make up something and fail to indicate when it lacks knowledge. In this paper, we present a new approach called Refusal-Aware Instruction Tuning (R-Tuning). This approach is formalized by first identifying the disparity in knowledge encompassed by pre-trained parameters compared to that of instruction tuning data. Then, we construct the refusal-aware data based on the knowledge intersection, to tune LLMs to refrain from responding to questions beyond its parametric knowledge. Experimental results demonstrate R-Tuning effectively improves a model’s ability to answer known questions and refrain from answering unknown questions. Furthermore, when tested on out-of-domain datasets, the refusal ability was found to be a meta-skill that could be generalized to other tasks. Further analysis surprisingly finds that learning the uncertainty results in better calibration and an improved ability to estimate the uncertainty than uncertainty-based testing. Our code is available at https://github.com/shizhediao/R-Tuning + 2024.naacl-long.394 + 2024.naacl-long.394.copyright.pdf + zhang-etal-2024-r + + + Bridging the Gap between Different Vocabularies for <fixed-case>LLM</fixed-case> Ensemble + YangyifanXuUniversity of the Chinese Academy of Sciences + JinliangLuInstitute of automation, Chinese Academy of Sciences + JiajunZhangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + 7133-7145 + Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to \textbf{E}nsemble LLMs via \textbf{V}ocabulary \textbf{A}lignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement. + 2024.naacl-long.395 + 2024.naacl-long.395.copyright.pdf + xu-etal-2024-bridging + + + <fixed-case>K</fixed-case>now<fixed-case>LA</fixed-case>: Enhancing Parameter-efficient Finetuning with Knowledgeable Adaptation + XindiLuo + ZequnSun + JingZhaoTencent AI Lab + ZheZhao + WeiHuNanjing University + 7146-7159 + Parameter-efficient finetuning (PEFT) is a key technique for adapting large language models (LLMs) to downstream tasks. In this paper, we study leveraging knowledge graph embeddings to improve the effectiveness of PEFT. We propose a knowledgeable adaptation method called KnowLA. It inserts an adaptation layer into an LLM to integrate the embeddings of entities appearing in the input text. The adaptation layer is trained in combination with LoRA on instruction data. Experiments on six benchmarks with two popular LLMs and three knowledge graphs demonstrate the effectiveness and robustness of KnowLA. We show that KnowLA can help activate the relevant parameterized knowledge in an LLM to answer a question without changing its parameters or input prompts. + 2024.naacl-long.396 + 2024.naacl-long.396.copyright.pdf + luo-etal-2024-knowla + + + Extremely Weakly-supervised Text Classification with Wordsets Mining and Sync-Denoising + LysaXiao + 7160-7172 + Extremely weakly-supervised text classification aims to classify texts without any labeled data, but only relying on class names as supervision. Existing works include prompt-based and seed-based methods. Prompt-based methods prompt language model with instructions, while seed-based methods generate pseudo-labels with word matching. Both of them have significant flaws, including zero-shot instability and context-dependent ambiguities. This paper introduces SetSync, which follows a new paradigm, i.e. wordset-based, which can avoid the above problems. In SetSync, a class is represented with wordsets, and pseudo-labels are generated with wordsets matching. To facilitate this, we propose to use information bottleneck to identify class-relevant wordsets. Moreover, we regard the classifier training as a hybrid learning of semi-supervised and noisy-labels, and propose a new training strategy, termed sync-denoising. Extensive experiments on 11 datasets show that SetSync outperforms all existing prompt and seed methods, exceeding SOTA by an impressive average of 8 points. + 2024.naacl-long.397 + 2024.naacl-long.397.copyright.pdf + xiao-2024-extremely + + + <fixed-case>F</fixed-case>-<fixed-case>MALLOC</fixed-case>: Feed-forward Memory Allocation for Continual Learning in Neural Machine Translation + JunhongWuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + YuchenLiu + ChengqingZongInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + 7173-7185 + In the evolving landscape of Neural Machine Translation (NMT), the pretrain-then-finetune paradigm has yielded impressive results. However, the persistent challenge of Catastrophic Forgetting (CF) remains a hurdle. While previous work has introduced Continual Learning (CL) methods to address CF, these approaches grapple with the delicate balance between avoiding forgetting and maintaining system extensibility. To address this, we propose a CL method, named \textbf{F-MALLOC} (\textbf{F}eed-forward \textbf{M}emory \textbf{ALLOC}ation). F-MALLOC is inspired by recent insights highlighting that feed-forward layers emulate neural memories and encapsulate crucial translation knowledge. It decomposes feed-forward layers into discrete memory cells and allocates these memories to different tasks. By learning to allocate and safeguard these memories, our method effectively alleviates CF while ensuring robust extendability. Besides, we propose a comprehensive assessment protocol for multi-stage CL of NMT systems. Experiments conducted following this new protocol showcase the superior performance of F-MALLOC, evidenced by higher BLEU scores and almost zero forgetting. + 2024.naacl-long.398 + 2024.naacl-long.398.copyright.pdf + wu-etal-2024-f + + + Towards Reducing Diagnostic Errors with Interpretable Risk Prediction + DenisMcInerney + WilliamDickinson + LucyFlynnBrigham and Women’s Hospital + AndreaYoungBrigham and Women’s Hospital, Harvard University + GeoffreyYoungHarvard Medical School + Jan-Willemvan de MeentUniversity of Amsterdam + ByronWallaceNortheastern University, Brown University and Northeastern University + 7186-7203 + Many diagnostic errors occur because clinicians cannot easily access relevant information in patient Electronic Health Records (EHRs). In this work we propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses; our ultimate aim is to increase access to evidence and reduce diagnostic errors. In particular, we propose a Neural Additive Model to make predictions backed by evidence with individualized risk estimates at time-points where clinicians are still uncertain, aiming to specifically mitigate delays in diagnosis and errors stemming from an incomplete differential. To train such a model, it is necessary to infer temporally fine-grained retrospective labels of eventual “true” diagnoses. We do so with LLMs, to ensure that the input text is from before a confident diagnosis can be made. We use an LLM to retrieve an initial pool of evidence, but then refine this set of evidence according to correlations learned by the model. We conduct an in-depth evaluation of the usefulness of our approach by simulating how it might be used by a clinician to decide between a pre-defined list of differential diagnoses. + 2024.naacl-long.399 + 2024.naacl-long.399.copyright.pdf + mcinerney-etal-2024-towards + + + Generalizable Multilingual Hate Speech Detection on Low Resource <fixed-case>I</fixed-case>ndian Languages using Fair Selection in Federated Learning + AkshaySingh + RahulThakur + 7204-7214 + Social media, originally meant for peaceful communication, now faces issues with hate speech. Detecting hate speech from social media in Indian languages with linguistic diversity and cultural nuances presents a complex and challenging task. Furthermore, traditional methods involve sharing of users’ sensitive data with a server for model training making it undesirable and involving potential risk to their privacy remained under-studied. In this paper, we combined various low-resource language datasets and propose MultiFED, a federated approach that performs effectively to detect hate speech. MultiFED utilizes continuous adaptation and fine-tuning to aid generalization using subsets of multilingual data overcoming the limitations of data scarcity. Extensive experiments are conducted on 13 Indic datasets across five different pre-trained models. The results show that MultiFED outperforms the state-of-the-art baselines by 8% (approx.) in terms of Accuracy and by 12% (approx.) in terms of F-Score. + 2024.naacl-long.400 + 2024.naacl-long.400.copyright.pdf + singh-thakur-2024-generalizable + + + Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks + NadezhdaChirkovaNaver Labs Europe + VassilinaNikoulinaNaver Labs Europe + 7215-7231 + Zero-shot cross-lingual transfer, which implies finetuning of the multilingual pretrained language model on input-output pairs in one language and using it to make task predictions for inputs in other languages, was widely studied for natural language understanding but is understudied for generation. Previous works notice a frequent problem of generation in a wrong language and propose approaches to address it, usually using mT5 as a backbone model. In this work we compare various approaches proposed from the literature in unified settings, also including alternative backbone models, namely mBART and NLLB-200. We first underline the importance of tuning learning rate used for finetuning, which helps to substantially alleviate the problem of generation in the wrong language. Then, we show that with careful learning rate tuning, the simple full finetuning of the model acts as a very strong baseline and alternative approaches bring only marginal improvements. Finally, we find that mBART performs similarly to mT5 of the same size, and NLLB-200 can be competitive in some cases. Our final zero-shot models reach the performance of the approach based on data translation which is usually considered as an upper baseline for zero-shot cross-lingual transfer in generation. + 2024.naacl-long.401 + 2024.naacl-long.401.copyright.pdf + chirkova-nikoulina-2024-key + + + The Impact of Depth on Compositional Generalization in Transformer Language Models + JacksonPettyNew York University + SjoerdSteenkisteGoogle + IshitaDasguptaDeepMind + FeiSha + DanGarretteGoogle DeepMind + TalLinzenNew York University and Google + 7232-7245 + To process novel sentences, language models (LMs) must generalize compositionally—combine familiar elements in new ways. What aspects of a model’s structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by theoretical and empirical work, that deeper transformers generalize more compositionally. Simply adding layers increases the total number of parameters; to address this confound between depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize more compositionally than shallower models do, but the benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling. Because model latency is approximately linear in the number of layers, these results lead us to the recommendation that, with a given total parameter budget, transformers can be made shallower than is typical without sacrificing performance. + 2024.naacl-long.402 + 2024.naacl-long.402.copyright.pdf + petty-etal-2024-impact + + + Pregnant Questions: The Importance of Pragmatic Awareness in Maternal Health Question Answering + NehaSrikanth + RupakSarkar + HeranManeUniversity of Maryland, College Park + ElizabethAparicio + QuynhNguyen + RachelRudingerUniversity of Maryland, College Park + JordanBoyd-GraberUniversity of Maryland, College Park + 7246-7261 + Questions posed by information-seeking users often contain implicit false or potentially harmful assumptions. In a high-risk domain such as maternal and infant health, a question-answering system must recognize these pragmatic constraints and go beyond simply answering user questions, examining them in context to respond helpfully. To achieve this, we study assumptions and implications, or pragmatic inferences, made when mothers ask questions about pregnancy and infant care by collecting a dataset of 2,727 inferences from 500 questions across three diverse sources. We study how health experts naturally address these inferences when writing answers, and illustrate that informing existing QA pipelines with pragmatic inferences produces responses that are more complete, mitigating the propagation of harmful beliefs. + 2024.naacl-long.403 + 2024.naacl-long.403.copyright.pdf + srikanth-etal-2024-pregnant + + + Towards Explainability in Legal Outcome Prediction Models + JosefValvoda + RyanCotterellSwiss Federal Institute of Technology + 7262-7282 + Current legal outcome prediction models - a staple of legal NLP - do not explain their reasoning. However, to employ these models in the real world, human legal actors need to be able to understand the model’s decisions. In the case of common law, legal practitioners reason towards the outcome of a case by referring to past case law, known as precedent. We contend that precedent is, therefore, a natural way of facilitating explainability for legal NLP models. In this paper, we contribute a novel method for identifying the precedent employed by legal outcome prediction models. Furthermore, by developing a taxonomy of legal precedent, we are able to compare human judges and neural models with respect to the different types of precedent they rely on. We find that while the models learn to predict outcomes reasonably well, their use of precedent is unlike that of human judges. + 2024.naacl-long.404 + 2024.naacl-long.404.copyright.pdf + valvoda-cotterell-2024-towards + + + The steerability of large language models toward data-driven personas + JunyiLiUniversity of Maryland, College Park + CharithPerisAmazon + NinarehMehrabiAmazon + PalashGoyalAmazon + Kai-WeiChangUniversity of California, Los Angeles + AramGalstyanInformation Sciences Institute, University of Southern California and Amazon Alexa + RichardZemelDepartment of Computer Science, Columbia University and Department of Computer Science, University of Toronto + RahulGupta + 7283-7298 + Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented. Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs, that can be leveraged to produce multiple perspectives and to reflect the diverse opinions. Moving beyond the traditional reliance on demographics like age, gender, or party affiliation, we introduce a data-driven notion of persona grounded in collaborative filtering, which is defined as either a single individual or a cohort of individuals manifesting similar views across specific inquiries. As individuals in the same demographic group may have different personas, our data-driven persona definition allows for a more nuanced understanding of different (latent) social groups present in the population. In addition to this, we also explore an efficient method to steer LLMs toward the personas that we define. We show that our data-driven personas significantly enhance model steerability, with improvements of between 57%-77% over our best performing baselines. + 2024.naacl-long.405 + 2024.naacl-long.405.copyright.pdf + li-etal-2024-steerability + + + <fixed-case>CCS</fixed-case>um: A Large-Scale and High-Quality Dataset for Abstractive News Summarization + XiangJiangAmazon + MarkusDreyerAmazon + 7299-7329 + Training a supervised news summarization model requires large amounts of high-quality training data consisting of news articles paired with reference summaries. However, obtaining such data is costly, and existing datasets contain considerable amount of noise. We present a new large-scale and high-quality dataset for supervised abstractive news summarization containing 1.3 million training samples, which we call CCSum. In creating this dataset, we take advantage of the journalistic inverted-pyramid style in news writing: In some articles, the first sentence can be considered a summary of the reported story. Accordingly, among 35 million CommonCrawl News articles, we identify pairs of articles about the same news story and use one article’s first sentence as the summary for the other article. To ensure high quality, we apply strict filters whose parameters we optimize using Bayesian optimization. We show that the resulting dataset is more factual and informative than established summarization datasets; less than 1% of the summaries have major factual inconsistencies with the corresponding news articles, compared to 5.5% to 15.4% in existing datasets, according to our human evaluation. Summarization models trained on our dataset are more favored compared to those trained on CNN/Daily Mail. The proposed dataset can open new opportunities for future research in abstractive summarization. + 2024.naacl-long.406 + 2024.naacl-long.406.copyright.pdf + jiang-dreyer-2024-ccsum + + + Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks + NegarMokhberian + MyrlMarmarelisUniversity of Southern California and USC/ISI + FredericHoppUniversity of Amsterdam + ValerioBasileUniversity of Turin + FredMorstatterUniversity of Southern California and USC/ISI + KristinaLermanUniversity of Southern California and USC Information Sciences Institute + 7330-7342 + Supervised classification heavily depends on datasets annotated by humans. However, in subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters. Annotations have commonly been aggregated by employing methods like majority voting to determine a single ground truth label. In subjective tasks, aggregating labels will result in biased labeling and, consequently, biased models that can overlook minority opinions. Previous studies have shed light on the pitfalls of label aggregation and have introduced a handful of practical approaches to tackle this issue. Recently proposed multi-annotator models, which predict labels individually per annotator, are vulnerable to under-determination for annotators with few samples. This problem is exacerbated in crowdsourced datasets. In this work, we propose Annotator Aware Representations for Texts (AART) for subjective classification tasks. Our approach involves learning representations of annotators, allowing for exploration of annotation behaviors. We show the improvement of our method on metrics that assess the performance on capturing individual annotators’ perspectives. Additionally, we demonstrate fairness metrics to evaluate our model’s equability of performance for marginalized annotators compared to others. + 2024.naacl-long.407 + 2024.naacl-long.407.copyright.pdf + mokhberian-etal-2024-capturing + + + Improving Factual Accuracy of Neural Table-to-Text Output by Addressing Input Problems in <fixed-case>T</fixed-case>o<fixed-case>TT</fixed-case>o + BarkaviSundararajanUniversity of Aberdeen + YajiSripadaArria NLG and University of Aberdeen + EhudReiterUniversity of Aberdeen + 7343-7369 + Neural Table-to-Text models tend to hallucinate, producing texts that contain factual errors. We investigate whether such errors in the output can be traced back to problems with the input. We manually annotated 1,837 texts generated by multiple models in the politics domain of the ToTTo dataset. We identify the input problems that are responsible for many output errors and show that fixing these inputs reduces factual errors by between 52% and 76% (depending on the model). In addition, we observe that models struggle in processing tabular inputs that are structured in a non-standard way, particularly when the input lacks distinct row and column values or when the column headers are not correctly mapped to corresponding values. + 2024.naacl-long.408 + 2024.naacl-long.408.copyright.pdf + sundararajan-etal-2024-improving + + + <fixed-case>CERET</fixed-case>: Cost-Effective Extrinsic Refinement for Text Generation + JasonCaiAmazon + HangSuAmazon + MonicaSunkara + IgorShalyminovAmazon + SaabMansourAmazon + 7370-7383 + Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality typically involve LLM self-improvement / self-reflection that incorporate feedback from models themselves. Despite their effectiveness, these methods are hindered by their high computational cost and lack of scalability. In this work, we propose CERET, a method for refining text generations by considering semantic stability, entailment and inter-sample uncertainty measures. Experimental results show that CERET outperforms Self-consistency and Self-rerank baselines consistently under various task setups, by 1.6% in Rouge-1 for abstractive summarization and 3.5% in hit rate for question answering. Compared to LLM Self-rerank method, our approach only requires 9.4% of its latency and is more cost-effective. + 2024.naacl-long.409 + 2024.naacl-long.409.copyright.pdf + cai-etal-2024-ceret + + + Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling + SubhenduKhatuya + RajdeepMukherjeeIndian Institute of Technology Kharagpur + AkashGhosh + ManjunathHegde + KoustuvDasgupta + NiloyGangulyIndian Institute of Technology Kharagpur, + SaptarshiGhoshIndian Institute of Technology Kharagpur + PawanGoyalIIT Kharagpur + 7384-7396 + We study the problem of automatically annotating relevant numerals (GAAP metrics) occurring in the financial documents with their corresponding XBRL tags. Different from prior works, we investigate the feasibility of solving this extreme classification problem using a generative paradigm through instruction tuning of Large Language Models (LLMs). To this end, we leverage metric metadata informationto frame our target outputs while proposing a parameter efficient solution for the task using LoRA. We perform experiments on two recently released financial numeric labeling datasets. Our proposed model, **FLAN-FinXC**, achieves new state-of-the-art performances on both the datasets, outperforming several strong baselines. We explain the better scores of our proposed model by demonstrating its capability for zero-shot as well as the least frequently occurring tags. Also, even when we fail to predict the XBRL tags correctly, our generated output has substantial overlap with the ground-truth in majority of the cases. + 2024.naacl-long.410 + 2024.naacl-long.410.copyright.pdf + khatuya-etal-2024-parameter + + + Analysis of State-Level Legislative Process in Enhanced Linguistic and Nationwide Network Contexts + MaryamDavoodiPurdue University + DanGoldwasserPurdue University, Purdue University and Purdue University + 7397-7415 + State bills have a significant impact on various aspects of society, including health, education, and the economy. Consequently, it is crucial to conduct systematic research on state bills before and after they are enacted to evaluate their benefits and drawbacks, thereby guiding future decision-making. In this work, we developed the first state-level deep learning framework that (1) handles the complex and inconsistent language of policies across US states using generative large language models and (2) decodes legislators’ behavior and implications of state policies by establishing a shared nationwide network, enriched with diverse contexts, such as information on interest groups influencing public policy and legislators’ courage test results, which reflect their political positions. + 2024.naacl-long.411 + 2024.naacl-long.411.copyright.pdf + davoodi-goldwasser-2024-analysis + + + <fixed-case>D</fixed-case>e<fixed-case>M</fixed-case>u<fixed-case>X</fixed-case>: Data-efficient Multilingual Learning + SimranKhanujaCMU, Carnegie Mellon University and Google + SrinivasGowriraj + LucioDeryCarnegie Mellon University + GrahamNeubigCarnegie Mellon University + 7416-7429 + Pre-trained multilingual models have enabled deployment of NLP technologies for multiple languages. However, optimally fine-tuning these models under an annotation budget, such that performance on desired target languages is jointly maximized, still remains an open question. In this paper, we introduce DeMuX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set. Unlike most prior works, our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations. Our active learning strategies rely upon distance and uncertainty measures to select task-specific neighbors that are most informative to label, given a model. DeMuX outperforms strong baselines in 84% of the test cases, in the zero-shot setting of disjoint source and target language sets (including multilingual target pools), across three models and four tasks. Notably, in low-budget settings (5-100 examples), we observe gains of up to 8-11 F1 points. Our code is released here: https://github.com/simran-khanuja/demux. + 2024.naacl-long.412 + 2024.naacl-long.412.copyright.pdf + khanuja-etal-2024-demux + + + <fixed-case>DUQG</fixed-case>en: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation + RamrajChandradevan + KaustubhDholeEmory University + EugeneAgichteinAmazon and Emory University + 7430-7444 + State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot. However, zero-shot neural ranking may be sub-optimal, as it does not take advantage of the target domain information. Unfortunately, acquiring sufficiently large and high quality target training data to improve a modern neural ranker can be costly and time-consuming. To address this problem, we propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature, namely how to automatically generate both effective and diverse synthetic training data to fine tune a modern neural ranker for a new domain. Specifically, DUQGen produces a more effective representation of the target domain by identifying clusters of similar documents; and generates a more diverse training dataset by probabilistic sampling over the resulting document clusters. Our extensive experiments, over the standard BEIR collection, demonstrate that DUQGen consistently outperforms all zero-shot baselines and substantially outperforms the SOTA baselines on 16 out of 18 datasets, for an average of 4% relative improvement across all datasets. We complement our results with a thorough analysis for more in-depth understanding of the proposed method’s performance and to identify promising areas for further improvements. + 2024.naacl-long.413 + 2024.naacl-long.413.copyright.pdf + chandradevan-etal-2024-duqgen + + + How did we get here? Summarizing conversation dynamics + YilunHuaDepartment of Computer Science, Cornell University + NicholasChernogorCornell University + YuzheGu + SeoyeonJeong + MirandaLuoCornell University + CristianDanescu-Niculescu-MizilCornell University and Cornell University + 7445-7470 + Throughout a conversation, the way participants interact with each other is in constant flux: their tones may change, they may resort to different strategies to convey their points, or they might alter their interaction patterns. An understanding of these dynamics can complement that of the actual facts and opinions discussed, offering a more holistic view of the trajectory of the conversation: how it arrived at its current state and where it is likely heading.In this work, we introduce the task of summarizing the dynamics of conversations, by constructing a dataset of human-written summaries, and exploring several automated baselines. We evaluate whether such summaries can capture the trajectory of conversations via an established downstream task: forecasting whether an ongoing conversation will eventually derail into toxic behavior. We show that they help both humans and automated systems with this forecasting task. Humans make predictions three times faster, and with greater confidence, when reading the summaries than when reading the transcripts. Furthermore, automated forecasting systems are more accurate when constructing, and then predicting based on, summaries of conversation dynamics, compared to directly predicting on the transcripts. + 2024.naacl-long.414 + 2024.naacl-long.414.copyright.pdf + hua-etal-2024-get + + + Can Language Model Moderators Improve the Health of Online Discourse? + HyundongChoUSC/ISI + ShuaiLiuUniversity of Southern California, Information Sciences Institute + TaiweiShi + DarpanJain + BasemRizkUSC Institute for Creative Technologies, University of Southern California + YuyangHuang + ZixunLuUniversity of Southern California + NuanWenUniversity of Southern California + JonathanGratchUniversity of Southern California + EmilioFerraraUniversity of Southern California + JonathanMayUniversity of Southern California and USC/ISI + 7471-7489 + Conversational moderation of online communities is crucial to maintaining civility for a constructive environment, but it is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier to aid human moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establish a systematic definition of conversational moderation effectiveness grounded on moderation literature and establish design criteria for conducting realistic yet safe evaluation. We then propose a comprehensive evaluation framework to assess models’ moderation capabilities independently of human intervention. With our framework, we conduct the first known study of language models as conversational moderators, finding that appropriately prompted models that incorporate insights from social science can provide specific and fair feedback on toxic behavior but struggle to influence users to increase their levels of respect and cooperation. + 2024.naacl-long.415 + 2024.naacl-long.415.copyright.pdf + cho-etal-2024-language + + + <fixed-case>L</fixed-case>ean<fixed-case>R</fixed-case>easoner: Boosting Complex Logical Reasoning with Lean + DongweiJiang + MarcioFonseca + ShayCohenUniversity of Edinburgh + 7490-7503 + Large language models (LLMs) often struggle with complex logical reasoning due to logical inconsistencies and the inherent difficulty ofsuch reasoning. We use Lean, a theorem proving framework, to address these challenges. By formalizing logical reasoning problems intotheorems within Lean, we can solve them by proving or disproving the corresponding theorems. This method reduces the risk of logical inconsistencies with the help of Lean’s symbolic solver. It also enhances our ability to treat complex reasoning tasks using Lean’s extensive library of theorem proofs. Our method achieves state-of-the-art performance on the FOLIO dataset and achieves performance near this level on ProofWriter. Notably, these results were accomplished by fine-tuning on fewer than 100 in-domain samples for each dataset + 2024.naacl-long.416 + 2024.naacl-long.416.copyright.pdf + jiang-etal-2024-leanreasoner + + + <fixed-case>UIC</fixed-case>oder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback + JasonWu + 7504-7518 + Many large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely either on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset, and producing a new LLM by finetuning the original on the refined dataset.We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences.Our results show the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models. + 2024.naacl-long.417 + 2024.naacl-long.417.copyright.pdf + wu-2024-uicoder + + + Measuring Cross-lingual Transfer in Bytes + LeandroDe SouzaUniversidade Estadual de Campinas + ThalesAlmeida + RobertoLotufoUniversity of Campinas, Universidade Estadual de Campinas + RodrigoFrassetto Nogueira + 7519-7530 + Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamination and syntactic similarity. An emerging line of research suggests that the representations learned by language models contain two components: a language-specific and a language-agnostic component. The latter is responsible for transferring a more universal knowledge. However, there is a lack of comprehensive exploration of these properties across diverse target languages. To investigate this hypothesis, we conducted an experiment inspired by the work on the Scaling Laws for Transfer. We measured the amount of data transferred from a source language to a target language and found that models initialized from diverse languages perform similarly to a target language in a cross-lingual setting. This was surprising because the amount of data transferred to 10 diverse target languages, such as Spanish, Korean, and Finnish, was quite similar. We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge. Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining. + 2024.naacl-long.418 + 2024.naacl-long.418.copyright.pdf + de-souza-etal-2024-measuring + + + <fixed-case>M</fixed-case>isgender<fixed-case>M</fixed-case>ender: A Community-Informed Approach to Interventions for Misgendering + TamannaHossainUniversity of California, Irvine + SunipaDevGoogle + SameerSinghUniversity of California, Irvine and Allen Institute for Artificial Intelligence + 7531-7551 + Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering.Misgendering, the act of incorrectly addressing someone’s gender, inflicts serious harm and is pervasive in everyday technologies, yet there is a notable lack of research to combat it. We are the first to address this lack of research into interventions for misgendering by conducting a survey of gender-diverse individuals in the US to understand perspectives about automated interventions for text-based misgendering. Based on survey insights on the prevalence of misgendering, desired solutions, and associated concerns, we introduce a misgendering interventions task and evaluation dataset, MisgenderMender. We define the task with two sub-tasks: (i) detecting misgendering, followed by (ii) correcting misgendering where misgendering is present, in domains where editing is appropriate. MisgenderMender comprises 3790 instances of social media content and LLM-generations about non-cisgender public figures, annotated for the presence of misgendering, with additional annotations for correcting misgendering in LLM-generated text. Using this dataset, we set initial benchmarks by evaluating existing NLP systems and highlighting challenges for future models to address. We release the full dataset, code, and demo at https://tamannahossainkay.github.io/misgendermender/ + 2024.naacl-long.419 + 2024.naacl-long.419.copyright.pdf + hossain-etal-2024-misgendermender + + + Interplay of Machine Translation, Diacritics, and Diacritization + Wei-RuiChenUniversity of British Columbia + IfeAdebaraUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia + 7552-7594 + We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of keeping (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European languages). For (1), results show that diacritization significantly benefits MT in the LR scenario, doubling or even tripling performance for some languages, but harms MT in the HR scenario. We find that MT harms diacritization in LR but benefits significantly in HR for some languages. For (2), MT performance is similar regardless of diacritics being kept or removed. In addition, we propose two classes of metrics to measure the complexity of a diacritical system, finding these metrics to correlate positively with the performance of our diacritization models. Overall, our work provides insights for developing MT and diacritization systems under different data size conditions and may have implications that generalize beyond the 55 languages we investigate. + 2024.naacl-long.420 + 2024.naacl-long.420.copyright.pdf + chen-etal-2024-interplay + + + From Quantity to Quality: Boosting <fixed-case>LLM</fixed-case> Performance with Self-Guided Data Selection for Instruction Tuning + MingLiUniversity of Maryland, College Park + YongZhangPingan Technology + ZhitaoLiPingan Technology + JiuhaiChen + LichangChen + NingChengPingan Technology + JianzongWangPingan Technology + TianyiZhouUniversity of Maryland, College Park + JingXiaoPingan Group + 7595-7628 + In the realm of Large Language Models (LLMs), the balance between instruction data quality and quantity is a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model’s expected responses and its intrinsic generation capability. Through the application of IFD, cherry samples can be pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of original data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the instruction tuning of LLMs, promising both efficiency and resource-conscious advancements. Codes, data, and models are available. + 2024.naacl-long.421 + 2024.naacl-long.421.copyright.pdf + li-etal-2024-quantity + + + Safer-Instruct: Aligning Language Models with Automated Preference Data + TaiweiShi + KaiChen + JieyuZhaoUniversity of Southern California + 7629-7644 + Reinforcement learning from human feedback (RLHF) is a vital strategy for enhancing model capability in language models. However, annotating preference data for RLHF is a resource-intensive and creativity-demanding process, while existing automatic generation methods face limitations in data diversity and quality. In response, we present Safer-Instruct, a novel pipeline for automatically constructing large-scale preference data. Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data without human annotators. To verify the effectiveness of Safer-Instruct, we apply the pipeline to construct a safety preference dataset as a case study. Finetuning an Alpaca model on this synthetic dataset not only demonstrates improved harmlessness but also outperforms models fine-tuned on human-annotated safety preference data, all the while maintaining a competitive edge in downstream tasks. Importantly, our Safer-Instruct framework is versatile and can be applied to generate preference data across various domains, extending its utility beyond safety preferences. It addresses the challenges in preference data acquisition and advances the development of more capable and responsible AI systems. For dataset and code implementation, see https://github.com/uscnlp-lime/safer-instruct/. + 2024.naacl-long.422 + 2024.naacl-long.422.copyright.pdf + shi-etal-2024-safer + + + <fixed-case>PELMS</fixed-case>: Pre-training for Effective Low-Shot Multi-Document Summarization + JosephPeperUniversity of Michigan - Ann Arbor + WenzhaoQiu + LuWangNortheastern University, Northeastern University and University of Michigan + 7645-7667 + We investigate pre-training techniques for abstractive multi-document summarization (MDS), which is much less studied than summarizing single documents. Though recent work has demonstrated the effectiveness of highlighting information salience for pre-training strategy design, they struggle to generate abstractive and reflective summaries, which are critical properties for MDS. To this end, we present **PELMS**, a pre-trained model that uses pre-training objectives based on semantic coherence heuristics and faithfulness constraints together with unlabeled multi-document inputs, to promote the generation of concise, fluent, and faithful summaries. To support the training of PELMS, we compile **MultiPT**, a multi-document pre-training corpus containing over 93 million documents to form more than 3million unlabeled topic-centric document clusters, covering diverse genres such as product reviews, news, and general knowledge. We perform extensive evaluation of PELMS in low-shot settings on a wide range of MDS datasets. Our approach consistently outperforms competitive comparisons with respect to overall informativeness, abstractiveness, coherence, and faithfulness, and with minimal fine-tuning can match performance of language models at a much larger scale (e.g., GPT-4). + 2024.naacl-long.423 + 2024.naacl-long.423.copyright.pdf + peper-etal-2024-pelms + + + Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination? + BangzhengLiUniversity of Southern California + BenZhouUniversity of Pennsylvania + FeiWangUniversity of Southern California + XingyuFuUniversity of Pennsylvania, University of Pennsylvania + DanRothAmazon and University of Pennsylvania + MuhaoChenUniversity of California, Davis and University of Southern California + 7668-7681 + Despite the high performances of large language models (LLMs) across numerous benchmarks, recent research has unveiled their suffering from hallucinations and unfaithful reasoning. This work studies a type of hallucination induced by semantic associations. We investigate to what extent LLMs take shortcuts from certain keyword/entity biases in the prompt instead of following correct reasoning paths. To quantify this phenomenon, we propose a novel probing method and benchmark called EUREQA. EUREQA is an entity-searching task where a model finds a missing entity based on described multi-hop relations with other entities. These deliberately designed multi-hop relations create deceptive semantic associations, and models must stick to the correct reasoning path instead of incorrect shortcuts to find the correct answer.Experiments show that existing LLMs cannot follow correct reasoning paths and resist the attempt of greedy shortcuts, with GPT-4 only achieving 62% accuracy. Analyses provide further evidence that LLMs rely on semantic biases to solve the task instead of proper reasoning, questioning the validity and generalizability of current LLMs’ high performances. + 2024.naacl-long.424 + 2024.naacl-long.424.copyright.pdf + li-etal-2024-deceptive + + + <fixed-case>I</fixed-case>ndi<fixed-case>S</fixed-case>entiment140: Sentiment Analysis Dataset for <fixed-case>I</fixed-case>ndian Languages with Emphasis on Low-Resource Languages using Machine Translation + SaurabhKumarIndian Institute of Technology, Guwahati + RanbirSanasamIndian Institute of Technology, Guwahati, Dhirubhai Ambani Institute Of Information and Communication Technology + SukumarNandiIndian Institute of Technology, Guwahati + 7682-7691 + Sentiment analysis, a fundamental aspect of Natural Language Processing (NLP), involves the classification of emotions, opinions, and attitudes in text data. In the context of India, with its vast linguistic diversity and low-resource languages, the challenge is to support sentiment analysis in numerous Indian languages. This study explores the use of machine translation to bridge this gap. The investigation examines the feasibility of machine translation for creating sentiment analysis datasets in 22 Indian languages. Google Translate, with its extensive language support, is employed for this purpose in translating the Sentiment140 dataset. The study aims to provide insights into the practicality of using machine translation in the context of India’s linguistic diversity for sentiment analysis datasets. Our findings indicate that a dataset generated using Google Translate has the potential to serve as a foundational framework for tackling the low-resource challenges commonly encountered in sentiment analysis for Indian languages. + 2024.naacl-long.425 + 2024.naacl-long.425.copyright.pdf + kumar-etal-2024-indisentiment140 + + + Leveraging <fixed-case>LLM</fixed-case>s for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval + NandanThakurUniversity of Waterloo + JianmoNiGoogle and Google + GustavoHernandez AbregoGoogle + JohnWietingGoogle DeepMind + JimmyLinUniversity of Waterloo + DanielCerGoogle + 7692-7717 + There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages. Synthetic training data generation is promising (e.g., InPars or Promptagator), but has been investigated only for English. Therefore, to study model capabilities across both cross-lingual and monolingual retrieval tasks, we develop **SWIM-IR**, a synthetic retrieval training dataset containing 33 (high to very-low resource) languages for fine-tuning multilingual dense retrievers without requiring any human supervision. To construct SWIM-IR, we propose SAP (summarize-then-ask prompting), where the large language model (LLM) generates a textual summary prior to the query generation step. SAP assists the LLM in generating informative queries in the target language. Using SWIM-IR, we explore synthetic fine-tuning of multilingual dense retrieval models and evaluate them robustly on three retrieval benchmarks: XOR-Retrieve (cross-lingual), MIRACL (monolingual) and XTREME-UP (cross-lingual). Our models, called SWIM-X, are competitive with human-supervised dense retrieval models, e.g., mContriever-X, finding that SWIM-IR can cheaply substitute for expensive human-labeled retrieval training data. SWIM-IR dataset and SWIM-X models are available at: https://github.com/google-research-datasets/SWIM-IR. + 2024.naacl-long.426 + 2024.naacl-long.426.copyright.pdf + thakur-etal-2024-leveraging + + + <fixed-case>SCANNER</fixed-case>: Knowledge-Enhanced Approach for Robust Multi-modal Named Entity Recognition of Unseen Entities + HyunjongOk + TaehoKilNAVER Cloud + SukminSeoNAVER + JaehoLeeGoogle and Pohang University of Science and Technology + 7718-7730 + Recent advances in named entity recognition (NER) have pushed the boundary of the task to incorporate visual signals, leading to many variants, including multi-modal NER (MNER) or grounded MNER (GMNER). A key challenge to these tasks is that the model should be able to generalize to the entities unseen during the training, and should be able to handle the training samples with noisy annotations.To address this obstacle, we propose SCANNER (Span CANdidate detection and recognition for NER), a model capable of effectively handling all three NER variants.SCANNER is a two-stage structure; we extract entity candidates in the first stage and use it as a query to get knowledge, effectively pulling knowledge from various sources.We can boost our performance by utilizing this entity-centric extracted knowledge to address unseen entities.Furthermore, to tackle the challenges arising from noisy annotations in NER datasets, we introduce a novel self-distillation method, enhancing the robustness and accuracy of our model in processing training data with inherent uncertainties.Our approach demonstrates competitive performance on the NER benchmark and surpasses existing methods on both MNER and GMNER benchmarks.Further analysis shows that the proposed distillation and knowledge utilization methods improve the performance of our model on various benchmarks. + 2024.naacl-long.427 + 2024.naacl-long.427.copyright.pdf + ok-etal-2024-scanner + + + A Theory Guided Scaffolding Instruction Framework for <fixed-case>LLM</fixed-case>-Enabled Metaphor Reasoning + YuanTian + NanXu + WenjiMaoInstitute of Automation, Chinese Academy of Sciences + 7731-7748 + Metaphor detection is a challenging task in figurative language processing, which aims to distinguish between metaphorical and literal expressions in text. Existing methods tackle metaphor detection via training or fine-tuning discriminative models on labeled data. However, these approaches struggle to explain the underlying reasoning process behind the metaphorical/literal judgment. Recently, large language models (LLMs) have shown promise in language reasoning tasks. Although promising, LLM-based methods for metaphor detection and reasoning are still faced with the challenging issue of bringing the explainable concepts for metaphor reasoning and their linguistic manifestation. To fill this gap, we propose a novel Theory guided Scaffolding Instruction (TSI) framework that instructs an LLM to infer the underlying reasoning process of metaphor detection guided by metaphor theories for the first time. Our work is inspired by a pedagogical strategy called scaffolding instruction, which encourages educators to provide questioning and support as scaffolding so as to assist learners in constructing the understanding of pedagogical goals step by step. We first construct a metaphor knowledge graph grounded in metaphor theory which serves as the instructional structure to obtain a series of scaffolding questions, directing the LLM to incrementally generate the reasoning process for metaphor understanding through dialogue interactions. During this theory guided instruction process, we explore the LLM’s mastery boundary and provide the relevant knowledge as scaffolding support when the question is beyond the LLM’s capability. Experimental results verify that our method significantly outperforms both the LLM-based reasoning methods and the SOTA methods in metaphor detection, indicating the facilitation of metaphor and instruction theories in guiding LLM-based reasoning process. + 2024.naacl-long.428 + 2024.naacl-long.428.copyright.pdf + tian-etal-2024-theory + + + Learning to Compress Prompt in Natural Language Formats + Yu-NengChuangRice University + TianweiXingSamsung Research America + Chia-YuanChang + ZiruiLiuRice University + XunChenSamsung Research America + XiaHuRice University + 7749-7760 + Large language models (LLMs) are great at processing multiple natural language processing tasks, but their abilities are constrained by inferior performance with long context, slow inference speed, and the high cost of computing the results. Deploying LLMs with precise and informative context helps users process large-scale datasets more effectively and cost-efficiently. Existing works rely on compressing long prompt contexts into soft prompts. However, soft prompt compression encounters limitations in transferability across different LLMs, especially API-based LLMs. To this end, this work aims to compress lengthy prompts in the form of natural language with LLM transferability. This poses two challenges: (i) Natural Language (NL) prompts are incompatible with back-propagation, and (ii) NL prompts lack flexibility in imposing length constraints. In this work, we propose a Natural Language Prompt Encapsulation (Nano-Capsulator) framework compressing original prompts into NL formatted Capsule Prompt while maintaining prompt utility and transferability. Specifically, to tackle the first challenge, the Nano-Capsulator is optimized by a reward function that interacts with the proposed semantics preserving loss. To address the second question, the Nano-Capsulator is optimized by a reward function featuring length constraints. Experimental results demonstrate that the Capsule Prompt can reduce 81.4% of the original length, decrease inference latency up to 4.5x, and save 80.1% of budget overheads while providing transferability across diverse LLMs and different datasets. + 2024.naacl-long.429 + 2024.naacl-long.429.copyright.pdf + chuang-etal-2024-learning + + + Automatic, Meta and Human Evaluation for Multimodal Summarization with Multimodal Output + HaojieZhuangUniversity of Adelaide + Wei EmmaZhangThe University of Adelaide + LeonXie + WeitongChenUniversity of Adelaide + JianYangMacquarie University + QuanShengMacquarie University + 7761-7783 + Multimodal summarization with multimodal output (MSMO) has attracted increasing research interests recently as multimodal summary could provide more comprehensive information compared to text-only summary, effectively improving the user experience and satisfaction. As one of the most fundamental components for the development of MSMO, evaluation is an emerging yet underexplored research topic. In this paper, we fill this gap and propose a research framework that studies three research questions of MSMO evaluation: (1) Automatic Evaluation: We propose a novel metric mLLM-EVAL, which utilizes multimodal Large Language Model for MSMO EVALuation. (2) Meta-Evaluation: We create a meta-evaluation benchmark dataset by collecting human-annotated scores for multimodal summaries. With our benchmark, we conduct meta-evaluation analysis to assess the quality of different evaluation metrics and show the effectiveness of our proposed mLLM-EVAL. (3) Human Evaluation: To provide more objective and unbiased human annotations for meta-evaluation, we hypothesize and verify three types of cognitive biases in human evaluation. We also incorporate our findings into the human annotation process in the meta-evaluation benchmark. Overall, our research framework provides an evaluation metric, a meta-evaluation benchmark dataset annotated by humans and an analysis of cognitive biases in human evaluation, which we believe would serve as a valuable and comprehensive resource for the MSMO research community. + 2024.naacl-long.430 + 2024.naacl-long.430.copyright.pdf + zhuang-etal-2024-automatic + + + Naive <fixed-case>B</fixed-case>ayes-based Context Extension for Large Language Models + JianlinSu + MurtadhaAhmedZhuiyi AI Lab + BoWen + LuoAoZhuiyi Technology Co., Ltd. + MingrenZhuShenzhen Zhuiyi Technology Co., Ltd + YunfengLiu + 7784-7800 + Large Language Models (LLMs) have shown promising in-context learning abilities. However, conventional In-Context Learning (ICL) approaches are often impeded by length limitations of transformer architecture, which pose challenges when attempting to effectively integrate supervision from a substantial number of demonstration examples. In this paper, we introduce a novel framework, called Naive Bayes-based Context Extension (NBCE), to enable existing LLMs to perform ICL with an increased number of demonstrations by significantly expanding their context size. Importantly, this expansion does not require fine-tuning or dependence on particular model architectures, all the while preserving linear efficiency. NBCE initially splits the context into equal-sized windows fitting the target LLM’s maximum length. Then, it introduces a voting mechanism to select the most relevant window, regarded as the posterior context. Finally, it employs Bayes’ theorem to generate the test task. Our experimental results demonstrate that NBCE substantially enhances performance, particularly as the number of demonstration examples increases, consistently outperforming alternative methods. The NBCE code will be made publicly accessible. The code NBCE is available at: https://github.com/amurtadha/NBCE-master + 2024.naacl-long.431 + 2024.naacl-long.431.copyright.pdf + su-etal-2024-naive + + + Leitner-Guided Memory Replay for Cross-lingual Continual Learning + MeryemM’hamdiUniversity of Southern California + JonathanMayUniversity of Southern California and USC/ISI + 7801-7814 + Cross-lingual continual learning aims to continuously fine-tune a downstream model on emerging data from new languages. One major challenge in cross-lingual continual learning is catastrophic forgetting: a stability-plasticity dilemma, where performance on previously seen languages decreases as the model learns to transfer to new languages. Experience replay, which revisits data from a fixed-size memory of old languages while training on new ones, is among the most successful approaches for solving this dilemma. Faced with the challenge of dynamically storing the memory with high-quality examples while complying with its fixed size limitations, we consider Leitner queuing, a human-inspired spaced-repetition technique, to determine what should be replayed at each phase of learning. Via a controlled set of quantitative and qualitative analyses across different memory strategies, we show that, just like humans, carefully picking informative examples to be prioritized in cross-lingual memory replay helps tame the stability-plasticity dilemma. Compared to vanilla and strong memory replay baselines, our Leitner-guided approach significantly and consistently decreases forgetting while maintaining accuracy across natural language understanding tasks, language orders, and languages. + 2024.naacl-long.432 + 2024.naacl-long.432.copyright.pdf + mhamdi-may-2024-leitner + + + Multilingual Nonce Dependency Treebanks: Understanding how Language Models Represent and Process Syntactic Structure + DavidArpsHHU Düsseldorf + LauraKallmeyerHeinrich Heine University Düsseldorf, Germany + YounesSamih + HassanSajjadDalhousie University + 7815-7837 + We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of Müller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics. + 2024.naacl-long.433 + 2024.naacl-long.433.copyright.pdf + arps-etal-2024-multilingual + + + Actively Learn from <fixed-case>LLM</fixed-case>s with Uncertainty Propagation for Generalized Category Discovery + JingguiLiangSingapore Management University + LiziLiaoSingapore Management University + HaoFeiNational University of Singapore + BoboLiWuhan University + JingJiangSingapore Management University + 7838-7851 + Generalized category discovery faces a key issue: the lack of supervision for new and unseen data categories. Traditional methods typically combine supervised pretraining with self-supervised learning to create models, and then employ clustering for category identification. However, these approaches tend to become overly tailored to known categories, failing to fully resolve the core issue. Hence, we propose to integrate the feedback from LLMs into an active learning paradigm. Specifically, our method innovatively employs uncertainty propagation to select data samples from high-uncertainty regions, which are then labeled using LLMs through a comparison-based prompting scheme. This not only eases the labeling task but also enhances accuracy in identifying new categories. Additionally, a soft feedback propagation mechanism is introduced to minimize the spread of inaccurate feedback. Experiments on various datasets demonstrate our framework’s efficacy and generalizability, significantly improving baseline models at a nominal average cost. + 2024.naacl-long.434 + 2024.naacl-long.434.copyright.pdf + liang-etal-2024-actively + + + Explaining Text Similarity in Transformer Models + AlexandrosVasileiou + OliverEberleTechnische Universität Berlin + 7852-7866 + As Transformers have become state-of-the-art models for natural language processing (NLP) tasks, the need to understand and explain their predictions is increasingly apparent. Especially in unsupervised applications, such as information retrieval tasks, similarity models built on top of foundation model representations have been widely applied. However, their inner prediction mechanisms have mostly remained opaque. Recent advances in explainable AI have made it possible to mitigate these limitations by leveraging improved explanations for Transformers through layer-wise relevance propagation (LRP). Using BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, we investigate which feature interactions drive similarity in NLP models. We validate the resulting explanations and demonstrate their utility in three corpus-level use cases, analyzing grammatical interactions, multilingual semantics, and biomedical text retrieval. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights. + 2024.naacl-long.435 + 2024.naacl-long.435.copyright.pdf + vasileiou-eberle-2024-explaining + + + Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning + HuimingWangSingapore University of Technology and Design + ZhaodonghuiLiNanyang Technological University + LiyingCheng + De WenSohSingapore University of Technology and Design + LidongBingAlibaba Group + 7867-7884 + Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training base sentence embedding models into three stages (i.e., sentence generation, sentence pair construction, in-batch training) and refines the generated content at these three distinct stages, ensuring only high-quality sentence pairs are utilized to train a base contrastive learning model. Our extensive experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results. Comprehensive analyses further underscore the potential of our framework in various application scenarios and achieving better sentence representation learning with LLMs. + 2024.naacl-long.436 + 2024.naacl-long.436.copyright.pdf + wang-etal-2024-large-language + + + <fixed-case>HIL</fixed-case>: Hybrid Isotropy Learning for Zero-shot Performance in Dense retrieval + JaeyoungKimSeoul National University + DohyeonLeeSeoul National University + Seung-wonHwangSeoul National University + 7885-7896 + Advancements in dense retrieval models have brought ColBERT to prominence in Information Retrieval (IR) with its advanced interaction techniques.However, ColBERT is reported to frequently underperform in zero-shot scenarios, where traditional techniques such as BM25 still exceed it.Addressing this, we propose to balance representation isotropy and anisotropy for zero-shot model performance, based on our observations that isotropy can enhance cosine similarity computations and anisotropy may aid in generalizing to unseen data.Striking a balance between these isotropic and anisotropic qualities stands as a critical objective to refine model efficacy.Based on this, we present ours, a Hybrid Isotropy Learning (HIL) architecture that integrates isotropic and anisotropic representations.Our experiments with the BEIR benchmark show that our model significantly outperforms the baseline ColBERT model, highlighting the importance of harmonized isotropy in improving zero-shot retrieval performance. + 2024.naacl-long.437 + 2024.naacl-long.437.copyright.pdf + kim-etal-2024-hil + + + <fixed-case>S</fixed-case>uper<fixed-case>GLEB</fixed-case>er: <fixed-case>G</fixed-case>erman Language Understanding Evaluation Benchmark + JanPfisterBayerische Julius-Maximilians-Universität Würzburg + AndreasHothoBayerische Julius-Maximilians-Universität Würzburg + 7897-7916 + We assemble a broad Natural Language Understanding benchmark suite for the German language and consequently evaluate a wide array of existing German-capable models in order to create a better understanding of the current state of German LLMs. Our benchmark consists of 29 different tasks ranging over different types such as document classification, sequence tagging, sentence similarity, and question answering, on which we evaluate 10 different German-pretrained models, thereby charting the landscape of German LLMs. In our comprehensive evaluation we find that encoder models are a good choice for most tasks, but also that the largest encoder model does not necessarily perform best for all tasks. We make our benchmark suite and a leaderboard publically available at https://supergleber.professor-x.de and encourage the community to contribute new tasks and evaluate more models on it (https://github.com/LSX-UniWue/SuperGLEBer). + 2024.naacl-long.438 + 2024.naacl-long.438.copyright.pdf + pfister-hotho-2024-supergleber + + + “You are an expert annotator”: Automatic Best–Worst-Scaling Annotations for Emotion Intensity Modeling + ChristopherBagdon + PrathameshKarmalkar + HarshaGurulingappa + RomanKlingerOtto-Friedrich Universität Bamberg + 7917-7929 + Labeling corpora constitutes a bottleneck to create models for new tasks or domains. Large language models mitigate the issue with automatic corpus labeling methods, particularly for categorical annotations. Some NLP tasks such as emotion intensity prediction, however, require text regression, but there is no work on automating annotations for continuous label assignments. Regression is considered more challenging than classification: The fact that humans perform worse when tasked to choose values from a rating scale lead to comparative annotation methods, including best–worst scaling. This raises the question if large language model-based annotation methods show similar patterns, namely that they perform worse on rating scale annotation tasks than on comparative annotation tasks. To study this, we automate emotion intensity predictions and compare direct rating scale predictions, pairwise comparisons and best–worst scaling. We find that the latter shows the highest reliability. A transformer regressor fine-tuned on these data performs nearly on par with a model trained on the original manual annotations. + 2024.naacl-long.439 + 2024.naacl-long.439.copyright.pdf + bagdon-etal-2024-expert + + + What Matters in Training a <fixed-case>GPT</fixed-case>4-Style Language Model with Multimodal Inputs? + YanZengByteDance + HanboZhangNational University of Singapore + JianiZheng + JiangnanXia + GuoqiangWeiByteDance + YangWeiEast China Normal University + YuchenZhangByteDance Research + TaoKongBytedance + RuihuaSongRenmin University of China + 7930-7957 + Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in processing image inputs and following open-ended instructions. Despite these advancements, there is considerable scope for enhancing open-source multi-modal LLMs, especially in terms of multi-modal understanding accuracy and instruction-following proficiency. In this paper, we conduct a comprehensive study on training GPT4-style models. We introduce Lynx a multi-modal LLM developed through a series of controlled experiments comparing various model variants. This process allowed us to identify and implement an optimal training strategy tailored for multi-modal LLMs. In addition to our model development, we propose a plug-and-play technique designed to augment the instruction-following capabilities of multi-modal LLMs. We have validated the performance of Lynx on multiple benchmarks. Results demonstrate that Lynx not only achieves strong image understanding accuracy but also excels in instruction-following tasks, paving the path for ongoing enhancements in multi-modal LLMs. + 2024.naacl-long.440 + 2024.naacl-long.440.copyright.pdf + zeng-etal-2024-matters + + + Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable <fixed-case>NLG</fixed-case> Evaluation + JieRuan + WangWenqingWangWenqing + XiaojunWanPeking University + 7958-7982 + Human evaluation serves as the gold standard for assessing the quality of Natural Language Generation (NLG) systems. Nevertheless, the evaluation guideline, as a pivotal element ensuring reliable and reproducible human assessment, has received limited attention. Our investigation revealed that only 29.84% of recent papers involving human evaluation at top conferences release their evaluation guidelines, with vulnerabilities identified in 77.09% of these guidelines. Unreliable evaluation guidelines can yield inaccurate assessment outcomes, potentially impeding the advancement of NLG in the right direction. To address these challenges, we take an initial step towards reliable evaluation guidelines and propose the first human evaluation guideline dataset by collecting annotations of guidelines extracted from existing papers as well as generated via Large Language Models (LLMs). We then introduce a taxonomy of eight vulnerabilities and formulate a principle for composing evaluation guidelines. Furthermore, a method for detecting guideline vulnerabilities has been explored using LLMs, and we offer a set of recommendations to enhance reliability in human evaluation. The annotated human evaluation guideline dataset and code for the vulnerability detection method are publicly available online. + 2024.naacl-long.441 + 2024.naacl-long.441.copyright.pdf + ruan-etal-2024-defining + + + <fixed-case>MOSAIC</fixed-case>o: a Multilingual Open-text Semantically Annotated Interlinked Corpus + SimoneConiaSapienza University of Rome + EdoardoBarba + Abelardo CarlosMartinez LorenzoUniversity of Roma “La Sapienza” + Pere-LluísHuguet Cabot + RiccardoOrlando + LuigiProcopio + RobertoNavigliSapienza University of Rome + 7983-7997 + Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks in NLU. + 2024.naacl-long.442 + 2024.naacl-long.442.copyright.pdf + conia-etal-2024-mosaico + + + <fixed-case>S</fixed-case>em<fixed-case>R</fixed-case>o<fixed-case>D</fixed-case>e: Macro Adversarial Training to Learn Representations that are Robust to Word-Level Attacks + BrianFormentonational university of singaore, National University of Singapore + WenjieFengNational University of Singapore + Chuan-ShengFooCentre for Frontier AI Research, A*STAR and Institute for Infocomm Research, A*STAR + Anh TuanLuuNanyang Technological University + See-KiongNgNational University of Singapore + 7998-8021 + Language models (LMs) are indispensable tools for natural language processing tasks, but their vulnerability to adversarial attacks remains a concern. While current research has explored adversarial training techniques, their improvements to defend against word-level attacks have been limited. In this work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing inspiration from recent studies in the image domain, we investigate and later confirm that in a discrete data setting such as language, adversarial samples generated via word substitutions do indeed belong to an adversarial domain exhibiting a high Wasserstein distance from the base domain. Our method learns a robust representation that bridges these two domains. We hypothesize that if samples were not projected into an adversarial domain, but instead to a domain with minimal shift, it would improve attack robustness. We align the domains by incorporating a new distance-based objective. With this, our model is able to learn more generalized representations by aligning the model’s high-level output features and therefore better handling unseen adversarial samples. This method can be generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels. To evaluate the effectiveness of our approach, we conduct experiments on BERT and RoBERTa models on three datasets. The results demonstrate promising state-of-the-art robustness. + 2024.naacl-long.443 + 2024.naacl-long.443.copyright.pdf + formento-etal-2024-semrode + + + <fixed-case>BUST</fixed-case>: Benchmark for the evaluation of detectors of <fixed-case>LLM</fixed-case>-Generated Text + JosephCorneliusSUPSI - University of Applied Sciences Southern Switzerland + OscarLithgow-SerranoThe Swiss AI Lab (IDSIA) + SandraMitrovicIDSIA-USI/SUPSI + LjiljanaDolamicarmasuisse + FabioRinaldiIDSIA + 8022-8050 + We introduce BUST, a comprehensive benchmark designed to evaluate detectors of texts generated by instruction-tuned large language models (LLMs). Unlike previous benchmarks, our focus lies on evaluating the performance of detector systems, acknowledging the inevitable influence of the underlying tasks and different LLM generators. Our benchmark dataset consists of 25K texts from humans and 7 LLMs responding to instructions across 10 tasks from 3 diverse sources. Using the benchmark, we evaluated 5 detectors and found substantial performance variance across tasks. A meta-analysis of the dataset characteristics was conducted to guide the examination of detector performance. The dataset was analyzed using diverse metrics assessing linguistic features like fluency and coherence, readability scores, and writer attitudes, such as emotions, convincingness, and persuasiveness. Features impacting detector performance were investigated with surrogate models, revealing emotional content in texts enhanced some detectors, yet the most effective detector demonstrated consistent performance, irrespective of writer’s attitudes and text styles. Our approach focused on investigating relationships between the detectors’ performance and two key factors: text characteristics and LLM generators. We believe BUST will provide valuable insights into selecting detectors tailored to specific text styles and tasks and facilitate a more practical and in-depth investigation of detection systems for LLM-generated text. + 2024.naacl-long.444 + 2024.naacl-long.444.copyright.pdf + cornelius-etal-2024-bust + + + Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment + ChongLiInstitute of automation, Chinese Academy of Sciences + ShaonanWang + JiajunZhangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + ChengqingZongInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + 8051-8069 + Multilingual generative models obtain remarkable cross-lingual in-context learning capabilities through pre-training on large-scale corpora. However, they still exhibit a performance bias toward high-resource languages and learn isolated distributions of multilingual sentence representations, which may hinder knowledge transfer across languages. To bridge this gap, we propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences. It aligns the internal sentence representations across different languages via multilingual contrastive learning and aligns outputs by following cross-lingual instructions in the target language. Experimental results show that even with less than 0.1{\textperthousand} of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models and mitigates the performance gap. Further analyses reveal that it results in a better internal multilingual representation distribution of multilingual models. + 2024.naacl-long.445 + 2024.naacl-long.445.copyright.pdf + li-etal-2024-improving-context + + + <fixed-case>M</fixed-case>a<fixed-case>CSC</fixed-case>: Towards Multimodal-augmented Pre-trained Language Models via Conceptual Prototypes and Self-balancing Calibration + XianweiZhuang + ZhichangWang + XuxinCheng + YuxinXie + LimingLiang + YuexianZouPeking University + 8070-8083 + Pre-trained language models (PLMs) that rely solely on textual data may exhibit limitations in multimodal semantics comprehension. Existing solutions attempt to alleviate this issue by incorporating explicit image retrieval or generation techniques.However, these methods: (1) focus exclusively on the static image modality; (2) inevitably encounter modality gaps and noise; (3) indiscriminately treat all modalities.In this paper, we propose a novel multimodal-augmented framework termed MaCSC, which can infuse multimodal semantics into PLMs and facilitate a self-balancing calibration of information allocation.Specifically, MaCSC obtains modal-specific conceptual prototypes from contrastive pre-training models (e.g., CLIP),and aggregates the intra- and inter-modal semantics of the conceptual prototype to enhance PLMs.In addition, we utilize a novel self-balancing contrastive loss to achieve multi-scale self-balancing calibration of multimodal information during fine-tuning PLMs.Experimental results show that MaCSC consistently improves the performance of PLMs across various architectures and scales, and outperforms competitive baselines on multiple NLP tasks. + 2024.naacl-long.446 + 2024.naacl-long.446.copyright.pdf + zhuang-etal-2024-macsc + + + Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion? + YusukeSakaiNara Institute of Science and Technology, Japan + HidetakaKamigaitoDivision of Information Science, Nara Institute of Science and Technology + KatsuhikoHayashiThe University of Tokyo + TaroWatanabeNara Institute of Science and Technology, Japan + 8084-8099 + Knowledge graphs (KGs) consist of links that describe relationships between entities. Due to the difficulty of manually enumerating all relationships between entities, automatically completing them is essential for KGs. Knowledge Graph Completion (KGC) is a task that infers unseen relationships between entities in a KG. Traditional embedding-based KGC methods (e.g. RESCAL, TransE, DistMult, ComplEx, RotatE, HAKE, HousE, etc.) infer missing links using only the knowledge from training data. In contrast, the recent Pre-trained Language Model (PLM)-based KGC utilizes knowledge obtained during pre-training, which means it can estimate missing links between entities by reusing memorized knowledge from pre-training without inference. This part is problematic because building KGC models aims to infer unseen links between entities. However, conventional evaluations in KGC do not consider inference and memorization abilities separately. Thus, a PLM-based KGC method, which achieves high performance in current KGC evaluations, may be ineffective in practical applications. To address this issue, we analyze whether PLM-based KGC methods make inferences or merely access memorized knowledge. For this purpose, we propose a method for constructing synthetic datasets specified in this analysis and conclude that PLMs acquire the inference abilities required for KGC through pre-training, even though the performance improvements mostly come from textual information of entities and relations. + 2024.naacl-long.447 + 2024.naacl-long.447.copyright.pdf + sakai-etal-2024-pre + + + Discovering Lobby-Parliamentarian Alignments through <fixed-case>NLP</fixed-case> + AswinSureshIndependent Consultant + LazarRadojevićEPFL - EPF Lausanne + FrancescoSalviEPFL - EPF Lausanne + AntoineMagron + VictorKristof + MatthiasGrossglauserEPFL + 8100-8113 + We discover alignments of views between interest groups (lobbies) and members of the European Parliament (MEPs) by automatically analyzing their texts. Specifically, we do so by collecting novel datasets of lobbies’ position papers and MEPs’ speeches, and comparing these texts on the basis of semantic similarity and entailment. In the absence of ground-truth, we perform an indirect validation by comparing the discovered alignments with a dataset, which we curate, of retweet links between MEPs and lobbies, and with the publicly disclosed meetings of MEPs. Our best method performs significantly better than several baselines. Moreover, an aggregate analysis of the discovered alignments, between groups of related lobbies and political groups of MEPs, correspond to the expectations from the ideology of the groups (e.g., groups on the political left are more aligned with humanitarian and environmental organisations). We believe that this work is a step towards enhancing the transparency of the intricate decision-making processes within democratic institutions. + 2024.naacl-long.448 + 2024.naacl-long.448.copyright.pdf + suresh-etal-2024-discovering + + + <fixed-case>I</fixed-case>ter<fixed-case>CQR</fixed-case>: Iterative Conversational Query Reformulation with Retrieval Guidance + YunahJang + Kang-ilLeeSeoul National University + HyunkyungBaeLG AI Research + HwanheeLeeChung-Ang University + KyominJung + 8114-8131 + Conversational search aims to retrieve passages containing essential information to answer queries in a multi-turn conversation. In conversational search, reformulating context-dependent conversational queries into stand-alone forms is imperative to effectively utilize off-the-shelf retrievers. Previous methodologies for conversational query reformulation frequently depend on human-annotated rewrites.However, these manually crafted queries often result in sub-optimal retrieval performance and require high collection costs.To address these challenges, we propose **Iter**ative **C**onversational **Q**uery **R**eformulation (**IterCQR**), a methodology that conducts query reformulation without relying on human rewrites. IterCQR iteratively trains the conversational query reformulation (CQR) model by directly leveraging information retrieval (IR) signals as a reward.Our IterCQR training guides the CQR model such that generated queries contain necessary information from the previous dialogue context.Our proposed method shows state-of-the-art performance on two widely-used datasets, demonstrating its effectiveness on both sparse and dense retrievers. Moreover, IterCQR exhibits superior performance in challenging settings such as generalization on unseen datasets and low-resource scenarios. + 2024.naacl-long.449 + 2024.naacl-long.449.copyright.pdf + jang-etal-2024-itercqr + + + <fixed-case>A</fixed-case>ce<fixed-case>GPT</fixed-case>, Localizing Large Language Models in <fixed-case>A</fixed-case>rabic + HuangHuangShenzhen Research Institute of Big Data + FeiYu + JianqingZhu + XueningSun + HaoCheng + SongDingjie + ZhihongChenStanford University and THE CHINESE UNIVERSITY OF HONG KONG, SHENZHEN + MosenAlharthiKing Abdullah University of Science and Technology + BangAn + JuncaiHeKing Abdullah University of Science and Technology + ZicheLiu + JunyingChen + JianquanLi + BenyouWangThe Chinese University of Hong Kong, Shenzhen + LianZhangShenzhen Research Institute of Big Data + RuoyuSunUniversity of Illinois, Urbana-Champaign + XiangWanShenzhen Research Institute of Big Data + HaizhouLiThe Chinese University of Hong Kong (Shenzhen); National University of Singapore and National University of Singapore + JinchaoXuKing Abdullah University of Science and Technology and Pennsylvania State University + 8132-8156 + This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed ‘AceGPT’, sets the state-of-the-art standard for open Arabic LLMs across various benchmarks. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT. + 2024.naacl-long.450 + 2024.naacl-long.450.copyright.pdf + huang-etal-2024-acegpt + + + Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model + ZhiweiHeShanghai Jiao Tong University + XingWangTencent AI Lab + WenxiangJiaoTencent AI Lab + ZhuoshengZhangShanghai Jiao Tong University + RuiWangShanghai Jiao Tong University + ShumingShiTencent AI Lab + ZhaopengTuTencent AI Lab + 8157-8173 + Insufficient modeling of human preferences within the reward model is a major obstacle for leveraging human feedback to improve translation quality. Fortunately, quality estimation (QE), which predicts the quality of a given translation without reference, has achieved impressive alignment with human evaluations in the last two years. In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training. We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines. We examine the problem and argue that the vulnerability of the QE model might lead to high rewards for incorrect translations, resulting in overoptimization and error propagation. To address the problem, we adopt a simple yet effective method that uses heuristic rules to detect the incorrect translations and assigns a penalty term to the reward scores of them. Experimental results show that the proposed QE-based feedback training achieves consistent and significant improvements across various settings, further verified through human preference studies. Our subsequent analysis demonstrates the high data efficiency of the proposed QE-based feedback training: it outperforms systems using larger parallel corpora by a small amount of monolingual data. Our code is available at: https://github.com/zwhe99/FeedbackMT + 2024.naacl-long.451 + 2024.naacl-long.451.copyright.pdf + he-etal-2024-improving + + + Depression Detection in Clinical Interviews with <fixed-case>LLM</fixed-case>-Empowered Structural Element Graph + ZhuangChen + JiawenDengUniversity of Electronic Science and Technology of China + JinfengZhou + JincenziWu + TieyunQianWuhan University + MinlieHuangTsinghua University, Tsinghua University + 8174-8187 + Depression is a widespread mental health disorder affecting millions globally. Clinical interviews are the gold standard for assessing depression, but they heavily rely on scarce professional clinicians, highlighting the need for automated detection systems. However, existing methods only capture part of the relevant elements in clinical interviews, unable to incorporate all depressive cues. Moreover, the scarcity of participant data, due to privacy concerns and collection challenges, intrinsically constrains interview modeling. To address these limitations, in this paper, we propose a structural element graph (SEGA), which transforms the clinical interview into an expertise-inspired directed acyclic graph for comprehensive modeling. Additionally, we further empower SEGA by devising novel principle-guided data augmentation with large language models (LLMs) to supplement high-quality synthetic data and enable graph contrastive learning. Extensive evaluations on two real-world clinical datasets, in both English and Chinese, show that SEGA significantly outperforms baseline methods and powerful LLMs like GPT-3.5 and GPT-4. + 2024.naacl-long.452 + 2024.naacl-long.452.copyright.pdf + chen-etal-2024-depression + + + <fixed-case>SQATIN</fixed-case>: Supervised Instruction Tuning Meets Question Answering for Improved Dialogue <fixed-case>NLU</fixed-case> + EvgeniiaRazumovskaiaUniversity of Cambridge + GoranGlavašJulius-Maximilians-Universität Würzburg + AnnaKorhonenUniversity of Cambridge + IvanVulićUniversity of Cambridge and PolyAI Limited + 8188-8204 + Task-oriented dialogue (TOD) systems help users execute well-defined tasks across a variety of domains (e.g., flight booking or food ordering), with their Natural Language Understanding (NLU) components being dedicated to the analysis of user utterances, predicting users’ intents (Intent Detection, ID) and extracting values for informational slots (Value Extraction, VE). In most domains, labelled NLU data is scarce, making sample-efficient learning – enabled with effective transfer paradigms – paramount. In this work, we introduce SQATIN, a new framework for dialog NLU based on (i) instruction tuning and (ii) question-answering-based formulation of ID and VE tasks. According to the evaluation on established NLU benchmarks, SQATIN sets the new state of the art in dialogue NLU, substantially surpassing the performance of current models based on standard fine-tuning objectives in both in-domain training and cross-domain transfer, and it also surpasses off-the-shelf large language models for the same task, both in terms of performance and inference efficiency. Furthermore, SQATIN yields particularly large performance gains in cross-domain transfer, owing to the fact that our QA-based instruction tuning leverages similarities between natural language descriptions of classes (i.e., slots and intents) across domains. + 2024.naacl-long.453 + 2024.naacl-long.453.copyright.pdf + razumovskaia-etal-2024-sqatin + + + Enhancing Argument Summarization: Prioritizing Exhaustiveness in Key Point Generation and Introducing an Automatic Coverage Evaluation Metric + MohammadKhosravani + ChenyangHuang + AmineTrabelsiUniversité de Sherbrooke + 8205-8217 + The proliferation of social media platforms has given rise to the amount of online debates and arguments. Consequently, the need for automatic summarization methods for such debates is imperative, however this area of summarization is rather understudied. The Key Point Analysis (KPA) task formulates argument summarization as representing the summary of a large collection of arguments in the form of concise sentences in bullet-style format, called key points. A sub-task of KPA, called Key Point Generation (KPG), focuses on generating these key points given the arguments. This paper introduces a novel extractive approach for key point generation, that outperforms previous state-of-the-art methods for the task. Our method utilizes an extractive clustering based approach that offers concise, high quality generated key points with higher coverage of reference summaries, and less redundant outputs. In addition, we show that the existing evaluation metrics for summarization such as ROUGE are incapable of differentiating between generated key points of different qualities. To this end, we propose a new evaluation metric for assessing the generated key points by their coverage. Our code can be accessed online. + 2024.naacl-long.454 + 2024.naacl-long.454.copyright.pdf + khosravani-etal-2024-enhancing + + + <fixed-case>ARM</fixed-case>: Alignment with Residual Energy-Based Model + BoPangSalesForce.com and University of California, Los Angeles + CaimingXiongSalesforce Research + YingboZhouSalesforce Research + 8218-8229 + While large language models (LLMs) trained with large-scale unsupervised learning acquire a wide variety of world knowledge and skills, its behavior does not necessarily align with human preferences. RLHF methods achieve successes in aligning LLM responses with human preferences and improving the controllability of LLM behavior with human instruction. However, RLHF methods are considerably complicated to implement, computationally expensive to train, and notoriously tricky to tune. In this work, we propose Alignment with Residual Energy-Based Model (ARM), as a simple and flexible alternative to RLHF methods. Our method is driven by an observation that we can learn an aligned policy by minimizing a forward Kullback–Leibler (KL) divergence from a target policy (in the form of a residual energy-based model) to a parameteric policy (LLM), instead of a reverse KL as in RLHF methods. With samples from the energy-based target policy, we can leverage the power of DPO (or other offline methods) to learn an aligned policy efficiently. ARM is simple to implement and applicable in various data settings. Our extensive experiments demonstrate its strong performance across multiple datasets, compared to strong baselines like PPO, DPO. + 2024.naacl-long.455 + 2024.naacl-long.455.copyright.pdf + pang-etal-2024-arm + + + <fixed-case>H</fixed-case>uman<fixed-case>R</fixed-case>ank<fixed-case>E</fixed-case>val: Automatic Evaluation of <fixed-case>LM</fixed-case>s as Conversational Assistants + MilanGritta + GerasimosLampourasHuawei Technologies Ltd. + IgnacioIacobacciHuawei Noah’s Ark Lab + 8230-8242 + Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE ranks these answers based on their log-likelihood under the LM’s distribution, and subsequently calculates their correlation with the corresponding human rankings. We support HRE’s efficacy by investigating how efficiently it separates pretrained and instruction-tuned LMs of various sizes. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning. + 2024.naacl-long.456 + 2024.naacl-long.456.copyright.pdf + gritta-etal-2024-humanrankeval + + + <fixed-case>FAM</fixed-case>u<fixed-case>S</fixed-case>: Frames Across Multiple Sources + SiddharthVashishthaUniversity of Rochester + AlexanderMartinJohns Hopkins University + WilliamGanttDepartment of Computer Science, University of Rochester + BenjaminVan DurmeJohns Hopkins University, Johns Hopkins University, Johns Hopkins University and Microsoft + AaronWhiteUniversity of Rochester + 8243-8266 + Understanding event descriptions is a central aspect of language processing, but current approaches focus overwhelmingly on single sentences or documents. Aggregating information about an event across documents can offer a much richer understanding. To this end, we present FAMuS, a new corpus of Wikipedia passages that report on some event, paired with underlying, genre-diverse (non-Wikipedia) source articles for the same event. Events and (cross-sentence) arguments in both report and source are annotated against FrameNet, providing broad coverage of different event types. We present results on two key event understanding tasks enabled by FAMuS: source validation—determining whether a document is a valid source for a target report event—and cross-document argument extraction—full-document argument extraction for a target event from both its report and the correct source article. + 2024.naacl-long.457 + 2024.naacl-long.457.copyright.pdf + vashishtha-etal-2024-famus + + + Rationale-based Opinion Summarization + HaoyuanLi + SnigdhaChaturvediDepartment of Computer Science, University of North Carolina, Chapel Hill + 8267-8285 + Opinion summarization aims to generate concise summaries that present popular opinions of a large group of reviews. However, these summaries can be too generic and lack supporting details. To address these issues, we propose a new paradigm for summarizing reviews, rationale-based opinion summarization. Rationale-based opinion summaries output the representative opinions as well as one or more corresponding rationales. To extract good rationales, we define four desirable properties: relatedness, specificity, popularity, and diversity and present a Gibbs-sampling-based method to extract rationales. Overall, we propose RATION, an unsupervised extractive system that has two components: an Opinion Extractor (to extract representative opinions) and Rationales Extractor (to extract corresponding rationales). We conduct automatic and human evaluations to show that rationales extracted by RATION have the proposed properties and its summaries are more useful than conventional summaries. The implementation of our work is available at https://github.com/leehaoyuan/RATION. + 2024.naacl-long.458 + 2024.naacl-long.458.copyright.pdf + li-chaturvedi-2024-rationale + + + Mustango: Toward Controllable Text-to-Music Generation + JanMelechovskySingapore University of Technology and Design + ZixunGuo + DeepanwayGhosalGoogle DeepMind + NavonilMajumderSingapore University of Technology and Design + DorienHerremansSingapore University of Technology and Design + SoujanyaPoriaSingapore University of Technology and Design + 8286-8309 + The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models such as MusicGen and AudioLDM2. + 2024.naacl-long.459 + 2024.naacl-long.459.copyright.pdf + melechovsky-etal-2024-mustango + + + Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations + EmilioCueva + AdrianLopez Monroy + FernandoSánchez-VegaCenter for Research in Mathematics (CIMAT) + ThamarSolorioMohamed bin Zayed University of Artificial Intelligence and University of Houston + 8310-8328 + Zero-Shot Cross-lingual Transfer (ZS-XLT) utilizes a model trained in a source language to make predictions in another language, often with a performance loss. To alleviate this, additional improvements can be achieved through subsequent adaptation using examples in the target language. In this paper, we exploit In-Context Tuning (ICT) for One-Shot Cross-lingual transfer in the classification task by introducing In-Context Cross-lingual Transfer (IC-XLT). The novel concept involves training a model to learn from context examples and subsequently adapting it during inference to a target language by prepending a One-Shot context demonstration in that language. Our results show that IC-XLT successfully leverages target-language examples to improve the cross-lingual capabilities of the evaluated mT5 model, outperforming prompt-based models in the Zero and Few-shot scenarios adapted through fine-tuning. Moreover, we show that when source-language data is limited, the fine-tuning framework employed for IC-XLT performs comparably to prompt-based fine-tuning with significantly more training data in the source language. + 2024.naacl-long.460 + 2024.naacl-long.460.copyright.pdf + cueva-etal-2024-adaptive + + + <fixed-case>CNER</fixed-case>: Concept and Named Entity Recognition + GiulianoMartinelliUniversity of Roma “La Sapienza” + FrancescoMolfeseUniversity of Roma “La Sapienza” + SimoneTedeschi + AlberteFernández-Castro + RobertoNavigliSapienza University of Rome + 8329-8344 + Named entities – typically expressed via proper nouns – play a key role in Natural Language Processing, as their identification and comprehension are crucial in tasks such as Relation Extraction, Coreference Resolution and Question Answering, among others. Tasks like these also often entail dealing with concepts – typically represented by common nouns – which, however, have not received as much attention. Indeed, the potential of their identification and understanding remains underexplored, as does the benefit of a synergistic formulation with named entities. To fill this gap, we introduce Concept and Named Entity Recognition (CNER), a new unified task that handles concepts and entities mentioned in unstructured texts seamlessly. We put forward a comprehensive set of categories that can be used to model concepts and named entities jointly, and propose new approaches for the creation of CNER datasets. We evaluate the benefits of performing CNER as a unified task extensively, showing that a CNER model gains up to +5.4 and +8 macro F1 points when compared to specialized named entity and concept recognition systems, respectively. Finally, to encourage the development of CNER systems, we release our datasets and models at https://github.com/Babelscape/cner. + 2024.naacl-long.461 + 2024.naacl-long.461.copyright.pdf + martinelli-etal-2024-cner + + + Branch-Solve-Merge Improves Large Language Model Evaluation and Generation + SwarnadeepSahaDepartment of Computer Science, University of North Carolina, Chapel Hill + OmerLevyFacebook + AsliCelikyilmazFAIR + MohitBansalUniversity of North Carolina at Chapel Hill + JasonWestonNew York University and Facebook + XianLiFacebook AI + 8345-8363 + Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model’s lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA-2-chat to match or outperform GPT-4 on most domains. On a constraint story generation task, BSM improves the coherence of stories while also improving constraint satisfaction by 12%. + 2024.naacl-long.462 + 2024.naacl-long.462.copyright.pdf + saha-etal-2024-branch + + + <fixed-case>REPLUG</fixed-case>: Retrieval-Augmented Black-Box Language Models + WeijiaShi + SewonMinFacebook and Department of Computer Science, University of Washington + MichihiroYasunagaStanford University + MinjoonSeoKorea Advanced Institute of Science and Technology + RichardJamesResearch, Facebook + MikeLewisFacebook AI Research + LukeZettlemoyerUniversity of Washington, Facebook and Meta + Wen-tauYihMeta Platforms, Inc. + 8364-8377 + We introduce REPLUG, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model. Unlike prior retrieval-augmented LMs that train language models with special cross-attention mechanisms to encode the retrieved text, REPLUG simply prepends retrieved documents to the input for the frozen black-box LM. This simple design can be easily applied to any existing language models. Furthermore, we show that the LM can be used to supervise the retrieval model, which can then find documents that help the LM make better predictions. Our experiments demonstrate that REPLUG with the tuned retriever significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%. Code is publicly released at github.com/swj0419/REPLUG. + 2024.naacl-long.463 + 2024.naacl-long.463.copyright.pdf + shi-etal-2024-replug + + + <fixed-case>D</fixed-case>avid helps Goliath: Inference-Time Collaboration Between Small Specialized and Large General Diffusion <fixed-case>LM</fixed-case>s + XiaochuangHanDepartment of Computer Science, University of Washington + SachinKumarOhio State University, Columbus + YuliaTsvetkovDepartment of Computer Science, University of Washington + MarjanGhazvininejadFacebook AI Research + 8378-8393 + Diffusion-based language models are emerging as a promising alternative to autoregressive LMs: they approach the competence of autoregressive LMs while offering nuanced controllability at inference time. While autoregressive LMs have benefited immensely from scaling and instruction-based learning, existing studies of diffusion LMs have been conducted on a smaller scale. Starting with a recently proposed diffusion model SSD-LM, in this work we first explore methods to scale it from 0.4B to 13B parameters, proposing techniques to improve its training and inference efficiency, and to finetune the model to follow instructions. Armed with a more powerful, general purpose diffusion LM, we introduce the primary contribution of this work – SSD-2 – an approach to easily ensemble at inference time a large general-purpose diffusion LM with smaller, but specialized and contextualized diffusion LMs. We show that SSD-2 facilitates novel ensembles with 100x smaller models that can be customized and deployed by individual users. We find that compared to autoregressive models, the collaboration between diffusion LMs is more effective, leading to higher-quality model responses due to their ability to dynamically incorporate bi-directional contexts. + 2024.naacl-long.464 + 2024.naacl-long.464.copyright.pdf + han-etal-2024-david + + + Efficient End-to-End Visual Document Understanding with Rationale Distillation + WangZhuUniversity of Southern California + AlekhAgarwalGoogle + MandarJoshiGoogle DeepMind + RobinJiaUniversity of Southern California + JesseThomasonUniversity of Southern California and Amazon + KristinaToutanovaGoogle + 8394-8417 + Understanding visually situated language requires interpreting complex layouts of textual and visual elements. Pre-processing tools, such as optical character recognition (OCR), can map document image inputs to textual tokens, then large language models (LLMs) can reason over text.However, such methods have high computational and engineering complexity. Can small pretrained image-to-text models accurately understand visual documents through similar recognition and reasoning steps instead?We propose Rationale Distillation (RD), which incorporates the outputs of OCR tools, LLMs, and larger multimodal models as intermediate “rationales”, and trains a small student model to predict both rationales and answers. On three visual document understanding benchmarks representing infographics, scanned documents, and figures, our Pix2Struct (282M parameters) student model finetuned with RD outperforms the base model by 4-5% absolute accuracy with only 1% higher computational cost. + 2024.naacl-long.465 + 2024.naacl-long.465.copyright.pdf + zhu-etal-2024-efficient + + + A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models + TiwalayoEisapeMassachusetts Institute of Technology + MichaelTesslerDeepMind + IshitaDasguptaDeepMind + FeiSha + SjoerdSteenkisteGoogle + TalLinzenNew York University and Google + 8418-8437 + A central component of rational behavior is logical inference: the process of determining which conclusions follow from a set of premises. Psychologists have documented several ways in which humans’ inferences deviate from the rules of logic. Do language models, which are trained on text generated by humans, replicate such human biases, or are they able to overcome them? Focusing on the case of syllogisms—inferences from two simple premises—we show that, within the PaLM 2 family of transformer language models, larger models are more logical than smaller ones, and also more logical than humans. At the same time, even the largest models make systematic errors, some of which mirror human reasoning biases: they show sensitivity to the (irrelevant) ordering of the variables in the syllogism, and draw confident but incorrect inferences from particular syllogisms (syllogistic fallacies). Overall, we find that language models often mimic the human biases included in their training data, but are able to overcome them in some cases. + 2024.naacl-long.466 + 2024.naacl-long.466.copyright.pdf + eisape-etal-2024-systematic + + + <fixed-case>A</fixed-case>nchor<fixed-case>AL</fixed-case>: Computationally Efficient Active Learning for Large and Imbalanced Datasets + PietroLesciUniversity of Cambridge + AndreasVlachosUniversity of Cambridge + 8438-8457 + Active learning for imbalanced classification tasks is challenging as the minority classes naturally occur rarely. Gathering a large pool of unlabelled data is thus essential to capture minority instances. Standard pool-based active learning is computationally expensive on large pools and often reaches low accuracy by overfitting the initial decision boundary, thus failing to explore the input space and find minority instances. To address these issues we propose AnchorAL. At each iteration, AnchorAL chooses class-specific instances from the labelled set, or *anchors*, and retrieves the most similar unlabelled instances from the pool. This resulting *subpool* is then used for active learning. Using a small, fixed-sized subpool AnchorAL allows scaling any active learning strategy to large pools. By dynamically selecting different anchors at each iteration it promotes class balance and prevents overfitting the initial decision boundary, thus promoting the discovery of new clusters of minority instances. Experiments across different classification tasks, active learning strategies, and model architectures AnchorAL is *(i)* faster, often reducing runtime from hours to minutes, *(ii)* trains more performant models, *(iii)* and returns more balanced datasets than competing methods. + 2024.naacl-long.467 + 2024.naacl-long.467.copyright.pdf + lesci-vlachos-2024-anchoral + + + <fixed-case>ICLE</fixed-case>++: Modeling Fine-Grained Traits for Holistic Essay Scoring + ShengjieLiUniversity of Texas at Dallas + VincentNgUniversity of Texas at Dallas, Central China Normal University and State University of New York at Stony Brook + 8458-8478 + The majority of the recently developed models for automated essay scoring (AES) are evaluated solely on the ASAP corpus. However, ASAP is not without its limitations. For instance, it is not clear whether models trained on ASAP can generalize well when evaluated on other corpora. In light of these limitations, we introduce ICLE++, a corpus of persuasive student essays annotated with both holistic scores and trait-specific scores. Not only can ICLE++ be used to test the generalizability of AES models trained on ASAP, but it can also facilitate the evaluation of models developed for newer AES problems such as multi-trait scoring and cross-prompt scoring. We believe that ICLE++, which represents a culmination of our long-term effort in annotating the essays in the ICLE corpus, contributes to the set of much-needed annotated corpora for AES research. + 2024.naacl-long.468 + 2024.naacl-long.468.copyright.pdf + li-ng-2024-icle + + + <fixed-case>UN</fixed-case>commonsense Reasoning: Abductive Reasoning about Uncommon Situations + WentingZhaoCornell University + JustinChiuCornell University + JenaHwangAllen Institute for Artificial Intelligence + FaezeBrahmanAllen Institute for AI + JackHesselSamaya AI + SanjibanChoudhuryCornell University + YejinChoiDepartment of Computer Science, University of Washington + XiangLi + AlaneSuhrUniversity of California, Berkeley + 8479-8497 + Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate an explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the performance differences between human explainers and the best-performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators. + 2024.naacl-long.469 + 2024.naacl-long.469.copyright.pdf + zhao-etal-2024-uncommonsense + + + To Tell The Truth: Language of Deception and Language Models + SanchaitaHazra + Bodhisattwa PrasadMajumderAllen Institute for Artificial Intelligence + 8498-8512 + Text-based false information permeates online discourses, yet evidence of people’s ability to discern truth from such deceptive textual content is scarce. We analyze a novel TV game show data where conversations in a high-stake environment between individuals with conflicting objectives result in lies. We investigate the manifestation of potentially verifiable language cues of deception in the presence of objective truth, a distinguishing feature absent in previous text-based deception datasets. We show that there exists a class of detectors (algorithms) that have similar truth detection performance compared to human subjects, even when the former accesses only the language cues while the latter engages in conversations with complete access to all potential sources of cues (language and audio-visual). Our model, built on a large language model, employs a bottleneck framework to learn discernible cues to determine truth, an act of reasoning in which human subjects often perform poorly, even with incentives. Our model detects novel but accurate language cues in many cases where humans failed to detect deception, opening up the possibility of humans collaborating with algorithms and ameliorating their ability to detect the truth. + 2024.naacl-long.470 + 2024.naacl-long.470.copyright.pdf + hazra-majumder-2024-tell + + + Multilingual Models for <fixed-case>ASR</fixed-case> in Chibchan Languages + RolandoCoto-SolanoDartmouth College + Tai WanKim + AlexanderJones + SharidLoáicigaUniversity of Gothenburg, Sweden + 8513-8527 + We present experiments on Automatic Speech Recognition (ASR) for Bribri and Cabécar, two languages from the Chibchan family. We fine-tune four ASR algorithms (Wav2Vec2, Whisper, MMS & WavLM) to create monolingual models, with the Wav2Vec2 model demonstrating the best performance. We then proceed to use Wav2Vec2 for (1) experiments on training joint and transfer learning models for both languages, and (2) an analysis of the errors, with a focus on the transcription of tone. Results show effective transfer learning for both Bribri and Cabécar, but especially for Bribri. A post-processing spell checking step further reduced character and word error rates. As for the errors, tone is where the Bribri models make the most errors, whereas the simpler tonal system of Cabécar is better transcribed by the model. Our work contributes to developing better ASR technology, an important tool that could facilitate transcription, one of the major bottlenecks in language documentation efforts. Our work also assesses how existing pre-trained models and algorithms perform for genuine extremely low resource-languages. + 2024.naacl-long.471 + 2024.naacl-long.471.copyright.pdf + coto-solano-etal-2024-multilingual + + + <fixed-case>L</fixed-case>egal<fixed-case>D</fixed-case>iscourse: Interpreting When Laws Apply and To Whom + AlexanderSpangherUniversity of Southern California + ZihanXueUniversity of California, Los Angeles + Te-LinWuUniversity of California, Los Angeles + MarkHansenColumbia University + JonathanMayUniversity of Southern California and USC/ISI + 8528-8551 + While legal AI has made strides in recent years, it still struggles with basic legal concepts: _when_ does a law apply? _Who_ does it applies to? _What_ does it do? We take a _discourse_ approach to addressing these problems and introduce a novel taxonomy for span-and-relation parsing of legal texts. We create a dataset, _LegalDiscourse_ of 602 state-level law paragraphs consisting of 3,715 discourse spans and 1,671 relations. Our trained annotators have an agreement-rate \kappa>.8, yet few-shot GPT3.5 performs poorly at span identification and relation classification. Although fine-tuning improves performance, GPT3.5 still lags far below human level. We demonstrate the usefulness of our schema by creating a web application with journalists. We collect over 100,000 laws for 52 U.S. states and territories using 20 scrapers we built, and apply our trained models to 6,000 laws using U.S. Census population numbers. We describe two journalistic outputs stemming from this application: (1) an investigation into the increase in liquor licenses following population growth and (2) a decrease in applicable laws under different under-count projections. + 2024.naacl-long.472 + 2024.naacl-long.472.copyright.pdf + spangher-etal-2024-legaldiscourse + + + <fixed-case>X</fixed-case>-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects + MinqianLiuVirginia Polytechnic Institute and State University + YingShen + ZhiyangXu + YixinCaoSingapore Management University + EunahCho + VaibhavKumarSchool of Computer Science, Carnegie Mellon University + RezaGhanadanUniversity of Maryland, College Park + LifuHuangVirginia Tech + 8552-8571 + Natural Language Generation (NLG) typically involves evaluating the generated text in various aspects (e.g., consistency and naturalness) to obtain a comprehensive assessment. However, multi-aspect evaluation remains challenging as it may require the evaluator to generalize to any given evaluation aspect even if it’s absent during training. In this paper, we introduce X-Eval, a two-stage instruction tuning framework to evaluate text in both seen and unseen aspects customized by end users. X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model’s ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality. To support the training of X-Eval, we collect AspectInstruct, the first instruction tuning dataset tailored for multi-aspect NLG evaluation spanning 27 diverse evaluation aspects with 65 tasks. To enhance task diversity, we devise an augmentation strategy that converts human rating annotations into diverse forms of NLG evaluation tasks, including scoring, comparison, ranking, and Boolean question answering. Extensive experiments across three essential categories of NLG tasks: dialogue generation, summarization, and data-to-text coupled with 21 aspects in meta-evaluation, demonstrate that X-Eval enables even a lightweight language model to achieve a comparable if not higher correlation with human judgments compared to the state-of-the-art NLG evaluators like GPT-4. + 2024.naacl-long.473 + 2024.naacl-long.473.copyright.pdf + liu-etal-2024-x + + + Is Reference Necessary in the Evaluation of <fixed-case>NLG</fixed-case> Systems? When and Where? + ShuqianSheng + YiXu + LuoyiFu + JiaxinDingShanghai Jiaotong University + LeiZhouShanghai Jiaotong University + XinbingWangShanghai Jiao Tong University + ChenghuZhouIGSNRR, Chinese Academy of Sciences, Beijing, China + 8572-8588 + The majority of automatic metrics for evaluating NLG systems are reference-based. However, the challenge of collecting human annotation results in a lack of reliable references in numerous application scenarios. Despite recent advancements in reference-free metrics, it has not been well understood when and where they can be used as an alternative to reference-based metrics. In this study, by employing diverse analytical approaches, we comprehensively assess the performance of both metrics across a wide range of NLG tasks, encompassing eight datasets and eight evaluation models. Based on solid experiments, the results show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. However, their effectiveness varies across tasks and is influenced by the quality of candidate texts. Therefore, it’s important to assess the performance of reference-free metrics before applying them to a new task, especially when inputs are in uncommon form or when the answer space is highly variable. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance. + 2024.naacl-long.474 + 2024.naacl-long.474.copyright.pdf + sheng-etal-2024-reference + + + Semi-Structured Chain-of-Thought: Integrating Multiple Sources of Knowledge for Improved Language Model Reasoning + XinSuUniversity of Arizona + TiepLeIntel + StevenBethardUniversity of Arizona + PhillipHowardIntel + 8589-8605 + An important open question in the use of large language models for knowledge-intensive tasks is how to effectively integrate knowledge from three sources: the model’s parametric memory, external structured knowledge, and external unstructured knowledge. Most existing prompting methods either rely on one or two of these sources, or require repeatedly invoking large language models to generate similar or identical content. In this work, we overcome these limitations by introducing a novel semi-structured prompting approach that seamlessly integrates the model’s parametric memory with unstructured knowledge from text documents and structured knowledge from knowledge graphs. Experimental results on open-domain multi-hop question answering datasets demonstrate that our prompting method significantly surpasses existing techniques, even exceeding those that require fine-tuning. + 2024.naacl-long.475 + 2024.naacl-long.475.copyright.pdf + su-etal-2024-semi + + + Evaluating the Deductive Competence of Large Language Models + SSeals + ValerieShalinUniversity of South Carolina and Wright State University + 8606-8622 + The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance and the human-generated language corpora that informs them. + 2024.naacl-long.476 + 2024.naacl-long.476.copyright.pdf + seals-shalin-2024-evaluating + + + Large Human Language Models: A Need and the Challenges + NikitaSoni + H.SchwartzStony Brook University (SUNY) + JoãoSedocNew York University + NiranjanBalasubramanianState University of New York, Stony Brook + 8623-8638 + As research in human-centered NLP advances, there is a growing recognition of the importance of incorporating human and social factors into NLP models. At the same time, our NLP systems have become heavily reliant on LLMs, most of which do not model authors. To build NLP systems that can truly understand human language, we must better integrate human contexts into LLMs. This brings to the fore a range of design considerations and challenges in terms of what human aspects to capture, how to represent them, and what modeling strategies to pursue. To address these, we advocate for three positions toward creating large human language models (LHLMs) using concepts from psychological and behavioral sciences: First, LM training should include the human context. Second, LHLMs should recognize that people are more than their group(s). Third, LHLMs should be able to account for the dynamic and temporally-dependent nature of the human context. We refer to relevant advances and present open challenges that need to be addressed and their possible solutions in realizing these goals. + 2024.naacl-long.477 + 2024.naacl-long.477.copyright.pdf + soni-etal-2024-large + + + On Learning to Summarize with Large Language Models as References + YixinLiuYale University + KejianShi + KatherineHe + LongtianYe + AlexanderFabbriSalesForce.com + PengfeiLiu + DragomirRadevYale University + ArmanCohanYale University and Allen Institute for Artificial Intelligence + 8639-8656 + Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved. To this end, we use LLMs as both oracle summary generators for standard supervised fine-tuning and oracle summary evaluators for efficient contrastive learning that leverages the LLMs’ supervision signals. We conduct comprehensive experiments with source news articles and find that (1) summarization models trained under the LLM-as-reference setting achieve significant performance improvement in both LLM and human evaluations; (2) contrastive learning outperforms standard supervised fine-tuning under both low and high resource settings. Our experimental results also enable a meta-analysis of LLMs’ summary evaluation capacities under a challenging setting, showing that LLMs are not well-aligned with human evaluators. Particularly, our expert human evaluation reveals remaining nuanced performance gaps between LLMs and our fine-tuned models, which LLMs fail to capture. Thus, we call for further studies into both the potential and challenges of using LLMs in summarization model development. + 2024.naacl-long.478 + 2024.naacl-long.478.copyright.pdf + liu-etal-2024-learning + + + Hallucination Diversity-Aware Active Learning for Text Summarization + YuXiaUniversity of California, San Diego + XuLiuShanghai Jiaotong University + TongYuAdobe Research + SungchulKimAdobe Systems + RyanRossiAdobe Research + AnupRaoAdobe Systems + TungMaiAdobe + ShuaiLiJohn Hopcroft Center, Shanghai Jiao Tong University + 8657-8669 + Large Language Models (LLMs) have shown propensity to generate hallucinated outputs, i.e., texts that are factually incorrect or unsupported. Existing methods for alleviating hallucinations typically require costly human annotations to identify and correct hallucinations in LLM outputs. Moreover, most of these methods focus on a specific type of hallucination, e.g., entity or token errors, which limits their effectiveness in addressing various types of hallucinations exhibited in LLM outputs. To our best knowledge, in this paper we propose the first active learning framework to alleviate LLM hallucinations, reducing costly human annotations of hallucination needed. By measuring fine-grained hallucinations from errors in semantic frame, discourse and content verifiability in text summarization, we propose HAllucination Diversity-Aware Sampling (HADAS) to select diverse hallucinations for annotations in active learning for LLM finetuning. Extensive experiments on three datasets and different backbone models demonstrate advantages of our method in effectively and efficiently mitigating LLM hallucinations. + 2024.naacl-long.479 + 2024.naacl-long.479.copyright.pdf + xia-etal-2024-hallucination + + + <fixed-case>K</fixed-case>eep it <fixed-case>P</fixed-case>rivate: Unsupervised Privatization of Online Text + CalvinBaoUniversity of Maryland, College Park + MarineCarpuatUniversity of Maryland, College Park + 8670-8685 + Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks. + 2024.naacl-long.480 + 2024.naacl-long.480.copyright.pdf + bao-carpuat-2024-keep + + + Tied-<fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case>: Enhancing parameter efficiency of <fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case> with Weight Tying + AdithyaRenduchintalaNVIDIA + TugrulKonukNVIDIA + OleksiiKuchaievNVIDIA + 8686-8697 + We introduce Tied-LoRA, a novel paradigm leveraging weight tying and selective training to enhance the parameter efficiency of Low-rank Adaptation (LoRA). Our exploration encompasses different plausible combinations of parameter training and freezing, coupled with weight tying, aimed at identifying the optimal trade-off between performance and the count of trainable parameters. Across 5 diverse tasks and two foundational language models with different parameter counts, our experiments provide comprehensive insights into the inherent trade-offs between efficiency and performance.Our findings reveal a specific Tied-LoRA configuration that distinguishes itself by showcasing comparable performance to LoRA across multiple tasks while utilizing only a fraction of the parameters employed by the standard LoRA method, particularly at elevated ranks. This underscores the efficacy of Tied-LoRA in achieving impressive results with significantly reduced model complexity. + 2024.naacl-long.481 + 2024.naacl-long.481.copyright.pdf + renduchintala-etal-2024-tied + + + Investigating Data Contamination in Modern Benchmarks for Large Language Models + ChunyuanDeng + YilunZhaoYale University + XiangruTangYale University + MarkGersteinYale University + ArmanCohanYale University and Allen Institute for Artificial Intelligence + 8698-8711 + Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named Testset Slot Guessing (TS-Guessing), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52% and 57%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field. + 2024.naacl-long.482 + 2024.naacl-long.482.copyright.pdf + deng-etal-2024-investigating + + + Pre-trained Language Models for Entity Blocking: A Reproducibility Study + RunhuiWang + YongfengZhangRutgers University + 8712-8722 + Entity Resolution (ER) is an essential task in data integration and its goal is to find records that represent the same entity in a dataset. Deep learning models, especially large pre-trained language models, have achieved state-of-the-art results on this task. A typical ER pipeline consists of Entity Blocking and Entity Matching: Entity Blocking finds candidate record pairs that potentially match and Entity Matching determines if the pairs match. The goal of the entity blocking step is to include as many matching pairs as possible while including as few non-matching pairs as possible. On the other hand, the blocking task can also be considered as an Information Retrieval (IR) task. However, state-of-the-art neural IR models that are based on large language models have not been evaluated on the ER task. What’s more, the generalization ability of state-of-the-art methods for entity blocking is not well-studied but an import aspect in real-world applications. In this work, we evaluate state-of-the-art models for Entity Blocking along with neural IR models on a wide range of real-world datasets, and also study their in-distribution and out-of-distribution generalization abilities. + 2024.naacl-long.483 + 2024.naacl-long.483.copyright.pdf + wang-zhang-2024-pre + + + <tex-math>RE^2</tex-math>: Region-Aware Relation Extraction from Visually Rich Documents + PritikaRamuAdobe Research + SijiaWang + LallaMouatadid + JoyRimchala + LifuHuangVirginia Tech + 8723-8739 + Current research in form understanding predominantly relies on large pre-trained language models, necessitating extensive data for pre-training. However, the importance of layout structure (i.e., the spatial relationship between the entity blocks in the visually rich document) to relation extraction has been overlooked. In this paper, we propose \textbf{RE}gion-Aware \textbf{R}elation \textbf{E}xtraction (\bf{RE^2}) that leverages region-level spatial structure among the entity blocks to improve their relation prediction. We design an edge-aware graph attention network to learn the interaction between entities while considering their spatial relationship defined by their region-level representations. We also introduce a constraint objective to regularize the model towards consistency with the inherent constraints of the relation extraction task. To support the research on relation extraction from visually rich documents and demonstrate the generalizability of \bf{RE^2}, we build a new benchmark dataset, {DiverseForm}, that covers a wide range of domains. Extensive experiments on {DiverseForm} and several public benchmark datasets demonstrate significant superiority and transferability of \bf{RE^2} across various domains and languages, with up to 18.88% absolute F-score gain over all high-performing baselines + 2024.naacl-long.484 + 2024.naacl-long.484.copyright.pdf + ramu-etal-2024-re2 + + + Mix-Initiative Response Generation with Dynamic Prefix Tuning + YuxiangNieHong Kong University of Science and Technology + HeyanHuangBeijing Institute of Technology + Xian-LingMaoBeijing Institute of Technology + LiziLiaoSingapore Management University + 8740-8753 + Mixed initiative serves as one of the key factors in controlling conversation directions. For a speaker, responding passively or leading proactively would result in rather different responses. However, most dialogue systems focus on training a holistic response generation model without any distinction among different initiatives. It leads to the cross-contamination problem, where the model confuses different initiatives and generates inappropriate responses. Moreover, obtaining plenty of human annotations for initiative labels can be expensive. To address this issue, we propose a general mix-Initiative Dynamic Prefix Tuning framework (IDPT) to decouple different initiatives from the generation model, which learns initiative-aware prefixes in both supervised and unsupervised settings. Specifically, IDPT decouples initiative factors into different prefix parameters and uses the attention mechanism to adjust the selection of initiatives in guiding generation dynamically. The prefix parameters can be tuned towards accurate initiative prediction as well as mix-initiative response generation. Extensive experiments on two public dialogue datasets show that the proposed IDPT outperforms previous baselines on both automatic metrics and human evaluations. It also manages to generate appropriate responses with manipulated initiatives. + 2024.naacl-long.485 + 2024.naacl-long.485.copyright.pdf + nie-etal-2024-mix + + + Value <fixed-case>FULCRA</fixed-case>: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Value + JingYaoMicrosoft + XiaoyuanYiMicrosoft Research + YifanGong + XitingWangRenmin University of China + XingXieMicrosoft + 8754-8777 + Value alignment is crucial for the responsible development of Large Language Models (LLMs). However, how to define values in this context remains largely unexplored. Existing work mainly specifies values as risk criteria formulated in the AI community, e.g., fairness and privacy protection, suffering from poor clarity, adaptability and transparency. Leveraging basic values established in humanity and social science that are compatible with values across cultures, this paper introduces a novel value space spanned by multiple basic value dimensions and proposes BaseAlign, a corresponding value alignment paradigm. Applying the representative Schwartz’s Theory of Basic Values as an instantiation, we construct FULCRA, a dataset consisting of 20k (LLM output, value vector) pairs. LLMs’ outputs are mapped into the K-dim value space beyond simple binary labels, by identifying their underlying priorities for these value dimensions. Extensive analysis and experiments on FULCRA: (1) reveal the essential relation between basic values and LLMs’ behaviors, (2) demonstrate that our paradigm with basic values not only covers existing risks but also anticipates the unidentified ones, and (3) manifest BaseAlign’s superiority in alignment performance with less data, paving the way for addressing the above three challenges. + 2024.naacl-long.486 + 2024.naacl-long.486.copyright.pdf + yao-etal-2024-value + + + <fixed-case>I</fixed-case>ndi<fixed-case>B</fixed-case>ias: A Benchmark Dataset to Measure Social Biases in Language Models for <fixed-case>I</fixed-case>ndian Context + NiharSahoo + PranamyaKulkarni + ArifAhmad + TanuGoyal + NarjisAsad + AparnaGarimellaAdobe Research + PushpakBhattacharyyaIndian Institute of Technology, Bombay, Dhirubhai Ambani Institute Of Information and Communication Technology + 8778-8798 + The pervasive influence of social biases in language data has sparked the need for benchmark datasets that capture and evaluate these biases in Large Language Models (LLMs). Existing efforts predominantly focus on English language and the Western context, leaving a void for a reliable dataset that encapsulates India’s unique socio-cultural nuances. To bridge this gap, we introduce IndiBias, a comprehensive benchmarking dataset designed specifically for evaluating social biases in the Indian context. We filter and translate the existing CrowS-Pairs dataset to create a benchmark dataset suited to the Indian context in Hindi language. Additionally, we leverage LLMs including ChatGPT and InstructGPT to augment our dataset with diverse societal biases and stereotypes prevalent in India. The included bias dimensions encompass gender, religion, caste, age, region, physical appearance, and occupation. We also build a resource to address intersectional biases along three intersectional dimensions. Our dataset contains 800 sentence pairs and 300 tuples for bias measurement across different demographics. The dataset is available in English and Hindi, providing a size comparable to existing benchmark datasets. Furthermore, using IndiBias we compare ten different language models on multiple bias measurement metrics. We observed that the language models exhibit more bias across a majority of the intersectional groups. All the scripts utilized and datasets created in this study are publicly available. + 2024.naacl-long.487 + 2024.naacl-long.487.copyright.pdf + sahoo-etal-2024-indibias + +
+ + + Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) + KevinDuh + HelenaGomez + StevenBethard + Association for Computational Linguistics +
Mexico City, Mexico
+ June + 2024 + 2024.naacl-short + naacl + + + 2024.naacl-short.0 + naacl-2024-2024-north + + + Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias + AnshumanChhabraUniversity of California, Davis + HadiAskari + PrasantMohapatraUniversity of South Florida + 1-11 + We characterize and study zero-shot abstractive summarization in Large Language Models (LLMs) by measuring position bias, which we propose as a general formulation of the more restrictive lead bias phenomenon studied previously in the literature. Position bias captures the tendency of a model unfairly prioritizing information from certain parts of the input text over others, leading to undesirable behavior. Through numerous experiments on four diverse real-world datasets, we study position bias in multiple LLM models such as GPT 3.5-Turbo, Llama-2, and Dolly-v2, as well as state-of-the-art pretrained encoder-decoder abstractive summarization models such as Pegasus and BART. Our findings lead to novel insights and discussion on performance and position bias of models for zero-shot summarization tasks. + 2024.naacl-short.1 + 2024.naacl-short.1.copyright.pdf + chhabra-etal-2024-revisiting + + + Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data? + XiangruTangYale University + YimingZong + JasonPhangOpenAI + YilunZhaoYale University + WangchunshuZhouAIWaves Inc. + ArmanCohanYale University and Allen Institute for Artificial Intelligence + MarkGersteinYale University + 12-34 + Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs’ proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions, coverage, formatting, reasoning, comprehension, pragmatics, and hallucination, highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench. + 2024.naacl-short.2 + 2024.naacl-short.2.copyright.pdf + tang-etal-2024-struc + + + Improving Toponym Resolution by Predicting Attributes to Constrain Geographical Ontology Entries + ZeyuZhangAmazon AGI + EgoitzLaparraUniversity of Arizona + StevenBethardUniversity of Arizona + 35-44 + Geocoding is the task of converting location mentions in text into structured geospatial data.We propose a new prompt-based paradigm for geocoding, where the machine learning algorithm encodes only the location mention and its context.We design a transformer network for predicting the country, state, and feature class of a location mention, and a deterministic algorithm that leverages the country, state, and feature class predictions as constraints in a search for compatible entries in the ontology.Our architecture, GeoPLACE, achieves new state-of-the-art performance on multiple datasets.Code and models are available at https://github.com/clulab/geonorm. + 2024.naacl-short.3 + 2024.naacl-short.3.copyright.pdf + zhang-etal-2024-improving-toponym + + + Advancing Regular Language Reasoning in Linear Recurrent Neural Networks + Ting-HanFan + Ta-ChungChi + AlexanderRudnickyCarnegie Mellon University and Carnegie Mellon University + 45-53 + In recent studies, linear recurrent neural networks (LRNNs) have achieved Transformer-level performance in natural language and long-range modeling, while offering rapid parallel training and constant inference cost. With the resurgence of interest in LRNNs, we study whether they can learn the hidden rules in training sequences, such as the grammatical structures of regular language. We theoretically analyze some existing LRNNs and discover their limitations in modeling regular language. Motivated by this analysis, we propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN capable of performing length extrapolation on regular language tasks such as Sum, Even Pair, and Modular Arithmetic. The code is released at https://github.com/tinghanf/RegluarLRNN. + 2024.naacl-short.4 + 2024.naacl-short.4.copyright.pdf + fan-etal-2024-advancing + + + Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers + RoyXie + OrevaogheneAhia + YuliaTsvetkovDepartment of Computer Science, University of Washington + AntoniosAnastasopoulosAthena Research Center and George Mason University + 54-69 + Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations. + 2024.naacl-short.5 + 2024.naacl-short.5.copyright.pdf + xie-etal-2024-extracting + + + Clear Up Confusion: Advancing Cross-Domain Few-Shot Relation Extraction through Relation-Aware Prompt Learning + GeBai + ChenjiLu + DaichiGuo + ShilongLi + YingLiu + ZhangZhang + GuantingDong + RuifangLiuBeijing University of Posts and Telecommunications + SunYongBeijing University of Posts and Telecommunications + 70-78 + Cross-domain few-shot Relation Extraction (RE) aims to transfer knowledge from a source domain to a different target domain to address low-resource problems.Previous work utilized label descriptions and entity information to leverage the knowledge of the source domain.However, these models are prone to confusion when directly applying this knowledge to a target domain with entirely new types of relations, which becomes particularly pronounced when facing similar relations.In this work, we propose a relation-aware prompt learning method with pre-training.Specifically, we empower the model to clear confusion by decomposing various relation types through an innovative label prompt, while a context prompt is employed to capture differences in different scenarios, enabling the model to further discern confusion. Two pre-training tasks are designed to leverage the prompt knowledge and paradigm.Experiments show that our method outperforms previous sota methods, yielding significantly better results on cross-domain few-shot RE tasks. + 2024.naacl-short.6 + 2024.naacl-short.6.copyright.pdf + bai-etal-2024-clear + + + Fusion Makes Perfection: An Efficient Multi-Grained Matching Approach for Zero-Shot Relation Extraction + ShilongLi + GeBai + ZhangZhang + YingLiu + ChenjiLu + DaichiGuo + RuifangLiuBeijing University of Posts and Telecommunications + SunYongBeijing University of Posts and Telecommunications + 79-85 + Predicting unseen relations that cannot be observed during the training phase is a challenging task in relation extraction. Previous works have made progress by matching the semantics between input instances and label descriptions. However, fine-grained matching often requires laborious manual annotation, and rich interactions between instances and label descriptions come with significant computational overhead. In this work, we propose an efficient multi-grained matching approach that uses virtual entity matching to reduce manual annotation cost, and fuses coarse-grained recall and fine-grained classification for rich interactions with guaranteed inference speed.Experimental results show that our approach outperforms the previous State Of The Art (SOTA) methods, and achieves a balance between inference efficiency and prediction accuracy in zero-shot relation extraction tasks.Our code is available at https://github.com/longls777/EMMA. + 2024.naacl-short.7 + 2024.naacl-short.7.copyright.pdf + li-etal-2024-fusion + + + Personalized Review Recommendation based on Implicit dimension mining + BeiXu + YifanXu + 86-91 + Users usually browse product reviews before buying products from e-commerce websites. Lots of e-commerce websites can recommend reviews. However, existing research on review recommendation mainly focuses on the general usefulness of reviews and ignores personalized and implicit requirements. To address the issue, we propose a Large language model driven Personalized Review Recommendation model based on Implicit dimension mining (PRR-LI). The model mines implicit dimensions from reviews and requirements, and encodes them in the form of “text + dimension”. The experiments show that our model significantly outperforms other state-of-the-art textual models on the Amazon-MRHP dataset, with some of the metrics outperforming the state-of-the-art multimodal models. And we prove that encoding “text + dimension” is better than encoding “text” and “dimension” separately in review recommendation. + 2024.naacl-short.8 + 2024.naacl-short.8.copyright.pdf + xu-xu-2024-personalized + + + Unlocking Structure Measuring: Introducing <fixed-case>PDD</fixed-case>, an Automatic Metric for Positional Discourse Coherence + YinhongLiu + YixuanSuCohere + EhsanShareghiMonash University and University of Cambridge + NigelCollierUniversity of Cambridge + 92-100 + Recent large language models (LLMs) have shown remarkable performance in aligning generated text with user intentions across various tasks. When it comes to long-form text generation, there has been a growing interest in generation from a discourse coherence perspective.However, existing lexical or semantic metrics such as BLEU, ROUGE, BertScore cannot effectively capture the discourse coherence.The development of discourse-specific automatic evaluation methods for assessing the output of LLMs warrants greater focus and exploration. In this paper, we present a novel automatic metric designed to quantify the discourse divergence between two long-form articles.Extensive experiments on three datasets from representative domains demonstrate that our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods. + 2024.naacl-short.9 + 2024.naacl-short.9.copyright.pdf + liu-etal-2024-unlocking + + + Returning to the Start: Generating Narratives with Related Endpoints + AnnelieseBreiUniversity of North Carolina at Chapel Hill + ChaoZhaoBloomberg + SnigdhaChaturvediDepartment of Computer Science, University of North Carolina, Chapel Hill + 101-112 + Human writers often *bookend* their writing with ending sentences that relate back to the beginning sentences in order to compose a satisfying narrative that “closes the loop.” Motivated by this observation, we propose RENarGen, a controllable story-generation paradigm that generates narratives by ensuring the first and last sentences are related and then infilling the middle sentences. Our contributions include an initial exploration of how various methods of bookending from Narratology affect language modeling for stories. Automatic and human evaluations indicate RENarGen produces better stories with more narrative closure than current autoregressive models. + 2024.naacl-short.10 + 2024.naacl-short.10.copyright.pdf + brei-etal-2024-returning + + + Unified Examination of Entity Linking in Absence of Candidate Sets + NicolasOng + HassanShavarani + AnoopSarkarSimon Fraser University + 113-123 + Despite remarkable strides made in the development of entity linking systems in recent years, a comprehensive comparative analysis of these systems using a unified framework is notably absent. This paper addresses this oversight by introducing a new black-box benchmark and conducting a comprehensive evaluation of all state-of-the-art entity linking methods. We use an ablation study to investigate the impact of candidate sets on the performance of entity linking. Our findings uncover exactly how much such entity linking systems depend on candidate sets, and how much this limits the general applicability of each system. We present an alternative approach to candidate sets, demonstrating that leveraging the entire in-domain candidate set can serve as a viable substitute for certain models. We show the trade-off between less restrictive candidate sets, increased inference time and memory footprint for some models. + 2024.naacl-short.11 + 2024.naacl-short.11.copyright.pdf + ong-etal-2024-unified + + + <fixed-case>M</fixed-case>ulti<fixed-case>P</fixed-case>ara<fixed-case>D</fixed-case>etox: Extending Text Detoxification with Parallel Data to New Languages + DarynaDementieva + NikolayBabakovUnivesity of Santiago de Compostela + AlexanderPanchenkoSkoltech + 124-140 + Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection—ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022)—were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models—from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora—showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language. + 2024.naacl-short.12 + 2024.naacl-short.12.copyright.pdf + dementieva-etal-2024-multiparadetox + + + <fixed-case>SKICSE</fixed-case>: Sentence Knowable Information Prompted by <fixed-case>LLM</fixed-case>s Improves Contrastive Sentence Embeddings + FangweiOu + JinanXuBeijing Jiaotong University + 141-146 + Contrastive learning, which utilizes positive pairs and in-batch negatives to optimize the loss objective, has been proven to be an effective method for learning sentence embeddings. However, we argue that the previous methods of constructing positive pairs only through dropout perturbation or entailment relation are limited. Since there is more sentence knowable information (SKI) to be mined, such as sentence external knowledge, semantic analysis, and grammatical description. In this work, we first hand-craft a simple and effective prompt template that is able to obtain the knowable information of input sentences from LLMs (e.g., LLaMA). Then we combine the original sentence and its knowable information to form a positive pair for contrastive learning. We evaluate our method on standard semantic textual similarity (STS) tasks. Experimental results show that our unsupervised and supervised models using \text{BERT}_\text{base} achieve an average of 78.65% and 82.45% Spearman’s correlation respectively, a 2.40% and 0.88% improvement compared to SimCSE. Our model outperforms the previous state-of-the-art model PromptBERT in both unsupervised and supervised settings and specifically yields a new state-of-the-art performance in supervised setting. + 2024.naacl-short.13 + 2024.naacl-short.13.copyright.pdf + ou-xu-2024-skicse + + + A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models + JaylenJonesOhio State University, Columbus + LingboMo + EricFosler-LussierOhio State University + HuanSunThe Ohio State University, Columbus + 147-168 + Counter narratives - informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters - have emerged as an effective hate speech intervention strategy. While previous work has proposed automatic counter narrative generation methods to aid manual interventions, the evaluation of these approaches remains underdeveloped. Previous automatic metrics for counter narrative evaluation lack alignment with human judgment as they rely on superficial reference comparisons instead of incorporating key aspects of counter narrative quality as evaluation criteria. To address prior evaluation limitations, we propose a novel evaluation framework prompting LLMs to provide scores and feedback for generated counter narrative candidates using 5 defined aspects derived from guidelines from counter narrative specialized NGOs. We found that LLM evaluators achieve strong alignment to human-annotated scores and feedback and outperform alternative metrics, indicating their potential as multi-aspect, reference-free and interpretable evaluators for counter narrative evaluation. + 2024.naacl-short.14 + 2024.naacl-short.14.copyright.pdf + jones-etal-2024-multi + + + How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes + HarmonBhasin + TimothyOssowski + YiqiaoZhongUniversity of Wisconsin - Madison + JunjieHuUniversity of Wisconsin, Madison + 169-187 + Large language models (LLM) have recently shown the extraordinary ability to perform unseen tasks based on few-shot examples provided as text, also known as in-context learning (ICL). While recent works have attempted to understand the mechanisms driving ICL, few have explored training strategies that incentivize these models to generalize to multiple tasks. Multi-task learning (MTL) for generalist models is a promising direction that offers transfer learning potential, enabling large parameterized models to be trained from simpler, related tasks. In this work, we investigate the combination of MTL with ICL to build models that efficiently learn tasks while being robust to out-of-distribution examples. We propose several effective curriculum learning strategies that allow ICL models to achieve higher data efficiency and more stable convergence. Our experiments reveal that ICL models can effectively learn difficult tasks by training on progressively harder tasks while mixing in prior tasks, denoted as mixed curriculum in this work. + 2024.naacl-short.15 + 2024.naacl-short.15.copyright.pdf + bhasin-etal-2024-multi + + + <fixed-case>CELI</fixed-case>: Simple yet Effective Approach to Enhance Out-of-Domain Generalization of Cross-Encoders. + CrystinaZhangUniversity of Waterloo + MinghanLi + JimmyLinUniversity of Waterloo + 188-196 + In text ranking, it is generally believed that the cross-encoders already gather sufficient token interaction information via the attention mechanism in the hidden layers. However, our results show that the cross-encoders can consistently benefit from additional token interaction in the similarity computation at the last layer. We introduce CELI (Cross-Encoder with Late Interaction), which incorporates a late interaction layer into the current cross-encoder models. This simple method brings 5% improvement on BEIR without compromising in-domain effectiveness or search latency. Extensive experiments show that this finding is consistent across different sizes of the cross-encoder models and the first-stage retrievers. Our findings suggest that boiling all information into the [CLS] token is a suboptimal use for cross-encoders, and advocate further studies to investigate its relevance score mechanism. + 2024.naacl-short.16 + 2024.naacl-short.16.copyright.pdf + zhang-etal-2024-celi + + + <fixed-case>C</fixed-case>ontrastive<fixed-case>M</fixed-case>ix: Overcoming Code-Mixing Dilemma in Cross-Lingual Transfer for Information Retrieval + JunggeunDoSeoul National University + JaeseongLeeSeoul National University + Seung-wonHwangSeoul National University + 197-204 + Multilingual pretrained language models (mPLMs) have been widely adopted in cross-lingual transfer, and code-mixing has demonstrated effectiveness across various tasks in the absence of target language data. Our contribution involves an in-depth investigation into the counterproductive nature of training mPLMs on code-mixed data for information retrieval (IR). Our finding is that while code-mixing demonstrates a positive effect in aligning representations across languages, it hampers the IR-specific objective of matching representations between queries and relevant passages. To balance between positive and negative effects, we introduce ContrastiveMix, which disentangles contrastive loss between these conflicting objectives, thereby enhancing zero-shot IR performance. Specifically, we leverage both English and code-mixed data and employ two contrastive loss functions, by adding an additional contrastive loss that aligns embeddings of English data with their code-mixed counterparts in the query encoder. Our proposed ContrastiveMix exhibits statistically significant outperformance compared to mDPR, particularly in scenarios involving lower linguistic similarity, where the conflict between goals is more pronounced. + 2024.naacl-short.17 + 2024.naacl-short.17.copyright.pdf + do-etal-2024-contrastivemix + + + <fixed-case>SLIDE</fixed-case>: Reference-free Evaluation for Machine Translation using a Sliding Document Window + VikasRaunakMicrosoft + TomKocmiMicrosoft + MattPostMicrosoft and Johns Hopkins University + 205-211 + Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output.This is unsurprising, since references resolve ambiguities that may be present in the source.In this paper, we investigate whether additional source context can effectively substitute for a reference.We present a metric named SLIDE (SLIding Document Evaluator), which operates on blocks of sentences. SLIDE leverages a moving window that slides over each document in the test set, feeding each chunk of sentences into an unmodified, off-the-shelf quality estimation model.We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics.This suggests that source context may provide the same information as a human reference in disambiguating source ambiguities. This finding is especially pertinent for reference-free document-level evaluation, wherein SLIDE could provide higher-quality pairwise system assessments while only requiring document boundary annotations. + 2024.naacl-short.18 + 2024.naacl-short.18.copyright.pdf + raunak-etal-2024-slide + + + Separately Parameterizing Singleton Detection Improves End-to-end Neural Coreference Resolution + XiyuanZou + YiranLi + IanPoradaMcGill University + JackieCheungMcGill University, Mila Research Institute and Microsoft + 212-219 + Current end-to-end coreference resolution models combine detection of singleton mentions and antecedent linking into a single step. In contrast, singleton detection was often treated as a separate step in the pre-neural era. In this work, we show that separately parameterizing these two sub-tasks also benefits end-to-end neural coreference systems. Specifically, we add a singleton detector to the coarse-to-fine (C2F) coreference model, and design an anaphoricity-aware span embedding and singleton detection loss. Our method significantly improves model performance on OntoNotes and four additional datasets. + 2024.naacl-short.19 + 2024.naacl-short.19.copyright.pdf + zou-etal-2024-separately + + + Unveiling Divergent Inductive Biases of <fixed-case>LLM</fixed-case>s on Temporal Data + SindhuKishore + HangfengHeUniversity of Rochester + 220-228 + Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific emphasis on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. Employing two distinct prompt types, namely Question Answering (QA) format and Textual Entailment (TE) format, our analysis probes into both implicit and explicit events. The findings underscore noteworthy trends, revealing disparities in the performance of GPT-3.5 and GPT-4. Notably, biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for “AFTER” in the QA format for both implicit and explicit events, while GPT-4 leans towards “BEFORE”. Furthermore, a consistent pattern surfaces wherein GPT-3.5 tends towards “TRUE”, and GPT-4 exhibits a preference for “FALSE” in the TE format for both implicit and explicit events. This persistent discrepancy between GPT-3.5 and GPT-4 in handling temporal data highlights the intricate nature of inductive bias in LLMs, suggesting that the evolution of these models may not merely mitigate bias but may introduce new layers of complexity. + 2024.naacl-short.20 + 2024.naacl-short.20.copyright.pdf + kishore-he-2024-unveiling + + + On Retrieval Augmentation and the Limitations of Language Model Training + Ting-RuiChiangUniversity of Southern California + XinyanYuUniversity of Southern California + JoshuaRobinsonUniversity of Southern California + OllieLiuUniversity of Southern California + IsabelleLeeUniversity of Southern California + DaniYogatamaGoogle DeepMind and DeepMind + 229-238 + Augmenting a language model (LM) with k-nearest neighbors (kNN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility — the “softmax bottleneck.” We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional information that is not causally relevant. This task is challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral 7B, kNN retrieval augmentation consistently improves per formance in this setting. Finally, to make kNN retrieval more accessible, we propose using amulti-layer perceptron model that maps datastore keys to values as a drop-in replacement for traditional retrieval. This reduces storage costsby over 25x. + 2024.naacl-short.21 + 2024.naacl-short.21.copyright.pdf + chiang-etal-2024-retrieval + + + <fixed-case>G</fixed-case>en<fixed-case>D</fixed-case>ecider: Integrating “None of the Candidates” Judgments in Zero-Shot Entity Linking Re-ranking + KangZhou + YuepeiLi + QingWangIowa State University + QiaoQiaoIowa State University + QiLiIowa State University + 239-245 + We introduce GenDecider, a novel re-ranking approach for Zero-Shot Entity Linking (ZSEL), built on the Llama model. It innovatively detects scenarios where the correct entity is not among the retrieved candidates, a common oversight in existing re-ranking methods. By autoregressively generating outputs based on the context of the entity mention and the candidate entities, GenDecider significantly enhances disambiguation, improving the accuracy and reliability of ZSEL systems, as demonstrated on the benchmark ZESHEL dataset. Our code is available at https://github.com/kangISU/GenDecider. + 2024.naacl-short.22 + 2024.naacl-short.22.copyright.pdf + zhou-etal-2024-gendecider + + + Advancing the Robustness of Large Language Models through Self-Denoised Smoothing + JiabaoJiUniversity of California, Santa Barbara + BairuHou + ZhenZhangUniversity of California, Santa Barbara + GuanhuaZhangMax Planck Institute for Intelligent Systems, Max-Planck Institute + WenqiFanHong Kong Polytechnic University + QingLiThe Hong Kong Polytechnic University, Hong Kong Polytechnic University + YangZhang + GaowenLiu + SijiaLiuMichigan State University + ShiyuChangUC Santa Barbara + 246-257 + Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model’s parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model’s robustness largely depends on the model’s performance on these noise-corrupted data. Its effectiveness is often limited by the model’s sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at https://github.com/UCSB-NLP-Chang/SelfDenoise. + 2024.naacl-short.23 + 2024.naacl-short.23.copyright.pdf + ji-etal-2024-advancing + + + Can <fixed-case>LLM</fixed-case>’s Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis + Vishnu SashankDorbalaUniversity of Maryland, College Park + SanjoyChowdhuryUniversity of Maryland, College Park + DineshManochaUniversity of Maryland, College Park + 258-271 + We present a novel approach to automatically synthesize “wayfinding instructions” for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. We finally discuss the applicability of our approach in enabling a generalizable evaluation of embodied navigation policies. To the best of our knowledge, ours is the first LLM-driven approach capable of generating “human-like” instructions in a platform-agnostic manner, without training. + 2024.naacl-short.24 + 2024.naacl-short.24.copyright.pdf + dorbala-etal-2024-llms + + + On the Role of Summary Content Units in Text Summarization Evaluation + MarcelNawrath + AgnieszkaNowak + TristanRatz + DaniloWalenta + JuriOpitzRuprecht-Karls-Universität Heidelberg and University of Zurich + LeonardoRibeiroAmazon + JoãoSedocNew York University + DanielDeutschGoogle + SimonMille + YixinLiuYale University + SebastianGehrmannBloomberg + LiningZhang + SaadMahamoodtrivago N.V. + MirunaClinciu + KhyathiChandu + YufangHouTechnische Universität Darmstadt and IBM Research Ireland + 272-281 + At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs areconcise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages?ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategiesto approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when rankingshort summaries, but may not help as much when ranking systems or longer summaries. + 2024.naacl-short.25 + 2024.naacl-short.25.copyright.pdf + nawrath-etal-2024-role + + + More room for language: Investigating the effect of retrieval on language models + DavidSamuelUniversity of Oslo + LucasCharpentierUniversity of Oslo + SondreWold + 282-305 + Retrieval-augmented language models pose a promising alternative to standard language modeling. During pretraining, these models search in a corpus of documents for contextually relevant information that could aid the language modeling objective. We introduce an ‘ideal retrieval’ methodology to study these models in a fully controllable setting. We conduct an extensive evaluation to examine how retrieval augmentation affects the behavior of the underlying language model. Among other things, we observe that these models: (i) save substantially less world knowledge in their weights, (ii) are better at understanding local context and inter-word dependencies, but (iii) are worse at comprehending global context. + 2024.naacl-short.26 + 2024.naacl-short.26.copyright.pdf + samuel-etal-2024-room + + + Discourse-Aware In-Context Learning for Temporal Expression Normalization + AkashGautam + LukasLangeRobert Bosch GmbH, Bosch + JannikStrötgenKarlsruhe University of Applied Sciences + 306-315 + Temporal expression (TE) normalization is a well-studied problem. However, the predominately used rule-based systems are highly restricted to specific settings, and upcoming machine learning approaches suffer from a lack of labeled data. In this work, we explore the feasibility of proprietary and open-source large language models (LLMs) for TE normalization using in-context learning to inject task, document, and example information into the model. We explore various sample selection strategies to retrieve the most relevant set of examples. By using a window-based prompt design approach, we can perform TE normalization across sentences, while leveraging the LLM knowledge without training the model.Our experiments show competitive results to models designed for this task. In particular, our method achieves large performance improvements for non-standard settings by dynamically including relevant examples during inference. + 2024.naacl-short.27 + 2024.naacl-short.27.copyright.pdf + gautam-etal-2024-discourse + + + Contextualizing Argument Quality Assessment with Relevant Knowledge + DarshanDeshpande + ZhivarSourati + FilipIlievskiVrije Universiteit Amsterdam + FredMorstatterUniversity of Southern California and USC/ISI + 316-326 + Automatic assessment of the quality of arguments has been recognized as a challenging task with significant implications for misinformation and targeted speech. While real-world arguments are tightly anchored in context, existing computational methods analyze their quality in isolation, which affects their accuracy and generalizability. We propose SPARK: a novel method for scoring argument quality based on contextualization via relevant knowledge. We devise four augmentations that leverage large language models to provide feedback, infer hidden assumptions, supply a similar-quality argument, or give a counter-argument. SPARK uses a dual-encoder Transformer architecture to enable the original argument and its augmentation to be considered jointly. Our experiments in both in-domain and zero-shot setups show that SPARK consistently outperforms existing techniques across multiple metrics + 2024.naacl-short.28 + 2024.naacl-short.28.copyright.pdf + deshpande-etal-2024-contextualizing + + + Selective Perception: Learning Concise State Descriptions for Language Model Actors + KolbyNottingham + YasamanRazeghi + KyungminKimUniversity of California, Irvine + JbLanierUniversity of California, Irvine + PierreBaldi + RoyFoxUniversity of California, Irvine + SameerSinghUniversity of California, Irvine and Allen Institute for Artificial Intelligence + 327-341 + The latest large language models (LMs) support increasingly longer contexts. While this trend permits using substantial amounts of text with SOTA LMs, requiring these large LMs to process potentially redundant or irrelevant data needlessly increases inference time and cost. To remedy this problem, we propose BLINDER, a method that leverages a small finetuned LM to sample the minimal set of input features that maximizes the performance of a downstream LM. BLINDER trains an LM with a value head to estimate the likelihood of optimal outputs from a downstream LM given an input. We evaluate BLINDER on embodied decision making tasks with notoriously verbose state descriptions: NetHack and robot planning. BLINDER reduces the length of LM actor input by 87% and 99% while improving task success rates by 158% and 54% on NetHack and robot planning respectively which represents substantial inference cost savings while actually increasing performance. + 2024.naacl-short.29 + 2024.naacl-short.29.copyright.pdf + nottingham-etal-2024-selective + + + <fixed-case>ALOH</fixed-case>a: A New Measure for Hallucination in Captioning Models + SuzannePetrykUniversity of California Berkeley + DavidChanUniversity of California Berkeley + AnishKachinthaya + HaodiZou + JohnCannyUniversity of California - Berkeley and University of California Berkeley + JosephGonzalezUniversity of California - Berkeley, University of California-Berkeley and UC Berkeley, University of California Berkeley + TrevorDarrellElectrical Engineering & Computer Science Department + 342-357 + Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. + 2024.naacl-short.30 + 2024.naacl-short.30.copyright.pdf + petryk-etal-2024-aloha + + + Beyond Yes and No: Improving Zero-Shot <fixed-case>LLM</fixed-case> Rankers via Scoring Fine-Grained Relevance Labels + HongleiZhuangGoogle Research + ZhenQinGoogle + KaiHuiGoogle + JunruWuGoogle Research + LeYanGoogle + XuanhuiWangGoogle + MichaelBenderskyGoogle + 358-370 + Zero-shot text rankers powered by recent LLMs achieve remarkable ranking performance by simply prompting. Existing prompts for pointwise LLM rankers mostly ask the model to choose from binary relevance labels like “Yes” and “No”. However, the lack of intermediate relevance label options may cause the LLM to provide noisy or biased answers for documents that are partially relevant to the query. We propose to incorporate fine-grained relevance labels into the prompt for LLM rankers, enabling them to better differentiate among documents with different levels of relevance to the query and thus derive a more accurate ranking. We study two variants of the prompt template, coupled with different numbers of relevance levels. Our experiments on 8 BEIR data sets show that adding fine-grained relevance labels significantly improves the performance of LLM rankers. + 2024.naacl-short.31 + 2024.naacl-short.31.copyright.pdf + zhuang-etal-2024-beyond + + + <fixed-case>LLM</fixed-case>-Driven Knowledge Injection Advances Zero-Shot and Cross-Target Stance Detection + ZhaoZhang + YimingLi + JinZhang, Chinese Academy of Sciences + HuiXuChinese Academy of Sciences + 371-378 + Stance detection aims at inferring an author’s attitude towards a specific target in a text. Prior methods mainly consider target-related background information for a better understanding of targets while neglecting the accompanying input texts. In this study, we propose to prompt Large Language Models (LLMs) to explicitly extract the relationship between paired text and target as contextual knowledge. We then inject such LLM-driven knowledge into a generation model BART to exploit the rich contexts and semantics. Moreover, to further enhance the decoding capability of BART, a novel prototypical contrastive scheme is designed to align input contents with stance labels. Our experimental results demonstrate the state-of-the-art performance across several publicly available datasets, showcasing effectiveness in both zero-shot and cross-target stance detection scenarios. We publicly release our code to facilitate future research. + 2024.naacl-short.32 + 2024.naacl-short.32.copyright.pdf + zhang-etal-2024-llm-driven + + + Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information + ShadiIskander + KiraRadinskyComputer Science Departmen, Technion-Israel Institute of Technology + YonatanBelinkovTechnion, Technion + 379-390 + Mitigating social biases typically requires identifying the social groups associated with each data sample. In this paper, we present DAFair, a novel approach to address social bias in language models. Unlike traditional methods that rely on explicit demographic labels, our approach does not require any such information. Instead, we leverage predefined prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias in the model’s representations. Our empirical results across two tasks and two models demonstrate the effectiveness of our method compared to previous approaches that do not rely on labeled data. Moreover, with limited demographic-annotated data, our approach outperforms common debiasing approaches. + 2024.naacl-short.33 + 2024.naacl-short.33.copyright.pdf + iskander-etal-2024-leveraging + + + Direct Preference Optimization for Neural Machine Translation with Minimum <fixed-case>B</fixed-case>ayes Risk Decoding + GuangyuYang + JinghongChen + WeizheLin + BillByrneAmazon and University of Cambridge + 391-398 + Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive. We show how the recently developed Reinforcement Learning technique, Direct Preference Optimization (DPO), can fine-tune MLLMs to get the gains of MBR without any additional computation in inference. Our method uses only a small monolingual fine-tuning set and yields significantly improved performance on multiple NMT test sets compared to MLLMs without DPO. + 2024.naacl-short.34 + 2024.naacl-short.34.copyright.pdf + yang-etal-2024-direct + + + <fixed-case>E</fixed-case>cho<fixed-case>P</fixed-case>rompt: Instructing the Model to Rephrase Queries for Improved In-context Learning + Raja Sekhar ReddyMekala + YasamanRazeghi + SameerSinghUniversity of California, Irvine and Allen Institute for Artificial Intelligence + 399-432 + Language models are achieving impressive performance on various tasks by aggressively adopting inference-time prompting techniques,such as zero-shot and few-shot prompting. In this work, we introduce EchoPrompt, a simple yet effective approach that prompts the model to rephrase its queries before answering them. EchoPrompt is tailored for four scenarios, including standard and chain-of-thought prompting, in both zero-shot and few-shot settings. Experimental results show that EchoPrompt yields substantial improvementsacross all these settings for four families of causal language models. These improvements are observed across various numerical reasoning (e.g., GSM8K, SVAMP), reading comprehension (e.g., DROP), and logical reasoning (e.g., Coin flipping) tasks. On average, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002 by 5% in numerical tasks and 13% in reading comprehension tasks. Our empirical results indicate that EchoPrompt is an effective technique that enhances in-context learning performance. + 2024.naacl-short.35 + 2024.naacl-short.35.copyright.pdf + mekala-etal-2024-echoprompt + + + <fixed-case>LEAF</fixed-case>: Language Learners’ <fixed-case>E</fixed-case>nglish Essays and Feedback Corpus + ShabnamBehzad + OmidKashefiEducational Testing Service + SwapnaSomasundaran + 433-442 + This paper addresses the issue of automated feedback generation for English language learners by presenting a corpus of English essays and their corresponding feedback, called LEAF, collected from the “essayforum” website. The corpus comprises approximately 6K essay-feedback pairs, offering a diverse and valuable resource for developing personalized feedback generation systems that address the critical deficiencies within essays, spanning from rectifying grammatical errors to offering insights on argumentative aspects and organizational coherence. Using this corpus, we present and compare multiple feedback generation baselines. Our findings shed light on the challenges of providing personalized feedback and highlight the potential of the LEAF corpus in advancing automated essay evaluation. + 2024.naacl-short.36 + 2024.naacl-short.36.copyright.pdf + behzad-etal-2024-leaf + + + Zero-Shot vs. Translation-Based Cross-Lingual Transfer: The Case of Lexical Gaps + AbteenEbrahimiUniversity of Colorado, Boulder + KatharinaWenseJohannes-Gutenberg Universität Mainz, University of Colorado, Boulder and New York University + 443-458 + Cross-lingual transfer can be achieved through two main approaches: zero-shot transfer or machine translation (MT). While the former has been the dominant approach, both have been shown to be competitive. In this work, we compare the current performance and long-term viability of these methods. We leverage lexical gaps to create a multilingual question answering dataset, which provides a difficult domain for evaluation. Both approaches struggle in this setting, though zero-shot transfer performs better, as current MT outputs are not specific enough for the task. Using oracle translation offers the best performance, showing that this approach can perform well long-term, however current MT quality is a bottleneck. We also conduct an exploratory study to see if humans produce translations sufficient for the task with only general instructions. We find this to be true for the majority of translators, but not all. This indicates that while translation has the potential to outperform zero-shot approaches, creating MT models that generate accurate task-specific translations may not be straightforward. + 2024.naacl-short.37 + 2024.naacl-short.37.copyright.pdf + ebrahimi-wense-2024-zero + + + On the True Distribution Approximation of Minimum <fixed-case>B</fixed-case>ayes-Risk Decoding + AtsumotoOhashiNagoya University + UkyoHondaCyberAgent, Inc. + TetsuroMorimuraCyberAgent, Inc. + YuuJinnaiCyberAgent, Inc. + 459-468 + Minimum Bayes-risk (MBR) decoding has recently gained renewed attention in text generation.MBR decoding considers texts sampled from a model as pseudo-references and selects the text with the highest similarity to the others.Therefore, sampling is one of the key elements of MBR decoding, and previous studies reported that the performance varies by sampling methods.From a theoretical standpoint, this performance variation is likely tied to how closely the samples approximate the true distribution of references.However, this approximation has not been the subject of in-depth study.In this study, we propose using anomaly detection to measure the degree of approximation.We first closely examine the performance variation and then show that previous hypotheses about samples do not correlate well with the variation, but our introduced anomaly scores do.The results are the first to empirically support the link between the performance and the core assumption of MBR decoding. + 2024.naacl-short.38 + 2024.naacl-short.38.copyright.pdf + ohashi-etal-2024-true + + + Rehearsal-Free Modular and Compositional Continual Learning for Language Models + MingyangWang + HeikeAdelHochschule der Medien (University of Applied Sciences) + LukasLangeRobert Bosch GmbH, Bosch + JannikStrötgenKarlsruhe University of Applied Sciences + HinrichSchuetze + 469-480 + Continual learning aims at incrementally acquiring new knowledge while not forgetting existing knowledge. To overcome catastrophic forgetting, methods are either rehearsal-based, i.e., store data examples from previous tasks for data replay, or isolate parameters dedicated to each task. However, rehearsal-based methods raise privacy and memory issues, and parameter-isolation continual learning does not consider interaction between tasks, thus hindering knowledge transfer. In this work, we propose MoCL, a rehearsal-free **Mo**dular and **C**ompositional Continual **L**earning framework which continually adds new modules to language models and composes them with existing modules. Experiments on various benchmarks show that MoCL outperforms state of the art and effectively facilitates knowledge transfer. + 2024.naacl-short.39 + 2024.naacl-short.39.copyright.pdf + wang-etal-2024-rehearsal + + + Llama meets <fixed-case>EU</fixed-case>: Investigating the <fixed-case>E</fixed-case>uropean political spectrum through the lens of <fixed-case>LLM</fixed-case>s + IliasChalkidis + StephanieBrandlKøbenhavns Universitet + 481-498 + Instruction-finetuned Large Language Models inherit clear political leanings that have been shown to influence downstream task performance. We expand this line of research beyond the two-party system in the US and audit Llama Chat in the context of EU politics in various settings to analyze the model’s political knowledge and its ability to reason in context. We adapt, i.e., further fine-tune, Llama Chat on speeches of individual euro-parties from debates in the European Parliament to reevaluate its political leaning based on the EUandI questionnaire. Llama Chat shows considerable knowledge of national parties’ positions and is capable of reasoning in context. The adapted, party-specific, models are substantially re-aligned towards respective positions which we see as a starting point for using chat-based LLMs as data-driven conversational engines to assist research in political science. + 2024.naacl-short.40 + 2024.naacl-short.40.copyright.pdf + chalkidis-brandl-2024-llama + + + <fixed-case>M</fixed-case>3<fixed-case>T</fixed-case>: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation + BenjaminHsuAmazon + XiaoyuLiuUniversity of Maryland, College Park + HuayangLi + YoshinariFujinumaAWS AI Labs + MariaNadejdeAmazon + XingNiuAmazon + RonLitmanAmazon + YairKittenplonAmazon + RaghavendraPappagari + 499-507 + Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications. + 2024.naacl-short.41 + 2024.naacl-short.41.copyright.pdf + hsu-etal-2024-m3t + + + Control-<fixed-case>DAG</fixed-case>: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata + JinghongChen + WeizheLin + JingbiaoMei + BillByrneAmazon and University of Cambridge + 508-518 + The Directed Acyclic Transformer is a fast non-autoregressive (NAR) model that performs well in Neural Machine Translation. Two issues prevent its application to general Natural Language Generation (NLG) tasks: frequent Out-Of-Vocabulary (OOV) errors and the inability to faithfully generate entity names. We introduce Control-DAG, a constrained decoding algorithm for our Directed Acyclic T5 (DA-T5) model which offers lexical, vocabulary and length control. We show that Control-DAG significantly enhances DA-T5 on the Schema Guided Dialogue and the DART datasets, establishing strong NAR results for Task-Oriented Dialogue and Data-to-Text NLG. + 2024.naacl-short.42 + 2024.naacl-short.42.copyright.pdf + chen-etal-2024-control + + + Do Vision-Language Models Understand Compound Nouns? + SonalKumar + SreyanGhosh + SSakshi + UtkarshTyagi + DineshManochaUniversity of Maryland, College Park + 519-527 + Open-vocabulary vision-language models (VLMs) like CLIP, trained using contrastive loss, have emerged as a promising new paradigm for text-to-image retrieval. However, do VLMs understand compound nouns (CNs) (e.g., *lab coat*) as well as they understand nouns (e.g., *lab*)? We curate Compun, a novel benchmark with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in interpreting CNs. The Compun benchmark challenges a VLM for text-to-image retrieval where, given a text prompt with a CN, the task is to select the correct image that shows the CN among a pair of distractor images that show the constituent nouns that make up the CN. Next, we perform an in-depth analysis to highlight CLIPs’ limited understanding of certain types of CNs. Finally, we present an alternative framework that moves beyond hand-written templates for text prompts widely used by CLIP-like models. We employ a Large Language Model to generate multiple diverse captions that include the CN as an object in the scene described by the caption. Our proposed method improves CN understanding of CLIP by 8.25% on Compun. Code and benchmark are available. + 2024.naacl-short.43 + 2024.naacl-short.43.copyright.pdf + kumar-etal-2024-vision + + + Is Prompt Transfer Always Effective? An Empirical Study of Prompt Transfer for Question Answering + MinjiJung + SoyeonParkHanyang University + JeewooSulLG Corporation + Yong SukChoiHanyang University + 528-539 + Prompt tuning, which freezes all parameters of a pre-trained model and only trains a soft prompt, has emerged as a parameter-efficient approach. For the reason that the prompt initialization becomes sensitive when the model size is small, the prompt transfer that uses the trained prompt as an initialization for the target task has recently been introduced. Since previous works have compared tasks in large categories (e.g., summarization, sentiment analysis), the factors that influence prompt transfer have not been sufficiently explored. In this paper, we characterize the question answering task based on features such as answer format and empirically investigate the transferability of soft prompts for the first time. We analyze the impact of initialization during prompt transfer and find that the train dataset size of source and target tasks have the influence significantly. Furthermore, we propose a novel approach for measuring catastrophic forgetting and investigate how it occurs in terms of the amount of evidence. Our findings can help deeply understand transfer learning in prompt tuning. + 2024.naacl-short.44 + 2024.naacl-short.44.copyright.pdf + jung-etal-2024-prompt + + + Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers + GeorgiosPantazopoulos + AlessandroSugliaHeriot-Watt University + OliverLemonHeriot-Watt University + ArashEshghiHeriot-Watt University + 540-549 + An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a ‘visual prompt’ which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use diagnostic classifiers to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability. + 2024.naacl-short.45 + 2024.naacl-short.45.copyright.pdf + pantazopoulos-etal-2024-lost + + + Do Multilingual Language Models Think Better in <fixed-case>E</fixed-case>nglish? + JulenEtxanizHiTZ Center, University of the Basque Country (UPV/EHU) + GorkaAzkuneUniversidad del País Vasco + AitorSoroaUniversity of the Basque Country. UPV/EHU. + OierLacalle + MikelArtetxeReka AI + 550-564 + Translate-test is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system before running inference. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel data not seen by the language model. In this work, we introduce a new approach called self-translate that leverages the few-shot translation capabilities of multilingual language models. This allows us to analyze the effect of translation in isolation. Experiments over 5 tasks show that self-translate consistently outperforms direct inference, demonstrating that language models are unable to leverage their full multilingual potential when prompted in non-English languages. Our code is available at https://github.com/juletx/self-translate. + 2024.naacl-short.46 + 2024.naacl-short.46.copyright.pdf + etxaniz-etal-2024-multilingual + + + A Continued Pretrained <fixed-case>LLM</fixed-case> Approach for Automatic Medical Note Generation + DongYuan + EtiRastogiDeepScribe + GautamNaik + Sree PrasannaRajagopal + SagarGoyal + FenZhao + BharathChintagunta + JeffreyWardDeepScribe + 565-571 + LLMs are revolutionizing NLP tasks. However, the use of the most advanced LLMs, such as GPT-4, is often prohibitively expensive for most specialized fields. We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4%. It also achieves parity with GPT-4 in generating medical notes. Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes and other comparable models in correctness and completeness. + 2024.naacl-short.47 + 2024.naacl-short.47.copyright.pdf + yuan-etal-2024-continued + + + Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts + MichaelSaxon + YiranLuoArizona State University + SharonLevyJohns Hopkins University + ChittaBaralArizona State University, Arizona State University and Arizona State University + YezhouYangArizona State University + William YangWangUC Santa Barbara + 572-582 + Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, “Conceptual Coverage Across Languages” (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction’s impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions. + 2024.naacl-short.48 + 2024.naacl-short.48.copyright.pdf + saxon-etal-2024-lost + + + Self-Improving for Zero-Shot Named Entity Recognition with Large Language Models + TingyuXieZhejiang University + QiLi + YanZhangTencent + ZuozhuLiuZhejiang University + HongweiWangZhejiang University + 583-593 + Exploring the application of powerful large language models (LLMs) on the named entity recognition (NER) task has drawn much attention recently. This work pushes the performance boundary of zero-shot NER with LLMs by proposing a training-free self-improving framework, which utilizes an unlabeled corpus to stimulate the self-learning ability of LLMs. First, we use the LLM to make predictions on the unlabeled corpus using self-consistency and obtain a self-annotated dataset. Second, we explore various strategies to select reliable annotations to form a reliable self-annotated dataset. Finally, for each test input, we retrieve demonstrations from the reliable self-annotated dataset and perform inference via in-context learning. Experiments on four benchmarks show substantial performance improvements achieved by our framework. Through comprehensive experimental analysis, we find that increasing the size of unlabeled corpus or iterations of self-improving does not guarantee further improvement, but the performance might be boosted via more advanced strategies for reliable annotation selection. + 2024.naacl-short.49 + 2024.naacl-short.49.copyright.pdf + xie-etal-2024-self + + + Lifelong Event Detection with Embedding Space Separation and Compaction + ChengweiQinNanyang Technological University + RuiruiChenInstitute of High Performance Computing, Singapore, A*STAR + RuochenZhao + WenhanXia + ShafiqJotySalesForce.com and Nanyang Technological University + 594-602 + To mitigate forgetting, existing lifelong event detection methods typically maintain a memory module and replay the stored memory data during the learning of a new task. However, the simple combination of memory data and new-task samples can still result in substantial forgetting of previously acquired knowledge, which may occur due to the potential overlap between the feature distribution of new data and the previously learned embedding space. Moreover, the model suffers from overfitting on the few memory samples rather than effectively remembering learned patterns. To address the challenges of forgetting and overfitting, we propose a novel method based on embedding space separation and compaction. Our method alleviates forgetting of previously learned tasks by forcing the feature distribution of new data away from the previous embedding space. It also mitigates overfitting by a memory calibration mechanism that encourages memory data to be close to its prototype to enhance intra-class compactness. In addition, the learnable parameters of the new task are initialized by drawing upon acquired knowledge from the previously learned task to facilitate forward knowledge transfer. With extensive experiments, we demonstrate that our method can significantly outperform previous state-of-the-art approaches. + 2024.naacl-short.50 + 2024.naacl-short.50.copyright.pdf + qin-etal-2024-lifelong + + + Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion + SmritiSingh + CorneliaCarageaUniversity of Illinois, Chicago + Junyi JessyLiUniversity of Texas, Austin + 603-614 + Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? This work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection. + 2024.naacl-short.51 + 2024.naacl-short.51.copyright.pdf + singh-etal-2024-language + + + <fixed-case>CP</fixed-case>op<fixed-case>QA</fixed-case>: Ranking Cultural Concept Popularity by <fixed-case>LLM</fixed-case>s + MingJiangIndiana University + MansiJoshi + 615-630 + Many recent studies examining the knowledge capacity of large language models (LLM) have focused on knowledge explicitly learned from the pretraining data or implicitly inferable from similar contexts. However, the extent to which an LLM effectively captures corpus-level statistical trends of concepts for reasoning, especially long-tail ones, is largely underexplored. In this study, we introduce a novel few-shot question-answering task (CPopQA) that examines LLMs’ statistical ranking abilities for long-tail cultural concepts (e.g., holidays), particularly focusing on these concepts’ popularity in the United States and the United Kingdom, respectively. We curate a dataset of 457 holidays across 58 countries, generating a total of 9,000 QA testing pairs. Experiments on four strong LLMs show that open-sourced LLMs still lag way behind close LLM API (e.g., GPT-3.5) in statistical ranking of cultural concepts. Notably, GPT-3.5 exhibited its potential to identify geo-cultural proximity across continents. + 2024.naacl-short.52 + 2024.naacl-short.52.copyright.pdf + jiang-joshi-2024-cpopqa + + + The Impact of Language on Arithmetic Proficiency: A Multilingual Investigation with Cross-Agent Checking Computation + Chung-ChiChenAIST, National Institute of Advanced Industrial Science and Technology + HiroyaTakamuraAIST, National Institute of Advanced Industrial Science and Technology + IchiroKobayashiOchanomizu University + YusukeMiyaoThe University of Tokyo + 631-637 + This paper critically examines the arithmetic capabilities of Large Language Models (LLMs), uncovering significant limitations in their performance. Our research reveals a notable decline in accuracy for complex calculations involving large numbers, with addition and subtraction tasks showing varying degrees of proficiency. Additionally, we challenge the notion that arithmetic is language-independent, finding up to a 10% difference in performance across twenty languages. The study also compares self-verification methods with cross-agent collaborations, showing that a single model often outperforms collaborative approaches in basic arithmetic tasks. These findings suggest a need to reassess the effectiveness of LLMs in tasks requiring numerical accuracy and precision. + 2024.naacl-short.53 + 2024.naacl-short.53.copyright.pdf + chen-etal-2024-impact + + + Efficient Information Extraction in Few-Shot Relation Classification through Contrastive Representation Learning + PhilippBorchertIÉSEG School of Management and KU Leuven + JochenDe WeerdtKU Leuven + Marie-FrancineMoensKU Leuven, KU Leuven + 638-646 + Differentiating relationships between entity pairs with limited labeled instances poses a significant challenge in few-shot relation classification. Representations of textual data extract rich information spanning the domain, entities, and relations. In this paper, we introduce a novel approach to enhance information extraction combining multiple sentence representations and contrastive learning. While representations in relation classification are commonly extracted using entity marker tokens, we argue that substantial information within the internal model representations remains untapped. To address this, we propose aligning multiple sentence representations, such as the CLS] token, the [MASK] token used in prompting, and entity marker tokens. Our method employs contrastive learning to extract complementary discriminative information from these individual representations. This is particularly relevant in low-resource settings where information is scarce. Leveraging multiple sentence representations is especially effective in distilling discriminative information for relation classification when additional information, like relation descriptions, are not available. We validate the adaptability of our approach, maintaining robust performance in scenarios that include relation descriptions, and showcasing its flexibility to adapt to different resource constraints. + 2024.naacl-short.54 + 2024.naacl-short.54.copyright.pdf + borchert-etal-2024-efficient + + + A diverse Multilingual News Headlines Dataset from around the World + FelixLeebMax Planck Institute for Intelligent Systems, Max-Planck Institute + BernhardSchölkopfELLIS Institute and Max Planck Institute for Intelligent Systems, Max-Planck Institute + 647-652 + Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide with English translations of all articles included. Designed for natural language processing and media studies, it serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles, for example, to analyze global news coverage and cultural narratives. As a simple demonstration of the analyses facilitated by this dataset, we use a basic procedure using a TF-IDF weighted similarity metric to group articles into clusters about the same event. We then visualize the event signatures of the event showing articles of which languages appear over time, revealing intuitive features based on the proximity of the event and unexpectedness of the event. The dataset is available on [Kaggle](https://www.kaggle.com/datasets/felixludos/babel-briefings) and [HuggingFace](https://huggingface.co/datasets/felixludos/babel-briefings) with accompanying [GitHub](https://github.com/felixludos/babel-briefings) code. + 2024.naacl-short.55 + 2024.naacl-short.55.copyright.pdf + leeb-scholkopf-2024-diverse + + + The Unreasonable Effectiveness of Random Target Embeddings for Continuous-Output Neural Machine Translation + EvgeniiaTokarchukUniversity of Amsterdam + VladNiculaeUniversity of Amsterdam + 653-662 + Continuous-output neural machine translation (CoNMT) replaces the discrete next-word prediction problem with an embedding prediction.The semantic structure of the target embedding space (*i.e.*, closeness of related words) is intuitively believed to be crucial. We challenge this assumption and show that completely random output embeddings can outperform laboriously pre-trained ones, especially on larger datasets. Further investigation shows this surprising effect is strongest for rare words, due to the geometry of their embeddings. We shed further light on this finding by designing a mixed strategy that combines random and pre-trained embeddings, and that performs best overall. + 2024.naacl-short.56 + 2024.naacl-short.56.copyright.pdf + tokarchuk-niculae-2024-unreasonable + + + Efficient Sample-Specific Encoder Perturbations + YassirFathullahUniversity of Cambridge + MarkGalesUniversity of Cambridge + 663-671 + Encoder-decoder foundation models have displayed state-of-the-art performance on a range of autoregressive sequence tasks. This paper proposes a simple and lightweight modification to such systems to control the behaviour according to a specific attribute of interest. This paper proposes a novel inference-efficient approach to modifying the behaviour of an encoder-decoder system according to a specific attribute of interest. Specifically, we show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model to trigger the decoder to generate improved decodings. This work explores a specific realization of this framework focused on improving the COMET performance of Flan-T5 on Machine Translation and the WER of Whisper foundation models on Speech Recognition. Results display consistent improvements in performance evaluated through COMET and WER respectively. Furthermore, experiments also show that the proxies are robust to the exact nature of the data used to train them and can extend to other domains. + 2024.naacl-short.57 + 2024.naacl-short.57.copyright.pdf + fathullah-gales-2024-efficient + + + Diverse Perspectives, Divergent Models: Cross-Cultural Evaluation of Depression Detection on <fixed-case>T</fixed-case>witter + Nuredin AliAbdelkadir + CharlesZhang + NedMayo + StevieChancellorUniversity of Minnesota - Twin Cities + 672-680 + Social media data has been used for detecting users with mental disorders, such as depression. Despite the global significance of cross-cultural representation and its potential impact on model performance, publicly available datasets often lack crucial metadata relatedto this aspect. In this work, we evaluate the generalization of benchmark datasets to build AI models on cross-cultural Twitter data. We gather a custom geo-located Twitter dataset of depressed users from seven countries as a test dataset. Our results show that depressiondetection models do not generalize globally. The models perform worse on Global South users compared to Global North. Pre-trainedlanguage models achieve the best generalization compared to Logistic Regression, though still show significant gaps in performance on depressed and non-Western users. We quantify our findings and provide several actionable suggestions to mitigate this issue + 2024.naacl-short.58 + 2024.naacl-short.58.copyright.pdf + abdelkadir-etal-2024-diverse + + + Removing <fixed-case>RLHF</fixed-case> Protections in <fixed-case>GPT</fixed-case>-4 via Fine-Tuning + QiusiZhanUniversity of Illinois Urbana-Champaign + RichardFang + RohanBindu + AkulGupta + TatsunoriHashimotoStanford University + DanielKangDepartment of Computer Science + 681-687 + As large language models (LLMs) have increased in their capabilities, so doestheir potential for dual use. To reduce harmful outputs, produces and vendors ofLLMs have used reinforcement learning with human feedback (RLHF). In tandem,LLM vendors have been increasingly enabling fine-tuning of their most powerfulmodels. However, concurrent work has shown that fine-tuning can remove RLHFprotections. We may expect that the most powerful models currently available(GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHFprotections with as few as 340 examples and a 95% success rate. These trainingexamples can be automatically generated with weaker models. We further show thatremoving RLHF protections does not decrease usefulness on non-censored outputs,providing evidence that our fine-tuning strategy does not decrease usefulnessdespite using weaker models to generate training data. Our results show the needfor further research on protections on LLMs. + 2024.naacl-short.59 + 2024.naacl-short.59.copyright.pdf + zhan-etal-2024-removing + + + <fixed-case>L</fixed-case>ife<fixed-case>T</fixed-case>ox: Unveiling Implicit Toxicity in Life Advice + MinbeomKim + JahyunKooSeoul National University + HwanheeLeeChung-Ang University + JoonsukParkUniversity of Richmond + HwaranLeeNAVER AI Lab + KyominJung + 688-698 + As large language models become increasingly integrated into daily life, detecting implicit toxicity across diverse contexts is crucial. To this end, we introduce \texttt{LifeTox}, a dataset designed for identifying implicit toxicity within a broad range of advice-seeking scenarios. Unlike existing safety datasets, \texttt{LifeTox} comprises diverse contexts derived from personal experiences through open-ended questions. Our experiments demonstrate that RoBERTa fine-tuned on \texttt{LifeTox} matches or surpasses the zero-shot performance of large language models in toxicity classification tasks. These results underscore the efficacy of \texttt{LifeTox} in addressing the complex challenges inherent in implicit toxicity. We open-sourced the dataset and the \texttt{LifeTox} moderator family; 350M, 7B, and 13B. + 2024.naacl-short.60 + 2024.naacl-short.60.copyright.pdf + kim-etal-2024-lifetox + + + Arithmetic Reasoning with <fixed-case>LLM</fixed-case>: <fixed-case>P</fixed-case>rolog Generation & Permutation + XiaochengYang + BingsenChen + Yik-CheungTamNew York University + 699-710 + Instructing large language models (LLMs) to solve elementary school math problems has shown great success using Chain of Thought (CoT). However, the CoT approach relies on an LLM to generate a sequence of arithmetic calculations which can be prone to cascaded calculation errors. We hypothesize that an LLM should focus on extracting predicates and generating symbolic formulas from the math problem description so that the underlying calculation can be done via an external code interpreter. We investigate using LLM to generate Prolog programs to solve mathematical questions. Experimental results show that our Prolog-based arithmetic problem-solving outperforms CoT generation in the GSM8K benchmark across three distinct LLMs. In addition, given the insensitive ordering of predicates and symbolic formulas in Prolog, we propose to permute the ground truth predicates for more robust LLM training via data augmentation. + 2024.naacl-short.61 + 2024.naacl-short.61.copyright.pdf + yang-etal-2024-arithmetic + + + Verifying Claims About Metaphors with Large-Scale Automatic Metaphor Identification + KotaroAono + RyoheiSasanoNagoya University + KoichiTakedaNagoya University + 711-719 + There are several linguistic claims about situations where words are more likely to be used as metaphors.However, few studies have sought to verify such claims with large corpora.This study entails a large-scale, corpus-based analysis of certain existing claims about verb metaphors, by applying metaphor detection to sentences extracted from Common Crawl and using the statistics obtained from the results.The verification results indicate that the direct objects of verbs used as metaphors tend to have lower degrees of concreteness, imageability, and familiarity, and that metaphors are more likely to be used in emotional and subjective sentences. + 2024.naacl-short.62 + 2024.naacl-short.62.copyright.pdf + aono-etal-2024-verifying + + + <fixed-case>I</fixed-case>nstruct<fixed-case>ABSA</fixed-case>: Instruction Learning for Aspect Based Sentiment Analysis + KevinScaria + HimanshuGuptaAmazon + SiddharthGoyal + SaurabhSawant + SwaroopMishraGoogle + ChittaBaralArizona State University, Arizona State University and Arizona State University + 720-736 + We introduce InstructABSA, an instruction learning paradigm for Aspect-Based Sentiment Analysis (ABSA) subtasks.Our method introduces positive, negative, and neutral examples to each training sample, and instruction tune the model (Tk-Instruct) for ABSA subtasks, yielding significant performance improvements. Experimental results on the Sem Eval 2014, 15, and 16 datasets demonstrate that InstructABSA outperforms the previous state-of-the-art (SOTA) approaches on Term Extraction (ATE), Sentiment Classification(ATSC) and Sentiment Pair Extraction (ASPE) subtasks.In particular, InstructABSA outperforms the previous state-of-the-art (SOTA) on the Rest14 ATE subtask by 5.69% points, the Rest15 ATSC subtask by 9.59% points, and the Lapt14 AOPE subtask by 3.37% points, surpassing 7x larger models.We get competitive results on AOOE, AOPE, AOSTE, and ACOSQE subtasks indicating strong generalization ability to all subtasks. Exploring sample efficiency reveals that just 50% train data is required to get competitive results with other instruction tuning approaches. Lastly, we assess the quality of instructions and observe that InstructABSA’s performance experiences a decline of ~10% when adding misleading examples + 2024.naacl-short.63 + 2024.naacl-short.63.copyright.pdf + scaria-etal-2024-instructabsa + + + <fixed-case>MEMORY</fixed-case>-<fixed-case>VQ</fixed-case>: Compression for Tractable <fixed-case>I</fixed-case>nternet-Scale Memory + YuryZemlyanskiy + Michielde JongAugment Computing + LukeVilnisGoogle + SantiagoOntanonGoogle and Drexel University + WilliamCohenGoogle DeepMind + SumitSanghaiResearch, Google + JoshuaAinslieGoogle + 737-744 + Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN (de Jong et al., 2023a) pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora. + 2024.naacl-short.64 + 2024.naacl-short.64.copyright.pdf + zemlyanskiy-etal-2024-memory + + + Unveiling the Magic: Investigating Attention Distillation in Retrieval-Augmented Generation + ZizhongLiUniversity of California, Davis + HaopengZhang + JiaweiZhangUniversity of California, Davis + 745-754 + Retrieval-augmented generation framework addresses the limitations of large language models by enabling real-time knowledge updates for more accurate answers. An efficient way in the training phase of retrieval-augmented models is attention distillation, which uses attention scores as supervision signals instead of manually annotated query-document pairs. Despite its growing popularity, the detailed mechanisms behind the success of attention distillation remain unexplored, particularly the specific patterns it leverages to benefit training. In this paper, we address this gap by conducting a comprehensive investigation of attention distillation workflow and identifying key factors influencing the learning performance of retrieval-augmented language models. We further propose several insightful indicators for optimizing models’ training methods and avoiding ineffective training. + 2024.naacl-short.65 + 2024.naacl-short.65.copyright.pdf + li-etal-2024-unveiling + + + Improving Factuality in Clinical Abstractive Multi-Document Summarization by Guided Continued Pre-training + AhmedElhady + KhaledElsayedCairo University + EnekoAgirreUniversity of the Basque Country (UPV/EHU) + MikelArtetxeReka AI + 755-761 + Factual accuracy is an important property of neural abstractive summarization models, especially in fact-critical domains such as the clinical literature. In this work, we introduce a guided continued pre-training stage for encoder-decoder models that improves their understanding of the factual attributes of documents, which is followed by supervised fine-tuning on summarization. Our approach extends the pre-training recipe of BART to incorporate 3 additional objectives based on PICO spans, which capture the population, intervention, comparison, and outcomes related to a clinical study. Experiments on multi-document summarization in the clinical domain demonstrate that our approach is competitive with prior work, improving the quality and factuality of the summaries and achieving the best-published results in factual accuracy on the MSLR task. + 2024.naacl-short.66 + 2024.naacl-short.66.copyright.pdf + elhady-etal-2024-improving + + + <fixed-case>M</fixed-case>u<fixed-case>L</fixed-case>an: A Study of Fact Mutability in Language Models + ConstanzaFierroCopenhagen University + NicolasGarneau + EmanueleBugliarelloGoogle + YovaKementchedjhievaMohamed bin Zayed University of Artificial Intelligence + AndersSøgaardCopenhagen University + 762-771 + Facts are subject to contingencies and can be true or false in different circumstances. One such contingency is time, wherein some facts mutate over a given period, e.g., the president of a country or the winner of a championship. Trustworthy language models ideally identify mutable facts as such and process them accordingly. We create MuLan, a benchmark for evaluating the ability of English language models to anticipate time-contingency, covering both 1:1 and 1:N relations. We hypothesize that mutable facts are encoded differently than immutable ones, hence being easier to update. In a detailed evaluation of six popular large language models, we consistently find differences in the LLMs’ confidence, representations, and update behavior, depending on the mutability of a fact. Our findings should inform future work on the injection of and induction of time-contingent knowledge to/from LLMs. + 2024.naacl-short.67 + 2024.naacl-short.67.copyright.pdf + fierro-etal-2024-mulan + + + Language-Independent Representations Improve Zero-Shot Summarization + VladimirSolovyev + DanniLiuKarlsruher Institut für Technologie + JanNiehues + 772-782 + Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions. In this work, we focus on summarization and tackle the problem through the lens of language-independent representations. After training on monolingual summarization, we perform zero-shot transfer to new languages or language pairs. We first show naively finetuned models are highly language-specific in both output behavior and internal representations, resulting in poor zero-shot performance. Next, we propose query-key (QK) finetuning to decouple task-specific knowledge from the pretrained language generation abilities. Then, after showing downsides of the standard adversarial language classifier, we propose a balanced variant that more directly enforces language-agnostic representations. Moreover, our qualitative analyses show removing source language identity correlates to zero-shot summarization performance. Our code is openly available. + 2024.naacl-short.68 + 2024.naacl-short.68.copyright.pdf + solovyev-etal-2024-language + + + Trusting Your Evidence: Hallucinate Less with Context-aware Decoding + WeijiaShi + XiaochuangHanDepartment of Computer Science, University of Washington + MikeLewisFacebook AI Research + YuliaTsvetkovDepartment of Computer Science, University of Washington + LukeZettlemoyerUniversity of Washington, Facebook and Meta + Wen-tauYihMeta Platforms, Inc. + 783-791 + Language models (LMs) often struggle to pay enough attention to the input context, and generate texts that are unfaithful or contain hallucinations. To mitigate this issue, we present context-aware decoding (CAD), which follows a contrastive output distribution that amplifies the difference between the output probabilities when a model is used with and without context. Our experiments show that CAD, without additional training, significantly improves the faithfulness of different LM families, including OPT, GPT, LLaMA, and FLAN-T5 for summarization tasks (e.g., 14.3% gain for LLaMA in factuality metrics). Furthermore, CAD is particularly effective in overriding a model’s prior knowledge when it contradicts the provided context, leading to substantial improvements in tasks where resolving the knowledge conflict is essential. Our code is publicly released at https://github.com/xhan77/context-aware-decoding. + 2024.naacl-short.69 + 2024.naacl-short.69.copyright.pdf + shi-etal-2024-trusting + + + <fixed-case>G</fixed-case>uy<fixed-case>L</fixed-case>ingo: The <fixed-case>R</fixed-case>epublic of <fixed-case>G</fixed-case>uyana Creole Corpora + ChristopherClarkeUniversity of Michigan - Ann Arbor and University of Guyana + RolandDaynauthUniversity of Michigan - Ann Arbor + JasonMarsUniversity of Michigan - Ann Arbor + CharleneWilkinson + HubertDevonish + 792-798 + While major languages often enjoy substantial attention and resources, the linguistic diversity across the globe encompasses a multitude of smaller, indigenous, and regional languages that lack the same level of computational support. One such region is the Caribbean. While commonly labeled as “English speaking”, the ex-British Caribbean region consists of a myriad of Creole languages thriving alongside English. In this paper, we present Guylingo: a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole), the most widely spoken language in the culturally rich nation of Guyana. We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation for Creolese. Lastly, we discuss the unique opportunities presented by recent NLP advancements for accelerating the formal adoption of Creole languages as official languages in the Caribbean. + 2024.naacl-short.70 + 2024.naacl-short.70.copyright.pdf + clarke-etal-2024-guylingo + + + <fixed-case>D</fixed-case>ouble<fixed-case>L</fixed-case>ingo: Causal Estimation with Large Language Models + MarkoVeljanovski + ZachWood-DoughtyNorthwestern University + 799-807 + Estimating causal effects from non-randomized data requires assumptions about the underlying data-generating process. To achieve unbiased estimates of the causal effect of a treatment on an outcome, we typically adjust for any confounding variables that influence both treatment and outcome. When such confounders include text data, existing causal inference methods struggle due to the high dimensionality of the text. The simple statistical models which have sufficient convergence criteria for causal estimation are not well-equipped to handle noisy unstructured text, but flexible large language models that excel at predictive tasks with text data do not meet the statistical assumptions necessary for causal estimation. Our method enables theoretically consistent estimation of causal effects using LLM-based nuisance models by incorporating them within the framework of Double Machine Learning. On the best available dataset for evaluating such methods, we obtain a 10.4% reduction in the relative absolute error for the estimated causal effect over existing methods. + 2024.naacl-short.71 + 2024.naacl-short.71.copyright.pdf + veljanovski-wood-doughty-2024-doublelingo + + + Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification + MichailMitsios + GeorgiosVamvoukakisSamsung + GeorgiaManiatiSamsung + NikolaosEllinasSamsung + GeorgiosDimitriouSamsung + KonstantinosMarkopoulos + PanosKakoulidisUniversity of Athens + AlexandraVioniInnoetics, Samsung Electronics + MyrsiniChristidou + JunkwangOh + GunuJho + InchulHwang + GeorgiosVardaxoglou + AimiliosChalamandaris + PirrosTsiakoulis + SpyrosRaptis + 808-813 + Emotion detection in textual data has received growing interest in recent years, as it is pivotal for developing empathetic human-computer interaction systems.This paper introduces a method for categorizing emotions from text, which acknowledges and differentiates between the diversified similarities and distinctions of various emotions.Initially, we establish a baseline by training a transformer-based model for standard emotion classification, achieving state-of-the-art performance. We argue that not all misclassifications are of the same importance, as there are perceptual similarities among emotional classes.We thus redefine the emotion labeling problem by shifting it from a traditional classification model to an ordinal classification one, where discrete emotions are arranged in a sequential order according to their valence levels.Finally, we propose a method that performs ordinal classification in the two-dimensional emotion space, considering both valence and arousal scales.The results show that our approach not only preserves high accuracy in emotion prediction but also significantly reduces the magnitude of errors in cases of misclassification. + 2024.naacl-short.72 + 2024.naacl-short.72.copyright.pdf + mitsios-etal-2024-improved + + + On Narrative Question Answering Skills + EmilKalbaliyev + KairitSirtsinstitute of computer science, University of Tartu + 814-820 + Narrative Question Answering is an important task for evaluating and improving reading comprehension abilities in both humans and machines. However, there is a lack of consensus on the skill taxonomy that would enable systematic and comprehensive assessment and learning of the various aspects of Narrative Question Answering. Existing task-level skill views oversimplify the multidimensional nature of tasks, while question-level taxonomies face issues in evaluation and methodology. To address these challenges, we introduce a more inclusive skill taxonomy that synthesizes and redefines narrative understanding skills from previous taxonomies and includes a generation skill dimension from the answering perspective. + 2024.naacl-short.73 + 2024.naacl-short.73.copyright.pdf + kalbaliyev-sirts-2024-narrative + + + Order-Based Pre-training Strategies for Procedural Text Understanding + AbhilashNandyIndian Institute of Technology Kharagpur + YashKulkarni + PawanGoyalIIT Kharagpur + NiloyGangulyIndian Institute of Technology Kharagpur, + 821-828 + In this paper, we propose sequence-based pre-training methods to enhance procedural understanding in natural language processing. Procedural text, containing sequential instructions to accomplish a task, is difficult to understand due to the changing attributes of entities in the context. We focus on recipes as they are commonly represented as ordered instructions, and use this order as a supervision signal. Our work is one of the first to compare several ‘order-as-supervision’ transformer pre-training methods, including Permutation Classification, Embedding Regression, and Skip-Clip, and show that these methods give improved results compared to baselines and SoTA LLMs on two downstream Entity-Tracking datasets: NPN-Cooking dataset in recipe domain and ProPara dataset in open domain. Our proposed methods address the non-trivial Entity Tracking Task that requires prediction of entity states across procedure steps, which requires understanding the order of steps. These methods show an improvement over the best baseline by 1.6% and 7-9% on NPN-Cooking and ProPara Datasets respectively across metrics. + 2024.naacl-short.74 + 2024.naacl-short.74.copyright.pdf + nandy-etal-2024-order + + + Breaking the Language Barrier: Can Direct Inference Outperform Pre-Translation in Multilingual <fixed-case>LLM</fixed-case> Applications? + YotamIntrator + MatanHalfon + RomanGoldenberg + ReutTsarfatyGoogle and Bar-Ilan University, Technion + MatanEyalAllen Institute for Artificial Intelligence + EhudRivlinTechnion, Technion + YossiMatiasGoogle and Tel Aviv University + NataliaAizenberg + 829-844 + Large language models hold significant promise in multilingual applications. However, inherent biases stemming from predominantly English-centric pre-training have led to the widespread practice of pre-translation, i.e., translating non-English inputs to English before inference, leading to complexity and information loss. This study re-evaluates the need for pre-translation in the context of PaLM2 models, which have been established as highly performant in multilingual tasks. We offer a comprehensive investigation across 108 languages and 6 diverse benchmarks, including open-end generative tasks, which were excluded from previous similar studies. Our findings challenge the pre-translation paradigm established in prior research, highlighting the advantages of direct inference in PaLM2. Specifically, PaLM2-L consistently outperforms pre-translation in 94 out of 108 languages. These findings pave the way for more efficient and effective multilingual applications, alleviating the limitations associated with pre-translation and unlocking linguistic authenticity. + 2024.naacl-short.75 + 2024.naacl-short.75.copyright.pdf + intrator-etal-2024-breaking + +
+ + + Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations) + Kai-WeiChang + AnnieLee + NazneenRajani + Association for Computational Linguistics +
Mexico City, Mexico
+ June + 2024 + 2024.naacl-demo + naacl + + + 2024.naacl-demo.0 + naacl-2024-2024-north-american + + + <fixed-case>TOPICAL</fixed-case>: <fixed-case>TOPIC</fixed-case> Pages <fixed-case>A</fixed-case>utomagica<fixed-case>L</fixed-case>ly + JohnGiorgi + AmanpreetSinghAllen Institute for Artificial Intelligence + DougDowneyAllen Institute for Artificial Intelligence and Northwestern University + SergeyFeldmanAllen Institute for Artificial Intelligence and Data Cowboys + LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence + 1-11 + Topic pages aggregate useful information about an entity or concept into a single succinct and accessible article. Automated creation of topic pages would enable their rapid curation as information resources, providing an alternative to traditional web search. While most prior work has focused on generating topic pages about biographical entities, in this work, we develop a completely automated process to generate high-quality topic pages for scientific entities, with a focus on biomedical concepts. We release TOPICAL, a web app and associated open-source code, comprising a model pipeline combining retrieval, clustering, and prompting, that makes it easy for anyone to generate topic pages for a wide variety of biomedical entities on demand. In a human evaluation of 150 diverse topic pages generated using TOPICAL, we find that the vast majority were considered relevant, accurate, and coherent, with correct supporting citations. We make all code publicly available and host a free-to-use web app at: https://s2-topical.apps.allenai.org. + 2024.naacl-demo.1 + giorgi-etal-2024-topical + + + Low-code <fixed-case>LLM</fixed-case>: Graphical User Interface over Large Language Models + YuzheCai + ShaoguangMaoMicrosoft + WenshanWuMicrosoft + ZehuaWang + YaoboLiang + TaoGe + ChenfeiWuMicrosoft + WangYouWangYou + TingSong + YanXiaResearch, Microsoft + NanDuanMicrosoft Research Asia + FuruWeiMicrosoft Research + 12-25 + Utilizing Large Language Models (LLMs) for complex tasks is challenging, often involving a time-consuming and uncontrollable prompt engineering process. This paper introduces a novel human-LLM interaction framework, Low-code LLM. It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses. Through visual interaction with a graphical user interface, users can incorporate their ideas into the process without writing trivial prompts. The proposed Low-code LLM framework consists of a Planning LLM that designs a structured planning workflow for complex tasks, which can be correspondingly edited and confirmed by users through low-code visual programming operations, and an Executing LLM that generates responses following the user-confirmed workflow. We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability. We demonstrate its benefits using four typical applications. By introducing this framework, we aim to bridge the gap between humans and LLMs, enabling more effective and efficient utilization of LLMs for complex tasks. The code, prompts, and experimental details are available at https://github.com/moymix/TaskMatrix/tree/main/LowCodeLLM. A system demonstration video can be found at https://www.youtube.com/watch?v=jb2C1vaeO3E. + 2024.naacl-demo.2 + cai-etal-2024-low-code + + + <fixed-case>E</fixed-case>d<fixed-case>T</fixed-case>ec-<fixed-case>QB</fixed-case>uilder: A Semantic Retrieval Tool for Assembling Vocational Training Exams in <fixed-case>G</fixed-case>erman Language + AlonsoPalominoUniversität Bielefeld + AndreasFischer + JakubKuzilekGerman Research Center for AI and Humboldt Universität Berlin + JarekNitsch + NielsPinkwartGerman Research Center for AI + BenjaminPaassen + 26-35 + Selecting and assembling test items from a validated item database into comprehensive exam forms is an under-researched but significant challenge in education. Search and retrieval methods provide a robust framework to assist educators when filtering and assembling relevant test items. In this work, we present EdTec-QBuilder, a semantic search tool developed to assist vocational educators in assembling exam forms. To implement EdTec-QBuilder’s core search functionality, we evaluated eight retrieval strategies and twenty-five popular pre-trained sentence similarity models. Our evaluation revealed that employing cross-encoders to re-rank an initial list of relevant items is best for assisting vocational trainers in assembling examination forms. Beyond topic-based exam assembly, EdTec-QBuilder aims to provide a crowdsourcing infrastructure enabling manual exam assembly data collection, which is critical for future research and development in assisted and automatic exam assembly models. + 2024.naacl-demo.3 + palomino-etal-2024-edtec + + + <fixed-case>DIALIGHT</fixed-case>: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models + SongboHuLanguage Technology Lab, University of Cambridge + XiaobinWang + MoyYuanUniversity of Cambridge + AnnaKorhonenUniversity of Cambridge + IvanVulićUniversity of Cambridge and PolyAI Limited + 36-52 + We present DIALIGHT, a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations and comparisons between ToD systems using fine-tuning of Pretrained Language Models (PLMs) and those utilising the zero-shot and in-context learning capabilities of Large Language Models (LLMs). In addition to automatic evaluation, this toolkit features (i) a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level, and (ii) a microservice-based backend, improving efficiency and scalability. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses. However, we also identify significant challenges of LLMs in adherence to task-specific instructions and generating outputs in multiple languages, highlighting areas for future research. We hope this open-sourced toolkit will serve as a valuable resource for researchers aiming to develop and properly evaluate multilingual ToD systems and will lower, currently still high, entry barriers in the field. + 2024.naacl-demo.4 + hu-etal-2024-dialight + + + <fixed-case>RTSUM</fixed-case>: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization + SeonglaeCho + MyunghaJangPinterest Inc. + JinyoungYeoYonsei University + DonghaLeeYonsei University + 53-60 + In this paper, we present RTSum, an unsupervised summarization framework that utilizes relation triples as the basic unit for summarization. Given an input document, RTSum first selects salient relation triples via multi-level salience scoring and then generates a concise summary from the selected relation triples by using a text-to-text language model. On the basis of RTSum, we also develop a web demo for an interpretable summarizing tool, providing fine-grained interpretations with the output summary. With support for customization options, our tool visualizes the salience for textual units at three distinct levels: sentences, relation triples, and phrases. The code, demo, and video are publicly available. + 2024.naacl-demo.5 + cho-etal-2024-rtsum + + + Edu-<fixed-case>C</fixed-case>onvo<fixed-case>K</fixed-case>it: An Open-Source Library for Education Conversation Data + RoseWangStanford University + DorottyaDemszkyStanford University + 61-69 + We introduce Edu-ConvoKit, an open-source library designed to handle pre-processing, annotation and analysis of conversation data in education. Resources for analyzing education conversation data are scarce, making the research challenging to perform and therefore hard to access. We address these challenges with Edu-ConvoKit. Edu-ConvoKit is open-source [1], pip-installable [2], with comprehensive documentation [3]. Our demo video is available at: https://youtu.be/zdcI839vAko?si=h9qlnl76ucSuXb8-. We include additional resources, such as Colab applications of Edu-ConvoKit to three diverse education datasets [4] and a repository of Edu-ConvoKit-related papers [5].[1] https://github.com/stanfordnlp/edu-convokit[2] https://pypi.org/project/edu-convokit/[3] https://edu-convokit.readthedocs.io/en/latest/[4] https://github.com/stanfordnlp/edu-convokit?tab=readme-ov-file#datasets-with-edu-convokit[5] https://github.com/stanfordnlp/edu-convokit/blob/main/papers.md + 2024.naacl-demo.6 + wang-demszky-2024-edu + + + jp-evalb: Robust Alignment-based <fixed-case>PARSEVAL</fixed-case> Measures + JungyeulParkUniversity of British Columbia + JunruiWang + EunkyulJo + AngelaParkUniversity of British Columbia + 70-77 + We introduce an evaluation system designed to compute PARSEVAL measures, offering a viable alternative to evalb commonly used for constituency parsing evaluation. The widely used evalb script has traditionally been employed for evaluating the accuracy of constituency parsing results, albeit with the requirement for consistent tokenization and sentence boundaries. In contrast, our approach, named jp-evalb, is founded on an alignment method. This method aligns sentences and words when discrepancies arise. It aims to overcome several known issues associated with evalb by utilizing the ‘jointly preprocessed (JP)’ alignment-based method. We introduce a more flexible and adaptive framework, ultimately contributing to a more accurate assessment of constituency parsing performance. + 2024.naacl-demo.7 + park-etal-2024-jp + + + <fixed-case>O</fixed-case>pinion<fixed-case>GPT</fixed-case>: Modelling Explicit Biases in Instruction-Tuned <fixed-case>LLM</fixed-case>s + PatrickHallerHumboldt Universität Berlin + AnsarAynetdinovDepartment of Computer Science, Humboldt University Berlin, Humboldt Universität Berlin + AlanAkbikHumboldt Universität Berlin + 78-86 + Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable ability to generate fitting responses to natural language instructions. However, an open research question concerns the inherent biases of trained models and their responses. For instance, if the data used to tune an LLM is dominantly written by persons with a specific political bias, we might expect generated answers to share this bias. Current research work seeks to de-bias such models, or suppress potentially biased answers.With this demonstration, we take a different view on biases in instruction-tuning: Rather than aiming to suppress them, we aim to make them explicit and transparent. To this end, we present OpinionGPT, a web demo in which users can ask questions and select all biases they wish to investigate. The demo will answer this question using a model fine-tuned on text representing each of the selected biases, allowing side-by-side comparison. To train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics. This paper presents OpinionGPT, illustrates how we trained the bias-aware model and showcases the web application (available at https://opiniongpt.informatik.hu-berlin.de). + 2024.naacl-demo.8 + haller-etal-2024-opiniongpt + + + <fixed-case>ATLAS</fixed-case>: A System for <fixed-case>PDF</fixed-case>-centric Human Interaction Data Collection + AlexaSiuAdobe + ZichaoWangAdobe Research + JoshuaHoeflich + NamanKapasi + AniNenkovaAdobe Research + TongSunAdobe Systems + 87-96 + The Portable Document Format (PDF) is a popular format for distributing digital documents. Datasets on PDF reading behaviors and interactions remain limited due to the challenges of instrumenting PDF readers for these data collection tasks. We present ATLAS, a data collection tool designed to better support researchers in collecting rich PDF-centric datasets from users. ATLAS supports researchers in programmatically creating a user interface for data collection that is ready to share with annotators. It includes a toolkit and an extensible schema to easily customize the data collection tasks for a variety of purposes, allowing collection of PDF annotations (e.g., highlights, drawings) as well as reading behavior analytics (e.g., page scroll, text selections). We open-source ATLAS1 to support future research efforts and review use cases of ATLAS that showcase our system’s broad applicability. + 2024.naacl-demo.9 + siu-etal-2024-atlas + + + <fixed-case>B</fixed-case>e<fixed-case>L</fixed-case>eaf: Belief Prediction as Tree Generation + JohnMurzaku, State University of New York at Stony Brook + OwenRambowStony Brook University + 97-106 + We present a novel approach to predicting source-and-target factuality by transforming it into a linearized tree generation task. Unlike previous work, our model and representation format fully account for the factuality tree structure, generating the full chain of nested sources instead of the last source only. Furthermore, our linearized tree representation significantly compresses the amount of tokens needed compared to other representations, allowing for fully end-to-end systems. We achieve state-of-the-art results on FactBank and the Modal Dependency Corpus, which are both corpora annotating source-and-target event factuality. Our results on fine-tuning validate the strong generality of the proposed linearized tree generation task, which can be easily adapted to other corpora with a similar structure. We then present BeLeaf, a system which directly leverages the linearized tree representation to create both sentence level and document level visualizations. Our system adds several missing pieces to the source-and-target factuality task such as coreference resolution and event head word to syntactic span conversion. Our demo code is available on https://github.com/yurpl/beleaf and our video is available on https://youtu.be/SpbMNnin-Po. + 2024.naacl-demo.10 + murzaku-rambow-2024-beleaf + + + <fixed-case>Q</fixed-case>uery<fixed-case>E</fixed-case>xplorer: An Interactive Query Generation Assistant for Search and Exploration + KaustubhDholeBITS Pilani + ShivamBajaj + RamrajChandradevan + EugeneAgichteinAmazon and Emory University + 107-115 + Formulating effective search queries remains a challenging task, particularly when users lack expertise in a specific domain or are not proficient in the language of the content. Providing example documents of interest might be easier for a user. However, such query-by-example scenarios are prone to concept drift, and the retrieval effectiveness is highly sensitive to the query generation method, without a clear way to incorporate user feedback. To enable exploration and to support Human-In-The-Loop experiments we propose QueryExplorer– an interactive query generation, reformulation, and retrieval interface with support for Hug-gingFace generation models and PyTerrier’sretrieval pipelines and datasets, and extensivelogging of human feedback. To allow users to create and modify effective queries, our demo supports complementary approaches of using LLMs interactively, assisting the user with edits and feedback at multiple stages of the query formulation process. With support for recording fine-grained interactions and user annotations, QueryExplorer can serve as a valuable experimental and research platform for annotation, qualitative evaluation, and conducting Human-in-the-Loop (HITL) experiments for complex search tasks where users struggle to formulate queries. + 2024.naacl-demo.11 + dhole-etal-2024-queryexplorer + + + <fixed-case>LMF</fixed-case>low: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models + ShizheDiaoHong Kong University of Science and Technology + RuiPanThe Hong Kong University of Science and Technology + HanzeDongSalesForce.com + KaShunShum + JipengZhang + WeiXiong + TongZhangUIUC + 116-127 + Foundation models have demonstrated a great ability to achieve general human-level intelligence far beyond traditional approaches. As the technique keeps attracting attention from the AI community, more and more foundation models have become publicly available.However, most of those models exhibit a major deficiency in specialized-domain and specialized-task applications, where the step of domain- and task-aware finetuning is still required to obtain scientific language models. As the number of available foundation models and specialized tasks keeps growing, the job of training scientific language models becomes highly nontrivial. In this paper, we take the first step to address this issue. We introduce an extensible and lightweight toolkit, LMFlow, which aims to simplify the domain- and task-aware finetuning of general foundation models.LMFlow offers a complete finetuning workflow for a foundation model to support specialized training with limited computing resources.Furthermore, it supports continuous pretraining, instruction tuning, parameter-efficient finetuning, alignment tuning, inference acceleration, long context generalization, model customization, and even multimodal finetuning, along with carefully designed and extensible APIs. This toolkit has been thoroughly tested and is available at https://github.com/OptimalScale/LMFlow. + 2024.naacl-demo.12 + diao-etal-2024-lmflow + + + <fixed-case>DOCMASTER</fixed-case>: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering + AlexNguyen + ZilongWangUniversity of California, San Diego + JingboShangUniversity of California, San Diego + DheerajMekalaUniversity of California, San Diego + 128-136 + The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego’s (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents. + 2024.naacl-demo.13 + nguyen-etal-2024-docmaster + + + <fixed-case>R</fixed-case>ed<fixed-case>C</fixed-case>oast: A Lightweight Tool to Automate Distributed Training of <fixed-case>LLM</fixed-case>s on Any <fixed-case>GPU</fixed-case>/<fixed-case>TPU</fixed-case>s + BowenTanCarnegie Mellon University + YunZhuGoogle + LijuanLiu + HongyiWangCMU, Carnegie Mellon University + YonghaoZhuangCMU, Carnegie Mellon University + JindongChenGoogle + EricXingMohamed bin Zayed Univeristy of AI and School of Computer Science, Carnegie Mellon University + ZhitingHuUniversity of California, San Diego and Amazon + 137-147 + The recent progress of AI can be largely attributed to large language models (LLMs). However, their escalating memory requirements introduce challenges for machine learning (ML) researchers and engineers. Addressing this requires developers to partition a large model to distribute it across multiple GPUs or TPUs. This necessitates considerable coding and intricate configuration efforts with existing model parallel tools, such as Megatron-LM, DeepSpeed, and Alpa. These tools require users’ expertise in machine learning systems (MLSys), creating a bottleneck in LLM development, particularly for developers without MLSys background. In this work, we present RedCoast (Redco), a lightweight and user-friendly tool crafted to automate distributed training and inference for LLMs, as well as to simplify ML pipeline development. The design of Redco emphasizes two key aspects. Firstly, to automate model parallelism, our study identifies two straightforward rules to generate tensor parallel strategies for any given LLM. Integrating these rules into Redco facilitates effortless distributed LLM training and inference, eliminating the need of additional coding or complex configurations. We demonstrate the effectiveness by applying Redco on a set of LLM architectures, such as GPT-J, LLaMA, T5, and OPT, up to the size of 66B. Secondly, we propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions, avoiding redundant and formulaic code like multi-host related processing. This mechanism proves adaptable across a spectrum of ML algorithms, from foundational language modeling to complex algorithms like meta-learning and reinforcement learning. As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts. RedCoast (Redco) has been released under Apache 2.0 license at https://github.com/tanyuqian/redco. + 2024.naacl-demo.14 + tan-etal-2024-redcoast + + + Concept Over Time Analysis: Unveiling Temporal Patterns for Qualitative Data Analysis + TimFischerUniversity of Hamburg + FlorianSchneiderUniversität Hamburg + RobertGeislinger + FlorianHelfer + GertraudKoch + ChrisBiemannU Hamburg + 148-157 + In this system demonstration paper, we present the Concept Over Time Analysis extension for the Discourse Analysis Tool Suite.The proposed tool empowers users to define, refine, and visualize their concepts of interest within an interactive interface. Adhering to the Human-in-the-loop paradigm, users can give feedback through sentence annotations. Utilizing few-shot sentence classification, the system employs Sentence Transformers to compute representations of sentences and concepts. Through an iterative process involving semantic similarity searches, sentence annotation, and fine-tuning with contrastive data, the model continuously refines, providing users with enhanced analysis outcomes. The final output is a timeline visualization of sentences classified to concepts. Especially suited for the Digital Humanities, Concept Over Time Analysis serves as a valuable tool for qualitative data analysis within extensive datasets. The chronological overview of concepts enables researchers to uncover patterns, trends, and shifts in discourse over time. + 2024.naacl-demo.15 + fischer-etal-2024-concept + + + pyvene: A Library for Understanding and Improving <fixed-case>P</fixed-case>y<fixed-case>T</fixed-case>orch Models via Interventions + ZhengxuanWuStanford University + AtticusGeigerPr(Ai)²R Group + AryamanArora + JingHuangStanford University + ZhengWangStanford University + NoahGoodmanStanford University + ChristopherManningComputer Science Department, Stanford University + ChristopherPottsStanford University + 158-165 + Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce pyvene, an open-source Python library that supports customizable interventions on a range of different PyTorch modules. pyvene supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how pyvene provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at ‘https://github.com/stanfordnlp/pyvene‘. + 2024.naacl-demo.16 + wu-etal-2024-pyvene + + + Newspaper Signaling for Crisis Prediction + PrajviSaxenaGerman Research Center for AI + SabineJanzen + WolfgangMaassUniversität des Saarlandes + 166-173 + To establish sophisticated monitoring of newspaper articles for detecting crisis-related signals, natural language processing has to cope with unstructured data, media, and cultural bias as well as multiple languages. So far, research on detecting signals in newspaper articles is focusing on structured data, restricted language settings, and isolated application domains. When considering complex crisis-related signals, a high number of diverse newspaper articles in terms of language and culture reduces potential biases. We demonstrate MENDEL – a model for multi-lingual and open-domain newspaper signaling for detecting crisis-related indicators in newspaper articles. The model works with unstructured news data and combines multiple transformer-based models for pre-processing (STANZA) and content filtering (RoBERTa, GPT-3.5). Embedded in a Question-Answering (QA) setting, MENDEL supports multiple languages (>66) and can detect early newspaper signals for open crisis domains in real-time. + 2024.naacl-demo.17 + saxena-etal-2024-newspaper + + + <fixed-case>F</fixed-case>ast<fixed-case>F</fixed-case>it: Fast and Effective Few-Shot Text Classification with a Multitude of Classes + AsafYehudai + ElronBandelInternational Business Machines + 174-184 + We present FastFit, a Python package designed to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multi-class classification performance in speed and accuracy across various English and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub, presenting a user-friendly solution for NLP practitioners. + 2024.naacl-demo.18 + yehudai-bandel-2024-fastfit + + + <fixed-case>A</fixed-case>gent<fixed-case>Q</fixed-case>uest: A Modular Benchmark Framework to Measure Progress and Improve <fixed-case>LLM</fixed-case> Agents + LucaGioacchini + GiuseppeSiracusanoNEC + DavideSanvito + KirilGashteovskiNEC Laboratories Europe, St.Cyril and Methodius University and NEC Laboratories Europe + DavidFriedeUniversität Mannheim + RobertoBifulcoNEC + CarolinLawrenceNEC Laboratories Europe and NEC Laboratories Europe + 185-193 + The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest – a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we make it available under https://github.com/nec-research/agentquest. + 2024.naacl-demo.19 + gioacchini-etal-2024-agentquest + + + <fixed-case>Z</fixed-case>hu<fixed-case>J</fixed-case>iu-Knowledge: A Fairer Platform for Evaluating Multiple Knowledge Types in Large Language Models + PengfanDu + SiruiLiang + BaoliZhang + PengfeiCaoInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + YuboChenInstitute of automation, Chinese academy of science + KangLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + JunZhao + 194-206 + The swift advancement in large language models (LLMs) has heightened the importance of model evaluations. LLMs have acquired a substantial amount of knowledge, and evaluating the knowledge of these LLMs is crucial. To address this, we introduce the ZhuJiu-Knowledge benchmark which carefully considers the following factors: (1) For knowledge scope, we concentrate on three domains: commonsense knowledge, world knowledge, language knowledge, which comes from ATOMIC, Conceptnet, Wikidata, and Wordnet. (2) For data construction, to prevent data contamination, we utilize knowledge derived from corpora and knowledge graphs to formulate novel questions which are ensured not to appear in the training corpus. A multitude of prompts is purposefully devised to mitigate the impact of prompt design on evaluation and to further analyze the LLMs’ sensitivity to various prompts. (3) For evaluation criteria, we propose a novel voting methodology for assessing generative text, aligning the model’s evaluation with human preferences to reduce biases inherent in individual model assessments. We evaluate 14 current mainstream LLMs and conduct a comprehensive discussion and analysis of their results. The ZhuJiu-Knowledge benchmark and open-participation leaderboard are publicly released at http://zhujiu-knowledge.top and we also provide a demo video at https://youtu.be/QJp4qlEHVH8. + 2024.naacl-demo.20 + du-etal-2024-zhujiu + + + Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative <fixed-case>AI</fixed-case> + ElronBandelInternational Business Machines + YotamPerlitzInternational Business Machines + EladVenezianInternational Business Machines + RoniFriedmanIBM Research + OfirArvivHebrew University of Jerusalem and Computer Science Departmen, Technion-Israel Institute of Technology + MatanOrbachInternational Business Machines + ShacharDon-YehiyaHebrew University of Jerusalem and International Business Machines + DafnaSheinwaldIBM Research + ArielGeraInternational Business Machines + LeshemChoshenInternational Business Machines + MichalShmueli-Scheuer + YoavKatzInternational Business Machines + 207-215 + In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt + 2024.naacl-demo.21 + bandel-etal-2024-unitxt + +
+ + + Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop) + Yang (Trista)Cao + IsabelPapadimitriou + AnaeliaOvalle + Association for Computational Linguistics +
Mexico City, Mexico
+ June + 2024 + 2024.naacl-srw + naacl + + + 2024.naacl-srw.0 + naacl-2024-2024-north-american-chapter + + + Systematic Analysis for Pretrained Language Model Priming for Parameter-Efficient Fine-tuning + Shih-ChengHuang + Shih-HengWangNational Taiwan University + Min-HanShih + SauravSahayIntel + Hung-yiLeeNational Taiwan University + 1-7 + Parameter-efficient (PE) methods (like Prompts or Adapters) for adapting pre-trained language models (PLM) to downstream tasks have been popular recently. However, hindrances still prevent these methods from reaching their full potential. For example, two significant challenges are few-shot adaptation and cross-task generalization. To tackle these issues, we propose a general PE priming framework to enhance and explore the few-shot adaptation and generalization ability of PE methods. In this framework, PLMs are primed with PE methods for rapidly adapting to various target tasks. To evaluate the generalization ability of these PE methods, we conduct experiments on a few-shot cross-domain benchmark containing 160 diverse NLP tasks. Our experiment not only reveals the best priming strategy but also verifies that priming facilitates the adaptation to target tasks. + 2024.naacl-srw.1 + huang-etal-2024-systematic + + + Rephrasing Invokes Better Generations for Large Language Models + HaoranYang + HongyuanLu + WaiLamThe Chinese University of Hong Kong + 8-15 + In the realm of emerging multitasking abilities of Large language models (LLMs), methodologies like prompt tuning enable low-cost adaptation to downstream tasks without retraining the model. However, automatic input pre-processing when LLMs are unavailable is currently under-studied. This paper proposes ReLLM (Rephrasing for LLMs), a method that automatically paraphrases input content for better output generations. ReLLM replaces low-frequency lexical items with their high-frequency counterparts. This substitution is particularly beneficial for low-resource language tasks that lack sufficient training data and resources. ReLLM is user-friendly and requires no additional LLM training. Experimental results in cross-lingual summarization, and natural language inference demonstrate the effectiveness of ReLLM. + 2024.naacl-srw.2 + yang-etal-2024-rephrasing + + + Exploring Compositional Generalization of Large Language Models + HaoranYang + HongyuanLu + WaiLamThe Chinese University of Hong Kong + DengCaiTencent AI Lab + 16-24 + In this paper, we study the generalization ability of large language models (LLMs) with respect to compositional instructions, which are instructions that can be decomposed into several sub-instructions. We argue that the ability to generalize from simple instructions to more intricate compositional instructions represents a key aspect of the out-of-distribution generalization for LLMs. Since there are no specialized datasets for studying this phenomenon, we first construct a dataset with the help of ChatGPT, guided by the self-instruct technique. Then, we fine-tune and evaluate LLMs on these datasets. Interestingly, our experimental results indicate that training LLMs on higher-order compositional instructions enhances their performance on lower-order ones, but the reverse does not hold true. + 2024.naacl-srw.3 + yang-etal-2024-exploring + + + Explainable <fixed-case>CED</fixed-case>: A Dataset for Explainable Critical Error Detection in Machine Translation + DahyunJungKorea University + SugyeongEoKorea University + ChanjunParkUpstage + HeuiseokLimKorea University + 25-35 + Critical error detection (CED) in machine translation is a task that aims to detect errors that significantly distort the intended meaning. However, the existing study of CED lacks explainability due to the absence of content addressing the reasons for catastrophic errors. To address this limitation, we propose Explainable CED, a dataset that introduces the attributes of error explanation and correction regarding critical errors. Considering the advantage of reducing time costs and mitigating human annotation bias, we leverage a large language model in the data construction process. To improve the quality of the dataset and mitigate hallucination, we compare responses from the model and introduce an additional data filtering method through feedback scoring. The experiment demonstrates that the dataset appropriately reflects a consistent explanation and revision for errors, validating the reliability of the dataset. + 2024.naacl-srw.4 + jung-etal-2024-explainable + + + <fixed-case>SMARTR</fixed-case>: A Framework for Early Detection using Survival Analysis of Longitudinal Texts + Jean-ThomasBaillargeon + LucLamontagneLaval University + 36-41 + This paper presents an innovative approach to the early detection of expensive insurance claims by leveraging survival analysis concepts within a deep learning framework exploiting textual information from claims notes. Our proposed SMARTR model addresses limitations of state-of-the-art models, such as handling data-label mismatches and non-uniform data frequency, to enhance a posteriori classification and early detection. Our results suggest that incorporating temporal dynamics and empty period representation improves model performance, highlighting the importance of considering time in insurance claim analysis. The approach appears promising for application to other insurance datasets. + 2024.naacl-srw.5 + baillargeon-lamontagne-2024-smartr + + + Fast Exact Retrieval for Nearest-neighbor Lookup (<fixed-case>FERN</fixed-case>) + RichardZhuPrinceton University + 42-47 + Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling — vector retrieval — can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension d relative to the number of vectors, N, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a O(Nd) problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. d=1 vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. d=2 or d=3 cases), kd-trees provide a O(d\log N) algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a O(dN) solution at high dimensions (e.g. k=128), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by kd-trees. The algorithm achieves O(d\log N) look-up with 100% recall on 10 million d=128 uniformly randomly generated vectors. + 2024.naacl-srw.6 + zhu-2024-fast + + + Start Simple: Progressive Difficulty Multitask Learning + YunfeiLuo + YuyangLiuYale University + RukaiCai + TauhidurRahmanUniversity of California, San Diego + 48-55 + The opaque nature of neural networks, often described as black boxes, poses significant challenges in understanding their learning mechanisms, which limit our ability to fully optimize and trust these models.Inspired by how humans learn, this paper proposes a novel neural network training strategy that employs multitask learning with progressive difficulty subtasks, which we believe can potentially shed light on the internal learning mechanisms of neural networks.We implemented this strategy across a range of NLP tasks, data sets, and neural network architectures and observed notable improvements in model performance.This suggests that neural networks may be able to extract common features and internalize shared representations across similar subtasks that differ in their difficulty.Analyzing this strategy could lead us to more interpretable and robust neural networks, enhancing both their performance and our understanding of their nature. + 2024.naacl-srw.7 + luo-etal-2024-start + + + <fixed-case>LUCID</fixed-case>: <fixed-case>LLM</fixed-case>-Generated Utterances for Complex and Interesting Dialogues + JoeStaceyImperial College London + JianpengCheng + JohnTorrApple + TristanGuigueApple + JorisDriesenApple + AlexandruCoca + MarkGaynorApple + AndersJohannsen + 56-74 + Spurred by recent advances in Large Language Models (LLMs), virtual assistants are poised to take a leap forward in terms of their dialogue capabilities. Yet a major bottleneck to achieving genuinely transformative task-oriented dialogue capabilities remains the scarcity of high quality data. Existing datasets, while impressive in scale, have limited domain coverage and contain few genuinely challenging conversational phenomena; those which are present are typically unlabelled, making it difficult to assess the strengths and weaknesses of models without time-consuming and costly human evaluation. Moreover, creating high quality dialogue data has until now required considerable human input, limiting both the scale of these datasets and the ability to rapidly bootstrap data for a new target domain. We aim to overcome these issues with LUCID, a modularised and highly automated LLM-driven data generation system that produces realistic, diverse and challenging dialogues. We use LUCID to generate a seed dataset of 4,277 conversations across 100 intents to demonstrate its capabilities, with a human review finding consistently high quality labels in the generated data. + 2024.naacl-srw.8 + stacey-etal-2024-lucid + + + Fine-tuning Pre-trained Named Entity Recognition Models For <fixed-case>I</fixed-case>ndian Languages + SankalpBahad + PruthwikMishraIIIT-Hyderabad + ParameswariKrishnamurthy + DiptiSharmaIIIT Hyderabad + 75-82 + Named Entity Recognition (NER) is a use-ful component in Natural Language Process-ing (NLP) applications. It is used in varioustasks such as Machine Translation, Summa-rization, Information Retrieval, and Question-Answering systems. The research on NER iscentered around English and some other ma-jor languages, whereas limited attention hasbeen given to Indian languages. We analyze thechallenges and propose techniques that can betailored for Multilingual Named Entity Recog-nition for Indian Languages. We present a hu-man annotated named entity corpora of ∼40Ksentences for 4 Indian languages from two ofthe major Indian language families. Addition-ally, we show the transfer learning capabilitiesof pre-trained transformer models from a highresource language to multiple low resource lan-guages through a series of experiments. Wealso present a multilingual model fine-tunedon our dataset, which achieves an F1 score of∼0.80 on our dataset on average. We achievecomparable performance on completely unseenbenchmark datasets for Indian languages whichaffirms the usability of our model. + 2024.naacl-srw.9 + bahad-etal-2024-fine + + + Knowledge-centered conversational agents with a drive to learn + SeleneBaez Santamaria + 83-92 + We create an adaptive conversational agent that assesses the quality of its knowledge and is driven to become more knowledgeable. Unlike agents with predefined tasks, ours can leverage people as diverse sources to meet its knowledge needs. We test the agent in social contexts, where personal and subjective information can be obtained through dialogue. We provide the agent both with generic methods for assessing its knowledge quality (e.g. correctness, completeness, redundancy, interconnectedness, and diversity), as well as with generic capabilities to improve its knowledge by leveraging external sources. We demonstrate that the agent can learn effective policies to acquire the knowledge needed by assessing the efficiency of these capabilities during interaction. Our framework enables on-the-fly learning, offering a dynamic and adaptive approach to shaping conversational interactions. + 2024.naacl-srw.10 + baez-santamaria-2024-knowledge + + + Exploring Inherent Biases in <fixed-case>LLM</fixed-case>s within <fixed-case>K</fixed-case>orean Social Context: A Comparative Analysis of <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case> and <fixed-case>GPT</fixed-case>-4 + SeungyoonLeeKorea University + DongKimKorea University + DahyunJungKorea University + ChanjunParkUpstage + HeuiseokLimKorea University + 93-104 + Large Language Models (LLMs) have significantly impacted various fields requiring advanced linguistic understanding, yet concerns regarding their inherent biases and ethical considerations have also increased. Notably, LLMs have been critiqued for perpetuating stereotypes against diverse groups based on race, sexual orientation, and other attributes. However, most research analyzing these biases has predominantly focused on communities where English is the primary language, neglecting to consider the cultural and linguistic nuances of other societies. In this paper, we aim to explore the inherent biases and toxicity of LLMs, specifically within the social context of Korea. We devise a set of prompts that reflect major societal issues in Korea and assign varied personas to both ChatGPT and GPT-4 to assess the toxicity of the generated sentences. Our findings indicate that certain personas or prompt combinations consistently yield harmful content, highlighting the potential risks associated with specific persona-issue alignments within the Korean cultural framework. Furthermore, we discover that GPT-4 can produce more than twice the level of toxic content than ChatGPT under certain conditions. + 2024.naacl-srw.11 + lee-etal-2024-exploring-inherent + + + To Clarify or not to Clarify: A Comparative Analysis of Clarification Classification with Fine-Tuning, Prompt Tuning, and Prompt Engineering + AlinaLeippertGerman Research Center for AI + TatianaAnikinaGerman Research Center for AI + BerndKieferGerman Research Center for AI + JosefGenabithGerman Research Center for AI and Universität des Saarlandes + 105-115 + Misunderstandings occur all the time in human conversation but deciding on when to ask for clarification is a challenging task for conversational systems that requires a balance between asking too many unnecessary questions and running the risk of providing incorrect information. This work investigates clarification identification based on the task and data from (Xu et al., 2019), reproducing their Transformer baseline and extending it by comparing pre-trained language model fine-tuning, prompt tuning and manual prompt engineering on the task of clarification identification. Our experiments show strong performance with LM and a prompt tuning approach with BERT and RoBERTa, outperforming standard LM fine-tuning, while manual prompt engineering with GPT-3.5 proved to be less effective, although informative prompt instructions have the potential of steering the model towards generating more accurate explanations for why clarification is needed. + 2024.naacl-srw.12 + leippert-etal-2024-clarify + + + Detecting Response Generation Not Requiring Factual Judgment + RyoheiKamei + DaikiShionoTohoku University + ReinaAkamaTohoku University and RIKEN + JunSuzukiTohoku University + 116-123 + With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge.However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues.This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings.We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset.The model with the highest classification accuracy could yield about 88% accurate classification results. + 2024.naacl-srw.13 + kamei-etal-2024-detecting + + + Unknown Script: Impact of Script on Cross-Lingual Transfer + WondimagegnhueTufa + IliaMarkovVrije Universiteit Amsterdam + PiekVossenVrije Universiteit Amsterdam + 124-129 + Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size. + 2024.naacl-srw.14 + tufa-etal-2024-unknown + + + Improving Repository-level Code Search with Text Conversion + MizukiKondo + DaisukeKawaharaWaseda University + ToshiyukiKurabayashiNTT + 130-137 + The ability to generate code using large language models (LLMs) has been increasing year by year. However, studies on code generation at the repository level are not very active. In repository-level code generation, it is necessary to refer to related code snippets among multiple files. By taking the similarity between code snippets, related files are searched and input into an LLM, and generation is performed. This paper proposes a method to search for related files (code search) by taking similarities not between code snippets but between the texts converted from the code snippets by the LLM. We confirmed that converting to text improves the accuracy of code search. + 2024.naacl-srw.15 + kondo-etal-2024-improving + + + Improving Multi-lingual Alignment Through Soft Contrastive Learning + MinsuPark + SeyeonChoi + ChanyeolChoi + Jun-SeongKimLinq + Jy-yongSohnYonsei University + 138-145 + Making decent multi-lingual sentence representations is critical to achieve high performances in cross-lingual downstream tasks. In this work, we propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model. Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model. Our method can be considered as contrastive learning with soft labels defined as the similarity between sentences. Our experimental results on five languages show that our contrastive loss with soft labels far outperforms conventional constrastive loss with hard labels in various benchmarks for bitext mining tasks and STS tasks. In addition, our method outperforms existing multi-lingual embeddings including LaBSE, for Tatoeba dataset. + 2024.naacl-srw.16 + park-etal-2024-improving-multi + + + Few-Shot Event Argument Extraction Based on a Meta-Learning Approach + AboubacarTuo + RomaricBesançonCEA + OlivierFerretCEA + JulienTourilleCEA + 146-153 + Few-shot learning techniques for Event Extraction are developed to alleviate the cost of data annotation. However, most studies on few-shot event extraction only focus on event trigger detection and no study has been proposed on argument extraction in a meta-learning context. In this paper, we investigate few-shot event argument extraction using prototypical networks, casting the task as a relation classification problem. Furthermore, we propose to enhance the relation embeddings by injecting syntactic knowledge into the model using graph convolutional networks. Our experimental results show that our proposed approach achieves strong performance on ACE 2005 in several few-shot configurations, and highlight the importance of syntactic knowledge for this task. More generally, our paper provides a unified evaluation framework for meta-learning approaches for argument extraction. + 2024.naacl-srw.17 + tuo-etal-2024-shot + + + Investigating Web Corpus Filtering Methods for Language Model Development in <fixed-case>J</fixed-case>apanese + RintaroEnomotoWaseda University + ArsenyTolmachevFujitsu Research and Development Center Co. Ltm. + TakuroNiitsumaThe Asahi Shimbun Company + ShuheiKuritaNational Institute of Informatics and New York University + DaisukeKawaharaWaseda University + 154-160 + The development of large language models (LLMs) is becoming increasingly significant, and there is a demand for high-quality, large-scale corpora for their pretraining.The quality of a web corpus is especially essential to improve the performance of LLMs because it accounts for a large proportion of the whole corpus. However, filtering methods for Web corpora have yet to be established.In this paper, we present empirical studies to reveal which filtering methods are indeed effective and analyze why they are.We build classifiers and language models in Japanese that can process large amounts of corpora rapidly enough for pretraining LLMs in limited computational resources. By evaluating these filtering methods based on a Web corpus quality evaluation benchmark, we reveal that the most accurate method is the N-gram language model. Indeed, we empirically present that strong filtering methods can rather lead to lesser performance in downstream tasks.We also report that the proportion of some specific topics in the processed documents decreases significantly during the filtering process. + 2024.naacl-srw.18 + enomoto-etal-2024-investigating + + + Referring Expressions in Human-Robot Common Ground: A Thesis Proposal + JaapKruijtVrije Universiteit Amsterdam + 161-167 + In this PhD, we investigate the processes through which common ground shapes the pragmatic use of referring expressions in Human-Robot Interaction. A central point in our investigation is the interplay between a growing common ground and changes in the surrounding context, which can create ambiguity, variation and the need for pragmatic interpretations. We outline three objectives that define the scope of our work: 1) obtaining data with common ground interactions, 2) examining reference-making, and 3) evaluating the robot interlocutor. We use datasets as well as a novel interactive experimental framework to investigate the linguistic processes involved in shaping referring expressions. We also design an interactive robot model, which models these linguistic processes and can use pragmatic inference to resolve referring expressions. With this work, we contribute to existing work in HRI, reference resolution and the study of common ground. + 2024.naacl-srw.19 + kruijt-2024-referring + + + Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection + MohammedRahaman + JuliaIveQueen Mary, University of London + 168-199 + Code clone detection is challenging, as sourcecode can be written in different languages, do-mains, and styles. In this paper, we arguethat source code is inherently a graph, not asequence, and that graph-based methods aremore suitable for code clone detection thansequence-based methods. We compare the per-formance of two state-of-the-art models: Code-BERT (Feng et al., 2020), a sequence-basedmodel, and CodeGraph (Yu et al., 2023), agraph-based model, on two benchmark data-sets: BCB (Svajlenko et al., 2014) and PoolC(PoolC, no date). We show that CodeGraphoutperforms CodeBERT on both data-sets, es-pecially on cross-lingual code clones. To thebest of our knowledge, this is the first work todemonstrate the cross-lingual code clone detec-tion showing superiority on graph-based meth-ods over sequence-based methods + 2024.naacl-srw.20 + rahaman-ive-2024-source + + + Distilling Text Style Transfer With Self-Explanation From <fixed-case>LLM</fixed-case>s + ChiyuZhangUniversity of British Columbia + HonglongCai + YuezhangLi + YuexinWuGoogle + LeHouGoogle Research + MuhammadAbdul-MageedUniversity of British Columbia + 200-211 + Text Style Transfer (TST) seeks to alter the style of text while retaining its core content. Given the constraints of limited parallel datasets for TST, we propose CoTeX, a framework that leverages large language models (LLMs) alongside chain-of-thought (CoT) prompting to facilitate TST. CoTeX distills the complex rewriting and reasoning capabilities of LLMs into more streamlined models capable of working with both non-parallel and parallel data. Through experimentation across four TST datasets, CoTeX is shown to surpass traditional supervised fine-tuning and knowledge distillation methods, particularly in low-resource settings. We conduct a comprehensive evaluation, comparing CoTeX against current unsupervised, supervised, in-context learning (ICL) techniques, and instruction-tuned LLMs. Furthermore, CoTeX distinguishes itself by offering transparent explanations for its style transfer process. + 2024.naacl-srw.21 + zhang-etal-2024-distilling + + + Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation + HaoWang + TetsuroMorimuraCyberAgent, Inc. + UkyoHondaCyberAgent, Inc. + DaisukeKawaharaWaseda University + 212-218 + Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models’ training. + 2024.naacl-srw.22 + wang-etal-2024-reinforcement + + + Evaluation Dataset for <fixed-case>J</fixed-case>apanese Medical Text Simplification + KokiHoriguchi + TomoyukiKajiwaraEhime University + YukiAraseTokyo Institute of Technology, Tokyo Institute of Technology and AIST, National Institute of Advanced Industrial Science and Technology + TakashiNinomiyaEhime University + 219-225 + We create a parallel corpus for medical text simplification in Japanese, which simplifies medical terms into expressions that patients can understand without effort.While text simplification in the medial domain is strongly desired by society, it is less explored in Japanese because of the lack of language resources.In this study, we build a parallel corpus for Japanese text simplification evaluation in the medical domain using patients’ weblogs.This corpus consists of 1,425 pairs of complex and simple sentences with or without medical terms.To tackle medical text simplification without a training corpus of the corresponding domain, we repurpose a Japanese text simplification model of other domains.Furthermore, we propose a lexically constrained reranking method that allows to avoid technical terms to be output.Experimental results show that our method contributes to achieving higher simplification performance in the medical domain. + 2024.naacl-srw.23 + horiguchi-etal-2024-evaluation + + + Multi-Source Text Classification for Multilingual Sentence Encoder with Machine Translation + ReonKajikawa + KeiichiroYamada + TomoyukiKajiwaraEhime University + TakashiNinomiyaEhime University + 226-232 + To reduce the cost of training models for each language for developers of natural language processing applications, pre-trained multilingual sentence encoders are promising.However, since training corpora for such multilingual sentence encoders contain only a small amount of text in languages other than English, they suffer from performance degradation for non-English languages.To improve the performance of pre-trained multilingual sentence encoders for non-English languages, we propose a method of machine translating a source sentence into English and then inputting it together with the source sentence in a multi-source manner.Experimental results on sentiment analysis and topic classification tasks in Japanese revealed the effectiveness of the proposed method. + 2024.naacl-srw.24 + kajikawa-etal-2024-multi + + + A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the <fixed-case>URIEL</fixed-case> Knowledge Base + HastiToossi + GuoHuai + JinyuLiu + EricKhiu + A.DoğruözGhent University + En-ShiunLee + 233-241 + In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL’s ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides. + 2024.naacl-srw.25 + toossi-etal-2024-reproducibility + + + Coding Open-Ended Responses using Pseudo Response Generation by Large Language Models + YukiZenimoto + RyoHasegawa + TakehitoUtsuroUniversity of Tsukuba + MasaharuYoshiokaHokkaido University + NorikoKandoNII, Tokyo Institute of Technology + 242-254 + Survey research using open-ended responses is an important method thatcontributes to the discovery of unknown issues and new needs. However,survey research generally requires time and cost-consuming manual dataprocessing, indicating that it is difficult to analyze large dataset.To address this issue, we propose an LLM-based method to automate partsof the grounded theory approach (GTA), a representative approach of thequalitative data analysis. We generated and annotated pseudo open-endedresponses, and used them as the training data for the coding proceduresof GTA. Through evaluations, we showed that the models trained withpseudo open-ended responses are quite effective compared with thosetrained with manually annotated open-ended responses. We alsodemonstrate that the LLM-based approach is highly efficient andcost-saving compared to human-based approach. + 2024.naacl-srw.26 + zenimoto-etal-2024-coding + + + Cross-Task Generalization Abilities of Large Language Models + QinyuanYeUniversity of Southern California + 255-262 + Humans can learn a new language task efficiently with only few examples, by leveraging their knowledge and experience obtained when learning prior tasks. Enabling similar cross-task generalization abilities in NLP systems is fundamental for approaching the goal of general intelligence and expanding the reach of language technology in the future.In this thesis proposal, I will present my work on (1) benchmarking cross-task generalization abilities with diverse NLP tasks; (2) developing model architectures for improving cross-task generalization abilities; (3) analyzing and predicting the generalization landscape of current state-of-the-art large language models. Additionally, I will outline future research directions, along with preliminary thoughts on addressing them. + 2024.naacl-srw.27 + ye-2024-cross + + + Commentary Generation from Data Records of Multiplayer Strategy Esports Game + ZihanWangGraduate School of Information Science and Technology, The University of Tokyo + NaokiYoshinagaInstitute of Industrial Science, the University of Tokyo + 263-271 + Esports, a sports competition on video games, has become one of the most important sporting events. Although esports play logs have been accumulated, only a small portion of them accompany text commentaries for the audience to retrieve and understand the plays. In this study, we therefore introduce the task of generating game commentaries from esports’ data records. We first build large-scale esports data-to-text datasets that pair structured data and commentaries from a popular esports game, League of Legends. We then evaluate Transformer-based models to generate game commentaries from structured data records, while examining the impact of the pre-trained language models. Evaluation results on our dataset revealed the challenges of this novel task. We will release our dataset to boost potential research in the data-to-text generation community. + 2024.naacl-srw.28 + wang-yoshinaga-2024-commentary + + + Facilitating Opinion Diversity through Hybrid <fixed-case>NLP</fixed-case> Approaches + MichielVan Der MeerLeiden University + 272-284 + Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions. + 2024.naacl-srw.29 + van-der-meer-2024-facilitating + + + <fixed-case>H</fixed-case>ybrid<fixed-case>BERT</fixed-case> - Making <fixed-case>BERT</fixed-case> Pretraining More Efficient Through Hybrid Mixture of Attention Mechanisms + GokulSrinivasaganGerman Research Center for AI and Universität des Saarlandes + SimonOstermannGerman Research Center for AI + 285-291 + Pretrained transformer-based language models have produced state-of-the-art performance in most natural language understanding tasks. These models undergo two stages of training: pretraining on a huge corpus of data and fine-tuning on a specific downstream task. The pretraining phase is extremely compute-intensive and requires several high-performance computing devices like GPUs and several days or even months of training, but it is crucial for the model to capture global knowledge and also has a significant impact on the fine-tuning task. This is a major roadblock for researchers without access to sophisticated computing resources. To overcome this challenge, we propose two novel hybrid architectures called HybridBERT (HBERT), which combine self-attention and additive attention mechanisms together with sub-layer normalization. We introduce a computing budget to the pretraining phase, limiting the training time and usage to a single GPU. We show that HBERT attains twice the pretraining accuracy of a vanilla-BERT baseline. We also evaluate our proposed models on two downstream tasks, where we outperform BERT-base while accelerating inference. Moreover, we study the effect of weight initialization with a limited pretraining budget. The code and models are publicly available at: www.github.com/gokulsg/HBERT/. + 2024.naacl-srw.30 + srinivasagan-ostermann-2024-hybridbert + +
+ + + Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts) + RuiZhang + NathanSchneider + SnigdhaChaturvedi + Association for Computational Linguistics +
Mexico City, Mexico
+ June + 2024 + 2024.naacl-tutorials + naacl + + + 2024.naacl-tutorials.0 + naacl-2024-2024-north-american-chapter-association + + + Catch Me If You <fixed-case>GPT</fixed-case>: Tutorial on Deepfake Texts + AdakuUchenduMIT Lincoln Laboratory + SaranyaVenkatramanThe Pennsylvania State University + ThaiLeIndiana University, Bloomington + DongwonLeeThe Pennsylvania State University + 1-7 + In recent years, Natural Language Generation (NLG) techniques have greatly advanced, especially in the realm of Large Language Models (LLMs). With respect to the quality of generated texts, it is no longer trivial to tell the difference between human-written and LLMgenerated texts (i.e., deepfake texts). While this is a celebratory feat for NLG, it poses new security risks (e.g., the generation of misinformation). To combat this novel challenge, researchers have developed diverse techniques to detect deepfake texts. While this niche field of deepfake text detection is growing, the field of NLG is growing at a much faster rate, thus making it difficult to understand the complex interplay between state-of-the-art NLG methods and the detectability of their generated texts. To understand such inter-play, two new computational problems emerge: (1) Deepfake Text Attribution (DTA) and (2) Deepfake Text Obfuscation (DTO) problems, where the DTA problem is concerned with attributing the authorship of a given text to one of k NLG methods, while the DTO problem is to evade the authorship of a given text by modifying parts of the text. In this cutting-edge tutorial, therefore, we call attention to the serious security risk both emerging problems pose and give a comprehensive review of recent literature on the detection and obfuscation of deepfake text authorships. Our tutorial will be 3 hours long with a mix of lecture and hands-on examples for interactive audience participation. You can find our tutorial materials here: https://tinyurl.com/naacl24-tutorial. + 2024.naacl-tutorials.1 + uchendu-etal-2024-catch + + + Combating Security and Privacy Issues in the Era of Large Language Models + MuhaoChenUC Davis + ChaoweiXiaoUW-Madison + HuanSunOSU + LeiLiCMU + LeonDerczynskiUW Seattle + AnimaAnandkumarCaltech, NVIDIA + FeiWangUSC + 8-18 + This tutorial seeks to provide a systematic summary of risks and vulnerabilities in security, privacy and copyright aspects of large language models (LLMs), and most recent solutions to address those issues. We will discuss a broad thread of studies that try to answer the following questions: (i) How do we unravel the adversarial threats that attackers may leverage in the training time of LLMs, especially those that may exist in recent paradigms of instruction tuning and RLHF processes? (ii) How do we guard the LLMs against malicious attacks in inference time, such as attacks based on backdoors and jailbreaking? (iii) How do we ensure privacy protection of user information and LLM decisions for Language Model as-a-Service (LMaaS)? (iv) How do we protect the copyright of an LLM? (v) How do we detect and prevent cases where personal or confidential information is leaked during LLM training? (vi) How should we make policies to control against improper usage of LLM-generated content? In addition, will conclude the discussions by outlining emergent challenges in security, privacy and reliability of LLMs that deserve timely investigation by the community + 2024.naacl-tutorials.2 + chen-etal-2024-combating + + + Explanation in the Era of Large Language Models + ZiningZhuStevens Institute of Technology, University of Toronto + HanjieChenJohns Hopkins University, Rice University + XiYeUniversity of Texas Austin + QingLyuUniversity of Pennsylvania + ChenhaoTanUniversity of Chicago + AnaMarasovicUniversity of Utah + SarahWiegreffeAllen Institute for AI, University of Washington + 19-25 + Explanation has long been a part of communications, where humans use language to elucidate each other and transmit information about the mechanisms of events. There have been numerous works that study the structures of the explanations and their utility to humans. At the same time, explanation relates to a collection of research directions in natural language processing (and more broadly, computer vision and machine learning) where researchers develop computational approaches to explain the (usually deep neural network) models. Explanation has received rising attention. In recent months, the advance of large language models (LLMs) provides unprecedented opportunities to leverage their reasoning abilities, both as tools to produce explanations and as the subjects of explanation analysis. On the other hand, the sheer sizes and the opaque nature of LLMs introduce challenges to the explanation methods. In this tutorial, we intend to review these opportunities and challenges of explanations in the era of LLMs, connect lines of research previously studied by different research groups, and hopefully spark thoughts of new research directions + 2024.naacl-tutorials.3 + zhu-etal-2024-explanation + + + From Text to Context: Contextualizing Language with Humans, Groups, and Communities for Socially Aware <fixed-case>NLP</fixed-case> + Adithya VGanesanStony Brook University + SiddharthMangalikStony Brook University + VasudhaVaradarajanStony Brook University + NikitaSoniStony Brook University + SwanieJuhngStony Brook University + JoãoSedocNew York University + H. AndrewSchwartzStony Brook University + SalvatoreGiorgiUniversity of Pennsylvania, National Institute on Drug Abuse, Intramural Research Program + Ryan LBoydStony Brook University + 26-33 + Aimed at the NLP researchers or practitioners who would like to integrate human - individual, group, or societal level factors into their analyses, this tutorial will cover recent techniques and libraries for doing so at each level of analysis. Starting with human-centered techniques that provide benefit to traditional document- or word-level NLP tasks (Garten et al., 2019; Lynn et al., 2017), we undertake a thorough exploration of critical human-level aspects as they pertain to NLP, gradually moving up to higher levels of analysis: individual persons, individual with agent (chat/dialogue), groups of people, and finally communities or societies. + 2024.naacl-tutorials.4 + ganesan-etal-2024-text + + + Human-<fixed-case>AI</fixed-case> Interaction in the Age of <fixed-case>LLM</fixed-case>s + DiyiYangStanford University + Sherry TongshuangWuCarnegie Mellon University + Marti A.HearstUniversity of California, Berkeley + 34-38 + Recently, the development of Large Language Models (LLMs) has revolutionized the capabilities of AI systems. These models possess the ability to comprehend and generate human-like text, enabling them to engage in sophisticated conversations, generate content, and even perform tasks that once seemed beyond the reach of machines. As a result, the way we interact with technology and each other — an established field called “Human-AI Interaction” and have been studied for over a decade — is undergoing a profound transformation. This tutorial will provide an overview of the interaction between humans and LLMs, exploring the challenges, opportunities, and ethical considerations that arise in this dynamic landscape. It will start with a review of the types of AI models we interact with, and a walkthrough of the core concepts in Human-AI Interaction. We will then emphasize the emerging topics shared between HCI and NLP communities in light of LLMs. + 2024.naacl-tutorials.5 + yang-etal-2024-human + + + Spatial and Temporal Language Understanding: Representation, Reasoning, and Grounding + ParisaKordjamshidiMichigan State University + QiangNingAWS + JamesPustejovskyBrandeis University + Marie-FrancineMoensKU Leuven + 39-46 + This tutorial provides an overview of the cutting edge research on spatial and temporal language understanding. We also cover some essential background material from various subdisciplines to this topic, which we believe will enrich the CL community’s appreciation of the complexity of spatiotemporal reasoning. + 2024.naacl-tutorials.6 + kordjamshidi-etal-2024-spatial + +
+ + + 2024.findings-naacl + + +
diff --git a/hugo/content/posts/2024-06-15-naacl-2024.md b/hugo/content/posts/2024-06-15-naacl-2024.md new file mode 100644 index 0000000000..b7f6d38a93 --- /dev/null +++ b/hugo/content/posts/2024-06-15-naacl-2024.md @@ -0,0 +1,11 @@ +--- +Title: NAACL 2024 main conference papers +date: "2024-06-15" +Description: > + NAACL 2024 main conference papers are now available +--- + +The proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies are [now available](https://aclanthology.org/events/naacl-2024/). +This includes five volumes: long and short paper volumes, plus proceedings of the Student Research Workshop, the industry track, and the tutorial abstracts. + +We hope to have available the workshop proceedings soon and before the start of the workshops on Thursday, June 20.