Skip to content

Latest commit

 

History

History
22 lines (18 loc) · 6.44 KB

t-evaluation.md

File metadata and controls

22 lines (18 loc) · 6.44 KB

Filtered Union Bibliography

This automatically-generated file contains references from the main union bibliography that have been filtered for a single tag. Do not edit this file; instead, please update the main bibliography and tag references appropriately to have them show up here. Thank you!

The papers are listed in the same order as the main bibliography; e.g., by year of publication / release; then by surname / name of the first author.

Evaluation

  • Kirk, H. R., Vidgen, B., Röttger, P., Thrush, T., and Hale, S. A. (2023). Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. (NAACL '23') 10.18653/v1/2022.naacl-main.97 [paper] Biases Evaluation published
  • Tan, S., Joty, S., Baxter, K., Taeihagh, A., Bennett, G. A., & Kan, M. Y. (2021). Reliability Testing for Natural Language Processing Systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4153–4169, Online. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.321 [paper] Evaluation published
  • Caglayan, O., Madhyastha, P., & Specia, L. (2020). Curious case of language generation evaluation metrics: A cautionary tale. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2322–2328, Barcelona, Spain (Online). International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.210. [paper] Evaluation published
  • Ethayarajh, K., & Jurafsky, D. (2020, November). Utility is in the eye of the user: A critique of NLP leaderboards. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) doi:10.18653/v1/2020.emnlp-main.393. [paper] Evaluation published
  • Garnerin, M., Rossato, S., & Besacier, L. (2020). Pratiques d’évaluation en ASR et biais de performance (Evaluation methodology in ASR and performance bias). In Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL) (pp. 1-9). [paper] Evaluation published
  • Goldfarb-Tarrant, S., Marchant, R., Sanchez, R. M., Pandya, M., & Lopez, A. (2020). Intrinsic bias metrics do not correlate with application bias. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) [paper]. Evaluation published
  • Linzen, T. (2020, July). How can we accelerate progress towards human-like linguistic generalization?. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.465 [paper] Evaluation published
  • Mathur, N., Baldwin, T., & Cohn, T. (2020, July). Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.448 [paper] Evaluation published
  • Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020, November). CrowS-pairs: A challenge dataset for measuring social biases in masked language models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). doi:10.18653/v1/2020.emnlp-main.154 [paper] Evaluation published
  • Bregeon, D., Antoine, J. Y., Villaneau, J., & Lefeuvre-Halftermeyer, A. (2019). Redonner du sens à l’accord interannotateurs: vers une interprétation des mesures d’accord en termes de reproductibilité de l’annotation. Traitement Automatique des Langues, 60(2), 23. [paper] Evaluation published
  • Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183-186. Evaluation published
  • Mathet, Y., & Widlöcher, A. (2016). Évaluation des annotations: ses principes et ses pièges. Traitement Automatique des Langues, 57(2), 73-98. [paper] Evaluation published