Data Science
Society, Ethics & Sustainabilty
Stories

Low-Resource Languages as Stress Tests for NLP Data

Session Abstract

Low-resource languages expose weaknesses in NLP systems that are often hidden by benchmark data. Drawing on experience annotating fieldwork data, this talk shows how ambiguity and annotation decisions reveal fundamental data quality issues relevant to real-world NLP pipelines.

Session Description

This talk is an experience report on annotating language data in a low-resource setting and what this process reveals about data quality in NLP pipelines. Rather than treating low-resource languages as edge cases, the talk frames them as stress tests that make structural data issues visible early and clearly.

The session outlines what linguistic fieldwork data looks like before it becomes “training data,” highlighting ambiguity, context dependence, and variation that cannot always be resolved through additional labeling. It then focuses on the annotation decisions required when categories are underspecified or multiple analyses are plausible, and connects these challenges to familiar issues in applied NLP, such as label noise, brittle representations, and unexpected model behavior.

The goal is to share practical lessons from linguistic data work that help NLP practitioners reason more realistically about annotation, uncertainty, and robustness. Attendees will gain concrete insights into why “clean data” is often an illusion and how early data decisions shape downstream systems.