Small language, big data: Building the Gurindji Kriol corpus to model the emergence of a mixed language

dc.creatorSasha Wilmoth
dc.creatorFelicity Meakins
dc.creatorCassandra Algy
dc.date.accessioned2025-11-24T22:26:24Z
dc.date.available2025-11-24T22:26:24Z
dc.date.copyright2025
dc.date.issued2025-11
dc.description.abstractAt 178 hours and 853,348 words, the Gurindji Kriol corpus (Meakins & Algy 2004) is currently the largest annotated corpus of an Australian Indigenous language, and is a significant record of the community’s language use in a complex multilingual environment. Together with the Gurindji corpus, four generations of language use and change in the Gurindji community are represented, including the rare emergence of a mixed language. In this paper, we present details on the development of this corpus, in particular the complex processes of corralling this data into a consistent format that enables quantitative and computational work. The scale, breadth and consistency of the corpus has enabled innovative research into questions of language variation, contact, emergence and change; and has helped the Gurindji community to better understand linguistic changes and continuities across generations. Data-cleaning and annotation are often overlooked in discussions of data management within the field of language documentation. However, they are important steps in any quantitative research, and the amount of work required can be significantly reduced with thoughtful automation. Our approach, drawn from industry best practice, may provide a useful model for others working on the development of corpora of low-resource languages.
dc.description.sponsorshipNational Foreign Language Resource Center
dc.formatArticle
dc.format.extent20
dc.identifier.citationWilmoth, Sasha, Felicity Meakins, Cassandra Algy. 2025. Small language, big data: Building the Gurindji Kriol corpus to model the emergence of a mixed language. Language Documentation & Conservation 19: 348-367.
dc.identifier.issn1934-5275
dc.identifier.urihttps://hdl.handle.net/10125/74839
dc.languageeng
dc.publisherUniversity of Hawaii Press
dc.titleSmall language, big data: Building the Gurindji Kriol corpus to model the emergence of a mixed language
dcterms.rightsCreative Commons Attribution-NonCommercial 4.0 International
dcterms.typeText
prism.endingpage367
prism.publicationnameLanguage Documentation & Conservation
prism.startingpage348
prism.volume19

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wilmoth_etal_2025.pdf
Size:
815.29 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.73 KB
Format:
Item-specific license agreed upon to submission
Description: