Small language, big data: Building the Gurindji Kriol corpus to model the emergence of a mixed language

Loading...
Thumbnail Image

Contributor

Advisor

Department

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Interviewee

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

University of Hawaii Press

Volume

19

Number/Issue

Starting Page

348

Ending Page

367

Alternative Title

Abstract

At 178 hours and 853,348 words, the Gurindji Kriol corpus (Meakins & Algy 2004) is currently the largest annotated corpus of an Australian Indigenous language, and is a significant record of the community’s language use in a complex multilingual environment. Together with the Gurindji corpus, four generations of language use and change in the Gurindji community are represented, including the rare emergence of a mixed language. In this paper, we present details on the development of this corpus, in particular the complex processes of corralling this data into a consistent format that enables quantitative and computational work. The scale, breadth and consistency of the corpus has enabled innovative research into questions of language variation, contact, emergence and change; and has helped the Gurindji community to better understand linguistic changes and continuities across generations. Data-cleaning and annotation are often overlooked in discussions of data management within the field of language documentation. However, they are important steps in any quantitative research, and the amount of work required can be significantly reduced with thoughtful automation. Our approach, drawn from industry best practice, may provide a useful model for others working on the development of corpora of low-resource languages.

Description

Keywords

Citation

Wilmoth, Sasha, Felicity Meakins, Cassandra Algy. 2025. Small language, big data: Building the Gurindji Kriol corpus to model the emergence of a mixed language. Language Documentation & Conservation 19: 348-367.

DOI

Extent

20

Format

Article

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

Creative Commons Attribution-NonCommercial 4.0 International

Rights Holder

Catalog Record

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.