Corpora, collections, data – Reusing outputs of language documentation

Thieberger, Nick
Thieberger, Nick
With the success of new methods in language documentation comes the creation of collections of records in an increasing number of small languages. Looking back over the past decade of such work reveals a heterogeneity in the form of collections that reflects the context in which each linguist has been trained and the relative focus they put on creating records. The Australian Centre of Excellence in the Dynamics of Language is a new seven-year documentation program that will include a data management and archiving ‘thread’, with the need to consider what form its primary research material should take. The distinction between primary and secondary data is laid out in Himmelmann (2012) and this paper will explore the range of types of primary data that can be considered part of a corpus, and how is a corpus distinct from a collection. A corpus for these purposes is structured and often allows some interoperability with other corpora for searching comparable phenomena, as in the Corpo AfroAs [1] project for example. Collections are more idiosyncratic and typically require some work on the part of the researcher to make use of them. Data types representing the outputs of funded research range from unannotated primary data through to elaborately annotated, interlinear glossed text and media. What is the ideal form of the material that would allow it to be interpreted, accessed and re-used, and how can current and future researchers collaborate in the construction of corpora that will be accessed in this way in future? While linguists have adopted several standard formats for their research outputs, typically based on the schema provided by the tools used (Elan, Fieldworks, Praat and so on), and perhaps using conventions like the Leipzig Glossing Rules, there is no agreement about the internal structure of a documentary corpus, nor in the methods used to expose elements of the corpus for discovery and analysis. We have so far built a repository (PARADISEC [2]) which provides citability and preservation of primary records. We have also built a system for presenting corpus material as interlinear text with media (EOPAS [3]). We are exploring what gaps there are in the workflow from creation to reuse of research material in order to build new tools with the aim of providing richer sources of information about the world’s languages. [1] [2] [3] Reference Himmelmann, Nikolaus P. 2012. Linguistic Data Types and the Interface between Language Documentation and Description. Language Documentation & Conservation 6. 187-207.
