Please use this identifier to cite or link to this item: http://hdl.handle.net/10125/25264

Integrating descriptive and computational approaches in language documentation and resource development

File SizeFormat 
25264.mp354.57 MBMP3View/Open

Item Summary

Title: Integrating descriptive and computational approaches in language documentation and resource development
Issue Date: 12 Mar 2015
Description: The benefits of interdisciplinary teams as well as the creation of documentary products of a variety of types and media has been discussed widely in the literature (see, for example, Gippert et al 2006). However, computational resources, such as morphological parsers or automated part-of-speech taggers, are often not part of the suite of materials produced by a language documentation or description project. Moreover, descriptive and computational resources are frequently created by completely different sets of researchers who may have little to no contact with one another. Under such an approach, descriptive resources are considered to be foundational for the creation of computational resources. Thus it is typically the case that computational resources are built on, but do not inform, descriptive resources. We argue that simultaneous creation of both descriptive and computational resources allow each resource to not only inform, but also to significantly enhance the creation of the other.

The authors of this paper are currently working on a project that focuses on creation of a number of resources for Somali. Objectives of the project include writing a descriptive reference grammar, creating a morphological parser and part-of-speech tagger, enhancing existing lexical resources, and developing computational aids designed to help electronic dictionary users who are unsure how to spell Somali words. The project uses a multi-faceted approach to data collection, including work with native speaker consultants, collection and transcription of narratives and conversations, creation and tagging of pedagogical corpora, and large-scale corpus mining of internet data.

We describe our methodology, workflow, and some of our research outcomes and illustrate the ways in which simultaneous creation of computational and descriptive resources has significantly improved our products. For example, writing of the descriptive grammar and development of the morphological parser have been done in tandem. At many points along the way, problems encountered in the programming of the parser shed light on shortcomings in both our description and understanding of Somali structures. We are currently in the process of validating the output of the parser against a large internet corpus of Somali data. We are using automatic scripts to identify words which cannot be parsed using our tools (either due to errors in the parser or gaps in the dictionary). The output of this process allows us to refine our grammar and parser, as well as provide enhancements and modernizations to a published Somali dictionary (Zorc & Osman 2002).
URI/DOI: http://hdl.handle.net/10125/25264
Rights: Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Appears in Collections:4th International Conference on Language Documentation and Conservation (ICLDC)



Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.