Integrating descriptive and computational approaches in language documentation and resource development

Date

2015-03-12

Contributor

Advisor

Department

Instructor

Depositor

Researcher

Consultant

Interviewer

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

Volume

Number/Issue

Starting Page

Ending Page

Alternative Title

Abstract

Description

The benefits of interdisciplinary teams as well as the creation of documentary products of a variety of types and media has been discussed widely in the literature (see, for example, Gippert et al 2006). However, computational resources, such as morphological parsers or automated part-of-speech taggers, are often not part of the suite of materials produced by a language documentation or description project. Moreover, descriptive and computational resources are frequently created by completely different sets of researchers who may have little to no contact with one another. Under such an approach, descriptive resources are considered to be foundational for the creation of computational resources. Thus it is typically the case that computational resources are built on, but do not inform, descriptive resources. We argue that simultaneous creation of both descriptive and computational resources allow each resource to not only inform, but also to significantly enhance the creation of the other. The authors of this paper are currently working on a project that focuses on creation of a number of resources for Somali. Objectives of the project include writing a descriptive reference grammar, creating a morphological parser and part-of-speech tagger, enhancing existing lexical resources, and developing computational aids designed to help electronic dictionary users who are unsure how to spell Somali words. The project uses a multi-faceted approach to data collection, including work with native speaker consultants, collection and transcription of narratives and conversations, creation and tagging of pedagogical corpora, and large-scale corpus mining of internet data. We describe our methodology, workflow, and some of our research outcomes and illustrate the ways in which simultaneous creation of computational and descriptive resources has significantly improved our products. For example, writing of the descriptive grammar and development of the morphological parser have been done in tandem. At many points along the way, problems encountered in the programming of the parser shed light on shortcomings in both our description and understanding of Somali structures. We are currently in the process of validating the output of the parser against a large internet corpus of Somali data. We are using automatic scripts to identify words which cannot be parsed using our tools (either due to errors in the parser or gaps in the dictionary). The output of this process allows us to refine our grammar and parser, as well as provide enhancements and modernizations to a published Somali dictionary (Zorc & Osman 2002).

Keywords

Citation

Extent

Format

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported

Rights Holder

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.