Software tools for integrated development of the corpus, the lexicon,and community materials

Software tools for integrated development of the corpus, the lexicon,and community materials
Alexander Nakhimovsky
Tom Myers
28 Feb 2013
Description: We present an integrated approach to corpus and lexicon development, both for the language archive and a repository of materials for local community. We assume that the target audiences of the archive and the repository have different interests in the same underlying body of data, and we seek to construct that body of data in such a way that both sets of interests can be addressed. This involves integration of three pieces of software: ELAN, for work with digital video/audio and its transcription; FLEx, for grammatical analysis and lexicon development; and MannX, a browser-based video player for language learning.

Integrating corpus and the lexicon means the following functionality:
• Each lexicon entry has links to its tokens in the corpus, which are in turn linked, via time alignment, to the media segments in which the tokens occur.
• Each word in the corpus has a link to its lexicon entry.

To achieve this, the corpus and the lexicon must be integrated throughout their development:
• The lexicon maintains the list of all lexical and grammatical morphemes.
• When a morpheme that is already in the lexicon is encountered in the corpus, interlinear glosses are filled in from the lexicon.
• When a new morpheme is encountered, a new entry is created.
• A change in the lexical entry is automatically propagated through the corpus.

We seek to achieve this kind of integration by building a software "bridge" between ELAN and FLEx that supports the following workflow:
• Starting in ELAN, do transcription and time-alignment at the utterance level ("phrase" in FLEx).
• Export to FLEx for lexical and morphological analysis.
• Export the results back into ELAN as symbolic subdivision or symbolic association tiers.
• Further annotate in ELAN; perhaps time-align at the word level.

As of August 2012, a functioning version of software has been implemented in JavaScript. It is being reimplemented in Java as a Web application. We expect to put the Java version on the Web for testing in September, and also upgrade the ELAN-MannX conversion.

This process results in a corpus of media files, associated annotation files, and a FLEx-created lexicon, with links between them. A subset of these materials, transformed into different formats, will form the basis of a community repository. This will be a Web application, running on a remote or localhost server, that can run on a laptop or on an Android phone.
