Please use this identifier to cite or link to this item:

Towards a more general model of interlinear text

File Size Format  
26180.pdf 747.34 kB Adobe PDF View/Open

Item Summary

Title:Towards a more general model of interlinear text
Authors:Arkhipov, Alexandre
Contributors:Arkhipov, Alexandre (speaker)
Date Issued:01 Mar 2013
Description:The interlinear glossed text (IGT) is a complex object, the complexity of its structure depending on factors such as origin, intended use, languages involved etc. Developing tools and workflows for integrated linguistic analysis environments calls for particular attention to those aspects which in many common cases can be disregarded as insignificant; thus, collaborating for ELAN–FLEx integration was particularly motivating for this paper.

IGT is often conceived of as a tree: the root node corresponds to the whole text, subdivided into smaller units (sentences, words, morphemes). Each unit has a number of associated annotations, generally one per information type, like sentence translation, part-of-speech label, morpheme gloss.

However, an IGT can easily amount to a large set of trees. Unresolved ambiguities of all kinds are one reason for it. Each pair of alternative analyses (e.g. two concurrent parses of a word) implies two distinct trees, identical except for the node in question and all its descendants. The more ambiguities arise, the more underlying trees should be posited. Still, all trees in such a tree family stem from a single analyzed object (transcript, original orthographic representation). Storing entire trees for each combination of relevant alternatives being utterly inefficient, a more compact storage model is needed.

Turning to the media dimension, an accurate transcript of a spontaneous discourse is most often unsuitable for a grammatical analysis without some preprocessing (normalization) dealing with various speech errors, incomprehensible fragments etc. to produce a grammatically correct and coherent text for subsequent grammatical analysis – whereas the “raw” transcript feeds phonological and possibly discourse analysis. We thus get two distinct texts, interconnected but giving rise to independent (families of) analysis trees; only one of them is linked directly to the media timeline.
In some scenarios, more than one media-based timeline emerge which need to be interlinked (cf. BOLD framework: sound annotations to sound events; retelling experiments, e.g. pear stories; sign languages translated from/into spoken languages). The reference axis may not be properly a timeline (text, path through a complex graphic image).

One should mention further complicating factors such as multi-speaker and multi-lingual settings, collaboration and versioning.

The overall structure (an XML sketch will be presented) might grow unreasonably complex for any specialized analysis component to handle. It may thus be efficient to use an intermediate repository, e.g. a unified underlying RDF representation [Nakhimovsky et al. 2012], to which all changes made in specific tools are merged.


Bow, Cathy, Baden Hughes and Steven Bird. 2003. Towards a General Model of Interlinear Text.

Nakhimovsky, Alexander, Jeff Good, Tom Myers. 2012. Interoperability of Language Documentation Tools and Materials for Local Communities // Digital Humanities 2012.
Rights:Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Appears in Collections: 3rd International Conference on Language Documentation and Conservation (ICLDC)

Please email if you need this content in ADA-compliant format.

Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.