Please use this identifier to cite or link to this item:

A general format for time information to be the first-class data of general linguistics

File Size Format  
25368.pdf 50.31 kB Adobe PDF View/Open 68.34 kB ZIP View/Open

Item Summary

Title:A general format for time information to be the first-class data of general linguistics
Authors:Ohya, Kazushi
Contributors:Ohya, Kazushi (speaker)
Date Issued:12 Mar 2015
Description:This presentation aims at proposing philosophy and an actual description of data format for time information to realize a new phenomenon language documentation brings about with computational environments. This data format should be simple and cogent because it is used by linguists as a common and fundamental format to record sound data to relate it to encoded language data. To be simple, the data format is based on a flat data model and the actual description had better be a plain text. To be cogent, the data format is based on mathematical foundations. In this proposal, elements in records line up in superset order, which means that a left-side element is a superset of the right-side element. The elements are an ID or an equivalent of it such as a file name, or a pair of time information with start and end timestamps. An example of the actual description of a record is

The reasons why this type of data format is needed are as follows.

(1) As shown in [Author2012] a key strategy for sharing language resources is a data conversion service. From our experiments, data formats based on a multi-link-path model proposed by international organizations or research projects such as IOS LAF/GrAF[Ide 2006, ISO24612] and TEI[Bauman 2008, ISO24610-1] have drawbacks of data size and data manipulation[Author2009]. If we use these formats we have to prepare flexible data conversion programs or services[Author2011, Author2012]. And to reduce the number of link paths, a part of defining data units
in a standoff style can be moved from and be an independent data file. The data format proposed here can be used for this kind of data.

(2) There is a one-to-many relationship between actual sound and sound data, and a many-to-many relationship between sound data and encoded language data. Thus, when replaying sound from sound data, there must be time information. To realize sound data as the first-class data in linguistics, linguists have to prepare for changing from consumers to providers of sound psychologically and practically. This format proposed in this presentation is not a big result of an academic study, but could be a case example for linguists to start considering the need of time information in their language documentation.

Brief References

[Author 2009]

[Author 2011]

[Author 2012]

Bauman, S. and L.Burnard (2008) TEI P5 Guidelines for Electronic Text Encoding and Interchange, TEI

Ide, N and K. Suderman (2006) GrAF: A GrAF-based Format for Linguistic Annotations, Proc. of the Linguistic Annotation Workshop ISO 24612 (2012) Language resource management -- Linguistic annotation framework(LAF) --, ISO

ISO 24610-1 (2006) Language resource management -- Feature structures -- Part1: Feature structure representation (FSR), ISO
Rights:Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Appears in Collections: 4th International Conference on Language Documentation and Conservation (ICLDC)

Please email if you need this content in ADA-compliant format.

Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.