Please use this identifier to cite or link to this item:

A general format for time information to be the first-class data of general linguistics

File SizeFormat 
25368.pdf50.31 kBAdobe PDFView/Open
25368.zip68.34 kBZIPView/Open

Item Summary

Title: A general format for time information to be the first-class data of general linguistics
Authors: Ohya, Kazushi
Issue Date: 12 Mar 2015
Description: This presentation aims at proposing philosophy and an actual description of data format for time information to realize a new phenomenon language documentation brings about with computational environments. This data format should be simple and cogent because it is used by linguists as a common and fundamental format to record sound data to relate it to encoded language data. To be simple, the data format is based on a flat data model and the actual description had better be a plain text. To be cogent, the data format is based on mathematical foundations. In this proposal, elements in records line up in superset order, which means that a left-side element is a superset of the right-side element. The elements are an ID or an equivalent of it such as a file name, or a pair of time information with start and end timestamps. An example of the actual description of a record is

The reasons why this type of data format is needed are as follows.

(1) As shown in [Author2012] a key strategy for sharing language resources is a data conversion service. From our experiments, data formats based on a multi-link-path model proposed by international organizations or research projects such as IOS LAF/GrAF[Ide 2006, ISO24612] and TEI[Bauman 2008, ISO24610-1] have drawbacks of data size and data manipulation[Author2009]. If we use these formats we have to prepare flexible data conversion programs or services[Author2011, Author2012]. And to reduce the number of link paths, a part of defining data units
in a standoff style can be moved from and be an independent data file. The data format proposed here can be used for this kind of data.

(2) There is a one-to-many relationship between actual sound and sound data, and a many-to-many relationship between sound data and encoded language data. Thus, when replaying sound from sound data, there must be time information. To realize sound data as the first-class data in linguistics, linguists have to prepare for changing from consumers to providers of sound psychologically and practically. This format proposed in this presentation is not a big result of an academic study, but could be a case example for linguists to start considering the need of time information in their language documentation.

Brief References

[Author 2009]

[Author 2011]

[Author 2012]

Bauman, S. and L.Burnard (2008) TEI P5 Guidelines for Electronic Text Encoding and Interchange, TEI

Ide, N and K. Suderman (2006) GrAF: A GrAF-based Format for Linguistic Annotations, Proc. of the Linguistic Annotation Workshop ISO 24612 (2012) Language resource management -- Linguistic annotation framework(LAF) --, ISO

ISO 24610-1 (2006) Language resource management -- Feature structures -- Part1: Feature structure representation (FSR), ISO
Rights: Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Appears in Collections:4th International Conference on Language Documentation and Conservation (ICLDC)

Please contact if you need this content in an alternative format.

Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.