Please use this identifier to cite or link to this item: http://hdl.handle.net/10125/25368

A general format for time information to be the first-class data of general linguistics

File SizeFormat 
25368.pdf50.31 kBAdobe PDFView/Open
25368.zip68.34 kBZIPView/Open

Item Summary

Title: A general format for time information to be the first-class data of general linguistics
Issue Date: 12 Mar 2015
Description: This presentation aims at proposing philosophy and an actual description of data format for time information to realize a new phenomenon language documentation brings about with computational environments. This data format should be simple and cogent because it is used by linguists as a common and fundamental format to record sound data to relate it to encoded language data. To be simple, the data format is based on a flat data model and the actual description had better be a plain text. To be cogent, the data format is based on mathematical foundations. In this proposal, elements in records line up in superset order, which means that a left-side element is a superset of the right-side element. The elements are an ID or an equivalent of it such as a file name, or a pair of time information with start and end timestamps. An example of the actual description of a record is
"original_sound.wav,00:00:13,00:01:03.25,part_of_sound1.wav."

The reasons why this type of data format is needed are as follows.

(1) As shown in [Author2012] a key strategy for sharing language resources is a data conversion service. From our experiments, data formats based on a multi-link-path model proposed by international organizations or research projects such as IOS LAF/GrAF[Ide 2006, ISO24612] and TEI[Bauman 2008, ISO24610-1] have drawbacks of data size and data manipulation[Author2009]. If we use these formats we have to prepare flexible data conversion programs or services[Author2011, Author2012]. And to reduce the number of link paths, a part of defining data units
in a standoff style can be moved from and be an independent data file. The data format proposed here can be used for this kind of data.

(2) There is a one-to-many relationship between actual sound and sound data, and a many-to-many relationship between sound data and encoded language data. Thus, when replaying sound from sound data, there must be time information. To realize sound data as the first-class data in linguistics, linguists have to prepare for changing from consumers to providers of sound psychologically and practically. This format proposed in this presentation is not a big result of an academic study, but could be a case example for linguists to start considering the need of time information in their language documentation.


Brief References

[Author 2009]

[Author 2011]

[Author 2012]

Bauman, S. and L.Burnard (2008) TEI P5 Guidelines for Electronic Text Encoding and Interchange, TEI

Ide, N and K. Suderman (2006) GrAF: A GrAF-based Format for Linguistic Annotations, Proc. of the Linguistic Annotation Workshop ISO 24612 (2012) Language resource management -- Linguistic annotation framework(LAF) --, ISO

ISO 24610-1 (2006) Language resource management -- Feature structures -- Part1: Feature structure representation (FSR), ISO
URI/DOI: http://hdl.handle.net/10125/25368
Rights: Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Appears in Collections:4th International Conference on Language Documentation and Conservation (ICLDC)



Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.