LD&C Special Publication No.25: Doing Corpus-Based Typology With Spoken Language Corpora

    The role of language documentation in corpus-based typology
    (University of Hawai'i Press, 2021) Schnell, Stefan ; Haig, Geoffrey ; Seifart, Frank
    Child language documentation: The sketch acquisition project
    (University of Hawai'i Press, 2021) Hellwig, Birgit ; Defina, Rebecca ; Kidd, Evan ; Allen, Shanley ; Davidson, Lucinda ; Kelly, Barbara F.
    This paper reports on an on-going project designed to collect comparable corpus data on child language and child-directed language in under-researched languages. Despite a long history of cross-linguistic research, there is a severe empirical bias within language acquisition research: Data is available for less than 2% of the world's languages, heavily skewed towards the larger and better-described languages. As a result, theories of language development tend to be grounded in a non-representative sample, and we know little about the acquisition of typologically-diverse languages from different families, regions, or sociocultural contexts. It is very likely that the reasons are to be found in the forbidding methodological challenges of constructing child language corpora under fieldwork conditions with their strict requirements on participant selection, sampling intervals, and amounts of data. There is thus an urgent need for proposals that facilitate and encourage language acquisition research across a wide variety of languages. Adopting a language documentation perspective, we illustrate an approach that combines the construction of manageable corpora of natural interaction with and between children with a sketch description of the corpus data – resulting in a set of comparable corpora and comparable sketches that form the basis for cross-linguistic comparisons.
    Prosodic segmentation and cross-linguistic comparison in CorpAfroAs and CorTypo: Corpus-driven and corpus-based approaches
    (University of Hawai'i Press, 2021) Mettouchi, Amina ; Vanhove, Martine
    The paper addresses the issue of corpus-design in relation to research questions for under-described languages. It shows how a corpus emerges from the methodology and habitus of its contributors, and how it is shaped by the technical tools used for data organization. It also underlines the ways in which a morphosyntactically annotated corpus, segmented into intonation units, is amenable to a wide array of searches, both corpus-based and corpus-driven, and both formal and functional. After a presentation of the annotation layout, and the segmentation choices that characterize the two projects, CorpAfroAs and CorTypo, scientific results are illustrated for two languages, Kabyle and Beja, and more marginally for Zaar, Juba Arabic, and Modern Hebrew. They exemplify corpus-driven and corpus-based approaches of information structure and grammatical relations. Both types of approaches plead for an integrated view of prosody, closely interacting with syntax, semantics, phonology, information structure, and all levels of human communication and cognition. They also plead for a general endeavour to annotate as much as possible the large array of prosodic cues that are inseparable from speech processing and interaction dynamics.
    Combining documentary linguistics and corpus phonetics to advance corpus-based typology
    (University of Hawai'i Press, 2021) Seifart, Frank
    This article argues that documentary linguistics and corpus phonetics can form a happy marriage in that corpora extracted from language documentation collections contain highly relevant data that can advance corpus phonetics by enabling broad comparative studies. To make this point, this article reviews previous research on phonetic lengthening at utterance boundaries and pause probabilities before nouns and verbs in ten languages. I then introduce the DoReCo initiative, which, based on experience gained from these studies, builds a database of time-aligned corpora from documentary collections of 50 languages for corpus phonetic research and other research purposes.
    Universals of reference in discourse and grammar: Evidence from the Multi-CAST collection of spoken corpora
    (University of Hawai'i Press, 2021) Haig, Geoffrey ; Schnell, Stefan ; Schiborr, Nils Norman
    Data from under-researched languages are now available in sufficient quantity and quality to feed into corpus-based approaches to language typology. In this paper we present Multi-CAST (Multilingual Corpus of Annotated Spoken Texts), a project designed to facilitate cross-linguistic comparison of naturalistic discourse across typologically diverse languages, which implements a purpose-built shared annotation scheme. After sketching the rationale and architecture of Multi-CAST, we illustrate the efficacy of the method with two case-studies: The first one investigates the rates of lexical (as opposed to pronominal and zero) realization of arguments in discourse across a sample of 15 typologically diverse languages. Our results reveal a remarkable and hitherto unnoticed uniformity in the density of lexical references, despite the lack of content control in the corpora. The second addresses the question of whether cross-linguistically attested regularities in morphosyntax can meaningfully be related to frequency effects in discourse. We find some support for frequency-based explanations, but our data also show that the frequency accounts leave several key questions unanswered. Overall, our findings underscore that research based on language documentation-derived corpus data, and in particular spoken language data, is not only possible, but in fact crucially necessary for testing frequency-based explanations, because these data stem from spoken language and typologically diverse languages. We also identify a number of epistemological and methodological shortcomings with our approach, and discuss some of the requirements for further innovation in areas of corpus building, corpus annotation, and typological comparability.
    Language vs individuals in cross-linguistic corpus typology
    (University of Hawai'i Press, 2021) Barth, Danielle ; Evans, Nicholas ; Arka, I Wayan ; Bergqvist, Henrik ; Forker, Diana ; Gipper, Sonja ; Hodge, Gabrielle ; Kashima, Eri ; Kasuga, Yuki ; Kawakami, Carine ; Kimoto, Yukinori ; Knuchel, Dominique ; Kogura, Norikazu ; Kurabe, Keita ; Mansfield, John ; Narrog, Heiko ; Pratiwi, Desak Putu Eka ; van Putten, Saskia ; Senge, Chikako ; Tykhostup, Olena
    There is a long tradition in linguistics of seeing each language as a powerful factor setting out predetermining grooves in how people express themselves. But how strong is this effect? We know that despite the forces of linguistic habit people nonetheless enjoy some freedom in formulating their thoughts. Can we measure the relative contributions of language structures and individual variation to how people formulate statements about the world? Do accounts of typological differences need to take individual variation into account, and is such variation more prevalent in some kinds of linguistic domains than others? In this paper, we deploy a parallax corpus across thirteen languages from around the world and explore four case studies of linguistic choice, two grammatical and two semantic. We assess whether differences are accounted adequately just by individual participant variation, just by language information, or whether taking into account both helps account for the patterns we see. We do this through comparisons of statistical models. Our results make it clear that participants using the same language do not always behave similarly and this is especially true of our semantic variables. We take this to be a strong caution that the behaviour of individual participants should be considered when making typological generalisations, but also as an exciting outcome that corpus typology as a field can help us account for intra- and inter-language variation.
    This research topic of yours – is it a research topic at all? Using comparative interactional data for a fine-grained reanalysis of traditional concepts
    (University of Hawai'i Press, 2021) Ozerov, Pavel
    This paper demonstrates how bottom-up research on interactional data offers the opportunity of disentangling presumably basic linguistic notions into smaller primitives. Using parallel case studies on interrogatives and left dislocations from two unrelated languages (Modern Hebrew and Anal Naga), the paper shows how avoiding restrictive definitions and recurrently expanding the set of examples results in a revision of concepts, taken for granted at the beginning of the study. The findings emphasise the need for channelling corpus-based research into an interactionally-informed examination of the metalanguage employed for the analysis. They illustrate how studying a research topic and questioning the validity of the concepts that underlie it are part of the same process.