Presentations from the Linguistic Society of America symposium and poster session on Data Citation and Attribution in Linguistics, 5-9 January 2017, Austin TX

Permanent URI for this collection

Developing Standards for Data Citation and Attribution for Reproducible Research in Linguistics

A project of the National Science Foundation (NSF 1447886, PIs Andrea Berez-Kroeker, Susan Kung, Gary Holton, Peter Pulsifer)

Main Project Website

This 2-year project (2015-2017) supports a series of three workshops and one panel presentation bringing together relevant stakeholders to develop and promote standards for data citation and attribution for linguistic data.

Linguistics is a data-driven social science in which inferences about human cognition and social structure are drawn from observations of linguistic practice. These observations, in the form of recordings and associated annotations, represent the primary data sets that underlie the field. This practice has its roots in philology, which relies on texts as a primary data source. However, three recent and inter-related factors make the data-oriented model of linguistics particularly relevant to the field at the current time. First, a major shift in technology has resulted in rapidly growing volumes of digital language data. Second, more than half of the world's languages are critically endangered, so that in the not-so-distant future archival data will be the only source of information on those languages. Third, the emergence of Documentary Linguistics as a recognized sub-field has led to an increased focus on data curation and management.

While linguists have always relied on language data, they have not always facilitated access to those data. Linguistic publications typically include short excerpts from data sets, ordinarily consisting of fewer than five words, and often without citation. Where citations are provided, the connection to the data set is usually only vaguely identified. An excerpt might be given a citation which refers to the name of the text from which it was extracted, but in practice the reader has no way to access that text. That is, in spite of the potential generated by recent shifts in the field, a great deal of linguistic research created today is not reproducible, either in principle or in practice. The workshops and panel presentation will facilitate development of standards for the curation and citation of linguistics data that are responsive to these changing conditions and shift the field of linguistics toward a more scientific, data-driven model which results in reproducible research.

A primary factor hindering the development of reproducible research in linguistics is the lack of standards for data citation and attribution. Although language data are increasingly recognized as important, there are no widely established guidelines for the citation of these data. Equally important, there are no standards for attribution. Lacking such standards, journals, academic tenure and promotion committees, and peer review processes continue to emphasize linguistic analyses over linguistic data, and as a result linguists have little incentive to make data accessible. A data-driven linguistic science has the potential to provide substantiation of scientific claims by promoting attention to the care and structuring of language data.

By the end of the project, the researchers will have held three workshops to research and develop a model for data citation and attribution in linguistics; facilitated discipline-wide discussion on these topics at the 2017 annual meeting of the Linguistic Society of America; written a position paper on standards for citation and attribution in linguistics; and submitted a proposal for a Resolution on citation and attribution to the LSA.


Recent Submissions

Now showing 1 - 5 of 10
  • Item
    Developing methods for reproducible research in linguistics: a first step
    ( 2017-01-06) McDonnell, Bradley ; Hall, Patrick
    Reproducible research in other fields has developed various software tools that facilitate the publishing of code and results in a single document that are linked directly to the data. In mainstream linguistics, however, such software does not exist. The workflows for including linguistic examples in published work typically involve manual methods of copying and pasting text from a database into a word processing document. These manual methods are error-prone and time-consuming--often involving tedious tasks of aligning glosses in tables or with tabs. Furthermore, the examples in these documents are in no way linked to the corpus. This poster presents a first-attempt at developing a family of scripts called glossbox that link data, code, and analysis. At present, glossbox works with the typesetting software LaTeX, allowing users to semi-automatically import examples directly from the corpus. These examples require little to no manual manipulation and automatically produce citations to the corpus.
  • Item
    Data management across academic disciplines
    ( 2017-01-06) Hooshiar, Kavon
    Developments in digital technologies have increased the quantity of data being created as well as provided a means to make that data available to the public digitally. Researchers are now faced with managing such grey publications, for which there is no guarantee of persistence or accessibility, nor standards for citation and attribution. Linguists are not alone in changing the way we think about data. Initiatives such as the e-Infrastructure Reflection Group and FORCE11 have membership across the sciences and identify citation, attribution, unique identification, access, persistence, specificity, and interoperability of data as fundamental. The linguistics community could benefit from developing our understanding of data management consistently with the larger academic community, and the overlap between our guiding principles should facilitate this outcome. This poster outlines this overlap between our efforts and those of other disciplines, and explores ways we can proceed to facilitate our interaction with the larger academic community.
  • Item
    Data citation, attribution, and employability
    ( 2017-01-06) Dailey, Meagan ; Henke, Ryan
    Demand from academic departments for linguists possessing data skills has remained low in the last decade despite an influx of new data-driven tools, research, and ability to manage data in ways not possible before the internet. We assessed two barometers of employability: academic job postings and course descriptions. A survey of the academic linguistic job market over 10 years reveals that despite the field becoming more reliant on digital data, employers are not asking that candidates be fluent in data management. We also surveyed course descriptions and syllabi from 25 of the top-ranked linguistics programs in the United States and abroad, finding that most universities do not offer training in basic data management, despite offering courses in data- driven subdisciplines. This poster presents the data supporting these points including data management hiring and training trends.
  • Item
    TROLLing: Scope and operation of an open repository for linguistic datasets
    ( 2017-01-06) Andreassen, Helene N. ; Conzett, Philipp ; Høydalsvik, Stein ; Longva, Leif ; Obiajulu, Odu
    TROLLing ( is an international archive for open linguistic data and statistical code (e.g. R scripts), launched in 2014 at UiT The Arctic University of Norway. With the increasing demand for archiving and sharing research data, as well as the problem of improper attribution, TROLLing aims to meet researchers’ needs by proposing safe storage of data files, and metadata templates based on international standards. Retrieval, sharing, and reuse of data is further facilitated by TROLLing being part of a global open data network. As regards attribution, the system automatically provides a dataset citation, comprising among other things the author name(s) and a persistent identifier (doi). A version control allows researchers to update their datasets at any time, previously published versions still being available open access. TROLLing is available to all subfields of linguistics, but is limited to structural data. The metadata template, however, allows linking to primary data, stored elsewhere.
  • Item
    Citation and attribution of archived data: guidelines of the Archive of the Indigenous Languages of Latin America
    ( 2017-01-06) Kung, Susan ; Perez Gonzalez, Jaime
    Today great quantities of research data are publically available for re-use, and most academic fields are becoming aware of the need to establish recommendations for how these data should be cited so that the data creators get proper attribution for their work. To this end, AILLA has developed Citation Guidelines that provide detailed citation examples of the different hierarchical levels of AILLA's holdings, including collections (organized materials based on individual collectors), resources (materials organized around a speech event), and individual files. These Guidelines differentiate in-text and bibliographic citations. Furthermore, each collection and resource page on AILLA provides instructions for how it should be cited. In this poster, we explain AILLA's Citation Guidelines, we show how--when followed--these guidelines give appropriate credit to the various contributors of the data and allow for easy access to the data in the archive, and we demonstrate the proper implementation of these guidelines in linguistic literature.