Presentations from the Linguistic Society of America symposium and poster session on Data Citation and Attribution in Linguistics, 5-9 January 2017, Austin TX
Permanent URI for this collection
Developing Standards for Data Citation and Attribution for Reproducible Research in Linguistics
A project of the National Science Foundation (NSF 1447886, PIs Andrea Berez-Kroeker, Susan Kung, Gary Holton, Peter Pulsifer)Main Project Website
This 2-year project (2015-2017) supports a series of three workshops and one panel presentation bringing together relevant stakeholders to develop and promote standards for data citation and attribution for linguistic data.
Linguistics is a data-driven social science in which inferences about human cognition and social structure are drawn from observations of linguistic practice. These observations, in the form of recordings and associated annotations, represent the primary data sets that underlie the field. This practice has its roots in philology, which relies on texts as a primary data source. However, three recent and inter-related factors make the data-oriented model of linguistics particularly relevant to the field at the current time. First, a major shift in technology has resulted in rapidly growing volumes of digital language data. Second, more than half of the world's languages are critically endangered, so that in the not-so-distant future archival data will be the only source of information on those languages. Third, the emergence of Documentary Linguistics as a recognized sub-field has led to an increased focus on data curation and management.
While linguists have always relied on language data, they have not always facilitated access to those data. Linguistic publications typically include short excerpts from data sets, ordinarily consisting of fewer than five words, and often without citation. Where citations are provided, the connection to the data set is usually only vaguely identified. An excerpt might be given a citation which refers to the name of the text from which it was extracted, but in practice the reader has no way to access that text. That is, in spite of the potential generated by recent shifts in the field, a great deal of linguistic research created today is not reproducible, either in principle or in practice. The workshops and panel presentation will facilitate development of standards for the curation and citation of linguistics data that are responsive to these changing conditions and shift the field of linguistics toward a more scientific, data-driven model which results in reproducible research.
A primary factor hindering the development of reproducible research in linguistics is the lack of standards for data citation and attribution. Although language data are increasingly recognized as important, there are no widely established guidelines for the citation of these data. Equally important, there are no standards for attribution. Lacking such standards, journals, academic tenure and promotion committees, and peer review processes continue to emphasize linguistic analyses over linguistic data, and as a result linguists have little incentive to make data accessible. A data-driven linguistic science has the potential to provide substantiation of scientific claims by promoting attention to the care and structuring of language data.
By the end of the project, the researchers will have held three workshops to research and develop a model for data citation and attribution in linguistics; facilitated discipline-wide discussion on these topics at the 2017 annual meeting of the Linguistic Society of America; written a position paper on standards for citation and attribution in linguistics; and submitted a proposal for a Resolution on citation and attribution to the LSA.
ItemDeveloping methods for reproducible research in linguistics: a first step( 2017-01-06)Reproducible research in other fields has developed various software tools that facilitate the publishing of code and results in a single document that are linked directly to the data. In mainstream linguistics, however, such software does not exist. The workflows for including linguistic examples in published work typically involve manual methods of copying and pasting text from a database into a word processing document. These manual methods are error-prone and time-consuming--often involving tedious tasks of aligning glosses in tables or with tabs. Furthermore, the examples in these documents are in no way linked to the corpus. This poster presents a first-attempt at developing a family of scripts called glossbox that link data, code, and analysis. At present, glossbox works with the typesetting software LaTeX, allowing users to semi-automatically import examples directly from the corpus. These examples require little to no manual manipulation and automatically produce citations to the corpus.
ItemData management across academic disciplines( 2017-01-06)Developments in digital technologies have increased the quantity of data being created as well as provided a means to make that data available to the public digitally. Researchers are now faced with managing such grey publications, for which there is no guarantee of persistence or accessibility, nor standards for citation and attribution. Linguists are not alone in changing the way we think about data. Initiatives such as the e-Infrastructure Reflection Group and FORCE11 have membership across the sciences and identify citation, attribution, unique identification, access, persistence, specificity, and interoperability of data as fundamental. The linguistics community could benefit from developing our understanding of data management consistently with the larger academic community, and the overlap between our guiding principles should facilitate this outcome. This poster outlines this overlap between our efforts and those of other disciplines, and explores ways we can proceed to facilitate our interaction with the larger academic community.
ItemData citation, attribution, and employability( 2017-01-06)Demand from academic departments for linguists possessing data skills has remained low in the last decade despite an influx of new data-driven tools, research, and ability to manage data in ways not possible before the internet. We assessed two barometers of employability: academic job postings and course descriptions. A survey of the academic linguistic job market over 10 years reveals that despite the field becoming more reliant on digital data, employers are not asking that candidates be fluent in data management. We also surveyed course descriptions and syllabi from 25 of the top-ranked linguistics programs in the United States and abroad, finding that most universities do not offer training in basic data management, despite offering courses in data- driven subdisciplines. This poster presents the data supporting these points including data management hiring and training trends.
ItemTROLLing: Scope and operation of an open repository for linguistic datasets( 2017-01-06)TROLLing (opendata.uit.no) is an international archive for open linguistic data and statistical code (e.g. R scripts), launched in 2014 at UiT The Arctic University of Norway. With the increasing demand for archiving and sharing research data, as well as the problem of improper attribution, TROLLing aims to meet researchers’ needs by proposing safe storage of data files, and metadata templates based on international standards. Retrieval, sharing, and reuse of data is further facilitated by TROLLing being part of a global open data network. As regards attribution, the system automatically provides a dataset citation, comprising among other things the author name(s) and a persistent identifier (doi). A version control allows researchers to update their datasets at any time, previously published versions still being available open access. TROLLing is available to all subfields of linguistics, but is limited to structural data. The metadata template, however, allows linking to primary data, stored elsewhere.
ItemCitation and attribution of archived data: guidelines of the Archive of the Indigenous Languages of Latin America( 2017-01-06)Today great quantities of research data are publically available for re-use, and most academic fields are becoming aware of the need to establish recommendations for how these data should be cited so that the data creators get proper attribution for their work. To this end, AILLA has developed Citation Guidelines that provide detailed citation examples of the different hierarchical levels of AILLA's holdings, including collections (organized materials based on individual collectors), resources (materials organized around a speech event), and individual files. These Guidelines differentiate in-text and bibliographic citations. Furthermore, each collection and resource page on AILLA provides instructions for how it should be cited. In this poster, we explain AILLA's Citation Guidelines, we show how--when followed--these guidelines give appropriate credit to the various contributors of the data and allow for easy access to the data in the archive, and we demonstrate the proper implementation of these guidelines in linguistic literature.
ItemThe data management life cycle for linguists( 2017-01-06)While the concept of managing data is not new to the field of linguistics, the reality is that there are still significant barriers to creating citable records that allow persistent access to clearly structured primary data and that enable reproducible results. With the current emphasis from funding agencies and publishers on the importance of transparency and data sharing, it is increasingly important that good data management skills and methods be prioritized as a formal part of the academic linguist’s workflow. Although it is difficult to provide specific recommendations for all subdisciplines of such a heterogeneous field, this poster will highlight core principles and provide specific guidelines for academic linguists throughout the research life cycle, from the earliest pre-planning stages, through deposit in a trusted repository, and into the future as data are re-used for further inquiry.
ItemA survey of current reproducibility practices in linguistics publications( 2017-01-06)In order to move forward toward reproducible research in linguistics, we first need to know where we are now with regard to our practices for methodological clarity and data citation in publications. In this poster we share the results of a study of over 370 journal articles, dissertations, and grammars, which is taken as a sample of current practices in the field. The publications all come from a ten-year span. The journals were selected for broad coverage. Grammars included published grammars and dissertations written as grammars, with broad geographic coverage, both in terms of subject language and publisher or university.These publications are critiqued on the basis of transparency of data source, data collection methods, analysis, and storage. While we find examples of transparent reporting, most of the surveyed research does not include key metadata, methodological information, or citations that are resolvable to the data on which the analyses are based.
ItemQuestions, curiosities, and concerns: talking points for data citation and attribution( 2017-01-06)Changing the way linguists approach data citation and attribution means changing the way we traditionally think about data, their role in research, and their scholarly value. We are accustomed to valuing only particular academic products, even though we invest just as much time, effort, and analytical skill into collecting and managing the data behind published products. Efforts to redress this status quo proceed on a variety of fronts, but none of these efforts will happen overnight. Furthermore, not all of these conversations will be smooth conversions of viewpoints and philosophies. Changing minds takes time and patience, especially when navigating decades of thought and practice. This poster presents some questions, curiosities, and concerns that have been raised during conversations about changing standards and practices for data citation and attribution. Following the tradition in public relations and politics, we offer talking points for conveying a helpful and hopeful message to colleagues.
ItemDeveloping standards for data citation and attribution for reproducible research in linguistics: project summary and next steps( 2017-01-06)Developing Standards for Data Citation and Attribution for Reproducible Research in Linguistics is an NSF-supported project (SMA-1447886) that brings together relevant stakeholders to collaboratively develop and promote standards for linguistic data citation and attribution. Project participants include linguistics journal editors; language archivists; linguists representing various subfields and academic career stages from graduate students to provosts; and “Big Data” specialists. The first workshop was held at the University of Colorado Boulder in September 2015; the second was held at The University of Texas at Austin in April 2016. A panel presentation and the final workshop will take place in conjunction with this LSA Annual Meeting. This poster summarizes the aims and accomplishments of the first two workshops, describes the plans for the final workshop and the suggests the next steps in the development and promotion of a model for data citation and attribution in linguistics.