Collecting and annotating corpora for three under-resourced languages of France: Methodological issues

Date

2021-06

Contributor

Advisor

Department

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

University of Hawaii Press

Volume

15

Number/Issue

Starting Page

316

Ending Page

357

Alternative Title

Abstract

In contrast to French, the vast majority of regional languages of France can be considered as under-resourced. In this article, we present the results of a research project aiming to produce annotated resources for three regional languages of France: Alsatian, Occitan, and Picard. These languages cover three different language families (Germanic and two subfamilies of Romance, Oïl and Oc languages) and different sociolinguistic situations. Yet, they all face issues common to many under-resourced languages: lack of human and financial resources and presence of geolinguistic variation. The originality of this project is that it brought together researchers from different fields (sociolinguistics, descriptive linguistics, dialectology, natural language processing, digital humanities) to work together towards the common goal of developing annotated corpora for Alsatian, Occitan, and Picard. This created a favorable and stimulating working environment which could not have been achieved had different research groups worked independently, each on a single language. This article details the annotation process, with a special focus on the delimitation of the tokens and the definition of the part-of-speech tags.

Description

Keywords

corpus, annotations, tokenization, part-of-speech, Alsatian, Occitan, Picard

Citation

Bernhard, Delphine, Anne-Laure Ligozat, Myriam Bras, Fanny Martin, Marianne Vergez-Couret, Pascale Erhart, Jean Sibille, Amalia Todirascu, Philippe Boula de Mareüil, & Dominique Huck. 2021. Collecting and annotating corpora for three under-resourced languages of France: Methodological issues. Language Documentation & Conservation 15: 316-357. http://hdl.handle.net/10125/74645.

Extent

42 pages

Format

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

Creative Commons Attribution-NonCommercial 4.0 International
Attribution-NonCommercial 3.0 United States

Rights Holder

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.