Collecting and annotating corpora for three under-resourced languages of France: Methodological issues

dc.contributor.author Bernhard, Delphine
dc.contributor.author Ligozat, Anne-Laure
dc.contributor.author Bras, Myriam
dc.contributor.author Martin, Fanny
dc.contributor.author Vergez-Couret, Marianne
dc.contributor.author Erhart, Pascale
dc.contributor.author Sibille, Jean
dc.contributor.author Todirascu, Amalia
dc.contributor.author Boula de Mareüil, Philippe
dc.contributor.author Huck, Dominique
dc.date.accessioned 2021-06-22T16:51:12Z
dc.date.available 2021-06-22T16:51:12Z
dc.date.issued 2021-06
dc.description.abstract In contrast to French, the vast majority of regional languages of France can be considered as under-resourced. In this article, we present the results of a research project aiming to produce annotated resources for three regional languages of France: Alsatian, Occitan, and Picard. These languages cover three different language families (Germanic and two subfamilies of Romance, Oïl and Oc languages) and different sociolinguistic situations. Yet, they all face issues common to many under-resourced languages: lack of human and financial resources and presence of geolinguistic variation. The originality of this project is that it brought together researchers from different fields (sociolinguistics, descriptive linguistics, dialectology, natural language processing, digital humanities) to work together towards the common goal of developing annotated corpora for Alsatian, Occitan, and Picard. This created a favorable and stimulating working environment which could not have been achieved had different research groups worked independently, each on a single language. This article details the annotation process, with a special focus on the delimitation of the tokens and the definition of the part-of-speech tags.
dc.description.sponsorship National Foreign Language Resource Center
dc.format.extent 42 pages
dc.identifier.citation Bernhard, Delphine, Anne-Laure Ligozat, Myriam Bras, Fanny Martin, Marianne Vergez-Couret, Pascale Erhart, Jean Sibille, Amalia Todirascu, Philippe Boula de Mareüil, & Dominique Huck. 2021. Collecting and annotating corpora for three under-resourced languages of France: Methodological issues. Language Documentation & Conservation 15: 316-357. http://hdl.handle.net/10125/74645.
dc.identifier.issn 1934-5275
dc.identifier.uri http://hdl.handle.net/10125/74645
dc.language.iso en-US
dc.publisher University of Hawaii Press
dc.rights Creative Commons Attribution-NonCommercial 4.0 International
dc.rights Attribution-NonCommercial 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc/3.0/us/ *
dc.subject corpus
dc.subject annotations
dc.subject tokenization
dc.subject part-of-speech
dc.subject Alsatian
dc.subject Occitan
dc.subject Picard
dc.title Collecting and annotating corpora for three under-resourced languages of France: Methodological issues
dc.type Article
dc.type.dcmi Text
prism.endingpage 357
prism.publicationname Language Documentation & Conservation
prism.startingpage 316
prism.volume 15
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
bernhard_et_al.pdf
Size:
10.92 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.73 KB
Format:
Item-specific license agreed upon to submission
Description: