Please use this identifier to cite or link to this item: http://hdl.handle.net/10125/24675

BaTelÒc: A text base for the Occitan language

File SizeFormat 
bras_vergez-couret_2016.pdf707.05 kBAdobe PDFView/Open

Item Summary

Title: BaTelÒc: A text base for the Occitan language
Authors: Myriam Bras
Marianne Vergez-Couret
Issue Date: Feb 2016
Publisher: University of Hawai'i Press
Citation: Bras, Myriam and Marianne Vergez-Couret. 2016. BaTelÒc: A text base for the Occitan language. In Vera Ferreira and Peter Bouda (eds.). Language Documentation and Conservation in Europe. 133-149. Honolulu: University of Hawai'i Press.
Series/Report no.: LD&C Special Publication
Abstract: Language Documentation, as defined by Himmelmann (2006), aims at compiling and preserving linguistic data for studies in linguistics, literature, his- tory, ethnology, sociology. This initiative is vital for endangered languages such as Occitan, a romance language spoken in southern France and in several valleys of Spain and Italy. The documentation of a language concerns all its modalities, covering spoken and written language, various registers and so on. Nowadays, Occitan documentation mostly consists of data from linguistic atlases, virtual libraries from the modern to the contemporary period, and text bases for the Middle Ages. BaTelÒc is a text base for modern and contemporary periods. With the aim of creating a wide coverage of text collections, BaTelÒc gathers not only written literary texts (prose, drama and poetry) but also other genres such as technical texts and newspapers. Enough material is already available to foresee a text base of hundreds of millions of words. BaTelÒc not only aims at documenting Occitan, it is also designed to provide tools to explore texts (different criteria for corpus selection, concordance tools and more complex enquiries with regular expressions). As for linguistic analysis, the second step is to enrich the corpora with annotations. Natural Language Processing of endangered languages such as Occitan is very challenging. It is not possible to transpose existing models for resource-rich languages directly, partly because of the spelling, dialectal variations, and lack of standardization. With BaTelÒc we aim at providing corpora and lexicons for the development of basic natural language processing tools, namely OCR and a Part-of-Speech tagger based on tools initially designed for machine translation and which take variation into account.
Sponsor: National Foreign Language Resource Center
URI/DOI: http://hdl.handle.net/10125/24675
ISBN: 978-0-9856211-5-5
Rights: Creative Commons Attribution Non-Commercial Share Alike License
Appears in Collections:LD&C Special Publication No. 9: Language Documentation and Conservation in Europe



Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.