LSI and DBSCAN: Natural language processing for sociolinguistic analysis

Collard, Jacob
Journal Title
Journal ISSN
Volume Title
Starting Page
Ending Page
Alternative Title
The issue of analyzing sociolinguistic and anthropological information remains an open question in contemporary social sciences. Though statistical analyses are possible to identify and quantify correlations and other relationships, it is much more difficult to examine qual- itative information, including descriptions of sociolinguistic contexts such as those found in the Endangered Languages Catalog (The Linguist List at Eastern Michigan University and The University of Hawaii at M ̄anoa, 2012). By introducing natural language processing tech- niques such as LSI, or latent semantic analysis, it becomes possible to quantify sociolinguistic descriptions to a certain degree. By quantifying natural language semantics, analysis of sociolinguistics becomes less sub- jective, though the analysis is still performed on descriptions generated by humans. Further- more, when combined with document clustering techniques such as DBSCAN (Kriegel et al., 2011) natural language processing also allows for the possibility of recognizing relationships between disparate languages hitherto overlooked. Because of the speed and breadth of this algorithm, it can recognize the relationships between any languages, regardless of geographic or genetic distance. This can provide insights into the effectiveness of different conservation techniques and language policies, as descriptions of these parameters are commonly found in natural language publications. References Hans-Peter Kriegel, Peer Kroger, Jorg Sander, and Arthur Zimek. Density-based clustering. WIREs Data Mining and Knowledge Discovery, 1(3):231–240, 2011. The Linguist List at Eastern Michigan University and The University of Hawaii at Mānoa. Endangered languages, 2012. URL
Geographic Location
Time Period
Related To
Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Rights Holder
Email if you need this content in ADA-compliant format.