Inferring human personality from written media

Date

2020

Contributor

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

University of Hawaii at Manoa

Volume

Number/Issue

Starting Page

Ending Page

Alternative Title

Abstract

This work explores the association between human personality and language features consisting of sequences of tokens. My work reveals that there are such features that are predictive of personality over multiple corpora taken from different populations of English speakers. I gathered written text authored by 50 individuals who participated on a bodybuilding web forum (the Forum corpus). Also I administered a personality questionnaire following the protocol provided by the International Personality Item Pool (IPIP). For comparison across other populations I also obtained text corpora from three other research groups, along with the results of personality assessments: the EAR corpora consisting of transcripts of the speech of 96 participants as they go about their daily lives, Essays written by 2,588 undergraduates at the University of Texas and posts by 244 Facebook users. After performing part-of-speech (POS) tagging on the text for all the participants in these corpora, I extracted unigrams, bigrams and trigrams (n-grams) of tokens and their POS tags, and counted every word/tag permutation that appeared. I considered only features appearing one or more times per 1000 words in the Forum corpus because there was not enough data to consider sparser features. I found 766 such features. From among those features I explored which were relevant across both my Forum corpus and at least one of the borrowed corpora, since those are the most promising, robust features that illustrate the possibility of building models across various corpora using the same language features. 75 of the features were associated with one or more personality dimensions across both the Forum corpus and at least one additional corpus. I devised explanations as to why some of the features are correlated with a given personality dimension. That task establishes that although some of the features may have arisen randomly, one can confidently proceed with the conclusion that English speakers consistently express their personalities through their language usage. In addition, to show that it is possible to use these features for prediction, I generated multiple linear regression models for each corpora-personality dimension combination; in the best case (Openness with the Forum corpus) I obtained R2 of 0.686 and S (standard error of the estimate) of 0.561. My work sets a foundation for more robust, accurate models of personality. I hope that others will find additional principled explanations of why the features I found are associated with personality. In the future I anticipate that suitable language-analytical techniques will deepen insight both in the case of English speakers and speakers of additional world languages.

Description

Keywords

Computer science, Psychology

Citation

Extent

Format

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

Rights Holder

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.