Rater expertise in a second language speaking assessment : the influence of training and experience

Date
2012-12
Authors
Davis, Lawrence Edward
Contributor
Advisor
Department
Instructor
Depositor
Speaker
Researcher
Consultant
Interviewer
Annotator
Journal Title
Journal ISSN
Volume Title
Publisher
[Honolulu] : [University of Hawaii at Manoa], [December 2012]
Volume
Number/Issue
Starting Page
Ending Page
Alternative Title
Abstract
Speaking performance tests typically employ raters to produce scores; accordingly, variability in raters' scoring decisions has important consequences for test reliability and validity. One such source of variability is the rater's level of expertise in scoring. Therefore, it is important to understand how raters' performance is influenced by training and experience, as well as the features that distinguish more proficient raters from their less proficient counterparts. This dissertation examined the nature of rater expertise within a speaking test, and how training and increasing experience influenced raters' scoring patterns, cognition, and behavior. Experienced teachers of English (N=20) scored recorded examinee responses from the TOEFL iBT speaking test prior to training and in three sessions following training (100 responses for each session). For an additional 20 responses, raters verbally reported (via stimulated recall) what they were thinking as they listened to the examinee response and made a scoring decision, with the resulting data coded for language features mentioned. Scores were analyzed using many-facet Rasch analysis, with scoring phenomena including consistency, severity, and use of the rating scale compared across dates. Various aspects of raters' interaction with the scoring instrument were also recorded to determine if certain behaviors, such as the time taken to reach a scoring decision, were associated with the reliability and accuracy of scores. Prior to training, rater severity and internal consistency (measured via Rasch analysis) were already of a standard typical for operational language performance tests, but training resulted in increased inter-rater correlation and agreement and improved correlation and agreement with established reference scores, although little change was seen in rater severity. Additional experience gained after training appeared to have little effect on rater scoring patterns, although agreement with reference scores continued to increase. More proficient raters reviewed benchmark responses more often and took longer to make scoring decisions, suggesting that rater behavior while scoring may influence the accuracy and reliability of scores. On the other hand, no obvious relationship was seen between raters' comments and their scoring patterns, with considerable individual variation seen in the frequency with which raters mentioned various language features.
Description
Ph.D. University of Hawaii at Manoa 2012.
Includes bibliographical references.
Keywords
second language speaking assessment
Citation
Extent
Format
Geographic Location
Time Period
Related To
Theses for the degree of Doctor of Philosophy (University of Hawaii at Manoa). Second Language Studies.
Table of Contents
Rights
Rights Holder
Local Contexts
Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.