Please use this identifier to cite or link to this item:

Rater expertise in a second language speaking assessment : the influence of training and experience

File Description SizeFormat 
Davis_Lawrence_r.pdfVersion for non-UH users. Copying/Printing is not permitted4.36 MBAdobe PDFView/Open
Davis_Lawrence_uh.pdfVersion for UH users4.75 MBAdobe PDFView/Open

Item Summary

Title: Rater expertise in a second language speaking assessment : the influence of training and experience
Authors: Davis, Lawrence Edward
Keywords: second language speaking assessment
Issue Date: Dec 2012
Publisher: [Honolulu] : [University of Hawaii at Manoa], [December 2012]
Abstract: Speaking performance tests typically employ raters to produce scores; accordingly, variability in raters' scoring decisions has important consequences for test reliability and validity. One such source of variability is the rater's level of expertise in scoring. Therefore, it is important to understand how raters' performance is influenced by training and experience, as well as the features that distinguish more proficient raters from their less proficient counterparts. This dissertation examined the nature of rater expertise within a speaking test, and how training and increasing experience influenced raters' scoring patterns, cognition, and behavior.
Experienced teachers of English (N=20) scored recorded examinee responses from the TOEFL iBT speaking test prior to training and in three sessions following training (100 responses for each session). For an additional 20 responses, raters verbally reported (via stimulated recall) what they were thinking as they listened to the examinee response and made a scoring decision, with the resulting data coded for language features mentioned. Scores were analyzed using many-facet Rasch analysis, with scoring phenomena including consistency, severity, and use of the rating scale compared across dates. Various aspects of raters' interaction with the scoring instrument were also recorded to determine if certain behaviors, such as the time taken to reach a scoring decision, were associated with the reliability and accuracy of scores.
Prior to training, rater severity and internal consistency (measured via Rasch analysis) were already of a standard typical for operational language performance tests, but training resulted in increased inter-rater correlation and agreement and improved correlation and agreement with established reference scores, although little change was seen in rater severity. Additional experience gained after training appeared to have little effect on rater scoring patterns, although agreement with reference scores continued to increase. More proficient raters reviewed benchmark responses more often and took longer to make scoring decisions, suggesting that rater behavior while scoring may influence the accuracy and reliability of scores. On the other hand, no obvious relationship was seen between raters' comments and their scoring patterns, with considerable individual variation seen in the frequency with which raters mentioned various language features.
Description: Ph.D. University of Hawaii at Manoa 2012.
Includes bibliographical references.
Appears in Collections:Ph.D. - Second Language Studies

Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.