Please use this identifier to cite or link to this item:
Rater expertise in a second language speaking assessment : the influence of training and experience
|Davis_Lawrence_r.pdf||Version for non-UH users. Copying/Printing is not permitted||4.36 MB||Adobe PDF||View/Open|
|Davis_Lawrence_uh.pdf||Version for UH users||4.75 MB||Adobe PDF||View/Open|
|Title:||Rater expertise in a second language speaking assessment : the influence of training and experience|
|Authors:||Davis, Lawrence Edward|
|Keywords:||second language speaking assessment|
|Issue Date:||Dec 2012|
|Publisher:||[Honolulu] : [University of Hawaii at Manoa], [December 2012]|
|Abstract:||Speaking performance tests typically employ raters to produce scores; accordingly, variability in raters' scoring decisions has important consequences for test reliability and validity. One such source of variability is the rater's level of expertise in scoring. Therefore, it is important to understand how raters' performance is influenced by training and experience, as well as the features that distinguish more proficient raters from their less proficient counterparts. This dissertation examined the nature of rater expertise within a speaking test, and how training and increasing experience influenced raters' scoring patterns, cognition, and behavior.|
Experienced teachers of English (N=20) scored recorded examinee responses from the TOEFL iBT speaking test prior to training and in three sessions following training (100 responses for each session). For an additional 20 responses, raters verbally reported (via stimulated recall) what they were thinking as they listened to the examinee response and made a scoring decision, with the resulting data coded for language features mentioned. Scores were analyzed using many-facet Rasch analysis, with scoring phenomena including consistency, severity, and use of the rating scale compared across dates. Various aspects of raters' interaction with the scoring instrument were also recorded to determine if certain behaviors, such as the time taken to reach a scoring decision, were associated with the reliability and accuracy of scores.
Prior to training, rater severity and internal consistency (measured via Rasch analysis) were already of a standard typical for operational language performance tests, but training resulted in increased inter-rater correlation and agreement and improved correlation and agreement with established reference scores, although little change was seen in rater severity. Additional experience gained after training appeared to have little effect on rater scoring patterns, although agreement with reference scores continued to increase. More proficient raters reviewed benchmark responses more often and took longer to make scoring decisions, suggesting that rater behavior while scoring may influence the accuracy and reliability of scores. On the other hand, no obvious relationship was seen between raters' comments and their scoring patterns, with considerable individual variation seen in the frequency with which raters mentioned various language features.
|Description:||Ph.D. University of Hawaii at Manoa 2012.|
Includes bibliographical references.
|Appears in Collections:||Ph.D. - Second Language Studies|
Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.