Language Learning & Technology  2025, Volume 29, Issue 1 
ISSN 1094-3501  CC BY-NC-ND pp. 1–12 
  

LANGUAGE TEACHER EDUCATION AND TECHNOLOGY FORUM 
 
 
* Corresponding Author: Jang Ho Lee, Email: jangholee@cau.ac.kr  

 
The potential advantages of using an LLM-based 
chatbot for automated writing evaluation for English 

teaching practitioners  
Kyungmin Kim, Chung-Ang University 

Jang Ho Lee*, Chung-Ang University 

Dongkwang Shin, Gwangju National University of Education 

Abstract 

With the ever-increasing demand for assessing large amounts of student writing, Automated Writing 
Evaluation (AWE) has emerged as an efficient system to satisfy this demand. However, it has also been 
suggested that applying it in teachers’ writing classes may offer limited results. Given this, the present 
study developed an AWE chatbot based on ChatGPT 4.0-turbo, designed as an automated rater. A total of 
465 narrative essays written by Korean high school EFL students were scored according to three criteria 
by the developed tool; these were then compared with the scores administered by two professional raters 
using various analytic measures. The results showed that the AWE chatbot’s scoring was strongly 
correlated with that of the human raters. Meanwhile, the many-facet Rasch model’s result indicated that 
the two human raters’ statistics demonstrated an excellent fit, whereas those of the developed AWE 
chatbot were slightly lower. The Coh-Metrix analysis suggested that the human raters’ scoring tendencies 
and GPT are largely aligned, indicating that both raters scored the essays similarly. Based on our 
findings, we suggest that the Large Language Model (LLM)-based AWE chatbot has great potential to 
assist teachers in EFL writing classrooms. 

Keywords: Automated Writing Evaluation, ChatGPT, L2 Writing, Many-Facet Rasch model 

Language(s) Learned in This Study: English 
APA Citation: Kim, K., Lee, J. H., & Shin, D. (2025). The potential advantages of using an LLM-based 
chatbot for automated writing evaluation for English teaching practitioners. Language Learning & 
Technology, 29(1), 1–12. https://hdl.handle.net/10125/73628  

Introduction 

Writing has become an essential skill for communicating with people worldwide; as such, it has received 
increasing attention in the second language (L2) teaching community (Weigle, 2002). The English as a 
Foreign Language (EFL) domain is no exception. Meanwhile, some suggest that writing is a process-
oriented domain that requires time, determination and concentration for both learners (Kormos, 2012) and 
teachers (Warschauer & Grimes, 2008; Yu, 2021). For the latter, a main source of difficulty is in the sheer 
amount of time and effort required to assess and mark students’ writing, indicating the need for an 
efficient assessment system.  

Automated Writing Evaluation (AWE) has emerged as a potential solution to address this need and has 
been suggested to provide systematic and consistent scoring (Hockly, 2019). The conceptualization of 
AWE dates back to the 1960s (Page, 1966); since then, myriad AWE tools have been developed for this 
purpose. Many are used to assess the writing section of standardized language tests, with e-rater 
(Monaghan & Bridgeman, 2005) and IntelliMetric (Elliot, 2003) being some popular examples. These 
tools have generally provided high-quality essay scoring (e.g., Elliot, 2003; McCurry, 2010; Ramineni et 
al., 2012), showing great potential for use in L2 writing instruction. 

mailto:jangholee@cau.ac.kr
mailto:kmkim@englishunt.com
https://neweng.cau.ac.kr/
https://scholarworks.bwise.kr/cau/researcher-profile?ep=941
https://neweng.cau.ac.kr/
https://english.gnue.ac.kr/index.9is?contentUid=4a9f18ec8595c2cb0187c09de8bf1aee&UNIQ_ID=USRCNFRM_00000007952
https://www.gnue.ac.kr/
https://hdl.handle.net/10125/73628


2 Language Learning & Technology 
   

There are two concerns, however, that need to be considered when using extant AWE tools in individual 
teachers’ writing classes. The first is price (i.e., not usually free) and the second is the difficulty of 
adapting them to a target pedagogical context, where there is usually a specially designed analytical rubric 
for assessment. One strand of research on AI-assisted language teaching and learning has suggested that 
technology based on Large Language Models (LLMs), such as ChatGPT (OpenAI, 2023), could be used 
to develop customized AWE tools for EFL writing (see Teng, 2024 for a recent systematic review), which 
are more affordable, accessible and pedagogically suitable for individual writing teachers. Studies directly 
addressing this topic (e.g., Bucol & Sangkawong, 2024; Shin & Lee, 2024) have demonstrated that 
ChatGPT-based AWE tools offer highly consistent scoring on EFL students’ essays, with their marks 
strongly correlating with those of human raters. Nevertheless, these studies offer only limited 
implications, particularly given the small dataset size of student essays. Furthermore, developing an 
updated ChatGPT (i.e., GPT 4.0-turbo) since these studies’ publication necessitates examining the 
performance of an AWE tool developed on this newer version. 

Therefore, the present study was conducted to advance our understanding of the potential advantages of 
using ChatGPT as an AWE tool and its applicability in individual EFL teachers’ writing instruction 
(Bucol & Sangkawong, 2024; Shin & Lee, 2024). To this end, we developed the AWE chatbot based on 
ChatGPT 4.0-turbo, which scored 465 narrative essays written by Korean EFL students at the secondary 
level; we then compared its scores with those of two human raters using various analytical measures. 
Specifically, we employed (1) correlation coefficients to measure the degree of agreement between 
human raters and ChatGPT, (2) the many-facet Rasch model to assess rater severity and consistency, and 
(3) Coh-Metrix analysis to quantify various linguistic and discourse features of students’ essays, and 
examine which features correlate most with human raters’ and the AWE chatbot’s scores—that is, 
whether certain features are weighted more heavily in scoring. 

Integrating ChatGPT into EFL Writing Instruction and Assessment 

As part of the research efforts that have been made to integrate ChatGPT—an LLM-based chatbot—into 
EFL writing, one strand (e.g., Guo & Wang, 2024; Steiss et al., 2024) has examined the quality and 
quantity of ChatGPT’s feedback on EFL students’ writing and compared it with human raters’ feedback. 
The general finding is that ChatGPT—although it may not provide feedback on par with human raters on 
some analytic criteria—could generate valuable feedback for EFL students and be a useful supplementary 
tool for language teachers. Another group of researchers (e.g., Boudouaia et al., 2024; Song & Song, 
2023) has compared the effect that ChatGPT-assisted writing instruction has on the development of 
English writing with that of more traditional writing instruction. The outcomes of these studies favored 
the ChatGPT-assisted condition concerning developments in various aspects of writing, thereby pointing 
to its pedagogical value in EFL writing lessons. Despite these positive findings, it has also been suggested 
that, at least at its current performance level, ChatGPT needs to be a supportive resource rather than a 
replacement for language teachers. For example, Al-Garaady and Mahyoob (2023) found that ChatGPT, 
although it excels at identifying surface-level errors in students’ writing, has a limited ability to detect 
more nuanced errors (e.g., pragmatics-related), which human teachers may be better at addressing. In 
addition, EFL students considered ChatGPT less useful in terms of informal and neutral registers (Punar 
Özçelik & Yangın Ekşi, 2024). This indicates that it could be used in limited ways for certain types of 
writing.  

More relevant to this study’s topic, some research (Bucol & Sangkawong, 2024; Shin & Lee, 2024) has 
examined ChatGPT as an AWE tool, testing its ability to score EFL students’ writing based on custom-
designed analytical rubrics and comparing it with human raters’ scores. These studies take advantage of 
the characteristics of LLM-based chatbots in that they are accessible, powerful in terms of performance, 
and could be developed according to a user’s purpose without any programming knowledge. As these two 
studies are the direct precursors of this present one, they are reviewed in detail below. 

 
Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 3 
    

In Bucol and Sangkawong’s (2024) study, one of the earliest efforts in this regard, five EFL instructors 
working at a Thai university were asked to develop a ChatGPT-based AWE tool using custom-designed 
rubrics consisting of five assessment criteria (therefore generating five GPT-based raters). Next, the tool 
scored 10 university freshman writing samples with another five instructors scoring the same samples. It 
was found that both ChatGPT’s and the human raters’ scoring reflected a high-level of internal 
consistency. Moreover, the GPT-based tool’s and instructors’ scoring demonstrated a moderate-to-strong 
correlation (r = .65). Based on their results, Bucol and Sangkawong suggested that the LLM-based AWE 
tool could comprehend the human-developed rubrics properly and, therefore, assess EFL students’ essays 
reliably.  

Shin and Lee’s (2024) study also developed an LLM-based AWE tool. They used My GPTs’ feature of 
ChatGPT Plus, launched by OpenAI in November 2023 when building the AWE chatbot. The AWE 
chatbot and two professional human raters scored 50 persuasive essays composed by Korean EFL 
students. The results revealed that the LLM-based chatbot’s scoring overall correlated highly with the two 
human raters’, as measured by correlation coefficients and intraclass correlation. As for the result of the 
many-facet Rasch model, this revealed that infit and outfit mean square values of the AWE chatbot’s 
scoring (1.33 and 1.34, respectively), as the measurement of the consistency of ratings, were marginally 
acceptable (Linacre, 2005). These values further indicate that the AWE chatbot, as an L2 writing rater, 
requires further training to enhance its consistency. 

Although both Bucol and Sangkawong’s and Shin and Lee’s work greatly contribute to the extant 
research, they do have some limitations. First, their dataset consists of only a few essay samples, which 
could be considered rather small for any inferential statistics. Second, although it examines the similarity 
between the AWE chatbot and human scorers, it provides only limited insight into which writing 
subcomponents (linguistic and structural) are most correlated with scoring. Finally, as a more recent 
updated version of LLM (i.e., ChatGPT 4.0-turbo) has been released since their studies, it is important to 
examine the AWE chatbot based on this version of LLM. 

The Present Study 

In light of the literature review, we aimed to compensate for its limitations by (1) involving a larger 
dataset; (2) employing additional various measures; and (3) using an updated version of an LLM to 
develop an AWE chatbot. To this end, the present study addresses the following three research questions: 

Research Question #1: To what extent is My GPTs-based AWE chatbot’s scoring of secondary-level EFL 
students’ narrative writing similar to professional human raters’?  

Research Question #2: What are the patterns of the degree of severity and consistency in the rating 
behaviors of My GPTs-based AWE chatbot against those of professional human raters?  

Research Question #3: What are the textual features that correlate with each scoring domain and 
distinguish human raters from the AWE chatbot’s scoring tendency? 

Methods 

Dataset Description 
This study used a dataset of 465 narrative essays written by 465 high school students, collected as part of 
a large research endeavor (Lim et al., 2014). This project involved eight high schools in Seoul, Republic 
of Korea, selected by the Ministry of Education for various research purposes. Given the nature of the 
English curriculum at the secondary level in South Korea, English writing had not been focused on, 
suggesting that these students had not received intensive writing instruction at the time of data collection. 
However, as performance assessment based on productive language skills is an important component of 
the English curriculum in this pedagogical context, English speaking or writing tasks are regularly 


4 Language Learning & Technology 
   

administered to these students in English classes.  

Regarding high-school-level writing tasks, students are generally asked to compose 60–80-word 
narratives or persuasive essays. The topic of the English essay regarding the current dataset was to 
describe ‘the happiest moment in one’s life and reasons why it was the happiest moment’. In the current 
pedagogical context, this topic is among the most frequent in the aforementioned performance 
assessment, and it is considered to be both cognitively appropriate and thematically familiar to students. 

Additional data obtained from the aforementioned project comprised two in-service English teachers’ 
scoring of the same 465 narrative essays. By the time of the scoring, these teachers had more than five 
years of English teaching experience and were certified English writing raters by the Korean Institute for 
Curriculum and Evaluation (KICE). They were instructed to rate each piece of writing according to three 
criteria (content, organization, and language use) on a scale of five. 

Developing an AWE Chatbot 
As mentioned above, the present study developed an AWE chatbot by leveraging the ‘My GPTs’ feature 
(see Shin & Lee, 2024 for further information). In the first development phase, researchers added several 
guidelines and PDF files (i.e., sample scoring dataset), instructing it to perform the present study’s 
required functions. The following are the instructions researchers provided: 

� You must remember that writers are Korean EFL learners, so their writing should not be treated as those 
written by native English speakers. 

� When evaluating essays that I provide as a text in prompt, it is important to assess each of the three 
scoring domains: (1) ‘Content’; (2) ‘Organization’; (3) ‘Language Use’. 

� You must use the ‘Essay scoring criteria’ attached to ‘Knowledge’. 

� You must not deduct points from ‘Language Use’ for minor grammatical errors if the essay’s content is 
comprehensible. 

� You must use the sample question and sample scores for each scoring domain as a reference, which are 
listed in the PDF file (Essay scoring criteria). 

Following the aforementioned procedures, researchers conducted the pilot test (see Figures 1 and 2 
below). The piloting revealed that the AWE chatbot did not, as programmed, cover the full scoring range. 
Furthermore, it did not consider the word limit (60–80 words). As a result, the chatbot was given 
additional prompts (i.e., If the essay does not meet the 60-word minimum, you can deduct 1 point) and 
extra sample scoring data, which enhanced its performance. 

Figure 1 

The Sample Question and Essay of the Pilot Test for the AWE Chatbot 

 
Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 5 
    

Figure 2 

The Scoring Results of the Pilot Test for the AWE Chatbot 

 
Data Analysis 
For the data analysis, the descriptive statistics (i.e., mean and standard deviations) were first calculated 
for the raters’ scoring of the 465 essays across the three different criteria. For the first research question, 
in which we aimed to measure the degree of agreement between two human raters and the AWE chatbot 
in scoring patterns, we calculated correlation coefficients for each pair of raters on each criterion. The 
intraclass correlation coefficients were additionally examined as another measure of inter-rater agreement, 
employing Statistical Package for the Social Sciences (SPSS) 28.0 (IBM Corp, 2021) —a software tool 
for data management and statistical analysis.  

For the second research question, in which we aimed to measure the degree of severity (i.e., whether a 
rater is relatively harsher or more lenient in scoring) and rating consistency, we used the many-facet 
Rasch model, employing the Many-Facets Rasch Measurement software (Linacre, 2023). Specifically, we 
examined logit values for rater severity and infit and outfit statistics for rating consistency. 

For the third research question, in which we aimed to identify the textual features that correlate most with 
raters’ scoring, we analyzed the essays using 108 features provided by Coh-Metrix 3.0 (Graesser et al., 
2004), a software tool for analyzing and quantifying the linguistic and discourse features of a given essay. 
Identifying these textual features was expected to provide insight into which writing subcomponents 
influence raters’ scoring and help distinguish human raters’ scoring tendencies from those of the AWE 
chatbot. For our analysis, we selected features that showed a correlation greater than .30 with raters’ 
scoring across the three criteria—content, organization, and language use—for further examination. 


6 Language Learning & Technology 
   

Results 

Descriptive Statistics and Correlation of the Three Raters’ Scoring 
Table 1 presents the descriptive statistics regarding the two professional raters’ and the AWE chatbot’s 
scoring. The overview of the descriptive statistics shows that Raters 1 and 2 scored very similarly on 
Content (M = 2.62 for Rater 1; M = 2.61 for Rater 2) and Organization (M = 2.21 for Rater 1; M = 2.27 
for Rater 2), whereas Rater 1 scored Language Use (M = 2.46) more leniently than Rater 2 (M = 2.13). 
The AWE chatbot differed somewhat from the two professional raters, scoring Content (M = 2.80) and 
Organization (M = 2.43) more leniently, but more harshly on language use (M = 2.08). Nevertheless, the 
three evaluators’ scores for each criterion were all within 0.5 points of each other. The three raters’ total 
scores were as follows: M = 7.30 for Rater 1, M = 7.00 for Rater 2, and M = 7.31 for the AWE chatbot. 

Table 1  

Descriptive Statistics of Scores from the Two Professional Raters and the AWE Chatbot 

 Rater 1 Rater 2 AWE chatbot 
 Mean (SD) Mean (SD) Mean (SD) 

Content 2.62 (1.48) 2.61 (1.54) 2.80 (1.50) 

Organization 2.21 (1.27) 2.27 (1.45) 2.43 (1.36) 

Language Use 2.46 (1.40) 2.13 (1.33) 2.08 (1.20) 

Total 7.30 (4.04) 7.00 (4.19) 7.31 (3.94) 

 
Table 2 summarizes the correlation coefficient and intraclass correlation coefficient. The correlation 
coefficient values ranged from r = 0.85 (Rater 1 – Rater 2 on organization) to r = .91 (Rater 2 – AWE 
chatbot on content), indicating that each pair’s scores correlated highly. Intraclass correlation coefficients, 
as another correlation measure, indicated a strong similarity between the two raters and the AWE chatbot 
(r = .95 for Content and Organization and r = .91 for Language use) in light of Cicchetti’s (1994) 
guideline. 

Rater Severity and Rater Consistency Statistics 
As stated in the Data Analysis section, the many-facet Rasch model was employed for rater severity and 
rater consistency statistics. Table 3 presents the estimates of rater severity and consistency statistics, with 
Figure 3 illustrating the Wright map. 

Examining rater severity revealed that Rater 2 was the harshest, at +0.31 logits, whereas Rater 1 and the 
AWE chatbot displayed a similar severity level (–0.14 and –0.17, respectively). The chi-square test for 
fixed effects varied significantly in rater severity (chi-square = 43.3, df = 2; p < .01), suggesting that all 
three differed in severity, with the reliability of the separation value being .93. 

For rating consistency, infit and outfit statistics were examined. It was found that the two human raters’ 
statistics were very close to 1, indicating an excellent fit. The same statistics for the AWE chatbot were 
slightly lower (0.8 for Infit MnSQ and 0.7 for Outfit MnSQ), indicating that its scoring is either 
overfitting and showing some mechanical rating patterns. 

 
Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 7 
    

Table 2  

Chatbot Correlation Coefficient and Intraclass Correlation Coefficient 

  Pair Correlation coefficient Intraclass 
correlation 

Content 
Rater 1 – Rater 2 0.88*** 

0.95*** Rater 1 – AWE chatbot 0.90*** 
Rater 2 – AWE chatbot 0.91*** 

Organization 
Rater 1 – Rater 2 0.85*** 

0.95*** Rater 1 – AWE chatbot 0.90*** 
Rater 2 – AWE chatbot 0.89*** 

Language Use 
Rater 1 – Rater 2 0.86*** 

0.91*** Rater 1 – AWE chatbot 0.89*** 
Rater 2 – AWE chatbot 0.87*** 

Note. *** p < .001 
 

Table 3  

The Estimates of Rater Severity and Consistency Statistics 

Note. χ2 (2) = 43.3, p < .01. Reliability of separation = .93 

 
Correlation between Raters’ Scoring and Textual Features 
The purpose of this section is to identify key textual features provided by Coh-Metrix that significantly 
correlate with the raters’ scoring across three assessment domains. This analysis aimed to explore the 
scoring tendencies of the two human raters (i.e., R1, R2) and the AWE chatbot, and then to compare their 
reliance on specific textual features in rating. As stated in the Data Analysis section, we first identified 
features that showed a correlation with the raters at a level greater than .30 across three criteria. Next, 
correlation coefficients were applied to examine to what extent the identified textual features shape the 
scoring process. The findings are summarized in Table 4 below. 

Three features significantly correlated with the raters’ scoring for the Content domain. First, Total Word 
Count (DESWC) was the most significant predictor across all raters, indicating that longer essays were 
perceived as containing richer and more developed content (i.e., more extensive elaboration of the ideas) 
(r > .60 for all three raters). Temporal Cohesion (PCTEMPz) was another important feature, which 
concerns the clarity of temporal progression within the text. Positive correlations with content scores 
indicate that texts with clear temporal markers are easier to follow and more coherent, improving 

Raters Observed 
average 

Fair (M) 
average Logit SE Infit MnSq Outfit MnSq 

Rater 1 2.6 2.35 –0.14 0.06 1.0 1.0 

Rater 2 2.5 2.22 0.31 0.06 1.1 1.0 

AWE 
Chatbot 2.6 2.36 –0.17 0.06 0.8 0.7 


8 Language Learning & Technology 
   

perceived content quality. The two raters and the AWE chatbot showed similar values (r = .38 for Rater 1, 
r = .35 for Rater 2, r = .39 for the AWE chatbot). Connectivity (PCCONNp) demonstrated a negative 
correlation with the raters’ scoring. It can thus be suggested that, although logical connectives can 
enhance cohesion, excessive use may disrupt natural flow and make the text seem mechanical. Here, too, 
the two raters and the AWE chatbot indicated a similar degree of correlation (r = –.37 for Rater 1, r = 
–.39 for Rater 2, r = –.36 for AWE chatbot). 

Figure 3 

The Wright Map Produced as Part of the Many-Facet Rasch Measurement Analysis 

 
Table 4  

Key Findings Based on the Coh-Metrix Analysis 

Feature (Code) Scoring Domain Rater1 
(r) 

Rater 2 
(r) 

AWE 
chatbot (r) Interpretation 

DESWC  

(F3, Total Word Count) 
Content .62* .62* .61** Longer essays are perceived as 

richer in content. 

PCTEMPz  

(F26, Temporal 
Cohesion) 

Content .38* .35* .39** Temporal consistency improves 
content quality. 

PCCONNp  

(F25, Connectivity) 
Content –.37* –.39* –.36** Overuse of connectives reduces 

perceived content flow. 

DESWC  

(F3, Total Word Count) 
Organization .59* .56* .63** Essay length supports structural 

development. 

PCTEMPz  Organization .40* .36* .42** Temporal flow contributes to 


Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 9 
    

(F26, Temporal 
Cohesion) 

essay organization. 

SYNSTRUTa  

(F74, Syntax 
Similarity) 

Organization .41* .39* .47** Syntactic similarity enhances 
organizational coherence. 

WRDFRQc  

(F94, Word Frequency) 
Language Use –.34* –.36* –.38** Lower-frequency words 

indicate lexical sophistication. 

SYNNP  

(F70, Modifiers per 
NP) 

Language Use .35* .32* .37** Noun phrase modifiers reflect 
syntactic complexity. 

Note. *p < .05, ** p < .01 

 
Regarding the Organization domain, Total Word Count (DESWC) was again an important predictor of 
the raters’ scoring, suggesting that essays with higher word counts may exhibit better structural 
development and logical progression. The correlation coefficients for Raters 1 and 2 and the AWE 
chatbot were .59, .56, and .63, respectively. As in the Content domain, temporal Cohesion (PCTEMPz) 
was another feature that significantly correlated with the raters’ scoring (r = .40 for Rater 1, r = .36 for 
Rater 2, r = .42 for AWE chatbot), given that using such markers could help readers navigate the logical 
flow of ideas, thereby contributing to perceived coherence. Finally, Syntactic Structure Similarity 
(SYNSTRUTa) positively correlated with the scores in this domain. The AWE chatbot showed a slightly 
stronger correlation (r = .47) than the human raters (Rater 1 = .41, Rater 2 = .39), indicating the 
automated system’s reliance on syntactic consistency as a measure of organizational quality. 

Concerning Language Use domain, two textual features emerged as significant. The first was Lexical 
sophistication, measured by Word Frequency for Content Words (WRDFRQc), which significantly and 
negatively correlated with Language Use scores. This result indicates that essays containing lower-
frequency words are associated with higher scores, reflecting a higher level of lexical complexity. The 
correlation values were similar for all three raters, –.34 (Rater 1), –.36 (Rater 2), and –.38 (the AWE 
chatbot). The other feature was Modifiers per Noun Phrase (SYNNP), with greater values indicating 
syntactic complexity. This feature showed positive and small correlation values with all three raters (r 
= .35 for Rater 1, r = .32 for Rater 2, r = .37 for the AWE chatbot).  

The analysis based on Coh-Metrix demonstrates substantial agreement between the human raters and the 
AWE chatbot in their reliance on the same textual features across all domains. Although, notably, the 
latter seems to rely slightly more on syntax similarity for scoring in the Organization domain. 
Nevertheless, the scoring tendencies of the human raters and the AWE chatbot were broadly similar. 

Discussion and Conclusion 

As part of recent research interests in the investigation of ChatGPT’s potential usefulness in L2 writing 
instruction (e.g., Al-Garaady & Mahyoob, 2023; Boudouaia et al., 2024; Bucol & Sangkawong, 2024; 
Guo & Wang, 2024; Shin & Lee, 2024; Song & Song, 2023; Steiss et al., 2024), the present study sought 
to assess the scoring ability of the AWE chatbot by comparing it with that of two professional human 
raters, using various measures of rater quality. Our findings generally concur with previous studies on 
ChatGPT as an English essay rater (Bucol & Sangkawong, 2024; Shin & Lee, 2024). Namely, ChatGPT’s 


10 Language Learning & Technology 
   

scoring could be judged as reliable and highly correlated with human raters’ marking. Although the 
results related to infit and outfit statistics from the many-facet Rasch model showed that this AI-based 
tool’s scoring is slightly more overfitting (possibly showing mechanical rating patterns), they were still 
within the acceptable MNSQ range (Linacre, 2005). A rather interesting and unexpected finding was that, 
unlike Shin and Lee’s (2024) study in which two human raters showed a stronger correlation, one of the 
human raters (Rater 1) and the AWE chatbot did so on some domains. Rater 1 and the AWE chatbot 
further demonstrated a similar severity level, whereas the other human rater (Rater 2) was the harshest of 
the three. We postulate that this finding may be attributed to the LLM employed for developing the 
currently examined AWE chatbot (GPT 4.0-turbo), which seems to display better and more human-like 
rating behaviors than the earlier version used in Shin and Lee. In this study, the Coh-Mertix analysis 
further suggested that human raters and AWE chatbots are in close agreement, indicating that the AWE 
chatbot considered the identified textual features in a similar manner when scoring essays.   

The present study has the following pedagogical implications. First, given its close resemblance to trained 
and certified in-service English teachers, an AWE chatbot could be developed to assist instructors in 
assessing student essays. Our hands-on experience in building a customized AWE chatbot highlights the 
importance of accurately identifying the target audience’s proficiency level, establishing a scoring rubric 
aligned with assessment objectives, securing representative sample data for each score range (see the 
Developing an AWE Chatbot section for instructions on uploading this data to My GPTs), and developing 
effective prompt-writing skills. Notably, our experience shows that the quality of representative sample 
data and the careful phrasing of prompts significantly impact the chatbot’s scoring accuracy. When 
developing their own AWE chatbot, teachers should first grade a subset of student essays to serve as 
representative sample data, then pilot the initial chatbot to ensure its scoring aligns with their assessments. 
Further calibration can be achieved through additional prompts. For instance, if the chatbot is too harsh on 
Language Use, teachers may specify that the essays were written by EFL students with limited English 
proficiency. Finally, teachers should recognize that developing a high-quality AWE chatbot requires 
multiple rounds of refinement through iterative prompting. 

Second, learners could also use the AWE chatbot to assess the quality of their work and make revisions 
before submitting their essays for a registered course or lesson. For example, if they receive a low score 
on organization, they might refine their essay’s structure. The chatbot could be integrated as a menu 
option within educational platforms (e.g., online learning management systems), or instructors could 
provide students with a direct link to their customized AWE chatbot. 

There are some limitations of this study that are worth addressing. First, although its dataset (N = 465) 
was much larger than the previous study on the same topic (N = 50) (Shin & Lee, 2024), it only examined 
one particular genre (i.e., narrative writing). Second, the writing task concerning the current dataset was a 
simple and short composition task (60–80 words). Therefore, it remains unexplored to what extent the 
AWE chatbot’s scoring would resemble that of professional human raters for other types of writing and at 
different lengths. Finally, the students involved in the current dataset were Korean EFL high school 
students. It is possible that the AWE chatbot’s rating may have different rating severity and fit for other 
learner populations. Given this methodological limitation, including other learner groups is expected to 
further validate the AWE chatbot’s scoring ability.  

Despite these limitations, the present study provides valuable insights into the potential advantages of an 
LLM-based chatbot’s scoring ability for rating L2 learners’ writing. Along with recent research on the 
quality of LLM-based chatbots’ feedback on L2 learners’ written composition, we expect that research 
like that which was conducted for the current study would greatly assist L2 teachers’ writing evaluation 
and instruction. 

  
Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 11 
    

Acknowledgements 

The authors are grateful for anonymous reviewers’ constructive feedback and suggestions. An earlier 
version of this article was based on the first author’s master’s thesis, and they would like to express their 
gratitude to Professors Jie-Young Kim and Ho Lee for their helpful comments. 

References 

Al-Garaady, J., & Mahyoob, M. (2023). ChatGPT’s capabilities in spotting and analyzing writing errors 
experienced by EFL learners. Arab World English Journal, 9, 3–17. 
https://dx.doi.org/10.24093/awej/call9.1 

Boudouaia, A., Mouas, S., & Kouider, B. (2024). A study on ChatGPT-4 as an innovative approach to 
enhancing English as a foreign language writing learning. Journal of Educational Computing 
Research, 62(6), 1289–1317. https://doi.org/10.1177/07356331241247465 

Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in 
Education and Teaching International. Advance online publication.  
https://doi.org/10.1080/14703297.2024.2363901 

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized 
assessment instruments in psychology. Psychological Assessment, 6(4), 284–290. 
https://doi.org/10.1037/1040-3590.6.4.284 

Elliot, S. (2003). Intellimetric: From here to validity. In M. D. Shermis & J. C. Burstein (Eds.), 
Automated essay scoring: A cross-disciplinary perspective (pp. 71–86). Lawrence Erlbaum. 

Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on 
cohesion and language. Behavior Research Methods, Instruments, & Computers, 36, 193–202. 
https://doi.org/10.3758/BF03195564 

Guo, K., & Wang, D. (2024). To resist it or to embrace it? Examining ChatGPT’s potential to support 
teacher feedback in EFL writing. Education and Information Technologies, 29, 8435–8463. 
https://doi.org/10.1007/s10639-023-12146-0 

Hockly, N. (2019). Automated writing evaluation. ELT Journal, 73(1), 82–88. 
https://doi.org/10.1093/elt/ccy044 

IBM Corp. (2021). IBM SPSS statistics for windows (Version 28.0) [Computer Software]. IBM Corp. 

Kormos, J. (2012). The role of individual differences in L2 writing. Journal of Second Language Writing, 
21(4), 390–403. https://doi.org/10.1016/j.jslw.2012.09.003 

Lim, H., Park, D., & Si, K. (2014). Sophistication of an automated scoring system for large-scale essay 
writing tests. Multimedia-Assisted Language Learning, 17(1), 84–105. 
https://doi.org/10.15702/mall.2014.17.1.84 

Linacre, J. M. (2005). A user’s guide to Winsteps/Ministeps Rasch model programs. MESA Press. 

Linacre, J. M. (2023). Facets computer program for many-facet Rasch measurement (Version 3.87.0) 
[Computer software]. https://www.winsteps.com/facets.htm 

McCurry, D. (2010). Can machine scoring deal with broad and open writing tests as well as human 
readers? Assessing Writing, 15(2), 118–129. https://doi.org/10.1016/j.asw.2010.04.002 

Monaghan, W., & Bridgeman, B. (2005). E-rater as a quality control on human scores. Connections. 

OpenAI. (2023). ChatGPT—release notes: Introducing GPT. 
https://help.openai.com/en/articles/6825453-chatgpt-release-notes 

https://dx.doi.org/10.24093/awej/call9.1
https://doi.org/10.1177/07356331241247465
https://doi.org/10.1080/14703297.2024.2363901
https://doi.org/10.1037/1040-3590.6.4.284
https://doi.org/10.3758/BF03195564
https://doi.org/10.1007/s10639-023-12146-0
https://doi.org/10.1093/elt/ccy044
https://doi.org/10.1016/j.jslw.2012.09.003
https://doi.org/10.15702/mall.2014.17.1.84
https://www.winsteps.com/facets.htm
https://doi.org/10.1016/j.asw.2010.04.002
https://help.openai.com/en/articles/6825453-chatgpt-release-notes


12 Language Learning & Technology 
   

Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47(5), 238–243.  

Punar Özçelik, N., & Yangın Ekşi, G. (2024). Cultivating writing skills: The role of ChatGPT as a 
learning assistant—a case study. Smart Learning Environments, 11(10), 1–18. 
https://doi.org/10.1186/s40561-024-00296-8 

Ramineni, C., Trapani, C. S., Williamson, D. M., Davey, T., & Bridgeman, B. (2012). Evaluation of the 
e-rate® scoring engine for the TOEFL® independent and integrated prompts. ETS Research 
Report Series, 2012(1), i–51. https://doi.org/10.1002/j.2333-8504.2012.tb02288.x 

Shin, D., & Lee, J. H. (2024). Exploratory study on the potential of ChatGPT as a rater of second 
language writing. Education and Information Technologies, 29, 24735–24757. 
https://doi.org/10.1007/s10639-024-12817-6 

Song, C., & Song, Y. (2023). Enhancing academic writing skills and motivation: Assessing the efficacy 
of ChatGPT in AI-assisted language learning for EFL students. Frontiers in Psychology, 14, 
Article 1260843. https://doi.org/10.3389/fpsyg.2023.1260843 

Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., Warschauer, M., & 
Olson, C. B. (2024). Comparing the quality of human and ChatGPT feedback of students’ 
writing. Learning and Instruction, 91, Article 101894. 
https://doi.org/10.1016/j.learninstruc.2024.101894 

Teng, M. F. (2024). A systematic review of ChatGPT for English as a foreign language writing: 
Opportunities, challenges, and recommendations. International Journal of TESOL Studies, 6(3), 
36–57. https://doi.org/10.58304/ijts.20240304 

Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies, 3(1), 
22–36. https://doi.org/10.1080/15544800701771580 

Weigle, S. C. (2002). Assessing writing. Cambridge University Press.  

Yu, S. (2021). Feedback-giving practice for L2 writing teachers: Friend or foe? Journal of Second 
Language Writing, 52, Article 100798. https://doi.org/10.1016/j.jslw.2021.100798 

About the Authors 

Kyungmin Kim received an MA degree in English Education from Chung-Ang University, South Korea. 
His interests include CALL, L2 learning, meta-analysis and the pedagogical use of ChatGPT. 

E-mail: kmkim@englishunt.com 

ORCiD: https://orcid.org/0009-0002-2310-688X 

Jang Ho Lee received his DPhil in education from the University of Oxford and is presently a Professor at 
Chung-Ang University, South Korea. His areas of interest are AI-based language teaching and learning, 
and L1 use in L2 teaching. All correspondence regarding this publication should be addressed to him. 

E-mail: jangholee@cau.ac.kr 

ORCiD: https://orcid.org/0000-0003-2767-3881 

Dongkwang Shin received his PhD in Applied Linguistics from Victoria University of Wellington and is 
currently a Professor at Gwangju National University of Education, South Korea. His research interests 
include corpus linguistics, CALL, and AI-based language learning.  

E-mail: sdhera@gmail.com 

ORCiD: https://orcid.org/0000-0002-5583-0189 

https://doi.org/10.1186/s40561-024-00296-8
https://doi.org/10.1002/j.2333-8504.2012.tb02288.x
https://doi.org/10.1007/s10639-024-12817-6
https://doi.org/10.3389/fpsyg.2023.1260843
https://doi.org/10.1016/j.learninstruc.2024.101894
https://doi.org/10.58304/ijts.20240304
https://doi.org/10.1080/15544800701771580
https://doi.org/10.1016/j.jslw.2021.100798
mailto:kmkim@englishunt.com
https://orcid.org/0009-0002-2310-688X
mailto:jang
https://orcid.org/0000-0003-2767-3881
mailto:sdhera@gmail.com
https://orcid.org/0000-0002-5583-0189