Language Learning & Technology 2025, Volume 29, Issue 1 ISSN 1094-3501 CC BY-NC-ND pp. 1–12 LANGUAGE TEACHER EDUCATION AND TECHNOLOGY FORUM * Corresponding Author: Jang Ho Lee, Email: jangholee@cau.ac.kr The potential advantages of using an LLM-based chatbot for automated writing evaluation for English teaching practitioners Kyungmin Kim, Chung-Ang University Jang Ho Lee*, Chung-Ang University Dongkwang Shin, Gwangju National University of Education Abstract With the ever-increasing demand for assessing large amounts of student writing, Automated Writing Evaluation (AWE) has emerged as an efficient system to satisfy this demand. However, it has also been suggested that applying it in teachers’ writing classes may offer limited results. Given this, the present study developed an AWE chatbot based on ChatGPT 4.0-turbo, designed as an automated rater. A total of 465 narrative essays written by Korean high school EFL students were scored according to three criteria by the developed tool; these were then compared with the scores administered by two professional raters using various analytic measures. The results showed that the AWE chatbot’s scoring was strongly correlated with that of the human raters. Meanwhile, the many-facet Rasch model’s result indicated that the two human raters’ statistics demonstrated an excellent fit, whereas those of the developed AWE chatbot were slightly lower. The Coh-Metrix analysis suggested that the human raters’ scoring tendencies and GPT are largely aligned, indicating that both raters scored the essays similarly. Based on our findings, we suggest that the Large Language Model (LLM)-based AWE chatbot has great potential to assist teachers in EFL writing classrooms. Keywords: Automated Writing Evaluation, ChatGPT, L2 Writing, Many-Facet Rasch model Language(s) Learned in This Study: English APA Citation: Kim, K., Lee, J. H., & Shin, D. (2025). The potential advantages of using an LLM-based chatbot for automated writing evaluation for English teaching practitioners. Language Learning & Technology, 29(1), 1–12. https://hdl.handle.net/10125/73628 Introduction Writing has become an essential skill for communicating with people worldwide; as such, it has received increasing attention in the second language (L2) teaching community (Weigle, 2002). The English as a Foreign Language (EFL) domain is no exception. Meanwhile, some suggest that writing is a process- oriented domain that requires time, determination and concentration for both learners (Kormos, 2012) and teachers (Warschauer & Grimes, 2008; Yu, 2021). For the latter, a main source of difficulty is in the sheer amount of time and effort required to assess and mark students’ writing, indicating the need for an efficient assessment system. Automated Writing Evaluation (AWE) has emerged as a potential solution to address this need and has been suggested to provide systematic and consistent scoring (Hockly, 2019). The conceptualization of AWE dates back to the 1960s (Page, 1966); since then, myriad AWE tools have been developed for this purpose. Many are used to assess the writing section of standardized language tests, with e-rater (Monaghan & Bridgeman, 2005) and IntelliMetric (Elliot, 2003) being some popular examples. These tools have generally provided high-quality essay scoring (e.g., Elliot, 2003; McCurry, 2010; Ramineni et al., 2012), showing great potential for use in L2 writing instruction. mailto:jangholee@cau.ac.kr mailto:kmkim@englishunt.com https://neweng.cau.ac.kr/ https://scholarworks.bwise.kr/cau/researcher-profile?ep=941 https://neweng.cau.ac.kr/ https://english.gnue.ac.kr/index.9is?contentUid=4a9f18ec8595c2cb0187c09de8bf1aee&UNIQ_ID=USRCNFRM_00000007952 https://www.gnue.ac.kr/ https://hdl.handle.net/10125/73628 2 Language Learning & Technology There are two concerns, however, that need to be considered when using extant AWE tools in individual teachers’ writing classes. The first is price (i.e., not usually free) and the second is the difficulty of adapting them to a target pedagogical context, where there is usually a specially designed analytical rubric for assessment. One strand of research on AI-assisted language teaching and learning has suggested that technology based on Large Language Models (LLMs), such as ChatGPT (OpenAI, 2023), could be used to develop customized AWE tools for EFL writing (see Teng, 2024 for a recent systematic review), which are more affordable, accessible and pedagogically suitable for individual writing teachers. Studies directly addressing this topic (e.g., Bucol & Sangkawong, 2024; Shin & Lee, 2024) have demonstrated that ChatGPT-based AWE tools offer highly consistent scoring on EFL students’ essays, with their marks strongly correlating with those of human raters. Nevertheless, these studies offer only limited implications, particularly given the small dataset size of student essays. Furthermore, developing an updated ChatGPT (i.e., GPT 4.0-turbo) since these studies’ publication necessitates examining the performance of an AWE tool developed on this newer version. Therefore, the present study was conducted to advance our understanding of the potential advantages of using ChatGPT as an AWE tool and its applicability in individual EFL teachers’ writing instruction (Bucol & Sangkawong, 2024; Shin & Lee, 2024). To this end, we developed the AWE chatbot based on ChatGPT 4.0-turbo, which scored 465 narrative essays written by Korean EFL students at the secondary level; we then compared its scores with those of two human raters using various analytical measures. Specifically, we employed (1) correlation coefficients to measure the degree of agreement between human raters and ChatGPT, (2) the many-facet Rasch model to assess rater severity and consistency, and (3) Coh-Metrix analysis to quantify various linguistic and discourse features of students’ essays, and examine which features correlate most with human raters’ and the AWE chatbot’s scores—that is, whether certain features are weighted more heavily in scoring. Integrating ChatGPT into EFL Writing Instruction and Assessment As part of the research efforts that have been made to integrate ChatGPT—an LLM-based chatbot—into EFL writing, one strand (e.g., Guo & Wang, 2024; Steiss et al., 2024) has examined the quality and quantity of ChatGPT’s feedback on EFL students’ writing and compared it with human raters’ feedback. The general finding is that ChatGPT—although it may not provide feedback on par with human raters on some analytic criteria—could generate valuable feedback for EFL students and be a useful supplementary tool for language teachers. Another group of researchers (e.g., Boudouaia et al., 2024; Song & Song, 2023) has compared the effect that ChatGPT-assisted writing instruction has on the development of English writing with that of more traditional writing instruction. The outcomes of these studies favored the ChatGPT-assisted condition concerning developments in various aspects of writing, thereby pointing to its pedagogical value in EFL writing lessons. Despite these positive findings, it has also been suggested that, at least at its current performance level, ChatGPT needs to be a supportive resource rather than a replacement for language teachers. For example, Al-Garaady and Mahyoob (2023) found that ChatGPT, although it excels at identifying surface-level errors in students’ writing, has a limited ability to detect more nuanced errors (e.g., pragmatics-related), which human teachers may be better at addressing. In addition, EFL students considered ChatGPT less useful in terms of informal and neutral registers (Punar Özçelik & Yangın Ekşi, 2024). This indicates that it could be used in limited ways for certain types of writing. More relevant to this study’s topic, some research (Bucol & Sangkawong, 2024; Shin & Lee, 2024) has examined ChatGPT as an AWE tool, testing its ability to score EFL students’ writing based on custom- designed analytical rubrics and comparing it with human raters’ scores. These studies take advantage of the characteristics of LLM-based chatbots in that they are accessible, powerful in terms of performance, and could be developed according to a user’s purpose without any programming knowledge. As these two studies are the direct precursors of this present one, they are reviewed in detail below. Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 3 In Bucol and Sangkawong’s (2024) study, one of the earliest efforts in this regard, five EFL instructors working at a Thai university were asked to develop a ChatGPT-based AWE tool using custom-designed rubrics consisting of five assessment criteria (therefore generating five GPT-based raters). Next, the tool scored 10 university freshman writing samples with another five instructors scoring the same samples. It was found that both ChatGPT’s and the human raters’ scoring reflected a high-level of internal consistency. Moreover, the GPT-based tool’s and instructors’ scoring demonstrated a moderate-to-strong correlation (r = .65). Based on their results, Bucol and Sangkawong suggested that the LLM-based AWE tool could comprehend the human-developed rubrics properly and, therefore, assess EFL students’ essays reliably. Shin and Lee’s (2024) study also developed an LLM-based AWE tool. They used My GPTs’ feature of ChatGPT Plus, launched by OpenAI in November 2023 when building the AWE chatbot. The AWE chatbot and two professional human raters scored 50 persuasive essays composed by Korean EFL students. The results revealed that the LLM-based chatbot’s scoring overall correlated highly with the two human raters’, as measured by correlation coefficients and intraclass correlation. As for the result of the many-facet Rasch model, this revealed that infit and outfit mean square values of the AWE chatbot’s scoring (1.33 and 1.34, respectively), as the measurement of the consistency of ratings, were marginally acceptable (Linacre, 2005). These values further indicate that the AWE chatbot, as an L2 writing rater, requires further training to enhance its consistency. Although both Bucol and Sangkawong’s and Shin and Lee’s work greatly contribute to the extant research, they do have some limitations. First, their dataset consists of only a few essay samples, which could be considered rather small for any inferential statistics. Second, although it examines the similarity between the AWE chatbot and human scorers, it provides only limited insight into which writing subcomponents (linguistic and structural) are most correlated with scoring. Finally, as a more recent updated version of LLM (i.e., ChatGPT 4.0-turbo) has been released since their studies, it is important to examine the AWE chatbot based on this version of LLM. The Present Study In light of the literature review, we aimed to compensate for its limitations by (1) involving a larger dataset; (2) employing additional various measures; and (3) using an updated version of an LLM to develop an AWE chatbot. To this end, the present study addresses the following three research questions: Research Question #1: To what extent is My GPTs-based AWE chatbot’s scoring of secondary-level EFL students’ narrative writing similar to professional human raters’? Research Question #2: What are the patterns of the degree of severity and consistency in the rating behaviors of My GPTs-based AWE chatbot against those of professional human raters? Research Question #3: What are the textual features that correlate with each scoring domain and distinguish human raters from the AWE chatbot’s scoring tendency? Methods Dataset Description This study used a dataset of 465 narrative essays written by 465 high school students, collected as part of a large research endeavor (Lim et al., 2014). This project involved eight high schools in Seoul, Republic of Korea, selected by the Ministry of Education for various research purposes. Given the nature of the English curriculum at the secondary level in South Korea, English writing had not been focused on, suggesting that these students had not received intensive writing instruction at the time of data collection. However, as performance assessment based on productive language skills is an important component of the English curriculum in this pedagogical context, English speaking or writing tasks are regularly 4 Language Learning & Technology administered to these students in English classes. Regarding high-school-level writing tasks, students are generally asked to compose 60–80-word narratives or persuasive essays. The topic of the English essay regarding the current dataset was to describe ‘the happiest moment in one’s life and reasons why it was the happiest moment’. In the current pedagogical context, this topic is among the most frequent in the aforementioned performance assessment, and it is considered to be both cognitively appropriate and thematically familiar to students. Additional data obtained from the aforementioned project comprised two in-service English teachers’ scoring of the same 465 narrative essays. By the time of the scoring, these teachers had more than five years of English teaching experience and were certified English writing raters by the Korean Institute for Curriculum and Evaluation (KICE). They were instructed to rate each piece of writing according to three criteria (content, organization, and language use) on a scale of five. Developing an AWE Chatbot As mentioned above, the present study developed an AWE chatbot by leveraging the ‘My GPTs’ feature (see Shin & Lee, 2024 for further information). In the first development phase, researchers added several guidelines and PDF files (i.e., sample scoring dataset), instructing it to perform the present study’s required functions. The following are the instructions researchers provided: � You must remember that writers are Korean EFL learners, so their writing should not be treated as those written by native English speakers. � When evaluating essays that I provide as a text in prompt, it is important to assess each of the three scoring domains: (1) ‘Content’; (2) ‘Organization’; (3) ‘Language Use’. � You must use the ‘Essay scoring criteria’ attached to ‘Knowledge’. � You must not deduct points from ‘Language Use’ for minor grammatical errors if the essay’s content is comprehensible. � You must use the sample question and sample scores for each scoring domain as a reference, which are listed in the PDF file (Essay scoring criteria). Following the aforementioned procedures, researchers conducted the pilot test (see Figures 1 and 2 below). The piloting revealed that the AWE chatbot did not, as programmed, cover the full scoring range. Furthermore, it did not consider the word limit (60–80 words). As a result, the chatbot was given additional prompts (i.e., If the essay does not meet the 60-word minimum, you can deduct 1 point) and extra sample scoring data, which enhanced its performance. Figure 1 The Sample Question and Essay of the Pilot Test for the AWE Chatbot Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 5 Figure 2 The Scoring Results of the Pilot Test for the AWE Chatbot Data Analysis For the data analysis, the descriptive statistics (i.e., mean and standard deviations) were first calculated for the raters’ scoring of the 465 essays across the three different criteria. For the first research question, in which we aimed to measure the degree of agreement between two human raters and the AWE chatbot in scoring patterns, we calculated correlation coefficients for each pair of raters on each criterion. The intraclass correlation coefficients were additionally examined as another measure of inter-rater agreement, employing Statistical Package for the Social Sciences (SPSS) 28.0 (IBM Corp, 2021) —a software tool for data management and statistical analysis. For the second research question, in which we aimed to measure the degree of severity (i.e., whether a rater is relatively harsher or more lenient in scoring) and rating consistency, we used the many-facet Rasch model, employing the Many-Facets Rasch Measurement software (Linacre, 2023). Specifically, we examined logit values for rater severity and infit and outfit statistics for rating consistency. For the third research question, in which we aimed to identify the textual features that correlate most with raters’ scoring, we analyzed the essays using 108 features provided by Coh-Metrix 3.0 (Graesser et al., 2004), a software tool for analyzing and quantifying the linguistic and discourse features of a given essay. Identifying these textual features was expected to provide insight into which writing subcomponents influence raters’ scoring and help distinguish human raters’ scoring tendencies from those of the AWE chatbot. For our analysis, we selected features that showed a correlation greater than .30 with raters’ scoring across the three criteria—content, organization, and language use—for further examination. 6 Language Learning & Technology Results Descriptive Statistics and Correlation of the Three Raters’ Scoring Table 1 presents the descriptive statistics regarding the two professional raters’ and the AWE chatbot’s scoring. The overview of the descriptive statistics shows that Raters 1 and 2 scored very similarly on Content (M = 2.62 for Rater 1; M = 2.61 for Rater 2) and Organization (M = 2.21 for Rater 1; M = 2.27 for Rater 2), whereas Rater 1 scored Language Use (M = 2.46) more leniently than Rater 2 (M = 2.13). The AWE chatbot differed somewhat from the two professional raters, scoring Content (M = 2.80) and Organization (M = 2.43) more leniently, but more harshly on language use (M = 2.08). Nevertheless, the three evaluators’ scores for each criterion were all within 0.5 points of each other. The three raters’ total scores were as follows: M = 7.30 for Rater 1, M = 7.00 for Rater 2, and M = 7.31 for the AWE chatbot. Table 1 Descriptive Statistics of Scores from the Two Professional Raters and the AWE Chatbot Rater 1 Rater 2 AWE chatbot Mean (SD) Mean (SD) Mean (SD) Content 2.62 (1.48) 2.61 (1.54) 2.80 (1.50) Organization 2.21 (1.27) 2.27 (1.45) 2.43 (1.36) Language Use 2.46 (1.40) 2.13 (1.33) 2.08 (1.20) Total 7.30 (4.04) 7.00 (4.19) 7.31 (3.94) Table 2 summarizes the correlation coefficient and intraclass correlation coefficient. The correlation coefficient values ranged from r = 0.85 (Rater 1 – Rater 2 on organization) to r = .91 (Rater 2 – AWE chatbot on content), indicating that each pair’s scores correlated highly. Intraclass correlation coefficients, as another correlation measure, indicated a strong similarity between the two raters and the AWE chatbot (r = .95 for Content and Organization and r = .91 for Language use) in light of Cicchetti’s (1994) guideline. Rater Severity and Rater Consistency Statistics As stated in the Data Analysis section, the many-facet Rasch model was employed for rater severity and rater consistency statistics. Table 3 presents the estimates of rater severity and consistency statistics, with Figure 3 illustrating the Wright map. Examining rater severity revealed that Rater 2 was the harshest, at +0.31 logits, whereas Rater 1 and the AWE chatbot displayed a similar severity level (–0.14 and –0.17, respectively). The chi-square test for fixed effects varied significantly in rater severity (chi-square = 43.3, df = 2; p < .01), suggesting that all three differed in severity, with the reliability of the separation value being .93. For rating consistency, infit and outfit statistics were examined. It was found that the two human raters’ statistics were very close to 1, indicating an excellent fit. The same statistics for the AWE chatbot were slightly lower (0.8 for Infit MnSQ and 0.7 for Outfit MnSQ), indicating that its scoring is either overfitting and showing some mechanical rating patterns. Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 7 Table 2 Chatbot Correlation Coefficient and Intraclass Correlation Coefficient Pair Correlation coefficient Intraclass correlation Content Rater 1 – Rater 2 0.88*** 0.95*** Rater 1 – AWE chatbot 0.90*** Rater 2 – AWE chatbot 0.91*** Organization Rater 1 – Rater 2 0.85*** 0.95*** Rater 1 – AWE chatbot 0.90*** Rater 2 – AWE chatbot 0.89*** Language Use Rater 1 – Rater 2 0.86*** 0.91*** Rater 1 – AWE chatbot 0.89*** Rater 2 – AWE chatbot 0.87*** Note. *** p < .001 Table 3 The Estimates of Rater Severity and Consistency Statistics Note. χ2 (2) = 43.3, p < .01. Reliability of separation = .93 Correlation between Raters’ Scoring and Textual Features The purpose of this section is to identify key textual features provided by Coh-Metrix that significantly correlate with the raters’ scoring across three assessment domains. This analysis aimed to explore the scoring tendencies of the two human raters (i.e., R1, R2) and the AWE chatbot, and then to compare their reliance on specific textual features in rating. As stated in the Data Analysis section, we first identified features that showed a correlation with the raters at a level greater than .30 across three criteria. Next, correlation coefficients were applied to examine to what extent the identified textual features shape the scoring process. The findings are summarized in Table 4 below. Three features significantly correlated with the raters’ scoring for the Content domain. First, Total Word Count (DESWC) was the most significant predictor across all raters, indicating that longer essays were perceived as containing richer and more developed content (i.e., more extensive elaboration of the ideas) (r > .60 for all three raters). Temporal Cohesion (PCTEMPz) was another important feature, which concerns the clarity of temporal progression within the text. Positive correlations with content scores indicate that texts with clear temporal markers are easier to follow and more coherent, improving Raters Observed average Fair (M) average Logit SE Infit MnSq Outfit MnSq Rater 1 2.6 2.35 –0.14 0.06 1.0 1.0 Rater 2 2.5 2.22 0.31 0.06 1.1 1.0 AWE Chatbot 2.6 2.36 –0.17 0.06 0.8 0.7 8 Language Learning & Technology perceived content quality. The two raters and the AWE chatbot showed similar values (r = .38 for Rater 1, r = .35 for Rater 2, r = .39 for the AWE chatbot). Connectivity (PCCONNp) demonstrated a negative correlation with the raters’ scoring. It can thus be suggested that, although logical connectives can enhance cohesion, excessive use may disrupt natural flow and make the text seem mechanical. Here, too, the two raters and the AWE chatbot indicated a similar degree of correlation (r = –.37 for Rater 1, r = –.39 for Rater 2, r = –.36 for AWE chatbot). Figure 3 The Wright Map Produced as Part of the Many-Facet Rasch Measurement Analysis Table 4 Key Findings Based on the Coh-Metrix Analysis Feature (Code) Scoring Domain Rater1 (r) Rater 2 (r) AWE chatbot (r) Interpretation DESWC (F3, Total Word Count) Content .62* .62* .61** Longer essays are perceived as richer in content. PCTEMPz (F26, Temporal Cohesion) Content .38* .35* .39** Temporal consistency improves content quality. PCCONNp (F25, Connectivity) Content –.37* –.39* –.36** Overuse of connectives reduces perceived content flow. DESWC (F3, Total Word Count) Organization .59* .56* .63** Essay length supports structural development. PCTEMPz Organization .40* .36* .42** Temporal flow contributes to Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 9 (F26, Temporal Cohesion) essay organization. SYNSTRUTa (F74, Syntax Similarity) Organization .41* .39* .47** Syntactic similarity enhances organizational coherence. WRDFRQc (F94, Word Frequency) Language Use –.34* –.36* –.38** Lower-frequency words indicate lexical sophistication. SYNNP (F70, Modifiers per NP) Language Use .35* .32* .37** Noun phrase modifiers reflect syntactic complexity. Note. *p < .05, ** p < .01 Regarding the Organization domain, Total Word Count (DESWC) was again an important predictor of the raters’ scoring, suggesting that essays with higher word counts may exhibit better structural development and logical progression. The correlation coefficients for Raters 1 and 2 and the AWE chatbot were .59, .56, and .63, respectively. As in the Content domain, temporal Cohesion (PCTEMPz) was another feature that significantly correlated with the raters’ scoring (r = .40 for Rater 1, r = .36 for Rater 2, r = .42 for AWE chatbot), given that using such markers could help readers navigate the logical flow of ideas, thereby contributing to perceived coherence. Finally, Syntactic Structure Similarity (SYNSTRUTa) positively correlated with the scores in this domain. The AWE chatbot showed a slightly stronger correlation (r = .47) than the human raters (Rater 1 = .41, Rater 2 = .39), indicating the automated system’s reliance on syntactic consistency as a measure of organizational quality. Concerning Language Use domain, two textual features emerged as significant. The first was Lexical sophistication, measured by Word Frequency for Content Words (WRDFRQc), which significantly and negatively correlated with Language Use scores. This result indicates that essays containing lower- frequency words are associated with higher scores, reflecting a higher level of lexical complexity. The correlation values were similar for all three raters, –.34 (Rater 1), –.36 (Rater 2), and –.38 (the AWE chatbot). The other feature was Modifiers per Noun Phrase (SYNNP), with greater values indicating syntactic complexity. This feature showed positive and small correlation values with all three raters (r = .35 for Rater 1, r = .32 for Rater 2, r = .37 for the AWE chatbot). The analysis based on Coh-Metrix demonstrates substantial agreement between the human raters and the AWE chatbot in their reliance on the same textual features across all domains. Although, notably, the latter seems to rely slightly more on syntax similarity for scoring in the Organization domain. Nevertheless, the scoring tendencies of the human raters and the AWE chatbot were broadly similar. Discussion and Conclusion As part of recent research interests in the investigation of ChatGPT’s potential usefulness in L2 writing instruction (e.g., Al-Garaady & Mahyoob, 2023; Boudouaia et al., 2024; Bucol & Sangkawong, 2024; Guo & Wang, 2024; Shin & Lee, 2024; Song & Song, 2023; Steiss et al., 2024), the present study sought to assess the scoring ability of the AWE chatbot by comparing it with that of two professional human raters, using various measures of rater quality. Our findings generally concur with previous studies on ChatGPT as an English essay rater (Bucol & Sangkawong, 2024; Shin & Lee, 2024). Namely, ChatGPT’s 10 Language Learning & Technology scoring could be judged as reliable and highly correlated with human raters’ marking. Although the results related to infit and outfit statistics from the many-facet Rasch model showed that this AI-based tool’s scoring is slightly more overfitting (possibly showing mechanical rating patterns), they were still within the acceptable MNSQ range (Linacre, 2005). A rather interesting and unexpected finding was that, unlike Shin and Lee’s (2024) study in which two human raters showed a stronger correlation, one of the human raters (Rater 1) and the AWE chatbot did so on some domains. Rater 1 and the AWE chatbot further demonstrated a similar severity level, whereas the other human rater (Rater 2) was the harshest of the three. We postulate that this finding may be attributed to the LLM employed for developing the currently examined AWE chatbot (GPT 4.0-turbo), which seems to display better and more human-like rating behaviors than the earlier version used in Shin and Lee. In this study, the Coh-Mertix analysis further suggested that human raters and AWE chatbots are in close agreement, indicating that the AWE chatbot considered the identified textual features in a similar manner when scoring essays. The present study has the following pedagogical implications. First, given its close resemblance to trained and certified in-service English teachers, an AWE chatbot could be developed to assist instructors in assessing student essays. Our hands-on experience in building a customized AWE chatbot highlights the importance of accurately identifying the target audience’s proficiency level, establishing a scoring rubric aligned with assessment objectives, securing representative sample data for each score range (see the Developing an AWE Chatbot section for instructions on uploading this data to My GPTs), and developing effective prompt-writing skills. Notably, our experience shows that the quality of representative sample data and the careful phrasing of prompts significantly impact the chatbot’s scoring accuracy. When developing their own AWE chatbot, teachers should first grade a subset of student essays to serve as representative sample data, then pilot the initial chatbot to ensure its scoring aligns with their assessments. Further calibration can be achieved through additional prompts. For instance, if the chatbot is too harsh on Language Use, teachers may specify that the essays were written by EFL students with limited English proficiency. Finally, teachers should recognize that developing a high-quality AWE chatbot requires multiple rounds of refinement through iterative prompting. Second, learners could also use the AWE chatbot to assess the quality of their work and make revisions before submitting their essays for a registered course or lesson. For example, if they receive a low score on organization, they might refine their essay’s structure. The chatbot could be integrated as a menu option within educational platforms (e.g., online learning management systems), or instructors could provide students with a direct link to their customized AWE chatbot. There are some limitations of this study that are worth addressing. First, although its dataset (N = 465) was much larger than the previous study on the same topic (N = 50) (Shin & Lee, 2024), it only examined one particular genre (i.e., narrative writing). Second, the writing task concerning the current dataset was a simple and short composition task (60–80 words). Therefore, it remains unexplored to what extent the AWE chatbot’s scoring would resemble that of professional human raters for other types of writing and at different lengths. Finally, the students involved in the current dataset were Korean EFL high school students. It is possible that the AWE chatbot’s rating may have different rating severity and fit for other learner populations. Given this methodological limitation, including other learner groups is expected to further validate the AWE chatbot’s scoring ability. Despite these limitations, the present study provides valuable insights into the potential advantages of an LLM-based chatbot’s scoring ability for rating L2 learners’ writing. Along with recent research on the quality of LLM-based chatbots’ feedback on L2 learners’ written composition, we expect that research like that which was conducted for the current study would greatly assist L2 teachers’ writing evaluation and instruction. Kyungmin Kim, Jang Ho Lee, and Dongkwang Shin 11 Acknowledgements The authors are grateful for anonymous reviewers’ constructive feedback and suggestions. An earlier version of this article was based on the first author’s master’s thesis, and they would like to express their gratitude to Professors Jie-Young Kim and Ho Lee for their helpful comments. References Al-Garaady, J., & Mahyoob, M. (2023). ChatGPT’s capabilities in spotting and analyzing writing errors experienced by EFL learners. Arab World English Journal, 9, 3–17. https://dx.doi.org/10.24093/awej/call9.1 Boudouaia, A., Mouas, S., & Kouider, B. (2024). A study on ChatGPT-4 as an innovative approach to enhancing English as a foreign language writing learning. Journal of Educational Computing Research, 62(6), 1289–1317. https://doi.org/10.1177/07356331241247465 Bucol, J. L., & Sangkawong, N. (2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International. Advance online publication. https://doi.org/10.1080/14703297.2024.2363901 Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290. https://doi.org/10.1037/1040-3590.6.4.284 Elliot, S. (2003). Intellimetric: From here to validity. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71–86). Lawrence Erlbaum. Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36, 193–202. https://doi.org/10.3758/BF03195564 Guo, K., & Wang, D. (2024). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies, 29, 8435–8463. https://doi.org/10.1007/s10639-023-12146-0 Hockly, N. (2019). Automated writing evaluation. ELT Journal, 73(1), 82–88. https://doi.org/10.1093/elt/ccy044 IBM Corp. (2021). IBM SPSS statistics for windows (Version 28.0) [Computer Software]. IBM Corp. Kormos, J. (2012). The role of individual differences in L2 writing. Journal of Second Language Writing, 21(4), 390–403. https://doi.org/10.1016/j.jslw.2012.09.003 Lim, H., Park, D., & Si, K. (2014). Sophistication of an automated scoring system for large-scale essay writing tests. Multimedia-Assisted Language Learning, 17(1), 84–105. https://doi.org/10.15702/mall.2014.17.1.84 Linacre, J. M. (2005). A user’s guide to Winsteps/Ministeps Rasch model programs. MESA Press. Linacre, J. M. (2023). Facets computer program for many-facet Rasch measurement (Version 3.87.0) [Computer software]. https://www.winsteps.com/facets.htm McCurry, D. (2010). Can machine scoring deal with broad and open writing tests as well as human readers? Assessing Writing, 15(2), 118–129. https://doi.org/10.1016/j.asw.2010.04.002 Monaghan, W., & Bridgeman, B. (2005). E-rater as a quality control on human scores. Connections. OpenAI. (2023). ChatGPT—release notes: Introducing GPT. https://help.openai.com/en/articles/6825453-chatgpt-release-notes https://dx.doi.org/10.24093/awej/call9.1 https://doi.org/10.1177/07356331241247465 https://doi.org/10.1080/14703297.2024.2363901 https://doi.org/10.1037/1040-3590.6.4.284 https://doi.org/10.3758/BF03195564 https://doi.org/10.1007/s10639-023-12146-0 https://doi.org/10.1093/elt/ccy044 https://doi.org/10.1016/j.jslw.2012.09.003 https://doi.org/10.15702/mall.2014.17.1.84 https://www.winsteps.com/facets.htm https://doi.org/10.1016/j.asw.2010.04.002 https://help.openai.com/en/articles/6825453-chatgpt-release-notes 12 Language Learning & Technology Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47(5), 238–243. Punar Özçelik, N., & Yangın Ekşi, G. (2024). Cultivating writing skills: The role of ChatGPT as a learning assistant—a case study. Smart Learning Environments, 11(10), 1–18. https://doi.org/10.1186/s40561-024-00296-8 Ramineni, C., Trapani, C. S., Williamson, D. M., Davey, T., & Bridgeman, B. (2012). Evaluation of the e-rate® scoring engine for the TOEFL® independent and integrated prompts. ETS Research Report Series, 2012(1), i–51. https://doi.org/10.1002/j.2333-8504.2012.tb02288.x Shin, D., & Lee, J. H. (2024). Exploratory study on the potential of ChatGPT as a rater of second language writing. Education and Information Technologies, 29, 24735–24757. https://doi.org/10.1007/s10639-024-12817-6 Song, C., & Song, Y. (2023). Enhancing academic writing skills and motivation: Assessing the efficacy of ChatGPT in AI-assisted language learning for EFL students. Frontiers in Psychology, 14, Article 1260843. https://doi.org/10.3389/fpsyg.2023.1260843 Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., Warschauer, M., & Olson, C. B. (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, Article 101894. https://doi.org/10.1016/j.learninstruc.2024.101894 Teng, M. F. (2024). A systematic review of ChatGPT for English as a foreign language writing: Opportunities, challenges, and recommendations. International Journal of TESOL Studies, 6(3), 36–57. https://doi.org/10.58304/ijts.20240304 Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies, 3(1), 22–36. https://doi.org/10.1080/15544800701771580 Weigle, S. C. (2002). Assessing writing. Cambridge University Press. Yu, S. (2021). Feedback-giving practice for L2 writing teachers: Friend or foe? Journal of Second Language Writing, 52, Article 100798. https://doi.org/10.1016/j.jslw.2021.100798 About the Authors Kyungmin Kim received an MA degree in English Education from Chung-Ang University, South Korea. His interests include CALL, L2 learning, meta-analysis and the pedagogical use of ChatGPT. E-mail: kmkim@englishunt.com ORCiD: https://orcid.org/0009-0002-2310-688X Jang Ho Lee received his DPhil in education from the University of Oxford and is presently a Professor at Chung-Ang University, South Korea. His areas of interest are AI-based language teaching and learning, and L1 use in L2 teaching. All correspondence regarding this publication should be addressed to him. E-mail: jangholee@cau.ac.kr ORCiD: https://orcid.org/0000-0003-2767-3881 Dongkwang Shin received his PhD in Applied Linguistics from Victoria University of Wellington and is currently a Professor at Gwangju National University of Education, South Korea. His research interests include corpus linguistics, CALL, and AI-based language learning. E-mail: sdhera@gmail.com ORCiD: https://orcid.org/0000-0002-5583-0189 https://doi.org/10.1186/s40561-024-00296-8 https://doi.org/10.1002/j.2333-8504.2012.tb02288.x https://doi.org/10.1007/s10639-024-12817-6 https://doi.org/10.3389/fpsyg.2023.1260843 https://doi.org/10.1016/j.learninstruc.2024.101894 https://doi.org/10.58304/ijts.20240304 https://doi.org/10.1080/15544800701771580 https://doi.org/10.1016/j.jslw.2021.100798 mailto:kmkim@englishunt.com https://orcid.org/0009-0002-2310-688X mailto:jang https://orcid.org/0000-0003-2767-3881 mailto:sdhera@gmail.com https://orcid.org/0000-0002-5583-0189