Language Learning & Technology  October 2023, Volume 27, Issue 3 
ISSN 1094-3501  CC BY-NC-ND pp. 27–40 
  
LANGUAGE TEACHER EDUCATION AND TECHNOLOGY FORUM 
 
 
Can ChatGPT make reading comprehension testing 
items on par with human experts? 
Dongkwang Shin, Gwangju National University of Education 
 Jang Ho Lee, Chung-Ang University 
Abstract 
Given the recent increased interest in ChatGPT in the L2 teaching and learning community, the present 
study sought to examine ChatGPT’s potential as a resource for generating L2 assessment materials on 
par with those created by human experts. To this end, we extracted five reading passages and testing 
items in the format of multiple-choice questions from the English section of the College Scholastic Ability 
Test (CSAT) in South Korea. Additionally, we used ChatGPT to generate another set of readings and 
testing items in the same format. Next, we developed a survey made up of Likert-scale questions and 
open-ended response questions that asked about participants’ perceptions of the diverse aspects of the 
target readings and testing elements. The study’s participants were comprised of 50 pre- and in-service 
teachers, and they were not informed of the target materials’ source or the study’s purpose. The survey’s 
results revealed that the CSAT and ChatGPT-developed readings were perceived as similar in terms of 
naturalness of the target passages’ flow and expressions. However, the former was judged as having 
included more attractive multiple-choice options, as well as having a higher completion level regarding 
testing items. Based on such outcomes, we then present implications for L2 teaching and future research. 
Keywords: Artificial Intelligence, Automated Item Generation, ChatGPT, Content Generation 
Language(s) Learned in This Study: English 
APA Citation: Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on 
par with human experts? Language Learning & Technology, 27(3), 27–40. 
https://hdl.handle.net/10125/73530 
Introduction 
Chat Generative Pre-trained Transformer (ChatGPT, henceforth), which is a Large Language Model 
(LLM)-based chatbot, has received significant attention in a wide range of domains in recent months. 
Since its release on November 30, 2022, ChatGPT reached one million users in five days and two million 
in two weeks, thus expanding quicker than the record-breaking Facebook (now Meta), which took ten 
months to reach 100 million users (Kim, 2023). Although no large-scale, experimental research has yet 
reported on ChatGPT use in the language learning and teaching field, some studies (e.g., Ahn, 2023; 
Kohnke et al., 2023; Kwon & Lee, 2023; Shin, 2023) have already started revealing its full potential in 
the language teaching and learning domain. 
Among various ways that ChatGPT can be employed in language teaching and learning, we focus on its 
capability to generate original texts and language testing items in this article. Although the concepts of 
computer-based text generation and automated item generation were introduced more than five decades 
ago (e.g., Bormuth, 1969; Klein et al., 1973), it is only recently, with the development of LLM (i.e., 
ChatGPT), that such technology has reached a level where it could be a useful supplement to language 
teaching and learning (Shin, 2023). Given that generating original passages and designing testing items is 
labor-intensive for L2 teachers, we explore ChatGPT’s potential to this end. More specifically, we 
address the question of whether ChatGPT could make L2 readings and tests on par with those generated 
by human experts through a blind test, in which both pre- and in-service teachers evaluated different 
28 Language Learning & Technology 
   
 
 
aspects of the College Scholastic Ability Test (CSAT) (i.e., the Korean SAT) and ChatGPT-generated 
content. Based on our findings, we also present the implications for future research and L2 teaching in the 
context of using ChatGPT. 
Background 
In this section, we review three relevant branches of research that relate to our topic: research on 
computer-based content generation, research on automated item generation, and recent education studies 
on ChatGPT.  
Research on Computer-based Content Generation 
Developing computer-based content generation systems dates to the 1970s with the introduction of the 
Novel Writer System (NWS) by Klein et al. (1973). The NWS was employed to draft a short story when a 
user entered basic information about the story’s historical background and characterization. During the 
same decade, Meehan (1977) produced Tale-Spin, an interactive storytelling program that allowed users 
to enter a sequence of events and receive a story in return that incorporated those elements. Despite their 
early advancements, however, these early systems were limited in that they relied on rule-based or case-
based methods. That is, they required a significant amount of human effort to categorize story 
components, such as characters, scenes, and plot. 
More recently, automated content generators based on artificial neural networks and machine learning 
have been developed. These systems can automatically produce stories using basic information like 
setting, characters, and plot. The LLM that has become the backbone of such neural network-based 
content creators is GPT-3 (Generative Pre-trained Transformer 3). GPT-3 is an artificial intelligence 
language prediction model constructed by OpenAI and has gained extraordinary attention in myriad 
domains (Ghumra, 2022). Trained on a dataset of over 200 million English words, GPT-3 can generate 
sophisticated English sentences and perform diverse language-related tasks, all based on users’ prompts. 
Such an advanced technology has led to the burgeoning of various AI-based text generation tools like 
CopyAI, Hyperwrite, and INK. In one recent study (Lee et al., 2023), a reading activity created by an AI-
based content generator has been found to enhance young EFL learners’ English reading enjoyment and 
interest in reading English books, thereby underscoring the value of adopting such technology in L2 
classrooms. 
Research on Automated Item Generation 
Automated item generation refers to the process of automatically drafting testing items using 
measurement information (Bormuth, 1969). Moreover, it could be considered that such technology could 
be extremely useful in terms of time and cost in the language testing and learning domain, especially 
when building large-scale testing banks.  
To date, automated item generation methods have been divided into largely two approaches: ontology-
based and rule-based. The former is a model that represents an object or concept in a form that can be 
understood by both humans and computers, and it focuses on the object’s properties or relationships (Al-
Yahya, 2014). Utilizing an ontology-based approach, Al-Yahya (2014) automatically generated multiple-
choice (MCQ), true/false (T/F) and fill-in-the-blank (FB) questions. The process was restricted to drafting 
toss-up and wrong answers. Meanwhile, Liu et al. (2012) implemented the rule-based approach. They 
developed a program for scientific writing called G-Ask, which can produce prompts for scientific 
writing. 
More recently, deep-learning-based automated item generation approaches have become mainstream. One 
type of Recurrent Neural Network (RNN) deep learning technique, the Long Short-Term Memory 
(LSTM) network, was used by von Davier (2018) to produce items measuring non-cognitive qualities 
(e.g., emotions, attitudes). Kumar et al. (2018) also employed an LSTM model to analyze inputted 
sentences and create question–answer pairs for them. ChatGPT, which was used for automated item 
Dongkwang Shin and Jang Ho Lee 29 
    
     
 
generation in this study, also uses a deep learning approach. However, it could be considered superior to 
earlier technologies because it employs an LLM-based transformer method (Shin, 2023). 
Education Research on ChatGPT 
In this section, we provide a selective review of recent studies that explore the following issues: the use of 
ChatGPT in L2 reading and its potential to replace human teachers. First, Kwon and Lee (2023) analyzed 
ChatGPT’s accuracy in answering English reading comprehension questions on the CSAT and the 
TOEFL iBT. The results showed that ChatGPT, based on GPT-3.5, was indeed capable of answering 
approximately 69% of the questions correctly. In terms of question type, ChatGPT correctly answered 
about 75% of those pertaining to factual and inferential comprehension and around 87% of the fill-in-the-
blank and summarizing ones. However, its accuracy rate on vocabulary and grammar questions was much 
lower. Nonetheless, ChatGPT PLUS, which is based on the latest GPT-4, achieved 93% mastery, 
including on vocabulary and grammar questions. 
With a similar goal, Ahn (2023) evaluated ChatGPT’s efficacy on CSAT English reading comprehension 
testing items. In the experiment, ChatGPT provided correct answers 74% of the time. The study therefore 
suggests that ChatGPT’s performance could be improved by enhancing training methods, incorporating 
diverse and balanced datasets, and applying human-AI collaboration. The research also identified testing 
item types that ChatGPT could advance, including identifying the most appropriate order of events in a 
story, determining pronoun referents, and placing sentences in correct order. 
While prior literature focused on measuring ChatGPT’s capability to solve reading comprehension 
problems as the test taker, Shin (2023) investigated ChatGPT’s potential in terms of developing reading 
comprehension items as the test designer. The outcomes showed that some question types require 
specialized prompts for them to be designed properly; therefore, the different kinds of questions included 
identifying the contextual meaning of underlined expressions, sequencing parts of a passage, and 
identifying the mood of the text. Based on the measurements, the study provided optimized prompts for 
different types of reading comprehension questions, as well as for various question development tips 
when using ChatGPT. 
Despite growing recognition of ChatGPT’s capabilities in L2 research, there appears to be notable 
apprehension among teachers regarding their roles being potentially replaced by AI. For example, in a 
study conducted by Chan and Tsi (2023), a survey was administered to 384 undergraduate and graduate 
students, as well as to 144 teachers representing various disciplines. The goal was to gain insight into the 
potential of AI, including ChatGPT, to replace teachers. The results showed that neither group strongly 
agreed that AI could replace teachers in the future. In another study, Tao et al. (2019) reported on the 
results of a survey conducted among 140 teachers and found that about 30% expressed skepticism about 
AI replacing them. However, the remaining believed otherwise. 
In the current state of mixed skepticism and optimism about AI, including ChatGPT, it is important to 
note that the L2 teaching field needs classroom-oriented research that could demonstrate the technology’s 
usefulness to L2 teachers and practitioners. Based on the results of such research, they can then judge its 
potential for themselves. To address this issue, in this study, both in- and pre-service teachers in the 
Korean EFL context were asked to evaluate reading passages and questions generated by the CSAT and 
ChatGPT in a blind test, which aligned with the following two research questions: 
Research Question 1. How do pre- and in-service English teachers perceive the CSAT and ChatGPT-
developed reading passages in terms of the naturalness of the writing flow and the expressions?  
Research Question 2. How do pre- and in-service English teachers perceive the CSAT and ChatGPT-
developed reading comprehension testing items in terms of the attractiveness of multiple-choice options 
and the overall level of completion? 
30 Language Learning & Technology 
   
 
 
Method 
This section describes the methodological approach of the blind test, during which participants were 
asked to evaluate different aspects of the CSAT and ChatGPT-developed reading passages and testing 
items. 
Participants 
The present study included two groups of participants. The first was comprised of 38 undergraduate 
students majoring in English Education (pre-service teachers, henceforth) at a private university in Seoul, 
South Korea. These participants had been preparing for the teacher’s examination to qualify as in-service 
teachers, and they had already taken several courses related to English Education. In this group, about 
two-thirds (68%) had taken a course on L2 assessment and testing. The other group was comprised of 12 
in-service English teachers and professors (in-service teachers, henceforth), with a mean career length of 
10.6 years (SD = 5.2). Each group’s self-judged rating of L2 reading assessment, as found in the survey’s 
background section, was 3.08 and 3.45 for the pre- and in-service group, respectively, with 5 being the 
most proficient and 1 being the lowest. 
The participants were not informed of the present study’s aim (i.e., to evaluate ChatGPT-developed 
reading passages and testing items) but were told only that they had been asked to evaluate items on the 
CSAT’s English test (reading section). Two participants (one in each group) had not completed the survey 
and were therefore removed from the analysis. 
Description of the Technology regarding Text and Item Generation 
In this study, we used ChatGPT to generate items based on those of the CSAT English test in South 
Korea. Based on Shin’s (2023) suggestion that using model items is more effective than relying solely on 
prompts to create test components, we selected five reading questions from the English section of the 
2019 CSAT (Korea Institute for Curriculum and Evaluation, 2019) as model items. These five items were 
comprised of different question types, including (1) identifying changes in a character’s mood, (2) 
pinpointing details in a passage, (3) inferring an appropriate phrase from blanks, (4) inserting a paragraph 
according to the text’s flow, and (5) filling in appropriate words in the blanks provided in a given 
passage’s summary. After selecting the five items of different question types from the CSAT, we then 
used ChatGPT to generate a comparison set of five reading passages and testing items. For example, as 
shown in Figure 1, the reading passage and testing item related to the question type ‘identifying changes 
in the character’s mood’ from the CSAT were typed into ChatGPT along with the following prompt: 
Draft a new passage with a different topic with a similar multiple-choice question, as follows. 
A similar procedure was followed for the other kinds of question, although some modifications were 
made to the prompts for each type. That is, if the produced testing item differed from its CSAT 
counterpart in terms of the target question type’s structure (or format), the prompts were slightly revised. 
For instance, in the case of the ‘pinpointing details in a passage’ question type, the following prompt was 
entered: Draft a new passage with a different topic and a multiple-choice question to confirm the 
agreement with the details in the passage. Such procedures were continued until ChatGPT created the 
reading passage and testing item that were structurally equivalent to those sampled from the CSAT (see 
the Appendix for the sample reading passages and testing items included in the instrument of the present 
study). 
For each question type, we paired a human-generated item (i.e., the CSAT one) with a ChatGPT-
generated item and asked participants to rate the characteristics and quality of the testing item of the same 
type generated by both methods. The testing components were presented in a randomized order of the two 
methods to prevent participants from depositing their sources. The survey was administered as a blind 
test. 
 
Dongkwang Shin and Jang Ho Lee 31 
    
     
 
Figure 1 
Prompt Entered into ChatGPT to Create the Item Type of ‘Identifying Changes in a Character’s Mood’ 
 
Instrument 
Four Likert-scale questionnaire items and one open-ended response item were given for each reading 
passage and testing item. Four questionnaire components were developed to measure participants’ 
perceptions of (1) the naturalness of the writing flow, (2) the naturalness of the expressions, (3) the 
attractiveness of multiple-choice options (i.e., the extent to which distractions play a role in causing the 
difficulty of the items), and (4) the overall completion level of the testing item (i.e., the quality of the 
testing items in terms of their relevance to the target passage and whether the options are clearly written 
and homogenous in content), all on a 5-point Likert scale. For RQ2, in addition to the overall completion 
level (that purports to measure the overall quality), we also included a questionnaire item related to the 
attractiveness of multiple-choice options since our piloting showed that multiple options generated by 
ChatGPT are often not plausible. After presenting participants with the open-ended item, they were asked 
to elaborate upon the rationale for choosing a particular scale, if any. Figure 2 illustrates one of the 
reading passages and testing items, and the corresponding questionnaire items. The participants were 
asked first to read the passage and the testing item (on the left side of Figure 2) and then complete the 
questionnaire items (on the right side of Figure 2). 
  
32 Language Learning & Technology 
   
 
 
Figure 2 
 The Example of the Passage, the Testing Item, and the Questionnaire Items 
Which of the following is consistent with 1. Determine the naturalness of the flow of this 
the below announcement about a Charity passage (1 = very unnatural ~ 5 = very natural). 
Walk Event? 1 2 3 4 5 
Charity Walk Event      
Join a charity walk hosted by the  
Riverfront Park! This event supports the 2. Determine the naturalness of the English 
local animal shelter.  expressions of this passage (1 = very unnatural ~ 5 
= very natural). 
-When & Where: Sunday, May 15, 1 2 3 4 5 
9:00 a.m./Riverfront Park.       
-How to Join the Walk: Individual or  
team registration is accepted.  3. Determine the attractiveness of multiple-choice 
options for this question (1 = not attractive at all ~ 5 
-Pay your registration fee of $20 as a = very attractive).  
donation. 1 2 3 4 5 
-Activities: Walk a 5K route along the      
riverfront.   
4. Determine the overall completion level of the 
-With an additional $10 donation, you testing item (1 = lowest quality ~ 5 = highest 
can participate in a pet adoption fair.  quality). 
1 2 3 4 5 
※Water and snacks will be provided. 
     
Click here to register now! 
 
5. If you would like to elaborate on the rationale for 
① The event is held on a weekday.  choosing a particular scale for one of the 
② The event is held at the Riverside Park.  questionnaire items above, please do so.  
 
③ The event is free to participate.  
______________________________________ 
④ The event is for abandoned animals. 
⑤ The event includes a silent auction. 
 
 
Data Analysis 
For the data analysis, the participants’ responses to the Likert-scale questionnaire items were first entered 
into a SPSS worksheet, with their responses to the CSAT-sampled reading passages and testing items and 
their ChatGPT-generated counterparts being coded differently. Next, their responses to each Likert-scale 
questionnaire item were averaged across the five reading passages and testing items extracted from the 
CSAT, and the same procedure was followed for the ChatGPT-generated passages and testing items. 
Then, four paired t-tests were conducted, respectively, with each Likert-scale item, with a Bonferroni 
correction adjusting for the alpha level (.05/4 = .0125). 
Dongkwang Shin and Jang Ho Lee 33 
    
     
 
The participants’ responses to the open-ended questionnaire items were first distinguished between the 
ones given for the CSAT’s reading passages and testing items, and those given for the ChatGPT ones. 
Afterward, they were grouped together according to theme (i.e., naturalness of flow, naturalness of 
English expressions, attractiveness of multiple-choice options, overall completion level of the testing 
item). The responses that were not relevant to any of the themes or did not offer a rationale for choosing a 
particular scale were excluded from the dataset. 
Results 
Table 1 shows the mean rating of the participants on the CSAT and ChatGPT-developed testing 
components, along with the results of paired t-tests. 
Table 1  
Participants’ Mean Rating on the CSAT and ChatGPT-developed Testing Items and the Paired t-Test 
Results 
ChatGPT-
CSAT 
developed 
Mean (SD) 95% 
Mean (SD) t-
Survey item confidence Cohen’s d 
value 
intervals 
Pre- In- Pre- In-
Total Total 
service service service service 
Naturalness of 4.56 3.98 4.43 4.40 4.00 4.31 
2.23 [.01, .22] .24 
flow (.48) (.61) (.56) (.40) (.69) (.50) 
Naturalness of 
4.46 4.11 4.38 4.36 4.07 4.30 
English 1.71 [–.01, .18] .14 
(.53) (.52) (.55) (.53) (.73) (.59) 
expressions 
Attractiveness 
4.35 3.64 4.19 3.86 3.29 3.73 
of multiple- 6.71* [.32, .60] .64 
(.47) (.78) (.63) (.66) (.79) (.72) 
choice options 
Overall 
completion 4.39 3.44 4.18 4.00 3.44 3.88 
3.40* [.12, .47] .42 
level of the (.43) (.69) (.64) (.64) (.79) (.71) 
testing item 
Note. *p < .0125 (adjusted with a Bonferroni correction). 
 
As Table 1 demonstrates, the pre-service group gave overall higher ratings than their in-service 
counterparts, regardless of the types of passages and testing items (i.e., the CSAT items or those 
developed by ChatGPT) or survey components. Regarding the CSAT, the results of the independent t-
tests revealed that the pre-service group gave significantly higher ratings than the in-service group in 
terms of naturalness of flow (t = 3.28, p = .002) and overall completion level of the testing item (t = 4.38, 
p = .001), but not for the naturalness of English expressions (t = 1.92, p = .06) or the attractiveness of 
multiple-choice options (t = 2.90, p = .013), when a Bonferroni correction was adjusted for the alpha level 
(.05/4 = .0125). In the case of ChatGPT-developed testing components, the two groups’ ratings were not 
significantly different, when a Bonferroni correction was adjusted for the alpha level, in terms of 
34 Language Learning & Technology 
   
 
 
naturalness of flow (t = 1.86, p = .087), naturalness of English expressions (t = 1.45, p = .15), 
attractiveness of multiple-choice options (t = 2.40, p = .02), and overall completion level of the testing 
item (t = 2.47, p = .017). 
To answer the first research question, the participants rated the naturalness of flow and English 
expressions of the target reading passages highly (over 4.3), with no significant difference being found 
between the CSAT and ChatGPT-developed items. Some participants gave open-ended responses with 
regard to the naturalness of the flow and expressions of the passages developed by ChatGPT, as follows: 
The flow of this passage [the fourth ChatGPT-developed reading passage] seems to be natural. (Pre-
service teacher #35) 
The sentences in this passage [the third ChatGPT-developed reading passage] flow well overall. (Pre-
service teacher #13) 
The flow, expression, and composition of this passage [the first ChatGPT-developed reading passage] 
seem appropriate. (In-service teacher #1) 
As for the second research question, the CSAT items (M = 4.19) were rated significantly higher than the 
ChatGPT-based ones (M = 3.73) (p < .0125, adjusted with a Bonferroni correction) in terms of the 
attractiveness of multiple-choice options. Indeed, one of the most frequent comments regarding the 
opened-ended responses for the ChatGPT-based items concerned the lack of attractive distractors (n = 
22). Some examples are given below: 
There are no compelling option choices for this question [the question for the first ChatGPT-developed 
reading passage]. (Pre-service teacher #16) 
Some options in this question [the question for the second ChatGPT-developed reading passage] do not 
make sense at all. (Pre-service teacher #30) 
A significant modification is required for this question [the question for the second ChatGPT-developed 
reading passage]. For example, the phrase in the fifth option is not even mentioned in the passage. (In-
service teacher #10)  
The overall attractiveness of the options [for the question for the third ChatGPT-developed reading 
passage] seems to be diminishing. (In-service teacher #9) 
The first option [in the question for the fifth ChatGPT-developed reading passage] could also be the 
answer, I believe. (In-service teacher #7) 
As seen from the comments, some of the incorrect options generated by ChatGPT were deemed rather 
unattractive, or even as having the potential to be considered a correct answer. Although there were some 
negative comments about the options’ attractiveness regarding the CSAT items as well, they were rare (n 
= 3).  
For the other questionnaire item regarding the second research question, it was found that participants 
rated the completion level of the CSAT items (including the reading passages and testing components) (M 
= 4.18) significantly higher than the ChatGPT-developed ones (M = 3.88) (p < .0125, adjusted with a 
Bonferroni correction). Some of the relevant comments for the CSAT items are given below: 
I think this testing item [for the third CSAT reading passage] can be solved only when you fully 
understand the text and analyze the results accurately. (Pre-service teacher #17) 
The content of the [first CSAT] passage is easy, but it is designed such that test takers should read both 
the passage and the options carefully in order to choose the correct answer. (In-service teacher #3) 
Considering the logical flow, it seems that there are a lot of conjunctive adverbs in the [fourth CSAT] 
passage. However, the overall completion of the testing item is high. (In-service teacher #2) 
Dongkwang Shin and Jang Ho Lee 35 
    
     
 
To summarize, the CSAT and ChatGPT-developed reading passages were identified as similar in terms of 
naturalness of the target passages’ flow and expressions. In contrast, the former was judged as including 
more attractive multiple-choice options and as having a higher completion level regarding the testing 
items. 
Discussion and Conclusion 
In the present article, we have examined whether ChatGPT could generate L2 reading passages and 
testing items on par with those created by human experts. To this end, we administered a blind test with 
the pre- and in-service English teachers in the South Korean EFL context. Our findings revealed that 
ChatGPT is indeed capable of generating L2 reading passages that have a similar level of naturalness in 
terms of flow and expressions as those written and developed by human experts, as perceived by both 
participant groups. This outcome is consistent with recent studies (e.g., Ahn, 2023; Kwon & Lee, 2023), 
which have demonstrated that ChatGPT has a remarkable ability to solve L2 reading comprehension 
tasks. Notably, the present study further revealed ChatGPT’s potential for designing L2 reading 
comprehension tests. We have also noted that, given the results related to the significant difference 
between the CSAT and ChatGPT-developed testing items (in terms of the participants’ perception of the 
attractiveness of multiple-choice options and overall level of completion of the testing item), human 
teachers would indeed still be important for revising the testing items developed by ChatGPT, as seen in 
prior research (Shin, 2023). As evidenced in several excerpts from the participants’ open-ended 
responses, there seems to be much room for improvement in ChatGPT’s ability to construct well-designed 
testing components. 
Given this study’s results, our tentative answer to the question of whether ChatGPT is needed in L2 
teaching was positive. That is, ChatGPT could help EFL teachers to generate passages for L2 reading and 
testing items more conveniently, and in a very short time period, thereby significantly reducing their 
workload. Meanwhile, teacher involvement in revising the generated testing items seems crucial, at least 
given the current state of ChatGPT’s capacity. Thus, while it is not yet possible to completely hand over 
L2 testing item creation to ChatGPT, EFL teachers are strongly encouraged to explore its potential, given 
its powerful ability to assist them. 
The following procedures could be followed by EFL teachers and practitioners who wish to create testing 
items using ChatGPT. First, they should identify a set of target skills and abilities (e.g., inferring a target 
passage’s main idea) of interest in the given domain (e.g., reading) and examine the type of testing items 
that purport to assess each skill and ability. Next, they should collect model testing items from extant L2 
tests that have high reliability and validity to examine whether the selected testing items fit their purposes 
(i.e., measuring the target skills and abilities). Then, they should draft the prompt (e.g., generate a new 
passage on a different topic and a multiple-choice question with 5 choices, as follows: [A model reading 
passage and testing item]) with which to generate each type of testing item, and revise it if the output is 
not satisfactory. 
Finally, we provide the following suggestions for future research. First, a larger scale study with EFL 
learners, rather than pre- and in-service teachers, is needed to examine this same issue from learners’ 
perspectives. Second, researchers are encouraged to use ChatGPT for generating testing items in other 
linguistic dimensions (e.g., listening, grammar) and to examine its potential in such areas. Lastly, a 
longitudinal study on L2 teachers’ engagement in developing language testing components with ChatGPT 
and its washback effect would be expected to contribute to the growth of the current research branch on 
using ChatGPT in language teaching and learning. 
Acknowledgements 
We thank the pre- and in-service teachers who participated in the current study. We also appreciate the 
reviewers for their insightful feedback.  
36 Language Learning & Technology 
   
 
 
References 
Ahn, Y. Y. (2023). Performance of ChatGPT 3.5 on CSAT: Its potential as a language learning and 
assessment tool. Journal of the Korea English Education Society, 22(2), 119–145.  
Al-Yahya, M. (2014). Ontology-based multiple choice question generation. The Scientific World Journal, 
3, 274949. https://doi.org/10.1155/2014/274949 
Bormuth, J. (1969). On a theory of achievement test items. University of Chicago Press. 
Chan, C. K. Y., & Tsi, L. H. Y. (2023). The AI revolution in education: Will AI replace or assist teachers 
in higher education? arXiv. https://doi.org/10.48550/arXiv.2305.01185 
Ghumra, F. (2022, March). OpenAI GPT-3, the most powerful language model: An overview. e-
Infochips. https://www.einfochips.com/blog/openai-gpt-3-the-most-powerful-language-model-an-
overview/ 
Kim, T. (2023). Can ChatGPT be an innovative tool?: The use cases and prospects of ChatGPT 
(NIA_The AI Report 2023-1). NIA AI Future Strategy Center. 
Klein, S., Aeschlimann, J. F., Balsiger, D. F., Converse, S. L., Court, C., Foster, M., Lao, R., Oakley, J. 
D., & Smith, J. (1973). Automatic novel writing: A status report (Technical Report 186). The 
University of Wisconsin, Computer Science Department.  
Kohnke, L., Moorhouse, B. L., & Zou, D. (2023). ChatGPT for language teaching and learning. RELC 
Journal. https://doi.org/10.1177/00336882231162868 
Korea Institute for Curriculum and Evaluation. (2019). 2020 College Scholastic Ability Test: English 
section. Korea Institute for Curriculum and Evaluation. 
https://www.suneung.re.kr/boardCnts/fileDown.do?fileSeq=b8cc879f115f67b90ace7b59c57641a8 
Kumar, V., Boorla, K., Meena, Y., Ramakrishnan, G., & Li, Y. F. (2018). Automating reading 
comprehension by generating question and answer pairs. In D. Phung, V. S. Tseng, G. I. Webb, B. 
Ho, M. Ganji & L. Rashidi. (Eds.) Advances in knowledge discovery and data mining (pp. 335–348). 
Springer International Publishing.  
Kwon, S-K., & Lee, Y. T. (2023). Investigating the performance of generative AI ChatGPT’s reading 
comprehension ability. Journal of the Korea English Education Society, 22(2), 147–172.  
Lee, J. H., Shin, D., & Noh, W. (2023). Artificial intelligence-based content generator technology for 
young English-as-a-foreign-language learners’ reading enjoyment. RELC Journal. 
https://doi.org/10.1177/00336882231165060 
Liu, M., Calvo, R. A., & Rus, V. (2012). G-Asks: An intelligent automatic question generation system for 
academic writing support. Dialogue and Discourse, 3(2), 101–124. 
https://doi.org/10.5087/dad.2012.205 
Meehan, J. R. (1977). TALE-SPIN: An interactive program that writes stories. In R. Reddy (Ed.), 
Proceedings of the Fifth International Joint Conference on Artificial Intelligence (pp. 91–98). 
Morgan Kaufmann Inc.  
Shin, D. (2023). A case study on English test item development training for secondary school teachers 
using AI tools: Focusing on ChatGPT. Language Research, 59(1), 21–42. 
https://doi.org/10.30961/lr.2023.59.1.21 
Tao, H. B., Diaz, V. R., & Guerra, Y. M. (2019). Artificial intelligence and education: Challenges and 
disadvantages for the teacher. Arctic Journal, 72(12), 30–50. 
Dongkwang Shin and Jang Ho Lee 37 
    
     
 
von Davier, M. (2018). Automated item generation with Recurrent Neural Networks. Psychometrika, 
83(4), 847–857. https://doi.org/10.1007/s11336-018-9608-y 
Appendix. The Sample Reading Passages and Testing Items Included in the Blind 
Test 
Reading passages and testing items 1-1, 2-1, 3-2 in the Appendix are copyrighted by the Korea Institute 
for Curriculum and Evaluation (2019). 
1.1 (2019 CSAT English section Q#19)  
Which of the following is the most appropriate for capturing Jonas’ emotional change, as revealed in the 
following article? 
Looking out the bus window, Jonas could not stay calm. He had been looking forward to this field trip. It 
was the first field trip for his history course. His history professor had recommended it to the class, and 
Jonas had signed up enthusiastically. He was the first to board the bus in the morning. The landscape 
looked fascinating as the bus headed to Alsace. Finally arriving in Alsace after three hours on the road, 
however, Jonas saw nothing but endless agricultural fields. The fields were vast, but hardly appealed to 
him. He had expected to see some old castles and historical monuments, but now he saw nothing like that 
awaiting him. “What can I learn from these boring fields?” Jonas said to himself with a sigh. 
① excited → disappointed ② indifferent → thrilled 
③ amazed → horrified    ④ surprised → relieved 
⑤ worried → confident 
 
1-2 (Chat GPT) 
Which of the following is the most appropriate in capturing Maria’s emotional change, as revealed in the 
following article? 
As she stepped onto the stage, Maria’s heart began to race. She had always loved singing, but this was the 
first time she would be performing in front of a large audience. She had been practicing for weeks, 
preparing for this moment. As the music started playing and the lights came on, Maria felt a surge of 
excitement. Her voice was strong and clear, and the audience seemed to be enjoying her performance. But 
as she reached the chorus, Maria’s confidence began to waver. She had missed a note, and now she was 
worried she would make another mistake. By the end of the song, Maria was feeling disappointed with 
her performance, despite the enthusiastic applause from the audience.  
① excited → disheartened ② indifferent → thrilled  
③ amazed → horrified   ④ surprised → relieved  
⑤ worried → confident 
 
2-1 (2019 CSAT English section Q#28) 
Which of the following is consistent with the below announcement about the 2019 Badminton Challenge 
for Charity? 
38 Language Learning & Technology 
   
 
 
2019 Badminton Challenge for Charity 
Join the charity tournament event hosted by Cliffield Community 
Center! This event supports Salke Children’s Hospital. 
When & Where 
∙Saturday, November 23, 2:00 p.m. 
∙Cliffield Sports Center. 
How to Join the Tournament 
∙Make a two-member team. 
∙Pay your team’s $100 entry fee as a donation. 
Activities 
∙Challenge last year’s champion team to a 3-point match. 
∙With an additional $20 donation, you can learn badminton skills 
from professional players. 
※ Rackets and shuttlecocks will be provided. 
Click here to register now! 
① It is hosted by Salke Children’s Hospital. 
② Three people should form a team and participate. 
③ The participation fee is $100 per person. 
④ You can learn badminton skills if you donate 20 dollars extra. 
⑤ Racket and shuttlecock are not provided. 
 
2-2 (Chat GPT) 
Which of the following is consistent with the below announcement about a Charity Walk Event? 
Charity Walk Event 
Join a charity walk hosted by the Riverfront Park! This event 
supports the local animal shelter.  
-When & Where: Sunday, May 15, 9:00 a.m./Riverfront Park.  
-How to Join the Walk: Individual or team registration is 
accepted.  
-Pay your registration fee of $20 as a donation. 
-Activities: Walk a 5K route along the riverfront  
Dongkwang Shin and Jang Ho Lee 39 
    
     
 
∙With an additional $10 donation, you can participate in a pet 
adoption fair.  
※Water and snacks will be provided. 
Click here to register now! 
① The event is held on a weekday.  
② The event is held at the Riverside Park. 
③ The event is free to participate. 
④ The event is for abandoned animals. 
⑤ The event includes a silent auction.  
 
3-1 (Chat GPT) 
Choose the most appropriate sentence for the blank in the below passage. 
The famous novel To Kill a Mockingbird by Harper Lee explores the issue of racial injustice in the 
southern United States. The main character, Scout Finch, grows up in the 1930s in the small town of 
Maycomb, Alabama, where she witnesses the prejudice and narrow-mindedness that exist in the town. 
_______________________ Through her experiences and the influence of her father, Atticus (a lawyer 
who defends an African American man accused of assaulting a white woman), Scout learns important 
lessons about tolerance, compassion, and the pursuit of justice. 
① Scout also learns how to deal with her neighbors. 
② The town is also struggling with the Great Depression. 
③ Scout’s curiosity and adventurous spirit often get her into trouble. 
④ The trial of the African American man serves as a catalyst for Scout’s growth. 
⑤ Scout becomes friends with a mysterious neighbor named Boo Radley. 
 
3-2 (2019 CSAT English section Q#32) 
Choose the most appropriate sentence for the blank in the below passage. 
The Swiss psychologist Jean Piaget frequently analyzed children’s conception of time via their ability to 
compare or estimate the time taken by pairs of events. In a typical experiment, two toy cars were shown 
running synchronously on parallel tracks, _______________________________. The children were then 
asked to judge whether the cars had run for the same time and to justify their judgment. Preschoolers and 
young school-age children confused temporal and spatial dimensions: starting times are judged by starting 
points, stopping times by stopping points, and durations by distance, although each of these errors does 
not necessitate the others. Hence, a child may claim that the cars started and stopped running together 
(correct) and that the car that stopped further ahead ran for more time (incorrect). 
① one running faster and stopping further down the track  
② both stopping at the same point further than expected  
③ one keeping the same speed as the other to the end 
40 Language Learning & Technology 
   
 
 
④ both alternating their speed but arriving at the same end  
⑤ both slowing their speed and reaching the identical spot 
 
About the Authors 
Dongkwang Shin received his PhD in Applied Linguistics from Victoria University of Wellington and is 
currently a Professor at Gwangju National University of Education, South Korea. His research interests 
include corpus linguistics, CALL, and AI-based language learning. 
E-mail: sdhera@gmail.com 
Jang Ho Lee received his DPhil in education from the University of Oxford, and is presently a Professor 
at Chung-Ang University, South Korea. His areas of interest are CALL, L1 use in L2 teaching, and 
vocabulary acquisition. All correspondence regarding this publication should be addressed to him. 
E-mail: jangholee@cau.ac.kr