Language Learning & Technology  2022, Volume 26, Issue 1 
ISSN 1094-3501  CC BY-NC-ND pp. 1–25 
  
ARTICLE 
 
 
 
Automated written corrective feedback:  
Error-correction performance and timing of delivery 
Jim Ranalli, Iowa State University 
Taichi Yamashita, University of Toledo 
Abstract 
To the extent automated written corrective feedback (AWCF) tools such as Grammarly are based on 
sophisticated error-correction technologies, such as machine-learning techniques, they have the potential 
to find and correct more common L2 error types than simpler spelling and grammar checkers such as the 
one included in Microsoft Word (technically known as MS-NLP). Moreover, AWCF tools can deliver 
feedback synchronously, although not instantaneously, as often appears to be the case with MS-NLP. 
Cognitive theory and recent L2 research suggest that synchronous corrective feedback may aid L2 
development, but also that error-flagging at suboptimal times could cause disfluencies in L2 students’ 
writing processes. To contribute to the knowledge needed for appropriate application of this new genre of 
writing-support technology, we evaluated Grammarly’s capacity to address common L2 problem areas, as 
well as issues with its feedback-delivery timing, using MS-NLP as a benchmark. Grammarly was found to 
flag 10 times as many common L2 error types as MS-NLP in the same corpus of student texts while also 
displaying an average 17.5-second delay in feedback delivery, exceeding the distraction-potential threshold 
defined for the L2 student writers in our sample. Implications for the use of AWCF tools in L2 settings are 
discussed. 
Keywords: Syntax/Grammar, Writing, Human-Computer Interaction 
Language(s) Learned in This Study: English 
APA Citation: Ranalli, J., & Yamashita, T. (2022). Automated written corrective feedback: Error-
correction performance and timing of delivery. Language Learning & Technology, 26(1), 1–25. 
http://hdl.handle.net/10125/73465 
Introduction 
At the same time that much of L2 writing in English is taking place across a variety of digital spaces, 
powerful and sophisticated error-correction tools have become available across these spaces. Learners have 
come to expect that some form of automated help, at least with spelling, will be provided in email programs, 
learning management systems, and mobile device keyboards. The reach of sophisticated error-correction 
methods addressing not only spelling but complex areas of grammar has increased with the advent of tools 
such as Grammarly, which integrates into web browsers, office productivity software, mobile devices, and 
even Google Docs. Importantly, these tools can operate synchronously, providing feedback as writers write.  
Recent published work on automated help for writing has focused on so-called AWE (automated writing 
evaluation) tools such as Criterion (e.g., Lavolette et al., 2015; Ranalli et al., 2017) and MY Access! (Chen 
& Cheng, 2008; Dikli, 2010). These tools deliver feedback asynchronously, allow access only through 
standalone web interfaces, and attempt to address both grammatical errors and higher-level issues (e.g., 
organization) with mixed results. This has left under-investigated another type of tool for automated 
feedback on writing—one which delivers feedback synchronously, is accessed in a convenient diversity of 
ways, and harnesses state-of-the-art technologies in focusing on lower-level concerns, including error types 
common to L2 writers—that we refer to as the automated written corrective feedback (AWCF) tool.  
2 Language Learning & Technology 
   
 
 
Viewed from cognitive theoretical perspectives on L2 writing and L2 learning, this new genre presents both 
opportunities and risks. Synchronous corrective feedback (CF) provided by teachers or text-chat 
interlocutors has been found to lead to increased gains in grammatical accuracy compared to CF that is 
delayed (Arroyo & Yilmaz, 2018; Shintani & Aubrey, 2016). AWCF tools could provide a more practicable 
and frequent source of such feedback. Yet synchronous AWCF may also constitute a potential source of 
distraction for writers if the timing of its delivery is misaligned with the cognitive processes involved in 
text production. 
To contribute to the knowledge needed for appropriate applications of this new technological genre, we 
undertook evaluations of Grammarly from both system-centric and user-centric perspectives (Chodorow et 
al., 2010). For the former, we assessed Grammarly’s error-correction performance vis-à-vis the unique 
needs of L2 writers, and for the latter, the consequences of Grammarly’s enhanced error-correction 
capabilities on the timing of its feedback.  
Automated Corrective Feedback for L2 Student Writers 
Researchers in the field of grammatical error correction (GEC) distinguish between methods and systems 
aimed at L1 users versus those designed for L2 users of a language, or, more recently both L1 and L2 users 
(Napoles et al., 2019) because the different groups are characterized by different common error types. L1 
student writers’ most common errors after spelling errors include lack of a comma after an introductory 
element and vague pronoun reference (Connors & Lunsford, 1988). By contrast, an error-annotated version 
of the Cambridge Learner Corpus (CLC), which represents a wide range of L1s and English proficiency 
levels, shows errors involving word choice, prepositions, and determiners to have the highest proportions 
after spelling errors (Leacock et al., 2014). Table 1 lists the 10 highest-ranking errors in the CLC.  
Table 1 
Top-ranked L2 Written Errors in the Cambridge Learners Corpus (adapted from Leacock et al., 2014, p. 
20) 
Rank Error Type Example 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
Content Word Choice Error 
Preposition Error 
Determiner Error 
Comma Error 
Inflectional Morphology 
Wrong Verb Tense 
Derivational Morphology 
Pronoun 
Agreement Error 
Run-on Sentence 
We need to deliver the merchandise on a daily *base/basis. 
Our society is developing *in/at high speed. 
We must try our best to avoid *the/a shortage of fresh water. 
However, */, I’ll meet you later. 
The women *weared/wore long dresses. 
I look forward to *see/seeing you. 
It has already been *arrangement/arranged. 
I want to make *me/myself fit. 
I *were/was in my house. 
They deliver documents to them they provide fast service. 
The different frequencies of L1/L2 written errors necessitate different error-correction approaches. MS-
NLP, which has been included with MS Word since 1997, is a system designed for detecting L1 errors. It 
comprises a parser and dictionary of morphological information small enough to be stored and operated on 
the user’s local machine. According to Leacock et al. (2014), MS-NLP is based on a formal grammar called 
augmented phrase structure, which requires the expertise of trained linguists to create rules addressing, for 
example, subject-verb agreement and fragment errors. These rules are implemented when users initiate the 
spelling-grammar check in MS Word. Following an initial parse, the system creates a parse tree, which is 
Jim Ranalli and Taichi Yamashita 3 
    
     
  
then converted to a semantic representation called a logical form. Critique rules are then applied to the 
logical form to check for rule violations, which, if found, initiate error-correction algorithms. While it 
remains “arguably the world’s most heavily used linguistic analysis system” (Leacock et al., 2014, p. 10), 
informal reports suggest the English version has been modified over successive releases so as to detect 
fewer and fewer error types (Kies, 2008), possibly in response to user complaints about false positives 
(Bishop, 2005).  
Some of the most common error types in L2 English student writing, however, present challenges for GEC 
because the choice of the appropriate form rests not on syntactic rules but contextual dependencies. Gamon 
et al. (2009) describe how preposition and article errors require a great deal of contextual information to 
detect and correct. Choice of the correct preposition, for example, may depend on the noun that follows it, 
the verb that precedes it, the noun that precedes it, or some combination of these (Chodorow et al., 2010). 
Statistical or machine learning-based techniques have thus been used to tackle such errors because they 
obviate the need for intensive, manual effort and grammatical expertise in devising a large set of rules. Such 
systems avoid the need for exact matches by assigning higher probabilities to particular words that are more 
frequent and lower probabilities to those that are not through the use of statistical classifiers (e.g., maximum 
entropy1 classifiers and support vector machines) in combination with different information types such as 
token-context, syntactic-context, part-of-speech (POS), and semantic information (Leacock et al., 2014). 
Context also factors into solutions for addressing what is the most frequent error type in both L1 and L2 
writing: misspellings. Spelling errors are a special concern in GEC not only for their frequency but because 
they can degrade the performance of NLP systems (Nagata et al., 2017). Most spell-checking systems have 
been based on research into L1 spelling errors (Heift & Schulze, 2007; Rimrott & Heift, 2005, 2008), and 
so they are less effective in addressing L2 spelling errors. This is because the underlying research shows 
L1 spelling errors to typically involve only the omission, inclusion, transposition, or substitution of a single 
letter (Rimrott & Heift, 2005). L2 misspellings tend to be more complex, originating with a number of 
different causes such as misapplication of morphological rules, L1 transfer, and lack of L2 morphological 
and phonological knowledge (Heift & Schulze, 2007). As such, they often involve edit distances greater 
than one—that is, two or more operations are needed for transformation into the corrected form—so they 
are difficult for L1-based spell checkers to handle. In response, researchers have developed systems that 
can address L2 spelling errors using contextual information (Flor & Futagi, 2012) or error-correction 
models derived from learner corpora (Nagata et al., 2017). 
There is a general consensus that addressing the needs of L2 student writers for automated corrective 
feedback requires a hybrid approach, including hand-crafted rules for simpler errors, machine-learning 
techniques for those that are more context-dependent, and parser-based analyses for errors involving long-
distance dependencies such as subject-verb agreement (Leacock et al., 2014). In addition, the need for pre- 
and post-processing routines (e.g., splitting text into sentences and applying exclusion rules to minimize 
false positives) means that analysis of L2 student text necessitates complex suites of procedures and, as a 
consequence, the computing power of remote servers rather than the user’s local CPU. An AWCF tool like 
Grammarly is thus likely to display longer error-flagging delays than MS-NLP because of (a) the 
requirement that text be transmitted to and from remote servers for processing; (b) the large and complex 
array of computational processes involved; and (c) according to a Grammarly technical lead (M. 
Romanyshyn, personal communication, January 6, 2019), the requirement, in the case of some types of 
checks, for text to be bounded by sentence-final punctuation in order to facilitate parsing and part-of-speech 
tagging. Because our purpose was to evaluate both the error-detection performance of Grammarly as well 
as the effects of error-correction performance on the timing of feedback delivery, a brief review of GEC 
evaluation techniques is necessary.  
Evaluation of GEC Techniques and Systems 
GEC evaluations have typically involved two measures: precision and recall, which originate in the field 
of information retrieval. Precision is concerned with false positives; that is, the proportion of flagged items 
that are not, in fact, errors. A precision rate of .73 for missing article errors, for example, would mean that 
4 Language Learning & Technology 
   
 
 
73% of a system’s missing-article flaggings had been confirmed by human annotators to indeed be such 
errors. By contrast, recall involves false negatives; specifically, the proportion of actual errors that have 
been flagged. Recall of .35 for fragment errors would mean a system identified 35% of the total number of 
fragment errors in a corpus as attested by human annotators. There is a trade-off such that increasing 
precision leads to lower rates of recall and vice versa. GEC developers prioritize precision over recall 
because false positives are thought to be more detrimental to learners than false negatives, a position for 
which there is some empirical support (Nagata & Nakatani, 2010).  
GEC researchers typically measure precision and recall with reference to a particular error-correction 
technique or method. Han et al. (2006), for example, reported precision and recall of .90 and .40 for an 
article-error detection approach based on a maximum entropy classifier and the use of token and POS 
contexts. Tetreault and Chodorow (2008) reported precision and recall of .84 and .19, respectively, for a 
preposition-error detection approach based on a maximum entropy classifier, the use of token and POS 
contexts, and heuristic rules. Developers may also specify a baseline of performance needed before a new 
method can be added to a tool. According to Quinlan et al., (2009), Criterion’s developers require precision 
of .80 or above in testing. Actual performance in operational systems may vary considerably, however, 
since individual error-correction methods perform differently when combined with other methods and when 
applied to different types of text than those with which they were trained or tested. Classroom-based studies 
of Criterion have found precision rates as low as .51 for Extra comma errors (Ranalli et al., 2017) and .43 
for Wrong article errors (Lavolette et al., 2015). 
For what currently may be the two most widely used error-correction systems,2 performance data are hard 
to come by; peer-reviewed evaluations for either Grammarly or MS-NLP could not be located for this study, 
with the exception of an analysis of MS Word’s spell checker. As part of an investigation of spelling errors 
as predictors of L2 proficiency, Bestgen and Granger (2011) spell-checked a corpus of L2 English student 
writing using MS Word 2007 and found precision and recall rates of .80 and .82, respectively; a total 
correction rate was not reported. For comparison's sake, Rimrott and Heift (2008) found recall of .94 but 
correction of only .62 for the German-language version of MS-NLP analyzing the writing of L2 learners of 
German. Noting that most of the errors in their corpus were multiple-edit misspellings, Rimrott and Heift 
concluded that there is “a need to design spell checkers that specifically target L2 misspellings” (2008, p. 
86). 
The Timing of Written Corrective Feedback 
Cognitive theory provides a basis for consideration of feedback-timing issues as they relate to both L2 
development and L2 writing. In a review connecting the pedagogical practice of focus on form to cognitive 
models, Doughty (2001) situated an important putative trigger of acquisitional processes—the cognitive 
comparison of student-produced and target-like forms—in working memory, postulating a 40-second 
window for optimal focus on form based on how long the forms to be compared can be held in short-term 
memory. Similarly, Long (2007) underscored the importance of providing CF within the time span in which 
learners are using linguistic forms to convey meaning. Recent empirical studies that adopted Doughty’s 
view (Arroyo & Yilmaz, 2018) and Long’s view (Shintani & Aubrey, 2016) both showed students who 
received immediate written CF outperforming those in delayed CF conditions on tests of accurate use of 
the target feature. However, Doughty’s and Long’s claims were made with reference to oral CF research, 
which has tended to operationalize CF timing as immediate if provided during a task and delayed if provided 
after a task. Therefore, this work has limited potential to inform thinking about the optimal timing of AWCF 
during writing so as to support L2 development. 
With regard to the effects of feedback timing on writing processes, a key theme in cognitive models of 
writing is that working-memory resources are limited; conflicting demands on these resources can therefore 
prevent writers from accomplishing their goals effectively (Galbraith & Vedder, 2019). Two key processes 
for our purposes here are translation and transcription. Translation is the process whereby proposed ideas 
in non-linguistic form are converted into linguistic strings, and transcription is the process whereby 
linguistic strings are converted into text (Chenoweth & Hayes, 2001, 2003). Translation and transcription 
Jim Ranalli and Taichi Yamashita 5 
    
     
  
are dependent on verbal working memory in particular—the subvocal, articulatory rehearsal process that 
can counteract the otherwise rapid decay of verbal information in short-term memory, and which writers 
experience as speaking to themselves while composing (Chenoweth & Hayes, 2003).  
An interesting and important feature of the output of these processes is that it takes the form not of complete 
sentences but rather sentence parts, which are generated rapidly and usually terminated by a pause, so 
Chenoweth and Hayes (2001, 2003) coined the term p-burst, or pause burst. Research has shown that p-
bursts are shorter when writers work in an L2 versus an L1, and when produced by students with less 
linguistic experience in the L2 (Chenoweth & Hayes, 2001). Other research demonstrates the potential for 
unrelated verbal information to interfere with translation and transcription. Chenoweth and Hayes (2003) 
showed that articulatory suppression (i.e., having participants repeat a syllable to themselves while 
composing sentences) slowed sentence production and reduced p-burst length by 40%. Ransdell et al., 
(2002) found that when participants had to write while attending to background speech, fluency, sentence 
length, and writing quality deteriorated. In Van Waes et al. (2010), participants were given a written 
sentence stem to complete using an auditory prompt while also identifying and correcting an error in the 
sentence stem. Given the choice of which task to perform first, the participants chose to complete the 
sentence before performing the correction in about 90% of cases, suggesting a felt need to avoid losing the 
proposed text in verbal working memory by suppressing revision. 
Applying these ideas to the realm of synchronous AWCF, we propose that flaggings will represent potential 
distractions to the extent they interfere with working memory during translation and transcription. Thus, 
for synchronous AWCF to avoid such interference, it should address only errors that relate to the current 
contents of verbal working memory. We propose two indices for measuring the potential for synchronous 
AWCF to create distractions: error-flagging delay and p-burst duration. Error-flagging delay is the temporal 
difference between the commitment of an error and its flagging by an automated tool. P-burst duration is 
based on the p-burst, the unit of text production that represents the current contents of verbal working 
memory. In earlier studies (e.g., Chenoweth & Hayes, 2001), p-bursts were measured in number of words, 
but now, with keystroke logging software such as Inputlog (Leijten & Van Waes, 2013), p-burst duration 
can also be measured in milliseconds, facilitating comparison with error-feedback delay. If the delay is 
short enough that flagging addresses an error encompassed in the current p-burst, we assume the feedback 
can be incorporated into translation and transcription at comparatively little cost to attentional resources. 
On the other hand, if error-flagging delay exceeds p-burst duration, the flagging may draw the writers’ 
attention to text produced in a previous p-burst, thereby placing conflicting demands on verbal working 
memory and thus constituting a potential distraction.  
The Present Investigation 
We decided to investigate Grammarly’s timing issues and its error-correction performance by comparing it 
to MS-NLP because this legacy GEC system provided useful benchmarks in two ways. First, as a system 
designed to correct L1 writing errors, it allowed a measure of how far automated error-correction techniques 
have come in addressing the unique needs of L2 learners. Second, it facilitated a comparison of 
Grammarly’s feedback timing to another synchronous CF system whose operation is both familiar and 
generally perceived as unproblematic for users. We thus formulated the following research questions.   
1. How do Grammarly and MS-NLP compare in terms of their capacity to address error types 
common to L2 student writers?  
2. How do Grammarly and MS-NLP compare in terms of the timing of error flagging? 
The research questions and the nature of the technologies involved required us to conduct two separate 
studies. These were part of a larger project that also investigated the effects of synchronous AWCF on 
revision behavior and text quality as well as the cognitive, behavioral, and affective dimensions of L2 
students’ engagement with AWCF. 
6 Language Learning & Technology 
   
 
 
Study 1 
Study 1 addressed the first research question regarding the extent to which Grammarly and MS-NLP are 
able to address errors common to L2 learners.  
The study was based on a corpus of 68 essays written by incoming international students at a large 
Midwestern research university as part of an English placement test. The essay was one of two integrated 
reading-writing tasks. Working in a computer testing center, the students were presented with two short 
readings on the topic of violent video games. In the longer of the two tasks, students had 30 minutes to 
write an essay in which they argued for or against banning violent video games, using ideas from the 
readings and their own views on the topic. The essays were written in the Canvas learning management 
system in a text-entry tool with no automated checking enabled. The total number of words in the corpus 
was 20,187. Average text length was 296.87 words (SD = 74.63). 
The sample consisted of 25 females and 42 males whose most common L1s were Chinese (n = 21), Arabic 
(n = 10), Korean (n = 6), and Portuguese (n = 4); the 21 other L1s included Bengali, Kikuyu, Kurdish, and 
Vietnamese (gender and L1 data were not available for one test-taker). Fourteen test-takers were graduate 
students, and the rest were undergraduates. The placement test is given to students who score below 100 on 
the TOEFL iBT but whose scores meet the minimum threshold for admission to the university (72 for 
undergraduates and 78 for graduates).  
To obtain the AWCF for analysis, we opened each text in MS Word and submitted it for checking to both 
Grammarly Premium (the paid version of Grammarly, which includes all checks and other features) and 
MS-NLP. In the former case, checking was achieved using Grammarly’s plug-in for MS Word, which, 
when activated, automatically turns off MS-NLP. All flaggings produced by each tool were screen-recorded 
using the software TechSmith Morae. In both tools, checks in the style category, and in Grammarly, checks 
in the vocabulary enhancement and plagiarism categories, were deactivated because the nature of the advice 
in these categories makes it difficult to characterize flaggings as accurate or inaccurate; for example, blanket 
recommendations against use of the passive voice or so-called overused words such as “obviously.” 
Using the annotation functionality of TechSmith Morae, we coded each flagging first for error-type using 
the labeling provided by the tools themselves. In cases where MS-NLP provided no error-type label, we 
used labels for similar error types observed in Grammarly so as to facilitate comparison; these included 
Misspelled word, Capitalization, “A” vs. “an,” and Verb form. 
To analyze precision, we coded for two dimensions of accuracy: accuracy of flagging (i.e., does the flagged 
feature indeed represent an error of the specified type?) and accuracy of correction (i.e., is the suggested 
correction appropriate for the specific context?).3 The following codes were used for both accuracy 
dimensions: accurate, inaccurate, and indeterminate, with the latter used in cases where the writer’s 
intention was indiscernible. For accuracy of correction, an additional code, unknown, was used for cases 
where the tool provided no specific suggestion. For example, in Figure 1 below, MS-NLP’s flagging was 
coded as inaccurate because the highlighted feature does not represent a fragment (although it does contain 
errors involving verb form and tense), and its correction is coded as unknown because of the generic 
recommendation to “consider revising.” Figure 2 shows a case where Grammarly’s flagging was coded as 
accurate, but the correction was coded as inaccurate because the suggested verb “equip” is not syntactically 
appropriate in the given context.  
The first author and a trained research assistant independently coded all 2,310 error flaggings (1,515 for 
Grammarly and 795 for MS-NLP) with 91.7% agreement for accuracy of flagging and 89.4% for accuracy 
of correction. We also calculated interrater reliability (Cohen’s kappa) at .674 for accuracy of flagging 
and .695 for accuracy of correction, both of which represent substantial agreement (Landis & Koch, 1977). 
Discrepancies were resolved through discussion so that final agreement on both accuracy dimensions was 
100%. After exporting the coded data from TechSmith Morae, we calculated precision by first removing 
118 items that had been coded indeterminate for accuracy of flagging,4 leaving 1,412 Grammarly flaggings 
Jim Ranalli and Taichi Yamashita 7 
    
     
  
and 780 MS-NLP flaggings, and then dividing the number of flaggings and corrections coded as accurate 
by the total number of flaggings and corrections.  
Figure 1 
MS-NLP Flagging Coded as Inaccurate and Suggestion Coded as Unknown 
 
 
Figure 2 
Grammarly Flagging Coded as Accurate and Suggestion Coded as Inaccurate 
 
We also measured recall of spelling, preposition, article, and subject-verb agreement errors.5 For each of 
these analyses, the first author worked with the second author or a native-speaking research assistant to 
independently review each of the 68 texts manually to identify all errors in each category using a set of 
annotation guidelines and to propose at least one acceptable correction for each error. Initial agreement for 
flagging and correction was 88.2% and 88% respectively for spelling errors, 68.9% and 62.2% respectively 
for preposition errors, 87.2% and 86.9% respectively for article-related errors, and 87.5% for both flagging 
and correction in the case of Subject-verb agreement errors. Discrepancies were resolved through 
discussion so that final agreement for both flagging and correction of all four error types was 100%. The 
final counts were 556 attested spelling errors, 473 attested article errors, 271 attested preposition errors, 
and 138 attested subject-verb agreement errors. The first author then used the screen-capture annotation 
data and an MS Excel spreadsheet to record the extent to which the two tools had flagged each attested 
error and provided at least one of the corrections deemed acceptable. Recall was then calculated by dividing 
8 Language Learning & Technology 
   
 
 
the number of attested errors flagged by each system by the total number of attested errors, and correction 
was calculated in a similar manner.  
Results and Discussion 
Overall, Grammarly flagged 1,412 items comprising 114 error types (see the complete list in Appendix A), 
with precision and correction rates of .88 and .83 respectively. MS-NLP flagged 780 items comprising 22 
error types (Appendix B), with overall precision and correction rates of .92 and .81 respectively. The large 
discrepancy in the number of error types is partly accounted for by Grammarly’s identification of a wider 
variety of error types but also by the way it differentiates among variants of some types; for example, in 
addition to a generic Incorrect verb form, nine subtypes were specified, including Incorrect verb form after 
modal and Incorrect verb form after “Do” or “does.” MS-NLP was observed to do this only with respect 
to errors involving missing or extra spaces, which were divided into Space between words, Space before 
punctuation, and Space after punctuation. 
To facilitate comparisons with the top-ranked L2 problem areas in the Cambridge Learner Corpus (Table 1 
above), we decided to modify two CLC categories, Inflectional morphology and Derivational morphology, 
because such errors were spread among a number of error types that also included other unrelated errors, 
particularly in the case of Grammarly. In their place, we substituted a different category, Verb form, because 
this captured morphological errors found by both tools and because verb-form errors have been found to be 
common in other analyses of errors in L2 English writing (e.g., Chan, 2010). Both tools’ coverage of the 
resulting nine L2 problem areas is reported in Table 2, along with precision rates, correction rates, and the 
number of error types corresponding to each L2 problem area (see Appendices A and B for complete 
information about these correspondences).  
Results showed Grammarly generating more than 10 times the number of flaggings (856) related to the 
most common L2 problem areas than did MS-NLP (81). Over 60% of Grammarly’s total flaggings 
addressed these L2 problem areas compared to just over 10% of MS-NLP’s total flaggings. Overall 
precision rates were higher for MS-NLP than Grammarly—.88 versus .84, respectively, although the larger 
and more diverse set of items flagged by Grammarly must be taken into account—while the two tools 
performed similarly in correction, .81 for Grammarly versus .79 for MS-NLP. No flaggings were recorded 
for either tool in the category Wrong verb tense.6   
Among Grammarly’s flaggings, determiner-related flaggings were the most frequent, accounting for one in 
every five flaggings. All nine error-types included in this category addressed articles in particular. 
Collectively, these demonstrated the lowest precision among Grammarly’s flaggings (.08), which is 
attributable to the tool’s assumption that a noun phrase not marked as singular (by means of an article) or 
plural (by adding -s) was intended to be singular in form despite it being clear from the context that the 
plural form was called for (e.g., “These situations happen because *adolescent failed to recognize 
difference between video game and their real life.” > the adolescent).7 Regarding preposition errors, 
Grammarly flagged 79 items across five error types, with both precision and correction rates of .94.  
For its part, MS-NLP displayed the highest number of flaggings in the Agreement error category. Whereas 
all of its 61 flaggings constituted subject-verb agreement errors, Grammarly’s 160 flaggings included both 
subject-verb and determiner-noun agreement errors. MS-NLP flagged only eight determiner-related items, 
all involving the use of “A” versus “an”, and two preposition-related items, both involving the need to 
combine nominal elements following “between” with the word “and” (e.g., “... kids the ages *between 12-
14 > between 12 and 14”). These findings highlight the limitations of MS-NLP’s rules-based approach 
insofar as they show that most article and preposition errors are not detectable via parsing and pattern-
matching techniques.    
Results of the recall analyses are reported in Table 3. For the two types of errors addressed by both tools, 
recall of spelling errors was where Grammarly and MS-NLP came closest to parity, .86 and .74, 
respectively. Regarding subject-verb agreement errors, Grammarly identified and corrected about two-
thirds of those in the corpus while MS-NLP identified and corrected slightly less than one-third. For the 
Jim Ranalli and Taichi Yamashita 9 
    
     
  
two error types that required context-analytic techniques to detect (and for which MS-NLP therefore 
provides no feedback), Grammarly identified one-third of the preposition errors in the corpus and supplied 
the appropriate correction in slightly less than a third of cases. Grammarly’s recall for article-related errors  
was higher at .47, with a correction rate of .45. Edit-distance data for spelling errors (Table 4) showed 
Grammarly finding 10% more multiple-edit spelling errors and providing appropriate corrections for such 
errors in 25% more cases. 
10 Language Learning & Technology 
   
 
 
Table 2 
Performance of Grammarly and MS-NLP in Addressing Common L2 Problem Areas 
   
Grammarly 
  
 
 
MS-NLP 
  
 
Flaggings % of Total 
Flaggings 
Error-types Precision Correction Flaggings % of Total 
Flaggings 
Error-types Precision Correction 
Agreement Error 160 11.33% 22 0.84 0.79 61 7.80% 1 0.87 0.82 
Comma Error 141 9.99% 12 0.89 0.88 1 0.10% 1 0 0 
Content Word 
Choice Error 
60 4.25% 5 0.83 0.8 2 0.3% 1 1 1 
Determiner Error 289 20.47% 9 0.8 0.78 8 1.03% 2 0.88 0.75 
Preposition Error 79 5.59% 5 0.94 0.94 2 0.30% 1 1 0 
Pronoun Error 30 2.12% 6 0.9 0.83 1 0.10% 1 1 1 
Run-on Sentence 2 0.14% 1 1 1 - - - - - 
Verb-form Error 95 6.73% 19 0.84 0.73 6 0.80% 1 1 0.83 
Wrong verb Tense - - - - - - - - - - 
Total 856 60.62% 79 0.84 0.81 81 10.43% 8 0.88 0.79 
Jim Ranalli and Taichi Yamashita 11 
    
     
  
Table 3 
Comparison of Recall and Correction Rates for Selected Error Types 
  
Grammarly MS-NLP 
Error Type n Recall Correction Recall Correction 
Spelling 556 0.86 0.84 0.74 0.67 
Article 437 0.47 0.45 - - 
Preposition 271 0.33 0.31 - - 
Subject-Verb Agreement  138 0.67 0.65 0.35 0.34 
Table 4 
Recall and Correction Rates for Spelling Errors According to Edit Distance 
  
Grammarly MS-NLP 
Edit distance Number of 
Errors 
Recall Correction Recall Correction 
Single-edit 431 0.93 0.92 0.80 0.78 
Multiple-edit 125 0.68 0.58 0.58 0.33 
 
Having established that Grammarly is able to find and correct more common L2 error types than MS-NLP 
does with greater accuracy, we move on to the consequences of Grammarly’s enhanced error-correction 
capacities regarding feedback timing, which was the focus of the second study.  
Study 2 
Study 2 addressed the second research question regarding how Grammarly and MS-NLP differ in terms of 
the timing of error-flagging.  
We recruited 20 students from an ESL writing course for undergraduates at the same university as Study 1. 
The group consisted of 10 females and 10 males whose average age was 19.6 years (SD = 1.31). Ten of the 
students spoke Chinese as their first language; other L1s included Malay, Pali, Japanese, Indonesian, and 
Brazilian Portuguese. Having been placed into the ESL writing program by the placement test described in 
Study 1, their TOEFL iBT scores would have been between 72 and 100. Each participant wrote two short 
essays based on different prompts, both addressing educational themes (What makes a successful student? 
and Why do people attend college or university?). The time limit for each essay task was 30 minutes. 
Average text length was 299.8 words (SD = 72.95). The 40 essays together comprised 11,733 words.  
Participants composed their essays individually in the research team’s office on a desktop computer 
connected to the university’s network via an Ethernet cable to ensure the fastest possible internet 
connection. One essay was written with MS-NLP operating synchronously and the other with the 
Grammarly plug-in for MS Word operating synchronously. 
Because our goal in Study 2 was to compare typical feedback timing across the two tools, we accepted each 
system’s default checks for synchronous operation on the assumption that most L2 students do not modify 
these settings. For MS-NLP, this meant only spelling and “format consistency” errors (e.g., missing or extra 
12 Language Learning & Technology 
   
 
 
spaces) would be identified; in the version of Grammarly Premium available at the time, this meant only 
the spelling, punctuation, grammar, and sentence structure checks were activated. Tasks and tools were 
counterbalanced to avoid ordering effects. TechSmith Morae was used for recording, coding, and measuring 
on-screen activity. Inputlog was used to record and analyze participants’ keystroke timing.  
To measure differences in the timing of error flagging across tools, we first categorized each flagging 
according to whether it occurred at the point of inscription (i.e., where the writer was currently typing) or  
at a point earlier in the text, and then tallied up all instances of each timing category. Next, to estimate an 
average time lag between error production and flagging, we focused on cases where flagging occurred 
earlier in the text; the timeline markers in TechSmith Morae were used to measure the elapsed time between 
(a) completion of the text that would be flagged and (b) the flagging itself. These measures were then used 
to calculate the average feedback delay for each participant.   
Finally, to relate the timing of feedback delivery to the timing of p-bursts, we used the automated analysis 
tools in Inputlog to collect average p-burst lengths for each participant based on a pause threshold of two 
seconds between two typing events.8 Two seconds is a standard benchmark (Chanquoy et al., 1996; 
Spelman Miller, 2000; Sullivan & Lindgren, 2006) for separating pauses indicative of cognitive activity, 
such as planning or evaluating, from pauses indicating transitions between keystrokes, which are motoric 
in nature and generally average less than one second among even the slowest writers (Sullivan & Lindgren, 
2006).  
Results and Discussion 
In total, Grammarly flagged 535 items as errors compared to MS-NLP’s 536 flaggings. Frequencies for the 
two temporal location categories (Table 5) showed that only about one in five of Grammarly’s flaggings 
occurred at the point of inscription, with the rest (78.9%) occurring earlier in the text. By contrast, nearly 
nine out of ten of MS-NLP’s flaggings occurred at the point of inscription, with only 11.4% occurring 
earlier in the text. This can be attributed to the fact that MS-NLP’s default checks for spelling and format 
consistency relied on simple pattern matching routines that could be performed nearly instantaneously.  
Table 5 
Frequency of Flaggings by Temporal Location 
 
Grammarly MS-NLP 
Point of inscription 113  
(21.1%) 
475  
(88.6%) 
Earlier in the text 422  
(78.9%) 
61 
(11.4 %) 
Total 535 536 
 
Based on these findings, we decided to restrict our analysis of feedback timing relative to p-burst duration 
to the Grammarly data only. Average p-burst duration for the group while using Grammarly was 11.63 
seconds (SD = 4.38) while the average feedback delay experienced among the participants while using 
Grammarly was 17.45 seconds (SD = 35.52). Boxplots of these data (Figure 3) show the comparatively 
greater variance in feedback delay, which is probably attributable to the considerable variation in terms of 
when writers would type sentence-final punctuation (a variable, as noted above, that can delay initialization 
of some of Grammarly’s checks).   
To test for significance, we used a non-parametric technique because of the unequal variances. A Wilcoxon 
Signed Ranks test comparing the two distributions showed that the feedback-delay ranks (median = 14.45 
seconds) were statistically higher than the p-burst ranks (median = 10.7 seconds), Z = -2.80, p = .005, r = 
Jim Ranalli and Taichi Yamashita 13 
    
     
  
-.442 (a medium to large effect according to the scale in Fritz et al., 2012). This means that when Grammarly 
was operating synchronously, new flaggings usually appeared in segments of previously written text as 
opposed to the segment being transcribed contemporaneously.   
Figure 3  
Average P-burst Length and Error-flagging Delay in the Grammarly Condition (N = 20)   
 
 
General Discussion 
To briefly summarize the results, Grammarly outperformed MS-NLP by flagging a much larger and more 
diverse set of error types representing both common problems for L2 writers of English as well as complex 
computational challenges. Despite the differences in the frequency and range of error types detected, 
Grammarly’s precision was only slightly lower than MS-NLP's and its correction rate higher. For 
comparison’s sake, Grammarly's overall precision rate and individual precision rates for most error types 
(Appendix A) met the 80% benchmark used by developers of Criterion (Quinlan et al., 2009).  
In terms of recall, the two tools approached parity with respect to spelling errors, but edit distance data 
showed Grammarly better addressing the more complex spelling errors characteristic of L2 learners. 
Furthermore, Grammarly flagged and corrected nearly twice the number of subject-verb agreement errors 
as MS-NLP. Grammarly also flagged and corrected about half of all attested article errors and a third of all 
preposition errors, both common L2 error types that MS-NLP could not address in any substantive way. 
Evidently, progress has been made regarding the call in Rimrott and Heift (2008) for spell checkers that 
better attend to the needs of L2 writers. Progress may also be reflected in the recall and correction rates for 
article and preposition errors found here insofar as they exceeded those reported for the experimental 
systems in both Han et al. (2006) and Tetreault and Chodorow (2008).  
Regarding feedback timing, the average delay in Grammarly’s flaggings exceeded the average p-burst 
14 Language Learning & Technology 
   
 
 
length of the L2 student writers in our sample, suggesting that these writers were generally receiving 
feedback from Grammarly about linguistic units that no longer represented the current contents of verbal 
working memory, and which thus constituted potential distractions. By contrast, the majority of MS-NLP’s 
flaggings occurred contemporaneously with the completion of the word that was flagged, which meant this 
feedback likely coincided with the concurrent contents of verbal working memory.  
Implications 
L2 student writers stand to benefit from the enhanced error-correction capabilities of AWCF tools. 
However, research has shown that such writers already prioritize sentence-level grammatical concerns over 
higher-level issues in evaluating and revising their work (Barkaoui, 2007). Regardless of feedback delays, 
there may be a risk of reinforcing students' low-level focus by facilitating continuous access to CF without 
considering the nature of the task at hand. At the same time, there may be writing tasks for which slightly 
delayed synchronous AWCF proves not only harmless to task completion (e.g., a simple email message) 
but supportive of learning (e.g., an L2 practice task addressing definite/indefinite/zero articles), particularly 
when one considers the 40-second cognitive window for ideal focus on form proposed by Doughty (2001), 
within which Grammarly's synchronous feedback would generally appear. In this regard, Manchón's (2011) 
contrast between learning to write and writing to learn may be helpful for differentiating among writing-
task types in pedagogical contexts. For learning to write tasks, where the focus is on development of writing 
skills, students can be taught to evaluate tasks as to the amount of cognitive demand that they will require, 
and on this basis, to decide whether to use a tool like Grammarly synchronously or asynchronously. In 
addition to avoiding potential distractions, restricting the use of AWCF to appropriate junctures in the 
writing process may create space for students to consider the feedback more carefully, which can help both 
with identifying inaccurate flaggings and possibly facilitating L2 learning.  
A second implication concerns the development and marketing of AWCF tools. Companies such as 
Grammarly should make users aware of the potential for problems with synchronous operation. In their 
promotional materials and user tutorials, they should inform users of the need to consider the impact of 
constant access to CF, and encourage them to make strategic decisions about their engagement with 
Grammarly based on these considerations. AWCF tools should also allow users to toggle the program’s 
checks on and off. In the version of Grammarly used in the present study, this was easily accomplished 
using buttons for each category of check (e.g., grammar, punctuation, style). In the new web-interface 
released to paid subscribers in 2019, it does not appear possible to completely deactivate checking.  
The final implication is that the results affirm our view that Grammarly and other tools that (a) use 
sophisticated, hybrid GEC approaches to target unique problem areas for L2 writers, and (b) operate 
synchronously across multiple applications, platforms, and devices—the other major commercially 
available exemplar being Ginger (see Swier, 2016, for a description of the latter)—represent a distinct genre 
of writing-support technology that must be recognized and understood on its own terms. This means 
differentiating it from MS-NLP as well as asynchronous AWE tools such as Criterion and MY Access!, 
which may incorporate similar error-correction technologies but which users interact with differently. The 
case for a new genre is strengthened when one considers that the term AWE has also been applied to systems 
that do not address sentence-level concerns, such as Writing Pal (Roscoe & McNamara, 2013) and the 
Research Writing Tutor (Cotos, 2015), which target writing strategies and functional discourse, 
respectively. Thus, we have proposed the term AWCF tool (see also Ranalli, 2018), to try to capture the 
unique qualities of this new genre. Although the term GEC is already in currency among developers, we 
note that AWCF connects these tools to existing research in both the SLA and L2 writing domains regarding 
teacher-provided WCF, with which we see potential synergy. The main features differentiating AWCF tools 
from AWE tools and MS-NLP are summarized in Table 6.  
  
Jim Ranalli and Taichi Yamashita 15 
    
     
  
Table 6 
Comparison of Automated Written Corrective Feedback (AWCF) Tools, Automated Writing Evaluation 
(AWE) Tools, and Microsoft Natural Language Processing (MS-NLP) 
 
AWCF Tools AWE Tools MS-NLP 
Examples Grammarly, Ginger, 
Grammar Suggestions 
Criterion, MY Access!, 
Research Writing Tutor, 
Writing Pal  
The spelling and grammar 
checker in MS Word, 
Outlook, and Office 365 
Access Multiple ways (e.g., web 
apps, browser extensions, 
productivity software plug-
ins, mobile device 
keyboards) 
Standalone web-based 
interfaces 
Office productivity 
software  
Delivery Mode Synchronous and 
asynchronous 
Asynchronous only Synchronous and 
asynchronous 
Analysis Combinations of complex 
techniques performed on 
remote servers 
Combinations of complex 
techniques performed on 
remote servers  
Simpler techniques 
performed on user's local 
machine 
Focus  Lower-level concerns (e.g., 
spelling, grammar, and 
punctuation) 
Lower- and higher-level, 
or only higher-level, 
concerns (e.g., 
organization, discourse, 
writing strategies) 
Lower-level concerns  
Note. Grammar Suggestions is a machine-learning based synchronous AWCF tool that was included in Google Docs 
starting in 2019.  
Limitations and Future Research 
The small, relatively homogenous sample, the use of a single writing task type, and the fact that only four 
error types were addressed in the recall analysis together mean that caution should be exercised in 
generalizing the findings regarding both error-correction performance and feedback timing. In addition, we 
note that Grammarly is continuously being developed and improved; in the two years between the start of 
the project and the writing of this report, Grammarly’s claims about the number of features it could identify 
increased from 250 to 400 (Grammarly, n.d.). The most significant limitation of the study, however, is that 
no perception or learning data were collected. Research is needed to confirm whether discrepancies in 
feedback timing relative to p-burst duration are indeed perceived as distracting by users and, if so, what 
individual, contextual, and task conditions influence such perceptions. Verbal reports, eye-tracking, or both 
could be used in such studies to triangulate keystroke logging data and increase veridicality. Similarly, 
research should investigate the potential for L2 development as a result of exposure to synchronous AWCF, 
with writing goal (i.e., writing to learn vs. learning to write) and feedback timing (e.g., point of inscription 
versus earlier in the text) as independent variables.  
In addition, studies evaluating other AWCF tools, including comparison studies, are needed to help 
potential users in deciding which, if any, could suit their needs while also revealing meaningful variation 
within the genre. Because these tools are commercial products, the for-profit companies that produce them 
are less forthcoming with information about their performance (Perelman, 2016), so such studies will 
provide a valuable public service. Such evaluations could also address error-correction performance 
regarding collocation errors, which is another area of GEC innovation (Leacock et al., 2014). Finally, 
research is needed to understand how continuous exposure to synchronous AWCF tools influences L2 
16 Language Learning & Technology 
   
 
 
students’ own self-initiated revisions, given that the ability to identify, diagnose, and correct errors in one’s 
own writing is a vital skill. As mentioned, this study was part of a larger project that also investigated this 
latter concern.  
Conclusion 
Much work has gone into the development of automated error-correction technologies for L2 writers of 
English. These innovations are spreading widely as Grammarly and other similar tools attract new users. 
Grammarly’s capacity to correct frequent and complex L2 written errors based on these technologies comes 
at a cost in terms of the speed with which it can deliver its feedback, and this has consequences for the way 
users interact with the tool. This study sought to advance basic understanding of these issues so as to inform 
applications of Grammarly and similar tools to L2 writing and L2 learning in ways that can support the 
work of researchers, educators, and students.  
Acknowledgements 
The authors gratefully acknowledge the diligent work of research assistants Elizabeth Lee, Cay Bappe, and 
Gabi Mitchell.  
Notes 
1. In information science, entropy refers to a measure of average uncertainty with respect to a given 
variable's possible outcomes.  
2. In 2016, Microsoft claimed to have 1.2 billion users of its Office suite along with 60 million active 
users of its cloud-based Office 365 service, which also includes MS Word. At the time of writing, 
Grammarly claims a daily user base of 20 million. 
3. In the case of MS-NLP, which often provides multiple suggestions, feedback was coded as accurate 
if an appropriate correction appeared anywhere in the list of suggestions.  
4. Our rationale for removing items coded as indeterminate for accuracy of flagging was that if human 
annotators could not identify what form was appropriate for a writer's intended meaning, an 
automated checker could not reasonably be expected to do so either, so such items should not 
detract from the tool’s precision and recall statistics.  
5. A comprehensive recall analysis of all the errors in our corpus was not attempted because such 
analyses are notoriously time-consuming and difficult, as has been noted in both the GEC (e.g., 
Leacock et al., 2014) and L2 writing (e.g., Polio, 1997) literatures. The number and type of errors 
found will depend on how particular errors are corrected, and because many L2 errors can be 
corrected in multiple ways, a comprehensive analysis would need to produce a correction not only 
for every error but for every possible interpretation of every error, and intercoder reliability would 
be low. Instead, we followed the recent practice of GEC developers annotating corpora for specific 
error types relevant to a given purpose (e.g., Tetreault & Chodorow, 2008). We chose spelling and 
subject-verb agreement errors because we knew these were detected by both Grammarly and MS-
NLP and thus they could facilitate informative comparisons. Additionally, we chose article and 
preposition errors because they represent key L2 problem areas that GEC developers have 
addressed recently through innovative techniques, as discussed in the literature review.  
6. In Grammarly, one error-type involving a tense label, Incorrect verb form in perfect tense, ended 
up being categorized under the L2 problem area Verb-form error. Grammarly generated eight 
flaggings for two other tense-related error types, Incorrect verb tense (6) and Incorrect use of 
progressive tense (2), but because all of these flaggings were coded indeterminate for accuracy of 
flagging, they were not included in our analyses. Grammarly clearly attempts to address verb-tense 
Jim Ranalli and Taichi Yamashita 17 
    
     
  
errors but without much success with respect to our corpus. 
7. Grammarly could take a cue here from MS-NLP, which in correcting Subject-Verb Agreement 
errors supplies both a singular and plural option (e.g., “If *a person use more time … > a person 
uses | people use”). 
8. A second analysis was conducted using an individual pause threshold defined for each participant 
using inter-keystroke interval (IKI) measures from a typing task that had been administered as part 
of the research protocol. This pause threshold was determined by multiplying each individual’s 
average IKI by three, as recommended in Wengelin (2006). These individual pause thresholds 
resulted in an average p-burst duration of 3.11 seconds (SD = 1.15) for the sample (N = 20). We 
decided to report the analysis incorporating the larger p-burst duration in order to maintain 
comparability with previous research and to err on the side of conservatism. Needless to say, 
adoption of the individualized pause threshold would not only support but bolster our 
interpretations in Study 2.  
References 
Arroyo, D. C., & Yilmaz, Y. (2018). An open for replication study: The role of feedback timing in 
synchronous computer‐mediated communication. Language Learning, 68(4), 942–972. 
https://doi.org/10.1111/lang.12300 
Barkaoui, K. (2007). Revision in second language writing: What teachers need to know. TESL Canada 
Journal, 25(1), 81–92. https://doi.org/10.18806/tesl.v25i1.109 
Bestgen, Y., & Granger, S. (2011). Categorising spelling errors to assess L2 writing. International 
Journal of Continuing Engineering Education and Life-Long Learning, 21(2–3), 235–252.  
Bishop, T. (2005, March 27). A Word to the unwise—program’s grammar check isn’t so smart. Seattle 
Post-Intelligencer. https://www.seattlepi.com/business/article/A-Word-to-the-unwise-program-s-
grammar-check-1169572.php 
Chan, A. Y. W. (2010). Toward a taxonomy of written errors: Investigation into the written errors of 
Hong Kong Cantonese ESL learners. TESOL Quarterly, 44(2), 295–319. 
https://doi.org/10.5054/tq.2010.219941 
Chanquoy, L., Foulin, J.-N., & Fayol, M. (1996). Writing in adults: A real-time approach. In G. 
Rijlaarsdam, H. van den Bergh, & M. Couzijn (Eds.), Theories, models and methodology in writing 
research (pp. 36–43). Amsterdam University Press. 
Chen, C.-F. E., & Cheng, W.-Y. E. (2008). Beyond the design of automated writing evaluation: 
Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language 
Learning & Technology, 12(2), 94–112. https://www.lltjournal.org/item/2631 
Chenoweth, N. A., & Hayes, J. R. (2001). Fluency in writing: Generating text in L1 and L2. Written 
Communication, 18(1), 80–98. https://doi.org/10.1177/0741088301018001004  
Chenoweth, N. A., & Hayes, J. R. (2003). The inner voice in writing. Written Communication, 20(1), 99–
118. https://doi.org/10.1177/0741088303253572  
Chodorow, M., Gamon, M., & Tetreault, J. (2010). The utility of article and preposition error correction 
systems for English language learners: Feedback and assessment. Language Testing, 27(3), 419–436. 
https://doi.org/10.1177/0265532210364391  
Connors, R. J., & Lunsford, A. A. (1988). Frequency of formal errors in current college writing, or Ma 
and Pa Kettle do research. College Composition and Communication, 39(4), 395–409. 
https://doi.org/10.2307/357695  
18 Language Learning & Technology 
   
 
 
Cotos, E. (2015). Automated Writing Analysis for writing pedagogy: From healthy tension to tangible 
prospects. Writing and Pedagogy, 7(2–3), 197–231. https://core.ac.uk/download/pdf/38937327.pdf 
Dikli, S. (2010). The nature of automated essay scoring feedback. CALICO Journal, 28(1), 99–134.  
Doughty, C. (2001). Cognitive underpinnings of focus on form. In P. Robinson (Ed.), Cognition and 
Second Language Acquisition (pp. 206–257).  Cambridge University Press. 
Flor, M., & Futagi, Y. (2012). On using context for automatic correction of non-word misspellings in 
student essays [Paper presentation]. The Seventh Workshop on Building Educational Applications 
Using NLP, Montréal, Canada. https://aclanthology.org/W12-2012.pdf 
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and 
interpretation. Journal of Experimental Psychology: General, 141(1), 2–18. 
https://doi.org/10.1037/a0024338 
Galbraith, D., & Vedder, I. (2019). Methodological advances in investigating L2 writing processes: 
Challenges and perspectives. Studies in Second Language Acquisition, 41(3), 633–645.  
https://doi.org/10.1017/S0272263119000366 
Gamon, M., Leacock, C., Brockett, C., Dolan, W. B., Gao, J., Belenko, D., & Klementiev, A. (2009). 
Using statistical techniques and web search to correct ESL errors. CALICO Journal, 26(3), 491–511.  
Grammarly. (n.d.). What is Grammarly Premium, and how is it different from the free version? 
Grammarly Support. https://support.grammarly.com/hc/en-us/articles/115000090812-What-is-
Grammarly-Premium-and-how-is-it-different-from-the-free-version-. 
Han, N.-R., Chodorow, M., & Leacock, C. (2006). Detecting errors in English article usage by non-native 
speakers. Natural Language Engineering, 12(2), 115–129. 
https://doi.org/10.1017/S1351324906004190  
Heift, T., & Schulze, M. (2007). Errors and intelligence in computer-assisted language learning: Parsers 
and pedagogues. Routledge. 
Kies, D. (2008). Evaluating grammar checkers: A comparative ten-year study [Paper presentation].The 
6th International Conference on Education and Information Systems, Technologies and Applications: 
EISTA, Orlando, Florida, USA. 
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. 
Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310 
Lavolette, E., Polio, C., & Kahng, J. (2015). The accuracy of computer-assisted feedback and students' 
responses to it. Language Learning & Technology, 19(2), 50–68. 
https://www.lltjournal.org/item/2903 
Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2014). Automated grammatical error detection 
for language learners, second edition. Synthesis Lectures on Human Language Technologies, 7(1), 1–
154. https://doi.org/10.2200/S00562ED1V01Y201401HLT025  
Leijten, M., & Van Waes, L. (2013). Keystroke logging in writing research: Using Inputlog to analyze 
and visualize writing processes. Written Communication, 30(3), 358–392. 
https://doi.org/10.1177/0741088313491692  
Long, M. H. (2007). Problems in SLA. Lawrence Erlbaum. 
Manchón, R. M. (Ed.) (2011). Learning-to-write and writing-to-learn in an additional language. John 
Benjamins. 
Jim Ranalli and Taichi Yamashita 19 
    
     
  
Nagata, R., & Nakatani, K. (2010). Evaluating performance of grammatical error detection to maximize 
learning effect [Paper presentation]. Proceedings of the 23rd International Conference on 
Computational Linguistics, Beijing, China. https://aclanthology.org/C10-2103/ 
Nagata, R., Takamura, H., & Neubig, G. (2017). Adaptive spelling error correction models for learner 
English. Procedia Computer Science, 112, 474–483. https://doi.org/10.1016/j.procs.2017.08.065 
Napoles, C., Nădejde, M., & Tetreault, J. (2019). Enabling robust grammatical error correction in new 
domains: Data sets, metrics, and analyses. Transactions of the Association for Computational 
Linguistics, 7, 551–566. https://doi.org/10.1162/tacl_a_00282  
Perelman, L. (2016). Grammar checkers do not work. WLN: A Journal of Writing Center Scholarship, 
40(7–8), 11–20.  
Quinlan, T., Higgins, D., & Wolff, S. (2009). Evaluating the construct coverage of the e-rater scoring 
engine. Educational Testing Service, 2009(1), i–35. https://doi.org/10.1002/j.2333-
8504.2009.tb02158.x  
Ranalli, J. (2018). Automated written corrective feedback: How well can students make use of it? 
Computer Assisted Language Learning, 31(7), 653–674. 
https://doi.org/10.1080/09588221.2018.1428994  
Ranalli, J., Link, S., & Chukharev-Hudilainen, E. (2017). Automated writing evaluation for formative 
assessment of second language writing: Investigating the accuracy and usefulness of feedback as part 
of argument-based validation. Educational Psychology, 37(1), 8–25. 
https://doi.org/10.1080/01443410.2015.1136407  
Ransdell, S., Levy, C. M., & Kellogg, R. T. (2002). The structure of writing processes as revealed by 
secondary task demands. L1-Educational Studies in Language and Literature, 2(2), 141–163. 
https://doi.org/10.1023/A:1020851300668 
Rimrott, A., & Heift, T. (2005). Language learners and generic spell checkers in CALL. CALICO 
Journal, 23(1), 17–48.  
Rimrott, A., & Heift, T. (2008). Evaluating automatic detection of misspellings in German. Language 
Learning & Technology, 12(3), 73–92. https://www.lltjournal.org/item/2642 
Roscoe, R. D., & McNamara, D. S. (2013). Writing Pal: Feasibility of an intelligent writing strategy tutor 
in the high school classroom. Journal of Educational Psychology, 105(4), 1010–1025. 
https://doi.org/10.1037/a0032340 
Shintani, N., & Aubrey, S. (2016). The effectiveness of synchronous and asynchronous written corrective 
feedback on grammatical accuracy in a computer‐mediated environment. The Modern Language 
Journal, 100(1), 296–319. https://doi.org/10.1111/modl.12317 
Spelman Miller, K. (2000). Academic writers on-line: Investigating pausing in the production of text. 
Language Teaching Research, 4(2), 123–148. https://doi.org/10.1177/136216880000400203 
Sullivan, K. P. H., & Lindgren, E. (2006). Analysing online revision. In G. Rijlaarsdam (Series Ed.) and 
K. P. H. Sullivan, & E. Lindgren (Volume Eds.), Computer key-stroke logging and writing: Methods 
and applications (Studies in Writing, Vol. 18) (pp. 157–18). Elvesier. 
https://doi.org/10.1163/9780080460932_008 
Swier, R. (2016). Ginger software suite of writing services & apps. CALICO Journal, 33(2), 282–290.  
Tetreault, J. R., & Chodorow, M. (2008). Native judgments of non-native usage: Experiments in 
preposition error detection. [Workshop]. Human Judgements in Computational Linguistics, 
Manchester, UK. https://aclanthology.org/W08-1205.pdf 
20 Language Learning & Technology 
   
 
 
Van Waes, L., Leijten, M., & Quinlan, T. (2010). Reading during sentence composing and error 
correction: A multilevel analysis of the influences of task complexity. Reading and Writing, 23(7), 
803–834. https://doi.org/10.1007/s11145-009-9190-x 
Wengelin, Å. (2006). Examining pauses in writing: Theory, methods, and empirical data. In G. 
Rijlaarsdam (Series Ed.) and K. P. H. Sullivan, & E. Lindgren (Volume Eds.), Computer key-stroke 
logging and writing: Methods and applications (Studies in Writing, Vol. 18) (pp. 107–130). Elvesier. 
https://doi.org/10.1163/9780080460932_008 
Appendix A 
Grammarly Error-flagging Data Including Precision, Correction, and 
Correspondence with L2 Problem Areas 
Error Type L2 Problem 
Area 
Total 
Flaggings 
Percentage 
of Total 
Precision Correction 
Misspelled word  350 24.79% 0.99 0.95 
Missing article DET 137 9.70% 0.61 0.58 
Incorrect article use DET 122 8.64% 0.97 0.96 
Possibly confused word CWC* 101 7.15% 0.85 0.81 
Missing comma in compound sentence COM 50 3.54% 0.88 0.88 
Confused preposition PRP 45 3.19% 0.93 0.93 
Incorrect verb form VFE 38 2.69% 0.79 0.55 
Incorrect verb form with plural subject AGR 32 2.27% 0.88 0.88 
Incorrect verb form with singular subject AGR 32 2.27% 0.72 0.63 
Sentence fragment  26 1.84% 0.54 0.12 
Missing comma after introductory phrase COM 24 1.70% 0.92 0.92 
Possibly miswritten word  21 1.49% 0.86 0.81 
Comma splice COM 19 1.35% 0.79 0.79 
Incorrect verb form with personal pronoun AGR 15 1.06% 0.80 0.73 
Lowercase pronoun “I”  14 0.99% 1.00 1.00 
Missing preposition PRP 14 0.99% 0.93 0.93 
Missing pronoun PRN 14 0.99% 1.00 0.86 
Incorrect noun form AGR 13 0.92% 0.85 0.77 
Missing hyphen  12 0.85% 0.92 1.00 
Redundant preposition PRP 12 0.85% 1.00 1.00 
Wrong verb form VFE 12 0.85% 0.83 0.75 
Incorrect verb form after modal VFE 11 0.78% 1.00 1.00 
Singular noun after plural quantifier AGR 11 0.78% 1.00 1.00 
“These” with singular noun AGR 11 0.78% 1.00 1.00 
Jim Ranalli and Taichi Yamashita 21 
    
     
  
Missing word  10 0.71% 0.90 0.90 
Redundant indefinite article DET 9 0.64% 1.00 1.00 
Unnecessary comma in complex sentence COM 9 0.64% 0.89 0.89 
Incorrect use of comma COM 8 0.57% 0.75 0.75 
Indefinite article with plural noun AGR 8 0.57% 1.00 1.00 
Missing comma after introductory clause COM 8 0.57% 1.00 1.00 
Missing comma(s) with interrupter COM 8 0.57% 1.00 0.88 
Inconsistent spelling  7 0.50% 1.00 1.00 
Missing verb VFE 7 0.50% 0.86 0.86 
Possibly confused “affect” and “effect” CWC 7 0.50% 1.00 0.86 
Wrong article with set expression DET 7 0.50% 1.00 1.00 
Confused pronoun PRN 6 0.42% 1.00 1.00 
Incorrect preposition after adjective PRP 6 0.42% 0.83 0.83 
Incorrect punctuation  6 0.42% 1.00 1.00 
Missing comma in a series COM 6 0.42% 1.00 1.00 
The use of “a” versus “an” DET 6 0.42% 0.83 0.83 
Unknown word  6 0.42% 0.67 0.00 
Unnecessary pronoun PRN 6 0.42% 1.00 1.00 
Incorrect verb form in perfect tense VFE 5 0.35% 1.00 1.00 
Misused determiner AGR 5 0.35% 0.80 0.40 
Redundant use of article DET 5 0.35% 1.00 1.00 
Singular noun with plural number AGR 5 0.35% 0.80 0.80 
Infinitive instead of gerund VFE 4 0.28% 0.50 0.50 
Possibly confused “who” and “whom”  4 0.28% 0.75 0.75 
To-infinitive instead of bare form VFE 4 0.28% 1.00 1.00 
Confused “everytime” and “every time”  3 0.21% 1.00 1.00 
Confused possessive and contraction  3 0.21% 0.33 0.33 
Faulty parallelism  3 0.21% 1.00 0.67 
Improper comma between subject and verb COM 3 0.21% 0.67 0.67 
Incorrect punctuation with abbreviation  3 0.21% 1.00 1.00 
Incorrect verb form with indefinite pronoun AGR 3 0.21% 1.00 1.00 
Missing comma COM 3 0.21% 1.00 1.00 
“Other” with singular noun AGR 3 0.21% 0.67 0.67 
Possibly confused “lets” and “let's”  3 0.21% 1.00 1.00 
22 Language Learning & Technology 
   
 
 
Possibly confused “specially” and 
“especially” 
CWC 3 0.21% 1.00 1.00 
Singular quantifier with plural noun AGR 3 0.21% 1.00 1.00 
“There” with singular noun AGR 3 0.21% 0.67 0.67 
“This” with plural noun AGR 3 0.21% 0.67 0.67 
“Those” with singular noun AGR 3 0.21% 1.00 0.67 
Unnecessary ellipsis  3 0.21% 1.00 0.67 
Adjective instead of adverb CWC 2 0.14% 1.00 1.00 
Adverb instead of adjective CWC 2 0.14% 1.00 1.00 
Capitalization  2 0.14% 1.00 1.00 
Confused “which” and “who” PRN 2 0.14% 0.00 0.00 
Dangling modifier  2 0.14% 0.50 0.50 
Incorrect comma COM 2 0.14% 1.00 1.00 
Incorrect form for to-infinitive VFE 2 0.14% 1.00 1.00 
Incorrect plural verb with collective noun AGR 2 0.14% 0.50 0.50 
Incorrect punctuation with quotation mark  2 0.14% 1.00 1.00 
Incorrect quantifier AGR 2 0.14% 1.00 1.00 
Incorrect verb  2 0.14% 0.00 0.00 
Incorrect verb form with compound subject AGR 2 0.14% 1.00 1.00 
Incorrect verb form with conditional VFE 2 0.14% 1.00 1.00 
Inversion  2 0.14% 1.00 0.00 
Missing subject  2 0.14% 0.50 0.00 
Possibly confused “other” and “others”  2 0.14% 1.00 1.00 
Redundancy PRP 2 0.14% 1.00 1.00 
Run-on sentence ROS 2 0.14% 1.00 1.00 
Squinting modifier  2 0.14% 0.00 0.00 
Compound instead of comparative  1 0.07% 1.00 1.00 
“Do” with modal verb VFE 1 0.07% 1.00 1.00 
Double comparative  1 0.07% 1.00 1.00 
Gerund instead of to-infinitive VFE 1 0.07% 0.00 0.00 
Incorrect modifier AGR 1 0.07% 1.00 1.00 
Incorrect negative verb form VFE 1 0.07% 1.00 1.00 
Incorrect quantifier with uncountable noun AGR 1 0.07% 0.00 0.00 
Incorrect verb form after “do” or “does” VFE 1 0.07% 1.00 1.00 
Intransitive verb in passive voice VFE 1 0.07% 1.00 1.00 
Jim Ranalli and Taichi Yamashita 23 
    
     
  
Missing article before noun DET 1 0.07% 1.00 1.00 
Missing hyphen in a number  1 0.07% 1.00 1.00 
Non-infinitive after “to” VFE 1 0.07% 1.00 1.00 
Past participle without auxiliary verb VFE 1 0.07% 0.00 0.00 
Personal instead of possessive pronoun PRN 1 0.07% 0.00 0.00 
Plural noun with singular verb AGR 1 0.07% 1.00 1.00 
Possibly confused “everyday” and “every 
day” 
 1 0.07% 1.00 1.00 
Possibly confused “sometime” and “some 
time” 
 1 0.07% 1.00 1.00 
Possibly confused “than” and “then”  1 0.07% 0.00 0.00 
Possibly confused “there” and “their”  1 0.07% 1.00 1.00 
Redundant article DET 1 0.07% 1.00 1.00 
Redundant comma before “not” COM 1 0.07% 1.00 1.00 
Redundant determiner DET 1 0.07% 1.00 1.00 
Redundant reflexive pronoun PRN 1 0.07% 1.00 1.00 
Redundant word  1 0.07% 1.00 1.00 
Repeated word  1 0.07% 1.00 1.00 
Simple or compound adjective  1 0.07% 0.00 0.00 
“That” with plural noun AGR 1 0.07% 1.00 1.00 
“To” after modal verb VFE 1 0.07% 1.00 1.00 
To-infinitive instead of prepositional phrase VFE 1 0.07% 1.00 0.00 
Wrong verb form after “be\get used to” VFE 1 0.07% 1.00 1.00 
Wrong word choice * 1 0.07% 0.00 0.00 
Total 
 
1412 100% 0.88 0.83 
 
  
24 Language Learning & Technology 
   
 
 
Appendix B 
MS-NLP Error-flagging Data Including Precision, Correction, and Correspondence 
with L2 Problem Areas 
Error Type L2 Problem 
Area 
Total 
Flaggings 
Percentage 
Overall 
Precision Correction 
Misspelled word  464 59.5% 0.92 0.80 
Space after punctuation  75 9.6% 1.00 0.97 
Subject-verb agreement AGR 61 7.8% 0.87 0.82 
Capitalization  56 7.2% 0.95 0.95 
Space between words  28 3.6% 1.00 0.96 
Fragment  25 3.2% 0.68 0.00 
Space before punctuation  24 3.1% 1.00 1.00 
Possessive use  9 1.2% 0.89 0.89 
“A” versus “an” DET 7 0.9% 0.86 0.71 
Verb form VFE 6 0.8% 1.00 0.83 
Order of words  5 0.6% 0.80 0.40 
Commonly confused words CWC* 4 0.5% 0.75 0.75 
Comparative use  4 0.5% 1.00 1.00 
Punctuation  3 0.4% 1.00 0.33 
Use of “between” PRP 2 0.3% 1.00 0.00 
Pronoun use PRN 1 0.1% 1.00 1.00 
Comma use COM 1 0.1% 0.00 0.00 
Conjunction use  1 0.1% 0.00 0.00 
Extra word DET 1 0.1% 1.00 1.00 
Possible question  1 0.1% 0.00 0.00 
Question mark use  1 0.1% 1.00 1.00 
Repeated word  1 0.1% 1.00 1.00 
Total 
 
780 100% 0.92 0.81 
 
 
 
 
Jim Ranalli and Taichi Yamashita 25 
    
     
  
About the Authors 
Jim Ranalli, PhD, is an Assistant Professor in the TESL/Applied Linguistics Program at Iowa State 
University. His research addresses the intersection of L2 writing, technology, and self-regulated learning. 
He is particularly interested in innovative uses of computers for scaffolding and assessing the development 
of EAP writing skills. 
E-mail: jranalli@iastate.edu 
Taichi Yamashita, PhD, is a Visiting Assistant Professor in the Department of World Languages & Cultures 
at the University of Toledo. His research interests include instructed second language acquisition and 
computer-assisted language learning. He is particularly interested in how uses of computers in second 
language classrooms affect the way learners learn their second language skills. 
E-mail: taichi.yamashita@utoledo.edu