Automated versus Human Essay Scoring: A Comparative Study

Rania Zribi, Chokri Smaoui


The purpose of this study was to investigate the validity of automated essay scoring (Paper Rater) of EFL learners’ written performances by comparing the average group mean scores assigned by the Paper Rater computer and by human raters. Ten intermediate EFL learners responded to a topic and received scores from both automated and human scoring processes. The SPSS statistical procedure, namely the One-Way Reported-Measures ANOVA, diagnosed the difference between the computerized mean scores and human raters’ mean scores. Unlike previous studies, the findings of this study reflected some differences in the scores awarded by both procedures. The average mean scores assigned by the automated essay scoring tool Paper Rater was significantly higher than the human raters’ scores of learners’ essays. The Paper Rater tool did not seem to correlate well with human raters. Thus, the implications for English teachers revealed that despite its cost-effective nature, the automated scoring system together with human scorers lack the ability to award as reliable scores as humans do. However, the application of computerized scoring system in the educational system plays a key role in improving the learning process. Thanks to its instant feedback, this software may contribute to the improvement of EFL learners’ writings.

Full Text:



Anglin, L., Anglin, K., Schumann, P. L., & Kaliski, J. A. (2008). Improving the Efficiency and Effectiveness of Grading Through the Use of Computer‐Assisted Grading Rubrics. Decision Sciences Journal of Innovative Education, vol.6, (1), pp.51-73.

Anson, C. M. (2003). Responding to and assessing student writing: The uses and limits of technology. In Takayoshi, P. & Huot, B. (Eds.), Teaching writing with computers: An introduction (pp. 234–245). New York: Houghton Mifflin Company.

Bennett, R. E. & Ben-Simon, A. (2005). Toward theoretically meaningful automated essay scoring. Journal of Technology, Learning, and Assessment, vol.6, (1), pp.1-47.

Bijani, H. (2010). Raters’ perception and expertise in evaluating second language compositions. The Journal of Applied Linguistics, vol.3, (2), pp.69–89.

Brookes, A., & Grundy, P. (2001). Beginning to write. (3rd Ed.) Cambridge: Cambridge University Press.

Caswell, R & Mahler, B. (2004). Strategies for Teaching Writing. United States of America: ASCD (Association for Supervision and Curriculum Development).

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against “True” scores. Applied Measurement in Education, vol.31, (3), pp. 241-250.

Collier, D. (1993). The comparative method. In A. W. Finifter (Ed.). Political Science: The State of the Discipline 2. Washington, D. C. American Political Science Association.

Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, vol. 5, (1), pp.1-36.

Foltz, P.W., Laham, D. & Landauer, T.K. (1999). Automated Essay Scoring: Applications to Educational Technology. In Collis, B. & Oliver, R. (Eds.), Proceedings of EdMedia: World Conference on Educational Media and Technology (pp. 939-944). Association for the Advancement of Computing in Education (AACE).

Hamp-Lyons, L. & Kroll, B. (1997). TOEFL 2000-writing: composition, community, and assessment. Educational Testing Service.

Huang, S. J. (2014). Automated versus human scoring: A case study in an EFL context. Electronic Journal of Foreign Language Teaching, 11(1), pp.149-164.

Huot, B. (1990). Reliability, validity, and holistic scoring: What we know and what we need to know. College Composition and Communication, vol. 41, (2), pp. 201-213.

Huot, B. (2002). (Re) Articulating writing assessment for teaching and learning. Logan: Utah State University Press.

Hyland, K., & Hamp-Lyons, L. (2002). EAP: Issues and directions. Journal of English for academic purposes, vol. 1, (1), pp. 1-12.

Lumley, T., & McNamara, T., F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, vol. 12, pp. 54-71.

McCurry, D. (2010). Can machine scoring deal with broad and open writing tests as well as human readers? Assessing Writing, vol.15, 2, pp.118-129.

McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, (1), pp. 30–46.

Nivens-Bower, C. (2002). Faculty-Write Placer Plus score comparisons. In Vantage Learning, Establishing Write Placer Validity: A summary of studies (p. 12). (RB-781). Yardley, PA: Author.

Norusis, M. J. (2004). SPSS 12.0 guide to data analysis. Upper Saddle River, NJ: Prentice Hall.

Page, E. B. (1968). The use of the computer in analyzing student essays. International Review of Education, vol.14, 2, pp.210–225.

Peterson, S., S. (2008). Writing across the curriculum: All teachers teach writing. Portage & Main Press.

Pilliner, A, (1968). Subjective and objective testing. In Davies, A. (Ed.), Language Testing Symposium: A Psychological Approach. Oxford University Press. London.

Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educ. Assess. Vol.20, pp. 46–65.

Wahlen, A., Kuhn, C., Zlatkin-Troitschanskaia, O., Gold, C., Zesch, T., & Horbach, A. (2020). Automated Scoring of Teachers' Pedagogical Content Knowledge-A Comparison between Human and Machine Scoring. Frontiers in Education, Vol. 5, p. 1-10.

Wang, J., & Brown, M. S. (2008). Automated essay scoring versus human scoring: A correlational study. Contemporary Issues in Technology and Teacher Education, vol.8, 4, pp. 310-325.

Wang, J., & Brown, M.S. (2007). Automated essay scoring versus human scoring: A comparative study. Journal of Technology, Learning, and Assessment, vol.6, 2,pp. 1-29.

Williamson, D. M., Bennett, R. E., Lazer, S., Bernstein, J., Foltz, P. W., Landauer, T. K., .Sweeney, K. (2010). Automated scoring for the assessment of common core standards. White Paper