Assessment-Relevant Stimuli and Judging of Writing Performances – From Micro-Judgments to Macro-Judgments

Authors

  • Nikola Dobrić University of Klagenfurt

DOI:

https://doi.org/10.4312/elope.21.1.91-110

Keywords:

writing assessment, judging, macro-judgment, micro-judgment, assessment-relevant stimuli

Abstract

The contemporary practice of rating writing performances is grounded in an approach known as judging, which is done to avoid paying conscious attention to discrete elements in texts. Instead, it involves accounting for the overall impression made by a writing performance. However, studies have indicated that while this may be true on a conscious level, the concrete stimuli in texts still preconsciously influence the forming of such overall impressions. What is left largely unnoticed is that most assessment-relevant stimuli require the use of judging to be perceived as such. This implies that an overall, macro-judgment of a writing performance (expressed normally as a score) comprises individual (and largely preconsciously generated) micro-judgments coming together into a complex and non-linear combination-count. The paper presents an argument in favour of such a composition of judgments, demonstrates it empirically by means of a case study, and then discusses the wider consequences of this different perspective on judging.

Downloads

Download data is not yet available.

References

Baker, Beverly Anne. 2012. “Individual differences in rater decision–making style: An exploratory mixed–methods study.” Language Assessment Quarterly 9 (3): 225–48. https://doi.org/10.1080/15434303.2011.637262.

Bejar, Isaac. 2012. “Rater cognition: Implications for validity.” Educational Measurement Issues and Practice 31 (3): 2–9. https://doi.org/10.1111/j.1745-3992.2012.00238.x.

—. 2011. “A validity-based approach to quality control and assurance of automated scoring.” Assessment in Education: Principles, Policy and Practice 18 (3): 319–41. https://doi.org/10.1080/0969594X.2011.555329.

Braun, Henry, Isaac Bejar, and David Williamson. 2006. “Rule-based methods for automated scoring: Application in a licensing context.” In Automated Scoring of Complex Tasks in Computer-Based Testing, edited by David Williamson, Robert Mislevy, and Isaac Bejar, 83–123. Mahwar: Lawrence Erlbaum.

Brooks, Val. 2009. “Marking as judgment.” Research Papers in Education 27 (1): 63–80. https://doi.org/10.1080/02671520903331008.

Brunswik, Egon. 1952. The Conceptual Framework of Psychology. Chicago: University of Chicago Press.

—. 1955. “Representative design and probabilistic theory in a functional psychology.” Psychological Review 62: 193–217.

Cai, Hongwen 2015. “Weight-based classification of raters and rater cognition in an EFL speaking test.” Language Assessment Quarterly 12 (3): 262–82. https://doi.org/10.1080/15434303.2015.1053134.

Carlson, Allen. 1981. “Nature, aesthetic judgment, and objectivity.” The Journal of Aesthetics and Art Criticism 40 (1): 15–27. https://doi.org/10.2307/430349.

Chalhoub-Deville, Micheline. 1995. “Deriving oral assessment scales across different tests and rater groups.” Language Testing 12 (1): 16–33. https://doi.org/10.1177/026553229501200102.

Cooksey, Ray, Peter Freebody, and Claire Wyatt–Smith. (2007). “Assessment as judgement–in–context: Analysing how teachers evaluate students’ writing.” Educational Research and Evaluation 13 (5): 401–34. https://doi.org/10.1080/13803610701728311.

Cooper, Charles, and Lee Odell, eds. 1977. Evaluating Writing: Describing, Measuring, Judging. Urbana: National Council of Teachers of English.

Coward, Ann. 1952. “A comparison of two methods of grading English compositions.” Journal of Educational Research 46 (2): 81–94.

Crisp, Victoria. 2010. “Judging the grade: Exploring the judgement processes involved in examination grading decisions.” Evaluation & Research in Education 23 (1): 19–35. https://doi.org/10.1080/09500790903572925.

—. 2012. “An investigation of rater cognition in the assessment of projects.” Educational Measurements: Issues and Practice 31 (3): 10–20. https://doi.org/10.1111/j.1745-3992.2012.00239.x.

—. 2008. “The validity of using verbal protocol analysis to investigate the processes involved in examination marking.” Research in Education 79 (1): 1–12. https://doi.org/10.7227/RIE.79.1.

Cumming, Alister. 1990. “Expertise in evaluating second language compositions.” Language Testing 7: 31–51. https://doi.org/10.1177/026553229000700104.

Cumming, Alister, Robert Kantor, and Donald Powers. 2001. Scoring TOEFL Essays and TOEFL 2000 Prototype Writing Tasks: An Investigation into Raters’ Decision Making and Development of a Preliminary Analytic Framework (TOEFL Monograph Series, MS–22). Princeton: Educational Testing Service.

Cushing Weigle, Sara. 1994. “Effects of training on raters of ESL compositions.” Language Testing 11 (2): 197–223. https://doi.org/10.1177/026553229401100206.

DeKeyser, Robert. 2008. “The complexities of defining complexity.” Paper presented at AAAL 2008, April 1st, 2008. Washington D.C.

DeRemer, Mary. 1998. “Writing assessment: Raters’ elaboration of the rating task.” Assessing Writing 5 (1): 7–29. https://doi.org/10.1016/S1075-2935(99)80003-8.

Diederich, Paul, John French, and Sydell Carlton. 1961. “Factors in judgments of writing ability.” ETS Research Bulletin 61: 2.

Dobrić, Nikola. 2024. “Effects of errors on ratings of writing performances in rating contexts in which focus on them is overtly discouraged – Evidence from a high-stakes exam.” Assessing Writing 59: 100806. https://doi.org/10.1016/j.asw.2023.100806 .

—. 2023. “Towards a taxonomy of text features highly indicative of context-appropriate L2 writing competence.” System 119. https://doi.org/10.1016/j.system.2023.103155.

—. 2022. “Identifying errors in a learner corpus – the two stages of error location vs. error description and consequences for measuring and reporting inter-annotator agreement.” Applied Corpus Linguistics 3 (1). https://doi.org/10.1016/j.acorp.2022.100039.

—. 2018. “Reliability, validity, and writing assessment: A timeline.” ELOPE: English Language Overseas Perspectives and Enquiries 15 (2): 9–24. https://doi.org/10.4312/elope.15.2.9-24.

—. 2015. “Quality measurements of error annotation – ensuring validity through reliability.” The European English Messenger 24: 36–42.

Dobrić, Nikola, and Günther Sigott. 2023. “Use of error profiles in applied linguistics – Empowering language instruction by cataloguing rating-negative performance at the English Department, University of Klagenfurt, Austria.” In Power in Language, Culture, Literature and Education: The Perspectives of English Studies, edited by Marta Degani and Werner Delanoy, 97–117. Tübingen: Günther Narr.

—. 2014. “Towards an error taxonomy for student writing.” Zeitschrift für Interkulturellen Fremdsprachenunterricht 19 (2): 111–18.

Dobrić, Nikola, Günther Sigott, Gasper Ilc, Vesna Lazović, Hermann Cesnik, and Andrej Stopar. 2021. “Errors as indicators of writing task difficulty at the Slovene general Matura in English.” International Journal of Applied Linguistics 31: 475–91. https://doi.org/10.1111/ijal.12345.

Dulay, Heidi, Marina Burt, and Stephen Krashen. 1982. Language Two. Oxford: Oxford University Press.

Ecclestone, Kathryn. 2001. “I know a 2:1 when I see it: Understanding criteria for degree classifications in franchised university programmes.” Journal of Further and Higher Education 25 (3): 301–13. https://doi.org/10.1080/03098770126527.

Eckes, Thomas. 2023. “Detecting and measuring rater effects in performance assessments: Advances in manyfacet Rasch modeling.” In Festschrift in Honour of Günther Sigott: Advanced Methods in Language Testing, edited by Nikola Dobrić, Hermann Cesnik, and Claudia Harsch, 195–223. Frankfurt: Peter Lang.

—. 2015. Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments. 2nd ed. Frankfurt: Peter Lang.

—. 2008. “Rater types in writing performance assessments: A classification approach to rater variability.” Language Testing 25 (2): 155–85. https://doi.org/10.1177/0265532207086780.

Elliot, Jane. 2005. Using Narrative in Social Research. Qualitative and Quantitative Approaches. London: Sage Publications.

Engelhard, George, Jue Wang, and Stefanie Wind. 2018. “A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings.” Psychological Test and Assessment Modelling 60 (1): 33–52.

Eraut, Michael. 2000. “Non-formal learning and tacit knowledge in professional work.” British Journal of Educational Psychology 70: 113–136. https://doi.org/10.1348/000709900158001.

Everitt, Brian. 1984. An Introduction to Latent Variable Models. Dordrecht: Springer Netherlands.

Faigley, Lester. 1989. “Judging writing, judging selves.” College Composition and Communication 40 (4): 395–412.

Fechner, Gustav. 1897. Kollektivmasslehre. Leipzig: Engelmann.

—. 1863. Die drei Motive und Gründe des Glaubens. Leipzig: Breitkopf und Härtel.

Furneaux, Clare, and Mark Rignall. 2007. “The effect of standardization-training on rater judgements for the IELTS Writing Module.” In IELTS Collected Papers: Research in Speaking and Writing Assessment SiLT 19, edited by Lynda Taylor and Peter Falvey, 422–45. Cambridge: Cambridge University Press.

Gauthier, Geneviève, Christina St-Onge, and Walter Tavares. 2016. “Rater cognition: Review and integration of research findings.” Medical Education 50 (5): 511–22. https://doi.org/10.1111/medu.12973.

Gigerenzer, Gerd, and Reinhard Selten. 2002. Bounded Rationality. The Adaptive Toolbox. Cambridge: The MIT Press.

Gigerenzer, Gerd, and Daniel Goldstein. 1996. “Reasoning the fast and frugal way: Models of bounded rationality.” Psychological Review 103 (4): 650–69. https://doi.org/10.1037/0033-295X.103.4.650.

Gilovich, Thomas, and Dale Griffin. 2002. “Introduction – heuristics and biases: Then and now.” In Heuristics and Biases. The Psychology of Intuitive Judgement, edited by Thomas Gilovich, Dale Griffin and Daniel Kahnemann, 1–19. Cambridge: Cambridge University Press.

Gilquin, Gaëtanelle. 2022. “One norm to rule them all? Corpus-derived norms in learner corpus research and foreign language teaching.” Language Teaching 55 (1): 87–99. https://doi.org/10.1017/S0261444821000094.

Godshalk, Fred, Frances Swineford, and William Coffman. 1966. The Measurement of Writing Ability. New York: College Entrance Examination Board.

Grainger, Peter, Ken Purnell, and Reyna Zipf. 2008. “Judging quality through substantive conversations between markers.” Assessment & Evaluation in Higher Education 33 (2): 133–42. https://doi.org/10.1080/02602930601125681.

Hammond, Kenneth. 1995. “Probabilistic functioning and the clinical method.” Psychological Review 62 (4): 255–62.

Havranek, Gertraud. 2002. Die Rolle der Korrektur beim Fremdsprachenlernen. Frankfurt/Main: Peter Lang.

Hay, Peter, and Doune Macdonald. 2008. “(Mis)appropriations of criteria and standards-referenced assessment in a performance-based subject.” Assessment and Evaluation in Higher Education: Principles, Policy and Practice 15 (2): 153–68. https://doi.org/10.1080/09695940802164184.

Hilton, Heather. 2008. “The link between vocabulary knowledge and spoken L2 fluency.” Language Learning Journal 36 (2): 153–66. https://doi.org/10.1080/09571730802389983.

James, Carl. 1998. Errors in Language Learning and Use: Exploring Error Analysis. London: Longman.

Johnson–Laird, Philipp. 1983. Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambridge: Cambridge University Press.

Kaplan, Abraham. 1964. The Conduct of Inquiry: Methodology for Behavioural Science. San Francisco: Chandler Publishing Company.

Keren, Gideon, and Karl Teigen. 2004. “Yet another look at the heuristics and biases approach.” In Blackwell Handbook of Judgment and Decision Making, edited by Derek Koehler and Nigel Harvey, 89–109. Hoboken: Blackwell Publishing.

Knoch, Utte. 2011. “Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from?” Assessing Writing 16 (2): 81–96. https://doi.org/10.1016/j.asw.2011.02.003.

Krosnick, Jon. 1991. “Response strategies for coping with the cognitive demands of attitude measures in surveys.” Applied Cognitive Psychology 5 (3): 213–36. https://doi.org/10.1002/acp.2350050305.

Kuiken, Folkert, and Ineke Vedder. 2017: “Functional adequacy in L2 writing: Towards a new rating scale.” Language Testing 343: 321–36. https://doi.org/10.1177/0265532216663991.

Laming, Donald. 2004. Human Judgment. The Eye of the Beholder. Stamford: Thomson Learning.

Leder, Helmut, Benno Belke, Andriest Oeberst, and Dorothee Augustin. 2004. “A model of aesthetic appreciation and aesthetic judgements.” British Journal of Psychology 95: 489–508. https://doi.org/10.1348/0007126042369811.

Lengo, Nskala. 1995. “What is an error?” English Teaching Forum 33 (3): 20–24.

Lennon, Paul. 1991. “Error: Some problems of definition, identification, and distinction.” Applied Linguistics 12 (2): 180–96. https://doi.org/10.1093/applin/12.2.180.

Lumley, Tom. 2005. Assessing Second Language Writing: The Rater’s Perspective. Frankfurt: Lang.

March, James. 1994. A Primer on Decision-Making. New York: Free Press.

McHugh, Mary. 2012. “Interrater reliability: The Kappa statistic.” Biochemia Medica 22 (3): 276–82. https://doi.org/10.11613/BM.2012.031.

Milanovic, Michael, Nick Saville, and Shen Shuhong. 1996. “A study of the decision-making behaviour of composition markers.” In Performance Testing, Cognition and Assessment: Selected Papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem, Michael Milanovic and Nick Saville, 92–114. Cambridge: UCLES and Cambridge University Press.

Morgan, Candia. 1996. “The teacher as examiner: The case of mathematics coursework.” Assessment in Education: Principles, Policy and Practice 3 (3): 353–75.

Mumford, Simon, and Derin Atay. 2021. “Teachers’ perspectives on the causes of rater discrepancy in an English for Academic Purposes context.” Assessing Writing 48. https://doi.org/10.1016/j.asw.2021.100527.

Mussweiler, Thoma, and Birte Englich. 2005. “Subliminal anchoring: Judgmental consequences and underlying mechanisms.” Organizational Behavior and Human Decision Processes 98 (2): 133–43. https://doi.org/10.1016/j.obhdp.2004.12.002.

Myers, Albert, Carolyn McConville, and William Coffman. 1966. “Simplex structure in the grading of essay tests.” Educational and Psychological Measurement 26 (1): 41–54.

Myford, Carol, and Edward Wolfe. 2003. “Detecting and measuring rater effects using many-facet Rasch measurement: Part I.” Journal of Applied Measurement 4 (4): 386–422.

Nagelkerke, Nico. 1991. “A note on a general definition of the coefficient of determination.” Biometrika 78 (3): 691–92. https://doi.org/10.1093/biomet/78.3.691.

Newell, Benjamin, David Lagnado and David Shanks. 2007. Straight Choices: The Psychology of Decision Making. New York: Psychology Press.

Park, Yoon Soo, Jing Chen, and Seven Holtzman. 2015. “Evaluating efforts to minimize rater bias in scoring classroom observations.” In Designing Teacher Evaluation Systems, edited by Thomas J. Kane, Kerri A. Kerr, and Robert C. Pianta, 381–414. Hoboken: John Wiley & Sons. https://doi.org/10.1002/9781119210856.ch12.

Peters, Henry. 1942. “The experimental study of aesthetic judgments.” Psychological Bulletin 39 (5): 273–305. https://doi.org/10.1037/h0057008.

Postman, Leo, and Edward Tolman. 1959. “Brunswik’s Probabilistic Functionalism.” In Psychology: A Study of a Science, Study 1: Conceptual and Systematic, Vol. 1: Sensory, Perceptual, and Physiological Formulations, edited by Sigmund Koch, 502–64. New York: McGraw-Hill.

Prall, David. 1929. Aesthetic Judgment. Georgia: Crowell.

Quinlan, Thomas, Derrick Higgins and Susanne Wolff. 2009. “Evaluating the construct coverage of the e–rater® scoring engine.” ETS Research Report.

Richards, Jack C. 1974. Error Analysis. Perspectives on Second Language Acquisition. London: Longman.

Rubinstein, Ariel. 1997. Modeling Bounded Rationality. Cambridge: MIT Press.

Sadler, Royce. 1989. “Formative assessment and the design of instructional systems.” Instructional Science 18: 119–44.

Sakyi, Alfred. 2000. “Validation of holistic scoring for ESL writing assessment: How raters evaluate compositions.” In Fairness and Validation in Language Assessment, edited by Antony John Kunnan, 129–52. Cambridge: Cambridge University Press.

Scott, Susanne, and Reginald Bruce. 1995. “Decision-making style: The development and assessment of a new measure.” Educational and Psychological Measurement 55 (5): 818–31.

Shirazi, Masoumeh. 2012. “When raters talk, rubrics fall silent.” Language Testing in Asia 2: 123–39.

Sigott, Günther, Melanie Fleischhacker, Stephanie Sihler, and Jennifer Steiner. 2019. “The effect of written feedback types on students’ academic texts: A pilot study.” AAA: Arbeiten aus Anglistik und Amerikanistik 44 (2): 195–216.

Sigott, Günther, Hermann Cesnik, and Nikola Dobrić. 2016. “Refining the Scope–Substance error taxonomy: A closer look at Substance.” In Corpora in Applied Linguistics – Current Approaches, edited by Nikola Dobrić, Eva-Maria Graf, and Alexander Onysko, 79–95. Newcastle: Cambridge Scholars Publishing.

Simon, Herbert. 1957. Models of Man, Social and Rational: Mathematical Essays on Rational Human Behavior in a Social Setting. New York: John Wiley and Sons.

Skehan, Peter. 2009. “Modelling second language performance: Integrating complexity, accuracy, fluency, and lexis.” Applied Linguistics 30 (4): 510–32. https://doi.org/10.1093/applin/amp047.

Suto, Irenka, and Jackie Greatorex. 2008. “What goes through and examiner´s mind? Using verbal protocols to gain insights into the GCSE marking process.” British Educational Research Journal 34 (2): 213–33. https://doi.org/10.1080/01411920701492050.

—. 2006. “An empirical exploration of human judgement in the marking of school examinations.” Paper presented at the 32nd Annual Conference of the International Association for Educational Assessment, Singapore.

Swan, Michael, and Bernard Smith. 2001. Learner English. A Teacher’s Guide to Interference and Other Problems. Cambridge: Cambridge University Press.

Thunholm, Peter. 2004. “Decision–making style: Habit, style, or both?” Personality and Individual Differences 36 (4): 931–44. https://doi.org/10.1016/S0191-8869(03)00162-4.

Tigelaar, Dineke, Diana Dolmans, Ineke Wolfhagen, and Cees van der Vleuten. 2005. “Quality issues in judging portfolios: Implications for organizing teaching portfolio assessment procedures.” Studies in Higher Education 30 (5): 595–610. https://doi.org/10.1080/03075070500249302.

Tversky, Amos, and Daniel Kahneman 1992. “Advances in prospect theory: Cumulative representation of uncertainty.” Journal of Risk and Uncertainty 5 (4): 297–323.

van Someren, Maarten, Yvonne Barnard, and Jacobijn Sandberg. 1994. The Think Aloud Method. A Practical Guide to Modelling Cognitive Processes. London: Academic Press.

Veall, Michael, and Klaus Zimmermann. 1996. “Pseudo-R2 measures for some common limited dependent ariable models.” Sonderforschungsbereich 386: paper 18. https://doi.org/10.5282/ubm/epub.1421.

Wiliam, Dylan. 1992. “Some technical issues in assessment: A user’s guide.” British Journal for Curriculum and Assessment 2 (3): 11–20.

Wolf, Alison. 1995. Competence-Based Assessment. Buckingham: Open University Press.

Wolfe, Edward, and Tian Song. 2016. “Methods for monitoring and document rating quality.” In The Next Generation of Testing: Common Core Standards, Smarter-Balanced, PARCC, and the Nationwide Testing Movement, edited by Hong Jiao and Robert Lissitz, 107–42. Information Age.

Xie, Qin. 2019. “Error analysis and diagnosis of ESL linguistic accuracy: Construct specification and empirical validation.” Assessing Writing 41: 47–62. https://doi.org/10.1016/j.asw.2019.05.002.

Zhang, Jie. 2016. “Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring.” Assessing Writing 27: 37–53. https://doi.org/10.1016/j.asw.2015.11.001.

Downloads

Published

22. 08. 2024

Issue

Section

English Language and Literature Teaching

How to Cite

Dobrić, N. (2024). Assessment-Relevant Stimuli and Judging of Writing Performances – From Micro-Judgments to Macro-Judgments. ELOPE: English Language Overseas Perspectives and Enquiries, 21(1), 91-110. https://doi.org/10.4312/elope.21.1.91-110