Evaluation: A Holistic Perspective

fo l10listic eval11atio11, tlte evaluator views tlie text as a wl10le, deter111i11i11g the degree to 111l1iclt it is e.ffective as a specific type of writing. This paper examines tl1e pri11ciples, applications, and limitatiollS oflwlistic eval11atio11, a11d explores tl1e co11text11alizatio11 of holistic eval11atio11 sta11dards. A n11111-ber of holistic scoring guides are described; tl1e dffferences between tliem reveal that differe11t~11res exl1ibit dijfere11tfeat11res i11 their successfulforms. TI1e paper also describes a11 empirical study i11to tlu correlation between a11alytic a11d l10listic eva/11atio11s of a corpus of s11111111aries writtw by 1111i versity-level st11de11ts.

examines tl1e pri11ciples, applications, and limitatiollS oflwlistic eval11atio11, a11d explores tl1e co11text11alizatio11 of holistic eval11atio11 sta11dards. A n11111ber of holistic scoring guides are described; tl1e dffferences between tliem reveal that differe11t~11res exl1ibit dijfere11tfeat11res i11 their successfulforms. TI1e paper also describes a11 empirical study i11to tlu correlation between a11alytic a11d l10listic eva/11atio11s of a corpus of s11111111aries writtw by 1111iversity-level st11de11ts. THIS PAPER WILL EXPLORE WAYS IN WHICH RESEARCH into holistic ev;tluation can contribute to our understanding of the bases on which we judge writing. I will begin by giving an overview of holistic evaluation and examining what a holistic procedure of evaluation is, what the literature reports on it, and how it is applied in the assessment of writing skills. In so doing, I will deal with an issue that is not adequately addressed in the literature: namely, the contextualization of the standards which underlie a holistic assessment. The second part of this paper will describe the results of an empirical study I conducted into the quality of summaries, using holistic evaluation as a research tool. The results of this study illuminates the relationship between analytic and holistic evaluation, and between tacit and expressed criteria in evaluation.
This paper thus addresses the concerns of teachers of professional and technical writing, who must evaluate a variety of genres of writing according to a wide range of context-specific criteria. When such criteria are contrasted, it becomes apparent that different genres exhibit different features in their successful forms. In particular, the description of criteria used in evaluating summaries (a standard assignment in a professional writing course) should prove to be of interest to the writing instructor.
Basically, holistic evaluation is one pole of a continuum, the opposite pole of which is analytic evaluation. Most teachers use neither a purely holistic nor a purely analytic approach; rather, they combine approaches, trying to reach an assessment that takes into account the complex interplay of individual features, their cumulative effect, and their importance in a particular writing context. Charles Cooper uses the term "holistic evaluation" in the broadest of ways, to include ";my procedure which stops short of enumerating linguistic, rhetorical, or informational f<:>aturcs of a piece of writing" (1977, p. 66). He considers what he calls "general impression marking" to be one variety of holistic scoring, and includes in his <:>xamplcs ofholistic procedures approaches that others might well classify as analytic.
In Creati11g IVriters, Sp:rndcl and Stiggins differentiate between two types of holistic evalu;ition, which they call "general impression holistic scoring" and "focused holistic scoring" ( 1990, p. 6). In focused holistic scoring, writing samples are matched :ig;iinst set standards and scored accordingly. (What these standards arc and how they :ire context-specific arc issues to be addressed later in this paper.) In the field of language testing, numerous empirical studies of such focused holistic scoring have been carried out. These studies usually involve tests that are rated by a group of evaluators; the evaluators' ratings arc then averaged for each paper. Such studies emphasize the importance of evaluators· determining in advance common standards that define each of the possible ratings. In general, the procedure is as follows: a -number of possible ratings arc set (e.g:·, from 1to5) and each rating is defined in terms of specific criteria. The most common way of defining criteria is by means of a scoring guide, which describes the qualities of a te:ll.1: that merits each specific rating. Standards can also be set indirectly, through representative papers. In this case, evaluators are shown samples of writing, called "anchor papers" (Spandel and Stiggins, 1990, p. 20), which are representative of each rating. Once the standards have been set, evaluators read through . the full set of papers, matching each one with the scoring guide description or with the anchor papers they judge to be of similar quality.

Validity and Reliability of Holistic Evaluation
Perhaps the best known and most d:11n11ing study of holistic: cval11atinn was carried out hy Paul Diederich ( 19(4), and is dcscrihnl in his houk/Vfra-· suri11g Growt/1 i11 E11glisl1. Diederich selected a corpus of 300 compositions written by college-level students and had them ev:1l 11ated by (>0 readers from six different fields. His readers included English teachers, science teachers, editors, lawyers, and business people. The readers were instructed to sort the papers into nine piles in order of general merit, and they were left on their own to decide what constituted general merit. The results showed very little agreement among readers: ratings were very inconsistent, and the interrater reliability measured was only .31: a very low score, for Diederich considers an acceptable reliability rating to be .80. Diederich concluded that his readers had judged the papers on very different bases, and as an alternative to the holistic approach, he suggested an analytic procedure: he analyzed text quality into specific features to be evaluated separately. The features he isolated for evaluation fall into two general categories: ideas, organization, wording, and flavour (which he termed "general merit"); and usage, punctuation, spelling and handwriting (which he termed "mechanics"). The total rating would.then be determined by the sum of the scores for the individual features.
A question that arises is thus: is holistic evaluation valid? Does it measure what it is intended to measure? In focused holistic evaluation, the answer is yes, since the evaluators work with standards that define what they are attempting to measure, and what a successful paper is in a given situation. The contextualization of those standards is of paramount importance, particularly for the professional writing teacher, who must evaluate a variety of genres such as compositions,journalistic-style articles, business reports, sets of technical instructions, and summaries.
In holistic evaluation, rather than judging a text on the basis of isolated features, the ev:tluator judges to what extelll it fullills its purpose. Cooper wt itcs, "I lolis1ic cv:1l11;1tio11 is oliviously to he preferred (to analytic evaluation] where the primary concern is with evaluating the communicative effectiveness of candidates' writing" (1977, p. 3). Yet these notions of"purpose" and "communicative effectiveness" are too vague to provide concrete guidance for the eva)uJtor. For holistic evaluation to produce valid results, the evaluator needs to define these notions in terms of context-specific criteria, and then focus on the cumulative effect of a text's specific features on the communication process.
Another question that arises is that of reliability: is holistic eVJlu:uion reliable? Does holistic evaluation produce consistent results? Inter-rJter reliability is generally me;isured by having diffe~ent m;irkers score a set of papers and then determining the average correlation between their scores. There have been numerous studies of reliability, many of which are reported in Cyril Weir's Co1111111111icatitle Language Testing (1990). The findings of these studies have varied. In general, they have indicated that focused holistic scoring produces acceptable reliability scores, comparable to those achieved by analytic procedures.
I lolistic scoring is particularly appropriate when a group of papers needs to be rated on a continuum, or when a group of students needs to be rankordered-for example, in placement tests. On a practical level, a real advantage to holistic scoring is its speed-it i? certainly much faster for an experi-· enced evaluator to read through a corpus of texts and sort them according to overall merit than to identify and categorize all the errors and then calculate the mark on the basis of individual performance of subskills. This advantage is, of course, somewhat offset by the fact that holistic evaluation is much more reliable when papers are evaluatc<l hy rnnre than one evaluator.
A purely analytic approach is unsatisfactory for a number of reasons. One major problem is that it is difficult to delimit the subskills that consti~ · tutc writing competence. E.M. White points out the limitations of analytic scorinr, in his hook 'Jr11rl1i11.1! 11111/ /11.(r.uir(I! M'ri1i11.I!: In theory, analytic scoring should provide the diagnostic information that holistic scoring fails to provide and in the process yield a desirable increase in information from the writing sample. In practice, three major problems have so far demonstrated the limitations of analytic scoring: (1) There is as yet no agreement (except among the uninformed) about what, if any, separable subskills exist in writing.
(2) It is extremely difficult to obtain reliable analytic scores, since there is so little professional consensus about subskills. (3) Analytic scoring tends to be quite complicated for readers. (1985, pp. 29-30) There is no evidence that writing quality is the result of the accumulation of a series of subskills. To the contrary, the lack of agreement on subskills in the profession suggests that writing remains more than the sum of its ~tts and that the analytic theory that seeks to define and add up the sub~hl\\~ is fomhm~n\1\\y th.wtt\, (p. 123)

Scoring Guides
As we have seen, scoring guides arc often used to ensure the reliability of holistic scoring when it is carried out by a group of evaluators. Scoring guides arc also useful for the individual teacher: by drawing up such a guide, the teacher can define the qualities that define an A, 13, C, D or F paper for a specific writing activity. Thus the teacher-evaluator reflects on and establishes standards in his or her mind, in light of the purpose and genre of the writing activity, the communicative context, and the level of the class. But because terms such as "purpose" arc rather vague, concrete examples of scoring guides are needed to illustrate how evaluation criteria arc context-specific. Let us therefore look at excerpts from a number of scoring guides to see the variety of criteria by which different types of writing are assessed.
In the teacher's manual that accompanies Reportingfor the Print Media, author Fred Fedler (1989) suggests criteria for gradingjournalistic articles-in other words, he gives a scoring guide for journalism teachers. He defines ;i.11 A p;i.per as follows: The story is newsworthy and exccptiorrally well written: thorough and lice o(nrnr s. The lead is cle;i.r, concise and interesting .... The body i.\ w<·ll lll l'.;111izcd and cu11tai11s cl!Cctivc tr;i.11sitio11s, quotations, descriptions, ;i.nd anecdotes. Because of the story's . . . merit, newspapers would . . . publish it. (p. 2) He begins his description of a B paper as follows: "The story could be published by a newspaper after minimal editing"(p. 2). And an F paper is described thusly: The news story could not be published by a newspaper, nor easily rewritten. It is too confusing, incomplete or inaccurate. Or, the story contains a misspelled name or serious factual error. (p. 2) The last criterion for an F paper-the presence of a misspelled name or factual error-reflects the fact that in journalism, where one writes in a public forum, the cardinal sin is getting :I name or fact wrong, since such a misuke could ruin reputations and lead to costly libel suits.
Obviously, very different criteria would be used to evaluate the writing of ESL students. One description of a top-ranking paper in ESL is as follows: "The writing is indistinguishable from that of a native speaker." Similarly, a colle;i.gue who specializes in ESL showed me a scoring guide she uses, from which I take the following definition of a B paper: Student writes clearly understandable English and organizes material well. Grammatical errors are ... not serious enough to interfere with communication .... Sentence structure may be somewhat inelegant, but is clear and understandable. (G. Arbach, personal communication, f June 10, 1990) r Criteria used by evaluators of ESL writing thus indicate that evaluators measure the students' ability against that of a native speaker, emphasizing ·comprehensibility and idiomaticity. ' . . , A scoring guide used by the British Council to determine university admission is based on nine ratings. The description of the top rating begins Ev:ilu:ition: A Holistic Perspective as follows: "The writing displays an ability to communicate in a way which gives the reader full satisfaction"; and the next rating begins with: "The writing displays an ability to communicate without causing the reader any difficulties" (Hughes, 1989, p. 88). The ratings are more fully described, but the criteria cited indicate th:it the texts are evalu:itcd partly in terms of the re:ider's response.
So scoring guides describe in broad strokes the standards by which the merit of texts is to be determined in a particular writing activity. Their purpose is to ensure that impressionistic or intuitive assessment docs not tr:mslate into arbitrary or idiosyncratic assessment. They also serve to m:ike explicit the expectations that underlie the evaluator's intuitive response. No doubt teachers of professional writing, who assign a variety of writing tasks to their students, would be well advised to identify, for themselves and for their students, the features that characterize successful texts of different genres. For it is apparent that not only will different genres exhibit different features in their successful forms, but also that features crucial to success in one genre may be relatively unimportant in another.

Applications in Research: An Empirical Study
Holistic evaluation is a useful research tool to help elucidate those bases on which we evaluate writing. I used holistic evaluation for this purpose in an empirical study I conducted into the quality of summaries written by university-level students. In this study, a set of student texts was evaluated both holistically and analytically. I determined the correlation between the holistic scores, and specific analytic variables, in order to identify those variables that carried the most weight in a holistic evaluation, as well as the relationship between tacit and e>.."Pressed criteria.
The ev:iluators read the summaries quickly, and gave each one a rating from 1 to 6, based on their overall impression of the summary's merit. For each paper, I averaged the four scores to determine the holistic score-a procedure that has produced highly reliable results in other studies.
The results of the holistic evaluation were as follows. First, the interr:iter reliability was calculated to be approximately .72 (with a p-value of .0001). Perhaps one reason that the reliability was not higher was that one of the evaluators did not rate any of the papers as a 6--she did not think that any of them matched the description of an excellent paper. This points to a clear difference between two types of holistic evaluation: one in which texts are scored simply on a curve, on the basis of their relative merit, with the best texts in the group receiving the highest score; and the other, in contrast, a holistic evaluation in which texts are scored according to their absolute merit, on the basis of set standards. The evaluator who didn't give any of the te:ll.1:S a 6 was measuring the texts against a set standard, and was not simply rating them in relation to another.
In the second part of this study, I carried out a detailed analysis of each of the summaries, and rated each text according to eight variables: errors of grammar and mechanics, distortions of meaning, inclusion ofimportant ideas, integration of important ideas, syntactic complexity of the sentences (measured by number ofT-units per sentence and average length ofT-unit), organization of important ide:is, and efficiency of summarization. I then analyzed the correlation between the holistic scores determined in the first part of the study and the eight variables I had identified. The most significant correlation was between the holistic evaluation and three of the variables: first, and primarily, the inclusion of important ideas; secondly, the absence of errors of us:ige and grammar; and thirdly, the absence of distortion. A stepwise linear rq',ression procedure c1rril'd 0111 to detl'n11i1w the nm1hi11atio11 of variahlc' 1 lr.11 w1111 Id I w~.1 I".-. I 11 I I''"' I; 11111.11H1· \I tt iw.-d II 111\1· tl 111-r va11ahlc, 1· x · plained "/(J''/., of tire v;11 i.1111-e o!'the holi:·aic snire~. 1'.11 l;t..ul.11 ly i1;.li..:;1tivc o( the complex role of error in evaluation was a comment made by one of the ev:iluators. In one of the summary texts, the student writer had consistently misspelled the name Costa Rica, calling the country Casto Rico. Now misspelling is usually viewed simply as a mechanical error, a surface error easy to identify and classify. In error analysis, what could ·possibly be more straightforward than a spelling mistake? Yet the evaluator saw this misspelling as something more serious. She commented, "This misspelling seriously affected my impression of the text's merit. After all, the whole text is about Costa Ric:i, and ifthe writer doesn't even get the name of the country right, how effective is the summary?" (K. Barber, personal communication, September 2, 1991).
This anecdote illustrates that, when viewed according to their effect oil con 111111n ic11 ion rad i n tl 1:111 f'I ror typology, spdlin1~ 111is1:1 kcs arc not a II <"q 11:1 I. This s:1111i-ohs1·1 v:11io11 w:1s :1pp;i11·111 in i=i-dln's sco1ing1•.11idi-, 11w111ion<"d earlier, i11 which 111isspi-lli11g ;1 pe1 son's 11:1111c i11 a news slm y was viewed a~ an error of such importance that it could earn the writer a failing grade. In these cases, the type of error does not necessarily indicate its communicative effect.
Let's examine this anecdote in the context of the theoretical model I used for my research into summaries: Kintsch and Van Dijk's (1978) macropropositional representation of the meaning of a text. Kintsch and Van Dijk describe a processing model of discourse comprehension and production, and characterize a text's semantic structure on two levels: the microstructure and tl1e macrostructure. The former term refers to the local structure of individual propositions and sequences of propositions, the latter, to the "gist" of the text-its global structure. Both structures are abstracted from the surface structure and are described in terms of sequences of propositions. Thus the macrostructure of a text is a hierarchical representation of its gist, or overall meaning.
In the case in question, the spelling error occurred at the highest level of the macrostructure. The importance placed on the misspelling of Costa Rica can be attributed to the fact that it was the topic of the text-the highestlevel argument.
Thus the level of macrostructure on which an error occurs, along with the effect of an error on the reader, is an important consideration i11 error analysis.
It was a holistic approach to evaluation that brought to light the importance of this specific error, for the holistic approach focuses on the effect of text features in specific genres, and on the response the text elicits in the evaluator as reader. By being overly analytic and by focusing on types of errors rather than on their effects, an evaluator may fail to account for the complex interplay of form and function. One can conclude that the holistic dimension of text quality should be recognized and included in any evaluation procedure. I , t, 1 11 1'l, ...., 1()()1

Conclusion
This p;iper h;is reviewed the concept and practice of holistic evaluation :md has atte111pted to show th;ll the1e arc iss1u-s involved in holistic evaluat i1111 that 1 c111:1 i11 11111 t·solwd a 11d 111:iy l>c Ii uitfi illy explored-issues such as tlir de· In 111i11.1ti1111ore011trxl ·Spc·.-ifj.-<I itc·1 i:t, tJ1<'. descriptiOll offraturcS exhibited l>y diflc1c11t gcmes in their succcssfiil fimns, and the analysis of the complex nature of error. Moreover, by illustrating the use of holistic evaluation as a research tool, I have tried to show how a holistic evaluation procedure can help us learn more about standards used both explicitly and implicitly to determine the quality of a text. of the original. He or she may fail to discriminate between major and secondary ideas, may omit or distort major ideas, or may copy sections of text verbatim and fail to integrate them into the text. Errors in grammar, usage, and sentence structure may interfere with readability. Despite definite weaknesses in selection, development of ideas, or expression, the text is still intelligible.
2: The summary-writer does not adequately convey the major ideas of tl1e original, hec:111se of omission, distortion, poor analysis, or inability to express ideas clearly. There is evidence of some or all of the following problems: errors in comprehension; lack of coherence between sections; or frequent errors in grammar, usage, and sentence structure. The general impression is that of confused thinking and poor writing.
1: The summary-writer fails to convey the major ideas of the original. This may be because of the writer's misunderstanding of the original, because oflack of org:mization and development, or because of an inability lO write intelligibly.

97.
The above scoring guide is inspired by and adapted from the holistic model for evaluating compositions given in White (1985, pp.135-36).