Rethinking Traditional Assessment

Many schooling practices—assessment practices among them—have a long history. Some assessment practices, like teachers giving exams and assigning grades, go back many hundreds of years. But others, like standardized testing and eligibility determination, are inventions of the industrial age. All these practices are nonetheless institutionalized practically everywhere in the United States.

The focus of this Foundational Concept, then, is on four traditional assessment practices: (1) teacher-made tests and quizzes, (2) grading, (3) standardized testing, and (4) eligibility determination. For several reasons, Ohio’s leadership teams need to rethink their engagement with these traditional assessment practices, as explained in each of the sections that follow.

Teacher-Made Testing and Quizzing

Because teachers create them, it might be reasonable to think that such tests reflect most accurately what has been taught. A corollary assumption is that students who receive good grades on such tests have learned the relevant content and those who receive poor grades have not learned that content. There are problems with both assumptions. The problems are partly due to a lack of understanding about the characteristics of good tests (e.g., item quality, level of knowledge assessed, and scoring procedures, not to mention reliability and validity).

Indeed, decades of research have shown that individual teachers typically fail to use tests and quizzes that are reliable and valid (DiDonato-Barnes et al., 2014; Fitt et al., 1999; Linn, 1990; Marso & Pigge, 1991; McMorris & Boothroyd, 1993; Newman & Newman, 2013; Romagno & Long, 2013). Even the commercial “tear-out” tests that accompany textbook series, often used by teachers, lack evidence for reliability and validity, and they fixate on lower-level tasks like factual recall (Frey & Schmitt, 2010). Nonetheless, the questionable assumptions persist in part because tests and quizzes produce the grades that families rely on for academic feedback (Guskey & Brookhart, 2019).

This short logical chain keeps the questionable assumptions in place: (1) if grades legitimately represent accomplishment, then (2) the tests and quizzes on which they are based must also be legitimate. The assumptions bolster the legitimacy of the schooling enterprise. It makes sense then, that classroom tests and quizzes would remain institutionalized practices in virtually all schools. Surprisingly, parents prefer standardized test scores to grades (Brookhart, 2013), even if they still receive grades and even if many families fixate on grades.

Whatever one’s view of such logic and its institutional significance, two facts remain true: all tests, even excellent ones, exhibit measurement error, and working alone, teachers do not typically create good tests.

In this context, leadership teams (especially TBTs and BLTs) should work toward improving the quizzes and tests (and assessment procedures) used for classroom assessment. And quizzes and tests should address mostly formative purposes, not summative ones. In other words, tests and quizzes should mostly be used to determine how to provide needed support to students, not mostly for calculating their grades. Although formative assessment is what is needed to improve learning, most classroom tests address summative purposes—student performance, rather than the learning process. Based on their study, Frey and Schmitt (2010) reported that only 12% of classroom assessments addressed a formative purpose. Changing these unfortunate circumstances is a substantial leadership challenge in schools and districts, including those in Ohio. A key insight is that most leadership teams exhibit reasonable capacity to improve classroom assessment (Howley et al., 2013). They can draft and improve item quality; they can help each other ask questions about higher-order knowledge; and they can determine consistent scoring procedures and data routines. With useful tests or quizzes produced in this way, they might also use the data they collect to establish the reliability of their tests and quizzes. This work, though simple enough, is very rare in practice. Nonetheless, it’s something Ohio’s leadership teams can do.

Grading

Traditionally, testing (with teacher-made quizzes and tests) and grading were treated as parts of one process—both leading to summative judgments of performance. As noted above, however, quizzing and testing are much better suited to formative purposes.

Nevertheless, a generation ago, Allan Ornstein raised concerns about both teacher-made tests and grading. And his cautions are still apt today because summative uses of testing and grading persist widely.

Not only is a high degree of expertise required for being accurate in grading, one that eludes many teachers, but there is the false assumption that teacher-made tests reflect precisely what is being taught. (Ornstein, 1994, p. 57)

In the commonplace view, grading is seen as a way to communicate with families about how students are doing in school (Marzano, 2000). The communication value of grades, however, is questionable because grading practices are often inconsistent and unreliable. More troubling, grades are used to control student behavior, including for punishment (see the classic study by McNeil, 1981). What are the consequences?

Sometimes grades understate students’ learning. This situation may cause students to conclude that they lack the ability to succeed academically. They may even be denied the chance to follow educational or career paths that could provide a better quality of life (Pintrich & Schunk, 2002).
Sometimes grades overstate students’ learning. When students get high grades with little effort, they may stop seeing a need to work hard to learn more challenging academic content.

In other words, grades often get in the way of communication. They also get in the way of student engagement and authentic learning. Here’s how that works: “Good students” work for grades (instead of satisfying curiosity or a need to understand)
“Bad students” tune out completely, feeling that they’ve failed before they even get started on a learning task.

Indeed, in many schools, students are “tested” and “graded” far too often (Nottingham, 1988; Ornstein, 1994; Reeves et al., 2017; Tucker, 2014). This circumstance has negative consequences for everyone involved (Erickson, 2022; White, 2011). For everyone, excessive grading and testing subvert engagement, and the problem is getting notably worse (Fullan et al., 2019).

The regime of testing and grading spreads the message that every error is a sign of deficiency. The practice cultivates fear, alienation, resentment, and resistance—all of which undermine engagement and learning. Through this experience, far too many students “learn” that they themselves are failures (Anderson, 2018; Guskey & Brookhart, 2019; Ornstein, 1994; Reeves et al., 2017).

There is also a connection between grading, undermining student engagement, and implicit bias, which supports unconscious adherence to racial and cultural stereotypes. The connection is that such stereotypes inform the subjective elements of grading. And when the grading is relentless, so is the effect of the unconscious stereotyping. Implicit bias is a particular problem when grading criteria are vague or entirely subjective (Payne & Vuletich, 2018; Quinn, 2000; Uhlmann & Cohen, 2005).

Standards-based grading systems offer a promising alternative to traditional ones. Other promising approaches include: (1) putting less weight on early assignments and formative assessments and (2) clearly communicating to students that mistakes are a necessary part of the learning process (Reeves, 2016; Stiggins & Chappuis, 2005).

Working to develop enlightened grading procedures and policies is something Ohio’s leadership teams – especially DLTs and BLTs—should be doing. This work is especially important in the context of MTSS, where the goal is to match instruction with student needs. This work is important because families do need to know what their children are learning, if their children need (and are receiving) help, and if the school district is addressing their children’s learning needs.

Standardized Testing

As a traditional practice, standardized testing was designed to provide reliable and valid summative performance data for large groups of students. The need for such tests became apparent across the 20th century, and most such tests were created to serve assessment needs in the factory system of schooling emerging at the time (e.g., Rice, 1913). This system consolidated districts, built larger schools with larger class sizes, and sent more and more students on to high school.

The term “standardized” indicated that the test was developed carefully, administered the same way everywhere, and scored and reported the same way for everyone. Initially scores for students oriented to the norm, or average score, calculated as part of the test development process. Today many standardized tests orient to state standards rather than to the average score in a norming sample.

Initially, group-administered standardized tests were used to provide a more objective report of learning accomplishment than grades were able to do—even back then. Grading practices are variable, so standardization offers consistency. But, as has become clear in over a half-century of research, the results of standardized testing reflect and reinforce social and economic inequities (e.g., Coleman et al., 1966; Ravitch, 2011; Oakes et al., 1990).

For some critics (e.g., Blanton, 2020; Edley et al., 2019; Ravitch, 2011), standardized testing of achievement levels is a problem because scores from such measures track social and economic inequity more than they reflect learning. Most objections to the traditional use of standardized tests center, however, not on standardization per se, but on the overuse of summative assessment (Glover et al., 2016).

Standardized testing will, of course, persist as part of external accountability regimes. But in between such administrations, districts and schools ought to replace much of the summative assessment done today with formative assessments to do two things. The first is discovering and then addressing individual student needs. Universal screening, Curriculum-Based Assessment, and benchmarking are well-known tools for formative assessment, especially as they are used to support MTSS. But second and just as important to improvement efforts, is formative assessment of instructional practice. This sort of assessment should replace much of the summative assessment of student learning that now prevails in schools and classrooms.

Eligibility Determination

Standardized testing also has another side: individually administered standardized tests. These tests are typically part of the assessment process to determine eligibility for special education services or gifted education services.

In traditional practice, determination of eligibility for special or gifted education supports a summative decision about placement. This part of the familiar routine, where students are “placed” in a resource room or other setting outside general education, is no longer considered good professional practice (Cole et al., 2021), even though it is still widely used.

Several principles (e.g., Glover et al., 2016; Massachusetts Department of Education, 2019; Thurlow et al., 2020) define the alternative to this too-familiar outcome:

The Least Restrictive Alternative (LRE) is always the general education classroom.
Special education services take place in the general education classroom.
Special education services are planned as temporary.
These temporary services accelerate learning for all who receive them as part of both special and gifted education.

Multi-tiered Systems of Support (MTSS) honor these principles, and determining eligibility does not come with a “placement decision” in the MTSS framework. It comes with a plan to provide temporary service in addition to core instruction in the general education classroom.

Whether established through MTSS or as a matter of local assessment policy, these principles entail a range of substantial changes to traditional practice: co-teaching in general-education classrooms, changes to IEP team protocols, and specific changes to universal screening and diagnostic testing. The OLAC module on Learning Supports details the implementation of MTSS.

Rethinking Tradition

Tradition represents institutionalized norms, and this representation is as true for assessment practices as it is for other schooling practices (e.g., the age-grade classroom, summer vacation, textbooks, and grade-span configuration). The difficulty with institutionalized forms of assessment is that they restrict the decision-making capacities of leadership teams. Refocusing these traditional activities away from summative purposes and towards formative purposes undoes those restrictions. Traditions of excessive (relentless) summative assessment undermine student engagement. It would be possible, as a corrective, to continue some standardized testing and preserve grading, but with much less summative and much more formative assessment directed toward teaching as well as learning.

References

Anderson, L. W. (2018). A critique of grading: Policies, practices, and technical matters. Education Policy Analysis Archives, 26(45/55), 1–27. https://epaa.asu.edu/index.php/epaa/article/view/3814/2053

Blanton, M. V. (2020). A correlational study of school report card grades and degrees of poverty. Journal of Organizational and Educational Leadership, 6(1). http://files.eric.ed.gov/fulltext/EJ1274243.pdf

Brookhart, S. M. (2013). The use of teacher judgment for summative assessment in the USA. Assessment in Education, 20, 69–90.

Cole, S. M., Murphy, H. R., Frisby, M. B., Grossi, T. A., & Bolte, H. R. (2021). The relationship of special education placement and student academic outcomes. Journal of Special Education, 54(4), 217–227.

Coleman, J., Campbell, E., Hobson, C., McPartland, J., Weinfeld, F., & York, R. (1966). Equality of educational opportunity. Department of Health, Education, and Welfare. http://files.eric.ed.gov/fulltext/ED012275.pdf

DiDonato-Barnes, N., Fives, H., & Krause, E. S. (2014). Using a table of specifications to improve teacher-constructed traditional tests: An experimental design. Assessment in Education: Principles, Policy & Practice, 21(1), 90–108. https://doi.org/10.1080/0969594X.2013.808173

Edley, C., Jr., Koenig, J., Nielsen, N., & Citro, C. (2019). Monitoring educational equity: Consensus study report. National Academies Press. https://www.nap.edu/download/25389

Erickson, S. S. (2022). The game of grades and the hidden curriculum. Physics Teacher, 60(5), 398–399. https://doi.org/10.1119/10.0010403

Fitt, D. X., Rafferty, K., & Presner, M. T. (1999). Improving the quality of teachers’ classroom tests. Education, 119(4), 643.

Frey, B. B., & Schmitt, V. L. (2010). Teachers’ classroom assessment practices. Middle Grades Research Journal, 5(3), 107–117.

Fullan, M., Gardner, M., & Drummy, M. (2019). What today’s teens need most from schools is learning that fosters engagement and connection: That may mean changing everything. Educational Leadership, 76(8), 64–69.

Glover, T. A., Reddy, L. A., Kettler, R. J., Kurz, A., & Lekwa, A. J. (2016). Improving high-stakes decisions via formative assessment, professional development, and comprehensive educator evaluation: The School System Improvement Project. Teachers College Record, 118(14), 1–26.

Guskey, T. R., & Brookhart, S. M. (2019). What we know about grading: What works, what doesn’t, and what’s next. ASCD.

Howley, M., Howley, A. Henning, J. E., Gilliam, M. B., & Weade, G. (2013). Intersecting domains of assessment knowledge: School typologies based on interviews with secondary teachers. Educational Assessment, 18, 26-48. doi:10.1080/10627197.2013.761527

Linn, R. L. (1990). Essentials of student assessment: From accountability to instructional aid. Teachers College Record, 91, 422–436.

Marso, R. N., & Pigge, F. L. (1991). An analysis of teacher-made tests: Item types, cognitive demands, and item construction errors. Contemporary Educational Psychology, 16(3), 279–286. https://doi.org/10.1016/0361-476X(91)90027-I

Marzano, R. (2000). Transforming classroom grading. ASCD.

McMorris, R. F., & Boothroyd, R. A. (1993). Tests that teachers build: An analysis of classroom tests in science and mathematics. Applied Measurement in Education, 6(4), 321–342.

McNeil, L. M. (1981). Negotiating classroom knowledge: Beyond achievement and socialization. Journal of Curriculum Studies, 13(4), 313–328.

Massachusetts Department of Education. (2018). Multi-tiered system of support: Blueprint for Massachusetts. https://www.doe.mass.edu/sfss/mtss/blueprint.pdf

Newman, C., & Newman, I. (2013). A teacher’s guide to assessment concepts and statistics. Teacher Educator, 48(2), 87–95. Nottingham, M. (1988). Grading practices—watching out for land mines. NASSP Bulletin, 72(507), 24–28.

Oakes, J., Ormseth, T., Bell, R., & Camp, P. (1990). Multiplying inequalities: The effects of race, social inequality, and tracking on opportunities to learn mathematics and science. RAND. http://files.eric.ed.gov/fulltext/ED329615.pdf

Ornstein, A. C. (1994). Grading practices and policies: An overview and some suggestions. NASSP Bulletin, 78(561), 55–64.

Payne, K. B., & Vuletich, H. A. (2018). Policy insights from advances in implicit bias research. Policy Insights from the Behavior and Brain Sciences, 5, 49–56.

Pintrich, P. R., & Schunk, D. H. (2002). Motivation in education (2nd ed.). Merrill/Prentice Hall.

Quinn, D. M. (2020). Experimental evidence on teachers’ racial bias in student evaluation: The role of grading scales. Educational Evaluation & Policy Analysis, 42(3), 375–392. https://doi-org.proxy.library.ohio.edu/10.3102/0162373720932188

Ravitch, D. (2011). The death and life of the great American school system: How testing and choice are undermining education. Basic Books.

Reeves, D. B. (2016). Elements of grading: A guide to effective practice (2nd ed.). Solution Tree.

Rice, J. M. (1913). The scientific management in education. Hines, Noble & Eldredge.

Romagnano, L., & Long, V. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31.

Stiggins, R., & Chappuis, J. (2005). Using student-involved classroom assessment to close achievement gaps. Theory into Practice, 44(1), 11-18.

Thurlow, M. L., Ghere, G., Lazarus, S. S., & Liu, K. (2020). MTSS for all: Including students with the most significant cognitive disabilities. National Center on Educational Outcomes. https://nceo.umn.edu/docs/OnlinePubs/NCEOBriefMTSS.pdf

Tucker, M. S. (2014). Fixing our national accountability system. National Center on Education and the Economy. http://files.eric.ed.gov/fulltext/ED556313.pdf

Uhlmann, E. L., & Cohen, G. L. (2005). Constructed criteria: Redefining merit to justify discrimination. Psychological Science, 16, 474–480.

White, J. (2011). Exploring well-being in schools: A guide to making children’s lives more fulfilling. Routledge.