Skip to main content

External Accountability Testing

Internal accountability precedes external accountability. (Elmore & Fuhrman, 2001, p. 67)

For decades, external accountability testing has been in use in Ohio as well as worldwide. Although individual students take the tests, the results are aggregated to produce measures that apply to entire schools and districts.

An important feature of external accountability tests is their applicability for measuring systems (i.e., schools and districts); they were developed for that purpose alone. One very clear inference of this testing purpose is often overlooked: using the results of accountability tests to describe the learning of individual students is not a valid application. According to researchers—both those who investigate assessment and those who investigate accountability systems—judging individual performance on the basis of accountability tests is a misuse of the data these tests generate (e.g., Fullan, 2011; Glover et al., 2016; Mandinach & Gummer, 2016). This inference, in fact, guides the protocols that states adopt for their own reporting of the results of accountability tests. In Ohio, for instance, results for subgroups with fewer than 10 students are not reported.

As in Ohio, accountability tests are typically aligned to state-adopted learning standards. So, the tests do represent something real. Moreover, they do demonstrate the comparative performance of entire districts and schools to the public. And these comparisons show that all the students (and various subgroups of students) in some districts perform better or worse than all the students (or various subgroups of students) in other districts. Current accountability systems use these tests along with an array of other measures to award “grades” to schools and districts.

The High-Stakes Experience

Accountability tests are often called “high-stakes tests.” The “high stakes” pertain to the way test results are used. In the past, the high stakes were either rewards (for strong results) or sanctions for weak results. That is, just as with students and the oversight of some parents (e.g., Guskey & Brookhart, 2019), the states that adopted rewards and sanctions—and most, including Ohio, did (Carnoy & Loeb, 2001)—punished bad grades and rewarded good grades. Really bad grades—and consistently bad grades—could have drastic punishments: takeovers of a school or district by a state department of education, firing of principals, or the imposition of one of several models of “school transformation.” This initial practice understandably prompted fear among educators.

What really motivated many educators to attend to external accountability results, though, was their interest in avoiding bad press in their local communities (King & Mathers, 1997; Sunderman et al., 2006). Today, the use of rewards and sanctions is much less common, but the release of scores and “grades” still provokes fear among educators.

In effect, educators worry that accountability test scores reflect fallout from the many social ills that make the US one of the most inequitable societies in the developed world. Many educators are aware of this fact (e.g., Lubienski et al., 2022; Starck et al., 2020; Weathers & Sosina, 2022), even as they work to close “opportunity gaps” and “achievement gaps”. The challenges they face are difficult. For instance, schools alone cannot combat social inequity; and because educators hold views of race and equity that resemble those of society as a whole, perspectives on the relevant issues do not represent a consensus and often, therefore, do not support a clear path forward (Starck et al., 2020).

Prominent observers (e.g., Hargreaves, 2020) today recommend mid-stakes testing rather than high-stakes testing. Ohio’s current scheme, reformed to align with the Every Student Succeeds Act (ESSA) of 2015, is more like mid-stakes testing.

Accountability Testing in Ohio Today

For Ohio schools and districts with consistently poor or mediocre testing results, the stakes involve public identification in four “needs assistance” categories, as follows (data are from the 2020 list for each category):

  • Priority Schools (285 schools)
    Schools with a four-year graduation rate of 67% or lower OR lowest-performing schools using the report card’s overall grade methodology OR failure to improve subgroup performance over the three-year identification period.
  • Focus Schools (537 schools)
    Schools with subgroups performing at or below the performance of Priority Schools OR schools’ with (a) subgroups performing in the bottom 30% and (b) low performance on the Gap Closing component.
  • Warning Schools (11 schools)
    Schools with subgroups performing at or below the performance of Priority Schools (and not already identified as a Priority or Focus School)
  • Watch Schools and Districts (589 schools and 105 districts)
    At least one subgroup without satisfactory achievement AND without satisfactory progress.


The ODE provides listed schools and districts with assistance described as follows:

District support categories include Independent, Watch, Moderate Support, Intensive Support and Academic Distress Commission districts. Each category of support schools and districts receive is a distinct package of supports from the Ohio Department of Education that corresponds to improvement needs. Ohio calls this its “differentiated accountability” system. (ODE, 2023, para 3)

The State System of Support (SST) provides some of the support mandated by the ODE. It is not considered an honor to need such support, of course. The SSTs also—when asked—provide some level of support to districts not listed in the high-need categories. Notably, some universal forms of SST support are available to all districts.

External Accountability: A Brief History

The external accountability discussed in the professional literature is the form that accountability took following the 1983 Nation at Risk report (National Commission on Excellence, 1983). The late 1970s and early 1980s were a troubled economic time, and the report placed major blame on American schools:

If an unfriendly foreign power had attempted to impose on America the mediocre educational performance that exists today, we might well have viewed it as an act of war. (National Commission, 1983, p. 5)

The upshot was that state and national policymakers heard that they had to hold schools accountable for a “rising tide of mediocrity” (National Commission, p. 5). States needed to create accountability regimes that punished schools to motivate them to produce higher test scores and retrieve American global dominance. The No Child Left Behind Act (NCLB, 2001) authorized rewards and sanctions nationwide.

But most educators and many community activists viewed the NCLB approach to accountability as counter-productive; and after long struggle, NCLB was replaced by the Every Student Succeeds Act (ESSA, 2015). Contemporary provisions for external accountability conform to the requirements of ESSA, which give states more leeway in designing their own external accountability systems.

Does External Accountability Improve Teaching and Learning?

The short answer is “yes.” The longer answer is “we can’t know.”

The short answer: Research teams have reported that external accountability regimes are associated with higher test scores both in the US and internationally (Bergbauer et al., 2015; Carnoy & Loeb, 2002). Of course, this finding does not mean that external accountability causes higher test scores. Even the association is a high-level inference (involving aggregated data for entire states and nations).

The longer answer: Logically speaking, something in school practice has to have changed when test scores change. Looking only at data about the use of external accountability tests, one cannot tell what that something might be. Perhaps accountability tests serve as a “wake up call” to teachers, causing them to change their instructional practices. Perhaps they provoke other—less salutary—responses. According to some research, for instance, external accountability regimes produce improvements in measured student outcomes (i.e., scores on external accountability tests) because they encourage educators to teach to the test (e.g., Jennings & Bearak, 2014; Koretz, 2017; Welsh et al. 2014). Nevertheless, figuring out exactly how external accountability regimes contribute to improved scores remains elusive. Furthermore, even if one knew exactly how accountability testing in general affected schooling practice, that knowledge wouldn’t be much help to TBTs, BLTs, and DLTs in specific Ohio districts.

Far more useful to these leadership teams are insights about the kinds of practices that actually work to improve teaching and learning. Michael Fullan (2011) called such practices, “drivers” of systems improvement. He argued that external accountability regimes (and their related practices) represent the “wrong drivers” whereas internal accountability practices offer much better “drivers.” Fullan characterized these internal accountability drivers in the following quote: “Intrinsic motivation, instructional improvement, teamwork, and “allness” are the crucial elements for whole system reform. Many systems not only fail to feature these components but choose drivers that actually make matters worse” (Fullan, 2011, p. 3).

Really Improving Teaching and Learning

Not only are external accountability regimes inadequate drivers of systemic reform, but the tests they rely on are questionable measures of instructional effectiveness (Buzick & Latusis, 2010; Hagopian, 2014; Stolz, 2017). Although these tests orient to state standards, they are very far removed from the life of classrooms, the learning of individual students, the teaching offered by individual educators, and the day-to-day leadership provided by district leadership teams.

Real improvement occurs through the concerted efforts of committed educators, not through fear of public embarrassment. As Bergbauer and colleagues noted, “It appears that systems that are showing strong results know more about how to boost student performance and are less in need of strong [external] accountability systems” (p. 28).

From the perspective of educators, then, rather than that of policymakers, internal accountability is more important for improving teaching and learning than external accountability (Elmore & Fuhrman, 2001). Cultivating internal accountability, in fact, is the main goal of using the Ohio Improvement Process. One Ohio researcher (Hoy, 2012) described the right internal accountability drivers as: (1) collective efficacy, (2) collective trust in students and parents, and (3) academic focus.

As this discussion suggests, internal accountability does something more and better than external accountability, but neither can substitute for the other. External accountability, in other words, has its place. It uses large-scale, systemic testing in order to press for improvement. But it doesn’t change school practice. Internal district commitment to improvement is the first step toward meaningful change in school practice, followed by hard work from district leadership teams and the system-wide use of the OIP or other systematic improvement process.


Bergbauer, A. B., Hanushek, E. A., & Woessman, L. (2018). Testing. National Bureau of Economic Research.

Buzick, H. M., & Laitusis, C. C. (2010). A summary of models and standards-based applications for grade-to-grade growth on statewide assessments and implications for students with disabilities. Educational Testing Service.

Carnoy, M., & Loeb, S. (2002). Does external accountability affect student outcomes? A cross-state analysis. Educational Evaluation & Policy Analysis, 24(4), 305–331.

Elmore, R. F., & Fuhrman, S. H. (2001). Holding schools accountable: Is it working? Phi Delta Kappan, 83(1), 67–72.

Every Child Succeeds Act of 2015, Pub. L. No. 114-95, 20 USC § 1001 et seq.

Fullan, M. (2011). Choosing the wrong drivers for whole system reform. Centre for Strategic Education.

Glover, T. A., Reddy, L. A., Kettler, R. J., Kurz, A., & Lekwa, A. J. (2016). Improving high-stakes decisions via formative assessment, professional development, and comprehensive educator evaluation: The School System Improvement Project. Teachers College Record, 118(14), 1–26.

Guskey, T. R., & Brookhart, S. M. (2019). What we know about grading: What works, what doesn’t, and what’s next. ASCD.

Hagopian, J. (Ed.). (2014). More than a score: The new uprising against high-stakes testing. Haymarket Books.

Hargreaves, A. (2020). Large-scale assessments and their effects: The case of mid-stakes tests in Ontario. Journal of Educational Change, 21(3), 393–420.

Hoy, W. (2012). School characteristics that make a difference for the achievement of all students: A 40-year odyssey. Journal of Educational Administration, 50(1), 76–97.

King, R. A., & Mathers, J. K. (1997). Improving schools through performance-based accountability and financial rewards. Journal of Education Finance, 23, 147–176.

Jennings, J. L., & Bearak, J. M., 2014. “Teaching to the test” in the NCLB era: How test predictability affects our understanding of student performance. Educational Researcher, 43(8), 381–389.

Koretz, D. 2017. The testing charade: Pretending to make schools better. University of Chicago Press.

Lubienski, C., Perry, L. B., Kim, J., & Canbolat, Y. (2022). Market models and segregation: Examining mechanisms of student sorting. Comparative Education, 58(1), 16–36.

Mandinach, E. B., & Gummer, E. S. (2016). Data literacy for educators: Making it count in teacher preparation and practice. WestEd.

National Commission on Excellence. (1983). A Nation at risk: the imperative for educational reform.

No Child Left Behind Act of 2001, Pub. L. No. 107-110, 20 USC § 6361 et seq.

Ohio Department of Education. (2023). District supports (webpage).

Stolz, S. A. (2017). Can educationally significant learning be assessed? Educational Philosophy & Theory, 49(4), 379–390.

Sunderman, G. L., Orfield, G., & Kim, J. S. (2006). The principals denied by NCLB are central to visionary school reform. Education Digest, 72(2), 19–24.

Weathers, E. S., & Sosina, V. E. (2022). Separate remains unequal: Contemporary segregation and racial disparities in school district revenue. American Educational Research Journal, 59(5), 905–938.

Welsh, M. E., Eastwood, M., & D’Agostino, J. V. (2014). Conceptualizing teaching to the test under standards-based reform. Applied Measurement in Education, 27(2), 98–114.