5 Key Lessons from the Successes (and Failures) of President Obama’s Teacher Evaluation Reforms

Getty Images
Lessons for the post-NCLB era.
The passage of the Every Student Succeeds Act and the waning of the Obama administration brings to a close federal efforts to improve teacher evaluation — a practice once widely derided for its infrequent and pro forma observations, inflated ratings, and lack of consequences.
Today most states combine different measures, including classroom observations and student test data, to produce a rating that describes effectiveness. But problems with the system persist.
Research by Matt Kraft of Brown University and Allison Gilmour of Vanderbilt University confirm other evidence that in most of the country new teacher evaluation systems still rate the vast majority of teachers effective — even though uniformly high ratings in the past were part of the impetus for creating new systems.
Based on this study, the American Enterprise Institute’s Rick Hess declared “all that time, money, and passion” dedicated to teacher evaluation “haven’t delivered much.” The Shanker Institute’s Matt Di Carlo also pointed out that evaluation systems can’t be judged primarily on how many low-performing teachers they identify.
(More: The War Over Evaluating Teachers — Where It Went Wrong and How It Went Right)
Another report, “Beyond Ratings,” by Kaylan Connelly and Melissa Tooley of the New America Foundation, argues that evaluation systems need to better calibrated to enhance professional growth and development. A paper from the Aspen Institute lays out a 10-lesson “roadmap for improvement” on teacher evaluation.
Meanwhile, Georgetown University’s Thomas Toch has taken to The Washington Monthly, Education Next, and The Atlantic to defend the Obama administration’s accomplishments. Toch argues that “state and local studies, teacher surveys, and other evidence reveals that many of the new [teacher evaluation] systems have been much more beneficial than the union narrative would suggest.”
So, with the political fight moving to the states, what can we learn from the research debate? Here are five key lessons policymakers should consider as we head into (another) brave new world of teacher evaluation.
Determine why so many teachers get high ratings — and address the root causes
Kraft and Gilmour’s study not only documents the high marks teachers in many states are earning, but also asks principals in one district why they tend to grade teachers on a generous curve. What they find is revealing: Principals say they are worried about finding better teachers to replace low-performers, don’t like telling teachers they’re not doing well, fear that a low rating will damage a teacher’s morale, lack time to remediate, and are daunted by difficult-to-navigate teacher dismissal processes.

The new wave of evaluation systems don’t seem to have addressed these concerns sufficiently. If districts want better-differentiated teacher ratings — important for targeting professional development and making smart personnel decisions — they need to confer with principals to ensure new programs are useful to the school leaders who will be implementing them.

Don’t use test scores to evaluate every teacher in every grade and subject
Under Obama, states were strongly incentivized by the federal government to use test scores in teacher evaluation; the vast majority of states now do.
The problem with this approach was that while every state tests students in grades 3–8 in reading and math, there were few standardized assessments in other grades and subjects. Consequently, new tests — of generally unknown reliability or validity — materialized around the country for rating physical education, social studies, and first-grade teachers, among others. In some areas teacher ratings have been based on school averages or on test scores in subjects the teacher didn’t teach — prompting confusion, outrage, and multiple lawsuits.
Using test scores for all teachers was poor policy and proved to be even worse politics for reformers — it exacerbated the anti-testing backlash and contributed to the rollback of federal power in new education law. That, in turn, has led many state policymakers to try to reduce or remove student growth from teacher evaluation systems.
There is a simple solution to the problem of overtesting and unfair attribution of test score: Evaluate teachers by test scores only if there is a valid test to do so — one that rigorously isolates a teacher’s impact on student growth. Hastily creating new assessments is usually unwise.
Take the professional growth aspect of teacher evaluation seriously — systematize it
The New America report says, “For the most part, states have prioritized getting evaluation systems up and running and are only beginning to think about using them to promote ongoing teacher learning and growth.”
The research has not yet clearly identified how to use teacher evaluation systems as a tool for improving teacher practice. However, there is encouraging new evidence that when highly rated teachers work with poorer performers the latter group improves.
Another study found that Chicago’s teacher evaluation pilot, which provided extensive training for principals to revamp how they observed teachers, had a positive impact on student achievement in its first year (but not in its second when it expanded but received less budgetary and central office support for school leaders).
While there’s still a lot we need to learn, it’s clear that states and districts should create systems to help struggling teachers improve, provide support and training for evaluators, and not expect to get this done on the cheap.
Don’t rely on models that leave no room for principal discretion
Most states have systems that assign a fixed value to each part of the evaluation.1 For example, 50 percent might be based on principal observations, 35 percent on student test scores, and 15 percent on student surveys. Sum the separate scores and out pops a rating.

It’s not obvious that this is best way to do things, though. It constrains the principal’s judgment and discretion: she may believe a component of the evaluation to be misleading, for instance, but can do nothing to adjust it.

Some may argue that a mechanical model provides needed principal-proofing, but there is research suggesting that principals typically make smart personnel decisions. Given their accountability for school performance, it’s worth experimenting with less rigid systems that engender rather than diminish principal autonomy.

Pay attention to how evaluation affects the teacher labor market
Many pundits suggest that tougher accountability and evaluation systems have contributed to what some see as a nationwide teacher shortage. There is zero empirical evidence to support this claim, to my knowledge.

However, it is certainly possible that recent evaluation systems have made teaching less appealing in some circumstances — high-poverty schools, for instance, which already often struggle to recruit and retain teachers in part because of poor working conditions. Teachers in these schools are generally at greater risk of being identified as low-performing, and potentially fired, under new evaluation systems. Making the teaching profession riskier, in perception or reality, may make it less appealing.

Some lessons may be drawn from Washington, D.C., which has been among the most aggressive in identifying and dismissing struggling teachers in disadvantaged schools. Researchers have found that the district has been able to replace poor performers with better ones, perhaps in part because of high salaries differentiated by performance and school population. D.C. public schools have also developed performance screens when hiring that seem to be helpful in determining who will be effective in the classroom.

Districts with aggressive evaluation systems that generate more teacher dismissals should pay particular attention to this issue, and ought to consider pairing evaluation reform with higher salaries or other efforts to make the job more appealing.


1. A handful of states use a ‘matrix’ model in which scores on two dimensions are combined to create a summative rating. This is essentially a cruder version of a percentage-based system.

Get stories like these delivered straight to your inbox. Sign up for The 74 Newsletter

Republish This Article

We want our stories to be shared as widely as possible — for free.

Please view The 74's republishing terms.

On The 74 Today