The War Over Evaluating Teachers—Where it Went Right and How it Went Wrong

The tale is an ominous one: A 17-year classroom veteran with a doctorate in education and glowing evaluations from her principal and superintendent is unfairly evaluated by a computer,  based on an “assembly line statistical rating system” that amounts to a “black box, which no one can or will explain.”
That’s the lament of Sheri Lederman, a fourth-grade teacher in Great Neck, who is now suing the state of New York.
The media loves it. Washington Post blogger Valerie Strauss warns readers, “How can a teacher known for excellence be rated ‘ineffective’? It happens — and not just in New York.” A Business Insider story claims that the case “exposes a huge flaw in many U.S. schools.” A piece in the Albany Times-Union says Lederman’s students’ tests are “above average” but somehow she still “fails.” An Atlantic article explains,  “Teachers who received ineffective ratings for two consecutive years could face an expedited dismissal process — a fate that Lederman now fears might soon be her own.”
The message is clear to all teachers: the computer might get you next.
In fact Lederman’s case — along with the push to more rigorously evaluate teachers —  is much more complex than many of these articles suggest. Still, her fate is an important part of a long-running debate about how to fairly evaluate teachers. The Obama administration has successfully reshaped how teachers across the country are evaluated.
So how has it gone?
There appear to have been major benefits in the form of stronger focus by teachers and principals on classroom instruction.  But the systems have both failed in a stated objective — to  identify significant numbers of ineffective teachers — while stoking concerns about testing.
In the five years since New York established its evaluation law and created an expedited process for removing teachers with poor ratings, one teacher — a single teacher — has been dismissed under the new law, according to state Education Department records.
At the same time, the system has also drawn the ire of the state teachers union, which has encouraged students to opt out of state tests  used to evaluate its members. Many New York parents took heed, with one in five students refusing this spring to sit for the state test.
Call it a mixed bag then: The positives are real, but so are the concerns. The mechanical, test-based evaluations that many states adopted in the wake of federal incentives have left educators feeling disempowered and have driven a fierce backlash to school reform.
Where every teacher is great
The impetus for much of evaluation reform was an influential study, which found widespread inflation in teachers’ evaluations. The 2009 report — titled “The Widget Effect” and put out by the reform-minded nonprofit TNTP (called at the time “The New Teacher Project”) — examined a dozen districts in four states and found that “on paper, almost every teacher is a great teacher.”
Districts’ evaluation systems failed to differentiate between good and bad teaching meaning struggling educators got limited feedback and high-performers went unrecognized, according to TNTP.
The way to differentiate teachers, the report said, was to develop an evaluation that “fairly, accurately and credibly” judged their effectiveness in promoting student achievement. The report recommended, “fairly but swiftly remov[ing] consistently low-performing teachers.”
The study made a mark. Secretary of Education Arne Duncan cited it in remarks to the National Education Association, saying, “A recent report … found that almost all teachers are rated the same. Who in their right mind really believes that? We need to work together to change this.”
Duncan then took action, using money from the administration’s $4.35 billion Race to the Top program and then waivers from No Child Left Behind’s onerous student performance requirements to persuade states to adopt specific evaluation reforms. The feds wanted more performance categories, greater classroom observations, and evaluations based in part on student test scores.
Lederman’s complaint that she was the victim of a Kafkaesque judging system came out of this effort to end the widget effect.
Capturing a teacher’s value(-added)
The most controversial aspect of the spate of new evaluations is how they incorporate student test scores, which is what Lederman is suing over.
Most states now require that “student growth” be one part of teachers’ evaluations. For teachers in grades and subjects with state tests (grades 4–8 in math and English) that means a complex growth measure, often called a value-added model, or VAM.1 (Read The Seventy Four’s flashcards explaining value-added and the push to evaluate teachers by test scores.)
The model is designed to isolate teachers’ impact on student learning, or what value they add to student test scores. It’s done by statistically predicting how students will score on standardized tests, based on past performance and sometimes factors such as poverty and special education status.  The difference between a student’s predicted score and the actual one is considered a teacher’s value-added. A teacher’s overall rating comes from averaging this value-added measure for all of his or her students.
An often-cited study shows that teachers’ value-added scores are connected to long-run student outcomes such as adult income and college attendance. On the other hand, a given teacher’s value-added scores can jump around significantly from year to year, suggesting to some that they are not reliable enough to drive critical personnel decisions.2
These statistical measures are subject to a fierce debate among researchers. Some say they’re very important; some say not to use them at all; some say use them a little.
“The evidence is fairly clear that there’s useful information in value-added measures and we should be using them to make smarter personnel decisions,” Cory Koedel, a professor at the University of Missouri-Columbia, told The Seventy Four.
Jesse Rothstein, a Berkeley economist who has written extensively about value-added’s flaws, said, “There are real problems using (value-added) to evaluate teachers…(They) provide some information, but it's limited information and it ought to feed into more subjective evaluations.”
Susan Moore Johnson, a Harvard professor is similarly skeptical. “Among scholars, there’s very limited support for using value-added methods for any high-stakes decisions about individuals,” she said. “I certainly think that student achievement should be a component (of teacher evaluation) — I’m not clear that it should be measured by standardized tests.”
Perhaps even more questionable than value-added — which has been extensively studied — are other techniques for incorporating student test scores in teacher evaluation. These include judging teachers based on tests in subjects they don’t teach, such an art teacher evaluated by school-wide English scores. Another method is called student learning objectives, in which teachers are scored based on whether students meet certain goals set by the teacher (and approved by the evaluator), often based on tests created by the teacher. This approach hasn’t been as carefully studied as value-added, and what research does exist is not particularly promising.
Outside the academic world, tests as part of evaluation are disliked by many teachers, who complain that the scores can’t fairly capture their contribution to student learning. A 2014 Gallup poll found that nearly nine out of 10 teachers said linking student test scores to their evaluations was unfair.
Evaluation reform has also led to a proliferation of new tests, as states and districts scramble to create assessments to evaluate teachers in traditionally non-tested subjects, such as music and social studies, or early elementary and high school grades . This in turn has fueled concerns from teachers and parents over whether too much time is spent on testing.
Teachers unions have long been wary of having their members judged by test scores, though at one point opposition was waning. In 2010, American Federation of Teachers President Randi Weingarten gave a speech supporting using student growth measures as part of teacher evaluations and backed a Colorado law that bases 50 percent of evaluations on test scores. She also wrote a sympathetic forward to a book by economist Doug Harris that suggested that value-added measures might have a useful, though limited, role in education.
Times have changed. Under pressure from unhappy AFT members, Weingarten did an about-face and now sharply opposes the measures. She declared that  value-added “is a sham.”
Dan Weisberg, the CEO of TNTP, criticized this position, saying that instead of trying to throw the system overboard, unions should have worked to improve the implementation challenges that exist. “That’s what leadership is about.”
He added that test scores are imperfect, but valuable.
“What’s the error rate with the old New York State evaluation system where principals would come by once a year with a clipboard and rate everybody as satisfactory?” he asked rhetorically.
Where every teacher is still great
Despite their intentions, new evaluation systems across the country have generally failed to smoke out bad teachers at the rate that some reformers hoped.
In almost every state where evaluation reform has been implemented, the vast majority of teachers end up with a good rating. In Delaware, 99 percent of teachers were rated effective or highly effective; in Ohio just .4 percent of teachers got the lowest mark and in New York it was 1 percent.
What you make of this depends on where you sit.
Weisberg said, “The ‘widget effect’ is still alive and well in a lot of places,” pointing to inflated evaluations as “a strongly ingrained piece of the culture” in schools and districts.
Linda Barker, director of teaching and learning at the Colorado Education Association, the state’s largest teachers union, says the focus shouldn’t be on how many teachers receive poor ratings.
When her state’s evaluation law passed, it was based on a “competitive approach to rank and sort us.” It has changed for the better, she said: “Now it's really about teaching and learning systems that support continuous improvement, both for teachers and principals, but also for the profession.”
Research doesn’t tell us what proportion of teachers ought to be considered poor performing.  Simulations from economists have suggested that firing rates of as high as 5 to 10 percent would improve student learning, but others researchers have suggested that these simulations are unrealistic and would make it more difficult to recruit new teachers.
Worse still, some educators, like Barker, fear that making evaluations about firing the worst performers will undermine their role in helping all teachers improve.
Talking about teaching
Controversy aside, there seems to be wide agreement that the push for new teacher evaluations has created a much-needed focus on what quality instruction means.
“It’s absolutely clear that there is much more attention to what happens within classrooms,” Moore Johnson, the Harvard professor, said, “and historically there was no accountability policy.”
Michael Johnston, a Colorado state senator who spearheaded a controversial statewide teacher evaluation law, said, “I think we’ve done a good job of going fast but also going deliberately towards quality implementation so we actually have teachers and principals that are using the tools.”
Barker, whose Colorado Education Association fiercely opposed Johnston’s law in 2010, generally agreed. She compared the new system to her experience as a teacher for 25 years when she was “evaluated very rarely and against no standards or criteria.” In contrast, Colorado’s law has empowered educators to focus on “continuous learning” by creating a “common language” about teaching.
Little research on these questions exists, but what does is fairly encouraging. A Chicago study found gains in achievement as a result of its new teacher evaluation program; another study found that teachers in Chicago believed its system of observations was improving instruction, though they also raised concerns about the use of student test scores. This lines up with a Cincinatti study showing benefits to evaluation.
The good and the bad of evaluation
Critics of the new teacher evaluation system are both right and wrong.
The new evaluations seemed to have produced genuine benefits in the form of a focus on quality instruction and feedback to teachers, which often goes unacknowledged. Critics are also generally wrong to suggest that the new regime has been one of “test and punish.”
Lederman’s case is a perfect example. She was not, contrary to a slew of inaccurate news stories, including in the Washington Post and Business Insider, rated “ineffective” overall.
She actually scored “effective,” which means she isn’t in jeopardy of losing her job as a result of the rating.3 Arguably, the system of multiple measures of evaluation worked in her case.
The latest data nationwide, from the 2010–11 school year, also find that few teachers, particularly tenured ones, are dismissed for poor performance. Maybe that has changed in recent years, but New York’s experience and the small number of teachers rated ineffective across the country suggests that’s extremely unlikely.  
It’s not clear why. Perhaps it’s part of an ingrained culture; perhaps principals don’t give low ratings because the mechanism to dismiss low-performers remains overly arduous. Maybe there simply aren’t as many ineffective teachers as reformers thought or maybe principals want to protect teachers from what some see as arbitrary test-based rating.
Critics are right that there has been too much emphasis on testing, but far too little attention to whether the scores are valid measures of teacher performance.
Even supporters generally say that basing 50 percent of an evaluation on one year’s test data is too much. Value-added measures are useful, but they capture only one aspect of what it means to be a good teacher. As Mike Petrilli, of the reform-minded Fordham Institute, said in an interview, “If used, [test scores] should be a relatively small percent of what you’re looking at.”  
Unfortunately, several years into this nationwide experiment, it’s not clear what lessons have been learned.
In New York, Gov. Andrew Cuomo deemed the evaluation system “baloney” and pushed through a series of changes designed to make it easier to identify and fire ineffective teachers. The new evaluation system increases the weight of test scores from 40 to 50 percent; brings in independent evaluators from outside the school to help rate teachers; makes it harder for teachers deemed ineffective to appeal their scores; and withholds tenure from teachers who do not receive three effective ratings within their first four years on the job.
New York’s solution, in other words, is to double down on the test-based aspect of the system, and take more power away from principals who some see as the root of the every-teacher-is-great problem. As Petrilli put it, speaking broadly about evaluation reform, “I view it as an approach in many ways to try to principal-proof the schools.”
But what if the solution is to move in the opposite direction: empower principals. Teaching is complex; evaluating teaching is too. It should be more than plugging in data from test scores and observations into a spreadsheet that spits out a final rating.
And some research actually suggests principals can be good judges of teacher quality. That doesn’t mean principals should have unchecked authority.  Many may give out too many high ratings — but the first step to addressing that problem is to understand why. Adding more performance categories or new tests doesn’t seem to being doing the trick, but working to give them the tools to be instructional and personnel leaders might.
This lack of faith in principals, embedded in the evaluation system, is apparent in the Lederman case. The test score data forced her principal to give Lederman a lower rating than she thought was appropriate based on Lederman’s body of work.
If her case shows one thing, it’s not that tests are meaningless or that evaluation reform hasn’t produced benefits. It’s that a policy agenda that hinges on disempowering — rather than empowering — people in schools will likely fall short both politically and practically.
(Disclosure: In my previous job at Educators for Excellence–New York, I worked with teachers to advocate for the state to adopt an evaluation system that would have continued to use test scores but reduced how much they were emphasized.)


1. Although there are different types of growth measures other than VAMs — in fact, New York does not use a traditional VAM, but a different statistical model known as a median growth percentile — they are among the most commonly used and referred to.

2. There are ways to increase the year-to-year reliability of value-added measures by, for example, averaging together multiple years of scores. A handful of states use this approach, but many others don’t.

3. Lederman is suing based exclusively on the fact that she received an ineffective score on the state test portion of her evaluation, which at the time accounted for just one fifth of her overall rating.

Get stories like these delivered straight to your inbox. Sign up for The 74 Newsletter

Republish This Article

We want our stories to be shared as widely as possible — for free.

Please view The 74's republishing terms.

On The 74 Today