Test Scores and Teacher Evals: A Complex Controversy Explained

1/15 What started the movement to evaluate teacher performance by test scores?

For the last several decades, as far back as the 1970s, researchers have been estimating teachers’ impact on their students’ standardized test scores, and have found substantial variation among teachers. The students of some teachers consistently made more growth on tests than the students of other teachers did. They also generally found that teachers’ ability to raise test scores was only modestly correlated with concrete characteristics, such as experience and education.
This helped galvanize a focus among school reformers on improving teacher quality — often referred to as the most important in-school factor affecting student achievement — as well as the belief that long-standing certification and personnel practices were falling short.
In 2009, the reform-minded nonprofit TNTP (then known as the New Teacher Project) released an influential report called the “Widget Effect,” which studied a dozen school districts in four states. It found that “on paper, almost every teacher is a great teacher.” In other words, districts’ evaluation systems did not differentiate between good and bad teaching, resulting in limited help for struggling teachers or recognition for high-performers. The report recommended that districts “adopt a comprehensive performance evaluation system” that would rate teachers “based on their effectiveness in promoting student achievement.“
The report, along with other evidence, made an impression on the Obama administration, which used its newly created Race to the Top initiative to reward money to states that created teacher and principal evaluation systems that included “student growth,” i.e. test scores. The administration has also used waivers from the tough requirements of the federal No Child Left Behind law to spur similar policies — one state, Washington, lost its waiver because it refused to use test scores in teacher evaluation.

2/15 Has the move to evaluate teachers based on test scores created a backlash?

Yes, indeed.
The move to judge teachers by student growth has helped lead to a proliferation of new tests partially because of the desire to evaluate teachers in grades and subjects that have traditionally lacked standardized tests, such as social studies and grades K–2. Frustration about overtesting spurred an opt-out movement in many states across the country, most prominently in New York, where roughly one in five students declined to sit for the most recent state test. Teachers unions have also pushed back forcefully against testing.
The Obama administration has vowed to reduce testing, but says that student growth should remain a part of teacher evaluation systems. It is unclear, however, whether such plans will actually lead to fewer tests.

3/15 Are all teachers now evaluated on test scores?

Not all, but most.
A 2013 estimate by the National Council for Teacher Quality found that 41 states now require student test scores to be a part of teachers’ evaluations — up from just 15 states in 2009.

4/15 What are value-added measures or VAM?

Value-added measures (VAM) are among the most common ways to evaluate teachers using test scores. They are statistical growth models that attempt to isolate a given teacher’s impact on (or ‘added value’ to) student learning. The models work by comparing a student’s estimated score on a standardized test to the student’s actual score — the difference between the two is the teacher’s VAM. The estimated score is based on past student test scores and sometimes other factors such as poverty and disability status. A teacher’s overall VAM score is computed by averaging together the value-added to each of his or her individual students.  Not all teachers receive VAM scores, since it is only computed for those who teach a grade and subject that ends in a standardized test. In New York, for instance, about one in five teachers received a growth rating from the state. 

5/15 Are teachers who don’t receive VAM scores still evaluated based on tests?

In many cases, yes. There are two commonly used options for such teachers:
  • Group measures of performance, in which teachers are evaluated based on test scores of students or subjects they don’t teach. A common example is teachers being judged on the entire school’s math or English score even if they teach, say, art. This occurred in Florida, Tennessee, New Mexico, and New York and generated significant controversy.

  • Student learning objectives (SLOs), in which teachers set goals for student performance on a test, either one they create themselves or a standardized one. The goals are approved by their supervisor, who then assesses the teacher based on how well the students meet those goals. One study of schools in Austin, Texas found no correlation between a teacher’s SLO score and his or her VAM score; while another study in Denver, Colorado found a moderate correlation. These results may be because SLOs and VAMs are assessing different aspects of teacher quality, but they might also call into question whether SLOs are valid measures of teacher performance.

6/15 Are there different types of growth models?

There are.

VAMs are among the most common. Another common model is known as student growth percentile, which, like VAM, measures student test score growth, but with a different mathematical technique. These models rank students with similar prior achievement based on how much growth they make. Such models, unlike VAM, often do not include controls for student characteristics like poverty, and so may unfairly disadvantage teachers of at-risk students.

Different VAMs also use different variables and demographic factors to create students’ estimated scores. In general, models that account for more student characteristics do a better job of ensuring a level playing field for teachers of academically challenged students.

Some models compare teachers only to other teachers in the same school, though most compare teachers across a given state.

Generally, different models produce at least somewhat similar results.

7/15 What are some potential uses of VAM?

The most controversial question is whether to use VAM for individual teacher evaluation. It can be — and often is — used for other purposes as well. For example, it has long been used for research in order to evaluate the effectiveness of a given program or look for teacher characteristics that are associated with student achievement. Some advocate that VAM also be used for evaluation of principals, schools, and teacher preparation programs.

8/15 What are some of the arguments for and against using VAM in teacher evaluation?

Significant debate exists as to whether (and to what extent) VAM should be used in individual teachers’ evaluations.
Proponents of VAM argue that it directly measures teachers’ effects on student achievement, is free of some of the bias that is part of other measures of teacher performance — like principal observations — is connected to long-run student outcomes, and is particularly effective in identifying high- and low-performing teachers.
Skeptics of VAM argue that scores fluctuate significantly from year to year, that test scores provide a narrow sense of teacher quality, and that attaching stakes to tests will lead to teaching to the test and even cheating.

9/15 Is VAM a valid measure of teacher performance?

This is a controversial and complex question, and the research to date has not reached a clear conclusion. The answer also depends on subjective views on which student outcomes are important and how they should be measured. Even among those who support VAM in teacher evaluation, there is a disagreement on how heavily it should be weighted.
There is evidence that teachers who have high VAM scores produce lasting gains for their students in terms of college enrollment and adult earnings (though these results have been challenged). On the other hand, it is clear that teachers’ influence extends well beyond test scores. Multiple studies have shown that teachers can affect students’ non-cognitive skills and behaviors (such as attendance, discipline, etc.) and that teachers who do well in this aspect are not necessarily the same ones who raise test scores the most.
VAM also tends to be modestly but not highly correlated with other measures of teacher quality, though some studies have found no correlation. Together, this research suggests that although VAM does not capture all aspects of quality teaching, it is capturing at least some meaningful information.
There are also concerns about whether VAM can accurately measure teachers who work with students who are particularly high- or low-performing, though some research suggests this is rarely a major problem. Researchers disagree whether or not how students are assigned into classrooms can bias VAM scores.
It is also probably fair to assume that validity varies from test to test — a low-quality exam is unlikely to be a particularly strong measure of teacher performance or student knowledge.
Note that ‘validity’ here is used in the statistical sense, meaning a measure’s success in measuring what it purports to measure, meaning in this case teacher effectiveness.

10/15 Is VAM reliable?

VAM scores can and do fluctuate from year to year and much of this fluctuation is the result of imprecise measurement (also known as “error”). For example, one study found that 57 percent of teachers who were in the bottom fifth of performance in one year, had moved to another level in the subsequent year — and 8 percent of the bottom-level teachers were in the top performance category in the following year. In general the correlation from year-to- year ranges between .2 (weakly) and .7 (fairly high).1

The reliability tends to be higher for math teachers than for English teachers. Some (but not all) of this instability can be addressed by averaging multiple years of data.  The year-to-career correlation of a given teacher’s VAM is significantly higher — ranging from .55 (medium) to .78 (high) in one study — than the year-to-year correlation.

Finally, it’s crucial to note that all performance measures have some degree of instability. There is less evidence about the reliability of these alternative measures, but what exists generally suggests principal observations are somewhat more stable over time than VAM — though stability/reliability does not imply validity. In other words, a measure could be consistent over time — like a teacher’s height — but not a very valid one to judge how well that teacher teaches.

Note that ‘reliability’ here is used in the statistical sense, meaning a measure’s consistency.

1. In statistical terms a correlation coefficient ranges between -1 and 1. A correlation of 0 means there is no association whatsoever; 1 means a perfect correlation; and -1 means a perfectly negative correlation.

11/15 Does using tests for high-stakes decisions in teacher evaluation lead to negative unintended consequences? Will it lead to positive consequences?

We don’t know for sure yet, though there’s certainly a possibility that it will, and there is some evidence suggesting both positive and negative outcomes.

There is research showing that holding schools accountable for student test scores has led to cheating and teaching to the test. At the same time, there is evidence that test-based accountability for schools has in many circumstances increased student achievement both on high-stakes tests — like the yearly standardized tests — and on low-stakes exams, like the National Assessment of Educational Progress test given every two years.

However, the gains on the low-stakes tests are often not as dramatic as those on the high-stakes exams, which gets back to whether teachers are teaching to the high-stakes tests or cheating on them.

Schools can adopt policies that reduce cheating and there may be ways of designing tests to make teaching to them difficult.

12/15 Didn’t the American Statistical Association (ASA) say that VAM should not be used?

Not quite, even though some news outlets have reported it that way. The ASA, the country’s largest organization of statisticians, does urge significant caution in how VAM is used.
The ASA put out a statement summarizing research on VAM, but does not say at any point that VAM should not be used. In fact the statement says, “When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes.” The statement warns, “Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.” The statement adds, “Ranking teachers by their VAM scores can have unintended consequences that reduce quality.”
Some researchers in the field have challenged parts of ASA’s statement, suggesting the group left out recent research that addressed many of its own concerns about VAM.

13/15 Has the use of VAM led to improved results for students?

It’s too early to tell.

There have been relatively few studies on how the use of VAM in districts and schools affects students. The few pieces of research that do exist offer both reasons for caution and optimism.

  • A study found that providing districts with value-added data did not lead to improved student outcomes (relative to similar districts that did not have access to such data).

  • A study that offered teachers with high VAM scores a $20,000 bonus for transferring to a high-poverty school produced significant student achievement gains in elementary grades but no effect in middle school.

  • A study of New York City’s tenure system — which was made more rigorous, partly by using VAM scores — found that the reforms likely led to improvements in teacher quality.

  • A study in which a group of New York City principals were given VAM scores produced small improvements in student achievement (relative to students of principals who were not given such data).

14/15 What do teachers unions say about using test scores in teacher evaluations?

Teachers unions have generally been skeptical about the use of test scores in teacher evaluation, and such skepticism has increased in recent years.
Randi Weingarten, president of the American Federation of Teachers (AFT), originally expressed openness to the use of test scores in teacher evaluation, saying in 2010 that student progress should be used alongside other measures; Weingarten also backed a Colorado law that required half of teachers’ evaluations to be based on student assessments. However, in 2014 Weingarten came out strongly against VAM, saying its use had led to an overemphasis on testing as well as high-profile errors.
The National Education Association (NEA) has followed a similar path. In 2011, the union passed a resolution signaling openness to using test scores in teacher evaluation in theory. But union leaders at the time said that no tests were high-quality enough to be used for that purpose in practice. The NEA backed further away from the practice in 2014, saying that “standardized tests, even if deemed valid and reliable, may not be used to support any employment action against a teacher.” NEA president Lily Eskelsen Garcia has been a sharp critic of standardized testing, referring to VAM as “voodoo.”  

15/15 Where can I find additional information about VAM?