Dynarski: When Done Right, Standardized Tests Really Do Reflect What a Student Knows
A new school year is in full force, so schools are beginning to think about whether students are on track to meet their state’s academic standards. They also are likely thinking ahead about state achievement tests, the independent and objective exams students take to determine whether they are learning at the appropriate grade level.
At the same time, some parents might be wondering if they should opt their child out of those tests. As an individual act, opting out of tests is like opting out of annual medical checkups — it yields no information and does not make one healthier. As a collective act, opting out erodes what can be learned from test scores. If parents of high-performing students in a school all opt out, that school’s average score will be lower (and vice versa for low-performing students). Who is being helped is unclear.
What exams test reflects what states want their students to learn — the standards. Comparing average scores between schools and districts is possible only because the same test is done. In measuring what students know, tests are a tremendous asset, providing important and reliable information that cannot be learned in other ways.
To be sure, not many people look forward to taking tests. But when we look on our doctor’s wall and see a plaque that she is board-certified, we might think certification is a good thing. The doctor passed a test. Or a lawyer might be admitted to the bar, meaning he’s passed the state’s bar exam — a test. A certified public accountant will have passed a battery of tests. Nobody wants a doctor, lawyer, or accountant whose knowledge is below the accepted standard.
Children deserve a first-rate education; and the public deserves first-rate reporting on it.
Please support our journalism.
The same should be true in schools. So it is useful to take a step back and ask an important question: Does the test score reflect what a child knows? If so, it’s doing what it is designed to do. Exam scores are often derided as being the result of “teaching to the test,” but what people who use that phrase are really complaining about is rote and lifeless teaching. Lifeless teaching and teaching to the test are two different things. Real teaching to the test is central to effective teaching, as long the exams reflect what students are supposed to learn.
How tests are created
To parents who are not educators, the process of creating standardized tests might seem like a big black box. In fact, it’s a rigorous and highly scientific process, one that has been developed over 100 years and reflects research by generations of esteemed scholars. It has its own subfield, psychometrics, and every year universities graduate new Ph.D.s in that subfield.
We can think about large-scale test development by first thinking about test development in miniature. Consider how a high school teacher might go about designing a test related to, say, linear equations in algebra.
The teacher delivered a certain amount of material on the subject in the form of classroom instruction, homework, and other assignments like group activities or online lessons. Crucially, what the teacher taught should relate to a set of content standards that each state has developed and usually posts online.
Here’s an algebra standard from the Common Core: Solve linear equations and inequalities in one variable, including equations with coefficients represented by letters. (Interested readers can see other standards for algebra here and can find all the standards here.)
To a teacher, this standard indicates that her students should be able to do this: For the equation 3x + 4 = 13, determine that x equals 3. Or, if the equation is ax + b = c, be able to solve for x equals (c – b)/a. In this second equation, coefficients are represented as letters, as the standard calls for.
For the teacher wanting to know if her students meet the standard, she might think a reasonable question is to ask them to solve 15y + 10 = 40. The test question includes a subtle difference — students need to solve for y rather than x — but it’s still an equation with a single variable. The teacher might up the difficulty level a bit by asking students to solve y + 3y + 10 = 50. Students need to add the two y terms, but it’s still an equation with a single variable.
Those hard questions need to be there
If the test contains too many difficult questions, and no students get any answers right, the exam has what test designers call a floor. With all students scoring 0, the teacher cannot distinguish what her students know: The floor blocks the teacher from knowing which students have a low level of knowledge and which have a high level.
Similarly, if the teacher makes the test so easy that many students get all the answers right, she will have created a ceiling effect. Some students have true abilities above their score, but the ceiling blocks the teacher from knowing it because when all test scores are 100 percent, all students look the same.
To avoid ceiling and floor effects, tests need questions with higher and lower degrees of difficulty. Students with strong knowledge are able to answer the harder questions; students with weaker knowledge are not.
The same development process is used on a much larger scale for assessments such as the PARCC test (the acronym stands for the Partnership for Assessment of Readiness for College and Careers), Smarter Balanced, and the Texas STAAR test, the State of Texas Assessment of Academic Readiness. Here, the standards are the starting point for designing the tests (here is a visualization of the process).
Educators and testing experts develop banks of questions related to those standards, such as the algebra questions above. These are scrutinized to ensure they test what the standards call for and that their wording is clear and not inappropriate or biased against any race or gender.
For example, a math question that involves calculating a baseball player’s batting average might pose an issue for students who do not play baseball or are unfamiliar with its rules. Revising the wording to be about calculating a simple average without referencing baseball might be the solution, as might simply substituting another question.
Painstaking efforts are invested in these tests. Each question on the PARCC exams, for example, is reviewed by 30 or more people before it is used. Questions that make the cut are then pilot-tested in 14 states and nearly 16,000 schools. Smarter Balanced follows a similar process, testing more than 5,000 items in 21 states and more than 5,000 schools.
At any grade level, the test is likely to include at least a few really hard questions that may seem well beyond the abilities of students in that grade. Students (and teachers) tend to remember these kinds of questions, but they are not on the test simply to create pain and discomfort. Rather, seeding the test with difficult questions avoids the ceiling effect and helps distinguish between students who have a basic level of proficiency and those at an advanced level. There might still be students who get all the questions wrong or right, but the design of the tests makes it unlikely to happen.
Some complaints are deserved, some are not
Some aspects of tests deservedly draw complaints. For example, test-score reports to parents often are laden with statistical jargon such as norms, percentiles, normal curve equivalents, stanines, lexiles, and proficiency levels that are based on … who knows what.
A parent whose child scores at the 65th percentile in fourth grade and the 65th percentile in fifth grade might wonder whether he or she is standing still. The child isn’t — in fact, the student has learned a year’s worth of material, because the child scored better than 65 percent of fourth-graders and then, a year later, better than 65 percent of fifth-graders. But test designers have done themselves no favors by providing reports that require parents to grapple with statistical concepts to make sense of scores.
Some parents might see a child’s test score and think it must be wrong because they know their child is better at math (or reading, or science) than that. Perhaps the student had an off day — illness, family distraction, a skipped breakfast. Parents should view scores against the backdrop of other indicators of how their child is doing in school, such as grades on report cards.
Parents also may be concerned that their child’s test scores are used as a basis for evaluating their child’s teacher, a development in the past decade that emerged in response to pressure on states and school districts to raise test scores. Won’t their child’s teacher care more about the score than about their child?
Well, no — most systems for rating teachers give only a moderate weight to scores, while organizing and managing classrooms get more weight. And the notion that teachers caring about higher scores is a bad thing reflects a topsy-turvy view of education, in which teachers accomplishing their objectives — having their students learn what’s in the standards — is somehow a problem.
Some debates about tests are built on a flimsy basis. For example, annual state tests do not take up a lot of instructional time. Parents may be concerned that their child’s teacher is teaching to the test, but, as described above, tests are designed to measure knowledge about the same standards on which classroom curricula are based.
So, as this school year unfolds, let’s remember why states test students: to see whether they are learning at the appropriate grade level. And let’s understand that tests are created through a reliable process, much like the exams our doctors, lawyers, and accountants must take.
Mark Dynarski, founder and president of Pemberton Research, is an education fellow at the George W. Bush Institute.Submit a Letter to the Editor