29 Aug Evaluating Teachers with VAM: Variable Ambiguous Mistake
Anyone who has read this blog before is probably aware of my position on the use value-added measurement for teacher evaluation. I have argued many times here, and in Teacher Magazine, that politicians, self-styled education reformers, and members of the general public are ill-informed if they believe that we can use state tests to determine teacher effectiveness. Accomplished California Teachers (ACT) addressed that issue in detail in our report on teacher evaluation, which also featured our recommendations for how California can improve teacher evaluations.
This morning, having read Ken Bernstein’s Daily Kos post on the same topic, I have one more opportunity to address the issue, by looking at a new policy report from the Economic Policy Institute (EPI). The title is “Problems with the Use of Student Test Scores to Evaluate Teachers.” EPI convened ten experts* in the fields of teaching, learning, schools, testing, statistics, economics, and social policy, and their review of the available research yields a powerful consensus:
[T]here is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed.
Of course, many of the advocates of VAM in teacher evaluations are particularly interested in firing the bad teachers. They may talk about helping identify the best teachers and helping all teachers improve, but you don’t have to be more than a casual observer of these debates to have noticed how they relish the prospect of getting tough on teachers. The authors of this report have a response to the notion that VAM will help schools clean house and produce better results:
If new laws or policies specifically require that teachers be fired if their students’ test scores do not rise by a certain amount, then more teachers might well be terminated than is now the case. But there is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones. There is also little or no evidence for the claim that teachers will be more motivated to improve student learning if teachers are evaluated or monetarily rewarded for student test score gains.
Everyone who cares about schools and students should be shouting down the proponents of VAM for teacher evaluation until they produce evidence to counter those demonstrating all of the problems with that approach. For too long, the politicians and the tough-on-teachers, tough-on-unions education reformers have been able to coast on their sound bites. They tell us we “support the status-quo;” they tell us “it’s about the students’ needs, not the adults.” They rely on the simplistically appealing but incorrect notion that, of course, student test scores measure teaching effectiveness – a notion disproven on several levels now. They tell us that schools need to run more like businesses. The EPI report authors respond:
A second reason to be wary of evaluating teachers by their students’ test scores is that so much of the promotion of such approaches is based on a faulty analogy—the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education.
There are many reasons that VAM fails, most of which I’ve touched upon before, and anyone wanting a more detailed overview can check Ken’s post, or read the report. I just want to call attention to one of the more interesting ones. It turns out that if you analyze the data backwards, VAM can appear to prove that next year’s teacher raised this year’s test scores. Now, of course that can’t be true. However, if VAM were valid, you would expect that it could isolate the effects of teaching that has occurred, and teaching that hasn’t yet occurred would make the data turn “fuzzy” – you wouldn’t observe any results looking at the data that way. If the data can be turned upside down and appear to show that “future effect” then that students are not randomly placed with teachers. The deck is stacked in ways that will bias the results for or against a given teacher.
I had some success with a baseball analogy last week, so I’ll try another. If you wanted to measure the effectiveness of basketball coaches, wouldn’t you need to randomly distribute the players, and account for the variable quality of the rest of the staff, and the facilities? Of course you would. So, Phil Jackson should have as good a chance of guiding the Los Angeles Clippers to a championship as he did the Los Angeles Lakers.
Now, if we were to run win-loss records through the VAM data analysis in the same backwards fashion, consider the results. Do you think next year’s lineup would appear to affect this year’s record? Of course they would, and the reason is obvious: in most cases, there will be considerable overlap. You don’t start each year with an entirely new roster. However, in schools, you do start with a new roster. The report has this to say about the phantom future effect:
Inasmuch as a student’s later fifth grade teacher cannot possibly have influenced that student’s fourth grade performance, this curious result can only mean that students are systematically grouped into fifth grade classrooms based on their fourth grade performance. For example, students who do well in fourth grade may tend to be assigned to one fifth grade teacher while those who do poorly are assigned to another. The usefulness of value-added modeling requires the assumption that teachers whose performance is being compared have classrooms with students of similar ability (or that the analyst has been able to control statistically for all the relevant characteristics of students that differ across classrooms).
On this point, it appears to me the EPI report authors could go even further in taking apart VAM. Consider the impossibility of ever identifying and controlling for “all” of the factors that could affect the data (especially dealing with samples as small as one class. Larger studies can claim to mute the effects of variables by working with large samples). Or, lower the bar from “all” to “enough” – and then tell me how you find “enough” without knowing how many there are in the first place. And in actuality, the proponents of VAM would need to account not only for the varying characteristics of students, but also the varying effects of combinations of students, the varying effects of classrooms themselves, and the varying effects of every relevant factor in the school that could affect the teacher, students, or classroom.
Meanwhile, more states are winning Race to the Top grants, and celebrating the opportunity to waste money on this misguided approach, as they ignore more cost-effective and proven ways to improve schools and support teachers and students. We lead the industrialized world in child poverty and poor health care, but by all means, lets pour hundreds of millions of dollars into voodoo methods to pick out the bad teachers and reward the good ones. The report authors conclude their executive summary with this sobering and entirely realistic assessment of the consequences if we continue down this path:
Adopting an invalid teacher evaluation system and tying it to rewards and sanctions is likely to lead to inaccurate personnel decisions and to demoralize teachers, causing talented teachers to avoid high-needs students and schools, or to leave the profession entirely, and discouraging potentially effective teachers from entering it. Legislatures should not mandate a test-based approach to teacher evaluation that is unproven and likely to harm not only teachers, but also the children they instruct.