Rarely is the Question Asked…

September 26th, 2005 | View Comments

Earlier this week, I posted a link to an article on the Washington Post titled “Teachers Stir Science, History Into Core Classes“. This article contained the following quote in reference to how many schools have opted to address the strict requirements that the No Child Left Behind Act (NCLB) has placed on reading and math performance:

The time devoted to reading and math has increased. And in many places, the increase has brought results. Between 2002 and 2004, Keister Elementary’s passing rate went from 81 to 92 percent on the state English test and from 86 to 90 percent on the math test.

Quotes like these always force me to ask a simple question: Have we established the reliability of the standardized tests being used? In science research, we have a concept called reliability. A measurement instrument is considered to be reliable if, upon repeated use, the measurements wind up being consistent with each other.

A bathroom scale that makes a wild ten-pound swing every five minutes is not a reliable measure. A scale that shows you steadily losing weight could be a reliable measure if you were, in fact, losing weight, or it could be steadily going on the fritz and one morning you will step on the scale to discover that you weigh exactly four pounds.

I ask this question because there are two possible causes for increasing test scores: Either the students really are getting better (we will examine what “getting better” might actually mean in a minute), or the test is getting easier.

I have no hard evidence to suggest that it is the latter instead of the former, I just find it curious that no one ever seems to ask (most questions about standardized testing relate to the validity of the method, or whether or not the test is actually measuring students’ understanding of key concepts; any questions about the reliability of a standardized test seem to center on inter-rater reliability, or the likelihood that two graders receiving the same student response would score it equally), nor does anybody ever seem to offer up hard data to try to prove it’s the former.

I did find that Pennsylvania published a document discussing the reliability and validity of its statewide test, but do all the states do that?

The cynical little devil on my shoulder would like to point out that both the politicians who have thrown their weight behind NCLB and school administrators everywhere have a vested interest in producing higher test scores, and I’m not sure that they’re scrupulous enough to make sure those gains are produced honestly.

One way to check for reliability is to try to correlate the measure with other related measures. For example, you could make sure that your scale isn’t fooling you and you are actually losing weight by comparing the scale readout with how your clothes are fitting and how your physical measurements are looking. If the number on the scale, the number of inches around your waist, and the size of the clothes that fit all keep getting smaller, then you have a reliable measure of weight loss.

People who are dismayed at the prospect of high school grade inflation often use this method to support their idea that high school GPA is not a reliable measure. High school GPAs have gone steadily up while SAT scores have remained fairly constant.

It would be interesting to compare 12th graders’ scores on the statewide tests with scores on the SAT (keeping in mind that the SAT has changed formats several times in recent memory and that the students who take the SAT tend to be in the top half of their class, while everyone takes the NAEP) and other standardized tests. In fact, this is exactly what the Pennsylvania study does in order to establish the reliability of its test.

But even assuming that the statewide tests are all reliable, a very serious question remains and I think it’s irresponsible to declare early success for NCLB until it’s answered: Are the tests valid?

A valid measure is one that actually measures what it’s supposed to measure. A bathroom scale that always displays your height when you stand on it may be reliable (it always gives you the correct height), but it’s not valid, because it’s supposed to measure your weight, not your height.

How do we know that early increases in scores are due to increases in students’ knowledge, rather than due to increases in their standardized test-taking skills? Or maybe even due to a statistical fluke that will soon be fixed by regression toward the mean?

We could again try to correlate performance on the statewide tests to performance on other standardized tests. Or we could examine the mounds of research on learning and expertise that suggest that the type of questions asked on standardized tests really don’t do a great job of revealing student understanding at all.

Yvonne posted this on September 26th, 2005 @ 4:11am in Education, News/Politics, Science | Permalink to "Rarely is the Question Asked…"

Discussion

No Comments

There are no comments yet. Be the first to leave one!
| Trackback |

And then keep track of the discussion by subscribing to this post's RSS feed.

Leave a Comment

Your email address will never be published or shared. Required fields are marked with *.


Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>