Testing Limits

Sandra Stotsky

Readers have no doubt heard the mantra “teaching to the test,” the term of art for a pedagogical strategy that has achieved prominence as a response to the efforts of the Bush and Obama administrations to initiate public school accountability through standardized testing. Teaching to the test seems to have a wide range of meanings and connotations. For example, in Ed Speak, a glossary of education terms, phrases, buzzwords, and jargon by Diane Ravitch, we find this simple definition:

teaching to the test: The practice of devoting extra time and attention in the classroom to the skills and knowledge that will be assessed on the district or state test.1

Some perspectives on this buzz phrase give it a positive gloss:

States should delineate what students should know and be able to do, teachers should match instruction to those standards, and state tests should measure how well students meet those expectations.2

Teaching to the test is exactly the right thing to do as long as the test is measuring what you are supposed to learn.3

However, Ravitch’s definition in Ed Speak continues:

Critics claim that it reduces education to a limited range of skills, ignores the importance of comprehension, and neglects subjects that are not tested, such as history, civics, geography, and the arts.4

Other critical perspectives on this phrase make technical points:

Teaching to the test alters what you can interpret from test scores because it involves teaching specific content.5

Because teaching either to test items or to clones of those items eviscerates the validity of score-based inferences—whether those inferences are made by teachers, parents, or policymakers—item-teaching is reprehensible. It should be stopped.6

In the conclusion to a summary of the research, Patte Barth and Ruth Mitchell try to straddle these polar opposites. In a Q&A for their report, they say:

Research is beginning to show that teaching to the test can be either bad or good depending on how administrators and teachers approach it.7

Often used in the same context as “test prep” or “test coaching,” it seems that “teaching to the test” (TttT) can be good, bad, some of both, or neither. Moreover, teaching to a limited range of skills does not necessarily entail neglecting other subjects. One can do one without doing the other. The purpose of this essay is to show why we are likely to see more rather than less of TttT in K–12 in the coming years and why it should matter to those in 13 and above.

Testing History

First, readers need to understand why this buzz phrase is frequently heard today. The most obvious reason is that K–12 schools have coped with an abundance of mandated testing since the early 1990s, and the consequences of poor student performance, under federal guidelines, have in the name of accountability come to fall more on teachers than students. The 2001 No Child Left Behind Act (NCLB) mandated annual testing for reading and mathematics in grades 3–8, once in high school, and at several grade levels in science. The 2015 re-authorization of the fifty-year-old Elementary and Secondary Education Act (ESEA), called ESSA (Every Student Succeeds Act) in its reauthorized form, continued NCLB’s annual testing mandate, in large part because of strong support from testing experts,8 but without evidence that annual testing via NCLB had increased the achievement of low-income students in reading and mathematics. In between NCLB and ESSA came Race to the Top in 2009, a competition permitting the U.S. Department of Education (USED) to award grants to public schools for implementing Common Core standards and standardized tests aligned to Common Core’s standards.

Second, readers need to understand the convoluted political context for the battle cry against TttT and standardized testing. Opposition to all standardized testing is the only safe stance for teachers and their unions to take now in order to oppose an accountability system that links student scores on federal/state-mandated tests to teacher evaluation, and the charge that excessive standardized testing has led to TttT and wasted valuable instructional time rationalizes this opposition. This charge helps to cover up the fact that our major teacher unions supported the Common Core project at the outset, including a link between student scores and teacher evaluations in states’ Race to the Top grant applications. While the earliest versions of ESEA (from 1965 to 2001) held no one accountable for test results, its reauthorized version in 2001, NCLB, held schools and school districts, but not individual teachers, accountable. Race to the Top instituted teacher accountability.

Organized Opposition

Opposition to all standardized testing is the only way teachers and unions can legally include Common Core’s standards among their dislikes, since it is clear to most union representatives at the building level that most teachers intensely dislike these newer “standards.” There was little teacher opposition to state standards and tests in NCLB days and no opposition at any time to norm-referenced testing, which places students in their relevant student population and with no pass/fail score.

In fact, ESSA was deliberately designed to counter growing teacher opposition to the Common Core project. It supposedly left accountability specifics up to each state where state unions would have more influence. But only now is it becoming clear that ESSA also left final approval of a “State Plan” for accountability to the USED. In addition, final “accountability rules” have just been issued by unknown USED bureaucrats after a large, broadly-based committee selected by the USED—and required by law—failed, as one might have predicted, to come to consensus on recommendations for rules.9 The USED “was given permission”—by whom, my informant didn’t know—to come up with its own accountability rules, given no consensus in the selected committee.

Because TttT is a serious problem when it consumes a great deal of instructional time in K–12, and to placate parents, President Obama came out recently for less testing. But what he seemed to mean were fewer non-Common Core-based tests, such as teacher-made tests or other district-based tests. While testing time would be reduced, this would also eliminate other independent sources of information on what our high school students know, and this would be in addition to the proposed ban on placement tests at the college level for determining eligibility for credit-bearing freshman courses. Since 2011, state departments of K–12 and higher education have been working to get public colleges to accept the Common Core-based grade 11 College Readiness test cut score as their cut score for credit-bearing freshman courses.10

The role of civil rights organizations in the context of the battle against standardized testing and the problem of TttT has been a curious one. In theory, they should have been opposed to all the testing mandated by NCLB and then by ESSA, since low-achieving students likely have more of their instructional time taken up by TttT than do higher-achieving students. But these organizations were apparently sold on the virtues of long and frequent tests, perhaps by education researchers who told them that such tests would provide useful information for instructional purposes. In fact, the new computer-based tests haven’t provided teachers of children with any useful information at all. But a test-based accountability system in which teachers are held accountable for student scores seems to be far more acceptable to these organizations than a system that holds students accountable for their own scores, as is the case in most countries. Even though ESEA has yet to show in fifty years that extra money to the schools (via Title I funds) for low-income students has made a difference, the dominant view of education policy makers and civil rights organizations has been that the conditions for learning (e.g., absenteeism due to suspensions, inexperienced teachers, a curriculum designed by teachers of a different color or “culture,” parental income, et al.) account for the “gaps” among demographic groups, not child-rearing or other cultural habits.

What We Know

One empirical question involves what we actually know about teachers’ practices in the context of mandated testing, with or without high stakes for students or teachers. Most of what is known comes from surveys, anecdotal research (nonsystematic accounts of idiosyncratic personal experiences), and/or a variety of data sources, but not from systematic observational research. For example, a 2013 report issued by the American Federation of Teachers was based on assessment inventories, testing calendars, and time and cost data.11 Not surprisingly, it recommended eliminating all but Common Core-based tests, and with no high stakes for teachers.

Almost completely left out of current discussions of research on testing, then and now, are teacher-made tests—the only kind of high-stakes tests that generations of K–12 students took (most promotions from grade to grade, as well as high school diplomas, were based on them, not on the norm-referenced tests their schools might also have given) until the advent of federal- or state-mandated standardized testing in the 1990s. A recent Seattle Post-Intelligencer article suggested that “teacher-made tests better reflect what is taught in class.”12 Another virtue of teacher-made tests is that parents get to see the tests, the grade teachers give, and teacher comments, if any. But the small and sporadic body of research on them over the years suggests the many types of problems they have, despite the fact that most teachers are required to take coursework in classroom testing in their professional preparation.13

Accountability

Because of the many issues in using teacher-made tests for accountability today, especially their variability in content and scoring from teacher to teacher in the same subject, standardized tests developed by testing experts were mandated for accountability purposes and are now promoted by most testing experts for that purpose. Standardized tests eliminated many of the acknowledged problems with teacher-made tests for the purpose of accountability. They have been defined as follows: “A standardized test is any form of test that (1) requires all test takers to answer the same questions, or a selection of questions from a common bank of questions, in the same way, and that (2) is scored in a ‘standard’ or consistent manner, which makes it possible to compare the relative performance of individual students or groups of students.”14

But we need to keep in mind that many education researchers had serious concerns about standardized testing in K–12 in the early days of mandated state tests. For example, a 1991 report states: “[R]ather than exerting a positive influence on student learning, testing may trivialize the learning and instructional process, distort curricula, and usurp valuable instructional time….Schools serving disadvantaged students are thought to be particularly at risk for such adverse effects.”15

While there seems to be little criticism of standardized testing for the purpose of accountability from education researchers today, “TttT” has become a common pejorative to dismiss all standardized tests, especially those with high stakes—important consequences for schools, teachers, or students. As criticism, it suggests that standardized tests are not well correlated with learning, cannot measure all that students learn, perhaps not even the best parts of what they learn, and improve only test performance, not learning, when teachers drill with test-maker-provided workbooks and administer practice tests.

Criticism

There are two sets of criticisms of standardized testing that involve TttT, aside from the extreme claim that any high-stakes standardized test will induce TttT, and that TttT is always bad even if it is simply more attention paid to what will be assessed. The assumptions are that any deviation from what teachers normally do in the classroom subtracts from instruction, and that what teachers normally do in the classroom is always better than preparing students for an external test.

In one negative scenario, teachers provide instruction on test item formats or practice with test-maker-provided workbooks and sample tests, as in the criticism of TttT items or clones of them. This type of TttT, its promoters claim, does improve test performance. However, the evidence, mainly from research on coaching for college admission tests, shows only a little gain.16 Even so, enhanced test performance based on “format and drill” types of TttT does not compensate for subject matter knowledge not gained during regular instructional time.

In addition, the “bible” for testing experts, Standards for Educational and Psychological Testing, notes that as test item formats become less familiar or more complex, “construct irrelevant” variance increases.17 Test scores become less a measure of subject matter knowledge and more a measure of ability to discern the logic of the format or the meaning of the instructions. Given tests with complex test items for accountability purposes, teachers will increase time on practice drills to improve the test performance of low-achieving students and spend less time on instruction in subject matter—ultimately a poor trade-off. How ironic, then, that the tests designed to help make all students ready for college and careers are drenched in unfamiliar and complex test item formats, i.e., the Common Core-based Partnership for the Assessment of Readiness for College and Careers (PARCC) and Smarter Balanced Assessment Consortium (SBAC) tests—and possibly other Common Core-aligned tests like ACT and SAT (whose previous versions had already stimulated the growth of privately paid tutorial services).

Formats and Failings

For example, PARCC’s Technology-Enhanced Responses (TERs) are item types in many reading exercises on computerized versions of its tests that make use of drag-and-drop or cut-and-paste functions, even for children, and its Evidence-Based-Selected Response items (EBSRs) seem to be an adaptation of what testing expert James Popham describes as a “multiple, binary choice item” that links two multiple-choice items, even though no research suggests that ESBRs are valid measures of reading comprehension.18 While it may seem bold to challenge children to solve multiple-step problems or use difficult computer-based functions, straightforward and familiar testing formats are more likely to measure their reading skills or mastery of subject matter, if these are the purposes of the test. That is why the innovative test items on Common Core-based tests are rare on tests outside the United States, as Mark McQuillan, Richard P. Phelps, and I discuss in How PARCC’s False Rigor Stunts the Academic Growth of All Students. (For further details on PARCC test items, especially for reading and writing, see chapters 5 and 6 and appendix A in this Pioneer Institute for Public Policy Research report.).19

PARCC developers prominently claim “the end of test prep” on PARCC web pages: “Let’s replace test preparation with smart assessments and tools. Let’s empower teachers to do what they best: provide high-quality instruction to our children.”20 However, we can predict from the experimental and empirical record that format and drill types of TttT will proliferate because many test item formats on current standardized tests are complex and often convoluted. In addition, test directions are deliberately difficult for both teachers and students to understand; SBAC expects an advanced vocabulary to be used by teachers and test-item writers even in the elementary grades and even when below-grade-level reading passages are used. It also provides a grade-by-grade list of a “construct relevant vocabulary” that it claims should be part of instruction in the English language arts because these words are “essential to the construct of the English language arts.”21

The other set of concerns about TttT in standardized testing reflects the phrase “teaching specific content” in the quotation at the beginning of this essay. The only way teachers can teach content specific to a particular test is if they have seen the test in advance (or believe that previous tests indicate the specific content of a current test). If they have seen a test in advance, there is clearly a test security problem. Although lax security can be addressed in part by enforcing tight security of test materials and by rotating the contents and forms of tests (routine practices by most testing companies), the charge of TttT has, according to Phelps, served to divert attention from an endemic problem with standardized tests still with us today—cheating by educators and the testing companies themselves.22

As Phelps recounts the story, suspicion of standardized tests began when John Jacob Cannell, a young medical resident working in a high-poverty region of rural West Virginia in the mid-1980s, heard local school officials claim that their children scored above the national average on standardized tests. Skeptical, he investigated and discovered that every state administering a nationally normed reading test had claimed to score above the national average, a statistical impossibility. The phenomenon was tagged the “Lake Wobegon Effect,” meaning “all children are above average.”

Findings

In his major publications Dr. Cannell cited educator dishonesty and lax security in test administrations as the primary culprits for “test score inflation” or “artificial test score gains.”23 State and local officials were responsible for much of the dishonesty, Cannell found, and most of the problems occurred at the elementary level because norm-referenced tests tended to be used there for general informational purposes, such as how much above or below the rest of the country their children were, academically. (Norm-referenced tests did not contain enough test items to provide the individual assessment that a diagnostic test would.) Even though test scores were rarely used for teacher evaluation or student promotion, prominent education researchers blamed the pressures of “high-stakes” tests for the test-score inflation Cannell found.24

Cannell had exhorted the nation to pay attention to educator dishonesty and lax test security, but education researchers tended to be dismissive of his findings. According to Robert L. Linn, a co-director of the Center for Research on Educational Standards and Student Testing (CRESST), the only federally-funded research center on educational testing for the past three decades: “There are many reasons for the Lake Wobegon Effect, most of which are less sinister than those emphasized by Cannell.”25

Yes, school administrators had contributed to inflated scores on national norm-referenced tests in order to compare a jurisdiction’s average scores to national averages as evidence of their competence in increasing student learning. These inflated scores resulted from allowing teachers to use professional development materials containing some of the actual test questions as practice with their students, one of many questionable practices in which teachers were able to engage (unlike TttT, in which teachers use “practice” items prepared by the testing company). For their part, testing companies had sometimes used the same test forms and questions year after year (and had often been allowed to do so by state officials), enabling teachers to memorize the questions over time. Lax practices in administering standardized tests were probably the major culprits accounting for educator dishonesty with these tests. Nevertheless, many education researchers declared the problem to be one of high-stakes tests, despite the lack of pressure on teachers in the 1980s to teach to tests that had no consequences for them or their supervisors.

Just as the charge of TttT diverted attention from the problem of lax security, so, too, has the charge that high-stakes tests are bad diverted attention from questions about using standardized testing as the chief mechanism for accountability in K–12. Most countries use large-scale assessments at various points in their education systems (often for high school entrance and exit) but have not imposed them heavily on the elementary level. In contrast, policies to centralize education decision-making in the USED, such as the Common Core project and the 2015 re-authorization of ESEA called ESSA, show an unhealthy reliance on computer-based standardized tests for accountability at most grade levels and at all educational levels.

The standardized tests that have emerged from the Common Core project (PARCC, SBAC, and others now aligned to Common Core’s standards, such as ACT and SAT) not only contain questionable test item types and do not provide useful feedback to teachers and parents (e.g., the kinds of basic operations in math that most students in a classroom failed to do correctly), they are also accompanied by problematic policies, such as the secrecy of the test items for each test and grade, efforts to eliminate all other “achievement” tests, and the lack of transparency in who actually determines the cut-off scores for each performance level.

Had the test-cheating Cannell uncovered in the 1980s and the more recent test cheating revealed in Atlanta, Georgia, and Washington, D.C., been analyzed for what they really reflected—the failure of policy makers to find forms of accountability less susceptible to TttT and cheating, and more clearly related to authentic learning, such as the analytical essay writing required in the original New York Regents Exams and in most European and British Commonwealth countries today—we would be on our way as a country toward an improved system of education, especially if scoring were done by high school teachers in ways that enhanced their professional status.

Instead, despite a growing concern by parents about the amount of instructional time used for test preparation, education researchers and national policy makers urged for ESSA to

• continue annual standardized testing in most grades in mathematics and reading as the chief mechanism for accountability (not testing only once at each educational level as in the 1994 re-authorization of ESEA, not testing every two years—grade span testing—as in pre-Common Core Massachusetts, to reduce the number of tests elementary and middle school teachers gave annually, nor other forms of accountability altogether), and

• rely almost totally on computer-based testing and scoring even for essay writing (in the future).

Perhaps the chief new problem in a K–12 education system using computer-based and centrally-controlled standardized testing for accountability is the incentive to manipulate the prize—the percentages of students who receive passing or higher scores on the grade 11 readiness tests that college administrators agreed to use in place of college freshman placement exams, according to Race to the Top applications in 2010.

Final Thoughts

The current emphasis on standardized testing for accountability damages both the tests and the school curriculum in several ways: test validity is jeopardized because test-taking is supervised internally, usually in students’ own schools; the dual purpose of the grade 11 tests weakens the value of the information for each purpose; TttT is incentivized at every grade level tested, making curricular uniformity across schools as well as curricular coherence through the grades impossible to achieve; and computer-based testing and scoring prevents teachers, parents, and students from gaining important information about students’ academic achievement. How so?

First, even if the tests are computer-based (including those in grade 11), the validity of their scores may be suspect because these tests are administered in the schools themselves and by local school personnel. Conditions are unlikely to be identical across all schools.

Second, because ESSA allows states to use the SAT or ACT in grade 11 for determining college readiness, and because these college admissions tests have been aligned down to Common Core’s high school standards, these tests can no longer serve their original predictive purpose well. Nor can they simultaneously serve well as measures of achievement in mathematics and English. A January 2016 piece in Education Week quoted Wayne Camara, a testing expert at ACT, as saying: “Sound assessment practice requires that a test be validated for its specific intended use. But there are no independent research studies analyzing how well the newest versions of the SAT or ACT reflect the depth and breadth of the Common Core State Standards.” According to testing experts, the article continues, “Without that kind of evidence…states are on shaky ground if they use a college-entrance exam to measure mastery of their content standards.”26

Other countries understand the need to differentiate between retrospective achievement tests—tests aligned to past curricula—and predictive tests—tests aligned to future outcomes. But the U.S. seems determined to ignore what other countries do or have learned from experience. High school and college differ substantially in the populations of students who attend them, their goals, and their teachers’ backgrounds. Moreover, U.S. post-secondary institutions vary enormously in their academic demands. Our overseas competitors do not expect the same test to serve two disparate functions, so they use different tests or criteria to separate secondary school exit from entrance to a career or a post-secondary institution.27

Third, an accountability environment dominated by test items and cut-off scores they have not developed and reviewed leads teachers to teach to the test and in whatever ways they think are effective. In essence, accountability tests will drive the school curriculum, not the other way around, until we adopt new mechanisms for accountability more closely related to authentic learning in K–12, standardized testing is reserved for appropriate uses, and a range of coherent curricula are developed for different kinds of student interests and talents at the secondary level.

Finally, we have no clue yet as to who will decide, and on what grounds, where to set the various passing or performance-level scores. In a context where all used test items may never be available for public scrutiny, and students and their parents see no tests or evaluative feedback, the incentive for manipulation by a privately-owned, government-regulated testing system allowing complete anonymity to decision-makers is too obvious.28

The Success Academy Charter Schools in New York City, a fast-growing network of schools enrolling thousands of elementary and middle school children, have attracted much attention.29 They are controversial because they clearly practice TttT and achieve extraordinarily high scores on state tests in reading and mathematics compared to results in nearby urban and suburban schools. It is too early to evaluate long-term results, because less than four hundred student have now gone on to high school and no grade 8 graduates have gotten into the city’s exam high schools.

Like the Knowledge Is Power Program, the Success schools maintain strict discipline and a dress code; they also seem to have a high teacher turnover rate.30 While they claim to address Common Core’s skill-based standards, many schools are described as following Core Knowledge’s content-specified curriculum sequence. Their founder and current executive, Eva Moskowitz, claims to stress content as well as skills. Until systematic and confirmed data are available (both observational data and long-term results), their test results suggest not that charter schools are a panacea for low-achieving children but that our public schools need to restore some of the discipline, structure, and curriculum they once offered, and that our education schools need to move beyond their indifference, if not outright hostility, to the classroom conditions that make teaching content possible.

  • Share
Most Commented

August 23, 2021

1.

Testing the Tests for Racism

What is the veracity of "audit" studies, conducted primarily by sociologists, that appear to demonstrate that people of color confront intense bias at every level of society?...

April 16, 2021

2.

Social Justice 101: Intro. to Cancel Culture

Understanding the illogical origin of cancel culture, we can more easily accept mistakes, flaws, and errors in history, and in ourselves, as part of our fallen nature....

April 19, 2021

3.

Critical Race Theory and the Will to Power

A review of "1620: A Critical Response to the 1619 Project" by NAS President Peter W. Wood....

Most Read

May 30, 2018

1.

The Case for Colonialism

From the summer issue of Academic Questions, we reprint the controversial article, "The Case for Colonialism." ...

March 20, 2019

2.

Remembering Columbus: Blinded by Politics

American colleges and universities have long dispensed with efforts to honor or commemorate Christopher Columbus. But according to Robert Carle, “most Americans know very little about......

March 29, 2019

3.

Homogenous: The Political Affiliations of Elite Liberal Arts College Faculty

A study on the partisanship of liberal arts professors at America's top universities. ...