In March and April 2020, most American schools closed their buildings to in-person learning and shifted to remote instruction for the remainder of the school year in response to the COVID-19 pandemic. With indications that the coronavirus is not yet adequately under control, an increasingly long list of both large and small public school districts, including Chicago, Philadelphia, Los Angeles, San Diego, and Nashville, have started the school year this fall with fully remote instruction while numerous other districts are using a mixture of in-person and virtual instruction.
Our previous blog on “quarantine slide” highlighted that having students out of school for long periods is likely to lead to learning loss. At-risk students, including low-income students, students of color, older students, English language learners, and special education students, are likely to see the largest losses. Compensating for this loss will require a substantial evidence-based investment, likely including programs of additional instruction and an extended school year.
A less discussed issue is how best to assess the extent of learning loss in students as a result of the school shutdowns, and how to properly identify the most affected students in need of targeted intervention.
Yearly federally mandated accountability testing is the most common way to assess student learning in the US. These tests are used to directly compare states, schools, and sub-groups of students to understand where students are academically and where they are falling behind their peers. These tests, in combination with other school accountability measures such as average yearly attendance and chronic absenteeism, allow policymakers and researchers insight into the US education landscape. Due to the COVID-19 school closures, however, these metrics have become substantially less useful as states canceled the spring 2020 accountability tests, measures of attendance were muddied by remote learning, and other related complications. Michigan has already called for federal waivers from 2021 testing requirements as well.
Taking high-stakes accountability testing off the table for states may seem like a harmless way to lower stress for teachers and students alike amid the ongoing COVID-19 pandemic; in fact, there have been calls to reduce testing for many years. However, accountability testing has provided valuable insight into educational disparities between schools and demographic subgroups of students, and annual testing has allowed researchers to better understand long-term dynamics of how students learn and how teachers develop their instructional skills.
This piece discusses the history of federal accountability testing and other accountability measures, how those measures are used to inform policy and research, and how the reliance on testing may change in the wake of the COVID-19 outbreak.
History of Accountability Testing and Measures
Students have always taken many different kinds of tests, from content tests designed by their teachers at the end of a classroom lesson or weekly unit to the ACT and SAT college entrance exams. Until the No Child Left Behind (NCLB) act was signed into law in 2001, accountability or standardized testing was fairly decentralized and states made the decision of who, what, and when to test. NCLB, which passed with broad bipartisan support, made the historic change of holding schools responsible for the academic progress of their students. NCLB was a re-authorization of the Elementary and Secondary Education Act (ESEA) originally enacted in 1965.
The most notable aspect of NCLB to parents, teachers, and students was an increased testing requirement. States were now required to test students in grades three through eight yearly in reading and math, and once in high school. Prior to that in New York, similar to many other states, state assessments were administered only once in fourth grade and once in eighth grade. NCLB also set high standards, requiring that all students be “proficient” at grade level by spring 2014. Schools that did not meet adequate yearly progress (AYP) toward this goal faced federal sanctions. This meant that lower performing schools had a much steeper climb than higher performing schools.
Under the Obama administration, relief from potential sanctions under NCLB was given to states through accountability waivers, and a new reauthorization of ESEA replaced NCLB with the Every Student Succeeds Act (ESSA). ESSA focused on curriculum and college and career readiness rather than test-based accountability alone, but it did keep the yearly testing requirement. ESSA required that all students be taught at a high-quality curriculum standard, such as the Common Core, and called for tests to be linked directly to that high-quality curriculum. ESSA also expanded accountability indicators to include a non-test-based measure of school quality or student success. For this indicator, most states chose chronic absenteeism, generally defined as the percentage of students who miss 10 percent of the school year or more.
Yearly accountability tests required through NCLB/ESSA are not the only “standardized” tests that students take. Every year a representative sample of students take the National Assessment of Educational Progress (NAEP) test in fourth, eighth, and twelfth grade. This test, called the “nation’s report card,” allows for direct learning comparisons between states. A smaller representative sample take the Program for International Student Assessment (PISA) test that directly compares student learning across nations.
Most of the focus on accountability testing is in grades three through eight; however, high school students also often face significant testing. Nearly a dozen states require high school exit examinations, such as New York Regents Examinations in New York, and others require students to take end of course (EOC) exams. Many states have students take the SAT or ACT college placement exams. High school students enrolled in Advanced Placement (AP) and International Baccalaureate (IB) courses also can take “college credit” exams.
Uses of Accountability Testing and Metrics in Research
NCLB received significant criticism from parents, teachers, school administrators, and policymakers because of its punitive component (both financial and reputational) for schools that could not meet prescribed levels of improvement. Yet, NCLB’s focus on student equity and its expansion of annual testing revealed substantial inequity in the American education system. NCLB required that for a school to meet adequate yearly progress, all subgroups within that school needed to meet AYP. Subgroups included racial and ethnic groups, gender, low-income status, English language learner (ELL) status, and special education status. Practically, a school with a mix of high- and low-income students, for example, could not rely only on its high-income students to bring up school-wide average test scores; rather, the school had to make sure that low-income students also were reaching proficiency.
The goal of accountability testing is not so much to measure the learning of an individual student, but rather to assess the state of an education system and to track learning differences across schools, states, and groups within schools.
After the passage of NCLB in 2001 and ESSA in 2015, many states created longitudinal data systems using yearly accountability test scores to examine the long-term dynamics of student learning and understand the role that teachers and schools play in that process. Annual individual assessment results created the ability to calculate yearly “value added” and “growth” to help measure how much a student has improved from one year to the next. Value added measures (VAM) are most commonly calculated for individual teachers, but can also be calculated for students and schools, and are measures of student performance controlling for that student’s performance in the past and other observable student characteristics. Some states, most notably New York, prohibit student test scores from being used to evaluate teachers. When VAM formulas meet certain requirements, they offer unbiased estimates of teacher effectiveness. Using longitudinal testing data, researchers have found that students assigned to teachers that have high VAM are more likely to go to college, earn more, and are less likely to become pregnant as teenagers. Tests may be an imperfect way to measure student academic success, but research has shown them to be correlated with important lifetime outcomes.
VAM measures also have been used to estimate how teachers develop their skills over the course of their careers. These studies include examining how school environments affect effectiveness growth as well as how effective teachers are distributed across schools and districts. Such tests have even allowed researchers to identify which types of teacher experience contribute most to student learning. Data from annual tests also have allowed researchers to identify the effects of specific school and district-level policies beyond just instruction, including such things as school day start times, repeat teachers or “looping,” class size, providing free summer reading books, one-on-one tutoring, and math intervention.
Yearly testing certainly has provided additional and valuable information to policymakers and researchers to inform educational practice and policy. Pairing these tests with significant consequences, however, likely served as motivation for some to engage in systematic cheating: notably the Atlanta cheating scandal, individual teacher cheating in Chicago, and suspiciously high test scores and erasure rates in DC.
Even with the relaxation of federal accountability under ESSA, many states use testing and VAM measures to retain, dismiss, and compensate teachers. The National Association of Secondary School Principals (NASSP) is against the use of test scores and VAMs for those decisions. The American Statistical Association takes a similar position, stating that most VAM studies find that teachers account for 1 percent to 14 percent of variability in test scores and stressing that VAMs and test scores are invaluable tools in understanding education systems as a whole, but need to be accompanied by discussion of precision, model limitations, and confidence intervals.
The inclusion of new accountability indicators and curriculum-based standards under ESSA, including chronic absenteeism, has somewhat lowered the stakes of testing for both schools and teachers. Chronic absenteeism is associated with both lower test scores and a decreased probability of high school graduation, and these attendance patterns tend to develop early. A student with a single truancy in September of ninth grade is nearly twice as likely to dropout and has a GPA a full grade point lower than students with no truancies that month. Tracking and targeting chronic absenteeism has allowed schools to target at-risk students before their grades dramatically slip, and before more serious disciplinary problems develop.
Long-term data on attendance and absenteeism also provides additional insight into how students learn and develop the socioemotional (sometimes called non-cognitive) skills that help them to be successful in both school and future careers. Teachers not only can affect test scores, but they can improve attendance and suspension patterns. These measures give a more holistic view of the skills teachers impart to their students beyond performance on academic exams.
Accountability Measures post-COVID
The complete cancelation of spring 2020 testing means that researchers and policymakers will have at least one missing year of data in any new longitudinal analyses from this point forward. While a single missing year of data is far from the end of the world, and has happened before, it will have ripple effects for years to come. Both state-level accountability measures and VAM calculations rely on student growth from one year to the next; for analysis that usually would rely on spring 2020 data, models will have to be adjusted to account for both two-year growth and upset in students’ performance caused by the school shutdowns resulting from the COVID-19 pandemic. This is less of a lift for researchers who can adjust their statistical methodology and error assessments, but it could have negative effects on teachers and schools in states that are compensated in part based on VAM estimates if their formulas are not properly adjusted.
A larger issue with the cancelation of 2020 testing is that it will make it harder to assess how far students have fallen. It seems unlikely that schools will complete the canceled testing this fall, so it is likely that the next full snapshot of student performance will not be until spring 2021, and some states are already beginning to think of asking for testing waivers for 2021 as well. If a second year of annual tests are canceled, the impact on education policy and analysis will be significant.
Tests may be an imperfect way to measure student academic success, but research has shown them to be correlated with important lifetime outcomes.
Still, there are many alternatives to accountability testing, depending on a state’s goals for testing. Among these alternatives are the assessment options discussed below.
Assessing “quarantine slide” to improve instruction
Large-scale accountability testing requires a long calendar. Tests need to be distributed to schools, students need appropriate testing supplies and space, the tests have to be administered, collected, entered, and analyzed, and then the results must be collected, analyzed, and reported. The wait time between test and usable results makes moving spring accountability tests to fall an inefficient way to assess student learning loss, as teachers would waste valuable class time waiting for results. Smaller teacher-driven formative assessments are likely to be more effective for identifying students that have fallen behind and planning effective remediation programs. Formative assessments allow teachers to quickly assess where students are in terms of specific skills, thus helping their instruction match students’ specific learning needs.
Many well-regarded formative assessment tools are computer–based, such as Khan Academy or Carnegie Learning. As schools rely upon more virtual instruction, including in the beginning of the 2020-21 school year, incorporating and administering online formative assessments as a purposeful part of the instructional program is an attractive option for helping to measure students’ academic standing.
Assessing “quarantine slide” to measure state-level and national learning loss
Measuring learning loss on a state or national level is much more difficult than for a single school, or even an entire school district. Formative assessments are generally brief, and students only take assessments when necessary to assess course-relevant skills. This means there is not a standard set of tests every student in the same school, or even in the same class, would take. As such, these assessments are relatively difficult to aggregate up to even a school level. Having all students take the same standard set of formative assessments would essentially defeat their purpose as a brief, low-impact, rapid measure of specific skills.
In light of the limitations of formative assessments, traditional accountability testing may be the best option for assessing total learning loss of a student population. If schools are fully safe for in-person learning in spring 2021, the easiest way to measure the point-in-time effect of learning loss would be to conduct testing as normal in spring 2021 and compare the results to projections of what the results “would have been” without the pandemic. This would mean comparing spring 2021 test scores to observationally similar students in the same schools and same grades back in 2019 and making adjustments based on individual students’ pre-pandemic test score histories. This would not provide useful information on any one student’s learning loss but it would provide a reasonable statistical estimate for average learning loss across schools, states, and student subgroups.
In the increasingly likely case that districts are unable to do a full round of testing in spring 2021, sample testing can be conducted. Instead of testing all students, a representative sample could be identified and designated for testing in each school, district, or state. This would ease some of the space concerns if schools are back to in-person instruction. This approach has precedence: sample testing is how both the NAEP and PISA exams are conducted.
School-level accountability
Under NCLB and ESSA, states set accountability standards for their schools, generally based on test score growth, absenteeism, and other measures of quality, and schools that fail to meet those state-specific standards receive sanctions if they do not make improvements. New York schools where students and student subgroups fail to meet performance targets are classified as “struggling” or “persistently struggling” and must make “demonstrable improvement” to avoid receivership. Without both new test scores and meaningful chronic absenteeism data in the world of remote learning, it is difficult to determine whether any schools should be added to those lists and whether any schools on those lists have made “demonstrable improvement.” The easy answer is to simply pause and leave all schools at their current status, which New York has decided to do.
An alternative to focusing on traditional accountability test results, which will be out of date if testing is suspended for the next two years, is creating a short-form rapid assessment, much like a “standardized” formative assessment, at schools that have failed to meet accountability standards or were on track to do so before the pandemic. These assessments would not necessarily need to be used for accountability, but could be a way to assess the learning status of the most at-risk students.
Teacher Effectiveness
An alternative student assessment (and by consequence teacher assessment) that may be especially valuable in distance learning is embedded rapid assessments. These curriculum-based assessments are like a standard classroom unit test, but administering them online would allow student progress to be measured and immediately known to teachers and administrators. Teachers would know if students had mastered specific skills and administrators would be able to intervene and assist teachers who are struggling with remote instruction or expand successful teaching strategies to more classrooms. There are already options for this type of computer-based assessments, including such models as Khan Academy, Read 180, and Carnegie Learning.
Missing one or two years of student test data will have the ripple effect of hindering the calculation of teacher VAM because these metrics require students’ past test scores to measure student learning growth that generally are averaged over multiple years so that teachers are not rewarded or punished for a single year where performance data spikes unusually. Three to five years of data can be needed to calculate an accurate measure of teacher effectiveness.
Another alternative to using student test data for teacher evaluation is teacher observation in classrooms. Principals’ subjective measures of teacher effectiveness are correlated with VAM and pick up on different aspects of teaching skills. Just as teachers have little experience with remote learning, however, principals have little experience observing and judging remote learning. Such lack of experience does not make it less important. Principals dropping into remote classes to observe instruction and review portfolios of student work help identify effective teachers and effective teaching strategies.
Remote learning environments also create new challenges for assessing teacher impact on students’ non-test score skills, such as attendance and behavior. Traditional measures of absenteeism as a proxy for engagement make less sense in remote learning. Measures of direct engagement in one-on-one or small group remote meetings between teachers and students are needed alternatives. Things such as the number of posts a student makes to a class message board, time active on a class website, or views of class videos and resources can become new performance indicators in a virtual schooling environment.
Assessing and fostering individual student learning and engagement
All the assessment and accountability strategies discussed above have the end goal of improving and fostering student learning. Policymakers care about measuring learning loss, school performance, and effective teaching to make sure that students get the best education possible and the skills they need to succeed in life. The goal of accountability testing is not so much to measure the learning of an individual student, but rather to assess the state of an education system and to track learning differences across schools, states, and groups within schools.
For the next few years, schools will be focused on mitigating the effects of learning loss and returning students academically to where they would have been in the absence of the pandemic. Students are likely to be profoundly differently affected by the closure of schools and the introduction of virtual instruction, and are likely to return to school with an even higher variance in preparation for the coming year than in normal years. Teachers, however, already have many of the tools that they need to assess students. The formative assessments discussed above can be an effective tool in helping teachers match their instruction to where students are, and mini-assessments administered online can help teachers work with students at different levels of preparedness in a way that annual accountability tests cannot.
Expanding the types of student skills that are assessed also has taken on new importance. Attendance measures have been used as proxies of engagement generally, but in the wake of COVID-19 and the emotional impact it has had on so many students, it will be valuable to directly assess student socioemotional and non-academic skills. The Gallup Student Poll, generally completed in the fall, has been used by many school districts to do just that. It is available at a low cost, and takes only 10 to 15 minutes to complete. While this poll does not have the same insight as a full socioemotional skills battery, it can provide districts and schools insight into their students’ well-being beyond test scores.
A Different Kind of Back-to-School Season
Students, parents, caregivers, teachers, administrators, and policymakers all have had to make substantial adjustments to the education system and their expectations of how school works and there will continue to be more adjustments as the fall semester progresses. Many large school districts began the semester online, and even based plans for the phased opening of physical schools on virus infection levels, student needs, and parent preferences.
Traditional standardized tests may not be the primary concern this fall of those involved in the education system. Still, a functioning testing and monitoring system that assesses student learning and engagement is vital to delivering instruction effectively and targeting resources where they are needed the most. In the short term, formative skills-based assessments will help teachers identify student learning levels and meet students where they are as they progress through remote learning. In the long term, traditional accountability tests, either administered to all students or a representative sample of students, can help states and researchers measure the large scale average effect of learning loss due to COVID-19 and direct resources to the groups who have been hit hardest.
ABOUT THE AUTHOR
Leigh Wedenoja is senior policy analyst at the Rockefeller Institute of Government.