contact us: gse-csail@gse.upenn.edu
Developing New Measures of Teachers' Instruction: Part 2
One of the guiding questions for C-SAIL’s Measurement Study is, “How reliably can raters code the content of teachers’ assignments and assessments?”
We find that raters can code mathematics assignments quite reliably, but that they struggle to code English language arts (ELA) assignments. In this post, we discuss why we think this finding is important and what the implications are for our and others’ work.
Teacher surveys are the backbone of our FAST Program Study and reporting plans. In addition to teacher surveys, we planned to collect assignments and assessments in order to check the extent to which the survey reports match the actual materials on which students are evaluated. This portion of the Measurement Study is necessary for us to understand the extent to which we can consistently analyze these materials in order to judge their alignment to standards.
How do you think we could improve the reliability of ELA content analysis?
Our analysis follows three previous studies of the reliability of content analysis procedures using the Surveys of Enacted Curriculum. Two of the studies (first, second) examined how reliably raters could code the content of state standards and assessments (in essence asking the same question as is discussed here, only with different documents). That work found that these analyses were fairly reliable (about .75 on a 0 to 1 scale, with 1 being perfect reliability) if four trained raters were used. The results looked better in mathematics than in English language arts. A third study examined the reliability of content analyses of entire mathematics textbooks, finding that they were incredibly reliable—often .99 or higher on the 0 to 1 scale, even for as few as two content analysts (all things equal, more raters = higher reliability).
This study hypothesized that the reasons math textbook analyses were so much more reliable than those of tests and standards were:
- The length—all things equal, longer documents can be analyzed more reliably just like longer tests are more reliable than shorter ones.
- The fact that the tasks in mathematics textbooks often measure quite discrete skills that are easier to code.
While the results of previous studies suggested raters could code both math and ELA documents reliably, we needed to update previous work for C-SAIL, both because we had modified the SEC tools (see previous post for more on this), and because teachers’ assignments and assessments are not as long as whole textbooks.
The procedures for this study were straightforward. We collected two weeks’ worth of assignments and assessments from 47 teachers—24 in English language arts (ELA) and 23 in mathematics. We had four trained content analysts analyze the set of materials for each teacher independently. Then we calculated the reliability using the same “generalizability theory” techniques we had used in the previous studies.
The results of our analyses were illuminating. In mathematics, just two weeks’ worth of assignments or assessments could be content analyzed quite reliably. The average reliability for two content analysts across the 23 teachers was .73, and that increased to .79 if three content analysts were used. Only 4 of the 23 math teachers had reliabilities below .70 when three analysts were used. In short, the results in mathematics were strong.
In ELA, the results were much weaker. The average reliability for two content analysts was .49, and it rose to only .57 with three content analysts. Of the 24 teachers, just 7 had reliabilities above .7 with three content analysts. In short, our raters struggled to achieve reliable content analyses in ELA on two weeks of assignments.
What do these results mean? It appears that it is straightforward to analyze mathematics materials—we now have evidence from tests, standards, textbooks, and teacher-created assignments/assessments that we can do this quite well. This means we can give good feedback to these teachers about their instruction based on relatively few raters.
In contrast, we were surprised at how weak the results were in ELA. Clearly, more work needs to be done in ELA to achieve reliability. Four strategies we could use to improve the reliability are:
- Collecting assignments over a longer period (such as a full month).
- Increasing the training we provide to content analysts.
- Increasing the number of content analysts we use.
- Simplifying the ELA content languages to make analysis easier.
In future work, we plan to explore why the reliability of some teachers’ coded assignments/assessments was higher than other teachers. Was it something about the content of these documents that made reliable coding easier? Or was it merely that they were longer?
Finally, it is important to note that when we began planning the Measurement Study, we were expecting to include content analysis as part of the FAST Program Study. In particular, we were planning to collect some assignments and assessments from participating teachers every few weeks and to content-analyze them to gauge their alignment to standards. As we further developed the FAST study, however, the study took a different direction. Thus, the work presented here is not directly connected to our ongoing intervention study, but it can inform other research on teachers’ instruction.