To identify links among professional development, teacher knowledge, practice, and student achievement, researchers have called for study designs that allow causal inferences and that examine relationships among features of interventions and multiple outcomes. In a randomized experiment implemented in six states with over 270 elementary teachers and 7,000 students, this project compared three related but systematically varied teacher interventions—Teaching Cases, Looking at Student Work, and Metacognitive Analysis—along with no-treatment controls. The three courses contained identical science content components, but differed in the ways they incorporated analysis of learner thinking and of teaching, making it possible to measure effects of these features on teacher and student outcomes. Interventions were delivered by staff developers trained to lead the teacher courses in their regions. Each course improved teachers' and students' scores on selected-response science tests well beyond those of controls, and effects were maintained a year later. Student achievement also improved significantly for English language learners in both the study year and follow-up, and treatment effects did not differ based on sex or race/ethnicity. However, only Teaching Cases and Looking at Student Work courses improved the accuracy and completeness of students' written justifications of test answers in the follow-up, and only Teaching Cases had sustained effects on teachers' written justifications. Thus, the content component in common across the three courses had powerful effects on teachers' and students' ability to choose correct test answers, but their ability to explain why answers were correct only improved when the professional development incorporated analysis of student conceptual understandings and implications for instruction; metacognitive analysis of teachers' own learning did not improve student justifications either year. Findings suggest investing in professional development that integrates content learning with analysis of student learning and teaching rather than advanced content or teacher metacognition alone.
Experimenters often use post-stratification to adjust estimates. Post-stratification is akin to blocking, except that the number of treated units in each stratum is a random variable because stratification occurs after treatment assignment. We analyse both post-stratification and blocking under the Neyman–Rubin model and compare the efficiency of these designs. We derive the variances for a post-stratified estimator and a simple difference-in-means estimator under different randomization schemes. Post-stratification is nearly as efficient as blocking: the difference in their variances is of the order of 1/n2, with a constant depending on treatment proportion. Post-stratification is therefore a reasonable alternative to blocking when blocking is not feasible. However, in finite samples, post-stratification can increase variance if the number of strata is large and the strata are poorly chosen. To examine why the estimators’ variances are different, we extend our results by conditioning on the observed number of treated units in each stratum. Conditioning also provides more accurate variance estimates because it takes into account how close (or far) a realized random sample is from a comparable blocked experiment. We then show that the practical substance of our results remains under an infinite population sampling model. Finally, we provide an analysis of an actual experiment to illustrate our analytical results.
In November 2008, we audited contests in Santa Cruz and Marin counties, California. The audits were risk-limiting: they had a prespecified minimum chance of requiring a full hand count if the outcomes were wrong. We developed a new technique for these audits, the trinomial bound. Batches of ballots are selected for audit using probabilities proportional to the amount of error each batch can conceal. Votes in the sample batches are counted by hand. Totals for each batch are compared to the semiofficial results. The ldquotaintrdquo in each sample batch is computed by dividing the largest relative overstatement of any margin by the largest possible relative overstatement of any margin. The observed taints are binned into three groups: less than or equal to zero, between zero and a threshold d , and larger than d . The number of batches in the three bins have a joint trinomial distribution. An upper confidence bound for the overstatement of the margin in the election as a whole is constructed by inverting tests for trinomial category probabilities and projecting the resulting set. If that confidence bound is sufficiently small, the hypothesis that the outcome is wrong is rejected, and the audit stops. If not, there is a full hand count. We conducted the audits with a risk limit of 25%, ensuring at least a 75% chance of a full manual count if the outcomes were wrong. The trinomial confidence bound confirmed the results without a full count, even though the Santa Cruz audit found some errors. The trinomial bound gave better results than the Stringer bound, which is commonly used to analyze financial audit samples drawn with probability proportional to error bounds.