Winner of the 2019 Warren Miller Prize

The popularity of online surveys has increased the prominence of using weights that capture units' probabilities of inclusion for claims of representativeness. Yet, much uncertainty remains regarding how these weights should be employed in the analysis of survey experiments: Should they be used or ignored? If they are used, which estimators are preferred? We offer practical advice, rooted in the Neyman-Rubin model, for researchers producing and working with survey experimental data. We examine simple, efficient estimators for analyzing these data, and give formulae for their biases and variances. We provide simulations that examine these estimators as well as real examples from experiments administered online through YouGov. We find that for examining the existence of population treatment effects using high-quality, broadly representative samples recruited by top online survey firms, sample quantities, which do not rely on weights, are often sufficient. We found that Sample Average Treatment Effect (SATE) estimates did not appear to differ substantially from their weighted counterparts, and they avoided the substantial loss of statistical power that accompanies weighting. When precise estimates of Population Average Treatment Effects (PATE) are essential, we analytically show post-stratifying on survey weights and/or covariates highly correlated with the outcome to be a conservative choice. While we show these substantial gains in simulations, we find limited evidence of them in practice.

%B Political Analysis %V 26 %P 275-291 %G eng %U https://doi.org/10.1017/pan.2018.1 %N 3 %0 Journal Article %J Journal of Research on Educational Effectiveness %D 2018 %T Bounding, an Accessible Method for Estimating Principal Causal Effects, Examined and Explained %A Miratrix, Luke %A Furey, Jane %A Avi Feller %A Todd Grindal %A Lindsay C. Page %X Estimating treatment effects for subgroups defined by posttreatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. We investigate an alternative path: using bounds to identify ranges of possible effects that are consistent with the data. This simple approach relies on fewer assumptions and yet can result in policy-relevant findings. As we show, even moderately predictive covariates can be used to substantially tighten bounds in a straightforward manner. Via simulation, we demonstrate which types of covariates are maximally beneficial. We conclude with an analysis of a multisite experimental study of Early College High Schools. When examining the program's impact on students completing the ninth grade “on-track” for college, we find little impact for ECHS students who would otherwise attend a high-quality high school, but substantial effects for those who would not. This suggests a potential benefit in expanding these programs in areas primarily served by lower quality schools. %B Journal of Research on Educational Effectiveness %V 11 %P 133-162 %G eng %N 1 %0 Journal Article %J The American Statistician %D 2017 %T Randomization Inference for Outcomes with Clumping at Zero %A Keele, Luke %A Miratrix, Luke %X In randomized experiments, randomization forms the “reasoned basis for inference.” While randomization inference is well developed for continuous and binary outcomes, there has been comparatively little work for outcomes with nonnegative support and clumping at zero. Typically outcomes of this type have been modeled using parametric models that impose strong distributional assumptions. This article proposes new randomization inference procedures for nonnegative outcomes with clumping at zero. Instead of making distributional assumptions, we propose various assumptions about the nature of response to treatment. Our methods form a set of nonparametric methods for outcomes that are often described as zero-inflated. These methods are illustrated using two randomized trials where job training interventions were designed to increase earnings of participants. %B The American Statistician %G eng %0 Journal Article %J Bayesian Analysis %D 2017 %T Posterior PredictiveTwo common concerns raised in analyses of randomized experiments are (i) appropriately handling issues of non-compliance, and (ii) appropriately adjusting for multiple tests (e.g., on multiple outcomes or subgroups). Although simple intention-to-treat (ITT) and Bonferroni methods are valid in terms of type I error, they can each lead to a substantial loss of power; when employing both simultaneously, the total loss may be severe. Alternatives exist to address each concern. Here we propose an analysis method for experiments involving both features that merges posterior predictive p-values for complier causal effects with randomizationbased multiple comparisons adjustments; the results are valid familywise tests that are doubly advantageous: more powerful than both those based on standard ITT statistics and those using traditional multiple comparison adjustments. The operating characteristics and advantages of our method are demonstrated through a series of simulated experiments and an analysis of the United States Job Training Partnership Act (JTPA) Study, where our methods lead to different conclusions regarding the significance of estimated JTPA effects.

%B Statistica Sinica %V 27 %P 1319-1345 %G eng %N 3 %0 Journal Article %J Journal of Causal Inference %D 2017 %T Bridging Finite and Super Population Causal Inference. %A Ding, P. %A Li, X. %A Miratrix, L. %XThere are two general views in causal analysis of experimental data: the super population view that the units are an independent sample from some hypothetical infinite populations, and the finite population view that the potential outcomes of the experimental units are fixed and the randomness comes solely from the physical randomization of the treatment assignment. These two views differs conceptually and mathematically, resulting in different sampling variances of the usual difference-in-means estimator of the average causal effect. Practically, however, these two views result in identical variance estimators. By recalling a variance decomposition and exploiting a completeness-type argument, we establish a connection between these two views in completely randomized experiments. This alternative formulation could serve as a template for bridging finite and super population causal inference in other scenarios.

%B Journal of Causal Inference %V 5 %G eng %N 2 %0 Journal Article %J Journal of Causal Inference %D 2016 %T A Conditional Randomization Test to Account for Covariate Imbalance in Randomized Experiments %A Hennessy, J. %A Dasgupta, T. %A Miratrix, L. %A Pattanayak. C. %A Sarkar, P. %X

We consider the conditional randomization t

est as a way to account for covariate imbalance

in randomized experiments. The test accounts for co

variate imbalance by comparing the observed test

statistic to the null distribution of the test statistic conditional on the observed covariate imbalance.

We prove that the conditional randomization test

has the correct significance level and introduce

original notation to describe covariate balance more formally. Through simulation, we verify that

conditional randomization tests behave like more t

raditional forms of covariate adjustment but have

the added benefit of having the correct conditional s

ignificancelevel.Finally,weapplytheapproach

to a randomized product marketing experiment where covariate information was collected after

We consider the conditional randomization test as a way to account for covariate imbalance in randomized experiments. The test accounts for covariate imbalance by comparing the observed test statistic to the null distribution of the test statistic conditional on the observed covariate imbalance. We prove that the conditional randomization test has the correct significance level and introduce original notation to describe covariate balance more formally. Through simulation, we verify that conditional randomization tests behave like more traditional forms of covariate adjustment but have the added benefit of having the correct conditional significance level. Finally, we apply the approach to a randomized product marketing experiment where covariate information was collected after randomization.

%B Journal of Causal Inference %V 4 %P 61-80 %G eng %N 1 %0 Journal Article %J Annals of Applied Statistics %D 2016 %T Compared to What? Variations in the Impacts of Early Childhood Education by Alternative Care-Type Settings %A Feller, A. %A Grindal, T. %A Miratrix, L. %A Page, L. %XEarly childhood education research often compares a group of children who receive the intervention of interest to a group of children who receive care in a range of different care settings. In this paper, we estimate differential impacts of an early childhood intervention by alternative care setting, using data from the Head Start Impact Study, a large-scale randomized evaluation. To do so, we utilize a Bayesian principal stratification framework to estimate separate impacts for two types of Compliers: those children who would otherwise be in other center-based care when assigned to control and those who would otherwise be in home-based care. We find strong, positive short-term effects of Head Start on receptive vocabulary for those Compliers who would otherwise be in home-based care. By contrast, we find no meaningful impact of Head Start on vocabulary for those Compliers who would otherwise be in other center-based care. Our findings suggest that alternative care type is a potentially important source of variation in early childhood education interventions.

%B Annals of Applied Statistics %V 10 %P 1245-1285 %G eng %N 3 %0 Journal Article %J Statistical Analysis and Data Mining %D 2016 %T Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability %A Miratrix, L. %A Ackerman, R. %X

We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an Occupational Safety and Health Administration (OSHA) database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death), and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency-based methods currently in wide use, and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). For a particular topic of interest (e.g., mental health disability, or carbon monoxide exposure), we regress a labeling of documents onto the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch-and-bound approach, this method can incorporate phrases of arbitrary length, which allows for potentially rich summarization. We discuss how focus on the purpose of the summaries can inform choices of tuning parameters and model constraints. We evaluate this tool by comparing the computational time and summary statistics of the resulting word lists to three other methods in the literature. We also present a new R package, **textreg**. Overall, we argue that sparse methods have much to offer in text analysis and is a branch of research that should be considered further in this context. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

%B Statistical Analysis and Data Mining %V 9 %P 435-460 %G eng %U http://www.statisticsviews.com/details/journalArticle/10120771/Conducting-sparse-feature-selection-on-arbitrarily-long-phrases-in-text-corpora-.html %N 6 %0 Journal Article %J Journal of Causal Inference %D 2015 %T To Adjust or Not to Adjust? Sensitivity Analysis of M-Bias and Butterﬂy-Bias. %B Journal of Causal Inference %G eng %0 Journal Article %J Journal of the Royal Statistical Society: Series B (Statistical Methodology) %D 2015 %T Randomization Inference for Treatment Effect Variation %A Ding, P. %A Feller, A. %A Miratrix, L. %X

pplied researchers are increasingly interested in whether and how treatment effects

vary in randomized evaluations, especially variation not explained by observed covariates. We pro-

Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation that is not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, wemust address the fact that the average treatment effect, which is generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting.We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance.We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory.We finally apply our method to the National Head Start impact study, which is a large-scale randomized evaluation of a Federal preschool programme, finding that there is indeed significant unexplained treatment effect variation.

%B Journal of the Royal Statistical Society: Series B (Statistical Methodology) %V 78 %P 655-671 %G eng %N 3 %0 Journal Article %J American Journal of Evaluation %D 2015 %T Principal stratification: A tool for understanding variation in program effects across endogenous subgroups %A Page, L. %A Feller, A. %A Grindal, T. %A Miratrix, L. %A Somers, M. A. %X

Increasingly, researchers are interested in questions regarding treatment-effect variation across partially or fully latent subgroups defined not by pretreatment characteristics but by postrandomization actions. One promising approach to address such questions is principal stratification. Under this framework, a researcher defines endogenous subgroups, or principal strata, based on post-randomization behaviors under both the observed and the counterfactual experimental conditions. These principal strata give structure to such research questions and provide a framework for determining estimation strategies to obtain desired effect estimates. This article provides a nontechnical primer to principal stratification. We review selected applications to highlight the breadth of substantive questions and methodological issues that this method can inform. We then discuss its relationship to instrumental variables analysis to address binary noncompliance in an experimental context and highlight how the framework can be generalized to handle more complex posttreatment patterns. We emphasize the counterfactual logic fundamental to principal stratification and the key assumptions that render analytic challenges more tractable. We briefly discuss technical aspects of estimation procedures, providing a short guide for interested readers.

%B American Journal of Evaluation %V 36 %P 1-18 %G eng %N 4 %0 Journal Article %J Annals of Applied Statistics %D 2014 %T Concise Comparative Summaries (CCS) of Large Text Corpora with a Human Experiment %A Jia J. %A Miratrix, L. %A Yu, B. %A Gawalt, B. %A El Ghaoui, L. %A Barnesmoore, L. %A Clavier, S. %X

In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field.

For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. To validate our tool, we, using news articles from the New York Times international section, designed and conducted a human survey to compare the different summarizers with human understanding. We demonstrate our approach with two case studies, a media analysis of the framing of “Egypt” in the New York Times throughout the Arab Spring and an informal comparison of the New York Times’ and Wall Street Journal’s coverage of “energy.” Overall, we find that the Lasso with ${L}^{2}$ normalization can be effectively and usefully used to summarize large corpora, regardless of document size.

%B Annals of Applied Statistics %V 8 %P 499-529 %G eng %N 1 %0 Journal Article %J Annals of Applied Statistics %D 2014 %T

Concise Comparative Summaries (CCS) of Large Text Corpora with a Human Experiment

%A Jia, Jinzhu %A Miratrix, Luke %A Yu, Bin %A Gawalt, Brian %A El Ghaoui, Laurent %A Barnesmoore, Luke %A Clavier, Sophia %B Annals of Applied Statistics %V 8 %P 499-529 %G eng %N 1 %0 Journal Article %J Journal of the Royal Statistical Society Series B %D 2013 %T {Adjusting treatment effect estimates by post-stratification in randomized experiments} %A Miratrix, Luke W %A Sekhon, Jasjeet S %A Yu, Bin %B Journal of the Royal Statistical Society Series B %V 75 %P 369–396 %G eng %0 Conference Paper %B Proceedings of the SIGCHI Conference on Human Factors in Computing Systems %D 2013 %T {Predicting users' first impressions of website aesthetics with a quantification of perceived visual complexity and colorfulness} %A Reinecke, Katharina %A Yeh, Tom %A Miratrix, Luke %A Mardiko, Rahmatri %A Zhao, Yuechen %A Liu, Jenny %A Gajos, Krzysztof Z %B Proceedings of the SIGCHI Conference on Human Factors in Computing Systems %I ACM %P 2049–2058 %G eng %0 Journal Article %J Journal of Research in Science Teaching %D 2012 %T Diﬀerential eﬀects of three professional development models on teacher knowledge and student achievement in elementary science %A Heller, J.I. %A Daehler, K.R. %A Wong, N. %A Shinohara, M. %A Miratrix, L.W. %XTo identify links among professional development, teacher knowledge, practice, and student achievement, researchers have called for study designs that allow causal inferences and that examine relationships among features of interventions and multiple outcomes. In a randomized experiment implemented in six states with over 270 elementary teachers and 7,000 students, this project compared three related but systematically varied teacher interventions—*Teaching Cases*, *Looking at Student Work*, and *Metacognitive Analysis*—along with no-treatment controls. The three courses contained identical science content components, but differed in the ways they incorporated analysis of learner thinking and of teaching, making it possible to measure effects of these features on teacher and student outcomes. Interventions were delivered by staff developers trained to lead the teacher courses in their regions. Each course improved teachers' and students' scores on selected-response science tests well beyond those of controls, and effects were maintained a year later. Student achievement also improved significantly for English language learners in both the study year and follow-up, and treatment effects did not differ based on sex or race/ethnicity. However, only Teaching Cases and Looking at Student Work courses improved the accuracy and completeness of students' written justifications of test answers in the follow-up, and only Teaching Cases had sustained effects on teachers' written justifications. Thus, the content component in common across the three courses had powerful effects on teachers' and students' ability to choose correct test answers, but their ability to explain why answers were correct only improved when the professional development incorporated analysis of student conceptual understandings and implications for instruction; metacognitive analysis of teachers' own learning did not improve student justifications either year. Findings suggest investing in professional development that integrates content learning with analysis of student learning and teaching rather than advanced content or teacher metacognition alone.

Experimenters often use post-stratification to adjust estimates. Post-stratification is akin to blocking, except that the number of treated units in each stratum is a random variable because stratification occurs after treatment assignment. We analyse both post-stratification and blocking under the Neyman–Rubin model and compare the efficiency of these designs. We derive the variances for a post-stratified estimator and a simple difference-in-means estimator under different randomization schemes. Post-stratification is nearly as efficient as blocking: the difference in their variances is of the order of 1/*n*^{2}, with a constant depending on treatment proportion. Post-stratification is therefore a reasonable alternative to blocking when blocking is not feasible. However, in finite samples, post-stratification can increase variance if the number of strata is large and the strata are poorly chosen. To examine why the estimators’ variances are different, we extend our results by conditioning on the observed number of treated units in each stratum. Conditioning also provides more accurate variance estimates because it takes into account how close (or far) a realized random sample is from a comparable blocked experiment. We then show that the practical substance of our results remains under an infinite population sampling model. Finally, we provide an analysis of an actual experiment to illustrate our analytical results.

In November 2008, we audited contests in Santa Cruz and Marin counties, California. The audits were risk-limiting: they had a prespecified minimum chance of requiring a full hand count if the outcomes were wrong. We developed a new technique for these audits, the trinomial bound. Batches of ballots are selected for audit using probabilities proportional to the amount of error each batch can conceal. Votes in the sample batches are counted by hand. Totals for each batch are compared to the semiofficial results. The ldquotaintrdquo in each sample batch is computed by dividing the largest relative overstatement of any margin by the largest possible relative overstatement of any margin. The observed taints are binned into three groups: less than or equal to zero, between zero and a threshold *d* , and larger than *d* . The number of batches in the three bins have a joint trinomial distribution. An upper confidence bound for the overstatement of the margin in the election as a whole is constructed by inverting tests for trinomial category probabilities and projecting the resulting set. If that confidence bound is sufficiently small, the hypothesis that the outcome is wrong is rejected, and the audit stops. If not, there is a full hand count. We conducted the audits with a risk limit of 25%, ensuring at least a 75% chance of a full manual count if the outcomes were wrong. The trinomial confidence bound confirmed the results without a full count, even though the Santa Cruz audit found some errors. The trinomial bound gave better results than the Stringer bound, which is commonly used to analyze financial audit samples drawn with probability proportional to error bounds.