Reducing Selection Bias in Quasi-Experimental Educational Studies
Transcription
Reducing Selection Bias in Quasi-Experimental Educational Studies
Reducing Selection Bias in Quasi-Experimental Educational Studies Christopher Brooks Omar Chavez School of Information University of Michigan Department of Statistics and Data Sciences University of Texas at Austin brooksch@umich.edu Jared Tritz ochavez@utexas.edu Stephanie Teasley School of Information University of Michigan School of Information University of Michigan jtritz@umich.edu steasley@umich.edu ABSTRACT In this paper we examine the issue of selection bias in quasiexperimental (non-randomly controlled) educational studies. We provide background about common sources of selection bias and the issues involved in evaluating the outcomes of quasi-experimental studies. We describe two methods, matched sampling and propensity score matching, that can be used to overcome this bias. Using these methods, we describe their application through one case study that leverages large educational datasets drawn from higher education institutional data warehouses. The contribution of this work is the recommendation of a methodology and case study that educational researchers can use to understand, measure, and reduce selection bias in real-world educational interventions. 1. INTRODUCTION Evaluating the impact of novel educational pedagogies, strategies, programs, and interventions in quasi-experimental studies can be highly error-prone due to selection biases. The effect of these errors can be significant, and can lead to harm being done to learners, instructors, and institutions through misinformed decision-making. Further, the lack of confidence researchers have in their analyses of realworld deployments can lead to a decrease in situated experimentation. In this paper we describe a methodology to understand and correct for selection bias, restoring the confidence researchers and policy-makers can have in the results of quasi-experimental studies. A quasi-experimental study is one in which there is no randomized control population. This design contains the potential for selection bias of learners and is a principal challenge for measuring the effectiveness of the intervention delivered. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LAK ’15 March 16 - 20, 2015, Poughkeepsie, NY, USA ACM 978-1-4503-3417-4/15/03 ...$15.00. http://dx.doi.org/10.1145/2723576.2723614. In educational studies selection bias is often very difficult to eliminate: there are ethical considerations around the equal access to new learning technologies and programs, as well as pragmatic considerations such as dealing with open recruitment of learners where the bias is the result of self-selection. For instance, one would expect a series of workshops aimed at helping non-traditional learners excel in first year university would increase the grades of these students. But students may not elect to attend workshops randomly – students with existing strong study skills may be more predisposed to attending the workshops, and this latent variable may be more explanatory of the outcome than the workshop itself. The big data culture that has permeated academic (as well as other) institutions offers a solution to issues of selection bias in quasi-experimental interventions. Instead of limiting learner access to a technology or program to form a control group a priori, a subset of the overall population of learners is selected post hoc such that it best matches the group of learners who received the intervention. This creates a matched sample, and allows for an “apples-to-apples” comparison of outcomes between the two groups of learners while contextualizing how those groups might differ with respect to selection bias. The work presented here describes a process for identifying a matched sample of learners and contextualizing how the matched sample differs from those learners who have received educational interventions. This technique is especially important when communicating research results to decision makers within the higher education institution. By comparing the results of a treatment (a learning technology or program) effect on a group of learners against a similarlymatched sample, researchers can control for selection bias and make a more compelling argument about the impact (or lack there of) of their intervention. The contributions of this work are three fold: 1. A process for evaluating educational programs and interventions using subset matching, including an understanding of the important statistical tests that must be considered when contextualizing how good (unbiased with respect to some attributes) of a match is achieved. 2. A case study demonstrating how this method can be applied to reduce selection bias. 3. A free and open source software toolkit1 that allows educational researchers to execute this process directly, complete with the reporting of contextual statistics about the matched populations. 2. 2.1 ADDRESSING SELECTION BIAS Selection Bias To identify methods to deal with the problem of selection bias, we first describe what causes an educational intervention (treatment effect) to become biased. This bias is typically due to members of the treatment group voluntarily selecting themselves to participate in a given intervention. Other exogenous factors may exist, such as access to technology to engage in the treatment (e.g. Internet access, access to a smart phone) or selection that is based in part on demographic features. Consequently one might ask whether there is some sort of important difference in the individuals who elect to participate in a particular treatment or whether this a completely random phenomenon that is unrelated to the particular individuals. Regardless, we can mitigate the effects that various factors (the observed covariates) might have on the outcome of interest. Specifically, we want to estimate the extent to which the measured covariates influence our estimate of the average treatment effect. In general, our estimates of the average treatment effect involve matching members of the treatment condition to members of the comparison group (control) and relies on the “strong ignorability assumption” [2]. To state plainly: If we observe two individuals with the same base set of covariates or same propensity score, then the likelihood that either one would participate in the intervention is the same for both. Thus one electing to use the intervention and the second not is purely coincidental, hence comparing the two students’ outcomes is valid approach. Without this assumption it is impossible to infer all the selection bias has been removed from the estimated treatment effect [5, 2]. This inference in practice is limited to the covariates we are able to measure which inevitably have limitations either due to resources, time or ethical constraints limiting our ability to develop a “complete” set of potentially relevant factors to control. A natural consequence of this is explained in [3]: “It is important to realize, however, that whether treatments are randomly assigned or not, and no matter how large a sample size one has, a skeptical observer could always eventually find some variable that systematically differs in the E trials and C trials (e.g., length of longest hair on the child) and claim the average difference estimates the effect of this variable rather than the causal effect of Treatment. Within the experiment there can be no refutation of this claim; only a logical argument explaining that the variable cannot causally affect the dependent variable or additional data outside the study can be used to counter it.” This statement again points to the need for a researcher, when attempting to establish a causal relationship between 1 Available at https://github.com/usaskulc/population_ matching the application of a treatment and some sort of measured outcome, to use as complete a data set as possible. Covariates that are both related to an outcome of interest (such as test scores) as well as the covariates that effect the likelihood of an individual opting to participate in the treatment or intervention are relevant and discussed in [1]. Thus we would say selection bias due to some collection of variables X, is the bias that is introduced into our estimate of the average treatment effect, when we fail to account or control for X. Mathematically we can state this in the following way: Suppose the true treatment effect is Ttrue . Let T(−X) be the estimated treatment effect when we do not use X and T(+X) be the estimated treatment effect when we do use X, then: bias X removed = T(−X) − T(+X) (1) This equation allows us to calculate how much selection bias is introduced as the result of failing to account for a particular covariate or set of covariates X from the data we have available. For example, suppose we were interested in measuring the selection bias a variable such as Math SAT2 would introduce into our estimated treatment effect. We would first take our two groups, treatment and control, and match them based on some list of covariates (e.g. gender, socio-economic status, residency status) and not include Math SAT scores. Our estimated Treatment effect would be T1. We then would repeat the analysis but this time include Math SAT scores to get a second estimate of the treatment effect T2. The difference between T1 and T2 is our point estimate of the bias introduced by failing to include Math SAT as a control variable. For a full discussion of how various types of coavariates in education can account for selection bias see [5]. To summarize their findings: When it comes to which variables to use in one’s analysis, it is important to select data that was known before the treatment was administered, or at least could have been known before the treatment was administered. Using data collected after treatment assignment that can be influenced (changed in value) by the treatment itself is not useful since they can introduce bias either by causing an over or underestimate of the treatment effect. In educational research, variables relating to demographics, pretest information, prior academic achievement, topic and subject matter preferences, as well as psychological and personality predispositions, have been shown to affect either observed performance or propensity to participate in interventions. For instance, of these variables Shadish et al. [4] have found when it comes to interventions variables on proxy-pretests and topic preference together with demographics reduce nearly all bias for language-related outcomes and variables related to demographics, pretests. Further, prior academic achievement reduced about 75% of selection bias in a mathematics related intervention. It is worth noting however, the actual reductions in bias could also be due to the specific context of the intervention. However, the findings do provide supporting evidence that estimating treatment effects with observational data is an appropriate approach. 2 The Math SAT is a standardized test measuring the mathematics ability of entry level college students in the United States. 2.2 Methods for Subset Matching The question then arises as to which matching method best deals with the problem of selection bias. Should we match equally across all of the covariate measures that we have available, or should we use a univariate statistic that describes the propensity by which an individual relates to the treatment group? Rosenbaum and Rubin [2] provide advice on this issue, and suggest that using the covariates directly or propensity scores are both sufficient and neither is clearly better than the other on this matter. Instead, it is how related the set of all covariates which are used for matching is to the treatment assignment or outcome of interest which is important. With this caveat in mind, we outline two popular methods for finding a matched population for a quasi-experimental study. The first is a simple matching strategy, whereby scores are calculated for each covariate and each pair of subjects in the treatment and condition groups. A vector of scores for a particular pair of individuals represents the difference between subjects, and various differencing methods (e.g. Euclidean distance, Mahalanobis distance) may be used depending upon the form and distribution of the data. The second method is to collapse covariates for each individual into a propensity score using a regression approach such as linear or logistic regression. The result is then a single value that describes the likelihood an individual would receive some treatment condition. Care must be taken when forming propensity scores, especially in large matching datasets where the number and diversity of non-treatment individuals outweighs that of the treatment individuals. The difference between two individual’s scores forms a metric by which individuals can be matched. Regardless of the method used, both approaches form a matrix of treatment versus non-treatment individuals where intersection elements hold the similarity two individuals have to one another. This matrix can be solved as a linear assignment problem with the result being globally minimal (most similar) pairwise matches between the treatment and non-treatment populations. 2.3 Reporting on the Effects of Selection Bias While the subset matching technique attempts to minimize the overall difference between the treatment group and a matched sample, such an approach does not guarantee that suitable matches for a given analysis can be found. It is thus important to verify how well matched the treatment group is to the non-treatment group when presenting results on the effect of the treatment. This can be done by considering the similarity of each of the covariate distributions between the treatment and non-treatment groups. While there are several methods that might be used, a practical approach for continuous data is to compare distributions using a two-sampled Kolmogrov-Smirnov test, which is sensitive to both the shape and location of the distributions being compared. A second useful approach is to use the Mann-Whitney test. It has greater efficiency than the t-test on data not sampled from a normal distribution, and it is nearly as efficient as the t-test on normally distributed data. Both are conservative tests that will provide a comprehensive comparison of the distribution of two populations. In our experience, achieving a significant (e.g. p ≤ 0.05) confidence value using the Kolmogrov-Smirnov is difficult unless the non-treatment group is quite large and diverse, leading to excellent matches. Less robust tests of the quality of matches include means-test methods such the students paired t-test. 3. CASE STUDY: LEARNING COMMUNITIES The purpose of this section of the paper is not to outline a particular case study result per se, but to demonstrate how the techniques described can be used by educational researchers to come to conclusions about the effect of their interventions, by reducing the possible selection bias of the participatory sample. Learning communities programs3 at our institution group some students into residences based upon students’ interest in pursuing a particular domain or discipline. The goal of the learning communities programs is to provide students with a peer group for support, as well as provide opportunities for academic development. These programs have existed for more than a decade in various forms, and there is a strong interest in understanding the effect these programs have on student success and achievement. Current learning communities include women students in science and engineering programs; students who are interested in the health sciences; students pursuing the visual arts; students who are interested in social justice and community; and students who are interested in research. Students are not chosen at random for participation in learning communities programs. There is both self-selection bias (e.g. students who are interested in being in the learning community) as well as a formal selection phase (e.g. application forms included essays which are judged). Students may apply to many learning communities, but can only be accepted into one. Learning communities programs are only available for freshman (first year) university students. One common question for program evaluators of learning communities is whether participation in the program raises the overall academic achievement of students. A naive approach to answering this issue would be to conduct a t-test between students who are in a particular learning community and those who were not in any learning community along a particular outcome variable such as overall grade point average. Using one year of such data, the means difference is 0.12 (on a four point scale, see Table 1) suggesting the learning community students actually perform worse than non-learning community students; a t-test confirms significance at p ≤ 0.01. Student Group Learning Community Non-Learning Community N 103 6,090 average GPA 3.13 3.25 Table 1: Comparison of treatment (Learning Community) and non-treatment (Non-Learning Community) groups using a naive analysis. In determining how well matched the comparison groups are, a first step is to consider the list of variables being considered and similarity of the distribution of those variables within each group. This can be done with a two-tailed Kolmogrov-Smirnov test, and Table 2 shows the results between the two learning communities groups for a variety of 3 See http://www.lsa.umich.edu/mlc variables that are hypothesized as interacting with cumulative GPA. For the variables that are statistically significant (e.g. p < 0.01) we cannot reject the null hypothesis that the two samples come from different distributions. In this example, we see that only gender meets this criteria, suggesting that the distribution of gender in the two groups is different. Variables Sex Ethnic Group Citizenship Status Standardized Entrance Test Credits at Entry Parental Education Household Income KS Confidence (p) p < 0.001∗ ∗ ∗ p = 0.720 p = 1.000 p = 0.987 p = 0.164 p = 0.953 p = 0.661 Table 2: Comparison of the treatement and nontreatment groups across seven demographic and performance features before matching. In this case it was the treatment group that had a higher number of women than the non-treatment group. To reduce this bias a matched set can be created. Using the equal covariate matching method described at the beginning of Section 2.2, it is possible to minimize the bias that may exist. Balancing across the variables listed, a paired treatment–non-treatment dataset of 206 individuals can be created. Application of a two-tailed Kolmogrov-Smirnov test shows no significance at p = 0.01 level, though one variable (Household Income) is significant at the p = 0.05 level. The high confidence of all other p-values suggests this dataset is well balanced, except perhaps with respect to parental income. Variables Sex Ethnic Group Citizenship Status Standardized Entrance Test Credits at Entry Parental Education Household Income KS Confidence (p) p = 1.000 p = 1.000 p = 1.000 p = 0.996 p = 1.000 p = 1.000 p = 0.036∗ Table 3: Comparison of the treatment and nontreatment groups across seven demographic and performance features after matching. the p=0.036 level) has been eliminated.4 4. 4 The result of the matching process are two populations of the same size with individuals in the first directly matched to individuals in the second. Thus, a paired t-test for statistical significance can be used on outcome variables. Considering GPA, the paired t-test returns a statistically significant difference (p = 0.003), with the means difference between the groups being 0.18 points in favor of the non-treatment group. In short, the researcher can now say with greater certainty that there is a difference between the treatment and non-treatment students and that bias introduced because of an observed variables (except perhaps Household Income, at CONCLUSIONS Selection bias in quasi-experimental studies can undermine the confidence decision-makers have in the results of analyses, and lead to possible misunderstandings and poor policy decisions. Yet institutions, researchers, and practitioners, are often unable to run randomized controlled experiments of learning innovations based on pragmatic or ethical concerns. This paper has introduced a methodology by which researchers can use contextualize the results of their analysis and reduce selection biases. Leveraging big educational datasets and institutional data warehouses, researchers can often mitigate selection bias by finding a comparison group of learners who did not undergo a particular treatment. Learners can be compared equally across all covariates (e.g. demographics, previous performances, or preferences), or covariates can be collapsed into a single propensity score which can be used as the basis for matching. The end result of the matching process is a paired dataset of learners who have undergone a treatment and similar learners who did not receive the treatment. The researcher can then apply post-hoc analysis as appropriate. In this work we have included an example of this method applied to a case study educational program which is particularly affected by selection bias: university learning communities. These learning communities are heavily biased based on the sex of participants (Table 2). After controlling for this bias, an increase of 66% is seen in the means differences between the treatment and control groups (from 0.12 to 0.18 in GPA units). Whether this is significant enough to change policy or deployment of the program depends on how decision makers weight this particular outcome. There may be alternative student outcomes such as satisfaction, time to degree completion, or co-curricular achievements that influence policy in this area. What is important here is that the researcher can feel confident that these results more accurately reflect the effects on the treatment population given the kinds of learners who would opt-in to the treatment. In our experience, however, creating matched pairs of learners rarely results in perfect results, thus contextualizing the goodness of fit between the two groups of learners is important. This can be done both before the matching as well as afterwards, using the Kolmogrov–Smirnov statistic. This technique can describe which covariates may not be possible to match on; an insight which is essential when forming educational policy. As the level of significance goes up (a declining p value), the more likely it is that variability (noise) in the data will cause for a rejection of a particular hypothesis. As more variables are considered, the chance of spurious correlation at a well-accepted level such as p = 0.05 or p = 0.01 for one variable increases. A more conservative value for a given confidence level can be achieved by dividing the alpha (e.g. 0.05) necessary for statistical significance by the number of variables being considered (7), a Bonferroni correction which controls the family-wise error levels. In this example, at p = 0.05, one would then expect only values of p ≤ 0.0083 to be consider statistically significant. Thus the appearance of Household Income being statistically significantly different between the two distributions should be questioned as to whether it is a spurious result. 5. ACKNOWLEDGEMENTS Thanks to Dr. Jim Greer from the University of Saskatchewan for motivating earlier work in this area. Also, thanks to Dr. Ben Hansen at the University of Michigan for insights on using propensity and prognostic scores and their application to matching problems. Finally, thanks to Dr. Brenda Gunderson from the University of Michigan for her support in investigating these issues in the E2 Coach framework. 6. REFERENCES [1] B. Hansen. The prognostic analogue of the propensity score. Biometrika, pages 1–17, 2008. [2] P. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983. [3] D. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 1974. [4] W. R. Shadish, M. H. Clark, and P. M. Steiner. Can Nonrandomized Experiments Yield Accurate Answers? A Randomized Experiment Comparing Random and Nonrandom Assignments. Journal of the American Statistical Association, 103(484):1334–1344, Dec. 2008. [5] P. M. Steiner, T. D. Cook, W. R. Shadish, and M. H. Clark. The importance of covariate selection in controlling for selection bias in observational studies. Psychological methods, 15(3):250–67, Sept. 2010.