Recent discussion in the public sphere about algorithmic classification has involved tension between competing notions of what it means for a probabilistic classification to be fair to different groups. We formalize three fairness conditions that lie at the heart of these debates, and we prove that except in highly constrained special cases, there is no method that can satisfy these three conditions simultaneously. Moreover, even satisfying all three conditions approximately requires that the data lie in an approximate version of one of the constrained special cases identified by our theorem. These results suggest some of the ways in which key notions of fairness are incompatible with each other, and hence provide a framework for thinking about the trade-offs between them.
We examine how machine learning can be used to improve and understand human decision-making. In particular, we focus on a decision that has important policy consequences. Millions of times each year, judges must decide where defendants will await trial—at home or in jail. By law, this decision hinges on the judge’s prediction of what the defendant would do if released. This is a promising machine learning application because it is a concrete prediction task for which there is a large volume of data available. Yet comparing the algorithm to the judge proves complicated. First, the data are themselves generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the single variable that the algorithm focuses on; for instance, judges may care about racial inequities or about specific crimes (such as violent crimes) rather than just overall crime risk. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: a policy simulation shows crime can be reduced by up to 24.8% with no change in jailing rates, or jail populations can be reduced by 42.0% with no increase in crime rates. Moreover, we see reductions in all categories of crime, including violent ones. Importantly, such gains can be had while also significantly reducing the percentage of African-Americans and Hispanics in jail. We find similar results in a national dataset as well. In addition, by focusing the algorithm on predicting judges’ decisions, rather than defendant behavior, we gain some insight into decision-making: a key problem appears to be that judges to respond to ‘noise’ as if it were signal. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals.
Algorithms are increasingly being used to make recommendations about matters of taste, expanding their scope into domains that are primarily subjective. This raises two important questions. How accurately can algorithms predict subjective preferences, compared to human recommenders? And how much do people trust them? Recommender systems face several disadvantages: They have no preferences of their own and they do not model their recommendations after the way people make recommendations. In a series of experiments, however, we find that recommender systems outperform human recommenders, even in a domain where people have a lot of experience and well-developed tastes: Predicting what people will find funny. Moreover, these recommender systems outperform friends, family members, and significant others. But people do not trust these recommender systems. They do not use them to make recommendations for others, and they prefer to receive recommendations from other people instead. We find that this lack of trust partly stems from the fact that machine recommendations seem harder to understand than human recommendations. But, simple explanations of recommender systems can alleviate this distrust.
Mullainathan, Sendhil, Christine De Mol, Eric Gautier, Domenico Giannone, Lucrezia Reichlin, Herman van Dijk, and Jeffrey Wooldridge. 2017. “Big Data in Economics: Evolution or Revolution?.” Economics without Borders: Economic Research for European Policy Challenges, edited by Laszlo Matyas, Richard Blundell, Estelle Cantillon, Barbara Chizzolini, Marc Ivaldi, Wolfgang Leininger, Ramon Mariman, and Frode Steen, 612-632. Cambridge, UK: Cambridge University Press, 612-632. Publisher's VersionAbstract
The Big Data Era creates a lot of exciting opportunities for new developments in economics and econometrics. At the same time, however, the analysis of large datasets poses difficult methodological problems that should be addressed appropriately and are the subject of the present chapter.
Machine learning tools are beginning to be deployed en masse in health care. While the statistical underpinnings of these techniques have been questioned with regard to causality and stability, we highlight a different concern here, relating to measurement issues. A characteristic feature of health data, unlike other applications of machine learning, is that neither y nor x is measured perfectly. Far from a minor nuance, this can undermine the power of machine learning algorithms to drive change in the health care system--and indeed, can cause them to reproduce and even magnify existing errors in human judgment.
Machines are increasingly doing "intelligent" things. Face recognition algorithms use a large dataset of photos labeled as having a face or not to estimate a function that predicts the presence y of a face from pixels x. This similarity to econometrics raises questions: How do these new empirical tools fit with what we know? As empirical economists, how can we use them? We present a way of thinking about machine learning that gives it its own place in the econometric toolbox. Machine learning not only provides new tools, it solves a different problem. Specifically, machine learning revolves around the problem of prediction, while many economic applications revolve around parameter estimation. So applying machine learning to economics requires finding relevant tasks. Machine learning algorithms are now technically easy to use: you can download convenient packages in R or Python. This also raises the risk that the algorithms are applied naively or their output is misinterpreted. We hope to make them conceptually easier to use by providing a crisper understanding of how these algorithms work, where they excel, and where they can stumble—and thus where they can be most usefully applied.
There is growing interest in understanding the psychology of the poor—biases that may affect decision-making are of particular interest. The sheer diversity of potential biases—hyperbolic discounting, probabilistic, and judgmental errors just to name a few—poses a key challenge. These psychological biases cannot easily be put into a common unit such as money spent. However, two insights from psychology make this problem more tractable.
First, a large body of work points toward a two-system model of the brain. System 1 thinks fast: it is intuitive, automatic, and effortless, and as a result, prone to biases and errors. System 2 is slow, effortful, deliberate, and costly, but typically produces more unbiased and accurate results. Second, when mentally taxed, people are less likely to engage their System 2 processes. Put simply, one might think of having a (mental) reserve or capacity for the kind of effortful thought required to use System 2. When burdened, there is less of this resource available for use in other judgments and decisions. Though there is no commonly accepted name for this capacity, we will refer to it in this article as “bandwidth” (Mullainathan and Shafir 2013). This two-system model has direct relevance to many of the heuristics and biases familiar to economists. Kahneman and Frederick (2002) and more recently Kahneman (2011) provide reviews. Fudenberg and Levine (2006) develop a model with two systems in the context of time discounting.
Psychologists often study this underlying resource by imposing “cognitive load” to tax bandwidth and measure the impact on judgments and decisions. The many ways to induce load produce similar results on various bandwidth measures and consequences from reduced System 2 thinking. This insight is particularly useful because it implies that bandwidth is both malleable and measurable. It also suggests a unified approach of studying the psychology of poverty. We can understand factors in the lives of the poor, such as malnutrition, alcohol consumption, or sleep deprivation, by how they affect bandwidth. And we can understand important decisions made by the poor, such as technology adoption or savings, through the lens of how they are affected by bandwidth. Clearly, bandwidth is not the only important aspect of the psychological lives of the poor; no single metric can take on this role. However, it provides a way to at least partly understand a great many of the thought processes that drive decision-making by the poor.
An increasing number of domains are providing us with detailed trace data on human decisions in settings where we can evaluate the quality of these decisions via an algorithm. Motivated by this development, an emerging line of work has begun to consider whether we can characterize and predict the kinds of decisions where people are likely to make errors.
To investigate what a general framework for human error prediction might look like, we focus on a model system with a rich history in the behavioral sciences: the decisions made by chess players as they select moves in a game. We carry out our analysis at a large scale, employing datasets with several million recorded games, and using chess tablebases to acquire a form of ground truth for a subset of chess positions that have been completely solved by computers but remain challenging even for the best players in the world.
We organize our analysis around three categories of features that we argue are present in most settings where the analysis of human error is applicable: the skill of the decision-maker, the time available to make the decision, and the inherent difficulty of the decision. We identify rich structure in all three of these categories of features, and find strong evidence that in our domain, features describing the inherent difficulty of an instance are significantly more powerful than features based on skill or time.
Imagine sitting in an office located near the railroad tracks. Trains rattle by several times an hour. As you try to concentrate, the rumble of every train pulls you away from what you are doing. You need time to refocus, to collect your thoughts. Worse, just when you have settled back in, another train hurtles by. This description mirrors the conditions of a school in New Haven located next to a noisy railroad line. In the early 1970s two researchers decided to measure the impact of this noise on students. They noted that only one side of the school faced the tracks, so the students in classrooms on that side were particularly exposed to the noise but were otherwise similar to their fellow students.
Niehaus, Paul, Antonia Attanassova, Marianne Bertrand, and Sendhil Mullainathan. 2013. “Targeting with Agents.” American Economic Journal: Economic Policy 5 (1): 206-38.Abstract
Targeting assistance to the poor is a central problem in development. We study the problem of designing a proxy means test when the implementing agent is corruptible. Conditioning on more poverty indicators may worsen targeting in this environment because of a novel tradeo between statistical accuracy and enforceability. We then test necessary conditions for this tradeo using data on Below Poverty Line card allocation in India. Less eligible households pay larger bribes and are less likely to obtain cards, but widespread rule violations yield a de facto allocation much less progressive than the de jure one. Enforceability appears to matter
We provide evidence that individuals optimize imperfectly when making annuity decisions, and this result is not driven by loss aversion. Life annuities are more attractive when presented in a consumption frame than in an investment frame. Highlighting the purchase price in the consumption frame does not alter this result. The level of habitual spending has little interaction with preferences for annuities in the consumption frame. In an investment frame, consumers prefer annuities with principal guarantees; this result is similar for guarantee amounts below, at, and above the purchase price. We discuss implications for the retirement services industry and its regulators.
The poor often behave in less capable ways, which can further perpetuate poverty. We hypothesize that poverty directly impedes cognitive function and present two studies that test this hypothesis. First, we experimentally induced thoughts about finances and found that this reduces cognitive performance among poor but not in well-off participants. Second, we examined the cognitive function of farmers over the planting cycle. We found that the same farmer shows diminished cognitive performance before harvest, when poor, as compared with after harvest, when rich. This cannot be explained by differences in time available, nutrition, or work effort. Nor can it be explained with stress: Although farmers do show more stress before harvest, that does not account for diminished cognitive performance. Instead, it appears that poverty itself reduces cognitive capacity. We suggest that this is because poverty-related concerns consume mental resources, leaving less for other tasks. These data provide a previously unexamined perspective and help explain a spectrum of behaviors among the poor. We discuss some implications for poverty policy.
Research in behavioral public finance has blossomed in recent years, producing diverse empirical and theoretical insights. This article develops a single framework with which to understand these advances. Rather than drawing out the consequences of specific psychological assumptions, the ramework takes a reducedform approach to behavioral modeling. It emphasizes the difference between decision and experienced utility that underlies most behavioral models. We use this framework to examine the behavioral implications for canonical public finance problems involving the provision of social insurance, commodity taxation, and correcting externalities. We show how deeper principles undergird much work in this area and that many insights are not specific to a single psychological assumption.
Background: Human error due to risky behaviour is a common and important contributor to acute injury related to poverty. We studied whether social benefit payments mitigate or exacerbate risky behaviours that lead to emergency visits for acute injury among low-income mothers with dependent children. Methods: We analyzed total emergency department visits throughout Ontario to identify women between 15 and 55 years of age who were mothers of children younger than 18 years, who were living in the lowest socio-economic quintile and who presented with acute injury. We used universal health care databases to evaluate emergency department visits during specific days on which social benefit payments were made (child benefit distribution) relative to visits on control days over a 7-year interval (1 April 2003 to 31 March 2010). Results: A total of 153 377 emergency department visits met the inclusion criteria. We observed fewer emergencies per day on child benefit payment days than on control days (56.4 v. 60.1, p = 0.008). The difference was primarily explained by lower values among mothers age 35 years or younger (relative reduction 7.29%, 95% confidence interval [CI] 1.69% to 12.88%), those living in urban areas (relative reduction 7.07%, 95% CI 3.05% to 11.10%) and those treated at community hospitals (relative reduction 6.83%, 95% CI 2.46% to 11.19%). No significant differences were observed for the 7 days immediately before or the 7 days immediately after the child benefit payment. Interpretation: Contrary to political commentary, we found that small reductions in relative poverty mitigated, rather than exacerbated, risky behaviours that contribute to acute injury among low-income mothers with dependent children.
Consumers need information to compare alternatives for markets to function efficiently. Recognizingthis, publicpolicies oftenpaircompetitionwitheasyaccess tocomparative information. The implicit assumption is that comparison friction—thewedgebetweentheavailability of omparativeinformationandconsumers’ use of it—is inconsequential becausewheninformationis readilyavailable, consumers will access this information and make effective choices. We examine the extent of comparison friction in the market for Medicare Part D prescription drug plans in the United States. In a randomized field experiment, an intervention group received a letter with personalized cost information. That information was readily available for free and widely advertised. However, this additional step—providing the information rather than having consumers actively access it—had an impact. Plan switching was 28% in the intervention group, versus 17% in the comparison group, and the intervention caused an average decline in predicted consumer cost of about $100 a year among letter recipients—roughly 5% of the cost in the comparison group. Our results suggest that comparison friction can be large even when the cost of acquiring information is small and may be relevant for a wide range of public policies that incorporate consumer choice
Labor market policies succeed or fail at least in part depending on how well they reflect or account for behavioral responses. Insights from behavioral economics, which allow for realistic deviations from standard economic assumptions about behavior, have consequences for the design and functioning of labor market policies. We review key implications of behavioral economics related to procrastination, difficulties in dealing with complexity, and potentially biased labor market expectations for the design of selected labor market policies including unemployment compensation, employment services and job search assistance, and job training.
In this paper, we provide a new framework for analyzing corruption in public bureaucracies. The standard way to model corruption is as an example of moral hazard, which then leads to a focus on better monitoring and stricter penalties with the eradication of corruption as the final goal. We propose an alternative approach which emphasizes why corruption arises in the first place. Corruption is modeled as a consequence of the interaction between the underlying task being performed by bureaucrat, the bureaucrat's private incentives and what the principal can observe and control. This allows us to study not just corruption but also other distortions that arise simultaneously with corruption, such as red-tape and ultimately, the quality and efficiency of the public services provided, and how these outcomes vary depending on the specific features of this task. We then review the growing empirical literature on corruption through this perspective and provide guidance for future empirical research.
This paper develops a model of health insurance that incorporates behavioral biases. In the traditional model, people who are insured overuse low value medical care because of moral hazard. There is ample evidence, though, of a different inefficiency: people underuse high value medical care because they make mistakes. Such “behavioral hazard” changes the fundamental tradeoff between insurance and incentives. With only moral hazard, raising copays increases the efficiency of demand by ameliorating overuse. With the addition of behavioral hazard, raising copays may reduce efficiency by exaggerating underuse. This means that estimating the demand response is no longer enough for setting optimal copays; the health response needs to be considered as well. This provides a theoretical foundation for value-based insurance design: for some high value treatments, for example, copays should be zero (or even negative). Empirically, this reinterpretation of demand proves important, since high value care is often as elastic as low value care. For example, calibration using data from a field experiment suggests that omitting behavioral hazard leads to welfare estimates that can be both wrong in sign and off by an order of magnitude. Optimally designed insurance can thus increase health care efficiency as well as provide financial protection, suggesting the potential for market failure when private insurers are not fully incentivized to counteract behavioral biases.