Articles & Working Papers

Blackwell, Matthew, and Maya Sen. 2012. “Large Datasets and You: A Field Guide.” The Political Methodologist 20 (1): 2-5. Abstract

The last five years have seen an explosion in the amount of data available to social scientists. Although a blessing, these extremely large sources of data can cause problems for political scientists working with standard statistical software programs, which are poorly suited to analyzing big data sets. In this essay, we describe a few approaches to handling extremely large datasets within the R programming language, both at the command line prior to R and after we fire up R. We show that handling large datasets is about either (1) choosing tools that can shrink the problem or (2) fine-tuning R to handle massive data files.


The recent subprime mortgage crisis has brought to the forefront the possibility of discriminatory lending on the basis of race or gender. Using the over 10 million observations collected by the federal government in 2006 through the Home Mortgage Disclosure Act, this paper explores these claims causally. In so doing, the paper explores two possible theories of discrimination: (1) that any discriminatory lending patterns are picking up the fact that minority borrowers went to different lenders, perhaps as a result of predatory lending, and (2) the possibility that individual lenders discriminated against identically situated borrowers. The results presented provide limited evidence for the idea that borrowers of different races went to different lenders, but only in certain regions of the country and only for certain minority groups. In addition, many of these results are sensitive to missing confounders – e.g., financial data like credit scores and down payments, which the federal government does not collect. Ultimately, the results’ sensitivity suggests that more data gathering is in order before definitive assertions can be made by legal and policy actors.