##############
*** README ***
##############

This file is the README documentation for A Rosetta Stone for Human Capital, by Dev Patel and Justin Sandefur, 2020.

There are three parent folders in these replication files:
- The first, "CODE" contains the code that executes the analysis of the raw files.
- The second, "DATA", contains the raw data.
- The third, "TESTS", contains the underlying items necessary to construct the tests (in the source language and in Hindi.)

############
*** CODE ***
############

The code was written using Stata 16.

There are two custom ado files that were used in this analysis contained in the folder "CODE/Custom ado files". They modify the commands uirt and texsave in minor ways to facilitate the output from Stata to LaTeX. These modifications are unnecessary for users who simply want to view the main results, in which case references to "texsave_notable" should be changed to "texsave" and "uirt_difgraphs" should be changed to "uirt". Otherwise, place these two ado files in the corresponding folder in your directory.
The remaining commands used in the do files can all be downloaded directly using ssc install.

Users wishing to replicate the results should modify the working directories in the do file "00 Master Code and Macros" and create destination folders for output datasets, figures, and tables. There are three datasets---the PASEC microdata, U.S. census data, and trade data---that we are not able to post publicly but are available online for free after short user registrations. These non-postable data directories (lines 36-38 of the "00 Master Code and Macros" file) must also be updated for the code to run completely. After addressing the ado files above, the user can run all of the code from "00 Master Code and Macros". There may be further Stata packages used in the code which the user has not installed, all of which can be installed directly through ssc.

The do files that bootstrap the analysis, specifically 6a and 6b, take many days to run on a server.

00 Master Code and Macros: This file contains the global run script
01 Create Master Numbers: This file constructs a master list of item IDs across each exam version.
02 Prepare Item Parameters: This file constructs a dataset of item parameters for the IRT estimation.
03 Grade Tests: This file grades the raw student responses, pulling from the grading do files in "CODE/Grading Do Files." In the folder "GRADING DO FILES" are 12 do files that grade the raw responses.
04 Renumber Tests and Combine: This file combines the tests together and calculates statistics on the portion correct and missing.
05 IRT: This file runs the item response theory models to estimate test scores for each student and test. It also produces the DIF plots.
06a Bootstrap Items: This file runs bootstrap estimates of the IRT models using a random selection of items.
06b Bootstrap Students: This file runs bootstrap estimates of the IRT models using a random draw of students from the Bihar sample with replacement.
07 Linking Functions: This file creates the linking functions between test scores and the resulting plots.
08a Grade Adjustments: This file creates the grade adjustments using the linked microdata from several Latin American countries.
08b Bootstrap Grade Adjustments: This file samples with replacement from the Latin American microdata to re-estimate the bootstrap.
09 Convert Microdata: This file applies the linking functions and grade adjustments to the official microdata.
10a Construct Median Sample: This file cleans and combines median test scores from each country.
10b Convert Medians with Bootstraps: This file bootstraps with replacement to estimate a distribution of converted median scores for each country.
10c Country-Level Median Estimates: This file converts existing median test scores using the links estimated on the Bihar sample.
11 Country-Level Benchmarks: This file calculates country comparisons relative to the low international benchmarks.
12 Global Learning Distribution: This file combines the converted microdata to analyze the global learning distribution.
13 Country Correlates: This file estimates country-level correlations between the new test score measures and other important economic indicators.
14 Skilled Trade: This file estimates the relationship between test scores and the value of exports in skill-intensive industries.
15 Maps: This file creates the maps of test score coverage.
16 Miscellaneous Code: This file generates much of the statistics and tables used in the text and conducts the unidimensionality tests using factor analysis. When constructing the code for the DIF graphs in LaTeX, the directory will need to be changed to the figures directory when flowed into LaTeX.
17a Private Schools: This file estimates the private school premiums and calculates the selection bias following Oster (2019), saving the resulting coefficients.
17b Private School Graphs: This file creates graphs analyzing the private school premiums.
18 Variance Decomposition: This file estimates the relative role of country and income in explaining test scores.
19 Rosetta Stone: This file exports the country scores and conversion tables.

############
*** DATA ***
############

The data used in this analysis are listed below in the order in which they appear in the .do files. Data that is not included in the replication files due to usage restrictions is noted, along with information on how users can access it for themselves. The sources of the raw data come from the international assessments unless otherwise noted.

The file Exam Question Sources.xlsx contains information on the original source of items used in the hybrid tests.
The files PIRLS11 for Stata.csv, TIMSS11 for Stata.csv, PASEC for Stata.csv, REG_PL3_EG.xls, REG_PL6_EG.xls, REG_PM3_EG.xls, REG_PM6_EG.xls, contain the item parameters for PIRLS, TIMSS, PASEC, and LLECE respectively.
The novel primary data from the hybrid test used in this analysis are the test score responses in the .dta files test_#_# from the field work in Bihar. The numbers denote the test and version.
The files PL3_all_TERCE.dta, PM3_all_TERCE.dta, PL6_all_TERCE.dta, and PM6_all_TERCE.dta contain the microdata for LLECE TERCE. The file QF3.dta contains student characteristics for the LLECE microdata. The data can be downloaded from the UNESCO website: http://www.unesco.org/new/en/santiago/education/education-assessment-llece/terce/databases/
The spss files in the folders TIMSS and PIRLS contain the underlying test score data for those tests. The data can be downloaded from: https://timssandpirls.bc.edu/. The data was merged together using the IEA IDBAnalyzer available from the same site and converted to a .dta file using SPSS. The merged files are TIMSS_2011_4_studenbg and PIRLS_2011_4_studentbg.
The file PASEC2014_GRADE6.dta contains the microdata for PASEC 2014. This data is not included in the replication materials. It can be downloaded from the PASEC website after filling out a short form: http://www.pasec.confemen.org/donnees/. Update the global for "pasec" in "00 Master Code and Macros" with the directory for this file after it is downloaded.
The files TIMSSPIRLS_median.csv and PASEC_median.csv contain country median scores for TIMSS, PIRLS, and PASEC.
The file WDI_data.csv contains country statistics from the World Bank: https://databank.worldbank.org/home.aspx
The file codes_masterlist.dta is a crosswalk of country abbreviations and names.
The file TIMSSPIRLS_benchmarks.csv contains the portion of students meeting the low-international learning benchmarks from TIMSS and PIRLS.
The file WB_enrollment contains data on school enrollment rates from the World Bank: https://databank.worldbank.org/home.aspx
The file LM_WPID_web.dta contains data on the income distribution of each country: http://www.worldbank.org/en/research/brief/World-Panel-Income-Distribution
The file CLASS.xls contains information on World Bank income classification: https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups
The file WDI_gini.csv contains the Gini inequality coefficient according to the World Bank's World Development Indicators: https://databank.worldbank.org/home.aspx
The file WBGender_Data.csv contains data from the World Bank's World Development Indicators on gender inequality within countries: https://databank.worldbank.org/home.aspx
The file Data_Extract_From_Education_Statistics_Learning_Outcomes.xlsx contains average scores for countries on regional and international assessments from the World Bank Education Statistics:  http://datatopics.worldbank.org/education/home.
The file pwt90.dta is the income data from the Penn World Tables version 9.0: https://www.rug.nl/ggdc/productivity/pwt/
The file BL2014_MF1599_v2.2.dta is the average years of schooling data from Barro and Lee (2013), version 2.2: http://www.barrolee.com/
The file HLO Panel Data (mean, thresh) - Jan 2018.dta contain results from Altinok et al. (2018) provided by the authors, and the file 8c1824bb-fb57-483f-8a92-bda2ba7d2aab_Data.csv contains test scores from Patrinos and Angrist (2018) which can be downloaded from the World Bank's education statistics portal: http://datatopics.worldbank.org/education/home.
The file hanushek+woessmann.cognitive.dta contains results from Hanushek and Woessman (2013), downloaded from Eric Hanushek's website: http://hanushek.stanford.edu/download.
The file WBEdstats_Data.csv is data from the World Bank on education spending: http://datatopics.worldbank.org/education/home.
The file GPS.dta is data from the Global Preferences Survey: https://www.briq-institute.org/global-preferences/home.
The file usa_00082.dta contains data from the 5 percent sample of the United States 2000 Census to measure skill intensity by industry. This data is not included in the replication materials. Users may download it from usa.ipums.org, selecting the following variables: YEAR, DATANUM, SERIAL, HHWT, GQ, PERNUM, PERWT, EDUC, EDUCD, EMPSTAT, EMPSTATD, OCC2010, IND.
The files cw_ind2000_ind1990ddx.dta, cw_hs6_sic97dd, and cw_sic87_ind1990ddx are crosswalks from David Dorn: https://www.ddorn.net/data.htm.
The file baci92_2017 is the 2017 trade data based on the BACI release of COMTRADE and country_code_baci92 is the crosswalk for the corresponding country codes. This data is not included in the replication materials. Users may download both from CEPII: http://www.cepii.fr/CEPII/en/bdd_modele/presentation.asp?id=1
The file nem-occcode-acs-crosswalk.xlsx comes from the Bureau of Labor Statistics: https://www.bls.gov/emp/documentation/crosswalks.htm.
The file 2010_to_SOC_Crosswalk.xls comes from the Bureau of Labor Statistics O*NET center: https://www.onetcenter.org/crosswalks.html.
The file Abilities.xlsx comes from the Bureau of Labor Statistics O*NET center: https://www.onetcenter.org/database.html.
The file Data_Extract_From_Education_Statistics_Learning_Outcomes.xlsx contains data on test coverage: http://datatopics.worldbank.org/education/home.
The file ASER_UWEZO_map.dta contains information on ASER and UWEZO country coverage.
The files globalshape.dbf and globalshape.shp are the shapefiles necessary to create the map of test coverage from GADM: https://gadm.org/
The file Reference Population Percent Correct.xlsx contains information on the portion of students who answered specific items correctly among the original sample.

#############
*** TESTS ***
#############

The pdfs test#_#.pdf are the Hindi tests that were administered to the students in Bihar. The roman numeral denotes the test version. Due to typos in the initial versions (marked with _1) which were administered in some schools, a second round of tests was created correcting these typos (marked with _2).
The items in the tests that do not come from LLECE, PASEC, TIMSS, or PIRLS come from a state level assessment in India that was not included in the main analysis here.
In TESTS/LaTeX, we also provide LaTeX versions of the public use items. The images associated with the questions can be found in the Images folder. The Hindi versions of the LaTeX questions are in the Hindi folder. English versions for PASEC, PIRLS, and TIMSS are in the English folder.
Please note that these LaTeX documents will not compile on their own, they are just the individual questions.