Publications

2013
Ananthakrishnan AN, Cheng SC, Cai T, Cagan A, Gainer VS, Szolovits P, Shaw SY, Churchill S, Karlson EW, Murphy SN, et al. Association Between Reduced Plasma 25-hydroxy Vitamin D and Increased Risk of Cancer in Patients with Inflammatory Bowel Diseases. Clin Gastroenterol Hepatol. 2013.Abstract
BACKGROUND & AIMS: Vitamin D deficiency is common among patients with inflammatory bowel diseases (IBD) (Crohn's disease or ulcerative colitis). The effects of low plasma 25-hydroxy vitamin D (25[OH]D) on outcomes other than bone health are understudied in patients with IBD. We examined the association between plasma level of 25(OH)D and risk of cancers in patients with IBD. METHODS: From a multi-institutional cohort of patients with IBD, we identified those with at least 1 measurement of plasma 25(OH)D. The primary outcome was development of any cancer. We examined the association between plasma 25(OH)D and risk of specific subtypes of cancer, adjusting for potential confounders in a multivariate regression model. RESULTS: We analyzed data from 2809 patients with IBD and a median plasma level of 25(OH)D of 26 ng/mL. Nearly one-third had deficient levels of vitamin D (<20 ng/mL). During a median follow-up period of 11 y, 196 patients (7%) developed cancer, excluding non-melanoma skin cancer (41 cases of colorectal cancer). Patients with vitamin D deficiency had an increased risk of cancer (adjusted odds ratio=1.82; 95% CI, 1.25-2.65) compared to those with sufficient levels. Each 1 ng/mL increase in plasma 25(OH)D was associated with an 8% reduction in risk of colorectal cancer (odds ratio=0.92; 95% CI, 0.88-0.96). A weaker inverse association was also identified for lung cancer. CONCLUSION: In a study of from 2809 patients with IBD, low plasma level of 25(OH)D was associated with an increased risk of cancer-especially colorectal cancer.
Altman RB, Clayton EW, Kohane IS, Malin BA, Roden DM. Data re-identification: societal safeguards. ScienceScienceScience. 2013;339 :1032-3.
Ananthakrishnan AN, Cai T, Savova G, Cheng SC, Chen P, Perez RG, Gainer VS, Murphy SN, Szolovits P, Xia Z, et al. Improving Case Definition of Crohn's Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach. Inflammatory bowel diseases. 2013;19 :1411-1420.Abstract
BACKGROUND:: Previous studies identifying patients with inflammatory bowel disease using administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record-based model for classification of inflammatory bowel disease leveraging the combination of codified data and information from clinical text notes using natural language processing. METHODS:: Using the electronic medical records of 2 large academic centers, we created data marts for Crohn's disease (CD) and ulcerative colitis (UC) comprising patients with >/=1 International Classification of Diseases, 9th edition, code for each disease. We used codified (i.e., International Classification of Diseases, 9th edition codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables. RESULTS:: We confirmed 399 CD cases (67%) in the CD training set and 378 UC cases (63%) in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve for CD 0.95; UC 0.94) than models using only disease International Classification of Diseases, 9th edition codes (area under the curve 0.89 for CD; 0.86 for UC). Addition of natural language processing narrative terms to our final model resulted in classification of 6% to 12% more subjects with the same accuracy. CONCLUSIONS:: Inclusion of narrative concepts identified using natural language processing improves the accuracy of electronic medical records case definition for CD and UC while simultaneously identifying more subjects compared with models using codified data alone.
Ananthakrishnan AN, Cagan A, Gainer VS, Cai T, Cheng SC, Savova G, Chen P, Szolovits P, Xia Z, De Jager PL, et al. Normalization of Plasma 25-Hydroxy Vitamin D Is Associated with Reduced Risk of Surgery in Crohn's Disease. Inflamm Bowel Dis. 2013.Abstract
BACKGROUND:: Vitamin D may have an immunologic role in Crohn's disease (CD) and ulcerative colitis (UC). Retrospective studies suggested a weak association between vitamin D status and disease activity but have significant limitations. METHODS:: Using a multi-institution inflammatory bowel disease cohort, we identified all patients with CD and UC who had at least one measured plasma 25-hydroxy vitamin D (25(OH)D). Plasma 25(OH)D was considered sufficient at levels >/=30 ng/mL. Logistic regression models adjusting for potential confounders were used to identify impact of measured plasma 25(OH)D on subsequent risk of inflammatory bowel disease-related surgery or hospitalization. In a subset of patients where multiple measures of 25(OH)D were available, we examined impact of normalization of vitamin D status on study outcomes. RESULTS:: Our study included 3217 patients (55% CD; mean age, 49 yr). The median lowest plasma 25(OH)D was 26 ng/mL (interquartile range, 17-35 ng/mL). In CD, on multivariable analysis, plasma 25(OH)D <20 ng/mL was associated with an increased risk of surgery (odds ratio, 1.76; 95% confidence interval, 1.24-2.51) and inflammatory bowel disease-related hospitalization (odds ratio, 2.07; 95% confidence interval, 1.59-2.68) compared with those with 25(OH)D >/=30 ng/mL. Similar estimates were also seen for UC. Furthermore, patients with CD who had initial levels <30 ng/mL but subsequently normalized their 25(OH)D had a reduced likelihood of surgery (odds ratio, 0.56; 95% confidence interval, 0.32-0.98) compared with those who remained deficient. CONCLUSION:: Low plasma 25(OH)D is associated with increased risk of surgery and hospitalizations in both CD and UC, and normalization of 25(OH)D status is associated with a reduction in the risk of CD-related surgery.
2012
Wattanasin N, Porter A, Ubaha S, Mendis M, Phillips L, Mandel J, Ramoni R, Mandl K, Kohane I, Murphy SN. Apps to display patient data, making SMART available in the i2b2 platform. AMIA Annu Symp ProcAMIA Annu Symp Proc. 2012;2012 :960-9.Abstract
The Substitutable Medical Apps, Reusable Technologies (SMART) project provides a framework of core services to facilitate the use of substitutable health-related web applications. The platform offers a common interface used to "SMART-ready" health IT systems allowing any SMART application to be able to interact with those systems. At Partners Healthcare, we have SMART-enabled the Informatics for Integrating Biology and the Bedside (i2b2) open source analytical platform, enabling the use of SMART applications directly within the i2b2 web client. In i2b2, viewing the patient in an EMR-like view enables a natural-feeling medical review process for each patient.
Schmid PR, Palmer NP, Kohane IS, Berger B. Making sense out of massive data by going beyond differential expression. Proceedings of the National Academy of Sciences of the United States of AmericaProc Natl Acad Sci U S A. 2012;109 :5594-9.Abstract
With the rapid growth of publicly available high-throughput transcriptomic data, there is increasing recognition that large sets of such data can be mined to better understand disease states and mechanisms. Prior gene expression analyses, both large and small, have been dichotomous in nature, in which phenotypes are compared using clearly defined controls. Such approaches may require arbitrary decisions about what are considered "normal" phenotypes, and what each phenotype should be compared to. Instead, we adopt a holistic approach in which we characterize phenotypes in the context of a myriad of tissues and diseases. We introduce scalable methods that associate expression patterns to phenotypes in order both to assign phenotype labels to new expression samples and to select phenotypically meaningful gene signatures. By using a nonparametric statistical approach, we identify signatures that are more precise than those from existing approaches and accurately reveal biological processes that are hidden in case vs. control studies. Employing a comprehensive perspective on expression, we show how metastasized tumor samples localize in the vicinity of the primary site counterparts and are overenriched for those phenotype labels. We find that our approach provides insights into the biological processes that underlie differences between tissues and diseases beyond those identified by traditional differential expression analyses. Finally, we provide an online resource (http://concordia.csail.mit.edu) for mapping users' gene expression samples onto the expression landscape of tissue and disease.
Wolf SM, Crock BN, Van Ness B, Lawrenz F, Kahn JP, Beskow LM, Cho MK, Christman MF, Green RC, Hall R, et al. Managing incidental findings and research results in genomic research involving biobanks and archived data sets. Genetics in medicine : official journal of the American College of Medical GeneticsGenetics in medicine : official journal of the American College of Medical GeneticsGenet Med. 2012;14 :361-84.Abstract
Biobanks and archived data sets collecting samples and data have become crucial engines of genetic and genomic research. Unresolved, however, is what responsibilities biobanks should shoulder to manage incidental findings and individual research results of potential health, reproductive, or personal importance to individual contributors (using "biobank" here to refer both to collections of samples and collections of data). This article reports recommendations from a 2-year project funded by the National Institutes of Health. We analyze the responsibilities involved in managing the return of incidental findings and individual research results in a biobank research system (primary research or collection sites, the biobank itself, and secondary research sites). We suggest that biobanks shoulder significant responsibility for seeing that the biobank research system addresses the return question explicitly. When reidentification of individual contributors is possible, the biobank should work to enable the biobank research system to discharge four core responsibilities to (1) clarify the criteria for evaluating findings and the roster of returnable findings, (2) analyze a particular finding in relation to this, (3) reidentify the individual contributor, and (4) recontact the contributor to offer the finding. We suggest that findings that are analytically valid, reveal an established and substantial risk of a serious health condition, and are clinically actionable should generally be offered to consenting contributors. This article specifies 10 concrete recommendations, addressing new biobanks as well as those already in existence.
Saxena V, Ramdas S, Ochoa CR, Wallace D, Bhide P, Kohane I. Structural, genetic, and functional signatures of disordered neuro-immunological development in autism spectrum disorder. PLoS OnePLoS ONEPLoS ONE. 2012;7 :e48835.Abstract
BACKGROUND: Numerous linkage studies have been performed in pedigrees of Autism Spectrum Disorders, and these studies point to diverse loci and etiologies of autism in different pedigrees. The underlying pattern may be identified by an integrative approach, especially since ASD is a complex disorder manifested through many loci. METHOD: Autism spectrum disorder (ASD) was studied through two different and independent genome-scale measurement modalities. We analyzed the results of copy number variation in autism and triangulated these with linkage studies. RESULTS: Consistently across both genome-scale measurements, the same two molecular themes emerged: immune/chemokine pathways and developmental pathways. CONCLUSION: Linkage studies in aggregate do indeed share a thematic consistency, one which structural analyses recapitulate with high significance. These results also show for the first time that genomic profiling of pathways using a recombination distance metric can capture pathways that are consistent with those obtained from copy number variations (CNV).
Uno H, Tian L, Cai T, Kohane IS, Wei LJ. A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Statistics in medicineStat Med. 2012.Abstract
Risk prediction procedures can be quite useful for the patient's treatment selection, prevention strategy, or disease management in evidence-based medicine. Often, potentially important new predictors are available in addition to the conventional markers. The question is how to quantify the improvement from the new markers for prediction of the patient's risk in order to aid cost-benefit decisions. The standard method, using the area under the receiver operating characteristic curve, to measure the added value may not be sensitive enough to capture incremental improvements from the new markers. Recently, some novel alternatives to area under the receiver operating characteristic curve, such as integrated discrimination improvement and net reclassification improvement, were proposed. In this paper, we consider a class of measures for evaluating the incremental values of new markers, which includes the preceding two as special cases. We present a unified procedure for making inferences about measures in the class with censored event time data. The large sample properties of our procedures are theoretically justified. We illustrate the new proposal with data from a cancer study to evaluate a new gene score for prediction of the patient's survival. Copyright (c) 2012 John Wiley & Sons, Ltd.
Perlis RH, Iosifescu DV, Castro VM, Murphy SN, Gainer VS, Minnier J, Cai T, Goryachev S, Zeng Q, Gallagher PJ, et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol MedPsychological medicinePsychological medicine. 2012;42 :41-50.Abstract
BACKGROUND: Electronic medical records (EMR) provide a unique opportunity for efficient, large-scale clinical investigation in psychiatry. However, such studies will require development of tools to define treatment outcome. METHOD: Natural language processing (NLP) was applied to classify notes from 127 504 patients with a billing diagnosis of major depressive disorder, drawn from out-patient psychiatry practices affiliated with multiple, large New England hospitals. Classifications were compared with results using billing data (ICD-9 codes) alone and to a clinical gold standard based on chart review by a panel of senior clinicians. These cross-sectional classifications were then used to define longitudinal treatment outcomes, which were compared with a clinician-rated gold standard. RESULTS: Models incorporating NLP were superior to those relying on billing data alone for classifying current mood state (area under receiver operating characteristic curve of 0.85-0.88 v. 0.54-0.55). When these cross-sectional visits were integrated to define longitudinal outcomes and incorporate treatment data, 15% of the cohort remitted with a single antidepressant treatment, while 13% were identified as failing to remit despite at least two antidepressant trials. Non-remitting patients were more likely to be non-Caucasian (p<0.001). CONCLUSIONS: The application of bioinformatics tools such as NLP should enable accurate and efficient determination of longitudinal outcomes, enabling existing EMR data to be applied to clinical research, including biomarker investigations. Continued development will be required to better address moderators of outcome such as adherence and co-morbidity.
Kong SW, Collins CD, Shimizu-Motohashi Y, Holm IA, Campbell MG, Lee IH, Brewster SJ, Hanson E, Harris HK, Lowe KR, et al. Characteristics and predictive value of blood transcriptome signature in males with autism spectrum disorders. PLoS OnePLoS ONEPLoS ONE. 2012;7 :e49475.Abstract
Autism Spectrum Disorders (ASD) is a spectrum of highly heritable neurodevelopmental disorders in which known mutations contribute to disease risk in 20% of cases. Here, we report the results of the largest blood transcriptome study to date that aims to identify differences in 170 ASD cases and 115 age/sex-matched controls and to evaluate the utility of gene expression profiling as a tool to aid in the diagnosis of ASD. The differentially expressed genes were enriched for the neurotrophin signaling, long-term potentiation/depression, and notch signaling pathways. We developed a 55-gene prediction model, using a cross-validation strategy, on a sample cohort of 66 male ASD cases and 33 age-matched male controls (P1). Subsequently, 104 ASD cases and 82 controls were recruited and used as a validation set (P2). This 55-gene expression signature achieved 68% classification accuracy with the validation cohort (area under the receiver operating characteristic curve (AUC): 0.70 [95% confidence interval [CI]: 0.62-0.77]). Not surprisingly, our prediction model that was built and trained with male samples performed well for males (AUC 0.73, 95% CI 0.65-0.82), but not for female samples (AUC 0.51, 95% CI 0.36-0.67). The 55-gene signature also performed robustly when the prediction model was trained with P2 male samples to classify P1 samples (AUC 0.69, 95% CI 0.58-0.80). Our result suggests that the use of blood expression profiling for ASD detection may be feasible. Further study is required to determine the age at which such a test should be deployed, and what genetic characteristics of ASD can be identified.
Kong SW, Collins CD, Shimizu-Motohashi Y, Holm IA, Campbell MG, Lee I-H, Brewster SJ, Hanson E, Harris HK, Lowe KR, et al. Characteristics and predictive value of blood transcriptome signature in males with autism spectrum disorders. PLoS ONEPLoS ONE. 2012;7 :e49475.Abstract
Autism Spectrum Disorders (ASD) is a spectrum of highly heritable neurodevelopmental disorders in which known mutations contribute to disease risk in 20% of cases. Here, we report the results of the largest blood transcriptome study to date that aims to identify differences in 170 ASD cases and 115 age/sex-matched controls and to evaluate the utility of gene expression profiling as a tool to aid in the diagnosis of ASD. The differentially expressed genes were enriched for the neurotrophin signaling, long-term potentiation/depression, and notch signaling pathways. We developed a 55-gene prediction model, using a cross-validation strategy, on a sample cohort of 66 male ASD cases and 33 age-matched male controls (P1). Subsequently, 104 ASD cases and 82 controls were recruited and used as a validation set (P2). This 55-gene expression signature achieved 68% classification accuracy with the validation cohort (area under the receiver operating characteristic curve (AUC): 0.70 [95% confidence interval [CI]: 0.62-0.77]). Not surprisingly, our prediction model that was built and trained with male samples performed well for males (AUC 0.73, 95% CI 0.65-0.82), but not for female samples (AUC 0.51, 95% CI 0.36-0.67). The 55-gene signature also performed robustly when the prediction model was trained with P2 male samples to classify P1 samples (AUC 0.69, 95% CI 0.58-0.80). Our result suggests that the use of blood expression profiling for ASD detection may be feasible. Further study is required to determine the age at which such a test should be deployed, and what genetic characteristics of ASD can be identified.
Masys DR, Harris PA, Fearn PA, Kohane I. Designing a Public Square for Research Computing. Science Translational Medicine. 2012;4 :149fs32-149fs32.
Mandl KD, Kohane IS. Escaping the EHR trap--the future of health IT. The New England journal of medicineN Engl J Med. 2012;366 :2240-2.
Palmer NP, Schmid PR, Berger B, Kohane IS. A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers. Genome biologyGenome Biol. 2012;13 :R71.Abstract
ABSTRACT: BACKGROUND: Understanding the fundamental mechanisms of tumorigenesis remains one of the most pressing problems in modern biology. To this end, stem-like cells with tumor-initiating potential have become a central focus in cancer research. While the cancer stem cell hypothesis presents a compelling model of self-renewal and partial differentiation, the relationship between tumor cells and normal stem cells remains unclear. RESULTS: We identify, in an unbiased fashion, mRNA transcription patterns associated with pluripotent stem cells. Using this profile, we derive a quantitative measure of stem cell-like gene expression activity. We show how this 189 gene signature stratifies a variety of stem cell, malignant and normal tissue samples by their relative plasticity and state of differentiation within Concordia, a diverse gene expression database consisting of 3,209 Affymetrix HGU133+ 2.0 microarray assays. Further, the orthologous murine signature correctly orders a time course of differentiating embryonic mouse stem cells. Finally, we demonstrate how this stem-like signature serves as a proxy for tumor grade in a variety of solid tumors, including brain, breast, lung and colon. CONCLUSIONS: This core stemness gene expression signature represents a quantitative measure of stem cell-associated transcriptional activity. Broadly, the intensity of this signature correlates to the relative level of plasticity and differentiation across all of the human tissues analyzed. The fact that the intensity of this signature is also capable of differentiating histological grade for a variety of human malignancies suggests potential therapeutic and diagnostic implications.
Mandl KD, Khorasani R, Kohane IS. Meaningful use of electronic health records. Health affairsHealth Aff (Millwood). 2012;31 :1365; author reply 1366.
Kohane IS, Valtchinov VI. Quantifying the White Blood Cell Transcriptome as an accessible window to the Multi-Organ Transcriptome. BioinformaticsBioinformaticsBioinformatics. 2012.Abstract
MOTIVATION: We investigate and quantify the generalizeability of the WBC (White Blood Cell) transcriptome to the general, multi-organ transcriptome. We use data from the NCBI's Gene Expression Omnibus (GEO) public repository to define 2 data sets for comparison, WBC and OO (Other Organ) sets. RESULTS: Comprehensive pair-wise correlation and expression level profiles are calculated for both data sets (with sizes of 81 and 1,463 respectively). We have used mapping and ranking across the Gene Ontology (GO) categories to quantify similarity between the two sets. GO mappings of the most correlated and highly expressed genes from the two data sets tightly match, with the notable exceptions of components of the ribosome, cell adhesion and immune response. That is, 10877 or 48.8% of all measured genes do not change more than 10% of rank range between WBC and OO; only 878 (3.9%) change rank more than 50%. Two trans-tissue gene lists are defined, the most changing and the least changing genes in expression rank. We also provide a general, quantitative measure of the probability of expression rank and correlation profile in the OO system given the expression rank and correlation profile in the WBC data set. CONTACT: vvaltchinov@partners.org.
Mandl KD, Mandel JC, Murphy SN, Bernstam EV, Ramoni RL, Kreda DA, McCoy JM, Adida B, Kohane IS. The SMART Platform: early experience enabling substitutable applications for electronic health records. Journal of the American Medical Informatics Association : JAMIAJ Am Med Inform Assoc. 2012.Abstract
ObjectiveThe Substitutable Medical Applications, Reusable Technologies (SMART) Platforms project seeks to develop a health information technology platform with substitutable applications (apps) constructed around core services. The authors believe this is a promising approach to driving down healthcare costs, supporting standards evolution, accommodating differences in care workflow, fostering competition in the market, and accelerating innovation.Materials and methodsThe Office of the National Coordinator for Health Information Technology, through the Strategic Health IT Advanced Research Projects (SHARP) Program, funds the project. The SMART team has focused on enabling the property of substitutability through an app programming interface leveraging web standards, presenting predictable data payloads, and abstracting away many details of enterprise health information technology systems. Containers-health information technology systems, such as electronic health records (EHR), personally controlled health records, and health information exchanges that use the SMART app programming interface or a portion of it-marshal data sources and present data simply, reliably, and consistently to apps.ResultsThe SMART team has completed the first phase of the project (a) defining an app programming interface, (b) developing containers, and (c) producing a set of charter apps that showcase the system capabilities. A focal point of this phase was the SMART Apps Challenge, publicized by the White House, using http://www.challenge.gov website, and generating 15 app submissions with diverse functionality.ConclusionKey strategic decisions must be made about the most effective market for further disseminating SMART: existing market-leading EHR vendors, new entrants into the EHR market, or other stakeholders such as health information exchanges.
Masys DR, Jarvik GP, Abernethy NF, Anderson NR, Papanicolaou GJ, Paltoo DN, Hoffman MA, Kohane IS, Levy HP. Technical desiderata for the integration of genomic data into Electronic Health Records. Journal of biomedical informaticsJournal of Biomedical InformaticsJ Biomed Inform. 2012;45 :419-22.Abstract
The era of "Personalized Medicine," guided by individual molecular variation in DNA, RNA, expressed proteins and other forms of high volume molecular data brings new requirements and challenges to the design and implementation of Electronic Health Records (EHRs). In this article we describe the characteristics of biomolecular data that differentiate it from other classes of data commonly found in EHRs, enumerate a set of technical desiderata for its management in healthcare settings, and offer a candidate technical approach to its compact and efficient representation in operational systems.
Kohane IS, Shendure J. What's a Genome Worth?. Sci Transl MedSci Transl Med. 2012;4 :133fs13.Abstract
A recent study (Roberts et al.) explores considerations in estimating the current and potential clinical utility of whole-genome sequencing for individual patients.

Pages