Discovering and following up on genetic associations with complex phenotypes require large patient cohorts. This is particularly true for patient cohorts of diverse ancestry and clinically relevant subsets of disease. The ability to mine the electronic health records (EHRs) of patients followed as part of routine clinical care provides a potential opportunity to efficiently identify affected cases and unaffected controls for appropriate-sized genetic studies. Here, we demonstrate proof-of-concept that it is possible to use EHR data linked with biospecimens to establish a multi-ethnic case-control cohort for genetic research of a complex disease, rheumatoid arthritis (RA). In 1,515 EHR-derived RA cases and 1,480 controls matched for both genetic ancestry and disease-specific autoantibodies (anti-citrullinated protein antibodies [ACPA]), we demonstrate that the odds ratios and aggregate genetic risk score (GRS) of known RA risk alleles measured in individuals of European ancestry within our EHR cohort are nearly identical to those derived from a genome-wide association study (GWAS) of 5,539 autoantibody-positive RA cases and 20,169 controls. We extend this approach to other ethnic groups and identify a large overlap in the GRS among individuals of European, African, East Asian, and Hispanic ancestry. We also demonstrate that the distribution of a GRS based on 28 non-HLA risk alleles in ACPA+ cases partially overlaps with ACPA- subgroup of RA cases. Our study demonstrates that the genetic basis of rheumatoid arthritis risk is similar among cases of diverse ancestry divided into subsets based on ACPA status and emphasizes the utility of linking EHR clinical data with biospecimens for genetic studies.
Huntington's disease is initiated by the expression of a CAG repeat-encoded polyglutamine region in full-length huntingtin, with dominant effects that vary continuously with CAG size. The mechanism could involve a simple gain of function or a more complex gain of function coupled to a loss of function (e.g. dominant negative-graded loss of function). To distinguish these alternatives, we compared genome-wide gene expression changes correlated with CAG size across an allelic series of heterozygous CAG knock-in mouse embryonic stem (ES) cell lines (Hdh(Q20/7), Hdh(Q50/7), Hdh(Q91/7), Hdh(Q111/7)), to genes differentially expressed between Hdh(ex4/5/ex4/5) huntingtin null and wild-type (Hdh(Q7/7)) parental ES cells. The set of 73 genes whose expression varied continuously with CAG length had minimal overlap with the 754-member huntingtin-null gene set but the two were not completely unconnected. Rather, the 172 CAG length-correlated pathways and 238 huntingtin-null significant pathways clustered into 13 shared categories at the network level. A closer examination of the energy metabolism and the lipid/sterol/lipoprotein metabolism categories revealed that CAG length-correlated genes and huntingtin-null-altered genes either were different members of the same pathways or were in unique, but interconnected pathways. Thus, varying the polyglutamine size in full-length huntingtin produced gene expression changes that were distinct from, but related to, the effects of lack of huntingtin. These findings support a simple gain-of-function mechanism acting through a property of the full-length huntingtin protein and point to CAG-correlative approaches to discover its effects. Moreover, for therapeutic strategies based on huntingtin suppression, our data highlight processes that may be more sensitive to the disease trigger than to decreased huntingtin levels.
The role of the immune system in neuropsychiatric diseases, including autism spectrum disorder (ASD), has long been hypothesized. This hypothesis has mainly been supported by family cohort studies and the immunological abnormalities found in ASD patients, but had limited findings in genetic association testing. Two cross-disorder genetic association tests were performed on the genome-wide data sets of ASD and six autoimmune disorders. In the polygenic score test, we examined whether ASD risk alleles with low effect sizes work collectively in specific autoimmune disorders and show significant association statistics. In the genetic variation score test, we tested whether allele-specific associations between ASD and autoimmune disorders can be found using nominally significant single-nucleotide polymorphisms. In both tests, we found that ASD is probabilistically linked to ankylosing spondylitis (AS) and multiple sclerosis (MS). Association coefficients showed that ASD and AS were positively associated, meaning that autism susceptibility alleles may have a similar collective effect in AS. The association coefficients were negative between ASD and MS. Significant associations between ASD and two autoimmune disorders were identified. This genetic association supports the idea that specific immunological abnormalities may underlie the etiology of autism, at least in a number of cases.
We review the scholarly career of our colleague, Marco Ramoni, who died unexpectedly in the summer of 2010. His work mainly explored the development and application of Bayesian techniques to model clinical, public health, and bioinformatics questions. His contributions have led to improvements in our ability to model behavior that evolves in time, to explore systematic relationships among large sets of covariates, and to tease out the meaning of data on the role of genetic variation in the genesis of important diseases.
BackgroundThe re-use of patient data from electronic healthcare record systems can provide tremendous benefits for clinical research, but measures to protect patient privacy while utilizing these records have many challenges. Some of these challenges arise from a misperception that the problem should be solved technically when actually the problem needs a holistic solution.ObjectiveThe authors' experience with informatics for integrating biology and the bedside (i2b2) use cases indicates that the privacy of the patient should be considered on three fronts: technical de-identification of the data, trust in the researcher and the research, and the security of the underlying technical platforms.MethodsThe security structure of i2b2 is implemented based on consideration of all three fronts. It has been supported with several use cases across the USA, resulting in five privacy categories of users that serve to protect the data while supporting the use cases.ResultsThe i2b2 architecture is designed to provide consistency and faithfully implement these user privacy categories. These privacy categories help reflect the policy of both the Health Insurance Portability and Accountability Act and the provisions of the National Research Act of 1974, as embodied by current institutional review boards.ConclusionBy implementing a holistic approach to patient privacy solutions, i2b2 is able to help close the gap between principle and practice.
Informatics for integrating biology and the bedside (i2b2) seeks to provide the instrumentation for using the informational by-products of health care and the biological materials accumulated through the delivery of health care to conduct discovery research and to study the healthcare system in vivo. This complements existing efforts such as prospective cohort studies or trials outside the delivery of routine health care. i2b2 has been used to generate genome-wide studies at less than one tenth the cost and one tenth the time of conventionally performed studies as well as to identify important risk from commonly used medications. i2b2 has been adopted by over 60 academic health centers internationally.
If genomic studies are to be a clinically relevant and timely reflection of the relationship between genetics and health status - whether for common or rare variants - cost-effective ways must be found to measure both the genetic variation and the phenotypic characteristics of large populations, including the comprehensive and up-to-date record of their medical treatment. The adoption of electronic health records, used by clinicians to document clinical care, is becoming widespread and recent studies demonstrate that they can be effectively employed for genetic studies using the informational and biological 'by-products' of health-care delivery while maintaining patient privacy.
BACKGROUND: Negated biomedical events are often ignored by text-mining applications; however, such events carry scientific significance. We report on the development of BioNOT, a database of negated sentences that can be used to extract such negated events. DESCRIPTION: Currently BioNOT incorporates approximately 32 million negated sentences, extracted from over 336 million biomedical sentences from three resources: approximately 2 million full-text biomedical articles in Elsevier and the PubMed Central, as well as approximately 20 million abstracts in PubMed. We evaluated BioNOT on three important genetic disorders: autism, Alzheimer's disease and Parkinson's disease, and found that BioNOT is able to capture negated events that may be ignored by experts. CONCLUSIONS: The BioNOT database can be a useful resource for biomedical researchers. BioNOT is freely available at http://bionot.askhermes.org/. In future work, we will develop semantic web related technologies to enrich BioNOT.
BACKGROUND:Negated biomedical events are often ignored by text-mining applications; however, such events carry scientific significance. We report on the development of BioNØT, a database of negated sentences that can be used to extract such negated events.DESCRIPTION:Currently BioNØT incorporates ≈32 million negated sentences, extracted from over 336 million biomedical sentences from three resources: ≈2 million full-text biomedical articles in Elsevier and the PubMed Central, as well as ≈20 million abstracts in PubMed. We evaluated BioNØT on three important genetic disorders: autism, Alzheimer's disease and Parkinson's disease, and found that BioNØT is able to capture negated events that may be ignored by experts.CONCLUSIONS:The BioNØT database can be a useful resource for biomedical researchers. BioNØT is freely available at http://bionot.askhermes.org/. In future work, we will develop semantic web related technologies to enrich BioNØT.
Huntington's disease (HD) involves marked early neurodegeneration in the striatum, whereas the cerebellum is relatively spared despite the ubiquitous expression of full-length mutant huntingtin, implying that inherent tissue-specific differences determine susceptibility to the HD CAG mutation. To understand this tissue specificity, we compared early mutant huntingtin-induced gene expression changes in striatum to those in cerebellum in young Hdh CAG knock-in mice, prior to onset of evident pathological alterations. Endogenous levels of full-length mutant huntingtin caused qualitatively similar, but quantitatively different gene expression changes in the two brain regions. Importantly, the quantitatively different responses in the striatum and cerebellum in mutant mice were well accounted for by the intrinsic molecular differences in gene expression between the striatum and cerebellum in wild-type animals. Tissue-specific gene expression changes in response to the HD mutation, therefore, appear to reflect the different inherent capacities of these tissues to buffer qualitatively similar effects of mutant huntingtin. These findings highlight a role for intrinsic quantitative tissue differences in contributing to HD pathogenesis, and likely to other neurodegenerative disorders exhibiting tissue-specificity, thereby guiding the search for effective therapeutic interventions.
Trinucleotide repeats sequences (TRS) represent a common type of genomic DNA motif whose expansion is associated with a large number of human diseases. The driving molecular mechanisms of the TRS ongoing dynamic expansion across generations and within tissues and its influence on genomic DNA functions are not well understood. Here we report results for a novel and notable collective breathing behavior of genomic DNA of tandem TRS, leading to propensity for large local DNA transient openings at physiological temperature. Our Langevin molecular dynamics (LMD) and Markov Chain Monte Carlo (MCMC) simulations demonstrate that the patterns of openings of various TRSs depend specifically on their length. The collective propensity for DNA strand separation of repeated sequences serves as a precursor for outsized intermediate bubble states independently of the G/C-content. We report that repeats have the potential to interfere with the binding of transcription factors to their consensus sequence by altered DNA breathing dynamics in proximity of the binding sites. These observations might influence ongoing attempts to use LMD and MCMC simulations for TRS-related modeling of genomic DNA functionality in elucidating the common denominators of the dynamic TRS expansion mutation with potential therapeutic applications.
Feedback control is an important regulatory process in biological systems, which confers robustness against external and internal disturbances. Genes involved in feedback structures are therefore likely to have a major role in regulating cellular processes. Here we rely on a dynamic Bayesian network approach to identify feedback loops in cell cycle regulation. We analyzed the transcriptional profile of the cell cycle in HeLa cancer cells and identified a feedback loop structure composed of 10 genes. In silico analyses showed that these genes hold important roles in system's dynamics. The results of published experimental assays confirmed the central role of 8 of the identified feedback loop genes in cell cycle regulation. In conclusion, we provide a novel approach to identify critical genes for the dynamics of biological processes. This may lead to the identification of therapeutic targets in diseases that involve perturbations of these dynamics.
Large-scale molecular profiling technologies have assisted the identification of disease biomarkers and facilitated the basic understanding of cellular processes. However, samples collected from human subjects in clinical trials possess a level of complexity, arising from multiple cell types, that can obfuscate the analysis of data derived from them. Failure to identify, quantify, and incorporate sources of heterogeneity into an analysis can have widespread and detrimental effects on subsequent statistical studies.We describe an approach that builds upon a linear latent variable model, in which expression levels from mixed cell populations are modeled as the weighted average of expression from different cell types. We solve these equations using quadratic programming, which efficiently identifies the globally optimal solution while preserving non-negativity of the fraction of the cells. We applied our method to various existing platforms to estimate proportions of different pure cell or tissue types and gene expression profilings of distinct phenotypes, with a focus on complex samples collected in clinical trials. We tested our methods on several well controlled benchmark data sets with known mixing fractions of pure cell or tissue types and mRNA expression profiling data from samples collected in a clinical trial. Accurate agreement between predicted and actual mixing fractions was observed. In addition, our method was able to predict mixing fractions for more than ten species of circulating cells and to provide accurate estimates for relatively rare cell types (<10% total population). Furthermore, accurate changes in leukocyte trafficking associated with Fingolomid (FTY720) treatment were identified that were consistent with previous results generated by both cell counts and flow cytometry. These data suggest that our method can solve one of the open questions regarding the analysis of complex transcriptional data: namely, how to identify the optimal mixing fractions in a given experiment.
OBJECTIVE: Electronic medical records (EMRs) are a rich data source for discovery research but are underutilized due to the difficulty of extracting highly accurate clinical data. We assessed whether a classification algorithm incorporating narrative EMR data (typed physician notes), more accurately classifies subjects with rheumatoid arthritis (RA) compared to an algorithm using codified EMR data alone. METHODS: Subjects with >/=1 ICD9 RA code (714.xx) or who had anti-CCP checked in the EMR of two large academic centers were included into an 'RA Mart' (n=29,432). For all 29,432 subjects, we extracted narrative (using natural language processing) and codified RA clinical information. In a training set of 96 RA and 404 non-RA cases from the RA Mart classified by medical record review, we used narrative and codified data to develop classification algorithms using logistic regression. These algorithms were applied to the entire RA Mart. We calculated and compared the positive predictive value (PPV) of these algorithms by reviewing records of an additional 400 subjects classified as RA by the algorithms. RESULTS: A complete algorithm (narrative and codified data) classified RA subjects with a significantly higher PPV of 94%, than an algorithm with codified data alone (PPV 88%). Characteristics of the RA cohort identified by the complete algorithm were comparable to existing RA cohorts (80% female, 63% anti-CCP+, 59% erosion+). CONCLUSION: We demonstrate the ability to utilize complete EMR data to define an RA cohort with a PPV of 94%, which was superior to an algorithm using codified data alone.
Recent surveys about participation in cohort studies reconfirm that participants value and desire the return of research results to a degree that is out of step with the restrictive recommendations of various ethics advisory groups, which have historically limited disclosure based on clinician value judgments and the severity and treatability of the disease in question, among other factors. Rather than framing the current inconclusive ethics discussion as a standstill among competing ethical principles and their potential applicability, we introduce a new element, communicability (that is, those properties of a message that will determine how likely it is that its informational intent will be grasped by the study participant), as the subject of empirical research to align participants' goals with beneficent and responsible results reporting. Structural changes in research design, combined with governance changes in assessing impact, allow us to move beyond a binary construction of report/do not report and to create a structure in which the communicability of the message and the participants' preferences are variables in a function that affects results reporting. Here we illustrate this structure and its principles.
BACKGROUND: In Huntington's disease (HD), an expanded CAG repeat produces characteristic striatal neurodegeneration. Interestingly, the HD CAG repeat, whose length determines age at onset, undergoes tissue-specific somatic instability, predominant in the striatum, suggesting that tissue-specific CAG length changes could modify the disease process. Therefore, understanding the mechanisms underlying the tissue specificity of somatic instability may provide novel routes to therapies. However progress in this area has been hampered by the lack of sensitive high-throughput instability quantification methods and global approaches to identify the underlying factors. RESULTS: Here we describe a novel approach to gain insight into the factors responsible for the tissue specificity of somatic instability. Using accurate genetic knock-in mouse models of HD, we developed a reliable, high-throughput method to quantify tissue HD CAG repeat instability and integrated this with genome-wide bioinformatic approaches. Using tissue instability quantified in 16 tissues as a phenotype and tissue microarray gene expression as a predictor, we built a mathematical model and identified a gene expression signature that accurately predicted tissue instability. Using the predictive ability of this signature we found that somatic instability was not a consequence of pathogenesis. In support of this, genetic crosses with models of accelerated neuropathology failed to induce somatic instability. In addition, we searched for genes and pathways that correlated with tissue instability. We found that expression levels of DNA repair genes did not explain the tissue specificity of somatic instability. Instead, our data implicate other pathways, particularly cell cycle, metabolism and neurotransmitter pathways, acting in combination to generate tissue-specific patterns of instability. CONCLUSION: Our study clearly demonstrates that multiple tissue factors reflect the level of somatic instability in different tissues. In addition, our quantitative, genome-wide approach is readily applicable to high-throughput assays and opens the door to widespread applications with the potential to accelerate the discovery of drugs that alter tissue instability.
Informatics for Integrating Biology and the Bedside (i2b2) is one of seven projects sponsored by the NIH Roadmap National Centers for Biomedical Computing (http://www.ncbcs.org). Its mission is to provide clinical investigators with the tools necessary to integrate medical record and clinical research data in the genomics age, a software suite to construct and integrate the modern clinical research chart. i2b2 software may be used by an enterprise's research community to find sets of interesting patients from electronic patient medical record data, while preserving patient privacy through a query tool interface. Project-specific mini-databases ("data marts") can be created from these sets to make highly detailed data available on these specific patients to the investigators on the i2b2 platform, as reviewed and restricted by the Institutional Review Board. The current version of this software has been released into the public domain and is available at the URL: http://www.i2b2.org/software.