OBJECTIVES: While genetic determinants of low density lipoprotein (LDL) cholesterol levels are well characterised in the general population, they are understudied in rheumatoid arthritis (RA). Our objective was to determine the association of established LDL and RA genetic alleles with LDL levels in RA cases compared with non-RA controls. METHODS: Using data from electronic medical records, we linked validated RA cases and non-RA controls to discarded blood samples. For each individual, we extracted data on: first LDL measurement, age, gender and year of LDL measurement. We genotyped subjects for 11 LDL and 44 non-HLA RA alleles, and calculated RA and LDL genetic risk scores (GRS). We tested the association between each GRS and LDL level using multivariate linear regression models adjusted for age, gender, year of LDL measurement and RA status. RESULTS: Among 567 RA cases and 979 controls, 80% were female and mean age at the first LDL measurement was 55 years. RA cases had significantly lower mean LDL levels than controls (117.2 vs 125.6 mg/dl, respectively, p<0.0001). Each unit increase in LDL GRS was associated with 0.8 mg/dl higher LDL levels in both RA cases and controls (p=3.0x10(-7)). Each unit increase in RA GRS was associated with 4.3 mg/dl lower LDL levels in both groups (p=0.01). CONCLUSIONS: LDL alleles were associated with higher LDL levels in RA. RA alleles were associated with lower LDL levels in both RA cases and controls. As RA cases carry more RA alleles, these findings suggest a genetic basis for epidemiological observations of lower LDL levels in RA.
To reduce costs and improve clinical relevance of genetic studies, there has been increasing interest in performing such studies in hospital-based cohorts by linking phenotypes extracted from electronic medical records (EMRs) to genotypes assessed in routinely collected medical samples. A fundamental difficulty in implementing such studies is extracting accurate information about disease outcomes and important clinical covariates from large numbers of EMRs. Recently, numerous algorithms have been developed to infer phenotypes by combining information from multiple structured and unstructured variables extracted from EMRs. Although these algorithms are quite accurate, they typically do not provide perfect classification due to the difficulty in inferring meaning from the text. Some algorithms can produce for each patient a probability that the patient is a disease case. This probability can be thresholded to define case-control status, and this estimated case-control status has been used to replicate known genetic associations in EMR-based studies. However, using the estimated disease status in place of true disease status results in outcome misclassification, which can diminish test power and bias odds ratio estimates. We propose to instead directly model the algorithm-derived probability of being a case. We demonstrate how our approach improves test power and effect estimation in simulation studies, and we describe its performance in a study of rheumatoid arthritis. Our work provides an easily implemented solution to a major practical challenge that arises in the use of EMR data, which can facilitate the use of EMR infrastructure for more powerful, cost-effective, and diverse genetic studies.
The elucidation of disease etiologies and establishment of robust, scalable, high-throughput screening assays for autism spectrum disorders (ASDs) have been impeded by both inaccessibility of disease-relevant neuronal tissue and the genetic heterogeneity of the disorder. Neuronal cells derived from induced pluripotent stem cells (iPSCs) from autism patients may circumvent these obstacles and serve as relevant cell models. To date, derived cells are characterized and screened by assessing their neuronal phenotypes. These characterizations are often etiology-specific or lack reproducibility and stability. In this review, we present an overview of efforts to study iPSC-derived neurons as a model for autism, and we explore the plausibility of gene expression profiling as a reproducible and stable disease marker.
BACKGROUND: Whole genome sequencing (WGS) is already being used in certain clinical and research settings, but its impact on patient well-being, health-care utilization, and clinical decision-making remains largely unstudied. It is also unknown how best to communicate sequencing results to physicians and patients to improve health. We describe the design of the MedSeq Project: the first randomized trials of WGS in clinical care. METHODS/DESIGN: This pair of randomized controlled trials compares WGS to standard of care in two clinical contexts: (a) disease-specific genomic medicine in a cardiomyopathy clinic and (b) general genomic medicine in primary care. We are recruiting 8 to 12 cardiologists, 8 to 12 primary care physicians, and approximately 200 of their patients. Patient participants in both the cardiology and primary care trials are randomly assigned to receive a family history assessment with or without WGS. Our laboratory delivers a genome report to physician participants that balances the needs to enhance understandability of genomic information and to convey its complexity. We provide an educational curriculum for physician participants and offer them a hotline to genetics professionals for guidance in interpreting and managing their patients' genome reports. Using varied data sources, including surveys, semi-structured interviews, and review of clinical data, we measure the attitudes, behaviors and outcomes of physician and patient participants at multiple time points before and after the disclosure of these results. DISCUSSION: The impact of emerging sequencing technologies on patient care is unclear. We have designed a process of interpreting WGS results and delivering them to physicians in a way that anticipates how we envision genomic medicine will evolve in the near future. That is, our WGS report provides clinically relevant information while communicating the complexity and uncertainty of WGS results to physicians and, through physicians, to their patients. This project will not only illuminate the impact of integrating genomic medicine into the clinical care of patients but also inform the design of future studies. TRIAL REGISTRATION: ClinicalTrials.gov identifier NCT01736566.
OBJECTIVE: We report the first pediatric specific Phenome-Wide Association Study (PheWAS) using electronic medical records (EMRs). Given the early success of PheWAS in adult populations, we investigated the feasibility of this approach in pediatric cohorts in which associations between a previously known genetic variant and a wide range of clinical or physiological traits were evaluated. Although computationally intensive, this approach has potential to reveal disease mechanistic relationships between a variant and a network of phenotypes. METHOD: Data on 5049 samples of European ancestry were obtained from the EMRs of two large academic centers in five different genotyped cohorts. Recently, these samples have undergone whole genome imputation. After standard quality controls, removing missing data and outliers based on principal components analyses (PCA), 4268 samples were used for the PheWAS study. We scanned for associations between 2476 single-nucleotide polymorphisms (SNP) with available genotyping data from previously published GWAS studies and 539 EMR-derived phenotypes. The false discovery rate was calculated and, for any new PheWAS findings, a permutation approach (with up to 1,000,000 trials) was implemented. RESULTS: This PheWAS found a variety of common variants (MAF > 10%) with prior GWAS associations in our pediatric cohorts including Juvenile Rheumatoid Arthritis (JRA), Asthma, Autism and Pervasive Developmental Disorder (PDD) and Type 1 Diabetes with a false discovery rate < 0.05 and power of study above 80%. In addition, several new PheWAS findings were identified including a cluster of association near the NDFIP1 gene for mental retardation (best SNP rs10057309, p = 4.33 x 10(-7), OR = 1.70, 95%CI = 1.38 - 2.09); association near PLCL1 gene for developmental delays and speech disorder [best SNP rs1595825, p = 1.13 x 10(-8), OR = 0.65(0.57 - 0.76)]; a cluster of associations in the IL5-IL13 region with Eosinophilic Esophagitis (EoE) [best at rs12653750, p = 3.03 x 10(-9), OR = 1.73 95%CI = (1.44 - 2.07)], previously implicated in asthma, allergy, and eosinophilia; and association of variants in GCKR and JAZF1 with allergic rhinitis in our pediatric cohorts [best SNP rs780093, p = 2.18 x 10(-5), OR = 1.39, 95%CI = (1.19 - 1.61)], previously demonstrated in metabolic disease and diabetes in adults. CONCLUSION: The PheWAS approach with re-mapping ICD-9 structured codes for our European-origin pediatric cohorts, as with the previous adult studies, finds many previously reported associations as well as presents the discovery of associations with potentially important clinical implications.
IMPORTANCE: Epilepsy is a debilitating condition, often with neither a known etiology nor an effective treatment. Autoimmune mechanisms have been increasingly identified. OBJECTIVE: To conduct a population-level study investigating the relationship between epilepsy and several common autoimmune diseases. DESIGN, SETTING, AND PARTICIPANTS: A retrospective population-based study using claims from a nationwide employer-provided health insurance plan in the United States. Participants were beneficiaries enrolled between 1999 and 2006 (N = 2 518 034). MAIN OUTCOMES AND MEASURES: We examined the relationship between epilepsy and 12 autoimmune diseases: type 1 diabetes mellitus, psoriasis, rheumatoid arthritis, Graves disease, Hashimoto thyroiditis, Crohn disease, ulcerative colitis, systemic lupus erythematosus, antiphospholipid syndrome, Sjogren syndrome, myasthenia gravis, and celiac disease. RESULTS: The risk of epilepsy was significantly heightened among patients with autoimmune diseases (odds ratio, 3.8; 95% CI, 3.6-4.0; P < .001) and was especially pronounced in children (5.2; 4.1-6.5; P < .001). Elevated risk was consistently observed across all 12 autoimmune diseases. CONCLUSIONS AND RELEVANCE: Epilepsy and autoimmune disease frequently co-occur; patients with either condition should undergo surveillance for the other. The potential role of autoimmunity must be given due consideration in epilepsy so that we are not overlooking a treatable cause.
BACKGROUND: While antidepressant treatment response appears to be partially heritable, no consistent genetic associations have been identified. Large, rare copy number variants (CNVs) play a role in other neuropsychiatric diseases, so we assessed their association with treatment-resistant depression (TRD). METHODS: We analyzed data from two genome-wide association studies comprising 1263 Caucasian patients with major depressive disorder. One was drawn from a large health system by applying natural language processing to electronic health records (i2b2 cohort). The second consisted of a multicenter study of sequential antidepressant treatments, Sequenced Treatment Alternatives to Relieve Depression. The Birdsuite package was used to identify rare deletions and duplications. Individuals without symptomatic remission, despite two antidepressant treatment trials, were contrasted with those who remitted with a first treatment trial. RESULTS: CNV data were derived for 778 subjects in the i2b2 cohort, including 300 subjects (37%) with TRD, and 485 subjects in Sequenced Treatment Alternatives to Relieve Depression cohort, including 152 (31%) with TRD. CNV burden analyses identified modest enrichment of duplications in cases (empirical p = .04 for duplications of 100-200 kilobase) and a particular deletion region spanning gene PABPC4L (empirical p = .02, 6 cases: 0 controls). Pathway analysis suggested enrichment of CNVs intersecting genes regulating actin cytoskeleton. However, none of these associations survived genome-wide correction. CONCLUSIONS: Contribution of rare CNVs to TRD appears to be modest, individually or in aggregate. The electronic health record-based methodology demonstrated here should facilitate collection of larger TRD cohorts necessary to further characterize these effects.
We describe the architecture of the Patient Centered Outcomes Research Institute (PCORI) funded Scalable Collaborative Infrastructure for a Learning Healthcare System (SCILHS, http://www.SCILHS.org) clinical data research network, which leverages the $48 billion dollar federal investment in health information technology (IT) to enable a queryable semantic data model across 10 health systems covering more than 8 million patients, plugging universally into the point of care, generating evidence and discovery, and thereby enabling clinician and patient participation in research during the patient encounter. Central to the success of SCILHS is development of innovative 'apps' to improve PCOR research methods and capacitate point of care functions such as consent, enrollment, randomization, and outreach for patient-reported outcomes. SCILHS adapts and extends an existing national research network formed on an advanced IT infrastructure built with open source, free, modular components.
BACKGROUND: The MedSeq Project is a randomized clinical trial developing approaches to assess the impact of integrating genome sequencing into clinical medicine. To facilitate the return of results of potential medical relevance to physicians and patients participating in the MedSeq Project, we sought to develop a reporting approach for the effective communication of such findings. METHODS: Genome sequencing was performed on the Illumina HiSeq platform. Variants were filtered, interpreted, and validated according to methods developed by the Laboratory for Molecular Medicine and consistent with current professional guidelines. The GeneInsight software suite, which is integrated with the Partners HealthCare electronic health record, was used for variant curation, report drafting, and delivery. RESULTS: We developed a concise 5-6 page Genome Report (GR) featuring a single-page summary of results of potential medical relevance with additional pages containing structured variant, gene, and disease information along with supporting evidence for reported variants and brief descriptions of associated diseases and clinical implications. The GR is formatted to provide a succinct summary of genomic findings, enabling physicians to take appropriate steps for disease diagnosis, prevention, and management in their patients. CONCLUSIONS: Our experience highlights important considerations for the reporting of results of potential medical relevance and provides a framework for interpretation and reporting practices in clinical genome sequencing.
BACKGROUND AND OBJECTIVE: Upgrades to electronic health record (EHR) systems scheduled to be introduced in the USA in 2014 will advance document interoperability between care providers. Specifically, the second stage of the federal incentive program for EHR adoption, known as Meaningful Use, requires use of the Consolidated Clinical Document Architecture (C-CDA) for document exchange. In an effort to examine and improve C-CDA based exchange, the SMART (Substitutable Medical Applications and Reusable Technology) C-CDA Collaborative brought together a group of certified EHR and other health information technology vendors. MATERIALS AND METHODS: We examined the machine-readable content of collected samples for semantic correctness and consistency. This included parsing with the open-source BlueButton.js tool, testing with a validator used in EHR certification, scoring with an automated open-source tool, and manual inspection. We also conducted group and individual review sessions with participating vendors to understand their interpretation of C-CDA specifications and requirements. RESULTS: We contacted 107 health information technology organizations and collected 91 C-CDA sample documents from 21 distinct technologies. Manual and automated document inspection led to 615 observations of errors and data expression variation across represented technologies. Based upon our analysis and vendor discussions, we identified 11 specific areas that represent relevant barriers to the interoperability of C-CDA documents. CONCLUSIONS: We identified errors and permissible heterogeneity in C-CDA documents that will limit semantic interoperability. Our findings also point to several practical opportunities to improve C-CDA document quality and exchange in the coming years.
Analysis of large-scale systems of biomedical data provides a perspective on neuropsychiatric disease that may be otherwise elusive. Described here is an analysis of three large-scale systems of data from autism spectrum disorder (ASD) and of ASD research as an exemplar of what might be achieved from study of such data. First is the biomedical literature that highlights the fact that there are two very successful but quite separate research communities and findings pertaining to genetics and the molecular biology of ASD. There are those studies positing ASD causes that are related to immunological dysregulation and those related to disorders of synaptic function and neuronal connectivity. Second is the emerging use of electronic health record systems and other large clinical databases that allow the data acquired during the course of care to be used to identify distinct subpopulations, clinical trajectories, and pathophysiological substructures of ASD. These systems reveal subsets of patients with distinct clinical trajectories, some of which are immunologically related and others which follow pathologies conventionally thought of as neurological. The third is genome-wide genomic and transcriptomic analyses which show molecular pathways that overlap neurological and immunological mechanisms. The convergence of these three large-scale data perspectives illustrates the scientific leverage that large-scale data analyses can provide in guiding researchers in an approach to the diagnosis of neuropsychiatric disease that is inclusive and comprehensive.
OBJECTIVE: The distinct trajectories of patients with autism spectrum disorders (ASDs) have not been extensively studied, particularly regarding clinical manifestations beyond the neurobehavioral criteria from the Diagnostic and Statistical Manual of Mental Disorders. The objective of this study was to investigate the patterns of co-occurrence of medical comorbidities in ASDs. METHODS: International Classification of Diseases, Ninth Revision codes from patients aged at least 15 years and a diagnosis of ASD were obtained from electronic medical records. These codes were aggregated by using phenotype-wide association studies categories and processed into 1350-dimensional vectors describing the counts of the most common categories in 6-month blocks between the ages of 0 to 15. Hierarchical clustering was used to identify subgroups with distinct courses. RESULTS: Four subgroups were identified. The first was characterized by seizures (n = 120, subgroup prevalence 77.5%). The second (n = 197) was characterized by multisystem disorders including gastrointestinal disorders (prevalence 24.3%) and auditory disorders and infections (prevalence 87.8%), and the third was characterized by psychiatric disorders (n = 212, prevalence 33.0%). The last group (n = 4316) could not be further resolved. The prevalence of psychiatric disorders was uncorrelated with seizure activity (P = .17), but a significant correlation existed between gastrointestinal disorders and seizures (P < .001). The correlation results were replicated by using a second sample of 496 individuals from a different geographic region. CONCLUSIONS: Three distinct patterns of medical trajectories were identified by unsupervised clustering of electronic health record diagnoses. These may point to distinct etiologies with different genetic and environmental contributions. Additional clinical and molecular characterizations will be required to further delineate these subgroups.
BACKGROUND: Fragile X syndrome and tuberous sclerosis are genetic syndromes that both have a high rate of comorbidity with autism spectrum disorder (ASD). Several lines of evidence suggest that these two monogenic disorders may converge at a molecular level through the dysfunction of activity-dependent synaptic plasticity. METHODS: To explore the characteristics of transcriptomic changes in these monogenic disorders, we profiled genome-wide gene expression levels in cerebellum and blood from murine models of fragile X syndrome and tuberous sclerosis. RESULTS: Differentially expressed genes and enriched pathways were distinct for the two murine models examined, with the exception of immune response-related pathways. In the cerebellum of the Fmr1 knockout (Fmr1-KO) model, the neuroactive ligand receptor interaction pathway and gene sets associated with synaptic plasticity such as long-term potentiation, gap junction, and axon guidance were the most significantly perturbed pathways. The phosphatidylinositol signaling pathway was significantly dysregulated in both cerebellum and blood of Fmr1-KO mice. In Tsc2 heterozygous (+/-) mice, immune system-related pathways, genes encoding ribosomal proteins, and glycolipid metabolism pathways were significantly changed in both tissues. CONCLUSIONS: Our data suggest that distinct molecular pathways may be involved in ASD with known but different genetic causes and that blood gene expression profiles of Fmr1-KO and Tsc2+/- mice mirror some, but not all, of the perturbed molecular pathways in the brain.
BACKGROUND: The length of the huntingtin (HTT) CAG repeat is strongly correlated with both age at onset of Huntington's disease (HD) symptoms and age at death of HD patients. Dichotomous analysis comparing HD to controls is widely used to study the effects of HTT CAG repeat expansion. However, a potentially more powerful approach is a continuous analysis strategy that takes advantage of all of the different CAG lengths, to capture effects that are expected to be critical to HD pathogenesis. METHODOLOGY/PRINCIPAL FINDINGS: We used continuous and dichotomous approaches to analyze microarray gene expression data from 107 human control and HD lymphoblastoid cell lines. Of all probes found to be significant in a continuous analysis by CAG length, only 21.4% were so identified by a dichotomous comparison of HD versus controls. Moreover, of probes significant by dichotomous analysis, only 33.2% were also significant in the continuous analysis. Simulations revealed that the dichotomous approach would require substantially more than 107 samples to either detect 80% of the CAG-length correlated changes revealed by continuous analysis or to reduce the rate of significant differences that are not CAG length-correlated to 20% (n = 133 or n = 206, respectively). Given the superior power of the continuous approach, we calculated the correlation structure between HTT CAG repeat lengths and gene expression levels and created a freely available searchable website, "HD CAGnome," that allows users to examine continuous relationships between HTT CAG and expression levels of approximately 20,000 human genes. CONCLUSIONS/SIGNIFICANCE: Our results reveal limitations of dichotomous approaches compared to the power of continuous analysis to study a disease where human genotype-phenotype relationships strongly support a role for a continuum of CAG length-dependent changes. The compendium of HTT CAG length-gene expression level relationships found at the HD CAGnome now provides convenient routes for discovery of candidates influenced by the HD mutation.
OBJECTIVES: Treatment-resistant depression is a common clinical occurrence among patients with major depressive disorder (MDD), but its neurobiology is poorly understood. We used data collected as part of routine clinical care to study white matter integrity of the brain's limbic system and its association to treatment response. METHODS: Electronic medical records of multiple large New England hospitals were screened for patients with an MDD billing diagnosis, and natural language processing was subsequently applied to find those with concurrent diffusion-weighted images, but without any diagnosed brain pathology. Treatment outcome was determined by review of clinical charts. MDD patients (n = 29 non-remitters, n = 26 partial-remitters, and n = 37 full-remitters), and healthy control subjects (n = 58) were analyzed for fractional anisotropy (FA) of the fornix and cingulum bundle. RESULTS: Failure to achieve remission was associated with lower FA among MDD patients, statistically significant for the medial body of the fornix. Moreover, global and regional-selective age-related FA decline was most pronounced in patients with treatment-refractory, non-remitted depression. CONCLUSIONS: These findings suggest that specific brain microstructural white matter abnormalities underlie persistent, treatment-resistant depression. They also demonstrate the feasibility of investigating white matter integrity in psychiatric populations using legacy data.
Whole-genome sequencing (WGS) studies are uncovering disease-associated variants in both rare and nonrare diseases. Utilizing the next-generation sequencing for WGS requires a series of computational methods for alignment, variant detection, and annotation, and the accuracy and reproducibility of annotation results are essential for clinical implementation. However, annotating WGS with up to date genomic information is still challenging for biomedical researchers. Here, we present one of the fastest and highly scalable annotation, filtering, and analysis pipeline-gNOME-to prioritize phenotype-associated variants while minimizing false-positive findings. Intuitive graphical user interface of gNOME facilitates the selection of phenotype-associated variants, and the result summaries are provided at variant, gene, and genome levels. Moreover, the enrichment results of specific variants, genes, and gene sets between two groups or compared with population scale WGS datasets that is already integrated in the pipeline can help the interpretation. We found a small number of discordant results between annotation software tools in part due to different reporting strategies for the variants with complex impacts. Using two published whole-exome datasets of uveal melanoma and bladder cancer, we demonstrated gNOME's accuracy of variant annotation and the enrichment of loss-of-function variants in known cancer pathways. gNOME Web server and source codes are freely available to the academic community (http://gnome.tchlab.org).
As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false-positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here, we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false-negative rates were significantly reduced by 1.1- to 17.8-fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous single nucleotide variants (SNVs); 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery in NA12878, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype-associated variants, and an ensemble genotyping would be essential to minimize false-positive DNM candidates.
Purpose:Disease-causing mutations and pharmacogenomic variants are of primary interest for clinical whole-genome sequencing. However, estimating genetic liability for common complex diseases using established risk alleles might one day prove clinically useful.Methods:We compared polygenic scoring methods using a case-control data set with independently discovered risk alleles in the MedSeq Project. For eight traits of clinical relevance in both the primary-care and cardiomyopathy study cohorts, we estimated multiplicative polygenic risk scores using 161 published risk alleles and then normalized them using the population median estimated from the 1000 Genomes Project.Results:Our polygenic score approach identified the overrepresentation of independently discovered risk alleles in cases as compared with controls using a large-scale genome-wide association study data set. In addition to normalized multiplicative polygenic risk scores and rank in a population, the disease prevalence and proportion of heritability explained by known common risk variants provide important context in the interpretation of modern multilocus disease risk models.Conclusion:Our approach in the MedSeq Project demonstrates how complex trait risk variants from an individual genome can be summarized and reported for the general clinician and also highlights the need for definitive clinical studies to obtain reference data for such estimates and to establish clinical utility.Genet Med advance online publication 23 October 2014Genetics in Medicine (2014); doi:10.1038/gim.2014.143.
Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.