Although gene and protein measurements are increasing in quantity and comprehensiveness, they do not characterize a sample's entire phenotype in an environmental or experimental context. Here we comprehensively consider associations between components of phenotype, genotype and environment to identify genes that may govern phenotype and responses to the environment. Context from the annotations of gene expression data sets in the Gene Expression Omnibus is represented using the Unified Medical Language System, a compendium of biomedical vocabularies with nearly 1-million concepts. After showing how data sets can be clustered by annotative concepts, we find a network of relations between phenotypic, disease, environmental and experimental contexts as well as genes with differential expression associated with these concepts. We identify novel genes related to concepts such as aging. Comprehensively identifying genes related to phenotype and environment is a step toward the Human Phenome Project.
Diamond-Blackfan anemia (DBA) is a broad developmental disease characterized by anemia, bone marrow (BM) erythroblastopenia, and an increased incidence of malignancy. Mutations in a ribosomal protein gene S19 (RPS19) are found in approximately 25% of DBA patients; however, the role of RPS19 in the pathogenesis of DBA remains unknown. Using global gene expression analysis we compared highly purified multipotential, erythroid and myeloid BM progenitors from RPS19 mutated and control individuals. We found several ribosomal protein genes down-regulated in all DBA progenitors. Apoptosis genes like TNFRSF10B and FAS, transcriptional control genes including the erythropoietic transcription factor MYB (encoding c-myb), and translational genes were greatly dysregulated mostly in diseased erythroid cells. Cancer-related genes, including RAB family oncogenes and tumor suppressor genes, were significantly dysregulated in all diseased progenitors. In addition, our results provide evidence that RPS19 mutations lead to co-downregulation of multiple ribosomal protein genes as well as down-regulation of genes involved in translation in DBA cells. In conclusion, the altered expression of cancer-related genes suggests a molecular basis for malignancy in DBA. Down-regulation of c-myb expression, which causes complete failure of fetal liver erythropoiesis in knockout mice, suggests a link between RPS19 mutations and reduced erythropoiesis in DBA.
ABSTRACT: BACKGROUND: Patient genomic data are rapidly becoming part of clinical decision making. Within a few years, full genome expression profiling and genotyping will be affordable enough to perform on every individual. The management of such sizeable, yet fine-grained, data in compliance with privacy laws and best practices presents significant security and scalability challenges. RESULTS: We present the design and implementation of GenePING, an extension to the PING personal health record system that supports secure storage of large, genome-sized datasets, as well as efficient sharing and retrieval of individual datapoints (e.g. SNPs, rare mutations, gene expression levels). Even with full access to the raw GenePING storage, an attacker cannot discover any stored genomic datapoint on any single patient. Given a large-enough number of patient records, an attacker cannot discover which data corresponds to which patient, or even the size of a given patient's record. The computational overhead of GenePING's security features is a small constant, making the system usable, even in emergency care, on today's hardware. CONCLUSIONS: GenePING is the first personal health record management system to support the efficient and secure storage and sharing of large genomic datasets. GenePING is available online at http://ping.chip.org/genepinghtml, licensed under the LGPL.
Mechanical stimulation of the airway epithelium, as would occur during bronchoconstriction is a potent stimulus and can activate profibrotic pathways. We used DNA microarray technology to examine gene expression in compressed normal human bronchial epithelial cells (NHBE). Compressive stress applied continuously over an 8 hour period to NHBE cells led to the upregulation of several families of genes including a family of plasminogen related genes that were previously not known to be regulated in this system. Real-time PCR demonstrated a peak increase in gene expression of 8.0 fold for urokinase plasminogen activator (uPA), 16.2 fold for urokinase plasminogen activator receptor (uPAR), 4.2 fold for plasminogen activator inhibitor-1 (PAI-1) and 3.9 fold for tissue plasminogen activator (tPA). Compressive stress also increased uPA protein levels in the cell lysates (112.0 vs 82.0 ng/ml, p=0.0004), and increased uPA (4.7 vs 3.3 ng/mL p=0.02), uPAR (1.3 vs 0.86 ng/mL p=0.007) and PAI-1 (50 vs 36 ng/mL p=0.006) protein levels in cell culture media. Functional studies demonstrated increased urokinase dependent plasmin generation in compression stimulated cells (0.0090 vs 0.0033 OD/min, p=0.03). In addition, compression led to increased activation of matrix metalloproteinase (MMP)-9 and MMP2 in a urokinase dependent manner. In post-mortem human lung tissue, we observed an increase in epithelial uPA and uPAR immunostaining in the airways of two patients who died in status asthmaticus compared to minimal immunoreactivity noted in airways from seven non-asthmatic lung donors. Together these observations suggest an integrated response of airway epithelial cells to mechanical stimulation, acting through the plasminogen activating system to modify the airway micro-environment.
The Informatics for Integrating Biology and the Bedside (i2b2) is one of the sponsored initiatives of the NIH Roadmap National Centers for Biomedical Computing (http://www.bisti.nih.gov/ncbc/). One of the goals of i2b2 is to provide clinical investigators broadly with the software tools necessary to collect and manage project-related clinical research data in the genomics age as a cohesive entity - a software suite to construct and manage the modern clinical research chart.Maintaining relationships between terms in multiple vocabularies is a vital activity for any organization attempting to support the integration of data coming from multiple sources. In a clinical research setting, in fact, this activity becomes even more significant. Combined laboratory and clinical data from diverse systems, with semantically related codes, terms and concepts, can help transform genomic knowledge into the practice of healthcare. Unless these vocabulary relationships are defined, however, the potential for considerable insight is lost. Maintaining these relationships, then, becomes a priority.Vocabulary mapping is a process of specifying and maintaining relationships between terms in multiple vocabularies. One vocabulary, designated as master becomes the classification scheme for the other subsidiary vocabularies.1 In order to use a system like this, there must be some way of maintaining freedom and control over the master vocabulary, while still maintaining the integrity of the mappings to each source. Though, there is no single tool to accomplish all of this, some existing tools offer a great deal of functionality, especially in the area of maintaining a single vocabulary.
Early detection of cancer can greatly improve prognosis. Identification of proteins or peptides in the circulation, at different stages of cancer, would greatly enhance treatment decisions. Mass spectrometry (MS) is emerging as a powerful tool to identify proteins from complex mixtures such as plasma that may help identify novel sets of markers that may be associated with the presence of tumors. To examine this feature we have used a genetically modified mouse model, Apc(Min), which develops intestinal tumors with 100% penetrance. Utilizing liquid chromatography-tandem mass spectrometry (LC-MS/MS), we identified total plasma proteome (TPP) and plasma glycoproteome (PGP) profiles in tumor-bearing mice. Principal component analysis (PCA) and agglomerative hierarchial clustering analysis revealed that these protein profiles can be used to distinguish between tumor-bearing Apc(Min) and wild-type control mice. Leave-one-out cross-validation analysis established that global TPP and global PGP profiles can be used to correctly predict tumor-bearing animals in 17/19 (89%) and 19/19 (100%) of cases, respectively. Furthermore, leave-one-out cross-validation analysis confirmed that the significant differentially expressed proteins from both the TPP and the PGP were able to correctly predict tumor-bearing animals in 19/19 (100%) of cases. A subset of these proteins was independently validated by antibody microarrays using detection by two color rolling circle amplification (TC-RCA). Analysis of the significant differentially expressed proteins indicated that some might derive from the stroma or the host response. These studies suggest that mass spectrometry-based approaches to examine the plasma proteome may prove to be a valuable method for determining the presence of intestinal tumors.
Skeletal muscle side population (SP) cells are thought to be "stem"-like cells. Despite reports confirming the ability of muscle SP cells to give rise to differentiated progeny in vitro and in vivo, the molecular mechanisms defining their phenotype remain unclear. In this study, gene expression analyses of human fetal skeletal muscle demonstrate that bone morphogenetic protein 4 (BMP4) is highly expressed in SP cells but not in main population (MP) mononuclear muscle-derived cells. Functional studies revealed that BMP4 specifically induces proliferation of BMP receptor 1a-positive MP cells but has no effect on SP cells, which are BMPR1a-negative. In contrast, the BMP4 antagonist Gremlin, specifically up-regulated in MP cells, counteracts the stimulatory effects of BMP4 and inhibits proliferation of BMPR1a-positive muscle cells. In vivo, BMP4-positive cells can be found in the proximity of BMPR1a-positive cells in the interstitial spaces between myofibers. Gremlin is expressed by mature myofibers and interstitial cells, which are separate from BMP4-expressing cells. Together, these studies propose that BMP4 and Gremlin, which are highly expressed by human fetal skeletal muscle SP and MP cells, respectively, are regulators of myogenic progenitor proliferation.
We developed a computational method to characterize aneuploidy in tumor samples based on coordinated aberrations in expression of genes localized to each chromosomal region. We summarized the total level of chromosomal aberration in a given tumor in a univariate measure termed total functional aneuploidy. We identified a signature of chromosomal instability from specific genes whose expression was consistently correlated with total functional aneuploidy in several cancer types. Net overexpression of this signature was predictive of poor clinical outcome in 12 cancer data sets representing six cancer types. Also, the signature of chromosomal instability was higher in metastasis samples than in primary tumors and was able to stratify grade 1 and grade 2 breast tumors according to clinical outcome. These results provide a means to assess the potential role of chromosomal instability in determining malignant potential over a broad range of tumors.
BACKGROUND: Widespread availability of geographic information systems software has facilitated the use of disease mapping in academia, government and private sector. Maps that display the address of affected patients are often exchanged in public forums, and published in peer-reviewed journal articles. As previously reported, a search of figure legends in five major medical journals found 19 articles from 1994-2004 that identify over 19,000 patient addresses. In this report, a method is presented to evaluate whether patient privacy is being breached in the publication of low-resolution disease maps. RESULTS: To demonstrate the effect, a hypothetical low-resolution map of geocoded patient addresses was created and the accuracy with which patient addresses can be resolved is described. Through georeferencing and unsupervised classification of the original image, the method precisely re-identified 26% (144/550) of the patient addresses from a presentation quality map and 79% (432/550) from a publication quality map. For the presentation quality map, 99.8% of the addresses were within 70 meters (approximately one city block length) of the predicted patient location, 51.6% of addresses were identified within five buildings, 70.7% within ten buildings and 93% within twenty buildings. For the publication quality map, all addresses were within 14 meters and 11 buildings of the predicted patient location. CONCLUSION: This study demonstrates that lowering the resolution of a map displaying geocoded patient addresses does not sufficiently protect patient addresses from re-identification. Guidelines to protect patient privacy, including those of medical journals, should reflect policies that ensure privacy protection when spatial data are displayed or published.
BACKGROUND: Biological processes are carried out by coordinated modules of interacting molecules. As clustering methods demonstrate that genes with similar expression display increased likelihood of being associated with a common functional module, networks of coexpressed genes provide one framework for assigning gene function. This has informed the guilt-by-association (GBA) heuristic, widely invoked in functional genomics. Yet although the idea of GBA is accepted, the breadth of GBA applicability is uncertain. RESULTS: We developed methods to systematically explore the breadth of GBA across a large and varied corpus of expression data to answer the following question: To what extent is the GBA heuristic broadly applicable to the transcriptome and conversely how broadly is GBA captured by a priori knowledge represented in the Gene Ontology (GO)? Our study provides an investigation of the functional organization of five coexpression networks using data from three mammalian organisms. Our method calculates a probabilistic score between each gene and each Gene Ontology category that reflects coexpression enrichment of a GO module. For each GO category we use Receiver Operating Curves to assess whether these probabilistic scores reflect GBA. This methodology applied to five different coexpression networks demonstrates that the signature of guilt-by-association is ubiquitous and reproducible and that the GBA heuristic is broadly applicable across the population of nine hundred Gene Ontology categories. We also demonstrate the existence of highly reproducible patterns of coexpression between some pairs of GO categories. CONCLUSION: We conclude that GBA has universal value and that transcriptional control may be more modular than previously realized. Our analyses also suggest that methodologies combining coexpression measurements across multiple genes in a biologically-defined module can aid in characterizing gene function or in characterizing whether pairs of functions operate together.
Numerous cellular and molecular perturbations have been studied to elucidate the pathogenic mechanisms underlying nephrotic-range proteinuria, which may in turn shed light on disease-specific mechanisms. We have analyzed the publicly available data from the PhysGen partial panel of consomic rats to determine whether there are quantitative trait loci that associate with nephrotic-range proteinuria. As of this writing, consomic rat strains subjected to the renal protocol have been bred by the Program for Genomic Applications for 15 of the 22 rat chromosomes for both genders, predominantly with the Brown-Norway (BN) and Dahl salt-sensitive (SS) strains as parents. We defined chromosomes of interest as consomic SS-xBN strains whose phenotype measurements differed significantly from SS but not BN strains, stratified by gender. We filtered and clustered differentially expressed genes by function in renal tissue from relevant strains. Proteinuria was significantly higher in male SS vs. male SS-18BN, and it was significantly higher in male SS vs. female SS. Functional clustering of differentially expressed genes yielded two specific functional clusters: apoptosis (p=0.022) and angiogenesis (p=0.046). Gene expression profiles demonstrated differential expression of apoptotic and angiogenic genes. However, TUNEL stains of renal tissue showed no significant difference in the number of apoptotic nuclei. We conclude that chromosomes 18 and X are quantitative trait loci for nephrotic-range proteinuria in rats.
Accurate and rapid identification of perturbed pathways through the analysis of genome-wide expression profiles facilitates the generation of biological hypotheses. We propose a statistical framework for determining whether a specified group of genes for a pathway has a coordinated association with a phenotype of interest. Several issues on proper hypothesis-testing procedures are clarified. In particular, it is shown that the differences in the correlation structure of each set of genes can lead to a biased comparison among gene sets unless a normalization procedure is applied. We propose statistical tests for two important but different aspects of association for each group of genes. This approach has more statistical power than currently available methods and can result in the discovery of statistically significant pathways that are not detected by other methods. This method is applied to data sets involving diabetes, inflammatory myopathies, and Alzheimer's disease, using gene sets we compiled from various public databases. In the case of inflammatory myopathies, we have correctly identified the known cytotoxic T lymphocyte-mediated autoimmunity in inclusion body myositis. Furthermore, we predicted the presence of dendritic cells in inclusion body myositis and of an IFN-alpha/beta response in dermatomyositis, neither of which was previously described. These predictions have been subsequently corroborated by immunohistochemistry.
A number of important applications in medicine and biomedical research, including quality of care surveillance and identification of prospective study subjects, require identification of large cohorts of patients with a specific diagnosis. Currently used methods are either labor-intensive or imprecise. We have therefore designed DITTO - a tool for identification of patients with a documented specific diagnosis through analysis of the text of physician notes in the electronic medical record. Evaluation of DITTO on the example of diabetes mellitus, hypertension and overweight has shown it to be rapid and highly accurate. DITTO processed 170,000 notes/hr with sensitivity ranging from 74 to 96%, and specificity from 86 to 100%. Its accuracy substantially exceeded the performance of currently used techniques for each of the three diseases. DITTO can be adapted for use in another healthcare facility or to detect a different diagnosis. DITTO is an important advancement in the field, and we plan to continue to work to enhance its functionality and performance.
BACKGROUND: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity. RESULTS: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method. CONCLUSIONS: The search engine, available at http://mapper.chip.org, allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.
We describe a comprehensive map of putative transcription factor binding sites (TFBSs) across multiple genomes created using a search method that relies on hidden Markov models built from experimentally determined TFBSs. Using the information in the TRANSFAC and JASPAR databases, we built 1134 models for TFBSs and used them to scan regions 10 kb upstream of the start of the transcript for all known genes in the human, mouse and Drosophila melanogaster genomes. The results, together with homology information on clusters of ortholog genes across the three genomes, were used to create a multi-organism catalog of annotated TFBSs. The catalog can be queried through a web interface accessible at http://bio.chip.org/mapper that allows the identification, visualization and selection of TFBSs occurring in the promoter of a gene of interest and also the common factors predicted to bind across the cluster of orthologs that includes that gene. Alternatively, the interface allows the user to retrieve binding sites for a single transcription factor of interest in a single gene or in all genes of the human, mouse or fruit fly genomes.
OBJECTIVE Patient-centered information management may overcome barriers that impede high quality, safe care in the emergency department (ED). The utility of parents' report of medication data via a multimedia, touch screen interface, the asthma kiosk, was investigated. Our specific aims were: 1) to estimate the validity of parents' electronically-entered medication history for asthma, and, 2) to compare the parents' kiosk entries regarding medications to the documentation of ED physicians and nurses. METHODS We enrolled a cohort of parents to use the asthma kiosk and tested the validity of this communication channel for medication data specific to pediatric asthma. Parents' data provided via the kiosk during the ED encounter and the documentation of ED nurses and physicians were compared to a telephone-based interview with the parent after discharge that reviewed all asthma-specific medications physically present in the home. Treating clinicians in the ED were blinded to the parents' kiosk entries. RESULTS Sixty-six parents were enrolled and 49/66 (74.2%) completed the gold standard interview. When analyzed at the level of individual medications, the validity of parental report was 81% for medication name, 79% for route of delivery, 66% for the form of the medication, and 60% for dosage. Parents' report improved upon the validity of documentation by physicians across all medication details save for medication name. Parents' report was more valid than nursing documentation at triage for all medication details. CONCLUSION Parents can provide an independent source of medication data that improves upon current documentation for key variables that impact quality and safety in emergency asthma care.
Despite progress in creating standardized clinical data models and interapplication protocols, the goal of creating a lifelong health care record remains mired in the pragmatics of interinstitutional competition, concerns about privacy and unnecessary disclosure, and the lack of a nationwide system for authenticating and authorizing access to medical information. The authors describe the architecture of a personally controlled health care record system, PING, that is not institutionally bound, is a free and open source, and meets the policy requirements that the authors have previously identified for health care delivery and population-wide research.
Side Population (SP) cells, isolated from murine adult bone marrow (BM) based on the exclusion of the DNA dye Hoechst 33342, exhibit potent hematopoietic stem cell (HSC) activity when compared to Main Population (MP) cells. Furthermore, SP cells derived from murine skeletal muscle exhibit both hematopoietic and myogenic potential in vivo. The multipotential capacity of SP cells isolated from variable tissues is supported by an increasing number of studies. To investigate whether the SP phenotype is associated with a unique transcriptional profile, we characterized gene expression of SP cells isolated from two biologically distinct tissues, bone marrow and muscle. Comparison of SP cells with differentiated MP cells within a tissue revealed that SP cells are in an active transcriptional and translational status and underexpress genes reflecting tissue-specific functions. Direct comparison of gene expression of SP cells isolated from different tissues identified genes common to SP cells as well as genes specific to SP cells within a particular tissue and further define a muscle and bone marrow environment. This study reports gene expression of muscle SP cells, common features and differences between SP cells isolated from muscle and bone marrow, and further identifies common signaling pathways that might regulate SP cell functions.
As the public interest in consumer-driven electronic healthcare applications rises(1-3), so do concerns about the privacy and security of these applications. Achieving a balance between providing the necessary security, while promoting user acceptance, is a major obstacle in large-scale deployment of applications such as personal health records (PHRs). Robust and reliable forms of authentication are needed for PHRs, as the record will often contain sensitive and protected health information, including the patient's own annotations. Since the health care industry per se is unlikely to succeed at single-handedly developing and deploying a large scale, national authentication infrastructure, it makes sense to leverage existing hardware, software and networks. This paper proposes a new model for authentication of users to health care information applications, leveraging wireless mobile devices. Cell phones are widely distributed, have high user acceptance, and offer advanced security protocols. We propose harnessing this technology for the strong authentication of individuals by creating a registration authority and an authentication service, and examine the problems and promise of such a system.