MOTVIATION: The existence of several technologies for measuring gene expression makes the question of cross-technology agreement of measurements an important issue. Cross-platform utilization of data from different technologies has the potential to reduce the need to duplicate experiments but requires corresponding measurements to be comparable. METHODS: A comparison of mRNA measurements of 2895 sequence-matched genes in 56 cell lines from the standard panel of 60 cancer cell lines from the National Cancer Institute (NCI 60) was carried out by calculating correlation between matched measurements and calculating concordance between cluster from two high-throughput DNA microarray technologies, Stanford type cDNA microarrays and Affymetrix oligonucleotide microarrays. RESULTS: In general, corresponding measurements from the two platforms showed poor correlation. Clusters of genes and cell lines were discordant between the two technologies, suggesting that relative intra-technology relationships were not preserved. GC-content, sequence length, average signal intensity, and an estimator of cross-hybridization were found to be associated with the degree of correlation. This suggests gene-specific, or more correctly probe-specific, factors influencing measurements differently in the two platforms, implying a poor prognosis for a broad utilization of gene expression measurements across platforms.
This article presents a Bayesian method for model-based clustering of gene expression dynamics. The method represents gene-expression dynamics as autoregressive equations and uses an agglomerative procedure to search for the most probable set of clusters given the available data. The main contributions of this approach are the ability to take into account the dynamic nature of gene expression time series during clustering and a principled way to identify the number of distinct clusters. As the number of possible clustering models grows exponentially with the number of observed time series, we have devised a distance-based heuristic search procedure able to render the search process feasible. In this way, the method retains the important visualization capability of traditional distance-based clustering and acquires an independent, principled measure to decide when two series are different enough to belong to different clusters. The reliance of this method on an explicit statistical representation of gene expression dynamics makes it possible to use standard statistical techniques to assess the goodness of fit of the resulting model and validate the underlying assumptions. A set of gene-expression time series, collected to study the response of human fibroblasts to serum, is used to identify the properties of the method.
MOTIVATION: Gene regulatory elements are often predicted by seeking common sequences in the promoter regions of genes that are clustered together based on their expression profiles. We consider the problem in the opposite direction: we seek to find the genes that have similar promoter regions and determine the extent to which these genes have similar expression profiles. RESULTS: We use the data sets from experiments on Saccharomyces cerevisiae. Our similarity measure for the promoter regions is based on the set of common mapped or putative transcription factor binding sites and other regulatory elements in the upstream region of the genes, as contained in the Saccharomyces cerevisiae Promoter Database. We pair up the genes with high similarity scores and compare their expression levels in time-course experiment data. We find that genes with similar promoter regions on the average have significantly higher correlation, but it can vary widely depending on the genes. This confirms that the presence of similar regulatory elements often does not correspond to similarity in expression profiles and indicates that finding transcription factor binding sites or other regulatory elements starting with the expression patterns may be limited in many cases. Regardless of the correlation, the degree to which the profiles agree under different experimental conditions can be examined to derive hypotheses concerning the role of common regulatory elements. Overall, we find that considering the relationship between the promoter regions and the expression profiles starting with the regulatory elements is a difficult but useful process that can provide valuable insights.
A comprehensive and timely response to current and future bioterrorist attacks requires a data acquisition, threat detection, and response infrastructure with unprecedented scope in time and space. Fortunately, biomedical informaticians have developed and implemented architectures, methodologies, and tools at the local and the regional levels that can be immediately pressed into service for the protection of our populations from these attacks. These unique contributions of the discipline of biomedical informatics are reviewed here.
The primary cause of Duchenne muscular dystrophy (DMD) is a mutation in the dystrophin gene leading to the absence of the corresponding RNA transcript and protein. Absence of dystrophin leads to disruption of the dystrophin-associated protein complex and substantial changes in skeletal muscle pathology. Although the histological pathology of dystrophic tissue has been well documented, the underlying molecular pathways remain poorly understood. To examine the pathogenic pathways and identify new or modifying factors involved in muscular dystrophy, expression microarrays were used to compare individual gene expression profiles of skeletal muscle biopsies from 12 DMD patients and 12 unaffected control patients. Two separate statistical analysis methods were used to interpret the resulting data: t test analysis to determine the statistical significance of differential expression and geometric fold change analysis to determine the extent of differential expression. These analyses identified 105 genes that differ significantly in expression level between unaffected and DMD muscle. Many of the differentially expressed genes reflect changes in histological pathology. For instance, immune response signals and extracellular matrix genes are overexpressed in DMD muscle, an indication of the infiltration of inflammatory cells and connective tissue. Significantly more genes are overexpressed than are underexpressed in dystrophic muscle, with dystrophin underexpressed, whereas other genes encoding muscle structure and regeneration processes are overexpressed, reflecting the regenerative nature of the disease.
There is an increasing need to link the large amount of genotypic data, gathered using microarrays for example, with various phenotypic data from patients. The classification problem in which gene expression data serve as predictors and a class label phenotype as the binary outcome variable has been examined extensively, but there has been less emphasis in dealing with other types of phenotypic data. In particular, patient survival times with censoring are often not used directly as a response variable due to the complications that arise from censoring. We show that the issues involving censored data can be circumvented by reformulating the problem as a standard Poisson regression problem. The procedure for solving the transformed problem is a combination of two approaches: partial least squares, a regression technique that is especially effective when there is severe collinearity due to a large number of predictors, and generalized linear regression, which extends standard linear regression to deal with various types of response variables. The linear combinations of the original variables identified by the method are highly correlated with the patient survival times and at the same time account for the variability in the covariates. The algorithm is fast, as it does not involve any matrix decompositions in the iterations. We apply our method to data sets from lung carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness.
OBJECTIVE: To define current practice among US newborn screening programs for notification of results, research, and consenting procedures. METHODS: A telephone survey of all US newborn screening program supervisors. RESULTS: All 51 programs participated. All states reported abnormal results to the infant's physician, and some also reported to the hospital and parents. Cases with abnormal results were tracked to different endpoints but usually (92.1%) at least until a follow-up appointment was made. A total of 66.6% of programs can communicate with programs in other states; 9.8% enable families to suppress reporting of results to the infant's physician. No state has a mechanism for parents to prevent results from entering the medical record. Parents or physicians who request results are often authenticated by providing their name (52.9%). Many programs (45.1%) report only to physicians and require just their name (43.5%), an identification number (17.4%), a letter (26.1%), or a parent's signature (26.1%). A total of 70.6% retain residual blood samples; of these, only 8.3% store them completely devoid of patient identifiers. A total of 49.0% of programs aggregate data for research. In 16.0% of these, the data are publicly available. In 24.0%, researchers obtain approval at their own institution; in 24.0%, researchers obtain approval through the state laboratory Institutional Review Board. In 74.5% of programs, parents are notified but not asked for consent before collection of the sample; 19.6% neither notify parents nor obtain consent before screening. CONCLUSIONS: There is wide variation in practice among the US newborn screening programs. Because the programs collectively manage a comprehensive nationwide genomic databank, careful consideration of how information technology and high-throughput genomic analysis are used will be essential to allow progress in clinical care, public health, and research while protecting individual privacy.
Clustering algorithms have been shown to be useful to explore large-scale gene expression profiles. Visualization and objective evaluation of clusters are two important considerations when users are selecting different clustering algorithms, but they are often overlooked. The developments of a framework and software tools that implement comprehensive data visualization and objective measures of cluster quality are crucial. In this paper, we describe a theoretical framework and formalizations for consistently developing clustering algorithms. A new clustering algorithm was developed within the proposed framework. We demonstrate that a theoretically sound principle can be uniformly applied to the developments of cluster-optimization function, comprehensive data-visualization strategy, and objective cluster-evaluation measures as well as actual implementation of the principle. Cluster consistency and quality measures of the algorithm are rigorously evaluated against those of popular clustering algorithms for gene expression data analysis (K-means and self-organizing maps), in four data sets, yielding promising results.
Inadequate follow-up for abnormal laboratory results is a frequent cause of medical errors, especially for those that arrive after the patient is discharged in an Emergency Department (ED) setting. We have developed and implemented a computerized reminder system called the Automated Late-Arriving Results Monitoring System (ALARMS) for the Emergency Department at Children's Hospital, Boston. ALARMS scans the hospital's laboratory and ED registration databases to generate an electronic daily log of all late-arriving abnormal results for ED patients, which can be obtained by authorized physicians through a web-based user interface inside the hospital's intranet. We believe, by using this automated data-driven rule-based reminder system, we can minimize the risk of errors resulting from late-arriving laboratory data without requiring substantial additional efforts from clinicians.
OBJECTIVE: To describe the use of large-scale gene expression profiles to distinguish broad categories of myopathy and subtypes of inflammatory myopathies (IM) and to provide insight into the pathogenesis of inclusion body myositis (IBM), polymyositis, and dermatomyositis. METHODS: Using Affymetrix GeneChip microarrays, the authors measured the simultaneous expression of approximately 10,000 genes in muscle specimens from 45 patients in four major disease categories (dystrophy, congenital myopathy, inflammatory myopathy, and normal). The authors separately analyzed gene expression in 14 patients limited to the three major subtypes of IM. Bioinformatics techniques were used to classify specimens with similar expression profiles based on global patterns of gene expression and to identify genes with significant differential gene expression compared with normal. RESULTS: Ten of 11 patients with IM, all normals and nemaline myopathies, and 10 of 12 patients with Duchenne muscular dystrophy were correctly classified by this approach. The various subtypes of inflammatory myopathies have distinct gene expression signatures. Specific sets of immune-related genes allow for molecular classification of patients with IBM, polymyositis, and dermatomyositis. Analysis of differential gene expression identifies as relevant to disease pathogenesis previously reported cytokines, major histocompatibility complex class I and II molecules, granzymes, and adhesion molecules, as well as newly identified members of these categories. Increased expression of actin cytoskeleton genes is also identified. CONCLUSIONS: The molecular profiles of muscle tissue in patients with inflammatory myopathies are distinct and represent molecular signatures from which diagnostic insight may follow. Large numbers of differentially expressed genes are rapidly identified.
As many as 86% of intensive care unit (ICU) alarms are false. Multiple signal integration of temporal monitor data by decision tree induction may improve artifact detection. We explore the effect of data granularity on model-building by comparing models made from 1-second versus 1-minute data. Models developed from 1-minute data remained effective when tested on 1-second data. Model development using 1-minute data means that more hours of ICU monitoring (including more artifacts) can be processed in less time. Compression of temporal data by arithmetic mean, therefore, can be an effective method for decreasing knowledge discovery processing time without compromising learning.
Most investigations of coordinated gene expression have focused on identifying correlated expression patterns between genes by examining their normalized static expression levels. In this study, we focus on the dynamics of gene expression by seeking to identify correlated patterns of changes in genetic expression level. In doing so, we build upon methods developed in clinical informatics to detect temporal trends of laboratory and other clinical data. We construct relevance networks from Saccharomyces cerevisiae gene-expression dynamics data and find genes with related functional annotations grouped together. While some of these associations are also found using a standard expression level analysis, many are identified exclusively through the dynamic analysis. These results strongly suggest that the analysis of gene expression dynamics is a necessary and important tool for studying regulatory and other functional relationships among genes. The source code developed for this investigation is freely available to all non-commercial investigators by contacting the authors.
Direct electronic acquisition of data from patients possesses accuracy and diagnostic value. The mechanics of how best to capture historical information from patients using electronic interfaces are not well studied. We undertook an iterative usability experiment to answer 2 questions: 1) How can maximal electronic data input from a patient be achieved, and 2) Do varying structures for data entry promote differential documentation of specified data elements? METHODS: A series of four trials comprised the testing cycle. Unstructured text entry, directed text entry, and closed ended questions were tested in combination against outcomes of word count, time to task completion, and user preferences. Covariates of interest included participants' technologic experience and ergonomic experience with keyboards, as well as self-report of educational status, literacy, and primary language. RESULTS: Participants clearly preferred the order of initial closed-ended questions followed by unstructured text entry, and this ordering was not associated with decrements in word count or increase in time. When compared to unstructured text entry, directed text entry provided higher documentation of data for past medical history and questions which parents wished to discuss with the clinician. A closed-end question structure, when compared to directed text entry, provided higher capture of parents' questions for discussion. CONCLUSIONS: Optimal design of an electronic interview for the capture of medical histories will benefit from a mixed structure of directed text entry and closed-ended questions. For historical or clinically relevant items where maximal capture of data is desired, a structure with closed-ended questions would be preferred.
In this paper, we propose a secure, distributed and scaleable infrastructure for a lifelong personal medical record system. We leverage on existing and widely available technologies, like the Web and public-key cryptography, to define an architecture that allows patients to exercise full control over their medical data. This is done without compromising patients' privacy and the ability of other interested parties (e.g. physicians, health-care institutions, public-health researchers) to access the data when appropriately authorized. The system organizes the information as a tree of encrypted plain-text XML files, in order to ensure platform independence and durability, and uses a role-based authorization scheme to assign access privileges. In addition to the basic architecture, we describe tools to populate the patient's record with data from hospital databases and the first testbed applications we are deploying.
As we enter an age in which genomics and bioinformatics make possible the discovery of new knowledge about the biological characteristics of an organism, it is critical that we attempt to report newly discovered "significant" phenotypes only when they are actually of significance. With the relative youth of genome-scale gene expression technologies, how to make such distinctions has yet to be better defined. We present a "mask technology" by which to filter out those levels of gene expression that fall within the noise of the experimental techniques being employed. Conversely, our technique can lend validation to significant fold differences in expression level even when the fold value may appear quite small (e.g. 1.3). Given array-organized expression level results from a pair of identical experiments, our ID Mask Tool enables the automated creation of a two-dimensional "region of insignificance" that can then be used with subsequent data analyses. Fundamentally, this should enable researchers to report on findings that are more likely to be in nature truly meaningful. Moreover, this can prevent major investments of time, energy, and biological resources into the pursuit of candidate genes that represent false positives.
Single Nucleotide Polymorphisms (SNPs) are the most important source of variation in our genome, and an invaluable tool in the hands of researchers who investigate genetic diseases. Databases of SNPs are growing at a very fast rate, and the ability to perform large-scale, high-resolution association studies is quickly becoming a reality. In this paper we describe SNPper, a web-based tool to search for SNPs in public databases. The system allows searching for all SNPs in a given set of genes (for candidate gene studies) or in a specified region of a chromosome. The information displayed for each gene or each SNP is fully annotated and linked to the leading bioinformatics web sites. The first release of SNPper is available on the web, and has received positive feedback from the genetic and bioinformatics community.