SUMMARY: To increase compatibility between different generations of Affymetrix GeneChip arrays, we propose a method of filtering probes based on their sequences. Our method is implemented as a web-based service for downloading necessary materials for converting the raw data files (*.CEL) for comparative analysis. The user can specify the appropriate level of filtering by setting the criteria for the minimum overlap length between probe sequences and the minimum number of usable probe pairs per probe set. Our website supports a within-species comparison for human and mouse GeneChip arrays. AVAILABILITY: http://www.crosschip.org CONTACT: firstname.lastname@example.org.
BACKGROUND: Recent advances in genome sequencing suggest a remarkable conservation in gene content of mammalian organisms. The similarity in gene repertoire present in different organisms has increased interest in studying regulatory mechanisms of gene expression aimed at elucidating the differences in phenotypes. In particular, a proximal promoter region contains a large number of regulatory elements that control the expression of its downstream gene. Although many studies have focused on identification of these elements, a broader picture on the complexity of transcriptional regulation of different biological processes has not been addressed in mammals. The regulatory complexity may strongly correlate with gene function, as different evolutionary forces must act on the regulatory systems under different biological conditions. We investigate this hypothesis by comparing the conservation of promoters upstream of genes classified in different functional categories. RESULTS: By conducting a rank correlation analysis between functional annotation and upstream sequence alignment scores obtained by human-mouse and human-dog comparison, we found a significantly greater conservation of the upstream sequence of genes involved in development, cell communication, neural functions and signaling processes than those involved in more basic processes shared with unicellular organisms such as metabolism and ribosomal function. This observation persists after controlling for G+C content. Considering conservation as a functional signature, we hypothesize a higher density of cis-regulatory elements upstream of genes participating in complex and adaptive processes. CONCLUSION: We identified a class of functions that are associated with either high or low promoter conservation in mammals. We detected a significant tendency that points to complex and adaptive processes were associated with higher promoter conservation, despite the fact that they have emerged relatively recently during evolution. We described and contrasted several hypotheses that provide a deeper insight into how transcriptional complexity might have been emerged during evolution.
The phenotypic differences among Duchenne muscular dystrophy patients, mdx mice, and mdx(5cv) mice suggest that despite the common etiology of dystrophin deficiency, secondary mechanisms have a substantial influence on phenotypic severity. The differential response of various skeletal muscles to dystrophin deficiency supports this hypothesis. To explore these differences, gene expression profiles were generated from duplicate RNA targets extracted from six different skeletal muscles (diaphragm, soleus, gastrocnemius, quadriceps, tibialis anterior, and extensor digitorum longus) from wild-type, mdx, and mdx(5cv) mice, resulting in 36 data sets for 18 muscle samples. The data sets were compared in three different ways: (1) among wild-type samples only, (2) among all 36 data sets, and (3) between strains for each muscle type. The molecular profiles of soleus and diaphragm separate significantly from the other four muscle types and from each other. Fiber-type proportions can explain some of these differences. These variations in wild-type gene expression profiles may also reflect biomechanical differences known to exist among skeletal muscles. Further exploration of the genes that most distinguish these muscles may help explain the origins of the biomechanical differences and the reasons why some muscles are more resistant than others to dystrophin deficiency.
BACKGROUND: Comparison of data produced on different microarray platforms often shows surprising discordance. It is not clear whether this discrepancy is caused by noisy data or by improper probe matching between platforms. We investigated whether the significant level of inconsistency between results produced by alternative gene expression microarray platforms could be reduced by stringent sequence matching of microarray probes. We mapped the short oligo probes of the Affymetrix platform onto cDNA clones of the Stanford microarray platform. Affymetrix probes were reassigned to redefined probe sets if they mapped to the same cDNA clone sequence, regardless of the original manufacturer-defined grouping. The NCI-60 gene expression profiles produced by Affymetrix HuFL platform were recalculated using these redefined probe sets and compared to previously published cDNA measurements of the same panel of RNA samples. RESULTS: The redefined probe sets displayed a significantly higher level of cross-platform consistency at the level of gene correlation, cell line correlation and unsupervised hierarchical clustering. The same strategy allowed an almost complete correspondence of breast cancer subtype classification between Affymetrix gene chip and cDNA microarray derived gene expression data, and gave an increased level of similarity between normal lung derived gene expression profiles using the two technologies. In total, two Affymetrix gene-chip platforms were remapped to three cDNA platforms in the various cross-platform analyses, resulting in improved concordance in each case. CONCLUSION: We have shown that probes which target overlapping transcript sequence regions on cDNA microarrays and Affymetrix gene-chips exhibit a greater level of concordance than the corresponding Unigene or sequence matched features. This method will be useful for the integrated analysis of gene expression data generated by multiple disparate measurement platforms.
Widespread availability geographic information systems (GIS) software has facilitated the use health mapping in both academia and government. Maps that display patients as points are often exchanged in public forums (journals, meetings, web). However,even these low resolution maps may reveal confidential patient location information. In this report, we describe a method to test whether privacy is being breached. We reverse geocode from maps with cases and describe the accuracy with which patient addresses can be extracted.
BACKGROUND: Diabetes mellitus is an independent risk factor for early postoperative mortality and complications after coronary artery bypass grafting (CABG). We sought to compare the cardiac gene expression responses to cardiopulmonary bypass (CPB) and cardioplegic arrest (C) in patients with and without diabetes. METHODS AND RESULTS: Twenty atrial myocardium samples were harvested from 5 type II insulin-dependent diabetic and 5 matched nondiabetic patients undergoing CABG, before and after CPB/C. Oligonucleotide microarray analyses of 12625 genes were performed on the 10 sample pairs using matched pre-CPB tissues as controls. Array results were validated with Northern blotting and immunoblotting. Compared with pre-CPB/C, post-CPB/C myocardial tissues revealed 851 upregulated and 480 downregulated genes with a threshold P< or =0.025 (signal-to-noise ratio, 4.04) in the diabetic group, compared with 480 upregulated and 626 downregulated genes (signal-to-noise ratio, 3.04) in the nondiabetic group (P<0.001). There were 18 genes that were upregulated >4-fold in diabetic and nondiabetic patients (including inflammatory/transcription activators FOS, CYR 61, and IL-6, apoptotic gene NR4A1, stress gene DUSP1, and glucose-transporter gene SLC2A3). However, 28 genes showed such marked upregulation in the diabetic group exclusively (including inflammatory/transcription activators MYC, IL8, IL-1beta, growth factor vascular endothelial growth factor, amphiregulin, and glucose metabolism-involved gene insulin receptor substrate 1), and 27 genes in the nondiabetic group only, including glycogen-binding subunit PPP1R3C. CONCLUSIONS: Gene expression profile after CPB/C is quantitatively and qualitatively different in patients with diabetes. These results have important implications for the design of tailored myocardial protection and operative strategies for diabetic patients undergoing CPB/C.
Skeletal muscle differentiation is a complex, highly coordinated process that relies on precise temporal gene expression patterns. To better understand this cascade of transcriptional events, we used expression profiling to analyze gene expression in a 12-day time course of differentiating C2C12 myoblasts. Cluster analysis specific for time-ordered microarray experiments classified 2895 genes and ESTs with variable expression levels between proliferating and differentiating cells into 22 clusters with distinct expression patterns during myogenesis. Expression patterns for several known and novel genes were independently confirmed by real-time quantitative RT-PCR and/or Western blotting and immunofluorescence. MyoD and MEF family members exhibited unique expression kinetics that were highly coordinated with cell-cycle withdrawal regulators. Among genes with peak expression levels during cell cycle withdrawal were Vcam1, Itgb3, Itga5, Vcl, as well as Ptger4, a gene not previously associated with the process of myogenesis. One interesting uncharacterized transcript that is highly induced during myogenesis encodes several immunoglobulin repeats with sequence similarity to titin, a large sarcomeric protein. These data sets identify many additional uncharacterized transcripts that may play important functions in muscle cell proliferation and differentiation and provide a baseline for comparison with C2C12 cells expressing various mutant genes involved in myopathic disorders.
Nemaline myopathy (NM) is a slowly progressive or nonprogressive neuromuscular disorder caused by mutations in genes encoding skeletal muscle sarcomeric thin filament proteins. It is characterized by great heterogeneity at the clinical, histopathological, and genetic level. Although multiple molecular pathways are commonly affected in all NM patients, little is known about the molecular characteristics of muscles from patients in different NM subgroups. We have analyzed a group of global gene expression data sets for transcriptional patterns characteristic of particular nemaline myopathy classes. Differential expression between disease subgroups was primarily seen in mitochondrial-, structural-, and transcription-related genes. Multiple lines of evidence support the hypothesis that muscles from cases with "nontyping" NM, although clinically classified as typical NM, share a unique pathophysiological state and are characterized by distinct patterns of gene expression. Determination of the specific molecular differences in NM subgroups may eventually lead to improved prognostic determinations and treatment of these patients.
BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are an increasingly important tool for genetic and biomedical research. Although current genomic databases contain information on several million SNPs and are growing at a very fast rate, the true value of a SNP in this context is a function of the quality of the annotations that characterize it. Retrieving and analyzing such data for a large number of SNPs often represents a major bottleneck in the design of large-scale association studies. DESCRIPTION: SNPper is a web-based application designed to facilitate the retrieval and use of human SNPs for high-throughput research purposes. It provides a rich local database generated by combining SNP data with the Human Genome sequence and with several other data sources, and offers the user a variety of querying, visualization and data export tools. In this paper we describe the structure and organization of the SNPper database, we review the available data export and visualization options, and we describe how the architecture of SNPper and its specialized data structures support high-volume SNP analysis. CONCLUSIONS: The rich annotation database and the powerful data manipulation and presentation facilities it offers make SNPper a very useful online resource for SNP research. Its success proves the great need for integrated and interoperable resources in the field of computational biology, and shows how such systems may play a critical role in supporting the large-scale computational analysis of our genome.
The authors report on the development and evaluation of a novel patient-centered technology that promotes capture of critical information necessary to drive guideline-based care for pediatric asthma. The design of this application, the asthma kiosk, addresses five critical issues for patient-centered technology that promotes guideline-based care: (1) a front-end mechanism for patient-driven data capture, (2) neutrality regarding patients' medical expertise and technical backgrounds, (3) granular capture of medication data directly from the patient, (4) formal algorithms linking patient-level semantics and asthma guidelines, and (5) output to both patients and clinical providers regarding best practice. The formative evaluation of the asthma kiosk demonstrates its ability to capture patient-specific data during real-time care in the emergency department (ED) with a mean completion time of 11 minutes. The asthma kiosk successfully links parents' data to guideline recommendations and identifies data critical to health improvements for asthmatic children that otherwise remains undocumented during ED-based care.
Identification of common mechanisms underlying organ development and primary tumor formation should yield new insights into tumor biology and facilitate the generation of relevant cancer models. We have developed a novel method to project the gene expression profiles of medulloblastomas (MBs)--human cerebellar tumors--onto a mouse cerebellar development sequence: postnatal days 1-60 (P1-P60). Genomically, human medulloblastomas were closest to mouse P1-P10 cerebella, and normal human cerebella were closest to mouse P30-P60 cerebella. Furthermore, metastatic MBs were highly associated with mouse P5 cerebella, suggesting that a clinically distinct subset of tumors is identifiable by molecular similarity to a precise developmental stage. Genewise, down- and up-regulated MB genes segregate to late and early stages of development, respectively. Comparable results for human lung cancer vis-a-vis the developing mouse lung suggest the generalizability of this multiscalar developmental perspective on tumor biology. Our findings indicate both a recapitulation of tissue-specific developmental programs in diverse solid tumors and the utility of tumor characterization on the developmental time axis for identifying novel aspects of clinical and biological behavior.
The ageing of the human brain is a cause of cognitive decline in the elderly and the major risk factor for Alzheimer's disease. The time in life when brain ageing begins is undefined. Here we show that transcriptional profiling of the human frontal cortex from individuals ranging from 26 to 106 years of age defines a set of genes with reduced expression after age 40. These genes play central roles in synaptic plasticity, vesicular transport and mitochondrial function. This is followed by induction of stress response, antioxidant and DNA repair genes. DNA damage is markedly increased in the promoters of genes with reduced expression in the aged cortex. Moreover, these gene promoters are selectively damaged by oxidative stress in cultured human neurons, and show reduced base-excision DNA repair. Thus, DNA damage may reduce the expression of selectively vulnerable genes involved in learning, memory and neuronal survival, initiating a programme of brain ageing that starts early in adult life.
Although much has been learned about basic mechanisms of cell invasion, the genes whose expression is required for this process by malignant cell lines have remained obscure. We assessed invasion through Matrigel using EGF as a chemoattractant and gene expression profiles using oligonucleotide microarrays for 22 non-small cell lung cancer cell lines. The expression of 22 genes were significantly correlated (p < 0.001) with the measured invasion index. Cluster analysis demonstrated that gene expression profiles classify the cell lines into low and high invasive subgroups. Considering invasiveness as a dichotomous variable, Bayesian analysis was used to identify genes that have the highest probability of being differentially expressed between the high and low invasion groups. This analysis identified 16 genes whose expression was associated with invasiveness. "Leave one out" cross validation was 91% accurate. Nine genes were identified in both correlation and Bayesian analyses. Seven of the nine genes were negatively associated with invasion and four of those genes are plasma membrane proteins. The two genes with the highest inverse association with invasion, TACSTD1 and CLDN3, are involved with cell adhesion and cell-cell interactions, respectively. Interestingly, the gene with the highest positive association with invasion, SERPINE1 (PAI-1), is a protease inhibitor. These and the other genes identified by both analyses represent targets for further study to assess their importance in non-small cell lung cancer invasion and metastasis.
Microarrays have been extensively used to investigate genome-wide expression patterns. Although this technology has been tremendously successful, it has suffered from suboptimal individual measurement precision. Significant improvements in this respect have been recently made. In an effort to further explore the underlying variability, we have attempted to globally assess the accuracy of individual probe sequences used to query gene expression. For mammalian Affymetrix microarrays, we identify an unexpectedly large number of probes (greater than 19% of the probes on each platform) that do not correspond to their appropriate mRNA reference sequence (RefSeq). Compared with data derived from inaccurate probes, we find that data derived from sequence-verified probes show 1) increased precision in technical replicates, 2) increased accuracy translating data from one generation microarray to another, 3) increased accuracy translating data from oligonucleotide to cDNA microarrays, and 4) improved capture of biological information in human clinical specimens. The logical conclusion of this work is that probes containing the most reliable sequence information provide the most accurate results. Our data reveal that the identification and removal of inaccurate probes can significantly improve this technology.
BACKGROUND: There is increasing evidence that gene order within the eukaryotic genome is not random. In yeast and worm, adjacent or neighboring genes tend to be co-expressed. Clustering of co-expressed genes has been found in humans, worm and fruit flies. However, in mice and rats, an effect of chromosomal distance (CD) on co-expression has not been investigated yet. Also, no cross-species comparison has been made so far. We analyzed the effect of CD as well as normalized distance (ND) using expression data in six eukaryotic species: yeast, fruit fly, worm, rat, mouse and human. RESULTS: We analyzed 24 sets of expression data from the six species. Highly co-expressed pairs were sorted into bins of equal sized intervals of CD, and a co-expression rate (CoER) in each bin was calculated. In all datasets, a higher CoER was obtained in a short CD range than a long distance range. These results show that across all studied species, there was a consistent effect of CD on co-expression. However, the results using the ND show more diversity. Intra- and inter-species comparisons of CoER reveal that there are significant differences in the co-expression rates of neighboring genes among the species. A pair-wise BLAST analysis finds 8-30 % of the highly co-expressed pairs are duplicated genes. CONCLUSION: We confirmed that in the six eukaryotic species, there was a consistent tendency that neighboring genes are likely to be co-expressed. Results of pair-wised BLAST indicate a significant effect of non-duplicated pairs on co-expression. A comparison of CD and ND suggests the dominant effect of CD.
OBJECTIVE: To compare hospital outcome prediction using an artificial neural network model, built on an Indian data set, with the APACHE II (Acute Physiology and Chronic Health Evaluation II) logistic regression model. DESIGN: Analysis of a database containing prospectively collected data. SETTING: Medical-neurological ICU of a university hospital in Mumbai, India. SUBJECTS: Two thousand sixty-two consecutive admissions between 1996 and 1998. INTERVENTIONS: None. MEASUREMENTS AND RESULTS: The 22 variables used to obtain day-1 APACHE II score and risk of death were recorded. Data from 1,962 patients were used to train the neural network using a back-propagation algorithm. Data from the remaining 1,000 patients were used for testing this model and comparing it with APACHE II. There were 337 deaths in these 1,000 patients; APACHE II predicted 246 deaths while the neural network predicted 336 deaths. Calibration, assessed by the Hosmer-Lemeshow statistic, was better with the neural network (H=22.4) than with APACHE II (H=123.5) and so was discrimination (area under receiver operating characteristic curve =0.87 versus 0.77, p=0.002). Analysis of information gain due to each of the 22 variables revealed that the neural network could predict outcome using only 15 variables. A new model using these 15 variables predicted 335 deaths, had calibration (H=27.7) and discrimination (area under receiver operating characteristic curve =0.88) which was comparable to the 22-variable model (p=0.87) and superior to the APACHE II equation (p<0.001). CONCLUSION: Artificial neural networks, trained on Indian patient data, used fewer variables and yet outperformed the APACHE II system in predicting hospital outcome.
A generic query engine for a distributed system of databases is presented. The architecture allows institutions to share data with a community of researchers while remaining in control of the data. The present system was developed as an implementation of the SPIN (Shared Pathology Information Network) project and concerns itself with the sharing of Surgical Pathology and Specimen information; however, the same distributed architecture and codebase can be adapted with little effort to any other type of data.
Cancer derived microarray data sets are routinely produced by various platforms that are either commercially available or manufactured by academic groups. The fundamental difference in their probe selection strategies holds the promise that identical observations produced by more than one platform prove to be more robust when validated by biology. However, cross-platform comparison requires matching corresponding probe sets. We are introducing here sequence-based matching of probes instead of gene identifier-based matching. We analyzed breast cancer cell line derived RNA aliquots using Agilent cDNA and Affymetrix oligonucleotide microarray platforms to assess the advantage of this method. We show, that at different levels of the analysis, including gene expression ratios and difference calls, cross-platform consistency is significantly improved by sequence- based matching. We also present evidence that sequence-based probe matching produces more consistent results when comparing similar biological data sets obtained by different microarray platforms. This strategy allowed a more efficient transfer of classification of breast cancer samples between data sets produced by cDNA microarray and Affymetrix gene-chip platforms.
This paper describes the Shared Pathology Informatics Network (SPIN) submission model for uploading de-identified XML annotations of pathology case and specimen information to a distributed peer-to-peer network architecture. SPIN use cases, architecture, and technologies, as well as pathology information design is described. With the architecture currently in use by six member institutions, SPIN appears to be a viable, secure methodology to submit pathology information for query and specimen retrieval by investigators