Alterovitz G, Aivado M, Spentzos D, Libermann T, Ramoni M, Kohane I. Analysis and robot pipelined automation for SELDI-TOF mass spectrometry. Conf Proc IEEE Eng Med Biol SocConf Proc IEEE Eng Med Biol Soc. 2004;4 :3068-71.Abstract
Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI or SELDI-TOF MS) with protein arrays has facilitated the discovery of disease-specific protein profiles in serum. As array technologies in bioinformatics and proteomics multiply the quantity of data being generated, more automated hardware and computational methods will become necessary in order to keep up. Robot automated sample preparation and analysis pipeline for proteomics (Raspap) in SELDI provides a solution from the lab bench to the desktop. In this approach, the entire processing of protein arrays is delegated to a robotics system and the bioinformatics automated pipeline (BAP) performs data mining after SELDI analysis. A key part of BAP is the creation of a journal-styled report in HTML (with text, embedded figures, and references) which can be automatically emailed back to the engineers/scientists for review. An object-oriented tree-based structure allows for the derivation of conclusions about the data and comparison of multiple analyses within the generated report. Testing yielded improvement in the resulting assay coefficients of variation (CV) from 45.1% (when done manually) to 27.8% (P<0.001). A large biological dataset was also examined with the Raspap approach and consequent results are discussed.
Cai Z, Tsung EF, Marinescu VD, Ramoni MF, Riva A, Kohane IS. Bayesian approach to discovering pathogenic SNPs in conserved protein domains. Hum MutatHum Mutat. 2004;24 :178-84.Abstract
The success rate of association studies can be improved by selecting better genetic markers for genotyping or by providing better leads for identifying pathogenic single nucleotide polymorphisms (SNPs) in the regions of linkage disequilibrium with positive disease associations. We have developed a novel algorithm to predict pathogenic single amino acid changes, either nonsynonymous SNPs (nsSNPs) or missense mutations, in conserved protein domains. Using a Bayesian framework, we found that the probability of a microbial missense mutation causing a significant change in phenotype depended on how much difference it made in several phylogenetic, biochemical, and structural features related to the single amino acid substitution. We tested our model on pathogenic allelic variants (missense mutations or nsSNPs) included in OMIM, and on the other nsSNPs in the same genes (from dbSNP) as the nonpathogenic variants. As a result, our model predicted pathogenic variants with a 10% false-positive rate. The high specificity of our prediction algorithm should make it valuable in genetic association studies aimed at identifying pathogenic SNPs.
Adam RM, Eaton SH, Estrada C, Nimgaonkar A, Shih SC, Smith LE, Kohane IS, Bagli D, Freeman MR. Mechanical stretch is a highly selective regulator of gene expression in human bladder smooth muscle cells. Physiol GenomicsPhysiol Genomics. 2004;20 :36-44.Abstract
Application of mechanical stimuli has been shown to alter gene expression in bladder smooth muscle cells (SMC). To date, only a limited number of "stretch-responsive" genes in this cell type have been reported. We employed oligonucleotide arrays to identify stretch-sensitive genes in primary culture human bladder SMC subjected to repetitive mechanical stimulation for 4 h. Differential gene expression between stretched and nonstretched cells was assessed using Significance Analysis of Microarrays (SAM). Expression of 20 out of 11,731 expressed genes ( approximately 0.17%) was altered >2-fold following stretch, with 19 genes induced and one gene (FGF-9) repressed. Using real-time RT-PCR, we tested independently the responsiveness of 15 genes to stretch and to platelet-derived growth factor-BB (PDGF-BB), another hypertrophic stimulus for bladder SMC. In response to both stimuli, expression of 13 genes increased, 1 gene (FGF-9) decreased, and 1 gene was unchanged. Six transcripts (HB-EGF, BMP-2, COX-2, LIF, PAR-2, and FGF-9) were evaluated using an ex vivo rat model of bladder distension. HB-EGF, BMP-2, COX-2, LIF, and PAR-2 increased with bladder stretch ex vivo, whereas FGF-9 decreased, consistent with expression changes observed in vitro. In silico analysis of microarray data using the FIRED algorithm identified c-jun, AP-1, ATF-2, and neurofibromin-1 (NF-1) as potential transcriptional mediators of stretch signals. Furthermore, the promoters of 9 of 13 stretch-responsive genes contained AP-1 binding sites. These observations identify stretch as a highly selective regulator of gene expression in bladder SMC. Moreover, they suggest that mechanical and growth factor signals converge on common transcriptional regulators that include members of the AP-1 family.
Allocco DJ, Kohane IS, Butte AJ. Quantifying the relationship between co-expression, co-regulation and gene function. BMC BioinformaticsBMC Bioinformatics. 2004;5 :18.Abstract
BACKGROUND: It is thought that genes with similar patterns of mRNA expression and genes with similar functions are likely to be regulated via the same mechanisms. It has been difficult to quantitatively test these hypotheses on a large scale because there has been no general way of determining whether genes share a common regulatory mechanism. Here we use data from a recent genome wide binding analysis in combination with mRNA expression data and existing functional annotations to quantify the likelihood that genes with varying degrees of similarity in mRNA expression profile or function will be bound by a common transcription factor. RESULTS: Genes with strongly correlated mRNA expression profiles are more likely to have their promoter regions bound by a common transcription factor. This effect is present only at relatively high levels of expression similarity. In order for two genes to have a greater than 50% chance of sharing a common transcription factor binder, the correlation between their expression profiles (across the 611 microarrays used in our study) must be greater than 0.84. Genes with similar functional annotations are also more likely to be bound by a common transcription factor. Combining mRNA expression data with functional annotation results in a better predictive model than using either data source alone. CONCLUSIONS: We demonstrate how mRNA expression data and functional annotations can be used together to estimate the probability that genes share a common regulatory mechanism. Existing microarray data and known functional annotations are sufficient to identify only a relatively small percentage of co-regulated genes.
Friedman CP, Altman RB, Kohane IS, McCormick KA, Miller PL, Ozbolt JG, Shortliffe EH, Stormo GD, Szczepaniak MC, Tuck D, et al. Training the next generation of informaticians: the impact of "BISTI" and bioinformatics--a report from the American College of Medical Informatics. J Am Med Inform AssocJ Am Med Inform Assoc. 2004;11 :167-72.Abstract
In 2002-2003, the American College of Medical Informatics (ACMI) undertook a study of the future of informatics training. This project capitalized on the rapidly expanding interest in the role of computation in basic biological research, well characterized in the National Institutes of Health (NIH) Biomedical Information Science and Technology Initiative (BISTI) report. The defining activity of the project was the three-day 2002 Annual Symposium of the College. A committee, comprised of the authors of this report, subsequently carried out activities, including interviews with a broader informatics and biological sciences constituency, collation and categorization of observations, and generation of recommendations. The committee viewed biomedical informatics as an interdisciplinary field, combining basic informational and computational sciences with application domains, including health care, biological research, and education. Consequently, effective training in informatics, viewed from a national perspective, should encompass four key elements: (1). curricula that integrate experiences in the computational sciences and application domains rather than just concatenating them; (2). diversity among trainees, with individualized, interdisciplinary cross-training allowing each trainee to develop key competencies that he or she does not initially possess; (3). direct immersion in research and development activities; and (4). exposure across the wide range of basic informational and computational sciences. Informatics training programs that implement these features, irrespective of their funding sources, will meet and exceed the challenges raised by the BISTI report, and optimally prepare their trainees for careers in a field that continues to evolve.
Sanoudou D, Haslett JN, Kho AT, Guo S, Gazda HT, Greenberg SA, Lidov HG, Kohane IS, Kunkel LM, Beggs AH. Expression profiling reveals altered satellite cell numbers and glycolytic enzyme transcription in nemaline myopathy muscle. Proc Natl Acad Sci U S AProc Natl Acad Sci U S A. 2003;100 :4666-71.Abstract
The nemaline myopathies (NMs) are a clinically and genetically heterogeneous group of disorders characterized by nemaline rods and skeletal muscle weakness. Mutations in five sarcomeric thin filament genes have been identified. However, the molecular consequences of these mutations are unknown. Using Affymetrix oligonucleotide microarrays, we have analyzed the expression patterns of >21,000 genes and expressed sequence tags in skeletal muscles of 12 NM patients and 21 controls. Multiple complementary approaches were used for data analysis, including geometric fold analysis, two-tailed unequal variance t test, hierarchical clustering, relevance network, and nearest-neighbor analysis. We report the identification of high satellite cell populations in NM and the significant down-regulation of transcripts for key enzymes of glucose and glycogen metabolism as well as a possible regulator of fatty acid metabolism, UCP3. Interestingly, transcript level changes of multiple genes suggest possible changes in Ca(2+) homeostasis. The increased expression of multiple structural proteins was consistent with increased fibrosis. This comprehensive study of downstream molecular consequences of NM gene mutations provides insights in the cellular events leading to the NM phenotype.
Ruel M, Bianchi C, Khan TA, Xu S, Liddicoat JR, Voisine P, Araujo E, Lyon H, Kohane IS, Libermann TA, et al. Gene expression profile after cardiopulmonary bypass and cardioplegic arrest. J Thorac Cardiovasc SurgJ Thorac Cardiovasc Surg. 2003;126 :1521-30.Abstract
OBJECTIVE: This study examines the cardiac and peripheral gene expression responses to cardiopulmonary bypass and cardioplegic arrest. METHODS: Atrial myocardium and skeletal muscle were harvested from 16 patients who underwent coronary artery bypass grafting before and after cardiopulmonary bypass and cardioplegic arrest. Ten sample pairs were selected for patient similarity, and oligonucleotide microarray analyses of 12,625 genes were performed using matched precardiopulmonary bypass tissues as controls. Array results were validated with Northern blotting, real-time polymerase chain reaction, in situ hybridization, and immunoblotting. Statistical analyses were nonparametric. RESULTS: Median durations of cardiopulmonary bypass and cardioplegic arrest were 74 and 60 minutes, respectively. Compared with precardiopulmonary bypass, postcardiopulmonary bypass myocardial tissues revealed 480 up-regulated and 626 down-regulated genes with a threshold P value of.025 or less (signal-to-noise ratio: 3.46); skeletal muscle tissues showed 560 and 348 such genes, respectively (signal-to-noise ratio: 3.04). Up-regulated genes in cardiac tissues included inflammatory and transcription activators FOS; jun B proto-oncogene; nuclear receptor subfamily 4, group A, member 3; MYC; transcription factor-8; endothelial leukocyte adhesion molecule-1; and cysteine-rich 61; apoptotic genes nuclear receptor subfamily 4, group A, member 1 and cyclin-dependent kinase inhibitor 1A; and stress genes dual-specificity phosphatase-1, dual-specificity phosphatase-5, and B-cell translocation gene 2. Up-regulated skeletal muscle genes included interleukin 6; interleukin 8; tumor necrosis factor receptor superfamily, member 11B; nuclear receptor subfamily 4, group A, member 3; transcription factor-8; interleukin 13; jun B proto-oncogene; interleukin 1B; glycoprotein Ib, platelet, alpha polypeptide; and Ras-associated protein RAB27A. Down-regulated genes included haptoglobin and numerous immunoglobulins in the heart, and factor H-related gene 2, protein phosphatase 1, regulatory subunit 3A, and growth differentiation factor-8 in skeletal muscle. CONCLUSIONS: By establishing a profile of the gene-expression responses to cardiopulmonary bypass and cardioplegia, this study allows a better understanding of their effects and provides a framework for the evaluation of new cardiac surgical modalities directly at the genome level.
Saluja SK, Kohane I. Localization and characterization of mouse-human alignments within the human genome. Does evolutionary conservation suggest functional importance?. AMIA Annu Symp ProcAMIA Annu Symp Proc. 2003 :994.Abstract
In an attempt to validate the use of evolutionary conservation as a method to identify putative regulatory elements, we have quantified the frequency of Single Nucleotide Polymorphisms (SNPs) within the most tightly conserved regions across the entire Human Genome. Our results show that conserved non-coding sequences have a significantly lower SNP frequency than their exonic counterparts, which suggests that these regions are functionally important.
Sebastiani P, Lazarus R, Weiss ST, Kunkel LM, Kohane IS, Ramoni MF. Minimal Haplotype Tagging. Proc Natl Acad Sci U S AProc Natl Acad Sci U S A. 2003;100 :9900-5.Abstract
The high frequency of single-nucleotide polymorphisms (SNPs) in the human genome presents an unparalleled opportunity to track down the genetic basis of common diseases. At the same time, the sheer number of SNPs also makes unfeasible genome-wide disease association studies. The haplotypic nature of the human genome, however, lends itself to the selection of a parsimonious set of SNPs, called haplotype tagging SNPs (htSNPs), able to distinguish the haplotypic variations in a population. Current approaches rely on statistical analysis of transmission rates to identify htSNPs. In contrast to these approximate methods, this contribution describes an exact, analytical, and lossless method, called BEST (Best Enumeration of SNP Tags), able to identify the minimum set of SNPs tagging an arbitrary set of haplotypes from either pedigree or independent samples. Our results confirm that a small proportion of SNPs is sufficient to capture the haplotypic variations in a population and that this proportion decreases exponentially as the haplotype length increases. We used BEST to tag the haplotypes of 105 genes in an African-American and a European-American sample. An interesting finding of this analysis is that the vast majority (95%) of the htSNPs in the European-American sample is a subset of the htSNPs of the African-American sample. This result seems to provide further evidence that a severe bottleneck occurred during the founding of Europe and the conjectured "Out of Africa" event.
Porter SC, Fleisher GR, Kohane IS, Mandl KD. The value of parental report for diagnosis and management of dehydration in the emergency department. Ann Emerg MedAnn Emerg Med. 2003;41 :196-205.Abstract
STUDY OBJECTIVES: We define the predictive value of parents' computer-based report for history and physical signs of dehydration for a primary outcome of percentage of dehydration (fluid deficit) and 2 secondary outcomes: clinically important acidosis and hospital admission. We also sought to compare the reports of physical signs related to dehydration made by parents and nurses. METHODS: We performed a prospective observational trial in an urban pediatric emergency department. A convenience sample of parents completed a computer-based interview covering historical details and physical signs (ill appearance, sunken fontanelle, sunken eyes, decreased tears, dry mouth, cool extremities, and weak cry) related to dehydration. Nurses independently completed an assessment of physical signs for enrolled children. The primary outcome was the degree of dehydration (fluid deficit), which was defined as the percentage difference between initial ED weight and stable final weight after the illness. Secondary outcomes included clinically important acidosis (defined as a serum CO(2) value of
Patti ME, Butte AJ, Crunkhorn S, Cusi K, Berria R, Kashyap S, Miyazaki Y, Kohane I, Costello M, Saccone R, et al. Coordinated reduction of genes of oxidative metabolism in humans with insulin resistance and diabetes: Potential role of PGC1 and NRF1. Proc Natl Acad Sci U S AProc Natl Acad Sci U S A. 2003;100 :8466-71.Abstract
Type 2 diabetes mellitus (DM) is characterized by insulin resistance and pancreatic beta cell dysfunction. In high-risk subjects, the earliest detectable abnormality is insulin resistance in skeletal muscle. Impaired insulin-mediated signaling, gene expression, glycogen synthesis, and accumulation of intramyocellular triglycerides have all been linked with insulin resistance, but no specific defect responsible for insulin resistance and DM has been identified in humans. To identify genes potentially important in the pathogenesis of DM, we analyzed gene expression in skeletal muscle from healthy metabolically characterized nondiabetic (family history negative and positive for DM) and diabetic Mexican-American subjects. We demonstrate that insulin resistance and DM associate with reduced expression of multiple nuclear respiratory factor-1 (NRF-1)-dependent genes encoding key enzymes in oxidative metabolism and mitochondrial function. Although NRF-1 expression is decreased only in diabetic subjects, expression of both PPAR gamma coactivator 1-alpha and-beta (PGC1-alpha/PPARGC1 and PGC1-beta/PERC), coactivators of NRF-1 and PPAR gamma-dependent transcription, is decreased in both diabetic subjects and family history-positive nondiabetic subjects. Decreased PGC1 expression may be responsible for decreased expression of NRF-dependent genes, leading to the metabolic disturbances characteristic of insulin resistance and DM.
Haslett JN, Sanoudou D, Kho AT, Han M, Bennett RR, Kohane IS, Beggs AH, Kunkel LM. Gene expression profiling of Duchenne muscular dystrophy skeletal muscle. NeurogeneticsNeurogenetics. 2003;4 :163-71.Abstract
The primary cause of Duchenne muscular dystrophy (DMD) is a mutation in the dystrophin gene, leading to absence of the corresponding protein, disruption of the dystrophin-associated protein complex, and substantial changes in skeletal muscle pathology. Although the primary defect is known and the histological pathology well documented, the underlying molecular pathways remain in question. To clarify these pathways, we used expression microarrays to compare individual gene expression profiles for skeletal muscle biopsies from DMD patients and unaffected controls. We have previously published expression data for the 12,500 known genes and full-length expressed sequence tags (ESTs) on the Affymetrix HG-U95Av2 chips. Here we present comparative expression analysis of the 50,000 EST clusters represented on the remainder of the Affymetrix HG-U95 set. Individual expression profiles were generated for biopsies from 10 DMD patients and 10 unaffected control patients. Two methods of statistical analysis were used to interpret the resulting data (t-test analysis to determine the statistical significance of differential expression and geometric fold change analysis to determine the extent of differential expression). These analyses identified 183 probe sets (59 of which represent known genes) that differ significantly in expression level between unaffected and disease muscle. This study adds to our knowledge of the molecular pathways that are altered in the dystrophic state. In particular, it suggests that signaling pathways might be substantially involved in the disease process. It also highlights a large number of unknown genes whose expression is altered and whose identity therefore becomes important in understanding the pathogenesis of muscular dystrophy.
Lee K, Kohane IS, Butte AJ. PGAGENE: integrating quantitative gene-specific results from the NHLBI programs for genomic applications. BioinformaticsBioinformatics. 2003;19 :778-9.Abstract
Summary: PGAGENE is a web-based gene-specific genomic data search engine, which allows users to search over 5.9 million pieces of collective genetic and genomic data from the NHLBI supported Programs for Genomic Applications. This data includes microarray measurements, SNPs, and mutations, and data may be found using symbols, parts of gene names or products, Affymetrix probe IDs, GenBank accession numbers, UniGene IDs, dbSNP IDs, and others. The PGAGENE indexing agent periodically maps all publicly available gene-specific PGA data onto LocusLink using dynamically generated cross-referencing tables.
Nimgaonkar A, Sanoudou D, Butte AJ, Haslett JN, Kunkel LM, Beggs AH, Kohane IS. Reproducibility of gene expression across generations of Affymetrix microarrays. BMC BioinformaticsBMC Bioinformatics. 2003;4 :27.Abstract
BACKGROUND: The development of large-scale gene expression profiling technologies is rapidly changing the norms of biological investigation. But the rapid pace of change itself presents challenges. Commercial microarrays are regularly modified to incorporate new genes and improved target sequences. Although the ability to compare datasets across generations is crucial for any long-term research project, to date no means to allow such comparisons have been developed. In this study the reproducibility of gene expression levels across two generations of Affymetrix GeneChips (HuGeneFL and HG-U95A) was measured. RESULTS: Correlation coefficients were computed for gene expression values across chip generations based on different measures of similarity. Comparing the absolute calls assigned to the individual probe sets across the generations found them to be largely unchanged. CONCLUSION: We show that experimental replicates are highly reproducible, but that reproducibility across generations depends on the degree of similarity of the probe sets and the expression level of the corresponding transcript.
Riva AA, Kohane IS. Accessing Genomic Data through XML-based Remote Procedure Calls. Proceedings of the Annual American Medical Informatics Association SymposiumProceedings of the Annual American Medical Informatics Association Symposium. 2002 :662-6.Abstract
As the amount of data in public genomic databases grows, interoperability among them is becoming an increasingly critical feature. The ability for automated systems to mine and integrate data will be crucial to extracting knowledge from sources of data whose volume far exceeds the capabilities of human researchers. The currently dominant paradigm of presenting information as Web pages and using hyperlinks to describe relationships between pieces of information favors usability, but makes interoperability and automated data exchange more difficult. In this paper we describe how SNPper, a web-based system for the retrieval and analysis of Single Nucleotide Polymorphisms (SNPs), was augmented with a Remote Procedure Call interface, allowing client applications to query our program for SNP data and to receive the response as an XML document. Data represented in this form can be easily parsed by the requesting program, and thus reused for other applications. In this paper we describe the implementation of the interface and we show examples of its usage in a number of existing applications.
Tsien CL, Libermann TA, Gu X, Ho AK, Kohane IS. CHIP TUNER: a web tool for evidence-based noise reduction in gene discovery. Proc AMIA SympProc AMIA Symp. 2002 :810-4.Abstract
The potential for gene discovery, fueled by DNA microchip technology and the sequencing of hundreds of genomes, is unprecedented. In this context, trying to discover genes that are actually of significance rather than merely appearing so due to noise is of utmost importance. We present a web application, CHIP TUNER, which assists in this gene discovery process. Our system uses evidence-based noise reduction to help delineate candidate target genes of biological importance. Specifically, CHIP TUNER learns from redundant experiments an "identity mask" that defines a region of noise inherent to biological sampling and DNA microarray processing; it then takes this into account during actual sample comparisons. The goal of CHIP TUNER is to improve the chances that newly discovered "important" genes are actually of importance before large amounts of time and resources are invested.
Turchin A, Kohane IS. Gene homology resources on the World Wide Web. Physiol GenomicsPhysiol Genomics. 2002;11 :165-77.Abstract
As the amount of information available to biologists increases exponentially, data analysis becomes progressively more challenging. Sequence homology has been a traditional tool in the researchers' armamentarium; it is a very versatile instrument and can be employed to assist in numerous tasks, from establishing the function of a gene to determination of the evolutionary development of an organism. Consequently, numerous specialized tools have been established in the public domain (most commonly, the World Wide Web) to help investigators use sequence homology in their research. These homology databases differ both in techniques they use to compare sequences as well as in the size of the unit of analysis, which can be the whole gene, a domain, or a motif. In this paper, we aim to present a systematic review of the inner details of the most commonly used databases as well as to offer guidelines for their use.
Zhao Q, Ho AK, Kenney AM, Yuk Di DI, Kohane I, Rowitch DH. Identification of genes expressed with temporal-spatial restriction to developing cerebellar neuron precursors by a functional genomic approach. Proc Natl Acad Sci U S AProc Natl Acad Sci U S A. 2002;99 :5704-9.Abstract
Hedgehog pathway activation is required for proliferation of cerebellar granule cell neuron precursors during development and is etiologic in certain cerebellar tumors. To identify genes expressed specifically in granule cell neuron precursors, we used oligonucleotide microarrays to analyze regulation of 13,179 genes/expressed sequence tags in heterogeneous primary cultures of neonatal mouse cerebellum that respond to the mitogen Sonic hedgehog. In conjunction, we applied experiment-specific noise models to render a gene-by-gene robust indication of up-regulation in Sonic hedgehog-treated cultures. Twelve genes so identified were tested, and 10 (83%) showed appropriate expression in the external granular layer (EGL) of the postnatal day (PN) 7 cerebellum and down-regulation by PN 15, as verified by in situ hybridization. Whole-organ profiling of the developing cerebellum was carried out from PN 1 to 30 to generate a database of temporal gene regulation profiles (TRPs). From the database an algorithm was developed to capture the TRP typical of EGL-specific genes. The "TRP-EGL" accurately predicted expression in vivo of an additional 18 genes/expressed sequence tags with a sensitivity of 80% and a specificity of 88%. We then compared the positive predictive value of our analytical procedure with other widely used methods, as verified by the TRP-EGL in silico. These findings suggest that replicate experiments and incorporation of noise models increase analytical specificity. They further show that genome-wide methods are an effective means to identify stage-specific gene expression in the developing granule cell lineage.
Riva A, Kohane IS. SNPper: retrieval and analysis of human SNPs. BioinformaticsBioinformatics. 2002;18 :1681-5.Abstract
MOTIVATION: Single Nucleotide Polymorphisms (SNPs) are an increasingly important tool for the study of the human genome. SNPs can be used as markers to create high-density genetic maps, as causal candidates for diseases, or to reconstruct the history of our genome. SNP-based studies rely on the availability of large numbers of validated, high-frequency SNPs whose position on the chromosomes is known with precision. Although large collections of SNPs exist in public databases, researchers need tools to effectively retrieve and manipulate them. RESULTS: We describe the implementation and usage of SNPper, a web-based application to automate the tasks of extracting SNPs from public databases, analyzing them and exporting them in formats suitable for subsequent use. Our application is oriented toward the needs of candidate-gene, whole-genome and fine-mapping studies, and provides several flexible ways to present and export the data. The application has been publicly available for over a year, and has received positive user feedback and high usage levels.
Schachter AD, Kohane IS. An unsupervised self-optimizing gene clustering algorithm. Proc AMIA SympProc AMIA Symp. 2002 :682-6.Abstract
We have devised a gene-clustering algorithm that is completely unsupervised in that no parameters need be set by the user, and the clustering of genes is self-optimizing to yield the set of clusters that minimizes within-cluster distance and maximizes between-cluster distance. This algorithm was implemented in Java, and tested on a randomly selected 200-gene subset of 3000 genes from cell-cycle data in S. cerevisiae. AlignACE was used to evaluate the resulting optimized cluster set for upstream cis-regulons. The optimized cluster set was found to be of comparable quality to cluster sets obtained by two established methods (complete linkage and k-means), even when provided with only a small, randomly selected subset of the data (200 vs 3000 genes), and with absolutely no supervision. MAP and specificity scores of the highest ranking motifs identified in the largest clusters were comparable.