Artem Sokolov, Christopher Funk, Kiley Graim, Karin Verspoor, and Asa Ben-Hur. 2013. “Combining heterogeneous data sources for accurate functional annotation of proteins.” BMC Bioinformatics, 14 Suppl 3, Pp. S10.Abstract
Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at
Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop, Artem Sokolov, Kiley Graim, Christopher Funk, Karin Verspoor, Asa Ben-Hur, Gaurav Pandey, Jeffrey M Yunes, Ameet S Talwalkar, Susanna Repo, Michael L Souza, Damiano Piovesan, Rita Casadio, Zheng Wang, Jianlin Cheng, Hai Fang, Julian Gough, Patrik Koskinen, Petri Törönen, Jussi Nokso-Koivisto, Liisa Holm, Domenico Cozzetto, Daniel WA Buchan, Kevin Bryson, David T Jones, Bhakti Limaye, Harshal Inamdar, Avik Datta, Sunitha K Manjari, Rajendra Joshi, Meghana Chitale, Daisuke Kihara, Andreas M Lisewski, Serkan Erdin, Eric Venner, Olivier Lichtarge, Robert Rentzsch, Haixuan Yang, Alfonso E Romero, Prajwal Bhat, Alberto Paccanaro, Tobias Hamp, Rebecca Kaßner, Stefan Seemayer, Esmeralda Vicedo, Christian Schaefer, Dominik Achten, Florian Auer, Ariane Boehm, Tatjana Braun, Maximilian Hecht, Mark Heron, Peter Hönigschmid, Thomas A Hopf, Stefanie Kaufmann, Michael Kiening, Denis Krompass, Cedric Landerer, Yannick Mahlich, Manfred Roos, Jari Björne, Tapio Salakoski, Andrew Wong, Hagit Shatkay, Fanny Gatzmann, Ingolf Sommer, Mark N Wass, Michael JE Sternberg, Nives Škunca, Fran Supek, Matko Bošnjak, Panče Panov, Sašo Džeroski, Tomislav Šmuc, Yiannis AI Kourmpetis, Aalt DJ van Dijk, Cajo JF ter Braak, Yuanpeng Zhou, Qingtian Gong, Xinran Dong, Weidong Tian, Marco Falda, Paolo Fontana, Enrico Lavezzo, Barbara Di Camillo, Stefano Toppo, Liang Lan, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic, Amos Bairoch, Michal Linial, Patricia C Babbitt, Steven E Brenner, Christine Orengo, Burkhard Rost, Sean D Mooney, and Iddo Friedberg. 2013. “A large-scale evaluation of computational protein function prediction.” Nat Methods, 10, 3, Pp. 221-7.Abstract
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
TCGA Network, John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna Mills R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. 2013. “The Cancer Genome Atlas Pan-Cancer analysis project.” Nat Genet, 45, 10, Pp. 1113-20.Abstract
The Cancer Genome Atlas (TCGA) Research Network has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein and epigenetic levels. The resulting rich data provide a major opportunity to develop an integrated picture of commonalities, differences and emergent themes across tumor lineages. The Pan-Cancer initiative compares the first 12 tumor types profiled by TCGA. Analysis of the molecular aberrations and their functional roles across tumor types will teach us how to extend therapies effective in one cancer type to others with a similar genomic profile.
Sam Ng, Eric A Collisson, Artem Sokolov, Theodore Goldstein, Abel Gonzalez-Perez, Nuria Lopez-Bigas, Christopher Benz, David Haussler, and Joshua M Stuart. 2012. “PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis.” Bioinformatics, 28, 18, Pp. i640-i646.Abstract
MOTIVATION: A current challenge in understanding cancer processes is to pinpoint which mutations influence the onset and progression of disease. Toward this goal, we describe a method called PARADIGM-SHIFT that can predict whether a mutational event is neutral, gain-or loss-of-function in a tumor sample. The method uses a belief-propagation algorithm to infer gene activity from gene expression and copy number data in the context of a set of pathway interactions. RESULTS: The method was found to be both sensitive and specific on a set of positive and negative controls for multiple cancers for which pathway information was available. Application to the Cancer Genome Atlas glioblastoma, ovarian and lung squamous cancer datasets revealed several novel mutations with predicted high impact including several genes mutated at low frequency suggesting the approach will be complementary to current approaches that rely on the prevalence of events to reach statistical significance. AVAILABILITY: All source code is available at the github repository CONTACT: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
TCGA Network. 2012. “Comprehensive molecular portraits of human breast tumours.” Nature, 490, 7418, Pp. 61-70.Abstract

We analysed primary breast cancers by genomic DNA copy number arrays, DNA methylation, exome sequencing, messenger RNA arrays, microRNA sequencing and reverse-phase protein arrays. Our ability to integrate information across platforms provided key insights into previously defined gene expression subtypes and demonstrated the existence of four main breast cancer classes when combining data from five platforms, each of which shows significant molecular heterogeneity. Somatic mutations in only three genes (TP53, PIK3CA and GATA3) occurred at >10% incidence across all breast cancers; however, there were numerous subtype-associated and novel gene mutations including the enrichment of specific mutations in GATA3, PIK3CA and MAP3K1 with the luminal A subtype. We identified two novel protein-expression-defined subgroups, possibly produced by stromal/microenvironmental elements, and integrated analyses identified specific signalling pathways dominant in each molecular subtype including a HER2/phosphorylated HER2/EGFR/phosphorylated EGFR signature within the HER2-enriched expression subtype. Comparison of basal-like breast tumours with high-grade serous ovarian tumours showed many molecular commonalities, indicating a related aetiology and similar therapeutic opportunities. The biological finding of the four main breast cancer subtypes caused by different subsets of genetic and epigenetic abnormalities raises the hypothesis that much of the clinically observable plasticity and heterogeneity occurs within, and not across, these major biological subtypes of breast cancer.

Artem Sokolov and Asa Ben-Hur. 2011. “Multi-view prediction of protein function.” Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine.
Artem Sokolov and Asa Ben-Hur. 2010. “Hierarchical classification of gene ontology terms using the GOstruct method.” J Bioinform Comput Biol, 8, 2, Pp. 357-76.Abstract
Protein function prediction is an active area of research in bioinformatics. Yet, the transfer of annotation on the basis of sequence or structural similarity remains widely used as an annotation method. Most of today's machine learning approaches reduce the problem to a collection of binary classification problems: whether a protein performs a particular function, sometimes with a post-processing step to combine the binary outputs. We propose a method that directly predicts a full functional annotation of a protein by modeling the structure of the Gene Ontology hierarchy in the framework of kernel methods for structured-output spaces. Our empirical results show improved performance over a BLAST nearest-neighbor method, and over algorithms that employ a collection of binary classifiers as measured on the Mousefunc benchmark dataset.
Artem Sokolov and Asa Ben-Hur. 2009. “GOstruct: Prediction of Gene Ontology Terms Using Methods for Structured Output Spaces.” Proceedings of the 8th International Conference on Computational Systems Bioinformatics.
Artem Sokolov, Darrell Whitley, and Andre’ Motta Salles da Barreto. 2007. “A Note on the Variance of Rank-Based Selection Strategies for Genetic Algorithms and Genetic Programming.” Genetic Programming and Evolvable Machines, 8, 3, Pp. 221-237.
Darrell Whitley, Monte Lunacek, and Artem Sokolov. 2006. “Comparing the niches of CMA-ES, CHC and pattern search using diverse benchmarks .” Proceedings of the 9th international conference on Parallel Problem Solving from Nature.
Charles W Anderson, James N Knight, Tim O'Connor, Michael J Kirby, and Artem Sokolov. 2006. “Geometric subspace methods and time-delay embedding for EEG artifact removal and classification.” IEEE Trans Neural Syst Rehabil Eng, 14, 2, Pp. 142-6.Abstract
Generalized singular-value decomposition is used to separate multichannel electroencephalogram (EEG) into components found by optimizing a signal-to-noise quotient. These components are used to filter out artifacts. Short-time principal components analysis of time-delay embedded EEG is used to represent windowed EEG data to classify EEG according to which mental task is being performed. Examples are presented of the filtering of various artifacts and results are shown of classification of EEG from five mental tasks using committees of decision trees.
Artem Sokolov, Darrell Whitley, and Monte Lunacek. 2005. “Alternative Implementations of The Griewangk Function.” Proceedings of the 7th Genetic and Evolutionary Computation Conference,.
Artem Sokolov, Alodeep Sanyal, Darrell Whitley, and Yashwant Malaiya. 2005. “Dynamic power minimization during combinational circuit testing as a traveling salesman problem.” IEEE Congress on Evolutionary Computation.
Artem Sokolov and Darrell Whitley. 2005. “Unbiased Tournament Selection.” Proceedings of the 7th Genetic and Evolutionary Computation Conference.