Genome-scale deletion screening of human long non-coding RNAs using a paired-guide RNA CRISPR–Cas9 library
Shiyou Zhu*, Wei Li*, Jingze Liu, Chen-Hao Chen, Qi Liao, Ping Xu, Han Xu, Tengfei Xiao, Zhongzheng Cao, Jingyu Peng, Pengfei Yuan, Myles Brown, Xiaole Shirley Liu, and Wensheng Wei. 10/31/2016. “Genome-scale deletion screening of human long non-coding RNAs using a paired-guide RNA CRISPR–Cas9 library.” Nature Biotechnology. Publisher's VersionAbstract


CRISPR–Cas9 screens have been widely adopted to analyze coding-gene functions, but high-throughput screening of non-coding elements using this method is more challenging because indels caused by a single cut in non-coding regions are unlikely to produce a functional knockout. A high-throughput method to produce deletions of non-coding DNA is needed. We report a high-throughput genomic deletion strategy to screen for functional long non-coding RNAs (lncRNAs) that is based on a lentiviral paired-guide RNA (pgRNA) library. Applying our screening method, we identified 51 lncRNAs that can positively or negatively regulate human cancer cell growth. We validated 9 of 51 lncRNA hits using CRISPR–Cas9-mediated genomic deletion, functional rescue, CRISPR activation or inhibition and gene-expression profiling. Our high-throughput pgRNA genome deletion method will enable rapid identification of functional mammalian non-coding elements.


Jian Ma, Johannes Köster, Qian Qin, Shengen Hu, Wei Li, Chenhao Chen, Qingyi Cao, Jinzeng Wang, Shenglin Mei, Qi Liu, Han Xu, and Xiaole Shirley Liu. 2016. “CRISPR-DO for genome-wide CRISPR design and optimization..” Bioinformatics.Abstract
MOTIVATION: Despite the growing popularity in using CRISPR/Cas9 technology for genome editing and gene knockout, its performance still relies on well-designed single guide RNAs (sgRNA). In this study, we propose a web application for the Design and Optimization (CRISPR-DO) of guide sequences that target both coding and non-coding regions in spCas9 CRISPR system across human, mouse, zebrafish, fly and worm genomes. CRISPR-DO uses a computational sequence model to predict sgRNA efficiency, and employs a specificity scoring function to evaluate the potential of off-target effect. It also provides information on functional conservation of target sequences, as well as the overlaps with exons, putative regulatory sequences and single-nucleotide polymorphisms (SNPs). The web application has a user-friendly genome-browser interface to facilitate the selection of the best target DNA sequences for experimental design. AVAILABILITY AND IMPLEMENTATION: CRISPR-DO is available at CONTACT: or or xsliu@jimmy.harvard.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Shom Goel, Qi Wang, April C Watt, Sara M Tolaney, Deborah A Dillon, Wei Li, Susanne Ramm, Adam C Palmer, Haluk Yuzugullu, Vinay Varadan, David Tuck, Lyndsay N Harris, Kwok-Kin Wong, Shirley X Liu, Piotr Sicinski, Eric P Winer, Ian E Krop, and Jean J Zhao. 2016. “Overcoming Therapeutic Resistance in HER2-Positive Breast Cancers with CDK4/6 Inhibitors..” Cancer Cell, 29, 3, Pp. 255-69.Abstract
Using transgenic mouse models, cell line-based functional studies, and clinical specimens, we show that cyclin D1/CDK4 mediate resistance to targeted therapy for HER2-positive breast cancer. This is overcome using CDK4/6 inhibitors. Inhibition of CDK4/6 not only suppresses Rb phosphorylation, but also reduces TSC2 phosphorylation and thus partially attenuates mTORC1 activity. This relieves feedback inhibition of upstream EGFR family kinases, resensitizing tumors to EGFR/HER2 blockade. Consequently, dual inhibition of EGFR/HER2 and CDK4/6 invokes a more potent suppression of TSC2 phosphorylation and hence mTORC1/S6K/S6RP activity. The suppression of both Rb and S6RP enhances G1 arrest and a phenotype resembling cellular senescence. In vivo, CDK4/6 inhibitors sensitize patient-derived xenograft tumors to HER2-targeted therapies and delay tumor recurrence in a transgenic model of HER2-positive breast cancer.
Masruba Tasnim, Shining Ma, Ei-Wen Yang, Tao Jiang, and Wei Li. 2015. “Accurate inference of isoforms from multiple sample RNA-Seq data..” BMC Genomics, 16 Suppl 2, Pp. S15.Abstract
BACKGROUND: RNA-Seq based transcriptome assembly has become a fundamental technique for studying expressed mRNAs (i.e., transcripts or isoforms) in a cell using high-throughput sequencing technologies, and is serving as a basis to analyze the structural and quantitative differences of expressed isoforms between samples. However, the current transcriptome assembly algorithms are not specifically designed to handle large amounts of errors that are inherent in real RNA-Seq datasets, especially those involving multiple samples, making downstream differential analysis applications difficult. On the other hand, multiple sample RNA-Seq datasets may provide more information than single sample datasets that can be utilized to improve the performance of transcriptome assembly and abundance estimation, but such information remains overlooked by the existing assembly tools. RESULTS: We formulate a computational framework of transcriptome assembly that is capable of handling noisy RNA-Seq reads and multiple sample RNA-Seq datasets efficiently. We show that finding an optimal solution under this framework is an NP-hard problem. Instead, we develop an efficient heuristic algorithm, called Iterative Shortest Path (ISP), based on linear programming (LP) and integer linear programming (ILP). Our preliminary experimental results on both simulated and real datasets and comparison with the existing assembly tools demonstrate that (i) the ISP algorithm is able to assemble transcriptomes with a greatly increased precision while keeping the same level of sensitivity, especially when many samples are involved, and (ii) its assembly results help improve downstream differential analysis. The source code of ISP is freely available at
Peng Jiang, Hongfang Wang, Wei Li, Chongzhi Zang, Bo Li, Yinling J Wong, Cliff Meyer, Jun S Liu, Jon C Aster, and Shirley X Liu. 2015. “Network analysis of gene essentiality in functional genomics experiments..” Genome Biol, 16, Pp. 239.Abstract
Many genomic techniques have been developed to study gene essentiality genome-wide, such as CRISPR and shRNA screens. Our analyses of public CRISPR screens suggest protein interaction networks, when integrated with gene expression or histone marks, are highly predictive of gene essentiality. Meanwhile, the quality of CRISPR and shRNA screen results can be significantly enhanced through network neighbor information. We also found network neighbor information to be very informative on prioritizing ChIP-seq target genes and survival indicator genes from tumor profiling. Thus, our study provides a general method for gene essentiality analysis in functional genomic experiments ( ).
Xuesong Zhao, Tatyana Ponomaryov, Kimberly J Ornell, Pengcheng Zhou, Sukriti K Dabral, Ekaterina Pak, Wei Li, Scott X Atwood, Ramon J Whitson, Anne Lynn S Chang, Jiang Li, Anthony E Oro, Jennifer A Chan, Joseph F Kelleher, and Rosalind A Segal. 2015. “RAS/MAPK Activation Drives Resistance to Smo Inhibition, Metastasis, and Tumor Evolution in Shh Pathway-Dependent Tumors..” Cancer Res, 75, 17, Pp. 3623-35.Abstract
Aberrant Shh signaling promotes tumor growth in diverse cancers. The importance of Shh signaling is particularly evident in medulloblastoma and basal cell carcinoma (BCC), where inhibitors targeting the Shh pathway component Smoothened (Smo) show great therapeutic promise. However, the emergence of drug resistance limits long-term efficacy, and the mechanisms of resistance remain poorly understood. Using new medulloblastoma models, we identify two distinct paradigms of resistance to Smo inhibition. Sufu mutations lead to maintenance of the Shh pathway in the presence of Smo inhibitors. Alternatively activation of the RAS-MAPK pathway circumvents Shh pathway dependency, drives tumor growth, and enhances metastatic behavior. Strikingly, in BCC patients treated with Smo inhibitor, squamous cell cancers with RAS/MAPK activation emerged from the antecedent BCC tumors. Together, these findings reveal a critical role of the RAS-MAPK pathway in drug resistance and tumor evolution of Shh pathway-dependent tumors.
Han Xu, Tengfei Xiao, Chen-Hao Chen, Wei Li, Clifford A Meyer, Qiu Wu, Di Wu, Le Cong, Feng Zhang, Jun S Liu, Myles Brown, and Shirley X Liu. 2015. “Sequence determinants of improved CRISPR sgRNA design..” Genome Res, 25, 8, Pp. 1147-57.Abstract
The CRISPR/Cas9 system has revolutionized mammalian somatic cell genetics. Genome-wide functional screens using CRISPR/Cas9-mediated knockout or dCas9 fusion-mediated inhibition/activation (CRISPRi/a) are powerful techniques for discovering phenotype-associated gene function. We systematically assessed the DNA sequence features that contribute to single guide RNA (sgRNA) efficiency in CRISPR-based screens. Leveraging the information from multiple designs, we derived a new sequence model for predicting sgRNA efficiency in CRISPR/Cas9 knockout experiments. Our model confirmed known features and suggested new features including a preference for cytosine at the cleavage site. The model was experimentally validated for sgRNA-mediated mutation rate and protein knockout efficiency. Tested on independent data sets, the model achieved significant results in both positive and negative selection conditions and outperformed existing models. We also found that the sequence preference for CRISPRi/a is substantially different from that for CRISPR/Cas9 knockout and propose a new model for predicting sgRNA efficiency in CRISPRi/a experiments. These results facilitate the genome-wide design of improved sgRNA for both knockout and CRISPRi/a studies.
Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR.
Wei Li*, Johannes Köster*, Han Xu, Chen-Hao Chen, Tengfei Xiao, Jun S Liu, Myles Brown, and X. Shirley Liu. 2015. “Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR..” Genome Biol, 16, Pp. 281. Publisher's VersionAbstract

High-throughput CRISPR screens have shown great promise in functional genomics. We present MAGeCK-VISPR, a comprehensive quality control (QC), analysis, and visualization workflow for CRISPR screens. MAGeCK-VISPR defines a set of QC measures to assess the quality of an experiment, and includes a maximum-likelihood algorithm to call essential genes simultaneously under multiple conditions. The algorithm uses a generalized linear model to deconvolute different effects, and employs expectation-maximization to iteratively estimate sgRNA knockout efficiency and gene essentiality. MAGeCK-VISPR also includes VISPR, a framework for the interactive visualization and exploration of QC and analysis results. MAGeCK-VISPR is freely available at .

Xiaoqi Zheng, Qian Zhao, Hua-Jun Wu, Wei Li, Haiyun Wang, Clifford A Meyer, Qian Alvin Qin, Han Xu, Chongzhi Zang, Peng Jiang, Fuqiang Li, Yong Hou, Jianxing He, Jun Wang, Jun Wang, Peng Zhang, Yong Zhang, and Xiaole Shirley Liu. 2014. “MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes..” Genome Biol, 15, 8, Pp. 419.Abstract
We propose a statistical algorithm MethylPurify that uses regions with bisulfite reads showing discordant methylation levels to infer tumor purity from tumor samples alone. MethylPurify can identify differentially methylated regions (DMRs) from individual tumor methylome samples, without genomic variation information or prior knowledge from other datasets. In simulations with mixed bisulfite reads from cancer and normal cell lines, MethylPurify correctly inferred tumor purity and identified over 96% of the DMRs. From patient data, MethylPurify gave satisfactory DMR calls from tumor methylome samples alone, and revealed potential missed DMRs by tumor to normal comparison due to tumor heterogeneity.
MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens.
Wei Li*, Han Xu*, Tengfei Xiao, Le Cong, Michael I Love, Feng Zhang, Rafael A Irizarry, Jun S Liu, Myles Brown, and X. Shirley Liu. 2014. “MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens..” Genome Biol, 15, 12, Pp. 554.Abstract

We propose the Model-based Analysis of Genome-wide CRISPR/Cas9 Knockout (MAGeCK) method for prioritizing single-guide RNAs, genes and pathways in genome-scale CRISPR/Cas9 knockout screens. MAGeCK demonstrates better performance compared with existing methods, identifies both positively and negatively selected genes simultaneously, and reports robust results across different experimental conditions. Using public datasets, MAGeCK identified novel essential genes and pathways, including EGFR in vemurafenib-treated A375 cells harboring a BRAF mutation. MAGeCK also detected cell type-specific essential genes, including BCR and ABL1, in KBM7 cells bearing a BCR-ABL fusion, and IGF1R in HL-60 cells, which depends on the insulin signaling pathway for proliferation.

Paul M Ruegger, Elizabeth Bent, Wei Li, Daniel R Jeske, Xinping Cui, Jonathan Braun, Tao Jiang, and James Borneman. 2012. “Improving oligonucleotide fingerprinting of rRNA genes by implementation of polony microarray technology..” J Microbiol Methods, 90, 3, Pp. 235-40.Abstract
Improvements to oligonucleotide fingerprinting of rRNA genes (OFRG) were obtained by implementing polony microarray technology. OFRG is an array-based method for analyzing microbial community composition. Polonies are discrete clusters of DNA, produced by solid-phase PCR in hydrogels, and derived from individual, spatially isolated DNA molecules. The advantages of a polony-based OFRG method include higher throughput and reductions in the PCR-induced errors and compositional skew inherent in all other PCR-based community composition methods, including high-throughput sequencing of rRNA genes. Given the similarities between polony microarrays and certain aspects of sequencing methods such as the Illumina platform, we suggest that if concepts presented in this study were implemented in high-throughput sequencing protocols, a reduction of PCR-induced errors and compositional skew may be realized.
Wei Li and Tao Jiang. 2012. “Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads..” Bioinformatics, 28, 22, Pp. 2914-21.Abstract
MOTIVATION: RNA-Seq uses the high-throughput sequencing technology to identify and quantify transcriptome at an unprecedented high resolution and low cost. However, RNA-Seq reads are usually not uniformly distributed and biases in RNA-Seq data post great challenges in many applications including transcriptome assembly and the expression level estimation of genes or isoforms. Much effort has been made in the literature to calibrate the expression level estimation from biased RNA-Seq data, but the effect of biases on transcriptome assembly remains largely unexplored. RESULTS: Here, we propose a statistical framework for both transcriptome assembly and isoform expression level estimation from biased RNA-Seq data. Using a quasi-multinomial distribution model, our method is able to capture various types of RNA-Seq biases, including positional, sequencing and mappability biases. Our experimental results on simulated and real RNA-Seq datasets exhibit interesting effects of RNA-Seq biases on both transcriptome assembly and isoform expression level estimation. The advantage of our method is clearly shown in the experimental analysis by its high sensitivity and precision in transcriptome assembly and the high concordance of its estimated expression levels with quantitative reverse transcription-polymerase chain reaction data. AVAILABILITY: CEM is freely available at CONTACT: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Jianxing Feng, Wei Li, and Tao Jiang. 2011. “Inference of isoforms from short sequence reads..” J Comput Biol, 18, 3, Pp. 305-21.Abstract
Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS, and PAS information, especially for isoforms whose expression levels are significantly high. The software is publicly available for free at∼jianxing/IsoInfer.html.
Wei Li, Jianxing Feng, and Tao Jiang. 2011. “IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly..” J Comput Biol, 18, 11, Pp. 1693-707.Abstract
The new second generation sequencing technology revolutionizes many biology-related research fields and poses various computational biology challenges. One of them is transcriptome assembly based on RNA-Seq data, which aims at reconstructing all full-length mRNA transcripts simultaneously from millions of short reads. In this article, we consider three objectives in transcriptome assembly: the maximization of prediction accuracy, minimization of interpretation, and maximization of completeness. The first objective, the maximization of prediction accuracy, requires that the estimated expression levels based on assembled transcripts should be as close as possible to the observed ones for every expressed region of the genome. The minimization of interpretation follows the parsimony principle to seek as few transcripts in the prediction as possible. The third objective, the maximization of completeness, requires that the maximum number of mapped reads (or ?expressed segments? in gene models) be explained by (i.e., contained in) the predicted transcripts in the solution. Based on the above three objectives, we present IsoLasso, a new RNA-Seq based transcriptome assembly tool. IsoLasso is based on the well-known LASSO algorithm, a multivariate regression method designated to seek a balance between the maximization of prediction accuracy and the minimization of interpretation. By including some additional constraints in the quadratic program involved in LASSO, IsoLasso is able to make the set of assembled transcripts as complete as possible. Experiments on simulated and real RNA-Seq datasets show that IsoLasso achieves, simultaneously, higher sensitivity and precision than the state-of-art transcript assembly tools.