I'm interested in developing computational technologies to understand the functions of coding and non-coding elements, especially in the context of human physiology and disease. I am focusing on the following areas:

  1. Method development for the design and analysis of functional screening (especially CRISPR/Cas9 knockout screening);
  2. Transcriptome dynamics (alternative splicing and transcript structure changes) from high-throughput RNA sequencing data;
  3. Functional analysis of coding and non-coding elements from integrating genomics big data.

Algorithms for the design, analysis and interpretation of CRISPR/Cas9 screen

Figure 1. The  scores of genes whose over-expression leads to drug resistance in CRISPR activation (CRISPRa) screens. Their scores in CRISPR knockout screens (GeCKO) are also shown.

I worked with colleagues to develop a comprehensive computational solution for CRISPR/Cas9 screens. This includes an efficient sgRNA design algorithm (SSC and CRISPR-DO), two algorithms for the modeling and processing of CRISPR screens (MAGeCK/MAGeCK-VISPR), and one algorithm for the interpretation of CRISPR screens using network information (NEST). MAGeCK prioritizes individual short guide RNA (gRNA) between conditions using a negative binomial (NB) model, and combines multiple gRNAs per gene with a robust rank aggregation (RRA) to identify significant genes and pathways. The new “MAGeCK-VISPR” model is able to model complex experimental designs, and provides quality control and visualization functions for screening analysis. In MAGeCK-VISPR, the effects of gene on different samples are defined as the “beta” score, a measurement of gene selections similar to the “log fold change” in differential expression analysis. The values of “beta” scores can be estimated by maximizing the joint log-likelihood of observing all sgRNA read counts on all different samples. This is implemented using an Expectation-Maximization (EM) algorithm.

Compared with existing algorithms, MAGeCK reaches a better performance with a lower false positive rate and a higher sensitivity. MAGeCK also identified additional essential genes and pathways that are not discovered in the original studies. For example, MAGeCK identified EGFR as essential in A375 cells that are resistant to a BRAF inhibitor, PLX (FDR=0.025, rank 6th/17420), implying that these cells are more dependent on EGFR. Our finding is consistent with recent studies linking EGFR activation with PLX resistance.

MAGeCK-VISPR is a powerful algorithm to compare different screening experiments simultaneously -- for example, to analyze CRISPR activation and knockout screens on melanoma cells (A375) treated with PLX. Using MAGeCK-VISPR, we found a small cluster of genes whose over-expression lead to drug resistance (in activation screens), and whose knockout turned out to be lethal in drug resistant cells (Fig. 1). These genes include EGFR and BRAF, two known kinases that drive melanoma progression and PLX resistance, and CRKL, a novel kinase that has not been implicated in PLX resistance. Indeed, CRKL is a protein kinase that activates RAS and JUN pathway, and its amplification is reported to lead to drug resistance against EGFR inhibitors by activating EGFR downstream pathways, implying its potential role in PLX drug resistance.

CRISPR screens identified drivers of endocrine resistance and synthetic lethal vulnerabilities in breast cancer

Over 70% of breast cancer patients are ER positive, and endocrine therapy has been a standard treatment for these patients for decades. However, most patients with advanced stage will eventually develop resistance to ER inhibition therapies, while their mechanisms are widely unknown. I collaborated with experimental biologists to study the mechanism and potential treatment solutions of breast cancer endocrine resistance (manuscript submitted). We performed genome-wide CRISPR knockout screens in ER+ breast cancer cell lines. Using MAGeCK, we found that the strongest hit in both cells is c-src tyrosine kinase (CSK), a negative regulator of Src family kinases (SFKs). The phenotype of accelerated cell growth without hormone is confirmed by downstream validations, and we also found loss of CSK is associated with high-grade ER+ tumors and worse survival rates in ER+ breast cancer patients.

To identify potential drug targets for endocrine resistance triggered by CSK loss, we performed a second round of genome-wide CRISPR screens after knocking out CSK, and used MAGeCK-VISPR to identify genes that are essential only in CSK loss cells, but not in CSK wild-type cells. By comparing the  scores of all genes across two conditions, we found "druggable" synthetic lethal genes with CSK loss. Validation experiments confirmed our findings, and treating CSK loss cells with these gene inhibitors can suppress the growth of CSK loss cells.

CRISPR/Cas9 screens for long non-coding RNAs using paired gRNAs (pgRNAs)

Figure 2. Long non-coding RNA (lncRNA) screens using paired guide RNA (pgRNA). a-b, The Robust Rank Aggregation (RRA) scores of top ranked negatively selected lncRNAs (a) and positively selected lncRNAs (b) calculated by MAGeCK. Some positive control genes are shown as black triangles. c-d, Validation experiments confirmed the functions of  negatively selected (c) and positively selected lncRNAs (d). For each lncRNA, 3~5 pgRNAs targeting promoter or promoter + exon were chosen for validation.

The current screening protocol to use pooled sgRNAs may be ineffective on non-coding elements, since indels caused by one sgRNA are unlikely to generate loss-of-function phenotypes. In collaboration with Wensheng Wei lab (Peking University, China), we developed a novel computational and experimental protocol to screen for non-coding genomic elements using paired gRNAs (pgRNAs) (Zhu*, Li* et al. In press, Nature Biotechnology). Compared with screens using one sgRNA, the new protocol introduces pgRNAs simultaneously into one cell, and is able to efficiently knockout non-coding elements by introducing large genomic deletions. I developed a computational algorithm to design pgRNAs for long no-coding RNAs (lncRNAs). The algorithm ( first scans for all possible sgRNAs, then filters those that overlap with existing coding genes, map to multiple locations, or are predicted to have low efficiency based on the SSC efficiency prediction algorithm. We performed a screening experiment on one liver cancer cell line (Huh7.5), and analyzed the results using MAGeCK (Fig. 2a-b). The functions of 9 out of 9 top lncRNAs (5 negatively selected, 4 positively selected) are confirmed through knockdown and knockout experiments (Fig. 2c-d), a demonstration that our computational and experimental protocol could identify functional non-coding elements in an efficient manner.

My research projects before 2012 (RNA-seq transcriptome assembly, drug activity, robots, etc.) can be found in my UCR webpage.