Today I'm commenting on the interplay of structural variation (SV) and genome topology (e.g. TADs) in human diseases, the impetus for which stems from two papers we recently published (Redin et al., Nat. Genet., 2016 & Ordulu et al., Am. J. Hum. Genet., 2016). I want to emphasize up front how incredibly lucky I feel to have the opportunity to collaborate with an outstanding group of colleagues as part of the DGAP consortium over the last several years; huge congratulations are in order to both Claire and Zehra on championing these studies through to completion!
What's in a TAD, anyway?
Since the publication of the first kilobase-scale map of chromatin domains, subdomains, and loops in 2014 (Rao et al., Cell, 2014), interest in topologically associated domains (TADs) has surged. From preliminary studies, it's clear that TADs play a major coordinating role in the control of gene expression, and is summarized nicely by three recent reviews (Gonzalez-Sandoval and Gasser, Trends in Genet., 2016, Lupiáñez, Spielmann, and Mundlos, Trends in Genet., 2016, and Bonev & Cavalli, Nat. Rev. Genet., 2016). If you want a primer on the topic, I recommend reading all three. If you still want a primer but are too lazy to read these review articles (don't worry, I don't blame you), here's a few sentences of sparknotes: in order to drive gene expression, enhancers must be physically colocalized near a gene's promoter in 3D space within the nucleus, and thus the chromosome has to fold into bunched-up "neighborhoods" (TADs) of 0.5-3Mb to colocalize enhancers to promoters, which incidentally do end up resembling cul-de-sacs made of DNA (i.e. loops). When one of these enhancer-promoter relationships is disrupted (generally by mutations, especially SV), it can result in decreased gene expression commensurate with a heterozygous loss-of-function mutation directly to the gene itself, despite the mutation not impacting coding sequence (and sometimes being hundreds of kb away!). Conversely, SV can also fuse two previously distinct TADs and introduce "enhancer adoption," which can cause overexpression of a gene. However, there is no universally accepted model (*yet) as to how & under what conditions these domains and loops form, how transient they are between cell types and individuals, and the magnitude of their effects on gene expression.
It should make some intuitive sense that certain rearrangements of genome structure (i.e. SV) would probably result in damaging disorganization of the genome's carefully regulated 3D chromatin architecture in the nucleus. It's not a stretch to imagine that when this disorganization occurs near haploinsufficient coding loci, these rearrangements could result in disease pathogenesis despite leaving that same coding sequence completely intact. By now, several examples of SV rewiring 3D regulatory architecture in disease phenotypes have been published (e.g. Lupiáñez et al., Cell, 2015, Benko et al., J. Med. Genet., 2011, Ibn-Salem et al., Genome Biol., 2014 & Montavon et al., Cell, 2011), so the proof-of-principle is established. Conceptually, this has been a huge advance for the field, but it prompts two questions, which are both currently unanswered: (1) are these dysregulating rearrangements predictable from knowledge of DNA sequence changes alone (i.e. absent functional assays), and (2) how could this ever be applied to diagnostics without a lengthy and expensive functional workup? In our two recent papers, we tried to put some initial cracks in the armor of both of these problems, as I'll describe below.
Bigger is Badder
Balanced chromosomal abnormalities (BCAs) are a class of SV that, unlike the better-studied copy-number variants (CNVs), do not result in gross gains or loss of segments of DNA (hence "balanced"; mostly includes inversions and reciprocal translocations). Historically, BCAs have been tricky to survey in the genome, largely because (1) they leave no scars in the genome other than precisely at their two or more breakpoints, and thus (2) if you don't know where to look a priori, you'll have trouble sifting through the proverbial haystack of noise in sequencing data to find the needle that represents the BCA. Fortunately, extremely large BCAs can be visualized under a microscope by traditional karyotyping or FISH, guiding our search through whole-genome sequencing (WGS) data to pin down the breakpoint sequences. This has been our go-to approach for the DGAP consortium, as I highlighted in an earlier post. In the first paper led by Dr. Claire Redin, an exceptionally talented postdoc in the Talkowski lab, we report a WGS-based analysis of 273 patients with congenital or developmental anomalies and a large, de novo, BCA identified by clinical karyotyping. Decades of research has shown that these massive rearrangements frequently lead to disease phenotypes, and the prevailing assumption has been that this is due in majority to direct gene truncation. In this study, we observed this was indeed the case: by carefully reviewing statistical and clinical evidence for each case, we were able to assign likely pathogenicity to the BCA in 46% of patients, 73% of which were due to direct gene truncation and another 11% were due to cryptic CNVs at described syndromic loci flanking BCA breakpoints.
Comprehensive studies of de novo BCAs in healthy individuals haven't been performed at sequence resolution, so we can't say if this diagnostic yield is more or less than we'd expect by chance, but the genes being disrupted have reasonable evidence for being implicated in disease. What do we think is causing the other 16% of patients, though? Well, read on.
Hit Me Baby One More Time
Recurrent de novo gene-truncating mutations identified by whole-exome sequencing have been a remarkably fertile source of gene-disease associations in neurodevelopmental disorders previously (e.g. Neale et al., Nature, 2012, Iossifov et al., Nature, 2014, and Sanders et al., Neuron, 2015). In our BCA study, Claire reasoned that we could apply a similar approach of looking for enrichments of de novo BCA breakpoints at particular segments of the genome to identify regions that are recurrently mutated by SV in developmental disorders. As we lacked matched controls (unlike most previous exome sequencing studies), she used a Monte Carlo simulation method to permute a matched set of breakpoints throughout the genome, and was therefore able to calculate an empirical p-value for 1Mb bins in 100kb sliding windows. The result was the following Manhattan plot:
Remarkably, there was one locus that had an enrichment of breakpoints beyond simulated expectations at genome-wide significance. This locus, 5q14.3, had eight independent de novo breakpoints in our cohort within a 1Mb span, and all localized within a TAD containing the gene MEF2C, the proposed driver gene of the 5q14.3 microdeletion syndrome. Claire searched the literature and turned up another three de novo BCA breakpoints in this same TAD to come to a total of 11 de novo BCA breakpoints in the MEF2C TAD, astoundingly 10 of which were non-coding. Claire next ran an analysis using Human Phenotype Ontology terms for these patients to ask the question of how similar their clinical presentations were, and indeed found that the patients' phenotypes were significantly more similar than expected by chance, and additionally that these phenotypes largely matched those previously described with the 5q14.3/MEF2C microdeletion syndrome. Finally, from the four cell lines we had available from patients with intergenic 5q14.3 breakpoints, qPCR confirmed our suspicions: MEF2C expression was ablated by close to 50%, which is consistent with expectations from a heterozygous loss-of-function mutation as observed in cases of 5q14.3 microdeletion syndrome:
Based on these data, we concluded that recurrent de novo inversion and translocation breakpoints in this 5q14.3 TAD containing MEF2C is a surprisingly frequent non-coding causal mechanism behind the syndrome previously associated with 5q14.3 microdeletions. Due to our relatively limited sample size, MEF2C was the only locus where breakpoint pileups reached genome-wide significance, however there were three other loci that each featured ≥4 breakpoints falling in a TAD containing a likely syndromic gene (FOXG1, SATB2, and SYNCRIP), and another four subjects with one-off disruptions of TADs containing similarly likely candidate genes (PITX2, SLC2A1, SOX9, and SRCAP). The case of FOXG1 in particular is an interesting one, because all four independent de novo BCA breakpoints were non-coding and predicted to not only fall within the same TAD as FOXG1, but within the same sub-TAD loop:
Taken together, these four TADs with likely haploinsufficient or syndromic disease-associated genes and recurrent de novo BCA breakpoints underscore the potential effects strictly non-coding SV can have on the regulatory 3D architecture of the genome, and thus untowardly influence gene expression and disease.
The Prognosis on Prenatal Diagnosis
In our second study, led by Dr. Zehra Ordulu and our collaborators in the Morton lab at Brigham & Women's Hospital, we undertook a case study on the roles of de novo BCAs in a series of ten cases of prenatal diagnostic sequencing. It goes without saying that prenatal diagnostics is a controversial and very serious area of research, as some parents make decisions on whether or not to take their pregnancy to term based on genetic evidence, so any improvements in interpreting clinical genetic findings might help improve reproductive counseling, as long as these new findings are couched in appropriate degrees of uncertainty and caution. In a surprisingly large fraction of cases of cytogenetically de novo BCAs, the variant is of unknown significance (aka VUS), so prior to WGS parents are left with an estimated 2-to-3-fold increase in the risk of congenital anomalies in the presence of a de novo BCA. In these ten cases, we applied WGS to fine-map their BCA breakpoints using the same approach as we did in Claire's paper, and the Morton lab interpreted the significance of the involved loci: of the 8/10 patients who were carried to term, thus permitting a comparison of predictions to postnatal phenotypes, all outcomes matched prenatal predictions from WGS evidence (5/8 untoward, 3/8 healthy). Interestingly, 2/5 untoward cases were determined to be attributable (at least in part) to non-coding disruption of TADs containing haploinsufficient loci (SOX9 and MTM1), the TAD-associated regulatory architecture for one of which (SOX9) has been previously studied in detail as a source of disease risk (Franke et al., Nature, 2016).
While this prenatal case series demonstrated that it is possible in practice to interpret non-coding TAD disruptions by SV in a clinical context, there remain a number of major challenges before such an approach can be routinely implemented. Most cytogenetically identified SVs will disrupt one or more TAD boundaries, meaning this non-coding "TAD-sundering" mechanism of gene dysregulation could in theory occur quite frequently. However, each TAD often contains multiple genes, and based on simple probability the vast majority of these genes will not make a large contribution--if any--to an individual patient's phenotype. In this study, Zehra cleverly used a haploinsufficiency (HI) score to index which genes in a given TAD are most likely to not tolerate reduced expression in a healthy human, but these HI scores are predictions and will not be ironclad in all cases. A clinical geneticist could consider layering many such annotations to refine their confidence in a particular locus, but the second (and perhaps larger) problem is knowing whether or not a TAD disruption will actually result in reduced expression of the gene of interest. In the case of SOX9 here, qPCR clearly showed reduced expression in patient tissue...
...but without access to patient RNA this prediction becomes an extremely risky one, especially given the radically different expression profiles for various tissues. One potential solution to this problem would be to demonstrate statistical association of repeated disruptions of a given TAD with a particular phenotype (like the MEF2C example from Claire's paper), but this process will require extremely large-scale sequencing of affected cases and controls to achieve the statistical power necessary to determine any associations and the risks conferred by those associations. Finally, many BCAs are more complex than simple translocations or inversions, and instead involve three or more breakpoints, sometimes including multiple chromosomes and dozens of genes and TADs. In those cases, this interpretive challenge rapidly evolves into an interpretive nightmare, and it seems unlikely we will ever observe the same subset of genes/TADs disrupted recurrently in enough independent patients and controls to derive any statistical association. Despite these future challenges, this study provided a proof-of-concept in clinical practice, and opens an exciting new area of research in the prenatal diagnostics of SV.
Seeing the Forest for the Trees
One thing is for certain: these and other recent studies have plainly wagged an accusatory finger at TAD-rewiring SVs as a surprisingly frequent source of dysregulated gene expression, and that these explicitly non-coding lesions can result in identical functional outcomes to coding loss-of-function muations, leading to disease states with phenotypes that match previously described syndromes caused by direct truncation of haploinsufficient coding loci. For a few loci (MEF2C, FOXG1, SYNCRIP, SATB2), we are beginning to observe an emergence of recurrent non-coding SV breakpoints in cases with similar phenotypes, which holds the thrilling promise of following the footsteps of the decades of foundational work in the association of de novo coding point mutations with many human diseases. Moreover, even in isolated clinical cases, a veteran eye can spot a potentially adverse TAD disruption from these SV and predict pathogenicity, especially when patient-derived RNA is available for confirmation. All in all, I'm of the opinion that these findings are valuable advances for our field, both conceptually and practically.
However, I want to close this commentary by stressing how far we are from deconvolving the ludicrously complex relationship between genome structure and function. Alterations to genome structure (i.e. SV) are not remotely the n=1 types of rare events they're sometimes unjustly perceived to be: to be sure, the massive de novo translocations and inversions studied here are undeniably rare events, but the average diploid human genome harbors tens of thousands of SV (≥50bp), which collectively impact nearly 1% of all nucleotides in every individual's genome on earth. Each of these SV, almost all of which are orders of magnitude smaller than the large BCAs we studied here, has the chance to disrupt a TAD boundary and fuse or sunder TADs containing crucial regulatory networks driving expression of influential genes. With this realization comes the unexpected side-dish of harsh reality: we have no idea how these pieces fit together. At specific and intensively studied loci, sure; in aggregate individual genomes, no way. The notion that there is a "wild type" human genome structure is uninformed at best (and potentially offensive at worst!), so delineating the relationship between an individual genome's structural organization of nucleotides, how the chromosomes fold in 3D space, and how that impacts gene expression and eventually phenotypes is a titanic and complex challenge. It's not all doom-and-gloom, though: to the contrary, it strikes me as an extremely exciting prospect. To interpret suspicious SV found in individual genomes, we'll need population-scale sequencing for benchmarking: as if right on cue, projects like the Genome Aggregation Database (gnomAD) are coming online to make these kinds of statistical analyses a real possibility. Finding associations between SV, TADs, and disease phenotypes is just the tip of the iceberg, too--it's somewhat like a pathfinding exercise to feed loci of putative significance into molecular and cellular biological modeling pipelines around the world to decrypt the mechanisms of pathogenesis, with the hope of someday finding cures or treatments for these diseases. There is still so much to learn about the structural-functional relationship of the human genome, and I'm eager to see how this story continues to play out over the coming years.
• Redin C, Brand H, Collins RL, Kammin T, Mitchell E, Hodge JC, Hanscom C, Pillalamarri V, Seabra CM, Abbott M, Abdul-Rahman OA, Aberg E, Adley R, Alcaraz-Estrada SL, Alkuraya FS, An Y, Anderson M, Antolik C, Anyane-Yeboa K, Atkin JF, Bartell T, Bernstein JA, Beyer E, Bongers EMHF, Brilstra EH, Brown CW, Brüggenwirth HT, Callewaert B, Corning K, Cox H, Cuppen E, Currall BB, Cushing T, David D, Deardorff MA, Dheedene A, D’hooghe M, de Vries BBA, Earl DL, Ferguson HL, Fisher H, FitzPatrick DR, Gerrol P, Giachino D, Glessner JT, Gliem T, Grady M, Graham BH, Griffis C, Gripp KW, Gropman AL, Hanson-Kahn A, Harris DJ, Hayden MA, Hochstenbach R, Hoffman JD, Hopkin RJ, Hubshman MW, Innes AM, Irons M, Irving M, Janssens S, Jewett T, Johnson JP, Jongmans MC, Kahler SG, Koolen DA, Korzelius J, Kroisel PM, Lacassie Y, Lawless W, Lemyre E, Leppig K, Levin AV, Li H, Li H, Liao EC, Lim C, Lose EJ, Lucente D, Macera MJ, Manavalan P, Mandrile G, Marcelis CL, Margolin L, Mason T, Masser-Frye D, McClellan MW, Mendoza CZ, Menten B, Middelkamp S, Mikami LR, Moe E, Mohammed S, Mononen T, Mortenson ME, Moya G, Nieuwint A, Ordulu Z, Parkash S, Pauker SP, Pereira S, Perrin D, Phelan K, Piña-Aguilar RE, Poddighe PJ, Pregno G, Raskin S, Reis L, Rhead W, Rita D, Renkens I, Roelens F, Ruliera J, Rump P, Schilit SLP, Shaheen R, Sparkes R, Spiegel E, Stevens B, Stone MR, Tagoe J, Thakuria JV, van Bon BW, van de Kamp J, van der Burgt I, van Essen T, van Ravenswaaij-Arts CM, van Roosmalen MJ, Vergult S, Volker-Touw CML, Warburton DP, Waterman MJ, Wiley S, Wilson A, Vega M, Zori RT, Levy B, Brunner HG, de Leeuw N, Kloosterman WP, Thorland EC, Morton CC, Gusella JF, Talkowski ME. The genomic landscape of balanced cytogenetic abnormalities associated with human congenital anomalies. Nature Genetics (2016), online ahead of print. PMID: TBD. DOI: 10.1038/ng.3720. PDF
• Ordulu Z, Kammin T, Brand H, Pillalamarri V, Redin CE, Collins RL, Blumenthal I, Hanscom C, Pereira S, Bradley I, Crandall BF, Gerrol P, Hayden MA, Hussain N, Kanengisser-Pines B, Kantarci S, Levy B, Macera MJ, Quintero-Rivera F, Spiegel E, Stevens B, Ulm JE, Warburton D, Wilkins-Haug LE, Yachelevich N, Gusella JF, Talkowski ME, Morton CC. Structural chromosome rearrangements require nucleotide level resolution: lessons from next-generation sequencing in prenatal diagnosis. American Journal of Human Genetics, 99(5): 1015-1033. PMID: 27745839. DOI: 10.1016/j.ajhg.2016.08.022. PDF