We recently posted a medRxiv preprint on our large-scale data aggregation efforts for rare copy-number variants (rCNVs) across human diseases and disorders.
It's hard to believe, but this study actually started over five years ago as a final project in MIT's Quantitatve Genomics course co-taught by Profs. Leonid Mirny and Shamil Sunyaev during my first semester as a PhD student back in the fall of 2016. In a literal sense, this project has spanned all five years of my graduate school experience.
I covered several of the main scientific findings already in a thread on Twitter. Instead, I thought I'd use this blog post to summarize three lessons I've learned over the course of this study, which I believe are indicative of larger trends in human genetics & genomics for the coming decade.
1. Data generation is no longer the (only) rate-limiting step in human genomic research. This observation might seem trite to those leading the coming-of-age for pan-phenotype common variant GWAS, but it is now obvious to me that the human genetics research community has far more data in hand than we have fully analyzed. The UK Biobank is a paragon of this claim: the widespread availability of a >400,000-person cohort with multi-modal genetics/imaging data paired with deep phenotyping has provided a seemingly endless pool of hypotheses to test. Not surprisingly, the UK Biobank was the largest single cohort included in our recent rCNV study. Several similar biobanks are either already available or imminently emerging. Such cohorts of comparable scale and scope (such as the NIH AllOfUs program and various hospital system-based biobanks [e.g., BioVU from Vanderbilt, BioMe from Mt. Sinai, etc.]) will be invaluable for greater statistical power, independent replication, and more uniform representation across demographic groups. However, I would argue that an equally important focus for the next decade in biomedical research should be on improving methods and algorithms to extract the full potential of all biological and medical insights lurking in existing datasets. Bigger and better datasets can only provide us with so much new information without corresponding improvements in how we analyze that data... and as sample sizes continue to balloon, the computational details of the "how" are quickly becoming paramount.
While the secondary analyses of existing datasets many constitute "research parasitism" to some (and lauded by others for that same reason!), it is undeniable that we as a community can generate scientific advances without generating any new data. A recent example of this is last month's preprint from Po-Ru Loh's & Steve McCarroll's labs at Harvard: by developing sophisticated algorithms for imputing the complex allelic distributions found at variable-number tandem repeats (VNTRs), they were able to identify several human traits where VNTRs made surprisingly strong genetic contributions. It is amazing that we have known about VNTR polymorphisms for nearly four decades, and the FBI and other law inforcement agencies famously use VNTR genotyping in genetic forensics (i.e., DNA fingerprinting)... and yet these types of fundamental biological insights have only recently become possible thanks to the careful analysis of existing large-scale datasets.
With all of this said, my comments don't mean to understate the transformative value of past and present efforts to generate large-scale public datasets. Nor do I intend to discourage the future generation of new datasets, especially those (a) filling an unmet scientific need, (b) improving representation of historically marginalized populations, or (c) applying state-of-the-art emerging technologies. Indeed, more data clearly is still needed in several disciplines, such as to comprehensively map the biochemical features of the noncoding genome across all human tissues and cell types. Similarly, experimental genetic perturbations in model systems (and their corresponding readouts) will be required on a titanic scale to understand the function of the many millions of human genetic variants. However, looking forward to the future of human genetics and genomics, I nevertheless contend that thoughtful hypotheses, sophisticated statistics, efficient algorithms, and harmonization of existing datasets are now ~nearly as important as primary data generation. There is still much to be discovered from existing, published datasets.
2. Collaborations are integral to large-scale genomics. Generous international collaborators were some of the most vital ingredients in the recipe for both this study and our recent efforts to map structural variants in the Genome Aggregation Database (gnomAD). Neither of the studes would have happened without international collaboration. In our rCNV study, for example, we were granted access to ~200k samples from the Children's Hospital of Philadelphia, ~400k samples from the UK Biobank, ~25k samples from the Epi25 Consortium, and ~10k samples from clinical genetic testing laboratories (GeneDx, Indiana University). Simply put: while it is theoretically possible for large-scale genomics research to be conducted by single labs muscling through sample recruitment, genotyping/sequencing, and analysis on their own, it is way, way, way, way easier (and faster) to develop a network of like-minded scientists with aligned goals. In 2021, it seems unrealistic to think you can accomplish more on your own than as part of a larger collaborative team; while I make this claim specifically about human genomics research, I believe it is widely generalizable across disciplines and even outside of academic research. To that end, I have been extremely lucky to train in a lab that approaches science with a team-first mentality: we are generous with our data and results, emphasize the wide distribution of credit, and are always looking to work with other groups with similar approaches to science. It makes the process more efficient, fun, and productive.
3. Everone wins in data sharing. Over the last five years, I have developed a deep appreciation for—and committment to—open data sharing, especially in advance of formal publication. While the value of data sharing is self-evident to people accessing public datasets, less has been written on the benefits of data sharing for the people doing the sharing. I believe there are three main reasons to share your data. First, it makes your science better. Users of your data will invariably spot errors, interesting patterns, or exciting results that you had missed (or never even considered!). Why wouldn't you want feedback that can make your science stronger? Second, open data sharing provides a natural mechanism for interacting with others in the scientific community and developing professional relationships. Exchanging emails with a stranger about a dataset or preprint you posted can lead to new revelations, new colleagues, and even new collaborations. The field of human genetics is filled with many tens of thousands of talented, hard-working people pushing towards common goals, but the ability for the average trainee or staff scientist (i.e., non-PI) to meet new members of the human genetics community outside of their own institution tends to be restricted to conferences, consortia, or other large meetings. Open data sharing provides a mechanism for interacting with new members of our community at an exceedingly low barrier to entry: all you need to do is share your data, protocols, code, or results somewhere on the internet alongside your contact information. The value of such virtual professional interactions is even more blatant today in today's era of COVID isolation. Third, data sharing satisfies the Golden Rule of science: treat others as you wish to be treated. In science, if you want to access others' results or data, then you should probably share your own! As a point of example: in our rCNV preprint, we were able to easily access CNV data from numerous previous publications totalling >100k individuals, and made extensive use of an amazing catalog of de novo coding point mutations from exome sequencing in >30k developmental disorder patients from the Deciphering Developmental Disorders (DDD) consortium. While the DDD study has since been published in Nature, they openly shared their entire de novo mutation dataset a full year prior to publication. In so doing, they allowed researchers (like us) to access their data sooner, catalyzing secondary analyses and more discoveries. In turn, we have followed suit by making our genome-wide rCNV disease association summary statistics available via medRxiv prior to publication. In summary: pay it forward. Share your data. It will help you, people you know, and many people you don't even know (yet)!