Interpreting exomes and genomes: a beginner’s guide Daniel MacArthur Analytic and Translational Genetics Unit Massachusetts General Hospital Broad Institute of Harvard and MIT www.macarthurlab.org Twitter: @dgmacarthur
Overview Fundamentals of next-generation sequencing Genomes, exomes and targeted panels Genomic diagnosis: how do we filter causal variants from a patient’s entire genome? Major challenges for NGS diagnosis
Next-generation sequencing Many different technologies Can chop up DNA and read bits of fragment all at the same time – massively parallel sequencing Illumina Pacific Biosciences Oxford Nanopore
Sequencing yields billions of reads per run TTTGAACTTTCATAG CGTTACGGCAGACG GGGACATATTCGAAAT ACGGGATGTACG TAGACATAGACGACT GGGATGTACGAA GTACTGACCAG GACCAGTAGAC GACATAGACGACT CCAGTAGACATA ACGAGCCGTAGCTA TTTGACGGGATG GGGATGTACGA What does the data “look like” The machines generate fragments of DNA sequence – depending on the application these can be 75 to 150bp long Our reads are paired so we can read in from each end of the library fragment CGAGCCGTAGCTA AGACGACTTTGAC ATAGACGACTTTGA GGGATGTATGAG GGGATGTACGAG TACGAGCCGTA TGTACGAGCCGTA
Compare the reads to a reference genome GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA ACGGGATGTACG TAGACATAGACGACT GGGATGTACGAA GTACTGACCAG GACCAGTAGAC GACATAGACGACT CCAGTAGACATA ACGAGCCGTAGCTA TTTGACGGGATG GGGATGTACGA As part of our data processing we then compare these reads to a reference genome – human or any other reference that is applicable CGAGCCGTAGCTA AGACGACTTTGAC ATAGACGACTTTGA GGGATGTATGAG GGGATGTACGAG TACGAGCCGTA TGTACGAGCCGTA
C -> T Challenges: Mapping short reads Variable coverage NGS allows us to sample the sequence position many times over GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA TAGACATAGACGACT ACGGGATGTATG GTACTGACCAG GGGATGTATGA TTTGACGGGATG ATGAGCCGTAGCTA GACCAGTAGAC GTACGAGCCGTA CCAGTAGACATA TGAGCCGTAGCTA GACATAGACGACT GGGATGTATGAG GGGATGTACGAG ATAGACGACTTTGA AGACGACTTTGAC TACGAGCCGTA TGTACGAGCCGTA C -> T (5 C / 5 T) Challenges: Mapping short reads Variable coverage Base calling quality Tend to be worse for insertions and deletions compared to SNPs With NGS we are able to sample the position many times, so here we have many looks at this mutation and this kind of information gives us confidence in the call.
Percent of Genome Sequenced Which technology to choose? Technology Percent of Genome Sequenced Cost Depth of Coverage Whole Genome Sequencing >95% Whole Exome Sequencing ~1.5% (protein-coding regions) Targeted Sequencing 0.005% - 0.1% (100s – 1000s of genes) High level overview of the types of sequencing WGS = complete DNA sequence of the person/organism Exome = all mutations in all exons PLUS other variations (such as small insertions/deletions) Targeted = all mutations in all targeted exons PLUS other variations (such as small insertions/deletions) – use on a large collection of genese
Targeted sequencing
Targeted sequencing
The problem with exome data Clinically and genetically heterogeneous conditions x 30,000 rows
Sifting signal from noise in exomes Every genome contains many rare, potentially functional variants ~500 rare missense variants ~100 LoF variants: ~20 homozygous, ~20 rare ~100 rare variants in known disease genes 5-10 recessive disease-causing mutations 1-2 de novo coding mutations sequencing errors In Mendelian disease patients we need to find 1-2 true causal mutations amidst this “noise”
How do we find pathogenic variants? Is the variant a known pathogenic variant? How much evidence supports the claim of pathogenicity? Is the variant rare? Is it predicted to have a functional impact (change a protein sequence)? Does it segregate with disease? Is the gene associated with the disease?
Making sense of one genome requires tens of thousands of genomes vs More than 500K exomes and 50K genomes have been sequenced worldwide but these data are siloed by project and inconsistently processed
Exome Aggregation Consortium (ExAC) Latino African European South Asian East Asian Other 1000 Genomes ESP ExAC exac.broadinstitute.org
Value of reference databases Provide variant frequency in a large population (either healthy, or “reference” i.e. population sample) Provide frequency across multiple human populations Allow us to assess how many variants we see in a particular gene Provide an unbiased estimate of variant penetrance
Lessons from ExAC Many “healthy” people carry apparently disease-causing variants over 20,000 reported disease variants are seen in our “healthy” samples average ~2/person after filtering What’s causing this? carriers of recessive variants some undiagnosed disease cases lots of false positive variants (20-25%)
Databases of disease mutations Drawn from literature collected over decades with variable standards Five years ago: no large frequency databases, = any rare protein-altering variant is causal New databases more careful about evidence
xBrowse: Rapid exploration of multiple inheritance patterns https://atgu.mgh.harvard.edu/xbrowse/
xBrowse: Filtering by function and frequency https://atgu.mgh.harvard.edu/xbrowse/
xBrowse: Digestible information for all candidate variants and genes https://atgu.mgh.harvard.edu/xbrowse/
xBrowse: Digestible information for all candidate variants and genes exac.broadinstitute.org
xBrowse: Following up candidate genes with external resources https://atgu.mgh.harvard.edu/xbrowse/
The big (largely) unsolved challenges NGS data still misses a non-trivial number of genetic variants, also has errors Our reference databases are still missing many populations Uncertainty even about “known” pathogenic variants in databases For many variants, penetrance is not robustly established Huge difference between interpretation in “healthy” and “disease” samples