Interpreting exomes and genomes: a beginner’s guide

Interpreting exomes and genomes: a beginner’s guide
Daniel MacArthur Analytic and Translational Genetics Unit Massachusetts General Hospital Broad Institute of Harvard and MIT

Overview Fundamentals of next-generation sequencing
Genomes, exomes and targeted panels Genomic diagnosis: how do we filter causal variants from a patient’s entire genome? Major challenges for NGS diagnosis

Next-generation sequencing
Many different technologies Can chop up DNA and read bits of fragment all at the same time – massively parallel sequencing Illumina Pacific Biosciences Oxford Nanopore

Sequencing yields billions of reads per run
TTTGAACTTTCATAG CGTTACGGCAGACG GGGACATATTCGAAAT ACGGGATGTACG TAGACATAGACGACT GGGATGTACGAA GTACTGACCAG GACCAGTAGAC GACATAGACGACT CCAGTAGACATA ACGAGCCGTAGCTA TTTGACGGGATG GGGATGTACGA What does the data “look like” The machines generate fragments of DNA sequence – depending on the application these can be 75 to 150bp long Our reads are paired so we can read in from each end of the library fragment CGAGCCGTAGCTA AGACGACTTTGAC ATAGACGACTTTGA GGGATGTATGAG GGGATGTACGAG TACGAGCCGTA TGTACGAGCCGTA

Compare the reads to a reference genome
GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA ACGGGATGTACG TAGACATAGACGACT GGGATGTACGAA GTACTGACCAG GACCAGTAGAC GACATAGACGACT CCAGTAGACATA ACGAGCCGTAGCTA TTTGACGGGATG GGGATGTACGA As part of our data processing we then compare these reads to a reference genome – human or any other reference that is applicable CGAGCCGTAGCTA AGACGACTTTGAC ATAGACGACTTTGA GGGATGTATGAG GGGATGTACGAG TACGAGCCGTA TGTACGAGCCGTA

C -> T Challenges: Mapping short reads Variable coverage
NGS allows us to sample the sequence position many times over GTACTGACCAGTAGACATAGACGACTTTGACGGGATGTACGAGCCGTAGCTA TAGACATAGACGACT ACGGGATGTATG GTACTGACCAG GGGATGTATGA TTTGACGGGATG ATGAGCCGTAGCTA GACCAGTAGAC GTACGAGCCGTA CCAGTAGACATA TGAGCCGTAGCTA GACATAGACGACT GGGATGTATGAG GGGATGTACGAG ATAGACGACTTTGA AGACGACTTTGAC TACGAGCCGTA TGTACGAGCCGTA C -> T (5 C / 5 T) Challenges: Mapping short reads Variable coverage Base calling quality Tend to be worse for insertions and deletions compared to SNPs With NGS we are able to sample the position many times, so here we have many looks at this mutation and this kind of information gives us confidence in the call.

Percent of Genome Sequenced
Which technology to choose? Technology Percent of Genome Sequenced Cost Depth of Coverage Whole Genome Sequencing >95% Whole Exome Sequencing ~1.5% (protein-coding regions) Targeted Sequencing 0.005% - 0.1% (100s – 1000s of genes) High level overview of the types of sequencing WGS = complete DNA sequence of the person/organism Exome = all mutations in all exons PLUS other variations (such as small insertions/deletions) Targeted = all mutations in all targeted exons PLUS other variations (such as small insertions/deletions) – use on a large collection of genese

Targeted sequencing

The problem with exome data
Clinically and genetically heterogeneous conditions x 30,000 rows

Sifting signal from noise in exomes
Every genome contains many rare, potentially functional variants ~500 rare missense variants ~100 LoF variants: ~20 homozygous, ~20 rare ~100 rare variants in known disease genes 5-10 recessive disease-causing mutations 1-2 de novo coding mutations sequencing errors In Mendelian disease patients we need to find 1-2 true causal mutations amidst this “noise”

How do we find pathogenic variants?
Is the variant a known pathogenic variant? How much evidence supports the claim of pathogenicity? Is the variant rare? Is it predicted to have a functional impact (change a protein sequence)? Does it segregate with disease? Is the gene associated with the disease?

Making sense of one genome requires tens of thousands of genomes
vs More than 500K exomes and 50K genomes have been sequenced worldwide but these data are siloed by project and inconsistently processed

Exome Aggregation Consortium (ExAC)
Latino African European South Asian East Asian Other 1000 Genomes ESP ExAC exac.broadinstitute.org

Value of reference databases
Provide variant frequency in a large population (either healthy, or “reference” i.e. population sample) Provide frequency across multiple human populations Allow us to assess how many variants we see in a particular gene Provide an unbiased estimate of variant penetrance

Lessons from ExAC Many “healthy” people carry apparently disease-causing variants over 20,000 reported disease variants are seen in our “healthy” samples average ~2/person after filtering What’s causing this? carriers of recessive variants some undiagnosed disease cases lots of false positive variants (20-25%)

Databases of disease mutations
Drawn from literature collected over decades with variable standards Five years ago: no large frequency databases, = any rare protein-altering variant is causal New databases more careful about evidence

xBrowse: Rapid exploration of multiple inheritance patterns

xBrowse: Filtering by function and frequency

xBrowse: Digestible information for all candidate variants and genes

xBrowse: Digestible information for all candidate variants and genes
exac.broadinstitute.org

xBrowse: Following up candidate genes with external resources

The big (largely) unsolved challenges
NGS data still misses a non-trivial number of genetic variants, also has errors Our reference databases are still missing many populations Uncertainty even about “known” pathogenic variants in databases For many variants, penetrance is not robustly established Huge difference between interpretation in “healthy” and “disease” samples

Interpreting exomes and genomes: a beginner’s guide

Similar presentations

Presentation on theme: "Interpreting exomes and genomes: a beginner’s guide"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interpreting exomes and genomes: a beginner’s guide

Similar presentations

Presentation on theme: "Interpreting exomes and genomes: a beginner’s guide"— Presentation transcript:

Similar presentations

About project

Feedback