Sangwoo Kim, Ph.D. Assistant Professor,

Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant
Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine

Overview PART I: NGS technologies and standard workflow
Next generation sequencing History and technology Data and its meaning; process workflow Discussion PART II: NGS Analysis to find variants NGS analysis to find variants Single nucleotide variants (SNVs) Copy number variations (CNVs) Structural variations (SVs) PART III: NGS application to diagnostics NGS in genomic medicine Potential application to forensic science

From Previous session Conventional variant calling
Variant calling in minor subgroups From Previous session

Next-generation sequencing
Massively Parallel Sequencing (a.k.a. Next-generation sequencing) via spatially separated, clonally amplified DNA templates or single DNA molecules Metzker et al, Nat Rev Genet, 2010 Illumina HiSeq2500 5500 SOLiD system Ion Torrent PGM

The human genome project
Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc. “Rough draft” in 2000 “Complete genome” published in 2003 13 years, $3 billion dollars.

FASTQ format (NGS raw data)
sequence quality one read A format for NGS read (FASTQ + quality)

D. Validation and functional assessment
control sequencing quality control short read alignment (BAM files) raw reads (FASTQ files) germ-line mutation somatic mutation copy number variation (CNV) structural variation (SV) A. Data Generation B. Variant Finding C. Variant Analysis xenogeneic sequence 43% 0% 31% recurrence analysis GKRRAGGGKRRAV*G variant impact prediction mutation filtration/selection tumor heterogeneity inference disease Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimization. Whole genome sequencing (WGS) allows an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only interrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear. D. Validation and functional assessment variant confirmation pathway analysis functional study Kim S and Paik S, in preparation

Short Read Alignment Data preprocessing

Mapping back to genome Where is this sequence in human genome?
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Do this as fast as possible!

brute force way Find “GATTCAAA” in human genome
This is very long (3 billion) The reference genome (chr1, start) T G A C G A T C Your query G A T C G A T C G A T C

How fast should it be? time per 1 read (sec) time per 80x WGS (sec)
is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days based on 200bp read length, 80x single-end wgs

Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome”

How can we build an index for genome?
Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome” How can we build an index for genome?

Burrows-Wheeler Transform

Burrows-Wheeler Transformation
BANANA

Lexicographically smallest BANANA$

BANANA$ ANANA$B

BANANA$ ANANA$B NANA$BA

BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

0 BANANA$ 1 ANANA$B 2 NANA$BA 3 ANA$BAN 4 NA$BANA 5 A$BANAN 6 $BANANA

0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B 4 NA$BANA 4 0 BANANA$ sort 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA

0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA”

0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA” BWT just changes the order of the string BWT tends to collect similar characters together With only the transformed string, we can easily get the original string

LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA

NAN LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA NAN N AN NAN

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point the number of ‘N’ to determine the end point end

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 end

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 start end

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end This is a range for ‘A’ not ‘AN’!!

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end

Question: Find “NAN” from BANANA NAN “Ax” is not “AN” and less than “AN” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ + number of ‘A’ before start point to determine the start point =1 + 1 = 2 the number of ‘A’ before end point to determine the end point =3 start end

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “NAN” can be calculated from: the number of symbols that are lexicographically less than ‘N’ + number of ‘N’ before start point to determine the start point =5 + 1 = 6 the number of ‘N’ before end point to determine the end point =2 start end

Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA BANANA 2nd row at the original permutation =number of rotations of original string =“NAN” exists at the 3rd position of “BANANA” start end

Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)

Inexact matching T G A C G A T When exact match does not exist: continue other possible candidates (G -> A, C, T) and increase the mismatch count If another mismatch occurs, again branch it out. So edit distance is critical to alignment speed Genome Informatics I (2015 Spring)

Goal achieved time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days Genome Informatics I (2015 Spring)

Variant calling – SNV calling

Detailed View A genome region one read = one DNA fragment
aligned to a specific genomic region = observation of our sample in this region (1 time)

Detailed View A certain genomic position (in bp) A —A C

Detailed View A —A C A certain genomic position (in bp)
reference allele observation of our sample at this position from read 1 observation of our sample at this position from read 2 observation of our sample at this position from read 10

Why multiple observations?
Observations contain errors errors from machine basecall error errors from mapping mapping error errors from others library prep error With accuracy of 99%... 1% error from whole region leads to ~30million false SNPs for whole genome ~500k false SNPs for whole exome

Human diploid genome G A G G G A G A A A A Homozygotic Reference
Sequencing error / mapping error G G Homozygotic Reference G A G A Heterozygotic Alternative somatic mutations A A A Homozygotic Alternative

Allele fraction distribution (binomial)
Normal approximation of B(100,0.5) 𝜇=𝑛𝑝=100×0.5=50 𝜎= 𝑛𝑝𝑞 =5 Pr⁡(𝜇−3𝜎≤𝑥≤𝜇+3𝜎)≈0.9973 Pr⁡(35≤𝑥≤65)≈0.9973

Allele fraction distribution (binomial)
G A G A A

Inferring mutations G AGAGGGGGAAAGAGA reference allele
Probability of observing “G” at the site of “G” Observation of donor genome True genotype = “AA” and no sequencing error 𝑃(1−𝑒) True genotype = “AB” and Read was generated from ‘A’ allele and no sequencing error 1 2 ∗𝑃(1−𝑒) Read was generated from ‘B’ allele and sequencing error and ‘A’ was generated by chance 1 2 ∗ 1 4 ∗𝑃(𝑒) True genotype = “BB” and sequencing error 1 4 ∗𝑃(𝑒)

Inferring mutations G AGAGGGGGAAAGAGA
reference allele G AGAGGGGGAAAGAGA Probability of observing “A” at the site of “G” Observation of donor genome True genotype = “AA” and sequencing error 1 4 ∗P(e) True genotype = “AB” and - Read was generated from ‘A’ allele and sequencing error and ‘T’ was generated by chance 1 2 ∗ 1 4 ∗𝑃(𝑒) - Read was generated from ‘B’ allele and no sequencing error 1 2 ∗𝑃(1−𝑒) True genotype = “BB” and no sequencing error 𝑃(1−𝑒)

somatic mutations

Germline vs. Somatic mutation
sample from non-disease site sample from disease site reference sequence (e.g. hg19) UnifiedGenotyper VarScan2 SomaticSniper …

Easy way to somatic mutations
sample from non-disease site GN=AA sample from disease site GT=AB

Joint Probabilities

Joint Probabilities P(GT=AB|GN=AA) ≠P(GT=AB|GN=AB) ≠P(GT=AB|GN=BB)
Tumor genotype is dependent on normal genotype!!! G: Joint Genotype Matrix

when sample is not pure

Heterogeneous Sample G G G G G G G G G G G G G Normal Cells G G A G G
Tumor Cells G G A A G G

Causes of low-frequency
Sample contamination (e.g. stromal cells)

Sample contamination (e.g. stromal cells) Tumor heterogeneity

Sample contamination (e.g. stromal cells) Tumor heterogeneity Extreme environments

Sample contamination (e.g. stromal cells) Tumor heterogeneity Extreme environments Somatic mosaicism

Heterogeneous Sample “2/15: No mutation. Two ‘A’s are from sequencing errors” VS “2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!” G G G G G G G G G G G A A G G

Heterogeneous Sample “2/15: No mutation. Two ‘A’s are from sequencing errors...” VS “2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!” “How do we know this?” G G G G G G G G G G G A A G G

Estimating Cellularity
It is “easy” only if we already know where to see (disease genotype is AB or BB) But how do we know the genotype? (even without knowing α?) Use SNP array - ONCOSNP (Yau et al, Genome Biol, 2009), Absolute (Carter et al, Nature Biotech, 2012) SNP Calling - Snyder et al, PNAS, 2010, PurityEst (Su et al, Bioinformatics, 2012)

Accurate inference in Virmid
Estimate global within-individual contamination to accurate detection of somatic mutations

Bias 1 - Loss of Reads (Virmid)
g1 A ref A r1 g2 A B B r2 𝑥 𝑎 =𝑝 a read that passes 𝑔 1 being unmapped =𝑝 𝑟 1 has 𝑑+1 or more variants in the remaining sites 𝑥 𝑏 =𝑝 a read that passes 𝑔 2 being unmapped =𝑝 𝑟 2 has 𝑑 or more variants in the remaining sites 𝑥 𝑎 =1− 𝑖=0 𝑑 𝑙−1 𝑖 𝑝 𝑖 1−𝑝 𝑙−1−𝑖 𝑥 𝑏 =1− 𝑖=0 𝑑−1 𝑙−1 𝑖 𝑝 𝑖 1−𝑝 𝑙−1−𝑖 ,where 𝑑=maximum edit distance, 𝑙=read length, and 𝑝=frequency of variation

Bias 2 - Loss of variants (Virmid)
α reads from normal 1-α reads from disease B-allele overestimate BAF underestimate α

Estimated α underestimated α overestimated α

Calling low-fraction somatic mutations in Virmid
Kim S et al, Genome Biology 2013

Low frequent mutations in disease
Identification of de novo somatic mutation in ATK-MTOR-PIK3CA in hemimegalencephaly Lee J et al, Nature Genetics, 2012

Low frequent mutations in disease
Identification of MTOR driver mutations in focal cortical dysplaisa Lim J et al, Nature Medicine 2015

Copy number variation (CNV)

Copy Number Variation Changes in copy number of large DNA segment
usually in terms of genes e.g. HER2 amplification Types of CNVs Copy number gain (CN > 2): Increase of copy number due to genomic rearrangement like insertion/duplication Copy number loss (CN < 2): Decrease of copy number due to deleterious genomic rearrangements Copy number aberration (CNA) refers to CNV particularly when the events are associated with disease phenotype

Comparative Genome Hybridization (CGH)
500kb-1500kb fragment for optimal hybridization

Array CGH

Resolution

Benefits of NGS-based CNV detection
High resolution (< 50 bp) in size Data reuse (multi-purpose) One NGS (whole-genome) sequencing can be used to SNV, CNV, SV detection Can be improved with additional NGS information Discordant reads in paired-end sequencing

Inferring CNVs from NGS
Principle: Samples with copy number gain (or loss) will generate more (or less) reads in the region gene 3 Copy (gain) 2 Copy (normal) 1 Copy (loss)

The signal Genome Informatics I (2015 Spring) 3 Copy (gain)
2 Copy (normal) 1 Copy (loss) mapped to reference Genome Informatics I (2015 Spring)

needs a systematic approach!
The signal 3 Copy (gain) 2 Copy (normal) 1 Copy (loss) mapped to reference catching these needs a systematic approach!

Catching the signal Problems
Read depth is not uniform even without copy number changes GC bias Mapping bias in repeat region Natural variance (Poisson distribution) Poisson distribution: - The probability of a given number of events occurring in a fixed interval of time and/or space. Example: - You have 120 phone calls a day, what is the best way to describe the number of phone call in an hour? - Similarly, you generated 100,000,000 NGS reads from whole genome, what is the number of reads generated within chr1: ?

Significantly deviated read-depth
Null hypothesis (H0): copy number of a given region is unchanged we assume the read-depth follows Poisson dist. Alternative hypothesis (Ha): copy number of a given region is changed If H0 is right: The read-depth (calculated from number of reads) within a specific genomic region is not significantly deviated from the Poisson distribution If the read-depth is too deviated to explain with natural variance (Poisson distribution) Copy number has been changed

Practically, we should consider
Bias correction from sequence context (GC-bias, etc.) Event detection method If the significant rise (or drop) of read-depth looks like an event mean-shift technique (CNVnator, Abyzov et al 2013) event-wise testing (Yoon et al, 2009) paired-end signal (CNVer, Medvedev et al 2010)

CNVNator

structure variation (SV)

Beyond the SNVs

Beyond the SNVs TFE3-KHSRP Translocation in Renal Cell Carcinoma

Structural Variations (SVs)
Genomic rearrangements that affect >50bp of sequence Alkan et al, Nat. Rev. Genetics 12, , 2011

List of structural variations

Paired-end sequencing

Paired end reads for SV finding
Reference Donor Reference Donor Bix Seminar UCSD

Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA

Methods for Deletion Detection

Problems 1. Judgment of discordance

Problem 2. Size of insertion

Novel Sequence Insertion
Problem 2. Large indels Novel Sequence Insertion

Existing Sequence Insertion
Problem 2. Large Indels Existing Sequence Insertion

Problem 3. Nonspecific Mappings

discussion

Thank you

Sangwoo Kim, Ph.D. Assistant Professor,

Similar presentations

Presentation on theme: "Sangwoo Kim, Ph.D. Assistant Professor,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sangwoo Kim, Ph.D. Assistant Professor,

Similar presentations

Presentation on theme: "Sangwoo Kim, Ph.D. Assistant Professor,"— Presentation transcript:

Similar presentations

About project

Feedback