Download presentation
Presentation is loading. Please wait.
1
Next-generation sequencing: from basics to future diagnostics PART II: NGS analysis to find variant
Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine
2
Overview PART I: NGS technologies and standard workflow
Next generation sequencing History and technology Data and its meaning; process workflow Discussion PART II: NGS Analysis to find variants NGS analysis to find variants Single nucleotide variants (SNVs) Copy number variations (CNVs) Structural variations (SVs) PART III: NGS application to diagnostics NGS in genomic medicine Potential application to forensic science
3
From Previous session Conventional variant calling
Variant calling in minor subgroups From Previous session
4
Next-generation sequencing
Massively Parallel Sequencing (a.k.a. Next-generation sequencing) via spatially separated, clonally amplified DNA templates or single DNA molecules Metzker et al, Nat Rev Genet, 2010 Illumina HiSeq2500 5500 SOLiD system Ion Torrent PGM
5
The human genome project
Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc. “Rough draft” in 2000 “Complete genome” published in 2003 13 years, $3 billion dollars.
6
FASTQ format (NGS raw data)
sequence quality one read A format for NGS read (FASTQ + quality)
7
D. Validation and functional assessment
control sequencing quality control short read alignment (BAM files) raw reads (FASTQ files) germ-line mutation somatic mutation copy number variation (CNV) structural variation (SV) A. Data Generation B. Variant Finding C. Variant Analysis xenogeneic sequence 43% 0% 31% recurrence analysis GKRRAGGGKRRAV*G variant impact prediction mutation filtration/selection tumor heterogeneity inference disease Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimization. Whole genome sequencing (WGS) allows an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only interrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear. D. Validation and functional assessment variant confirmation pathway analysis functional study Kim S and Paik S, in preparation
8
Short Read Alignment Data preprocessing
9
Mapping back to genome Where is this sequence in human genome?
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCTTAAGAGCTGTGAGA Do this as fast as possible!
10
brute force way Find “GATTCAAA” in human genome
This is very long (3 billion) The reference genome (chr1, start) T G A C G A T C Your query G A T C G A T C G A T C
11
How fast should it be? time per 1 read (sec) time per 80x WGS (sec)
is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days based on 200bp read length, 80x single-end wgs
12
Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome”
13
How can we build an index for genome?
Searching with index Assume you’re searching “genome” in a English dictionary You don’t search every line in every page You first find the page range of “g” in the dictionary in the above range (of ‘g’), you find the page range of “ge” in the dictionary in the above range (of ‘ge’), you find the page range of “gen” in the dictionary ... until you find “genome” How can we build an index for genome?
14
Burrows-Wheeler Transform
15
Burrows-Wheeler Transformation
BANANA
16
Burrows-Wheeler Transformation
Lexicographically smallest BANANA$
17
Burrows-Wheeler Transformation
BANANA$ ANANA$B
18
Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA
19
Burrows-Wheeler Transformation
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
20
Burrows-Wheeler Transformation
0 BANANA$ 1 ANANA$B 2 NANA$BA 3 ANA$BAN 4 NA$BANA 5 A$BANAN 6 $BANANA
21
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B 4 NA$BANA 4 0 BANANA$ sort 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA
22
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA
23
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA”
24
Burrows-Wheeler Transformation
0 BANANA$ 0 6 $BANANA 1 ANANA$B 1 5 A$BANAN 2 NANA$BA 2 3 ANA$BAN 3 ANA$BAN 3 1 ANANA$B ANNB$AA 4 NA$BANA 4 0 BANANA$ sort last column 5 A$BANAN 5 4 NA$BANA 6 $BANANA 6 2 NANA$BA BWT(“BANANA$”) = “ANNB$AA” BWT just changes the order of the string BWT tends to collect similar characters together With only the transformed string, we can easily get the original string
25
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA
26
NAN LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B
Question: Find “NAN” from BANANA 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA NAN N AN NAN
27
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point the number of ‘N’ to determine the end point end
28
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA start The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 end
29
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “N” can be calculated from: the number of symbols that are lexicographically less than ‘N’ to determine the start point =5 the number of ‘N’ to determine the end point =2 start end
30
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end
31
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end This is a range for ‘A’ not ‘AN’!!
32
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end
33
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ to determine the start point =1 the number of ‘A’ to determine the end point =3 start end
34
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN “Ax” is not “AN” and less than “AN” 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA count of ‘A’ before start point = 1 The range of strings that start with “AN” can be calculated from: the number of symbols that are lexicographically less than ‘A’ + number of ‘A’ before start point to determine the start point =1 + 1 = 2 the number of ‘A’ before end point to determine the end point =3 start end
35
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA The range of strings that start with “NAN” can be calculated from: the number of symbols that are lexicographically less than ‘N’ + number of ‘N’ before start point to determine the start point =5 + 1 = 6 the number of ‘N’ before end point to determine the end point =2 start end
36
LF Search 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$
Question: Find “NAN” from BANANA NAN 0 6 $BANANA 1 5 A$BANAN 2 3 ANA$BAN 3 1 ANANA$B 4 0 BANANA$ 5 4 NA$BANA 6 2 NANA$BA BANANA 2nd row at the original permutation =number of rotations of original string =“NAN” exists at the 3rd position of “BANANA” start end
37
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
38
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
39
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
40
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
41
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
42
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
43
Genome Informatics I (2015 Spring)
Genome query imported from Mike Schatz’s slide Genome Informatics I (2015 Spring)
44
Genome Informatics I (2015 Spring)
Inexact matching T G A C G A T When exact match does not exist: continue other possible candidates (G -> A, C, T) and increase the mismatch count If another mismatch occurs, again branch it out. So edit distance is critical to alignment speed Genome Informatics I (2015 Spring)
45
Genome Informatics I (2015 Spring)
Goal achieved time per 1 read (sec) time per 80x WGS (sec) is equal to eyeballing 3x109 3.6x1018 1x1011 yrs naïve matching 2400 1.2x109 7,608 yrs improved algorithm 3 3.6x108 10 yrs minimum required 0.01 1.2x107 11.5 days desired 0.001 1.2x106 1.2 days Genome Informatics I (2015 Spring)
46
Variant calling – SNV calling
47
Detailed View A genome region one read = one DNA fragment
aligned to a specific genomic region = observation of our sample in this region (1 time)
48
Detailed View A certain genomic position (in bp) A —A C
49
Detailed View A —A C A certain genomic position (in bp)
reference allele observation of our sample at this position from read 1 observation of our sample at this position from read 2 observation of our sample at this position from read 10
50
Why multiple observations?
Observations contain errors errors from machine basecall error errors from mapping mapping error errors from others library prep error With accuracy of 99%... 1% error from whole region leads to ~30million false SNPs for whole genome ~500k false SNPs for whole exome
51
Human diploid genome G A G G G A G A A A A Homozygotic Reference
Sequencing error / mapping error G G Homozygotic Reference G A G A Heterozygotic Alternative somatic mutations A A A Homozygotic Alternative
52
Allele fraction distribution (binomial)
Normal approximation of B(100,0.5) 𝜇=𝑛𝑝=100×0.5=50 𝜎= 𝑛𝑝𝑞 =5 Pr(𝜇−3𝜎≤𝑥≤𝜇+3𝜎)≈0.9973 Pr(35≤𝑥≤65)≈0.9973
53
Allele fraction distribution (binomial)
G A G A A
54
Inferring mutations G AGAGGGGGAAAGAGA reference allele
Probability of observing “G” at the site of “G” Observation of donor genome True genotype = “AA” and no sequencing error 𝑃(1−𝑒) True genotype = “AB” and Read was generated from ‘A’ allele and no sequencing error 1 2 ∗𝑃(1−𝑒) Read was generated from ‘B’ allele and sequencing error and ‘A’ was generated by chance 1 2 ∗ 1 4 ∗𝑃(𝑒) True genotype = “BB” and sequencing error 1 4 ∗𝑃(𝑒)
55
Inferring mutations G AGAGGGGGAAAGAGA
reference allele G AGAGGGGGAAAGAGA Probability of observing “A” at the site of “G” Observation of donor genome True genotype = “AA” and sequencing error 1 4 ∗P(e) True genotype = “AB” and - Read was generated from ‘A’ allele and sequencing error and ‘T’ was generated by chance 1 2 ∗ 1 4 ∗𝑃(𝑒) - Read was generated from ‘B’ allele and no sequencing error 1 2 ∗𝑃(1−𝑒) True genotype = “BB” and no sequencing error 𝑃(1−𝑒)
56
Genotype determination
Likelihood that the genotype is wild-type given the observation! L(g=AA|D)= 𝑖=1 𝑑 𝑃( 𝐷 𝑖 |𝑔=𝐴𝐴) L(g=AB|D)= 𝑖=1 𝑑 𝑃( 𝐷 𝑖 |𝑔=𝐴𝐵) L(g=BB|D)= 𝑖=1 𝑑 𝑃( 𝐷 𝑖 |𝑔=𝐵𝐵) Likelihood that the genotype is mutant given the observation!
57
Tools
58
somatic mutations
59
Germline vs. Somatic mutation
sample from non-disease site sample from disease site reference sequence (e.g. hg19) UnifiedGenotyper VarScan2 SomaticSniper …
60
Easy way to somatic mutations
sample from non-disease site GN=AA sample from disease site GT=AB
61
Joint Probabilities
62
Joint Probabilities P(GT=AB|GN=AA) ≠P(GT=AB|GN=AB) ≠P(GT=AB|GN=BB)
Tumor genotype is dependent on normal genotype!!! G: Joint Genotype Matrix
63
when sample is not pure
64
Heterogeneous Sample G G G G G G G G G G G G G Normal Cells G G A G G
Tumor Cells G G A A G G
65
Causes of low-frequency
Sample contamination (e.g. stromal cells)
66
Causes of low-frequency
Sample contamination (e.g. stromal cells) Tumor heterogeneity
67
Causes of low-frequency
Sample contamination (e.g. stromal cells) Tumor heterogeneity Extreme environments
68
Causes of low-frequency
Sample contamination (e.g. stromal cells) Tumor heterogeneity Extreme environments Somatic mosaicism
69
Heterogeneous Sample “2/15: No mutation. Two ‘A’s are from sequencing errors” VS “2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!” G G G G G G G G G G G A A G G
70
Heterogeneous Sample “2/15: No mutation. Two ‘A’s are from sequencing errors...” VS “2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!” “How do we know this?” G G G G G G G G G G G A A G G
71
Estimating Cellularity
It is “easy” only if we already know where to see (disease genotype is AB or BB) But how do we know the genotype? (even without knowing α?) Use SNP array - ONCOSNP (Yau et al, Genome Biol, 2009), Absolute (Carter et al, Nature Biotech, 2012) SNP Calling - Snyder et al, PNAS, 2010, PurityEst (Su et al, Bioinformatics, 2012)
72
Accurate inference in Virmid
Estimate global within-individual contamination to accurate detection of somatic mutations
73
Bias 1 - Loss of Reads (Virmid)
g1 A ref A r1 g2 A B B r2 𝑥 𝑎 =𝑝 a read that passes 𝑔 1 being unmapped =𝑝 𝑟 1 has 𝑑+1 or more variants in the remaining sites 𝑥 𝑏 =𝑝 a read that passes 𝑔 2 being unmapped =𝑝 𝑟 2 has 𝑑 or more variants in the remaining sites 𝑥 𝑎 =1− 𝑖=0 𝑑 𝑙−1 𝑖 𝑝 𝑖 1−𝑝 𝑙−1−𝑖 𝑥 𝑏 =1− 𝑖=0 𝑑−1 𝑙−1 𝑖 𝑝 𝑖 1−𝑝 𝑙−1−𝑖 ,where 𝑑=maximum edit distance, 𝑙=read length, and 𝑝=frequency of variation
74
Bias 2 - Loss of variants (Virmid)
α reads from normal 1-α reads from disease B-allele overestimate BAF underestimate α
75
Estimated α underestimated α overestimated α
76
Calling low-fraction somatic mutations in Virmid
Kim S et al, Genome Biology 2013
77
Low frequent mutations in disease
Identification of de novo somatic mutation in ATK-MTOR-PIK3CA in hemimegalencephaly Lee J et al, Nature Genetics, 2012
78
Low frequent mutations in disease
Identification of MTOR driver mutations in focal cortical dysplaisa Lim J et al, Nature Medicine 2015
79
Copy number variation (CNV)
80
Copy Number Variation Changes in copy number of large DNA segment
usually in terms of genes e.g. HER2 amplification Types of CNVs Copy number gain (CN > 2): Increase of copy number due to genomic rearrangement like insertion/duplication Copy number loss (CN < 2): Decrease of copy number due to deleterious genomic rearrangements Copy number aberration (CNA) refers to CNV particularly when the events are associated with disease phenotype
81
Comparative Genome Hybridization (CGH)
500kb-1500kb fragment for optimal hybridization
82
Array CGH
83
Resolution
84
Benefits of NGS-based CNV detection
High resolution (< 50 bp) in size Data reuse (multi-purpose) One NGS (whole-genome) sequencing can be used to SNV, CNV, SV detection Can be improved with additional NGS information Discordant reads in paired-end sequencing
85
Inferring CNVs from NGS
Principle: Samples with copy number gain (or loss) will generate more (or less) reads in the region gene 3 Copy (gain) 2 Copy (normal) 1 Copy (loss)
86
The signal Genome Informatics I (2015 Spring) 3 Copy (gain)
2 Copy (normal) 1 Copy (loss) mapped to reference Genome Informatics I (2015 Spring)
87
needs a systematic approach!
The signal 3 Copy (gain) 2 Copy (normal) 1 Copy (loss) mapped to reference catching these needs a systematic approach!
88
Catching the signal Problems
Read depth is not uniform even without copy number changes GC bias Mapping bias in repeat region Natural variance (Poisson distribution) Poisson distribution: - The probability of a given number of events occurring in a fixed interval of time and/or space. Example: - You have 120 phone calls a day, what is the best way to describe the number of phone call in an hour? - Similarly, you generated 100,000,000 NGS reads from whole genome, what is the number of reads generated within chr1: ?
89
Significantly deviated read-depth
Null hypothesis (H0): copy number of a given region is unchanged we assume the read-depth follows Poisson dist. Alternative hypothesis (Ha): copy number of a given region is changed If H0 is right: The read-depth (calculated from number of reads) within a specific genomic region is not significantly deviated from the Poisson distribution If the read-depth is too deviated to explain with natural variance (Poisson distribution) Copy number has been changed
90
Practically, we should consider
Bias correction from sequence context (GC-bias, etc.) Event detection method If the significant rise (or drop) of read-depth looks like an event mean-shift technique (CNVnator, Abyzov et al 2013) event-wise testing (Yoon et al, 2009) paired-end signal (CNVer, Medvedev et al 2010)
91
CNVNator
92
structure variation (SV)
93
Beyond the SNVs
94
Beyond the SNVs
95
Beyond the SNVs TFE3-KHSRP Translocation in Renal Cell Carcinoma
96
Structural Variations (SVs)
Genomic rearrangements that affect >50bp of sequence Alkan et al, Nat. Rev. Genetics 12, , 2011
97
List of structural variations
98
List of structural variations
99
Paired-end sequencing
100
Paired end reads for SV finding
Reference Donor Reference Donor Bix Seminar UCSD
101
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
102
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
103
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
104
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
105
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
106
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
107
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
108
Methods for SV detection
Read depth Assume a random distribution in mapping depth Significantly higher depth for duplicated regions Significantly reduced depth for deleted regions Read pair Assess the span and orientation of paired end reads Split Read Define breakpoints of SVs using split-sequence-read signature (broken alignment) Assembly Assemble and reconstruct the whole genome of sample DNA
109
Methods for Deletion Detection
110
Methods for Deletion Detection
111
Methods for Deletion Detection
112
Methods for Deletion Detection
113
Methods for Deletion Detection
114
Methods for Deletion Detection
115
Problems 1. Judgment of discordance
116
Problems 1. Judgment of discordance
117
Problem 2. Size of insertion
118
Novel Sequence Insertion
Problem 2. Large indels Novel Sequence Insertion
119
Existing Sequence Insertion
Problem 2. Large Indels Existing Sequence Insertion
120
Problem 3. Nonspecific Mappings
121
Problem 3. Nonspecific Mappings
122
discussion
123
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.