Download presentation
Presentation is loading. Please wait.
Published byAnn Gray Modified over 9 years ago
1
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto, Canada
2
2 What are structural variations? (1) 10^3 – 10^6 basepair variations in the genome Insertion: a large consecutive fragment of DNA is inserted Deletion: a large consecutive fragment of DNA is deleted Inversion: a large consecutive fragment of DNA is inversed Translocation: a large consecutive fragment of DNA is moved from one chromosome to another. Copy number variations
3
3 What are structural variations? (2) Various examples of structural variations
4
4 Outline Introduction Type of Structural Variations Sequencing Approaches to Detect Structural Variations Motivation & Research Objectives Probabilistic Framework for Detecting Structural Variations Probabilistic Framework Flow of our Framework Hierarchical Clustering of Matepairs (2nd phase) Choosing a Unique Mapped Location for Each Matepair (3nd phase) Experiments Comparison with Three Previous research DMBT1 Gene for Deletion Centromere and Translocations Conclusions
5
5 Type of Structural Variations (1) Insertion A REF
6
6 Type of Structural Variations (2) Deletion A REF
7
7 Type of Structural Variations (3) Inversion A REF 5’ 3’ 5’3’ 5’3’
8
8 Type of Structural Variations (4) Translocation chr1 chr2
9
9 Sequencing Approaches 1. “Fine-scale structural variation of the human genome” [Tuzun et al, 2005] Mapping matepairs onto the reference genome Insertion and deletion: inconsistent mapped distance Inversion: the same orientation of both reads 2. “Paired-End mappings Reveals Extensive Structural Variation in the Human Genome” [Korbel et al, 2007] Proposed high-throughput and massive paired end mapping technique Detailed types of structural variations
10
10 Motivation & Research Objectives (1) Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences) How can we map reads onto the reference genome?
11
11 Motivation & Research Objectives (2) Sequencing method is effective to detect structural variants. Proven by Tuzun et al, Korbel et al However, there are multiple mappings for each read Previous research used a priori mapped locations. Why don’t we develop a probabilistic model without such assumptions? Hopefully, it can be applied to short reads from NGS machines.
12
12 Probabilistic Framework (1) p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes We play with p(Y) to describe our probabilistic framework
13
13 Probabilistic Framework (2) Insertion μ Y = (s+r) P(X i, X j |ins=r) = P(X i |ins=r)P(X j |ins=r) P(X i |ins=r) = 1 - P(μ Y - δ ≤Y≤μ y + δ), where δ= |μ Y - (s+r)|, s = mapped distance μ y - δ X1, X2 = matepair 1,2 Y= random variable for mapped distances of “uniquely mapped” matepairs p(Y)
14
14 Probabilistic Framework (3) Deletion μ Y = (s-r) P(X i, X j |del=r) = P(X i |del=r)P(X j |del=r) P(X i |del=r) = 1 - P(μ Y - δ ≤Y≤μ y + δ) where δ= |μ Y - (s-r)|, s = mapped distance μ y - δ p(Y)
15
15 Probabilistic Framework (4) c - d = s(X1) - s(X2) P(X i, X j |inv) = 1 - P(μ |Y1-Y2| - δ ≤|Y1-Y2|≤μ |Y1-Y2| + δ) where δ= |μ |Y1-Y2| – (c – d)|, s(Xi) = insert size of Xi μ |Y1-Y2| -δ p(|Y1-Y2|) Inversion
16
16 Probabilistic Framework (5) μ |Y1-Y2| -δ (c – a) – (d – b) = s(X1) - s(X2) P(X i, X j |trans) = 1 - P(μ |Y1-Y2| - δ ≤|Y1-Y2|≤μ |Y1-Y2| + δ), where δ= |μ |Y1-Y2| – (c – a) – (d – b) |, s(Xi) = insert size of Xi p(|Y1-Y2|) Translocation
17
17 Flow of our Framework (1) 1. Preprocessing step Get top K mappingsRemove short mappings Make all possible combinations of mappings Discard matepairs consistent with insert size Remove invalid strands (-,+) Remove very similar mappings mappings Mask repeats
18
18 Flow of our Framework (2) 2. Clustering 3. Finding structural variations Do hierarchical clustering for each structural variation (Insertion, Deletion, Inversion, Translocation) Find a local optimum configuration Parameter learning for the objective function Find initial configuration in greedy manner
19
19 Hierarchical Clustering (1) (ex) Insertion A REF Cluster, C, is a set of matepairs explaining the same structural variations Linkage distance = D(X1, X2) = - ln P(X1, X2|C) X1 X2 X1 X2 C={X1, X2}
20
20 Hierarchical Clustering (2) Generally, linkage distance is given by, We do hierarchical clustering for each structural variation.
21
21 Choosing a Unique Mapped Location (1) We should map matepairs onto unique pair of BLAT hits and unique cluster. R1 R2 C2 C1 C2C1 R2 R1 123 4 5 M 1,4 M 2,4 M 3,5
22
22 Choosing a Unique Mapped Location (2) We define a objective Function J( ω ) ƒ 1 corresponds to BLAT hit scores ƒ 2 corresponds to the probability ƒ 3 corresponds to the size of clusters
23
23 Choosing a Unique Mapped Location (3) Find the initial configuration greedily Learn parameters for the objective function J( ω ). We used hill climbing search to maximize the log likelihood of P(ω|λ i ) Finally, find a configuration, locally maximizing J( ω ) using hill climbing search
24
24 P-values We assign p-values to give confidence to our clusters. The probability that the cluster is generated by the reference genome not by structural variants Pval(C k )=(E choose |C k |) ∏ P(X i |C null ) where E = (Expected number of matepairs mapped to the location of the cluster) P-values depend on the length of the cluster, the number of matepairs involved and probabilities.
25
25 Clustering Results We started with ~360,000 matepair ~90% were uniquely mapped ~90% had a concordant position (mapped at ± 2 ) Through the clustering procedure above (FDR 0.2) we found 82 Insertion clusters (53 had a uniquely mapped read) 175 Deletion clusters (135) 103 inversion clusters (24) 55 Translocation (cross-chromosome) cluster (all were required to have a uniquely mapped read)
26
26 Example Deletion
27
27 Agreement with Previous Results TypeTotalTuzunLevyKorbelDGV-All Insertion82(53)12(7)/1396(5)/3190(0)/3424(13)/2216 Deletion175(135)21(17)/10225(23)/34445(36)/74282(63)/4697 Inversion103(24)34(12)/56N/A42(8)/10560(15)/164 We have compared All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlo simulations The DMBT1 deletion was also found in the Tuzun et al dataset (but not the Levy dataset).
28
28 Translocations A large fraction (69%) of the translocations were close to the centromeres She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart These could also be mis-assemblies. Distance to centromere <10 6 (10 6, 4.5*10 6 ]>4.5*10 6 <10 6 22610 (10 6, 4.5*10 6 ]03 >4.5*10 6 14
29
29 Conclusions Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions. Introduced a probabilistic model for structural variants Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor. These results show statistically significant correlation with previous variation studies Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.