Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

1 Canadian Bioinformatics Workshops

2 Module #: Title of Module
2

3 Module 4 Structural Variation
Michael Brudno Informatics on High Throughput Sequencing Data July 2009

4 Diversity of Humans G: 798 GAACCCCTTACAACTGAACCCCTTAC
Humans are diverse Genomic Variation Single Nucleotide Polymorphisms SNPs occur ~1/1000 positions Find by comparing reads from one individual to the reference human genome Structural variations are large scale genomic alterations Insertions, deletions, inversions, translocations and changes in copy numbers G: 798 GAACCCCTTACAACTGAACCCCTTAC |||||||||| ||||||||||||||| R: GAACCCCTTATAACTGAACCCCTTAC

5 What are structural variations?
Consequtive basepairs Various examples of structural variations

6 What are Structural Variations?
Insertion ATAC Donor genome Reference genome

7 What are Structural Variations?
Deletion ATAC Donor genome ATAC Reference genome

8 What are Structural Variations?
Inversion ATAC TATG Donor genome ATAC TATG Reference genome

9 What are Structural Variations?
Inversion ATAC TATG Donor genome ATAC TATG Reference genome

10 What are Structural Variations?
Inversion ATAC TATG Donor genome ATAC TATG Reference genome

11 What are Structural Variations?
Inversion ATAC TATG Donor genome ATAC TATG Reference genome

12 What are Structural Variations?
Inversion TATG ATAC Donor genome ATAC TATG Reference genome

13 What are Structural Variations?
Inversion TATG ATAC Donor genome ATAC TATG Reference genome

14 What are Structural Variations?
Inversion TATG ATAC Donor genome ATAC TATG Reference genome

15 What are Structural Variations?
Inversion TATG ATAC Donor genome ATAC TATG Reference genome

16 How Do We Detect Structural Variations?
Comparative Genome Hybridization (CGH) Detect copy number changes Cannot detect inversions and translocations Fluorescence In Situ Hybridization (FISH) Time-consuming and expensive Direct comparison of genomes (Levy et al. 2007) Very expensive to assemble the whole genome Unassembled clone-end data (Tuzun et al. 2005, Korbel et al. 2007) Uses matepairs to detect structural variations, much cheaper than assembling the whole genome Next generation sequencing technologies make it even more promising

17 Outline Detecting structural variants with matepairs
Probabilistic framework Finding structural variations Results

18 Outline Detecting structural variants with matepairs
Probabilistic framework Finding structural variations Results

19 What are Matepairs? For now, assume insert size is perfect Matepair
DNA fragment ATCAA CTAAG Insert size For now, assume insert size is perfect

20 Detecting Structural Variants With Matepairs
No structural variants A Insert size Insert size = Mapped distance Concordant matepair REF Mapped distance

21 Detecting Structural Variants With Matepairs - Insertion
Size of insertion = Insert size - Mapped distance A Insert size Insert size > Mapped distance REF Mapped distance

22 Consistency - Insertion
Size of insertion explained by Xi = Size of insertion explained by Xj Overlap Xi Xj A Xi Xj REF

23 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

24 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

25 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

26 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

27 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

28 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

29 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

30 Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

31 Range of The Size of Inversion
|m - insert size of Xi| < size of inversion Xi 3’ 5’ A Insert size of Xi Xi 5’ Given the nature of inversions, the lower bound on the size of inversion occurs when the inversion is the smallest interval containing the sampled read location and its mapped location. 3’ REF m

32 Range of The Size of Inversion
3’ 5’ A 5’ 3’ REF

33 Range of The Size of Inversion
3’ 5’ A 5’ 3’ REF

34 Range of The Size of Inversion
size of inversion < m + insert size of Xi Xi 3’ 5’ A Insert size of Xi Xi 5’ The upper bound for the size of the inversion occurs when the boundary of the inversion (affecting one of the reads) meets the endpoint of the other read. 3’ REF m

35 Range of The Size of Inversion
|m – insert size of Xi| < size of inversion < m + insert size of Xi Xi 5’ 3’ A Insert size of Xi Xi m is the distance between mapped positions of the two reads This picture shows the case when the lower bound is tight 5’ 3’ REF m

36 Consistency - Inversion
Mapped distance A = Mapped Distance B Range of the size of inversion explained by Xi overlaps Range of the size of inversion explained by Xj Overlap Xi Xj 5’ 3’ A Xi Xj 5’ 3’ REF Mapped Distance B Mapped Distance A

37 Outline Detecting structural variants with matepairs
Probabilistic framework Finding structural variations Results

38 Difficulties in Small INDEL Detection
In reality, insert sizes of matepairs are not perfect Unable to detect small indels (e.g. < 3STD) Tuzun et al. 2005 38

39 Detecting Smaller INDELS
Small Insertion… or noise A Insert size ≈ Mapped distance? REF

40 Haploid Case – Alignment
Donor REF 40

41 Haploid Case – Alignment
Donor Mapped distance REF Cluster 41

42 Haploid Case – Distribution
Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL 20bp insertion No indel 188bp 208bp 228bp 42

43 Haploid Case – Distribution
Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL No indel 20bp deletion 188bp 208bp 228bp 43

44 Accuracy of INDEL Estimation
Central Limit Theorem Mean of n independent random variables with finite mean and variance follows the Gaussian distribution with mean and standard deviation : random variables for size of indels supported by each matepair Distribution of mean of random variables Z1…Zn 44

45 P-value (assigning a confidence)
P-value Probability that a cluster is generated from a region without an indel 45

46 Diploid Case – Alignment & Clustering
Heterozygous insertion Donor C1 Donor C2 REF 46

47 Diploid Case – Alignment & Clustering
Heterozygous insertion Donor C1 Donor C2 REF Cluster 47

48 Diploid - Mixture of Distributions
Heterozygous insertion size of insertion distribution of mapped distances from donor C1 =? distribution of insert size Mapped distance distribution of mapped distances from donor C2 48

49 Outline Detecting structural variants with matepairs
Probabilistic framework Finding structural variations Results

50 Expectation Maximization (EM) Algorithm
1. Randomly initialize and 2. E step: Assign each matepair, Mi, to one of two distributions Assign Mi to p1 with probability , p2 with 3. M step: Update and by searching the optimal and which minimizes Kolmogorov–Smirnov statistic

51 Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 51

52 Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 52

53 Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 53

54 Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 54

55 Pipeline of MoDIL Read Mapping Clustering
1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping R1 R2 C2 C1 1 2 3 4 5 M1,4 M2,4 M3,5 55

56 Pipeline of MoDIL Read Mapping Clustering
1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping 2. Detecting INDELs from MoDs EM algorithm to detect INDELs in each cluster 3. Post-processing Compute P-values Merge duplicates Compute P-het 56

57 Outline Detecting structural variants with matepairs
Probabilistic framework Finding structural variations Results

58 Simulation Results Implanted all indels from Mills et al. into chromosome 1 and generated ~51 million matepairs Insertion Deletion 58

59 Comparison with Other Methods
Both MoDIL and Hormozdiari et al.’s method performed well for INDELs >=40bp (precision: ~0.95, recall: ~0.85) For INDELs (20-39bp), only MoDIL detected INDELs (precision: ~0.9 & recall: ~0.87) MAQ only found very small INDELs (<10bp) INDELs >=40 INDELs 20-39bp 59

60 Comparison with Kidd et al.
NA (40x Illumina coverage, 208±13bp matepairs) Kidd et al. found small fraction of INDELs using Sanger style reads (0.3x coverage) >=20bp INDELs FNR=0.05 15-19bp INDELs FNR=0.3 10-14bp INDELs FNR=0.65

61 Accuracy of Size Estimation of MoDIL
Large # of indels (~32%) overlapped with Mills et al. Compared sizes of Mills et al. and MoDIL Pearson’s correlation coefficient, r2=0.96 - (Mills et al. minus MoDIL) overlaps with Gaussian with SD=4 (expected SD for a cluster with 20 matepairs) MoDIL indel size Mills et al. indel size Mills et al. - MoDIL

62 Copy Number Variants (CNVs)
Large regions that appear a different number of times within different indiv. CNVs are associated with a number of diseases Input reference human genome sequenced donor genome Output CNV annotations in ref snps have been widely studied other variation is CNV cancer HIV autism schizophrenia

63 Step 1 – Build Repeat Graph
Repeat graph captures the copy-numbers of the ref. view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)

64 Step 1 – Build Repeat Graph
The ref genome is a walk in this graph view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)

65 Step 2 – Capture Donor Adjacencies
Ref

66 Step 2 – Capture Donor Adjacencies
Ref

67 Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage
Results 67

68 Adding Depth of Coverage
Ref where Donor

69 Calling CNVs Ref 0.8 2.3 2.6 0.5 1.4 1.7 1.1 Depth Path 1 2 2 1 1 1 1 Ref 1 2 CNV Find the path “most faithful” to the DOC Probabilistic model to score “faithfulness” Network flow to find “most likely” walk Donor

70 Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage
Results 70

71 Preliminary Results NA18507 individual sampled with Illumina
9909 CNV calls 5795 losses, 4114 gains

72 Preliminary Results (Sensitivity)
Kidd et al.’s loss calls on NA18507 (146 calls) Percentage of Kidd’s calls that overlap one of ours: After shuffling our calls:

73 Preliminary Results (Specificity)
Percent of our GAIN calls that overlap with DGV: Percent of our LOSS calls that overlap with DGV: After shuffle:

74 More Results (McCarroll Comparison)
McCarroll et al. bottom third percentile within 270 samples (94 calls) McCarroll et al. homozygous deletions (39 calls) McCarroll et al. top third percentile (26 calls)

75 Take-home points CNVs Matepairs are key MoDIL
Take advantage of high clone coverage to find smaller INDELs with high accuracy ~90% accuracy and recall for INDELs ≥ 20bp. CNVs Combine pair-end and arrival information to find CNVs Good Concordance with previous results Matepairs are key Length & distribution of insert sizes key Read length (sometimes) less so

76 http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu
Acknowledgments Paul Medvedev Marc Fiume Tim Smith Adrian Dalca Seunghak Lee Can Alkan (UW) Fereydoun Hormozdiari (SFU)

77 http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu
Acknowledgments Paul Medvedev Marc Fiume Tim Smith Adrian Dalca Seunghak Lee Can Alkan (UW) Fereydoun Hormozdiari (SFU)

78 Thank you for your attention
Michael Brudno University of Toronto


Download ppt "Canadian Bioinformatics Workshops"

Similar presentations


Ads by Google