Canadian Bioinformatics Workshops www.bioinformatics.ca
Module #: Title of Module 2
Module 4 Structural Variation Michael Brudno Informatics on High Throughput Sequencing Data July 2009
Diversity of Humans G: 798 GAACCCCTTACAACTGAACCCCTTAC Humans are diverse Genomic Variation Single Nucleotide Polymorphisms SNPs occur ~1/1000 positions Find by comparing reads from one individual to the reference human genome Structural variations are large scale genomic alterations Insertions, deletions, inversions, translocations and changes in copy numbers G: 798 GAACCCCTTACAACTGAACCCCTTAC |||||||||| ||||||||||||||| R: GAACCCCTTATAACTGAACCCCTTAC
What are structural variations? Consequtive basepairs Various examples of structural variations
What are Structural Variations? Insertion ATAC Donor genome Reference genome
What are Structural Variations? Deletion ATAC Donor genome ATAC Reference genome
What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome
What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome
What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome
What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome
What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome
What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome
What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome
What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome
How Do We Detect Structural Variations? Comparative Genome Hybridization (CGH) Detect copy number changes Cannot detect inversions and translocations Fluorescence In Situ Hybridization (FISH) Time-consuming and expensive Direct comparison of genomes (Levy et al. 2007) Very expensive to assemble the whole genome Unassembled clone-end data (Tuzun et al. 2005, Korbel et al. 2007) Uses matepairs to detect structural variations, much cheaper than assembling the whole genome Next generation sequencing technologies make it even more promising
Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results
Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results
What are Matepairs? For now, assume insert size is perfect Matepair DNA fragment ATCAA CTAAG Insert size For now, assume insert size is perfect
Detecting Structural Variants With Matepairs No structural variants A Insert size Insert size = Mapped distance Concordant matepair REF Mapped distance
Detecting Structural Variants With Matepairs - Insertion Size of insertion = Insert size - Mapped distance A Insert size Insert size > Mapped distance REF Mapped distance
Consistency - Insertion Size of insertion explained by Xi = Size of insertion explained by Xj Overlap Xi Xj A Xi Xj REF
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’
Range of The Size of Inversion |m - insert size of Xi| < size of inversion Xi 3’ 5’ A Insert size of Xi Xi 5’ Given the nature of inversions, the lower bound on the size of inversion occurs when the inversion is the smallest interval containing the sampled read location and its mapped location. 3’ REF m
Range of The Size of Inversion 3’ 5’ A 5’ 3’ REF
Range of The Size of Inversion 3’ 5’ A 5’ 3’ REF
Range of The Size of Inversion size of inversion < m + insert size of Xi Xi 3’ 5’ A Insert size of Xi Xi 5’ The upper bound for the size of the inversion occurs when the boundary of the inversion (affecting one of the reads) meets the endpoint of the other read. 3’ REF m
Range of The Size of Inversion |m – insert size of Xi| < size of inversion < m + insert size of Xi Xi 5’ 3’ A Insert size of Xi Xi m is the distance between mapped positions of the two reads This picture shows the case when the lower bound is tight 5’ 3’ REF m
Consistency - Inversion Mapped distance A = Mapped Distance B Range of the size of inversion explained by Xi overlaps Range of the size of inversion explained by Xj Overlap Xi Xj 5’ 3’ A Xi Xj 5’ 3’ REF Mapped Distance B Mapped Distance A
Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results
Difficulties in Small INDEL Detection In reality, insert sizes of matepairs are not perfect Unable to detect small indels (e.g. < 3STD) Tuzun et al. 2005 38
Detecting Smaller INDELS Small Insertion… or noise A Insert size ≈ Mapped distance? REF
Haploid Case – Alignment Donor REF 40
Haploid Case – Alignment Donor Mapped distance REF Cluster 41
Haploid Case – Distribution Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL 20bp insertion No indel 188bp 208bp 228bp 42
Haploid Case – Distribution Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL No indel 20bp deletion 188bp 208bp 228bp 43
Accuracy of INDEL Estimation Central Limit Theorem Mean of n independent random variables with finite mean and variance follows the Gaussian distribution with mean and standard deviation : random variables for size of indels supported by each matepair Distribution of mean of random variables Z1…Zn 44
P-value (assigning a confidence) P-value Probability that a cluster is generated from a region without an indel 45
Diploid Case – Alignment & Clustering Heterozygous insertion Donor C1 Donor C2 REF 46
Diploid Case – Alignment & Clustering Heterozygous insertion Donor C1 Donor C2 REF Cluster 47
Diploid - Mixture of Distributions Heterozygous insertion size of insertion distribution of mapped distances from donor C1 =? distribution of insert size Mapped distance distribution of mapped distances from donor C2 48
Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results
Expectation Maximization (EM) Algorithm 1. Randomly initialize and 2. E step: Assign each matepair, Mi, to one of two distributions Assign Mi to p1 with probability , p2 with 3. M step: Update and by searching the optimal and which minimizes Kolmogorov–Smirnov statistic
Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 51
Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 52
Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 53
Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 54
Pipeline of MoDIL Read Mapping Clustering 1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping R1 R2 C2 C1 1 2 3 4 5 M1,4 M2,4 M3,5 55
Pipeline of MoDIL Read Mapping Clustering 1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping 2. Detecting INDELs from MoDs EM algorithm to detect INDELs in each cluster 3. Post-processing Compute P-values Merge duplicates Compute P-het 56
Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results
Simulation Results Implanted all indels from Mills et al. into chromosome 1 and generated ~51 million matepairs Insertion Deletion 58
Comparison with Other Methods Both MoDIL and Hormozdiari et al.’s method performed well for INDELs >=40bp (precision: ~0.95, recall: ~0.85) For INDELs (20-39bp), only MoDIL detected INDELs (precision: ~0.9 & recall: ~0.87) MAQ only found very small INDELs (<10bp) INDELs >=40 INDELs 20-39bp 59
Comparison with Kidd et al. NA 18507 (40x Illumina coverage, 208±13bp matepairs) Kidd et al. found small fraction of INDELs using Sanger style reads (0.3x coverage) >=20bp INDELs FNR=0.05 15-19bp INDELs FNR=0.3 10-14bp INDELs FNR=0.65
Accuracy of Size Estimation of MoDIL Large # of indels (~32%) overlapped with Mills et al. Compared sizes of Mills et al. and MoDIL - Pearson’s correlation coefficient, r2=0.96 - (Mills et al. minus MoDIL) overlaps with Gaussian with SD=4 (expected SD for a cluster with 20 matepairs) MoDIL indel size Mills et al. indel size Mills et al. - MoDIL
Copy Number Variants (CNVs) Large regions that appear a different number of times within different indiv. CNVs are associated with a number of diseases Input reference human genome sequenced donor genome Output CNV annotations in ref snps have been widely studied other variation is CNV cancer HIV autism schizophrenia
Step 1 – Build Repeat Graph Repeat graph captures the copy-numbers of the ref. view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)
Step 1 – Build Repeat Graph The ref genome is a walk in this graph view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)
Step 2 – Capture Donor Adjacencies Ref
Step 2 – Capture Donor Adjacencies Ref
Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage Results 67
Adding Depth of Coverage Ref where Donor
Calling CNVs Ref 0.8 2.3 2.6 0.5 1.4 1.7 1.1 Depth Path 1 2 2 1 1 1 1 Ref 1 2 CNV Find the path “most faithful” to the DOC Probabilistic model to score “faithfulness” Network flow to find “most likely” walk Donor
Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage Results 70
Preliminary Results NA18507 individual sampled with Illumina 9909 CNV calls 5795 losses, 4114 gains
Preliminary Results (Sensitivity) Kidd et al.’s loss calls on NA18507 (146 calls) Percentage of Kidd’s calls that overlap one of ours: After shuffling our calls:
Preliminary Results (Specificity) Percent of our GAIN calls that overlap with DGV: Percent of our LOSS calls that overlap with DGV: After shuffle:
More Results (McCarroll Comparison) McCarroll et al. bottom third percentile within 270 samples (94 calls) McCarroll et al. homozygous deletions (39 calls) McCarroll et al. top third percentile (26 calls)
Take-home points CNVs Matepairs are key MoDIL Take advantage of high clone coverage to find smaller INDELs with high accuracy ~90% accuracy and recall for INDELs ≥ 20bp. CNVs Combine pair-end and arrival information to find CNVs Good Concordance with previous results Matepairs are key Length & distribution of insert sizes key Read length (sometimes) less so
http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Acknowledgments http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Paul Medvedev Marc Fiume Tim Smith Adrian Dalca Seunghak Lee Can Alkan (UW) Fereydoun Hormozdiari (SFU)
http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Acknowledgments http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Paul Medvedev Marc Fiume Tim Smith Adrian Dalca Seunghak Lee Can Alkan (UW) Fereydoun Hormozdiari (SFU)
Thank you for your attention Michael Brudno University of Toronto