Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Module #: Title of Module
2

Module 4 Structural Variation
Michael Brudno Informatics on High Throughput Sequencing Data July 2009

Diversity of Humans G: 798 GAACCCCTTACAACTGAACCCCTTAC
Humans are diverse Genomic Variation Single Nucleotide Polymorphisms SNPs occur ~1/1000 positions Find by comparing reads from one individual to the reference human genome Structural variations are large scale genomic alterations Insertions, deletions, inversions, translocations and changes in copy numbers G: 798 GAACCCCTTACAACTGAACCCCTTAC |||||||||| ||||||||||||||| R: GAACCCCTTATAACTGAACCCCTTAC

What are structural variations?
Consequtive basepairs Various examples of structural variations

What are Structural Variations?
Insertion ATAC Donor genome Reference genome

Deletion ATAC Donor genome ATAC Reference genome

Inversion ATAC TATG Donor genome ATAC TATG Reference genome

Inversion TATG ATAC Donor genome ATAC TATG Reference genome

How Do We Detect Structural Variations?
Comparative Genome Hybridization (CGH) Detect copy number changes Cannot detect inversions and translocations Fluorescence In Situ Hybridization (FISH) Time-consuming and expensive Direct comparison of genomes (Levy et al. 2007) Very expensive to assemble the whole genome Unassembled clone-end data (Tuzun et al. 2005, Korbel et al. 2007) Uses matepairs to detect structural variations, much cheaper than assembling the whole genome Next generation sequencing technologies make it even more promising

Outline Detecting structural variants with matepairs
Probabilistic framework Finding structural variations Results

What are Matepairs? For now, assume insert size is perfect Matepair
DNA fragment ATCAA CTAAG Insert size For now, assume insert size is perfect

Detecting Structural Variants With Matepairs
No structural variants A Insert size Insert size = Mapped distance Concordant matepair REF Mapped distance

Detecting Structural Variants With Matepairs - Insertion
Size of insertion = Insert size - Mapped distance A Insert size Insert size > Mapped distance REF Mapped distance

Consistency - Insertion
Size of insertion explained by Xi = Size of insertion explained by Xj Overlap Xi Xj A Xi Xj REF

Detecting Structural Variants With Matepairs - Inversion
3’ 5’ A 5’ 3’ REF 3’ 5’

Range of The Size of Inversion
|m - insert size of Xi| < size of inversion Xi 3’ 5’ A Insert size of Xi Xi 5’ Given the nature of inversions, the lower bound on the size of inversion occurs when the inversion is the smallest interval containing the sampled read location and its mapped location. 3’ REF m

3’ 5’ A 5’ 3’ REF

size of inversion < m + insert size of Xi Xi 3’ 5’ A Insert size of Xi Xi 5’ The upper bound for the size of the inversion occurs when the boundary of the inversion (affecting one of the reads) meets the endpoint of the other read. 3’ REF m

|m – insert size of Xi| < size of inversion < m + insert size of Xi Xi 5’ 3’ A Insert size of Xi Xi m is the distance between mapped positions of the two reads This picture shows the case when the lower bound is tight 5’ 3’ REF m

Consistency - Inversion
Mapped distance A = Mapped Distance B Range of the size of inversion explained by Xi overlaps Range of the size of inversion explained by Xj Overlap Xi Xj 5’ 3’ A Xi Xj 5’ 3’ REF Mapped Distance B Mapped Distance A

Difficulties in Small INDEL Detection
In reality, insert sizes of matepairs are not perfect Unable to detect small indels (e.g. < 3STD) Tuzun et al. 2005 38

Detecting Smaller INDELS
Small Insertion… or noise A Insert size ≈ Mapped distance? REF

Haploid Case – Alignment
Donor REF 40

Haploid Case – Alignment
Donor Mapped distance REF Cluster 41

Haploid Case – Distribution
Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL 20bp insertion No indel 188bp 208bp 228bp 42

Haploid Case – Distribution
Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL No indel 20bp deletion 188bp 208bp 228bp 43

Accuracy of INDEL Estimation
Central Limit Theorem Mean of n independent random variables with finite mean and variance follows the Gaussian distribution with mean and standard deviation : random variables for size of indels supported by each matepair Distribution of mean of random variables Z1…Zn 44

P-value (assigning a confidence)
P-value Probability that a cluster is generated from a region without an indel 45

Diploid Case – Alignment & Clustering
Heterozygous insertion Donor C1 Donor C2 REF 46

Diploid Case – Alignment & Clustering
Heterozygous insertion Donor C1 Donor C2 REF Cluster 47

Diploid - Mixture of Distributions
Heterozygous insertion size of insertion distribution of mapped distances from donor C1 =? distribution of insert size Mapped distance distribution of mapped distances from donor C2 48

Expectation Maximization (EM) Algorithm
1. Randomly initialize and 2. E step: Assign each matepair, Mi, to one of two distributions Assign Mi to p1 with probability , p2 with 3. M step: Update and by searching the optimal and which minimizes Kolmogorov–Smirnov statistic

Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 51

Pipeline of MoDIL Read Mapping Clustering
1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping R1 R2 C2 C1 1 2 3 4 5 M1,4 M2,4 M3,5 55

Pipeline of MoDIL Read Mapping Clustering
1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping 2. Detecting INDELs from MoDs EM algorithm to detect INDELs in each cluster 3. Post-processing Compute P-values Merge duplicates Compute P-het 56

Simulation Results Implanted all indels from Mills et al. into chromosome 1 and generated ~51 million matepairs Insertion Deletion 58

Comparison with Other Methods
Both MoDIL and Hormozdiari et al.’s method performed well for INDELs >=40bp (precision: ~0.95, recall: ~0.85) For INDELs (20-39bp), only MoDIL detected INDELs (precision: ~0.9 & recall: ~0.87) MAQ only found very small INDELs (<10bp) INDELs >=40 INDELs 20-39bp 59

Comparison with Kidd et al.
NA (40x Illumina coverage, 208±13bp matepairs) Kidd et al. found small fraction of INDELs using Sanger style reads (0.3x coverage) >=20bp INDELs FNR=0.05 15-19bp INDELs FNR=0.3 10-14bp INDELs FNR=0.65

Accuracy of Size Estimation of MoDIL
Large # of indels (~32%) overlapped with Mills et al. Compared sizes of Mills et al. and MoDIL Pearson’s correlation coefficient, r2=0.96 - (Mills et al. minus MoDIL) overlaps with Gaussian with SD=4 (expected SD for a cluster with 20 matepairs) MoDIL indel size Mills et al. indel size Mills et al. - MoDIL

Copy Number Variants (CNVs)
Large regions that appear a different number of times within different indiv. CNVs are associated with a number of diseases Input reference human genome sequenced donor genome Output CNV annotations in ref snps have been widely studied other variation is CNV cancer HIV autism schizophrenia

Step 1 – Build Repeat Graph
Repeat graph captures the copy-numbers of the ref. view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)

Step 1 – Build Repeat Graph
The ref genome is a walk in this graph view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)

Step 2 – Capture Donor Adjacencies
Ref

Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage
Results 67

Adding Depth of Coverage
Ref where Donor

Calling CNVs Ref 0.8 2.3 2.6 0.5 1.4 1.7 1.1 Depth Path 1 2 2 1 1 1 1 Ref 1 2 CNV Find the path “most faithful” to the DOC Probabilistic model to score “faithfulness” Network flow to find “most likely” walk Donor

Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage
Results 70

Preliminary Results NA18507 individual sampled with Illumina
9909 CNV calls 5795 losses, 4114 gains

Preliminary Results (Sensitivity)
Kidd et al.’s loss calls on NA18507 (146 calls) Percentage of Kidd’s calls that overlap one of ours: After shuffling our calls:

Preliminary Results (Specificity)
Percent of our GAIN calls that overlap with DGV: Percent of our LOSS calls that overlap with DGV: After shuffle:

More Results (McCarroll Comparison)
McCarroll et al. bottom third percentile within 270 samples (94 calls) McCarroll et al. homozygous deletions (39 calls) McCarroll et al. top third percentile (26 calls)

Take-home points CNVs Matepairs are key MoDIL
Take advantage of high clone coverage to find smaller INDELs with high accuracy ~90% accuracy and recall for INDELs ≥ 20bp. CNVs Combine pair-end and arrival information to find CNVs Good Concordance with previous results Matepairs are key Length & distribution of insert sizes key Read length (sometimes) less so

http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu
Acknowledgments Paul Medvedev Marc Fiume Tim Smith Adrian Dalca Seunghak Lee Can Alkan (UW) Fereydoun Hormozdiari (SFU)

Thank you for your attention
Michael Brudno University of Toronto

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback