Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
RNAseq.
Advertisements

Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Using the whole read: Structural Variation detection with RPSR
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
PSY 307 – Statistics for the Behavioral Sciences
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
High Throughput Sequencing
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Todd J. Treangen, Steven L. Salzberg
Genetic Variations Lakshmi K Matukumalli. Human – Mouse Comparison.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Identification of Copy Number Variants using Genome Graphs
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder.
Canadian Bioinformatics Workshops
The Haplotype Blocks Problems Wu Ling-Yun
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Confidence Intervals Cont.
Disease risk prediction
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
ENCODE Pseudogenes and Transcription
SVs and CNVs They are often confused…
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Linking Genetic Variation to Important Phenotypes
Genomic alterations in breast cancer cell line MDA-MB-231.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Eric Samorodnitsky, Jharna Datta, Benjamin M
The Fine-Scale and Complex Architecture of Human Copy-Number Variation
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Review of paper submitted to NAR - confidential
Next-generation DNA sequencing
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing–Detected Variants with an Orthogonal Method in Clinical Genetic.
SNPs and CNPs By: David Wendel.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Development of a Novel Next-Generation Sequencing Assay for Carrier Screening in Old Order Amish and Mennonite Populations of Pennsylvania  Erin L. Crowgey,
Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 4 Structural Variation Michael Brudno Informatics on High Throughput Sequencing Data July 2009

Diversity of Humans G: 798 GAACCCCTTACAACTGAACCCCTTAC Humans are diverse Genomic Variation Single Nucleotide Polymorphisms SNPs occur ~1/1000 positions Find by comparing reads from one individual to the reference human genome Structural variations are large scale genomic alterations Insertions, deletions, inversions, translocations and changes in copy numbers G: 798 GAACCCCTTACAACTGAACCCCTTAC |||||||||| ||||||||||||||| R: GAACCCCTTATAACTGAACCCCTTAC

What are structural variations? Consequtive basepairs Various examples of structural variations

What are Structural Variations? Insertion ATAC Donor genome Reference genome

What are Structural Variations? Deletion ATAC Donor genome ATAC Reference genome

What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome

What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome

What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome

What are Structural Variations? Inversion ATAC TATG Donor genome ATAC TATG Reference genome

What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome

What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome

What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome

What are Structural Variations? Inversion TATG ATAC Donor genome ATAC TATG Reference genome

How Do We Detect Structural Variations? Comparative Genome Hybridization (CGH) Detect copy number changes Cannot detect inversions and translocations Fluorescence In Situ Hybridization (FISH) Time-consuming and expensive Direct comparison of genomes (Levy et al. 2007) Very expensive to assemble the whole genome Unassembled clone-end data (Tuzun et al. 2005, Korbel et al. 2007) Uses matepairs to detect structural variations, much cheaper than assembling the whole genome Next generation sequencing technologies make it even more promising

Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results

Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results

What are Matepairs? For now, assume insert size is perfect Matepair DNA fragment ATCAA CTAAG Insert size For now, assume insert size is perfect

Detecting Structural Variants With Matepairs No structural variants A Insert size Insert size = Mapped distance Concordant matepair REF Mapped distance

Detecting Structural Variants With Matepairs - Insertion Size of insertion = Insert size - Mapped distance A Insert size Insert size > Mapped distance REF Mapped distance

Consistency - Insertion Size of insertion explained by Xi = Size of insertion explained by Xj Overlap Xi Xj A Xi Xj REF

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Detecting Structural Variants With Matepairs - Inversion 3’ 5’ A 5’ 3’ REF 3’ 5’

Range of The Size of Inversion |m - insert size of Xi| < size of inversion Xi 3’ 5’ A Insert size of Xi Xi 5’ Given the nature of inversions, the lower bound on the size of inversion occurs when the inversion is the smallest interval containing the sampled read location and its mapped location. 3’ REF m

Range of The Size of Inversion 3’ 5’ A 5’ 3’ REF

Range of The Size of Inversion 3’ 5’ A 5’ 3’ REF

Range of The Size of Inversion size of inversion < m + insert size of Xi Xi 3’ 5’ A Insert size of Xi Xi 5’ The upper bound for the size of the inversion occurs when the boundary of the inversion (affecting one of the reads) meets the endpoint of the other read. 3’ REF m

Range of The Size of Inversion |m – insert size of Xi| < size of inversion < m + insert size of Xi Xi 5’ 3’ A Insert size of Xi Xi m is the distance between mapped positions of the two reads This picture shows the case when the lower bound is tight 5’ 3’ REF m

Consistency - Inversion Mapped distance A = Mapped Distance B Range of the size of inversion explained by Xi overlaps Range of the size of inversion explained by Xj Overlap Xi Xj 5’ 3’ A Xi Xj 5’ 3’ REF Mapped Distance B Mapped Distance A

Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results

Difficulties in Small INDEL Detection In reality, insert sizes of matepairs are not perfect Unable to detect small indels (e.g. < 3STD) Tuzun et al. 2005 38

Detecting Smaller INDELS Small Insertion… or noise A Insert size ≈ Mapped distance? REF

Haploid Case – Alignment Donor REF 40

Haploid Case – Alignment Donor Mapped distance REF Cluster 41

Haploid Case – Distribution Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL 20bp insertion No indel 188bp 208bp 228bp 42

Haploid Case – Distribution Make a distribution of mapped distances in each cluster => The distribution shifts from distribution of insert size if there is an INDEL No indel 20bp deletion 188bp 208bp 228bp 43

Accuracy of INDEL Estimation Central Limit Theorem Mean of n independent random variables with finite mean and variance follows the Gaussian distribution with mean and standard deviation : random variables for size of indels supported by each matepair Distribution of mean of random variables Z1…Zn 44

P-value (assigning a confidence) P-value Probability that a cluster is generated from a region without an indel 45

Diploid Case – Alignment & Clustering Heterozygous insertion Donor C1 Donor C2 REF 46

Diploid Case – Alignment & Clustering Heterozygous insertion Donor C1 Donor C2 REF Cluster 47

Diploid - Mixture of Distributions Heterozygous insertion size of insertion distribution of mapped distances from donor C1 =? distribution of insert size Mapped distance distribution of mapped distances from donor C2 48

Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results

Expectation Maximization (EM) Algorithm 1. Randomly initialize and 2. E step: Assign each matepair, Mi, to one of two distributions Assign Mi to p1 with probability , p2 with 3. M step: Update and by searching the optimal and which minimizes Kolmogorov–Smirnov statistic

Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 51

Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 52

Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 53

Pipeline of MoDIL 1. Preprocessing Read Mapping Clustering Cluster 54

Pipeline of MoDIL Read Mapping Clustering 1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping R1 R2 C2 C1 1 2 3 4 5 M1,4 M2,4 M3,5 55

Pipeline of MoDIL Read Mapping Clustering 1. Preprocessing Read Mapping Clustering Assign each matepair to unique mapping 2. Detecting INDELs from MoDs EM algorithm to detect INDELs in each cluster 3. Post-processing Compute P-values Merge duplicates Compute P-het 56

Outline Detecting structural variants with matepairs Probabilistic framework Finding structural variations Results

Simulation Results Implanted all indels from Mills et al. into chromosome 1 and generated ~51 million matepairs Insertion Deletion 58

Comparison with Other Methods Both MoDIL and Hormozdiari et al.’s method performed well for INDELs >=40bp (precision: ~0.95, recall: ~0.85) For INDELs (20-39bp), only MoDIL detected INDELs (precision: ~0.9 & recall: ~0.87) MAQ only found very small INDELs (<10bp) INDELs >=40 INDELs 20-39bp 59

Comparison with Kidd et al. NA 18507 (40x Illumina coverage, 208±13bp matepairs) Kidd et al. found small fraction of INDELs using Sanger style reads (0.3x coverage) >=20bp INDELs FNR=0.05 15-19bp INDELs FNR=0.3 10-14bp INDELs FNR=0.65

Accuracy of Size Estimation of MoDIL Large # of indels (~32%) overlapped with Mills et al. Compared sizes of Mills et al. and MoDIL - Pearson’s correlation coefficient, r2=0.96 - (Mills et al. minus MoDIL) overlaps with Gaussian with SD=4 (expected SD for a cluster with 20 matepairs) MoDIL indel size Mills et al. indel size Mills et al. - MoDIL

Copy Number Variants (CNVs) Large regions that appear a different number of times within different indiv. CNVs are associated with a number of diseases Input reference human genome sequenced donor genome Output CNV annotations in ref snps have been widely studied other variation is CNV cancer HIV autism schizophrenia

Step 1 – Build Repeat Graph Repeat graph captures the copy-numbers of the ref. view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)

Step 1 – Build Repeat Graph The ref genome is a walk in this graph view repeat graph as containing adjacency information a walk in this graph spells a genome Pevzner, Tang, Tesler (2004)

Step 2 – Capture Donor Adjacencies Ref

Step 2 – Capture Donor Adjacencies Ref

Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage Results 67

Adding Depth of Coverage Ref where Donor

Calling CNVs Ref 0.8 2.3 2.6 0.5 1.4 1.7 1.1 Depth Path 1 2 2 1 1 1 1 Ref 1 2 CNV Find the path “most faithful” to the DOC Probabilistic model to score “faithfulness” Network flow to find “most likely” walk Donor

Outline: CNVs Building the Donor Graph Adding Depth-of-Coverage Results 70

Preliminary Results NA18507 individual sampled with Illumina 9909 CNV calls 5795 losses, 4114 gains

Preliminary Results (Sensitivity) Kidd et al.’s loss calls on NA18507 (146 calls) Percentage of Kidd’s calls that overlap one of ours: After shuffling our calls:

Preliminary Results (Specificity) Percent of our GAIN calls that overlap with DGV: Percent of our LOSS calls that overlap with DGV: After shuffle:

More Results (McCarroll Comparison) McCarroll et al. bottom third percentile within 270 samples (94 calls) McCarroll et al. homozygous deletions (39 calls) McCarroll et al. top third percentile (26 calls)

Take-home points CNVs Matepairs are key MoDIL Take advantage of high clone coverage to find smaller INDELs with high accuracy ~90% accuracy and recall for INDELs ≥ 20bp. CNVs Combine pair-end and arrival information to find CNVs Good Concordance with previous results Matepairs are key Length & distribution of insert sizes key Read length (sometimes) less so

http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Acknowledgments http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Paul Medvedev Marc Fiume Tim Smith Adrian Dalca Seunghak Lee Can Alkan (UW) Fereydoun Hormozdiari (SFU)

http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Acknowledgments http://compbio.cs.toronto.edu/modil http://compbio.cs.toronto.edu Paul Medvedev Marc Fiume Tim Smith Adrian Dalca Seunghak Lee Can Alkan (UW) Fereydoun Hormozdiari (SFU)

Thank you for your attention Michael Brudno University of Toronto