1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
RNAseq.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Fast and accurate short read alignment with Burrows–Wheeler transform
Basics of Linkage Analysis
VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science.
Profiles for Sequences
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Comparative ab initio prediction of gene structures using pair HMMs
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Presented by Zeehasham Rasheed
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Genome sequencing and assembling
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Small-Scale Anisotropy Studies with HiRes Stereo Observations Chad Finley and Stefan Westerhoff Columbia University HiRes Collaboration ICRC 2003 Tsukuba,
1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.
Todd J. Treangen, Steven L. Salzberg
Improved Gene Expression Programming to Solve the Inverse Problem for Ordinary Differential Equations Kangshun Li Professor, Ph.D Professor, Ph.D College.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Monte Carlo Methods So far we have discussed Monte Carlo methods based on a uniform distribution of random numbers on the interval [0,1] p(x) = 1 0  x.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Permutation Analysis Benjamin Neale, Michael Neale, Manuel Ferreira.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
Randomized Algorithms for Bayesian Hierarchical Clustering
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Identification of Copy Number Variants using Genome Graphs
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Cluster validation Integration ICES Bioinformatics.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
ANOVA, Regression and Multiple Regression March
By Alfonso Farrugio, Hieu Nguyen, and Antony Vydrin Sequencing Technologies and Human Genetic Variation.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Review of statistical modeling and probability theory Alan Moses ML4bio.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Lecture 11: Linkage Analysis IV Date: 10/01/02  linkage grouping  locus ordering  confidence in locus ordering.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Canadian Bioinformatics Workshops
Constrained Hidden Markov Models for Population-based Haplotyping
Clustering (3) Center-based algorithms Fuzzy k-means
Predicting Gene Expression from Sequence
Canadian Bioinformatics Workshops
Presentation transcript:

1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto, Canada

2 What are structural variations? (1) 10^3 – 10^6 basepair variations in the genome Insertion: a large consecutive fragment of DNA is inserted Deletion: a large consecutive fragment of DNA is deleted Inversion: a large consecutive fragment of DNA is inversed Translocation: a large consecutive fragment of DNA is moved from one chromosome to another. Copy number variations

3 What are structural variations? (2) Various examples of structural variations

4 Outline Introduction  Type of Structural Variations  Sequencing Approaches to Detect Structural Variations  Motivation & Research Objectives Probabilistic Framework for Detecting Structural Variations  Probabilistic Framework  Flow of our Framework  Hierarchical Clustering of Matepairs (2nd phase)  Choosing a Unique Mapped Location for Each Matepair (3nd phase) Experiments  Comparison with Three Previous research  DMBT1 Gene for Deletion  Centromere and Translocations Conclusions

5 Type of Structural Variations (1) Insertion A REF

6 Type of Structural Variations (2) Deletion A REF

7 Type of Structural Variations (3) Inversion A REF 5’ 3’ 5’3’ 5’3’

8 Type of Structural Variations (4) Translocation chr1 chr2

9 Sequencing Approaches 1. “Fine-scale structural variation of the human genome” [Tuzun et al, 2005] Mapping matepairs onto the reference genome Insertion and deletion: inconsistent mapped distance Inversion: the same orientation of both reads 2. “Paired-End mappings Reveals Extensive Structural Variation in the Human Genome” [Korbel et al, 2007] Proposed high-throughput and massive paired end mapping technique Detailed types of structural variations

10 Motivation & Research Objectives (1) Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences) How can we map reads onto the reference genome?

11 Motivation & Research Objectives (2) Sequencing method is effective to detect structural variants.  Proven by Tuzun et al, Korbel et al However, there are multiple mappings for each read  Previous research used a priori mapped locations. Why don’t we develop a probabilistic model without such assumptions?  Hopefully, it can be applied to short reads from NGS machines.

12 Probabilistic Framework (1) p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes We play with p(Y) to describe our probabilistic framework

13 Probabilistic Framework (2) Insertion μ Y = (s+r) P(X i, X j |ins=r) = P(X i |ins=r)P(X j |ins=r) P(X i |ins=r) = 1 - P(μ Y - δ ≤Y≤μ y + δ), where δ= |μ Y - (s+r)|, s = mapped distance μ y - δ X1, X2 = matepair 1,2 Y= random variable for mapped distances of “uniquely mapped” matepairs p(Y)

14 Probabilistic Framework (3) Deletion μ Y = (s-r) P(X i, X j |del=r) = P(X i |del=r)P(X j |del=r) P(X i |del=r) = 1 - P(μ Y - δ ≤Y≤μ y + δ) where δ= |μ Y - (s-r)|, s = mapped distance μ y - δ p(Y)

15 Probabilistic Framework (4) c - d = s(X1) - s(X2) P(X i, X j |inv) = 1 - P(μ |Y1-Y2| - δ ≤|Y1-Y2|≤μ |Y1-Y2| + δ) where δ= |μ |Y1-Y2| – (c – d)|, s(Xi) = insert size of Xi μ |Y1-Y2| -δ p(|Y1-Y2|) Inversion

16 Probabilistic Framework (5) μ |Y1-Y2| -δ (c – a) – (d – b) = s(X1) - s(X2) P(X i, X j |trans) = 1 - P(μ |Y1-Y2| - δ ≤|Y1-Y2|≤μ |Y1-Y2| + δ), where δ= |μ |Y1-Y2| – (c – a) – (d – b) |, s(Xi) = insert size of Xi p(|Y1-Y2|) Translocation

17 Flow of our Framework (1) 1. Preprocessing step Get top K mappingsRemove short mappings Make all possible combinations of mappings Discard matepairs consistent with insert size Remove invalid strands (-,+) Remove very similar mappings mappings Mask repeats

18 Flow of our Framework (2) 2. Clustering 3. Finding structural variations Do hierarchical clustering for each structural variation (Insertion, Deletion, Inversion, Translocation) Find a local optimum configuration Parameter learning for the objective function Find initial configuration in greedy manner

19 Hierarchical Clustering (1) (ex) Insertion A REF Cluster, C, is a set of matepairs explaining the same structural variations Linkage distance = D(X1, X2) = - ln P(X1, X2|C) X1 X2 X1 X2 C={X1, X2}

20 Hierarchical Clustering (2) Generally, linkage distance is given by, We do hierarchical clustering for each structural variation.

21 Choosing a Unique Mapped Location (1) We should map matepairs onto unique pair of BLAT hits and unique cluster. R1 R2 C2 C1 C2C1 R2 R M 1,4 M 2,4 M 3,5

22 Choosing a Unique Mapped Location (2) We define a objective Function J( ω ) ƒ 1 corresponds to BLAT hit scores ƒ 2 corresponds to the probability ƒ 3 corresponds to the size of clusters

23 Choosing a Unique Mapped Location (3) Find the initial configuration greedily Learn parameters for the objective function J( ω ).  We used hill climbing search to maximize the log likelihood of P(ω|λ i ) Finally, find a configuration, locally maximizing J( ω ) using hill climbing search

24 P-values We assign p-values to give confidence to our clusters. The probability that the cluster is generated by the reference genome not by structural variants  Pval(C k )=(E choose |C k |) ∏ P(X i |C null ) where E = (Expected number of matepairs mapped to the location of the cluster) P-values depend on the length of the cluster, the number of matepairs involved and probabilities.

25 Clustering Results We started with ~360,000 matepair ~90% were uniquely mapped ~90% had a concordant position (mapped at  ± 2  ) Through the clustering procedure above (FDR 0.2) we found 82 Insertion clusters (53 had a uniquely mapped read) 175 Deletion clusters (135) 103 inversion clusters (24) 55 Translocation (cross-chromosome) cluster (all were required to have a uniquely mapped read)

26 Example Deletion

27 Agreement with Previous Results TypeTotalTuzunLevyKorbelDGV-All Insertion82(53)12(7)/1396(5)/3190(0)/3424(13)/2216 Deletion175(135)21(17)/10225(23)/34445(36)/74282(63)/4697 Inversion103(24)34(12)/56N/A42(8)/10560(15)/164 We have compared All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlo simulations The DMBT1 deletion was also found in the Tuzun et al dataset (but not the Levy dataset).

28 Translocations A large fraction (69%) of the translocations were close to the centromeres She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart These could also be mis-assemblies. Distance to centromere <10 6 (10 6, 4.5*10 6 ]>4.5*10 6 < (10 6, 4.5*10 6 ]03 >4.5*

29 Conclusions Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions. Introduced a probabilistic model for structural variants Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor. These results show statistically significant correlation with previous variation studies Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)