An efficient algorithm for optimizing whole genome alignment with noise P. Wong, T. Lam, N. Lu, H. Ting, and S. Yiu Department of Computer Science, University.

Slides:



Advertisements
Similar presentations
Optimal Bus Sequencing for Escape Routing in Dense PCBs H.Kong, T.Yan, M.D.F.Wong and M.M.Ozdal Department of ECE, University of Illinois at U-C ICCAD.
Advertisements

Indexing DNA Sequences Using q-Grams
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Longest Common Subsequence
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Review of Chapter 5 張啟中. Selection Trees Selection trees can merge k ordered sequences (assume in non- decreasing order) into a single sequence easily.
R. Johnsonbaugh Discrete Mathematics 5 th edition, 2001 Chapter 8 Network models.
6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.
Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Sabegh Singh Virdi ASC Processor Group Computer Science Department
Yangjun Chen 1 Bipartite Graphs What is a bipartite graph? Properties of bipartite graphs Matching and maximum matching - alternative paths - augmenting.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Global alignment algorithm CS 6890 Zheng Lu. Introduction Global alignments find the best match over the total length of both sequences. We do global.
Polynomial-Time Approximation Schemes for Geometric Intersection Graphs Authors: T. Erlebach, L. Jansen, and E. Seidel Presented by: Ping Luo 10/17/2005.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Yangjun Chen 1 Bipartite Graph 1.A graph G is bipartite if the node set V can be partitioned into two sets V 1 and V 2 in such a way that no nodes from.
Of Mice and Men Learning from genome reversal findings Genome Rearrangements in Mammalian Evolution: Lessons From Human and Mouse Genomes and Transforming.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
The k-server Problem Study Group: Randomized Algorithm Presented by Ray Lam August 16, 2003.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Distributed Combinatorial Optimization
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
"Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.
Efficient Partition Trees Jiri Matousek Presented By Benny Schlesinger Omer Tavori 1.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Gene expression & Clustering (Chapter 10)
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Genome Rearrangements Unoriented Blocks. Quick Review Looking at evolutionary change through reversals Find the shortest possible series of reversals.
Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Minimizing Stall Time in Single Disk Susanne Albers, Naveen Garg, Stefano Leonardi, Carsten Witt Presented by Ruibin Xu.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
1 Symmetry Symmetry Chapter 14 from “Model Checking” by Edmund M. Clarke Jr., Orna Grumberg, and Doron A. Peled presented by Anastasia Braginsky March.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.
1 Genome Rearrangements (Lecture for CS498-CXZ Algorithms in Bioinformatics) Dec. 6, 2005 ChengXiang Zhai Department of Computer Science University of.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Department of Computer Science and Engineering On Computing handles and tunnels for surfaces Tamal K. DeyKuiyu LiJian Sun.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Introduction to NP Instructor: Neelima Gupta 1.
D. AriflerCMPE 548 Fall CMPE 548 Routing and Congestion Control.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Da Yan, Raymond Chi-Wing Wong, and Wilfred Ng The Hong Kong University of Science and Technology.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
Bipartite Graphs What is a bipartite graph?
ICS 353: Design and Analysis of Algorithms
Algorithms for Budget-Constrained Survivable Topology Design
Bipartite Graph 1. A graph G is bipartite if the node set V can be partitioned into two sets V1 and V2 in such a way that no nodes from the same set are.
Computational Genomics Lecture #3a
Presentation transcript:

An efficient algorithm for optimizing whole genome alignment with noise P. Wong, T. Lam, N. Lu, H. Ting, and S. Yiu Department of Computer Science, University of Hong Kong Presented by Hyun-Chul Chung

Outline Introduction The optimization problem The MaxMinCluster algorithm Finding the k-noisy clusters Homework

Introduction (1) Aligning whole genomes of two related species so as to identify regions that possibly contain conserved genes. Conserved genes –Genes that have the same functionality among species. –Usually corresponds to a sequence of matched substrings that are consecutive and close to both genomes and have sufficient length. –Let’s call this sequence a cluster

Introduction (2) Conserved Genes (cont’d) –Not every pair of matched substrings correspond to a conserved gene; most of them are noise.

Introduction (3) Relaxing the definition of cluster to allow noise –k-noisy cluster Avoid reporting relatively small clusters We investigate the optimization problem of finding an alignment that maximizes the size of the smallest cluster.

The Optimization Problem Input is a sequence M=(m 1,m 2,…,m n ) m i : uniquely matched substrings on two genomes A and B – (a i, b i, l i,  i ) a i and b i : starting positions on A and B, respectively l i : length of the substring  i : 1 if m i is of same orientation; -1 if m i is of opposite orientation –Assume a 1 < a 2 < … < a n

Noisy Clusters –(m i,m i+1,…,m i+t ) : segment of M –A segment is a k-noisy cluster (denoted by C) if we can remove at most k elements from the segment, denoted by X, s.t. the resulting subsequence S satisfies the following conditions: Same orientation of all matched substrings in S If  i =1, b i ’s of S increase; otherwise decrease For any two consecutive elements m p and m q in S, |a p -a q |  Gap and |b p -b q |  Gap (distance requirement) Size(S) defined to be, is at least MinSize (size requirement)

Alignment –Maximal collection of disjoint k-noisy clusters (denote this as A) Max-Min alignment problem –Among all Xs that makes C qualified as a k- noisy cluster, let X o be the one with the smallest size –Define w(C), the weight of C, to be Size(C-X o ) –We want to find an optimal alignment A * of M s.t. –  : set of all possible alignments of M

The MaxMinCluster Algorithm Let  j be the set of all possible k-noisy clusters whose elements are in (m 1,m 2,…,m j ) Let  j the set of all alignments in  j Define Let  j   j where each  j has a k-noisy cluster containing m j Define

Let S j be the set of the starting positions of all segments which end at position j and which form a k-noisy cluster Let i * be the largest position in S j Let h be the the largest index of the matched substring pair in some alignment A   j -  j Proposition 1. –Assume that W(0)=WI(0)=WE(0)=0. –Then for any j  1,

In Step 3, for each iteration,WI(j) & WE(j) takes O(j) time and W(j) takes O(1) time. So, step 3 takes O(n 2 ) time.

Finding the k-noisy clusters A set H  M ij is said to be a set of noise in M ij if M ij -H satisfies the requirements of a noisy cluster. Let N ij + (N ij -, respectively) be the set of noise in M ij s.t. all elements in M ij -H have orientation of 1 (-1, respectively). Proposition 2. –M ij is a k-noisy cluster iff the following expression is at least MinSize

Define for all 1  i  j  n and 0  x  k, Then where for any i,j < 1 and x < 0,

Let P be the set of matched substring pairs m p =(a p,b p,l p ) s.t. the following holds: –m p is of same orientation –max(i, j-x-1)  p  j-1 –b p < b j –m p and m j satisfy the distance requirement If  X s.t. |X|  x-(j-p-1) and M ip -X is a noisy cluster then M ij -X’ is also a noisy cluster where X’=(m p+1, m p+2,…,m j-1 )  X

Proposition 3. Time complexity : O(k 2 n 2 ) Space complexity : O(kn 2 ) Space complexity can be reduced to O(k 2 n)?

Homework #6 Given M=(m 1,m 2,…,m n ) where m i indicates uniquely matched substrings on two genomes A and B, design your own algorithm to find all k-noisy clusters in M. Analyze the time complexity (explain how you obtained the time bound).