JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Applications of Dynamic Programming zTo sequence analysis Shotgun sequence assembly Multiple alignments Dispersed & tandem repeats Bird song alignments.
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
JM - 1 Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming Jarek Meller Jarek Meller Division.
Next Generation Sequencing, Assembly, and Alignment Methods
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Structural bioinformatics
9 Genomics and Beyond Brief Chapter Outline
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
CISC667, F05, Lec4, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Whole genome sequencing Mapping & Assembly.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Physical Mapping of DNA Shanna Terry March 2, 2004.
JM - 1 Introduction to Bioinformatics: Lecture XVI Global Optimization and Monte Carlo Jarek Meller Jarek Meller Division of Biomedical.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
May 1, 2002Applied Discrete Mathematics Week 13: Graphs and Trees 1News CSEMS Scholarships for CS and Math students (US citizens only) $3,125 per year.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Human Genome.
1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Doug Raiford Phage class: introduction to sequence databases.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CISC667, S07, Lec4, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Whole genome sequencing Mapping & Assembly.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Sequence comparison: Local alignment
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Sequence Alignment 11/24/2018.
Bioinformatics: Buzzword or Discipline (???)
CSE 589 Applied Algorithms Spring 1999
3. Brute Force Selection sort Brute-Force string matching
3. Brute Force Selection sort Brute-Force string matching
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

JM Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC

JM - Outline of the lecture Physical mapping problem and the resulting computational challenges Ordering clone libraries: from the consecutive ones to global optimization methods Applications of exact string matching methods Towards the shortest superstring problem and the shotgun assembly problem

JM - Literature watch Aloy et. al., “Structure-Based Assembly of Protein Complexes in Yeast”, Science 303, As a way of getting acquainted with protein pathways and their intersection with structural studies.

4 Assembling physical maps of a genome Markers DNA Physical mapping problem: create and locate in the genome of interest a set of markers (e.g. stretches of DNA that hybridize to a given probe). With sufficiently dense and ordered set of markers any newly sequenced (and long enough to cover at least one marker) DNA fragment can be mapped to a rough location on the genome. One of the early goals of the Human Genome Project was to select and map a set of STS markers such that there would be at least one STS in each stretch of 100 kb of the genome.

5 Physical mapping and the problem of ordering clone libraries with STS markers DNA clone 1 clone 2 clone 3 clone 4 STS: Definition A clone library consists of a set of short DNA fragments, called clones that originated in a stretch of the studied DNA. Definition A sequence tagged site (STS) is a DNA substring which occurs only once in the DNA of interest. One may think of STSs as a set of indices to which new DNA sequences can be referenced. Problem What is the minimum length of the STSs that could (at least in principle) provide the requested coverage for the Human genome?

6 The problem of ordering clone libraries with STS markers can be cast (and solved) as the consecutive ones problem DNA clone 1 clone 2 clone 3 clone 4 STS: Our task is to reconstruct the original order of the STSs (and thus order the clone library) given this data. Assuming that the STS probes are unique and that there are no hybridization errors the problem can be cast as the consecutive ones problem and efficiently solved using CS techniques (PQ-tree algorithm, Booth and Leuker, 1976). The true location of the STSs and clones is not known. However, for each clone the list of STSs hybridizing to it is given.

7 The consecutive ones problem and its solution DNA clone 1 clone 2 clone 3 clone 4 STS: For a binary hybridization matrix find a permutation of its columns such that in each row all ones are located in a block of consecutive entries. STS Clone

8 Fortunately errors make life more interesting … DNA clone 1 clone 2 clone 3 clone 4 STS: In the presence of experimental errors the problem leads to global optimization problem (see Pevzner, Chapter 3). STS Clone

JM - Heuristic solutions may still provide good probe ordering The number of “gaps” (blocks of zeros in rows) in the hybridization matrix may be used as a cost function, since hybridization errors typically split blocks of ones (false negatives) or split a gap into two gaps (false positive). The problem of finding a permutation that minimizes the number of gaps can be cast as a Traveling Salesman Problem (TSP), in which cities are the columns of the hybridization matrix (plus an additional column of zeros) and the distance between two cities is the number of positions in which the two columns differ (Hamming dist.) Thus, an efficient algorithm is unlikely in general case (unless P=NP) and heuristic solutions are being sought that provide good probe ordering, at least for most cases (e.g. Alizadeh et. al., 1995) Problem Is the correct order of the STSs in the example from the previous slide providing the shortest cycle for the corresponding TSP?

JM - Map location of anonymous DNA as a string matching problem A sufficiently long string of anonymous yet sequenced DNA can be placed on the physical map by finding which STSs are contained in this sequence. Due to the size of the problem, efficiency is very important. Millions of STS are available at present and their total length is typically much larger than the length of the DNA sequence to be mapped. Assuming no sequencing errors, the problem can be cast as the exact set matching and solved efficiently using for example suffix trees. Generalized suffix tree or inexact string matching methods need to be used when some errors are allowed.

JM - Strings, sequences and string operations

JM - String exact matching problem

JM - Solving the exact matching problem: conceptual simplicity vs. computational complexity

JM - Computationally efficient and elegant solutions

JM - The idea of the suffix tree method A string with m characters has m suffixes, which can be represented as m leaves of a rooted directed tree. Consider for example T=cabca c a b c a $ 1 a b c a $ 2 b c a $ 3 $ 4 $ 5 For simplicity one leaf, due to the terminal character $ is not included. Problem What is the reason for adding the terminal character?

JM - Why does it work? A substring of a string is a prefix of a suffix in that string. For example, a substring P=ab is a prefix of the suffix bca in T=cabca. Thus, if P occurs in T there is a leaf in the suffix tree that has a label starting with P. c a b c a $ 1 a b c a $ 2 b c a $ 3 $ 4 $ 5 As a related problem consider the motif search, as implemented in PROSITE. Explain how finite automata formalism is used for motif search.

JM - General idea: ordered fingerprints and the notion of closeness between DNA fragments Hierarchical sequencing: physical maps, clone libraries and shotgun Definition The algorithmic problem of shotgun sequence assembly is to deduce the sequence of the DNA string from a set of sequenced and partially overlapping short substrings derived from that string. Analogy to physical map assembly: DNA sequence of a substring may be viewed as a precise ordered fingerprint (in analogy to STSs) and the suffix-prefix match determines if two substrings would be assembled together. In general, the shortest superstring problem (find the shortest string that contains each string from a certain set of strings as its substring) is NP-hard and heuristics are being developed to address the problem.

JM - Get the relevant sequences to compare them: conservation and differences Problem  Algorithms  Programs Sequencing  Fragment assembly problem  The Shortest Superstring Problem  Phrap (Green, 1994) Gene finding  Hidden Markov Models, pattern recognition methods  GenScan (Burge & Karlin, 1997) Sequence comparison  pairwise and multiple sequence alignments  dynamic algorithm, heuristic methods  BLAST (Altschul et. al., 1990)