Presentation is loading. Please wait.

Presentation is loading. Please wait.

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Similar presentations


Presentation on theme: "JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical."— Presentation transcript:

1 JM - http://folding.chmcc.org 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC

2 JM - http://folding.chmcc.org2 Outline of the lecture Physical mapping problem and the resulting computational challenges Ordering clone libraries: from the consecutive ones to global optimization methods Applications of exact string matching methods Towards the shortest superstring problem and the shotgun assembly problem

3 JM - http://folding.chmcc.org3 Literature watch Aloy et. al., “Structure-Based Assembly of Protein Complexes in Yeast”, Science 303, As a way of getting acquainted with protein pathways and their intersection with structural studies.

4 4 Assembling physical maps of a genome Markers DNA Physical mapping problem: create and locate in the genome of interest a set of markers (e.g. stretches of DNA that hybridize to a given probe). With sufficiently dense and ordered set of markers any newly sequenced (and long enough to cover at least one marker) DNA fragment can be mapped to a rough location on the genome. One of the early goals of the Human Genome Project was to select and map a set of STS markers such that there would be at least one STS in each stretch of 100 kb of the genome.

5 5 Physical mapping and the problem of ordering clone libraries with STS markers DNA clone 1 clone 2 clone 3 clone 4 STS: 1 2 3 4 5 Definition A clone library consists of a set of short DNA fragments, called clones that originated in a stretch of the studied DNA. Definition A sequence tagged site (STS) is a DNA substring which occurs only once in the DNA of interest. One may think of STSs as a set of indices to which new DNA sequences can be referenced. Problem What is the minimum length of the STSs that could (at least in principle) provide the requested coverage for the Human genome?

6 6 The problem of ordering clone libraries with STS markers can be cast (and solved) as the consecutive ones problem DNA clone 1 clone 2 clone 3 clone 4 STS: 1 2 3 4 5 Our task is to reconstruct the original order of the STSs (and thus order the clone library) given this data. Assuming that the STS probes are unique and that there are no hybridization errors the problem can be cast as the consecutive ones problem and efficiently solved using CS techniques (PQ-tree algorithm, Booth and Leuker, 1976). The true location of the STSs and clones is not known. However, for each clone the list of STSs hybridizing to it is given.

7 7 The consecutive ones problem and its solution 35142 110010 200101 310011 411010 DNA clone 1 clone 2 clone 3 clone 4 STS: 1 2 3 4 5 12345 100110 211000 301110 400111 For a binary hybridization matrix find a permutation of its columns such that in each row all ones are located in a block of consecutive entries. STS Clone

8 8 Fortunately errors make life more interesting … 54132 101010 201101 301011 410010 DNA clone 1 clone 2 clone 3 clone 4 STS: 1 2 3 4 5 12345 100110 211010 301110 400101 In the presence of experimental errors the problem leads to global optimization problem (see Pevzner, Chapter 3). STS Clone

9 JM - http://folding.chmcc.org9 Heuristic solutions may still provide good probe ordering The number of “gaps” (blocks of zeros in rows) in the hybridization matrix may be used as a cost function, since hybridization errors typically split blocks of ones (false negatives) or split a gap into two gaps (false positive). The problem of finding a permutation that minimizes the number of gaps can be cast as a Traveling Salesman Problem (TSP), in which cities are the columns of the hybridization matrix (plus an additional column of zeros) and the distance between two cities is the number of positions in which the two columns differ (Hamming dist.) Thus, an efficient algorithm is unlikely in general case (unless P=NP) and heuristic solutions are being sought that provide good probe ordering, at least for most cases (e.g. Alizadeh et. al., 1995) Problem Is the correct order of the STSs in the example from the previous slide providing the shortest cycle for the corresponding TSP?

10 JM - http://folding.chmcc.org10 Map location of anonymous DNA as a string matching problem A sufficiently long string of anonymous yet sequenced DNA can be placed on the physical map by finding which STSs are contained in this sequence. Due to the size of the problem, efficiency is very important. Millions of STS are available at present and their total length is typically much larger than the length of the DNA sequence to be mapped. Assuming no sequencing errors, the problem can be cast as the exact set matching and solved efficiently using for example suffix trees. Generalized suffix tree or inexact string matching methods need to be used when some errors are allowed.

11 JM - http://folding.chmcc.org11 Strings, sequences and string operations

12 JM - http://folding.chmcc.org12 String exact matching problem

13 JM - http://folding.chmcc.org13 Solving the exact matching problem: conceptual simplicity vs. computational complexity

14 JM - http://folding.chmcc.org14 Computationally efficient and elegant solutions

15 JM - http://folding.chmcc.org15 The idea of the suffix tree method A string with m characters has m suffixes, which can be represented as m leaves of a rooted directed tree. Consider for example T=cabca c a b c a $ 1 a b c a $ 2 b c a $ 3 $ 4 $ 5 For simplicity one leaf, due to the terminal character $ is not included. Problem What is the reason for adding the terminal character?

16 JM - http://folding.chmcc.org16 Why does it work? A substring of a string is a prefix of a suffix in that string. For example, a substring P=ab is a prefix of the suffix bca in T=cabca. Thus, if P occurs in T there is a leaf in the suffix tree that has a label starting with P. c a b c a $ 1 a b c a $ 2 b c a $ 3 $ 4 $ 5 As a related problem consider the motif search, as implemented in PROSITE. Explain how finite automata formalism is used for motif search.

17 JM - http://folding.chmcc.org17 General idea: ordered fingerprints and the notion of closeness between DNA fragments Hierarchical sequencing: physical maps, clone libraries and shotgun Definition The algorithmic problem of shotgun sequence assembly is to deduce the sequence of the DNA string from a set of sequenced and partially overlapping short substrings derived from that string. Analogy to physical map assembly: DNA sequence of a substring may be viewed as a precise ordered fingerprint (in analogy to STSs) and the suffix-prefix match determines if two substrings would be assembled together. In general, the shortest superstring problem (find the shortest string that contains each string from a certain set of strings as its substring) is NP-hard and heuristics are being developed to address the problem.

18 JM - http://folding.chmcc.org18 Get the relevant sequences to compare them: conservation and differences Problem  Algorithms  Programs Sequencing  Fragment assembly problem  The Shortest Superstring Problem  Phrap (Green, 1994) Gene finding  Hidden Markov Models, pattern recognition methods  GenScan (Burge & Karlin, 1997) Sequence comparison  pairwise and multiple sequence alignments  dynamic algorithm, heuristic methods  BLAST (Altschul et. al., 1990)


Download ppt "JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical."

Similar presentations


Ads by Google