Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Mutual Information Mathematical Biology Seminar
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Class 2: Basic Sequence Alignment
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Developing Pairwise Sequence Alignment Algorithms
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Closest String with Wildcards ( CSW ) Parameterized Complexity Analysis for the Closest String with Wildcards ( CSW ) Problem Danny Hermelin Liat Rozenberg.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Outline More exhaustive search algorithms Today: Motif finding
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Calculating branch lengths from distances. ABC A B C----- a b c.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Finding Regulatory Motifs in DNA Sequences
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Flat clustering approaches
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Multiple sequence alignment (msa)
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization T. L. Bailey and C. Elkan Machine Learning, Vol. 21, No. 1-2, pp.
Learning Sequence Motif Models Using Expectation Maximization (EM)
SPIRE Normalized Similarity of RNA Sequences
Pairwise sequence Alignment.
On the k-Closest Substring and k-Consensus Pattern Problems
SPIRE Normalized Similarity of RNA Sequences
(Regulatory-) Motif Finding
Randomized Algorithm for Motif Detection
Presentation transcript:

Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong Joint work with WangSen FENG and Lusheng WANG

Outline The Definitions of Problems Applications Previous work Our work Algorithm for Single Group Algorithm for Two Groups Simulation Results for Single Group Simulation Results for Two Groups

Motif Identification Two versions 1. Single Group 2. Two Groups

Single Group Instance: a group of n sequences. Objective: find a length-L motif that appears in each of the given sequences and those occurrences of the motif are similar

Two Groups Instance: two groups of sequences: B (Bad) and G (Good) Objective: find a motif of length-L that appears in every sequence in group B and does not appear in anywhere of the sequences in G the occurrences of the motif have errors

Applications 1.Finding Targets for Potential Drugs (T. Jiang, C. Trendall, S, Wang, T. Wareham, X. Zhang, 98) (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999) -- bad strings in B are from Bacteria. -- good strings in G are from Humans -- find a substring s of length L that is conserved in all bad strings, but not conserved in good strings. -- use s to screen chemicals -- those selected chemicals can then be tested as potential broad-range antibiotics.

Applications 2. Creating Diagnostic Probes for Bacterial Infection (T. Brown, G.A. Leonard, E.D. Booth, G. Kneale, 1990) -- a group of closely related pathogenic bacteria -- find a substring that occurs in each of the bacterial sequences (with as few substitutions as possible) and does not occur in the human sequences

Applications 3. Locating binding sites and regulatory signals 4. Creating Universal PCR Primers 5. Creating Unbiased Consensus Sequences 6. Anti-sense Drug Design

Previous work The closest substring problem was proved to be NP-hard. So are the single group and two groups (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999) Polynomial time approximation schemes -theoretical results -speed is slow in order to solve practical instances

Previous Programs Bailey and Elkan: MEME (1994) uses a modified EM algorithm, allows the motif to be absent in some of the given sequences Waterman: Extended sample-driven approach (1984) Keich and Pavel Pevzner: two programs (2002) Buhler and Tompa : Projection (2002) combine EM and random projection Price, Ramabhadran and Pevzner: PatternBranching uses branching from sample strings (2003) faster than the previously best known program: projection

Previous Programs (continued) Do not allow indels Only for the one group problem Some algorithms can handle one gap

Our work An extension of the EM approach A randomized algorithm for the single group problem which can handle indels We give an algorithm for the two groups problem

Representation of motifs Consensus pattern: choosing the letter that appears the most in each of the L columns (Figure a) Profile: 4×L matrix W (ACGT), each cell W(i,j) is a number indicating the occurrence rate of letter i in column j.(Figure b) Use the profile representation in the early stage of the EM algorithm Use the consensus pattern representation to improve the accuracy caaccca caacccc catcccg catccct cacccca consensus pattern caaccca Another con. Pattern catccca (a) A C G T (b)

Computing the single group problem The EM (Expectation Maximization) Algorithm (Wang,L. Dong,L. and Fan,H. 2004) Input: –n sequences S 1,S 2,...,S n –a 4  L matrix W (the initial guess of the motif) Output: –new matrix W that is a local maximal solution A C G T

Step 1: L-mer: S ij, a length-L substring For each L-mer S ij, calculate the likelihood that S ij is the occurrence of the motif: P(i,j)=  x=1 to L W(S ij (x),x) To avoid zero weights, a fixed small number is added to W(i,j) (0.1) Step 2: Normalize the likelihood: P'(i, j)=P(i,j) /  x=1 m-L+1 P(i, x) s. t.  j=1 to m-L+1 P'(i,j)=1 S ij = c a a W=a c g t P(i,j): 0.25*0.1*1=0.025

Step 3: Re-estimate the motif matrix W. W=  i=1 n  j=1 m-L+1 W ij Where W ij is constructed from S ij S ij = c a a W=a c g t P(i,j): 0.25*0.1*1=0.025 S ij (1) S ij (2) S ij (3) S ij = c a a W ij = a c g t s 0 0 0

Step 4 Normalize W W'(b,x)= W(b,x)/  b=A,C,G,T W(b,x) Replace W with W'

Step 5 Steps 1 to 4 is called a cycle. If W changes very little from last cycle, then EM converges and the algorithm ends. otherwise, goto step 1 and start next cycle Determine the amount of change: max|W q (b,x)-W q-1 (b,x)|<  set  =0.05 such that the algorithm stops within few cycles

Our Algorithm For Single Group (with indels) General frame is the same as the previous algorithm 1. We get a initial guess of the motif W 2. With W as initial value, use the new EM algorithm to update W 3. Repeat 1–2 several (Maxtrials) times and choose the best result.

Incorporating Indels We add the “space” as a letter, so the matrix for EM algorithm became 5×L K: the maximum total number of indels For each starting position, consider all length L+h substrings, h=0,1,-1,…,k,-k is the number of indels. For each length L+h substring, align it with the matrix

Align a length L+h string with a 5×L matrix Dynamic programming similar to pair wise string alignment d[i, j] is the score of aligning the first i columns in the matrix with the first j letters in the string d[i, j]=max{d[i-1, j-1] ×W[x,i], d[i-1,j] ×w[ △,i], d[i, j-1] ×e} Buttom-up order: d[L, L+h] Best alignment (with indel)

Continued After calculated the motif W (profile representation: matrix), we use the matrix W to find the occurrence of the motif in each sequence

Find the motif occurrences find the occurrence of the motif in each string ∑ i=1 L W(a i,i) a 1 a 2 a 3 …a L is a length-L substring (L-mer) and W is the matrix for the motif

Algorithm for the two Groups (no indels) We follow the basic steps of EM method Modify the formula to re-construct W Re-estimate the matrix W from both group B and G

Main idea When the motif represented by the matrix W is too close to some L-mers from group G (p(i,j)>ave), we scoop the pattern from the matrix by subtracting the corresponding matrix W ij

Experiment Results (Single Group) Input: (1) randomly generate sequences n = 20 m= 600 (2) insert motif into the sequences Center string s (length L) Mutate d positions (insertion, deletion, mutation) Implant the mutated copy into the sequences Output: Use our program to find the implanted pattern.

Experiment Results (Single Group) Table 1: 15 sequences: no indel 5 sequences: one deletion Table 2: 10 sequences: no indel 5 sequences : one deletion 5 sequences : one insertion In table 2, the running time increases significantly and accuracy in many cases is slightly worse than that in Table 1

Experiment Results (Single Group) Table 3: 5 sequences : one deletion 5 sequences : two deletions 10 sequences: no indel Table 4: 5 sequences : one insertion 5 sequences : two insertions 10 sequences: no indel The results in Table 4 are slightly better than those in Table 3. The reason might be that the case in Table 4 needs to insert two columns in the matrix for the motif, whereas the case in Table 3 needs to insert two spaces in the motif sequences

Experiment Results (Single Group) Table 5, the mixed case: Probability: one insertion : 1/8 one deletion : 1/8 two insertions : 1/8 two deletions: 1/8 one insertion and one deletion: 1/8 no indel: 3/8

Experiment Results (Two Groups) Center (m=600): c1: the center for group B, random sequence c2: the center for group G, randomly mutate 200 positions from c1 Generate two groups n=10 Randomly mutate 200 positions from the center

Experiment Results (Two Groups) From Table 6, we can see that it is easy to find a motif that can distinguish the two groups when L is large Compare Table 7 with Table 6, we can see that it is easy to find a distinguishing motif when the distance between the two centers is large Table 7 shows the results when the average Hamming distance between c1 and c2 is about 175 Table 6 shows the results when the average Hamming distance between c1 and c2 is about 128

Summary An algorithm for the single group problem that can handle indels An algorithm for the two groups problem