Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Heuristic alignment algorithms and cost matrices
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Sequencing and Sequence Alignment
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Multiple sequence alignments and motif discovery Tutorial 5.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Finding Regulatory Motifs in DNA Sequences
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Class 2: Basic Sequence Alignment
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
The dynamic nature of the proteome
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Outline More exhaustive search algorithms Today: Motif finding
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Finding Regulatory Motifs in DNA Sequences
Motif Finding [1]: Ch , , 5.5,
Randomized Algorithms Chapter 12 Jason Eric Johnson Presentation #3 CS Bioinformatics.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Transcription factor binding motifs
(Regulatory-) Motif Finding
Presentation transcript:

Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.

Preface  This presentation is based on paper “Finding subtle motifs by branching from sample strings” by Alkes Price, Sriram Ramabhadran and Pavel A. Pevzner.

Outline  Motif finding problem.  Methods that have been proposed to address this problem.  The contribution of the method presented in this paper.  The algorithms proposed in this paper.  Experiment results.  Discussion of the advantages and disadvantages of the method proposed in this paper.  Future research direction.

Motif Finding Problem  Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score. Input: A t*n matrix of DNA, and l, the length of the pattern to find. Output: An array of t starting positions s = (s1, s2, …, st) maximizing Score(s, DNA).  Subtle motif: low score, not significant pattern among the sequences, and thus more difficult to identify

Methods Proposed  Category1: Searching possible starting points of the motif Methods: CONSENSUS, GibbsSampling Disadvantages: Search space is very large. They are not always capable to find optimal motifs.  Category2: Searching possible samples of the motif Methods: Vanet et al. 2000, Marsan and Sagot 2000, Pavesi et al. 2001, Apostolico et al. 2002, Eskin and Pevzner 2002 Advantages: Reduce down the search space. Disadvantages: Still have high computational cost especially for long motifs. The selected sample may only converge to local optima instead of global optimal point. An alternative: extended sample-driven approach Search the neighbors of all samples with exhaustive search.

Contribution of This Paper  Basic idea: branching from the sample strings  Contribution: Much more efficient than previous algorithms. Very powerful to find subtle motifs.

Comparison between the Methods

The Algorithms Proposed  Two ways to model a motif: 1. as a pattern 2. as a profile: 4*l matrix  Two algorithms proposed: 1. Pattern-Branching algorithm 2. Profile-Branching algorithm

Pattern-Branching Algorithm  Distance between M and a sample A 0 : d(M, A 0 ) = k  D = k (A 0 ): a set of patterns of distance exactly k from A 0  Neighbor: D = 1 (A 0 ), changing a single nucleotide of A E.g., ATTGCCAG, ATTGCCTG, GTTGCCAG  Score of a pattern: total distance from the sequences 1. For each sequence s i, d(A, s i ) = min{d(A, P)|P  s i }, p is a l- mer (a pattern of length n). 2. The total distance of A from S is d(A, S) = ∑ s i  S d(A, s i )  BestNeighbor(A): the pattern B  D = 1 (A 0 ) with the lowest total distance d(B, S)

Pattern-Branching Algorithm  Input: A set of sequences S, the length of the motif l and * of mutations k.  Output: motif of length l with k mutations.  Algorithm: PatternBranching(S, l, k) 1. Motif M  arbitrary motif pattern 2. Get a set of samples of M in the sequences (S) 3. For each l-mer A 0 in S 4. For j  0 to k 5. { 6. if d(A j, S) < d(M, S) 7. M  A j 8. A j+1  Bestneighbour(A j ) 9. Output M 10. }

Profile-Branching Algorithm  Similar to Pattern-Branching  Some changes: 1. convert each sample string to a profile X(A 0 ) 2. generalize the scoring method to score profiles 3. modify the branching method to apply to profiles 4. use the top-scoring profile we find as a seed to the EM algorithm

Profile-Branching Algorithm  Convert a sample string to a profile X(A 0 ): ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6

Profile-Branching Algorithm  Use entropy to score profiles: Given a profile X = (x vw ) and a pattern P = p 1 … p l, let e(X, P) be the log probability of sampling P from X, i.e. e(X, P) = ∑ w log(x p w w ). ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6 G T G A C A T 1/6 1/2 1/2 1/6 1/2 1/2 1/2

Profile-Branching Algorithm  For each sequence S i in the sample S = {S 1, …, S n }, let e(X, S i ) = max{e(X, P)|P  S i }.  Then the entropy score of X is e(X, S) = ∑ s i  S e(X, s i ).  Intuitively, e(X, S) describes how well X matches its best occurrence in each sequence of the sample.

Profile-Branching Algorithm  Branching from the sample string: 1. Amplify only one column in the profile (which corresponds to one position in the sample pattern), and we only amplify a nucleotide v if x vw < Make sure that the relative entropy ∑ v x vw log(x’ vm /x vm ) = . We use  = ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6 ATGCCAT A0.271/6 1/21/6 T0.551/21/6 1/2 G0.091/61/21/6 C0.091/6 1/2 1/6

Profile-Branching Algorithm  Algorithm: ProfileBranching(S, l, k) 1. M  arbitrary motif profile 2. For each l-mer A 0 in S 3. { 4. X 0  X(A 0 ) 5. For j  0 to k 6. { 7. if e(X j, S) > e(Motif, S) 8. Motif  Xj 9. X j+1  BestNeighbor(X j ) 10. } 11. Run EM algorithm with Motif as seed

Results on Implanted Motifs  Pattern-Branching algorithm VS previous pattern-based motif finding algorithms WINNOWER, SP-STAR: unable to find subtle motifs PROJECTION, MITRA, MULTIPROFILER

Results on Implanted Motifs  Profile-Branching algorithm VS previous profile-based motif finding algorithms  Performance coefficient: Let k be the set of n implanted motifs found, and let p be the set of predicted motif positions,the performance coefficient is defined to be |K ∩ P|/|K ∪ P|.

Results on Biological Samples  Pattern-Branching Algorithm:  Profile-Branching Algorithm: The pattern returned by profile-branching matches the reference motif.

Discussion  Advantages: Much more efficient than previous algorithms. Very powerful to find subtle motifs.  Disadvantages: 1. Pattern-Branching has difficulty finding motifs with many degenerate positions. But profile-Branching works well on it. 2. Profile-Branching is very powerful to find subtle motifs but is comparatively slow.

Future Work  Apply Pattern-Branching and Profile-Branching algorithms to more challenging biological samples 1. Larger samples 2. Corrupted samples  Extend the algorithms to address the motif finding problem which involves not only A, T, G, C, but purine(R), pryrimidine(Y), weak bond(W) and strong bond(S).