Motif finding in groups of related sequences

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Fuzzy K means.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Finding Regulatory Motifs in DNA Sequences
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
CSE 6406: Bioinformatics Algorithms. Course Outline
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Biology 10.2 Gene Regulation and Structure Gene Regulation and Structure.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Outline More exhaustive search algorithms Today: Motif finding
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
Overview of Bioinformatics 1 Module Denis Manley..
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
From Genomes to Genes Rui Alves.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Doug Raiford Phage class: introduction to sequence databases.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Regulation of Gene Expression
Reverse-engineering transcription control networks timothy s
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
De novo Motif Finding using ChIP-Seq
Algorithms for Regulatory Motif Discovery
Recitation 7 2/4/09 PSSMs+Gene finding
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
A Zero-Knowledge Based Introduction to Biology
1 Department of Engineering, 2 Department of Mathematics,
CS 5263 & CS 4233 Bioinformatics Motif finding.
(Regulatory-) Motif Finding
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
CS5263 Bioinformatics Lecture 18 Motif finding.
Eukaryotic Gene Regulation
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Motif finding in groups of related sequences 6.096 – Algorithms for Computational Biology Motif finding in groups of related sequences

Challenges in Computational Biology 4 Genome Assembly 5 Regulatory motif discovery 1 Gene Finding DNA 2 Sequence alignment 6 Comparative Genomics TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT 3 Database lookup 7 Evolutionary Theory 8 Gene expression analysis RNA transcript 9 Cluster discovery 10 Gibbs sampling 11 Protein network analysis 12 Regulatory network inference 13 Emerging network properties

Challenges in Computational Biology 5 Regulatory motif discovery DNA Group of co-regulated genes Common subsequence

Overview Introduction Combinatorial solutions Probabilistic solutions Bio review: Where do ambiguities come from? Computational formulation of the problem Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement Probabilistic solutions Expectation maximization Gibbs sampling

Overview Introduction Combinatorial solutions Probabilistic solutions Bio review: Where do ambiguities come from? Computational formulation of the problem Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement Probabilistic solutions Expectation maximization Gibbs sampling

Regulatory motif discovery GAL1 Gal4 Gal4 Mig1 ATGACTAAATCTCATTCAGAAGAAGTGA CGG CCG CGG CCG CCCCW Regulatory motifs Genes are turned on / off in response to changing environments No direct addressing: subroutines (genes) contain sequence tags (motifs) Specialized proteins (transcription factors) recognize these tags What makes motif discovery hard? Motifs are short (6-8 bp), sometimes degenerate Can contain any set of nucleotides (no ATG or other rules) Act at variable distances upstream (or downstream) of target gene DNA structure revealed that inherited information is digital. Like the bits in a digital computer, it is the precise ordering of nucleotides that encodes inherited information. Biology has been able to make only limited use of this fact, experiments on individual genes. Only recently, technological advances have allowed us to apply computational methods to biological data. Genome-wide views, novel approaches, novel statistical models and algorithms. Application of computational methods is really revolutionizing the field of biology.

Sticks and backbones Traditional Fancy Chemical Atomic

Where do ambiguous bases come from ? Protein-DNA interactions Proteins read DNA by “feeling” the chemical properties of the bases Without opening DNA (not by base complementarity) Sequence specificity Topology of 3D contact dictates sequence specificity of binding Some positions are fully constrained; other positions are degenerate “Ambiguous / degenerate” positions are loosely contacted by the transcription factor

Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size Because a constant-size transcription factor binds Often repeated Low-complexity-ish

Sequence Logos entropy - n 1: (communication theory) a numerical measure of the uncertainty of an outcome; "the signal contained thousands of bits of information" [information, selective information] 2: (thermodynamics) a thermodynamic quantity representing the amount of energy in a system that is no longer available for doing mechanical work; "entropy increases as matter and energy in the universe degrade to an ultimate state of inert uniformity" [randomness] Entropy at pos’n I, H(i) = – {letter x} freq(x, i) log2 freq(x, i) Height of x at pos’n i, L(x, i) = freq(x, i) (2 – H(i)) Examples: freq(A, i) = 1; H(i) = 0; L(A, i) = 2 A: ½; C: ¼; G: ¼; H(i) = 1.5; L(A, i) = ¼; L(not T, i) = ¼

Problem Definition Given a collection of promoter sequences s1,…, sN of genes with common expression Combinatorial Motif M: m1…mW Some of the mi’s blank Find M that occurs in all si with  k differences Or, Find M with smallest total hamming dist Probabilistic Motif: Mij; 1  i  W 1  j  4 Mij = Prob[ letter j, pos i ] Find best M, and positions p1,…, pN in sequences

Finding Regulatory Motifs . Given a collection of genes bound by a transcription factor, Find the TF-binding motif in common

Essentially a Multiple Local Alignment . Find “best” multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases

Overview Introduction Combinatorial solutions Probabilistic solutions Bio review: Where do ambiguities come from? Computational formulation of the problem Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement Probabilistic solutions Expectation maximization Gibbs sampling

Discrete Formulations Given sequences S = {x1, …, xn} A motif W is a consensus string w1…wK Find motif W* with “best” match to x1, …, xn Definition of “best”: d(W, xi) = min hamming dist. between W and any word in xi d(W, S) = i d(W, xi)

Overview Introduction Combinatorial solutions Probabilistic solutions Bio review: Where do ambiguities come from? Computational formulation of the problem Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement Probabilistic solutions Expectation maximization Gibbs sampling

Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4K ) (where N = i |xi|) Advantage: Finds provably “best” motif W Disadvantage: Time

Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some xi Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W* Running time: O( K N2 ) Advantage: Time Disadvantage: If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

Overview Introduction Combinatorial solutions Probabilistic solutions Bio review: Where do ambiguities come from? Computational formulation of the problem Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement Probabilistic solutions Expectation maximization Gibbs sampling

Greedy motif clustering (CONSENSUS) Algorithm: Cycle 1: For each word W in S (of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C1 best alignments, A1, …, AC1 ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT …

Greedy motif clustering (CONSENSUS) Algorithm: Cycle t: For each word W in S For each alignment Aj from cycle t-1 Create alignment (gap free) of W, Aj Keep the Cl best alignments A1, …, ACt ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … … … … ACGGCTC , AGATCTT , GGCGTCT …

Greedy motif clustering (CONSENSUS) C1, …, Cn are user-defined heuristic constants N is sum of sequence lengths n is the number of sequences Running time: O(N2) + O(N C1) + O(N C2) + … + O(N Cn) = O( N2 + NCtotal) Where Ctotal = i Ci, typically O(nC), where C is a big constant

Overview Introduction Combinatorial solutions Probabilistic solutions Bio review: Where do ambiguities come from? Computational formulation of the problem Combinatorial solutions Exhaustive search Greedy motif clustering Wordlets and motif refinement Probabilistic solutions Expectation maximization Gibbs sampling

Motif Refinement and wordlets (MULTIPROFILER) Extended sample-driven approach Given a K-long word W, define: Nα(W) = words W’ in S s.t. d(W,W’)  α Idea: Assume W is occurrence of true motif W* Will use Nα(W) to correct “errors” in W

Motif Refinement and wordlets (MULTIPROFILER) Assume W differs from true motif W* in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

Motif Refinement and wordlets (MULTIPROFILER) Algorithm: For each W in S: For L = 1 to Lmax Find the α-neighbors of W in S  Nα(W) Find all “strong” L-long wordlets G in Na(W) For each wordlet G, Modify W by the wordlet G  W’ Compute d(W’, S) Report W* = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search