Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
(Regulatory-) Motif Finding
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Multiple sequence alignments and motif discovery Tutorial 5.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Sequence Databases As DNA and protein sequences accumulate, they are deposited in public databases. One of the most popular of these is GenBank, which.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Building Phylogenies Parsimony 2.
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Gene expression & Clustering (Chapter 10)
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Outline More exhaustive search algorithms Today: Motif finding
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Parsimony and searching tree-space. The basic idea To infer trees we want to find clades (groups) that are supported by synapomorpies (shared derived.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Spring 2008The Greedy Method1. Spring 2008The Greedy Method2 Outline and Reading The Greedy Method Technique (§5.1) Fractional Knapsack Problem (§5.1.1)
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
Self-organizing map numeric vectors and sequence motifs
Phylogeny.
Dynamic Programming II DP over Intervals
Presentation transcript:

Motif finding : Lecture 2 CS 498 CXZ

Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented motifs –Gibbs sampling

Ab initio motif finding: Gibbs sampling Popular algorithm for motif discovery Motif model: Position Weight Matrix Local search algorithm

Gibbs sampling: basic idea Current motif = PWM formed by circled substrings

Gibbs sampling: basic idea Delete one substring

Gibbs sampling: basic idea Try a replacement: Compute its score, Accept the replacement depending on the score.

Gibbs sampling: basic idea New motif

Ab initio motif finding: Expectation Maximization Popular algorithm for motif discovery Motif model: Position Weight Matrix Local search algorithm –Move from current choice of motif to a new similar motif, so as to improve the score –Keep doing this until no more improvement is obtained : Convergence to local optima

How is a motif evaluated ? Let W be a PWM. Let S be the input sequence. Imagine a process that randomly picks different strings matching W, and threads them together, interspersed with random sequence Let Pr(S|W) be the probability of such a process ending up generating S.

How is a motif evaluated ? Find W so as to maximize Pr(S|W) Difficult optimization Special technique called “Expectation- Maximization” or E-M. Iteratively finds a new motif W that improves Pr(S|W)

Basic idea of iteration PWM Current motif 1. Scan sequence for good matches to the current motif Build a new PWM out of these matches, and make it the new motif

Guarantees The basic idea can be formalized in the language of probability Provides a formula for updating W, that guarantees an improvement in Pr(S|W)

MEME Popular motif finding program that uses Expectation-Maximization Web site

Ab initio motif finding: CONSENSUS Popular algorithm for motif discovery, that uses a greedy approach Motif model: Position Weight Matrix Motif score: Information Content

Information Content PWM W: W  k = frequency of base  at position k q  = frequency of base  by chance Information content of W:

Information Content If W  k is always equal to q , i.e., if W is similar to random sequence, information content of W is 0. If W is different from q, information content is high.

CONSENSUS: basic idea Final goal: Find a set of substrings, one in each input sequence Set of substrings define a PWM. Goal: This PWM should have high information content. High information content means that the motif “stands out”.

CONSENSUS: basic idea Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time Until the entire set is built

CONSENSUS: the greedy heuristic Suppose we have built a partial set of substrings {s 1,s 2,…,s i } so far. Have to choose a substring s i+1 from the input sequence S i+1 Consider each substring s of S i+1 Compute the score (information content) of the PWM made from {s 1,s 2,…,s i,s} Choose the s that gives the PWM with highest score, and assign s i+1  s

Phylogenetic footprinting So far, the input sequences were the “promoter” regions of genes believed to be “co-regulated” A special case: the input sequences are promoter regions of the same gene, but from multiple species. –Such sequences are said to be “orthologous” to each other.

Phylogenetic footprinting Input sequences Related by an evolutionary tree Find motif

Phylogenetic footprinting: formally speaking Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.

Small Example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

Solution Parsimony score: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACG G ACGT

An Exact Algorithm (Blanchette’s algorithm) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Recurrence

O(k  4 2k ) time per node W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time

What after motif finding ? Experiments to confirm results DNaseI footprinting & gel-shift assays Tells us which substrings are the binding sites

Before motif finding How do we obtain a set of sequences on which to run motif finding ? In other words, how do we get genes that we believe are regulated by the same transcription factor ? Two high-throughput experimental methods: ChIP-chip and microarray.

Before motif finding: ChIP-chip Take a particular transcription factor TF Take hundreds or thousands of promoter sequences Measure how strongly TF binds to each of the promoter sequences Collect the set to which TF binds strongly, do motif finding on these

Before motif finding: Microarray Take a large number of genes (mRNA) in the form of a soup Make a “chip” with thousands of slots, one for each gene Pour the gene (mRNA) soup on the chip Slots for genes that are highly expressed in the soup light up !

Before motif finding: Microarray Hence measure activity of thousands of genes in parallel Collect set of genes with similar expression (activity) profiles and do motif finding on these.