Regulatory Motif Finding

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
A System Approach to Measuring the Binding Energy Landscapes of Transcription Factors Authors: Sebastian J. et. al Presenter: Hongliang Fei.
Fuzzy K means.
Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Ab initio motif finding
Regulatory Motif Finding
Finding Regulatory Motifs in DNA Sequences
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Transcription factor binding motifs (part II) 10/22/07.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Projects
Hidden Markov Models BMI/CS 576
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Regulation of Gene Expression
Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent
CS273B: Deep learning for Genomics and Biomedicine
Reverse-engineering transcription control networks timothy s
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Inferring Models of cis-Regulatory Modules using Information Theory
Algorithms for Regulatory Motif Discovery
Transcription factor binding motifs
Recitation 7 2/4/09 PSSMs+Gene finding
Maximum Likelihood Estimation & Expectation Maximization
A Zero-Knowledge Based Introduction to Biology
(Regulatory-) Motif Finding
Finding regulatory modules
Computational Biology
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Regulatory Motif Finding Lectures 12 – Nov 7, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Many of the techniques we will be discussing in class are based on the probabilistic models. Examples include Bayesian networks, Hidden markov models or Markov Random Fields. So, today, before we start discussing computational methods in active areas of research, we’d like to spend a couple of lectures for some basics on probabilistic models.

Outline Biology background Computational problem Common methods Input data Motif representation Common methods Enumeration Expectation-Maximization (EM) algorithm Gibbs sampling methods

Cell = Factory, Proteins = Machines Biovisions, Harvard

DNA Instructions for making the machines Instructions for when and where to make them “Coding” Regions “Regulatory” Regions (Regulons)

Transcriptional Regulation Regulatory regions are comprised of “binding sites” “Binding sites” attract a special class of proteins, known as “transcription factors” A TFBS can be located anywhere within the regulatory region (promoter region) Bound transcription factors can also inhibit DNA transcription More realistic picture?

DNA Regulation Source: Richardson, University College London

Gene Regulation Transcriptional regulation is one of many regulatory mechanisms in the cell Focus of today’s lecture Source: Mallery, University of Miami

Transcriptional Regulation of Genes What turns genes on (producing a protein) and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced? 3 billions nucleotides of human genome. Different cell types differ dramatically in both structure and functions. Cell differentiation is often irreversible. Q: In what condition is each gene product made, and once made, what does it do?

Structural Basis of Interaction

Structural Basis of Interaction Key Feature: Transcription factors are not 100% specific when binding DNA because non-essential bases could mutate Not one sequence, but family of sequences, with varying affinities G A C 0.54 G A C 0.48 G A C T 0.32 G A C 0.25 G C 0.11 G C T 0.08

What is a motif? A subsequence (substring) that occurs in multiple sequences with a biological importance. Motifs can be totally constant or have variable elements. DNA motifs (regulatory elements) Binding sites for proteins Short sequences (5-25) Up to 1000 bp (or farther) from gene Inexactly repeating patterns

Motif Finding Basic Objective: Motivations Find regions in the genome that transcription factors bind to Motivations Understanding which TFs regulate which genes Major part of the gene regulation Many classes of algorithms, differ in: Types of input data Motif representation

Input Data Single sequence AGCATCAGCAGCACATCATCAGCATACGACTCAGCATAGCCATGGGCTACAGCAGATCGATCGAACAGCACGGCAGTAGTCGGGATGCGGATCCAGCAGGGAGGGAGCGCGACGCTCTATAGAGGAGGACTTAGCAGAGCGATCGACGATTACGCAGCAGTACGCAGCAAAAAAAAAACGACGTACGTACAGCACTGACATCGGACATCTGATCTGTAGCTAGCTACTACACTCATGACTCAGTCAGTACCGATCAGCACAGCAGCTACATGCATGCATGCAGTCACGTAGAG… Based on over-representation of short sequences

Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Where is the Implanted Motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Implanting Motif AAAAAAGGGGGGG with Four Mutations – (15,4)-motif atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG

Challenge Problem Find a motif in a sample of 20 “random” sequences (e.g. 600 nt long) each sequence containing an implanted pattern of length 15, each pattern appearing with 4 mismatches as (15,4)-motif.

Input Data Single sequence Sequence + other data Gene expression data … AGCATCAGCAGCACATCATCAGCATACGACTCAGCATAGCCATGGGCTACAGCAGATCGATCGAACAGCACG… Sequence + other data Gene expression data ChIP-chip Others…

Identifying Motifs Genes are turned on or off by regulatory proteins (TFs). TFs bind to upstream regulatory regions of genes to either attract or block an RNA polymerase So, multiple genes that are regulated by the same TF will have the same motifs in their regulatory regions. How do we identify the genes that are regulated by the same TF?

Sequence + Gene Expression Data Say that a microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed. How can one gene have such drastic effects? Say that 5 different genes are co-expressed across many experiments in a gene expression data. These genes are likely to share the same binding sites.

daf-19 Binding Sites in C. elegans Motifs and transcriptional start sites GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1 source: Peter Swoboda

Input Data Single sequence Sequence + other data … AGCATCAGCAGCACATCATCAGCATACGACTCAGCATAGCCATGGGCTACAGCAGATCGATCGAACAGCACG… Sequence + other data Gene expression data ChIP-chip Others… Evolutionarily related set of sequences species

Motif Finding Basic Objective: Motivations Find regions in the genome that bind transcription factors Motivations Understanding which TFs regulate which genes Major part of the gene regulation Many classes of algorithms, differ in: Types of input data Motif representation

Structural Basis of Interaction Key Feature: Transcription factors are not 100% specific when binding DNA Not one sequence, but family of sequences, with varying affinities G A C 0.54 G A C 0.48 G A C T 0.32 G A C 0.25 G C 0.11 G C T 0.08

Motif Representation Structural discussion immediately raises difficulties Least expressive: Single sequence Most expressive: 4k-dimensional probability distribution Independently assign probability for each of the possible k-mers* *k-mer refers to a specific n-tuple of nucleic acid or amino acid sequences that can be used to identify certain regions within DNA or protein. G A C

Motif Representation Standard Solution: Position-Specific Scoring Matrix (PSSM) Assuming independence of positions, assign a probability distribution for each position This is a too simple representation

Position Weight Matrix (PWM) Assign probability to (A,G,C,T) in each position G 0.1 0.01 0 0.1 0.25 1 0.1 0 … A 0.2 0.99 0 0 0.25 0 0.9 1 … T 0.6 0.7 0 0.5 0.25 0 0 0 … C 0.3 0.2 1 0.3 0.25 0 0 0 …

Oversimplicity of PSSMs Assumes independence between positions ~25% of TRANSFAC motifs have been shown to violate this assumption Two Examples: ADR1 and YAP6

Oversimplicity of PSSMs Assumes independence between positions Generates potentially unseen motifs We need to model dependency between positions. Revisit later ATGGGGAT CATGGGAT

Outline Biology background Computational problem Common methods Input data Motif representation Common methods Enumeration Expectation-Maximization (EM) algorithm Gibbs sampling methods

Finding Regulatory Motifs Given a collection of genes that are likely to be regulated by the same TFs, Find the TF-binding motifs in common .

Identifying Motifs: Complications We do not know the motif sequence We do not know where it is located relative to the genes start Motifs can differ slightly from one gene to another How to discern it from “random” motifs? .

Common Methods Problem statement: Enumeration (simplest method) Given a set of n promoters of n co-regulated genes, find a motif common to the promoters Both the PWM (defined in page 30) and the motif sequences are unknown. Enumeration (simplest method) Look at the frequency of all k-mers* EM algorithm (MEME) Iteratively hone in on the most likely motif model Gibbs sampling methods (AlignAce, BioProspector) *k-mer refers to a specific n-tuple of nucleic acid that can be used to identify certain regions within DNA or proteins.

Generating k-mers Example: atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacg

Motif Finding Using EM Algorithm MEME (Multiple EM for Motif Elucidation) Bailey and Elkan (1995), Bailey et al. (2006) http://meme.sdsc.edu/meme/intro.html Expectation-Maximization In each iteration, it learns the PWM model and identifies examples of the matrix (sites in the input sequences) Identify binding locations for all PWMs Optimize recognition preferences

Motif Finding Using EM Algorithm MEME works by iteratively refining PWMs and identifying sites for each PWM 1. Estimate motif model (PWM) Start with a k-mer seed (random or specified) Build a PWM by incorporating some of background frequencies 2. Identify examples of the model For every k-mer in the input sequences, identify its probability given the PWM model. 3. Re-estimate the motif model Calculate a new PWM, based on the weighted frequencies of all k-mers in the input sequences 4. Iteratively refine the PWMs and identify sites until convergence.

Example: MEME Find a 6-mer motif in 4 sequences 1. MEME uses an initial EM heuristic to estimate the best starting-point PWM matrix: S1: GGCTATTGCAGATGACGAGATGAGGCCCAGACC S2: GGATGACAATTATATAAAGGACGATAAGAGATGAC S3: CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC S4: GATGACGGAGTATTAAAGACTCGATGAGTTATACGA G 0.26 0.24 0.18 0.26 0.25 0.26 A 0.24 0.26 0.28 0.24 0.25 0.22 T 0.25 0.23 0.30 0.25 0.25 0.25 C 0.25 0.27 0.24 0.25 0.25 0.27

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA 2. MEME scores the match of all 6-mers to current matrix GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACAATTATATAAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA Here, just consider the underlined 6-mers, Although in reality all 6-mers are scored The height of the bases above corresponds to how much that 6-mer counts in calculating the new matrix 3. Re-estimate the PWM based on the weighted contribution of all 6-mers. G 0.29 0.24 0.17 0.27 0.24 0.30 A 0.22 0.26 0.27 0.22 0.28 0.18 T 0.24 0.23 0.33 0.23 0.24 0.28 C 0.24 0.27 0.23 0.28 0.24 0.24

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA 4. MEME scores the match of all 6-mers to current matrix GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACAATTATATAAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA The height of the bases above corresponds to how much that 6-mer counts in calculating the new matrix 5. Re-estimate the PWM based on the weighted contribution of all 6-mers. G 0.40 0.20 0.15 0.42 0.24 0.30 A 0.30 0.30 0.20 0.24 0.46 0.18 T 0.15 0.30 0.45 0.16 0.15 0.28 C 0.15 0.20 0.20 0.16 0.15 0.24

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA 6. MEME scores the match of all 6-mers to current matrix GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACTTATATAAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA Iterations continue until convergence Numbers do not change much between iterations Final motif G 0.85 0.05 0.10 0.80 0.20 0.35 A 0.05 0.60 0.10 0.05 0.60 0.10 T 0.05 0.30 0.70 0.05 0.20 0.10 C 0.05 0.05 0.10 0.10 0.10 0.35

Gibbs Sampling References Procedure AlignAce by Hughes et al. 2000 http://atlas.med.harvard.edu/download/index.html, BioProspector by Liu et al. 2001 http://motif.stanford.edu/distributions/r Procedure 1. Start by randomly choosing sites and creates an initial PWM matrix 2. Sample other sites Remove some set of matrix examples (sites) Randomly choose other sites and calculate P given matrix If they have a high score to the matrix, keep the new site 3. Iterate until convergence

Gibbs Sampling: Basic Idea Current motif = PWM formed by circled substrings Slides generously and unknowingly provided by S. Sinha, Urbana-Chamaign CS Dept.

Gibbs Sampling: Basic Idea Delete one substring Slides generously and unknowingly provided by S. Sinha, Urbana-Chamaign CS Dept.

Gibbs sampling: Basic Idea Try a replacement: Compute its score, Accept the replacement depending on the score. Slides generously and unknowingly provided by S. Sinha, Urbana-Chamaign CS Dept.

Outline Biology background Computational problem Common methods Input data Motif representation Common methods Enumeration Expectation-Maximization (EM) algorithm Gibbs sampling methods