CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Transcription factor binding motifs (part I) 10/17/07.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Tutorial 5 Motif discovery.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Finding Regulatory Motifs in DNA Sequences
DNA Motif and protein domain discovery
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
CSCE555 Bioinformatics Lecture 11 Promoter Predication
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Construction of Substitution matrices
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
(H)MMs in gene prediction and similarity searches.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
A Very Basic Gibbs Sampler for Motif Detection
Learning Sequence Motif Models Using Expectation Maximization (EM)
Sequence Based Analysis Tutorial
Presentation transcript:

CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: University of South Carolina Department of Computer Science and Engineering HAPPY CHINESE NEW YEAR

Outline Introduction to DNA Motif Motif Representations (Recap) Motif database search Algorithms for motif discovery 10/15/20152

What is a DNA motif? Motif  A Recurring pattern A short conserved sequence pattern associated with distinct functions of a protein or DNA DNA motifs asTranscription Factor biding sites

Transcription: binding sites (DNA) and factors (proteins) Colored lines are binding sites: DNA sequence patterns. Blobs are factors (proteins) that recognize binding sites.

Example: Transcription Factor Binding Sites ERE Estrogen Receptor Transcription start DNA GeneERE Sequence Efp … a g g g t c a t g g t g a c c c t … TERT … t t g g t c a g g c t g a t c t c … Oxytocin … g c g g t g a c c t t g a c c c c … Lactoferrin … c a g g t c a a g g c g a t c t t … Angiotensin … t a g g g c a t c g t g a c c c g … VEGF … a t a a t c a g a c t g a c t g g … (estrogen response element)

Why are sequence patterns useful-- revisited In the context of transcriptional regulation, sequence patterns can be used to help answer several questions. What transcription factors are involved in regulating my gene? Does my gene contain a DNA binding domain? What novel transcription factor binding sites does my set of co-regulated genes contain?

How do we represent sequence patterns? The three most common pattern representation languages: regular expressions (e.g.,leucine zipper) profiles (PWMs, PSSMs etc.) hidden Markov models (HMMs)

1) Regular expressions define sets of sequences that they match Sp1 binds to DNA via 3 zinc-finger binding domains: C-X(2,4)-C-X(3)- [LIVMFYWC]-X(8)-H- X(3,5)-H These particular domains recognize Sp1 binding sites: GRGGCRGGW Transcription factor Sp1 binding to DNA

2) Profiles are built from multiple alignments of instances of a pattern Example: nuclear hormone receptor transcription factor binding site profile derived from experimentally determined sites. Observed counts can be converted to frequencies by dividing by the number of observed instances. So profiles are probabilistic models of sequence patterns. Counts of number of times each letter is observed at each position in pattern.

Lecture ) Making a Markov Model A C A A T G T C A A C T A T C A C A C - - A G C A G A A T C A C C G - - A T C [AT][CG][AC][ACGT-](3)A[TG][GC] ~3600 possible valid sequences

Lecture Making a Markov Model of Motif A:0.8 T:0.2 C:0.8 G:0.2 A:0.8 C:0.2 A:1.0T:0.8 G:0.2 C:0.8 G:0.2 C:0.4 G:0.2 T:0.2 A: P(ACAC--ATC)=0.8x1.0x0.8x1.0x0.8x1.0x0.6x0.4 x0.6x1.0x1.0x0.8x1.0x0.8 =

How to score the match of a sequence against three motif models? Regular express: exact match or fuzzy match Profile: sum of log-odds HMM: probability score P(s|H)

Outline Introduction to DNA Motif Motif Representations (Recap) Motif database search Algorithms for motif discovery 10/15/201513

How do we search for occurrences of known patterns? Tools exist that allow us to search for one or more known sequence patterns in one or more sequences in different ways. The patterns can come from a database of known patterns or be novel patterns we have discovered using pattern discovery software or other means. Some tools treat each pattern independently; others look for groups of matches to patterns. All tools compare each pattern to each position and compute a score which can be the number of mutations (regular expression patterns) or a probability or log- odds (profiles and HMMs).

Many useful databases of patterns have been compiled TRANSFAC – transcription factor binding sites (profiles) PROSITE – protein sites and domains (regular expressions and profiles) EPD – eukaryotic promoters (profiles) PFAM – protein families and domains (HMMs) BLOCKS – protein families (profiles)

Searching for known patterns in a given sequence MOTIF – search protein sequence against Prosite, PFAM etc.; search DNA sequence against TRANSFAC PROFILESCAN – search protein sequence against Prosite database of profiles or regular expressions MAST – search for occurrences of one or more patterns in a DNA sequence (or database of sequences)

Outline Introduction to DNA Motif Motif Representations (Recap) Motif database search Algorithms for motif discovery 10/15/201517

The Motif Discovery Problem We are given a set of sequences, each containing an instance of an unknown motif. Find the motif. Multiple, local sequence alignment. A clean, computer-sciencey problem. A bit too clean, we should be suspicious…

In Real Life A microarray experiment indicates that 50 genes share similar expression patterns. Do they share a common type of transcription factor binding site? ◦ Almost certainly some of the genes were included erroneously: experimental noise. ◦ Perhaps they share a common mRNA degradation signal. Is the TFBS near the transcription start site? ◦ Yeast: probably. Human: who knows?

Approaches to Motif Discovery Matrix-based: ◦ Gibbs Sampling - most popular. ◦ Expectation maximization. ◦ Stormo’s greedy algorithm. Consensus sequence-based: ◦ Several algorithms by Pevzner. ◦ Box-finder of Kielbasa et al.

Three Ingredients of Almost any Bioinformatics Method 1. Search space (haystack) 2. Scoring scheme 3. Search algorithm (= optimization technique) Strictly speaking, Gibbs sampling and expectation- maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts. Mathematically precise formulation of the problem

Gibbs Sampling: Simplifying Assumptions The width of the motif is known in advance. No indels (gaps). Each sequence contains precisely one instance of the motif. The sequences are single-stranded (e.g. mRNA).

Search Space N Length = L Motif width = W Size of search space = (L – W + 1) N L=100, W=15, N=10  size  10 19

Scoring Scheme Assign a numeric score to any proposed answer. What score should this get? caga ctga cacc cgca

Some Definitions caga ctga cacc cgca 1234 a0203 c4021 g0120 t0100  count matrix: c ki = k i p ki = c ki / N p i = background abundance of i th residue type

1.Based on frequentist statistics / information theory: 2.Based on Bayesian statistics: Two Scoring Schemes

Worked Example 1234 a0203 c4021 g0120 t0100 c ki = N = 4 p i = ¼ Score = = 2.29

Search Algorithm We want the global maximum score! (Or as close as we can get.) Exact algorithms (e.g. dynamic programming) would be too slow (e.g. lifetime of universe). Therefore we resort to a heuristic algorithm: Gibbs sampling, which is a type of Monte Carlo Markov chain method.

Gibbs Sampling Search 1 2 Suppose the search space is a 2D rectangle. (Typically, more than 2 dimensions!) X Start at a random point X. Randomly pick a dimension. Look at all points along this dimension. Repeat. Move to one of them randomly, proportional to its score π.

Gibbs Sampling for Motif Search Choose a random starting state. Randomly pick a sequence. Look at all motif positions in this sequence. Pick one randomly proportional to exp(score). Repeat.

Does it Work in Practice? Only successful cases get published! Seems more successful in microbes (bacteria & yeast) than in animals. The search algorithm seems to work quite well, the problem is the scoring scheme: real motifs often don’t have higher scores than you would find in random sequences by chance. I.e. the needle looks like hay. Attempts to deal with this: ◦ Assume the motif is an inverted palindrome (they often are). ◦ Only analyze sequence regions that are conserved in another species (e.g. human vs. mouse). As usual, repetitive sequences cause problems. More powerful algorithm: MEME

1.Go to our MEME server: ml ml 1.Fill in your adres, description of the sequences 2.Open the fasta formatted file you just saved with Genome2d (click “Browse”) 3.Select the number of motifs, number of sites and the optimum width of the motif 4.Click “Search given strand only” 5.Click “Start search”

Something like this will appear in your . The results are quite self explanatory.

Summary Motif discovery and Motif search problem Motif representation Gibbs sampling algorithm for motif discovery Using MEME (Expectation Maximization algorithm) for motif discovery

Acknowledgement Zhiping Weng (Boston Uni.) Timothy L. Bailey