DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West 573-882-7064

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Promoter and Module Analysis Statistics for Systems Biology.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Finding Regulatory Motifs in DNA Sequences
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Motif discovery and Protein Databases Tutorial 5.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Step 3: Tools Database Searching
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Finding genes in the genome
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Regulatory Motif Finding
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Inferring Models of cis-Regulatory Modules using Information Theory
Algorithms for Regulatory Motif Discovery
Recitation 7 2/4/09 PSSMs+Gene finding
(Regulatory-) Motif Finding
Nora Pierstorff Dept. of Genetics University of Cologne
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West

Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler

Gene Regulation DNA sequence Start of transcription promoter operator

Key steps in transcription å Initiation å Elongation å Termination DNA + RNA

Initiation TATA RNA Polymerases RNA Pol IRibosomal RNAs RNA Pol IIAll protein genes, snRNAs U1,U2 etc RNA Pol IIITransfer RNAs, ribosomal RNAs One of the first sequences to be described was the TATA box consensus TATA A/T A A/T

Transcription Initiation Complex TATA-binding protein (TBP) binds to TATA box A macromolecular assembly of approximately 50 proteins Many conserved from yeast to humans TATA TBP TAF RNA pol II

Upstream Regulatory Elements In addition to the TATA box the comparison of many eukaryotic upstream sequences identified addition conserved motifs that were involved with the regulation of gene transcription Some UREs were common to many genes others were found only in genes expressed in specific cells or as a result of specific stimuli TATA URE Promoters are sequences in the DNA just upstream of transcripts (coding sequences) that define the sites of initiation

TATA TBP TAF RNA pol II motif Transcription faction Transcription factors are the proteins that modulate the rate of gene transcription by specific interactions with DNA and/or other proteins

Regulatory elements in eukaryotes are frequently arranged in “modules”. Frequently TFs act as synergistic (cooperative) or antagonistic (competitive) pairs. Endo 16 Regulatory Network (1)

Regulatory Network (2)

Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler

Motif Identification AGCCA Regulatory regions Motif – Binding site???

What constitutes a motif? l In S.cerevisiae typically 6-10 conserved bases – The motif l Spacers varying in length (1-11bp) å Usually located in the middle ACCNNNNNNGTT

Subproblem #1 l Having a collection of known binding sites l Can we develop a representation to search for new binding sites?

Subproblem # 2 l Given a set of sequences containing binding sites for a common factor l Can we discover their location in each sequence?

Computational Approach l Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes). l Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences. l Find some way to identify conserved elements (ungapped pattern) in these sequences, resulting in a list of potential regulatory sites.

Motif Finding Problem l Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern? l Input: a set of sequences, each one with an unknown pattern at an unknown position. l Output: the pattern and a set of starting positions of the pattern in each sequence.

Why Not Use Multiple Alignment l The motif is short and may appear in different location in different sequences. Most other areas are random. l The problem is made more complicated since not every sequence contains a motif, due to: å The upstream region used may not be long enough to include a regulatory site in every sequence. å Usually, potential co-regulated genes are used to construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated.

Frequency matrix Log ( ) f(b,i) + p(b)

The functional constraints on each specific position of the pattern are variable from some sites absolutely conserved (Shannon’s information content C i ranging between 0 and 1). Information Content Values

Sequence Logo

Example Data Set Experimentally determined CRP binding sites for 18 genes

CRP Dimer Homo dimeric structure indicates symmetric model

CPR Product Multinomial Model Logo Palidromic Product Multinomial model of sites

Essentially a Multiple Local Alignment Find “best” multiple local alignment......

Difficulties l Multiple factors for a single gene l Variability in binding sites å The nature of variability is NOT well understood å Insertions and deletions are uncommon l Location, location, location… l Confidence assessment

Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler

Early Statistical Approaches å CONSENSUS – Use a greedy algorithm to iteratively build up motifs by adding more and more pattern instances. å Gibbs sampler – Start from a random initial solution, use the Gibbs sampling approach to make a series of local moves, trying to get to the solution with the best score. å MEME – Use the expectaion maxmization (EM) algorithm.

CONSENSUS Algorithm l CONSENSUS uses an iterative procedure to add more and more patterns to form potential motifs: å Initialize each l-mer in sequence 1 as a single- pattern motif. å Add each l-mer in sequence 2 to each single-pattern motif, forming motifs consisting of 2 patterns. Keep only the top n motifs. å Repeat the process by adding each l-mer in sequence 3 to the top n motifs from the last round, forming motifs consisting of 3 patterns, and so on until the last sequence. Only the top n motifs are kept each time.

More Details of CONSENSUS å CONSENSUS use the information content score for scoring a motif as a set of ungapped patterns. å Instead of following the sequence order as given in the input sequence set, a randomized ordering is used to avoid dependence on the input set.

CONSENSUS Procedure (1) Cycle 1: For each word W 1 in S 1 For each word W 2 in S 2 Create alignment (gap free) of W 1, W 2 Keep the n best alignments A 1,1, …, A n,1 : ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …

Cycle t: For each alignment A j, t-1 from cycle t-1 For each word W t+1 in S t+1 Create alignment (gap free) of W t+1, A j, t-1 Keep the n best alignments A 1,t, …, A n,t ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT … ……… ACGGCTC,AGATCTT,GGCGTCT … CONSENSUS Procedure (2)

Weight matrix l Probabilistic model: How likely is each letter at each motif position? ACGTACGT

A. K. A. Weight matrices are also known as l Position-specific scoring matrices l Position-specific probability matrices l Position-specific weight matrices Related concepts l Information content l Relative entropy

Scoring a motif model l A motif is interesting if it is very different from the background distribution more interesting less interesting ACGTACGT

Relative entropy l A motif is interesting if it is very different from the background distribution l Use relative entropy as objective function: p i,  = probability of  in matrix position i b  = background frequency (in non-motif sequence)

n is user-defined heuristic constants Running time: O(N 2 ) + O(k N n) Where N: length of sequence; n: top n selections k: number of sequences Computational Complexity

Lecture Outline l Gene regulation l Definition of regulatory motif search l CONSENSUS (“Greedy” Algorithm) l Gibbs Sampler

Gibbs Sampling (1) l Goal: find the best a k to maximize the difference between motif and background base distribution. a2a2 a3a3 a4a4 akak a1a1

l Step 1: Pick random start position, compute current motif matrix l Step 2: Iterative update å Take one sequence out, update motif matrix å Calcuate fitness score of each position of out sequence å Pick start position in out sequence based on weight Ax å Take out another sequence, …, until converge l Step 3: Reset starting position Liu, X Gibbs Sampling (2)

a3'a3' a4'a4' ak'ak' a2'a2' ????????????????? a1'a1' Take out one sequence, calculate the fitness score of every subsequence relative to the current motif Gibbs Sampling (3)

Fitness Score l Ax = Qx / Px å Qx: probability of generating subsequence x from current motif å Px: probability of generating subsequence x from background 123 A T G C Current Motif Background: P(A) = P(T) = 0.4 P(G) = P(C) = 0.1 X = GGA: Q? P?

An example ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT ACGTACGT ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4

Gibbs pseudocode select sites at random compute the relative entropy for (iter = 0; iter < maxiter; iter++) { shuffle(sequences) foreach sequence in (sequences) { assign score to each site in sequence choose one site probabilistically compute the fitness score if (fitness score is best so far) { store a copy of the current sites } print the best scoring set of sites

Computational Complexity l One iteration running time: O(NK) å Usually need < N iterations for convergence, and < N starting points. å Overall complexity: unclear – typically O(N 2 K) - O(N 3 K) l EM is a local optimization method l Initial parameters matter

Biological Considerations l In practice, motif finding algorithms have to take into account characteristics of real input samples. These include: å Motifs with unknown length. å Samples with biased nucleotide composition. å Corrupted samples (not every sequence contains a motif). å Regulatory sites can lie on either DNA strand.

Reading Assignments l Suggested reading: å Chapter 10 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press ” l Optional reading: 1. Victor Olman, Dong Xu, and Ying Xu. CUBIC: Identifications of Regulatory Binding Sites through Data Clustering. Journal of Bioinformatics and Computational Biology. 1:

Develop a program that implement the “greedy” algorithm (CONSENSUS) for motif identification 1. Use an objective function of total mismatches between words. 2. Test the program using the DNA sequence in the next page. 3. Output the motif and location in each sequence. Project Assignment (1)

Project Assignment (2) atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Test DNA sequence (each line a sequence):