Modeling Regulatory Motifs 3/26/2013. Transcriptional Regulation  Transcription is controlled by the interaction of tran-acting elements called transcription.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Periodic clusters. Non periodic clusters That was only the beginning…
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
The multi-layered organization of information in living systems
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Ab initio motif finding
Finding Regulatory Motifs in DNA Sequences
Molecular genetics of gene expression Mat Halter and Neal Stewart 2014.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Introduction to gene expression Seema Zargar. Lecture outline Introduction to all terms used in Gene expression.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genome-wide computational prediction of transcriptional regulatory modules reveal new insights into human gene expression Mathieu Blanchette et al. Presented.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Reconstruction of Transcriptional Regulatory Networks
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013.
Introduction to Gene Expression
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Finding genes in the genome
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
Learning Sequence Motif Models Using Expectation Maximization (EM)
Recitation 7 2/4/09 PSSMs+Gene finding
Presented by, Jeremy Logue.
Presented by, Jeremy Logue.
Presentation transcript:

Modeling Regulatory Motifs 3/26/2013

Transcriptional Regulation  Transcription is controlled by the interaction of tran-acting elements called transcription factors (TFs) and cis-acting elements of DNA.  Prediction of cis-acting elements or TF binding sites is a challenging problem in computational biology. TSS +1 Promoter region FT binding site Terminator RNA Transcription TF1    Ribosome binding site 3’UTR TF2 Transcriptional regulation of in prokaryotes 5’UTR

Specific Protein-DNA interactions  Protein-DNA interactions are specific, guaranteeing that transcriptional regulation is specific and precise.  The specificity of protein-DNA interactions are realized by the 3-D structures on the DNA-binding face of TF protein and the TF binding site of the DNA sequence.  Usually a TF recognizes variable but similar binding sites associated with different genes.  All the binding site recognized by the same TF is called a TF-binding motif.

Experimental determination of binding sites  There are in vitro and in vivo methods for determining the binding sites of TFs.  Systematic evolution of ligands by exponential enrichment (SELEX) is likely to identify all possible sequences recognized by a TF;  SELEX may not work if TF- DNA interaction requires unknown co-factors;  The method is laborious as tedious molecular cloning and sequencing are required to determine the binding sites. Geertz M, and Maerkl S J Briefings in Functional Genomics 2010;9: Motif finding

Experimental determination of binding sites  Protein binding microarray (PBM) is another in vitro method, which avoid the molecular cloning step, and the binding site can be directly read out from the microarray;  PBM can determine binding sites at single base resolution.  But as SELEX, PBM may not work if TF-DNA interaction requires unknown co-factor;  PBM may not work either if the binding site is long, e.g., longer than 12 pb.  The putative binding site determined by PBM may not necessarily the real binding site in cells. Geertz M and Maerkl S J Briefings in Functional Genomics 2010;9:

Experimental determination of binding sites  ChIP-seq and ChIP-chip are two high throughput in vivo methods for determining the binding sites of a TF.  ChIP-seq and ChIP-chip can determine actual binding sites in a genome, but to determine all binding sites, many cell types need to be explored. Geertz M, and Maerkl S J Briefings in Functional Genomics 2010;9: Motif finding

Profile representation of TF binding sites TACGAT TATAAT GATACT TATGAT TATGTT TATAGT TATAAT Consensus sequence Examples of  70 binding sites in E. coli Regular expression [TG]A[TC][GA]XT Frequency matrix To avoid 0 counting, add a pseudo count of 1

Profile representation of TF binding sites where n b,i is the frequency of residue b at position i; and k is a pseudocount to avoid zero probability.  Profile: for a motif of n samples (sequences), the probability of residue b at position i is Profile p b,i, of the  70 binding sites in E. coli, pseudocount k = 1

where p b,i is the probability of residue b at position i; and p b is the probability of residue b in the background sequences.  Position specific weigh (scoring) matrix (PSWM): for a motif of n samples, the weight of residue b at position i is defined as Profile representation of TF binding sites PSWM of the  70 binding sites in E. coli, assuming p A =p C =p G =p T =0.25

 Information content at position i of the sequence profile is given by:  Logo representation:  Information contents of a motif: Profile representation of TF binding sites where e(n) is a correction factor required when one only has a few (n) sample. A pseudo count is not added when computing p b,i. The height of each base is

Score of a sequence using a PSWM S =TATAAT {s j,b } nx4 =  The score a sequence against a profile (or PSWM) is defined as A C G T  If we represent a sequence S = {b 1 b 2 … b j …b n } as a binary matrix:

Score of a sequence using a PSWM TATAAT = {S j,b } = A C G T

Higher order PSWM  To account for the dependence among adjacent positions of TF-DNA interaction, we can use higher order PSWMs.  A higher order PSWM corresponds to a k-th order Markov chain, in which position i is dependent on the previous k positions.  A higher order PSWM is also called a position weight array. TACGAT TATAAT GATACT TATGAT TATGTT TATAGT To avoid 0 counting, add a pseudo count of 1 First order PWSM for the  70 factor binding sites

Maximal dependence decomposition  Maximal dependence decomposition (MDD) models the dependence between any two positions. It estimates the extent to which the nucleotides b j at position j depend on the nucleotides b i at position i.  MDD uses the  2 test to determine whether position j depends on positions i. T A C G A T T A T A A T G A T A C T T A T G A T T A T G T T T A T A G T T A T A A T Consensus bases: bjbj bibi Non-consensus bases: G - C G C – G T  For each position i, we divide binding sites in two groups: C i : Binding sites having the consensus base at i; : Binding sites having non-consensus base at i. T A C G A T T A T A A T T A T G A T G A T A C T T A T G T T T A T A G T bjbj bibi bjbj bibi C i

Maximal dependence decomposition  Let f b be the probability base b at position j in the binding sites in  Let N and N b be the total number of binding sites and count of base b at j in C i, respectively, then the  2 static is defined as, T A C G A T T A T A A T T A T G A T G A T A C T T A T G T T T A T A G T bjbj bibi bjbj bibi C i fAfCfGfTfAfCfGfT N binding sites NANCNGNTNANCNGNT

Maximal dependence decomposition  This  2 static describes the dependence of position j on position i, and is denoted as  2 (j|i).  The MDD approach proceeds iteratively as follows. 1.For each position i, compute 2.Among all the positions, select position i with maximum S i, and partition sequences into two groups C i and ; 3.Repeat steps 1 and 2 separately for C i and ; 4.Stop if there is no significant dependence or if there is an insufficient number of binding sites in C i or. In either case construct a standard PWSM for the remaining subset of binding sites.

AACGTG AGGCTG AGCTTT TACGTG CACGGT GATGGG AACGTG AGGCTG AGCTTT AACGTG CACGGT GATGGG GACTTG AACGTG AGCCTG AACGTG AAGGTG AGGCTG AATGTG PSWM1 PSWM2 Maximum S 1 Maximum S 3 Insufficient dependence Maximal dependence decomposition  Illustration of the MDD procedure: modeling

AACGTG AGGCTG AGCTTT TACGTG CACGGT GATGGG AACGTG AGGCTG AGCTTT AACGTG CACGGT GATGGG GACTTG AACGTG AGCCTG AACGTG AAGGTG AGGCTG AATGTG PSWM1 PSWM2 Maximum S 1 Maximum S 3 Insufficient dependence Maximal dependence decomposition  Illustration of the MDD procedure: scoring X=AAGGTG Position 1 has the consensus base ‘A’ Position 3 has non- consensus base ‘G’ Score X using PSWM2 AGCGTG

Modeling and detecting arbitrary dependencies  We can also use a digraph to model the dependence among the positions: S2S2 S3S3 S4S4 S1S1 a S2S2 S3S3 S4S4 S1S1 b S2S2 S3S3 S4S4 S1S1 c S2S2 S3S3 S4S4 S1S1 d T

Searching for novel binding site using a PSWM  Scan a sequence using a sliding window of the length of the PSWM, and return the windows that have a significantly high score....G A G T T A T A A T T A A G A...  The significance of a score S can be computed as an empirical p value, or as follows, where S min and S max is the minimal and maximal score can be scored by the PSWM,

De novel prediction of TF binding sites 1.Greedy algorithms: CONSENSUS, DREME 2.Probabilistic algorithms: MEME, BioProspector 3.Graph-theoretic algorithms: CUBIC, MotifClick 4.……  The motif-finding problem: Since there are usually no fixed patterns of cis-regulatory elements of a TF, a cis-regulatory element can be only predicted by comparing a set of sequences that are likely to contain the binding site of the same TF. The problem of finding cis-regulatory elements in a given set of sequences is called the motif-finding problem.  Currently, all sequence-based motif-finding algorithms are based on the assumption that binding sites of a TF are more conserved than the flanking sequences in a genome. A larger number of motif-finding algorithms have been developed:

Methods for finding a set of intergenic sequences for motif-finding  One genome, multiple genes approach: identify a set of co- regulated genes from an organism of interest through clustering analysis of gene expression profiles. IAIA IBIB ICIC IDID IEIE IFIF Motif finding

Methods for finding a set of intergenic sequences for motif-finding  One gene, multiple genomes approach---phylogenetic footprinting: in closely related species, more often both the coding sequences and cis-regulatory elements of orthologous genes are conserved Homologous A operon from another genome TFBSs Genes

Phylogenetic footprinting Orthologues identification T.g 1 G 1.g 1 G 2.g 1  G n.g PSWM m Motif finding Predicted binding Sites Intergenic regions …… T.g m G 1.g m G 2.g m G n.g m....

Additional hallmarks of functional TF binding sites  In high eukaryote, genes are regulated by multiple TFs binding to a close cluster of respective binding sites.  These clusters of binding sites of the same and/or different TFs are called cis-regulatory modules (CRMs), they can be in different orientations, located in the upstream, downstream or in the intron of a gene, can be very far away from the target gene, and can be even on a different chromosome. Borok M J et al. Development 2010;137:5-13 Wyeth W. Wasserman & Albin Sandelin Nature Reviews Genetics 2004; 5,