GIBBS SAMPLER FOR IDENTIFICATION OF SYMMETRICALLY STRUCTURED AND POSSIBLY SPACED DNA MOTIFS AND ITS VALIDATION ON THE ArcA BINDING SITES.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Finding approximate palindromes in genomic sequences.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Regulatory factors 1) Gene copy number 2) Transcriptional control 2-1) Promoters 2-2) Terminators, attenuators and anti-terminators 2-3) Induction and.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Marcin Pacholczyk, Silesian University of Technology.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Motif discovery and Protein Databases Tutorial 5.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Cis-regulatory Modules and Module Discovery
Introduction to biological molecular networks
Cluster validation Integration ICES Bioinformatics.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
(H)MMs in gene prediction and similarity searches.
Ribonucleotide reductases (RNRs) catalyse the reduction of ribonucleotides to their corresponding 2`-deoxyribonucleotides and therefore play an essential.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Motif identification with Gibbs Sampler Xuhua Xia
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Regulation of Gene Expression
bacteria and eukaryotes
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Bioinformatics tools to identify structured motifs in the upstream regions of stress-response-involved genes in Tetrahymena thermophila Antonietta La Terza*,
Learning Sequence Motif Models Using Expectation Maximization (EM)
Transcription factor binding motifs
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
Presentation transcript:

GIBBS SAMPLER FOR IDENTIFICATION OF SYMMETRICALLY STRUCTURED AND POSSIBLY SPACED DNA MOTIFS AND ITS VALIDATION ON THE ArcA BINDING SITES

Multiple Local Alignment (MLA)

Other representation: motif and background

What is a motif Motif in an a DNA sequence is a model for a protein binding site. As there is no a strict physical description of the interaction, a few approaches are used to create such a model. – a consensus with possible mismatches: string-like model –a positional-probabilistic (PPM), or a positional- weight (PWM) matrix: statistical model –others

PPM and background We mark a sequence into the motif site (occurrence), which is described by a probability-positional matrix q(i,r), and the background, which is described by background symbol probabilities f(i). r is a nucleotide (a residue); r  {A,T,G,C} i is a position in the site, i=1..s, s is the motif length

What is a motif Two probabilistic models, foreground (the motif) and background, are formulated. We classify (mark) all the input sequences into these two models-obtained parts. The optimal classification is the one most probable in the Bayesian sense.

What and how do we want to optimize We maximize the posterior of the given foreground- background classification of the DNA sequence data as a function of the site positions in the sequences. Markov Chain Monte-Carlo (MCMC) technique is a natural algorithm for its optimisation. The MCMC variant known as the Gibbs sampling has been originally applied to the MLA problem in (Lawrence et al, 1993) and then has become one of the most popular tools for motif extraction in biological sequences.

A Gibbs sampling step Motif and background bases counters are computed from all the sequence fragments except the current one. The probability distribution of the new site position or its absence in the current sequence is derived from the statistical models and the current sequence content. A new site location is sampled from the distribution. Statistical models for the background and for the motif are formed using the counters. The current sequence

An adjustment We test all possible site lengths, preserving the relative site positions. For every length, we look for the best position of the entire collection (it slides as a solid body of sites).

Information content per letter Structural component: The motif PPM information content. It is difficult to use it as the maximization parameter, because it grows monotonously with the motif length. The same value when related to a position will be the best for the best position without any elongation. Spatial component: distance between prior and posterior distributions of probability to obtain motif in a sequence position.

The algorithm differences The modification is inspired by extensive practice of analysis and prediction of gene co-regulation in prokaryotes. It is designed to look for symmetrical (repeated or palindromic) motifs as well as regular ones. The motif may be symmetrically spaced (i.e. some positions in the middle of the site can be ignored). The optimal length of the gap is determined along with the motif length. It strictly accounts for the possibility of the site absence in a sequence. The information measure that is used is the Kullbak entropy distance for both spatial and structural components. We can optimise the motif length while looking for the best motif, thus reducing the runtime.

The major parameters The preferred structure of the motif. The prior for a sequence to be garbage and thus not to contain any motif.

Postprocessing procedures The software can scan all the input sequences with the obtained motif profile, gathering all sites that are better than the worst in the obtained set. It is a very useful procedure to adjust the prior of garbage. The sequence collection can be output with masked sites found to search for another motif in the data.

Application: ArcA signal In Escherichia coli, gene expression is dependent on redox conditions, which is partially mediated by the Arc signal transduction system. The phosphorylated form of ArcA protein (ArcA-P) represses certain target operons (e.g. icd, lld, sdh and sodA) or activates others (e.g. cyd and pfl ) by interacting with promoter DNA. We used the tool to search for a common motif in upstream regions of the genes, which were extracted as ArcA-regulated from the DPInteract database.

The parameters for the motif search were selected as a possibly spaced direct repeat of length between 6 and 22 bases located at any DNA strand. As a result 15- nucleotide motif was obtained, which refines the known ArcA binding site structure

The refinement is a result of looking for a motif of a definite structure.

Recognition rule The found set of sites was used to create a PWM. Genome Explorer software was used in all comparative genomics studies. We we looking for E. coli genes with the upstream ArcA box scored better than 4,25 and with at least two orthologs in Y. pestis, P. multocida and V. vulnificus, which carry an ArcA boxes scored at least 4,00 in the upstream.

Test result The search identified 23 E. coli genes. One of the found genes is the ArcA protein gene itself. 14 of these genes are mentioned in literature as oxygen-dependently regulated.

Test interpretation The probability of a null-hypothesis of random gene selection by the recognition rule can be evaluated with a high estimation of 500 oxygen-dependent genes among 4404 genes in full E. coli genome. Fisher criterion for “14 9 // ” four-pole table gives the null- hypothesis probability of about 2x10 ‑ 7. So, the null-hypothesis can be reliably rejected.

Favorov, A.V. 1, Gelfand, M.S. 1,2,3, Gerasimova, A. V. 1, Mironov, A.A. 1,3, Makeev, V.J. 1,4 1 State Scientific Centre “GosNIIGenetica”, 1st Dorozhny pr., Moscow, , Russia. 2 Institute for Information Transmission Problems, Russian Academy of Sciences, Bolshoi Karetny per. 19, Moscow , Russia. 3 Dept of Bioengineering and Bioinformatics, Moscow State University, Lab. Bldg B, Vorobiovy Gory 1-37, Moscow , Russia. 4 Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilova 32, Moscow , Russia.

We are grateful to Dmitriy Rodionov for useful discussion, to Ludmila Danilova for assistance with the data, to Jeeping Weng for advices and to Valentina Boeva for help with the presentation. This study was partially supported by grants from the Howard Hughes Medical Institute ( to M. Gelfand), from the Ludwig Cancer Research Institute (CRDF RBO-1268 to M. Gelfand), from the Russian Fund of Basic Research ( to V. Makeev) and from Program in Molecular and Cellular Biology of Russian Academy of Sciences (to V.G. Tumanian).