Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Sequence Analysis. Programme 1.A Motif-based Framework for Recognizing Sequence Families Sharan, Myers 9:45-10:10am 10:10-10:40am Coffee Break 2.An HMM.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
MICHAEL MORRA CSE 4939W Detection of Transcription Factor Binding Sites.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Genome-wide Analysis of Gene regulation Berlin, 4th of May, 2005Presentation by: David Rozado.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Structural bioinformatics
Nucleotide Level We define four statistics to describe how results are scored at the nucleotide level. If a base is part of an actual site and is predicted.
Discriminative Motifs Saurabh Sinha, RECOMB ’02, April Introduction The term “motif” means the common pattern in different binding sites of a transcription.
Transcription factor binding motifs (part I) 10/17/07.
Tutorial 5 Motif discovery.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Multiple sequence alignments and motif discovery Tutorial 5.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Finding Regulatory Motifs in DNA Sequences
MICHAEL MORRA CSE 4939W Detection of Transcription Factor Binding Sites.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Gary Stormo by Andrew Bardee. History Born 1950 in South Dakota Undergraduate in Biology from Caltech PhD in Molecular Biology from University of Colorado.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
From Genomes to Genes Rui Alves.
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Comparative Genomics.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Motif Search and RNA Structure Prediction Lesson 9.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Finding Regulatory Signals in Genomes Regulatory signals know from molecular biology Different Kinds of Signals Promotors Enhancers Splicing Signals The.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
A Very Basic Gibbs Sampler for Motif Detection
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Learning Sequence Motif Models Using Expectation Maximization (EM)
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Finding regulatory modules
Nora Pierstorff Dept. of Genetics University of Cologne
Introduction to Bioinformatics Tuesday, 19 March
Presentation transcript:

Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University of Connecticut

Transcription Factors Transcription factors regulate DNA transcription

Transcription Factor Binding Site Detection Algorithms Training sequences AGATCGTT ACATGATT TGATGGAT Genetic region to search ATCGTCGATGCTGAGATGTCTATCGTAGCTAGTC Highest scoring sequence in that region AGATGTCT

Assessment by Osada et al. Compared various transcription factor binding site detection algorithms Consensus: builds a consensus sequence based on the training data PSSM: makes a scoring matrix based on the logs of nucleotide frequencies. Berg and von Hippel: like PSSM, but with nucleotide counts instead of freqs. Centroid: sum of position specific frequencies

The Same Length Training Sequence Assumption Example set of known binding sites from TRANSFAC: ACATTTAACTGGTTAATTGA ATAACCCAAT TTAATCCGTT ACCGGGTTGC TCGAAGGGATTAG ACTGGGTTAT TTAACCCGTTT TTAGCGGCATAAAAGGGTTAAACAGG AATGCGCGCCCATAAAAGGGTTAAG

Project Goal Modify the tools evaluated by Osada et al. to handle training sets with varying sequence length and still produce decent performance

Overall Strategy Step 1: Alignment AGCTTTCA ACCTTTGGAC GTAACTTTCA AGCTTTCA ACCTTTGGAC GTAACTTTCA Step 2: Scoring ACTGAGTCGATAATTTTGAACTG AATTTTGA

MLCentroid Applies this strategy to the Centroid algorithm Centroid was chosen for its strong performance, more efficient execution, and ease of implementation The same techniques could be readily applied to any of the other algorithms

Running Time Issues First version: O(c * L^numseqs)‏ Second version: O(c * L * numseqs^2)‏ Quadratic is MUCH better than exponential!

Method of Testing Leave one out testing similar to that used by Osada Counts the number of sequence which score higher than the desired one The data sets for Drosophila Melanogaster from Tompa's paper were used

Experimental Results

Future Work Better alignment scoring schemes Modify and test PSSM, Berg and von Hippel, and Consensus Incorporate these techniques into de novo motif discovery algorithms Trying to incorporate sequence structure into alignment.

References Timothy L. Bailey, Nadya Williams1, Chris Misleh1 and Wilfred W. Li: MEME: discovering and analyzing DNA and protein sequence motifs, Nucleic Acids Research, 2006, Vol. 34, Web Server issue W369–W373 Berg, O. and von Hippel, P. : Selection of DNA binding sites by Regulatory Proteins. Statistical-Mechanical Theory and Application to Operators and Promoters, Journal of Molecular Biology, 1987, 193, pages Day,W.H. and McMorris,F. : Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res., 20, 1992, pages 1093–1099 Charles E. Lawrence and Andrew A. Reilly, An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences, PROTEINS: Structure, Function, and Genetics, :41-51 Robert Osada, Elena Zaslavsky, Mona Singh: Comparative analysis of methods for representing and searching for transcription factor binding sites, Bioinformatics, Vol. 20 no , pages 3516–3525 Giulio Pavesi, Giancarlo Mauri, and Graziano Pesole: An algorithm for finding signals of unknown length in DNA sequences, Bioinformatics, Vol. 17 Suppl pages S207–S214 Giulio Pavesi, Paolo Mereghetti, Giancarlo Mauri and Graziano Pesole, Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes, Nucleic Acids Research, vol. 32, Web server issue, 2004 Tompa et al.: Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, vol. 23, no. 1, January 2005, pages blogs.venturacountystar.com

Questions?