Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Slides:

Advertisements

Similar presentations

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......

Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Ab initio gene prediction Genome 559, Winter 2011.

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.

Comparative Motif Finding

Transcription factor binding motifs (part I) 10/17/07.

DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West

Introduction to BioInformatics GCB/CIS535

Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.

Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.

Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.

(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.

Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.

Regulatory Motif Finding

Finding Regulatory Motifs in DNA Sequences

Journal club 06/27/08. Phylogenetic footprinting A technique used to identify TFBS within a noncoding region of DNA of interest by comparing it to the.

Modeling Regulatory Motifs 3/26/2013. Transcriptional Regulation  Transcription is controlled by the interaction of tran-acting elements called transcription.

Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.

Cis-regulatory element study in transcriptome Jin Chen CSE Fall

International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.

Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-

발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.

* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.

Sequence analysis – an overview A.Krishnamachari

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Comparative Genomics Gene Regulatory Networks (GRNs) Anil Jegga Biomedical Informatics Contact Information: Anil Jegga Biomedical Informatics Room # 232,

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

Motif discovery and Protein Databases Tutorial 5.

From Genomes to Genes Rui Alves.

Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.

How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.

Cis-regulatory Modules and Module Discovery

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.

Local Multiple Sequence Alignment Sequence Motifs

CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,

Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.

Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.

Finding genes in the genome

CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

Gene Structure and Regulation. Gene Expression The expression of genetic information is one of the fundamental activities of all cells. Instruction stored.

Regulation of Gene Expression

bacteria and eukaryotes

Motifs BCH364C/394P - Systems Biology / Bioinformatics

Learning Sequence Motif Models Using Expectation Maximization (EM)

Recitation 7 2/4/09 PSSMs+Gene finding

A Zero-Knowledge Based Introduction to Biology

Finding regulatory modules

Nora Pierstorff Dept. of Genetics University of Cologne

Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016

Presentation transcript:

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Outline Gene Regulation DNA Transcription factors Motifs What are they? Binding Sites Combinatoric Approaches Exhaustive searches Consensus Comparative Genomics Example Probabilistic Approaches Statistics EM algorithm Gibbs Sampling

Four DNA nucleotide building blocks G-C is more strongly hydrogen-bonded than A-T

Degenerate code Four bases: A, C, G, T Two-fold degenerate IUB codes: R=[AG] -- Purines Y=[CT] -- Pyrimidines K=[GT] M=[AC] S=[GC] W=[AT] Four-fold degenerate: N=[AGCT]

Transcription Factors Required but not a part of the RNA polymerase complex Many different roles in gene regulation  Binding  Interaction  Initiation  Enhancing  Repressing Various structural classes (eg. zinc finger domains) Consist of both a DNA-binding domain and an interactive domain

 Short sequences of DNA or RNA (or amino acids)  Often consist of nucleotides  May contain gaps  Examples include:  Splice sites  Start/stop codons  Transmembrane domains  Centromeres  Phosphorylation sites  Coiled-coil domains  Transcription factor binding sites (TFBS – regulatory motifs) Motifs

TFBSs  Difficult to identify  Each transcription factor may have more than one binding site  Degenerate  Most occur upstream of translation start site (TSS) but are known to also occur in:  introns  exons  3’ UTRs  Usually occur in clusters, i.e. collections of sites within a region (modules)  Often repeated  Sites can be experimentally verified

Why are TFBSs important?  Aid in identification of gene networks/pathways  Determine correct network structure  Drug discovery  Switch production of gene product on/off Gene A Gene B

Consensus sequences  Matches all of the example sequences closely but not exactly  A single site TACGAT  A set of sites: TACGAT TATAAT GATACT TATGAT TATGTT  Consensus sequence: TATAAT or TATRNT  Trade-off: number of mismatches allowed, ambiguity in consensus sequence and the sensitivity and precision of the representation.

Information Content and Entropy

Sequence Logos

 Given a collection of motifs, TACGAT TATAAT GATACT TATGAT TATGTT  Create the matrix: Frequency Matrices TACGTACG

Position weight matrices

 Two problems:  Given a collection of known motifs, develop a representation of the motifs such that additional occurrences can reliably be identified in new promoter regions  Given a collection of genes, thought to be related somehow, find the location of the motif common to all and a representation for it.  Two approaches:  Combinatorial  Probabilistic Finding Motifs

Combinatorial Approach

Exhaustive Search

Sample-driven here refers to trying all the words as they occur in the sequences, instead of trying all possible (4 W ) words exhaustively

Greedy Motif Clustering

 Main Idea: Conserved non coding regions are important  Align the promoters of orthologous co-expressed genes from two (or more) species e.g. human and mouse  Search for TFBS only in conserved regions  Problems:  Not all regulatory regions are conserved  Which genomes to use? Comparative Genomics

Phylogenetic Footprinting Phylogenetic Footprinting refers to the task of finding conserved motifs across different species. Common ancestry and selection on these motifs has resulted in these “footprints”.

Xie et al  Genome-wide alignments for four species (human, mouse, rat, dog)  Promoter regions and 3’UTRs then extracted for 17,700 well-annotated genes  Promoter region taken to be (-2000, 2000)  This set of sequences then searched exhaustively for motifs Phylogenetic Footprinting An Example Nature 434, , 2005

The Search Xie et al. 2005

Expected Rate

Probabilistic Approach

Gibbs Sampling (applied to Motif Finding)

Gibbs Sampling Algorithm

Gibbs Sampling – Motif Positions

AlignACE - Gibbs Sampling

Remainder of the lecture: Maximum likelihood and the EM algorithm The remaining slides are for your information only and will not be part of the exam

Basic Statistics

Maximum Likelihood Estimates

EM Algorithm

Basic idea (MEME)

Basic idea (MEME) MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter- probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.

Basic MEME Model

MEME Background frequencies

MEME – Hidden Variable

MEME – Conditional Likelihood

EM algorithm

Example

E-step of EM algorithm

Example

M-step of EM Algorithm

Example

Characteristics of EM

Gibbs Sampling (versus EM)