CSCE555 Bioinformatics Lecture 11 Promoter Predication

Slides:



Advertisements
Similar presentations
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Advertisements

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)1 10/26/05 Promoter Prediction (really!)
March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Introduction to BioInformatics GCB/CIS535
Tutorial 5 Motif discovery.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Identifying Regulatory Transcriptional Elements on Functional Gene Groups Using Computer-
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Finding Regulatory Motifs in DNA Sequences
10/31/05 D Dobbs ISU - BCB 444/544X: RNA Structure & Function1 10/31/05 RNA Structure & Function.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Identifying conserved promoter motifs and transcription factor binding sites in plant promoters Endre Sebestyén, ARI-HAS, Martonvásár, Hungary 26th, November,
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
10/19/05 D Dobbs ISU - BCB 444/544X: Gene Regulation1 10/19/05 Gene Regulation (formerly Gene Prediction - 2)
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
10/17/05 D Dobbs ISU - BCB 444/544X: Genes & Genomes1 10/17/05 Genes & Genomes (formerly Gene Prediction - 1)
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction1 10/24/05 Promoter Prediction RNA Structure & Function Prediction.
Comparative Genomics Gene Regulatory Networks (GRNs) Anil Jegga Biomedical Informatics Contact Information: Anil Jegga Biomedical Informatics Room # 232,
Motif discovery and Protein Databases Tutorial 5.
From Genomes to Genes Rui Alves.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gene Regulatory Networks and Neurodegenerative Diseases Anne Chiaramello, Ph.D Associate Professor George Washington University Medical Center Department.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Finding genes in the genome
Transcription factor binding motifs (part II) 10/22/07.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
A Very Basic Gibbs Sampler for Motif Detection
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Presentation transcript:

CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 HAPPY CHINESE NEW YEAR University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

Outline Introduction to DNA Motif Motif Representations (Recap) Motif database search Algorithms for motif discovery 4/21/2017

Search Space Size of search space = (L – W + 1)N Motif width = W N Length = L Size of search space = (L – W + 1)N L=100, W=15, N=10  size  1019

Worked Example 1 2 3 4 a c g t N = 4 pi = ¼ cki = c g t N = 4 pi = ¼ cki = Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29

Gibbs Sampling Search 1 2 Suppose the search space is a 2D rectangle. (Typically, more than 2 dimensions!) Randomly pick a dimension. X Start at a random point X. Look at all points along this dimension. Move to one of them randomly, proportional to its score π. Repeat.

Gibbs Sampling for Motif Search Choose a random starting state. Randomly pick a sequence. Look at all motif positions in this sequence. Pick one randomly proportional to exp(score). Repeat.

Does it Work in Practice? Only successful cases get published! Seems more successful in microbes (bacteria & yeast) than in animals. The search algorithm seems to work quite well, the problem is the scoring scheme: real motifs often don’t have higher scores than you would find in random sequences by chance. I.e. the needle looks like hay. Attempts to deal with this: Assume the motif is an inverted palindrome (they often are). Only analyze sequence regions that are conserved in another species (e.g. human vs. mouse). As usual, repetitive sequences cause problems. More powerful algorithm: MEME

Go to our MEME server: http://molgen.biol.rug.nl/meme/website/meme.html Fill in your emailadres, description of the sequences Open the fasta formatted file you just saved with Genome2d (click “Browse”) Select the number of motifs, number of sites and the optimum width of the motif Click “Search given strand only” Click “Start search”

Something like this will appear in your email Something like this will appear in your email. The results are quite self explanatory.

Promoter Prediction What are promoters? Three strategies for promoter prediction Signal based Comparative genomics/phylogenetic footprinting Expression profile base de-novo motif discovery algorthms

What is a Promoter? Region of gene that binds RNA polymerase and transcription factors to initiate transcription

Promoters:What signals are there? Simple ones in prokaryotes Promoter Prediction (really) 10/26/05 Promoters:What signals are there? Simple ones in prokaryotes D Dobbs ISU - BCB 444/544X

Prokaryotic promoters Promoter Prediction (really) 10/26/05 Prokaryotic promoters RNA polymerase complex recognizes promoter sequences located very close to & on 5’ side (“upstream”) of initiation site RNA polymerase complex binds directly to these. with no requirement for “transcription factors” Prokaryotic promoter sequences are highly conserved -10 region -35 region D Dobbs ISU - BCB 444/544X

Promoter Prediction (really) 10/26/05 What signals are there? Complex ones in eukaryotes D Dobbs ISU - BCB 444/544X

Eukaryotic genes are transcribed by 3 different RNA polymerases Promoter Prediction (really) 10/26/05 Eukaryotic genes are transcribed by 3 different RNA polymerases Recognize different types of promoters & enhancers: D Dobbs ISU - BCB 444/544X

Eukaryotic promoters & enhancers Promoter Prediction (really) 10/26/05 Eukaryotic promoters & enhancers Promoters located “relatively” close to initiation site (but can be located within gene, rather than upstream!) Enhancers also required for regulated transcription (these control expression in specific cell types, developmental stages, in response to environment) RNA polymerase complexes do not specifically recognize promoter sequences directly Transcription factors bind first and serve as “landmarks” for recognition by RNA polymerase complexes D Dobbs ISU - BCB 444/544X

Eukaryotic transcription factors Promoter Prediction (really) 10/26/05 Eukaryotic transcription factors Transcription factors (TFs) are DNA binding proteins that also interact with RNA polymerase complex to activate or repress transcription TFs contain characteristic “DNA binding motifs” http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039 TFs recognize specific short DNA sequence motifs “transcription factor binding sites” Several databases for these, e.g. TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac D Dobbs ISU - BCB 444/544X

Zinc finger-containing transcription factors Promoter Prediction (really) 10/26/05 Zinc finger-containing transcription factors Common in eukaryotic proteins Estimated 1% of mammalian genes encode zinc-finger proteins In C. elegans, there are 500! Can be used as highly specific DNA binding modules Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy D Dobbs ISU - BCB 444/544X

Promoter Prediction (really) 10/26/05 Predicting Promoters Overview of strategies  What sequence signals can be used? What other types of information can be used? Algorithms Promoter prediction software 3 major types many, many programs D Dobbs ISU - BCB 444/544X

Promoter prediction: Eukaryotes vs prokaryotes Promoter Prediction (really) 10/26/05 Promoter prediction: Eukaryotes vs prokaryotes Promoter prediction is easier in microbial genomes Why? Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously, again mostly HMM-based Now: similarity-based. comparative methods (because so many genomes available) De novo motif discovery D Dobbs ISU - BCB 444/544X

Predicting promoters: Steps & Strategies Promoter Prediction (really) 10/26/05 Predicting promoters: Steps & Strategies Closely related to gene prediction Obtain genomic sequence Use sequence-similarity based comparison (BLAST, MSA) to find related genes But: "regulatory" regions are much less well-conserved than coding regions Locate ORFs Identify TSS (if possible!) Use promoter prediction programs Analyze motifs, etc. in sequence (TRANSFAC) FirstEF D Dobbs ISU - BCB 444/544X

Automated promoter prediction strategies Promoter Prediction (really) 10/26/05 Automated promoter prediction strategies Pattern-driven algorithms Sequence-similarity based algorithms Combined "evidence-based" BEST RESULTS? Combined, sequential D Dobbs ISU - BCB 444/544X

1: Promoter Prediction: Pattern-driven algorithms Promoter Prediction (really) 10/26/05 1: Promoter Prediction: Pattern-driven algorithms Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO) Tend to produce huge numbers of FPs Why? Binding sites (BS) for specific TFs often variable Binding sites are short (typically 5-15 bp) Interactions between TFs (& other proteins) influence affinity & specificity of TF binding One binding site often recognized by multiple BFs Biology is complex: promoters often specific to organism/cell/stage/environmental condition D Dobbs ISU - BCB 444/544X

Solutions to problem of too many FP predictions? Promoter Prediction (really) 10/26/05 Solutions to problem of too many FP predictions? Take sequence context/biology into account Eukaryotes: clusters of TFBSs are common Prokaryotes: knowledge of  factors helps Probability of "real" binding site increases if annotated transcription start site (TSS) nearby But: What about enhancers? (no TSS nearby!) & Only a small fraction of TSSs have been experimentally mapped CpG islands before promoter around TSS TATA Box, CCAAT box Content Information: hexamer frequency D Dobbs ISU - BCB 444/544X

Why we cannot rely on consensus sequence? Inr (Initiator) consensus sequence will appear once every 512bp in random sequences For TATA box, one for every 120bp Short-sequence patterns can appear by chance with high likelihood (false postives)

2: Promoter Prediction: Phylogenetic Footprinting Promoter Prediction (really) 10/26/05 2: Promoter Prediction: Phylogenetic Footprinting Assumption: common functionality can be deduced from sequence conservation Comparative promoter prediction: "Phylogenetic footprinting rVista, ConSite, PromH, FootPrinter For comparative (phylogenetic) methods Must choose appropriate species Different genomes evolve at different rates Classical alignment methods have trouble with translocations, inversions in order of functional elements If background conservation of entire region is highly conserved, comparison is useless Not enough data (Prokaryotes >>> Eukaryotes) Biology is complex: many (most?) regulatory elements are not conserved across species! D Dobbs ISU - BCB 444/544X

3: Promoter Prediction: Co-expression based algorithms Promoter Prediction (really) 10/26/05 3: Promoter Prediction: Co-expression based algorithms Problems: Need sets of co-regulated genes Genes experimentally determined to be co-regulated (using microarrays??) Careful: How determine co-regulation? Alignments of co-regulated genes should highlight elements involved in regulation Algorithms: MEME AlignACE, PhyloCon D Dobbs ISU - BCB 444/544X

Examples of promoter prediction/characterization software Promoter Prediction (really) 10/26/05 Examples of promoter prediction/characterization software MATCH, MatInspector TRANSFAC MEME & MAST BLAST, etc. Others? FIRST EF Dragon Promoter Finder (these are links in PPTs) also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc) JASPAR D Dobbs ISU - BCB 444/544X

TRANSFAC matrix entry: for TATA box Fields: Accession & ID Brief description TFs associated with this entry Weight matrix Number of sites used to build (How many here?) Other info

Global alignment of human & mouse obese gene promoters (200 bp upstream from TSS)

Check out optional review & try associated tutorial: Promoter Prediction (really) 10/26/05 Check out optional review & try associated tutorial: Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html Check this out: http://www.phylofoot.org/NRG_testcases/ D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) D Dobbs ISU - BCB 444/544X

Annotated lists of promoter databases & promoter prediction software Promoter Prediction (really) 10/26/05 Annotated lists of promoter databases & promoter prediction software URLs from Mount Chp 9, available online Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm URLs for Baxevanis & Ouellette, Chp 5: http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links More lists: http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104 http://www3.oup.co.uk/nar/database/subcat/1/4/ D Dobbs ISU - BCB 444/544X

Summary Promoter & gene regulation 3 types of methods for promoter prediction Many programs have sensitivity and specificity less than 0.5 Integrative algorithms are more promising

Acknowledgement Zhiping Weng (Boston Uni.)