Searching for Transcription Start Sites in Drosophila

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Albin Sandelin (University of Copenhagen) Jasmina Ponjavic (Oxford University) Boris Lenhard (Bergen University) David Hume (Queensland University) Piero.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Lecture 12 Splicing and gene prediction in eukaryotes
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Comparative analysis of eukaryotic genes Mar Albà Barcelona Biomedical Research Park.
Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Sackler Medical School
Mark D. Adams Dept. of Genetics 9/10/04
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Overview of ENCODE Elements
No reference available
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Finding genes in the genome
Accessing and visualizing genomics data
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Considerations for multi-omics data integration Michael Tress CNIO,
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Annotation for D. virilis
The Transcriptional Landscape of the Mammalian Genome
TSS Annotation Workflow
GEP Annotation Workflow
From: TopHat: discovering splice junctions with RNA-Seq
Ensembl Genome Repository.
Identify D. melanogaster ortholog
Nora Pierstorff Dept. of Genetics University of Cologne
Problems from last section
Introduction to Alternative Splicing and my research report
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

Searching for Transcription Start Sites in Drosophila Wilson Leung 08/2015

Outline Transcription start sites (TSS) annotation goals Promoter architecture in D. melanogaster Find the initial transcribed exon Annotate putative transcription start sites RNA PolII ChIP-Seq Search for core promoter motifs Additional TSS annotation resources

Structure of a typical mRNA Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews0004.1-reviews0004.10.

Muller F element, heterochromatic, and euchromatic genes show similar expression levels F element genes show lower expression levels in S2 (late embryos) cells but no difference in BG3 (neuronal) cells See Dr. Elgin’s presentation on the private wiki for a more detailed explanation of the scientific goals. Riddle NC, et al. Enrichment of HP1a on Drosophila chromosome 4 genes creates an alternate chromatin structure critical for regulation in this heterochromatic domain. PLoS Genet. 2012 Sep;8(9):e1002954.

Annotate transcription start sites (TSS) Research goal: Identify motifs that enable Muller F element genes to function within a heterochromatic environment Annotation goals: Define search regions enriched in regulatory motifs Define precise location of TSS if possible Define search regions where TSS could be found Document the evidence used to support the TSS annotations Detailed documentations allow us to prioritize the list of TSS candidates In the process of analyzing the GEP submissions from this past year

Estimated evolutionary distances with respect to D. melanogaster D. simulans Species Substitutions per neutral site D. ficusphila 0.80 D. eugracilis 0.76 D. biarmipes 0.70 D. takahashii 0.65 D. elegans 0.72 D. rhopaloa 0.66 D. kikkawai 0.89 D. bipectinata 0.99 D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata Purple = priority 1 species in modENCODE white paper D. melanogaster subgroup = 0.4ss D. ananassae = 1.3ss D. ananassae D. pseudoobscura D. persimilis D. willistoni Data from table 1 of the modENCODE comparative genomics whitepaper D. mojavensis D. virilis D. grimshawi GEP annotation projects New species sequenced by modENCODE

TSS annotation resources Walkthroughs: Annotation of Transcription Start Sites in Drosophila Reference: TSS Annotation Workflow GEP Annotation Report: TSS Report Form (sample report for onecut) Additional evidence tracks: RNA PolII ChIP-Seq (D. biarmipes only) D. mel Transcripts Conservation (Multiz alignments) Additional curriculum on motif finding also available under the “Beyond annotation” section Added new parameters page to the TSS Annotation Workflow

TSS annotation workflow Identify the ortholog Note the gene structure in D. melanogaster Annotate the coding exons Classify the type of core promoter in D. melanogaster Annotate the initial transcribed exon Identify any core promoter motifs Define TSS positions or TSS search regions Annotation is based on parsimony with D. melanogaster

Evidence for TSS annotations (in general order of importance) Experimental data RNA-Seq RNA Pol II ChIP-Seq (D. biarmipes only) Conservation Type of TSS (peaked versus broad) in D. melanogaster Sequence similarity to initial exon in D. melanogaster Sequence similarity to other Drosophila species (Multiz) Core promoter motifs Inr, TATA box, etc.

Challenges with TSS annotations Fewer constraints on untranslated regions (UTRs) UTRs evolve more quickly than coding regions Open reading frames, compatible phases of donor and acceptor sites do not apply to UTRs Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRs Most gene finders do not predict UTRs Lack of experimental data Cannot use RNA-Seq data to precisely define the TSS

RNA-Seq biases introduced by library construction cDNA fragmentation Strong bias at the 3’ end RNA fragmentation More uniform coverage Miss the 5’ and 3’ ends of the transcript RNA-Seq Read Count This chart uses yeast (single ORF) as an example It does not account for biases in read mapping RNA fragmentation (e.g., hydrolysis, nebulization) missed the transcript ends cDNA fragmentation (e.g., sonication, DNaseI) has strong bias at the 3’ end Transcripts show more uniform coverage using RNA instead of cDNA fragmentation 5’ 3’ Gene Span Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.

Techniques for finding TSS Identify the 5’ cap at the beginning of the mRNA Cap Analysis of Gene Expression (CAGE) RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE) Cap-trapped Expressed Sequence Tags (5’ ESTs) More information on these techniques: Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012 786:181-200. Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36. CAGE randomly sample 5’ end – correspond to expression level RACE target 5’ ends of specific transcripts (8727 transcripts) – capture lowly expressed transcripts 5’ end of mRNA has a guanine connected to the rest of the mRNA via a 5’ to 5’ triphosphate link. This guanosine is methylated at position 7 (7-methylguanylate cap) and helps increase the stability of the mRNA. http://www.osc.riken.jp/english/activity/cage/basic/

modENCODE TSS annotations TSS predictions based on CAGE, RLM-RACE, and 5’ EST data Results lifted from Release 5 to the Release 6 assembly Two sets of modENCODE TSS predictions TSS (Celniker) Most recent dataset produced by modENCODE TSS (Embryonic) Older dataset available from FlyBase GBrowse Use TSS (Celniker) dataset as the primary evidence Revised TSS analysis for Release 6 not yet available BDGP working with FlyBase to update the TSS annotations Just updated RNA-Seq datasets in release 6.03 Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92

FlyBase release 6 assembly First change in the D. melanogaster assembly since 2005 Substantial changes in the heterochromatic regions modENCODE RNA-Seq data have been re-analyzed relative to the release 6 assembly Most of the modENCODE datasets are still in release 5

Gnomon predictions for eight Drosophila species Based on RNA-Seq data from either the same or closely-related species D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis Predictions include untranslated regions and multiple isoforms Records not yet available through the NCBI RefSeq database Access these annotations through the FlyBase BLAST service

RNA Polymerase II core promoter Initiator motif (Inr) contains the TSS TFIID binds to the TATA box and Inr to initiate the assembly of the pre- initiation complex (PIC) polypyrimidine initiator (TCT) motif associated with ribosomal genes Core promoter 200bp surrounding the TSS K = G or T, W = A or T Mammalian genes have CpG islands but Drosophila promoters do not Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9.

Peaked versus broad promoters Peaked promoter (Single strong TSS) Broad promoter (Multiple weak TSS) 50-300 bp Peaked promoter in D. melanogaster is more informative: expect peak promoter in the target species focused (peaked) versus dispersed (broad) promoter Motifs associated with peaked promoters have more well-defined positions relative to +1 Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012 Jan-Feb;1(1):40-51.

Promoter architecture in Drosophila Classify core promoter based on the Shape Index (SI) Determined by distribution of CAGE and 5’ RLM-RACE reads Shape index is a continuum Most promoters in D. melanogaster contain multiple TSS Median width = 162 bp ~70% of vertebrate genes have broad promoters median width 162bp => approximately the length of DNA for a single nucleosome Most of the promoters in Drosophila are broad promoters Motif associated with Paused polymerase are found in peaked promoters Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Genes with peaked promoters show stronger spatial and tissue specificity 46% of genes with broad promoters are expressed in all stages of embryonic development 19% of genes with peaked promoters are expressed in all stages Peaked promoter – have more precise control Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Peaked and broad promoters are enriched in different core promoter motifs Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.

Using modENCODE data to classify the type of core promoter in D Using modENCODE data to classify the type of core promoter in D. melanogaster Only a subset of the modENCODE data are available through FlyBase Evidence tracks on the D. melanogaster GEP UCSC Genome Browser [July 2014 (BDGP R6) assembly]: FlyBase gene annotations (release 6.06) modENCODE TSS (Celniker) annotations DNase I hypersensitive sites (DHS) 9-state and 16-state chromatin models RNA Pol II and CBP ChIP-Seq Transcription factor binding site (TFBS) HOT spots 16-state hiHMM models: late embryos and third instar larvae 9-state chromHMM models: S2 and BG3 cells CREB-binding proteins – coactivation of transcription factors DHS data available for different embryonic stages and cell lines

9-state chromatin model Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011 Mar 24;471(7339):480-5.

DNaseI Hypersensitive Sites (DHS) correspond to accessible regions Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119:355-62. Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014 Aug 28;512(7515):449-52.

Classifying the D. melanogaster core promoter for each unique TSS Peaked Single annotated TSS with a single DHS position Intermediate Single annotated TSS with multiple DHS positions within a 300bp window Broad Multiple annotated TSS with multiple DHS positions within a 300bp window Insufficient evidence No TSS annotations or DHS positions Classification is based on each unique TSS Different unique TSS for the same gene could have different classifications

DEMO Classifying the core promoter of D. melanogaster Rad23

Additional DHS data from different stages of embryonic development DHS data produced by the BDTNP project Evidence tracks: Detected DHS Positions (Embryos) DHS Read Density (Embryos) BDTNP = Berkeley Drosophila Transcription Network Project Use DHS data from cell lines as primary evidence Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43.

Determine the gene structure in D. melanogaster FlyBase: GBrowse UTR CDS Gene Record Finder: Transcript Details

Identify the initial transcribed exon using NCBI blastn Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record Finder Use placement of the flanking exons to reduce the size of the search region if possible Increase sensitivity of nucleotide searches Change Program Selection to blastn Change Word size to 7 Change Match/Mismatch Scores to +1, -1 Change Gap Costs to Existence: 2, Extension: 1

Optimize alignment parameters based on expected levels of conservation Derive alignment scores using information theory Relative entropy of target and background frequencies Match +2, Mismatch -3 optimized for 90% identity Match +1, Mismatch -1 optimized for 75% identity Less information available per aligned position WU-BLAST: +5/-4, +5,-11 information = decrease in uncertainty Convey more information from larger vocabulary and surprising answer (inversely proportional to its probability) PAM = Point Accepted Mutations Need aligned base to have some minimum information content to get significant alignment Shannon Entropy = unpredictability of random variable = subtract Kullback-Leibler divergence (target vs. uniform distribution) from total entropy required to encode a message States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices. Methods 3:66-70.

DEMO blastn search of the initial transcribed exon of Rad23 against D. biarmipes contig19

RNA PolII ChIP-Seq tracks (D. biarmipes only) Show regions that are enriched in RNA Polymerase II compared to input DNA Gene Models RNA PolII Peaks RNA PolII Enrichment RNA-Seq

Use core promoter motifs to support TSS annotations Some sequence motifs are enriched in the region (~300 bp) surrounding the TSS Some motifs (e.g., Inr, TATA) are well-characterized Other motifs are identified based on computational analysis Presence of core promoter motifs can be used to support the TSS annotations Inr motif (TCAKTY) overlaps with the TSS (-2 to +4) Absence of core promoter motifs is a negative result Most D. melanogaster TSS do not contain the Inr motif TCA for broad promoters

Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087. TATA box Initiator (Inr) Available under “Projects”  “Annotation Resources”  “Core Promoter Motifs” on the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html

Core Promoter Motifs tracks Show core promoter motif matches for each contig Separated by strand Visualize matches to different core promoter motifs Use UCSC Table Browser (or other means) to export the list of motif matches within the search region

DEMO Use the Inr motif to determine the TSS position of Rad23

Using RNA PolII ChIP-Seq and RNA-Seq data to define the TSS search region D. mel Transcripts RNA PolII Peaks RNA PolII Enrichment RNA-Seq TSS search region

TSS annotation for Rad23 TSS position: 28,936 Conservation with D. melanogaster blastn of initial exon and the D. mel Transcripts track Location of the Inr motif TSS search region: 28,716-28,936 Enrichment of RNA PolII upstream of the TSS position RNA-Seq read coverage upstream of the TSS position Search region defined by the extent of the RNA PolII peak

TSS annotation section in the GEP Annotation Report Form TSS report form Classify the type of core promoter Evidence that supports or refutes the TSS annotation Distribution of core promoter motifs in the project and in D. melanogaster Sample TSS report for onecut on the GEP web site Under the “Beyond Annotation” section

D. biarmipes Muller F element TSS annotation projects Reconcile coding region annotations submitted by GEP students Reconciled annotations might be incorrect Based on older FlyBase release Possible misannotations See the “Revised gene models report form” section of the Transcription Start Sites Project Report Example: some coding exon annotations are based on weak evidence (e.g., Small Exon Finder) CDS might overlap with untranslated exons or TSS of an adjacent gene

Additional TSS annotation resources The D. melanogaster gene annotations are the primary source of evidence Resources that might be useful if the D. melanogaster evidence is ambiguous Whole genome alignments of 14 Drosophila species PhastCons and PhyloP conservation scores Genome browsers for nine Drosophila species RNA Pol II ChIP-Seq (D. biarmipes only) RNA-Seq coverage, TopHat junctions, assembled transcripts Augustus and N-SCAN gene predictions onecut example: supported by both RNA-Seq and Multiz

Conservation tracks on the D. melanogaster GEP UCSC Genome Browser Whole genome alignments of 14 Drosophila species Drosophila Chain/Net composite track Generate multiple sequence (Multiz) alignments from these pairwise alignments Identify conserved regions from Multiz alignments PhastCons: identify conserved elements PhyloP: measure level of selection at each nucleotide Multiz alignment of 27 insect species available on the official UCSC Genome Browser Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly

Use the conservation tracks to identify regions under selection PhyloP scores: Under negative selection Fast-evolving

Examine the Multiz alignments to identify the orthologous TSS regions

Use RNA PolII tracks on the D Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS April 2013 (BCM-HGSC/Dbia_2.0) assembly Search for orthologous regions in D. elegans Use more stringent parameters than the GEP annotation projects (problem with overlapping projects).

Use RNA-Seq data to predict untranslated regions and putative TSS TSS predictions available for 9 Drosophila species N-SCAN+PASA-EST, Augustus, TransDecoder D. mel Proteins N-SCAN Augustus TransDecoder RNA-Seq

DEMO Genome browsers for other Drosophila species

TSS annotation summary Most of the D. melanogaster core promoters have multiple TSS Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogaster Define search regions that might contain TSS Use multiple lines of evidence to infer the TSS region Identify initial exon RNA-Seq coverage blastn (change search parameters) RNA PolII ChIP-Seq peaks Distribution of core promoter motifs (e.g., Inr) Maintain conservation compared to D. melanogaster

Questions?