Searching for Transcription Start Sites in Drosophila

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Heuristic alignment algorithms and cost matrices
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Similar Sequence Similar Function Charles Yan Spring 2006.
Lecture 12 Splicing and gene prediction in eukaryotes
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Fine Structure and Analysis of Eukaryotic Genes
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
From Genomes to Genes Rui Alves.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Regulation of Gene Expression
Annotation of Drosophila
Annotation for D. virilis
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
RNA-seq Replicate 1 RNA-seq Replicate 2 DNA
Volume 5, Issue 3, Pages (November 2013)
TSS Annotation Workflow
GEP Annotation Workflow
Recitation 7 2/4/09 PSSMs+Gene finding
Today… Review a few items from last class
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
Volume 14, Issue 7, Pages (February 2016)
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Human Promoters Are Intrinsically Directional
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Evolution of Alu Elements toward Enhancers
Volume 132, Issue 2, Pages (January 2008)
Dynamic Regulation of Nucleosome Positioning in the Human Genome
Nora Pierstorff Dept. of Genetics University of Cologne
Summarized by Sun Kim SNU Biointelligence Lab.
Basic Local Alignment Search Tool
Sequence Analysis - RNA-Seq 2
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Volume 11, Issue 7, Pages (May 2015)
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

Searching for Transcription Start Sites in Drosophila Wilson Leung 08/2017

Outline Transcription start sites (TSS) annotation goals Promoter architecture in D. melanogaster New D. melanogaster TSS datasets Find the initial transcribed exon Annotate putative transcription start sites Search for core promoter motifs

Muller F element, heterochromatic, and euchromatic genes show similar expression levels F element genes show lower expression levels in S2 (late embryos) cells but no difference in BG3 (neuronal) cells Position affects gene expression See Dr. Elgin’s presentation on the private wiki for a more detailed explanation of the scientific goals. Riddle NC, et al. PLoS Genet. 2012 Sep;8(9):e1002954.

TSS of F element genes show lower levels of H3K9me3 and HP1a POF Su(var)3-9 PolII H3K36me3 POF PolII HP1a Su(var)3-9 H3K9me3 over the gene body H3K36me3 – associated with 3’ end of transcript in euchromatic regions Riddle NC, et al. PLoS Genet. 2012 Sep;8(9):e1002954.

Three strategies for motif finding Multiple genes in a single species Genes with common expression pattern Sequences associated with ChIP-Seq peaks Single gene in multiple species Phylogenetic footprinting Multiple genes in multiple species Compare multiple sequence alignment profiles of multiple genes (Magma)

Motif finding using multiple genes within a single species 1 2 3 4 5 6 7 8 9 10 A 12 13 C 22 23 64 70 54 33 G 11 30 14 T 36 31 56 40 19 Bits 0.0 1.0 2.0 5 10 Trl: FlyReg_DNaseI Zero or one instances per sequence Sequences surrounding TSS Predicted motif instances

Motif finding using single gene in multiple species Genes PhyloP phastCons Conserved Elements Multiple Sequence Alignment D. mel: chr4 EvoprinterHD

Motif finding using multiple genes in multiple species (PhyloNet) 1. Identify conserved regions (profiles) in whole genome multiple sequence alignments 2. Identify multiple genes in the genome with similar alignment profiles Create phylogenetic profiles for each promoter Modify Karlin-Altschul statistics to calculate E-values (compare homologous sequences) Cluster HSPs into profile alignments and generate final motifs Create a new continuous profile space (15 subprofile spaces – tetrahedron where the four vertices represent the nucleotides) Promoter sequences Conserved motifs Based on Figure 1 from Wang T and Stormo GD. PNAS 2005 Nov 29;102(48):17400-5.

Magma: Multiple Aligner of Genomic Multiple Alignments Key features of Magma: Runs ~70x faster than PhyloNet Analyze multiple sequence alignments with gaps Use set-covering approach to minimize redundancy in discovered motifs Comparison using average log likelihood scores (ALLR) - extended to allow for gaps Computationally tractable to analyze conserved motifs in multiple eukaryotic genomes Ihuegbu NE, Stormo GD, Buhler J. J Comput Biol. 2012 Feb;19(2):139-47.

Goals for the transcription start sites (TSS) annotations Research goal: Identify motifs that enable Muller F element genes to function within a heterochromatic environment Annotation goals: Define search regions enriched in regulatory motifs Define precise location of TSS if possible Define search regions where TSS could be found Document the evidence used to support the TSS annotations Detailed documentations allow us to prioritize the list of TSS candidates

Estimated evolutionary distances with respect to D. melanogaster D. simulans Species Substitutions per neutral site D. ficusphila 0.80 D. eugracilis 0.76 D. biarmipes 0.70 D. takahashii 0.65 D. elegans 0.72 D. rhopaloa 0.66 D. kikkawai 0.89 D. bipectinata 0.99 D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata D. eugracilis – training project D. ficusphila – higher substitutions per neutral sites than D. biarmipes and D. elegans Purple = priority 1 species in modENCODE white paper D. melanogaster subgroup = 0.4ss D. ananassae = 1.3ss D. ananassae D. pseudoobscura D. persimilis D. willistoni Data from Table 1 of the modENCODE comparative genomics white paper D. mojavensis D. virilis D. grimshawi GEP annotation projects Species sequenced by modENCODE

Challenges with TSS annotations Fewer constraints on untranslated regions (UTRs) UTRs evolve more quickly than coding regions Open reading frames, compatible phases of donor and acceptor sites do not apply to UTRs Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRs Most gene finders do not predict UTRs Lack of experimental data Cannot use RNA-Seq data to precisely define the TSS Similar levels of sequence similarity between D. elegans and D. melanogaster

TSS annotation workflow Identify the ortholog Note the gene structure in D. melanogaster Annotate the coding exons Classify the type of core promoter in D. melanogaster Annotate the initial transcribed exon Identify any core promoter motifs in region Define TSS positions or TSS search regions Annotation is based on parsimony with D. melanogaster

RNA Polymerase II core promoter Initiator motif (Inr) contains the TSS TFIID binds to the TATA box and Inr to initiate the assembly of the pre- initiation complex (PIC) polypyrimidine initiator (TCT) motif associated with ribosomal genes Core promoter 200bp surrounding the TSS K (keto) = G or T, W (weak) = A or T Mammalian genes have CpG islands but Drosophila promoters do not Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9.

Peaked versus broad promoters Peaked promoter (Single strong TSS) Broad promoter (Multiple weak TSS) 50-300 bp Peaked promoter in D. melanogaster is more informative: expect peak promoter in the target species focused (peaked) versus dispersed (broad) promoter Motifs associated with peaked promoters have more well-defined positions relative to +1 Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012 Jan-Feb;1(1):40-51.

RNA-Seq biases introduced by library construction cDNA fragmentation Strong bias at the 3’ end RNA fragmentation More uniform coverage Miss the 5’ and 3’ ends of the transcript RNA-Seq Read Count This chart uses yeast (single ORF) as an example It does not account for biases in read mapping RNA fragmentation (e.g., hydrolysis, nebulization) missed the transcript ends cDNA fragmentation (e.g., sonication, DNaseI) has strong bias at the 3’ end Transcripts show more uniform coverage using RNA instead of cDNA fragmentation 5’ 3’ Gene Span Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.

Techniques for finding TSS Identify the 5’ cap at the beginning of the mRNA Cap Analysis of Gene Expression (CAGE) RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE) Cap-trapped Expressed Sequence Tags (5’ ESTs) More information on these techniques: Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012 786:181-200. Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36. CAGE randomly sample 5’ end – correspond to expression level RACE target 5’ ends of specific transcripts (8727 transcripts) – capture lowly expressed transcripts 5’ end of mRNA has a guanine connected to the rest of the mRNA via a 5’ to 5’ triphosphate link. This guanosine is methylated at position 7 (7-methylguanylate cap) and helps increase the stability of the mRNA. http://www.osc.riken.jp/english/activity/cage/basic/

Promoter architecture in Drosophila Classify core promoter based on the Shape Index (SI) Determined by the distribution of CAGE and 5’ RLM-RACE reads Shape index is a continuum Most promoters in D. melanogaster contain multiple TSS Median width = 162 bp ~70% of vertebrate genes have broad promoters median width 162bp => approximately the length of DNA for a single nucleosome Most of the promoters in Drosophila are broad promoters Motif associated with Paused polymerase are found in peaked promoters Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Genes with peaked promoters show stronger spatial and tissue specificity 46% of genes with broad promoters are expressed in all stages of embryonic development 19% of genes with peaked promoters are expressed in all stages Peaked promoter – have more precise control Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Peaked and broad promoters are enriched in different core promoter motifs Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.

Resources for classifying the type of core promoter in D. melanogaster Only a subset of the modENCODE data are available through FlyBase D. melanogaster GEP UCSC Genome Browser [Aug. 2014 (BDGP Release 6) assembly] FlyBase gene annotations (release 6.16) modENCODE TSS (Celniker) annotations DNase I hypersensitive sites (DHS) CAGE and RAMPAGE TSS datasets 9-state and 16-state chromatin models Transcription factor binding site (TFBS) HOT spots 16-state hiHMM models: late embryos and third instar larvae 9-state chromHMM models: S2 and BG3 cells DHS data available for different embryonic stages and cell lines

9-state chromatin model Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011 Mar 24;471(7339):480-5.

DNaseI Hypersensitive Sites (DHS) correspond to accessible regions Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119:355-62. Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014 Aug 28;512(7515):449-52.

modENCODE TSS annotations Two sets of modENCODE TSS predictions TSS (Celniker) Most recent dataset produced by modENCODE Available on the GEP UCSC Genome Browser TSS (Embryonic) Older dataset available from FlyBase GBrowse Use TSS (Celniker) dataset as the primary evidence BDGP working with FlyBase to update the TSS annotations Updated RNA-Seq datasets available since release 6.03 Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92

Classify the D. melanogaster core promoter based on (TSS) Celniker annotations and DHS positions TSS classification # Annotated TSS # DHS positions Peaked 1 Intermediate ≤ 1 > 1 Broad Insufficient evidence Updated definition for peaked and intermediate TSS to cover all cases Classification is based on each unique TSS Different unique TSS for the same gene could have different classifications Consider DHS positions within a 300bp window surrounding the start of the D. melanogaster transcript

DEMO: Classify the core promoter of D. melanogaster Rad23

Additional DHS data from different stages of embryonic development DHS data produced by the BDTNP project Evidence tracks: Detected DHS Positions (Embryos) DHS Read Density (Embryos) BG3 9-state S2 9-state chr4 CG2316-RB CG2316-RD CG2316-RC CG2316-RA CG2316-RG CG2316-RH BG3 DHS S2 DHS Kc DHS BDTNP = Berkeley Drosophila Transcription Network Project Use DHS data from cell lines as primary evidence BG3 = CNS 3rd instar larvae, S2 = late embryos, Kc167 = dorsal closure stage stage 14 = 620-680min; Dorsal closure of midgut and epidermis Determine expression pattern of the gene in D. melanogaster through the “High-Throughput Expression Data” section Stage 5 Stage 9 Stage 10 Stage 11 Stage 14 TSS (Celniker) Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43.

Additional TSS data available in FlyBase release 6.11 MachiBase Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013 Jan;23(1):169-80.

Benefits of RAMPAGE RAMPAGE = RNA Annotation and Mapping of Promoters for Analysis of Gene Expression CAGE only allows sequencing of short sequence tags (~27 bp) near the 5’ cap Ambiguous read mapping to large parts of the genome RAMPAGE produces long paired-end reads instead of short sequence tags Developed novel algorithm to identify TSS clusters Used paired-end information during peak calling Used Cufflinks to produce partial transcript models CAGE = Cap analysis gene expression RAMPAGE = combine template switching and cap trapping template switching = add adapter to end of 5’ complete first-strand cDNAs during reverse transcription cap trapping = biotinylation of the 7-methylguanosine cap of Pol II transcripts and pull down of 5’-complete cDNAs MachiBase – based on modified version of SAGE (Serial Analysis of Gene Expression) Less specific than CAGE or RAMPAGE Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'-complete cDNAs. Curr Protoc Mol Biol. 2013 Nov 11;104:Unit 25B.11.

RAMPAGE results on the GEP UCSC Genome Browser Lifted RAMPAGE results from release 5 to release 6 Results from 36 developmental stages Combined TSS peak call from all samples Available under the “Expression and Regulation” section

Standardize analysis of MachiBase and modENCODE CAGE data using CAGEr Bioconductor package developed by RIKEN Map datasets against release 6 assembly 37 modENCODE CAGE samples; 7 MachiBase samples Define TSS and promoters for each sample Define consensus promoters across all samples MachiBase = 5’ SAGE data Haberle V, et al. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015 Apr 30;43(8):e51.

TSS classifications based on CAGEr Peaked FlyBase Genes modENCODE CAGE Peaks modENCODE CAGE (Plus) Intermediate FlyBase Genes modENCODE CAGE Peaks modENCODE CAGE (Plus) Broad modENCODE CAGE Peaks modENCODE CAGE (Minus) FlyBase Genes

Changes in the dominant TSS of Rad23 across different developmental stages CAGE Tag Clusters Tag cluster interval ~70bp, 80% of signal within 50bp thin box = interquantile width: 80% of CAGE signal BG3 cells L3 digestive system, also show different dominant TSS Stages of Development Adult females

Evidence for TSS annotations (in general order of importance) Experimental data RNA-Seq RNA Pol II ChIP-Seq Conservation Type of TSS (peaked/intermediate/broad) in D. melanogaster Sequence similarity to initial exon in D. melanogaster Sequence similarity to other Drosophila species (Multiz) Core promoter motifs Inr, TATA box, etc. Use D. biarmipes PolII data to help define the TSS search regions in the other species

Determine the gene structure in D. melanogaster UTR CDS FlyBase: GBrowse Gene Record Finder: Transcript Details

Identify the initial transcribed exon using NCBI blastn Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record Finder Use placement of the flanking exons to reduce the size of the search region if possible Increase sensitivity of nucleotide searches Change Program Selection to blastn Change Word size to 7 Change Match/Mismatch Scores to +1, -1 Change Gap Costs to Existence: 2, Extension: 1

Extrapolate TSS position based on blastn alignment of the initial transcribed exon blastn: D. mel: Rad23:1 (Query) vs. contig19 (Sbjct) Query start: 6 Extrapolate TSS position: 28,941-5 = 28,936 Assume the length of the initial transcribed exon is conserved between D. melanogaster and D. biarmipes

Core promoter motifs can affect gene expression levels SCP1: SCP1 = Super Core Promoter 1 mTATA = variant of SCP1 with mutations in TATA Juven-Gershon T, et al. Rational design of a super core promoter that enhances gene expression. Nat Methods. 2006 Nov;3(11):917-22.

Use core promoter motifs to support TSS annotations Some sequence motifs are enriched in the region (~300 bp) surrounding the TSS Some motifs (e.g., Inr, TATA) are well-characterized Other motifs are identified based on computational analysis Presence of core promoter motifs can be used to support the TSS annotations Inr motif (TCAKTY) overlaps with the TSS (-2 to +4) Absence of core promoter motifs is a negative result Most D. melanogaster TSS do not contain the Inr motif TCA for broad promoters

Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087. TATA box Initiator (Inr) K = keto (G or T), Y = Pyrimidine (C or T) Consider removing BREu and BREd from list because of large number of false positives Available under “Projects”  “Annotation Resources”  “Core Promoter Motifs” on the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html

Core Promoter Motifs tracks Show core promoter motif matches for each contig Separated by strand Visualize matches to different core promoter motifs Use UCSC Table Browser (or other means) to export the list of motif matches within the search region Consider ways to generate locations of core promoter motifs in D. melanogaster

DEMO: Use the Inr motif to support the TSS position of Rad23

RNA PolII ChIP-Seq tracks (available for D. biarmipes, D RNA PolII ChIP-Seq tracks (available for D. biarmipes, D. elegans, and D. ficusphila) Show regions that are enriched in RNA Polymerase II compared to input DNA Gene Models RNA PolII Peaks RNA PolII Enrichment RNA-Seq Perform blastn search of D. biarmipes region enriched in RNA PolII against the target species

Narrow TSS search region Using RNA-Seq and RNA PolII ChIP-Seq data to define the TSS search region D. mel Transcripts RNA-Seq Could define narrow and wide search regions to encompass regions supported by weak TSS evidence RNA PolII Peaks RNA PolII Enrichment Narrow TSS search region

TSS annotation for Rad23 TSS position: 28,936 Conservation with D. melanogaster blastn search of initial exon “D. mel Transcripts” track Location of the Inr motif TSS search region: 28,716-28,936 Enrichment of RNA PolII upstream of the TSS position RNA-Seq read coverage upstream of the TSS position Search region defined by the extent of the RNA PolII peak

TSS annotation resources Walkthroughs: Annotation of Transcription Start Sites in Drosophila Sample TSS report for onecut Reference: TSS Annotation Workflow GEP Annotation Report: Classify the type of core promoter Evidence that supports or refutes the TSS annotation Distribution of core promoter motifs Additional curriculum on motif finding also available under the “Beyond annotation” section Added new parameters page to the TSS Annotation Workflow

Additional TSS annotation resources The D. melanogaster gene annotations are the primary source of evidence Resources that could be useful if the D. melanogaster evidence is ambiguous Whole genome alignments of multiple Drosophila species PhastCons and PhyloP conservation scores Genome browsers for nine Drosophila species RNA Pol II ChIP-Seq (D. biarmipes, D. elegans, and D. ficusphila) RNA-Seq coverage, TopHat junctions, assembled transcripts Augustus and N-SCAN gene predictions Cross-species alignments of Gnomon gene predictions onecut example: supported by both RNA-Seq and Multiz

TSS annotation summary Most of the D. melanogaster core promoters have multiple TSS Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogaster Define search regions that might contain TSS Use multiple lines of evidence to infer the TSS region Identify initial exon RNA-Seq coverage blastn (change search parameters) Distribution of core promoter motifs (e.g., Inr) RNA PolII ChIP-Seq peaks Maintain conservation compared to D. melanogaster

Questions?

Structure of a typical mRNA Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews0004.1-reviews0004.10.

Expression Levels (rlog) D. ananassae and D. melanogaster F element genes show similar range of expression levels Adult Females Adult Males F element D. mel: F (modENCODE) 4 D. ana: F (modENCODE) 4L 4R Chen Z-X, et al. Genome Res. 24:1209-1223 D. ananassae Adult Females CAI (Codon Bias) Expression Levels (rlog) LOESS Regression Line

Phylogenetic tree based on the analysis of 13 Type IIB restriction endonucleases D. simulans Simulate restriction digests of 21 genomes DNA fragments range from 21-33 bp in size Calculate distance between two genomes based on number of shared fragments D. sechellia D. melanogaster D. yakuba D. santomea D. erecta D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. ficusphila Recognition sequence of Type IIB endonucleases are 5-7bp long and cut dsDNA D. kikkawai D. ananassae D. bipectinata D. persimilis D. pseudoobscura Seetharam AS and Stuart GW. Whole genome phylogeny for 21 Drosophila species using predicted 2b-RAD fragments. PeerJ. 2013 Dec 23;1:e226. D. willistoni D. virilis D. mojavensis D. grimshawi

RAMPAGE protocol Batut P, et al. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. (http://genome.cshlp.org/content/23/1/169) supplementary figure 1. Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'-complete cDNAs. Curr Protoc Mol Biol. 2013 Nov 11;104:Unit 25B.11. Ribosome-depleted RNA is reverse-transcribed with random primers bearing an Illumina adaptor sequence overhang. Under the conditions used, the reverse transcriptase will often add a few non-templated Cs when it reaches the 5′ end of the template, especially if the template is capped. A template-switching oligo (TSO), which has three riboguanosines at its 3′ end, can hybridize to the terminal Cs, prompting the enzyme to switch templates and add the TSO sequence to the end of the newly synthesized cDNA. Since the TSO bears the other Illumina adaptor sequence, resulting 5′-complete cDNAs are amplifiable, whereas non-5′-complete molecules are not. The next steps implement the cap-trapping strategy, in which riboses with free 2′- and 3′-hydroxyl groups are oxidized and biotinylated, and single-stranded portions of RNA are digested by RNase I. This leaves biotin groups at only the 5′ ends of capped transcripts hybridized to 5′-complete cDNAs, which can then be recovered on streptavidin-coated beads. After PCR amplification and size selection, the cDNAs selected by these two orthogonal strategies can be directly sequenced on Illumina platforms.

“FlyBase: GBrowse Tracks” page on the FlyBase Wiki Signals in the FlyBase RAMPAGE and MachiBase TSS tracks are off by one base “FlyBase: GBrowse Tracks” page on the FlyBase Wiki http://flybase.org/wiki/FlyBase:GBrowse_Tracks#Aligned_Evidence

DEMO blastn search of the initial transcribed exon of Rad23 against D. biarmipes contig19

Optimize alignment parameters based on expected levels of conservation Derive alignment scores using information theory Relative entropy of target and background frequencies Match +2, Mismatch -3 optimized for 90% identity Match +1, Mismatch -1 optimized for 75% identity Less information available per aligned position WU-BLAST: +5/-4, +5,-11 information = decrease in uncertainty Convey more information from larger vocabulary and surprising answer (inversely proportional to its probability) PAM = Point Accepted Mutations Need aligned base to have some minimum information content to get significant alignment Shannon Entropy = unpredictability of random variable = subtract Kullback-Leibler divergence (target vs. uniform distribution) from total entropy required to encode a message States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices. Methods. 1991 3:66-70.

Use RNA PolII tracks on the D Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS April 2013 (BCM-HGSC/Dbia_2.0) assembly Search for orthologous regions in D. elegans Use more stringent parameters than the GEP annotation projects (problem with overlapping projects).

Gnomon predictions for eight Drosophila species Based on RNA-Seq data from either the same or closely-related species D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis Predictions include untranslated regions and multiple isoforms Records not yet available through the NCBI RefSeq database Access these annotations through the FlyBase BLAST service

Conservation tracks on the D. melanogaster GEP UCSC Genome Browser Whole genome alignments of multiple Drosophila species Drosophila Chain/Net composite track Generate multiple sequence (Multiz) alignments from these pairwise alignments Identify conserved regions from Multiz alignments PhastCons: identify conserved elements PhyloP: measure level of selection at each nucleotide Multiz alignment of 27 insect species available on the official UCSC Genome Browser Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly

Use the conservation tracks to identify regions under selection PhyloP scores: Under negative selection Fast-evolving

Examine the Multiz alignments to identify the orthologous TSS regions

Use RNA-Seq data to predict untranslated regions and putative TSS TSS predictions available for 9 Drosophila species N-SCAN+PASA-EST, Augustus, TransDecoder D. mel Proteins N-SCAN Augustus TransDecoder RNA-Seq