Presentation is loading. Please wait.

Presentation is loading. Please wait.

Searching for Transcription Start Sites in Drosophila

Similar presentations


Presentation on theme: "Searching for Transcription Start Sites in Drosophila"— Presentation transcript:

1 Searching for Transcription Start Sites in Drosophila
Wilson Leung 08/2015

2 Outline Transcription start sites (TSS) annotation goals
Promoter architecture in D. melanogaster Find the initial transcribed exon Annotate putative transcription start sites RNA PolII ChIP-Seq Search for core promoter motifs Additional TSS annotation resources

3 Structure of a typical mRNA
Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews reviews

4 Muller F element, heterochromatic, and euchromatic genes show similar expression levels
F element genes show lower expression levels in S2 (late embryos) cells but no difference in BG3 (neuronal) cells See Dr. Elgin’s presentation on the private wiki for a more detailed explanation of the scientific goals. Riddle NC, et al. Enrichment of HP1a on Drosophila chromosome 4 genes creates an alternate chromatin structure critical for regulation in this heterochromatic domain. PLoS Genet Sep;8(9):e

5 Annotate transcription start sites (TSS)
Research goal: Identify motifs that enable Muller F element genes to function within a heterochromatic environment Annotation goals: Define search regions enriched in regulatory motifs Define precise location of TSS if possible Define search regions where TSS could be found Document the evidence used to support the TSS annotations Detailed documentations allow us to prioritize the list of TSS candidates In the process of analyzing the GEP submissions from this past year

6 Estimated evolutionary distances with respect to D. melanogaster
D. simulans Species Substitutions per neutral site D. ficusphila 0.80 D. eugracilis 0.76 D. biarmipes 0.70 D. takahashii 0.65 D. elegans 0.72 D. rhopaloa 0.66 D. kikkawai 0.89 D. bipectinata 0.99 D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata Purple = priority 1 species in modENCODE white paper D. melanogaster subgroup = 0.4ss D. ananassae = 1.3ss D. ananassae D. pseudoobscura D. persimilis D. willistoni Data from table 1 of the modENCODE comparative genomics whitepaper D. mojavensis D. virilis D. grimshawi GEP annotation projects New species sequenced by modENCODE

7 TSS annotation resources
Walkthroughs: Annotation of Transcription Start Sites in Drosophila Reference: TSS Annotation Workflow GEP Annotation Report: TSS Report Form (sample report for onecut) Additional evidence tracks: RNA PolII ChIP-Seq (D. biarmipes only) D. mel Transcripts Conservation (Multiz alignments) Additional curriculum on motif finding also available under the “Beyond annotation” section Added new parameters page to the TSS Annotation Workflow

8 TSS annotation workflow
Identify the ortholog Note the gene structure in D. melanogaster Annotate the coding exons Classify the type of core promoter in D. melanogaster Annotate the initial transcribed exon Identify any core promoter motifs Define TSS positions or TSS search regions Annotation is based on parsimony with D. melanogaster

9 Evidence for TSS annotations (in general order of importance)
Experimental data RNA-Seq RNA Pol II ChIP-Seq (D. biarmipes only) Conservation Type of TSS (peaked versus broad) in D. melanogaster Sequence similarity to initial exon in D. melanogaster Sequence similarity to other Drosophila species (Multiz) Core promoter motifs Inr, TATA box, etc.

10 Challenges with TSS annotations
Fewer constraints on untranslated regions (UTRs) UTRs evolve more quickly than coding regions Open reading frames, compatible phases of donor and acceptor sites do not apply to UTRs Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRs Most gene finders do not predict UTRs Lack of experimental data Cannot use RNA-Seq data to precisely define the TSS

11 RNA-Seq biases introduced by library construction
cDNA fragmentation Strong bias at the 3’ end RNA fragmentation More uniform coverage Miss the 5’ and 3’ ends of the transcript RNA-Seq Read Count This chart uses yeast (single ORF) as an example It does not account for biases in read mapping RNA fragmentation (e.g., hydrolysis, nebulization) missed the transcript ends cDNA fragmentation (e.g., sonication, DNaseI) has strong bias at the 3’ end Transcripts show more uniform coverage using RNA instead of cDNA fragmentation 5’ 3’ Gene Span Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet Jan;10(1):57-63.

12 Techniques for finding TSS
Identify the 5’ cap at the beginning of the mRNA Cap Analysis of Gene Expression (CAGE) RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE) Cap-trapped Expressed Sequence Tags (5’ ESTs) More information on these techniques: Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol : Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet Jun;8(6): CAGE randomly sample 5’ end – correspond to expression level RACE target 5’ ends of specific transcripts (8727 transcripts) – capture lowly expressed transcripts 5’ end of mRNA has a guanine connected to the rest of the mRNA via a 5’ to 5’ triphosphate link. This guanosine is methylated at position 7 (7-methylguanylate cap) and helps increase the stability of the mRNA.

13 modENCODE TSS annotations
TSS predictions based on CAGE, RLM-RACE, and 5’ EST data Results lifted from Release 5 to the Release 6 assembly Two sets of modENCODE TSS predictions TSS (Celniker) Most recent dataset produced by modENCODE TSS (Embryonic) Older dataset available from FlyBase GBrowse Use TSS (Celniker) dataset as the primary evidence Revised TSS analysis for Release 6 not yet available BDGP working with FlyBase to update the TSS annotations Just updated RNA-Seq datasets in release 6.03 Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res Feb;21(2):182-92

14 FlyBase release 6 assembly
First change in the D. melanogaster assembly since 2005 Substantial changes in the heterochromatic regions modENCODE RNA-Seq data have been re-analyzed relative to the release 6 assembly Most of the modENCODE datasets are still in release 5

15 Gnomon predictions for eight Drosophila species
Based on RNA-Seq data from either the same or closely-related species D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis Predictions include untranslated regions and multiple isoforms Records not yet available through the NCBI RefSeq database Access these annotations through the FlyBase BLAST service

16 RNA Polymerase II core promoter
Initiator motif (Inr) contains the TSS TFIID binds to the TATA box and Inr to initiate the assembly of the pre- initiation complex (PIC) polypyrimidine initiator (TCT) motif associated with ribosomal genes Core promoter 200bp surrounding the TSS K = G or T, W = A or T Mammalian genes have CpG islands but Drosophila promoters do not Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol Mar 15;339(2):225-9.

17 Peaked versus broad promoters
Peaked promoter (Single strong TSS) Broad promoter (Multiple weak TSS) bp Peaked promoter in D. melanogaster is more informative: expect peak promoter in the target species focused (peaked) versus dispersed (broad) promoter Motifs associated with peaked promoters have more well-defined positions relative to +1 Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol Jan-Feb;1(1):40-51.

18 Promoter architecture in Drosophila
Classify core promoter based on the Shape Index (SI) Determined by distribution of CAGE and 5’ RLM-RACE reads Shape index is a continuum Most promoters in D. melanogaster contain multiple TSS Median width = 162 bp ~70% of vertebrate genes have broad promoters median width 162bp => approximately the length of DNA for a single nucleosome Most of the promoters in Drosophila are broad promoters Motif associated with Paused polymerase are found in peaked promoters Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res Feb;21(2):

19 Genes with peaked promoters show stronger spatial and tissue specificity
46% of genes with broad promoters are expressed in all stages of embryonic development 19% of genes with peaked promoters are expressed in all stages Peaked promoter – have more precise control Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res Feb;21(2):

20 Peaked and broad promoters are enriched in different core promoter motifs
Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.

21 Using modENCODE data to classify the type of core promoter in D
Using modENCODE data to classify the type of core promoter in D. melanogaster Only a subset of the modENCODE data are available through FlyBase Evidence tracks on the D. melanogaster GEP UCSC Genome Browser [July 2014 (BDGP R6) assembly]: FlyBase gene annotations (release 6.06) modENCODE TSS (Celniker) annotations DNase I hypersensitive sites (DHS) 9-state and 16-state chromatin models RNA Pol II and CBP ChIP-Seq Transcription factor binding site (TFBS) HOT spots 16-state hiHMM models: late embryos and third instar larvae 9-state chromHMM models: S2 and BG3 cells CREB-binding proteins – coactivation of transcription factors DHS data available for different embryonic stages and cell lines

22 9-state chromatin model
Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature Mar 24;471(7339):480-5.

23 DNaseI Hypersensitive Sites (DHS) correspond to accessible regions
Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119: Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature Aug 28;512(7515):

24 Classifying the D. melanogaster core promoter for each unique TSS
Peaked Single annotated TSS with a single DHS position Intermediate Single annotated TSS with multiple DHS positions within a 300bp window Broad Multiple annotated TSS with multiple DHS positions within a 300bp window Insufficient evidence No TSS annotations or DHS positions Classification is based on each unique TSS Different unique TSS for the same gene could have different classifications

25 DEMO Classifying the core promoter of D. melanogaster Rad23

26 Additional DHS data from different stages of embryonic development
DHS data produced by the BDTNP project Evidence tracks: Detected DHS Positions (Embryos) DHS Read Density (Embryos) BDTNP = Berkeley Drosophila Transcription Network Project Use DHS data from cell lines as primary evidence Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43.

27 Determine the gene structure in D. melanogaster
FlyBase: GBrowse UTR CDS Gene Record Finder: Transcript Details

28 Identify the initial transcribed exon using NCBI blastn
Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record Finder Use placement of the flanking exons to reduce the size of the search region if possible Increase sensitivity of nucleotide searches Change Program Selection to blastn Change Word size to 7 Change Match/Mismatch Scores to +1, -1 Change Gap Costs to Existence: 2, Extension: 1

29 Optimize alignment parameters based on expected levels of conservation
Derive alignment scores using information theory Relative entropy of target and background frequencies Match +2, Mismatch -3 optimized for 90% identity Match +1, Mismatch -1 optimized for 75% identity Less information available per aligned position WU-BLAST: +5/-4, +5,-11 information = decrease in uncertainty Convey more information from larger vocabulary and surprising answer (inversely proportional to its probability) PAM = Point Accepted Mutations Need aligned base to have some minimum information content to get significant alignment Shannon Entropy = unpredictability of random variable = subtract Kullback-Leibler divergence (target vs. uniform distribution) from total entropy required to encode a message States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices. Methods 3:66-70.

30 DEMO blastn search of the initial transcribed exon of Rad23 against D. biarmipes contig19

31 RNA PolII ChIP-Seq tracks (D. biarmipes only)
Show regions that are enriched in RNA Polymerase II compared to input DNA Gene Models RNA PolII Peaks RNA PolII Enrichment RNA-Seq

32 Use core promoter motifs to support TSS annotations
Some sequence motifs are enriched in the region (~300 bp) surrounding the TSS Some motifs (e.g., Inr, TATA) are well-characterized Other motifs are identified based on computational analysis Presence of core promoter motifs can be used to support the TSS annotations Inr motif (TCAKTY) overlaps with the TSS (-2 to +4) Absence of core promoter motifs is a negative result Most D. melanogaster TSS do not contain the Inr motif TCA for broad promoters

33 Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs
Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087. TATA box Initiator (Inr) Available under “Projects”  “Annotation Resources”  “Core Promoter Motifs” on the GEP web site:

34 Core Promoter Motifs tracks
Show core promoter motif matches for each contig Separated by strand Visualize matches to different core promoter motifs Use UCSC Table Browser (or other means) to export the list of motif matches within the search region

35 DEMO Use the Inr motif to determine the TSS position of Rad23

36 Using RNA PolII ChIP-Seq and RNA-Seq data to define the TSS search region
D. mel Transcripts RNA PolII Peaks RNA PolII Enrichment RNA-Seq TSS search region

37 TSS annotation for Rad23 TSS position: 28,936
Conservation with D. melanogaster blastn of initial exon and the D. mel Transcripts track Location of the Inr motif TSS search region: 28,716-28,936 Enrichment of RNA PolII upstream of the TSS position RNA-Seq read coverage upstream of the TSS position Search region defined by the extent of the RNA PolII peak

38 TSS annotation section in the GEP Annotation Report Form
TSS report form Classify the type of core promoter Evidence that supports or refutes the TSS annotation Distribution of core promoter motifs in the project and in D. melanogaster Sample TSS report for onecut on the GEP web site Under the “Beyond Annotation” section

39 D. biarmipes Muller F element TSS annotation projects
Reconcile coding region annotations submitted by GEP students Reconciled annotations might be incorrect Based on older FlyBase release Possible misannotations See the “Revised gene models report form” section of the Transcription Start Sites Project Report Example: some coding exon annotations are based on weak evidence (e.g., Small Exon Finder) CDS might overlap with untranslated exons or TSS of an adjacent gene

40 Additional TSS annotation resources
The D. melanogaster gene annotations are the primary source of evidence Resources that might be useful if the D. melanogaster evidence is ambiguous Whole genome alignments of 14 Drosophila species PhastCons and PhyloP conservation scores Genome browsers for nine Drosophila species RNA Pol II ChIP-Seq (D. biarmipes only) RNA-Seq coverage, TopHat junctions, assembled transcripts Augustus and N-SCAN gene predictions onecut example: supported by both RNA-Seq and Multiz

41 Conservation tracks on the D. melanogaster GEP UCSC Genome Browser
Whole genome alignments of 14 Drosophila species Drosophila Chain/Net composite track Generate multiple sequence (Multiz) alignments from these pairwise alignments Identify conserved regions from Multiz alignments PhastCons: identify conserved elements PhyloP: measure level of selection at each nucleotide Multiz alignment of 27 insect species available on the official UCSC Genome Browser Aug (BDGP Release 6 + ISO1 MT/dm6) assembly

42 Use the conservation tracks to identify regions under selection
PhyloP scores: Under negative selection Fast-evolving

43 Examine the Multiz alignments to identify the orthologous TSS regions

44 Use RNA PolII tracks on the D
Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS April 2013 (BCM-HGSC/Dbia_2.0) assembly Search for orthologous regions in D. elegans Use more stringent parameters than the GEP annotation projects (problem with overlapping projects).

45 Use RNA-Seq data to predict untranslated regions and putative TSS
TSS predictions available for 9 Drosophila species N-SCAN+PASA-EST, Augustus, TransDecoder D. mel Proteins N-SCAN Augustus TransDecoder RNA-Seq

46 DEMO Genome browsers for other Drosophila species

47 TSS annotation summary
Most of the D. melanogaster core promoters have multiple TSS Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogaster Define search regions that might contain TSS Use multiple lines of evidence to infer the TSS region Identify initial exon RNA-Seq coverage blastn (change search parameters) RNA PolII ChIP-Seq peaks Distribution of core promoter motifs (e.g., Inr) Maintain conservation compared to D. melanogaster

48 Questions?


Download ppt "Searching for Transcription Start Sites in Drosophila"

Similar presentations


Ads by Google