Download presentation
1
Annotation of Drosophila primer
GEP Workshop – August 2015 Wilson Leung and Chris Shaffer
2
Outline Overview of the GEP annotation project GEP annotation strategy
Annotation goals Nomenclature GEP annotation strategy Using genomics databases Interpreting RNA-Seq data Understanding the phase of splice sites “Annotation of a Drosophila gene” walkthrough
3
Start codon Coding region Stop codon Splice donor Splice acceptor UTR
AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donor Splice acceptor UTR
4
Annotation: Adding labels to a sequence
Genes: Novel or known genes, pseudogenes Regulatory elements: Promoters, enhancers, silencers Non-coding RNA: tRNAs, miRNAs, siRNAs, snoRNAs Repeats: Transposable elements, simple repeats Structural: Origins of replication, synteny Experimental results: DNase I Hypersensitive sites (DHS) ChIP-chip and ChIP-Seq datasets (e.g., modENCODE)
5
GEP Drosophila annotation projects
D. melanogaster D. simulans D. sechellia Reference D. yakuba D. erecta Published D. ficusphila D. eugracilis D. biarmipes Species in the Four Genomes Paper D. takahashii D. elegans D. rhopaloa Annotation projects for Fall 2015 / Spring 2016 D. kikkawai D. bipectinata Anticipate that we will work on the remaining projects from the D. elegans F element and new projects from the D element this Fall. D. ananassae Manuscript in progress D. pseudoobscura D. persimilis New species sequenced by modENCODE D. willistoni D. mojavensis D. virilis D. grimshawi Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project
6
Muller element nomenclature
Muller elements Species A B C D E F D. simulans X 2L 2R 3L 3R 4 D. sechellia D. melanogaster D. yakuba D. erecta D. ananassae XLXR 4L4R D. pseudoobscura XL 3 XR 2 5 D. persimilis D. willistoni D. mojavensis 6 D. virilis D. grimshawi
7
Gene structure nomenclature
Primary mRNA Protein Gene span Exons Exon UTR’s UTR CDS’s CDS
8
GEP annotation strategy
Technique is optimized for projects with a moderately close, well annotated neighbor species Example: D. melanogaster Need to apply different strategies when annotating genes in other species Examples: corn, parrot
9
GEP annotation goals Identify and annotate all the genes in your project For each gene, identify and precisely map (accurate to the base pair) all coding exons (CDS) Do this for ALL isoform Annotate the initial transcribed exon and transcription start site (TSS) Optional curriculum not submitted to GEP Clustal analysis (protein, promoter regions) Repeats analysis Synteny analysis
10
Evidence-based annotation
Human-curated analysis More accurate than standard ab initio and evidence-based gene finders Goal: collect, analyze, and synthesize all the available evidence to create the best-supported gene model Example: , ,
11
Collect, analyze, and synthesize
Genome Browser Conservation (BLAST searches) Analyze: Interpreting Genome Browser evidence tracks Interpreting BLAST results Synthesize: Construct the best-supported gene model based on potentially contradictory evidence
12
Basic annotation workflow
Identify the likely ortholog in D. melanogaster Observe the gene structure of the ortholog Map each CDS of ortholog to the project sequence Use BLASTX to identify conserved region Note position and reading frame Use these data to construct a gene model Use TopHat splice junctions to identify splice site boundaries, or a combination of conservation, splice signals, reading frame Identify the exact start and stop base position for each CDS Use the Gene Model Checker to verify the gene model For each additional isoform, repeat steps 2-5
13
Annotation workflow (graphically)
Contig Feature BLASTP search of feature against the D. melanogaster proteins database D. melanogaster gene model (1 isoform, 5 CDS) BLASTX search of each D. melanogaster CDS against the contig Contig Feature 3 Reading frame Alignment 2 1 3 1
14
Annotation workflow (graphically)
BLASTX search of each D. melanogaster CDS against the contig Reading frame 3 1 3 2 1 Alignment Contig Identify the exact coordinates of each CDS using the Genome Browser M 3 Reading frame GT 1 AG GT Bem46 Use the Gene Model Checker to verify the final CDS coordinates Gene model 1245 1383 1437 1678 1740 2081 2159 2337 2397 2511 , , , , Coordinates:
15
Four main web sites used by the GEP annotation strategy
GEP UCSC Genome Browser ( FlyBase ( Tools Genomic/Map Tools BLAST Gene Record Finder ( Projects Annotation Resources NCBI BLAST ( BLASTX select the checkbox:
16
UCSC Genome Browser Provides a graphical view of genomic region
Sequence conservation Gene and splice site predictions RNA-Seq data and splice junction predictions BLAT – BLAST-Like Alignment Tool Map protein or nucleotide sequence against an assembly Faster but less sensitive than BLAST Table Browser Access raw data used to create the graphical browser
17
Two different versions of the UCSC Genome Browser
Official UCSC Version Published data, lots of species, whole genomes, used for “Chimp Chunks” GEP Version GEP data, parts of genomes, used for annotation of Drosophila species
18
Additional training resources
Training section on the UCSC web site Training videos, user guides and tutorials Mailing lists OpenHelix tutorials and training materials Pre-recorded tutorials Reference cards
19
GEP UCSC Genome Browser overview
Genomic sequence Evidence tracks
20
Control how evidence tracks are displayed on the Genome Browser
Five different display modes: Hide: track is hidden Dense: all features appear on a single line Squish: overlapping features appear on separate lines Features are half the height compared to full mode Pack: overlapping features appear on separate lines Features are the same height as full mode Full: each feature is displayed on its own line Set “Base Position” track to “Full” to see the amino acid translations Some evidence tracks (e.g., RepeatMasker) only have a subset of these display modes
21
FlyBase – Database for the Drosophila research community
Lots of ancillary data for each gene in D. melanogaster Curation of literature for each gene Reference D. melanogaster annotations for all other databases Including NCBI, EBI, and DDBJ Fast release cycle (6-8 releases per year)
22
Be aware of different annotation releases
D. melanogaster Release 6 genome assembly First change of the assembly since late 2006 Most modENCODE analysis used the Release 5 assembly Gene annotations change much more frequently Use FlyBase as the canonical reference GEP data freeze: Update GEP materials before the start of semester Potential discrepancies in exercise screenshots Minor differences in search results Let us know about major errors or discrepancies Lifted release 5 datasets required by the GEP to release 6
23
Gene Record Finder – Observe the structure of D. melanogaster genes
Retrieve CDS and exon sequences for each gene in D. melanogaster CDS and exon usage maps for each isoform List of unique CDS Optimal for the exon-by-exon annotation strategy
24
Nomenclature for Drosophila genes
Drosophila gene names are case-sensitive Lowercase initial letter = recessive mutant phenotype Uppercase initial letter = dominant mutant phenotype Every D. melanogaster gene has an annotation symbol Begins with the prefix CG (Computed Gene) Some genes have a different gene symbol (e.g., mav) Suffix after the gene symbol denotes different isoforms mRNA = -R; protein = -P mav-RA = Transcript for the A isoform of mav mav-PA = Protein product for the A isoform of mav ey is the gene symbol, CG1901 is the annotation symbol
25
NCBI – comprehensive database for biomedical and genomics information
One of the most comprehensive genomics database Quality of GenBank records will vary PubMed for literature searches BLAST web service BLAST search against RefSeq, nr/nt databases Align two (or more) sequences (bl2seq)
26
Nucleotide -> Protein
Common BLAST programs Except for BLASTN, all alignments are based on comparisons of protein sequences Alignment coordinates are relative to the original sequences Decide which BLAST program to use based on the type of query and subject sequences: Program Query Database (Subject) BLASTN Nucleotide BLASTP Protein BLASTX Nucleotide -> Protein TBLASTN TBLASTX
27
Where can I run BLAST? NCBI BLAST web service EBI BLAST web service
EBI BLAST web service FlyBase BLAST (Drosophila and other insects)
28
Data on the Genome Browser is incomplete
The Genome Browser contains many different evidence tracks: Sequence similarity to D. melanogaster proteins RNA-Seq from different developmental stages Gene and splice site predictions Sequence alignments of multiple Drosophila species Annotators still need to identify the ortholog WARNING! Students often over-interpret the BLASTX alignment track (D. mel Proteins); use with caution
29
Identify ortholog based on BLASTP search of the feature against D
Identify ortholog based on BLASTP search of the feature against D. melanogaster proteins Large increase in E-value from mav-PA to gbb-PB Most genes (~90%) remain on the same Muller element across the different Drosophila species ~95% of all genes, ~90% of F element genes remain on the same Muller element
30
Basic biological constraints (inviolate rules*)
Coding regions start with a methionine Coding regions end with a stop codon Gene should be on only one strand of DNA Exons appear in order along the DNA (collinear) Intron sequences should be at least 40 bp Intron starts with a GT (or rarely GC) Intron ends with an AG Only break these rules if they are found in D. melanogaster or supported by experimental evidence ~1% of the genes in D. melanogaster have a GC donor site * There are known exceptions to each rule
31
Evidence for gene models (in general order of importance)
Expression data RNA-Seq, EST, cDNA Conservation Sequence similarity to genes in D. melanogaster Sequence similarity to other Drosophila species (Multiz) Computational predictions Gene and splice site predictions Tie-breakers of last resort See the “Annotation Instruction Sheet”
32
modENCODE RNA-Seq data
RNA-Seq evidence tracks: RNA-Seq coverage (read depth) TopHat splice junction predictions Assembled transcripts (Cufflinks, Oases) Positive results very helpful Negative results less informative Lack of transcription ≠ no gene GEP curriculum: RNA-Seq Primer Browser-Based Annotation and RNA-Seq Data modENCODE RNA-Seq data: mixed embryos, adult males, adult females
33
Generating RNA-Seq data (Illumina)
5’ cap Poly-A tail Processed mRNA AAAAAA RNA fragments (~250bp) Library with adapters 5’ 3’ Paired end sequencing 5’ 3’ ~125bp RNA-Seq reads Reverse Forward Wang Z et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.
34
Use the TopHat splice junction predictions to identify splice sites
5’ cap M * Poly-A tail Processed mRNA AAAAAA RNA-Seq reads Contig Intron TopHat junctions
35
Use BLASTX to map D. melanogaster CDS onto the contig
Use sequence similarity to infer homology D. melanogaster is very well annotated Coding sequences evolve slowly Exon structure changes very slowly Change two settings in the “Algorithm parameters” section when using bl2seq Turn off compositional adjustments Turn off the low complexity filter
36
Strategies for finding small CDS
Examine RNA-Seq coverage and TopHat junctions Small CDS is typically part of a larger transcribed exon Use Query subrange to restrict the search region Increase the Expect threshold and try again Keep increasing the Expect threshold until you get matches Also try decreasing the word size Use the Small Exon Finder Minimize changes in CDS size Available under Projects Annotation Resources See the “Annotation Strategy Guide” for details
37
A genomic sequence has 6 different reading frames
1 3 2 Frames Frame: Base to begin translation relative to the first base of the sequence
38
Splice donor and acceptor phases
Phase: Number of bases between the complete codon and the splice site Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC) Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon Phase is relative to the reading frame of the CDS
39
Phase depends on the reading frame
Splice Acceptor Phase of acceptor site: Phase 2 relative to frame +1 Phase 0 relative to frame +2 Phase 1 relative to frame +3
40
Phase of the donor and acceptor sites must be compatible
Extra nucleotides from donor and acceptor phases will form an additional codon Donor phase + acceptor phase = 0 or 3 CCA AAT G CCA AAT G GT … … … AG CTC GAT TT CTC GAT TT By definition, blastx cannot detect the additional codon that results from donor and acceptor phases CTC GAT GTT CCA AAT P N V L D Translation:
41
Incompatible donor and acceptor phases results in a frame shift
CCA AAT G GT GT … … AG TT CTC GAT CCA AAT TCG AT GGT TTC P N G F S Translation: Phase 0 donor is incompatible with phase 2 acceptor; use prior GT, which is a phase 1 donor.
42
Verify the final gene model using the Gene Model Checker
Examine the checklist and explain any errors or warnings in the GEP Annotation Report View your gene model in the context of the other evidence tracks on the Genome Browser Examine the dot plot and explain any discrepancies in the GEP Annotation Report Look for large vertical and horizontal gaps See the “How to do a quick check of student annotations” document on the GEP web site
43
Questions? http://www.flickr.com/photos/jac_opo/240254763/sizes/l/
Introduction to Macs, Annotation of a Drosophila gene walkthrough
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.