Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Ab initio gene prediction Genome 559, Winter 2011.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Genome Annotation BCB 660 October 20, From Carson Holt.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Locating genes in Plasmodium falciparum You have seen how artemis is used to view, analyse and annotate bacterial genomes, but now we are going to move.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
SAGExplore web server tutorial for Module II: Genome Mapping.
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genomics Education Partnership: a flexible approach to implement Genomic teachings and research in the classroom Matthew W. Wadsworth and Consuelo J. Alvarez,
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Finding genes in the genome
Accessing and visualizing genomics data
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Primer on Reading Frames and Phase Wilson Leung08/2012.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Annotation for D. virilis
bacteria and eukaryotes
Annotating The data.
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
TSS Annotation Workflow
GEP Annotation Workflow
Visualization of genomic data
Ab initio gene prediction
BLAST.
Basic Local Alignment Search Tool
Determine CDS Coordinates
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Agenda Overview of the GEP annotation project GEP annotation strategy Types of evidence Analysis tools Web databases Annotation of a single isoform (walkthrough)

AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATC GTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAAT AGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAG GACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACA TCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGA TGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCC AGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTG GAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAA ACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATA CGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCC GATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTG GTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCC CTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAA CTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGG CTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCG GAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATC GCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCG GGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCC AAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAA TGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTAT GTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAA ATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTT TGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donorSplice acceptor UTR

Annotation: Adding labels to a sequence Genes : Novel or known genes, pseudogenes Regulatory Elements : Promoters, enhancers, silencers Non-coding RNA : tRNAs, miRNAs, siRNAs, snoRNAs Repeats : Transposable elements, simple repeats Structural : Origins of replication Experimental Results : DNase I Hypersensitive sites ChIP-chip and ChIP-Seq datasets ( e.g., modENCODE)

GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. ananassae D. bipectinata D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Published Annotation projects for Fall 2015 / Spring 2016 Species in the Four Genomes Paper New species sequenced by modENCODE Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project Manuscript in progress

Muller element nomenclature SpeciesABCDEF D. simulans X2L2R3L3R4 D. sechellia X2L2R3L3R4 D. melanogaster X2L2R3L3R4 D. yakuba X2L2R3L3R4 D. erecta X2L2R3L3R4 D. ananassae XLXR3R3L2R2L4L4R D. pseudoobscura XL43XR25 D. persimilis XL43XR25 D. willistoni XL2R2LXR3 D. mojavensis X35426 D. virilis X45326 D. grimshawi X32546 Muller elements

Nomenclature for Drosophila genes Drosophila gene names are case-sensitive Lowercase initial letter = recessive mutant phenotype Uppercase initial letter = dominant mutant phenotype Every D. melanogaster gene has an annotation symbol Begins with the prefix CG ( C omputed G ene) Some genes have a different gene symbol ( e.g., mav ) Suffix after the gene symbol denotes different isoforms mRNA = -R ; protein = -P mav-RA = Transcript for the A isoform of mav mav-PA = Protein product for the A isoform of mav

GEP annotation strategy Technique optimized for projects with a moderately close, well annotated neighbor species Example: D. melanogaster Need to apply different strategies when annotating genes in other species: Example: corn, parrot

GEP annotation goals Identify and annotate all the genes in your project For each gene, identify and precisely map (accurate to the base pair) all coding exons ( CDS ) Do this for ALL isoform Annotate the initial transcribed exon and transcription start site (TSS) Optional curriculum not submitted to GEP Clustal analysis (protein, promoter regions) Repeats analysis Synteny analysis

Evidence-based annotation Human-curated analysis Much higher accuracy than standard ab initio and evidence-based gene finders Goal : collect, analyze, and synthesize all the available evidence to create the best-supported gene model: Example: , ,

Collect, analyze, and synthesize Collect: Genome Browser Conservation ( BLAST searches) Analyze: Interpreting Genome Browser evidence tracks Interpreting BLAST results Synthesize: Construct the best-supported gene model based on potentially contradictory evidence

Evidence for gene models (in general order of importance) 1.Expression data RNA-Seq, EST, cDNA 2.Conservation Sequence similarity to genes in D. melanogaster Sequence similarity to other Drosophila species (Multiz) 3.Computational predictions Gene and splice site predictions 4.Tie-breakers of last resort See the “Annotation Instruction Sheet”

Expression data: RNA-Seq Positive results very helpful Negative results are less informative Lack of transcription ≠ no gene Evidence tracks: RNA-Seq coverage (read depth) TopHat splice junction predictions Assembled transcripts ( Cufflinks, Oases ) GEP curriculum: RNA-Seq Primer Browser-Based Annotation and RNA-Seq Data

Gene structure - terminology Gene span Primary Exons UTR’s CDS’s mRNA Protein

Basic annotation workflow 1.Identify the likely ortholog in D. melanogaster 2.Determine the gene structure of the ortholog 3.Map each CDS of ortholog to the project sequence Use BLASTX to identify conserved region Note position and reading frame 4.Use these data to construct a gene model Identify the exact start and stop base position for each CDS 5.Use the Gene Model Checker to verify the gene model 6.For each additional isoform, repeat steps 2-5

BLASTX search of each D. melanogaster CDS against the contig Contig Feature Annotation workflow (graphically) Contig BLASTP search of feature against the D. melanogaster proteins database Feature D. melanogaster gene model (1 isoform, 5 CDS) 1 3 Reading frame Alignment 21 3

Identify the exact coordinates of each CDS using the Genome Browser Annotation workflow (graphically) BLASTX search of each D. melanogaster CDS against the contig Contig 1 3 Reading frame Alignment 21 3 GT 1 AGGT M 3Reading frame Use the Gene Model Checker to verify the final CDS coordinates Gene model , , , , Coordinates:

UCSC Genome Browser Provide a graphical view of genomic regions Sequence conservation Gene and splice site predictions RNA-Seq data and splice junction predictions BLAT – B LAST - L ike A lignment T ool Map protein or nucleotide sequence against an assembly Faster but less sensitive than BLAST Table Browser Access raw data used to create the graphical browser

UCSC Genome Browser overview Genomic sequence Evidence tracks

Two different versions of the UCSC Genome Browser Official UCSC Version Published data, lots of species, whole genomes, used for “Chimp Chunks” GEP Version GEP data, parts of genomes, used for annotation of Drosophila species

Additional training resources Training section on the UCSC web site User guides and tutorials Mailing lists OpenHelix tutorials and training materials Pre-recorded tutorials Reference cards

Four web sites used by the GEP annotation strategy Open 4 tabs on your web browser: 1.GEP UCSC Genome Browser ( Genome Browser D. virilis – Mar – chr10 2.FlyBase ( Tools  BLAST 3.Gene Record Finder GEP web site  Projects  Annotation Resources Information on the D. melanogaster gene structure 4.NCBI BLAST ( BLASTX  select the checkbox:

Initial survey of a genomic region Investigate gene prediction chr10.4 in a fosmid project (chr10) from D. virilis using the GEP UCSC Genome Browser

Navigate to the Genome Browser for the project region Configure the Genome Browser Gateway page: clade = Insect genome = D. virilis assembly = Mar (GEP/Annot. D. virilis ppt) position = chr10 Click “submit”

Control how evidence tracks are displayed on the Genome Browser Five display modes: Hide : track is hidden Dense : all features appear on a single line Squish : overlapping features appear on separate lines Features are half the height compared to full mode Pack: overlapping features appear on separate lines Features are the same height as full mode Full: each feature is displayed on its own line Set “Base Position” track to “Full” to see the translations Some evidence tracks ( e.g., RepeatMasker ) only have a subset of these display modes

Initial assessment of fosmid project chr10 Seven gene predictions ( features ) from Genscan Need to investigate each feature if one were to annotate this entire project

Investigate gene prediction chr10.4 Enter chr10: in the position box and click jump to navigate to this region Click on the feature and select “ Predicted Protein ” to retrieve the predicted protein sequence Select and copy the sequence

Computational evidence Assumption: there are recognizable signals in the DNA sequence that the cell uses; it should be possible to detect these signals computationally Many programs designed to detect these signals Use machine learning and Bayesian statistics These programs do work to a certain extent The information they provide is better than nothing The predictions have high error rates

Accuracy of the different types of computational evidence DNA sequence analysis: ab initio gene predictors – exons vs. genes Splice site predictors – high sensitivity but low specificity Multi-species alignment – low specificity BLASTX protein alignment: WARNING! Students often over-interpret the BLASTX alignment track (D. mel Proteins ); use with caution

Data on the Genome Browser is incomplete Most evidence has already been gathered for you and they are shown in the Genome Browser tracks Annotator still needs to generate conservation data 1.Assign orthology 2.Look for regions of conservation using BLAST Collect the locations (position, frame, and strand) of conservation to the orthologous protein If no expression data are available, these searches might need to be exhaustive (labor intensive)

Detect sequence similarity with BLAST BLAST = B asic L ocal A lignment S earch T ool Why is BLAST popular? Provide statistical significance for each match Good balance between sensitivity and speed Identify local regions of similarity

Common BLAST programs Except for BLASTN, all alignments are based on comparisons of protein sequences Alignment coordinates are relative to the original sequences Decide which BLAST program to use based on the type of query and subject sequences: ProgramQueryDatabase (Subject) BLASTN Nucleotide BLASTP Protein BLASTX Nucleotide -> ProteinProtein TBLASTN ProteinNucleotide -> Protein TBLASTX Nucleotide -> Protein

Where can I run BLAST ? NCBI BLAST web service EBI BLAST web service FlyBase BLAST ( Drosophila and other insects)

Detect conserved D. melanogaster coding exons with BLASTX Coding sequences evolve slowly Exon structure changes very slowly Use sequence similarity to infer homology D. melanogaster very well annotated Use BLAST to find similarity at amino acid level As evidence accumulates indicating the presence of a gene, we could justify spending more time and effort looking for conservation

FlyBase – Database for the Drosophila research community Key features: Lots of ancillary data for each gene in Drosophila Curation of literature for each gene Reference Drosophila annotations for all other databases Including NCBI, EBI, and DDBJ Fast release cycle (~6-8 releases per year)

Web databases and tools Many genome databases available Be aware of different annotation releases Release 6 versus release 5 assemblies Use FlyBase as the canonical reference Web databases are being updated constantly Update GEP materials before semester starts Potential discrepancies in exercise screenshots Minor changes in search results Let us know about errors or revisions

Finding the ortholog BLASTP search of the chr10.4 gene prediction against the set of D. melanogaster proteins at FlyBase

BLASTP results show a significant hit to the mav gene Note the large change in E-value from mav-PA to the next best hit gbb-PB; good evidence for orthology FlyBase naming convention: -[RP] R = mRNA, P = Protein mav-PA = Protein product from the A isoform of the gene mav

Results of ortholog search Degree of similarity consistent with comparison of typical proteins from D. virilis versus D. melanogaster Regions of strong similarity interspersed with regions of little or no similarity Same Muller element supports orthology We have a probable ortholog: mav

Constructing a gene model We need more information on mav : Determine the gene function and structure in D. melanogaster If this is the ortholog, we also need the amino acid sequence of each CDS Use two web sites to learn about the gene structure: Gene Record Finder Table list of isoforms, exon sequences FlyBase Graphical representation of the gene; relationship among exons

Obtain the gene structure model Using the Gene Record Finder and FlyBase to determine the gene structure of the D. melanogaster gene mav

Retrieve the Gene Record Finder record for mav Type mav into the search box, then press [Enter] Click on the “ View in GBrowse ” link to get a graphical view of the gene structure on FlyBase

Gene structure of mav Both the graphical and table view shows mav has a single isoform ( mav-RA ) in D. melanogaster This isoform has two transcribed exons Both exons contain translated regions Gene Record Finder table formatFlyBase GBrowse graphical format

Investigate exons Search each CDS from D. melanogaster against the project sequence ( e.g., fosmid / contig) Identify regions within the project sequence that code for amino acids similar to the D. melanogaster CDS Best to search with the entire project DNA sequence Easier to keep track of positions and reading frames

Identify the conserved regions Use BLASTX to map each D. melanogaster CDS from the mav gene against the D. virilis chr10 fosmid sequence

Retrieve CDS sequences In the Polypeptide Details section of the Gene Record Finder, select a row in the CDS table The corresponding CDS sequence will appear in the Sequence viewer window

Retrieve the fosmid sequence Go back to the Genome Browser (first tab) and navigate to chr10 Click on the DNA button on the top menu bar and then click on the “ get DNA ” button

Compare the CDS against the fosmid sequence with BLASTX Copy and paste the genomic sequence from tab 1 into the “Enter Query Sequence” textbox Copy and paste the sequence for the CDS 1_9604_0 from tab 2 into the “Enter Subject Sequence” textbox Expand the “Algorithm parameters” section: Turn off compositional adjustments Turn off the low complexity filter Click “BLAST”

Adjust the Expect threshold E-value is negatively correlated with aligned length Difficult to find small exons with BLAST Sometimes BLAST cannot find regions with significant similarity: Use the Query subrange field to restrict the search region Change the Expect threshold to 1000 and try again Keep increasing the Expect threshold until you get hits This strategy will identify regions with very weak similarity but they can be better than nothing

Examine the BLASTX alignment BLASTX result shows a weak alignment 50 identities and 94 similarities (positives) Low degree of similarity is not unusual when comparing single exons from these two species Location : , frame : +3, missing first 92 aa

Map the second CDS (2_9604_0) Perform similar BLASTX search with CDS 2_9604_0 Location : chr10:18,476-19,747; Frame : +2; Alignment include the entire CDS except for first amino acid For larger genes, continue mapping each exon with BLASTX (adjust Expect threshold as needed) Record location, frame and missing amino acids BLAST might not find very small exons Move on and come back later Plot the location of each CDS might help Note the parts of the CDS’s missing from the alignments

Building a gene model Determine the exact start and end positions of each exon for the putative mav-PA ortholog

Identify missing parts of CDS BLASTX CDS alignments for mav on D. virilis ,86617,50418,47619,747 Frame +3 Frame +2

Basic biological constraints (inviolate rules*) Coding regions start with a methionine Coding regions end with a stop codon Gene should be on only one strand of DNA Exons appear in order along the DNA (collinear) Intron sequences should be at least 40 bp Intron starts with a GT (or rarely GC) Intron ends with an AG * There are known exceptions to each rule

Find the missing start codon Only a single start codon (16,515-16,517 in frame +3) upstream of conserved region before the stop codon Start codon at 16,977-16,979 would truncate conserved region CDS alignment start at 16,866 but 92 amino acids missing 16,866 - (92 * 3) = 16,590

A genomic sequence has 6 different reading frames Frame : Base to begin translation relative to the first base of the sequence Frames 1 2 3

Splice donor and acceptor phases Phase : Number of bases between the complete codon and splice site Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC) Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon Phase is relative to the reading frame of the exon

Phase depends on the reading frame Phase of acceptor site: Phase 2 relative to frame +1 Phase 0 relative to frame +2 Phase 1 relative to frame +3 Splice Acceptor

Phase of donor and acceptor sites must be compatible Extra nucleotides from donor and acceptor phases will form an additional codon Donor phase + acceptor phase = 0 or 3 GTAG… … … CC A AA T G CT C GA T TT PNVLD Translation: CC A AA T G CT C GA T GTT CC A AA T CT C GA T TT

Incompatible donor and acceptor phases results in a frame shift Phase 0 donor is incompatible with phase 2 acceptor; use prior GT, which is a phase 1 donor. CC A GTGT AG… AA T G CT C GA T TT PNGFS Translation: GT CC A AA T TC G ATGGTGGT TTC

Picking the best acceptor site Reading frame ( +2 ) dictated by BLASTX result Two splice acceptor candidates: 18,471-18,472 (phase 0) 18,482-18,483 (phase 1), remove conserved amino acids 18,471-18,472 is the better candidate Pick different candidate if no phase 0 donor in previous exon

Picking the best donor site Reading frame ( +3 ) dictated by BLASTX result Two splice donor candidates: 17,505-17,506 (phase 0) 17,518-17,519 (phase 1) Phase 0 donor (17,505-17,506) is compatible with phase 0 acceptor (18,471-18,472)

Create the final gene model Pick ATG (M) at the start of the gene For each putative intron/exon boundary, use the Genome Browser to locate the exact first and last base of the exon After splicing, conserved amino acids are linked together in a single long open reading frame CDS coordinates for mav : ,

Additional strategies for identifying splice donor and acceptor sites For many genes, the location of donor and acceptor sites can be identified easily based on: Locations and quality of the CDS alignments Evidence of expression from RNA-Seq TopHat splice junctions When amino acid conservation is absent, other evidence and techniques must be considered. See the following documents for help: Annotation Instruction Sheet Annotation Strategy Guide

Verify the final gene model using the Gene Model Checker Examine the checklist and explain any errors or warnings in the GEP Annotation Report View your gene model in the context of the other evidence tracks on the Genome Browser Examine the dot plot and explain any discrepancies in the GEP Annotation Report Look for large vertical and horizontal gaps See the “ How to do a quick check of student annotations ” document on the GEP web site

Questions?