Annotation for D. virilis

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Lecture 12 Splicing and gene prediction in eukaryotes
What was the most interesting thing that you did over Winter Break? Create a double bubble map comparing/contrasting DNA and RNA.
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Locating genes in Plasmodium falciparum You have seen how artemis is used to view, analyse and annotate bacterial genomes, but now we are going to move.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Mark D. Adams Dept. of Genetics 9/10/04
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
Transcription and Translation of DNA How does DNA transmit information within the cell? PROTEINS! How do we get from DNA to protein??? The central dogma.
Finding genes in the genome
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Welcome to the combined BLAST and Genome Browser Tutorial.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Web Databases for Drosophila
Annotation of Drosophila
bacteria and eukaryotes
Annotating The data.
Eukaryotic Gene Structure
Protein Synthesis (Transcription and Translation)
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
TSS Annotation Workflow
GEP Annotation Workflow
Visualization of genomic data
Eukaryotic Gene Finding
Visualization of genomic data
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Introduction to Bioinformatics II
Ensembl Genome Repository.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
BLAT Blast Like Alignment Tool
Basic Local Alignment Search Tool
Determine CDS Coordinates
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Annotation for D. virilis 08/22/2017 Annotation for D. virilis Chris Shaffer July 2012 Last update: 08/2017

Annotation “Big Picture” of annotation and then one practical example This technique may not be the best with other projects (e.g. corn, bacteria) The technique is optimized for projects with a moderately close, well annotated neighbor species (i.e. Drosophila melanogaster)

GEP goals: Identify & annotate genes in your fosmid 08/22/2017 GEP goals: Identify & annotate genes in your fosmid For each gene identify and precisely map (accurate to the base pair) all CDS’s Do this for ALL isoforms Optional Curriculum not sent to GEP Clustal analysis (protein, promoter) Repeat analysis Synteny analysis

Evidence Based Annotation Human curated analysis Much better outcome than standard ab initio gene finders Goal: Collect, analyze and synthesize all evidence available to create most likely gene model: 4591-4688, 5157-5490, 5747-6001

Collect, analyze and synthesize Genome Browser (gander.wustl.edu) Conservation (BLAST searches) Analyze Reading browser tracks Interpreting BLAST results Synthesize Annotation Instruction Sheet Annotation Strategy Guide

Evidence for Gene Models Interpret in the context of basic biology (in general order of importance) Expression Data Conservation Computational predictions Tie-breakers of last resort

Basic Biology (context) Known biological properties Coding regions start with a methionine Coding regions end with a stop codon Genes are only on one strand of DNA Exons appear in order along the DNA Introns start with GT (or rarely GC) Introns end with an AG CCTAGAGTACCA….CAGATAGCTAGGAG

Expression: RNA-seq Positive result very helpful Negative result less informative No transcription ≠ No Gene modENCODE RNA-Seq: RNA-Seq (read depth) Junction prediction (e.g. TopHat) Noisy, Noisy Data

Conservation Coding sequences evolve slowly Exon structure changes VERY slowly Similarity implies homology D. melanogaster very well annotated Use BLAST to find a.a. conservation As evidence accumulates indicating the presence of a gene, more time/effort is justified when looking for conservation

Computational Evidence Assumption: there are recognizable signals in the DNA sequence that the cell uses; it should be possible to detect these computationally Many programs designed to detect these signals These programs do work to a certain extent, the information they provide is better than nothing; high error rates

Computational Evidence DNA sequence analysis: Ab initio gene predictors – exons vs genes Splice site predictors – low quality Summary/organization of observations Multi-species conservation – low specificity BlastX alignment (WARNING! Students often over interpret this data, use with caution or ignore)

Genome Browser Lots and Lots of evidence! Organized for your use 08/22/2017 75 kb, organize all the data (expression, conservation, computational) organize and present and Lots and Lots of evidence! Organized for your use

08/22/2017 Browser is incomplete Most evidence has already been gathered for you and is shown in the genome browser tracks However, some conservation data must be generated by the annotator at the time of annotation Assignment of orthology Look for regions of conservation using BLAST Collect the locations (position, frame and strand) of conservation to orthologous protein If no expression data available these searches may need to be exhaustive, labor intensive

Typical Annotation Identify the likely ortholog in D. mel. 08/22/2017 Typical Annotation Identify the likely ortholog in D. mel. Find gene model of ortholog, view structure and get exon sequences (FlyBase & Gene Record Finder) Use BLASTX to find conservation to exons; search exon by exon, note position and frame Based on data generated above, plus other data in browser create gene model; identify the exact base location (start and stop) of each CDS (coding exon) for each isoform Confirm your model using Gene Model Checker and genome browser.

Basic Procedure (graphically) fosmid feature BLASTX of predicted gene to D. melanogaster proteins suggests this region is orthologous to a D. melanogaster gene with 1 isoform and 5 exons: BLASTX of each exon to locate region of similarity and which translated frame encodes the similar amino acids: 1 3 3 2 1

Basic Procedure (graphically) 1 3 3 2 1 Frame Alignments To annotate, finds ends of exons (i.e. start codon, donor (GT) and acceptor (AG) sites, stop codon). Use conservation and browser info to create model making sure to keep a single long open frame that encodes conserved a.a. 1 3 Met GT AG GT Once these have been identified, write down the exact location of the first base and last base of each exon. Use these numbers to check your gene model in Gene Model Checker. 1121 1187 1402 1591 1754 1939 2122 2434 2601 2789

Questions before the Break?

Evidence for Gene Models Expression Data Conservation Computational predictions 1 & 3 are already done for you (Browser) 2 is generated by annotator

Searching for Conservation Search for conservation “to do” list Use BLAST to identify ortholog Use FlyBase to view gene structure of ortholog – 99% to 100% conserved Use BLAST to find amino acid conservation at single exon level

Example Annotation Open 4 Tabs http://gander.wustl.edu 08/22/2017 Example Annotation Open 4 Tabs http://gander.wustl.edu Genome Browser D. virilis - Mar 2005 - chr10 http://flybase.org Tools  Genomic/Map Tools  BLAST Gene Record Finder Info from most recent D. mel genome annotations https://www.ncbi.nlm.nih.gov/ BLAST page  BLASTX  click the checkbox:

Example Annotation from Drosophila virilis 08/22/2017 Example Annotation from Drosophila virilis Settings are: Insect; D. virilis; Mar. 2005; chr10 (chr10 is a fosmid from 2005) Click submit

Example Annotation Seven predicted Genscan genes 08/22/2017 Example Annotation Seven predicted Genscan genes Each one would be investigated

Investigate 10.4 We will focus on 10.4 in this example 08/22/2017 Investigate 10.4 We will focus on 10.4 in this example To zoom in on this gene enter: chr10:15000-21000 in the position box Then click the “go” button

08/22/2017 Find Ortholog If this is a real gene it will probably have at least some homology to a D. melanogaster protein Step one: do a BLAST search with the predicted protein sequence of 10.4 to all proteins in D. melanogaster

Find Ortholog Click on one of the exons in gene 10.4 08/22/2017 Find Ortholog Click on one of the exons in gene 10.4 On the Genscan report page click on Predicted Protein Select and copy the sequence Do a blastp search of the predicted sequence to the D. melanogaster “Annotated Proteins” (Tab 2)

08/22/2017 Find Ortholog The results show a significant hit to the the gene “mav” the PA names the isoform Note the large step in E value from mav to next best hit gbb; good evidence for orthology

Results of Ortholog search 08/22/2017 Results of Ortholog search The alignment looks right for D. virilis vs. D. melanogaster - regions of high similarity interspersed with regions of little or no similarity Same chromosome supports orthology We have a probable ortholog: “mav”

08/22/2017 Gene Structure Model We need information on “mav”; What is mav? What does its gene look like? If this is the ortholog we will also need the protein sequence of each exon Combination of two sites: Use GEP’s Gene Record Finder TABLE List of isoforms; exon sequences Use FlyBase GRAPHIC representation of gene; relationship among exons

08/22/2017 Gene Structure Model We need information on “mav”; What is mav? What does its gene look like? Use the Gene Record Finder Available off the “Projects” menu at gep.wustl.edu

Gene Structure Model Search for “mav” (search box) Only one mav so select and hit the button:

Gene Structure Model The chart shows which isoforms has each exon or CDS: FlyBase GBrowse graphical format Gene Record Finder table format

Gene Structure Model Terminology: Primary mRNA Protein Gene span Exons UTR’s CDS’s

08/22/2017 Gene Structure Model We now have a putative gene model (one isoforms with 2 CDS). For your final report you would annotate all isoforms for all the genes you identify in your fosmid.

08/22/2017 Investigate Exons Click on the exon to get amino acid sequence of each CDS

Investigate Exons Use exon sequence (protein) to search fosmid 08/22/2017 Investigate Exons Use exon sequence (protein) to search fosmid Where in this fosmid are sequences which code for amino acids that are similar? Best to search entire fosmid DNA sequence (easier to keep track of positions) with the amino acid sequences of exons

08/22/2017 Investigate Exons In the first tab, go to the browser chr10 of D. virilis; click the “DNA” button under the “View” menu, then click “get DNA” In the third tab, make sure you have the peptide sequence for the D. melanogaster mav gene These two tabs now have the two sequences you are going to compare

08/22/2017 Investigate Exons Copy and paste the genomic sequence from tab 1 into the top box 1 of tab 4 Copy and paste the peptide sequence of exon 1 from tab 3 into bottom box 2 Since we are comparing a DNA sequence on top to an amino acid sequence on the bottom we need to run BLASTX Open the “algorithm parameters” section: turn off the filter and compositional adjustments; leave other values at default Click the “BLAST” button to run the comparison

Investigate Exons Sometimes you have not find any similarity 08/22/2017 Investigate Exons Sometimes you have not find any similarity If so: change the expect value to 1000 or even larger and click align, keep raising the expect value until you get hits, then evaluate hits by position This will show very weak similarity which can be better than nothing

08/22/2017 Investigate Exons We have a weak alignment (50 identities and 94 similarities), but we have seen worse when comparing single exons from these two species Notice the location of the hit (bases 16866 to 17504) and frame +3 and missing 92 aa

08/22/2017 Investigate Exons A similar search with exon 2 sequences gives a location of chr10:18476-19747 and frame +2 with all a.a. aligning For larger genes continue with each exon, searching with BLASTX for similarity (adjusting e cutoff if needed) noting location, frame and any missing a.a Very small exons may be undetectable, move on and come back later; find all conservation before making final model Draw these out if this will help Note unaligned aa as well

Evidence for mav annotation 08/22/2017 Evidence for mav annotation 16,866 17,504 18,476 19,747 93 271 1 431 Frame +3 Frame +2

Start Codon 16,866 17,504 18,476 19,747 93 271 1 431 Frame +3 Frame +2 1st CDS starts @ 16515

Evidence for mav annotation 16,866 17,504 18,476 19,747 93 271 2 431 Frame +3 Frame +2 Which GT? Which AG?

08/22/2017 Create a Gene Model Pick ATG (met) at start of gene, first met in frame with coding region of similarity (+3) For each putative intron/exon boundary compare location of BLASTX result to locate exact first and last base of the exon such that the conserved amino acids are linked together in a single long open reading frame CDS’s: 16515-17504; 18473-19744 Intron GT and AG present

08/22/2017 Create a Gene Model For many genes the locations of donor and acceptor sites will be easily identified based on the locations and quality of the alignments of the individual exons and how these regions compare with evidence of expression from RNA-seq. However when amino acid conservation is absent, other evidence must be considered. See the handout “Annotation Instruction Sheet” for more help.

08/22/2017 Confirm Gene Model We will do this later once you have a real model of your own to check As a final check, do all three: Enter coordinates into gene model checker to confirm it is a valid model Use custom tracks (magnifying glass) to view model and double check that the final model agrees with all your evidence Examine dot plot to discover possible errors