Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.

Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management

Annotation Goals of this Class
Create an evidence based set of gene annotations for the dot chromosome and a similarly sized “control” region of a long chromosome arm. This gene annotation set includes all putative isoforms, coding sequence only.

Ab initio Gene Prediction
Gene prediction using only the genomic DNA sequence Search for “signals” of protein coding regions Typically uses a probabilistic model Hidden Markov Model (HMM) “Putative models”- external evidence needed to verify predictions (e.g. mRNA, ESTs) We will use multiple gene finders (e.g. Genscan, Nscan, SNAP) for our Drosophila annotations

Performance of Gene Finders
Most gene finders can predict prokaryotic genes accurately (>98%) However, gene-finders do a poor job of predicting genes in eukaryotes Not as much is known about the general properties of eukaryotic genes Splice site recognition, different isoforms

Common Problems Common problems with gene finders Other challenges
Fusing neighboring genes Splitting a single gene Missing exons or entire genes Over predict exons or genes Other challenges Nested genes Noncanonical splice sites Pseudogenes Different isoforms of same gene

How to Improve Predictions?
Improve algorithms Identify conserved sequences from multiple sequence alignments Consensus model based on results from multiple gene finders Computational predictions that incorporate biological evidence (e.g. ESTs, cDNAs) Manual annotation Collect all the available evidence from multiple biological and computational sources to create the best gene models

Generate Consensus Gene Model
Each gene prediction program has different strengths and weaknesses. Create consensus gene set by combining results from multiple gene finders to generate the best reference set For the 12 Drosophila species that have been sequenced, the reference gene set are available from FlyBase

Manual Annotation Evidence for genes: Goals for this class
EST or other expression data Conservation to known genes Computational predictions Goals for this class Use web resources to collect evidence Practice collecting and analyzing data

Web Databases and Tools
For this class we will use the following sites: UCSC Genome Browser for Drosophila: FlyBase: NCBI: GEP Gene Finder: GEP Model Checker:

Phylogenetic Tree for Drosophila

UCSC Genome Browser GEP version, parts of genomes, GEP data, used for annotation of Drosophila species Strengths Genome Browser: graphical views of genomic regions BLAT: BLAST-Like Alignment Tool Table Browser: access underlying data used to generate the graphical genome browser Weaknesses Constraints on ability to display data Uses View and organize available computational results Follow along to view a section of drosophila genome

Adjusting Display Options
Adjust following tracks to “pack” Under “Gene and Gene Prediction Tracks” FlyBase Genes, RefSeq Genes, N-SCAN, CONTRAST Adjust following tracks to “dense” Under “mRNA and EST Tracks” D. melanogaster mRNAs, Spliced ESTs Under “Comparative Genomics” Conservation Adjust following tracks to “full” Under “Mapping and Sequencing Tracks” Base Position Under “Variation and Repeats” RepeatMasker

FlyBase http://www.flybase.org Strengths Weaknesses Uses
Lots of ancillary data for each gene Curation of literature for each gene Weaknesses Difficult to find data if you don’t already know where to look Uses Species-specific BLAST searches Genome browsers and data sets for all the sequenced Drosophila species

Basic Strategy for Annotation
Use Ab Initio prediction to focus attention on genomic features (areas) of interest Add as much other evidence as you can to refine the gene model and support your conclusion What other evidence is there? Basic gene structure Motif information BLAST homologies: nr, protein, est Other species or other proteins

Eukaryotic Genome Annotation
BLAST homology: nr, protein, EST Homology to known proteins argues against false positive Consider length, percent identity when examining alignments Without good EST evidence you can never be sure; make your best guess and be able to defend it Other species or other proteins For any similarity hit, look for even better hits elsewhere in the genome If you are convinced you have a gene and it is a member of a multi-gene family, be sure to pick the right ortholog

Drosophila Sequence Annotation

GEP projects http://gep.wustl.edu D. erecta
Close to D. melanogaster easier to annotate D. grimshaw, D. virilis and D. mojavensis Further from D. melanogaster More difficult to annotate Will use example in D. virilis but basic technique is the same

The Drosophila genomes
Average gene size will be smaller than mammals Very low density of pseudogenes Almost all genes will have the same basic structure as D. melanogaster orthologs; mapping exon by exon works well for most genes Genes only rarely move to different chromosomal element

Goals Identify genes For each gene identify and precisely map (accurate to the base pair) all exons Do this for ALL isoforms

Evidence Based Annotation
Human curated analysis Much better outcome than standard ab initio gene finders Goal: Collect, analyze and synthesize all evidence available and come up with most likely gene model Evidence for Gene Models Basic Biology Expression Data Conservation Computational

Basic Biology Known biological properties
Coding regions start with a methionine Coding regions end with a stop codon Genes are only on one strand of DNA Exons appear in order along the DNA Introns start with GT (or rarely GC) Introns end with a AG CCTAGAGTACCA….CAGATAGCTAGGAG

Expression: EST, mRNA, Other
Protein coding genes must be expressed Positive result very helpful Negative result less informative Did not find message because looking in wrong tissue or developmental stage Transcription rate so low, messages remain undetected Drosophilids: only 20,000 EST from each species; only helpful in rare cases

Conservation Coding sequences evolve slowly
Similarity to known genes suggests new genes D. melanogaster very well annotated 12 other drosophila genomes now available This will be your most important source of evidence

Computational Evidence
Assumption: there are recognizable signals in the DNA sequence that the cell uses; it should be possible to detect these algorithmically Many programs designed to detect these signals These programs do work to a certain extent, the information they provide is better than nothing; high error rates

How to Proceed You will need to investigate all features of interest:
Ab initio gene finder results Watch out for ends - fused or split genes Regions of high similarity with D. melanogaster proteins and EST’s, identified by BLAST but not overlapping with 1 above Overlapping genes usually on opposite strand Be vigilant for partial genes at fosmid ends Regions with high similarity to known proteins (i.e. BLAST to nr) but not found by 1 or 2 above

Practical Example in D. virilis
The following example will give a general outline of how we recommend you proceed, works in many cases Goal: come up with the best gene model that integrates all the evidence in a manner that maximizes likely outcomes and minimizes contradictions

Basic Annotation Procedure
For each feature (exon in this case) of interest: Identify the likely ortholog in D. mel. Use D. mel. database to find gene model of ortholog and identify protein seq for each exon Use BLASTX to locate exons; search one by one, find conservation, note position and frame Based on locations, frames of conservation, as well as other evidence create gene model; identify the exact base location (start and stop) of each CDS (coding exon) for each isoform Confirm your model using Gene checker and genome browser.

Basic Procedure (graphically)
contig feature BLASTX of predicted gene to D. melanogaster proteins suggests this region orthologous to Dm gene with 1 isoform and 5 exons: BLASTX of each exon to locate region of similarity and which translated frame encodes the similar amino acids: 1 3 3 2 1 Frame alignments

Basic Procedure (graphically)
Zoom in on ends of exons and find first met, matching intron Doner (GT) and Acceptor (AG) sites and final stop codon, making sure to keep the frame intact 1 3 Met GT AG GT Once these have been identified, write down the exact location of the first base and last base of each exon. Use these numbers to check your gene model and if correct generate and save GFF file 1121 1187 1402 1591 1754 1939 2122 2434 2601 2789

Example Annotation Open 4 Tabs gander.wustl.edu Flybase.org
Genome browser D. virilis - Mar chr10 Flybase.org Tools  blast The Gene Record Finder Info from most recent D. mel genome The search term is case sensitive (Fts is different from fts) Blast page BLASTX  click the checkbox:

Example Annotation from Drosophila virilis
Settings are: Insect; D. virilis; Mar. 2005; chr10 (chr10 is a fosmid from 2005) Click submit

Example Annotation Seven predicted Genscan genes
Each one would be investigated

Investigate 10.4 All putative genes will need to be analyzed; we will focus on 10.4 in this example To zoom in on this gene enter: chr10: in position box Then click jump button

Step 1: Find Ortholog If this is a real gene it will probably have at least some homology to a D. melanogaster protein Step one: do a BLAST search with the predicted protein sequence of 10.4 to all proteins in D. melanogaster In Genome Browser ( Click on one of the exons in gene 10.4 On the Genscan report page click on Predicted Protein Select and copy the sequence On the flybase blast page ( Do a blastp search of the predicted sequence to the D. melanogaster “Annotated Proteins” (Database)

Step 1: Find Ortholog The flybase blastp results show a significant hit to the “A”, “B” and “C” isoforms of the gene “mav” Note the large step in E value from Mav isoforms to next best hit gbb; good evidence for orthology

Step 1: Results of Ortholog Search
The alignment looks right for virilis vs. melanoaster- regions of high similarity interspersed with regions of little or no similarity Same chromosome supports orthology We have a probable ortholog: “mav”

Step 2: Gene Structure Model
We need information on “mav”; What is mav? What does its gene look like? If this is the ortholog we will also need the protein sequence of each exon Use the Gene Record Finder Available off the “Projects” menu at Projects Annotation Resources  Gene Record Finder

Search for “mav” (search box) Only one mav so select and hit the button:

Sequence of exons in the mav gene in D. mel

We now have a gene model (three isoforms each with 2 CDS). In this case, we will annotate isoform A only. For your project report you would annotate ALL isoforms for all genes you identify in your fosmid/contig. Isoforms B and C only differ in non-coding regions. Simply make B model, duplicate and rename for submission

Step 3: Investigate Exons
The last section of the Gene record includes the exons:

Use exon to search fosmid with exon Where in this fosmid are sequences which code for amino acids that are similar Best to search entire fosmid DNA sequence (easier to keep track of positions) with the amino acid sequences of exons In the genome browser tab, go to the browser “chr10” of D. virilis; click the DNA button, then click “get DNA” In the Gene Record Finder tab, make sure you have the peptide sequence for the melanogaster mav gene exons These two tabs now have the two sequences you are going to compare

Copy and paste the fosmid genomic sequence obtained from genome browser tab into the top box 1 of blastx Copy and paste the peptide sequence of exon 1 from “gene record finder” tab into bottom box 2 of blastx Open the “algorithm parameters” section: turn off the filter; leave other values at default Click “BLAST” button to run the comparison

BLASTX Comparison

Sometimes you have not find any similarity If so: change the expect value to 1000 or even larger and click “BLAST”, keep raising the expect value until you get hits, then evaluate hits by position This will show very weak similarity which can be better than nothing

We have a weak alignment (50 identities and 94 similarities), but we have seen worse when comparing single exons from these two species Notice the location of the hit (bases to 17504) and frame +3 and missing 92 aa

A similar search with exon 2 sequences gives a location of chr10: and frame +2 with one amino acids missing

For larger genes continue with each exon, searching with BLASTX for similarity (adjusting e cutoff if needed) noting location, frame and any missing amino acids Very small exons may be undetectable, move on and come back later Draw these out if this will help Note unaligned amino acid as well 16,866 17,504 18,476 19,744 93 271 2 430 Frame +3 Frame +2

Step 4: Create a Gene Model
Pick ATG (met) at start of gene, first met in frame with coding region of similarity (+3) For each putative intron/exon boundary compare location of BLASTX result to locate exact first and last base of the exon such that the conserved amino acids are linked together in a single long open reading frame Exons: ; Intron GT and AG present

Step 4: Create a Gene Model
For many genes the locations of donor and acceptor sites will be easily identified based on the locations of the alignments of the individual exons. However when amino acid conservation is not sufficient to allow for the identification of intron/exon boundaries then other evidence must be considered. See the handout “Annotation Instruction sheet” for more help.

Splicing The splice of life: Alternative splicing and neurological disease, B. Kate Dredge, Alexandros D. Polydorides & Robert B. Darnell, Nature Reviews Neuroscience 2, (January 2001)

Alternative Splicing Splicing mutations can arise by alteration of conserved splice donor and splice acceptor sequences or by activation of cryptic splice sites (A) Mutations at conserved splice donor (SD) or splice acceptor (SA) sequences result in intron retention where there is failure of splicing and an intron sequence is not excised; or in exon skipping where the spliceosome brings together the splice donor and splice acceptor sites of non-neighboring exons. (B) Sequences that are very similar to the splice donor or splice acceptor sequences may coincidentally exist in introns and exons (sd and sa). These sequences are not normally used in splicing and so are known as cryptic splice sites. A mutation can activate a cryptic splice site by making the sequence more like the consensus splice donor or acceptor sequence and the cryptic splice site can now be recognized and used by the spliceosome (activation of the cryptic splice site). From “Human molecular genetics”, Tom Strachan and Andrew Read

Alternative Splicing

Revisit “Phase” Introns and internal exons are divided according to “phase”, which is closely related to the reading frame. An intron which falls between codons is considered phase 0 An intron after the first base of a codon, phase 1 An intron after the second base of a codon, phase 2 Internal exons are similarly divided according to the phase of the previous intron (which determines the codon position of the first base-pair of the exon, hence the reading frame).

Step 5: Confirm Gene Model
We will do this later once you have a real model of your own to check As a final check: enter coordinates into “gene model checker” (gep.wustl.edu, projects  annotation resource  gene model checker) to confirm it is a valid model Use custom tracks to view model and double check that the final model agrees with all your evidence Examine dot plot to discover possible errors

Considerations Some exons are very hard to find (small or non-conserved) keep increasing E value to find any hits; restrict location based on flanking exons Donor “GC” seen on rare occasions We have seen a couple of examples where the only reasonable interpretation was that the basic gene structure had changed (i.e. intron was gained or lost) Use the techniques described in the Handout as well as discussions with your peers when things get difficult. Research is almost always a collaborative effort.

Complex Genes with Many Isoforms
Some genes will have many isoforms and many exons. The technique described for mav does not scale well to these large complex genes. The recommended protocol in these cases: Identify all unique exons and which isoforms have which exons (Gene record finder) Map each unique exon Build gene models of each isoform based on which exons it contains; re-use previously found exons

Homework 6 Read the “Annotation Instruction Sheet”
Go through “A Simple Annotation Problem” and do the exercises Get familiar with “Annotation Report” Read requirements for your annotation project report Work on an annotation project (details are given in the next slide) Read “Gene and Disease” (ebook) on NCBI website, select one genetic disease and prepare roughly 10 slides for your presentation given in next lecture (3/12/2012, provide the slides to me before the class)

Annotation Procedure You are provided a zip file named “derecta_2nd3Lcontrol_Nov2011_fosmid22.zip”, unzip it you will get a folder named “derecta_2nd3Lcontrol_Nov2011_fosmid22” Go into the provided folder, go to subfolder “src”, get sequence named “fosmid22.fasta.masked”. Use only this masked genomic sequence when the genomic sequence is needed. This sequence is produced by repeatmasker and the corresponding report can be found in subfolder “analysis”  “Repeats” Go to select “D. erecta” for genome, “Nov.2011” for assembly and type “fosmid22” in the “position or search term” box, then click submit In the genome browser, click one predicted gene from GenScan, obtain the predicted protein sequence in the next page Use blastp on the flybase website (use “Annotated Protein” database), search this predicted protein against annotated D. mel proteins. Find the best match and determine the gene name in D. mel. Adjust blast parameters when needed Search this gene in the GEP gene record finder, find D. mel gene details (exon amino acids sequences, gene structure etc.). Search the masked genomic sequence against each amino acid sequence of exon in each isoform of D. mel using blastx on the NCBI website (The genomic sequence as query and each exon as subject). Record the matched regions (in the exon and in the genomic sequence), frame Determine precise exon boundary by using signals (ATG, GT, AG, TAA,…), phase, conservation and frame information in the genome browser Fill out the project report form

Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.

Similar presentations

Presentation on theme: "Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.

Similar presentations

Presentation on theme: "Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of."— Presentation transcript:

Similar presentations

About project

Feedback