SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Homology Based Analysis of the Human/Mouse lncRNome
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Similar Sequence Similar Function Charles Yan Spring 2006.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Eukaryotic Gene Finding
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Genome Annotation BCB 660 October 20, From Carson Holt.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Sequencing a genome and Basic Sequence Alignment
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequencing a genome and Basic Sequence Alignment
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Plant Biology Division Post-process of IMGAG M.t. 2.0 Release Affymetrix Medicago Probe set – IMGAG 2.0 / MTGI 8.0 Mapping Zhao Bioinformatics Lab.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
Construction of Substitution matrices
Step 3: Tools Database Searching
(H)MMs in gene prediction and similarity searches.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Welcome to the combined BLAST and Genome Browser Tutorial.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Annotating The data.
EGASP 2005 Evaluation Protocol
EGASP 2005 Evaluation Protocol
ENCODE Pseudogenes and Transcription
Genome Center of Wisconsin, UW-Madison
Introduction to Bioinformatics II
Basic Local Alignment Search Tool
Presentation transcript:

SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe Birney Group EBI

SPIDA - Motivation Improve UTR annotation Make use of the ever expanding EST resource - good UTR source Make use of the ever increasing number of comparative genomes Cope with inaccurate and partial data sets Complement existing Ensembl methodologies Novelty - not probabilistic modeling

SPIDA: The Basic Idea Use comparative data to determine if ESTs are mapped to the correct place by looking for coding signal - ie 3 periodic substitution pattern. Provide CDS annotation - determine translation frame from mutation pattern. Eliminate false positives and pseudogenes by requiring evidence from at least two species separated by ~50MY of independent evolution. Annotate further using Pfam (Verify?).

Substitution periodicity index acgtacgtacgtacgtacgtacgtacgt total for aagtaggtacgaacgtccgttagcacgt frame f0 1 1 f f S0 = 1/((2+5)/2) = 0.28 S1 = 2/((1+5)/2) = 0.67 S2 = 5/((1+2)/2) = 3.33 if denominator = 0,denominator = 1 SPI = max(S0,S1,S2) If we know the ‘wobbly’ frame, we also know the translation frame

Distributions of SPI values

SPI: Current Implementation Multiple Pairwise SPI at whole exon and 48bp window resolutions on each exon. Also ORF status of each frame in each window is determined. Heuristics then applied - v. preliminary! Window threshold applied at min 7 mutations, min 7 spi. (This is a quick way of calculating that the probability of this pattern of mutations occurring by chance is less than 1/100). Exons or windows are grown by aa sequence walking in the SPI-frame to a stop codon in both directions. The resulting ORF = CP3O. Conserved with a Periodicity of 3 Open Reading Frame. The same CP3O must be generated by more than one species if it is to be accepted. Mouse and/or Rat = 1 spp.

SPIDA for EGASP: Overview Map dbEST to Genome - Exonerate Filter then flatten ‘transcript’ structures, preserving all exon boundaries SPI analysis of exons to give validation and reading frame. (TBA alignments) Extend CDS (cp3o) from validated exons Remove ‘transcripts’ with inconsistencies Pfam_fs search of translated CDS Reject CDS from single-exon ESTs with max Pfam-e > 1.0 Report unique exons

Degree of Automation Given :- an ensembl database containing EST mappings a database containing multispecies alignments the Ensembl computational infrastructure hmmpfam + Pfam_fs Mysql The entire procedure is, and will continue to be, completely automatic

Confessions & Cockups The first time the script ran all the way through was April 15th I didn’t have time to run it again or check the results against the design set. EST selection was too conservative, so too few exons were found. I wasn’t aware that the EST mapping by exonerate had placed a substantial proportion of the mappings on the wrong strand.

Analysis in the light of known shortcomings 1. The 1. The SPI calculation is strand-agnostic so we ignore strand. 2. After the filtering procedure, the remaining ESTs overlapped 5900 of the 9313 unique vega exons (including putatives and pseudogenes) by at least one base. So, at most, SPIDA might have found/confirmed 5900 exons given the input data. 3. A number of filters in the SPIDA procedure were designed to identify and remove exons which looked ‘wrong’. These filters will have removed a number of exons which were correctly identified as conserved and three periodic, but were mapped to the wrong strand. It is not possible to compensate for these filters in the following analysis, but the Sn value is probably too low.

Analysis in the light of known shortcomings Of the 5900 vega exons covered by the filtered ESTs, spida confirmed 5033 as having 3-periodicity of these were either known-validated(5019) or putative(6). Eight were from pseudogenes. This suggests a minimum sensitivity in excess of 80%. In total SPIDA confirmed 5037 exons. ie. 4 false positives. However the high values observed for the e- value in Pfam domain analysis of 3 of these exons suggested they may not be false. Either way, the specificity of SPIDA is in excess of 99% | tscpt_id | domain description |domain_e | | | ATP:guanido phosphotransferase N-ter | 7e-55 | | | Elongation factor Tu GTP binding doma | 1.3e-55 | | | Protein of unknown function (DUF431) | 1.7 | | | Fibronectin type III domain | 7.5e-34 |

Why did SPIDA miss 867 exons There were 343 pseudogene exons (137 processed pseudogenes). It is not supposed to find these. There were a similar number of vega putative exons, wet lab analysis suggests 85% of these may not be real. They were filtered out because they looked wrong (probably mapped to the wrong strand). There was no informative alignment. There was an alignment but the mutation rate was too low to get SPI above threshold. There was an alignment but the exon wasn't entirely orf and no windows achieved spi threshold.

Issues for the Future Mappings to the wrong strand Transcripts with validated exons but inconsistent frames. Partial-exon ESTs. Joining ESTs Better Multi-Species Alignments Gene-finder or screening tool? Re-engineer the software for speed and compute farm deployment.

Acknowledgements Ewan NIH-NHGRI EBI Ensembl Team Havana Team Sanger ISG Team Elliott Margulies and team. Ben Paton Guy Slater

SPI: Variations Pairwise: The basic idea - calculated from an aligned sequence pair. Multiple Pairwise: Heuristic evaluation of the results of Pairwise SPI calculated on several species for the same human ‘exon’. MultiSpecies: Use the substitutions from more than one aligned species to calculate SPI. acgtacgtacgtacgtacgtacgtacgt total for aagtaggtacgaacgtccgttagcacgt frame acgtaagtgcgtacgtacgttcgtacgt f0 1 1 f f

SPI: Possible Modes of Use Whole Exon: Use Blast or Exonerate to Map EST to Genome. SPI gives validation and translation frame Only good for mid-rank ie non-UTR exons Windowed Exon: As above but SPI (and mutation rate) calculated in sliding window. Detect CDS starts and Ends by extending >threshold windows. 3’ to a stop, 5’ to a step up in mutation rate Windowed Alignment: Scan raw comparative alignments. For each >threshold window extend 3’ and 5’ to splice-site/poly-pyrimidine/branch signal/tataa signal.

Filtering for EGASP ESTs mapping to ENCODE ( ‘exons’). After removing all mappings with indels: (84201). Keeping only the best alignment for each EST and only EST mappings which begin at first base of EST (71865). After flattening: 8927 ESTs = 8977 different transcripts = exons. After SPI analysis -> 3640 CP3Os After selecting one CP3O per transcript = 1908 CP3Os = 7407 'exons’. After removing 5' fragments with no met 7389 exons. Unique start,end,chr,strand combinations 5240 ‘exons’. After liftOver 5193 'exons’.