CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005.

Slides:



Advertisements
Similar presentations
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Advertisements

Ab initio gene prediction Genome 559, Winter 2011.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
McPromoter – an ancient tool to predict transcription start sites
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Heuristic alignment algorithms and cost matrices
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Lecture 12 Splicing and gene prediction in eukaryotes
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
BLAST What it does and what it means Steven Slater Adapted from pt.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Construction of Substitution matrices
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
bacteria and eukaryotes
Introduction to Bioinformatics II
Sequence Similarity Andrew Torda, wintersemester 2006 / 2007, Angewandte … What is the easiest information to find about a protein ? sequence history.
Nora Pierstorff Dept. of Genetics University of Cologne
Discussion Section Week 9
The Toy Exon Finder.
Sequence Analysis Alan Christoffels
Presentation transcript:

CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005

DOGFISH Detection Of Genomic Features In Sequence Homologies A four-component system to detect splice sites, coding starts/stops etc in multiple-species alignments DOGFISH-C “Contextual” component only Plus simple best-path CDS finder to derive single transcripts

DOGFISH components What’s in an alignment? Taxonomic information: mutations, or lack of, at a given position. Evolutionary models. Contextual information: does each sequence look right in itself? (DOGFISH-C) Indels: where are the gaps? Which species are present at all? Derive an estimate from each “view”, and combine into a single result.

Training data UCSC MultiZ 8-species vertebrate alignments (minus chimp, plus frog) VEGA gene set from March 2005 DOGFISH trained to discriminate true sites from equal numbers of decoys taken at random from within genes Final best-path search tuned using genes from 13 Encode training regions

Deriving per-site probability estimates Candidate site is represented by 100 bases each side of site itself, and 100 each side of every informant species position it’s aligned with: up to 8 x 200 bases in total. Step 1: derive many statistics per species –1a: position-specific weight matrices –1b: significant k-mers in subregions Step 2: derive one estimate per species Step 3: integrate into a single estimate

Train 6 th -order position-specific weight matrices: one for each coding phase for true sites, and one for decoys. Given a candidate sequence for a given species, find the overall best-scoring true-site model, i.e. find the most likely phase At each position, take logodds between best true-site model and decoy model, giving 200 logodds scores. Step 1a: Position-specific weight matrices

As well as applying weight matrices, count occurrences of 200 “diagnostic” k-mers (k=1 to 6) within specific regions of the 200-base window “Diagnostic” means frequency differs between true and decoy sites: e.g. AG is rare in positions -30 to -1 for true acceptor sites but not decoys. Captures more subtle, less position-specific effects. Step 1b: Diagnostic k-mers-in-regions

Now we have 200 positional logodds scores and 200 k-mer counts for our 200-base sequence, but we want a single probability estimate (that this site is a true one). Train and run a relevance vector machine (RVM): decides which are the useful (“relevant”) statistics and what weight to give each one. This gives better results than just adding the scores (as we would if we made the independence assumptions made in e.g. HMMs) Step 2: convert 400 scores per species to one estimate per species

Step 3: convert up to 8 per-species estimates to one overall estimate Now we have an estimate for each species that aligned to the target. Boost estimates of species that did align, and introduce low “default” estimate for those that didn’t; more distant species have larger boosts and milder defaults. Train and run another RVM that takes (exactly) 8 inputs and outputs the single DOGFISH-C estimate for this site.

Error rates (%) on balanced test set “Error” means estimate 0.5 for decoy

Predicting CDS’s (in a hurry) A candidate CDS is any sequence [ATG|AG] … … [Stop|GT] Use the DOGFISH-C candidate site estimates for the two boundary sites Introduce further statistic based on which species get an alignment with “convincing” length across the candidate CDS CDS estimate = 5’-site estimate * 3’-site estimate * aligned-species estimate Hand-tune a few more parameters (missing tea break) Apply DP search to look for best legal CDS sequence (so single transcript only) across the Encode region

CDS prediction results (my figures) on 31 unseen Encode regions, May 3rd SystemSnSp  (Sn*Sp) N-Scan (Brown/Brent) Augustus (Stanke) DOGFISH-C (fixed) DOGFISH-C (submitted) Flicek Chatterji

Conclusions/Plans/Thanks “Full” DOGFISH could well boost performance as a post processing step Detect transcription start sites! Alternative transcripts Thanks to: Richard Durbin; Thomas Down (RVM expert); Patrick Meidl (Vega); organizers;...