CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005.

CDS predictions using DOGFISH-C David Carter dmc@sanger.ac.uk Wellcome Trust Sanger Institute 6th May 2005

DOGFISH Detection Of Genomic Features In Sequence Homologies A four-component system to detect splice sites, coding starts/stops etc in multiple-species alignments DOGFISH-C “Contextual” component only Plus simple best-path CDS finder to derive single transcripts

DOGFISH components What’s in an alignment? Taxonomic information: mutations, or lack of, at a given position. Evolutionary models. Contextual information: does each sequence look right in itself? (DOGFISH-C) Indels: where are the gaps? Which species are present at all? Derive an estimate from each “view”, and combine into a single result.

Training data UCSC MultiZ 8-species vertebrate alignments (minus chimp, plus frog) VEGA gene set from March 2005 DOGFISH trained to discriminate true sites from equal numbers of decoys taken at random from within genes Final best-path search tuned using genes from 13 Encode training regions

Deriving per-site probability estimates Candidate site is represented by 100 bases each side of site itself, and 100 each side of every informant species position it’s aligned with: up to 8 x 200 bases in total. Step 1: derive many statistics per species –1a: position-specific weight matrices –1b: significant k-mers in subregions Step 2: derive one estimate per species Step 3: integrate into a single estimate

Train 6 th -order position-specific weight matrices: one for each coding phase for true sites, and one for decoys. Given a candidate sequence for a given species, find the overall best-scoring true-site model, i.e. find the most likely phase At each position, take logodds between best true-site model and decoy model, giving 200 logodds scores. Step 1a: Position-specific weight matrices

As well as applying weight matrices, count occurrences of 200 “diagnostic” k-mers (k=1 to 6) within specific regions of the 200-base window “Diagnostic” means frequency differs between true and decoy sites: e.g. AG is rare in positions -30 to -1 for true acceptor sites but not decoys. Captures more subtle, less position-specific effects. Step 1b: Diagnostic k-mers-in-regions

Now we have 200 positional logodds scores and 200 k-mer counts for our 200-base sequence, but we want a single probability estimate (that this site is a true one). Train and run a relevance vector machine (RVM): decides which are the useful (“relevant”) statistics and what weight to give each one. This gives better results than just adding the scores (as we would if we made the independence assumptions made in e.g. HMMs) Step 2: convert 400 scores per species to one estimate per species

Step 3: convert up to 8 per-species estimates to one overall estimate Now we have an estimate for each species that aligned to the target. Boost estimates of species that did align, and introduce low “default” estimate for those that didn’t; more distant species have larger boosts and milder defaults. Train and run another RVM that takes (exactly) 8 inputs and outputs the single DOGFISH-C estimate for this site.

Error rates (%) on balanced test set “Error” means estimate 0.5 for decoy

Predicting CDS’s (in a hurry) A candidate CDS is any sequence [ATG|AG] … … [Stop|GT] Use the DOGFISH-C candidate site estimates for the two boundary sites Introduce further statistic based on which species get an alignment with “convincing” length across the candidate CDS CDS estimate = 5’-site estimate * 3’-site estimate * aligned-species estimate Hand-tune a few more parameters (missing tea break) Apply DP search to look for best legal CDS sequence (so single transcript only) across the Encode region

CDS prediction results (my figures) on 31 unseen Encode regions, May 3rd 2005. SystemSnSp  (Sn*Sp) N-Scan (Brown/Brent) 63.976.069.7 Augustus (Stanke) 59.664.161.8 DOGFISH-C (fixed) 50.673.561.0 DOGFISH-C (submitted) 49.871.259.7 Flicek 61.957.159.5 Chatterji 36.546.841.3

Conclusions/Plans/Thanks “Full” DOGFISH could well boost performance as a post processing step Detect transcription start sites! Alternative transcripts Thanks to: Richard Durbin; Thomas Down (RVM expert); Patrick Meidl (Vega); organizers;...

CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005.

Similar presentations

Presentation on theme: "CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005.

Similar presentations

Presentation on theme: "CDS predictions using DOGFISH-C David Carter Wellcome Trust Sanger Institute 6th May 2005."— Presentation transcript:

Similar presentations

About project

Feedback