Download presentation
Presentation is loading. Please wait.
1
Bioinformatics and Statistics: A Real World Example Joseph D. Szustakowski
2
Words of Encouragement “There are three kinds of lies: lies, damned lies, and statistics” – Benjamin Disraeli “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination” “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates
3
Outline Basic idea – what are we trying to do? Extreme Value Distribution – a brief review Overwhelm audience with lots of pictures
4
The Basic Idea Most experiments result in one or more quantitative measurements –Height, length, weight, time, speed –SW score, threading potential, Viterbi score Is that measurement unusual? –Tall or short; heavy or light; fast or slow –‘good’ or ‘bad’; homologous or non- homologous
5
The Basic Idea ‘Unusualness’ is only definable if we know what ‘usual’ is. –Make lots of random measurements –Model the ‘background’ distribution –Compare measurements of interest to the background
6
What’s the Point? Are our results good / bad / the same / meaningful / garbage ? Consultation with an oracle –Definitive –Elusive Magic eight ball –Readily available –Inconsistent results
7
What’s the Point? Statistics - –Readily available –Reproducible –Provide an estimate of how likely a better / worse / same result can be obtained by chance.
8
Background Distributions Gaussian – sum of independent variables (central limit theorem) Extreme Value Distributions – optimization procedures
9
Extreme Value Distributions – a Brief Review Extreme value distributions often result from optimization procedures –Sequence alignments (BLAST, SW) –Viterbi algorithm (HMMER, SAM) EVDs are skewed EVDs have a ‘heavy’ tail
11
Real World Example Protein structure alignment –Identify equivalent backbone positions in two proteins –Maximize the number of equivalent pairs –Minimize the distance between pairs K2 –Target function to evaluate alignments –Searches for the best alignment Dynamic programming Weighted bipartite matching Genetic algorithm Simulated annealing Kitchen sink (in progress)
12
Serine Proteases Yellow – human protease Red – viral coat protein Asp-His-Ser catalytic triad shown in balls and sticks
13
DNA Methyltransferases 12345 6 7ABCDEZ 1234567ABCDEZ NC N C M.TaqI M.PvuII
14
Background Distribution Extreme value distribution P-Value
25
Table 2 K2 and BLAST Sensitivities Positive SetLLLRULURTotal K2 Sensitivity BLAST Sensitivity Fold 95% Identity 31909133709793409219978983%16% Fold 40% Identity 158222237879273175475%5% Superfamily 95% Identity 3190995722791654114425189%22% Superfamily 40% Identity 158211519828731598282%10%
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.