Download presentation
Presentation is loading. Please wait.
Published byScott Steven York Modified over 6 years ago
1
Challenges in Creating an Automated Protein Structure Metaserver
Lawrence Wisne - CS 273
2
The Problem Given a set of servers running prediction algorithms and their results on test data, is it possible to automate the choice of a “best” server for unknown sequences? There does not yet exist a structure prediction algorithm which gives consistently accurate results While not providing any new answers, an algorithm which can successfully answer the question above could add a great degree of consistency to structure predictions
3
Solution Outline Download the results of the CASP6 competition
Isolate a small subset S of structure prediction servers such that the worst result for the CASP6 target sequences is minimized, given the correct choice of server Link each amino acid target sequence with the server that gives the best result for that sequence Isolate the similar characteristics of the sequences that are linked with each server Note that this requires that he number of results which are optimally linked with each server in S is large enough that characteristics of these sequences can be observed
4
Picking a Set of optimal servers
To evaluate the quality of a given server’s prediction of sequence i, we use a relative property, not an absolute one the ranking Rsi of each the prediction of server S on sequence i among CASP6 participants Ideally, we would pick our set of servers S such that i(minsS(Rsi)) is minimized. This is difficult even if we decide |R|, as there are |S| choose |R| possible subsets
5
A decent approximation
To approximate an optimal subset S, use the following greedy algorithm: For each server, find the number of targets for which the server ranked within the top t, for some threshold t Add the server with the largest count to S, and remove from consideration the targets for which that server was in the top t Repeat until S reaches the desired size Using this algorithm, with t=5 and |S|=6, the worst result, given correct server prediction, had a rank of 13, and the mean rank was ~2
6
Linking Servers with Targets
Now, we can link each target sequence with the server in our subset S that produces optimal results In the case of t=5 and |S|=6, the smallest group of targets linked with a server was of size 8, and the largest was of size 32
7
A Reduced (but still very difficult) Problem
Find the common characteristics of a set of input strings which represent amino acids The main methods attempt were Machine Learning and Clustering
8
The machine learning approach
Given a training set and a set of features that are present in each member of the set, weigh the features such that future input instances will be solved optimally Sounds great, but what are the “features” of a string?
9
The Clustering Approach
OK, so why can’t we just group the strings according to some characteristic? Some kind of edit distance metric (ie: Smith-Waterman) may sound good in principle, but there are problems: Alphabet too large String size too varied Most metrics are thrown off by size differences, and normalization has its problems as well Scoring Patterns are (at best) very subtle
10
So, where do we go from here?
It is very possible that a way to solve the reformulated problem does exist Better domain-specific knowledge may be necessary To create a richer set of features for learning, we can use the various properties of amino acids to replace the alphabet Alternately, it may be possible to alter the match/mismatch scores to account for physical properties More sample cases would be very helpful The size differentials in the strings made certain metrics almost useless More cases = the possibility of only comparing like-sized strings
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.