Presentation is loading. Please wait.

Presentation is loading. Please wait.

Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a.

Similar presentations


Presentation on theme: "Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a."— Presentation transcript:

1

2 Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a

3 Local two-sequences alignment is the basis of sequence analysis, and perhaps the most widely used tool in computational molecular biology [1] The parameters of most popular local sequence alignment tools including BLAST and FASTA are set by: Default – set to for the “average case,” which may not be appropriate for the sequences being examined Custom – the manual settings may be difficult, which usually require fine tuning through several manual trials AutoSimS (Automated Sequence Similarity Search) contains three modules: A modified version of SIM/DDS (Similarity / DNA-DNA sequence) [2, 3] for finding similar regions Adaptive simulated annealing (ASA) [4] for optimizing parameters for SIM/DDS An AI decision-making system (not implemented) for guiding the adaptive simulated annealing 1 AutoSimS

4 Integrates features from Smith-Waterman, BLAST, Fasta and Haste (Hash- Accelerated Search) [5] Rated as one of fastest and least space consuming (linear space complexity) tools for universal sequence alignment [6] Provides tradeoffs between sensitivity and speed using over a dozen of parameters Our modified SIM/DDS introduces more cutoffs Increases flexibility of control Sequence filtering Word masking Reduces the impact of short and exact matches Allows adjusting sensitivity for weak similarity 2 (SIM/DDS) Similarity / DNA-DNA Sequence

5 Adaptive Simulated Annealing Uses global and statistical optimization techniques that are able to handle complex, non-linear search spaces Several improvements over the original simulated annealing technique Computational complexity – exponential temperature schedule for annealing Completeness – decreases the chance to miss optima Generality – more options to better fit problems to be solved Most attractive feature: individual considerations given to parameter range, annealing-time-dependent sensitivities, and the probability density distribution for each parameter Provides up to 100 options Facilitates incorporation into the AutoSimS model 3 (ASA)

6 AutoSimS Model Parameters Parameter Search Set of possible parameters with exponential probability Sequence Data Modified SIM / DDS Data Selection Knowledge Base Exponential Annealing Parameter Evaluation Value of objective function Preferred similarity regions ASA AI Decision-Making Module (not implemented) User Preferences 4

7 Summary of Model ASA works as a “wrapper” program to select parameters for SIM/DDS With properly specified search spaces, objective function and successor heuristics determined by the AI decision-making system, ASA is used to find the optimal parameter setting of modified SIM/DDS program. This leads to finding better similar regions Even though the above mentioned information to be given manually to ASA, we find it easier to do so and let ASA tune the parameters for SIM/DDS than to manually tune SIM/DDS’s parameters Adding the AI decision-making module will make AutoSimS nearly autonomous by automatically providing most of the information ASA needs 5

8 AHSC (Average of High- Scoring Chain Scores) may be used as an ASA objective function to find parameters yielding highly similar regions We find close-to-optimal parameter settings are difficult to find manually, and that there are many different parameter settings that yield close-to-optimal search results An automatic search for parameters may be effective Adaptive simulated annealing may be a preferred search technique Results 6 Three runs of our modified SIM/DDS program using parameters selected by adaptive simulated annealing for a 100 and 200 letter pair of DNA sequences yield similar results, but with different parameter settings. ASA settings: Annealing schedule: T = 20 * exp(-0.005*t) if t < 100 and 0 otherwise Acceptance function: exp(  E / T )

9 Implement the AI decision-making system, including the decision analysis and knowledge base system Experiment on a large number of different types of molecular biological sequences to determine the proper annealing temperature schedules and successor heuristics and/or their parameters Parallelize AutoSimS Incorporate core ideas of more efficient very large-scale sequence comparison techniques, such as LSH (Locality-Sensitive Hashing) [7] Generate statistical estimates for the local alignment score distributions [1], which will be used in AutoSimS’s decision-making system Explore different ASA objective functions, which may improve results 7 Future Work

10 ASA’s ability to fit complex functions, i.e. nonlinear search spaces and multiple variables, allows it to find a suitable set of parameters for SIM/DDS The incorporation of AI decision-making system to our ASA-SIM/DDS program should enhance our ability to achieve almost autonomous two- sequence similarity analysis with high volume throughput and acceptable performance Our use of simulated annealing to find a suitable set of parameter can be adapted to other bioinformatics analysis programs, such as alignment and clustering 8 Conclusion

11 [1] Altschul, S. F., Bundschuh, R., Olsen, R. and Hwa, T., The Estimation of Statistical Parameters for Local Alignment Score Distributions. Nucleic Acids Research, Vol. 29, No. 2, 351–361, 2001 [2] Jiang, T., Xu, Y. and Zhang, M.Q., Current Topics in Computational Molecular Biology. MIT Press, 2002 [3] Huang, X. and Miller, W., A Time-Efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics 12, 337–357, 1991 [4] Ingber, L., Simulated Annealing: Practice versus Theory. Mathl. Comput. Modelling, Vol.18, No.11, 29–57, 1993 [5] Borkowski, J.A., Smith, C.P. and Huang, X., PFP—A Flexible Integrated Filtering and Masking Tool, Paracel Inc., Pasadena, CA [6] Tech Topics, Michigan Technological University, Nov. 3, 1995, Vol. XXVIII, No.9 [7] Buhler, J., Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing. Bioinformatics 17(5) 419–428, 2001 9 References


Download ppt "Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a."

Similar presentations


Ads by Google