Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a.

Slides:



Advertisements
Similar presentations
G5BAIM Artificial Intelligence Methods
Advertisements

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Monte Carlo Methods and Statistical Physics
ISE480 Sequencing and Scheduling Izmir University of Economics ISE Fall Semestre.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
CSE 830: Design and Theory of Algorithms
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
DESCRIPTION: AutomN is concerned with automating the tedious task of protein interaction pathway discovery using only protein sequences as input. AutomN.
D Nagesh Kumar, IIScOptimization Methods: M1L4 1 Introduction and Basic Concepts Classical and Advanced Techniques for Optimization.
The Calibration Process
Protein Sequence Comparison Patrice Koehl
Sequence alignment, E-value & Extreme value distribution
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Island Based GA for Optimization University of Guelph School of Engineering Hooman Homayounfar March 2003.
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
ANTs PI Meeting, Nov. 29, 2000W. Zhang, Washington University1 Flexible Methods for Multi-agent distributed resource Allocation by Exploiting Phase Transitions.
Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.
1 IE 607 Heuristic Optimization Particle Swarm Optimization.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Simulated Annealing.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Written by Changhyun, SON Chapter 5. Introduction to Design Optimization - 1 PART II Design Optimization.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Metaheuristics for the New Millennium Bruce L. Golden RH Smith School of Business University of Maryland by Presented at the University of Iowa, March.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
The Role of Prior Knowledge in the Development of Strategy Flexibility: The Case of Computational Estimation Jon R. Star Harvard University Bethany Rittle-Johnson.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Heuristic Optimization Methods
Data Mining Jim King.
C.-S. Shieh, EC, KUAS, Taiwan
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Objective of This Course
Fast Sequence Alignments
Explore Evolution: Instrument for Analysis
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a

Local two-sequences alignment is the basis of sequence analysis, and perhaps the most widely used tool in computational molecular biology [1] The parameters of most popular local sequence alignment tools including BLAST and FASTA are set by: Default – set to for the “average case,” which may not be appropriate for the sequences being examined Custom – the manual settings may be difficult, which usually require fine tuning through several manual trials AutoSimS (Automated Sequence Similarity Search) contains three modules: A modified version of SIM/DDS (Similarity / DNA-DNA sequence) [2, 3] for finding similar regions Adaptive simulated annealing (ASA) [4] for optimizing parameters for SIM/DDS An AI decision-making system (not implemented) for guiding the adaptive simulated annealing 1 AutoSimS

Integrates features from Smith-Waterman, BLAST, Fasta and Haste (Hash- Accelerated Search) [5] Rated as one of fastest and least space consuming (linear space complexity) tools for universal sequence alignment [6] Provides tradeoffs between sensitivity and speed using over a dozen of parameters Our modified SIM/DDS introduces more cutoffs Increases flexibility of control Sequence filtering Word masking Reduces the impact of short and exact matches Allows adjusting sensitivity for weak similarity 2 (SIM/DDS) Similarity / DNA-DNA Sequence

Adaptive Simulated Annealing Uses global and statistical optimization techniques that are able to handle complex, non-linear search spaces Several improvements over the original simulated annealing technique Computational complexity – exponential temperature schedule for annealing Completeness – decreases the chance to miss optima Generality – more options to better fit problems to be solved Most attractive feature: individual considerations given to parameter range, annealing-time-dependent sensitivities, and the probability density distribution for each parameter Provides up to 100 options Facilitates incorporation into the AutoSimS model 3 (ASA)

AutoSimS Model Parameters Parameter Search Set of possible parameters with exponential probability Sequence Data Modified SIM / DDS Data Selection Knowledge Base Exponential Annealing Parameter Evaluation Value of objective function Preferred similarity regions ASA AI Decision-Making Module (not implemented) User Preferences 4

Summary of Model ASA works as a “wrapper” program to select parameters for SIM/DDS With properly specified search spaces, objective function and successor heuristics determined by the AI decision-making system, ASA is used to find the optimal parameter setting of modified SIM/DDS program. This leads to finding better similar regions Even though the above mentioned information to be given manually to ASA, we find it easier to do so and let ASA tune the parameters for SIM/DDS than to manually tune SIM/DDS’s parameters Adding the AI decision-making module will make AutoSimS nearly autonomous by automatically providing most of the information ASA needs 5

AHSC (Average of High- Scoring Chain Scores) may be used as an ASA objective function to find parameters yielding highly similar regions We find close-to-optimal parameter settings are difficult to find manually, and that there are many different parameter settings that yield close-to-optimal search results An automatic search for parameters may be effective Adaptive simulated annealing may be a preferred search technique Results 6 Three runs of our modified SIM/DDS program using parameters selected by adaptive simulated annealing for a 100 and 200 letter pair of DNA sequences yield similar results, but with different parameter settings. ASA settings: Annealing schedule: T = 20 * exp(-0.005*t) if t < 100 and 0 otherwise Acceptance function: exp(  E / T )

Implement the AI decision-making system, including the decision analysis and knowledge base system Experiment on a large number of different types of molecular biological sequences to determine the proper annealing temperature schedules and successor heuristics and/or their parameters Parallelize AutoSimS Incorporate core ideas of more efficient very large-scale sequence comparison techniques, such as LSH (Locality-Sensitive Hashing) [7] Generate statistical estimates for the local alignment score distributions [1], which will be used in AutoSimS’s decision-making system Explore different ASA objective functions, which may improve results 7 Future Work

ASA’s ability to fit complex functions, i.e. nonlinear search spaces and multiple variables, allows it to find a suitable set of parameters for SIM/DDS The incorporation of AI decision-making system to our ASA-SIM/DDS program should enhance our ability to achieve almost autonomous two- sequence similarity analysis with high volume throughput and acceptable performance Our use of simulated annealing to find a suitable set of parameter can be adapted to other bioinformatics analysis programs, such as alignment and clustering 8 Conclusion

[1] Altschul, S. F., Bundschuh, R., Olsen, R. and Hwa, T., The Estimation of Statistical Parameters for Local Alignment Score Distributions. Nucleic Acids Research, Vol. 29, No. 2, 351–361, 2001 [2] Jiang, T., Xu, Y. and Zhang, M.Q., Current Topics in Computational Molecular Biology. MIT Press, 2002 [3] Huang, X. and Miller, W., A Time-Efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics 12, 337–357, 1991 [4] Ingber, L., Simulated Annealing: Practice versus Theory. Mathl. Comput. Modelling, Vol.18, No.11, 29–57, 1993 [5] Borkowski, J.A., Smith, C.P. and Huang, X., PFP—A Flexible Integrated Filtering and Masking Tool, Paracel Inc., Pasadena, CA [6] Tech Topics, Michigan Technological University, Nov. 3, 1995, Vol. XXVIII, No.9 [7] Buhler, J., Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing. Bioinformatics 17(5) 419–428, References