A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.

Slides:



Advertisements
Similar presentations
Alignment Visual Recognition “Straighten your paths” Isaiah.
Advertisements

Managerial Economics in a Global Economy
Inference for Regression
BLAST Sequence alignment, E-value & Extreme value distribution.
PCA + SVD.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Computer Vision Laboratory 1 Unrestricted Recognition of 3-D Objects Using Multi-Level Triplet Invariants Gösta Granlund and Anders Moe Computer Vision.
A 3-D reference frame can be uniquely defined by the ordered vertices of a non- degenerate triangle p1p1 p2p2 p3p3.
Proportion Priors for Image Sequence Segmentation Claudia Nieuwenhuis, etc. ICCV 2013 Oral.
Two Examples of Docking Algorithms With thanks to Maria Teresa Gil Lucientes.
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Introduction to bioinformatics
Object Recognition Using Geometric Hashing
Object Recognition. Geometric Task : find those rotations and translations of one of the point sets which produce “large” superimpositions of corresponding.
1 Alignment of Flexible Protein Structures Based on: FlexProt: Alignment of Flexible Protein Structures Without a Pre-definition of Hinge Regions / M.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Comparison and Combination of Ear and Face Images in Appearance-Based Biometrics IEEE Trans on PAMI, VOL. 25, NO.9, 2003 Kyong Chang, Kevin W. Bowyer,
COMP 290 Computer Vision - Spring Motion II - Estimation of Motion field / 3-D construction from motion Yongjik Kim.
Chapter 11: Inference for Distributions
1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Protein Sequence Comparison Patrice Koehl
Model Database. Scene Recognition Lamdan, Schwartz, Wolfson, “Geometric Hashing”,1988.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Sequence alignment, E-value & Extreme value distribution
Correlation and Regression Analysis
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Point set alignment Closed-form solution of absolute orientation using unit quaternions Berthold K. P. Horn Department of Electrical Engineering, University.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Protein Sequence Alignment and Database Searching.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
1 Section 9-4 Two Means: Matched Pairs In this section we deal with dependent samples. In other words, there is some relationship between the two samples.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
EDGE DETECTION IN COMPUTER VISION SYSTEMS PRESENTATION BY : ATUL CHOPRA JUNE EE-6358 COMPUTER VISION UNIVERSITY OF TEXAS AT ARLINGTON.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Tests of Random Number Generators
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
Geometric Hashing: A General and Efficient Model-Based Recognition Scheme Yehezkel Lamdan and Haim J. Wolfson ICCV 1988 Presented by Budi Purnomo Nov 23rd.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
ChE 551 Lecture 04 Statistical Tests Of Rate Equations 1.
Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
CHAPTER 29: Multiple Regression*
Unfolding Problem: A Machine Learning Approach
6-1 Introduction To Empirical Models
Geometric Hashing: An Overview
Pairwise Sequence Alignment (cont.)
Unfolding with system identification
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein

Statistics Introduction Statistics is the discipline which deals with inference in the presence of variation Given a score, how significant is it? H o, H A, Critical Region, P-value Extreme Value Distribution-maximum over all sequence scores is distributed as Extreme Value Distribution Reason why extreme value distribution is useful: maximize score over all possible random alignments

Introduction Given sequence and structural scores, develop hypothesis testing framework H o : Two proteins compared are unrelated Distribution of scores of unrelated proteins determined empirically using PDB data at 40% sequence identity No assumption of background distribution

Sequence Comparison Framework Sequence score determined by SSEARCH and BLOSUM 50 substitution matrix S seq (sequence score), n and m (lengths of two sequences compared) in p.d.f. Compared all possible pairs to determine empirically the p.d.f.

P.D.F. for Sequence Score

Cross Section of p.d.f for constant ln(nm)

Density Distribution for constant ln(nm) Density distribution follows extreme value distribution: exp(-Z –exp(-Z))= p c seq (Z) Z=(S seq - µ seq )/ơ seq µ seq = a ln(nm) + b; model average; a and b fitted to the observed density by least squares ơ seq = a

Comparison to BLAST and FASTA statistics Critical region to determine p-value for model: P seq (z>Z) Comparison of model p-values with BLAST p-value found BLAST p-value higher than model FASTA statistic better coverage, more error than model

Structure Comparison Algorithm

Structure Comparison Framework The score obtained from the structure comparison algorithm is S str P.d.f. for S str used N (number of residues matched) and S str (pairs which scored high were removed) Kept N fixed and fitted extreme value distribution to density using all N

Comparison with RMS RMS deviation in alpha-carbon after least squares fit is traditional method RMS score used to determine p.d.f. with ln(RMS score) and N Comparison of RMS with S str found RMS worse than S in coverage and accuracy

Comparison with RMS (cont.) Three reasons: S str depends most strongly on best-fitting atoms; RMS depends most on worst-fitting atoms S str penalizes gaps; RMS does not S str is analogous to S seq in the sense that both use dynamic-programming

Comparison of Structure and Sequence Comparison

Concluding Remarks Significance of sequence structure score can be calculated from any structural alignment program This method of statistical significance is between FASTA and BLAST methods

Efficient Detection of Three- Dimensional Structural Motifs in Biological Macromolecules By Computer Vision Techniques Ruth Nussinov Haim J. Wolfson

Introduction One of the earlier papers addressing structure comparison Based on computer vision techniques ( geometric hashing paradigm) No a priori predefined motif assumed Advantage: Can be parallelized

Problem Given 3D coordinates of atoms of two molecules, find a rigid transformation (rotation and translation allowed) so that a large number of atoms of one molecule match the atoms of the other molecule Closely related to 3D rigid object recognition

Geometric Hashing Paradigm:Representation of Geometric Constraints Proteins represented as points using coordinate frames (minimal representation of coordinate frames) Pick three noncolinear points to define a plane (RS) and construct orthogonal 3D coordinate system based on RS

Representation of Geometric Constraints (cont.) Define orthonormal vectors w.r.t. RS so that any point can be represented as a linear combination of the orthonormal vectors To remove dependence on particular RS (may preclude recognition if at least one of the RS points does not match with input substructure), represent the m points in all basis triplets (I.e. all orthonormal vectors) with all possible RS

Algorithm for Representation of Geometric Constraints For each RS { Compute orthonormal 3D basis associated with each RS Compute coordinates of all other points in coordinate frame defined by 3D basis For each point define address of hash table with labels and measurements Use each address to enter hash table with pair (model, RS) }

Determining Hash Table Entries with Model M1 and Points 4 and 1 as Basis

Locations of Hash Table Entries for Model M1 after all bases, RS

Geometric Hashing: Matching Given observed object: 1.1. Choose an RS and compute 3D basis associated with RS 2.2. Compute the coordinates of the other observed object points in 3D basis 3.3. For each point, enter hash table at address defined by labels and measurements and label and coordinate of new point

Geometric Hashing: Matching (cont.) For step 3: Tally a vote for model and RS for each entry found at address; can histogram all hash table entries which received one or more votes 4. If no pair scores high (determine by threshold), then go to 1, and begin with different RS of the observed object

Geometric Hashing: Matching (cont.) 5. Consider all the models from step 4 and find rigid motion that gives best least squares match 6. Transform the model point set according to the transformation of step 5 and check consistency of all biological information (I.e. match labeling)

Modifications to Algorithm Could modify voting scheme, modify representation of coordinate axes to 2D coordinate axes (reduces worst case running time analysis), could apply representation of atoms to alpha-carbons only (no labeling allowed), could group atoms together into a single unit and analyze structures using these atom groups

Algorithm Performance Experimented with bacterial proteins, bovine pancreas protein, calcium binding protein, bovine liver protein, and protein from hen egg All experiments were “favorable” to “excellent” results in terms of fit

Conclusion Algorithm needs O(N x m 4 ) for hash table (can be big for large N, m) Running time for algorithm can also be long Can be parallelized (ie. representation stage independent of matching stage) Sequence order independent (ie. Insensitive to gaps, insertions, deletions)