Identifying Structural Motifs in Proteins Rohit Singh Joint work with Mitul Saha.

Slides:

Advertisements

Similar presentations

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Unsupervised learning

Robust Global Registration Natasha Gelfand Niloy Mitra Leonidas Guibas Helmut Pottmann.

Chapter 6 Feature-based alignment Advanced Computer Vision.

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

CENG 789 – Digital Geometry Processing 06- Rigid-Body Alignment Asst. Prof. Yusuf Sahillioğlu Computer Eng. Dept,, Turkey.

Iterative closest point algorithms

Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

CS CS 175 – Week 2 Processing Point Clouds Registration.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Geometric Optimization Problems in Computer Vision.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.

Lecture 10: Robust fitting CS4670: Computer Vision Noah Snavely.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.

CSE 473/573 RANSAC & Least Squares Devansh Arpit.

כמה מהתעשייה? מבנה הקורס השתנה Computer vision.

Image Stitching Ali Farhadi CSE 455

CSE 185 Introduction to Computer Vision

Chapter 6 Feature-based alignment Advanced Computer Vision.

Computer Vision - Fitting and Alignment

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.

Multimodal Interaction Dr. Mike Spann

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Factor Analysis Psy 524 Ainsworth. Assumptions Assumes reliable correlations Highly affected by missing data, outlying cases and truncated data Data screening.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Model Fitting Computer Vision CS 143, Brown James Hays 10/03/11 Slides from Silvio Savarese, Svetlana Lazebnik, and Derek Hoiem.

EECS 274 Computer Vision Model Fitting. Fitting Choose a parametric object/some objects to represent a set of points Three main questions: –what object.

Using simplified meshes for crude registration of two partially overlapping range images Mercedes R.G.Márquez Wu Shin-Ting State University of Matogrosso.

Fitting image transformations Prof. Noah Snavely CS1114

CSE 185 Introduction to Computer Vision Feature Matching.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Step 3: Tools Database Searching

HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Affine Registration in R m 5. The matching function allows to define tentative correspondences and a RANSAC-like algorithm can be used to estimate the.

Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.

CENG 789 – Digital Geometry Processing 07- Rigid-Body Alignment Asst. Prof. Yusuf Sahillioğlu Computer Eng. Dept,, Turkey.

Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.

Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.

Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.

CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.

Grouping and Segmentation. Sometimes edge detectors find the boundary pretty well.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

CENG 789 – Digital Geometry Processing 08- Rigid-Body Alignment

Line Fitting James Hayes.

Lecture 7: Image alignment

A special case of calibration

Shape matching and object recognition using shape contexts

Finding Functionally Significant Structural Motifs in Proteins

CSE 185 Introduction to Computer Vision

Calibration and homographies

CS5760: Computer Vision Lecture 9: RANSAC Noah Snavely

Protein structure prediction

CS5760: Computer Vision Lecture 9: RANSAC Noah Snavely

Presentation transcript:

Identifying Structural Motifs in Proteins Rohit Singh Joint work with Mitul Saha

The Big Picture: small motifs Active Sites are preserved across proteins with similar functions

The Big Picture: large motifs Even bigger motifs are often conserved.

Oh, BTW… There are two different issues here: 1. Find the best match for the motif in the protein Extensively studied in vision/graphics 2. Is the match “significant” ? For small motifs a good match is more likely What is probability of a match against a random protein being this good ? (cf. BLAST)

What’s in it for a CS guy ? The problem of matching two point-sets has many applications Most current algorithms geared towards points that are indistinguishable (e.g. points on a mesh) There are few rigorous results on the significance of matches

So what have we done ? Towards a more rigorous approach for scoring the quality of a match (between motif and protein) Provide a method that is capable of finding the optimum match based on these criteria

Problem Description Given a motif and a protein, for each point in the motif, find a corresponding point in the protein. Given these correspondences, find the best transformation (rotation and translation only) of the motif that aligns it to the protein. Optimize over all possible correspondences

Oh, BTW… Given two sets of k points, easy to find the optimal rotation and translation that minimizes the least sum-of-squared error (also RMSD). Boils down to finding the largest eigenvalue of a 4x4 matrix.

Previous Work Brute Force approach: match edges of same length. Geometric Hashing: Pennec & Ayache, Bioinformatics, 1998

What is missing ? Ad hoc: Try to minimize a quantity that is only indirectly related to the least square error or RMSD. Hard to evaluate the quality of partial matches Brute Force methods infeasible for larger motifs Geometric Hashing requires significant preprocessing

Estimating the error Model the alignment problem as a regression problem: Y = model set (protein) T = data set (motif) g = transformation (rot+trans) Which error criterion to use ? Least Mean Squared Error (also RMSD) LSE is not good when you have outliers. what to do ?

Robust error estimation LSE: larger error terms have disproportionate influence. Use a function to reduce the effect of larger error terms (M-estimators)

Its an optimization problem! Consider the case of full matching: Domain: set of all possible correspondences between points on the motif and points on the protein Range: given a particular set of corresponding points, the minimum error in aligning those point sets. Goal: find the global minimum of this function!

Looking for global minimum Our approach: Prune the search space to a small and plausible sub-space Find (most) of the local minima in this sub-space quickly Choose the minimum over these local minima

Finding local minima is easy:ICP Iterative Closest Point (Besl-McKay):

ICP contd… ICP is guaranteed to converge to a local minimum But depends a lot on initial seeding Convergence is quick: ~4-5 iterations ICP movie

Pruning the search space Every point in motif/protein has some features : Amino acid type, element type, sec. structure, hydrophobic/polar, ‘substitutable’ Assume: a point with feature X can only match another point with feature X (or {Y,Z,W}) Assume: some features are more frequent than others

Our Approach Find the feature that is least frequent in protein. For each occurrence of the feature: Seed ICP appropriately. Find local minimum. Look around a few more times Return the best answer you have

Observations Will always find a perfect match, if it exists. Moreover, will find such a match quickly. The error is directly interpretable in RMSD terms

Does it work ?

…contd Trypsin active site against Trypsin like proteins

…contd Trypsin active site against kinases

What about partial matching ? Basic idea is the same: pruning+ICP Replace least squared error estimates by M- estimator based errors. Problem: How to find the optimal rotation/translation that minimizes this new variety of error criterion? Answer: weighted LSE ? Is there a better way ?

RANSAC Choice of the parameters has statistical justification

Plain Vanilla (Least Squares):

M-estimator+ weighted LSE

M-estimator + RANSAC

…contd Data for distorted trypsin active site against ten different trypsins:

Future Work Test on larger motifs: secondary structure elements Choice of better features A theoretical guarantee about the quality of results Explore different criteria for partial matching

Thanks!