TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition Principal Investigators: Thomas Ioerger (Dept. Computer Science)

Slides:

Advertisements

Similar presentations

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Advertisements

Automated phase improvement and model building with Parrot and Buccaneer Kevin Cowtan

Determination of Protein Structure. Methods for Determining Structures X-ray crystallography – uses an X-ray diffraction pattern and electron density.

Computing Protein Structures from Electron Density Maps: The Missing Loop Problem I. Lotan, H. van den Bedem, A. Beacon and J.C. Latombe.

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Ioerger Lab – Bioinformatics Research

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Two Examples of Docking Algorithms With thanks to Maria Teresa Gil Lucientes.

The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R. Ioerger Department of Computer Science Texas A&M University.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

CAPRA: C-Alpha Pattern Recognition Algorithm Thomas R. Ioerger Department of Computer Science Texas A&M University.

The TEXTAL System for Automated Model Building Thomas R. Ioerger Texas A&M University.

Hanging Drop Sitting Drop Microdialysis Crystallization Screening.

AN ADAPTIVE PLANNER BASED ON LEARNING OF PLANNING PERFORMANCE Kreshna Gopal & Thomas R. Ioerger Department of Computer Science Texas A&M University College.

 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.

PcaA Mycolic acid cyclopropyl synthase (Smith&Sacchettini) original structure solved at 2.0A via MAD R-value = 0.22, R-free = residues,  fold.

Current Status and Future Directions for TEXTAL March 2, 2003 The TEXTAL Group at Texas A&M: Thomas R. Ioerger James C. Sacchettini Tod Romo Kreshna Gopal.

Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.

Don't fffear the buccaneer Kevin Cowtan, York. ● Map simulation ⇨ A tool for building robust statistical methods ● 'Pirate' ⇨ A new statistical phase improvement.

Automated Model-Building with TEXTAL Thomas R. Ioerger Department of Computer Science Texas A&M University.

TEXTAL: A System for Automated Model Building Based on Pattern Recognition Thomas R. Ioerger Department of Computer Science Texas A&M University.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

TEXTAL Progress Basic modeling of side-chain and backbone coordinates seems to be working well. –even for experimental MAD maps, 2.5-3A –using pattern-recognition.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Protein Structures.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

A Probabilistic Approach to Protein Backbone Tracing in Electron Density Maps Frank DiMaio, Jude Shavlik Computer Sciences Department George Phillips Biochemistry.

The P HENIX project Crystallographic software for automated structure determination Computational Crystallography Initiative (LBNL) -Paul Adams, Ralf Grosse-Kunstleve,

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.

Protein Tertiary Structure Prediction

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

Protein Sequence Alignment and Database Searching.

BALBES (Current working name) A. Vagin, F. Long, J. Foadi, A. Lebedev G. Murshudov Chemistry Department, University of York.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.

Crystallographic Databases I590 Spring 2005 Based in part on slides from John C. Huffman.

Computing Missing Loops in Automatically Resolved X-Ray Structures Itay Lotan Henry van den Bedem (SSRL)

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

1. Diffraction intensity 2. Patterson map Lecture

Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Computer Graphics and Image Processing (CIS-601).

Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.

Chapter 4 Decision Support System & Artificial Intelligence.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

X-ray crystallography – an overview (based on Bernie Brown’s talk, Dept. of Chemistry, WFU) Protein is crystallized (sometimes low-gravity atmosphere is.

Data Mining and Decision Support

Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?

Structural alignment methods Like in sequence alignment, try to find best correspondence: –Look at atoms –A 3-dimensional problem –No a priori knowledge.

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.

Protein Tertiary Structure Prediction Structural Bioinformatics.

Lecture 53: X-ray crystallography. Electrons deflect x-rays We try to recreate electron density from the x-ray diffraction pattern Each point in space.

Score maps improve clarity of density maps

Reduce the need for human intervention in protein model building

Protein Structures.

Protein structure prediction.

Panagiotis G. Ipeirotis Luis Gravano

Dr. Thomas R. Ioerger Department of Computer Science

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition Principal Investigators: Thomas Ioerger (Dept. Computer Science) James Sacchettini (Dept. Biochem/Biophys) Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee, Lalji Kanbi, Reetal Pai & Jacob Smith Funding: National Institutes of Health Texas A&M University

X-ray crystallography Most widely used method for protein modeling Steps: –Grow crystal –Collect diffraction data –Generate electron density map (Fourier transform) –Interpret map i.e. infer atomic coordinates –Refine structure Model-building –Currently: crystallographers –Challenges: noise, resolution –Goal: automation

X-ray crystallography Most widely used method for protein modeling Steps: –Grow crystal –Collect diffraction data –Generate electron density map (Fourier transform) –Interpret map i.e. infer atomic coordinates –Refine structure Model-building –Currently: crystallographers –Challenges: noise, resolution –Goal: automation

Automated map interpretation Fit amino acids into density in the right orientation Largely manual process –Molecular graphics programs –Bottleneck step: time consuming & error-prone Diffraction data is typically of poor quality –Focus of TEXTAL ™ : medium-poor resolution Modeling requires a lot of experience –Automation is very challenging; often considered an art! Other automated model building programs: ARP/wARP, RESOLVE, MAIN –Other AI approaches: expert system, molecular scene analysis

Automated model-building program Can we automate the kind of visual processing of patterns that crystallographers use? –Intelligent methods to interpret density, despite noise –Exploit knowledge about typical protein structure Focus on medium-resolution maps –optimized for 2.8A (actually, A is fine) –typical for MAD data (useful for high-throughput) –other programs exist for higher-res data (ARP/wARP) Overview of TEXTAL Electron density map (or structure factors) TEXTAL Protein model (may need refinement)

SCALE MAP TRACE MAP CALCULATE FEATURES PREDICT Cα ’ s BUILD CHAINS PATCH & STITCH CHAINS REFINE CHAINS LOOKUP: model side chains CAPRA: models backbone POST-PROCESSING SEQUENCE ALIGNMENT REAL SPACE REFINEMENT CrystalCollect data Diffraction data Electron density map Model of backbone Model of backbone & side chains Corrected & refined model

CAPRA: C-Alpha Pattern-Recognition Algorithm tracing linking Neural network: estimates which pseudo-atoms are closest to true C  ’s Best-first search with heuristic scoring function based on: neural net scores density connectivity secondary structure

Example of C  -chains fit by CAPRA % built: 84% # chains: 2 lengths: 47, 88 RMSD: 0.82A Rat  2 urinary protein (P. Adams) data: 2.5A MR map generated at 2.8A

Stage 2: LOOKUP LOOKUP is based on Pattern Recognition –Given a local (5A-spherical) region of density, have we seen a pattern like this before (in another map)? –If so, use similar atomic coordinates. Use a database of maps with known structures –200 proteins from PDB-Select (non-redundant) –back-transformed (calculated) maps at 2.8A (no noise) –regions centered on 50,000 C  ’s Use feature extraction to match regions efficiently –feature (e.g. moments) represent local density patterns –features must be rotation-invariant (independent of 3D orientation) –use density correlation for more precise evaluation

CAPRA BUILD CHAINS: Examines network of Cα’s and use heuristic search to connect them to form backbone chains

LOOKUP: Uses case-based reasoning to find, for each Cα, the best matching local region in a database

Database of known maps Region in map to be interpreted The LOOKUP Process Find optimal rotation “2-norm”: weighted Euclidean distance metric for retrieving matches: Two-step filter: 1) by features 2) by density correlation

Examples of Numeric Density Features Distance from center-of-sphere to center- of-mass Moments of inertia - relative dispersion along orthogonal axes Geometric features like “Spoke angles” Local variance and other statistics Features are designed to be rotation-invariant, i.e. same values for region in any orientation/frame-of-reference. TEXTAL uses 19 distinct numeric features to represent the pattern of density in a region, each calculated over 4 different radii, for a total of 76 features.

F=

SLIDER: Feature-weighting algorithm Euclidean distance metric used for retrieval: importance of relevant features, avoid noisy features Goal: find optimal weight vector w the generates highest probability of hits (matches) in top K candidates from database Concept of Slider: analyze distances between representative matches and mismatches adjust features so the most matches are ranked higher than mismatches Slider Algorithm(w,F,{R i },matches,mismatches) choose feature f  F at random for each, R j  matches(R i ),R k  mismatches(R i ) compute cross-over point i where: dist’(R i,R j )=dist’(R i,R k ) dist’(X,Y)= (X f -Y f ) 2 +(1- )dist \f (X,Y) pick that is best compromise among i ranks most matches above mismatches update weight vector: w’  update(w,f, ), w f ’= repeat until convergence

SLIDER Results

Stage 3: Post-Processing

Quality of TEXTAL models Typically builds >80% of the protein atoms Accuracy of coordinates: ~1 Å error (RMSD) –Depends on resolution and quality of map

PcaA Mycolic acid cyclopropyl synthase (Smith&Sacchettini) original structure solved at 2.0A via MAD R-value = 0.22, R-free = residues,  fold Example of density quality (~1  contour with C  trace)

Electron density map (2.8A)

Results of tracing

Strip off branches of trace (linearize)

Linearized trace shows backbone connectivity

Pick C  ’s using neural net; link together

Results of CAPRA

Comparison to backbone of true structure (white) Percent built = 89% (missing: 15-residue N-terminus, 17-residue disordered loop) 4 single-atom insertions; 5 single-atom deletions RMSD = 0.81A

CAPRA model consists of 3 chains Chain lengths: 14, 96, 145 residues

Results of LOOKUP (modeling side-chains)

Comparison of TEXTAL model to true structure Percent amino acid identity = 87.5% (mistakes: small frame-shifts around gaps in alignment) all-atom RMSD = 0.92A

Closeup of  -strand (TEXTAL model in green)

Closeup of another  -strand and turn

Implementation Project started in 1998 –Collaboration between TAMU Computer Science & Biochemistry departments 100,000 lines of C/C++, Perl, Python code ~8 developers CVS for version management Platforms: Irix, Linux, OSX, Win32 Speed: 1-3 hours for medium-sized proteins

Deployment September 2004: Linux and OSX distributions –Can be downloaded from –40 trial licenses granted so far June 2002: WebTex ( –Till May 2005: TB Structural Genomics Consortium members only –Recently open to the public –~500 jobs successfully processed –120 users from 70 institutions in 20 countries July 2003: Model building component of PHENIX –Python-based Hierarchical ENvironment for Integrated Xtallography –Consortium members: Lawrence Berkeley National Lab University of Cambridge Los Alamos National Lab Texas A&M University –April 2005: Alpha release - over 300 downloads so far

Python-based Hierarchical ENvironment for Integrated Xtallography HYSS, CCTBX (Lawrence Berkeley Lab) Crystallography toolbox, heavy atom search, refinement PHASER (University of Cambridge) Maximum likelihood phasing SOLVE/RESOLVE (Los Alamos National Lab) Statistical density modification, minimum bias phasing TEXTAL ™ (Texas A&M University) Model building PHENIX diffraction data refined molecular model

Conclusions Pattern recognition is a successful technique for macromolecular model-building Future directions: –recognizing disulfide bridges, metal ions, detergents... –building ligands, co-factors, etc. –using models built to iteratively improve phases –building at higher or lower resolutions –intelligent agent for guiding model-completion –detecting and exploiting non-crystallographic symmetry –building nucleic acids (RNA and DNA) Importance and challenges of interdisciplinary research

Acknowledgements Funding: –National Institutes of Health Our group: –Jacob Smith, Kreshna Gopal, Lalji Kanbi, Erik McKee, Reetal Pai, Tod Romo Our association with the PHENIX group: –Paul Adams (Lawrence Berkeley National Lab) –Randy Read (Cambridge University) –Tom Terwilliger (Los Alamos National Lab)