Self-organizing Map (SOM) in Protein Folding Based on HP Model Xiang-Sun ZHANG 2003.12.2 2 Dec. 2003.

Slides:



Advertisements
Similar presentations
Analysis of Algorithms
Advertisements

1 Motion and Manipulation Configuration Space. Outline Motion Planning Configuration Space and Free Space Free Space Structure and Complexity.
Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang Institute of Applied Mathematics,
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Protein Tertiary Structure Prediction
Structural bioinformatics
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
CISC667, F05, Lec21, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction 3-Dimensional Structure.
Protein Structure, Databases and Structural Alignment
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Protein Tertiary Structure Prediction. Protein Structure Prediction & Alignment Protein structure Secondary structure Tertiary structure Structure prediction.
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Genetic Threading By J.Yadgari and A.Amir Published: special issue on Bioinformatics in Journal of Constraints, June 2001 Alexandre Tchourbanov University.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Protein Side Chain Packing Problem: A Maximum Edge-Weight Clique Algorithmic Approach Dukka Bahadur K.C, Tatsuya Akutsu and Tomokazu Seki Proceedings of.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Structure Prediction Dr. G.P.S. Raghava Protein Sequence + Structure.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.
Gene expression & Clustering (Chapter 10)
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
ProteinShop: A Tool for Protein Structure Prediction and Modeling Silvia Crivelli Computational Research Division Lawrence Berkeley National Laboratory.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Secondary structure prediction
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Structural alignment methods Like in sequence alignment, try to find best correspondence: –Look at atoms –A 3-dimensional problem –No a priori knowledge.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
CS-ROSETTA Yang Shen et al. Presented by Jonathan Jou.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Local Flexibility Aids Protein Multiple Structure Alignment Matt Menke Bonnie Berger Lenore Cowen.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Molecular mechanics Classical physics, treats atoms as spheres Calculations are rapid, even for large molecules Useful for studying conformations Cannot.
Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.
Protein Structure Prediction & Alignment
Generating, Maintaining, and Exploiting Diversity in a Memetic Algorithm for Protein Structure Prediction Mario Garza-Fabre, Shaun M. Kandathil, Julia.
University of Washington
Computability and Complexity
CS 598AGB Genome Assembly Tandy Warnow.
Protein Structure Prediction
Metaheuristic methods and their applications. Optimization Problems Strategies for Solving NP-hard Optimization Problems What is a Metaheuristic Method?
Protein Structures.
3-Dimensional Structure
CSE 589 Applied Algorithms Spring 1999
Protein structure prediction.
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Protein structure prediction
Presentation transcript:

Self-organizing Map (SOM) in Protein Folding Based on HP Model Xiang-Sun ZHANG Dec at NCSU

Motivation We are all concerning what we (OR researchers and algorithm designers) can do in Bioinformatics? What is the junction of Operations research and Bioinfomatics?

Abstract Many problems in Bioinformatics can be formulated as large linear/nonlinear integer programming or combinatorial problems which are NP-hard and unsolvable within existing algorithms. Then efficient approxi- mate methods are needed. As examples, a heuristic algorithm for SBH and a new SOM algorithm for solving the protein HP model are presented. Other related research works in our group are introduced.

Problem areas in Bioinformatics Human Genome Project Large molecule data in biology, such as DNA and protein Genomics ( 基因组学 ) DNA sequencing Gene prediction Sequence alignment Proteomics( entries in google )/Protenomics ( hundreds entries in google )( 蛋白质学 ) Structure prediction Protein alignment

“ Operations Research ” Over 8 millions entries on “ google ”

DNA Sequencing ACGTGATCGATCGAGTACGAGAGTCTA _______________________________ ACGTGATCGATCGAGTACGAGAGTCTA

Two pieces of a target sequence with longer overlap are preferably connected together, that needs that ٭ the average size of the pieces is as long as possible and ٭ the duplicates of the target sequence are as many as possible.

A novel DNA sequencing technique, called Sequencing By Hybridization (SBH), was proposed as an alternative to the traditional sequencing by gel electrophoresis. SBH is based on the DNA chip (or DNA array). A DNA chip contains all probes of length (i.e. a short k-nucleotide fragment of DNA or called a k-tuple). Given a probe and a target DNA, the target will bind (hybridize) to the probe if there is a substring of the target which “fits” the probe.

DNA Sequencing DNA array (DNA chip) AAATGCG( 5 3-tuples, a chip with 3-tuples)

SBH uses classical probing scheme, i.e., by the hybridization of an (unknown) DNA fragment with this chip, the unknown target DNA can be tested and its all k-tuple compositions (called a spectrum) determined. SBH provides information about k-tuples presented in target DNA, but does not provide information about positions of these k-tuples. This results in a problem: how to reconstruct the target DNA from this data.

Because of the limitation of technology, k has not been taken as large as possible yet (generally less than 30---already a big chip). This possibly leads to the branching phenomenon in the sequence reconstruction and multiple reconstruction. On the other hand, there are two cases of errors possibly occur: negative errors (i.e. some k-tuples in the sequence which are not hybridized) and positive errors (i.e. some hybridized probes which are not k-tuples in the sequence). Therefore, for larger DNA fragments, the problem of sequence reconstruction becomes rather complicated and hard to analyze.

In the case of error-free SBH and ideal spectrum (i.e. consists of n-k+1 different k-tuples where n is the length of the DNA fragment), it is known that the SBH reconstruction problem is equivalent to finding an Eulerian path in a corresponding graph, and the algorithm can be implemented in linear time. An occurrence of positive and negative errors and repetitions of k-tuple in the DNA fragment will result in a computational difficulty, i.e., the Problem becomes a strongly NP-hard one.

Sequencing by Hybridization DNA fragment …… ATACGAAGA ……  Spectrum Error: Positive (misread) / Negative (missing, repetition) ATA TAC ACG CGA GAA AAG AGA Ideal case ATA TAC AGG CGA GAA AAG AGA With errors

1989,Pevzner, SBH reconstruction problem is equivalent to finding an Eulerian path in a related graph. 1990,Fleischner, the algorithm can be implemented in linear time. 1991,Dramanac,et al., an algorithm for SBH with errors under assumption that only the first or last nucleotide in the data can be erroneous. 1993,Lipshutz, use empirically derived rates of positive and negative errors and other assumptions. No convergence analysis. 1999,Blazewicz,et al., branch and bound method in the case of only positive errors. 2000,Blazewicz,et al., a heuristic algorithm producing near-optimal solutions.

SBH Reconstruction Problem Design efficient heuristic algorithms Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A new approach to the reconstruction of DNA sequencing by hybridization. Bioinformatics, vol 19(1), pages 14-21, Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu. Combinatorial optimization problems in the positional DNA sequencing by hybridization and its algorithms. System Sciences and Mathematics, vol 3, (in Chinese) Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang. Application of neural networks in the reconstruction of DNA sequencing by hybridization. In Proceedings of the 4th ISORA, 2002.

Basic Observation The spectrum corresponds to a graph: each k-tuple to a vertex and two connected k-tuples to an edge. The structure of the graph is represented by the adjacency matrix A reconstruction of the spectrum is a path in the graph. Information about all paths are implied in the power of the adjacency matrix

Some criteria, using information in the power of adjacency matrix, which can determine the most possible k-tuples at both ends and in the middle of all possible reconstructions of the target DNA in a polynomial time are given. A novel means which can transform the negative errors into the positive errors is proposed. It enables us to handle both types of errors easily.

Protein Structure Prediction Predict protein 3D structure from (amino acid) sequence Sequence  secondary structure  3D structure  function

Proteins Secondary Structure  -helix (30-35%)  - 螺旋  -sheet /  -strand (20-25%)  - 折叠 Coil (40-50%) 无规则卷曲 Loop 环  -turn  - 转角

3D Structure of Protein Alpha-helix Beta-sheet Loop and Turn Turn or coil

Protein 3D Structure Detection X-ray diffraction X- 射线衍射法 Expensive Slow

Protein Structure Prediction Prediction is possible because Sequence information uniquely determines 3D structure Sequence similarity (>50%) tends to imply structural similarity Prediction is necessary because DNA sequence data » protein sequence data » structure data Sequence (Swiss-Port)40,00068,000114,033 Structure (PDB)4,0457,00018,838

Three Methods of Protein Structure Prediction Goal Find best fit of sequence to 3D structure Comparative (homology) modeling ( 同源建模法 ) Construct 3D model from alignment to protein sequences with known structure Threading (fold recognition) ( 折叠识别法 ) Pick best fit to sequences of known 2D / 3D structures (folds) Ab initio / de novo methods ( 从头预测法 ) Attempt to calculate 3D structure “from scratch”  Molecular dynamics  Energy minimization  Lattice models

Suppose that each amino acid occupies one point in a space lattice It is called an Exact Model Lattice Models

Twenty amino acids can be divided into two classes: Hydrophobic/Non-polar (H) ( 疏水 ) Hydrophilic/Polar (P) ( 亲水 ) The contacts between H points are favorable hydrophobic amino acid hydrophilic amino acid Covalent bond H-H contact Goal: maximize the number of H-H contacts HP Model (Simple Model)

Basic Ideas Each acid (neuron) in the primary sequence occupies one lattice point (city). The distance between two cities mapped by two neighboring neurons is forced to be 1 as a covalent bond length between the amino acids in a protein molecule. Move the neurons to have more H-H contacts, I.e., emphasis on forming hydrophobic core.

Main Observation A Traveling Salesman Problem with an energy function concerning the H-H contacts that would be maximized.

Mathematical Model (in square lattice) Let the both of sequence and lattice size be, let for the i-th acid taking the j-th lattice point or not. Let be the neighboring set of point j. Let and the coordinates of point j be

Complexity NP-hard problem even in the case of two dimensional HP model P.Crescenzi, et al. On the complexity of protein folding, Journal of Computational Biology, 5(3): 423-, 1998 Many local solutions GA MC SA time consuming

SOM Approach Existing algorithm Motivated by Self-Organizing-Map for TSP Incorporation of HP Information Compact lattice (the sequence exactly fills the lattice) A 36-long sequence In a 6x6 lattice

New SOM Approach Motivation Consider a bigger lattice than the sequence to have more flexible shapes than the only rectangular shape Equivalent to a PCTSP (Price Collecting Traveling Salesman Problem): a man travels only a part of the city set with some expectation. Difficulties caused: Number of cities > number of neurons

PCTSP A traveling salesman who gets a prize in every city k that he visits and pays a penalty for every city that he fails to visit, and who travels between cities i and j at cost, wants to minimize the sum of his travel cost and net penalties, while including in his tour enough cities to collect a prescribed amount of prize money.

The New SOM model is corresponding to the integer programming: where m>n and the total variables are (n+1)m.

New SOM Approach Innovate Points Heuristic initialization to imitate a protein Learning sample set partition strategy Learning sample set reduction strategy Local search procedure to overcome the multi-mapping phenomena

Numerical Results 1. Constructed HP sequences (Length of 17) 2. HP benchmark (up to 36 amino acids)

SOM Approach for 2D HP-Model Xiang-Sun Zhang, Yong Wang, Zhong-Wei Zhan, Ling-Yun Wu, Luonan Chen. A New SOM Approach for 2D HP-Model of Proteins' Structure Prediction. Submitted to RECOMB04. Yong Wang, Zhong-Wei Zhan, Ling-Yun Wu, Xiang- Sun Zhang. Improved Self-Organizing Map Algorithm for Protein Folding and its Realization. Submitted to J. of Systems Science and Mathematical Sciences. (in Chinese)

Main Inprovements Find the global maximum H-H contacts configurations in all the tests Find more optimal conformations Fast -- running time is linear with the sequence length

Unique Optimal Folding Problem What proteins in the two dimensional HP model have unique optimal (minimum energy) folding? (Brian Hayes, 1998) Oswin Aichholzer proved that in square lattice There are closed chains of monomers with this property for all even lengths. There are open monomer chains with this property for all lengths divisible by four.

Square Lattice and Triangular Lattice

Our Results For any n = 18k (k is a positive integer), there exists an n-node (open or closed) chain with at least optimal foldings all with isomorphic contact graphs of size n/2. On 2D triangular lattice, for any integer n> 19, there exist both closed and open chains of n nodes with unique optimal folding.

Proteins With Unique Optimal Foldings Zhen-Ping Li, Xiang-Sun Zhang, Luo-Nan Chen, Protein with Unique Optimal Foldings on a Triangular Lattice in the HP Model, Submitted to Journal of Computational Biology.

Examples of Optimal Foldings

3D Protein Structure Alignment Motivation Group proteins by structural similarity Determine impact of individual residues on protein structure Identify distant homologues of protein families Predict function of proteins with low sequence similarity Identify new folds / targets for x-ray crystallography

3D Protein Structure Alignment Correspondence between atoms Pairwise sequence alignment Locations of atoms Protein Data Bank (in PDB file)  Bond angles / lengths  X,Y,Z atom coordinates Evaluation metric 6 degrees of freedom  3 degrees of translation (A)  3 degrees of rotation (R) Root Mean Square Deviation (RMSD)  n = number of atoms  di = distance between corresponding atoms i

Structure Alignment Problem

Match two rigid bodies by rotating and removing them in the 3D space

Structure Alignment Problem A nonlinear integer programming problem:

Structure Alignment Problem Luo-Nan Chen, Tian-Shou Zhou, Yun Tang, Xiang-Sun Zhang. Structure of Alignment of Protein by Mean Field Annealing. Submitted to ICSB2003.

On-going Research Protein structure prediction Algorithms for HP model Threading methods Protein structure alignment Novel model for structure alignment SBH reconstruction Algorithms for new pattern SBH methods SNP(Single Nucleotide Polymorphism) and Haplotype analysis

Summary Problems in Bioinformatics are simple in description but complicated in solving Many problems in Proteomics are in deterministic nature Combinatorial Continuous model while many problems in Genomics are in stochastic nature Model a problem accurately but solves it approximately