Machine Learning in the Study of Protein Structure Rui Kuang Columbia University Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor),

Machine Learning in the Study of Protein Structure Rui Kuang Columbia University Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor), Yoav Freund, Tony Jebara

Table of contents 1.Introduction to protein structure and its prediction 2.HMM, SVM and string kernels 3.Machine learning in the study of protein structure Protein ranking Protein structural classification Protein secondary structural and conformational state prediction Protein domain segmentation 4.Conclusion and Future work

Thanks to Carl-Ivar Branden and John Tooze Part 1: Introduction to Protein Structure and Its PredictionIntroduction to Protein Structure and Its Prediction 1.Introduction 2.HMM, SVM and string kernels 3.Topics 4.Conclusion and future work

Why study protein structure Protein – Derived from Greek word proteios meaning “ of the first rank ” in 1838 by J ö ns J. Berzelius Crucial in all biological processes Function depends on structure structure can help us to understand function

How to Describe Protein Structure Primary structure: amino acid sequence Secondary structure: local structure elements Tertiary structure: packing and arrangement of secondary structure, also called domain Quaternary structure: arrangement of several polypeptide chains

Secondary Structure : Alpha Helix hydrogen bonds between C ’ =O at position n and N- H at position n+i (i=3,4,5)

Secondary Structure : Beta Sheet Parallel Beta Sheet Antiparallel Beta Sheet We can also have a mix of both.

Secondary Structure : Loop Regions –Less conserved structure –Insertions and deletions are more often –Conformations are flexible

Phi – N - bond Psi – -C ’ bond Tertiary Structure

Phi-Psi angle distribution

Protein Domains A polypeptide chain or a part of a polypeptide chain that can fold independently into a stable tertiary structure.

Determination of Protein Structures Experimental determination (time consuming and expensive)Experimental determination (time consuming and expensive) –X-ray crystallography –Nuclear magnetic resonance (NMR) Computational determination [Schonbrun 2002 (B2)]Computational determination –Comparative modeling –Fold recognition ('threading') –Ab initio structure prediction ( ‘ de novo ’ )

Picture due to Michal Linial Sequence, Structure and Function Sequence, Structure and Function [Domingues 2000 (B1)] Sequence, Structure and Function Structure (24,000): discrete groups of folds with unclear boundaries Sequence (1,000,000) >30% sequence similarity suggests strong structure similarity Remote homologous proteins can also share similar structure Function (Ill-defined) Function associated with different structures Super-family with the same fold can evolve into distinct functions. 66% of proteins having similar fold also have a similar function

Thanks to Nello Cristianini Part 2: Hidden Markov Model, Support Vector Machine and String Kernels K (, ) 1.Introduction 2.HMM, SVM and string kernels 3.Topics 4.Conclusion and future work

Hidden Markov Models for Modeling Protein [Krogh 1993(B3)] Alignment HMM Maximum Likelihood Or Maximum a posteriori If we don’t know the alignment, use EM to train HMM.EM

Hidden Markov Models for Modeling Protein [Krogh 1993(B3)] Probability of sequence x through path q Viterbi algorithm for finding the best path Can be used for sequence clustering, database search…

Relate to structural risk minimizationstructural risk minimization Linear-separable case –Primal qp problem Minimize subject to –Dual convex problem Minimize subject to & Support Vector Machine [Burges 1998(B4)]

Kernel: one nice property of dual qp problem is that it only involves the inner product between feature vectors, we can define a kernel function to compute it more efficiently Example:

String Kernels for Text Classification [Lodhi 2002(M2)] String subsequence kernel – SSK : A recursive computation of SSK has the complexity of the computation O(n|s||t|). It is quadratic in terms of the length of input sequences. Not practical.

Part 3 Machine learning in the study of protein structure 3.1 Protein ranking 3.2 Protein structural classification 3.3 Protein secondary structure and conformational state prediction 3.4 Protein domain segmentation 1.Introduction 2.HMM, SVM and string kernels 3.Topics 4.Conclusion and future work

Part 3.1 Protein Ranking Please!!! Stand in order Smith-Waterman SAM-T98 BLAST/PSI-BLAST Rank Propagation

Thanks to Jean Philippe Local alignment: Smith-Waterman algorithm For two string x and y, a local alignment with gaps is: The score is: Smith-Waterman score:

BLAST [Altschul 1997 (R1)] : a heuristic algorithm for matching DNA/Protein sequences Idea: True matches are likely to contain a short stretch of identity AKQDYYYYE… AKQ KQD QDY DYY YYY… cut Search Protein Database match Query: ………DYY……………… Target: …ASDDYYQQEYY… substitution score>T Extend match Neighbor mapping AKQ SKQ.. KQD AQD.. QDY.. DYY.. YYY…

PSI-BLAST: Position-specific Iterated BLAST [Altschul 1997 (R1)] Only extend those double hits within a certain range. A gapped alignment uses dynamic programming to extend a central pair of aligned residues in both directions. PSI-BLAST can takes PSSM as input to search database

SAM-T98 [Karplus 1999 (C3)] Query sequence Blast search NR Protein database Profile/Alignment Build alignment with hits search Iterate 4 rounds HMM

Affinity matrix D is a diagonal matrix of sum of i-th row of W Iterate F* is the limit of seuqnce {F(t)} Local and Global Consistency [Zhou 2003 (M1)]

Rank propagation [Weston 2004 (R2)] Protein similarity network: –Graph nodes: protein sequences in the database –Directed edges: a exponential function of the PSI-BLAST e-value (destination node as query) –Activation value at each node: the similarity to the query sequnce Exploit the structure of the protein similarity network

Result [Weston 2004 (R2)]

Part 3.2 Protein structural classification Where are my relatives? Fisher Kernel Mismatch Kernel ISITE Kernel SVM-Pairwise EMOTIF Kernel Cluster Kernels

SCOPSCOP [Murzin 1995 (C1)] SCOP Fold Superfamily Family Positive Training Set Positive Test Set Negative Training Set Negative Test Set Family : Sequence identity > 30% or functions and structures are very similar Superfamily : low sequence similarity but functional features suggest probable common evolutionary origin Common fold : same major secondary structures in the same arrangement with the same topological connections

CATHCATH [Orengo 1997 (C2)] CATH Class Secondary structure composition and contacts Architecture Gross arrangement of secondary structure Topology Similar number and arrange of secondary structure and same connectivity linking Homologous superfamily Sequence family

Fisher Kernel [Jaakkola 2000 (C4)] A HMM (or more than one) is built for each family Derive feature mapping from the Fisher scores of each sequence given a HMM H1:

SVM-pairwise [Liao 2002 (C5)] Represent sequence P as a vector of pairwise similarity score with all training sequences The similarity score could be a Smith- Waterman score or PSI-BLAST e-value.

Mismatch Kernel [ Leslie 2002 (C6)] AKQDYYYYE… AKQ KQD QDY DYY YYY… AKQ CKQ DKQ AAQAAQ AKY … … ( 0, …, 1, …, 1, …, 1, …, 1, …, 0 ) AAQ AKQ DKQ EKQ AKQ Implementation with suffix tree achieves linear time complexity O(|| m k m+1 (|x|+|y|))

EMOTIF Kernel [Ben-Hur 2003 (C8)] EMOTIF TRIE built from eBLOCKS [Nevill-manning 1998 (C7)] [Nevill-manning 1998 (C7)] EMOTIF feature vector: where is the number of occurrences of the motif m in x

I-SITE Kernel [Hou 2003 (C10)] Similar to EMOTIF kernel I-SITE kernel encodes protein sequences as a vector of the confidence level against structural motifs in the I- SITES library [Bystroff 1998 (C9)] [Bystroff 1998 (C9)]

Cluster kernels [Weston 2004 (C11)] Neighborhood Kernels Implicitly average the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence (dependent on the size of the neighborhood and total length of unlabeled sequences) Bagged Kernels Run bagged k-means to estimate p(x,y), the empirical probability that x and y are in the same cluster. The new kernel is the product of p(x,y) and base kernel K(x,y)

Results

Part 3.3: Protein secondary structure and conformational state prediction Can we really do that? PHD PSI-PRED PrISM HMMSTR

PHD: Profile network from HeiDelberg [Rost 1993 (P1)] Accuracy: 70.8%

PSIPRED [Jones 1999 (P2)] Accuracy: 76.0%

Conformational State Prediction

PrISM [Yang 2003 (P3)] PredictionPrediction with this conformation library based on sequence and secondary structure similarity, accuracy: 74.6%

I-sites motifs are modeled as markov chains and merged into one compact HMM to capture grammatical structure The HMM can be used for Gene finding, secondary or conformational state prediction, sequence alignment … Accuray: –secondary structure prediction: 74.5% –Conformational state prediction: 74.0% HMMSTR [Bystroff 2000 (P4)] : a Hidden Markov Model for Local Sequence- Structure Correlations in Proteins

Part 3.4: Protein domain segmentation Cut? where??? DOMAINATIONPfam DatabaseMulti-experts

DOMAINATION [George 2002 (D1)] Get a distribution of both the N- and C- termini in PSI-BLAST alignment at each position, potential domain boundaries with Z-score>2 Acuracy: 50% over 452 multi-domain proteins

Pfam [Sonnhammer 1997 (D2)] A database of HMMs of domain families Pfam A: high quality alignments and HMMS built from known domainsPfam A Pfam B: domains built from Domainer algorithm from the remaining protein sequences with removal of Pfam-A domainsPfam B

A multi-expert system from sequence information [Nagarajan 2003 (D3)] Seed Sequence Multiple Alignment blast search Neural Network Correlation Entropy Sequence Participation Contact Profile Secondary Structure Physio-Chemical Properties Putative Predictions DNA DATA Intron Boundaries

Results [Nagarajan 2003 (D3)]

Part 4: Conclusion and Future Work Mars is not too far!? 1.Introduction 2.HMM, SVM and string kernels 3.Topics 4.Conclusion and future work

Conclusion Structural genomics plays important role for understanding our life Protein structure can be studied from different perspectives with different methods Machine learning is one of the most important tools for understanding genome data Protein structure prediction is a challenging task given the data we have now

Future Work Rank propagation with domain activation regions Profile kernel with secondary structure information for protein classification Rank propagation for domain segmentation Specialist algorithm for protein conformational state prediction

The End

Determination of Protein Structures (back) (back) X-ray crystallography The interaction of x-rays with electrons arranged in a crystal can produce electron-density map, which can be interpreted to an atomic model. Crystal is very hard to grow. Nuclear magnetic resonance (NMR) Some atomic nuclei have a magnetic spin. Probed the molecule by radio frequency and get the distances between atoms. Only applicable to small molecules.

Hidden Markov Models for Modeling Protein [Krogh 1993(B3)] (back) (back) Build HMM from sequences not aligned EM algorithm 1.Choose initial length and parameters 2.Iterate until the change of likelihood is small –Calculate expected number of times each transition or emission is used –Maximize the likelihood to get new parameters

Thanks to Tony Jebara Support Vector Machine [Burges 1998(B4)] (back) (back) With probability 1-η the bound holds –l is the number of data points –h is VC dimension Structural Risk Minimization –For each h i, –Get best α *=argmin R emp ( α ) –Choose model with min J( α *,hi)

EMOTIF Database [Nevill-manning 1998 (C7)] A motif database of protein families Substitution groups from separation score

EMOTIF Database [Nevill-manning 1998 (C7)] (back) (back) All possible motifs are enumerated from sequence alignments

I-SITE Motif Library [Bystroff 1998 (C9)] (back) (back) Sequence segments (3-15 amino acids long) are clustered via K-means Within each cluster structure similarity is calculated in terms of dme and mda Only those clusters with good dme and mda are refined and considered motifs afterwords

PrISM [Yang 2003 (P3)] (back) (back)

Pfam [Sonnhammer 1997 (D2)] (back) (back) Construction of Pfam A: –Pick seed sequences from several sources and build seed alignment –Build HMM from seed alignment and use to it pull in new members and align them to the HMM to get full alignment

Sonnhammer, 1997 Pfam [Sonnhammer 1997 (D2)] (back) (back) Construction of Pfam B: –Domainer program merges homology segment pairs into homologous segment sets together with links. This graph is partitioned into domains –Use domainer program to build alignment from all protein segments not covered by Pfam-A Incremental updating –New sequence is added to the full alignment of existing models if they score above a threshold –If the new sequence causes problems, the seed alignment will be altered and Pfam-B will be regenerated afterwards.

Machine Learning in the Study of Protein Structure Rui Kuang Columbia University Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor),

Similar presentations

Presentation on theme: "Machine Learning in the Study of Protein Structure Rui Kuang Columbia University Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in the Study of Protein Structure Rui Kuang Columbia University Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor),

Similar presentations

Presentation on theme: "Machine Learning in the Study of Protein Structure Rui Kuang Columbia University Candidacy Exam Talk May 5th, 2004 Committee: Christina S. Leslie (advisor),"— Presentation transcript:

Similar presentations

About project

Feedback