Protein Classification

Slides:



Advertisements
Similar presentations
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
What is Statistical Modeling
Pfam(Protein families )
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Discriminative and generative methods for bags of features
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 6, Thursday April 17, 2003
Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Profile-profile alignment using hidden Markov models Wing Wong.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Thomas Blicher, Center for Biological Sequence Analysis Details of Protein Structure.
Protein Fold recognition
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Protein Classification. PDB Growth New PDB structures.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Protein Tertiary Structure Prediction Structural Bioinformatics.
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Hidden Markov Models for Sequence Analysis 4
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Comp. Genomics Recitation 3 The statistics of database searching.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
1 E. Fatemizadeh Statistical Pattern Recognition.
Protein Classification Using Averaged Perceptron SVM
Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.
1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Step 3: Tools Database Searching
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Expected accuracy sequence alignment Usman Roshan.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
SVMs in a Nutshell.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Chapter 14 Protein Structure Classification
Combining HMMs with SVMs
Protein structure prediction.
Protein Structural Classification
Presentation transcript:

Protein Classification

PDB Growth New PDB structures

Protein classification Number of protein sequences grow exponentially Number of solved structures grow exponentially Number of new folds identified very small (and close to constant) Protein classification can Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences SCOP release 1.67, Class # folds # superfamilies # families All alpha proteins 202 342 550 All beta proteins 141 280 529 Alpha and beta proteins (a/b) 130 213 593 Alpha and beta proteins (a+b) 260 386 650 Multi-domain proteins 40 55 Membrane & cell surface 42 82 91 Small proteins 72 104 162 Total 887 1447 2630 Morten Nielsen,CBS, BioCentrum, DTU

Protein structure classification Protein fold Protein world Protein superfamily Protein family Morten Nielsen,CBS, BioCentrum, DTU

Structure Classification Databases SCOP Manual classification (A. Murzin) scop.berkeley.edu CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath FSSP Automatic classification (L. Holm) www.ebi.ac.uk/dali/fssp/fssp.html Morten Nielsen,CBS, BioCentrum, DTU

Morten Nielsen,CBS, BioCentrum, DTU Major classes in SCOP Classes All alpha proteins Alpha and beta proteins (a/b) Alpha and beta proteins (a+b) Multi-domain proteins Membrane and cell surface proteins Small proteins Morten Nielsen,CBS, BioCentrum, DTU

All a: Hemoglobin (1bab) Morten Nielsen,CBS, BioCentrum, DTU

All b: Immunoglobulin (8fab) Morten Nielsen,CBS, BioCentrum, DTU

a/b: Triosephosphate isomerase (1hti) Morten Nielsen,CBS, BioCentrum, DTU

Morten Nielsen,CBS, BioCentrum, DTU a+b: Lysozyme (1jsf) Morten Nielsen,CBS, BioCentrum, DTU

Morten Nielsen,CBS, BioCentrum, DTU Families Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) Families are further subdivided into Proteins Proteins are divided into Species The same protein may be found in several species Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

Morten Nielsen,CBS, BioCentrum, DTU Superfamilies Proteins which are (remote) evolutionarily related Sequence similarity low Share function Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

Morten Nielsen,CBS, BioCentrum, DTU Folds Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold No evolutionary relation between proteins Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

Protein Classification Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Superfamily new protein ? Family Proteins

PSI-BLAST Given a sequence query x, and database D Find all pairwise alignments of x to sequences in D Collect all matches of x to y with some minimum significance Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) Using the matrix M, search D for more matches Iterate 1–4 until convergence Profile M

Profile HMMs M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1 Protein profile H Each M state has a position-specific pre-computed substitution table Each I and D state has position-specific gap penalties Profile is a generative model: The sequence X that is aligned to H, is thought of as “generated by” H Therefore, H parameterizes a conditional distribution P(X | H)

Classification with Profile HMMs Fold M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1 Superfamily “The statistical modeling approach to protein sequence analysis involves constructing a generative protein model, such as an HMM, for a protein family or superfamily. Sequences known to be members of the protein family are used as positive training examples, and the parameters of a statistical model representing the family are estimated using these training examples, in conjunction with general a priori information about properties of proteins. The model assigns a probability to any given protein sequences. If it is a good model for the family it is trained on, then sequences from that family, including sequences that were not used as training examples, yield a higher probability score than those outside the family. The probability score can thus be interpreted as a measure of the extent to which a new protein sequence is homologous to the protein family of interest.” Jaakkola, Diekhans, Haussler, 1999. Family new protein M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1 M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1 ?

Classification with Profile HMMs How generative models work Training examples ( sequences known to be members of family ): positive Model assigns a probability to any given protein sequence. The sequence from that family yield a higher probability than that of outside family. Log-likelihood ratio as score P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log -------------- P(X | H0) P(H0) P(H0|X) P(X) P(H0|X) This approach is perfectly reasonable if we assume that the models we have constructed, Hnull and Hfamily are excellent and give a very accurate posterior probability of the model given a sequence. As is usually the case with oversimplifying models of biological sequences, this is not a very reasonable assumption.

Generation of a protein by a profile HMM P(X | H) ?? To generate sequence x1…xn by profile HMM H: We will find the sum probability of all possible ways to generate X Define AjM(i): probability of generating x1…xi and ending with xi being emitted from Mj AjI(i): probability of generating of x1…xi and ending with xi being emitted from Ij AjD(i): probability of generating of x1…xi and ending in Dj (xi is the last character emitted before Dj)

Alignment of a protein to a profile HMM AjM(i) = εM(j)(xi) * { Aj-1M(i – 1) + log αM(j-1)M(j) + Aj-1I(i – 1) + log αI(j-1)M(j) + Aj-1D(i – 1) + log αD(j-1)M(j) } AjI(i) = εI(j)(xi) * { AjM(i – 1) + log αM(j)I(j) + AjI(i – 1) + log αI(j)I(j) + AjD(i – 1) + log αD(j)I(j) } AjD(i) = { Aj-1M(i) + log αM(j-1)D(j) + Aj-1I(i) + log αI(j-1)D(j) + Aj-1D(i) + log αD(j-1)D(j) }

Generative Models

Generative Models

Generative Models

Generative Models

Generative Models

Discriminative Methods Instead of modeling the process that generates data, directly discriminate between classes Give up trying to provide a likelihood function (a generative model) of the input data. Instead, focus on separating the different classes. More direct way to the goal Better if model is not accurate

Discriminative Models -- SVM If x1 … xn training examples, sign(iixiTx) “decides” where x falls Train i to achieve best margin margin Give up trying to provide a likelihood function (a generative model) of the input data. Instead, focus on separating the different classes. Decision Rule: red: vTx > 0 v Large Margin for |v| < 1  Margin of 1 for small |v|

Discriminative protein classification Jaakkola, Diekhans, Haussler, ISMB 1999 Define the discriminating function to be L(X) = XiH1 i K(X, Xi) - XjH0 j K(X, Xj) We decide X  family H whenever L(X) > 0 For now, let’s just assume K(.,.) is a similarity function Then, we want to train i so that this classifier makes as few mistakes as possible in the new data Similarly to SVMs, train i so that margin is largest for 0  i  1

Discriminative protein classification Ideally, for training examples, L(Xi) ≥ 1 if Xi  H1, L(Xi)  -1 otherwise This is not always possible; softer constraints are obtained with the following objective function J() = XiH1 i(2 - L(Xi)) - XjH0 j(2 + L(Xj)) Training: for Xi  H, try to “make” L(Xi) = 1 1 - L(Xi) + i K(Xi, Xi) i  -----------------------------; with minimum allowable value 0, and maximum 1 K(Xi, Xi) Similarly, for Xi  H0 try to “make” L(Xi) = -1

The Fisher Kernel The function K(X, Y) compares two sequences Acts effectively as an inner product in a (non-Euclidean) space Called “Kernel” Has to be positive definite For any X1, …, Xn, the matrix K: Kij = K(Xi, Xj) is such that For any X  Rn, X ≠ 0, XT K X > 0 Choice of this function is important Consider P(X | H1, ) – sufficient statistics How many expected times X takes each transition/emission M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1

The Fisher Kernel Fisher score UX =  log P(X | H1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare UX, UY D2F(X, Y) = ½ 2 |UX – UY|2 Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D2F(X, Y)) Set  so that the average distance of training sequences Xi  H1 to sequences Xj  H0 is 1 Question: Is partial derivative larger when X “uses” a given parameter I more or less often? Question: Is partial derivative larger when a given parameter I is larger or smaller? M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1

The Fisher Kernel In summary, to distinguish between family H1 and (non-family) H0, define Profile H1 UX =  log P(X | H1, ) (Fisher score) D2F(X, Y) = ½ 2 |UX – UY|2 (distance) K(X, Y) = exp(-D2F(X, Y)), (akin to dot product) L(X) = XiH1 i K(X, Xi) – XjH0 j K(X, Xj) Iteratively adjust  to optimize J() = XiH1 i(2 - L(Xi)) – XjH0 j(2 + L(Xj))

The Fisher Kernel If a given superfamily has more than one profile model, Lmax(X) = maxi Li(X) = maxi (XjHi j K(X, Xj) – XjH0 j K(X, Xj)) Superfamily Family M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1 M1 M2 Mm BEGIN I0 I1 Im-1 D1 D2 Dm END Im Dm-1

Benchmarks Methods evaluated BLAST (Altschul et al. 1990; Gish & States 1993) HMMs using SAM-T98 methodology (Park et al. 1998; Karplus, Barrett, & Hughey 1998; Hughey & Krogh 1995, 1996) SVM-Fisher Measurement of recognition rate for members of superfamilies of SCOP (Hubbard et al. 1997) PDB90 eliminates redundant sequences Withhold all members of a given SCOP family Train with the remaining members of SCOP superfamily Test with withheld data Question: “Could the method discover a new family of a known superfamily?” O. Jangmin

O. Jangmin

Other methods SAM-T98 method WU-BLAST version 2.0a16 (Althcshul & Gish 1996) PDB90 database was queried with each positive training examples, and E-values were recorded. BLAST:SCOP-only BLAST:SCOP+SAM-T98-homologs Scores were combined by the maximum method SAM-T98 method Same data and same set of models as in the SVM-Fisher Combined with maximum methods O. Jangmin

Results Metric : the rate of false positives (RFP) RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence Result of the family of the nucleotide triphosphate hydrolases SCOP superfamily Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds O. Jangmin

Table 1. Rate of false positives for G proteins family Table 1. Rate of false positives for G proteins family. BLAST = BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, S-T98 = SAMT-98, and SVM-F = SVM-Fisher method O. Jangmin

Running time of Fisher kernel SVM on query X? QUESTION Running time of Fisher kernel SVM on query X?