Characterization of Secondary Structure of Proteins using Different Vocabularies Madhavi K. Ganapathiraju Language Technologies Institute Advisors Raj.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Latent Semantic Analysis
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Structure Prediction in 1D
Similar Sequence Similar Function Charles Yan Spring 2006.
Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Lecture 1 BNFO 136 Usman Roshan. Course overview Pre-req: BNFO 135 or approval of instructor Python progamming language and Perl for continuing students.
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Sequencing a genome and Basic Sequence Alignment
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Counting and Probability Sets and Counting Permutations & Combinations Probability.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Intelligent Systems for Bioinformatics Michael J. Watts
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Study of Loop Length & Residue Composition of β-Hairpin Motif
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Sequencing a genome and Basic Sequence Alignment
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Protein Secondary Structure Prediction G P S Raghava.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Using Webcast Text for Semantic Event Detection in Broadcast Sports Video IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 7, NOVEMBER 2008.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Link Distribution on Wikipedia [0407]KwangHee Park.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Protein backbone Biochemical view:
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Semi-Supervised Learning with Graph Transduction Project 2 Due Nov. 7, 2012, 10pm EST Class presentation on Nov. 12, 2012.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
14.0 Linguistic Processing and Latent Topic Analysis.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Levels of Protein Structure. Why is the structure of proteins (and the other organic nutrients) important to learn?
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Chapter 14 Protein Structure Classification
Prediction of RNA Binding Protein Using Machine Learning Technique
Introduction to Bioinformatics II
There are four levels of structure in proteins
Protein Structures.
Protein structure prediction.
K-Medoid May 5, 2019.
Restructuring Sparse High Dimensional Data for Effective Retrieval
Robert Fraser, University of Waterloo
Protein structure prediction
Presentation transcript:

Characterization of Secondary Structure of Proteins using Different Vocabularies Madhavi K. Ganapathiraju Language Technologies Institute Advisors Raj Reddy, Judith Klein-Seetharaman, Roni Rosenfeld 2 nd Biological Language Modeling Workshop Carnegie Mellon University May

2 Presentation overview Classification of Protein Segments by their Secondary Structure types Document Processing Techniques Choice of Vocabulary in Protein Sequences Application of Latent Semantic Analysis Results Discussion

3 Sample Protein: MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA… Sample Protein: MEPAPSAGAELQPPLFANASDAYPSACPSAGANASGPPGARSASSLALAIAITAL YSAVCAVGLLGNVLVMFGIVRYTKMKTATNIYIFNLALADALATSTLPFQSA… Secondary Structure of Protein

4 Application of Text Processing Letters  Words  Sentences Letter counts in languages Word counts in Documents Residues  Secondary Structure  Proteins  Genomes Can unigrams distinguish Secondary Structure Elements from one another

5 Unigrams for Document Classification Word-Document matrix –represents documents in terms of their word unigrams “Bag-of-words” model since the position of words in the document is not taken into account

6 Word Document Matrix

7 Document Vectors

8 Doc-1 Document Vectors

9 Doc-2 Document Vectors

10 Doc-3 Document Vectors

11 Doc-N Document Vectors

12 Documents can be compared to one another in terms of dot-product of document vectors document vectors Document Comparison.* =

13 Documents can be compared to one another in terms of dot-product of document vectors document vectors Document Comparison.* =

14 Documents can be compared to one another in terms of dot-product of document vectors document vectors Document Comparison.* = Formal Modeling of documents is presented in next few slides…

15 Vector Space Model construction Document vectors in word-document matrix are normalized –By word counts in entire document collection –By document lengths This gives a Vector Space Model (VSM) of the set of documents Equations for Normalization…Equations

16 (Word count in document) (document length) (depends on word count in corpus) t_i is the total number of times word i occurs in the corpus Word count normalization

17 Word-Document Matrix Normalized Word-Document Matrix

18 Document vectors after normalisation...

19 Use of Vector Space Model A query document is also represented as a vector It is normalized by corpus word counts Documents related to the query-doc are identified –by measuring similarity of document vectors to the query document vector

20 Application to Protein Secondary Structure Prediction

21 Protein Secondary Structure Dictionary of Secondary Structure Prediction: annotation of each residue with its structure –based on hydrogen bonding patterns and geometrical constraints 7 DSSP labels for PSS: –H–H –G–G –B–B –E–E –S–S –I–I –T–T Helix types Strand types Coil types

22 Example ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT Residues DSSP PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT T, S, I,_: Coil E, B: Strand H, G: Helix Key to DSSP labels

23 Reference Model Proteins are segmented into structural Segments Normalized word-document matrix –constructed from structural segments

24 Example Structural Segments obtained from the given sequence: PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL ALPPTP YLGAMKY NLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT Residues DSSP PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT

25 Example Unigrams in the structural segments Structural Segments obtained from the given sequence: PKPPVKFN RRIFLLNTQNVI NG YVKWAI ND VSL ALPPTP YLGAMKY NLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT Residues DSSP PKPPVKFNRRIFLLNTQNVINGYVKWAINDVSLALPPTPYLGAMKYNLLH ____SS_SEEEEEEEEEEEETTEEEEEETTEEE___SS_HHHHHHTT_TT

26 Amino Acids Structural Segments Amino-acid Structural-Segment Matrix

27 Amino Acids Structural Segments Amino-acid Structural-Segment Matrix Similar to Word-Document Matrix

28 Document Vectors Word Vectors …

29 Document Vectors Word Vectors … Query Vector

30 JPred data –513 protein sequences in all –<25% homology between sequences –Residues & corresponding DSSP annotations are given We used 50 sequences for model construction (training) 30 sequences for testing Data Set used for PSSP

31 Proteins from test set –segmented into structural elements –Called “query segments” –Segment vectors are constructed For each query segment –‘n’ most similar reference segment vectors are retrieved –Query segment is assigned same structure as that of the majority of the retrieved segments* Classification *k-nearest neighbour classification

32 Helix Strand CoilKey Query Vector Reference Model Compare Similarities 3 most similar reference vectors Majority voting out of 3-most similar reference vectors == Coil Hence Structure-type assigned to Query Vector is Coil Structure type assignment to QVector

33 Choice of Vocabulary in Protein Sequences Amino Acids But Amino acids are –Not all distinct.. –Similarity is primarily due to chemical composition  So, –Represent protein segments in terms of “types” of amino acids –Represent in terms of “chemical composition”

34 Representation in terms of “types” of AA Classify based on Electronic Properties –e - donors: D,E,A,P –weak e - donors: I,L,V –Ambivalent: G,H,S,W –weak e - acceptor: T,M,F,Q,Y –e - acceptor: K,R,N –C (by itself, another group) Use Chemical Groups

35 Representation using Chemical Groups

36 Results of Classification with “AA” as words Leave 1-out testing of reference vectors Unseen query segments

37 Results with “chemical groups” as words Build VSM using both reference segments and test segments –Structure labels of reference segments are known –Structure labels of query segments are unknown

38 Modification to Word-Document matrix Latent Semantic Analysis Word document matrix is transformed – by “Singular Value Decomposition”

39

40 Results with “AA” as words, using LSA

41 Results with “types of AA” as words using LSA

42 Results with “chemical groups” as words using LSA

43 LSA results for Different Vocabularies Amino acids LSA Types of Amino acid LSA Chemical Groups LSA

44 Model construction using all data Matrix models constructed using both reference and query documents together. This gives better models both for normalization and in construction Of latent semantic model Amino Acid Chemical Groups Amino acid types

45 Applications Complement other methods for protein structure prediction –Segmentation approaches Protein classifications as all-alpha, all- beta, alpha+beta or alpha/beta types Automatically assigning new proteins into SCOP families

46 References 1.Kabsch, Sander “Dictionary of Secondary Structure Prediction”, Biopolymers. 2.Dwyer, D.S., Electronic properties of the amino acid side chains contribute to the structural preferences in protein folding. J Biomol Struct Dyn, (6): p Bellegarda, J., “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings of the IEEE, Vol 88:8, 2000.

Thank you!

48 Use of SVD Representation of Training and test segments very similar to that in VSM Structure type assignment goes through same process, except that it is done with the LSA matrices

49 Classification of Query Document A query document is also represented as a vector It is normalized by corpus word counts Documents related to the query are identified –by measuring similarity of document vectors to the query document vector Query Document is assigned the same Structure as of those retrieved by similarity measure Majority voting* *k-nearest neighbour classification

50 Notes… Results described are per-segment Normalized Word document matrix does not preserve document lengths –Hence “per residue” accuracies of structure assignments cannot be computed