Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.

Slides:

Advertisements

Similar presentations

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Advertisements

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.

Protein structure (Part 2 of 2).

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University.

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.

Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

1 Advanced Smoothing, Evaluation of Language Models.

Copyright © 2012 by Nelson Education Limited. Chapter 8 Hypothesis Testing II: The Two-Sample Case 8-1.

@ 2012 Wadsworth, Cengage Learning Chapter 5 Description of Behavior Through Numerical 2012 Wadsworth, Cengage Learning.

Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,

1 Business Math Chapter 7: Business Statistics. Cleaves/Hobbs: Business Math, 7e Copyright 2005 by Pearson Education, Inc. Upper Saddle River, NJ

Transmembrane proteins in the Protein Data Bank: identification and classification Gabor, E. Tusnady, Zsuzanna Dosztanyi and Istvan Simon Bioinformatics,

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Probabilistic and Statistical Techniques 1 Lecture 3 Eng. Ismail Zakaria El Daour 2010.

Protein Secondary Structure, Bioinformatics Tools, and Multiple Sequence Alignments Finding Similar Sequences Predicting Secondary Structures Predicting.

By Siavoush Dastmalchi Tabriz University of Medical Sciences Tabriz-Iran Modelling the Structures of G Protein-Coupled Receptors Aided by Three-Dimensional.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Protein Classification Using Averaged Perceptron SVM

Understanding Your Data Set Statistics are used to describe data sets Gives us a metric in place of a graph What are some types of statistics used to describe.

Evaluating Results of Learning Blaž Zupan

Section 2-1 Review and Preview. 1. Center: A representative or average value that indicates where the middle of the data set is located. 2. Variation:

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.

Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

BPS - 5th Ed. Chapter 251 Nonparametric Tests. BPS - 5th Ed. Chapter 252 Inference Methods So Far u Variables have had Normal distributions. u In practice,

Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.

Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.

Quality Control  Statistical Process Control (SPC)

Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.

Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.

Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

NOTES #9 CREATING DOT PLOTS & READING FREQUENCY TABLES.

Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Copyright © Cengage Learning. All rights reserved. 8 PROBABILITY DISTRIBUTIONS AND STATISTICS.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute,

Section 2.1 Review and Preview. Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 1. Center: A representative or average value.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Challenges in Creating an Automated Protein Structure Metaserver

Evaluating Results of Learning

Mitchell Kossoris, Catelyn Scholl, Zhi Zheng

N-Gram Model Formulas Word sequences Chain rule of probability

Generalizations of Markov model to characterize biological sequences

Volume 3, Issue 12, Pages (December 1995)

Section 2-1 Review and Preview

Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng

Language Model Approach to IR

Genome Pool Strategy for Structural Coverage of Protein Families

Presentation transcript:

Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman Jaime Carbonell

The Segmentation Problem Segment protein sequence according to secondary structure Related to secondary structure prediction Often viewed as a classification problem Best performance so far is 78% Large portion of the problem lies with the boundary cases

Limited Domain: GPCRs G-Protein Coupled Receptors One of the largest superfamily of proteins known 2955 sequences, 1654 fragments found so far Transmembrane proteins Plays a central role in many diseases Only 1 protein has been crystallized

Distinguishing Characteristic of GPCRs Order of segments are known N-terminus Helix Intracellular loop Extracellular loop C-Terminus

Methodology: Topicality Measures Based on “Statistical Models for Text Segmentation” by D. Beeferman, A. Berger, and J. Lafferty Topicality measures are log-ratios of 2 different models Short-range model versus long-range model in topic segmentation in text Models of different segments in proteins

Short-Range Model vs. Long-Range Model

Problem - Not Enough Data! Family NameNumber of Proteins Class A1081 Class B83 Class C28 Class D11 Class E4 Class F45 Drosophila Odorant Receptors31 Nematode Chemoreceptors1 Ocular Albinism Proteins2 Orphan A35 Orphan B2 Plant Mlo Receptors10  Total of 1333 Proteins  Over 90% are shorter than 750 amino acids  Average sequence length is 441 amino acids  Average segment length is 25 amino acids

3 Topicality Models in GPCRs Previous segmentation experiments with mutual information and Yule’s measures have shown a similarity between All helices All intracellular loops and C-terminus All extracellular loops and N-terminus No two helices or loops occur consecutively 3 models instead of 15, trained across all families of GPCRs

Model of a Segment Each model is an interpolated model of 6 basic probability models Unigram model (20 amino acids) Bi-gram model (20 amino acids) Tri-gram model (20 amino acids) 3 Tri-gram models on reduced alphabets 11, 3, 2 amino acids LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P LVIMFYAGCW, KREDH, STNQP LVIMFYAGCW, KREDHSTNQP

Why Use Reduced Alphabets? Figure 1. Snake-like diagram of the human 2 adrenergic receptor.

Interpolation Oddity weights were trained so that sum of the probability assigned to the amino acid at each position in the training data is a max First attempt: all weight to the tri-gram model with the smallest reduced alphabet Reason: smaller vocabulary size causes the probability mass to be not as spread out

Interpolation Oddity, Take 2 Normalize the probabilities from reduced alphabet models E.g. LVIM,FY,KR,ED,AG,ST,NQ,W,C,H,P P(L |  ) / 4 P(F |  ) / 2 All of the weight went to the tri-gram model with the normal 20 amino acid alphabet

An Example: D3DR_RAT Class A dopamine receptor Figure 3 - Graph of the Log Probability of the Amino Acid at Each Position in the D3DR_RAT Sequence from the 3 Segment Models. The 3 segment models fluctuate frequently in their performance, making it difficult to detect which model is doing best and where the boundaries should be drawn.

Position Figure 4 - Enlargement of the Graph in Figure 3 for the Amino Acid Positions The true segment boundaries are marked in dotted vertical lines. N-TerminusHelixIntracellularHelix

Running Averages & Look-Ahead Figure 5 - Graph of Running Averages of Log Probabilities of Each Amino Acid between Positions 0 and 100 in the D3DR_RAT sequence with Predicted and True Boundaries marked. Running averages were computed using a window-size of  2 and boundaries were predicted using a look-ahead of 5. The predicted boundaries are indicated by dotted vertical lines at positions 38, 53, 65 and 88, while the true boundaries are indicated by dashed vertical lines at positions 32, 55, 66 and 92. N-TerminusHelixIntracellularHelix

Predicted Boundaries for D3DR_RAT Window-size  2 from current amino acid Look-ahead interval of 5 amino acids Predicted Boundaries Synthetic True Boundaries

The Only Truth: OPSD_HUMAN The only GPCR that has been crystallized so far Predicted Boundaries True Boundaries Average offset for protein is a.a.

Evaluation Metrics Accuracy Score 1 – perfect match Score 0.5 – offset of  1 Score 0.25 – offset of  2 Score 0 otherwise Offset – absolute difference between the predicted and true boundary position 10-fold Cross Validation

Results: Trained Interpolated Models Figure 6 -Results of Our Approach using Trained Interpolation Weights. Window-size:  2 Look-ahead interval: 5

Distribution of Offset between Predicted and Synthetic True Boundary

Removing 10% of the proteins with the worst average offset causes the average offset for the dataset to drop to

Results: Using All Probability Models Figure 7 - Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.1 for unigram and bi-gram models, 0.2 for each of the tri-gram models. Running averages were computed over a window-size of  5 and a look-ahead interval of 4 was used.

Results: Using Only Tri-gram Models Figure 8 -Results of Our Approach using Pre-set Model Weights in the Interpolation: 0.25 for each of the tri-gram models. Window-size of  4 and a look-ahead interval of 4.

Conclusions Average accuracy of ~ offset of  2 on average But average offsets are much higher Missing a boundary has detrimental effects on prediction of remaining boundaries in the sequence, especially with a small segment Large offsets with a small number of proteins

Future Work Cue words Unigrams, bi-grams, tri-grams, 4-grams in a window of +/- 25 amino acids from boundary Long range contact Distribution tables of how likely 2 amino acids are in long-range contact of each other Evaluation How much homology is needed between training and testing data

References 1.Doug Beeferman, Adam Berger, and John Lafferty. “Statistical Models for Text Segmentation.” Machine Learning, special issue on Natural Language Learning, C. Cardie and R. Mooney eds., 34(1-3), pp , F. Campagne, J.M. Bernassau, and B. Maigret. Viseur program (Release 2.35). Copyright 1994,1995,1996, Fabien Campagne, All Rights Reserved.