Download presentation
Presentation is loading. Please wait.
Published byZoe Thornton Modified over 8 years ago
1
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim
2
Outline Background & Motivation Data Methods Experiments Conclusions Acknowledgements Bibliography
3
Background & Motivation Prior works have studied correlation between positions in protein families (Cline et al., 2002, Martin et al., 2005) Used multiple sequence alignments to detect correlation, and make links to protein structure and residue co-evolution Correlation across an MSA has been tied to co-evolution and contact points in the protein structure. Less work on correlation between residues in a sequence, significance is less clear
4
Background & Motivation “protein sequences can be regarded as slightly edited random strings” (Weiss et al. 2000) Can we detect the increased correlation in protein sequences vs random sequences? Is there correlation between distant residues? Is correlation characteristic of the protein structure? Can we measure correlation for hydropathy or other residue non-specific interactions?
5
The Protein Families Database We use the Pfam-A subset, consisting of around 8000 curated families Pfam-A contains families with a wide variety of sequence length and number of sequences Pfam-A contains multiple sequence alignments for families Limit experiments to sequences containing 100 or more residues to reduce sampling effects
6
Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue?
7
Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue? Pick the most frequently printed residue We feel more certain about our guess with the second sequence as it seems less random
8
We can quantify the uncertainty in a sequence with Shannon Entropy Entropy is maximal when P i is uniform for all i Entropy is 0 when P i = 1 for some i The lower the Entropy, the better our prediction should be Methods: Measuring Correlation
9
Should we guess 'N'? Is there a correlation between 'V' and 'K'? Between 'N' and 'N'? We can measure the correlation with Mutual Information for the sequence Substitute frequencies for probabilities
10
Mutual Information Example Sequence: AANANK
11
Mutual Information Example Sequence: AANANK
12
Mutual Information Example Sequence: AANANK
13
Mutual Information Example MI( AANANK ) = MI( JJCJCL ) Sequence: AANANK
14
Experiment: Measuring Correlation Sample 100 sequences from PFAM Shuffle each sequence 100 times use shuffle command from HMMER package preserves length and residue frequency of sequence randomly re-orders residues Compare MI score for each sequence to the MI scores of its shuffles
15
Results: Correlation
16
Results : Normalized Correlation
17
Methods: Correlation Classification Nearest Neighbor classification algorithm plot N-dimensional vector in space 3 Training Classes
18
Nearest Neighbor classification algorithm Plot N-dimensional vector in space 3 Training Classes Methods: Correlation Classification 3 Training Test Vector
19
Methods: Correlation Classification Measure the distance from the new point to each existing point Assign the family of nearest training point to the test vector
20
Methods: NCBI BLAST Classification Build BLAST database from training sequences with formatdb Blast test sequence about database with default parameters Classify test sequence according to the highest scoring match (High Scoring Sequence Pair ) If no sequence match is found, classification fails
21
Methods: Experimental Method Randomly Select 10 families from PFAM database Evaluate classification techniques on each possible combination of 3 families from the 10 The results of all sub-experiments are summed Accuracy is measured by: # of correct classifications # of classification attempts
22
Methods: Leave-one-out Validation Comprehensive Validation
23
Results: Neighbor Correlation
24
Experiment: Long Range Correlation Extend correlation measure beyond neighboring residues gap: number of residues between the residues we are comparing we are considering the pairing of all residues within 20 positions of each other MI Vector = [ MI(0), MI(1), … MI(19) ]
25
Results: 20D-Correlation Vector
26
Experiment: Physical Properties Not all intra-protein interactions are residue specific Cline(2002) explores information attributed to hydropathy, charge, disulfide bonding, and burial Hydropathy was found to contain half the information as the 20-element amino acid alphabet, and its 2-element alphabet is more resistant to finite-sample size effects
27
Hydropathy Alphabet Hydrophobic: C,I,M,F,W,Y,V,L Hydrophilic:R,N,D,E,Q,H,K,S,T,P,A,G This partitioning from Weiss, et al. (2000) Converting every residue in a sequence to a ‘+’ or ‘-’
28
Results: Hydropathy Correlation
29
Experiment: Combined Vectors Combine residue and hydropathy correlation vectors A single 40 dimensional vector per sequence
30
Results: Combined Vectors
31
Conclusions Correlation was strong enough for building sequence classifiers without using sequence Significant Long Range Correlation between protein sequence residues Correlation exists in terms of residues and physical properties
32
Future Work More comprehensive study of long range interactions how much distance should we consider? analyze gap distances individually and compare look for combination of distances and methods to most improve classification power Explore other physical properties Measure correlation of residue groups Investigate normalization or correction techniques to reduce sampling effects
33
Acknowledgements Dr. Sun Kim Dr. Mehmet Dalkilic The Center for Genomics and Bioinformatics Computing resources Support throughout this process
34
References Aha D, Kibler D, Albert M. Machine Learning 6 1991 Bateman A, Coin L et al. Nucleic Acids Research 32 2004 Cline M, Karplus K, Lathrop R, et al. PROTEINS: Structures, Functions, and Genetics 49 2002 Kohavi R. International Joint Conference on AI 1995 Martin L, Gloor G, Dunn S, Wahl L. Bioinformatics 21(22) 2005 Shannon C.E., The Bell System Tech. Journal 71 1948 Weiss O, Jim é nez-Monta ñ o M, Herzel H. J. theor Biol. 206 2000
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.