Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Similar presentations


Presentation on theme: "A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim."— Presentation transcript:

1 A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim

2 Outline  Background & Motivation  Data  Methods  Experiments  Conclusions  Acknowledgements  Bibliography

3 Background & Motivation  Prior works have studied correlation between positions in protein families (Cline et al., 2002, Martin et al., 2005)  Used multiple sequence alignments to detect correlation, and make links to protein structure and residue co-evolution  Correlation across an MSA has been tied to co-evolution and contact points in the protein structure.  Less work on correlation between residues in a sequence, significance is less clear

4 Background & Motivation  “protein sequences can be regarded as slightly edited random strings” (Weiss et al. 2000)  Can we detect the increased correlation in protein sequences vs random sequences?  Is there correlation between distant residues?  Is correlation characteristic of the protein structure?  Can we measure correlation for hydropathy or other residue non-specific interactions?

5 The Protein Families Database  We use the Pfam-A subset, consisting of around 8000 curated families  Pfam-A contains families with a wide variety of sequence length and number of sequences  Pfam-A contains multiple sequence alignments for families  Limit experiments to sequences containing 100 or more residues to reduce sampling effects

6  Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue?

7  Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue? Pick the most frequently printed residue We feel more certain about our guess with the second sequence as it seems less random

8  We can quantify the uncertainty in a sequence with Shannon Entropy  Entropy is maximal when P i is uniform for all i  Entropy is 0 when P i = 1 for some i  The lower the Entropy, the better our prediction should be Methods: Measuring Correlation

9  Should we guess 'N'? Is there a correlation between 'V' and 'K'? Between 'N' and 'N'?  We can measure the correlation with Mutual Information for the sequence  Substitute frequencies for probabilities

10 Mutual Information Example  Sequence: AANANK

11 Mutual Information Example  Sequence: AANANK

12 Mutual Information Example  Sequence: AANANK

13 Mutual Information Example  MI( AANANK ) = MI( JJCJCL )  Sequence: AANANK

14 Experiment: Measuring Correlation  Sample 100 sequences from PFAM  Shuffle each sequence 100 times use shuffle command from HMMER package preserves length and residue frequency of sequence randomly re-orders residues  Compare MI score for each sequence to the MI scores of its shuffles

15 Results: Correlation

16 Results : Normalized Correlation

17 Methods: Correlation Classification  Nearest Neighbor classification algorithm  plot N-dimensional vector in space 3 Training Classes

18  Nearest Neighbor classification algorithm  Plot N-dimensional vector in space 3 Training Classes Methods: Correlation Classification 3 Training Test Vector

19 Methods: Correlation Classification  Measure the distance from the new point to each existing point  Assign the family of nearest training point to the test vector

20 Methods: NCBI BLAST Classification  Build BLAST database from training sequences with formatdb  Blast test sequence about database with default parameters  Classify test sequence according to the highest scoring match (High Scoring Sequence Pair )  If no sequence match is found, classification fails

21 Methods: Experimental Method  Randomly Select 10 families from PFAM database  Evaluate classification techniques on each possible combination of 3 families from the 10  The results of all sub-experiments are summed  Accuracy is measured by: # of correct classifications # of classification attempts

22 Methods: Leave-one-out Validation  Comprehensive Validation

23 Results: Neighbor Correlation

24 Experiment: Long Range Correlation  Extend correlation measure beyond neighboring residues  gap: number of residues between the residues we are comparing  we are considering the pairing of all residues within 20 positions of each other  MI Vector = [ MI(0), MI(1), … MI(19) ]

25 Results: 20D-Correlation Vector

26 Experiment: Physical Properties  Not all intra-protein interactions are residue specific  Cline(2002) explores information attributed to hydropathy, charge, disulfide bonding, and burial  Hydropathy was found to contain half the information as the 20-element amino acid alphabet, and its 2-element alphabet is more resistant to finite-sample size effects

27 Hydropathy Alphabet Hydrophobic: C,I,M,F,W,Y,V,L Hydrophilic:R,N,D,E,Q,H,K,S,T,P,A,G  This partitioning from Weiss, et al. (2000)  Converting every residue in a sequence to a ‘+’ or ‘-’

28 Results: Hydropathy Correlation

29 Experiment: Combined Vectors  Combine residue and hydropathy correlation vectors  A single 40 dimensional vector per sequence

30 Results: Combined Vectors

31 Conclusions  Correlation was strong enough for building sequence classifiers without using sequence  Significant Long Range Correlation between protein sequence residues  Correlation exists in terms of residues and physical properties

32 Future Work  More comprehensive study of long range interactions how much distance should we consider? analyze gap distances individually and compare look for combination of distances and methods to most improve classification power  Explore other physical properties  Measure correlation of residue groups  Investigate normalization or correction techniques to reduce sampling effects

33 Acknowledgements  Dr. Sun Kim  Dr. Mehmet Dalkilic  The Center for Genomics and Bioinformatics Computing resources Support throughout this process

34 References  Aha D, Kibler D, Albert M. Machine Learning 6 1991  Bateman A, Coin L et al. Nucleic Acids Research 32 2004  Cline M, Karplus K, Lathrop R, et al. PROTEINS: Structures, Functions, and Genetics 49 2002  Kohavi R. International Joint Conference on AI 1995  Martin L, Gloor G, Dunn S, Wahl L. Bioinformatics 21(22) 2005  Shannon C.E., The Bell System Tech. Journal 71 1948  Weiss O, Jim é nez-Monta ñ o M, Herzel H. J. theor Biol. 206 2000


Download ppt "A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim."

Similar presentations


Ads by Google