A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim

Outline  Background & Motivation  Data  Methods  Experiments  Conclusions  Acknowledgements  Bibliography

Background & Motivation  Prior works have studied correlation between positions in protein families (Cline et al., 2002, Martin et al., 2005)  Used multiple sequence alignments to detect correlation, and make links to protein structure and residue co-evolution  Correlation across an MSA has been tied to co-evolution and contact points in the protein structure.  Less work on correlation between residues in a sequence, significance is less clear

Background & Motivation  “protein sequences can be regarded as slightly edited random strings” (Weiss et al. 2000)  Can we detect the increased correlation in protein sequences vs random sequences?  Is there correlation between distant residues?  Is correlation characteristic of the protein structure?  Can we measure correlation for hydropathy or other residue non-specific interactions?

The Protein Families Database  We use the Pfam-A subset, consisting of around 8000 curated families  Pfam-A contains families with a wide variety of sequence length and number of sequences  Pfam-A contains multiple sequence alignments for families  Limit experiments to sequences containing 100 or more residues to reduce sampling effects

 Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue?

 Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue? Pick the most frequently printed residue We feel more certain about our guess with the second sequence as it seems less random

 We can quantify the uncertainty in a sequence with Shannon Entropy  Entropy is maximal when P i is uniform for all i  Entropy is 0 when P i = 1 for some i  The lower the Entropy, the better our prediction should be Methods: Measuring Correlation

 Should we guess 'N'? Is there a correlation between 'V' and 'K'? Between 'N' and 'N'?  We can measure the correlation with Mutual Information for the sequence  Substitute frequencies for probabilities

Mutual Information Example  Sequence: AANANK

Mutual Information Example  MI( AANANK ) = MI( JJCJCL )  Sequence: AANANK

Experiment: Measuring Correlation  Sample 100 sequences from PFAM  Shuffle each sequence 100 times use shuffle command from HMMER package preserves length and residue frequency of sequence randomly re-orders residues  Compare MI score for each sequence to the MI scores of its shuffles

Results: Correlation

Results : Normalized Correlation

Methods: Correlation Classification  Nearest Neighbor classification algorithm  plot N-dimensional vector in space 3 Training Classes

 Nearest Neighbor classification algorithm  Plot N-dimensional vector in space 3 Training Classes Methods: Correlation Classification 3 Training Test Vector

Methods: Correlation Classification  Measure the distance from the new point to each existing point  Assign the family of nearest training point to the test vector

Methods: NCBI BLAST Classification  Build BLAST database from training sequences with formatdb  Blast test sequence about database with default parameters  Classify test sequence according to the highest scoring match (High Scoring Sequence Pair )  If no sequence match is found, classification fails

Methods: Experimental Method  Randomly Select 10 families from PFAM database  Evaluate classification techniques on each possible combination of 3 families from the 10  The results of all sub-experiments are summed  Accuracy is measured by: # of correct classifications # of classification attempts

Methods: Leave-one-out Validation  Comprehensive Validation

Results: Neighbor Correlation

Experiment: Long Range Correlation  Extend correlation measure beyond neighboring residues  gap: number of residues between the residues we are comparing  we are considering the pairing of all residues within 20 positions of each other  MI Vector = [ MI(0), MI(1), … MI(19) ]

Results: 20D-Correlation Vector

Experiment: Physical Properties  Not all intra-protein interactions are residue specific  Cline(2002) explores information attributed to hydropathy, charge, disulfide bonding, and burial  Hydropathy was found to contain half the information as the 20-element amino acid alphabet, and its 2-element alphabet is more resistant to finite-sample size effects

Hydropathy Alphabet Hydrophobic: C,I,M,F,W,Y,V,L Hydrophilic:R,N,D,E,Q,H,K,S,T,P,A,G  This partitioning from Weiss, et al. (2000)  Converting every residue in a sequence to a ‘+’ or ‘-’

Results: Hydropathy Correlation

Experiment: Combined Vectors  Combine residue and hydropathy correlation vectors  A single 40 dimensional vector per sequence

Results: Combined Vectors

Conclusions  Correlation was strong enough for building sequence classifiers without using sequence  Significant Long Range Correlation between protein sequence residues  Correlation exists in terms of residues and physical properties

Future Work  More comprehensive study of long range interactions how much distance should we consider? analyze gap distances individually and compare look for combination of distances and methods to most improve classification power  Explore other physical properties  Measure correlation of residue groups  Investigate normalization or correction techniques to reduce sampling effects

Acknowledgements  Dr. Sun Kim  Dr. Mehmet Dalkilic  The Center for Genomics and Bioinformatics Computing resources Support throughout this process

References  Aha D, Kibler D, Albert M. Machine Learning 6 1991  Bateman A, Coin L et al. Nucleic Acids Research 32 2004  Cline M, Karplus K, Lathrop R, et al. PROTEINS: Structures, Functions, and Genetics 49 2002  Kohavi R. International Joint Conference on AI 1995  Martin L, Gloor G, Dunn S, Wahl L. Bioinformatics 21(22) 2005  Shannon C.E., The Bell System Tech. Journal 71 1948  Weiss O, Jim é nez-Monta ñ o M, Herzel H. J. theor Biol. 206 2000

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Similar presentations

Presentation on theme: "A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Similar presentations

Presentation on theme: "A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim."— Presentation transcript:

Similar presentations

About project

Feedback