1 Computational Analysis of Protein-DNA Interactions Changhui (Charles) Yan Department of Computer Science Utah State University
2 I Problem I Identifying amino acid residues involved in protein-DNA interactions from sequence
3 Materials And Methods 56 double-stranded DNA binding proteins previously used in the study of Jones et al. (2003) Encoding
4 Materials And Methods
5 Leave-one-out cross-validation Na ï ve Bayes Naïve Bayes Classifier
6 Na ï ve Bayes Naïve Bayes Classifier Leave-one-out cross-validation
7 Leave-One-Out Cross-Validations Sequence-basedSequence/structure-based Identities (ID) ID + entropyID + rASAID + rASA + entropy Correlation coefficient Accuracy(%) Specificity+(%) Sensitivity+(%)
8 Pit-1, PDB 1au7 TP:30 FP: 16 TN: 86 FN:14 CC: 0.51 (2 nd ) Accuracy: 79% Predicted Actual Predictions in The Context of 3-D Structures
9 -Cro, PDB 6cro TP:10 FP: 5 TN: 34 FN:10 CC: 0.37 (19 th ) Accuracy: 73% PredictedActual
10 Predictions C With PROSITE Motifs Predictions Compared With PROSITE Motifs Predicted binding sites substantially overlap with 34 of the 37 “DNA-binding” PROSITE motifs In 52 of the 56 proteins, the predictor identifies at least 20% of the DNA-binding residues 28 of the 56 proteins contain no PROSITE motifs that are annotated as “DNA-binding”
11 Comparison With Previous Study MethodNaïve Bayes classifier Ahmad and Sarai method * C Correlation Coefficient Accuracy (%)8066 Specificity+(%)2921 Sensitivity+(%)4868 * Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.
12 Summary A simple sequence-based Naive Bayes classifier predicts interface residues in DNA-binding proteins with 75% accuracy, 37% specificity+, 53% sensitivity+ and correlation coefficient of 0.29 Predicted binding sites correctly indicate the locations of actual binding sites substantially overlap with known PROSITE motifs
13 Problem II Identification of Helix-Turn-Helix (HTH) DNA-binding motifs
14 HTH Motifs Sequences sharing low similarities can fold into a similar HTH structure Sequences sharing low similarities can fold into a similar HTH structure Identifying HTH motifs from sequence is extremely challenging Identifying HTH motifs from sequence is extremely challenging
15 Trick 1 Including more information Including more information Amino acid sequence Amino acid sequence Secondary structure Secondary structure
16 Hidden Markov Model (HMM) LQQITHIANQL-GLE----KDVVRVWF
17 Hidden Markov Model (HMM_AA_SS) LQQITHIANQL-GLE----KDVVRVWF HHHEEHEEEHMHE----HHEEMMEH
18 Trick 2 There are similarities among the 20 naturally occurred amino acids There are similarities among the 20 naturally occurred amino acids Reduced alphabets Reduced alphabets
19 Reduced Alphabets Schemes for reducing amino acid alphabet based on the BLOSUM50 matrix by Henikoff and Henikoff (1992) derived by grouping and averaging the similarity matrix elements as described in the text. (Murphy et al. 2000)
20 Cross-Families Evaluations True Positive 1 False Positive 2 HMM_AA30 HMM_AA_SS (20 letters) HMM_AA_SS (Murphy_15) HMM_AA_SS (Murphy_10) HMM_AA_SS (Murphy_8) True positive: HTH motifs that are correctly identified as such. 2.False positive: Non-HTH motifs that are identified as HTH motifs. 3.The alphabet used to encode amino acid sequences.
21 Questions
22 Within-family Three-Fold Cross-Validations. Family (number of HTH motifs in the family) HMM_AAHMM_AA_SS (Murphy_15) PF00126 (1635) PF00165 (90)6380 PF00196 (30)2630 PF04545 (164) PF01022 (42)39 PF00046 (189) PF03965 (48)48
23 Comparisons of HMM_AA_SS with FFAS03 in Cross-Family Evaluations Total HTH motifs Recognized by both FFAS03 and HMM_AA_SS Recognized by FFAS03 only Recognized by HMM_AA_SS only
24 Putative HTH motifs in Ureaplasma parvum ProteinLocationAnnotation from Uniprot sp|Q9PQE5|SCPB_UREPA Participates to chromosomal partition during cell division sp|Q9PQV6|RPOB_UREPA DNA-directed RNA polymerase sp|Q9PR27|SYY_UREPA Tyrosyl-tRNA synthetase sp|Q9PQC2|SYA_UREPA Alanyl-tRNA synthetase sp|Q9PQ74|DPO3A_UREPA DNA polymerase III subunit alpha sp|Q9PQX7|Y166_UREPA Hypothetical protein