Download presentation
Presentation is loading. Please wait.
Published byUrsula Lawrence Modified over 9 years ago
1
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous
2
Brief outline Description and evaluation of a new hidden Markov model method, SAM-T98, for finding remote homologs of protein sequences. Evaluation with three fold-recognition test datasets and a curated database. Comparison with WU-BLASTP and DOUBLE- BLAST Results
3
1.Biology background –Homologs: chromosomes carrying the same genetic loci; –Structure of a protein can be predicted by using a homology to sequences for which the structure is known. –Similar structures similar functions ÞCan classify proteins into families with similar functions. ÞRemote-homolog Detection
4
2.Statistics background Hidden Markov Models Q=set of states={match, insert, delete} V=output alphabet={20 amino acids} (i)=probability of being in state i, at time t=0 A=transition probabilities={aij}, where aij=Pr[entering state j at time t+1| in state i at time t] B=output probabilities={bj(k)}, where bj(k)=Pr[producing vk at time t | in state j at time t]
5
HMMs as profile models Homologs are chromosomes carrying the same genetic loci; a diploid cell has 2 copies of each homolog, one derived from each parent. A profile of a protein family is a labeling of the positions of the amino acids in the secondary structure and a probability distribution for each position. The structure of a protein can be predicted by using a homology to sequences for which the structure is known. Proteins with similar structure assumed to have similar function classification of proteins into families according their function.
6
Typical Profile HMM: chain of match, insert and delete states. Specific probabilities to all transitions between nodes and character costs in match and insert states. BEST PATH: a single path from ‘Start’ to ‘End’ and each character is related to a successive match or insertion state along the path.
7
Example
8
For this work: Local alignment procedure was used: relates part of the sequence to one contiguous path through part of the HMM. An HMM is trained on sequences-members of protein family, the resulting HMM identifies the positions of amino acids which describe structure of family. use this HMM to discriminate family members.
9
TEST SETS Fold recognition datasets FSSP: based on a protein classification tree (Holm and Sander, 1996, 1997) presents a continuously updated structural classification of 3- dimensional protein folds (sequences of 1050 leaves of the FSSP tree, 166 target sequences). Uses DALI to determine structural homology. Classification: z-scores > 6 homologs z-scores < 2 non-homologs 2 < z-scores < 6 partly yes or no
10
Even with the best possible classifier, about 2% of the non-self pairs represent homologies to be detected. At the minimum-error point for an optimal classifier, there are about 1.4% homolog pairs.
12
SCOP:(Structural Classification of proteins) 2 test sets (Brenner, 1996; Park et al., 1997): identical lists for both target lists Database of known folds Homologous pair: if both sequences were in the same SCOP superfamily. No 2 sequences had >40% sequence similarity. Whole-chain test set:.6% correct homologies Domain test set: the same
13
Sequence comparison dataset Pearson: curated version of the PIR database ( Barker et al., 1990). 12 216 sequences total Set of 67 target sequences.4% were considered correct PIR families (Protein Information Resource): close homologs Pearson test is for close homolog classification, NOT remote homolog.
14
ALGORITHMS WU-Blast (Basic Local Alignment Tool): Protein identification software for genes Set E (# of false positives) = 10 Log of P-value reported as the score to threshold. Optimum threshold never corresponded to P-value >.005.
15
DOUBLE-Blast: Inspired by ISS (Park et al., 1997), ISS was used to recognize remote revolutionary related sequence pairs derived from SCOP database. It considered to increase detection compared to FASTA. Two-step approach: 1. Set of close homologs found to the target sequence in NRP. 2. Each homolog is used as a query to search the final database.
16
SAM-T98: Single target sequence: finds and multiplies align a set of homologs and creates an HMM from that multiple alignment. Resulting HMM is used then for database search. SAM package Database small: method is used to create an HMM for each sequence in the database. For fold-recognition tests created HMMs for all sequences. Pearson test only for the 67 target sequences.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.