paper study for class presentation on Nov16th, 2005 slider by 陳奕先 tPatternHunter: gapped, fast and sensitive translated homology search Derek Kisman, Ming Li, Bin Ma, Li Wang Bioinformatics, 21(4):542-544. February 2005 paper study for class presentation on Nov16th, 2005 slider by 陳奕先
tPatternHunter "t" for translated search what issue we'll meet when trying to apply PatternHunter technique on translated search? Protein has 20 different letters, much more than DNA's 4 letters 3 DNA letters makes a codon. at the hit extension stage, a DNA gap may cause a frameshift,
Protein has 20 different letters, much more than DNA's 4 letters the space complexity of the hash table will be significantly larger than for DNA sequence PatternHunter used weight-11 seeds for DNA sequence. How big the seeds we should use for protein? 11 * log 4 = 6.62 5 * log 20 = 6.51 tPH uses weight-5 spaced seeds (the default seed is 1101011)
only the five letters at the "1" position are checked for hits. using BLOSUM 62 scores to evaluate. a "Hit": all five position has value at least 0, and the total score above a threshold T
Blosum62 Scoring Matrix
And the issue about frameshift ? when performing DNA-protein or DNA-DNA search...... tPH regards the DNA sequences as a sequence of overlapped codons. T T T G C A F L C A
To improve the sensitivity, we can use not only one seed. The default of tPH uses four weight-5 seeds (length 6 or 7), and threshold T=20 for BLOSUM62 how fast and how sensitive tPH is ???