Projects…
FASTA Lookup Tables ACNGTSCHQE C S Q GCHCLSAGQD ACNGTSCHQE G C sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD ACNGTSCHQE C S Q GCHCLSAGQD ACNGTSCHQE CH GCHCLSAGQD ACNGTSCHQE G C GCHCLSAGQD
SSEARCH Smith-Waterman local alignment pairwise on entire database Extremely slow Best for identifying weak, distant relationships Review of Scoring
Scoring Normal Scores collected from SW matches against a database of sequences are the BEST scores for each pair, not random Thus, distribution is not normal, but skewed positively. For database searches, we can use the actual scores of all pairwise comparisions in DB as the set of scores. Knowing the distribution allows us to compute P(Score≥x) Gumbel Extreme Value Distribution has 2 parameters m(center) and l (scaling) Extreme Value
Scoring, cont. Parameter Estimation [m(center) and l (scaling)] Estimate from moments [m = x - 0.4500s and l = 1.2825s] Maximum likelihood estimation [SSEARCH, FASTA] scores between random sequences increase with sequence length. For each seq. near length L, plot SW-score vs. log(avg.LENGTH) Fit scores by linear regression High scores and low outliers are trimmed from regression fit. “normalize”: subtract predicted value from real value Compute z-score: how many standard deviations away is normalized score Z-scores have known extreme value distribution parameters.
Profile/Scoring Matrixes So far, query is single sequence Compare: query as regular expression or other generalized pattern Example: Position-Specific Scoring Matrix (PSSM) WHY? Motifs Multiple sequence alignments
PSSM A M P G V A M P G V A M P G V A 4 . . . A 4 . . . A 4 . . . C . . . . G . . 2 0 M 1 2 . . P . 3 1 . V . . . 1 - A 4 . . . C . . . . G . . 2 0 M 1 2 . . P . 3 1 . V . . . 1 - A 4 . . . C . . . . G . . 2 0 M 1 2 . . P . 3 1 . V . . . 1 - 4+2-1+0=5 1+3+2-1=5 0+0+0=0