Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stephen Altschul National Center for Biotechnology Information

Similar presentations


Presentation on theme: "Stephen Altschul National Center for Biotechnology Information"— Presentation transcript:

1 Compositionally Adjusted Substitution Matrices for Protein Database Searches
Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health

2 Collaborators Yi-Kuo Yu Alejandro Schäffer John Wootton Richa Agarwala
Mike Gertz Aleksandr Morgulis National Center for Biotechnology Information National Library of Medicine National Institutes of Health See: Yu, Wootton & Altschul (2003) PNAS 100: ; Yu & Altschul (2005) Bioinformatics 21: ; Altschul et al. (2005) FEBS J. 272:

3 Log-odds scores The scores of any local-alignment substitution
matrix can be written in the form where the pi are background amino acid frequencies, the qij are target frequencies and λ is an arbitrary scale factor. (PNAS 87: )

4 The BLOSUM-62 matrix PNAS 89:10915-10919 A 4 R -1 5 N -2 0 6
D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V PNAS 89:

5 Amino acid compositional bias
Some sources of bias: Organismal bias AT-rich genome: tend to have more amino acids FLINKYM GC-rich genome: tend to have more amino acids PRAWG Protein family bias Transmembrane proteins: more hydrophobic residues Cysteine-rich proteins: more Cysteines than usual

6 Construction of an asymmetric log-odds substitution matrix
Given a (not necessarily symmetric) set of target frequencies qij, define two sets of background frequencies pi and p’j as the marginal sums of the qij : The substitution scores are then defined as We call this matrix valid in the context of the pi and p’j.

7 Substitution matrix validity theorem
A substitution matrix can be valid for only a unique set of target and background frequencies, except in certain degenerate cases. (Proof omitted) One can determine efficiently whether an arbitrary substitution matrix can be valid in some context and, if so, one can extract its unique target and background frequencies, and scale. (Proof and algorithms omitted)

8 Choosing new target frequencies
Given new sets of background frequencies Pi and P’j , how should one choose appropriate target frequencies Qij ? Consistency constraints: Close to original qij : Sometimes, it is desirable to constrain the relative entropy H

9 Substitution matrices compared
Mode A: Standard BLOSUM-62 matrix. Mode B: Composition-adjusted matrix; no constraint on relative entropy (H). Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats). Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

10 Performance evaluation (mode D vrs. mode A)

11 BLOSUM-62 and sequence specific background frequencies
Amino P. falciparum M. tuberculosis Acid BLOSUM # # A R N D C Q E G H I L K M F P S T W Y V

12 Difference between a scaled, standard BLOSUM-62
and a compositionally adjusted BLOSUM-62 P. falciparum A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V Entries shown: score of standard matrix subtracted from the adjusted one

13 Optimal alignments implied by modes A and D
Mode A: bits (H = 0.51 nats) Mode D: bits (H = 0.51 nats) Mode C: bits (H = 0.44 nats)

14 Substitution matrices compared
Mode A: Standard BLOSUM-62 matrix. Mode B: Composition-adjusted matrix; no constraint on relative entropy (H). Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats). Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

15 Performance of various matrices on 143 pairs of related sequences (FEBS J. 272:5101-5109)

16 Empirical rules for invoking compositional adjustment when comparing two sequences
1: The length ratio of the longer to the shorter sequence is less than 3.

17 One metric definition of distance between two composition vectors
(IEEE Trans. Info. Theo. 49: )

18 Empirical rules for invoking compositional adjustment when comparing two sequences
1: The length ratio of the longer to the shorter sequence is less than 3. 2: The distance d between the compositions of the two sequences is less than 0.16.

19 Law of cosines In a triangle with sides of length a,b and c, the angle opposite the side of length c is

20 Empirical rules for invoking compositional adjustment when comparing two sequences
1: The length ratio of the longer to the shorter sequence is less than 3. 2: The distance d between the compositions of the two sequences is less than 0.16. 3: The angle θ made by the compositions of the two sequences with the standard composition is less than 70o.

21 ROCn curves for Aravind set (NAR 29: 2994-3005)
b

22 ROCn curves for SCOP set (Proc IEEE 9: 1834-1847)

23 Future directions Possible less extensive use of SEG when compositional adjustment is invoked. Application to PSI-BLAST.


Download ppt "Stephen Altschul National Center for Biotechnology Information"

Similar presentations


Ads by Google