Presentation is loading. Please wait.

Presentation is loading. Please wait.

Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences

Similar presentations


Presentation on theme: "Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences"— Presentation transcript:

1 Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov John L. Spouge National Center for Biotechnology Information John L. Spouge National Center for Biotechnology Information Bldg. 45, Rm. 6AS 47J NCBI, NLM, NIH Bethesda MD 20894 Bldg. 45, Rm. 6AS 47J NCBI, NLM, NIH Bethesda MD 20894

2 Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Compound Poisson process (CPP) models for PSSM motifs Compound Poisson process (CPP) models for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using CPP Local run (BLAST-like) tests for PSSM motifs using CPP Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Compound Poisson process (CPP) models for PSSM motifs Compound Poisson process (CPP) models for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using CPP Local run (BLAST-like) tests for PSSM motifs using CPP Overview

3 Clustering in Bacterial Genomes IK Jordan et al (2001) Genome Res 11:555-565 Given a small gene family in several bacterial genomes, do its genes tend to cluster?

4 Fisher Omnibus Test The Fisher omnibus combines several weak one-sided continuous p-values to test the aggregate for significance. is chi-square with 2n degrees of freedom

5 Fisher Omnibus Test is exponential (1) distributed For any one-sided continuous p-value, is chi-square with 2n degrees of freedom is gamma (1,n) distributed

6 Minimum Distance S Karlin & HM Taylor (1981) A Second Course in Stochastic Processes, p. 132

7 B de Finetti (1964) Giornale Istituto Italiano degli Attuari 27:151 W Feller (1971) An Introduction to Probability Theory…, Vol. 2, p. 42 De Finetti’s Formula

8 Discrete Version

9 Special Cases

10 Minimum Distance Choose n distinct numbers from {1,2,…,t} such that the minimum distance between consecutive order statistics exceeds x  0. S Karlin & HM Taylor (1981) A Second Course in Stochastic Processes

11 Threading Configurations 4 4 11 19 28

12 Clustering in Bacterial Genomes Given a large gene family in one bacterial genome, do its genes tend to cluster? IK Jordan et al (2001) Genome Res 11:555-565

13 Kolmogorov-Smirnov Tests M Kendall & A Stuart (1979) The Advanced Theory of Statistics, Vol. 2, p. 476 The Kolmogorov-Smirnov test examines whether come from distribution function come from distribution function are uniformly distributed Are uniformly distributed?

14 Kolmogorov-Smirnov Tests Are uniformly distributed? Plot

15 Kolmogorov-Smirnov Tests L Breiman (1992) Probability where is exponential (1) distributed

16 Kolmogorov-Smirnov Tests Are uniformly distributed?

17 Clustering in Bacterial Genomes Given a large gene family in one linear genome, do its genes tend to cluster?

18 Clustering in Bacterial Genomes where is exponential (1) distributed Given a large gene family in one circular genome, do its genes tend to cluster? IK Jordan et al (2001) Genome Res 11:555-565

19 Clustering in Bacterial Genomes where is exponential (1) distributed are approximately exponential (n) distributed

20 Clustering in Bacterial Genomes Given a large gene family in one circular genome, do its genes tend to cluster? IK Jordan et al (2001) Genome Res 11:555-565

21 Clustering in Bacterial Genomes Given a set of restriction sites in a genome, do the sites tend to cluster? S Karlin & C Macken (1991) J Amer Stat Soc 86:27-35 kth minimum in an r-scan kth maximum in an r-scan r-scan for r = 3

22 Clustering in Bacterial Genomes A Dembo & S Karlin(1992) Ann Appl Prob 2:329-357 C Chen & S Karlin (2000) J Appl Prob 37:865-880

23 Clustering of Conservation Conserved Nucleotide Non-conserved Nucleotide After accounting for edge effects, could uniformly random conserved and non-conserved nucleotides be as clustered as the data from intergenic regions? Alternative with Some Very Long Conserved Clusters Scan or Local Run test is powerful against alternative. Alternative with Some Very Long Conserved Clusters Scan or Local Run test is powerful against alternative. Alternative with Many Short Conserved Clusters Hypergeometric test offers more power against alternative. Alternative with Many Short Conserved Clusters Hypergeometric test offers more power against alternative.

24 Extreme Cases k = 0 or 1 corresponds to complete separation of conserved and non-conserved positions k = min{m,n} corresponds to complete mixing Extreme Cases k = 0 or 1 corresponds to complete separation of conserved and non-conserved positions k = min{m,n} corresponds to complete mixing Given m conserved positions and n non-conserved positions, calculate the probability that exactly k of the conserved positions are followed by a non-conserved position. Conserved Nucleotide Non-conserved Nucleotide Clustering of Conservation

25 Given m conserved positions and n non-conserved positions, calculate the probability that exactly k of the conserved positions are followed by a non-conserved position. Hypergeometric Distribution Conserved Nucleotide Non-conserved Nucleotide Clustering of Conservation

26 Count the number of ways of placing m conserved positions (1) and n non-conserved positions (0) so that exactly k of the conserved positions are followed by a non-conserved position (10). Count the number of ways of placing m conserved positions (1) and n non-conserved positions (0) so that exactly k of the conserved positions are followed by a non-conserved position (10). Conserved Nucleotide Non-conserved Nucleotide 0110001001110011 0110001001110011 Count the number of ways of placing k 10’s, n  k 0’s, and m  k 1’s so that none of the 1’s is followed by a 0. 0110001001110011 0110001001110011 Count the number of ways of placing k 10’s, n  k 0’s, and m  k 1’s so that none of the 1’s is followed by a 0. Clustering of Conservation

27 Place the k 10’s and m  k 1’s in arbitrary order. Place the k 10’s and m  k 1’s in arbitrary order. 110 10 1110 11 110 10 1110 11 Count the number of ways of placing k 10’s, n  k 0’s, and m  k 1’s so that no 0 follows a 1. Count the number of ways of placing k 10’s, n  k 0’s, and m  k 1’s so that no 0 follows a 1. 0110001001110011 0.00.0.00.00.0.0 Place the n  k 0’s in k+1 bins. Clustering of Conservation

28 PSSM Motif Clustering Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals? For 1/0 signals, a  2 test suffices.

29 PSSM Motif Clustering Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals? Consider the strength of the signal. M Frith & Zhiping Weng (2001)

30 PSSM Motif Clustering Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals? (1) Assume an independent, identically distributed DNA base composition. (2) Assume the PSSM signals, appropriately truncated, follow a compound Poisson process with parameters ( , ). S Schbath et al. (1998) J Comp Biol 5:223-253

31 PSSM Motif Clustering compound Poisson process ( , ) “time” Tail probability can be calculated by small sample asymptotics. cumulant generating function of sum of signals

32 PSSM Motif Clustering Given different PSSM signals in a piece of DNA, any of the signals unusually concentrated? (1) Assume an independent, identically distributed DNA base composition. (2) Assume the PSSM signals, appropriately truncated, follow a compound Poisson process with parameters ( , ).

33 PSSM Motif Clustering

34 Alignment Matrices

35 Alignment Score Renewal local score below 0 renewalrenewal Local Alignment Score on a Single Diagonal random renewal length random

36 Alignment Score Success successsuccess probability of success probability Local Alignment Score on a Single Diagonal local score above y

37 HSP Poisson Distribution S Karlin & A Dembo (1992) Adv Appl Prob 24:113

38 Finite-Size Effect successsuccess local score above y Local Alignment Score on a Single Diagonal expected time to success expected time to success

39 S Altschul & W Gish (1996) Methods Enzymology 266 Finite-Size Effect

40 PSSM Motif Clustering

41 Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Small sample asymptotic methods for PSSM motifs Small sample asymptotic methods for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using compound PP Local run (BLAST-like) tests for PSSM motifs using compound PP Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Small sample asymptotic methods for PSSM motifs Small sample asymptotic methods for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using compound PP Local run (BLAST-like) tests for PSSM motifs using compound PP Summary of Techniques

42


Download ppt "Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences"

Similar presentations


Ads by Google