Download presentation
Presentation is loading. Please wait.
1
Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov John L. Spouge National Center for Biotechnology Information John L. Spouge National Center for Biotechnology Information Bldg. 45, Rm. 6AS 47J NCBI, NLM, NIH Bethesda MD 20894 Bldg. 45, Rm. 6AS 47J NCBI, NLM, NIH Bethesda MD 20894
2
Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Compound Poisson process (CPP) models for PSSM motifs Compound Poisson process (CPP) models for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using CPP Local run (BLAST-like) tests for PSSM motifs using CPP Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Compound Poisson process (CPP) models for PSSM motifs Compound Poisson process (CPP) models for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using CPP Local run (BLAST-like) tests for PSSM motifs using CPP Overview
3
Clustering in Bacterial Genomes IK Jordan et al (2001) Genome Res 11:555-565 Given a small gene family in several bacterial genomes, do its genes tend to cluster?
4
Fisher Omnibus Test The Fisher omnibus combines several weak one-sided continuous p-values to test the aggregate for significance. is chi-square with 2n degrees of freedom
5
Fisher Omnibus Test is exponential (1) distributed For any one-sided continuous p-value, is chi-square with 2n degrees of freedom is gamma (1,n) distributed
6
Minimum Distance S Karlin & HM Taylor (1981) A Second Course in Stochastic Processes, p. 132
7
B de Finetti (1964) Giornale Istituto Italiano degli Attuari 27:151 W Feller (1971) An Introduction to Probability Theory…, Vol. 2, p. 42 De Finetti’s Formula
8
Discrete Version
9
Special Cases
10
Minimum Distance Choose n distinct numbers from {1,2,…,t} such that the minimum distance between consecutive order statistics exceeds x 0. S Karlin & HM Taylor (1981) A Second Course in Stochastic Processes
11
Threading Configurations 4 4 11 19 28
12
Clustering in Bacterial Genomes Given a large gene family in one bacterial genome, do its genes tend to cluster? IK Jordan et al (2001) Genome Res 11:555-565
13
Kolmogorov-Smirnov Tests M Kendall & A Stuart (1979) The Advanced Theory of Statistics, Vol. 2, p. 476 The Kolmogorov-Smirnov test examines whether come from distribution function come from distribution function are uniformly distributed Are uniformly distributed?
14
Kolmogorov-Smirnov Tests Are uniformly distributed? Plot
15
Kolmogorov-Smirnov Tests L Breiman (1992) Probability where is exponential (1) distributed
16
Kolmogorov-Smirnov Tests Are uniformly distributed?
17
Clustering in Bacterial Genomes Given a large gene family in one linear genome, do its genes tend to cluster?
18
Clustering in Bacterial Genomes where is exponential (1) distributed Given a large gene family in one circular genome, do its genes tend to cluster? IK Jordan et al (2001) Genome Res 11:555-565
19
Clustering in Bacterial Genomes where is exponential (1) distributed are approximately exponential (n) distributed
20
Clustering in Bacterial Genomes Given a large gene family in one circular genome, do its genes tend to cluster? IK Jordan et al (2001) Genome Res 11:555-565
21
Clustering in Bacterial Genomes Given a set of restriction sites in a genome, do the sites tend to cluster? S Karlin & C Macken (1991) J Amer Stat Soc 86:27-35 kth minimum in an r-scan kth maximum in an r-scan r-scan for r = 3
22
Clustering in Bacterial Genomes A Dembo & S Karlin(1992) Ann Appl Prob 2:329-357 C Chen & S Karlin (2000) J Appl Prob 37:865-880
23
Clustering of Conservation Conserved Nucleotide Non-conserved Nucleotide After accounting for edge effects, could uniformly random conserved and non-conserved nucleotides be as clustered as the data from intergenic regions? Alternative with Some Very Long Conserved Clusters Scan or Local Run test is powerful against alternative. Alternative with Some Very Long Conserved Clusters Scan or Local Run test is powerful against alternative. Alternative with Many Short Conserved Clusters Hypergeometric test offers more power against alternative. Alternative with Many Short Conserved Clusters Hypergeometric test offers more power against alternative.
24
Extreme Cases k = 0 or 1 corresponds to complete separation of conserved and non-conserved positions k = min{m,n} corresponds to complete mixing Extreme Cases k = 0 or 1 corresponds to complete separation of conserved and non-conserved positions k = min{m,n} corresponds to complete mixing Given m conserved positions and n non-conserved positions, calculate the probability that exactly k of the conserved positions are followed by a non-conserved position. Conserved Nucleotide Non-conserved Nucleotide Clustering of Conservation
25
Given m conserved positions and n non-conserved positions, calculate the probability that exactly k of the conserved positions are followed by a non-conserved position. Hypergeometric Distribution Conserved Nucleotide Non-conserved Nucleotide Clustering of Conservation
26
Count the number of ways of placing m conserved positions (1) and n non-conserved positions (0) so that exactly k of the conserved positions are followed by a non-conserved position (10). Count the number of ways of placing m conserved positions (1) and n non-conserved positions (0) so that exactly k of the conserved positions are followed by a non-conserved position (10). Conserved Nucleotide Non-conserved Nucleotide 0110001001110011 0110001001110011 Count the number of ways of placing k 10’s, n k 0’s, and m k 1’s so that none of the 1’s is followed by a 0. 0110001001110011 0110001001110011 Count the number of ways of placing k 10’s, n k 0’s, and m k 1’s so that none of the 1’s is followed by a 0. Clustering of Conservation
27
Place the k 10’s and m k 1’s in arbitrary order. Place the k 10’s and m k 1’s in arbitrary order. 110 10 1110 11 110 10 1110 11 Count the number of ways of placing k 10’s, n k 0’s, and m k 1’s so that no 0 follows a 1. Count the number of ways of placing k 10’s, n k 0’s, and m k 1’s so that no 0 follows a 1. 0110001001110011 0.00.0.00.00.0.0 Place the n k 0’s in k+1 bins. Clustering of Conservation
28
PSSM Motif Clustering Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals? For 1/0 signals, a 2 test suffices.
29
PSSM Motif Clustering Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals? Consider the strength of the signal. M Frith & Zhiping Weng (2001)
30
PSSM Motif Clustering Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals? (1) Assume an independent, identically distributed DNA base composition. (2) Assume the PSSM signals, appropriately truncated, follow a compound Poisson process with parameters ( , ). S Schbath et al. (1998) J Comp Biol 5:223-253
31
PSSM Motif Clustering compound Poisson process ( , ) “time” Tail probability can be calculated by small sample asymptotics. cumulant generating function of sum of signals
32
PSSM Motif Clustering Given different PSSM signals in a piece of DNA, any of the signals unusually concentrated? (1) Assume an independent, identically distributed DNA base composition. (2) Assume the PSSM signals, appropriately truncated, follow a compound Poisson process with parameters ( , ).
33
PSSM Motif Clustering
34
Alignment Matrices
35
Alignment Score Renewal local score below 0 renewalrenewal Local Alignment Score on a Single Diagonal random renewal length random
36
Alignment Score Success successsuccess probability of success probability Local Alignment Score on a Single Diagonal local score above y
37
HSP Poisson Distribution S Karlin & A Dembo (1992) Adv Appl Prob 24:113
38
Finite-Size Effect successsuccess local score above y Local Alignment Score on a Single Diagonal expected time to success expected time to success
39
S Altschul & W Gish (1996) Methods Enzymology 266 Finite-Size Effect
40
PSSM Motif Clustering
41
Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Small sample asymptotic methods for PSSM motifs Small sample asymptotic methods for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using compound PP Local run (BLAST-like) tests for PSSM motifs using compound PP Clustering in bacterial genomes Clustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test) Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov tests Kolmogorov-Smirnov tests Scan tests Scan tests Local run (BLAST-like) tests using Poisson process (PP) Local run (BLAST-like) tests using Poisson process (PP) Clustering of Intergenic conservation Clustering of Intergenic conservation Hypergeometric test Hypergeometric test Clustering of PSSM motifs Clustering of PSSM motifs Chi-square for 1/0 “motifs” Chi-square for 1/0 “motifs” Small sample asymptotic methods for PSSM motifs Small sample asymptotic methods for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using compound PP Local run (BLAST-like) tests for PSSM motifs using compound PP Summary of Techniques
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.