Download presentation
Presentation is loading. Please wait.
1
1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study of DNA sequences * Ka Lok Ng Department of Information Management Ling Tung College * In collaborate with S.P. Li, Institute of Physics, Academia Sinica
2
2 Statistical linguistic study of DNA sequences 1.Linguistic study models – Zipf law and Compound Poisson Distribution 2.Compound Poisson Distribution study of the Fortran language and DNA sequences 3.Entropic segmentation method 4.Compound Poisson Distribution study of the DNA segments
3
3 Statistical linguistic study of DNA sequences Zipf Law Zipf law stated that rf = C where r is the rank of a word; f is the frequency of occurrence of the word; and C is a constant that depends on the text being analyzed. It is linear in a double logarithmic plot, with a slope - ~ 1 for all language studied. DNA sequences study – coding and non-coding regions (Mammals, invertebrate, Eukaryotic Virus, Bacteria ) Reference Mantegna, R.N.; S.V. Buldryev; A.L. Goldberger; S. Havlin; C.-K. Peng; M. Simons and H.E. Stanley. "Linguistic Features of Noncoding DNA Sequences" v 73 n 23 Physical Review Letters 73, no. 23, p 3169-3172(1994). Sequence Types : Zipf analysis of 6-tuples of the Mammals, Invertebrates, Yeast chromosome III, Eukaryotoc Virus, Prokaryotics and Bacteria DNA sequences. Results : They found that non-coding sequences have a slope that is consistently larger, suggesting that the non-coding sequences bear more resemblance to a natural language than the coding sequences. Log r Log f
4
4 Statistical linguistic study of DNA sequences Word frequency distribution - Compound Poisson Distribution an author’s total vocabulary, V words (with probability of occurrence 1 < 2 < …. < v ) The frequency distribution of a specific word with probability of occurrence i to appear r = 1, 2 …. times in a total word count of N tokens is given by Replacing the binomial by the Poisson distribution, assuming (r) is a mixing distribution,and integrate over the probability distribution, one obtains where - 0 are three parameters and K r ( ) is the modified Bessel function of the second kind of order r. For = -0.5, (r) stands for the inverse Gaussian distribution.
5
5 Statistical linguistic study of DNA sequences Fortran program
6
6 Statistical linguistic study of DNA sequences Mammals
7
7 Statistical linguistic study of DNA sequences Invertebrate
8
8 Statistical linguistic study of DNA sequences Eukaryotic Virus
9
9 Statistical linguistic study of DNA sequences Bacteria
10
10 Statistical linguistic study of DNA sequences
11
11 Statistical linguistic study of DNA sequences Chi-square test O is the observed frequency T is the theoretical frequency
12
12 Statistical linguistic study of DNA sequences
13
13 Statistical linguistic study of DNA sequences Segmentation method How to define a sentence ? DNA sequences are not a random sequences Such as CpG island and repeated sequences Look for subsequences different from the rest of the sequence Segmentation of DNA according to the {ATCG} bases composition by entropic segmentation method ( a method used in image segmentation) Let S = {a 1, a 2, …….a N } where the a’s are symbols over the alphabet A = {A 1, ….. A k } for example{A,T,C,G} Consider a segmentation at position n, which resulted in S (1) = {a 1, a 2, …….a n } and S (2) = {a n+1, a 2, …….a N } Let F (1) = { f 1 (1), …. f k (1) } and F (2) = { f 1 (2), …. f k (2) } be the relative nucleotide frequencies over alphabet A. The Jensen-Shannon divergence measure between the 2 distributions is given by D JS (F (1), F (2) ) = H( 1 F (1) + 2 F (2) ) – ( 1 H(F (1) ) + 2 H(F (2) )) where is the Shannon’s entropy of the distribution F and 1 + 2 = 1. To look for subsequences one maximize D JS. Halting of the segmentation process is determined by the significant level. References P. Bernaola-Galvan, R. Roman-Roldan, and J. L. Oliver, “Compositional segmentation and long range fractal correlations in DNA sequences.” Phys. Rev. E 53, p5181-5189 (1996).
14
14 Statistical linguistic study of DNA sequences
15
15 Statistical linguistic study of DNA sequences Summary 1.The compound Poisson distribution fits quite well for a 6bp and 7 bp long DNA sequences and the segmentation domains, we considered that it is better than the Zipf law. 2.The compound Poisson distribution give the correct overall normalization factor. 3.We noticed that controls the long range behavior (ie less frequently occurred, rare word), controls the short range behavior (ie more frequently occurred, frequent word), and seems to control the overall slope (ie the syntax or style) of the distribution (r). 4.It is still premature to suggest that DNA sequences are resemble to natural language and it may be modeled by linguistic methodology. In linguistic - representation of linguistic expressions Morpheme word phrase sentence text Biological implications Study the statistical significance of word frequency Naively, words of rare frequency because it disrupts replication or gene expression ? Words of significant frequency survive after natural selection ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.