Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.

Similar presentations


Presentation on theme: "1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study."— Presentation transcript:

1 1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study of DNA sequences * Ka Lok Ng Department of Information Management Ling Tung College * In collaborate with S.P. Li, Institute of Physics, Academia Sinica

2 2 Statistical linguistic study of DNA sequences 1.Linguistic study models – Zipf law and Compound Poisson Distribution 2.Compound Poisson Distribution study of the Fortran language and DNA sequences 3.Entropic segmentation method 4.Compound Poisson Distribution study of the DNA segments

3 3 Statistical linguistic study of DNA sequences Zipf Law Zipf law stated that rf = C where r is the rank of a word; f is the frequency of occurrence of the word; and C is a constant that depends on the text being analyzed. It is linear in a double logarithmic plot, with a slope -  ~ 1 for all language studied. DNA sequences study – coding and non-coding regions (Mammals, invertebrate, Eukaryotic Virus, Bacteria ) Reference Mantegna, R.N.; S.V. Buldryev; A.L. Goldberger; S. Havlin; C.-K. Peng; M. Simons and H.E. Stanley. "Linguistic Features of Noncoding DNA Sequences" v 73 n 23 Physical Review Letters 73, no. 23, p 3169-3172(1994). Sequence Types : Zipf analysis of 6-tuples of the Mammals, Invertebrates, Yeast chromosome III, Eukaryotoc Virus, Prokaryotics and Bacteria DNA sequences. Results : They found that non-coding sequences have a slope that is consistently larger, suggesting that the non-coding sequences bear more resemblance to a natural language than the coding sequences. Log r Log f

4 4 Statistical linguistic study of DNA sequences Word frequency distribution - Compound Poisson Distribution an author’s total vocabulary, V words (with probability of occurrence  1 <  2 < …. <  v ) The frequency distribution of a specific word with probability of occurrence  i to appear r = 1, 2 …. times in a total word count of N tokens is given by Replacing the binomial by the Poisson distribution, assuming  (r) is a mixing distribution,and integrate over the probability distribution, one obtains where -  0 are three parameters and K r (  ) is the modified Bessel function of the second kind of order r. For  = -0.5,  (r) stands for the inverse Gaussian distribution.

5 5 Statistical linguistic study of DNA sequences Fortran program

6 6 Statistical linguistic study of DNA sequences Mammals

7 7 Statistical linguistic study of DNA sequences Invertebrate

8 8 Statistical linguistic study of DNA sequences Eukaryotic Virus

9 9 Statistical linguistic study of DNA sequences Bacteria

10 10 Statistical linguistic study of DNA sequences

11 11 Statistical linguistic study of DNA sequences Chi-square test O is the observed frequency T is the theoretical frequency

12 12 Statistical linguistic study of DNA sequences

13 13 Statistical linguistic study of DNA sequences Segmentation method How to define a sentence ? DNA sequences are not a random sequences Such as CpG island and repeated sequences Look for subsequences different from the rest of the sequence Segmentation of DNA according to the {ATCG} bases composition by entropic segmentation method ( a method used in image segmentation) Let S = {a 1, a 2, …….a N } where the a’s are symbols over the alphabet A = {A 1, ….. A k }  for example{A,T,C,G} Consider a segmentation at position n, which resulted in S (1) = {a 1, a 2, …….a n } and S (2) = {a n+1, a 2, …….a N } Let F (1) = { f 1 (1), …. f k (1) } and F (2) = { f 1 (2), …. f k (2) } be the relative nucleotide frequencies over alphabet A. The Jensen-Shannon divergence measure between the 2 distributions is given by D JS (F (1), F (2) ) = H(  1 F (1) +  2 F (2) ) – (  1 H(F (1) ) +  2 H(F (2) )) where is the Shannon’s entropy of the distribution F and  1 +  2 = 1. To look for subsequences one maximize D JS. Halting of the segmentation process is determined by the significant level. References P. Bernaola-Galvan, R. Roman-Roldan, and J. L. Oliver, “Compositional segmentation and long range fractal correlations in DNA sequences.” Phys. Rev. E 53, p5181-5189 (1996).

14 14 Statistical linguistic study of DNA sequences

15 15 Statistical linguistic study of DNA sequences Summary 1.The compound Poisson distribution fits quite well for a 6bp and 7 bp long DNA sequences and the segmentation domains, we considered that it is better than the Zipf law. 2.The compound Poisson distribution give the correct overall normalization factor. 3.We noticed that  controls the long range behavior (ie less frequently occurred, rare word),  controls the short range behavior (ie more frequently occurred, frequent word), and  seems to control the overall slope (ie the syntax or style) of the distribution  (r). 4.It is still premature to suggest that DNA sequences are resemble to natural language and it may be modeled by linguistic methodology. In linguistic - representation of linguistic expressions Morpheme  word  phrase  sentence  text Biological implications Study the statistical significance of word frequency Naively, words of rare frequency because it disrupts replication or gene expression ? Words of significant frequency survive after natural selection ?


Download ppt "1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study."

Similar presentations


Ads by Google