1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes.
Quantitative aspects of literary texts Adam J. Callahan & Gary E. Davis Department of Mathematics University of Massachusetts.
Hidden Markov Model in Biological Sequence Analysis – Part 2
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Ka-Lok Ng Dept. of Bioinformatics Asia University
1 Special type of log linear models to fit DNA. Irina Abnizova 1, Brian Tom 1 and Walter R. Gilks 2 1 MRC Biostatistics Unit, Cambridge 2 Leeds University,
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Identification of Transcriptional Regulatory Elements in Chemosensory Receptor Genes by Probabilistic Segmentation Steven A. McCarroll, Hao Li Cornelia.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Discrete Probability Distributions Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Heuristic alignment algorithms and cost matrices
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University.
Log-linear and logistic models
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Genetica per Scienze Naturali a.a prof S. Presciuttini Mutation Rates Ultimately, the source of genetic variation observed among individuals in.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Some standard univariate probability distributions
Transforming the data Modified from: Gotelli and Allison Chapter 8; Sokal and Rohlf 2000 Chapter 13.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
DNA DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other organisms. Located in the nucleus, mitochondria and chloroplast.
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Distributions Normal distribution Binomial distribution Poisson distribution Chi-square distribution Frequency distribution
Models and Algorithms for Complex Networks Power laws and generative processes.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Investigating the Ancient Meroitic Language Using Statistical Natural Language Techniques: Zipf’s Law and Word Co-Occurrences Reginald Smith August 10,
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Finding Mathematics in Genes and Diseases Ming-Ying Leung Department of Mathematical Sciences University of Texas at El Paso (UTEP)
Self-Similarity of Complex Networks Maksim Kitsak Advisor: H. Eugene Stanley Collaborators: Shlomo Havlin Gerald Paul Zhenhua Wu Yiping Chen Guanliang.
Chapter 21 Eukaryotic Genome Sequences
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.
4.3 More Discrete Probability Distributions NOTES Coach Bridges.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Chapter 3 The Interrupted Gene.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
Chapter 3 Statistical Models or Quality Control Improvement.
Review of statistical modeling and probability theory Alan Moses ML4bio.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Data Modeling Patrice Koehl Department of Biological Sciences
Introduction to Probability - III John Rundle Econophysics PHYS 250
A short tutorial on DNA structure and functions
Log Linear Modeling of Independence
Discrete Event Simulation - 4
Organisms are made up of cells, cells are largely protein and DNA carries the instructions for the synthesis of those proteins.
Modeling Signals in DNA
Volume 10, Issue 11, Pages (March 2015)
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Statistical Models or Quality Control Improvement
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Discovering Frequent Poly-Regions in DNA Sequences
Presentation transcript:

1 1.Protein structure study via residue environment – Residues Solvent Accessibility Environment in Globins Protein Family 2.Statistical linguistic study of DNA sequences * Ka Lok Ng Department of Information Management Ling Tung College * In collaborate with S.P. Li, Institute of Physics, Academia Sinica

2 Statistical linguistic study of DNA sequences 1.Linguistic study models – Zipf law and Compound Poisson Distribution 2.Compound Poisson Distribution study of the Fortran language and DNA sequences 3.Entropic segmentation method 4.Compound Poisson Distribution study of the DNA segments

3 Statistical linguistic study of DNA sequences Zipf Law Zipf law stated that rf = C where r is the rank of a word; f is the frequency of occurrence of the word; and C is a constant that depends on the text being analyzed. It is linear in a double logarithmic plot, with a slope -  ~ 1 for all language studied. DNA sequences study – coding and non-coding regions (Mammals, invertebrate, Eukaryotic Virus, Bacteria ) Reference Mantegna, R.N.; S.V. Buldryev; A.L. Goldberger; S. Havlin; C.-K. Peng; M. Simons and H.E. Stanley. "Linguistic Features of Noncoding DNA Sequences" v 73 n 23 Physical Review Letters 73, no. 23, p (1994). Sequence Types : Zipf analysis of 6-tuples of the Mammals, Invertebrates, Yeast chromosome III, Eukaryotoc Virus, Prokaryotics and Bacteria DNA sequences. Results : They found that non-coding sequences have a slope that is consistently larger, suggesting that the non-coding sequences bear more resemblance to a natural language than the coding sequences. Log r Log f

4 Statistical linguistic study of DNA sequences Word frequency distribution - Compound Poisson Distribution an author’s total vocabulary, V words (with probability of occurrence  1 <  2 < …. <  v ) The frequency distribution of a specific word with probability of occurrence  i to appear r = 1, 2 …. times in a total word count of N tokens is given by Replacing the binomial by the Poisson distribution, assuming  (r) is a mixing distribution,and integrate over the probability distribution, one obtains where -  0 are three parameters and K r (  ) is the modified Bessel function of the second kind of order r. For  = -0.5,  (r) stands for the inverse Gaussian distribution.

5 Statistical linguistic study of DNA sequences Fortran program

6 Statistical linguistic study of DNA sequences Mammals

7 Statistical linguistic study of DNA sequences Invertebrate

8 Statistical linguistic study of DNA sequences Eukaryotic Virus

9 Statistical linguistic study of DNA sequences Bacteria

10 Statistical linguistic study of DNA sequences

11 Statistical linguistic study of DNA sequences Chi-square test O is the observed frequency T is the theoretical frequency

12 Statistical linguistic study of DNA sequences

13 Statistical linguistic study of DNA sequences Segmentation method How to define a sentence ? DNA sequences are not a random sequences Such as CpG island and repeated sequences Look for subsequences different from the rest of the sequence Segmentation of DNA according to the {ATCG} bases composition by entropic segmentation method ( a method used in image segmentation) Let S = {a 1, a 2, …….a N } where the a’s are symbols over the alphabet A = {A 1, ….. A k }  for example{A,T,C,G} Consider a segmentation at position n, which resulted in S (1) = {a 1, a 2, …….a n } and S (2) = {a n+1, a 2, …….a N } Let F (1) = { f 1 (1), …. f k (1) } and F (2) = { f 1 (2), …. f k (2) } be the relative nucleotide frequencies over alphabet A. The Jensen-Shannon divergence measure between the 2 distributions is given by D JS (F (1), F (2) ) = H(  1 F (1) +  2 F (2) ) – (  1 H(F (1) ) +  2 H(F (2) )) where is the Shannon’s entropy of the distribution F and  1 +  2 = 1. To look for subsequences one maximize D JS. Halting of the segmentation process is determined by the significant level. References P. Bernaola-Galvan, R. Roman-Roldan, and J. L. Oliver, “Compositional segmentation and long range fractal correlations in DNA sequences.” Phys. Rev. E 53, p (1996).

14 Statistical linguistic study of DNA sequences

15 Statistical linguistic study of DNA sequences Summary 1.The compound Poisson distribution fits quite well for a 6bp and 7 bp long DNA sequences and the segmentation domains, we considered that it is better than the Zipf law. 2.The compound Poisson distribution give the correct overall normalization factor. 3.We noticed that  controls the long range behavior (ie less frequently occurred, rare word),  controls the short range behavior (ie more frequently occurred, frequent word), and  seems to control the overall slope (ie the syntax or style) of the distribution  (r). 4.It is still premature to suggest that DNA sequences are resemble to natural language and it may be modeled by linguistic methodology. In linguistic - representation of linguistic expressions Morpheme  word  phrase  sentence  text Biological implications Study the statistical significance of word frequency Naively, words of rare frequency because it disrupts replication or gene expression ? Words of significant frequency survive after natural selection ?