1 Special type of log linear models to fit DNA. Irina Abnizova 1, Brian Tom 1 and Walter R. Gilks 2 1 MRC Biostatistics Unit, Cambridge 2 Leeds University,

Slides:



Advertisements
Similar presentations
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Advertisements

Markov models and applications
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Ab initio gene prediction Genome 559, Winter 2011.
Chi Square Analyses: Comparing Frequency Distributions.
Markov Chains Lecture #5
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
12.The Chi-square Test and the Analysis of the Contingency Tables 12.1Contingency Table 12.2A Words of Caution about Chi-Square Test.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Log-linear and logistic models
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Inferences About Process Quality
Diversity and Distribution of Species
BCOR 1020 Business Statistics
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Nonparametric or Distribution-free Tests
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Regression Analysis (2)
Goodness-of-Fit Tests and Categorical Data Analysis
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Chapter 26: Comparing Counts AP Statistics. Comparing Counts In this chapter, we will be performing hypothesis tests on categorical data In previous chapters,
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Chapter 11: Applications of Chi-Square. Count or Frequency Data Many problems for which the data is categorized and the results shown by way of counts.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 9: Non-parametric Tests n Parametric vs Non-parametric n Chi-Square –1 way –2 way.
Multinomial Distribution
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Testing Hypotheses about Differences among Several Means.
Discrete Multivariate Analysis Analysis of Multivariate Categorical Data.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
1 In this case, each element of a population is assigned to one and only one of several classes or categories. Chapter 11 – Test of Independence - Hypothesis.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 11: Bivariate Relationships: t-test for Comparing the Means of Two Groups.
Question paper 1997.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
© Copyright McGraw-Hill 2004
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
GG 313 beginning Chapter 5 Sequences and Time Series Analysis Sequences and Markov Chains Lecture 21 Nov. 12, 2005.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Statistics for Psychology CHAPTER SIXTH EDITION Statistics for Psychology, Sixth Edition Arthur Aron | Elliot J. Coups | Elaine N. Aron Copyright © 2013.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Chapter 11 – Test of Independence - Hypothesis Test for Proportions of a Multinomial Population In this case, each element of a population is assigned.
10 Chapter Chi-Square Tests and the F-Distribution Chapter 10
A Very Basic Gibbs Sampler for Motif Detection
Ab initio gene prediction
Chapter 10 Analyzing the Association Between Categorical Variables
Presentation transcript:

1 Special type of log linear models to fit DNA. Irina Abnizova 1, Brian Tom 1 and Walter R. Gilks 2 1 MRC Biostatistics Unit, Cambridge 2 Leeds University, Leeds

2 Why to fit models to DNA improve annotation quality understanding of evolutionary processes leading to DNA diversion about DNA: DNA has the form of four states (A,C,G,T nucleotides or base pairs) positioned in space can be “read” in specific direction for coding parts low order Markov models are used, e.g. for predicting the occurrence of certain sequences as CTGAC etc

3 As known from (Avery and Henderson 1999), DNA can often be modelled by Markov chain Analysis in the context of log linear models Peculiar features: The data produce contingency tables with the similar margins (dependence of the observations). However, the analysis is the same as for multinomial samples. Standard number of degrees of freedom is correct

4 recall: Markov chain model Def: A sequence of random variables is called a Markov chain of the order k, if each state in the chain depends only on its k previous neighbours For DNA 1st order Markov chain models two-residue dependence. A 0th order Markov chain models the distribution of independent residues.

5 How to estimate the Markov model for a given stretch of DNA approximate the parameters by multi-nucleotide frequencies: for 1st order Markov model: Initial distribution: via the occurrences of each particular residue, e.g. p a =N a /L Transitional probability matrix via the number of occurrences of the each adjacent pair, N ij.

6 Dependencies within DNA Definition: Independent state model, M0 - the state in a particular position is independent of the previous state Two-way table is formed by counting 16 pairs: S A C G T A Naa Nac Nag Nat F C Nca Ncc Ncg Nct G Nga Ngc Ngg Ngt T Nga Ngc Ngg Ngt Notation: N ij =occurrence of (i,j) pair

7 Example: kni-cis regulatory region, Drosophila melanogaster S A C G T total A C F G T total (Bishop et al. 1975) calculate expectations by using the marginal totals, assuming independence: Eij=Xi.X.j/X.., Xi.=number of times i occurs in the position First, X.j=number of times j occurs in the position Second, X..=number of all pairs

8 Then we compute: the Pearson chi-square statistics X 2 and/or minus twice the likelihood ratio statistics G 2 : Both follow asymptotic chi-square with df=9 Compare against a chi-square distribution If the null hypothesis of independence is correct, we have a distribution with 9 degrees of freedom Both statistics here are more than 100.2… Hypothesis of independence should be rejected Typical for DNA!

9 Alternative formulation (Avery, Henderson1999) is as a generalised linear model with a log-link, Poisson error structure and linear predictor: is the overall mean, Fi refers to the first position, Sj-to the second of the pair (i,j) We fit the model for kni_cis data with R:

10 Null deviance: on 15 degrees of freedom Residual deviance: on 9 degrees of freedom D=G2 here, similarly used: again, the null is rejected! We might Assess if first order Markov model, M1 describes the data Analyse a three-way table for triplets (i,j,k) Fitting first order Markov model, M1

11 Table 2 : The triplet counts for kni-cis regulatory region, D. A C G T A C A G T A C C G T A C G G T A C T G T

12 extension of independence model: a model with linear predictor: No dependence between First and Third positions; Df=4(4-1) 2 =36, we need 64 observations, And 48 parameters to estimate Fitting first order Markov model

13 Fitting higher order Markov model m2: df=144 (256 observations and 112 parameters) m3: (1024 observations and 448 parameters) Avery, Henderson 1999 claim that M1 fits the most non-coding DNA they tried. We did not find it for regulatory regions The expected values are given by Eijk=Xji.X.jk/X.j. using notation similar to the two-way case.

14 Power of the test for Markov model order of the Markov model increases ability of the test to discriminate between different models decreases order of the Markov model increases sample size (length L) is more important: lack of the data e.g. for fitting M2 (four-way table with 256 elements) with the power 80% we need at least 1250 bp long sequence! for M2, most cells have expected values less than 6 for a sequence of around 1500 bp, thus we will not have asymptotic chi- square with df=144 merging sequences is not good: we loose biological sense That is why we suggest the following model:

15 Special type of log linear models to fit DNA. alternative type of log linear models, for short DNA functional regions Our personal motivation: to use the models for 1.Distinguishing DNA functional types, specially interested in regulatory DNA 2. Search for regulatory motifs

16 RNA transcript Gene Header transcription translation Protein Genes: code for proteins (exons) Gene header regulates transcription rate About gene regulation

17 Gene Regulation of transcription RNA tr RNA transcript header other header Transcription factors: Other gene Translation, Far-away headers

18 Biological Observations about Regulatory elements (TFBS) Suggests certain statistical properties which might be captured with appropriate models Likely to be Over- and Under-Represented Short (5-15 bp) sub-sequences Often arranged and closely packed in Cis-Regulatory Modules (CRM) within regulatory regions

19 Data sets Drosophila melanogaster 1 Regulatory Regions, 60 experimentally Verified Sequences, bp each 2. Internal exons, Randomly picked, 60 sequences of similar to 1 lengths 3. Non coding Presumably non regulatory DNA, Randomly picked, 60 sequences

20 Fitting Markov models Limitations: 1. length of sequence analysed e.g. required 1250 bp to fit second order MC, while average human exon length is 146 bp… 2. requirement of stationary distribution DNA is very non-stationary

21 Faster method, requires less information Assume (for each individual sequence): 1. most nucleotides occur independently 2. overall independence is disrupted by certain words: cores of TFBS (regulatory regions) stop-codons (exons) simple repeats (‘junk’ DNA) Length of ‘disruptive’ words is responsible for the “memory” of the model, k

22 Instead of fitting higher order model, we first study standardised residuals The significant standardized residuals (>2) for independence model were: Pearson pair tt aa ta at gt tc gc gg ag

23 Assume all pairs are independent except of aa and tt (see Pearson residuals) Where G h(i,j) ={1 if i=j=a or t; 0 else} Now we have Residual deviance: on 8 degrees of freedom The model ‘independence except’ describes the data well! Biological explanation: there are a lot of TFBS “TT” and ‘AA’ rich words (TFBS: bicoid, huncnback,kruppel etc)

24 Fitting three-way independent-except model The length of the most repetitive or rare patterns may be more than two base pairs. Compute three-way table, assuming that all nucleotides are independent except of some triples. Requires much less DNA information than conventional M1, M2 models. k=2 for this ‘3-way M0 except’ model k+1 is the length of most disruptive patterns

25 Fitting multi-way independent-except model We successively compute four-, five- etc way tables if three-way Independent models do not fit our data. k=3,4.. for these models k+1 is the length of most disruptive patterns

26 Coding regions: best fit is k = 0 or 1 Disruptive is (-ta) Regulatory Regions: best fit is k=2 or 3 Disruptive are 3-4 short motifs ( aaa, ttt,cgg ) Non coding non regulatory regions: k > 0,1 Disruptive are( If k > 1) only one or two short motifs (almost any 16 pair motifs) Results: given k and ‘disruptive’ patterns, we may separate between functional DNA P.S. -ta is everywhere…

27 Presently we fit each sequence separately to determine the “best” fitting model for that sequence However sequences of the same DNA functional type are expected to be more similar to each other than sequences of different types Therefore can we identify what is common amongst sequences of the same type? Also can we discriminate between the different types?

28 Acknowledgements David Ohlssen Rene te Boekhorst