The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.

Slides:



Advertisements
Similar presentations
Chapter 23: Inferences About Means
Advertisements

EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
1 Regression as Moment Structure. 2 Regression Equation Y =  X + v Observable Variables Y z = X Moment matrix  YY  YX  =  YX  XX Moment structure.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Hypothesis Testing Steps in Hypothesis Testing:
Dimension reduction (1)
Ch11 Curve Fitting Dr. Deshi Ye
Chapter 18 Sampling Distribution Models
Multivariate distributions. The Normal distribution.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Correlations and Copulas Chapter 10 Risk Management and Financial Institutions 2e, Chapter 10, Copyright © John C. Hull
Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Lecture 5: Simple Linear Regression
Modern Navigation Thomas Herring
Variance and covariance Sums of squares General linear models.
Separate multivariate observations
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Introduction to Error Analysis
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
1 Sample Geometry and Random Sampling Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Regression Analysis Part C Confidence Intervals and Hypothesis Testing
Modern Navigation Thomas Herring MW 11:00-12:30 Room
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Correlation. Correlation is a measure of the strength of the relation between two or more variables. Any correlation coefficient has two parts – Valence:
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.
NON-LINEAR REGRESSION Introduction Section 0 Lecture 1 Slide 1 Lecture 6 Slide 1 INTRODUCTION TO Modern Physics PHYX 2710 Fall 2004 Intermediate 3870 Fall.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
ANOVA, Regression and Multiple Regression March
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Principal Component Analysis (PCA)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
Review of statistical modeling and probability theory Alan Moses ML4bio.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #46 Interpolation.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
The normal approximation for probability histograms.
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Fitting.
Computational Statistics with Application to Bioinformatics
Sampling Distribution Models
Significance analysis of microarrays (SAM)
ECE 417 Lecture 4: Multivariate Gaussians
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
数据的矩阵描述.
#21 Marginalize vs. Condition Uninteresting Fitted Parameters
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
#22 Uncertainty of Derived Parameters
#10 The Central Limit Theorem
Probabilistic Surrogate Models
Presentation transcript:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Unit 9: Working with Multivariate Normal Distributions

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 2 The multivariate normal distribution –completely defined by its mean (vector) and covariance (matrix) –therefore, trivial to “fit” to a bunch of sample points –also easy (e.g. in Matlab) to sample from Brief digression on spliceosomes –explain the weird exon length distribution that we’ve previously seen Relation of fitted covariance matrix to linear correlation matrix –statistical significance of a correlation but, note, uses CLT How to generate multivariate normal deviates –Cholesky decomposition of the covariance matrix A related Cholesky trick: draw error ellipses corresponding to a given covariance matrix Can there be a significant correlation that is not a “real association” –no, there can’t be if CLT applies or, if CLT doesn’t apply, if you compute significance correctly e.g., by permutation test Unit 9: Working with Multivariate Normal Distributions (Summary)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 3 New topic: the Multivariate Normal Distribution Generalizes Normal (Gaussian) to M-dimensions Like 1-d Gaussian, completely defined by its mean and (co-)variance Mean is a M-vector, covariance is a M x M matrix Because mean and covariance are easy to estimate from a data set, it is easy – perhaps too easy – to fit a multivariate normal distribution to data. § = h( x ¡ ¹ ) ­ ( x ¡ ¹ )i ¹ = h x i g = readgenestats('genestats.dat'); ggg = g(g.ne>2,:); which = randsample(size(ggg,1),1000); iilen = ggg.intronlen(which); i1len = zeros(size(which)); i2len = zeros(size(which)); for j=1:numel(i1len), i1llen(j) = log10(iilen{j}(1)); end; for j=1:numel(i2len), i2llen(j) = log10(iilen{j}(2)); end; plot(i1llen,i2llen,'+') hold on Sample the sizes of 1 st and 2 nd introns for 1000 genes:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 4 notice the biology! This is kind of fun, because it’s not just the usual featureless scatter plot Is there a significant correlation here? (Yes, we’ll see.) (Do you think the correlation could be only from the biological censoring of low values, and not be a “real association”?)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 5 credit: Alberts et al. Molecular Biology of the Cell Biological digression: The hard lower bounds on intron length are because the intron has to fit around the “big” spliceosome machinery! It’s all carefully arranged to allow exons of any length, even quite small. Why? Could the spliceosome have evolved to require a minimum exon length, too? Are we seeing chance early history, or selection?

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 6 mu = mean([i1llen; i2llen],2) sig = cov(i1llen,i2llen) mu = sig = rsamp = mvnrnd(mu,sig,1000); plot(rsamp(:,1),rsamp(:,2),'or') hold off We can easily sample from the fitted multivariate Gaussian: By the way, if renormalized, the covariance matrix is the linear correlation matrix: r = sig./ sqrt(diag(sig) * diag(sig)') tval = sqrt(numel(iilen))*r r = tval = [rr p] = corrcoef(i1llen,i2llen) rr = p = statistical significance of the correlation in standard deviations (but note: uses CLT) Matlab has built-ins

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 7 How to generate multivariate normal deviates: So, just take: and you get a multivariate normal deviate!

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 8 A related, useful, Cholesky trick is to draw error ellipses (ellipsoids, …) So, locus of points at 1 standard deviation is If z is on the unit circle (sphere, …) then function [x y] = errorellipse(mu,sigma,stdev,n) L = chol(sigma,'lower'); circle = [cos(2*pi*(0:n)/n); sin(2*pi*(0:n)/n)].*stdev; ellipse = L*circle + repmat(mu,[1,n+1]); x = ellipse(1,:); y = ellipse(2,:); plot(i1llen,i2llen,'+b'); hold on [xx yy] = errorellipse(mu,sig,1,100); plot(xx,yy,'-r'); [xx yy] = errorellipse(mu,sig,2,100); plot(xx,yy,'-r')

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 9 plot(i1llen,i2llen,'+b') hold on plot(i1llen(randperm(numel(i1llen))),i2llen,'+r'); 1.The theorem that r is ~ normal (around zero) for independent distributions is a CLT and doesn’t depend on the individual distributions (as long as they have suitably convergent moments and the number of data points is large) 2.It’s clear for our example data if you plot even one random permutation: 3.In case of any lingering doubt, you could use brute force: do the permutation test and plot the histogram of r values. You’ll recover the CLT result. A few slides back, I put out the red herring that there could be a correlation r that wasn’t a “real association” but instead came from weirdness in the individual distributions. I was being a provocateur. It can’t be so: