Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applied statistics Usman Roshan.

Similar presentations


Presentation on theme: "Applied statistics Usman Roshan."— Presentation transcript:

1 Applied statistics Usman Roshan

2 A few basic stats Expected value of a random variable –
example of Bernoulli and Binomal Variance of a random variable example of Bernoulli and Binomial Correlation coefficient (same as Pearson correlation coefficient) Formulas: Covariance(X,Y) = E((X-μX)(Y-μY)) Correlation(X,Y)= Covariance(X,Y)/σXσY Pearson correlation

3 Correlation between variables
Measures the correlation between two variables The correlation r is between -1 and 1. A value of 1 means perfect positive correlation and -1 in the other direction The function f(r) has a t-distribution with n-2 df that can be used to obtain a p-value

4 Pearson correlation coefficient
From Wikipedia

5 Basic stats in R Mean and variance calculation Correlations
Define list and compute mean and variance Correlations Define two lists and compute correlation

6 Statistical inference
P-value Bayes rule Posterior probability and likelihood Bayesian decision theory Bayesian inference under Gaussian distribution Chi-square test, Pearson correlation coefficient, t-test

7 P-values What is a p-value?
It is the probability of your estimate assuming the data is coming from some null distribution For example if your estimate of mean is 1 and the true mean is 0 and is normally distributed what is the p-value of your estimate? It is the area under curve of the normal distribution for all values of mean that are at least your estimate A small p-value means the probability that the data came from the null distribution is small and thus the null distribution could be rejected. A large p-value supports the null distribution but may also support other distributions

8 P-values from Gaussian distributions
Courtesy of Wikipedia

9 P-values from chi-square distributions
Courtesy of Wikipedia

10 Distributions in R Binomial distribution in R – dbinom,pbinom
Gaussian (normal) distribution in R – pnorm Calculating p-values in R Suppose true mean is 0 Your estimated mean is 1 What is the p-value of your estimate? Example problem Suppose your estimate of NJ mean age is 30 out of 100 people but true mean and standard deviation is 25 and 20. Central limit says sample mean is normally distributed. We use this to determine p-value of our estimate

11 Type 1 and type 2 errors Courtesy of WIkipedia

12 Bayes rule Fundamental to statistical inference
Conditional probability Posterior = (Likelihood * Prior) / Normalization

13 Hypothesis testing We can use Bayes rule to help make decisions
An outcome or action is described by a model Given two models we pick the one with the higher probability Coin toss example: use likelihood to determine which coin generated the tosses

14 Likelihood example Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) We are given some tosses (training data): HTHHHTHHHTHTH. Was the above sequence produced by a fair coin? What is the probability that a fair coin produced the above sequence of tosses? What is the p-value of your sequence of tosses assuming the coin is fair? This is the same as asking what is the probability that a fair coin generates 9 or more heads out of 13 heads. Let’s start with exactly. Solve it with R. Was the above sequence more likely to be produced by a biased coin 1 (p=0.85) or a biased coin 2 (p=.75)? Solution: Calculate the likelihood (probability) of the data with each coin Alternatively we can ask which coin maximizes the likelihood?

15 Maximum likelihood example
Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) We want to determine the probability P(H) of the coin that produces k heads and n-k tails? We are given some tosses (training data): HTHHHTHHHTHTH. Solution: Form the log likelihood Differentiate w.r.t. p Set to the derivative to 0 and solve for p

16 Maximum likelihood example

17 Likelihood inference Assume data is generated by a Gaussian distribution whose mean and variance are unknown

18 Gaussian models Assume that class likelihood is represented by a Gaussian distribution with parameters μ (mean) and σ (standard deviation) We find the model (in other words mean and variance) that maximize the likelihood (or equivalently the log likelihood). Suppose we are given training points x1,x2,…,xn1 from class C1. Assuming that each datapoint is drawn independently from C1 the sample log likelihood is

19 Gaussian models The log likelihood is given by
By setting the first derivatives dP/dμ1 and dP/dσ1 to 0. This gives us the maximum likelihood estimate of μ1 and σ1 (denoted as m1 and s1 respectively) Similarly we determine m2 and s2 for class C2.

20 Gaussian classification example
Consider one dimensional data for two classes (SNP genotypes for case and control subjects). Case (class C1): 1, 1, 2, 1, 0, 2 Control (class C2): 0, 1, 0, 0, 1, 1 Under the Gaussian assumption case and control classes are represented by Gaussian distributions with parameters (μ1, σ1) and (μ2, σ2) respectively. The maximum likelihood estimates of means are

21 Gaussian classification example
The estimates of class standard deviations are Similarly s2=.25 Which class does x=1 belong to? What about x=0 and x=2? What happens if class variances are equal?

22 Multivariate Gaussian classification
Suppose each datapoint is an m-dimensional vector. In the previous example we would have m SNP genotypes instead of one. The class likelihood is given by Where Σ1 is the class covariance matrix. Σ1 is of dimensiona d x d. The (i,j)th entry of Σ1 is the covariance of the ith and jth variable.

23 Multivariate Gaussian classification
The maximum likelihood estimates of η1 and Σ1 are The class log likelihoods with estimated parameters (ignoring constant terms) are

24 Naïve Bayes algorithm If we assume that variables are independent (no interaction between SNPs) then the off-diagonal terms of S are zero and the log likelihood becomes (ignoring constant terms)

25 Multivariate Gaussian classification
If S1=S2 then the class log likelihoods with estimated parameters (ignoring constant terms) are Depends on distance to means.

26 Nearest means classifier
If we assume all variances sj to be equal then (ignoring constant terms) we get

27 Gaussian classification example
Consider three SNP genotype for case and control subjects. Case (class C1): (1,2,0), (2,2,0), (2,2,0), (2,1,1), (0,2,1), (2,1,0) Control (class C2): (0,1,2), (1,1,1), (1,0,2), (1,0,0), (0,0,2), (0,1,0) Classify (1,2,1) and (0,0,1) with the nearest means classifier

28 Chi-square test Contingency table We have two random variables:
Label (L): 0 or 1 Feature (F): Categorical Null hypothesis: the two variables are independent of each other (unrelated) Under independence P(L,F)= P(D)P(G) P(L=0) = (c1+c2)/n P(F=A) = (c1+c3)/n Expected values E(X1) = P(L=0)P(F=A)n We can calculate the chi-square statistic for a given feature and the probability that it is independent of the label (using the p-value). We look up chi-square value in distribution with degrees of freedom = (cols-1)*(rows-1) to get p- value Features with very small probabilities deviate significantly from the independence assumption and therefore considered important. Observed=c4 Expected=X4 Observed=c3 Expected=X3 Label=1 Observed=c2 Expected=X2 Observed=c1 Expected=X1 Label=0 Feature=B Feature=A


Download ppt "Applied statistics Usman Roshan."

Similar presentations


Ads by Google