Basic Statistics: Concepts of Distribution, Correlation, Regression, Hypothesis testing etc. P. Guhathakurta Hydromet Division, IMD, Pune.

Basic Statistics: Concepts of Distribution, Correlation, Regression, Hypothesis testing etc. P. Guhathakurta Hydromet Division, IMD, Pune

DATA DATA 2 6 Data Observations (=individuals or cases) Variables = observations’ attributes Continuous variables 3. Interval scale e.g., income ($100,000) 4. Ratio scale e.g., height; weight Discrete variables 1. Nominal e.g., sex (male or female) 2. Ordinal scale e.g., economic status (low, Middle, high) Level of measurement (scales of measure) Structure e.g., Raw Data

TYPES OF DATA Nominal variables allow for only qualitative classification. That is, they can be measured only in terms of whether the individual items belong to some distinctively different categories, but we cannot quantify or even rank order those categories. Typical examples are gender, race, color, city, etc. Ordinal variables allow us to rank order the items we measure in terms of which has less and which has more of the quality represented by the variable, but still they do not allow us to say "how much more.” A typical example of an ordinal variable is the socioeconomic status of families. Interval variables allow us not only to rank order the items that are measured, but also to quantify and compare the sizes of differences between them. For example, temperature, as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. Ratio variables are very similar to interval variables; in addition to all the properties of interval variables, they feature an identifiable absolute zero point, thus they allow for statements such as x is two times more than y. Typical examples are measures of time or space. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. Temperature, expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those scales does not mean 'no temperature'. However, temperature in Kelvin is a ratio variable, as 0.0 Kelvin really does mean 'no temperature'.

Systematic and Random Errors  Error: Defined as the difference between a calculated or observed value and the “true” value  Blunders: Usually apparent either as obviously incorrect data points or results that are not reasonably close to the expected value. Easy to detect.  Systematic Errors: Errors that occur reproducibly from faulty calibration of equipment or observer bias. Statistical analysis in generally not useful, but rather corrections must be made based on experimental conditions.  Random Errors: Errors that result from the fluctuations in observations. Requires that experiments be repeated a sufficient number of time to establish the precision of measurement.

Uncertainties  In most cases, we cannot know what the “true” value unless there is an independent determination (i.e. different measurement technique).  Only can consider estimates of the error.  Discrepancy is the difference between two or more observations. This gives rise to uncertainty.  Probable Error: Indicates the magnitude of the error we estimate to have made in the measurements. Means that if we make a measurement that we “probably” won’t be wrong by that amount.

Parent vs. Sample Populations  Parent population: Hypothetical probability distribution if we were to make an infinite number of measurements of some variable or set of variables.  Sample population: Actual set of experimental observations or measurements of some variable or set of variables.  In General: (Parent Parameter) = lim (Sample Parameter) When the number of observations, N, goes to infinity. N ->∞

mode: value that occurs most frequently in a distribution (usually the highest point of curve) may have more than one mode in a dataset median: value midway in the frequency distribution …half the area of curve is to right and other to left mean: arithmetic average …sum of all observations divided by no. of observations poor measure of central tendency in skewed distributions range: measure of dispersion about mean (maximum minus minimum) when max and min are unusual values, range may be a misleading measure of dispersion Some univariate statistical terms:

histogram is a useful graphical representation of information content of sample or parent population many statistical tests assume values are normally distributed not always the case! examine data prior to processing

Deviations The deviation, d i, of any measurement x i from the mean  of the parent distribution is defined as the difference between x i and  Average deviation,  is defined as the average of the magnitudes of the deviations, which is given by the absolute value of the deviations.

variance: average squared deviation of all possible observations from a sample mean (calculated from sum of squares) standard deviation: positive square root of the variance small std dev: observations are clustered tightly around a central value large std dev: observations are scattered widely about the mean   = lim [1/n  (x i - µ) 2 ] i=1 n s  =  (x i - µ) 2 i=1 n n - 1 where: µ is the mean, x i is observed value, and n is the number of observations n->∞ Number decreased from N to n – 1 for the “sample” variance as µ is used in the calculation

Sample Mean and Standard Deviation Sample Mean Our best estimate of the standard deviation  would be from: Sample Variance But we cannot know the true parent mean µ so the best estimate of the sample variance and standard deviation would be:

Sampling Distribution  Each estimate of the mean will be different.  Treat this as a random sample of means  Plot a histogram of the means.  This is an estimate of the sampling distribution of the mean.  Can get the sampling distribution of any parameter in a similar way.

Distribution of the mean 50 samples N=5 50 samples N=10 50 samples N=100  = 78.2,  = 9.4 Population

Distribution of the Mean  BUT! Don’t need to take multiple samples  Standard error of the mean =  SE of the mean is the SD of the distribution of the sample mean

Distribution of Sample Mean  Distribution of sample mean is Normal regardless of distribution of sample (unless small or very skew sample)  So we can apply Normal theory to sample mean also  i.e. 95% of sample means lie within 1.96 SEs of (unknown) true mean  This is the basis for a 95% confidence interval (CI)  95% CI is an interval which on 95% of occasions includes the population mean

Example  57 measurements of quantity (volume) of sample

Example  95% of population lie within i.e. within 4.06 ±1.96  0.67, from 2.75 to 5.38 litres

Example  Thus for this data, 95% chance that the interval contains the true population mean i.e. between 3.89 and 4.23 litres  This is the 95% confidence interval for the mean

Confidence Intervals  The confidence interval (CI) measures uncertainty. The 95% confidence interval is the range of values within which we can be 95% sure that the true value lies for the whole of the population of patients from whom the study patients were selected. The CI narrows as the number of cases on which it is based increases.

Standard Deviations & Standard Errors  Thus the SE is the SD of the sampling distribution (of the mean, say)  SE = SD/√N  Use SE to describe the precision of estimates (for example Confidence intervals)  Use SD to describe the variability of samples, populations or distributions (for example reference ranges)

The t-distribution l When N is small, estimate of SD is particularly unreliable and the distribution of sample mean is not Normal l Distribution is more variable - longer tails l Shape of distribution depends upon sample size l This distribution is called the t-distribution

N=2 N(0,1) t(1) 95% within ± 12.7

N=10 N(0,1) t(9) 95% within ± 2.26

N=30 t(29) 95% within ± 2.04

t-distribution l As N becomes larger, t-distribution becomes more similar to Normal distribution l Degrees of Freedom (DF) =sample size - 1 l DF measure of amount of information contained in data set

Implications l Confidence interval for the mean »Sample size < 30 Use t-distribution »Sample size > 30 Use either Normal or t distribution l Note: Stats packages (generally) will automatically use the correct distribution for confidence intervals

Probability and the Binomial Distribution Coin Toss Experiment: If p is the probability of success (landing heads up) is not necessarily equal to the probability q = 1 - p for failure (landing tails up) because the coins may be lopsided! The probability for each of the combinations of x coins heads up and n -x coins tails up is equal to p x q n-x. The binomial distribution can be used to calculate the probability: The coefficients P B (x,n,p) are closely related to the binomial theorem for the expansion of a power of a sum:

Mean and Variance: Binomial Distribution The mean µ of the binomial distribution is evaluated by combining the definition of µ with the function that defines the probability, yielding: The average of the number of successes will approach a mean value µ given by the probability for success of each item p times the number of items. For the coin toss experiment p=1/2, half the coins should land heads up on average. If the probability for a single success p is equal to the probability for failure p=q=1/2, the final distribution is symmetric about the mean and mode and median equal the mean. The variance,   

Other Probability Distributions: Special Cases  Poisson Distribution: An approximation to the binomial distribution for the special case when the average number of successes is very much smaller than the possible number i.e. µ << n because p << 1.  Important for the study of such phenomena as radioactive decay. Distribution is NOT necessarily symmetric! Data are usually bounded on one side and not the other. Advantage is that    µ = 1.67  µ = 10.0 

Gaussian or Normal Error Distribution  Gaussian Distribution: Most important probability distribution in the statistical analysis of experimental data. Functional form is relatively simple and the resultant distribution is reasonable. Again this is a special limiting case to the binomial distribution where the number of possible different observations, n, becomes infinitely large yielding np >> 1.  Most probable estimate of the mean µ from a random sample of observations is the average of those observations! Tangent along the steepest portion of the probability curve intersects at e -1/2 and intersects x axis at the points x = µ ± 2  Probable Error (P.E.) is defined as the absolute value of the deviation such that P G of the deviation of any random observation is < 1/2 P.E.  PROBABLE ERROR (P.E.) (Definition) A range within one probable error on either side of the mean will include 50% of the data values. This is 0.6745σ. RELIABLE ERROR (Def.) A range within one reliable error on either side of the mean will include 90% of the data values. This is 1.6949σ.

For Gaussian or normal error distributions: Total area underneath curve is 1.00 (100%) 68.27% of observations lie within ± 1 std dev of mean 95% of observations lie within ± 2 std dev of mean 99% of observations lie within ± 3 std dev of mean Variance, standard deviation, probable error, mean, and weighted root mean square error are commonly used statistical terms. Empirical Rule

Gaussian Details, con’t. The probability function for the Gaussian distribution is defined as: If we know the population mean and population standard deviation, for any value of X we can compute a z-score by subtracting the population mean and dividing the result by the population standard deviation Z-score

Analysis of Data Analysis of Data 33 6 Inferential Analysis  Relationship analysis  Correlation -Correlation means linear association between two variables -Three types of correlation positive zero negative X1 X2

34 6 Inferential Analysis  Regression (independent and dependent relationships among variables) 1. Number of independent variable  Single/Simple regression model: Association of one independent variable with one dependent variable e.g., Y = β 0 +β 1 X 1 +e where Y is dependent var, X is independent var, e is error, β 0 is intercept, and β 1 is slope of X 1.  Multiple regression model: Association of more than two independent variables with one dependent variable e.g., Y = β 0 +β 1 X 1 +β 2 X 2 +e 2. Shape of regression line  Linear regression model  Non-linear regression model Y X Y X Analysis of Data Analysis of Data

Simple Linear Regression  Given a data set of (x, y) pairs, the problem is to find the particular straight line, ŷ = a+bx minimizing the squared vertical distances (thin lines) between it and the data points. The circumflex (“hat”) accent signifies that the equation specifies a predicted value of y. The line, also called errors or residuals, is defined as e i = y i − ŷ(x i ) y i = ŷ(x i )+e i = a+bx i +e i The true value of the predictand is the sum of predicted value + residual.

Method of Least Square The regression procedure chooses the line that produces the least error for predictions of y based on x. In order to minimize the sum of squared residuals, it is only necessary to set the derivatives of the above equation with respect to the parameters a and b to zero and solve. These derivatives are Rearranging these two equations leads to the so-called normal equations,

Finally, solving the normal equations for the regression parameters yields Method of Least Square

 The term SST is an acronym for sum of squares total, which has the mathematical meaning of the sum of squared deviations of the y values around their mean,  The term SSR stands for the regression sum of squares, or the sum of squared differences between the regression predictions and the sample mean of y,  which relates to the regression equation according to  Finally, SSE refers to the sum of squared differences between the residuals and their mean, which is zero, or sum of squared errors: SST=SSR+SSE

Analysis-of-variance table  Output from regression analysis is often given in an ANOVA table. (k=1 for simple regression i.e. single predictor x)

 The second usual measure of the fit of a regression is the coefficient of determination, or R 2. This can be computed from The R 2 can be interpreted as the proportion of the variation of the predictand (proportional to SST) that is described or accounted for by the regression (SSR). For a perfect regression, SSR =SST and SSE =0, so R 2 =1. For a completely useless regression, SSR =0 and SSE =SST, so that R 2 =0. The third commonly used measure of the strength of the regression is the F ratio, generally given in the last column of the ANOVA table. The ratio MSR/MSE increases with the strength of the regression, since a strong relationship between x and y will produce a large MSR and a small MSE. Assuming that the residuals are independent and follow the same Gaussian distribution, and under the null hypothesis no real linear relationship, the sampling distribution of the F ratio has a known parametric form.

Terms Important to remember  Population  all possible values  Sample  a portion of the population  Statistical inference  generalizing from a sample to a population with calculated degree of certainty  Two forms of statistical inference  Hypothesis testing  Estimation  Parameter  a characteristic of population, e.g., population mean µ  Statistic  calculated from data in the sample, e.g., sample mean ( )

Distinctions Between Parameters and Statistics

Hypothesis Testing: Preliminaries A hypothesis is a statement that something is true. Null hypothesis: A hypothesis to be tested. We use the symbol H 0 to represent the null hypothesis Alternative hypothesis: A hypothesis to be considered as an alternative to the null hypothesis. We use the symbol Ha to represent the alternative hypothesis. - The alternative hypothesis is the one believe to be true, or what you are trying to prove is true.

Type I and Type II Errors True State of Nature We decide to reject the null hypothesis We fail to reject the null hypothesis The null hypothesis is true The null hypothesis is false Type I error (rejecting a true null hypothesis)  Type II error (rejecting a false null hypothesis)  Correct decision Correct decision Decision

Alpha vs. Beta  is the probability of Type I error is the probability of Type II error The experimenters have the freedom to set the -level for a particular hypothesis test. That level is called the level of significance for the test. Changing can (and often does) affect the results of the test—whether you reject or fail to reject H 0.

Null Hypothesis vs. Alternative Hypothesis Null Hypothesis Statement about the value of a population parameter Represented by H 0 Always stated as an Equality Alternative Hypothesis Statement about the value of a population parameter that must be true if the null hypothesis is false Represented by H 1 Stated in on of three forms > < 

Forming Conclusions Every hypothesis test ends with the experimenters (you and I) either Rejecting the Null Hypothesis, or Failing to Reject the Null Hypothesis acceptAs strange as it may seem, we never accept the Null Hypothesis. The best we can ever say about the Null Hypothesis is that we don’t have enough evidence, based on a sample, to reject it!

Test of hypothesis for a population mean (two tailed and large sample) 1) Hypothesis : 2) Test statistic: large sample case 3) Critical value, rejection and acceptance region: - The bigger the absolute value of z is, the more possible to reject null hypothesis. - The critical value depend on the significance level - rejection region:

Test of hypothesis for a population mean (one tailed test and large sample) 1) Hypothesis: or 2) Test statistic: large sample case 3) Critical value, rejection and acceptance region: rejection region: or

Concepts of Hypothesis Testing…  Consider mean demand for computers during assembly lead time. Rather than estimate the mean demand, our operations manager wants to know whether the mean is different from 350 units. In other words, someone is claiming that the mean time is 350 units and we want to check this claim out to see if it appears reasonable. We can rephrase this request into a test of the hypothesis:  H 0 : = 350  Thus, our research hypothesis becomes:  H 1 : ≠ 350  Recall that the standard deviation [σ]was assumed to be 75, the sample size [n] was 25, and the sample mean [ ] was calculated to be 370.16 11.51

Concepts of Hypothesis Testing (2)…  The testing procedure begins with the assumption that the null hypothesis is true.  Thus, until we have further statistical evidence, we will assume:  H 0 : = 350 (assumed to be TRUE)  The next step will be to determine the sampling distribution of the sample mean assuming the true mean is 350.  is normal with 350 75/SQRT(25) = 15

Is the Sample Mean in the Guts of the Sampling Distribution??

Three ways to determine this: First way 1.Unstandardized test statistic: Is in the guts of the sampling distribution? Depends on what you define as the “guts” of the sampling distribution.  If we define the guts as the center 95% of the distribution [this means  = 0.05], then the critical values that define the guts will be 1.96 standard deviations of X-Bar on either side of the mean of the sampling distribution [350], or  UCV = 350 + 1.96*15 = 350 + 29.4 = 379.4  LCV = 350 – 1.96*15 = 350 – 29.4 = 320.6

1. Unstandardized Test Statistic Approach 11.55

Three ways to determine this: Second way  2. Standardized test statistic: Since we defined the “guts” of the sampling distribution to be the center 95% [  = 0.05],  If the Z-Score for the sample mean is greater than 1.96, we know that will be in the reject region on the right side or  If the Z-Score for the sample mean is less than -1.97, we know that will be in the reject region on the left side.  Z = ( - )/ = (370.16 – 350)/15 = 1.344  Is this Z-Score in the guts of the sampling distribution???

Three ways to determine this: Third way  3. The p-value approach (which is generally used with a computer and statistical software): Increase the “Rejection Region” until it “captures” the sample mean.  For this example, since is to the right of the mean, calculate  P( > 370.16) = P(Z > 1.344) = 0.0901  Since this is a two tailed test, you must double this area for the p-value.  p-value = 2*(0.0901) = 0.1802  Since we defined the guts as the center 95% [  = 0.05], the reject region is the other 5%. Since our sample mean,, is in the 18.02% region, it cannot be in our 5% rejection region [  = 0.05]. 11.58

3. p-value approach 11.59

Hypothesis test for differences in mean

References:

Basic Statistics: Concepts of Distribution, Correlation, Regression, Hypothesis testing etc. P. Guhathakurta Hydromet Division, IMD, Pune.

Similar presentations

Presentation on theme: "Basic Statistics: Concepts of Distribution, Correlation, Regression, Hypothesis testing etc. P. Guhathakurta Hydromet Division, IMD, Pune."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basic Statistics: Concepts of Distribution, Correlation, Regression, Hypothesis testing etc. P. Guhathakurta Hydromet Division, IMD, Pune.

Similar presentations

Presentation on theme: "Basic Statistics: Concepts of Distribution, Correlation, Regression, Hypothesis testing etc. P. Guhathakurta Hydromet Division, IMD, Pune."— Presentation transcript:

Similar presentations

About project

Feedback