DATA ANALYSIS Module Code: CA660 Lecture Block 3.

DATA ANALYSIS Module Code: CA660 Lecture Block 3

2 Standard Statistical Distributions Importance Modelling practical applications Mathematical properties are known Described by few parameters, which have natural interpretations. Bernoulli Distribution. This is used to model a trial/expt. which gives rise to two outcomes: success/ failure: male/ female, 0 / 1..… Let p be the probability that the outcome is one and q = 1 - p that the outcome is zero. E[X] = p (1) + (1 - p) (0) = p VAR[X] = p (1) 2 + (1 - p) (0) 2 - E[X] 2 = p (1 - p). 01p Prob 1 1 - p p

3 Standard distributions - Binomial Binomial Distribution. Suppose that we are interested in the number of successes X in n independent repetitions of a Bernoulli trial, where the probability of success in an individual trial is p. Then Prob{X = k} = n C k p k (1-p) n - k, (k = 0, 1, …, n) E[X] = n p VAR[X] = n p (1 - p) (n=4, p=0.2) Prob 1 4 np This is the appropriate distribution to model e.g. Preferences expressed between two brands e.g. Number of recombinant gametes produced by a heterozygous parent for a 2-locus model. Extension for  3 loci, (brands) is multinomial

4 Standard distributions - Poisson Poisson Distribution. The Poisson distribution arises as a limiting case of the binomial distribution, where n , p  in such a way that np  Constant) P{X = k} = exp ( -    … ). E [X] = VAR [X] = Poisson is used to model No. of occurrences of a certain phenomenon in a fixed period of time or space, e.g. e.g. O particles emitted by radioactive source in fixed direction for interval  T O people arriving in a queue in a fixed interval of time O genomic mapping functions, e.g. cross over as a random event X 5 1

5 Other Standard examples: e.g. Hypergeometric, Exponential…. Hypergeometric. Consider a population of M items, of which W are deemed to be successes. Let X be the number of successes that occur in a sample of size n, drawn without replacement from the finite population, then Prob { X = k} = W C k M-W C n-k / M C n ( k = 0, 1, 2, … ) Then E [X] = n W / M VAR [X] = n W (M - W) (M - n) / { M 2 (M - 1)} Exponential : special case of the Gamma distribution with n = 1 used e.g. to model inter-arrival time of customers or time to arrival of first customer in a simple queue, e.g. fragment lengths in genome mapping etc. The p.d.f. is f (x)= exp ( - x ),x  0  0 = 0otherwise

6 Standard p.d.f.’s - Gaussian/ Normal A random variable X has a normal distribution with mean  and standard deviation  if it has density with and Arises naturally as the limiting distribution of the average of a set of independent, identically distributed random variables with finite variances. Plays a central role in sampling theory and is a good approximation to a large class of empirical distributions. Default assumption  in many empirical studies is that each observation is approx. ~ N(  2 ) Note: Statistical tables of the Normal distribution are of great importance in analysing practical data sets. X is said to be a Standardised Normal variable if  = 0 and  = 1.

7 Standard p.d.f.’s : Student’s t-distribution A random variable X has a t -distribution with ‘ ’ d.o.f. ( t ) if it has density = 0 otherwise. Symmetrical about origin, with E[X] = 0 & V[X] = n / (n -2). For small n, the t n distribution is very flat. For n  25, the t n distribution  Standard Normal curve. Suppose Z a standard Normal variable, W has a  n 2 distribution and Z and W independent then r.v. has form If x 1, x 2, …,x n is a random sample from N(   , and, if define then

8 Chi-Square Distribution A r.v. X has a Chi-square distribution with n degrees of freedom; (n a positive integer) if it is a Gamma distribution with = 1, so its p.d.f. is E[X] =n ; Var [X] =2n Two important applications: - If X 1, X 2, …, X n a sequence of independently distributed Standardised Normal Random Variables, then the sum of squares X 1 2 + X 2 2 + … + X n 2 has a  2 distribution (n degrees of freedom). - If x 1, x 2, …, x n is a random sample from N(  2 ), then and and s 2 has  2 distribution, n - 1 d.o.f., with r.v.’s and s 2 independent. X  2 ν (x) Prob

9 F-Distribution A r.v. X has an F distribution with m and n d.o.f. if it has a density function = ratio of gamma functions for x>0 and = 0 otherwise. For X and Y independent r.v.’s, X ~  m 2 and Y~  n 2 then One consequence: if x 1, x 2, …, x m ( m  is a random sample from N(  1,  1 2 ), and y 1, y 2, …, y n ( n  a random sample from N(  2,  2 2 ), then

10 Sampling and Sampling Distributions – Extended Examples: refer to primer Central Limit Theorem If X 1, X 2,… X n are a random sample of r.v. X, (mean , variance  2 ), then, in the limit, as n , the sampling distribution of means has a Standard Normal distribution, N(0,1) Probabilities for sampling distribution – limits for large n U (or Z) = standardized Normal deviate

11 Large Sample theory In particular is the C.D.F. or D.F. In general, the closer the random variable X behaviour is to the Normal, the faster the approximation approaches U. Generally, n  ~25  “Large sample” theory

12 Attribute and Proportionate Sampling recall primer sample proportion and sample mean synonymous Probability Statements If X and Y are independent Binomially distributed r.v.’s parameters n, p and m, p respectively, then X+Y ~ B(n+m, p) So, Y=X 1 + X 2 +…. + X n ~ B(n, p) for the IID X~B(1, p) Since we know  Y = np,  Y =  (npq) and, clearly then and, further is the sampling distribution of a proportion

13 Differences in Proportions Can use  2 : Contingency table type set-up Can set up as parallel to difference estimate or test of 2 means (independent) so for 100 (1-  C.I. Under H 0 : P 1 – P 2 =0 so, can write S.E. as for pooled X & Y = No. of successes S.E., n 1, n 2 large. Small sample n-1 2-sided

14 C.L.T. and Approximations summary General form of theorem - an infinite sequence of independent r.v.’s, with means, variances as before, then approximation  U for n large enough. Note: No condition on form of distribution of the X’s (the raw data) -Strictly - for approximations of discrete distributions, can improve by considering correction for continuity e.g.

15 Generalising Sampling Distn. Concept - see primer For sampling distribution of any statistic, a sample characteristic is an unbiased estimator of the parent population characteristic, if the mean of the corresponding sampling distribution is equal to the parent characteristic. Also the sample average proportion is an unbiased estimator of the parent average proportion Sampling without replacement from a finite population gives the Hypergeometric distribution. finite population correction (fpc) =  [( N - n) / ( N - 1)], N, n are parent population and sample size respectively. Above applies to variance also.

16 Examples Large scale 1980 survey in country showed 30% of adult population with given classification. If still the current rate, what is probability that, in a random sample of 1000, the number with this classification will be (a) < 280, (b) 316 or more? Soln. Let X = no. successes (with trait) in sample. So, for expected proportion of 0.3 in population, we suppose X ~B(1000,0.3) Since np=300, and √npq = √210 =14.49, distn. of X ~N(300,14.49) (a)P{X<280} or P{X≤279}  (b) P{X≥316} 

Examples Auditors checking if certain firm overstating value of inventory items. Decide to randomly select 15 items. For each, determine recorded amount (R), audited (exact) amount (A) and hence difference between the two = X, variable of interest. Of particular interest is whether average difference > 250 Euro. 170 350 310 220 500 420 560 230 270 380 200 250 430 450 210 So n = 15, x = €330 and s = €121.5 H 0 :   €250 H 1 :   €250 Decision Rule: Reject H 0 if where the dof = n-1 =14 Value from data Since 2.55 > 1.761, reject H 0. Also, the p-value is the area to the right of 2.55. It is between 0.01 and 0.025, (so less than  = 0.05), so again - reject H 0 The data indicate that the firm is overstating the value of its inventory items by more than €250 on average

18 Examples contd. Blood pressure readings before and after 6 months on medication taken in women students, (aged 25-35); sample of 15. Calculate (a) 95% C.I. for mean change in B.P. (b) test at 1% level of significance, (  = 0.01) that the medication reduces B.P. Data: Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 st (x) 70 80 72 76 76 76 72 78 82 64 74 92 74 68 84 2 nd (y) 68 72 62 70 58 66 68 52 64 72 74 60 74 72 74 d =x-y 2 8 10 6 18 10 4 26 18 -8 0 32 0 -4 10 (a) So for 95% C. limits

19 Contd. Value for t 0.025 based on d.o.f. = 14. From t-table, find t 0.025 = 2.145 So, 95% C.I. is: i.e. limits are 8.80  6.08 or (2.72, 14.88), so 95% confident that there is a mean difference (reduction) in B.P. of between 2.72 and 14.88 (b) The claim is that  > 0, so we look at H 0 :  = 0 vs H 1 :  > 0, So t-statistic as before, but right-tailed (one sided only) Rejection Region. For d.o.f. = 14, t 0.01 = 2.624. So calculated value from our data clearly in Rejection region, so H 0 rejected in favour of H 1 at  = 0.01 Reduction in B.P. after medication strongly supported by data. 0 t 14 Accept Reject = 1% t 0.01 = 2.624.

20 Examples Rates of preference recorded for product P1 among given age group children. Of 113 boys tested, 34 indicate positive preference, while of 139 girls tested, 54 are positive. Is evidence strong for a higher preference rate in girls? H 0 : p 1 =p 2 vs H 1: p 1 < p 2 (where p 1, p 2 proportion boys, girls with +ve preference respectively). Soln. Can not reject H 0 Actual p-value = P{U ≤ -1.44) = 0.0749

Developed Examples using Standard Distributions/sampling distributions Lot Acceptance Sampling in SPC. Binomial frequently used. Suppose shipment of 500 calculator chips arrives at electronics firm; acceptable if a sample of size 10 has no more than one defective chip. What is the probability of accepting lot if, in fact, (i) 10% (50 chips) are defective (ii) 20% (100) are defective? n = 10 trials, each with 2 outcomes: Success = defective; Failure = not defective P = P{Success} = 0.10, (assume constant for simplicity) X= no. successes out of n trials = No. defective out of 10 sampled i.e. Electronics Firm will accept shipment if X = 0 or 1 (i) P{accept} = P{0 or 1} = P {0 } + P{1} =P{X  1} (cumulative) From tables: n=10, p=0.10, P(0}=0.349, P{1} = 0.387 So, P{accept} = 0.736, i.e 73.6% chance (ii) For p=0.20, P{0} = 0.107, P{1} = 0.268, so P{accept} = 0.375 or 37.5% chance

Example contd. Suppose have a shipment of 50 chips, similar set up to before – check for lot acceptance, still selecting sample of size 10 and assuming 10% defective. Success and Failure as before Now, though, p = P{Success 1 st trial} = 5/50 = 0.1 first trial, but Conditional P{Success 2 nd trial} = 5/49 = 0.102 if 1 st pick is a failure (not defective) OR P{Success 2 nd trial} = 4/49 =0.082 if 1 st is defective (success). Hypergeometric Think of two sets in shipment – one having 5 S’s, the other 45 F’s Taking 10 chips randomly from the two sections If x are selected from S set, then 10-x must be selected from F set, i.e. N = 50, k = 5, n = 10 So P{1 S and 9 Fs} = P{1} = and P{0} from similar expression = 0.31 c.f. Binomial

Example contd. Approximations: Poisson to Binomial Suppose shipment = 2500 chips and want to test 100. Accept lot if sample contains no more than one defective. Assuming 5% defective. What is probability of accepting lot? Note: n= 100, N=2500; ratio = 0.04, i.e. < 5%, so can avoid the work for hypergeometric, as approximately Binomial, n = 100, p  0.05 So Binomial random variable X here = no. defective chips out of 100 P{accept lot} = P{X  1} = P{0} +P{1} Lot of work, not tabulated Alternative: Poisson approx. to Binomial where n >20, np  7 works well, so probability from Poisson table, where close to result for Binomial

Example contd. Approximations: Normal to discrete distribution Supposing still want to sample 100 chips, but 10% of chips expected to be defective. Rule for approximation of Binomial is that n large, p small, or that np < 7. Now p =0.10, so np = 10, Poisson not a good approximation. However, n large and np=10, n(1-p)=90, and both > 5, so can use Normal approximation then X is a Binomial r.v. with So have Very small chance of accepting lot with this many defectives.

DATA ANALYSIS Module Code: CA660 Lecture Block 3.

Similar presentations

Presentation on theme: "DATA ANALYSIS Module Code: CA660 Lecture Block 3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DATA ANALYSIS Module Code: CA660 Lecture Block 3.

Similar presentations

Presentation on theme: "DATA ANALYSIS Module Code: CA660 Lecture Block 3."— Presentation transcript:

Similar presentations

About project

Feedback