Presentation is loading. Please wait.

Presentation is loading. Please wait.

XIAO WU DATA ANALYSIS & BASIC STATISTICS.

Similar presentations


Presentation on theme: "XIAO WU DATA ANALYSIS & BASIC STATISTICS."— Presentation transcript:

1 XIAO WU XIAO.WU@YALE.EDU DATA ANALYSIS & BASIC STATISTICS

2 PURPOSE OF THIS WORKSHOP Statistics as a useful tool to analyze results Basic terminology and most commonly used tests Exposure to more advanced statistical tools

3 WHY DO WE NEED STATISTICS?

4 Summary Classification Interpretation Pattern searching Abnormality identification Prediction Intrapolation Extrapolation

5 SUMMARY http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/

6 SUMMARY Mean, median, mode Variance, standard deviation Max, min values and range Quartiles http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/

7 EXAMPLE Firm A Mean: $5,800 Firm B Mean: $5,000

8 EXAMPLE Firm A Mean: $5,800 Median: $4,000 SD: $7,270 3 rd Quartile: $4,000 1 st Quartile: $500 Firm B Mean: $5,000 Median: $5,000 SD: $203 3 rd Quartile: $5,175 1 st Quartile: $4,825

9 EXAMPLE #Salary ($) 1 4650 2 4700 3 4750 4 4800 5 4850 6 4900 7 4950 8 5000 9 5050 10 5100 11 5150 12 5200 13 5250 14 5300 15 5350 #Salary ($) 120000 24000 3 4500 5

10 CLASSIFICATION Identification of variable Independent vs. dependent Numeric vs. categorical Variable Categorical Nominal Ordinal Numeric Continuous Discrete

11 PATTERN SEARCHING Distribution of data Some commonly used distributions Uniform Binomial Poisson … Central limit theorem http://www.mathwave.com/img/art/graphs_pdf2.gif

12 UNIFORM Every outcome has equal chance Example: Flipping a coin Rolling a dice What if you need to flip multiple times?

13 BINOMIAL Two outcomes, probability p and 1- p Multiple trials: n Example: Flipping a coin 100 times Germination of multiple seeds https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.p su.edu.stat414/files/lesson09/graph_n15_p02.gif

14 POISSON Counts of rare, independent events Each with probability, or average rate p Example: radioactive decay http://kaffee.50webs.com/Science/images/alpha_decay.gif

15 THE MOST IMPORTANT DISTRIBUTION

16 NORMAL DISTRIBUTION Central limit theorem Every distribution converges to a normal distribution Large sample size  normal distribution Parameters: mean standard deviation https://www.mathsisfun.com/data/images/normal-distrubution-large.gif

17 PATTERN SEARCHING Hypothesis testing Difference between two populations Z-test or t-test? What does p-value mean? Family-wise error – Bonferroni correction More than two possibilities Chi square test Fisher’s exact test More than two variables ANOVA

18 EXAMPLE 1 SAT score is related to gender Null hypothesis Alternative hypothesis (3 possibilities) One or two tail? Z or T test? p=0.07, conclusion?

19 EXAMPLE 2 Predictors of stroke Age Hypertension Gender …

20 EXAMPLE 3 Genome-wide association studies Scanning markers across the DNA of many people to find genetic variations associated with certain diseases

21 PATTERN SEARCHING Hypothesis testing One variable Z-test or t-test? What does p-value mean? Family-wise error – Bonferroni correction Compare two categorical variables Chi square test Fisher’s exact test More than two variables ANOVA

22 CHI SQUARE Punnett Square A cross between two pea plants yields 880 plants, 639 green, 241 yellow Hypothesis: The green allele is dominant and both parents are heterozygous. http://www2.lv.psu.edu/jxm57/irp/chisquar.html

23 CHI SQUARE Gg G GG (green) Gg(green) g gg (yellow) 75% green 25% yellow

24 CHI SQUARE GreenYellow Observed (o)639241 Expected (e)660220 Deviation (d=o – e)-2121 Deviation squared (d^2) 441 d^2/e0.6682 Sum2.669 Degree of freedom: number of categories – 1 = 1

25 CHI SQUARE

26 PREDICTION Regression Linear regression Multiple linear regression Accuracy vs. simplicity Validation leave-k-out http://2.bp.blogspot.com/-W7Ptp8uB02U/T8UAGm4Uw5I/AAAAAAAAC08/DcHCtLWXv- U/s1600/actnactn+1.png

27 EXAMPLE Use brain structural measurements to predict a subject’s performance on picture vocabulary test 144 total structural measurements 521 subjects First step: eliminate unnecessary variables All zeros? Highly correlated pairs Variables that do not correlate well with performance score

28 EXAMPLE Run regression Validation: leave 1 out and leave 10 out Principle component analysis …

29 PREDICTION More complicated models: Baysian approach Use prior knowledge to update prediction Diffusion weights Use local structure to predict neighboring values

30 STATISTICAL TOOLS EXCEL MatLab R MiniTab …

31 QUESTIONS?

32 MY OWN RESEARCH Cost-effectiveness analysis Mathematical modeling in medicine Simulate iterations rather than actual patients

33 RECENT RESULTS

34 RESULTS

35 GROUP EXERCISE


Download ppt "XIAO WU DATA ANALYSIS & BASIC STATISTICS."

Similar presentations


Ads by Google