Download presentation
Presentation is loading. Please wait.
1
XIAO WU XIAO.WU@YALE.EDU DATA ANALYSIS & BASIC STATISTICS
2
PURPOSE OF THIS WORKSHOP Statistics as a useful tool to analyze results Basic terminology and most commonly used tests Exposure to more advanced statistical tools
3
WHY DO WE NEED STATISTICS?
4
Summary Classification Interpretation Pattern searching Abnormality identification Prediction Intrapolation Extrapolation
5
SUMMARY http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
6
SUMMARY Mean, median, mode Variance, standard deviation Max, min values and range Quartiles http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/
7
EXAMPLE Firm A Mean: $5,800 Firm B Mean: $5,000
8
EXAMPLE Firm A Mean: $5,800 Median: $4,000 SD: $7,270 3 rd Quartile: $4,000 1 st Quartile: $500 Firm B Mean: $5,000 Median: $5,000 SD: $203 3 rd Quartile: $5,175 1 st Quartile: $4,825
9
EXAMPLE #Salary ($) 1 4650 2 4700 3 4750 4 4800 5 4850 6 4900 7 4950 8 5000 9 5050 10 5100 11 5150 12 5200 13 5250 14 5300 15 5350 #Salary ($) 120000 24000 3 4500 5
10
CLASSIFICATION Identification of variable Independent vs. dependent Numeric vs. categorical Variable Categorical Nominal Ordinal Numeric Continuous Discrete
11
PATTERN SEARCHING Distribution of data Some commonly used distributions Uniform Binomial Poisson … Central limit theorem http://www.mathwave.com/img/art/graphs_pdf2.gif
12
UNIFORM Every outcome has equal chance Example: Flipping a coin Rolling a dice What if you need to flip multiple times?
13
BINOMIAL Two outcomes, probability p and 1- p Multiple trials: n Example: Flipping a coin 100 times Germination of multiple seeds https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.p su.edu.stat414/files/lesson09/graph_n15_p02.gif
14
POISSON Counts of rare, independent events Each with probability, or average rate p Example: radioactive decay http://kaffee.50webs.com/Science/images/alpha_decay.gif
15
THE MOST IMPORTANT DISTRIBUTION
16
NORMAL DISTRIBUTION Central limit theorem Every distribution converges to a normal distribution Large sample size normal distribution Parameters: mean standard deviation https://www.mathsisfun.com/data/images/normal-distrubution-large.gif
17
PATTERN SEARCHING Hypothesis testing Difference between two populations Z-test or t-test? What does p-value mean? Family-wise error – Bonferroni correction More than two possibilities Chi square test Fisher’s exact test More than two variables ANOVA
18
EXAMPLE 1 SAT score is related to gender Null hypothesis Alternative hypothesis (3 possibilities) One or two tail? Z or T test? p=0.07, conclusion?
19
EXAMPLE 2 Predictors of stroke Age Hypertension Gender …
20
EXAMPLE 3 Genome-wide association studies Scanning markers across the DNA of many people to find genetic variations associated with certain diseases
21
PATTERN SEARCHING Hypothesis testing One variable Z-test or t-test? What does p-value mean? Family-wise error – Bonferroni correction Compare two categorical variables Chi square test Fisher’s exact test More than two variables ANOVA
22
CHI SQUARE Punnett Square A cross between two pea plants yields 880 plants, 639 green, 241 yellow Hypothesis: The green allele is dominant and both parents are heterozygous. http://www2.lv.psu.edu/jxm57/irp/chisquar.html
23
CHI SQUARE Gg G GG (green) Gg(green) g gg (yellow) 75% green 25% yellow
24
CHI SQUARE GreenYellow Observed (o)639241 Expected (e)660220 Deviation (d=o – e)-2121 Deviation squared (d^2) 441 d^2/e0.6682 Sum2.669 Degree of freedom: number of categories – 1 = 1
25
CHI SQUARE
26
PREDICTION Regression Linear regression Multiple linear regression Accuracy vs. simplicity Validation leave-k-out http://2.bp.blogspot.com/-W7Ptp8uB02U/T8UAGm4Uw5I/AAAAAAAAC08/DcHCtLWXv- U/s1600/actnactn+1.png
27
EXAMPLE Use brain structural measurements to predict a subject’s performance on picture vocabulary test 144 total structural measurements 521 subjects First step: eliminate unnecessary variables All zeros? Highly correlated pairs Variables that do not correlate well with performance score
28
EXAMPLE Run regression Validation: leave 1 out and leave 10 out Principle component analysis …
29
PREDICTION More complicated models: Baysian approach Use prior knowledge to update prediction Diffusion weights Use local structure to predict neighboring values
30
STATISTICAL TOOLS EXCEL MatLab R MiniTab …
31
QUESTIONS?
32
MY OWN RESEARCH Cost-effectiveness analysis Mathematical modeling in medicine Simulate iterations rather than actual patients
33
RECENT RESULTS
34
RESULTS
35
GROUP EXERCISE
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.