Get and Load Data Univariate EDA Bivariate EDA Filter Individuals

Slides:



Advertisements
Similar presentations
AP Statistics Course Review.
Advertisements

Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
DATA ANALYSIS I MKT525. Plan of analysis What decision must be made? What are research objectives? What do you have to know to reach those objectives?
BCOR 1020 Business Statistics Lecture 28 – May 1, 2008.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Statistics Intro Univariate Analysis Central Tendency Dispersion.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
STAT 101 Dr. Kari Lock Morgan Exam 2 Review.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Measures of Regression and Prediction Intervals
Testing Group Difference
BOT3015L Data analysis and interpretation Presentation created by Jean Burns and Sarah Tso All photos from Raven et al. Biology of Plants except when otherwise.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
Univariate EDA. Quantitative Univariate EDASlide #2 Exploratory Data Analysis Univariate EDA – Describe the distribution –Distribution is concerned with.
Lecture 3 Topic - Descriptive Procedures Programs 3-4 LSB 4:1-4.4; 4:9:4:11; 8:1-8:5; 5:1-5.2.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
Chapter 8 Minitab Recipe Cards. Confidence intervals for the population mean Choose Basic Statistics from the Stat menu and 1- Sample t from the sub-menu.
SPSS Workshop Day 2 – Data Analysis. Outline Descriptive Statistics Types of data Graphical Summaries –For Categorical Variables –For Quantitative Variables.
Statistical Analysis using SPSS Dr.Shaikh Shaffi Ahamed Asst. Professor Dept. of Family & Community Medicine.
Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 1 Chapter Correlation and Regression 9.
Elementary Analysis Richard LeGates URBS 492. Univariate Analysis Distributions –SPSS Command Statistics | Summarize | Frequencies Presents label, total.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Marginal Distribution Conditional Distribution. Side by Side Bar Graph Segmented Bar Graph Dotplot Stemplot Histogram.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Quantitative Methods in the Behavioral Sciences PSY 302
Introduction to Marketing Research
EHS 655 Lecture 4: Descriptive statistics, censored data
BINARY LOGISTIC REGRESSION
MATH-138 Elementary Statistics
26134 Business Statistics Autumn 2017
Bivariate Testing (ttests and proportion tests)
REGRESSION G&W p
Chapter 12 Simple Linear Regression and Correlation
Review 1. Describing variables.
CHAPTER 7 Linear Correlation & Regression Methods
Dr. Siti Nor Binti Yaacob
Hypothesis Testing Review
Bivariate Testing (ttests and proportion tests)
Description of Data (Summary and Variability measures)
Lecture 6 Comparing Proportions (II)
Data Management and Acquisition Tutorial
Data Analysis for Two-Way Tables
Two Way Frequency Tables
Quantitative Methods PSY302 Quiz 6 Confidence Intervals
CHAPTER 29: Multiple Regression*
AP Exam Review Chapters 1-10
Bivariate Testing (Chi Square)
Bivariate Testing (ttests and proportion tests)
Review of Hypothesis Testing
Bivariate Testing (Chi Square)
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Chapter 12 Simple Linear Regression and Correlation
Statistical Analysis using SPSS
STAT 312 Introduction Z-Tests and Confidence Intervals for a
Univariate Statistics
Data Analysis Module: Chi Square
Describing distributions with numbers
Statistics II: An Overview of Statistics
Factors (or Grouping Variables) Checking Model Assumptions
BUSINESS MARKET RESEARCH
Measures of Regression Prediction Interval
Finding Correlation Coefficient & Line of Best Fit
Introductory Statistics
Presentation transcript:

Get and Load Data Univariate EDA Bivariate EDA Filter Individuals > library(NCStats) > setwd("C:/aaaWork/Web/GitHub/NCMTH107") > dfcar <- read.csv("93cars.csv") > str(dfcar) 'data.frame': 93 obs. of 26 variables: $ Type : Factor w/ 6 levels "Compact","Large": 4 3 3 ... $ HMPG : int 31 25 26 26 30 31 28 25 27 25... $ Manual : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 ... $ Weight : int 2705 3560 3375 3405 3640 2880 3470 ... $ Domestic: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 ... ENTER RAW DATA: In Excel, enter variables in columns with variable names in the first row, each individual’s data in rows below that (do not use spaces or special characters). Save as “Comma Separated Values (*.CSV)” file in your local directory/folder. DATA PROVIDED BY PROFESSOR: Goto the MTH107 Resources webpage. Save “data” link (right-click) to your local directory/folder. LOAD THE EXTERNAL CSV FILE INTO R: Start script and save it in the same folder with the CSV file. Select the Session, Set Working Directory, To Source File Location menus. Copy resulting setwd() code to your script. Use read.csv() to load data in filename.csv into dfobj. Observe the structure of dfobj. Get and Load Data dfobj <- read.csv("filename.csv") str(dfobj) > ( freq1 <- xtabs(~Type,data=dfcar) ) Compact Large Midsize Small Sporty Van 16 11 22 21 14 9 > percTable(freq1,digits=1) Compact Large Midsize Small Sporty Van Sum 17.2 11.8 23.7 22.6 15.1 9.7 100.1 > barplot(freq1,xlab="Type of Car",ylab="Frequency") CATEGORICAL – Frequency table, percentage table, and bar chart for the cvar variable. Univariate EDA ( freq1 <- xtabs(~cvar,data=dfobj) ) percTable(freq1,digits=1) barplot(freq1,xlab="better cvar label", ylab="Frequency") > ( freq2 <- xtabs(~Domestic+Manual,data=dfcar) ) Manual Domestic No Yes No 6 39 Yes 26 22 > percTable(freq2,digits=1) Manual Domestic No Yes Sum No 6.5 41.9 48.4 Yes 28.0 23.7 51.7 Sum 34.5 65.6 100.1 > percTable(freq2,digits=1,margin=1) Manual Domestic No Yes Sum No 13.3 86.7 100.0 Yes 54.2 45.8 100.0 > percTable(freq2,digits=1,margin=2) Manual Domestic No Yes No 18.8 63.9 Yes 81.2 36.1 Sum 100.0 100.0 CATEGORICAL – Frequency and percentage tables for the cvarRow and cvarCol variables. Bivariate EDA ( freq2 <- xtabs(~cvarRow+cvarCol, data=dfobj) ) percTable(freq2,digits=1) # total/table % percTable(freq2,digits=1,margin=1) # row % percTable(freq2,digits=1,margin=2) # column % > hist(~HMPG,data=dfcar,xlab="Highway MPG") > Summarize(~HMPG,data=dfcar,digits=1) n mean sd min Q1 median Q3 max 93.0 29.1 5.3 20.0 26.0 28.0 31.0 50.0 QUANTITATIVE – Histogram and summary statistics (mean, median, SD, IQR, etc.) for the qvar variable. hist(~qvar,data=dfobj,xlab="better qvar label") Summarize(~qvar,data=dfobj,digits=#) > plot(HMPG~Weight,data=dfcar,pch=19,ylab="Highway MPG“, xlab="Weight (lbs)") > corr(~HMPG+Weight,data=dfcar,digits=3) [1] -0.811 QUANTITATIVE – Scatterplot and correlation coefficient (r) for the qvarY and qvarX variables. plot(qvarY~qvarX,data=dfobj, pch=19, ylab="better yvar label", xlab="better xvar label") corr(~qvarY+qvarX,data=dfobj,digits=3) > justSporty <- filterD(dfcar,Type=="Sporty") > noDomestic <- filterD(dfcar,Domestic!="Yes") > justHMPGgt30 <- filterD(dfcar,HMPG>30) > Sp_or_Sm <- filterD(dfcar,Type %in% c("Sporty","Small")) > Spry_n_gt30 <- filterD(dfcar,Type=="Sporty",HMPG>30) > justWTlteq3000 <- filterD(dfcar,Weight<=3000) > justNum17 <- dfcar[17,] > notNum17 <- dfcar[-17,] Individuals may be selected from the dfobj data.frame and put in the newdf data.frame according to a condition with where condition may be as follows var == value # equal to var != value # not equal to var > value # greater than var >= value # greater than or equal var %in% c("val", "val", "val") # in the list cond, cond # both conditions met with var replaced by a variable name and value replaced by a number or category name (if value is not a number then it must be put in quotes). Filter Individuals newdf <- filterD(dfobj,condition) > hist(HMPG~Domestic,data=dfcar,xlab="Highway MPG") > Summarize(HMPG~Domestic,data=dfcar,digits=1) Domestic n mean sd min Q1 median Q3 max 1 No 45 30.1 6.2 21 25 30 33 50 2 Yes 48 28.1 4.2 20 26 28 30 41 QUANTITATIVE BY GROUP – Histograms and summary statistics for qvar separated by groups in cvar. hist(qvar~cvar,data=dfobj,xlab="better qvar label") Summarize(qvar~cvar,data=dfobj,digits=#) > pairs(~HMPG+Weight+Cyl,data=dfcar,pch=21,bg="gray70") > corr(HMPG+Weight+Cyl,data=dfcar,digits=3, use="pairwise.complete.obs") QUANTITATIVE (ALL PAIRS) – Scatterplot and correlation coefficient (r) for all pairs of quantitative variables. pairs(~qvar1+qvar2+qvar3,data=dfobj, pch=21,bg="gray70") corr(~qvar1+qvar2+qvar3,data=dfobj,digits=3, use="pairwise.complet.obs")

Quantitative Hypothesis Tests Categorical Hypothesis Tests > distrib(50,mean=45,sd=10,lower.tail=FALSE) #forward-right > distrib(50,mean=45,sd=10) #forward-left > distrib(0.05,mean=45,sd=10,type="q") #rev-left > distrib(0.2,mean=45,sd=10,type="q",lower.tail=FALSE) #rev-rgt > distrib(50,mean=45,sd=10/sqrt(30)) #using SE > distrib(0.95,mean=45,sd=10/sqrt(30), type="q",lower.tail=FALSE) #using SE where val is a value of the quantitative variable (x) or an area (i.e., a percentage, but entered as a proportion). mnval is the population mean (m) sdval is the standard deviation (s) or error (SE) lower.tail=FALSE is included for “right-of” calculations type="q" is included for reverse calculations For SE use (where nval=sample size): Normal Distributions distrib(val,mean=mnval,sd=sdval,lower.tail=FALSE, type="q") sd=sdval/sqrt(nval) > z.test(dfcar$HMPG,mu=26,alt="greater",conf.level=0.95,sd=6) z= 4.9601, n= 93, Std. Dev= 6.000, Std. Dev of the sample mean = 0.622, p-value = 3.523e-07 alternative hypothesis: true mean is greater than 26 95 percent confidence interval: 28.06264 Inf sample estimates: mean of dfcar$HMPG 29.08602 > t.test(dfcar$HMPG,mu=26,alt="two.sided",conf.level=0.99) t = 5.5818, df = 92, p-value = 2.387e-07 alternative hypothesis: true mean is not equal to 26 99 percent confidence interval: 27.63178 30.54026 sample estimates: mean of x 29.08602 ONE SAMPLE Z-TEST AND T-TEST: qvar is the quantitative response variable in dfobj mu0 is the population mean in H0 HA is replaced with "two.sided" for a not equals, "less" for a less than, or "greater" for a greater than HA cnfval is the confidence level as a proportion (e.g., 0.95) sdval is the known population standard deviation (s) Quantitative Hypothesis Tests z.test(dfobj$qvar,mu=mu0,alt=HA, conf.level=cnfval,sd=sdval) t.test(dfobj$qvar,mu=mu0,alt=HA, conf.level=cnfval) > ( freq2 <- xtabs(~Domestic+Manual,data=dfcar) ) Manual Domestic No Yes No 6 39 Yes 26 22 > ( chi <- chisq.test(freq2,correct=FALSE) ) Pearson's Chi-squared test with freq2 X-squared = 17.1588, df = 1, p-value = 3.438e-05 > chi$expected Manual Domestic No Yes No 15.48387 29.51613 Yes 16.51613 31.48387 > percTable(freq2,margin=1,digits=1) Manual Domestic No Yes Sum No 13.3 86.7 100.0 Yes 54.2 45.8 100.0 (TWO SAMPLE) CHI-SQUARE TEST: Chi-square for two-way frequency table in obstbl (with the cvarResp categorical response variable in columns and the populations in cvarPop as rows). Follow-up Analyses: Extract expected values. Percentages of individuals in each level of the response variable for each population. Categorical Hypothesis Tests ( obstbl <- xtabs(~cvarPop+cvarResp,data=dfobj) ) ( chi <- chisq.test(obstbl,correct=FALSE) ) chi$expected percTable(obstbl,margin=1,digits=1) # row percent table > ( bfl <- lm(HMPG~Weight,data=dfcar) ) Coefficients: (Intercept) Weight 51.601365 -0.007327 > fitPlot(bfl,ylab="Highway MPG",xlab="Weight (lbs)") > rSquared(bfl) [1] 0.6571665 The best-fit line between the qvarResp response and qvarExpl explanatory variables. A visual of the best-fit line. The r2 value. Linear Regression ( bfl <- lm(qvarResp~qvarExpl,data=dfobj) ) rSquared(bfl) fitPlot(bfl,ylab="better Resp label",xlab="better Expl label") > levenesTest(HMPG~Manual,data=dfcar) Df F value Pr(>F) group 1 7.6663 0.006818 91 > t.test(HMPG~Manual,data=dfcar,alt="less",conf.level=0.99, var.equal=TRUE) t = -4.2183, df = 91, p-value = 2.904e-05 alt. hypothesis: true difference in means is less than 0 99 percent confidence interval: -Inf -1.980103 sample estimates: mean in group No mean in group Yes 26.12500 30.63934 TWO SAMPLE T-TEST: qvar is the quantitative response variable in dfobj cvar is the categorical variable that identifies the two groups mu0 is the population mean in H0 HA is replaced with "two.sided" for a not equals, "less" for a less than, or "greater" for a greater than HA cnfval is the confidence level as a proportion (e.g., 0.95) var.equal=TRUE if the popn variances are thought to be equal levenesTest(qvar~cvar,data=dfobj) t.test(qvar~cvar,data=dfobj,alt=HA , conf.level=cnfval, var.equal=TRUE) Quantitative Hypothesis Tests (ONE SAMPLE) GOODNESS-OF-FIT TEST: Goodness-of-fit for one-way frequency table in obstbl and expected values (or ratios) in exp.p. Follow-up Analyses: Extract expected values. Percentages of individuals in each level of the response variable. Categorical Hypothesis Tests ( obstbl <- c(lvl1=##,lvl2=##,lvl3=##) ) # if summarized data ( obstbl <- xtabs(~cvarResp,data=dfobj) ) # if raw data ( exp.p <- c(lvl1=##,lvl2=##,lvl3=## ) ) ( gof <- chisq.test(obstbl,p=exp.p,rescale.p=TRUE, correct=FALSE) ) gof$expected percTable(obstbl,digits=1) Prepared by Dr. Derek H. Ogle for Northland College’s MTH107 course Revised Dec-18