1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.

Slides:



Advertisements
Similar presentations
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Advertisements

PTP 560 Research Methods Week 9 Thomas Ruediger, PT.
Inferential Statistics
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Introduction to Statistics
Hypothesis testing Week 10 Lecture 2.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Topic 2: Statistical Concepts and Market Returns
PPA 415 – Research Methods in Public Administration Lecture 5 – Normal Curve, Sampling, and Estimation.
Lec 6, Ch.5, pp90-105: Statistics (Objectives) Understand basic principles of statistics through reading these pages, especially… Know well about the normal.
Chapter Sampling Distributions and Hypothesis Testing.
8-2 Basics of Hypothesis Testing
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
Statistics for CS 312. Descriptive vs. inferential statistics Descriptive – used to describe an existing population Inferential – used to draw conclusions.
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Probability Population:
Inferential Statistics
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Chapter Ten Introduction to Hypothesis Testing. Copyright © Houghton Mifflin Company. All rights reserved.Chapter New Statistical Notation The.
Overview of Statistical Hypothesis Testing: The z-Test
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Lecture Slides Elementary Statistics Twelfth Edition
Descriptive statistics Inferential statistics
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Statistical inference: confidence intervals and hypothesis testing.
Fundamentals of Hypothesis Testing: One-Sample Tests
1 GE5 Lecture 6 rules of engagement no computer or no power → no lesson no SPSS → no lesson no homework done → no lesson.
CHAPTER 4 Research in Psychology: Methods & Design
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
QUANTITATIVE RESEARCH AND BASIC STATISTICS. TODAYS AGENDA Progress, challenges and support needed Response to TAP Check-in, Warm-up responses and TAP.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Introduction to Inferential Statistics Statistical analyses are initially divided into: Descriptive Statistics or Inferential Statistics. Descriptive Statistics.
Determination of Sample Size: A Review of Statistical Theory
1 Chapter 8 Hypothesis Testing 8.2 Basics of Hypothesis Testing 8.3 Testing about a Proportion p 8.4 Testing about a Mean µ (σ known) 8.5 Testing about.
PPA 501 – Analytical Methods in Administration Lecture 6a – Normal Curve, Z- Scores, and Estimation.
1 Chapter 8 Introduction to Hypothesis Testing. 2 Name of the game… Hypothesis testing Statistical method that uses sample data to evaluate a hypothesis.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Ex St 801 Statistical Methods Inference about a Single Population Mean.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
26134 Business Statistics Tutorial 11: Hypothesis Testing Introduction: Key concepts in this tutorial are listed below 1. Difference.
Chapter Eight: Using Statistics to Answer Questions.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Chapter 6: Analyzing and Interpreting Quantitative Data
© 2008 Pearson Addison-Wesley. All rights reserved Chapter 6 Putting Statistics to Work.
Copyright © 2005 Pearson Education, Inc. Slide 6-1.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.
Outline Sampling Measurement Descriptive Statistics:
Data Analytics – ITWS-4963/ITWS-6965
Introductory Statistics/ Refresher
CHAPTER 4 Research in Psychology: Methods & Design
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Chapter Nine: Using Statistics to Answer Questions
Analyzing and Interpreting Quantitative Data
Statistics Review (It’s not so scary).
Introductory Statistics
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: Lally 102 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Jiaju Shen and help from James Ryan Web site: –Schedule, lectures, syllabus, reading, assignments, etc. – 2

Today Initial review of stats and terms that are important for this course Then… check in on installation of application software, and Getting some data and read, explore, etc. 3

Definitions/ topics Statistic Statistics Population and Samples Sampling Distributions and parameters Central Tendencies Frequency Probability 4 Significance tests Hypothesis (null and alternate) P-value Density and cumulative distributions

Statistic and Statistics Statistic (not to be confused with Statistics) –Characteristic or measure obtained from a sample. Statistics –Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions. 5

A population is defined (“all” of the data) –We must be able to say, for every object, if it is in the population or not –We must be able, in principle, to find every individual of the population –Inferential statistics apply here - Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining relationships between variables, and making predictions. A sample is a subset of a population (“some” of the data) –We must be able to say, for every object in the population, if it is in the sample or not (detecting “outliers”, “errors”, etc.) –Sampling is the process of selecting a sample from a population –Descriptive statistics apply here (especially distributions) Populations and samples 6

E.g. Election prediction Exit polls versus election results –Human versus cyber How is the “population” defined here? What is the sample, how chosen? –What is described and how is that used to predict? –Are results categorized? (where from, M/F, age) What is the uncertainty? –It is reflected in the “sample distribution” –And controlled/ constraints by “sampling theory” 7

Sampling Types (basic) Random Sampling –Sampling in which the data is collected using chance methods or random numbers. Systematic Sampling –Sampling in which data is obtained by selecting every kth object. Convenience Sampling –Sampling in which data is which is readily available is used. Stratified Sampling –Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques. Cluster Sampling –Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected. 8

Random Numbers Can a computer generate a random number? Can you? Origin – to reduce selection bias! In R – many ways – see help on Random {base} and get familiar with set.seed 9

Sampling Theory See Nyquist–Shannon – for time-series* Basically if there are no frequencies greater than x, then you need to sample at 2 x /time unit Not well known application: good, better, best –How many samples? 10

Minimum Sample Size Typical formula** is –N=(z * std deviation)^2/ (margin of error)^2 –May need to estimate std deviation –z is from confidence intervals (normal distribution) –Margin of error is your tolerance for being wrong –E.g. for elections ~7000 ! Based on 1% error and 95% confidence… 11

Bias difference: between cyber and human data Election results and exit polls –What are examples of bias in election results? –In exit polls? 12

Distributions ist.ziphttp:// ist.zip Shape Character Parameter(s) –Mean –Standard deviation –Skewness –Etc. 13

Plotting these distributions Histograms and binning Getting used to log scales Going beyond 2-D More of this next week (in more detail) 14

In applications Scipy: R: patched/library/stats/html/Distributions.htmlhttp://stat.ethz.ch/R-manual/R- patched/library/stats/html/Distributions.html Matlab: Excel: HAH! 15

Heavy-tail distributions are probability distributions whose tails are not exponentially bounded Common – long-tail… human v. cyber… 16 Few that dominateMore that add up Equal areas

Spatial example 17

Spatial roughness… 18

Central tendency – median, mean, mode 19

Significance Tests Confidence intervals allow you to accept or reject hypotheses… (critical region) - two- tailed test. –If the hypothesized value of the parameter lies within the confidence interval with a 1-alpha level of confidence, then the decision at an alpha level of significance is to fail to reject the null hypothesis, i.e. accept –If the hypothesized value of the parameter lies outside the confidence interval with a 1-alpha level of confidence, then the decision at an alpha level of significance is to reject the null hypothesis. 20

Variability in normal distributions 21

F-test 22 F = S 1 2 / S 2 2 where S 1 and S 2 are the sample variances. The more this ratio deviates from 1, the stronger the evidence for unequal population variances.

T-test 23

Note on Standard Error Versus standard deviation (i.e. from the mean) SE ~ SD/sample size So, as size increases SE << SD !! Big data 24

Frequencies v. Probabilities Actual rate of occurrence in a sample or population – frequency Expected or estimate likelihood of a value or outcome Coin toss – two outcomes (binomial) –p=0.5 25

Ranges: z, Percentiles, Quartiles The standard score is obtained by subtracting the mean and dividing the difference by the standard deviation. The symbol is z, which is why it's also called a z-score. Percentiles (100 regions) –The kth percentile is the number which has k% of the values below it. The data must be ranked. Quartiles (4 regions) –The quartiles divide the data into 4 equal regions. –Note: The 2 nd quartile is the same as the median. The 1 st quartile is the 25 th percentile, the 3 rd quartile is the 75 th percentile. 26

Hypothesis 1.Write the original claim and identify whether it is the null hypothesis or the alternative hypothesis. 2.Write the null and alternative hypothesis. Use the alternative hypothesis to identify the type of test.type of test. 3.Write down all information from the problem. 4.Find the critical value using the tables 5.Compute the test statistic 6.Make a decision to reject or fail to reject the null hypothesis. A picture showing the critical value and test statistic may be useful. 7.Write the conclusion. 27

Hypothesis What are you exploring? Regular data analytics features ~ well defined hypotheses –Big Data messes that up E.g. Stock market performance / trends versus unusual events (crash/ boom): –Populations versus samples – which is which? –Why? E.g. Election results are predictable from exit polls 28

Null and Alternate Hypotheses H0 - null H1 – alternate If a given claim contains equality, or a statement of no change from the given or accepted condition, then it is the null hypothesis, otherwise, if it represents change, it is the alternative hypothesis. It never snows in Troy in January Students will attend their scheduled classes 29

P-value One common way to evaluate significance, especially in R output –approaches hypothesis testing from a different manner. Instead of comparing z-scores or t- scores as in the classical approach, you're comparing probabilities, or areas. The level of significance (alpha) is the area in the critical region. That is, the area in the tails to the right or left of the critical values. 30

P-value The p-value is the area to the right or left of the test statistic. –If it is a two tail test, then look up the probability in one tail and double it. If the test statistic is in the critical region, then the p-value will be less than the level of significance. –It does not matter whether it is a left tail, right tail, or two tail test. This rule always holds. 31

Accept or Reject? Reject the null hypothesis if the p-value is less than the level of significance. You will fail to reject the null hypothesis if the p-value is greater than or equal to the level of significance. Typical significance 0.05 (!) 32

Probability Density 33

Cumulative… 34

Pause… 35

Gnu R - load this firsthttp://lib.stat.cmu.edu/R/CRAN/ R Studio – see R-intro.html in manualshttp:// /– / –Manuals - Libraries – at the command line – library(), or select the packages tab, and check/ uncheck as needed 36

Files This is where the files for assignments, exercise will be placed 37

Exercises – getting data in Rstudio –read in csv file (two ways to do this) - GPW3_GRUMP_SummaryInformation_2010.csv –Read in excel file (directly or by csv convert) EPI_data.xls (2010EPI_data tab) –See if you can plot some variables –Anything in common between them? 38

If time or for fun… se_eqs.xls –Plot it –Fit it PRESSURE.xls –Plot it –Smooth it –Fit it … 39

No reading this week Complete the installs as best you can 40