Introductory Statistics/ Refresher

Slides:



Advertisements
Similar presentations
Topics: Inferential Statistics
Advertisements

Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Chapter 3 Hypothesis Testing. Curriculum Object Specified the problem based the form of hypothesis Student can arrange for hypothesis step Analyze a problem.
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
Statistics for CS 312. Descriptive vs. inferential statistics Descriptive – used to describe an existing population Inferential – used to draw conclusions.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Chapter Ten Introduction to Hypothesis Testing. Copyright © Houghton Mifflin Company. All rights reserved.Chapter New Statistical Notation The.
Chapter 8 Introduction to Hypothesis Testing. Hypothesis Testing Hypothesis testing is a statistical procedure Allows researchers to use sample data to.
Overview of Statistical Hypothesis Testing: The z-Test
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Statistical inference: confidence intervals and hypothesis testing.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
The Argument for Using Statistics Weighing the Evidence Statistical Inference: An Overview Applying Statistical Inference: An Example Going Beyond Testing.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Introduction to Inferential Statistics Statistical analyses are initially divided into: Descriptive Statistics or Inferential Statistics. Descriptive Statistics.
Determination of Sample Size: A Review of Statistical Theory
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
1 Chapter 8 Introduction to Hypothesis Testing. 2 Name of the game… Hypothesis testing Statistical method that uses sample data to evaluate a hypothesis.
Chapter Eight: Using Statistics to Answer Questions.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Chapter 6: Analyzing and Interpreting Quantitative Data
© 2008 Pearson Addison-Wesley. All rights reserved Chapter 6 Putting Statistics to Work.
Copyright © 2005 Pearson Education, Inc. Slide 6-1.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
© Copyright McGraw-Hill 2004
LIS 570 Summarising and presenting data - Univariate analysis.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Data analysis and basic statistics KSU Fellowship in Clinical Pathology Clinical Biochemistry Unit
Outline Sampling Measurement Descriptive Statistics:
Lecture Slides Elementary Statistics Twelfth Edition
Basics of Pharmaceutical Statistics
Module 10 Hypothesis Tests for One Population Mean
Logic of Hypothesis Testing
Advanced Quantitative Techniques
Dependent-Samples t-Test
Statistics in Management
Hypothesis Testing.
Making inferences from collected data involve two possible tasks:
CHAPTER 4 Research in Psychology: Methods & Design
Dr.MUSTAQUE AHMED MBBS,MD(COMMUNITY MEDICINE), FELLOWSHIP IN HIV/AIDS
Review You run a t-test and get a result of t = 0.5. What is your conclusion? Reject the null hypothesis because t is bigger than expected by chance Reject.
PCB 3043L - General Ecology Data Analysis.
Hypothesis Testing for Proportions
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Analyzing and Interpreting Quantitative Data
Description of Data (Summary and Variability measures)
Hypothesis Testing: Hypotheses
Introduction to Inferential Statistics
SDPBRN Postgraduate Training Day Dundee Dental Education Centre
Analysis and Interpretation: Exposition of Data
Chapter 9 Hypothesis Testing.
Basic Statistical Terms
Review: What influences confidence intervals?
Data analysis and basic statistics
Interval Estimation and Hypothesis Testing
Chapter Nine: Using Statistics to Answer Questions
Testing Hypotheses I Lesson 9.
Chapter 9 Hypothesis Testing: Single Population
Advanced Algebra Unit 1 Vocabulary
Analyzing and Interpreting Quantitative Data
Statistics Review (It’s not so scary).
Introductory Statistics
Some Key Ingredients for Inferential Statistics
Presentation transcript:

Introductory Statistics/ Refresher Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Group 1, Module 2, January 19, 2017

Definitions/ topics Statistic Statistics Population and Samples Sampling Distributions and parameters Central Tendencies Frequency Probability Significance tests Hypothesis (null and alternate) P-value Density and cumulative distributions

Statistic and Statistics Statistic (not to be confused with Statistics) Characteristic or measure obtained from a sample. Statistics Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions.

What is "statistics"? The term "statistics" has two common meanings, which we want to clearly separate: descriptive and inferential statistics. But to understand the difference between descriptive and inferential statistics, we must first be clear on the difference between populations and samples. Courtesy Marshall Ma (and prior sources) Courtesy Marshall Ma (and prior sources) 5

Populations and samples A population is a set of well-defined objects We must be able to say, for every object, if it is in the population or not We must be able, in principle, to find every individual of the population A geographic example of a population is all pixels in a multi-spectral satellite image A sample is a subset of a population We must be able to say, for every object in the population, if it is in the sample or not Sampling is the process of selecting a sample from a population Continuing the example, a sample from this population could be a set of pixels from known ground truth points Courtesy Marshall Ma (and prior sources) Courtesy Marshall Ma (and prior sources) 6

Populations and samples A population = “all” of the data, if you can get it (BIG Data) This is what is different about the methods you use A sample = “some” of the data, and you may not know how representative it is This is what limits analysis but certainly the development of models Courtesy Marshall Ma (and prior sources) Courtesy Marshall Ma (and prior sources) 7

Another example "statistics"? Example: Descriptive "The adjustments of 14 GPS control points for this orthorectification ranged from 3.63 to 8.36 m with an arithmetic mean of 5.145 m" Inferential "The mean adjustment for any set of GPS points taken under specified conditions and used for orthorectification is no less than 4.3 and no more than 6.1 m; this statement has a 5% probability of being wrong" Courtesy Marshall Ma (and prior sources) 8

E.g. Election prediction Exit polls versus election results Human versus cyber How is the “population” defined here? What is the sample, how is it chosen? What is described and how is that used to predict? Are results categorized? (where from, M/F, age) What is the uncertainty? It is reflected in the “sample distribution” And controlled/ constraints by “sampling theory”

Sampling Types (basic) Random Sampling Sampling in which the data is collected using chance methods or random numbers. Systematic Sampling Sampling in which data is obtained by selecting every kth object. Convenience Sampling Sampling in which data is which is readily available is used. Stratified Sampling Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques. Cluster Sampling Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected.

Random Numbers Can a computer generate a random number? Can you? Origin – to reduce selection bias! In R – many ways – see help on Random {base} and get familiar with set.seed

Sampling Theory See Nyquist–Shannon – for time-series* Basically if there are no frequencies greater than x, then you need to sample at 2 x /time unit Not well known application: good, better, best How many samples?

Minimum Sample Size Typical formula** is N=(z * std deviation)^2/ (margin of error)^2 May need to estimate std deviation z is from confidence intervals (normal distribution) Margin of error is your tolerance for being wrong E.g. for elections ~7000 ! Based on 1% error and 95% confidence…

Bias difference: between cyber and human data 2012 (!not 2016) election results and exit polls What are examples of bias in election results? In exit polls? http://www.mapsnworld.com/blog/wp-content/uploads/2012/11/election-results-map-usa-2012.jpg http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2012/11-6-12-7-35-pm-results/14234173-7-eng-US/11-6-12-7-35-pm-results_full_600.jpg

Distributions http://www.quantitativeskills.com/sisa/rojo/alldist.zip Shape Character Parameter(s) Mean Standard deviation Skewness Etc.

Plotting these distributions Histograms and binning Getting used to log scales Going beyond 2-D More of this next week (in more detail)

In applications Scipy: http://docs.scipy.org/doc/scipy/reference/stats.html R: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html Matlab: http://www.mathworks.com/help/stats/_brn2irf.html Excel: see

Heavy-tail distributions are probability distributions whose tails are not exponentially bounded Common – long-tail… human v. cyber… Few that dominate More that add up http://upload.wikimedia.org/wikipedia/commons/8/8a/Long_tail.svg Equal areas http://en.wikipedia.org/wiki/Heavy-tailed_distribution

Spatial example

Spatial roughness…

Central tendency – median, mean, mode http://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Comparison_mean_median_mode.svg/300px-Comparison_mean_median_mode.svg.png

Significance Tests Confidence intervals allow you to accept or reject hypotheses… (critical region) - two-tailed test. If the hypothesized value of the parameter lies within the confidence interval with a 1-alpha level of confidence, then the decision at an alpha level of significance is to fail to reject the null hypothesis, i.e. accept If the hypothesized value of the parameter lies outside the confidence interval with a 1-alpha level of confidence, then the decision at an alpha level of significance is to reject the null hypothesis.

Variability in normal distributions http://www.socialresearchmethods.net/kb/Assets/images/stat_t2.gif

F-test F = S12 / S22 where S1 and S2 are the sample variances. The more this ratio deviates from 1, the stronger the evidence for unequal population variances. http://www.statistics4u.info/fundstat_eng/img/gm_compvaria_tusche.png

T-test http://www.socialresearchmethods.net/kb/Assets/images/stat_t3.gif

Note on Standard Error Versus standard deviation = SD (i.e. from the mean) SE ~ SD/sample size So, as size increases SE << SD !! Big data

Frequencies v. Probabilities Actual rate of occurrence in a sample or population – frequency Expected or estimate likelihood of a value or outcome – probability Coin toss – two outcomes (binomial) p=0.5 (of “heads”) Male/Female Which US State you live in

Ranges: z, Percentiles, Quartiles The standard score is obtained by subtracting the mean and dividing the difference by the standard deviation. The symbol is z, which is why it's also called a z-score. Percentiles (100 regions) The kth percentile is the number which has k% of the values below it. The data must be ranked. Quartiles (4 regions) The quartiles divide the data into 4 equal regions. Note: The 2nd quartile is the same as the median. The 1st quartile is the 25th percentile, the 3rd quartile is the 75th percentile.

Hypothesis Write the original claim and identify whether it is the null hypothesis or the alternative hypothesis. Write the null and alternative hypothesis. Use the alternative hypothesis to identify the type of test. Write down all information from the problem. Find the critical value using the tables Compute the test statistic Make a decision to reject or fail to reject the null hypothesis. A picture showing the critical value and test statistic may be useful. Write the conclusion.

Hypothesis What are you exploring? Regular data analytics features ~ well defined hypotheses Big Data messes that up E.g. Stock market performance / trends versus unusual events (crash/ boom): Populations versus samples – which is which? Why? E.g. Election results are predictable from exit polls

Null and Alternate Hypotheses H0 - null H1 – alternate If a given claim contains equality, or a statement of no change from the given or accepted condition, then it is the null hypothesis, otherwise, if it represents change, it is the alternative hypothesis. It never snows in Troy in January Students will attend their scheduled classes

P-value One common way to evaluate significance, especially in R output approaches hypothesis testing from a different manner. Instead of comparing z-scores or t-scores as in the classical approach, you're comparing probabilities, or areas. The level of significance (alpha) is the area in the critical region. That is, the area in the tails to the right or left of the critical values.

P-value The p-value is the area to the right or left of the test statistic. If it is a two tail test, then look up the probability in one tail and double it. If the test statistic is in the critical region, then the p-value will be less than the level of significance. It does not matter whether it is a left tail, right tail, or two tail test. This rule always holds.

Accept or Reject? Reject the null hypothesis if the p-value is
less than the level of significance. You will fail to reject the null hypothesis if the p-value is greater than or equal to the level of significance. In English – you accept that there is no relation! Typical significance 0.05 (!)

Probability Density

Cumulative…

Next? Additional review material at: http://libproxy.rpi.edu/login?url=http://library.books24x7.com/library.asp?%5EB&bookid=72929 http://libproxy.rpi.edu/login?url=http://library.books24x7.com/library.asp?%5EB&bookid=82386 Next “class” is Jan. 26 – introduction to R – boot camp, installs, examples (John Erickson, guest presenter) For Jan. 23 – RCS login required - http://mediasite.mms.rpi.edu/Mediasite5/Play/43c86ae1171443d8ae71be3841bafd841d (i.e. do not come to class but WATCH this before Jan. 30)