Chapter 4 Statistics.

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Chapter 4 Statistics

4.1 – What is Statistics? Definition 4.1.1 Data are observed values of random variables. The field of statistics is a collection of methods for estimating distributions and parameters of random variables through the collection and analysis of data.

4.1 – What is Statistics? Definition 4.1.2 The population is the set of all objects of interest in a statistical study. A sample is a subset of the population. Definition 4.1.3 Data are information that has been collected. The field of statistics is a collection of methods for drawing conclusions about a population by collecting and anlyzing data from a sample.

Types of Data Definition 4.1.4 A parameter is a number calculated using information from every member of a population. A statistic is calculated using information from a sample. Definition 4.1.5 Quantitative data consist of numbers. Qualitative data are nonnumeric information that can be separated into different categories.

Types of Data Definition 4.1.6 Discrete data are observed values of a discrete random variable. They are numbers that have a finite or countable set of values. Continuous data are observed values of a continuous random variable. They are numbers that can take any value within some range.

Levels of Measurement Definition 4.1.7 Data are at the nominal level of measurement if they consist of only names, labels, or categories. They cannot be ordered (such as smallest to largest) in a meaningful way. Data are at the ordinal level of measurement if they can be ordered in a meaningful way, but differences between data values cannot be calculated or are meaningless. Data are at the interval level of measurement if they can be ordered in a meaningful way and differences between data values are meaningful. Data are at the ratio level of measurement if they are at the interval level, ratios of data values are meaningful, and there is meaningful zero starting point.

Types of Studies Definition 4.1.8 In an observational study, data is obtained in a way such that the members of the sample are not changed, modified, or altered in any way. In an experiment, something is done to the members of the sample and the resulting effects are recorded. The “something” that is done is called a treatment.

Types of Observational Studies Definition 4.1.9 In a cross-sectional study, data are collected at one specific point in time. In a retrospective study, data are collected from studies done in the past. In a prospective study, data are collected by observing a sample for some time into the future.

Blocks Definition 4.1.10 A block is a subset of the population with a similar characteristic. Different blocks of a population have different characteristics that may affect the variable of interest differently. A randomized block design is a type of experiment where: The population is divided into blocks. Members from each block are randomly chosen to receive the treatment.

Sampling Techniques Definition 4.1.11 A convenience sample is a sample that is very easy to get. A voluntary response sample is obtained when members of the sample decide whether to participate or not. A systematic sample is obtained by arranging the population in some order, then selecting a starting point, and then selecting every kth member (such as every 20th).

Sampling Techniques A cluster sample is obtained by dividing the population into subsets (or clusters) where the members of each cluster have a common characteristic, then randomly choosing some of the clusters, and surveying every member of the chosen clusters. A stratified sample is obtained by dividing the population into subsets and then randomly choosing some members from each of the subsets. A multistage sample is obtained by successively applying a variety of sampling techniques. At each stage the sample becomes smaller, and at the last stage, a clustersample is chosen.

Random Samples Definition 4.1.12 A random sample is chosen in a way such that every individual member of the population has the same probability of being chosen. A simple random sample of size n is chosen in a way such that every group of size n has the same probability of being chosen.

4.2 – Summarizing Data Example 4.2.3 Shown below are the waiting times of 30 customers at a supermarket check-out stand Relative frequency distribution

Histograms The “shape” of a relative frequency histogram is an approximation of the graph of the p.d.f. (or p.m.f.) of the underlying random variable.

Summary Statistics Definition 4.2.1 Let {x1, x2,…, xn} be a set of quantitative data collected from a sample of the population mean of the data: variance of the data: standard deviation of the data: range of the data: (max value) – (min value)

Example 4.2.4

Percentiles Definition 4.2.2 Let p be a number between 0 and 1. The (100p)th percentile of a set of quantitative data is a number, denoted πp, that is greater than (100p)% of the data values. The 25th, 50th, and 75th percentiles are called the first, second and third quartiles and are denoted p1 = π0.25, p2 = π0.50, and p3 = π0.75, respectively. The 50th percentile is also called the median of the data and is denoted m = p2. The mode of the data is the data value that occurs most frequently. The 5-number summary of a set of data consists of the minimum value, p1, p2, p3, and the maximum value.

Calculating Percentiles Arrange the data in increasing order: 𝑥 1 ≤ 𝑥 2 ≤⋯≤ 𝑥 𝑛 Calculate 𝐿=𝑛𝑝 If 𝐿 is not an integer, then round it up to the next larger integer and 𝜋 𝑝 = 𝑥 𝐿 If L is an integer, then 𝜋 𝑝 = 1 2 𝑥 𝐿 + 𝑥 𝐿+1

Example 4.2.5 Calculate the first quartile, p1 = π0.25 Calculate the median m = p2 = π0.5

Example 4.2.5 5-number summary 0, 0.5, 1.8, 2.9, 7.3 Box Plot

4.3 – Sampling Distributions Definition 4.3.1 A random variable Θ whose values are used to estimate the value of a parameter 𝜃 is called an estimator of 𝜃. A value of Θ , 𝜃 , is called an estimate of 𝜃. An estimator Θ is called an unbiased estimator of 𝜃 if If this equation is not true, then Θ is called a biased estimator.

Sample Proportion Suppose we want to know the proportion p of a population who support a particular political candidate p is a parameter We survey 735 voters and find 383 that support the candidate The sample proportion is 𝑝 = 383 735 ≈0.521 This is an estimate of p

Sample Proportion Let 𝑋 denote the number who support the candidate in a sample of 𝑛 Define the random variable 𝑃 = 𝑋 𝑛 Called the “sample proportion” 𝑝 is an observed value of 𝑃 𝑝 is an estimate of p 𝑃 is an estimator of p

Sampling Distribution of the Proportion Theorem 4.3.1 Let 𝑋 be b(n, p). Then as 𝑛→∞ the distribution of the sample proportion Meaning: 𝑃 is approximately 𝑁(𝑝, 𝑝 1−𝑝 𝑛 ) for n “large enough” “Large enough” - 𝑛𝑝≥5 and 𝑛(1−𝑝)≥5

Example 4.3.3 By examining the spending habits of one particular consumer, a credit card company observes that during the course of normal transactions 37% of the charges exceed $150. Out of 50 charges made in one particular month, 27 exceeded $150. Does it appear that these charges were made in the course of normal transactions?

Example 4.3.3 Sample prop. that exceed $150: 𝑝 = 27 50 =0.54 Is this unusually large? Assume normal transactions: 𝑃 is approximately This probability is small (< 0.05) Reject the assumption

Sample Mean Suppose we want to know the mean IQ score of all college students in the US, 𝜇 Estimate it with a sample mean 𝑥 Let 𝑋 denote the IQ of a randomly selected student 𝐸 𝑋 =𝜇 𝑥 is an observed value of the sample mean 𝑋 𝑛 𝑥 is an estimate of 𝜇 𝑋 𝑛 is an estimator of 𝜇

Sampling Distribution of the Mean By the Central Limit Theorem where 𝜎 2 =𝑉𝑎𝑟(𝑋)

Idea

4.4 – Confidence Intervals for a Proportion Definition 4.4.1 Let Z be 𝑁(0, 1) and p be a number between 0 and 0.5. A critical z-value 𝑧 𝑝 is a positive number such that

Practice

Practice

Critical Values Let 𝛼 be between 0 and 1. Then 𝑝=𝛼/2 is between 0 and 0.5, so that the critical z-value 𝑧 𝛼/2 is a positive number such that

Confidence Interval Definition 4.4.2 Let 0<𝛼<1 and let x be a number of successes in n observed trials of a Bernoulli experiment with unknown probability of a success p. Define 𝑝 =𝑥/𝑛 and let 𝑧 𝛼/2 be a critical z-value. The interval is called a 100(1 − α)% confidence interval estimate for p.

Confidence Interval Different forms

Requirements The sample must be random. The conditions for a binomial distribution must be satisfied (at least approximately). There are at least 5 successes and at least 5 failures observed in the n trials.

p = The proportion of all voters who support the candidate Example 4.4.2 Suppose 383 out of 735 surveyed voters support a particular political candidate. Calculate a 95% confidence interval estimate for the proportion of all voters who support the candidate. Define the population proportion being estimated: p = The proportion of all voters who support the candidate Calculate the sample proportion

Example 4.4.2 Find the critical value: 𝛼=0.05 Calculate the margin of error Calculate the confidence interval

Example 4.4.2 Correct interpretation Meaning We are 95% confident that the value of p is between 0.485 and 0.557. Meaning If we were to survey many different samples of voters and calculate the corresponding 95% confidence interval using the statistics from each sample, then about 95% of the intervals would contain the true value of p.

Incorrect Meaning There is a 95% chance that the actual value of p is between 0.485 and 0.557 Why is this incorrect? p is a number that we don’t know. It has a value. It is between 0.485 and 0.557 or not. There is no probability involved.

What does this mean The confidence level refers to the process of constructing, not the interval themselves If we constructed many intervals, then about 95% of them would contain the true value of p.

4.5 – Confidence Intervals for a Mean Definition 4.5.1 Let 𝑥 be the mean of a sample of size n taken from a population with known variance 𝜎 2 and unknown mean μ. The interval is called a 100(1 − α)% confidence interval estimate for μ.

Z-Interval Requirements The sample is random. The population variance 𝜎 2 is known. The population is normally distributed or 𝑛>30.

T-Interval Definition 4.5.2 Let 𝑥 be the mean and 𝑠 2 be the variance of a sample of size n taken from a population with unknown variance 𝜎 2 and mean μ, and let 𝑡 𝛼/2 be a critical Student-t value with (𝑛−1) degrees of freedom. The interval is called a 100(1 − α)% confidence interval estimate for μ when 𝜎 2 is unknown.

T-Interval Requirements The sample is random. The population is normally distributed or n > 30.

Which Type of Interval? Suggestions If n > 30 or 𝜎 2 is known, then use a Z-interval. If 𝜎 2 is unknown, and the population is normally distributed (at least approximately), then use a T-interval. If n ≤ 30, 𝜎 2 is unknown, and the population is not normally distributed, then see Chapter 7.

Example 4.5.3 A random sample of 15 “1-pound” packages of shredded cheddar cheese has a mean weight of 𝑥 =1.05 lb. and standard deviation of s = 0.02 lb. Calculate a 99% confidence interval estimate for the mean weight of all such packages. Define the population mean being estimated: 𝜇 = The mean weight of all “1-pound” packages of shredded cheddar cheese.

Example 4.5.3 Find the critical value: α = 0.01 and n = 15 Calculate the margin of error: Calculate the confidence interval:

4.6 – Confidence Intervals for a Variance Definition 4.6.1 Let 𝑠 2 be the variance of a sample of size n taken from a normally distributed population with unknown variance 𝜎 2 and let be critical 𝜒 2 values. The interval is a 100(1 − α)% confidence interval estimate for 𝜎 2 .

Confidence Intervals for a Variance Requirements The sample is random. The population is normally distributed.

Example 4.6.2 The proportion of butterfat in 20 batches of butter were measured. The resulting data have a sample variance of 𝑠 2 =0.001102. Construct a 95% confidence interval estimate of the variance in the proportion of butterfat of all batches. Define the population variance being estimated: 𝜎 2 = The variance in the proportion of butterfat of all batches of butter

Example 4.6.2 Find the critical values: 𝛼=0.05 and 𝑛=20 Calculate the confidence interval:

4.7 – Confidence Intervals for Differences Definition 4.7.1 Consider two populations with respective proportions 𝑝 1 and 𝑝 2 . Let 𝑛 1 and 𝑛 2 be the sample sizes 𝑝 1 and 𝑝 2 be the sample proportions Then is a 100(1 − α)% confidence interval estimate for 𝑝 1 − 𝑝 2

2-Proportion Z-Interval Requirements Both samples are random and independent. Each sample contains at least 5 successes and 5 failures.

2-Sample T-Interval If two populations are (approximately) normally distributed and their variances are unknown, then an approximate 100(1 − α)% confidence interval for the difference of their means 𝜇 1 − 𝜇 2 using data from two independent samples of the respective populations is

Equal Variances 𝑠 𝑝 - pooled standard deviation 𝑡 𝛼/2 - critical t-value with 𝑛 1 + 𝑛 2 −2 degrees of freedom

Non-equal Variances where 𝑡 𝛼/2 is a critical t-value with r degrees of freedom where If r is not an integer, then round it down to the nearest whole number.

Requirements Both samples are random and independent. Both populations are normally distributed or both sample sizes are greater than 30.

4.8 – Sample Size Sample size for estimating a population proportion 𝑝 - an estimate of the population proportion E - desired margin of error

Mean Sample size for estimating a population mean 𝜎 2 - an estimate of the population variance E - desired margin of error

4.9 – Assessing Normality Constructing a Normal Quantile Plot Arrange the data values in increasing order: 𝑥 1 ≤ 𝑥 2 ≤…≤ 𝑥 𝑛 For each 𝑘=1, 2, …𝑛, define Calculate 𝑧 𝑘 = Φ −1 𝑝 𝑘 for each 𝑘 where Φ is the standard normal c.d.f.

Normal Quantile Plot Plot the points 𝑥 𝑘 , 𝑧 𝑘 If the points form a straight-line pattern, then conclude that the population appears to be normal. If the points do not form a straight-line or exhibit some other type of non-linear pattern, then conclude that the population is not normal.

Example 4.9.2 The second row of the table below gives the average daily temperatures in the month of November for the city of Lincoln, NE for nine different years (data collected by Brandon Metcalf, 2009). Determine if the population of all such temperatures is normally distributed.

Example 4.9.2 Roughly a straight line Population is normal

Straight Line Calculate the sample mean and standard deviation of the data, 𝑥 , and s. For each k, calculate the following quantity: Plot the points 𝑥 𝑘 , 𝑦 𝑘 on the quantile plot and connect them with a straight line.

Straight Line

Fuzzy Central Limit Theorem If the population is influenced by many small, random, unrelated effects, then the population may be normally distributed.