Daniela Stan Raicu School of CTI, DePaul University

Daniela Stan Raicu School of CTI, DePaul University
CSC 323 Quarter: Spring 02/03 Daniela Stan Raicu School of CTI, DePaul University 12/8/2018 Daniela Stan - CSC323

Outline Chapter 2: Looking at Data
Cautions about regression and correlation (slides from previous lecture) Chapter 3: Producing Data Common Terms: Population, Individual, Sampling Frame, Sample, Sample Survey, Census Sampling Design Towards Statistical Inference Experimental Design 12/8/2018 Daniela Stan - CSC323

Least - Squares Regression (cont.)
How is the least – squares regression line calculated? = predicted value Where: r = correlation, Sx,Sy = standard deviations = means Problem 2.52/page 152 12/8/2018 Daniela Stan - CSC323

Coefficient of Determination (R2)
Measures usefulness of regression prediction: R2 (or r2, the square of the correlation): measures how much variation in the values of the response variable (y) is explained by the regression line Example: r=1: R2=1: regression line explains/captures all (100%) of the variation in y r=.7: R2=.49: regression line explains almost half (50%) of the variation in y 12/8/2018 Daniela Stan - CSC323

Accuracy of the predictions
One possible measure of the accuracy of the regression predictions is given by the root mean square error (r.m.s. error). The r.m.s. error is defined as the square root of the average of the square residuals: In large data sets, the r.m.s. error is approximately equal to Problem 2.59/page 169 12/8/2018 Daniela Stan - CSC323

A Caution: Confounding factor
A confounding factor is a variable that has an important effect on the relationship among the variables in a study but it is not included in the study. Example: The mathematics department of a large university must plan the timetable for the following year. Data are collected on the enrollment year, the number x of first-year students and the number y of students enrolled in elementary math courses. The fitted regression line has equation: = x R2=0.694. 12/8/2018 Daniela Stan - CSC323

A Caution: Influential Point
An observation is influential for the regression line, if removing it would change considerably the fitted line. An influential point pulls the regression line towards itself. Regression line if  is omitted                  Influential point/outlier       12/8/2018 Daniela Stan - CSC323

A Caution: Beware of Extrapolation
Extrapolation is the use of regression line for prediction outside the range values of the explanatory variable x that you used to obtain the line. Such predictions are often not accurate. See the example in the previous lecture Next slide: Chapter 3: Producing Data 12/8/2018 Daniela Stan - CSC323

Population The way that the data is collected is very important. The amount and quality of useful information in a data set depends directly on how that data was gathered. The entire group of individuals that we want information about is called the population. 12/8/2018 Daniela Stan - CSC323

Census 12/8/2018 Daniela Stan - CSC323

Samples and Sample Surveys
A sample is the part of the population that we actually examine in order to gather information. 12/8/2018 Daniela Stan - CSC323

Sampling Design How do we choose the sample?
sample design: the method used to choose the sample; the sample should be representative for the entire population, that is it is not biased. Sources of bias: voluntary response sample: consists of people who choose themselves by responding to a general appeal. convenience sampling: selecting individuals that are easiest to reach: for example, choosing a sample of shoppers at a mall, also tends to give biased data sets. 12/8/2018 Daniela Stan - CSC323

Sampling Design The simplest way of getting an unbiased sample is to use a simple random sample. Simple Random Sample (SRS) of size n consists of n individuals from the population chosen in a such a way that every set of n individuals has an equal chance to be the sample actually selected. Steps (similar to experimental randomization): Label all the individuals in the population Use a table of random digits to select a sample of a desired size 12/8/2018 Daniela Stan - CSC323

Table of Random Digits Experimenters use software to carry out randomization. Without software, using a table of random digits (Table B in the textbook); A table of random digits is a list of digits 0,1,2,3,4,5,6,7,8,9 that has the following properties: 1. The digit in any position in the list has the same chance of being anyone of 0,1,2,3,4,5,6,7,8,9 2. The digits in different positions are independent in the sense that the value of one has no influence on the value of any other. 12/8/2018 Daniela Stan - CSC323

How to randomize? Randomization requires two steps:
assign labels to the individuals: - all labels should have same length use the shortest possible labels: one digit for 9 or fewer individuals, two digits for 10 to 100 individuals and so on. 2. use Table B to select labels at random: - you can read digits from Table B in any order – along a row, down a column, and so on Example: Problem 3.40 12/8/2018 Daniela Stan - CSC323

Toward Statistical Inference
Statistical Inference is to use a fact about a sample to estimate the truth about the whole population. A parameter p is a number that describes the population; it is a fixed number whose value we don’t know. A statistic is a number that describes a sample. A value of a statistics is known when we have taken a sample, but it can change from sample to sample. A statistic is often used to estimate a parameter. 12/8/2018 Daniela Stan - CSC323

How good is the statistic?
The value of a statistic will vary from one sample to another one; sampling variability is the variation of the values of the statistic in repeated random sampling. If the variation of the statistic is too great, when choosing different samples, the results of any one sample cannot be trusted. A statistical inference is trustworthy if there is not much variability for the statistics within repeated samples of same size. 12/8/2018 Daniela Stan - CSC323

How the statistic varies with repeated samples?
To understand the variability of the statistics: Take a large number of samples (simulation can be used to obtain the samples); Calculate the statistics for each sample Make a histogram of the values of the statistics Examine the distribution displayed in the histogram: - shape - center - spread - outliers 12/8/2018 Daniela Stan - CSC323

The distribution of a statistic
Example: Suppose that 60% of the all American adult residents find clothes shopping time-consuming and frustrating. The true value of the parameter we want to estimate is p =0.6. Suppose that we don’t know the true value of the parameter and we take different samples in order to estimate the value of p: - we take 1000 simple random samples (SRS), each of size 100, and we estimate the value of p 12/8/2018 Daniela Stan - CSC323

Example (cont.): If we choose other 1000 samples of 2500 size each, the below figure shoes the variation in the estimate of p: Sampling Distribution of a statistic is the distribution of values taken by the statistics in all samples of the same size from the same population. 12/8/2018 Daniela Stan - CSC323

Interpretation of the sampling distribution
Shape: normal distribution Center: The values are centered at 0.6; since the true values of the parameter is 0.6, the estimator of p (obtained from repeated SRS) is called to be unbiased (the mean of the statistic’s values is equal to the true value of the parameter). Spread: the values of the estimator (statistics) from samples of size 2500 are much less spread (variability of the statistics) out than those from samples of size 100; therefore, the statistics from larger size sample have smaller spreads. 12/8/2018 Daniela Stan - CSC323

Managing Bias and Variability
Simple random sampling produces unbiased estimates for the value of the parameter of a population; therefore, use random sampling to reduce bias. To reduce the variability of a statistics from a SRS, use a larger sample. The variability of a statistics from a random sample does not depend on the size of the population as long as the size of the population is at least 100 times larger than the sample. 12/8/2018 Daniela Stan - CSC323

Low bias, high variability
Example on bias and variability: Problem 3.62: Label each distribution relative to the others as: Low bias, high variability High bias, low variability Low bias, low variability High bias, high variability 12/8/2018 Daniela Stan - CSC323

Bias and Variability True value of the parameter
= bull’s eye on a target Bias and variability describes what happens when an archer fires many arrows at the target. 12/8/2018 Daniela Stan - CSC323

Probability Sampling Plans
Stratified random sampling Multistage sampling Stratified Samples: (Example: Problem 3.47) It is important to sample important groups within the population separately, then combine these samples. Steps of the stratified sample design: - divide the population into groups of similar individuals, called strata; Examples: - female versus male - urban, suburban and rural - choose a separate SRS in each stratum - combine the SRS to form the full sample. Multistage sample design selects successively smaller groups from the population in stages, resulting in a sample consisting of clusters of individuals; each stage may employ an SRS, a stratified sample or another type of sample. 12/8/2018 Daniela Stan - CSC323

Observation versus Experiment
An observational study observes individuals and measures variables of interests but does not attempt to influence the responses. An experiment deliberately imposes some treatment on individuals in order to observe their responses. Terms: Experimental units: individuals on which the experiment is done; subjects when the experimental units are human beings; Factors: the explanatory variables Treatment: combination of a specific value (often called level) of each of the factors 12/8/2018 Daniela Stan - CSC323

Design of Experiments The design of an experiment refers to:
the choice of treatments and the manner in which the experimental units or subjects are assigned to the treatments. The principles of statistical design: 1. Control: comparison of several treatments in the same environment is the simplest form of control. 2. Randomization: uses chance to assign experimental units into treatment groups that are similar (except for chance variation). - Randomization and comparison together prevent bias (systematic favoritism in experiments). 3. Replication: of the treatments on many units reduces the role of chance variation in the results. 12/8/2018 Daniela Stan - CSC323

Block design A second form of control is by forming blocks of experimental units that are similar in some way that is important to the response. In a block design, the random assignment of units to treatments is carried out separately within each block. Block designs can have blocks of any size; blocks allow to draw separate conclusions about each block. 12/8/2018 Daniela Stan - CSC323

Matched pairs designs Matched pairs are a common form of blocking for comparing just two treatments. There are two types of matched pairs designs: Each subject receives both treatments in a random order. The subjects are matched in pairs as close as possible, and one subject in each pair receives one treatment. Reading Assignment: Chapter 3 12/8/2018 Daniela Stan - CSC323

Daniela Stan Raicu School of CTI, DePaul University

Similar presentations

Presentation on theme: "Daniela Stan Raicu School of CTI, DePaul University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Daniela Stan Raicu School of CTI, DePaul University

Similar presentations

Presentation on theme: "Daniela Stan Raicu School of CTI, DePaul University"— Presentation transcript:

Similar presentations

About project

Feedback