Presentation is loading. Please wait.

Presentation is loading. Please wait.

2-1. Data Collection Data Vocabulary Data Vocabulary Level of Measurement Level of Measurement Time Series and Cross-sectional Data Time Series and Cross-sectional.

Similar presentations


Presentation on theme: "2-1. Data Collection Data Vocabulary Data Vocabulary Level of Measurement Level of Measurement Time Series and Cross-sectional Data Time Series and Cross-sectional."— Presentation transcript:

1 2-1

2 Data Collection Data Vocabulary Data Vocabulary Level of Measurement Level of Measurement Time Series and Cross-sectional Data Time Series and Cross-sectional Data Sampling Concepts Sampling Concepts Sampling Methods Sampling Methods Data Sources Data Sources Survey Research Survey Research Chapter 22 McGraw-Hill/Irwin© 2008 The McGraw-Hill Companies, Inc. All rights reserved.

3 2-3 Data Vocabulary Data is the plural form of the Latin datum (a “given” fact).Data is the plural form of the Latin datum (a “given” fact). In scientific research, data arise from experiments whose results are recorded systematically.In scientific research, data arise from experiments whose results are recorded systematically. Important decisions may depend on data.Important decisions may depend on data. In business, data usually arise from accounting transactions or management processes.In business, data usually arise from accounting transactions or management processes.

4 2-4 Data Vocabulary  Subjects, Variables, Data Sets We will refer to Data as plural and data set as a particular collection of data as a whole.We will refer to Data as plural and data set as a particular collection of data as a whole. Observation – each data value.Observation – each data value. Subject (or individual) – an item for study (e.g., an employee in your company).Subject (or individual) – an item for study (e.g., an employee in your company). Variable – a characteristic about the subject or individual (e.g., employee’s income).Variable – a characteristic about the subject or individual (e.g., employee’s income).

5 2-5 Data Vocabulary  Subjects, Variables, Data Sets Three types of data sets:Three types of data sets: Data Set Variables Typical Tasks UnivariateOne Histograms, descriptive statistics, frequency tallies BivariateTwo Scatter plots, correlations, simple regression Multivariate More than two Multiple regression, data mining, econometric modeling

6 2-6 Data Vocabulary  Subjects, Variables, Data Sets Consider the multivariate data set with 5 variables 8 subjects 5 x 8 = 40 observations

7 2-7 Data Vocabulary  Data Types A data set may have a mixture of data types.A data set may have a mixture of data types. Types of Data Attribute (qualitative) Numerical (quantitative) Verbal Label X = economics (your major) Coded X = 3 (i.e., economics) Discrete X = 2 (your siblings) Continuous X = 3.15 (your GPA)

8 2-8 Data Data Vocabulary  Attribute Data Also called categorical, nominal or qualitative data.Also called categorical, nominal or qualitative data. Values are described by words rather than numbers.Values are described by words rather than numbers. For example, - Automobile style (e.g., X = full, midsize, compact, subcompact). - Mutual fund (e.g., X = load, no-load).For example, - Automobile style (e.g., X = full, midsize, compact, subcompact). - Mutual fund (e.g., X = load, no-load).

9 2-9 Data Vocabulary  Data Coding Coding refers to using numbers to represent categories to facilitate statistical analysis.Coding refers to using numbers to represent categories to facilitate statistical analysis. Coding an attribute as a number does not make the data numerical.Coding an attribute as a number does not make the data numerical. For example, 1 = Bachelor’s, 2 = Master’s, 3 = DoctorateFor example, 1 = Bachelor’s, 2 = Master’s, 3 = Doctorate Rankings may exist, for example, 1 = Liberal, 2 = Moderate, 3 = ConservativeRankings may exist, for example, 1 = Liberal, 2 = Moderate, 3 = Conservative

10 2-10 Data Vocabulary  Binary Data A binary variable has only two values, 1 = presence, 0 = absence of a characteristic of interest (codes themselves are arbitrary).A binary variable has only two values, 1 = presence, 0 = absence of a characteristic of interest (codes themselves are arbitrary). For example, 1 = employed, 0 = not employed 1 = married, 0 = not married 1 = male, 0 = female 1 = female, 0 = maleFor example, 1 = employed, 0 = not employed 1 = married, 0 = not married 1 = male, 0 = female 1 = female, 0 = male The coding itself has no numerical value so binary variables are attribute data.The coding itself has no numerical value so binary variables are attribute data.

11 2-11 Binary Data

12 2-12 Data Vocabulary  Numerical Data Numerical or quantitative data arise from counting or some kind of mathematical operation.Numerical or quantitative data arise from counting or some kind of mathematical operation. For example, - Number of auto insurance claims filed in March (e.g., X = 114 claims). - Ratio of profit to sales for last quarter (e.g., X = 0.0447).For example, - Number of auto insurance claims filed in March (e.g., X = 114 claims). - Ratio of profit to sales for last quarter (e.g., X = 0.0447). Can be broken down into two types – discrete or continuous data.Can be broken down into two types – discrete or continuous data.

13 2-13 Data Vocabulary  Discrete Data A numerical variable with a countable number of values that can be represented by an integer (no fractional values).A numerical variable with a countable number of values that can be represented by an integer (no fractional values). For example, - Number of Medicaid patients (e.g., X = 2). - Number of takeoffs at O’Hare (e.g., X = 37).For example, - Number of Medicaid patients (e.g., X = 2). - Number of takeoffs at O’Hare (e.g., X = 37).

14 2-14 Data Vocabulary  Continuous Data A numerical variable that can have any value within an interval (e.g., length, weight, time, sales, price/earnings ratios).A numerical variable that can have any value within an interval (e.g., length, weight, time, sales, price/earnings ratios). Any continuous interval contains infinitely many possible values (e.g., 426 < X < 428).Any continuous interval contains infinitely many possible values (e.g., 426 < X < 428).

15 2-15 Data Vocabulary  Rounding Ambiguity is introduced when continuous data are rounded to whole numbers.Ambiguity is introduced when continuous data are rounded to whole numbers. Underlying measurement scale is continuous.Underlying measurement scale is continuous. Precision of measurement depends on instrument.Precision of measurement depends on instrument. Sometimes discrete data are treated as continuous when the range is very large (e.g., SAT scores) and small differences (e.g., 604 or 605) aren’t of much importance.Sometimes discrete data are treated as continuous when the range is very large (e.g., SAT scores) and small differences (e.g., 604 or 605) aren’t of much importance.

16 2-16 Level of Measurement  Likert Scales A special case of interval data frequently used in survey research.A special case of interval data frequently used in survey research. The coarseness of a Likert scale refers to the number of scale points (typically 5 or 7).The coarseness of a Likert scale refers to the number of scale points (typically 5 or 7). “College-bound high school students should be required to study a foreign language.” (check one)  StronglyAgreeSomewhatAgree Neither Agree Nor Disagree SomewhatDisagreeStronglyDisagree

17 2-17 Level of Measurement  Likert Scales A neutral midpoint (“Neither Agree Nor Disagree”) is allowed if an odd number of scale points is used or omitted to force the respondent to “lean” one way or the other.A neutral midpoint (“Neither Agree Nor Disagree”) is allowed if an odd number of scale points is used or omitted to force the respondent to “lean” one way or the other. Likert data are coded numerically (e.g., 1 to 5) but any equally spaced values will work.Likert data are coded numerically (e.g., 1 to 5) but any equally spaced values will work. Likert coding: 1 to 5 scale Likert coding: -2 to +2 scale 5 = Help a lot 4 = Help a little 3 = No effect 2 = Hurt a little 1 = Hurt a lot +2 = Help a lot +1 = Help a little 0 = No effect 0 = No effect  1 = Hurt a little  2 = Hurt a lot

18 2-18 Level of Measurement  Likert Scales Careful choice of verbal anchors results in measurable intervals (e.g., the distance from 1 to 2 is “the same” as the interval, say, from 3 to 4).Careful choice of verbal anchors results in measurable intervals (e.g., the distance from 1 to 2 is “the same” as the interval, say, from 3 to 4). Ratios are not meaningful (e.g., here 4 is not twice 2).Ratios are not meaningful (e.g., here 4 is not twice 2). Many statistical calculations can be performed (e.g., averages, correlations, etc.).Many statistical calculations can be performed (e.g., averages, correlations, etc.).

19 2-19 Level of Measurement  Likert Scales More variants of Likert scales:More variants of Likert scales: How would you rate your marketing instructor? (check one)  Terrible  Poor  Adequate  Good  Excellent How would you rate your marketing instructor? (check one) Very Bad  Very Good

20 2-20 Time Series and Cross-sectional Data  Time Series Data Each observation in the sample represents a different equally spaced point in time (e.g., years, months, days).Each observation in the sample represents a different equally spaced point in time (e.g., years, months, days). Periodicity may be annual, quarterly, monthly, weekly, daily, hourly, etc.Periodicity may be annual, quarterly, monthly, weekly, daily, hourly, etc. We are interested in trends and patterns over time (e.g., annual growth in consumer debit card use from 1999 to 2006).We are interested in trends and patterns over time (e.g., annual growth in consumer debit card use from 1999 to 2006).

21 2-21 Sampling Concepts  Sample or Census? A sample involves looking only at some items selected from the population.A sample involves looking only at some items selected from the population. A census is an examination of all items in a defined population.A census is an examination of all items in a defined population. - Mobility - Illegal immigrants - Budget constraints - Incomplete responses or nonresponses Why can’t the United States Census survey every person in the population?Why can’t the United States Census survey every person in the population?

22 2-22 Sampling Concepts Situations Where A Sample May Be Preferred: Infinite Population No census is possible if the population is infinite or of indefinite size (an assembly line can keep producing bolts, a doctor can keep seeing more patients). Destructive Testing The act of sampling may destroy or devalue the item (measuring battery life, testing auto crashworthiness, or testing aircraft turbofan engine life). Timely Results Sampling may yield more timely results than a census (checking wheat samples for moisture and protein content, checking peanut butter for aflatoxin contamination).

23 2-23 Sampling Concepts Situations Where A Sample May Be Preferred: Accuracy Sample estimates can be more accurate than a census. Instead of spreading limited resources thinly to attempt a census, our budget of time and money might be better spent to hire experienced staff, improve training of field interviewers, and improve data safeguards. Cost Even if it is feasible to take a census, the cost, either in time or money, may exceed our budget. Sensitive Information Some kinds of information are better captured by a well-designed sample, rather than attempting a census. Confidentiality may also be improved in a carefully-done sample.

24 2-24 Sampling Concepts Situations Where A Census May Be Preferred Small Population If the population is small, there is little reason to sample, for the effort of data collection may be only a small part of the total cost. Large Sample Size If the required sample size approaches the population size, we might as well go ahead and take a census. Legal Requirements Banks must count all the cash in bank teller drawers at the end of each business day. The U.S. Congress forbade sampling in the 2000 decennial population census. Database Exists If the data are on disk we can examine 100% of the cases. But auditing or validating data against physical records may raise the cost.

25 2-25 Sampling Concepts  Parameters and Statistics Statistics are computed from a sample of n items, chosen from a population of N items.Statistics are computed from a sample of n items, chosen from a population of N items. Statistics can be used as estimates of parameters found in the population.Statistics can be used as estimates of parameters found in the population. Symbols are used to represent population parameters and sample statistics.Symbols are used to represent population parameters and sample statistics.

26 2-26 Sampling Concepts  Parameters and Statistics Statistic Any measurement computed from a sample. Usually, the statistic is regarded as an estimate of a population parameter. Sample statistics are often (but not always) represented by Roman letters. Parameter or Statistic? Parameter Any measurement that describes an entire population. Usually, the parameter value is unknown since we rarely can observe the entire population. Parameters are often (but not always) represented by Greek letters.

27 2-27 Sampling Concepts  Parameters and Statistics The population must be carefully specified and the sample must be drawn scientifically so that the sample is representative.The population must be carefully specified and the sample must be drawn scientifically so that the sample is representative. The target population is the population we are interested in (e.g., U.S. gasoline prices).The target population is the population we are interested in (e.g., U.S. gasoline prices).  Target Population The sampling frame is the group from which we take the sample (e.g., 115,000 stations).The sampling frame is the group from which we take the sample (e.g., 115,000 stations). The frame should not differ from the target population.The frame should not differ from the target population.

28 2-28 Nn  Finite or Infinite? A population is finite if it has a definite size, even if its size is unknown.A population is finite if it has a definite size, even if its size is unknown. A population is infinite if it is of arbitrarily large size.A population is infinite if it is of arbitrarily large size. Rule of Thumb: A population may be treated as infinite when N is at least 20 times n (i.e., when N/n > 20)Rule of Thumb: A population may be treated as infinite when N is at least 20 times n (i.e., when N/n > 20) Sampling Concepts Here, N/n > 20

29 2-29 Sampling Methods Probability Samples Simple Random Sample Use random numbers to select items from a list (e.g., VISA cardholders). Systematic Sample Select every kth item from a list or sequence (e.g., restaurant customers). Stratified Sample Select randomly within defined strata (e.g., by age, occupation, gender). Cluster Sample Like stratified sampling except strata are geographical areas (e.g., zip codes).

30 2-30 Sampling Methods Nonprobability Samples Judgment Sample Use expert knowledge to choose “typical” items (e.g., which employees to interview). Convenience Sample Use a sample that happens to be available (e.g., ask co-worker opinions at lunch).

31 2-31 Sampling Methods  Simple Random Sample Every item in the population of N items has the same chance of being chosen in the sample of n items.Every item in the population of N items has the same chance of being chosen in the sample of n items. We rely on random numbers to select a name.We rely on random numbers to select a name. =RANDBETWEEN(1,48)

32 2-32 Sampling Methods  Random Number Tables A table of random digits used to select random numbers between 1 and N.A table of random digits used to select random numbers between 1 and N. Each digit 0 through 9 is equally likely to be chosen.Each digit 0 through 9 is equally likely to be chosen.  Setting Up a Rule For example, NilCo wants to award cash prizes to 10 of its 875 loyal customers.For example, NilCo wants to award cash prizes to 10 of its 875 loyal customers. To get 10 three-digit numbers between 001 and 875, we define any consistent rule for moving through the random number table.To get 10 three-digit numbers between 001 and 875, we define any consistent rule for moving through the random number table.

33 2-33 Sampling Methods  Setting Up a Rule Randomly point at the table to choose a starting point.Randomly point at the table to choose a starting point. Choose the first three digits of the selected five- digit block, move to the right one column, down one row, and repeat.Choose the first three digits of the selected five- digit block, move to the right one column, down one row, and repeat. When we reach the end of a line, wrap around to the other side of the table and continue.When we reach the end of a line, wrap around to the other side of the table and continue. Discard any number greater than 875 and any duplicates.Discard any number greater than 875 and any duplicates.

34 2-34 82134144586671654269319284624103052002603236725783 07139168297676811913424349196192934182291559502566 45056439393118843272113329949419348970769560528010 10244190935167863463855687003482811232614879463984 12940844345008720189580096697205764104213687564964 84438458284035328925119115350224640968809316668409 98681678717173564113901393346665312906557544430845 43290967531879949713392271595546167638530363319990 96893854108823322094306057902401791388398553194576 75403412270019216814470541681481349922640102829071 78064921115154176563690276771806499719381735412680 26246717469401993165967130331675912862091208157817 98766673129635821351864483182886113788686724306763 37895510551192944443159957293599631181908587731309 27988811635221225102617982867001358603547401518556 19216530084449819262121969394790162763371264626838 28078867296943824235352084895753529762974174154735 34455613639371168038759601632795716669642863465015 53510904127043845932578157514452472618174156242084 30658188948820897867307379498518235021783972866398 Table of 1,000 Random Digits Table of 1,000 Random Digits Start Here

35 2-35 Sampling Methods  With or Without Replacement If we allow duplicates when sampling, then we are sampling with replacement.If we allow duplicates when sampling, then we are sampling with replacement. Duplicates are unlikely when n is much smaller than N.Duplicates are unlikely when n is much smaller than N. If we do not allow duplicates when sampling, then we are sampling without replacement.If we do not allow duplicates when sampling, then we are sampling without replacement.

36 2-36 Sampling Methods  Systematic Sampling For example, starting at item 2, we sample every k = 4 items to obtain a sample of n = 20 items from a list of N = 78 items.For example, starting at item 2, we sample every k = 4 items to obtain a sample of n = 20 items from a list of N = 78 items. Note that N/n = 78/20  4.Note that N/n = 78/20  4. Sample by choosing every kth item from a list, starting from a randomly chosen entry on the list.Sample by choosing every kth item from a list, starting from a randomly chosen entry on the list.

37 2-37 Sampling Methods  Systematic Sampling A systematic sample of n items from a population of N items requires that periodicity k be approximately N/n.A systematic sample of n items from a population of N items requires that periodicity k be approximately N/n. Systematic sampling should yield acceptable results unless patterns in the population happen to recur at periodicity k.Systematic sampling should yield acceptable results unless patterns in the population happen to recur at periodicity k. Can be used with unlistable or infinite populations.Can be used with unlistable or infinite populations. Systematic samples are well-suited to linearly organized physical populations.

38 2-38 Sampling Methods  Systematic Sampling For example, out of 501 companies, we want to obtain a sample of 25. What should the periodicity k be?For example, out of 501 companies, we want to obtain a sample of 25. What should the periodicity k be? k = N/n = 501/25  20. So, we should choose every 20 th company from a random starting point.So, we should choose every 20 th company from a random starting point.

39 2-39 Sampling Methods  Stratified Sampling Utilizes prior information about the population. Applicable when the population can be divided into relatively homogeneous subgroups of known size (strata). A simple random sample of the desired size is taken within each stratum. For example, from a population containing 55% males and 45% females, randomly sample 120 males and 80 females (n = 200).

40 2-40 Sampling Methods  Stratified Sampling Or, take a random sample of the entire population and then combine individual strata estimates using appropriate weights.Or, take a random sample of the entire population and then combine individual strata estimates using appropriate weights. For a population with L strata, the population size N is the sum of the stratum sizes: N = N 1 + N 2 +... + N LFor a population with L strata, the population size N is the sum of the stratum sizes: N = N 1 + N 2 +... + N L The weight assigned to stratum j is w j = N j / nThe weight assigned to stratum j is w j = N j / n For example, take a random sample of n = 200 and then weight the responses for males by w M =.55 and for females by w F =.45.For example, take a random sample of n = 200 and then weight the responses for males by w M =.55 and for females by w F =.45.

41 2-41 Sampling Methods  Cluster Sample Strata consist of geographical regions.Strata consist of geographical regions. One-stage cluster sampling – sample consists of all elements in each of k randomly chosen subregions (clusters).One-stage cluster sampling – sample consists of all elements in each of k randomly chosen subregions (clusters). Two-stage cluster sampling, first choose k subregions (clusters), then choose a random sample of elements within each cluster.Two-stage cluster sampling, first choose k subregions (clusters), then choose a random sample of elements within each cluster.

42 2-42 Sampling Methods  Cluster Sample Cluster sampling is useful when - Population frame and stratum characteristics are not readily available - It is too expensive to obtain a simple or stratified sample - The cost of obtaining data increases sharply with distance - Some loss of reliability is acceptableCluster sampling is useful when - Population frame and stratum characteristics are not readily available - It is too expensive to obtain a simple or stratified sample - The cost of obtaining data increases sharply with distance - Some loss of reliability is acceptable

43 2-43 Sampling Methods  Judgment Sample A nonprobability sampling method that relies on the expertise of the sampler to choose items that are representative of the population.A nonprobability sampling method that relies on the expertise of the sampler to choose items that are representative of the population. Can be affected by subconscious bias (i.e., nonrandomness in the choice).Can be affected by subconscious bias (i.e., nonrandomness in the choice). Quota sampling is a special kind of judgment sampling, in which the interviewer chooses a certain number of people in each category.Quota sampling is a special kind of judgment sampling, in which the interviewer chooses a certain number of people in each category.

44 2-44 Sampling Methods  Convenience Sample Take advantage of whatever sample is available at that moment. A quick way to sample.Take advantage of whatever sample is available at that moment. A quick way to sample. Sample size depends on the inherent variability of the quantity being measured and on the desired precision of the estimate.Sample size depends on the inherent variability of the quantity being measured and on the desired precision of the estimate.  Sample Size


Download ppt "2-1. Data Collection Data Vocabulary Data Vocabulary Level of Measurement Level of Measurement Time Series and Cross-sectional Data Time Series and Cross-sectional."

Similar presentations


Ads by Google