Chapter 2 Describing Data: Graphs and Tables Basic Concepts Frequency Tables and Histograms Bar and Pie Charts Scatter Plots Time Series Plots Some information adapted from: Levine, Brenson and Stephan’s Statistics for Managers Alok Srivastava
Basic Concepts in Data Analysis Data, Information, and Knowledge Populations and Samples Variables and Observations Types of Data: Categorical and Numerical Types of Data: Cross Sectional and Time Ordered Alok Srivastava
Data, Information, and Knowledge Data are building blocks of information. These are observations on entities (observation units). Variables are used to measure observations. Information is processed data (organized, summarized, analyzed and filtered) that are made meaningful and relevant to the situation/phenomenon being understood. Knowledge is the ability to apply/use information to decision situations. Meaning associated with information is knowledge …. Actionable Information! Processing Analysis Reports Application Meaning Relevance Alok Srivastava
Populations and Samples Statistical Inference Sample: Subset of collection of all possible entities (observation units) Data on sample is what is available. KNOWN Statistics are used to describe samples. These can vary across samples. Population: Collection of all possible entities (observation units) Data on the whole population is usually not available. UNKNOWN Parameters are used to describe populations. These are constants for a population. Statistical Inference is the art and science of drawing inferences/ conclusions about a population of interest. Statistical Inference is the process by which a characteristics/aspects of a population are understood (known). Conclusions about the population are drawn (inferred) based in the knowledge gained from the sample. A sample should be a good representation of the population. Alok Srivastava
Variables and Observations Entity Height (inches) Weight (pounds) Age (years) Sex (Category) Person 1 Person 2 Person 3 * 67 61 72 170 120 220 33 38 62 Male Female O B S E R V A T I O N S Variables are characteristics (aspects) of entities that are different for different entities. Observations on an entity are values of these characteristics that have been measured. So, a dataset is a collection of observations on a group (sample) of entities. Each row is an observation on a particular entity. Each column is an aspect or characteristic of individual entities (measured as variables). Measurement Alok Srivastava
Types of Data: Categorical and Numerical We can do arithmetic on numerical data (age and salary). These data are actual measurements. Categorical data is qualitative. Sometimes qualitative data is coded. For example, opinion can be coded 1-5 and arithmetic (calculations) can be performed. Such data is ordinal (has implied order). State is a categorical variable and cannot be used for calculations. Such data are nominal. Categorical Numerical Alok Srivastava
Types of Data: Cross-sectional and Time Ordered Questions What was the absenteeism at Plant 1 in Jan. 1998? Was the annual absenteeism the same for all plants? Was absenteeism stable at plant 1 during 1998? Alok Srivastava
Percentage Class Frequency Frequency Tables A Frequency Table showing a classification of the AGE of attendees at an event. Class Frequency 10 but under 20 3 .15 15 20 but under 30 6 .30 30 30 but under 40 5 .25 25 40 but under 50 4 .20 20 50 but under 60 2 .10 10 Total 20 1 100 Relative Frequency Percentage Class is a range for the values of a variable. Frequency is the number of observations associated with a class. Relative Frequency is the proportion of observations (frequency) associated with a class. Alok Srivastava
A graphical display of distribution of frequencies Frequency Histograms A graphical display of distribution of frequencies Alok Srivastava
Developing Frequency Tables and Histograms Sort Raw Data in Ascending Order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Find Range: 58 - 12 = 46 Select Number of Classes: 5 (usually between 5 and 15) Compute Class Interval (width): 10 (range/classes = 46/5 then round up) Determine Class Boundaries (limits): 10, 20, 30, 40, 50 Compute Class Midpoints: 15, 25, 35, 45, 55 Count Observations & Assign to Classes Alok Srivastava
Displaying Categorical Data Bar and Pie Charts Displaying Categorical Data CD 14% Investment Category Amount Percentage (in thousands $) Stocks 46.5 42.27 Bonds 32 29.09 CD 15.5 14.09 Savings 16 14.55 Total 110 100 Savings 15% Stocks 42% Bonds 29% Alok Srivastava
Side by Side Chart Displaying Categorical Bivariate Data: Contingency Tables and Side-by-Side Charts Alok Srivastava
Scatter Plot for bivariate numerical data Shows relationship between two variables. Can one be used to predict the other? Time-Series and Regression Analysis are used to predict one variable’s value based on the other. Correlation analyses is used to measure the strength of linear relationship among two variables. Alok Srivastava
Chapter Summary Alok Srivastava