Paf 203 Data Analysis and Modeling for Public Affairs Lecture on Statistics Paf 203 Data Analysis and Modeling for Public Affairs
Dedication of the Sanders and Smidt book: To those who open this book with dismay Quote in a university student calendar- “ If I had only one day left to live, I would live it in my statistics class… it would seem so much longer”. “ It’s not the figures themselves, it’s what you do with them that matters”.
Learning points in our study of statistics: What is statistics? What is descriptive versus inferential statistics? What is a mean, median, mode? What is a standard deviation? Why is frequency distribution important in the data presentation and analysis? How do we present data in a histogram, pie chart, pictogram, bar charts? What is the difference between a population and a sample? What is a parameter? A statistic? When do we use a t-statistic versus a chi-square? Why do we need to know about regression analysis?
What is statistics? Statistics is the science of designing studies, gathering data, and then classifying, summarizing, interpreting, and presenting these data to explain and support the decisions that are reached. Population- is the complete collection of measurements, objects, or individuals under study. A sample- is a portion or subset taken from the population. A parameter is a number that describes a population characteristic. A statistic is a number that describes a sample characteristic.
What is descriptive versus inferential statistics? Descriptive statistics includes the procedures for collecting, classifying, summarizing, and presenting data. Charts, tables, and summary measures such as averages are used to describe the basic structure of the study subject. Inferential statistics is the process of arriving at a conclusion about a population parameter (which is usually an unknown quantity) on the basis of information obtained from a sample statistic (a known value).
Why do we want to know about statistics??? You need a knowledge of statistics to help you: describe and understand numerical relationships and to make better decisions.
Example: Describing Relationships between Variables A college admissions officer needs to find an effective way of selecting student applicants. He/she designs a statistical study to see if there is a significant relationship between UPCAT scores and the grade point average achieved by freshmen at the school. If there is a strong relationship, high UPCAT scores will become an important criterion for acceptance. A public health official decides to see if there is any connection between inhaling the smoke produced by cigarette smokers and the incidence of asthma in young children. She applies statistical techniques to large amounts of data and reaches conclusions that will affect the health of large numbers of people.
Example: Aiding in Decision Making A personnel manager has noted that job applicants who score high on a manual dexterity test later tend to perform well in the assembling of a product, while those with low test scores tend to be less productive. By applying statistical techniques known as the regression analysis, the manager can forecast how productive a new applicant will be on the job on the basis of how well he or she performs on the test.
Statistical Solving Problem Methodology Identifying the problem or opportunity. Deciding the method of data collection. Collecting the data. Classifying and summarizing the data. Presenting and analyzing the data. Making the decision.
Descriptive Statistics The following array of data characterizes the ISPPS staff at the UPLB for the year 2004. Let’s use this data series to learn about descriptive statistics.
Each picture represents 2 persons To present the data in the pictogram, we use symbols to represent a unit of measurement for each of the classification that we want to show. For example, to present a pictogram of the classification of ISPPS employees (faculty, REPS, administrative staff), we use the following: Pictogram of Employee classification of ISPPS staff REPS Faculty Admin Each picture represents 2 persons
Pictogram of Educational attainment of ISPPS staff PhD MS BS HS Each picture represents one person
We can also present data in terms of bar graphs We can also present data in terms of bar graphs. There are two types of bar graphs: the vertical and the horizontal bar graphs. Vertical bar graphs
Horizontal bar graphs
We can also use pie chart to present our data We can also use pie chart to present our data. To derive for the figures to be used in the pie chart, we first get the proportion each of the class to the total, and then draw a pie chart, as follows:
Measures of Central Tendency: Central tendency means in lay man’s terms an average. But there are several ways of computing for this average. The three most common are the following: mean, median, and the mode.
where n = number of samples Mean The mean is the sum of the scores divided by the number of items. For example, if we have an array as follows: X: 0,5,3,9,8 The arithmetic mean would be: 0+5+3+9+8/5=5. The formula for this is: where n = number of samples Xi = age, i= 1…n Mean age of ISPPS staff : 47
Median The median is the point that divides the array such that 50% of the cases fall below it and 50% fall above it. Example: Given an array: 0,5,3,9,8, the median is the middle of the value of the array; after the numbers have been ordered from lowest to highest or highest to lowest: 0,3,5,9,8- the median of this distribution is 5.
(cont.) Median 0,3,5,8,9,12, where there will be no middle value, the median is the average of the two middlemost values: 0,3,[5,8],9,12, hence the median will be the average of 5 and 8 which would be 6.5. Now, what is the median age of the staff of the ISPPS?
Mode The mode is the most frequent value in the distribution. Since it is the most frequent value, it dispenses with the idea of a point of balance. In the following array, what is the mode? 1,2,5,1,3,5,1,9,1? The mode is 1 because it is the most frequent value in the distribution
(cont.) Mode In the following array, what is the mode? 1,2,5,1,3,5,1,9,2,5,2? The mode is 1 and 2 because there are actual numbers of 1 and 2. What is the mode of the age distribution of the ISPPS staff?
Showing variability: It may also be useful to show, apart from typicality, variability within a group. This is done by computing for one or more measures of variability, or measures of spread or measures of dispersion. There are different measures of variability, some of which are the following: the range, the average deviation and the standard deviation.
Range It is the difference between the highest value and the lowest value in the array. Again, given the array: 0,3,5,9,8, the range is 9 (9-0). The more common way of expressing the range would be to cite the figures that have the highest and the lowest value. In the above example, the range would be R=[0,9].
Average Deviation or the Mean Absolute Deviation The average deviation gives you a sense of how far away the individual values from the mean. It is not a commonly used measure for showing variability but will facilitate our learning of the very important measure, the standard deviation. It is the numeric difference of each item from the mean without regard to the algebraic sign. It is represented by the following formula:
Standard deviation The most common measure of dispersion is the standard deviation. The standard deviation is the square root of the average of the squared deviations from the mean. Standard deviation of age: 8.14 Standard deviation of income : 6,174 standard deviation
Frequency Distribution We can summarize data using an interval scale. Decisions need to be made on how many categories will be used and where to establish cut-off points. There are no simple rules for doing this. A lot of the decision will depend on the purposes to be served by the classification. There are some guidelines that can be followed in constructing frequency distributions. If the data is given in whole numbers, then the end limits or what we call the class limits should be in whole numbers. If these are given to one decimal point, then the class limits should be to one decimal point. In other words, our class limits should follow the number of decimal points that the data follow.
(cont.) Frequency Distribution The size or the width of the interval should be some convenient number. Convenient numbers would be like 1, 5, 10, 20, 25, 50, 100. The class limits should also be a convenient number. It makes no sense to have a class limit like 8.4-13.8. Avoid intervals so narrow that some categories have zero observations. As much as possible, use equal size intervals As much as possible, use closed intervals. You may use open intervals only when closed intervals would result in class frequencies of zero.
Table 2. Frequency distribution of monthly incomes of ISPPS staff, 2004, College, Laguna
Histogram In a histogram, a bar can be used to represent each category. The height of the bar indicates its size. If the scale is nominal, the actual ordering of bars will not matter. For ordinal or interval scales, the bars are to be arranged in their proper order, giving a good visual indication of the frequency distribution.
(cont.) Histogram Table 3. Frequency distribution of age of ISPPS staff, 2004, College, Laguna
(cont.) Histogram