Objectives (IPS chapter 1.1) Displaying distributions with graphs Labels/Variables Two types of variables Ways to chart categorical data Bar graphs Pie charts Ways to chart quantitative data Line graphs: time plots Scales matter Histograms Stemplots Stemplots versus histograms Interpreting histograms
Variables In a study, we collect information—data—from individuals. Individuals can be people, animals, plants, or any object of interest. A variable is a characteristic that varies among individuals in a population or in a sample (a subset of a population). Example: age, height, blood pressure, ethnicity, leaf length, first language The distribution of a variable tells us what values the variable takes and how often it takes these values.
Two types of variables Variables can be either quantitative (or numerical)… Something that can be counted or measured for each individual and then added, subtracted, averaged, etc. across individuals in the population. Example: How tall you are, your age, your blood cholesterol level, the number of credit cards you own … or categorical. Something that falls into one of several categories. What can be counted is the count or proportion of individuals in each category. Example: Your blood type (A, B, AB, O), your hair color, your ethnicity, whether you paid income tax last tax year or not
Cases, Labels, Variables, and Values Cases: are the objects described by a set of data. Cases may be customers, companies, subjects in a study, or other subjects. A label: a special variable used in some data sets to distinguish different cases. A variable is a characteristic of a case. Different cases can have different values for the variables. Example:(1) The cases are the individual students; (2) The first three (Student identification number, last name, first name) are labels. (3) Gender is a categorical variable; (4) Test 1 to Final are numerical variables.
Eg: How do you know if a variable is categorical or quantitative? Ask: What are the n individuals/units in the sample (of size “n”)? What is being recorded about those n individuals/units? Is that a number ( quantitative) or a statement ( categorical)? Categorical Each individual is assigned to one of several categories. Quantitative Each individual is attributed a numerical value. Label Each individual is assigned to one label. Individuals in sample DIAGNOSIS AGE AT DEATH Patient A Heart disease 56 Patient B Stroke 70 Patient C 75 Patient D Lung cancer 60 Patient E 80 Patient F Accident 73 Patient G Diabetes 69 HW 1.14(b), 1.16
Summary of variables Definition: quantitative data are discrete if the possible values are isolated points on the number line. quantitative data are continuous if the set of possible values forms an entire interval on the number line. Question: Are peoples’ heights continuous? What about ages? What about family size? (1) For categorical data: i) Frequency / Relative Frequency distribution; ii) Bar graph iii) Pie charts. (2) For numerical data: i) Frequency distribution; ii) Stemplot (stem-and-leaf plot); iii) Histogram. (3) For bivariate numerical data: Scatter plot
Ways to chart categorical data Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.) Bar graphs Each category is represented by a bar. Pie charts Peculiarity: The slices must represent the parts of one whole.
Example 1: For frequency distribution, the most important definition is relative frequency, which is defined by Example 1: For seven students with gender: M, F, M, M, F, M, F. (M=male; F=Female). Q1: What is frequency distribution? Q2: Find the relative frequency for Example 1. Q3: Make a bar graph and pie chart for Example 1. HW 1.21(a)
Example 2: For the following frequency distribution: Q1: Find the relative frequency. (Eg: in the format of 12.56%) Q2: Make a bar graph for Example 2. Q3: Make a pie chart for Example 2. Grade Freq Freshman 12 Sophomore 25 Junior 16 Senior 10 Others 1 Total 64 Relative Freq 12/64=18.75% 25/64=39.06% 16/64=25% 10/64=15.63% 1/64= 1.56% 1=100% HW 1.21(a)
Example 2: Students percentage (continue)
Example: Top 10 causes of death in the United States 2006 Rank Causes of death Counts % of top 10s % of total deaths 1 Heart disease 631,636 34% 26% 2 Cancer 559,888 30% 23% 3 Cerebrovascular 137,119 7% 6% 4 Chronic respiratory 124,583 5% 5 Accidents 121,599 6 Diabetes mellitus 72,449 4% 3% 7 Alzheimer’s disease 72,432 8 Flu and pneumonia 56,326 2% 9 Kidney disorders 45,344 10 Septicemia 34,234 1% All other causes 570,654 24% For each individual who died in the United States in 2006, we record what was the cause of death. The table above is a summary of that information.
Bar graphs Each category is represented by one bar. The bar’s height shows the count (or sometimes the percentage) for that particular category. Top 10 causes of deaths in the United States 2006 The number of individuals who died of an accident in 2006 is approximately 121,000. HW 1.21(b, c)
Pie charts Each slice represents a piece of one whole. The size of a slice depends on what percent of the whole this category represents. Percent of people dying from top 10 causes of death in the United States in 2006 Another way to graphically illustrate the same categorical data is using a Pie Chart. Here is listed in order, and can see relative proportions as pieces of pie. Notice here that we have changed from the numbers of people dying to the percent of people dying To make a pie chart, typically use percentages, and they have to add up to one, or you won’t have the whole pie. ? HW 1.22(b)
Make sure your labels match the data. Make sure all percents add up to 100. Percent of deaths from top 10 causes The top pie chart is the one we have just been looking at. In the bottom one I have added deaths from all other causes - 24% in addition to the top 10. Adding this additional category changes the percentages on the original 10, so, for instance Heart disease was 34% of total before, now is a smaller percent, 26%, because we are looking at All deaths. Percent of deaths from all causes
Child poverty before and after government intervention—UNICEF, 2005 What does this chart tell you? The United States and Mexico have the highest rate of child poverty among OECD (Organization for Economic Cooperation and Development) nations (22% and 28% of under 18). Their governments do the least—through taxes and subsidies—to remedy the problem (size of orange bars and percent difference between orange/blue bars). Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why? The poverty line is defined as 50% of national median income.
R program for Students percentage example ## The route of your data files route<-"//bearsrv/classrooms/Math/cchen/STT215/LAB_Fall2011/" ## Read in the dataset dat=read.csv(paste(route,'In_Class_exercises1.csv',sep=''), header=TRUE) ## To plot pie graph pie(dat$Percent, labels=dat$Students) ## To plot pie graph with a title and make it colorful pie(dat$Percent, labels=dat$Students, main='Student percentage', col = rainbow(5)) ## To create a barplot barplot(dat$Percent, names.arg= dat$Students, main='Student percentage')
Overview: Ways to chart quantitative data Stemplots Also called a stem-and-leaf plot. Each observation is represented by a stem, consisting of all digits except the final one, which is the leaf. Histograms A histogram breaks the range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. Line graphs: time plots A time plot of a variable plots each observation against the time at which it was measured.
Stem plots: Example 1 How to make a stemplot: Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. Eg: how to find the stem and leaf for: 25, 135, 6. Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Dataset: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70. Q: Make a stem-leaf-plot for this dataset.
Stem plots: Example 2 How to make a stemplot: Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Dataset: 1, 3, 3, 12, 15, 17, 21, 25, 45, 49, 62, 67, 69. Q: Make a stem-leaf-plot for this dataset.
Stem and leaf Notes: To compare two related distributions, a back-to-back stem plot with common stems is useful. Stem-and-leaf plot works best for small numbers of observations that are all greater than 0. But it does not work well for large datasets. Stem-and-leaf plot display the actual values of the observations.
Stem and leaf Notes: Eg: Make a stemplot for the data: 115, 143, 162, 198, 267, 279, 302. Trim and also split stems. That means: trimming numbers means dropping the last digit. Eg: Original data 141, by dropping the last digit, it gives 14. Original data 255, by dropping the last digit, it gives 25. Splitting stems. Eg: your dataset are: 1, 7, 10, 11, 12, 13, 15, 16, 17, 18, 19, 21, 22, 23, 25, 26, 28, 29. Then by “splitting stem”, it gives: 0 | 1 0 | 7 1 | 0 1 2 3 1 | 5 6 7 8 9 2 | 1 2 3 2 | 5 6 8 9 Here “splitting stem” says, if you dataset is of median size, then even when some numbers share same stem, we separate them into two parts with same stem, one part with 0-4, and another part with 5-9.
Histogram A histogram breaks the range of values of a variable into classes and displays only the count or percentage of the observations that fall into each class. You can choose any convenient number of classes, but you should always choose classes of equal width. Table 1.3 Introduction to the Practice of Statistics, Sixth Edition © 2009 W.H. Freeman and Company
Steps to draw a histogram Step 1: Divide the range of the data into classes of equal width. (Be sure to specify the classes precisely so that each individual falls into exactly one class.) EX: IQ Scores 75<= IQ Scores < 85; 85<= IQ Scores < 95; 95<= IQ Scores < 105; 105<= IQ Scores < 115; 115<= IQ Scores < 125; 125<= IQ Scores < 135; 135<= IQ Scores < 145; 145<= IQ Scores < 155; Step 2: Count the number of individual in each class. The counts are called frequencies, and a table of frequencies for all class is a frequency table. Step 3: Draw the histogram. First, on the horizontal axis mark the scale for the variable whose distribution you are displaying. The vertical axis contains the scale of counts. Each bar represents a class. The base covers the class. The bar height is the class count. classes [75, 85) [85, 95) [95, 105) [105,115) [115,125) [125,135) [135,145) [145,155) counts 2 3 10 16 13 5 1
Q: How many percent of those chose fifth-grade students have IQ scores of 105 or less?
In Summary… Q: How many percent of those chose fifth-grade students have IQ scores of 105 or less? Important property of a density curve is that areas under the curve correspond to relative frequencies
Stemplots versus histograms Stemplots are quick and dirty histograms that can easily be done by hand, and therefore are very convenient for back of the envelope calculations. However, they are rarely found in scientific or laymen publications.
Stemplots versus histograms Examining distributions of Quantitative Variables Once a graph of the variable is made, we can begin to understand its distribution by looking at the following: look at the overall pattern in the graph and for striking deviations from that overall pattern. Peaks? Gaps? Symmetric? Skewed? describe the overall pattern of the distribution by talking about its shape, center, and spread (or variation). look for possible outliers in the distribution; i.e., those values of the variable that seem to fall outside the overall pattern you see. These features will be important for all types of graphs…
Contiunue 1. one or several peaks (called modes)? If unique major peak, then call unimodal. 2. Center/Midpoint: the value with roughly half the observations taking smaller values and half taking larger values. 3. symmetric or skewed to the right / left? Skewed to the right if the right tail (larger value) is much longer than the left tail (smaller value). 4. Outlier: a point that is clearly apart from the body of the data, not just the most extreme observation in a distribution. Complex, multimodal distribution Symmetric distribution Skewed distribution
Outliers An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetrical except for 2 states that clearly do not belong to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier. Imagine you are doing a study of health care in the 50 US states, and need to know how they differ in terms of their elderly population. This is a histogram of the number of states grouped by the percentage of their residents that are 65 or over. You can see there is one very small number and one very large number, with a gap between them and the rest of the distribution. Values that fall outside of the overall pattern are called outliers. They might be interesting, they might be mistakes - I get those in my data from typos in entering RNA sequence data into the computer. They might only indicate that you need more samples. Will be paying a lot of attention to them throughout class both for what we can learn about biology and also because they can cause trouble with your statistics. Guess which states they are (florida and alaska). Alaska Florida
Symmetric Skewed to the right Skewed to the left Outlier Double peaks Q: 1.37(p 28) HWQ: 1.25, 1.26, 1.27, 1.42
How to create a histogram It is an iterative process – try and try again. What bin size should you use? Not too many bins with either 0 or 1 counts Not overly summarized that you lose all the information Not so detailed that it is no longer summary rule of thumb: start with 5 to 10 bins Look at the distribution and refine your bins (There isn’t a unique or “perfect” solution)
Same data set Not summarized enough Too summarized
Scales matter How you stretch the axes and choose your scales can give a different impression. A picture is worth a thousand words, BUT There is nothing like hard numbers. Look at the scales.
Histogram of dry days in 1995 IMPORTANT NOTE: Your data are the way they are. Do not try to force them into a particular shape. It is a common misconception that if you have a large enough data set, the data will eventually turn out nice and symmetrical.
Line graphs: time plots In a time plot, time always goes on the horizontal, x axis. We describe time series by looking for an overall pattern and for striking deviations from that pattern. In a time series: A trend is a rise or fall that persist over time, despite small irregularities. A pattern that repeats itself at regular intervals of time is called seasonal variation.
Retail price of fresh oranges over time Time is on the horizontal, x axis. The variable of interest—here “retail price of fresh oranges”— goes on the vertical, y axis. This time plot shows a regular pattern of yearly variations. These are seasonal variations in fresh orange pricing most likely due to similar seasonal variations in the production of fresh oranges. There is also an overall upward trend in pricing over time. It could simply be reflecting inflation trends or a more fundamental change in this industry.
A time plot can be used to compare two or more data sets covering the same time period. The pattern over time for the number of flu diagnoses closely resembles that for the number of deaths from the flu, indicating that about 8% to 10% of the people diagnosed that year died shortly afterward from complications of the flu.
How to run StatCrunch Go to http://www.statcrunch.com/ and log in with your user name and password. From tool bar, click on MyStatCrunch: Then click on My Groups: Then click on “join a group” In the search box on the left panel, search for: STT215 Then you are able to find Now select and join our UNCW-STT215 group.
How to run StatCrunch Searchengines.xls (Categorical variable with summary of %). Ruthmcgwire.xls (Numerical variable) Beer.xls (Complex dataset)
How to run StatCrunch After you make a graph, now click on the “Options” icon. If you select “COPY”, then a new window Will pop up. Right-click the image to copy it. Now use “Ctr+Alt+v” to past the image. You can choose the bitmap format. Otherwise you can save the image and cut and paste later on.