Exploring Data Chapter 1 Displaying distributions with graphs Describing distributions with Numbers
Different types of graphs Categorical Data: ◦Use a Bar Graph Quantitative Data: ◦Dot plots ◦Stem plots ◦Histograms
Things to remember!! Always Always Always plot your data!! Don’t forget your SOCS ◦S – Shape- ◦O – Outliers ◦C – Center ◦S – Spread
Bar Graphs- used to plot categorical data. The distribution of a categorical variable lists the categories and gives either the count or the percent of each individuals who fall in each category. Example 1 The radio audience rating service Aribitron places the country’s 13,838 radio stations into categories that describe the kind of programs they broadcast. Here is the distribution of stations format.
FormatCount of stationsPercent of stations Adult Contemporary Adult Standards Contemporary hit Country News/Talk/Information Oldies Religious Rock Spanish language Other format Total
Dot Plots Use for quantitative data Small amounts of data
Example 2 The accompanying data on gender and birth weight (KG) of foals born to 15 thoroughbred mares appeared in the article “Suckling Behavior Does Not Measure Milk Intake in Horses” (Animal Behaviour (1999): ). Construct a dot plot of the birth weights by gender. Gender: F M M F F M F F M F M F M F F Weight:
Stemplots: Use with quantitative data Gives a quick picture of the shape of a distribution Shows symmetry, gaps, clusters, outliers Use for small data sets
The accompanying observations are maximum flow rates for 34 different shower heads evaluated in a Consumer Reports article (July 1990). Construct two stem plots (one without splitting and one with split stems) and describe the most prominent features of the displays
Back to Back Stem Plot Literacy rates in Islamic nations CountryFemale Percent Male Percent CountryFemale Percent Male Percent Algeria6078Morocco3868 Bangladesh3150Saudi Arabia 7084 Egypt4668Syria6389 Iran7185Tajikistan99100 Jordan8696Tunisia6383 Kazakhstan99100Turkey7894 Lebanon8295Uzbekistan99100 Libya7192Yemen2970 Malaysia8592
Virginia CollegesTuition and fees ($)Virginia CollegesTuition and fees ($) Averett18430Patrick Henry14645 Bluefield10615Randolph—Macon22625 Christendom14420 Randolph—Macon Women’s Christopher Newport12626Richmond34850 DeVry12710Roanoke22109 Eastern Mennonite18220Saint Paul’s9420 Emory and Henry16690Shenandoah19240 Ferrum16870Sweet Briar21080 George Mason15816University of Virginia22831 Hampton14996University of Virginia-Wise14152 Hampton – Sydney22944Virginia Commonwealth17262 Hollins21675Virginia Intermont15200 Liberty13150Virginia Military Institute19991 Longwood12901Virginia State11462 Lynchburg22885Virginia Tech16530 Mark Baldwin19991Virginia Union12260 Marymount17090Washington and Lee25760 Norfolk State14837William and Mary21796 Old Dominion14688
Histograms Used for large sets of data Breaks the range of values of a variable into classes and displays only the count or percent of the observations that fall into each class Divide the range of data into equal-width classes Count the observations in each class – ’frequency’ Draw bars to represent classes- height=frequency Bars should touch (unlike bar graphs) Large sets of data
You have probably heard that the distribution of scores on IQ tests follows a bell shaped pattern. Let’s look at some actual IQ scores. Here are 60 5 th -grade students chosen at random from one school
Distributions Look for the overall pattern and for striking deviations from that pattern Describe the overall pattern by its shape, center, spread, and outliers. Outliers-an individual value that falls outside the overall pattern.
SHAPE Does the distribution have one or more major peak(s), one peak-unimodal Is the distribution approximately symmetric or is it skewed in one direction? Symmetric- Skewed right Skewed left
Outliers Look for points that are clearly apart from the body of the data, not just the most extreme observations in a distribution. We will discuss a test used to identify outliers in the next section. You should look for an explanation for any outlier, sometimes they are an error in recording the data. It is not a good idea to just delete or ignore outliers.
Relative Frequency Histograms do a good job displaying the distribution of values of a quantitative variable. But….. In order to get information about an individual observation you should construct a relative cumulative frequency graph. Let’s look at the U.S. presidents example…
PresidentAgePresidentAgePresidentAge Washington57Lincoln52Hoover54 J. Adams61A.Johnson56F.D. Roosevelt51 Jefferson57Grant46Truman60 Madison57Hayes54Eisenhower61 Monroe58Garfield49Kennedy43 J.Q. Adams57Arthur51L.B. Johnson55 Jackson61Cleveland47Nixon56 Van Buren54B. Harrison55Ford61 W.H. Harrison68Cleveland55Carter52 Tyler51McKinley54Reagan69 Polk49T. Roosevelt42G.H.W. Bush64 Taylor64Taft51Clinton46 Fillmore50Wilson56G.W. Bush54 Pierce48Harding55 Buchanan65Coolidge51
1. Decide on class intervals and make a frequency table, add three columns, relative frequency, cumulative frequency, and relative cumulative frequency. ClassFrequencyRelative Frequency Cumulative Frequency Relative cumulative frequency
Describing Distributions with Numbers
Two-seater CarsMinicompact Cars ModelCityHighwayModelCityHighway Acura NSX1724Aston Martin Vanquish 1219 Audi TT Roadster2028Audi TT Coupe2129 BMW Z4 Roadster2028BMW 325 CI1927 Cadillac XLR1725BMW 330 CI1928 Chevrolet Corvette1825BMW M31623 Dodge Viper1220Jaguar XK81826 Ferrari 360 Modena1116Jaguar XKR1623 Ferrari Maranello1016Lexus SC Ford Thunderbird1723Mini Cooper2532 Honda Insight6066Mitsubishi Eclipse2331 Lamborghini Gallardo 915Mitsubishi Spyder2029 Lamborghini Murcielago 913Porsche Cabriolet1826 Lotus Esprit1522Porsche Turbo Maserati Spyder1217 Mazda Miata2228 Mercedes-Benz SL Mercedes-Benz SL Nissan 350Z2026 Porsche Boxster2029 Porsche Carrera Toyota MR22632
Construct a Stem Plot—This will help you describe the shape! In order to interpret measures of center and spread you will need to think about the shape of the distribution.
Mean and Median Mean- average value Median- middle value
Two-seater CarsMinicompact Cars ModelCityHighwayModelCityHighway Acura NSX1724Aston Martin Vanquish 1219 Audi TT Roadster2028Audi TT Coupe2129 BMW Z4 Roadster2028BMW 325 CI1927 Cadillac XLR1725BMW 330 CI1928 Chevrolet Corvette1825BMW M31623 Dodge Viper1220Jaguar XK81826 Ferrari 360 Modena 1116Jaguar XKR1623 Ferrari Maranello1016Lexus SC Ford Thunderbird1723Mini Cooper2532 Honda Insight6066Mitsubishi Eclipse2331 Lamborghini Gallardo 915Mitsubishi Spyder2029 Lamborghini Murcielago 913Porsche Cabriolet1826 Lotus Esprit1522Porsche Turbo Maserati Spyder1217 Mazda Miata2228 Mercedes-Benz SL Mercedes-Benz SL Nissan 350Z2026 Porsche Boxster2029 Porsche Carrera Toyota MR22632
Looking at the data are there any outliers? What happens to the mean if we remove the outlier? One weakness of mean as a measure of center is it is non resistant to outliers. The Median is resistant to outliers.
Mean versus Median Both mean and median are the most common measures of center. The mean and median of a symmetric distribution are close together. In a skewed distribution, the mean is farther out in the ‘tail’ than is the median.