Data Visualization Seminar NCDC, April 27 2011 Todd Pierce Module 5 Types of Graphs.

Slides:



Advertisements
Similar presentations
Very simple to create with each dot representing a data value. Best for non continuous data but can be made for and quantitative data 2004 US Womens Soccer.
Advertisements

Chapter 2: Frequency Distributions
Describing Quantitative Variables
Unit 1.1 Investigating Data 1. Frequency and Histograms CCSS: S.ID.1 Represent data with plots on the real number line (dot plots, histograms, and box.
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 2 Picturing Variation with Graphs.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.2 Graphical Summaries.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
1 Chapter 1: Sampling and Descriptive Statistics.
Chapter 5: Understanding and Comparing Distributions
Chap 2-1 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 2 Describing Data: Graphical.
Types of Data Displays Based on the 2008 AZ State Mathematics Standard.
ISE 261 PROBABILISTIC SYSTEMS. Chapter One Descriptive Statistics.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Understanding and Comparing Distributions
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Frequency Distributions and Graphs
CS 235: User Interface Design November 24 Class Meeting Department of Computer Science San Jose State University Fall 2014 Instructor: Ron Mak
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Quantitative Skills: Data Analysis and Graphing.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
Chapter 1 – Exploring Data YMS Displaying Distributions with Graphs xii-7.
Descriptive Statistics
CMPT 880/890 Writing labs. Outline Presenting quantitative data in visual form Tables, charts, maps, graphs, and diagrams Information visualization.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Quantitative Skills 1: Graphing
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
Chapter 2 Describing Data.
Graphing Data: Introduction to Basic Graphs Grade 8 M.Cacciotti.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
The Central Tendency is the center of the distribution of a data set. You can think of this value as where the middle of a distribution lies. Measure.
Categorical vs. Quantitative…
Unit 4 Statistical Analysis Data Representations.
GrowingKnowing.com © Frequency distribution Given a 1000 rows of data, most people cannot see any useful information, just rows and rows of data.
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Displaying Distributions with Graphs. the science of collecting, analyzing, and drawing conclusions from data.
CS 235: User Interface Design May 5 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
CS 235: User Interface Design November 19 Class Meeting Department of Computer Science San Jose State University Fall 2014 Instructor: Ron Mak
CS 235: User Interface Design April 30 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
Chapter 2: Frequency Distributions. Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data.
1 Frequency Distributions. 2 After collecting data, the first task for a researcher is to organize and simplify the data so that it is possible to get.
Introduction to statistics I Sophia King Rm. P24 HWB
Statistics - is the science of collecting, organizing, and interpreting numerical facts we call data. Individuals – objects described by a set of data.
Slide Copyright © 2009 Pearson Education, Inc. Ch. 3.1 Definition A basic frequency table has two columns: One column lists all the categories of.
Statistics Unit Test Review Chapters 11 & /11-2 Mean(average): the sum of the data divided by the number of pieces of data Median: the value appearing.
Techniques for Decision-Making: Data Visualization Sam Affolter.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Exploratory Data Analysis
Unit 4 Statistical Analysis Data Representations
Laugh, and the world laughs with you. Weep and you weep alone
CHAPTER 1: Picturing Distributions with Graphs
DAY 3 Sections 1.2 and 1.3.
An Introduction to Statistics
Topic 5: Exploring Quantitative data
Histograms: Earthquake Magnitudes
Displaying and Summarizing Quantitative Data
Basic Practice of Statistics - 3rd Edition
Basic Practice of Statistics - 3rd Edition
CHAPTER 1 Exploring Data
Lesson – Teacher Notes Standard:
Organizing, Displaying and Interpreting Data
Presentation transcript:

Data Visualization Seminar NCDC, April Todd Pierce Module 5 Types of Graphs

Best Practices Time Series (sources: Colin Ware and Stephen Kosslyn)

Time Series Graphs Most graphics show values changing over time – time gives us a context for understanding data – random sample of 4000 newspaper graphics found 75% of them had time series – Time Series can be shown best by line graphs but sometimes other graphs work best

Time Series Graphs Patterns – Trend: overall tendency of values to increase, decrease, or stay stable during a time period; trend lines can show this (but see later caveats) – Variability: average degree of change from one point in time to the next in a time period; but be careful, if the y scale is narrow or does not start at zero, variability may be overstated – Rate of change: percent difference between one value and the next; rates of change may be increasing faster than the raw data values would indicate

Time Series Graphs Patterns – Co-variation: changes in one time series are reflected as changes in another, either immediately or later; changes can be in same or different directions; if changes are not immediate, we have leading or lagging indicators – Cycles: patterns that repeat at regular intervals instead of in one fixed interval – Exceptions: values that fall far outside the norm

Time Series Graphs

Line Graphs: show how quantitative values have changed over a continuous time period; show pattern or shape of change over time; show exceptions – Lines make visible the sequential flow of values over time – Lines trace connection from one value to the next – Lines shows extent and direction of change through slope – If we want to compare magnitudes of values at a point in time, we should add dots to the lines

Time Series Graphs Bar Graphs: emphasize individual values and allow for comparisons of specific values at points in time – Visual weight of bars and their separation makes us focus on individual values rather than the overall patterns Dot Plots: useful when sampling at irregular intervals – A line connecting sporadic values implies smooth transitions between values – More regular sampling might show different picture – Use dots instead of lines to avoid false conclusions

Time Series Graphs Box Plots: show distribution of values over time by showing the average, min and max – see Distribution Analysis for more information Animated Scatterplots : show correlation analysis over time – such as Gapminder – see Correlation Analysis for more information – Great for telling a story, not so good for analysis – hard to track individual dots – Must be combined with trails to show patterns of change over time, and small multiples (trellis display) to compare patterns of changes for multiple items

Time Series Graphs Best Practices – Aggregating to different time intervals: combine data into different time spans (month, week, year, day) to see different patterns emerge – Viewing time periods in context: extend the time period – trends that look significant in a small time span may not be over longer periods – Grouping related time intervals: add vertical lines or shading on the time axis to show for example each quarter or when the weekends are

Time Series Graphs Best Practices – Using running averages to enhance perception of high level patterns: trend lines can mislead if they don’t take into account values just outside the time period; better to look at running averages of current value and a few previous values – this smoothing can reduce variability that throws off trend lines – Omitting missing values from a display: rather than have the line dip to zero, either skip the value (show a broken line) or show the line lighter or dashed; do not confuse a valid zero value with a missing value

Time Series Graphs Best Practices – Optimizing a graph’s aspect ratio: change the aspect ratio to get a lumpy profile instead of a flat or spiky profile, to allow for optimal comparison of slopes – Using log scales and percentages to compare rates of change: variations in numerical magnitudes may hide true rates of change – use log scales, or percent change from previous value or from a baseline value, to see true rates of change – Overlapping time scales to compare cyclical patterns: instead of showing for example all three years in one line, show each year as a different line over the 12 months, to allow comparisons from year to year for a given month

Time Series Graphs Best Practices – Using cycle plots to examine trends and cycles together: compare cycles and see trends across multiple cycles – Shifting time to compare leading and lagging indicators: shift the time axis on one graph so it aligns with the other and see patterns – Stacking line graphs to compare multiple values: if multiple time series have very different units or scale ranges, put them in stacked line graphs with the same time axis

Time Series Graphs Best Practices – Expressing time as 0-100% to compare asynchronous processes: if activities have different start dates, reduce each to 0% and show later dates as percentage of total activity time, to compare values at similar times in total activity length – Maintaining consistency through time: must adjust for inflation in currency over time; and account for how information gathering changed or values were defined over time

Time Series Graphs Do’s and Don’t’s – Change salience of lines if needed to show relative importance. – Ensure crossing or nearby lines are discriminable. – If using points on lines, make points at least twice as thick as the lines. – Vary the lengths of dashes in dashed lines by at least a ratio of 2 to 1. – Use different, discriminable symbols for points on different lines.

Time Series Graphs Do’s and Don’t’s – Do not fill in the areas between two lines – it’s not an area graph. – In a mixed line and bar display, make one more salient and important. – Put labels of all lines in same part of graph (else it draws attention to certain lines – also less busy). – Put labels at end of lines (so labels and lines group with each other. – Label any critical data points explicitly rather than labeling all points.

Best Practices Part-to-Whole and Ranking Analysis (sources: Colin Ware and Stephen Kosslyn)

Part-to-Whole and Ranking Comparing parts to a whole and ranking them by value – for example the expenses of each department of a company as a % of total expenses, ranked in order

Part-to-Whole and Ranking Patterns – Uniform – all values roughly the same – Uniformly different – differences from one value to the next increase by roughly the same amount – Non-uniformly different – differences from one value to the next vary significantly

Part-to-Whole and Ranking Patterns – Increasingly different – differences from one value to the next increase – Decreasingly different – differences from one value to the next decrease – Alternating differences – differences from one value to the next begin small then shift to large and finally back to small – Exceptional – one or more values are very different from the rest

Part-to-Whole and Ranking

Part to whole is usually shown with pie charts – bad idea! Makes us compare areas or angles, both of which humans do poorly If pie uses a legend, eye must bounce between chart and legend – You can label pie wedges directly with name and % value – but this is no better than a table – why use a graph if we must resort to printed values to make sense of it?

Part-to-Whole and Ranking Acceptable Bad

Part-to-Whole and Ranking Bad

Part-to-Whole and Ranking Bad

Part-to-Whole and Ranking Acceptable?

Part-to-Whole and Ranking Instead, use a bar graph – One exception – if values cluster close together, the bar differences are small and hard to see – So narrow the scale (zoom in) so differences bigger – But, use dot plot – dots or lines instead of bars – so we don’t misjudge the bar lengths

Part-to-Whole and Ranking Use a Pareto chart to show the cumulative contributions of each part to a whole – a line graph plus a bar chart shows how the parts sum to 100 – summarize and display the relative importance of the differences between groups of data. Pareto charts – distinguish the "vital few" from the "useful many."

Part-to-Whole and Ranking Vilfredo Pareto, a turn-of-the-century Italian economist, studied the distributions of wealth, finding that about 20% of people controlled about 80% of a society's wealth. This same distribution has been observed in other areas and has been termed the Pareto Principle or 80/20 rule.

Part-to-Whole and Ranking

Best Practices – Grouping categorical values in ad hoc manner: group very small categories into one called ‘other’ or regrouping similar categories into one master category for better analysis – Using Pareto charts with percentile scales: group values into percentile intervals (top 10%,,next 10%, etc) and use Pareto line – can lead to new insights – Using line graphs to view ranking changes through time: use line graphs to show changes in ranking (such as salesperson’s sales) over time – the lines show the relative ranking but not the actual values – inspired by bump charts from racing

Part-to-Whole and Ranking Best Practices – Re-expressing values to solve quantitative scaling problems: sometimes the small values on a bar chart are hard to see relative to the large values – so re-express the number using the square root, or a logarithm, if it reduces the range from highest to lowest; can also use an inverse scale (divide each value by the largest value or some other value such as a million)

Part-to-Whole and Ranking Do’s and Don’t’s: Bar Charts – Do not insist on minimizing ink. – Mark corresponding bars in same color or symbol for multiple parameters. – Arrange corresponding bars in same order for multiple parameters. – Ensure overlapping bars do not look like stacked bars – offset the bars. – Leave space between bar clusters for multiple parameters. – Do not extend bars beyond the end of the scale.

Part-to-Whole and Ranking Do’s and Don’t’s: Pie Charts – Draw radii from the center of the circle. – Explode a maximum of 25% of the wedges. – Arrange wedges in a simple increasing progression. – Place labels in wedges provided they can be easily read. – Place labels next to all wedges if they cannot fit inside wedges (otherwise reader will think ones outside wedge are more important).

Best Practices Deviation Analysis (sources: Colin Ware and Stephen Kosslyn)

Deviation Analysis Examining how a set of values deviate from a reference point (a budget, average, or price in time) – Usually use a bar graph with two bars per entity – the actual and expected, such as for a budget – However this makes user subtract values in head – Better to have the graph 0 line be the expected reference, and the bars show the amount over or under (the deviation)

Deviation Analysis Comparisons – Current target, future target – Same point in time in past – Immediately prior period – Standard or norm – Other items in same category or same market

Deviation Analysis

Best shown as bar or line graphs with reference line at 0 or 100% – If at 0, values expressed as positive and negative deviations in dollars or percents – If at 100%, values expressed as percentages of the reference value – Best to use a line graph when doing comparisons over time, from one period to the next; if comparing entities such as areas or companies, use a bar graph

Deviation Analysis Best Practices – Expressing deviations as percentages: helps normalize multiple data sets to same units to allow for better comparison – works best if values or mostly <= 100% and nothing exceeds 500% – Comparing deviations to other points of reference: besides showing reference line, show other lines such as acceptable deviations from norm, or standard deviations from mean

Best Practices Distribution Analysis (sources: Colin Ware and Stephen Kosslyn)

Distribution Analysis Seeing how numerical values are distributed from low to high, and compare how multiple values sets are distributed “The median isn’t the message” (Stephen Jay Gould) – knowing the average or median value hides the full range of values – even knowing the max and min values hides the number of values at each numerical value in a range of data

Distribution Analysis Characteristics of distributions of values – Spread: the difference between the max and min values – the full range of values – Center: estimate of the middle of a set of values – the mean or median or average – Shape: where values are located in a spread – skewed to a side? Evenly distributed? Distribution summaries: – 3 value: low, median, high – 5 value: low, 25 th %ile, median, 75 th %ile, high

Distribution Analysis Patterns - Shape: – Curved or flat? – If curved, curved upward (bell curve) or downward (opposite of bell curve)? – If curved upward, one peak, two peaks (bi-modal), or more? – If single peaked, symmetrical or skewed left or right? – Concentrations? Noticeably high peaks, that may not be the absolute peak – Gaps? Areas of low or no values

Distribution Analysis Gaussian distribution

Distribution Analysis

Bimodal distribution for graduating lawyer salaries

Distribution Analysis Patterns - Outliers: – values way beyond the norm – good rule of thumb – take distance between 75 th and 25 th percentile values, multiply that by 1.5, and then subtract that from 25 th percentile to make lower bound and add to 75 th percentile to mark upper bound

Distribution Analysis Histograms: single distribution display – Bar graph with X axis showing value ‘bins’ like age groups, and y axis showing number of values falling in each bin – Bars touch to show continuous distribution between bins – Enhanced if you can show the 3 value or 5 value marks on the X axis – otherwise no good way to determine the center and spread, just the shape

Distribution Analysis

Box Plots: multiple distribution display – Box shows median and 25 th /75 th percentiles (midspread) – Whiskers show high and low values (spread) – Could also have whiskers stop at 5 th /95 th percentiles and show outliers as dots

Distribution Analysis

Best Practices – Keeping intervals consistent: each X axis bin should have an equal number of values in it; but it is OK to group outliers at one or both ends into one bin – Selecting the best interval: if bins are too large, patterns are lost and the graph is too general; if bins are too small, the graph is too jagged and patterns cannot be seen – Using measures that are resistant to outliers: certain measure such as the mean and the standard deviation can be greatly changed by the presence or absence of outliers; the median is very resistant to outliers and hence is preferred

Best Practices Correlation Analysis (sources: Colin Ware and Stephen Kosslyn)

Correlation Analysis Examining how numerical values relate to and affect one another; helps to track down causes – Does one value vary systematically with another value? – If so, in what manner, degree, direction, and why?

Correlation Analysis Correlation between two variables can mean – One variable causes another – Neither variable affects the other – instead both are caused by one or more other variables (spurious correlation – due to these lurking variables) – Neither variable affects the other – instead another variable connects them in causation – The apparent correlation is an error due to bad or insufficient data

Correlation Analysis Describing correlations – Direction: positive or negative (refers to slope on graph) – Strength: amount of grouping along the trend line – the stronger the grouping, the more likely the variables are related; if values are scattered the correlation is weak or not present – Shape: linear or curved (curvilinear)

Correlation Analysis Patterns - Shape – Linear or curved? If linear, an increase in one variable is matched by same increase in another variable; if curved, the increases are not the same – One direction or two? Does curve go up or down only, or both? – Logarithmic (values go up or down at ever decreasing rate of change) or exponential (values go up or down at ever increasing rate of change)?

Correlation Analysis Patterns - Shape – Curved upward or downward? Shaped like an S? – Concentrations? (can be due to overlapping distributions creating multiple clusters) – Gaps? (only useful to examine when there is a correlation) – Outliers? Values very far from the fit line showing the trend

Correlation Analysis

Statistical summaries of correlation – Linear correlation (r): direction and strength of correlation, from r=+1 (perfect positive) to r=-1 (perfect negative); each analysis has different value of r that is significant – Coefficient of correlation (r 2 ): strength of correlation – equal to r squared – so values range from 0 to 1; value indicates percent of change in dependent variable that can be attributed to the independent variable (from 0 to 100%) Visual displays on a graph are still needed because very different sets of data can have the same statistical values (see next slide)

Correlation Analysis from Few

Correlation Analysis Correlation displays – Scatterplots: use x and y axes to show two variables, then plot all the points – Scatterplot matrices: show all combinations of two variables from a set of multiple variables; let you see how multiple variables are related – Table lenses: horizontal bars (or dots) show values in a column; multiple columns show multiple variables; columns are compared to the left most column to see how values correlate

Correlation Analysis Best Practices – Optimizing aspect ratio and quantitative scales: make width and height of graph equal, and have axes go from just below lowest value to just above highest value of each variable – Removing fill color to reduce over plotting: just show outline to avoid overlaps – Comparing data to reference regions: shade the reference or normed region to see outliers – Visually distinguishing data sets when divided into groups: either through easily distinguished hues, or by symbols (best to use are circle, square, triangle, plus, and X)

Correlation Analysis Best Practices – Using trend lines to enhance perception of correlation’s shape, strength, and outliers: line of best fit is one such that vertical distance of each point from the line, squared and them summed, is the least amount; can be linear or curved line shouldn’t match every point! look for overall trend can be used (if r squared for the line is high) to estimate values for missing data points use with caution for predicting values though – how do we know if we’re in the middle of an upward trend or just at the top of an S curve and about to go down?

Correlation Analysis Best Practices – Using multiple trend lines to see categorical differences: may be useful if multiple tends show up – Removing the rough to see the smooth more clearly: removing outliers can make graph more compact and show the trend (the smooth) better – Using trellis and crosstab displays: to reduce complexity and over-plotting – Using grid lines to enhance comparisons between scatterplots: helps focus on particular areas from one graph to the next by using lines as reference

Correlation Analysis Do’s and Don’t’s – Do not indicate overlapping points with different symbols – vary the size with number of points at given location. – Ensure error bars do not make less stable points (with longer bars) look bigger. – Ensure best fit lines are salient and distinguishable. – Do not fit a line by eye. – If using more than one best fit line, label each directly.

Next Module We are done with graphs and charts What about maps?