Data Visualization The commonality between science and art is in trying to see profoundly - to develop strategies of seeing and showing Edward Tufte
Visualization skills Humans are particularly skilled at processing visual information A natural capability compared to reading which is a learning skill Our ancestors were those who were efficient visual processors and quickly detected threats and used this information to make effective decisions
A graphical representation of Napoleon Bonaparte's invasion of and subsequent retreat from Russia during 1812. The graph shows the size of the army, its location and the direction of its movement. The temperature during the retreat is drawn at the bottom of figure, which was drawn by Charles Joseph Minard in 1861 and is generally considered to be one of the finest graphs ever produced.
Wilkinson’s grammar of graphics Data A set of data operations that create variables from datasets (e.g., spreadsheets and databases (e.g., Classic Models)) Trans Variable transformations (converting data into a format suitable for the intended visualization) Scale Scale transformations (good for controlling the visualization of data)
Wilkinson’s grammar of graphics Coord A coordinate system describing where things are located (e.g., longitude and latitude for maps, and x-axis and y-axis for graphs) Element Graph and its aesthetic attributes (e.g., scatterplot of year against co2 emissions) Guide One or more guides (e.g., axes and legends can be useful for guiding what is plotted in a graph)
ggvis An implementation of the grammar of graphics in R The grammar describes the structure of a graphic A graphic is a mapping of data to a visual representation ggvis http://had.co.nz/ggplot/
Data Spreadsheet approach Database Use an existing spreadsheet or create a new one Export as CSV file Database Execute SQL query
Transformation A transformation converts data into a format suitable for the intended visualization # TRANSFORMATION: url <-'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt' carbon <- read.table(url, header=T, sep=',') head(carbon) # compute a new column in carbon containing the relative change in CO2 since pre- # industrial periods, when the value was 280ppm. carbon$relCO2 = (carbon$CO2-280)/280
Coord A coordinate system describes where things are located Most graphs are plotted on a two-dimensional (2D) grid with x (horizontal) and y (vertical) coordinates The default coordinate system is Cartesian (histogram)
Element An element is a graph and its aesthetic attributes Build a graph by adding layers library(ggvis) library(readr) # ELEMENT: CO2 EMISSION BY YEAR carbon %>% ggvis(~year,~CO2) %>% layer_points(fill:='red') # use pipe function (%>%) to create a pipeline of commands # the code above reads like a recipe. It says: # 1. take the carbon data, then # 2. use the package ggvis to plot year by co2, and # 3. specify the plot to contain red points.
Element
Scale # SCALE: GOOD IDEA TO HAVE A ZERO POINT FOR THE Y-AXIS (DONT DISTORT THE SLOPE!) carbon %>% ggvis(~year,~CO2) %>% layer_points(fill:='red') %>% scale_numeric('y',zero=T) # perform steps 1-3 of the ELEMENT code, and then, # 4. set the scale for the y-axis to zero.
Axes # AXES: HELP THE READER UNDERSTAND THE GRAPH carbon %>% ggvis(~year,~relCO2) %>% layer_lines(stroke:='blue') %>% scale_numeric('y',zero=T) %>% add_axis('y', title = "CO2 ppm of the atmosphere", title_offset=50) %>% add_axis('x', title ='Year', format = '####') # the code above says: # 1. take the carbon data, then # 2. use the package ggvis to plot year by relco2, then # 3. specify the plot to contain a continuous blue line, then # 4. set the scale for the y-axis to zero, then # 5. add a title for the y-axis that is moved a bit to the left to improve readability, and # 6. add a title for the x-axis, specifying a format of 4 consecutive digits for displaying year on the x-axis
Axes
Guides Axes and legends are both forms of guides Helps the viewer to understand a graphic
Exercise Create a point plot using the data in the following table. Add a title for both x- and y- axes. Year 1804 1927 1960 1974 1987 1999 2012 2027 2046 Population (billions) 1 2 3 4 5 6 7 8 9
Histogram # HISTOGRAM: USEFUL FOR SHOWING THE DISTRIBUTION OF VALUES IN A SINGLE COLUMN url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' t <- read.table(url, header=T, sep=',') t$C <- round((t$temperature - 32)*5/9,1) t %>% ggvis(~C) %>% layer_histograms(width = 2, fill:='cornflowerblue') %>% add_axis('x',title='Celsius') %>% add_axis('y',title='Frequency') # width refers to the size of the bin. # this means that the bin above the tick mark 10 contains all values in the range 9 to 11. # The code above says: # 1. read the url, then # 2. read the url content as table t, then # 3. create a new column in t that transforms f temperature to celsius and rounds it to one decimal place, then # 4. take the t data, then # 5. use the package ggvis to plot celsius temperature, then # 6. specify the plot to be a histogram with width 2 and color cornflowerblue, then, # 7. add a title for the x-axis, and # 8. add a title for the y-axis.
Histogram
Exercise Create a histogram of CO2 using the carbon 1959-2011 data. Add a title for both x- and y- axis. url <-'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt' carbon <- read.table(url, header=T, sep=',')
Bar graph # BAR GRAPH: USEFUL FOR GRAPHING CATEGORICAL DATA library(DBI) require(RMySQL) # set a driver m<-dbDriver("MySQL") # connect to the database conn <- dbConnect(m,user='student',password='student',host='wallaby.terry.uga.edu',dbname='ClassicModels') # if error "in .local(drv, ...): cannot allocate a new connection: 16 connections already opened" appears loop through the connections and delete them. If there is no problem move on to query the database. cons<-dbListConnections(MySQL()) for(con in cons) dbDisconnect(con) # query the database and create file for use with R d <- dbGetQuery(conn,"select productLine from Products;") # plot the number of product lines by specifying the appropriate column name d %>% ggvis(~productLine) %>% layer_bars(fill:='chocolate') %>% add_axis('x',title='Product line') %>% add_axis('y',title='Count') # The code immediately above says: # 1. take the d data, then # 2. use the package ggvis to plot productline, the # 3. specify the plot to be a bar graph with color chocolate, then, # 4. add a title for the x-axis, and # 5. add a title for the y-axis.
Bar graph
Exercise Using Classic Models, create a bar graph to show how many offices each country has.