Lecture 8: Data Visualization with ggplot2

DATA 101: Making Prediction with Data

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

R includes at least three graphical systems:

Today we’ll take a look at ggplot2.

Data Visualiztion

  • Data visualization is the graphical representation of data to uncover insights, patterns, and trends.

  • Visualizations can take various forms, including charts, graphs, maps, and diagrams.

  • The primary goal is to communicate data clearly, aiding decision-making and storytelling.

  • To facilitate that, you want to reduce visual noise as much as possible

Goals of Data Visualization

  • A stepping stone for developing modes: help to refine/determine parameters and test assumptions

Cholera Outbreak

The dot plot used by John Snow to illustrate the cluster of cholera cases around the pump on Broad Street. Image source: [Wikipedia (John Snow, 2023)]

John Snow

Minimize Noise

  • Colours: use sparingly; too many can be distracting, create false patterns, and detract from the message.

  • Overplotting: when multiple data points overlap to the extent that individual points cannot be distinguished.

  • Size: Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small.

  • Axis manipulation: don’t adjust the axes to zoom in on small differences.

EU labour market

A bar chart created by the German economic development agency GTAI which boasts that German workers are more motivated and work more hours than do workers in other EU nations. Data scoure: Eurofound 2014

Do you think it is fair to say that Germans are more motivated and work more hours than do workers in other EU nations?

Graph redrawn

The redrawn graph with an axis going all the way to zero. Source of images: Misleading axes on graphs callingbulls***.org

How about now?

What is ggplot2

  • ggplot2 is a (non-core) tidyverse package; written by Hadley Wickham and others (view on CRAN)

  • ggplot2 implements the Grammar of Graphics and enables us to concisely describe the components of a graphic.

  • ggplot2 does a lot of the automatic formatting , while also providing the buildable and customizable features as described in base.

  • There is a lot to unpack with this graphic method and it may be helpful to keep a cheatsheet nearby.

ggplot2 vs base

  • We can think of the base plotting model as blank canvas on which we can draw but not erase.

  • We may start with a plot boxplot, striptchart, histogram, etc

  • Upon viewing, we might decide we want to superimpose a line (eg abline(), lines()) or points (eg. points()), or text (eg. text(), axis(), title()),

  • In this way, we have a series of R commands which “build-up” our graphic until we are satisfied with it.

Advantages of ggplot2

  • Consistency and Clarity: produces clearer and more organized code, making it easier to understand and reproduce your plots.

  • Layered Approach: uses a layered approach where you add different components (geometric objects, statistical transformations, facets) to build a plot step by step.

  • Data-Driven Aesthetics: You can map data variables to aesthetics like color, size, and shape, creating dynamic visualizations where the plot adapts to changes in your data.

  • Faceting: provides built-in support for faceting, allowing you to split data into multiple subplots based on one or more categorical variables.

  • Community and Ecosystem large and active user community, which means that you can find ample resources, tutorials, and support.

  • Reproducibility: The structured nature of ggplot2 code and the fact that it’s based on R means that your plots can be part of reproducible workflows.

qplot

  • qplot() is a shortcut designed to be familiar if you’re used to base plot()
  • The basic syntax looks like this:
library(ggplot2)
qplot(x,y, ..., data)

Example: scatter plot

  • We can recreate the scatter plot from using qplot() using the following code: (output on next slide)
library(ggplot2)
n = 100
x = runif(n)
y = runif(n, min=0, max=5)
qplot(x,y)

As you may notice, plots produced in ggplot2 have a very distinct look from the ones made in base.

Faceting

  • qplot() can create multi-panel plots using facets. We can specify our desired groups using y~x.
  • y~. creates a single row of plots with each panel corresponding to a unique levels of y (i.e. the row faceting variable)
  • .~x creates a single column of plots with each panel corresponding to a unique levels of x (i.e. the column faceting variable)
  • |y~x| forms a matrix of plots whose rows and columns represents a combination of the levels of x and y

Facets

  • Facets are a powerful feature in data visualization that allow you to split a single plot into multiple smaller subplots based on one or more categorical variables.

  • Faceting enables you to compare and contrast different subsets of your data within the same visualization.

  • Faceting in qplot() accepts a formula

  • It uses facet_wrap() or facet_grid() depending on whether the formula is one- or two-sided

Example: mtcars

Let’s plot the side-by-side scatter plots for mpg (Miles/(US) gallon) vs. disp (Displacement (cu.in.)) for each cyl (Number of cylinders: 4, 6, 8) in the mtcars data set.

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

side-by-side scatterplot

qplot(mpg, disp, data = mtcars, facets = .~cyl)

Row-wise histograms

qplot(mpg, data = mtcars, facets = cyl~.)

Comments

  • Notice that when we just specify one variable, qplot() plots a histogram rather than a scatter plot

  • When cyl appears on the left-hand side of the tilde the subplots are plotted row-wise (i.e. along the \(y\)-axis)

  • When cyl appears on the right-hand side of the tilde the subplots are plotted column-wise (i.e. along the \(x\)-axis)

  • We can faceting with two discrete variables using the y~x formula

Faceting with two discrete variables

qplot(mpg, disp, data = mtcars, facets = am~cyl)

Colours

  • Rather than representing this information in distinct side-by-side plots, I may want to create a single plot and and distinguish the groups of 4, 6, and 8 cylinder cars using colours or shapes.

  • In ggplot2 terms, we could change the colour aesthetics (latter referenced as aes) according the to factor cyl.

  • ggplot2 will automatically pick the colours, and the displays the legend.

Scatterplot with colour legend

qplot(mpg, disp, data = mtcars, color = factor(cyl))

A scatter plot for mpg vs. disp (displacement) where points are coloured according to cyl type

Scatterplot with shape legend

We could have distinguished our cylinders by shapes instead.

qplot(mpg, disp, data = mtcars, shape = factor(cyl))

Note: to control the size and aspect ratio of plots use the fig.width and fig.height. Theout.width` chunk option is used to control the display width of the generated plot when it is embedded in the final document

Stacked histogram

What might you expect the following code to produce?

qplot(mpg, data = mtcars, fill = factor(cyl))

Compare with row-wise histogram

Factors revisited

  • Notice how I coerced the cyl variable to a factor data type
  • This concept is extremely important in ggplot2 as they are used to categorize our data
  • Since factor levels are often cryptic, which should do our best to replace them with meaningful labels.
  • eg. rather than the transmission (am) having levels 0,1 we should rename them "automatic" and "manual"

ggplot()

  • More commonly, we’ll be using ggplot() function.
  • ggplot() is the workhorse function in ggplot2 and will be able to do a lot of things that qplot() can’t.
  • ggplot() is based on the Grammar of Graphics
  • Rather than specifying graphical features of our plot with arguments in a function, we will add them (literally by using +) to a ggplot object layer by layer.

Grammar of Graphics

  • The Grammar of Graphics is a foundational framework for creating data visualizations.

  • It provides a structured approach to visualizing data, emphasizing the importance of consistency and repeatability in creating graphics.

  • The concept was introduced by Leland Wilkinson and is implemented in various data visualization libraries, including ggplot2 in R. The

Key components

Required:

  • Data: The dataset you want to visualize.
  • Aesthetic Mapping: How data variables are mapped to visual properties (e.g., x and y positions, colors, sizes).
  • Geometric Objects: The shapes and marks used to represent data points (e.g., points, lines, bars).

Optional:

  • Statistical Transformation: Optional data summarization or transformation (e.g., mean, median).
  • Facets: How data is split into subplots for easier comparison.
  • Coordinates: The system that specifies how data points are arranged in the plot (e.g., Cartesian, polar).
  • Theme: all non-data related markings

Basic Usage

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPING>) 

Key Components:

  • data: The tidy dataset containing the variables you want to visualize.
  • <GEOM_FUNCTION>: The geometric function that defines the type of plot (e.g., geom_point, geom_bar).
  • <MAPPING>: Aesthetic mappings that specify how variables are visually represented

Basic Usage

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPING>) + <FACET_FUNCTION> 

Key Components:

  • data: The tidy dataset containing the variables you want to visualize.
  • <GEOM_FUNCTION>: The geometric function that defines the type of plot (e.g., geom_point, geom_bar).
  • <MAPPING>: Aesthetic mappings that specify how variables are visually represented
  • <FACET_FUNCTION>: uses facet_wrap() or facet_grid()

Notation

  • aes aesthetic attributes, i.e. how data are mapped (eg. colour, shape, size, more…)
  • geoms geometric objects (eg. points, lines, bars, more…)
  • facets for forming multi-panel plots; see faceting
  • stats🔗 for statistical transformation (eg. smoothing)
  • co-ordinate system🔗 (eg. \(x\) and \(y\) axis)

General workflow

  1. Create a ggplot object

  2. Identify your data and basic aestheics (identify \(x\) and \(y\) variables for example)

  3. Save this to an R object (which will be ggplot class); standard convention is to call this object g.

library(ggplot2)
g = ggplot(mtcars, aes(mpg, disp)) # this will not plot anything
class(g)
[1] "gg"     "ggplot"

Blank canvas

ggplot(mtcars)

Axis

ggplot(mtcars, aes(mpg, disp))

g

Let’s save this to an object:

# this will not plot anything
# mpg is on the x-axis
# disp is on the y-axis
g =ggplot(mtcars, aes(mpg, disp)) 

# need to call the ggplot 
# object to plot/view

At minimum a ggplot requires: data, a geom function, and aes mapping.

Geometric markings

Now we will need to add geometric markings on this plot using some <GEOM_FUCNTION>. Examples include:

  • geom_point() creates geometric points
  • geom_bar() creates barplots
  • geom_boxplot() creates a boxplot
  • geom_histogram() creates a histogram
  • geom_density() creates a smoothed density estimates

Adding layers

To add the geometric object layers we could write:

g + geom_point()

geom smooth

We can keep adding on layers, e.g. let’s add a smoothed line or curve to a scatter plot, helping to visualize trends or relationships in the data.

g + geom_point() + geom_smooth(method ="lm")

Themes

We can change the theme from gray to black and white.

g + geom_point() + theme_bw() 

Faceting with ggplot

g + geom_point() + facet_grid(.~cyl)

To create panels we need the faceting functions: facet_wrap() or facet_grid().

Changing defaults

We can override the default axis labels or legend keys using the following helper functions:

  • xlab(), ylab(), ggtitle() to modify axis, legend, and plot labels1

We can manage geom objects using argument in the geom_function, e.g.

geom_point(color = ___, size=____, alpha=_____)

where alpha controls the transparency.

Example

g + geom_point(color = "red")

Comment

  • When you specify color within a geom_*() function, you are setting a static, constant color for the entire layer.

  • When you specify color within the aes() function, you are mapping a variable to color, which makes the color a function of the data.

  • In ggplot2, you can specify the aesthetics (aes) at various levels of your plot creation including within individual geom_*() functions (layers) …

Colour with factor variable

# g = ggplot(mtcars, aes(mpg, disp))    # aes can be given at this level
g + geom_point(aes(color=factor(cyl)))  # or this level
We can add colours and legends by specifying parameter aesthics.

Colour with continuous variable

# g = ggplot(mtcars, aes(mpg, disp))   
g + geom_point(aes(color=wt))  # continuous variable for colour

When our variable is continous, ggplot2 uses a gradient color scale instead.

Alpha transparency

p = g + geom_point(aes(color=factor(cyl)), alpha=0.4, size=5); p

We change the transparency of points using alpha (0 = see through 1 = opaque). Notice how “coincidence points” (ie overlapping) points are more obvious with a more transparent point. size is used to make the points (5x) bigger.

Modify labels

p + labs(title="Old cars", x="Miles per Gallon", y ="Displacement",
         color = "Number of Cylinders") # changes legend title

Notice how labels were added on a separate line of code.

Final remarks

  • ggplot2 is a very powerful and flexible tool for creating good looking graphics.
  • While the learning curve may be a little steeper than with base R, ggplot2 over base R for plotting offers several advantages that make it a popular choice for data visualization.
  • This is only a short demonstration of the power of this package.
  • Explore the difference features using the references provided