Data 550: Data Visualization I

Lecture 4b: Comparing Distributions in R

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

  • Up until this point we have provided examples mostly in Altair with the understanding that ggplot has a similar counterpart.

  • As Altair is relatively new, and ggplot2 is one of the most widely used and documented packages in R, it does have functionalities that Altair has yet to implement.

  • One such example is violin plots.

Learning Outcomes

  • Create density, box plots, and violin plots using ggplot

Data

Below is the reprocessed movies data frame (to see how it was processed see the accompanying ipynb)

# the above is the cleaned version
library(rjson)
library(tidyverse)
movies <- fromJSON(file = 'data/lec-movies.json') %>%
    as_tibble() %>%
    unnest(-c(countries, genres))
    
head(movies)

Histogram

Let’s recall how to make a histogram.

ggplot(movies, aes(x = runtime)) +
    geom_histogram(color = 'white')

Density plot

Unlike Altair, ggplot has it’s own density mark, …

ggplot(movies, aes(x = runtime)) +
    geom_density(fill = 'grey', alpha = 0.7)

Unnesting the data

We need to unnest/explode on genres and countires.

free_genres <- movies %>% unnest(genres)
free_countries <- movies  %>%  unnest(countries)
free_both <- movies %>% unnest(genres) %>%  unnest(countries)
free_genres %>% 
  filter(, title ==  "All Dogs Go to Heaven") %>% 
  select(genres, countries)
free_both %>% 
  filter(title ==  "All Dogs Go to Heaven") %>% 
  select(genres, countries)

Layered Density Plot

ggplot(free_genres, aes(x = runtime,
        fill = genres,
        color = genres)) +
    geom_density(alpha = 0.6)

Notice how you can add the aesthetic rather than including it as an argument within ggplot():

ggplot(free_genres) +
    aes(x = runtime,
        fill = genres,
        color = genres) +
    geom_density(alpha = 0.6)

Layered Density Plot

Faceting

ggplot(free_both) +
    aes(x = runtime, fill = genres, color = genres) +
    geom_density(alpha = 0.6) +
    facet_wrap(~countries)

Faceting (row and column)

ggplot(free_both, show.legend = FALSE) +
    aes(x = runtime, fill = genres, color = genres) +
    geom_density(alpha = 0.6) +
    facet_grid(countries~genres)

Boxplots

As in Altair, ggplot unsuprisingly has a boxplot geom, eg.

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_boxplot()

Scaled Boxplots

As in Altair, ggplot unsuprisingly has a boxplot geom, eg.

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_boxplot(varwidth = TRUE)

Violin Plots

The change from boxplot to violin is extremely simple

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_violin()

What are violin plots

  • Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.

  • Typically, violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots.

  • The function geom_violin() is used to produce a violin plot.

Violin vs Faceted Density Plots

ggplot(free_both) +
    aes(x = runtime, 
        y = genres,
        fill = genres) +
    geom_violin()

ggplot(free_both) +
    aes(x = runtime, fill = genres, color = genres) +
    geom_density(alpha = 0.6) +
    facet_wrap(~genres, ncol = 1)

Faceted Boxplots

As with out density plots, we can also facet by country, eg.

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_boxplot() +
    facet_wrap(~countries)

Violin Plots

To get the violin plots, we simply change the geom:

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_violin() +
    facet_wrap(~countries)

Layering Quanties

We can layer the quantiles shown in the box plots

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
    facet_wrap(~countries)

Comments

  • When possible, it is a good idea to have a look at where the individual data points are.

  • Of course we could always layer on different marking of our data (using geom_point() for example)

  • However when we have a lot of data, this could be impossible to read.

  • For this we can use a categorical scatter plot where the dots are spread/jittered1 randomly on the non-value axis so that they don’t all overlap via geom_jitter().

Layering Points

We can layer the points onto the violin plots:

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_violin() + geom_point() +
    facet_wrap(~countries)

Jittering Data

“jittering” adds some noise to the location of each point

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_violin() + geom_jitter() +
    facet_wrap(~countries)

Order matters

We can change the default height and order or layers

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_jitter(height = 0.2, alpha = 0.3) + geom_violin() + 
    facet_wrap(~countries)

Unfaceting

Rather than faceting we could fill by countries

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = countries) + 
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))