Data Visualization - Data 550: Data Visualization I

Introduction

Up until this point we have provided examples mostly in Altair with the understanding that ggplot has a similar counterpart.
As Altair is relatively new, and ggplot2 is one of the most widely used and documented packages in R, it does have functionalities that Altair has yet to implement.
One such example is violin plots.

Learning Outcomes

Create density, box plots, and violin plots using ggplot

Data

Below is the reprocessed movies data frame (to see how it was processed see the accompanying ipynb)

# the above is the cleaned version
library(rjson)
library(tidyverse)
movies <- fromJSON(file = 'data/lec-movies.json') %>%
    as_tibble() %>%
    unnest(-c(countries, genres))
    
head(movies)

Histogram

Let’s recall how to make a histogram.

ggplot(movies, aes(x = runtime)) +
    geom_histogram(color = 'white')

Density plot

Unlike Altair, ggplot has it’s own density mark, …

ggplot(movies, aes(x = runtime)) +
    geom_density(fill = 'grey', alpha = 0.7)

Unnesting the data

We need to unnest/explode on genres and countires.

free_genres <- movies %>% unnest(genres)
free_countries <- movies  %>%  unnest(countries)
free_both <- movies %>% unnest(genres) %>%  unnest(countries)

free_genres %>% 
  filter(, title ==  "All Dogs Go to Heaven") %>% 
  select(genres, countries)

free_both %>% 
  filter(title ==  "All Dogs Go to Heaven") %>% 
  select(genres, countries)

Layered Density Plot

ggplot(free_genres, aes(x = runtime,
        fill = genres,
        color = genres)) +
    geom_density(alpha = 0.6)

Notice how you can add the aesthetic rather than including it as an argument within ggplot():

ggplot(free_genres) +
    aes(x = runtime,
        fill = genres,
        color = genres) +
    geom_density(alpha = 0.6)

Layered Density Plot

Faceting

ggplot(free_both) +
    aes(x = runtime, fill = genres, color = genres) +
    geom_density(alpha = 0.6) +
    facet_wrap(~countries)

Faceting (row and column)

ggplot(free_both, show.legend = FALSE) +
    aes(x = runtime, fill = genres, color = genres) +
    geom_density(alpha = 0.6) +
    facet_grid(countries~genres)

Boxplots

As in Altair, ggplot unsuprisingly has a boxplot geom, eg.

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_boxplot()

Scaled Boxplots

As in Altair, ggplot unsuprisingly has a boxplot geom, eg.

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_boxplot(varwidth = TRUE)

Violin Plots

The change from boxplot to violin is extremely simple

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_violin()

What are violin plots

Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.
Typically, violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots.
The function geom_violin() is used to produce a violin plot.

Violin vs Faceted Density Plots

ggplot(free_both) +
    aes(x = runtime, 
        y = genres,
        fill = genres) +
    geom_violin()

ggplot(free_both) +
    aes(x = runtime, fill = genres, color = genres) +
    geom_density(alpha = 0.6) +
    facet_wrap(~genres, ncol = 1)

Faceted Boxplots

As with out density plots, we can also facet by country, eg.

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_boxplot() +
    facet_wrap(~countries)

Violin Plots

To get the violin plots, we simply change the geom:

ggplot(free_both) +
    aes(x = runtime, y = genres,fill = genres) +
    geom_violin() +
    facet_wrap(~countries)

Layering Quanties

We can layer the quantiles shown in the box plots

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
    facet_wrap(~countries)

Comments

When possible, it is a good idea to have a look at where the individual data points are.
Of course we could always layer on different marking of our data (using geom_point() for example)
However when we have a lot of data, this could be impossible to read.
For this we can use a categorical scatter plot where the dots are spread/jittered¹ randomly on the non-value axis so that they don’t all overlap via geom_jitter().

Layering Points

We can layer the points onto the violin plots:

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_violin() + geom_point() +
    facet_wrap(~countries)

Jittering Data

“jittering” adds some noise to the location of each point

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_violin() + geom_jitter() +
    facet_wrap(~countries)

Order matters

We can change the default height and order or layers

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = genres) +
    geom_jitter(height = 0.2, alpha = 0.3) + geom_violin() + 
    facet_wrap(~countries)

Unfaceting

Rather than faceting we could fill by countries

ggplot(free_both) +
    aes(x = runtime, y = genres, fill = countries) + 
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))