Data 101 - Lecture 5: Data Wrangling

Outline

In today’s lecture we’ll be looking and dplyr functions:

select and filter
arrange
rename
mutate and transmute
group_by and summarize
Piping operator (%>% and |>)

filter(): This function is used to filter rows in a data frame based on specified conditions. You can use it to subset data based on logical conditions.
select(): select() is used to choose and reorder columns in a data frame. It helps you extract specific columns from a larger dataset.
arrange(): is used to reorder rows in a data frame based on one or more columns. It’s helpful for sorting data in ascending or descending order.
rename(): This function allows you to rename columns in a data frame, providing more meaningful or concise names.
mutate(): With mutate(), you can create new columns or modify existing ones. It’s useful for adding calculated columns to your data frame.
Related to mutate() is the transmute() function. It creates a new data frame containing only the newly created variables.
group_by(): It creates a grouping variable that specifies how the data should be divided into groups. It is often used in in conjunction with other dplyr functions like summarize() to perform grouped operations.
summarise(): This function is used for aggregating and summarizing data. It allows you to calculate summary statistics, such as mean, median, or count, for groups of data.

Introduction

Most of the data that we have dealt with so far have been in a condition that is readily usable in R.
More often than not, however, we will have had to reconfigure our data into another format to make it appropriate for downstream analytics.
The general problem of preparing your data for future analysis is a common problem in this field and can be classified under the process of data wrangling.

What is data wrangling

Some aspects of data wrangling (aka data munging) include:

Collection: Gathering data from various sources

Cleaning: Identify anomalies, missing values, and outliers.

Transformation: Converting data into a consistent format

Feature Engineering: Creating new features or variables

Data Reduction: Reducing the dimensionality of the data

Documentation: Documenting the steps taken during data wrangling for reproducibility and transparency.

Comment

Data wrangling is a crucial step in the data analysis process because the quality of the data greatly influences the accuracy and reliability of subsequent analyses
In fact, it has been claimed¹ that 50 to 80 percent of the whole data analysis pipeline in a typical machine learning/deep learning project can be attributed to this process.

Notation

In this lecture we aim to go from:

\[\text{raw }\to\text{ tidy data }\]

For the purpose of this lecture, I will refer to “raw” data as how the data is first presented to us and “tidy” data as the cleaned/wrangled data that meets the following criteria:

each row is a single observation,
each variable is a single column, and
each single cell contains a single value

Fig 6.1 from R for data science: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.

Data frames and help files

We have discussed how a data frame is a key data structure in statistics and in R.
Data frames represent tabular data that hold observations in the row and variables in the columns.
Meta-data in the form of column names are useful to describe each variable and provide a unique identifier.
Meta data in form of help files that describes things like, where the data was sourced from, what each variable means in more detail, etc., can be very helpful

Meta data

Metadata is often referred to as “data about data”
It provides information that describes and provides context for other data.
It helps users and systems understand the content, structure, and characteristics of the underlying data.
Metadata is essential for managing, organizing, and retrieving data efficiently.

Most of the functions we use in R requires that our data is in a tidy format.

dplyr

The dplyr package is a very helpful package in R that will help us manage tidy data frames.
It comes as part of the tidyverse ecosystem and are widely used in data analysis and data science workflows.
This package was developed by Hadley Wickham of the Posit team (note name change: RStudio is now Posit!).
While we can do much of what is performed with functions in this package using base code in R, these functions will run much quicker (as they call to C++) and you may find them much easier and intuitive to work with.

dplyr functions

select() to return a subset of the columns
filter() to extract a subset of rows
arrange() to reorder rows of a data frame
rename() to rename variables in a data frame
mutate() to add new variables/columns
summarise()¹ to generate summary statistics
%>% (or |>) is the pipe (native pipe) operator is used to connect multiple verb actions together into a pipe line

Example

The original data can be downloaded from here

chicago <- readRDS("data/chicago.rds")
str(chicago)

'data.frame':   6940 obs. of  8 variables:
 $ city      : chr  "chic" "chic" "chic" "chic" ...
 $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
 $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
 $ date      : Date, format: "1987-01-01" "1987-01-02" ...
 $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pm10tmean2: num  34 NA 34.2 47 NA ...
 $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
 $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...

This data deals with airpolution and weather variables from the city of Chicago, from 1997–2005.

Chicago Data set

chicago

Extracting Columns

If we want to extract the second to forth column, we’ve seen how we could use:

head(chicago[,2:4], 3)

  tmpd   dptp       date
1 31.5 31.500 1987-01-01
2 33.0 29.875 1987-01-02
3 33.0 27.375 1987-01-03

However, the colon operator : will not work with names:

chicago[,"tmpd":"date"] # produces as error

Base R select

If we know our relevant columns names, we could either reference them directly or convert them to a column index:

head(chicago[,c("tmpd", "dptp", "date")],4) # this ok but long

  tmpd   dptp       date
1 31.5 31.500 1987-01-01
2 33.0 29.875 1987-01-02
3 33.0 27.375 1987-01-03
4 29.0 28.625 1987-01-04

ind <- match("tmpd", colnames(chicago)):match("date", colnames(chicago))
head(chicago[,ind],4)

  tmpd   dptp       date
1 31.5 31.500 1987-01-01
2 33.0 29.875 1987-01-02
3 33.0 27.375 1987-01-03
4 29.0 28.625 1987-01-04

match returns a vector of the positions of (first) matches of its first argument in its second. Its the function version of %in%

match(x, table, nomatch = NA_integer_, incomparables = NULL)

x %in% table

dplyr select

An easier/faster way can be done using the select() function from the dplyr package:

library(dplyr)
head(select(chicago, tmpd:date))

  tmpd   dptp       date
1 31.5 31.500 1987-01-01
2 33.0 29.875 1987-01-02
3 33.0 27.375 1987-01-03
4 29.0 28.625 1987-01-04
5 32.0 28.875 1987-01-05
6 40.0 35.125 1987-01-06

select

select() makes it easy to refer to variables based on their name and also by type:

e.g. where(is.numeric) selects all numeric columns.
e.g. a:f selects all columns from a to f (inclusive)

We could also use it to exclude certain columns/variables:

head(select(chicago, -(tmpd:date)))

  city pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic         NA   34.00000 4.250000  19.98810
2 chic         NA         NA 3.304348  23.19099
3 chic         NA   34.16667 3.333333  23.81548
4 chic         NA   47.00000 4.375000  30.43452
5 chic         NA         NA 4.750000  30.33333
6 chic         NA   48.00000 5.833333  25.77233

Extracting rows

Another handy function will allow us to extract certain rows from a data frame.
Say for instance we wanted to obtain all the data points from chicago for which the pm10tmean2 is not NA.
We already know a few ways to do this in base R:

this <- chicago[!is.na(chicago$pm10tmean2),]
that <- subset(chicago, !is.na(pm10tmean2))
all.equal(this, that)

[1] TRUE

filter

In the dpylr package this can be done using filter()

chic.f <- filter(chicago, !is.na(pm10tmean2))
all.equal(this, chic.f)

[1] "Attributes: < Component \"row.names\": Mean relative difference: 0.05941197 >"

We could also specify multiple conditions:

head(filter(chicago, !is.na(pm10tmean2) & tmpd > 80))

  city tmpd   dptp       date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic 83.0 68.500 1987-06-14         NA   43.66667 45.42188  25.99583
2 chic 80.5 62.750 1987-06-18         NA   90.00000 62.96966  43.81667
3 chic 80.5 69.250 1987-07-08         NA   38.00000 30.85168  23.01479
4 chic 80.5 67.625 1987-07-09         NA   31.00000 24.94792  20.15783
5 chic 84.5 73.625 1987-07-19         NA   72.00000 47.05208  13.37153
6 chic 81.5 67.625 1987-07-20         NA   65.42857 40.49896  18.45833

arrange

Sometimes we wish to reorder or rows in such a way that preserve the corresponding order of other columns.
We have seen a rather tedious way of performing this task using sort() and order()
The arrange() function from the dplyr package provides a simple way of reordering rows of a data frame according to an expression involving its variables/columns.

Ascending example

For instance, we might want to arrange our data by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.

head(chicago <- arrange(chicago, date))

  city tmpd   dptp       date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic 31.5 31.500 1987-01-01         NA   34.00000 4.250000  19.98810
2 chic 33.0 29.875 1987-01-02         NA         NA 3.304348  23.19099
3 chic 33.0 27.375 1987-01-03         NA   34.16667 3.333333  23.81548
4 chic 29.0 28.625 1987-01-04         NA   47.00000 4.375000  30.43452
5 chic 32.0 28.875 1987-01-05         NA         NA 4.750000  30.33333
6 chic 40.0 35.125 1987-01-06         NA   48.00000 5.833333  25.77233

Descending example

To change this to dates descending (so that the most recent observation appears first), we simply write:

head(chicago <- arrange(chicago, desc(date)))

  city tmpd dptp       date pm25tmean2 pm10tmean2  o3tmean2 no2tmean2
1 chic   35 30.1 2005-12-31   15.00000       23.5  2.531250  13.25000
2 chic   36 31.0 2005-12-30   15.05714       19.2  3.034420  22.80556
3 chic   35 29.4 2005-12-29    7.45000       23.5  6.794837  19.97222
4 chic   37 34.5 2005-12-28   17.75000       27.5  3.260417  19.28563
5 chic   40 33.6 2005-12-27   23.56000       27.0  4.468750  23.50000
6 chic   35 29.6 2005-12-26    8.40000        8.5 14.041667  16.81944

rename

We may want to rename columns in our data set to either be more informative, be easier to type (e.g. shorter) or work better with functions in R. The general syntax:

rename(iris, new_name = old_name)

We could easily achieve this using the rename() function from dplyr

lifeExp <- as.data.frame(state.x77)
lifeExp <- rename(lifeExp, Life.Exp = `Life Exp`,
                  HS.Grad = `HS Grad`)

mutate

The mutate() function is used for transforming variables in a data frame.
Say we want to create a new variable that scales an existing variable in the data frame to have a mean of 0 and standard deviation of 1.
We can do this very easily using mutate() and the scale() function (see ?scale).

Scaled variable example

temp <- mutate(chicago, pm25scaled = scale(pm25tmean2))
temp

Notice how this data frame contains both the old variable pm25tmean2 and the scaled version pm25scaled.

Example (cont’d)

To avoid storing the same information twice we may want to remove pm25tmean2 (the unscaled version) from the data frame.

head(select(temp, -(pm25tmean2)))

  city tmpd dptp       date pm10tmean2  o3tmean2 no2tmean2 pm25scaled
1 chic   35 30.1 2005-12-31       23.5  2.531250  13.25000 -0.1415260
2 chic   36 31.0 2005-12-30       19.2  3.034420  22.80556 -0.1349561
3 chic   35 29.4 2005-12-29       23.5  6.794837  19.97222 -1.0095666
4 chic   37 34.5 2005-12-28       27.5  3.260417  19.28563  0.1746477
5 chic   40 33.6 2005-12-27       27.0  4.468750  23.50000  0.8426366
6 chic   35 29.6 2005-12-26        8.5 14.041667  16.81944 -0.9003429

transmute

Related to mutate() is the transmute() function. It creates a new data frame containing only the newly created variables. For instance,

newdf <- transmute(chicago, pm25scaled = scale(pm25tmean2),
      pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE))
head(newdf)

  pm25scaled pm10detrend
1 -0.1415260  -10.395206
2 -0.1349561  -14.695206
3 -1.0095666  -10.395206
4  0.1746477   -6.395206
5  0.8426366   -6.895206
6 -0.9003429  -25.395206

group_by

The group_by() function is used to generate summary statistics from the data within groups defined by a variable.
You can think of this function as splitting our data frame into groups according to some expression, then performing this summary statistic within these subsets.
Those of you familiar with SQL will recognize this GROUP BY feature within the context of a relational database.

Averge PM2.5 Example

Determine the average level of PM2.5 for hot days (tmpd>80) and cold days (tmpd<80).

# create a new variable which will indicate 
# if the day was hot or cold.
chicago <- mutate(chicago, tempcat = 
                    factor(tmpd >80, labels=c("cold","hot")))

# split the data into hot days and cold days:
hotcold <- group_by(chicago, tempcat)
class(hotcold)

[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

hc <- split(chicago, chicago$tmpd<80)
class(hc); class(hotcold)

[1] "list"

[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Grouping is a fundamental operation in data analysis that allows you to split a dataset into smaller subsets based on specific criteria. These grouped subsets can then be analyzed independently or in a more structured manner.

Sidnote: Tibbles

holdcold is a grouped data frame of tibble (tbl for short) which can now be passed to the summarize() function.

hotcold

# A tibble: 6,940 × 9
# Groups:   tempcat [3]
   city   tmpd  dptp date       pm25tmean2 pm10tmean2 o3tmean2 no2tmean2 tempcat
   <chr> <dbl> <dbl> <date>          <dbl>      <dbl>    <dbl>     <dbl> <fct>  
 1 chic     35  30.1 2005-12-31      15          23.5     2.53      13.2 cold   
 2 chic     36  31   2005-12-30      15.1        19.2     3.03      22.8 cold   
 3 chic     35  29.4 2005-12-29       7.45       23.5     6.79      20.0 cold   
 4 chic     37  34.5 2005-12-28      17.8        27.5     3.26      19.3 cold   
 5 chic     40  33.6 2005-12-27      23.6        27       4.47      23.5 cold   
 6 chic     35  29.6 2005-12-26       8.4         8.5    14.0       16.8 cold   
 7 chic     35  32.1 2005-12-25       6.7         8      14.4       13.8 cold   
 8 chic     37  35.2 2005-12-24      30.8        25.2     1.77      32.0 cold   
 9 chic     41  32.6 2005-12-23      32.9        34.5     6.91      29.1 cold   
10 chic     22  23.3 2005-12-22      36.6        42.5     5.39      33.7 cold   
# ℹ 6,930 more rows

summarize

The summarize() (or summarise()) function applies a summary statistic to the subsets created by group_by().

To determine the average level of PM2.5 for hot days (tmpd>80) and cold days (tmpd<80), we write:

summarize(hotcold, pm25tmean2 = mean(pm25tmean2, na.rm = TRUE))

# A tibble: 3 × 2
  tempcat pm25tmean2
  <fct>        <dbl>
1 cold          16.0
2 hot           26.5
3 <NA>          47.7

Example continued

If we also want to know the max ozone level and median nitrogen dioxide level for hot and cold days, we could write:

summarize(hotcold, pm25tmean2 = mean(pm25tmean2, na.rm = TRUE),
          o3 = max(o3tmean2),
          no2 = median(no2tmean2))

# A tibble: 3 × 4
  tempcat pm25tmean2    o3   no2
  <fct>        <dbl> <dbl> <dbl>
1 cold          16.0 66.6   24.5
2 hot           26.5 63.0   24.9
3 <NA>          47.7  9.42  37.4

Example continued

If we want those summary statistics for the various years we could use:

library(lubridate)
chicago <- mutate(chicago, year = year(date))
# look at the last 6 lines 
tail(summarize(group_by(chicago, year), 
               pm25tmean2 = mean(pm25tmean2, na.rm = TRUE),
               o3 = max(o3tmean2),
               no2 = median(no2tmean2)))

# A tibble: 6 × 4
   year pm25tmean2    o3   no2
  <dbl>      <dbl> <dbl> <dbl>
1  2000       16.9  55.8  23.5
2  2001       16.9  51.8  25.1
3  2002       15.3  54.9  22.7
4  2003       15.2  56.2  24.6
5  2004       14.6  44.5  23.4
6  2005       16.2  58.8  22.6

Piping

The dpylr package implements a special pipe operator %>%.
This operator allows you to “pipe” or “chain” a number of functions together.
As of version 4.0 of R, you can also use the native pipe |> (this will not require you to load any packages).
It is a powerful tool for simplifying and improving the readability of code, especially when working with data manipulation and transformation tasks.

Example redone

chicago %>%
  mutate(year = year(date)) %>%
  group_by(year) %>%
  summarize(pm25 = mean(pm25tmean2, na.rm = TRUE),
               o3 = max(o3tmean2),
               no2 = median(no2tmean2)) %>%
  tail()

# A tibble: 6 × 4
   year  pm25    o3   no2
  <dbl> <dbl> <dbl> <dbl>
1  2000  16.9  55.8  23.5
2  2001  16.9  51.8  25.1
3  2002  15.3  54.9  22.7
4  2003  15.2  56.2  24.6
5  2004  14.6  44.5  23.4
6  2005  16.2  58.8  22.6

Example redone with native pipe

chicago |>
  mutate(year = year(date)) |>
  group_by(year) |>
  summarize(pm25 = mean(pm25tmean2, na.rm = TRUE),
               o3 = max(o3tmean2),
               no2 = median(no2tmean2)) |>
  tail()

# A tibble: 6 × 4
   year  pm25    o3   no2
  <dbl> <dbl> <dbl> <dbl>
1  2000  16.9  55.8  23.5
2  2001  16.9  51.8  25.1
3  2002  15.3  54.9  22.7
4  2003  15.2  56.2  24.6
5  2004  14.6  44.5  23.4
6  2005  16.2  58.8  22.6

No more temporary variables

The pipeline operator is very handy for stringing together multiple functions in a sequence of operations and avoid assigning temporary variables.

Returning to the question posed on this slide, we could use a single line of code,

Using temporary variables:

# using temporary variables:
chicago <- readRDS("data/chicago.rds")
chicago <- mutate(chicago, tempcat = factor(tmpd>80, 
                            labels=c("cold","hot")))
hotcold <- group_by(chicago, tempcat)
this <- summarize(hotcold, pm25tmean2 = mean(pm25tmean2, na.rm = TRUE))

Using pipe:

that <- chicago %>% 
  mutate(tempcat = factor(tmpd >80, labels=c("cold","hot"))) %>% 
  group_by(tempcat) %>%
  summarize(pm25tmean2 = mean(pm25tmean2, na.rm = TRUE))

Check that they’re the same:

all.equal(this, that)

[1] TRUE