# install.packages("devtools")
# devtools::install_github("rstudio/EDAWR")
library(EDAWR)
data(cases)
cases
DATA 101: Making Prediction with Data
University of British Columbia Okanagan
In today’s lecture we’ll be look
The goal of tidyr is to help you create “tidy” data wherein:
The goal of tidyr is to help you create “tidy” data wherein:
The goal of tidyr is to help you create “tidy” data wherein:
“Pivoting” data, in the context of data wrangling, refers to the process of reorganizing or restructuring a dataset from a long format to a wide format or vice versa.
This transformation involves changing the arrangement of the data to make it more suitable for downstream analysis.
There are typically two types of pivoting:
Wide data format is characterized by having many columns and fewer rows.
This format can make it easy to read and understand the data, especially when when you have data with a small number of observations but a large number of variables.
However, wide data can be less suitable for certain types of analysis and visualization, particularly when you want to perform aggregations, comparisons, or create certain types of plots.
Long data format is characterized by having fewer columns and more rows. It’s often used to represent data with repeated measures or observations over time or categories.
In long data, variables are typically stacked into a single column, and an additional column is used to indicate the context or category to which each observation belongs.
This format is often preferred for data analysis, modeling, and certain types of visualizations, as it’s more amenable to aggregation and summarization.
ID | Year | level_1 | level_2 | level_3 |
---|---|---|---|---|
1 | 2010 | 10 | 20 | 30 |
2 | 2011 | 15 | 25 | 35 |
3 | 2012 | 12 | 22 | 32 |
Going from the left table to the right table is pivoting from wide to long.
ID | Year | Variable | Value |
---|---|---|---|
1 | 2010 | level_1 | 10 |
1 | 2010 | level_2 | 20 |
1 | 2010 | level_3 | 30 |
2 | 2011 | level_1 | 15 |
2 | 2011 | level_2 | 25 |
2 | 2011 | level_3 | 35 |
3 | 2012 | level_1 | 12 |
3 | 2012 | level_2 | 22 |
3 | 2012 | level_3 | 32 |
While the wide table is easy to for humans to read it difficult to work with when performing analysis using R.
max()
.The problem only gets worse if you would like to find the value for the population for a given region for the latest year.
Furthermore, we don’t know what the numbers under each year actually represent.
Old function:
gather()
spread()
New functions:
pivot_longer()
pivot_wider()
The gather()
function is used to pivot data from a wide format into a long format.
This functions work with key-value pairs.
Key: the column names in the original wide dataset that you want to stack or gather into a single column.
Value: the data in the cells corresponding to the key columns.
The general syntax for gather:
data
a data frame you want to cleankey
name of new key column (character string)value
name of the new value column (character string)...
a selection of columns to collapse (e.g. 2:4
)If you look at the help file for gather()
will notice that it says that the lifecycle is superseded which is just a softer version of depreciated.
A superseded function has a known better alternative, but the function itself is not going away.
A superseded function will not emit a warning (since there’s no risk if you keep using it), but the documentation will tell you what is recommend instead.
Development on gather()
is complete, and for new code we recommend switching to pivot_longer()
, which is easier to use, more featureful, and still under active development.
is equivalent to
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
pivot_longer()
is commonly needed to tidy raw datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis.
The inverse transformation is pivot_wider()
data
a data frame to pivot.cols
<tidy-select
> columns to pivot into longer formatnames_to
A character vector specifying the new column or columns to create from the information stored in the column names of data
specified by cols
.values_to
a string specifying the name of the column to create from the data stored in cell values...
Additional arguments passed on to methods.pivot_longer(
1 cases,
2 cols = 2:4,
3 names_to = "year",
4 values_to = "n"
) # output on next slide ...
Suppose we have observations spread across multiple rows rather than in a single row.
pivot_wider()
is the opposite of pivot_longer()
: it makes a dataset wider by increasing the number of columns and decreasing the number of rows.
pivot_longer()
takes a set of columns and pivots them into two columns: one for variable names one for values.
pivot_wider()
takes key-value pairs and spreads them into multiple columns based on the unique values in the key column.
This data is not “tidy”1
Observation (here, population, commuter, and incorporated values for a region) is split across three rows.
Using data in this format—where two or more variables are mixed together in a single column—makes it harder to apply many usual tidyverse functions.
This function generally increases the number of columns (widens) and decreases the number of rows in a data set.
data
a data frame to pivot…
additional arguments passed on to methodsnames_from
/ values_from
: <tidy-select
> a pair of arguments describing which column(s) to get the name of the output column (names_from
), and which column(s) to get the cell values from (values_from
).Another handy tidyr is separate()
It is used to split a single column of data that contains multiple values separated by a delimiter into multiple columns
You specify the delimiter or separator that separates the values within the original column as a regular expression or numeric locations
This function uses the following basic syntax:
where:
data
: Name of the data framecol
: Name of the column to separateinto
: Vector of names for the column to be separated intosep
: The value to separate the column atex1 <- data.frame(player=c('A', 'A', 'B', 'B', 'C', 'C'),
year=c(1, 2, 1, 2, 1, 2),
stats=c('22-2', '29-3', '18-6', '11-8', '12-5', '19-2'))
ex1
The delimiter in this case is the hyphen -
Goal: separate the stats
column into two new columns called “points” and “assists” as follows:
<chr>
). - Because of the delimiter (-
) R read these columns in as character types, and by default, separate()
will return columns as character data types.separate()
function can convert these to the appropriate data type.Now we can see they are being converted to integers
Last lecture we saw how we can combine summarize()
and group_by()
to summarize values for subgroups within a data set.
Now we’ll see how we can use the summarize()
function across many columns.
To summarize statistics across many columns, we can use the summarize()
and across()
functions from the dplyr package.
The summarize(across(...))
function combination allows you to apply a summary function across multiple columns of a data frame or tibble.
To do this more efficiently, we can pair summarize()
with across
and use a colon :
to specify a range of columns we would like to perform the statistical summaries on.
df
a data frame or tibble.cols
<tidy-select
> columns to transform.fns
a function name (e.g. mean
of a purrr-style lambda function)max()
summary function return NAs
for two of the columnsmax
summary function to columns that contain NAs
na.rm = TRUE
To avoid creating a user-defined function, we could instead create an anonymous function …
Anonymous functions, also known as lambda functions or inline functions, are used when you want to define a small, unnamed function for a specific task without formally creating a named function using function()
.
In purrr, you can create anonymous functions using the tilde (~
) operator.
For unary1 functions, ~ .x + 1
is equivalent to function(.x) .x + 1
.
Anonymous function:
An alternative to summarize
and across
for applying a function to many columns is the map
family of functions.
Let’s again redo the previous example, but using map
with the max
function this time.
More generally, the map()
function in the purrr package is used for applying a function to each element of a list or vector and returning a new list.
The general syntax for map() is
.x
an object (a vector, data frame or list) that you want to iterate over
.f
the function you would like to apply to each element
⚠️ There is no argument to specify which columns to apply the function to; it will simply apply the function to each column (resp. element) of the dataframe (resp. list/vector)
You’ll notice that the output of the map()
function is a list.
While we could convert this to a data frame, a simpler alternative is to use a different map()
function; see ?map
.
map function |
Output |
---|---|
map |
list |
map_lgl |
logical vector |
map_int |
integer vector |
map_dbl |
double vector |
map_chr |
character vector |
map_dfc |
data frame, combining column-wise |
map_dfr |
data frame, combining row-wise |
This returns a data frame rather than a list which is perhaps more desirable.
Sometimes we need to apply a function to many columns in a data frame.
Suppose we want to scale multiple columns (say the pm25tmean2
to the no2tmean2
column) from the chicago
data set.
To accomplish such a task, we can use mutate
paired with across
.
Sometimes we need to apply a function across columns but within one row:
For instance, suppose we want to know the maximum value between pm25tmean2
, pm10tmean2
, o3tmean2
and no2tmean2
for each record in the chicago
data set.
In addition to these tidyverse functions we also have a handy base R function called apply()
apply()
is primarily used for applying a function to the rows or columns of a matrix or array. It is not designed for use with lists or other data structures.
You specify whether you want to apply the function to rows (MARGIN = 1), columns (MARGIN = 2), or both (MARGIN = c(1, 2)).
X
an array, including a matrix.MARGIN
a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.FUN
the function to be applied....
optional arguments to FUN.simplify
a logical indicating whether results should be simplified if possible.Function | Description |
across() |
allows you to apply function(s) to multiple columns |
filter() |
subsets rows of a data frame |
group_by() |
allows you to apply function(s) to groups of rows |
mutate() |
adds or modifies columns in a data frame |
map() |
general iteration function |
pivot_longer() |
makes the data frame longer and narrower |
pivot_wider() |
makes a data frame wider and decreases the number of rows |
rowwise() |
applies functions across columns within one row |
separate() |
splits up a character column into multiple columns |
select() |
subsets columns of a data frame |
summarize() |
calculates summaries of inputs |