Most of the data that we have dealt with so far have been in a condition that is readily usable in R.
More often than not, however, we will have had to reconfigure our data into another format to make it appropriate for downstream analytics.
The general problem of preparing your data for future analysis is a common problem in this field and can be classified under the process of data wrangling.
What is data wrangling
Some aspects of data wrangling (aka data munging) include:
Collection: Gathering data from various sources
Cleaning: Identify anomalies, missing values, and outliers.
Transformation: Converting data into a consistent format
Feature Engineering: Creating new features or variables
Data Reduction: Reducing the dimensionality of the data
Documentation: Documenting the steps taken during data wrangling for reproducibility and transparency.
Comment
Data wrangling is a crucial step in the data analysis process because the quality of the data greatly influences the accuracy and reliability of subsequent analyses
In fact, it has been claimed1 that 50 to 80 percent of the whole data analysis pipeline in a typical machine learning/deep learning project can be attributed to this process.
Notation
In this lecture we aim to go from:
\[\text{raw }\to\text{ tidy data }\]
For the purpose of this lecture, I will refer to “raw” data as how the data is first presented to us and “tidy” data as the cleaned/wrangled data that meets the following criteria:
each row is a single observation,
each variable is a single column, and
each single cell contains a single value
Fig 6.1 from R for data science: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.
Data frames and help files
We have discussed how a data frame is a key data structure in statistics and in R.
Data frames represent tabular data that hold observations in the row and variables in the columns.
Meta-data in the form of column names are useful to describe each variable and provide a unique identifier.
Meta data in form of help files that describes things like, where the data was sourced from, what each variable means in more detail, etc., can be very helpful
Meta data
Metadata is often referred to as “data about data”
It provides information that describes and provides context for other data.
It helps users and systems understand the content, structure, and characteristics of the underlying data.
Metadata is essential for managing, organizing, and retrieving data efficiently.
Most of the functions we use in R requires that our data is in a tidy format.
dplyr
The dplyr package is a very helpful package in R that will help us manage tidy data frames.
It comes as part of the tidyverse ecosystem and are widely used in data analysis and data science workflows.
This package was developed by Hadley Wickham of the Posit team (note name change: RStudio is now Posit!).
While we can do much of what is performed with functions in this package using base code in R, these functions will run much quicker (as they call to C++) and you may find them much easier and intuitive to work with.
'data.frame': 6940 obs. of 8 variables:
$ city : chr "chic" "chic" "chic" "chic" ...
$ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
$ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
$ date : Date, format: "1987-01-01" "1987-01-02" ...
$ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
$ pm10tmean2: num 34 NA 34.2 47 NA ...
$ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
This data deals with airpolution and weather variables from the city of Chicago, from 1997–2005.
Chicago Data set
chicago
Extracting Columns
If we want to extract the second to forth column, we’ve seen how we could use:
select() makes it easy to refer to variables based on their name and also by type:
e.g. where(is.numeric) selects all numeric columns.
e.g. a:f selects all columns from a to f (inclusive)
We could also use it to exclude certain columns/variables:
head(select(chicago, -(tmpd:date)))
city pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic NA 34.00000 4.250000 19.98810
2 chic NA NA 3.304348 23.19099
3 chic NA 34.16667 3.333333 23.81548
4 chic NA 47.00000 4.375000 30.43452
5 chic NA NA 4.750000 30.33333
6 chic NA 48.00000 5.833333 25.77233
Extracting rows
Another handy function will allow us to extract certain rows from a data frame.
Say for instance we wanted to obtain all the data points from chicago for which the pm10tmean2 is not NA.
We already know a few ways to do this in base R:
this <- chicago[!is.na(chicago$pm10tmean2),]that <-subset(chicago, !is.na(pm10tmean2))all.equal(this, that)
[1] TRUE
filter
In the dpylr package this can be done using filter()
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic 83.0 68.500 1987-06-14 NA 43.66667 45.42188 25.99583
2 chic 80.5 62.750 1987-06-18 NA 90.00000 62.96966 43.81667
3 chic 80.5 69.250 1987-07-08 NA 38.00000 30.85168 23.01479
4 chic 80.5 67.625 1987-07-09 NA 31.00000 24.94792 20.15783
5 chic 84.5 73.625 1987-07-19 NA 72.00000 47.05208 13.37153
6 chic 81.5 67.625 1987-07-20 NA 65.42857 40.49896 18.45833
arrange
Sometimes we wish to reorder or rows in such a way that preserve the corresponding order of other columns.
We have seen a rather tedious way of performing this task using sort() and order()
The arrange() function from the dplyr package provides a simple way of reordering rows of a data frame according to an expression involving its variables/columns.
Ascending example
For instance, we might want to arrange our data by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.
head(chicago <-arrange(chicago, date))
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
2 chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
3 chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548
4 chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452
5 chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333
6 chic 40.0 35.125 1987-01-06 NA 48.00000 5.833333 25.77233
Descending example
To change this to dates descending (so that the most recent observation appears first), we simply write:
We may want to rename columns in our data set to either be more informative, be easier to type (e.g. shorter) or work better with functions in R. The general syntax:
rename(iris, new_name = old_name)
We could easily achieve this using the rename() function from dplyr
The group_by() function is used to generate summary statistics from the data within groups defined by a variable.
You can think of this function as splitting our data frame into groups according to some expression, then performing this summary statistic within these subsets.
Those of you familiar with SQL will recognize this GROUP BY feature within the context of a relational database.
Averge PM2.5 Example
Determine the average level of PM2.5 for hot days (tmpd>80) and cold days (tmpd<80).
# create a new variable which will indicate # if the day was hot or cold.chicago <-mutate(chicago, tempcat =factor(tmpd >80, labels=c("cold","hot")))# split the data into hot days and cold days:hotcold <-group_by(chicago, tempcat)class(hotcold)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
Sidnote: Tibbles
holdcold is a grouped data frame of tibble (tbl for short) which can now be passed to the summarize() function.
If we want those summary statistics for the various years we could use:
library(lubridate)chicago <-mutate(chicago, year =year(date))# look at the last 6 lines tail(summarize(group_by(chicago, year), pm25tmean2 =mean(pm25tmean2, na.rm =TRUE),o3 =max(o3tmean2),no2 =median(no2tmean2)))