Lecture 4: Getting data into R

DATA 101: Making Prediction with Data

Dr. Irene Vrbik

University of British Columbia Okanagan

Outline

In today’s lecture we’ll be looking:

File formats and File location
Functions for reading data into R read.csv and read_csv
Working Directory
Dates and Times
tidyverse

File formats

In R, you can read data from various sources and formats.
Some common types include:
- CSV (comma-separated value) files,
- Excel spreadsheets,
- databases,
- web APIs, …
The method you choose depends on the type and location of your data.

File location

In addition to specifying the file format, you also need to specify the file location.
The file could live on your computer (local) or somewhere on the internet (remote).
The location of a remote file is provided by a URL.
The location of a local file is either specified using absolute or relative paths¹

Functions for Reading in Data

There are a few principle functions for reading data into R:

read.table(), read.csv() for reading tabular data
readLines(), for reading lines of a text file
source(), for reading in R code files
load(), for reading in saved workspaces

Tabular Data

Tabular data is a structured form of data organization where information is presented in rows and columns.

rows: represents a single data record or observation
columns: represent individual attributes, properties, or variables.
Headers: are the labels at the top of each column, providing a clear description of the data within that column.
Cells: the individual data points where a row and column intersect. Each cell contains a single value representing a specific attribute of the corresponding record.

Reading in Tabular Data

Both read.table() and read.csv() functions are used to read tabular data from text files.

read.csv(): This function is specifically designed to read data from CSV (Comma-Separated Values) files¹.
read.table() is more general and can be used to read tabular data from various text file formats² (e.g. where data is separated by white space, or tabs)

Reading CSV Files

To read data from a CSV file, you can use the read.csv() function. For example:

data <- read.csv("data.csv")

If the file name does not contain an absolute path, the file name is relative to the current working directory.

## Relative path:
# e.g.. if deck.csv was in a folder called "data" in my working directory
deck <- read.csv("data/deck.csv")

## Absolute path:
deck <- dead.csv("/Users/ivrbik/DATA101/data/deck.csv")

Working Directory

The “working directory” is the directory (or folder) on your computer where R will look for and save files by default.
You can view the current working directory using getwd()

You can change the working directory using:

setwd("/Users/ivrbik/path/to/your/directory")

The working directory for an Rmd file is typically set to the directory where the Rmd file is located.
Relative paths within an Rmd document are interpreted relative to the directory where the Rmd file resides.

Working directories in Rmd files

When running R code interactively within an Rmd file (e.g., by executing code chunks in an R Markdown document within RStudio), the working directory may be different¹.
You can change your working directory in interactive mode using setwd()
⚠️ Using setwd() within a Rmd will only works for the current code chunk and the working directory will be restored after this code chunk has been evaluated.

deck <- read.csv("data/deck.csv")
head(deck, 2) # only prints the first 2 rows

tail(deck, 2) # only prints the last 2 rows

Saving data

On the flip side, you can save data to a CSV file in R using the write.csv() function
This function allows you to write the contents of a data frame (or a matrix) to a CSV file. The basic syntax is:

write.csv(df, file = "filename.csv")

df is the data frame you want to save to the CSV file.
file is the path and filename where you want to save the CSV file (don’t forget the .csv extension)

Comment

⚠️ Warning: By default, write.csv() includes an extra column containing the row names in the output CSV file. More likely than not, you will not want to save row names as they will often be read in by other software as data.

If you do not assign row names, the rows of the data frame are identified by numerical indices, starting from 1.
You can access the row names of a data frame using the row.names()
Turn this off by specifying row.names = FALSE

write.csv(deck, file = "deck-with-rows.csv")
deck_rows <- read.csv("deck-with-rows.csv")
head(deck_rows, 3)

write.csv(deck, file = "deck-no-rows.csv", row.names = FALSE)
deck_no_rows <- read.csv("deck-no-rows.csv")
head(deck_no_rows, 3)

Saving other data formats

In R, you can save data using various file formats depending on your needs. Examples include:

Save as a TSV File: TSV (Tab-Separated Values) is similar to CSV, but it uses tabs as the delimiter.

write.table(data, file = "data.tsv", sep = "\t", row.names = FALSE)

Save as an RDS File: To save R objects (data frames, lists, etc.) with their structure and attributes, you can use RDS (R Data Serialization) files.

saveRDS(data, file = "data.rds")
# Load the workspace from an .RData file
load("my_workspace.RData")

Saving your workspace

In R, saving your workspace refers to saving the current state of your R session, including all loaded data, variables, functions, and other objects, to a file (extension “.RData”)
You can can then easily resume your R session at a later time without having to reload data or re-run scripts.

Advantages:

Convenience
Reproducibility

Disadvantages:

Large Files
Environment dependency

you to preserve the entire state of your R session

Advantages:

Convenience: It allows you to pick up where you left off without needing to reload data or re-run code. This is particularly useful for long-running computations or projects.
Reproducibility: It helps ensure that your analysis or code can be reproduced with the exact same environment and data.

Disadvantages:

Large Files: If your workspace includes large datasets or objects, saving the entire workspace can result in large files, which may not be practical to share or store.
Dependency on Environment: Saving the workspace may lead to code dependencies on specific objects and environments, potentially causing issues when sharing or collaborating on code.

In some cases, it’s better to save specific data or results as separate files or scripts to maintain better control over your project’s reproducibility and file sizes.

Additionally, version control systems like Git are often used to manage code and project history in a more structured and collaborative manner.

Mixed Type Example

You will notice that R attempts to determine the data types of each column based on the content of the data.¹

Here’s how R determines data types:

Columns with numerical values are assigned the “numeric” data type
Columns with text values (strings) are typically assigned the “character”

Example

dat <- read.csv("data/example.csv")
head(dat)

Dates and Times

Base R has three date-time classes:

Dates are represents by the Date class
Times are represented by the POSIXct or POSIXlt class.
- POSIXct: Represents date and time with a time zone.
- POSIXlt: Represents date and time with additional components.

Dates

Dates are represented by the Date class and can be coerced from a character string using the as.Date() function. This is a common way to end up with a Date object in R.

## Coerce a 'Date' object from character
x <- as.Date("1970-01-01")   
x

[1] "1970-01-01"

class(x)

[1] "Date"

x + 2

[1] "1970-01-03"

weekdays(x)

[1] "Thursday"

months(x)

[1] "January"

Times

Times are represented by the POSIXct or the POSIXlt class.

POSIXct stores date and time values as a numeric value representing the number of seconds since the Unix epoch (January 1, 1970 at midnight UTC).

POSIXlt stores date and time values as a list of components (e.g., year, month, day, hour, minute, second).

Example

x <- Sys.Date()

(x)
(date_num <- as.POSIXct(x))
(date_list <- as.POSIXlt(x))

[1] "2023-09-20"
[1] "2023-09-20 UTC"
[1] "2023-09-20 UTC"

You can see how the “raw” format of the date by stripping these variables of their class using the unclass() function:

unclass(x) # number of days since January 1, 1970.

[1] 19620

unclass(date_num) #  number of sec since Jan 1, 1970, at midnight UTC

[1] 1695168000
attr(,"tzone")
[1] "UTC"

names(unclass(date_list))

 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"  
 [9] "isdst"  "zone"   "gmtoff"

Example with `colClasses`

dat <- read.csv("data/example.csv", colClasses = 
                  c("numeric", "factor", "Date", "logical", "character"))
head(dat)

tidyverse

The tidyverse is a collection of open-source R packages and tools for data science and data analysis.
It is designed to make data manipulation, visualization, and modeling more efficient, intuitive, and consistent in R.
The tidyverse philosophy is centered around the principles of “tidy data” and “tidy tools.”
The tidyverse is widely adopted in the R data science community and has become a standard toolkit for many data analysts and data scientists.

9 core packages in tidyverse

When you load the tidyverse package you will notice

library(tidyverse)

✔ dplyr 1.1.2 ✔ readr 2.1.4 ✔ forcats 1.0.0
✔ stringr 1.5.0 ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0. ✔ purrr 1.0.2

dplyr: provides a set of functions for data manipulation and transformation, making it easier to filter, arrange, group, and summarize data.

ggplot2: is a highly customizable and powerful package for creating data visualizations and graphics. It follows the “grammar of graphics” philosophy.

tidyr: helps reshape data into a tidy format, where each variable is in a separate column, each observation is in a separate row, and each type of observational unit forms a table.

purrr: provides tools for functional programming and iteration. It is particularly useful for working with lists, vectors, and data frames in a consistent and functional way.

readr: offers fast and efficient tools for reading data into R from various formats, including CSV, TSV, and more. It also includes functions for specifying data types and handling missing values.

tibble: introduces the tibble data frame, an enhanced version of the traditional data frame in R. Tibbles have improved printing, subsetting, and handling of column types.

stringr: provides functions for working with strings and text data, making it easier to manipulate and clean text within data frames.

forcats: focuses on working with categorical (factor) variables, providing tools for reordering, recoding, and summarizing categorical data.

broom: simplifies the process of tidying up model output, making it easier to work with the results of statistical models, such as regression or clustering.

lubridate: designed for working with date and time data, providing functions to parse, manipulate, and format dates and times.

tidy data

There are three interrelated rules that make a dataset tidy:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

Fig 6.1 from R for data science: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.

Conflicts

library(tidyverse)
tidyverse_conflicts()

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You’ll notice that when we load the tidyverse package, it will mask the filter() function used in base¹ R.
This means that when you call filter() or lag(), you the default will be to use the dplyr package version of this function.
To use the stats version, you need to use its full name stats::filter

readr

As we say previously, the base R read.csv() function does not automatically detect date columns
After loading the tidyverse package, we can call the read_csv() function from the readr package.
read_csv() is designed to play nicely with other tidyverse functions
read_csv() is generally considered to be faster and more memory-efficient than read.csv(), especially when working with large datasets.

Example

dat <- read_csv("data/example.csv")
dat

While read_csv() did detect dates automatically, it does not treat character vectors as factors. This can be fixed using

dat <- read_csv("data/example.csv", col_types = cols(site = "f"))
head(dat)

Tibble

While it is not obvious in the output within the slides, rather than producing a data frame like read.csv(), the read_csv() function produces a tibble.
As stated in vignette("tibble"),

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.

Features of tibbles

Tibbles are designed to produce more readable and informative output when displayed in the R console.
Character columns remain character columns, which can prevent unintended conversion to factors.
They allow non-standard variable names (i.e. your variables can start with a number and can contain spaces)
To create a tibble use tibble() (in a similar way to how we used data.frame())

lubridate

As it’s name suggests, lubridate will be very useful when we are working with dates

(today_n <- 20230918)
class(today_n)

[1] 20230918
[1] "numeric"

ymd(today_n) 
class(ymd(today_n))

[1] "2023-09-18"
[1] "Date"

(today_t <- "9/18/2023")
class(today_t)

[1] "9/18/2023"
[1] "character"

mdy(today_t)
class(mdy(today_t))

[1] "2023-09-18"
[1] "Date"

Date example

today_s <- "This lecture is scheduled for September 18, 2023 at 1 pm PMT."
class(today_s)

[1] "character"

We can use lubricate to parse the date-times with

mdy_h(today_s, tz = "Canada/Pacific")

[1] "2023-09-18 13:00:00 PDT"

class(mdy_h(today_s, tz = "Canada/Pacific"))

[1] "POSIXct" "POSIXt"

Projects

Creating a project in R helps you organize your work, manage files, and maintain a structured workflow.
RStudio provides a convenient way to create and manage R projects.
Details on how to use and create R projects can be found at https://bookdown.org/daniel_dauber_io/r4np_book/starting-your-r-projects.html (among other places, e.g. Posit).
While creating a project is not mandatory for creating Rmd files for your assignments, for example, I highly recommend utilizing them for your own data science workflow.

Lecture 4: Getting data into R

Outline

File formats

File location

Functions for Reading in Data

Tabular Data

Reading in Tabular Data

Reading CSV Files

Working Directory

Working directories in Rmd files

Saving data

Comment

Saving other data formats

Saving your workspace

Mixed Type Example

Example

Dates and Times

Dates

Times

Example

Example with colClasses

tidyverse

9 core packages in tidyverse

tidy data

Conflicts

readr

Example

Tibble

Features of tibbles

lubridate

Date example

Projects

Example with `colClasses`