Lecture 4: Getting data into R

DATA 101: Making Prediction with Data

Dr. Irene Vrbik

University of British Columbia Okanagan

Outline

In today’s lecture we’ll be looking:

File formats

  • In R, you can read data from various sources and formats.

  • Some common types include:

    • CSV (comma-separated value) files,
    • Excel spreadsheets,
    • databases,
    • web APIs, …
  • The method you choose depends on the type and location of your data.

File location

  • In addition to specifying the file format, you also need to specify the file location.

  • The file could live on your computer (local) or somewhere on the internet (remote).

  • The location of a remote file is provided by a URL.

  • The location of a local file is either specified using absolute or relative paths1

Functions for Reading in Data

There are a few principle functions for reading data into R:

Tabular Data

Tabular data is a structured form of data organization where information is presented in rows and columns.

  • rows: represents a single data record or observation
  • columns: represent individual attributes, properties, or variables.
  • Headers: are the labels at the top of each column, providing a clear description of the data within that column.
  • Cells: the individual data points where a row and column intersect. Each cell contains a single value representing a specific attribute of the corresponding record.

Reading in Tabular Data

Both read.table() and read.csv() functions are used to read tabular data from text files.

  • read.csv(): This function is specifically designed to read data from CSV (Comma-Separated Values) files1.
  • read.table() is more general and can be used to read tabular data from various text file formats2 (e.g. where data is separated by white space, or tabs)

Reading CSV Files

To read data from a CSV file, you can use the read.csv() function. For example:

data <- read.csv("data.csv")

If the file name does not contain an absolute path, the file name is relative to the current working directory.

## Relative path:
# e.g.. if deck.csv was in a folder called "data" in my working directory
deck <- read.csv("data/deck.csv")

## Absolute path:
deck <- dead.csv("/Users/ivrbik/DATA101/data/deck.csv")

Working Directory

  • The “working directory” is the directory (or folder) on your computer where R will look for and save files by default.

  • You can view the current working directory using getwd()

  • You can change the working directory using:

    setwd("/Users/ivrbik/path/to/your/directory")
  • The working directory for an Rmd file is typically set to the directory where the Rmd file is located.

  • Relative paths within an Rmd document are interpreted relative to the directory where the Rmd file resides.

Working directories in Rmd files

  • When running R code interactively within an Rmd file (e.g., by executing code chunks in an R Markdown document within RStudio), the working directory may be different1.

  • You can change your working directory in interactive mode using setwd()

  • ⚠️ Using setwd() within a Rmd will only works for the current code chunk and the working directory will be restored after this code chunk has been evaluated.

deck <- read.csv("data/deck.csv")
head(deck, 2) # only prints the first 2 rows
tail(deck, 2) # only prints the last 2 rows

Saving data

  • On the flip side, you can save data to a CSV file in R using the write.csv() function

  • This function allows you to write the contents of a data frame (or a matrix) to a CSV file. The basic syntax is:

write.csv(df, file = "filename.csv")
  • df is the data frame you want to save to the CSV file.
  • file is the path and filename where you want to save the CSV file (don’t forget the .csv extension)

Comment

⚠️ Warning: By default, write.csv() includes an extra column containing the row names in the output CSV file. More likely than not, you will not want to save row names as they will often be read in by other software as data.

  • If you do not assign row names, the rows of the data frame are identified by numerical indices, starting from 1.

  • You can access the row names of a data frame using the row.names()

  • Turn this off by specifying row.names = FALSE

write.csv(deck, file = "deck-with-rows.csv")
deck_rows <- read.csv("deck-with-rows.csv")
head(deck_rows, 3)
write.csv(deck, file = "deck-no-rows.csv", row.names = FALSE)
deck_no_rows <- read.csv("deck-no-rows.csv")
head(deck_no_rows, 3)

Saving other data formats

In R, you can save data using various file formats depending on your needs. Examples include:

Save as a TSV File: TSV (Tab-Separated Values) is similar to CSV, but it uses tabs as the delimiter.

write.table(data, file = "data.tsv", sep = "\t", row.names = FALSE)

Save as an RDS File: To save R objects (data frames, lists, etc.) with their structure and attributes, you can use RDS (R Data Serialization) files.

saveRDS(data, file = "data.rds")
# Load the workspace from an .RData file
load("my_workspace.RData")

Saving your workspace

  • In R, saving your workspace refers to saving the current state of your R session, including all loaded data, variables, functions, and other objects, to a file (extension “.RData”)

  • You can can then easily resume your R session at a later time without having to reload data or re-run scripts.

Advantages:

  • Convenience

  • Reproducibility

Disadvantages:

  • Large Files

  • Environment dependency

Mixed Type Example

You will notice that R attempts to determine the data types of each column based on the content of the data.1

Here’s how R determines data types:

  • Columns with numerical values are assigned the “numeric” data type

  • Columns with text values (strings) are typically assigned the “character”

Example

dat <- read.csv("data/example.csv")
head(dat)

Dates and Times

Base R has three date-time classes:

  • Dates are represents by the Date class
  • Times are represented by the POSIXct or POSIXlt class.
    • POSIXct: Represents date and time with a time zone.
    • POSIXlt: Represents date and time with additional components.

Dates

Dates are represented by the Date class and can be coerced from a character string using the as.Date() function. This is a common way to end up with a Date object in R.

## Coerce a 'Date' object from character
x <- as.Date("1970-01-01")   
x
[1] "1970-01-01"
class(x)
[1] "Date"
x + 2
[1] "1970-01-03"
weekdays(x)
[1] "Thursday"
months(x)
[1] "January"

Times

Times are represented by the POSIXct or the POSIXlt class.

  • POSIXct stores date and time values as a numeric value representing the number of seconds since the Unix epoch (January 1, 1970 at midnight UTC).
  • POSIXlt stores date and time values as a list of components (e.g., year, month, day, hour, minute, second).

Example

x <- Sys.Date()
(x)
(date_num <- as.POSIXct(x))
(date_list <- as.POSIXlt(x))
[1] "2023-09-20"
[1] "2023-09-20 UTC"
[1] "2023-09-20 UTC"

You can see how the “raw” format of the date by stripping these variables of their class using the unclass() function:

unclass(x) # number of days since January 1, 1970.
[1] 19620
unclass(date_num) #  number of sec since Jan 1, 1970, at midnight UTC
[1] 1695168000
attr(,"tzone")
[1] "UTC"
names(unclass(date_list))
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"  
 [9] "isdst"  "zone"   "gmtoff"

Example with colClasses

dat <- read.csv("data/example.csv", colClasses = 
                  c("numeric", "factor", "Date", "logical", "character"))
head(dat)

tidyverse

  • The tidyverse is a collection of open-source R packages and tools for data science and data analysis.
  • It is designed to make data manipulation, visualization, and modeling more efficient, intuitive, and consistent in R.
  • The tidyverse philosophy is centered around the principles of “tidy data” and “tidy tools.”
  • The tidyverse is widely adopted in the R data science community and has become a standard toolkit for many data analysts and data scientists.

9 core packages in tidyverse

When you load the tidyverse package you will notice

library(tidyverse)

✔ dplyr 1.1.2 ✔ readr 2.1.4 ✔ forcats 1.0.0
✔ stringr 1.5.0 ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0. ✔ purrr 1.0.2

tidy data

There are three interrelated rules that make a dataset tidy:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Fig 6.1 from R for data science: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.

Conflicts

library(tidyverse)
tidyverse_conflicts()
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  • You’ll notice that when we load the tidyverse package, it will mask the filter() function used in base1 R.

  • This means that when you call filter() or lag(), you the default will be to use the dplyr package version of this function.

  • To use the stats version, you need to use its full name stats::filter

readr

  • As we say previously, the base R read.csv() function does not automatically detect date columns

  • After loading the tidyverse package, we can call the read_csv() function from the readr package.

  • read_csv() is designed to play nicely with other tidyverse functions

  • read_csv() is generally considered to be faster and more memory-efficient than read.csv(), especially when working with large datasets.

Example

dat <- read_csv("data/example.csv")
dat

While read_csv() did detect dates automatically, it does not treat character vectors as factors. This can be fixed using

dat <- read_csv("data/example.csv", col_types = cols(site = "f"))
head(dat)

Tibble

  • While it is not obvious in the output within the slides, rather than producing a data frame like read.csv(), the read_csv() function produces a tibble.

  • As stated in vignette("tibble"),

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.

Features of tibbles

  • Tibbles are designed to produce more readable and informative output when displayed in the R console.

  • Character columns remain character columns, which can prevent unintended conversion to factors.

  • They allow non-standard variable names (i.e. your variables can start with a number and can contain spaces)

  • To create a tibble use tibble() (in a similar way to how we used data.frame())

lubridate

As it’s name suggests, lubridate will be very useful when we are working with dates

(today_n <- 20230918)
class(today_n)
[1] 20230918
[1] "numeric"
ymd(today_n) 
class(ymd(today_n))
[1] "2023-09-18"
[1] "Date"
(today_t <- "9/18/2023")
class(today_t)
[1] "9/18/2023"
[1] "character"
mdy(today_t)
class(mdy(today_t))
[1] "2023-09-18"
[1] "Date"

Date example

today_s <- "This lecture is scheduled for September 18, 2023 at 1 pm PMT."
class(today_s)
[1] "character"

We can use lubricate to parse the date-times with

mdy_h(today_s, tz = "Canada/Pacific")
[1] "2023-09-18 13:00:00 PDT"
class(mdy_h(today_s, tz = "Canada/Pacific"))
[1] "POSIXct" "POSIXt" 

Projects

  • Creating a project in R helps you organize your work, manage files, and maintain a structured workflow.

  • RStudio provides a convenient way to create and manage R projects.

  • Details on how to use and create R projects can be found at https://bookdown.org/daniel_dauber_io/r4np_book/starting-your-r-projects.html (among other places, e.g. Posit).

  • While creating a project is not mandatory for creating Rmd files for your assignments, for example, I highly recommend utilizing them for your own data science workflow.