DATA 101: Making Prediction with Data
University of British Columbia Okanagan
In today’s lecture we’ll be looking:
In R, you can read data from various sources and formats.
Some common types include:
The method you choose depends on the type and location of your data.
In addition to specifying the file format, you also need to specify the file location.
The file could live on your computer (local) or somewhere on the internet (remote).
The location of a remote file is provided by a URL.
The location of a local file is either specified using absolute or relative paths1
There are a few principle functions for reading data into R:
read.table()
, read.csv()
for reading tabular datareadLines()
, for reading lines of a text filesource()
, for reading in R code filesload()
, for reading in saved workspacesTabular data is a structured form of data organization where information is presented in rows and columns.
Both read.table()
and read.csv()
functions are used to read tabular data from text files.
read.csv()
: This function is specifically designed to read data from CSV (Comma-Separated Values) files1.read.table()
is more general and can be used to read tabular data from various text file formats2 (e.g. where data is separated by white space, or tabs)To read data from a CSV file, you can use the read.csv()
function. For example:
If the file name does not contain an absolute path, the file name is relative to the current working directory.
The “working directory” is the directory (or folder) on your computer where R will look for and save files by default.
You can view the current working directory using getwd()
You can change the working directory using:
The working directory for an Rmd file is typically set to the directory where the Rmd file is located.
Relative paths within an Rmd document are interpreted relative to the directory where the Rmd file resides.
When running R code interactively within an Rmd file (e.g., by executing code chunks in an R Markdown document within RStudio), the working directory may be different1.
You can change your working directory in interactive mode using setwd()
⚠️ Using setwd()
within a Rmd will only works for the current code chunk and the working directory will be restored after this code chunk has been evaluated.
On the flip side, you can save data to a CSV file in R using the write.csv()
function
This function allows you to write the contents of a data frame (or a matrix) to a CSV file. The basic syntax is:
df
is the data frame you want to save to the CSV file.file
is the path and filename where you want to save the CSV file (don’t forget the .csv
extension)⚠️ Warning: By default, write.csv()
includes an extra column containing the row names in the output CSV file. More likely than not, you will not want to save row names as they will often be read in by other software as data.
If you do not assign row names, the rows of the data frame are identified by numerical indices, starting from 1.
You can access the row names of a data frame using the row.names()
Turn this off by specifying row.names = FALSE
In R, you can save data using various file formats depending on your needs. Examples include:
Save as a TSV File: TSV (Tab-Separated Values) is similar to CSV, but it uses tabs as the delimiter.
Save as an RDS File: To save R objects (data frames, lists, etc.) with their structure and attributes, you can use RDS (R Data Serialization) files.
In R, saving your workspace refers to saving the current state of your R session, including all loaded data, variables, functions, and other objects, to a file (extension “.RData”)
You can can then easily resume your R session at a later time without having to reload data or re-run scripts.
Advantages:
Convenience
Reproducibility
Disadvantages:
Large Files
Environment dependency
You will notice that R attempts to determine the data types of each column based on the content of the data.1
Here’s how R determines data types:
Columns with numerical values are assigned the “numeric” data type
Columns with text values (strings) are typically assigned the “character”
Base R has three date-time classes:
Date
classPOSIXct
or POSIXlt
class.
Dates are represented by the Date
class and can be coerced from a character string using the as.Date()
function. This is a common way to end up with a Date
object in R.
Times are represented by the POSIXct
or the POSIXlt
class.
POSIXct
stores date and time values as a numeric value representing the number of seconds since the Unix epoch (January 1, 1970 at midnight UTC).POSIXlt
stores date and time values as a list of components (e.g., year, month, day, hour, minute, second).[1] "2023-09-20"
[1] "2023-09-20 UTC"
[1] "2023-09-20 UTC"
You can see how the “raw” format of the date by stripping these variables of their class using the unclass()
function:
colClasses
When you load the tidyverse
package you will notice
✔ dplyr 1.1.2 ✔ readr 2.1.4 ✔ forcats 1.0.0
✔ stringr 1.5.0 ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0. ✔ purrr 1.0.2
There are three interrelated rules that make a dataset tidy:
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You’ll notice that when we load the tidyverse
package, it will mask the filter()
function used in base1 R.
This means that when you call filter()
or lag()
, you the default will be to use the dplyr
package version of this function.
To use the stats
version, you need to use its full name stats::filter
As we say previously, the base R read.csv()
function does not automatically detect date columns
After loading the tidyverse
package, we can call the read_csv()
function from the readr package.
read_csv()
is designed to play nicely with other tidyverse
functions
read_csv()
is generally considered to be faster and more memory-efficient than read.csv()
, especially when working with large datasets.
While read_csv()
did detect dates automatically, it does not treat character vectors as factors. This can be fixed using
While it is not obvious in the output within the slides, rather than producing a data frame like read.csv()
, the read_csv()
function produces a tibble.
As stated in vignette("tibble")
,
Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.
Tibbles are designed to produce more readable and informative output when displayed in the R console.
Character columns remain character columns, which can prevent unintended conversion to factors.
They allow non-standard variable names (i.e. your variables can start with a number and can contain spaces)
To create a tibble use tibble()
(in a similar way to how we used data.frame()
)
As it’s name suggests, lubridate
will be very useful when we are working with dates
[1] "character"
We can use lubricate to parse the date-times with
Creating a project in R helps you organize your work, manage files, and maintain a structured workflow.
RStudio provides a convenient way to create and manage R projects.
Details on how to use and create R projects can be found at https://bookdown.org/daniel_dauber_io/r4np_book/starting-your-r-projects.html (among other places, e.g. Posit).
While creating a project is not mandatory for creating Rmd files for your assignments, for example, I highly recommend utilizing them for your own data science workflow.