Lecture 2: Getting Familiar with R

DATA 101: Making Prediction with Data

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

Today we will be getting more familiar with R¹

To save space, I will often place the output to the right of the code

2 + 3 # --> output

[1] 5

Recall HTML lecture notes will have a copy to clipboard:
If you run the same code in your console, it will look like this:

> 2 + 3 
[1] 5

Calculations

R will follow the order of operations

100 - (3 * (2 + 1))^2

[1] 19

100 - (3 * 2 + 1)^2

[1] 51

100 - 3 * (2 + 1)^2

[1] 73

Note the syntax errors below.

20 - [3 * (2 + 4)]
## Error: unexpected '[' in "20 - ["

20 - (3 x (2 + 4))
## Error: unexpected symbol in "20 - (3 x"

R Objects

In R, an object is a fundamental data structure that represents a value or a collection of values.
Objects typically have a specific data type.
R provides various structures of objects, including vectors, matrices, data frames, lists, factors, scalars, functions.
Objects are “assigned” to variable names using either the equal sign (=) arrow or (->) as the assignment operator.

four <- 4 # same as --->

four = 4

Naming variables

Variable names …

cannot contain spaces or special characters other than an underscore (_) and period (.)
can contain numbers,
must begin with a letter

## allowed
data101
no_problem
camelCase
data.100

## Not allowed
4you
data-101
for

## not recommended 
print
var

⚠️ You should avoid using function names as variable names to avoid confusion.

Updating Variables

If we need to update a variable we can perform the same logical operator to change its value.

int <- 3
num <- int + 6.1

You call¹ the variable by name

num # or print(num)

[1] 9.1

print(num)

[1] 9.1

print(c(int, num))

[1] 3.0 9.1

Data Types

Here is a list of some important data types used in R:

Integer (whole numbers)
Doubles (real numbers)
Character
Logical (TRUE / FALSE, 0 / 1)
Factor ()
NA: (indicates a missing element)

Data types in R

int <- 2L         # integer
num <- 2          # numeric
my_str <- "2"     # string/character
char <- 'charlie' # double/single quotes
logical <- TRUE   # T/F note CAPITALS
missing <- NA     # Not Available

N.B. single or double quotes are allowed for strings.
There are NO quotes on TRUE/FALSE and NAs

The typeof() function returns the data type of a variable …

typeof(int)

[1] "integer"

typeof(num)

[1] "double"

typeof(my_str)

[1] "character"

typeof(logical)

[1] "logical"

typeof(missing)

[1] "logical"

Structures

Here is a list of some important structures used in R:

Vectors
Scalars
Matrices
Data Frames
Lists
Factors
Functions

Vectors

Vectors are the simplest data structure in R.
The c() function can be used to build a vector which is simply a sequence of elements.

\[\begin{equation} \texttt{general_vec <- c(element_1, element_2, ..., element_n) } \end{equation}\]

They can store a sequence of values of the same data type, such as numeric, character, logical, or complex values.
That is, vectors cannot contain elements of different data types (click to review what those data types were)

Vector examples

e.g. create a vector with the numbers 1 through 6

my_vec <- c(1,2,3,4,5,6) # this is a vector of length 6
length(my_vec)

[1] 6

A short hand way of doing this would be do use the colon (:)

my_vec <- 1:6             # [1] 1 2 3 4 5 6
count_down <- 5:0         # [1] 5 4 3 2 1 0

It is common to use terms like “numeric vector,” “character vector,” or “logical vector” to emphasize the data type.

numbers <- c(1,4,-9, 5.5)                   # numeric vector
my_strs <- c("a", "bee", "see you later")   # character vector
log_vec <- c(TRUE, FALSE, FALSE, TRUE, F)   # logical vector

Vector Indexing

You can access elements of a vector using square brackets []

numbers <- c(3,4,-9, 5.5, 10)

# extract the FIRST element
numbers[1]

[1] 3

#  extract 2nd & 4th elements
numbers[c(2, 4)]

[1] 4.0 5.5

# extract all but the third
numbers[-3]

[1]  3.0  4.0  5.5 10.0

# get 1st-4th element (inclusive)
numbers[1:4]

[1]  3.0  4.0 -9.0  5.5

⚠️ It is worth pointing out that some programming languages (e.g. Python) start their index at 0 instead of 1

Vector Indexing (cont’d)

log_vec
numbers

[1]  TRUE FALSE FALSE  TRUE FALSE
[1]  3.0  4.0 -9.0  5.5 10.0

What do you think this will do?

numbers[log_vec]

[1] 3.0 5.5

This will extract the elements from numbers for which the corresponding element in log_vec is equal to TRUE

⚠️ Warning: R will not produce a warning if the vectors are not of the same length!

numbers[c(TRUE,FALSE)]
numbers[rep(TRUE, TRUE, TRUE,
            TRUE, TRUE, TRUE,
            TRUE, TRUE, TRUE)]

[1]  3 -9 10
[1]  3.0  4.0 -9.0  5.5 10.0

Scalars

A single value is technically a vector of length 1;

x <- 0 ; is.vector(x)

[1] TRUE

length(x)

[1] 1

typeof(x)

[1] "double"

I can combine vectors together using c()

longer_vec <- c(x,my_vec)
print(longer_vec)

[1] 0 1 2 3 4 5 6

Coercion

Where appropriate, you can convert objects of one data type to another using as. followed by the data type

x <- 1; as.character(x)

[1] "1"

as.logical(x) 
# 0 = FALSE, 1 = TRUE

[1] TRUE

If different data types are combined together, R will try and guess whats the best choice to store them as.

v1 <- c(1, "hi!", TRUE)
v1; typeof(v1)

[1] "1"    "hi!"  "TRUE"
[1] "character"

v2 <- c(8,4,TRUE,3,FALSE)
v2; typeof(v2)

[1] 8 4 1 3 0
[1] "double"

Matrices

Matrices store values in a two-dimensional array
One way of creating a matrix is to supply a vector to the matrix() function; see ?matrix or click the hyperlinks

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE)

💡 The parameter values specified in the documentation represent default values. These values are used by the function when the caller does not provide a specific argument value for that parameter.

Matrix examples

my_vec

[1] 1 2 3 4 5 6

matrix(my_vec) # column vector

     [,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5
[6,]    6

matrix(my_vec, ncol = 2)

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

matrix(my_vec,ncol = 2,byrow = T)

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

# extra numbers get thrown away
matrix(my_vec, nrow = 2,ncol = 2)

     [,1] [,2]
[1,]    1    3
[2,]    2    4

# some numbers recycled
matrix(my_vec, nrow = 3,ncol = 5)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    1    4    1
[2,]    2    5    2    5    2
[3,]    3    6    3    6    3

# can work with other data types
matrix(my_strs)

     [,1]           
[1,] "a"            
[2,] "bee"          
[3,] "see you later"

💡 Notice how i can use T instead of TRUE

how would you make a row matrix

matrix(my_vec, nrow = 1)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6

Matrix indexing

You can index matrices using square brackets [ ] to extract specific elements, rows, or columns from the matrix.

(mat <- matrix(my_vec, nrow = 2)) # ncol = 3 need not be specified

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Extract the element in the 1st row and third column

mat[1,3]

[1] 5

Extract the entire first row

mat[1,]

[1] 1 3 5

Extract the entire third column

mat[,3] # returns a 1D vector

[1] 5 6

Factors

In R, a factor is a data type used for categorical (aka nominal) data (e.g. marital status: “Married,” “Single,” “Divorced”).
That is, factors are used to represent data that can take on a limited, fixed set of distinct values or categories.
Notably these categories do not have any inherent order or ranking among them.
While it may look like character data, note the differences in the following output.

Factor Examples

eyecolor <- c("brown", "hazel", "brown", "blue", "blue")  # char vec
marrital_status <- c(1, 2, 2, 3)                          # num vec

To convert the character vector above to a factor, use:

factor(eyecolor)

[1] brown hazel brown blue  blue 
Levels: blue brown hazel

Ofttimes, categorical data will be codified using numbers. They can be coerced into a factor type using:

factor(marrital_status) # or

[1] 1 2 2 3
Levels: 1 2 3

#  1 = married, 2 = single, 3 = divorced
(marrital_status <- factor(marrital_status, levels = c(1,2,3), 
       labels = c("Married", "Single", "Divorced")))

[1] Married  Single   Single   Divorced
Levels: Married Single Divorced

Factor Examples (cont’d)

If we know there is another category that we didn’t happen to have in our vector, we can let R know of it using:

(more_eyecolors <- factor(eyecolor, 
                         levels = c("hazel", "blue", "brown","green")))

[1] brown hazel brown blue  blue 
Levels: hazel blue brown green

In addition, if we know that some categories might be duplicated using different labels we can merge them using:

x <- c("Man", "Male", "Man", "Lady", "Female")
(xf <- factor(x, levels = c("Male", "Man" , "Lady",   "Female"),
                 labels = c("Male", "Male", "Female", "Female")))

[1] Male   Male   Male   Female Female
Levels: Male Female

Lists

So far, all of the previous structures required that all the elements are of the same data type.
In R, a list is a versatile and flexible data structure that can hold elements of different data types, including other lists, vectors, matrices, data frames, and even functions.
Lists are commonly used to store and manage heterogeneous data or complex data structures.
Elements within the list can be named for convenience

List examples

numbers <- c(1,4,-9, 5.5)                   # numeric vector
my_strs <- c("a", "bee", "see you later")   # character vector

my_list <- list(numbers, c("Hello", "World!"), c(TRUE, FALSE))
named_list <- list(numbers = numbers, msg = c("Hello", "World!"),
                   logicals = c(TRUE, FALSE))

my_list

[[1]]
[1]  1.0  4.0 -9.0  5.5

[[2]]
[1] "Hello"  "World!"

[[3]]
[1]  TRUE FALSE

named_list

$numbers
[1]  1.0  4.0 -9.0  5.5

$msg
[1] "Hello"  "World!"

$logicals
[1]  TRUE FALSE

List indexing

You can access elements within a list using double square brackets [[]] or single square brackets [ ] and specify either the element’s position or its name.
In R, the dollar sign ($) operator is used to access elements within a list (or data frame) by their names.
Each element of the list will be a vector which can then be indexed using the single square brackets [] as discussed in Vector Indexing

List Indexing Examples

The following are equivalent for returning the first element of the list, i.e. a vector.

named_list[[1]] 
named_list$numbers
named_list[['numbers']]
num_vec <- named_list[[1]]

[1]  1.0  4.0 -9.0  5.5
[1]  1.0  4.0 -9.0  5.5
[1]  1.0  4.0 -9.0  5.5

The resulting vector can then be indexed

num_vec[3]

[1] -9

Which is the same as:

named_list[[1]][3]

[1] -9

Data Frames

Data frames are the two-dimensional version of a list.
They provide an ideal way to store data sets which will most likely have data of different types.
A data frame in R is much like an Excel spreadsheet.
Like list, data frames store a sequence of vectors.
Unlike lists, these vectors must be the same length

Visualizing Data Frames

Data frames store data as a sequence of columns. Each column can be a different data type. Every column in a data frame must be the same length. Source of image: Hands-on Programming with R

Data Frame Examples

To create¹ a data frame we can supply data.frame() with any number of vectors, each separated with a comma.

(df <- data.frame(
  face = c('ace', 'two', 'six'),  
  suit=c('clubs','clubs','clubs'), 
  value = c(1, 2, 3)))

  face  suit value
1  ace clubs     1
2  two clubs     2
3  six clubs     3

In this example, we have created a data frame named df with three columns: face, suit, and value.
Each vector should have the same length, as each element corresponds to a row in the data frame.

Object types vs. class

Each data frame is a list with class data.frame.

typeof(df)

[1] "list"

class(df)

[1] "data.frame"

In R, both typeof() and class() are functions used to examine the characteristics of objects

typeof() is concerned with the internal representation of the object, focusing on the low-level data type.
class() is concerned with how the object behaves and how it interacts with other objects and functions in R

typeof(marrital_status)

[1] "integer"

class(marrital_status)

[1] "factor"

Data Frame Indexing

To access columns of the data frame you can either use [] square brackets of the $ operator

Columns

df[,1] # same as

[1] "ace" "two" "six"

df$face

[1] "ace" "two" "six"

df[c(1,3)]

  face value
1  ace     1
2  two     2
3  six     3

Rows

df[1,]

  face  suit value
1  ace clubs     1

df[c(1,3),]

  face  suit value
1  ace clubs     1
3  six clubs     3

Viewing Data Frames

To invoke a spreadsheet-style viewer of your data in a the Source panel of RStudio, you can execute:

View(df)

The str() function gives you a quick overview of your data

str(df)

'data.frame':   3 obs. of  3 variables:
 $ face : chr  "ace" "two" "six"
 $ suit : chr  "clubs" "clubs" "clubs"
 $ value: num  1 2 3

Functions

A function is a self-contained block of code that performs a specific task or set of tasks.
There are many available functions in R
We have used a few functions¹ already: c() and print()
To access the help file for any function use the ? operator followed by the function name. For example

?print

In(put)s and out(put)s of functions

Functions typically accept input values, called arguments or parameters, perform operations on them, and return a result.

Output of a function can either be printed to the console or assigned to a variable.

Function Examples

numbers <- c(3.0,  4.0, -9.0, 5.5, 10.0); max(numbers)

[1] 10

sort(numbers)

[1] -9.0  3.0  4.0  5.5 10.0

date() # no arguments

[1] "Tue Sep 19 08:40:57 2023"

R comes with more sophisticated functions like those that can randomly sample data from a Normal distribution.

rnorm(6) # samples random numbers from the standard normal distribution

[1]  0.5530039 -0.3726930 -0.1572766 -0.4543976 -0.5384378  0.6014636

User Defined Functions

Functions are designed to be reusable and modular.

👍 Rule of Thumb: if you written the same code more than twice, consider writing a function

Here’s the general syntax for creating your own function:

functionName <- function(parameters) {
  
  code to execute on pararmeters
  
}

The same naming rules for variables apply to functions

E.g. create a function that counts the number of unique elements in a vector:

lu <- function(x){
  length(unique(x))
}
names <- c("sara", "sara", "sara", "peter", "peter", "irene")
lu(names)

[1] 3

Creating vectors using `seq()`

A common way to make vectors is to use the seq() function:

seq(from, to, by=, ...) # default

The first argument this function is expecting is from, the starting value of the sequence.
The second argument is to, the ending value of the sequence
The third argument is by which sets the increment of the sequence.
... indicates that there are alternative arguments that may be passed to this function. These are described in the Arguments section of the help file which can access using ?seq

`seq()` Examples

If you arrange your arguments in the order R expects, you do not need to specify the argument name.

seq(from = 1, to = 9, by = 2)     # same as
seq(1, 9, 2)                      # [1] 1 3 5 7 9

If you rearange them or utilize alternative (non-default) arguments you need to explicitly provide the argument name.

seq(0, 50, length.out = 11)  # NOT the same as

 [1]  0  5 10 15 20 25 30 35 40 45 50

seq(0, 50, 11) # same as seq(0, 50, by = 11)

[1]  0 11 22 33 44

Creating vectors using `rep()`

A common way to make vectors is to use the rep() which replicates the values in x

rep(x, ...) # see documentation for what can be used in ...

times an integer-valued vector giving the (non-negative) number of times to repeat each element if of length length(x), or to repeat the whole vector if of length 1.
length.out non-negative integer. The desired length of the output vector.
each non-negative integer. Each element of x is repeated each times.

`rep()` Examples

Some useful examples from the help file:

rep(1:4, 2)

[1] 1 2 3 4 1 2 3 4

rep(1:4, each = 2)

[1] 1 1 2 2 3 3 4 4

rep(1:4, c(2,1,2,1))

[1] 1 1 2 3 3 4

💡 Don’t overlook the Examples at the end of a help file. Sometimes they are more useful than the descriptions.

Vector Functions

There are a number of useful functions you can apply to vectors. Just a few examples:

vec <- c(3,4,-9,5.5,5.5,10)
log(vec) # performs element-wise calculations

[1] 1.098612 1.386294      NaN 1.704748 1.704748 2.302585

log(abs(vec)) # abs for absolute values

[1] 1.098612 1.386294 2.197225 1.704748 1.704748 2.302585

summary(vec)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -9.000   3.250   4.750   3.167   5.500  10.000

unique(vec)

[1]  3.0  4.0 -9.0  5.5 10.0

var(vec) 
sd(vec)
mean(vec)
sum(vec^2)

[1] 41.26667
[1] 6.423914
[1] 3.166667
[1] 266.5

Importing Packages

Part of R’s popularity is due to its rich collection of packages.
A package is a collection of functions, data sets, and documentation bundled together into a single unit.
To use an R package, you first need to install it using install.packages()
Load it into your R session using library()

# Install the ggplot2 package (only needed once)
install.packages("ggplot2")

# Load the ggplot2 package into the current R session
library(ggplot2)