Lecture 3: Introduction to R Programming

DATA 101: Making Prediction with Data

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

In today’s lecture, we will go a bit deeper into programming by learning:

These concepts are fundamental for data manipulation and analysis in R

Operators

  • R has several operators to perform tasks

  • We have already seen two:

    1. assignment operators (eg. = and ->)
    2. arithmetic operators (eg. +, -, *, /, ^, %%)
  • Other types of operators include

Comparison Operators

  • Comparison operators are used to determine whether a specific relationship exists between two values or expressions (e.g., equality, inequality, greater than).

  • Comparison operators return a logical value, which is either TRUE (if the comparison is true) or FALSE (if the comparison is false).

🤓 You can think of comparison statements as questions.
Q: is 3 < 4 (R input: 3 < 4)?
A: yes! (R output: TRUE)

Examples

Here is a list of some handy comparison operators:

  • Less than: <

  • Greater than: >

  • Less than or equal to: <=

  • Greater than or equal to: >=

  • Is equal to: ==

  • Is NOT equal to: !=

3 < 4
[1] TRUE
3 > 4
[1] FALSE
4 >= 4
[1] TRUE
3 <= 4
[1] TRUE
4 == 4
[1] TRUE
3 != 3
[1] FALSE

Logical Operators

  • Logical operators are used to manipulate and combine logical values (i.e., TRUE and FALSE).

  • Logical operators are typically used to create more complex conditions by combining the results of simpler conditions.

  • In programming, there are three common logical operators: AND (&), OR (|), and NOT (!).

# admitted to the bar?
age = 18; hasID = TRUE
(age >= 18 & hasID)
[1] TRUE
# satisfy pre-reqs: One of STAT205, STAT230
courses <- c("STAT230", "DATA101", "DATA301")
("STAT205" %in% courses || 
    "STAT230" %in% courses)
[1] TRUE

Scalar vs Element-wise Operators

⚠️ Warning: “longform” vesions of exist (&& and ||) to provide flexibility in different use cases

  • & and | perform element-wise logical operations when applied to vectors, matrices, or arrays (may return a logical vector with length >1)

  • && and ||, are designed for scalar operations (will return a logical scalar)

Examples using AND

x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, TRUE, TRUE)

Scalar usage of AND

TRUE & TRUE   # same TRUE && TRUE
TRUE & FALSE  # same FALSE && TRUE
FALSE & FALSE # same FALSE && FALSE
[1] TRUE
[1] FALSE
[1] FALSE

Using & with vectors (element-wise evaluation)

x & y 
[1] FALSE FALSE  TRUE
TRUE & y 
[1] FALSE  TRUE  TRUE

Examples using OR

x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, TRUE, TRUE)

Scalar usage of OR:

TRUE | TRUE   # same as TRUE || TRUE 
TRUE | FALSE  # same as TRUE || FALSE 
FALSE | FALSE # same as FALSE || FALSE 
[1] TRUE
[1] TRUE
[1] FALSE

Using & with vectors (element-wise evaluation)

x | y 
[1] TRUE TRUE TRUE
TRUE | y 
[1] TRUE TRUE TRUE

4.3.0 News

⚠️ Warning: If you’re using 4.3.0 or higher calling && or || with LHS or RHS of length greater than one will produce an error (see R 4.3.0 NEWS),

In other words, the following will produce errors:

x && y      # Error since x and y are vectors
TRUE && y   # Error since y is a vectors
x && TRUE   # Error since x is a vectors

So && should only be used with scalar logical:

TRUE && TRUE
[1] TRUE

Conditionals

  • Conditional statements allow us to make decisions in R

  • Conditional statements allow the program to execute different code blocks or take different actions based on specific conditions.

  • Common conditional statements: if, else if, else

  • Just like a flow chart, conditional statements supply a sequence of steps, actions, or decisions in a process or system.

Flow chart

Source geeksforgeeks.org

Pseudocode1

# not to be run in R:

if score is greater than or equal to 80:
  set grade to A
else if score is greater than or equal to 68:
  set grade to B
else if score is greater than or equal to 55:
  set grade to C
else if score is greater than or equal to 50:
  set grade to D
else:
  set grade to F

Syntax

  • If you try to execute the previous pseudocode in R, you will get an error:
> if score is greater than or equal to 80:
[1] Error: unexpected symbol in "if score"
  • That is because R is expecting a very specif syntax

  • A syntax refers to the specific rules and conventions that dictate how code is written and structured e.g. case sensitivity, comments, assignment operators

if statements

An if statement allows you to execute different code blocks based on whether a specified condition is true or false. The basic syntax of an if statement in R is as follows:

if (condition) {
  # Code to be executed 
  # if condition is true
}
if (age >= 18 & hasID==TRUE) {
  print("admit to club")
}
[1] "admit to club"

💡 Tip: R is designed to compare the condition in a logical context to TRUE


if (age >= 18 & hasID) {
  print("admit to club")
}
[1] "admit to club"

Components

  • if: This is the keyword that initiates the if statement.

  • condition: This is a logical expression that evaluates to either TRUE or FALSE. If the condition is TRUE, the code inside the curly braces {} will be executed; otherwise, it will be skipped.

  • {}: Curly braces enclose the code block that should be executed when the condition is TRUE. If you have only one statement to execute, the curly braces are optional, but it’s a good practice to include them for readability.

Keywords

In R, a keyword refers to a reserved word that has a predefined meaning and cannot be used as a variable or function name.

Here are some common keywords in R:

  • if: Used to start an if statement for conditional branching.

  • else: Used in conjunction with if to provide an alternative code block to execute when the condition is false.

  • else if: Used in an if statement to specify additional conditions to check when the initial condition is false.

  • for: Used to create a loop that iterates over a sequence of values.

  • while: Used to create a loop that continues as long as a specified condition is true.

  • repeat: Used to create an indefinite loop that continues until explicitly stopped with a break statement.

  • function: Used to define a user-defined function in R.

  • return: Used within a function to specify the value to return from that function.

  • break: Used to exit a loop prematurely.

  • next: Used in a loop to skip the current iteration and move to the next iteration.

  • NULL: Represents the absence of a value or missing data.

  • NA: Stands for “Not Available” and is used to represent missing or undefined values in R.

  • TRUE and FALSE: Represent the logical values for true and false, respectively.

  • Inf: Represents positive infinity.

  • NaN: Stands for “Not-a-Number” and represents undefined or unrepresentable numerical values.

if ... else statements

The if...else is used when you have a single condition, and you want to execute one block of code if the condition is true and another block if it’s false.

if (condition) {
  # statement1
} else {
  # statement2
}
if (age >= 18 & hasID) {
  print("admit")
} else { 
  print("deny ")
}

💡 Tip: you can write an if statement in R on a single line if the code block associated with the if statement contains only one statement

if (age >= 18 & hasID) print("admit") else print("deny ")
[1] "admit"

else if statements

else if is used when you have multiple conditions to check, and you want to evaluate them in sequence until one of them is true. When the first true condition is found, the associated block of code is executed, and the rest of the conditions are not evaluated.

if (condition) {
  # Code to execute if 
  # condition is TRUE
} else if (another_condition) {
  # Code to execute if 
  # another_condition is TRUE
} else {
  # Code to execute if 
  # no conditions are TRUE
}
fakeID = FALSE
looksOld  = FALSE
if (age >= 18 & hasID) {
  print("legit admit")
} else if (age < 18 & fakeID) {
  print("sneaky admit")
} else if (looksOld) {
  print("old admit")
} else {
  print("deny")
}

Conditional Indexing

  • We can use these operators in some advanced indexing.
  • Last lecture we saw how to extract data from using one or several indices (eg x[1], x[c(4,2)])
  • In practice, you often need to extract data that satisfy a certain criteria.
  • To do this in one step, we use conditional selection.

Example

x = c("female","male","non-binary","female","male","male","female")
x=="female"
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
x[x=="female"]
[1] "female" "female" "female"
set.seed(2023)
# sample twelve numbers from 1--10
(y = sample(10, 12, replace=TRUE))
 [1]  5  9  8  3 10  2  1  1  1  1  5  8
# return the elements that are larger than 7
y[y>7]
[1]  9  8 10  8
# return the elements that are <7 than and odd
y[y<7 & y%%2==1]
[1] 5 3 1 1 1 1 5

Subsetting

  • Instead to using operations to create logicals vectors for indexing, we also have used subset()
subset(data, subset = condition)
  • This function return subsets of vectors, matrices or data frames which meet conditions.
  • To show its utility, lets consider a built in data set called iris
  • To see a description type ?iris

Iris

  • This famous (Fisher’s or Anderson’s) iris data set gives the measurements (in cm) of the variables:

    • sepal length and width and
    • petal length and width, respectively,

for 50 flowers from each of 3 species of iris:

  1. setosa, 2. versicolor, and 3. virginica.

iris

Iris Indexing

Extract the rows which correspond to the setosa species

nrow(iris) # count the number of observations 
[1] 150
setosa = iris[iris$Species == "setosa",]
nrow(setosa)
[1] 50

Equivalently

setosa = subset(iris, Species=="setosa") 

Extract the setosa flowers with long (>5 cm) sepal length

longPetals = subset(iris, Species=="setosa" & Sepal.Length>5)
nrow(longPetals)
[1] 22

Transforming

  • There’s is also transform() function also provides a quick and easy way to transform the data.
  • For instance if we want to add a new column which holds the log values of the petal lengths we could type:
dim(iris)
[1] 150   5
irisMore = transform(iris, logPL = log(Petal.Length))
dim(irisMore)
[1] 150   6

irisMore

Splitting

  • Another handy function is split.
  • split() generates a list of vectors according to a grouping
iSpecies = split(iris, iris$Species)
names(iSpecies)
[1] "setosa"     "versicolor" "virginica" 

Note that iSpecies$setosa creates the same subset as setosa defined on this slide. We can verify this using:

all.equal(iSpecies$setosa,setosa)
[1] TRUE

Order

  • A related function is order() which order provides the indexing of x which provides the sorted vector sortx.
(o <- order(x))
[1] 1 4 7 2 5 6 3
x[o]
[1] "female"     "female"     "female"     "male"       "male"      
[6] "male"       "non-binary"
  • We can use order to rearange the rows of data set to agree with a sorting of a particular column, for instance.

Example

Example: rearrange the rows of iris so that the Petal.length is sorted from smallest to largest:

o = order(iris$Petal.Length)
head(o)
[1] 23 14 15 36  3 17
irisSorted = iris[o,]
head(irisSorted)

Comment

  • Note that order() can also take multiple sorting arguments
  • For instance, we order(gender, age) in the example of the following slide will give a main division into men and women, and within each group, they will be ordered by age.

Example

gender = c("female","male","female","male","male","female")
age = c(36, 24, 25, 40, 22, 23)
df = data.frame(gender=gender, age=age)
o = order(df$gender, df$age)
df[o,]

Missing Data

  • In R, missing values are represented as NA (Not Available).
  • NaN (Not a Number) is usually the product of some arithmetic operation and represents impossible values (e.g., dividing by zero).
  • We can check for these using is.na(), is.nan()

Example

y = -1:3  # fills elements 1--5
y[7] = 7  # element 6 is missing
y
[1] -1  0  1  2  3 NA  7
is.na(y)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
(sy = sqrt(y))   # take the square roots 
[1]      NaN 0.000000 1.000000 1.414214 1.732051       NA 2.645751
is.nan(sy)
[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Infinite values

  • Arithmetic operations may result in infinity (or negative infinity)

  • This concept is represented in R using Inf and -Inf

  • We can check if a number is finite/infinite using is.finite()/is.infinite()

    • is.finite(NA)/is.infinite(NA) returns FALSE/FALSE
    • is.finite(NaN)/is.infinite(NaN) returns FALSE/FALSE
    • is.finite(Inf)/is.infinite(Inf) returns FALSE/TRUE

Example

y
[1] -1  0  1  2  3 NA  7
(ly = log(y))
[1]       NaN      -Inf 0.0000000 0.6931472 1.0986123        NA 1.9459101
is.finite(ly) # is.finite with NA/NaN/Inf all return FALSE
[1] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
is.infinite(ly)
[1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

Missing Values

  • An easy way to remove rows from a data set having missing values is:
newdata <- na.omit(mydata)
  • Some functions may having built in arguments to remove missing values from the calculation:
mean(y)
[1] NA
mean(na.omit(y))
[1] 2

💡 Some functions have this feature built in as an argument option:

mean(y, na.rm = TRUE)
[1] 2

NAs

  • It may happen that we would like to replace values that meet a certain condition with NA
# replace scores outside of allowable range with NA
student_scores <- c(85, 92, -54, 78, 90, 101, 67, 75, 88)
student_scores[student_scores>100 | student_scores<0] = NA
student_scores
[1] 85 92 NA 78 90 NA 67 75 88
  • On the flip side, we could easily replace NAs by some value
# replace NAs with 0s
student_scores[is.na(student_scores)] = 0

N.B. this common mistake:

> (70 < student_scores < 90)
Error: unexpected '<' 

Fix:

(70 < student_scores & 
   student_scores < 90)
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

Loops

  • Looping, (AKA cycling or iterating) provides a way of replicating a set of statements multiple times until some condition is satisfied.

  • Each time a loop is executed is called an iteration

  • A for loop repeats statements a number of times. It will iterate based on the number of group/collection elements.

  • A while loop repeats statements while a condition is true

  • A repeat loop is repeats continuously until you explicitly break it using the break statement.

for loop example

General Syntax

for (item in my_vector) {
  # Code to process each item in my_vector
}

Simple example:

# not executed (space)
for (i in 1:5) {
  print(i) 
}

for loop combined with if statement

for (i in 1:5) {
  if (i%%2 == 0) { 
    # if i is even
    print(paste(i, " is even"))
  }
}
[1] "2  is even"
[1] "4  is even"

Loops with lists

my_list <- list("apple", "banana", "cherry")
for (fruit in my_list) {
  print(fruit)
}
[1] "apple"
[1] "banana"
[1] "cherry"

while loop syntax/example

General syntax
while (condition) {
  # Code to be executed 
  # as long as the
  # condition is true
}

Example

count <- 1

while (count <= 5) {
  print(count)
  count <- count + 1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

⚠️ It’s important to be cautious with while loops to avoid infinite loops. Make sure that the condition eventually becomes false, or include logic within the loop to break out of it when needed.

Infinite Loops

Infinite loops are caused by an incorrect loop condition or not updating values within the loop so that the loop condition will eventually be false.

n = 1
while (n <= 5){
  print(n)
}

Here we forgot to increase \(n\). Hence we get an infinite loop (i.e. the code will print 1,2,3, , \(\infty\))

repeat loops

repeat loops are typically used when your iterative tasks doesn’t have a predetermined stopping point.

General Syntax

repeat {
  # Code to be executed 
  # in each iteration
  
  if (condition) {
    break  # Exit the loop when the condition is met
  }
}

Example:

count <- 1
repeat {
  # Code to execute 
  # in each iteration
  if (count > 5) {
    # Exit the loop when 
    # count exceeds 5
    break
  }
  count <- count + 1
}

Comment

Notice that the condition we place for breaking a repeat loop will be the opposite condition that we had for our while loop

  • Remember: if the while condition is true, we continue to the next iteration
  • Remember: if the break condition is false, we continue to the next iteration

next

  • Another reserve word is next
  • Like break, next does not return a value, it merely transfers control within the loop.
  • A next statement is useful when we want to skip the current iteration of a loop without terminating it.
  • On encountering next, the R proceeds to next iteration of the loop (with out executing any remaining statements the current iteration).

Example

x <- c("apple", "ball","cat","dog","elephant","fish")
for (i in seq_along(x)){
  print(i)
  if (i%%2==0)
    next
  print(x[i])
} # wont be printed for even indices
[1] 1
[1] "apple"
[1] 2
[1] 3
[1] "cat"
[1] 4
[1] 5
[1] "elephant"
[1] 6