[1] 19
[1] 51
[1] 73
DATA 101: Making Prediction with Data
University of British Columbia Okanagan
Recall HTML lecture notes will have a copy to clipboard:
If you run the same code in your console, it will look like this:
> 2 + 3
[1] 5
R will follow the order of operations
Note the syntax errors below.
In R, an object is a fundamental data structure that represents a value or a collection of values.
Objects typically have a specific data type.
R provides various structures of objects, including vectors, matrices, data frames, lists, factors, scalars, functions.
Objects are “assigned” to variable names using either the equal sign (=
) arrow or (->
) as the assignment operator.
Variable names …
_
) and period (.
)⚠️ You should avoid using function names as variable names to avoid confusion.
Here is a list of some important data types used in R:
N.B. single or double quotes are allowed for strings.
There are NO quotes on TRUE
/FALSE
and NA
s
The typeof()
function returns the data type of a variable …
Here is a list of some important structures used in R:
Vectors are the simplest data structure in R.
The c()
function can be used to build a vector which is simply a sequence of elements.
They can store a sequence of values of the same data type, such as numeric, character, logical, or complex values.
That is, vectors cannot contain elements of different data types (click to review what those data types were)
e.g. create a vector with the numbers 1 through 6
A short hand way of doing this would be do use the colon (:
)
It is common to use terms like “numeric vector,” “character vector,” or “logical vector” to emphasize the data type.
You can access elements of a vector using square brackets []
⚠️ It is worth pointing out that some programming languages (e.g. Python) start their index at
0
instead of1
What do you think this will do?
This will extract the elements from numbers
for which the corresponding element in log_vec
is equal to TRUE
⚠️ Warning: R will not produce a warning if the vectors are not of the same length!
A single value is technically a vector of length 1;
I can combine vectors together using c()
Where appropriate, you can convert objects of one data type to another using as.
followed by the data type
If different data types are combined together, R will try and guess whats the best choice to store them as.
Matrices store values in a two-dimensional array
One way of creating a matrix is to supply a vector to the matrix()
function; see ?matrix
or click the hyperlinks
💡 The parameter values specified in the documentation represent default values. These values are used by the function when the caller does not provide a specific argument value for that parameter.
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[,1] [,2]
[1,] 1 3
[2,] 2 4
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 1 4 1
[2,] 2 5 2 5 2
[3,] 3 6 3 6 3
[,1]
[1,] "a"
[2,] "bee"
[3,] "see you later"
You can index matrices using square brackets [ ]
to extract specific elements, rows, or columns from the matrix.
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Extract the element in the 1st row and third column
Extract the entire first row
Extract the entire third column
In R, a factor is a data type used for categorical (aka nominal) data (e.g. marital status: “Married,” “Single,” “Divorced”).
That is, factors are used to represent data that can take on a limited, fixed set of distinct values or categories.
Notably these categories do not have any inherent order or ranking among them.
While it may look like character data, note the differences in the following output.
To convert the character vector above to a factor, use:
Ofttimes, categorical data will be codified using numbers. They can be coerced into a factor type using:
If we know there is another category that we didn’t happen to have in our vector, we can let R know of it using:
[1] brown hazel brown blue blue
Levels: hazel blue brown green
In addition, if we know that some categories might be duplicated using different labels we can merge them using:
So far, all of the previous structures required that all the elements are of the same data type.
In R, a list is a versatile and flexible data structure that can hold elements of different data types, including other lists, vectors, matrices, data frames, and even functions.
Lists are commonly used to store and manage heterogeneous data or complex data structures.
Elements within the list can be named for convenience
You can access elements within a list using double square brackets [[]]
or single square brackets [ ]
and specify either the element’s position or its name.
In R, the dollar sign ($
) operator is used to access elements within a list (or data frame) by their names.
Each element of the list will be a vector which can then be indexed using the single square brackets []
as discussed in Vector Indexing
The following are equivalent for returning the first element of the list, i.e. a vector.
[1] 1.0 4.0 -9.0 5.5
[1] 1.0 4.0 -9.0 5.5
[1] 1.0 4.0 -9.0 5.5
The resulting vector can then be indexed
Which is the same as:
To create1 a data frame we can supply data.frame()
with any number of vectors, each separated with a comma.
face suit value
1 ace clubs 1
2 two clubs 2
3 six clubs 3
In this example, we have created a data frame named df
with three columns: face
, suit
, and value
.
Each vector should have the same length, as each element corresponds to a row in the data frame.
Each data frame is a list with class data.frame
.
In R, both typeof()
and class()
are functions used to examine the characteristics of objects
To access columns of the data frame you can either use []
square brackets of the $
operator
To invoke a spreadsheet-style viewer of your data in a the Source panel of RStudio, you can execute:
The str()
function gives you a quick overview of your data
Functions typically accept input values, called arguments or parameters, perform operations on them, and return a result.
Output of a function can either be printed to the console or assigned to a variable.
[1] 10
[1] -9.0 3.0 4.0 5.5 10.0
[1] "Tue Sep 19 08:40:57 2023"
R comes with more sophisticated functions like those that can randomly sample data from a Normal distribution.
Functions are designed to be reusable and modular.
👍 Rule of Thumb: if you written the same code more than twice, consider writing a function
Here’s the general syntax for creating your own function:
functionName <- function(parameters) {
code to execute on pararmeters
}
The same naming rules for variables apply to functions
seq()
A common way to make vectors is to use the seq()
function:
from
, the starting value of the sequence.to
, the ending value of the sequenceby
which sets the increment of the sequence....
indicates that there are alternative arguments that may be passed to this function. These are described in the Arguments section of the help file which can access using ?seq
seq()
ExamplesIf you arrange your arguments in the order R expects, you do not need to specify the argument name.
If you rearange them or utilize alternative (non-default) arguments you need to explicitly provide the argument name.
rep()
A common way to make vectors is to use the rep()
which replicates the values in x
times
an integer-valued vector giving the (non-negative) number of times to repeat each element if of length length(x), or to repeat the whole vector if of length 1.length.out
non-negative integer. The desired length of the output vector.each
non-negative integer. Each element of x is repeated each times.rep()
ExamplesSome useful examples from the help file:
[1] 1 2 3 4 1 2 3 4
[1] 1 1 2 2 3 3 4 4
[1] 1 1 2 3 3 4
💡 Don’t overlook the Examples at the end of a help file. Sometimes they are more useful than the descriptions.
There are a number of useful functions you can apply to vectors. Just a few examples:
[1] 1.098612 1.386294 NaN 1.704748 1.704748 2.302585
[1] 1.098612 1.386294 2.197225 1.704748 1.704748 2.302585
Min. 1st Qu. Median Mean 3rd Qu. Max.
-9.000 3.250 4.750 3.167 5.500 10.000
[1] 3.0 4.0 -9.0 5.5 10.0
Part of R’s popularity is due to its rich collection of packages.
A package is a collection of functions, data sets, and documentation bundled together into a single unit.
To use an R package, you first need to install it using install.packages()
Load it into your R session using library()