Lecture 1: Introduction

DATA 101: Making Prediction with Data

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

Welcome to DATA 101!

Course calendar description:

DATA 101 (3) Making Predictions with Data Introduction to the techniques and software for handling real-world data. Topics include data cleaning, visualization, simulation, basic modelling, and prediction making. [3-1-0]

Instructors:

  • Please note that there are two sections of this course.

  • Section 1 (001) Dr. Irene Vrbik

  • Section 2 (002) Ladan Tazik

Lab Teaching Assistants (TAs)

  • Mahi Gangal
  • Jesse Ghashti

A little about me

  • I am currently a Tenure-track Assistant Professor of Teaching

  • I have taught a variety of courses (from introductory data science and to graduate courses in statistics) at several institutions (Guelph, McGill, MDS Program)

  • I am currently the Data Science Program advisor, Articulation, and curriculum representative

Where can you find me?

Office: SCI 104 email: irene.vrbik@ubc.ca

Websites: irene.quarto.pub, irene.vrbik.ok.ubc.ca

Educational Background

  1. McMaster University, BSc (Mathematics & Statistics)

  2. University of Guelph, MSc (Applied Statistics)

    Thesis: Using Individual-level Models to Model Spatio-temporal Combustion Dynamics. This involved modelling the spatio-temporal combustion dynamics of fire in a Bayesian framework. Supervisors: Rob Deardon and Zeng Feng.

  3. University of Guelph, PhD (Applied Statistics)

    Thesis: Non-Elliptical and Fractionally-Supervised Classification. This involved model-based classification with a particular emphasis on non-elliptical distributions. Supervisor: Paul D. McNicholas.

Experience

Postdoctoral Fellow at McGill University Under the supervision of Dr. David Stephens, this work focused on the statistical and computational challenges associated with analyzing genetic data. It involved clustering and modeling HIV DNA sequences.

Postdoctoral Fellow at UBCO Awarded by NSERC (Natural Sciences and Engineering Research Council of Canada), this research involved collaborations with faculty from several disciplines (eg. Medical Physics, Biology, and Chemistry) and was supervised by Dr. Jason Loeppky.

Instructor at UBCO a three-year contract position in the Department of Computer Science, Mathematics, Physics, and Statistics.

Research Interests

Statistics and Machine Leaning in Curriculum Design

  • e.g. topics modeling in Data Science course calendars

Curricular Analytics the systematic analysis and evaluation of educational curricula to gain insights into various aspects of curriculum design, delivery, and assessment.

  • e.g. metric calculation for various pathways, curriculum visualization, course recommendation systems

Tools for teaching, learning, and technology

  • e.g. Prairie Learn: online problem-driven learning system for creating homework and tests

Course Syllabus

The course syllabus is a dynamic document which has been posted to our (shared) Canvas shell . Many administrative questions can be answered there:

To Do for next class

  1. Install R
  2. Install RStudio

Course Tools

We will be using Canvas for most course related material:

  • Grades
  • Assignments (downloading/submitting)
  • Course announcements/discussions
  • Supplementary files (eg. data sets, code, etc…)

The two sections will share the same canvas shell: canvas.ubc.ca/courses/126695. Let’s check it out!

Lecture Format

  • Lectures will be posted at irene.quarto.pub/data101.

  • Take time to learn how to navigate through the slides, how to annotate them, and how to export to PDF.

  • 1

Programming Language

  • Any necessary coding will be done in R:
  • Relevant code will be posted to Canvas and embedded in the slides when necessary
  • It is also recommended that you complete assignments using Rmarkdown in RStudio.

Clipboard code

A clipboard button appears when you hover over code1

head(iris)

Why R?

Pros

  • exposure to R in Statistics prerequisite course
  • Rich Ecosystem
  • Reproducibility
  • Textbook

Cons

  • Steep learning curve
  • Performance
  • Package Quality
  • Limited Industry Adoption

Lab Delivery

  • Labs will be held in online.

  • Students must be enrolled in a lab (which cannot conflict with other courses)

  • TAs provide guidance on carrying out analyses in R for the techniques discussed in lecture.

  • Knowledge of commands and programming techniques will be evaluated throughout the course.

  • Follow the instructions carefully and practice skills by completing (and redoing!) labs and assignments

Textbook

We are using an open source textbook available free on the web: Data Science: A First Introduction by Tiffany Timbers, Trevor Campbell, and Melissa Lee.

While the above textbook is all you will need for the course, supplementary (optional) textbooks include:

  • A First Course in Statistical Programming with R, 3rd edition (2021), by W. Braun and D. Murdoch
  • Simple and Multiple Regression with R.

Lecture format

  • Slides will sometimes be supplemented with handwritten material.
  • Aside for doodling, substantial written material will be done digitally (on my iPad) and uploaded to Canvas.
  • Lectures may also include discussions which you will only gain access to by attending class.
  • You will not get the whole story by reading the slides!

Class Etiquette

  1. Please be respectful, especially to other students
  2. Please be present. Attendance will not be taken, but you are encouraged to come and learn together.
  3. Please restrict the use of electronic devices to course related material; other content could be distracting.
  4. Please be forgiving; instructors are people too, we will make mistakes.

Course Questions

In class

  • If you are stuck on a concept during lecture, please feel free to raise your hand and ask for clarification.
  • If you are needing help understanding something, chances are, other students are too!
  • I will do my best to answer questions on the fly or organize a more thoughtful answer to be presented first thing next class or posted to Canvas.

Course Questions

Outside of class

Outside of class, the general order in which I would suggest you asking course-related questions is:

  1. Consult the course syllabus
  2. Post your question on the public forum on Canvas*
  3. Come see me during student hours or visit your TA during lab (whichever comes first)
  4. e-mail (weekdays are best)

Lecture 1: Introduction and Getting started with R

What is data science exactly?

Data science is the use of reproducible and auditable processes to obtain value (i.e., insight) from data. (your textbook’s defintion)

  • reproducible: others should be able to re-run the analysis
  • auditable: readers should be able to see and understand all the steps in the analysis, as well as the history of how the analysis developed

What is DATA 101?

  • This course is an introduction to data science.
  • We will start with basic programming: how to tell a computer what to do.
  • Data scientists need to display data; we aim to teach you how to construct effective visualizations
  • Data science models are concerned with randomness: random errors in data, and models that include stochastic components; we will discuss methods for simulating random values with specified characteristics.

High-level goals of this course:

  1. Learn how to use reproducible tools (RStudio) to do data analysis

  2. Learn how to identify and solve common problems in data science

  3. Doing computations1 to aid in data analysis.

AIM: to provide a foundation for an understanding of how those applications work: what calculations are the they doing, and how could you do them yourself.

Problems we will focus on:

  1. Classification Predict a class/category for a new observation/measurement (e.g., cancerous or benign tumour)

  2. Regression Predicting a value for a new observation/measurement (e.g., 10 km race time for 20 year old females with a BMI of 25)

  3. Clustering Finding previously unknown/unlabelled subgroups in your data (e.g., products commonly bought together on Amazon)

  4. Statistical Inference Estimating an average or a proportion from a representative sample (group of people or units) and using that estimate to generalize to the broader population (e.g., the proportion of undergraduate students that own an iphone)

An overview of R

  • These lectures introduce R, originally developed as S, by John Chambers and others at Bell Laboratories in 1976, and implemented and made into an Open Source program by Robert Gentleman and Ross Ihaka in 1995.

  • As you learn R, there is nothing wrong with making errors when learning a programming language like R.

  • You learn from your mistakes, and there is no harm done.

3(2) + 4 # what will I return?

Installing R and RStudio

  • R can be downloaded for free from CRAN

  • While you can use R directly, R is much more common used with RStudio.

  • RStudio is an integrated development environment (IDE) that uses the R language.

  • You can download RStudio here

Updating

  • If you already have R and RStudio downloaded to your computer, now is a great time to update to make sure you have the most recent versions.

    • R version is 4.3.1 (2023-06-16) – “Beagle Scouts”.
    • RStudio is 2023.06.2+561 (released: 2023-08-30).
  • You can check your version of R using:

R.version.string # "R version 4.3.1 (2023-06-16)"
  • Check your RStudio version using Help > About RStudio

R vs. RStudio

  • R is the underlying programming language while RStudio is an integrated development environment (IDE)

  • RStudio provides a more user-friendly interface and additional tool and features such as

    • code editor with syntax highlighting, code completion
    • project management tools
    • tools for version control (e.g., Git integration)
    • easy access to help documentation and package management.

RStudio Window

Panels

  • The Console is where you can execute code directly.

  • Apart from one off calculations and help requests, you should be instead drafting and saving your code to a file. - The most basic file format is a R script (having .R extension)

  • The Source Panel is where you will see your R script(s)1

  • You can execute lines of code in your script by pressing by Cmd/CTRL+R or Cmd/CTRL+Enter, or pressing the Run button.

Basics of R

R can be used as a glorified calculator.

Operation Command Example
addition + 1 + 3
subtraction1 - 1 - 3
multiplication * 3*2
division / 6/2
powers ^ a^2
modulos %% 6%%5

Comments

  • Comments are used by the programmer to document and explain the code.

  • Comments are not executed by R and should give information about the code to the person reading.

  • Comment lines begin with an hashtag #.

  • There is no multiline commenting in R.

  • There are keyboard shortcuts to commenting lines of code in your script (on a Mac it is Shift + Command + C).

Basic R code examples

Note that spaces do not matter

2      +           4   
[1] 6
2+4
[1] 6
# 4 + 5

Notes:

  • grayed out number indicates the start of input

  • [1] indicates the start of output

  • 4 + 5 is not executed because it is preceded by a hashtag.

Variables

  • Variables are assigned with <- or =.
x <- 4
y = 10
x + y 
[1] 14

These variables1 are now saved in your workspace, i.e. the environment in which you are currently doing your calculations. They can be displayed, by invoking their name.

y
[1] 10

Environment

See the saved objects in your workspace in the Environment tab.

Data sets

There are a number of data sets that are accessible already. To see the list of pre-loaded data, type the function

data() # output suppressed to save space
head(iris)

Functions

  • There are a large number of base R function at your disposal.

  • The power of R will come however when we start calling to other packages (more on this in a later lecture)

  • To get help on any function in R, replace the <function name> in either of the commands below

?<function name>
help(function name)
?head   # populates the help page in the help panel

Help!

More generally, if you found this introduction to R to be a tad overwhelming, DON’T WORRY! You will have plenty of time to get aquainted with R in lab