DATA 101: Making Prediction with Data
University of British Columbia Okanagan
Welcome to DATA 101!
Course calendar description:
DATA 101 (3) Making Predictions with Data Introduction to the techniques and software for handling real-world data. Topics include data cleaning, visualization, simulation, basic modelling, and prediction making. [3-1-0]
Please note that there are two sections of this course.
Section 1 (001) Dr. Irene Vrbik
Section 2 (002) Ladan Tazik
I am currently a Tenure-track Assistant Professor of Teaching
I have taught a variety of courses (from introductory data science and to graduate courses in statistics) at several institutions (Guelph, McGill, MDS Program)
I am currently the Data Science Program advisor, Articulation, and curriculum representative
Office: SCI 104 email: irene.vrbik@ubc.ca
Websites: irene.quarto.pub, irene.vrbik.ok.ubc.ca
McMaster University, BSc (Mathematics & Statistics)
University of Guelph, MSc (Applied Statistics)
Thesis: Using Individual-level Models to Model Spatio-temporal Combustion Dynamics. This involved modelling the spatio-temporal combustion dynamics of fire in a Bayesian framework. Supervisors: Rob Deardon and Zeng Feng.
University of Guelph, PhD (Applied Statistics)
Thesis: Non-Elliptical and Fractionally-Supervised Classification. This involved model-based classification with a particular emphasis on non-elliptical distributions. Supervisor: Paul D. McNicholas.
Postdoctoral Fellow at McGill University Under the supervision of Dr. David Stephens, this work focused on the statistical and computational challenges associated with analyzing genetic data. It involved clustering and modeling HIV DNA sequences.
Postdoctoral Fellow at UBCO Awarded by NSERC (Natural Sciences and Engineering Research Council of Canada), this research involved collaborations with faculty from several disciplines (eg. Medical Physics, Biology, and Chemistry) and was supervised by Dr. Jason Loeppky.
Instructor at UBCO a three-year contract position in the Department of Computer Science, Mathematics, Physics, and Statistics.
Statistics and Machine Leaning in Curriculum Design
Curricular Analytics the systematic analysis and evaluation of educational curricula to gain insights into various aspects of curriculum design, delivery, and assessment.
Tools for teaching, learning, and technology
The course syllabus is a dynamic document which has been posted to our (shared) Canvas shell . Many administrative questions can be answered there:
We will be using Canvas for most course related material:
The two sections will share the same canvas shell: canvas.ubc.ca/courses/126695. Let’s check it out!
Lectures will be posted at irene.quarto.pub/data101.
Take time to learn how to navigate through the slides, how to annotate them, and how to export to PDF.
1
A clipboard button appears when you hover over code1
Labs will be held in online.
Students must be enrolled in a lab (which cannot conflict with other courses)
TAs provide guidance on carrying out analyses in R for the techniques discussed in lecture.
Knowledge of commands and programming techniques will be evaluated throughout the course.
Follow the instructions carefully and practice skills by completing (and redoing!) labs and assignments
We are using an open source textbook available free on the web: Data Science: A First Introduction by Tiffany Timbers, Trevor Campbell, and Melissa Lee.
While the above textbook is all you will need for the course, supplementary (optional) textbooks include:
Outside of class, the general order in which I would suggest you asking course-related questions is:
Data science is the use of reproducible and auditable processes to obtain value (i.e., insight) from data. (your textbook’s defintion)
Learn how to use reproducible tools (RStudio) to do data analysis
Learn how to identify and solve common problems in data science
Doing computations1 to aid in data analysis.
AIM: to provide a foundation for an understanding of how those applications work: what calculations are the they doing, and how could you do them yourself.
Classification Predict a class/category for a new observation/measurement (e.g., cancerous or benign tumour)
Regression Predicting a value for a new observation/measurement (e.g., 10 km race time for 20 year old females with a BMI of 25)
Clustering Finding previously unknown/unlabelled subgroups in your data (e.g., products commonly bought together on Amazon)
Statistical Inference Estimating an average or a proportion from a representative sample (group of people or units) and using that estimate to generalize to the broader population (e.g., the proportion of undergraduate students that own an iphone)
These lectures introduce R, originally developed as S, by John Chambers and others at Bell Laboratories in 1976, and implemented and made into an Open Source program by Robert Gentleman and Ross Ihaka in 1995.
As you learn R, there is nothing wrong with making errors when learning a programming language like R.
You learn from your mistakes, and there is no harm done.
If you already have R and RStudio downloaded to your computer, now is a great time to update to make sure you have the most recent versions.
You can check your version of R using:
R is the underlying programming language while RStudio is an integrated development environment (IDE)
RStudio provides a more user-friendly interface and additional tool and features such as
The Console is where you can execute code directly.
Apart from one off calculations and help requests, you should be instead drafting and saving your code to a file. - The most basic file format is a R script (having .R extension)
The Source Panel is where you will see your R script(s)1
You can execute lines of code in your script by pressing by Cmd/CTRL+R or Cmd/CTRL+Enter, or pressing the Run button.
R can be used as a glorified calculator.
Operation | Command | Example |
---|---|---|
addition | + |
1 + 3 |
subtraction1 | - |
1 - 3 |
multiplication | * |
3*2 |
division | / |
6/2 |
powers | ^ |
a^2 |
modulos | %% |
6%%5 |
Note that spaces do not matter
Notes:
grayed out number indicates the start of input
[1]
indicates the start of output
4 + 5
is not executed because it is preceded by a hashtag.
<-
or =
.These variables1 are now saved in your workspace, i.e. the environment in which you are currently doing your calculations. They can be displayed, by invoking their name.
There are a number of data sets that are accessible already. To see the list of pre-loaded data, type the function
There are a large number of base R function at your disposal.
The power of R will come however when we start calling to other packages (more on this in a later lecture)
To get help on any function in R, replace the <function name>
in either of the commands below
More generally, if you found this introduction to R to be a tad overwhelming, DON’T WORRY! You will have plenty of time to get aquainted with R in lab
Comments
Comments are used by the programmer to document and explain the code.
Comments are not executed by R and should give information about the code to the person reading.
Comment lines begin with an hashtag
#
.There is no multiline commenting in R.
There are keyboard shortcuts to commenting lines of code in your script (on a Mac it is Shift + Command + C).