DATA 101: Making Prediction with Data
University of British Columbia Okanagan
Consider the data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian (Street, Wolberg, and Mangasarian 1993).
Task: use these tumor image measurements to predict whether a new tumor image (with unknown diagnosis) shows a benign or malignant tumor.
But we don’t really care how well our classifier predicts the observations in our training set, but rather if it can be generalized to new observations.
If our classifier is able to make accurate predictions on data not seen during training it implies that it has actually learned about the relationship between the predictor variables and response variable, as opposed to simply “memorizing” the labels of individual training data.
But what if we don’t have “new” data?
The trick is to split the data into a training set and testing set
The training set (only) is used to build our classifier while the testing set is used to evaluate the classifier’s performance.
If our predictions for the testing set match the true labels, then we have some confidence that our classifier is generalizable and will perform well on new, unseen data.
How exactly can we assess how well our predictions match the true labels for the observations in the test set?
One way we can do this for classification problems is to calculate the prediction accuracy
Accuracy is simply the proportion of examples for which the classifier made the correct prediction:
Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with a single number, but by itself it does not tell the whole story.
In particular, it does not tell us anything about the kinds of mistakes the classifier makes.
Clearly the second type of mistake is more threatening.
Truly Malignant | Truly Benign | |
Predicted Malignant | 1 | 4 |
Predicted Benign | 3 | 57 |
Focusing more on one label than the other is common in classification problems.
➕ | ➖ | |
Predicted ➕ | ✅ TP | ❌ FP |
Predicted ➖ | ❌ FN | ✅ TN |
When presented with a classification task you should think about which kinds of error are most important.
Two metrics often reported together with accuracy are precision and recall.
The four main components of a confusion matrix are:
True Positives (TP): The number of instances correctly predicted as the positive class
True Negatives (TN): The number of instances correctly predicted as the negative class
False Positives (FP): The number of instances incorrectly predicted as the positive class
False Negatives (FN): The number of instances incorrectly predicted as the negative class
Returning to this example, we have the following calculations for precision and recall:
\[\begin{align} \text{precision} &= \frac{1}{1 + 4} = 0.20\\ \text{recall} &= \frac{1}{1 + 3} = 0.25. \end{align}\]So even with an accuracy of 89%, the precision and recall of the classifier were both relatively low.
Of course, most real classifiers fall somewhere in between these two extremes.
Unlike the visualization suggests, we do not simply take the first, say 60% of observations to be our training set, and the remaining 40% to be the test.
For one reason, data often come ordered in some way (e.g. by date) so we would not like to add that bias to our training set.
Rather than manually selecting observation, what we will do is split the data set randomly into training and test set.
Pseudorandom, or pseudo-random, refers to a sequence of numbers or values that appears to be random but is generated by a deterministic process, typically a mathematical algorithm.
In other words, pseudorandom sequences are not truly random but are instead generated in a way that simulates randomness.
R’s random number generator produces a sequence of numbers that are completely determined by a seed value.
set.seed
In the R programming language, the set.seed()
function is used to set the seed for the random number generator.
Tidymodels is a collection of R packages and tools for modeling and machine learning that follow the principles of tidy data science
It powerful and flexible framework for conducting end-to-end machine learning and predictive modeling in R while promoting best practices for data science and analysis.
We can use the tidymodels
package not only to perform KNN classification, but also to assess how well our classification worked.
Let’s work through an example of how to use tools from tidymodels
to evaluate a classifier using the breast cancer data set; csv can be downloaded here.
In line with our previous discussion, let’s load our required packages and set a seed to begin:
# load data
cancer <- read_csv("data/wdbc_unscaled.csv") |>
# convert the character Class variable to the factor datatype
mutate(Class = as_factor(Class)) |>
# rename the factor values to be more readable
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
# create scatter plot of tumor cell concavity versus smoothness,
# labeling the points be diagnosis class
perim_concav <- cancer |>
ggplot(aes(x = Smoothness, y = Concavity, color = Class)) +
geom_point(alpha = 0.5) +
labs(color = "Diagnosis") +
scale_color_manual(values = c("orange2", "steelblue2")) +
theme(text = element_text(size = 12))
perim_concav
initial_split
The initial_split()
function from tidymodels splits the data while applying two very important steps:
Example: if our data comprise roughly 63% benign observations, and 37% malignant, initial_split
ensures that roughly 63% of the training data are benign, 37% of the training data are malignant, and the same proportions exist in the testing data.
?initial_split
data
A data frame.prop
The proportion of data to be retained for modeling/analysis.strata
A variable in data (single character or name) used to conduct stratified sampling. When not NULL, each resample is created within the stratification variable1
initial_split
for splittingspecify that prop = 0.75
so that 75% of our original data set ends up in the training set (and the remaining 25% makes up our testing set)
set the strata
argument to the categorical label variable (here, Class
) to ensure that the training and testing subsets contain the right proportions of each category of observation.
Rows: 426
Columns: 12
$ ID <dbl> 8510426, 8510653, 8510824, 857373, 857810, 858477, 8…
$ Class <fct> Benign, Benign, Benign, Benign, Benign, Benign, Beni…
$ Radius <dbl> 13.540, 13.080, 9.504, 13.640, 13.050, 8.618, 10.170…
$ Texture <dbl> 14.36, 15.71, 12.44, 16.34, 19.31, 11.79, 14.88, 20.…
$ Perimeter <dbl> 87.46, 85.63, 60.34, 87.21, 82.61, 54.34, 64.55, 54.…
$ Area <dbl> 566.3, 520.0, 273.9, 571.8, 527.2, 224.5, 311.9, 221…
$ Smoothness <dbl> 0.09779, 0.10750, 0.10240, 0.07685, 0.08060, 0.09752…
$ Compactness <dbl> 0.08129, 0.12700, 0.06492, 0.06059, 0.03789, 0.05272…
$ Concavity <dbl> 0.066640, 0.045680, 0.029560, 0.018570, 0.000692, 0.…
$ Concave_points <dbl> 0.047810, 0.031100, 0.020760, 0.017230, 0.004167, 0.…
$ Symmetry <dbl> 0.1885, 0.1967, 0.1815, 0.1353, 0.1819, 0.1683, 0.27…
$ Fractal_dimension <dbl> 0.05766, 0.06811, 0.06905, 0.05953, 0.05501, 0.07187…
[1] 0.7486819
cancer_proportions <- cancer_train |>
group_by(Class) |>
summarize(n = n()) |>
mutate(percent = 100*n/nrow(cancer_train))
cancer_proportions
Rows: 143
Columns: 12
$ ID <dbl> 84501001, 846381, 84799002, 849014, 852763, 853401, …
$ Class <fct> Malignant, Malignant, Malignant, Malignant, Malignan…
$ Radius <dbl> 12.460, 15.850, 14.540, 19.810, 14.580, 18.630, 16.7…
$ Texture <dbl> 24.04, 23.95, 27.54, 22.15, 21.53, 25.11, 21.59, 18.…
$ Perimeter <dbl> 83.97, 103.70, 96.73, 130.00, 97.41, 124.80, 110.10,…
$ Area <dbl> 475.9, 782.7, 658.8, 1260.0, 644.8, 1088.0, 869.5, 5…
$ Smoothness <dbl> 0.11860, 0.08401, 0.11390, 0.09831, 0.10540, 0.10640…
$ Compactness <dbl> 0.23960, 0.10020, 0.15950, 0.10270, 0.18680, 0.18870…
$ Concavity <dbl> 0.22730, 0.09938, 0.16390, 0.14790, 0.14250, 0.23190…
$ Concave_points <dbl> 0.085430, 0.053640, 0.073640, 0.094980, 0.087830, 0.…
$ Symmetry <dbl> 0.2030, 0.1847, 0.2303, 0.1582, 0.2252, 0.2183, 0.18…
$ Fractal_dimension <dbl> 0.08243, 0.05338, 0.07077, 0.05395, 0.06924, 0.06197…
[1] 0.2513181
Malignant Benign
0.3706294 0.6293706
recipe
from the recipes
package.cancer_train
data, specifying that:
Class
is the response, and Smoothness
and Concavity
are to be used as predictors.?recipe
A recipe is a description of the steps to be applied to a data set in order to prepare it for data analysis.
x , data
|
A data frame or tibble of the template data set (see below). |
... |
Further arguments passed to or from other methods (not currently used). |
formula |
A model formula of the form y~x1+x2+ ... + xp where y is your response variable and x1 , x2 , …, xp are your desired predictors |
As discussed last class, KNN is sensitive to the scale of the predictor, hence, we standardize them…
recipe
for normalizingstep_scale
creates a scaling step that will normalize numeric data to have a standard deviation of one.step_center
creates a centering step that will normalize numeric data to have a mean of zeroall_predictors()
is used to select variablesstep_normalize
both centering and scaling in a single recipe step; however we will keep step_scale
and step_center
separate to emphasize conceptually that there are two steps happeningNow that we have split our original data set into training and test sets, we can create our KNN classifier with only the training set.
For now, we will just choose the number \(k\), the number of nearest neighbors, to be 3
We use Concavity
and Smoothness
as the predictors (as specified in our recipe)
While we did this with the knn
function from the class package last class, let’s see how we can do it with the tidymodels functions …
nearest_neighbor function
, specifying that we want to use K=3
neighborsweight_func
argument controls how neighbors vote when classifying a new observationThe weight_func
argument controls how neighbors vote when classifying a new observation
weight_func = "rectangular"
specifies that each neighboring point should have the same weight when voting
Other choices, which weigh each neighbor’s vote differently, can be found on the parsnip website.
set_engine
argument, we specify which package or system will be used for training the model.kknn
is the R package we will use for performing KNN classification.set_engine
argument, we specify which package or system will be used for training the model.kknn
is the R package we will use for performing KNN classification.set_mode
function.In order to fit the model on the breast cancer data, we need to pass the model specification and the data set to the fit function.
While we could use something like the following, this would not include the data preprocessing steps
The tidymodels package collection also provides the workflow, a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
This framework combines data preprocessing steps, model specification, and model fitting into a single cohesive unit.
Workflows help you streamline and simplify the process of building, tuning, and evaluating predictive models while following the principles of tidy data science.
workflow
exampleknn_fit <- workflow() |>
add_recipe(cancer_recipe) |>
add_model(knn_spec) |>
fit(data = cancer_train)
knn_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_scale()
• step_center()
── Model ───────────────────────────────────────────────────────────────────────
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3, data, 5), kernel = ~"rectangular")
Type of response variable: nominal
Minimal misclassification: 0.1150235
Best kernel: rectangular
Best k: 3
Now that we have a KNN classifier object, we can use it to predict the class labels for our test set.
We use the bind_cols
to add the column of predictions to the original test data, creating the cancer_test_predictions
data frame.
The Class
variable contains the true diagnoses, while the .pred_class
contains the predicted diagnoses from the classifier.
metrics
function from tidymodels, specifying the truth
and estimate
arguments:.metric
.metric
column since we are interested in the accuracy row..estimate
variable shows that the estimated accuracy of the classifier on the test data was 86%.conf_mat
conf_mat
function:Using our formulas from earlier, we see that the accuracy agrees with what R reported, and can also compute the precision and recall of the classifier:
\[\begin{align} \mathrm{accuracy} &= \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{39+84}{39+84+6+14} = 0.86\\ \mathrm{precision} &= \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}} = \frac{39}{39 + 6} = 0.867\\ \mathrm{recall} &= \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}} = \frac{39}{39+14} = 0.736 \end{align}\]
Comment
Many functions in tidymodels, tidyverse use randomness1
At the beginning of every data analysis you do, right after loading packages, you should call the
set.seed()
function and pass it an integer of your choosing.If you do not explicitly “set a seed” your results will likely not be reproducible.
Avoid setting a seed many times throughout your analysis, otherwise the randomness that R uses will not look as random as it should.