Data 101 - Lecture 9: Classification

Introduction

So far, we’ve concentrated on descriptive and exploratory data analysis.
At this stage of the course we seek to answer predictive questions about data.
Today we cover classification fundamentals: data preprocessing, and prediction using observed data.
Next class will focus on how to evaluate prediction accuracy and improving the classifier for optimal results.

Classification

Classification involves predicting the value of a categorical variable (AKA class or label) using one or more variables.
Classification is a form of supervised learning since we seek to learn from labelled data.
Generally, a classifier assigns an observation without a known class to a class on the basis of how similar it is to other observations for which we do know the class.

How Classification Works

Classification relies on numeric or categorical variables, often referred to as features or attributes and denoted by $X$, of the data points.
The process involves training a classification model, using the labeled training data.
Once the model is trained, it can be used to make predictions about the class or category of new, unlabeled data points.

Types of Classification:

In binary classification, there are two possible classes or outcomes.
- e.g., “Yes” or “No,” “Spam” or “Not Spam.”
Multiclass classification involves more than two classes.
- e.g., “Cat,” “Dog,” “Bird,” or “Fish”

Oftentimes, these categories are coded as integers in our data set, e.g. 1 = male, 0 = female

Methods for Classification

While there are many possible methods for classification.
In this course, we will focus on the widely used K-nearest neighbors (KNN) algorithm¹$^{,}$²
In your future studies, you might encounter decision trees, support vector machines (SVMs), logistic regression, neural networks, and more, …

Example: Simulation

Code

# for reproducibility
set.seed(4623)

# generate some random points
x1 <- runif(200, -10, 10)
x2 <- runif(200, -0.7, 1)

# assign class labels according to the x2 value
clas <- rep(NA, length(x2))
for(i in 1:length(x2)){
  clas[i] <- sample(1:2, size = 1, 
      prob =c(max(0, x2[i]), min(1 - x2[i], 1)))
}
par(mar=c(5.1,4.1,0,0))
plot(x1, x2, col=c("red", "blue")[clas], pch = 1)

Suppose we have training set with two numeric features ($X_1$ and $X_2$) having two possible classes: 1 (coloured in red) or 2 (coloured in blue)

Green point

Suppose we have a new observation (ploted in green) and we would like to predict if that point is 1 (red) or 2 (blue). What would you predict?

Purple point

Suppose we have a new observation (ploted in purple) and we would like to predict if that point is 1 (red) or 2 (blue). What would you predict?

Concept of KNN

KNN is a distance-based algorithm.
It works on the principle that data points that are close to each other in the feature space are likely to belong to the same class.
It measures the “closeness” or similarity between data points using a distance metric (most commonly Euclidean)

Euclidean Distance (2D)

The Euclidean distance between observations $\boldsymbol p = (p_1, p_2)$ and $\boldsymbol q = (q_1, q_2)$ is calculated as

\[d(\boldsymbol p, \boldsymbol q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]

Euclidean Distance

The Euclidean (or straight-line) distance between observations $\boldsymbol x = (x_{1}, x_{2}, \dots x_{p})$ and $\boldsymbol y = (y_{1}, y_{2}, \dots y_{p})$ is calculated as \[d(\boldsymbol x, \boldsymbol y) = \sqrt{\sum_{i=1}^p (x_{i} - y_{i})^2}\]

Steps of KNN Classification

Given positive integer $k$ (chosen by user) and observation $\boldsymbol x$:

Nearest Neighbours: Identify the $k$ closest points to $\boldsymbol x$ in training feature space. These are called the nearest neighbors; we denoted their set by $\mathcal N_0$
Majority Voting: The class label that occurs most frequently among the $k$-nearest neighbors is assigned to $\boldsymbol x$.

R implementation

To demo knn classification we will be using the knn function from the class package performs¹

library("class")
knn(train, test, cl, k)

train matrix or data frame of training set cases.
test matrix or data frame of test set cases.
cl factor of true classifications of training set
k number of neighbours considered.

k Nearest Neighbours Example

Let’s apply this algorithm to the simulated data considered previously.
As described here, the user needs to specify $k$, the number of nearest neighbours to consider.
We will discuss the important issue of how which $k$ to use next lecture. For now, let’s try a few options and see how that affects our predictions.
More specifically, we will fit knn for $k = 20, 15, 9, 3$ and $1$

Green point k = 1

library(class)
dat <- data.frame(x1, x2)
data = data.frame(x1, x2, clas)
greenpt = c(-5.5, 0.9) 
g.1 = knn(dat,        # 1.
     test = greenpt,  # 2.
     cl = clas,       # 3.
     k = 1)           # 4.
g.1

[1] 1
Levels: 1 2

training features
new point(s)
training labels
no. of nearest neighbours

Since the closest neighbour to the new green point is red, `knn` has classified the green point to group `1`, i.e. the red group

Comment on scale

It may seem confusing visually why this point is the nearest neighbour to the green square.
It makes sense once you consider the scale of the x1 feature.
The range of x1 on the x-axis is 20, whereas the scale of x2 on the y-axis is 1.5
Using the same scale we can see why …

Same x/y axis

Code

par(mar=c(4,4,0.1,0.1))
plot(x1, x2, 
     col=c("red", "blue")[clas], 
     ylim = c(-10.7, 10.7), 
     xlim = c(-10.7, 10.6))
points(greenpt[1], greenpt[2], 
       pch = 15, cex = 1.5,
       col = "green")
points(x1[nn], x2[nn], 
       pch = 0, , cex = 2,
       col = "orange")  
text(x1[nn], x2[nn], 
     labels="Nearest Neighbour", 
     col = "orange", pos = 4)

Zoomed in

When you plot on the same scale you can see that this point is indeed the closest.
Hence, the green point is classified as 1 (= red)

3-nearest neighbours

When $k = 3$ the point is still classified as red since all of it’s nearest neighbours are red

library(tidyverse)
dat.dist = mutate(data, dist_from_new = sqrt((x1 - greenpt[1])^2 + (x2 - greenpt[2])^2)) 
dat.dist |>  slice_min(dist_from_new, n = 9)

9 nearest neighbours

tab9 = dat.dist |> 
  slice_min(dist_from_new, n = 9) |> 
  select(clas) |> 
  as.vector() |> 
  table()
names(tab9) = c("red", "blue")
tab9

 red blue 
   4    5

Since 5 of the 9 nearest neighbours are blue and only 4 of the 9 nearest neighbours are red, the majority vode for the green point is now blue.

15 nearest neighbours

k = 15
tabk = dat.dist |> 
  slice_min(dist_from_new, n = k) |> 
  select(clas) |> 
  as.vector() |> 
  table()
names(tabk) = c("red", "blue")
tabk

 red blue 
   6    9

Since 9 of the 15 nearest neighbours are blue and only 6 of the 15 nearest neighbours are red, the majority vode for the green point is blue when $k$ = 15.

20 nearest neighbours

k = 20
tabk = dat.dist |> 
  slice_min(dist_from_new, n = k) |> 
  select(clas) |> 
  as.vector() |> 
  table()
names(tabk) = c("red", "blue")
tabk

 red blue 
   6   14

Since 14 of the 20 nearest neighbours are blue and only 6 of the 20 nearest neighbours are red, the majority vode for the green point is blue when $k$ = 20.

KNN predictions for Green point

g.3 = knn(dat, test = greenpt, cl = clas, k = 3)
g.3 # 1 = red

[1] 1
Levels: 1 2

g.9 = knn(dat, test = greenpt, cl = clas, k = 9)
g.9 # 2 = blue

[1] 2
Levels: 1 2

g.15 = knn(dat, test = greenpt, cl = clas, k = 15)
g.15 # 2 = blue

[1] 2
Levels: 1 2

g.20 = knn(dat, test = greenpt, cl = clas, k = 20)
g.20 # 2 = blue

[1] 2
Levels: 1 2

KNN predictions for Purple point

p.1 = knn(dat, test = purplept, cl = clas, k = 3)
p.1 # 1 = red

[1] 1
Levels: 1 2

p.3 = knn(dat, test = purplept, cl = clas, k = 3)
p.3 # 1 = red

[1] 1
Levels: 1 2

p.9 = knn(dat, test = purplept, cl = clas, k = 9)
p.9 # 1 = red

[1] 1
Levels: 1 2

p.15 = knn(dat, test = purplept, cl = clas, k = 15)
p.15 # 1 = red

[1] 1
Levels: 1 2

p.20 = knn(dat, test = purplept, cl = clas, k = 20)
p.20 # 1 = red

[1] 1
Levels: 1 2

Choosing k

A key component of a “good” KNN classifier is determined by how we choose $k$.
$k$ in a user-defined input to our algorithm which we can set anywhere from 1 to $n$ (the number of points in our training set)

Question What would happen if $k$ was set equal to $n$?

KNN Decision regions

KNN decision regions. Any new points in the shaded red areas will be classified as red. Any new points in the shaded blue areas will be classified as blue. Compare with scaled.

Comment

When using KNN classification, the scale of each variable (i.e., its size and range of values) matters.
Since the classifier predicts classes by identifying observations nearest to it, any variables with a large scale will have a much larger effect than variables with a small scale.
But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions.

Scale Matters

Example Suppose we want to predict the type of job based on salary (in dollars) and years of education

When we compute the neighbor distances, a difference of $1000 is huge compared to a difference of 10 years of education.
But for our conceptual 10 years of education is huge compared to a difference of $1000 in yearly salary!

Center (sometimes) matters

In many other predictive models, the center of each variable (e.g., its mean) matters as well.
For example, changing a temperature variable from degrees Kelvin to degrees Celsius would shift the variable by 273
Although this doesn’t affect the KNN classification algorithm, this large shift can change the outcome of using many other predictive models.

Data preprocessing

Because of the reasons stated before, it is common practice to standardized our data.
For each observed value of the variable, we subtract the mean (the average, which quantifies the “central” value of a set of numbers) and divide by the standard deviation (a number quantifying how spread out values are).
Now the data is said to be standardized, and all features will have a mean of 0 and a standard deviation of 1.

Standardized Euclidean Distance

In other words we scale and center our data to have mean 0, variance 1 via

\[\begin{align} Z_{1} &= \frac{X_1 - \text{mean}(X_1)}{\text{sd}(X_1)} & \dots && Z_{p} &= \frac{X_p - \text{mean}(X_p)}{\text{sd}(X_p)} \end{align}\]

Then we can define standardized pairwise distances between $\boldsymbol z_i = (z_{i1}, z_{i2}, \dots, z_{ip})$ and $\boldsymbol z_j = (z_{j1}, z_{j2}, \dots, z_{jp})$ using: \[\begin{equation} d(\boldsymbol z_i, \boldsymbol z_j) = \sqrt{\sum_{k=1}^p (z_{ik} - z_{jk})^2} \end{equation}\]

How it might look after scaling

sdat <- data.frame(scale(dat))
par(mar=c(5.1,4.1,0,0))
plot(sdat$x1, sdat$x2, col=c("red", "blue")[clas])

limits = par("usr") # to save axis limits

KNN Decision regions after scaling

KNN decision regions. Any new points in the shaded red areas will be classified as red. Any new points in the shaded blue areas will be classified as blue. Compare with unscaled.

3 nearest neighbours Purple point

Since 2 of the 3 nearest neighbours are red and only 1 of the 3 nearest neighbours are blue, the majority vode for the purple point is red when $k$ = 3.

9 nearest neighbours Purple point

Since 6 of the 9 nearest neighbours are blue and only 3 of the 9 nearest neighbours are red, the majority vode for the purple point is blue when $k$ = 9.

15 nearest neighbours Purple point

Since 8 of the 15 nearest neighbours are blue and only 7 of the 15 nearest neighbours are red, the majority vode for the purple point is blue when $k$ = 15.

20 nearest neighbours Purple point

Since 11 of the 20 nearest neighbours are red and only 9 of the 20 nearest neighbours are blue, the majority vode for the purple point is red when $k$ = 20.

Balance

A potential issue for a classifier is class imbalance, i.e., when one label is much more common than another.
Since classifiers like KNN use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general
Class imbalance is actually quite a common and important problem, e.g., rare disease diagnosis, malicious email detection

Solution

Despite the simplicity of the problem, solving it in a statistically sound manner is actually fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this course.
For the present purposes, it will suffice to either rebalance the data by:
1. oversampling the underrepresented class.
2. undersampling the overrepresented class.

Simulation revisited

For instance we have more blue points than red points:

labels = factor(clas, labels = c("red", "blue"))
table(labels)

labels
 red blue 
  63  137

If we are adopting option 2. on the previous slide, we could create a data set with, say, 63 in both the red class and blue class and repeat the analysis.

ind = c(sample(which(labels == "red"), 63),
        sample(which(labels == "blue"), 63))
balanced.dat = dat[ind,] ;  balanced.labs = labels[ind]
table(balanced.labs)

balanced.labs
 red blue 
  63   63

Missing Data

Another potential issue is missing data
Handling missing data properly is very challenging and generally relies on expert knowledge about the data, setting, and how the data were collected.
In this course, we assume missing entries are just “randomly missing”,
To handle these observations we’ll simply remove the row having NAs (with, say, the drop_na() function)

Warning

Missing entries can be informative since the absence of values maybe have some underlying significance.
- e.g., survey participants from a marginalized group of people may be less likely to respond to certain kinds of questions if they fear that answering honestly will come with negative consequences.
If we simply throw away data with missing entries, we would bias analysis by inadvertently removing many members of that group of respondents.

Final comments

We have gone through some very basic code in this lecture and dealt mostly with the concepts
It will be very important that you attend labs this week and get a handle of the tidymodel syntax for running KNN within tidyvers.
It is also highly recommended that you go through the examples in Chapter 5 of your textbook.