KNN Machine-learning

 

In order to prove my skills in learning a new skill on my own, I decided to delve into a completely new subject for me: machine learning in R, in microbiology. Originally, I planned on studying this site (Majchrowska 2021) and focussing on image-based machine learning for identifying bacterial colonies. However, this was too complicated for a beginner. Thus, I started by following this guide (Willems 2018) on KNN-based machine learning within R.

This guide guides you through KNN-based machine learning using the in-built R database “iris”. KNN-based machine learning stands for “K-nearest-neighbour”. In this technique, a computer model takes a dataset used for learning, and determines the distance between each of the different training data-points. Then, it places the to-be-determined datapoint into this web of training-datapoints, and it’ll determine the K-amount of nearest neighbours (if K=5, the 5 nearest neighbours). Then it’ll classify the to-be-determined datapoint based on which category has the most datapoints within that K range. Underneath is a example of this.

Important is to first understand your data. Thus, to begin, I looked it up some background information about what data iris contains.

Iris contains 5 columns: Sepal width/length, petal width/length and species. A “sepal” is a modified leave which encases and protects the flower before it blooms. A “petal” is a modified leave that surrounds the reproductive part of the flower

Images taken from Dodona, universiteit Gent

We’ll perform some basic visualisation on the data, in order to get a feel for what the data represents

iris %>% ggplot(aes(x=Sepal.Length, y=Sepal.Width, colour=Species))+
  geom_point()+
  theme_bw()+
  labs(
    title="Correlation between sepal length and width",
    x="Sepal length (cm)",
    y="Sepal width (cm)"
  )

iris %>% ggplot(aes(x=Petal.Length, y=Petal.Width, colour=Species))+
  geom_point()+
  theme_bw()+
  labs(
    title="Correlation between petal length and width",
    x="Petal length (cm)",
    y="Petal width (cm)"
  )

We can clearly see that there appears to be some form of correlation between the length and the width of a petal/sepal.

To further study the dataset, a table off “species”, the factor that we’d like to predict using our KNN model, will be created

table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

We can see that our dataset has 50 of each of the different flower types.

Finally, we’ll also have a look at the range of data in our data-set

knitr::kable(summary(iris))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 NA
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 NA
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 NA

The lowest value present in the variables is “0.1”, under Petal.Width. The highest value is “7.9”, under “Sepal.Length”. This gives a difference of approximately factor 10-100, which is within acceptable levels. If it were not, and the factor difference would be above 1000, we would have to normalise the data with a function like the one below

normalize<-function(x){
  num<-x-min(x)
  denom<-max(x)-min(x)
  return(num/denom)
}

as.data.frame(lapply(iris[1:4], normalize)) %>% mutate(Species=iris$Species) %>% head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   0.22222222   0.6250000   0.06779661  0.04166667  setosa
## 2   0.16666667   0.4166667   0.06779661  0.04166667  setosa
## 3   0.11111111   0.5000000   0.05084746  0.04166667  setosa
## 4   0.08333333   0.4583333   0.08474576  0.04166667  setosa
## 5   0.19444444   0.6666667   0.06779661  0.04166667  setosa
## 6   0.30555556   0.7916667   0.11864407  0.12500000  setosa

In order to properly train the algorythm, we must separate our data into 2 groups: a “training” group and a “test” group. The training dataset usually takes about 2/3rds of the dataset, while the test group is the other 1/3rd. Important is that the algorythm has equal amounts of test conditions: if the algorythm got 50 setosa, 50 versicolor and 0 virginica, it’d be impossible for it to recognize virginica. Thus, we’ll use “setseed” and “sample” to generate a sample.

set.seed(1234) #Set seed to "determine" the randomness"

ind<-sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))

iris_training<-iris[ind==1,1:4] #Storing the training dataset
iris_testing<-iris[ind==2,1:4] #Storing the testing dataset

# We have the training and testing dataset, but we don't have the labels associated to those datasets. We have to store 
                                                                                                     #these seperately:

iris_trainlabels<-iris[ind==1,5]
iris_testlabels<-iris[ind==2,5]

With both the training group and the test group set up, we can now feed these into the KNN algorythm, and set up a prediction for what the testlabels are

iris_pred<-class::knn(train = iris_training, test = iris_testing, cl = iris_trainlabels)

store<-iris_pred == iris_testlabels

table(store)
## store
## FALSE  TRUE 
##     1    39

This gives an output of 1 mismatch and 39 correct matches, a pretty accurate result. To further study the result, we’ll use the package called “gmodels”

library(gmodels)

CrossTable(x = iris_testlabels, y=iris_pred)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40 
## 
##  
##                 | iris_pred 
## iris_testlabels |     setosa | versicolor |  virginica |  Row Total | 
## ----------------|------------|------------|------------|------------|
##          setosa |         12 |          0 |          0 |         12 | 
##                 |     19.600 |      3.900 |      4.500 |            | 
##                 |      1.000 |      0.000 |      0.000 |      0.300 | 
##                 |      1.000 |      0.000 |      0.000 |            | 
##                 |      0.300 |      0.000 |      0.000 |            | 
## ----------------|------------|------------|------------|------------|
##      versicolor |          0 |         12 |          0 |         12 | 
##                 |      3.600 |     16.823 |      4.500 |            | 
##                 |      0.000 |      1.000 |      0.000 |      0.300 | 
##                 |      0.000 |      0.923 |      0.000 |            | 
##                 |      0.000 |      0.300 |      0.000 |            | 
## ----------------|------------|------------|------------|------------|
##       virginica |          0 |          1 |         15 |         16 | 
##                 |      4.800 |      3.392 |     13.500 |            | 
##                 |      0.000 |      0.062 |      0.938 |      0.400 | 
##                 |      0.000 |      0.077 |      1.000 |            | 
##                 |      0.000 |      0.025 |      0.375 |            | 
## ----------------|------------|------------|------------|------------|
##    Column Total |         12 |         13 |         15 |         40 | 
##                 |      0.300 |      0.325 |      0.375 |            | 
## ----------------|------------|------------|------------|------------|
## 
## 

This table contains the original data of the testlabels (the first column contains the species, the last column contains the original amount of these species in the dataset), and the prediction of the testlabels based on the KNN model (the first row contains the species, the last row contains the prediction of the amount of these species). It tells us that the algorythm succesfully identified all setosa sample, but classified 1 “virginica” monster as a “versicolor” sample.

This is machine learning in one of its most basic forms. Now, in order to test if I’ve truly understood what’s given here in this KNN-guide, I’ve performed this entire workflow for another dataset, the “Glass” dataset from mlbench.

References

Majchrowska, Sylwia. 2021. “Recognition and Counting of Microorganisms on Petri Dishes.” Medium. https://towardsdatascience.com/recognition-and-counting-of-microorganisms-on-petri-dishes-53a499283dc8.
Willems, Karlijn. 2018. “Machine Learning in R for Beginners.” Datacamp. https://www.datacamp.com/tutorial/machine-learning-in-r.