KNN in Glass

 

In this page, what was previous learned about KNN-models will be brought into practice using the “Glass” dataset. First, we’ll load the dataset in an inspect it

data(Glass)

Glass %>% head()
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
knitr::kable(summary(Glass))
RI Na Mg Al Si K Ca Ba Fe Type
Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290 Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000 Min. :0.00000 1:70
1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000 1st Qu.:0.00000 2:76
Median :1.518 Median :13.30 Median :3.480 Median :1.360 Median :72.79 Median :0.5550 Median : 8.600 Median :0.000 Median :0.00000 3:17
Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445 Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175 Mean :0.05701 5:13
3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000 3rd Qu.:0.10000 6: 9
Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500 Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150 Max. :0.51000 7:29

We can immidiately see that glass consists off 10 columns: 9 for variables, and 1 for the identifying variable. The range between this data goes from 0.29:75.41, a factor 100 difference. A little on the large side, but still not high enough for normalisation to be necessary.

We’ll use the same “sample” technique as before to split the data into 2 groups, a training and a test group

set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints

glass_training<-Glass[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass[ind_glass==2,1:9]

glass_training_labels<-Glass[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass[ind_glass==2,10]

With the data set ordered and all, we can perform the machine-learning

glass_pred<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.

glass_result<-glass_pred == glass_test_labels
table(glass_result)
## glass_result
## FALSE  TRUE 
##    15    55
CrossTable(x = glass_test_labels, y=glass_pred)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  70 
## 
##  
##                   | glass_pred 
## glass_test_labels |         1 |         2 |         3 |         5 |         6 |         7 | Row Total | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 1 |        17 |         2 |         2 |         0 |         0 |         0 |        21 | 
##                   |    12.033 |     4.033 |     1.344 |     1.800 |     0.600 |     2.700 |           | 
##                   |     0.810 |     0.095 |     0.095 |     0.000 |     0.000 |     0.000 |     0.300 | 
##                   |     0.680 |     0.080 |     0.667 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.243 |     0.029 |     0.029 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 2 |         5 |        21 |         0 |         0 |         0 |         0 |        26 | 
##                   |     1.978 |    14.778 |     1.114 |     2.229 |     0.743 |     3.343 |           | 
##                   |     0.192 |     0.808 |     0.000 |     0.000 |     0.000 |     0.000 |     0.371 | 
##                   |     0.200 |     0.840 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.071 |     0.300 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 3 |         3 |         0 |         1 |         0 |         0 |         0 |         4 | 
##                   |     1.729 |     1.429 |     4.005 |     0.343 |     0.114 |     0.514 |           | 
##                   |     0.750 |     0.000 |     0.250 |     0.000 |     0.000 |     0.000 |     0.057 | 
##                   |     0.120 |     0.000 |     0.333 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.043 |     0.000 |     0.014 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 5 |         0 |         0 |         0 |         6 |         0 |         0 |         6 | 
##                   |     2.143 |     2.143 |     0.257 |    58.514 |     0.171 |     0.771 |           | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.086 | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |           | 
##                   |     0.000 |     0.000 |     0.000 |     0.086 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 6 |         0 |         1 |         0 |         0 |         1 |         0 |         2 | 
##                   |     0.714 |     0.114 |     0.086 |     0.171 |    15.557 |     0.257 |           | 
##                   |     0.000 |     0.500 |     0.000 |     0.000 |     0.500 |     0.000 |     0.029 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     0.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 7 |         0 |         1 |         0 |         0 |         1 |         9 |        11 | 
##                   |     3.929 |     2.183 |     0.471 |     0.943 |     1.496 |    40.687 |           | 
##                   |     0.000 |     0.091 |     0.000 |     0.000 |     0.091 |     0.818 |     0.157 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     1.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##      Column Total |        25 |        25 |         3 |         6 |         2 |         9 |        70 | 
##                   |     0.357 |     0.357 |     0.043 |     0.086 |     0.029 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Based on the table, we can conclude that using the KNN-model for this Glass dataset was less accurate than for the iris dataset. In the table, you’ll notice of the 21 glasstypes 1, only 17 were identified correctly. Of the 26 glasstypes 2, only 21 were correctly identified. This pattern continues for all glasstypes, only glasstype 5 was perfectly identified.

Purely to sate my own curiosity, I’ve also performed the analysis using normalised data, to see what the impact of normalisation would be on this data.

normalize<-function(x){
  num<-x-min(x)
  denom<-max(x)-min(x)
  return(num/denom)
}

set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints

Glass_norm<-normalize(Glass[1:9]) %>% mutate(Type=Glass$Type)

glass_training<-Glass_norm[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass_norm[ind_glass==2,1:9]

glass_training_labels<-Glass_norm[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass_norm[ind_glass==2,10]

glass_pred<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.

glass_result<-glass_pred == glass_test_labels
table(glass_result)
## glass_result
## FALSE  TRUE 
##    15    55
CrossTable(x = glass_test_labels, y=glass_pred)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  70 
## 
##  
##                   | glass_pred 
## glass_test_labels |         1 |         2 |         3 |         5 |         6 |         7 | Row Total | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 1 |        17 |         2 |         2 |         0 |         0 |         0 |        21 | 
##                   |    12.033 |     4.033 |     1.344 |     1.800 |     0.600 |     2.700 |           | 
##                   |     0.810 |     0.095 |     0.095 |     0.000 |     0.000 |     0.000 |     0.300 | 
##                   |     0.680 |     0.080 |     0.667 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.243 |     0.029 |     0.029 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 2 |         5 |        21 |         0 |         0 |         0 |         0 |        26 | 
##                   |     1.978 |    14.778 |     1.114 |     2.229 |     0.743 |     3.343 |           | 
##                   |     0.192 |     0.808 |     0.000 |     0.000 |     0.000 |     0.000 |     0.371 | 
##                   |     0.200 |     0.840 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.071 |     0.300 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 3 |         3 |         0 |         1 |         0 |         0 |         0 |         4 | 
##                   |     1.729 |     1.429 |     4.005 |     0.343 |     0.114 |     0.514 |           | 
##                   |     0.750 |     0.000 |     0.250 |     0.000 |     0.000 |     0.000 |     0.057 | 
##                   |     0.120 |     0.000 |     0.333 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.043 |     0.000 |     0.014 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 5 |         0 |         0 |         0 |         6 |         0 |         0 |         6 | 
##                   |     2.143 |     2.143 |     0.257 |    58.514 |     0.171 |     0.771 |           | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.086 | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |           | 
##                   |     0.000 |     0.000 |     0.000 |     0.086 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 6 |         0 |         1 |         0 |         0 |         1 |         0 |         2 | 
##                   |     0.714 |     0.114 |     0.086 |     0.171 |    15.557 |     0.257 |           | 
##                   |     0.000 |     0.500 |     0.000 |     0.000 |     0.500 |     0.000 |     0.029 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     0.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 7 |         0 |         1 |         0 |         0 |         1 |         9 |        11 | 
##                   |     3.929 |     2.183 |     0.471 |     0.943 |     1.496 |    40.687 |           | 
##                   |     0.000 |     0.091 |     0.000 |     0.000 |     0.091 |     0.818 |     0.157 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     1.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##      Column Total |        25 |        25 |         3 |         6 |         2 |         9 |        70 | 
##                   |     0.357 |     0.357 |     0.043 |     0.086 |     0.029 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Conclusion: in this specific example, normalisation does not change anything at all.

Now, we’ve completely studied and used one form of machine learning, it’s time to look into another aspect of machine learning that’s important, a step before the training even starts: pre processing.