KNN in Glass

Frontpage Data visualisation Parametizing data Directory structure R-package SQL Zotero Reproductibility Future endeavours Free research (Machine learning) CV Bibliography

In this page, what was previous learned about KNN-models will be brought into practice using the “Glass” dataset. First, we’ll load the dataset in an inspect it

data(Glass)

Glass %>% head()

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

knitr::kable(summary(Glass))

RI	Na	Mg	Al	Si	K	Ca	Ba	Fe	Type
Min. :1.511	Min. :10.73	Min. :0.000	Min. :0.290	Min. :69.81	Min. :0.0000	Min. : 5.430	Min. :0.000	Min. :0.00000	1:70
1st Qu.:1.517	1st Qu.:12.91	1st Qu.:2.115	1st Qu.:1.190	1st Qu.:72.28	1st Qu.:0.1225	1st Qu.: 8.240	1st Qu.:0.000	1st Qu.:0.00000	2:76
Median :1.518	Median :13.30	Median :3.480	Median :1.360	Median :72.79	Median :0.5550	Median : 8.600	Median :0.000	Median :0.00000	3:17
Mean :1.518	Mean :13.41	Mean :2.685	Mean :1.445	Mean :72.65	Mean :0.4971	Mean : 8.957	Mean :0.175	Mean :0.05701	5:13
3rd Qu.:1.519	3rd Qu.:13.82	3rd Qu.:3.600	3rd Qu.:1.630	3rd Qu.:73.09	3rd Qu.:0.6100	3rd Qu.: 9.172	3rd Qu.:0.000	3rd Qu.:0.10000	6: 9
Max. :1.534	Max. :17.38	Max. :4.490	Max. :3.500	Max. :75.41	Max. :6.2100	Max. :16.190	Max. :3.150	Max. :0.51000	7:29

We can immidiately see that glass consists off 10 columns: 9 for variables, and 1 for the identifying variable. The range between this data goes from 0.29:75.41, a factor 100 difference. A little on the large side, but still not high enough for normalisation to be necessary.

We’ll use the same “sample” technique as before to split the data into 2 groups, a training and a test group

set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints

glass_training<-Glass[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass[ind_glass==2,1:9]

glass_training_labels<-Glass[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass[ind_glass==2,10]

With the data set ordered and all, we can perform the machine-learning

glass_pred<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.

glass_result<-glass_pred == glass_test_labels
table(glass_result)

## glass_result
## FALSE  TRUE 
##    15    55

CrossTable(x = glass_test_labels, y=glass_pred)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  70 
## 
##  
##                   | glass_pred 
## glass_test_labels |         1 |         2 |         3 |         5 |         6 |         7 | Row Total | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 1 |        17 |         2 |         2 |         0 |         0 |         0 |        21 | 
##                   |    12.033 |     4.033 |     1.344 |     1.800 |     0.600 |     2.700 |           | 
##                   |     0.810 |     0.095 |     0.095 |     0.000 |     0.000 |     0.000 |     0.300 | 
##                   |     0.680 |     0.080 |     0.667 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.243 |     0.029 |     0.029 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 2 |         5 |        21 |         0 |         0 |         0 |         0 |        26 | 
##                   |     1.978 |    14.778 |     1.114 |     2.229 |     0.743 |     3.343 |           | 
##                   |     0.192 |     0.808 |     0.000 |     0.000 |     0.000 |     0.000 |     0.371 | 
##                   |     0.200 |     0.840 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.071 |     0.300 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 3 |         3 |         0 |         1 |         0 |         0 |         0 |         4 | 
##                   |     1.729 |     1.429 |     4.005 |     0.343 |     0.114 |     0.514 |           | 
##                   |     0.750 |     0.000 |     0.250 |     0.000 |     0.000 |     0.000 |     0.057 | 
##                   |     0.120 |     0.000 |     0.333 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.043 |     0.000 |     0.014 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 5 |         0 |         0 |         0 |         6 |         0 |         0 |         6 | 
##                   |     2.143 |     2.143 |     0.257 |    58.514 |     0.171 |     0.771 |           | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.086 | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |           | 
##                   |     0.000 |     0.000 |     0.000 |     0.086 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 6 |         0 |         1 |         0 |         0 |         1 |         0 |         2 | 
##                   |     0.714 |     0.114 |     0.086 |     0.171 |    15.557 |     0.257 |           | 
##                   |     0.000 |     0.500 |     0.000 |     0.000 |     0.500 |     0.000 |     0.029 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     0.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 7 |         0 |         1 |         0 |         0 |         1 |         9 |        11 | 
##                   |     3.929 |     2.183 |     0.471 |     0.943 |     1.496 |    40.687 |           | 
##                   |     0.000 |     0.091 |     0.000 |     0.000 |     0.091 |     0.818 |     0.157 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     1.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##      Column Total |        25 |        25 |         3 |         6 |         2 |         9 |        70 | 
##                   |     0.357 |     0.357 |     0.043 |     0.086 |     0.029 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##

Based on the table, we can conclude that using the KNN-model for this Glass dataset was less accurate than for the iris dataset. In the table, you’ll notice of the 21 glasstypes 1, only 17 were identified correctly. Of the 26 glasstypes 2, only 21 were correctly identified. This pattern continues for all glasstypes, only glasstype 5 was perfectly identified.

Purely to sate my own curiosity, I’ve also performed the analysis using normalised data, to see what the impact of normalisation would be on this data.

normalize<-function(x){
  num<-x-min(x)
  denom<-max(x)-min(x)
  return(num/denom)
}

set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints

Glass_norm<-normalize(Glass[1:9]) %>% mutate(Type=Glass$Type)

glass_training<-Glass_norm[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass_norm[ind_glass==2,1:9]

glass_training_labels<-Glass_norm[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass_norm[ind_glass==2,10]

glass_pred<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.

glass_result<-glass_pred == glass_test_labels
table(glass_result)

## glass_result
## FALSE  TRUE 
##    15    55

CrossTable(x = glass_test_labels, y=glass_pred)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  70 
## 
##  
##                   | glass_pred 
## glass_test_labels |         1 |         2 |         3 |         5 |         6 |         7 | Row Total | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 1 |        17 |         2 |         2 |         0 |         0 |         0 |        21 | 
##                   |    12.033 |     4.033 |     1.344 |     1.800 |     0.600 |     2.700 |           | 
##                   |     0.810 |     0.095 |     0.095 |     0.000 |     0.000 |     0.000 |     0.300 | 
##                   |     0.680 |     0.080 |     0.667 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.243 |     0.029 |     0.029 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 2 |         5 |        21 |         0 |         0 |         0 |         0 |        26 | 
##                   |     1.978 |    14.778 |     1.114 |     2.229 |     0.743 |     3.343 |           | 
##                   |     0.192 |     0.808 |     0.000 |     0.000 |     0.000 |     0.000 |     0.371 | 
##                   |     0.200 |     0.840 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.071 |     0.300 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 3 |         3 |         0 |         1 |         0 |         0 |         0 |         4 | 
##                   |     1.729 |     1.429 |     4.005 |     0.343 |     0.114 |     0.514 |           | 
##                   |     0.750 |     0.000 |     0.250 |     0.000 |     0.000 |     0.000 |     0.057 | 
##                   |     0.120 |     0.000 |     0.333 |     0.000 |     0.000 |     0.000 |           | 
##                   |     0.043 |     0.000 |     0.014 |     0.000 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 5 |         0 |         0 |         0 |         6 |         0 |         0 |         6 | 
##                   |     2.143 |     2.143 |     0.257 |    58.514 |     0.171 |     0.771 |           | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.086 | 
##                   |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |           | 
##                   |     0.000 |     0.000 |     0.000 |     0.086 |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 6 |         0 |         1 |         0 |         0 |         1 |         0 |         2 | 
##                   |     0.714 |     0.114 |     0.086 |     0.171 |    15.557 |     0.257 |           | 
##                   |     0.000 |     0.500 |     0.000 |     0.000 |     0.500 |     0.000 |     0.029 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     0.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 7 |         0 |         1 |         0 |         0 |         1 |         9 |        11 | 
##                   |     3.929 |     2.183 |     0.471 |     0.943 |     1.496 |    40.687 |           | 
##                   |     0.000 |     0.091 |     0.000 |     0.000 |     0.091 |     0.818 |     0.157 | 
##                   |     0.000 |     0.040 |     0.000 |     0.000 |     0.500 |     1.000 |           | 
##                   |     0.000 |     0.014 |     0.000 |     0.000 |     0.014 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##      Column Total |        25 |        25 |         3 |         6 |         2 |         9 |        70 | 
##                   |     0.357 |     0.357 |     0.043 |     0.086 |     0.029 |     0.129 |           | 
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##

Conclusion: in this specific example, normalisation does not change anything at all.

Now, we’ve completely studied and used one form of machine learning, it’s time to look into another aspect of machine learning that’s important, a step before the training even starts: pre processing.