Frontpage Data visualisation Parametizing data Directory structure R-package SQL Zotero Reproductibility Future endeavours Free research (Machine learning) CV Bibliography
In this page, what was previous learned about KNN-models will be brought into practice using the “Glass” dataset. First, we’ll load the dataset in an inspect it
data(Glass)
Glass %>% head()## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
knitr::kable(summary(Glass))| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :1.511 | Min. :10.73 | Min. :0.000 | Min. :0.290 | Min. :69.81 | Min. :0.0000 | Min. : 5.430 | Min. :0.000 | Min. :0.00000 | 1:70 | |
| 1st Qu.:1.517 | 1st Qu.:12.91 | 1st Qu.:2.115 | 1st Qu.:1.190 | 1st Qu.:72.28 | 1st Qu.:0.1225 | 1st Qu.: 8.240 | 1st Qu.:0.000 | 1st Qu.:0.00000 | 2:76 | |
| Median :1.518 | Median :13.30 | Median :3.480 | Median :1.360 | Median :72.79 | Median :0.5550 | Median : 8.600 | Median :0.000 | Median :0.00000 | 3:17 | |
| Mean :1.518 | Mean :13.41 | Mean :2.685 | Mean :1.445 | Mean :72.65 | Mean :0.4971 | Mean : 8.957 | Mean :0.175 | Mean :0.05701 | 5:13 | |
| 3rd Qu.:1.519 | 3rd Qu.:13.82 | 3rd Qu.:3.600 | 3rd Qu.:1.630 | 3rd Qu.:73.09 | 3rd Qu.:0.6100 | 3rd Qu.: 9.172 | 3rd Qu.:0.000 | 3rd Qu.:0.10000 | 6: 9 | |
| Max. :1.534 | Max. :17.38 | Max. :4.490 | Max. :3.500 | Max. :75.41 | Max. :6.2100 | Max. :16.190 | Max. :3.150 | Max. :0.51000 | 7:29 |
We can immidiately see that glass consists off 10 columns: 9 for variables, and 1 for the identifying variable. The range between this data goes from 0.29:75.41, a factor 100 difference. A little on the large side, but still not high enough for normalisation to be necessary.
We’ll use the same “sample” technique as before to split the data into 2 groups, a training and a test group
set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints
glass_training<-Glass[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass[ind_glass==2,1:9]
glass_training_labels<-Glass[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass[ind_glass==2,10]With the data set ordered and all, we can perform the machine-learning
glass_pred<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.
glass_result<-glass_pred == glass_test_labels
table(glass_result)## glass_result
## FALSE TRUE
## 15 55
CrossTable(x = glass_test_labels, y=glass_pred)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 70
##
##
## | glass_pred
## glass_test_labels | 1 | 2 | 3 | 5 | 6 | 7 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 17 | 2 | 2 | 0 | 0 | 0 | 21 |
## | 12.033 | 4.033 | 1.344 | 1.800 | 0.600 | 2.700 | |
## | 0.810 | 0.095 | 0.095 | 0.000 | 0.000 | 0.000 | 0.300 |
## | 0.680 | 0.080 | 0.667 | 0.000 | 0.000 | 0.000 | |
## | 0.243 | 0.029 | 0.029 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 5 | 21 | 0 | 0 | 0 | 0 | 26 |
## | 1.978 | 14.778 | 1.114 | 2.229 | 0.743 | 3.343 | |
## | 0.192 | 0.808 | 0.000 | 0.000 | 0.000 | 0.000 | 0.371 |
## | 0.200 | 0.840 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.071 | 0.300 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 3 | 0 | 1 | 0 | 0 | 0 | 4 |
## | 1.729 | 1.429 | 4.005 | 0.343 | 0.114 | 0.514 | |
## | 0.750 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.057 |
## | 0.120 | 0.000 | 0.333 | 0.000 | 0.000 | 0.000 | |
## | 0.043 | 0.000 | 0.014 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 0 | 0 | 6 | 0 | 0 | 6 |
## | 2.143 | 2.143 | 0.257 | 58.514 | 0.171 | 0.771 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.086 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.086 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 0 | 1 | 0 | 0 | 1 | 0 | 2 |
## | 0.714 | 0.114 | 0.086 | 0.171 | 15.557 | 0.257 | |
## | 0.000 | 0.500 | 0.000 | 0.000 | 0.500 | 0.000 | 0.029 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 0.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 0 | 1 | 0 | 0 | 1 | 9 | 11 |
## | 3.929 | 2.183 | 0.471 | 0.943 | 1.496 | 40.687 | |
## | 0.000 | 0.091 | 0.000 | 0.000 | 0.091 | 0.818 | 0.157 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 1.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 25 | 25 | 3 | 6 | 2 | 9 | 70 |
## | 0.357 | 0.357 | 0.043 | 0.086 | 0.029 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Based on the table, we can conclude that using the KNN-model for this Glass dataset was less accurate than for the iris dataset. In the table, you’ll notice of the 21 glasstypes 1, only 17 were identified correctly. Of the 26 glasstypes 2, only 21 were correctly identified. This pattern continues for all glasstypes, only glasstype 5 was perfectly identified.
Purely to sate my own curiosity, I’ve also performed the analysis using normalised data, to see what the impact of normalisation would be on this data.
normalize<-function(x){
num<-x-min(x)
denom<-max(x)-min(x)
return(num/denom)
}
set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints
Glass_norm<-normalize(Glass[1:9]) %>% mutate(Type=Glass$Type)
glass_training<-Glass_norm[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass_norm[ind_glass==2,1:9]
glass_training_labels<-Glass_norm[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass_norm[ind_glass==2,10]
glass_pred<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.
glass_result<-glass_pred == glass_test_labels
table(glass_result)## glass_result
## FALSE TRUE
## 15 55
CrossTable(x = glass_test_labels, y=glass_pred)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 70
##
##
## | glass_pred
## glass_test_labels | 1 | 2 | 3 | 5 | 6 | 7 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 17 | 2 | 2 | 0 | 0 | 0 | 21 |
## | 12.033 | 4.033 | 1.344 | 1.800 | 0.600 | 2.700 | |
## | 0.810 | 0.095 | 0.095 | 0.000 | 0.000 | 0.000 | 0.300 |
## | 0.680 | 0.080 | 0.667 | 0.000 | 0.000 | 0.000 | |
## | 0.243 | 0.029 | 0.029 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 5 | 21 | 0 | 0 | 0 | 0 | 26 |
## | 1.978 | 14.778 | 1.114 | 2.229 | 0.743 | 3.343 | |
## | 0.192 | 0.808 | 0.000 | 0.000 | 0.000 | 0.000 | 0.371 |
## | 0.200 | 0.840 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.071 | 0.300 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 3 | 0 | 1 | 0 | 0 | 0 | 4 |
## | 1.729 | 1.429 | 4.005 | 0.343 | 0.114 | 0.514 | |
## | 0.750 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.057 |
## | 0.120 | 0.000 | 0.333 | 0.000 | 0.000 | 0.000 | |
## | 0.043 | 0.000 | 0.014 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 0 | 0 | 6 | 0 | 0 | 6 |
## | 2.143 | 2.143 | 0.257 | 58.514 | 0.171 | 0.771 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.086 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.086 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 0 | 1 | 0 | 0 | 1 | 0 | 2 |
## | 0.714 | 0.114 | 0.086 | 0.171 | 15.557 | 0.257 | |
## | 0.000 | 0.500 | 0.000 | 0.000 | 0.500 | 0.000 | 0.029 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 0.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 0 | 1 | 0 | 0 | 1 | 9 | 11 |
## | 3.929 | 2.183 | 0.471 | 0.943 | 1.496 | 40.687 | |
## | 0.000 | 0.091 | 0.000 | 0.000 | 0.091 | 0.818 | 0.157 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 1.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 25 | 25 | 3 | 6 | 2 | 9 | 70 |
## | 0.357 | 0.357 | 0.043 | 0.086 | 0.029 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Conclusion: in this specific example, normalisation does not change anything at all.
Now, we’ve completely studied and used one form of machine learning, it’s time to look into another aspect of machine learning that’s important, a step before the training even starts: pre processing.