Frontpage Data visualisation Parametizing data Directory structure R-package SQL Zotero Reproductibility Future endeavours Free research (Machine learning) CV Bibliography
In this page, what was previous learned about KNN-models will be brought into practice using the “Glass” dataset. First, we’ll load the dataset in an inspect it
data(Glass)
%>% head() Glass
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
::kable(summary(Glass)) knitr
RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type | |
---|---|---|---|---|---|---|---|---|---|---|
Min. :1.511 | Min. :10.73 | Min. :0.000 | Min. :0.290 | Min. :69.81 | Min. :0.0000 | Min. : 5.430 | Min. :0.000 | Min. :0.00000 | 1:70 | |
1st Qu.:1.517 | 1st Qu.:12.91 | 1st Qu.:2.115 | 1st Qu.:1.190 | 1st Qu.:72.28 | 1st Qu.:0.1225 | 1st Qu.: 8.240 | 1st Qu.:0.000 | 1st Qu.:0.00000 | 2:76 | |
Median :1.518 | Median :13.30 | Median :3.480 | Median :1.360 | Median :72.79 | Median :0.5550 | Median : 8.600 | Median :0.000 | Median :0.00000 | 3:17 | |
Mean :1.518 | Mean :13.41 | Mean :2.685 | Mean :1.445 | Mean :72.65 | Mean :0.4971 | Mean : 8.957 | Mean :0.175 | Mean :0.05701 | 5:13 | |
3rd Qu.:1.519 | 3rd Qu.:13.82 | 3rd Qu.:3.600 | 3rd Qu.:1.630 | 3rd Qu.:73.09 | 3rd Qu.:0.6100 | 3rd Qu.: 9.172 | 3rd Qu.:0.000 | 3rd Qu.:0.10000 | 6: 9 | |
Max. :1.534 | Max. :17.38 | Max. :4.490 | Max. :3.500 | Max. :75.41 | Max. :6.2100 | Max. :16.190 | Max. :3.150 | Max. :0.51000 | 7:29 |
We can immidiately see that glass consists off 10 columns: 9 for variables, and 1 for the identifying variable. The range between this data goes from 0.29:75.41, a factor 100 difference. A little on the large side, but still not high enough for normalisation to be necessary.
We’ll use the same “sample” technique as before to split the data into 2 groups, a training and a test group
set.seed(4321)
<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints
ind_glass
<-Glass[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_training<-Glass[ind_glass==2,1:9]
glass_test
<-Glass[ind_glass==1,10] # Storing the labels seperately
glass_training_labels<-Glass[ind_glass==2,10] glass_test_labels
With the data set ordered and all, we can perform the machine-learning
<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.
glass_pred
<-glass_pred == glass_test_labels
glass_resulttable(glass_result)
## glass_result
## FALSE TRUE
## 15 55
CrossTable(x = glass_test_labels, y=glass_pred)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 70
##
##
## | glass_pred
## glass_test_labels | 1 | 2 | 3 | 5 | 6 | 7 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 17 | 2 | 2 | 0 | 0 | 0 | 21 |
## | 12.033 | 4.033 | 1.344 | 1.800 | 0.600 | 2.700 | |
## | 0.810 | 0.095 | 0.095 | 0.000 | 0.000 | 0.000 | 0.300 |
## | 0.680 | 0.080 | 0.667 | 0.000 | 0.000 | 0.000 | |
## | 0.243 | 0.029 | 0.029 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 5 | 21 | 0 | 0 | 0 | 0 | 26 |
## | 1.978 | 14.778 | 1.114 | 2.229 | 0.743 | 3.343 | |
## | 0.192 | 0.808 | 0.000 | 0.000 | 0.000 | 0.000 | 0.371 |
## | 0.200 | 0.840 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.071 | 0.300 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 3 | 0 | 1 | 0 | 0 | 0 | 4 |
## | 1.729 | 1.429 | 4.005 | 0.343 | 0.114 | 0.514 | |
## | 0.750 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.057 |
## | 0.120 | 0.000 | 0.333 | 0.000 | 0.000 | 0.000 | |
## | 0.043 | 0.000 | 0.014 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 0 | 0 | 6 | 0 | 0 | 6 |
## | 2.143 | 2.143 | 0.257 | 58.514 | 0.171 | 0.771 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.086 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.086 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 0 | 1 | 0 | 0 | 1 | 0 | 2 |
## | 0.714 | 0.114 | 0.086 | 0.171 | 15.557 | 0.257 | |
## | 0.000 | 0.500 | 0.000 | 0.000 | 0.500 | 0.000 | 0.029 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 0.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 0 | 1 | 0 | 0 | 1 | 9 | 11 |
## | 3.929 | 2.183 | 0.471 | 0.943 | 1.496 | 40.687 | |
## | 0.000 | 0.091 | 0.000 | 0.000 | 0.091 | 0.818 | 0.157 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 1.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 25 | 25 | 3 | 6 | 2 | 9 | 70 |
## | 0.357 | 0.357 | 0.043 | 0.086 | 0.029 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Based on the table, we can conclude that using the KNN-model for this Glass dataset was less accurate than for the iris dataset. In the table, you’ll notice of the 21 glasstypes 1, only 17 were identified correctly. Of the 26 glasstypes 2, only 21 were correctly identified. This pattern continues for all glasstypes, only glasstype 5 was perfectly identified.
Purely to sate my own curiosity, I’ve also performed the analysis using normalised data, to see what the impact of normalisation would be on this data.
<-function(x){
normalize<-x-min(x)
num<-max(x)-min(x)
denomreturn(num/denom)
}
set.seed(4321)
<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints
ind_glass
<-normalize(Glass[1:9]) %>% mutate(Type=Glass$Type)
Glass_norm
<-Glass_norm[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_training<-Glass_norm[ind_glass==2,1:9]
glass_test
<-Glass_norm[ind_glass==1,10] # Storing the labels seperately
glass_training_labels<-Glass_norm[ind_glass==2,10]
glass_test_labels
<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.
glass_pred
<-glass_pred == glass_test_labels
glass_resulttable(glass_result)
## glass_result
## FALSE TRUE
## 15 55
CrossTable(x = glass_test_labels, y=glass_pred)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 70
##
##
## | glass_pred
## glass_test_labels | 1 | 2 | 3 | 5 | 6 | 7 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 17 | 2 | 2 | 0 | 0 | 0 | 21 |
## | 12.033 | 4.033 | 1.344 | 1.800 | 0.600 | 2.700 | |
## | 0.810 | 0.095 | 0.095 | 0.000 | 0.000 | 0.000 | 0.300 |
## | 0.680 | 0.080 | 0.667 | 0.000 | 0.000 | 0.000 | |
## | 0.243 | 0.029 | 0.029 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 5 | 21 | 0 | 0 | 0 | 0 | 26 |
## | 1.978 | 14.778 | 1.114 | 2.229 | 0.743 | 3.343 | |
## | 0.192 | 0.808 | 0.000 | 0.000 | 0.000 | 0.000 | 0.371 |
## | 0.200 | 0.840 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.071 | 0.300 | 0.000 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 3 | 0 | 1 | 0 | 0 | 0 | 4 |
## | 1.729 | 1.429 | 4.005 | 0.343 | 0.114 | 0.514 | |
## | 0.750 | 0.000 | 0.250 | 0.000 | 0.000 | 0.000 | 0.057 |
## | 0.120 | 0.000 | 0.333 | 0.000 | 0.000 | 0.000 | |
## | 0.043 | 0.000 | 0.014 | 0.000 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 0 | 0 | 6 | 0 | 0 | 6 |
## | 2.143 | 2.143 | 0.257 | 58.514 | 0.171 | 0.771 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.086 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.086 | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 0 | 1 | 0 | 0 | 1 | 0 | 2 |
## | 0.714 | 0.114 | 0.086 | 0.171 | 15.557 | 0.257 | |
## | 0.000 | 0.500 | 0.000 | 0.000 | 0.500 | 0.000 | 0.029 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 0.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.000 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 0 | 1 | 0 | 0 | 1 | 9 | 11 |
## | 3.929 | 2.183 | 0.471 | 0.943 | 1.496 | 40.687 | |
## | 0.000 | 0.091 | 0.000 | 0.000 | 0.091 | 0.818 | 0.157 |
## | 0.000 | 0.040 | 0.000 | 0.000 | 0.500 | 1.000 | |
## | 0.000 | 0.014 | 0.000 | 0.000 | 0.014 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 25 | 25 | 3 | 6 | 2 | 9 | 70 |
## | 0.357 | 0.357 | 0.043 | 0.086 | 0.029 | 0.129 | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Conclusion: in this specific example, normalisation does not change anything at all.
Now, we’ve completely studied and used one form of machine learning, it’s time to look into another aspect of machine learning that’s important, a step before the training even starts: pre processing.