Frontpage Data visualisation Parametizing data Directory structure R-package SQL Zotero Reproductibility Future endeavours Free research (Machine learning) CV Bibliography
For the studying of pre-processing, we’ll use the package called “Karet”. caret, short for Classification And REgression Training, is a package containing functions to help create machine learning-based (or actually, any form of predictive) models.
library(caret)
names(getModelInfo()) %>% head(50)
## [1] "ada" "AdaBag" "AdaBoost.M1" "adaboost" "amdai" "ANFIS" "avNNet" "awnb" "awtan"
## [10] "bag" "bagEarth" "bagEarthGCV" "bagFDA" "bagFDAGCV" "bam" "bartMachine" "bayesglm" "binda"
## [19] "blackboost" "blasso" "blassoAveraged" "bridge" "brnn" "BstLm" "bstSm" "bstTree" "C5.0"
## [28] "C5.0Cost" "C5.0Rules" "C5.0Tree" "cforest" "chaid" "CSimca" "ctree" "ctree2" "cubist"
## [37] "dda" "deepboost" "DENFIS" "dnn" "dwdLinear" "dwdPoly" "dwdRadial" "earth" "elm"
## [46] "enet" "evtree" "extraTrees" "fda" "FH.GBML"
Caret contains a lot of predictive models: above are shown only the top 50 of the over 200 models available in caret. For now we’ll continue to use a KNN model, though we’ll study another model later too.
First of all, we can use “createDataPartition” instead of “sample” to create a training and a test group more easily
set.seed(1234)
<-createDataPartition(iris$Species, p=0.75, list=FALSE)
index
<-iris[index,]
iris_training2
<-iris[-index,] iris_test2
Now, with the training data separated, we can use the caret-method of training a model and predicting with it
<-train(iris_training2[,1:4], iris_training2[,5], method='knn')
model_knn
#Then, we use the model to make a prediction
<-predict(object=model_knn, iris_test2[,1:4])
prediction
#We check if the predicitons are true
<-prediction == iris_test2[,5]
predtable(pred)
## pred
## FALSE TRUE
## 2 34
And, due to the fact that we’re using caret, we can create a confusion matrix: this’ll allow us to study the sensitivity and the specificity of our model way easier.
#And we create a confusion matrix for the results
confusionMatrix(prediction,iris_test2[,5])
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 11 1
## virginica 0 1 11
##
## Overall Statistics
##
## Accuracy : 0.9444
## 95% CI : (0.8134, 0.9932)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.728e-14
##
## Kappa : 0.9167
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9167 0.9167
## Specificity 1.0000 0.9583 0.9583
## Pos Pred Value 1.0000 0.9167 0.9167
## Neg Pred Value 1.0000 0.9583 0.9583
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3056 0.3056
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9375 0.9375
We can see that the model can identify Setosa perfectly, but struggles a little with versicolor (92% accuracy) and some more with virginica (92% accuracy).
What’s nice about caret training is that it contains pre-built in preprocessing methods. For the most basic examples of pre-processing, take centering and scaling.
Centering is a form of pre-processing where the mean of the data is determined, and all other data is represented based on their distance to the mean
Scaling is a form of pre-processing where all data is changed to the same scale of size. This makes sure the machine learning algorythm does not determine that one variable is more important than another, purely because it has higher numbers.
<-train(iris_training2[,1:4],iris_training2[,5],method='knn',preProcess = c("center", "scale"))
model_knn
<-predict(object=model_knn, iris_test2[,1:4], type='raw')
prediction
<-prediction == iris_test2[,5]
predtable(pred)
## pred
## FALSE TRUE
## 2 34
confusionMatrix(prediction, iris_test2[,5])
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 12 2
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 0.9444
## 95% CI : (0.8134, 0.9932)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.728e-14
##
## Kappa : 0.9167
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8333
## Specificity 1.0000 0.9167 1.0000
## Pos Pred Value 1.0000 0.8571 1.0000
## Neg Pred Value 1.0000 1.0000 0.9231
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.2778
## Detection Prevalence 0.3333 0.3889 0.2778
## Balanced Accuracy 1.0000 0.9583 0.9167
centering and scaling the data already increases the accuracy of the system, but is not a perfect system. It removed the versicolor mis-identification, but added a virginica mis-identification.
To apply this newfound knowledge about pre-processing to another example, we’ll once again perform it upon the “glass”
library(mlbench)
data(Glass)
set.seed(4321)
<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints
ind_glass
<-Glass[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_training<-Glass[ind_glass==2,1:9]
glass_test
<-Glass[ind_glass==1,10] # Storing the labels seperately
glass_training_labels<-Glass[ind_glass==2,10]
glass_test_labels
<-train(glass_training, glass_training_labels, method="knn", preProcess=c("center", "scale"))
model_knn_glass
<-predict(object=model_knn_glass, glass_test, type="raw")
prediction
confusionMatrix(prediction, glass_test_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 17 8 4 0 0 1
## 2 4 16 0 1 1 0
## 3 0 0 0 0 0 0
## 5 0 2 0 3 0 0
## 6 0 0 0 0 1 0
## 7 0 0 0 2 0 10
##
## Overall Statistics
##
## Accuracy : 0.6714
## 95% CI : (0.5488, 0.7791)
## No Information Rate : 0.3714
## P-Value [Acc > NIR] : 3.469e-07
##
## Kappa : 0.5444
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.8095 0.6154 0.00000 0.50000 0.50000 0.9091
## Specificity 0.7347 0.8636 1.00000 0.96875 1.00000 0.9661
## Pos Pred Value 0.5667 0.7273 NaN 0.60000 1.00000 0.8333
## Neg Pred Value 0.9000 0.7917 0.94286 0.95385 0.98551 0.9828
## Prevalence 0.3000 0.3714 0.05714 0.08571 0.02857 0.1571
## Detection Rate 0.2429 0.2286 0.00000 0.04286 0.01429 0.1429
## Detection Prevalence 0.4286 0.3143 0.00000 0.07143 0.01429 0.1714
## Balanced Accuracy 0.7721 0.7395 0.50000 0.73438 0.75000 0.9376
As you can see, pre-processing does not significantly affect the accuracy of the model. In fact, if we compare the confusionMatrix of this model to the original model set up using the KNN: command (Shown in the table below), you’ll see that creating the model this way actually reduced it’s accuracy.
set.seed(4321)
<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints
ind_glass
<-normalize(Glass[1:9]) %>% mutate(Type=Glass$Type)
Glass_norm
<-Glass_norm[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_training<-Glass_norm[ind_glass==2,1:9]
glass_test
<-Glass_norm[ind_glass==1,10] # Storing the labels seperately
glass_training_labels<-Glass_norm[ind_glass==2,10]
glass_test_labels
<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.
glass_pred
confusionMatrix(glass_test_labels, glass_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 17 2 2 0 0 0
## 2 5 21 0 0 0 0
## 3 3 0 1 0 0 0
## 5 0 0 0 6 0 0
## 6 0 1 0 0 1 0
## 7 0 1 0 0 1 9
##
## Overall Statistics
##
## Accuracy : 0.7857
## 95% CI : (0.6713, 0.8748)
## No Information Rate : 0.3571
## P-Value [Acc > NIR] : 2.852e-13
##
## Kappa : 0.7062
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.6800 0.8400 0.33333 1.00000 0.50000 1.0000
## Specificity 0.9111 0.8889 0.95522 1.00000 0.98529 0.9672
## Pos Pred Value 0.8095 0.8077 0.25000 1.00000 0.50000 0.8182
## Neg Pred Value 0.8367 0.9091 0.96970 1.00000 0.98529 1.0000
## Prevalence 0.3571 0.3571 0.04286 0.08571 0.02857 0.1286
## Detection Rate 0.2429 0.3000 0.01429 0.08571 0.01429 0.1286
## Detection Prevalence 0.3000 0.3714 0.05714 0.08571 0.02857 0.1571
## Balanced Accuracy 0.7956 0.8644 0.64428 1.00000 0.74265 0.9836
Why this happens is something I’ve yet to figure out: perhaps the class::knn model already uses a form of built-in preprocessing. However, the KNN-model was purely used as an introduction to other machine learning processes, the question about why the KNN-model loses accuracy using karet is one for another time. For now, we’ll continue to the next machine learning technique, one significant in the field of microbiology: randomForest.