Pre-Processing, karet

 

For the studying of pre-processing, we’ll use the package called “Karet”. caret, short for Classification And REgression Training, is a package containing functions to help create machine learning-based (or actually, any form of predictive) models.

library(caret)
names(getModelInfo()) %>% head(50)
##  [1] "ada"            "AdaBag"         "AdaBoost.M1"    "adaboost"       "amdai"          "ANFIS"          "avNNet"         "awnb"           "awtan"         
## [10] "bag"            "bagEarth"       "bagEarthGCV"    "bagFDA"         "bagFDAGCV"      "bam"            "bartMachine"    "bayesglm"       "binda"         
## [19] "blackboost"     "blasso"         "blassoAveraged" "bridge"         "brnn"           "BstLm"          "bstSm"          "bstTree"        "C5.0"          
## [28] "C5.0Cost"       "C5.0Rules"      "C5.0Tree"       "cforest"        "chaid"          "CSimca"         "ctree"          "ctree2"         "cubist"        
## [37] "dda"            "deepboost"      "DENFIS"         "dnn"            "dwdLinear"      "dwdPoly"        "dwdRadial"      "earth"          "elm"           
## [46] "enet"           "evtree"         "extraTrees"     "fda"            "FH.GBML"

Caret contains a lot of predictive models: above are shown only the top 50 of the over 200 models available in caret. For now we’ll continue to use a KNN model, though we’ll study another model later too.

First of all, we can use “createDataPartition” instead of “sample” to create a training and a test group more easily

set.seed(1234)
index<-createDataPartition(iris$Species, p=0.75, list=FALSE)

iris_training2<-iris[index,]

iris_test2<-iris[-index,]

Now, with the training data separated, we can use the caret-method of training a model and predicting with it

model_knn<-train(iris_training2[,1:4], iris_training2[,5], method='knn')

#Then, we use the model to make a prediction
prediction<-predict(object=model_knn, iris_test2[,1:4])

#We check if the predicitons are true
pred<-prediction == iris_test2[,5]
table(pred)
## pred
## FALSE  TRUE 
##     2    34

And, due to the fact that we’re using caret, we can create a confusion matrix: this’ll allow us to study the sensitivity and the specificity of our model way easier.

#And we create a confusion matrix for the results
confusionMatrix(prediction,iris_test2[,5])
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         11         1
##   virginica       0          1        11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.8134, 0.9932)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.728e-14       
##                                           
##                   Kappa : 0.9167          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9167           0.9167
## Specificity                 1.0000            0.9583           0.9583
## Pos Pred Value              1.0000            0.9167           0.9167
## Neg Pred Value              1.0000            0.9583           0.9583
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3056           0.3056
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9375           0.9375

We can see that the model can identify Setosa perfectly, but struggles a little with versicolor (92% accuracy) and some more with virginica (92% accuracy).

What’s nice about caret training is that it contains pre-built in preprocessing methods. For the most basic examples of pre-processing, take centering and scaling.

  • Centering is a form of pre-processing where the mean of the data is determined, and all other data is represented based on their distance to the mean

  • Scaling is a form of pre-processing where all data is changed to the same scale of size. This makes sure the machine learning algorythm does not determine that one variable is more important than another, purely because it has higher numbers.

model_knn<-train(iris_training2[,1:4],iris_training2[,5],method='knn',preProcess = c("center", "scale"))

prediction<-predict(object=model_knn, iris_test2[,1:4], type='raw')

pred<-prediction == iris_test2[,5]
table(pred)
## pred
## FALSE  TRUE 
##     2    34
confusionMatrix(prediction, iris_test2[,5])
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         12         2
##   virginica       0          0        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.8134, 0.9932)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.728e-14       
##                                           
##                   Kappa : 0.9167          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.8333
## Specificity                 1.0000            0.9167           1.0000
## Pos Pred Value              1.0000            0.8571           1.0000
## Neg Pred Value              1.0000            1.0000           0.9231
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.2778
## Detection Prevalence        0.3333            0.3889           0.2778
## Balanced Accuracy           1.0000            0.9583           0.9167

centering and scaling the data already increases the accuracy of the system, but is not a perfect system. It removed the versicolor mis-identification, but added a virginica mis-identification.

To apply this newfound knowledge about pre-processing to another example, we’ll once again perform it upon the “glass”

library(mlbench)
data(Glass)

set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints

glass_training<-Glass[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass[ind_glass==2,1:9]

glass_training_labels<-Glass[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass[ind_glass==2,10]

model_knn_glass<-train(glass_training, glass_training_labels, method="knn", preProcess=c("center", "scale"))

prediction<-predict(object=model_knn_glass, glass_test, type="raw")

confusionMatrix(prediction, glass_test_labels)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 17  8  4  0  0  1
##          2  4 16  0  1  1  0
##          3  0  0  0  0  0  0
##          5  0  2  0  3  0  0
##          6  0  0  0  0  1  0
##          7  0  0  0  2  0 10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6714          
##                  95% CI : (0.5488, 0.7791)
##     No Information Rate : 0.3714          
##     P-Value [Acc > NIR] : 3.469e-07       
##                                           
##                   Kappa : 0.5444          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            0.8095   0.6154  0.00000  0.50000  0.50000   0.9091
## Specificity            0.7347   0.8636  1.00000  0.96875  1.00000   0.9661
## Pos Pred Value         0.5667   0.7273      NaN  0.60000  1.00000   0.8333
## Neg Pred Value         0.9000   0.7917  0.94286  0.95385  0.98551   0.9828
## Prevalence             0.3000   0.3714  0.05714  0.08571  0.02857   0.1571
## Detection Rate         0.2429   0.2286  0.00000  0.04286  0.01429   0.1429
## Detection Prevalence   0.4286   0.3143  0.00000  0.07143  0.01429   0.1714
## Balanced Accuracy      0.7721   0.7395  0.50000  0.73438  0.75000   0.9376

As you can see, pre-processing does not significantly affect the accuracy of the model. In fact, if we compare the confusionMatrix of this model to the original model set up using the KNN: command (Shown in the table below), you’ll see that creating the model this way actually reduced it’s accuracy.

set.seed(4321)
ind_glass<-sample(2, nrow(Glass), replace=TRUE, prob=c(0.67,0.33)) #Creating a random selection of datapoints

Glass_norm<-normalize(Glass[1:9]) %>% mutate(Type=Glass$Type)

glass_training<-Glass_norm[ind_glass==1,1:9] # Separating the training the training and the testing datasets
glass_test<-Glass_norm[ind_glass==2,1:9]

glass_training_labels<-Glass_norm[ind_glass==1,10] # Storing the labels seperately
glass_test_labels<-Glass_norm[ind_glass==2,10]

glass_pred<-class::knn(glass_training, glass_test, glass_training_labels) #Performing the machine learning test.

confusionMatrix(glass_test_labels, glass_pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 17  2  2  0  0  0
##          2  5 21  0  0  0  0
##          3  3  0  1  0  0  0
##          5  0  0  0  6  0  0
##          6  0  1  0  0  1  0
##          7  0  1  0  0  1  9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7857          
##                  95% CI : (0.6713, 0.8748)
##     No Information Rate : 0.3571          
##     P-Value [Acc > NIR] : 2.852e-13       
##                                           
##                   Kappa : 0.7062          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            0.6800   0.8400  0.33333  1.00000  0.50000   1.0000
## Specificity            0.9111   0.8889  0.95522  1.00000  0.98529   0.9672
## Pos Pred Value         0.8095   0.8077  0.25000  1.00000  0.50000   0.8182
## Neg Pred Value         0.8367   0.9091  0.96970  1.00000  0.98529   1.0000
## Prevalence             0.3571   0.3571  0.04286  0.08571  0.02857   0.1286
## Detection Rate         0.2429   0.3000  0.01429  0.08571  0.01429   0.1286
## Detection Prevalence   0.3000   0.3714  0.05714  0.08571  0.02857   0.1571
## Balanced Accuracy      0.7956   0.8644  0.64428  1.00000  0.74265   0.9836

Why this happens is something I’ve yet to figure out: perhaps the class::knn model already uses a form of built-in preprocessing. However, the KNN-model was purely used as an introduction to other machine learning processes, the question about why the KNN-model loses accuracy using karet is one for another time. For now, we’ll continue to the next machine learning technique, one significant in the field of microbiology: randomForest.