Comparing different ML models

HR Analytics: Machine Learning to predict Attrition Rate

May 20, 2022

Abstracto

Throughout this analysis, I analyzed the IBM Employee Attrition dataset to explore the main causes of dropout in a company. First, through an exploratory analysis, I made sure the data was clean.

EDA

At this stage, I realized that the data had an imbalance problem between the classes of the dependent variable, to fix this problem I have expanded the sample of the lower class.

Oversampling Technique

Once the data was ready, I started modelling the decision tree, bagging, random forest, gradient boosting machine (GBM) and extreme gradient boosting machine (XGBM). For each technique, my workflow has been as follows:

Hyperparameters tuning of each model, with the train function of the caret package, using a loop when the function does not allow control of some parameters and using parallel computing over three processors cores.

library(doParallel)
library(tictoc)
registerDoParallel(makeCluster(3) -> cpu) 

nodesize_ <- c()
sampsize_ <- c()
Accuracy <- c()
Kappa <- c()
Auc <- c()

tic()
for (nodesize in c(20,40,60,80,100)) {
    for (sampsize in c(200,500,800,1200,1570)) {
bg <- train(data=upTrain,
                factor(attrition)~.,
                method="rf", trControl= control,
                #fijar mtry for bagging
                tuneGrid= expand.grid(mtry=c(51)), 
                ntree = 5000, 
                sampsize = sampsize, 
                nodesize = nodesize,
                #muestras con reemplazamiento
                replace = TRUE, 
                linout = FALSE) 

confusionMatrix <- confusionMatrix(
    bg$pred$pred, bg$pred$obs)

roc <- roc(response = bg$pred$obs,
                     predictor = bg$pred$Yes)

Acc_i <- confusionMatrix$overall[1]
Accuracy <- append(Accuracy, Acc_i)

K_i <- confusionMatrix$overall[2]
Kappa <- append(Kappa, K_i)

Auc_i <- roc$auc
Auc <- append(Auc, Auc_i)

nodesize_ <- append(nodesize_, nodesize)
sampsize_ <- append(sampsize_, sampsize)

dput("---------------")
dput(paste0("With nodesize= ", nodesize))
dput(paste0("With sampsize= ", sampsize))
dput(paste0("Accuracy: ", 
     round(confusionMatrix$overall[1], 3)))
print(roc$auc)
    }
}
toc()
stopCluster(cpu)

# Aggregate Metrics ----
bagging_results = cbind(
    data.frame(Accuracy, Kappa, Auc, 
               nodesize = nodesize_, sampsize=sampsize_))

#save(bagging_results, file="bagging_results.RData")

# Clear cache ----
rm(Acc_i, K_i, Auc_i, nodesize, roc, sampsize)
rm(Accuracy, Auc, Kappa, nodesize_, sampsize_)

Understand the behavior of the adjustment for the different parameters tuned through graphs.
- Also for bagging and random forest I have visualized the error of out-of-bag, to understand the walking as the interactions of the trees increased. Through this chart I have been able to reduce the complexity of my final model.

Some visualizations from the project

After choosing the best parameters, I have adjusted the final model, visualizing the confusion matrix and the most influential variables for each model.

Variable Importance

Once the tree techniques have been adjusted, I have again adjusted the models with repeated cross validation this time, to compare the respective failure rate and the area below the ROC curve, through a box and mustache graph including as a reference model a logistic regression.

Once I found the winning model, I decided to create a stacked bar graph to get the most influential variables according to the different models. From these, I have created a new set training/test, which was then used to implement further ML techniques:

Support Vector Machines, con kernel:
- Lineal
- Polynomial
- Radial
Bagging del SVM Lineal
Boosting
Stacking de:
- Gradient Boosting Machine
- SVM Radial
- XGBoost

Comparing models

Alessio Crisafulli Carpani

MSc Student in Statistical Sciences

We need the power of data and machine learning to tackle today’s needs efficiently