R is one of the most popular open source programming languages for predictive analytics. One of its upsides is the abundance of modeling choices provided by more than 10000 user-created packages on the Comprehensive R Archive Network (CRAN). On the downside, package-specific syntax choices (which are a much bigger problem in R than in e.g. in Python) impede the employment of new models. The caret package attempts to streamline the process of creating predictive models by providing a uniform interface to various training and prediction functions. Caret’s data preparation- , feature selection- and model tuning functionalities facilitate the process of building and evaluating predictive models. This blog post focuses on model tuning and selection and shows how to tackle common model building challenges with caret.
Table of Contents
Between And Within-Model Selection
There is no gold standard: No model beats every other model in every other situation. As data scientists, we have to justify our modelling choices. We select between different models by training them and comparing their performances. The same reasoning applies to parameter choices that are not determined by the optimization process: We select within all models of a class by comparing the performance of models fed by different parameter values.
With caret’s train function we can quickly estimate several different models. As an illustration, we create a predictive model to determine the defaults of credit card holders, based on personal information and their previous payment history.[1]
To select a model we create a small training set, estimate different models for several resamples, compare their performance and deploy the best one.
Data Preprocessing
Caret provides simple functions for creating balanced splits of the data, preserving the overall distribution of the dependent variable. Stratified random sampling is advisable if one class has a disproportionately small frequency compared to others.
library(caret)
set.seed(11)
# reading data and preparationcustomer_data <- read.csv("credit_default_prepared.csv", sep = ";",stringsAsFactors = TRUE)
# data partitionrows_train <- createDataPartition(y = customer_data$DEFAULT,
p = 0.3, list = FALSE)
training <- customer_data[rows_train, ]
A simple but powerful model for binary variables is logistic regression, which we will assess against random forests and boosting. Hence, we will do a between-model selection in which we pit logistic regression against two different ensemble models. Random forests estimate and aggregate several classification trees on bootstrapped subsamples while decorrelating individual trees by means of sampling the set of included predictors. The original boosting algorithm aggregates classification trees too, but estimates them in a sequential fashion:
It starts with equal weights for each observation. In every iteration, the weights of misclassified observations increase and the ones of correctly classified observations decrease. In the next iteration, the updated weights are utilized to concentrate on the correct prediction of previously misclassified observations. The final prediction is obtained by weighting each tree’s prediction based on model accuracy.
Both ensemble methods include tuning parameters we have to decide upon (within-model selection). For the random forest model, we will only consider the number of predictor variables sampled as candidates at each split (mtry) as a tuning parameter and search through parameter values 4 to 14. Table 1 depicts the tuning parameters for boosting:
Expanding The Search Grid
By default, train uses a minimal search grid of three values for each tuning parameter. With caret’s expand.grid() function we create a Cartesian product of eligible parameters and pass it to the train function:
One approach for model tuning is to fit models with different tuning parameter values to many resampled versions of the training set, estimate their performance and determine the final parameters based on a performance metric.
Choices for the resampling technique and performance measure can be set with the trainControl() function. For this illustration, we do a 5-fold cross-validation and measure model performance by the area under the ROC curve (the default setting in caret is the bootstrap and overall accuracy):
The code below shows how to train several different models with caret, substituting only the specific method and the previously specified parameter grid. Caret selects the optimal parameter choice and provides a summary with “print(object_name)”, where “object_name” is the name of the object returned by the train function:
# random forestset.seed(11)
fit_rf <- train(y ~.,
data = training,
method = "rf",
ntrees = 5,
metric = "ROC",
trControl = ctrl,
tuneGrid = grid_rf)
# boostingset.seed(11)
fit_bo <- train(y ~.,
data = training,
method = "gbm",
metric = "ROC",
verbose = FALSE,
trControl = ctrl,
tuneGrid = grid_bo)
# performance plot random forestplot(fit_rf, main = "Random forest ROC for several mtry values")
Plotting the train object displays a performance profile for the tuning parameters. As an example, figure 1 shows the ROC of the random forest in dependence of the number of randomly selected predictors. Choosing six predictors randomly at every split yields to the model with the highest ROC.
Building a good logistic regression model requires great care (i.e. with respect to feature and variable selection). Here we take an arbitrary shortcut by selecting the union of the five most important variables from our boosting and random forest models as predictors. Doing so we introduce caret’s helper function varImp(), which provides us with the ranked importance of the independent variables:
We have just estimated over 120 models and can now visually compare the distribution of their performance. The resamples() function allows to collect, analyse and visualize the resampling results from our data set. Figure 2 summarises the performance distribution for the optimal models over the different folds. Boosting and random forest outperform the ad-hoc logistic regression model. Both ensemble methods perform comparable, but the variance of the boosting algorithm appears to be smaller. For our case study, the average boosting model (ROC: 0.7807) slightly outperforms the average random forest model (ROC of 0.7802).
# Model comparisoncomp <- resamples(list(Logistic = fit_lr, RandomForest = fit_rf, Boosting = fit_bo))
summary(comp)
# plot performance comparisonbwplot(comp, metric="ROC", main = "Model Comparison: RF vs Boosting vs LR")
A Cautionary Note On Utilising Caret
The strongest argument for using caret is that it streamlines the model building process, so you can focus on more important modelling decisions: What kind of models could work for your problem at hand? What is the right performance measure? What is the right choice for a parameter space? Caret simplifies model building and selection significantly, but you still need to think:
Not just choose some models and parameter search spaces to plug into caret’s functions, but the right ones.
Even more importantly, get the feature engineering right, and tailor it to each of the models you want to use.
[1] The data source is a preprocessed data set from Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480ll. (http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). Contact XYC for the prepared version of the data se
Who is b.telligent?
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
Neural Networks for Tabular Data: Ensemble Learning Without Trees
Neural networks are applied to just about any kind of data (images, audio, text, video, graphs, ...). Only with tabular data, tree-based ensembles like random forests and gradient boosted trees are still much more popular. If you want to replace these successful classics with neural networks, ensemble learning may still be a key idea. This blog post tells you why. It is complemented by a notebook in which you can follow the practical details.
Azure AI Search, Microsoft’s top serverless option for the retrieval part of RAG, has unique sizing, scaling, and pricing logic. While it conceals many complexities of server based solutions, it demands specific knowledge of its configurations.
Polars, the Pandas challenger written in Rust, is much faster, not only in executing the code, but also in development. Pandas has always suffered from an API that "grew historically" in many places. Polars is completely different: it ensures significantly faster development, since its API is designed to be logically consistent from the outset, carefully maintaining stringency with every release (sometimes at the expense of backwards compatibility). Polars can often easily replace Pandas: for example, in Ibis Analytics projects and, of course, for all kinds of daily data preparation tasks. Polars’ superior performance is also helpful in interactive environments like Power BI.