Variable Importance Measures in Tree-Based Classification Algorithms Applied to Decision Problems in Marketing
von Benjamin Stahl
Prof. Dr. Bernd Skiera
Machine learning algorithms such as decision trees and their ensemble counterpart, known as random forests, are increasingly gaining in importance for binary classification tasks in various scientific and practical fields.
The increasing popularity of random forests is primarily driven by their superior prediction performance compared to other prediction algorithms in many applications and by their simplicity of use. Nevertheless, random forests are as yet hardly represented in the marketing literature, even though the few research papers that applied this technique also demonstrated the superior performance over other machine learning algorithms and the logistic regression. One possible explanation for this restraint might be that traditional approaches used in marketing, such as the RFM analysis or the logistic regression, provide model outputs that facilitate the assessment of the influence of predictor variables on the prediction.
While decision trees generate simple if-then rules which are easily interpretable and understandable, ensembles are usually considered as black box algorithms. However, there also exist some techniques for ensemble models to estimate the most important predictor variables for a given model. These so-called variable importance measures can provide valuable insights into the black box, even though they do not provide evidence about the nature of the relationship between the predictor variables and the predicted outcome. To be more specific, variable importance measures do not indicate whether predictions of the target class are driven by rather large or rather small values of a predictor variable. One possible graphical solution to overcome this limitation of importance measures is to use partial dependence plots, which allow investigating the direction of effects of certain predictor variables. Unfortunately, partial dependence plots are not able to identify the most important predictor variables. That is why variable importance measures are nevertheless required in order to identify the most influential variables that are worth being looked at in more detail (Friedman, 2001, pp. 1220–1221 & Greenwell, 2017, p. 421).
This provides the starting point for this thesis, which aims at enriching the marketing literature in various ways. On the one hand, a methodology named classification improvement measure is developed that consists of two metrics and combines the ability to assess the importance of predictor variables in a random forest as well as the assessment of the direction of their effects. Even though most of the currently existing studies in the marketing literature consider different aspects of random forests by using churn data, not much research has been done on the response prediction to marketing campaigns with random forests. Hence, a further objective of this thesis is to predict customer’s response to a direct marketing campaign in banking using random forests with different hyperparameter combinations and select a well performing model. In order to demonstrate the superior predictive power of random forests the models are moreover benchmarked with a logistic regression. After a well performing model is selected, it will be investigated in more detail using the common variable importance measures to identify the most relevant predictor variables. Furthermore, a new contribution to the contemporary marketing literature is the application of partial dependence plots to explore the direction of variable effects in the model. Finally, the newly developed classification improvement measure is used for the model exploration and the results are compared with those from the traditional methods.
The thesis is structured into seven chapters. Chapter 2 has a theoretical focus and outlines the concepts of decision trees and random forests to lay the foundation for the later applications. Chapter 3 presents the two most common variable importance measures for random forests, namely Gini importance and mean decrease in accuracy, and provides an overview over the main points of criticism discussed in the literature. Additionally, partial dependence plots (and its extension called individual conditional expectations plots) are described as a graphical approach to illustrate how the input variables affect the outcome prediction. Chapter 4 emphasizes the importance as well as common use cases of predictive modelling in the marketing context. It also summarizes the results of the most important academic papers from the literature that apply random forests in the context of marketing-related questions. The novel approach of the classification improvement measure is outlined in detail in chapter 5 and is tested on a small simulated data set as well as on a publicly available data set. In chapter 6, random forests with different hyperparameter combinations are computed with the objective of finding a well performing model for the prediction of a direct marketing campaign’s success. For the purpose of this analysis, anonymize dcustomer data are provided by a German direct bank. Finally, chapter 7 provides a summary of the results.
The main part of the analysis is carried out by using the R software. The algorithm for calculating the classification improvement measure is implemented in SAS code.