Churn Prediction Project

Data description

The dataset was released by CrowdAnalytix Community as part of their churn prediction 2012 challenge. It has 3333 customer entries with 20 features each, from a telecom company, which the real name was anonymized. We got a copy from Kaggle which was already divided 80% for the training set and 20% for test.

library(dplyr)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(scales)
library(performanceEstimation)
library(class)
library(PRROC)
library(caret)
library(fastDummies)
library(randomForest)
library(doParallel)
source("../functions.R") #Auxiliar functions script
train <- read.csv("../data/churn-bigml-80.csv")
test <- read.csv("../data/churn-bigml-20.csv")
str(train)
## 'data.frame':    2666 obs. of  20 variables:
##  $ State                 : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 50 40 ...
##  $ Account.length        : int  128 107 137 84 75 118 121 147 141 74 ...
##  $ Area.code             : int  415 415 415 408 415 510 510 415 415 415 ...
##  $ International.plan    : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 2 1 2 2 1 ...
##  $ Voice.mail.plan       : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 2 1 2 1 ...
##  $ Number.vmail.messages : int  25 26 0 0 0 0 24 0 37 0 ...
##  $ Total.day.minutes     : num  265 162 243 299 167 ...
##  $ Total.day.calls       : int  110 123 114 71 113 98 88 79 84 127 ...
##  $ Total.day.charge      : num  45.1 27.5 41.4 50.9 28.3 ...
##  $ Total.eve.minutes     : num  197.4 195.5 121.2 61.9 148.3 ...
##  $ Total.eve.calls       : int  99 103 110 88 122 101 108 94 111 148 ...
##  $ Total.eve.charge      : num  16.78 16.62 10.3 5.26 12.61 ...
##  $ Total.night.minutes   : num  245 254 163 197 187 ...
##  $ Total.night.calls     : int  91 103 104 89 121 118 118 96 97 94 ...
##  $ Total.night.charge    : num  11.01 11.45 7.32 8.86 8.41 ...
##  $ Total.intl.minutes    : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 11.2 9.1 ...
##  $ Total.intl.calls      : int  3 3 5 7 3 6 7 6 5 5 ...
##  $ Total.intl.charge     : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 3.02 2.46 ...
##  $ Customer.service.calls: int  1 1 0 2 3 0 3 0 0 0 ...
##  $ Churn                 : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...

The column names are self-explanatory, eve stands for evenings, intl for international and vmail for voice mails. We’re gonna transform International.plan,Voice.mail.plan and Churn columns into binary variables, and mutate the State column into regions. We don’t need to do any further manipulation since the data is already well treated and there is no missing data.

train <- train %>% mutate(Churn=ifelse(Churn=="True",1,0),
                          International.plan=ifelse(International.plan=="Yes",1,0),
                          Voice.mail.plan=ifelse(Voice.mail.plan=="Yes",1,0))

test <- test %>% mutate(Churn=ifelse(Churn=="True",1,0),
                        International.plan=ifelse(International.plan=="Yes",1,0),
                        Voice.mail.plan=ifelse(Voice.mail.plan=="Yes",1,0))

data(state) #load buit-in state dataset
regions <- data.frame(cbind(state.abb,as.character(state.region)))
names(regions) <- c("State","Region")

train <- train %>% mutate(State=ifelse(State=="DC","VA",as.character(State))) %>%
  left_join(by="State",y=regions) %>% select(-State)
test <- test %>% mutate(State=ifelse(State=="DC","VA",as.character(State))) %>%
  left_join(by="State",y=regions) %>% select(-State)

sum(is.na(train))+sum(is.na(test))
## [1] 0

Visualizations

First of all let’s see the proportion of churners in our dataset.

train_churn <- c(sum(train$Churn)/nrow(train),1-sum(train$Churn)/nrow(train))
test_churn <- c(sum(test$Churn)/nrow(test),1-sum(test$Churn)/nrow(test))
colors <- c('#FFEB6E','#32FF6E')
par(mfrow=c(1,2))
pie(train_churn,label = round(100*train_churn,1),main="Train",col = colors)
pie(test_churn,label = round(100*test_churn,1),main="Test",col = colors)
legend("bottomleft", c("% Churners","% Non-churners"), cex = 0.8,fill = colors)

Both splits have around 14.4% of churners, this proportion can lead to class imbalance bias, and we may need to deal with it with data-level solutions such as resampling techniques.

Let`s analyse continuous variable correlation:

# Reference: http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization
corr <- round(cor(train[,-c(3,4,19,20)]),2)
corr[lower.tri(corr)] <- NA
corr <- melt(corr,na.rm=T)
ggplot(data=corr,aes(x=Var2,y=Var1,fill = value)) + 
  geom_tile(color="white") + 
  scale_fill_gradient2(mid="#fddbc7",high="red",low="white",
                       space="Lab",name = "Correlation\n") + 
  labs(x="",y="") + theme_minimal() + 
  theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust = 1))+
  coord_fixed() + 
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 2.1)

As we can see above, the vast majority of the features are uncorrelated. But the pairs like total minutes/total charge are perfectly correlated, which makes sense if users are charged by minutes used. We choose to remove the charge variables to avoid multicollinearity.

train <- train %>% select(-Total.day.charge,-Total.eve.charge,
                          -Total.night.charge,-Total.intl.charge)
test <- test %>% select(-Total.day.charge,-Total.eve.charge,
                        -Total.night.charge,-Total.intl.charge)
# write.csv(train,"../data/train_cleaned.csv",row.names=FALSE)
# write.csv(test,"../data/test_cleaned.csv",row.names=FALSE)
names(train)
##  [1] "Account.length"         "Area.code"              "International.plan"    
##  [4] "Voice.mail.plan"        "Number.vmail.messages"  "Total.day.minutes"     
##  [7] "Total.day.calls"        "Total.eve.minutes"      "Total.eve.calls"       
## [10] "Total.night.minutes"    "Total.night.calls"      "Total.intl.minutes"    
## [13] "Total.intl.calls"       "Customer.service.calls" "Churn"                 
## [16] "Region"

Below we can see that churners make more calls to customer services than non-churners.

plot_var("Customer.service.calls",nbins=10)

Although Northeast and South regions have more churners, the churn percentage in all regions is around 14% which was the % in the train set as a whole.

train %>% group_by(Region) %>% 
  summarize(churn_perc=sum(Churn/n())) %>%
  ggplot(aes(x=Region,y=churn_perc)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Churn % by Region")

Among churners, we have fewer customers with voice mail plan (~16.75%) comparing with non-churners subscribed to this plan (~29.32%).

train  %>% mutate(Churn=as.factor(Churn)) %>% 
  group_by(Churn) %>% 
  summarise(has_vmail_plan=sum(Voice.mail.plan)/n()) %>%  ggplot() + geom_bar(aes(x=Churn,y=has_vmail_plan,fill=Churn),stat="identity") +
  scale_y_continuous(labels = scales::percent) + 
  labs(title="% of customers with voicemail plan",y="") + theme_minimal()

On the other hand, we have much more churners with an international plan, comparing with non-churners. A possible explanation would be that international plan subscribers are finding better offers in this service with competitors.

train  %>% mutate(Churn=as.factor(Churn)) %>% 
  group_by(Churn) %>% 
  summarise(has_intl_plan=sum(International.plan)/n()) %>%  ggplot() + geom_bar(aes(x=Churn,y=has_intl_plan,fill=Churn),stat="identity") +
  scale_y_continuous(labels = scales::percent) + 
  labs(title="% of customers with International plan",y="") + theme_minimal()

plot_var("Total.day.minutes")

Evaluation Metrics

There are a number of metrics to evaluate classification algorithms, we list some that we’re gonna use, but in churn prediction problem we’ll have to pay special atention to specific ones. (TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative, N = total number of observations).

  • Accuracy: (TP+TN)/N
  • Recall: TP/(TP+FN)
  • AUC: Area under ROC curve
  • Lift

(Zhu, Baesens, and vanden Broucke 2017) If the model yields probabilities, we sort the test predictions by these probabilities in a descending order, then we define the lift as the ratio between the percentage of positive entries in the top 10% lines (\(\beta_{10\%}\)), and the percentage of churners (positives) in the entire test set (\(\beta_0\)).

\[\text{top decile lift} = \dfrac{\beta_{10\%}}{\beta_0}\] That’s a widely used performance metric for this kind of problem. A top-decile lift of 2 means that the model classifies two times more churners in the top 10% group than a random classifier. We’ll devote special attention to recall and top-decile lift, ’cause of the imbalance problem. If a model predicts 0 to every test observation, it would be ~85% accurate, but recall would be zero, and lift undefined.

Resampling methods

Our reference (Zhu, Baesens, and vanden Broucke 2017) presents a lot of methods to deal with class imbalance, from data-level solutions to adapted algorithms. We’ll focus our tests on resampling our training set to create a rebalanced new one.

  • (ROS) Random Oversampling: randomly replicates churners instances.
  • (RUS) Random Undersampling: randomly eliminates the non-churners instances
  • (SMOTE) Synthetic minority oversampling technique (Chawla et al. 2002)

Smote oversamples the minority class by creating random linear interpolations between a given sample and its nearest neighbors.

See ROS and RUS implementation in functions.R. Smote is implemented on performanceEstimation library.

Classification Methods

Logistic Regression

logistic_reg <- glm(Churn ~ ., data = train, family = "binomial")
summary(logistic_reg)
## 
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2237  -0.5144  -0.3415  -0.1988   3.2542  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.8861678  1.0499685  -7.511 5.87e-14 ***
## Account.length          0.0010275  0.0015737   0.653  0.51380    
## Area.code              -0.0006617  0.0014727  -0.449  0.65318    
## International.plan      2.0974910  0.1599706  13.112  < 2e-16 ***
## Voice.mail.plan        -2.0580107  0.6562002  -3.136  0.00171 ** 
## Number.vmail.messages   0.0384312  0.0206564   1.860  0.06282 .  
## Total.day.minutes       0.0126296  0.0012234  10.323  < 2e-16 ***
## Total.day.calls         0.0027757  0.0031141   0.891  0.37275    
## Total.eve.minutes       0.0057234  0.0012726   4.497 6.88e-06 ***
## Total.eve.calls        -0.0007386  0.0030890  -0.239  0.81104    
## Total.night.minutes     0.0028147  0.0012444   2.262  0.02370 *  
## Total.night.calls       0.0020757  0.0031753   0.654  0.51329    
## Total.intl.minutes      0.0994010  0.0228371   4.353 1.35e-05 ***
## Total.intl.calls       -0.1212521  0.0288705  -4.200 2.67e-05 ***
## Customer.service.calls  0.5067940  0.0442616  11.450  < 2e-16 ***
## RegionNortheast         0.3912728  0.1936314   2.021  0.04331 *  
## RegionSouth             0.1743858  0.1690461   1.032  0.30227    
## RegionWest              0.0793001  0.1857678   0.427  0.66947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2212.2  on 2665  degrees of freedom
## Residual deviance: 1726.1  on 2648  degrees of freedom
## AIC: 1762.1
## 
## Number of Fisher Scoring iterations: 6

The ROC curve:

roc_logistic <- roc.curve(scores.class0 = logit2prob(predict(logistic_reg,newdata = test)),weights.class0=test$Churn,curve=TRUE)
plot(roc_logistic,main = "ROC Curve - Logistic Regression") 

By changing the 10% percentile from the lift definition we can also generate a lift curve, but for model comparison, we’ll use the decile version.

quantiles <- seq(0.1,1,length.out = 10)
lifts <- sapply(quantiles,function(x){return(lift(predict(logistic_reg,newdata =test),test$Churn,quantile=x))})
plot(quantiles,lifts,type='b',main="Lift Curve")

set.seed(123)

train <- train %>% mutate(Churn = as.factor(Churn))
Majority <- sum(train$Churn==0)
minority <- nrow(train)-Majority
resamples <- list("Original" = train,
                  "ROS" = ROS(train,"Churn"),
                  "RUS" = RUS(train,"Churn"),
                  "SMOTE" =  smote(Churn ~ ., train, perc.over = Majority/minority,perc.under=1))

#save(resamples,file="../data/resamples.rda")

final_results <- matrix(NA,0,4)
colnames(final_results) <- c("Accuracy","Recall","Lift","AUC")
for(i in c("Original","ROS","RUS","SMOTE")){
  data <-  resamples[i][[1]]
  model <- glm(Churn ~ ., data = data, family = "binomial")
  scores <- predict(model,newdata =test)
  ypred <- logit2prob(scores)>0.5
  suffix <- ifelse(i!="Original",paste("+",i),"")
  res <- evaluate(ypred,test$Churn,scores,name=paste("LogisticReg",suffix))
  final_results <- rbind(final_results,res)
}
final_results
##                      Accuracy    Recall     Lift       AUC
## LogisticReg         0.8545727 0.1789474 3.143755 0.8244387
## LogisticReg + ROS   0.7661169 0.7684211 3.143755 0.8303276
## LogisticReg + RUS   0.7631184 0.7368421 3.143755 0.8212919
## LogisticReg + SMOTE 0.7391304 0.8210526 3.353339 0.8292418

Although we lost accuracy with resampling, we got much higher recall which is a crucial metric to this problem, as we stated before.

Let’s review the coefficients of SMOTE version of the regression, which was the best one in terms of recall and lift. We got more statistically significant features than before. In both cases the “strongest” predictor (in terms of z-value) was Customer.service.calls.

data <-  resamples["SMOTE"][[1]]
smote_logistic <- glm(Churn ~ ., data = data, family = "binomial")
summary(smote_logistic)
## 
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2055  -0.8085   0.3043   0.7966   2.7835  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.2338220  0.6458328 -11.201  < 2e-16 ***
## Account.length          0.0021405  0.0009932   2.155  0.03116 *  
## Area.code              -0.0006780  0.0008921  -0.760  0.44727    
## International.plan      2.5649186  0.1242593  20.642  < 2e-16 ***
## Voice.mail.plan        -2.1678067  0.3801978  -5.702 1.19e-08 ***
## Number.vmail.messages   0.0452635  0.0120397   3.760  0.00017 ***
## Total.day.minutes       0.0133785  0.0007205  18.568  < 2e-16 ***
## Total.day.calls         0.0051677  0.0019800   2.610  0.00905 ** 
## Total.eve.minutes       0.0072255  0.0008227   8.783  < 2e-16 ***
## Total.eve.calls        -0.0008711  0.0020065  -0.434  0.66420    
## Total.night.minutes     0.0032962  0.0008402   3.923 8.74e-05 ***
## Total.night.calls       0.0031070  0.0020616   1.507  0.13178    
## Total.intl.minutes      0.0767535  0.0146975   5.222 1.77e-07 ***
## Total.intl.calls       -0.0983638  0.0176142  -5.584 2.35e-08 ***
## Customer.service.calls  0.6687630  0.0298297  22.419  < 2e-16 ***
## RegionNortheast         0.4678842  0.1194881   3.916 9.01e-05 ***
## RegionSouth             0.2496615  0.1024976   2.436  0.01486 *  
## RegionWest              0.0861546  0.1143103   0.754  0.45103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5881.4  on 4267  degrees of freedom
## Residual deviance: 4259.8  on 4250  degrees of freedom
## AIC: 4295.8
## 
## Number of Fisher Scoring iterations: 5

K-Nearest-Neighbors

Let’s compare KNN performance between the resamples generated before. We’ll perform 5-fold cross-validation to select K, using the recall(sensitivity) as the performance metric. We’ll use use the undersampled dataset to avoid imbalanced folds.

#Code reference: https://rpubs.com/njvijay/16444
set.seed(123)
trControl <- trainControl(method  = "cv",
                          number  = 5,
                          classProbs=TRUE,
                          summaryFunction = twoClassSummary)
data <- dummy_cols(resamples["RUS"][[1]],select_columns = "Region") %>% select(-Region,-`Region_North Central`)

data <- data  %>%
  mutate(Churn = factor(as.numeric(Churn)-1,
                        labels = make.names(levels(Churn))))

cv.knn <- train(Churn ~ .,
                method     = "knn",
                tuneGrid   = expand.grid(k = 1:10),
                trControl  = trControl,
                preProcess = c("center","scale"),
                data = data)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
cv.knn
## k-Nearest Neighbors 
## 
## 776 samples
##  17 predictor
##   2 classes: 'X0', 'X1' 
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 621, 622, 620, 621, 620 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.6881618  0.7089244  0.6673993
##    2  0.7241891  0.6754246  0.6624043
##    3  0.7460747  0.7681985  0.6622045
##    4  0.7684068  0.7474192  0.6416250
##    5  0.7701596  0.7886114  0.6416250
##    6  0.7836965  0.7962704  0.6518815
##    7  0.7928432  0.8324342  0.6466200
##    8  0.8045158  0.8376623  0.6722611
##    9  0.8075667  0.8325008  0.6517816
##   10  0.8024933  0.8324009  0.6286047
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Although the package suggests K=9 with the best AUC-ROC, we’ll use K=8, which got a higher sensitivity (recall). The dataset was standardized.

K=8
set.seed(123)
new_test <- dummy_cols(test,select_columns = "Region") %>% select(-Region,-`Region_North Central`)

for(i in c("Original","ROS","RUS","SMOTE")){
  X_train_scaled <-  dummy_cols(resamples[i][[1]],select_columns = "Region") %>% select(-Region,-`Region_North Central`,-Churn) %>% scale()
  y_train <- resamples[i][[1]]$Churn
  center <- attr(X_train_scaled,"scaled:center")
  scale <- attr(X_train_scaled,"scaled:scale")
  churn.knn = knn(train = X_train_scaled, 
                  test = scale(new_test[,-15],
                               center=center,
                               scale=scale),
                  cl = y_train,
                  k=K, prob=TRUE)
  scores <- attr(churn.knn,"prob")
  suffix <- ifelse(i!="Original",paste("+",i),"")
  res <- evaluate(churn.knn,test$Churn,scores,name=paste("KNN",suffix))
  final_results <- rbind(final_results,res)
}
final_results[5:8,]
##              Accuracy    Recall      Lift       AUC
## KNN         0.8710645 0.1789474 0.5239592 0.1962827
## KNN + ROS   0.6986507 0.6842105 0.8383346 0.4055576
## KNN + RUS   0.8065967 0.6210526 0.6287510 0.4448473
## KNN + SMOTE 0.7196402 0.7368421 1.1527101 0.5123666

Again, the model trained in the original dataset performed poorly in terms of recall/lift. In general KNN performed badly, but the smoted dataset yields acceptable metrics (AUC>0.5, lift>1).

Bagging & Random Forest

Let’s try bagging with default R parameters:

set.seed(123)
bag.churn <- randomForest(Churn~.,data=train,mtry=15,importance=TRUE)
ypred.bag <- predict(bag.churn,newdata = test,type="response")
probs.bag <- predict(bag.churn,newdata = test,type="prob")[,2]

evaluate(ypred.bag,test$Churn,probs.bag)
##   Accuracy    Recall     Lift       AUC
##  0.9595202 0.7684211 6.916261 0.9329039

Pretty good results in general! But recall is lower than smoted logistic regression. Now let’s use the resampled training sets:

set.seed(123)
for(i in c("Original","ROS","RUS","SMOTE")){
  data <-  resamples[i][[1]]
  model <- randomForest(Churn~.,data=data,mtry=15)
  scores <- predict(model,newdata =test,type="prob")[,2]
  ypred <- predict(model,newdata = test,type="response")
  suffix <- ifelse(i!="Original",paste("+",i),"")
  res <- evaluate(ypred,test$Churn,scores,name=paste("Bagging",suffix))
  final_results <- rbind(final_results,res)
}
final_results[9:12,]
##                  Accuracy    Recall     Lift       AUC
## Bagging         0.9610195 0.7789474 6.916261 0.9303276
## Bagging + ROS   0.9505247 0.7578947 6.916261 0.9191940
## Bagging + RUS   0.9025487 0.8631579 5.868342 0.9165440
## Bagging + SMOTE 0.9265367 0.8526316 6.497093 0.9246964

As shown, RUS and SMOTE yielded better recall without too much prejudice to other metrics, specially SMOTE, with a higher lift than RUS.

Let’s plot the SMOTE version ROC curve:

data <-  resamples["SMOTE"][[1]]
smote.bag <- randomForest(Churn~.,data=data,mtry=15)
scores.bag <- predict(smote.bag,newdata =test,type="prob")[,2]
roc.bag <- roc.curve(scores.class0 = scores,weights.class0=test$Churn,curve=TRUE)
plot(roc.bag,main = "ROC curve: Bagging + SMOTE")

For Random Forest we’ll tune the mtry parameter using OOB error. The default value is \(\lfloor\sqrt{p}\rfloor\), which in our case is 3, but as shown below, using mtry=6 we get a lower OOB error.

set.seed(123)
tune.rf.churn <- tuneRF(train[,-15], train[,15], mtryStart = 3, ntreeTry=500)
## mtry = 3  OOB error = 5.36% 
## Searching left ...
## mtry = 2     OOB error = 7.43% 
## -0.3846154 0.05 
## Searching right ...
## mtry = 6     OOB error = 4.65% 
## 0.1328671 0.05 
## mtry = 12    OOB error = 4.73% 
## -0.01612903 0.05

Let’s now tune the ntree parameter:

#Code reference: https://rpubs.com/phamdinhkhanh/389752
set.seed(123)
control <- trainControl(method = 'repeatedcv',
                        number = 5,
                        search = 'grid',
                        classProbs = T,
                        summaryFunction = twoClassSummary)
#create tunegrid
tunegrid <- expand.grid(.mtry = c(6))
modellist <- list()

data <- train  %>%
  mutate(Churn = factor(as.numeric(Churn)-1,
                        labels = make.names(levels(Churn))))
#train with different ntree parameters
for (ntree in c(500,1000,1500,2000,2500)){
  set.seed(123)
  fit <- train(Churn~.,
               data = data,
               method = 'rf',
               tuneGrid = tunegrid,
               trControl = control,
               ntree = ntree)
  key <- toString(ntree)
  modellist[[key]] <- fit
}

#Compare results
results <- resamples(modellist)
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: 500, 1000, 1500, 2000, 2500 
## Number of resamples: 5 
## 
## ROC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 500  0.8895604 0.9028902 0.9070215 0.9082137 0.9075995 0.9339969    0
## 1000 0.8868836 0.9028200 0.9056229 0.9072137 0.9065030 0.9342390    0
## 1500 0.8861792 0.9018219 0.9072481 0.9077068 0.9080206 0.9352643    0
## 2000 0.8843477 0.9010628 0.9068404 0.9068026 0.9085629 0.9331995    0
## 2500 0.8847140 0.9008238 0.9054628 0.9065441 0.9087627 0.9329574    0
## 
## Sens 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## 500  0.9714912 0.9868421 0.9912088 0.9881540 0.9912281    1    0
## 1000 0.9714912 0.9868421 0.9890351 0.9881550 0.9934066    1    0
## 1500 0.9736842 0.9890351 0.9912281 0.9894708 0.9934066    1    0
## 2000 0.9736842 0.9890351 0.9890351 0.9890322 0.9934066    1    0
## 2500 0.9736842 0.9890351 0.9890351 0.9890322 0.9934066    1    0
## 
## Spec 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 500  0.6883117 0.6923077 0.7532468 0.7293373 0.7564103 0.7564103    0
## 1000 0.6666667 0.6883117 0.7435897 0.7216783 0.7435897 0.7662338    0
## 1500 0.6666667 0.6883117 0.7435897 0.7190809 0.7435897 0.7532468    0
## 2000 0.6666667 0.6883117 0.7435897 0.7190809 0.7435897 0.7532468    0
## 2500 0.6666667 0.6883117 0.7435897 0.7216450 0.7532468 0.7564103    0

Since the metrics weren’t brutally affected by the number of trees parameter, we’ll use R’s default value ntree = 500 to save computational resources.

set.seed(123)
for(i in c("Original","ROS","RUS","SMOTE")){
  data <-  resamples[i][[1]]
  model <- randomForest(Churn~.,data=data,mtry=6,ntree=500)
  scores <- predict(model,newdata =test,type="prob")[,2]
  ypred <- predict(model,newdata = test,type="response")
  suffix <- ifelse(i!="Original",paste("+",i),"")
  res <- evaluate(ypred,test$Churn,scores,name=paste("RForest",suffix))
  final_results <- rbind(final_results,res)
}
final_results[13:16,]
##                  Accuracy    Recall     Lift       AUC
## RForest         0.9565217 0.7368421 7.021053 0.9280732
## RForest + ROS   0.9655172 0.8105263 6.706677 0.9216231
## RForest + RUS   0.8935532 0.8526316 5.973134 0.9197092
## RForest + SMOTE 0.9325337 0.8421053 6.392302 0.9298123

Again, resampled training sets enhanced recall, but the original dataset got a higher lift.

Bagging-based ensembles

Since now we’ve used preprocessed resampled datasets to deals with imbalance in the traning set. We can adapt existing algorithms to deal with this problem directly from the original data. Now we’re try three bagging variations: RUSBagging, ROSBagging and SMOTEBagging. The modification is that, instead of bootstrapping the entire dataset, for each bag, one uses RUS,ROS or SMOTE to create a balanced sample. More detailed explanations can be found in the article (Galar et al. (2012)).

We’ll use the R implementation from IRIC package based on the article (Zhu et al. (2019)).

RUSBagging:

set.seed(123)
source("../../IRIC/R/Ensemble-based level/BalanceBagging.R")
rusbag <- bbaging(train[,-15],train[,15],numBag=500,type="RUSBagging",allowParallel=TRUE)
rusbag.pred <- predict(rusbag, test[,-15],type="class")
rusbag.prob <- predict(rusbag,test[,-15],type="probability")[,2]
final_results <- rbind(final_results,
                       evaluate(rusbag.pred,test$Churn,
                                rusbag.prob,name="RUSBagging"))
final_results[17,]
##  Accuracy    Recall      Lift       AUC 
## 0.8890555 0.8631579 6.3923016 0.9197184

ROSBagging:

set.seed(123)
rosbag <- bbaging(train[,-15],train[,15],numBag=500,type="ROSBagging",allowParallel=TRUE)
rosbag.pred <- predict(rosbag, test[,-15],type="class")
rosbag.prob <- predict(rosbag,test[,-15],type="probability")[,2]
final_results <- rbind(final_results,
                       evaluate(rosbag.pred,test$Churn,
                                rosbag.prob,name="ROSBagging"))
final_results[17:18,]
##             Accuracy    Recall     Lift       AUC
## RUSBagging 0.8890555 0.8631579 6.392302 0.9197184
## ROSBagging 0.8875562 0.8631579 6.287510 0.9085112

SMOTEBagging:

set.seed(123)
smotebag <- bbaging(train[,-15],train[,15],numBag=500,type="SMOTEBagging",iter=40)
smotebag.pred <- predict(smotebag, test[,-15],type="class")
smotebag.prob <- predict(smotebag,test[,-15],type="probability")[,2]

final_results <- rbind(final_results,
                       evaluate(smotebag.pred,test$Churn,
                                smotebag.prob,name="SMOTEBagging"))
final_results[c(9,17:19),]
##               Accuracy    Recall     Lift       AUC
## Bagging      0.9610195 0.7789474 6.916261 0.9303276
## RUSBagging   0.8890555 0.8631579 6.392302 0.9197184
## ROSBagging   0.8875562 0.8631579 6.287510 0.9085112
## SMOTEBagging 0.8845577 0.8526316 5.658759 0.9052356

All 3 methods yielded better recall measures than original bagging, but ROSBagging got a higher lift among them (although original bagging was better). The other metrics were pretty close among the 3 models.

AdaBoost

The function bboost is implemented on this GitHub repo based on the article Zhu et al. (2019). If you want to reproduce the following chunk, install the IRIC package following the repo instructions.

First we try classical AdaBoosting:

set.seed(123)
source("../../IRIC/R/Ensemble-based level/BalanceBoost.R")
adaboost.churn <- bboost(train[,-15], train[,15], base = treeBoost, type = "AdaBoost")
adaboost.pred <- predict(adaboost.churn, test[,-15],type="class")
adaboost.prob <- predict(adaboost.churn,test[,-15],type="probability")[,2]
final_results <- rbind(final_results,
                       evaluate(adaboost.pred,test$Churn,
                                adaboost.prob,name="AdaBoost"))
final_results[20,]
##  Accuracy    Recall      Lift       AUC 
## 0.9505247 0.7157895 6.7066771 0.9269967

We got satisfactory accuracy, lift, and AUC, but not so high recall. Let’s try the cost-sensitive AdaBoost algorithm (AdaC2) assigning higher weights to the minority class (costRatio parameter). We’ll tune this parameter using a validation set with 1/8 of the train size., respecting the same churn proportion.

set.seed(123)
val_size <- floor(nrow(train)/8)
churn_prop <- sum(train$Churn==1)/nrow(train)
val_churners <- sample(which(train$Churn==1),size=floor(churn_prop*val_size))
val_non_churners <- sample(which(train$Churn==0),size=val_size-length(val_churners))
val_index <- c(val_churners,val_non_churners)
validation <- train[val_index,]

costs <- seq(1,5,length.out = 21)
val_results <- matrix(NA,length(costs),5)
colnames(val_results) <- c("Cost","Accuracy","Recall","Lift","AUC")
val_results[,1] <- costs
for(i in 1:length(costs)){
  adac2.churn <- bboost(train[-val_index,-15], train[-val_index,15], base = treeBoost, type = "AdaC2",costRatio = costs[i])
  adac2.pred <- predict(adac2.churn, validation[,-15],type="class")
  adac2.prob <- predict(adac2.churn,validation[,-15],type="probability")[,2]
  val_results[i,2:5] <- evaluate(adac2.pred,validation$Churn,adac2.prob)
}
val_results[,1:3]
##       Cost  Accuracy    Recall
##  [1,]  1.0 0.9459459 0.6875000
##  [2,]  1.2 0.9309309 0.7500000
##  [3,]  1.4 0.9039039 0.8333333
##  [4,]  1.6 0.8408408 0.8125000
##  [5,]  1.8 0.7957958 0.8958333
##  [6,]  2.0 0.7357357 0.8333333
##  [7,]  2.2 0.6726727 0.8750000
##  [8,]  2.4 0.6366366 0.8958333
##  [9,]  2.6 0.5915916 0.9166667
## [10,]  2.8 0.5495495 0.8541667
## [11,]  3.0 0.5405405 0.9583333
## [12,]  3.2 0.4714715 0.9791667
## [13,]  3.4 0.3663664 0.9375000
## [14,]  3.6 0.4144144 1.0000000
## [15,]  3.8 0.4114114 0.9791667
## [16,]  4.0 0.3453453 0.9791667
## [17,]  4.2 0.2972973 0.9583333
## [18,]  4.4 0.3363363 0.9166667
## [19,]  4.6 0.2582583 0.9583333
## [20,]  4.8 0.3063063 0.9791667
## [21,]  5.0 0.2282282 1.0000000
val_results[,1:3] %>% as.data.frame() %>% ggplot(aes(x=Cost)) + geom_line(aes(y=Accuracy),linetype="dashed",color='red')+
  geom_point(aes(y=Accuracy),color="red")+
  geom_line(aes(y=Recall),linetype="dashed")+
  geom_point(aes(y=Recall)) +
  scale_x_continuous(breaks = costs) + labs(title = "Red=Accuracy, Black=Recall",y="")

Although we argued that Recall is a crucial measure in our case, we don’t want a low accuracy, so 1.4 seems to be a good equilibrium point.

adac2.churn <- bboost(train[,-15], train[,15], base = treeBoost, type = "AdaC2",costRatio = 1.4)
adac2.pred <- predict(adac2.churn, test[,-15],type="class")
adac2.prob <- predict(adac2.churn,test[,-15],type="probability")[,2]

final_results <- rbind(final_results,
                       evaluate(adac2.pred,test$Churn,
                                adac2.prob,name="AdaC2"))
final_results[20:21,]
##           Accuracy    Recall     Lift       AUC
## AdaBoost 0.9505247 0.7157895 6.706677 0.9269967
## AdaC2    0.8815592 0.8315789 6.497093 0.9257269

Losing some accuracy, we got a better recall without too much prejudice to other measures.

Final Results and comments

Let’s see the whole table comparing all models and discover who is the champion for each measure.

final_results
##                      Accuracy    Recall      Lift       AUC
## LogisticReg         0.8545727 0.1789474 3.1437549 0.8244387
## LogisticReg + ROS   0.7661169 0.7684211 3.1437549 0.8303276
## LogisticReg + RUS   0.7631184 0.7368421 3.1437549 0.8212919
## LogisticReg + SMOTE 0.7391304 0.8210526 3.3533386 0.8292418
## KNN                 0.8710645 0.1789474 0.5239592 0.1962827
## KNN + ROS           0.6986507 0.6842105 0.8383346 0.4055576
## KNN + RUS           0.8065967 0.6210526 0.6287510 0.4448473
## KNN + SMOTE         0.7196402 0.7368421 1.1527101 0.5123666
## Bagging             0.9610195 0.7789474 6.9162608 0.9303276
## Bagging + ROS       0.9505247 0.7578947 6.9162608 0.9191940
## Bagging + RUS       0.9025487 0.8631579 5.8683425 0.9165440
## Bagging + SMOTE     0.9265367 0.8526316 6.4970935 0.9246964
## RForest             0.9565217 0.7368421 7.0210526 0.9280732
## RForest + ROS       0.9655172 0.8105263 6.7066771 0.9216231
## RForest + RUS       0.8935532 0.8526316 5.9731343 0.9197092
## RForest + SMOTE     0.9325337 0.8421053 6.3923016 0.9298123
## RUSBagging          0.8890555 0.8631579 6.3923016 0.9197184
## ROSBagging          0.8875562 0.8631579 6.2875098 0.9085112
## SMOTEBagging        0.8845577 0.8526316 5.6587588 0.9052356
## AdaBoost            0.9505247 0.7157895 6.7066771 0.9269967
## AdaC2               0.8815592 0.8315789 6.4970935 0.9257269
apply(final_results,2,function(x){
  return(row.names(final_results)[which(x==max(x))])})
## $Accuracy
## [1] "RForest + ROS"
## 
## $Recall
## [1] "Bagging + RUS" "RUSBagging"    "ROSBagging"   
## 
## $Lift
## [1] "RForest "
## 
## $AUC
## [1] "Bagging "

Although we have these champions for each metric, excluding KNN and the basic Logistic Regression all the models performed very well. A recall lower bound of 0.71 (AdaBoost) means we’re identifying 71% of the real churners.

A lift lower bound of 3.1 (Logistic + RUS/ROS) means we have 3.1 times more churners in the top decile than a random classifier would have. This is ~44%, but excluding logistic, all lifts were higher than 5, which corresponds to ~71% of churners in the top decile. Our lift champion, Random Forest almost reached the maximum lift, so using these models to target the top decile with a suitable offer would be a good idea to avoid churn.

# library(xtable)
# print(xtable(round(final_results,4), type = "latex",digits=c(0,4,4,4,4)))

References

Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 321-357. https://doi.org/https://doi.org/10.1613/jair.953.
Galar, Mikel, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2012. “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (4): 463–84. https://doi.org/10.1109/TSMCC.2011.2161285.
Zhu, Bing, Bart Baesens, and Seppe K. L. M. vanden Broucke. 2017. “An Empirical Comparison of Techniques for the Class Imbalance Problem in Churn Prediction.” Information Sciences 408: 84–99. https://doi.org/https://doi.org/10.1016/j.ins.2017.04.015.
Zhu, Bing, Zihan Gao, Junkai Zhao, and Seppe K. L. M. vanden Broucke. 2019. “IRIC: An r Library for Binary Imbalanced Classification.” SoftwareX 10: 100341. https://doi.org/https://doi.org/10.1016/j.softx.2019.100341.