Should I Stay or Should I go?

WSDM — KKBox’s Churn Prediction Challenge

9 min readSep 16, 2020

“There is only one boss. The customer. And he can fire everybody in the company from the chairman on down, simply by spending his money somewhere else.”
— Sam Walton

Advancing knowledge in machine learning and artificial intelligence helps in the eradication of complex problems in more sophisticated and simpler methods.

So why not we take this approach towards solving our churn prediction problem & prevent the companies from their huge loss.

What is the Problem?
what is Churn?
Business Impact
Problem Statement
Client
Business objective and Constraints
Machine Learning Problem
Existing Solution
Data Overview
Exploratory Data Analysis
Data Preprocessing and Feature Engineering
Modelling
Results
Final Submission score
Future Work
References

What is the Problem?

WSDM — KKBox’s Churn Prediction Challenge
Caution Spoiler Ahead: Our model is in the TOP 7 Percent :)

What is Churn?

Churn quantifies the number of customers who have left your brand by canceling their subscription or stopping paying for your services.

Business Impact

Since the music streaming service providers are becoming more competitive day by day one of the major problem these companies are facing is customer retention. A High Churn Rate is a bad news for any business as it costs five times as much to attract a new customer as it does to keep an existing one. A high customer churn rate will hit your company’s finances hard.

Problem Statement

To predict whether a user will churn after his/her subscription expires. Specifically, we want to forecast if a user makes a new service subscription transaction within 30 days after the current membership expiration date.

Client

KKBOX is Asia’s leading music streaming service, holding the world’s most comprehensive Asia-Pop music library with over 30 million tracks, supported by advertising and paid subscription.

Business objective and constraints

The company uses survival analysis techniques to determine the residual membership lifetime for each subscriber. By adopting different methods, KKBOX anticipates they’ll discover new insights as to why users leave Accurately predicting churn is critical to long-term success. Even slight variations in churn can drastically affect profits.
1. No latency constrained
2. Minimum error
3. Have probabilistic Output

Machine Learning Problem

Description

The churn/renewal definition can be tricky due to KKBox’s subscription model. Since the majority of KKBox’s subscription length is 30 days, a lot of users re-subscribe every month. The key fields to determine churn/renewal are transaction date, membership expiration date, and is_cancel.
Note that the is_cancel field indicates whether a user actively cancels a subscription. Subscription cancellation does not imply the user has churned. A user may cancel service subscription due to a change of service plans or other reasons. The criteria of “churn” is no new valid service subscription within 30 days after the current membership expires.

The training and the test data are selected from users whose membership expires within a certain month. The train data consists of users whose subscription expires within February 2017, and the test data is with users whose subscription expires within March 2017. This means we are looking at user churn or renewal roughly in March 2017 for the train set, and the user churn or renewal roughly in April 2017. Train and test sets are split by the transaction date.

Posing as a Machine Learning Problem

Binary class classification: is_churn either 0 or 1.

Evaluation Metrics: Log Loss (KPI)
we will also keep track of the F1 Score and Confusion matrix.

Existing Solutions

Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data

This paper explores the application of extreme gradient boosting (XGBoost) on a customer dataset with a wide variety of temporal features to create a highly-accurate customer churn model. In particular, it describes an effective method for handling temporally sensitive feature engineering. The proposed model was submitted in the WSDM Cup 2018 Churn Challenge and achieved first-place out of 575 teams.

Data Overview

Get the data from https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data

train.csv

the train set, containing the user ids and whether they have churned.

1. msno: user-id

2. is_churn: This is the target variable. Churn is defined as whether the user did not continue the subscription within 30 days of expiration. is_churn = 1 means churn,is_churn = 0 means renewal.

transactions.csv

transactions of users up until 2/28/2017.

1. msno: user-id

2. payment_method_id: payment method

3. payment_plan_days: length of membership plan in days plan_list_price: in New Taiwan Dollar (NTD)

4. actual_amount_paid: in New Taiwan Dollar (NTD)

5. is_auto_renew

6. transaction_date: format %Y%m%d

7. membership_expire_date: format %Y%m%d

8. is_cancel: whether or not the user canceled the membership in this transaction.

user_logs.csv

daily user logs describing listening behaviors of a user. Data collected until 2/28/2017.

1. msno: user-id

2. date: format %Y%m%d

3. num_25: # of songs played less than 25% of the song length

4. num_50: # of songs played between 25% to 50% of the song length

5. num_75: # of songs played between 50% to 75% of of the song length

6. num_985: # of songs played between 75% to 98.5% of the song length

7. num_100: # of songs played over 98.5% of the song length

8. num_unq: # of unique songs played

9. total_secs: total seconds played

members.csv

user information. Note that not every user in the dataset is available.

1. msno

2. city

3. bd: age

4. gender

5. registered_via: registration method

6. registration_init_time: format %Y%m%d

Exploratory Data Analysis

Exploratory Data Analysis is the very first step and the most important step in solving any case study in data science. It gives valuable insights and information. Proper EDA gives interesting features of data which in turn influences our data preprocessing and model selection criterion as well.

Let us import some libraries and visualize the data.

Loading Data

train_data=pd.read_csv(‘/home/asad_99rizvi/dataset/train_v2.csv’)
train_data.head()

sns.countplot(train_data[‘is_churn’])

For a business having a fewer number of churned users are great but this makes our dataset imbalance.

Let us see some plots from other CSV also.

There is more teenager in our dataset.
Some methods of registration are more favorable than others.
Approx the same number of Males and Females and most of the values are missing in the gender column.

figure(num=None, figsize=(16, 5), dpi=80, facecolor=’w’, edgecolor=’k’)plt.plot(transaction_data[‘transaction_date’].value_counts().sort_index()[:500])

We can see an upward trend in the number of transactions every year. This type of graph shows a healthy growth of the company.

Youngsters are slightly more prone to churning.

These plots are just a glimpse of EDA that I did for this case study
you can find the complete EDA with a detailed explanation here.

Data Preprocessing and Feature Engineering

Since the dataset was quite large. I found some techniques to reduce the space occupied by the data.
The whole case study revolves around the features one can design. I created approx 250 features out of which I used 100 features in my final model.

This function takes DataFrame and adjusts the datatypes of the columns depending upon their Maximum and Minimum values.

Most of the features that I created were Relative to each other.
Some of the features that made to final models are:

1. Feature: not_autorenew_cancel
Logical operation NOT auto-renew AND is cancel

2. Feature: difference
Discount given to the customer

3. Feature: is_auto_renew_change
Captures if the user changed its auto_renew state

4. Feature: transaction_count
How many Transactions user had done in the past

5. Feature: duration
Average duration

6. Feature: sum_diff_last_2_month_25
Difference between the sum of songs played less than 25% of the song length in the last two months

7. Feature: sum_diff_last_2_month_50
Difference between the sum of songs played less than 50% of the song length in the last two months

8. Feature: sum_diff_last_2_month_num_unq
Difference between the sum of unique songs played in the last two months

9. Feature: average_amount_paid
The Average amount paid by user (Theoretical)

10. Feature: average_amount_charged
The Average amount charged on the user (actual)

I found one very interesting thing that some of the customers are paying more than what they are charged ( Rich Guys 🤷‍♀️ ).

There were some feature indicators also like Binary flags based upon the information that I found in the EDA part for example:

People who use registration via method as 7 or 9 are more likely to churn
And people who registered themself before 2014 are less likely to churn (maybe because they are using these services for a long time.)

is_cancel_change_flag=transaction.groupby(‘msno’)[‘is_cancel’].max()is_auto_renew_change_flag=transaction.groupby(‘msno’)[‘is_auto_renew’].max()member_data[‘city_flag’]=member_data[‘city’].apply(lambda x:1 if (x==1 or x==13) else 0 )member_data[‘age_flag’]=member_data[‘bd’].apply(lambda x :1 if (x>=20 and x<=24) else 0)

you can find complete Feature Engineering with an explanation here.

Modelling

I have experimented various machine learning models with hyperparameter tuning :

1. SGDClassifier hinge L2 Regularization

2. SGDClassifier hinge L1 Regularization

3. Logistic Regression l2 Regularization

4. Logistic Regression l1 Regularization

5. Random Forest

6. LGBM Classifier

7. XGBoost

8. AdaBoost

9. Ensembles

10. MLP

11.CNN

12. LSTM

Some helper functions to visualize and evaluate the model.
These functions print the LoggLoss, F1 score, and Confusion Matrix,
find which hyperparameter is best,
Train model with that hyperparameters,
print feature importance.

Feature importance of top 50 features

Feature importance by Logistic Regression

In Classical Machine Learning models, XGBoost performed the best.
Therefore I created Ensemble of XGBoost and it performed pretty well than other classical models.

I split the data into 3 parts one to train the ensemble second to train meta classifier and third to tune hyperparameters.

I trained 50 XGboost Models on bootstrapped samples and with their predictions, I trained an XGBoost on top of it.

Now let us just move to our Best model i.e. CNN.

Convolutional Neural Network models were developed for image classification problems, where the model learns an internal representation of a two-dimensional input, in a process referred to as feature learning.
This same process can be harnessed on one-dimensional sequences of data.
The model learns to extract features from sequences of observations and how to map the internal features to target variables.

I wrote some custom callbacks so that I can keep track of the F1 score at every epoch and I also used some other callbacks such as Early Stopping.

Also used Batch Normalization to correct internal covariance shift and Dropout to avoid overfitting.
I also changed the shape of the data because CNN requires an extra dimension.

X_train=np.reshape(X_train,(X_train.shape[0],X_train.shape[1],1))
X_test=np.reshape(X_test,(X_test.shape[0],X_test.shape[1],1))

Our model shows a very stable log loss and F1 score throughout the training phase.

model = Sequential()
model.add(Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(X_train.shape[1:])))
model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model.add(Dropout(0.5))
model.add(MaxPool1D(pool_size=2))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train, epochs=50, batch_size=2**16,validation_data=(X_cv,y_cv), verbose=1,callbacks=callback_list)

I trained the model for 50 epochs and find loss was still decreasing so I further trained the model to 80 epochs.

Results

Out of all the models built, CNN gives the lowest loss of 0.11665 which is a non-linear model. However, Random Forest which is a non-linear model did not perform at par with the linear models. the second-best performance after CNN was our ensemble model which is 0.13424.

Note: These values are on cross-validation dataset

Final Submission Score

Future Work

Future work includes further parameter optimization of the XGBoost and CNN, stacking models, and further exploration of additional feature engineering not yet tested.

While some code snippets are included within the blog, for the full code you can check out this Jupyter notebook on Github. I hope you learned something new through this read!

You can also find and connect with me on LinkedIn and Github