Credit Card Users Churn Prediction¶
Problem Statement¶
The Thera bank recently saw a steep decline in the number of credit card users. Credit cards are a good source of money for banks because of several types of fees charged by the banks, such as annual fees, balance transfer costs, cash advance fees, late payment fees, international transaction fees, and others. Some costs are assessed to all users regardless of consumption, while others are assessed only under specific conditions.
Customers leaving credit card services would result in a loss for the bank, so the bank wants to analyse customer data to identify customers who will leave their credit card services and the reasons for doing so – so that the bank can improve in those areas.
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards. You need to identify the best possible model that will give the required performance.
Dataset Description¶
- CLIENTNUM - Client number. Unique identifier for the customer holding the account
- Attrition_Flag - Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
- Customer_Age - Age in Years
- Gender - Gender of the account holder
- Dependent_count - Number of dependents
- Education_Level - Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
- Marital_Status - Marital Status of the account holder
- Income_Category - Annual Income Category of the account holder
- Card_Category - Type of Card
- Months_on_book - Period of relationship with the bank (in months)
- Total_Relationship_Count - Total no. of products held by the customer
- Months_Inactive_12_mon - No. of months inactive in the last 12 months
- Contacts_Count_12_mon - No. of Contacts in the last 12 months
- Credit_Limit - Credit Limit on the Credit Card
- Total_Revolving_Bal - Total Revolving Balance on the Credit Card
- Avg_Open_To_Buy - Open to Buy Credit Line (Average of last 12 months)
- Total_Amt_Chng_Q4_Q1 - Change in Transaction Amount (Q4 over Q1)
- Total_Trans_Amt - Total Transaction Amount (Last 12 months)
- Total_Trans_Ct - Total Transaction Count (Last 12 months)
- Total_Ct_Chng_Q4_Q1 - Change in Transaction Count (Q4 over Q1)
- Avg_Utilization_Ratio - Average Card Utilization Ratio
What is a Revolving Balance?¶
- If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance.
What is the Average Open to buy?¶
- 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?¶
- The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:
- ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1
Importing the libraries and overview of the dataset¶
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# To tune model, get different metric scores, and split data
from Scikit-learn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
classification_report,
precision_recall_curve
)
from Scikit-learn import metrics
from Scikit-learn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from Scikit-learn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from Scikit-learn.impute import SimpleImputer
# To help with model building
from Scikit-learn.linear_model import LogisticRegression
# To build classification models
from Scikit-learn.svm import SVC
from Scikit-learn.tree import DecisionTreeClassifier
from Scikit-learn.ensemble import RandomForestClassifier
from Scikit-learn.linear_model import LogisticRegression
# For tuning the model
from Scikit-learn.model_selection import GridSearchCV
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
Loading Data¶
# Connect to google
from google.colab import drive
drive.mount('/content/drive')
# Loading the dataset
churn = pd.read_csv("/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Hands-on Session:_Classification_and_Hypothesis_Testing/BankChurners.csv")
# Checking the number of rows and columns in the data
churn.shape
(10127, 21)
Observations:
- There are 10127 observations and 21 columns in the dataset.
Data Overview¶
# Let's create a copy of the data
data = churn.copy()
# Let's view the first 5 rows of the data
data.head()
CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.00000 | 777 | 11914.00000 | 1.33500 | 1144 | 42 | 1.62500 | 0.06100 |
1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.00000 | 864 | 7392.00000 | 1.54100 | 1291 | 33 | 3.71400 | 0.10500 |
2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.00000 | 0 | 3418.00000 | 2.59400 | 1887 | 20 | 2.33300 | 0.00000 |
3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.00000 | 2517 | 796.00000 | 1.40500 | 1171 | 20 | 2.33300 | 0.76000 |
4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.00000 | 0 | 4716.00000 | 2.17500 | 816 | 28 | 2.50000 | 0.00000 |
# Let's view the last 5 rows of the data
data.tail()
CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.00000 | 1851 | 2152.00000 | 0.70300 | 15476 | 117 | 0.85700 | 0.46200 |
10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.00000 | 2186 | 2091.00000 | 0.80400 | 8764 | 69 | 0.68300 | 0.51100 |
10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.00000 | 0 | 5409.00000 | 0.81900 | 10291 | 60 | 0.81800 | 0.00000 |
10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.00000 | 0 | 5281.00000 | 0.53500 | 8395 | 62 | 0.72200 | 0.00000 |
10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.00000 | 1961 | 8427.00000 | 0.70300 | 10294 | 61 | 0.64900 | 0.18900 |
Checking the info of the dataset¶
# Let's check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Observations:
- There are missing values in the Education_Level and Marital_Status.
- Five columns are of the object type, and the remaining columns are all numerical.
Checking for Duplicate values and Missing Values¶
# Let's check for duplicate values in the data
data.duplicated().sum()
0
# Let's check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
CLIENTNUM 0.00000 Attrition_Flag 0.00000 Customer_Age 0.00000 Gender 0.00000 Dependent_count 0.00000 Education_Level 15.00000 Marital_Status 7.40000 Income_Category 0.00000 Card_Category 0.00000 Months_on_book 0.00000 Total_Relationship_Count 0.00000 Months_Inactive_12_mon 0.00000 Contacts_Count_12_mon 0.00000 Credit_Limit 0.00000 Total_Revolving_Bal 0.00000 Avg_Open_To_Buy 0.00000 Total_Amt_Chng_Q4_Q1 0.00000 Total_Trans_Amt 0.00000 Total_Trans_Ct 0.00000 Total_Ct_Chng_Q4_Q1 0.00000 Avg_Utilization_Ratio 0.00000 dtype: float64
Observations:
- Education_Level has 15% missing values
- Marital_Status has 7% missing values
Univariate analysis of numerical columns¶
# Let's view the statistical summary of the numerical columns in the data
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
CLIENTNUM | 10127.00000 | 739177606.33366 | 36903783.45023 | 708082083.00000 | 713036770.50000 | 717926358.00000 | 773143533.00000 | 828343083.00000 |
Customer_Age | 10127.00000 | 46.32596 | 8.01681 | 26.00000 | 41.00000 | 46.00000 | 52.00000 | 73.00000 |
Dependent_count | 10127.00000 | 2.34620 | 1.29891 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 5.00000 |
Months_on_book | 10127.00000 | 35.92841 | 7.98642 | 13.00000 | 31.00000 | 36.00000 | 40.00000 | 56.00000 |
Total_Relationship_Count | 10127.00000 | 3.81258 | 1.55441 | 1.00000 | 3.00000 | 4.00000 | 5.00000 | 6.00000 |
Months_Inactive_12_mon | 10127.00000 | 2.34117 | 1.01062 | 0.00000 | 2.00000 | 2.00000 | 3.00000 | 6.00000 |
Contacts_Count_12_mon | 10127.00000 | 2.45532 | 1.10623 | 0.00000 | 2.00000 | 2.00000 | 3.00000 | 6.00000 |
Credit_Limit | 10127.00000 | 8631.95370 | 9088.77665 | 1438.30000 | 2555.00000 | 4549.00000 | 11067.50000 | 34516.00000 |
Total_Revolving_Bal | 10127.00000 | 1162.81406 | 814.98734 | 0.00000 | 359.00000 | 1276.00000 | 1784.00000 | 2517.00000 |
Avg_Open_To_Buy | 10127.00000 | 7469.13964 | 9090.68532 | 3.00000 | 1324.50000 | 3474.00000 | 9859.00000 | 34516.00000 |
Total_Amt_Chng_Q4_Q1 | 10127.00000 | 0.75994 | 0.21921 | 0.00000 | 0.63100 | 0.73600 | 0.85900 | 3.39700 |
Total_Trans_Amt | 10127.00000 | 4404.08630 | 3397.12925 | 510.00000 | 2155.50000 | 3899.00000 | 4741.00000 | 18484.00000 |
Total_Trans_Ct | 10127.00000 | 64.85869 | 23.47257 | 10.00000 | 45.00000 | 67.00000 | 81.00000 | 139.00000 |
Total_Ct_Chng_Q4_Q1 | 10127.00000 | 0.71222 | 0.23809 | 0.00000 | 0.58200 | 0.70200 | 0.81800 | 3.71400 |
Avg_Utilization_Ratio | 10127.00000 | 0.27489 | 0.27569 | 0.00000 | 0.02300 | 0.17600 | 0.50300 | 0.99900 |
Observations:¶
- CLIENTNUM: It is a unique identifier for customers and can be dropped as it wouldn't add any information to our analysis.
- Customer_Age: Average age of customers is 46 years and it has a wide range from 26 to 73 years.
- Dependent_count: On average the customers in the data have 2 dependents and a maximum of 5 dependents.
- Months_on_book: All the customers of the bank have at least been with them for a year and 50% of the customers for at least 3 years.
- Total_Relationship_Count: All customers use at least one product of the bank, whereas 75% of customers use 5 or fewer products of the bank.
- Months_Inactive_12_mon: On average customers were inactive for two months in the past 12 months - this shows that the bank customers are active in transactions or usage of cards it would be interesting to see if high inactivity leads to churning of a customer.
- Contacts_Count_12_mon: On average bank and customers interacted twice in the past 12 months.
- Credit_Limit: There's a huge difference between the third quartile and maximum value. The range of credit limit is very wide from 1438 to 34516, customers with high credit limit might be outliers.
- Total_Revolving_Bal: Average revolving balance of customers is 1162, there's not much difference in the third quartile and maximum value.
- Avg_Open_To_Buy: Average amount that goes unused by the customers is 7469, the range is very wide for this variable and the extreme values(min and max) might be outliers.
- Total_Amt_Chng_Q4_Q1: For 75% of the customers the transaction amount in Q4 was less than the transaction amount in Q1 (as value is equal to ~0.9).
- Total_Trans_Amt: Average transaction amount of last 12 months is 4404, some customers spent as little as 510 while some customers made the transaction of more than 18k.
- Total_Trans_Ct: On average customers made 64 or fewer transactions while 75% of the customers made 81 transactions.
- Total_Ct_Chng_Q4_Q1: For 75% of the customers the number of transactions in Q4 was less than the transactions made in Q1.
- Avg_Utilization_Ratio: On average customers used ~27% of the available credit amount of their card, with 75% of the customers utilizing 50% or less of their available credit amount.
Univariate analysis for categorical variables¶
data.describe(include=["object"]).T
count | unique | top | freq | |
---|---|---|---|---|
Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
Gender | 10127 | 2 | F | 5358 |
Education_Level | 8608 | 6 | Graduate | 3128 |
Marital_Status | 9378 | 3 | Married | 4687 |
Income_Category | 10127 | 6 | Less than $40K | 3561 |
Card_Category | 10127 | 4 | Blue | 9436 |
# Printing the % sub categories of each category
for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
Unique values in Attrition_Flag are : Attrition_Flag Existing Customer 8500 Attrited Customer 1627 Name: count, dtype: int64 ************************************************** Unique values in Gender are : Gender F 5358 M 4769 Name: count, dtype: int64 ************************************************** Unique values in Education_Level are : Education_Level Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: count, dtype: int64 ************************************************** Unique values in Marital_Status are : Marital_Status Married 4687 Single 3943 Divorced 748 Name: count, dtype: int64 ************************************************** Unique values in Income_Category are : Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: count, dtype: int64 ************************************************** Unique values in Card_Category are : Card_Category Blue 9436 Silver 555 Gold 116 Platinum 20 Name: count, dtype: int64 **************************************************
Observations:
- Most of the records are for existing customers.
- Most of the bank's customers are female.
- Most customers are graduates.
- Most customers are married.
- Most customers are in the income group of less than $40k
- Most customers have blue card.
- 'abc' value of Income_Category can be considered and treated as missing values.
Data Preprocessing¶
# CLIENTNUM consists of uniques ID for clients and hence will not add value to the modeling
data.drop(["CLIENTNUM"], axis=1, inplace=True)
## Encoding Existing and Attrited customers to 0 and 1 respectively, for analysis.
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
Exploratory Data Analysis¶
Univariate analysis¶
Let's explore these variables in some more depth by observing their distributions
1.Customer_Age:¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (.15, .85)}, figsize=(10, 6))
sns.set_style("darkgrid")
# Add a graph in each part
sns.boxplot(data=data, x="Customer_Age", ax=ax_box)
sns.histplot(data=data, x="Customer_Age", kde=True, ax=ax_hist)
ax_box.set(xlabel='Customer Age')
plt.show()
Observations:
- The distribution of Customer_Age is normally distributed with mean and median at 46 years.
- From the boxplot, we can see that there are a few outliers.
2.Months_on_book¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x="Months_on_book", ax=ax_box)
sns.histplot(data=data, x="Months_on_book", ax=ax_hist, stat='density')
sns.kdeplot(data=data, x="Months_on_book", ax=ax_hist, color='blue')
ax_box.set(xlabel='Months_on_book')
plt.show()
Observations:
- Most customers are with the bank for 3 years.
- From the boxplot, we can see that there are outliers on both sides of the whiskers.
3.Credit_Limit¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x="Credit_Limit", ax=ax_box)
sns.histplot(data=data, x="Credit_Limit", ax=ax_hist, stat='density')
sns.kdeplot(data=data, x="Credit_Limit", ax=ax_hist, color='blue')
ax_box.set(xlabel='Credit_Limit')
plt.show()
Observations:
- The distribution of the Credit_Limit is skewed to the right.
- There are quite a few customers with a maximum Credit Limit of 35000.
- 50% of the customers of the bank have a credit limit of less than <5000.
4.Total_Revolving_Bal¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x="Total_Revolving_Bal", ax=ax_box)
sns.histplot(data=data, x="Total_Revolving_Bal", ax=ax_hist, stat='density')
sns.kdeplot(data=data, x="Total_Revolving_Bal", ax=ax_hist, color='blue')
ax_box.set(xlabel='Total_Revolving_Bal')
plt.show()
Observations:
- Most customers pay the complete dues of credit card and have 0 revolving balance.
- There are quite a few customers with a revolving balance of 2500.
5.Avg_Open_To_Buy¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x="Avg_Open_To_Buy", ax=ax_box)
sns.histplot(data=data, x="Avg_Open_To_Buy", ax=ax_hist, stat='density')
sns.kdeplot(data=data, x="Avg_Open_To_Buy", ax=ax_hist, color='blue')
ax_box.set(xlabel='')
ax_hist.set_xlabel('Avg_Open_To_Buy')
plt.tight_layout()
plt.show()
Observations:
- The distribution of the Avg_Open_To_Buy column is right-skewed.
- A right-skewed distribution indicates that most customers used a big part of their limit while only a few customers (on the right tail) were left with a majority of their credit amount.
6.Total_Trans_Ct¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x=data["Total_Trans_Ct"], ax=ax_box)
sns.histplot(data["Total_Trans_Ct"], ax=ax_hist, kde=True, stat="density") # Add kde=True and stat="density"
ax_box.set(xlabel='Total_Trans_Ct')
plt.show()
Observations:
- The distribution of Total_Trans_Ct shows two peaks on 40 and 80 transactions in a year which indicates that customers used credit cards 3 to 6 times a month to make transactions.
7.Total_Amt_Chng_Q4_Q1¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x=data["Total_Amt_Chng_Q4_Q1"], ax=ax_box)
sns.histplot(data["Total_Amt_Chng_Q4_Q1"], ax=ax_hist, kde=True, stat="density", bins=30) # Adjust the number of bins
ax_box.set(xlabel='Total_Amt_Chng_Q4_Q1')
plt.show()
Observations:
- The distribution of Total_Amt_Chng_Q4_Q1 looks normally distributed but there's a slight skew towards the right.
- From the boxplot, we can see that there are outliers on both sides of the whiskers.
8.Total_Trans_Amt¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x=data["Total_Trans_Amt"], ax=ax_box)
sns.histplot(data=data["Total_Trans_Amt"], ax=ax_hist, kde=True, stat='density')
ax_box.set(xlabel='Total_Trans_Amt')
plt.show()
Observations:
- The distribution of Total_Trans_Amt is skewed to the right.
- Data shows two peaks at total transaction amounts, one around 2500 and the other around the mean value of 4500.
- From the boxplot, we can see that there are outliers - customers with more than ~8000 total transaction amounts are being considered as outliers.
- It would be interesting to check if the customers spending less with the card are the ones churning or the ones spending more are churning, if the latter is the case then there is a problem for the bank as it is losing valuable customers.
9.Total_Ct_Chng_Q4_Q1¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x=data["Total_Ct_Chng_Q4_Q1"], ax=ax_box)
sns.histplot(data=data["Total_Ct_Chng_Q4_Q1"], ax=ax_hist, kde=True, stat='density')
ax_box.set(xlabel='Total_Ct_Chng_Q4_Q1')
plt.show()
Observations:
- The distribution of Total_Ct_Chng_Q4_Q1 looks normally distributed but there's a slight skew towards the right.
- From the boxplot, we can see that there are outliers on both sides of the whiskers.
10.Avg_Utilization_Ratio¶
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(data=data, x="Avg_Utilization_Ratio", ax=ax_box)
sns.histplot(data=data, x="Avg_Utilization_Ratio", ax=ax_hist, kde=True, stat='density')
sns.kdeplot(data=data, x="Avg_Utilization_Ratio", ax=ax_hist, color='blue')
ax_box.set(xlabel='Avg_Utilization_Ratio')
plt.show()
Observations:
- The distribution of Avg_Utilization_Ratio is skewed to the right.
- This distribution is not a positive sign for the bank as most of the customers are not utilizing their credit amount.
Credit limit, Average open to buy and Average utilization ratio are right-skewed.
Open to buy means how much credit a customer is left with
- Low values of Open to buy could represent either
- Customers have low credit limits
- Customers are spending a lot so they are left less open to buy
- Low values of Open to buy could represent either
Average utilization ratio = (1 - (open to buy/credit limit))
- Low values of the Average utilization ratio represents
- (Open to buy/credit limit) is nearly equal to 1 -> Open to buy is nearly equal to the credit limit -> customers are spending less using their credit cards
- Low values of the Average utilization ratio represents
Credit limit is also right-skewed which represents - most of the customers have low credit limits
Based on the three variables, we can conclude that the majority of customers have low credit limits and are not utilizing their credit cards frequently. Now this statement justifies the right skewness for all the three variables.
Bivariate Analysis¶
numeric_data = data.select_dtypes(include=['float64', 'int64'])
plt.figure(figsize=(15, 7))
sns.heatmap(numeric_data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations:
- Attrition_Flag shows a bit of a negative correlation with Total_Trans_Ct (total transactions) and Total_Trans_Amt (total transaction amount).
- There's a strong positive correlation between Months_on_book and Customer_Age, Total_Revolving_Bal and Avg_Utilization_Ratio, Total_Trans_Amt and Total_Trans_Ct.
- There's a negative correlation of Total_Relationship_count with Total_Trans_Amt and Total_Trans_Ct, Avg_Utilization_Ratio with Avg_Open_To_Buy and Credit_Limit.
Attrition_Flag vs Gender¶
(pd.crosstab(data['Gender'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- There's not much difference in attrition percentages for Males and Females.
- ~20% of both Males and Females attrite.
Attrition_Flag vs Marital_Status¶
(pd.crosstab(data['Marital_Status'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- There's not much difference in attrition percentages for Marital_Status.
- ~20% of Singles, Divorced attrite.
- Married customers attrite the least.
Attrition_Flag vs Education_Level¶
(pd.crosstab(data['Education_Level'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- Customers with higher education - Doctorates and Post Graduates are the ones most (~20% for both education levels) attriting.
Attrition_Flag vs Income_Category¶
(pd.crosstab(data['Income_Category'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- The customers from two extreme income groups - Earning less than 40K and Earning more than 120k+ are the ones attriting the most.
Attrition_Flag vs Card_Category¶
(pd.crosstab(data['Card_Category'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- ~35% of attrition is amongst the customers with platinum cards followed by ~30% attrition in Gold cards.
- Customers with Platinum and Gold cards are our premium customers and the highest attrition for these customers is alarming as they are using the premium card provided by the bank.
Attrition_Flag vs Contacts_Count_12_mon¶
(pd.crosstab(data['Contacts_Count_12_mon'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- Highest attrition is among the customers who interacted the most with the bank.
- This signifies that the bank is not able to resolve the problems faced by customers leading to attrition.
- A preliminary step to identify attriting customers would be to look out for customers who have reached out to them repeatedly.
Attrition_Flag vs Months_Inactive_12_mon¶
(pd.crosstab(data['Months_Inactive_12_mon'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- As inactivity increases attrition also increases (2-4 months).
- The interpretation from here for 0 months to 6 months is difficult as customers who recently used the card attrited the most while those who were inactive for 6 months attrited less.
Attrition_Flag vs Total_Relationship_Count¶
(pd.crosstab(data['Total_Relationship_Count'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- Attrition is highest among the customers who are using 1 or 2 products offered by the bank - together they constitute ~55% of the attrition.
- Customers who use more than 3 products are the ones least attriting, such customers might be more financially stable and actively invest in different services provided by the bank.
Attrition_Flag vs Dependent_count¶
(pd.crosstab(data['Dependent_count'],data['Attrition_Flag'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
plt.ylabel('Percentage Attrition %')
Text(0, 0.5, 'Percentage Attrition %')
Observations:
- More the number of dependents more is the attrition, more responsibilities might lead to financial instability in such customers.
- Attrition is fairly low for customers with 0 or 1 dependents.
Let's find the percentage of outliers, in each column of the data, using IQR.¶
numeric_data = data.select_dtypes(include=['float64', 'int64'])
# Calculate quartiles and IQR
Q1 = numeric_data.quantile(0.25)
Q3 = numeric_data.quantile(0.75)
IQR = Q3 - Q1
# Find lower and upper bounds for outliers
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
Attrition_Flag 16.06596 Customer_Age 0.01975 Dependent_count 0.00000 Months_on_book 3.81159 Total_Relationship_Count 0.00000 Months_Inactive_12_mon 3.26849 Contacts_Count_12_mon 6.21112 Credit_Limit 9.71660 Total_Revolving_Bal 0.00000 Avg_Open_To_Buy 9.50923 Total_Amt_Chng_Q4_Q1 3.91034 Total_Trans_Amt 8.84764 Total_Trans_Ct 0.01975 Total_Ct_Chng_Q4_Q1 3.89059 Avg_Utilization_Ratio 0.00000 dtype: float64
Observations:
- After identifying outliers, we can decide whether to remove/treat them or not. It depends on one's approach, here we are not going to treat them as there will be outliers in real case scenario (in age, the total amount of transactions, number of transactions, etc) and we would want our model to learn the underlying pattern for such customers.
Missing value imputation¶
- We will first replace 'abc' values with 'np.nan' in Income_Category.
- We will impute missing values in all 3 columns using mode.
data1 = data.copy()
data1["Income_Category"].replace("abc", np.nan, inplace=True)
data1.isna().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Separating the independent variables (X) and the dependent variable (Y)
X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"]
Splitting the data into 70% train and 30% test set
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.
# Splitting the data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)
Missing Values Imputation
# Columns to impute
reqd_col_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# Simple Imputer with Mode
imputer = SimpleImputer(strategy="most_frequent")
# Fit and transform the train data
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])
# Transform the test data
X_test[reqd_col_for_impute] = imputer.transform(X_test[reqd_col_for_impute])
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observations:
- All missing values have been treated.
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
Gender F 3770 M 3318 Name: count, dtype: int64 ****************************** Education_Level Graduate 3247 High School 1425 Uneducated 1031 College 709 Post-Graduate 364 Doctorate 312 Name: count, dtype: int64 ****************************** Marital_Status Married 3815 Single 2771 Divorced 502 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 3273 $40K - $60K 1254 $80K - $120K 1084 $60K - $80K 974 $120K + 503 Name: count, dtype: int64 ****************************** Card_Category Blue 6621 Silver 375 Gold 78 Platinum 14 Name: count, dtype: int64 ******************************
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
Gender F 3770 M 3318 Name: count, dtype: int64 ****************************** Education_Level Graduate 3247 High School 1425 Uneducated 1031 College 709 Post-Graduate 364 Doctorate 312 Name: count, dtype: int64 ****************************** Marital_Status Married 3815 Single 2771 Divorced 502 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 3273 $40K - $60K 1254 $80K - $120K 1084 $60K - $80K 974 $120K + 503 Name: count, dtype: int64 ****************************** Card_Category Blue 6621 Silver 375 Gold 78 Platinum 14 Name: count, dtype: int64 ******************************
Encoding categorical variables¶
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_test.shape)
(7088, 29) (3039, 29)
Observations:
- After encoding there are 29 columns.
X_train.head()
Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4124 | 50 | 1 | 43 | 6 | 1 | 2 | 7985.00000 | 0 | 7985.00000 | 1.03200 | 3873 | 72 | 0.67400 | 0.00000 | False | False | True | False | False | False | True | False | False | False | False | True | False | False | False |
4686 | 50 | 0 | 36 | 3 | 3 | 2 | 5444.00000 | 2499 | 2945.00000 | 0.46800 | 4509 | 80 | 0.66700 | 0.45900 | True | False | True | False | False | False | False | False | False | True | False | False | False | False | False |
1276 | 26 | 0 | 13 | 6 | 3 | 4 | 1643.00000 | 1101 | 542.00000 | 0.71300 | 2152 | 50 | 0.47100 | 0.67000 | False | False | True | False | False | False | False | True | True | False | False | False | False | False | False |
6119 | 65 | 0 | 55 | 3 | 3 | 0 | 2022.00000 | 0 | 2022.00000 | 0.57900 | 4623 | 65 | 0.54800 | 0.00000 | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False |
2253 | 46 | 3 | 35 | 6 | 3 | 4 | 4930.00000 | 0 | 4930.00000 | 1.01900 | 3343 | 77 | 0.63800 | 0.00000 | True | False | True | False | False | False | False | True | False | False | True | False | False | False | False |
Model evaluation criterion¶
The model can make two types of wrong predictions:
- Predicting a customer will attrite and the customer doesn't attrite.
- Predicting a customer will not attrite and the customer attrites.
Which case is more important?
- Predicting that customer will not attrite but he attrites i.e. losing on a valuable customer or asset.
How to reduce this loss i.e the need to reduce False Negatives?
- Bank would want
Recall
to be maximized, greater the Recall higher the chances of minimizing false negatives. Hence, the focus should be on increasing Recall or minimizing the false negatives or in other words identifying the true positives (i.e. Class 1) so that the bank can retain their valuable customers by identifying the customers who are at risk of attrition.
Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
# Creating metric function
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8,5))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Not Attrite', 'Attrite'], yticklabels=['Not Attrite', 'Attrite'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Building the model¶
We will be building 4 different models:
- Logistic Regression
- Support Vector Machine(SVM)
- Decision Tree
- Random Forest
Logistic Regression Model¶
Logistic Regression is a supervised learning algorithm that is used for binary classification problems i.e. where the dependent variable is categorical and has only two possible values. In logistic regression, we use the sigmoid function to calculate the probability of an event y, given some features x as:
P(y)=1/(1 + exp(-x))
# Fitting logistic regression model
lg = LogisticRegression()
lg.fit(X_train,y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Let's check the model performance
# Checking the performance on the training data
y_pred_train = lg.predict(X_train)
metrics_score(y_train, y_pred_train)
precision recall f1-score support 0 0.90 0.96 0.93 5949 1 0.69 0.44 0.53 1139 accuracy 0.88 7088 macro avg 0.79 0.70 0.73 7088 weighted avg 0.87 0.88 0.87 7088
- The reported average includes the macro average which averages the unweighted mean per label, and the weighted average i.e. averaging the support-weighted mean per label.
- In classification, the class of interest is considered the positive class. Here, the class of interest is 1 i.e. identifying the customers who are at risk of attrition.
Reading the confusion matrix (clockwise):
True Negative (Actual=0, Predicted=0): Model predicts that a customer would not attrite and the customer does not attrite
False Positive (Actual=0, Predicted=1): Model predicts that a customer would attrite but the customer does not attrite
False Negative (Actual=1, Predicted=0): Model predicts that a customer would not attrite but the customer attrites
True Positive (Actual=1, Predicted=1): Model predicts that a customer would attrite and the customer attrites
# Checking the performance on the test dataset
y_pred_test = lg.predict(X_test)
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.90 0.96 0.93 2551 1 0.70 0.44 0.54 488 accuracy 0.88 3039 macro avg 0.80 0.70 0.74 3039 weighted avg 0.87 0.88 0.87 3039
Observations:
- We are getting an accuracy of around 90% on train and test dataset.
- However, the recall for this model is only around 44% for class 1 on train and test dataset.
- As the recall is low, this model will not perform well in differentiating out those customers who have a high chance of leaving the bank, meaning it will eventually not help in reducing the attrition rate.
- As we can see from the Confusion Matrix, this model fails to identify the majority of customers who will attrite.
Let's check the coefficients and find which variables are leading to attrition and which can help to reduce the attrition:
# Printing the coefficients of logistic regression
cols=X_train.columns
coef_lg=lg.coef_
pd.DataFrame(coef_lg,columns=cols).T.sort_values(by=0,ascending=False)
0 | |
---|---|
Contacts_Count_12_mon | 0.31531 |
Months_Inactive_12_mon | 0.24919 |
Dependent_count | 0.20002 |
Customer_Age | 0.11755 |
Income_Category_Less than $40K | 0.05269 |
Marital_Status_Single | 0.04579 |
Education_Level_Graduate | 0.01085 |
Education_Level_Post-Graduate | 0.00870 |
Education_Level_Doctorate | 0.00737 |
Education_Level_Uneducated | 0.00455 |
Card_Category_Silver | 0.00365 |
Card_Category_Gold | 0.00318 |
Income_Category_$40K - $60K | 0.00264 |
Card_Category_Platinum | 0.00175 |
Avg_Utilization_Ratio | 0.00137 |
Total_Trans_Amt | 0.00038 |
Avg_Open_To_Buy | 0.00034 |
Education_Level_High School | -0.00032 |
Credit_Limit | -0.00035 |
Total_Revolving_Bal | -0.00068 |
Income_Category_$80K - $120K | -0.00609 |
Total_Amt_Chng_Q4_Q1 | -0.01180 |
Income_Category_$60K - $80K | -0.01614 |
Marital_Status_Married | -0.01957 |
Gender_M | -0.03225 |
Total_Ct_Chng_Q4_Q1 | -0.05203 |
Months_on_book | -0.09103 |
Total_Trans_Ct | -0.09617 |
Total_Relationship_Count | -0.21321 |
Observations:
Features which positively affect on the attrition rate are:
- Contacts_Count_12_mon
- Months_Inactive_12_mon
- Dependent_count
- Customer_Age
- Income_Category_Less than $40K
- Marital_Status_Single
- Education_Level_Graduate
- Education_Level_Post-Graduate
- Education_Level_Doctorate
- Avg_Utilization_Ratio
Features which negatively affect on the attrition rate are:
- Total_Relationship_Count
- Total_Trans_Ct
- Months_on_book
- Total_Ct_Chng_Q4_Q1
- Marital_Status_Married
- Income_Category_ 60𝐾− 80K
- Total_Amt_Chng_Q4_Q1
Observations:
Based on the Logistic Regression model, Contacts_Count_12_mon is the most important feature in detecting whether a customer would attrite or not. So, the highest attrition is among the customers who interacted the most with the bank. This signifies that the bank is not able to resolve the problems faced by customers leading to attrition
This model also suggests that attrition is dependent on the customers' activity. As inactivity increases attrition also increases.
Dependent_count is an important variable in predicting the attrition rate. As more the number of dependents more is the attrition, more responsibilities might lead to financial instability in such customers.
Education level of customers also have some interesting outcome. Customers with higher education - Doctorates and Post Graduates are the ones most attriting.
The customers belonging to the income group - Earning less than 40K are the ones attriting the most.
Other features which appear to affect the chances of attrition are Marital Status, Avg Utilization ratio.
The model also captures the inverse relation between Total_Relationship_Count and attrition - suggesting customer who uses more number of products from the bank are the ones least attriting, such customers might be more financially stable and actively invest in different services provided by the bank.
Customers who are doing more transactions with the bank have a lower chance of attrition, a conclusion that makes sense since Less number of transactions leads to higher attrition.
From Total_Ct_Chng_Q4_Q1 and Total_Amt_Chng_Q4_Q1 it's clear that Customers who didn't attrite showed less variability across Q4 to Q1 as compared to the ones who attrited.
The coefficients of the logistic regression model give us the log of odds, which is hard to interpret in the real world. We can convert the log of odds into real odds by taking its exponential.
# Finding the odds
odds = np.exp(lg.coef_[0])
# Adding the odds to a dataframe and sorting the values
pd.DataFrame(odds, X_train.columns, columns=['odds']).sort_values(by='odds', ascending=False)
odds | |
---|---|
Contacts_Count_12_mon | 1.37069 |
Months_Inactive_12_mon | 1.28299 |
Dependent_count | 1.22143 |
Customer_Age | 1.12474 |
Income_Category_Less than $40K | 1.05410 |
Marital_Status_Single | 1.04686 |
Education_Level_Graduate | 1.01091 |
Education_Level_Post-Graduate | 1.00874 |
Education_Level_Doctorate | 1.00739 |
Education_Level_Uneducated | 1.00456 |
Card_Category_Silver | 1.00366 |
Card_Category_Gold | 1.00319 |
Income_Category_$40K - $60K | 1.00265 |
Card_Category_Platinum | 1.00176 |
Avg_Utilization_Ratio | 1.00137 |
Total_Trans_Amt | 1.00038 |
Avg_Open_To_Buy | 1.00034 |
Education_Level_High School | 0.99968 |
Credit_Limit | 0.99965 |
Total_Revolving_Bal | 0.99932 |
Income_Category_$80K - $120K | 0.99392 |
Total_Amt_Chng_Q4_Q1 | 0.98827 |
Income_Category_$60K - $80K | 0.98399 |
Marital_Status_Married | 0.98062 |
Gender_M | 0.96826 |
Total_Ct_Chng_Q4_Q1 | 0.94930 |
Months_on_book | 0.91299 |
Total_Trans_Ct | 0.90831 |
Total_Relationship_Count | 0.80799 |
Observations
- The odds of a customers contacting with the bank more to attrite are 1.3 times the odds of one who is not, probably due to the fact that the bank is not able to resolve the problems faced by customers leading to attrition.
- The odds of a customer being inactive to attrite are 1.2 times the odds of a customer who is actively in touch with bank.
- The odds of a customer with more dependent attriting are 1.2 times the odds of a customer with less or no dependent.
Precision-Recall Curve for logistic regression¶
Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
# Predict_proba gives the probability of each observation belonging to each class
y_scores_lg=lg.predict_proba(X_train)
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
# Calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds_lg)):
if precisions_lg[i]==recalls_lg[i]:
print(thresholds_lg[i])
0.34806960434560535
Observation:
- We can see that precision and recall are balanced for a threshold of about ~0.35.
Let's find out the performance of the model at this threshold
optimal_threshold=.35
y_pred_train = lg.predict_proba(X_train)
metrics_score(y_train, y_pred_train[:,1]>optimal_threshold)
precision recall f1-score support 0 0.92 0.93 0.92 5949 1 0.61 0.60 0.61 1139 accuracy 0.87 7088 macro avg 0.77 0.76 0.77 7088 weighted avg 0.87 0.87 0.87 7088
Observations:
- The model performance has improved. The recall has increased significantly for class 1.
Let's check the performance on the test data.
optimal_threshold1=.35
y_pred_test = lg.predict_proba(X_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold1)
precision recall f1-score support 0 0.92 0.93 0.93 2551 1 0.62 0.60 0.61 488 accuracy 0.88 3039 macro avg 0.77 0.77 0.77 3039 weighted avg 0.88 0.88 0.88 3039
Observation:
- The model is giving a similar performance on the test and train data i.e. the model is giving a generalized performance.
- The recall of the test data has increased significantly while at the same time, the precision has decreased slightly, which is to be expected while adjusting the threshold.
- The average recall and precision for the model are good but let's see if we can get better performance using other algorithms.
Support Vector Machines¶
# To Speed-Up SVM training.
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train_scaled = scaling.transform(X_train)
X_test_scaled = scaling.transform(X_test)
Let's build the models using the two of the widely used kernel functions:
- Linear Kernel
- RBF Kernel
Linear Kernel¶
# Fitting SVM
svm = SVC(kernel = 'linear') # Linear kernel or linear decision boundary
model = svm.fit(X = X_train_scaled, y = y_train)
y_pred_train_svm = model.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
precision recall f1-score support 0 0.92 0.97 0.95 5949 1 0.80 0.58 0.67 1139 accuracy 0.91 7088 macro avg 0.86 0.77 0.81 7088 weighted avg 0.90 0.91 0.90 7088
# Checking performance on the test data
y_pred_test_svm = model.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)
precision recall f1-score support 0 0.92 0.97 0.94 2551 1 0.78 0.56 0.65 488 accuracy 0.90 3039 macro avg 0.85 0.76 0.80 3039 weighted avg 0.90 0.90 0.90 3039
- SVM model with linear kernel is not overfitting as the accuracy is around 90% for both train and test dataset.
- Recall of class 1 for the model is only around 55% which implies our model will not correctly predict the customers who are likely to attrite.
- The precision is quite good and the model will help to find true positive and will save the cost and energy of the bank.
RBF Kernel¶
svm_rbf=SVC(kernel='rbf',probability=True)
svm_rbf.fit(X_train_scaled,y_train)
y_scores_svm=svm_rbf.predict_proba(X_train_scaled) # Predict_proba gives the probability of each observation belonging to each class
precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
# Calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds_svm)):
if precisions_svm[i]==recalls_svm[i]:
print(thresholds_svm[i])
0.3122015685746674
optimal_threshold1=0.31
y_pred_train = svm_rbf.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train[:,1]>optimal_threshold1)
precision recall f1-score support 0 0.96 0.96 0.96 5949 1 0.77 0.77 0.77 1139 accuracy 0.93 7088 macro avg 0.86 0.86 0.86 7088 weighted avg 0.93 0.93 0.93 7088
y_pred_test = svm_rbf.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold1)
precision recall f1-score support 0 0.94 0.93 0.94 2551 1 0.67 0.71 0.69 488 accuracy 0.90 3039 macro avg 0.80 0.82 0.81 3039 weighted avg 0.90 0.90 0.90 3039
Observations:
- At the optimal threshold of .31, the model performance has improved significantly. The recall has improved from 0.55 to .77 which is a ~20% increase and the model is giving good generalized results.
- Moreover, the kernel used to create this is rbf, hence model is performing good with non-linear kernel.
- As the recall is good, this model will perform well in differentiating out those customers who have a high chance of leaving the bank, meaning it will eventually help in reducing the attrition rate.
Decision Tree¶
# Building decision tree model
model_dt= DecisionTreeClassifier(random_state=1,max_depth=8)
model_dt.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=8, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=8, random_state=1)
Let's check the model performance of decision tree
# Checking performance on the training dataset
pred_train_dt = model_dt.predict(X_train)
metrics_score(y_train, pred_train_dt)
precision recall f1-score support 0 0.98 0.99 0.99 5949 1 0.93 0.92 0.93 1139 accuracy 0.98 7088 macro avg 0.96 0.95 0.96 7088 weighted avg 0.98 0.98 0.98 7088
Observation:
- Model has performed very well on the training set.
- As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
Let's check the performance on test data to see if the model is overfitting.
pred_test_dt = model_dt.predict(X_test)
metrics_score(y_test, pred_test_dt)
precision recall f1-score support 0 0.97 0.96 0.97 2551 1 0.82 0.82 0.82 488 accuracy 0.94 3039 macro avg 0.89 0.89 0.89 3039 weighted avg 0.94 0.94 0.94 3039
Observation:
- The decision tree model is slightly overfitting the data here.
- We can tune the hyperparameters to increase the performance and reduce overfitting.
Let's visualize the decision tree and observe the decision rules:
features = list(X_train.columns)
plt.figure(figsize=(20,20))
from Scikit-learn import tree
tree.plot_tree(model_dt,feature_names=features,max_depth =4, filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Importance of features in the tree building
feature_names = list(X_train.columns)
importances = model_dt.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observation:
- So,Total_Trans_Ct is the most important feature followed by Total_Revolving_Bal and Total_Trans_Amt which makes sense. Customers who are doing more transactions with the bank have lower chance of attrition.
- Total_Ct_Chng_Q4_Q1, Total_Relationship_Count, Total_Amt_Chng_Q4_Q1 are also important factors .
Random Forest¶
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier(random_state = 1)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train_rf)
precision recall f1-score support 0 1.00 1.00 1.00 5949 1 1.00 1.00 1.00 1139 accuracy 1.00 7088 macro avg 1.00 1.00 1.00 7088 weighted avg 1.00 1.00 1.00 7088
Observation:
- For all the metrics in the training dataset, the Random Forest gives a 100% score.
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support 0 0.96 0.99 0.97 2551 1 0.92 0.79 0.85 488 accuracy 0.96 3039 macro avg 0.94 0.89 0.91 3039 weighted avg 0.95 0.96 0.95 3039
Observations:
- The Random Forest classifier seems to be overfitting the training data. The recall on the training data is 1, while the recall on the test data is only ~ 0.80 for class 1.
- Precision is high for the test data as well.
Let's check the feature importance of the Random Forest
importances = rf_estimator.feature_importances_
columns = X_train.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x = importance_df.Importance, y = importance_df.index)
<Axes: xlabel='Importance', ylabel='None'>
Observations:
- The Random Forest further verifies the results from the decision tree that the most important features are
Total_Trans_ct
andTotal_Trans_Amt
.
Business Recommendations:¶
*We have successfully built a predictive model that the bank can:
a) Deploy to identify customers at risk of attrition.
b) Use to determine the key factors driving attrition.
c) Leverage to take appropriate actions and develop better customer retention policies.
Key Factors Driving Attrition:¶
Total_Trans_Ct (Total Transaction Count)
Customers with fewer transactions in a year are more likely to churn. To encourage card usage, the bank can offer incentives such as cashback, special discounts on purchases, and other promotional benefits.Total_Revolving_Bal (Total Revolving Balance)
Customers with lower revolving balances tend to attrite, likely because they have cleared their dues and opted out of the credit card service. After a customer clears their balance, the bank should request feedback to understand the reasons behind their decision.Total_Trans_Amt (Total Transaction Amount)
A lower transaction count often results in a lower total transaction amount, ultimately increasing the risk of attrition. The bank can introduce offers on high-value purchases, benefiting both the customer and the bank.Total_Relationship_Count
Attrition is highest among customers using only one or two bank products, accounting for approximately 55% of total attrition. The bank should investigate the issues customers face with these products. Improved customer support and greater transparency may help retain them.
Additional Insights:¶
Targeting Female Customers
Female customers tend to utilize their credit more, make frequent transactions, and spend higher amounts. However, their credit limits are relatively lower. Increasing their credit limits could be beneficial for both the bank and the customers.Months_Inactive
The likelihood of attrition increases with inactivity. Customers inactive for 2–4 months contribute the most to attrition. The bank can engage them through automated messages highlighting their monthly activity, new offers, or services.Customer Support and Attrition
The highest attrition rates are observed among customers who frequently contact the bank, indicating unresolved issues. Implementing a feedback system can help assess customer satisfaction with issue resolution. If dissatisfaction persists, the bank should take corrective action promptly.
This version improves readability, ensures grammatical accuracy, and enhances the professional tone. Let me know if you need further refinements!
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Hands-on Session:_Classification_and_Hypothesis_Testing/Credit_Card_Users_Churn_Prediction.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Hands-on Session:_Classification_and_Hypothesis_Testing/Credit_Card_Users_Churn_Prediction.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 37 image(s). [NbConvertApp] Writing 2528924 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Hands-on Session:_Classification_and_Hypothesis_Testing/Credit_Card_Users_Churn_Prediction.html