MLS Case study: Unsupervised Learning¶
Welcome to the case study on Unsupervised Learning. We will be using the Credit Card Customer Data for this case study.
Problem Statement:¶
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the bank poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. The Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.
Objective:¶
Identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.
About the data:¶
Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call centre.
- Sl_no - Customer Serial Number
- Customer Key - Customer identification
- Avg_Credit_Limit - Average credit limit (currency is not specified, you can make an assumption around this)
- Total_Credit_Cards - Total number of credit cards
- Total_visits_bank - Total bank visits
- Total_visits_online - Total online visits
- Total_calls_made - Total calls made
Importing libraries and overview of the dataset¶
#Import all the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
#to scale the data using z-score
from Scikit-learn.preprocessing import StandardScaler
#importing clustering algorithms
from Scikit-learn.cluster import KMeans
from Scikit-learn.mixture import GaussianMixture
!pip install scikit-learn-extra
from Scikit-learn_extra.cluster import KMedoids
from Scikit-learn.metrics import silhouette_score
import warnings
warnings.filterwarnings("ignore")
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.7/dist-packages (0.2.0) Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn-extra) (1.21.6) Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn-extra) (1.0.2) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn-extra) (1.4.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.1.0) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.1.0)
# Connect to google
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Loading data¶
data = pd.read_excel('Credit Card Customer Data.xlsx')
data.head()
Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
---|---|---|---|---|---|---|---|
0 | 1 | 87073 | 100000 | 2 | 1 | 1 | 0 |
1 | 2 | 38414 | 50000 | 3 | 0 | 10 | 9 |
2 | 3 | 17341 | 50000 | 7 | 1 | 3 | 4 |
3 | 4 | 40496 | 30000 | 5 | 1 | 1 | 4 |
4 | 5 | 47437 | 100000 | 6 | 0 | 12 | 3 |
Check the info of the data¶
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 655 entries, 0 to 654 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sl_No 655 non-null int64 1 Customer Key 655 non-null int64 2 Avg_Credit_Limit 655 non-null int64 3 Total_Credit_Cards 655 non-null int64 4 Total_visits_bank 655 non-null int64 5 Total_visits_online 655 non-null int64 6 Total_calls_made 655 non-null int64 dtypes: int64(7) memory usage: 35.9 KB
Observations:
- There are 655 observations and 7 columns in the dataset.
- All columns have 655 non-null values i.e. there are no missing values.
- All columns are of int64 data type.
There are no missing values. Let us now figure out the uniques in each column.
data.nunique()
Sl_No 655 Customer Key 655 Avg_Credit_Limit 109 Total_Credit_Cards 10 Total_visits_bank 6 Total_visits_online 16 Total_calls_made 11 dtype: int64
- Customer key, which is an identifier, has no repeated values. We should treat the same accordingly before applying any algorithm.
Data Preprocessing and Exploratory Data Analysis¶
First we need to drop the variables that are not required for our analysis.
data.drop(columns = ['Sl_No', 'Customer Key'], inplace = True)
Now that we have dropped unnecessary column. We can again check for duplicates. Duplicates would mean customers with identical features.
data[data.duplicated()]
Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
---|---|---|---|---|---|
162 | 8000 | 2 | 0 | 3 | 4 |
175 | 6000 | 1 | 0 | 2 | 5 |
215 | 8000 | 4 | 0 | 4 | 7 |
295 | 10000 | 6 | 4 | 2 | 3 |
324 | 9000 | 4 | 5 | 0 | 4 |
360 | 18000 | 6 | 3 | 1 | 4 |
377 | 12000 | 6 | 5 | 2 | 1 |
384 | 8000 | 7 | 4 | 2 | 0 |
394 | 5000 | 4 | 5 | 0 | 1 |
452 | 47000 | 6 | 2 | 0 | 4 |
494 | 52000 | 4 | 2 | 1 | 2 |
We can drop these duplicated rows from the data
data=data[~data.duplicated()]
data.shape
(644, 5)
- After removing duplicated keys and rows and unnecessary columns, there are 644 unique observations and 5 columns in our data.
Summary Statistics¶
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Avg_Credit_Limit | 644.0 | 34543.478261 | 37428.704286 | 3000.0 | 11000.0 | 18000.0 | 48000.00 | 200000.0 |
Total_Credit_Cards | 644.0 | 4.694099 | 2.175338 | 1.0 | 3.0 | 5.0 | 6.00 | 10.0 |
Total_visits_bank | 644.0 | 2.395963 | 1.626964 | 0.0 | 1.0 | 2.0 | 4.00 | 5.0 |
Total_visits_online | 644.0 | 2.624224 | 2.957728 | 0.0 | 1.0 | 2.0 | 4.00 | 15.0 |
Total_calls_made | 644.0 | 3.608696 | 2.880025 | 0.0 | 1.0 | 3.0 | 5.25 | 10.0 |
Observations:
- The average credit limit has a high range as it has a minimum value of 3K and a maximum value of 200K.
- The mean of the average credit limit is approx 34.5K with a large standard deviation of 37.4K.
- The average number of cards per customer is approx 5 (rounding off to nearest integer).
- On average, a customer has 2 bank visits, 3 online visits, and made 4 calls.
Now let's go ahead with the exploring each variable at hand. We will check the distribution and outliers for each variable in the data.¶
for col in data.columns:
print(col)
print('Skew :',round(data[col].skew(),2))
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
data[col].hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1,2,2)
sns.boxplot(x=data[col])
plt.show()
Avg_Credit_Limit Skew : 2.19
Total_Credit_Cards Skew : 0.17
Total_visits_bank Skew : 0.15
Total_visits_online Skew : 2.21
Total_calls_made Skew : 0.65
Observation:
- The average credit limit and total visits online has a right skewed distribution with some outliers.
- The majority of customers have 4 credits cards, 2 bank visits, and 4 calls.
Now, let's check the correlation among different variables.
plt.figure(figsize=(8,6))
sns.heatmap(data.corr(), annot=True, fmt='0.2f')
plt.show()
Observation:
- Avg_Credit_Limit is positively correlated with Total_Credit_Cards and Total_visits_online which can makes sense.
- Avg_Credit_Limit is negatively correlated with Total_calls_made and Total_visits_bank.
- Total_visits_bank, Total_visits_online, Total_calls_made are negatively correlated which implies that majority of customers use only one of these channels to contact the bank.
Scaling the data¶
scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
data_scaled.head()
Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
---|---|---|---|---|---|
0 | 1.750192 | -1.239437 | -0.858684 | -0.549573 | -1.253982 |
1 | 0.413280 | -0.779381 | -1.473803 | 2.495669 | 1.873420 |
2 | 0.413280 | 1.060843 | -0.858684 | 0.127148 | 0.135974 |
3 | -0.121485 | 0.140731 | -0.858684 | -0.549573 | 0.135974 |
4 | 1.750192 | 0.600787 | -1.473803 | 3.172390 | -0.211515 |
#Creating copy of the data to store labels from each algorithm
data_scaled_copy = data_scaled.copy(deep=True)
K-Means¶
Let us now fit k-means algorithm on our scaled data and find out the optimum number of clusters to use.
We will do this in 3 steps:
- Initialize a dictionary to store the SSE for each k
- Run for a range of Ks and store SSE for each run
- Plot the SSE vs K and find the elbow
# step 1
WCSS = {}
# step 2 - iterate for a range of Ks and fit the scaled data to the algorithm. Use inertia attribute from the clustering object and
# store the inertia value for that k
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data_scaled)
WCSS[k] = kmeans.inertia_
# step 3
plt.figure()
plt.plot(list(WCSS.keys()), list(WCSS.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("WCSS")
plt.show()
Observations:
- Looking at the plot, we can say that elbow point is achieved for k=3.
- We will fit the k-means again with k=3 to get the labels.
#Empty dictionary to store the Silhouette score for each value of k
sc = {}
# iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that k
for k in range(2, 10):
kmeans = KMeans(n_clusters=k).fit(data_scaled)
labels = kmeans.predict(data_scaled)
sc[k] = silhouette_score(data_scaled, labels)
#Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
Fitting the K-means algorithms on the scaled data with number of cluster equal to 3 and Storing the predictions as 'Labels' to the 'data_scaled_copy' and 'data' dataframes
kmeans = KMeans(n_clusters=3, max_iter= 1000, random_state=1)
kmeans.fit(data_scaled)
#Adding predicted labels to the original data and scaled data
data_scaled_copy['Labels'] = kmeans.predict(data_scaled)
data['Labels'] = kmeans.predict(data_scaled)
We have generated the labels with k-means. Let us look at the various features based on the labels.
Creating cluster profiles using the below summary statistics and box plots for each label¶
#Number of observations in each cluster
data.Labels.value_counts()
1 374 0 221 2 49 Name: Labels, dtype: int64
#Calculating summary statistics of the original data for each label
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()
df_kmeans = pd.concat([mean, median], axis=0)
df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T
group_0 Mean | group_1 Mean | group_2 Mean | group_0 Median | group_1 Median | group_2 Median | |
---|---|---|---|---|---|---|
Avg_Credit_Limit | 12239.819005 | 33893.048128 | 140102.040816 | 12000.0 | 31500.0 | 145000.0 |
Total_Credit_Cards | 2.411765 | 5.508021 | 8.775510 | 2.0 | 6.0 | 9.0 |
Total_visits_bank | 0.945701 | 3.489305 | 0.591837 | 1.0 | 3.0 | 1.0 |
Total_visits_online | 3.561086 | 0.975936 | 10.979592 | 4.0 | 1.0 | 11.0 |
Total_calls_made | 6.891403 | 1.997326 | 1.102041 | 7.0 | 2.0 | 1.0 |
#Visualizing different features w.r.t K-means labels
data_scaled_copy.boxplot(by = 'Labels', layout = (1,5),figsize=(20,7))
plt.show()
Cluster Profiles:
- Cluster 0 - Low Credit Cutsomers: This cluster represents customers with low spending and prefers to interact with the bank via calls and therefore they have less number of bank and online visits.
- Cluster 1 - Medium Credit Customers: This cluster represents customers who spends comparatively more than low credit customers and visits the bank more often and therefore they have less number of calls and online visits.
- Cluster 2 - High Credit Customers: This cluster represents customers with high spending and prefers to contact the bank through online channel and therefore they have less number of total calls and bank visits.
K-Medoids¶
kmedo = KMedoids(n_clusters = 3, random_state=1)
kmedo.fit(data_scaled)
data_scaled_copy['kmedoLabels'] = kmedo.predict(data_scaled)
data['kmedoLabels'] = kmedo.predict(data_scaled)
#Number of observations in each cluster
data.kmedoLabels.value_counts()
2 289 0 222 1 133 Name: kmedoLabels, dtype: int64
Creating cluster profiles using the below summary statistics and box plots for each label and Comparing the clusters from both algorithms - K-Means and K-Medoids
#Calculating summary statistics of the original data for each label
original_features = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made"]
mean = data.groupby('kmedoLabels').mean()
median = data.groupby('kmedoLabels').median()
df_kmedoids = pd.concat([mean, median], axis=0)
df_kmedoids.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmedoids[original_features].T
group_0 Mean | group_1 Mean | group_2 Mean | group_0 Median | group_1 Median | group_2 Median | |
---|---|---|---|---|---|---|
Avg_Credit_Limit | 12216.216216 | 85052.631579 | 28449.826990 | 12000.0 | 68000.0 | 20000.0 |
Total_Credit_Cards | 2.423423 | 7.030075 | 5.363322 | 2.0 | 7.0 | 5.0 |
Total_visits_bank | 0.950450 | 1.691729 | 3.830450 | 1.0 | 2.0 | 4.0 |
Total_visits_online | 3.554054 | 4.639098 | 0.982699 | 4.0 | 2.0 | 1.0 |
Total_calls_made | 6.878378 | 1.969925 | 1.851211 | 7.0 | 2.0 | 2.0 |
#plotting boxplots with the new K-Medoids based labels
features_with_lables = ["Avg_Credit_Limit", "Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","kmedoLabels"]
data_scaled_copy[features_with_lables].boxplot(by = 'kmedoLabels', layout = (1,5),figsize=(20,7))
plt.show()
Let's compare the clusters from K-Means and K-Medoids
comparison = pd.concat([df_kmedoids, df_kmeans], axis=1)[original_features]
comparison
Avg_Credit_Limit | Avg_Credit_Limit | Total_Credit_Cards | Total_Credit_Cards | Total_visits_bank | Total_visits_bank | Total_visits_online | Total_visits_online | Total_calls_made | Total_calls_made | |
---|---|---|---|---|---|---|---|---|---|---|
group_0 Mean | 12216.216216 | 12239.819005 | 2.423423 | 2.411765 | 0.950450 | 0.945701 | 3.554054 | 3.561086 | 6.878378 | 6.891403 |
group_1 Mean | 85052.631579 | 33893.048128 | 7.030075 | 5.508021 | 1.691729 | 3.489305 | 4.639098 | 0.975936 | 1.969925 | 1.997326 |
group_2 Mean | 28449.826990 | 140102.040816 | 5.363322 | 8.775510 | 3.830450 | 0.591837 | 0.982699 | 10.979592 | 1.851211 | 1.102041 |
group_0 Median | 12000.000000 | 12000.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 4.000000 | 4.000000 | 7.000000 | 7.000000 |
group_1 Median | 68000.000000 | 31500.000000 | 7.000000 | 6.000000 | 2.000000 | 3.000000 | 2.000000 | 1.000000 | 2.000000 | 2.000000 |
group_2 Median | 20000.000000 | 145000.000000 | 5.000000 | 9.000000 | 4.000000 | 1.000000 | 1.000000 | 11.000000 | 2.000000 | 1.000000 |
Cluster Profiles:
- Cluster 0 - Low Credit Cutsomers: This cluster represents customers with low spending and prefers to interact with the bank via calls and therefore they have less number of bank and online visits.
- Cluster 1 - High Credit Customers: This cluster represents customers with high spending and prefers to contact the bank through online channel and therefore they have less number of total calls and bank visits.
- Cluster 2 - Medium Credit Customers: This cluster represents customers who spends comparatively more than low credit customers and visits the bank more often and therefore they have less number of calls and online visits.
Comparing Clusters:
- The size of clusters (count of observations) from K-Medoids is different from K-means clusters. In K-Medoids, number of observations are evenly distributed among clusters.
- K-Medoids clusters are more compact i.e. have small ranges for all variables. For example, the range of total number of credit cards is from 4 to 7 for medium credit customers from K-Medoids, whereas from K-Means, it is from 2 to 7.
- K-Medoids is more robust to outliers which helps to identify outliers in the clusters. For example, the total number of bank visits has an outlier for high credit customers from K-Medoids, whereas from K-Means, we could not detect any outliers in this variable.
Additional Content (Optional)¶
Gaussian Mixture¶
Let's create clusters using Gaussian Mixture Models
Finding Optimal K using Silhouette for GMM model
#Empty dictionary to store the Silhouette score for each value of k
sc = {}
# iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that k
for k in range(2, 10):
GMM = GaussianMixture(n_components = k).fit(data_scaled)
labels = GMM.predict(data_scaled)
sc[k] = silhouette_score(data_scaled, labels)
#Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
Fitting the GMM model
gmm = GaussianMixture(n_components = 3)
gmm.fit(data_scaled)
data_scaled_copy['GmmLabels'] = gmm.predict(data_scaled)
data['GmmLabels'] = gmm.predict(data_scaled)
#Number of observations in each cluster
data.GmmLabels.value_counts()
0 374 1 221 2 49 Name: GmmLabels, dtype: int64
Creating cluster profiles using the below summary statistics and box plots for each label and Comparing the clusters from both algorithms - K-means and Gaussian Mixture
#Calculating summary statistics of the original data for each label
original_features = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made"]
mean = data.groupby('GmmLabels').mean()
median = data.groupby('GmmLabels').median()
df_gmm = pd.concat([mean, median], axis=0)
df_gmm.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_gmm[original_features].T
group_0 Mean | group_1 Mean | group_2 Mean | group_0 Median | group_1 Median | group_2 Median | |
---|---|---|---|---|---|---|
Avg_Credit_Limit | 33893.048128 | 12239.819005 | 140102.040816 | 31500.0 | 12000.0 | 145000.0 |
Total_Credit_Cards | 5.508021 | 2.411765 | 8.775510 | 6.0 | 2.0 | 9.0 |
Total_visits_bank | 3.489305 | 0.945701 | 0.591837 | 3.0 | 1.0 | 1.0 |
Total_visits_online | 0.975936 | 3.561086 | 10.979592 | 1.0 | 4.0 | 11.0 |
Total_calls_made | 1.997326 | 6.891403 | 1.102041 | 2.0 | 7.0 | 1.0 |
# plotting boxplots with the new GMM based labels
features_with_lables = ["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","GmmLabels"]
data_scaled_copy[features_with_lables].boxplot(by = 'GmmLabels', layout = (1,5),figsize=(20,7))
plt.show()
Cluster Profiles:
- Cluster 0 - Low Credit Cutsomers: This cluster represents customers with low spending and prefers to interact with the bank via calls and therefore they have less number of bank and online visits.
- Cluster 1 - Medium Credit Customers: This cluster represents customers who spends comparatively more than low credit customers and visits the bank more often and therefore they have less number of calls and online visits.
- Cluster 2 - High Credit Customers: This cluster represents customers with high spending and prefers to contact the bank through online channel and therefore they have less number of total calls and bank visits.
Comparing Clusters:
- The cluster profiles are the same as the K-means clusters.
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Three - Making Sense of Unstructured Data/Mentored Learning Session/MLS_Case_Study_Unsupervised_Learning.ipynb"