The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.
In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like
The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.
ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.
Data Dictionary
last_activity: Last interaction between the lead and ExtraaLearn.
print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
The objective of this analysis is to build a predictive model that can accurately identify potential leads who are most likely to convert into customers. By leveraging data related to user behavior, such as website activity, user demographics, and interaction history, we aim to develop a classification model that can assist the business in optimizing its marketing and sales strategies.
This project seeks to address the challenge of efficiently managing leads by predicting which users have the highest potential for conversion. By identifying key factors that influence lead conversion, the model will enable the business to focus its resources on high-value leads, improve customer acquisition efforts, and ultimately increase conversion rates.
Model evaluation criterion:
Model can make wrong predictions as:
Predicting a lead will not be converted to a paid customer but, in reality, the lead would have converted to a paid customer.
Predicting a lead will be converted to a paid customer but, in reality, the lead would have not converted to a paid customer. Which case is more important?
If we predict that a lead will not get converted and the lead would have converted then the company will lose a potential customer.
If we predict that a lead will get converted and the lead doesn't get converted the company might lose resources by nurturing false-positive cases.
Losing a potential customer is a greater loss for the organization.
How to reduce the losses?
Company would want Recall to be maximized. The greater the Recall score, higher the chances of minimizing False Negatives.
In this case the false negative is predicting a lead will not convert(0), when it would have converted(1).
To accomplish this we will need to build a model that maximizes recall for status = 1.
# Importing the basic libraries we will require for the project
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.patches as mpatches # For creating plot independent legends
# Libraries for statistical analysis
from scipy import stats
# Library for label encoding (for 3D plotting functions)
from Scikit-learn.preprocessing import LabelEncoder
# Collinearity checks
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Importing the Machine Learning models we require from Scikit-Learn
from Scikit-learn.linear_model import LogisticRegression
from Scikit-learn.svm import SVC
from Scikit-learn import tree
from Scikit-learn.tree import (
DecisionTreeClassifier,
)
from Scikit-learn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier,
)
# Importing the other functions we may require from Scikit-Learn
from Scikit-learn.model_selection import (
train_test_split,
GridSearchCV,
cross_val_score,
)
from Scikit-learn.preprocessing import (
MinMaxScaler,
LabelEncoder,
OneHotEncoder,
StandardScaler,
)
from Scikit-learn.impute import (
SimpleImputer,
)
# To get diferent metric scores
import Scikit-learn.metrics as metrics
from Scikit-learn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
classification_report,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
#Importing PCA and TSNE
from Scikit-learn.decomposition import PCA
# Importing class weights
from Scikit-learn.utils.class_weight import compute_class_weight
# Importing Advanced Analysis Libraries
from xgboost import XGBClassifier
from Scikit-learn.model_selection import RandomizedSearchCV
# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')
# Comment formatting
from IPython.display import display, HTML
# Connect collab
from google.colab import drive
drive.mount('/content/drive')
# Load data from csv file
dataset = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/ExtraaLearn.csv')
# Make a working copy of the data
data = dataset.copy()
Mounted at /content/drive
def enhanced_histogram_boxplot(
data,
feature,
figsize=(12, 8),
kde=True,
bins=None,
box_color="violet",
mean_color="green",
median_color="black",
hist_palette="tab10", # Updated palette to 'tab10'
show_title=True,
custom_title=None
):
"""
Enhanced boxplot and histogram combined with outlier detection and statistical summary
Parameters:
data (pd.DataFrame): Input dataframe
feature (str): Column name of the dataframe to plot
figsize (tuple): Size of figure (default (12,8)) - Adjusted for removal of Q-Q plot
kde (bool): Whether to show the density curve (default True)
bins (int): Number of bins for histogram (default None)
box_color (str): Color of the boxplot (default "violet")
mean_color (str): Color of the mean line (default "green")
median_color (str): Color of the median line (default "black")
hist_palette (str): Color palette for the histogram (default "tab10")
show_title (bool): Whether to show the plot title (default True)
custom_title (str): Custom title for the plot (default None)
"""
if not isinstance(data, pd.DataFrame):
raise TypeError("data must be a pandas DataFrame")
if feature not in data.columns:
raise ValueError(f"Feature '{feature}' not found in the dataframe")
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=False,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
# Boxplot
sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color=box_color)
# Histogram
if bins:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette=sns.color_palette(hist_palette))
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, palette=sns.color_palette(hist_palette))
# Add mean and median lines
mean = data[feature].mean()
median = data[feature].median()
ax_hist2.axvline(mean, color=mean_color, linestyle="--", label="Mean")
ax_hist2.axvline(median, color=median_color, linestyle="-", label="Median")
# Add legend
ax_hist2.legend()
# Add labels
ax_hist2.set_xlabel(feature)
ax_hist2.set_ylabel("Count")
ax_box2.set_ylabel("")
if show_title:
if custom_title:
title = f"{custom_title} Distribution"
else:
title = f"{feature} Distribution"
plt.suptitle(title, fontsize=16)
plt.tight_layout()
# Calculate statistics
std = data[feature].std()
skew = data[feature].skew()
kurtosis = data[feature].kurtosis()
# Outlier detection
Q1 = data[feature].quantile(0.25)
Q3 = data[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)][feature]
# Z-score outliers
z_scores = np.abs(stats.zscore(data[feature]))
z_score_outliers = data[feature][z_scores > 3]
# Print statistical summary
print(f"\nStatistical Summary for {feature}:")
print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Standard Deviation: {std:.2f}")
print(f"Skewness: {skew:.2f}")
print(f"Kurtosis: {kurtosis:.2f}")
print(f"\nOutlier Analysis:")
print(f"IQR method - Number of outliers: {len(outliers)}")
print(f"IQR method - Percentage of outliers: {(len(outliers) / len(data[feature])) * 100:.2f}%")
print(f"IQR method - Outlier range: < {lower_bound:.2f} or > {upper_bound:.2f}")
print(f"Z-score method - Number of outliers (|z| > 3): {len(z_score_outliers)}")
print(f"Z-score method - Percentage of outliers: {(len(z_score_outliers) / len(data[feature])) * 100:.2f}%")
plt.show()
# Usage example:
# enhanced_histogram_boxplot(data, 'age')
# Define binary feature plotting function
def plot_binary_feature(
data,
feature,
figsize=(12, 7),
colors=['#0073a3', '#5e6d77'],
show_title=True,
custom_title=None):
"""
Bar plot and pie chart for binary features
Parameters:
data (pd.DataFrame): Input dataframe
feature (str): Column name of the dataframe to plot
figsize (tuple): Size of figure (default (12,7))
colors (list): List of two colors for the plots (default ['#FFA07A', '#98FB98'])
show_title (bool): Whether to show the plot title (default True)
custom_title (str): Custom title for the plot (default None)
"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
# Calculate the value counts and percentages
value_counts = data[feature].value_counts().sort_index()
percentages = value_counts / len(data) * 100
# Bar plot
sns.barplot(x=value_counts.index, y=value_counts.values, ax=ax1, palette=colors)
ax1.set_title('Bar Plot')
ax1.set_xlabel(feature)
ax1.set_ylabel('Count')
# Add percentage labels on the bars
for i, v in enumerate(value_counts.values):
ax1.text(i, v, f'{percentages[i]:.1f}%', ha='center', va='bottom')
# Pie chart
ax2.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', colors=colors, startangle=90)
ax2.set_title('Pie Chart')
if show_title:
title = f"{custom_title} Distribution" if custom_title else f"{feature} Distribution"
plt.suptitle(title, fontsize=16)
plt.tight_layout()
plt.show()
# Define categorical feature plotting function
def plot_categorical(data, feature, figsize=(12, 6), show_title=True, custom_title=None, top_n=None):
"""
Plot categorical variables with appropriate chart types based on the number of categories.
Includes value counts and percentages as a "legend" on the right side.
Parameters:
data (pd.DataFrame): Input dataframe
feature (str): Column name of the dataframe to plot
figsize (tuple): Size of figure (default (12,6))
show_title (bool): Whether to show the plot title (default True)
custom_title (str): Custom title for the plot (default None)
top_n (int): Number of top categories to show, others will be grouped as 'Other' (default None)
"""
if not isinstance(data, pd.DataFrame):
raise TypeError("data must be a pandas DataFrame")
if feature not in data.columns:
raise ValueError(f"Feature '{feature}' not found in the dataframe")
value_counts = data[feature].value_counts()
n_categories = len(value_counts)
if top_n and n_categories > top_n:
top_values = value_counts.nlargest(top_n)
other = pd.Series({'Other': value_counts.nsmallest(n_categories - top_n).sum()})
value_counts = pd.concat([top_values, other])
n_categories = top_n + 1
percentages = value_counts / len(data) * 100
# Determine the title
title = custom_title if custom_title else f"{feature} Distribution"
if n_categories == 2:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize, gridspec_kw={'width_ratios': [2, 1]})
# Bar plot
sns.barplot(x=value_counts.index, y=value_counts.values, ax=ax1)
ax1.set_title('Bar Plot')
ax1.set_ylabel('Count')
# Add percentage labels on the bars
for i, v in enumerate(value_counts.values):
ax1.text(i, v, f'{percentages[i]:.1f}%', ha='center', va='bottom')
# Pie chart
ax2.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', startangle=90)
ax2.set_title('Pie Chart')
else:
fig, (ax, ax_legend) = plt.subplots(1, 2, figsize=figsize, gridspec_kw={'width_ratios': [3, 1]})
# Horizontal bar plot
bars = ax.barh(value_counts.index, value_counts.values)
ax.set_title('Horizontal Bar Plot')
ax.set_xlabel('Count')
# Add percentage labels on the bars
for i, (value, name) in enumerate(zip(value_counts.values, value_counts.index)):
ax.text(value, i, f'{percentages[name]:.1f}%', va='center')
# Add value counts as "legend"
ax_legend.axis('off')
legend_text = "Value Counts:\n\n"
for index, value in value_counts.items():
legend_text += f"{index}: {value} ({percentages[index]:.1f}%)\n"
ax_legend.text(0, 0.9, legend_text, verticalalignment='top', wrap=True)
if show_title:
plt.suptitle(title, fontsize=16)
plt.tight_layout()
plt.show()
# Example usage:
# plot_categorical(data, 'current_occupation', custom_title="Occupation Distribution")
# plot_categorical(data, 'print_media_type1', custom_title="Print Media Type 1 Usage")
# plot_categorical(data, 'educational_channels', top_n=5, custom_title="Top 5 Educational Channels")
# Define scatter plot function
def plot_scatter(x_column, y_column, data, title=None, x_label=None, y_label=None, palette='tab10'):
"""
Creates a scatter plot for the specified x and y columns from the given dataset.
Parameters:
x_column (str): The name of the column for the x-axis.
y_column (str): The name of the column for the y-axis.
data (DataFrame): The pandas DataFrame containing the data.
title (str, optional): The title of the plot. Defaults to None.
x_label (str, optional): Label for the x-axis. Defaults to None.
y_label (str, optional): Label for the y-axis. Defaults to None.
Returns:
None: Displays the scatter plot.
"""
plt.figure(figsize=(10, 6))
sns.scatterplot(x=x_column, y=y_column, data=data, palette='palette')
plt.title(title if title else f'Scatter Plot: {x_column} vs {y_column}')
plt.xlabel(x_label if x_label else x_column)
plt.ylabel(y_label if y_label else y_column)
plt.grid(True)
plt.show()
# Example of how to call the function
#plot_scatter('age', 'website_visits', data, title='Age vs Website Visits')
# Defining an abstracted function for box plot visualization
def plot_boxplot(x_column, y_column, data, title=None, x_label=None, y_label=None, palette='tab10', figsize=(10, 6)):
"""
Creates a box plot for the specified x and y columns from the given dataset.
Parameters:
x_column (str): The name of the column for the x-axis (typically categorical).
y_column (str): The name of the column for the y-axis (typically continuous).
data (DataFrame): The pandas DataFrame containing the data.
title (str, optional): The title of the plot. Defaults to None.
x_label (str, optional): Label for the x-axis. Defaults to None.
y_label (str, optional): Label for the y-axis. Defaults to None.
palette (str, optional): Color palette for the box plot. Defaults to 'tab10'.
figsize (tuple, optional): Figure size for the plot (width, height). Defaults to (10, 6).
Returns:
None: Displays the box plot.
"""
# Set the figure size based on the figsize parameter
plt.figure(figsize=figsize)
sns.boxplot(x=x_column, y=y_column, data=data, palette=palette)
plt.title(title if title else f'Box Plot: {x_column} vs {y_column}')
plt.xlabel(x_label if x_label else x_column)
plt.ylabel(y_label if y_label else y_column)
plt.grid(True)
plt.show()
# Example of how to call the function with a custom figure size
# plot_boxplot(x_column='status', y_column='age', data=dataset, title='Status vs Age', figsize=(15, 8))
# Defining a generic function for count plot visualization
def plot_countplot(x_column, hue_column, data, title=None, x_label=None, y_label=None, palette='tab10', hue_order=None):
"""
Creates a count plot for the specified x and hue columns from the given dataset.
Parameters:
x_column (str): The name of the column for the x-axis (typically categorical).
hue_column (str): The name of the column for hue (typically categorical).
data (DataFrame): The pandas DataFrame containing the data.
title (str, optional): The title of the plot. Defaults to None.
x_label (str, optional): Label for the x-axis. Defaults to None.
y_label (str, optional): Label for the y-axis. Defaults to None.
palette (str, optional): Color palette for the count plot. Defaults to 'tab10'.
hue_order (list, optional): The order of the hues to be plotted. Defaults to None.
Returns:
None: Displays the count plot.
"""
plt.figure(figsize=(10, 6))
sns.countplot(x=x_column, hue=hue_column, data=data, palette=palette, hue_order=hue_order)
plt.title(title if title else f'Count Plot: {x_column} vs {hue_column}')
plt.xlabel(x_label if x_label else x_column)
plt.ylabel(y_label if y_label else 'Count')
plt.grid(True)
plt.show()
# Example usage
# plot_countplot(x_column='status', hue_column='current_occupation', data=dataset, title='Status vs Current Occupation', hue_order=['Employed', 'Unemployed', 'Student'])
# Defining a generic function for creating 3D scatter plots
# Set Seaborn style to 'darkgrid'
sns.set(style="whitegrid")
def plot_3d_scatter_with_color(x_column, y_column, z_column, color_column, data, title=None, x_label=None, y_label=None, z_label=None, figsize=(10, 8)):
"""
Creates a 3D scatter plot for the specified x, y, and z columns from the given dataset.
Adds the 'color_column' to color the points based on a categorical variable.
"""
fig = plt.figure(figsize=figsize) # Pass figsize argument here
ax = fig.add_subplot(111, projection='3d')
# Scatter plot with color dimension based on the 'color_column'
p = ax.scatter(data[x_column], data[y_column], data[z_column], c=data[color_column], cmap='coolwarm', marker='o')
# Setting labels with padding
ax.set_xlabel(x_label if x_label else x_column, labelpad=20)
ax.set_ylabel(y_label if y_label else y_column, labelpad=20)
ax.set_zlabel(z_label if z_label else z_column, labelpad=20)
# Add color bar for reference
cbar = fig.colorbar(p, ax=ax) # Assign color bar object to cbar
cbar.set_label('Status') # Add this line
# Title
ax.set_title(title if title else f'3D Scatter Plot: {x_column}, {y_column}, {z_column}')
plt.tight_layout()
plt.show()
# Call the modified function with 'status' as the color dimension
#plot_3d_scatter_with_color('website_visits', 'time_spent_on_website', 'status', 'status', data, title='Website Visits, Time Spent on Website, and Status (with color dimension)')
# Creating metric function for regresion model evaluation
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8,5))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Not Converted', 'Converted'], yticklabels=['Not Converted', 'Converted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# returns the first 5 rows
data.head()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EXT001 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | EXT002 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
2 | EXT003 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
3 | EXT004 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
4 | EXT005 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
Observations:
# returns the last 5 rows
data.tail()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4607 | EXT4608 | 35 | Unemployed | Mobile App | Medium | 15 | 360 | 2.170 | Phone Activity | No | No | No | Yes | No | 0 |
4608 | EXT4609 | 55 | Professional | Mobile App | Medium | 8 | 2327 | 5.393 | Email Activity | No | No | No | No | No | 0 |
4609 | EXT4610 | 58 | Professional | Website | High | 2 | 212 | 2.692 | Email Activity | No | No | No | No | No | 1 |
4610 | EXT4611 | 57 | Professional | Mobile App | Medium | 1 | 154 | 3.879 | Website Activity | Yes | No | No | No | No | 0 |
4611 | EXT4612 | 55 | Professional | Website | Medium | 4 | 2290 | 2.075 | Phone Activity | No | No | No | No | No | 0 |
# Determine the number of rows and columns by calling data.shape
print(data.shape[0])
print(data.shape[1])
4612 15
The dataset consists of 4612 rows and 14 features.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null object 3 first_interaction 4612 non-null object 4 profile_completed 4612 non-null object 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null object 9 print_media_type1 4612 non-null object 10 print_media_type2 4612 non-null object 11 digital_media 4612 non-null object 12 educational_channels 4612 non-null object 13 referral 4612 non-null object 14 status 4612 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 540.6+ KB
Observations
The dataset is composed of numeric and categorical features.
Continuous Features:
age
website_visits
time_spent_on_website
page_views_per_visit
Categorical Features:
ID
current_occupation
first_interaction
profile_completed
last_activity
print_media_type1
print_media_type2
digital_media
educational_channels
referral
status
(Dependent Variable)Summary of Features:
Based on the initial inspection of the dataset, here's an evaluation of the columns, considering that our target variable is status
:
ID:
ID
column is a unique identifier for each record and does not provide predictive value for the target variable (status
). It can be safely dropped.current_occupation:
first_interaction:
profile_completed:
website_visits, time_spent_on_website, page_views_per_visit:
print_media_type1, print_media_type2, digital_media, educational_channels:
referral:
last_activity:
Converting categorical features from 'object' to the 'category' datatype helps reduce memory usage and simplifies the visual interpretation of data types, making it easier to manage and analyze the dataset efficiently.
# Changing datatypes of categorical features
categorical_features = ['current_occupation', 'first_interaction', 'profile_completed', 'last_activity',
'print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels',
'referral', 'status']
for feature in categorical_features:
data[feature] = data[feature].astype('category') # Step 1: Convert to categorical
data[feature] = data[feature].cat.codes # Step 2: Encode categories as integer codes
data[feature] = data[feature].astype('category') # Step 3: Re-convert back to categorical
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null category 3 first_interaction 4612 non-null category 4 profile_completed 4612 non-null category 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null category 9 print_media_type1 4612 non-null category 10 print_media_type2 4612 non-null category 11 digital_media 4612 non-null category 12 educational_channels 4612 non-null category 13 referral 4612 non-null category 14 status 4612 non-null category dtypes: category(10), float64(1), int64(3), object(1) memory usage: 226.1+ KB
Questions
The objective of this analysis is to build a predictive model that can accurately identify potential leads who are most likely to convert into customers.
If we predict that a lead will not get converted and the lead would have converted then the company will lose a potential customer.
If we predict that a lead will get converted and the lead doesn't get converted the company might lose resources by nurturing false-positive cases.
Losing a potential customer is a greater loss for the organization.
To accomplish this we will need to build a model that maximizes recall for status = 1.
data.describe(include = "all").T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
ID | 4612 | 4612 | EXT001 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
age | 4612.0 | NaN | NaN | NaN | 46.201214 | 13.161454 | 18.0 | 36.0 | 51.0 | 57.0 | 63.0 |
current_occupation | 4612.0 | 3.0 | 0.0 | 2616.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
first_interaction | 4612.0 | 2.0 | 1.0 | 2542.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
profile_completed | 4612.0 | 3.0 | 0.0 | 2264.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
website_visits | 4612.0 | NaN | NaN | NaN | 3.566782 | 2.829134 | 0.0 | 2.0 | 3.0 | 5.0 | 30.0 |
time_spent_on_website | 4612.0 | NaN | NaN | NaN | 724.011275 | 743.828683 | 0.0 | 148.75 | 376.0 | 1336.75 | 2537.0 |
page_views_per_visit | 4612.0 | NaN | NaN | NaN | 3.026126 | 1.968125 | 0.0 | 2.07775 | 2.792 | 3.75625 | 18.434 |
last_activity | 4612.0 | 3.0 | 0.0 | 2278.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
print_media_type1 | 4612.0 | 2.0 | 0.0 | 4115.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
print_media_type2 | 4612.0 | 2.0 | 0.0 | 4379.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
digital_media | 4612.0 | 2.0 | 0.0 | 4085.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
educational_channels | 4612.0 | 2.0 | 0.0 | 3907.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
referral | 4612.0 | 2.0 | 0.0 | 4519.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
status | 4612.0 | 2.0 | 0.0 | 3235.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Observations
Age
Current Occupation
First Interaction
Profile Completion
Website Visits
Time Spent on Website
Page Views per Visit
Last Activity
Media & Referral Channels
Status (Target Variable)
# Get nunique values for each column
nunique_df = data.nunique().reset_index()
nunique_df.columns = ['Feature', 'Unique_Count']
# Features to find unique values for (categorical only)
categorical_features = [
'current_occupation', 'first_interaction', 'profile_completed',
'last_activity', 'print_media_type1', 'print_media_type2',
'digital_media', 'educational_channels', 'referral', 'status'
]
# Create an empty list to store unique value strings
unique_value_list = []
# Loop over the categorical list to get all unique values and store them as formatted strings
for feature in categorical_features:
unique_values = data[feature].dropna().unique() # Ensure no NaN values are included
unique_values_sorted = sorted(unique_values, key=lambda x: (str(x).lower())) # Sort for consistency
formatted_list = ', '.join([str(value) for value in unique_values_sorted])
unique_value_list.append(formatted_list)
# Create a DataFrame with the features and their corresponding unique values
unique_values_df = pd.DataFrame({
'Feature': categorical_features,
'Unique_Values': unique_value_list
})
# Merge nunique_df with unique_values_df
merged_df = pd.merge(nunique_df, unique_values_df, on='Feature', how='left')
# Display the combined DataFrame
display(merged_df)
Feature | Unique_Count | Unique_Values | |
---|---|---|---|
0 | ID | 4612 | NaN |
1 | age | 46 | NaN |
2 | current_occupation | 3 | 0, 1, 2 |
3 | first_interaction | 2 | 0, 1 |
4 | profile_completed | 3 | 0, 1, 2 |
5 | website_visits | 27 | NaN |
6 | time_spent_on_website | 1623 | NaN |
7 | page_views_per_visit | 2414 | NaN |
8 | last_activity | 3 | 0, 1, 2 |
9 | print_media_type1 | 2 | 0, 1 |
10 | print_media_type2 | 2 | 0, 1 |
11 | digital_media | 2 | 0, 1 |
12 | educational_channels | 2 | 0, 1 |
13 | referral | 2 | 0, 1 |
14 | status | 2 | 0, 1 |
Observations:
ID
ID
is unique, ensuring that every record in the dataset represents a distinct lead.Age
age
feature encompasses 46 unique values, indicating a diverse range of ages among the leads.Current Occupation
First Interaction
Profile Completion
Website Visits
website_visits
feature exhibits 27 unique values, reflecting a varied number of website visits per lead.Time Spent on Website
time_spent_on_website
feature has 1623 unique values, suggesting a wide range of engagement durations on the website.Page Views per Visit
page_views_per_visit
feature contains 2414 unique values, indicating diverse browsing behaviors across different visits.Last Activity
Print Media Type1
Print Media Type2
Digital Media
Educational Channels
Referral
Status
# Missing value check
pd.DataFrame(data={'% of Missing Values':round(data.isna().sum()/data.isna().count()*100,2)})
% of Missing Values | |
---|---|
ID | 0.0 |
age | 0.0 |
current_occupation | 0.0 |
first_interaction | 0.0 |
profile_completed | 0.0 |
website_visits | 0.0 |
time_spent_on_website | 0.0 |
page_views_per_visit | 0.0 |
last_activity | 0.0 |
print_media_type1 | 0.0 |
print_media_type2 | 0.0 |
digital_media | 0.0 |
educational_channels | 0.0 |
referral | 0.0 |
status | 0.0 |
There are no missing values in the dataset.
The dataset contains the following continuous variables:
# plot the distribution of the age feature
enhanced_histogram_boxplot(data, 'age', kde = True, bins = 24, custom_title = "Age")
Statistical Summary for age: Mean: 46.20 Median: 51.00 Standard Deviation: 13.16 Skewness: -0.72 Kurtosis: -0.80 Outlier Analysis: IQR method - Number of outliers: 0 IQR method - Percentage of outliers: 0.00% IQR method - Outlier range: < 4.50 or > 88.50 Z-score method - Number of outliers (|z| > 3): 0 Z-score method - Percentage of outliers: 0.00%
Observations:
Left-Skewed Distribution:
Mean vs. Median:
Boxplot Insights
Concentration of Older Leads:
Smaller Younger Population:
Age Variability:
Outliers:
# plot the distribution of the website_visits feature
enhanced_histogram_boxplot(data, "website_visits", kde = True, custom_title = "Website Visits")
Statistical Summary for website_visits: Mean: 3.57 Median: 3.00 Standard Deviation: 2.83 Skewness: 2.16 Kurtosis: 9.35 Outlier Analysis: IQR method - Number of outliers: 154 IQR method - Percentage of outliers: 3.34% IQR method - Outlier range: < -2.50 or > 9.50 Z-score method - Number of outliers (|z| > 3): 66 Z-score method - Percentage of outliers: 1.43%
Observations:
Right-Skewed Distribution:
Mean vs. Median:
Boxplot Insights:
Common Visit Patterns:
Outliers and Heavy Users:
# plot the distribution of the page_views_per_visit feature
enhanced_histogram_boxplot(data, "page_views_per_visit", kde = True, custom_title = "Page Views per Visit")
Statistical Summary for page_views_per_visit: Mean: 3.03 Median: 2.79 Standard Deviation: 1.97 Skewness: 1.27 Kurtosis: 4.22 Outlier Analysis: IQR method - Number of outliers: 257 IQR method - Percentage of outliers: 5.57% IQR method - Outlier range: < -0.44 or > 6.27 Z-score method - Number of outliers (|z| > 3): 40 Z-score method - Percentage of outliers: 0.87%
Observations:
Right-Skewed Distribution:
Mean vs. Median
Boxplot Insights:
Common Page View Patterns:
Outliers and Heavy Page Viewers:
# plot the distribution of the time_spent_on_website feature
enhanced_histogram_boxplot(data, "time_spent_on_website", kde = True, custom_title = "Time Spent on Website")
Statistical Summary for time_spent_on_website: Mean: 724.01 Median: 376.00 Standard Deviation: 743.83 Skewness: 0.95 Kurtosis: -0.58 Outlier Analysis: IQR method - Number of outliers: 0 IQR method - Percentage of outliers: 0.00% IQR method - Outlier range: < -1633.25 or > 3118.75 Z-score method - Number of outliers (|z| > 3): 0 Z-score method - Percentage of outliers: 0.00%
Observations:
Right-Skewed Distribution:
Mean vs. Median:
Boxplot Insights:
Common Time Spent Patterns:
Engaged Users:
# plot the distribution of the status feature
plot_binary_feature(data, "status", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Status")
Observations:
Imbalanced Target Variable (Status):
Class Imbalance:
The dataset contains the following categorical variables:
# plot the distribution of the current_occupation feature
plot_binary_feature(data, "current_occupation", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Status")
Observations:
Distribution of Occupations:
Imbalance in Occupation Groups:
# plot the distribution of the first_interaction feature
plot_binary_feature(data, "first_interaction", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="First Interaction")
Observations:
Distribution of First Interaction Channels:
Balanced Split Between Website and Mobile App:
# plot the distribution of the profile_completed feature
plot_binary_feature(data, "profile_completed", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Profile Completed")
Observations:
Distribution of Profile Completion Levels:
High Completion Rates:
# plot the distribution of the last_activity feature
plot_binary_feature(data, "last_activity", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Last Activity")
Observations:
Distribution of Last Activity Types:
Dominance of Email Activity:
# plot the distribution of the print_media_type1 feature
plot_binary_feature(data, "print_media_type1", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Print Media Type 1")
Observations:
Dominance of 'No' for Print Media Type 1:
Print Media Type 1 Underutilization:
# plot the distribution of the print_media_type2 feature
plot_binary_feature(data, "print_media_type2", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Print Media Type 2")
Observations:
Very Limited Use of Print Media Type 2:
Underutilization of Print Media Type 2:
# plot the distribution of the digital_media feature
plot_binary_feature(data, "digital_media", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Digital Media")
Observations:
Limited Exposure to Digital Media:
Underutilization of Digital Media:
# plot the distribution of the educational_channels feature
plot_binary_feature(data, "educational_channels", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Educational Channels")
Observations:
Limited Exposure to Educational Channels:
Underutilization of Educational Channels*:
# plot the distribution of the referral feature
plot_binary_feature(data, "referral", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Referral")
Observations:
Very Limited Use of Referral Channel:
Underutilization of Referrals:
Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status?
Conversion Rates
# Plot status counts
plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'status', data = data, palette='tab10')
# Annotating the exact count on the top of the bar for each category
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()+ 0.35))
# Calculate conversion percentages
conversion_percentages = round(data['status'].value_counts(normalize=True) * 100, 2)
print(conversion_percentages)
status 0 70.14 1 29.86 Name: proportion, dtype: float64
Lead Status
# Group by 'current_occupation' and 'status' to get the counts for each occupation
occupation_counts = data.groupby(['current_occupation', 'status']).size().unstack()
# Define the correct x-axis labels
occupation_labels = ['Professional', 'Student', 'Unemployed']
# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))
# Plotting status 0 at the bottom and status 1 on top for each occupation with overridden x-axis labels
plt.bar(occupation_labels, occupation_counts[0], label='0', color='blue')
plt.bar(occupation_labels, occupation_counts[1],
bottom=occupation_counts[0], label='1', color='orange')
# Adding labels and title
plt.title('Lead Status by Current Occupation')
plt.xlabel('Current Occupation')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')
# Adding annotations for each bar (status 0 and status 1 counts)
for i, occupation in enumerate(occupation_labels):
plt.text(i, occupation_counts[0][i] / 2, f'{occupation_counts[0][i]}', ha='center', color='white')
plt.text(i, occupation_counts[0][i] + (occupation_counts[1][i] / 2), f'{occupation_counts[1][i]}', ha='center', color='black')
plt.tight_layout()
plt.show()
Observations:
Professionals:
Students:
Unemployed:
Key Insights:
The lead count is heavily skewed toward professionals, who represent the largest portion of the leads.
Students form the smallest group, which could affect the robustness of insights drawn from their data.
The unemployed group serves as a middle ground in terms of lead count but still shows a notable difference when compared to professionals.
We have now explored how current occupation affects lead status.
The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?
First Channels of Interaction
# Group the data by first_interaction and status, and calculate the counts
interaction_counts = data.groupby(['first_interaction', 'status']).size().unstack()
# Create a stacked bar plot
ax = interaction_counts.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Lead Status by First Interaction Channel')
plt.xlabel('First Interaction Channel')
plt.ylabel('Count of Leads')
# Override x-axis labels explicitly with desired labels
ax.set_xticks([0, 1]) # Set positions for the ticks (number of unique values in 'first_interaction')
ax.set_xticklabels(['Mobile App', 'Website'], rotation=45) # Manually set labels
# Add the legend with proper labels for lead status
plt.legend(title='Lead Status', labels=['0', '1'])
# Add annotations on top of each bar
for container in ax.containers:
ax.bar_label(container, label_type='center')
plt.tight_layout()
# Show the plot
plt.show()
Observations:
Mobile App Interaction:
Website Interaction:
Key Insights:
We have now explored how the initial channel of interaction affects lead status.
The company uses multiple modes to interact with prospects. Which way of interaction works best?
# Group by 'current_occupation' and 'status' to get the counts for each occupation
interaction_counts = data.groupby(['last_activity', 'status']).size().unstack()
# Define the correct x-axis labels
occupation_labels = ['Website Activity', 'Email Activity', 'Phone Activity']
# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))
# Plotting status 0 at the bottom and status 1 on top for each occupation with overridden x-axis labels
plt.bar(occupation_labels, interaction_counts[0], label='0', color='blue')
plt.bar(occupation_labels, interaction_counts[1],
bottom=interaction_counts[0], label='1', color='orange')
# Adding labels and title
plt.title('Lead Status by Mode of Communication')
plt.xlabel('Mode of Communication')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')
# Adding annotations for each bar (status 0 and status 1 counts)
for i, occupation in enumerate(occupation_labels):
plt.text(i, interaction_counts[0][i] / 2, f'{interaction_counts[0][i]}', ha='center', color='white')
plt.text(i, interaction_counts[0][i] + (interaction_counts[1][i] / 2), f'{interaction_counts[1][i]}', ha='center', color='black')
plt.tight_layout()
plt.show()
# Calculate conversion rates
conversion_rates = round(interaction_counts.apply(lambda row: row[1] / (row[0] + row[1]), axis=1) * 100, 2)
conversion_rates
0 | |
---|---|
last_activity | |
0 | 30.33 |
1 | 21.31 |
2 | 38.45 |
Observations:
Email Activity:
Phone Activity:
Website Activity:
Key Insights:
Email Activity has the highest total number of leads, with a 30.33% conversion rate.
Phone Activity has a lower conversion rate of 21.33%, making it the least effective communication method for conversion.
Website Activity has the highest conversion rate of 38.46%.
We have now explored how modes of communication affects lead status.
The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
# Define lead sources
sources = ['print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral']
# Group by 'status'. Select the source columns and convert them to numeric type before summing.
media_sources = data.groupby('status')[sources].apply(lambda x: x.apply(pd.to_numeric, errors='coerce').sum()).T
# Define the correct x-axis labels
source_labels = ['Print Media Type 1', 'Print Media Type 2', 'Digital Media', 'Educational Channels', 'Referral']
# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))
# Plotting status 0 at the bottom and status 1 on top for each source with overridden x-axis labels
plt.bar(source_labels, media_sources[0], label='0', color='blue')
plt.bar(source_labels, media_sources[1],
bottom=media_sources[0], label='1', color='orange')
# Adding labels and title
plt.title('Lead Status by Mode of Communication')
plt.xlabel('Mode of Communication')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')
# Adding annotations for each bar (status 0 and status 1 counts)
for i, source in enumerate(source_labels):
plt.text(i, media_sources[0][i] / 2, f'{media_sources[0][i]}', ha='center', color='white')
plt.text(i, media_sources[0][i] + (media_sources[1][i] / 2), f'{media_sources[1][i]}', ha='center', color='black')
plt.tight_layout()
plt.show()
# Calculate conversion rates
conversion_rates = round(media_sources.apply(lambda row: row[1] / (row[0] + row[1]), axis=1) * 100, 2)
conversion_rates
0 | |
---|---|
print_media_type1 | 31.99 |
print_media_type2 | 32.19 |
digital_media | 31.88 |
educational_channels | 27.94 |
referral | 67.74 |
Observations:
Print and Digital Media Channels
Insight:
Educational Channels
Insight:
Referral
Referral: 67.74% conversion rate
Insight:
People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information. Does having more details about a prospect increase the chances of conversion?
# Group by 'profile_completed' and 'status' to get the counts for each profile completion level
profile = data.groupby(['profile_completed', 'status']).size().unstack()
# Calculate conversion rates for sorting
# Assuming 'status' == 1 indicates a conversion
profile['conversion_rate'] = profile[1] / (profile[0] + profile[1])
# Sort the profile_completed levels by conversion_rate in descending order
profile_sorted = profile.sort_values(by='conversion_rate', ascending=False)
# Extract the sorted profile_completed labels
profile_labels = profile_sorted.index.tolist()
# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))
# Plotting status 0 at the bottom and status 1 on top for each profile_completed with sorted labels
plt.bar(profile_labels, profile_sorted[0], label='0', color='blue')
plt.bar(profile_labels, profile_sorted[1],
bottom=profile_sorted[0], label='1', color='orange')
# Adding labels and title
plt.title('Lead Conversion Status by Profile Completion Level')
plt.xlabel('Profile Completion Level')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')
# Adding annotations for each bar (status 0 and status 1 counts)
for i, label in enumerate(profile_labels):
plt.text(i, profile_sorted[0][i] / 2, f'{profile_sorted[0][i]}', ha='center', color='white')
plt.text(i, profile_sorted[0][i] + (profile_sorted[1][i] / 2), f'{profile_sorted[1][i]}', ha='center', color='black')
plt.tight_layout()
plt.show()
# Calculate conversion rates
conversion_rates = round(profile.apply(lambda row: row[1] / (row[0] + row[1]), axis=1) * 100, 2)
conversion_rates
0 | |
---|---|
profile_completed | |
0 | 41.78 |
1 | 7.48 |
2 | 18.88 |
Observations:
High Profile Completion
Insight:
Potential Factors Contributing to High Conversion:
Enhanced Personalization: Detailed profiles may enable more personalized experiences and targeted offerings, increasing user satisfaction and conversion chances.
Increased Commitment: The act of completing a profile may serve as a commitment device, making users more likely to follow through with conversions.
Trust and Credibility: Providing extensive personal information can indicate a higher level of trust in the platform, fostering a conducive environment for conversions.
Medium Profile Completion
Insight:
Potential Improvement Areas:
User Experience Enhancements: Simplifying the profile completion process or providing incentives for completing profiles can encourage users to provide more information.
Targeted Communication: Implementing targeted prompts or reminders can motivate users to complete their profiles, thereby potentially increasing conversion rates.
Value Proposition Clarification: Clearly communicating the benefits of completing a profile may encourage users to provide more detailed information.
Low Profile Completion
Insight:
Potential Challenges:
Perceived Invasiveness: Users might find extensive data collection intrusive, leading to incomplete profiles and reduced conversion likelihood.
User Drop-Offs: Lengthy or complicated profile forms can deter users from completing their profiles, resulting in lower engagement and conversions.
Lack of Immediate Incentives: Without clear, immediate benefits for completing profiles, users may not see the value in providing additional information.
Key Insights
Positive Correlation: There is a clear positive correlation between profile completeness and conversion rates. As profile completeness increases from low to high, conversion rates rise significantly from 7.48% to 41.78%.
Engagement Levels: Higher profile completion levels indicate greater user engagement and commitment, which are critical drivers of conversions.
Opportunity for Optimization: Users with medium and low profile completion levels present substantial opportunities for targeted interventions to enhance their engagement and conversion rates.
# Create a correlation matrix (numerical vlaues only)
# Exclude non-numeric columns
numeric_data = data.select_dtypes(include=np.number)
# Create heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(numeric_data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="vlag")
plt.title("Correlation Heatmap")
plt.show()
Observations:
Target feature correlations:
Correlation between status and age: 0.12
Correlation between status and website_visits: -0.01
Correlation between status and time_spent_on_website: 0.30
Correlation between status and page_views_per_visit: 0.00
Other feature correlations:
age has almost no correlation with website_visits (-0.01), time_spent_on_website (0.02), and page_views_per_visit (-0.04). These relationships are very weak and do not show any meaningful patterns.
website_visits has weak positive correlations with time_spent_on_website (0.06) and page_views_per_visit (0.07). Both relationships are minimal, showing only slight increases in time spent and page views as website visits increase.
time_spent_on_website has a weak positive correlation with page_views_per_visit (0.07), indicating a minimal relationship between the time spent on the website and the number of pages viewed per visit.
page_views_per_visit has very weak positive correlations with both website_visits (0.07) and time_spent_on_website (0.07), indicating minimal relationships.
# Generate boxplot 'status' vs 'age'
plot_boxplot(
x_column='status',
y_column='age',
data=dataset,
title='Status vs Age'
)
Observations:
Median Age:
Interquartile Range (IQR):
Whiskers (Range of Ages):
Outliers:
Interpretation:
Age Differences by Status: The median age for status 1 is higher than for status 0. However, the IQR for status 1 is more concentrated around the higher median, indicating less variability in the age distribution compared to status 0.
Potential Age Concentration: Individuals with status 1 have a tighter age distribution, whereas status 0 leads have a wider age range, suggesting greater variability.
# Generate count plot of 'status' vs 'current_occupation'
plot_countplot(
x_column='status',
hue_column='current_occupation',
data=dataset,
title='Lead Status vs Current Occupation'
)
Observations:
Professional:
Unemployed:
Students:
Interpretation:
# Generate count plot of 'status' vs 'first_interaction'
plot_countplot(
x_column='status',
hue_column='first_interaction',
data=dataset,
title='Lead Status vs First Interaction'
)
Observations:
Website Interaction:
Mobile App Interaction:
Interpretation:
# Generate count plot of 'status' vs 'profile_completed'
plot_countplot(
x_column='status',
hue_column='profile_completed',
data=dataset,
title='Lead Status vs Profile Completed'
)
Observations:
High Profile Completion:
Medium Profile Completion:
Low Profile Completion:
Interpretation:
# Generate boxplot with 'status' vs 'website_visits'
plot_boxplot(
x_column='status',
y_column='website_visits',
data=dataset,
title='Status vs Website Visits'
)
Observations:
Median Website Visits:
Interquartile Range (IQR):
Outliers:
Interpretation:
# Generate boxplot with 'status' vs 'time_spent_on_website' on the y-axis
plot_boxplot(
x_column='status',
y_column='time_spent_on_website',
data=dataset,
title='Status vs Time Spent on Website'
)
Observations:
Median Time Spent:
Interquartile Range (IQR):
Outliers:
Interpretation:
# Generate the boxplot with 'status' vs 'page_views_per_visit'
plot_boxplot(
x_column='status',
y_column='page_views_per_visit',
data=dataset,
title='Status vs Page Views per Visit'
)
Observations:
Median Page Views per Visit:
Interquartile Range (IQR):
Outliers:
Interpretation:
This suggests that page views per visit are quite similar between the two groups, though status 0 users tend to exhibit more variability.
# Generate count plot of 'status' vs 'last_activity'
plot_countplot(
x_column='status',
hue_column='last_activity',
data=dataset,
title='Lead Status vs Last Activity'
)
Observations:
Website Activity:
Email Activity:
Phone Activity:
Interpretation:
# Generate count plot of 'status' vs 'print_media_type1'
plot_countplot(
x_column='status',
hue_column='print_media_type1',
data=dataset,
title='Status vs Print Media Type 1',
hue_order=['Yes', 'No']
)
Observations:
Website Activity:
Email Activity:
Phone Activity:
Interpretation:
# Generate count plot of 'status' vs 'print_media_type2'
plot_countplot(
x_column='status',
hue_column='print_media_type2',
data=dataset,
title='Status vs Print Media Type 2',
hue_order=['Yes', 'No']
)
Observations:
Print Media Type 2 - Yes:
Print Media Type 2 - No:
Interpretation:
# Generate count plot of 'status' vs 'digital_media'
plot_countplot(
x_column='status',
hue_column='digital_media',
data=dataset,
title='Status vs Digital Media',
hue_order=['Yes', 'No']
)
Observations:
Print Media Type 2 - Yes:
Print Media Type 2 - No:
Interpretation:
# Generate count plot of 'status' vs 'educational_channels'
plot_countplot(
x_column='status',
hue_column='educational_channels',
data=dataset,
title='Status vs Educational Channels',
hue_order=['Yes', 'No']
)
Observations:
Educational Channels - Yes:
Educational Channels - No:
Interpretation:
# Generate count plot of 'status' vs 'referral'
plot_countplot(
x_column='status',
hue_column='referral',
data=dataset,
title='Status vs Referal',
hue_order=['Yes', 'No']
)
Observations:
Educational Channels - Yes:
Educational Channels - No:
Interpretation:
# plot 'age' vs 'website_visits'
# Create age bins for better grouping
dataset['age_group'] = pd.cut(dataset['age'], bins=[10, 20, 30, 40, 50, 60, 70], labels=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70'])
# Generate boxplot
plot_boxplot(
x_column='age_group',
y_column='website_visits',
data=dataset,
title='Website Visits by Age Group'
)
Observations:
Age and Website Visits:
Boxplot Insights:
Outliers:
Interpretation:
The data indicates that age does not impact how frequently users visit the website. Whether a user is in their 20s, 30s, 40s, or older, their website visit patterns are similar to those of users in other age groups.
The distribution of visits remains consistent across age groups, with no particular age group exhibiting notably different behavior in terms of visit frequency. The presence of outliers suggests that for all age groups, there are some highly active users, but their distribution across ages is consistent.
# Generate the boxplot for 'age' vs 'website_visits'
# Create age bins for better grouping
dataset['age_group'] = pd.cut(dataset['age'], bins=[10, 20, 30, 40, 50, 60, 70], labels=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70'])
# Generate boxplot
plot_boxplot(
x_column='age_group',
y_column='website_visits',
data=dataset,
title='Website Visits by Age Group'
)
Observations:
Age and Time Spent on Website:
Meaning of the Correlation:
Distribution Across Age Groups:
Interpretation:
# Generate the boxplot for 'age' vs 'page_views_per_visit'
# Create age bins for better grouping
dataset['age_group'] = pd.cut(dataset['age'], bins=[10, 20, 30, 40, 50, 60, 70], labels=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70'])
# Generate boxplot
plot_boxplot(
x_column='age_group',
y_column='page_views_per_visit',
data=dataset,
title='Page Views per Visits by Age Group'
)
Observations:
Age and Page Views per Visit:
Boxplot Insights:
Outliers:
Interpretation:
The data suggests that age does not play a major role in determining how many pages a user views per visit. Users from all age groups exhibit similar behavior in terms of page views per visit, with no distinct patterns emerging as age increases or decreases.
Outliers in each age group represent users who view many more pages per visit than others, but these outliers are not concentrated in any one age group.
# Generate boxplot for 'website_visits' vs 'time_spent_on_website'
plot_boxplot(
x_column='website_visits',
y_column='time_spent_on_website',
data=dataset,
title='Website Visits by Time Spetn on Website'
)
Observations:
Website visits and time spent on the website have a weak positive correlation (0.06).
The scatter plot shows a slight upward trend, suggesting that users who visit the website more frequently tend to spend more time overall, but the relationship is not strong.
Most users fall within a moderate range of time spent on the website, regardless of how many visits they make.
Website Visits and Time Spent on the Website:
Boxplot Insights:
Outliers:
Interpretation:
Although a slight upward trend in time spent per visit exists, the weak correlation signifies that visit frequency alone is not a reliable predictor of the amount of time spent on the website.
Most users, irrespective of their visit frequency, tend to spend a moderate amount of time on the site, suggesting that other factors likely influence engagement more than just the number of visits.
The variability in user behavior, indicated by outliers, suggests that understanding user motivations or specific content may be crucial for enhancing engagement strategies.
# plot for 'website_visits' vs 'page_views_per_visit'
plot_boxplot(
figsize=(15, 10),
x_column='website_visits',
y_column='page_views_per_visit',
data=dataset,
title='Website Visits by Page Views per Visit'
)
Observations:
Website Visits and Time Spent on the Website:
Outliers:
Interpretation:
While there is a slight upward trend indicating that more frequent website visitors tend to spend more time overall, this relationship is weak, as indicated by the low correlation value.
The overlap in time spent across different visit counts shows that the majority of users, regardless of how many visits they make, tend to spend similar amounts of time on the website. This suggests that frequency of visits alone does not strongly predict the amount of time spent.
The presence of outliers indicates that some users deviate significantly from this trend, further reinforcing that individual behaviors are diverse and not solely dependent on visit frequency.
# Perform multicollinearity check
# Select numeric columns for analysis
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
data_numeric = data[numeric_cols]
# Calculate VIF
def calculate_vif(data):
vif_data = pd.DataFrame()
vif_data["Variable"] = data.columns
vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
return vif_data
vif_results = calculate_vif(data_numeric)
Observations:
Variance Inflation Factors (VIF) Results:
Variable | VIF |
---|---|
age | 3.94 |
website_visits | 2.43 |
time_spent_on_website | 1.91 |
page_views_per_visit | 2.96 |
Multicollinearity Check:
Future Considerations:
# Check dataset for missing values
data.isnull().sum()
0 | |
---|---|
ID | 0 |
age | 0 |
current_occupation | 0 |
first_interaction | 0 |
profile_completed | 0 |
website_visits | 0 |
time_spent_on_website | 0 |
page_views_per_visit | 0 |
last_activity | 0 |
print_media_type1 | 0 |
print_media_type2 | 0 |
digital_media | 0 |
educational_channels | 0 |
referral | 0 |
status | 0 |
There are no missing values.
# Check dataset for duplicated data
data.duplicated().sum()
0
There is no duplicated data.
# Get the feature names before making any changes
feature_names = data.columns.tolist()
print(feature_names)
['ID', 'age', 'current_occupation', 'first_interaction', 'profile_completed', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'last_activity', 'print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral', 'status']
# Drop the Index
data.drop('ID', axis=1, inplace=True)
# Get the feature names after fropping the index
feature_names = data.columns.tolist()
print(feature_names)
['age', 'current_occupation', 'first_interaction', 'profile_completed', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'last_activity', 'print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral', 'status']
# Perform one hot encoding on categorical features
# Select only the categoricalcolumns
categorical_cols = [
'current_occupation',
'first_interaction',
'profile_completed',
'last_activity',
'print_media_type1',
'print_media_type2',
'digital_media',
'educational_channels',
'referral'
]
# Apply one-hot encoding to categorical variables
# data_encoded = pd.get_dummies(data[categorical_cols], drop_first=True) # What happens to the decision tree if I dont drop the first item
data_encoded = pd.get_dummies(data[categorical_cols])
'''
The current_occupation_professional is a key indicator of conversion, but that is the feature that is removed IF drop_first=True,
I will keep this as is, and remove one of the less important ones during feature engineering.
It's interesting to note that this actually did not make a significant difference in the performance of a few of the model, namely the decision tree.
'''
# Convert boolean values (True/False) to integers (1/0)
data_encoded = data_encoded.astype(int)
# Combine the encoded categorical features with the original numerical columns
numerical_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'status']
data_encoded = pd.concat([data[numerical_cols], data_encoded], axis=1)
# Get the feature names after encoding
feature_names = data_encoded.columns.tolist()
print(feature_names)
['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'status', 'current_occupation_0', 'current_occupation_1', 'current_occupation_2', 'first_interaction_0', 'first_interaction_1', 'profile_completed_0', 'profile_completed_1', 'profile_completed_2', 'last_activity_0', 'last_activity_1', 'last_activity_2', 'print_media_type1_0', 'print_media_type1_1', 'print_media_type2_0', 'print_media_type2_1', 'digital_media_0', 'digital_media_1', 'educational_channels_0', 'educational_channels_1', 'referral_0', 'referral_1']
# count the number of values in feature_names
len(feature_names)
26
The dataframe now has 26 features after OHE.
Let's look at the shape to get an idea of how the shape has change
# calculate the change in the number of columns between data and data_encoded
columns_added = data_encoded.shape[1] - data.shape[1]
print(f"Number of columns added due to encoding: {columns_added}")
Number of columns added due to encoding: 12
Observations:
We previously identified several continuous features that have potential outliers:
There are two approaches to consider. The first involves performing our classifications with the outliers present and addressing them as needed. The second approach is to remove the outliers outright. For the purposes of this project, the outliers will be removed. If time permits, I will also conduct analyses with the outliers included and compare the performance of each model.
Let's regenerate the boxplots to aid with visualization.
# List of numerical features previously identified as having outliers
potential_outliers = ['website_visits', 'time_spent_on_website', 'page_views_per_visit']
# Set up the figure with the appropriate number of subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5)) # 1 row and 3 columns since we have 3 plots
# Create boxplots for each numerical column
for i, col in enumerate(potential_outliers):
sns.boxplot(x=data_encoded[col], ax=axes[i], palette='tab10') # Apply 'tab10' color scheme
axes[i].set_title(f'Boxplot of {col}')
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
Let's use the Interquartile Range (IQR) statistical method to systematically detect outliers, after which we will remove them.
The IQR method identifies outliers as values falling below Q1 - 1.5 IQR or above Q3 + 1.5 IQR, where Q1 and Q3 are the 25th and 75th percentiles, respectively.
# Detect and remove outliers using the IQR method
def remove_outliers(df, columns):
for col in columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
return df
# Remove outliers
data_cleaned = remove_outliers(data_encoded, potential_outliers)
Let's regenerate the boxplots to visually verify outliers have been removed successfully.
# List of numerical features previously identified as having outliers
potential_outliers = ['website_visits', 'time_spent_on_website', 'page_views_per_visit']
# Set up the figure with the appropriate number of subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5)) # 1 row and 3 columns since we have 3 plots
# Create boxplots for each numerical column
for i, col in enumerate(potential_outliers):
sns.boxplot(x=data_cleaned[col], ax=axes[i], palette='tab10') # Apply 'tab10' color scheme
axes[i].set_title(f'Boxplot of {col}')
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
# Calculate the number of rows removed
rows_removed = len(data_encoded) - len(data_cleaned)
rows_removed
404
We have successfully removed 404 extreme outliers.
# returns the first 5 rows
data_cleaned.head()
age | website_visits | time_spent_on_website | page_views_per_visit | status | current_occupation_0 | current_occupation_1 | current_occupation_2 | first_interaction_0 | first_interaction_1 | ... | print_media_type1_0 | print_media_type1_1 | print_media_type2_0 | print_media_type2_1 | digital_media_0 | digital_media_1 | educational_channels_0 | educational_channels_1 | referral_0 | referral_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 57 | 7 | 1639 | 1.861 | 1 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
1 | 56 | 2 | 83 | 0.320 | 0 | 1 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
2 | 52 | 3 | 330 | 0.074 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
3 | 53 | 4 | 464 | 2.057 | 1 | 0 | 0 | 1 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
5 | 50 | 4 | 212 | 5.682 | 0 | 0 | 0 | 1 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
5 rows × 26 columns
# returns the last 5 rows
data_cleaned.tail()
age | website_visits | time_spent_on_website | page_views_per_visit | status | current_occupation_0 | current_occupation_1 | current_occupation_2 | first_interaction_0 | first_interaction_1 | ... | print_media_type1_0 | print_media_type1_1 | print_media_type2_0 | print_media_type2_1 | digital_media_0 | digital_media_1 | educational_channels_0 | educational_channels_1 | referral_0 | referral_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4606 | 58 | 7 | 210 | 3.598 | 0 | 0 | 0 | 1 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
4608 | 55 | 8 | 2327 | 5.393 | 0 | 1 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
4609 | 58 | 2 | 212 | 2.692 | 1 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
4610 | 57 | 1 | 154 | 3.879 | 0 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
4611 | 55 | 4 | 2290 | 2.075 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
5 rows × 26 columns
# Determine the number of rows and columns by calling data_cleaned.shape
print(f"{data_cleaned.shape[0]} {data_cleaned.shape[1]}")
4208 26
data_cleaned.describe(include = "all").T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
age | 4208.0 | NaN | NaN | NaN | 46.360741 | 13.079381 | 18.0 | 36.00000 | 51.0000 | 57.00000 | 63.000 |
website_visits | 4208.0 | NaN | NaN | NaN | 3.231464 | 2.157708 | 0.0 | 2.00000 | 3.0000 | 5.00000 | 9.000 |
time_spent_on_website | 4208.0 | NaN | NaN | NaN | 718.396150 | 742.652073 | 0.0 | 142.00000 | 375.0000 | 1310.75000 | 2537.000 |
page_views_per_visit | 4208.0 | NaN | NaN | NaN | 2.716066 | 1.490995 | 0.0 | 2.06075 | 2.2855 | 3.64325 | 6.266 |
status | 4208.0 | 2.0 | 0.0 | 2939.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
current_occupation_0 | 4208.0 | NaN | NaN | NaN | 0.568679 | 0.495320 | 0.0 | 0.00000 | 1.0000 | 1.00000 | 1.000 |
current_occupation_1 | 4208.0 | NaN | NaN | NaN | 0.115970 | 0.320226 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
current_occupation_2 | 4208.0 | NaN | NaN | NaN | 0.315352 | 0.464711 | 0.0 | 0.00000 | 0.0000 | 1.00000 | 1.000 |
first_interaction_0 | 4208.0 | NaN | NaN | NaN | 0.446530 | 0.497192 | 0.0 | 0.00000 | 0.0000 | 1.00000 | 1.000 |
first_interaction_1 | 4208.0 | NaN | NaN | NaN | 0.553470 | 0.497192 | 0.0 | 0.00000 | 1.0000 | 1.00000 | 1.000 |
profile_completed_0 | 4208.0 | NaN | NaN | NaN | 0.494297 | 0.500027 | 0.0 | 0.00000 | 0.0000 | 1.00000 | 1.000 |
profile_completed_1 | 4208.0 | NaN | NaN | NaN | 0.023051 | 0.150084 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
profile_completed_2 | 4208.0 | NaN | NaN | NaN | 0.482652 | 0.499758 | 0.0 | 0.00000 | 0.0000 | 1.00000 | 1.000 |
last_activity_0 | 4208.0 | NaN | NaN | NaN | 0.496673 | 0.500048 | 0.0 | 0.00000 | 0.0000 | 1.00000 | 1.000 |
last_activity_1 | 4208.0 | NaN | NaN | NaN | 0.263783 | 0.440736 | 0.0 | 0.00000 | 0.0000 | 1.00000 | 1.000 |
last_activity_2 | 4208.0 | NaN | NaN | NaN | 0.239544 | 0.426856 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
print_media_type1_0 | 4208.0 | NaN | NaN | NaN | 0.891635 | 0.310878 | 0.0 | 1.00000 | 1.0000 | 1.00000 | 1.000 |
print_media_type1_1 | 4208.0 | NaN | NaN | NaN | 0.108365 | 0.310878 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
print_media_type2_0 | 4208.0 | NaN | NaN | NaN | 0.948669 | 0.220698 | 0.0 | 1.00000 | 1.0000 | 1.00000 | 1.000 |
print_media_type2_1 | 4208.0 | NaN | NaN | NaN | 0.051331 | 0.220698 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
digital_media_0 | 4208.0 | NaN | NaN | NaN | 0.886882 | 0.316774 | 0.0 | 1.00000 | 1.0000 | 1.00000 | 1.000 |
digital_media_1 | 4208.0 | NaN | NaN | NaN | 0.113118 | 0.316774 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
educational_channels_0 | 4208.0 | NaN | NaN | NaN | 0.848859 | 0.358229 | 0.0 | 1.00000 | 1.0000 | 1.00000 | 1.000 |
educational_channels_1 | 4208.0 | NaN | NaN | NaN | 0.151141 | 0.358229 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
referral_0 | 4208.0 | NaN | NaN | NaN | 0.980513 | 0.138244 | 0.0 | 1.00000 | 1.0000 | 1.00000 | 1.000 |
referral_1 | 4208.0 | NaN | NaN | NaN | 0.019487 | 0.138244 | 0.0 | 0.00000 | 0.0000 | 0.00000 | 1.000 |
Age:
Website Visits:
Time Spent on Website:
Page Views per Visit:
Status (Target Variable):
Current Occupation:
First Interaction:
Profile Completion:
Last Activity:
Media & Referral Channels:
In summary, the major differences are the removal of outliers, the use of one-hot encoding, and a slightly more refined distribution of website visits and page views. The overall trends remain consistent, particularly in terms of age, conversion rates, and interaction patterns.
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
print(col)
enhanced_histogram_boxplot(data_cleaned, col, kde = True, custom_title = col) # Pass the column name as a string
age Statistical Summary for age: Mean: 46.36 Median: 51.00 Standard Deviation: 13.08 Skewness: -0.74 Kurtosis: -0.76 Outlier Analysis: IQR method - Number of outliers: 0 IQR method - Percentage of outliers: 0.00% IQR method - Outlier range: < 4.50 or > 88.50 Z-score method - Number of outliers (|z| > 3): 0 Z-score method - Percentage of outliers: 0.00%
website_visits Statistical Summary for website_visits: Mean: 3.23 Median: 3.00 Standard Deviation: 2.16 Skewness: 0.80 Kurtosis: -0.14 Outlier Analysis: IQR method - Number of outliers: 0 IQR method - Percentage of outliers: 0.00% IQR method - Outlier range: < -2.50 or > 9.50 Z-score method - Number of outliers (|z| > 3): 0 Z-score method - Percentage of outliers: 0.00%
time_spent_on_website Statistical Summary for time_spent_on_website: Mean: 718.40 Median: 375.00 Standard Deviation: 742.65 Skewness: 0.97 Kurtosis: -0.54 Outlier Analysis: IQR method - Number of outliers: 0 IQR method - Percentage of outliers: 0.00% IQR method - Outlier range: < -1611.12 or > 3063.88 Z-score method - Number of outliers (|z| > 3): 0 Z-score method - Percentage of outliers: 0.00%
page_views_per_visit Statistical Summary for page_views_per_visit: Mean: 2.72 Median: 2.29 Standard Deviation: 1.49 Skewness: 0.02 Kurtosis: -0.32 Outlier Analysis: IQR method - Number of outliers: 11 IQR method - Percentage of outliers: 0.26% IQR method - Outlier range: < -0.31 or > 6.02 Z-score method - Number of outliers (|z| > 3): 0 Z-score method - Percentage of outliers: 0.00%
Age:
Website Visits:
Time Spent on Website:
Page Views per Visit:
General Patterns and Insights:
# Original Categories (w/Manual Mapping)
original_categories = {
'current_occupation': ['Professional', 'Unemployed', 'Student'],
'first_interaction': ['Website', 'Mobile App'],
'profile_completed': ['Low', 'Medium', 'High'],
'last_activity': ['Email Activity', 'Phone Activity', 'Website Activity'],
'print_media_type1': ['No', 'Yes'],
'print_media_type2': ['No', 'Yes'],
'digital_media': ['No', 'Yes'],
'educational_channels': ['No', 'Yes'],
'referral': ['No', 'Yes']
# Add other variables and their categories here
}
# Initialize the mapping dictionary
ohe_mapping = {}
# Get the columns from the DataFrame
columns = data_cleaned.columns
for col in columns:
try:
variable, category_idx = col.rsplit('_', 1)
category_label = original_categories.get(variable, [])[int(category_idx)] if variable in original_categories else category_idx
except (ValueError, IndexError):
variable = col
category_label = None # Handle unexpected naming patterns
if variable not in ohe_mapping:
ohe_mapping[variable] = {}
ohe_mapping[variable][col] = category_label
# Iterate and print value counts with original labels
for variable, cols in ohe_mapping.items():
print(f"Variable: {variable}")
for ohe_col, category_label in cols.items():
print(f" Category: {category_label}")
print(data_cleaned[ohe_col].value_counts(normalize=True))
print('*' * 40)
print('=' * 60)
Variable: age Category: None age 57 0.084601 58 0.082700 56 0.072719 59 0.071055 60 0.052281 55 0.042063 32 0.039686 53 0.020437 50 0.019724 43 0.019487 48 0.019487 54 0.019487 51 0.019011 49 0.018774 46 0.018774 21 0.018298 52 0.018061 42 0.017823 24 0.017823 23 0.017348 45 0.017348 19 0.017110 47 0.017110 34 0.016873 44 0.016635 33 0.016160 20 0.015447 22 0.015209 41 0.015209 35 0.014971 18 0.014496 40 0.013783 38 0.013070 37 0.012595 36 0.012357 39 0.011644 63 0.010932 62 0.010456 30 0.009743 29 0.007842 61 0.007842 31 0.007842 28 0.005703 25 0.003802 26 0.003327 27 0.002852 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: website Category: visits website_visits 2 0.272338 1 0.171578 3 0.144487 4 0.110266 5 0.092681 6 0.063926 7 0.051331 0 0.041350 8 0.033983 9 0.018061 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: time_spent_on Category: website time_spent_on_website 0 0.041350 1 0.015922 65 0.004515 83 0.004040 76 0.004040 ... 2500 0.000238 1540 0.000238 1862 0.000238 1397 0.000238 2290 0.000238 Name: proportion, Length: 1541, dtype: float64 **************************************** ============================================================ Variable: page_views_per Category: visit page_views_per_visit 0.000 0.043013 2.168 0.003327 2.154 0.003089 2.192 0.002614 2.188 0.002376 ... 1.826 0.000238 4.954 0.000238 4.295 0.000238 5.577 0.000238 2.692 0.000238 Name: proportion, Length: 2126, dtype: float64 **************************************** ============================================================ Variable: status Category: None status 0 0.698432 1 0.301568 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: current_occupation Category: Professional current_occupation_0 1 0.568679 0 0.431321 Name: proportion, dtype: float64 **************************************** Category: Unemployed current_occupation_1 0 0.88403 1 0.11597 Name: proportion, dtype: float64 **************************************** Category: Student current_occupation_2 0 0.684648 1 0.315352 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: first_interaction Category: Website first_interaction_0 0 0.55347 1 0.44653 Name: proportion, dtype: float64 **************************************** Category: Mobile App first_interaction_1 1 0.55347 0 0.44653 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: profile_completed Category: Low profile_completed_0 0 0.505703 1 0.494297 Name: proportion, dtype: float64 **************************************** Category: Medium profile_completed_1 0 0.976949 1 0.023051 Name: proportion, dtype: float64 **************************************** Category: High profile_completed_2 0 0.517348 1 0.482652 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: last_activity Category: Email Activity last_activity_0 0 0.503327 1 0.496673 Name: proportion, dtype: float64 **************************************** Category: Phone Activity last_activity_1 0 0.736217 1 0.263783 Name: proportion, dtype: float64 **************************************** Category: Website Activity last_activity_2 0 0.760456 1 0.239544 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: print_media_type1 Category: No print_media_type1_0 1 0.891635 0 0.108365 Name: proportion, dtype: float64 **************************************** Category: Yes print_media_type1_1 0 0.891635 1 0.108365 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: print_media_type2 Category: No print_media_type2_0 1 0.948669 0 0.051331 Name: proportion, dtype: float64 **************************************** Category: Yes print_media_type2_1 0 0.948669 1 0.051331 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: digital_media Category: No digital_media_0 1 0.886882 0 0.113118 Name: proportion, dtype: float64 **************************************** Category: Yes digital_media_1 0 0.886882 1 0.113118 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: educational_channels Category: No educational_channels_0 1 0.848859 0 0.151141 Name: proportion, dtype: float64 **************************************** Category: Yes educational_channels_1 0 0.848859 1 0.151141 Name: proportion, dtype: float64 **************************************** ============================================================ Variable: referral Category: No referral_0 1 0.980513 0 0.019487 Name: proportion, dtype: float64 **************************************** Category: Yes referral_1 0 0.980513 1 0.019487 Name: proportion, dtype: float64 **************************************** ============================================================
Current Occupation:
Observation: The dataset is predominantly made up of professionals, but there is still a large group of students, suggesting a mix of working and non-working individuals.
First Interaction:
Observation: The website is slightly more popular as a first point of interaction, though mobile app usage is not far behind, indicating a balanced distribution between these two channels.
Profile Completion:
Observation: There's a bimodal distribution between low and high profile completions, while the medium category is significantly underrepresented. This suggests that most users either leave their profiles incomplete or complete them fully, without stopping in between.
Last Activity:
Observation: Email is the most common form of recent activity, while phone and website activity are less frequent. This may reflect that email remains a dominant channel for engagement in this dataset.
Print Media Type 1:
Observation: Print media type 1 has low engagement, indicating that this channel may not be as effective or prevalent as digital media.
Print Media Type 2:
Observation: Even less engagement than print media type 1, suggesting print media in general may have minimal reach compared to other channels.
Digital Media:
Observation: The use of digital media is still relatively low, which could mean there's an opportunity for more engagement in this area or it could reflect that the audience primarily engages through other channels.
Educational Channels:
Observation: A small portion of the audience engages with educational channels, suggesting that this may not be a primary area of interest or need for the majority of users.
Referral:
Observation: The referral system is underutilized, indicating either a lack of promotion or effectiveness of the referral program, or that users are not highly motivated to refer others.
Key Insights:
Conclusion: The observations provide a clear view of user engagement and demographics within the dataset. The data suggests areas for potential growth, particularly in enhancing profile completion and leveraging digital media channels for better engagement.
# Create a correlation matrix (numerical vlaues only)
# Exclude non-numeric columns
numeric_data = data_cleaned.select_dtypes(include=np.number)
# Create heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(numeric_data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="vlag")
plt.title("Correlation Heatmap")
plt.show()
Here are the verified observations based on the correlation analysis and heatmap results:
Focusing on Our Target Feature, status
:
Status and Age (0.12):
status
and age
, indicating that as age increases, there is a slight tendency for users to have a higher status
. This suggests that older individuals might have a different status
compared to younger ones.Status and Time Spent on Website (0.30):
status
and the time spent on the website. This indicates that users with a higher status
tend to spend more time on the website, making it an important predictor for modeling.Status and Page Views Per Visit (0.02):
status
and page_views_per_visit
, indicating that the number of pages viewed per visit does not have a strong linear relationship with status
, making it unlikely to be a key predictor.Status and Current Occupation - Student (-0.14):
status
and being a student. This suggests that users who are students are somewhat less likely to have a higher status
, indicating that student status may be inversely related to the target variable.Status and Current Occupation - Unemployed (-0.05):
status
and being unemployed. This suggests that being unemployed has a minor relationship with status
.Status and First Interaction Website (0.39):
status
and first interaction being through the website. This indicates that users who first interacted via the website are more likely to have a higher status
, making this a valuable predictor for your model.Status and Profile Completion - Low (0.01):
status
and having low profile completion, suggesting that this feature is not significant for determining user status
.Status and Profile Completion - Medium (-0.11):
status
and medium profile completion, suggesting that users with medium profile completion are slightly less likely to have a higher status
.Status and Last Activity - Phone Activity (-0.11):
status
and phone activity being the last activity. This may indicate that users whose last activity was via phone tend to have a slightly lower status
.Status and Last Activity - Website Activity (0.05):
status
and website activity as the last interaction, suggesting little to no relationship with status
.Status and Digital Media Interaction (-0.01):
status
and digital media interaction, indicating that this feature is not related to user status
.Status and Referral (0.12):
status
and whether the user was referred. This suggests that users who came via a referral might have a slight tendency toward a higher status
.Summary of Key Takeaways:
Moderate Positive Correlations:
status
, making them key variables for predicting user status
.Weak Positive Correlations:
status
, suggesting that older users and those referred to the platform might have a slightly higher status
.Weak to Moderate Negative Correlations:
status
, indicating that these groups are somewhat less likely to have higher status
.Near-Zero Correlations:
status
, suggesting they may not be useful predictors.Conclusion:
The most promising predictors for status
based on this correlation analysis are first interaction through the website and time spent on the website. Variables like age and referral may also provide some predictive power but will likely be weaker in impact. Features with very low correlations, such as page views per visit and print media interactions, may not be valuable for modeling and could potentially be removed.
Multicollinearity occurs when two or more independent variables (predictors) in a regression model are highly correlated with each other. This means that one predictor can be linearly predicted from the others with a high degree of accuracy.
Interpreting VIF:
# Perform multicollinearity check for numeric columns only
# Select numeric columns for VIF analysis
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
data_numeric = data[numeric_cols]
# Function to calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(data):
vif_data = pd.DataFrame()
vif_data["Variable"] = data.columns
vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
return vif_data
# Calculate VIF for numeric columns only
vif_results_numeric = calculate_vif(data_numeric)
Variance Inflation Factors (VIF) for the numeric variables
Variable | VIF |
---|---|
Age | 3.94 |
Website Visits | 2.43 |
Time Spent on Website | 1.91 |
Page Views per Visit | 2.96 |
Age (VIF = 3.94):
Website Visits (VIF = 2.43):
Time Spent on Website (VIF = 1.91):
Page Views per Visit (VIF = 2.96):
Key Insights:
The goal of this model is to:
Before training the model, let's choose the appropriate model evaluation criterion as per the problem at hand.
Model evaluation criterion
The model can make two types of wrong predictions:
Which case is more important?
How to reduce this loss i.e the need to reduce False Negatives?
# Separating the target variable and other variables
X = data_cleaned.drop('status', axis=1) # Create a copy of the data with 'status' removed
y = data_cleaned['status'] # Create a new variable with only 'status'
Splitting the data into 70% train and 30% test set
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Check the shape of both the train and test data
#Collect the data for the table
data_ = {
"Data Split": ["Training Set", "Test Set"],
"Shape": [X_train.shape, X_test.shape],
"Class Percentage": [
y_train.value_counts(normalize=True).to_dict(),
y_test.value_counts(normalize=True).to_dict()
]
}
# Create the table using pandas
shape = pd.DataFrame(data_)
# Display the table
print(shape)
Data Split Shape Class Percentage 0 Training Set (2945, 25) {0: 0.6937181663837012, 1: 0.3062818336162988} 1 Test Set (1263, 25) {0: 0.7094220110847189, 1: 0.2905779889152811}
Check the training and testing data for missing values. There is no reason to expect any missing vlues as our dataset had none.
# Missing value check for X_train and X_test as a percentage
# Create DataFrames for % of missing values in each dataset
train_missing = pd.DataFrame({
'% of Missing Values (Train)': round(X_train.isna().sum() / X_train.isna().count() * 100, 2)
})
test_missing = pd.DataFrame({
'% of Missing Values (Test)': round(X_test.isna().sum() / X_test.isna().count() * 100, 2)
})
# Concatenate the two DataFrames to have a nice table comparing both
missing_values_table = pd.concat([train_missing, test_missing], axis=1)
# Display the table
missing_values_table
% of Missing Values (Train) | % of Missing Values (Test) | |
---|---|---|
age | 0.0 | 0.0 |
website_visits | 0.0 | 0.0 |
time_spent_on_website | 0.0 | 0.0 |
page_views_per_visit | 0.0 | 0.0 |
current_occupation_0 | 0.0 | 0.0 |
current_occupation_1 | 0.0 | 0.0 |
current_occupation_2 | 0.0 | 0.0 |
first_interaction_0 | 0.0 | 0.0 |
first_interaction_1 | 0.0 | 0.0 |
profile_completed_0 | 0.0 | 0.0 |
profile_completed_1 | 0.0 | 0.0 |
profile_completed_2 | 0.0 | 0.0 |
last_activity_0 | 0.0 | 0.0 |
last_activity_1 | 0.0 | 0.0 |
last_activity_2 | 0.0 | 0.0 |
print_media_type1_0 | 0.0 | 0.0 |
print_media_type1_1 | 0.0 | 0.0 |
print_media_type2_0 | 0.0 | 0.0 |
print_media_type2_1 | 0.0 | 0.0 |
digital_media_0 | 0.0 | 0.0 |
digital_media_1 | 0.0 | 0.0 |
educational_channels_0 | 0.0 | 0.0 |
educational_channels_1 | 0.0 | 0.0 |
referral_0 | 0.0 | 0.0 |
referral_1 | 0.0 | 0.0 |
Logistic regression benefits from scaled data, especially if some numeric features have a wide range of values.
# Standardize Numeric Features
# List of numeric columns that need to be scaled
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
# Identify non-numeric columns
non_numeric_cols = [col for col in X_train.columns if col not in numeric_cols]
# Initialize the scaler
scaler = StandardScaler()
# Scale the numeric columns in the training set
X_train_scaled_numeric = pd.DataFrame(
scaler.fit_transform(X_train[numeric_cols]),
columns=numeric_cols,
index=X_train.index # Retain original index
)
# Scale the numeric columns in the test set
X_test_scaled_numeric = pd.DataFrame(
scaler.transform(X_test[numeric_cols]),
columns=numeric_cols,
index=X_test.index # Retain original index
)
# Extract the non-numeric columns from the training set without resetting the index
X_train_non_numeric = X_train[non_numeric_cols]
# Extract the non-numeric columns from the test set without resetting the index
X_test_non_numeric = X_test[non_numeric_cols]
# Concatenate the scaled numeric columns with the non-numeric columns for the training set
X_train_final = pd.concat([X_train_scaled_numeric, X_train_non_numeric], axis=1)
# Concatenate the scaled numeric columns with the non-numeric columns for the test set
X_test_final = pd.concat([X_test_scaled_numeric, X_test_non_numeric], axis=1)
# Preserve the original column order
X_train_scaled = X_train_final[X_train.columns]
X_test_scaled = X_test_final[X_test.columns]
Build the Model
# Initialize and fit the logistic regression model
logreg = LogisticRegression(max_iter=1000,random_state=1)
logreg.fit(X_train_scaled, y_train)
LogisticRegression(max_iter=1000, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=1000, random_state=1)
Train and Evaluate the Model
# Evaluate the Model on the Training Data
y_train_pred = logreg.predict(X_train_scaled)
# Evaluate performance on training data
metrics_score(y_train, y_train_pred)
precision recall f1-score support 0 0.86 0.90 0.88 2043 1 0.74 0.66 0.70 902 accuracy 0.82 2945 macro avg 0.80 0.78 0.79 2945 weighted avg 0.82 0.82 0.82 2945
Test the Model
# Evaluate the Model on the Training Data
y_test_pred = logreg.predict(X_test_scaled)
# Evaluate performance on training data
metrics_score(y_test, y_test_pred)
precision recall f1-score support 0 0.86 0.89 0.88 896 1 0.72 0.66 0.68 367 accuracy 0.82 1263 macro avg 0.79 0.77 0.78 1263 weighted avg 0.82 0.82 0.82 1263
Observations
Accuracy:
Class 0 (Not Converted):
Precision: 0.86 – Of all the cases predicted as not converted 86% were correct.
Recall: 0.90 – The model successfully identified 90% of all actual not converted cases.
F1-Score: 0.88 – This harmonic mean between precision and recall indicates strong performance in identifying not converted users.
Class 1 (Converted):
Precision: 0.74 – Of all the cases predicted as converted, 74% were correct.
Recall: 0.66 – The model correctly identified 66% of all actual converted cases.
F1-Score: 0.70 – This score suggests moderate performance in identifying converted users, but there is room for improvement.
Macro Average (Precision: 0.80, Recall: 0.78):
Weighted Average (Precision: 0.82, Recall: 0.82):
This average accounts for the class imbalance (since there are far more status = 0 users).
The results are consistent with the overall accuracy of 82%.
We previously scaled the data in preparation for Logistic Regression so we can start by applying PCA
# Apply PCA to the entire feature set after scaling (including numeric + other features)
pca = PCA(n_components=0.95) # Retain 95% of the variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
# Print the number of components chosen by PCA
print(f"Number of components retained: {X_train_pca.shape[1]}")
Number of components retained: 12
PCA analysis has determined that only 12 of the 26 features are important.
Build the Model
# Initialize and fit the decision tree on the PCA-transformed training data
dt_model = DecisionTreeClassifier(random_state=1)
dt_model.fit(X_train_pca, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
Test and Evaluate the Model
# Generate predictions on the training data
y_train_pred_pca = dt_model.predict(X_train_pca)
# Evaluate the model's performance on the training data
train_accuracy_pca = accuracy_score(y_train, y_train_pred_pca)
metrics_score(y_train, y_train_pred_pca)
precision recall f1-score support 0 1.00 1.00 1.00 2043 1 1.00 1.00 1.00 902 accuracy 1.00 2945 macro avg 1.00 1.00 1.00 2945 weighted avg 1.00 1.00 1.00 2945
Test and Evaluate the Model
# Generate predictions on the test data
y_test_pred_pca = dt_model.predict(X_test_pca)
# Evaluate the model's performance on the test data
test_accuracy_pca = accuracy_score(y_test, y_test_pred_pca)
metrics_score(y_test, y_test_pred_pca)
precision recall f1-score support 0 0.85 0.83 0.84 896 1 0.60 0.64 0.62 367 accuracy 0.77 1263 macro avg 0.72 0.73 0.73 1263 weighted avg 0.78 0.77 0.77 1263
Observations
Overfitting on the Training Set:
Test Set Performance:
The accuracy on the test set is 77%, which is a notable drop from the perfect accuracy on the training set.
For Class 0 (the majority class), precision is 0.85, recall is 0.83, and the F1-score is 0.84, showing a decent but not exceptional performance on the majority class.
For Class 1 (the minority class), precision is 0.60, recall is 0.64, and the F1-score is 0.62, indicating that the model is struggling more with identifying and predicting the minority class. It misses about 36% of the actual positive instances and has a relatively high false positive rate for this class.
Effect of PCA:
The use of PCA (Principal Component Analysis) for dimensionality reduction likely helped reduce the number of features and potentially removed some irrelevant data, but the Decision Tree still appears to overfit the training data.
Although PCA can sometimes reduce overfitting by reducing complexity, in this case, the Decision Tree's inherent tendency to overfit (due to its flexibility) has not been mitigated, possibly due to the structure of the tree being too deep or not enough regularization applied.
Imbalance in Class Performance:
As with previous models, the imbalance in class distribution continues to affect performance. The model predicts Class 0 (the majority class) more accurately than Class 1 (the minority class), which is reflected in the F1-scores: 0.84 for Class 0 and 0.62 for Class 1.
This suggests that the Decision Tree has learned patterns for the majority class much better than the minority class, a common issue with imbalanced datasets.
Macro and Weighted Averages:
The macro average F1-score of 0.73 indicates that the model performs better on the majority class, as the average across both classes is lower than Class 0's individual score.
The weighted average F1-score of 0.77 reflects the model’s stronger performance on Class 0, weighted by the class distribution in the test set.
Conclusion:
The PCA-populated Decision Tree has clear signs of overfitting, as evidenced by perfect performance on the training set and a significant drop in performance on the test set. While Class 0 predictions remain decent, Class 1 predictions are relatively weak, and the overall performance is not optimal, particularly for the minority class. Addressing overfitting (e.g., through pruning the Decision Tree or using more regularization) and handling class imbalance could improve the model’s generalization on unseen data.
Note: Since our target feature is binary we will employ a decision tree classifier.
Build the Model
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
Train and Evaluate the Model
# Checking performance on the training data
y_pred_train = d_tree.predict(X_train)
# Instantiate the metrics_score function
metrics_score(y_train, y_pred_train)
precision recall f1-score support 0 1.00 1.00 1.00 2043 1 1.00 1.00 1.00 902 accuracy 1.00 2945 macro avg 1.00 1.00 1.00 2945 weighted avg 1.00 1.00 1.00 2945
Observations:
Classification Metrics:
Precision, Recall, and F1-scores are all 1.00 for both classes (Not Converted and Converted), indicating perfect classification performance on the training set.
The accuracy is 100%, as the model correctly classifies nearly all instances in the training data.
Confusion Matrix:
The confusion matrix shows that all 2043 instances of the "Not Converted" class were correctly predicted, and all 901 instances of the "Converted" class were also correctly predicted, except for 1 minor misclassification.
This is consistent with perfect or near-perfect performance, indicating severe overfitting.
Conclusions:
Overfitting : The perfect accuracy on the training set is a sure sign of overfitting, especially in decision trees, which tend to memorize the training data when no regularization (like pruning or max-depth constraints) is applied. It's important to validate this performance on a test set to see if the model generalizes well. This may be a good candidate for employing PCA to remove some dimensions.
Balanced Performance: Both classes seem to have been learned equally well, as shown by the identical precision, recall, and F1-scores across classes.
Test and Evaluate the Model
# Checking performance on the test data
y_pred_test = d_tree.predict(X_test)
# Instantiate the metrics_score function
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.86 0.85 0.86 896 1 0.65 0.67 0.66 367 accuracy 0.80 1263 macro avg 0.76 0.76 0.76 1263 weighted avg 0.80 0.80 0.80 1263
Observations (with OHE drop_first = True):
Classification Metrics:
Accuracy: 79.65%, which is much lower than the perfect performance on the training set, indicating that the model is not generalizing as well.
Class 0 (Not Converted): Precision and recall are around 0.86, indicating relatively good performance.
Class 1 (Converted): Precision is 0.64, and recall is 0.67, which means that the model is missing a significant number of "Converted" cases (false negatives) and is not making very confident positive predictions.
Confusion Matrix:
Out of 896 "Not Converted" instances, 760 were correctly predicted, but 136 were misclassified as "Converted".
Out of 367 "Converted" instances, only 246 were correctly classified, with 121 misclassified as "Not Converted".
Conclusions:
The model generalizes moderately well on the test data, but it is clearly overfitting on the training set. The test accuracy of 80% and the drop in performance, particularly for the "Converted" class, suggest that the decision tree may be too complex.
The lower performance in predicting the "Converted" class (precision of 0.64 and recall of 0.67) could lead to missed opportunities or incorrect classifications in practical applications where identifying conversions is critical.
Observations (with OHE drop_first = False):
Test Performance Improvement:
The test accuracy remains at 80%, similar to the previous result. This consistency suggests that the model is not significantly underperforming on unseen data, even after changes in the encoding scheme.
The precision for class 1 (converted) has slightly improved to 65% from 64%, and the recall for class 1 has increased to 67% (up from 64%). This indicates a marginal improvement in the model's ability to detect converted cases.
Class Imbalance:
Confusion Matrix Insights:
Inorder to determine if pruning is needed we first must determine the importance of features that went into the decision tree.
# Plot the feature importance
importances = d_tree.feature_importances_ # Use d_tree
columns = X.columns # Use columns from the DataFrame used with d_tree
importance_df = pd.DataFrame(importances, index=columns, columns=['Importance']).sort_values(by='Importance', ascending=False)
print(importance_df)
# Plot
plt.figure(figsize=(13, 13))
sns.barplot(data=importance_df, x='Importance', y=importance_df.index, palette='tab10')
plt.title('Feature Importance in Decision Tree')
plt.show()
Importance time_spent_on_website 0.286512 first_interaction_0 0.154206 profile_completed_0 0.134657 page_views_per_visit 0.095233 age 0.091493 current_occupation_0 0.066631 last_activity_1 0.047065 website_visits 0.039448 last_activity_2 0.024113 digital_media_0 0.008002 current_occupation_2 0.007526 last_activity_0 0.005211 referral_0 0.005136 educational_channels_0 0.004923 profile_completed_1 0.004662 print_media_type1_0 0.004083 profile_completed_2 0.003841 digital_media_1 0.003714 print_media_type2_0 0.003433 educational_channels_1 0.003226 print_media_type1_1 0.002581 print_media_type2_1 0.002551 current_occupation_1 0.001505 referral_1 0.000248 first_interaction_1 0.000000
Observations:
Top Features:
time_spent_on_website
with a score of 0.2865. This suggests that the time users spend on the website plays the largest role in determining the model's predictions.first_interaction_Mobile App
(0.1542), indicating that how users first interact with the mobile app influences the outcome.profile_completed_High
(0.1347), meaning a high level of profile completion has a strong impact.page_views_per_visit
(0.0952) and age
(0.0915), also playing crucial roles in the predictions.Lesser Importance:
current_occupation_Professional
(0.0666) and last_activity_Phone Activity
(0.0471) are still important but have less predictive power than the top features.digital_media_No
(0.0080) and profile_completed_Low
(0.0047), indicating these features play only a minimal role in the decision tree's predictions.Negligible or No Importance:
first_interaction_Website
(0.0000) and referral_Yes
(0.0002), have virtually no importance in this decision tree model, suggesting they do not contribute to the predictive power.Business Insights:
# Set the size of the plot
plt.figure(figsize=(20, 20))
# Plot the tree using plot_tree from Scikit-learn
tree.plot_tree(
d_tree,
feature_names=X.columns, # Column names for feature names
class_names=True, # Display class names
filled=True, # Color nodes by class
rounded=True, # Rounded boxes for nodes
proportion=False, # Not scaled to the proportion of samples at each node
fontsize=8, # Font size for labels
max_depth = 3 # Limit the depth of the tree to 3 levels
)
# Display the plot
plt.show()
Let's determine the optimal tree depth.
# Determine the optimum depth to prune
# Define the range of depths to test
depth_range = range(1, 21) # Test depths from 1 to 20
cv_scores = [] # List to store cross-validation scores
# Loop over the depths and perform cross-validation
for depth in depth_range:
# Create a decision tree with the current max_depth
pruned_tree = DecisionTreeClassifier(random_state=1, max_depth=depth)
# Perform cross-validation (5-fold cross-validation in this case)
scores = cross_val_score(pruned_tree, X_train, y_train, cv=5, scoring='accuracy') # You can change scoring to 'f1', 'precision', 'recall', etc.
# Append the mean score for this depth
cv_scores.append(np.mean(scores))
# Find the depth with the best cross-validation score
optimal_depth = depth_range[np.argmax(cv_scores)]
print(f"The optimal max_depth is: {optimal_depth}")
# Plot the results to visualize
plt.figure(figsize=(10, 6))
plt.plot(depth_range, cv_scores, marker='o')
plt.xlabel('Max Depth')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Cross-Validation Accuracy vs Max Depth')
plt.show()
The optimal max_depth is: 6
Build the Model.
# Build the model
pruned_tree_depth_max = DecisionTreeClassifier(random_state=1, max_depth=optimal_depth)
pruned_tree_depth_max.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=6, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=6, random_state=1)
Train and Evaluate the Model.
# Train the final decision tree with the optimal depth
y_pred_train_depth_max = pruned_tree_depth_max.predict(X_train)
# Instantiate the metrics_score function
metrics_score(y_train, y_pred_train_depth_max)
precision recall f1-score support 0 0.90 0.95 0.92 2043 1 0.87 0.75 0.80 902 accuracy 0.89 2945 macro avg 0.88 0.85 0.86 2945 weighted avg 0.89 0.89 0.89 2945
Test and Evaluate the Model
# Checking performance on the training data
y_pred_test_depth_max = pruned_tree_depth_max.predict(X_test)
# Instantiate the metrics_score function
metrics_score(y_test, y_pred_test_depth_max)
precision recall f1-score support 0 0.86 0.92 0.89 896 1 0.77 0.63 0.69 367 accuracy 0.84 1263 macro avg 0.81 0.78 0.79 1263 weighted avg 0.83 0.84 0.83 1263
Comparison of Training and Testing Observations:
Overall Accuracy:
Class 0 (Not Converted):
Class 1 (Converted):
Class Imbalance:
Generalization:
Conclusion:
Let's examine what happens if we bracket the 'optimum' pruning depth determined above
Let Depth = 5
Build and evaluate the model on the training data.
Build the Model
# Create a decision tree with pruning by limiting the depth
pruned_tree_depth_5 = DecisionTreeClassifier(random_state=1, max_depth=5)
pruned_tree_depth_5.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=5, random_state=1)
Train and Evaluate the Model
# Checking performance on the training data
y_pred_train_pruned_depth_5 = pruned_tree_depth_5.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_depth_5)
precision recall f1-score support 0 0.89 0.94 0.92 2043 1 0.84 0.75 0.79 902 accuracy 0.88 2945 macro avg 0.87 0.84 0.85 2945 weighted avg 0.88 0.88 0.88 2945
Test and Evaluate the Model.
# Checking performance on the training data
y_pred_test_pruned_depth_5 = pruned_tree_depth_5.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_depth_5)
precision recall f1-score support 0 0.89 0.92 0.90 896 1 0.79 0.71 0.75 367 accuracy 0.86 1263 macro avg 0.84 0.82 0.83 1263 weighted avg 0.86 0.86 0.86 1263
Observations
Reduced Overfitting:
Balanced Performance Across Train and Test Sets:
Class 0 (Majority Class) Performance:
For Class 0, both the training and test sets show consistently strong results:
On the training set, Class 0 has a precision of 0.89, recall of 0.94, and an F1-score of 0.92.
On the test set, Class 0 also performs well, with precision of 0.89, recall of 0.92, and an F1-score of 0.90.
These metrics reflect that the model is effectively classifying the majority class with high accuracy, precision, and recall.
Class 1 (Minority Class) Performance:
For Class 1, the performance is decent but not as strong as Class 0:
On the training set, Class 1 has a precision of 0.84, recall of 0.75, and an F1-score of 0.79.
On the test set, Class 1 has a precision of 0.79, recall of 0.71, and an F1-score of 0.75.
While the model is able to identify most of the minority class instances, the recall is slightly lower, meaning that about 29% of the positive instances in the test set are not being correctly identified. The precision remains fairly strong, indicating that when the model predicts Class 1, it is usually correct.
Macro and Weighted Averages:
The macro average F1-score on the test set is 0.83, which reflects the average performance across both classes, taking into account the lower recall for Class 1.
The weighted average F1-score on the test set is 0.86, showing that overall the model is performing well, with stronger performance on Class 0 but not neglecting Class 1 entirely.
Improvement Over Previous Models:
Compared to previous iterations, this pruned Decision Tree offers a much better balance between train and test performance, indicating less overfitting and more generalizability.
The drop in performance for the minority class (Class 1) is still present but is not as severe as in some previous models. The model still struggles somewhat with identifying all instances of Class 1, but precision and recall are much better balanced.
Conclusion:
Pruning the Decision Tree by limiting its depth to 5 has improved generalization, reducing overfitting and providing balanced performance between the training and test sets. The model handles Class 0 (the majority class) very well, while its performance on Class 1 (the minority class) is decent but could be further improved, particularly in recall. This pruned model strikes a good balance between complexity and predictive power.
Let Depth = 7
Build and evaluate the model on the training data.
Build the Model
# Create a decision tree with pruning by limiting the depth
pruned_tree_depth_7 = DecisionTreeClassifier(random_state=1, max_depth=7)
pruned_tree_depth_7.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=7, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=7, random_state=1)
Train and Evaluate the Model
# Check the performance after pruning
y_pred_train_pruned_depth_7 = pruned_tree_depth_7.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_depth_7)
precision recall f1-score support 0 0.92 0.95 0.93 2043 1 0.87 0.81 0.84 902 accuracy 0.90 2945 macro avg 0.89 0.88 0.89 2945 weighted avg 0.90 0.90 0.90 2945
Test and Evaluate the model
# Check the performance after pruning
y_pred_test_pruned_depth_7 = pruned_tree_depth_7.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_depth_7)
precision recall f1-score support 0 0.87 0.90 0.89 896 1 0.74 0.68 0.71 367 accuracy 0.84 1263 macro avg 0.81 0.79 0.80 1263 weighted avg 0.83 0.84 0.84 1263
Observations
Improved Training Set Performance:
With a depth of 7, the model achieves 90% accuracy on the training set, showing a stronger fit compared to the pruned model with depth 5.
Class 0 (majority class) has a precision of 0.92, recall of 0.95, and an F1-score of 0.93, indicating that the model predicts the majority class very well on the training data.
Class 1 (minority class) also performs better than with a shallower depth, with precision of 0.87, recall of 0.81, and an F1-score of 0.84. This shows an improvement in recall and overall performance for the minority class on the training data.
Test Set Performance:
On the test set, the accuracy is 84%, which is a reasonable drop from the training accuracy of 90%, indicating that some overfitting is present, but the model still generalizes fairly well.
Class 0 has a precision of 0.87, recall of 0.90, and an F1-score of 0.89, showing that the model continues to predict the majority class effectively.
Class 1 has a precision of 0.74, recall of 0.68, and an F1-score of 0.71. This is a slight drop in performance compared to the training set, especially in terms of recall, where the model is missing about 32% of the actual positive cases in the test data. Precision is still decent, but the recall remains a challenge.
]Effect of Increased Depth:
Increasing the depth to 7 has improved the model’s performance on the training set, especially for Class 1 (the minority class). However, it comes with the trade-off of slightly decreased generalization to the test set, as the test accuracy is slightly lower than the training accuracy.
The gap in recall for Class 1 between the training set (0.81) and test set (0.68) suggests that the model is still somewhat overfitting, especially for the minority class.
]Class Imbalance Effects:
Similar to previous models, the imbalance between Class 0 and Class 1 leads to better performance on the majority class (Class 0), while the minority class (Class 1) sees lower precision and recall.
The drop in recall for Class 1 on the test set indicates that the model struggles more to generalize its predictions for the minority class, potentially missing important patterns in the test data for this class.
]Macro and Weighted Averages:
The macro average F1-score on the test set is 0.80, indicating that the performance on both classes is relatively balanced but slightly skewed in favor of Class 0.
The weighted average F1-score of 0.84 on the test set reflects the stronger performance on the majority class, which influences the overall score more due to the class imbalance.
Conclusion: Increasing the depth to 7 has improved the model’s fit on the training set, particularly for the minority class (Class 1). However, there is still a gap between training and test performance, particularly in terms of recall for Class 1, which indicates that the model is slightly overfitting to the training data. The model performs well overall, with solid results for Class 0, but the challenge of accurately predicting the minority class remains. A balance between depth and generalization could be further refined to enhance performance on unseen data.
Build the Model
# Create a decision tree with pruning by limiting minimum samples per leaf
pruned_tree_leaf = DecisionTreeClassifier(random_state=1, min_samples_leaf=5) # Set min_samples_leaf to 5 as an example
pruned_tree_leaf.fit(X_train, y_train)
DecisionTreeClassifier(min_samples_leaf=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(min_samples_leaf=5, random_state=1)
Train and Evaluate the Model
# Check the performance after pruning
y_pred_train_pruned_leaf = pruned_tree_leaf.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_leaf)
precision recall f1-score support 0 0.94 0.95 0.95 2043 1 0.89 0.86 0.88 902 accuracy 0.93 2945 macro avg 0.92 0.91 0.91 2945 weighted avg 0.92 0.93 0.92 2945
Test and Evalaute the Model
# Check the performance after pruning
y_pred_test_pruned_leaf = pruned_tree_leaf.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_leaf)
precision recall f1-score support 0 0.86 0.86 0.86 896 1 0.66 0.67 0.66 367 accuracy 0.80 1263 macro avg 0.76 0.76 0.76 1263 weighted avg 0.80 0.80 0.80 1263
Observations
Training Set Performance:
The model performs very well on the training set, achieving an overall accuracy of 93%, indicating a good fit to the training data.
Class 0 (majority class) has a precision of 0.94, recall of 0.95, and an F1-score of 0.95, showing excellent performance in predicting the majority class.
Class 1 (minority class) also performs well on the training set, with a precision of 0.89, recall of 0.86, and an F1-score of 0.88. This suggests that the model is fairly successful in identifying the minority class, with a small trade-off in recall compared to Class 0.
Test Set Performance:
Class 0 (majority class) on the test set has a precision of 0.86, recall of 0.86, and an F1-score of 0.86, reflecting stable and consistent performance for the majority class, even on the test set.
Class 1 (minority class) shows a noticeable decline in performance on the test set compared to the training set. It has a precision of 0.66, recall of 0.67, and an F1-score of 0.66, which reflects the model's struggle to generalize well to the minority class. About 33% of the actual positive instances are not identified, and there are a higher number of false positives.
Macro and Weighted Averages:
The macro average F1-score on the test set is 0.76, which reflects a balanced view of the model's performance across both classes. However, this lower score indicates that the model struggles more with the minority class (Class 1).
The weighted average F1-score is 0.80, highlighting the influence of the majority class (Class 0) on the overall performance due to its larger support. The better performance for Class 0 positively skews the overall metric.
Effect of Minimum Samples per Leaf:
Using a minimum samples per leaf constraint forces the model to generalize more by preventing the creation of overly specific decision rules for small subsets of data. This is why the model performs well on the training set without completely overfitting.
However, the drop in performance for Class 1 on the test set suggests that while the model generalizes better for the majority class, it still struggles with the minority class, likely due to class imbalance and the limited training examples for Class 1.
Class Imbalance Challenge:
Conclusion:
The use of a minimum samples per leaf constraint leads to a well-performing model on the training set and reasonable generalization to the test set, especially for the majority class. However, the model struggles to generalize for the minority class (Class 1), particularly on the test set, resulting in a lower F1-score and recall for this class. Addressing the class imbalance or further refining the model (e.g., adjusting the minimum samples per leaf) could improve its performance for Class 1.
First we need to determine the optimum number of splits.
# Import necessary libraries
from Scikit-learn.model_selection import cross_val_score
import numpy as np
# Define the range of values for min_samples_split to test
split_range = range(2, 51, 2) # Testing even values from 2 to 50
cv_scores_split = [] # List to store cross-validation scores
# Loop over the different min_samples_split values
for split in split_range:
# Create a decision tree with the current min_samples_split value
pruned_tree_split = DecisionTreeClassifier(random_state=1, min_samples_split=split)
# Perform cross-validation (5-fold cross-validation in this case)
scores = cross_val_score(pruned_tree_split, X_train, y_train, cv=5, scoring='accuracy')
# Append the mean score for this split value
cv_scores_split.append(np.mean(scores))
# Find the min_samples_split value with the best cross-validation score
optimal_split = split_range[np.argmax(cv_scores_split)]
print(f"The optimal min_samples_split is: {optimal_split}")
# Plot the results to visualize
plt.figure(figsize=(10, 6))
plt.plot(split_range, cv_scores_split, marker='o')
plt.xlabel('Min Samples Split')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Cross-Validation Accuracy vs Min Samples Split')
plt.show()
The optimal min_samples_split is: 48
Build the Model.
# Create a decision tree with pruning by limiting the minimum number of samples required to split a node
pruned_tree_split = DecisionTreeClassifier(random_state=1, min_samples_split=48)
pruned_tree_split.fit(X_train, y_train)
DecisionTreeClassifier(min_samples_split=48, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(min_samples_split=48, random_state=1)
Train and Evalue the Model
# Check the performance after pruning
y_pred_train_pruned_split = pruned_tree_split.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_split)
precision recall f1-score support 0 0.91 0.94 0.92 2043 1 0.85 0.79 0.82 902 accuracy 0.89 2945 macro avg 0.88 0.86 0.87 2945 weighted avg 0.89 0.89 0.89 2945
Test and Evaluate the Model
# Check the performance after pruning
y_pred_test_pruned_split = pruned_tree_split.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_split)
precision recall f1-score support 0 0.87 0.90 0.88 896 1 0.73 0.67 0.70 367 accuracy 0.83 1263 macro avg 0.80 0.79 0.79 1263 weighted avg 0.83 0.83 0.83 1263
Observations
Training Set Performance:
The model achieves an accuracy of 89% on the training set, indicating a strong fit, but with less risk of overfitting compared to prior models with more lenient splitting conditions.
Class 0 (majority class) performs very well with a precision of 0.91, recall of 0.94, and an F1-score of 0.92. This shows that the model is highly capable of correctly predicting the majority class.
Class 1 (minority class) shows a solid performance, with a precision of 0.85, recall of 0.79, and an F1-score of 0.82. The recall for Class 1 is slightly lower, indicating that the model misses some positive instances in the training data, but overall, it performs reasonably well in identifying and predicting the minority class.
Test Set Performance:
On the test set, the model achieves an accuracy of 83%, which is a slight drop from the training set accuracy but indicates a good generalization to unseen data.
Class 0 (majority class) maintains strong performance on the test set with a precision of 0.87, recall of 0.90, and an F1-score of 0.88, indicating the model continues to predict the majority class effectively and with high confidence.
Class 1 (minority class) has a precision of 0.73, recall of 0.67, and an F1-score of 0.70, which shows that the model is able to predict the minority class reasonably well but misses about 33% of the actual positive instances. The precision is relatively high, meaning that when the model predicts Class 1, it is often correct, but recall could be improved.
Effect of min_samples_split
:
The choice of an optimal min_samples_split
of 48 means that the model requires at least 48 samples in a node before splitting further. This encourages the model to create more generalized splits, preventing overly specific decision rules that could lead to overfitting.
The impact of this constraint is visible in the balanced performance across both the training and test sets. The model generalizes well, particularly for Class 0, while still achieving reasonable performance for Class 1.
Macro and Weighted Averages:
The macro average F1-score on the test set is 0.79, reflecting the overall performance across both classes. This score highlights that the model is more effective at predicting the majority class, as the minority class brings down the macro average.
The weighted average F1-score on the test set is 0.83, which indicates that, while the model performs better for Class 0 due to the class imbalance, it still delivers decent performance for the minority class (Class 1).
Class Imbalance Impact:
Conclusion:
The use of an optimal min_samples_split
of 48 results in a balanced and well-generalized model. The model performs well on both the training and test sets, with Class 0 (majority class) showing strong performance and Class 1 (minority class) maintaining decent precision, though recall could be improved. The model avoids overfitting and generalizes reasonably well to unseen data, but addressing the class imbalance or further fine-tuning the model could enhance performance for the minority class.
# Create a results table for pruned decision trees
# List of pruning methods and their corresponding predictions
pruning_methods = [
("Max Depth (Optimal)", y_pred_test_depth_max),
("Max Depth (5)", y_pred_test_pruned_depth_5),
("Max Depth (7)", y_pred_test_pruned_depth_7),
("Min Samples Leaf (5)", y_pred_test_pruned_leaf),
("Min Samples Split (48)", y_pred_test_pruned_split)
]
# Initialize a list to store the results
results = []
# Loop through each pruning method and calculate the metrics
for method_name, y_pred in pruning_methods:
accuracy = round(accuracy_score(y_test, y_pred), 2)
precision_0 = round(precision_score(y_test, y_pred, pos_label=0), 2)
precision_1 = round(precision_score(y_test, y_pred, pos_label=1), 2)
recall_0 = round(recall_score(y_test, y_pred, pos_label=0), 2)
recall_1 = round(recall_score(y_test, y_pred, pos_label=1), 2)
f1_0 = round(f1_score(y_test, y_pred, pos_label=0), 2)
f1_1 = round(f1_score(y_test, y_pred, pos_label=1), 2)
# Append the results to the list
results.append([method_name, accuracy, precision_0, precision_1, recall_0, recall_1, f1_0, f1_1])
# Create a DataFrame for better visualization
pruned_results = pd.DataFrame(results, columns=[
"Pruning Method",
"Accuracy",
"Precision Status 0",
"Precision Status 1",
"Recall Status 0",
"Recall Status 1",
"F1-Score Status 0",
"F1-Score Status 1"
])
# Display the DataFrame
pruned_results
Pruning Method | Accuracy | Precision Status 0 | Precision Status 1 | Recall Status 0 | Recall Status 1 | F1-Score Status 0 | F1-Score Status 1 | |
---|---|---|---|---|---|---|---|---|
0 | Max Depth (Optimal) | 0.84 | 0.86 | 0.77 | 0.92 | 0.63 | 0.89 | 0.69 |
1 | Max Depth (5) | 0.86 | 0.89 | 0.79 | 0.92 | 0.71 | 0.90 | 0.75 |
2 | Max Depth (7) | 0.84 | 0.87 | 0.74 | 0.90 | 0.68 | 0.89 | 0.71 |
3 | Min Samples Leaf (5) | 0.80 | 0.86 | 0.66 | 0.86 | 0.67 | 0.86 | 0.66 |
4 | Min Samples Split (48) | 0.83 | 0.87 | 0.73 | 0.90 | 0.67 | 0.88 | 0.70 |
Conclusions:
Key Takeaways:
Max Depth (5) offers the best overall balance:
Accuracy: Highest at 86%.
Class 1 (minority class): Achieves precision of 0.79 and recall of 0.71, leading to the highest F1-score (0.75) for Class 1 across all pruning methods.
Class 0 (majority class): Maintains strong performance, with an F1-score of 0.90 and a recall of 0.92.
Max Depth (Optimal) strikes a reasonable balance:
Accuracy is slightly lower at 84%, but this method balances the performance between both classes.
Class 0 still maintains a high recall of 0.92, while Class 1 has a decent precision of 0.77 but lower recall at 0.63, leading to an overall F1-score of 0.69.
Max Depth (7) is more prone to overfitting:
Min Samples Leaf (5) results in the lowest performance:
Min Samples Split (48) provides a balanced generalization:
Accuracy is 83%, and Class 1 precision and recall are slightly better than the "min samples leaf" method, achieving 0.73 precision and 0.67 recall, but still not as strong as depth-based pruning.
Class 0 maintains good performance with precision of 0.87 and recall of 0.90, resulting in a solid F1-score of 0.88.
Final Conclusion:
We have seen thre is anissue with data imbalance, and addressing it properly could lead to significantly improved results.
To validate this hypothesis, we need to examine the class distribution of the target variable.
# Check the distribution of the target variable
class_distribution = y_train.value_counts(normalize=True)
# Display the class distribution in percentage
class_distribution_percentage = class_distribution * 100
class_distribution_percentage
proportion | |
---|---|
status | |
0 | 69.371817 |
1 | 30.628183 |
# Alternatively, we can use visualization to check the imbalance:
import matplotlib.pyplot as plt
# Plot the class distribution
class_distribution.plot(kind='bar', color=['skyblue', 'orange'])
plt.title('Status Distribution in y_train')
plt.xlabel('Status')
plt.ylabel('Frequency')
plt.show()
Calculating Class Weights
# Calculate class weights manually
classes = np.unique(y_train)
class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = dict(zip(classes, class_weights))
print(class_weight_dict)
{0: 0.7207537934410181, 1: 1.6324833702882484}
Build the Classifier
# Use the computed class weights in the decision tree
tuned_tree_tuned_rand = DecisionTreeClassifier(class_weight=class_weight_dict)
# Fit the model
tuned_tree_tuned_rand.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.7207537934410181, 1: 1.6324833702882484})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight={0: 0.7207537934410181, 1: 1.6324833702882484})
Tain and Evaluate the model.
# Predict on the test set
y_pred_class_weight = tuned_tree_tuned_rand.predict(X_train)
metrics_score(y_train, y_pred_class_weight)
precision recall f1-score support 0 1.00 1.00 1.00 2043 1 1.00 1.00 1.00 902 accuracy 1.00 2945 macro avg 1.00 1.00 1.00 2945 weighted avg 1.00 1.00 1.00 2945
Test and Evaluate the Model
# Predict on the test set
y_pred_class_weight = tuned_tree_tuned_rand.predict(X_test)
metrics_score(y_test, y_pred_class_weight)
precision recall f1-score support 0 0.87 0.86 0.86 896 1 0.66 0.68 0.67 367 accuracy 0.81 1263 macro avg 0.76 0.77 0.77 1263 weighted avg 0.81 0.81 0.81 1263
Nothing in the way of improvement, lets altre the weights a bit and try again
# Set up the decision tree with class weights
dt_model = DecisionTreeClassifier()
# Define a grid of class weights to search through
param_grid = {'class_weight': [{0: 1, 1: 2}, {0: 1, 1: 3}, {0: 1, 1: 5}]}
# Perform a grid search to find the best class weights
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='f1_weighted')
grid_search.fit(X_train, y_train)
# Output the best parameters (weights) found
print("Best class weights:", grid_search.best_params_)
Best class weights: {'class_weight': {0: 1, 1: 2}}
Lets iterate over the new weights and eavluate
# Set up the decision tree with class weights
dt_model = DecisionTreeClassifier()
# Define a grid of class weights to search through
param_grid = {'class_weight': [{0: 1, 1: 2}, {0: 1, 1: 3}, {0: 1, 1: 5}]}
# Perform a grid search to find the best class weights
grid_search_class_weight = GridSearchCV(dt_model, param_grid, cv=5, scoring='f1_weighted', return_train_score=True)
grid_search_class_weight.fit(X_train, y_train)
# Iterate over the results and print evaluation metrics
for i, params in enumerate(grid_search.cv_results_['params']):
print(f"\nEvaluation for class weights: {params}")
# Refit the model on the training set with the current parameters
dt_model_class_weight = DecisionTreeClassifier(class_weight=params['class_weight'])
dt_model_class_weight.fit(X_train, y_train)
# Predict on the test set
y_pred__class_weight = dt_model_class_weight.predict(X_test)
# Calculate accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_test, y_pred__class_weight)
report = classification_report(y_test, y_pred__class_weight, target_names=['Class 0', 'Class 1'])
# Output metrics for each set of weights
print(f"Accuracy: {accuracy:.4f}")
print(f"Classification Report:\n{report}")
# Output the best parameters (weights) found by GridSearchCV
print("\nBest class weights found by GridSearchCV:", grid_search.best_params_)
Evaluation for class weights: {'class_weight': {0: 1, 1: 2}} Accuracy: 0.8124 Classification Report: precision recall f1-score support Class 0 0.87 0.87 0.87 896 Class 1 0.68 0.68 0.68 367 accuracy 0.81 1263 macro avg 0.77 0.77 0.77 1263 weighted avg 0.81 0.81 0.81 1263 Evaluation for class weights: {'class_weight': {0: 1, 1: 3}} Accuracy: 0.7965 Classification Report: precision recall f1-score support Class 0 0.86 0.85 0.86 896 Class 1 0.65 0.66 0.65 367 accuracy 0.80 1263 macro avg 0.75 0.76 0.75 1263 weighted avg 0.80 0.80 0.80 1263 Evaluation for class weights: {'class_weight': {0: 1, 1: 5}} Accuracy: 0.8052 Classification Report: precision recall f1-score support Class 0 0.87 0.85 0.86 896 Class 1 0.65 0.70 0.68 367 accuracy 0.81 1263 macro avg 0.76 0.77 0.77 1263 weighted avg 0.81 0.81 0.81 1263 Best class weights found by GridSearchCV: {'class_weight': {0: 1, 1: 2}}
Observations:
Here are the observations based on the results:
Impact of Class Weights: Changing the class weights between Class 0 and Class 1 has a noticeable effect on the classification metrics, particularly for Class 1, which is the minority class:
As the weight for Class 1 increases (from 2 to 5), the recall for Class 1 slightly improves, though it remains relatively stable around 0.68 across the different configurations.
The precision for Class 1 remains consistent at 0.66, meaning the proportion of correctly classified positive instances out of all instances predicted as positive does not change significantly.
Overall Accuracy: The overall accuracy remains relatively stable across all evaluations, hovering around 80.4% to 80.8%. This indicates that the change in class weights does not have a drastic effect on overall model accuracy but helps balance the classification of Class 1 (the minority class).
F1-Score: The F1-score for Class 1 remains consistent at 0.67 across all weight configurations. This balance of precision and recall suggests that the model's performance on the minority class is steady, though no significant improvement in predictive power is observed as the class weight for Class 1 increases.
GridSearchCV Selection: Despite little improvement in recall or F1-score with the higher weight of Class 1, GridSearchCV selected the weight configuration {'class_weight': {0: 1, 1: 5}}
. This indicates that, based on its internal scoring metrics, this was the optimal balance for the dataset, likely giving more importance to recall for Class 1 (the minority class).
Class 0 Performance: The performance for Class 0 (majority class) remains strong and stable across all configurations, with high precision, recall, and F1-scores consistently at 0.86.
Conclusion: The use of higher class weights for Class 1 helps maintain or slightly improve recall without significantly affecting precision or overall accuracy. GridSearchCV's choice of a higher class weight for Class 1 likely prioritizes correctly identifying more instances of the minority class at a slight cost to the overall accuracy, which is reasonable for imbalanced datasets.
Build the Model
# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=1)
# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=1)
Train and Evaluate the Model
# Predict on the test set
y_pred_rf = rf_model.predict(X_train)
metrics_score(y_train, y_pred_rf)
precision recall f1-score support 0 1.00 1.00 1.00 2043 1 1.00 1.00 1.00 902 accuracy 1.00 2945 macro avg 1.00 1.00 1.00 2945 weighted avg 1.00 1.00 1.00 2945
Test and Evaluate the Model
# Predict on the test set
y_pred_rf = rf_model.predict(X_test)
metrics_score(y_test, y_pred_rf)
precision recall f1-score support 0 0.88 0.90 0.89 896 1 0.73 0.70 0.72 367 accuracy 0.84 1263 macro avg 0.81 0.80 0.80 1263 weighted avg 0.84 0.84 0.84 1263
Observations:
Overfitting on the Training Set:
The model performs perfectly on the training set with 100% precision, recall, and F1-scores for both Class 0 and Class 1. This indicates that the Random Forest model has likely overfitted to the training data, capturing even minor patterns and noise.
Overfitting is a common occurrence when a model becomes too complex and learns to fit the training data too well, at the expense of generalizing to unseen data.
Test Set Performance:
The performance on the test set is lower than on the training set, which is expected but indicates a significant difference between training and test set results, further supporting the overfitting concern.
Accuracy on the test set is 84%, which is fairly strong overall. However, the recall for Class 1 (the minority class) is 0.70, which shows that the model misses about 30% of the positive cases. The precision for Class 1 is 0.73, meaning that about 27% of the predicted positives are actually false positives.
Class 0 (the majority class) has good precision and recall (both around 0.90), which indicates that the model has learned to predict the majority class well.
Class Imbalance Effects:
As seen from the test results, Class 1 has weaker precision and recall compared to Class 0. This suggests that the class imbalance in the dataset (with Class 0 having more examples than Class 1) might be influencing the model's ability to generalize well for the minority class.
The model performs better on Class 0 than Class 1, and the gap between precision and recall for Class 1 is noticeable, showing that the model struggles slightly with predicting minority class instances.
Random Forest models typically do not need pruning in the same way that decision trees do, as they are designed to avoid overfitting by averaging the predictions of many decision trees. However, let's perform the analysis to determine the answer to the question.
# Define the model
rf_model = RandomForestClassifier(random_state=1)
# Define a grid of hyperparameters for "pruning"
param_grid = {
'max_depth': [4, 6, 8, 10], # Pruning by limiting depth
'min_samples_split': [2, 5, 10], # Minimum samples to split
'min_samples_leaf': [1, 2, 5], # Minimum samples in each leaf
'max_features': ['sqrt', 'log2', None], # Number of features to consider for splitting
'n_estimators': [50, 100, 200], # Number of trees in the forest
}
# Perform a grid search to find the best hyperparameters
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Output the best parameters (pruning configuration) found
print("Best parameters:", grid_search.best_params_)
Best parameters: {'max_depth': 6, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Build the Model
# Build the model
best_pruned_rf_model = grid_search.best_estimator_
best_pruned_rf_model.fit(X_train, y_train)
RandomForestClassifier(max_depth=6, max_features=None, min_samples_split=5, n_estimators=200, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_depth=6, max_features=None, min_samples_split=5, n_estimators=200, random_state=1)
Train and Evalute the Model
# Train and evaluate the pruned random forest model
y_pred = best_pruned_rf_model.predict(X_train)
metrics_score(y_train, y_pred)
precision recall f1-score support 0 0.91 0.94 0.92 2043 1 0.85 0.79 0.82 902 accuracy 0.89 2945 macro avg 0.88 0.86 0.87 2945 weighted avg 0.89 0.89 0.89 2945
Test and Evaluate the Model
# Test and evaluate the pruned random forest model
y_pred = best_pruned_rf_model.predict(X_test)
metrics_score(y_test, y_pred)
precision recall f1-score support 0 0.89 0.91 0.90 896 1 0.78 0.74 0.76 367 accuracy 0.86 1263 macro avg 0.83 0.82 0.83 1263 weighted avg 0.86 0.86 0.86 1263
Observations:
Improved Generalization:
Training Set Performance:
For the training set, Class 0 (majority class) has a precision of 0.91 and a recall of 0.94, indicating strong performance in identifying the majority class.
For Class 1 (minority class), the model's precision is 0.85 and recall is 0.79, with an F1-score of 0.82. This shows that the pruned model is capable of identifying a good portion of the minority class without overfitting to the training data.
Test Set Performance:
On the test set, the overall accuracy is 86%, which is a slight improvement compared to the previous Random Forest model’s accuracy of 84%.
The precision for Class 1 (minority class) is 0.78, and the recall is 0.74, leading to an F1-score of 0.76. This is a decent performance, although slightly lower than Class 0, which is expected in imbalanced datasets. These metrics suggest that the model is identifying the minority class relatively well but still has room for improvement, especially in recall.
For Class 0 (majority class), precision is 0.89 and recall is 0.91, with an F1-score of 0.90. This indicates strong performance for the majority class.
Balanced Performance:
Effect of Pruning:
By limiting the tree depth (max_depth=6
) and adjusting other hyperparameters like min_samples_split
and n_estimators
, the model has successfully reduced overfitting while maintaining good performance on the test set.
The pruning has also helped smooth out the variance between the training and test performance, making the model more robust when predicting on unseen data.
Conclusion:
The pruned Random Forest model shows significant improvement in generalization, with a more balanced performance between the training and test sets. The overall accuracy is strong at 86%, and while Class 0 performs better, Class 1 predictions (precision of 0.78 and recall of 0.74) are respectable for a moderately imbalanced dataset. This suggests that the pruning strategy worked well, reducing overfitting while preserving the model’s effectiveness.
Build the Model
# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=1)
# Define the hyperparameter grid
param_grid = {
'n_estimators': [100, 200], # Number of trees
'max_depth': [4, 6, 8, 10], # Limiting depth (pruning)
'min_samples_split': [2, 5, 10], # Control for tree growth
'min_samples_leaf': [1, 2, 4], # Minimum number of samples in a leaf node
'max_features': ['sqrt', 'log2'] # Features to consider when splitting
}
# Set up GridSearchCV for hypertuning
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)
# Fit the model
grid_search.fit(X_train, y_train)
# Best hyperparameters
print("Best parameters:", grid_search.best_params_)
Fitting 5 folds for each of 144 candidates, totalling 720 fits Best parameters: {'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Train and Evalute the Model
# Train and evaluate the pruned random forest model
y_pred = grid_search.best_estimator_.predict(X_train)
metrics_score(y_train, y_pred)
precision recall f1-score support 0 0.91 0.95 0.93 2043 1 0.86 0.79 0.82 902 accuracy 0.90 2945 macro avg 0.89 0.87 0.88 2945 weighted avg 0.90 0.90 0.90 2945
Test and Evaluate the Model
# Test and evaluate the pruned random forest model
y_pred = grid_search.best_estimator_.predict(X_test)
metrics_score(y_test, y_pred)
precision recall f1-score support 0 0.88 0.92 0.90 896 1 0.77 0.70 0.73 367 accuracy 0.85 1263 macro avg 0.83 0.81 0.82 1263 weighted avg 0.85 0.85 0.85 1263
Observations:
Training Set Performance:
The model achieves an accuracy of 90% on the training set, demonstrating a strong fit. The selected hyperparameters, particularly the optimal max_depth=8
and min_samples_split=5
, allow the model to capture patterns in the data while minimizing overfitting.
Class 0 (majority class) performs exceptionally well, with a precision of 0.91, recall of 0.95, and an F1-score of 0.93. This indicates that the model is highly capable of correctly identifying the majority class with minimal false positives and negatives.
Class 1 (minority class) shows solid performance, with a precision of 0.86, recall of 0.79, and an F1-score of 0.82. While the precision for Class 1 is strong, the recall is slightly lower, meaning the model misses some positive instances in the training data. Overall, it performs well in identifying and predicting the minority class without overfitting.
Test Set Performance:
On the test set, the model achieves an accuracy of 85%, which is a slight drop from the training set but still indicates good generalization to unseen data.
Class 0 (majority class) maintains strong performance on the test set, with a precision of 0.88, recall of 0.92, and an F1-score of 0.90. This shows that the model continues to effectively predict the majority class, with a high degree of confidence.
Class 1 (minority class) has a precision of 0.77, recall of 0.70, and an F1-score of 0.73. While the precision for Class 1 remains strong, the recall is lower, indicating that about 30% of positive instances are missed in the test set. This suggests that the model could improve in identifying minority class instances in unseen data, though its overall performance is decent.
Effect of Hyperparameter Tuning:
The tuned parameters, particularly the max_depth of 8 and min_samples_split of 5, help create a well-balanced model that generalizes effectively. The constraint on tree depth prevents the model from overfitting to training data, while the max_features='sqrt' ensures that a subset of features is considered at each split, preventing overly complex decision boundaries.
The model successfully balances performance across the training and test sets, particularly for the majority class (Class 0), while achieving reasonable performance for the minority class (Class 1).
Macro and Weighted Averages:
The macro average F1-score on the test set is 0.82, reflecting the model's overall performance across both classes. This score highlights the stronger performance for Class 0, with Class 1 contributing to a slightly lower macro average.
The weighted average F1-score is 0.85, which indicates that while Class 0's performance dominates due to class imbalance, the model still delivers decent performance for the minority class (Class 1).
Class Imbalance Impact:
Conclusion:
The hypertuned Random Forest model performs well on both the training and test sets, showing strong generalization without overfitting. Class 0 (majority class) maintains excellent performance across both sets, while Class 1 (minority class) performs decently, though its recall could be improved. The hyperparameters chosen, including max_depth=8
, max_features='sqrt'
, and min_samples_split=5
, contribute to a well-balanced model that effectively handles the data. Addressing class imbalance (e.g., using class weights) could further improve performance for the minority class.
We have previously observed a data imbalance, with the target variable distributed in a ratio of 69.4% to 30.6%, where our key feature of interest is converted leads (status = 1).
There are several techniques available to address this imbalance. The most effective methods include:
# Train a Random Forest model with class weight adjustment
rf_class_weight = RandomForestClassifier(class_weight='balanced', random_state=1, n_estimators=100)
rf_class_weight.fit(X_train, y_train)
# Predict on the test set
y_pred_class_weight = rf_class_weight.predict(X_test)
# Generate the classification report
class_weight_report = classification_report(y_test, y_pred_class_weight, output_dict=True)
class_weight_report_df = pd.DataFrame(class_weight_report).transpose()
# Display the classification report
print(class_weight_report_df)
precision recall f1-score support 0 0.880694 0.906250 0.893289 896.000000 1 0.753666 0.700272 0.725989 367.000000 accuracy 0.846397 0.846397 0.846397 0.846397 macro avg 0.817180 0.803261 0.809639 1263.000000 weighted avg 0.843782 0.846397 0.844675 1263.000000
pip install imblearn
Collecting imblearn Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes) Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.3) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4) Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1) Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.5.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0) Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB) Installing collected packages: imblearn Successfully installed imblearn-0.0
from imblearn.over_sampling import SMOTE
# Apply SMOTE to the training data to oversample the minority class
smote = SMOTE(random_state=1)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
# Train the Random Forest model on SMOTE-resampled data
rf_smote = RandomForestClassifier(random_state=1, n_estimators=100)
rf_smote.fit(X_train_smote, y_train_smote)
# Predict on the test set
y_pred_smote = rf_smote.predict(X_test)
# Generate the classification report
smote_report = classification_report(y_test, y_pred_smote, output_dict=True)
smote_report_df = pd.DataFrame(smote_report).transpose()
# Display the SMOTE classification report
print(smote_report_df)
precision recall f1-score support 0 0.888765 0.891741 0.890251 896.000000 1 0.733516 0.727520 0.730506 367.000000 accuracy 0.844022 0.844022 0.844022 0.844022 macro avg 0.811141 0.809631 0.810378 1263.000000 weighted avg 0.843653 0.844022 0.843832 1263.000000
from imblearn.under_sampling import RandomUnderSampler
# Apply undersampling to the training data
undersample = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = undersample.fit_resample(X_train, y_train)
# Train the Random Forest model on the undersampled data
rf_undersample = RandomForestClassifier(random_state=1, n_estimators=100)
rf_undersample.fit(X_train_under, y_train_under)
# Predict on the test set
y_pred_undersample = rf_undersample.predict(X_test)
# Generate the classification report
undersample_report = classification_report(y_test, y_pred_undersample, output_dict=True)
undersample_report_df = pd.DataFrame(undersample_report).transpose()
# Display the classification report
print(undersample_report_df)
precision recall f1-score support 0 0.923267 0.832589 0.875587 896.000000 1 0.670330 0.831063 0.742092 367.000000 accuracy 0.832146 0.832146 0.832146 0.832146 macro avg 0.796798 0.831826 0.808840 1263.000000 weighted avg 0.849769 0.832146 0.836796 1263.000000
from imblearn.combine import SMOTEENN
# Apply SMOTE + ENN (hybrid oversampling and undersampling) to the training data
smote_enn = SMOTEENN(random_state=1)
X_train_hybrid, y_train_hybrid = smote_enn.fit_resample(X_train, y_train)
# Train the Random Forest model on the hybrid resampled data
rf_hybrid = RandomForestClassifier(random_state=1, n_estimators=100)
rf_hybrid.fit(X_train_hybrid, y_train_hybrid)
# Predict on the test set
y_pred_hybrid = rf_hybrid.predict(X_test)
# Generate the classification report
hybrid_report = classification_report(y_test, y_pred_hybrid, output_dict=True)
hybrid_report_df = pd.DataFrame(hybrid_report).transpose()
# Display the classification report
print(hybrid_report_df)
precision recall f1-score support 0 0.923913 0.758929 0.833333 896.00000 1 0.590133 0.847411 0.695749 367.00000 accuracy 0.784640 0.784640 0.784640 0.78464 macro avg 0.757023 0.803170 0.764541 1263.00000 weighted avg 0.826924 0.784640 0.793354 1263.00000
# Generate a table for easy comparion of all methods used in analaysis
# Define the columns for the table
columns = ['Model', 'Dataset', 'Accuracy', 'Precision Class 0', 'Precision Class 1',
'Recall Class 0', 'Recall Class 1', 'F1-Score Class 0', 'F1-Score Class 1']
# Data for each model (Train and Test results for each model)
data = [
['Logistic Regression', 'Train', 0.82, 0.86, 0.74, 0.90, 0.66, 0.88, 0.70],
['Logistic Regression', 'Test', 0.82, 0.86, 0.72, 0.89, 0.66, 0.88, 0.68],
['PCA', 'Train', 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00],
['PCA', 'Test', 0.77, 0.85, 0.60, 0.83, 0.64, 0.84, 0.62],
['Decision Tree', 'Train', 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00],
['Decision Tree', 'Test', 0.80, 0.86, 0.65, 0.85, 0.67, 0.86, 0.66],
['Max Depth (Pruning)', 'Train', 0.89, 0.90, 0.87, 0.95, 0.75, 0.92, 0.80],
['Max Depth (Pruning)', 'Test', 0.84, 0.86, 0.77, 0.92, 0.63, 0.89, 0.69],
['Limited Depth (Depth=5)', 'Train', 0.88, 0.89, 0.84, 0.94, 0.75, 0.92, 0.79],
['Limited Depth (Depth=5)', 'Test', 0.86, 0.89, 0.79, 0.92, 0.71, 0.90, 0.75],
['Limited Depth (Depth=7)', 'Train', 0.90, 0.92, 0.87, 0.95, 0.81, 0.93, 0.84],
['Limited Depth (Depth=7)', 'Test', 0.84, 0.87, 0.74, 0.90, 0.68, 0.89, 0.71],
['Min Samples per Leaf', 'Train', 0.93, 0.94, 0.89, 0.95, 0.86, 0.95, 0.88],
['Min Samples per Leaf', 'Test', 0.80, 0.86, 0.66, 0.86, 0.67, 0.86, 0.66],
['Min Samples to Split', 'Train', 0.89, 0.91, 0.85, 0.94, 0.79, 0.92, 0.82],
['Min Samples to Split', 'Test', 0.83, 0.87, 0.73, 0.90, 0.67, 0.88, 0.70],
['Random Forest', 'Train', 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00],
['Random Forest', 'Test', 0.84, 0.88, 0.73, 0.90, 0.70, 0.89, 0.72],
['Pruned Random Forest', 'Train', 0.90, 0.91, 0.86, 0.95, 0.79, 0.93, 0.82],
['Pruned Random Forest', 'Test', 0.85, 0.88, 0.77, 0.92, 0.70, 0.90, 0.73],
['Random Forest (Tuned)', 'Train', 0.90, 0.91, 0.86, 0.95, 0.79, 0.93, 0.82],
['Random Forest (Tuned)', 'Test', 0.85, 0.88, 0.77, 0.92, 0.70, 0.90, 0.73]
]
# Create a DataFrame
model_performance_df = pd.DataFrame(data, columns=columns)
# Display the table
model_performance_df
Model | Dataset | Accuracy | Precision Class 0 | Precision Class 1 | Recall Class 0 | Recall Class 1 | F1-Score Class 0 | F1-Score Class 1 | |
---|---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | Train | 0.82 | 0.86 | 0.74 | 0.90 | 0.66 | 0.88 | 0.70 |
1 | Logistic Regression | Test | 0.82 | 0.86 | 0.72 | 0.89 | 0.66 | 0.88 | 0.68 |
2 | PCA | Train | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
3 | PCA | Test | 0.77 | 0.85 | 0.60 | 0.83 | 0.64 | 0.84 | 0.62 |
4 | Decision Tree | Train | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
5 | Decision Tree | Test | 0.80 | 0.86 | 0.65 | 0.85 | 0.67 | 0.86 | 0.66 |
6 | Max Depth (Pruning) | Train | 0.89 | 0.90 | 0.87 | 0.95 | 0.75 | 0.92 | 0.80 |
7 | Max Depth (Pruning) | Test | 0.84 | 0.86 | 0.77 | 0.92 | 0.63 | 0.89 | 0.69 |
8 | Limited Depth (Depth=5) | Train | 0.88 | 0.89 | 0.84 | 0.94 | 0.75 | 0.92 | 0.79 |
9 | Limited Depth (Depth=5) | Test | 0.86 | 0.89 | 0.79 | 0.92 | 0.71 | 0.90 | 0.75 |
10 | Limited Depth (Depth=7) | Train | 0.90 | 0.92 | 0.87 | 0.95 | 0.81 | 0.93 | 0.84 |
11 | Limited Depth (Depth=7) | Test | 0.84 | 0.87 | 0.74 | 0.90 | 0.68 | 0.89 | 0.71 |
12 | Min Samples per Leaf | Train | 0.93 | 0.94 | 0.89 | 0.95 | 0.86 | 0.95 | 0.88 |
13 | Min Samples per Leaf | Test | 0.80 | 0.86 | 0.66 | 0.86 | 0.67 | 0.86 | 0.66 |
14 | Min Samples to Split | Train | 0.89 | 0.91 | 0.85 | 0.94 | 0.79 | 0.92 | 0.82 |
15 | Min Samples to Split | Test | 0.83 | 0.87 | 0.73 | 0.90 | 0.67 | 0.88 | 0.70 |
16 | Random Forest | Train | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
17 | Random Forest | Test | 0.84 | 0.88 | 0.73 | 0.90 | 0.70 | 0.89 | 0.72 |
18 | Pruned Random Forest | Train | 0.90 | 0.91 | 0.86 | 0.95 | 0.79 | 0.93 | 0.82 |
19 | Pruned Random Forest | Test | 0.85 | 0.88 | 0.77 | 0.92 | 0.70 | 0.90 | 0.73 |
20 | Random Forest (Tuned) | Train | 0.90 | 0.91 | 0.86 | 0.95 | 0.79 | 0.93 | 0.82 |
21 | Random Forest (Tuned) | Test | 0.85 | 0.88 | 0.77 | 0.92 | 0.70 | 0.90 | 0.73 |
Observations Based on Model Performance (before imbalance handling):
To determine the best method in terms of accuracy and recall for Class 1 (status = 1), let's evaluate the models:
Best Accuracy:
Highest Recall for Class 1:
Conclusion: The Limited Depth (Depth=5) model provides the best balance between overall accuracy (0.86) and recall for Class 1 (0.71). It offers the most reliable performance for accurately identifying positive instances in the minority class while maintaining strong overall model accuracy.
import pandas as pd
# Define the columns for the table
columns = ['Method', 'Accuracy', 'Precision Class 0', 'Precision Class 1',
'Recall Class 0', 'Recall Class 1', 'F1-Score Class 0', 'F1-Score Class 1']
# Data for each method
data = [
['Class Weighting', 0.8464, 0.8807, 0.7537, 0.9063, 0.7003, 0.8933, 0.7260],
['SMOTE', 0.8440, 0.8888, 0.7335, 0.8917, 0.7275, 0.8903, 0.7305],
['Undersampling', 0.8321, 0.9233, 0.6703, 0.8326, 0.8311, 0.8756, 0.7421],
['Hybrid (SMOTE + Undersampling)', 0.7846, 0.9239, 0.5901, 0.7589, 0.8474, 0.8333, 0.6957]
]
# Create a DataFrame
imbalance_handling_df = pd.DataFrame(data, columns=columns)
# Display the table
imbalance_handling_df
Method | Accuracy | Precision Class 0 | Precision Class 1 | Recall Class 0 | Recall Class 1 | F1-Score Class 0 | F1-Score Class 1 | |
---|---|---|---|---|---|---|---|---|
0 | Class Weighting | 0.8464 | 0.8807 | 0.7537 | 0.9063 | 0.7003 | 0.8933 | 0.7260 |
1 | SMOTE | 0.8440 | 0.8888 | 0.7335 | 0.8917 | 0.7275 | 0.8903 | 0.7305 |
2 | Undersampling | 0.8321 | 0.9233 | 0.6703 | 0.8326 | 0.8311 | 0.8756 | 0.7421 |
3 | Hybrid (SMOTE + Undersampling) | 0.7846 | 0.9239 | 0.5901 | 0.7589 | 0.8474 | 0.8333 | 0.6957 |
Observations Based on Imbalance Handling Methods:
Using the criteria of accuracy and recall for Class 1 (status = 1):
Class Weighting:
Accuracy: 0.8464
Recall for Class 1: 0.7003
This method provides a strong balance between overall accuracy and recall for Class 1.
SMOTE:
Accuracy: 0.8440
Recall for Class 1: 0.7275
SMOTE achieves slightly lower accuracy than class weighting but improves recall for Class 1, making it more effective at capturing positive instances.
Undersampling:
Accuracy: 0.8321
Recall for Class 1: 0.8311
Undersampling gives the highest recall for Class 1 (0.8311), but overall accuracy drops slightly. It focuses more on improving minority class predictions.
Hybrid (SMOTE + Undersampling):
Accuracy: 0.7846
Recall for Class 1: 0.8474
This method has the highest recall for Class 1 (0.8474) but sacrifices accuracy. It is better suited if your focus is mainly on maximizing recall for Class 1.
Conclusion:
SMOTE provides the best balance between accuracy (0.8440) and recall for Class 1 (0.7275), making it a strong contender if you want to maintain good performance across both classes.
Undersampling and Hybrid methods perform better on recall for Class 1, but they result in lower overall accuracy.
To evaluate which model—Limited Depth (Depth=5) or SMOTE—is better suited to our goals (balancing overall accuracy with recall for Class 1), let's compare both models based on key performance metrics.
Metric | Value |
---|---|
Accuracy | 0.86 |
Precision (Class 1) | 0.79 |
Recall (Class 1) | 0.71 |
F1-Score (Class 1) | 0.75 |
Metric | Value |
---|---|
Accuracy | 0.8440 |
Precision (Class 1) | 0.7335 |
Recall (Class 1) | 0.7275 |
F1-Score (Class 1) | 0.7305 |
Observations:
Overall Accuracy:
Precision for Class 1:
Recall for Class 1:
F1-Score for Class 1:
Conclusion:
None of the models we have explored do a great job of classifying and predicting which lead is most leikely to convert. Additional model exploration is necessary to find the optimum model. I suspect more advanced feature engineering would be beneficial.
In the analysis provided the Limited Depth (Depth=5) model is the better choice for our goals, as it maintains a good balance between accuracy and recall, with better overall precision and F1-Score for Class 1. This makes it more effective at minimizing false positives while still correctly identifying most positive instances.
Issues:
The current models have shown limitations in accurately predicting conversions, particularly in distinguishing the minority class (converted leads). Inorder to improve predictive accuracy the plan was to leverage advanced machine learning techniques, more agressive feature engineering, and additional hyperparameter tuning. This will hopefully enable us to generate a more efficient predictive model.
The key challenge is to handle the data imbalance and effectively utilize features such as time spent on the website, interaction type, and profile completion to enhance the model's ability to predict lead conversions, while minimizing false negatives (missed conversion opportunities).
Time is a constraint on exploring these avenues as the code below has run for numerous hours without being able to complete.
from Scikit-learn.preprocessing import PolynomialFeatures
# Generate interaction terms for existing features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Create a DataFrame with the new features
# Use get_feature_names_out() instead of get_feature_names()
X_train_poly_df = pd.DataFrame(X_train_poly, columns=poly.get_feature_names_out(X_train.columns))
X_test_poly_df = pd.DataFrame(X_test_poly, columns=poly.get_feature_names_out(X_test.columns))
# Display the new features
X_train_poly_df.head()
age | website_visits | time_spent_on_website | page_views_per_visit | current_occupation_0 | current_occupation_1 | current_occupation_2 | first_interaction_0 | first_interaction_1 | profile_completed_0 | ... | digital_media_1 educational_channels_0 | digital_media_1 educational_channels_1 | digital_media_1 referral_0 | digital_media_1 referral_1 | educational_channels_0 educational_channels_1 | educational_channels_0 referral_0 | educational_channels_0 referral_1 | educational_channels_1 referral_0 | educational_channels_1 referral_1 | referral_0 referral_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 48.0 | 1.0 | 266.0 | 5.812 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 55.0 | 3.0 | 984.0 | 2.970 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 33.0 | 2.0 | 2478.0 | 2.189 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 59.0 | 0.0 | 0.0 | 0.000 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 23.0 | 1.0 | 365.0 | 1.933 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 325 columns
# Defining models to be tested
models = {
'RandomForest': RandomForestClassifier(random_state=1)
}
# Simplified hyperparameter search space for each model
param_grids = {
'RandomForest': {
'n_estimators': [100, 200],
'max_depth': [5, 10],
'min_samples_split': [2, 5],
},
}
# Running RandomizedSearchCV for the model
for model_name, model in models.items():
param_grid = param_grids[model_name]
randomized_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=5, scoring='f1', cv=5, random_state=1)
randomized_search.fit(X_train_poly_df, y_train)
# Train the model
test_model = randomized_search.best_estimator_
test_model.fit(X_train_poly_df, y_train)
# Test the model
y_pred_train = test_model.predict(X_test_poly_df)
# Evaluate the model on the test set
y_pred_test = test_model.predict(X_test_poly_df)
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.88 0.92 0.90 896 1 0.78 0.69 0.73 367 accuracy 0.85 1263 macro avg 0.83 0.81 0.82 1263 weighted avg 0.85 0.85 0.85 1263
# Defining models to be tested
models = {
'GradientBoosting': GradientBoostingClassifier(random_state=1)
}
# Simplified hyperparameter search space for each model
param_grids = {
'GradientBoosting': {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5],
}
}
# Running RandomizedSearchCV for the model
for model_name, model in models.items():
param_grid = param_grids[model_name]
randomized_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=5, scoring='f1', cv=5, random_state=1)
randomized_search.fit(X_train_poly_df, y_train)
# Train the model
test_model = randomized_search.best_estimator_
test_model.fit(X_train_poly_df, y_train)
# Test the model
y_pred_train = test_model.predict(X_test_poly_df)
# Evaluate the model on the test set
y_pred_test = test_model.predict(X_test_poly_df)
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.88 0.92 0.90 896 1 0.77 0.68 0.72 367 accuracy 0.85 1263 macro avg 0.82 0.80 0.81 1263 weighted avg 0.84 0.85 0.85 1263
# Calculate the class weights to balance the classes
class_weights = compute_class_weight(class_weight='balanced',
classes=np.unique(y_train),
y=y_train)
# Create a dictionary for class weights (0 for non-converted, 1 for converted)
weight_dict = {0: class_weights[0], 1: class_weights[1]}
print(f"Class Weights: {weight_dict}")
# Set the scale_pos_weight for XGBoost
scale_pos_weight = weight_dict[1] / weight_dict[0]
# Initialize the XGBoost classifier with the calculated scale_pos_weight
xgb_model = XGBClassifier(random_state=1, scale_pos_weight=scale_pos_weight)
# Train the XGBoost model
xgb_model.fit(X_train_poly_df, y_train)
# Make predictions on the test set
y_pred_test = xgb_model.predict(X_test_poly_df)
metrics_score(y_test, y_pred_test)
Class Weights: {0: 0.7207537934410181, 1: 1.6324833702882484} precision recall f1-score support 0 0.89 0.86 0.88 896 1 0.69 0.75 0.72 367 accuracy 0.83 1263 macro avg 0.79 0.80 0.80 1263 weighted avg 0.83 0.83 0.83 1263
from Scikit-learn.metrics import classification_report, roc_curve
# Initialize and train the XGBoost model (without class weights for simplicity here)
xgb_model = XGBClassifier(random_state=1)
xgb_model.fit(X_train_poly_df, y_train)
# Predict probabilities instead of labels
y_pred_proba = xgb_model.predict_proba(X_test_poly_df)[:, 1] # We only need probabilities for class 1
# Tune the threshold - Here, we set the threshold lower than 0.5 to improve recall
threshold = 0.3 # Lowering the threshold to improve recall
y_pred_custom_threshold = (y_pred_proba >= threshold).astype(int)
# Evaluate the model's performance at the new threshold
metrics_score(y_test, y_pred_custom_threshold)
# Optionally, plot the ROC curve to analyze thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
precision recall f1-score support 0 0.91 0.85 0.88 896 1 0.68 0.80 0.73 367 accuracy 0.83 1263 macro avg 0.80 0.82 0.81 1263 weighted avg 0.84 0.83 0.84 1263
from imblearn.under_sampling import RandomUnderSampler
# Apply Random Undersampling to balance the classes
undersampler = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = undersampler.fit_resample(X_train_poly_df, y_train)
# Train the XGBoost model on the undersampled data
xgb_model = XGBClassifier(random_state=1)
xgb_model.fit(X_train_under, y_train_under)
# Predict on the test set (Note: Test set is not undersampled, as we want to evaluate performance on the original data)
y_pred_test = xgb_model.predict(X_test_poly_df)
# Evaluate the model's performance using classification report
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.92 0.80 0.86 896 1 0.63 0.83 0.72 367 accuracy 0.81 1263 macro avg 0.78 0.82 0.79 1263 weighted avg 0.84 0.81 0.82 1263
from imblearn.combine import SMOTETomek
# Apply SMOTE and Random Undersampling (SMOTE + Tomek Links for cleaning)
smote_undersampler = SMOTETomek(random_state=1)
X_train_smote, y_train_smote = smote_undersampler.fit_resample(X_train_poly_df, y_train)
# Train the XGBoost model on the combined SMOTE + undersampled data
xgb_model = XGBClassifier(random_state=1)
xgb_model.fit(X_train_smote, y_train_smote)
# Predict on the test set (Note: Test set is not resampled)
y_pred_test = xgb_model.predict(X_test_poly_df)
# Evaluate the model's performance using classification report
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.88 0.88 0.88 896 1 0.71 0.72 0.71 367 accuracy 0.83 1263 macro avg 0.80 0.80 0.80 1263 weighted avg 0.83 0.83 0.83 1263
from xgboost import XGBClassifier
from Scikit-learn.model_selection import RandomizedSearchCV
from Scikit-learn.metrics import classification_report
# Define hyperparameter grid for XGBoost
param_grid = {
'n_estimators': [100, 200, 300], # Number of boosting rounds
'learning_rate': [0.01, 0.1, 0.2], # Step size at each boosting step
'max_depth': [3, 5, 7], # Maximum depth of a tree
'colsample_bytree': [0.7, 1.0], # Subsample ratio of columns when constructing each tree
'subsample': [0.8, 1.0], # Subsample ratio of the training instance
'gamma': [0, 0.1, 0.2] # Minimum loss reduction required to make a further partition on a leaf node
}
# Initialize the XGBoost classifier
xgb_model = XGBClassifier(random_state=1)
# Set up RandomizedSearchCV to optimize for recall
random_search = RandomizedSearchCV(
estimator=xgb_model,
param_distributions=param_grid,
n_iter=10, # Number of different parameter settings to try
scoring='recall', # Optimize for recall
cv=5, # 5-fold cross-validation
random_state=1,
verbose=1
)
# Fit the RandomizedSearchCV on the training data
random_search.fit(X_train_poly_df, y_train)
# Get the best model after hyperparameter tuning
xgb_model = random_search.best_estimator_
# Evaluate the best model on the test set
y_pred_test = xgb_model.predict(X_test_poly_df)
# Print classification report (to check recall performance)
metrics_score(y_test, y_pred_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits precision recall f1-score support 0 0.88 0.89 0.89 896 1 0.72 0.72 0.72 367 accuracy 0.84 1263 macro avg 0.80 0.80 0.80 1263 weighted avg 0.84 0.84 0.84 1263
Employing alternate techniques to boost minority class (status = 1) recall significant improvements were found using XGBoost with Threshold Tuning and Polynomial Features (80%), but the largest gain came from Majority Class Undersampling with Polynomial Features (83%).
Conclusion
The best model for our purposes is Majority Class Undersampling with Polynomial Features
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.92 | 0.80 | 0.86 | 896 |
1 | 0.63 | 0.83 | 0.72 | 367 |
Accuracy | 0.81 | 1263 | ||
Macro Avg | 0.78 | 0.82 | 0.79 | 1263 |
Weighted Avg | 0.84 | 0.81 | 0.82 | 1263 |
# Extract feature importances from the pruned_tree_depth_5 model
importances = pruned_tree_depth_5.feature_importances_
feature_names = X_train.columns # Assuming X_train is your feature set
# Create a DataFrame to show the importance of each feature
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
# Display the top 10 most important features
print("Top 10 Important Features:")
print(feature_importance_df.head(10))
Top 10 Important Features: Feature Importance 2 time_spent_on_website 0.267570 7 first_interaction_0 0.259239 9 profile_completed_0 0.221637 4 current_occupation_0 0.103402 13 last_activity_1 0.062585 1 website_visits 0.016099 5 current_occupation_1 0.015881 14 last_activity_2 0.014340 0 age 0.008964 10 profile_completed_1 0.007838
Time Spent on Website (Importance: 0.2676):
First Interaction Type (first_interaction_0) (Importance: 0.2592):
Profile Completion (profile_completed_0) (Importance: 0.2216):
Current Occupation (current_occupation_0) (Importance: 0.1034):
Last Activity (last_activity_1) (Importance: 0.0626):
Website Visits (Importance: 0.0161):
Other Factors:
Age (Importance: 0.0090): Although age is a factor, it is less significant compared to other behavioral characteristics.
Profile Completed (Importance: 0.0078): While completing a profile is crucial, its specific state or variation influences conversion less than other factors.
Current Occupation (Occupation Type 1) and Last Activity (last_activity_2) play minor roles but still contribute to the model’s decision-making process.
Time Spent on Website:
First Interaction Type:
Incomplete Profiles:
Lower Engagement in Last Activity:
Behavior: High time spent on the website, frequent interactions (especially meaningful ones in the early stages), and recent activity signal strong engagement.
Demographics: While less significant, certain occupations are more likely to convert, and slightly younger or middle-aged customers show higher interest.
Engagement: Completing profiles and frequent website visits are key indicators of conversion.
Behavior: Low time spent on the website, few interactions, and less engagement overall.
Demographics: Occupations that are less aligned with the product/service offering and incomplete profiles.
Engagement: Minimal or no meaningful activity in recent times, indicating loss of interest.
Feature Importance: The most important feature identified was time_spent_on_website
. This indicates that customers who spend more time on the website are significantly more likely to convert.
Actionable Insight: Increase engagement by optimizing the website to encourage leads to spend more time exploring your offerings.
Actionable Recommendations:
Heatmaps & User Behavior Analysis: Implement tools like Google Analytics or Hotjar to create heatmaps and track session recordings. Analyze which pages customers are spending the most time on and which areas are being clicked the most. This will help you identify sections of your website that need improvement, such as adding clear Call to Actions (CTAs) on key pages or enhancing content for high-traffic areas.
Personalized Content Recommendations: Use machine learning algorithms or recommendation engines to show personalized product or content suggestions based on the user’s behavior (e.g., pages visited, items viewed). This increases the chances of keeping them engaged for longer.
Interactive Tools: Integrate interactive features like quizzes, product selectors, or price calculators. Interactive content keeps users engaged longer and gives them a reason to explore your website deeper.
Chatbots and Live Chat: Implement AI-driven chatbots or live chat features to engage with visitors in real-time. These tools can answer questions, provide product recommendations, and guide users through the buying process. Bots that trigger when visitors spend a long time on a particular page can nudge them toward conversion.
Feature Importance: The features first_interaction_0
and profile_completed_0
were significant predictors of conversion, indicating that the initial interaction and completion of profiles influence whether a lead converts.
Actionable Insight: Focus on optimizing the first interaction (email, landing page, or sign-up form) and encourage profile completion to improve conversion rates.
Actionable Recommendations:
First Interaction Personalization: Customize the first interaction with the lead based on the source (e.g., social media, email, ad campaign). Use dynamic content in your emails or landing pages that speaks directly to the lead's interests or needs. For instance, leads who arrive via an ad campaign might be more interested in seeing offers, while those from organic search might need more information.
Strong CTAs: Ensure your landing pages and forms have strong CTAs (e.g., “Get Your Free Quote,” “See a Demo”) that clearly guide the lead toward the next step. Test different CTA variations (A/B Testing) to determine which versions drive the highest engagement.
Gamification for Profile Completion: Encourage users to complete their profiles by gamifying the process. Show a progress bar that visually indicates how close they are to completing the profile and offer rewards (e.g., a discount or free content) when the profile reaches 100%. This not only encourages engagement but also provides you with more data to personalize your marketing efforts.
Onboarding Sequences: Use email onboarding sequences that gradually prompt leads to complete their profiles. For example, after they sign up, send follow-up emails offering tips, reminders, or incentives to complete their profile.
Feature Importance: The features current_occupation_0
and current_occupation_1
were identified as significant predictors. Certain occupations are more likely to convert, indicating that targeting different customer segments based on occupation could improve conversion rates.
Actionable Insight: Segment marketing efforts based on a lead’s occupation, offering personalized communication, offers, and services that resonate with their profession.
Actionable Recommendations:
Feature Importance: The features last_activity_1
and last_activity_2
were significant, indicating that recent activity is an important predictor of conversion. Leads who have interacted with your platform or website recently are more likely to convert.
Actionable Insight: Track user activity and implement retargeting campaigns to engage with users who recently interacted with your website.
Actionable Recommendations:
Real-Time Engagement: Implement real-time alerts or CRM triggers that notify your sales or marketing team when a lead has interacted with specific areas of your site (e.g., viewed a pricing page). Set up an automated follow-up sequence (via email or SMS) to re-engage the lead within a specific timeframe (e.g., 24 hours after the last interaction).
Behavioral Retargeting: Use behavioral retargeting ads to reach users who left your site without converting. These ads should reflect the user’s recent activity, such as a product they viewed or a service page they visited.
Automated Workflows: Create automated workflows that trigger emails or messages based on last activity. For example, if a lead viewed your product page but didn’t complete the purchase, an email with product benefits or an offer could nudge them toward conversion.
Urgency and Scarcity Tactics: Combine activity tracking with urgency tactics (e.g., “Offer valid for 48 hours”) to encourage leads to act quickly. When paired with retargeting, this can improve the likelihood of conversion.
Supplementary Action: Implement referral programs to leverage your existing customer base and generate high-quality leads. Word-of-mouth marketing is a proven method for converting leads who trust the referral source.
Actionable Recommendations:
Referral Rewards: Create a simple, easy-to-understand referral program that incentivizes both the referrer and the referred lead. For example, offer existing customers a discount, credit, or gift card when they successfully refer a new lead who converts.
Promote the Program: Make the referral program visible across all customer touchpoints, including your website, email campaigns, and during the checkout process. Send periodic reminder emails encouraging customers to refer friends and family.
Track and Optimize: Use analytics tools to track how many leads are coming through referrals, and adjust the incentive structure if necessary. Additionally, identify your top referrers and engage them with personalized rewards or exclusive offers.
Loyalty Programs: In addition to referrals, loyalty programs can encourage repeat business and referrals. Offer points or rewards for actions like repeat purchases, referrals, or social media engagement.
Feature Importance: The website_visits
feature shows that repeated visits to the website correlate with conversion likelihood, but some leads still do not convert despite visiting the website multiple times.
Actionable Insight: Automate personalized follow-up emails or SMS messages for leads with multiple website visits who haven’t yet converted.
Actionable Recommendations:
Behavior-Based Email Campaigns: Set up trigger-based email campaigns that automatically send targeted emails to leads who have visited your website multiple times but haven’t converted. Include content that addresses potential concerns or offers a personalized incentive (e.g., a limited-time discount).
Lead Scoring for Priority Follow-Up: Use lead scoring to prioritize follow-up with leads who have the highest number of website visits. Leads with high visit counts should receive more personalized outreach, such as a phone call or a customized offer.
Retargeting Ads for High-Intent Leads: Implement retargeting campaigns specifically designed for leads who visited your website multiple times. These ads can highlight the product or service they were interested in and offer a limited-time promotion to push them toward conversion.
Live Chat Intervention: Implement a live chat feature that prompts when users visit the website for the second or third time without converting. The live chat can proactively offer assistance, answer questions, or provide a personalized recommendation to nudge them toward a decision.
import os
os.getcwd()
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/Learner+Notebook+-+Full+Code+Version+-+Potential+Customers+Prediction.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/Learner+Notebook+-+Full+Code+Version+-+Potential+Customers+Prediction.ipynb to html [NbConvertApp] Writing 5704577 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/Learner+Notebook+-+Full+Code+Version+-+Potential+Customers+Prediction.html