Data Scientist Employee Attrition - Job Change of Data Scientists¶
Problem Statement¶
A company that is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses conducted by the company. Many people sign up for their training. The company wants to know which of these candidates want to work for the company after training or looking for new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education and experience is provided by candidates during signup and enrollment.
This dataset is designed to understand the factors that lead a person to leave their current job, and it is hence useful for HR research. By building a model that uses the current credentials, demographics, and work experience related data, you will predict the probability that a candidate is looking for a new job, as well as interpret the main factors that affect an employee's decision whether to continue or attrite.
Data Description¶
- Enrollee_id: Unique ID for candidate
- City: City code
- City_development_index: Developement index of the city (scaled)
- Gender: Gender of candidate
- Relevent_experience: Relevant experience of candidate
- Enrolled_university: Type of University course enrolled if any
- Education_level: Education level of candidate
- Major_discipline: Education major discipline of candidate
- Experience: Candidate total experience in years
- Company_size: No of employees in current employer's company
- Company_type: Type of current employer
- Last_new_job: Difference in years between previous job and current job
- Training_hours: training hours completed
- Target: 0 – Not looking for job change, 1 – Looking for a job change
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Introductory Steps¶
Importing Libraries¶
pip install scikeras
Collecting scikeras Downloading scikeras-0.13.0-py3-none-any.whl.metadata (3.1 kB) Requirement already satisfied: keras>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from scikeras) (3.4.1) Requirement already satisfied: scikit-learn>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from scikeras) (1.5.2) Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (1.4.0) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (1.26.4) Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (13.8.1) Requirement already satisfied: namex in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.0.8) Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (3.11.0) Requirement already satisfied: optree in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.12.1) Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.4.1) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (24.1) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (1.13.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (3.5.0) Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from optree->keras>=3.2.0->scikeras) (4.12.2) Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.2.0->scikeras) (3.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.2.0->scikeras) (2.18.0) Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras>=3.2.0->scikeras) (0.1.2) Downloading scikeras-0.13.0-py3-none-any.whl (26 kB) Installing collected packages: scikeras Successfully installed scikeras-0.13.0
pip install scikit-learn==1.2.2
Collecting scikit-learn==1.2.2 Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (1.26.4) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (1.13.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (3.5.0) Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 25.2 MB/s eta 0:00:00 Installing collected packages: scikit-learn Attempting uninstall: scikit-learn Found existing installation: scikit-learn 1.5.2 Uninstalling scikit-learn-1.5.2: Successfully uninstalled scikit-learn-1.5.2 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. scikeras 0.13.0 requires scikit-learn>=1.4.2, but you have scikit-learn 1.2.2 which is incompatible. Successfully installed scikit-learn-1.2.2
The code was executed after downgrading to version 1.2.2, hence it's advisable to downgrade to avoid potential unnecessary issues. After installing scikit-learn, restart the session then follow below steps.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn import model_selection
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
import warnings
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dense, Input, Dropout,BatchNormalization
from scikeras.wrappers import KerasClassifier
import random
from tensorflow.keras import backend
random.seed(1)
np.random.seed(1)
tf.random.set_seed(1)
warnings.filterwarnings("ignore")
Loading the Data¶
# Loeading the data. Fill the blank with the file path of the csv file
#Data = pd.read_csv('_________')
dataset = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Six_-_Deep_Learning/Data_Scientist_Employee_Attrition/Data.csv')
Data = dataset.copy()
# Checking the number of rows and columns in the data
Data.shape
(19158, 14)
- The dataset has 19158 rows and 14 columns
Data Overview¶
# Let's view the first 5 rows of the data
Data.head()
Enrollee_id | City | City_development_index | Gender | Relevent_experience | Enrolled_university | Education_level | Major_discipline | Experience | Company_size | Company_type | Last_new_job | Training_hours | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8949 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | >20 | NaN | NaN | 1 | 36 | 1 |
1 | 29725 | city_40 | 0.776 | Male | No relevent experience | no_enrollment | Graduate | STEM | 15 | 50-99 | Pvt Ltd | >4 | 47 | 0 |
2 | 11561 | city_21 | 0.624 | NaN | No relevent experience | Full time course | Graduate | STEM | 5 | NaN | NaN | never | 83 | 0 |
3 | 33241 | city_115 | 0.789 | NaN | No relevent experience | NaN | Graduate | Business Degree | <1 | NaN | Pvt Ltd | never | 52 | 1 |
4 | 666 | city_162 | 0.767 | Male | Has relevent experience | no_enrollment | Masters | STEM | >20 | 50-99 | Funded Startup | 4 | 8 | 0 |
# Let's view the last 5 rows of the data
Data.tail()
Enrollee_id | City | City_development_index | Gender | Relevent_experience | Enrolled_university | Education_level | Major_discipline | Experience | Company_size | Company_type | Last_new_job | Training_hours | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19153 | 7386 | city_173 | 0.878 | Male | No relevent experience | no_enrollment | Graduate | Humanities | 14 | NaN | NaN | 1 | 42 | 1 |
19154 | 31398 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 14 | NaN | NaN | 4 | 52 | 1 |
19155 | 24576 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | >20 | 50-99 | Pvt Ltd | 4 | 44 | 0 |
19156 | 5756 | city_65 | 0.802 | Male | Has relevent experience | no_enrollment | High School | NaN | <1 | 500-999 | Pvt Ltd | 2 | 97 | 0 |
19157 | 23834 | city_67 | 0.855 | NaN | No relevent experience | no_enrollment | Primary School | NaN | 2 | NaN | NaN | 1 | 127 | 0 |
# Let's check the datatypes of the columns in the dataset
Data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19158 entries, 0 to 19157 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Enrollee_id 19158 non-null int64 1 City 19158 non-null object 2 City_development_index 19158 non-null float64 3 Gender 14650 non-null object 4 Relevent_experience 19158 non-null object 5 Enrolled_university 18772 non-null object 6 Education_level 18698 non-null object 7 Major_discipline 16345 non-null object 8 Experience 19093 non-null object 9 Company_size 13220 non-null object 10 Company_type 13018 non-null object 11 Last_new_job 18735 non-null object 12 Training_hours 19158 non-null int64 13 Target 19158 non-null int64 dtypes: float64(1), int64(3), object(10) memory usage: 2.0+ MB
- There are 19,158 observations and 14 columns in the data.
- 10 columns are of the object datatype and 4 columns are numerical.
# Let's check for duplicate values in the data
Data.duplicated().sum()
0
# Let's check for missing values in the data
round(Data.isnull().sum() / Data.isnull().count() * 100, 2)
0 | |
---|---|
Enrollee_id | 0.00 |
City | 0.00 |
City_development_index | 0.00 |
Gender | 23.53 |
Relevent_experience | 0.00 |
Enrolled_university | 2.01 |
Education_level | 2.40 |
Major_discipline | 14.68 |
Experience | 0.34 |
Company_size | 30.99 |
Company_type | 32.05 |
Last_new_job | 2.21 |
Training_hours | 0.00 |
Target | 0.00 |
Data["Target"].value_counts(1)
proportion | |
---|---|
Target | |
0 | 0.750652 |
1 | 0.249348 |
# Let's view the statistical summary of the numerical columns in the data
Data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Enrollee_id | 19158.0 | 16875.358179 | 9616.292592 | 1.000 | 8554.25 | 16982.500 | 25169.75 | 33380.000 |
City_development_index | 19158.0 | 0.828848 | 0.123362 | 0.448 | 0.74 | 0.903 | 0.92 | 0.949 |
Training_hours | 19158.0 | 65.366896 | 60.058462 | 1.000 | 23.00 | 47.000 | 88.00 | 336.000 |
Target | 19158.0 | 0.249348 | 0.432647 | 0.000 | 0.00 | 0.000 | 0.00 | 1.000 |
Outside of the Enrollee_id (an ID column) and the Target variable, there are only two numerical columns in the dataset.
- The maximum number of training hours is 336, but at least 75% of employees had finished their training within 88 hours.
- 50% of the rows have a city development index between 0.903 and 0.949, so the dataset is weighted towards the higher side with respect to this column.
# Let's check the number of unique values in each column
Data.nunique()
0 | |
---|---|
Enrollee_id | 19158 |
City | 123 |
City_development_index | 93 |
Gender | 3 |
Relevent_experience | 2 |
Enrolled_university | 3 |
Education_level | 5 |
Major_discipline | 6 |
Experience | 22 |
Company_size | 8 |
Company_type | 6 |
Last_new_job | 6 |
Training_hours | 241 |
Target | 2 |
- Each value of the column 'Employee_id' is a unique identifier for an employee. Hence we can drop this column as it will not add any predictive power or value to the model.
- The 'City' column has 123 unique categories.
for i in Data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(Data[i].value_counts())
print("*" * 50)
Unique values in City are : City city_103 4355 city_21 2702 city_16 1533 city_114 1336 city_160 845 ... city_129 3 city_111 3 city_121 3 city_140 1 city_171 1 Name: count, Length: 123, dtype: int64 ************************************************** Unique values in Gender are : Gender Male 13221 Female 1238 Other 191 Name: count, dtype: int64 ************************************************** Unique values in Relevent_experience are : Relevent_experience Has relevent experience 13792 No relevent experience 5366 Name: count, dtype: int64 ************************************************** Unique values in Enrolled_university are : Enrolled_university no_enrollment 13817 Full time course 3757 Part time course 1198 Name: count, dtype: int64 ************************************************** Unique values in Education_level are : Education_level Graduate 11598 Masters 4361 High School 2017 Phd 414 Primary School 308 Name: count, dtype: int64 ************************************************** Unique values in Major_discipline are : Major_discipline STEM 14492 Humanities 669 Other 381 Business Degree 327 Arts 253 No Major 223 Name: count, dtype: int64 ************************************************** Unique values in Experience are : Experience >20 3286 5 1430 4 1403 3 1354 6 1216 2 1127 7 1028 10 985 9 980 8 802 15 686 11 664 14 586 1 549 <1 522 16 508 12 494 13 399 17 342 19 304 18 280 20 148 Name: count, dtype: int64 ************************************************** Unique values in Company_size are : Company_size 50-99 3083 100-500 2571 10000+ 2019 Oct-49 1471 1000-4999 1328 <10 1308 500-999 877 5000-9999 563 Name: count, dtype: int64 ************************************************** Unique values in Company_type are : Company_type Pvt Ltd 9817 Funded Startup 1001 Public Sector 955 Early Stage Startup 603 NGO 521 Other 121 Name: count, dtype: int64 ************************************************** Unique values in Last_new_job are : Last_new_job 1 8040 >4 3290 2 2900 never 2452 4 1029 3 1024 Name: count, dtype: int64 **************************************************
- The 'City' column has 123 unique categories, and the city with the highest number of employees is City 103.
- Over 90% of the employees in this dataset are males, so it is highly gender-skewed.
- Most of the employees (~70%) have relevant experience in Data Science.
- 70% of the employees did not enroll in any of the courses.
- Most of the employees have a Bachelor's level of education (Graduation) but not more than that (~62% of the total number of employees). There are very few employees with Master's degrees and PhDs.
- Almost all the employees have previous experience (Related and Non-Related).
- ~75% of the employees are from private companies.
Data Pre-processing¶
# ID column consists of uniques ID for clients and hence will not add value to the modeling
Data.drop(columns="Enrollee_id", inplace=True)
EDA¶
Univariate Analysis¶
# Function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Data.dtypes
0 | |
---|---|
City | object |
City_development_index | float64 |
Gender | object |
Relevent_experience | object |
Enrolled_university | object |
Education_level | object |
Major_discipline | object |
Experience | object |
Company_size | object |
Company_type | object |
Last_new_job | object |
Training_hours | int64 |
Target | int64 |
histogram_boxplot(Data, "City_development_index")
- From the above plot, we observe that there are many people from cities having a development index more than 0.9.
histogram_boxplot(Data, "Training_hours")
- From the plot, we observe that the measures of central tendency with respect to training hours seem to be 70, despite a maximum value over 300 hours. So most of the people in this dataset have undergone traning for less than 100 hours.
# Function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(Data, "Gender")
- There are far more males in this dataset in comparison to females.
- Over 90% of this dataset is male, representing a highly gender-skewed dataset. This could be a limitation with respect to implementing this model in the real world, since gender balance is highly important to create machine learning models that are practically implemented on datasets related to people.
labeled_barplot(Data, "Relevent_experience")
- Most of the employees have relevant prior experience (~70%).
- 30% of the employees, however, have no relevant experience.
labeled_barplot(Data, "Enrolled_university")
- Most of the employees did not enroll in any of the courses.
- Approximately 20% of the employees have enrolled themselves in full-time courses.
- Only 6% have enrolled in part-time courses.
labeled_barplot(Data, "Education_level")
- Approximately 62% of employees have a Bachelor's (Graduate) level of education, but not more than that.
- Approx 23% of employees have a Master's degree as their highest level of education.
- There are very few employees (~1.5%) with only a High School level of education or below.
labeled_barplot(Data, "Major_discipline")
- Approximately 88% of employees have opted for STEM as their major discipline.
labeled_barplot(Data, "Experience")
- Approximately 17% of total employees have over 20 years of work experience.
labeled_barplot(Data, "Company_type")
- Approximately 75% of the total employees are from a private limited company, showing the skew of the profile towards the private sector.
print(Data.Target.value_counts())
labels = 'Looking for job change', 'Not Looking for job change'
#sizes = [ds.is_promoted[ds['is_promoted']==1].count(), ds.is_promoted[ds['is_promoted']==0].count()]
sizes = [Data.Target[Data['Target']==1].count(),Data.Target[Data['Target']==0].count()]
explode = (0, 0.1)
fig1, ax1 = plt.subplots(figsize=(10, 8))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal')
plt.title("Proportion", size = 20)
plt.show()
Target 0 14381 1 4777 Name: count, dtype: int64
- This pie chart shows that the actual distribution of classes is itself imbalanced for the target variable.
- Only ~25% of the employees in this dataset are actually looking for a job change.
Hence, this dataset and problem statement represent an example of Imbalanced Classification, which has unique challenges in comparison to performing classification over balanced target variables.
Bivariate Analysis¶
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(Data, "City_development_index", "Target")
- From the above plot, we observe that employees from cities having a development index over 0.9, are not willing to switch their jobs.
distribution_plot_wrt_target(Data, "Training_hours", "Target")
- We observe that the distribution of the training hours with respect to the target variable is rightly skewed, and from the box plot for both classes the median traning hours are around 50.
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(Data, "Gender", "Target")
Target 0 1 All Gender All 11262 3388 14650 Male 10209 3012 13221 Female 912 326 1238 Other 141 50 191 ------------------------------------------------------------------------------------------------------------------------
- From the above plot, it is observed that the likelihood of the employee choosing a job switch does not depend on their gender.
stacked_barplot(Data, "Relevent_experience", "Target")
Target 0 1 All Relevent_experience All 14381 4777 19158 Has relevent experience 10831 2961 13792 No relevent experience 3550 1816 5366 ------------------------------------------------------------------------------------------------------------------------
- From the above plot, we see that employees from Non-relevant experience are more likely to be switching their job.
stacked_barplot(Data, "Enrolled_university", "Target")
Target 0 1 All Enrolled_university All 14118 4654 18772 no_enrollment 10896 2921 13817 Full time course 2326 1431 3757 Part time course 896 302 1198 ------------------------------------------------------------------------------------------------------------------------
- Employees who have taken full-time courses in universities are the ones who are more likely to be trying to switch jobs.
stacked_barplot(Data, "Education_level", "Target")
Target 0 1 All Education_level All 14025 4673 18698 Graduate 8353 3245 11598 Masters 3426 935 4361 High School 1623 394 2017 Phd 356 58 414 Primary School 267 41 308 ------------------------------------------------------------------------------------------------------------------------
- Employees who completed Graduation and Master's degrees are more likely to be trying to switch their jobs.
stacked_barplot(Data, "Major_discipline", "Target")
Target 0 1 All Major_discipline All 12117 4228 16345 STEM 10701 3791 14492 Humanities 528 141 669 Other 279 102 381 Business Degree 241 86 327 No Major 168 55 223 Arts 200 53 253 ------------------------------------------------------------------------------------------------------------------------
- Employees who took STEM or Business Degrees as their major discipline are slightly more likely to change their job.
stacked_barplot(Data, "Experience", "Target")
Target 0 1 All Experience All 14339 4754 19093 >20 2783 503 3286 3 876 478 1354 4 946 457 1403 5 1018 412 1430 2 753 374 1127 6 873 343 1216 7 725 303 1028 <1 285 237 522 1 316 233 549 9 767 213 980 10 778 207 985 8 607 195 802 11 513 151 664 15 572 114 686 14 479 107 586 12 402 92 494 13 322 77 399 16 436 72 508 17 285 57 342 19 251 53 304 18 237 43 280 20 115 33 148 ------------------------------------------------------------------------------------------------------------------------
- From the above plot, it's clear that employees having a work experience of less than 3 years are trying to switch their jobs.
stacked_barplot(Data, "Last_new_job", "Target")
Target 0 1 All Last_new_job All 14112 4623 18735 1 5915 2125 8040 never 1713 739 2452 2 2200 700 2900 >4 2690 600 3290 3 793 231 1024 4 801 228 1029 ------------------------------------------------------------------------------------------------------------------------
- Employees who have never switched their job before are the most likely to be looking for a job change.
###Dropping these columns as they will not add value to the modeling
Data.drop(['Company_size','Gender','City'], axis=1, inplace=True)
## Separating all the categorical columns for imputation
cat_col_df = Data.drop(['City_development_index','Training_hours','Target'], axis=1)
Missing Value Imputation¶
- We will impute the missing values in columns using their mode.
## Separating Independent and Dependent Columns
X = Data.drop(['Target'],axis=1)
Y = Data[['Target']]
Y.head()
Target | |
---|---|
0 | 1 |
1 | 0 |
2 | 0 |
3 | 1 |
4 | 0 |
# Splitting the dataset into the Training and Testing set.
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42,stratify = Y)
X_train.isnull().sum()
0 | |
---|---|
City_development_index | 0 |
Relevent_experience | 0 |
Enrolled_university | 317 |
Education_level | 362 |
Major_discipline | 2258 |
Experience | 50 |
Company_type | 4881 |
Last_new_job | 343 |
Training_hours | 0 |
imputer_mode = SimpleImputer(strategy="most_frequent")
X_train[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]] = imputer_mode.fit_transform(
X_train[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]])
X_test[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]] = imputer_mode.transform(
X_test[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]])
# Checking that no column has missing values in train and test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
City_development_index 0 Relevent_experience 0 Enrolled_university 0 Education_level 0 Major_discipline 0 Experience 0 Company_type 0 Last_new_job 0 Training_hours 0 dtype: int64 ------------------------------ City_development_index 0 Relevent_experience 0 Enrolled_university 0 Education_level 0 Major_discipline 0 Experience 0 Company_type 0 Last_new_job 0 Training_hours 0 dtype: int64
Encoding Categorical Columns¶
- We will be using the Label Encoding technique to encode the values of the categorical columns in this dataset.
from sklearn.preprocessing import LabelEncoder
labelencoder_RE = LabelEncoder()
X_train['Relevent_experience']= labelencoder_RE.fit_transform(X_train['Relevent_experience'])
X_test['Relevent_experience']= labelencoder_RE.transform(X_test['Relevent_experience'])
labelencoder_EN = LabelEncoder()
X_train['Enrolled_university'] = labelencoder_EN.fit_transform(X_train['Enrolled_university'])
X_test['Enrolled_university'] = labelencoder_EN.transform(X_test['Enrolled_university'])
labelencoder_EL = LabelEncoder()
X_train['Education_level']= labelencoder_EL.fit_transform(X_train['Education_level'])
X_test['Education_level']= labelencoder_EL.transform(X_test['Education_level'])
labelencoder_MD = LabelEncoder()
X_train['Major_discipline']= labelencoder_MD.fit_transform(X_train['Major_discipline'])
X_test['Major_discipline']= labelencoder_MD.transform(X_test['Major_discipline'])
labelencoder_EX = LabelEncoder()
X_train['Experience']= labelencoder_EX.fit_transform(X_train['Experience'])
X_test['Experience']= labelencoder_EX.transform(X_test['Experience'])
labelencoder_CT = LabelEncoder()
X_train['Company_type']= labelencoder_CT.fit_transform(X_train['Company_type'])
X_test['Company_type']= labelencoder_CT.transform(X_test['Company_type'])
labelencoder_LNJ = LabelEncoder()
X_train['Last_new_job']= labelencoder_LNJ.fit_transform(X_train['Last_new_job'])
X_test['Last_new_job']= labelencoder_LNJ.transform(X_test['Last_new_job'])
X_train.head()
City_development_index | Relevent_experience | Enrolled_university | Education_level | Major_discipline | Experience | Company_type | Last_new_job | Training_hours | |
---|---|---|---|---|---|---|---|---|---|
17855 | 0.624 | 0 | 2 | 0 | 5 | 1 | 5 | 0 | 90 |
17664 | 0.920 | 1 | 2 | 4 | 5 | 15 | 5 | 5 | 15 |
13404 | 0.896 | 0 | 2 | 0 | 5 | 3 | 2 | 4 | 36 |
13366 | 0.920 | 0 | 2 | 0 | 5 | 15 | 1 | 0 | 53 |
15670 | 0.855 | 0 | 0 | 0 | 5 | 15 | 5 | 0 | 158 |
y_train.head()
Target | |
---|---|
17855 | 0 |
17664 | 0 |
13404 | 0 |
13366 | 0 |
15670 | 1 |
###Checking the shape of train and test sets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(15326, 9) (3832, 9) (15326, 1) (3832, 1)
Model Building¶
A model can make wrong predictions in the following ways:¶
- Predicting an employee is looking for a job, when he/she is not looking for it.
- Predicting an employee is not looking for a job, when he/she is in fact looking for one.
Which case is more important?¶
Both cases are actually important for the purposes of this case study. Not giving a chance to a deserving employee (by wrongly classifying them as likely to attrite) might lead to decreased productivity, and the company might lose a good employee affecting the organization's growth. However, giving chances to a non-deserving employee (as they are likely to attrite) would lead to a financial loss for the company, and giving such employees an increased amount of responsibility might again affect the company's growth.
How to reduce this loss i.e need to reduce False Negatives as well as False Positives?¶
Since both errors are important for us to minimize, the company would want the F1 Score evaluation metric to be maximized/ Hence, the focus should be on increasing the F1 score rather than focusing on just one metric i.e. Recall or Precision.
Model 1¶
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
# Initializing the ANN
model = Sequential()
# The amount of nodes (dimensions) in hidden layer should be the average of input and output layers, in this case 64.
# This adds the input layer (by specifying input dimension) AND the first hidden layer (units)
model.add(Dense(activation = 'relu', input_dim = 9, units=64))
#Add 1st hidden layer
model.add(Dense(32, activation='relu'))
# Adding the output layer
# Notice that we do not need to specify input dim.
# we have an output of 1 node, which is the the desired dimensions of our output (stay with the bank or not)
# We use the sigmoid because we want probability outcomes
model.add(Dense(1, activation = 'sigmoid'))
# Create optimizer with default learning rate
# Compile the model
model.compile(optimizer='SGD', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 64) 640 dense_1 (Dense) (None, 32) 2080 dense_2 (Dense) (None, 1) 33 ================================================================= Total params: 2753 (10.75 KB) Trainable params: 2753 (10.75 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
history=model.fit(X_train, y_train,
validation_split=0.2,
epochs=50,
batch_size=32,verbose=1)
Epoch 1/50 384/384 [==============================] - 3s 4ms/step - loss: 0.6304 - accuracy: 0.7440 - val_loss: 0.6566 - val_accuracy: 0.7515 Epoch 2/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5711 - accuracy: 0.7504 - val_loss: 0.5643 - val_accuracy: 0.7515 Epoch 3/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5645 - accuracy: 0.7502 - val_loss: 0.5679 - val_accuracy: 0.7515 Epoch 4/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5637 - accuracy: 0.7499 - val_loss: 0.5634 - val_accuracy: 0.7515 Epoch 5/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5619 - accuracy: 0.7504 - val_loss: 0.6195 - val_accuracy: 0.6644 Epoch 6/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5606 - accuracy: 0.7502 - val_loss: 0.5782 - val_accuracy: 0.7515 Epoch 7/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5585 - accuracy: 0.7504 - val_loss: 0.5593 - val_accuracy: 0.7515 Epoch 8/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5571 - accuracy: 0.7502 - val_loss: 0.5677 - val_accuracy: 0.7515 Epoch 9/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5566 - accuracy: 0.7503 - val_loss: 0.5730 - val_accuracy: 0.7515 Epoch 10/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5568 - accuracy: 0.7502 - val_loss: 0.5540 - val_accuracy: 0.7515 Epoch 11/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5557 - accuracy: 0.7503 - val_loss: 0.5790 - val_accuracy: 0.7456 Epoch 12/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5534 - accuracy: 0.7502 - val_loss: 0.5762 - val_accuracy: 0.7515 Epoch 13/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5532 - accuracy: 0.7505 - val_loss: 0.5864 - val_accuracy: 0.7456 Epoch 14/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5538 - accuracy: 0.7503 - val_loss: 0.5517 - val_accuracy: 0.7515 Epoch 15/50 384/384 [==============================] - 2s 6ms/step - loss: 0.5529 - accuracy: 0.7503 - val_loss: 0.5553 - val_accuracy: 0.7515 Epoch 16/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5519 - accuracy: 0.7502 - val_loss: 0.5854 - val_accuracy: 0.7469 Epoch 17/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5523 - accuracy: 0.7499 - val_loss: 0.5673 - val_accuracy: 0.7515 Epoch 18/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5502 - accuracy: 0.7498 - val_loss: 0.5502 - val_accuracy: 0.7515 Epoch 19/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5501 - accuracy: 0.7498 - val_loss: 0.5570 - val_accuracy: 0.7515 Epoch 20/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5505 - accuracy: 0.7506 - val_loss: 0.5489 - val_accuracy: 0.7515 Epoch 21/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5507 - accuracy: 0.7498 - val_loss: 0.5606 - val_accuracy: 0.7515 Epoch 22/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5501 - accuracy: 0.7499 - val_loss: 0.5493 - val_accuracy: 0.7515 Epoch 23/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5492 - accuracy: 0.7504 - val_loss: 0.6562 - val_accuracy: 0.5753 Epoch 24/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5496 - accuracy: 0.7498 - val_loss: 0.5564 - val_accuracy: 0.7515 Epoch 25/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5498 - accuracy: 0.7499 - val_loss: 0.5640 - val_accuracy: 0.7515 Epoch 26/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5492 - accuracy: 0.7509 - val_loss: 0.5933 - val_accuracy: 0.7515 Epoch 27/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5491 - accuracy: 0.7501 - val_loss: 0.5554 - val_accuracy: 0.7534 Epoch 28/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5479 - accuracy: 0.7505 - val_loss: 0.5496 - val_accuracy: 0.7515 Epoch 29/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5489 - accuracy: 0.7500 - val_loss: 0.5566 - val_accuracy: 0.7505 Epoch 30/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5482 - accuracy: 0.7497 - val_loss: 0.5408 - val_accuracy: 0.7515 Epoch 31/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5482 - accuracy: 0.7506 - val_loss: 0.5466 - val_accuracy: 0.7511 Epoch 32/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5476 - accuracy: 0.7499 - val_loss: 0.5477 - val_accuracy: 0.7502 Epoch 33/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5464 - accuracy: 0.7507 - val_loss: 0.5988 - val_accuracy: 0.7365 Epoch 34/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5476 - accuracy: 0.7506 - val_loss: 0.5641 - val_accuracy: 0.7534 Epoch 35/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5467 - accuracy: 0.7498 - val_loss: 0.5461 - val_accuracy: 0.7515 Epoch 36/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5462 - accuracy: 0.7508 - val_loss: 0.5534 - val_accuracy: 0.7515 Epoch 37/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5456 - accuracy: 0.7501 - val_loss: 0.6286 - val_accuracy: 0.6458 Epoch 38/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5461 - accuracy: 0.7505 - val_loss: 0.5490 - val_accuracy: 0.7515 Epoch 39/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5457 - accuracy: 0.7510 - val_loss: 0.5594 - val_accuracy: 0.7515 Epoch 40/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5454 - accuracy: 0.7505 - val_loss: 0.5439 - val_accuracy: 0.7511 Epoch 41/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5456 - accuracy: 0.7502 - val_loss: 0.5483 - val_accuracy: 0.7515 Epoch 42/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5447 - accuracy: 0.7500 - val_loss: 0.5736 - val_accuracy: 0.7515 Epoch 43/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5453 - accuracy: 0.7505 - val_loss: 0.5876 - val_accuracy: 0.7515 Epoch 44/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5448 - accuracy: 0.7503 - val_loss: 0.5420 - val_accuracy: 0.7515 Epoch 45/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5454 - accuracy: 0.7502 - val_loss: 0.5398 - val_accuracy: 0.7511 Epoch 46/50 384/384 [==============================] - 1s 4ms/step - loss: 0.5448 - accuracy: 0.7506 - val_loss: 0.5451 - val_accuracy: 0.7515 Epoch 47/50 384/384 [==============================] - 2s 4ms/step - loss: 0.5440 - accuracy: 0.7511 - val_loss: 0.5393 - val_accuracy: 0.7508 Epoch 48/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5434 - accuracy: 0.7507 - val_loss: 0.5374 - val_accuracy: 0.7511 Epoch 49/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5447 - accuracy: 0.7507 - val_loss: 0.5410 - val_accuracy: 0.7508 Epoch 50/50 384/384 [==============================] - 2s 5ms/step - loss: 0.5439 - accuracy: 0.7505 - val_loss: 0.5467 - val_accuracy: 0.7515
# Capturing learning history per epoch
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
# Plotting accuracy at different epochs
plt.plot(hist['loss'])
plt.plot(hist['val_loss'])
plt.legend(("train" , "valid") , loc =0)
#Printing results
results = model.evaluate(X_test, y_test)
120/120 [==============================] - 0s 2ms/step - loss: 0.5455 - accuracy: 0.7508
There is noise in the loss behavior here. Sometimes, the loss function fluctuates a lot during training, which makes the convergence slow. These fluctuations are due to the nature of Stochastic Gradient Descent that produces noisy updates in the parameters.
Let's check the other metrices.
y_pred=model.predict(X_test)
y_pred = (y_pred > 0.5)
y_pred
120/120 [==============================] - 0s 2ms/step
array([[False], [False], [False], ..., [False], [False], [False]])
def make_confusion_matrix(cf,
group_names=None,
categories='auto',
count=True,
percent=True,
cbar=True,
xyticks=True,
xyplotlabels=True,
sum_stats=True,
figsize=None,
cmap='Blues',
title=None):
'''
This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.
Arguments
'''
# CODE TO GENERATE TEXT INSIDE EACH SQUARE
blanks = ['' for i in range(cf.size)]
if group_names and len(group_names)==cf.size:
group_labels = ["{}\n".format(value) for value in group_names]
else:
group_labels = blanks
if count:
group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
else:
group_counts = blanks
if percent:
group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
else:
group_percentages = blanks
box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])
# CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
if sum_stats:
#Accuracy is sum of diagonal divided by total observations
accuracy = np.trace(cf) / float(np.sum(cf))
# SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
if figsize==None:
#Get default figure size if not set
figsize = plt.rcParams.get('figure.figsize')
if xyticks==False:
#Do not show categories if xyticks is False
categories=False
# MAKE THE HEATMAP VISUALIZATION
plt.figure(figsize=figsize)
sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)
if title:
plt.title(title)
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm,
group_names=labels,
categories=categories,
cmap='Blues')
Here, the 0% of False Positives is because we gave 0.5 as the threshold to the model, and as this is an imbalanced dataset, we should calculate the threshold using the AUC-ROC curve.
#Accuracy as per the classification report
from sklearn import metrics
cr=metrics.classification_report(y_test,y_pred)
print(cr)
precision recall f1-score support 0 0.75 1.00 0.86 2877 1 0.00 0.00 0.00 955 accuracy 0.75 3832 macro avg 0.38 0.50 0.43 3832 weighted avg 0.56 0.75 0.64 3832
As you can see, the above model has a good accuracy but a poor F1-score. This could be due to the imbalanced dataset. We observe that the False positive rates are also high, which should be considerably lower.
Imbalanced dataset: As you have seen in the EDA, this dataset is imbalanced, and it contains more examples that belong to the 0 class.
Decision Threshold: Due to the imbalanced dataset, we can use ROC-AUC to find the optimal threshold and use the same for prediction.
Let's try to change the optimizer, tune the decision threshold, increase the layers and configure some other hyperparameters accordingly, in order to improve the model's performance.
Model 2¶
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
model1 = Sequential()
#Adding the hidden and output layers
model1.add(Dense(256,activation='relu',kernel_initializer='he_uniform',input_dim = X_train.shape[1]))
model1.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model1.add(Dense(64,activation='relu',kernel_initializer='he_uniform'))
model1.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
model1.add(Dense(1, activation = 'sigmoid'))
#Compiling the ANN with Adam optimizer and binary cross entropy loss function
optimizer = tf.keras.optimizers.Adam(0.001)
model1.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
model1.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 256) 2560 dense_1 (Dense) (None, 128) 32896 dense_2 (Dense) (None, 64) 8256 dense_3 (Dense) (None, 32) 2080 dense_4 (Dense) (None, 1) 33 ================================================================= Total params: 45825 (179.00 KB) Trainable params: 45825 (179.00 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
history1 = model1.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50 192/192 [==============================] - 3s 5ms/step - loss: 0.8645 - accuracy: 0.6954 - val_loss: 0.5785 - val_accuracy: 0.7355 Epoch 2/50 192/192 [==============================] - 1s 4ms/step - loss: 0.6333 - accuracy: 0.7193 - val_loss: 0.5575 - val_accuracy: 0.7469 Epoch 3/50 192/192 [==============================] - 1s 4ms/step - loss: 0.6248 - accuracy: 0.7277 - val_loss: 0.6609 - val_accuracy: 0.7518 Epoch 4/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5867 - accuracy: 0.7365 - val_loss: 0.5417 - val_accuracy: 0.7518 Epoch 5/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5684 - accuracy: 0.7414 - val_loss: 0.6028 - val_accuracy: 0.7502 Epoch 6/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5612 - accuracy: 0.7440 - val_loss: 0.5552 - val_accuracy: 0.7531 Epoch 7/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5678 - accuracy: 0.7418 - val_loss: 0.5352 - val_accuracy: 0.7524 Epoch 8/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5477 - accuracy: 0.7486 - val_loss: 0.5888 - val_accuracy: 0.7133 Epoch 9/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5500 - accuracy: 0.7467 - val_loss: 0.5536 - val_accuracy: 0.7469 Epoch 10/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5519 - accuracy: 0.7459 - val_loss: 0.5435 - val_accuracy: 0.7462 Epoch 11/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5524 - accuracy: 0.7479 - val_loss: 0.5832 - val_accuracy: 0.7267 Epoch 12/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5432 - accuracy: 0.7495 - val_loss: 0.5810 - val_accuracy: 0.7492 Epoch 13/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5422 - accuracy: 0.7517 - val_loss: 0.5275 - val_accuracy: 0.7518 Epoch 14/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5358 - accuracy: 0.7506 - val_loss: 0.5385 - val_accuracy: 0.7482 Epoch 15/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5415 - accuracy: 0.7493 - val_loss: 0.5355 - val_accuracy: 0.7508 Epoch 16/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5400 - accuracy: 0.7518 - val_loss: 0.5309 - val_accuracy: 0.7508 Epoch 17/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5347 - accuracy: 0.7518 - val_loss: 0.5275 - val_accuracy: 0.7534 Epoch 18/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5391 - accuracy: 0.7490 - val_loss: 0.5269 - val_accuracy: 0.7511 Epoch 19/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5326 - accuracy: 0.7514 - val_loss: 0.5265 - val_accuracy: 0.7521 Epoch 20/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5383 - accuracy: 0.7513 - val_loss: 0.5343 - val_accuracy: 0.7508 Epoch 21/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5405 - accuracy: 0.7499 - val_loss: 0.5330 - val_accuracy: 0.7489 Epoch 22/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5343 - accuracy: 0.7514 - val_loss: 0.5240 - val_accuracy: 0.7498 Epoch 23/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5407 - accuracy: 0.7519 - val_loss: 0.5386 - val_accuracy: 0.7515 Epoch 24/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5303 - accuracy: 0.7522 - val_loss: 0.5197 - val_accuracy: 0.7515 Epoch 25/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5308 - accuracy: 0.7510 - val_loss: 0.5403 - val_accuracy: 0.7436 Epoch 26/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5238 - accuracy: 0.7532 - val_loss: 0.5160 - val_accuracy: 0.7551 Epoch 27/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5288 - accuracy: 0.7534 - val_loss: 0.5376 - val_accuracy: 0.7495 Epoch 28/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5244 - accuracy: 0.7544 - val_loss: 0.5237 - val_accuracy: 0.7495 Epoch 29/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5192 - accuracy: 0.7554 - val_loss: 0.5187 - val_accuracy: 0.7511 Epoch 30/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5191 - accuracy: 0.7554 - val_loss: 0.5211 - val_accuracy: 0.7580 Epoch 31/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5151 - accuracy: 0.7588 - val_loss: 0.5259 - val_accuracy: 0.7518 Epoch 32/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5190 - accuracy: 0.7562 - val_loss: 0.5129 - val_accuracy: 0.7524 Epoch 33/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5170 - accuracy: 0.7564 - val_loss: 0.5074 - val_accuracy: 0.7547 Epoch 34/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5310 - accuracy: 0.7529 - val_loss: 0.5229 - val_accuracy: 0.7511 Epoch 35/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5171 - accuracy: 0.7540 - val_loss: 0.5370 - val_accuracy: 0.7498 Epoch 36/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5190 - accuracy: 0.7574 - val_loss: 0.5282 - val_accuracy: 0.7515 Epoch 37/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5115 - accuracy: 0.7587 - val_loss: 0.5492 - val_accuracy: 0.7567 Epoch 38/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5138 - accuracy: 0.7564 - val_loss: 0.5206 - val_accuracy: 0.7505 Epoch 39/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5147 - accuracy: 0.7542 - val_loss: 0.5091 - val_accuracy: 0.7518 Epoch 40/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5100 - accuracy: 0.7569 - val_loss: 0.5161 - val_accuracy: 0.7557 Epoch 41/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5103 - accuracy: 0.7572 - val_loss: 0.5286 - val_accuracy: 0.7515 Epoch 42/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5137 - accuracy: 0.7560 - val_loss: 0.5233 - val_accuracy: 0.7560 Epoch 43/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5094 - accuracy: 0.7563 - val_loss: 0.5135 - val_accuracy: 0.7551 Epoch 44/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5084 - accuracy: 0.7613 - val_loss: 0.5451 - val_accuracy: 0.7534 Epoch 45/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5105 - accuracy: 0.7564 - val_loss: 0.5131 - val_accuracy: 0.7622 Epoch 46/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5162 - accuracy: 0.7583 - val_loss: 0.5211 - val_accuracy: 0.7599 Epoch 47/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5063 - accuracy: 0.7604 - val_loss: 0.5137 - val_accuracy: 0.7554 Epoch 48/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5101 - accuracy: 0.7591 - val_loss: 0.5139 - val_accuracy: 0.7590 Epoch 49/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5102 - accuracy: 0.7591 - val_loss: 0.5346 - val_accuracy: 0.7531 Epoch 50/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5043 - accuracy: 0.7615 - val_loss: 0.5104 - val_accuracy: 0.7609
#Plotting Train Loss vs Validation Loss
plt.plot(history1.history['loss'])
plt.plot(history1.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
As we increased the depth of the neural network and changed the optimizer to Adam, we can see smoother loss curves for both train and validation.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat1 = model1.predict(X_test)
# keep probabilities for the positive outcome only
yhat1 = yhat1[:, 0]
# calculate roc curves
fpr, tpr, thresholds1 = roc_curve(y_test, yhat1)
# calculate the g-mean for each threshold
gmeans1 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans1)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds1[ix], gmeans1[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.219626, G-Mean=0.686
Let's tune the threshold using ROC-AUC
There are many ways we could locate the threshold with the optimal balance between false positive and true positive rates.
Firstly, the true positive rate is called the Sensitivity. The inverse of the false-positive rate is called the Specificity.
Sensitivity = True Positive / (True Positive + False Negative)
Specificity = True Negative / (False Positive + True Negative)
Where:
Sensitivity = True Positive Rate
Specificity = 1 – False Positive Rate
The Geometric Mean or G-Mean is a metric for imbalanced classification that, if optimized, will seek a balance between the sensitivity and the specificity.
G-Mean = sqrt(Sensitivity * Specificity)
One approach would be to test the model with each threshold returned from the call roc_auc_score(),
and select the threshold with the largest G-Mean value.
#Predicting the results using best as a threshold
y_pred_e1=model1.predict(X_test)
y_pred_e1 = (y_pred_e1 > thresholds1[ix])
y_pred_e1
120/120 [==============================] - 0s 2ms/step
array([[False], [ True], [False], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm1=confusion_matrix(y_test, y_pred_e1)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm1,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr=metrics.classification_report(y_test,y_pred_e1)
print(cr)
precision recall f1-score support 0 0.86 0.71 0.78 2877 1 0.43 0.66 0.52 955 accuracy 0.70 3832 macro avg 0.65 0.69 0.65 3832 weighted avg 0.76 0.70 0.71 3832
As the number of layers in the neural network has increased, we can see that the macro F1 score has increased.
Now let's try to use the Batch Normalization technique and check to see if we can increase the F1 score.
Model 3¶
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
model2 = Sequential()
model2.add(Dense(128,activation='relu',input_dim = X_train.shape[1]))
model2.add(BatchNormalization())
model2.add(Dense(64,activation='relu',kernel_initializer='he_uniform'))
model2.add(BatchNormalization())
model2.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
model2.add(Dense(1, activation = 'sigmoid'))
model2.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 128) 1280 batch_normalization (Batch (None, 128) 512 Normalization) dense_1 (Dense) (None, 64) 8256 batch_normalization_1 (Bat (None, 64) 256 chNormalization) dense_2 (Dense) (None, 32) 2080 dense_3 (Dense) (None, 1) 33 ================================================================= Total params: 12417 (48.50 KB) Trainable params: 12033 (47.00 KB) Non-trainable params: 384 (1.50 KB) _________________________________________________________________
optimizer = tf.keras.optimizers.Adam(0.001)
model2.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_2 = model2.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50 192/192 [==============================] - 5s 7ms/step - loss: 0.5859 - accuracy: 0.7278 - val_loss: 0.5599 - val_accuracy: 0.7515 Epoch 2/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5484 - accuracy: 0.7472 - val_loss: 0.5417 - val_accuracy: 0.7531 Epoch 3/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5325 - accuracy: 0.7512 - val_loss: 0.5247 - val_accuracy: 0.7547 Epoch 4/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5189 - accuracy: 0.7558 - val_loss: 0.5643 - val_accuracy: 0.7114 Epoch 5/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5116 - accuracy: 0.7582 - val_loss: 0.5220 - val_accuracy: 0.7674 Epoch 6/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5080 - accuracy: 0.7573 - val_loss: 0.5059 - val_accuracy: 0.7629 Epoch 7/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5026 - accuracy: 0.7632 - val_loss: 0.5113 - val_accuracy: 0.7567 Epoch 8/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5020 - accuracy: 0.7648 - val_loss: 0.5041 - val_accuracy: 0.7635 Epoch 9/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5021 - accuracy: 0.7657 - val_loss: 0.5022 - val_accuracy: 0.7596 Epoch 10/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5006 - accuracy: 0.7656 - val_loss: 0.5543 - val_accuracy: 0.7401 Epoch 11/50 192/192 [==============================] - 1s 8ms/step - loss: 0.4989 - accuracy: 0.7658 - val_loss: 0.5143 - val_accuracy: 0.7632 Epoch 12/50 192/192 [==============================] - 1s 7ms/step - loss: 0.4979 - accuracy: 0.7644 - val_loss: 0.5328 - val_accuracy: 0.7560 Epoch 13/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4965 - accuracy: 0.7619 - val_loss: 0.5328 - val_accuracy: 0.7603 Epoch 14/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4977 - accuracy: 0.7656 - val_loss: 0.5061 - val_accuracy: 0.7655 Epoch 15/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4956 - accuracy: 0.7661 - val_loss: 0.5337 - val_accuracy: 0.7482 Epoch 16/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4934 - accuracy: 0.7670 - val_loss: 0.5103 - val_accuracy: 0.7652 Epoch 17/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4944 - accuracy: 0.7664 - val_loss: 0.5074 - val_accuracy: 0.7665 Epoch 18/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4911 - accuracy: 0.7693 - val_loss: 0.5158 - val_accuracy: 0.7603 Epoch 19/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4926 - accuracy: 0.7677 - val_loss: 0.5053 - val_accuracy: 0.7632 Epoch 20/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4915 - accuracy: 0.7691 - val_loss: 0.4974 - val_accuracy: 0.7665 Epoch 21/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4898 - accuracy: 0.7690 - val_loss: 0.4980 - val_accuracy: 0.7609 Epoch 22/50 192/192 [==============================] - 1s 8ms/step - loss: 0.4904 - accuracy: 0.7692 - val_loss: 0.5135 - val_accuracy: 0.7606 Epoch 23/50 192/192 [==============================] - 1s 7ms/step - loss: 0.4911 - accuracy: 0.7721 - val_loss: 0.5017 - val_accuracy: 0.7570 Epoch 24/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4884 - accuracy: 0.7734 - val_loss: 0.5114 - val_accuracy: 0.7688 Epoch 25/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4877 - accuracy: 0.7699 - val_loss: 0.4956 - val_accuracy: 0.7599 Epoch 26/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4873 - accuracy: 0.7739 - val_loss: 0.5418 - val_accuracy: 0.7505 Epoch 27/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4872 - accuracy: 0.7734 - val_loss: 0.4981 - val_accuracy: 0.7753 Epoch 28/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4869 - accuracy: 0.7697 - val_loss: 0.5835 - val_accuracy: 0.7202 Epoch 29/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4871 - accuracy: 0.7693 - val_loss: 0.5027 - val_accuracy: 0.7639 Epoch 30/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4867 - accuracy: 0.7723 - val_loss: 0.4992 - val_accuracy: 0.7671 Epoch 31/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4850 - accuracy: 0.7726 - val_loss: 0.5193 - val_accuracy: 0.7622 Epoch 32/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4856 - accuracy: 0.7723 - val_loss: 0.5019 - val_accuracy: 0.7661 Epoch 33/50 192/192 [==============================] - 1s 7ms/step - loss: 0.4859 - accuracy: 0.7724 - val_loss: 0.5027 - val_accuracy: 0.7668 Epoch 34/50 192/192 [==============================] - 1s 8ms/step - loss: 0.4840 - accuracy: 0.7741 - val_loss: 0.4995 - val_accuracy: 0.7678 Epoch 35/50 192/192 [==============================] - 1s 8ms/step - loss: 0.4844 - accuracy: 0.7708 - val_loss: 0.5129 - val_accuracy: 0.7590 Epoch 36/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4834 - accuracy: 0.7754 - val_loss: 0.5039 - val_accuracy: 0.7707 Epoch 37/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4820 - accuracy: 0.7732 - val_loss: 0.4988 - val_accuracy: 0.7655 Epoch 38/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4827 - accuracy: 0.7759 - val_loss: 0.5014 - val_accuracy: 0.7658 Epoch 39/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4800 - accuracy: 0.7768 - val_loss: 0.5364 - val_accuracy: 0.7564 Epoch 40/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4825 - accuracy: 0.7700 - val_loss: 0.5179 - val_accuracy: 0.7551 Epoch 41/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4829 - accuracy: 0.7739 - val_loss: 0.5087 - val_accuracy: 0.7639 Epoch 42/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4801 - accuracy: 0.7761 - val_loss: 0.5285 - val_accuracy: 0.7609 Epoch 43/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4816 - accuracy: 0.7749 - val_loss: 0.5055 - val_accuracy: 0.7648 Epoch 44/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4800 - accuracy: 0.7740 - val_loss: 0.5154 - val_accuracy: 0.7551 Epoch 45/50 192/192 [==============================] - 2s 8ms/step - loss: 0.4811 - accuracy: 0.7756 - val_loss: 0.5968 - val_accuracy: 0.6905 Epoch 46/50 192/192 [==============================] - 2s 8ms/step - loss: 0.4796 - accuracy: 0.7768 - val_loss: 0.5083 - val_accuracy: 0.7616 Epoch 47/50 192/192 [==============================] - 1s 7ms/step - loss: 0.4800 - accuracy: 0.7732 - val_loss: 0.5088 - val_accuracy: 0.7645 Epoch 48/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4795 - accuracy: 0.7750 - val_loss: 0.5253 - val_accuracy: 0.7541 Epoch 49/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4789 - accuracy: 0.7774 - val_loss: 0.5032 - val_accuracy: 0.7613 Epoch 50/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4776 - accuracy: 0.7794 - val_loss: 0.5017 - val_accuracy: 0.7678
#Plotting Train Loss vs Validation Loss
plt.plot(history_2.history['loss'])
plt.plot(history_2.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
Unfortunately, from the above plot we observe that there is a lot of noise in the model, and it and seems to have overfitted on the training data because there is a significant difference in performance between train and validation.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat2 = model2.predict(X_test)
# keep probabilities for the positive outcome only
yhat2 = yhat2[:, 0]
# calculate roc curves
fpr, tpr, thresholds2 = roc_curve(y_test, yhat2)
# calculate the g-mean for each threshold
gmeans2 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans2)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds2[ix], gmeans2[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.228691, G-Mean=0.698
y_pred_e2=model2.predict(X_test)
y_pred_e2 = (y_pred_e2 > thresholds2[ix])
y_pred_e2
120/120 [==============================] - 0s 2ms/step
array([[False], [ True], [False], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm2=confusion_matrix(y_test, y_pred_e2)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm2,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr2=metrics.classification_report(y_test,y_pred_e2)
print(cr2)
precision recall f1-score support 0 0.87 0.75 0.81 2877 1 0.47 0.65 0.54 955 accuracy 0.73 3832 macro avg 0.67 0.70 0.67 3832 weighted avg 0.77 0.73 0.74 3832
The Train and Validation curves seem to show overfitting despite having a good F1 score.
Let's try to use the Dropout technique and check to see if it can reduce the False Negative rate.
Model 4¶
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
model3 = Sequential()
model3.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
model3.add(Dropout(0.2))
model3.add(Dense(128,activation='relu'))
model3.add(Dropout(0.2))
model3.add(Dense(64,activation='relu'))
model3.add(Dropout(0.2))
model3.add(Dense(32,activation='relu'))
model3.add(Dense(1, activation = 'sigmoid'))
model3.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 256) 2560 dropout (Dropout) (None, 256) 0 dense_1 (Dense) (None, 128) 32896 dropout_1 (Dropout) (None, 128) 0 dense_2 (Dense) (None, 64) 8256 dropout_2 (Dropout) (None, 64) 0 dense_3 (Dense) (None, 32) 2080 dense_4 (Dense) (None, 1) 33 ================================================================= Total params: 45825 (179.00 KB) Trainable params: 45825 (179.00 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
optimizer = tf.keras.optimizers.Adam(0.001)
model3.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_3 = model3.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50 192/192 [==============================] - 3s 5ms/step - loss: 0.6245 - accuracy: 0.7349 - val_loss: 0.5638 - val_accuracy: 0.7515 Epoch 2/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5763 - accuracy: 0.7480 - val_loss: 0.5716 - val_accuracy: 0.7515 Epoch 3/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5708 - accuracy: 0.7498 - val_loss: 0.5578 - val_accuracy: 0.7515 Epoch 4/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5607 - accuracy: 0.7504 - val_loss: 0.5542 - val_accuracy: 0.7515 Epoch 5/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5575 - accuracy: 0.7502 - val_loss: 0.5534 - val_accuracy: 0.7515 Epoch 6/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5561 - accuracy: 0.7504 - val_loss: 0.5482 - val_accuracy: 0.7515 Epoch 7/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5524 - accuracy: 0.7504 - val_loss: 0.5420 - val_accuracy: 0.7515 Epoch 8/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5511 - accuracy: 0.7503 - val_loss: 0.5492 - val_accuracy: 0.7515 Epoch 9/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5484 - accuracy: 0.7505 - val_loss: 0.5377 - val_accuracy: 0.7515 Epoch 10/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5443 - accuracy: 0.7504 - val_loss: 0.5390 - val_accuracy: 0.7515 Epoch 11/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5424 - accuracy: 0.7503 - val_loss: 0.5482 - val_accuracy: 0.7515 Epoch 12/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5418 - accuracy: 0.7502 - val_loss: 0.5393 - val_accuracy: 0.7515 Epoch 13/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5420 - accuracy: 0.7504 - val_loss: 0.5325 - val_accuracy: 0.7515 Epoch 14/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5402 - accuracy: 0.7503 - val_loss: 0.5395 - val_accuracy: 0.7515 Epoch 15/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5391 - accuracy: 0.7504 - val_loss: 0.5308 - val_accuracy: 0.7515 Epoch 16/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5372 - accuracy: 0.7504 - val_loss: 0.5257 - val_accuracy: 0.7515 Epoch 17/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5320 - accuracy: 0.7500 - val_loss: 0.5254 - val_accuracy: 0.7515 Epoch 18/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5316 - accuracy: 0.7510 - val_loss: 0.5213 - val_accuracy: 0.7515 Epoch 19/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5298 - accuracy: 0.7502 - val_loss: 0.5237 - val_accuracy: 0.7524 Epoch 20/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5276 - accuracy: 0.7526 - val_loss: 0.5205 - val_accuracy: 0.7528 Epoch 21/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5211 - accuracy: 0.7510 - val_loss: 0.5208 - val_accuracy: 0.7626 Epoch 22/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5205 - accuracy: 0.7529 - val_loss: 0.5123 - val_accuracy: 0.7554 Epoch 23/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5222 - accuracy: 0.7562 - val_loss: 0.5088 - val_accuracy: 0.7596 Epoch 24/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5188 - accuracy: 0.7563 - val_loss: 0.5043 - val_accuracy: 0.7661 Epoch 25/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5145 - accuracy: 0.7553 - val_loss: 0.5007 - val_accuracy: 0.7733 Epoch 26/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5114 - accuracy: 0.7570 - val_loss: 0.5030 - val_accuracy: 0.7671 Epoch 27/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5114 - accuracy: 0.7595 - val_loss: 0.4977 - val_accuracy: 0.7704 Epoch 28/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5125 - accuracy: 0.7598 - val_loss: 0.5018 - val_accuracy: 0.7720 Epoch 29/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5141 - accuracy: 0.7572 - val_loss: 0.5052 - val_accuracy: 0.7655 Epoch 30/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5084 - accuracy: 0.7604 - val_loss: 0.5040 - val_accuracy: 0.7658 Epoch 31/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5063 - accuracy: 0.7620 - val_loss: 0.5007 - val_accuracy: 0.7616 Epoch 32/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5063 - accuracy: 0.7602 - val_loss: 0.5087 - val_accuracy: 0.7674 Epoch 33/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5049 - accuracy: 0.7631 - val_loss: 0.5160 - val_accuracy: 0.7697 Epoch 34/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5077 - accuracy: 0.7615 - val_loss: 0.5058 - val_accuracy: 0.7658 Epoch 35/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5030 - accuracy: 0.7619 - val_loss: 0.5005 - val_accuracy: 0.7665 Epoch 36/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4993 - accuracy: 0.7631 - val_loss: 0.5014 - val_accuracy: 0.7694 Epoch 37/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5032 - accuracy: 0.7624 - val_loss: 0.5223 - val_accuracy: 0.7528 Epoch 38/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5012 - accuracy: 0.7617 - val_loss: 0.5056 - val_accuracy: 0.7652 Epoch 39/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5001 - accuracy: 0.7660 - val_loss: 0.4976 - val_accuracy: 0.7733 Epoch 40/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5006 - accuracy: 0.7635 - val_loss: 0.4955 - val_accuracy: 0.7753 Epoch 41/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4984 - accuracy: 0.7636 - val_loss: 0.5039 - val_accuracy: 0.7642 Epoch 42/50 192/192 [==============================] - 1s 4ms/step - loss: 0.4985 - accuracy: 0.7663 - val_loss: 0.4957 - val_accuracy: 0.7701 Epoch 43/50 192/192 [==============================] - 1s 4ms/step - loss: 0.4984 - accuracy: 0.7662 - val_loss: 0.4946 - val_accuracy: 0.7694 Epoch 44/50 192/192 [==============================] - 1s 4ms/step - loss: 0.4988 - accuracy: 0.7686 - val_loss: 0.5082 - val_accuracy: 0.7626 Epoch 45/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4995 - accuracy: 0.7651 - val_loss: 0.5048 - val_accuracy: 0.7629 Epoch 46/50 192/192 [==============================] - 1s 6ms/step - loss: 0.4959 - accuracy: 0.7673 - val_loss: 0.5058 - val_accuracy: 0.7717 Epoch 47/50 192/192 [==============================] - 1s 7ms/step - loss: 0.4969 - accuracy: 0.7680 - val_loss: 0.4990 - val_accuracy: 0.7668 Epoch 48/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4978 - accuracy: 0.7660 - val_loss: 0.4958 - val_accuracy: 0.7769 Epoch 49/50 192/192 [==============================] - 1s 4ms/step - loss: 0.4951 - accuracy: 0.7680 - val_loss: 0.5011 - val_accuracy: 0.7658 Epoch 50/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4949 - accuracy: 0.7676 - val_loss: 0.4969 - val_accuracy: 0.7727
#Plotting Train Loss vs Validation Loss
plt.plot(history_3.history['loss'])
plt.plot(history_3.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
From the above plot, we observe that both the curves train and validation are smooth.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat3 = model3.predict(X_test)
# keep probabilities for the positive outcome only
yhat3 = yhat3[:, 0]
# calculate roc curves
fpr, tpr, thresholds3 = roc_curve(y_test, yhat3)
# calculate the g-mean for each threshold
gmeans3 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans3)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds3[ix], gmeans3[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.223736, G-Mean=0.699
y_pred_e3=model3.predict(X_test)
y_pred_e3 = (y_pred_e3 > thresholds3[ix])
y_pred_e3
120/120 [==============================] - 0s 2ms/step
array([[False], [ True], [False], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(y_test, y_pred_e3)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm3,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr3=metrics.classification_report(y_test,y_pred_e3)
print(cr3)
precision recall f1-score support 0 0.87 0.72 0.79 2877 1 0.45 0.67 0.54 955 accuracy 0.71 3832 macro avg 0.66 0.70 0.66 3832 weighted avg 0.76 0.71 0.73 3832
The Dropout technique helped the model reduce the loss function of both train and validation. The F1 score also seems to be fine, with a decrease in the False Negative rate.
Now, let's try to use some of the Hyperparameter Optimization techniques we have learnt, such as RandomizedSearchCV, GridSearchCV and Keras Tuner to increase the F1 score of the model.
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
def create_model_v4():
np.random.seed(1337)
model = Sequential()
model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
model.add(Dropout(0.3))
#model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.2))
#model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
#model.add(Dropout(0.3))
model.add(Dense(32,activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#compile model
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
return model
We are using Random search to optimize two hyperparameters - Batch size & Learning Rate.
You can also optimize other hyperparameters as mentioned above.
keras_estimator = KerasClassifier(build_fn=create_model_v4, optimizer="Adam", verbose=1)
# define the grid search parameters
learn_rate = [0.01, 0.1, 0.001]
batch_size = [32, 64, 128]
param_random = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)
kfold_splits = 3
random= RandomizedSearchCV(estimator=keras_estimator,
verbose=1,
cv=kfold_splits,
param_distributions=param_random,n_jobs=-1)
random_result = random.fit(X_train, y_train,validation_split=0.2,verbose=1)
# Summarize results
print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))
means = random_result.cv_results_['mean_test_score']
stds = random_result.cv_results_['std_test_score']
params = random_result.cv_results_['params']
Fitting 3 folds for each of 9 candidates, totalling 27 fits 384/384 [==============================] - 5s 7ms/step - loss: 0.6308 - accuracy: 0.7337 - val_loss: 0.5739 - val_accuracy: 0.7515 Best: 0.750620 using {'optimizer__learning_rate': 0.01, 'batch_size': 32}
estimator_v4=create_model_v4()
estimator_v4.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_5 (Dense) (None, 256) 2560 dropout_3 (Dropout) (None, 256) 0 dense_6 (Dense) (None, 128) 32896 dropout_4 (Dropout) (None, 128) 0 dense_7 (Dense) (None, 64) 8256 dropout_5 (Dropout) (None, 64) 0 dense_8 (Dense) (None, 32) 2080 dense_9 (Dense) (None, 1) 33 ================================================================= Total params: 45825 (179.00 KB) Trainable params: 45825 (179.00 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
optimizer = tf.keras.optimizers.Adam()
estimator_v4.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_4=estimator_v4.fit(X_train, y_train, epochs=50, batch_size = 64, verbose=1,validation_split=0.2)
Epoch 1/50 192/192 [==============================] - 2s 5ms/step - loss: 0.6414 - accuracy: 0.7316 - val_loss: 0.5798 - val_accuracy: 0.7515 Epoch 2/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5813 - accuracy: 0.7496 - val_loss: 0.5842 - val_accuracy: 0.7515 Epoch 3/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5700 - accuracy: 0.7494 - val_loss: 0.5637 - val_accuracy: 0.7515 Epoch 4/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5643 - accuracy: 0.7502 - val_loss: 0.5583 - val_accuracy: 0.7515 Epoch 5/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5619 - accuracy: 0.7503 - val_loss: 0.5593 - val_accuracy: 0.7515 Epoch 6/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5574 - accuracy: 0.7503 - val_loss: 0.5598 - val_accuracy: 0.7515 Epoch 7/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5536 - accuracy: 0.7502 - val_loss: 0.5454 - val_accuracy: 0.7515 Epoch 8/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5525 - accuracy: 0.7505 - val_loss: 0.5472 - val_accuracy: 0.7515 Epoch 9/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5515 - accuracy: 0.7503 - val_loss: 0.5404 - val_accuracy: 0.7515 Epoch 10/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5482 - accuracy: 0.7504 - val_loss: 0.5384 - val_accuracy: 0.7515 Epoch 11/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5468 - accuracy: 0.7504 - val_loss: 0.5470 - val_accuracy: 0.7515 Epoch 12/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5433 - accuracy: 0.7504 - val_loss: 0.5352 - val_accuracy: 0.7515 Epoch 13/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5435 - accuracy: 0.7504 - val_loss: 0.5317 - val_accuracy: 0.7515 Epoch 14/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5438 - accuracy: 0.7503 - val_loss: 0.5449 - val_accuracy: 0.7515 Epoch 15/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5418 - accuracy: 0.7504 - val_loss: 0.5403 - val_accuracy: 0.7515 Epoch 16/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5392 - accuracy: 0.7504 - val_loss: 0.5295 - val_accuracy: 0.7515 Epoch 17/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5390 - accuracy: 0.7504 - val_loss: 0.5292 - val_accuracy: 0.7515 Epoch 18/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5360 - accuracy: 0.7504 - val_loss: 0.5278 - val_accuracy: 0.7515 Epoch 19/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5350 - accuracy: 0.7504 - val_loss: 0.5269 - val_accuracy: 0.7515 Epoch 20/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5348 - accuracy: 0.7502 - val_loss: 0.5229 - val_accuracy: 0.7515 Epoch 21/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5288 - accuracy: 0.7498 - val_loss: 0.5393 - val_accuracy: 0.7596 Epoch 22/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5282 - accuracy: 0.7522 - val_loss: 0.5195 - val_accuracy: 0.7518 Epoch 23/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5259 - accuracy: 0.7497 - val_loss: 0.5098 - val_accuracy: 0.7521 Epoch 24/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5231 - accuracy: 0.7514 - val_loss: 0.5060 - val_accuracy: 0.7580 Epoch 25/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5193 - accuracy: 0.7552 - val_loss: 0.5147 - val_accuracy: 0.7697 Epoch 26/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5176 - accuracy: 0.7557 - val_loss: 0.5076 - val_accuracy: 0.7596 Epoch 27/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5193 - accuracy: 0.7567 - val_loss: 0.5014 - val_accuracy: 0.7710 Epoch 28/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5215 - accuracy: 0.7565 - val_loss: 0.5038 - val_accuracy: 0.7626 Epoch 29/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5191 - accuracy: 0.7560 - val_loss: 0.5132 - val_accuracy: 0.7652 Epoch 30/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5154 - accuracy: 0.7525 - val_loss: 0.5060 - val_accuracy: 0.7652 Epoch 31/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5131 - accuracy: 0.7586 - val_loss: 0.5011 - val_accuracy: 0.7674 Epoch 32/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5109 - accuracy: 0.7613 - val_loss: 0.5035 - val_accuracy: 0.7701 Epoch 33/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5121 - accuracy: 0.7578 - val_loss: 0.5050 - val_accuracy: 0.7727 Epoch 34/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5109 - accuracy: 0.7613 - val_loss: 0.5095 - val_accuracy: 0.7665 Epoch 35/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5073 - accuracy: 0.7644 - val_loss: 0.5054 - val_accuracy: 0.7720 Epoch 36/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5099 - accuracy: 0.7621 - val_loss: 0.4966 - val_accuracy: 0.7697 Epoch 37/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5069 - accuracy: 0.7591 - val_loss: 0.5191 - val_accuracy: 0.7606 Epoch 38/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5068 - accuracy: 0.7610 - val_loss: 0.4989 - val_accuracy: 0.7701 Epoch 39/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5089 - accuracy: 0.7621 - val_loss: 0.4977 - val_accuracy: 0.7717 Epoch 40/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5032 - accuracy: 0.7585 - val_loss: 0.5053 - val_accuracy: 0.7655 Epoch 41/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5037 - accuracy: 0.7615 - val_loss: 0.5049 - val_accuracy: 0.7616 Epoch 42/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5081 - accuracy: 0.7632 - val_loss: 0.5035 - val_accuracy: 0.7674 Epoch 43/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5054 - accuracy: 0.7633 - val_loss: 0.5020 - val_accuracy: 0.7671 Epoch 44/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5042 - accuracy: 0.7628 - val_loss: 0.4991 - val_accuracy: 0.7701 Epoch 45/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5029 - accuracy: 0.7636 - val_loss: 0.4999 - val_accuracy: 0.7599 Epoch 46/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5019 - accuracy: 0.7637 - val_loss: 0.4988 - val_accuracy: 0.7704 Epoch 47/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5039 - accuracy: 0.7629 - val_loss: 0.5096 - val_accuracy: 0.7648 Epoch 48/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5053 - accuracy: 0.7642 - val_loss: 0.4981 - val_accuracy: 0.7697 Epoch 49/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5000 - accuracy: 0.7649 - val_loss: 0.4930 - val_accuracy: 0.7733 Epoch 50/50 192/192 [==============================] - 1s 5ms/step - loss: 0.4996 - accuracy: 0.7664 - val_loss: 0.4963 - val_accuracy: 0.7674
#Plotting Train Loss vs Validation Loss
plt.plot(history_4.history['loss'])
plt.plot(history_4.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
From the above plot, we observe that there is noise in the training behavior of the model.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat4 = estimator_v4.predict(X_test)
# keep probabilities for the positive outcome only
yhat4 = yhat4[:, 0]
# calculate roc curves
fpr, tpr, thresholds4 = roc_curve(y_test, yhat4)
# calculate the g-mean for each threshold
gmeans4 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans4)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds4[ix], gmeans4[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.235663, G-Mean=0.707
y_pred_e4=estimator_v4.predict(X_test)
y_pred_e4 = (y_pred_e4 > thresholds4[ix])
y_pred_e4
120/120 [==============================] - 0s 2ms/step
array([[False], [ True], [False], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm4=confusion_matrix(y_test, y_pred_e4)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm4,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr4=metrics.classification_report(y_test,y_pred_e4)
print(cr4)
precision recall f1-score support 0 0.87 0.76 0.81 2877 1 0.48 0.66 0.55 955 accuracy 0.73 3832 macro avg 0.67 0.71 0.68 3832 weighted avg 0.77 0.73 0.75 3832
Hyperparameter tuning is used here to get a better F1 score, but the F1 score may differ each time.
Other hyperparameters can also be tuned to get better performance on the metrics.
Here, the F1 score of the model has decreased in comparison to the previous best performance, as Random Search CV will choose the hyperparameters randomly, and hence has a very low chance of finding a highly optimal configuration.
Let's use the more exhaustive Grid Search CV and see if the F1 score increases.
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
def create_model_v5():
np.random.seed(1337)
model = Sequential()
model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
model.add(Dropout(0.3))
#model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.2))
#model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
#model.add(Dropout(0.3))
model.add(Dense(32,activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#compile model
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
return model
We're using Grid Search to optimize two hyperparameters - Batch Size & Learning Rate.
You can also optimize the other hyperparameters as mentioned above.
keras_estimator = KerasClassifier(build_fn=create_model_v4, optimizer="Adam", verbose=1)
# define the grid search parameters
learn_rate = [0.01, 0.1, 0.001]
batch_size = [32, 64, 128]
param_grid = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)
kfold_splits = 3
grid = GridSearchCV(estimator=keras_estimator,
verbose=1,
cv=kfold_splits,
param_grid=param_grid,n_jobs=-1)
import time
# store starting time
begin = time.time()
grid_result = grid.fit(X_train, y_train,validation_split=0.2,verbose=1)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
time.sleep(1)
# store end time
end = time.time()
# total time taken
print(f"Total runtime of the program is {end - begin}")
Fitting 3 folds for each of 9 candidates, totalling 27 fits 384/384 [==============================] - 4s 6ms/step - loss: 0.6308 - accuracy: 0.7337 - val_loss: 0.5739 - val_accuracy: 0.7515 Best: 0.750620 using {'batch_size': 32, 'optimizer__learning_rate': 0.01} Total runtime of the program is 93.2235975265503
The best model has the following configuration:
( It may vary each time the code runs )
Result of Grid Search
{'batch_size': 64, 'learning_rate": 0.01}
Let's create the final model with the above mentioned configuration
estimator_v5=create_model_v5()
estimator_v5.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_5 (Dense) (None, 256) 2560 dropout_3 (Dropout) (None, 256) 0 dense_6 (Dense) (None, 128) 32896 dropout_4 (Dropout) (None, 128) 0 dense_7 (Dense) (None, 64) 8256 dropout_5 (Dropout) (None, 64) 0 dense_8 (Dense) (None, 32) 2080 dense_9 (Dense) (None, 1) 33 ================================================================= Total params: 45825 (179.00 KB) Trainable params: 45825 (179.00 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
optimizer = tf.keras.optimizers.Adam(grid_result.best_params_['optimizer__learning_rate'])
estimator_v5.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_5=estimator_v5.fit(X_train, y_train, epochs=50, batch_size = 64, verbose=1,validation_split=0.2)
Epoch 1/50 192/192 [==============================] - 3s 5ms/step - loss: 0.6093 - accuracy: 0.7413 - val_loss: 0.5598 - val_accuracy: 0.7515 Epoch 2/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5624 - accuracy: 0.7503 - val_loss: 0.5598 - val_accuracy: 0.7515 Epoch 3/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5588 - accuracy: 0.7502 - val_loss: 0.5516 - val_accuracy: 0.7515 Epoch 4/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5555 - accuracy: 0.7503 - val_loss: 0.5450 - val_accuracy: 0.7515 Epoch 5/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5544 - accuracy: 0.7504 - val_loss: 0.5481 - val_accuracy: 0.7515 Epoch 6/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5573 - accuracy: 0.7503 - val_loss: 0.5630 - val_accuracy: 0.7515 Epoch 7/50 192/192 [==============================] - 1s 8ms/step - loss: 0.5553 - accuracy: 0.7503 - val_loss: 0.5488 - val_accuracy: 0.7515 Epoch 8/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5531 - accuracy: 0.7500 - val_loss: 0.5421 - val_accuracy: 0.7515 Epoch 9/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5478 - accuracy: 0.7495 - val_loss: 0.5449 - val_accuracy: 0.7515 Epoch 10/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5483 - accuracy: 0.7491 - val_loss: 0.5484 - val_accuracy: 0.7515 Epoch 11/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5496 - accuracy: 0.7506 - val_loss: 0.5499 - val_accuracy: 0.7515 Epoch 12/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5501 - accuracy: 0.7501 - val_loss: 0.5415 - val_accuracy: 0.7515 Epoch 13/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5483 - accuracy: 0.7495 - val_loss: 0.5361 - val_accuracy: 0.7515 Epoch 14/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5471 - accuracy: 0.7498 - val_loss: 0.5452 - val_accuracy: 0.7515 Epoch 15/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5487 - accuracy: 0.7503 - val_loss: 0.5406 - val_accuracy: 0.7518 Epoch 16/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5485 - accuracy: 0.7503 - val_loss: 0.5495 - val_accuracy: 0.7515 Epoch 17/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5453 - accuracy: 0.7503 - val_loss: 0.5353 - val_accuracy: 0.7515 Epoch 18/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5458 - accuracy: 0.7497 - val_loss: 0.5419 - val_accuracy: 0.7515 Epoch 19/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5484 - accuracy: 0.7495 - val_loss: 0.5386 - val_accuracy: 0.7515 Epoch 20/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5446 - accuracy: 0.7500 - val_loss: 0.5338 - val_accuracy: 0.7515 Epoch 21/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5430 - accuracy: 0.7497 - val_loss: 0.5317 - val_accuracy: 0.7515 Epoch 22/50 192/192 [==============================] - 1s 8ms/step - loss: 0.5456 - accuracy: 0.7507 - val_loss: 0.5344 - val_accuracy: 0.7515 Epoch 23/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5446 - accuracy: 0.7500 - val_loss: 0.5378 - val_accuracy: 0.7515 Epoch 24/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5451 - accuracy: 0.7498 - val_loss: 0.5320 - val_accuracy: 0.7515 Epoch 25/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5484 - accuracy: 0.7498 - val_loss: 0.5266 - val_accuracy: 0.7515 Epoch 26/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5421 - accuracy: 0.7489 - val_loss: 0.5257 - val_accuracy: 0.7515 Epoch 27/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5457 - accuracy: 0.7494 - val_loss: 0.5309 - val_accuracy: 0.7515 Epoch 28/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5501 - accuracy: 0.7500 - val_loss: 0.5359 - val_accuracy: 0.7515 Epoch 29/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5475 - accuracy: 0.7489 - val_loss: 0.5359 - val_accuracy: 0.7518 Epoch 30/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5420 - accuracy: 0.7499 - val_loss: 0.5281 - val_accuracy: 0.7515 Epoch 31/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5421 - accuracy: 0.7482 - val_loss: 0.5216 - val_accuracy: 0.7515 Epoch 32/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5416 - accuracy: 0.7507 - val_loss: 0.5475 - val_accuracy: 0.7515 Epoch 33/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5408 - accuracy: 0.7498 - val_loss: 0.5469 - val_accuracy: 0.7515 Epoch 34/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5472 - accuracy: 0.7492 - val_loss: 0.5219 - val_accuracy: 0.7511 Epoch 35/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5396 - accuracy: 0.7500 - val_loss: 0.5275 - val_accuracy: 0.7515 Epoch 36/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5470 - accuracy: 0.7507 - val_loss: 0.5505 - val_accuracy: 0.7515 Epoch 37/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5502 - accuracy: 0.7488 - val_loss: 0.5500 - val_accuracy: 0.7515 Epoch 38/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5497 - accuracy: 0.7505 - val_loss: 0.5362 - val_accuracy: 0.7518 Epoch 39/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5456 - accuracy: 0.7509 - val_loss: 0.5466 - val_accuracy: 0.7515 Epoch 40/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5469 - accuracy: 0.7515 - val_loss: 0.5229 - val_accuracy: 0.7515 Epoch 41/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5383 - accuracy: 0.7498 - val_loss: 0.5477 - val_accuracy: 0.7515 Epoch 42/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5475 - accuracy: 0.7510 - val_loss: 0.5384 - val_accuracy: 0.7511 Epoch 43/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5478 - accuracy: 0.7507 - val_loss: 0.5323 - val_accuracy: 0.7515 Epoch 44/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5454 - accuracy: 0.7511 - val_loss: 0.5328 - val_accuracy: 0.7515 Epoch 45/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5389 - accuracy: 0.7512 - val_loss: 0.5344 - val_accuracy: 0.7521 Epoch 46/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5471 - accuracy: 0.7502 - val_loss: 0.5398 - val_accuracy: 0.7511 Epoch 47/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5480 - accuracy: 0.7506 - val_loss: 0.5403 - val_accuracy: 0.7505 Epoch 48/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5425 - accuracy: 0.7501 - val_loss: 0.5454 - val_accuracy: 0.7515 Epoch 49/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5452 - accuracy: 0.7503 - val_loss: 0.5334 - val_accuracy: 0.7515 Epoch 50/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5436 - accuracy: 0.7517 - val_loss: 0.5489 - val_accuracy: 0.7515
#Plotting Train Loss vs Validation Loss
plt.plot(history_5.history['loss'])
plt.plot(history_5.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
From the above plot, we observe that both curves - train and validation, are smooth.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat5 = estimator_v5.predict(X_test)
# keep probabilities for the positive outcome only
yhat5 = yhat5[:, 0]
# calculate roc curves
fpr, tpr, thresholds5 = roc_curve(y_test, yhat5)
# calculate the g-mean for each threshold
gmeans5 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans5)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds5[ix], gmeans5[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.258756, G-Mean=0.509
y_pred_e5=estimator_v5.predict(X_test)
y_pred_e5 = (y_pred_e5 > thresholds5[ix])
y_pred_e5
120/120 [==============================] - 0s 2ms/step
array([[ True], [ True], [ True], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm5=confusion_matrix(y_test, y_pred_e5)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm5,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr5=metrics.classification_report(y_test,y_pred_e5)
print(cr5)
precision recall f1-score support 0 0.77 0.70 0.73 2877 1 0.29 0.37 0.32 955 accuracy 0.62 3832 macro avg 0.53 0.53 0.53 3832 weighted avg 0.65 0.62 0.63 3832
Hyperparameter tuning with Grid Search has been used here to get a better F1 score, but the F1 score might differ each time.
Other hyperparameters can also be tuned to get better metrics.
Here, the F1 score of the model, while better than in Randomized Search, is slightly lower than in Model 4 (the Dropout model).
Dask¶
- There is also another library called Dask, sometimes used in the industry to provide a performance boost to Hyperparameter Tuning due to its parallelized computing procedure.
- Dask also has the option of implementing Grid Search similar to the Grid Search in Scikit-learn.
You may install the Dask library in Anaconda prompt using the below code:
- !pip install dask-ml --user
# Try below code to install dask in Google Colab
!pip install dask-ml
Requirement already satisfied: dask-ml in /usr/local/lib/python3.10/dist-packages (2023.3.24) Requirement already satisfied: dask[array,dataframe]>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (2023.8.1) Requirement already satisfied: distributed>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (2023.8.1) Requirement already satisfied: numba>=0.51.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (0.58.1) Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.23.5) Requirement already satisfied: pandas>=0.24.2 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.5.3) Requirement already satisfied: scikit-learn>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.2.2) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.11.4) Requirement already satisfied: dask-glm>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (0.3.2) Requirement already satisfied: multipledispatch>=0.4.9 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.0.0) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from dask-ml) (23.2) Requirement already satisfied: cloudpickle>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from dask-glm>=0.2.0->dask-ml) (2.2.1) Requirement already satisfied: sparse>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from dask-glm>=0.2.0->dask-ml) (0.15.1) Requirement already satisfied: click>=8.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (8.1.7) Requirement already satisfied: fsspec>=2021.09.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (2023.6.0) Requirement already satisfied: partd>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (1.4.1) Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (6.0.1) Requirement already satisfied: toolz>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (0.12.0) Requirement already satisfied: importlib-metadata>=4.13.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (7.0.1) Requirement already satisfied: jinja2>=2.10.3 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (3.1.3) Requirement already satisfied: locket>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (1.0.0) Requirement already satisfied: msgpack>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (1.0.7) Requirement already satisfied: psutil>=5.7.2 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (5.9.5) Requirement already satisfied: sortedcontainers>=2.0.5 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (2.4.0) Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (3.0.0) Requirement already satisfied: tornado>=6.0.4 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (6.3.2) Requirement already satisfied: urllib3>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (2.0.7) Requirement already satisfied: zict>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (3.0.0) Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.0->dask-ml) (0.41.1) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.2->dask-ml) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.2->dask-ml) (2023.3.post1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.2.0->dask-ml) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.2.0->dask-ml) (3.2.0) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=4.13.0->dask[array,dataframe]>=2.4.0->dask-ml) (3.17.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2>=2.10.3->distributed>=2.4.0->dask-ml) (2.1.3) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=0.24.2->dask-ml) (1.16.0)
# importing library
from dask_ml.model_selection import GridSearchCV as DaskGridSearchCV
Try to run the code twice if you encounter any error while improting Dask
- Dask is the same as regular Grid Search in its functioning.
- We just have to change the function from GridSearchCV to DaskGridSearchCV.
def create_model_v6():
np.random.seed(1337)
model = Sequential()
model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
model.add(Dropout(0.3))
#model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.2))
#model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
#model.add(Dropout(0.3))
model.add(Dense(32,activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#compile model
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
return model
keras_estimator = KerasClassifier(build_fn=create_model_v6, verbose=1)
# define the grid search parameters
learn_rate = [0.01, 0.1, 0.001]
batch_size = [32, 64, 128]
param_grid = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)
kfold_splits = 3
dask = DaskGridSearchCV(estimator=keras_estimator,
cv=kfold_splits,
param_grid=param_grid,n_jobs=-1)
import time
# store starting time
begin = time.time()
dask_result = dask.fit(X_train, y_train,validation_split=0.2,verbose=1)
# Summarize results
print("Best: %f using %s" % (dask_result.best_score_, dask_result.best_params_))
means = dask_result.cv_results_['mean_test_score']
stds = dask_result.cv_results_['std_test_score']
params = dask_result.cv_results_['params']
time.sleep(1)
# store end time
end = time.time()
# total time taken
print(f"Total runtime of the program is {end - begin}")
256/256 [==============================] - 7s 14ms/step - loss: 0.6593 - accuracy: 0.7196 - val_loss: 0.5841 - val_accuracy: 0.7529 256/256 [==============================] - 7s 14ms/step - loss: 0.7176 - accuracy: 0.7145 - val_loss: 0.5892 - val_accuracy: 0.7529 160/160 [==============================] - 1s 3ms/step 160/160 [==============================] - 1s 3ms/step 256/256 [==============================] - 5s 9ms/step - loss: 0.7053 - accuracy: 0.7071 - val_loss: 0.6040 - val_accuracy: 0.7534 256/256 [==============================] - 5s 8ms/step - loss: 0.6678 - accuracy: 0.7171 - val_loss: 0.5782 - val_accuracy: 0.7529 160/160 [==============================] - 0s 2ms/step 160/160 [==============================] - 1s 4ms/step 256/256 [==============================] - 6s 9ms/step - loss: 0.6752 - accuracy: 0.7170 - val_loss: 0.5914 - val_accuracy: 0.7529 160/160 [==============================] - 1s 3ms/step 256/256 [==============================] - 7s 10ms/step - loss: 0.6436 - accuracy: 0.7315 - val_loss: 0.5745 - val_accuracy: 0.7534 160/160 [==============================] - 1s 3ms/step 256/256 [==============================] - 5s 11ms/step - loss: 0.6765 - accuracy: 0.7112 - val_loss: 0.5897 - val_accuracy: 0.7529 160/160 [==============================] - 0s 2ms/step 256/256 [==============================] - 5s 10ms/step - loss: 0.6626 - accuracy: 0.7225 - val_loss: 0.6091 - val_accuracy: 0.7529 160/160 [==============================] - 1s 6ms/step 256/256 [==============================] - 5s 10ms/step - loss: 0.7261 - accuracy: 0.7096 - val_loss: 0.5780 - val_accuracy: 0.7534 160/160 [==============================] - 0s 3ms/step 128/128 [==============================] - 5s 10ms/step - loss: 0.7253 - accuracy: 0.6993 - val_loss: 0.5844 - val_accuracy: 0.7529 128/128 [==============================] - 5s 11ms/step - loss: 0.6936 - accuracy: 0.7100 - val_loss: 0.6266 - val_accuracy: 0.7485 80/80 [==============================] - 0s 4ms/step 80/80 [==============================] - 0s 2ms/step 128/128 [==============================] - 4s 11ms/step - loss: 0.6966 - accuracy: 0.7060 - val_loss: 0.5896 - val_accuracy: 0.7534 128/128 [==============================] - 4s 11ms/step - loss: 0.7000 - accuracy: 0.7131 - val_loss: 0.5938 - val_accuracy: 0.7529 80/80 [==============================] - 0s 2ms/step 80/80 [==============================] - 0s 4ms/step 128/128 [==============================] - 5s 11ms/step - loss: 0.7100 - accuracy: 0.7161 - val_loss: 0.5920 - val_accuracy: 0.7529 80/80 [==============================] - 0s 3ms/step 128/128 [==============================] - 5s 10ms/step - loss: 0.6674 - accuracy: 0.7191 - val_loss: 0.6174 - val_accuracy: 0.7534 80/80 [==============================] - 0s 4ms/step 128/128 [==============================] - 4s 9ms/step - loss: 0.7987 - accuracy: 0.6962 - val_loss: 0.5932 - val_accuracy: 0.7529 80/80 [==============================] - 0s 3ms/step 128/128 [==============================] - 4s 9ms/step - loss: 0.7365 - accuracy: 0.7010 - val_loss: 0.5929 - val_accuracy: 0.7529 80/80 [==============================] - 0s 5ms/step 128/128 [==============================] - 4s 13ms/step - loss: 0.6770 - accuracy: 0.7163 - val_loss: 0.6244 - val_accuracy: 0.7534 64/64 [==============================] - 4s 19ms/step - loss: 0.7037 - accuracy: 0.7044 - val_loss: 0.6116 - val_accuracy: 0.7529 80/80 [==============================] - 1s 4ms/step 40/40 [==============================] - 1s 6ms/step 64/64 [==============================] - 4s 13ms/step - loss: 0.7867 - accuracy: 0.6869 - val_loss: 0.5894 - val_accuracy: 0.7529 64/64 [==============================] - 4s 14ms/step - loss: 0.8034 - accuracy: 0.6915 - val_loss: 0.6499 - val_accuracy: 0.6825 40/40 [==============================] - 0s 2ms/step 40/40 [==============================] - 0s 3ms/step 64/64 [==============================] - 3s 13ms/step - loss: 0.6962 - accuracy: 0.7050 - val_loss: 0.5872 - val_accuracy: 0.7529 40/40 [==============================] - 0s 4ms/step 64/64 [==============================] - 3s 11ms/step - loss: 1.1132 - accuracy: 0.6487 - val_loss: 0.5882 - val_accuracy: 0.7529 40/40 [==============================] - 1s 19ms/step 64/64 [==============================] - 3s 15ms/step - loss: 0.6810 - accuracy: 0.7094 - val_loss: 0.6400 - val_accuracy: 0.7534 40/40 [==============================] - 0s 5ms/step 64/64 [==============================] - 4s 18ms/step - loss: 0.6977 - accuracy: 0.7139 - val_loss: 0.6101 - val_accuracy: 0.7529 40/40 [==============================] - 0s 4ms/step 64/64 [==============================] - 4s 12ms/step - loss: 0.8207 - accuracy: 0.6831 - val_loss: 0.5936 - val_accuracy: 0.7529 40/40 [==============================] - 0s 3ms/step 64/64 [==============================] - 4s 11ms/step - loss: 0.7796 - accuracy: 0.6939 - val_loss: 0.6233 - val_accuracy: 0.7534 40/40 [==============================] - 0s 2ms/step 384/384 [==============================] - 3s 5ms/step - loss: 0.6440 - accuracy: 0.7283 - val_loss: 0.5697 - val_accuracy: 0.7515 Best: 0.750620 using {'batch_size': 32, 'optimizer__learning_rate': 0.01} Total runtime of the program is 86.2575147151947
Unfortunately, Dask took more time to run the model when compared to Grid Search CV, and this is because Dask has some requirements to perform well:
- The dimension of the dataset should be large.
- Dask shows a significant performance improvement in computation when the number and range of hyperparameters we are tuning is large.
Since the dataset dimensions and hyperparameter number/range were small for this example, Dask couldn't show a significant improvement.
We can also use another optimization technique - Keras Tuner.
## Install Keras Tuner
!pip install keras-tuner
Requirement already satisfied: keras-tuner in /usr/local/lib/python3.10/dist-packages (1.4.6) Requirement already satisfied: keras in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (2.15.0) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (23.2) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (2.31.0) Requirement already satisfied: kt-legacy in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (1.0.5) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (2023.11.17)
Keras Tuner¶
from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
Hyperparameters
- How many hidden layers should the model have?
- How many neurons should the model have in each hidden layer?
- Learning Rate
def build_model(h):
model = keras.Sequential()
for i in range(h.Int('num_layers', 2, 10)):
model.add(layers.Dense(units=h.Int('units_' + str(i),
min_value=32,
max_value=256,
step=32),
activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer=keras.optimizers.Adam(
h.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
loss='binary_crossentropy',
metrics=['accuracy'])
return model
Initialize a tuner (here, RandomSearch). We use objective to specify the objective to select the best models, and we use max_trials to specify the number of different models to try.
tuner = RandomSearch(
build_model,
objective='val_accuracy',
max_trials=5,
executions_per_trial=3,
project_name='Job_')
Reloading Tuner from ./Job_/tuner0.json
tuner.search_space_summary()
Search space summary Default search space size: 12 num_layers (Int) {'default': None, 'conditions': [], 'min_value': 2, 'max_value': 10, 'step': 1, 'sampling': 'linear'} units_0 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_1 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} learning_rate (Choice) {'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True} units_2 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_3 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_4 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_5 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_6 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_7 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_8 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_9 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
### Searching the best model on X and y train
tuner.search(X_train, y_train,
epochs=5,
validation_split = 0.2)
## Printing the best models with their hyperparameters
tuner.results_summary()
Results summary Results in ./Job_ Showing 10 best trials Objective(name="val_accuracy", direction="max") Trial 3 summary Hyperparameters: num_layers: 5 units_0: 32 units_1: 64 learning_rate: 0.01 units_2: 96 units_3: 256 units_4: 256 units_5: 160 units_6: 192 units_7: 224 units_8: 224 Score: 0.7516851425170898 Trial 0 summary Hyperparameters: num_layers: 9 units_0: 224 units_1: 96 learning_rate: 0.001 units_2: 32 units_3: 32 units_4: 32 units_5: 32 units_6: 32 units_7: 32 units_8: 32 Score: 0.7514677047729492 Trial 2 summary Hyperparameters: num_layers: 9 units_0: 192 units_1: 64 learning_rate: 0.001 units_2: 160 units_3: 32 units_4: 224 units_5: 32 units_6: 256 units_7: 96 units_8: 192 Score: 0.7514677047729492 Trial 1 summary Hyperparameters: num_layers: 5 units_0: 160 units_1: 160 learning_rate: 0.001 units_2: 224 units_3: 128 units_4: 224 units_5: 64 units_6: 160 units_7: 64 units_8: 32 Score: 0.7514677047729492 Trial 4 summary Hyperparameters: num_layers: 10 units_0: 128 units_1: 32 learning_rate: 0.0001 units_2: 160 units_3: 160 units_4: 160 units_5: 224 units_6: 96 units_7: 128 units_8: 96 units_9: 32 Score: 0.7514677047729492
Model 7¶
- Let's create a model with the above mentioned best configuration given by Keras Tuner.
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
model7 = Sequential()
model7.add(Dense(160,activation='relu',kernel_initializer='he_uniform',input_dim = X_train.shape[1]))
model7.add(Dense(160,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(1, activation = 'sigmoid'))
model7.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 160) 1600 dense_1 (Dense) (None, 160) 25760 dense_2 (Dense) (None, 224) 36064 dense_3 (Dense) (None, 128) 28800 dense_4 (Dense) (None, 224) 28896 dense_5 (Dense) (None, 1) 225 ================================================================= Total params: 121345 (474.00 KB) Trainable params: 121345 (474.00 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
optimizer = tf.keras.optimizers.Adam(0.001)
model7.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_7 = model7.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50 192/192 [==============================] - 3s 6ms/step - loss: 1.1225 - accuracy: 0.7018 - val_loss: 0.5539 - val_accuracy: 0.7495 Epoch 2/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5978 - accuracy: 0.7343 - val_loss: 0.5578 - val_accuracy: 0.7479 Epoch 3/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5746 - accuracy: 0.7434 - val_loss: 0.5542 - val_accuracy: 0.7515 Epoch 4/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5630 - accuracy: 0.7464 - val_loss: 0.5432 - val_accuracy: 0.7521 Epoch 5/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5655 - accuracy: 0.7437 - val_loss: 0.5516 - val_accuracy: 0.7515 Epoch 6/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5569 - accuracy: 0.7482 - val_loss: 0.5395 - val_accuracy: 0.7508 Epoch 7/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5542 - accuracy: 0.7462 - val_loss: 0.5374 - val_accuracy: 0.7511 Epoch 8/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5495 - accuracy: 0.7503 - val_loss: 0.5600 - val_accuracy: 0.7518 Epoch 9/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5518 - accuracy: 0.7497 - val_loss: 0.5385 - val_accuracy: 0.7502 Epoch 10/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5465 - accuracy: 0.7485 - val_loss: 0.5408 - val_accuracy: 0.7511 Epoch 11/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5534 - accuracy: 0.7467 - val_loss: 0.5846 - val_accuracy: 0.7257 Epoch 12/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5443 - accuracy: 0.7502 - val_loss: 0.5672 - val_accuracy: 0.7495 Epoch 13/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5455 - accuracy: 0.7524 - val_loss: 0.5340 - val_accuracy: 0.7508 Epoch 14/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5418 - accuracy: 0.7502 - val_loss: 0.5534 - val_accuracy: 0.7518 Epoch 15/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5444 - accuracy: 0.7498 - val_loss: 0.5480 - val_accuracy: 0.7515 Epoch 16/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5409 - accuracy: 0.7516 - val_loss: 0.5424 - val_accuracy: 0.7492 Epoch 17/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5373 - accuracy: 0.7511 - val_loss: 0.5501 - val_accuracy: 0.7515 Epoch 18/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5430 - accuracy: 0.7508 - val_loss: 0.5453 - val_accuracy: 0.7508 Epoch 19/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5356 - accuracy: 0.7524 - val_loss: 0.5365 - val_accuracy: 0.7511 Epoch 20/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5399 - accuracy: 0.7502 - val_loss: 0.5406 - val_accuracy: 0.7502 Epoch 21/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5378 - accuracy: 0.7511 - val_loss: 0.5323 - val_accuracy: 0.7544 Epoch 22/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5331 - accuracy: 0.7510 - val_loss: 0.5280 - val_accuracy: 0.7524 Epoch 23/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5387 - accuracy: 0.7498 - val_loss: 0.5293 - val_accuracy: 0.7502 Epoch 24/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5315 - accuracy: 0.7530 - val_loss: 0.5293 - val_accuracy: 0.7606 Epoch 25/50 192/192 [==============================] - 1s 8ms/step - loss: 0.5264 - accuracy: 0.7508 - val_loss: 0.5237 - val_accuracy: 0.7518 Epoch 26/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5262 - accuracy: 0.7502 - val_loss: 0.5344 - val_accuracy: 0.7495 Epoch 27/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5237 - accuracy: 0.7520 - val_loss: 0.5626 - val_accuracy: 0.7534 Epoch 28/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5270 - accuracy: 0.7498 - val_loss: 0.5218 - val_accuracy: 0.7508 Epoch 29/50 192/192 [==============================] - 1s 4ms/step - loss: 0.5219 - accuracy: 0.7526 - val_loss: 0.5211 - val_accuracy: 0.7518 Epoch 30/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5203 - accuracy: 0.7524 - val_loss: 0.5198 - val_accuracy: 0.7570 Epoch 31/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5185 - accuracy: 0.7560 - val_loss: 0.5277 - val_accuracy: 0.7573 Epoch 32/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5156 - accuracy: 0.7548 - val_loss: 0.5132 - val_accuracy: 0.7590 Epoch 33/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5194 - accuracy: 0.7573 - val_loss: 0.5290 - val_accuracy: 0.7541 Epoch 34/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5177 - accuracy: 0.7535 - val_loss: 0.5235 - val_accuracy: 0.7554 Epoch 35/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5102 - accuracy: 0.7560 - val_loss: 0.5377 - val_accuracy: 0.7599 Epoch 36/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5104 - accuracy: 0.7610 - val_loss: 0.5215 - val_accuracy: 0.7573 Epoch 37/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5112 - accuracy: 0.7524 - val_loss: 0.5206 - val_accuracy: 0.7590 Epoch 38/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5109 - accuracy: 0.7588 - val_loss: 0.5187 - val_accuracy: 0.7580 Epoch 39/50 192/192 [==============================] - 1s 7ms/step - loss: 0.5106 - accuracy: 0.7578 - val_loss: 0.5129 - val_accuracy: 0.7551 Epoch 40/50 192/192 [==============================] - 1s 6ms/step - loss: 0.5055 - accuracy: 0.7591 - val_loss: 0.5241 - val_accuracy: 0.7590 Epoch 41/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5065 - accuracy: 0.7595 - val_loss: 0.5358 - val_accuracy: 0.7551 Epoch 42/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5111 - accuracy: 0.7577 - val_loss: 0.5263 - val_accuracy: 0.7570 Epoch 43/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5047 - accuracy: 0.7623 - val_loss: 0.5209 - val_accuracy: 0.7590 Epoch 44/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5019 - accuracy: 0.7630 - val_loss: 0.5187 - val_accuracy: 0.7652 Epoch 45/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5043 - accuracy: 0.7619 - val_loss: 0.5164 - val_accuracy: 0.7603 Epoch 46/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5062 - accuracy: 0.7605 - val_loss: 0.5308 - val_accuracy: 0.7567 Epoch 47/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5040 - accuracy: 0.7617 - val_loss: 0.5312 - val_accuracy: 0.7648 Epoch 48/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5015 - accuracy: 0.7634 - val_loss: 0.5178 - val_accuracy: 0.7665 Epoch 49/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5001 - accuracy: 0.7630 - val_loss: 0.5166 - val_accuracy: 0.7632 Epoch 50/50 192/192 [==============================] - 1s 5ms/step - loss: 0.5025 - accuracy: 0.7604 - val_loss: 0.5249 - val_accuracy: 0.7599
#Plotting Train Loss vs Validation Loss
plt.plot(history_7.history['loss'])
plt.plot(history_7.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
From the above plot, we observe that the train and validation curves are smooth.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat7 = model7.predict(X_test)
# keep probabilities for the positive outcome only
yhat7 = yhat7[:, 0]
# calculate roc curves
fpr, tpr, thresholds7 = roc_curve(y_test, yhat7)
# calculate the g-mean for each threshold
gmeans7 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans7)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds7[ix], gmeans7[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.232952, G-Mean=0.678
y_pred_e7=model7.predict(X_test)
y_pred_e7 = (y_pred_e7 > thresholds7[ix])
y_pred_e7
120/120 [==============================] - 0s 2ms/step
array([[False], [ True], [False], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm7=confusion_matrix(y_test, y_pred_e7)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm7,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr7=metrics.classification_report(y_test,y_pred_e7)
print(cr7)
precision recall f1-score support 0 0.86 0.70 0.77 2877 1 0.42 0.66 0.51 955 accuracy 0.69 3832 macro avg 0.64 0.68 0.64 3832 weighted avg 0.75 0.69 0.71 3832
After using the suggested hyperparameters from Keras Tuner, the F1 score has slightly increased, and the False Negative rate is higher in comparison to the previous optimization technique model.
Further, you can add Batch Normalization and Dropout to the model and check the F1 score.
- Let's try to apply SMOTE to balance this dataset and then apply hyperparamter tuning accordingly.
SMOTE + Keras Tuner¶
##Applying SMOTE on train and test
from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='not majority')
X_sm , y_sm = smote.fit_resample(X_train,y_train)
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
def build_model_2(h):
model = keras.Sequential()
for i in range(h.Int('num_layers', 2, 10)):
model.add(layers.Dense(units=h.Int('units_' + str(i),
min_value=32,
max_value=256,
step=32),
activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer=keras.optimizers.Adam(
h.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
loss='binary_crossentropy',
metrics=['accuracy'])
return model
tuner_2 = RandomSearch(
build_model_2,
objective='val_accuracy',
max_trials=5,
executions_per_trial=3,
project_name='Job_Switch')
Reloading Tuner from ./Job_Switch/tuner0.json
tuner_2.search_space_summary()
Search space summary Default search space size: 12 num_layers (Int) {'default': None, 'conditions': [], 'min_value': 2, 'max_value': 10, 'step': 1, 'sampling': 'linear'} units_0 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_1 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} learning_rate (Choice) {'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True} units_2 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_3 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_4 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_5 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_6 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_7 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_8 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'} units_9 (Int) {'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
tuner_2.search(X_sm, y_sm,
epochs=5,
validation_split = 0.2)
tuner_2.results_summary()
Results summary Results in ./Job_Switch Showing 10 best trials Objective(name="val_accuracy", direction="max") Trial 1 summary Hyperparameters: num_layers: 5 units_0: 160 units_1: 160 learning_rate: 0.001 units_2: 224 units_3: 128 units_4: 224 units_5: 64 units_6: 160 units_7: 64 units_8: 32 Score: 0.3924380640188853 Trial 2 summary Hyperparameters: num_layers: 9 units_0: 192 units_1: 64 learning_rate: 0.001 units_2: 160 units_3: 32 units_4: 224 units_5: 32 units_6: 256 units_7: 96 units_8: 192 Score: 0.3745472927888234 Trial 0 summary Hyperparameters: num_layers: 9 units_0: 224 units_1: 96 learning_rate: 0.001 units_2: 32 units_3: 32 units_4: 32 units_5: 32 units_6: 32 units_7: 32 units_8: 32 Score: 0.3596986730893453 Trial 4 summary Hyperparameters: num_layers: 10 units_0: 128 units_1: 32 learning_rate: 0.0001 units_2: 160 units_3: 160 units_4: 160 units_5: 224 units_6: 96 units_7: 128 units_8: 96 units_9: 32 Score: 0.27350427707036334 Trial 3 summary Hyperparameters: num_layers: 5 units_0: 32 units_1: 64 learning_rate: 0.01 units_2: 96 units_3: 256 units_4: 256 units_5: 160 units_6: 192 units_7: 224 units_8: 224 Score: 0.14906562368075052
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
model9 = Sequential()
model9.add(Dense(160,activation='relu',kernel_initializer='he_uniform',input_dim = X_train.shape[1]))
model9.add(Dense(160,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(1, activation = 'sigmoid'))
#Compiling the ANN with Adam optimizer and binary cross entropy loss function
optimizer = tf.keras.optimizers.Adam(0.001)
model9.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
model9.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 160) 1600 dense_1 (Dense) (None, 160) 25760 dense_2 (Dense) (None, 224) 36064 dense_3 (Dense) (None, 128) 28800 dense_4 (Dense) (None, 224) 28896 dense_5 (Dense) (None, 1) 225 ================================================================= Total params: 121345 (474.00 KB) Trainable params: 121345 (474.00 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
history_9 = model9.fit(X_sm,y_sm,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50 288/288 [==============================] - 4s 6ms/step - loss: 0.9404 - accuracy: 0.5938 - val_loss: 1.1380 - val_accuracy: 0.0674 Epoch 2/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6742 - accuracy: 0.6132 - val_loss: 0.8120 - val_accuracy: 0.3644 Epoch 3/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6514 - accuracy: 0.6346 - val_loss: 1.2521 - val_accuracy: 0.1171 Epoch 4/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6498 - accuracy: 0.6296 - val_loss: 0.6781 - val_accuracy: 0.6395 Epoch 5/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6404 - accuracy: 0.6382 - val_loss: 0.9168 - val_accuracy: 0.1921 Epoch 6/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6399 - accuracy: 0.6413 - val_loss: 0.9899 - val_accuracy: 0.1199 Epoch 7/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6361 - accuracy: 0.6459 - val_loss: 0.8910 - val_accuracy: 0.2925 Epoch 8/50 288/288 [==============================] - 2s 7ms/step - loss: 0.6282 - accuracy: 0.6488 - val_loss: 0.9776 - val_accuracy: 0.2119 Epoch 9/50 288/288 [==============================] - 2s 7ms/step - loss: 0.6317 - accuracy: 0.6475 - val_loss: 0.8264 - val_accuracy: 0.3086 Epoch 10/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6275 - accuracy: 0.6549 - val_loss: 0.9303 - val_accuracy: 0.2193 Epoch 11/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6230 - accuracy: 0.6574 - val_loss: 0.7799 - val_accuracy: 0.3679 Epoch 12/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6210 - accuracy: 0.6600 - val_loss: 0.6063 - val_accuracy: 0.7942 Epoch 13/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6181 - accuracy: 0.6626 - val_loss: 0.6872 - val_accuracy: 0.4748 Epoch 14/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6195 - accuracy: 0.6632 - val_loss: 0.9843 - val_accuracy: 0.2436 Epoch 15/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6132 - accuracy: 0.6656 - val_loss: 0.7729 - val_accuracy: 0.4411 Epoch 16/50 288/288 [==============================] - 1s 4ms/step - loss: 0.6304 - accuracy: 0.6586 - val_loss: 0.9492 - val_accuracy: 0.2338 Epoch 17/50 288/288 [==============================] - 2s 6ms/step - loss: 0.6123 - accuracy: 0.6666 - val_loss: 0.7792 - val_accuracy: 0.3329 Epoch 18/50 288/288 [==============================] - 2s 7ms/step - loss: 0.6102 - accuracy: 0.6684 - val_loss: 1.0104 - val_accuracy: 0.2262 Epoch 19/50 288/288 [==============================] - 2s 5ms/step - loss: 0.6018 - accuracy: 0.6793 - val_loss: 0.7396 - val_accuracy: 0.5758 Epoch 20/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6002 - accuracy: 0.6821 - val_loss: 0.9011 - val_accuracy: 0.3568 Epoch 21/50 288/288 [==============================] - 1s 5ms/step - loss: 0.6021 - accuracy: 0.6786 - val_loss: 0.6261 - val_accuracy: 0.7223 Epoch 22/50 288/288 [==============================] - 1s 4ms/step - loss: 0.5972 - accuracy: 0.6861 - val_loss: 0.9532 - val_accuracy: 0.3179 Epoch 23/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5929 - accuracy: 0.6904 - val_loss: 0.7367 - val_accuracy: 0.5619 Epoch 24/50 288/288 [==============================] - 1s 4ms/step - loss: 0.5914 - accuracy: 0.6868 - val_loss: 0.9272 - val_accuracy: 0.3314 Epoch 25/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5906 - accuracy: 0.6908 - val_loss: 1.1307 - val_accuracy: 0.2442 Epoch 26/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5920 - accuracy: 0.6919 - val_loss: 0.7521 - val_accuracy: 0.4837 Epoch 27/50 288/288 [==============================] - 2s 6ms/step - loss: 0.5881 - accuracy: 0.6935 - val_loss: 1.0727 - val_accuracy: 0.1841 Epoch 28/50 288/288 [==============================] - 2s 7ms/step - loss: 0.5877 - accuracy: 0.6917 - val_loss: 0.9807 - val_accuracy: 0.3096 Epoch 29/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5836 - accuracy: 0.6987 - val_loss: 1.0255 - val_accuracy: 0.2618 Epoch 30/50 288/288 [==============================] - 1s 4ms/step - loss: 0.5837 - accuracy: 0.6971 - val_loss: 0.7151 - val_accuracy: 0.4959 Epoch 31/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5816 - accuracy: 0.6995 - val_loss: 0.5556 - val_accuracy: 0.7555 Epoch 32/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5758 - accuracy: 0.7031 - val_loss: 0.6548 - val_accuracy: 0.6026 Epoch 33/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5773 - accuracy: 0.7029 - val_loss: 0.6953 - val_accuracy: 0.6189 Epoch 34/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5780 - accuracy: 0.7068 - val_loss: 0.7301 - val_accuracy: 0.6295 Epoch 35/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5760 - accuracy: 0.7056 - val_loss: 1.1270 - val_accuracy: 0.1925 Epoch 36/50 288/288 [==============================] - 2s 6ms/step - loss: 0.5762 - accuracy: 0.6980 - val_loss: 0.6591 - val_accuracy: 0.6202 Epoch 37/50 288/288 [==============================] - 2s 7ms/step - loss: 0.5766 - accuracy: 0.7053 - val_loss: 0.6824 - val_accuracy: 0.5556 Epoch 38/50 288/288 [==============================] - 2s 6ms/step - loss: 0.5735 - accuracy: 0.7020 - val_loss: 0.8294 - val_accuracy: 0.4641 Epoch 39/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5755 - accuracy: 0.7031 - val_loss: 0.7203 - val_accuracy: 0.5737 Epoch 40/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5697 - accuracy: 0.7089 - val_loss: 0.9359 - val_accuracy: 0.3507 Epoch 41/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5700 - accuracy: 0.7080 - val_loss: 0.7763 - val_accuracy: 0.5791 Epoch 42/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5635 - accuracy: 0.7146 - val_loss: 0.6702 - val_accuracy: 0.6415 Epoch 43/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5684 - accuracy: 0.7126 - val_loss: 0.8945 - val_accuracy: 0.4557 Epoch 44/50 288/288 [==============================] - 1s 4ms/step - loss: 0.5673 - accuracy: 0.7131 - val_loss: 0.7436 - val_accuracy: 0.5259 Epoch 45/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5649 - accuracy: 0.7097 - val_loss: 0.9512 - val_accuracy: 0.3316 Epoch 46/50 288/288 [==============================] - 2s 7ms/step - loss: 0.5626 - accuracy: 0.7160 - val_loss: 0.6197 - val_accuracy: 0.6491 Epoch 47/50 288/288 [==============================] - 2s 7ms/step - loss: 0.5670 - accuracy: 0.7099 - val_loss: 0.6066 - val_accuracy: 0.6632 Epoch 48/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5607 - accuracy: 0.7162 - val_loss: 0.7814 - val_accuracy: 0.4872 Epoch 49/50 288/288 [==============================] - 1s 5ms/step - loss: 0.5579 - accuracy: 0.7186 - val_loss: 0.6836 - val_accuracy: 0.6347 Epoch 50/50 288/288 [==============================] - 1s 4ms/step - loss: 0.5643 - accuracy: 0.7142 - val_loss: 0.7058 - val_accuracy: 0.6452
#Plotting Train Loss vs Validation Loss
plt.plot(history_9.history['loss'])
plt.plot(history_9.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
From the above plot, we observe that there is a lot of noise in the model.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat9 = model9.predict(X_test)
# keep probabilities for the positive outcome only
yhat9 = yhat9[:, 0]
# calculate roc curves
fpr, tpr, thresholds9 = roc_curve(y_test, yhat9)
# calculate the g-mean for each threshold
gmeans9 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans9)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds9[ix], gmeans9[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.442253, G-Mean=0.679
y_pred_e9=model9.predict(X_test)
y_pred_e9 = (y_pred_e9 > thresholds9[ix])
y_pred_e9
120/120 [==============================] - 0s 2ms/step
array([[False], [ True], [False], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm9=confusion_matrix(y_test, y_pred_e9)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm9,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr6=metrics.classification_report(y_test,y_pred_e9)
print(cr6)
precision recall f1-score support 0 0.86 0.72 0.78 2877 1 0.43 0.64 0.51 955 accuracy 0.70 3832 macro avg 0.64 0.68 0.65 3832 weighted avg 0.75 0.70 0.71 3832
After applying the SMOTE technique to the data, the F1 score increased, and the False Negative rate decreased, but if you see the loss curves of train and validation, the model seems to have overfit.
Let's use Grid Search CV and see if we can increase the model's performance on the metrics.
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
def create_model_v7():
np.random.seed(1337)
model = Sequential()
model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
model.add(Dropout(0.3))
#model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.2))
#model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
#model.add(Dropout(0.3))
model.add(Dense(32,activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#compile model
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
return model
keras_estimator = KerasClassifier(build_fn=create_model_v7, verbose=1)
# define the grid search parameters
batch_size= [32, 64, 128]
Learn_rate = [0.001,0.01,0.1]
param_grid = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)
kfold_splits = 3
grid = GridSearchCV(estimator=keras_estimator,
verbose=1,
cv=kfold_splits,
param_grid=param_grid,n_jobs=-1)
grid_result = grid.fit(X_train, y_train,validation_split=0.2,verbose=1)
Fitting 3 folds for each of 9 candidates, totalling 27 fits 384/384 [==============================] - 3s 5ms/step - loss: 0.6308 - accuracy: 0.7337 - val_loss: 0.5739 - val_accuracy: 0.7515
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
Best: 0.750620 using {'batch_size': 32, 'optimizer__learning_rate': 0.01}
estimator_v7=create_model_v7()
estimator_v7.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_5 (Dense) (None, 256) 2560 dropout_3 (Dropout) (None, 256) 0 dense_6 (Dense) (None, 128) 32896 dropout_4 (Dropout) (None, 128) 0 dense_7 (Dense) (None, 64) 8256 dropout_5 (Dropout) (None, 64) 0 dense_8 (Dense) (None, 32) 2080 dense_9 (Dense) (None, 1) 33 ================================================================= Total params: 45825 (179.00 KB) Trainable params: 45825 (179.00 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
optimizer = tf.keras.optimizers.Adam()
estimator_v7.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_7=estimator_v7.fit(X_sm, y_sm, epochs=50, batch_size = grid_result.best_params_['batch_size'], verbose=1,validation_split=0.2)
Epoch 1/50 576/576 [==============================] - 4s 5ms/step - loss: 0.7003 - accuracy: 0.6014 - val_loss: 0.8608 - val_accuracy: 0.0000e+00 Epoch 2/50 576/576 [==============================] - 4s 6ms/step - loss: 0.6665 - accuracy: 0.6224 - val_loss: 0.9427 - val_accuracy: 0.0000e+00 Epoch 3/50 576/576 [==============================] - 3s 5ms/step - loss: 0.6595 - accuracy: 0.6238 - val_loss: 0.8662 - val_accuracy: 0.0000e+00 Epoch 4/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6542 - accuracy: 0.6246 - val_loss: 0.8600 - val_accuracy: 0.0000e+00 Epoch 5/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6477 - accuracy: 0.6252 - val_loss: 0.8651 - val_accuracy: 0.0000e+00 Epoch 6/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6400 - accuracy: 0.6319 - val_loss: 0.9245 - val_accuracy: 0.1034 Epoch 7/50 576/576 [==============================] - 4s 7ms/step - loss: 0.6341 - accuracy: 0.6440 - val_loss: 0.8075 - val_accuracy: 0.2605 Epoch 8/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6329 - accuracy: 0.6465 - val_loss: 0.7977 - val_accuracy: 0.4218 Epoch 9/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6272 - accuracy: 0.6545 - val_loss: 0.8649 - val_accuracy: 0.3044 Epoch 10/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6224 - accuracy: 0.6621 - val_loss: 0.8300 - val_accuracy: 0.3629 Epoch 11/50 576/576 [==============================] - 3s 5ms/step - loss: 0.6196 - accuracy: 0.6655 - val_loss: 0.7618 - val_accuracy: 0.4270 Epoch 12/50 576/576 [==============================] - 4s 6ms/step - loss: 0.6144 - accuracy: 0.6707 - val_loss: 0.6803 - val_accuracy: 0.5374 Epoch 13/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6110 - accuracy: 0.6766 - val_loss: 0.6796 - val_accuracy: 0.6047 Epoch 14/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6085 - accuracy: 0.6791 - val_loss: 0.9325 - val_accuracy: 0.2827 Epoch 15/50 576/576 [==============================] - 3s 4ms/step - loss: 0.6046 - accuracy: 0.6838 - val_loss: 0.7305 - val_accuracy: 0.5113 Epoch 16/50 576/576 [==============================] - 3s 5ms/step - loss: 0.6021 - accuracy: 0.6843 - val_loss: 0.8137 - val_accuracy: 0.4405 Epoch 17/50 576/576 [==============================] - 3s 6ms/step - loss: 0.6018 - accuracy: 0.6842 - val_loss: 0.8529 - val_accuracy: 0.3822 Epoch 18/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5989 - accuracy: 0.6880 - val_loss: 0.8450 - val_accuracy: 0.4500 Epoch 19/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5967 - accuracy: 0.6895 - val_loss: 0.6954 - val_accuracy: 0.5828 Epoch 20/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5925 - accuracy: 0.6931 - val_loss: 0.7040 - val_accuracy: 0.5808 Epoch 21/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5937 - accuracy: 0.6934 - val_loss: 0.6715 - val_accuracy: 0.6241 Epoch 22/50 576/576 [==============================] - 3s 6ms/step - loss: 0.5930 - accuracy: 0.6937 - val_loss: 0.8507 - val_accuracy: 0.3870 Epoch 23/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5895 - accuracy: 0.6967 - val_loss: 0.7462 - val_accuracy: 0.5545 Epoch 24/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5889 - accuracy: 0.6962 - val_loss: 0.7266 - val_accuracy: 0.5487 Epoch 25/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5888 - accuracy: 0.6978 - val_loss: 0.7798 - val_accuracy: 0.5248 Epoch 26/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5889 - accuracy: 0.6979 - val_loss: 0.7571 - val_accuracy: 0.5109 Epoch 27/50 576/576 [==============================] - 3s 6ms/step - loss: 0.5880 - accuracy: 0.6981 - val_loss: 0.8523 - val_accuracy: 0.4072 Epoch 28/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5867 - accuracy: 0.6971 - val_loss: 0.7692 - val_accuracy: 0.5695 Epoch 29/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5845 - accuracy: 0.7052 - val_loss: 0.7931 - val_accuracy: 0.4913 Epoch 30/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5841 - accuracy: 0.7034 - val_loss: 0.6327 - val_accuracy: 0.6391 Epoch 31/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5852 - accuracy: 0.7017 - val_loss: 0.6510 - val_accuracy: 0.6295 Epoch 32/50 576/576 [==============================] - 3s 6ms/step - loss: 0.5851 - accuracy: 0.7000 - val_loss: 0.6746 - val_accuracy: 0.6269 Epoch 33/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5842 - accuracy: 0.6992 - val_loss: 0.6463 - val_accuracy: 0.6812 Epoch 34/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5829 - accuracy: 0.7013 - val_loss: 0.7065 - val_accuracy: 0.5945 Epoch 35/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5833 - accuracy: 0.7020 - val_loss: 0.7968 - val_accuracy: 0.4894 Epoch 36/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5828 - accuracy: 0.7027 - val_loss: 0.6168 - val_accuracy: 0.7051 Epoch 37/50 576/576 [==============================] - 3s 6ms/step - loss: 0.5832 - accuracy: 0.7010 - val_loss: 0.7796 - val_accuracy: 0.4426 Epoch 38/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5818 - accuracy: 0.7042 - val_loss: 0.6748 - val_accuracy: 0.6393 Epoch 39/50 576/576 [==============================] - 2s 4ms/step - loss: 0.5801 - accuracy: 0.7029 - val_loss: 0.7662 - val_accuracy: 0.5482 Epoch 40/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5815 - accuracy: 0.7038 - val_loss: 0.7658 - val_accuracy: 0.5522 Epoch 41/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5806 - accuracy: 0.7015 - val_loss: 0.7803 - val_accuracy: 0.5778 Epoch 42/50 576/576 [==============================] - 3s 6ms/step - loss: 0.5796 - accuracy: 0.7015 - val_loss: 0.7343 - val_accuracy: 0.5732 Epoch 43/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5778 - accuracy: 0.7046 - val_loss: 0.8265 - val_accuracy: 0.5133 Epoch 44/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5774 - accuracy: 0.7065 - val_loss: 0.7490 - val_accuracy: 0.5900 Epoch 45/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5772 - accuracy: 0.7049 - val_loss: 0.8295 - val_accuracy: 0.4846 Epoch 46/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5795 - accuracy: 0.7057 - val_loss: 0.6819 - val_accuracy: 0.6132 Epoch 47/50 576/576 [==============================] - 4s 6ms/step - loss: 0.5789 - accuracy: 0.7056 - val_loss: 0.6177 - val_accuracy: 0.7112 Epoch 48/50 576/576 [==============================] - 3s 5ms/step - loss: 0.5775 - accuracy: 0.7069 - val_loss: 0.7920 - val_accuracy: 0.4863 Epoch 49/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5777 - accuracy: 0.7054 - val_loss: 0.7217 - val_accuracy: 0.6236 Epoch 50/50 576/576 [==============================] - 3s 4ms/step - loss: 0.5767 - accuracy: 0.7060 - val_loss: 0.6357 - val_accuracy: 0.6814
#Plotting Train Loss vs Validation Loss
plt.plot(history_7.history['loss'])
plt.plot(history_7.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
From the above plot, we observe that there is a lot of noise in the model.
Grid Search CV also does not seem to work that well on the SMOTE data.
from sklearn.metrics import roc_curve
from matplotlib import pyplot
# predict probabilities
yhat10 = estimator_v7.predict(X_test)
# keep probabilities for the positive outcome only
yhat10 = yhat10[:, 0]
# calculate roc curves
fpr, tpr, thresholds10 = roc_curve(y_test, yhat10)
# calculate the g-mean for each threshold
gmeans10 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans10)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds10[ix], gmeans10[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step Best Threshold=0.497505, G-Mean=0.685
y_pred_e10=estimator_v7.predict(X_test)
y_pred_e10 = (y_pred_e10 > thresholds10[ix])
y_pred_e10
120/120 [==============================] - 0s 2ms/step
array([[False], [ True], [False], ..., [False], [False], [False]])
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm10=confusion_matrix(y_test, y_pred_e10)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm10,
group_names=labels,
categories=categories,
cmap='Blues')
#Accuracy as per the classification report
from sklearn import metrics
cr10=metrics.classification_report(y_test,y_pred_e10)
print(cr10)
precision recall f1-score support 0 0.86 0.72 0.79 2877 1 0.44 0.65 0.52 955 accuracy 0.71 3832 macro avg 0.65 0.69 0.65 3832 weighted avg 0.76 0.71 0.72 3832
Oversampling using SMOTE did not help improve the F1 score.
In this dataset, the SMOTE oversampling technique does not work well, as both the models we tried building have overfitted on the training dataset.
So, our final model here can be Model 4, which uses the Dropout regularization technique and works on the imbalanced dataset.
Suggested Areas of Improvement¶
Build any one Machine Learning model, and use that to get the feature importance of the variables. Try to use that in the neural network model.
You can try to do better feature engineerning by removing the flaws of the skewed variables if required.
Business Recommendations¶
- The HR department of the company can deploy the final model from this exercise to identify with a reasonable degree of accuracy whether an employee is likely to switch jobs or not, and this process seems to be easier and more time-efficient than other methods.
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"