Classification and Hypothesis Testing Practice Project: Travel Package Purchase Prediction¶
Context¶
You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base. A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.
One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. This time company wants to harness the available data of existing and potential customers to target the right customers.
You as a Data Scientist at "Visit with us" travel company has to analyze the customers' data and information to provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.
Objective¶
To build a model to predict which customer is potentially going to purchase the newly introduced travel package.
Data Description¶
- CustomerID: Unique customer ID
- ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
- Age: Age of customer
- TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
- CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3. It's the city the customer lives in.
- DurationOfPitch: Duration of the pitch by a salesperson to the customer
- Occupation: Occupation of customer
- Gender: Gender of customer
- NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
- NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
- ProductPitched: Product pitched by the salesperson
- PreferredPropertyStar: Preferred hotel property rating by customer
- MaritalStatus: Marital status of customer
- NumberOfTrips: Average number of trips in a year by customer
- Passport: The customer has a passport or not (0: No, 1: Yes)
- PitchSatisfactionScore: Sales pitch satisfaction score
- OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
- NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
- Designation: Designation of the customer in the current organization
- MonthlyIncome: Gross monthly income of the customer
Importing the libraries required¶
# Importing the basic libraries we will require for the project
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Importing the Machine Learning models we require from Scikit-Learn
from Scikit-learn.linear_model import LogisticRegression
from Scikit-learn.svm import SVC
from Scikit-learn.tree import DecisionTreeClassifier
from Scikit-learn import tree
from Scikit-learn.ensemble import RandomForestClassifier
# Importing the other functions we may require from Scikit-Learn
from Scikit-learn.model_selection import train_test_split, GridSearchCV
from Scikit-learn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
from Scikit-learn.impute import SimpleImputer
# To get diferent metric scores
from Scikit-learn.metrics import confusion_matrix,classification_report,roc_auc_score,precision_recall_curve,roc_curve,make_scorer
# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')
# Connect to Google
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Loading the dataset¶
# Loading the dataset - sheet_name parameter is used if there are Basic tabs in the excel file.
data=pd.read_excel("/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Practice_Project_-_Travel_Package_Purchase_Prediction/Tourism.xlsx",sheet_name='Tourism')
Overview of the dataset¶
View the first and last 5 rows of the dataset¶
Let's view the first few rows and last few rows of the dataset in order to understand its structure a little better.
We will use the head() and tail() methods from Pandas to do this.
data.head()
CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
data.tail()
CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
Understand the shape of the dataset¶
data.shape
(4888, 20)
- The dataset has 4888 rows and 20 columns.
Check the data types of the columns for the dataset¶
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
- We can see that 8 columns have less than 4,888 non-null values i.e. columns have missing values.
Check the percentage of missing values in each column¶
pd.DataFrame(data={'% of Missing Values':round(data.isna().sum()/data.isna().count()*100,2)}).sort_values(by='% of Missing Values',ascending=False)
% of Missing Values | |
---|---|
DurationOfPitch | 5.14 |
MonthlyIncome | 4.77 |
Age | 4.62 |
NumberOfTrips | 2.86 |
NumberOfChildrenVisiting | 1.35 |
NumberOfFollowups | 0.92 |
PreferredPropertyStar | 0.53 |
TypeofContact | 0.51 |
Designation | 0.00 |
OwnCar | 0.00 |
PitchSatisfactionScore | 0.00 |
Passport | 0.00 |
CustomerID | 0.00 |
MaritalStatus | 0.00 |
ProdTaken | 0.00 |
NumberOfPersonVisiting | 0.00 |
Gender | 0.00 |
Occupation | 0.00 |
CityTier | 0.00 |
ProductPitched | 0.00 |
DurationOfPitch
column has 5.14% missing values out of the total observations.- The
MonthlyIncome
column has 4.77% missing values out of the total observations. - The
Age
column has 4.62% missing values out of the total observations. TypeofContact
column has 0.51% missing values out of the total observations.- The
NumberOfFollowups
column has 0.92% missing values out of the total observations. PreferredPropertyStar
column has 0.53% missing values out of the total observations.NumberOfTrips
column has 2.86% missing values out of the total observations.NumberOfChildrenVisiting
column has 1.35% missing values out of the total observations.- We will impute these values after we split the data into train and test sets.
Check the number of unique values in each column¶
data.nunique()
CustomerID 4888 ProdTaken 2 Age 44 TypeofContact 2 CityTier 3 DurationOfPitch 34 Occupation 4 Gender 3 NumberOfPersonVisiting 5 NumberOfFollowups 6 ProductPitched 5 PreferredPropertyStar 3 MaritalStatus 4 NumberOfTrips 12 Passport 2 PitchSatisfactionScore 5 OwnCar 2 NumberOfChildrenVisiting 4 Designation 5 MonthlyIncome 2475 dtype: int64
- We can drop the column - CustomerID as it is unique for each customer and will not add value to the model.
- Most of the variables are categorical except - Age, duration of pitch, monthly income, and number of trips of customers.
Dropping the unique values column
# Dropping CustomerID column
data.drop(columns='CustomerID',inplace=True)
Question 1: Check the summary statistics of the dataset and write your observations (2 Marks)¶
Let's check the statistical summary of the data.
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.0 | 36.0 | 44.0 | 61.0 |
CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.0 | 13.0 | 20.0 | 127.0 |
NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.0 | 22347.0 | 25571.0 | 98678.0 |
Write your Answer here :
- Mean and median of age column are very close to each other i.e. approx 37 and 36 respectively.
- Duration of pitch has some outliers at the right end as the 75th percentile value is 20 and the max value is 127. We need to explore this further.
- It seems like monthly income has some outliers at both ends. We need to explore this further.
- The number of trips also has some outliers as the 75th percentile value is 4 and the max value is 22.
- We can see that the target variable - ProdTaken is imbalanced as most of the values are 0.
Check the count of each unique category in each of the categorical variables.¶
# Making a list of all catrgorical variables
cat_col=['TypeofContact', 'CityTier','Occupation', 'Gender', 'NumberOfPersonVisiting',
'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar',
'MaritalStatus', 'Passport', 'PitchSatisfactionScore',
'OwnCar', 'NumberOfChildrenVisiting', 'Designation']
# Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print('-'*50)
Self Enquiry 3444 Company Invited 1419 Name: TypeofContact, dtype: int64 -------------------------------------------------- 1 3190 3 1500 2 198 Name: CityTier, dtype: int64 -------------------------------------------------- Salaried 2368 Small Business 2084 Large Business 434 Free Lancer 2 Name: Occupation, dtype: int64 -------------------------------------------------- Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64 -------------------------------------------------- 3 2402 2 1418 4 1026 1 39 5 3 Name: NumberOfPersonVisiting, dtype: int64 -------------------------------------------------- 4.0 2068 3.0 1466 5.0 768 2.0 229 1.0 176 6.0 136 Name: NumberOfFollowups, dtype: int64 -------------------------------------------------- Basic 1842 Deluxe 1732 Standard 742 Super Deluxe 342 King 230 Name: ProductPitched, dtype: int64 -------------------------------------------------- 3.0 2993 5.0 956 4.0 913 Name: PreferredPropertyStar, dtype: int64 -------------------------------------------------- Married 2340 Divorced 950 Single 916 Unmarried 682 Name: MaritalStatus, dtype: int64 -------------------------------------------------- 0 3466 1 1422 Name: Passport, dtype: int64 -------------------------------------------------- 3 1478 5 970 1 942 4 912 2 586 Name: PitchSatisfactionScore, dtype: int64 -------------------------------------------------- 1 3032 0 1856 Name: OwnCar, dtype: int64 -------------------------------------------------- 1.0 2080 2.0 1335 0.0 1082 3.0 325 Name: NumberOfChildrenVisiting, dtype: int64 -------------------------------------------------- Executive 1842 Manager 1732 Senior Manager 742 AVP 342 VP 230 Name: Designation, dtype: int64 --------------------------------------------------
- The Free lancer category in the occupation column has just 2 entries out of 4,888 observations.
- We can see that Gender has 3 unique values which include - 'Fe Male' and 'Female'. This must be a data input error, we should replace 'Fe Male' with 'Female'.
- NumberOfPersonVisiting equal to 5 has a count equal to 3 only.
- The majority of the customers are married.
- The majority of the customers own a car.
# Replacing 'Fe Male' with 'Female'
data.Gender=data.Gender.replace('Fe Male', 'Female')
# Converting the data type of each categorical variable to 'category'
for column in cat_col:
data[column]=data[column].astype('category')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4662 non-null float64 2 TypeofContact 4863 non-null category 3 CityTier 4888 non-null category 4 DurationOfPitch 4637 non-null float64 5 Occupation 4888 non-null category 6 Gender 4888 non-null category 7 NumberOfPersonVisiting 4888 non-null category 8 NumberOfFollowups 4843 non-null category 9 ProductPitched 4888 non-null category 10 PreferredPropertyStar 4862 non-null category 11 MaritalStatus 4888 non-null category 12 NumberOfTrips 4748 non-null float64 13 Passport 4888 non-null category 14 PitchSatisfactionScore 4888 non-null category 15 OwnCar 4888 non-null category 16 NumberOfChildrenVisiting 4822 non-null category 17 Designation 4888 non-null category 18 MonthlyIncome 4655 non-null float64 dtypes: category(14), float64(4), int64(1) memory usage: 260.3 KB
df = data.copy()
Exploratory Data Analysis¶
Question 2: Univariate Analysis¶
Let's explore these variables in some more depth by observing their distributions.
We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.
# Defining the hist_box() function
def hist_box(data, col):
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12, 6))
# Adding a graph in each part
sns.boxplot(data=data, x=col, ax=ax_box, showmeans=True)
sns.histplot(data=data, x=col, kde=True, ax=ax_hist)
plt.show()
Question 2.1: Plot the histogram and box plot for the variable Age
using the hist_box function provided and write your insights. (1 Mark)¶
hist_box(df, "Age")
Write your Answer here :
- Age distribution looks approximately normally distributed.
- The boxplot for the age column confirms that there are no outliers for this variable
- Age can be an important variable while targeting customers for the tourism package. We will further explore this in bivariate analysis.
Question 2.2: Plot the histogram and box plot for the variable Duration of Pitch
using the hist_box function provided and write your insights. (1 Mark)¶
hist_box(df, 'DurationOfPitch')
Write your Answer here :
- The distribution for the duration of pitch is right-skewed.
- The duration of the pitch for most of the customers is less than 20 minutes.
- There are some observations that can be considered as outliers as they are very far from the upper whisker in the boxplot. Let's check how many such extreme values are there.
df[df['DurationOfPitch']>40]
ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1434 | 0 | NaN | Company Invited | 3 | 126.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 18482.0 |
3878 | 0 | 53.0 | Company Invited | 3 | 127.0 | Salaried | Male | 3 | 4.0 | Basic | 3.0 | Married | 4.0 | 0 | 1 | 1 | 2.0 | Executive | 22160.0 |
- We can see that there are just two observations which can be considered as outliers.
Lets plot the histogram and box plot for the variable Monthly Income
using the hist_box function
hist_box(df, 'MonthlyIncome')
- The distribution for monthly income shows that most of the values lie between 20,000 to 40,000.
- Income is one of the important factors to consider while approaching a customer with a certain package. We can explore this further in bivariate analysis.
- There are some observations on the left and some observations on the right of the boxplot which can be considered as outliers. Let's check how many such extreme values are there.
df[(df.MonthlyIncome>40000) | (df.MonthlyIncome<12000)]
ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | Basic | NaN | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 95000.0 |
142 | 0 | 38.0 | Self Enquiry | 1 | 9.0 | Large Business | Female | 2 | 3.0 | Deluxe | 3.0 | Single | 4.0 | 1 | 5 | 0 | 0.0 | Manager | 1000.0 |
2482 | 0 | 37.0 | Self Enquiry | 1 | 12.0 | Salaried | Female | 3 | 5.0 | Basic | 5.0 | Divorced | 2.0 | 1 | 2 | 1 | 1.0 | Executive | 98678.0 |
2586 | 0 | 39.0 | Self Enquiry | 1 | 10.0 | Large Business | Female | 3 | 4.0 | Deluxe | 3.0 | Single | 5.0 | 1 | 5 | 0 | 1.0 | Manager | 4678.0 |
- There are just four such observations which can be considered as outliers.
Lets plot the histogram and box plot for the variable Number of Trips
using the hist_box function
hist_box(df,'NumberOfTrips')
- The distribution for the number of trips is right-skewed
- Boxplot shows that the number of trips has some outliers at the right end. Let's check how many such extreme values are there.
df.NumberOfTrips.value_counts(normalize=True)
2.0 0.308340 3.0 0.227254 1.0 0.130581 4.0 0.100674 5.0 0.096462 6.0 0.067818 7.0 0.045914 8.0 0.022115 19.0 0.000211 21.0 0.000211 20.0 0.000211 22.0 0.000211 Name: NumberOfTrips, dtype: float64
- We can see that most of the customers i.e. 52% have taken 2 or 3 trips.
- As expected, with the increase in the number of trips the percentage of customers is decreasing.
- The percentage of categories 19 or above is very less. We can consider these values as outliers.
- We can see that there are just four observations with a number of trips 19 or greater
Removing these outliers form duration of pitch, monthly income, and number of trips.
# Dropping observaions with duration of pitch greater than 40. There are just 2 such observations
df.drop(index=df[df.DurationOfPitch>37].index,inplace=True)
# Dropping observation with monthly income less than 12000 or greater than 40000. There are just 4 such observations
df.drop(index=df[(df.MonthlyIncome>40000) | (df.MonthlyIncome<12000)].index,inplace=True)
# Dropping observations with number of trips greater than 8. There are just 4 such observations
df.drop(index=df[df.NumberOfTrips>10].index,inplace=True)
Let's understand the distribution of the categorical variables¶
Number of Person Visiting
sns.countplot(x = df['NumberOfPersonVisiting'])
plt.show()
df['NumberOfPersonVisiting'].value_counts(normalize=True)
3 0.491390 2 0.289873 4 0.210127 1 0.007995 5 0.000615 Name: NumberOfPersonVisiting, dtype: float64
- Most customers have 3 persons who are visiting with them. This can be because most people like to travel with family.
- As mentioned earlier, there are just 3 observations where the number of persons visiting with the customers are 5 i.e. 0.1%.
Occupation
sns.countplot(x = df['Occupation'])
plt.show()
df['Occupation'].value_counts(normalize=True)
Salaried 0.484215 Small Business 0.427224 Large Business 0.088151 Free Lancer 0.000410 Name: Occupation, dtype: float64
- The majority of customers i.e. 91% are either salaried or owns a small business.
- As mentioned earlier, the freelancer category has only 2 observations.
City Tier
sns.countplot(x = df['CityTier'])
plt.show()
df['CityTier'].value_counts(normalize=True)
1 0.652317 3 0.307093 2 0.040590 Name: CityTier, dtype: float64
- Most of the customers i.e. approx 65% are from tier 1 cities. This can be because of better living standards and exposure as compared to tier 2 and tier 3 cities.
- Surprisingly, tier 3 cities have a much higher count than tier 2 cities. This can be because the company has less marketing in tier 2 cities.
Gender
sns.countplot(x = df['Gender'])
plt.show()
df['Gender'].value_counts(normalize=True)
Male 0.596556 Female 0.403444 Name: Gender, dtype: float64
- Male customers are more than the number of female customers
- There are approx 60% male customers as compared to 40% female customers
- This might be because males do the booking/inquiry when traveling with females which imply that males are the direct customers of the company.
Number of Follow ups
sns.countplot(x = df['NumberOfFollowups'])
plt.show()
df['NumberOfFollowups'].value_counts(normalize=True)
4.0 0.426857 3.0 0.302504 5.0 0.158701 2.0 0.047383 1.0 0.036416 6.0 0.028140 Name: NumberOfFollowups, dtype: float64
- We can see that company usually follow-ups with 3 or 4 times with their customers
- We can explore this further and observe which number of follow-ups have more customers who buy the product.
Product Pitched
sns.countplot(x = df['ProductPitched'])
plt.show()
df['ProductPitched'].value_counts(normalize=True)
Basic 0.376384 Deluxe 0.354244 Standard 0.152112 Super Deluxe 0.070111 King 0.047150 Name: ProductPitched, dtype: float64
- The company pitches Deluxe or Basic packages to their customers more than the other packages.
- This might be because the company makes more profit from Deluxe or Basic packages or these packages are less expensive, so preferred by the majority of the customers.
Type of Contact
sns.countplot(x = df['TypeofContact'])
plt.show()
df['TypeofContact'].value_counts(normalize=True)
Self Enquiry 0.70884 Company Invited 0.29116 Name: TypeofContact, dtype: float64
- There are approx 70% of customers who reached out to the company first i.e. self-inquiry.
- This shows the positive outreach of the company as most of the inquires are initiated from the customer's end.
Designation
sns.countplot(x = df['Designation'])
plt.show()
df['Designation'].value_counts(normalize=True)
Executive 0.376384 Manager 0.354244 Senior Manager 0.152112 AVP 0.070111 VP 0.047150 Name: Designation, dtype: float64
- Approx 73% of the customers are at the executive or manager level.
- We can see that the higher the position, the lesser number of observations which makes sense as executives/managers are more common than AVP/VP.
Product Taken
sns.countplot(x = df['ProdTaken'])
plt.show()
df['ProdTaken'].value_counts(normalize=True)
0 0.811808 1 0.188192 Name: ProdTaken, dtype: float64
- This plot shows the distribution of both classes in the target variable is
imbalanced
. - We only have approx 19% of customers who have purchased the product.
Question 3: Bivariate Analysis¶
Question 3.1: Find and visualize the correlation matrix using a heatmap and write your observations from the plot. (2 Marks)¶
cols_list = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Write your Answer here :
- The Number of trips and age have a weak positive correlation, which makes sense as age increases number of trips is expected to increase slightly.
- Age and monthly income are positively correlated.
- ProdTaken has a weak negative correlation with age which agrees with our earlier observation that as age increases the probability for purchasing a package decreases.
- No other variables have a high correlation among them.
We will define a stacked barplot() function to help analyse how the target variable varies across predictor categories.
# Defining the stacked_barplot() function
def stacked_barplot(data,predictor,target,figsize=(10,6)):
(pd.crosstab(data[predictor],data[target],normalize='index')*100).plot(kind='bar',figsize=figsize,stacked=True)
plt.legend(loc="lower right")
plt.ylabel(target)
Question 3.2: Plot the stacked barplot for the variable Marital Status
against the target variable ProdTaken
using the stacked_barplot function provided and write your insights. (1 Mark)¶
stacked_barplot(data, "MaritalStatus", "ProdTaken" )
Write your Answer here :
- The married people are the most common customer for the company but this graph shows that the conversion rate is higher for single and unmarried customers as compared to the married customers.
- The company can target single and unmarried customers more and can modify packages as per these customers.
Question 3.3: Plot the stacked barplot for the variable ProductPitched
against the target variable ProdTaken
using the stacked_barplot function provided and write your insights. (1 Mark)¶
stacked_barplot(df, "ProductPitched", "ProdTaken" )
Write your Answer here :
- The conversion rate of customers is higher if the product pitched is Basic. This might be because the basic package is less expensive.
- We saw earlier that company pitches the deluxe package more than the standard package, but the standard package shows a higher conversion rate than the deluxe package. The company can pitch standard packages more often.
Let's plot the stacked barplot for the variable Passport
against the target variable ProdTaken
using the stacked_barplot function.
stacked_barplot(data, "Passport", "ProdTaken" )
- The conversion rate for customers with a passport is higher as compared to the customers without a passport.
- The company should customize more international packages to attract more such customers.
Let's plot the stacked barplot for the variable Designation
against the target variable ProdTaken
using the stacked_barplot function.
stacked_barplot(data, "Designation", "ProdTaken" )
- The conversion rate of executives is higher than other designations.
- Customers at VP and AVP positions have the least conversion rate.
Data Preparation for Modeling¶
Separating the independent variables (X) and the dependent variable (Y)
# Separating target variable and other variables
X=data.drop(columns='ProdTaken')
Y=data['ProdTaken']
As we aim to predict customers who are more likely to buy the product, we should drop the columns DurationOfPitch', 'NumberOfFollowups', 'ProductPitched', 'PitchSatisfactionScore'
as these columns would not be available at the time of prediction for new data.
# Dropping columns
X.drop(columns=['DurationOfPitch','NumberOfFollowups','ProductPitched','PitchSatisfactionScore'],inplace=True)
Splitting the data into a 70% train and 30% test set
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.
# Splitting the data into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.30,random_state=1,stratify=Y)
As we saw earlier, our data has missing values. We will impute missing values using median for continuous variables and mode for categorical variables. We will use SimpleImputer
to do this.
The SimpleImputer
provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.
si1=SimpleImputer(strategy='median')
median_imputed_col=['Age','MonthlyIncome','NumberOfTrips']
# Fit and transform the train data
X_train[median_imputed_col]=si1.fit_transform(X_train[median_imputed_col])
#Transform the test data i.e. replace missing values with the median calculated using training data
X_test[median_imputed_col]=si1.transform(X_test[median_imputed_col])
si2=SimpleImputer(strategy='most_frequent')
mode_imputed_col=['TypeofContact','PreferredPropertyStar','NumberOfChildrenVisiting']
# Fit and transform the train data
X_train[mode_imputed_col]=si2.fit_transform(X_train[mode_imputed_col])
# Transform the test data i.e. replace missing values with the mode calculated using training data
X_test[mode_imputed_col]=si2.transform(X_test[mode_imputed_col])
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
Age 0 TypeofContact 0 CityTier 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 MonthlyIncome 0 dtype: int64 ------------------------------ Age 0 TypeofContact 0 CityTier 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 MonthlyIncome 0 dtype: int64
Let's create dummy variables for string type variables and convert other column types back to float.
#converting data types of columns to float
for column in ['NumberOfPersonVisiting', 'Passport', 'OwnCar']:
X_train[column]=X_train[column].astype('float')
X_test[column]=X_test[column].astype('float')
#List of columns to create a dummy variables
col_dummy=['TypeofContact', 'Occupation', 'Gender', 'MaritalStatus', 'Designation', 'CityTier']
#Encoding categorical varaibles
X_train=pd.get_dummies(X_train, columns=col_dummy, drop_first=True)
X_test=pd.get_dummies(X_test, columns=col_dummy, drop_first=True)
Model evaluation criterion:¶
The model can make wrong predictions as:¶
- Predicting a customer will buy the product and the customer doesn't buy - Loss of resources
- Predicting a customer will not buy the product and the customer buys - Loss of opportunity
Which case is more important?¶
- Predicting that customer will not buy the product but he buys i.e. losing on a potential source of income for the company because that customer will not be targeted by the marketing team when he should be targeted.
How to reduce this loss i.e need to reduce False Negatives?¶
- The company wants Recall to be maximized, the greater the Recall lesser the chances of false negatives.
Building the model¶
We will be building 4 different models:
- Logistic Regression
- Support Vector Machine(SVM)
- Decision Tree
- Random Forest
Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
# Creating metric function
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8,5))
sns.heatmap(cm, annot=True, fmt='.2f')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Question 4: Logistic Regression (6 Marks)¶
Question 4.1: Build a Logistic Regression model (Use the Scikit-learn library) (1 Mark)¶
# Fitting logistic regression model
lg = LogisticRegression()
lg.fit(X_train,y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Question 4.2: Check the performance of the model on train and test data (2 Marks)¶
# Checking the performance on the training data
y_pred_train = lg.predict(X_train)
metrics_score(y_train, y_pred_train)
precision recall f1-score support 0 0.85 0.98 0.91 2777 1 0.73 0.25 0.37 644 accuracy 0.84 3421 macro avg 0.79 0.61 0.64 3421 weighted avg 0.83 0.84 0.81 3421
Write your Answer here:
- We have been able to build a predictive model that can be used by the tourist company to predict the customers who are likely to accept the new package with a recall score of 25%.
Let's check the performance on the test set¶
# Checking the performance on the test dataset
y_pred_test = lg.predict(X_test)
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.85 0.98 0.91 1191 1 0.69 0.23 0.34 276 accuracy 0.84 1467 macro avg 0.77 0.60 0.62 1467 weighted avg 0.82 0.84 0.80 1467
Write your Answer here:
- Using the model with default threshold the model gives a low recall but decent precision score.
- We can’t have both precision and recall high. If you increase precision, it will reduce recall, and vice versa. This is called the precision/recall tradeoff.
- So let's find an optimal threshold where we can balance both the metrics.
Question 4.3: Find the optimal threshold for the model using the Precision-Recall Curve. (1 Mark)¶
Precision-Recall curve summarizes the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
Let's use the Precision-Recall curve and see if we can find a better threshold.
# Predict_proba gives the probability of each observation belonging to each class
y_scores_lg=lg.predict_proba(X_train)
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
- We want to choose a threshold that has a high recall while also having a small drop in precision. High recall is necessary, simultaneously we also need to be careful not to lose precision too much. So the threshold value of 0.25 should be sufficient because it has good recall and does not cause a significant drop in precision.
Note: We are attempting to maximise recall because that is our metric of interest. Consider the F1 score as the metric of interest then we must find the threshold that provides balanced precision and recall values. In that case, the theshold value will be 0.30.
# Setting the optimal threshold
optimal_threshold = 0.25
Question 4.4: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)¶
# creating confusion matrix
y_pred_train = lg.predict_proba(X_train)
metrics_score(y_train, y_pred_train[:,1]>optimal_threshold)
precision recall f1-score support 0 0.90 0.81 0.85 2777 1 0.42 0.60 0.49 644 accuracy 0.77 3421 macro avg 0.66 0.70 0.67 3421 weighted avg 0.81 0.77 0.78 3421
Write your Answer here :
- The model performance has improved as compared to our initial model.The recall has increased by 36%.
Let's check the performance on the test set¶
y_pred_test = lg.predict_proba(X_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold)
precision recall f1-score support 0 0.91 0.81 0.86 1191 1 0.45 0.65 0.53 276 accuracy 0.78 1467 macro avg 0.68 0.73 0.70 1467 weighted avg 0.82 0.78 0.80 1467
Write your Answer here :
- Using the model with a threshold of 0.25, the model has achieved a recall of 67% i.e. increase of 44%.
- The precision has dropped compared to inital model but using optimial threshold the model is able to provide the balanced performance.
However the model performance is not good. So let's try building another model.
Question 5: Support Vector Machines (11 Marks)¶
To accelerate SVM training, let's scale the data for support vector machines.
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train_scaled = scaling.transform(X_train)
X_test_scaled = scaling.transform(X_test)
Let's build the models using the two of the widely used kernel functions:
- Linear Kernel
- RBF Kernel
Question 5.1: Build a Support Vector Machine model using a linear kernel (1 Mark)¶
svm = SVC(kernel='linear',probability=True) # Linear kernal or linear decision boundary
model = svm.fit(X= X_train_scaled, y = y_train)
Question 5.2: Check the performance of the model on train and test data (2 Marks)¶
y_pred_train_svm = model.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
precision recall f1-score support 0 0.81 1.00 0.90 2777 1 1.00 0.00 0.00 644 accuracy 0.81 3421 macro avg 0.91 0.50 0.45 3421 weighted avg 0.85 0.81 0.73 3421
Write your Answer here :
- This model has completely failed to detect the class 1. The model predicted all the instances as class 0.
- The model has an recall score of 0.
Checking model performance on test set¶
print("Testing performance:")
y_pred_test_svm = model.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)
Testing performance: precision recall f1-score support 0 0.81 1.00 0.90 1191 1 1.00 0.00 0.01 276 accuracy 0.81 1467 macro avg 0.91 0.50 0.45 1467 weighted avg 0.85 0.81 0.73 1467
Write your Answer here:
- As the dataset has an imbalanced class distribution the model almost always predicts 0.
- So for linear kernel the 0.5 threshold doesn't seems to work. So lets find the optimal threshold and check if the model performs well.
Question 5.3: Find the optimal threshold for the model using the Precision-Recall Curve. (1 Mark)¶
# Predict on train data
y_scores_svm=model.predict_proba(X_train_scaled)
precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
- In this case the threshold value of 0.25 seems to be good as it has good recall and there isn't much drop in precision.
optimal_threshold_svm=0.25
Question 5.4: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)¶
print("Training performance:")
y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)
Training performance: precision recall f1-score support 0 0.89 0.82 0.85 2777 1 0.42 0.57 0.48 644 accuracy 0.77 3421 macro avg 0.65 0.69 0.67 3421 weighted avg 0.80 0.77 0.78 3421
Write your Answer here :
- The model performance has improved by selecting the optimal threshold of 0.25.
- The recall has increased from 0 to 56%.
y_pred_test = model.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)
precision recall f1-score support 0 0.90 0.82 0.86 1191 1 0.44 0.62 0.51 276 accuracy 0.78 1467 macro avg 0.67 0.72 0.69 1467 weighted avg 0.81 0.78 0.79 1467
Write your Answer here :
- SVM model with linear kernel is not overfitting as the accuracy is around 78% for both train and test dataset
- The model has a Recall of 61% which is highest compared to the above models.
- At the optimal threshold of .25, the model performance has improved really well. The F1 score has improved from 0.00 to 0.52.
Lets try using non-linear kernel and check if it can improve the performance.
Question 5.5: Build a Support Vector Machines model using an RBF kernel (1 Mark)¶
svm_rbf=SVC(kernel='rbf',probability=True)
# Fit the model
svm_rbf.fit(X_train_scaled,y_train)
SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(probability=True)
Question 5.6: Check the performance of the model on train and test data (2 Marks)¶
y_pred_train_svm = svm_rbf.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
precision recall f1-score support 0 0.86 0.98 0.92 2777 1 0.83 0.32 0.46 644 accuracy 0.86 3421 macro avg 0.85 0.65 0.69 3421 weighted avg 0.86 0.86 0.83 3421
Write your Answer here :
- When compared to the baseline svm model with linear kernel, the model's performance on training data has been slightly improved by using an RBF kernel.
Checking model performance on test set¶
y_pred_test = svm_rbf.predict(X_test_scaled)
metrics_score(y_test, y_pred_test)
precision recall f1-score support 0 0.85 0.98 0.91 1191 1 0.79 0.26 0.39 276 accuracy 0.85 1467 macro avg 0.82 0.62 0.65 1467 weighted avg 0.84 0.85 0.81 1467
Write your Answer here :
- When compared to the baseline svm model with linear kernel, the recall score on testing data has increased from 0% to 26%.
# Predict on train data
y_scores_svm=svm_rbf.predict_proba(X_train_scaled)
precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
optimal_threshold_svm=0.17
Question 5.7: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)¶
Checking model performance on training set¶
y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)
precision recall f1-score support 0 0.92 0.60 0.73 2777 1 0.31 0.78 0.44 644 accuracy 0.63 3421 macro avg 0.62 0.69 0.59 3421 weighted avg 0.81 0.63 0.67 3421
Write your Answer here :
- SVM model with RBF kernel is performing better compared to the linear kernel.
- The model has achieved a recall score of 0.78 but there is a slight drop in the precision value.
- Using the model with a threshold of 0.17, the model gives a better recall score compared to the initial model.
Checking model performance on test set¶
y_pred_test = svm_rbf.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)
precision recall f1-score support 0 0.92 0.88 0.90 1191 1 0.56 0.68 0.62 276 accuracy 0.84 1467 macro avg 0.74 0.78 0.76 1467 weighted avg 0.85 0.84 0.85 1467
Write your Answer here :
- The recall score for the model is around 69%.
- At the optimal threshold of .17, the model performance has improved from 0.26 to 0.69.
- This is the best performing model when compared to SVM with linear kernel and Logistic Regression because it provides good recall with no big drop in precision as well.
Let's build some non-linear models and see if they can outperform linear models.
Question 6: Decision Trees (7 Marks)¶
Question 6.1: Build a Decision Tree Model (1 Mark)¶
model_dt = DecisionTreeClassifier(random_state=1)
model_dt.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
Question 6.2: Check the performance of the model on train and test data (2 Marks)¶
# Checking performance on the training dataset
pred_train_dt = model_dt.predict(X_train)
metrics_score(y_train, pred_train_dt)
precision recall f1-score support 0 1.00 1.00 1.00 2777 1 1.00 1.00 1.00 644 accuracy 1.00 3421 macro avg 1.00 1.00 1.00 3421 weighted avg 1.00 1.00 1.00 3421
Write your Answer here :
- Almost 0 errors on the training set, each sample has been classified correctly.
- Model has performed very well on the training set.
- As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
- Let's check the performance on test data to see if the model is overfitting.
Checking model performance on test set¶
pred_test_dt = model_dt.predict(X_test)
metrics_score(y_test, pred_test_dt)
precision recall f1-score support 0 0.93 0.92 0.92 1191 1 0.66 0.68 0.67 276 accuracy 0.88 1467 macro avg 0.79 0.80 0.80 1467 weighted avg 0.88 0.88 0.88 1467
Write your Answer here :
- The decision tree model is clearly overfitting. However the decision tree has better performance compared to Logistic Regression and SVM models.
- We will have to tune the decision tree to reduce the overfitting.
Question 6.3: Perform hyperparameter tuning for the decision tree model using GridSearch CV (1 Mark)¶
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(1,100,10),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, cv=5,scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=21, max_leaf_nodes=250, min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=21, max_leaf_nodes=250, min_samples_split=10, random_state=1)
Question 6.4: Check the performance of the model on the train and test data using the tuned model (2 Mark)¶
Checking performance on the training set¶
# Checking performance on the training dataset
dt_tuned = estimator.predict(X_train)
metrics_score(y_train,dt_tuned)
precision recall f1-score support 0 0.96 0.97 0.96 2777 1 0.85 0.82 0.84 644 accuracy 0.94 3421 macro avg 0.91 0.89 0.90 3421 weighted avg 0.94 0.94 0.94 3421
# Checking performance on the training dataset
y_pred_tuned = estimator.predict(X_test)
metrics_score(y_test,y_pred_tuned)
precision recall f1-score support 0 0.90 0.91 0.91 1191 1 0.61 0.58 0.59 276 accuracy 0.85 1467 macro avg 0.75 0.75 0.75 1467 weighted avg 0.85 0.85 0.85 1467
Write your Answer here :
- Decision tree model with default parameters is overfitting the training data and is not able to generalize well.
- Tuned moded has provided a generalised performance with balanced precision and recall values.
- However, there is still some overfitting, and model performance on test data has not significantly improved.
Visualizing the Decision Tree¶
feature_names = list(X_train.columns)
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
max_depth=4,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Question 6.5: What are some important features based on the tuned decision tree? (1 Mark)¶
# Importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Write your Answer here :
We can see that the tree has become simpler and the rules of the trees are readable.
The model performance of the model has been generalized.
We observe that the most important features are:
- Monthly Income
- Age
- Number of trips
Question 7: Random Forest (4 Marks)¶
Question 7.1: Build a Random Forest Model (1 Mark)¶
rf_estimator = RandomForestClassifier( random_state = 1)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)
Question 7.2: Check the performance of the model on the train and test data (2 Marks)¶
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train_rf)
precision recall f1-score support 0 1.00 1.00 1.00 2777 1 1.00 1.00 1.00 644 accuracy 1.00 3421 macro avg 1.00 1.00 1.00 3421 weighted avg 1.00 1.00 1.00 3421
Write your Answer here :
- Almost 0 errors on the training set, each sample has been classified correctly.
- Model has performed very well on the training set.
y_pred_test_rf = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support 0 0.89 0.99 0.94 1191 1 0.91 0.47 0.62 276 accuracy 0.89 1467 macro avg 0.90 0.73 0.78 1467 weighted avg 0.89 0.89 0.88 1467
Write your Answer here :
- The Random Forest classifier seems to be overfitting.
- The recall score is 0.47 which is low compared to other models.
- We can reduce overfitting and improve recall by hyperparameter tuning.
Question 7.3: What are some important features based on the Random Forest? (1 Mark)¶
Let's check the feature importance of the Random Forest
importances = rf_estimator.feature_importances_
columns = X_train.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
sns.barplot(x = importance_df.Importance, y = importance_df.index, color="violet")
<Axes: title={'center': 'Feature Importances'}, xlabel='Importance'>
Write your Answer here :
- The Random Forest further verifies the results from the decision tree, that the most important features are monthly income, age and number of trips.
- monthly income is most important feature. If the monthly income of customer is high he/she is most likely to accept the tour package.
- Age is also a key feature, probably as customers with age between 25-50 are most likely to accept the newly introduced tour package.
Conclusion:¶
- The SVM with RBF kenel has outperformed other models and provided balanced metrics.
- We have been able to build a predictive model that can be used by the tourist company to predict the customers who are likely to accept the new package with the recall score of 0.69 formulate marketing policies accordingly.
Question 8: Conclude ANY FOUR key takeaways for business recommendations (4 Marks)¶
Write your Answer here :
- Our analysis shows that very few customers have passports and they are more likely to purchase the travel package. The company should customize more international packages to attract more such customers.
- We have customers from tier 1 and tier 3 cities but very few from tier 2 cities. The company should expand its marketing strategies to increase the number of customers from tier 2 cities.
- We saw in our analysis that people with higher income or at high positions like AVP or VP are less likely to buy the product. The company can offer short-term travel packages and customize the package for higher- income customers with added luxuries to target such customers.
- When implementing a marketing strategy, external factors, such as the number of follow-ups, time of call, should also be carefully considered as our analysis shows that the customers who have been followed up more are the ones buying the package.
- After we identify a potential customer, the company should pitch packages as per the customer's monthly income, for example, do not pitch king packages to a customer with low income and such packages can be pitched more to the higher-income customers.
- We saw in our analysis that young and single people are more likely to buy the offered packages. The company can offer discounts or customize the package to attract more couples, families, and customers above 30 years of age.
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Practice_Project_-_Travel_Package_Purchase_Prediction/Practice_Project_Solution_Tourism_Package_Prediction.ipynb"