Predicting Retail Spending with Automated Feature Engineering in Featuretools¶
In this notebook, we will implement an automated feature engineering solution with Featuretools for a set of online retail purchases. The dataset (available from the UCI machine learning repository) is a series of time-stamped purchases and does not come with any labels, so we'll have to make our own prediction problem. After defining the problem, we can use automated feature engineering to build a set of features that are used for training a predictive model.
This set of retail spending data is great practice both for defining our own prediction problem (known as prediction engineering), and for using some of the time-based capabilities of Featuretools, notably cutoff times. Whenever we have time-series data, we need to be extra careful to not "leak labels" or use information from the future to predict a past event. Usually, when we're doing manual feature engineering, this can be an issue and often, a system will work well in development but utterly fail when deployed because it was trained on invalid data. Fortunately, Featuretools will take care of the time issue for us, creating a rich set of features that obey the time restrictions.
Roadmap¶
Following is an outline for this notebook:
- Read in data, inspect, and clean
- Develop a prediction problem
- Create a dataframe of labels - what we want to predict, and cutoff times - the point that all data must come before for predicting a label
- Create an entityset and add entities
- Normalize the original table to develop new tables
- These new tables can be used for making features
- Run deep feature sythesis on the entityset to make features
- Use the cutoff times to make features using valid data for each label
- Use the features to train a machine learning model
- Measure performance of the model relative to an informed baseline
- Tune deep feature synthesis
- Specify custom primitives
- Adjust maximum depth of features
- Re-evaluate model
This problem is a great display of both the time and feature-creation capabilities of Featuretools. Also, we'll be able to use custom primitives to expand on our domain knowledge. Doing this problem by hand and ensuring we use only valid data for each label is a daunting task (as can be seen in the Manual Retail Spending notebook)!
!pip install featuretools
Requirement already satisfied: featuretools in /usr/local/lib/python3.10/dist-packages (1.31.0) Requirement already satisfied: cloudpickle>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from featuretools) (2.2.1) Requirement already satisfied: holidays>=0.17 in /usr/local/lib/python3.10/dist-packages (from featuretools) (0.48) Requirement already satisfied: numpy>=1.25.0 in /usr/local/lib/python3.10/dist-packages (from featuretools) (1.25.2) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from featuretools) (24.0) Requirement already satisfied: pandas>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from featuretools) (2.0.3) Requirement already satisfied: psutil>=5.7.0 in /usr/local/lib/python3.10/dist-packages (from featuretools) (5.9.5) Requirement already satisfied: scipy>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from featuretools) (1.11.4) Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from featuretools) (4.66.4) Requirement already satisfied: woodwork>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from featuretools) (0.31.0) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from holidays>=0.17->featuretools) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->featuretools) (2023.4) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=2.0.0->featuretools) (2024.1) Requirement already satisfied: scikit-learn>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from woodwork>=0.28.0->featuretools) (1.2.2) Requirement already satisfied: importlib-resources>=5.10.0 in /usr/local/lib/python3.10/dist-packages (from woodwork>=0.28.0->featuretools) (6.4.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil->holidays>=0.17->featuretools) (1.16.0) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1.0->woodwork>=0.28.0->featuretools) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1.0->woodwork>=0.28.0->featuretools) (3.5.0)
# Data manipulation
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline
# Automated feature engineering
import featuretools as ft
# Machine learning
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, roc_curve
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
#ignore warnings
import warnings
warnings.filterwarnings("ignore")
Load in Raw Data¶
The raw data is a collection of purchases from an online retailer collected in 2010-2011. Each row in the original data represents one product that was purchased with multiple purchases forming an order. There are a few issues with the data that will need to be addressed (as with most real-world datasets)!
This code loads in the data, converts the price in Pounds to dollars (based on the exchange rate on May 31, 2011), subsets the data to 2011, and creates a column representing the total of the purchase. The original data description can be found on the UCI Machine Learning Repository.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
data = pd.read_excel("Online Retail.xlsx")
data.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
data.shape
(541909, 8)
# Convert to dollars
data['UnitPrice'] = data['UnitPrice'] * 1.65
data['total'] = data['UnitPrice'] * data['Quantity']
# Restrict data to 2011
data_2011 = data[data['InvoiceDate'].dt.year == 2011]
data_2011.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | total | |
---|---|---|---|---|---|---|---|---|---|
42481 | 539993 | 22386 | JUMBO BAG PINK POLKADOT | 10 | 2011-01-04 10:00:00 | 3.2175 | 13313.0 | United Kingdom | 32.175 |
42482 | 539993 | 21499 | BLUE POLKADOT WRAP | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.325 |
42483 | 539993 | 21498 | RED RETROSPOT WRAP | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.325 |
42484 | 539993 | 22379 | RECYCLING BAG RETROSPOT | 5 | 2011-01-04 10:00:00 | 3.4650 | 13313.0 | United Kingdom | 17.325 |
42485 | 539993 | 20718 | RED RETROSPOT SHOPPER BAG | 10 | 2011-01-04 10:00:00 | 2.0625 | 13313.0 | United Kingdom | 20.625 |
data_2011.shape
(499428, 9)
data_2011.info()
<class 'pandas.core.frame.DataFrame'> Index: 499428 entries, 42481 to 541908 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 499428 non-null object 1 StockCode 499428 non-null object 2 Description 498099 non-null object 3 Quantity 499428 non-null int64 4 InvoiceDate 499428 non-null datetime64[ns] 5 UnitPrice 499428 non-null float64 6 CustomerID 379979 non-null float64 7 Country 499428 non-null object 8 total 499428 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(1), object(4) memory usage: 38.1+ MB
Data Cleaning¶
There are a few issues we need to address with the data. First, we'll drop the duplicated rows, then we'll drop any rows that contain a nan
. Finally, we can add a Boolean
column indicating whether or not an order is a cancellation.
data_2011.duplicated().sum()
4768
- There are 4768 duplicate values which we will drop as they will add bias in the dataset
data_2011.isnull().sum()
InvoiceNo 0 StockCode 0 Description 1329 Quantity 0 InvoiceDate 0 UnitPrice 0 CustomerID 119449 Country 0 total 0 dtype: int64
- There are about 119449 missing values in the customerID and it is very difficult to fill in such column also as they are unique ID belonging to each customer. There we will be dropping those too.
# drop the duplicates
data_2011 = data_2011.drop_duplicates()
# drop rows with null customer id
data_2011 = data_2011.dropna(axis=0)
data_2011['cancelled'] = data_2011['InvoiceNo'].str.startswith('C', na = False)
# Size has changed as we have removed duplicats
data_2011.shape
(375250, 10)
data_2011[data_2011.cancelled==True].head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|
42557 | C540006 | 21306 | SET/4 DAISY MIRROR MAGNETS | -1 | 2011-01-04 10:48:00 | 3.4650 | 14606.0 | United Kingdom | -3.4650 | True |
42558 | C540006 | 84352 | SILVER CHRISTMAS TREE BAUBLE STAND | -1 | 2011-01-04 10:48:00 | 27.9675 | 14606.0 | United Kingdom | -27.9675 | True |
42559 | C540006 | 22423 | REGENCY CAKESTAND 3 TIER | -1 | 2011-01-04 10:48:00 | 21.0375 | 14606.0 | United Kingdom | -21.0375 | True |
42560 | C540007 | 21055 | TOOL BOX SOFT TOY | -6 | 2011-01-04 11:08:00 | 14.7675 | 15379.0 | United Kingdom | -88.6050 | True |
42561 | C540007 | 22274 | FELTCRAFT DOLL EMILY | -6 | 2011-01-04 11:08:00 | 4.8675 | 15379.0 | United Kingdom | -29.2050 | True |
- All of the cancelled orders have negative quantities which mean that they cancel out with the corresponding purchase. We'll leave in the cancelled purchases, because if our goal (defined later in the prediction problem) is to predict the total amount purchased by a customer, we'll need to take into account their cancelled orders.
data_2011['cancelled'].value_counts(normalize =True)
cancelled False 0.978172 True 0.021828 Name: proportion, dtype: float64
- There are about 2% invoice that were cancelled.
#summary statitics for the new data
data_2011.describe()
Quantity | InvoiceDate | UnitPrice | CustomerID | total | |
---|---|---|---|---|---|
count | 375250.000000 | 375250 | 375250.000000 | 375250.000000 | 375250.000000 |
mean | 12.252296 | 2011-07-25 10:03:49.841812480 | 5.762031 | 15265.136168 | 33.972397 |
min | -80995.000000 | 2011-01-04 10:00:00 | 0.000000 | 12346.000000 | -277974.840000 |
25% | 2.000000 | 2011-05-04 08:40:00 | 2.062500 | 13901.000000 | 7.012500 |
50% | 5.000000 | 2011-08-14 12:55:30 | 3.217500 | 15108.000000 | 19.305000 |
75% | 12.000000 | 2011-10-24 17:07:00 | 6.187500 | 16767.000000 | 32.670000 |
max | 80995.000000 | 2011-12-09 12:50:00 | 64300.500000 | 18287.000000 | 277974.840000 |
std | 258.180877 | NaN | 119.054102 | 1710.927381 | 733.803756 |
- We can see that most total purchase prices are less than 50 (this is for a one item).
- The negative numbers represent cancelled orders. We can plot the purchase total by country (limited to only positive amounts and less than $1000) to see if there are differences between countries.
plt.figure(figsize = (20, 6))
sns.boxplot(x = 'Country', y = 'total', data = data_2011[(data_2011['total'] > 0) & (data_2011['total'] < 1000)]);
plt.title("Total Purchase Amount by Country");
plt.xticks(rotation = 90);
plt.figure(figsize = (20, 6))
sns.boxplot(x = 'Country', y = 'Quantity', data = data_2011[(data_2011['total'] > 0) & (data_2011['total'] < 1000)]);
plt.title("Purchased Quantity by Country");
plt.xticks(rotation = 90);
- Both the purchase total and the quantity are heavily skewed. This occurs often in real-world data and means this might be difficult as a regression problem (predict the actual spending amount).
- We might want to frame the problem as classification because the large purchase totals could throw off a machine learning algorithm. Our other option would be to remove the outliers, but given that these are probably legitimate, that does not seem like a responsible choice!
Skewness of Data¶
To see the extent of how skewed the data is, we can use an Empirical Cumulative Distribution Function (ECDF) plot.
def ecdf(data_2011):
x = np.sort(data_2011)
y = np.arange(1, len(x) + 1) / len(x)
return x, y
plt.figure(figsize = (14, 4))
# Total
plt.subplot(121)
x, y = ecdf(data_2011.loc[data_2011['total'] > 0, 'total'])
plt.plot(x, y, marker = '.')
plt.xlabel('Total'); plt.ylabel('Percentile'); plt.title('ECDF of Purchase Total');
# Quantity
plt.subplot(122)
x, y = ecdf(data_2011.loc[data_2011['total'] > 0, 'Quantity'])
plt.plot(x, y, marker = '.')
plt.xlabel('Quantity'); plt.ylabel('Percentile'); plt.title('ECDF of Purchase Quantity');
- The majority of total purchases are less than $20, but there a number of outliers.
- Rather than using a regression problem, we therefore might want to try classifying customers based on their spending per month.
Prediction Problem¶
The goal of machine learning is to predict some quantity (regression) or a label (classification).
Our data exploration showed that regression might not be the best approach because of the extreme outliers, so instead we can make the problem classification.
With this dataset, there are an unlimited number of prediction problems because there are no labels (unlike in many machine learning competitions).
Choosing a worthwhile quantity to predict therefore becomes critical. In most real-world situations, we could use a domain expert to frame a problem based on what they know is important in the field, and then it's our objective to make a set of labels and features based on that problem. This is known as prediction engineering.
We'll frame the problem as predicting whether or not a customer will spend more than 500 dollars in the next month.
This could be useful to a business because it will let them market more effectively to those customers who are likely to spend more. Moreover, an online retailer could advertise differently to customers based on their predicted class of spending.
Instead of picking just a single month for predictions, we can use each customer as a label multiple times. In other words, we not only predict whether a given customer will spend more than 500 dollars in May, but we also ask the same question in June, July, and so on.
The thing to note is that for each month, we can't use data from the future to predict the class of spending. Each month we can use information from any previous month which means that our predictions should get more accurate as we advance further in time through the data since we'll be able to use more information. Each label for a customer therefore has a different set of features because there is more or less data available to us depending on the month. Doing this by hand is very tedious and error-prone, but we'll see how Featuretools is able to handle the times associated with each label using cutoff times.
Making Labels¶
The function below takes in a start date and an end date (which we set to 30 days apart) and generates a dataframe of the labels, which depends on how much the customer spent in the period and the threshold. Our threshold will be $500 for this prediction problem.
For customers who appear in the data prior to the start date but then do have a purchase in between the start and end date, we set their total to 0. If we simply did not include them in the labels, then that would be cheating since we have no way of knowing ahead of time that they will not spend anything in the next month.
data_2011.columns
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country', 'total', 'cancelled'], dtype='object')
import datetime as dt
def make_retail_cutoffs_total(start_date, end_date, threshold=500):
# Find customers who exist before start date
customer_pool = data_2011[data_2011['InvoiceDate'] < start_date]['CustomerID'].unique()
tmp = pd.DataFrame({'CustomerID': customer_pool})
# For customers in the customer pool, find their sum between the start and end date
totals = data_2011[data_2011['CustomerID'].isin(customer_pool) &
(data_2011['InvoiceDate'] > start_date) &
(data_2011['InvoiceDate'] < end_date)
].groupby('CustomerID')['total'].sum().reset_index()
# Merge with all the customer ids to record all customers who existed before start date
totals = totals.merge(tmp, on='CustomerID', how='right')
# Set the total for any customer who did not have a purchase in the timeframe equal to 0
totals['total'] = totals['total'].fillna(0)
# Label is based on the threshold
totals['label'] = (totals['total'] > threshold).astype(int)
# The cutoff time is the start date
totals['cutoff_time'] = pd.to_datetime(start_date)
totals = totals[['CustomerID', 'cutoff_time', 'total', 'label']]
return totals
may_spending = make_retail_cutoffs_total(dt.datetime(2011, 5, 1), dt.datetime(2011, 6, 1))
may_spending.head()
CustomerID | cutoff_time | total | label | |
---|---|---|---|---|
0 | 13313.0 | 2011-05-01 | 499.026 | 0 |
1 | 18097.0 | 2011-05-01 | 1318.977 | 1 |
2 | 16656.0 | 2011-05-01 | 1804.176 | 1 |
3 | 16875.0 | 2011-05-01 | 0.000 | 0 |
4 | 13094.0 | 2011-05-01 | 678.942 | 1 |
- For each customer who appeared in the data before May, we have a label for them for the month of May which is the sum of their spending in May converted to a binary label. When we make features for these labels, we can only use data from before May.
- The
cutoff_time
represents the point at which any data we use must come before and thelabel
is based on our threshold of $500.
may_spending['label'].value_counts().plot.bar();
plt.title('Label Distribution for May');
may_spending['label'].value_counts()
label 0 1689 1 483 Name: count, dtype: int64
- This is an imbalanced classification problem which means that we probably don't want to use accuracy as our metric.
Metrics¶
Instead of accuracy, we can measure performance in terms of:
- Precision: the percentage of customers predicted to spend more than $500 that actually did
- Recall : the percentage of customers that actually spent more than $500 that were correctly identified
- F1 score: the harmonic mean of precision and recall
- Receiver Operating characteristic Area Under the Curve (ROC AUC): a 0 to 1 measure (with 1 being optimal) that measures the performance of a model across a range of thresholds
We'll have to establish a baseline for these metrics (which will be a little later) so we know whether machine learning is useful for this task.
Next we'll go ahead and make labels for the rest of the year. Keep in mind that this will generate one label for each customer for each month. We're used to thinking of a single label per customer, but since we have the data, we might as well use each customer as a training example as many times as possible (as long as we use valid data each time). A greater number of training observations should increase the predictive power of our model.
#making labels for each month
march_spending = make_retail_cutoffs_total('2011-03-01', '2011-04-01', 500)
april_spending = make_retail_cutoffs_total('2011-04-01', '2011-05-01', 500)
june_spending = make_retail_cutoffs_total('2011-06-01', '2011-07-01', 500)
july_spending = make_retail_cutoffs_total('2011-07-01', '2011-08-01', 500)
august_spending = make_retail_cutoffs_total('2011-08-01', '2011-09-01', 500)
september_spending = make_retail_cutoffs_total('2011-09-01', '2011-10-01', 500)
october_spending = make_retail_cutoffs_total('2011-10-01', '2011-11-01', 500)
november_spending = make_retail_cutoffs_total('2011-11-01', '2011-12-01', 500)
december_spending = make_retail_cutoffs_total('2011-12-01', '2012-01-01', 500)
labels = pd.concat([march_spending, april_spending, may_spending, june_spending, july_spending, august_spending,
september_spending, october_spending, november_spending, december_spending], axis = 0)
#labels.to_csv('../input/labels.csv')
print(labels.shape)
labels.label.value_counts(normalize=True)
(28133, 4)
label 0 0.829453 1 0.170547 Name: proportion, dtype: float64
- We have roughly 28,000 labels with ~17% of them positive.
- The total is very skewed, with several customers recording negative total for some months (they had more cancellations than purchases). By framing this as a classification problem, we don't have to worry about the outlying purchase totals throwing off our model.
Just to examine the data, we can plot the total spending distribution by month (with negative totals removed).
plot_labels = labels.copy()
plot_labels['month'] = plot_labels['cutoff_time'].dt.month
plt.figure(figsize = (12, 6))
sns.boxplot(x = 'month', y = 'total',
data = plot_labels[(plot_labels['total'] > 0) & (plot_labels['total'] < 1000)]);
plt.title('Customer Spending Distribution by Month');
Let's zoom in to one customer to make sure we understand the labels.
labels.loc[labels['CustomerID'] == 12347]
CustomerID | cutoff_time | total | label | |
---|---|---|---|---|
666 | 12347.0 | 2011-03-01 | 0.0000 | 0 |
666 | 12347.0 | 2011-04-01 | 1049.8125 | 1 |
666 | 12347.0 | 2011-05-01 | 0.0000 | 0 |
666 | 12347.0 | 2011-06-01 | 631.1580 | 1 |
666 | 12347.0 | 2011-07-01 | 0.0000 | 0 |
666 | 12347.0 | 2011-08-01 | 965.1015 | 1 |
666 | 12347.0 | 2011-09-01 | 0.0000 | 0 |
666 | 12347.0 | 2011-10-01 | 2135.6280 | 1 |
666 | 12347.0 | 2011-11-01 | 0.0000 | 0 |
666 | 12347.0 | 2011-12-01 | 370.9530 | 0 |
labels.loc[labels['CustomerID'] == 12347].set_index('cutoff_time')['total'].plot(figsize = (6, 4), linewidth = 3)
plt.xlabel('Date', size = 16);
plt.ylabel('Spending', size = 16);
plt.title('Monthly Spending for Customer', size = 20);
plt.xticks(size = 16); plt.yticks(size = 16);
- One customer, 10 different labels. It seems like it might be difficult to predict this customer's spending given her fluctuating total spending! We'll have to see if Featuretools is up to the task.
We now have our prediction problem all set. The next step is to start making features we can use in a machine learning model.
Featuretools Implementation¶
The first step in Featuretools is to create an
EntitySet
which will hold all of our data and the relationships between the multiple tables (which we'll create shortly). Initially we'll add the entiredata
as anentity
to the set. Since data has atime_index
, we'll add that and specify the variable type of the product description.The
time_index
represents the first time the information in that row is known. When we build features, Featuretools will use thistime_index
to filter data based on the cutoff time of the label. We can't use any purchases from after the cutoff time of the label to make features for that label.
data_2011.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|
42481 | 539993 | 22386 | JUMBO BAG PINK POLKADOT | 10 | 2011-01-04 10:00:00 | 3.2175 | 13313.0 | United Kingdom | 32.175 | False |
42482 | 539993 | 21499 | BLUE POLKADOT WRAP | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.325 | False |
42483 | 539993 | 21498 | RED RETROSPOT WRAP | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.325 | False |
42484 | 539993 | 22379 | RECYCLING BAG RETROSPOT | 5 | 2011-01-04 10:00:00 | 3.4650 | 13313.0 | United Kingdom | 17.325 | False |
42485 | 539993 | 20718 | RED RETROSPOT SHOPPER BAG | 10 | 2011-01-04 10:00:00 | 2.0625 | 13313.0 | United Kingdom | 20.625 | False |
es = ft.EntitySet(id="Online Retail Logs")
# Add the entire data table as an entity
es = es.add_dataframe(dataframe=data_2011,
dataframe_name="purchases",
index="purchases_index",
time_index='InvoiceDate')
es["purchases"]["Description"].metadata = {"logical_type": "NaturalLanguage"}
es["purchases"]
purchases_index | InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 539993 | 22386 | JUMBO BAG PINK POLKADOT | 10 | 2011-01-04 10:00:00 | 3.2175 | 13313.0 | United Kingdom | 32.1750 | False |
1 | 1 | 539993 | 21499 | BLUE POLKADOT WRAP | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.3250 | False |
2 | 2 | 539993 | 21498 | RED RETROSPOT WRAP | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.3250 | False |
3 | 3 | 539993 | 22379 | RECYCLING BAG RETROSPOT | 5 | 2011-01-04 10:00:00 | 3.4650 | 13313.0 | United Kingdom | 17.3250 | False |
4 | 4 | 539993 | 20718 | RED RETROSPOT SHOPPER BAG | 10 | 2011-01-04 10:00:00 | 2.0625 | 13313.0 | United Kingdom | 20.6250 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375245 | 375245 | 581587 | 22613 | PACK OF 20 SPACEBOY NAPKINS | 12 | 2011-12-09 12:50:00 | 1.4025 | 12680.0 | France | 16.8300 | False |
375246 | 375246 | 581587 | 22899 | CHILDREN'S APRON DOLLY GIRL | 6 | 2011-12-09 12:50:00 | 3.4650 | 12680.0 | France | 20.7900 | False |
375247 | 375247 | 581587 | 23254 | CHILDRENS CUTLERY DOLLY GIRL | 4 | 2011-12-09 12:50:00 | 6.8475 | 12680.0 | France | 27.3900 | False |
375248 | 375248 | 581587 | 23255 | CHILDRENS CUTLERY CIRCUS PARADE | 4 | 2011-12-09 12:50:00 | 6.8475 | 12680.0 | France | 27.3900 | False |
375249 | 375249 | 581587 | 22138 | BAKING SET 9 PIECE RETROSPOT | 3 | 2011-12-09 12:50:00 | 8.1675 | 12680.0 | France | 24.5025 | False |
375250 rows × 11 columns
Normalizing Entities¶
In order to create new tables out of the original table, we can normalize this entity
. This creates new tables by creating a unique row for every variable that we pass in, such as the customer or the product.
The code below creates a new entity for the products
where each row contains one product and the columns describe the product.
# create a new "products" entity
es.normalize_dataframe(new_dataframe_name = "products",
base_dataframe_name = "purchases",
index="StockCode",
additional_columns = ["Description"])
es['products'].head()
StockCode | Description | first_purchases_time | |
---|---|---|---|
22386 | 22386 | JUMBO BAG PINK POLKADOT | 2011-01-04 10:00:00 |
21499 | 21499 | BLUE POLKADOT WRAP | 2011-01-04 10:00:00 |
21498 | 21498 | RED RETROSPOT WRAP | 2011-01-04 10:00:00 |
22379 | 22379 | RECYCLING BAG RETROSPOT | 2011-01-04 10:00:00 |
20718 | 20718 | RED RETROSPOT SHOPPER BAG | 2011-01-04 10:00:00 |
The
first_purchases_time
is automatically created because thepurchases
table has a time index. This represents the first time the product appears in the purchase data. Again, Featuretools will filter data from this table for each label so that we only build valid features.We can use this table to create new features; the products table is a parent of the
purchases
table with the linking variableproduct_id
. For each product inproducts
, there can be multiple purchases of that product inpurchases
.
Additional Tables¶
We'll repeat the process to create tables for both the customers
and the orders
. normalize_entity
automatically creates the relationships and time index so we don't have to do that ourselves. If we want to include any other additional variables in the table, we can pass those in. These variables must be unique to the object that we are normalizing for. As an example, each order comes from one country, so we can include that as additional variables when creating the orders
table. However, the description is not unique to an order, so that should not be a variable that appears in the orders table.
es['purchases']
purchases_index | InvoiceNo | StockCode | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | total | cancelled | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 539993 | 22386 | 10 | 2011-01-04 10:00:00 | 3.2175 | 13313.0 | United Kingdom | 32.1750 | False |
1 | 1 | 539993 | 21499 | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.3250 | False |
2 | 2 | 539993 | 21498 | 25 | 2011-01-04 10:00:00 | 0.6930 | 13313.0 | United Kingdom | 17.3250 | False |
3 | 3 | 539993 | 22379 | 5 | 2011-01-04 10:00:00 | 3.4650 | 13313.0 | United Kingdom | 17.3250 | False |
4 | 4 | 539993 | 20718 | 10 | 2011-01-04 10:00:00 | 2.0625 | 13313.0 | United Kingdom | 20.6250 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375245 | 375245 | 581587 | 22613 | 12 | 2011-12-09 12:50:00 | 1.4025 | 12680.0 | France | 16.8300 | False |
375246 | 375246 | 581587 | 22899 | 6 | 2011-12-09 12:50:00 | 3.4650 | 12680.0 | France | 20.7900 | False |
375247 | 375247 | 581587 | 23254 | 4 | 2011-12-09 12:50:00 | 6.8475 | 12680.0 | France | 27.3900 | False |
375248 | 375248 | 581587 | 23255 | 4 | 2011-12-09 12:50:00 | 6.8475 | 12680.0 | France | 27.3900 | False |
375249 | 375249 | 581587 | 22138 | 3 | 2011-12-09 12:50:00 | 8.1675 | 12680.0 | France | 24.5025 | False |
375250 rows × 10 columns
# create a new "customers" entity based on the orders entity
es.normalize_dataframe(new_dataframe_name="customers",
base_dataframe_name="purchases",
index="CustomerID")
# create a new "orders" entity
es.normalize_dataframe(new_dataframe_name="orders",
base_dataframe_name="purchases",
index="InvoiceNo",
additional_columns=["Country", 'cancelled'])
es
Entityset: Online Retail Logs DataFrames: purchases [Rows: 375250, Columns: 8] products [Rows: 3612, Columns: 3] customers [Rows: 4244, Columns: 2] orders [Rows: 20482, Columns: 4] Relationships: purchases.StockCode -> products.StockCode purchases.CustomerID -> customers.CustomerID purchases.InvoiceNo -> orders.InvoiceNo
Deep Feature Synthesis¶
Now that our EntitySet
is defined with the proper relationships between tables, we can perform deep feature synthesis to generate 100s or 1000s of features. We can theoretically make features for any entity, but since our objective is to classify customer spending, we'll make features for each customer for each month.
Using cutoff times¶
To ensure the features are valid for the customer and the month, we'll pass in the labels dataframe that has the cutoff time for each customer for each month. Featuretools will make one row for each customer for each month, with the features for each month derived only from data prior to the cutoff time. This is an extremely useful method because it means we don't have to worry about using invalid data to make features. For example, if we were doing this by hand, it would be very easy to create features that use information from the future to make features for the labels, which is not allowed. One issue when we have the entire dataset is that we have access to all the data and must prevent ourselves from using it when building training features since when our model is deployed, it won't have access to data from the future and we want our models to do well in deployment.
The requirements of the cutoff time dataframe are that the first column contains the ids corresponding to the index of the target entity, and the second column must have the cutoff times. Featuretools then takes care of the rest, for each month making features only using valid data.
The following call will generate features for each customer, resulting in a feature_matrix
where each row consists of one customer for one month corresponding to a label and each column is one feature.
labels.head()
CustomerID | cutoff_time | total | label | |
---|---|---|---|---|
0 | 13313.0 | 2011-03-01 | 0.000 | 0 |
1 | 18097.0 | 2011-03-01 | 0.000 | 0 |
2 | 16656.0 | 2011-03-01 | 589.248 | 1 |
3 | 16875.0 | 2011-03-01 | 0.000 | 0 |
4 | 13094.0 | 2011-03-01 | 115.434 | 0 |
labels.info()
<class 'pandas.core.frame.DataFrame'> Index: 28133 entries, 0 to 4196 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 28133 non-null float64 1 cutoff_time 28133 non-null datetime64[ns] 2 total 28133 non-null float64 3 label 28133 non-null int64 dtypes: datetime64[ns](1), float64(2), int64(1) memory usage: 1.1 MB
labels.rename(columns = {'cutoff_time':'time'}, inplace = True)
es
Entityset: Online Retail Logs DataFrames: purchases [Rows: 375250, Columns: 8] products [Rows: 3612, Columns: 3] customers [Rows: 4244, Columns: 2] orders [Rows: 20482, Columns: 4] Relationships: purchases.StockCode -> products.StockCode purchases.CustomerID -> customers.CustomerID purchases.InvoiceNo -> orders.InvoiceNo
feature_matrix, feature_names = ft.dfs(entityset=es, target_dataframe_name='customers',
cutoff_time = labels, verbose = 2,
cutoff_time_in_index = True,
chunk_size = len(labels), n_jobs = -1,
max_depth = 1)
feature_matrix.head()
Built 27 features Elapsed: 00:00 | Progress: 0%|
INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy INFO:distributed.scheduler:State start INFO:distributed.scheduler: Scheduler at: tcp://127.0.0.1:38051 INFO:distributed.scheduler: dashboard at: http://127.0.0.1:8787/status INFO:distributed.nanny: Start Nanny at: 'tcp://127.0.0.1:40515' INFO:distributed.nanny: Start Nanny at: 'tcp://127.0.0.1:38201' INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:34443', name: 1, status: init, memory: 0, processing: 0> INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:34443 INFO:distributed.core:Starting established connection to tcp://127.0.0.1:49322 INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:38361', name: 0, status: init, memory: 0, processing: 0> INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:38361 INFO:distributed.core:Starting established connection to tcp://127.0.0.1:49310 INFO:distributed.scheduler:Receive client connection: Client-4dacf4e9-1759-11ef-8e5b-0242ac1c000c INFO:distributed.core:Starting established connection to tcp://127.0.0.1:49326
EntitySet scattered to 2 workers in 5 seconds Elapsed: 00:37 | Progress: 95%|█████████▌
INFO:distributed.scheduler:Remove client Client-4dacf4e9-1759-11ef-8e5b-0242ac1c000c INFO:distributed.core:Received 'close-stream' from tcp://127.0.0.1:49326; closing. INFO:distributed.scheduler:Remove client Client-4dacf4e9-1759-11ef-8e5b-0242ac1c000c INFO:distributed.scheduler:Close client connection: Client-4dacf4e9-1759-11ef-8e5b-0242ac1c000c INFO:distributed.nanny:Closing Nanny at 'tcp://127.0.0.1:40515'. Reason: nanny-close INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close INFO:distributed.nanny:Closing Nanny at 'tcp://127.0.0.1:38201'. Reason: nanny-close INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close INFO:distributed.core:Received 'close-stream' from tcp://127.0.0.1:49310; closing. INFO:distributed.core:Received 'close-stream' from tcp://127.0.0.1:49322; closing. INFO:distributed.scheduler:Remove worker <WorkerState 'tcp://127.0.0.1:38361', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1716285822.672842') INFO:distributed.scheduler:Remove worker <WorkerState 'tcp://127.0.0.1:34443', name: 1, status: closing, memory: 1, processing: 0> (stimulus_id='handle-worker-cleanup-1716285822.6755228') INFO:distributed.scheduler:Lost all workers INFO:distributed.scheduler:Scheduler closing due to unknown reason... INFO:distributed.scheduler:Scheduler closing all comms
Elapsed: 00:38 | Progress: 100%|██████████
COUNT(purchases) | MAX(purchases.Quantity) | MAX(purchases.UnitPrice) | MAX(purchases.total) | MEAN(purchases.Quantity) | MEAN(purchases.UnitPrice) | MEAN(purchases.total) | MIN(purchases.Quantity) | MIN(purchases.UnitPrice) | MIN(purchases.total) | ... | STD(purchases.total) | SUM(purchases.Quantity) | SUM(purchases.UnitPrice) | SUM(purchases.total) | DAY(first_purchases_time) | MONTH(first_purchases_time) | WEEKDAY(first_purchases_time) | YEAR(first_purchases_time) | total | label | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CustomerID | time | |||||||||||||||||||||
13313.0 | 2011-03-01 | 17 | 25.0 | 7.0125 | 58.410 | 10.058824 | 3.692118 | 29.826176 | 4.0 | 0.6930 | 17.3250 | ... | 11.998083 | 171.0 | 62.7660 | 507.0450 | 4 | 1 | 1 | 2011 | 0.000 | 0 |
18097.0 | 2011-03-01 | 9 | 60.0 | 16.4175 | 143.550 | 30.000000 | 4.882167 | 87.021000 | 6.0 | 0.6930 | 33.2640 | ... | 41.359375 | 270.0 | 43.9395 | 783.1890 | 4 | 1 | 1 | 2011 | 0.000 | 0 |
16656.0 | 2011-03-01 | 14 | 216.0 | 14.0250 | 498.960 | 56.285714 | 5.079643 | 109.184036 | -12.0 | 0.6930 | -32.6700 | ... | 144.361356 | 788.0 | 71.1150 | 1528.5765 | 4 | 1 | 1 | 2011 | 589.248 | 1 |
16875.0 | 2011-03-01 | 52 | 24.0 | 27.9675 | 65.670 | 7.250000 | 7.433250 | 24.218192 | -8.0 | 0.4785 | -16.4175 | ... | 17.671397 | 377.0 | 386.5290 | 1259.3460 | 4 | 1 | 1 | 2011 | 0.000 | 0 |
13094.0 | 2011-03-01 | 3 | 72.0 | 1.7490 | 125.928 | 72.000000 | 1.749000 | 125.928000 | 72.0 | 1.7490 | 125.9280 | ... | 0.000000 | 216.0 | 5.2470 | 377.7840 | 4 | 1 | 1 | 2011 | 115.434 | 0 |
5 rows × 29 columns
We will want to drop the total
and label
columns before training because these were passed through from the cutoff_time
data (this happens a little later). We also can remove the MODE
of the StockCode and InvoiceNo. These should not be used for creating features since they are index variables.
feature_matrix = feature_matrix.drop(columns = ['MODE(purchases.StockCode)', 'MODE(purchases.InvoiceNo)'])
feature_matrix.shape
(28133, 27)
Initially we did not generate very many features because we limited the max_depth
to 1. This means that only 1 aggregation will be stacked at a time. Let's take a look at some of the features. First, we'll zoom back in to a single one of the customers.
feature_matrix.loc[12347, :].sample(10, axis = 1)
MEAN(purchases.total) | MIN(purchases.Quantity) | NUM_UNIQUE(purchases.InvoiceNo) | STD(purchases.UnitPrice) | MEAN(purchases.Quantity) | SUM(purchases.UnitPrice) | MIN(purchases.UnitPrice) | MIN(purchases.total) | total | MAX(purchases.total) | |
---|---|---|---|---|---|---|---|---|---|---|
time | ||||||||||
2011-03-01 | 27.048052 | 3.0 | 1 | 4.015815 | 10.862069 | 120.7305 | 0.6930 | 8.316 | 0.0000 | 63.1125 |
2011-04-01 | 27.048052 | 3.0 | 1 | 4.015815 | 10.862069 | 120.7305 | 0.6930 | 8.316 | 1049.8125 | 63.1125 |
2011-05-01 | 34.607660 | 3.0 | 2 | 4.088012 | 15.056604 | 223.5090 | 0.4125 | 8.316 | 0.0000 | 411.8400 |
2011-06-01 | 34.607660 | 3.0 | 2 | 4.088012 | 15.056604 | 223.5090 | 0.4125 | 8.316 | 631.1580 | 411.8400 |
2011-07-01 | 34.723437 | 2.0 | 3 | 3.954557 | 14.000000 | 311.9820 | 0.4125 | 8.316 | 0.0000 | 411.8400 |
2011-08-01 | 34.723437 | 2.0 | 3 | 3.954557 | 14.000000 | 311.9820 | 0.4125 | 8.316 | 965.1015 | 411.8400 |
2011-09-01 | 36.886726 | 2.0 | 4 | 4.128061 | 13.666667 | 424.5780 | 0.4125 | 8.316 | 0.0000 | 411.8400 |
2011-10-01 | 36.886726 | 2.0 | 4 | 4.128061 | 13.666667 | 424.5780 | 0.4125 | 8.316 | 2135.6280 | 411.8400 |
2011-11-01 | 39.757811 | 2.0 | 5 | 4.019489 | 13.907143 | 623.8320 | 0.4125 | 8.316 | 0.0000 | 411.8400 |
2011-12-01 | 39.757811 | 2.0 | 5 | 4.019489 | 13.907143 | 623.8320 | 0.4125 | 8.316 | 370.9530 | 411.8400 |
- We see that as we get deeper into the year, the numbers change for this customer because we are using more information to build the features. We would expect our predictions to get more accurate with time because we are incorporating more information.
- However, it's also possible that customer behavior changes over time and therefore using all the previous data might not actually be useful.
# feature_matrix.reset_index(inplace = True)
feature_matrix.groupby('time')['COUNT(purchases)'].mean().plot();
plt.title('Average Monthly Count of Purchases');
plt.ylabel('Purchases Per Customer');
This shows that as we progress through time, we have more purchases per customer to use for prediction.
feature_matrix.groupby('time')['SUM(purchases.Quantity)'].mean().plot();
plt.title('Average Monthly Sum of Purchased Products');
plt.ylabel('Total Purchased Products Per Customer');
Naturally, as we include more information, our forecasts should improve in accuracy. Therefore, if we are predicting purchase in November, we would expect better performance than predicting purchases in June.
Correlations¶
As a first approximation of useful features, we can see if there are any significant correlations between the features and the total
. We'll one-hot encode the categorical features first.
feature_matrix = pd.get_dummies(feature_matrix).reset_index()
feature_matrix.shape
(28133, 3074)
corrs = feature_matrix.corr().sort_values('total')
corrs['total'].head()
MONTH(first_purchases_time)_3 -0.037102 MONTH(first_purchases_time)_4 -0.030643 MONTH(first_purchases_time)_5 -0.030183 WEEKDAY(first_purchases_time)_6 -0.026694 MIN(purchases.total) -0.026663 Name: total, dtype: float64
corrs['total'].dropna().tail()
label 0.343661 NUM_UNIQUE(purchases.InvoiceNo) 0.351811 SUM(purchases.Quantity) 0.566804 SUM(purchases.total) 0.630549 total 1.000000 Name: total, dtype: float64
A few of the features have a moderate positive correlation with the total
(ignoring the label
for now). The number and total of the purchases is clearly related to the total spending! Keep in mind that the features are built using only data from before the cutoff time.
g = sns.FacetGrid(feature_matrix[(feature_matrix['SUM(purchases.total)'] > 0) & (feature_matrix['SUM(purchases.total)'] < 1000)],
hue = 'label', height = 4, aspect = 3)
g.map(sns.kdeplot, 'SUM(purchases.total)')
g.add_legend();
plt.title('Distribution of Purchases Total by Label');
The sum of purchase totals prior to the month of the label is clearly higher for those customers who then went on to spend more than $500 in the next month. We would expect this to be a useful feature in modeling.
Preliminary modeling¶
We can now directly use this feature matrix for training and making predictions with a machine learning model. We'll predict one month at a time, each time training on all the previous observations. The testing features for each testing label are built using all data prior to the month of the testing label.
Model¶
For a model, we will use the RandomForestClassifier
as implemented in Scikit-Learn. We'll keep most of the hyperparameters at the default values but increase the number of trees to 1000. This is not an optimized model but should allow us to tell whether or not our solution is better than a baseline estimate.
model = RandomForestClassifier(n_estimators = 1000,
random_state = 50,
n_jobs = -1)
from sklearn.pipeline import Pipeline
The function below trains and tests for a single month. We pass in the month, and the training data is subsetted to label - observation pairs from before the month, while the testing data comes from the month. This ensures that when making predictions for one month, we're only using data from before than month.
def predict_month(month, feature_matrix, return_probs = False):
"""Train and test a machine learning model using a feature set
for one month. Testing labels are from the month."""
feature_matrix['month'] = feature_matrix['time'].dt.month
# Subset labels
test_labels = feature_matrix.loc[feature_matrix['month'] == month, 'label']
train_labels = feature_matrix.loc[feature_matrix['month'] < month, 'label']
# Features
X_train = feature_matrix[feature_matrix['time'].dt.month < month].drop(columns = ['CustomerID', 'time',
'month', 'label', 'total'])
X_test = feature_matrix[feature_matrix['time'].dt.month == month].drop(columns = ['CustomerID', 'time',
'month', 'label', 'total'])
feature_names = list(X_train.columns)
# Impute and scale features
pipeline = Pipeline([('imputer', SimpleImputer(strategy = 'median')),
('scaler', MinMaxScaler())])
# Fit and transform training data
X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
# Labels
y_train = np.array(train_labels).reshape((-1, ))
y_test = np.array(test_labels).reshape((-1, ))
print('Training on {} observations.'.format(len(X_train)))
print('Testing on {} observations.\n'.format(len(X_test)))
# Train
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
probs = model.predict_proba(X_test)[:, 1]
# Calculate metrics
p = precision_score(y_test, predictions)
r = recall_score(y_test, predictions)
f = f1_score(y_test, predictions)
auc = roc_auc_score(y_test, probs)
print(f'Precision: {round(p, 5)}')
print(f'Recall: {round(r, 5)}')
print(f'F1 Score: {round(f, 5)}')
print(f'ROC AUC: {round(auc, 5)}')
# Feature importances
fi = pd.DataFrame({'feature': feature_names, 'importance': model.feature_importances_})
if return_probs:
return fi, probs
return fi
june_fi = predict_month(6, feature_matrix)
Training on 5266 observations. Testing on 2489 observations. Precision: 0.58214 Recall: 0.38263 F1 Score: 0.46176 ROC AUC: 0.74479
def plot_feature_importances(df, n = 15, threshold = None):
"""Plots n most important features. Also plots the cumulative importance if
threshold is specified and prints the number of features needed to reach threshold cumulative importance.
Intended for use with any tree-based feature importances.
Args:
df (dataframe): Dataframe of feature importances. Columns must be "feature" and "importance".
n (int): Number of most important features to plot. Default is 15.
threshold (float): Threshold for cumulative importance plot. If not provided, no plot is made. Default is None.
Returns:
df (dataframe): Dataframe ordered by feature importances with a normalized column (sums to 1)
and a cumulative importance column
Note:
* Normalization in this case means sums to 1.
* Cumulative importance is calculated by summing features from most to least important
* A threshold of 0.9 will show the most important features needed to reach 90% of cumulative importance
"""
# Sort features with most important at the head
df = df.sort_values('importance', ascending = False).reset_index(drop = True)
# Normalize the feature importances to add up to one and calculate cumulative importance
df['importance_normalized'] = df['importance'] / df['importance'].sum()
df['cumulative_importance'] = np.cumsum(df['importance_normalized'])
plt.rcParams['font.size'] = 12
# Bar plot of n most important features
df.loc[:n, :].plot.barh(y = 'importance_normalized',
x = 'feature', color = 'blue',
edgecolor = 'k', figsize = (12, 8),
legend = False)
plt.xlabel('Normalized Importance', size = 18); plt.ylabel('');
plt.title(f'Top {n} Most Important Features', size = 18)
plt.gca().invert_yaxis()
if threshold:
# Cumulative importance plot
plt.figure(figsize = (8, 6))
plt.plot(list(range(len(df))), df['cumulative_importance'], 'b-')
plt.xlabel('Number of Features', size = 16); plt.ylabel('Cumulative Importance', size = 16);
plt.title('Cumulative Feature Importance', size = 18);
# Number of features needed for threshold cumulative importance
# This is the index (will need to add 1 for the actual number)
importance_index = np.min(np.where(df['cumulative_importance'] > threshold))
# Add vertical line to plot
plt.vlines(importance_index + 1, ymin = 0, ymax = 1.05, linestyles = '--', colors = 'red')
plt.show();
print('{} features required for {:.0f}% of cumulative importance.'.format(importance_index + 1,
100 * threshold))
return df
We can plot the feature importances using the function. These should allow us to see what the model considers useful information for predicting spending.
norm_june_fi = plot_feature_importances(june_fi)
- The most important features are those that are most correlated with the total.
- This should give us confidence that our machine learning model is learning the important relationships and that the Featuretools features are useful for the problem.
- If we want to predict future spending next month, the best indicators are the customer's total spending to date and the total number of their purchases to date.
Comparison to Baseline¶
We calculated metrics for our model, but it's possible that these numbers are no better than we might have done by guessing. One question we have to ask is how do those numbers compare to an informed baseline? If our model can't beat a simple baseline, then we might want to question our approach or even if machine learning is applicable to the problem.
For an informed baseline, let's use the amount the customer spent in the past month to predict how much they will spend in the next month. We can try this for July 2011. For the probability (used for the ROC AUC), we'll divide the previous month's total by the threhold (so a spending of 0 corresponds to 0 probability) and then clip any values to 1.
labels.head()
CustomerID | time | total | label | |
---|---|---|---|---|
0 | 13313.0 | 2011-03-01 | 0.000 | 0 |
1 | 18097.0 | 2011-03-01 | 0.000 | 0 |
2 | 16656.0 | 2011-03-01 | 589.248 | 1 |
3 | 16875.0 | 2011-03-01 | 0.000 | 0 |
4 | 13094.0 | 2011-03-01 | 115.434 | 0 |
labels['month'] = labels['time'].dt.month
july_labels = labels[labels['month'] == 7]
june_labels = labels[labels['month'] == 6]
july_labels = july_labels.rename(columns = {'total': 'july_total'})
june_labels = june_labels.rename(columns = {'total': 'june_total'})
# Merge the current month with the previous
july_labels = july_labels.merge(june_labels[['CustomerID', 'june_total']], on = 'CustomerID', how = 'left')
july_labels['june_total'] = july_labels['june_total'].fillna(0)
july_labels['predicted_label'] = (july_labels['june_total'] > 500).astype(int)
july_labels['probability'] = july_labels['june_total'] / 500
# Set probabilities greater than 1 equal to 1
july_labels.loc[july_labels['probability'] > 1, 'probability'] = 1
july_labels.sample(10, random_state=50)
CustomerID | time | july_total | label | month | june_total | predicted_label | probability | |
---|---|---|---|---|---|---|---|---|
1647 | 14711.0 | 2011-07-01 | 0.0000 | 0 | 7 | 778.8165 | 1 | 1.00000 |
627 | 14849.0 | 2011-07-01 | 1342.9185 | 1 | 7 | 1624.8210 | 1 | 1.00000 |
2434 | 13213.0 | 2011-07-01 | 0.0000 | 0 | 7 | 221.2650 | 0 | 0.44253 |
2236 | 15258.0 | 2011-07-01 | 0.0000 | 0 | 7 | 507.3750 | 1 | 1.00000 |
1852 | 14747.0 | 2011-07-01 | 0.0000 | 0 | 7 | 0.0000 | 0 | 0.00000 |
1547 | 13946.0 | 2011-07-01 | 0.0000 | 0 | 7 | 0.0000 | 0 | 0.00000 |
241 | 13004.0 | 2011-07-01 | 508.9260 | 1 | 7 | 1121.6040 | 1 | 1.00000 |
437 | 13982.0 | 2011-07-01 | 0.0000 | 0 | 7 | 520.8060 | 1 | 1.00000 |
2043 | 14323.0 | 2011-07-01 | 0.0000 | 0 | 7 | -29.2050 | 0 | -0.05841 |
1972 | 17827.0 | 2011-07-01 | 0.0000 | 0 | 7 | 0.0000 | 0 | 0.00000 |
To test whether this is reasonable, we can find the correlation between the previous months total and the current months total.
july_labels['july_total'].corr(july_labels['june_total'])
0.6212795006643788
There is a moderate correlation between spending from one month to the next.
sns.lmplot(x = 'june_total', y = 'july_total', data = july_labels, fit_reg = False)
plt.title('July vs June Spending');
Let's look at the four performance metrics.
print('Precision: {:.5f}.'.format(precision_score(july_labels['label'], july_labels['predicted_label'])))
print('Recall: {:.5f}.'.format(recall_score(july_labels['label'], july_labels['predicted_label'])))
print('F1 Score: {:.5f}.'.format(f1_score(july_labels['label'], july_labels['predicted_label'])))
print('ROC AUC Score: {:.5f}.'.format(roc_auc_score(july_labels['label'], july_labels['probability'])))
Precision: 0.41784. Recall: 0.39644. F1 Score: 0.40686. ROC AUC Score: 0.64355.
We can now compare this performance to that from the model.
july_fi, july_probs = predict_month(7, feature_matrix, True)
Training on 7755 observations. Testing on 2752 observations. Precision: 0.55172 Recall: 0.32071 F1 Score: 0.40563 ROC AUC: 0.75266
For a classifier, the most important metric is the ROC AUC because that accounts for performance across all possible thresholds. We can adjust the threshold to maximize the Recall/Precision/F1 Score depending on our preferences.
To make sure that our model is really outperforming the baseline, we can plot the Receiver Operating Characteristic Curve for the two sets of predictions.
# Calculate false positive rates and true positive rates
base_fpr, base_tpr, _ = roc_curve(july_labels['label'], july_labels['probability'])
model_fpr, model_tpr, _ = roc_curve(feature_matrix[feature_matrix['month'] == 7]['label'], july_probs)
plt.figure(figsize = (8, 6))
plt.rcParams['font.size'] = 16
# Plot both curves
plt.plot(base_fpr, base_tpr, 'b', label = 'baseline')
plt.plot(model_fpr, model_tpr, 'r', label = 'model')
plt.legend();
plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate'); plt.title('ROC Curves');
Based on the metrics and this plot, we can say that our model does indeed outperform the informed baseline guess. Machine learning, and the features from Featuretools, yield us a better solution than using a decent approximation.
If we were an online retailer currently using an informed guess for advertising, machine learning could make our advertising campaigns more efficient.
Let's take all of the code for comparing the baseline to the model and put it in a single function. First we need to calculate the informed baseline in a function.
def informed_baseline(month_number, threshold = 500):
"""Calculate an informed baseline for a given month.
The informed baseline is guessing the previous month's spending
for the next month. The probability is assessed by dividing
the previous month's total by the threshold and setting
any values greater than 1 to 1."""
# Subset to the months
month = labels[labels['month'] == month_number]
previous_month = labels[labels['month'] == (month_number - 1)]
previous_month = previous_month.rename(columns = {'total': 'previous_total'})
# Merge the current month with the previous month
month = month.merge(previous_month[['CustomerID', 'previous_total']], on = 'CustomerID', how = 'left')
# For customers who had no spending in the previous month, set their spending to 0
month['previous_total'] = month['previous_total'].fillna(0)
# Calculate a probability based on the previous months spending and the threshold
month['probability'] = month['previous_total'] / threshold
# Set probabilities greater than 1 equal to 1
month.loc[month['probability'] > 1, 'probability'] = 1
# Make the predicted label
month['prediction'] = (month['previous_total'] > threshold).astype(int)
# Calculate metrics
print('Precision: {:.5f}.'.format(precision_score(month['label'], month['prediction'])))
print('Recall: {:.5f}.'.format(recall_score(month['label'], month['prediction'])))
print('F1 Score: {:.5f}.'.format(f1_score(month['label'], month['prediction'])))
print('ROC AUC Score: {:.5f}.'.format(roc_auc_score(month['label'], month['probability'])))
return month
The next function compares a machine learning model trained on a set of features to the informed baseline.
def compare(month, feature_matrix):
"""Compare machine learning model to baseline performance.
Computes statistics and shows ROC curve."""
print('Baseline Performance')
baseline = informed_baseline(month)
print('\nModel Performance')
fi, probs = predict_month(month, feature_matrix, return_probs=True)
# Calculate false positive rates and true positive rates
base_fpr, base_tpr, _ = roc_curve(baseline['label'], baseline['probability'])
model_fpr, model_tpr, _ = roc_curve(feature_matrix[feature_matrix['month'] == month]['label'], probs)
plt.figure(figsize = (8, 6))
plt.rcParams['font.size'] = 16
# Plot both curves
plt.plot(base_fpr, base_tpr, 'b', label = 'baseline')
plt.plot(model_fpr, model_tpr, 'r', label = 'model')
plt.legend();
plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate'); plt.title('ROC Curves');
compare(6, feature_matrix)
Baseline Performance Precision: 0.41201. Recall: 0.46714. F1 Score: 0.43784. ROC AUC Score: 0.67322. Model Performance Training on 5266 observations. Testing on 2489 observations. Precision: 0.58214 Recall: 0.38263 F1 Score: 0.46176 ROC AUC: 0.74479
- We clearly see our model outperforming the baseline. In practice, we would optimize the model threshold for Precision/Recall/F1 Score.
Let's test December, when theoretically our model should do the best because it is training on the most data.
compare(12, feature_matrix)
Baseline Performance Precision: 0.21606. Recall: 0.57547. F1 Score: 0.31416. ROC AUC Score: 0.69498. Model Performance Training on 23936 observations. Testing on 4197 observations. Precision: 0.33038 Recall: 0.46855 F1 Score: 0.38752 ROC AUC: 0.79129
We get a slightly better ROC AUC. It's possible that the data does not stay consistent over the couse of the year so using more data might not necessarily create a better model. In other words, consumer behavior can shift (potentially due to seasonality) so the older data might not be relevant.
Feel free to check all of the months! It's clear that featuretools and machine learning is able to outperform the baseline for this problem. Even without concentrating on the model, we were able to perform significantly better than a well-informed guess. With some minor hyperparameter tuning (using random search) we could achieve even better performance. We'll not implement any hyperparameter tuning here, but usually it delivers a small performance gain relative to feature engineering (as can be seen in the other two projects in this repository.)
Precision Recall Curve¶
Our model was using a default threshold of 0.5 to make predictions. We can use the precision recall curve to identify the ideal threshold for our model. For example, we may care more about identifying customers who will spend more than $500 than about false positive so we choose a threshold that results in high recall but low precision. Conversely, we may want to limit false positives even if that means missing some potential high-spending customers so we would try for a higher level of precision. The ideal threshold for a classifier will depend on the problem.
We use the ROC AUC to compare models across a range of thresholds, so we know which model is strictly better, but then we can use a precision recall curve to adjust the threshold for our business needs. This is usually a process done with the help of domain experts and validated in cross validation (or on a separate validation set of data).
def precision_recall(month, feature_matrix):
"Show the precision vs recall curve for a month"
# Find the probability
fi, probs = predict_month(month, feature_matrix,
return_probs = True)
# Calculate metrics across thresholds
precision, recall, t = precision_recall_curve(labels.loc[labels['time'].dt.month == month, 'label'],
probs)
# Plot the curve
plt.step(recall, precision, color='b', alpha=0.5,
where='post')
# Fill in the curve
plt.fill_between(recall, precision, step='post', alpha=0.5,
color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title("Precision vs. Recall Curve"); plt.show();
precision_recall(12, feature_matrix)
Training on 23936 observations. Testing on 4197 observations. Precision: 0.33038 Recall: 0.46855 F1 Score: 0.38752 ROC AUC: 0.79129
precision_recall(10, feature_matrix)
Training on 16631 observations. Testing on 3467 observations. Precision: 0.55786 Recall: 0.29841 F1 Score: 0.38883 ROC AUC: 0.72191
Depending on what we want to maximize for, we can use these plots to select the appropriate threshold. If we want a recall around 80%, then we would have to accept a precision of near 25%.
Below we generate probabilities for each month to plot the ROC curve compared to both the baseline and manual engineered features. The graphs can be seen in the conclusions section. The code used to make the graphs is included in the Manual Retail Spending notebook.
%%capture
# Record probabilities for each month
for month in range(4, 13):
_, probs = predict_month(month, feature_matrix, return_probs=True)
temp_df = pd.DataFrame({'Automated': probs, 'month': month})
if month == 4:
probs_df = temp_df.copy()
else:
probs_df = pd.concat([probs_df, temp_df], ignore_index=True)
probs_df.head()
Automated | month | |
---|---|---|
0 | 0.037 | 4 |
1 | 0.080 | 4 |
2 | 0.799 | 4 |
3 | 0.176 | 4 |
4 | 0.209 | 4 |
probs_df.shape
(26851, 2)
probs_df['month'].unique()
array([ 4, 5, 6, 7, 8, 9, 10, 11, 12])
Conclusions¶
In this notebook we saw that Featuretools automated feature engineering in a machine learning pipeline results in better prediction of future customer behavior than an informed baseline guess. There are many advantages to using Featuretools in a time-series problem with multiple tables of data such as we often encounter in real life. Following are the key takeaways:
- Featuretools creates a rich set of relevant features from a set of related tables
- These features can deliver effective machine learning model performance
- Featuretools handles the issue of filtering data in a time-series problem
- Cutoff times for each label filter data
- All features for each label use only valid data
- Even using the default settings for Featuretools, we can create an effective machine learning model
- The featuretools and random forest classifier outperformed an informed guess in terms of ROC AUC and F1 score
One can try doing manual feature engineering for this problem, but the time spent will be significantly more than with Featuretools and yet there is a rare possibility to develop a model better than the baseline.
Data science is about using the right methods to get the job done as efficiently as possible, and for the retail dataset, that method is Featuretools in a predictive modeling pipeline. The implementation with automated feature engineering is faster, safer, and just as interpretable as any manual engineering work and delivers an effect machine learning model.
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Nine_Predictive_Engineering/Case_Studies/Automated_Retail_Spending/Automated_Retail_Spending.ipynb"