The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.
The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.
The data contains the different data related to a food order. The detailed data dictionary is given below.
# Import libraries for data manipulation
import numpy as np
import pandas as pd
# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Import libraries for statistical analysis
from scipy import stats
from scipy.stats import pearsonr
from scipy.stats import linregress
# Perform math
import math
import warnings
warnings.filterwarnings('ignore')
# Suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Connect collab
from google.colab import drive
drive.mount('/content/drive')
# Load data from csv file
data = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/foodhub_order.csv')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# returns the first 5 rows
data.head()
order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
---|---|---|---|---|---|---|---|---|---|
0 | 1477147 | 337525 | Hangawi | Korean | 30.75 | Weekend | Not given | 25 | 20 |
1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.08 | Weekend | Not given | 25 | 23 |
2 | 1477070 | 66393 | Cafe Habana | Mexican | 12.23 | Weekday | 5 | 23 | 28 |
3 | 1477334 | 106968 | Blue Ribbon Fried Chicken | American | 29.20 | Weekend | 3 | 25 | 15 |
4 | 1478249 | 76942 | Dirty Bird to Go | American | 11.59 | Weekday | 4 | 25 | 24 |
# returns the last 5 rows
data.tail()
order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
---|---|---|---|---|---|---|---|---|---|
1893 | 1476701 | 292602 | Chipotle Mexican Grill $1.99 Delivery | Mexican | 22.31 | Weekend | 5 | 31 | 17 |
1894 | 1477421 | 397537 | The Smile | American | 12.18 | Weekend | 5 | 31 | 19 |
1895 | 1477819 | 35309 | Blue Ribbon Sushi | Japanese | 25.22 | Weekday | Not given | 31 | 24 |
1896 | 1477513 | 64151 | Jack's Wife Freda | Mediterranean | 12.18 | Weekday | 5 | 23 | 31 |
1897 | 1478056 | 120353 | Blue Ribbon Sushi | Japanese | 19.45 | Weekend | Not given | 28 | 24 |
The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.
# Determine the number of rows and columns by calling data.shape
data.shape
(1898, 9)
# Use info() to print a concise summary of the DataFrame
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1898 entries, 0 to 1897 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 1898 non-null int64 1 customer_id 1898 non-null int64 2 restaurant_name 1898 non-null object 3 cuisine_type 1898 non-null object 4 cost_of_the_order 1898 non-null float64 5 day_of_the_week 1898 non-null object 6 rating 1162 non-null float64 7 food_preparation_time 1898 non-null int64 8 delivery_time 1898 non-null int64 9 cuisine_type_encoded 1898 non-null int64 10 total_time 1898 non-null int64 11 total_wait_time 1898 non-null int64 12 commission 1898 non-null float64 dtypes: float64(3), int64(7), object(3) memory usage: 192.9+ KB
Number of Entries: The DataFrame contains 1898 entries, ranging from index 0 to 1897.
Number of Columns: There are 9 columns in total.
Column Names and Types:
Non-Null Counts: Each column has 1898 non-null entries, indicating that there are no missing values in any of the columns.
Memory Usage: The DataFrame uses approximately 133.6 KB of memory.
Data Types Distribution:
These observations indicate a well-formed DataFrame with complete data and a mix of numerical and categorical data types.
Each column has 1898 non-null entries, indicating that there are no missing values in any of the columns.
# Check the statistical summary of the DataFrame
summary = data.describe()
# Minimum food preparation time
food_prep_min = data['food_preparation_time'].min()
# Average (mean) food preparation time
food_prep_avg = round(data['food_preparation_time'].mean(), 2)
# Maximum food preparation time
food_prep_max = data['food_preparation_time'].max()
# Format the summary using pandas Styler
summary_styled = summary.style.set_table_styles(
[{'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(**{'text-align': 'center'})
# Display the styled summary
display(summary_styled)
# Create a DataFrame for the food preparation times
prep_times_df = pd.DataFrame({
'Metric': ['Minimum', 'Average', 'Maximum'],
'Food Preparation Time (minutes)': [food_prep_min, food_prep_avg, food_prep_max]
})
print("\n")
# Format the preparation times DataFrame using pandas Styler
prep_times_styled = prep_times_df.style.set_table_styles(
[{'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(**{'text-align': 'center'})
# Display the styled preparation times DataFrame
display(prep_times_styled)
order_id | customer_id | cost_of_the_order | food_preparation_time | delivery_time | |
---|---|---|---|---|---|
count | 1898.000000 | 1898.000000 | 1898.000000 | 1898.000000 | 1898.000000 |
mean | 1477495.500000 | 171168.478398 | 16.498851 | 27.371970 | 24.161749 |
std | 548.049724 | 113698.139743 | 7.483812 | 4.632481 | 4.972637 |
min | 1476547.000000 | 1311.000000 | 4.470000 | 20.000000 | 15.000000 |
25% | 1477021.250000 | 77787.750000 | 12.080000 | 23.000000 | 20.000000 |
50% | 1477495.500000 | 128600.000000 | 14.140000 | 27.000000 | 25.000000 |
75% | 1477969.750000 | 270525.000000 | 22.297500 | 31.000000 | 28.000000 |
max | 1478444.000000 | 405334.000000 | 35.410000 | 35.000000 | 33.000000 |
Metric | Food Preparation Time (minutes) | |
---|---|---|
0 | Minimum | 20.000000 |
1 | Average | 27.370000 |
2 | Maximum | 35.000000 |
Based on the statistical summary of the food_preparation_time column in the DataFrame, we can make the following observations:
# Count the number of non-rated orders
# Find unique values for data['rating'] column
unique = data['rating'].unique()
print (unique)
# It appears non-rated orders are where data['rating'] = unique[0]
non_rated_orders = data[data['rating'] == unique[0]].shape[0]
print(f"Number of orders not rated: {non_rated_orders}")
['Not given' '5' '3' '4'] Number of orders not rated: 736
Upon analyzing the rating column in the DataFrame, it was found that there are a total of 736 orders rated as 'Not given'.
Interestingly there are not ratings lower than 3.
# Examine the characteristics of each variable
# Analyze the distribution of customer_id
CID_Distribution = data['customer_id'].value_counts()
CID_max = CID_Distribution.max() # Most frequent customer
CID_min = CID_Distribution.min() # Least frequent customer(s)
CID_mean = CID_Distribution.mean() # Average # of visits by customer
# count the number of customers that visited only once and calculate %
CID_once = (CID_Distribution == 1).sum()
CID_once_percent = round((CID_once / CID_Distribution.shape[0]) * 100, 2)
# Plotting the distribution
plt.figure(figsize=(18, 6))
# Histogram of customer visit frequency
plt.figure(figsize=(18, 6))
# Histogram of customer visit frequency
plt.subplot(1, 2, 1)
plt.hist(CID_Distribution, bins=range(1, CID_Distribution.max() + 1), edgecolor='black', color='orange')
plt.axvline(CID_mean, color='red', linestyle='dashed', linewidth=2, label=f'avg # of visits: {CID_mean:.2f}')
plt.title('Customer Visit Frequency')
plt.xlabel('Number of Visits')
plt.ylabel('Number of Customers')
plt.legend()
# Boxplot of customer visit frequency
plt.subplot(1, 2, 2)
plt.boxplot(CID_Distribution, vert=False)
plt.title('Customer Visit Frequency')
plt.xlabel('Number of Visits')
plt.tight_layout()
plt.show()
<Figure size 1800x600 with 0 Axes>
# Analyze the distribution of restaurant_name
unique_restaurants = data['restaurant_name'].nunique()
# Analyze the distribution of orders per restaurant
restaurant_order_distribution = data['restaurant_name'].value_counts()
most_orders = restaurant_order_distribution.max()
least_orders = restaurant_order_distribution.min()
average_orders = restaurant_order_distribution.mean()
# Names of the most and least popular restaurants
most_popular_restaurant = restaurant_order_distribution.idxmax()
least_popular_restaurant = restaurant_order_distribution.idxmin()
# Count the number of restaurants with only 1 order
restaurants_with_one_order = (restaurant_order_distribution == 1).sum()
# Get the top 75 restaurants by number of orders (more than this will make the x axis labels illegible)
top_75_restaurants = restaurant_order_distribution.head(50)
# Visualize the distribution of orders for the top 75 restaurants
plt.figure(figsize=(18, 6))
top_75_restaurants.plot(kind='bar')
plt.title('Top 75 Restaurants by Number of Orders')
plt.xlabel('Restaurant Name')
plt.ylabel('Number of Orders')
# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontsize=8)
plt.show()
# Set display options for wider columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 300)
pd.set_option('display.max_colwidth', None)
# Analyze the distribution of cuisine types
cuisine_type_distribution = data['cuisine_type'].value_counts()
# Calculate the most and least popular cuisine types
most_popular_cuisine = cuisine_type_distribution.idxmax()
least_popular_cuisine = cuisine_type_distribution.idxmin()
# Visualize the distribution of cuisine types
plt.figure(figsize=(18, 6))
cuisine_type_distribution.plot(kind='bar')
plt.title('Distribution of Cuisine Types')
plt.xlabel('Cuisine Type')
plt.ylabel('Number of Orders')
plt.xticks(rotation=90, fontsize=8)
plt.show()
# Analyze the distribution of order costs
order_cost_distribution = data['cost_of_the_order']
# Calculate the statistics
max_cost = order_cost_distribution.max()
min_cost = order_cost_distribution.min()
mean_cost = order_cost_distribution.mean()
# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))
# Histogram of order costs with a vertical line for the mean cost
axes[0].hist(order_cost_distribution, bins=30, edgecolor='black')
axes[0].axvline(mean_cost, color='red', linestyle='dashed', linewidth=2, label=f'avg cost of the order: ${mean_cost:.2f}')
axes[0].set_title('Distribution of Order Costs')
axes[0].set_xlabel('Order Cost')
axes[0].set_ylabel('Number of Orders')
axes[0].legend()
# Box plot for the order costs
axes[1].boxplot(order_cost_distribution, vert=False)
axes[1].set_title('Box Plot of Order Costs')
axes[1].set_xlabel('Order Cost')
# Adjust layout
plt.tight_layout()
plt.show()
# Analyze the distribution of orders by the day of the week
day_of_week_distribution = data['day_of_the_week'].value_counts()
# Visualize the distribution of orders by the day of the week
plt.figure(figsize=(18, 6))
day_of_week_distribution.plot(kind='bar')
plt.title('Distribution of Orders by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Orders')
plt.xticks(rotation=0, fontsize=8)
plt.show()
# Analyze the distribution of ratings
rating_distribution = data['rating'].value_counts()
# Visualize the distribution of ratings
plt.figure(figsize=(18, 6))
rating_distribution.plot(kind='bar')
plt.title('Distribution of Customer Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Orders')
plt.xticks(rotation=0, fontsize=8)
plt.show()
# Analyze the distribution of food preparation times
prep_time_distribution = data['food_preparation_time']
# Calculate the statistics
max_prep_time = prep_time_distribution.max()
min_prep_time = prep_time_distribution.min()
mean_prep_time = prep_time_distribution.mean()
# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))
# Histogram of food preparation times with a vertical line for the mean time
axes[0].hist(prep_time_distribution, bins=30, edgecolor='black')
axes[0].axvline(mean_prep_time, color='red', linestyle='dashed', linewidth=2, label=f'avg prep time: {mean_prep_time:.2f} mins')
axes[0].set_title('Distribution of Food Preparation Times')
axes[0].set_xlabel('Preparation Time (minutes)')
axes[0].set_ylabel('Number of Orders')
axes[0].legend()
# Box plot for the food preparation times
axes[1].boxplot(prep_time_distribution, vert=False)
axes[1].set_title('Box Plot of Food Preparation Times')
axes[1].set_xlabel('Preparation Time (minutes)')
# Adjust layout
plt.tight_layout()
plt.show()
# Analyze the distribution of delivery times
delivery_time_distribution = data['delivery_time']
# Calculate the statistics
max_delivery_time = delivery_time_distribution.max()
min_delivery_time = delivery_time_distribution.min()
mean_delivery_time = delivery_time_distribution.mean()
# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))
# Histogram of delivery times with a vertical line for the mean time
axes[0].hist(delivery_time_distribution, bins=30, edgecolor='black')
axes[0].axvline(mean_delivery_time, color='red', linestyle='dashed', linewidth=2, label=f'avg delivery time: {mean_delivery_time:.2f} mins')
axes[0].set_title('Distribution of Delivery Times')
axes[0].set_xlabel('Delivery Time (minutes)')
axes[0].set_ylabel('Number of Orders')
axes[0].legend()
# Box plot for the delivery times
axes[1].boxplot(delivery_time_distribution, vert=False)
axes[1].set_title('Box Plot of Delivery Times')
axes[1].set_xlabel('Delivery Time (minutes)')
# Adjust layout
plt.tight_layout()
plt.show()
# Get the top 5 restaurants by number of orders
top_5_restaurants = restaurant_order_distribution.head(5)
#top_5_restaurants
# Convert the series to a DataFrame for better formatting
top_5_restaurants_df = top_5_restaurants.reset_index()
top_5_restaurants_df.columns = ['Restaurant Name', 'Number of Orders']
# Convert the series to a DataFrame for better formatting
top_5_restaurants_df = top_5_restaurants.reset_index()
top_5_restaurants_df.columns = ['Restaurant Name', 'Number of Orders']
# Format the table using pandas Styler
top_5_restaurants_styled = top_5_restaurants_df.style.set_caption("Top 5 Restaurants by Number of Orders<br><br>").set_table_styles(
[{'selector': 'caption', 'props': [('caption-side', 'top'), ('font-size', '16px'), ('font-weight', 'bold'), ('text-align', 'center')]},
{'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(subset=['Restaurant Name'], **{'text-align': 'left'}).set_properties(subset=['Number of Orders'], **{'text-align': 'center'}).hide(axis="index")
# Display the styled table
top_5_restaurants_styled
Restaurant Name | Number of Orders |
---|---|
Shake Shack | 219 |
The Meatball Shop | 132 |
Blue Ribbon Sushi | 119 |
Blue Ribbon Fried Chicken | 96 |
Parm | 68 |
Dominance of Shake Shack:
Significant Order Volumes:
Diverse Cuisine Types:
The top 5 restaurants represent a variety of cuisines, including American fast food (Shake Shack), Italian (The Meatball Shop, Parm), and Japanese (Blue Ribbon Sushi).
This variety reflects a broad range of customer preferences. Distribution Drop-off:
There is a noticeable drop in the number of orders from the third to the fifth restaurant, showing a significant difference in popularity among the top restaurants.
# Filter the data for weekends
weekend_data = data[data['day_of_the_week'] == 'Weekend']
# Analyze the distribution of cuisine types on weekends
weekend_cuisine_distribution = weekend_data['cuisine_type'].value_counts()
# Get the most popular cuisine on weekends
most_popular_cuisine_weekend = weekend_cuisine_distribution.idxmax()
most_popular_cuisine_count = weekend_cuisine_distribution.max()
# Get the second most popular cuisine on weekends (for comparison)
second_most_popular_cuisine_weekend = weekend_cuisine_distribution.index[1]
second_most_popular_cuisine_count = weekend_cuisine_distribution.iloc[1]
# Create a DataFrame for the most popular weekend cuisine
most_popular_cuisine_df = pd.DataFrame({
'Cuisine Type': [most_popular_cuisine_weekend],
'Number of Orders': [most_popular_cuisine_count]
})
# Format the table using pandas Styler
most_popular_cuisine_styled = most_popular_cuisine_df.style.set_caption("Most Popular Cuisine on Weekends<br><br>").set_table_styles(
[{'selector': 'caption', 'props': [('caption-side', 'top'), ('font-size', '16px'), ('font-weight', 'bold'), ('text-align', 'center')]},
{'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(subset=['Cuisine Type'], **{'text-align': 'left'}).set_properties(subset=['Number of Orders'], **{'text-align': 'center'}).hide(axis="index")
# Display the styled table
most_popular_cuisine_styled
Cuisine Type | Number of Orders |
---|---|
American | 415 |
Most Popular Cuisine:
The most popular cuisine on weekends is American with 415 orders. This indicates a preference for American cuisine, narrowly beating out the second most popular cuisine, Japanese, which has 335 orders.
High Volume of Orders:
The number of orders for the most popular cuisine on weekends is significantly high, showing that weekends are a busy time for American cuisine.
Customer Preferences:
The popularity of American cuisine on weekends might be due to the types of meals associated with this cuisine, such as burgers, fries, and other fast food items, which are often considered comfort or leisure food.
# Calculate the percentage of orders that cost more than 20 dollars
total_orders = data.shape[0]
orders_above_20 = data[data['cost_of_the_order'] > 20].shape[0]
percentage_above_20 = (orders_above_20 / total_orders) * 100
# Print the result
print(f"Percentage of orders that cost more than 20 dollars: {percentage_above_20:.2f}%")
Percentage of orders that cost more than 20 dollars: 29.24%
The analysis reveals that a significant proportion (29.24%) of orders exceed $20, providing valuable insights into customer spending behavior and opportunities for revenue optimization. Restaurants can leverage this information to enhance their pricing strategies, marketing efforts, and overall customer experience.
# Prevoiusly calculate mean_delivery_time
print("The mean order delivery time is %.2f minutes." % mean_delivery_time)
The mean order delivery time is 24.16 minutes.
The analysis reveals that the mean delivery time for orders is approximately 24.16 minutes.
This information is useful for understanding operational efficiency, setting customer expectations, and identifying areas for potential improvement in the delivery process.
# Top three customers using the previously calculate Customer ID Distribution
top_3_customers = CID_Distribution.head(3).reset_index()
top_3_customers.columns = ['Customer ID', 'Number of Orders']
# Format the table using pandas Styler
top_3_customers_styled = top_3_customers.style.set_caption("Top 3 Customers by Number of Orders<br><br>").set_table_styles(
[{'selector': 'caption', 'props': [('caption-side', 'top'), ('font-size', '16px'), ('font-weight', 'bold'), ('text-align', 'center')]},
{'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(subset=['Customer ID'], **{'text-align': 'left'}).set_properties(subset=['Number of Orders'], **{'text-align': 'center'}).hide(axis="index")
# Display the styled table
top_3_customers_styled
Customer ID | Number of Orders |
---|---|
52832 | 13 |
47440 | 10 |
83287 | 9 |
The analysis reveals that the average delivery time for orders is approximately 24.16 minutes. This information is useful for understanding operational efficiency, setting customer expectations, and identifying areas for potential improvement in the delivery process.
# Correlation matrix of numerical variables
# Select only numerical columns
numerical_data = data.select_dtypes(include=['float64', 'int64'])
# Calculate the correlation matrix
corr_matrix = numerical_data.corr()
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='vlag')
plt.title('Correlation Matrix of Numerical Variables')
plt.show()
Relationship Between Food Preparation Time and Delivery Time:
The correlation between food_preparation_time and delivery_time is positive, indicating that longer food preparation times tend to be associated with longer delivery times. This suggests a connection between kitchen efficiency and overall delivery performance.
Cost of Order and Delivery Time:
There is a correlation between cost_of_the_order and delivery_time. Higher-cost orders may take longer to deliver, possibly due to their complexity or size. Cost of Order and Food Preparation Time:
The correlation between cost_of_the_order and food_preparation_time is also positive, suggesting that higher-cost orders might take longer to prepare. This could be due to more complex or larger orders.
Summary:
Customer Order Frequency
Repeat Customer Analysis
# Calculate the frequency of orders per customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']
# Style the table using pandas Styler
styled_table = order_frequency.style.background_gradient(cmap='viridis').set_caption("Frequency of Orders per Customer")
# Display the styled table
styled_table
# Plotting the frequency of orders per customer
plt.figure(figsize=(10, 6))
plt.hist(order_frequency['order_count'], bins=30, edgecolor='black')
plt.title('Frequency of Orders per Customer')
plt.xlabel('Number of Orders')
plt.ylabel('Number of Customers')
plt.show()
# Pearson correlation between 'cost_of_the_order' and 'delivery_time'
correlation_score = data['cost_of_the_order'].corr(data['delivery_time'])
print(f"Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': {correlation_score:.3f}")
Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': -0.030
Frequency of Orders per Customer
Distribution of Orders:
Pearson Correlation
Cost of the Order vs. Delivery Time:
The Pearson correlation score between cost_of_the_order and delivery_time is approximately -0.030.
This very weak negative correlation indicates that there is no significant linear relationship between the cost of an order and its delivery time in this dataset.
This implies that delivery times are relatively consistent regardless of the order cost, suggesting other factors might be influencing delivery time.
# Calculate the average order value per customer
average_order_value = data.groupby('customer_id')['cost_of_the_order'].mean().reset_index()
average_order_value.columns = ['customer_id', 'average_order_value']
# Style the table using pandas Styler
styled_average_order_value = average_order_value.style.background_gradient(cmap='viridis').set_caption("Average Order Value per Customer")
# Display the styled table
styled_average_order_value
# Plotting the average order value per customer
plt.figure(figsize=(10, 6))
plt.hist(average_order_value['average_order_value'], bins=30, edgecolor='black')
plt.title('Average Order Value per Customer')
plt.xlabel('Average Order Value')
plt.ylabel('Number of Customers')
plt.show()
Average Order Value per Customer
Distribution of Average Order Values:
The histogram shows the distribution of average order values per customer. Most customers have a lower average order value, with fewer customers having higher average order values.
This trend indicates that a majority of customers tend to place orders of smaller amounts, while a smaller group of customers contributes to higher-value orders.
Customer Spending Patterns:
Identifying the customers with higher average order values can help in tailoring marketing strategies to retain and engage these high-value customers.
Conversely, understanding why most customers place smaller orders might reveal opportunities for upselling or bundling products to increase the average order value.
# Calculate the percentage of repeat customers
total_customers = data['customer_id'].nunique()
repeat_customers = data['customer_id'].value_counts()[data['customer_id'].value_counts() > 1].count()
percentage_repeat_customers = (repeat_customers / total_customers) * 100
# Display the percentage of repeat customers
print(f"Percentage of Repeat Customers: {percentage_repeat_customers:.2f}%")
# Plotting the percentage of repeat customers
labels = ['Repeat Customers', 'One-time Customers']
sizes = [percentage_repeat_customers, 100 - percentage_repeat_customers]
colors = ['#ff9999','#66b3ff']
explode = (0.1, 0) # explode the first slice
plt.figure(figsize=(8, 8))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Percentage of Repeat Customers')
plt.show()
Percentage of Repeat Customers: 34.67%
Repeat Customer Rate:
# Identify repeat and one-time customers
repeat_customers_ids = data['customer_id'].value_counts()[data['customer_id'].value_counts() > 1].index
one_time_customers_ids = data['customer_id'].value_counts()[data['customer_id'].value_counts() == 1].index
# Separate data for repeat and one-time customers
repeat_customers_data = data[data['customer_id'].isin(repeat_customers_ids)]
one_time_customers_data = data[data['customer_id'].isin(one_time_customers_ids)]
# Calculate average rating and order cost for repeat customers
repeat_customers_avg_rating = repeat_customers_data[repeat_customers_data['rating'] != 'Not given']['rating'].astype(float).mean()
repeat_customers_avg_order_cost = repeat_customers_data['cost_of_the_order'].mean()
# Calculate average rating and order cost for one-time customers
one_time_customers_avg_rating = one_time_customers_data[one_time_customers_data['rating'] != 'Not given']['rating'].astype(float).mean()
one_time_customers_avg_order_cost = one_time_customers_data['cost_of_the_order'].mean()
# Display the results
print(f"Average Rating for Repeat Customers: {repeat_customers_avg_rating:.2f}")
print(f"Average Order Cost for Repeat Customers: ${repeat_customers_avg_order_cost:.2f}")
print(f"Average Rating for One-time Customers: {one_time_customers_avg_rating:.2f}")
print(f"Average Order Cost for One-time Customers: ${one_time_customers_avg_order_cost:.2f}")
# Plotting the results
categories = ['Repeat Customers', 'One-time Customers']
avg_ratings = [repeat_customers_avg_rating, one_time_customers_avg_rating]
avg_order_costs = [repeat_customers_avg_order_cost, one_time_customers_avg_order_cost]
fig, ax1 = plt.subplots(figsize=(10, 6))
color = 'tab:blue'
ax1.set_xlabel('Customer Type')
ax1.set_ylabel('Average Rating', color=color)
ax1.bar(categories, avg_ratings, color=color, alpha=0.6, label='Average Rating')
ax1.tick_params(axis='y', labelcolor=color)
ax2 = ax1.twinx()
color = 'tab:green'
ax2.set_ylabel('Average Order Cost ($)', color=color)
ax2.plot(categories, avg_order_costs, color=color, marker='o', linestyle='-', linewidth=2, markersize=8, label='Average Order Cost')
ax2.tick_params(axis='y', labelcolor=color)
fig.tight_layout()
plt.title('Average Rating and Order Cost for Repeat Customers vs One-time Customers')
fig.legend(loc='upper right', bbox_to_anchor=(0.9,0.9))
plt.show()
Average Rating for Repeat Customers: 4.34 Average Order Cost for Repeat Customers: $16.49 Average Rating for One-time Customers: 4.35 Average Order Cost for One-time Customers: $16.52
Average Rating:
Average Order Cost:
Order Cost by Cuisine Type
Avg Cost by Day of the Week and Cuisine Type
Order Frequency by Weekday/Weekend and Cuisine Type
Order Cost by Preparation Time, Delivery Time and Total Time
Order Frequency by Order Cost
# Box Plot of Order Cost by Cuisine Type
plt.figure(figsize=(12, 6))
sns.boxplot(x='cuisine_type', y='cost_of_the_order', data=data)
plt.title('Box Plot of Order Cost by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Order Cost ($)')
plt.xticks(rotation=90)
plt.show()
# Encode the 'cuisine_type' column
data['cuisine_type_encoded'] = data['cuisine_type'].astype('category').cat.codes
# Calculate the Pearson correlation coefficient between encoded cuisine type and order cost
correlation_cost_cuisine = data['cuisine_type_encoded'].corr(data['cost_of_the_order'])
# Print the Pearson correlation coefficient
print(f"Pearson correlation coefficient between encoded cuisine type and order cost: {correlation_cost_cuisine:.2f}")
Pearson correlation coefficient between encoded cuisine type and order cost: 0.04
The very weak correlation between cuisine type and order cost indicates that, while there may be some variation in costs among different cuisines, the overall impact of cuisine type on order cost is minimal. Other factors are likely more significant in determining the cost of an order.
# Create a pivot table for average order cost by day of the week and cuisine type
pivot_table = data.pivot_table(values='cost_of_the_order', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')
# Visualize the pivot table as a heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(pivot_table, annot=True, cmap='Reds')
plt.title('Average Order Cost by Day of the Week and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Day of the Week')
plt.show()
# Manually encode the 'cuisine_type' column
data['cuisine_type_encoded'] = data['cuisine_type'].astype('category').cat.codes
# Calculate the Pearson correlation coefficient between encoded cuisine type and order cost
correlation = data['cuisine_type_encoded'].corr(data['cost_of_the_order'])
# Print the result
print(f"Pearson correlation coefficient between encoded cuisine type and order cost: {correlation:.2f}")
Pearson correlation coefficient between encoded cuisine type and order cost: 0.04
Summary:
This means that there is virtually no meaningful relationship between cuisine type and order cost by weekend / weekday, indicating that the type of cuisine and both weekdays / weekends has minimal impact on the cost of the order.
# Create a pivot table for order frequency by Weekday and Weekend and cuisine type
order_frequency = data.pivot_table(values='order_id', index='day_of_the_week', columns='cuisine_type', aggfunc='count')
# Visualize the heatmap with adjusted plot width to avoid using exponents
plt.figure(figsize=(14, 8))
sns.heatmap(order_frequency, annot=True, fmt='d', cmap='Reds')
plt.title('Heatmap of Order Frequency by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()
Higher Order Frequencies on Weekends:
The heatmap shows that the order frequencies are generally higher on Weekends compared to Weekdays across various cuisine types.
For example, American, Chinese, Italian and Japanese cuisines have appreciably higher order frequencies on Weekends.
Popular Cuisine Types:
American and Japanese cuisine has the highest order frequency on both Weekdays and Weekends.
Chinese and Italian cuisines also show high order frequencies, especially on Weekends.
Consistent Popularity:
Less Popular Cuisines:
Cuisines like French, Southern, Thai, and Vietnamese have lower order frequencies across both Weekdays and Weekends.
These cuisines might have a niche customer base or less overall demand.
Weekday vs Weekend Trends:
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']
# Plotting Order Cost by Preparation Time
plt.figure(figsize=(10, 6))
plt.scatter(data['cost_of_the_order'], data['food_preparation_time'], alpha=0.6, edgecolors='w', linewidth=0.5)
plt.title('Order Cost by Preparation Time')
plt.xlabel('Order Cost ($)')
plt.ylabel('Preparation Time (minutes)')
plt.grid(True)
plt.show()
# Plotting Order Cost by Delivery Time
plt.figure(figsize=(10, 6))
plt.scatter(data['cost_of_the_order'], data['delivery_time'], alpha=0.6, edgecolors='w', linewidth=0.5)
plt.title('Order Cost by Delivery Time')
plt.xlabel('Order Cost ($)')
plt.ylabel('Delivery Time (minutes)')
plt.grid(True)
plt.show()
# Plotting Order Cost by Total Time
plt.figure(figsize=(10, 6))
plt.scatter(data['cost_of_the_order'], data['total_time'], alpha=0.6, edgecolors='w', linewidth=0.5)
plt.title('Order Cost by Total Time')
plt.xlabel('Order Cost ($)')
plt.ylabel('Total Time (minutes)')
plt.grid(True)
plt.show()
# Calculate Pearson correlation scores
pearson_cost_prep_time = data['cost_of_the_order'].corr(data['food_preparation_time'])
pearson_cost_delivery_time = data['cost_of_the_order'].corr(data['delivery_time'])
pearson_cost_total_time = data['cost_of_the_order'].corr(data['total_time'])
print(f"Pearson Correlation Score between 'cost_of_the_order' and 'food_preparation_time': {pearson_cost_prep_time:.3f}")
print(f"Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': {pearson_cost_delivery_time:.3f}")
print(f"Pearson Correlation Score between 'cost_of_the_order' and 'total_time': {pearson_cost_total_time:.3f}")
Pearson Correlation Score between 'cost_of_the_order' and 'food_preparation_time': 0.042 Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': -0.030 Pearson Correlation Score between 'cost_of_the_order' and 'total_time': 0.006
Pearson Correlation Scores
Order Cost and Preparation Time:
Pearson correlation score: 0.042
Interpretation: Very weak positive correlation, indicating almost no linear relationship between the cost of an order and the time it takes to prepare.
Order Cost and Delivery Time:
Interpretation: Very weak negative correlation, suggesting almost no linear relationship between the cost of an order and its delivery time.
Order Cost and Total Time:
Pearson correlation score: 0.005
Interpretation: Very weak positive correlation, indicating almost no linear relationship between the cost of an order and the total time (preparation + delivery).
Scatter Plots
Order Cost by Preparation Time:
Order Cost by Delivery Time:
Order Cost by Total Time:
from scipy.stats import linregress
# Calculate the frequency of orders per customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']
# Merge the order frequency with the original dataset to get order costs
merged_data = pd.merge(data, order_frequency, on='customer_id')
# Calculate the average order cost for different order frequencies
avg_cost_by_frequency = merged_data.groupby('order_count')['cost_of_the_order'].mean().reset_index()
avg_cost_by_frequency.columns = ['order_count', 'average_order_cost']
# Style the table using pandas Styler
styled_avg_cost_by_frequency = avg_cost_by_frequency.style.background_gradient(cmap='viridis').set_caption("Average Order Cost by Order Frequency")
# Display the styled table
styled_avg_cost_by_frequency
# Plotting the average order cost by order frequency with trendline
plt.figure(figsize=(10, 6))
plt.plot(avg_cost_by_frequency['order_count'], avg_cost_by_frequency['average_order_cost'], marker='o', linestyle='-', linewidth=2, markersize=8, label='Average Order Cost')
# Add a trendline
slope, intercept, r_value, p_value, std_err = linregress(avg_cost_by_frequency['order_count'], avg_cost_by_frequency['average_order_cost'])
plt.plot(avg_cost_by_frequency['order_count'], intercept + slope * avg_cost_by_frequency['order_count'], 'r', label=f'Trendline (slope={slope:.2f}, p-value={p_value:.2f})')
plt.title('Average Order Cost by Order Frequency')
plt.xlabel('Order Frequency')
plt.ylabel('Average Order Cost ($)')
plt.legend()
plt.grid(True)
plt.show()
# Display the results
#print(avg_cost_by_frequency)
# Pearson correlation for detailed analysis
correlation_score = avg_cost_by_frequency['order_count'].corr(avg_cost_by_frequency['average_order_cost'])
print(f"Pearson Correlation Score between 'order_count' and 'average_order_cost': {correlation_score:.3f}")
Pearson Correlation Score between 'order_count' and 'average_order_cost': 0.186
Pearson Correlation:
Trendline:
Data Points:
Customers who place orders infrequently (1-4 times) have relatively stable average order costs, ranging from around $16 to $17.
Customers with higher order frequencies (5-13) show more variation in average order costs, with a notable spike for those placing 8 orders ($22.98).
Preparation Time, Delivery Time, Total Time by Day of the week
Order Frequency by Day of the Week
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']
# Group by day of the week and calculate average times
avg_times_by_day = data.groupby('day_of_the_week').agg({
'food_preparation_time': 'mean',
'delivery_time': 'mean',
'total_time': 'mean'
}).reset_index()
# Style the table using pandas Styler
styled_avg_times_by_day = avg_times_by_day.style.background_gradient(cmap='viridis').set_caption("Average Times by Day of the Week")
# Display the styled table
styled_avg_times_by_day
# Define the x-axis labels with weekday/weekend context
x_labels = ['Weekday', 'Weekend']
# Plotting the average times by day of the week side by side
fig, ax = plt.subplots(1, 3, figsize=(18, 6), sharey=True)
# Preparation Time
ax[0].bar(avg_times_by_day['day_of_the_week'], avg_times_by_day['food_preparation_time'], color='skyblue')
ax[0].set_title('Average Preparation Time')
#ax[0].set_xlabel('Day of the Week')
ax[0].set_ylabel('Time (minutes)')
ax[0].grid(True)
ax[0].set_xticks(range(len(avg_times_by_day['day_of_the_week'])))
ax[0].set_xticklabels(x_labels)
# Delivery Time
ax[1].bar(avg_times_by_day['day_of_the_week'], avg_times_by_day['delivery_time'], color='lightgreen')
ax[1].set_title('Average Delivery Time')
ax[1].set_xlabel('Day of the Week')
ax[1].grid(True)
ax[1].set_xticks(range(len(avg_times_by_day['day_of_the_week'])))
ax[1].set_xticklabels(x_labels)
# Total Time
ax[2].bar(avg_times_by_day['day_of_the_week'], avg_times_by_day['total_time'], color='salmon')
ax[2].set_title('Average Total Time')
#ax[2].set_xlabel('Day of the Week')
ax[2].grid(True)
ax[2].set_xticks(range(len(avg_times_by_day['day_of_the_week'])))
ax[2].set_xticklabels(x_labels)
plt.tight_layout()
plt.show()
Average Preparation Time:
Preparation times are consistent across both weekdays and weekends, with slight variations.
Average preparation time ranges from approximately 23 to 25 minutes.
Average Delivery Time:
Delivery times show some variation across both weekdays and weekends.
Average delivery time ranges from about 22 to 24 minutes.
Average Total Time:
Total time, being the sum of preparation and delivery times, follows a similar pattern.
Average total time varies from around 46 to 49 minutes.
# Calculate the order frequency by day of the week
order_frequency_by_day = data['day_of_the_week'].value_counts().reset_index()
order_frequency_by_day.columns = ['day_of_the_week', 'order_count']
# Style the table using pandas Styler
styled_order_frequency_by_day = order_frequency_by_day.style.background_gradient(cmap='viridis').set_caption("Order Frequency by Day of the Week")
# Display the styled table
styled_order_frequency_by_day
# Define the x-axis labels with weekday/weekend context
x_labels = ['Weekday', 'Weekend']
# Plotting the order frequency by day of the week
plt.figure(figsize=(10, 6))
plt.bar(order_frequency_by_day['day_of_the_week'], order_frequency_by_day['order_count'], color='lightblue')
plt.title('Order Frequency by Day of the Week')
plt.xlabel('Day of the Week (Weekday/Weekend)')
plt.ylabel('Order Count')
plt.xticks(ticks=range(len(order_frequency_by_day['day_of_the_week'])), labels=x_labels)
plt.grid(True)
plt.show()
Order Frequency on Weekdays vs. Weekends:
Weekdays:
Orders placed on weekdays (Monday through Friday) have a higher frequency compared to weekends.
Weekends:
The order frequency on weekends (Saturday and Sunday) is noticeably lower than on weekdays.
Data Distribution:
Ratings by Cuisine Type
Rating Distribution by Order Cost
Rating Distribution by Total Wait Time (Prep Time + Delivery Time)
Ratings by day of the week
Order Frequency vs Ratings
# Plot of Customer Ratings by Cuisine Type
# Data Cleaning: Replace 'Not given' ratings with NaN and drop those rows
data['rating'] = data['rating'].replace('Not given', pd.NA).dropna().astype(int)
# Calculate the average ratings by cuisine type
average_ratings = data.groupby('cuisine_type')['rating'].mean().reset_index()
# Count the 'Not given' ratings by cuisine type
not_given_ratings = data['rating'].replace('Not given', pd.NA).isna().groupby(data['cuisine_type']).sum().reset_index(name='Not_given_count')
# Merge both dataframes
combined_data = pd.merge(average_ratings, not_given_ratings, on='cuisine_type')
# Plot the combined data
fig, ax1 = plt.subplots(figsize=(14, 8))
# Plot average ratings
ax1.set_xlabel('Cuisine Type')
ax1.set_ylabel('Average Rating', color='tab:blue')
ax1.bar(combined_data['cuisine_type'], combined_data['rating'], color='skyblue', label='Average Rating')
ax1.tick_params(axis='y', labelcolor='tab:blue')
ax1.set_xticklabels(combined_data['cuisine_type'], rotation=45, ha='right')
# Create a secondary y-axis to plot 'Not given' counts
ax2 = ax1.twinx()
ax2.set_ylabel('Count of "Not given" Ratings', color='tab:red')
ax2.plot(combined_data['cuisine_type'], combined_data['Not_given_count'], color='salmon', marker='o', label='Not Given Count')
ax2.tick_params(axis='y', labelcolor='tab:red')
# Title and layout
plt.title('Average Customer Ratings and Count of "Not given" Ratings by Cuisine Type')
fig.tight_layout()
plt.show()
# Display the combined data table summary using pandas Styler
combined_data_styled = combined_data.style.set_caption("Summary of Average Ratings and Count of 'Not given' Ratings by Cuisine Type")
combined_data_styled
# Encode the 'cuisine_type' column using pandas' factorize
data['cuisine_type_encoded'] = pd.factorize(data['cuisine_type'])[0]
# Ensure no NaNs or infinite values in the columns
cleaned_data = data.dropna(subset=['cuisine_type_encoded', 'rating'])
# Calculate the Pearson correlation coefficient using scipy
pearson_corr_cuisine_rating, _ = pearsonr(cleaned_data['cuisine_type_encoded'], cleaned_data['rating'])
# Print the Pearson correlation coefficient rounded to two decimal places
print(f"The Pearson correlation coefficient between cuisine type and rating is {pearson_corr_cuisine_rating:.2f}")
# Calculate the average ratings by cuisine type
average_ratings = cleaned_data.groupby('cuisine_type')['rating'].mean().reset_index()
# Count the 'Not given' ratings by cuisine type
not_given_ratings = data['rating'].replace('Not given', pd.NA).isna().groupby(data['cuisine_type']).sum().reset_index(name='Not_given_count')
# Merge both dataframes
combined_data = pd.merge(average_ratings, not_given_ratings, on='cuisine_type')
# Sort the combined data by average rating in descending order
combined_data_sorted = combined_data.sort_values(by='rating', ascending=False)
# Display the sorted combined data table summary using pandas Styler
combined_data_sorted_styled = combined_data_sorted.style.set_caption("Summary of Average Ratings and Count of 'Not given' Ratings by Cuisine Type (Sorted by Rating)")
combined_data_sorted_styled
The Pearson correlation coefficient between cuisine type and rating is -0.01
cuisine_type | rating | Not_given_count | |
---|---|---|---|
11 | Spanish | 4.833333 | 6 |
12 | Thai | 4.666667 | 10 |
3 | Indian | 4.540000 | 23 |
8 | Mexican | 4.416667 | 29 |
5 | Japanese | 4.373626 | 197 |
4 | Italian | 4.360465 | 126 |
1 | Chinese | 4.338346 | 82 |
10 | Southern | 4.307692 | 4 |
2 | French | 4.300000 | 8 |
0 | American | 4.298913 | 216 |
9 | Middle Eastern | 4.235294 | 15 |
7 | Mediterranean | 4.218750 | 14 |
6 | Korean | 4.111111 | 4 |
13 | Vietnamese | 4.000000 | 2 |
The Pearson correlation coefficient between the encoded cuisine type and rating is -0.01, indicating a very weak negative correlation, suggesting that the type of cuisine has virtually no linear relationship with the ratings given by customers.
# Rating Distribution by Order Cost
# Original unreadable plot
#plt.figure(figsize=(12, 6))
#sns.boxplot(x='cost_of_the_order', y='rating', data=data)
#plt.title('Box Plot of Customer Ratings by Order Cost')
#plt.xlabel('Order Cost ($)')
#plt.ylabel('Customer Rating')
#plt.xticks(rotation=90)
#plt.show()
# To make this readable we will first need to bin the data and then create the plot
# Bin the order costs into categories
bin_labels = ['< $10', '$10 - $20', '$20 - $30', '$30 - $40', '$40 - $50', '> $50']
clean_data = data.copy()
clean_data['cost_bin'] = pd.cut(clean_data['cost_of_the_order'], bins=[0, 10, 20, 30, 40, 50, float('inf')], labels=bin_labels)
# Calculate the mean and median for each bin
mean_ratings = clean_data.groupby('cost_bin')['rating'].mean()
median_ratings = clean_data.groupby('cost_bin')['rating'].median()
# Plotting a box plot with binned order costs
plt.figure(figsize=(10, 6))
sns.boxplot(x='cost_bin', y='rating', data=clean_data)
# Add mean and median points
plt.plot(mean_ratings.index, mean_ratings.values, color='red', marker='o', linestyle='-', label='Mean')
plt.plot(median_ratings.index, median_ratings.values, color='blue', marker='x', linestyle='-', label='Median')
# Customize the plot
plt.title('Box Plot of Customer Ratings by Binned Order Cost')
plt.xlabel('Order Cost Bins')
plt.ylabel('Customer Rating')
plt.legend()
plt.show()
Median Ratings:
The median customer ratings for various order costs are generally high, often around 4 to 5 stars.
This suggests a consistent level of customer satisfaction across different order costs.
Spread of Ratings:
For most order costs, the ratings have a relatively narrow interquartile range (IQR), indicating that the majority of ratings are close to the median.
Some order costs show wider IQRs, indicating more variability in customer satisfaction.
Outliers:
There are noticeable outliers at both ends of the rating scale for several order costs, indicating that some customers have had significantly better or worse experiences compared to the majority.
These outliers are more frequent at lower order costs, suggesting variability in satisfaction for cheaper orders.
High Cost Orders:
Low Cost Orders:
# Rating Distribution by Total Wait Time (Prep Time + Delivery Time)
# Data Cleaning: Replace 'Not given' ratings with NaN and drop those rows
data['rating'] = data['rating'].replace('Not given', pd.NA).dropna().astype(int)
# Calculate the total wait time
data['total_wait_time'] = data['food_preparation_time'] + data['delivery_time']
# Ensure no NaNs or infinite values in the columns
cleaned_data = data.dropna(subset=['total_wait_time', 'rating'])
# Scatter Plot - this wasnt very intuitive to visualize in this way
#plt.figure(figsize=(12, 6))
#sns.scatterplot(x='total_wait_time', y='rating', data=cleaned_data)
#plt.title('Rating Distribution by Total Wait Time (Prep Time + Delivery Time)')
#plt.xlabel('Total Wait Time (minutes)')
#plt.ylabel('Rating')
#plt.show()
# Hexbin Plot
plt.figure(figsize=(12, 6))
hb = plt.hexbin(cleaned_data['total_wait_time'], cleaned_data['rating'], gridsize=30, cmap='viridis', mincnt=1)
plt.colorbar(hb, label='Count')
plt.title('Rating Distribution by Total Wait Time (Prep Time + Delivery Time)')
plt.xlabel('Total Wait Time (minutes)')
plt.ylabel('Rating')
plt.show()
# Correlation Analysis
correlation, _ = pearsonr(cleaned_data['total_wait_time'], cleaned_data['rating'])
print(f"Pearson correlation coefficient between total wait time and rating: {correlation:.2f}")
Pearson correlation coefficient between total wait time and rating: -0.01
Pearson Correlation Coefficient:
The Pearson correlation coefficient between total wait time and rating is -0.01, indicating a very weak negative correlation. This suggests that there is virtually no linear relationship between total wait time and customer ratings.
Scatter Plot:
The scatter plot visually confirms the lack of a clear pattern or trend between total wait time and ratings. Ratings are spread across different total wait times, showing no strong correlation.
# Calculate average rating by day of the week
# First, exclude rows where rating is 'Not given'
filtered_data = data[data['rating'] != 'Not given']
filtered_data['rating'] = filtered_data['rating'].astype(float)
avg_ratings_by_day = filtered_data.groupby('day_of_the_week')['rating'].mean().reset_index()
avg_ratings_by_day.columns = ['day_of_the_week', 'average_rating']
# Style the table using pandas Styler
styled_avg_ratings_by_day = avg_ratings_by_day.style.background_gradient(cmap='viridis').set_caption("Average Ratings by Day of the Week")
# Display the styled table
styled_avg_ratings_by_day
# Define the x-axis labels with weekday/weekend context
x_labels = ['Weekday', 'Weekend']
# Plotting the average ratings by day of the week
plt.figure(figsize=(10, 6))
plt.bar(avg_ratings_by_day['day_of_the_week'], avg_ratings_by_day['average_rating'], color='lightcoral')
plt.title('Average Ratings by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Rating')
plt.xticks(ticks=range(len(avg_ratings_by_day['day_of_the_week'])), labels=x_labels)
plt.ylim(0, 5)
plt.grid(True)
plt.show()
Data and Visualization:
Average Ratings on Weekdays vs. Weekends:
Weekdays: The average rating on weekdays is slightly lower than on weekends.
Weekends: The average rating on weekends is slightly higher compared to weekdays.
Data Distribution:
The average rating on weekdays might hover around 4.3 to 4.4.
The average rating on weekends might hover around 4.4 to 4.5.
Average rating of 4.35 on weekdays and weekends have an average rating of 4.45, it reflects a slight improvement in customer satisfaction on weekends.
# Calculate order frequency for each customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']
# Merge the order frequency with the original dataset to get ratings
merged_data = pd.merge(data, order_frequency, on='customer_id')
# Filter out rows where rating is 'Not given'
filtered_merged_data = merged_data[merged_data['rating'] != 'Not given']
filtered_merged_data['rating'] = filtered_merged_data['rating'].astype(float)
# Calculate the average rating for different order frequencies
avg_rating_by_frequency = filtered_merged_data.groupby('order_count')['rating'].mean().reset_index()
avg_rating_by_frequency.columns = ['order_count', 'average_rating']
# Style the table using pandas Styler
styled_avg_rating_by_frequency = avg_rating_by_frequency.style.background_gradient(cmap='viridis').set_caption("Average Rating by Order Frequency")
# Display the styled table
styled_avg_rating_by_frequency
# Plotting the average rating by order frequency
plt.figure(figsize=(10, 6))
plt.plot(avg_rating_by_frequency['order_count'], avg_rating_by_frequency['average_rating'], marker='o', linestyle='-', color='blue')
plt.title('Order Frequency by Average Rating')
plt.xlabel('Order Frequency')
plt.ylabel('Average Rating')
plt.ylim(0, 5)
plt.grid(True)
plt.show()
General Trend:
The average ratings for customers with different order frequencies are generally high, ranging between 4.28 and 4.43.
There is no significant downward trend in ratings with increasing order frequency, indicating consistent customer satisfaction across different order frequencies.
Specific Patterns:
There are slight variations in ratings at different order frequencies, but these variations are not drastic.
Customers who placed eight orders have the highest average rating (4.43), while those with nine orders have the lowest average rating (4.28).
# Delivery time vs prep time
# Calculate the Pearson correlation coefficient
correlation = data['food_preparation_time'].corr(data['delivery_time'])
#Facet grid
g = sns.FacetGrid(data, col="cuisine_type", col_wrap=4, height=4)
g.map(sns.scatterplot, "food_preparation_time", "delivery_time", alpha=0.5)
g.set_titles("{col_name}")
g.set_axis_labels("Food Preparation Time (minutes)", "Delivery Time (minutes)")
plt.show()
# Print the result
print(f"Pearson correlation coefficient between food preparation time and delivery time: {correlation:.2f}")
Pearson correlation coefficient between food preparation time and delivery time: 0.01
The analysis reveals a very weak correlation between food preparation time and delivery time, suggesting that factors other than preparation time are more significant in determining delivery time.
The facet grid further illustrates the lack of a strong relationship across different cuisine types, highlighting variability and potential influences from other operational aspects.
# Box Plot of Delivery Time by Cuisine Type
plt.figure(figsize=(12, 6))
sns.boxplot(x='cuisine_type', y='delivery_time', data=data)
plt.title('Box Plot of Delivery Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()
Variation in Delivery Time:
There is noticeable variation in delivery times across different cuisine types. Cuisine with Longest Median Delivery Time:
Some cuisines, such as Mediterranean and Greek, show longer median delivery times.
Cuisine with Shortest Median Delivery Time:
Other cuisines, like Japanese and Italian, tend to have shorter median delivery times.
Presence of Outliers:
Outliers are present in the delivery times for most cuisine types, indicating occasional significantly longer or shorter delivery times than the median.
Spread of Delivery Time:
The interquartile ranges (IQRs) differ among cuisines, with some cuisines showing a wide range of delivery times (e.g., American), while others have a more consistent delivery time (e.g., Mexican).
# Box Plot of Food Preparation Time by Cuisine Type
plt.figure(figsize=(12, 6))
sns.boxplot(x='cuisine_type', y='food_preparation_time', data=data)
plt.title('Box Plot of Food Preparation Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']
# Group by cuisine type and calculate average total time
avg_total_time_by_cuisine = data.groupby('cuisine_type')['total_time'].mean().reset_index()
avg_total_time_by_cuisine.columns = ['cuisine_type', 'average_total_time']
# Style the table using pandas Styler
styled_avg_total_time_by_cuisine = avg_total_time_by_cuisine.style.background_gradient(cmap='viridis').set_caption("Average Total Time by Cuisine Type")
# Display the styled table
styled_avg_total_time_by_cuisine
# Plotting the average total time by cuisine type
plt.figure(figsize=(12, 6))
sns.barplot(x='cuisine_type', y='average_total_time', data=avg_total_time_by_cuisine, palette='viridis')
plt.title('Average Total Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Total Time (minutes)')
plt.xticks(rotation=90)
plt.show()
Longest Average Total Time:
Shortest Average Total Time:
Other Notable Cuisines:
Japanese Cuisine: The average total time for Japanese cuisine is around 44.27 minutes, indicating moderate preparation and delivery times.
Indian Cuisine: The average total time for Indian cuisine is approximately 49.16 minutes, indicating a higher total time similar to Italian cuisine.
Variability:
Italian Cuisine: Average total time ~ 52.64 minutes.
Fast Food Cuisine: Average total time ~ 30.04 minutes.
Japanese Cuisine: Average total time ~ 44.27 minutes.
Indian Cuisine: Average total time ~ 49.16 minutes.
# Pivot Table of Average Delivery Time by Weekday/Weekend and Cuisine Type
# Create a pivot table for average delivery time by Weekday/Weekend and cuisine type
pivot_table_delivery_time = data.pivot_table(values='delivery_time', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')
# Visualize the pivot table as a heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_table_delivery_time, annot=True, cmap='Reds')
plt.title('Average Delivery Time by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()
pivot_table_delivery_time
cuisine_type | American | Chinese | French | Indian | Italian | Japanese | Korean | Mediterranean | Mexican | Middle Eastern | Southern | Spanish | Thai | Vietnamese |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
day_of_the_week | ||||||||||||||
Weekday | 28.248521 | 28.826923 | 27.200000 | 27.625000 | 28.802198 | 28.133333 | 26.0 | 28.785714 | 28.250000 | 28.470588 | 29.500000 | 28.0 | 26.500000 | 27.666667 |
Weekend | 22.542169 | 22.269939 | 24.615385 | 22.346939 | 22.705314 | 22.519403 | 20.0 | 21.312500 | 22.641509 | 21.750000 | 20.727273 | 23.0 | 22.266667 | 25.000000 |
Weekday vs. Weekend Delivery Times:
Cuisine Types with Longest Delivery Times on Weekdays:
Chinese: Average delivery time is approximately 28.83 minutes.
Italian: Average delivery time is approximately 28.80 minutes.
American: Average delivery time is approximately 28.25 minutes.
Mediterranean: Average delivery time is approximately 28.79 minutes.
Cuisine Types with Shortest Delivery Times on Weekdays:
French: Average delivery time is approximately 27.20 minutes.
Korean: Average delivery time is approximately 26.00 minutes.
Cuisine Types with Longest Delivery Times on Weekends:
French: Average delivery time is approximately 24.62 minutes.
Thai: Average delivery time is approximately 22.27 minutes.
Cuisine Types with Shortest Delivery Times on Weekends:
Middle Eastern: Average delivery time is approximately 21.75 minutes.
Korean: Average delivery time is approximately 20.00 minutes.
Comparison Between Weekdays and Weekends:
Most cuisine types have a significant reduction in delivery times on weekends compared to weekdays.
The difference in delivery times between weekdays and weekends is more pronounced for certain cuisines like American, Chinese, and Italian.
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']
# Group by cuisine type and calculate average total time
avg_total_time_by_cuisine = data.groupby('cuisine_type')['total_time'].mean().reset_index()
avg_total_time_by_cuisine.columns = ['cuisine_type', 'average_total_time']
# Style the table using pandas Styler
styled_avg_total_time_by_cuisine = avg_total_time_by_cuisine.style.background_gradient(cmap='viridis').set_caption("Average Total Time by Cuisine Type")
# Display the styled table
styled_avg_total_time_by_cuisine
# Create a pivot table for average preparation time by Weekday/Weekend and cuisine type
pivot_table_preparation_time = data.pivot_table(values='food_preparation_time', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')
# Visualize the pivot table as a heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_table_preparation_time, annot=True, cmap='Blues')
plt.title('Average Preparation Time by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()
pivot_table_preparation_time
cuisine_type | American | Chinese | French | Indian | Italian | Japanese | Korean | Mediterranean | Mexican | Middle Eastern | Southern | Spanish | Thai | Vietnamese |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
day_of_the_week | ||||||||||||||
Weekday | 27.408284 | 28.076923 | 27.200000 | 26.333333 | 27.417582 | 27.281481 | 23.000000 | 25.785714 | 25.625000 | 27.235294 | 26.166667 | 33.000000 | 25.250000 | 23.666667 |
Weekend | 27.453012 | 27.331288 | 26.769231 | 27.489796 | 27.512077 | 27.602985 | 25.909091 | 27.531250 | 27.226415 | 26.375000 | 28.363636 | 26.363636 | 27.866667 | 27.250000 |
Weekday vs. Weekend Preparation Times:
Weekdays: Preparation times are generally higher for most cuisine types compared to weekends.
Weekends: Preparation times tend to be slightly lower, indicating quicker preparation on weekends.
Cuisine Types with Longest Preparation Times on Weekdays:
Spanish: Average preparation time is approximately 33.00 minutes.
Chinese: Average preparation time is approximately 28.08 minutes.
Italian: Average preparation time is approximately 27.42 minutes.
Cuisine Types with Shortest Preparation Times on Weekdays:
Korean: Average preparation time is approximately 23.00 minutes.
Vietnamese: Average preparation time is approximately 23.67 minutes.
Mexican: Average preparation time is approximately 25.63 minutes.
Cuisine Types with Longest Preparation Times on Weekends:
Southern: Average preparation time is approximately 28.36 minutes.
Thai: Average preparation time is approximately 27.87 minutes.
Indian: Average preparation time is approximately 27.49 minutes.
Cuisine Types with Shortest Preparation Times on Weekends:
Korean: Average preparation time is approximately 25.91 minutes.
Middle Eastern: Average preparation time is approximately 26.38 minutes.
Spanish: Average preparation time is approximately 26.36 minutes.
Comparison Between Weekdays and Weekends:
The difference in preparation times between weekdays and weekends is more pronounced for certain cuisines like Spanish and Southern.
Some cuisines have minimal differences in preparation times between weekdays and weekends, indicating consistent preparation times.
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']
# Create a pivot table for average total time by Weekday/Weekend and cuisine type
pivot_table_total_time = data.pivot_table(values='total_time', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')
# Visualize the pivot table as a heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_table_total_time, annot=True, cmap='Greens')
plt.title('Average Total Time by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()
pivot_table_total_time
cuisine_type | American | Chinese | French | Indian | Italian | Japanese | Korean | Mediterranean | Mexican | Middle Eastern | Southern | Spanish | Thai | Vietnamese |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
day_of_the_week | ||||||||||||||
Weekday | 55.656805 | 56.903846 | 54.400000 | 53.958333 | 56.219780 | 55.414815 | 49.000000 | 54.571429 | 53.875000 | 55.705882 | 55.666667 | 61.000000 | 51.750000 | 51.333333 |
Weekend | 49.995181 | 49.601227 | 51.384615 | 49.836735 | 50.217391 | 50.122388 | 45.909091 | 48.843750 | 49.867925 | 48.125000 | 49.090909 | 49.363636 | 50.133333 | 52.250000 |
**Weekday vs. Weekend Total Times:::
Weekdays: Total times are generally higher across most cuisine types compared to weekends.
Weekends: Total times are generally lower, indicating quicker overall preparation and delivery on weekends.
Cuisine Types with Longest Total Times on Weekdays:
Cuisine Types with Shortest Total Times on Weekdays:
Cuisine Types with Longest Total Times on Weekends:
Cuisine Types with Shortest Total Times on Weekends:
Comparison Between Weekdays and Weekends:
The difference in total times between weekdays and weekends is more pronounced for certain cuisines like Spanish, Chinese, and Italian.
Some cuisines have minimal differences in total times between weekdays and weekends, indicating consistent preparation and delivery times.
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']
# Calculate order frequency for each customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']
# Merge the order frequency with the original dataset
merged_data = pd.merge(data, order_frequency, on='customer_id')
# Calculate the average preparation, delivery, and total time for different order frequencies
avg_times_by_frequency = merged_data.groupby('order_count').agg({
'food_preparation_time': 'mean',
'delivery_time': 'mean',
'total_time': 'mean'
}).reset_index()
avg_times_by_frequency.columns = ['order_count', 'average_preparation_time', 'average_delivery_time', 'average_total_time']
# Plotting the average preparation, delivery, and total time by order frequency side by side
fig, ax = plt.subplots(1, 3, figsize=(18, 6), sharey=True)
# Preparation Time
sns.lineplot(ax=ax[0], x='order_count', y='average_preparation_time', data=avg_times_by_frequency, marker='o', color='skyblue', label='Preparation Time')
ax[0].set_title('Average Preparation Time by Order Frequency')
#ax[0].set_xlabel('Order Frequency')
ax[0].set_ylabel('Average Time (minutes)')
ax[0].grid(True)
ax[0].legend()
# Delivery Time
sns.lineplot(ax=ax[1], x='order_count', y='average_delivery_time', data=avg_times_by_frequency, marker='o', color='lightgreen', label='Delivery Time')
ax[1].set_title('Average Delivery Time by Order Frequency')
ax[1].set_xlabel('Order Frequency')
ax[1].grid(True)
ax[1].legend()
# Total Time
sns.lineplot(ax=ax[2], x='order_count', y='average_total_time', data=avg_times_by_frequency, marker='o', color='salmon', label='Total Time')
ax[2].set_title('Average Total Time by Order Frequency')
#ax[2].set_xlabel('Order Frequency')
ax[2].grid(True)
ax[2].legend()
plt.tight_layout()
plt.show()
Average Preparation Time:
Average Delivery Time:
Average Total Time:
Insights
General Trend:
Efficiency Peaks:
Inconsistencies:
The preparation time increases again for customers with 9 orders, suggesting possible inconsistencies or operational challenges for this group. Consistent Delivery Times:
Delivery times remain relatively stable with minor fluctuations across different order frequencies.
# Calculate the order frequency by day of the week
order_frequency_by_day = data['day_of_the_week'].value_counts().reset_index()
order_frequency_by_day.columns = ['day_of_the_week', 'order_count']
# Style the table using pandas Styler
styled_order_frequency_by_day = order_frequency_by_day.style.background_gradient(cmap='viridis').set_caption("Order Frequency by Day of the Week")
# Display the styled table
styled_order_frequency_by_day
# Plotting the order frequency by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='order_count', data=order_frequency_by_day, palette='viridis')
plt.title('Order Frequency by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Order Count')
#plt.xticks(rotation=90)
plt.grid(True)
plt.show()
Order Frequency on Weekdays vs. Weekends:
Weekends: There is a significantly higher order frequency on weekends, with 1351 orders.
Weekdays: The order frequency on weekdays is lower, with 547 orders.
Higher Order Volumes on Weekends:
Lower Order Volumes on Weekdays:
# Convert 'rating' to numeric, replacing 'Not given' with NaN
data['rating'] = pd.to_numeric(data['rating'].replace('Not given', pd.NA), errors='coerce')
# Calculate the total wait time
data['total_wait_time'] = data['food_preparation_time'] + data['delivery_time']
# Identify numeric columns
numeric_columns = data.select_dtypes(include=[np.number]).columns
numeric_columns = numeric_columns.drop('customer_id') # Exclude 'customer_id' from the analysis
numeric_columns = numeric_columns.drop('order_id') # Exclude 'order_id' from the analysis
# Determine the number of rows and columns for the subplot grid
num_cols = 5
num_rows = math.ceil(len(numeric_columns) / num_cols)
# Set up the subplots
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, num_rows * 5))
colors = sns.color_palette("husl", len(numeric_columns)) # Generate a color palette
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Step 2 & 3: Visualize Outliers and Identify Outliers using IQR method for each numeric column
outlier_counts = {}
for i, col in enumerate(numeric_columns):
sns.boxplot(x=data[col], ax=axes[i], color=colors[i])
axes[i].set_title(f'Box Plot of {col}')
axes[i].set_xlabel(col)
# Calculate IQR
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
# Define outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
outlier_counts[col] = len(outliers)
# Print outliers count
print(f"Number of outliers detected in {col}: {len(outliers)}")
# Remove any empty subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Adjust layout
plt.tight_layout()
plt.show()
# Print summary of outliers in each column
print("\nSummary of Outlier Counts:")
for col, count in outlier_counts.items():
print(f"{col}: {count} outliers")
# Optional: Remove outliers (across all numeric columns)
for col in numeric_columns:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
data = data[(data[col] >= lower_bound) & (data[col] <= upper_bound)]
# Summary of the cleaned data
print(f"\nData after removing outliers: {len(data)} records remaining.")
Number of outliers detected in cost_of_the_order: 0 Number of outliers detected in rating: 0 Number of outliers detected in food_preparation_time: 0 Number of outliers detected in delivery_time: 0 Number of outliers detected in total_wait_time: 0
Summary of Outlier Counts: cost_of_the_order: 0 outliers rating: 0 outliers food_preparation_time: 0 outliers delivery_time: 0 outliers total_wait_time: 0 outliers Data after removing outliers: 1162 records remaining.
# Filter out rows where rating is 'Not given'
filtered_data = data[data['rating'] != 'Not given']
filtered_data['rating'] = filtered_data['rating'].astype(float)
# Calculate the rating count and average rating for each restaurant
restaurant_ratings = filtered_data.groupby('restaurant_name')['rating'].agg(['count', 'mean']).reset_index()
restaurant_ratings.columns = ['restaurant_name', 'rating_count', 'average_rating']
# Filter the restaurants that meet the criteria: rating count > 50 and average rating > 4
eligible_restaurants = restaurant_ratings[(restaurant_ratings['rating_count'] > 50) & (restaurant_ratings['average_rating'] > 4)]
# Style the table using pandas Styler with all column headers centered and data left-justified, except for rating_count data which is centered
styled_eligible_restaurants = (eligible_restaurants.style
.set_properties(subset=['restaurant_name'], **{'text-align': 'left'})
.set_properties(subset=['rating_count'], **{'text-align': 'center'})
.set_table_styles([{
'selector': 'th',
'props': [('text-align', 'center')]
}])
.set_caption("Restaurants Eligible for Promotional Offer"))
# Display the styled table
styled_eligible_restaurants
restaurant_name | rating_count | average_rating | |
---|---|---|---|
20 | Blue Ribbon Fried Chicken | 64 | 4.328125 |
21 | Blue Ribbon Sushi | 73 | 4.219178 |
136 | Shake Shack | 133 | 4.278195 |
153 | The Meatball Shop | 84 | 4.511905 |
# Define the function to calculate the commission based on order cost
def calculate_commission(cost):
if cost > 20:
return cost * 0.25
elif cost > 5:
return cost * 0.15
else:
return 0
# Apply the function to calculate the commission for each order
data['commission'] = data['cost_of_the_order'].apply(calculate_commission)
# Calculate the total revenue generated by the company
total_revenue = data['commission'].sum()
# Print the total revenue generated by the company with a formatted string, rounded to 2 decimal places
print(f"The total net revenue generated by the company across all orders, based on the given commission structure, is ${total_revenue:.2f}")
The total net revenue generated by the company across all orders, based on the given commission structure, is $6166.30
Revenue from Low-Cost Orders:
The company should focus on increasing the average order value to maximize revenue.
Impact of Commission Structure:
Recommendations
Promote Higher-Value Items:
Increase Average Order Value:
Monitor and Adjust Commission Structure:
Customer Engagement:
# Write the code here# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']
# Calculate the percentage of orders that take more than 60 minutes to get delivered
total_orders = len(data)
orders_over_60_minutes = len(data[data['total_time'] > 60])
percentage_over_60_minutes = (orders_over_60_minutes / total_orders) * 100
# Print the percentage of orders that take more than 60 minutes
print(f"The percentage of orders that take more than 60 minutes to get delivered is {percentage_over_60_minutes:.2f}%")
The percentage of orders that take more than 60 minutes to get delivered is 10.54%
Percentage of Orders Over 60 Minutes:
Implications of Long Delivery Times:
# Write the code here# Calculate the mean delivery time for weekdays and weekends
mean_delivery_time_by_day = data.groupby('day_of_the_week')['delivery_time'].mean().reset_index()
# Separate weekdays and weekends
weekdays = mean_delivery_time_by_day[mean_delivery_time_by_day['day_of_the_week'] == 'Weekday']['delivery_time'].values[0]
weekends = mean_delivery_time_by_day[mean_delivery_time_by_day['day_of_the_week'] == 'Weekend']['delivery_time'].values[0]
# Print the mean delivery times
print(f"The mean delivery time on weekdays is {weekdays:.2f} minutes.")
print(f"The mean delivery time on weekends is {weekends:.2f} minutes.")
The mean delivery time on weekdays is 28.34 minutes. The mean delivery time on weekends is 22.47 minutes.
Mean Delivery Time on Weekdays:
The mean delivery time on weekdays is approximately 28.34 minutes. Mean Delivery Time on Weekends:
The mean delivery time on weekends is approximately 22.47 minutes.
Delivery times on weekdays is appreciably longer than on weekends despite generally higher order frequency on weekends
Improve Weekday Efficiency:
Maintain Weekend Efficiency:
Long Delivery Times:
Delivery Times on Weekdays vs. Weekends:
Highly Rated Restaurants:
Order Frequency by Day of the Week:
Optimize Delivery Efficiency on Weekdays:
Leverage High Customer Ratings:
Focus on Weekend Efficiency:
Enhance Customer Experience:
Improve Cuisine-Specific Offerings:
Incentivize Higher-Value Orders:
# Convert notebook to html
import os
os.getcwd()
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/FDS_Project_LearnerNotebook_FullCode.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/FDS_Project_LearnerNotebook_FullCode.ipynb to html [NbConvertApp] Writing 3452465 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/FDS_Project_LearnerNotebook_FullCode.html