Hospital Length of Stay (LOS) Prediction¶
Context:¶
Hospital management is a vital area that gained a lot of attention during the COVID-19 pandemic. Inefficient distribution of resources like beds, ventilators might lead to a lot of complications. However, this can be mitigated by predicting the length of stay (LOS) of a patient before getting admitted. Once this is determined, the hospital can plan a suitable treatment, resources, and staff to reduce the LOS and increase the chances of recovery. The rooms and bed can also be planned in accordance with that.
HealthPlus hospital has been incurring a lot of losses in revenue and life due to its inefficient management system. They have been unsuccessful in allocating pieces of equipment, beds, and hospital staff fairly. A system that could estimate the length of stay (LOS) of a patient can solve this problem to a great extent.
Objective:¶
As a Data Scientist, you have been hired by HealthPlus to analyze the data, find out what factors affect the LOS the most, and come up with a machine learning model which can predict the LOS of a patient using the data available during admission and after running a few tests. Also, bring about useful insights and policies from the data, which can help the hospital to improve their health care infrastructure and revenue.
Data Dictionary:¶
The data contains various information recorded during the time of admission of the patient. It only contains records of patients who were admitted to the hospital. The detailed data dictionary is given below:
- patientid: Patient ID
- Age: Range of age of the patient
- gender: Gender of the patient
- Type of Admission: Trauma, emergency or urgent
- Severity of Illness: Extreme, moderate, or minor
- health_conditions: Any previous health conditions suffered by the patient
- Visitors with Patient: The number of patients who accompany the patient
- Insurance: Does the patient have health insurance or not?
- Admission_Deposit: The deposit paid by the patient during admission
- Stay (in days): The number of days that the patient has stayed in the hospital. This is the target variable
- Available Extra Rooms in Hospital: The number of rooms available during admission
- Department: The department which will be treating the patient
- Ward_Facility_Code: The code of the ward facility in which the patient will be admitted
- doctor_name: The doctor who will be treating the patient
- staff_available: The number of staff who are not occupied at the moment in the ward
Approach to solve the problem:¶
- Import the necessary libraries
- Read the dataset and get an overview
- Exploratory data analysis - a. Univariate b. Bivariate
- Data preprocessing if any
- Define the performance metric and build ML models
- Checking for assumptions
- Compare models and determine the best one
- Observations and business insights
Importing Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build models for prediction
from Scikit-learn.model_selection import train_test_split, cross_val_score, KFold
from Scikit-learn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from Scikit-learn.tree import DecisionTreeRegressor
from Scikit-learn.ensemble import RandomForestRegressor,BaggingRegressor
# To encode categorical variables
from Scikit-learn.preprocessing import LabelEncoder
# For tuning the model
from Scikit-learn.model_selection import GridSearchCV
# To check model performance
from Scikit-learn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error
from google.colab import drive
drive.mount('/content/Mydrive')
Mounted at /content/Mydrive
# Read the healthcare dataset file
data = pd.read_csv("healthcare_data.csv")
# Copying data to another variable to avoid any changes to original data
same_data = data.copy()
Data Overview¶
# View the first 5 rows of the dataset
data.head()
Available Extra Rooms in Hospital | Department | Ward_Facility_Code | doctor_name | staff_available | patientid | Age | gender | Type of Admission | Severity of Illness | health_conditions | Visitors with Patient | Insurance | Admission_Deposit | Stay (in days) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | gynecology | D | Dr Sophia | 0 | 33070 | 41-50 | Female | Trauma | Extreme | Diabetes | 4 | Yes | 2966.408696 | 8 |
1 | 4 | gynecology | B | Dr Sophia | 2 | 34808 | 31-40 | Female | Trauma | Minor | Heart disease | 2 | No | 3554.835677 | 9 |
2 | 2 | gynecology | B | Dr Sophia | 8 | 44577 | 21-30 | Female | Trauma | Extreme | Diabetes | 2 | Yes | 5624.733654 | 7 |
3 | 4 | gynecology | D | Dr Olivia | 7 | 3695 | 31-40 | Female | Urgent | Moderate | None | 4 | No | 4814.149231 | 8 |
4 | 2 | anesthesia | E | Dr Mark | 10 | 108956 | 71-80 | Male | Trauma | Moderate | Diabetes | 2 | No | 5169.269637 | 34 |
# View the last 5 rows of the dataset
data.tail()
Available Extra Rooms in Hospital | Department | Ward_Facility_Code | doctor_name | staff_available | patientid | Age | gender | Type of Admission | Severity of Illness | health_conditions | Visitors with Patient | Insurance | Admission_Deposit | Stay (in days) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
499995 | 4 | gynecology | F | Dr Sarah | 2 | 43001 | 11-20 | Female | Trauma | Minor | High Blood Pressure | 3 | No | 4105.795901 | 10 |
499996 | 13 | gynecology | F | Dr Olivia | 8 | 85601 | 31-40 | Female | Emergency | Moderate | Other | 2 | No | 4631.550257 | 11 |
499997 | 2 | gynecology | B | Dr Sarah | 3 | 22447 | 11-20 | Female | Emergency | Moderate | High Blood Pressure | 2 | No | 5456.930075 | 8 |
499998 | 2 | radiotherapy | A | Dr John | 1 | 29957 | 61-70 | Female | Trauma | Extreme | Diabetes | 2 | No | 4694.127772 | 23 |
499999 | 3 | gynecology | F | Dr Sophia | 3 | 45008 | 41-50 | Female | Trauma | Moderate | Heart disease | 4 | Yes | 4713.868519 | 10 |
# Understand the shape of the data
data.shape
(500000, 15)
- The dataset has 500,000 rows and 15 columns.
# Checking the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 500000 entries, 0 to 499999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Available Extra Rooms in Hospital 500000 non-null int64 1 Department 500000 non-null object 2 Ward_Facility_Code 500000 non-null object 3 doctor_name 500000 non-null object 4 staff_available 500000 non-null int64 5 patientid 500000 non-null int64 6 Age 500000 non-null object 7 gender 500000 non-null object 8 Type of Admission 500000 non-null object 9 Severity of Illness 500000 non-null object 10 health_conditions 500000 non-null object 11 Visitors with Patient 500000 non-null int64 12 Insurance 500000 non-null object 13 Admission_Deposit 500000 non-null float64 14 Stay (in days) 500000 non-null int64 dtypes: float64(1), int64(5), object(9) memory usage: 57.2+ MB
Observations:
- Available Extra Rooms in Hospital, staff_available, patientid, Visitors with Patient, Admission_Deposit, and Stay (in days) are of numeric data type and the rest of the columns are of object data type.
- The number of non-null values is the same as the total number of entries in the data, i.e., there are no null values.
- The column patientid is an identifier for patients in the data. This column will not help with our analysis so we can drop it.
# To view patientid and the number of times they have been admitted to the hospital
data['patientid'].value_counts()
126719 21 125695 21 44572 21 126623 21 125625 19 .. 37634 1 91436 1 118936 1 52366 1 105506 1 Name: patientid, Length: 126399, dtype: int64
Observation:
- The maximum number of times the same patient admitted to the hospital is 21 and minimum is 1.
# Dropping patientid from the data as it is an identifier and will not add value to the analysis
data=data.drop(columns=["patientid"])
Data Preparation for Model Building¶
- Before we proceed to build a model, we'll have to encode categorical features.
- Separate the independent variables and dependent Variables.
- We'll split the data into train and test to be able to evaluate the model that we train on the training data.
# Creating dummy variables for the categorical columns
# drop_first=True is used to avoid redundant variables
data = pd.get_dummies(
data,
columns = data.select_dtypes(include = ["object", "category"]).columns.tolist(),
drop_first = True,
)
# Check the data after handling categorical data
data
Available Extra Rooms in Hospital | staff_available | Visitors with Patient | Admission_Deposit | Stay (in days) | Department_anesthesia | Department_gynecology | Department_radiotherapy | Department_surgery | Ward_Facility_Code_B | Ward_Facility_Code_C | Ward_Facility_Code_D | Ward_Facility_Code_E | Ward_Facility_Code_F | doctor_name_Dr John | doctor_name_Dr Mark | doctor_name_Dr Nathan | doctor_name_Dr Olivia | doctor_name_Dr Sam | doctor_name_Dr Sarah | doctor_name_Dr Simon | doctor_name_Dr Sophia | Age_11-20 | Age_21-30 | Age_31-40 | Age_41-50 | Age_51-60 | Age_61-70 | Age_71-80 | Age_81-90 | Age_91-100 | gender_Male | gender_Other | Type of Admission_Trauma | Type of Admission_Urgent | Severity of Illness_Minor | Severity of Illness_Moderate | health_conditions_Diabetes | health_conditions_Heart disease | health_conditions_High Blood Pressure | health_conditions_None | health_conditions_Other | Insurance_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 0 | 4 | 2966.408696 | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 4 | 2 | 2 | 3554.835677 | 9 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 2 | 8 | 2 | 5624.733654 | 7 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 7 | 4 | 4814.149231 | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 2 | 10 | 2 | 5169.269637 | 34 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
499995 | 4 | 2 | 3 | 4105.795901 | 10 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
499996 | 13 | 8 | 2 | 4631.550257 | 11 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
499997 | 2 | 3 | 2 | 5456.930075 | 8 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
499998 | 2 | 1 | 2 | 4694.127772 | 23 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
499999 | 3 | 3 | 4 | 4713.868519 | 10 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
500000 rows × 43 columns
# Separating independent variables and the target variable
x = data.drop('Stay (in days)',axis=1)
y = data['Stay (in days)']
# Splitting the dataset into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True, random_state = 1)
# Checking the shape of the train and test data
print("Shape of Training set : ", x_train.shape)
print("Shape of test set : ", x_test.shape)
Shape of Training set : (400000, 42) Shape of test set : (100000, 42)
Serialization¶
Serialization is the process of converting a data object (e.g., Python objects, models) into a format that allows us to store or transmit the data and then recreating the object when needed using the deserialization process. We are going to discuss two such serialization formats that are used in Python - Pickle and Joblib
.
Pickle¶
What is Pickle?
Pickle is a Python module that can be used to convert a Python object into a byte stream and vice versa. The byte stream can be stored in a file or transmitted over a network.
Why is Pickle used?
Pickle is used to store Python objects in a format that can be easily retrieved and used later. This is useful when you need to save the state of your program, for example, when you want to save the trained model of a machine learning algorithm so that it can be used later to make predictions on new data.
Advantages of using Pickle:
- It is easy to use.
- It can handle almost any Python object, including custom classes and functions.
- The serialized data can be compressed, making it smaller in size and faster to transmit.
- The deserialized object is guaranteed to have the same type and value as the original object.
Importing the library¶
import pickle
# Create a model with desired hyperparameters
model = RandomForestRegressor(n_estimators=120, max_depth=None, max_features=0.8, random_state=1)
model.fit(x_train, y_train)
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)
Saving the trained model using Pickle¶
with open('model.pkl', 'wb') as file:
pickle.dump(model, file)
The above code is using two functions:
with open('model.pkl', 'wb') as file: opens a new file named model.pkl in write binary mode. The
with
statement ensures that the file is closed properly after the data has been written to it.pickle.dump(model, file) writes the model object to the file object in binary format using the pickle.dump() method.
Loading the trained model using Pickle¶
with open('model.pkl', 'rb') as file:
loaded_model_pkl = pickle.load(file)
This code uses the open() function to open the model.pkl file in read mode ('rb'), and assigns the file object to the variable file.
Then, the pickle.load() method is called to load the saved model from the file file into the loaded_model_pkl variable.
To summarize, this code is loading the saved Random Forest Regression model stored in the model.pkl file using the pickle module, and assigning the loaded model to the loaded_model_pkl variable. This loaded model can be used to make predictions on new data.
Joblib¶
Joblib is a Python library that provides tools to provide lightweight pipelining in Python, as well as utilities for multi-threading. In the context of machine learning, it is primarily used for efficient pickling of large NumPy arrays, as well as for persisting scikit-learn models.
Some advantages of using Joblib for model persistence include:
- Efficiency: Joblib is optimized for efficient processing of large NumPy arrays, making it a good choice for persisting large models and their associated data
- Parallel processing: Joblib can be used to easily parallelize operations across multiple cores, which can greatly speed up the training and evaluation of machine learning models
- Seamless integration with scikit-learn: Joblib is designed to work seamlessly with the popular scikit-learn machine learning library, making it a natural choice for persisting scikit-learn models
One disadvantage of using Joblib is that it is not as widely used or well-known as other serialization libraries like Pickle, which can make it more difficult to find resources or support if you encounter issues. Additionally, while Joblib is optimized for large NumPy arrays, it may not be the best choice for persisting other types of data or models.
Importing the library¶
import joblib
Saving the trained model using Joblib¶
joblib.dump(model, 'model.joblib')
This code is using the joblib library to save a trained machine learning model to disk in the file model.joblib.
The joblib.dump() function takes two arguments: the first argument is the trained machine learning model ('model') that needs to be saved and the second argument is the filename ('model.joblib') where the model will be saved.
The advantage of using joblib over pickle is that it is optimized for dealing with large numpy arrays, which are commonly used in machine learning. This means that joblib is often faster and more efficient than Pickle for saving and loading machine learning models.
Loading the trained model using Joblib¶
loaded_model_joblib = joblib.load('model.joblib')
This code uses the joblib.load() function from the joblib library to load a trained machine learning model that was saved in a binary file format with the .joblib extension. The name of the file to be loaded is passed as an argument to the joblib.load() function.
Once the model is loaded from the file, it is stored in the variable loaded_model_joblib and can be used for making predictions on new data.
joblib.load(): This function is used to load a machine learning model that was saved using the joblib.dump() function
'model.joblib': This is the name of the file containing the saved model. It should be in the same directory as the notebook or script that is loading the model.
loaded_model_joblib: This is the variable where the loaded model will be stored. Once the model is loaded, it can be used to make predictions on new data.
/content/Mydrive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Four - Regression and Prediction/Hospital_LOS_Prediction/Hospital_LOS_Prediction_Part_4.ipynb