Case Study - Challenger Launch¶
Importing the necessary libraries¶
In [ ]:
# Basic libraries of python for numeric and dataframe computations
import pandas as pd
import numpy as np
In [1]:
# Connect to google
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Loading the data¶
In [ ]:
data=pd.read_csv('challenger-data.csv')
Now let us see the top five records of the rating data.
In [ ]:
data.head()
Out[ ]:
Observation | Y | X | |
---|---|---|---|
0 | 1 | 1 | 53 |
1 | 2 | 1 | 53 |
2 | 3 | 1 | 53 |
3 | 4 | 0 | 53 |
4 | 5 | 0 | 53 |
- X represent the temperature while the time of launch of the Rocket.
- Y represents the whether an o-rings failure happened or not at the temperature.
Let's check the info of the data.
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 120 entries, 0 to 119 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Observation 120 non-null int64 1 Y 120 non-null int64 2 X 120 non-null int64 dtypes: int64(3) memory usage: 2.9 KB
- The data is comprised of 120 non-null values.
In [ ]:
data.describe()
Out[ ]:
Observation | Y | X | |
---|---|---|---|
count | 120.000000 | 120.000000 | 120.000000 |
mean | 60.500000 | 0.083333 | 70.000000 |
std | 34.785054 | 0.277544 | 7.100716 |
min | 1.000000 | 0.000000 | 53.000000 |
25% | 30.750000 | 0.000000 | 67.000000 |
50% | 60.500000 | 0.000000 | 70.000000 |
75% | 90.250000 | 0.000000 | 75.250000 |
max | 120.000000 | 1.000000 | 81.000000 |
- The average temperature at which launch usually happens is 70 fahrenheit.
Visualizing the Data¶
In [ ]:
# We will be using the Matplotlib library for plotting.
# subsetting the data
failures = data.loc[(data.Y == 1)]
no_failures = data.loc[(data.Y == 0)]
# frequencies
failures_freq = failures.X.value_counts() #failures.groupby('X')
no_failures_freq = no_failures.X.value_counts()
# plotting
import matplotlib as mpl
from matplotlib import pyplot as plt
plt.scatter(failures_freq.index, failures_freq, c='red', s=40)
plt.scatter(no_failures_freq.index, np.zeros(len(no_failures_freq)), c='blue', s=40)
plt.xlabel('X: Temperature')
plt.ylabel('Number of Failures')
plt.legend(['failures', 'No failures'])
plt.show()
- At higher temperatures there are very less chance for o-rings failures.
- There is a chance where there is no o-ring failure below 55 temperature and other has 3 o-rings failures which creates a sense of doubt whether to go for it or not.
Logistic Regression¶
In [ ]:
# You will need to have the following libraries installed before proceeding:
import statsmodels.formula.api as SM
# Build the model
model = SM.logit(formula='Y~X',data=data)
result = model.fit()
# Summarize the model
print (result.summary())
Optimization terminated successfully. Current function value: 0.242411 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: Y No. Observations: 120 Model: Logit Df Residuals: 118 Method: MLE Df Model: 1 Date: Tue, 27 Jul 2021 Pseudo R-squ.: 0.1549 Time: 20:25:44 Log-Likelihood: -29.089 converged: True LL-Null: -34.420 Covariance Type: nonrobust LLR p-value: 0.001094 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 7.4049 3.041 2.435 0.015 1.445 13.365 X -0.1466 0.047 -3.104 0.002 -0.239 -0.054 ==============================================================================
- We now have the model and the summaries should provide the coefficient, intercept, standard errors and p-values.
- The Negative coefficient for X determines if the the temperate lowers by 1 there is ~15% chance for o-ring failure.
- p have for both intercept and X signifies that they are statistically significant and temperature does effect the change of an o-ring failure.
In [ ]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"