MLS Case Study PCA¶
Context:¶
In this case study, we will use the Education dataset which contains information on educational institutes in USA. The data has various attributes such as number of applications received, enrollments, faculty education, financial aspects and graduation rate of each institute.
Objective:¶
The objective of this problem is to reduce the number of features by using dimensionality reduction techniques like PCA and extract insights.
Dataset:¶
The Education dataset contains information on various colleges in USA. It contains the following information:
- Names: Names of various universities and colleges
- Apps: Number of applications received
- Accept: Number of applications accepted
- Enroll: Number of new students enrolled
- Top10perc: Percentage of new students from top 10% of Higher Secondary class
- Top25perc: Percentage of new students from top 25% of Higher Secondary class
- F_Undergrad: Number of full-time undergraduate students
- P_Undergrad: Number of part-time undergraduate students - Outstate: Number of students for whom the particular college or university is out-of-state tuition
- Room_Board: Cost of room and board
- Books: Estimated book costs for a student
- Personal: Estimated personal spending for a student
- PhD: Percentage of faculties with a Ph.D.
- Terminal: Percentage of faculties with terminal degree
- S_F_Ratio: Student/faculty ratio
- perc_alumni: Percentage of alumni who donate
- Expend: The instructional expenditure per student
- Grad_Rate: Graduation rate
Importing libraries and overview of the dataset¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to scale the data using z-score
from Scikit-learn.preprocessing import StandardScaler
#Importing PCA
from Scikit-learn.decomposition import PCA
Connect to Google Drive¶
# Connect to google
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Loading data¶
data=pd.read_csv("Education_Post_12th_Standard.csv")
data.head()
Names | Apps | Accept | Enroll | Top10perc | Top25perc | F_Undergrad | P_Undergrad | Outstate | Room_Board | Books | Personal | PhD | Terminal | S_F_Ratio | perc_alumni | Expend | Grad_Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abilene Christian University | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 |
1 | Adelphi University | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 |
2 | Adrian College | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 |
3 | Agnes Scott College | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 |
4 | Alaska Pacific University | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 |
Check the info of the data¶
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 777 entries, 0 to 776 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Names 777 non-null object 1 Apps 777 non-null int64 2 Accept 777 non-null int64 3 Enroll 777 non-null int64 4 Top10perc 777 non-null int64 5 Top25perc 777 non-null int64 6 F_Undergrad 777 non-null int64 7 P_Undergrad 777 non-null int64 8 Outstate 777 non-null int64 9 Room_Board 777 non-null int64 10 Books 777 non-null int64 11 Personal 777 non-null int64 12 PhD 777 non-null int64 13 Terminal 777 non-null int64 14 S_F_Ratio 777 non-null float64 15 perc_alumni 777 non-null int64 16 Expend 777 non-null int64 17 Grad_Rate 777 non-null int64 dtypes: float64(1), int64(16), object(1) memory usage: 109.4+ KB
Observations:
- There are 777 observations and 18 columns in the dataset.
- All columns have 777 non-null values i.e. there are no missing values.
- All columns are numeric except the Names column which is of object data type.
Data Preprocessing and Exploratory Data Analysis¶
Check if all the college names are unique¶
data.Names.nunique()
777
Observations:
- All college names are unique
- As all entries are unique, it would not add value to our analysis. We can drop the Names column.
#Dropping Names column
data.drop(columns="Names", inplace=True)
Summary Statistics¶
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Apps | 777.0 | 3001.638353 | 3870.201484 | 81.0 | 776.0 | 1558.0 | 3624.0 | 48094.0 |
Accept | 777.0 | 2018.804376 | 2451.113971 | 72.0 | 604.0 | 1110.0 | 2424.0 | 26330.0 |
Enroll | 777.0 | 779.972973 | 929.176190 | 35.0 | 242.0 | 434.0 | 902.0 | 6392.0 |
Top10perc | 777.0 | 27.558559 | 17.640364 | 1.0 | 15.0 | 23.0 | 35.0 | 96.0 |
Top25perc | 777.0 | 55.796654 | 19.804778 | 9.0 | 41.0 | 54.0 | 69.0 | 100.0 |
F_Undergrad | 777.0 | 3699.907336 | 4850.420531 | 139.0 | 992.0 | 1707.0 | 4005.0 | 31643.0 |
P_Undergrad | 777.0 | 855.298584 | 1522.431887 | 1.0 | 95.0 | 353.0 | 967.0 | 21836.0 |
Outstate | 777.0 | 10440.669241 | 4023.016484 | 2340.0 | 7320.0 | 9990.0 | 12925.0 | 21700.0 |
Room_Board | 777.0 | 4357.526384 | 1096.696416 | 1780.0 | 3597.0 | 4200.0 | 5050.0 | 8124.0 |
Books | 777.0 | 549.380952 | 165.105360 | 96.0 | 470.0 | 500.0 | 600.0 | 2340.0 |
Personal | 777.0 | 1340.642214 | 677.071454 | 250.0 | 850.0 | 1200.0 | 1700.0 | 6800.0 |
PhD | 777.0 | 72.660232 | 16.328155 | 8.0 | 62.0 | 75.0 | 85.0 | 103.0 |
Terminal | 777.0 | 79.702703 | 14.722359 | 24.0 | 71.0 | 82.0 | 92.0 | 100.0 |
S_F_Ratio | 777.0 | 14.089704 | 3.958349 | 2.5 | 11.5 | 13.6 | 16.5 | 39.8 |
perc_alumni | 777.0 | 22.743887 | 12.391801 | 0.0 | 13.0 | 21.0 | 31.0 | 64.0 |
Expend | 777.0 | 9660.171171 | 5221.768440 | 3186.0 | 6751.0 | 8377.0 | 10830.0 | 56233.0 |
Grad_Rate | 777.0 | 65.463320 | 17.177710 | 10.0 | 53.0 | 65.0 | 78.0 | 118.0 |
Observations:
- On an average, approx 3,000 applications are received in US universities, out of which around 2,000 applications are accepted by the universities and around 780 new students get enrolled.
- The standard deviation is very high for these variables - Apps, Accepted, Enroll which shows the variety of universities and colleges.
- The average cost for room and board, books, and personal expense is approx 4,357, 550, and 1,350 dollars respectively.
- The average number of full time undergrad students are around 3700 whereas the average number of part-time undergrad students stand low at around 850.
- PhD and Grad_Rate have a maximum value of greater than 100 which is not possible as these variables are in percentages. Let's see how many such observations are there in the data.
data[(data.PhD>100) | (data.Grad_Rate>100)]
Apps | Accept | Enroll | Top10perc | Top25perc | F_Undergrad | P_Undergrad | Outstate | Room_Board | Books | Personal | PhD | Terminal | S_F_Ratio | perc_alumni | Expend | Grad_Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
95 | 3847 | 3433 | 527 | 9 | 35 | 1010 | 12 | 9384 | 4840 | 600 | 500 | 22 | 47 | 14.3 | 20 | 7697 | 118 |
582 | 529 | 481 | 243 | 22 | 47 | 1206 | 134 | 4860 | 3122 | 600 | 650 | 103 | 88 | 17.4 | 16 | 6415 | 43 |
There is just one such observation for each variable. We can cap the values to 100%.
data.loc[582,"PhD"]=100
data.loc[95,"Grad_Rate"]=100
Let's check the distribution and outliers for each column in the data¶
cont_cols = list(data.columns)
for col in cont_cols:
print(col)
print('Skew :',round(data[col].skew(),2))
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
data[col].hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1,2,2)
sns.boxplot(x=data[col])
plt.show()
Apps Skew : 3.72
Accept Skew : 3.42
Enroll Skew : 2.69
Top10perc Skew : 1.41
Top25perc Skew : 0.26
F_Undergrad Skew : 2.61
P_Undergrad Skew : 5.69
Outstate Skew : 0.51
Room_Board Skew : 0.48
Books Skew : 3.49
Personal Skew : 1.74
PhD Skew : -0.77
Terminal Skew : -0.82
S_F_Ratio Skew : 0.67
perc_alumni Skew : 0.61
Expend Skew : 3.46
Grad_Rate Skew : -0.14
Observations:
- The distribution plots show that Apps, Accept, Enroll, Top10perc, F_Undergrad, P_Undergrad, Books, Personal and Expend variables are highly right skewed. It is evident from boxplots that all these variables have outliers.
- Top25percent is the only variable which does not possess outliers.
- Outstate, Room_Board, S_F_Ratio and perc_alumni seems to have a moderate right skew.
- PhD and Terminal are moderately left skewed.
Now, let's check the correlation among different variables.
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(), annot=True, fmt='0.2f')
plt.show()
Observations:
- We can see high positive correlation among the following variables:
- Apps and Accept
- Apps and Enroll
- Apps and F_Undergrad
- Accept and Enroll
- Accept and F_Undergrad
- Enroll and F_Undergrad
- Top10perc and Top25percent
- PhD and Terminal
- We can see high negative correlation among the following variables:
- S_F_Ratio and Top10perc
- S_F_Ratio and Expend
Scaling the data¶
scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
data_scaled.head()
Apps | Accept | Enroll | Top10perc | Top25perc | F_Undergrad | P_Undergrad | Outstate | Room_Board | Books | Personal | PhD | Terminal | S_F_Ratio | perc_alumni | Expend | Grad_Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.346882 | -0.321205 | -0.063509 | -0.258583 | -0.191827 | -0.168116 | -0.209207 | -0.746356 | -0.964905 | -0.602312 | 1.270045 | -0.162859 | -0.115729 | 1.013776 | -0.867574 | -0.501910 | -0.317993 |
1 | -0.210884 | -0.038703 | -0.288584 | -0.655656 | -1.353911 | -0.209788 | 0.244307 | 0.457496 | 1.909208 | 1.215880 | 0.235515 | -2.676529 | -3.378176 | -0.477704 | -0.544572 | 0.166110 | -0.551805 |
2 | -0.406866 | -0.376318 | -0.478121 | -0.315307 | -0.292878 | -0.549565 | -0.497090 | 0.201305 | -0.554317 | -0.905344 | -0.259582 | -1.205112 | -0.931341 | -0.300749 | 0.585935 | -0.177290 | -0.668710 |
3 | -0.668261 | -0.681682 | -0.692427 | 1.840231 | 1.677612 | -0.658079 | -0.520752 | 0.626633 | 0.996791 | -0.602312 | -0.688173 | 1.185939 | 1.175657 | -1.615274 | 1.151188 | 1.792851 | -0.376446 |
4 | -0.726176 | -0.764555 | -0.780735 | -0.655656 | -0.596031 | -0.711924 | 0.009005 | -0.716508 | -0.216723 | 1.518912 | 0.235515 | 0.204995 | -0.523535 | -0.553542 | -1.675079 | 0.241803 | -2.948375 |
Principal Component Analysis¶
#Defining the number of principal components to generate
n=data_scaled.shape[1]
#Finding principal components for the data
pca = PCA(n_components=n, random_state=1)
data_pca1 = pd.DataFrame(pca.fit_transform(data_scaled))
#The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_
# visualize the Explained Individual Components
plt.figure(figsize = (10,10))
plt.plot(range(1,18), pca.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
Text(0, 0.5, 'Cumulative Explained Variance')
# find the least number of components that can explain more than 70% variance
sum = 0
for ix, i in enumerate(exp_var):
sum = sum + i
if(sum>0.70):
print("Number of PCs that explain at least 70% variance: ", ix+1)
break
Number of PCs that explain at least 70% variance: 4
Observations:
We can see that out of the 17 original features, we reduced the number of features through principal components to 4, these components explain approximately 70% of the original variance.
So that is about 76% reduction in the dimensionality with a loss of 30% in variance.
Let us now look at these principal components as a linear combination of original features.
pc_comps = ['PC1','PC2','PC3','PC4']
data_pca = pd.DataFrame(np.round(pca.components_[:4,:],2),index=pc_comps,columns=data_scaled.columns)
data_pca.T
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
Apps | 0.25 | 0.33 | -0.06 | 0.28 |
Accept | 0.21 | 0.37 | -0.10 | 0.27 |
Enroll | 0.18 | 0.40 | -0.08 | 0.16 |
Top10perc | 0.35 | -0.08 | 0.03 | -0.05 |
Top25perc | 0.34 | -0.04 | -0.02 | -0.11 |
F_Undergrad | 0.15 | 0.42 | -0.06 | 0.10 |
P_Undergrad | 0.03 | 0.32 | 0.14 | -0.16 |
Outstate | 0.29 | -0.25 | 0.05 | 0.13 |
Room_Board | 0.25 | -0.14 | 0.15 | 0.19 |
Books | 0.06 | 0.06 | 0.68 | 0.08 |
Personal | -0.04 | 0.22 | 0.50 | -0.24 |
PhD | 0.32 | 0.06 | -0.13 | -0.53 |
Terminal | 0.32 | 0.05 | -0.07 | -0.52 |
S_F_Ratio | -0.18 | 0.25 | -0.29 | -0.16 |
perc_alumni | 0.21 | -0.25 | -0.15 | 0.02 |
Expend | 0.32 | -0.13 | 0.23 | 0.08 |
Grad_Rate | 0.25 | -0.17 | -0.21 | 0.26 |
Observations:
- Each principal component is a linear combination of original features. For example, we can write the equation for PC1 in the following manner:
0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 * Top25perc + 0.15 * F_Undergrad + 0.03 * P_Undergrad + 0.29 * Outstate + 0.25 * Room_Board + 0.06 * Books + -0.04 * Personal + 0.32 * PhD + 0.32 * Terminal + -0.18 * S_F_Ratio + 0.21 * perc_alumni + 0.32 * Expend + 0.25 * Grad_Rate
For the business implications, the first two principal components picks up around 58% of the variability in the data. That is to say, picking up a considerable amount of variation in the data.
The explanation of each component along with their weights is also one of the ways to look at it. For example, we can consider weights with absolute value greater than 0.25 significant and analyze the each component.
NOTE: Decision regarding what value of weights is high or significant may vary from case to case. It depends on the problem at hand.
def color_high(val):
if val <-0.25: # you can decide any value as per your understanding
return 'background: pink'
elif val >0.25:
return 'background: skyblue'
data_pca.T.style.applymap(color_high)
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
Apps | 0.250000 | 0.330000 | -0.060000 | 0.280000 |
Accept | 0.210000 | 0.370000 | -0.100000 | 0.270000 |
Enroll | 0.180000 | 0.400000 | -0.080000 | 0.160000 |
Top10perc | 0.350000 | -0.080000 | 0.030000 | -0.050000 |
Top25perc | 0.340000 | -0.040000 | -0.020000 | -0.110000 |
F_Undergrad | 0.150000 | 0.420000 | -0.060000 | 0.100000 |
P_Undergrad | 0.030000 | 0.320000 | 0.140000 | -0.160000 |
Outstate | 0.290000 | -0.250000 | 0.050000 | 0.130000 |
Room_Board | 0.250000 | -0.140000 | 0.150000 | 0.190000 |
Books | 0.060000 | 0.060000 | 0.680000 | 0.080000 |
Personal | -0.040000 | 0.220000 | 0.500000 | -0.240000 |
PhD | 0.320000 | 0.060000 | -0.130000 | -0.530000 |
Terminal | 0.320000 | 0.050000 | -0.070000 | -0.520000 |
S_F_Ratio | -0.180000 | 0.250000 | -0.290000 | -0.160000 |
perc_alumni | 0.210000 | -0.250000 | -0.150000 | 0.020000 |
Expend | 0.320000 | -0.130000 | 0.230000 | 0.080000 |
Grad_Rate | 0.250000 | -0.170000 | -0.210000 | 0.260000 |
Observations:
- The first principal component, PC1, is related to high values of students scores (Top10perc, Top25perc), number of out-of-state students (Outstate), faculties education (PhD and Terminal) and the instructional expenditure (Expend) per student. This principal component seams to capture attributes that generally define premier colleges with high quality of students entering them and higher accomplished faculty that is teaching there. They also seems to take rich students from all over the country.
- The second principal component, PC2, is related to high values of number of applications - received (Apps), accepted (Accept) and enrolled (Enroll), and number of full time (F_Undergrad) and part time (P_Undergrad) students. This principal component seems to capture attributes that generally define non-premier colleges that are comparatively easier to get admissions into.
- The third principal component, PC3, is related to financial aspects i.e. personal spending (Personal) and cost of books (Books) for a student. It is also associated with low values of student faculty ratios (S_F_Ratio).
- The fourth principal component, PC4, is related to low values of faculty's education and high values of students' graduation rate. This principal component seems to capture attributes that define colleges which lack the highly educated faculty (PhD and Terminal) but it is comparatively easier to graduate from there (Grad_Rate).
We can also visualize the data in 2 dimensions using first two principal components¶
plt.figure(figsize = (7,7))
sns.scatterplot(x=data_pca1[0],y=data_pca1[1])
plt.xlabel("PC1")
plt.ylabel("PC2")
Text(0, 0.5, 'PC2')
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"