MLS+Case_Study

MLS Case Study PCA¶

Context:¶

In this case study, we will use the Education dataset which contains information on educational institutes in USA. The data has various attributes such as number of applications received, enrollments, faculty education, financial aspects and graduation rate of each institute.

Objective:¶

The objective of this problem is to reduce the number of features by using dimensionality reduction techniques like PCA and extract insights.

Dataset:¶

The Education dataset contains information on various colleges in USA. It contains the following information:

Names: Names of various universities and colleges
Apps: Number of applications received
Accept: Number of applications accepted
Enroll: Number of new students enrolled
Top10perc: Percentage of new students from top 10% of Higher Secondary class
Top25perc: Percentage of new students from top 25% of Higher Secondary class
F_Undergrad: Number of full-time undergraduate students
P_Undergrad: Number of part-time undergraduate students - Outstate: Number of students for whom the particular college or university is out-of-state tuition
Room_Board: Cost of room and board
Books: Estimated book costs for a student
Personal: Estimated personal spending for a student
PhD: Percentage of faculties with a Ph.D.
Terminal: Percentage of faculties with terminal degree
S_F_Ratio: Student/faculty ratio
perc_alumni: Percentage of alumni who donate
Expend: The instructional expenditure per student
Grad_Rate: Graduation rate

Importing libraries and overview of the dataset¶

In [ ]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#to scale the data using z-score
from Scikit-learn.preprocessing import StandardScaler

#Importing PCA
from Scikit-learn.decomposition import PCA

Connect to Google Drive¶

In [1]:

# Connect to google
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

Loading data¶

In [ ]:

data=pd.read_csv("Education_Post_12th_Standard.csv")

In [ ]:

data.head()

Out[ ]:

	Names	Apps	Accept	Enroll	Top10perc	Top25perc	F_Undergrad	P_Undergrad	Outstate	Room_Board	Books	Personal	PhD	Terminal	S_F_Ratio	perc_alumni	Expend	Grad_Rate
0	Abilene Christian University	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
1	Adelphi University	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
2	Adrian College	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
3	Agnes Scott College	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
4	Alaska Pacific University	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15

Check the info of the data¶

In [ ]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Names        777 non-null    object 
 1   Apps         777 non-null    int64  
 2   Accept       777 non-null    int64  
 3   Enroll       777 non-null    int64  
 4   Top10perc    777 non-null    int64  
 5   Top25perc    777 non-null    int64  
 6   F_Undergrad  777 non-null    int64  
 7   P_Undergrad  777 non-null    int64  
 8   Outstate     777 non-null    int64  
 9   Room_Board   777 non-null    int64  
 10  Books        777 non-null    int64  
 11  Personal     777 non-null    int64  
 12  PhD          777 non-null    int64  
 13  Terminal     777 non-null    int64  
 14  S_F_Ratio    777 non-null    float64
 15  perc_alumni  777 non-null    int64  
 16  Expend       777 non-null    int64  
 17  Grad_Rate    777 non-null    int64  
dtypes: float64(1), int64(16), object(1)
memory usage: 109.4+ KB

Observations:

There are 777 observations and 18 columns in the dataset.
All columns have 777 non-null values i.e. there are no missing values.
All columns are numeric except the Names column which is of object data type.

Data Preprocessing and Exploratory Data Analysis¶

Check if all the college names are unique¶

In [ ]:

data.Names.nunique()

Out[ ]:

Observations:

All college names are unique
As all entries are unique, it would not add value to our analysis. We can drop the Names column.

In [ ]:

#Dropping Names column
data.drop(columns="Names", inplace=True)

Summary Statistics¶

In [ ]:

data.describe().T

Out[ ]:

	count	mean	std	min	25%	50%	75%	max
Apps	777.0	3001.638353	3870.201484	81.0	776.0	1558.0	3624.0	48094.0
Accept	777.0	2018.804376	2451.113971	72.0	604.0	1110.0	2424.0	26330.0
Enroll	777.0	779.972973	929.176190	35.0	242.0	434.0	902.0	6392.0
Top10perc	777.0	27.558559	17.640364	1.0	15.0	23.0	35.0	96.0
Top25perc	777.0	55.796654	19.804778	9.0	41.0	54.0	69.0	100.0
F_Undergrad	777.0	3699.907336	4850.420531	139.0	992.0	1707.0	4005.0	31643.0
P_Undergrad	777.0	855.298584	1522.431887	1.0	95.0	353.0	967.0	21836.0
Outstate	777.0	10440.669241	4023.016484	2340.0	7320.0	9990.0	12925.0	21700.0
Room_Board	777.0	4357.526384	1096.696416	1780.0	3597.0	4200.0	5050.0	8124.0
Books	777.0	549.380952	165.105360	96.0	470.0	500.0	600.0	2340.0
Personal	777.0	1340.642214	677.071454	250.0	850.0	1200.0	1700.0	6800.0
PhD	777.0	72.660232	16.328155	8.0	62.0	75.0	85.0	103.0
Terminal	777.0	79.702703	14.722359	24.0	71.0	82.0	92.0	100.0
S_F_Ratio	777.0	14.089704	3.958349	2.5	11.5	13.6	16.5	39.8
perc_alumni	777.0	22.743887	12.391801	0.0	13.0	21.0	31.0	64.0
Expend	777.0	9660.171171	5221.768440	3186.0	6751.0	8377.0	10830.0	56233.0
Grad_Rate	777.0	65.463320	17.177710	10.0	53.0	65.0	78.0	118.0

Observations:

On an average, approx 3,000 applications are received in US universities, out of which around 2,000 applications are accepted by the universities and around 780 new students get enrolled.
The standard deviation is very high for these variables - Apps, Accepted, Enroll which shows the variety of universities and colleges.
The average cost for room and board, books, and personal expense is approx 4,357, 550, and 1,350 dollars respectively.
The average number of full time undergrad students are around 3700 whereas the average number of part-time undergrad students stand low at around 850.
PhD and Grad_Rate have a maximum value of greater than 100 which is not possible as these variables are in percentages. Let's see how many such observations are there in the data.

In [ ]:

data[(data.PhD>100) | (data.Grad_Rate>100)]

Out[ ]:

	Apps	Accept	Enroll	Top10perc	Top25perc	F_Undergrad	P_Undergrad	Outstate	Room_Board	Books	Personal	PhD	Terminal	S_F_Ratio	perc_alumni	Expend	Grad_Rate
95	3847	3433	527	9	35	1010	12	9384	4840	600	500	22	47	14.3	20	7697	118
582	529	481	243	22	47	1206	134	4860	3122	600	650	103	88	17.4	16	6415	43

There is just one such observation for each variable. We can cap the values to 100%.

In [ ]:

data.loc[582,"PhD"]=100
data.loc[95,"Grad_Rate"]=100

Let's check the distribution and outliers for each column in the data¶

In [ ]:

cont_cols = list(data.columns)
for col in cont_cols:
    print(col)
    print('Skew :',round(data[col].skew(),2))
    plt.figure(figsize=(15,4))
    plt.subplot(1,2,1)
    data[col].hist(bins=10, grid=False)
    plt.ylabel('count')
    plt.subplot(1,2,2)
    sns.boxplot(x=data[col])
    plt.show()

Apps
Skew : 3.72

No description has been provided for this image

Accept
Skew : 3.42

Enroll
Skew : 2.69

Top10perc
Skew : 1.41

Top25perc
Skew : 0.26

F_Undergrad
Skew : 2.61

P_Undergrad
Skew : 5.69

Outstate
Skew : 0.51

Room_Board
Skew : 0.48

Books
Skew : 3.49

Personal
Skew : 1.74

PhD
Skew : -0.77

Terminal
Skew : -0.82

S_F_Ratio
Skew : 0.67

perc_alumni
Skew : 0.61

Expend
Skew : 3.46

Grad_Rate
Skew : -0.14

Observations:

The distribution plots show that Apps, Accept, Enroll, Top10perc, F_Undergrad, P_Undergrad, Books, Personal and Expend variables are highly right skewed. It is evident from boxplots that all these variables have outliers.
Top25percent is the only variable which does not possess outliers.
Outstate, Room_Board, S_F_Ratio and perc_alumni seems to have a moderate right skew.
PhD and Terminal are moderately left skewed.

Now, let's check the correlation among different variables.

In [ ]:

plt.figure(figsize=(15,10))
sns.heatmap(data.corr(), annot=True, fmt='0.2f')
plt.show()

Observations:

We can see high positive correlation among the following variables:
1. Apps and Accept
2. Apps and Enroll
3. Apps and F_Undergrad
4. Accept and Enroll
5. Accept and F_Undergrad
6. Enroll and F_Undergrad
7. Top10perc and Top25percent
8. PhD and Terminal
We can see high negative correlation among the following variables:
1. S_F_Ratio and Top10perc
2. S_F_Ratio and Expend

Scaling the data¶

In [ ]:

scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

In [ ]:

data_scaled.head()

Out[ ]:

	Apps	Accept	Enroll	Top10perc	Top25perc	F_Undergrad	P_Undergrad	Outstate	Room_Board	Books	Personal	PhD	Terminal	S_F_Ratio	perc_alumni	Expend	Grad_Rate
0	-0.346882	-0.321205	-0.063509	-0.258583	-0.191827	-0.168116	-0.209207	-0.746356	-0.964905	-0.602312	1.270045	-0.162859	-0.115729	1.013776	-0.867574	-0.501910	-0.317993
1	-0.210884	-0.038703	-0.288584	-0.655656	-1.353911	-0.209788	0.244307	0.457496	1.909208	1.215880	0.235515	-2.676529	-3.378176	-0.477704	-0.544572	0.166110	-0.551805
2	-0.406866	-0.376318	-0.478121	-0.315307	-0.292878	-0.549565	-0.497090	0.201305	-0.554317	-0.905344	-0.259582	-1.205112	-0.931341	-0.300749	0.585935	-0.177290	-0.668710
3	-0.668261	-0.681682	-0.692427	1.840231	1.677612	-0.658079	-0.520752	0.626633	0.996791	-0.602312	-0.688173	1.185939	1.175657	-1.615274	1.151188	1.792851	-0.376446
4	-0.726176	-0.764555	-0.780735	-0.655656	-0.596031	-0.711924	0.009005	-0.716508	-0.216723	1.518912	0.235515	0.204995	-0.523535	-0.553542	-1.675079	0.241803	-2.948375

Principal Component Analysis¶

In [ ]:

#Defining the number of principal components to generate
n=data_scaled.shape[1]

#Finding principal components for the data
pca = PCA(n_components=n, random_state=1)
data_pca1 = pd.DataFrame(pca.fit_transform(data_scaled))

#The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_

In [ ]:

# visualize the Explained Individual Components
plt.figure(figsize = (10,10))
plt.plot(range(1,18), pca.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")

Out[ ]:

Text(0, 0.5, 'Cumulative Explained Variance')

In [ ]:

# find the least number of components that can explain more than 70% variance
sum = 0
for ix, i in enumerate(exp_var):
  sum = sum + i
  if(sum>0.70):
    print("Number of PCs that explain at least 70% variance: ", ix+1)
    break

Number of PCs that explain at least 70% variance:  4

Observations:

We can see that out of the 17 original features, we reduced the number of features through principal components to 4, these components explain approximately 70% of the original variance.
So that is about 76% reduction in the dimensionality with a loss of 30% in variance.
Let us now look at these principal components as a linear combination of original features.

In [ ]:

pc_comps = ['PC1','PC2','PC3','PC4']
data_pca = pd.DataFrame(np.round(pca.components_[:4,:],2),index=pc_comps,columns=data_scaled.columns)
data_pca.T

Out[ ]:

	PC1	PC2	PC3	PC4
Apps	0.25	0.33	-0.06	0.28
Accept	0.21	0.37	-0.10	0.27
Enroll	0.18	0.40	-0.08	0.16
Top10perc	0.35	-0.08	0.03	-0.05
Top25perc	0.34	-0.04	-0.02	-0.11
F_Undergrad	0.15	0.42	-0.06	0.10
P_Undergrad	0.03	0.32	0.14	-0.16
Outstate	0.29	-0.25	0.05	0.13
Room_Board	0.25	-0.14	0.15	0.19
Books	0.06	0.06	0.68	0.08
Personal	-0.04	0.22	0.50	-0.24
PhD	0.32	0.06	-0.13	-0.53
Terminal	0.32	0.05	-0.07	-0.52
S_F_Ratio	-0.18	0.25	-0.29	-0.16
perc_alumni	0.21	-0.25	-0.15	0.02
Expend	0.32	-0.13	0.23	0.08
Grad_Rate	0.25	-0.17	-0.21	0.26

Observations:

Each principal component is a linear combination of original features. For example, we can write the equation for PC1 in the following manner:

0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 * Top25perc + 0.15 * F_Undergrad + 0.03 * P_Undergrad + 0.29 * Outstate + 0.25 * Room_Board + 0.06 * Books + -0.04 * Personal + 0.32 * PhD + 0.32 * Terminal + -0.18 * S_F_Ratio + 0.21 * perc_alumni + 0.32 * Expend + 0.25 * Grad_Rate

For the business implications, the first two principal components picks up around 58% of the variability in the data. That is to say, picking up a considerable amount of variation in the data.
The explanation of each component along with their weights is also one of the ways to look at it. For example, we can consider weights with absolute value greater than 0.25 significant and analyze the each component.

NOTE: Decision regarding what value of weights is high or significant may vary from case to case. It depends on the problem at hand.

In [ ]:

def color_high(val):
    if val <-0.25: # you can decide any value as per your understanding
        return 'background: pink'
    elif val >0.25:
        return 'background: skyblue'

data_pca.T.style.applymap(color_high)

Out[ ]:

	PC1	PC2	PC3	PC4
Apps	0.250000	0.330000	-0.060000	0.280000
Accept	0.210000	0.370000	-0.100000	0.270000
Enroll	0.180000	0.400000	-0.080000	0.160000
Top10perc	0.350000	-0.080000	0.030000	-0.050000
Top25perc	0.340000	-0.040000	-0.020000	-0.110000
F_Undergrad	0.150000	0.420000	-0.060000	0.100000
P_Undergrad	0.030000	0.320000	0.140000	-0.160000
Outstate	0.290000	-0.250000	0.050000	0.130000
Room_Board	0.250000	-0.140000	0.150000	0.190000
Books	0.060000	0.060000	0.680000	0.080000
Personal	-0.040000	0.220000	0.500000	-0.240000
PhD	0.320000	0.060000	-0.130000	-0.530000
Terminal	0.320000	0.050000	-0.070000	-0.520000
S_F_Ratio	-0.180000	0.250000	-0.290000	-0.160000
perc_alumni	0.210000	-0.250000	-0.150000	0.020000
Expend	0.320000	-0.130000	0.230000	0.080000
Grad_Rate	0.250000	-0.170000	-0.210000	0.260000

Observations:

The first principal component, PC1, is related to high values of students scores (Top10perc, Top25perc), number of out-of-state students (Outstate), faculties education (PhD and Terminal) and the instructional expenditure (Expend) per student. This principal component seams to capture attributes that generally define premier colleges with high quality of students entering them and higher accomplished faculty that is teaching there. They also seems to take rich students from all over the country.
The second principal component, PC2, is related to high values of number of applications - received (Apps), accepted (Accept) and enrolled (Enroll), and number of full time (F_Undergrad) and part time (P_Undergrad) students. This principal component seems to capture attributes that generally define non-premier colleges that are comparatively easier to get admissions into.
The third principal component, PC3, is related to financial aspects i.e. personal spending (Personal) and cost of books (Books) for a student. It is also associated with low values of student faculty ratios (S_F_Ratio).
The fourth principal component, PC4, is related to low values of faculty's education and high values of students' graduation rate. This principal component seems to capture attributes that define colleges which lack the highly educated faculty (PhD and Terminal) but it is comparatively easier to graduate from there (Grad_Rate).

We can also visualize the data in 2 dimensions using first two principal components¶

In [ ]:

plt.figure(figsize = (7,7))
sns.scatterplot(x=data_pca1[0],y=data_pca1[1])
plt.xlabel("PC1")
plt.ylabel("PC2")

Out[ ]:

Text(0, 0.5, 'PC2')

In [ ]:

# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"