Practice Project - FIFA World Cup Analysis¶
Context¶
The FIFA World Cup, often simply called the World Cup, is an international association football competition contested by the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body. The championship is contested every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the Second World War. It is one of the most prestigious and important trophies in the sport of football.
Objective¶
A new football club named 'Brussels United FC' has just been inaugurated. As a member of this club, you have been assigned a task to carry analysis of the world cup data.
Data Dictionary¶
The World Cups dataset has the following information about all the World Cups in history till 2014.
Year: Year in which the world cup was held
Country: Country where the world cup was held
Winner: Team that won the world cup
Runners-Up: Team that came second
Third: Team that came third
Fourth: Team that came fourth
GoalsScored: Total goals scored in the world cup
QualifiedTeams: Number of teams that qualified for the world cup
MatchesPlayed: Total matches played in the world cup
Attendance: Total attendance in the world cup
Q 1: Import the necessary libraries and briefly explain the use of each library¶
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from google.colab import drive
drive.mount('/content/drive')
Write your Answer here:¶
Ans 1:
Numpy:
Numpy is used for handling Numbers, Numerical analysis. It is the fundamental package for array computing with Python.
Pandas:
Pandas are used to process the data. Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.
matplotlib.pyplot
Matplotlib is a visualization library & has been taken from the software Matlab
. We are only considering one part of this library to show plotting, hence used .pyplot which means python plot.
Seaborn
Seaborn is another visualization library. When it comes to the visualization of statistical models like heat maps, Seaborn is among the reliable sources. This Python library is derived from matplotlib and closely integrated with Pandas data structures
Q 2: Which library can be used to read the WorldCups dataset? Read the dataset.¶
fifa=pd.read_csv("/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FIFA/WorldCups.csv")
Write your Answer here:¶
Ans 2:
Pandas
library can be used to import the WorldCups dataset
Q3. Show the last 10 records of the dataset. How many columns are there?¶
fifa.tail(10)
Year | Country | Winner | Runners-Up | Third | Fourth | GoalsScored | QualifiedTeams | MatchesPlayed | Attendance | |
---|---|---|---|---|---|---|---|---|---|---|
10 | 1978 | Argentina | Argentina | Netherlands | Brazil | Italy | 102 | 16 | 38 | 1.545.791 |
11 | 1982 | Spain | Italy | Germany FR | Poland | France | 146 | 24 | 52 | 2.109.723 |
12 | 1986 | Mexico | Argentina | Germany FR | France | Belgium | 132 | 24 | 52 | 2.394.031 |
13 | 1990 | Italy | Germany FR | Argentina | Italy | England | 115 | 24 | 52 | 2.516.215 |
14 | 1994 | USA | Brazil | Italy | Sweden | Bulgaria | 141 | 24 | 52 | 3.587.538 |
15 | 1998 | France | France | Brazil | Croatia | Netherlands | 171 | 32 | 64 | 2.785.100 |
16 | 2002 | Korea/Japan | Brazil | Germany | Turkey | Korea Republic | 161 | 32 | 64 | 2.705.197 |
17 | 2006 | Germany | Italy | France | Germany | Portugal | 147 | 32 | 64 | 3.359.439 |
18 | 2010 | South Africa | Spain | Netherlands | Germany | Uruguay | 145 | 32 | 64 | 3.178.856 |
19 | 2014 | Brazil | Germany | Argentina | Netherlands | Brazil | 171 | 32 | 64 | 3.386.810 |
Write your Answer here:¶
Ans 3:
There are 10 columns in the data
Q4. Show the first 10 records of the dataset.¶
fifa.head(10)
Year | Country | Winner | Runners-Up | Third | Fourth | GoalsScored | QualifiedTeams | MatchesPlayed | Attendance | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1930 | Uruguay | Uruguay | Argentina | USA | Yugoslavia | 70 | 13 | 18 | 590.549 |
1 | 1934 | Italy | Italy | Czechoslovakia | Germany | Austria | 70 | 16 | 17 | 363.000 |
2 | 1938 | France | Italy | Hungary | Brazil | Sweden | 84 | 15 | 18 | 375.700 |
3 | 1950 | Brazil | Uruguay | Brazil | Sweden | Spain | 88 | 13 | 22 | 1.045.246 |
4 | 1954 | Switzerland | Germany FR | Hungary | Austria | Uruguay | 140 | 16 | 26 | 768.607 |
5 | 1958 | Sweden | Brazil | Sweden | France | Germany FR | 126 | 16 | 35 | 819.810 |
6 | 1962 | Chile | Brazil | Czechoslovakia | Chile | Yugoslavia | 89 | 16 | 32 | 893.172 |
7 | 1966 | England | England | Germany FR | Portugal | Soviet Union | 89 | 16 | 32 | 1.563.135 |
8 | 1970 | Mexico | Brazil | Italy | Germany FR | Uruguay | 95 | 16 | 32 | 1.603.975 |
9 | 1974 | Germany | Germany FR | Netherlands | Poland | Brazil | 97 | 16 | 38 | 1.865.753 |
Q5. What do you understand by the dimension of the dataset? Find the dimension of the fifa
dataframe.¶
fifa.shape
(20, 10)
Write your Answer here:¶
Ans 5:
The shape of the dataset is a tuple of 2 elements. The first element shows the number of rows in the data and the second element shows the number of columns in the data.
Q6. What do you understand by the size of the dataset? Find the size of the fifa
dataframe.¶
fifa.size
200
Write your Answer here:¶
Ans 6:
The size of the dataset is the total number of elements in the data i.e. product of the number of rows and number of columns.
Q7. What are the data types of all the variables in the data set?¶
Hint: Use the info() function to get all the information about the dataset.
fifa.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20 entries, 0 to 19 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 20 non-null int64 1 Country 20 non-null object 2 Winner 20 non-null object 3 Runners-Up 20 non-null object 4 Third 20 non-null object 5 Fourth 20 non-null object 6 GoalsScored 20 non-null int64 7 QualifiedTeams 20 non-null int64 8 MatchesPlayed 20 non-null int64 9 Attendance 20 non-null object dtypes: int64(4), object(6) memory usage: 1.7+ KB
Write your Answer here:¶
Ans 7:
- There are two different data types - int64 (represents numerical variables) and object (represents categorical variables)
- There are 4 numerical columns -
Year, GoalsScored, QualifiedTeams, MatchesPlayed
- The rest of the columns are categorical
- Oddly,
Attendance
is a categorical variable here. This might be due to some commas or non-numerical entries in the column.
Q8. What do you mean by missing values? Are there any missing values in the fifa
dataframe?¶
fifa.isnull().values.any()
False
Write your Answer here:¶
Ans 8:
The Missing value(s) is/are any particular cell(s) in the dataset which is/are blank i.e. the information is missing.
The output of the above code (False) implies that there are no missing values in the data.
Q9. What do summary statistics of data represent? Find the summary statistics for the numerical variables (Dtype is int64) in the fifa
data?¶
fifa.describe()
Year | GoalsScored | QualifiedTeams | MatchesPlayed | |
---|---|---|---|---|
count | 20.000000 | 20.000000 | 20.000000 | 20.000000 |
mean | 1974.800000 | 118.950000 | 21.250000 | 41.800000 |
std | 25.582889 | 32.972836 | 7.268352 | 17.218717 |
min | 1930.000000 | 70.000000 | 13.000000 | 17.000000 |
25% | 1957.000000 | 89.000000 | 16.000000 | 30.500000 |
50% | 1976.000000 | 120.500000 | 16.000000 | 38.000000 |
75% | 1995.000000 | 145.250000 | 26.000000 | 55.000000 |
max | 2014.000000 | 171.000000 | 32.000000 | 64.000000 |
Write your Answer here:¶
Ans 9:
- The minimum and the maximum number of goals scored in world cups from 1930-2014 are 70 and 171, respectively.
- The average number of goals scored in a world cup is ~119.
- The number of qualified teams and matches played has increased over the years which implies that the world cups are getting bigger which in turn implies that the popularity of the sport is increasing over the years.
Q 10. Plot the distribution plot for the variable 'MatchesPlayed'. Write detailed observations from the plot.¶
sns.displot(fifa['MatchesPlayed'], kind='kde')
plt.show()
Write your Answer here:¶
Ans 10:
- The plot shows that most of the observations lie between 20 and 60 i.e. majority of world cups had 20 to 60 matches.
- The distribution looks fairly symmetric and there are two peaks in the plot around 25 and 60. A distribution with two peaks (modes) is called bimodal.
Q 11. Which country has won the world cup maximum times?¶
Hint: Use value_counts() function
value_counts()
function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently occurring.
fifa['Winner'].value_counts()
Winner Brazil 5 Italy 4 Germany FR 3 Uruguay 2 Argentina 2 England 1 France 1 Spain 1 Germany 1 Name: count, dtype: int64
Write your Answer here:¶
Ans 11:
Brazil has won the world cup the most number of times i.e. 5
m1 = fifa['QualifiedTeams'].mean()
print(m1)
m2 = fifa['QualifiedTeams'].median()
print(m2)
m3 = fifa['QualifiedTeams'].mode()[0]
print(m3)
21.25 16.0 16
Write your Answer here:¶
Ans 12:
- The mean, median, and mode of the variable QualifiedTeams are 21.25, 16, and 16, respectively.
- The mean is greater than the median which implies that the distribution of QualifiedTeams might be skewed to the right.
Q13. How many countries are above the mean level of 'Qualified Teams'?¶
fifa[fifa['QualifiedTeams']>m1]['Country']
11 Spain 12 Mexico 13 Italy 14 USA 15 France 16 Korea/Japan 17 Germany 18 South Africa 19 Brazil Name: Country, dtype: object
Write your Answer here:¶
Ans 13:
9 countries have more than the average number of qualified teams.
Q14. What is the median of variables 'GoalsScored' & 'MatchesPlayed'?¶
GS_median = np.median(fifa['GoalsScored'])
print(GS_median)
QT_median = np.median(fifa['MatchesPlayed'])
print(QT_median)
120.5 38.0
Write your Answer here:¶
Ans 14:
The median number of goals scored and matches played are ~120 and 38, respectively.
Q15. Which country scored the minimum number of goals?¶
fifa[fifa['GoalsScored']==fifa['GoalsScored'].min()]['Country']
0 Uruguay 1 Italy Name: Country, dtype: object
Write your Answer here:¶
Ans 15:
There is a tie for the minimum number of goals scored. Uruguay and Italy both scored the minimum number of goals.
Q16. Plot the pairplots of 'GoalsScored', 'QualifiedTeams', 'MatchesPlayed'.¶
sns.pairplot(fifa[['GoalsScored', 'QualifiedTeams', 'MatchesPlayed']])
plt.show()
Q17. Plot the scatterplot for variables 'Country' & 'Year'.¶
sns.scatterplot(x=fifa['Year'], y=fifa['Country'])
plt.show()
Q18. Plot a countplot for the variable 'Winner' to understand the number of times a country won the world cup between 1930 to 2014.¶
plt.figure(figsize=(10,6))
sns.countplot(x=fifa['Winner'])
plt.title('How many times Country played matches between 1930 to 2014')
plt.xlabel('Country')
plt.ylabel('Frequency')
plt.show()
Write your Answer here:¶
Ans 18:
- As observed earlier, Brazil has won the world cup the most number of times i.e. 5
- Notice that Germany has two entries - Germany FR and Germany. We can consider them the same.
- Italy and Germany have won the world cup second most number of times i.e. 4
- Uruguay and Argentina both have won the world cup twice.
- England, France, and Spain each have 1 world cup title.
Q 19. Show boxplot and calculate the interquartile range for the variable 'GoalsScored'¶
plt.boxplot(fifa['GoalsScored'])
plt.text(x=1.1,y=fifa['GoalsScored'].min(), s='min')
plt.text(x=1.1,y=fifa.GoalsScored.quantile(0.25), s='Q1')
plt.text(x=1.1,y=fifa['GoalsScored'].median(), s='median(Q2)')
plt.text(x=1.1,y=fifa.GoalsScored.quantile(0.75), s='Q3')
plt.text(x=1.1,y=fifa['GoalsScored'].max(), s='max')
plt.title('Boxplot of GoalsScored')
plt.ylabel('Goals')
plt.show()
Q1 = fifa.quantile(q = .25, numeric_only = True)
Q3 = fifa.quantile(q = .75, numeric_only = True)
IQR = Q3 - Q1
print(IQR)
Year 38.00 GoalsScored 56.25 QualifiedTeams 10.00 MatchesPlayed 24.50 dtype: float64
Write your Answer here:¶
Ans 19:
- The boxplot shows that there are no outliers for the number of goals scored.
- The IQR for the number of goals scored is high which implies that there is variability in the number of goals scored in world cups which can be expected.
Q 20. Find and visualize the correlation relation among numeric variables¶
corr_matrix = fifa.corr(numeric_only = True)
corr_matrix
Year | GoalsScored | QualifiedTeams | MatchesPlayed | |
---|---|---|---|---|
Year | 1.000000 | 0.829886 | 0.895565 | 0.972473 |
GoalsScored | 0.829886 | 1.000000 | 0.866201 | 0.876201 |
QualifiedTeams | 0.895565 | 0.866201 | 1.000000 | 0.949164 |
MatchesPlayed | 0.972473 | 0.876201 | 0.949164 | 1.000000 |
sns.heatmap(corr_matrix, annot = True)
# display the plot
plt.show()
Write your Answer here:¶
Ans 20:
- The Year has a strong positive correlation with all other numeric variables. This indicates that the number of goals scored, number of qualified teams, and the number of matches played have been increasing over the years. The world cup event has become more popular and bigger with time.
- The number of goals scored is positively correlated with the number of qualified teams and the number of matches played which is understandable.
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FIFA/Solution+Notebook+-+FIFA+World+Cup+Analysis.ipynb"
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive