Olympic Data Analysis in Python

Introduction

The case study uses past Olympic data to identify patterns and insights that may provide light on how the games and the athletes that compete in them have changed over time. The initial theories concerned the distribution of Body Mass Index (BMI), the trend of female involvement, and the relationship between athletes' height and weight.

In order to extract, clean, and analyze data, a combination of SQL and Python was used in the data analysis process. During the analysis, a number of technical difficulties were encountered, including inconsistent and missing data, a large dataset, and intricate SQL queries.

The analysis's findings supported the original theories by demonstrating a positive relationship between the height and weight of athletes and a long-term upward trend in female involvement. These findings offer insightful information on Olympic participation trends and patterns, emphasizing the value of taking into account a variety of variables in the research, including gender and physical attributes.

This case study illustrates how important discoveries and insightful learnings may result from the use of SQL and other data analysis tools. This case study demonstrates the usefulness of SQL extraction and manipulation abilities in a real-world setting, which is essential for data analysts.

Initial Hypotheses

Our study is predicated on Olympic Games historical data. Our goal was to find intriguing patterns and information that would help us understand how the sports and the athletes who play them have changed throughout time.

Our first theories were:

Height-Weight Correlation: Our goal was to find intriguing patterns and information that would help us understand how the sports and the athletes who play them have changed throughout time. This is predicated on the widespread perception that taller people often weigh more because of their higher body mass.
Trend in Female Participation: Our hypothesis was that more women athletes have been competing in the Olympics over the years.

Body Mass Index (BMI) Distribution: Given the physical demands of professional sports and the focus on fitness and health, we predicted that players' BMI would be within the normal range.

Data Interpretation

To extract, clean, and analyze the data, our data analysis strategy blended Python and SQL, taking use of each language's advantages.

Data Extraction: To retrieve the pertinent information from the Olympic Games dataset, we employed SQL. This contained information about the athletes, the competitions they took part in, and their results.
Data Cleaning: SQL and Python were used to clean the data. This included making sure the data types were appropriate for our research, addressing missing values, and eliminating duplicates.
Data Analysis: The panda's library in Python was the tool we used to analyze the data. This made it possible for us to carry out statistical analysis and effective data manipulation. We used trend analysis, correlation analysis, and visual aid creation to evaluate our ideas.
Data Visualization: To better comprehend the data and find insights, we created visualizations using Python tools like matplotlib and seaborn.
We had a number of technological obstacles during our investigation that we had to go past:

Missing Data: The dataset had a few entries with missing values, mostly in the weight and height columns. This was a problem because these were essential areas for our investigation. In order to resolve this, we did not include these data in some analyses where these fields were essential.

Inconsistent Data: We discovered several discrepancies in the data, including differences in the Olympic Games name practices (e.g., "Summer" vs. "S"). To maintain consistency, we solved issue by standardizing the data.
Big Dataset: With more than 270,000 records, the dataset was big. This presented a computing resource problem. In order to solve this, we ran effective SQL queries that extracted just the data required for our investigation.
Complex Queries: To determine the link between height and weight, for example, several of our investigations necessitated the use of complex SQL queries. To solve this, we divided the queries into more digestible chunks and tested each one separately before putting them together.

Source Code:

import pandas as pd
import numpy as np
In [2]:
df= pd.read_csv('athlete_events.csv')
df1=pd.read_csv('noc_regions.csv')
Df

Output:

ID	Name	Sex	Age	Height	Weight	Team	Games	Sport	Event
0	1	A Dijiang	M	24.0	180.0	80.0	CHN	Barcelona	Basketball
1	2	A Lamusi	M	23.0	170.0	60.0	CHN	London	Judo
2	3	Gunnar Nielsen Aaby	M	24.0	NaN	NaN	DEN	Antwerpen	Football
3	4	Edgar Lindenau Aabye	M	34.0	NaN	NaN	DEN	Paris	Tug-Of-War

year = df['Year'].unique().tolist()
In [5]:
year.sort()
#year.insert(0,'Overall')
In [6]:
year.insert(0,'Overall')
In [7]:
Year

Output:

['Overall',
 1896,
 1900,
 1904,
 1906,
 1908,
 1912,
 1920,
 1924,
 1928,
 2002,
 2004,
 2006,
 2008,
 2010,
 2012,
 2014 ]

df=df[df['Season']== 'Summer']
In [9]:
df=df.merge(df1,on= 'NOC',how='left')
In [10]:
df['region'].unique()

Output:

array(['China', 'Denmark', 'Netherlands', 'Finland', 'Norway', 'Romania',
       'Estonia', 'France', 'Morocco', 'Spain', 'Egypt', 'Iran',
       'Bulgaria', 'Italy', 'Chad', 'Azerbaijan', 'Sudan', 'Russia',
       'Argentina', 'Cuba', 'Belarus', 'Greece', 'Cameroon', 'Turkey',
       'Chile', 'Mexico', 'USA', 'Nicaragua', 'Hungary', 'Nigeria',
       'Algeria', 'Kuwait', 'Bahrain', 'Pakistan', 'Iraq', 'Syria',
       'Lebanon', 'Qatar', 'Malaysia', 'Germany', 'Canada', 'Ireland',
       'Australia', 'South Africa', 'Eritrea', 'Tanzania', 'Jordan',
       'Tunisia', 'Libya', 'Belgium', 'Djibouti', 'Palestine', 'Comoros',
       'Kazakhstan', 'Brunei', 'India', 'Saudi Arabia', 'Maldives',
,
       'Virgin Islands, British', 'Mozambique', 'Virgin Islands, US',
       'Central African Republic', 'Madagascar', 'Bosnia and Herzegovina',
       'Guam', 'Cayman Islands', 'Slovakia', 'Barbados', 'Guinea-Bissau',
       'Timor-Leste', 'Democratic Republic of the Congo', 'Gabon',
       'San Marino', 'Laos', 'Botswana', 'South Korea', 'Cambodia',
       'North Korea', 'Solomon Islands', 'Senegal', 'Cape Verde',
       'Equatorial Guinea', 'Boliva', 'Antigua', 'Andorra', 'Zimbabwe',
       'Grenada', 'Saint Lucia', 'Micronesia', 'Myanmar', 'Malawi',
       'Zambia', 'Taiwan', 'Sao Tome and Principe', 'Macedonia',
       'Liechtenstein', 'Montenegro', 'Gambia', 'Cook Islands', 'Albania',
       'Swaziland', 'Burkina Faso', 'Burundi', 'Aruba', 'Nauru',
       'Vietnam', 'Bhutan', 'Marshall Islands', 'Kiribati'})
#country = df['region'].unique().tolist()

#year.sort()
#year
In [33]:
# for taking input year ,country we make a function
def fetch_medal_tally(df,year,country):
    medal_df=df.drop_duplicates(subset=['Team','NOC','Games','Year','City','Sport','Event','Medal'])

    flag=0
    if year== 'Overall' and country== 'Overall':
        temp_df= medal_df
    if year== 'Overall' and country!= 'Overall':
        flag =1
        temp_df= medal_df[medal_df['region'] == country] 
    if year!= 'Overall' and country== 'Overall':
        temp_df= medal_df[medal_df['Year']== int(year)]
    if year!= 'Overall' and country!= 'Overall':
        temp_df= medal_df[(medal_df['Year'] == int(year)) & (medal_df['region'] == country)]
        
    if flag == 1:
        x=temp_df.groupby('Year').sum()[['Gold','Silver','Bronze']].sort_values('Gold').reset_index() 
    else:
        x=temp_df.groupby('region').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()  
    
    x['total']= x['Gold'] + x['Silver'] + x['Bronze']
    
    print(x)
    
    
    
    
temp_df=df.dropna(subset=['Medal'])
temp_df.drop_duplicates(subset=['Team','NOC','Games','Year','City','Sport','Event','Medal'],inplace =True)
C:\Users\siban\AppData\Local\Temp\ipykernel_5968\2569596598.py:2: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Output:

Out[69]:
Year
1896    120
1900    300
1904    280
1906    224
1908    322
1912    316
1920    449
1924    391
1928    356
1932    370
1936    422
1948    439
2016    973
Name: Medal, dtype: int64

new_df=temp_df[temp_df['region']== 'USA']
new_df.groupby('Year').count()['Medal']

Output:

Year
1896     19
1900     54
1904    231
1906     23
1908     46
1912     63
1920     95
1924     99
2012    103
2016    121
Name: Medal, dtype: int64

new_df=temp_df[temp_df['region']== 'USA']
final_df=new_df.groupby('Year').count()['Medal'].reset_index()
In [75]:
fig= px.line(final_df,x='Year',y='Medal')
fig.show()

Output:

Observations:

In our more in-depth investigation, we concentrated on two primary topics: the relationship between an athlete's weight and height and the historical trend in female involvement. These are the results we found:

Height and Weight connection: We discovered a positive connection (about 0.66) between the height and weight of an athlete. This implies that taller athletes are often heavier, which makes sense considering the overall relationship between human body shape and height and weight. Due to the unique physical demands of each activity, this correlation may, however, change among them.

Trend in Female Participation: Over time, the percentage of female athletes has clearly increased, according to our data. Female participation in the Olympic Games was quite low in the beginning, but it has gradually grown. With approximately 45% of women competing in the 2016 Rio Olympics, there has been substantial progress made toward Olympic gender equality.

These more in-depth observations show important patterns and connections and offer a more sophisticated interpretation of the data.

The trend in female involvement in the preliminary data, which demonstrate a notable rise over time, has previously been covered. We may make a graph that displays the proportion of female athletes in each Olympic Games edition to give a better idea of this trend.

Summary

Several important conclusions have been drawn from our examination of the Olympic Games dataset:

Height and Weight Correlation: There is a somewhat favorable association between the weight and height of athletes. This shows that an athlete's physical attributes can have a big impact on whether or not they are a good fit for a certain activity. When directing young athletes toward activities where their physical characteristics may provide them an edge, coaches and trainers may take this into account.
Growing Female Participation: There has been a noticeable increase in the number of female athletes. This is encouraging evidence that gender equality in sports is rising. More effort has to be done, though, as female involvement is still below that of men. Committees and organizations within the sports industry may concentrate on encouraging and aiding female athletes as well as working to expand the options available to them.
Analysis of Body Mass Index (BMI): Understanding the physical demands of each activity may be gained by analyzing the BMIs of players participating in various sports. Athletes and coaches may find this material useful in their training and preparation.

Next TopicPython seaborn facetgrid method

← prev next →