Project 5: Exploring Fifa World Cup Data

This dataset (source) includes 44,066 results of international football matches starting from the very first official match in 1872 up to 2022. The matches range from FIFA World Cup to FIFI Wild Cup to regular friendly matches. The matches are strictly men’s full internationals and the data does not include Olympic Games or matches where at least one of the teams was the nation’s B-team, U-23 or a league select team.

Task 1: Import and prepare the dataset

  • Import the pandas package with the usual alias.

In [1]:

# Import the pandas package with the usual alias
import pandas as pd
  • Read "results.csv". Assign to results.
  • Convert the date column to a datetime.
  • Get the year component of the date column; store in a new column named year.

In [2]:

# Read results.csv. Assign to results.
results = pd.read_csv("results.csv")
results.head()
results.dtypes
# Convert the date column to a datetime
results['date'] = pd.to_datetime(results['date'])
results.dtypes
# Get the year component of date column; store in a new column named year 
results['year'] = results['date'].dt.year
results.head()

# See the result
results

Out[2]:

datehome_teamaway_teamhome_scoreaway_scoretournamentcitycountryneutralyear
01872-11-30ScotlandEngland00FriendlyGlasgowScotlandFalse1872
11873-03-08EnglandScotland42FriendlyLondonEnglandFalse1873
21874-03-07ScotlandEngland21FriendlyGlasgowScotlandFalse1874
31875-03-06EnglandScotland22FriendlyLondonEnglandFalse1875
41876-03-04ScotlandEngland30FriendlyGlasgowScotlandFalse1876
440612022-10-22Saudi ArabiaNorth Macedonia10FriendlyAbu DhabiUnited Arab EmiratesTrue2022
440622022-10-23QatarGuatemala20FriendlyMálagaSpainTrue2022
440632022-10-26Saudi ArabiaAlbania11FriendlyAbu DhabiUnited Arab EmiratesTrue2022
440642022-10-27QatarHonduras10FriendlyMarbellaSpainTrue2022
440652022-10-30Saudi ArabiaHonduras00FriendlyAbu DhabiUnited Arab EmiratesTrue2022

44066 rows × 10 columns

Task 2: Get the FIFA World Cup data

  • Using results, count the number of rows of each tournament value.
  • Convert the results to a DataFrame for nicer printing.

In [3]:

# Count the number of rows for each tournament; convert to DataFrame
results.value_counts('tournament')
#convert to Dataframe
results.value_counts('tournament').to_frame("num_maatches")

Out[3]:

num_maatches
tournament
Friendly17427
FIFA World Cup qualification7774
UEFA Euro qualification2593
African Cup of Nations qualification1932
FIFA World Cup900
AFF Championship qualification2
TIFOCO Tournament1
FIFA 75th Anniversary Cup1
Copa Confraternidad1
Real Madrid 75th Anniversary Cup1

139 rows × 1 columns

  • Query for the rows where tournament is equal to “FIFA World Cup”

In [4]:

# Query for the rows where tournament is equal to "FIFA World Cup"
world_cup_res= results.query("tournament=='FIFA World Cup'")

# See the results
world_cup_res

Out[4]:

datehome_teamaway_teamhome_scoreaway_scoretournamentcitycountryneutralyear
13111930-07-13BelgiumUnited States03FIFA World CupMontevideoUruguayTrue1930
13121930-07-13FranceMexico41FIFA World CupMontevideoUruguayTrue1930
13131930-07-14BrazilYugoslavia12FIFA World CupMontevideoUruguayTrue1930
13141930-07-14PeruRomania13FIFA World CupMontevideoUruguayTrue1930
13151930-07-15ArgentinaFrance10FIFA World CupMontevideoUruguayTrue1930
402932018-07-07RussiaCroatia22FIFA World CupSochiRussiaFalse2018
402942018-07-10FranceBelgium10FIFA World CupSaint PetersburgRussiaTrue2018
402952018-07-11CroatiaEngland21FIFA World CupMoscowRussiaTrue2018
402962018-07-14BelgiumEngland20FIFA World CupSaint PetersburgRussiaTrue2018
402972018-07-15FranceCroatia42FIFA World CupMoscowRussiaTrue2018

900 rows × 10 columns

Task 3: Your turn: How many matches in every world cup?

  • Using world_cup_res, count the number of rows of each year value.
  • Convert the results to a DataFrame for nicer printing.

In [5]:

# Count the number of rows for each year; convert to DataFrame

matches_per_year= world_cup_res.value_counts("year").to_frame("num_matches")

# See the results
matches_per_year

#additional
# size vs count - both are same NaNs are accounted in size

world_cup_res.groupby(by='year').size().to_frame("num_matches")

Out[5]:

yearnum_matches
193018
193417
193818
195022
195426
195835
196232
196632
197032
197438
197838
198252
198652
199052
199452
199864
200264
200664
201064
201464
201864
  • Import the plotly.express package using the alias px.

In [6]:

# Import the plotly express package using the alias px
import plotly.express as px
  • Using matches_per_year, draw a bar plot of num_matches.

The year is in the index and will automatically be used for the x-axis.

In [7]:

# Using matches_per_year, draw a bar plot of num_matches
px.bar(matches_per_year, y="num_matches")

Task 4: Which games had the highest goal difference?

  • Add a goal_difference column as the absolute value of the home score minus the away score.
  • Query for rows where the goal difference equals the maximum goal difference.

In [8]:

# Add a goal_difference column as the absolute value of the home score minus the away score
# Query for rows where the goal difference equals the maximum goal difference
world_cup_res\
.assign(goal_difference = lambda match: (match['home_score'] - match['away_score']).abs())\
.query('goal_difference==goal_difference.max()')

Out[8]:

datehome_teamaway_teamhome_scoreaway_scoretournamentcitycountryneutralyeargoal_difference
36671954-06-17HungarySouth Korea90FIFA World CupZürichSwitzerlandTrue19549
92081974-06-18YugoslaviaDR Congo90FIFA World CupGelsenkirchenGermanyTrue19749
125551982-06-15HungaryEl Salvador101FIFA World CupElcheSpainTrue19829

Task 5: Your turn: Which game had the highest total number of goals?

  • Add a total_goals column as the home score plus the away score.
  • Query for rows where the total goals equals the maximum total goals.

In [9]:

# Add a total_goals column as the  home score plus the away score
# Query for rows where the total goals equals the maximum total goals

world_cup_res\
	.assign(total_goals = lambda match : match["home_score"] +match["away_score"])\
.query('total_goals == total_goals.max()')

Out[9]:

datehome_teamaway_teamhome_scoreaway_scoretournamentcitycountryneutralyeartotal_goals
36801954-06-26SwitzerlandAustria57FIFA World CupLausanneSwitzerlandFalse195412

Task 6: Which country scored the most goals?

Step 1: Calculate the home goals by country

  • Using world_cup_res, get the home_team and home_score columns.
  • Rename as team and score.

In [10]:

# Get the home_team and home_score columns
# Rename as team and score
home_goals = world_cup_res.filter(['home_team','home_score'])\
	.rename(columns={'home_team':'home', 'home_score':'score'})


# See the result
home_goals

Out[10]:

homescore
1311Belgium0
1312France4
1313Brazil1
1314Peru1
1315Argentina1
40293Russia2
40294France1
40295Croatia2
40296Belgium2
40297France4

900 rows × 2 columns

Your turn: Step 2: Calculate the away goals by country

  • Using world_cup_res, get the away_team and away_score columns.
  • Rename as team and score.

In [11]:

# Get the away_team and away_score columns
# Rename as team and score

away_goals = world_cup_res.filter(['away_team','away_score'])\
	.rename(columns={'away_team':'team', 'away_score':'score'})
# See the result
away_goals

Out[11]:

teamscore
1311United States3
1312Mexico1
1313Yugoslavia2
1314Romania3
1315France0
40293Croatia2
40294Belgium0
40295England1
40296England0
40297Croatia2

900 rows × 2 columns

Step 3: Combine the home and away totals

  • Concatenate home_goals and away_goals.
  • Group by teamas_index set to False.
  • Calculate the total score.
  • Rename the score column to total_goals.
  • Sort the total goals so the country with the highest total shows on top.

In [12]:

# Concatenate home_goals and away_goals
# Group by team, as_index equal to False
# Get the total score
# Rename score to total_goals
# Sort by total_goals

total_goals_by_country=pd.concat([home_goals, away_goals])\
	.groupby(by='team', as_index=False)\
	.sum('score')\
	.rename(columns={"score":'total_goals'})\
	.sort_values("total_goals", ascending=False)


# See the result
total_goals_by_country

Out[12]:

teamtotal_goals
27Germany68
76Uruguay56
37Italy56
8Brazil52
66Spain50
19DR Congo0
34Indonesia0
11Canada0
13China PR0
70Trinidad and Tobago0

79 rows × 2 columns

Step 4: draw a map colored by number of goals

  • Draw a plotly choropleth map, colored by total_goals, showing the team on hover.

In [13]:

# Draw a plotly choropleth map
px.choropleth(total_goals_by_country, 
              locations="team",
              locationmode="country names",
              color="total_goals",
               hover_name="team")

Extra: Does playing close to home matter?

Import the data

  • Import the winners.csv file and add the winner for 2018 based on info you find online

In [14]:

# Import the data
# source = wikipedia 🙂
winner_2018 = pd.DataFrame({"year": [2018], "hosting_country": ["Russia"], "winning_country": ["France"]})
# source = https://www.kaggle.com/datasets/abecklas/fifa-world-cup
winners = pd.read_csv("winners.csv")[["Year", "Country", "Winner"]] \
			.replace('Germany FR', 'Germany', regex=False) \
			.rename(columns = {'Year': "year", 'Country': 'hosting_country', "Winner": "winning_country"}) \
			.append(winner_2018)

winners
/var/folders/q_/3r83m4b95gz1zx4qbfv56fmw0000gn/T/ipykernel_72195/4165530431.py:5: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Out[14]:

yearhosting_countrywinning_country
01930UruguayUruguay
11934ItalyItaly
21938FranceItaly
31950BrazilUruguay
41954SwitzerlandGermany
51958SwedenBrazil
61962ChileBrazil
71966EnglandEngland
81970MexicoBrazil
91974GermanyGermany
101978ArgentinaArgentina
111982SpainItaly
121986MexicoArgentina
131990ItalyGermany
141994USABrazil
151998FranceFrance
162002Korea/JapanBrazil
172006GermanyItaly
182010South AfricaSpain
192014BrazilGermany
02018RussiaFrance

Who had the most wins?

  • Do a grouped count by winning_country.

In [15]:

# Do a grouped count
winners.groupby('winning_country').size().sort_values()

Out[15]:

winning_country
England      1
Spain        1
Argentina    2
France       2
Uruguay      2
Germany      4
Italy        4
Brazil       5
dtype: int64

Add continent information

  • Query the World Nations integration to get a list of countries with their respective contintent

In [18]:

country_continents= pd.read_csv('countries_continents.csv')
country_continents.head()

Out[18]:

countrycontinent
0AfghanistanAsia
1AlbaniaEurope
2AlgeriaAfrica
3American SamoaOceania
4AndorraEurope
  • Using the list from the database, add two additional columns, winning_country and hosting_continent, to the data frame

In [19]:

# Do two merges to create the winning_continent and hosting_continent columns
extra_maps = pd.DataFrame({"country": ["England", "USA", "Korea/Japan", "Russia"], "continent": ["Europe", "North America", "Asia", "Europe"]})
country_continents_2 = country_continents.append(extra_maps)
winners_continents = winners \
	.merge(country_continents_2, how='left', left_on = 'winning_country', right_on = 'country') \
	.rename(columns = {'continent': 'winning_continent'}) \
	.drop(columns = 'country') \
	.merge(country_continents_2, how = 'left', left_on = 'hosting_country', right_on = 'country') \
	.rename(columns = {'continent': 'hosting_continent'}) \
	.drop(columns = 'country')

winners_continents
/var/folders/q_/3r83m4b95gz1zx4qbfv56fmw0000gn/T/ipykernel_72195/2585684502.py:3: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Out[19]:

yearhosting_countrywinning_countrywinning_continenthosting_continent
01930UruguayUruguaySouth AmericaSouth America
11934ItalyItalyEuropeEurope
21938FranceItalyEuropeEurope
31950BrazilUruguaySouth AmericaSouth America
41954SwitzerlandGermanyEuropeEurope
51958SwedenBrazilSouth AmericaEurope
61962ChileBrazilSouth AmericaSouth America
71966EnglandEnglandEuropeEurope
81970MexicoBrazilSouth AmericaNorth America
91974GermanyGermanyEuropeEurope
101978ArgentinaArgentinaSouth AmericaSouth America
111982SpainItalyEuropeEurope
121986MexicoArgentinaSouth AmericaNorth America
131990ItalyGermanyEuropeEurope
141994USABrazilSouth AmericaNorth America
151998FranceFranceEuropeEurope
162002Korea/JapanBrazilSouth AmericaAsia
172006GermanyItalyEuropeEurope
182010South AfricaSpainEuropeAfrica
192014BrazilGermanyEuropeSouth America
202018RussiaFranceEuropeEurope

Analyze South American wins

In [20]:

# How many SA wins in non-SA hosting places?
total_tournaments = winners_continents.shape[0]
total_sa_tournaments = winners_continents.query("hosting_continent == 'South America'").shape[0]
total_sa_wins = winners_continents.query("winning_continent == 'South America'").shape[0]
total_sa_wins_in_sa = winners_continents.query("winning_continent == 'South America' & hosting_continent == 'South America'").shape[0]

print(f"South American teams won {total_sa_wins} world cups in total")
print(f"{total_sa_wins_in_sa} of these were won in South America")
print(f"{total_sa_wins - total_sa_wins_in_sa} of these were won elsewhere")
print(f"{total_sa_wins_in_sa}/{total_sa_tournaments} of South America hosted world cups were won by South American teams.")
South American teams won 9 world cups in total
4 of these were won in South America
5 of these were won elsewhere
4/5 of South America hosted world cups were won by South American teams.

Analyze European wins

In [21]:

# How many Europe wins in non-Europe hosting places?
total_tournaments = winners_continents.shape[0]
total_eu_tournaments = winners_continents.query("hosting_continent == 'Europe'").shape[0]
total_eu_wins = winners_continents.query("winning_continent == 'Europe'").shape[0]
total_eu_wins_in_eu = winners_continents.query("winning_continent == 'Europe' & hosting_continent == 'Europe'").shape[0]

print(f"European teams won {total_eu_wins} world cups in total")
print(f"{total_eu_wins_in_eu} of these were won in Europe")
print(f"{total_eu_wins - total_eu_wins_in_eu} of these were won elsewhere")
print(f"{total_eu_wins_in_eu}/{total_eu_tournaments} of Europe hosted world cups were won by European teams.")
European teams won 12 world cups in total
10 of these were won in Europe
2 of these were won elsewhere
10/11 of Europe hosted world cups were won by European teams.

Overall, it’s hard to make statements about this because of the small sample size and the hosting continent class imbalance…

Leave a comment

Hello !!! I’m Ram Rallabandi

an AI consultant and data scientist with a proven track record of delivering high-impact AI solutions. My mission is simple: transform complex AI strategies into measurable business outcomes.

→ What I Do:
I specialize in leading the end-to-end lifecycle of AI products, from ideation to technical development to successful market launch. My expertise lies in bridging the gap between technical innovation and business value to ensure that every AI initiative aligns with organizational goals.

→ How I Add Value:
☑ Drive the adoption of responsible AI practices to future-proof businesses
☑ Build scalable, high-quality AI solutions that directly address customer needs
☑ Streamline processes to connect AI demand with delivery teams efficiently
☑ Analyze market trends to position data-driven solutions for maximum impact

Why It Works:
With a strong foundation in AI, machine learning, and data science, I provide strategic insights that empower organizations to leverage cutting-edge technologies for growth. I collaborate with technical and business teams to:
↳ Define product roadmaps
↳ Manage budgets, timelines, and scopes
↳ Foster cross-functional alignment for AI-driven innovation

My Passion:
I thrive on leading AI storytelling for executive leadership and fostering collaboration across diverse teams. With deep experience in Fintech and Payments, I’m dedicated to unlocking AI’s potential to deliver business transformation and create widely adopted solutions.

Let’s connect to explore how AI can transform your business.