Project 2: Analyzing TV Data using Python

1. TV, halftime shows, and the Big Game

Whether or not you like football, the Super Bowl is a spectacle. There’s a little something for everyone at your Super Bowl party. Drama in the form of blowouts, comebacks, and controversy for the sports fan. There are the ridiculously expensive ads, some hilarious, others gut-wrenching, thought-provoking, and weird. The half-time shows with the biggest musicians in the world, sometimes riding giant mechanical tigers or leaping from the roof of the stadium. It’s a show, baby. And in this notebook, we’re going to find out how some of the elements of this show interact with each other. After exploring and cleaning our data a little, we’re going to answer questions like:

  • What are the most extreme game outcomes?
  • How does the game affect television viewership?
  • How have viewership, TV ratings, and ad cost evolved over time?
  • Who are the most prolific musicians in terms of halftime show performances?

Left Shark Steals The Show Left Shark Steals The Show. Katy Perry performing at halftime of Super Bowl XLIX. Photo by Huntley Paton. Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0).

The dataset we’ll use was scraped and polished from Wikipedia. It is made up of three CSV files, one with game data, one with TV data, and one with halftime musician data for all 52 Super Bowls through 2018. Let’s take a look, using display() instead of print() since its output is much prettier in Jupyter Notebooks.

In [1]:

# Import pandas
import pandas as pd

# Load the CSV data into DataFrames
super_bowls = pd.read_csv("datasets/super_bowls.csv")
tv = pd.read_csv("datasets/tv.csv")
halftime_musicians = pd.read_csv("datasets/halftime_musicians.csv")

# Display the first five rows of each DataFrame
display(super_bowls.head())
display(tv.head())
display(halftime_musicians.head())
datesuper_bowlvenuecitystateattendanceteam_winnerwinning_ptsqb_winner_1qb_winner_2coach_winnerteam_loserlosing_ptsqb_loser_1qb_loser_2coach_losercombined_ptsdifference_pts
02018-02-0452U.S. Bank StadiumMinneapolisMinnesota67612Philadelphia Eagles41Nick FolesNaNDoug PedersonNew England Patriots33Tom BradyNaNBill Belichick748
12017-02-0551NRG StadiumHoustonTexas70807New England Patriots34Tom BradyNaNBill BelichickAtlanta Falcons28Matt RyanNaNDan Quinn626
22016-02-0750Levi’s StadiumSanta ClaraCalifornia71088Denver Broncos24Peyton ManningNaNGary KubiakCarolina Panthers10Cam NewtonNaNRon Rivera3414
32015-02-0149University of Phoenix StadiumGlendaleArizona70288New England Patriots28Tom BradyNaNBill BelichickSeattle Seahawks24Russell WilsonNaNPete Carroll524
42014-02-0248MetLife StadiumEast RutherfordNew Jersey82529Seattle Seahawks43Russell WilsonNaNPete CarrollDenver Broncos8Peyton ManningNaNJohn Fox5135
super_bowlnetworkavg_us_viewerstotal_us_viewersrating_householdshare_householdrating_18_49share_18_49ad_cost
052NBC103390000NaN43.16833.478.05000000
151Fox111319000172000000.045.37337.179.05000000
250CBS111864000167000000.046.67237.779.05000000
349NBC114442000168000000.047.57139.179.04500000
448Fox112191000167000000.046.76939.377.04000000
super_bowlmusiciannum_songs
052Justin Timberlake11.0
152University of Minnesota Marching Band1.0
251Lady Gaga7.0
350Coldplay6.0
450Beyoncé3.0

2. Taking note of dataset issues

For the Super Bowl game data, we can see the dataset appears whole except for missing values in the backup quarterback columns (qb_winner_2 and qb_loser_2), which make sense given most starting QBs in the Super Bowl (qb_winner_1 and qb_loser_1) play the entire game.

From the visual inspection of TV and halftime musicians data, there is only one missing value displayed, but I’ve got a hunch there are more. The Super Bowl goes all the way back to 1967, and the more granular columns (e.g. the number of songs for halftime musicians) probably weren’t tracked reliably over time. Wikipedia is great but not perfect.

An inspection of the .info() output for tv and halftime_musicians shows us that there are multiple columns with null values.

In [2]:

# Summary of the TV data to inspect
tv.info()

print('\n')

# Summary of the halftime musician data to inspect
halftime_musicians.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   super_bowl        53 non-null     int64  
 1   network           53 non-null     object 
 2   avg_us_viewers    53 non-null     int64  
 3   total_us_viewers  15 non-null     float64
 4   rating_household  53 non-null     float64
 5   share_household   53 non-null     int64  
 6   rating_18_49      15 non-null     float64
 7   share_18_49       6 non-null      float64
 8   ad_cost           53 non-null     int64  
dtypes: float64(4), int64(4), object(1)
memory usage: 3.9+ KB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   super_bowl  134 non-null    int64  
 1   musician    134 non-null    object 
 2   num_songs   88 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 3.3+ KB

3. Combined points distribution

For the TV data, the following columns have missing values and a lot of them:

  • total_us_viewers (amount of U.S. viewers who watched at least some part of the broadcast)
  • rating_18_49 (average % of U.S. adults 18-49 who live in a household with a TV that were watching for the entire broadcast)
  • share_18_49 (average % of U.S. adults 18-49 who live in a household with a TV in use that were watching for the entire broadcast)

For the halftime musician data, there are missing numbers of songs performed (num_songs) for about a third of the performances.

There are a lot of potential reasons for these missing values. Was the data ever tracked? Was it lost in history? Is the research effort to make this data whole worth it? Maybe. Watching every Super Bowl halftime show to get song counts would be pretty fun. But we don’t have the time to do that kind of stuff now! Let’s take note of where the dataset isn’t perfect and start uncovering some insights.

Let’s start by looking at combined points for each Super Bowl by visualizing the distribution. Let’s also pinpoint the Super Bowls with the highest and lowest scores.

In [3]:

# Import matplotlib and set plotting style
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

# Plot a histogram of combined points
super_bowls.hist("combined_pts")
plt.xlabel('Combined Points')
plt.ylabel('Number of Super Bowls')
plt.show()

# Display the Super Bowls with the highest and lowest combined scores
display(super_bowls[super_bowls['combined_pts'] > 70])
display(super_bowls[super_bowls["combined_pts"] < 25])
datesuper_bowlvenuecitystateattendanceteam_winnerwinning_ptsqb_winner_1qb_winner_2coach_winnerteam_loserlosing_ptsqb_loser_1qb_loser_2coach_losercombined_ptsdifference_pts
02018-02-0452U.S. Bank StadiumMinneapolisMinnesota67612Philadelphia Eagles41Nick FolesNaNDoug PedersonNew England Patriots33Tom BradyNaNBill Belichick748
231995-01-2929Joe Robbie StadiumMiami GardensFlorida74107San Francisco 49ers49Steve YoungNaNGeorge SeifertSan Diego Chargers26Stan HumphreysNaNBobby Ross7523
datesuper_bowlvenuecitystateattendanceteam_winnerwinning_ptsqb_winner_1qb_winner_2coach_winnerteam_loserlosing_ptsqb_loser_1qb_loser_2coach_losercombined_ptsdifference_pts
431975-01-129Tulane StadiumNew OrleansLouisiana80997Pittsburgh Steelers16Terry BradshawNaNChuck NollMinnesota Vikings6Fran TarkentonNaNBud Grant2210
451973-01-147Memorial ColiseumLos AngelesCalifornia90182Miami Dolphins14Bob GrieseNaNDon ShulaWashington Redskins7Bill KilmerNaNGeorge Allen217
491969-01-123Orange BowlMiamiFlorida75389New York Jets16Joe NamathNaNWeeb EwbankBaltimore Colts7Earl MorrallJohnny UnitasDon Shula239

4. Point difference distribution

Most combined scores are around 40-50 points, with the extremes being roughly equal distance away in opposite directions. Going up to the highest combined scores at 74 and 75, we find two games featuring dominant quarterback performances. One even happened recently in 2018’s Super Bowl LII where Tom Brady’s Patriots lost to Nick Foles’ underdog Eagles 41-33 for a combined score of 74.

Going down to the lowest combined scores, we have Super Bowl III and VII, which featured tough defenses that dominated. We also have Super Bowl IX in New Orleans in 1975, whose 16-6 score can be attributed to inclement weather. The field was slick from overnight rain, and it was cold at 46 °F (8 °C), making it hard for the Steelers and Vikings to do much offensively. This was the second-coldest Super Bowl ever and the last to be played in inclement weather for over 30 years. The NFL realized people like points, I guess.

UPDATE: In Super Bowl LIII in 2019, the Patriots and Rams broke the record for the lowest-scoring Super Bowl with a combined score of 16 points (13-3 for the Patriots).

Let’s take a look at point difference now.

In [4]:

# Plot a histogram of point differences
plt.hist(super_bowls.difference_pts)
plt.xlabel('Point Difference')
plt.ylabel("Number of Super Bowls")
plt.show()


# Display the closest game(s) and biggest blowouts
closest_game = super_bowls.difference_pts.min()
biggest_games = 35

display(super_bowls[super_bowls["difference_pts"]== closest_game] )
display(super_bowls[super_bowls["difference_pts"]>= biggest_games])
datesuper_bowlvenuecitystateattendanceteam_winnerwinning_ptsqb_winner_1qb_winner_2coach_winnerteam_loserlosing_ptsqb_loser_1qb_loser_2coach_losercombined_ptsdifference_pts
271991-01-2725Tampa StadiumTampaFlorida73813New York Giants20Jeff HostetlerNaNBill ParcellsBuffalo Bills19Jim KellyNaNMarv Levy391
datesuper_bowlvenuecitystateattendanceteam_winnerwinning_ptsqb_winner_1qb_winner_2coach_winnerteam_loserlosing_ptsqb_loser_1qb_loser_2coach_losercombined_ptsdifference_pts
42014-02-0248MetLife StadiumEast RutherfordNew Jersey82529Seattle Seahawks43Russell WilsonNaNPete CarrollDenver Broncos8Peyton ManningNaNJohn Fox5135
251993-01-3127Rose BowlPasadenaCalifornia98374Dallas Cowboys52Troy AikmanNaNJimmy JohnsonBuffalo Bills17Jim KellyFrank ReichMarv Levy6935
281990-01-2824Louisiana SuperdomeNew OrleansLouisiana72919San Francisco 49ers55Joe MontanaNaNGeorge SeifertDenver Broncos10John ElwayNaNDan Reeves6545
321986-01-2620Louisiana SuperdomeNew OrleansLouisiana73818Chicago Bears46Jim McMahonNaNMike DitkaNew England Patriots10Tony EasonSteve GroganRaymond Berry5636

5. Do blowouts translate to lost viewers?

The vast majority of Super Bowls are close games. Makes sense. Both teams are likely to be deserving if they’ve made it this far. The closest game ever was when the Buffalo Bills lost to the New York Giants by 1 point in 1991, which was best remembered for Scott Norwood’s last-second missed field goal attempt that went wide right, kicking off four Bills Super Bowl losses in a row. Poor Scott. The biggest point discrepancy ever was 45 points (!) where Hall of Famer Joe Montana’s led the San Francisco 49ers to victory in 1990, one year before the closest game ever.

I remember watching the Seahawks crush the Broncos by 35 points (43-8) in 2014, which was a boring experience in my opinion. The game was never really close. I’m pretty sure we changed the channel at the end of the third quarter. Let’s combine our game data and TV to see if this is a universal phenomenon. Do large point differences translate to lost viewers? We can plot household share (average percentage of U.S. households with a TV in use that were watching for the entire broadcast) vs. point difference to find out.

In [5]:

# Join game and TV data, filtering out SB I because it was split over two networks
games_tv = pd.merge(tv[tv['super_bowl'] > 1], super_bowls, on='super_bowl')

# Import seaborn
import seaborn as sns

games_tv.head()
# Create a scatter plot with a linear regression model fit
sns.regplot(x="difference_pts", y="share_household", data=games_tv)

Out[5]:

<AxesSubplot:xlabel='difference_pts', ylabel='share_household'>

6. Viewership and the ad industry over time

The downward sloping regression line and the 95% confidence interval for that regression suggest that bailing on the game if it is a blowout is common. Though it matches our intuition, we must take it with a grain of salt because the linear relationship in the data is weak due to our small sample size of 52 games.

Regardless of the score though, I bet most people stick it out for the halftime show, which is good news for the TV networks and advertisers. A 30-second spot costs a pretty $5 million now, but has it always been that way? And how have number of viewers and household ratings trended alongside ad cost? We can find out using line plots that share a “Super Bowl” x-axis.

In [6]:

# Create a figure with 3x1 subplot and activate the top subplot
plt.figure(figsize=(10,10))
plt.subplot(3, 1, 1)
plt.plot("super_bowl","avg_us_viewers",data=games_tv, color='#648FFF')
plt.title('Average Number of US Viewers')

# Activate the middle subplot
plt.subplot(3, 1, 2)
plt.plot("super_bowl", "rating_household", data=games_tv, color='#DC267F')
plt.title('Household Rating')

# Activate the bottom subplot
plt.subplot(3, 1, 3)
plt.plot("super_bowl","ad_cost", data=games_tv, color="#FFB000")
plt.title('Ad Cost')
plt.xlabel('SUPER BOWL')

# Improve the spacing between subplots
plt.tight_layout()
plt.show()

7. Halftime shows weren’t always this great

We can see viewers increased before ad costs did. Maybe the networks weren’t very data savvy and were slow to react? Makes sense since DataCamp didn’t exist back then.

Another hypothesis: maybe halftime shows weren’t that good in the earlier years? The modern spectacle of the Super Bowl has a lot to do with the cultural prestige of big halftime acts. I went down a YouTube rabbit hole and it turns out the old ones weren’t up to today’s standards. Some offenders:

  • Super Bowl XXVI in 1992: A Frosty The Snowman rap performed by children.
  • Super Bowl XXIII in 1989: An Elvis impersonator that did magic tricks and didn’t even sing one Elvis song.
  • Super Bowl XXI in 1987: Tap dancing ponies. (Okay, that’s pretty awesome actually.)

It turns out Michael Jackson’s Super Bowl XXVII performance, one of the most watched events in American TV history, was when the NFL realized the value of Super Bowl airtime and decided they needed to sign big name acts from then on out. The halftime shows before MJ indeed weren’t that impressive, which we can see by filtering our halftime_musician data.

In [7]:

# Display all halftime musicians for Super Bowls up to and including Super Bowl XXVII
display(halftime_musicians[halftime_musicians["super_bowl"]<28])
super_bowlmusiciannum_songs
8027Michael Jackson5.0
8126Gloria Estefan2.0
8226University of Minnesota Marching BandNaN
8325New Kids on the Block2.0
8424Pete Fountain1.0
8524Doug Kershaw1.0
8624Irma Thomas1.0
8724Pride of Nicholls Marching BandNaN
8824The Human JukeboxNaN
8924Pride of AcadianaNaN
9023Elvis Presto7.0
9122Chubby Checker2.0
9222San Diego State University Marching AztecsNaN
9322Spirit of TroyNaN
9421Grambling State University Tiger Marching Band8.0
9521Spirit of Troy8.0
9620Up with PeopleNaN
9719Tops In BlueNaN
9818The University of Florida Fightin’ Gator March…7.0
9918The Florida State University Marching Chiefs7.0
10017Los Angeles Unified School District All City H…NaN
10116Up with PeopleNaN
10215The Human JukeboxNaN
10315Helen O’ConnellNaN
10414Up with PeopleNaN
10514Grambling State University Tiger Marching BandNaN
10613Ken HamiltonNaN
10713GramacksNaN
10812Tyler Junior College Apache BandNaN
10912Pete FountainNaN
11012Al HirtNaN
11111Los Angeles Unified School District All City H…NaN
11210Up with PeopleNaN
1139Mercer EllingtonNaN
1149Grambling State University Tiger Marching BandNaN
1158University of Texas Longhorn BandNaN
1168Judy MallettNaN
1177University of Michigan Marching BandNaN
1187Woody HermanNaN
1197Andy WilliamsNaN
1206Ella FitzgeraldNaN
1216Carol ChanningNaN
1226Al HirtNaN
1236United States Air Force Academy Cadet ChoraleNaN
1245Southeast Missouri State Marching BandNaN
1254Marguerite PiazzaNaN
1264Doc SeverinsenNaN
1274Al HirtNaN
1284The Human JukeboxNaN
1293Florida A&M University Marching 100 BandNaN
1302Grambling State University Tiger Marching BandNaN
1311University of Arizona Symphonic Marching BandNaN
1321Grambling State University Tiger Marching BandNaN
1331Al HirtNaN

8. Who has the most halftime show appearances?

Lots of marching bands. American jazz clarinetist Pete Fountain. Miss Texas 1973 playing a violin. Nothing against those performers, they’re just simply not Beyoncé. To be fair, no one is.

Let’s see all of the musicians that have done more than one halftime show, including their performance counts.

In [8]:

# Count halftime show appearances for each musician and sort them from most to least
halftime_appearances = halftime_musicians.groupby('musician').count()['super_bowl'].reset_index()
halftime_appearances = halftime_appearances.sort_values('super_bowl', ascending=False)

# Display musicians with more than one halftime show appearance
display(halftime_appearances[halftime_appearances['super_bowl']>1])
musiciansuper_bowl
28Grambling State University Tiger Marching Band6
104Up with People4
1Al Hirt4
83The Human Jukebox3
76Spirit of Troy2
25Florida A&M University Marching 100 Band2
26Gloria Estefan2
102University of Minnesota Marching Band2
10Bruno Mars2
64Pete Fountain2
5Beyoncé2
36Justin Timberlake2
57Nelly2
44Los Angeles Unified School District All City H…2

9. Who performed the most songs in a halftime show?

The world famous Grambling State University Tiger Marching Band takes the crown with six appearances. Beyoncé, Justin Timberlake, Nelly, and Bruno Mars are the only post-Y2K musicians with multiple appearances (two each).

From our previous inspections, the num_songs column has lots of missing values:

  • A lot of the marching bands don’t have num_songs entries.
  • For non-marching bands, missing data starts occurring at Super Bowl XX.

Let’s filter out marching bands by filtering out musicians with the word “Marching” in them and the word “Spirit” (a common naming convention for marching bands is “Spirit of [something]”). Then we’ll filter for Super Bowls after Super Bowl XX to address the missing data issue, then let’s see who has the most number of songs.

In [9]:

# Filter out most marching bands

no_bands = halftime_musicians[~halftime_musicians["musician"].str.contains('Marching')]
no_bands = no_bands[~no_bands.musician.str.contains('Spirit')]

# Plot a histogram of number of songs per performance
most_songs = int(max(no_bands['num_songs'].values))
plt.hist(no_bands.num_songs.dropna(), bins=10)
plt.xlabel("Number of Songs Per Halftime Show Performance")
plt.ylabel('Number of Musicians')
plt.show()

# Sort the non-band musicians by number of songs per appearance...
no_bands = no_bands.sort_values('num_songs', ascending=False)

# ...and display the top 15
display(no_bands.head(15))
super_bowlmusiciannum_songs
052Justin Timberlake11.0
7030Diana Ross10.0
1049Katy Perry8.0
251Lady Gaga7.0
9023Elvis Presto7.0
3341Prince7.0
1647Beyoncé7.0
1448Bruno Mars6.0
350Coldplay6.0
2545The Black Eyed Peas6.0
2046Madonna5.0
3044The Who5.0
8027Michael Jackson5.0
6432The Temptations4.0
3639Paul McCartney4.0

10. Conclusion

So most non-band musicians do 1-3 songs per halftime show. It’s important to note that the duration of the halftime show is fixed (roughly 12 minutes) so songs per performance is more a measure of how many hit songs you have. JT went off in 2018, wow. 11 songs! Diana Ross comes in second with 10 in her medley in 1996.

In this notebook, we loaded, cleaned, then explored Super Bowl game, television, and halftime show data. We visualized the distributions of combined points, point differences, and halftime show performances using histograms. We used line plots to see how ad cost increases lagged behind viewership increases. And we discovered that blowouts do appear to lead to a drop in viewers.

This year’s Big Game will be here before you know it. Who do you think will win Super Bowl LVI?

UPDATE: Spoiler alert.

In [10]:

# 2021-2022 conference champions
bengals = 'Cincinnati Bengals'
rams = 'Los Angeles Rams'

# Who will win Super Bowl LVI?
super_bowl_LIII_winner = rams
print('The winner of Super Bowl LVI will be the', super_bowl_LIII_winner)
The winner of Super Bowl LVI will be the Los Angeles Rams

Leave a comment