Project 12: Optimizing Online Sports Retail Revenue

1. Counting missing values

Sports clothing and athleisure attire is a huge industry, worth approximately $193 billion in 2021 with a strong growth forecast over the next decade!

In this notebook, we play the role of a product analyst for an online sports clothing company. The company is specifically interested in how it can improve revenue. We will dive into product data such as pricing, reviews, descriptions, and ratings, as well as revenue and website traffic, to produce recommendations for its marketing and sales teams.

The database provided to us, sports, contains five tables, with product_id being the primary key for all of them:

info

columndata typedescription
product_namevarcharName of the product
product_idvarcharUnique ID for product
descriptionvarcharDescription of the product

finance

columndata typedescription
product_idvarcharUnique ID for product
listing_pricefloatListing price for product
sale_pricefloatPrice of the product when on sale
discountfloatDiscount, as a decimal, applied to the sale price
revenuefloatAmount of revenue generated by each product, in US dollars

reviews

columndata typedescription
product_namevarcharName of the product
product_idvarcharUnique ID for product
ratingfloatProduct rating, scored from 1.0 to 5.0
reviewsfloatNumber of reviews for the product

traffic

columndata typedescription
product_idvarcharUnique ID for product
last_visitedtimestampDate and time the product was last viewed on the website

brands

columndata typedescription
product_idvarcharUnique ID for product
brandvarcharBrand of the product

We will be dealing with missing data as well as numeric, string, and timestamp data types to draw insights about the products in the online store. Let’s start by finding out how complete the data is.

In [1]:

%%sql
postgresql:///sports

-- Count all columns as total_rows
-- Count the number of non-missing entries for description, listing_price, and last_visited
-- Join info, finance, and traffic

SELECT COUNT(*) AS total_rows, COUNT(description) AS count_description, COUNT(listing_price) AS count_listing_price, COUNT(last_visited) AS count_last_visited FROM info
LEFT JOIN finance USING(product_id)
LEFT JOIN traffic USING(product_id)
1 rows affected.

Out[1]:

total_rowscount_descriptioncount_listing_pricecount_last_visited
3179311731202928

2. Nike vs Adidas pricing

We can see the database contains 3,179 products in total. Of the columns we previewed, only one — last_visited — is missing more than five percent of its values. Now let’s turn our attention to pricing.

How do the price points of Nike and Adidas products differ? Answering this question can help us build a picture of the company’s stock range and customer market. We will run a query to produce a distribution of the listing_price and the count for each price, grouped by brand.

In [2]:

%%sql

-- Select the brand, listing_price as an integer, and a count of all products in finance 
-- Join brands to finance on product_id
-- Filter for products with a listing_price more than zero
-- Aggregate results by brand and listing_price, and sort the results by listing_price in descending order

SELECT brand, CAST(listing_price AS int), COUNT(product_id) FROM brands
INNER JOIN finance
USING(product_id)
WHERE listing_price >0
GROUP BY brand, listing_price
ORDER BY listing_price DESC
 * postgresql:///sports
77 rows affected.

Out[2]:

brandlisting_pricecount
Adidas3002
Adidas2804
Adidas2405
Adidas2308
Adidas22011
Adidas2008
Nike2001
Adidas1907
Nike1902
Adidas18034
Nike1804
Nike17014
Adidas17027
Nike16031
Adidas16028
Nike1506
Adidas15041
Adidas14036
Nike14012
Nike13012
Adidas13096
Nike12016
Adidas120115
Adidas11091
Nike11017
Nike10014
Adidas10072
Adidas962
Nike951
Nike9013
Adidas9089
Adidas867
Nike855
Adidas851
Nike8016
Adidas80322
Nike791
Adidas76149
Adidas751
Nike757
Adidas7087
Nike704
Adidas66102
Nike651
Adidas631
Adidas60211
Nike602
Adidas56174
Adidas552
Adidas5343
Adidas50183
Nike505
Nike481
Adidas4842
Adidas46163
Adidas451
Nike453
Adidas4351
Adidas4081
Nike401
Adidas3824
Adidas3625
Adidas3324
Nike302
Adidas3037
Adidas2838
Adidas2718
Adidas2528
Adidas231
Adidas208
Adidas184
Adidas164
Adidas1527
Adidas1327
Adidas121
Adidas1011
Adidas91

3. Labeling price ranges

It turns out there are 77 unique prices for the products in our database, which makes the output of our last query quite difficult to analyze.

Let’s build on our previous query by assigning labels to different price ranges, grouping by brand and label. We will also include the total revenue for each price range and brand.

In [3]:

%%sql

-- Select the brand, a count of all products in the finance table, and total revenue
-- Create four labels for products based on their price range, aliasing as price_category
-- Join brands to finance on product_id and filter out products missing a value for brand
-- Group results by brand and price_category, sort by total_revenue


SELECT brand, COUNT(*), SUM(revenue) AS total_revenue, 
(CASE 
    WHEN listing_price >128 THEN 'Elite'
    WHEN listing_price > 73 THEN 'Expensive'
    WHEN listing_price > 41 THEN 'Average'
    ELSE 'Budget'
END) AS price_category FROM brands
LEFT JOIN finance
USING(product_id)
WHERE brand IS NOT NULL
GROUP BY brand, price_category
ORDER BY total_revenue DESC;
 * postgresql:///sports
8 rows affected.

Out[3]:

brandcounttotal_revenueprice_category
Adidas8494626980.069999999Expensive
Adidas10603233661.060000001Average
Adidas3073014316.8299999987Elite
Adidas359651661.1200000002Budget
Nike357595341.0199999992Budget
Nike82128475.59000000003Elite
Nike9071843.15000000004Expensive
Nike166623.5Average

4. Average discount by brand

Interestingly, grouping products by brand and price range allows us to see that Adidas items generate more total revenue regardless of price category! Specifically, "Elite" Adidas products priced $129 or more typically generate the highest revenue, so the company can potentially increase revenue by shifting their stock to have a larger proportion of these products!

Note we have been looking at listing_price so far. The listing_price may not be the price that the product is ultimately sold for. To understand revenue better, let’s take a look at the discount, which is the percent reduction in the listing_price when the product is actually sold. We would like to know whether there is a difference in the amount of discount offered between brands, as this could be influencing revenue.

In [4]:

%%sql

-- Select brand and average_discount as a percentage
-- Join brands to finance on product_id
-- Aggregate by brand
-- Filter for products without missing values for brand
SELECT brand, AVG(discount)*100 AS average_discount FROM brands
INNER JOIN finance
USING(product_id)
WHERE brand IS NOT NULL
GROUP BY brand
 * postgresql:///sports
2 rows affected.

Out[4]:

brandaverage_discount
Adidas33.452427184465606
Nike0.0

5. Correlation between revenue and reviews

Strangely, no discount is offered on Nike products! In comparison, not only do Adidas products generate the most revenue, but these products are also heavily discounted!

To improve revenue further, the company could try to reduce the amount of discounts offered on Adidas products and monitor sales volume to see if it remains stable. Alternatively, it could try offering a small discount on Nike products. This would reduce average revenue for these products but may increase revenue overall if there is an increase in the volume of Nike products sold.

Now explore whether relationships exist between the columns in our database. We will check the strength and direction of a correlation between revenue and reviews.

In [5]:

%%sql

-- Calculate the correlation between reviews and revenue as review_revenue_corr
-- Join the reviews and finance tables on product_id

SELECT CORR(reviews, revenue) AS review_revenue_corr FROM reviews
INNER JOIN finance
USING(product_id)
 * postgresql:///sports
1 rows affected.

Out[5]:

review_revenue_corr
0.6518512283481301

6. Ratings and reviews by product description length

Interestingly, there is a strong positive correlation between revenue and reviews. This means, potentially, if we can get more reviews on the company’s website, it may increase sales of those items with a larger number of reviews.

Perhaps the length of a product’s description might influence a product’s rating and reviews — if so, the company can produce content guidelines for listing products on their website and test if this influences revenue. Let’s check this out!

In [6]:

%%sql

-- Calculate description_length
-- Convert rating to a numeric data type and calculate average_rating
-- Join info to reviews on product_id and group the results by description_length
-- Filter for products without missing values for description, and sort results by description_length

SELECT TRUNC(LENGTH(description),-2) AS description_length, ROUND(AVG(rating::numeric),2) AS average_rating FROM info 
LEFT JOIN reviews
USING(product_id)
WHERE description IS NOT NULL
GROUP BY description_length
ORDER BY description_length
 * postgresql:///sports
7 rows affected.

Out[6]:

description_lengthaverage_rating
01.87
1003.21
2003.27
3003.29
4003.32
5003.12
6003.65

7. Reviews by month and brand

Unfortunately, there doesn’t appear to be a clear pattern between the length of a product’s description and its rating.

As we know a correlation exists between reviews and revenue, one approach the company could take is to run experiments with different sales processes encouraging more reviews from customers about their purchases, such as by offering a small discount on future purchases.

Let’s take a look at the volume of reviews by month to see if there are any trends or gaps we can look to exploit.

In [7]:

%%sql

-- Select brand, month from last_visited, and a count of all products in reviews aliased as num_reviews
-- Join traffic with reviews and brands on product_id
-- Group by brand and month, filtering out missing values for brand and month
-- Order the results by brand and month

SELECT brand, DATE_PART('month',last_visited) AS month, COUNT(*) AS num_reviews FROM traffic
INNER JOIN reviews
USING(product_id)
INNER JOIN brands
USING(product_id)
WHERE (brand IS NOT NULL) AND (DATE_PART('month',last_visited) IS NOT NULL)
GROUP BY brand, month
ORDER BY brand, month
 * postgresql:///sports
24 rows affected.

Out[7]:

brandmonthnum_reviews
Adidas1.0253
Adidas2.0272
Adidas3.0269
Adidas4.0180
Adidas5.0172
Adidas6.0159
Adidas7.0170
Adidas8.0189
Adidas9.0181
Adidas10.0192
Adidas11.0150
Adidas12.0190
Nike1.052
Nike2.052
Nike3.055
Nike4.042
Nike5.041
Nike6.043
Nike7.037
Nike8.029
Nike9.028
Nike10.047
Nike11.038
Nike12.035

8. Footwear product performance

Looks like product reviews are highest in the first quarter of the calendar year, so there is scope to run experiments aiming to increase the volume of reviews in the other nine months!

So far, we have been primarily analyzing Adidas vs Nike products. Now, let’s switch our attention to the type of products being sold. As there are no labels for product type, we will create a Common Table Expression (CTE) that filters description for keywords, then use the results to find out how much of the company’s stock consists of footwear products and the median revenue generated by these items.

In [8]:

%%sql

-- Create the footwear CTE, containing description and revenue
-- Filter footwear for products with a description containing %shoe%, %trainer, or %foot%
-- Also filter for products that are not missing values for description
-- Calculate the number of products and median revenue for footwear products

WITH footwear AS (
    SELECT description, revenue FROM info
    INNER JOIN finance
    USING(product_id)
    WHERE (description ILIKE '%shoe%') OR 
    (description ILIKE  '%trainer%')OR
    (description ILIKE '%foot%') AND (description IS NOT NULL)
    )


SELECT COUNT(*) AS num_footwear_products, percentile_disc(.5) WITHIN GROUP (ORDER BY revenue) AS median_footwear_revenue FROM footwear
 * postgresql:///sports
1 rows affected.

Out[8]:

num_footwear_productsmedian_footwear_revenue
27003118.36

9. Clothing product performance

Recall from the first task that we found there are 3,117 products without missing values for description. Of those, 2,700 are footwear products, which accounts for around 85% of the company’s stock. They also generate a median revenue of over $3000 dollars!

This is interesting, but we have no point of reference for whether footwear’s median_revenue is good or bad compared to other products. So, for our final task, let’s examine how this differs to clothing products. We will re-use footwear, adding a filter afterward to count the number of products and median_revenue of products that are not in footwear.

In [9]:

%%sql

-- Copy the footwear CTE from the previous task
-- Calculate the number of products in info and median revenue from finance
-- Inner join info with finance on product_id
-- Filter the selection for products with a description not in footwear

WITH footwear AS (
    SELECT description, revenue FROM info
    INNER JOIN finance
    USING(product_id)
    WHERE (description NOT ILIKE '%shoe%') AND 
    (description NOT ILIKE  '%trainer%') AND
    (description NOT ILIKE '%foot%') AND (description IS NOT NULL)
    )


SELECT COUNT(*) AS num_clothing_products, percentile_disc(.5) WITHIN GROUP (ORDER BY revenue) AS median_clothing_revenue FROM footwear
 * postgresql:///sports
1 rows affected.

Out[9]:

num_clothing_productsmedian_clothing_revenue
417503.82

Leave a comment

I’m Ram Rallabandi

Book a 1:1 on meeting for quick career chat here

If you’re seeking career advice in Data Science, Digital Marketing, Growth Hacking here are some key points to consider:

1. Discover your area of interest
2. Learn the basics
3. Build a portfolio
4. Keep learning
5. Network
6. Be proactive

I would be happy to discuss the above topics with you during our meeting. While the time is short, we can touch upon the key points and provide you with some guidance on your career path in Data Science. Let’s make the most of our time together and cover as much ground as possible.

Let’s connect