Hopper Company Queries II: Leetcode Pandas

10 min readJan 20, 2025

In the ever-evolving world of ride-hailing platforms, understanding driver engagement is a cornerstone of operational success. How effectively are drivers participating in the ecosystem? Are they actively contributing to fulfilling ride requests? These are critical questions for platforms aiming to optimize their services and retain both drivers and customers.

In this blog, we tackle a fascinating data problem that delves into measuring the percentage of working drivers for each month of a given year. By working drivers, we mean those who actively contribute by accepting rides, while available drivers are those eligible to work based on their onboarding dates.

Through this analytical lens, we aim to explore:

How to define and measure driver engagement over time.
The importance of balancing the supply of active drivers with user demand.
The challenges of handling edge cases, such as months with no available drivers.

By analyzing driver engagement on a month-by-month basis, we uncover actionable insights that could influence resource allocation, performance optimization, and even policy decisions.

Let’s dive into this problem, combining Pandas and logic to create a solution that not only answers the question at hand but also highlights the power of data in driving informed decisions.

Fall 2024 at the University of Arkansas FayetteVille, AR. Photo Credit: **Tasnim and Kawsar**

Table: Drivers
+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| driver_id   | int     |
| join_date   | date    |
+-------------+---------+
driver_id is the column with unique values for this table.
Each row of this table contains the driver's ID and the date they joined 
the Hopper company.

Table: Rides
+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| ride_id      | int     |
| user_id      | int     |
| requested_at | date    |
+--------------+---------+
ride_id is the column with unique values for this table.
Each row of this table contains the ID of a ride, the user's ID that 
requested it, and the day they requested it.
There may be some ride requests in this table that were not accepted.

Table: AcceptedRides
+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| ride_id       | int     |
| driver_id     | int     |
| ride_distance | int     |
| ride_duration | int     |
+---------------+---------+
ride_id is the column with unique values for this table.
Each row of this table contains some information about an accepted ride.
It is guaranteed that each accepted ride exists in the Rides table.

Write a solution to report the percentage of working drivers (working_percentage) for each month of 2020 where:

Note that if the number of available drivers during a month is zero, we consider the working_percentage to be 0. Return the result table ordered by month in ascending order, where month is the month's number (January is 1, February is 2, etc.). Round working_percentage to the nearest 2 decimal places. The result format is in the following example.

Input: 
Drivers table:
+-----------+------------+
| driver_id | join_date  |
+-----------+------------+
| 10        | 2019-12-10 |
| 8         | 2020-1-13  |
| 5         | 2020-2-16  |
| 7         | 2020-3-8   |
| 4         | 2020-5-17  |
| 1         | 2020-10-24 |
| 6         | 2021-1-5   |
+-----------+------------+
Rides table:
+---------+---------+--------------+
| ride_id | user_id | requested_at |
+---------+---------+--------------+
| 6       | 75      | 2019-12-9    |
| 1       | 54      | 2020-2-9     |
| 10      | 63      | 2020-3-4     |
| 19      | 39      | 2020-4-6     |
| 3       | 41      | 2020-6-3     |
| 13      | 52      | 2020-6-22    |
| 7       | 69      | 2020-7-16    |
| 17      | 70      | 2020-8-25    |
| 20      | 81      | 2020-11-2    |
| 5       | 57      | 2020-11-9    |
| 2       | 42      | 2020-12-9    |
| 11      | 68      | 2021-1-11    |
| 15      | 32      | 2021-1-17    |
| 12      | 11      | 2021-1-19    |
| 14      | 18      | 2021-1-27    |
+---------+---------+--------------+
AcceptedRides table:
+---------+-----------+---------------+---------------+
| ride_id | driver_id | ride_distance | ride_duration |
+---------+-----------+---------------+---------------+
| 10      | 10        | 63            | 38            |
| 13      | 10        | 73            | 96            |
| 7       | 8         | 100           | 28            |
| 17      | 7         | 119           | 68            |
| 20      | 1         | 121           | 92            |
| 5       | 7         | 42            | 101           |
| 2       | 4         | 6             | 38            |
| 11      | 8         | 37            | 43            |
| 15      | 8         | 108           | 82            |
| 12      | 8         | 38            | 34            |
| 14      | 1         | 90            | 74            |
+---------+-----------+---------------+---------------+
Output: 
+-------+--------------------+
| month | working_percentage |
+-------+--------------------+
| 1     | 0.00               |
| 2     | 0.00               |
| 3     | 25.00              |
| 4     | 0.00               |
| 5     | 0.00               |
| 6     | 20.00              |
| 7     | 20.00              |
| 8     | 20.00              |
| 9     | 0.00               |
| 10    | 0.00               |
| 11    | 33.33              |
| 12    | 16.67              |
+-------+--------------------+
Explanation: 
By the end of January --> two active drivers (10, 8) 
and no accepted rides. 
The percentage is 0%.
By the end of February --> three active drivers (10, 8, 5) 
and no accepted rides. 
The percentage is 0%.
By the end of March --> four active drivers (10, 8, 5, 7) 
and one accepted ride by driver (10). 
The percentage is (1 / 4) * 100 = 25%.
By the end of April --> four active drivers (10, 8, 5, 7) 
and no accepted rides. 
The percentage is 0%.
By the end of May --> five active drivers (10, 8, 5, 7, 4) 
and no accepted rides. 
The percentage is 0%.
By the end of June --> five active drivers (10, 8, 5, 7, 4) 
and one accepted ride by driver (10). 
The percentage is (1 / 5) * 100 = 20%.
By the end of July --> five active drivers (10, 8, 5, 7, 4) 
and one accepted ride by driver (8). 
The percentage is (1 / 5) * 100 = 20%.
By the end of August --> five active drivers (10, 8, 5, 7, 4) 
and one accepted ride by driver (7). 
The percentage is (1 / 5) * 100 = 20%.
By the end of September --> five active drivers (10, 8, 5, 7, 4) 
and no accepted rides. 
The percentage is 0%.
By the end of October --> six active drivers (10, 8, 5, 7, 4, 1) 
and no accepted rides. 
The percentage is 0%.
By the end of November --> six active drivers (10, 8, 5, 7, 4, 1) 
and two accepted rides by two different drivers (1, 7). 
The percentage is (2 / 6) * 100 = 33.33%.
By the end of December --> six active drivers (10, 8, 5, 7, 4, 1) 
and one accepted ride by driver (4). 
The percentage is (1 / 6) * 100 = 16.67%.

Solution:

import pandas as pd
import numpy as np

data = [[10, '2019-12-10'], 
        [8, '2020-1-13'], 
        [5, '2020-2-16'], 
        [7, '2020-3-8'], 
        [4, '2020-5-17'], 
        [1, '2020-10-24'], 
        [6, '2021-1-5']]
drivers = pd.DataFrame(
        data, 
        columns=['driver_id', 
                 'join_date']).astype({'driver_id':'Int64', 
                 'join_date':'datetime64[ns]'})
display(drivers)

data = [[6, 75, '2019-12-9'],
        [1, 54, '2020-2-9'], 
        [10, 63, '2020-3-4'], 
        [19, 39, '2020-4-6'], 
        [3, 41, '2020-6-3'], 
        [13, 52, '2020-6-22'], 
        [7, 69, '2020-7-16'], 
        [17, 70, '2020-8-25'], 
        [20, 81, '2020-11-2'], 
        [5, 57, '2020-11-9'], 
        [2, 42, '2020-12-9'], 
        [11, 68, '2021-1-11'], 
        [15, 32, '2021-1-17'], 
        [12, 11, '2021-1-19'], 
        [14, 18, '2021-1-27']]
rides = pd.DataFrame(
        data, 
        columns=['ride_id', 
                 'user_id', 
                 'requested_at']).astype({'ride_id':'Int64', 
                 'user_id':'Int64', 
                 'requested_at':'datetime64[ns]'})
display(rides)

data = [[10, 10, 63, 38], 
        [13, 10, 73, 96], 
        [7, 8, 100, 28], 
        [17, 7, 119, 68], 
        [20, 1, 121, 92], 
        [5, 7, 42, 101], 
        [2, 4, 6, 38], 
        [11, 8, 37, 43], 
        [15, 8, 108, 82], 
        [12, 8, 38, 34], 
        [14, 1, 90, 74]]
accepted_rides = pd.DataFrame(
        data, 
        columns=['ride_id', 
                 'driver_id', 
                 'ride_distance', 
                 'ride_duration']).astype({'ride_id':'Int64', 
                 'driver_id':'Int64', 
                 'ride_distance':'Int64', 
                 'ride_duration':'Int64'})
display(accepted_rides)

Step 1. Create a DataFrame for Months

Creates a DataFrame months with a single column month, containing numbers from 1 to 12, representing all months in a year.

months = pd.DataFrame({'month': range(1, 13)})
display(months.head())

Step 2. Filter Drivers by Joining Year (Up to 2020)

Filters the drivers DataFrame to include only rows where the join_date is in the year 2020 or earlier.

Step 3. Assign Driver Month

Creates a new column driver_month: If the driver joined in or before 2019, assigns 1 (represents January).
Otherwise, assign the month of the join_date.

drivers= drivers[drivers['join_date'].dt.year<=2020]
drivers['driver_month']=drivers['join_date'].apply(lambda x: 1 if x.year<=2019 else x.month)
display(drivers)

Step 4. Count Active Drivers by Month

Groups the drivers DataFrame by driver_month.
Counts the number of driver_id entries in each group, representing the count of active drivers per month.
Resets the index and names the resulting column active_drivers.

active_drivers=drivers.groupby('driver_month')[
    'driver_id'].count().reset_index(name='active_drivers')
display(active_drivers)

Step 5. Merge with Months and Fill in Missing Values

Merges the months DataFrame with the active_drivers DataFrame: Matches month in months with driver_month in active_drivers. Performs a left join, keeping all months from months even if there are no active drivers for a month.
Fills missing values (NaN) with 0 for months without active drivers.

df = months.merge(active_drivers,
                  left_on='month',
                  right_on='driver_month',
                  how='left').fillna(0)
display(df)

Step 6. Calculate Cumulative Active Drivers

Computes the cumulative sum of active_drivers across months, adding up the total active drivers month by month.

df['active_drivers']=df['active_drivers'].cumsum()
display(df)

Step 7. Filter Accepted Rides for 2020

Merges the rides DataFrame with accepted_rides using ride_id as the key, performing a right join.
Filters the resulting accept_rides DataFrame to include only rows where requested_at is in the year 2020.

accept_rides = rides.merge(accepted_rides,
                           how='right',
                           on='ride_id')
accept_rides=accept_rides[accept_rides['requested_at'].dt.year==2020]
display(accept_rides)

Step 8. Add Ride Request Month

Creates a new column month in accept_rides, extracting the month from the requested_at datetime field.

accept_rides['month']=accept_rides['requested_at'].dt.month
display(accept_rides)

Step 9. Remove Duplicate Month-Driver Combinations

Removes duplicate entries based on the combination of month and driver_id to ensure that each driver contributes only once per month.

accept_rides.drop_duplicates(['month', 'driver_id'], inplace=True)
display(accept_rides)

Step 10. Count Accepted Rides by Month

Groups the accept_rides DataFrame by month.
Counts the number of unique ride_id entries for each month.
Resets the index and names the resulting column accept_rides.

accept_rides=accept_rides.groupby('month')[
  'ride_id'].count().reset_index(name='accept_rides')
display(accept_rides)

Step 11. Merge Accepted Rides with the Main DataFrame

Merges the df DataFrame with the accept_rides DataFrame: Matches on the month column. Performs a left join to keep all months from df.
Fills missing values (NaN) with 0 for months without accepted rides.

df= df.merge(accept_rides,on='month',how='left').fillna(0)
display(df)

Step 12. Calculate Working Percentage

Computes the working_percentage for each month
If active_drivers is 0, assigns 0 (to avoid division by zero).
Otherwise, calculates the percentage of accepted rides relative to active drivers and rounds to two decimal places.

df['working_percentage'] = np.where(df['active_drivers']==0, 0, 
             (df['accept_rides']/df['active_drivers']*100).round(2))
display(df)

Step 13. Select Final Columns

Selects only the month and working_percentage columns for the final DataFrame.

df = df[['month','working_percentage']]
display(df)

Happy analyzing!

Follow me on LinkedIn Instagram Medium profile or subscribe to my email list for the latest blogs, and click the clap button to support writing. Feel free to ask questions in the comment section. Don’t forget to share the blog link with your friends or LinkedIn connections.

References:
[1] Leetcode Source: hopper-company-queries-ii
[2] Full code in GitHub

Hopper Company Queries I: Leetcode Pandas

The success of a ridesharing company like Hopper relies on its ability to balance driver availability with ride demand…

kawsar34.medium.com

Student Reports by Geography: Leetcode Pandas

Analyzing student distributions across different continents can provide valuable insights for educational institutions…