Leetcode Hard Pandas (Part 1)

15 min readFeb 24, 2024

Pandas is a must-know requirement for a Data Scientist. Today I will guide you on solving 5 Leetcode hard questions in pandas. I will show you how to use self-join, cross-join, inner join, and left-join, how to find rank, how to find the nth-largest data point, and how to use apply, repeat, and rolling functions in pandas. You’ll also learn how to use the lambda function, list comprehension, dataframe slicing and filtering.

Photo: Independence Pass, Colorado, USA. Credit: Tasnim and Kawsar

Problem 01: 185. Department Top Three Salaries

A company’s executives are interested in seeing who earns the most money in each of the company’s departments. A high earner in a department is an employee who has a salary in the top three unique salaries for that department. Write a solution to find the employees who are high earners in each of the departments. Return the result table in any order. The result format is in the following example.

Input: 
Employee table:
+----+-------+--------+--------------+
| id | name  | salary | departmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 85000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |
| 7  | Will  | 70000  | 1            |
+----+-------+--------+--------------+
Each row of this table indicates the ID, name, and salary of an employee. 
It also contains the ID of their department.

Department table:
+----+-------+
| id | name  |
+----+-------+
| 1  | IT    |
| 2  | Sales |
+----+-------+
Each row of this table indicates the ID of a department and its name.
Output: 
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Joe      | 85000  |
| IT         | Randy    | 85000  |
| IT         | Will     | 70000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |
+------------+----------+--------+
Explanation: 
In the IT department:
- Max earns the highest unique salary
- Both Randy and Joe earn the second-highest unique salary
- Will earns the third-highest unique salary

In the Sales department:
- Henry earns the highest salary
- Sam earns the second-highest salary
- There is no third-highest salary as there are only two employees

Pandas Input:

employee = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7],
    'name': ['Joe', 'Henry', 'Sam', 'Max', 'Janet', 'Randy', 'Will'],
    'salary': [85000, 80000, 60000, 90000, 69000, 85000, 70000],
    'departmentId': [1, 2, 2, 1, 1, 1, 1]})

department = pd.DataFrame({
    'id': [1, 2],
    'name': ['IT', 'Sales']})

Solution 1: pandas drop(), pandas rename(), pandas merge(), pandas rank(), pandas isin(), pandas groupby()

import pandas as pd

def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:

    # remove the id column from the employee dataframe 
    # because we don't need this column
    employee = employee.drop("id", axis=1)

    # rename the column names according to the required output
    employee = employee.rename(columns={"name": "Employee", 
                                        "salary": "Salary"})
    department = department.rename(columns={"id": "departmentId", 
                                            "name": "Department"})

    # left join employee with department based on the common departmentId
    df = employee.merge(department, how="left", on=["departmentId"])

    # create a new variable rank based on the top salaries for each department
    df["Rank"] = df.groupby(["Department"])["Salary"].rank(method="dense", 
                                                           ascending=False)
    
    # keep only the top 3 unique salaries
    df = df[df["Rank"].isin([1,2,3])]

    # keep only the required columns
    df = df[["Department", "Employee", "Salary"]]
    
    return df

Solution 2:

import pandas as pd

def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:

    # separate the rows containing the top 3 salaries for each department
    top_salary = employee[employee.groupby('departmentId')[
                 "salary"].rank(method='dense', 
                                ascending=False) <= 3]

    # inner join to get the department names
    df = top_salary.merge(department, 
                          how="inner",
                          left_on='departmentId', 
                          right_on='id')

    # rename and keep the required columns
    df = df.rename(columns = {'name_y': 'Department', 
                              'name_x': 'Employee', 
                              'salary': 'Salary'})
    df = df[['Department', 'Employee', 'Salary']]

    return df

Solution 3: pandas nlargest()

import pandas as pd

def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    
    # inner join employee with department 
    df = employee.merge(department, 
                        how="inner", 
                        left_on='departmentId', 
                        right_on='id')

    df = df.rename(columns = {'name_y': 'Department'})

    # keep only the unique salaries from each department
    df = df[['Department', 'departmentId', 'salary']].drop_duplicates()
    
    # find the top 3 unique salaries for each department
    top_salary = df.groupby(['Department', 
                   'departmentId'])['salary'].nlargest(3).reset_index()
    
    # find the top 3 salary earning employees
    df = top_salary.merge(employee, on=['departmentId', 'salary'])
    df = df.rename(columns = {'name': 'Employee', 
                              'salary': 'Salary'})
    df = df[['Department', 'Employee', 'Salary']]
    
    return df

Problem 02: 262. Trips and Users

The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day. Write a solution to find the cancellation rate of requests with unbanned users (both client and driver must not be banned) each day between "2013-10-01" and "2013-10-03". Round Cancellation Rate to two decimal points. Return the result table in any order. The result format is in the following example.

Input: 
The table holds all taxi trips. Each trip has a unique id.
Trips table:
+----+-----------+-----------+---------+---------------------+------------+
| id | client_id | driver_id | city_id | status              | request_at |
+----+-----------+-----------+---------+---------------------+------------+
| 1  | 1         | 10        | 1       | completed           | 2013-10-01 |
| 2  | 2         | 11        | 1       | cancelled_by_driver | 2013-10-01 |
| 3  | 3         | 12        | 6       | completed           | 2013-10-01 |
| 4  | 4         | 13        | 6       | cancelled_by_client | 2013-10-01 |
| 5  | 1         | 10        | 1       | completed           | 2013-10-02 |
| 6  | 2         | 11        | 6       | completed           | 2013-10-02 |
| 7  | 3         | 12        | 6       | completed           | 2013-10-02 |
| 8  | 2         | 12        | 12      | completed           | 2013-10-03 |
| 9  | 3         | 10        | 12      | completed           | 2013-10-03 |
| 10 | 4         | 13        | 12      | cancelled_by_driver | 2013-10-03 |
+----+-----------+-----------+---------+---------------------+------------+
Users table:
+----------+--------+--------+
| users_id | banned | role   |
+----------+--------+--------+
| 1        | No     | client |
| 2        | Yes    | client |
| 3        | No     | client |
| 4        | No     | client |
| 10       | No     | driver |
| 11       | No     | driver |
| 12       | No     | driver |
| 13       | No     | driver |
+----------+--------+--------+
Output: 
+------------+-------------------+
| Day        | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 | 0.33              |
| 2013-10-02 | 0.00              |
| 2013-10-03 | 0.50              |
+------------+-------------------+
Explanation: 
On 2013-10-01:
  - There were 4 requests in total, 2 of which were canceled.
  - However, the request with Id=2 was made by a banned client (User_Id=2), 
so it is ignored in the calculation.
  - Hence there are 3 unbanned requests in total, 1 of which was canceled.
  - The Cancellation Rate is (1 / 3) = 0.33
On 2013-10-02:
  - There were 3 requests in total, 0 of which were canceled.
  - The request with Id=6 was made by a banned client, so it is ignored.
  - Hence there are 2 unbanned requests in total, 0 of which were canceled.
  - The Cancellation Rate is (0 / 2) = 0.00
On 2013-10-03:
  - There were 3 requests in total, 1 of which was canceled.
  - The request with Id=8 was made by a banned client, so it is ignored.
  - Hence there are 2 unbanned request in total, 1 of which were canceled.
  - The Cancellation Rate is (1 / 2) = 0.50

Pandas Input:

trips = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'client_id': [1, 2, 3, 4, 1, 2, 3, 2, 3, 4],
    'driver_id': [10, 11, 12, 13, 10, 11, 12, 12, 10, 13],
    'city_id': [1, 1, 6, 6, 1, 6, 6, 12, 12, 12],
    'status': ['completed', 'cancelled_by_driver', 'completed', 
               'cancelled_by_client', 'completed', 'completed', 
               'completed', 'completed', 'completed', 'cancelled_by_driver'],
    'request_at': ['2013-10-01', '2013-10-01', '2013-10-01', '2013-10-01', 
                   '2013-10-02', '2013-10-02', '2013-10-02', '2013-10-03', 
                   '2013-10-03', '2013-10-03']
})

trips['request_at'] = pd.to_datetime(trips['request_at'])

users = pd.DataFrame({
    'users_id': [1, 2, 3, 4, 
                 10, 11, 12, 13],
    'banned': ['No', 'Yes', 'No', 'No', 
               'No', 'No', 'No', 'No'],
    'role': ['client', 'client', 'client', 'client', 
             'driver', 'driver', 'driver', 'driver']
})

Solution: numpy where

import pandas as pd

def trips_and_users(trips: pd.DataFrame, users: pd.DataFrame) -> pd.DataFrame:

    # if trips or users dataframe is empty, 
    # return the output dataframe with the required column headers
    if trips.empty or users.empty:
        return pd.DataFrame(columns=["Day", "Cancellation Rate"])

    # find the banned users
    users = users[users["banned"]=="No"]

    # create a list of unbanned users
    client_list = users[users["role"]=="client"]["users_id"].values.tolist()

    # convert the request_at str column to datetime
    trips["request_at"] = pd.to_datetime(trips["request_at"])

    # make sure request_at dates are between "2013-10-01" and "2013-10-03"
    trips = trips.set_index(["request_at"])
    trips = trips.loc["2013-10-01": "2013-10-03"]
    trips = trips.reset_index()

    # keep only the unbanned users
    trips = trips[trips["client_id"].isin(client_list)]

    # create a new column: 1 if status is completed, otherwise 0
    trips["status_num"] = np.where(trips["status"]=="completed", 0, 1)

    # for each date, find the number of completed rides 
    # and total number of rides requested
    df = trips.groupby(["request_at"]).agg({"status_num": "sum", 
                        "status": "count"}).reset_index()

    # find the Cancellation Rate
    df["Cancellation Rate"] = df["status_num"]/df["status"]

    # keep only the required columns for the output
    df = df.drop(["status_num", "status"], axis=1)
    df = df.rename(columns={"request_at": "Day"})

    return df.round(2)

Problem 03: 569. Median Employee Salary

Write a solution to find the rows that contain the median salary of each company. While calculating the median, when you sort the salaries of the company, break the ties by id. Return the result table in any order. The result format is in the following example.

Input: 
Employee table:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 1  | A       | 2341   |
| 2  | A       | 341    |
| 3  | A       | 15     |
| 4  | A       | 15314  |
| 5  | A       | 451    |
| 6  | A       | 513    |
| 7  | B       | 15     |
| 8  | B       | 13     |
| 9  | B       | 1154   |
| 10 | B       | 1345   |
| 11 | B       | 1221   |
| 12 | B       | 234    |
| 13 | C       | 2345   |
| 14 | C       | 2645   |
| 15 | C       | 2645   |
| 16 | C       | 2652   |
| 17 | C       | 65     |
+----+---------+--------+
Each row of this table indicates the company and the salary of one employee.

Output: 
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 5  | A       | 451    |
| 6  | A       | 513    |
| 12 | B       | 234    |
| 9  | B       | 1154   |
| 14 | C       | 2645   |
+----+---------+--------+
Explanation: 
For company A, the rows sorted are as follows:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 3  | A       | 15     |
| 2  | A       | 341    |
| 5  | A       | 451    | <-- median
| 6  | A       | 513    | <-- median
| 1  | A       | 2341   |
| 4  | A       | 15314  |
+----+---------+--------+
For company B, the rows sorted are as follows:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 8  | B       | 13     |
| 7  | B       | 15     |
| 12 | B       | 234    | <-- median
| 11 | B       | 1221   | <-- median
| 9  | B       | 1154   |
| 10 | B       | 1345   |
+----+---------+--------+
For company C, the rows sorted are as follows:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 17 | C       | 65     |
| 13 | C       | 2345   |
| 14 | C       | 2645   | <-- median
| 15 | C       | 2645   | 
| 16 | C       | 2652   |
+----+---------+--------+

Pandas input:

employee = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
    'company': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
    'salary': [2341, 341, 15, 15314, 451, 513, 15, 13, 1154, 1345, 1221, 234, 2345, 2645, 2645, 2652, 65]
})

Solution 1: pandas sort_values(), pandas apply(), pandas iloc[], python lambda function

import pandas as pd
def median_employee_salary(employee: pd.DataFrame) -> pd.DataFrame:

    # always sort in ascending order 
    # before finding median without built-in function
    # sort by 'id' to break ties
    employee = employee.sort_values(by=['salary', 'id'], 
                                    ascending=[True, True])

    # find median using lambda function
    # can't use built-in pandas median() because 
    # when the number of observations is even
    # it will take the average of the middle two values 
    # which doesn't fulfill the required criteria 
    employee = employee.groupby(
               'company').apply(lambda x: x.iloc[(len(x)-1)//2:len(x)//2+1])

    return employee

Solution 2: pandas cumcount(), pandas transform(), pandas loc[]

import pandas as pd
def median_employee_salary(employee: pd.DataFrame) -> pd.DataFrame:

    # create a new column rank based on the salary and id, 
    # group by each company
    employee['rank'] = employee.sort_values(['salary',
                       'id']).groupby('company').cumcount() + 1
    
    # create a new column count 
    # to find the number of observations of each company
    employee["count"] = employee.groupby(["company"])["id"].transform("count")

    # find the rows that contain the median salary of each company
    employee = employee.loc[(employee['rank'] >= employee['count'] / 2) 
                          & (employee['rank'] <= employee['count'] / 2 + 1)]

    # keep the required columns only
    employee = employee[['id', 'company', 'salary']]

    return employee

Problem 04: 571. Find Median Given Frequency of Numbers

The median is the value separating the higher half from the lower half of a data sample. Write a solution to report the median of all the numbers in the database after decompressing the Numbers table. Round the median to one decimal point. The result format is in the following example.

Input: 
Numbers table:
+-----+-----------+
| num | frequency |
+-----+-----------+
| 0   | 7         |
| 1   | 1         |
| 2   | 3         |
| 3   | 1         |
+-----+-----------+
Each row of this table shows the frequency of a number in the database.

Output: 
+--------+
| median |
+--------+
| 0.0    |
+--------+
Explanation: 
If we decompress the Numbers table, 
we will get [0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 3], 
so the median is (0 + 0) / 2 = 0.

Pandas Input:

numbers = pd.DataFrame({'num': [0, 1, 2, 3],
                        'frequency': [7, 1, 3, 1]
})

Solution 1: python list comprehension, numpy median(), numpy round()

import pandas as pd
import numpy as np

def median_frequency(numbers: pd.DataFrame) -> pd.DataFrame:

    # create a list of lists to extract each num according to their frequency
    num_lists = []
    for row in range(numbers.shape[0]):
        num_lists.append([numbers["num"].values[row] for i in range(numbers["frequency"].values[row])])

    # convert the list of lists to a single list
    num_list = [item for sublist in num_lists for item in sublist]

    # find the median and round to one decimal point.
    median = np.round(np.median(num_list), 1)

    # return the dataframe according to the required output
    return pd.DataFrame({"median": [median]})

Solution 2: pandas iterrows()

import pandas as pd
import numpy as np

def median_frequency(numbers: pd.DataFrame) -> pd.DataFrame:

    # create a list to extract each num according to their frequency
    num_list=[]
    for index, row in numbers.iterrows():
        for i in range((row["frequency"])):
            num_list.append(row["num"])
    
    # find the median and round to one decimal point.
    median = np.round(np.median(num_list), 1)

    # return the dataframe according to the required output
    return pd.DataFrame({"median": [median]})

Solution 3: pandas repeat(), pandas to_frame()

import pandas as pd

def median_frequency(numbers: pd.DataFrame) -> pd.DataFrame:

    # create a new dataframe with a single column "num"
    # extract each num according to their frequency
    df = numbers['num'].repeat(numbers['frequency']).to_frame()

    # find the median and round to one decimal point
    df = df.median().to_frame('median').round(1)

    return df

Problem 05: 579 Find the Cumulative Salary of an Employee

Write a solution to calculate the cumulative salary summary for every employee in a single unified table. The cumulative salary summary for an employee can be calculated as follows:

For each month that the employee worked, sum up the salaries in that month and the previous two months. This is their 3-month sum for that month. If an employee did not work for the company in previous months, their effective salary for those months is 0.
Do not include the 3-month sum for the most recent month that the employee worked for in the summary.
Do not include the 3-month sum for any month the employee did not work.

Return the result table ordered by id in ascending order. In case of a tie, order it by month in descending order. The result format is in the following example.

Input: 
Employee table:
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 1  | 1     | 20     |
| 2  | 1     | 20     |
| 1  | 2     | 30     |
| 2  | 2     | 30     |
| 3  | 2     | 40     |
| 1  | 3     | 40     |
| 3  | 3     | 60     |
| 1  | 4     | 60     |
| 3  | 4     | 70     |
| 1  | 7     | 90     |
| 1  | 8     | 90     |
+----+-------+--------+
Each row in the table indicates the salary of an employee 
in one month during the year 2020.

Output: 
+----+-------+--------+
| id | month | Salary |
+----+-------+--------+
| 1  | 7     | 90     |
| 1  | 4     | 130    |
| 1  | 3     | 90     |
| 1  | 2     | 50     |
| 1  | 1     | 20     |
| 2  | 1     | 20     |
| 3  | 3     | 100    |
| 3  | 2     | 40     |
+----+-------+--------+
Explanation: 
Employee '1' has five salary records excluding their most recent month '8':
- 90 for month '7'.
- 60 for month '4'.
- 40 for month '3'.
- 30 for month '2'.
- 20 for month '1'.
So the cumulative salary summary for this employee is:
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 1  | 7     | 90     |  (90 + 0 + 0)
| 1  | 4     | 130    |  (60 + 40 + 30)
| 1  | 3     | 90     |  (40 + 30 + 20)
| 1  | 2     | 50     |  (30 + 20 + 0)
| 1  | 1     | 20     |  (20 + 0 + 0)
+----+-------+--------+
Note that the 3-month sum for month '7' is 90 
because they did not work during month '6' or month '5'.
Employee '2' only has one salary record (month '1') 
excluding their most recent month '2'.
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 2  | 1     | 20     |  (20 + 0 + 0)
+----+-------+--------+
Employee '3' has two salary records excluding their most recent month '4':
- 60 for month '3'.
- 40 for month '2'.
So the cumulative salary summary for this employee is:
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 3  | 3     | 100    |  (60 + 40 + 0)
| 3  | 2     | 40     |  (40 + 0 + 0)
+----+-------+--------+

Pandas Input:

employee = pd.DataFrame({
    'id': [1, 2, 1, 2, 3, 1, 3, 1, 3, 1, 1],
    'month': [1, 1, 2, 2, 2, 3, 3, 4, 4, 7, 8],
    'salary': [20, 20, 30, 30, 40, 40, 60, 60, 70, 90, 90]
})

Solution: pandas unique(), pandas rolling()

import pandas as pd

def cumulative_salary(employee: pd.DataFrame) -> pd.DataFrame:

    # create a dataframe with the unique employee ids
    df_id = pd.DataFrame({"id": employee["id"].unique()})

    # create a dataframe with all the months of a year
    df_month = pd.DataFrame({"month": [i+1 for i in range(12)]})

    # cross join employee id with every month
    df = df_id.merge(df_month, how="cross")

    # include salary for the available months
    df = df.merge(employee, how="left")

    # calculate 3-month salary
    df["Salary"] = df.groupby(["id"])["salary"].rolling(3, 
                   min_periods=1).sum().reset_index(0, drop=True)
    
    # remove the missing values and sort by id and month
    df = df.dropna().sort_values(by=["id", "month"], 
                                     ascending=[True, False])

    # create a new column rank to find the most recent month for each employee
    df["rank"] = df.groupby(["id"])["month"].rank(ascending=False)

    # remove the most recent month
    df = df[df["rank"] != 1] 

    # keep only the required columns for the output
    df = df[["id", "month", "Salary"]]

    return df

Solution 2:

import pandas as pd

def cumulative_salary(employee: pd.DataFrame) -> pd.DataFrame:
    
    # find the most recent month's index of each employee id
    most_recent = employee.groupby('id')['month'].idxmax()
    
    # remove the most recent month for each employee id
    df = employee[~employee.index.isin(most_recent)]
    
    # self-join based on the employee id
    df = df.merge(df, on='id')

    # for each employee id and month, keep only the most recent 3 months data 
    df = df.loc[lambda x: (x.month_x-x.month_y).isin([0,1,2]), :]

    # calculate month recent 3-month salary
    df = df.groupby(['id','month_x'])['salary_y'].sum().reset_index()
    
    # rename the column names according to the required output
    # sort the dataframe by employee id and month
    df = df.rename(columns={'month_x':'month',
                   'salary_y':'Salary'}).sort_values(by=['id', 'month'], 
                                                     ascending=[True, False])
    
    return df

If you learn something new today, please follow my Medium profile, subscribe to my email list for the latest blogs, and click the clap button to support writing. Feel free to discuss your thoughts on these questions in the comment section. Don’t forget to share the blog link with your friends or LinkedIn connections. If you want to connect with me on LinkedIn: my LinkedIn profile. Happy learning.

Leetcode: 30 Days of Pandas (Part 1 of 3)

LeetCode has launched a study plan “30 days of Pandas”, with 30 questions to practice your Pandas skills. Here are the…

kawsar34.medium.com

Leetcode 30 days of Pandas (Part 2 of 3)

LeetCode has launched a study plan “30 days of Pandas”, with 33 questions to practice your Pandas skills. I have…

kawsar34.medium.com

Leetcode 30 days of Pandas (Part 3 of 3)

LeetCode has launched a study plan “30 days of Pandas”, with 33 questions to practice your Pandas skills. I have…

kawsar34.medium.com

Leetcode Hard Pandas (Part 1)

Leetcode: 30 Days of Pandas (Part 1 of 3)

LeetCode has launched a study plan “30 days of Pandas”, with 30 questions to practice your Pandas skills. Here are the…

Leetcode 30 days of Pandas (Part 2 of 3)

LeetCode has launched a study plan “30 days of Pandas”, with 33 questions to practice your Pandas skills. I have…

Leetcode 30 days of Pandas (Part 3 of 3)

LeetCode has launched a study plan “30 days of Pandas”, with 33 questions to practice your Pandas skills. I have…

Written by Shahidullah Kawsar