Leetcode Hard Pandas (Part 1)

Shahidullah Kawsar
15 min readFeb 24, 2024

Pandas is a must-know requirement for a Data Scientist. Today I will guide you on solving 5 Leetcode hard questions in pandas. I will show you how to use self-join, cross-join, inner join, and left-join, how to find rank, how to find the nth-largest data point, and how to use apply, repeat, and rolling functions in pandas. You’ll also learn how to use the lambda function, list comprehension, dataframe slicing and filtering.

Photo: Independence Pass, Colorado, USA. Credit: Tasnim and Kawsar

Problem 01: 185. Department Top Three Salaries

A company’s executives are interested in seeing who earns the most money in each of the company’s departments. A high earner in a department is an employee who has a salary in the top three unique salaries for that department. Write a solution to find the employees who are high earners in each of the departments. Return the result table in any order. The result format is in the following example.

Input: 
Employee table:
+----+-------+--------+--------------+
| id | name | salary | departmentId |
+----+-------+--------+--------------+
| 1 | Joe | 85000 | 1 |
| 2 | Henry | 80000 | 2 |
| 3 | Sam | 60000 | 2 |
| 4 | Max | 90000 | 1 |
| 5 | Janet | 69000 | 1 |
| 6 | Randy | 85000 | 1 |
| 7 | Will | 70000 | 1 |
+----+-------+--------+--------------+
Each row of this table indicates the ID, name, and salary of an employee.
It also contains the ID of their department.

Department table:
+----+-------+
| id | name |
+----+-------+
| 1 | IT |
| 2 | Sales |
+----+-------+
Each row of this table indicates the ID of a department and its name.
Output:
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT | Max | 90000 |
| IT | Joe | 85000 |
| IT | Randy | 85000 |
| IT | Will | 70000 |
| Sales | Henry | 80000 |
| Sales | Sam | 60000 |
+------------+----------+--------+
Explanation:
In the IT department:
- Max earns the highest unique salary
- Both Randy and Joe earn the second-highest unique salary
- Will earns the third-highest unique salary

In the Sales department:
- Henry earns the highest salary
- Sam earns the second-highest salary
- There is no third-highest salary as there are only two employees

Pandas Input:

employee = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7],
'name': ['Joe', 'Henry', 'Sam', 'Max', 'Janet', 'Randy', 'Will'],
'salary': [85000, 80000, 60000, 90000, 69000, 85000, 70000],
'departmentId': [1, 2, 2, 1, 1, 1, 1]})

department = pd.DataFrame({
'id': [1, 2],
'name': ['IT', 'Sales']})

Solution 1: pandas drop(), pandas rename(), pandas merge(), pandas rank(), pandas isin(), pandas groupby()

import pandas as pd

def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:

# remove the id column from the employee dataframe
# because we don't need this column
employee = employee.drop("id", axis=1)

# rename the column names according to the required output
employee = employee.rename(columns={"name": "Employee",
"salary": "Salary"})
department = department.rename(columns={"id": "departmentId",
"name": "Department"})

# left join employee with department based on the common departmentId
df = employee.merge(department, how="left", on=["departmentId"])

# create a new variable rank based on the top salaries for each department
df["Rank"] = df.groupby(["Department"])["Salary"].rank(method="dense",
ascending=False)

# keep only the top 3 unique salaries
df = df[df["Rank"].isin([1,2,3])]

# keep only the required columns
df = df[["Department", "Employee", "Salary"]]

return df

Solution 2:

import pandas as pd

def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:

# separate the rows containing the top 3 salaries for each department
top_salary = employee[employee.groupby('departmentId')[
"salary"].rank(method='dense',
ascending=False) <= 3]

# inner join to get the department names
df = top_salary.merge(department,
how="inner",
left_on='departmentId',
right_on='id')

# rename and keep the required columns
df = df.rename(columns = {'name_y': 'Department',
'name_x': 'Employee',
'salary': 'Salary'})
df = df[['Department', 'Employee', 'Salary']]

return df

Solution 3: pandas nlargest()

import pandas as pd

def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:

# inner join employee with department
df = employee.merge(department,
how="inner",
left_on='departmentId',
right_on='id')

df = df.rename(columns = {'name_y': 'Department'})

# keep only the unique salaries from each department
df = df[['Department', 'departmentId', 'salary']].drop_duplicates()

# find the top 3 unique salaries for each department
top_salary = df.groupby(['Department',
'departmentId'])['salary'].nlargest(3).reset_index()

# find the top 3 salary earning employees
df = top_salary.merge(employee, on=['departmentId', 'salary'])
df = df.rename(columns = {'name': 'Employee',
'salary': 'Salary'})
df = df[['Department', 'Employee', 'Salary']]

return df

Problem 02: 262. Trips and Users

The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day. Write a solution to find the cancellation rate of requests with unbanned users (both client and driver must not be banned) each day between "2013-10-01" and "2013-10-03". Round Cancellation Rate to two decimal points. Return the result table in any order. The result format is in the following example.

Input: 
The table holds all taxi trips. Each trip has a unique id.
Trips table:
+----+-----------+-----------+---------+---------------------+------------+
| id | client_id | driver_id | city_id | status | request_at |
+----+-----------+-----------+---------+---------------------+------------+
| 1 | 1 | 10 | 1 | completed | 2013-10-01 |
| 2 | 2 | 11 | 1 | cancelled_by_driver | 2013-10-01 |
| 3 | 3 | 12 | 6 | completed | 2013-10-01 |
| 4 | 4 | 13 | 6 | cancelled_by_client | 2013-10-01 |
| 5 | 1 | 10 | 1 | completed | 2013-10-02 |
| 6 | 2 | 11 | 6 | completed | 2013-10-02 |
| 7 | 3 | 12 | 6 | completed | 2013-10-02 |
| 8 | 2 | 12 | 12 | completed | 2013-10-03 |
| 9 | 3 | 10 | 12 | completed | 2013-10-03 |
| 10 | 4 | 13 | 12 | cancelled_by_driver | 2013-10-03 |
+----+-----------+-----------+---------+---------------------+------------+
Users table:
+----------+--------+--------+
| users_id | banned | role |
+----------+--------+--------+
| 1 | No | client |
| 2 | Yes | client |
| 3 | No | client |
| 4 | No | client |
| 10 | No | driver |
| 11 | No | driver |
| 12 | No | driver |
| 13 | No | driver |
+----------+--------+--------+
Output:
+------------+-------------------+
| Day | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 | 0.33 |
| 2013-10-02 | 0.00 |
| 2013-10-03 | 0.50 |
+------------+-------------------+
Explanation:
On 2013-10-01:
- There were 4 requests in total, 2 of which were canceled.
- However, the request with Id=2 was made by a banned client (User_Id=2),
so it is ignored in the calculation.
- Hence there are 3 unbanned requests in total, 1 of which was canceled.
- The Cancellation Rate is (1 / 3) = 0.33
On 2013-10-02:
- There were 3 requests in total, 0 of which were canceled.
- The request with Id=6 was made by a banned client, so it is ignored.
- Hence there are 2 unbanned requests in total, 0 of which were canceled.
- The Cancellation Rate is (0 / 2) = 0.00
On 2013-10-03:
- There were 3 requests in total, 1 of which was canceled.
- The request with Id=8 was made by a banned client, so it is ignored.
- Hence there are 2 unbanned request in total, 1 of which were canceled.
- The Cancellation Rate is (1 / 2) = 0.50

Pandas Input:

trips = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'client_id': [1, 2, 3, 4, 1, 2, 3, 2, 3, 4],
'driver_id': [10, 11, 12, 13, 10, 11, 12, 12, 10, 13],
'city_id': [1, 1, 6, 6, 1, 6, 6, 12, 12, 12],
'status': ['completed', 'cancelled_by_driver', 'completed',
'cancelled_by_client', 'completed', 'completed',
'completed', 'completed', 'completed', 'cancelled_by_driver'],
'request_at': ['2013-10-01', '2013-10-01', '2013-10-01', '2013-10-01',
'2013-10-02', '2013-10-02', '2013-10-02', '2013-10-03',
'2013-10-03', '2013-10-03']
})

trips['request_at'] = pd.to_datetime(trips['request_at'])

users = pd.DataFrame({
'users_id': [1, 2, 3, 4,
10, 11, 12, 13],
'banned': ['No', 'Yes', 'No', 'No',
'No', 'No', 'No', 'No'],
'role': ['client', 'client', 'client', 'client',
'driver', 'driver', 'driver', 'driver']
})

Solution: numpy where

import pandas as pd

def trips_and_users(trips: pd.DataFrame, users: pd.DataFrame) -> pd.DataFrame:

# if trips or users dataframe is empty,
# return the output dataframe with the required column headers
if trips.empty or users.empty:
return pd.DataFrame(columns=["Day", "Cancellation Rate"])

# find the banned users
users = users[users["banned"]=="No"]

# create a list of unbanned users
client_list = users[users["role"]=="client"]["users_id"].values.tolist()

# convert the request_at str column to datetime
trips["request_at"] = pd.to_datetime(trips["request_at"])

# make sure request_at dates are between "2013-10-01" and "2013-10-03"
trips = trips.set_index(["request_at"])
trips = trips.loc["2013-10-01": "2013-10-03"]
trips = trips.reset_index()

# keep only the unbanned users
trips = trips[trips["client_id"].isin(client_list)]

# create a new column: 1 if status is completed, otherwise 0
trips["status_num"] = np.where(trips["status"]=="completed", 0, 1)

# for each date, find the number of completed rides
# and total number of rides requested
df = trips.groupby(["request_at"]).agg({"status_num": "sum",
"status": "count"}).reset_index()

# find the Cancellation Rate
df["Cancellation Rate"] = df["status_num"]/df["status"]

# keep only the required columns for the output
df = df.drop(["status_num", "status"], axis=1)
df = df.rename(columns={"request_at": "Day"})

return df.round(2)

Problem 03: 569. Median Employee Salary

Write a solution to find the rows that contain the median salary of each company. While calculating the median, when you sort the salaries of the company, break the ties by id. Return the result table in any order. The result format is in the following example.

Input: 
Employee table:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 1 | A | 2341 |
| 2 | A | 341 |
| 3 | A | 15 |
| 4 | A | 15314 |
| 5 | A | 451 |
| 6 | A | 513 |
| 7 | B | 15 |
| 8 | B | 13 |
| 9 | B | 1154 |
| 10 | B | 1345 |
| 11 | B | 1221 |
| 12 | B | 234 |
| 13 | C | 2345 |
| 14 | C | 2645 |
| 15 | C | 2645 |
| 16 | C | 2652 |
| 17 | C | 65 |
+----+---------+--------+
Each row of this table indicates the company and the salary of one employee.

Output:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 5 | A | 451 |
| 6 | A | 513 |
| 12 | B | 234 |
| 9 | B | 1154 |
| 14 | C | 2645 |
+----+---------+--------+
Explanation:
For company A, the rows sorted are as follows:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 3 | A | 15 |
| 2 | A | 341 |
| 5 | A | 451 | <-- median
| 6 | A | 513 | <-- median
| 1 | A | 2341 |
| 4 | A | 15314 |
+----+---------+--------+
For company B, the rows sorted are as follows:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 8 | B | 13 |
| 7 | B | 15 |
| 12 | B | 234 | <-- median
| 11 | B | 1221 | <-- median
| 9 | B | 1154 |
| 10 | B | 1345 |
+----+---------+--------+
For company C, the rows sorted are as follows:
+----+---------+--------+
| id | company | salary |
+----+---------+--------+
| 17 | C | 65 |
| 13 | C | 2345 |
| 14 | C | 2645 | <-- median
| 15 | C | 2645 |
| 16 | C | 2652 |
+----+---------+--------+

Pandas input:

employee = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
'company': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'salary': [2341, 341, 15, 15314, 451, 513, 15, 13, 1154, 1345, 1221, 234, 2345, 2645, 2645, 2652, 65]
})

Solution 1: pandas sort_values(), pandas apply(), pandas iloc[], python lambda function

import pandas as pd
def median_employee_salary(employee: pd.DataFrame) -> pd.DataFrame:

# always sort in ascending order
# before finding median without built-in function
# sort by 'id' to break ties
employee = employee.sort_values(by=['salary', 'id'],
ascending=[True, True])

# find median using lambda function
# can't use built-in pandas median() because
# when the number of observations is even
# it will take the average of the middle two values
# which doesn't fulfill the required criteria
employee = employee.groupby(
'company').apply(lambda x: x.iloc[(len(x)-1)//2:len(x)//2+1])

return employee

Solution 2: pandas cumcount(), pandas transform(), pandas loc[]

import pandas as pd
def median_employee_salary(employee: pd.DataFrame) -> pd.DataFrame:

# create a new column rank based on the salary and id,
# group by each company
employee['rank'] = employee.sort_values(['salary',
'id']).groupby('company').cumcount() + 1

# create a new column count
# to find the number of observations of each company
employee["count"] = employee.groupby(["company"])["id"].transform("count")

# find the rows that contain the median salary of each company
employee = employee.loc[(employee['rank'] >= employee['count'] / 2)
& (employee['rank'] <= employee['count'] / 2 + 1)]

# keep the required columns only
employee = employee[['id', 'company', 'salary']]

return employee

Problem 04: 571. Find Median Given Frequency of Numbers

The median is the value separating the higher half from the lower half of a data sample. Write a solution to report the median of all the numbers in the database after decompressing the Numbers table. Round the median to one decimal point. The result format is in the following example.

Input: 
Numbers table:
+-----+-----------+
| num | frequency |
+-----+-----------+
| 0 | 7 |
| 1 | 1 |
| 2 | 3 |
| 3 | 1 |
+-----+-----------+
Each row of this table shows the frequency of a number in the database.

Output:
+--------+
| median |
+--------+
| 0.0 |
+--------+
Explanation:
If we decompress the Numbers table,
we will get [0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 3],
so the median is (0 + 0) / 2 = 0.

Pandas Input:

numbers = pd.DataFrame({'num': [0, 1, 2, 3],
'frequency': [7, 1, 3, 1]
})

Solution 1: python list comprehension, numpy median(), numpy round()

import pandas as pd
import numpy as np

def median_frequency(numbers: pd.DataFrame) -> pd.DataFrame:

# create a list of lists to extract each num according to their frequency
num_lists = []
for row in range(numbers.shape[0]):
num_lists.append([numbers["num"].values[row] for i in range(numbers["frequency"].values[row])])

# convert the list of lists to a single list
num_list = [item for sublist in num_lists for item in sublist]

# find the median and round to one decimal point.
median = np.round(np.median(num_list), 1)

# return the dataframe according to the required output
return pd.DataFrame({"median": [median]})

Solution 2: pandas iterrows()

import pandas as pd
import numpy as np

def median_frequency(numbers: pd.DataFrame) -> pd.DataFrame:

# create a list to extract each num according to their frequency
num_list=[]
for index, row in numbers.iterrows():
for i in range((row["frequency"])):
num_list.append(row["num"])

# find the median and round to one decimal point.
median = np.round(np.median(num_list), 1)

# return the dataframe according to the required output
return pd.DataFrame({"median": [median]})

Solution 3: pandas repeat(), pandas to_frame()

import pandas as pd

def median_frequency(numbers: pd.DataFrame) -> pd.DataFrame:

# create a new dataframe with a single column "num"
# extract each num according to their frequency
df = numbers['num'].repeat(numbers['frequency']).to_frame()

# find the median and round to one decimal point
df = df.median().to_frame('median').round(1)

return df

Problem 05: 579 Find the Cumulative Salary of an Employee

Write a solution to calculate the cumulative salary summary for every employee in a single unified table. The cumulative salary summary for an employee can be calculated as follows:

  • For each month that the employee worked, sum up the salaries in that month and the previous two months. This is their 3-month sum for that month. If an employee did not work for the company in previous months, their effective salary for those months is 0.
  • Do not include the 3-month sum for the most recent month that the employee worked for in the summary.
  • Do not include the 3-month sum for any month the employee did not work.

Return the result table ordered by id in ascending order. In case of a tie, order it by month in descending order. The result format is in the following example.

Input: 
Employee table:
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 1 | 1 | 20 |
| 2 | 1 | 20 |
| 1 | 2 | 30 |
| 2 | 2 | 30 |
| 3 | 2 | 40 |
| 1 | 3 | 40 |
| 3 | 3 | 60 |
| 1 | 4 | 60 |
| 3 | 4 | 70 |
| 1 | 7 | 90 |
| 1 | 8 | 90 |
+----+-------+--------+
Each row in the table indicates the salary of an employee
in one month during the year 2020.

Output:
+----+-------+--------+
| id | month | Salary |
+----+-------+--------+
| 1 | 7 | 90 |
| 1 | 4 | 130 |
| 1 | 3 | 90 |
| 1 | 2 | 50 |
| 1 | 1 | 20 |
| 2 | 1 | 20 |
| 3 | 3 | 100 |
| 3 | 2 | 40 |
+----+-------+--------+
Explanation:
Employee '1' has five salary records excluding their most recent month '8':
- 90 for month '7'.
- 60 for month '4'.
- 40 for month '3'.
- 30 for month '2'.
- 20 for month '1'.
So the cumulative salary summary for this employee is:
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 1 | 7 | 90 | (90 + 0 + 0)
| 1 | 4 | 130 | (60 + 40 + 30)
| 1 | 3 | 90 | (40 + 30 + 20)
| 1 | 2 | 50 | (30 + 20 + 0)
| 1 | 1 | 20 | (20 + 0 + 0)
+----+-------+--------+
Note that the 3-month sum for month '7' is 90
because they did not work during month '6' or month '5'.
Employee '2' only has one salary record (month '1')
excluding their most recent month '2'.
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 2 | 1 | 20 | (20 + 0 + 0)
+----+-------+--------+
Employee '3' has two salary records excluding their most recent month '4':
- 60 for month '3'.
- 40 for month '2'.
So the cumulative salary summary for this employee is:
+----+-------+--------+
| id | month | salary |
+----+-------+--------+
| 3 | 3 | 100 | (60 + 40 + 0)
| 3 | 2 | 40 | (40 + 0 + 0)
+----+-------+--------+

Pandas Input:

employee = pd.DataFrame({
'id': [1, 2, 1, 2, 3, 1, 3, 1, 3, 1, 1],
'month': [1, 1, 2, 2, 2, 3, 3, 4, 4, 7, 8],
'salary': [20, 20, 30, 30, 40, 40, 60, 60, 70, 90, 90]
})

Solution: pandas unique(), pandas rolling()

import pandas as pd

def cumulative_salary(employee: pd.DataFrame) -> pd.DataFrame:

# create a dataframe with the unique employee ids
df_id = pd.DataFrame({"id": employee["id"].unique()})

# create a dataframe with all the months of a year
df_month = pd.DataFrame({"month": [i+1 for i in range(12)]})

# cross join employee id with every month
df = df_id.merge(df_month, how="cross")

# include salary for the available months
df = df.merge(employee, how="left")

# calculate 3-month salary
df["Salary"] = df.groupby(["id"])["salary"].rolling(3,
min_periods=1).sum().reset_index(0, drop=True)

# remove the missing values and sort by id and month
df = df.dropna().sort_values(by=["id", "month"],
ascending=[True, False])

# create a new column rank to find the most recent month for each employee
df["rank"] = df.groupby(["id"])["month"].rank(ascending=False)

# remove the most recent month
df = df[df["rank"] != 1]

# keep only the required columns for the output
df = df[["id", "month", "Salary"]]

return df

Solution 2:

import pandas as pd

def cumulative_salary(employee: pd.DataFrame) -> pd.DataFrame:

# find the most recent month's index of each employee id
most_recent = employee.groupby('id')['month'].idxmax()

# remove the most recent month for each employee id
df = employee[~employee.index.isin(most_recent)]

# self-join based on the employee id
df = df.merge(df, on='id')

# for each employee id and month, keep only the most recent 3 months data
df = df.loc[lambda x: (x.month_x-x.month_y).isin([0,1,2]), :]

# calculate month recent 3-month salary
df = df.groupby(['id','month_x'])['salary_y'].sum().reset_index()

# rename the column names according to the required output
# sort the dataframe by employee id and month
df = df.rename(columns={'month_x':'month',
'salary_y':'Salary'}).sort_values(by=['id', 'month'],
ascending=[True, False])

return df

If you learn something new today, please follow my Medium profile, subscribe to my email list for the latest blogs, and click the clap button to support writing. Feel free to discuss your thoughts on these questions in the comment section. Don’t forget to share the blog link with your friends or LinkedIn connections. If you want to connect with me on LinkedIn: my LinkedIn profile. Happy learning.

--

--