Leetcode 30 days of Pandas (Part 2 of 3)

Shahidullah Kawsar
14 min readJan 24, 2024

LeetCode has launched a study plan “30 days of Pandas”, with 33 questions to practice your Pandas skills. I have already discussed problems 1 to 10 here. Today I will discuss problems 11 to 20 and their multiple solutions using pandas.

Problem 11: Leetcode medium 177. Nth Highest Salary

Table: Employee
+-------------+------+
| Column Name | Type |
+-------------+------+
| id | int |
| salary | int |
+-------------+------+
id is the primary key (column with unique values) for this table.
Each row of this table contains information about the salary of an employee.

Write a solution to find the nth highest salary from the Employee table. If there is no nth highest salary, return null. The result format is in the following example.

Example 1:
Input:
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
+----+--------+
n = 2
Output:
+------------------------+
| getNthHighestSalary(2) |
+------------------------+
| 200 |
+------------------------+

Example 2:
Input:
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1 | 100 |
+----+--------+
n = 2
Output:
+------------------------+
| getNthHighestSalary(2) |
+------------------------+
| null |
+------------------------+

Solution: pandas.DataFrame.rank

import pandas as pd

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:

# remove the duplicate rows based on the 'salary' column
df = employee.drop_duplicates(subset='salary')

# create a new column rank based on the 'salary' column
# highest salary will rank as 1
df['rank'] = df['salary'].rank(method='dense', ascending=False)

# filter the N-th rank 'salary'
df = df[df["rank"] == N][['salary']]

# if N-th rank is not present, return None
if df.shape[0] == 0:
return pd.DataFrame({'getNthHighestSalary('+str(N)+')': [None]})

# rename the column names according to the output data structure
df = df.rename(columns={'salary':'getNthHighestSalary('+str(N)+')'})

return df

Problem 12: Leetcode medium 176. Second Highest Salary

Table: Employee
+-------------+------+
| Column Name | Type |
+-------------+------+
| id | int |
| salary | int |
+-------------+------+
id is the primary key (column with unique values) for this table.
Each row of this table contains information about the salary of an employee.

Write a solution to find the second-highest salary from the Employee table. If there is no second-highest salary, return null (return None in Pandas). The result format is in the following example.

Example 1:
Input:
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
+----+--------+
Output:
+---------------------+
| SecondHighestSalary |
+---------------------+
| 200 |
+---------------------+

Example 2:
Input:
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1 | 100 |
+----+--------+
Output:
+---------------------+
| SecondHighestSalary |
+---------------------+
| null |
+---------------------+

Solution 1:

import pandas as pd

def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:

# find the unique salaries in a list
salary_list = employee["salary"].unique()

# if the length if the salary_list is less than or equal 1, return None
if len(salary_list)<=1:
return pd.DataFrame({"SecondHighestSalary": [None]})
else:
# by default 'sorted' is ascending
# second highest salary == second last element in the sorted list
return pd.DataFrame({"SecondHighestSalary": [sorted(salary_list)[-2]]})

Solution 2:

import pandas as pd

def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:

# drop any duplicate salaries
employee = employee.drop_duplicates(["salary"])

# if there are less than two unique salaries, return 'np.NaN'.
if len(employee["salary"].unique()) < 2:
return pd.DataFrame({"SecondHighestSalary": [np.NaN]})

# sort the table by "salary" in descending order.
employee = employee.sort_values("salary", ascending=False)

# drop the "id" column.
employee.drop("id", axis=1, inplace=True)

# rename the "salary" column.
employee.rename({"salary": "SecondHighestSalary"}, axis=1, inplace=True)

# return the second highest salary.
return employee.head(2).tail(1)

Problem 13: Leetcode medium 184. Department Highest Salary

Table: Employee
+--------------+---------+
| Column Name | Type |
+--------------+---------+
| id | int |
| name | varchar |
| salary | int |
| departmentId | int |
+--------------+---------+
id is the primary key (column with unique values) for this table.
departmentId is a foreign key (reference columns) of the ID from the Department table.
Each row of this table indicates the ID, name, and salary of an employee. It also contains the ID of their department.

Table: Department
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| id | int |
| name | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table. It is guaranteed that department name is not NULL.
Each row of this table indicates the ID of a department and its name.

Write a solution to find employees with the highest salary in each department. Return the result table in any order. The result format is in the following example.

Input: 
Employee table:
+----+-------+--------+--------------+
| id | name | salary | departmentId |
+----+-------+--------+--------------+
| 1 | Joe | 70000 | 1 |
| 2 | Jim | 90000 | 1 |
| 3 | Henry | 80000 | 2 |
| 4 | Sam | 60000 | 2 |
| 5 | Max | 90000 | 1 |
+----+-------+--------+--------------+
Department table:
+----+-------+
| id | name |
+----+-------+
| 1 | IT |
| 2 | Sales |
+----+-------+
Output:
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT | Jim | 90000 |
| Sales | Henry | 80000 |
| IT | Max | 90000 |
+------------+----------+--------+
Explanation: Max and Jim both have the highest salary in the IT department and Henry has the highest salary in the Sales department.

Solution 1: pandas.DataFrame.merge, pandas.DataFrame.transform

import pandas as pd

def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:

# create a new column with the max salary for each department
employee["max_salary"] = employee.groupby("departmentId")["salary"].transform('max')

# filter the rows with department-wise highest salary
employee = employee[employee["salary"] == employee["max_salary"]]

# left join employee with department
employee = employee.merge(department,
how="left",
left_on="departmentId",
right_on="id")

# rename the column names
employee = employee.rename(columns={"name_x": "Employee",
"name_y": "Department",
"salary": "Salary"})

return employee[["Department", "Employee", "Salary"]]

Solution 2:

import pandas as pd

def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:

# merge tables and rename
df = employee.merge(department, left_on='departmentId', right_on='id', how='left')

# rename the column names
df.rename(columns={'name_x': 'Employee', 'name_y': 'Department', 'salary': 'Salary'}, inplace=True)

# select employees whose salary is equal to the department highest salary
max_salary = df.groupby('Department')['Salary'].transform('max')
df = df[df['Salary'] == max_salary]

return df[['Department', 'Employee', 'Salary']]

Problem 14: Leetcode medium 178. Rank Scores

Table: Scores
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| id | int |
| score | decimal |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains the score of a game. Score is a floating point value with two decimal places.

Write a solution to find the rank of the scores. The ranking should be calculated according to the following rules:

  • The scores should be ranked from the highest to the lowest.
  • If there is a tie between two scores, both should have the same ranking.
  • After a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no holes between ranks.

Return the result table ordered by score in descending order. The result format is in the following example.

Input: 
Scores table:
+----+-------+
| id | score |
+----+-------+
| 1 | 3.50 |
| 2 | 3.65 |
| 3 | 4.00 |
| 4 | 3.85 |
| 5 | 4.00 |
| 6 | 3.65 |
+----+-------+
Output:
+-------+------+
| score | rank |
+-------+------+
| 4.00 | 1 |
| 4.00 | 1 |
| 3.85 | 2 |
| 3.65 | 3 |
| 3.65 | 3 |
| 3.50 | 4 |
+-------+------+

Solution: pandas.DataFrame.rank

import pandas as pd

def order_scores(scores: pd.DataFrame) -> pd.DataFrame:

scores['rank'] = scores['score'].rank(method = 'dense', ascending = False)

return scores[['score','rank']].sort_values(by= 'score', ascending = False)

Problem 15: Leetcode Easy 196. Delete Duplicate Emails

Table: Person
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| id | int |
| email | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id. For Pandas users, please note that you are supposed to modify Person in place. After running your script, the answer shown is the Person table. The driver will first compile and run your code and then show the Person table. The final order of the Person table does not matter. The result format is in the following example.

Input: 
Person table:
+----+------------------+
| id | email |
+----+------------------+
| 1 | john@example.com |
| 2 | bob@example.com |
| 3 | john@example.com |
+----+------------------+
Output:
+----+------------------+
| id | email |
+----+------------------+
| 1 | john@example.com |
| 2 | bob@example.com |
+----+------------------+
Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.

Solution 1: pandas.DataFrame.drop_duplicates, pandas.DataFrame.sort_values

import pandas as pd

# Modify Person in place
def delete_duplicate_emails(person: pd.DataFrame) -> None:

# sort the dataframe based on the id column in ascending order
person.sort_values(by="id", ascending=True, inplace=True)

# remove the duplicate emails and keep the first email with smallest id
person.drop_duplicates(subset=["email"], keep="first", inplace=True)

Solution 2: pandas.DataFrame.groupby, pandas.DataFrame.drop

import pandas as pd

def delete_duplicate_emails(person: pd.DataFrame) -> None:

# for each email find the smallest id
min_id = person.groupby('email')['id'].transform('min')

# separate the rows except smallest id
removed_person = person[person['id'] != min_id]

# remove the rows except smallest id
person.drop(removed_person.index, inplace=True)

Problem 16: Leetcode Easy 1795. Rearrange Products Table

Table: Products
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| product_id | int |
| store1 | int |
| store2 | int |
| store3 | int |
+-------------+---------+
product_id is the primary key (column with unique values) for this table.
Each row in this table indicates the product's price in 3 different stores: store1, store2, and store3.
If the product is not available in a store, the price will be null in that store's column.

Write a solution to rearrange the Products table so that each row has (product_id, store, price). If a product is not available in a store, do not include a row with that product_id and store combination in the result table. Return the result table in any order. The result format is in the following example.

Input: 
Products table:
+------------+--------+--------+--------+
| product_id | store1 | store2 | store3 |
+------------+--------+--------+--------+
| 0 | 95 | 100 | 105 |
| 1 | 70 | null | 80 |
+------------+--------+--------+--------+
Output:
+------------+--------+-------+
| product_id | store | price |
+------------+--------+-------+
| 0 | store1 | 95 |
| 0 | store2 | 100 |
| 0 | store3 | 105 |
| 1 | store1 | 70 |
| 1 | store3 | 80 |
+------------+--------+-------+
Explanation:
Product 0 is available in all three stores with prices 95, 100, and 105 respectively.
Product 1 is available in store1 with price 70 and store3 with price 80. The product is not available in store2.

Solution: pandas.melt

import pandas as pd

def rearrange_products_table(products: pd.DataFrame) -> pd.DataFrame:

return pd.melt(products,
id_vars=["product_id"],
value_vars=["store1", "store2", "store3"],
var_name='store',
value_name='price').dropna()

Problem 17: Leetcode Easy 2082 The Number of Rich Customers

Table: Store
+-------------+------+
| Column Name | Type |
+-------------+------+
| bill_id | int |
| customer_id | int |
| amount | int |
+-------------+------+
bill_id is the primary key (column with unique values) for this table.
Each row contains information about the amount of one bill and the customer associated with it.

Write a solution to report the number of customers who had at least one bill with an amount strictly greater than 500. The result format is in the following example.

Input: 
Store table:
+---------+-------------+--------+
| bill_id | customer_id | amount |
+---------+-------------+--------+
| 6 | 1 | 549 |
| 8 | 1 | 834 |
| 4 | 2 | 394 |
| 11 | 3 | 657 |
| 13 | 3 | 257 |
+---------+-------------+--------+
Output:
+------------+
| rich_count |
+------------+
| 2 |
+------------+
Explanation:
Customer 1 has two bills with amounts strictly greater than 500.
Customer 2 does not have any bills with an amount strictly greater than 500.
Customer 3 has one bill with an amount strictly greater than 500.

Solution 1: pandas.unique

import pandas as pd

def count_rich_customers(store: pd.DataFrame) -> pd.DataFrame:

store = store[store["amount"]>500]

return pd.DataFrame({"rich_count": [len(store["customer_id"].unique())]})

Solution 2: pandas.DataFrame.nunique

import pandas as pd

def count_rich_customers(store: pd.DataFrame) -> pd.DataFrame:

rich = store[store['amount'] > 500]['customer_id'].nunique()

return pd.DataFrame({'rich_count': [rich]})

Problem 18: Leetcode Easy 1173. Immediate Food Delivery I

Table: Delivery
+-----------------------------+---------+
| Column Name | Type |
+-----------------------------+---------+
| delivery_id | int |
| customer_id | int |
| order_date | date |
| customer_pref_delivery_date | date |
+-----------------------------+---------+
delivery_id is the primary key (column with unique values) of this table.
The table holds information about food delivery to customers that make orders at some date and specify a preferred delivery date (on the same order date or after it).

If the customer’s preferred delivery date is the same as the order date, then the order is called immediate; otherwise, it is called scheduled. Write a solution to find the percentage of immediate orders in the table, rounded to 2 decimal places. The result format is in the following example.

Input: 
Delivery table:
+-------------+-------------+------------+-----------------------------+
| delivery_id | customer_id | order_date | customer_pref_delivery_date |
+-------------+-------------+------------+-----------------------------+
| 1 | 1 | 2019-08-01 | 2019-08-02 |
| 2 | 5 | 2019-08-02 | 2019-08-02 |
| 3 | 1 | 2019-08-11 | 2019-08-11 |
| 4 | 3 | 2019-08-24 | 2019-08-26 |
| 5 | 4 | 2019-08-21 | 2019-08-22 |
| 6 | 2 | 2019-08-11 | 2019-08-13 |
+-------------+-------------+------------+-----------------------------+
Output:
+----------------------+
| immediate_percentage |
+----------------------+
| 33.33 |
+----------------------+
Explanation: The orders with delivery id 2 and 3 are immediate while the others are scheduled.

Solution 1: numpy.where

import pandas as pd

def food_delivery(delivery: pd.DataFrame) -> pd.DataFrame:

# find the immediate orders
delivery["immediate"] = np.where(delivery["order_date"] == delivery["customer_pref_delivery_date"], 1, 0)

return pd.DataFrame({"immediate_percentage": [np.round((delivery["immediate"].mean()*100), 2)]})

Solution 2:

import pandas as pd

def food_delivery(delivery: pd.DataFrame) -> pd.DataFrame:
is_immediate = delivery['order_date'] == delivery['customer_pref_delivery_date']

# Count the number of immediate orders and the number of all orders.
valid_count = is_immediate.sum()
total_count = len(delivery)

# Round the percentage to 2 decimal places.
percentage = round(100 * valid_count / total_count, 2)

df = pd.DataFrame({'immediate_percentage': [percentage]})
return df

Problem 19: Leetcode Medium 1907. Count Salary Categories

Table: Accounts
+-------------+------+
| Column Name | Type |
+-------------+------+
| account_id | int |
| income | int |
+-------------+------+
account_id is the primary key (column with unique values) for this table.
Each row contains information about the monthly income for one bank account.

Write a solution to calculate the number of bank accounts for each salary category. The salary categories are:

  • "Low Salary": All the salaries are strictly less than $20000.
  • "Average Salary": All the salaries in the inclusive range [$20000, $50000].
  • "High Salary": All the salaries strictly greater than $50000.

The result table must contain all three categories. If there are no accounts in a category, return 0. Return the result table in any order. The result format is in the following example.

Input: 
Accounts table:
+------------+--------+
| account_id | income |
+------------+--------+
| 3 | 108939 |
| 2 | 12747 |
| 8 | 87709 |
| 6 | 91796 |
+------------+--------+
Output:
+----------------+----------------+
| category | accounts_count |
+----------------+----------------+
| Low Salary | 1 |
| Average Salary | 0 |
| High Salary | 3 |
+----------------+----------------+
Explanation:
Low Salary: Account 2.
Average Salary: No accounts.
High Salary: Accounts 3, 6, and 8.

Solution 1:

import pandas as pd
import numpy as np

def count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:

df = pd.DataFrame()

# create categories
df["category"] = ["Low Salary", "Average Salary", "High Salary"]

# count number of accounts in each category
df["accounts_count"] = [sum(accounts["income"]<20000),
sum((accounts["income"]>=20000) & (accounts["income"]<=50000)),
sum(accounts["income"]>50000)]

return df

Solution 2:

import pandas as pd

def count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:

low_count = (accounts['income'] < 20000).sum()
average_count = ((accounts['income'] >= 20000) & (accounts['income'] <= 50000)).sum()
high_count = (accounts['income'] > 50000).sum()

ans = pd.DataFrame({
'category': ['Low Salary', 'Average Salary', 'High Salary'],
'accounts_count': [low_count, average_count, high_count]
})

return ans

Problem 20: Leetcode Easy 1322 Ads Performance

Table: Ads
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| ad_id | int |
| user_id | int |
| action | enum |
+---------------+---------+
(ad_id, user_id) is the primary key (combination of columns with unique values) for this table.
Each row of this table contains the ID of an Ad, the ID of a user, and the action taken by this user regarding this Ad.
The action column is an ENUM (category) type of ('Clicked', 'Viewed', 'Ignored').

A company is running Ads and wants to calculate the performance of each Ad. Performance of the Ad is measured using Click-Through Rate (CTR) where:

Write a solution to find the ctr of each Ad. Round ctr to two decimal points. Return the result table ordered by ctr in descending order and ad_id in ascending order in case of a tie. The result format is in the following example.

Input: 
Ads table:
+-------+---------+---------+
| ad_id | user_id | action |
+-------+---------+---------+
| 1 | 1 | Clicked |
| 2 | 2 | Clicked |
| 3 | 3 | Viewed |
| 5 | 5 | Ignored |
| 1 | 7 | Ignored |
| 2 | 7 | Viewed |
| 3 | 5 | Clicked |
| 1 | 4 | Viewed |
| 2 | 11 | Viewed |
| 1 | 2 | Clicked |
+-------+---------+---------+
Output:
+-------+-------+
| ad_id | ctr |
+-------+-------+
| 1 | 66.67 |
| 3 | 50.00 |
| 2 | 33.33 |
| 5 | 0.00 |
+-------+-------+
Explanation:
for ad_id = 1, ctr = (2/(2+1)) * 100 = 66.67
for ad_id = 2, ctr = (1/(1+2)) * 100 = 33.33
for ad_id = 3, ctr = (1/(1+1)) * 100 = 50.00
for ad_id = 5, ctr = 0.00, Note that ad_id = 5 has no clicks or views.
Note that we do not care about Ignored Ads.

Solution:

import pandas as pd

def ads_performance(ads: pd.DataFrame) -> pd.DataFrame:
# Group by 'ad_id' and calculate the CTR for each group
ctr = ads.groupby('ad_id')['action'].apply(
lambda x: round(
(sum(x == 'Clicked') / (sum(x == 'Clicked') + sum(x == 'Viewed')) * 100) if (sum(x == 'Clicked') + sum(x == 'Viewed')) > 0 else 0.00,
2
)
).reset_index()

# Rename the column to 'ctr'
ctr.columns = ['ad_id', 'ctr']

# Sort the results by 'ctr' in descending order and by 'ad_id' in ascending order
result = ctr.sort_values(by=['ctr', 'ad_id'], ascending=[False, True])

return result
Source: Pandas AI: Data Analysis With Artificial Intelligence

--

--