6 MONTH ROADMAP FOR GETTING A JOB
6 month roadmap for preparing 7 key skills to get that dream job
There are 7 Key Skills :
SQL
DSA
SYSTEM DESIGN
CASING
MARKET SIZING
BEHAVIORAL QUESTIONS
DATA DESIGN
SQL :
There are 2 types of SQL interview questions :
Theoretical questions
Practical solving SQL queries
Practically solving SQL queries is still asked in interviews like Amazon Business Intelligence interviews or any Data Engineering interviews. This requires solving coding questions on a shared screen in a limited time and running the queries. It also involves ensuring all the test cases run successfully. With the competition so high, it is a given that you solve all the questions successfully. Else, most times you won’t be able to clear the interview. For instance, in 1 hour interview, I was asked questions on multiple joins, rank, window functions , where , having , order by clauses.
Before sharing concept specific questions, also sharing some of the resources where you can practice these questions :
Advanced SQL 50 : Needs premium
Most asked SQL questions : Link
Would ideally recommend using the above resources. Sharing the most asked concepts and questions which you can use to practice below. Note : I did use AI to generate some of the questions. I did curate the list of questions but I want to apologize in advance if there are any inaccuracies.
1. Window Functions
Window functions allow calculations across rows relative to the current row, without grouping rows.
Examples:
ROW_NUMBER()
,RANK()
,DENSE_RANK()
,LAG()
,LEAD()
Here are clear and structured SQL practice questions that you can directly use to practice each window function (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD()). These questions include realistic table and column names and are designed similarly to medium-level LeetCode or HackerRank challenges:
ROW_NUMBER() Questions:
Question 1: Top 3 Employees per Department
Given a table Employees
with columns:
EmployeeID
EmployeeName
Salary
DepartmentID
Write a query to find the top 3 highest-paid employees in each department.
Output columns: DepartmentID
, EmployeeName
, Salary
, and EmployeeRank
.
Question 2: Latest Order per Customer
Given a table Orders
with columns:
OrderID
CustomerID
OrderDate
Amount
Write a query to find the latest order (by date) for each customer.
Output columns: CustomerID
, OrderID
, OrderDate
, Amount
.
RANK() Questions:
Question 1: Rank Students by Exam Scores
Given a table Students
with columns:
StudentID
StudentName
ExamScore
Write a query to rank students based on their exam scores. Students with equal scores should have the same rank, and ranks should skip numbers after ties (1,1,3...).
Output columns: StudentID
, StudentName
, ExamScore
, Rank
.
Question 2: Product Sales Rank
Given a table Products
with columns:
ProductID
ProductName
Category
TotalSales
Write a query to rank products within each category based on their total sales, handling ties appropriately.
Output columns: Category
, ProductName
, TotalSales
, SalesRank
.
DENSE_RANK() Questions:
Question 1: Movie Ratings
Given a table Movies
with columns:
MovieID
MovieName
Rating
Write a query to assign ranks to movies based on their ratings. Movies with identical ratings must share the same rank without gaps in the ranking (1,1,2...).
Output columns: MovieID
, MovieName
, Rating
, Rank
.
Question 2: Customer Spending Rank
Given a table Transactions
with columns:
TransactionID
CustomerID
SpendAmount
Write a query to rank customers based on total spending, ensuring customers with equal spend have the same rank (dense ranking).
Output columns: CustomerID
, TotalSpend
, Rank
.
LAG() Questions:
Question 1: Monthly Revenue Comparison
Given a table MonthlyRevenue
with columns:
Month
(e.g., '2024-01', '2024-02', ...)Revenue
Write a query to show each month’s revenue alongside the previous month's revenue.
Output columns: Month
, Revenue
, PreviousMonthRevenue
.
Question 2: Employee Salary Changes
Given a table EmployeeSalaries
with columns:
EmployeeID
Salary
EffectiveDate
Write a query to list each employee’s salary and the salary they had previously (chronologically).
Output columns: EmployeeID
, EffectiveDate
, Salary
, PreviousSalary
.
LEAD() Questions:
Question 1: Next Flight Schedule
Given a table Flights
with columns:
FlightID
Airline
DepartureTime
Write a query to list each flight along with the next flight’s departure time for the same airline.
Output columns: FlightID
, Airline
, DepartureTime
, NextFlightDepartureTime
.
Question 2: Stock Price Next-Day Comparison
Given a table Stocks
with columns:
StockID
StockDate
ClosingPrice
Write a query to show the closing price of each stock alongside the next day's closing price.
Output columns: StockID
, StockDate
, ClosingPrice
, NextDayClosingPrice
.
Use these structured examples to practice each function, allowing you to master complex SQL concepts effectively.
2. Partitioning (PARTITION BY
)
Partitioning splits the data into subsets for analysis within window functions.
Question 1: Rank Employees by Salary Within Departments
Table: Employees
Columns: EmployeeID
, DepartmentID
, EmployeeName
, Salary
Task:
Assign a rank to each employee within their department based on their salary, with the highest paid ranked first.
Expected Output:DepartmentID
, EmployeeName
, Salary
, RankInDepartment
Question 2 : Calculate Average Order Value per Customer Over Time
Table: Orders
Columns: OrderID
, CustomerID
, OrderDate
, Amount
Task:
Calculate a rolling average order amount per customer, ordered by date.
Expected Output:CustomerID
, OrderDate
, Amount
, AverageToDate
Question 3: Find Customers with Increasing Spend Over Time
Table: Transactions
Columns: TransactionID
, CustomerID
, TransactionDate
, Amount
Task:
For each customer, compare the amount spent in each transaction to their previous transaction. Show only rows where spending increased.
Expected Output:CustomerID
, TransactionDate
, Amount
, PreviousAmount
, Increased
(Yes/No)
Question 4 : Flag First Purchase per Customer
Table: Purchases
Columns: PurchaseID
, CustomerID
, PurchaseDate
, Item
Task:
For the purchases table, flag the first purchase made by each customer chronologically.
Expected Output:CustomerID
, PurchaseDate
, Item
, IsFirstPurchase
(Yes/No)
3. Common Table Expressions (CTE)
A CTE is a temporary named query used within another query.
Question 1: Top 3 Customers per Month
Tables: Orders(OrderID, CustomerID, OrderDate, Amount)
Create a CTE to calculate total monthly spend per customer. Then use ROW_NUMBER()
in the CTE to return the top 3 spending customers per month
Question 2 : Sales Over Threshold
Table: Sales(SaleID, SaleDate, Region, Amount)
Create a CTE to calculate average daily sales per region, and return the days where the amount exceeded 150% of the region's daily average.
Question 3: Second Highest Transaction per Customer
Table: Transactions(TransactionID, CustomerID, Amount)
Use a CTE and ROW_NUMBER()
to return each customer’s second highest transaction amount.
Question 4 : Filtering with CTEs
Table: Employees(EmployeeID, DepartmentID, Salary)
Create a CTE to get employees with above-average salaries per department. Then return only those earning more than twice the average in their department.
4. Joins
Joins combine rows from two or more tables based on related columns.
Types:
INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, CROSS JOIN
Question 1: Customers Without Orders
Tables: Customers(CustomerID, Name)
, Orders(OrderID, CustomerID, OrderDate)
Use a LEFT JOIN
to find customers who never placed an order.
Question 2: Employee and Manager Names
Table: Employees(EmployeeID, Name, ManagerID)
Join the Employees
table to itself to get each employee along with their manager’s name.
Question 3: Products Not Sold
Tables: Products(ProductID, Name)
, OrderDetails(OrderID, ProductID)
Find all products that were never sold using a LEFT JOIN
and IS NULL
.
Question 4: Highest Sale per Region
Tables: Sales(SaleID, Region, Amount)
, Regions(RegionID, RegionName)
Join both tables and return the highest sale amount per region.
Question 5: Total Revenue per Category
Tables: Products(ProductID, CategoryID)
, Categories(CategoryID, Name)
, Sales(ProductID, Quantity, Price)
Join and calculate total revenue per product category.
Question 6: Students and Courses Enrolled
Tables: Students(StudentID, Name)
, Enrollments(StudentID, CourseID)
, Courses(CourseID, Title)
Return each student’s name and the list of courses they are enrolled in.
Question 7: Orders with Customer & Product Info
Tables: Orders(OrderID, CustomerID, OrderDate)
, OrderItems(OrderID, ProductID, Quantity)
, Products(ProductID, Name, Price)
Join all tables to get complete details for each order including customer and product names and total cost.
5. Subqueries
Subqueries are nested queries within another SQL statement.
Types :
IN, EXISTS
Question 1: Customers With Above-Average Spend
Table: Orders(CustomerID, Amount)
Find all customers whose total spend is above the average total spend of all customers.
Question 2: Second Highest Salary
Table: Employees(EmployeeID, Salary)
Write a query to return the second highest distinct salary using a subquery.
Question 3: Products with No Orders
Tables: Products(ProductID)
, OrderItems(ProductID)
Use a subquery to return products that do not appear in any orders.
Question 4: Highest Paid Employee per Department
Table: Employees(EmployeeID, DepartmentID, Salary)
Use a correlated subquery to return employees earning the highest salary within their department.
Question 5: Customers With More Than One Order
Table: Orders(CustomerID)
Use a subquery to return customers who placed more than one order.
Question 6: Products More Expensive Than Average
Table: Products(ProductID, Price)
Return all products that cost more than the average price of all products.
Question 7: Employees With Salaries in Top 10%
Table: Employees(EmployeeID, Salary)
Use a subquery to find employees whose salary is in the top 10% of all salaries.
Question 8: Customers Who Ordered All Products
Tables: Customers
, Orders
, Products
Return customers who have ordered every product at least once. Requires NOT EXISTS
and set logic.
Question 9: Cities With No Orders
Tables: Customers(City)
, Orders(CustomerID)
Find cities where no customers have placed an order.
Question 10: Departments With Only One Employee
Table: Employees(EmployeeID, DepartmentID)
Use a subquery to return departments that have only one employee.
6. Aggregations and Grouping
Aggregation groups data to calculate metrics such as sum, average, min, max, count.
Question 1: Total Revenue Per Product
Table: OrderItems(ProductID, Quantity, Price)
Group by ProductID
and calculate total revenue as Quantity * Price
.
Question 2: Average Order Value per Customer
Table: Orders(CustomerID, OrderID, Amount)
Group by CustomerID
and return their average order value.
Question 3: Number of Orders per Month
Table: Orders(OrderDate)
Group orders by month and count them.
Question 4: Highest-Spending Customer
Table: Orders(CustomerID, Amount)
Use GROUP BY
and ORDER BY
to find the customer who spent the most.
Question 5: Product Count per Category
Table: Products(ProductID, CategoryID)
Group by CategoryID
and count the number of products in each.
Question 6: Orders Per Customer Per Year
Table: Orders(CustomerID, OrderDate)
Group by CustomerID
and YEAR(OrderDate)
to get order counts per year.
Question 7: Daily Average Sales by Store
Table: Sales(StoreID, SaleDate, Amount)
Group by StoreID
and SaleDate
and get average daily sales.
Question 8: Most Popular Product Each Month
Tables: Orders(OrderID, OrderDate)
, OrderItems(OrderID, ProductID)
Group by month and ProductID
, and return the product with the highest order count per month.
Question 9: Customers with Highest Order Frequency
Table: Orders(CustomerID, OrderDate)
Group by CustomerID
and count the number of orders. Return the top 5.
Question 10: Categories with No Sales
Tables: Products(ProductID, CategoryID)
, OrderItems(ProductID)
Group products and orders to identify categories that had zero sales.
DSA
DSA questions can be boiled down to 16 patterns! Solve and master those patterns and the problems become easier to solve. What are those patterns.
DSA Patterns :
1. Two Pointers
2. Fast & Slow pointers
3. Sliding Window
4. Merge Intervals
5. Cyclic Sort
6. In-place Reversal of a LinkedList
7. Stack
8. Monotonic Stack
9. Hash Maps
10. Tree Breadth First Search
11. Tree Depth First Search
12. Graphs
13. Island (Matrix traversal)
14. Two Heaps
15. Subsets
16. Modified Binary Search
17. Bitwise XOR
18. Top 'K' Elements
19. K-way merge
20. Greedy Algorithms
21. 0/1 Knapsack (Dynamic Programming)
22. Backtracking
23. Trie
24. Topological Sort (Graph)
25. Union Find
26. Ordered Set
27. Multi-thread
28. Miscellaneous
14 WEEK ROADMAP : Problems have been picked from this link
WEEK 1 : Two Pointers and Fast & Slow pointers
Introduction
Pair with Target Sum (easy) LeetCode
Remove Duplicates (easy) LeetCode LeetCode LeetCode LeetCode LeetCode
Squaring a Sorted Array (easy) LeetCode
Triplet Sum to Zero (medium) LeetCode
Triplet Sum Close to Target (medium) LeetCode
Triplets with Smaller Sum (medium) LintCode
Subarrays with Product Less than a Target (medium) LeetCode
Dutch National Flag Problem (medium) CoderByte
Problem Challenge 1: Quadruple Sum to Target (medium) Leetcode
Problem Challenge 2: Comparing Strings containing Backspaces (medium) Leetcode
Problem Challenge 3: Minimum Window Sort (medium) Leetcode Ideserve
Introduction emre.me
LinkedList Cycle (easy) Leetcode
Start of LinkedList Cycle (medium) Leetcode
Happy Number (medium) Leetcode
Middle of the LinkedList (easy) Leetcode
Problem Challenge 1: Palindrome LinkedList (medium) Leetcode
Problem Challenge 2: Rearrange a LinkedList (medium) Leetcode
Problem Challenge 3: Cycle in a Circular Array (hard) Leetcode
WEEK 2: Sliding Window and Merge Intervals
Introduction
Maximum Sum Subarray of Size K (easy)
Smallest Subarray with a given sum (easy) Educative.io
Longest Substring with K Distinct Characters (medium) Educative.io
Fruits into Baskets (medium) LeetCode
No-repeat Substring (hard) LeetCode
Longest Substring with Same Letters after Replacement (hard) LeetCode
Longest Subarray with Ones after Replacement (hard) LeetCode
Problem Challenge 1: Permutation in a String (hard) Leetcode
Problem Challenge 2: String Anagrams (hard) Leetcode
Problem Challenge 3: Smallest Window containing Substring (hard) Leetcode
Problem Challenge 4: Words Concatenation (hard) Leetcode
Introduction Educative.io
Merge Intervals (medium) Educative.io
Insert Interval (medium) Educative.io
Intervals Intersection (medium) Educative.io
Conflicting Appointments (medium) Geeksforgeeks
Problem Challenge 1: Minimum Meeting Rooms (hard) Lintcode
Problem Challenge 2: Maximum CPU Load (hard) Geeksforgeeks
Problem Challenge 3: Employee Free Time (hard) CoderTrain
WEEK 3: Cyclic Sort and In-place reversal of Linked List
Introduction emre.me
Cyclic Sort (easy) Geeksforgeeks
Find the Missing Number (easy) Leetcode
Find all Missing Numbers (easy) Leetcode
Find the Duplicate Number (easy) Leetcode
Find all Duplicate Numbers (easy) Leetcode
Problem Challenge 1: Find the Corrupt Pair (easy) TheCodingSimplified
Problem Challenge 2: Find the Smallest Missing Positive Number (medium) Leetcode
Problem Challenge 3: Find the First K Missing Positive Numbers (hard) TheCodingSimplified
Introduction emre.me
Reverse a LinkedList (easy) Leetcode
Reverse a Sub-list (medium) Leetcode
Reverse every K-element Sub-list (medium) Leetcode
Problem Challenge 1: Reverse alternating K-element Sub-list (medium) Geeksforgeeks
Problem Challenge 2: Rotate a LinkedList (medium) Leetcode
WEEK 4: Stack and Monotonic Stack
Introduction to Stack (Operations, Implementation, Applications)
Balanced Parentheses Leetcode
Reverse a String
Decimal to Binary Conversion
Next Greater Element Leetcode - I Leetcode -II Leetcode - III (Hard)
Sorting a Stack
Simplify Path Leetcode
Introduction to Monotonic Stack
Next Greater Element (easy) Leetcode - I Leetcode -II Leetcode - III (Hard)
Daily Temperatures (easy) Leetcode
Remove Nodes From Linked List (easy) Leetcode
Remove All Adjacent Duplicates In String (easy) Leetcode
Remove All Adjacent Duplicates in String II (medium) Leetcode
Remove K Digits (hard) Leetcode
WEEK 5: Hash Maps and Tree : BFS
Introduction (Hashing, Hash Tables, Issues)
First Non-repeating Character (easy) Leetcode
Largest Unique Number (easy) Leetcode+
Maximum Number of Balloons (easy) Leetcode
Longest Palindrome(easy) Leetcode
Ransom Note (easy) Leetcode
Binary Tree Level Order Traversal (easy) Leetcode
Reverse Level Order Traversal (easy) Leetcode
Zigzag Traversal (medium) Leetcode
Level Averages in a Binary Tree (easy) Leetcode
Minimum Depth of a Binary Tree (easy) Leetcode
Maximum Depth of a Binary Tree (easy) Leetcode
Level Order Successor (easy) Geeksforgeeks
Connect Level Order Siblings (medium) Leetcode
Problem Challenge 1: Connect All Level Order Siblings (medium) Educative
Problem Challenge 2: Right View of a Binary Tree (easy) Leetcode
WEEK 6: Tree : DFS and Graph
Introduction
Binary Tree Path Sum (easy) Leetcode
All Paths for a Sum (medium) Leetcode
Sum of Path Numbers (medium) Leetcode
Path With Given Sequence (medium) Geeksforgeeks
Count Paths for a Sum (medium) Leetcode
Problem Challenge 1: Tree Diameter (medium) Leetcode
Problem Challenge 2: Path with Maximum Sum (hard) Leetcode
Introduction to Graph (Representations, Abstract Data Type (ADT))
Graph Traversal: Depth First Search(DFS)
Graph Traversal: Breadth First Search (BFS)
Find if Path Exists in Graph(easy) Leetcode
Number of Provinces (medium) Leetcode
Minimum Number of Vertices to Reach All Nodes(medium) Leetcode
WEEK 7: Island and Two Heaps
Introduction to Island Pattern
Number of Islands (easy) Leetcode
Biggest Island (easy)
Flood Fill (easy) Leetcode
Number of Closed Islands (easy) Leetcode
Find the Median of a Number Stream (medium) Leetcode
Sliding Window Median (hard) Leetcode
Maximize Capital (hard) Leetcode
*Maximum Sum Combinations (medium) InterviewBit
WEEK 8: Subsets and Modified Binary Search
Introduction Educative.io
Subsets (easy) Educative.io
Subsets With Duplicates (easy) Educative.io
Permutations (medium) Educative.io
Introduction Complete Pattern Theory and Solutions
Order-agnostic Binary Search (easy) Geeksforgeeks
Ceiling of a Number (medium) Geeksforgeeks-Ceil Geeksforgeeks-Floor
Next Letter (medium) Leetcode
Number Range (medium) Leetcode
Search in a Sorted Infinite Array (medium) Leetcode
Minimum Difference Element (medium): Find the floor & ceil take the difference, minimum would be the ans
Bitonic Array Maximum (easy) Geeksforgeeks
Problem Challenge 1: Search Bitonic Array (medium) Leetcode
Problem Challenge 2: Search in Rotated Array (medium) Leetcode
Problem Challenge 3: Rotation Count (medium) Geeksforgeeks
*Search a 2D Matrix (medium) Leetcode
*Minimum Number of Days to Make m Bouquets (medium) Leetcode
*Koko Eating Bananas (medium) Leetcode
*Capacity To Ship Packages Within D Days (medium) Leetcode
*Median of Two Sorted Arrays (hard) Leetcode
WEEK 9: Bitwise XOR and Top K Elements
Single Number (easy)
Two Single Numbers (medium)
Complement of Base 10 Number (medium)
Problem Challenge 1: Flip and Invert an Image (hard)
Top 'K' Numbers (easy) Solution
Kth Smallest Number (easy)
'K' Closest Points to the Origin (easy) Leetcode
Connect Ropes (easy)
Top 'K' Frequent Numbers (medium)
Frequency Sort (medium)
Kth Largest Number in a Stream (medium) Leetcode
WEEK 10: K-way merge and Greedy Sort
Merge K Sorted Lists (medium) Leetcode
Kth Smallest Number in M Sorted Lists (Medium) Geeksforgeeks
Kth Smallest Number in a Sorted Matrix (Hard) Educative.io
Smallest Number Range (Hard) Leetcode
Valid Palindrome II (easy) Leetcode
Maximum Length of Pair Chain (medium) Leetcode
Minimum Add to Make Parentheses Valid (medium) Leetcode
Remove Duplicate Letters (medium) Leetcode
Largest Palindromic Number (Medium) Leetcode
Removing Minimum and Maximum From Array (medium) Leetcode
WEEK 11: 0/1 Knapsack and BackTracking
0/1 Knapsack (medium) Geeksforgeeks
Equal Subset Sum Partition (medium) Leetcode
Subset Sum (medium) Geeksforgeeks
Minimum Subset Sum Difference (hard) Geeksforgeeks
Combination Sum (medium) Leetcode - I Leetcode - II Leetcode - III Leetcode - IV
Word Search (medium) Leetcode - I Leetcode - II (Hard)
Sudoku Solver (hard) Leetcode
Factor Combinations (medium) Leetcode+
Split a String Into the Max Number of Unique Substrings (medium) Leetcode
WEEK 12: Trie and Topological Sort
Implement Trie (Prefix Tree) (medium) Leetcode
Index Pairs of a String (easy) Leetcode+
Design Add and Search Words Data Structure (medium) Leetcode
Extra Characters in a String (medium) Leetcode
Search Suggestions System (medium) Leetcode
Topological Sort (medium) Youtube
Tasks Scheduling (medium) Leetcode-Similar
Tasks Scheduling Order (medium) Leetcode-Similar
All Tasks Scheduling Orders (hard) Leetcode-Similar
Alien Dictionary (hard) Leetcode
Problem Challenge 1: Reconstructing a Sequence (hard) Leetcode
Problem Challenge 2: Minimum Height Trees (hard) Leetcode
WEEK 13: Union Find , Ordered Set and Multi-thread
Redundant Connection (medium) Leetcode - I Leetcode - II (Hard)
Number of Provinces (medium) Leetcode
Is Graph Bipartite? (medium) Leetcode
Path With Minimum Effort (medium) Leetcode
Merge Similar Items (easy) Leetcode
132 Pattern (medium) Leetcode
My Calendar I (medium) Leetcode Leetcode - II Leetcode - III (Hard)
CREDITS :
I want to give a huge thanks to dipjul for providing the problems for each of these patterns. They also cover Blind75 and Neetcode 150 !
dipjul : link
SYSTEM DESIGN
If there is one thing I am perennially thankful for my 2 years in consulting at the beginning of my career, its the ability to find structure and framework to solve any problem. Obviously, there is always scope to improvise on the spot. But what a framework does is gives you tools to get started, get over that initial brain freeze and ability to work yourself into a solution. And , I never really thought that my consulting gig would come in super handy when I was preparing for system design questions. For instance, when I began preparing those questions the first time, I read “ Design Youtube” , I kind of felt lost. I was like how do you even tackle such questions. Secondly, I was like if these are the questions I have to prepare for , then I wasn’t sure if I would be able to clear the interviews. But as usual, applying brute force and consistency helped. I read one case study everyday and thought through the problems. Then next day, reproduced the answers to those problems. And checked where I went wrong. After refining those answers, over 20 days, I was able to create a structure where I was able to answer any question in such a way , that I would explain the system design clearly thinking through most concepts.
NOTE BEFORE GETTING STARTED :
This article doesn’t cover it but highly recommend drawing the system design to explain if possible. I used to use whimsical.io or any drawing board online to explain the solutions
There are plenty of resources that I leveraged which are super helpful. Sharing a few below :
Grokking the System Design Interview by Educative.io
System Design Interview by Alex Xu
Medium articles of companies like Netflix.
Youtube channels like ByteByteGo
FRAMEWORK :
Now, before starting any question, I keep three concepts in mind :
CLARIFICATION
ASSUMPTIONS
CONSTRAINTS
CLARIFICATION :
Here, make sure you clarify the problem for the following three main parts :
Scope
Scale
Latency
Scope of the problem :
Identify what use-cases you will solve for ? For instance, for designing youtube, I can think of following use-cases :
video upload
playback
and homepage/feed
There could be more like analytics, monetization ,etc but we want to keep it simple for now.
Scale :
Understand the scale for which you are solving for. This will allow you to understand the users and how you could support the system with high availability and fallback
How many users are expected—millions or billions?
How many concurrent users/viewers should the system support?
What’s the expected volume of video uploads per day?
Assumptions :
Here think about two things : Functional Assumptions and Non-functional Assumptions
Functional Assumptions :
Assuming use-cases , this system will solve for :
User Registration and Login
Video Playback
Video Metadata
Search
Non-functional Assumptions :
Users can stream videos in multiple resolutions.
The system supports eventual consistency for things like view count and likes.
Video processing (transcoding) happens asynchronously after upload.
We'll use cloud storage for videos (e.g., S3 or equivalent).
What does CAC (Clarification, Assumptions, Constraints ) do for me ? It helps me by buying time. I not only understood the problem but also bought time to find a solution. Secondly, it helped me actually think on a deeper level for my 7 step solution below which are factors I need to think about.
R D S L A C S
R : Requirement Clarification (Already done as a part of CAC)
D : Data Modeling
S : Scalability
L : Latency
A : Availability vs Consistency
C : Caching Strategy
S : Security and Rate Limiting
Requirement Clarification :
As a part of requirement clarification, we already asked questions about Constraints, Assumptions and Clarification. And that in itself helps you to collect requirements. After you have asked constraints, clarification and assumptions, finally summarize it.
For instance for designing youtube, I would summarize it by
Based on our discussion, I’ll assume our MVP includes user authentication, video upload, playback, a homepage feed, and basic search functionality.
Key constraints I’ll keep in mind include ensuring low latency playback, high throughput for uploads and streaming, scalable storage to handle petabytes of data, and regional fault tolerance for high availability. I’ll also aim for cost efficiency in storage and compute.
Data Modeling :
What are the data tables I would need :
Users(user_id, email, username, created_at)
Videos(video_id, user_id, title, description, upload_time, tags, status, view_count)
Views(video_id, user_id, timestamp)
Subscriptions(follower_id, followed_id)
What kind of database I would need ?
I would use relational DB for users/videos, NoSQL for comments, likes, etc.
Relational DB for users/videos because it has clear , well defined relationships, requires strong consistency and data integrity.
NoSQL for comments , likes because these are write-heavy workloads and high fan-out, which NoSQL handles better. Also, Comments can vary (text, replies, reactions). NoSQL allows flexible schemas without needing frequent migrations.
Scalability :
How will I handle user growth and uploads of videos ? How will I horizontally scale it ?
Shard metadata storage by video_id/user_id :
Instead of keeping all metadata (video titles, descriptions, tags, etc.) in one big database table, we divide (shard) it across multiple database instances based on some key like
video_id
oruser_id
.Why it helps: Prevents bottlenecks in a single database and allows parallel read/write operations across shards.
Partition user-generated content (uploads, views) by time or region
Store data like uploads, views, or comments in separate partitions based on timestamp (daily/monthly) or user location.
Why it helps: Limits the amount of data scanned in each query and enables more efficient storage and retrieval.
Use Kafka for ingesting video processing jobs
Kafka acts as a message broker, queuing up video upload events to be picked up by downstream services like transcoding, thumbnail generation, and metadata extraction.
Why it helps: Decouples services, smooths traffic spikes, ensures reliable and ordered processing of millions of uploads.
Latency :
Minimize the time it takes for users to load pages, see content, and start video playback.
Preload Personalized Homepage Feed
Precompute and cache homepage video recommendations for each user during off-peak hours or when the user is active.
Why it helps: Reduces delay when users open the app or site—recommendations appear instantly instead of being computed in real-time.
Use Async Processing for Uploads (Transcoding, Thumbnail Generation)
After upload, tasks like converting video to different formats or generating thumbnails are offloaded to background workers, so users aren't kept waiting.
Why it helps: Improves responsiveness—users get confirmation of upload quickly without waiting for processing to finish.
CDN Ensures Low Latency Video Streaming
Videos are distributed and served from Content Delivery Network (CDN) edge servers located close to users geographically.
Why it helps: Reduces round-trip latency and bandwidth usage by avoiding distant servers.
Availability vs Consistency
Ensure the system stays operational and responsive even when components fail. The goal is to degrade gracefully, not crash.
Replicate Services Across Multiple Zones
Deploy services (e.g., upload service, user service) in multiple geographic regions or availability zones.
Why it helps: If one zone goes down (e.g., due to a network outage), the system can route traffic to another zone.
Use Message Queues (Kafka/SQS) to Decouple Services
Place a buffer between services using a queue so if one service fails, the others can keep running.
Why it helps: Prevents cascading failures and allows retry logic.
Store Multiple Video Formats Redundantly in Cloud Storage
Videos are stored in multiple formats (480p, 720p, etc.) and duplicated across regions.
Why it helps: Guarantees availability and playback options even if some formats or locations are unavailable.
Caching Strategy :
Reduce load on databases and backend services, and improve response times for frequently accessed data.
CDN Caches Video Content at Edge Nodes
Content Delivery Network (CDN) stores copies of video files (e.g., MP4 in various resolutions) on servers geographically close to users.
Why it helps: Reduces latency and bandwidth usage by avoiding long-distance requests to central servers.
Redis Caches Frequently Accessed Metadata
Video Metadata
Titles, thumbnails, view counts (especially for popular videos)
Speeds up page loads and reduces database queries
Homepage Feed
Personalized or trending videos are precomputed and cached per user or segment
Ensures instant loading on app/web open
Trending/Popular Videos
High-traffic content is frequently requested; caching keeps it readily available
Security and Rate Limiting
Protect the platform from abuse, ensure data privacy, and maintain trust with users.
OAuth 2.0 for Authentication
Secure protocol for login using tokens.
Enables third-party access without sharing user credentials.
Rate Limiting via API Gateway
Protects against spamming or DDoS attacks by limiting requests per user or IP (e.g., 100 requests/minute).
Helps manage load and ensures fair usage.
So, with this strategy RDSLACS, we were able to think through 7 most important parameters and finally explain how we were able to solve the problem in a structured way.
CASING
I have had a varied career close to a decade. I wasn’t sure what career would be a right fit for me. And as they saying goes, when you don’t know what to do , DO consulting (just kidding , or maybe not :p) When I worked for a boutique consulting which focussed on tech and finance, that’s when I got to work on data engineering projects. This led me to work at a start-up as a solutions engineer with a start-up focussed on building data pipelines. Finally, working there gave me the perspective and direction to follow solutions engineering/architecting at PayPal. And all of this, wouldn’t have happened without consulting. Now, before you think that if you are an engineer, then consulting is not for you, then I would probably need to correct you. Because, from Big 4 (PWC, Deloitte, KPMG to Ernst & Young) to the Trinity (McKinsey , Bain and BCG), they all hire for tech consultants. And if there was ever a career to polish off and master soft skills, it was consulting. It teaches you diverse skills and more importantly, the ability to handle difficult use-cases in a short time and more importantly making a significant impact. Fun fact : Sundar Pichai , CEO of Google, worked as a consultant at McKinsey prior to joining Google.
Now, coming to interviewing , there are a lot of behavioral questions which will be asked. But the technical or strategy questions are the difficult ones. Unless you have studied for them.
There are two things I have learnt about consulting market sizing/casing questions over time :
They are meant to analyze your thinking. The interviewer is analyzing your thinking and strategy. As long as the answer is somewhere within a nearby ballpark, you are fine.
These questions in my opinion are to throw you off. Your ability to not be flustered and patiently work through the solution will determine your success
Market sizing (or Fermi estimation) questions test your ability to break down complex problems into logical, manageable steps. The most important things atleast for me in the beginning was to buy some time to think through. To do that , here’s what I would do. Explaining with the help of an example :
“How many tennis balls can fit in a plane ?”
Buying Time Strategy :
The best way to buy time is to :
Clarify the question :
This could be by clearing up the questions :
Are we talking about a commercial plane or a private jet ?
Making Assumptions :
Then the next section is to make and confirm assumptions :
Is the a tennis ball is ~7 cm in diameter ?
What would be the dimensions of the plan ? (e.g., 747’s length ≈ 70m, width ≈ 6m, height ≈ 4m).
Is it okay to use approximations for calculations ? For example after calculating area of plane and area of ball, round up to whole numbers for simpler calculations.
Apply constraints :
Consider unusable spaces (seats, cockpit, storage).
Assume that only 70% of the plane is usable.
By applying the above three principles which I remembered as CAC (Clarify , Assume , Constraints) , I would buy time but more importantly show strategic thinking. This helped me think through the problem, present structured approach to all of the questions asked.
Having clarified assumptions , then it was time to deep dive and solve.
If your calculations are fine , then at the end of the day, all you need to do is :
Result = Area of plane / Area of 1 ball.
Sharing result below :
• Tennis ball: Diameter ≈ 7 cm → Volume = \frac{4}{3} \pi r^3 ≈ 150 cm³
• Boeing 747 Interior Volume: Approx. 800 m³ (~800,000,000 cm³)
• Packing Efficiency: ~70% (stacking inefficiencies)
• Final Estimate: \frac{800,000,000 \times 0.7}{150} ≈ 3.7 million balls
⸻
2. Market Sizing Questions Cheat Sheet (Common Formulas & Assumptions)
A. Volume-Based Estimation (e.g., “How many tennis balls fit in a plane?”)
Formula:
Number of objects = Total size of another object / size of 1 object.
Common Assumptions:
• Tennis ball: Diameter ≈ 7 cm → Volume = \frac{4}{3} \pi r^3 ≈ 150 cm³
• Boeing 747 Interior Volume: Approx. 800 m³ (~800,000,000 cm³)
• Packing Efficiency: ~70% (stacking inefficiencies)
• Final Estimate: \frac{800,000,000 \times 0.7}{150} ≈ 3.7 million balls
⸻
B. Population-Based Estimation (e.g., “How many barbers are in New York City?”)
Formula:
\text{Total Demand} = \text{Population} \times \text{Service Frequency per Capita}
Common Assumptions:
• NYC Population: ~8M
• Haircuts per person per month: ~1 (Men), ~0.5 (Women) → Avg: ~0.75
• Total haircuts/month: 8M \times 0.75 = 6M haircuts/month
• Haircuts per barber per day: ~10 (200/month)
• Barbers Needed: \frac{6M}{200} ≈ 30,000 barbers
⸻
C. Consumption-Based Estimation (e.g., “How many pizzas are sold in the U.S. per year?”)
Formula:
\text{Annual Demand} = \text{Population} \times \text{Avg Consumption Per Capita}
Common Assumptions:
• U.S. Population: ~330M
• Avg pizzas/person/year: ~25
• Total pizzas/year: 330M \times 25 = 8.25 billion pizzas/year
⸻
D. Area-Based Estimation (e.g., “How many golf courses are in California?”)
Formula:
\text{Total Land Area} \div \text{Avg Land per Unit}
Common Assumptions:
• California land area: ~400,000 km²
• Avg land per golf course: ~1 km²
• Estimated golf courses: \frac{400,000}{1} ≈ 4,000 golf courses
⸻
Casing Questions :
Currently taking into 4 types :
Profits & Loss: Revenues:
Entering a new market
Pricing
Growth and Increasing Sales
Profits and Loss Revenue :
Scenario: A mid-sized online clothing retailer wants to analyze their profit and loss for the last two quarters.
Clarifying Questions:
Start off with clarifying questions : You need analyze profits and loss. This means, can you think about revenues and costs. So, accordingly the clarifying questions follow :
What are the primary revenue streams (e.g., men's wear, women's wear, accessories)?
What are the fixed vs. variable costs?
Has there been any marketing push or product launch recently?
Are we comparing quarters, or year-over-year?
Assumptions Made:
Revenue comes from 3 streams: Men’s (40%), Women’s (50%), Accessories (10%).
Costs include manufacturing, marketing, logistics, and platform fees.
Industry-standard gross margin is 60%.
Applying Constraints:
Limit analysis to Q3 and Q4 only.
Benchmark only against top 3 competitors in the U.S. online retail market.
Example Walkthrough:
Revenue in Q3 was $10M, and in Q4 it increased to $12M. But profit dropped from $2.5M to $1.5M.
On inspection, marketing costs rose from $1M to $3M in Q4. A deep dive shows an aggressive ad campaign underperformed.
Accessory revenue dropped by 50%, indicating potential inventory or pricing issues.
Benchmarking shows the client’s shipping cost per order is $2 higher than competitors due to inefficient last-mile delivery.
Insights: Unusual increase in marketing spend and high shipping costs are squeezing profit. Accessory line underperforming may suggest poor product-market fit.
Entering a New Market Example
Scenario: A U.S.-based edtech company wants to expand into Latin America via acquisition.
Clarifying Questions:
Think about the overarching goal here.
What is the goal: user growth, revenue, or diversification?
What companies are being considered for acquisition?
What’s the regulatory landscape?
Assumptions Made:
The company wants to acquire a mid-sized local edtech firm in Mexico.
Cultural fit is critical as the client values collaborative work culture.
Applying Constraints:
Must enter the market within 6 months.
Budget capped at $10M.
Example Walkthrough:
The client shortlists two Mexican firms: Firm A (good growth, weak brand) and Firm B (moderate growth, strong brand, but hierarchical culture).
Due diligence reveals Firm A uses similar technology and agile processes, ensuring faster integration.
Firm B would require a major cultural shift.
Financial modeling shows both firms will break even in 2 years, but Firm A will yield better long-term synergies.
Insights: Although Firm B has stronger brand recognition, Firm A is a better cultural fit and aligns with the company’s long-term agility goals.
Pricing Example
Scenario: A SaaS startup is launching a project management tool and needs to decide pricing.
Clarifying Questions:
What is the customer segment (e.g., freelancers, SMEs, enterprises)?
What are competitors charging?
What are our development and support costs?
Assumptions Made:
Targeting SMEs.
Direct competitors charge $10-$15/user/month.
Our cost per user is $5/month.
Applying Constraints:
Need to reach break-even in 12 months.
Cannot exceed $15/user/month due to market cap.
Example Walkthrough:
Competitive analysis shows our tool has superior features but lacks brand presence.
Cost-based pricing would suggest $7/user/month to maintain 40% margin.
Price-based costing shows SMEs are willing to pay $12 if onboarding is easy.
Final pricing decision: Offer $10/user/month, with a 3-month free trial to lower adoption barrier.
Insights: The pricing strategy balances cost recovery and market competitiveness while encouraging trial and word-of-mouth marketing.
4. Growing and Increasing Sales Example
Scenario: A boutique coffee brand wants to grow its online sales by 30% over the next year.
Clarifying Questions:
What are current sales channels and performance?
What’s the size of the target market?
How do our prices compare to competitors?
Assumptions Made:
Current annual revenue is $2M, mostly from online DTC (Direct to Consumer).
Industry CAGR is 5–7%.
Applying Constraints:
Marketing budget increase limited to 10%.
Maintain premium pricing; no discounting strategy.
Example Walkthrough:
Company invests in SEO and influencer partnerships instead of discounts.
Expands SKUs with a limited-edition single-origin series to create buzz.
Conducts a competitor analysis: client’s pricing is 20% above average, justified by quality and sustainable sourcing.
Uses email campaigns to target repeat customers and improve retention.
Insights: Focused growth via product line extension and customer engagement (rather than price cuts) allows premium branding and sales uplift.
Some more excellent use-cases to master this and use :
The article is a continued piece and I aim to add more updates as I get a chance to refine it. Hope this is helpful.
BEHAVIORAL QUESTIONS
Behavioral Interviews are grey areas. And they can be super tricky. Its important to answer them in a way that ensures you show your strengths. And also the humility and ability to learn. Demonstrating those qualities can help you get a job because this shows your mastery in soft skills
First things first. The pre-requisite. Answer each question in STAR format.
Also, preparing for each of Amazon’s 14 leadership principles is the best way to master behavioral questions and they cover every type of question.
I have listed 9 questions below and the aim is to provide template answers which you can modify.
What’s a professional mistake you’ve made that you haven’t included on your resume?
On my first job, I created an ETL pipeline for a data visualization project. While it handled the data I was given very well, because, it was my first project, I hadn’t encompassed for unstructured data in production. As a result, I didn’t have a staging environment to mirror production. As a result, when the code was pushed to production, it immediately ran into issues and had to be rolled back. There was 1 key mistake: lack of staging environment which would have allowed me to test the code before. This could have avoided the entire mishap in production. So, I learnt from this and ensured I tested the ETL pipeline in staging before the next launch, fixed the issues before eventually rolling it out successfully in production. This is something I have kept in mind during my tenure and always ensure that the code has a staging environment before going to production.
If I called your former manager, what do you think they’d say you need to work on?
At my current job, I have an eye for detail and this means sometimes, I spent to much time on certain topics without realizing the need to zoom out and thing about the bigger picture. It also led me to spend more time than usual. So, I had a conversation with manager about this and have been actively working on it. I use a priortization process called Eisen hower matrix to ensure my time is spent in a more valuable manner and that I can dive into details on only topics that need the attention. This also ensures, I have more expediency in my work.
Have you ever disagreed with someone ? How did you resolve it ?
In my consulting job, my manager put me on a contracting job to write queries. This was a great opportunity to make money for hours worked. However, I saw this as an opportunity to also pitch a product we were working with. My manager was initially hesitant to do this because we were already doing well with the opportunity. So, I created a demo, where I showed the clear benefits of using our product. This gave him confidence with my decision and we had a meeting with merchant who loved the demo as well and now we have a SaaS agreement making us more revenue while directly benefiting the client.
Tell me something that is not on your resume
This could be anything positive :
Work at an NGO
Any passion project you are pursuing.
Paint yourself in a positive light.
If you get another job offer which is better, what would you do?
I evaluate a job offer on four main values. :
Company Mission
Company Work Culture
Alignment of my work and experience with the role offered.
Growth opportunity.
Currently, your company fulfills all of those criterias which is why I am excited to work for this role. If there is a company that fulfills all the above which is incredibly hard, then at that time, I would need to evaluate it carefully. However, I am someone who invests in a company fully and as long as the core 4 values align, I would more likely continue working here.
Tell me about a time you had too many responsibilities. How did you prioritize?
I think having too many responsibilities and managing them is a part of the job. I priortize them by using Eisenhower matrix. If its urgent and important, do it first. If its urgent but not important, delegate if I can. If its important and not urgent, put it on my to do list. If its neither urgent nor important, I can put it on my roadmap to when it needs to be done.
What’s a decision you regret in your career, and what did you learn from it?
I think every decision is a learning. I don’t necessarily have regrets. I decided to pursue Product Management and while I did well in the role, i realized , its not for me.I also realized I thrived in roles which involved selling more than building. The experience in PM was exhausting but I am thankful for the opportunity because it made me realize what I was good at and gave me clear decision for my career.
Why are you looking for a change ?
I am currently saturated with my role and not learning. This opportunity :
offers me chance to learn and grow
aligns with my education and experience ensuring I can hit the ground running.
Am impressed with company culture and like the values it stands for.
Resources :
Most asked behavioral questions : Link
20 most commonly asked Behavioral questions : Link
DATA DESIGN
PySpark Cheatsheets
Data Design Architecture Explanation : Batch and Lambda Architecture
Data Design Interview Questions
PySpark Cheatsheets :
Palantir uses Pyspark a lot themselves. So, its not surprise that they created a Pyspark syntax cheatsheet. This extensively covers topics like common patterns, joins, functions, string operations, number operations, aggregate operations , etc.
If you are using PySpark for Data Science , here is another cheatsheet which has all the functions making is super easy for you to learn in terms of creating data frames, running SQL on spark and so on.
PySpark CheatSheet for Beginners : Link
Pyspark RDD Cheatsheet : Link
Data Design Architecture Explanation : Batch and Lambda Architecture
Batch Architecture Explained
Batch Architecture essentially is collecting, processing and storing data in groups or batches at scheduled intervals. Think about batch jobs that run at a cadence of week, daily or months. Its a traditional form of data pipeline and useful in dealing with large volumes of data that doesn’t require immediate processing.
Key Features of Batch Architecture
Scheduled Processing : The data is processed at a specific cadence like day, months, weeks, etc
High Throughput : The system is designed to handle large datasets in a very efficient manner
Latency Tolerant: Its useful for use-cases where delays are acceptable and urgency is not priority. Think generating daily reports.
Storage & Processing Separation: Data is first stored in a data lake or a warehouse and then it is processed.
Typical ETL process :
Information sources - Logs, API, Databases
Data Ingestion - Typically happens through AWS Data Pipeline, Apache Sqoop
Batch Processing Engine – Tools used such as Hadoop MapReduce, Apache Spark.
Storage Layer – Data lakes like HDFS or S3; warehouses like Snowflake or BigQuery.
Output – Downstream files, dashboards, reports.
Batch Processing Interview Questions and Answers:
What is batch processing, and how does it differ from stream process
Batch processing involves collecting and processing data in large, discrete chunks at scheduled intervals. Stream processing handles data in real time or near-real time as it arrives.
When do you choose Batch processing over Streaming ?
Choose batch processing when:
Real-time insights are not needed.
Large volumes of data must be processed efficiently.
Cost and simplicity are priorities.
Use cases include reporting, backups, and data warehousing.
How do you optimize batch jobs for performance?
Partitioning and bucketing
Caching intermediate results
Using efficient file formats like Parquet
Tuning Spark configurations (e.g., memory, parallelism)
How do you handle scaling in batch processing systems?
Use distributed computing frameworks (e.g., Spark)
Autoscale compute resources in the cloud
Optimize resource allocation per job type
What techniques do you use to manage large volumes of data in batch jobs?
Data partitioning
Filtering early in the pipeline
Efficient storage formats (e.g., ORC, Parquet)
Columnar processing
How would you design a nightly ETL job to process millions of records?
Use Spark for distributed ETL
Store raw data in S3
Schedule with Airflow
Monitor job metrics and set up retry policies
Load cleaned data into Redshift or Snowflake
How do you handle schema changes in batch data pipelines?
Use schema evolution in storage formats (e.g., Avro, Parquet)
Maintain schema registry
Version datasets and track changes
Validate schema compatibility before processing
Real-time Data Processing Interview Questions and Answers
What is real time data processing ?
Real time architecture is essentially processing data in real time and providing results in a given time frame often with immediate or low latency. It is used in fields where quick data processing is essential like finance, healthcare, IoT.
Key Features of Real-Time Architecture
Low Latency: Processes and responds to input almost instantly.
High Availability: Designed to be fault-tolerant and run continuously with minimal downtime.
Scalability: Can handle large and variable volumes of data without performance loss.
Concurrency: Supports simultaneous execution of multiple tasks or events.
Determinism: Guarantees a predictable response time to events.
Data Streaming: Handles continuous flow of data from sources like sensors or user inputs.
Event-Driven: Triggers processes based on real-time events or changes in state.
Real Time Processing Interview Questions and Answers:
What technologies are commonly used in real-time architecture?
Message brokers: Apache Kafka, RabbitMQ
Stream processing: Apache Flink, Apache Storm, Spark Streaming
Data stores: Redis, Cassandra, or InfluxDB for fast reads/writes
Monitoring: Prometheus, Grafana
How do you ensure high availability in real-time systems?
By implementing:Redundant components
Load balancing
Failover mechanisms
Monitoring and alerting
Distributed systems with graceful degradation
Explain the CAP theorem in the context of real-time systems.
In real-time systems, we often prioritize availability and partition tolerance over consistency. This ensures the system keeps running and responsive even if some nodes fail or are partitioned, though eventual consistency might be acceptable.
Explain windowing in stream processing.
Windowing allows real-time systems to group incoming data into time-based segments (e.g., 5-second windows) for aggregation or analysis. Common types: tumbling, sliding, session windows.
What’s the difference between stream processing and micro-batching?
Stream processing: Processes data event-by-event with near-zero delay.
Micro-batching: Groups events into tiny batches (milliseconds) before processing—e.g., Spark Streaming.
What’s the difference between real-time and near real-time?
Real-time: Response within milliseconds or seconds
Near real-time: Slight delay, often a few seconds to minutes, acceptable for less critical applications
How do you handle data consistency in distributed real-time systems?
Use eventual consistency models
Implement idempotent operations to avoid duplication
Apply exactly-once processing guarantees with tools like Kafka + Flink
Hope this is helpful!