A while ago, I provided a SQL database with over 800 questions for practice. And someone asked me , “Can you just provide 10 questions which I can study to clear SQL interviews?”. And what could I honestly say. Like 10 questions, wouldn’t even probably clear the 10th standard board exams. And even if say, it helped clear an interview, when you start the job, you would still need to learn everything. And over course of 8 years, I have learned this. Interviews are literally just testing ability and skill. On the job will require you to know the skills to build the applications or else, you may not be at the company for very long. So, while I couldn’t provide 10 questions per se, I at least provided 6+ important concepts which would help you. And why just restrict to SQL, I even provided the concepts for PySpark and Data Design along with resources to study.
SQL Concepts. :
Keys in SQL:
Keys play a vital role in establishing relationships and maintaining data integrity in SQL databases:
Primary Key: Serves as a unique identifier for each record in a table. No two rows can share the same primary key. Primary and foreign keys are used in over 85% of relational databases to enforce relationships and minimize data duplication.
Foreign Key: Establishes a connection between two tables by referencing the primary key of another table. As of 2024, foreign key constraints remain essential for maintaining referential integrity, particularly in distributed systems like microservices.
Unique Key: Guarantees that all entries in a column are distinct, while still allowing for a single NULL value.
Composite keys are frequently applied in complex databases—particularly in composite indexing—to improve query performance and uphold hierarchical data structures.
JOINS :
Joins are actually the most critical concepts and asked in every interview. They are the lifeblood of SQL and I can explain it myself. But this resource does a terrific job of explaining :
So, just linking it here : https://medium.com/@johnnyJK/understanding-sql-joins-a-comprehensive-guide-88bab3457270
Subqueries in SQL:
A subquery is a query written inside another SQL query. It’s usually placed in theWHERE
clause to help filter results using data from a different table.
Example: If you want to find customers who have placed orders, but the order details are in another table, you can use a subquery to first get the relevant CustomerID
values from the orders table, then use that to filter the customers.
SELECT Name FROM Customers
WHERE CustomerID IN (
SELECT CustomerID
FROM Orders
);
Window Functions
Window functions allow calculations across rows relative to the current row, without grouping rows.
Examples:
ROW_NUMBER()
,RANK()
,DENSE_RANK()
,LAG()
,LEAD()
Here are clear and structured SQL practice questions that you can directly use to practice each window function (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD()). These questions include realistic table and column names and are designed similarly to medium-level LeetCode or HackerRank challenges
RANK()
RANK()
is a window function that assigns a rank to each row within a group based on a specific order. If two rows have the same value, they receive the same rank, and the next rank is skipped.
SELECT
Region,
Salesperson,
Revenue,
RANK() OVER (
PARTITION BY Region
ORDER BY Revenue DESC
) AS RankInRegion
FROM Sales;
Partitioning (
PARTITION BY
)
PARTITION BY
is used with window functions (like RANK()
) to divide the data into separate groups, similar to GROUP BY
, but it doesn’t collapse rows. Instead, it performs calculations within each partition.
SELECT
Department,
Employee,
Salary,
AVG(Salary) OVER (PARTITION BY Department) AS AvgSalaryInDept
FROM Employees;
PARTITION BY Department
tells SQL to calculate the average salary separately for each department.The result does not collapse rows (like
GROUP BY
would), so you can still see individual employee data.
Resources to study and practice SQL :
PYSPARK
1. RDDs vs. DataFrames
RDD (Resilient Distributed Dataset): Low-level, immutable distributed collection of objects.
DataFrame: Higher-level abstraction with schema and optimizations via Catalyst engine.
🔹 Example:
rdd = spark.sparkContext.parallelize([(1, 'Alice'), (2, 'Bob')]) df = rdd.toDF(['id', 'name'])
2. Transformations vs. Actions
Transformations are lazy (e.g.,
filter
,map
), only computed when an action (e.g.,collect
,count
) is called.
🔹 Example:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4]) rdd_filtered = rdd.filter(lambda x: x > 2) # Transformation print(rdd_filtered.collect()) # Action
Explain lazy evaluation and Spark’s optimization benefits.
groupBy
vs.groupByKey
groupBy
lets you apply aggregation functions (efficient).groupByKey
just groups, then you must manually reduce (less efficient).
🔹 Example:
from pyspark.sql.functions import avg df.groupBy("department").agg(avg("salary"))
Joins in PySpark
This link provides comprehensive overview to PySpark Joins
Window Functions
Window functions in PySpark are functions that allow you to perform calculations across a set of rows that are related to the current row. These functions are used in conjunction with the Window
function to specify the partitioning and ordering of rows that the window function will be applied to.
This link provides a great overview for window functions in Pyspark
Pyspark Repartitioning
When dealing with large datasets in PySpark, understanding how to manage data distribution is crucial. Two commonly used methods for this purpose are repartition()
and partitionBy()
. While they might seem similar at first glance, they serve different purposes and are used in different contexts. This article will delve into the differences between these two methods, provide guidance on when to use each, and explore advanced strategies for controlling the number of output files when writing data.
This link again provides an important details to this concept.
Data Design Concepts
You can get most concepts for this from Grokking the system design interview. But still mentioning here. Click on links below to understand the concepts
Proxy (Forward Proxy): Sits between client and server. Clients use it to access other servers.
Use case: Filtering or anonymizing web traffic (e.g., accessing a blocked website).
Reverse Proxy: Sits in front of servers and routes client requests to them.
Use case: Load balancing, caching, SSL termination.
A communication protocol enabling full-duplex, real-time interaction between client and server over a single persistent connection.
Use case: Chat apps, live notifications, online gaming.
Relational (SQL): Structured schema, ACID-compliant (e.g., PostgreSQL, MySQL).
NoSQL: Flexible schema for unstructured data (e.g., MongoDB, Redis).
NewSQL: Combines SQL-like structure with NoSQL scalability (e.g., CockroachDB).
Having duplicate data across the system.
Pros: Fault tolerance, faster access.
Cons: Inconsistencies, higher storage costs.
Managed via: normalization, replication control.
Monolithic: One large, centralized database.
Distributed: Data split across multiple locations (sharding, replication).
Federated: Multiple independent databases coordinated together.
Distributes incoming traffic across multiple servers to maximize performance and reliability.
Types: Round-robin, least connections, IP hash.
A single entry point that manages and routes API requests to backend services.
Functions: Authentication, rate limiting, request routing, caching.
Popular tools: Kong, AWS API Gateway, NGINX.
A probabilistic data structure that tests whether an element is in a set.
Fast, memory-efficient, but may return false positives.
Use case: Checking if a user already exists in a database.
An operation is idempotent if repeating it has no additional effect.
Use case: A payment API should not charge twice if the same request is retried.
Design principles for building loosely coupled, independently deployable services:
Single Responsibility Principle
Independent deployments
API-first communication (REST/gRPC)
Monitoring, scaling, and failover support
Temporarily storing data to improve speed and reduce load on backend systems.
Types: In-memory (Redis), browser, CDN.
Content Delivery Network (CDN)
A geographically distributed network of servers that deliver content (images, scripts, videos) faster by serving it from locations near the user.
Example: Cloudflare, Akamai.
Automatic switching to a backup system when the primary system fails.
Use case: High-availability systems like payment gateways or databases.
Regular signals sent between systems to check that they are still alive.
Use case: Detecting failed services in a microservices architecture.
Write-through: Data written to cache and database simultaneously.
Write-back: Data written to cache first, later synced with DB.
Cache-aside (Lazy loading): App checks cache, then DB if needed.
Time-to-live (TTL): Sets expiry for cached data.
Hope this is helpful!