Most important concepts

This post covers 10 most SQL , Python , NoSQL, Data Design Concepts

Apr 19, 2025

A while ago, I provided a SQL database with over 800 questions for practice. And someone asked me , “Can you just provide 10 questions which I can study to clear SQL interviews?”. And what could I honestly say. Like 10 questions, wouldn’t even probably clear the 10th standard board exams. And even if say, it helped clear an interview, when you start the job, you would still need to learn everything. And over course of 8 years, I have learned this. Interviews are literally just testing ability and skill. On the job will require you to know the skills to build the applications or else, you may not be at the company for very long. So, while I couldn’t provide 10 questions per se, I at least provided 6+ important concepts which would help you. And why just restrict to SQL, I even provided the concepts for PySpark and Data Design along with resources to study.

SQL Concepts. :

Keys in SQL:

Keys play a vital role in establishing relationships and maintaining data integrity in SQL databases:

Primary Key: Serves as a unique identifier for each record in a table. No two rows can share the same primary key. Primary and foreign keys are used in over 85% of relational databases to enforce relationships and minimize data duplication.
Foreign Key: Establishes a connection between two tables by referencing the primary key of another table. As of 2024, foreign key constraints remain essential for maintaining referential integrity, particularly in distributed systems like microservices.
Unique Key: Guarantees that all entries in a column are distinct, while still allowing for a single NULL value.

Composite keys are frequently applied in complex databases—particularly in composite indexing—to improve query performance and uphold hierarchical data structures.

JOINS :

Joins are actually the most critical concepts and asked in every interview. They are the lifeblood of SQL and I can explain it myself. But this resource does a terrific job of explaining :

So, just linking it here : https://medium.com/@johnnyJK/understanding-sql-joins-a-comprehensive-guide-88bab3457270

Subqueries in SQL:

A subquery is a query written inside another SQL query. It’s usually placed in the WHERE clause to help filter results using data from a different table.

Example: If you want to find customers who have placed orders, but the order details are in another table, you can use a subquery to first get the relevant CustomerID values from the orders table, then use that to filter the customers.

SELECT Name FROM Customers

WHERE CustomerID IN (

    SELECT CustomerID

    FROM Orders

);

Window Functions

Window functions allow calculations across rows relative to the current row, without grouping rows.

Examples:

ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD()

Here are clear and structured SQL practice questions that you can directly use to practice each window function (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD()). These questions include realistic table and column names and are designed similarly to medium-level LeetCode or HackerRank challenges

RANK()

RANK() is a window function that assigns a rank to each row within a group based on a specific order. If two rows have the same value, they receive the same rank, and the next rank is skipped.

SELECT 
  Region,
  Salesperson,
  Revenue,
  RANK() OVER (
    PARTITION BY Region 
    ORDER BY Revenue DESC
  ) AS RankInRegion
FROM Sales;

Partitioning (PARTITION BY)

PARTITION BY is used with window functions (like RANK()) to divide the data into separate groups, similar to GROUP BY, but it doesn’t collapse rows. Instead, it performs calculations within each partition.

SELECT 
  Department,
  Employee,
  Salary,
  AVG(Salary) OVER (PARTITION BY Department) AS AvgSalaryInDept
FROM Employees;

PARTITION BY Department tells SQL to calculate the average salary separately for each department.
The result does not collapse rows (like GROUP BY would), so you can still see individual employee data.

Resources to study and practice SQL :

Link1

Hackerank SQL

PYSPARK

1. RDDs vs. DataFrames

RDD (Resilient Distributed Dataset): Low-level, immutable distributed collection of objects.
DataFrame: Higher-level abstraction with schema and optimizations via Catalyst engine.

🔹 Example:

rdd = spark.sparkContext.parallelize([(1, 'Alice'), (2, 'Bob')]) df = rdd.toDF(['id', 'name'])

2. Transformations vs. Actions

Transformations are lazy (e.g., filter, map), only computed when an action (e.g., collect, count) is called.

🔹 Example:

rdd = spark.sparkContext.parallelize([1, 2, 3, 4]) rdd_filtered = rdd.filter(lambda x: x > 2) # Transformation print(rdd_filtered.collect()) # Action

Explain lazy evaluation and Spark’s optimization benefits.

groupBy vs. groupByKey

groupBy lets you apply aggregation functions (efficient).
groupByKey just groups, then you must manually reduce (less efficient).

🔹 Example:

from pyspark.sql.functions import avg df.groupBy("department").agg(avg("salary"))

Joins in PySpark

This link provides comprehensive overview to PySpark Joins

Window Functions

Window functions in PySpark are functions that allow you to perform calculations across a set of rows that are related to the current row. These functions are used in conjunction with the Window function to specify the partitioning and ordering of rows that the window function will be applied to.

This link provides a great overview for window functions in Pyspark

Pyspark Repartitioning

When dealing with large datasets in PySpark, understanding how to manage data distribution is crucial. Two commonly used methods for this purpose are repartition() and partitionBy(). While they might seem similar at first glance, they serve different purposes and are used in different contexts. This article will delve into the differences between these two methods, provide guidance on when to use each, and explore advanced strategies for controlling the number of output files when writing data.

This link again provides an important details to this concept.

Data Design Concepts

You can get most concepts for this from Grokking the system design interview. But still mentioning here. Click on links below to understand the concepts

Proxy vs Reverse Proxy
- Proxy (Forward Proxy): Sits between client and server. Clients use it to access other servers.
  - Use case: Filtering or anonymizing web traffic (e.g., accessing a blocked website).
- Reverse Proxy: Sits in front of servers and routes client requests to them.
  - Use case: Load balancing, caching, SSL termination.
WebSockets
A communication protocol enabling full-duplex, real-time interaction between client and server over a single persistent connection.
- Use case: Chat apps, live notifications, online gaming.
Databases Types
- Relational (SQL): Structured schema, ACID-compliant (e.g., PostgreSQL, MySQL).
- NoSQL: Flexible schema for unstructured data (e.g., MongoDB, Redis).
- NewSQL: Combines SQL-like structure with NoSQL scalability (e.g., CockroachDB).
Data Redundancy
Having duplicate data across the system.
- Pros: Fault tolerance, faster access.
- Cons: Inconsistencies, higher storage costs.
- Managed via: normalization, replication control.
Database Architectures
- Monolithic: One large, centralized database.
- Distributed: Data split across multiple locations (sharding, replication).
- Federated: Multiple independent databases coordinated together.
Load Balancing
Distributes incoming traffic across multiple servers to maximize performance and reliability.
- Types: Round-robin, least connections, IP hash.
API Gateway
A single entry point that manages and routes API requests to backend services.
- Functions: Authentication, rate limiting, request routing, caching.
- Popular tools: Kong, AWS API Gateway, NGINX.
Bloom Filters
A probabilistic data structure that tests whether an element is in a set.
- Fast, memory-efficient, but may return false positives.
- Use case: Checking if a user already exists in a database.
Idempotency
An operation is idempotent if repeating it has no additional effect.
- Use case: A payment API should not charge twice if the same request is retried.
Microservices Guidelines
Design principles for building loosely coupled, independently deployable services:
- Single Responsibility Principle
- Independent deployments
- API-first communication (REST/gRPC)
- Monitoring, scaling, and failover support
Caching
Temporarily storing data to improve speed and reduce load on backend systems.
- Types: In-memory (Redis), browser, CDN.
Content Delivery Network (CDN)
A geographically distributed network of servers that deliver content (images, scripts, videos) faster by serving it from locations near the user.
- Example: Cloudflare, Akamai.
Failover
- Automatic switching to a backup system when the primary system fails.
  - Use case: High-availability systems like payment gateways or databases.
HeartBeats
- Regular signals sent between systems to check that they are still alive.
  - Use case: Detecting failed services in a microservices architecture.
Caching Strategies
- Write-through: Data written to cache and database simultaneously.
- Write-back: Data written to cache first, later synced with DB.
- Cache-aside (Lazy loading): App checks cache, then DB if needed.
- Time-to-live (TTL): Sets expiry for cached data.

Hope this is helpful!

Veeraj’s Substack

Discussion about this post