PySpark CheatSheets and Data Design Interview Questions
In my first job as a consultant, I was focussed on tech projects in wealth management and reinsurance sector. And one of those projects involved working on Foundry which is a tool by Palantir. This tool essentially was a one-stop shop for ETL processes. What we did is source data into Foundry and then built ETL pipelines on top of it. Now, I didn’t know PySpark when I started. But I did know Python and SQL. So, it was a bit easier for me to learn. Nevertheless, just because of the demands of the project, I had to spend a lot of time learning and building on the fly which is never the best experience honestly. And while, it still worked for me, the real challenge would come when a new consultant had to be onboarded. This was a great moment for me because I had helped grow the account but I also knew I had to set him up for success. So, to ensure , the new consultant didn’t have a tough experience, I created a document which had all possible PySpark information as well as Data design architecture notes. And since I was refining those articles, I added interview questions as well which could help you. So, to summarize, here’s what this article entails :
PySpark Cheatsheets
Data Design Architecture Explanation : Batch and Lambda Architecture
Data Design Interview Questions
PySpark Cheatsheets :
Palantir uses Pyspark a lot themselves. So, its not surprise that they created a Pyspark syntax cheatsheet. This extensively covers topics like common patterns, joins, functions, string operations, number operations, aggregate operations , etc.
If you are using PySpark for Data Science , here is another cheatsheet which has all the functions making is super easy for you to learn in terms of creating data frames, running SQL on spark and so on.
PySpark CheatSheet for Beginners : Link
Pyspark RDD Cheatsheet : Link
Data Design Architecture Explanation : Batch and Lambda Architecture
Batch Architecture Explained
Batch Architecture essentially is collecting, processing and storing data in groups or batches at scheduled intervals. Think about batch jobs that run at a cadence of week, daily or months. Its a traditional form of data pipeline and useful in dealing with large volumes of data that doesn’t require immediate processing.
Key Features of Batch Architecture
Scheduled Processing : The data is processed at a specific cadence like day, months, weeks, etc
High Throughput : The system is designed to handle large datasets in a very efficient manner
Latency Tolerant: Its useful for use-cases where delays are acceptable and urgency is not priority. Think generating daily reports.
Storage & Processing Separation: Data is first stored in a data lake or a warehouse and then it is processed.
Typical ETL process :
Information sources - Logs, API, Databases
Data Ingestion - Typically happens through AWS Data Pipeline, Apache Sqoop
Batch Processing Engine – Tools used such as Hadoop MapReduce, Apache Spark.
Storage Layer – Data lakes like HDFS or S3; warehouses like Snowflake or BigQuery.
Output – Downstream files, dashboards, reports.
Batch Processing Interview Questions and Answers:
What is batch processing, and how does it differ from stream process
Batch processing involves collecting and processing data in large, discrete chunks at scheduled intervals. Stream processing handles data in real time or near-real time as it arrives.
When do you choose Batch processing over Streaming ?
Choose batch processing when:
Real-time insights are not needed.
Large volumes of data must be processed efficiently.
Cost and simplicity are priorities.
Use cases include reporting, backups, and data warehousing.
How do you optimize batch jobs for performance?
Partitioning and bucketing
Caching intermediate results
Using efficient file formats like Parquet
Tuning Spark configurations (e.g., memory, parallelism)
How do you handle scaling in batch processing systems?
Use distributed computing frameworks (e.g., Spark)
Autoscale compute resources in the cloud
Optimize resource allocation per job type
What techniques do you use to manage large volumes of data in batch jobs?
Data partitioning
Filtering early in the pipeline
Efficient storage formats (e.g., ORC, Parquet)
Columnar processing
How would you design a nightly ETL job to process millions of records?
Use Spark for distributed ETL
Store raw data in S3
Schedule with Airflow
Monitor job metrics and set up retry policies
Load cleaned data into Redshift or Snowflake
How do you handle schema changes in batch data pipelines?
Use schema evolution in storage formats (e.g., Avro, Parquet)
Maintain schema registry
Version datasets and track changes
Validate schema compatibility before processing
Real-time Data Processing Interview Questions and Answers
What is real time data processing ?
Real time architecture is essentially processing data in real time and providing results in a given time frame often with immediate or low latency. It is used in fields where quick data processing is essential like finance, healthcare, IoT.
Key Features of Real-Time Architecture
Low Latency: Processes and responds to input almost instantly.
High Availability: Designed to be fault-tolerant and run continuously with minimal downtime.
Scalability: Can handle large and variable volumes of data without performance loss.
Concurrency: Supports simultaneous execution of multiple tasks or events.
Determinism: Guarantees a predictable response time to events.
Data Streaming: Handles continuous flow of data from sources like sensors or user inputs.
Event-Driven: Triggers processes based on real-time events or changes in state.
Real Time Processing Interview Questions and Answers:
What technologies are commonly used in real-time architecture?
Message brokers: Apache Kafka, RabbitMQ
Stream processing: Apache Flink, Apache Storm, Spark Streaming
Data stores: Redis, Cassandra, or InfluxDB for fast reads/writes
Monitoring: Prometheus, Grafana
How do you ensure high availability in real-time systems?
By implementing:Redundant components
Load balancing
Failover mechanisms
Monitoring and alerting
Distributed systems with graceful degradation
Explain the CAP theorem in the context of real-time systems.
In real-time systems, we often prioritize availability and partition tolerance over consistency. This ensures the system keeps running and responsive even if some nodes fail or are partitioned, though eventual consistency might be acceptable.
Explain windowing in stream processing.
Windowing allows real-time systems to group incoming data into time-based segments (e.g., 5-second windows) for aggregation or analysis. Common types: tumbling, sliding, session windows.
What’s the difference between stream processing and micro-batching?
Stream processing: Processes data event-by-event with near-zero delay.
Micro-batching: Groups events into tiny batches (milliseconds) before processing—e.g., Spark Streaming.
What’s the difference between real-time and near real-time?
Real-time: Response within milliseconds or seconds
Near real-time: Slight delay, often a few seconds to minutes, acceptable for less critical applications
How do you handle data consistency in distributed real-time systems?
Use eventual consistency models
Implement idempotent operations to avoid duplication
Apply exactly-once processing guarantees with tools like Kafka + Flink
Hope this is helpful!