Back to Data Engineer
Detail

Batch Processing — Apache Spark

Process terabytes of data efficiently with distributed computing.

Apache Spark is the industry standard for large-scale batch data processing.

Core concepts:
- RDDs (Resilient Distributed Datasets) — low-level API
- DataFrame and Dataset API — high-level, type-safe
- SparkSQL — run SQL queries on Spark DataFrames
- Lazy evaluation — transformations are not executed until action
- Transformations: filter, select, groupBy, join, withColumn
- Actions: collect, count, show, write

Performance tuning:
- Partitioning — too few or too many partitions hurt performance
- Broadcast join — for small tables, avoids shuffle
- Caching (.cache(), .persist()) — reuse computed DataFrames
- Avoiding UDFs where native functions exist (Python UDFs don't benefit from JVM optimisation)
- Salting for skewed data

Spark on Cloud:
- AWS EMR, Google Dataproc, Azure HDInsight — managed Spark clusters
- Databricks — premium Spark platform, Delta Lake, Unity Catalog, notebooks
- Apache Spark Structured Streaming — near real-time with micro-batches