Process terabytes of data efficiently with distributed computing.
Apache Spark is the industry standard for large-scale batch data processing.
Core concepts:
- RDDs (Resilient Distributed Datasets) — low-level API
- DataFrame and Dataset API — high-level, type-safe
- SparkSQL — run SQL queries on Spark DataFrames
- Lazy evaluation — transformations are not executed until action
- Transformations: filter, select, groupBy, join, withColumn
- Actions: collect, count, show, write
Performance tuning:
- Partitioning — too few or too many partitions hurt performance
- Broadcast join — for small tables, avoids shuffle
- Caching (.cache(), .persist()) — reuse computed DataFrames
- Avoiding UDFs where native functions exist (Python UDFs don't benefit from JVM optimisation)
- Salting for skewed data
Spark on Cloud:
- AWS EMR, Google Dataproc, Azure HDInsight — managed Spark clusters
- Databricks — premium Spark platform, Delta Lake, Unity Catalog, notebooks
- Apache Spark Structured Streaming — near real-time with micro-batches
Core concepts:
- RDDs (Resilient Distributed Datasets) — low-level API
- DataFrame and Dataset API — high-level, type-safe
- SparkSQL — run SQL queries on Spark DataFrames
- Lazy evaluation — transformations are not executed until action
- Transformations: filter, select, groupBy, join, withColumn
- Actions: collect, count, show, write
Performance tuning:
- Partitioning — too few or too many partitions hurt performance
- Broadcast join — for small tables, avoids shuffle
- Caching (.cache(), .persist()) — reuse computed DataFrames
- Avoiding UDFs where native functions exist (Python UDFs don't benefit from JVM optimisation)
- Salting for skewed data
Spark on Cloud:
- AWS EMR, Google Dataproc, Azure HDInsight — managed Spark clusters
- Databricks — premium Spark platform, Delta Lake, Unity Catalog, notebooks
- Apache Spark Structured Streaming — near real-time with micro-batches