All Roadmaps

Data Engineer

Build and maintain data infrastructure — pipelines, warehouses, lakes, streaming systems, and the tooling that powers analytics and ML at scale.

Programming Foundations

Python and SQL are the two most important languages for a data engineer.

Python for Data Engineering

File I/O, HTTP clients, parsing (JSON/CSV/Parquet/Avro), logging, error handling, type hints.

Advanced SQL

Window functions, CTEs, recursive queries, EXPLAIN plans, query optimisation, partitioning.

Bash & Scripting

Shell scripts for automation, cron jobs, file manipulation, environment management.

Data Warehouses

Snowflake, BigQuery, Redshift — dimensional modelling (star/snowflake schema), slowly changing dimensions.

Data Lake & Lakehouses

S3/GCS + Delta Lake/Apache Iceberg — ACID on object storage, time travel, schema evolution.

File Formats

Parquet (columnar), Avro (schema evolution), ORC, JSON, CSV — when to use each format.

Apache Airflow

DAGs, operators, sensors, hooks, connections, XComs, task dependencies, scheduling.

dbt (data build tool)

SQL models, ref(), sources, tests, seeds, snapshots, macros, dbt Cloud vs Core.

Optional Prefect / Dagster

Modern orchestrators with better developer experience and observability than Airflow.

PySpark Fundamentals

DataFrames, transformations, actions, SparkSQL, reading/writing Parquet, S3, JDBC.

Spark Performance Tuning

Partitioning, broadcast join, caching, avoiding data skew, executor tuning.

Optional Databricks

Unity Catalog, Delta Live Tables, Auto Loader, Databricks Workflows, SQL Warehouse.

Streaming & Real-Time Data

Process events as they happen — from user clicks to financial transactions.

Apache Kafka

Topics, partitions, consumer groups, offsets, Kafka Streams, Kafka Connect, exactly-once semantics.

Optional Apache Flink

True stream processing, event time vs processing time, watermarks, state management, windowing.

Optional Spark Structured Streaming

Micro-batch streaming, triggers, watermarks, checkpointing, Kafka source/sink.

Dimensional Modelling

Star schema, fact/dimension tables, Kimball methodology, slowly changing dimensions (SCD 1/2/3).

dbt Modelling Layers

Staging → Intermediate → Mart pattern, ref(), sources, tests, documentation.

Cloud Data Platforms

Build end-to-end data infrastructure on AWS, GCP, or Azure.

AWS Data Stack

S3, Glue, Athena, Redshift, EMR, Kinesis, Lake Formation, Step Functions.

Optional GCP Data Stack

BigQuery, Dataflow, Pub/Sub, Cloud Composer (Airflow), Dataproc, Looker.

Optional Data Observability

Monte Carlo, Datafold, dbt tests, Great Expectations — data quality monitoring.