BlockHangs – Connect with Web3 & AI enthusiasts

Programming Foundations

Python and SQL are the two most important languages for a data engineer.

Python for Data Engineering

File I/O, HTTP clients, parsing (JSON/CSV/Parquet/Avro), logging, error handling, type hints.

Advanced SQL

Window functions, CTEs, recursive queries, EXPLAIN plans, query optimisation, partitioning.

Bash & Scripting

Shell scripts for automation, cron jobs, file manipulation, environment management.

Databases & Storage

Internal

Choose the right storage for every problem — OLTP for transactions, OLAP for analytics, object stores for raw data.

Data Warehouses

Snowflake, BigQuery, Redshift — dimensional modelling (star/snowflake schema), slowly changing dimensions.

Data Lake & Lakehouses

S3/GCS + Delta Lake/Apache Iceberg — ACID on object storage, time travel, schema evolution.

File Formats

Parquet (columnar), Avro (schema evolution), ORC, JSON, CSV — when to use each format.

Data Pipeline Orchestration

Internal

Schedule, monitor, and manage complex multi-step data pipelines with proper dependency management.

Apache Airflow

DAGs, operators, sensors, hooks, connections, XComs, task dependencies, scheduling.

Open

dbt (data build tool)

SQL models, ref(), sources, tests, seeds, snapshots, macros, dbt Cloud vs Core.

Open

Optional Prefect / Dagster

Modern orchestrators with better developer experience and observability than Airflow.

Batch Processing — Apache Spark

Internal

Process terabytes of data efficiently with distributed computing.

PySpark Fundamentals

DataFrames, transformations, actions, SparkSQL, reading/writing Parquet, S3, JDBC.

Open

Spark Performance Tuning

Partitioning, broadcast join, caching, avoiding data skew, executor tuning.

Optional Databricks

Unity Catalog, Delta Live Tables, Auto Loader, Databricks Workflows, SQL Warehouse.

Open

Streaming & Real-Time Data

Process events as they happen — from user clicks to financial transactions.

Apache Kafka

Topics, partitions, consumer groups, offsets, Kafka Streams, Kafka Connect, exactly-once semantics.

Open

Optional Apache Flink

True stream processing, event time vs processing time, watermarks, state management, windowing.

Open

Optional Spark Structured Streaming

Micro-batch streaming, triggers, watermarks, checkpointing, Kafka source/sink.

Data Modelling

Internal

Design data models that are both query-efficient and easy to maintain over time.

Dimensional Modelling

Star schema, fact/dimension tables, Kimball methodology, slowly changing dimensions (SCD 1/2/3).

dbt Modelling Layers

Staging → Intermediate → Mart pattern, ref(), sources, tests, documentation.

Cloud Data Platforms

Build end-to-end data infrastructure on AWS, GCP, or Azure.

AWS Data Stack

S3, Glue, Athena, Redshift, EMR, Kinesis, Lake Formation, Step Functions.

Open

Optional GCP Data Stack

BigQuery, Dataflow, Pub/Sub, Cloud Composer (Airflow), Dataproc, Looker.

Open

Optional Data Observability

Monte Carlo, Datafold, dbt tests, Great Expectations — data quality monitoring.