Schedule, monitor, and manage complex multi-step data pipelines with proper dependency management.
Data pipelines move, transform, and load data between systems.
Apache Airflow:
- DAGs (Directed Acyclic Graphs) define pipeline structure
- Tasks: PythonOperator, BashOperator, SQLOperator, Sensors
- Schedulers: cron expressions, data intervals
- XComs for inter-task communication
- Connections and Variables for secrets management
- Astronomer Cosmos for dbt integration
Alternatives:
- Prefect — more Pythonic, better error handling, Prefect Cloud
- Dagster — assets-first approach, better observability
- Mage.ai — modern alternative, notebook-style development
- dbt (data build tool) — SQL-based transformations, not an orchestrator
ELT vs ETL:
- ETL: Extract → Transform → Load (old approach, transform before loading)
- ELT: Extract → Load → Transform (modern: load raw data, transform in warehouse)
- ELT is preferred with powerful warehouses (BigQuery, Snowflake) and dbt
Data quality:
- Great Expectations, Soda — data validation tests
- Schema registries for streaming data
Apache Airflow:
- DAGs (Directed Acyclic Graphs) define pipeline structure
- Tasks: PythonOperator, BashOperator, SQLOperator, Sensors
- Schedulers: cron expressions, data intervals
- XComs for inter-task communication
- Connections and Variables for secrets management
- Astronomer Cosmos for dbt integration
Alternatives:
- Prefect — more Pythonic, better error handling, Prefect Cloud
- Dagster — assets-first approach, better observability
- Mage.ai — modern alternative, notebook-style development
- dbt (data build tool) — SQL-based transformations, not an orchestrator
ELT vs ETL:
- ETL: Extract → Transform → Load (old approach, transform before loading)
- ELT: Extract → Load → Transform (modern: load raw data, transform in warehouse)
- ELT is preferred with powerful warehouses (BigQuery, Snowflake) and dbt
Data quality:
- Great Expectations, Soda — data validation tests
- Schema registries for streaming data