Back to Data Engineer
Detail

Databases & Storage

Choose the right storage for every problem — OLTP for transactions, OLAP for analytics, object stores for raw data.

Storage systems by use case:

OLTP (Online Transaction Processing) — operational databases:
- PostgreSQL, MySQL — ACID, row-oriented, transactional
- MongoDB — document model, flexible schema

OLAP (Online Analytical Processing) — analytical databases:
- Snowflake — cloud-native warehouse, near-zero management, time travel
- BigQuery — serverless, pay-per-query, excellent for Google Cloud
- Redshift — AWS columnar warehouse, good for large organisations on AWS
- ClickHouse — ultra-fast open-source OLAP, great for real-time analytics
- DuckDB — in-process OLAP, excellent for local analytics on Parquet

Data Lake Storage:
- AWS S3, GCS, Azure Blob — cheap, durable object storage
- Delta Lake, Apache Iceberg, Apache Hudi — table formats over object storage
- ACID transactions on data lakes
- Time travel, schema evolution, partition evolution
- Used with Spark, Trino, Flink

Key concepts:
- Row vs columnar storage (Parquet is columnar — great for analytics)
- Compression: Snappy, Zstd, GZIP trade-offs
- Partitioning and clustering for query performance