LEARN DATA ENGINEERING DEEP DIVE
Learn Data Engineering: From CSV to Cloud Data Warehouse
Goal: To deeply understand the world of data warehousing and business intelligence—from fundamental ETL principles and data modeling to building, orchestrating, and visualizing data with the modern data stack.
Why Learn Data Engineering?
Every modern company runs on data, but raw data is messy, scattered, and often useless. Data engineering is the discipline of building the systems that collect, transform, and store this data, turning it into a reliable, queryable, and valuable asset. It’s the hidden backbone of business intelligence, data science, and machine learning.
After completing these projects, you will:
- Understand the difference between a transactional database and an analytical data warehouse.
- Master the Extract, Load, Transform (ELT) workflow.
- Design and implement robust data models like star schemas.
- Use SQL for complex data transformation and analysis.
- Build and schedule automated data pipelines.
- Connect Business Intelligence (BI) tools to a warehouse to create insightful dashboards.
- Be proficient with foundational tools of the modern data stack like dbt, Airflow, and Snowflake/BigQuery/DuckDB.
Core Concept Analysis
The Modern Data Engineering Lifecycle
This flow shows how raw data is turned into business value.
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ OLTP Databases │ │ SaaS APIs │ │ Log Files │
│ (e.g., MySQL, │ │ (e.g., Salesforce,│ │ (e.g., Nginx, │
│ Postgres) │ │ Stripe, Hubspot) │ │ Mobile App) │
└──────────────────┘ └──────────────────┘ └──────────────────┘
│ │ │
└─────────────┬──────┴─────────────┬──────┘
▼
┌───────────────────┐
│ EXTRACT & LOAD │ (Tools: Fivetran, Airbyte, Custom Scripts)
│ (Get data AS-IS) │
└───────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ DATA WAREHOUSE (DWH) │
│ (e.g., Snowflake, BigQuery, Redshift, DuckDB) │
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────┐ │
│ │ Raw Data │──────▶│ TRANSFORM (with dbt, SQL) │ │
│ │ (Exact Copies) │ │ (Clean, Join, Aggregate, Model) │ │
│ └─────────────────┘ └─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Modeled Data │ │
│ │ (Star Schemas) │ │
│ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────┐
│ BUSINESS │ (Tools: Looker, Tableau, Metabase)
│ INTELLIGENCE (BI) │
└────────────────────┘
│
▼
┌────────────────────────┐
│ Dashboards & Insights │
│ (e.g., Revenue Reports)│
└────────────────────────┘
Key Concepts Explained
1. OLTP vs. OLAP
Databases are optimized for different tasks. This is the most fundamental concept in data engineering.
| Aspect | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|---|---|---|
| Purpose | Run the business (e.g., process orders) | Analyze the business (e.g., find top products) |
| Workload | Many small, fast reads & writes | Fewer, complex queries over lots of data |
| Optimization | Row-oriented storage | Column-oriented storage |
| Schema | Normalized (3NF) to avoid redundancy | Denormalized (Star/Snowflake) for fast queries |
| Example | PostgreSQL, MySQL |
Snowflake, BigQuery, DuckDB |
2. The Star Schema
The most common data modeling pattern for data warehouses. It’s designed for fast, simple queries.
┌──────────────────┐
│ dim_customers │
│------------------│
│ customer_key (PK)│
│ name │
│ address │
└──────────────────┘
▲
│
┌─────────────────┐ ┌───────────────────────────┐ ┌────────────────┐
│ dim_products │ │ fact_sales │ │ dim_date │
│-----------------│ │---------------------------│ │----------------│
│ product_key (PK)│ │ date_key (FK) │ │ date_key (PK) │
│ name ├─┤ product_key (FK) │─┤ full_date │
│ category │ │ customer_key (FK) │ │ month_name │
└─────────────────┘ │ store_key (FK) │ │ year │
│ ------------------------- │ │ day_of_week │
│ amount │ └────────────────┘
│ quantity │
└───────────────────────────┘
- Fact Table: Contains measurements of business events (the “facts,” like
amount). It consists mostly of foreign keys to dimensions and the numeric facts themselves. - Dimension Tables: Contain the context for the events (the “who, what,where, when,” like
customer_nameorproduct_category).
3. ELT (Extract, Load, Transform)
The modern paradigm for data pipelines.
- Extract: Pull raw data from the source.
- Load: Load the raw data directly into the data warehouse into a “staging” area.
- Transform: Use the power of the data warehouse itself (with SQL) to clean, join, and model the data into its final, queryable form. This is where tools like
dbtshine.
This is preferred over the older ETL model because it’s more flexible—you always have the raw data available to re-transform if business logic changes.
Project List
These projects will guide you through building your own data stack, piece by piece.
Project 1: Local Product Sales Analysis
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Data Transformation / Data Cleaning
- Software or Tool: Pandas library, CSV files
- Main Book: “Python for Data Analysis” by Wes McKinney
What you’ll build: A Python script that reads three separate CSV files (customers, products, orders), cleans them, joins them together, calculates total revenue per customer, and outputs a final customer_revenue.csv report.
Why it teaches data engineering: This is a microcosm of a transformation job. It forces you to think about data quality, how to join different entities (the core of a star schema), and how to aggregate data to create a useful metric. It is the ‘T’ in ELT, performed locally.
Core challenges you’ll face:
- Handling messy data → maps to dealing with missing values, incorrect data types (e.g., price as a string), and inconsistent formats
- Joining multiple data sources → maps to using
pandas.mergeto combine your files, similar to a SQLJOIN - Aggregating data to create metrics → maps to using
groupbyand aggregate functions (sum,count) to answer a business question - Structuring code for readability → maps to creating a repeatable script, not a one-off notebook jungle
Key Concepts:
- DataFrames: The core data structure in Pandas. “Python for Data Analysis” Chapter 5.
- Data Cleaning: Handling missing data (
dropna,fillna). “Python for Data Analysis” Chapter 7. - Combining Data:
mergeandjoinoperations. “Python for Data Analysis” Chapter 8. - Aggregation:
groupby. “Python for Data Analysis” Chapter 10.
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python syntax.
Real world outcome:
You run python process_sales.py and it generates a new CSV file. When you open customer_revenue.csv, you see a clean list of customer names and the total amount they have spent, sorted from highest to lowest.
Implementation Hints:
- Load each CSV into a separate Pandas DataFrame.
- Inspect each DataFrame using
.info()and.head()to understand the data types and look for obvious issues. - Clean each one. Convert date columns to
datetimeobjects, price columns tofloat, etc. - First, merge
ordersandproductsonproduct_id. Then, merge the result withcustomersoncustomer_id. - Create a new
revenuecolumn by multiplyingpriceandquantity. - Use
.groupby('customer_name')['revenue'].sum()to get the final metric.
Learning milestones:
- You can load and clean a single CSV → You understand data types and handling missing data.
- You can join two files correctly → You understand the concept of keys and joins.
- You can produce the final aggregated report → You have performed a full, local transformation.
- Your script is runnable and produces the same output every time → You have built a basic, repeatable pipeline.
Project 2: Web Server Log Processor
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Data Extraction / Parsing / Loading
- Software or Tool: Python, SQLite, Regex
- Main Book: “Fluent Python” by Luciano Ramalho
What you’ll build: A script that reads a raw Nginx or Apache log file line by line, parses each line using regular expressions to extract structured fields (IP address, timestamp, URL, status code, user agent), and loads the structured data into a local SQLite database.
Why it teaches data engineering: This project covers the ‘E’ (Extract) and ‘L’ (Load) from a common, semi-structured data source. You’ll learn how to impose a schema on messy text data and use a database for storage, which is a massive step up from CSV files.
Core challenges you’ll face:
- Parsing unstructured text → maps to writing a regular expression to reliably capture fields from a complex string
- Setting up a database schema → maps to writing
CREATE TABLESQL statements with appropriate data types - Loading data into a database → maps to using a database library to execute
INSERTstatements efficiently - Handling errors and bad lines → maps to making your script robust enough to skip malformed log entries without crashing
Key Concepts:
- Regular Expressions: “Automate the Boring Stuff with Python” Chapter 7.
- SQLite: A self-contained, serverless database engine. The Python
sqlite3module documentation is excellent. - Database Transactions: Using
connection.commit()to save data in batches for efficiency.
Difficulty: Intermediate
Time estimate: Weekend
Prerequisites: Project 1, understanding of basic SQL (CREATE, INSERT).
Real world outcome:
You run python process_logs.py access.log. After it finishes, you can open the resulting logs.db file with a SQLite browser and see a clean, structured table of your log data. You can run SQL queries like SELECT status_code, COUNT(*) FROM logs GROUP BY status_code; to analyze the data.
Implementation Hints:
- Find a sample Nginx or Apache log file online.
- Use a site like regex101.com to interactively build and test your regular expression for parsing the log lines.
- Your Python script should first connect to the SQLite database and run a
CREATE TABLE IF NOT EXISTS ...command. - Read the log file line by line. For each line, try to match your regex.
- If it matches, extract the captured groups, clean them up (e.g., convert the timestamp string to a format SQLite understands), and execute an
INSERTstatement. - Wrap your inserts in a transaction, committing every 1000 lines or so, for much better performance than committing after every line.
Learning milestones:
- Your regex correctly parses a single log line → You understand how to extract structure from text.
- You can create a SQLite database and table → You understand basic DDL (Data Definition Language).
- You can insert one parsed line into the table → You can connect Python to a database.
- Your script processes a 10,000-line log file successfully → You’ve built a robust ETL script for a semi-structured source.
Project 3: Set Up a Data Warehouse and Model Data
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: SQL
- Alternative Programming Languages: Python for scripting if needed.
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Data Modeling / Data Warehousing
- Software or Tool: PostgreSQL or DuckDB, Docker
- Main Book: “The Data Warehouse Toolkit” by Ralph Kimball
What you’ll build: A proper, local data warehouse. You will install a real analytical database (PostgreSQL or DuckDB), design a star schema for the sales data from Project 1, write the SQL CREATE TABLE statements, and write the SQL INSERT ... SELECT ... statements to populate your fact_sales and dimension tables from the raw “staging” tables.
Why it teaches data engineering: This is the absolute core of data warehousing. You stop thinking about data in application-centric (normalized) tables and start thinking in analytical (denormalized) schemas. This project forces you to internalize the difference between OLTP and OLAP modeling.
Core challenges you’ll face:
- Designing a star schema → maps to identifying the central business process (facts) and the descriptive context (dimensions)
- Writing DDL for facts and dimensions → maps to choosing appropriate data types, primary keys, and foreign keys
- Populating dimensions from raw data → maps to using
SELECT DISTINCT ...to create dimension tables likedim_customer - Populating the fact table → maps to writing a complex
INSERTstatement that joins multiple staging tables to look up foreign keys
Key Concepts:
- Star Schema: “The Data Warehouse Toolkit” Chapter 2. This book is the bible of dimensional modeling.
- Fact vs. Dimension Tables: A deep dive into the purpose of each.
- Surrogate Keys: Using auto-incrementing integers as primary keys in your dimension tables, which is more stable than using natural keys from the source system.
- CTEs (Common Table Expressions): Using
WITH ... AS (...)to structure and simplify your complex SQL transformation queries.
Difficulty: Intermediate
Time estimate: 1-2 weeks
Prerequisites: Project 1, understanding of SQL JOINs.
Real world outcome:
Using a SQL client (like DBeaver or psql), you can connect to your database. You will see two schemas: staging (containing the raw CSV data) and analytics (containing your clean fact_sales, dim_customer, dim_product tables). Queries against the analytics schema are simple and fast.
Implementation Hints:
- Use Docker to easily run PostgreSQL locally.
- Create two schemas:
stagingandanalytics. - Load the raw CSVs from Project 1 into the
stagingschema. Most SQL clients have a UI for this. - First, create and populate your dimension tables. Example for
dim_customer:CREATE TABLE analytics.dim_customer ( customer_key SERIAL PRIMARY KEY, customer_id INT, customer_name VARCHAR ); INSERT INTO analytics.dim_customer (customer_id, customer_name) SELECT DISTINCT customer_id, customer_name FROM staging.customers; - Finally, create your fact table. The
INSERTwill need to joinstaging.orderswith your new dimension tables to find thecustomer_keyandproduct_keyfor each sale.
Learning milestones:
- You have a running PostgreSQL/DuckDB instance with raw data loaded → You can manage a DWH database.
- You have created and populated your dimension tables → You understand how to model the “context” of your data.
- You have populated your fact table with correct foreign keys → You can connect facts to their dimensions.
- You can write a simple query joining facts and dims to get total sales by product category → You have successfully built and can use a star schema.
Project 4: Connect a BI Tool and Build a Dashboard
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: SQL (inside the BI tool)
- Alternative Programming Languages: N/A
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Business Intelligence / Data Visualization
- Software or Tool: Metabase or Apache Superset, Docker, the DWH from Project 3.
- Main Book: “Storytelling with Data” by Cole Nussbaumer Knaflic
What you’ll build: A business intelligence dashboard. You will install an open-source BI tool, connect it to your data warehouse, and build a dashboard with several charts: a line chart of sales over time, a bar chart of top-selling products, and a table of the highest-value customers.
Why it teaches data engineering: This project closes the loop and shows the value of all the previous steps. It forces you to think like an analyst. Is your data model easy to query? Are the column names clear? You’ll learn how to write queries that power visualizations and how to present data in a way that is clear, accurate, and actionable.
Core challenges you’ll face:
- Setting up a BI tool with Docker → maps to learning to use Docker and Docker Compose to run services locally
- Connecting the BI tool to your data warehouse → maps to understanding connection strings, users, and permissions
- Writing “questions” or queries in the BI tool → maps to translating a business question into a SQL query
- Choosing the right visualization → maps to knowing when to use a bar chart vs. a line chart vs. a pie chart (hint: almost never a pie chart)
Key Concepts:
- Data Visualization Principles: Choosing the right chart for the data. “Storytelling with Data” is a classic guide.
- Dashboards: A collection of visualizations that give a high-level overview of a business area.
- Metrics and KPIs: Defining what you want to measure (e.g., “Monthly Recurring Revenue” is a KPI).
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 3.
Real world outcome:
You can navigate to http://localhost:3000 in your browser and see a professional-looking dashboard. You can interact with it, filter by date, and see the charts update. You can show this to someone and they will immediately understand the sales performance of your fictional company.
Implementation Hints:
- Both Metabase and Superset have official Docker images and
docker-compose.ymlfiles. Start there. - When you first set up the BI tool, it will ask for your database connection details: host (
host.docker.internalif your DB is running on your machine and the BI tool is in Docker), port, database name, user, and password. - Start by asking simple questions. In Metabase, you can use the graphical query builder first to explore.
- “Sales over time” will require a
GROUP BYon a date column. “Top-selling products” will require aGROUP BYon the product name and anORDER BYon the sum of sales. - Pay attention to axis labels, titles, and colors to make your charts easy to understand.
Learning milestones:
- Your BI tool is running and connected to your DWH → You understand the basic architecture.
- You have built and saved your first chart (question) → You can write a query and visualize it.
- You have assembled a dashboard with multiple charts → You can create a cohesive analytical view.
- Your dashboard is clean, easy to read, and tells a story → You are thinking about the end-user and not just the data.
Project 5: Automate Transformations with dbt
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: SQL, Jinja
- Alternative Programming Languages: N/A
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Modern Data Transformation / Analytics Engineering
- Software or Tool: dbt (data build tool), your DWH from Project 3.
- Main Book: dbt Labs’ “dbt Fundamentals” course (free online).
What you’ll build: A dbt project that programmatically creates the star schema from Project 3. Instead of writing CREATE TABLE and INSERT statements yourself, you will write dbt models (which are just SELECT statements) that dbt uses to build the tables and views in your warehouse.
Why it teaches data engineering: This is how modern data transformation is done, period. dbt brings software engineering best practices (modularity, testing, documentation, version control) to the data transformation process. Learning dbt is one of the single most valuable skills for a data engineer today.
Core challenges you’ll face:
- Setting up a dbt project → maps to installing
dbt-core, initializing a project, and connecting it to your warehouse profile - Creating models from raw data → maps to writing
SELECTstatements in.sqlfiles within yourmodelsdirectory - Refactoring SQL using sources and refs → maps to using
{{ source(...) }}and{{ ref(...) }}to build a dependency graph (DAG) of your models - Adding tests and documentation → maps to creating
.ymlfiles to define tests (e.g.,not_null,unique) and add descriptions to your models and columns
Key Concepts:
- Analytics Engineering: The discipline that focuses on the ‘T’ in ELT, a role that
dbtlargely created. - Idempotency: The ability to run a process multiple times with the same result.
dbtpipelines are idempotent. - DAG (Directed Acyclic Graph):
dbtbuilds a graph of your models to determine the correct order of execution. - Jinja: A templating language used within dbt’s SQL to add logic like loops and if-statements to your queries.
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3. Solid SQL skills are a must.
Real world outcome:
You run a single command, dbt run, from your terminal. dbt connects to your warehouse, drops the old analytics tables, and rebuilds your entire star schema from scratch based on the logic in your models. You can then run dbt test and get a confirmation that all your primary keys are unique and not null.
Implementation Hints:
- Follow the dbt “Getting Started” guide. It’s excellent.
- Structure your project with a
stagingfolder for simple cleaning and renaming of source data, and amartsfolder for your finalfact_salesanddim_*models. - A staging model for customers might look like this (
stg_customers.sql):select customer_id as customer_source_id, name as customer_name from {{ source('my_app', 'raw_customers') }} - Your
fact_salesmodel will then{{ ref('stg_customers') }}instead of using the raw source table. This creates the dependency. - Add tests! In a
.ymlfile, you can specify that the primary key of your dimension table must be unique. This is incredibly powerful.
Learning milestones:
- You have a dbt project connected to your DWH → You understand the setup and configuration.
- You can run
dbt runand it successfully builds one model → You understand the basic workflow. - You have rebuilt your entire star schema using interdependent models (
ref) → You can build a data transformation DAG. - You have added tests and documentation to your project → You are thinking like an analytics engineer, building reliable and maintainable data products.
Project 6: Orchestrate Your Pipeline
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: YAML (for some orchestrators)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Orchestration / Automation
- Software or Tool: Airflow, Dagster, Prefect, or Mage. Docker is essential.
- Main Book: “Data Pipelines with Apache Airflow” by Bas Harenslak and Julian de Ruiter.
What you’ll build: An automated, scheduled data pipeline. You will use a data orchestrator to define a workflow that first runs a script to fetch new raw data (e.g., the log processor from Project 2 or a new API script), and only upon its successful completion, runs your dbt run command to transform the data. This pipeline will be scheduled to run every night at midnight.
Why it teaches data engineering: Real-world data pipelines aren’t run manually from a laptop. They are automated, scheduled, and monitored. This project teaches you how to use the tools that manage this complexity, ensuring data is always fresh and reliable. It introduces concepts of dependency, scheduling, and monitoring.
Core challenges you’ll face:
- Setting up an orchestrator locally → maps to using Docker Compose to run the multiple services an orchestrator requires (scheduler, webserver, database)
- Defining a pipeline as a DAG → maps to writing Python code that defines tasks and their dependencies
- Executing shell commands from the orchestrator → maps to running your extraction script and
dbt runas tasks in your DAG - Scheduling and monitoring runs → maps to configuring a schedule (e.g., a cron string) and checking the UI to see if runs succeeded or failed
Key Concepts:
- DAG (Directed Acyclic Graph): A collection of tasks with dependencies, but no circular dependencies. This is the core abstraction for all orchestrators.
- Operators/Ops: The building blocks of a DAG, representing a single task (e.g.,
BashOperatorin Airflow). - Scheduling: Using cron expressions (
0 0 * * *for midnight every day) to define when a pipeline should run. - Sensors & Idempotency: More advanced concepts like waiting for a file to appear (
Sensor) and ensuring your tasks can be re-run safely (Idempotency).
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5. Familiarity with Docker is very helpful.
Real world outcome: You can see your pipeline DAG in the orchestrator’s web UI. You can trigger it manually and watch the tasks turn green as they complete in the correct order. The next morning, you can check the logs and see that it ran automatically overnight, and then check your BI dashboard (Project 4) to see it populated with new data.
Implementation Hints:
- Airflow is the most popular, but can be complex to set up. Dagster or Mage are often considered more beginner-friendly. All have great documentation.
- Your DAG will have at least two tasks:
extract_and_loadandtransform. - The
transformtask will be dependent on theextract_and_loadtask. In Airflow syntax, this isextract_and_load >> transform. - The
extract_and_loadtask will execute your Python script to get new data. - The
transformtask will execute thedbt runcommand. You will need to make sure the dbt project is available to the orchestrator (e.g., by building it into a Docker image or mounting a volume).
Learning milestones:
- You have a running orchestrator instance → You can manage a complex local dev environment.
- You can define a simple two-task DAG and run it manually → You understand the core concepts of tasks and dependencies.
- Your orchestrator can successfully run your full data pipeline (extract + dbt) → You have integrated your entire workflow into the orchestrator.
- Your pipeline runs successfully on a schedule → You have built a fully automated data pipeline.
Project 7: Ingest Data from a REST API
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: API Integration / Data Extraction (E)
- Software or Tool: Python
requestslibrary, your DWH. - Main Book: “API Design Patterns” by JJ Geewax
What you’ll build: A Python script that connects to a public REST API (e.g., a simple weather API, a public dataset API like the Open Brewery DB, or a GitHub API) and extracts data. This raw data (likely JSON) will then be loaded directly into a staging table in your data warehouse.
Why it teaches data engineering: Many critical business data sources are exposed through APIs. This project teaches you how to programmatically interact with these services, handle JSON data, and manage common challenges like API rate limits and pagination.
Core challenges you’ll face:
- Making authenticated API requests → maps to handling API keys, tokens, or other authentication methods
- Parsing nested JSON data → maps to navigating complex JSON structures to extract relevant fields
- Handling pagination → maps to making multiple API calls to retrieve all available data when results are split across pages
- Error handling (rate limits, network issues) → maps to implementing retry logic and gracefully managing API response codes
Key Concepts:
- REST APIs: Principles of API design (GET, POST, PUT, DELETE).
- JSON (JavaScript Object Notation): The primary data format for web APIs.
- Pagination: Strategies for fetching large datasets from APIs.
- API Rate Limiting: Understanding and respecting API usage policies to avoid being blocked.
Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 2 (for loading into DB), Project 6 (for orchestration).
Real world outcome:
A new raw staging table in your data warehouse, e.g., staging.raw_weather_data, which contains the complete, unaltered JSON responses from your chosen API. This table can then be used by your dbt project for transformation.
Implementation Hints:
- Choose a public API that doesn’t require complex authentication for starters.
- Use the Python
requestslibrary. It makes API calls very straightforward. - Before writing code, use a tool like Postman or
curlto explore the API and understand its responses. - Define the schema for your raw staging table. Often, you’ll store the entire JSON response as a
TEXTorJSONBcolumn initially, then parse it with SQL during transformation. - Implement a loop for pagination, checking the API response for
next_pagelinks or similar indicators. - Add
try-exceptblocks to catch network errors and check the HTTP status code of the API response.
Learning milestones:
- You can successfully make a single API call and retrieve data → You understand basic API interaction.
- You can parse the JSON response and load key fields into a database → You can extract structured data from JSON.
- Your script handles pagination to fetch all data → You can deal with large datasets from APIs.
- Your script is robust to API errors and rate limits → You can build production-ready API ingestors.
Project 8: Implementing Slowly Changing Dimensions (SCD Type 2)
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: SQL (with dbt)
- Alternative Programming Languages: Python (for complex custom logic)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Dimensional Modeling / Data History
- Software or Tool: dbt, your DWH.
- Main Book: “The Data Warehouse Toolkit” by Ralph Kimball
What you’ll build: A dbt model that implements a Slowly Changing Dimension (SCD) of Type 2 for your dim_customer table. When a customer’s address changes, instead of updating the existing record, a new record is created for that customer, and the old record is marked as “inactive” with a valid_to date. This preserves historical context.
Why it teaches data engineering: This is a crucial concept for understanding how data warehouses accurately model history. Businesses often need to know “what was the customer’s address at the time of this purchase,” not just their current address. SCDs solve this problem, and dbt provides powerful tools to manage them.
Core challenges you’ll face:
- Understanding SCD Type 2 logic → maps to how to identify changes, mark old records, and insert new ones
- Using dbt snapshots → maps to configuring dbt to automatically manage the SCD logic based on a unique key and a
check_col - Generating
valid_fromandvalid_totimestamps → maps to dbt snapshots automatically add these columns, but understanding their purpose is key - Querying SCD tables correctly → maps to learning how to join fact tables to SCD dimensions for a specific point in time
Key Concepts:
- Slowly Changing Dimensions (SCDs): Methods for handling changes to dimensional attributes over time. “The Data Warehouse Toolkit” Chapter 6.
- SCD Type 2: Preserves full history by adding new rows for changes.
- dbt Snapshots: A dbt feature designed specifically for managing SCDs. dbt Labs documentation on Snapshots.
- Point-in-Time Analysis: How SCDs enable accurate historical reporting.
Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Project 5 (dbt mastery).
Real world outcome:
You can run dbt snapshot, then simulate a customer’s address changing in your raw data. When you run dbt snapshot again, your dim_customer table will show two rows for that customer: one with the old address and a valid_to date, and one with the new address and an open valid_to date (e.g., 9999-12-31).
Implementation Hints:
- Start with a
staging.customerstable that you can easily modify (e.g., a CSV that you can edit and re-load). - Define a dbt snapshot in your
snapshotsdirectory (e.g.,customer_snapshot.sql):{% snapshot customer_snapshot %} {{ config( target_schema='analytics', unique_key='customer_id', strategy='check', check_cols=['address'] -- Columns to monitor for changes ) }} select * from {{ source('my_app', 'raw_customers') }} {% endsnapshot %} - Run
dbt snapshotmultiple times. Between runs, manually change an address for one customer in yourraw_customerstable (or CSV) to simulate a change. Observe how thecustomer_snapshottable evolves. - Experiment with querying your
fact_salestable by joining it tocustomer_snapshotusing thevalid_fromandvalid_todates to get the correct customer attributes at the time of the sale.
Learning milestones:
- You have configured and run your first dbt snapshot → You understand the basic setup.
- You can simulate a change in raw data and see a new row appear in the snapshot → You’ve observed SCD Type 2 in action.
- You can interpret the
valid_from,valid_to, anddbt_valid_tocolumns → You understand how history is tracked. - You can write a query that joins facts to the SCD dimension to get historical attributes → You can accurately perform point-in-time analysis.
Project 9: Build a “Reverse ETL” Sync
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 4: Expert
- Knowledge Area: Operational Analytics / API Write-back
- Software or Tool: Python
requestslibrary, your DWH, a mock API (or a simple service like Mailchimp/Hubspot if you have an account). - Main Book: “Operationalizing Data: The Secret Sauce” (blog posts/articles are more relevant than a single book here)
What you’ll build: A Python script that queries a “materialized view” (or dbt model) in your data warehouse (e.g., analytics.high_value_customers). For each customer identified in this view, the script will make an API call to an external system (e.g., a CRM, an email marketing tool, or a custom mock service) to update their profile or trigger an action.
Why it teaches data engineering: Data warehouses traditionally feed BI tools. Reverse ETL closes the loop by pushing valuable insights back into operational systems. This means data isn’t just analyzed; it actively drives business processes (e.g., flag a customer as “VIP” in Salesforce, or add them to a marketing segment in Mailchimp).
Core challenges you’ll face:
- Querying the data warehouse from Python → maps to using
psycopg2or another DB connector to fetch data - Connecting to an external API (CRM, Marketing) → maps to handling various API authentication methods and data formats
- Handling API idempotency and errors → maps to designing your sync to be repeatable without creating duplicates and managing API failures
- Batching API requests → maps to making efficient API calls, possibly in batches, to avoid overwhelming the external system
Key Concepts:
- Reverse ETL: The process of moving data from a data warehouse back into operational systems.
- Operational Analytics: Using data from the warehouse to drive day-to-day business operations.
- API Design (for writing): Understanding how to create or update resources in an external system via its API.
- Materialized Views/Models: Pre-computed results in the DWH for faster querying by the Reverse ETL script.
Difficulty: Expert Time estimate: 2 weeks Prerequisites: Project 7 (API integration), Project 6 (orchestration for scheduling).
Real world outcome:
You run your script, and then check the external system. For example, if you’re syncing to a mock CRM, you’d see a customer’s is_vip field updated to true. If integrated with an email service, a segment of customers might be automatically enrolled in a special campaign.
Implementation Hints:
- Define your
analytics.high_value_customersdbt model as a view or materialized view in your dbt project. This model should returncustomer_id,email, and any relevant flags (e.g.,is_vip). - Create a simple Python script (
reverse_etl.py) that:- Connects to your PostgreSQL DWH.
- Executes
SELECT * FROM analytics.high_value_customers;. - Iterates through the results.
- For each customer, constructs an API call to your mock (or real) external service.
- Sends the API call using
requests.
- If using a mock API, you can write a tiny Flask/FastAPI app that just logs incoming requests, or use a service like Mockoon.
- Integrate this script into your orchestrator (Project 6) as a new task that runs after your dbt transformation.
Learning milestones:
- You can query your dbt model from a Python script → You can connect Python to your DWH.
- You can make successful API calls to an external system to update data → You’ve mastered write-back API integration.
- Your script handles errors and ensures data consistency → You are building robust operational pipelines.
- The entire Reverse ETL process is automated and scheduled → You have completed a full, closed-loop data lifecycle.
Project 10: Build a Full Local Data Stack with Docker Compose
- File: LEARN_DATA_ENGINEERING_DEEP_DIVE.md
- Main Programming Language: YAML (for Docker Compose), Python, SQL
- Alternative Programming Languages: N/A
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 5: Master
- Knowledge Area: DevOps for Data / Full Stack Data Engineering
- Software or Tool: Docker, Docker Compose, PostgreSQL/DuckDB, dbt, Airflow/Mage, Metabase/Superset.
- Main Book: “The Docker Book” by James Turnbull
What you’ll build: A complete, integrated local data platform using Docker Compose. This single setup will include:
- A PostgreSQL database (your raw staging layer and production DWH).
- Your
dbtproject (mounted as a volume). - An orchestrator (Airflow or Mage) that runs your data pipelines.
- A BI tool (Metabase or Superset) connected to your DWH.
- A Python service to simulate an external data source or destination (e.g., a simple Flask app mimicking an API to ingest from, or a CRM to push data to).
Why it teaches data engineering: This is the ultimate capstone project. It forces you to combine all your knowledge into a single, cohesive, and runnable system. You’ll understand how all the pieces of a modern data stack fit together, from data ingestion to transformation, orchestration, and visualization. It’s the closest you’ll get to a “real” data platform environment on your laptop.
Core challenges you’ll face:
- Configuring Docker Compose for multiple services → maps to defining inter-service communication, volumes, and environment variables
- Ensuring services start in the correct order and are healthy → maps to using
depends_on,healthchecks, and entrypoint scripts - Managing network connectivity between containers → maps to understanding Docker’s internal networking and exposing ports correctly
- Integrating all the individual projects → maps to combining your Python scripts, dbt project, and BI dashboards into a single, automated workflow
Key Concepts:
- Containerization (Docker): Packaging applications and their dependencies into portable units.
- Orchestration (Docker Compose): Defining and running multi-container Docker applications.
- Infrastructure as Code (IaC): Defining your entire data platform setup in a
docker-compose.ymlfile. - Full Data Lifecycle Management: Overseeing data from source to insight and beyond.
Difficulty: Master Time estimate: 1 month+ Prerequisites: All previous projects. Solid Docker fundamentals.
Real world outcome:
A single command, docker-compose up -d, brings up your entire data platform. You can then access your orchestrator’s UI to see your pipelines running, and your BI tool’s UI to interact with your dashboards, all powered by data flowing through your locally managed DWH.
Implementation Hints:
- Start with a minimal
docker-compose.yml(e.g., just PostgreSQL). Add one service at a time, testing connectivity and functionality after each addition. - Pay close attention to environment variables (
DB_HOST,DB_USER,DB_PASSWORD) and make sure they are correctly passed to all services that need to connect to the database. Usedepends_onto ensure your database starts before services that connect to it. - Your orchestrator will run your Python scripts and dbt commands. Ensure the relevant codebases are mounted as volumes into the orchestrator’s container so it can access them.
- Consider creating a small Flask/FastAPI app that serves a dummy API for your API ingestion script (Project 7) to pull data from.
- For the Reverse ETL (Project 9), you could include another small Flask/FastAPI app that simulates an external CRM, with endpoints for updating customer profiles, and your Python script pushes data to it.
Learning milestones:
- You have successfully started a multi-service
docker-composestack → You understand basic container orchestration. - All services can communicate with each other (e.g., BI tool connects to DB) → You’ve mastered inter-container networking.
- Your orchestrator successfully runs your full data pipeline within the stack → You have an integrated automated workflow.
- Your BI dashboards are automatically populated and up-to-date → You have built a fully functional, end-to-end data platform.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Local Product Sales Analysis | Beginner | Weekend | Medium | Medium |
| 2. Web Server Log Processor | Intermediate | Weekend | Medium | Medium |
| 3. Set Up a Data Warehouse and Model Data | Intermediate | 1-2 weeks | High | High |
| 4. Connect a BI Tool and Build a Dashboard | Intermediate | Weekend | Medium | High |
| 5. Automate Transformations with dbt | Advanced | 1-2 weeks | High | High |
| 6. Orchestrate Your Pipeline | Advanced | 2-3 weeks | High | Medium |
| 7. Ingest Data from a REST API | Advanced | 1 week | High | Medium |
| 8. Implementing Slowly Changing Dimensions (SCD Type 2) | Expert | 1-2 weeks | Very High | Medium |
| 9. Build a “Reverse ETL” Sync | Expert | 2 weeks | Very High | High |
| 10. Build a Full Local Data Stack with Docker Compose | Master | 1 month+ | Extreme | Very High |
Recommendation
Based on the progressive learning path, I recommend you start with Project 1: Local Product Sales Analysis. It introduces fundamental data transformation concepts using Python and Pandas, which are excellent starting points before diving into databases and specialized tools. It’s approachable for beginners and lays the groundwork for more complex projects.
Final overall project
Build a Complete End-to-End Data Platform for an E-commerce Store with Alerting
This project combines all the concepts learned into a single, real-world scenario. You will build a comprehensive data platform that:
- Ingests data from simulated e-commerce operational databases (PostgreSQL/MySQL), external APIs (e.g., payment gateway, shipping provider), and web server logs.
- Loads this raw data into your staging area in a central data warehouse (PostgreSQL or DuckDB).
- Transforms and models the data using
dbtinto a star schema (fact tables fororders,pageviews,payments, and dimension tables forcustomers,products,dates,campaigns), including proper handling of Slowly Changing Dimensions (SCD Type 2) for customer information. - Orchestrates the entire pipeline using Airflow or Mage, scheduling daily runs for data ingestion, transformation, and quality checks.
- Performs data quality checks within dbt (e.g., ensuring no duplicate orders, positive sales amounts).
- Visualizes key metrics (e.g., daily sales, top-selling products, customer lifetime value, website traffic, conversion rates) in a BI dashboard (Metabase/Superset).
- Implements Reverse ETL to push insights back into an operational system (e.g., flag “churn risk” customers in a mock CRM or send an alert to a Slack channel if daily sales drop below a threshold).
- Everything runs within a Docker Compose setup, simulating a production-like environment on your local machine.
This project will give you a holistic view of data engineering, from the lowest-level data manipulation to the highest-level business impact, all within a production-like infrastructure. You will be able to demonstrate a full end-to-end data solution.
Summary
| Project | Main Language | Difficulty |
|---|---|---|
| 1. Local Product Sales Analysis | Python | Beginner |
| 2. Web Server Log Processor | Python | Intermediate |
| 3. Set Up a Data Warehouse and Model Data | SQL | Intermediate |
| 4. Connect a BI Tool and Build a Dashboard | SQL | Intermediate |
| 5. Automate Transformations with dbt | SQL, Jinja | Advanced |
| 6. Orchestrate Your Pipeline | Python | Advanced |
| 7. Ingest Data from a REST API | Python | Advanced |
| 8. Implementing Slowly Changing Dimensions (SCD Type 2) | SQL (with dbt) | Expert |
| 9. Build a “Reverse ETL” Sync | Python | Expert |
| 10. Build a Full Local Data Stack with Docker Compose | YAML (for Docker Compose), Python, SQL | Master |