Project 3: Excel-to-Database ETL Pipeline
Project 3: Excel-to-Database ETL Pipeline
Build a system that extracts data from Excel files, transforms it, and loads it into a SQL database.
Project Overview
Your company receives client data in Excel spreadsheets from various sources. The data is inconsistent (different column names, date formats, missing values). You need to clean it, standardize it, and load it into a PostgreSQL database for analysis.
What Youโll Build
- Read Excel files with inconsistent structures
- Standardize column names (e.g., โCustomer Nameโ vs โClientโ vs โcustomer_nameโ)
- Clean and validate data (remove duplicates, handle missing values, validate emails)
- Transform data types (dates, phone numbers, zip codes)
- Load into a PostgreSQL database
- Generate a data quality report showing what was cleaned
Technologies
pandasfor data transformationopenpyxlfor Excel readingSQLAlchemyfor database connectionpsycopg2for PostgreSQL- Data validation libraries
Learning Objectives
By completing this project, you will:
- Understand ETL architecture - Extract, Transform, Load patterns
- Master data cleaning - Handle messy real-world data
- Build column mapping systems - Standardize inconsistent schemas
- Use SQLAlchemy - Pythonโs most popular database toolkit
- Implement data validation - Regex, type checking, business rules
- Generate data quality reports - Audit and logging best practices
- Design idempotent pipelines - Safe to run multiple times
Deep Theoretical Foundation
ETL vs ELT: Understanding the Paradigms
ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) represent two fundamental approaches to data integration.
ETL (Traditional):
โโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโ
โ SOURCE โโโโโบโ TRANSFORM โโโโโบโ DATABASE โ
โ (Excel) โ โ (Python) โ โ (Clean) โ
โโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโ
โ
Heavy processing
happens BEFORE load
ELT (Modern Data Lake):
โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ SOURCE โโโโโบโ DATABASE โโโโโบโ TRANSFORM โ
โ (Excel) โ โ (Raw) โ โ (In-DB) โ
โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ
Transformation happens
AFTER load (in database)
When to use ETL (this project):
- Data is messy and needs significant cleaning
- Source format varies widely
- You need to validate before storing
- Database should contain only clean data
When to use ELT:
- Loading large volumes quickly is priority
- Database is powerful enough for transformation
- You want to preserve raw data
- Multiple transformations needed for different uses
Data Quality Dimensions
Before you can clean data, you need to understand what โcleanโ means:
| Dimension | Definition | Example Problem | Validation |
|---|---|---|---|
| Completeness | All required data present | Missing email address | Check NOT NULL |
| Validity | Data conforms to format | Email without @ | Regex pattern |
| Accuracy | Data reflects reality | ZIP code doesnโt exist | Lookup table |
| Consistency | Same thing same way | โCAโ vs โCaliforniaโ | Standardization |
| Uniqueness | No unwanted duplicates | Same customer twice | Duplicate check |
| Timeliness | Data is current | Old phone number | Date range check |
Column Mapping Strategies
Real-world data comes with inconsistent column names. Here are strategies to handle this:
1. Exact Mapping Dictionary
column_map = {
'Customer Name': 'customer_name',
'customer name': 'customer_name',
'client': 'customer_name',
'Client Name': 'customer_name',
'name': 'customer_name',
}
Pros: Simple, explicit Cons: Must enumerate every variation
2. Fuzzy Matching
from fuzzywuzzy import fuzz
def find_best_match(column, standard_columns, threshold=80):
for standard in standard_columns:
if fuzz.ratio(column.lower(), standard.lower()) > threshold:
return standard
return None
Pros: Handles minor variations automatically Cons: Can make wrong matches, slower
3. Semantic Matching
customer_synonyms = ['customer', 'client', 'account', 'buyer']
email_synonyms = ['email', 'e-mail', 'mail', 'contact']
def semantic_match(column):
col_lower = column.lower()
for synonym in customer_synonyms:
if synonym in col_lower:
return 'customer_name'
for synonym in email_synonyms:
if synonym in col_lower:
return 'email'
return None
Pros: Handles patterns like โCustomer_Nameโ, โCUSTOMERโ Cons: Can still miss edge cases
SQLAlchemy Architecture
SQLAlchemy provides two main interfaces:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SQLAlchemy Architecture โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ ORM (Object Relational Mapper) โ โ
โ โ โ โ
โ โ class Customer(Base): โ โ
โ โ __tablename__ = 'customers' โ โ
โ โ id = Column(Integer, primary_key=True) โ โ
โ โ name = Column(String(255)) โ โ
โ โ โ โ
โ โ # Work with Python objects: โ โ
โ โ customer = Customer(name='John') โ โ
โ โ session.add(customer) โ โ
โ โ session.commit() โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Core (SQL Expression Language) โ โ
โ โ โ โ
โ โ customers = Table('customers', metadata, โ โ
โ โ Column('id', Integer, primary_key=True), โ โ
โ โ Column('name', String(255)) โ โ
โ โ ) โ โ
โ โ โ โ
โ โ # Work with SQL constructs: โ โ
โ โ stmt = customers.insert().values(name='John') โ โ
โ โ conn.execute(stmt) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Engine (Connection Pool) โ โ
โ โ โ โ
โ โ engine = create_engine('postgresql://...') โ โ
โ โ # Manages connections, transactions โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ [ PostgreSQL Database ] โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
For this project, use Core because:
- Better for bulk operations (thousands of inserts)
- Clearer mapping to SQL operations
- Easier to work with pandas DataFrames
Data Type Mapping: Excel โ Python โ SQL
Understanding type transformations is crucial for data integrity:
Excel Cell Type pandas dtype PostgreSQL Type
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
"General" (text) โ object (str) โ VARCHAR(255)
"Number" โ float64 โ NUMERIC or FLOAT
"Date" โ datetime64[ns] โ DATE or TIMESTAMP
"Boolean" โ bool โ BOOLEAN
"Currency" โ float64 โ NUMERIC(12,2)
"Percentage" โ float64 โ NUMERIC(5,4)
DANGER ZONES:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Excel: "02134" (text ZIP code)
โ pandas: 2134 (int) โ LOSES leading zero!
โ PostgreSQL: 2134 โ WRONG!
Solution: Force string type: pd.read_excel(..., dtype={'zip': str})
Excel: "1/2/24"
โ pandas: Timestamp('2024-01-02') or Timestamp('2024-02-01') โ AMBIGUOUS!
Solution: Explicit date parsing: pd.to_datetime(col, format='%m/%d/%y')
Duplicate Detection Strategies
โDuplicateโ means different things in different contexts:
Strategy 1: Exact Match
โโโโโโโโโโโโโโโโโโโโโ
Same value in every column:
['John Smith', 'john@email.com', '555-1234'] == ['John Smith', 'john@email.com', '555-1234']
โ Clearly duplicate
Strategy 2: Key-Based Match
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Same value in unique identifier column:
Email 'john@email.com' appears twice (even if other fields differ)
โ Probably duplicate (same person, updated info)
Strategy 3: Fuzzy Match
โโโโโโโโโโโโโโโโโโโโโโโ
Similar but not identical:
'John Smith' vs 'Jon Smith'
'john@email.com' vs 'john.smith@email.com'
? Possible duplicate (needs human review)
Best practice: Use key-based matching on email/phone, flag fuzzy matches for review.
The Core Question Youโre Answering
โHow do I reliably move data from Excelโs chaotic flexibility into a databaseโs structured rigidity?โ
Excel allows anything in any cell. Databases enforce types, constraints, and structure. Youโre learning to be the translator, handling the messy reality of real-world Excel data.
Project Specification
Input Requirements
Multiple Excel files with varying structures:
File 1: client_data_vendor_A.xlsx
Customer Name | Email Address | Phone | Join Date
John Smith | john@email.com| 5551234567 | 1/15/24
File 2: client_data_vendor_B.xlsx
client | email | phone_number | signup_date
Jane Doe | jane@mail.com | (555) 234-5678 | 2024-01-20
File 3: client_data_vendor_C.xlsx
name | contact_email | telephone | registration
Bob Wilson | bob.w@co.com | 555.345.6789 | Jan 25, 2024
Output Requirements
- PostgreSQL Table:
customers- Columns: id (auto), customer_name, email, phone, signup_date, source_file, created_at
- All data standardized and validated
- No duplicates (based on email)
- Data Quality Report:
data_quality_report_{date}.xlsx- Summary sheet with total records, success rate, issue counts
- Transformation log showing what was cleaned
- Flagged records needing manual review
- Data quality metrics per column
Functional Requirements
- Handle different column name variations
- Standardize phone numbers to ###-###-#### format
- Validate email addresses with regex
- Parse multiple date formats
- Remove duplicate records (keep most recent)
- Log all transformations for audit
- Generate report even if no records loaded (show why)
- Safe to run multiple times (idempotent)
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ETL PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ EXTRACT LAYER โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โ โ FileScannerโ โ ExcelReaderโ โSchemaDetectโ โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โ โข glob() โโโโบโ โข read_xl()โโโโบโ โข map_colsโ โ โ
โ โ โ โข validate โ โ โข parse() โ โ โข detect() โ โ โ
โ โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ TRANSFORM LAYER โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โ โ Standardizeโ โ Validate โ โ Deduplicateโ โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โ โข names โโโโบโ โข email โโโโบโ โข find_dup โ โ โ
โ โ โ โข phones โ โ โข phone โ โ โข resolve โ โ โ
โ โ โ โข dates โ โ โข required โ โ โข flag โ โ โ
โ โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ LOAD LAYER โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โ โ DBConnect โ โ BulkInsert โ โ ReportGen โ โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โ โข engine โโโโบโ โข chunk() โโโโบโ โข summary โ โ โ
โ โ โ โข schema โ โ โข insert() โ โ โข details โ โ โ
โ โ โ โข validate โ โ โข commit() โ โ โข save() โ โ โ
โ โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ

Data Flow
Phase 1: EXTRACT
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ For each Excel file: โ
โ 1. Read with openpyxl/pandas โ
โ 2. Detect column names โ
โ 3. Map to standard schema โ
โ 4. Add source_file metadata โ
โ 5. Append to master DataFrame โ
โ โ
โ Output: Raw combined DataFrame with original values โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Phase 2: TRANSFORM
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 2.1: Column Standardization โ
โ - Rename columns to standard names โ
โ - Drop unexpected columns (log them) โ
โ โ
โ Step 2.2: Name Cleaning โ
โ - Trim whitespace โ
โ - Title case: "john smith" โ "John Smith" โ
โ - Remove extra spaces โ
โ โ
โ Step 2.3: Email Cleaning โ
โ - Lowercase โ
โ - Trim whitespace โ
โ - Validate with regex โ
โ - Flag invalid (don't remove) โ
โ โ
โ Step 2.4: Phone Cleaning โ
โ - Extract digits only โ
โ - Format to ###-###-#### โ
โ - Handle 10-digit and 11-digit (1 prefix) โ
โ - Flag invalid (don't remove) โ
โ โ
โ Step 2.5: Date Parsing โ
โ - Try multiple formats โ
โ - Convert to standard YYYY-MM-DD โ
โ - Flag unparseable โ
โ โ
โ Step 2.6: Deduplication โ
โ - Identify duplicates by email โ
โ - Keep most recent record โ
โ - Log removed duplicates โ
โ โ
โ Output: Clean DataFrame + flagged records DataFrame โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Phase 3: LOAD
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 3.1: Database Connection โ
โ - Create SQLAlchemy engine โ
โ - Verify table schema exists (create if not) โ
โ โ
โ Step 3.2: Upsert Strategy โ
โ - For each record: โ
โ - Check if email exists in database โ
โ - If exists: UPDATE (if newer) or SKIP โ
โ - If not: INSERT โ
โ - Use transactions for atomicity โ
โ โ
โ Step 3.3: Bulk Insert (for new records) โ
โ - Batch into chunks of 1000 โ
โ - Use executemany for efficiency โ
โ - Commit after each batch โ
โ โ
โ Step 3.4: Generate Report โ
โ - Summary statistics โ
โ - Transformation log โ
โ - Flagged records for review โ
โ - Save to Excel โ
โ โ
โ Output: Populated database + quality report โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Validation Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VALIDATION PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Each record passes through validators: โ
โ โ
โ โโโโโโโโโโโโโโ โ
โ โ Record โ โ
โ โโโโโโโฌโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโ PASS โโโโโโโโโโโโโโ โ
โ โ Required โโโโโโโโโโโโโโโโบโ Email โ โ
โ โ Fields โ โ Validator โ โ
โ โโโโโโโฌโโโโโโโ โโโโโโโฌโโโโโโโ โ
โ โ FAIL โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ PASS โ
โ โ Flag: โ โ Flag: โโโโโโโโโโโโโโโโโโโบ โ
โ โ MISSING_ โ โ INVALID_ โ โ
โ โ REQUIRED โ โ EMAIL โ โ
โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโ โ
โ โ Phone โ โ
โ โ Validator โ โ
โ โโโโโโโฌโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโ โ
โ โ Date โ โ
โ โ Validator โ โ
โ โโโโโโโฌโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโ โ
โ โ VALID โ โ Load to database โ
โ โ RECORD โ โ
โ โโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Phased Implementation Guide
Phase 1: Basic File Loading
Goal: Read all Excel files and combine into single DataFrame.
Tasks:
- Scan input directory for Excel files
- Read each file with pandas
- Log columns detected in each file
- Combine all DataFrames
- Add source file indicator column
Checkpoint: Single DataFrame containing all records from all files.
Common issues:
- Different encodings between files
- Files with multiple sheets (which to use?)
- Empty files or header-only files
Phase 2: Column Mapping
Goal: Standardize column names across all sources.
Tasks:
- Create column mapping dictionary
- Apply mappings to rename columns
- Handle unmapped columns (log and drop)
- Ensure all required columns exist
- Reorder columns to standard order
Checkpoint: All DataFrames have identical column structure.
Common issues:
- Columns that donโt match any mapping
- Missing required columns in some files
- Extra columns that arenโt needed
Phase 3: Data Cleaning Functions
Goal: Implement individual cleaning functions.
Tasks:
clean_name(value)- Title case, trim, no double spacesclean_email(value)- Lowercase, trim, validate formatclean_phone(value)- Extract digits, format ###-###-####parse_date(value)- Try multiple formats, return standard
Checkpoint: Each function works correctly with test inputs.
Common issues:
- Unicode characters in names (accents, etc.)
- International phone numbers
- Ambiguous date formats (1/2/24 = Jan 2 or Feb 1?)
Phase 4: Validation and Flagging
Goal: Identify problematic records without removing them.
Tasks:
- Create validation status column
- Run email validation, set flag if invalid
- Run phone validation, set flag if invalid
- Run date validation, set flag if unparseable
- Check for required fields, flag if missing
Checkpoint: Every record has a validation status and specific flags.
Phase 5: Deduplication
Goal: Remove duplicate records intelligently.
Tasks:
- Identify duplicates based on email
- For each duplicate group, keep best record:
- Prefer records with more complete data
- Prefer most recent date
- Log all removed duplicates
- Store duplicate count for reporting
Checkpoint: No duplicate emails remain.
Phase 6: Database Loading
Goal: Insert clean data into PostgreSQL.
Tasks:
- Create SQLAlchemy engine
- Define table schema
- Create table if not exists
- Insert records in batches
- Handle constraint violations gracefully
- Commit and verify row count
Checkpoint: Data appears in database with correct types.
Phase 7: Report Generation
Goal: Create comprehensive data quality report.
Tasks:
- Calculate summary statistics
- Create transformation log
- Export flagged records
- Generate charts (optional)
- Save to Excel
Checkpoint: Report provides complete audit trail.
Testing Strategy
Unit Tests
def test_clean_phone_standard():
assert clean_phone('5551234567') == '555-123-4567'
def test_clean_phone_with_formatting():
assert clean_phone('(555) 123-4567') == '555-123-4567'
def test_clean_phone_with_country_code():
assert clean_phone('15551234567') == '555-123-4567'
def test_clean_phone_invalid():
assert clean_phone('12345') is None
def test_validate_email_valid():
assert validate_email('john@email.com') == True
def test_validate_email_invalid():
assert validate_email('john.email.com') == False
assert validate_email('john@') == False
def test_parse_date_formats():
assert parse_date('1/15/24') == date(2024, 1, 15)
assert parse_date('2024-01-15') == date(2024, 1, 15)
assert parse_date('Jan 15, 2024') == date(2024, 1, 15)
Integration Tests
def test_full_pipeline():
# Create test Excel files
create_test_files()
# Run ETL
result = run_etl('test_input/', 'test_db_connection')
# Verify database
with engine.connect() as conn:
count = conn.execute(text("SELECT COUNT(*) FROM customers")).scalar()
assert count > 0
# Verify report was generated
assert os.path.exists('data_quality_report.xlsx')
Database Tests
def test_database_schema():
inspector = inspect(engine)
columns = inspector.get_columns('customers')
column_names = [c['name'] for c in columns]
assert 'customer_name' in column_names
assert 'email' in column_names
assert 'phone' in column_names
assert 'signup_date' in column_names
def test_unique_constraint():
# Insert same email twice should fail
with pytest.raises(IntegrityError):
insert_customer(email='test@example.com')
insert_customer(email='test@example.com')
Common Pitfalls and Debugging
Pitfall 1: ZIP Code Leading Zeros
Symptom: ZIP code โ02134โ becomes โ2134โ
Cause: pandas interprets as integer
Solution:
df = pd.read_excel(file, dtype={'zip_code': str})
Pitfall 2: Date Ambiguity
Symptom: โ1/2/24โ parsed as February 1 instead of January 2
Cause: Default locale assumption
Solution:
def parse_date(value):
formats = [
'%m/%d/%y', # US format first (if that's your source)
'%d/%m/%y', # European format
'%Y-%m-%d', # ISO format
'%B %d, %Y', # Month name format
]
for fmt in formats:
try:
return datetime.strptime(value, fmt).date()
except:
continue
return None
Pitfall 3: Unicode in Names
Symptom: Names like โJosรฉโ cause encoding errors
Cause: File encoding mismatch
Solution:
try:
df = pd.read_excel(file, encoding='utf-8')
except:
df = pd.read_excel(file, encoding='latin-1')
Pitfall 4: Database Connection Pool Exhaustion
Symptom: โtoo many connectionsโ error
Cause: Not closing connections properly
Solution:
# Use context manager
with engine.begin() as conn:
conn.execute(stmt)
# Connection automatically closed
Pitfall 5: Batch Insert Fails Partially
Symptom: Some records inserted, some failed, hard to know which
Cause: Single transaction for entire batch
Solution:
# Insert one at a time with individual error handling
for record in records:
try:
with engine.begin() as conn:
conn.execute(insert_stmt, record)
except Exception as e:
failed_records.append((record, str(e)))
Extensions and Challenges
Extension 1: Incremental Loading
Only process new/changed records since last run.
Skills practiced: Change detection, timestamps, watermarking
Extension 2: Parallel Processing
Process multiple files simultaneously.
Skills practiced: multiprocessing, thread safety, concurrent database access
Extension 3: Schema Evolution
Handle new columns appearing in source files.
Skills practiced: Dynamic schema detection, ALTER TABLE, migration patterns
Extension 4: Data Lineage
Track exactly where each database record came from.
Skills practiced: Metadata management, audit logging, provenance tracking
Extension 5: Automated Scheduling
Run the ETL pipeline automatically on schedule.
Skills practiced: cron, task scheduling, monitoring, alerting
Real-World Connections
Industry Applications
- Marketing: Customer data integration from multiple campaigns
- Sales: Lead consolidation from various sources
- HR: Employee data migration during acquisitions
- Finance: Transaction reconciliation from multiple systems
Career Relevance
This project demonstrates skills for:
- Data Engineer roles
- ETL Developer positions
- Database Administrator roles
- Business Intelligence Engineer
Portfolio Value
When presenting:
- Show the data quality improvement (before/after metrics)
- Demonstrate idempotency (safe to run multiple times)
- Discuss scalability considerations
- Highlight error handling and recovery
Interview Questions
Prepare to answer:
-
โWhatโs the difference between an ETL and an ELT pipeline?โ
-
โHow would you handle a 10GB Excel file that doesnโt fit in memory?โ
-
โWhatโs a database transaction and why is it important in data loading?โ
-
โHow do you ensure your ETL pipeline is idempotent?โ
-
โWhatโs the CAP theorem and how does it relate to data consistency?โ
-
โHow would you parallelize this pipeline to process files faster?โ
Hints by Layer
Layer 1: Column Mapping
column_mapping = {
'Customer Name': 'customer_name',
'client': 'customer_name',
'name': 'customer_name',
'Email Address': 'email',
'email': 'email',
'contact_email': 'email',
}
df.rename(columns=column_mapping, inplace=True)
Layer 2: Email Validation
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
def is_valid_email(email):
if pd.isna(email):
return False
return bool(re.match(email_pattern, str(email)))
df['email_valid'] = df['email'].apply(is_valid_email)
Layer 3: Phone Cleaning
def clean_phone(phone):
if pd.isna(phone):
return None
digits = ''.join(filter(str.isdigit, str(phone)))
if len(digits) == 11 and digits[0] == '1':
digits = digits[1:]
if len(digits) == 10:
return f'{digits[:3]}-{digits[3:6]}-{digits[6:]}'
return None
df['phone'] = df['phone'].apply(clean_phone)
Layer 4: Database Connection
from sqlalchemy import create_engine, Table, Column, Integer, String, Date, MetaData
engine = create_engine('postgresql://user:password@localhost/company_db')
metadata = MetaData()
customers = Table('customers', metadata,
Column('id', Integer, primary_key=True),
Column('customer_name', String(255)),
Column('email', String(255), unique=True),
Column('phone', String(20)),
Column('signup_date', Date),
Column('source_file', String(255)),
Column('created_at', DateTime, default=datetime.utcnow)
)
metadata.create_all(engine)
Layer 5: Bulk Insert
records = df.to_dict('records')
chunk_size = 1000
for i in range(0, len(records), chunk_size):
chunk = records[i:i+chunk_size]
with engine.begin() as conn:
conn.execute(customers.insert(), chunk)
Self-Assessment Checklist
Before considering complete:
- All Excel files successfully read
- Column names standardized across sources
- Names cleaned (title case, trimmed)
- Emails validated and flagged if invalid
- Phones formatted to ###-###-#### or flagged
- Dates parsed from multiple formats
- Duplicates removed (by email)
- Database table created with correct schema
- All valid records inserted
- Data quality report generated
- Report shows accurate statistics
- Flagged records exportable for review
- Pipeline is idempotent (safe to rerun)
- Errors handled gracefully
- Logging provides audit trail
Resources
| Topic | Resource |
|---|---|
| ETL fundamentals | โData Pipelines Pocket Referenceโ by James Densmore - Ch. 1-2 |
| pandas data cleaning | โPython for Data Analysisโ by Wes McKinney - Ch. 7 |
| Regular expressions | โPython for Data Analysisโ by Wes McKinney - Ch. 7.5 |
| SQL and database basics | โPractical SQLโ by Anthony DeBarros - Ch. 1-4 |
| SQLAlchemy fundamentals | โEssential SQLAlchemyโ by Jason Myers - Ch. 1-3 |
| Data validation strategies | โClean Codeโ by Robert C. Martin - Ch. 7-8 |