Project 3: Excel-to-Database ETL Pipeline

Project 3: Excel-to-Database ETL Pipeline

Build a system that extracts data from Excel files, transforms it, and loads it into a SQL database.


Project Overview

Your company receives client data in Excel spreadsheets from various sources. The data is inconsistent (different column names, date formats, missing values). You need to clean it, standardize it, and load it into a PostgreSQL database for analysis.

What Youโ€™ll Build

  • Read Excel files with inconsistent structures
  • Standardize column names (e.g., โ€œCustomer Nameโ€ vs โ€œClientโ€ vs โ€œcustomer_nameโ€)
  • Clean and validate data (remove duplicates, handle missing values, validate emails)
  • Transform data types (dates, phone numbers, zip codes)
  • Load into a PostgreSQL database
  • Generate a data quality report showing what was cleaned

Technologies

  • pandas for data transformation
  • openpyxl for Excel reading
  • SQLAlchemy for database connection
  • psycopg2 for PostgreSQL
  • Data validation libraries

Learning Objectives

By completing this project, you will:

  1. Understand ETL architecture - Extract, Transform, Load patterns
  2. Master data cleaning - Handle messy real-world data
  3. Build column mapping systems - Standardize inconsistent schemas
  4. Use SQLAlchemy - Pythonโ€™s most popular database toolkit
  5. Implement data validation - Regex, type checking, business rules
  6. Generate data quality reports - Audit and logging best practices
  7. Design idempotent pipelines - Safe to run multiple times

Deep Theoretical Foundation

ETL vs ELT: Understanding the Paradigms

ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) represent two fundamental approaches to data integration.

ETL (Traditional):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  SOURCE   โ”‚โ”€โ”€โ”€โ–บโ”‚  TRANSFORM  โ”‚โ”€โ”€โ”€โ–บโ”‚ DATABASE  โ”‚
โ”‚  (Excel)  โ”‚    โ”‚  (Python)   โ”‚    โ”‚  (Clean)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚
              Heavy processing
              happens BEFORE load

ELT (Modern Data Lake):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  SOURCE   โ”‚โ”€โ”€โ”€โ–บโ”‚ DATABASE  โ”‚โ”€โ”€โ”€โ–บโ”‚  TRANSFORM  โ”‚
โ”‚  (Excel)  โ”‚    โ”‚  (Raw)    โ”‚    โ”‚  (In-DB)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                         โ”‚
                              Transformation happens
                              AFTER load (in database)

When to use ETL (this project):

  • Data is messy and needs significant cleaning
  • Source format varies widely
  • You need to validate before storing
  • Database should contain only clean data

When to use ELT:

  • Loading large volumes quickly is priority
  • Database is powerful enough for transformation
  • You want to preserve raw data
  • Multiple transformations needed for different uses

Data Quality Dimensions

Before you can clean data, you need to understand what โ€œcleanโ€ means:

Dimension Definition Example Problem Validation
Completeness All required data present Missing email address Check NOT NULL
Validity Data conforms to format Email without @ Regex pattern
Accuracy Data reflects reality ZIP code doesnโ€™t exist Lookup table
Consistency Same thing same way โ€œCAโ€ vs โ€œCaliforniaโ€ Standardization
Uniqueness No unwanted duplicates Same customer twice Duplicate check
Timeliness Data is current Old phone number Date range check

Column Mapping Strategies

Real-world data comes with inconsistent column names. Here are strategies to handle this:

1. Exact Mapping Dictionary

column_map = {
    'Customer Name': 'customer_name',
    'customer name': 'customer_name',
    'client': 'customer_name',
    'Client Name': 'customer_name',
    'name': 'customer_name',
}

Pros: Simple, explicit Cons: Must enumerate every variation

2. Fuzzy Matching

from fuzzywuzzy import fuzz

def find_best_match(column, standard_columns, threshold=80):
    for standard in standard_columns:
        if fuzz.ratio(column.lower(), standard.lower()) > threshold:
            return standard
    return None

Pros: Handles minor variations automatically Cons: Can make wrong matches, slower

3. Semantic Matching

customer_synonyms = ['customer', 'client', 'account', 'buyer']
email_synonyms = ['email', 'e-mail', 'mail', 'contact']

def semantic_match(column):
    col_lower = column.lower()
    for synonym in customer_synonyms:
        if synonym in col_lower:
            return 'customer_name'
    for synonym in email_synonyms:
        if synonym in col_lower:
            return 'email'
    return None

Pros: Handles patterns like โ€œCustomer_Nameโ€, โ€œCUSTOMERโ€ Cons: Can still miss edge cases

SQLAlchemy Architecture

SQLAlchemy provides two main interfaces:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    SQLAlchemy Architecture                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                     ORM (Object Relational Mapper)        โ”‚   โ”‚
โ”‚  โ”‚                                                            โ”‚   โ”‚
โ”‚  โ”‚  class Customer(Base):                                    โ”‚   โ”‚
โ”‚  โ”‚      __tablename__ = 'customers'                          โ”‚   โ”‚
โ”‚  โ”‚      id = Column(Integer, primary_key=True)               โ”‚   โ”‚
โ”‚  โ”‚      name = Column(String(255))                           โ”‚   โ”‚
โ”‚  โ”‚                                                            โ”‚   โ”‚
โ”‚  โ”‚  # Work with Python objects:                              โ”‚   โ”‚
โ”‚  โ”‚  customer = Customer(name='John')                         โ”‚   โ”‚
โ”‚  โ”‚  session.add(customer)                                    โ”‚   โ”‚
โ”‚  โ”‚  session.commit()                                         โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                              โ”‚                                   โ”‚
โ”‚                              โ–ผ                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                     Core (SQL Expression Language)        โ”‚   โ”‚
โ”‚  โ”‚                                                            โ”‚   โ”‚
โ”‚  โ”‚  customers = Table('customers', metadata,                 โ”‚   โ”‚
โ”‚  โ”‚      Column('id', Integer, primary_key=True),             โ”‚   โ”‚
โ”‚  โ”‚      Column('name', String(255))                          โ”‚   โ”‚
โ”‚  โ”‚  )                                                         โ”‚   โ”‚
โ”‚  โ”‚                                                            โ”‚   โ”‚
โ”‚  โ”‚  # Work with SQL constructs:                              โ”‚   โ”‚
โ”‚  โ”‚  stmt = customers.insert().values(name='John')            โ”‚   โ”‚
โ”‚  โ”‚  conn.execute(stmt)                                       โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                              โ”‚                                   โ”‚
โ”‚                              โ–ผ                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                     Engine (Connection Pool)              โ”‚   โ”‚
โ”‚  โ”‚                                                            โ”‚   โ”‚
โ”‚  โ”‚  engine = create_engine('postgresql://...')               โ”‚   โ”‚
โ”‚  โ”‚  # Manages connections, transactions                      โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                              โ”‚                                   โ”‚
โ”‚                              โ–ผ                                   โ”‚
โ”‚                     [ PostgreSQL Database ]                      โ”‚
โ”‚                                                                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

For this project, use Core because:

  • Better for bulk operations (thousands of inserts)
  • Clearer mapping to SQL operations
  • Easier to work with pandas DataFrames

Data Type Mapping: Excel โ†’ Python โ†’ SQL

Understanding type transformations is crucial for data integrity:

Excel Cell Type       pandas dtype           PostgreSQL Type
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€      โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€          โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
"General" (text)  โ†’   object (str)      โ†’   VARCHAR(255)
"Number"          โ†’   float64           โ†’   NUMERIC or FLOAT
"Date"            โ†’   datetime64[ns]    โ†’   DATE or TIMESTAMP
"Boolean"         โ†’   bool              โ†’   BOOLEAN
"Currency"        โ†’   float64           โ†’   NUMERIC(12,2)
"Percentage"      โ†’   float64           โ†’   NUMERIC(5,4)

DANGER ZONES:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Excel: "02134" (text ZIP code)
   โ†’ pandas: 2134 (int) โ† LOSES leading zero!
   โ†’ PostgreSQL: 2134 โ† WRONG!

Solution: Force string type: pd.read_excel(..., dtype={'zip': str})

Excel: "1/2/24"
   โ†’ pandas: Timestamp('2024-01-02') or Timestamp('2024-02-01') โ† AMBIGUOUS!

Solution: Explicit date parsing: pd.to_datetime(col, format='%m/%d/%y')

Duplicate Detection Strategies

โ€œDuplicateโ€ means different things in different contexts:

Strategy 1: Exact Match
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Same value in every column:
['John Smith', 'john@email.com', '555-1234'] == ['John Smith', 'john@email.com', '555-1234']
โœ“ Clearly duplicate

Strategy 2: Key-Based Match
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Same value in unique identifier column:
Email 'john@email.com' appears twice (even if other fields differ)
โœ“ Probably duplicate (same person, updated info)

Strategy 3: Fuzzy Match
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Similar but not identical:
'John Smith' vs 'Jon Smith'
'john@email.com' vs 'john.smith@email.com'
? Possible duplicate (needs human review)

Best practice: Use key-based matching on email/phone, flag fuzzy matches for review.


The Core Question Youโ€™re Answering

โ€œHow do I reliably move data from Excelโ€™s chaotic flexibility into a databaseโ€™s structured rigidity?โ€

Excel allows anything in any cell. Databases enforce types, constraints, and structure. Youโ€™re learning to be the translator, handling the messy reality of real-world Excel data.


Project Specification

Input Requirements

Multiple Excel files with varying structures:

File 1: client_data_vendor_A.xlsx

Customer Name | Email Address | Phone | Join Date
John Smith    | john@email.com| 5551234567 | 1/15/24

File 2: client_data_vendor_B.xlsx

client | email | phone_number | signup_date
Jane Doe | jane@mail.com | (555) 234-5678 | 2024-01-20

File 3: client_data_vendor_C.xlsx

name | contact_email | telephone | registration
Bob Wilson | bob.w@co.com | 555.345.6789 | Jan 25, 2024

Output Requirements

  1. PostgreSQL Table: customers
    • Columns: id (auto), customer_name, email, phone, signup_date, source_file, created_at
    • All data standardized and validated
    • No duplicates (based on email)
  2. Data Quality Report: data_quality_report_{date}.xlsx
    • Summary sheet with total records, success rate, issue counts
    • Transformation log showing what was cleaned
    • Flagged records needing manual review
    • Data quality metrics per column

Functional Requirements

  • Handle different column name variations
  • Standardize phone numbers to ###-###-#### format
  • Validate email addresses with regex
  • Parse multiple date formats
  • Remove duplicate records (keep most recent)
  • Log all transformations for audit
  • Generate report even if no records loaded (show why)
  • Safe to run multiple times (idempotent)

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      ETL PIPELINE                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                    EXTRACT LAYER                           โ”‚ โ”‚
โ”‚  โ”‚                                                              โ”‚ โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ FileScannerโ”‚   โ”‚ ExcelReaderโ”‚   โ”‚SchemaDetectโ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚            โ”‚   โ”‚            โ”‚   โ”‚            โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข glob()   โ”‚โ”€โ”€โ–บโ”‚ โ€ข read_xl()โ”‚โ”€โ”€โ–บโ”‚ โ€ข map_colsโ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข validate โ”‚   โ”‚ โ€ข parse()  โ”‚   โ”‚ โ€ข detect() โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                              โ”‚                                   โ”‚
โ”‚                              โ–ผ                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                   TRANSFORM LAYER                          โ”‚ โ”‚
โ”‚  โ”‚                                                              โ”‚ โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ Standardizeโ”‚   โ”‚  Validate  โ”‚   โ”‚  Deduplicateโ”‚         โ”‚ โ”‚
โ”‚  โ”‚  โ”‚            โ”‚   โ”‚            โ”‚   โ”‚            โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข names    โ”‚โ”€โ”€โ–บโ”‚ โ€ข email    โ”‚โ”€โ”€โ–บโ”‚ โ€ข find_dup โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข phones   โ”‚   โ”‚ โ€ข phone    โ”‚   โ”‚ โ€ข resolve  โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข dates    โ”‚   โ”‚ โ€ข required โ”‚   โ”‚ โ€ข flag     โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                              โ”‚                                   โ”‚
โ”‚                              โ–ผ                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚                     LOAD LAYER                             โ”‚ โ”‚
โ”‚  โ”‚                                                              โ”‚ โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ DBConnect  โ”‚   โ”‚ BulkInsert โ”‚   โ”‚ ReportGen  โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚            โ”‚   โ”‚            โ”‚   โ”‚            โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข engine   โ”‚โ”€โ”€โ–บโ”‚ โ€ข chunk()  โ”‚โ”€โ”€โ–บโ”‚ โ€ข summary  โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข schema   โ”‚   โ”‚ โ€ข insert() โ”‚   โ”‚ โ€ข details  โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ”‚ โ€ข validate โ”‚   โ”‚ โ€ข commit() โ”‚   โ”‚ โ€ข save()   โ”‚          โ”‚ โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ETL Pipeline Architecture

Data Flow

Phase 1: EXTRACT
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ For each Excel file:                                             โ”‚
โ”‚   1. Read with openpyxl/pandas                                  โ”‚
โ”‚   2. Detect column names                                         โ”‚
โ”‚   3. Map to standard schema                                      โ”‚
โ”‚   4. Add source_file metadata                                    โ”‚
โ”‚   5. Append to master DataFrame                                  โ”‚
โ”‚                                                                   โ”‚
โ”‚ Output: Raw combined DataFrame with original values              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
Phase 2: TRANSFORM
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 2.1: Column Standardization                                 โ”‚
โ”‚   - Rename columns to standard names                            โ”‚
โ”‚   - Drop unexpected columns (log them)                          โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 2.2: Name Cleaning                                          โ”‚
โ”‚   - Trim whitespace                                              โ”‚
โ”‚   - Title case: "john smith" โ†’ "John Smith"                     โ”‚
โ”‚   - Remove extra spaces                                          โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 2.3: Email Cleaning                                         โ”‚
โ”‚   - Lowercase                                                    โ”‚
โ”‚   - Trim whitespace                                              โ”‚
โ”‚   - Validate with regex                                          โ”‚
โ”‚   - Flag invalid (don't remove)                                  โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 2.4: Phone Cleaning                                         โ”‚
โ”‚   - Extract digits only                                          โ”‚
โ”‚   - Format to ###-###-####                                       โ”‚
โ”‚   - Handle 10-digit and 11-digit (1 prefix)                     โ”‚
โ”‚   - Flag invalid (don't remove)                                  โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 2.5: Date Parsing                                           โ”‚
โ”‚   - Try multiple formats                                         โ”‚
โ”‚   - Convert to standard YYYY-MM-DD                              โ”‚
โ”‚   - Flag unparseable                                             โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 2.6: Deduplication                                          โ”‚
โ”‚   - Identify duplicates by email                                โ”‚
โ”‚   - Keep most recent record                                      โ”‚
โ”‚   - Log removed duplicates                                       โ”‚
โ”‚                                                                   โ”‚
โ”‚ Output: Clean DataFrame + flagged records DataFrame              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
Phase 3: LOAD
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 3.1: Database Connection                                    โ”‚
โ”‚   - Create SQLAlchemy engine                                    โ”‚
โ”‚   - Verify table schema exists (create if not)                  โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 3.2: Upsert Strategy                                        โ”‚
โ”‚   - For each record:                                             โ”‚
โ”‚     - Check if email exists in database                         โ”‚
โ”‚     - If exists: UPDATE (if newer) or SKIP                      โ”‚
โ”‚     - If not: INSERT                                             โ”‚
โ”‚   - Use transactions for atomicity                              โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 3.3: Bulk Insert (for new records)                         โ”‚
โ”‚   - Batch into chunks of 1000                                   โ”‚
โ”‚   - Use executemany for efficiency                              โ”‚
โ”‚   - Commit after each batch                                     โ”‚
โ”‚                                                                   โ”‚
โ”‚ Step 3.4: Generate Report                                        โ”‚
โ”‚   - Summary statistics                                           โ”‚
โ”‚   - Transformation log                                           โ”‚
โ”‚   - Flagged records for review                                   โ”‚
โ”‚   - Save to Excel                                                โ”‚
โ”‚                                                                   โ”‚
โ”‚ Output: Populated database + quality report                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Validation Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    VALIDATION PIPELINE                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                   โ”‚
โ”‚  Each record passes through validators:                          โ”‚
โ”‚                                                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                                  โ”‚
โ”‚  โ”‚   Record   โ”‚                                                  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                                  โ”‚
โ”‚        โ”‚                                                          โ”‚
โ”‚        โ–ผ                                                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     PASS      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                    โ”‚
โ”‚  โ”‚  Required  โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚   Email    โ”‚                    โ”‚
โ”‚  โ”‚  Fields    โ”‚               โ”‚  Validator โ”‚                    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                    โ”‚
โ”‚        โ”‚ FAIL                       โ”‚                            โ”‚
โ”‚        โ–ผ                            โ–ผ                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     PASS            โ”‚
โ”‚  โ”‚ Flag:      โ”‚             โ”‚ Flag:      โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ    โ”‚
โ”‚  โ”‚ MISSING_   โ”‚             โ”‚ INVALID_   โ”‚                      โ”‚
โ”‚  โ”‚ REQUIRED   โ”‚             โ”‚ EMAIL      โ”‚                      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ”‚
โ”‚                                   โ”‚                              โ”‚
โ”‚                                   โ–ผ                              โ”‚
โ”‚                             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                      โ”‚
โ”‚                             โ”‚   Phone    โ”‚                      โ”‚
โ”‚                             โ”‚  Validator โ”‚                      โ”‚
โ”‚                             โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ”‚
โ”‚                                   โ”‚                              โ”‚
โ”‚                                   โ–ผ                              โ”‚
โ”‚                             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                      โ”‚
โ”‚                             โ”‚   Date     โ”‚                      โ”‚
โ”‚                             โ”‚  Validator โ”‚                      โ”‚
โ”‚                             โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ”‚
โ”‚                                   โ”‚                              โ”‚
โ”‚                                   โ–ผ                              โ”‚
โ”‚                             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                      โ”‚
โ”‚                             โ”‚   VALID    โ”‚ โ†’ Load to database   โ”‚
โ”‚                             โ”‚   RECORD   โ”‚                      โ”‚
โ”‚                             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ”‚
โ”‚                                                                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Phased Implementation Guide

Phase 1: Basic File Loading

Goal: Read all Excel files and combine into single DataFrame.

Tasks:

  1. Scan input directory for Excel files
  2. Read each file with pandas
  3. Log columns detected in each file
  4. Combine all DataFrames
  5. Add source file indicator column

Checkpoint: Single DataFrame containing all records from all files.

Common issues:

  • Different encodings between files
  • Files with multiple sheets (which to use?)
  • Empty files or header-only files

Phase 2: Column Mapping

Goal: Standardize column names across all sources.

Tasks:

  1. Create column mapping dictionary
  2. Apply mappings to rename columns
  3. Handle unmapped columns (log and drop)
  4. Ensure all required columns exist
  5. Reorder columns to standard order

Checkpoint: All DataFrames have identical column structure.

Common issues:

  • Columns that donโ€™t match any mapping
  • Missing required columns in some files
  • Extra columns that arenโ€™t needed

Phase 3: Data Cleaning Functions

Goal: Implement individual cleaning functions.

Tasks:

  1. clean_name(value) - Title case, trim, no double spaces
  2. clean_email(value) - Lowercase, trim, validate format
  3. clean_phone(value) - Extract digits, format ###-###-####
  4. parse_date(value) - Try multiple formats, return standard

Checkpoint: Each function works correctly with test inputs.

Common issues:

  • Unicode characters in names (accents, etc.)
  • International phone numbers
  • Ambiguous date formats (1/2/24 = Jan 2 or Feb 1?)

Phase 4: Validation and Flagging

Goal: Identify problematic records without removing them.

Tasks:

  1. Create validation status column
  2. Run email validation, set flag if invalid
  3. Run phone validation, set flag if invalid
  4. Run date validation, set flag if unparseable
  5. Check for required fields, flag if missing

Checkpoint: Every record has a validation status and specific flags.

Phase 5: Deduplication

Goal: Remove duplicate records intelligently.

Tasks:

  1. Identify duplicates based on email
  2. For each duplicate group, keep best record:
    • Prefer records with more complete data
    • Prefer most recent date
  3. Log all removed duplicates
  4. Store duplicate count for reporting

Checkpoint: No duplicate emails remain.

Phase 6: Database Loading

Goal: Insert clean data into PostgreSQL.

Tasks:

  1. Create SQLAlchemy engine
  2. Define table schema
  3. Create table if not exists
  4. Insert records in batches
  5. Handle constraint violations gracefully
  6. Commit and verify row count

Checkpoint: Data appears in database with correct types.

Phase 7: Report Generation

Goal: Create comprehensive data quality report.

Tasks:

  1. Calculate summary statistics
  2. Create transformation log
  3. Export flagged records
  4. Generate charts (optional)
  5. Save to Excel

Checkpoint: Report provides complete audit trail.


Testing Strategy

Unit Tests

def test_clean_phone_standard():
    assert clean_phone('5551234567') == '555-123-4567'

def test_clean_phone_with_formatting():
    assert clean_phone('(555) 123-4567') == '555-123-4567'

def test_clean_phone_with_country_code():
    assert clean_phone('15551234567') == '555-123-4567'

def test_clean_phone_invalid():
    assert clean_phone('12345') is None

def test_validate_email_valid():
    assert validate_email('john@email.com') == True

def test_validate_email_invalid():
    assert validate_email('john.email.com') == False
    assert validate_email('john@') == False

def test_parse_date_formats():
    assert parse_date('1/15/24') == date(2024, 1, 15)
    assert parse_date('2024-01-15') == date(2024, 1, 15)
    assert parse_date('Jan 15, 2024') == date(2024, 1, 15)

Integration Tests

def test_full_pipeline():
    # Create test Excel files
    create_test_files()

    # Run ETL
    result = run_etl('test_input/', 'test_db_connection')

    # Verify database
    with engine.connect() as conn:
        count = conn.execute(text("SELECT COUNT(*) FROM customers")).scalar()
        assert count > 0

    # Verify report was generated
    assert os.path.exists('data_quality_report.xlsx')

Database Tests

def test_database_schema():
    inspector = inspect(engine)
    columns = inspector.get_columns('customers')
    column_names = [c['name'] for c in columns]

    assert 'customer_name' in column_names
    assert 'email' in column_names
    assert 'phone' in column_names
    assert 'signup_date' in column_names

def test_unique_constraint():
    # Insert same email twice should fail
    with pytest.raises(IntegrityError):
        insert_customer(email='test@example.com')
        insert_customer(email='test@example.com')

Common Pitfalls and Debugging

Pitfall 1: ZIP Code Leading Zeros

Symptom: ZIP code โ€œ02134โ€ becomes โ€œ2134โ€

Cause: pandas interprets as integer

Solution:

df = pd.read_excel(file, dtype={'zip_code': str})

Pitfall 2: Date Ambiguity

Symptom: โ€œ1/2/24โ€ parsed as February 1 instead of January 2

Cause: Default locale assumption

Solution:

def parse_date(value):
    formats = [
        '%m/%d/%y',    # US format first (if that's your source)
        '%d/%m/%y',    # European format
        '%Y-%m-%d',    # ISO format
        '%B %d, %Y',   # Month name format
    ]
    for fmt in formats:
        try:
            return datetime.strptime(value, fmt).date()
        except:
            continue
    return None

Pitfall 3: Unicode in Names

Symptom: Names like โ€œJosรฉโ€ cause encoding errors

Cause: File encoding mismatch

Solution:

try:
    df = pd.read_excel(file, encoding='utf-8')
except:
    df = pd.read_excel(file, encoding='latin-1')

Pitfall 4: Database Connection Pool Exhaustion

Symptom: โ€œtoo many connectionsโ€ error

Cause: Not closing connections properly

Solution:

# Use context manager
with engine.begin() as conn:
    conn.execute(stmt)
# Connection automatically closed

Pitfall 5: Batch Insert Fails Partially

Symptom: Some records inserted, some failed, hard to know which

Cause: Single transaction for entire batch

Solution:

# Insert one at a time with individual error handling
for record in records:
    try:
        with engine.begin() as conn:
            conn.execute(insert_stmt, record)
    except Exception as e:
        failed_records.append((record, str(e)))

Extensions and Challenges

Extension 1: Incremental Loading

Only process new/changed records since last run.

Skills practiced: Change detection, timestamps, watermarking

Extension 2: Parallel Processing

Process multiple files simultaneously.

Skills practiced: multiprocessing, thread safety, concurrent database access

Extension 3: Schema Evolution

Handle new columns appearing in source files.

Skills practiced: Dynamic schema detection, ALTER TABLE, migration patterns

Extension 4: Data Lineage

Track exactly where each database record came from.

Skills practiced: Metadata management, audit logging, provenance tracking

Extension 5: Automated Scheduling

Run the ETL pipeline automatically on schedule.

Skills practiced: cron, task scheduling, monitoring, alerting


Real-World Connections

Industry Applications

  • Marketing: Customer data integration from multiple campaigns
  • Sales: Lead consolidation from various sources
  • HR: Employee data migration during acquisitions
  • Finance: Transaction reconciliation from multiple systems

Career Relevance

This project demonstrates skills for:

  • Data Engineer roles
  • ETL Developer positions
  • Database Administrator roles
  • Business Intelligence Engineer

Portfolio Value

When presenting:

  1. Show the data quality improvement (before/after metrics)
  2. Demonstrate idempotency (safe to run multiple times)
  3. Discuss scalability considerations
  4. Highlight error handling and recovery

Interview Questions

Prepare to answer:

  1. โ€œWhatโ€™s the difference between an ETL and an ELT pipeline?โ€

  2. โ€œHow would you handle a 10GB Excel file that doesnโ€™t fit in memory?โ€

  3. โ€œWhatโ€™s a database transaction and why is it important in data loading?โ€

  4. โ€œHow do you ensure your ETL pipeline is idempotent?โ€

  5. โ€œWhatโ€™s the CAP theorem and how does it relate to data consistency?โ€

  6. โ€œHow would you parallelize this pipeline to process files faster?โ€


Hints by Layer

Layer 1: Column Mapping

column_mapping = {
    'Customer Name': 'customer_name',
    'client': 'customer_name',
    'name': 'customer_name',
    'Email Address': 'email',
    'email': 'email',
    'contact_email': 'email',
}

df.rename(columns=column_mapping, inplace=True)

Layer 2: Email Validation

import re

email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

def is_valid_email(email):
    if pd.isna(email):
        return False
    return bool(re.match(email_pattern, str(email)))

df['email_valid'] = df['email'].apply(is_valid_email)

Layer 3: Phone Cleaning

def clean_phone(phone):
    if pd.isna(phone):
        return None
    digits = ''.join(filter(str.isdigit, str(phone)))
    if len(digits) == 11 and digits[0] == '1':
        digits = digits[1:]
    if len(digits) == 10:
        return f'{digits[:3]}-{digits[3:6]}-{digits[6:]}'
    return None

df['phone'] = df['phone'].apply(clean_phone)

Layer 4: Database Connection

from sqlalchemy import create_engine, Table, Column, Integer, String, Date, MetaData

engine = create_engine('postgresql://user:password@localhost/company_db')
metadata = MetaData()

customers = Table('customers', metadata,
    Column('id', Integer, primary_key=True),
    Column('customer_name', String(255)),
    Column('email', String(255), unique=True),
    Column('phone', String(20)),
    Column('signup_date', Date),
    Column('source_file', String(255)),
    Column('created_at', DateTime, default=datetime.utcnow)
)

metadata.create_all(engine)

Layer 5: Bulk Insert

records = df.to_dict('records')

chunk_size = 1000
for i in range(0, len(records), chunk_size):
    chunk = records[i:i+chunk_size]
    with engine.begin() as conn:
        conn.execute(customers.insert(), chunk)

Self-Assessment Checklist

Before considering complete:

  • All Excel files successfully read
  • Column names standardized across sources
  • Names cleaned (title case, trimmed)
  • Emails validated and flagged if invalid
  • Phones formatted to ###-###-#### or flagged
  • Dates parsed from multiple formats
  • Duplicates removed (by email)
  • Database table created with correct schema
  • All valid records inserted
  • Data quality report generated
  • Report shows accurate statistics
  • Flagged records exportable for review
  • Pipeline is idempotent (safe to rerun)
  • Errors handled gracefully
  • Logging provides audit trail

Resources

Topic Resource
ETL fundamentals โ€œData Pipelines Pocket Referenceโ€ by James Densmore - Ch. 1-2
pandas data cleaning โ€œPython for Data Analysisโ€ by Wes McKinney - Ch. 7
Regular expressions โ€œPython for Data Analysisโ€ by Wes McKinney - Ch. 7.5
SQL and database basics โ€œPractical SQLโ€ by Anthony DeBarros - Ch. 1-4
SQLAlchemy fundamentals โ€œEssential SQLAlchemyโ€ by Jason Myers - Ch. 1-3
Data validation strategies โ€œClean Codeโ€ by Robert C. Martin - Ch. 7-8