Local Product Sales Analysis

Goal: Build a repeatable, deterministic transformation pipeline that cleans messy sales CSVs, joins multiple entities, and produces a trusted revenue report. You will learn how real-world data quality issues surface in seemingly simple datasets and how to design checks that prevent silent business errors. By the end, you can defend your data joins, aggregation logic, and validation choices.

Context and Problem

Real-world scenario: A small ecommerce company has three CSV exports (customers, products, orders) and needs a weekly revenue report for leadership.
Stakeholders and constraints: Finance needs accurate totals; Marketing needs customer rankings; Engineering needs a repeatable script. Constraints include messy input data and a fixed delivery deadline.
What happens if this system fails? Revenue is misreported, promotions are mis-targeted, and quarterly results can be wrong.

Real World Outcome

You run a single command and get a deterministic report file.
Example CLI transcript:

$ python process_sales.py --customers data/customers.csv --products data/products.csv --orders data/orders.csv --out reports/customer_revenue.csv
Loaded customers=1000 products=250 orders=12450
Cleaned customers=997 products=248 orders=12417
Joined rows=12417
Wrote reports/customer_revenue.csv (997 rows)

Example output table (top 5):

customer_id,customer_name,total_revenue,order_count,last_order_date
C00091,Aline Ortiz,12842.50,19,2024-05-12
C00437,Samuel Howard,11998.20,17,2024-05-28
C00112,Keisha Park,11542.00,18,2024-06-01
C00980,Andre Varga,11012.75,15,2024-05-21
C00205,Priya Iyer,10987.90,16,2024-05-30

Expected metrics:
- total_revenue sum equals sum(price * quantity) for all valid orders
- order_count equals number of valid orders per customer
- 0 rows with null customer_name or total_revenue

Core Concepts

Data cleaning: Handling missing values, invalid types, and inconsistent formatting.
Entity joins: Relational joins across customer, product, and order entities.
Aggregation: Grouping and summing to produce business metrics.
Determinism: Producing the same output from the same inputs (stable sorting, fixed parsing rules).
Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.

Architecture

+----------------+     +----------------+     +----------------+
| customers.csv  |     | products.csv   |     | orders.csv     |
+-------+--------+     +-------+--------+     +-------+--------+
        |                      |                      |
        +----------+-----------+-----------+----------+
                   |                       
              [clean + normalize]
                   |
               [join tables]
                   |
             [aggregate revenue]
                   |
        +-----------------------------+
        | reports/customer_revenue.csv|
        +-----------------------------+

Data Model

Input schemas:
- customers: customer_id, customer_name, email, signup_date
- products: product_id, product_name, category, price
- orders: order_id, customer_id, product_id, quantity, order_date
Output schema:
- customer_revenue: customer_id, customer_name, total_revenue, order_count, last_order_date
Example input records:

customers.csv
C00091,Aline Ortiz,aline@example.com,2023-02-14

products.csv
P001,Wireless Mouse,Accessories,25.50

orders.csv
O90001,C00091,P001,3,2024-05-12

Implementation Plan

Load each CSV into a DataFrame and validate required columns.
Normalize data types (dates, numeric prices, integer quantities).
Clean records: drop or quarantine rows missing required keys.
Join orders -> products -> customers using inner joins on IDs.
Compute revenue = price * quantity per row.
Aggregate by customer_id and customer_name (sum revenue, count orders, max order_date).
Sort by total_revenue descending, then customer_id ascending for stability.
Write CSV with a fixed column order and UTF-8 encoding.

Pseudocode sketch:

orders = load_csv(...)
products = load_csv(...)
customers = load_csv(...)
orders = clean_orders(orders)
products = clean_products(products)
customers = clean_customers(customers)
joined = orders.join(products).join(customers)
joined['revenue'] = joined['price'] * joined['quantity']
report = aggregate(joined)
write_csv(report)

Validation and Data Quality

Accuracy: Sample 10 orders and verify revenue math.
- Example check: assert abs(sum(revenue) - sum(price*quantity)) < 0.01
Completeness: No nulls in key fields.
- report[report.customer_name.isnull()].shape == (0, 0)
Consistency: All product_id and customer_id referenced in orders exist in dimension tables.
- orders.merge(customers, on='customer_id', how='left').customer_name.isnull().sum() == 0
Timeliness: Orders must be within expected date range (e.g., 2024-01-01 to 2024-06-30).
- orders[(orders.order_date < '2024-01-01') | (orders.order_date > '2024-06-30')] == 0 rows
Traceability: Store a rejected_rows.csv with line numbers and reasons for each dropped row.

Failure Modes and Debugging

Mismatched IDs: Orders reference products that do not exist.
- Symptom: join row count drops unexpectedly. Fix by validating foreign keys before join.
Type parsing errors: Prices stored as “$19.99” instead of numeric.
- Symptom: NaNs in price column. Fix by stripping currency symbols before conversion.
Duplicate customers: Same email appears with different customer_id.
- Symptom: inflated revenue due to duplicate identities. Fix by de-duplicating with rules and logging conflicts.

Definition of Done

Script runs from CLI with inputs and outputs described above.
Report is deterministic (same output for same inputs).
All validation checks pass and are logged.
A rejected_rows.csv is produced with reasons and counts.
Report matches manual revenue calculation on a 10-row sample.