← Back to all projects

LEARN APACHE PARQUET DEEP DIVE

Learn Apache Parquet: From Zero to Data Master

Goal: Deeply understand the Apache Parquet file format—from its columnar layout and internal structure to advanced topics like encoding, compression, and predicate pushdown—by building tools to read, write, and analyze Parquet files.


Why Learn Apache Parquet?

Apache Parquet is the most important file format in the modern data engineering and analytics world. It is a free, open-source columnar storage format that provides incredible efficiency for analytical queries (OLAP). Frameworks like Spark, DuckDB, Presto, and BigQuery are all optimized to use it. Understanding Parquet is understanding the foundation of high-performance big data processing.

While many developers use Parquet through a high-level library, few understand why it’s so fast. Learning its internals will allow you to design more efficient data pipelines, debug performance issues, and build better data-intensive applications.

After completing these projects, you will:

  • Understand the fundamental difference between row-based and column-based storage.
  • Be able to read a Parquet file’s metadata to understand its structure without reading the data.
  • Intelligently choose encodings and compression algorithms to optimize for size and speed.
  • Understand how query engines use Parquet’s statistics to achieve massive performance gains.
  • Handle complex, nested data structures efficiently.
  • Be capable of writing your own tools to interact with Parquet files at a low level.

Core Concept Analysis

Row-Oriented vs. Column-Oriented Storage

This is the most critical concept. A traditional database or CSV file stores data row by row. Parquet stores it column by column.

Row-Based (e.g., CSV):

# Each line is a self-contained record
1, "Alice", 34, "DE"
2, "Bob", 28, "US"
3, "Charlie", 45, "DE"

# To get all ages, you must read the entire file and skip data you don't need.

Column-Based (Parquet):

# Data is stored by column, leading to better compression and I/O for analytics.
# (Conceptual representation)
Column "ID":      [1, 2, 3]
Column "Name":    ["Alice", "Bob", "Charlie"]
Column "Age":     [34, 28, 45]
Column "Country": ["DE", "US", "DE"]

# To get all ages, you only need to read the "Age" column data.

This is perfect for analytical queries like AVG(Age), as the query engine can ignore the ID, Name, and Country data completely.

The Parquet File Layout

A Parquet file is a highly structured binary file.

┌──────────────────────────────────────────────┐
│                  Magic Bytes ("PAR1")         │
├──────────────────────────────────────────────┤
│ ◀─ Row Group 1 ───────────────────────────── │
│   ├──────────────────────────────────────────┤ │
│   │           Column Chunk 1 (e.g., Age)     │ │
│   │     ├───────────┬───────────┬──────────┤ │ │
│   │     │  Page 1   │  Page 2   │   ...    │ │ │
│   │     └───────────┴───────────┴──────────┘ │ │
│   ├──────────────────────────────────────────┤ │
│   │           Column Chunk 2 (e.g., Country) │ │
│   │     ├───────────┬───────────┬──────────┤ │ │
│   │     │  Page 1   │  Page 2   │   ...    │ │ │
│   │     └───────────┴───────────┴──────────┘ │ │
│   └──────────────────────────────────────────┘ │
├──────────────────────────────────────────────┤
│ ◀─ Row Group 2 ───────────────────────────── │
│   ├──────────────────────────────────────────┤ │
│   │           ...                            │ │
│   └──────────────────────────────────────────┘ │
├──────────────────────────────────────────────┤
│                  File Footer                  │
│  - Magic Bytes ("PAR1")                       │
│  - Footer length (4 bytes)                    │
│  - Thrift Metadata object (compressed)        │
│    - Version                                  │
│    - Schema                                   │
│    - List of Row Groups                       │
│    - Pointers to each Column Chunk            │
│    - Column-level statistics (min, max, count)│
└──────────────────────────────────────────────┘

Key Concepts Explained

  1. File Footer (Metadata): The single most important part of the file. It’s at the end, so it can be written in one pass after all the data is processed. It contains the schema, the location of all data chunks, and vital statistics. A reader first reads the footer to get a complete map of the file.

  2. Row Group: A horizontal partitioning of the data into rows. A row group consists of a column chunk for each column. A typical row group size is 128MB to 1GB.

  3. Column Chunk: The data for a single column within one row group.

  4. Page: The smallest unit in a Parquet file, containing a subset of a column chunk’s data. Compression and encoding are applied at the page level. A typical page size is 1MB.

  5. Encodings: Techniques to reduce data size before compression.
    • Dictionary Encoding: For low-cardinality columns (like “Country”), a dictionary of unique values is created. The data pages then store only the integer indices, which is highly efficient.
    • Run-Length Encoding (RLE): Used in combination with dictionary encoding to store sequences of repeated values.
    • Plain Encoding: The default, raw values.
  6. Compression: After encoding, data pages are compressed using algorithms like Snappy (fast, good compression), Gzip (slower, better compression), or ZSTD (modern, often the best balance).

  7. Predicate Pushdown (Statistics): The footer metadata contains statistics (min, max, null count) for each column chunk. A smart query engine can use these stats to skip reading entire row groups. For a query WHERE age > 50, if a row group’s “age” statistics show min=20, max=45, the engine knows it doesn’t need to read that row group at all.

Project List

The following 10 projects will build your intuition and expertise in Apache Parquet.


Project 1: CSV to Parquet Converter

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, Rust, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Data Formats / ETL
  • Software or Tool: Apache Arrow (pyarrow), pandas
  • Main Book: “Python for Data Analysis” by Wes McKinney

What you’ll build: A command-line utility that reads a CSV file, infers a schema, and writes the data into an efficient Parquet file.

Why it teaches Parquet: This is the most common real-world task involving Parquet. It forces you to handle schema definition, data type mapping (string to int, etc.), and the basic write path using a high-level library like Arrow.

Core challenges you’ll face:

  • Reading CSV data efficiently → maps to using a library like pandas or a CSV reader
  • Handling schema inference → maps to letting the library guess data types vs. defining them explicitly
  • Using the Parquet write API → maps to pyarrow.parquet.write_table or similar
  • Handling data types and nulls → maps to ensuring what’s in the CSV is correctly represented in Parquet

Key Concepts:

  • Schema Definition: Apache Arrow documentation on Schema.
  • Writing Parquet files: pyarrow documentation on the ParquetWriter class.
  • DataFrames: “Python for Data Analysis” Ch. 5, on the core pandas structure.

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python and familiarity with the pandas library.

Real world outcome: A CLI tool that is a fundamental building block of any data pipeline.

# Convert a large CSV file to Parquet
$ python converter.py --input large_dataset.csv --output large_dataset.parquet
Successfully converted file.
Original size: 500 MB
Parquet size: 85 MB

Implementation Hints:

  1. Use argparse to create a simple command-line interface.
  2. Use pandas.read_csv() to load the CSV file into a DataFrame. This is the easiest way to handle type inference.
  3. Use pyarrow.Table.from_pandas(df) to convert the pandas DataFrame into an Arrow Table. Arrow is the in-memory format that works seamlessly with Parquet.
  4. Use pyarrow.parquet.write_table(table, output_path) to write the Arrow Table to a Parquet file.
  5. Add options to specify the compression codec (e.g., --compression snappy).

Learning milestones:

  1. Convert a simple CSV with all string columns → You understand the basic read/write path.
  2. Convert a CSV with numbers and dates → You see how schemas and data types are handled.
  3. Compare the output file size to the original CSV → You have a tangible “win” for using Parquet.
  4. Read the Parquet file back with pandas and verify the data is identical → You trust your conversion process.

Project 2: Parquet Metadata Inspector

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: File Formats / Diagnostics
  • Software or Tool: Apache Arrow (pyarrow)
  • Main Book: The official Apache Parquet specification document.

What you’ll build: A tool that reads a Parquet file and prints its metadata in a human-readable format, similar to parquet-tools meta.

Why it teaches Parquet: It forces you to look inside the “black box.” You’ll interact directly with the file’s footer, schema, row group information, and column statistics. This project builds the mental model of how a Parquet file is physically organized.

Core challenges you’ll face:

  • Reading only the file metadata → maps to using pq.read_metadata() to avoid reading any actual data
  • Navigating the metadata object → maps to understanding the hierarchy: file -> row group -> column chunk
  • Accessing column statistics → maps to finding the min/max/null_count values that are key for performance
  • Formatting the output clearly → maps to presenting complex nested information in a readable way

Key Concepts:

  • File Metadata Structure: The pyarrow.parquet.FileMetaData object documentation.
  • Column Statistics: The pyarrow.parquet.Statistics object documentation.
  • Thrift Data Model: The underlying data model for Parquet metadata (though abstracted by Arrow).

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1.

Real world outcome: A powerful diagnostic tool for inspecting any Parquet file.

$ python inspector.py my_data.parquet
File: my_data.parquet
Version: 1.0
Num Rows: 1,000,000
Num Row Groups: 4
---
Row Group 0: (250,000 rows)
  Column 'age':
    Encoding: PLAIN, DICTIONARY
    Compression: SNAPPY
    Stats: min=18, max=65, null_count=50
  Column 'country':
    Encoding: RLE_DICTIONARY
    Compression: SNAPPY
    Stats: min=DE, max=US, null_count=0
...

Implementation Hints:

  1. Use pyarrow.parquet.ParquetFile(file_path) to open the file.
  2. Get the metadata object via pf.metadata. This is a FileMetaData object.
  3. You can access metadata.num_rows, metadata.num_row_groups, and the schema.
  4. Iterate from i in range(metadata.num_row_groups).
  5. Inside the loop, get the row group metadata object: rg = metadata.row_group(i).
  6. You can then access rg.num_rows and iterate through its columns.
  7. For each column, you can get its path, compression, encodings, and, most importantly, its statistics if they exist.

Learning milestones:

  1. Print the top-level file schema and number of rows → You’ve correctly read the footer.
  2. List all the row groups and their row counts → You understand horizontal partitioning.
  3. Display the compression and encodings for each column chunk → You see the physical properties.
  4. Print the min/max statistics for a numeric column → You understand the foundation of predicate pushdown.

Project 3: A “Parquet-cat” Tool with Column Projection

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Formats / CLI Tools
  • Software or Tool: Apache Arrow (pyarrow)
  • Main Book: The official pyarrow documentation.

What you’ll build: A utility that reads a Parquet file and prints its contents to the console, similar to cat, but with the crucial ability to select only specific columns to read.

Why it teaches Parquet: This project demonstrates the primary advantage of a columnar format: column projection. You’ll prove that reading two columns from a 100-column file is much faster than reading all 100 columns, as the tool only needs to touch the data for the selected columns.

Core challenges you’ll face:

  • Reading a Parquet file into a table → maps to using pq.read_table()
  • Implementing column selection → maps to using the columns argument in the read API to perform projection
  • Converting table data to a readable format → maps to iterating through the table or converting to pandas for printing
  • Handling large files without OOM errors → maps to reading the file in batches of row groups

Key Concepts:

  • Column Projection: The concept of only reading the columns required for a query.
  • Batched Reads: ParquetFile.iter_batches() in pyarrow for streaming data.
  • Arrow Tables and RecordBatches: The in-memory data structures pyarrow uses.

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1.

Real world outcome: A CLI tool that can quickly inspect the data in even very large Parquet files.

# Read all data from a file
$ python parquet_cat.py my_data.parquet

# Read only the 'name' and 'age' columns, which is much faster
$ python parquet_cat.py my_data.parquet --columns name age

Implementation Hints:

  1. Use argparse to accept a file path and an optional list of column names.
  2. The core of your logic will be pyarrow.parquet.read_table(file_path, columns=column_list).
  3. If column_list is empty or not provided, it will read all columns. If it’s provided, pyarrow will automatically perform the column projection at the read level.
  4. Once you have the Table object, you can convert it to a pandas DataFrame with table.to_pandas() and print it.
  5. Advanced version: To handle files that don’t fit in memory, open the file with ParquetFile and use iter_batches(columns=column_list). This will give you an iterator of RecordBatch objects, which you can process one at a time.

Learning milestones:

  1. Cat a small file with all columns → Your basic read-and-print logic works.
  2. Select a single column and print it → You’ve successfully implemented column projection.
  3. Time the difference between reading 2 columns vs. 20 columns on a large file → You have concrete proof of the performance benefit.
  4. Implement batched reading for a file that is larger than RAM → You can now handle arbitrarily large datasets.

Project 4: Performance Showdown: Parquet vs. CSV

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, DuckDB CLI
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Benchmarking / Data Analytics
  • Software or Tool: Apache Arrow, DuckDB
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A benchmarking script that generates a large dataset (e.g., 1GB CSV), converts it to Parquet, and then runs identical analytical queries against both files to measure the dramatic difference in file size and query speed.

Why it teaches Parquet: It provides undeniable, quantitative proof of why Parquet is the industry standard. It makes the abstract concepts of columnar storage and predicate pushdown concrete by showing a 10-100x performance improvement on your own machine.

Core challenges you’ll face:

  • Generating a large, realistic dataset → maps to creating a script to produce a multi-million row CSV
  • Writing an analytical query → maps to performing a filtering and aggregation operation (e.g., WHERE...GROUP BY...)
  • Executing the query on CSV vs. Parquet → maps to using a query engine like DuckDB or writing pandas code for both paths
  • Measuring and presenting the results → maps to timing the operations and printing a clear comparison

Key Concepts:

  • Analytical (OLAP) vs. Transactional (OLTP) workloads.
  • I/O Optimization: Why reading less data from disk is the key to performance.
  • Predicate Pushdown: A query engine pushing a WHERE clause down to the storage layer.

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1.

Real world outcome: A script that produces a report that will convince any data engineer to use Parquet.

--- File Size ---
data.csv:      1024.0 MB
data.parquet:    95.3 MB (10.7x smaller)

--- Query: AVG(col_A) WHERE col_B > 500 ---
Time on CSV:      12.5 seconds
Time on Parquet:   0.2 seconds (62.5x faster)

Implementation Hints:

  1. Data Generation: Write a script to generate a CSV with ~10 columns and 5-10 million rows. Include numeric columns, string columns with low cardinality (good for dictionary encoding), and string columns with high cardinality.
  2. Conversion: Use your script from Project 1 to convert this CSV to Parquet.
  3. Querying: The easiest way to run the exact same query is to use a query engine that can read both formats. DuckDB is perfect for this.
    import duckdb
    import time
    
    # Time query on CSV
    start = time.time()
    duckdb.query("SELECT avg(col_A) FROM 'data.csv' WHERE col_B > 500")
    end = time.time()
    print(f"CSV time: {end - start}")
    
    # Time query on Parquet
    start = time.time()
    duckdb.query("SELECT avg(col_A) FROM 'data.parquet' WHERE col_B > 500")
    end = time.time()
    print(f"Parquet time: {end - start}")
    

Learning milestones:

  1. Generate a 1GB+ CSV file → You can create a sufficiently large test dataset.
  2. Run a query on both file types → Your benchmarking setup works.
  3. See a >10x speedup on Parquet → You’ve proven the power of predicate pushdown and columnar reads.
  4. See a >5x reduction in file size → You’ve proven the power of columnar compression.

Project 5: Exploring Encodings and Compression

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Performance Tuning / File Formats
  • Software or Tool: Apache Arrow (pyarrow)
  • Main Book: The official Apache Parquet specification document.

What you’ll build: A script that takes a standard dataset and writes it to multiple Parquet files, experimenting with every major compression codec (snappy, gzip, zstd, none) and toggling dictionary encoding. It will then produce a report comparing file sizes and read speeds.

Why it teaches Parquet: It moves you from a user to a tuner. You’ll learn that the default settings are not always optimal and that understanding your data’s characteristics (like cardinality) allows you to make intelligent trade-offs between write speed, read speed, and file size.

Core challenges you’ll face:

  • Controlling write options in pyarrow → maps to using pyarrow.parquet.write_table with compression and use_dictionary arguments
  • Writing different versions for each column → maps to specifying per-column write options
  • Systematically benchmarking file size and read time → maps to scripting the creation and subsequent reading of many file variations
  • Analyzing the results → maps to understanding WHY zstd is good for size, WHY snappy is good for speed, and WHEN dictionary encoding helps

Key Concepts:

  • Compression Codecs: Snappy (fast), Gzip (slow, small), ZSTD (good balance).
  • Dictionary Encoding: The trade-off between build time and read performance/size.
  • Data Cardinality: How the number of unique values in a column affects encoding strategy.

Difficulty: Advanced Time estimate: Weekend Prerequisites: Project 1, Project 4.

Real world outcome: A table that gives you an expert’s intuition on how to configure Parquet files.

| Compression | Dictionary | File Size (MB) | Read Time (s) |
|-------------|------------|----------------|---------------|
| none        | True       | 150.2          | 0.5           |
| snappy      | True       | 85.3           | 0.7           |
| gzip        | True       | 60.1           | 2.1           |
| zstd        | True       | 58.9           | 0.9           |
| snappy      | False      | 120.5          | 0.6           |

Implementation Hints:

  1. Start with a pandas DataFrame or Arrow Table.
  2. Create lists of codecs (['snappy', 'gzip', 'zstd', 'none']) and dictionary settings ([True, False]).
  3. Loop through these options. Inside the loop, use pq.write_table():
    # Example for one iteration
    pq.write_table(
        table,
        f'output_{codec}_{use_dict}.parquet',
        compression=codec,
        use_dictionary=use_dict  # Global setting
        # Or you can specify per-column: use_dictionary=['col_A', 'col_B']
    )
    
  4. After creating all the files, loop through them again. For each file, record its size on disk (os.path.getsize) and time how long it takes to read it fully (pq.read_table(file)).
  5. Print the results in a formatted table.

Learning milestones:

  1. Write files with different compression codecs → You’ve mastered the compression write option.
  2. Disable dictionary encoding for a high-cardinality column and see the file size explode → You understand where dictionary encoding is most effective.
  3. Compare the slow read time of Gzip with the fast read time of Snappy → You understand the CPU vs. I/O trade-off.
  4. Produce a final comparison table → You have a personal reference guide for performance tuning.

Project 6: Handling Nested Data Structures

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Modeling / Nested Schemas
  • Software or Tool: Apache Arrow (pyarrow)
  • Main Book: “Learning Spark, 2nd Edition” by Jules S. Damji, et al.

What you’ll build: A script that reads complex, nested JSON data (like you’d get from a MongoDB export or a web API), preserves the nested structure in a Parquet file, and then executes queries against specific nested fields.

Why it teaches Parquet: It breaks you out of the “flat table” mindset. You’ll learn about Parquet’s rich type system (structs, lists, maps) and get a glimpse of the Dremel paper’s definition/repetition level encoding that makes this all possible.

Core challenges you’ll face:

  • Inferring a schema from nested JSON → maps to how libraries like pyarrow represent structs and lists
  • Writing nested data → maps to ensuring the library correctly handles the complex structure
  • Querying a nested field → maps to using dot-notation (a.b.c) to access fields within structs
  • Understanding definition/repetition levels (conceptually) → maps to why Parquet can efficiently store data where some nested fields are null or repeated

Key Concepts:

  • Nested Types: struct, list, and map types in Arrow/Parquet schemas.
  • Dot Notation: Standard way to access nested fields in query engines.
  • Definition/Repetition Levels: The low-level mechanism for encoding nested data. You won’t implement it, but you’ll see its effects.

Difficulty: Advanced Time estimate: Weekend Prerequisites: Project 1, familiarity with JSON.

Real world outcome: You can efficiently store and query complex document-oriented data in Parquet.

// Input JSON
{
  "user_id": 123,
  "profile": {
    "name": "Alice",
    "addresses": [
      { "type": "home", "zip": "10001" },
      { "type": "work", "zip": "10002" }
    ]
  }
}

// Your tool can convert this to Parquet and then run a query like:
// "Find all users with a 'home' address"

Implementation Hints:

  1. Start with a list of Python dictionaries representing your nested JSON.
  2. The easiest way to write this is to let pandas and pyarrow handle the schema inference.
    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    data = [...]  # Your list of dicts
    df = pd.json_normalize(data) # Flattens the structure with dot notation columns
    table = pa.Table.from_pandas(df)
    pq.write_table(table, 'nested.parquet')
    
  3. The more advanced (and better) way is to define an Arrow Schema manually.
    schema = pa.schema([
        pa.field('user_id', pa.int64()),
        pa.field('profile', pa.struct([
            pa.field('name', pa.string()),
            pa.field('addresses', pa.list_(pa.struct([
                pa.field('type', pa.string()),
                pa.field('zip', pa.string())
            ])))
        ]))
    ])
    # Then create an Arrow Table and write it.
    
  4. To query the data, read it back (pq.read_table) and use the query engine of your choice (pandas, DuckDB) which typically supports dot notation for accessing nested fields.

Learning milestones:

  1. Write a Parquet file with a simple struct (e.g., one level of nesting) → You can handle basic nesting.
  2. Write a file containing a list of structs (like multiple addresses) → You understand how repeated nested records work.
  3. Read the file and successfully filter based on a deeply nested field → You can query your complex data.
  4. Inspect the file’s schema with your metadata tool from Project 2 → You can see how pyarrow represents nested types in the Parquet schema.

The list of projects continues in the next sections.


Project 7: Implement Predicate Pushdown Manually

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Query Optimization / Low-Level I/O
  • Software or Tool: Apache Arrow (pyarrow)
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A “smart” Parquet reader that takes a filter condition (e.g., age > 50), reads the file’s metadata, inspects the row group statistics, and explicitly skips reading row groups that cannot possibly contain matching data.

Why it teaches Parquet: This is the secret sauce of Parquet’s performance. By implementing this yourself, you’ll gain a deep, practical understanding of how query engines avoid massive amounts of I/O. You’re essentially building a small piece of a query optimizer.

Core challenges you’ll face:

  • Parsing a simple filter expression → maps to writing a small parser for expressions like “column op value”
  • Comparing the filter to row group statistics → maps to writing the logic: if filter.max < stats.min: skip
  • Reading specific row groups → maps to using the ParquetFile.read_row_group() method
  • Stitching the results together → maps to concatenating the data from only the relevant row groups

Key Concepts:

  • Predicate Pushdown: The core concept of pushing a filter down to the storage layer.
  • Row Group Statistics: The Statistics object in the Parquet metadata.
  • Selective I/O: The practice of only reading the bytes you absolutely need.

Difficulty: Expert Time estimate: 1-2 weeks Prerequisites: Project 2 (Metadata Inspector).

Real world outcome: A query tool that visually demonstrates which row groups are being skipped, making the abstract concept of predicate pushdown visible.

$ python smart_reader.py my_data.parquet --filter "age > 60"
Query: age > 60
Inspecting Row Group 0... stats(min=18, max=45). SKIPPING.
Inspecting Row Group 1... stats(min=30, max=55). SKIPPING.
Inspecting Row Group 2... stats(min=50, max=65). READING.
Inspecting Row Group 3... stats(min=62, max=88). READING.

Read 2 of 4 row groups. Total rows returned: 15,000

Implementation Hints:

  1. Create a large Parquet file with several row groups (you can control this with the row_group_size write option in pyarrow). Make sure the data has some order so the min/max stats are meaningful.
  2. Write a simple parser for your filter string. It just needs to extract the column name, operator (>, <, =), and value.
  3. Open the Parquet file using pq.ParquetFile.
  4. Get the metadata and loop through the row groups, just like in Project 2.
  5. Inside the loop, get the column statistics for the column in your filter.
  6. Implement the core logic:
    • If the filter is age > 60 and the stats for a row group are min=20, max=55, then you can safely skip this entire row group.
    • If the filter is age < 30 and the stats are min=40, max=80, you can also skip.
    • If the filter condition overlaps with the min/max range, you must read the row group.
  7. Keep a list of the indices of the row groups you need to read.
  8. After checking all row groups, iterate through your list of indices and use pf.read_row_group(i) to read only the necessary data. Concatenate the results.

Learning milestones:

  1. You can read the min/max stats for a specific column in a specific row group → You’ve mastered metadata navigation.
  2. You correctly implement the logic to skip a row group based on a > filter → Your pushdown logic is working.
  3. You can read a single, specific row group by its index → You’ve learned how to do selective reads.
  4. Your tool successfully filters a large file and shows which row groups it skipped → You’ve built a mini-query optimizer.

Project 8: Write a Parquet File from Scratch

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python or Java
  • Alternative Programming Languages: C, Go, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 5: Master
  • Knowledge Area: Binary Formats / File Specification
  • Software or Tool: None. Just a hex editor to verify.
  • Main Book: The official Apache Parquet format specification on GitHub.

What you’ll build: A program that, without using any Parquet libraries, manually constructs the bytes for a minimal, valid Parquet file containing a few rows of data in a single column.

Why it teaches Parquet: This is the ultimate deep dive. You will be forced to read the official specification and understand every single byte: the magic numbers, the Thrift metadata structure, the page headers, the encodings. After this, Parquet will have no secrets. You will have achieved true mastery.

Core challenges you’ll face:

  • Understanding the Parquet format spec → maps to reading and interpreting highly technical documentation
  • Serializing metadata with Thrift → maps to finding a basic Thrift library or writing the serialization yourself for a minimal case
  • Implementing a simple encoding (like PLAIN) → maps to writing the bytes for your data values correctly
  • Calculating offsets and lengths → maps to manually tracking the position of every piece of data and writing those pointers into the footer
  • Getting it all right, byte for byte → maps to painstaking debugging with a hex editor

Key Concepts:

  • Thrift Serialization: The protocol used for Parquet’s metadata.
  • Page/Column/File Structure: The full, unabstracted layout of the file.
  • Magic Numbers: The PAR1 bytes that identify a Parquet file.

Difficulty: Master Time estimate: 1 month+ Prerequisites: All previous projects, strong skills in a low-level language, and immense patience.

Real world outcome: You produce a file, my_manual.parquet, that is perhaps only 100 bytes long, but it can be successfully read by standard tools like pyarrow or parquet-tools, proving you have correctly implemented the format specification.

Implementation Hints:

  1. Read the Spec: Start here: https://github.com/apache/parquet-format.
  2. Start from the End: A Parquet file is best written backwards. The footer is written last, once you know the locations of everything else.
  3. Minimal Viable File: Aim for the simplest possible file: one row group, one column chunk, one data page. The column can be simple integers.
  4. Metadata: The hardest part is the Thrift metadata. You will need to create the FileMetaData struct and serialize it. You might want to use a Thrift library for this part alone, or you can try to write the binary Thrift protocol by hand for the few fields you need.
  5. Structure:
    • Start with the magic bytes PAR1.
    • Write a data page (e.g., a few 32-bit integers).
    • Write a page header containing info about the data page (e.g., encoding, size).
    • Construct the FileMetaData Thrift object in memory. It should contain the schema and a RowGroup object that points to your column chunk, which in turn points to your data page.
    • Serialize the Thrift object.
    • Write the serialized Thrift data.
    • Write the length of the metadata (4 bytes, little-endian).
    • Write the final magic bytes PAR1.

Learning milestones:

  1. You write a file with only the PAR1 magic bytes at the start and end → You’ve created the file’s shell.
  2. You successfully serialize a minimal FileMetaData object using Thrift → You’ve conquered the hardest part.
  3. You write a single data page with PLAIN encoding → You’re writing actual data.
  4. A standard tool like pyarrow.read_table() successfully reads your file without error → You have achieved enlightenment.

Project 9: Row Group Size and Page Size Tuning

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Java/Spark
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Performance Tuning / Data Warehousing
  • Software or Tool: Apache Arrow (pyarrow)
  • Main Book: “Spark: The Definitive Guide” by Bill Chambers & Matei Zaharia

What you’ll build: A benchmarking suite that writes a large dataset multiple times with varying row group and page sizes, then runs different types of analytical queries against them to see how these settings impact performance.

Why it teaches Parquet: It teaches the crucial architectural trade-offs. You’ll learn that large row groups are great for streaming reads and predicate pushdown effectiveness, but bad for highly selective queries. You’ll see how page size impacts compression and random access. This is the knowledge that separates senior data engineers from junior ones.

Core challenges you’ll face:

  • Controlling row group and page size → maps to using write_table with row_group_size and data_page_size options
  • Designing representative queries → maps to creating a “wide” query (reads many columns) and a “narrow” query (reads few columns)
  • Scripting the benchmark → maps to automating the creation and querying of many file variants
  • Interpreting the results → maps to explaining why a 1GB row group is slow for a query that only needs one row

Key Concepts:

  • Row Group Size Trade-offs: Large groups = better compression, better for full scans. Small groups = better for selective queries. The default is now 128MB.
  • Page Size Trade-offs: The unit of I/O and compression. Smaller pages can mean more granular access but worse compression.
  • OLAP Query Patterns: Understanding the difference between scanning a few columns for all rows vs. seeking to a few specific rows.

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 4.

Real world outcome: A report with graphs that clearly illustrates the performance impact of these key tuning parameters, giving you expert-level guidance for system design.

Implementation Hints:

  1. Create a large dataset (e.g., 10 million rows).
  2. Define a range of row_group_size values to test (e.g., 10000, 100000, 1000000).
  3. For each size, write the dataset to a new Parquet file using pq.write_table(table, file_path, row_group_size=size).
  4. Define two types of queries to run on each file:
    • Narrow Scan: SELECT avg(col_A) FROM file (reads one column).
    • Selective Query: SELECT * FROM file WHERE id = 12345 (needs to find one specific row).
  5. Time both queries against each of the generated files.
  6. Plot the results: row_group_size on the x-axis, and query time on the y-axis, with one line for each query type. You should see the selective query get slower as row group size increases, while the narrow scan might get slightly faster.

Learning milestones:

  1. You successfully write files with different row group sizes → You’ve mastered the write options.
  2. Your benchmark shows that small row groups are faster for highly selective “lookup” queries → You understand the downside of large row groups.
  3. Your benchmark shows that large row groups provide slightly better file compression → You understand the upside.
  4. You can articulate the recommended row group size for different use cases → You can now design efficient data layouts.

Project 10: Custom Data Exporter for an Application

  • File: LEARN_APACHE_PARQUET_DEEP_DIVE.md
  • Main Programming Language: Your application’s language (e.g., Python/Django, Ruby/Rails, Java/Spring)
  • Alternative Programming Languages: Any
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Application Integration / ETL
  • Software or Tool: Apache Arrow, a web framework or ORM.
  • Main Book: The documentation for your framework and pyarrow.

What you’ll build: Add an “Export to Parquet” feature to an existing application (e.g., a Django or Rails web app). This feature will query the application’s primary database (e.g., PostgreSQL), convert the data, and allow the user to download a Parquet file.

Why it teaches Parquet: It bridges the gap between theory and practice. You’ll face real-world data modeling challenges, mapping your application’s objects or database schema to a Parquet schema. It’s a highly practical project that mirrors a common business requirement.

Core challenges you’ll face:

  • Integrating a Parquet library into an existing application → maps to managing dependencies and environments
  • Querying data from a production database → maps to using an ORM or SQL to efficiently extract data
  • Handling schema mapping → maps to translating SQL data types (e.g., VARCHAR, TIMESTAMP) to Parquet/Arrow types
  • Streaming the response → maps to generating the Parquet file in memory and serving it as an HTTP download without saving to disk

Key Concepts:

  • ORM to DataFrame/Table: Converting data from your application’s object model.
  • Schema Synchronization: Ensuring your Parquet schema matches your database schema.
  • HTTP Content Disposition: How to trigger a file download in a web browser.

Difficulty: Intermediate Time estimate: Weekend Prerequisites: An existing application to modify, Project 1.

Real world outcome: A new, valuable feature in one of your existing projects, and a reusable pattern for exporting data from any application.

Implementation Hints:

  1. Add the pyarrow library (or its equivalent in your language) to your application’s dependencies.
  2. Create a new view or controller action for the export.
  3. Inside the action, use your application’s ORM to query the data you want to export. For large datasets, query in chunks to avoid high memory usage.
  4. The most common pattern is to convert each chunk of ORM objects into a pandas DataFrame.
  5. Convert the DataFrame to an Arrow Table.
  6. Use an in-memory buffer (io.BytesIO) to write the Parquet file to memory instead of disk.
    buffer = io.BytesIO()
    pq.write_table(table, buffer)
    buffer.seek(0)
    
  7. Return an HttpResponse (or your framework’s equivalent) with the buffer’s content, setting the Content-Type to application/octet-stream and the Content-Disposition header to attachment; filename="export.parquet".

Learning milestones:

  1. You can export a simple table with 100 rows → The basic plumbing works.
  2. You correctly map database types like dates and decimals to Parquet types → You’re handling real-world data complexity.
  3. You can export a large table (100k+ rows) without crashing your server → You’ve implemented efficient chunking.
  4. A user can click a button in the UI and receive a valid Parquet file → You’ve shipped a complete feature.

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. CSV to Parquet Converter Beginner Weekend Foundational 2/5
2. Parquet Metadata Inspector Intermediate Weekend File Internals 4/5
3. “Parquet-cat” Tool Intermediate Weekend Core Benefit (Projection) 3/5
4. Performance Showdown Intermediate Weekend Benchmarking 5/5
5. Exploring Encodings & Compression Advanced Weekend Performance Tuning 4/5
6. Handling Nested Data Advanced Weekend Data Modeling 3/5
7. Manual Predicate Pushdown Expert 1-2 weeks Query Optimization 5/5
8. Write Parquet from Scratch Master 1 month+ Deep Internals 5/5
9. Row Group & Page Size Tuning Advanced 1-2 weeks Performance Tuning 4/5
10. Custom Data Exporter Intermediate Weekend Application 3/5

Recommendation

To gain a solid and practical understanding of Apache Parquet, I recommend the following path:

  1. Start with Project 1: CSV to Parquet Converter. It’s the essential first step and gives you a working tool immediately.
  2. Immediately follow up with Project 2: Parquet Metadata Inspector and Project 3: “Parquet-cat” Tool. These three projects form a “trilogy” that gives you the core vocabulary and mental model for all other projects.
  3. To be convinced of Parquet’s power, complete Project 4: Performance Showdown. The results will be so dramatic they will motivate you to learn more.
  4. From there, if you are focused on performance, tackle Project 5, 7, and 9. If you are focused on becoming a file format expert, take the ultimate challenge of Project 8.

By the time you complete this path, you’ll have a deep and practical expertise in Parquet that is rare and highly valuable in the data engineering field.

Summary

  • Project 1: CSV to Parquet Converter: Python
  • Project 2: Parquet Metadata Inspector: Python
  • Project 3: A “Parquet-cat” Tool with Column Projection: Python
  • Project 4: Performance Showdown: Parquet vs. CSV: Python
  • Project 5: Exploring Encodings and Compression: Python
  • Project 6: Handling Nested Data Structures: Python
  • Project 7: Implement Predicate Pushdown Manually: Python
  • Project 8: Write a Parquet File from Scratch: Python or Java
  • Project 9: Row Group Size and Page Size Tuning: Python
  • Project 10: Custom Data Exporter for an Application: Your application’s language