Project 3: Data Miner

Build a regex-powered extraction tool that pulls structured signals (emails, IPs, tokens) from noisy text and outputs clean datasets.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1 week
Main Programming Language Bash (Alternatives: Python)
Alternative Programming Languages Python
Coolness Level Level 4 - “data archaeologist”
Business Potential High (data extraction, compliance)
Prerequisites regex basics, pipelines, sorting
Key Topics grep -o, ERE patterns, normalization, false positives

1. Learning Objectives

By completing this project, you will:

  1. Construct precise extraction regexes for emails and IPv4 addresses.
  2. Use grep -o to output only matched substrings.
  3. Normalize extracted data for deduplication and downstream analysis.
  4. Reduce false positives using boundaries and validation rules.
  5. Produce deterministic datasets and extraction reports.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Extraction-Oriented Regex (Leftmost-Longest, -o, and Field Boundaries)

Fundamentals

Extraction is different from filtering. Instead of selecting entire lines, you want the exact substring that matches a pattern. Grep supports this with -o, which prints only the match. In POSIX ERE, regex matching is leftmost-longest, meaning the engine selects the earliest match on the line and then expands it as far as possible. This affects how you design extraction patterns: if you use .*, it can swallow too much; if you omit boundaries, you will capture trailing punctuation. The right pattern balances strictness with coverage. Extraction is also about intent. A filter asks “does this line contain something interesting?” while an extractor asks “what exact token do I want to collect?” This distinction forces you to think about token boundaries, normalization, and post-processing. If you do not define those boundaries carefully, the dataset you produce will be corrupted in subtle ways, which is worse than missing data because it looks valid but is wrong.

Deep Dive into the concept

Extraction regexes are about precision. A good extraction regex defines both the core token shape and its boundaries. For example, an email address has a local part, an @, and a domain with at least one dot. A naive regex like .+@.+\..+ will match too much, including spaces and trailing punctuation. A better extraction pattern uses character classes that exclude whitespace and punctuation: [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}. Even this is a compromise, because the full RFC syntax is complex, but it is sufficient for typical datasets.

Leftmost-longest matching means the engine will extend a match as long as the pattern allows. If your regex uses .+ for the domain, and the line contains multiple @ signs, the match may span across tokens. You avoid this by using a tighter character class for each subcomponent. The rule of thumb is: never use .* in extraction unless you have strict delimiters around it.

The -o flag changes how grep outputs data. Instead of lines, you get each match as its own output line. This is powerful for building datasets because you can pipe the output directly into sort | uniq or into a file. However, -o will print multiple matches per line if the line contains multiple tokens. This is usually desired, but it affects counts. A report should explicitly mention that counts represent matches, not lines.

Boundaries are critical. A pattern might match bob@example.com, with a trailing comma. You can prevent that using a character class that excludes punctuation at the edges or by using a boundary like \b if appropriate. But remember that POSIX grep has limited word boundary semantics. A practical approach is to include a negative lookbehind or lookahead, but POSIX ERE does not support lookarounds. Instead, you can use a delimiter-based strategy: match tokens that are preceded by start-of-line or a non-token character, and then capture the token. Since ERE does not support groups with non-capturing prefixes, you can post-process by trimming punctuation with tr -d '.,;:' or with awk.

IPv4 extraction has its own precision issues. The regex ([0-9]{1,3}\.){3}[0-9]{1,3} matches many invalid IPs like 999.999.999.999. If your dataset requires strict validation, you must add range checks after extraction. A common strategy is to extract using a simple regex and then validate with a small script that enforces 0-255 ranges. This two-step approach is easier to reason about and more maintainable than a monstrous regex.

Finally, extraction should be deterministic. Run LC_ALL=C sort before uniq so the output order is stable. Document the extraction rules and version them if you plan to run the tool repeatedly. This turns a one-off script into a reusable data mining utility.

Another layer of rigor is token boundary anchoring. If you extract emails from logs that include JSON, you may encounter tokens like "email":"alice@example.com" where the token is wrapped in quotes. A robust extractor should allow optional quotes at the edges but should not include them in the output. In POSIX ERE, you can do this with a pre-processing step that trims surrounding punctuation, or with a pipeline that uses sed to strip leading and trailing quotes. This is a practical compromise given the lack of lookarounds.

It is also helpful to think about the error profile of your extractor. There are two types of errors: false positives (bad tokens that should not be included) and false negatives (valid tokens that are missed). Your regex design and validation rules should explicitly decide which error is more acceptable. For compliance tasks, you may prefer higher recall (fewer false negatives), accepting that you will manually review a larger set. For strict data ingestion, you may prefer high precision to avoid polluted data. This trade-off should be documented in the report.

Extraction pipelines can also be enhanced by context reporting. Although your output is a list of tokens, you may want to keep a separate “evidence” file that records where each token was found (file name and line number). This allows later validation and auditing. You can implement this by running a second pass with grep -n and storing the full line, or by using awk to emit file:line:token triples. Even if you do not implement this in the first version, designing for it keeps your tool extensible.

Finally, be aware of performance implications. grep -o can produce enormous output on large corpora. If you are extracting multiple token types, you should run independent pipelines and stream to disk rather than holding everything in memory. Sorting large lists also requires disk. For large datasets, consider using sort -T to choose a temporary directory with enough space, and always report resource usage in the summary so users know the cost of the run.

How this fit on projects

Your extractor uses grep -o for tokens and relies on precise regexes to avoid corrupted datasets.

Definitions & key terms

  • extraction regex: pattern designed to capture exact tokens rather than full lines.
  • leftmost-longest: POSIX regex rule for choosing the match.
  • boundary: pattern element that enforces token edges.
  • -o: grep flag to print only the matched substring.

Mental model diagram (ASCII)

Line: "user bob@example.com, ip=10.0.0.1"
Regex: [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
Match:      bob@example.com

How it works (step-by-step)

  1. Grep scans each line for the regex.
  2. For each match, -o prints the token only.
  3. Tokens are piped to normalization and dedup steps.
  4. Output becomes the dataset for analysis.
  5. Invariant: each output line is a single extracted token.
  6. Failure modes: overmatching due to loose patterns, or missing tokens due to overly strict boundaries.

Minimal concrete example

grep -o -E '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}' dataset.txt

Common misconceptions

  • -o gives you unique matches” -> False; it prints all matches.
  • “A single regex can validate all emails” -> False; full RFCs are complex.
  • .* is safe in extraction” -> False; it usually overmatches.

Check-your-understanding questions

  1. What does leftmost-longest mean in POSIX ERE?
  2. Why does -o change counting semantics?
  3. Why is .* risky in extraction regexes?
  4. How can you prevent trailing punctuation from being captured?

Check-your-understanding answers

  1. The engine picks the earliest match and makes it as long as possible.
  2. It outputs each match, not each line, so counts are per token.
  3. It can consume unrelated text and merge multiple tokens.
  4. Use tighter character classes or post-process to trim punctuation.

Real-world applications

  • Email extraction for compliance or marketing lists.
  • IP extraction for incident response.
  • Token extraction for auditing API keys.

Where you will apply it

References

  • The Linux Command Line (Shotts), Chapter 19
  • man grep
  • POSIX regex specification

Key insights

Extraction requires precise token boundaries; loose patterns corrupt datasets.

Summary

A data miner is only as good as its extraction regex. Precision comes from boundaries, not from .*.

Homework/Exercises to practice the concept

  1. Write a regex to extract IPv4 addresses and test it on noisy text.
  2. Build a fixture with emails followed by punctuation and verify matches.
  3. Use grep -o and count how many tokens appear per line.

Solutions to the homework/exercises

  1. grep -o -E '([0-9]{1,3}\.){3}[0-9]{1,3}' file.txt
  2. Use grep -o and confirm alice@example.com, becomes alice@example.com after trimming.
  3. grep -o -E '...' file | wc -l

2.2 Normalization, Validation, and Deduplication

Fundamentals

Extraction produces raw tokens, but raw tokens are not a dataset. Normalization converts tokens into a consistent form (lowercase emails, trimmed punctuation) so that Alice@example.com and alice@example.com count as the same entity. Deduplication is performed with sort | uniq. Validation reduces false positives by filtering tokens that do not meet rules (such as invalid IP ranges). Without normalization and validation, your output may be misleading or unusable. Normalization is a design decision. You must decide which differences matter and which do not. For example, should John.Doe@example.com be considered the same as john.doe@example.com? For most analytics, yes. For forensic traces, maybe not. Documenting this choice makes your output interpretable and helps others use your dataset responsibly.

Deep Dive into the concept

Normalization is the process of removing superficial differences that do not matter to the meaning of the token. For email addresses, case is usually irrelevant in the domain part, and often irrelevant in the local part for most providers. A reasonable normalization is to lowercase the entire address for counting purposes. For IPs, normalization might include stripping leading zeros (e.g., 010.001.002.003 to 10.1.2.3). For tokens like API keys, normalization might include trimming surrounding quotes.

Validation is a second step that enforces constraints that regex alone cannot express easily. Consider IPv4: a pure regex will match 999.999.999.999 even though it is invalid. You can handle this by post-processing each extracted IP with a small function that checks each octet is between 0 and 255. Similarly, for emails, you can enforce a maximum length or disallow consecutive dots. The point is not to reimplement RFCs, but to rule out obvious garbage.

Deduplication must be deterministic. sort | uniq is the standard approach, but it only works when the data is sorted. To keep output reproducible across machines, you should set LC_ALL=C. If you want counts, use uniq -c and then sort by count. This becomes part of your report.

A subtlety is the order of operations. If you deduplicate before normalization, you will get inflated counts. Always normalize first, then deduplicate. Another subtlety is that grep -o will produce multiple tokens per line, which can lead to large intermediate output. To keep performance reasonable, you can stream through normalization and sorting rather than storing everything in memory.

Finally, a dataset should carry context. Your tool should produce not just emails.txt, but also a summary that describes how many tokens were extracted, how many were invalidated, and how many unique results remain. This tells the user how much noise was removed and provides confidence in the dataset quality.

There is also a subtle ordering issue in normalization pipelines. If you normalize after validation, you may validate the wrong form of the token. For example, an IP with leading zeros might fail a strict regex but could be normalized to a valid range. This is why the order is usually: extract -> normalize -> validate -> deduplicate. For emails, you might normalize case before validation to reduce false negatives. The ordering should be explicit in your design.

Validation rules should be explainable. If you decide that an email must contain at least one dot in the domain and a TLD of length >= 2, say so in the report. If you decide to drop private IP ranges or reserved blocks, document that decision. Without this, users will not know whether missing tokens were errors or deliberate filtering. A data miner should be transparent about its rules.

Deduplication is also more nuanced than uniq. If you want counts, you should use sort | uniq -c and then maintain the counts alongside the unique tokens. This gives you a frequency distribution that can be valuable for prioritization. For example, if one IP appears 10,000 times, it is likely more important than one that appears once. Your summary report should include top-N tokens by frequency to help users focus on the most significant results.

Finally, consider how to make the output deterministic and auditable. A good practice is to output not only the unique tokens but also a summary.txt that includes the tool version, extraction rules, validation rules, and counts. This turns the dataset into a reproducible artifact rather than a transient list.

How this fit on projects

The data miner uses normalization and validation to produce clean outputs and a summary report with counts of total vs unique tokens.

Definitions & key terms

  • normalization: transforming tokens into a consistent canonical form.
  • validation: filtering tokens that violate rules.
  • deduplication: removing duplicates, often via sort | uniq.
  • canonical form: the standard representation of a token.

Mental model diagram (ASCII)

raw matches -> normalize -> validate -> sort -> uniq -> dataset

How it works (step-by-step)

  1. Extract tokens with grep -o.
  2. Normalize tokens (lowercase, trim punctuation).
  3. Validate tokens (range checks, length checks).
  4. Sort and deduplicate.
  5. Write output files and summary counts.
  6. Invariant: normalization happens before validation and deduplication.
  7. Failure modes: validation rules too strict or normalization that collapses distinct tokens.

Minimal concrete example

grep -o -E '([0-9]{1,3}\.){3}[0-9]{1,3}' data.txt \
  | awk -F. '($1<=255 && $2<=255 && $3<=255 && $4<=255){print}' \
  | sort | uniq

Common misconceptions

  • “Regex is enough to validate IPs” -> False; range checks are needed.
  • “Deduplication before normalization is fine” -> False; it inflates counts.
  • “Lowercasing is always safe” -> Mostly, but document assumptions.

Check-your-understanding questions

  1. Why is normalization required before deduplication?
  2. How do you validate IPv4 ranges without complex regex?
  3. What does LC_ALL=C change in sorting?
  4. Why should a report include invalid token counts?

Check-your-understanding answers

  1. Otherwise Alice@example.com and alice@example.com are counted separately.
  2. Use a post-processing step to check each octet.
  3. It enforces a stable bytewise order.
  4. It shows how much noise was removed and builds trust.

Real-world applications

  • Compliance audits that must remove duplicate identifiers.
  • Threat intelligence pipelines that validate IP indicators.
  • Data cleaning for analytics.

Where you will apply it

  • In this project: see §3.5 (schema) and §3.7 (golden output).
  • Also used in: P07-stats-engine.md.

References

  • The Linux Command Line (Shotts), Chapter 20
  • man awk
  • man sort

Key insights

Extraction is only half the job; normalization and validation make the data trustworthy.

Summary

A usable dataset requires consistent canonical forms and explicit validation rules.

Homework/Exercises to practice the concept

  1. Normalize a list of emails to lowercase and deduplicate.
  2. Write an awk filter that removes invalid IPv4 addresses.
  3. Create a report that shows total vs unique counts.

Solutions to the homework/exercises

  1. tr 'A-Z' 'a-z' < emails.txt | sort | uniq
  2. awk -F. '($1<=255 && $2<=255 && $3<=255 && $4<=255){print}' ips.txt
  3. wc -l raw.txt; wc -l unique.txt with header labels.

3. Project Specification

3.1 What You Will Build

A CLI extractor that scans a text corpus, extracts emails and IPv4 addresses, normalizes and validates them, and outputs two datasets (emails.txt, ips.txt) plus a summary report.

3.2 Functional Requirements

  1. Email extraction: extract email-like tokens using ERE.
  2. IP extraction: extract IPv4 tokens and validate ranges.
  3. Normalization: lowercase emails and trim punctuation.
  4. Deduplication: output unique tokens.
  5. Summary report: total matches and unique counts.

3.3 Non-Functional Requirements

  • Performance: handle 1GB text corpus within reasonable time using streaming.
  • Reliability: output is deterministic across runs.
  • Usability: clear outputs and counts.

3.4 Example Usage / Output

$ ./data_miner.sh dataset.txt
[+] Extracting emails and IPs
[+] Found 284 unique emails
[+] Found 97 unique IP addresses

3.5 Data Formats / Schemas / Protocols

emails.txt: one email per line
ips.txt: one IPv4 per line
summary.txt: totals and unique counts

3.6 Edge Cases

  • Emails with trailing punctuation.
  • IPs with leading zeros.
  • Unicode text and mixed encodings.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./data_miner.sh ./fixtures/dataset.txt

3.7.2 Golden Path Demo (Deterministic)

Use a fixed dataset in ./fixtures/dataset.txt and a fixed timestamp in the report header.

3.7.3 If CLI: exact terminal transcript

$ ./data_miner.sh ./fixtures/dataset.txt
[2026-01-01T12:00:00] DATASET=./fixtures/dataset.txt
[2026-01-01T12:00:00] EMAILS=emails.txt
[2026-01-01T12:00:00] IPS=ips.txt
[2026-01-01T12:00:00] SUMMARY=summary.txt
[2026-01-01T12:00:00] DONE

$ cat summary.txt
Total email matches: 12
Unique emails: 10
Total ip matches: 8
Unique ips: 7
Invalid ips filtered: 1

Failure demo (missing file):

$ ./data_miner.sh ./nope.txt
[2026-01-01T12:00:00] ERROR: file not found
EXIT_CODE=2

Exit codes:

  • 0: success
  • 1: partial success (invalid tokens filtered)
  • 2: invalid arguments or missing file

4. Solution Architecture

4.1 High-Level Design

text -> grep -o (emails) -> normalize -> validate -> sort/uniq -> emails.txt
text -> grep -o (ips)    -> normalize -> validate -> sort/uniq -> ips.txt
                                              \-> summary

4.2 Key Components

Component Responsibility Key Decisions
Extractor run grep -o patterns use ERE with boundaries
Normalizer lowercase, trim tr and awk
Validator IPv4 range checks awk octet checks
Reporter counts and totals deterministic output

4.3 Data Structures (No Full Code)

counts: {emails_total, emails_unique, ips_total, ips_unique}

4.4 Algorithm Overview

Key Algorithm: Extraction Pipeline

  1. Extract tokens with grep -o.
  2. Normalize tokens.
  3. Validate tokens.
  4. Sort and deduplicate.
  5. Emit summary counts.

Complexity Analysis:

  • Time: O(n log n)
  • Space: O(n) for sorted token lists.

5. Implementation Guide

5.1 Development Environment Setup

# No extra dependencies required

5.2 Project Structure

project-root/
├── data_miner.sh
├── fixtures/
│   └── dataset.txt
└── README.md

5.3 The Core Question You’re Answering

“How do I reliably extract structured signals from noisy text?”

5.4 Concepts You Must Understand First

  1. Extraction regex and boundaries
  2. Normalization, validation, and deduplication

5.5 Questions to Guide Your Design

  1. What token formats are in scope (emails, IPs, UUIDs)?
  2. How strict should validation be?
  3. What normalization rules are acceptable?

5.6 Thinking Exercise

Design regexes for emails and IPs, then list two false positives you expect and how you will filter them.

5.7 The Interview Questions They’ll Ask

  1. “Why use grep -o instead of normal grep?”
  2. “How do you avoid false positives in extraction?”
  3. “Why is sorting required before uniq?”

5.8 Hints in Layers

Hint 1: Email extraction

grep -o -E '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}' dataset.txt

Hint 2: IP extraction

grep -o -E '([0-9]{1,3}\.){3}[0-9]{1,3}' dataset.txt

Hint 3: Validation

grep -o -E '([0-9]{1,3}\.){3}[0-9]{1,3}' dataset.txt | awk -F. '($1<=255 && $2<=255 && $3<=255 && $4<=255)'

5.9 Books That Will Help

Topic Book Chapter
Regex The Linux Command Line (Shotts) Ch. 19
Text processing The Linux Command Line (Shotts) Ch. 20

5.10 Implementation Phases

Phase 1: Foundation (1-2 days)

Goals:

  • Build extraction regexes
  • Validate on fixtures

Tasks:

  1. Implement grep -o for emails and IPs.
  2. Create a fixture dataset with known matches.

Checkpoint: extraction outputs expected raw lists.

Phase 2: Core Functionality (2-3 days)

Goals:

  • Normalization and validation

Tasks:

  1. Implement lowercase normalization.
  2. Add IPv4 range validation.

Checkpoint: invalid tokens are filtered.

Phase 3: Polish & Edge Cases (1-2 days)

Goals:

  • Deterministic output and reporting

Tasks:

  1. Add sorting with LC_ALL=C.
  2. Generate summary counts.

Checkpoint: summary report matches expectations.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Email regex strictness simple vs RFC simple maintainable and practical
IP validation regex only vs post-check post-check clarity and correctness
Output format text vs JSON text simple and interoperable

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests regex accuracy known tokens and false positives
Integration Tests end-to-end dataset fixture extraction
Edge Case Tests punctuation and spacing emails with commas

6.2 Critical Test Cases

  1. Email with trailing comma: should be cleaned.
  2. Invalid IP: should be filtered out.
  3. Mixed case email: normalized to lowercase.

6.3 Test Data

fixtures/dataset.txt

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Overmatching tokens include punctuation tighten boundaries
Invalid IPs garbage in dataset add validation
Missing sort duplicates remain sort | uniq

7.2 Debugging Strategies

  • Print intermediate outputs with tee.
  • Build a small fixture with known expected outputs.
  • Use diff to compare actual vs expected datasets.

7.3 Performance Traps

Large corpora can produce huge intermediate lists. Stream outputs and avoid storing raw data in memory.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add extraction for URLs.
  • Add a flag to output counts only.

8.2 Intermediate Extensions

  • Extract UUIDs and normalize formats.
  • Add a validation report with counts by error type.

8.3 Advanced Extensions

  • Add incremental extraction with a cache.
  • Produce a SQLite database for queries.

9. Real-World Connections

9.1 Industry Applications

  • Compliance data discovery.
  • Threat hunting for leaked credentials.
  • Building blocklists for security tools.
  • ripgrep: high-performance text search engine.
  • jq: structured data processor (useful for JSON logs).

9.3 Interview Relevance

  • Regex design for extraction.
  • Data normalization pipelines.
  • Handling false positives.

10. Resources

10.1 Essential Reading

  • The Linux Command Line (Shotts), Chapters 19-20

10.2 Video Resources

  • “Regex Extraction Patterns” (conference talk)
  • “Text Mining with Unix Tools” (YouTube)

10.3 Tools & Documentation

  • man grep
  • man awk

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain leftmost-longest matching.
  • I can explain why validation is needed after extraction.
  • I can explain normalization vs deduplication.

11.2 Implementation

  • Emails and IPs are extracted and deduplicated.
  • Invalid IPs are filtered.
  • Summary report is correct and deterministic.

11.3 Growth

  • I can identify a false positive and fix it.
  • I documented my extraction rules.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Extract emails and IPs into separate files.
  • Provide a summary with total and unique counts.

Full Completion:

  • Normalization and validation implemented.
  • Deterministic output across runs.

Excellence (Going Above & Beyond):

  • Additional token types supported with validation.
  • Dataset ready for ingestion into analytics tools.