Project 6: The Matrix Encoder (Linear Algebra for Reversible Transforms)
Build a deterministic matrix-based message encoder/decoder that demonstrates how linear transformations scramble and recover structured data.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2 (Intermediate) |
| Time Estimate | 10-18 hours |
| Main Language | Python |
| Alternatives | Julia, MATLAB/Octave, TypeScript |
| Key Topics | Matrix multiplication, invertibility, block encoding, modular arithmetic, numerical stability |
| Deliverable | CLI tool to encode plaintext into numeric blocks and decode back |
| Observable Outcome | Exact round-trip recovery of messages under valid key matrix |
Learning Objectives
By the end of this project, you should be able to:
- Map text symbols to numeric vectors with explicit encoding rules.
- Apply matrix multiplication as a reversible transformation pipeline.
- Validate key invertibility before encoding begins.
- Handle block sizing, padding, and message normalization deterministically.
- Decode using matrix inverse (real-valued or modular mode, depending on design).
- Identify failure modes caused by non-invertible keys or numeric precision.
- Build deterministic CLI transcripts for encode/decode regression tests.
- Explain why this project is educational cryptography, not production-grade security.
Theory Needed
1) Linear Transformations as Data Scramblers
A matrix can transform vectors in consistent, rule-based ways. If a plaintext block is represented as vector v, encoding can be:
c = K * v
Where K is your key matrix and c is the cipher vector.
If K is invertible, decoding is:
v = K^{-1} * c
This project turns abstract linear algebra into a concrete reversible pipeline.
ASCII intuition:
plaintext block v --(multiply by K)--> encoded block c
^ |
| v
+-----------(multiply by K^-1)---------+
2) Invertibility, Determinant, and Rank
A square matrix is invertible iff determinant is non-zero (equivalently full rank). If determinant is zero, information is lost: multiple inputs map to the same output and exact decode is impossible.
Practical engineering checks:
- matrix is square (
n x n) - determinant is not near zero (real-valued mode)
- for modular mode, determinant is coprime with modulus
Failure example:
- Rows are multiples of each other -> rank deficiency -> not invertible.
3) Block Encoding Design
Text usually has variable length, while matrix multiplication expects fixed vector dimension. So message must be chunked into blocks of size n.
Process:
- Normalize text (uppercase policy, allowed symbols).
- Map each symbol to number.
- Chunk into length
n. - Pad final block (e.g., with
_orX).
ASCII block pipeline:
"HELLO WORLD"
|
v
[H,E,L,L,O,_,W,O,R,L,D]
|
v
chunk n=3 -> [H,E,L] [L,O,_] [W,O,R] [L,D,_]
|
v
vectorize -> multiply by K block-by-block
4) Numeric Mode vs Modular Mode
You have two common educational approaches:
- Real-valued inverse mode:
- straightforward linear algebra
- but can suffer floating-point rounding
- Modular arithmetic mode (like Hill-cipher style):
- integer math under modulus
m - avoids floating drift
- requires modular inverse of determinant
- integer math under modulus
For high-school math learning, start with real-valued mode and optionally add modular mode extension.
5) Determinism and Input Contracts
Encoding systems fail when rules are ambiguous. Deterministic contracts are required:
- explicit alphabet order
- explicit symbol-to-number table
- explicit padding symbol
- explicit key and matrix dimension
Pseudo-logic:
normalize(text)
nums <- map_symbols(text)
blocks <- chunk_and_pad(nums, n)
encoded_blocks <- [K * b for b in blocks]
write encoded blocks in stable format
Project Spec
What You Will Build
A CLI tool with two commands:
encode: plaintext -> encoded numeric blocks.decode: encoded numeric blocks -> plaintext.
Core features:
- key matrix validation
- deterministic text normalization and padding
- block-by-block transform
- round-trip verification mode
Out of scope:
- secure key exchange
- modern cryptographic security claims
- binary file encryption pipeline
Functional Requirements
- Accept key matrix from file or inline argument.
- Validate shape and invertibility before processing.
- Normalize plaintext using documented alphabet policy.
- Encode/decode deterministically with stable output formatting.
- Return clear errors for invalid symbols, malformed blocks, and bad key matrices.
Non-Functional Requirements
- Correctness: encode/decode round-trip must match normalized input exactly.
- Reproducibility: same inputs produce same outputs byte-for-byte.
- Explainability: output should include intermediate metadata (block count, padding count).
Data Shapes (Conceptual)
AlphabetMap:
symbols: ordered list[str]
to_int: map[str -> int]
to_symbol: map[int -> str]
MatrixKey:
size: int
values: n x n numbers
inverse: n x n numbers (or modular inverse representation)
EncodedPayload:
key_id: string
mode: "real" | "mod"
block_size: int
pad_symbol: string
blocks: list[list[number]]
Real World Outcome
You should be able to run a deterministic encode/decode pair and compare your output with this transcript structure.
Deterministic CLI transcript:
$ python matrix_encoder.py \
encode \
--message "MEET AT NOON" \
--alphabet "A-Z_" \
--block-size 3 \
--key-file keys/key3x3.json \
--mode real
[INFO] mode=real block_size=3 key_id=key3x3-v1
[INFO] normalized_message=MEET_AT_NOON_
[INFO] symbol_count=12 pad_added=1
[INFO] blocks_plain=4
[ENCODED] [132, 89, 170]
[ENCODED] [141, 118, 162]
[ENCODED] [115, 76, 154]
[ENCODED] [ 88, 57, 103]
[INFO] wrote outputs/encoded_key3x3_v1.txt
$ python matrix_encoder.py decode --input outputs/encoded_key3x3_v1.txt --key-file keys/key3x3.json --mode real
[INFO] mode=real block_size=3 key_id=key3x3-v1
[INFO] decoded_blocks=4
[DECODED] MEET_AT_NOON_
[CHECK] round_trip=PASS
Failure transcript (invalid key):
$ python matrix_encoder.py encode --message "HELLO" --key-file keys/singular.json
[ERROR] key matrix is not invertible (determinant=0)
Success signal: normalized plaintext is recovered exactly in decode path.
Solution Architecture
System architecture:
+---------------------+
| CLI Layer |
+----------+----------+
|
v
+---------------------+ +----------------------+
| Input Normalizer |----->| Alphabet Mapper |
| text policy/padding | | symbols <-> integers |
+----------+----------+ +----------+-----------+
| |
v v
+---------------------+ +----------------------+
| Block Processor |----->| Matrix Engine |
| chunk/un-chunk | | multiply / inverse |
+----------+----------+ +----------+-----------+
| |
+-------------+--------------+
v
+-------------+
| Serializer |
+-------------+
Key Components
| Component | Responsibility | Design Decision |
|---|---|---|
KeyValidator |
Shape + invertibility checks | Fail-fast before any transform |
AlphabetCodec |
Symbol/integer mapping | Explicit ordered alphabet prevents ambiguity |
BlockCodec |
Chunk and padding policy | Separate layer keeps matrix engine generic |
MatrixEngine |
Encode/decode block transforms | Isolated for unit-testing arithmetic correctness |
PayloadIO |
Stable on-disk format | Deterministic serialization for regression tests |
Algorithm Overview
Encode path pseudocode:
function encode(message, key, alphabet, block_size):
ensure key is invertible
normalized <- normalize_message(message, alphabet)
ints <- map_symbols_to_ints(normalized)
blocks <- chunk_with_padding(ints, block_size)
encoded <- []
for block in blocks:
encoded.append(matrix_multiply(key, block))
return serialize(encoded, metadata)
Decode path pseudocode:
function decode(payload, key, alphabet):
ensure payload.block_size == key.size
inv <- inverse(key)
plain_blocks <- []
for c in payload.blocks:
plain_blocks.append(matrix_multiply(inv, c))
ints <- flatten(plain_blocks)
symbols <- map_ints_to_symbols(round_or_mod(ints))
return join(symbols)
Complexity:
- For
Bblocks andn x nkey:O(B * n^2)time (naive matrix-vector multiply),O(B * n)storage for blocks.
Tradeoff:
- Real mode is easier to understand but can require rounding safeguards.
- Modular mode is arithmetic-cleaner but conceptually heavier.
Implementation Guide
Phase 1: Contracts First
- Define alphabet and normalization policy.
- Define key matrix format and validation rules.
- Define payload format for encoded output.
Checkpoint: schema-level tests pass before arithmetic logic.
Phase 2: Arithmetic Core
- Implement matrix-vector multiply.
- Implement invertibility check.
- Implement inverse path (or modular inverse path if chosen).
Checkpoint: K^-1 * (K * v) == v for multiple test vectors.
Phase 3: Text and Block Pipeline
- Normalize plaintext.
- Convert symbols to numbers.
- Chunk/pad blocks and transform each block.
Checkpoint: known message round-trip matches normalized message exactly.
Phase 4: CLI and Reporting
- Add
encodeanddecodesubcommands. - Print deterministic metadata lines.
- Add clear failure messages for key and symbol errors.
Checkpoint: transcript-compatible output verified by snapshot tests.
Testing
Unit Tests
- Alphabet map reversibility (
symbol -> int -> symbol). - Invertibility detection for valid and singular matrices.
- Block split/merge with padding behavior.
- Matrix round-trip on sample vectors.
Integration Tests
- Full encode/decode CLI round-trip using fixed key.
- Payload file read/write stability test.
- Invalid input tests (bad symbol, malformed block, wrong key size).
Deterministic test transcript example:
$ pytest tests/test_matrix_encoder.py -q
.......
7 passed in 0.22s
$ python matrix_encoder.py encode --message "ALGEBRA" --block-size 3 --key-file keys/key3x3.json --mode real
[INFO] normalized_message=ALGEBRA__
[CHECKSUM] encoded_payload=1d4fa6c2
Numerical Stability Checks
- In real mode, assert decoded values are within tolerance before rounding.
- Add boundary tests where rounding could flip nearest symbol.
Pitfalls
| Symptom | Likely Cause | Fix |
|---|---|---|
| Decode output has wrong letters | Rounding drift in real mode | Use tolerance checks and deterministic rounding policy |
| Cannot decode at all | Singular key matrix | Reject key early with determinant/rank diagnostics |
| Random extra characters | Padding not removed/marked correctly | Include explicit pad_count metadata |
| Different outputs across runs | Non-deterministic serialization order | Sort keys and use stable formatting |
| Garbled symbols | Alphabet mismatch between encode/decode | Store alphabet ID in payload and validate |
Extensions
- Add modular arithmetic mode (
mod 27ormod 29) for integer-safe decode. - Support multi-key rotation by message block index.
- Add frequency-analysis experiment to show weakness of linear ciphers.
- Add “explain mode” that prints each matrix-vector step for learning.
- Add key-generation helper that enforces invertibility constraints.
Real-World Connections
- Data transforms: many compression and signal pipelines rely on reversible linear transforms.
- Error-correcting codes: linear algebra over finite fields underpins coding theory.
- Computer graphics: matrix transformations are core to rendering pipelines.
- Cryptography history: Hill-cipher style ideas show why modern crypto needs stronger primitives.
Resources
- Gilbert Strang, Linear Algebra and Its Applications.
- Sheldon Axler, Linear Algebra Done Right.
- David C. Lay et al., Linear Algebra and Its Applications.
- Douglas Stinson, Cryptography: Theory and Practice (historical cipher context).
Self-Assessment
- Why is non-zero determinant necessary for reversible decoding?
- What concrete problem does block padding solve?
- How does real-mode rounding error appear in decoded text?
- Why is explicit alphabet ordering part of correctness, not just formatting?
- What metadata must be stored so decoding is unambiguous?
- Why is this educational and not secure modern encryption?
- If two different keys decode the same payload plausibly, what does that imply?
- How would modular arithmetic change your validation rules?
Completion Criteria
- Encode/decode CLI supports deterministic round-trip for fixed key and input.
- Key validation rejects singular or malformed matrices before processing.
- Payload format is stable and includes enough metadata for safe decode.
- Tests cover mapping, block handling, arithmetic correctness, and CLI integration.
- Deterministic transcript run matches expected structure and outcomes.
- You can explain invertibility, block encoding, and failure modes clearly.