Project 14: Str Module - Low-Level String Operations

Build a string module with position-based indexing (1-based, with negative indices from end) that operates on substrings without allocation.

Quick Reference

Attribute Value
Difficulty Level 3 - Advanced
Time Estimate 4-5 days
Language C
Prerequisites Mem module, Assert module, Understanding of C strings
Key Topics String processing, Memory efficiency, Position semantics, Zero-copy operations

1. Learning Objectives

By completing this project, you will:

  1. Design safer string interfaces - Create APIs that prevent common C string errors
  2. Implement position-based indexing - Use 1-based indices with negative indices from end
  3. Master zero-copy operations - Return pointers into existing strings without allocation
  4. Handle string views - Understand the difference between owning and borrowing strings
  5. Bridge to C strings - Interoperate with standard null-terminated strings

Why Str Matters:

C strings are notoriously error-prone:

  • Buffer overflows from strcpy, sprintf
  • Off-by-one errors with 0-based indexing
  • O(n) length calculation with strlen
  • No substring support without copying
  • Null terminator confusion

Hanson’s Str module addresses these problems by:

  • Using 1-based, end-inclusive positions (like human counting)
  • Supporting negative indices (like Python)
  • Tracking length explicitly
  • Returning views (pointers) instead of copies
  • Clear ownership semantics

Real-World Applications:

  • Parsers: Efficient substring extraction during lexing/parsing
  • Protocol handlers: Extracting fields from packet payloads
  • Text editors: Manipulating ranges of text efficiently
  • Log analysis: Processing log entries without copying

2. Theoretical Foundation

2.1 Core Concepts

The Problems with C Strings

C String Problems
--------------------------------------------------------------------------------

Problem 1: Off-by-one errors with 0-based indexing

    char s[] = "Hello";  // indices 0,1,2,3,4
    s[5] = '!';          // Bug! Off the end
    char c = s[5];       // What's at index 5? The null terminator!

    // Which index is 'o'?
    // Answer: 4 (not 5, because we count from 0)
    // This catches EVERYONE sometimes

Problem 2: strlen is O(n)

    char *s = malloc(1000000);
    // ... fill with data ...
    int len = strlen(s);  // Walks ALL MILLION BYTES

    // In a loop?
    for (int i = 0; i < strlen(s); i++)  // O(n^2) disaster!
        process(s[i]);

Problem 3: No substring without copying

    char s[] = "Hello, World!";
    // I want just "World" - how?

    char sub[6];
    strncpy(sub, s + 7, 5);  // Copy 5 chars
    sub[5] = '\0';           // Don't forget null terminator!
    // Memory allocated, copy made, error-prone

Problem 4: Buffer overflows

    char buf[10];
    strcpy(buf, "This string is way too long");  // BOOM
    // No bounds checking, memory corruption

Problem 5: Null in data

    char data[] = {0x48, 0x00, 0x65, 0x6c};  // Binary data with null
    int len = strlen(data);  // Returns 1, not 4!
    // C strings can't contain null bytes

Position-Based Indexing

Hanson’s Str uses a different indexing scheme:

Str Position Semantics
--------------------------------------------------------------------------------

String: "Hello"
         H e l l o
Position: 1 2 3 4 5    (positive: count from start, 1-based)
         -5-4-3-2-1    (negative: count from end)
          0            (position 0 = after last character)

Key insight: Positions are between characters, not on them

          ↓1  ↓2  ↓3  ↓4  ↓5  ↓0 (end)
          | H | e | l | l | o |
          ↑-5 ↑-4 ↑-3 ↑-2 ↑-1 ↑

Str_sub(s, 1, 5)  → "Hello"  (positions 1 to 5)
Str_sub(s, 2, 4)  → "ell"    (positions 2 to 4)
Str_sub(s, -3, 0) → "llo"    (position -3 to end)
Str_sub(s, 1, -1) → "Hell"   (position 1 to second-to-last)

Why 1-based?
- Matches human counting: "first character", "fifth character"
- Position 0 has special meaning (end of string)
- Negative indices work more intuitively

Why negative indices?
- Access from end without knowing length
- s[-1] = last character (like Python)
- s[-2] = second to last
- Common operations become simpler

Comparison:
    C (0-based):           Str (1-based):
    s[0] = first char      Str_get(s, 1) = first char
    s[n-1] = last char     Str_get(s, -1) = last char
    s[n] = null term       position 0 = end marker

Zero-Copy Substring Operations

Zero-Copy Substring
--------------------------------------------------------------------------------

Traditional substring (copies):

    char original[] = "Hello, World!";
    char *sub = substring(original, 7, 5);
    // sub is NEWLY ALLOCATED, points to copy of "World"

    ┌─────────────────────┐
    │ H e l l o ,   W o r │ l d ! \0    original
    └─────────────────────┘

    ┌─────────────┐
    │ W o r l d \0│                      sub (NEW MEMORY)
    └─────────────┘

Str module (zero-copy):

    char original[] = "Hello, World!";
    int len;
    char *sub = Str_sub(original, 8, 12, &len);
    // sub points INTO original, len = 5

    ┌─────────────────────┐
    │ H e l l o ,   W o r │ l d ! \0    original
    └─────────────────────┘
                    ↑
                    sub (points here, len=5)

Benefits:
- No allocation needed
- No copying
- Faster
- Less memory

Gotchas:
- sub is NOT null-terminated
- sub is only valid while original exists
- Must track length separately
- Can't modify without affecting original

2.2 Why This Matters

Understanding string manipulation is fundamental because:

  1. Security: Buffer overflows are still a top vulnerability class
  2. Performance: String operations are often on the critical path
  3. Correctness: Off-by-one errors plague every codebase
  4. Memory: Unnecessary copies waste RAM and cache

Industry Impact:

  • Redis’s SDS (Simple Dynamic Strings) uses similar principles
  • Rust’s string slices are the same concept
  • Go’s strings track length separately from data
  • Every modern language has moved away from null-terminated strings

2.3 Historical Context

C strings have been problematic since the beginning:

  • 1970s: C inherits null-terminated strings from B and BCPL
  • 1988: Morris Worm exploits buffer overflow in fingerd
  • 1990s: Constant stream of buffer overflow vulnerabilities
  • 2000s: Languages like Python, Java, Go abandon null-termination
  • 2014: Heartbleed (OpenSSL) shows C string handling still dangerous
  • Today: Rust’s borrow checker + string slices provide safety

Hanson’s Contribution: The Str module (1996) anticipated modern string slice/view concepts by decades. The position semantics and zero-copy approach influenced later designs.

2.4 Common Misconceptions

Misconception 1: “Negative indices are confusing”

  • Reality: Once learned, they’re more intuitive for end-relative access. Python programmers use them constantly.

Misconception 2: “Zero-copy is dangerous because of lifetime issues”

  • Reality: The danger exists with any pointer in C. Clear documentation and discipline solve this.

Misconception 3: “1-based indexing is old-fashioned”

  • Reality: Many modern languages (Lua, Julia, MATLAB) use 1-based indexing. It matches human counting.

Misconception 4: “Str is just a wrapper around char*“

  • Reality: Str provides a fundamentally different abstraction - length-tracked, position-semantics strings.

3. Project Specification

3.1 What You Will Build

A complete Str module implementing Hanson’s CII interface:

// str.h - The Interface

#ifndef STR_INCLUDED
#define STR_INCLUDED

#include <stddef.h>

/* Substring extraction - ZERO COPY
 * Returns pointer into s, sets *len to substring length
 * i, j are positions (1-based, negative from end, 0 = end)
 */
extern char *Str_sub  (const char *s, int i, int j, int *len);
extern char *Str_dup  (const char *s, int i, int j, int alloc);

/* Character operations */
extern int   Str_chr  (const char *s, int i, int j, int c);
extern int   Str_rchr (const char *s, int i, int j, int c);
extern int   Str_upto (const char *s, int i, int j, const char *set);
extern int   Str_rupto(const char *s, int i, int j, const char *set);
extern int   Str_find (const char *s, int i, int j, const char *str);
extern int   Str_rfind(const char *s, int i, int j, const char *str);

/* Comparison */
extern int   Str_cmp  (const char *s1, int i1, int j1,
                       const char *s2, int i2, int j2);

/* Concatenation */
extern char *Str_cat  (const char *s1, int i1, int j1,
                       const char *s2, int i2, int j2);
extern char *Str_catv (const char *s, int i, int j, ...);

/* Mapping */
extern char *Str_map  (const char *s, int i, int j,
                       const char *from, const char *to);

/* String length */
extern int   Str_len  (const char *s, int i, int j);

/* Position conversion helpers */
extern int   Str_pos  (const char *s, int i);  /* Convert to index */

#undef T
#endif

3.2 Functional Requirements

  1. Position semantics: 1-based, negative from end, 0 = end
  2. Str_sub(s, i, j, &len): Return pointer to substring, set len
  3. Str_dup(s, i, j, alloc): Return allocated copy (alloc = include null terminator)
  4. Str_chr/rchr: Find character in range (forward/reverse)
  5. Str_upto/rupto: Find first char from set in range
  6. Str_find/rfind: Find substring in range
  7. Str_cmp: Compare substrings
  8. Str_cat/catv: Concatenate substrings (allocates result)
  9. Str_map: Character mapping (like tr command)
  10. Str_len: Return length of substring

3.3 Non-Functional Requirements

  1. Zero-copy where possible: Str_sub returns pointer, not copy
  2. Explicit length: Never rely on null termination in internal operations
  3. Bounds checking: Assert on invalid positions
  4. Memory safety: Use Mem module for allocations
  5. Interoperability: Work with standard null-terminated C strings

3.4 Example Usage / Output

$ ./str_test

# Test 1: Position-based indexing (1-based!)
s = "Hello, World!"
Str_sub(s, 1, 5, &len)"Hello" (len=5)     # positions 1-5
Str_sub(s, 8, 0, &len)"World!" (len=6)    # position 8 to end
Str_sub(s, -6, 0, &len)"World!" (len=6)   # -6 = 6th from end

# Test 2: Negative indices
Str_sub(s, -1, -1, &len)"!" (len=1)       # last character
Str_sub(s, 1, -2, &len)"Hello, World" (len=12)  # all but last

# Test 3: No allocation - returns view
int len;
char *sub = Str_sub(s, 1, 5, &len);
# sub points into s, no malloc!
# len = 5
# IMPORTANT: sub is NOT null-terminated!

# Test 4: Character operations
Str_chr(s, 1, 0, 'o') → 5      # first 'o' at position 5
Str_rchr(s, 1, 0, 'o') → 9     # last 'o' at position 9
Str_chr(s, 1, 0, 'z') → 0      # 'z' not found (returns 0)

# Test 5: Substring search
Str_find(s, 1, 0, "World") → 8  # "World" starts at position 8
Str_find(s, 1, 0, "world") → 0  # case-sensitive, not found

# Test 6: Comparison
Str_cmp(s, 1, 5, t, 1, 5) → 0   # compare "Hello" == "Hello"
Str_cmp("abc", 1, 0, "abd", 1, 0)-1  # "abc" < "abd"

# Test 7: Concatenation (allocates!)
char *result = Str_cat(s1, 1, 0, s2, 1, 0);
# result = s1 + s2, newly allocated
# caller must free

# Test 8: Character mapping
char *mapped = Str_map("Hello", 1, 0, "aeiou", "AEIOU");
# result = "HEllO" (vowels uppercased)

# Test 9: Allocated copy with null terminator
char *copy = Str_dup(s, 1, 5, 1);  # alloc=1 means null-terminate
# copy = "Hello\0" (6 bytes allocated)
# Can be used with printf, etc.

4. Solution Architecture

4.1 High-Level Design

Str Module Architecture
--------------------------------------------------------------------------------

                    ┌─────────────────────────────────────┐
                    │           str.h (Interface)           │
                    │                                       │
                    │  Position-based API:                  │
                    │  - Positions are 1-based             │
                    │  - Negative = from end               │
                    │  - 0 = after last char               │
                    │                                       │
                    │  Str_sub, Str_dup, Str_chr, etc.     │
                    └─────────────────┬─────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────────┐
                    │          str.c (Implementation)        │
                    │                                       │
                    │  Core helper: convert(s, i, j)        │
                    │  - Converts positions to indices      │
                    │  - Handles negative values            │
                    │  - Validates bounds                   │
                    │                                       │
                    │  Internal:                            │
                    │  - Works with (pointer, length)       │
                    │  - No null-termination assumptions    │
                    │                                       │
                    └───────────────────────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────────┐
                    │              Dependencies             │
                    │                                       │
                    │  mem.h ─────► ALLOC for Str_dup, etc  │
                    │  assert.h ──► position validation     │
                    └───────────────────────────────────────┘

4.2 Key Components

Position Conversion:

// Convert position i to 0-based index into string of length n
// Positions: 1 = first char, -1 = last char, 0 = past end
static int convert(int n, int i) {
    if (i == 0)
        return n;
    if (i < 0)
        return n + i + 1;  // -1 → n, -2 → n-1, etc.
    return i;               // Positive: i → i (1-based to 1-based)
}

// For a string "Hello" (n=5):
// convert(5, 1)  → 1  (first char)
// convert(5, 5)  → 5  (last char)
// convert(5, 0)  → 5  (past end, same as length)
// convert(5, -1) → 5  (last char)
// convert(5, -5) → 1  (first char)

4.3 Data Structures

Unlike most CII modules, Str doesn’t define a new opaque type. It operates on const char* strings directly, with lengths passed explicitly or derived.

The key insight is representing substrings as (pointer, length) pairs internally:

// Internal representation during processing
struct str_view {
    const char *data;  // Points into source string
    int len;           // Length of substring
};

// Example: substring "llo" of "Hello"
// data = original + 2   (points to 'l')
// len = 3

// This is NOT exposed to users - just used internally
// Users work with positions, not indices

4.4 Algorithm Overview

Position-to-Index Conversion:

convert(length, position):
    if position == 0:
        return length        # Past end
    if position < 0:
        return length + position + 1
    return position          # 1-based, keep as-is

Then adjust for 0-based C indexing:
    index = convert(length, position) - 1

Substring Extraction:

char *Str_sub(const char *s, int i, int j, int *len) {
    int n = strlen(s);

    // Convert positions to 1-based values
    i = convert(n, i);
    j = convert(n, j);

    // Validate
    assert(1 <= i && i <= n + 1);
    assert(1 <= j && j <= n + 1);
    assert(i <= j + 1);  // Empty string OK when i == j + 1

    // Set length
    if (len)
        *len = j - i + 1;

    // Return pointer to start (convert to 0-based index)
    return (char *)s + i - 1;
}

Character Search:

int Str_chr(const char *s, int i, int j, int c) {
    int n = strlen(s);
    i = convert(n, i);
    j = convert(n, j);

    assert(1 <= i && i <= n + 1);
    assert(1 <= j && j <= n + 1);
    assert(i <= j + 1);

    // Search from position i to j
    for (int k = i; k <= j; k++) {
        if (s[k-1] == c)  // k-1 for 0-based index
            return k;      // Return 1-based position
    }
    return 0;  // Not found
}

5. Implementation Guide

5.1 Development Environment Setup

# Required tools
gcc --version          # GCC with C11 support
make --version
valgrind --version

# Project structure
C_INTERFACES_AND_IMPLEMENTATIONS_MASTERY/
├── include/
│   ├── str.h
│   ├── mem.h
│   └── assert.h
├── src/
│   ├── str.c
│   ├── mem.c
│   └── assert.c
├── tests/
│   └── str_test.c
└── Makefile

# Compiler flags
CFLAGS = -Wall -Wextra -Wpedantic -std=c11 -g

5.2 Project Structure

str.h          Interface (public API)
str.c          Implementation
str_test.c     Comprehensive test suite

5.3 The Core Question You’re Answering

“How do you design a string interface that prevents the off-by-one and buffer overflow errors that plague C string code?”

The answer involves:

  • 1-based positions that match human counting
  • Explicit length tracking
  • Zero-copy where possible
  • Clear assertions on invalid access

5.4 Concepts You Must Understand First

Before implementing, ensure you can answer:

  1. C string basics: What is strlen actually doing? Why is it O(n)?
  2. Pointer arithmetic: What does s + 5 mean when s is char*?
  3. 0-based vs 1-based: Why do most languages use 0-based indexing?
  4. String views: What’s the difference between char* and a copied string?

5.5 Questions to Guide Your Design

  1. Validation: What happens if positions are out of range?
  2. Null termination: Does Str guarantee null-terminated results?
  3. Mutability: Can Str operations modify the source string?
  4. Integration: How do you convert between Str and char*?
  5. Empty strings: What do positions mean for an empty string?

5.6 Thinking Exercise

Trace through these operations on “Hello, World!” (13 chars):

String: H e l l o ,   W o r l  d  !
Index:  0 1 2 3 4 5 6 7 8 9 10 11 12  (0-based C index)
Pos:    1 2 3 4 5 6 7 8 9 10 11 12 13 (1-based position)
NegPos:-13-12... -6-5-4-3-2-1          (negative position)

Convert these positions to 0-based indices:

  • Position 1 → index ?
  • Position 8 → index ?
  • Position -1 → index ?
  • Position -6 → index ?
  • Position 0 → past end
Click to see solution ``` Position 1 → index 0 (H) Position 8 → index 7 (W) Position -1 → index 12 (!) Position -6 → index 7 (W) Position 0 → index 13 (past end, used as endpoint) Formula: - Positive i: index = i - 1 - Negative i: index = length + i - Zero: index = length For "Hello, World!" (n=13): - convert(13, 1) = 1, then index = 0 - convert(13, 8) = 8, then index = 7 - convert(13, -1) = 13 + (-1) + 1 = 13, then index = 12 - convert(13, -6) = 13 + (-6) + 1 = 8, then index = 7 - convert(13, 0) = 13, then index = 13 (past end) ```

5.7 Hints in Layers

Hint 1 - Starting Point (Conceptual Direction):

The key function is position conversion. Get this right first:

// Hanson's convert function (from CII)
// Returns 1-based position given possibly negative input
static int convert(int n, int i) {
    // n = string length
    // i = position (1-based, or negative, or 0)

    if (i == 0)
        return n + 1;  // Position 0 means "past end"
    else if (i < 0)
        return n + 1 + i;  // -1 → n, -2 → n-1, etc.
    else
        return i;  // Positive: keep as-is
}

Hint 2 - Next Level (More Specific Guidance):

Most functions follow this pattern:

int Str_something(const char *s, int i, int j, ...) {
    int n = strlen(s);  // Get length once

    // Convert positions
    i = convert(n, i);
    j = convert(n, j);

    // Validate: positions must be valid for string
    assert(1 <= i && i <= n + 1);
    assert(1 <= j && j <= n + 1);

    // For substring operations: i must not exceed j+1
    assert(i <= j + 1);  // i == j+1 means empty substring

    // Now work with 0-based indices: s[i-1] to s[j-1]
    // ...
}

Hint 3 - Technical Details (Approach/Pseudocode):

For Str_sub (zero-copy substring):

char *Str_sub(const char *s, int i, int j, int *len) {
    int n = strlen(s);
    i = convert(n, i);
    j = convert(n, j);

    assert(s);
    assert(1 <= i && i <= n + 1);
    assert(1 <= j && j <= n + 1);
    assert(i <= j + 1);

    if (len)
        *len = j - i + 1;  // Could be 0 if i > j

    // Return pointer into original string
    // i is 1-based, so s + i - 1 gives 0-based start
    return (char *)(s + i - 1);
}

For Str_dup (allocated copy):

char *Str_dup(const char *s, int i, int j, int alloc) {
    int len;
    char *sub = Str_sub(s, i, j, &len);

    // Allocate: len bytes + 1 for null if alloc is true
    char *copy = ALLOC(len + (alloc ? 1 : 0));
    memcpy(copy, sub, len);
    if (alloc)
        copy[len] = '\0';

    return copy;
}

Hint 4 - Tools/Debugging (Verification Methods):

Test edge cases carefully:

// Empty string
char *empty = "";
assert(Str_len(empty, 1, 0) == 0);  // Position 1 to end

// Single character
char *one = "X";
assert(Str_chr(one, 1, 0, 'X') == 1);   // Found at position 1
assert(Str_chr(one, 1, 0, 'Y') == 0);   // Not found

// Position 0 as endpoint
assert(Str_len("Hello", 1, 0) == 5);    // All chars
assert(Str_len("Hello", -1, 0) == 1);   // Just last char

// Negative positions
assert(Str_chr("Hello", -3, 0, 'l') == 4);  // Search last 3

5.8 The Interview Questions They’ll Ask

  1. “Design a string library that prevents buffer overflows. What’s your approach?”

    Expected Strong Answer: I’d track length separately from data (like Pascal strings, Rust’s &str, Go’s strings). Bounds checking on every access via assertions. Substring operations return views (pointer + length) rather than copies, eliminating allocation. Position-based API with validation prevents off-by-one errors. For output to C functions needing null-termination, provide explicit to_cstr() that allocates and null-terminates.

  2. “Implement substring search. What algorithm would you use and why?”

    Expected Strong Answer: For simple cases, naive O(nm) search is fine and has good cache behavior. For larger texts or repeated searches, I’d use Boyer-Moore (O(n/m) best case by skipping) or KMP (O(n+m) guaranteed). In practice, the naive algorithm with SIMD (like glibc’s memmem) is often fastest for typical string lengths. The choice depends on: pattern length, alphabet size, and whether pattern is reused.

  3. “Compare string views (Rust/C++) to null-terminated strings. Trade-offs?”

    Expected Strong Answer: String views store (pointer, length), null-terminated strings rely on sentinel. Views: O(1) length, can contain nulls, substring is free (just adjust pointer/length), but need explicit lifetime management. Null-terminated: compatible with C, self-describing size (sort of), but O(n) strlen, can’t contain nulls, substring requires copy or mutation. Modern languages overwhelmingly prefer views.

  4. “How would you implement Unicode support in a string library?”

    Expected Strong Answer: First, decide: UTF-8, UTF-16, or UTF-32 internally? UTF-8 is space-efficient and C-compatible but variable-width. Key insight: indices can mean bytes, code points, or grapheme clusters - each gives different results for position 5 in “cafe\u0301” (cafe with combining accent). I’d provide byte-indexed operations (fast) and optional grapheme-aware iteration. Rust does this with String (UTF-8 bytes) and .chars() iterator.

  5. “What’s the complexity of your substring operation? Does it allocate?”

    Expected Strong Answer: Str_sub is O(1) - just pointer arithmetic and validation. It does NOT allocate - returns a pointer into the original string. The caller receives (pointer, length) and must understand the returned string is not null-terminated and only valid while the source string exists. For a null-terminated copy, use Str_dup which is O(n) for the copy. This zero-copy approach is critical for parser performance.

5.9 Books That Will Help

Topic Book Chapter
Str implementation “C Interfaces and Implementations” by Hanson Ch. 15
String algorithms “Algorithms in C” by Sedgewick String processing
Safe string handling “Secure Coding in C and C++” by Seacord Ch. 2
Modern string design “Programming Rust” by Blandy & Orendorff String chapters
Efficient string operations “The Practice of Programming” by Kernighan & Pike Ch. 9

5.10 Implementation Phases

Phase 1: Position Conversion (Day 1)

  • Implement convert() helper
  • Test with positive, negative, and zero positions
  • Verify edge cases (empty string, single char)

Phase 2: Core Operations (Day 2)

  • Implement Str_sub (zero-copy substring)
  • Implement Str_dup (allocated copy)
  • Implement Str_len
  • Test thoroughly

Phase 3: Search Operations (Day 3)

  • Implement Str_chr (forward search for char)
  • Implement Str_rchr (reverse search)
  • Implement Str_find (substring search)
  • Implement Str_rfind

Phase 4: Advanced Operations (Day 4)

  • Implement Str_upto (search for char set)
  • Implement Str_cmp (substring comparison)
  • Implement Str_cat (concatenation)
  • Implement Str_map (character mapping)

Phase 5: Integration & Polish (Day 5)

  • Integration tests
  • Memory leak testing
  • Edge case coverage
  • Documentation

5.11 Key Implementation Decisions

  1. Position Conversion: Use Hanson’s convert() formula exactly - it’s been tested for decades

  2. Zero Return for Not Found: Str_chr returns 0 when not found (since 0 is never a valid position)

  3. Empty Substrings: Allow i > j to represent empty substring (length 0)

  4. Const Correctness: Functions take const char* but Str_sub returns non-const (C limitation)

  5. Allocation Policy: Str_sub never allocates; Str_dup, Str_cat always allocate


6. Testing Strategy

Unit Tests

void test_position_conversion(void) {
    // Test convert helper indirectly through Str_sub
    int len;

    // Positive positions
    char *s = "Hello";
    assert(Str_sub(s, 1, 1, &len) == s);  // First char
    assert(len == 1);

    assert(Str_sub(s, 1, 5, &len) == s);  // Whole string
    assert(len == 5);

    // Negative positions
    assert(Str_sub(s, -1, -1, &len) == s + 4);  // Last char
    assert(len == 1);

    assert(Str_sub(s, -5, -1, &len) == s);  // Whole string via negatives
    assert(len == 5);

    // Mixed
    assert(Str_sub(s, 1, -1, &len) == s);  // Start to last
    assert(len == 5);

    // Position 0 as end
    assert(Str_sub(s, 1, 0, &len) == s);
    assert(len == 5);

    printf("test_position_conversion: PASSED\n");
}

void test_chr_operations(void) {
    char *s = "Hello, World!";

    // Forward search
    assert(Str_chr(s, 1, 0, 'H') == 1);
    assert(Str_chr(s, 1, 0, 'o') == 5);   // First 'o'
    assert(Str_chr(s, 1, 0, 'z') == 0);   // Not found
    assert(Str_chr(s, 6, 0, 'o') == 9);   // 'o' in "World"

    // Reverse search
    assert(Str_rchr(s, 1, 0, 'o') == 9);  // Last 'o'
    assert(Str_rchr(s, 1, 5, 'o') == 5);  // Last 'o' in "Hello"

    printf("test_chr_operations: PASSED\n");
}

void test_substring_search(void) {
    char *s = "Hello, World!";

    assert(Str_find(s, 1, 0, "Hello") == 1);
    assert(Str_find(s, 1, 0, "World") == 8);
    assert(Str_find(s, 1, 0, "world") == 0);  // Case-sensitive
    assert(Str_find(s, 1, 0, "") == 1);       // Empty matches at start

    assert(Str_rfind(s, 1, 0, "o") == 9);     // Last 'o'

    printf("test_substring_search: PASSED\n");
}

Edge Cases

void test_edge_cases(void) {
    int len;

    // Empty string
    char *empty = "";
    assert(Str_len(empty, 1, 0) == 0);
    assert(Str_chr(empty, 1, 0, 'x') == 0);

    // Single character
    char *one = "X";
    assert(Str_len(one, 1, 0) == 1);
    assert(Str_chr(one, 1, 0, 'X') == 1);
    assert(Str_sub(one, 1, 1, &len) == one);
    assert(len == 1);

    // Empty substring (i > j after conversion)
    char *s = "Hello";
    assert(Str_len(s, 3, 2) == 0);  // Position 3 to 2 = empty

    // Boundary positions
    assert(Str_sub(s, 5, 5, &len) == s + 4);  // Just 'o'
    assert(len == 1);

    printf("test_edge_cases: PASSED\n");
}

Allocation Tests

void test_allocating_functions(void) {
    char *s = "Hello, World!";

    // Str_dup with null terminator
    char *copy = Str_dup(s, 1, 5, 1);  // "Hello\0"
    assert(strcmp(copy, "Hello") == 0);
    assert(strlen(copy) == 5);
    FREE(copy);

    // Str_cat
    char *cat = Str_cat(s, 1, 5, s, 8, 0);  // "Hello" + "World!"
    assert(strcmp(cat, "HelloWorld!") == 0);
    FREE(cat);

    // Str_map
    char *mapped = Str_map("Hello", 1, 0, "aeiou", "AEIOU");
    assert(strcmp(mapped, "HEllO") == 0);
    FREE(mapped);

    printf("test_allocating_functions: PASSED\n");
}

7. Common Pitfalls & Debugging

Problem 1: “Str_sub returns wrong pointer”

  • Symptom: Characters at wrong position
  • Why: Off-by-one in position-to-index conversion
  • Fix: Remember: position 1 = index 0, so s + i - 1
  • Test: Verify Str_sub(s, 1, 1, &len) returns pointer to first char

Problem 2: “Negative indices don’t work correctly”

  • Symptom: Position -1 gives wrong character
  • Why: Wrong formula for negative conversion
  • Fix: For length n and position i < 0: convert to n + i + 1 (1-based result)
  • Test: For “Hello”, position -1 should give ‘o’ (index 4)

Problem 3: “Str_find doesn’t find substring at start”

  • Symptom: Searching for “Hello” in “Hello World” returns 0
  • Why: Returning 0 instead of position 1
  • Fix: Found at start means position 1, not 0. Return 0 only for not-found.
  • Test: Str_find("abc", 1, 0, "a") should return 1

Problem 4: “Crash when using Str_sub result with printf”

  • Symptom: Garbage output or crash
  • Why: Str_sub returns non-null-terminated pointer
  • Fix: Use %.*s format or Str_dup with alloc=1
  • Test: printf("%.*s\n", len, Str_sub(s, i, j, &len))

Problem 5: “Position 0 doesn’t work as expected”

  • Symptom: Str_sub(s, 1, 0, &len) gives wrong length
  • Why: Position 0 should mean “end of string” (n+1), not index 0
  • Fix: In convert(), handle i==0 specially: return n+1
  • Test: For “Hello”, Str_sub(s, 1, 0, &len) should have len=5

8. Extensions & Challenges

8.1 Case-Insensitive Operations

Add Str_ichr, Str_ifind, Str_icmp variants that ignore case.

8.2 Pattern Matching

Add Str_match with glob patterns: * matches any sequence, ? matches any character.

8.3 Trim Operations

Add Str_ltrim, Str_rtrim, Str_trim that return positions of first/last non-whitespace.

8.4 Split/Join

Add Str_split(s, delim) returning array of (position, length) pairs.

8.5 UTF-8 Awareness

Add Str_uchr that finds Unicode code points, not just bytes.


9. Real-World Connections

Redis Simple Dynamic Strings (SDS)

// Redis uses similar principles
struct sdshdr {
    int len;        // Length of string
    int free;       // Free bytes in buffer
    char buf[];     // The data
};

// O(1) length, explicit tracking
// Hanson's Str predates SDS by ~10 years

Rust String Slices

// Rust's &str is exactly Str's philosophy
let s = "Hello, World!";
let hello: &str = &s[0..5];  // Zero-copy slice
// hello points into s, doesn't own memory
// Same concept as Str_sub returning pointer + length

Go Strings

// Go strings internally are (pointer, length)
type stringHeader struct {
    Data uintptr
    Len  int
}
// Substrings share backing array
// Same zero-copy principle

Protocol Parsers

// Parsing HTTP headers
// "Content-Length: 42\r\n"
int pos = Str_chr(line, 1, 0, ':');
char *name = Str_sub(line, 1, pos - 1, &name_len);
char *value = Str_sub(line, pos + 2, -3, &val_len);  // Skip ": " and "\r\n"
// Zero-copy extraction of fields

10. Resources

Primary References

  1. C Interfaces and Implementations by David Hanson, Chapter 15
    • Official CII source: https://github.com/drh/cii
  2. Secure Coding in C and C++ by Robert Seacord
    • Chapter on string handling security

Online Resources

  1. Redis SDS: https://redis.io/docs/reference/internals/sds/
  2. Rust String handling: https://doc.rust-lang.org/book/ch08-02-strings.html
  3. C string vulnerabilities: https://cwe.mitre.org/data/definitions/120.html

11. Self-Assessment Checklist

Before considering this project complete, verify:

  • Position conversion handles positive, negative, and zero correctly
  • Str_sub returns pointer into original (no allocation)
  • Str_dup allocates and optionally null-terminates
  • Str_chr finds first occurrence, returns position (not index)
  • Str_rchr finds last occurrence
  • Str_find finds substring, returns starting position
  • Str_cmp compares substrings correctly
  • Str_cat concatenates and allocates result
  • Str_map performs character translation
  • Empty strings handled correctly
  • Negative indices work as documented
  • Position 0 means “end of string”
  • No memory leaks
  • Assertions catch invalid positions

12. Submission / Completion Criteria

Your Str module implementation is complete when:

  1. All tests pass: Position semantics, edge cases, allocations
  2. Memory-safe: No leaks (Str_dup/cat return allocated memory)
  3. Documented: Header explains position semantics clearly
  4. Follows CII conventions: Uses Mem module, assertions
  5. Zero-copy verification: Str_sub demonstrably doesn’t allocate

Deliverables:

  • str.h - Interface with position semantics documentation
  • str.c - Implementation
  • str_test.c - Comprehensive test suite
  • Makefile - Build configuration
  • Brief writeup explaining: (1) position conversion formula, (2) zero-copy approach, (3) how you handle C string interop