Project 7: String Library from Scratch

A safe, feature-rich C string library with explicit lengths, bounds checking, and clear error semantics.

Quick Reference

Attribute Value
Difficulty Level 3 - Advanced
Time Estimate 1-2 weeks
Main Programming Language C
Alternative Programming Languages None
Coolness Level Level 3 - Genuinely Clever
Business Potential Level 1 - Resume Gold
Prerequisites Pointers, arrays, memory management basics
Key Topics String invariants, bounds checking, API design

1. Learning Objectives

By completing this project, you will:

  1. Design a string type with explicit length and capacity.
  2. Implement safe string operations that avoid buffer overflows.
  3. Provide consistent error handling and return conventions.
  4. Benchmark against standard str* functions for correctness and speed.
  5. Document edge cases and security pitfalls in string handling.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: C String Representation and Invariants

Fundamentals

Traditional C strings are arrays of characters terminated by a \0 byte. The \0 terminator is the only way standard string functions know where the string ends. This design is simple and efficient but dangerous: if the terminator is missing or misplaced, functions like strlen will read past the buffer, causing undefined behavior. To build a safer string library, you need explicit invariants: a string has a length, a capacity, and a guarantee that the buffer always contains a terminator at data[length]. These invariants must be maintained by every operation.

Deep Dive into the concept

A string invariant is a contract that every function in your library must preserve. For a “safe string” type, the typical invariant is: the buffer is allocated for capacity + 1 bytes, length <= capacity, and data[length] == '\0'. This ensures that the string is always a valid C string and that operations never read or write out of bounds. Without explicit length, every operation has to scan for the terminator, which is O(n) and can be unsafe. With explicit length, operations like concatenation and substring extraction can be O(1) for length checks and O(k) for actual copying.

This invariant also helps with correctness. Consider concatenation: you must ensure that the destination has enough capacity for the new length plus the terminator. If not, you can either return an error, grow the buffer, or truncate safely depending on your API design. In standard C, strcat assumes the buffer is large enough and will overflow if it is not. In a professional library, you must make the capacity explicit and enforce it.

C strings are also byte-oriented; they do not encode Unicode semantics. If you want to handle UTF-8 correctly, you must treat it as a byte sequence and be careful with “character” counts. For the scope of this project, we focus on byte strings, but you should document that length is in bytes, not characters. This clarity is part of professional API design.

In your library, you’ll provide constructors that initialize the string invariant, mutators that preserve it, and accessors that never expose invalid states. You will also include a debug mode that validates the invariant after each operation. This makes it easier to catch bugs early and prevents undefined behavior from leaking into your users’ code.

A professional library must also handle the case where strings may contain embedded null bytes. Standard C strings cannot represent this, so you must decide whether your library supports binary-safe strings (where length is authoritative) or text-only strings. If you support binary-safe strings, you must ensure that APIs never call standard functions like strlen on them, and you must document that data may contain a null byte before length. This affects serialization, comparison, and formatting. Another subtlety is ownership: does the string type own its buffer, or can it reference external memory (like a slice)? If you support non-owning views, you must make lifetime rules explicit to avoid dangling references. Including a separate sstring_view type with a pointer and length can clarify this. These design choices add depth and realism to your string library and mirror patterns used in modern C libraries.

To operationalize this concept in a real codebase, create a short checklist of invariants and a set of micro-experiments. Start with a minimal, deterministic test that isolates one rule or behavior, then vary a single parameter at a time (inputs, flags, platform, or data layout) and record the outcome. Keep a table of assumptions and validate them with assertions or static checks so violations are caught early. Whenever the concept touches the compiler or OS, capture tool output such as assembly, warnings, or system call traces and attach it to your lab notes. Finally, define explicit failure modes: what does a violation look like at runtime, and how would you detect it in logs or tests? This turns abstract theory into repeatable engineering practice and makes results comparable across machines and compiler versions.

How this fits on projects

Definitions & key terms

  • Null-terminated string: A char array ending with \0.
  • Invariant: A condition that must always hold for a data structure.
  • Capacity: Maximum size before reallocation.
  • Length: Current number of bytes in the string.

Mental model diagram (ASCII)

+------------------------------+
| data[0..length-1] | '\0' |   |
+------------------------------+
     length           terminator

How it works (step-by-step, with invariants and failure modes)

  1. Allocate capacity + 1 bytes.
  2. Set length and ensure data[length] = '\0'.
  3. On mutation, check capacity, update length, reset terminator.

Invariant: data[length] is always \0 and length <= capacity. Failure mode: Missing terminator leads to UB on strlen or printing.

Minimal concrete example

typedef struct {
    char *data;
    size_t length;
    size_t capacity;
} sstring_t;

Common misconceptions

  • “Null terminator is enough.” → It doesn’t prevent overflow.
  • “Length and capacity are the same.” → Capacity is space available, length is used.
  • “UTF-8 length equals character count.” → It is bytes, not characters.

Check-your-understanding questions

  1. Why does strlen risk UB if the terminator is missing?
  2. What invariant must hold for a safe string type?
  3. Why store length explicitly?
  4. What is the difference between length and capacity?
  5. How does the terminator help interoperability?

Check-your-understanding answers

  1. It keeps reading until it finds \0, possibly past the buffer.
  2. length <= capacity and data[length] == '\0'.
  3. It avoids repeated scans and enables safe bounds checks.
  4. Length is used bytes; capacity is allocated bytes.
  5. It lets the string be used with standard C APIs.

Real-world applications

  • Safe string handling in security-sensitive software.
  • Logging and telemetry systems requiring robust formatting.

Where you’ll apply it

References

  • “Effective C” — Seacord, string safety sections
  • “Secure Coding in C and C++” — Seacord

Key insights

A string library is only safe if every operation preserves its invariants.

Summary

C strings are simple but unsafe by default. A safe string type must carry explicit length and capacity and enforce the null-termination invariant.

Homework/Exercises to practice the concept

  1. Define invariants for a dynamic string type.
  2. Write a constructor that initializes an empty safe string.
  3. Implement an invariant checker for debugging.

Solutions to the homework/exercises

  1. length <= capacity, data[length] == '\0', data non-null.
  2. Allocate 1 byte, set length 0, terminator at data[0].
  3. Verify conditions and abort on failure.

Concept 2: Bounds Checking, Error Semantics, and API Design

Fundamentals

Safe string APIs must guard against buffer overflows, truncation, and undefined behavior. This requires explicit size parameters, clear return values, and consistent error semantics. Unlike strcpy, a safe API returns an error if the destination buffer is too small or grows the buffer automatically. Designing these semantics is as important as implementing the operations themselves, because your library’s safety depends on how users can misuse it.

Deep Dive into the concept

Bounds checking is straightforward in principle: before copying or concatenating, verify that the destination has enough capacity. The challenge is choosing how to react when it doesn’t. There are three common strategies: (1) return an error without modifying the destination, (2) truncate safely and report truncation, or (3) grow the buffer automatically. Each strategy has trade-offs. Returning an error is safest but can be inconvenient for users. Truncation is user-friendly but can hide data loss. Automatic growth is flexible but requires dynamic allocation and can fail at runtime.

Error semantics must be explicit and consistent. For example, a function sstring_append might return 0 on success, -1 on allocation failure, and -2 on invalid input. It must also document whether the destination is unchanged on error. Professional libraries often follow a “no partial writes on error” rule to preserve invariants. This requires careful staging: calculate required size, allocate if needed, then copy.

API design in C also relies on naming conventions and symmetry. If you provide sstring_init, you should also provide sstring_free. If you provide sstring_append, you should provide sstring_insert, sstring_replace, and sstring_substr. Consistent naming makes the library easier to learn. Additionally, you should provide functions that operate on raw C strings for interoperability, but clearly document that these functions validate inputs and may return errors if the input is not properly terminated within a given maximum length.

Another important aspect is constant-time operations when dealing with security-sensitive strings (like passwords or tokens). If your library includes sstring_equals, you might provide a constant-time variant to avoid timing side channels. This level of detail distinguishes a “safe” library from a “secure” one. In this project, you will design the API and implement core operations with explicit size checks, error codes, and invariant preservation. The goal is to show that safety is a property of both implementation and interface.

To operationalize this concept in a real codebase, create a short checklist of invariants and a set of micro-experiments. Start with a minimal, deterministic test that isolates one rule or behavior, then vary a single parameter at a time (inputs, flags, platform, or data layout) and record the outcome. Keep a table of assumptions and validate them with assertions or static checks so violations are caught early. Whenever the concept touches the compiler or OS, capture tool output such as assembly, warnings, or system call traces and attach it to your lab notes. Finally, define explicit failure modes: what does a violation look like at runtime, and how would you detect it in logs or tests? This turns abstract theory into repeatable engineering practice and makes results comparable across machines and compiler versions.

Another way to deepen understanding is to map the concept to a small decision table: list inputs, expected outcomes, and the assumptions that must hold. Create at least one negative test that violates an assumption and observe the failure mode, then document how you would detect it in production. Add a short trade-off note: what you gain by following the rule and what you pay in complexity or performance. Where possible, instrument the implementation with debug-only checks so violations are caught early without affecting release builds. If the concept admits multiple approaches, implement two and compare them; the act of measuring and documenting the difference is part of professional practice. This habit turns theoretical understanding into an engineering decision framework you can reuse across projects.

How this fits on projects

Definitions & key terms

  • Bounds checking: Verifying that operations stay within buffer limits.
  • Truncation: Cutting output to fit available space.
  • Error semantics: Rules about return codes and state after failure.
  • Constant-time: Execution time independent of input values.

Mental model diagram (ASCII)

append(src):
if (len + src_len > cap) -> error or grow
else copy -> update len -> set '\0'

How it works (step-by-step, with invariants and failure modes)

  1. Validate input pointers and lengths.
  2. Compute required size and check capacity.
  3. If insufficient, grow or return error.
  4. Perform copy and update length/terminator.

Invariant: Destination remains valid even if operation fails. Failure mode: Partial writes corrupt length or terminator.

Minimal concrete example

int sstring_append(sstring_t *dst, const char *src) {
    size_t n = strlen(src);
    if (dst->length + n > dst->capacity) return -1;
    memcpy(dst->data + dst->length, src, n + 1);
    dst->length += n;
    return 0;
}

Common misconceptions

  • “Returning size_t is enough.” → Error reporting becomes ambiguous.
  • “Truncation is safer than error.” → It can silently lose data.
  • “Safety is only about bounds.” → API semantics and error handling matter.

Check-your-understanding questions

  1. What should happen if append runs out of capacity?
  2. Why should functions avoid partial writes on error?
  3. When would truncation be acceptable?
  4. How can you design a constant-time comparison?
  5. Why is API symmetry important?

Check-your-understanding answers

  1. Return an error or grow the buffer, but keep invariants intact.
  2. Partial writes can corrupt invariants and user expectations.
  3. When data loss is acceptable and documented.
  4. Compare all bytes and aggregate differences without early exit.
  5. It makes the library predictable and easier to learn.

Real-world applications

  • Secure handling of user input in network services.
  • Safe logging and formatting in system daemons.

Where you’ll apply it

References

  • “CERT C Secure Coding Standard”
  • “Secure Coding in C and C++” — Seacord

Key insights

A safe library is defined as much by its API contracts as by its internal checks.

Summary

Bounds checking and clear error semantics are the difference between safe and unsafe string APIs. Thoughtful design prevents misuse and makes correctness enforceable.

Homework/Exercises to practice the concept

  1. Design error codes for a string library and document them.
  2. Implement a truncating append and compare with error-return version.
  3. Write a constant-time comparison function.

Solutions to the homework/exercises

  1. Use negative error codes, return 0 on success.
  2. Truncating version must preserve terminator and report truncation.
  3. XOR all bytes and check if result is zero.

3. Project Specification

3.1 What You Will Build

A safe string library (sstring) providing length-aware operations, dynamic growth, and consistent error handling, plus a CLI demo program that exercises the API and prints results.

3.2 Functional Requirements

  1. String type: Struct with data, length, capacity.
  2. Core operations: init, free, append, insert, replace, substring.
  3. Bounds safety: No operation writes out of bounds.
  4. Error reporting: Consistent error codes and no partial writes.
  5. Interoperability: Ability to convert to/from C strings.

3.3 Non-Functional Requirements

  • Performance: Append should be amortized O(1) with growth.
  • Reliability: Invariant checker passes after each operation in debug mode.
  • Usability: API is documented with examples.

3.4 Example Usage / Output

sstring_t s;
ss_init(&s, "hello");
ss_append(&s, " world");
printf("%s\n", s.data);

3.5 Data Formats / Schemas / Protocols

Error code enum:

typedef enum { SS_OK = 0, SS_EINVAL = -1, SS_ENOMEM = -2, SS_ETRUNC = 1 } ss_err_t;

3.6 Edge Cases

  • Appending empty strings.
  • Inserting at boundaries (0 or length).
  • Handling NULL inputs.

3.7 Real World Outcome

What you will see:

  1. A usable string library with documentation.
  2. A demo program showing safe operations.
  3. Test results validating no buffer overflows.

3.7.1 How to Run (Copy/Paste)

make
./sstring_demo

3.7.2 Golden Path Demo (Deterministic)

Append, insert, and substring on fixed strings and verify output.

3.7.3 If CLI: exact terminal transcript

$ ./sstring_demo
init: "hello"
append: "hello world"
substring(0,5): "hello"
Exit: 0

Failure demo (deterministic):

$ ./sstring_demo --inject-oom
ERROR: out of memory
Exit: 2

4. Solution Architecture

4.1 High-Level Design

+-------------------+
| sstring API        |
+---------+---------+
          |
          v
+-------------------+     +-------------------+
| core operations    | -->| growth policy     |
+-------------------+     +-------------------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————-| | Core operations | Append/insert/replace | Preserve invariants | | Growth policy | Expand capacity | Double strategy | | Debug checker | Validate invariants | Enabled in debug |

4.3 Data Structures (No Full Code)

typedef struct {
    char *data;
    size_t length;
    size_t capacity;
} sstring_t;

4.4 Algorithm Overview

  1. Validate input and required size.
  2. Grow capacity if needed.
  3. Perform copy/move operations.
  4. Update length and terminator.

Complexity Analysis:

  • Time: O(n) for most ops; append amortized O(1)
  • Space: O(n)

5. Implementation Guide

5.1 Development Environment Setup

clang -std=c23 -Wall -Wextra -Werror -fsanitize=address,undefined -g

5.2 Project Structure

sstring/
├── src/
│   ├── sstring.c
│   └── demo.c
├── include/
│   └── sstring.h
├── tests/
└── Makefile

5.3 The Core Question You’re Answering

“How can I design a string API that is safe by default and hard to misuse?”

5.4 Concepts You Must Understand First

  1. String invariants and termination.
  2. Bounds checking and error semantics.
  3. Buffer growth strategies.

5.5 Questions to Guide Your Design

  1. Should operations mutate on error or leave unchanged?
  2. What growth policy balances speed and memory overhead?
  3. How will you expose errors to callers?

5.6 Thinking Exercise

Write a safe append operation and identify all failure points.

5.7 The Interview Questions They’ll Ask

  1. Why are C strings unsafe by default?
  2. How would you design a safe string API?
  3. How do you avoid buffer overflows in C?

5.8 Hints in Layers

  • Hint 1: Implement a minimal init/append first.
  • Hint 2: Add an invariant checker macro.
  • Hint 3: Add growth and error semantics.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Secure strings | “Secure Coding in C and C++” — Seacord | Ch. 3 |

5.10 Implementation Phases

Phase 1: Foundation (3-4 days)

  • Implement core struct and init/free.
  • Checkpoint: Basic append works.

Phase 2: Core Functionality (4-5 days)

  • Add insert/replace/substring.
  • Checkpoint: Invariants hold under tests.

Phase 3: Polish & Edge Cases (2-3 days)

  • Add growth policy and error handling.
  • Checkpoint: Fuzz tests pass without crashes.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Growth | fixed, doubling | doubling | Amortized O(1) append | | Error handling | return codes, abort | return codes | Safer for libraries |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———|———|———-| | Unit tests | Validate each operation | append, insert | | Integration tests | End-to-end usage | demo output | | Edge case tests | NULL inputs, empty strings | error handling |

6.2 Critical Test Cases

  1. Append causing growth.
  2. Insert at index 0 and at length.
  3. Replace with longer string.

6.3 Test Data

Input: "hello" + " world"
Expected: "hello world"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |——–|———|———-| | Forgetting terminator | Garbage output | Always set data[length] | | Partial write on error | Corrupted state | Use staging buffers | | Incorrect length updates | Off-by-one bugs | Centralize length math |

7.2 Debugging Strategies

  • Add invariant checks after every operation.
  • Use AddressSanitizer to catch out-of-bounds writes.

7.3 Performance Traps

Excessive reallocation can slow append; use doubling growth.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add sstring_trim and sstring_split.

8.2 Intermediate Extensions

  • Add UTF-8 aware helper functions.

8.3 Advanced Extensions

  • Add small-string optimization (SSO).

9. Real-World Connections

9.1 Industry Applications

  • Safe logging and parsing in system daemons.
  • Security-sensitive input handling.
  • bstrlib, sds (Redis strings).

9.3 Interview Relevance

  • String safety and API design questions are common.

10. Resources

10.1 Essential Reading

  • “Secure Coding in C and C++” — Seacord
  • CERT C Secure Coding Standard

10.2 Video Resources

  • Talks on memory safety and string handling

10.3 Tools & Documentation

  • AddressSanitizer, Valgrind

11. Self-Assessment Checklist

11.1 Understanding

  • I can state the invariants of my string type.
  • I can explain how bounds checks prevent overflows.
  • I can describe error semantics for each operation.

11.2 Implementation

  • All core operations pass unit tests.
  • Invariants hold after every operation.
  • Demo output matches expected values.

11.3 Growth

  • I can explain the trade-offs of truncation vs error.
  • I can extend the library with new operations.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Core string type and append/insert/replace implemented.
  • Safe bounds checking enforced.
  • Demo program runs with expected output.

Full Completion:

  • All minimum criteria plus:
  • Growth policy and invariant checks in debug builds.

Excellence (Going Above & Beyond):

  • Fuzz testing and constant-time comparison functions.