Project 2: Safe String Library

Build a small C string library that makes unsafe operations impossible by default through explicit length and capacity tracking.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 10-16 hours
Main Programming Language C (Alternatives: C++, Rust)
Alternative Programming Languages C++, Rust, Zig
Coolness Level High
Business Potential Medium (safe C libraries, embedded tooling)
Prerequisites C strings, pointer arithmetic, basic testing
Key Topics C string internals, length/capacity invariants, bounds checking, API design

1. Learning Objectives

By completing this project, you will:

  1. Implement a string type with explicit length and capacity.
  2. Enforce invariants that prevent overflow and unterminated strings.
  3. Design error-return conventions that keep C APIs safe.
  4. Build a small, testable library with deterministic behaviors.
  5. Understand why classic C string APIs are unsafe by design.

2. All Theory Needed (Per-Concept Breakdown)

2.1 C String Representation and Null Terminators

Fundamentals

A C string is an array of bytes terminated by a \0 byte. The string’s length is not stored; it is derived by scanning memory until the terminator is found. This is the core reason functions like strlen and strcpy can be unsafe: they assume the \0 exists inside the allocated buffer. If it does not, they read or write past the end. A safe string library must still produce valid C strings for interoperability, but it must also track length explicitly to avoid repeated scans and to ensure that the terminator is always present. Understanding how C represents strings at the byte level is the foundation for building a safe alternative.

Deep Dive into the Concept

C strings predate modern memory safety practices. The design favors simplicity: any pointer to char can be treated as a string, and a terminator marks the end. This flexibility is powerful but dangerous. The length of a C string is not inherent; it is a derived property obtained by scanning memory. This means that every call to strlen has O(n) time, and every call to strcpy or strcat trusts that the destination has enough space. When the destination is too small, the copy continues past the buffer boundary, overwriting adjacent memory and potentially corrupting control data. This is the root of many security vulnerabilities.

A safe string library must provide a stricter model. It should store length and capacity in a struct and guarantee that the underlying data is always null-terminated. This requires carefully defined invariants: 0 <= len < cap, data[len] == '\0', and data[0..len-1] contains the string contents. The terminator must be part of the allocated capacity. That is why appending n bytes requires len + n + 1 <= cap. Many beginners forget the +1 for the terminator; your library must make that mistake impossible.

Another subtlety is that C strings are byte arrays, not character arrays. UTF-8 characters can span multiple bytes. If you treat len as “characters,” you will be wrong. Your library should define length in bytes, and any higher-level Unicode handling should be explicit and separate. This distinction is critical in real systems where multibyte encodings are common. The library’s job is to keep the bytes safe and well-formed, not to interpret them.

Interoperability is also central. The whole point of a safe string library in C is to keep using existing APIs safely. That means you must guarantee null termination at all times and expose char* accessors that are safe. But you must also prevent raw writes that could violate invariants. This is why many C libraries use opaque structs and provide function-based mutation. If you allow direct access to the buffer, you lose control over invariants.

Finally, the terminator itself can be treated as part of the safety contract. For example, an empty string must still have data[0] = '\0'. When you shrink or truncate, you must set the terminator at the new length. When you append, you must write the new bytes and then re-terminate. These rules are easy to get wrong if they are not centralized. A good library design makes the correct thing the default and the incorrect thing difficult.

How this fits on projects

This concept drives the core invariants in Section 3.2 and the API design in Section 5.11. It also informs edge cases in Section 3.6.

Definitions & key terms

  • Null terminator: The \0 byte that ends a C string.
  • Length: Number of bytes before the terminator.
  • Capacity: Total allocated bytes in the buffer.
  • Invariant: Condition that must always be true for correctness.

Mental model diagram (ASCII)

[data buffer]
+----+----+----+----+----+
| H  | e  | l  | l  | o  |
+----+----+----+----+----+
                     ^
                   len=5
cap=6 includes terminator

How it works (step-by-step, with invariants and failure modes)

  1. Allocate buffer with capacity cap.
  2. Set len=0 and data[0]='\0'.
  3. On append, check len + add + 1 <= cap.
  4. Copy bytes, update len, set terminator.

Invariant: data[len] == '\0' and len < cap.

Failure modes: Missing terminator, overflow, or treating bytes as characters.

Minimal concrete example

typedef struct {
    char *data;
    size_t len;
    size_t cap;
} sstr;

// invariant: data[len] == '\0'

Common misconceptions

  • strlen is cheap.” -> It is O(n) and unsafe without terminator.
  • “If it fits exactly, it’s fine.” -> You still need space for \0.
  • “Length equals characters.” -> Length is bytes in UTF-8.

Check-your-understanding questions

  1. Why must len + add + 1 <= cap hold for append?
  2. What happens if a string is missing a terminator?
  3. Why is strlen unsafe on untrusted input?

Check-your-understanding answers

  1. You must reserve one byte for \0.
  2. Functions scan past the buffer and read garbage memory.
  3. It may scan beyond allocated memory and crash or leak data.

Real-world applications

  • Safe wrappers in embedded systems.
  • Security-critical C libraries.
  • Core string types in OS or libc extensions.

Where you’ll apply it

References

  • “Effective C” (safe string handling)
  • K&R Ch. 5

Key insights

A safe string is not just bytes; it is bytes plus an invariant.

Summary

You now understand why C strings are unsafe by default and what invariants make them safe.

Homework/Exercises to practice the concept

  1. Write a function that appends a byte and re-terminates.
  2. Find a C bug caused by a missing \0 and explain it.

Solutions to the homework/exercises

  1. Check capacity, write byte, set terminator.
  2. Missing terminator causes strcpy to overflow.

2.2 Length/Capacity Invariants and Bounds Checking

Fundamentals

Safe string operations require tracking both length and capacity. Length is how many bytes of valid content exist; capacity is how many bytes were allocated, including the terminator. The invariant len < cap ensures there is always room for \0. Bounds checking is the enforcement mechanism: before any write, you must confirm that the target range fits inside capacity. This is the difference between “trusting the caller” (unsafe C APIs) and “defensive by default” (safe APIs). A safe string library makes bounds checking systematic, not optional.

Deep Dive into the Concept

In C, arrays are just contiguous memory. There is no built-in metadata to track how large a buffer is. If you write past the end, you overwrite whatever comes next. Many C bugs are off-by-one: the code checks len + add <= cap but forgets the terminator, or it copies n bytes without checking. The correct invariant is len + add + 1 <= cap. This must be enforced on every mutation, not just on initialization.

A safe string library should encode invariants into its API. For example, sstr_append should return an error if the operation would overflow. It should not silently truncate unless you define a specific truncation function. Invariants should also be checked in debug builds with assert, and you should provide a validation function to verify internal consistency. This makes bugs visible early.

Bounds checking can be implemented in layered functions. A low-level helper like sstr_can_fit(s, add) returns a boolean, while high-level operations call it and return a structured error. This enables unit tests to check error paths. The error type can be an enum like SSTR_EOVERFLOW, SSTR_EINVAL, SSTR_ENOMEM. These error codes should be deterministic and documented so users can handle them reliably.

Capacity management is another design choice. You can choose fixed-capacity strings for embedded systems or growable strings for general use. If you choose growable strings, you must define a reallocation policy: doubling capacity, growth by a fixed increment, or exact-fit. Doubling is common because it yields amortized O(1) append. But it must preserve invariants and handle allocation failure. If realloc fails, you must not lose the original buffer. This is why realloc should be assigned to a temporary pointer. The library should guarantee that even on failure, the string remains valid.

The bounds model also shapes how you handle slicing and copying. A safe library should allow creating a view or substring without copying but must ensure the view respects boundaries. If you provide a view type, it should carry its own length. If you provide a copy, it must be explicit. Avoid “hidden” or implicit truncation because it makes bugs invisible.

Finally, bounds checking is a performance trade-off. It adds a few branches, but it prevents catastrophic failures. In practice, the cost is negligible compared to the cost of a crash or vulnerability. Many high-performance C libraries (e.g., OpenBSD’s strlcpy) are designed around safe bounds. Your project’s goal is to internalize that cost-benefit trade-off and encode it into an API that is safe by default.

How this fits on projects

Bounds checking defines Section 3.2 Functional Requirements and informs testing in Section 6.2. It also drives error handling design in Section 5.11.

Definitions & key terms

  • Bounds checking: Verifying writes stay within capacity.
  • Invariant: A property that must always hold.
  • Amortized growth: Resizing strategy that keeps average cost low.
  • Overflow: Write beyond allocated memory.

Mental model diagram (ASCII)

len=5, cap=8
+----+----+----+----+----+----+----+----+
|  h |  e |  l |  l |  o | \0 |  ? |  ? |
+----+----+----+----+----+----+----+----+
          ^ len points here

How it works (step-by-step, with invariants and failure modes)

  1. Compute required = len + add + 1.
  2. If required > cap, return overflow error.
  3. Otherwise copy bytes, update len, write \0.

Invariant: len < cap must hold after every operation.

Failure modes: Missing +1 check, truncation without notice, realloc mishandling.

Minimal concrete example

if (s->len + add + 1 > s->cap) {
    return SSTR_EOVERFLOW;
}

Common misconceptions

  • “Truncation is fine.” -> It hides bugs and loses data silently.
  • “Realloc always succeeds.” -> It can return NULL and must be checked.
  • “Capacity == length.” -> Capacity includes unused space and \0.

Check-your-understanding questions

  1. Why is len < cap required even for empty strings?
  2. What is the downside of silent truncation?
  3. Why must realloc be assigned to a temporary variable?

Check-your-understanding answers

  1. Because the terminator must fit inside capacity.
  2. It hides bugs and data loss.
  3. On failure, realloc returns NULL and the old pointer is lost.

Real-world applications

  • Safe buffer handling in cryptographic code.
  • Embedded firmware where crashes are costly.
  • Robust text processing in system utilities.

Where you’ll apply it

References

  • “Effective C” (Defensive programming)
  • “Secure Coding in C and C++” (Seacord)

Key insights

Bounds checking is not optional; it is the core feature.

Summary

You understand how to encode and enforce safe length/capacity invariants.

Homework/Exercises to practice the concept

  1. Write a function that truncates safely with an explicit return code.
  2. Design a growth strategy and compute new capacity for a sequence of appends.

Solutions to the homework/exercises

  1. Check bounds, copy partial bytes, return SSTR_ETRUNC.
  2. Double capacity when required > cap to keep amortized O(1).

2.3 C API Design and Error Handling

Fundamentals

C has no exceptions, so error handling is explicit. A safe string library must return error codes and avoid implicit failure. Common patterns include returning int or an enum and using an output parameter for results. The API should be consistent: all mutating operations return a status code and leave the string in a valid state even on failure. You must decide how callers will detect errors and how errors are represented. A good API makes misuse difficult and success easy to check.

Deep Dive into the Concept

Designing a C API is an exercise in safety and clarity. You must decide whether functions return an error code, a pointer, or a size. If you return a pointer, you need a way to distinguish success from failure. That often means returning NULL on failure, but NULL can be a valid value for “empty string” unless you define otherwise. This ambiguity is why many safe libraries use explicit error codes. An enum such as SSTR_OK, SSTR_EOVERFLOW, SSTR_EINVAL, SSTR_ENOMEM provides clarity, and the caller can switch on it.

Another design decision is ownership. Who allocates and frees the string buffer? A safe library should clearly define lifecycle functions: sstr_init, sstr_free, and possibly sstr_reserve. If you allow stack-allocated strings, you need to support initializing with a caller-provided buffer. This is useful in embedded contexts but complicates your invariants. If you allow both heap and stack backing, you must store metadata about ownership to prevent double free. A simpler design is to always allocate on the heap and require sstr_free to release resources.

A safe API should also be explicit about mutability. If you provide a const char* accessor, you should not allow callers to mutate the buffer unless they also update the length and terminator. This is why many libraries provide sstr_cstr() for read-only access and sstr_mut() only in controlled contexts. If you provide direct mutable access, you should also provide a validation function so that callers can revalidate after changes.

Error handling should be deterministic. If an operation fails, the string should remain unchanged. This is called the “strong guarantee.” It prevents partial updates that leave the string in an inconsistent state. Implementing this sometimes means making a copy before modification or carefully ordering writes. For example, when appending, you should only update len after the copy succeeds. When resizing, you should allocate the new buffer first, and only swap pointers after allocation succeeds. If allocation fails, return SSTR_ENOMEM and leave the original buffer intact.

Finally, a safe library must be testable. Deterministic behavior is crucial for unit tests. Define error codes explicitly, document them, and ensure they do not change across versions. Provide functions like sstr_error_string() to map codes to messages. This makes the library usable in real projects and makes debugging easier for learners.

How this fits on projects

This concept shapes Section 3.2 Functional Requirements, Section 5.11 Key Decisions, and Section 6 Testing. It also defines the “failure demo” in Section 3.7.

Definitions & key terms

  • Error code: Explicit value representing failure type.
  • Strong guarantee: On failure, state is unchanged.
  • Ownership: Responsibility for allocating and freeing memory.
  • ABI stability: Consistent function signatures across versions.

Mental model diagram (ASCII)

Caller -> sstr_append() -> [check] -> [copy] -> [update len]
         | failure? -> return error, no state change

How it works (step-by-step, with invariants and failure modes)

  1. Validate inputs (non-null, sane sizes).
  2. Check capacity.
  3. Perform copy.
  4. Update length and terminator.
  5. Return status.

Invariant: On error, the string remains unchanged.

Failure modes: Partial updates, ambiguous return values, unowned buffers.

Minimal concrete example

typedef enum { SSTR_OK, SSTR_EOVERFLOW, SSTR_EINVAL, SSTR_ENOMEM } sstr_err;

sstr_err sstr_append(sstr *s, const char *src, size_t n);

Common misconceptions

  • “Returning NULL is enough.” -> It loses the error reason.
  • “Error codes are noisy.” -> They are explicit and testable.
  • “Invariants can be fixed later.” -> They must hold after every call.

Check-your-understanding questions

  1. Why is the strong guarantee important?
  2. Why should error codes be stable?
  3. What is ambiguous about returning char* from a mutating function?

Check-your-understanding answers

  1. It prevents partial updates and corruption.
  2. Tests and callers depend on consistent behavior.
  3. NULL could mean error or empty string depending on API.

Real-world applications

  • C libraries that must be safe under strict auditing.
  • Embedded firmware APIs with no exceptions.
  • Security-critical string handling.

Where you’ll apply it

References

  • “C Interfaces and Implementations” (API design)
  • “Effective C” (defensive programming)

Key insights

Safe C APIs are explicit, deterministic, and invariant-driven.

Summary

You can now design and implement C APIs that fail safely and predictably.

Homework/Exercises to practice the concept

  1. Design an error enum for string operations.
  2. Write a function signature for sstr_insert with explicit error handling.

Solutions to the homework/exercises

  1. Include overflow, invalid input, and allocation failure cases.
  2. sstr_err sstr_insert(sstr *s, size_t pos, const char *src, size_t n);

3. Project Specification

3.1 What You Will Build

A small library libsstr that provides:

  • A sstr struct with length/capacity.
  • Safe initialization, append, insert, truncate, and clear.
  • Error-returning APIs with deterministic behavior.
  • A demo CLI showing success and overflow failures.

Included: fixed-capacity and growable modes, unit tests, documentation. Excluded: Unicode grapheme handling, locale-aware operations.

3.2 Functional Requirements

  1. Initialization: sstr_init allocates capacity and sets len=0.
  2. Append: sstr_append appends bytes with bounds checks.
  3. Insert: insert at position with bounds checks.
  4. Truncate: reduces length and updates terminator.
  5. Clear: resets to empty with terminator.
  6. Grow: optional sstr_reserve to expand capacity.
  7. Errors: all mutators return explicit error codes.

3.3 Non-Functional Requirements

  • Performance: O(n) per append without growth; amortized O(1) with growth.
  • Reliability: invariants always preserved; no buffer overflow.
  • Usability: clear error messages and demo program.

3.4 Example Usage / Output

$ ./sstr_demo
Input: "hello"
Append: " world"
Result: "hello world" (len=11 cap=32)

Trying to append 100 bytes...
Error: SSTR_EOVERFLOW (requested 100, remaining 20)

3.5 Data Formats / Schemas / Protocols

Core Struct:

typedef struct {
    char *data;
    size_t len;
    size_t cap;
} sstr;

3.6 Edge Cases

  • Append exactly cap-1 bytes to empty string.
  • Insert at position 0 or at len.
  • Truncate to 0.
  • sstr_reserve fails (out of memory).
  • Null input pointers.

3.7 Real World Outcome

The finished library should be linked into a demo app that proves overflow safety and error reporting.

3.7.1 How to Run (Copy/Paste)

make
./sstr_demo

3.7.2 Golden Path Demo (Deterministic)

$ ./sstr_demo
Input: "hello"
Append: " world"
Result: "hello world" (len=11 cap=32)

Exit code: 0

3.7.3 If CLI: Exact Terminal Transcript

$ ./sstr_demo --overflow
Input: "hello"
Append: " world"
Result: "hello world" (len=11 cap=32)

Trying to append 100 bytes...
Error: SSTR_EOVERFLOW (requested 100, remaining 20)
Exit code: 1

4. Solution Architecture

4.1 High-Level Design

Caller -> libsstr API -> invariant checks -> buffer mutation -> result

4.2 Key Components

Component Responsibility Key Decisions
sstr struct store data, len, cap fixed vs growable
API functions safe operations return error codes
Demo CLI show success/failure deterministic output

4.3 Data Structures (No Full Code)

typedef struct {
    char *data;
    size_t len;
    size_t cap;
    int owned; /* 1 if heap-allocated */
} sstr;

4.4 Algorithm Overview

Key Algorithm: Safe Append

  1. Validate inputs.
  2. Check capacity with len + add + 1 <= cap.
  3. Copy bytes and update len.
  4. Write terminator.

Complexity Analysis:

  • Time: O(n) for copy, amortized O(1) with growth.
  • Space: O(cap).

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install build-essential

5.2 Project Structure

libsstr/
|-- include/
|   \-- sstr.h
|-- src/
|   \-- sstr.c
|-- tests/
|   \-- test_sstr.c
|-- demo/
|   \-- sstr_demo.c
\-- Makefile

5.3 The Core Question You’re Answering

“How can I design a C string API that makes unsafe operations impossible by default?”

5.4 Concepts You Must Understand First

  1. C string representation and terminators (see Section 2.1).
  2. Length/capacity invariants (see Section 2.2).
  3. Error handling in C APIs (see Section 2.3).

5.5 Questions to Guide Your Design

  1. Will strings be fixed-capacity or growable?
  2. How will you represent errors?
  3. How do you guarantee invariants after every operation?

5.6 Thinking Exercise

Given cap=8, len=0, should appending 8 bytes succeed? Why not? Write the invariant that answers this question.

5.7 The Interview Questions They’ll Ask

  1. Why is strcpy unsafe?
  2. What invariant keeps a safe string correct?
  3. How would you design a safe C string API?

5.8 Hints in Layers

Hint 1: Start with sstr_init and sstr_free. Hint 2: Implement sstr_append with a strict bounds check. Hint 3: Add sstr_insert and sstr_truncate. Hint 4: Add tests for overflow and null input.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Safe C APIs | Effective C | Ch. 4 | | Pointer semantics | K&R | Ch. 5 | | Secure coding | Seacord | C safety chapters |

5.10 Implementation Phases

Phase 1: Core Struct + Init (2-3 hours)

Goals: define struct and init/free. Checkpoint: valgrind shows no leaks.

Phase 2: Mutators + Invariants (4-6 hours)

Goals: append, insert, truncate, clear. Checkpoint: unit tests for overflow pass.

Phase 3: Growth + Demo (3-4 hours)

Goals: optional reserve/growth, demo app. Checkpoint: deterministic demo output.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Growth strategy | fixed vs dynamic | dynamic with reserve | usability | | Error model | errno vs enum | enum | deterministic tests | | Ownership | caller-managed vs library-managed | library-managed | simpler invariants |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Check invariants | append overflow, insert bounds | | Integration Tests | Demo correctness | demo output matches expected | | Edge Case Tests | Null/empty inputs | append empty, truncate to 0 |

6.2 Critical Test Cases

  1. Append exactly cap-1 bytes succeeds, cap bytes fails.
  2. Insert at len behaves like append.
  3. sstr_reserve failure keeps original buffer intact.

6.3 Test Data

Input: "hello" -> append " world" -> "hello world"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Forgot terminator | printf prints garbage | always set data[len]='\0' | | Off-by-one checks | overflow on exact fit | use +1 rule | | realloc misuse | lost pointer on failure | assign to temp |

7.2 Debugging Strategies

  • Run under AddressSanitizer and ensure no out-of-bounds writes.
  • Add sstr_validate() and assert invariants.

7.3 Performance Traps

Repeated strlen in loops; use len instead.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add sstr_trim for whitespace.
  • Add sstr_equal and sstr_compare.

8.2 Intermediate Extensions

  • Add formatting helper (sstr_printf).
  • Add sstr_split returning views.

8.3 Advanced Extensions

  • Add UTF-8 length validation.
  • Add allocator hooks for custom memory pools.

9. Real-World Connections

9.1 Industry Applications

  • Hardened C libraries and embedded firmware.
  • Safe APIs for legacy codebases.
  • OpenBSD strlcpy/strlcat: safe bounded string APIs.
  • libbsd: portability layer with safe string functions.

9.3 Interview Relevance

  • Explaining why C strings are unsafe and how to design safe APIs.

10. Resources

10.1 Essential Reading

  • “Effective C” (safe string handling)
  • “Secure Coding in C and C++” (bounds checking)

10.2 Video Resources

  • Talks on safe C APIs (security conferences)

10.3 Tools & Documentation

  • man 3 strlen, man 3 strncpy

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why C strings are unsafe by default.
  • I can articulate the length/capacity invariant.
  • I can explain error-handling choices in C.

11.2 Implementation

  • All mutators enforce bounds checks.
  • Overflow attempts return error codes.
  • Tests pass with ASan and Valgrind.

11.3 Growth

  • I can explain how my API prevents classic strcpy bugs.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • sstr_init, sstr_append, sstr_free implemented safely.
  • Demo program shows success and overflow error.

Full Completion:

  • Insert, truncate, and reserve functions.
  • Unit tests for edge cases.

Excellence (Going Above & Beyond):

  • UTF-8 validation or custom allocator support.
  • Benchmarks comparing to strcpy and strlcpy.