Project 6: Container Runtime with systemd Integration

Build a minimal container runtime that uses systemd transient units, cgroup delegation, and namespaces for isolation and resource control.

Quick Reference

Attribute Value
Difficulty Level 5: Expert
Time Estimate 4-8 weeks
Main Programming Language C or Rust (Alternatives: Go)
Alternative Programming Languages Go
Coolness Level Level 5: Wow Factor
Business Potential Level 4: Infrastructure Platform
Prerequisites Linux namespaces, cgroups, process supervision, systemd APIs
Key Topics transient units, cgroup delegation, namespaces, journald

1. Learning Objectives

By completing this project, you will:

  1. Create transient units with systemd-run or D-Bus APIs.
  2. Delegate cgroup control to child processes safely.
  3. Set up namespaces (PID, mount, UTS, net) for isolation.
  4. Capture container logs via journald.
  5. Build a CLI with run, stop, list, and inspect.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Transient Units and systemd-run / D-Bus APIs

Fundamentals

A transient unit is created at runtime through systemd’s D-Bus API instead of a file on disk. systemd-run is the CLI wrapper that calls the API for you. Transient units are ideal for containers because each container is a dynamic process that should be supervised but not permanently defined. A transient scope unit attaches to an existing process, while a transient service starts a new process under systemd. For containers, scopes are often preferred because the container’s init process should control its own children. Understanding transient units lets you integrate your runtime with systemd’s control plane for supervision, logging, and resource limits.

Deep Dive into the Concept

The systemd Manager interface exposes StartTransientUnit, which takes a unit name, a mode (e.g., “replace”), and a list of properties. When called, systemd creates a unit in memory and starts it immediately. If you use systemd-run --scope, systemd creates a transient scope unit and moves the existing process into its cgroup. If you use systemd-run --unit=NAME with an ExecStart, systemd launches a transient service. The key difference is ownership: a service is launched by systemd, while a scope groups an already-running process.

For containers, a scope is often a better match. A container runtime typically creates namespaces and then executes the container init process. The runtime wants systemd to supervise that process and apply cgroup limits but does not want systemd to own the exec. In that case, you create the process, then create a scope and attach it. If you use a service, systemd becomes the parent of the container init process; this can be useful if you want systemd to manage process lifecycle entirely.

Properties passed to StartTransientUnit include Description, Slice, Delegate, MemoryMax, CPUQuota, and TasksMax. Each property maps to cgroup settings or unit metadata. The name of the unit should be deterministic, such as mycontainer-web.scope. This makes it easy to query logs and state with systemctl and journalctl. It also simplifies cleanup: stopping the unit stops the container.

D-Bus calls are asynchronous and return job objects. You can watch the job to determine when the unit is active. If you ignore the job result, you might report success when the unit failed to start. A good runtime should wait for the job to finish or poll the unit state after creation.

Transient units integrate with journald automatically: stdout/stderr from the container process is tagged with _SYSTEMD_UNIT. This gives you log collection for free. Combined with systemd-cgls or systemctl status, you can inspect resource usage and logs as part of your runtime’s inspect command.

Finally, consider version differences. Not all systemd versions expose the same properties. Your runtime should feature-detect properties and degrade gracefully. If a property is missing, you should inform the user rather than fail silently.

How this fit on projects

Transient units are the systemd integration layer of this runtime. You will use them in Section 3.2, Section 3.7, and Section 5.10 Phase 1.

Definitions & key terms

  • Transient unit -> Unit created at runtime via D-Bus.
  • Scope unit -> Groups an existing process under systemd supervision.
  • Service unit -> Process launched by systemd.
  • StartTransientUnit -> D-Bus method to create a transient unit.

Mental model diagram (ASCII)

CLI -> D-Bus StartTransientUnit -> systemd
                          |
                          +--> mycontainer-web.scope -> cgroup

How it works (step-by-step)

  1. CLI prepares container process and metadata.
  2. CLI calls StartTransientUnit with properties.
  3. systemd creates unit and cgroup.
  4. Unit becomes active and logs are tagged.

Invariants: Unit name is stable; properties map to cgroup settings.
Failure modes: property not supported, job failure, or naming collisions.

Minimal concrete example

systemd-run --unit=demo --scope --property=MemoryMax=128M /bin/sleep 100

Common misconceptions

  • “Transient units are invisible” -> They show up in systemctl and journald.
  • “Only services can be transient” -> Scopes are often better for containers.

Check-your-understanding questions

  1. Why choose a scope over a service for containers?
  2. What does StartTransientUnit return?
  3. How do you apply resource limits on creation?

Check-your-understanding answers

  1. A scope groups an already-running process and lets it manage children.
  2. A job object path and the unit name.
  3. Pass properties like MemoryMax and CPUQuota.

Real-world applications

  • systemd-nspawn and podman.
  • One-off sandboxed jobs.

Where you’ll apply it

References

  • systemd D-Bus API documentation (org.freedesktop.systemd1).
  • systemd-run manual.

Key insights

Transient units let systemd supervise dynamic workloads without permanent files.

Summary

Use StartTransientUnit or systemd-run to attach containers to systemd’s control plane.

Homework/exercises to practice the concept

  1. Create a transient scope and inspect it with systemctl.
  2. Add MemoryMax and verify with systemd-cgls.
  3. Query logs for the unit in journald.

Solutions to the homework/exercises

  1. systemd-run --scope /bin/sleep 10.
  2. systemd-run --scope --property=MemoryMax=64M /bin/sleep 10.
  3. journalctl -u run-*.scope.

Concept 2: cgroups v2 and Delegation

Fundamentals

cgroups v2 provide resource control and accounting. systemd uses cgroups to group services and enforce limits like CPU and memory. For a container runtime, you must delegate cgroup control so that the container can create its own sub-cgroups. This is done with Delegate=yes. Without delegation, the container’s init process cannot create cgroups and many container tools fail. Understanding cgroups v2 and delegation is essential to enforce resource limits while still allowing the container to manage its internal process tree.

Deep Dive into the Concept

cgroups v2 unifies controllers under one hierarchy. Each cgroup has files such as memory.max, cpu.max, pids.max, and io.max. systemd creates a cgroup for each unit and writes these limits based on unit properties. For example, MemoryMax=512M maps to memory.max, and CPUQuota=50% maps to cpu.max. If you create a transient unit, systemd automatically creates the cgroup in /sys/fs/cgroup under the appropriate slice.

Delegation is the mechanism that allows a process to manage sub-cgroups. With Delegate=yes, systemd sets permissions so that the container’s init process can create child cgroups and move processes into them. This is necessary for container runtimes that rely on cgroup subtrees for resource control. Without delegation, the container init will get permission errors when trying to manage cgroups, and resource limits may not be enforceable.

Delegation must be done carefully. The parent cgroup should enforce the overall limits, and the container should only be able to manage its subtree. This protects the host and ensures limits are not bypassed. Your runtime should set limits at the unit level and then delegate. This mirrors how systemd-nspawn and other runtimes integrate with systemd.

Resource accounting is also important. You can read memory.current, cpu.stat, and pids.current to report usage. Your runtime’s inspect command should fetch these values to show live resource usage. This is one of the major advantages of integrating with systemd rather than building cgroups manually.

Finally, cgroup version matters. Some systems still use cgroups v1. Delegation behaves differently there, and some properties may not exist. Your runtime should detect the cgroup version and either enforce v2 or provide clear error messages on v1-only systems.

How this fit on projects

This concept enables resource limits and safe delegation. You will use it in Section 3.2, Section 4.2, and Section 5.10 Phase 2.

Definitions & key terms

  • cgroup v2 -> Unified resource control hierarchy.
  • Delegate -> Allow a process to manage its cgroup subtree.
  • MemoryMax -> systemd property for memory limit.
  • CPUQuota -> systemd property for CPU quota.

Mental model diagram (ASCII)

/system.slice/mycontainer.scope
  | memory.max=512M
  +-- delegated subtree (container processes)

How it works (step-by-step)

  1. systemd creates a cgroup for the transient unit.
  2. It writes resource limits to cgroup files.
  3. Delegate=yes grants control of the subtree.
  4. Container init creates sub-cgroups for processes.

Invariants: Unit-level limits always apply; delegated subtree cannot exceed them.
Failure modes: delegation disabled, incorrect cgroup version, or missing controllers.

Minimal concrete example

systemd-run --scope --property=MemoryMax=512M --property=Delegate=yes /bin/sleep 100

Common misconceptions

  • “Cgroup v1 and v2 are the same” -> They differ significantly.
  • “Delegation gives full control” -> It only grants subtree management.

Check-your-understanding questions

  1. Why is Delegate=yes needed for containers?
  2. How do you set CPU limits in systemd?
  3. Where do you read current memory usage?
  4. Why apply limits at the unit level?

Check-your-understanding answers

  1. Containers need to create sub-cgroups for their processes.
  2. Use CPUQuota or CPUWeight.
  3. Read memory.current in the unit cgroup.
  4. It ensures limits cannot be bypassed by sub-cgroups.

Real-world applications

  • Container runtimes (podman, systemd-nspawn).
  • Service-level resource isolation.

Where you’ll apply it

References

  • systemd.resource-control documentation.
  • Linux cgroup v2 kernel docs.

Key insights

Delegate cgroup control, but keep unit-level limits enforced by systemd.

Summary

Cgroup delegation is the key to safe container resource control.

Homework/exercises to practice the concept

  1. Set a memory limit and verify memory.current.
  2. Set CPUQuota and run a busy loop.
  3. Enable delegation and create a sub-cgroup.

Solutions to the homework/exercises

  1. systemd-run --scope --property=MemoryMax=64M /bin/sleep 10 and inspect cgroup files.
  2. Use CPUQuota=20% and observe CPU usage.
  3. Use cgcreate or manual mkdir in delegated cgroup.

Concept 3: Namespaces and Container Init

Fundamentals

Namespaces isolate system resources. A container typically uses PID, mount, UTS, IPC, and network namespaces. The container init process runs as PID 1 inside the namespace and must reap zombies and handle signals correctly. Without a proper init, processes may leak and signals may not propagate, causing stuck containers. This concept explains how to create namespaces and why PID 1 behavior matters.

Deep Dive into the Concept

The PID namespace isolates process IDs, so the first process inside becomes PID 1. PID 1 has special responsibilities: it reaps orphaned children and handles signals differently. Many programs ignore SIGTERM if they are PID 1. This is why container runtimes often use a minimal init like tini. In your runtime, you can implement a tiny init or require that the container command behaves properly.

The mount namespace isolates filesystem mount points. A runtime typically mounts a root filesystem and uses pivot_root() or chroot() to make it the container root. It must mount /proc inside the container so that tools like ps work. If /proc is missing, many programs will fail. The UTS namespace isolates hostname, allowing each container to have its own hostname. The network namespace isolates interfaces; you can connect it to the host with a veth pair or run in host network mode for simplicity.

Creating namespaces can be done with clone() or unshare(). clone() allows you to create a new process in the new namespaces directly; unshare() changes the current process. A common pattern is to clone() a child with the desired namespaces and then exec the container init. The parent can remain in the host namespace and manage the child.

Signal handling is critical. When you send SIGTERM to a container, it should reach PID 1 in the container and then propagate to children. If PID 1 ignores SIGTERM, the container will not stop cleanly. Your runtime should implement a shutdown sequence: send SIGTERM, wait for a timeout, then SIGKILL. This mirrors systemd behavior and prevents zombies.

Namespace setup must be deterministic. Use fixed mounts and a defined rootfs layout. Define what is inside /dev, /proc, and /sys. For a minimal project, you can bind-mount /proc and provide minimal /dev nodes. Document what is included and excluded so users know the isolation boundaries.

How this fit on projects

Namespaces turn your process into a real container. You will use them in Section 3.2, Section 4.4, and Section 5.10 Phase 3.

Definitions & key terms

  • Namespace -> Kernel feature that isolates a system resource.
  • PID 1 -> Special process that must reap zombies.
  • pivot_root -> Switch the root filesystem.
  • veth -> Virtual Ethernet pair for network namespaces.

Mental model diagram (ASCII)

host PID 1234 (runtime)
  |
  +-- clone/unshare -> container PID 1
          |
          +-- child processes (isolated)

How it works (step-by-step)

  1. Create namespaces with clone() or unshare().
  2. Mount root filesystem and /proc.
  3. Set hostname and configure network.
  4. Exec container init process.
  5. Supervisor monitors PID 1 and stops unit on exit.

Invariants: PID 1 reaps children; rootfs is isolated.
Failure modes: missing /proc, PID 1 ignoring signals, or incomplete mount setup.

Minimal concrete example

unshare -pfmU --mount-proc /bin/sh

Common misconceptions

  • “Containers are VMs” -> They share the host kernel.
  • “PID 1 is just another process” -> It has special signal semantics.

Check-your-understanding questions

  1. Why does PID 1 need to reap zombies?
  2. What does the mount namespace isolate?
  3. How do you stop a container cleanly?
  4. Why mount /proc inside the container?

Check-your-understanding answers

  1. Orphans are reparented to PID 1; without reaping, zombies accumulate.
  2. It isolates filesystem mount points.
  3. Send SIGTERM, wait, then SIGKILL if needed.
  4. Many tools rely on /proc for process info.

Real-world applications

  • Docker, runc, podman, systemd-nspawn.

Where you’ll apply it

References

  • “The Linux Programming Interface” (namespaces chapters).
  • man 2 clone, man 2 unshare.

Key insights

Containers are just processes with carefully constructed namespaces and cgroups.

Summary

Namespaces provide isolation; PID 1 behavior makes containers reliable.

Homework/exercises to practice the concept

  1. Run unshare with PID and mount namespaces.
  2. Mount /proc inside the namespace.
  3. Observe PID 1 behavior with a short-lived child process.

Solutions to the homework/exercises

  1. unshare -pf --fork /bin/sh.
  2. mount -t proc proc /proc inside.
  3. Start a child process and check with ps.

Concept 4: PID 1 Semantics and Init Responsibilities Inside Containers

Fundamentals

Every container needs a process that plays the role of PID 1 inside its namespace. PID 1 has special signal semantics: it does not receive default signal handling, and it is responsible for reaping orphaned child processes. If you run a single application as PID 1 without an init-like wrapper, you may end up with zombies, missed signals, or unclean shutdowns. This is why container runtimes often include a tiny init process or use --init options. In a systemd-integrated runtime, you must decide whether systemd or your runtime is responsible for PID 1 behavior inside the container. Understanding PID 1 semantics is essential for correct lifecycle management and predictable shutdown.

Deep Dive into the Concept

In Linux, PID 1 is special. It is the ancestor of all processes in a namespace and receives orphaned children when their parents exit. It also has distinct signal handling: signals with default action to terminate are ignored unless PID 1 installs handlers. This means that if your container’s main process is PID 1 and it does not handle SIGTERM, it might not terminate gracefully, causing containers to hang during shutdown. Similarly, if it spawns children and does not reap them, you accumulate zombie processes. In a long-running container, this leads to resource leaks and unstable behavior.

Container runtimes typically solve this with one of two approaches. The first is to run a small init process (like tini or dumb-init) as PID 1 and then exec the application as a child. The init process installs signal handlers, forwards signals, and reaps zombies. The second approach is to use a full init system inside the container (e.g., systemd), which provides more complex service management. In this project, you are building a minimal runtime, so the common strategy is to include a tiny init or to ensure that your container entrypoint is capable of PID 1 duties.

When integrating with systemd on the host, you must also think about the boundary between host supervision and container supervision. The host’s systemd unit tracks the container process; if that process does not exit cleanly, systemd may kill it after a timeout. But if the container’s PID 1 ignores SIGTERM, the host may be forced to send SIGKILL, leading to abrupt shutdown. To avoid this, your runtime should forward SIGTERM to the container and give it time to shut down. If you support stop commands, you should implement a two-phase shutdown: send SIGTERM, wait, then SIGKILL if needed.

PID 1 semantics also affect logging and exit codes. If the container’s PID 1 exits, the container is considered stopped. Your runtime should capture the exit code and propagate it to the host unit’s result. This enables systemd to mark the unit as failed when the container exits with a non-zero status. Without this propagation, operators will not know whether a container failed or stopped cleanly. systemd uses ExecMainStatus and Result to convey this; your runtime should map the container’s exit status into these fields when using transient units.

Another subtlety is signal forwarding for interactive containers. If the host receives SIGINT (Ctrl-C), the runtime should pass it into the container, typically to PID 1. But if PID 1 ignores SIGINT, the application will not stop. A robust runtime can support an option to run the application as a child of a tiny init that forwards signals. This is why container tools often have an --init flag.

Finally, PID namespaces change the meaning of PID values. Inside the container, PID 1 is the init process; outside, that process has a different PID. If you implement ps-like inspection or exec features, you must translate between host PIDs and container PIDs. For this project, you can keep it minimal by using host PIDs in the CLI, but you should document the difference and ensure that signal delivery targets the correct process in the host namespace.

How this fit on projects

This concept is critical for lifecycle management and clean shutdown. You will apply it in Section 3.2 (Functional Requirements: stop behavior), Section 4.2 (Key Components: init wrapper), and Section 5.10 (Implementation Phases: lifecycle management). It also influences the failure demo in Section 3.7.3.

Definitions & key terms

  • PID 1 -> The first process in a PID namespace with special signal semantics.
  • Zombie -> A terminated process not yet reaped by its parent.
  • Init process -> A minimal process that reaps children and forwards signals.
  • Signal forwarding -> Passing host signals into the container process.
  • Exit status propagation -> Mapping container exit codes to host supervisor results.

Mental model diagram (ASCII)

Host systemd -> container runtime -> PID namespace
                         |
                         +--> PID 1 (init) -> app process

How it works (step-by-step)

  1. Runtime creates a PID namespace and starts PID 1.
  2. PID 1 spawns the application process.
  3. PID 1 installs handlers and reaps children.
  4. On stop, runtime sends SIGTERM to PID 1.
  5. PID 1 forwards SIGTERM to the app and waits.
  6. If timeout expires, runtime sends SIGKILL.

Invariants: PID 1 must reap zombies and handle termination signals.
Failure modes: Unreaped zombies, stuck shutdown, or ignored signals.

Minimal concrete example

# Run with a tiny init so PID 1 forwards signals
mycontainer run --init alpine sh -c "sleep 1000"
# Ctrl-C should terminate cleanly

Common misconceptions

  • “PID 1 behaves like any other process” -> It ignores default signal actions.
  • “Zombies are harmless” -> They accumulate and waste process table entries.
  • “Host systemd can always kill the container” -> It can, but you lose graceful shutdown.

Check-your-understanding questions

  1. Why does PID 1 ignore default signal handlers?
  2. What happens if PID 1 never reaps children?
  3. How do you implement a graceful stop sequence?
  4. Why is exit status propagation important?

Check-your-understanding answers

  1. The kernel treats PID 1 specially; default actions are ignored.
  2. Zombies accumulate and the process table fills.
  3. Send SIGTERM, wait, then SIGKILL if needed.
  4. It lets the host supervisor report failure accurately.

Real-world applications

  • Container runtimes that use init wrappers like tini.
  • Long-running microservices that must shut down gracefully.
  • Systems where resource leaks are critical (embedded appliances).

Where you’ll apply it

References

  • signal(7) and wait(2) man pages.
  • tini and dumb-init documentation.
  • “The Linux Programming Interface” process lifecycle chapters.

Key insights

Containers need an init-like PID 1 or they will leak zombies and mishandle signals.

Summary

Handling PID 1 semantics ensures your runtime can start, stop, and supervise containers reliably.

Homework/exercises to practice the concept

  1. Run a container without an init and observe zombie processes.
  2. Add a tiny init process and verify SIGTERM is handled.
  3. Record exit statuses and map them to your runtime’s CLI output.

Solutions to the homework/exercises

  1. Spawn a child process that exits; check ps for zombies.
  2. Use --init and verify clean shutdown on SIGTERM.
  3. Capture the exit code and print it in mycontainer inspect.

3. Project Specification

3.1 What You Will Build

A minimal container runtime called mycontainer that:

  • Creates transient units for container processes.
  • Sets up namespaces and basic filesystem isolation.
  • Applies resource limits with cgroups.
  • Captures logs in journald.

Included: run/stop/list/inspect, cgroup limits, namespaces, journald logs.
Excluded: full OCI spec, image pulling, registry authentication.

3.2 Functional Requirements

  1. CLI: run, stop, list, inspect.
  2. Transient Unit: create mycontainer-<name>.scope.
  3. Namespaces: PID + mount required; UTS and net optional.
  4. Resource Limits: memory and CPU quotas.
  5. Logging: stdout/stderr captured in journald.
  6. Inspect: show cgroup usage stats.

3.3 Non-Functional Requirements

  • Performance: container start < 500ms for simple commands.
  • Reliability: clean stop and cleanup of cgroups.
  • Security: no root filesystem escapes.

3.4 Example Usage / Output

$ mycontainer run --name web --memory 512M --cpu 50% alpine sh
/ # echo hello
hello

3.5 Data Formats / Schemas / Protocols

CLI args:

--name <string>
--memory <bytes>
--cpu <percent>
--rootfs <path>

3.6 Edge Cases

  • Invalid rootfs path.
  • cgroup delegation not available.
  • Container init exits immediately.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

sudo ./mycontainer run --name demo --rootfs ./rootfs --memory 128M --cpu 20% /bin/sh

3.7.2 Golden Path Demo (Deterministic)

  • Use fixed rootfs directory and fixed name demo.

3.7.3 If CLI: exact terminal transcript

$ sudo ./mycontainer run --name demo --rootfs ./rootfs --memory 128M --cpu 20% /bin/sh
/ # echo hello
hello

Failure demo:

$ sudo ./mycontainer run --name demo --rootfs /missing
ERROR: rootfs not found: /missing
exit code: 4

Exit codes:

  • 0 success
  • 2 usage error
  • 4 rootfs missing
  • 5 cgroup delegation failure
  • 6 namespace setup failure

4. Solution Architecture

4.1 High-Level Design

CLI -> create namespaces -> fork/exec init -> StartTransientUnit (scope)
                    |
                    +--> cgroups + journald

4.2 Key Components

Component Responsibility Key Decisions
CLI Parse args and orchestration flags vs config file
Namespace Setup PID/mount/UTS/net isolation minimal namespace set
systemd Integration Start transient scope systemd-run vs D-Bus
Cgroup Controller Apply limits MemoryMax/CPUQuota

4.3 Data Structures (No Full Code)

struct Container {
    char name[64];
    char rootfs[256];
    int memory_max_bytes;
    int cpu_quota_pct;
    pid_t init_pid;
};

4.4 Algorithm Overview

Key Algorithm: Run Container

  1. Parse CLI args.
  2. Create namespaces with clone/unshare.
  3. Setup rootfs and mount /proc.
  4. Start transient scope with systemd.
  5. Exec container init.

Complexity Analysis:

  • Time: O(1) setup per container
  • Space: O(1) per container metadata

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y libsystemd-dev

5.2 Project Structure

mycontainer/
├── src/
│   ├── main.c
│   ├── namespaces.c
│   ├── systemd.c
│   └── cgroups.c
└── Makefile

5.3 The Core Question You’re Answering

“How can systemd supervise and resource-limit containers dynamically?”

5.4 Concepts You Must Understand First

  1. Transient units and D-Bus APIs.
  2. cgroups v2 and delegation.
  3. Namespaces and PID 1 behavior.

5.5 Questions to Guide Your Design

  1. How do you map container names to systemd unit names?
  2. What should happen if the init process exits immediately?
  3. How do you ensure logs are captured in journald?

5.6 Thinking Exercise

Write a step-by-step flow for mycontainer run showing where each systemd property is applied.

5.7 The Interview Questions They’ll Ask

  1. “Why use systemd-run for containers?”
  2. “What does Delegate=yes do?”
  3. “Why is PID 1 special inside a container?”

5.8 Hints in Layers

Hint 1: Start with systemd-run --scope manually.
Hint 2: Add resource limits via properties.
Hint 3: Add namespaces with unshare.
Hint 4: Add journald log queries.

5.9 Books That Will Help

Topic Book Chapter
Namespaces “The Linux Programming Interface” namespaces chapters
Containers “Linux Containers and Virtualization” fundamentals
Resource control “Operating Systems: Three Easy Pieces” scheduling/limits

5.10 Implementation Phases

Phase 1: systemd Integration (1-2 weeks)

Goals: create transient scopes with limits.
Checkpoint: systemctl status mycontainer-*.scope works.

Phase 2: Namespaces (2-3 weeks)

Goals: PID + mount namespaces and rootfs setup.
Checkpoint: container shows isolated PID and filesystem.

Phase 3: Polish and Observability (1-2 weeks)

Goals: journald logs, inspect stats.
Checkpoint: mycontainer inspect shows resource usage.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
systemd integration systemd-run vs D-Bus D-Bus for control programmatic control
Namespace set minimal vs full PID+mount+UTS usable isolation

6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———|———|———-| | Unit Tests | arg parsing | invalid memory values | | Integration Tests | systemd scopes | unit exists and stops | | Edge Case Tests | missing rootfs | exit code 4 |

6.2 Critical Test Cases

  1. Container exits -> scope stops.
  2. Memory limit enforced (OOM kills).
  3. Delegation failure returns exit 5.

6.3 Test Data

rootfs: ./rootfs
command: /bin/sh

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |——–|———|———-| | No delegate | cannot create cgroups | set Delegate=yes | | Missing /proc | ps fails inside container | mount proc | | PID 1 not reaping | zombies accumulate | use tiny init |

7.2 Debugging Strategies

  • systemd-cgls to inspect cgroups.
  • journalctl -u mycontainer-*.scope for logs.

7.3 Performance Traps

Creating namespaces and mounts for every container can be slow; cache rootfs if possible.


8. Extensions and Challenges

8.1 Beginner Extensions

  • Add exec command to run in existing container.
  • Add logs command.

8.2 Intermediate Extensions

  • Add cgroup IO limits.
  • Add network namespaces with veth.

8.3 Advanced Extensions

  • Support OCI bundle format.
  • Add image unpacking and layered filesystem.

9. Real-World Connections

9.1 Industry Applications

  • Minimal container runtimes for embedded devices.
  • Integration with systemd-based orchestration.
  • systemd-nspawn, podman, runc.

9.3 Interview Relevance

  • Discuss cgroups, namespaces, and systemd integration confidently.

10. Resources

10.1 Essential Reading

  • systemd D-Bus API docs.
  • Linux namespaces documentation.

10.2 Video Resources

  • Container internals talks (LXC, systemd-nspawn).

10.3 Tools and Documentation

  • systemd-run, systemd-cgls, journalctl.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain transient units and scopes.
  • I can explain cgroup delegation.
  • I can explain PID namespaces.

11.2 Implementation

  • Containers start and stop cleanly.
  • Resource limits are enforced.
  • Logs are captured in journald.

11.3 Growth

  • I can explain how my runtime differs from Docker.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • run and stop work for a simple container.
  • Transient scope created with limits.

Full Completion:

  • Namespaces and rootfs isolation implemented.
  • Logs and inspect command work.

Excellence (Going Above and Beyond):

  • OCI compatibility and network namespaces.