Project 4: CRI Runtime Observability with crictl and Events

Build an incident-grade workflow that correlates API, kubelet, and runtime data to diagnose pod lifecycle failures.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 8-14 hours
Main Programming Language Shell
Alternative Programming Languages Go, Python
Coolness Level Level 3 - Incident Ready
Business Potential 2. Ops Differentiator
Prerequisites Kubernetes pod lifecycle, node access, CRI basics
Key Topics event timeline correlation, runtime diagnostics, failure taxonomy

1. Learning Objectives

  1. Correlate failures across control plane and node runtime.
  2. Build deterministic root-cause timelines.
  3. Separate scheduling, image, runtime, and app failure classes.
  4. Create a reusable pod-incident diagnostic playbook.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Pod Lifecycle and Failure Surfaces

Fundamentals

Pod status fields and events describe intent-to-runtime progression. Failures can happen before scheduling, before container start, or after process launch.

Deep Dive into the concept

A common anti-pattern is reading only workload logs and ignoring kubelet/runtime events. Lifecycle diagnosis should begin with a normalized timeline: object creation, scheduling decision, image resolution, runtime create/start, readiness progression. Distinguish ImagePullBackOff from CrashLoopBackOff: one is artifact/runtime bootstrap, the other is process lifecycle and restart policy. Event sequencing also reveals hidden retries and backoff behaviors.

How this fit on projects

  • Core for P04, supports P06 and P10 operations.

Definitions & key terms

  • Pending, Scheduled, Running, RestartCount, BackOff.

Mental model diagram

API intent -> scheduler -> kubelet -> runtime create/start -> readiness -> service endpoints

How it works

  1. Capture pod describe/events.
  2. Capture node runtime state.
  3. Align timestamps.
  4. Classify failure stage.

Invariants: every failure belongs to a stage. Failure modes: missing node logs, unsynchronized timestamps.

Minimal concrete example

timeline:
09:00 Scheduled
09:00 Pulling image
09:01 Pull failed unauthorized
09:01 BackOff

Common misconceptions

  • “CrashLoopBackOff always means bad application code.” -> sometimes resource or config faults.

Check-your-understanding questions

  1. Why is timestamp alignment critical?
  2. What is first signal for pre-scheduling failure?

Check-your-understanding answers

  1. Misordered events cause false root cause claims.
  2. FailedScheduling events and unchanged node assignment.

2.2 CRI Boundary Observability

Fundamentals

CRI defines kubelet-runtime interactions. Node-level tooling exposes runtime state that API summaries can miss.

Deep Dive into the concept

Runtime incidents often appear as generic pod failures unless you inspect CRI-level state. Image availability, sandbox creation, container create/start errors, and low-level runtime messages are critical. The diagnostic workflow should pull both API-side and runtime-side signals and join them by pod UID/container ID. This dual view shrinks MTTR and prevents blind retries.

How this fit on projects

  • P04 core, reused in P08 and P10 incident playbooks.

Key insights

  • Control-plane and node-plane evidence must be combined for trustworthy diagnosis.

3. Project Specification

3.1 What You Will Build

A CLI workflow and checklist that:

  • ingests pod and node runtime evidence
  • produces stage-classified failure timeline
  • suggests first remediation action

3.2 Functional Requirements

  1. Collect pod metadata/events and node runtime state.
  2. Generate unified timeline.
  3. Classify failures into predefined categories.
  4. Output remediation hints tied to failure class.

3.3 Non-Functional Requirements

  • Performance: diagnosis under 60 seconds for one pod.
  • Reliability: deterministic classification for repeated runs.
  • Usability: one command per incident target.

3.4 Example Usage / Output

$ ./cri-lab diagnose pod/web-7d8f9

3.5 Data Formats / Schemas / Protocols

  • event record schema (source, timestamp, phase)
  • diagnosis report schema (class, confidence, remediation)

3.6 Edge Cases

  • node rotated before log retrieval
  • missing runtime permissions
  • multiple rapid restarts with overlapping events

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ ./cri-lab diagnose pod/web-7d8f9
$ ./cri-lab classify --last 20m

3.7.2 Golden Path Demo (Deterministic)

Inject one image auth failure and one crash loop scenario, then verify correct classification.

3.7.3 If CLI: exact terminal transcript

$ ./cri-lab diagnose pod/web-7d8f9
class: image-pull-auth
confidence: high
first_failure_stage: runtime-bootstrap
recommended_action: verify imagePullSecret scope

4. Solution Architecture

4.1 High-Level Design

evidence collector -> timeline normalizer -> classifier -> remediation reporter

4.2 Key Components

Component Responsibility Key Decisions
Collector gather API + CRI data consistent timestamp format
Normalizer unify event model stable phase taxonomy
Classifier failure category mapping deterministic rules first
Reporter remediation guidance concise, actionable output

4.4 Algorithm Overview

  1. Collect evidence.
  2. Sort by time and phase.
  3. Match known failure signatures.
  4. Produce report.

5. Implementation Guide

5.3 The Core Question You’re Answering

“Where in the kubelet-runtime-control-plane chain did this pod fail, and what is the fastest corrective action?”

5.6 Milestones

  1. Evidence collection.
  2. Timeline generation.
  3. Rule-based classification.
  4. Failure simulation validation.

5.9 Definition of Done

  • Unified timeline generated for at least 3 failure classes.
  • Classification confidence and rationale displayed.
  • First remediation steps validated in lab scenarios.
  • Playbook exported for team usage.