Nagios Monitoring Mastery: From Installation to Enterprise-Scale Monitoring

Goal: Deeply understand infrastructure monitoring by mastering Nagios Core from the ground up. You will learn how Nagios checks work at the protocol level, how the scheduling engine orchestrates thousands of checks, how notifications flow through escalation chains, and how to extend Nagios with custom plugins. By completing these projects, you will understand why monitoring systems are architected the way they are, how to debug check failures, configure complex notification routing, implement auto-remediation with event handlers, and scale monitoring to enterprise environments. You will internalize the mental model that transforms raw check results into actionable operational intelligence.


Why Nagios Matters

Nagios is the grandfather of open-source infrastructure monitoring. First released in 1999 as “NetSaint,” it pioneered concepts that every modern monitoring tool still uses: host/service checks, state machines, notification escalations, and plugin architecture. Understanding Nagios deeply means understanding the foundations of monitoring itself.

Historical Context

Before Nagios, monitoring was either:

  • Expensive commercial tools (HP OpenView, Tivoli) that cost tens of thousands of dollars
  • Custom scripts that checked things ad-hoc with no centralized view
  • Nothing at all - operators discovered problems when users complained

Ethan Galstad created Nagios to solve a real problem: he needed to monitor a network of Linux servers and couldn’t afford commercial solutions. The result was a monitoring philosophy that remains relevant 25 years later:

  1. Plugins are external - Any script that returns 0/1/2/3 is a valid check
  2. Configuration is declarative - Define what you want to monitor, not how to monitor it
  3. State is explicit - OK, WARNING, CRITICAL, UNKNOWN with clear transitions
  4. Notifications are programmable - Send alerts however you want

Industry Adoption

While newer tools (Prometheus, Datadog, Zabbix) have emerged, Nagios remains foundational:

  • 30,000+ organizations actively use Nagios
  • 10,000+ plugins available in the Nagios Exchange
  • Industry standard for traditional infrastructure monitoring
  • Base for derivatives - Icinga, Naemon, Shinken all forked from Nagios

Real-World Impact

Before Monitoring                    With Nagios
┌─────────────────────┐             ┌─────────────────────┐
│ User: "Site is down"│             │ Alert: Web server   │
│ Admin: "Let me SSH" │             │ response time > 5s  │
│ (30 minutes later)  │             │ (detected in 60s)   │
│ Admin: "Fixed it"   │             │ Event handler:      │
│ User: "Finally!"    │             │ restart_httpd.sh    │
└─────────────────────┘             │ (auto-fixed in 90s) │
                                    └─────────────────────┘
MTTR: 30+ minutes                   MTTR: 90 seconds

The Nagios Mental Model

Understanding Nagios requires internalizing its core architecture:

                    ┌─────────────────────────────────────────────┐
                    │              NAGIOS CORE                     │
                    │  ┌─────────────────────────────────────┐    │
                    │  │         SCHEDULING ENGINE           │    │
                    │  │  (when to run which checks)         │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │         CHECK EXECUTION             │    │
                    │  │  (run plugins, collect results)     │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │         STATE ENGINE                │    │
                    │  │  (OK→WARNING→CRITICAL transitions)  │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │      NOTIFICATION ENGINE            │    │
                    │  │  (who to notify, when, how)         │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │         EVENT HANDLERS              │    │
                    │  │  (auto-remediation scripts)         │    │
                    │  └─────────────────────────────────────┘    │
                    └─────────────────────────────────────────────┘
                                      │
            ┌─────────────────────────┼─────────────────────────┐
            │                         │                         │
            ▼                         ▼                         ▼
    ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
    │   PLUGIN      │         │   NRPE        │         │   NSCA        │
    │ check_http    │         │ (Remote       │         │ (Passive      │
    │ check_ping    │         │  checks)      │         │  checks)      │
    │ check_disk    │         │               │         │               │
    └───────────────┘         └───────────────┘         └───────────────┘
            │                         │                         │
            ▼                         ▼                         ▼
    ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
    │ Local Host    │         │ Remote Host   │         │ Remote Host   │
    │ (Nagios runs  │         │ (NRPE daemon  │         │ (sends data   │
    │  checks here) │         │  runs checks) │         │  proactively) │
    └───────────────┘         └───────────────┘         └───────────────┘

Why Nagios Before Modern Tools?

You might wonder: “Why learn Nagios when Prometheus exists?”

The answer is foundational understanding:

Concept Nagios Teaches Modern Equivalent
Check execution Fork/exec model, exit codes Prometheus scraping
State machines Hard/soft states, flapping Alertmanager grouping
Plugins Standard interface (0/1/2/3) Exporters
Notifications Contact groups, escalations PagerDuty integrations
Configuration Object inheritance Labels and selectors
Passive checks Push model Pushgateway
Remote execution NRPE Node exporter

Learning Nagios deeply teaches you why these patterns exist, not just how to configure them in modern tools.


Prerequisites & Background Knowledge

Essential Prerequisites

Before starting, you should have:

  1. Linux command line proficiency
    • Navigate filesystems, edit files with vim/nano
    • Understand file permissions and ownership
    • Use systemd (systemctl start/stop/status)
    • Read log files with tail, grep, less
  2. Basic networking knowledge
    • TCP/IP fundamentals (ports, connections)
    • DNS resolution basics
    • HTTP request/response cycle
    • ICMP (ping) and how it works
  3. Shell scripting basics
    • Write simple bash scripts
    • Use variables, conditionals, loops
    • Understand exit codes
    • Parse command output

Helpful but Not Required

  • Previous experience with any monitoring tool
  • Web server configuration (Apache/Nginx)
  • SNMP protocol knowledge
  • Database administration basics

Self-Assessment Questions

Before starting, verify you can answer:

  1. “What happens when you run ping google.com?”
  2. “How do you check if a service is running on Linux?”
  3. “What does exit code 0 mean in a shell script?”
  4. “How do you configure a firewall rule in iptables or firewalld?”
  5. “What is the difference between TCP and UDP?”

If you struggle with these, spend time on Linux fundamentals first.

Development Environment Setup

Option 1: VirtualBox/Vagrant (Recommended)

┌─────────────────────────────────────────────────┐
│                   Your Laptop                    │
│  ┌────────────────┐    ┌────────────────┐       │
│  │ VM: nagios-srv │    │ VM: client-01  │       │
│  │ 192.168.56.10  │    │ 192.168.56.11  │       │
│  │ Nagios Core    │    │ NRPE Agent     │       │
│  │ Apache         │    │ Test services  │       │
│  └────────────────┘    └────────────────┘       │
│           │                    │                 │
│           └────────────────────┘                 │
│            Host-only network                     │
└─────────────────────────────────────────────────┘

Option 2: Cloud VMs (AWS/GCP/DigitalOcean)

  • 2 small VMs (t2.micro equivalent)
  • Security groups allowing ports 80, 443, 5666
  • SSH access configured

Option 3: Docker Containers

  • Good for quick testing
  • Less realistic for learning production patterns
  • Use docker-compose for multi-container setup

Time Investment Expectations

Learning Path Time Projects
Quick overview 2 weeks 1-5
Solid foundation 4 weeks 1-10
Comprehensive mastery 8 weeks 1-18
Enterprise expert 12 weeks All projects

Reality Check

This guide does not provide:

  • Copy-paste configurations (you must understand each directive)
  • GUI-based shortcuts (we focus on configuration files)
  • Nagios XI (commercial version) - we use Nagios Core (open source)

You will:

  • Read man pages and documentation
  • Debug configuration errors
  • Write shell scripts
  • Break things and fix them

Core Concept Analysis

1. The Plugin Model

Nagios plugins are the heart of monitoring. They are simple executables that:

┌─────────────────────────────────────────────────────────┐
│                    PLUGIN CONTRACT                       │
├─────────────────────────────────────────────────────────┤
│  INPUT:                                                  │
│    - Command-line arguments (thresholds, targets)        │
│    - Environment variables (optional)                    │
│                                                          │
│  OUTPUT:                                                 │
│    - STDOUT: Human-readable message + performance data   │
│    - EXIT CODE: 0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN  │
│                                                          │
│  EXAMPLE:                                                │
│    $ ./check_disk -w 80 -c 90 -p /                      │
│    DISK OK - free space: / 45678 MB (72%)|/=17890MB     │
│    $ echo $?                                             │
│    0                                                     │
└─────────────────────────────────────────────────────────┘

Why this matters: Any language, any script - if it follows this contract, Nagios can use it. This is the ultimate in extensibility.

2. Object Configuration Model

Nagios uses an object-oriented configuration model:

┌─────────────────────────────────────────────────────────┐
│                OBJECT HIERARCHY                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  timeperiod ──────┐                                     │
│                   │                                     │
│  command ─────────┼──► service ──► servicegroup         │
│                   │       │                              │
│  contact ─────────┼───────┼──► contactgroup             │
│                   │       │                              │
│  host ────────────┴───────┘                              │
│    │                                                     │
│    └──► hostgroup                                        │
│                                                          │
│  Templates can be inherited at any level                 │
│                                                          │
└─────────────────────────────────────────────────────────┘

Configuration Inheritance Example:

┌─────────────────────────────────────────────────────────┐
│                                                          │
│  define host {                                           │
│      name                    linux-server-template       │
│      check_period            24x7                        │
│      notification_period     24x7                        │
│      register                0    ; template only        │
│  }                                                       │
│                                                          │
│  define host {                                           │
│      use                     linux-server-template       │
│      host_name               web-server-01               │
│      address                 192.168.1.10                │
│  }                                                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

3. State Machine

Nagios implements a sophisticated state machine for each host and service:

                        ┌─────────────────────────────────┐
                        │         STATE MACHINE           │
                        └─────────────────────────────────┘

    ┌───────────────┐                     ┌───────────────┐
    │   SOFT STATE  │  max_check_attempts │  HARD STATE   │
    │ (transitional)│ ─────────────────► │  (confirmed)  │
    └───────────────┘                     └───────────────┘

    Example: max_check_attempts = 3

    Check 1: CRITICAL  ──► SOFT CRITICAL (1/3)
    Check 2: CRITICAL  ──► SOFT CRITICAL (2/3)
    Check 3: CRITICAL  ──► HARD CRITICAL (3/3) ──► NOTIFY!

    Why soft states exist:
    - Prevent alert storms from transient issues
    - Allow recovery before notification
    - Reduce false positives

    ┌─────────────────────────────────────────────────────┐
    │              STATE TRANSITIONS                       │
    │                                                      │
    │   OK ◄─────────────────────────────────► WARNING    │
    │   ▲                                         ▲        │
    │   │                                         │        │
    │   │                                         │        │
    │   ▼                                         ▼        │
    │ UNKNOWN ◄─────────────────────────────► CRITICAL    │
    │                                                      │
    └─────────────────────────────────────────────────────┘

4. Check Scheduling

Nagios must efficiently schedule checks across thousands of hosts/services:

┌─────────────────────────────────────────────────────────┐
│               CHECK SCHEDULING                           │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  check_interval = 5    (minutes between checks)          │
│  retry_interval = 1    (minutes between retries)         │
│                                                          │
│  Timeline (service goes CRITICAL at T=0):                │
│                                                          │
│  T=0   Check: CRITICAL  ──► SOFT(1/3), retry in 1 min   │
│  T=1   Check: CRITICAL  ──► SOFT(2/3), retry in 1 min   │
│  T=2   Check: CRITICAL  ──► HARD(3/3), NOTIFY!          │
│  T=7   Check: CRITICAL  ──► still HARD                  │
│  T=12  Check: OK        ──► RECOVERY, NOTIFY!           │
│  T=17  Check: OK        ──► normal interval resumes     │
│                                                          │
│  Scheduling optimization:                                │
│  - Interleaving: spread checks across time              │
│  - Parallelization: run multiple checks simultaneously   │
│  - Freshness: detect stale passive check results         │
│                                                          │
└─────────────────────────────────────────────────────────┘

5. Notification Flow

Notifications follow a complex decision tree:

┌─────────────────────────────────────────────────────────┐
│               NOTIFICATION FLOW                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Check result: HARD CRITICAL                             │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────┐                                    │
│  │ Notifications   │ NO                                 │
│  │ enabled?        ├────────► No notification           │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ Within time     │ NO                                 │
│  │ period?         ├────────► No notification           │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ Contact wants   │ NO                                 │
│  │ this state?     ├────────► No notification           │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ Escalation      │                                    │
│  │ applies?        │                                    │
│  └────────┬────────┘                                    │
│           │                                              │
│           ▼                                              │
│     SEND NOTIFICATION                                    │
│                                                          │
└─────────────────────────────────────────────────────────┘

6. Active vs Passive Checks

┌─────────────────────────────────────────────────────────┐
│             ACTIVE CHECKS (Pull Model)                   │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Nagios ────► Run plugin ────► Get result               │
│                                                          │
│  - Nagios controls timing                                │
│  - Nagios initiates connection                           │
│  - Good for: scheduled monitoring                        │
│                                                          │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│             PASSIVE CHECKS (Push Model)                  │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  External ────► Send result ────► Nagios accepts         │
│                                                          │
│  - External system controls timing                       │
│  - External system initiates connection                  │
│  - Good for: security events, async jobs, firewalled hosts│
│                                                          │
└─────────────────────────────────────────────────────────┘

Concept Summary Table

Concept What You Must Internalize
Plugin Model Any executable returning 0/1/2/3 with STDOUT message is a valid check
Object Inheritance Templates reduce duplication; use directive inherits all attributes
Soft/Hard States Soft states prevent false alerts; hard states trigger notifications
Check Scheduling check_interval for normal, retry_interval during problems
Notification Logic Many conditions must be true before a notification sends
Active vs Passive Active = Nagios pulls, Passive = external pushes via NSCA
NRPE Runs local checks on remote hosts securely
Escalations Route notifications based on problem duration
Event Handlers Auto-remediation scripts triggered by state changes
Flapping Detection of rapid state oscillation

Deep Dive Reading by Concept

Nagios Architecture

Concept Resource
Overall architecture Nagios Core Documentation - Chapter 5: Basic Concepts
Plugin development Nagios Plugin Development Guidelines - nagios-plugins.org
Configuration structure Nagios Core Documentation - Chapter 3: Configuration Overview

Monitoring Fundamentals

Concept Book & Chapter
Monitoring philosophy “The Practice of System and Network Administration” by Limoncelli - Ch. 22
Alert design “Site Reliability Engineering” by Beyer et al. - Ch. 6: Monitoring
On-call practices “On-Call” by Jones - Chapters 1-3

Network Monitoring

Concept Book & Chapter
SNMP protocol “Essential SNMP” by Mauro & Schmidt - Ch. 1-4
Network monitoring “Network Warrior” by Donahue - Ch. 24
TCP/IP fundamentals “TCP/IP Illustrated, Vol. 1” by Stevens - Ch. 1-5

Scripting and Automation

Concept Book & Chapter
Bash scripting “Linux Command Line and Shell Scripting Bible” by Blum - Ch. 11-15
Perl plugins “Learning Perl” by Schwartz - Ch. 1-8
Python scripting “Automate the Boring Stuff” by Sweigart - Ch. 1-6

Essential Reading Order

  1. Week 1: Nagios Core Documentation Chapters 1-5
  2. Week 2: Plugin Development Guidelines + Essential SNMP Ch. 1-2
  3. Week 3: SRE Book Chapter 6 + Limoncelli Ch. 22
  4. Week 4: Shell scripting reference as needed

Quick Start Guide

If you feel overwhelmed, here’s your first 48 hours:

Day 1 (4 hours):

  1. Set up VirtualBox VM with CentOS or Ubuntu
  2. Complete Project 1 (Install from source)
  3. Access the web interface
  4. Understand the directory structure

Day 2 (4 hours):

  1. Complete Project 2 (Configuration structure)
  2. Add your first custom host
  3. Complete Project 3 (Local service monitoring)
  4. See your first alert

After this, you have a working Nagios system and understand the basics. Continue with projects 4-10 for comprehensive foundation, 11-18 for advanced topics.


Path 1: System Administrator (4 weeks)

Focus on practical monitoring of Linux/Windows infrastructure.

Projects: 1, 2, 3, 4, 5, 6, 8, 14 Outcome: Can monitor typical IT infrastructure with notifications

Path 2: DevOps Engineer (6 weeks)

Focus on automation, custom checks, and integration.

Projects: 1, 2, 3, 5, 6, 9, 10, 15, 18, 19 Outcome: Can integrate Nagios into CI/CD and create custom monitoring

Path 3: Monitoring Specialist (8 weeks)

Comprehensive coverage of all Nagios capabilities.

Projects: All, in order Outcome: Enterprise-scale Nagios deployment and management


Project List


Project 1: Installing Nagios Core from Source

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Bash/Shell
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Linux System Administration
  • Software or Tool: GCC, Make, Apache HTTPD
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: A fully functional Nagios Core installation compiled from source code, running with Apache web server and accessible via browser.

Why it teaches the concept: Compiling from source forces you to understand dependencies, directory structure, and how Nagios components interact. Package managers hide these details.

Core challenges you’ll face:

  • Installing build dependencies - Understanding what libraries Nagios needs (GD, OpenSSL, etc.)
  • Configuring the build - Using ./configure with appropriate paths
  • Setting up Apache integration - CGI scripts, authentication, permissions
  • Creating the nagios user/group - Understanding security isolation

Key Concepts:

  • Source compilation workflow
  • CGI web applications
  • Apache authentication (htpasswd)
  • Systemd service management

Difficulty: Intermediate Time estimate: 4-6 hours Prerequisites: Linux command line, package management, text editing

Real World Outcome

After completing this project, you will have:

$ systemctl status nagios
● nagios.service - Nagios Core Monitoring System
     Active: active (running) since Wed 2024-01-15 10:23:45 UTC
     Main PID: 12345 (nagios)

$ curl -u nagiosadmin:password http://localhost/nagios/
# Returns Nagios web interface HTML

$ /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Nagios Core 4.4.x
Copyright (c) 2009-present Nagios Core Development Team
...
Total Warnings: 0
Total Errors:   0

The Core Question You’re Answering

“What are the minimal components needed for a monitoring system, and how do they connect?”

Concepts You Must Understand First

  1. What is a daemon process?
    • How does it differ from a regular process?
    • Why does Nagios run as a daemon?
  2. How does Apache serve dynamic content?
    • What is CGI?
    • Why does Nagios use CGI instead of a modern framework?
  3. What are file permissions and why do they matter?
    • Who should own the Nagios files?
    • What permissions are required for CGI execution?

Questions to Guide Your Design

Before installing:

  1. Where should Nagios binaries be installed? Why?
  2. What user account should Nagios run as? Why not root?
  3. How will you back up your configuration?
  4. How will you verify the installation succeeded?

Thinking Exercise

Before running make install, draw the directory structure you expect:

  • Where will the main daemon binary be?
  • Where will configuration files live?
  • Where will plugins be stored?
  • Where will log files go?

Verify your drawing against the actual installation.

The Interview Questions They’ll Ask

  1. “Why would you compile from source instead of using packages?”
  2. “How do you verify a Nagios configuration before applying it?”
  3. “What happens if the nagios user doesn’t have permission to execute plugins?”
  4. “How would you upgrade Nagios without losing configuration?”
  5. “What are the security implications of running CGI scripts?”

Hints in Layers

Hint 1 - Starting Point: Begin with a minimal Linux installation. CentOS/RHEL or Ubuntu LTS are recommended.

Hint 2 - Dependencies: You need: gcc, glibc, glibc-common, gd, gd-devel, openssl, openssl-devel, make, perl, wget

Hint 3 - Build Process:

# General flow (not copy-paste - understand each step)
./configure --with-command-group=nagcmd
make all
make install
make install-init
make install-commandmode
make install-config
make install-webconf

Hint 4 - Verification:

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Books That Will Help

Topic Resource
Installation process Nagios Core Beginner’s Guide - Chapter 2
Apache configuration Apache Cookbook - O’Reilly
Linux system admin Linux Bible - Christopher Negus

Common Pitfalls & Debugging

Problem Root Cause Fix
“Permission denied” on CGI Apache user not in nagcmd group usermod -aG nagcmd apache
“Cannot open config file” Wrong ownership chown -R nagios:nagios /usr/local/nagios
Web interface shows blank CGI not enabled Enable mod_cgi in Apache
Daemon won’t start Config error Run nagios -v to validate

Learning Milestones

  1. Basic: Nagios daemon starts and stays running
  2. Intermediate: Web interface accessible with authentication
  3. Advanced: Can modify nagios.cfg and reload without restart

Project 2: Understanding the Configuration File Structure

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration Language
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Configuration Management
  • Software or Tool: Text editor, nagios -v
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: A well-organized configuration directory with separate files for hosts, services, commands, contacts, and templates.

Why it teaches the concept: Understanding the configuration structure is essential for maintainable monitoring. Poor organization leads to configuration drift and errors.

Core challenges you’ll face:

  • Understanding cfg_file vs cfg_dir - How Nagios finds configuration
  • Object relationships - How hosts reference commands and templates
  • Template inheritance - Creating reusable configuration blocks
  • Configuration validation - Catching errors before restart

Key Concepts:

  • Object definition syntax
  • Directive inheritance
  • Configuration file inclusion
  • Macro expansion

Difficulty: Intermediate Time estimate: 3-4 hours Prerequisites: Project 1 completed

Real World Outcome

Your configuration directory will look like:

/usr/local/nagios/etc/
├── nagios.cfg              # Main configuration
├── cgi.cfg                 # Web interface settings
├── resource.cfg            # Sensitive variables ($USER1$)
├── objects/
│   ├── commands.cfg        # Check command definitions
│   ├── contacts.cfg        # Contact definitions
│   ├── timeperiods.cfg     # When to check/notify
│   ├── templates.cfg       # Reusable object templates
│   ├── hosts/
│   │   ├── web-servers.cfg
│   │   └── db-servers.cfg
│   └── services/
│       ├── linux-services.cfg
│       └── web-services.cfg

The Core Question You’re Answering

“How do I organize monitoring configuration so it scales to hundreds of hosts without becoming unmaintainable?”

Concepts You Must Understand First

  1. What is object inheritance in Nagios?
    • How does use work?
    • What is the difference between name and host_name?
  2. What are Nagios macros?
    • What is $HOSTADDRESS$?
    • Where is $USER1$ defined?

Questions to Guide Your Design

  1. How many hosts will you monitor eventually?
  2. Will hosts be grouped by location, function, or owner?
  3. Who maintains which parts of the configuration?
  4. How will you version control the configuration?

The Interview Questions They’ll Ask

  1. “How does Nagios macro substitution work?”
  2. “What’s the difference between a host template and a host definition?”
  3. “How would you organize configuration for 1000 hosts?”
  4. “How do you handle secrets in Nagios configuration?”
  5. “What happens if two configuration files define the same object?”

Learning Milestones

  1. Basic: Understand nagios.cfg and cfg_dir directive
  2. Intermediate: Create templates and use inheritance
  3. Advanced: Organize configuration for team collaboration

Project 3: Monitoring Local Services

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Service Monitoring
  • Software or Tool: Nagios Plugins
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Monitoring for essential local services: disk space, CPU load, memory usage, running processes, and swap usage on the Nagios server itself.

Why it teaches the concept: Local monitoring is the foundation. These same concepts apply to remote hosts via NRPE.

Core challenges you’ll face:

  • Understanding check commands - How plugins receive arguments
  • Setting thresholds - Choosing appropriate WARNING and CRITICAL values
  • Interpreting plugin output - Understanding performance data

Key Concepts:

  • Standard Nagios plugins (check_disk, check_load, check_procs)
  • Threshold syntax (-w and -c arguments)
  • Performance data format
  • Service check intervals

Difficulty: Beginner Time estimate: 2-3 hours Prerequisites: Projects 1-2 completed

Real World Outcome

# From the web interface or command line:
$ /usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /
DISK OK - free space: / 45678 MB (75% inode=89%);| /=15234MB;51200;57600;0;64000

$ /usr/local/nagios/libexec/check_load -w 5,4,3 -c 10,8,6
OK - load average: 0.15, 0.10, 0.08|load1=0.150;5.000;10.000;0; load5=0.100;4.000;8.000;0; load15=0.080;3.000;6.000;0;

Web interface shows all services green (OK state).

The Core Question You’re Answering

“How do I translate ‘disk is almost full’ into a monitoring check that alerts at the right time?”

The Interview Questions They’ll Ask

  1. “How do you determine appropriate thresholds for disk space?”
  2. “What’s the difference between load average values 1, 5, and 15 minutes?”
  3. “How would you monitor a specific process?”
  4. “What does the performance data after the pipe ( ) mean?”

Learning Milestones

  1. Basic: One service check working with correct thresholds
  2. Intermediate: All local services monitored with sensible thresholds
  3. Advanced: Performance data graphed over time

Project 4: Setting Up NRPE for Remote Monitoring

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Remote Monitoring, Networking
  • Software or Tool: NRPE (Nagios Remote Plugin Executor)
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: NRPE daemon on a remote host that executes local checks and returns results to the Nagios server.

Why it teaches the concept: NRPE is the standard way to monitor internal metrics on remote hosts that require local access (disk, CPU, processes).

Core challenges you’ll face:

  • NRPE installation on remote hosts - Compiling or using packages
  • Firewall configuration - Port 5666 must be accessible
  • NRPE allowed_hosts - Security configuration
  • Command definitions - Matching server and client
┌─────────────────────────────────────────────────────────┐
│                    NRPE ARCHITECTURE                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Nagios Server                    Remote Host            │
│  ┌──────────────┐                ┌──────────────┐       │
│  │              │                │              │       │
│  │ check_nrpe   │ ──TCP/5666──► │  nrpe daemon │       │
│  │ -H host      │                │              │       │
│  │ -c check_cmd │                │  ┌────────┐  │       │
│  │              │ ◄──result──── │  │ plugin │  │       │
│  │              │                │  └────────┘  │       │
│  └──────────────┘                └──────────────┘       │
│                                                          │
│  The command "check_load" on the server maps to          │
│  a command definition on the remote host that            │
│  runs /usr/local/nagios/libexec/check_load locally.     │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-3, second VM or host

Real World Outcome

# From Nagios server:
$ /usr/local/nagios/libexec/check_nrpe -H 192.168.56.11 -c check_load
OK - load average: 0.08, 0.12, 0.15|load1=0.080;5.000;10.000;0;

# Remote host is being monitored for local metrics

The Core Question You’re Answering

“How do I run local checks on remote hosts securely and efficiently?”

The Interview Questions They’ll Ask

  1. “Why use NRPE instead of SSH?”
  2. “How does NRPE handle authentication?”
  3. “What are the security implications of dont_blame_nrpe?”
  4. “How would you troubleshoot ‘CHECK_NRPE: Error - Could not complete SSL handshake’?”
  5. “Can NRPE pass arguments to commands? Should it?”

Learning Milestones

  1. Basic: NRPE daemon running on remote host
  2. Intermediate: Check commands working from server to client
  3. Advanced: Secure configuration with SSL and restricted hosts

Project 5: Creating Custom Check Scripts

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Bash, Python, or Perl
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Scripting, Plugin Development
  • Software or Tool: Any scripting language
  • Main Book: Nagios Plugin Development Guidelines

What you’ll build: Custom monitoring plugins for application-specific checks not covered by standard plugins.

Why it teaches the concept: The plugin model is Nagios’s greatest strength. Understanding it lets you monitor anything.

Core challenges you’ll face:

  • Plugin specification compliance - Exit codes, output format, performance data
  • Threshold handling - Parsing -w and -c arguments
  • Error handling - UNKNOWN state for errors
  • Timeout handling - Plugin shouldn’t hang
┌─────────────────────────────────────────────────────────┐
│                 PLUGIN SPECIFICATION                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Exit Codes:                                             │
│    0 = OK                                                │
│    1 = WARNING                                           │
│    2 = CRITICAL                                          │
│    3 = UNKNOWN                                           │
│                                                          │
│  Output Format:                                          │
│    STATUS - Message text|performance_data                │
│                                                          │
│  Example:                                                │
│    OK - Queue length is 5|queue_length=5;10;20;0;100    │
│                                                          │
│  Performance Data Format:                                │
│    label=value[UOM];warn;crit;min;max                   │
│                                                          │
│  UOM (Unit of Measure):                                  │
│    (none) = count, s = seconds, % = percentage          │
│    B/KB/MB/GB/TB = bytes, c = counter                   │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Intermediate Time estimate: 4-6 hours Prerequisites: Scripting skills, Projects 1-3

Real World Outcome

A custom plugin that checks something specific to your environment:

$ ./check_app_queue.sh -w 10 -c 20
OK - Queue length is 5|queue_length=5;10;20;0;100
$ echo $?
0

$ ./check_app_queue.sh -w 3 -c 10
WARNING - Queue length is 5|queue_length=5;3;10;0;100
$ echo $?
1

The Core Question You’re Answering

“How do I extend Nagios to monitor something it doesn’t monitor out of the box?”

The Interview Questions They’ll Ask

  1. “What makes a Nagios plugin valid?”
  2. “How do you handle timeouts in custom plugins?”
  3. “What should happen if a plugin can’t connect to what it’s checking?”
  4. “How do you test a plugin before deploying it?”
  5. “What is the performance data format and why does it matter?”

Learning Milestones

  1. Basic: Plugin returns correct exit codes
  2. Intermediate: Plugin parses threshold arguments correctly
  3. Advanced: Plugin includes performance data for graphing

Project 6: Host and Service Groups

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Configuration Organization
  • Software or Tool: Nagios Core
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Logical groupings of hosts and services for organized display and notification routing.

Why it teaches the concept: Groups are essential for managing monitoring at scale. They affect both display and notification behavior.

Core challenges you’ll face:

  • Defining meaningful groups - By location, function, owner, or criticality
  • Group membership methods - Via hostgroup_name or hostgroups directive
  • Service dependencies on groups - Applying services to all hosts in a group
  • Notification via groups - Contact groups for different teams

Difficulty: Intermediate Time estimate: 2-3 hours Prerequisites: Projects 1-4, multiple hosts configured

Real World Outcome

# Web interface shows organized groups:

Hostgroups:
├── web-servers (3 hosts)
├── db-servers (2 hosts)
├── production (4 hosts)
└── development (1 host)

Servicegroups:
├── disk-checks (15 services)
├── http-checks (3 services)
└── database-checks (4 services)

The Core Question You’re Answering

“How do I organize monitoring so that operators can quickly find what they’re looking for?”

The Interview Questions They’ll Ask

  1. “What’s the difference between hostgroups and servicegroups?”
  2. “How do you assign a service to all hosts in a group?”
  3. “Can a host be in multiple groups? When is this useful?”
  4. “How do groups affect notification routing?”

Learning Milestones

  1. Basic: Create hostgroups and assign hosts
  2. Intermediate: Use hostgroup_name to apply services to groups
  3. Advanced: Servicegroups for cross-host service views

Project 7: Timeperiods and Scheduling

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Scheduling, Time-Based Logic
  • Software or Tool: Nagios Core
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Custom timeperiods for business hours, maintenance windows, and on-call schedules.

Why it teaches the concept: Real monitoring requires time-awareness: check less often at night, notify different people on weekends.

Core challenges you’ll face:

  • Timeperiod syntax - Date and time range formats
  • Exclusions - Excluding holidays from business hours
  • Overlapping periods - Combining multiple timeperiods
  • Check vs notification periods - Different uses for timeperiods

Difficulty: Intermediate Time estimate: 2-3 hours Prerequisites: Projects 1-4

Real World Outcome

# Example timeperiods:

define timeperiod {
    timeperiod_name     business_hours
    alias               Monday-Friday 9AM-5PM
    monday              09:00-17:00
    tuesday             09:00-17:00
    wednesday           09:00-17:00
    thursday            09:00-17:00
    friday              09:00-17:00
}

define timeperiod {
    timeperiod_name     non_business_hours
    alias               Outside Business Hours
    monday              00:00-09:00,17:00-24:00
    tuesday             00:00-09:00,17:00-24:00
    ...
    saturday            00:00-24:00
    sunday              00:00-24:00
}

The Core Question You’re Answering

“How do I make monitoring behave differently based on time of day or week?”

The Interview Questions They’ll Ask

  1. “What’s the difference between check_period and notification_period?”
  2. “How do you handle maintenance windows?”
  3. “How do you exclude holidays from a timeperiod?”
  4. “What happens if a check runs outside its check_period?”

Learning Milestones

  1. Basic: Define business hours timeperiod
  2. Intermediate: Apply different timeperiods to checks and notifications
  3. Advanced: Handle holidays and maintenance windows

Project 8: Notification Commands (Email, Slack, SMS)

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Bash, Nagios Configuration
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Notifications, Integration
  • Software or Tool: sendmail/postfix, curl, custom scripts
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Multi-channel notifications that send alerts via email, Slack, and optionally SMS.

Why it teaches the concept: Notifications are useless if they don’t reach the right people in the right way.

Core challenges you’ll face:

  • Email configuration - SMTP setup, HTML vs plain text
  • Slack webhook integration - Formatting messages for Slack
  • SMS integration - Using services like Twilio
  • Command macros - Using Nagios macros in notification commands
┌─────────────────────────────────────────────────────────┐
│              NOTIFICATION FLOW                           │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  State Change (HARD CRITICAL)                            │
│        │                                                 │
│        ▼                                                 │
│  Contact Definition ────► notification_commands          │
│        │                     │                           │
│        │                     ▼                           │
│        │              notify-host-by-email               │
│        │              notify-host-by-slack               │
│        │                     │                           │
│        ▼                     ▼                           │
│  Contact gets email AND Slack message                    │
│                                                          │
│  Macros available:                                       │
│  $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$                │
│  $SERVICEOUTPUT$ $LONGDATETIME$ $CONTACTEMAIL$          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, working email server or Slack workspace

Real World Outcome

# Email notification received:
Subject: ** PROBLEM: web-server-01/HTTP is CRITICAL **

***** Nagios *****
Notification Type: PROBLEM
Host: web-server-01
Service: HTTP
State: CRITICAL
Output: Connection refused

# Slack message in #alerts channel:
🔴 CRITICAL: web-server-01/HTTP
Connection refused

The Core Question You’re Answering

“How do I ensure alerts reach the right people through their preferred communication channel?”

The Interview Questions They’ll Ask

  1. “How do you prevent alert fatigue?”
  2. “What macros are available in notification commands?”
  3. “How would you add a new notification channel?”
  4. “How do you test notification commands without triggering real alerts?”
  5. “What’s the difference between host and service notification commands?”

Learning Milestones

  1. Basic: Email notifications working
  2. Intermediate: Slack webhook integration
  3. Advanced: Multiple channels per contact, formatted messages

Project 9: Contact Groups and Escalations

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Notification Routing
  • Software or Tool: Nagios Core
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Escalation chains that notify different teams based on how long a problem has persisted.

Why it teaches the concept: Real incidents need escalation paths. The junior on-call should be notified first, then managers if unacknowledged.

Core challenges you’ll face:

  • Contact group design - Organizing contacts by role and responsibility
  • Escalation timing - first_notification, last_notification, escalation_interval
  • Escalation conditions - Which states trigger escalation
  • Acknowledgement handling - Stopping escalation when acked
┌─────────────────────────────────────────────────────────┐
│              ESCALATION TIMELINE                         │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Time:   0      5      10     15     20     25     30   │
│          │      │      │      │      │      │      │    │
│          ▼      ▼      ▼      ▼      ▼      ▼      ▼    │
│                                                          │
│  Notif:  1      2      3      4      5      6      7    │
│                                                          │
│  Level 1: on-call engineer (notifications 1-2)           │
│  Level 2: senior engineer (notifications 3-4)            │
│  Level 3: manager (notifications 5+)                     │
│                                                          │
│  If acknowledged at notification 3:                      │
│    - Escalation stops                                    │
│    - Recovery notification goes to acknowledger          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 3-4 hours Prerequisites: Projects 1-8

Real World Outcome

# First 10 minutes: on-call gets notified
# After 10 minutes: senior engineer also notified
# After 20 minutes: manager also notified

# When engineer acknowledges:
# - Escalation stops
# - When service recovers, acknowledger is notified

The Core Question You’re Answering

“How do I ensure problems get attention even if the first responder doesn’t react?”

The Interview Questions They’ll Ask

  1. “What is the difference between notification_interval and escalation_interval?”
  2. “How do you prevent the same person from being notified by multiple escalation levels?”
  3. “What happens when an escalated problem is acknowledged?”
  4. “How would you implement ‘follow the sun’ on-call rotation?”

Learning Milestones

  1. Basic: Two-level escalation working
  2. Intermediate: Escalation stops on acknowledgement
  3. Advanced: Different escalation paths for different services

Project 10: Event Handlers for Auto-Remediation

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Bash
  • Difficulty: Level 4: Expert
  • Knowledge Area: Automation, Self-Healing Systems
  • Software or Tool: Shell scripts, SSH
  • Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Automated remediation scripts that attempt to fix problems before humans are notified.

Why it teaches the concept: The best alert is the one that never fires because the system fixed itself.

Core challenges you’ll face:

  • Event handler logic - Understanding when handlers trigger
  • State-based actions - Different actions for SOFT vs HARD states
  • Safe remediation - Ensuring handlers don’t make things worse
  • Logging and auditing - Recording what actions were taken
┌─────────────────────────────────────────────────────────┐
│              EVENT HANDLER FLOW                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Check Result                                            │
│       │                                                  │
│       ▼                                                  │
│  ┌─────────────────┐                                    │
│  │ Event handler   │                                    │
│  │ enabled?        ├──NO──► No action                   │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ State change?   │                                    │
│  │ (or first check)├──NO──► No action                   │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  Execute event_handler command                           │
│  with $SERVICESTATE$ $SERVICESTATETYPE$ args            │
│                                                          │
│  Example handler logic:                                  │
│  if SERVICESTATE=CRITICAL and STATETYPE=SOFT then       │
│      attempt_restart                                     │
│  if SERVICESTATE=CRITICAL and STATETYPE=HARD then       │
│      # Give up, notification will handle it             │
│  fi                                                      │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Expert Time estimate: 6-8 hours Prerequisites: Projects 1-4, sudo/SSH access to managed hosts

Real World Outcome

# Service goes CRITICAL (SOFT state):
Event handler: Attempting to restart httpd
Event handler: Service httpd restarted successfully

# Next check:
Service returned to OK state
# No notification ever sent!

# If restart fails:
# Service goes HARD CRITICAL
# Normal notification process triggers

The Core Question You’re Answering

“How do I automate the first steps of incident response to reduce human intervention?”

The Interview Questions They’ll Ask

  1. “Why should event handlers only run on SOFT states?”
  2. “How do you prevent event handlers from making problems worse?”
  3. “How do you audit what event handlers did?”
  4. “What’s the difference between event handlers and notifications?”
  5. “How would you handle a service that keeps restarting in a loop?”

Learning Milestones

  1. Basic: Event handler triggers on state change
  2. Intermediate: Handler takes action only on SOFT CRITICAL
  3. Advanced: Handler logs actions and has safety limits

Project 11: Passive Checks and NSCA

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Bash, Nagios Configuration
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Push-Based Monitoring
  • Software or Tool: NSCA (Nagios Service Check Acceptor)
  • Main Book: Nagios Core Documentation

What you’ll build: Passive check infrastructure for monitoring behind firewalls or from external scripts/cron jobs.

Why it teaches the concept: Not everything can be polled. Passive checks enable monitoring of jobs, security events, and firewalled systems.

┌─────────────────────────────────────────────────────────┐
│            PASSIVE CHECK ARCHITECTURE                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Active Check (Pull):                                    │
│  Nagios ───────────────────────────────► Remote Host    │
│  (initiates connection every N minutes)                  │
│                                                          │
│  Passive Check (Push):                                   │
│  External Script ─────► NSCA ─────► Nagios External CMD │
│  (sends when ready)      │          │                    │
│                          │          ▼                    │
│                     Port 5667   nagios.cmd pipe          │
│                                                          │
│  Use cases:                                              │
│  - Cron job completion status                            │
│  - Security events from SIEM                             │
│  - Backup job results                                    │
│  - Hosts behind NAT/firewall                             │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4

Real World Outcome

# From external system:
$ echo -e "web-server-01\tBackup Status\t0\tBackup completed successfully" | \
    /usr/local/nagios/bin/send_nsca -H nagios-server -c send_nsca.cfg

# Nagios shows service "Backup Status" with OK state
# If no result received within freshness_threshold:
# Service goes CRITICAL - "No result received in 24 hours"

The Core Question You’re Answering

“How do I monitor things that can’t be actively polled?”

The Interview Questions They’ll Ask

  1. “What’s the difference between active and passive checks?”
  2. “How does freshness checking work?”
  3. “What happens if a passive check never sends a result?”
  4. “How do you secure NSCA communication?”
  5. “When would you use passive vs active checks?”

Learning Milestones

  1. Basic: NSCA server accepting results
  2. Intermediate: Passive service with freshness checking
  3. Advanced: Cron job sending results via send_nsca

Project 12: Performance Data and Graphing

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Perl, Nagios Configuration
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Metrics, Time-Series Data
  • Software or Tool: PNP4Nagios, RRDtool, or InfluxDB/Grafana
  • Main Book: “Nagios and Nagios Related Development” - Nagios Exchange

What you’ll build: Performance data collection and graphing for historical trend analysis.

Why it teaches the concept: Monitoring without graphing is incomplete. Graphs show trends, capacity planning data, and historical context.

┌─────────────────────────────────────────────────────────┐
│            PERFORMANCE DATA PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Plugin Output:                                          │
│  "DISK OK - 45GB free|/=55GB;80;90;0;100"               │
│                ▲                                         │
│                │ performance data                        │
│                ▼                                         │
│  process_performance_data_command                        │
│                │                                         │
│                ▼                                         │
│  ┌─────────────────────────────────────┐                │
│  │  PNP4Nagios / Graphite / InfluxDB   │                │
│  │         (stores time-series)         │                │
│  └─────────────────────────────────────┘                │
│                │                                         │
│                ▼                                         │
│  ┌─────────────────────────────────────┐                │
│  │      Graphs in Web Interface        │                │
│  │   (click on service → see graph)    │                │
│  └─────────────────────────────────────┘                │
│                                                          │
│  Performance data format:                                │
│  label=value[UOM];warn;crit;min;max                     │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 6-8 hours Prerequisites: Projects 1-5

Real World Outcome

Click on any service in Nagios web interface and see historical graphs:

  • CPU usage over last 24 hours
  • Disk space trend over last month
  • Response time percentiles

The Core Question You’re Answering

“How do I see trends over time instead of just current state?”

The Interview Questions They’ll Ask

  1. “How is performance data different from check output?”
  2. “What is RRDtool and how does it work?”
  3. “How would you alert on rates of change vs absolute values?”
  4. “What’s the storage overhead for performance data?”

Learning Milestones

  1. Basic: Performance data being written to files
  2. Intermediate: PNP4Nagios or similar showing graphs
  3. Advanced: Custom graphs with calculated metrics

Project 13: Nagios Plugins Development

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Python, Bash, or Perl
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Plugin Development
  • Software or Tool: Nagios Plugin API
  • Main Book: Nagios Plugin Development Guidelines

What you’ll build: A production-quality plugin with proper argument parsing, timeout handling, and performance data.

Why it teaches the concept: Plugins are how Nagios interfaces with the world. Understanding plugin development means you can monitor anything.

Core challenges you’ll face:

  • Argument parsing - Standard options like -H, -w, -c, -t, -v
  • Timeout handling - Plugin must not hang
  • Help output - -h and –help should be informative
  • Verbose mode - -v for debugging
  • Performance data - Proper formatting

Difficulty: Advanced Time estimate: 8-10 hours Prerequisites: Projects 1-5, programming skills

Real World Outcome

A plugin that follows the Nagios guidelines:

$ ./check_myapp --help
check_myapp - Check MyApp API health
Usage: check_myapp -H <host> [-p port] [-w warn] [-c crit] [-t timeout]

Options:
  -H, --hostname   Hostname or IP
  -p, --port       Port number (default: 8080)
  -w, --warning    Warning threshold (ms)
  -c, --critical   Critical threshold (ms)
  -t, --timeout    Timeout in seconds (default: 10)
  -v, --verbose    Verbose output
  -h, --help       This help message

$ ./check_myapp -H localhost -w 200 -c 500
MYAPP OK - Response time 45ms|response_time=45ms;200;500;0;

The Core Question You’re Answering

“How do I create a monitoring plugin that is robust, user-friendly, and maintainable?”

The Interview Questions They’ll Ask

  1. “What are the Nagios plugin guidelines?”
  2. “How do you handle timeouts in plugins?”
  3. “What should happen if a plugin encounters an unexpected error?”
  4. “How do you test plugins before deployment?”
  5. “What libraries exist to simplify plugin development?”

Learning Milestones

  1. Basic: Plugin with correct exit codes
  2. Intermediate: Proper argument parsing and help output
  3. Advanced: Timeout handling, verbose mode, performance data

Project 14: Monitoring Windows Hosts

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Windows Monitoring
  • Software or Tool: NSClient++, check_nt, or WMI
  • Main Book: NSClient++ Documentation

What you’ll build: Windows host monitoring for disk, CPU, memory, services, and event logs.

Why it teaches the concept: Mixed environments require multi-platform monitoring. NSClient++ is the NRPE equivalent for Windows.

┌─────────────────────────────────────────────────────────┐
│            WINDOWS MONITORING OPTIONS                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Option 1: NSClient++ (Recommended)                      │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │         │ Windows Host │              │
│  │              │         │              │              │
│  │ check_nrpe   │◄───────►│ NSClient++   │              │
│  │ check_nt     │ TCP/12489│ (agent)     │              │
│  └──────────────┘         └──────────────┘              │
│                                                          │
│  Option 2: WMI (Agentless)                               │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │         │ Windows Host │              │
│  │              │         │              │              │
│  │ check_wmi    │◄───────►│ WMI Service  │              │
│  │ (uses wmic)  │ RPC/135 │ (built-in)   │              │
│  └──────────────┘         └──────────────┘              │
│                                                          │
│  Metrics available:                                      │
│  - CPU usage (CPULOAD)                                   │
│  - Memory usage (MEMUSE)                                 │
│  - Disk space (USEDDISKSPACE)                           │
│  - Service status (SERVICESTATE)                         │
│  - Process list (PROCSTATE)                              │
│  - Event log entries (EVENTLOG)                          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, Windows server access

Real World Outcome

$ /usr/local/nagios/libexec/check_nt -H windows-server -v CPULOAD -w 80 -c 90
CPU Load 12% (5 min average)|'5 min avg Load'=12%;80;90;0;100;

$ /usr/local/nagios/libexec/check_nt -H windows-server -v SERVICESTATE -l "Windows Update"
Windows Update: Started

The Core Question You’re Answering

“How do I monitor Windows systems from a Linux-based Nagios server?”

The Interview Questions They’ll Ask

  1. “What’s the difference between check_nt and check_nrpe for Windows?”
  2. “How do you monitor Windows services?”
  3. “How do you check Windows event logs?”
  4. “What are the security implications of WMI vs NSClient++?”

Learning Milestones

  1. Basic: NSClient++ installed and responding
  2. Intermediate: CPU, memory, disk monitored
  3. Advanced: Service status and event log monitoring

Project 15: Network Device Monitoring (SNMP)

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 4: Expert
  • Knowledge Area: Network Monitoring, SNMP
  • Software or Tool: check_snmp, snmpwalk
  • Main Book: “Essential SNMP” by Mauro & Schmidt

What you’ll build: Monitoring for routers, switches, and network appliances using SNMP.

Why it teaches the concept: Network devices are different from servers. SNMP is the standard protocol for managing network infrastructure.

┌─────────────────────────────────────────────────────────┐
│                    SNMP CONCEPTS                         │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  SNMP = Simple Network Management Protocol               │
│                                                          │
│  Key Components:                                         │
│  ┌──────────────┐    ┌──────────────┐                   │
│  │ SNMP Manager │───►│ SNMP Agent   │                   │
│  │ (Nagios)     │◄───│ (Device)     │                   │
│  └──────────────┘    └──────────────┘                   │
│         │                   │                            │
│         │ GET/SET           │ Response                   │
│         │                   │ Trap                       │
│                                                          │
│  OID (Object Identifier):                                │
│  .1.3.6.1.4.1.9.2.1.58.0 = Cisco CPU 5min              │
│  .1.3.6.1.2.1.2.2.1.8.1 = Interface 1 oper status       │
│                                                          │
│  MIB (Management Information Base):                      │
│  Human-readable mappings for OIDs                        │
│                                                          │
│  SNMP Versions:                                          │
│  v1: Community strings (insecure)                        │
│  v2c: Same security, better performance                  │
│  v3: Authentication + encryption (recommended)           │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Expert Time estimate: 6-8 hours Prerequisites: Projects 1-4, network device with SNMP enabled

Real World Outcome

# Check interface status:
$ check_snmp -H router.example.com -C public -o IF-MIB::ifOperStatus.1
SNMP OK - up(1)|

# Check CPU utilization on Cisco device:
$ check_snmp -H router.example.com -C public \
    -o .1.3.6.1.4.1.9.2.1.58.0 -w 80 -c 90
SNMP OK - 45|iso.3.6.1.4.1.9.2.1.58.0=45;80;90

# Monitor interface bandwidth:
$ check_snmp_int.pl -H switch.example.com -C public -i "Gi0/1" -w 80,80 -c 90,90

The Core Question You’re Answering

“How do I monitor network infrastructure that doesn’t run standard OS agents?”

The Interview Questions They’ll Ask

  1. “What is an OID and how do you find the right one?”
  2. “What are the security differences between SNMP v2c and v3?”
  3. “How do you monitor interface bandwidth?”
  4. “What is an SNMP trap and how is it different from polling?”
  5. “How do you monitor a device you don’t have the MIB for?”

Learning Milestones

  1. Basic: SNMP GET working from Nagios
  2. Intermediate: Interface status and bandwidth monitored
  3. Advanced: SNMP v3 with authentication, trap receiver

Project 16: Log File Monitoring

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Bash, Perl
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Log Analysis
  • Software or Tool: check_log, custom scripts
  • Main Book: Nagios Core Documentation

What you’ll build: Monitoring that detects error patterns in application log files and alerts.

Why it teaches the concept: Logs often contain warnings before metrics show problems. Log monitoring catches issues early.

Core challenges you’ll face:

  • Log rotation - Handling log files that rotate
  • State tracking - Only alerting on new entries
  • Pattern matching - Defining what constitutes an error
  • Performance - Scanning large logs efficiently
┌─────────────────────────────────────────────────────────┐
│            LOG MONITORING APPROACHES                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Approach 1: Scan for patterns (active check)            │
│  ┌──────────────────────────────────────────────┐       │
│  │ check_log -F /var/log/app.log -O /tmp/seek   │       │
│  │           -q "ERROR|FATAL"                    │       │
│  │                                               │       │
│  │ • Remembers last position (seek file)         │       │
│  │ • Only scans new entries                      │       │
│  │ • Must handle rotation                        │       │
│  └──────────────────────────────────────────────┘       │
│                                                          │
│  Approach 2: Real-time tail + NSCA (passive check)       │
│  ┌──────────────────────────────────────────────┐       │
│  │ tail -F /var/log/app.log | while read line   │       │
│  │ do                                            │       │
│  │   if [[ $line =~ ERROR ]]; then               │       │
│  │     send_nsca "host" "Log Errors" 2 "$line"  │       │
│  │   fi                                          │       │
│  │ done                                          │       │
│  └──────────────────────────────────────────────┘       │
│                                                          │
│  Approach 3: External log shipper → Nagios               │
│  (Filebeat/Logstash → parse → alert → NSCA)             │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, 11 (for passive approach)

Real World Outcome

# Active check approach:
$ /usr/local/nagios/libexec/check_log -F /var/log/myapp.log \
    -O /tmp/myapp_log.seek -q "ERROR"
LOG OK - No matches found

# (After an error is logged)
$ /usr/local/nagios/libexec/check_log -F /var/log/myapp.log \
    -O /tmp/myapp_log.seek -q "ERROR"
(1): 2024-01-15 10:23:45 ERROR: Database connection failed

The Core Question You’re Answering

“How do I detect problems from log messages before they become service failures?”

The Interview Questions They’ll Ask

  1. “How do you handle log rotation in monitoring?”
  2. “What’s the difference between active and passive log monitoring?”
  3. “How do you avoid duplicate alerts for the same error?”
  4. “How do you monitor logs across many servers?”
  5. “How would you alert on an absence of expected log entries?”

Learning Milestones

  1. Basic: check_log detecting patterns
  2. Intermediate: State tracking across check runs
  3. Advanced: Log rotation handling, passive check integration

Project 17: Custom Dashboards and Status Pages

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: PHP, HTML, API calls
  • Difficulty: Level 3: Advanced
  • Knowledge Area: UI/UX, Web Development
  • Software or Tool: Nagios CGIs, Thruk, Grafana
  • Main Book: Nagios Core Documentation - CGI Reference

What you’ll build: Custom status views and dashboards beyond the default Nagios interface.

Why it teaches the concept: Different stakeholders need different views. Operations needs details; management needs summaries.

Core challenges you’ll face:

  • Status API - Reading from status.dat or NDO
  • Custom CGIs - Building on top of Nagios CGIs
  • Alternative UIs - Thruk, Nagstamon, mobile apps
  • Data visualization - Meaningful dashboards
┌─────────────────────────────────────────────────────────┐
│            DASHBOARD ARCHITECTURE                        │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Data Sources:                                           │
│  ┌──────────────────┐                                   │
│  │ status.dat       │ ◄─ File-based, fast, limited     │
│  │ nagios.cmd       │ ◄─ Command pipe for control       │
│  │ NDO/Livestatus   │ ◄─ Database/socket, full history │
│  └──────────────────┘                                   │
│           │                                              │
│           ▼                                              │
│  ┌──────────────────────────────────────────────┐       │
│  │              Dashboard Options               │       │
│  ├──────────────────────────────────────────────┤       │
│  │ • Default Nagios CGIs (basic)                │       │
│  │ • Thruk (modern web UI)                      │       │
│  │ • Grafana + Nagios datasource                │       │
│  │ • Custom dashboards (TV displays)            │       │
│  │ • Mobile apps (Nagstamon, etc.)              │       │
│  └──────────────────────────────────────────────┘       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 6-8 hours Prerequisites: Projects 1-4, web development basics

Real World Outcome

  • TV dashboard in NOC showing current problems
  • Management summary page with SLA compliance
  • Mobile app notifications on phone
  • Custom status pages for specific teams

The Core Question You’re Answering

“How do I present monitoring data in a way that’s useful for different audiences?”

The Interview Questions They’ll Ask

  1. “How do you read current status programmatically?”
  2. “What’s the difference between status.dat and NDO?”
  3. “How would you build a public status page?”
  4. “How do you secure API access to Nagios data?”

Learning Milestones

  1. Basic: Install Thruk or alternative UI
  2. Intermediate: Custom status page for specific hosts
  3. Advanced: Grafana dashboards with historical data

Project 18: High Availability Setup

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Nagios Configuration
  • Difficulty: Level 5: Master
  • Knowledge Area: High Availability, Clustering
  • Software or Tool: Pacemaker/Corosync, DRBD, or Active/Standby
  • Main Book: “Pro Linux High Availability Clustering” by Sander van Vugt

What you’ll build: Redundant Nagios infrastructure that survives the failure of any single component.

Why it teaches the concept: Your monitoring system must be more reliable than what it monitors. HA is essential for production.

┌─────────────────────────────────────────────────────────┐
│          HIGH AVAILABILITY ARCHITECTURES                 │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Option 1: Active/Standby with Pacemaker                 │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │ ───────►│ Nagios       │              │
│  │ Primary      │ DRBD    │ Standby      │              │
│  │ (Active)     │         │ (Passive)    │              │
│  └──────────────┘         └──────────────┘              │
│         │                                                │
│         ▼                                                │
│    Virtual IP (failover)                                 │
│                                                          │
│  Option 2: Distributed Monitoring                        │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │ ◄──────►│ Nagios       │              │
│  │ Region A     │ NSCA    │ Region B     │              │
│  │ (monitors A) │         │ (monitors B) │              │
│  └──────────────┘         └──────────────┘              │
│         │                        │                       │
│         └────────────────────────┘                       │
│                    │                                     │
│                    ▼                                     │
│           Central Dashboard                              │
│                                                          │
│  Option 3: Load Balanced Pollers                         │
│  ┌──────────────┐                                       │
│  │ Central      │ ◄── Check results from                │
│  │ Nagios       │     multiple poller nodes             │
│  └──────────────┘                                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Master Time estimate: 12-16 hours Prerequisites: Projects 1-10, Linux clustering knowledge

Real World Outcome

# Primary server fails:
$ crm status
Node nagios-01: OFFLINE
Node nagios-02: online

Resources:
 Resource Group: nagios-group
     VIP        (ocf::heartbeat:IPaddr2):    Started nagios-02
     Nagios     (ocf::heartbeat:nagios):     Started nagios-02

# Nagios continues without interruption on nagios-02
# Alert: Monitoring failover occurred

The Core Question You’re Answering

“How do I ensure monitoring is available even when monitoring servers fail?”

The Interview Questions They’ll Ask

  1. “How do you handle split-brain in a Nagios cluster?”
  2. “What data needs to be replicated between nodes?”
  3. “How do you test failover without affecting production?”
  4. “What’s the recovery time objective for your monitoring?”
  5. “How do you monitor your monitoring system?”

Learning Milestones

  1. Basic: Active/standby configuration documented
  2. Intermediate: Automated failover working
  3. Advanced: Zero-downtime maintenance procedures

Project 19: Integration with External Tools

  • File: NAGIOS_MONITORING_MASTERY.md
  • Main Programming Language: Various (API integrations)
  • Difficulty: Level 4: Expert
  • Knowledge Area: Integration, APIs
  • Software or Tool: PagerDuty, ServiceNow, Ansible, Terraform
  • Main Book: Tool-specific documentation

What you’ll build: Integrations between Nagios and incident management, CMDB, and automation tools.

Why it teaches the concept: Nagios doesn’t exist in isolation. Integration with the broader toolchain is essential.

┌─────────────────────────────────────────────────────────┐
│          NAGIOS INTEGRATION LANDSCAPE                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Incident Management:                                    │
│  ┌──────────────┐                                       │
│  │   Nagios     │──Notification──►│ PagerDuty │         │
│  │              │                 │ OpsGenie  │         │
│  │              │                 │ VictorOps │         │
│  └──────────────┘                                       │
│                                                          │
│  Ticketing/ITSM:                                         │
│  ┌──────────────┐                                       │
│  │   Nagios     │──Create Ticket─►│ ServiceNow│         │
│  │              │                 │ Jira      │         │
│  │              │                 │ RT        │         │
│  └──────────────┘                                       │
│                                                          │
│  Configuration Management:                               │
│  ┌──────────────┐                                       │
│  │  Ansible/    │──Generate──►│ Nagios Config│          │
│  │  Puppet/     │  Config    │              │          │
│  │  Terraform   │            │              │          │
│  └──────────────┘            └──────────────┘          │
│                                                          │
│  Logging & Metrics:                                      │
│  ┌──────────────┐                                       │
│  │   Nagios     │──perfdata──►│ Graphite/   │          │
│  │              │             │ InfluxDB    │          │
│  │              │──alerts────►│ ELK Stack   │          │
│  └──────────────┘                                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Expert Time estimate: 8-12 hours Prerequisites: Projects 1-8, familiarity with target tools

Real World Outcome

  • Alerts automatically create PagerDuty incidents
  • Incident tickets created in ServiceNow
  • Nagios configuration generated from Terraform
  • Performance data visualized in Grafana

The Core Question You’re Answering

“How do I make Nagios work with my existing operational tools?”

The Interview Questions They’ll Ask

  1. “How would you integrate Nagios with your incident management tool?”
  2. “How do you auto-generate Nagios configuration from your CMDB?”
  3. “How do you correlate Nagios alerts with logs in ELK?”
  4. “What’s the best approach for Nagios + Prometheus coexistence?”

Learning Milestones

  1. Basic: One external integration working
  2. Intermediate: Bidirectional integration (e.g., acknowledge in PagerDuty reflects in Nagios)
  3. Advanced: Configuration-as-code pipeline

Project Comparison Table

Project Difficulty Time Key Skill Real-World Value
1. Install from Source Intermediate 4-6h Linux admin Foundation
2. Config Structure Intermediate 3-4h Configuration Organization
3. Local Services Beginner 2-3h Plugin usage Basic monitoring
4. NRPE Remote Advanced 4-6h Networking Remote monitoring
5. Custom Scripts Intermediate 4-6h Scripting Extensibility
6. Host/Service Groups Intermediate 2-3h Configuration Scalability
7. Timeperiods Intermediate 2-3h Scheduling Operations
8. Notifications Advanced 4-6h Integration Alerting
9. Escalations Advanced 3-4h Workflow On-call
10. Event Handlers Expert 6-8h Automation Self-healing
11. Passive/NSCA Advanced 4-6h Push model Special cases
12. Performance Data Advanced 6-8h Metrics Trending
13. Plugin Development Advanced 8-10h Programming Extension
14. Windows Hosts Advanced 4-6h Cross-platform Enterprise
15. SNMP Devices Expert 6-8h Network Infrastructure
16. Log Monitoring Advanced 4-6h Log analysis Proactive
17. Dashboards Advanced 6-8h Visualization UX
18. High Availability Master 12-16h Clustering Reliability
19. Integration Expert 8-12h APIs Ecosystem

Summary

# Project Core Skill Prerequisites
1 Install from Source Linux compilation Linux basics
2 Config Structure Object model Project 1
3 Local Services Plugin usage Project 2
4 NRPE Remote Remote execution Project 3
5 Custom Scripts Plugin development Scripting
6 Host/Service Groups Organization Project 4
7 Timeperiods Scheduling Project 4
8 Notifications Multi-channel alerts Project 4
9 Escalations Workflow routing Project 8
10 Event Handlers Auto-remediation Project 4
11 Passive/NSCA Push monitoring Project 4
12 Performance Data Metrics graphing Project 5
13 Plugin Development Production plugins Project 5
14 Windows Hosts Cross-platform Project 4
15 SNMP Devices Network monitoring Project 4
16 Log Monitoring Log analysis Project 5
17 Dashboards Visualization Project 4
18 High Availability Clustering Projects 1-10
19 Integration APIs Projects 1-8

What You’ll Achieve

After completing this learning path, you will be able to:

  1. Install and configure Nagios Core from source with confidence
  2. Design monitoring for complex infrastructure (Linux, Windows, network devices)
  3. Write custom plugins that follow best practices
  4. Configure notifications with escalations and on-call rotations
  5. Implement auto-remediation with event handlers
  6. Build dashboards for different stakeholders
  7. Scale Nagios with high availability configurations
  8. Integrate with incident management and automation tools

Most importantly, you will understand the principles behind monitoring that apply to any tool - not just Nagios.


Additional Resources

Official Documentation

  • Nagios Core Documentation: https://assets.nagios.com/downloads/nagioscore/docs/
  • Nagios Plugins Guidelines: https://nagios-plugins.org/doc/guidelines.html
  • NRPE Documentation: https://github.com/NagiosEnterprises/nrpe

Community Resources

  • Nagios Exchange: https://exchange.nagios.org/ (thousands of plugins)
  • Nagios Library: https://library.nagios.com/
  • Monitoring Portal: https://www.monitoring-portal.org/

Books

  • “Nagios Core Beginner’s Guide” by Eric Loyd - Packt Publishing
  • “Nagios: System and Network Monitoring” by Wolfgang Barth - No Starch Press
  • “Essential SNMP” by Douglas Mauro & Kevin Schmidt - O’Reilly
  • Prometheus - Modern pull-based monitoring with PromQL
  • Grafana - Advanced visualization
  • Icinga - Nagios fork with modern features
  • Zabbix - All-in-one monitoring platform
  • OpenTelemetry - Observability framework

Last updated: 2025-01-01 Projects: 19 | Difficulty range: Beginner to Master | Estimated total time: 100-140 hours