Nagios Monitoring Mastery: From Installation to Enterprise-Scale Monitoring

Goal: Deeply understand infrastructure monitoring by mastering Nagios Core from the ground up. You will learn how Nagios checks work at the protocol level, how the scheduling engine orchestrates thousands of checks, how notifications flow through escalation chains, and how to extend Nagios with custom plugins. By completing these projects, you will understand why monitoring systems are architected the way they are, how to debug check failures, configure complex notification routing, implement auto-remediation with event handlers, and scale monitoring to enterprise environments. You will internalize the mental model that transforms raw check results into actionable operational intelligence.

Why Nagios Matters

Nagios is the grandfather of open-source infrastructure monitoring. First released in 1999 as “NetSaint,” it pioneered concepts that every modern monitoring tool still uses: host/service checks, state machines, notification escalations, and plugin architecture. Understanding Nagios deeply means understanding the foundations of monitoring itself.

Historical Context

Before Nagios, monitoring was either:

Expensive commercial tools (HP OpenView, Tivoli) that cost tens of thousands of dollars
Custom scripts that checked things ad-hoc with no centralized view
Nothing at all - operators discovered problems when users complained

Ethan Galstad created Nagios to solve a real problem: he needed to monitor a network of Linux servers and couldn’t afford commercial solutions. The result was a monitoring philosophy that remains relevant 25 years later:

Plugins are external - Any script that returns 0/1/2/3 is a valid check
Configuration is declarative - Define what you want to monitor, not how to monitor it
State is explicit - OK, WARNING, CRITICAL, UNKNOWN with clear transitions
Notifications are programmable - Send alerts however you want

Industry Adoption

While newer tools (Prometheus, Datadog, Zabbix) have emerged, Nagios remains foundational:

30,000+ organizations actively use Nagios
10,000+ plugins available in the Nagios Exchange
Industry standard for traditional infrastructure monitoring
Base for derivatives - Icinga, Naemon, Shinken all forked from Nagios

Real-World Impact

Before Monitoring                    With Nagios
┌─────────────────────┐             ┌─────────────────────┐
│ User: "Site is down"│             │ Alert: Web server   │
│ Admin: "Let me SSH" │             │ response time > 5s  │
│ (30 minutes later)  │             │ (detected in 60s)   │
│ Admin: "Fixed it"   │             │ Event handler:      │
│ User: "Finally!"    │             │ restart_httpd.sh    │
└─────────────────────┘             │ (auto-fixed in 90s) │
                                    └─────────────────────┘
MTTR: 30+ minutes                   MTTR: 90 seconds

The Nagios Mental Model

Understanding Nagios requires internalizing its core architecture:

                    ┌─────────────────────────────────────────────┐
                    │              NAGIOS CORE                     │
                    │  ┌─────────────────────────────────────┐    │
                    │  │         SCHEDULING ENGINE           │    │
                    │  │  (when to run which checks)         │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │         CHECK EXECUTION             │    │
                    │  │  (run plugins, collect results)     │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │         STATE ENGINE                │    │
                    │  │  (OK→WARNING→CRITICAL transitions)  │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │      NOTIFICATION ENGINE            │    │
                    │  │  (who to notify, when, how)         │    │
                    │  └──────────────┬──────────────────────┘    │
                    │                 │                            │
                    │  ┌──────────────▼──────────────────────┐    │
                    │  │         EVENT HANDLERS              │    │
                    │  │  (auto-remediation scripts)         │    │
                    │  └─────────────────────────────────────┘    │
                    └─────────────────────────────────────────────┘
                                      │
            ┌─────────────────────────┼─────────────────────────┐
            │                         │                         │
            ▼                         ▼                         ▼
    ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
    │   PLUGIN      │         │   NRPE        │         │   NSCA        │
    │ check_http    │         │ (Remote       │         │ (Passive      │
    │ check_ping    │         │  checks)      │         │  checks)      │
    │ check_disk    │         │               │         │               │
    └───────────────┘         └───────────────┘         └───────────────┘
            │                         │                         │
            ▼                         ▼                         ▼
    ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
    │ Local Host    │         │ Remote Host   │         │ Remote Host   │
    │ (Nagios runs  │         │ (NRPE daemon  │         │ (sends data   │
    │  checks here) │         │  runs checks) │         │  proactively) │
    └───────────────┘         └───────────────┘         └───────────────┘

Why Nagios Before Modern Tools?

You might wonder: “Why learn Nagios when Prometheus exists?”

The answer is foundational understanding:

Concept	Nagios Teaches	Modern Equivalent
Check execution	Fork/exec model, exit codes	Prometheus scraping
State machines	Hard/soft states, flapping	Alertmanager grouping
Plugins	Standard interface (0/1/2/3)	Exporters
Notifications	Contact groups, escalations	PagerDuty integrations
Configuration	Object inheritance	Labels and selectors
Passive checks	Push model	Pushgateway
Remote execution	NRPE	Node exporter

Learning Nagios deeply teaches you why these patterns exist, not just how to configure them in modern tools.

Prerequisites & Background Knowledge

Essential Prerequisites

Before starting, you should have:

Linux command line proficiency
- Navigate filesystems, edit files with vim/nano
- Understand file permissions and ownership
- Use systemd (systemctl start/stop/status)
- Read log files with tail, grep, less
Basic networking knowledge
- TCP/IP fundamentals (ports, connections)
- DNS resolution basics
- HTTP request/response cycle
- ICMP (ping) and how it works
Shell scripting basics
- Write simple bash scripts
- Use variables, conditionals, loops
- Understand exit codes
- Parse command output

Helpful but Not Required

Previous experience with any monitoring tool
Web server configuration (Apache/Nginx)
SNMP protocol knowledge
Database administration basics

Self-Assessment Questions

Before starting, verify you can answer:

“What happens when you run ping google.com?”
“How do you check if a service is running on Linux?”
“What does exit code 0 mean in a shell script?”
“How do you configure a firewall rule in iptables or firewalld?”
“What is the difference between TCP and UDP?”

If you struggle with these, spend time on Linux fundamentals first.

Development Environment Setup

Option 1: VirtualBox/Vagrant (Recommended)

┌─────────────────────────────────────────────────┐
│                   Your Laptop                    │
│  ┌────────────────┐    ┌────────────────┐       │
│  │ VM: nagios-srv │    │ VM: client-01  │       │
│  │ 192.168.56.10  │    │ 192.168.56.11  │       │
│  │ Nagios Core    │    │ NRPE Agent     │       │
│  │ Apache         │    │ Test services  │       │
│  └────────────────┘    └────────────────┘       │
│           │                    │                 │
│           └────────────────────┘                 │
│            Host-only network                     │
└─────────────────────────────────────────────────┘

Option 2: Cloud VMs (AWS/GCP/DigitalOcean)

2 small VMs (t2.micro equivalent)
Security groups allowing ports 80, 443, 5666
SSH access configured

Option 3: Docker Containers

Good for quick testing
Less realistic for learning production patterns
Use docker-compose for multi-container setup

Time Investment Expectations

Learning Path	Time	Projects
Quick overview	2 weeks	1-5
Solid foundation	4 weeks	1-10
Comprehensive mastery	8 weeks	1-18
Enterprise expert	12 weeks	All projects

Reality Check

This guide does not provide:

Copy-paste configurations (you must understand each directive)
GUI-based shortcuts (we focus on configuration files)
Nagios XI (commercial version) - we use Nagios Core (open source)

You will:

Read man pages and documentation
Debug configuration errors
Write shell scripts
Break things and fix them

Core Concept Analysis

1. The Plugin Model

Nagios plugins are the heart of monitoring. They are simple executables that:

┌─────────────────────────────────────────────────────────┐
│                    PLUGIN CONTRACT                       │
├─────────────────────────────────────────────────────────┤
│  INPUT:                                                  │
│    - Command-line arguments (thresholds, targets)        │
│    - Environment variables (optional)                    │
│                                                          │
│  OUTPUT:                                                 │
│    - STDOUT: Human-readable message + performance data   │
│    - EXIT CODE: 0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN  │
│                                                          │
│  EXAMPLE:                                                │
│    $ ./check_disk -w 80 -c 90 -p /                      │
│    DISK OK - free space: / 45678 MB (72%)|/=17890MB     │
│    $ echo $?                                             │
│    0                                                     │
└─────────────────────────────────────────────────────────┘

Why this matters: Any language, any script - if it follows this contract, Nagios can use it. This is the ultimate in extensibility.

2. Object Configuration Model

Nagios uses an object-oriented configuration model:

┌─────────────────────────────────────────────────────────┐
│                OBJECT HIERARCHY                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  timeperiod ──────┐                                     │
│                   │                                     │
│  command ─────────┼──► service ──► servicegroup         │
│                   │       │                              │
│  contact ─────────┼───────┼──► contactgroup             │
│                   │       │                              │
│  host ────────────┴───────┘                              │
│    │                                                     │
│    └──► hostgroup                                        │
│                                                          │
│  Templates can be inherited at any level                 │
│                                                          │
└─────────────────────────────────────────────────────────┘

Configuration Inheritance Example:

┌─────────────────────────────────────────────────────────┐
│                                                          │
│  define host {                                           │
│      name                    linux-server-template       │
│      check_period            24x7                        │
│      notification_period     24x7                        │
│      register                0    ; template only        │
│  }                                                       │
│                                                          │
│  define host {                                           │
│      use                     linux-server-template       │
│      host_name               web-server-01               │
│      address                 192.168.1.10                │
│  }                                                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

3. State Machine

Nagios implements a sophisticated state machine for each host and service:

                        ┌─────────────────────────────────┐
                        │         STATE MACHINE           │
                        └─────────────────────────────────┘

    ┌───────────────┐                     ┌───────────────┐
    │   SOFT STATE  │  max_check_attempts │  HARD STATE   │
    │ (transitional)│ ─────────────────► │  (confirmed)  │
    └───────────────┘                     └───────────────┘

    Example: max_check_attempts = 3

    Check 1: CRITICAL  ──► SOFT CRITICAL (1/3)
    Check 2: CRITICAL  ──► SOFT CRITICAL (2/3)
    Check 3: CRITICAL  ──► HARD CRITICAL (3/3) ──► NOTIFY!

    Why soft states exist:
    - Prevent alert storms from transient issues
    - Allow recovery before notification
    - Reduce false positives

    ┌─────────────────────────────────────────────────────┐
    │              STATE TRANSITIONS                       │
    │                                                      │
    │   OK ◄─────────────────────────────────► WARNING    │
    │   ▲                                         ▲        │
    │   │                                         │        │
    │   │                                         │        │
    │   ▼                                         ▼        │
    │ UNKNOWN ◄─────────────────────────────► CRITICAL    │
    │                                                      │
    └─────────────────────────────────────────────────────┘

4. Check Scheduling

Nagios must efficiently schedule checks across thousands of hosts/services:

┌─────────────────────────────────────────────────────────┐
│               CHECK SCHEDULING                           │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  check_interval = 5    (minutes between checks)          │
│  retry_interval = 1    (minutes between retries)         │
│                                                          │
│  Timeline (service goes CRITICAL at T=0):                │
│                                                          │
│  T=0   Check: CRITICAL  ──► SOFT(1/3), retry in 1 min   │
│  T=1   Check: CRITICAL  ──► SOFT(2/3), retry in 1 min   │
│  T=2   Check: CRITICAL  ──► HARD(3/3), NOTIFY!          │
│  T=7   Check: CRITICAL  ──► still HARD                  │
│  T=12  Check: OK        ──► RECOVERY, NOTIFY!           │
│  T=17  Check: OK        ──► normal interval resumes     │
│                                                          │
│  Scheduling optimization:                                │
│  - Interleaving: spread checks across time              │
│  - Parallelization: run multiple checks simultaneously   │
│  - Freshness: detect stale passive check results         │
│                                                          │
└─────────────────────────────────────────────────────────┘

5. Notification Flow

Notifications follow a complex decision tree:

┌─────────────────────────────────────────────────────────┐
│               NOTIFICATION FLOW                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Check result: HARD CRITICAL                             │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────┐                                    │
│  │ Notifications   │ NO                                 │
│  │ enabled?        ├────────► No notification           │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ Within time     │ NO                                 │
│  │ period?         ├────────► No notification           │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ Contact wants   │ NO                                 │
│  │ this state?     ├────────► No notification           │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ Escalation      │                                    │
│  │ applies?        │                                    │
│  └────────┬────────┘                                    │
│           │                                              │
│           ▼                                              │
│     SEND NOTIFICATION                                    │
│                                                          │
└─────────────────────────────────────────────────────────┘

6. Active vs Passive Checks

┌─────────────────────────────────────────────────────────┐
│             ACTIVE CHECKS (Pull Model)                   │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Nagios ────► Run plugin ────► Get result               │
│                                                          │
│  - Nagios controls timing                                │
│  - Nagios initiates connection                           │
│  - Good for: scheduled monitoring                        │
│                                                          │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│             PASSIVE CHECKS (Push Model)                  │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  External ────► Send result ────► Nagios accepts         │
│                                                          │
│  - External system controls timing                       │
│  - External system initiates connection                  │
│  - Good for: security events, async jobs, firewalled hosts│
│                                                          │
└─────────────────────────────────────────────────────────┘

Concept Summary Table

Concept	What You Must Internalize
Plugin Model	Any executable returning 0/1/2/3 with STDOUT message is a valid check
Object Inheritance	Templates reduce duplication; `use` directive inherits all attributes
Soft/Hard States	Soft states prevent false alerts; hard states trigger notifications
Check Scheduling	check_interval for normal, retry_interval during problems
Notification Logic	Many conditions must be true before a notification sends
Active vs Passive	Active = Nagios pulls, Passive = external pushes via NSCA
NRPE	Runs local checks on remote hosts securely
Escalations	Route notifications based on problem duration
Event Handlers	Auto-remediation scripts triggered by state changes
Flapping	Detection of rapid state oscillation

Deep Dive Reading by Concept

Nagios Architecture

Concept	Resource
Overall architecture	Nagios Core Documentation - Chapter 5: Basic Concepts
Plugin development	Nagios Plugin Development Guidelines - nagios-plugins.org
Configuration structure	Nagios Core Documentation - Chapter 3: Configuration Overview

Monitoring Fundamentals

Concept	Book & Chapter
Monitoring philosophy	“The Practice of System and Network Administration” by Limoncelli - Ch. 22
Alert design	“Site Reliability Engineering” by Beyer et al. - Ch. 6: Monitoring
On-call practices	“On-Call” by Jones - Chapters 1-3

Network Monitoring

Concept	Book & Chapter
SNMP protocol	“Essential SNMP” by Mauro & Schmidt - Ch. 1-4
Network monitoring	“Network Warrior” by Donahue - Ch. 24
TCP/IP fundamentals	“TCP/IP Illustrated, Vol. 1” by Stevens - Ch. 1-5

Scripting and Automation

Concept	Book & Chapter
Bash scripting	“Linux Command Line and Shell Scripting Bible” by Blum - Ch. 11-15
Perl plugins	“Learning Perl” by Schwartz - Ch. 1-8
Python scripting	“Automate the Boring Stuff” by Sweigart - Ch. 1-6

Essential Reading Order

Week 1: Nagios Core Documentation Chapters 1-5
Week 2: Plugin Development Guidelines + Essential SNMP Ch. 1-2
Week 3: SRE Book Chapter 6 + Limoncelli Ch. 22
Week 4: Shell scripting reference as needed

Quick Start Guide

If you feel overwhelmed, here’s your first 48 hours:

Day 1 (4 hours):

Set up VirtualBox VM with CentOS or Ubuntu
Complete Project 1 (Install from source)
Access the web interface
Understand the directory structure

Day 2 (4 hours):

Complete Project 2 (Configuration structure)
Add your first custom host
Complete Project 3 (Local service monitoring)
See your first alert

After this, you have a working Nagios system and understand the basics. Continue with projects 4-10 for comprehensive foundation, 11-18 for advanced topics.

Recommended Learning Paths

Path 1: System Administrator (4 weeks)

Focus on practical monitoring of Linux/Windows infrastructure.

Projects: 1, 2, 3, 4, 5, 6, 8, 14 Outcome: Can monitor typical IT infrastructure with notifications

Path 2: DevOps Engineer (6 weeks)

Focus on automation, custom checks, and integration.

Projects: 1, 2, 3, 5, 6, 9, 10, 15, 18, 19 Outcome: Can integrate Nagios into CI/CD and create custom monitoring

Path 3: Monitoring Specialist (8 weeks)

Comprehensive coverage of all Nagios capabilities.

Projects: All, in order Outcome: Enterprise-scale Nagios deployment and management

Project List

Project 1: Installing Nagios Core from Source

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Bash/Shell
Difficulty: Level 2: Intermediate
Knowledge Area: Linux System Administration
Software or Tool: GCC, Make, Apache HTTPD
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: A fully functional Nagios Core installation compiled from source code, running with Apache web server and accessible via browser.

Why it teaches the concept: Compiling from source forces you to understand dependencies, directory structure, and how Nagios components interact. Package managers hide these details.

Core challenges you’ll face:

Installing build dependencies - Understanding what libraries Nagios needs (GD, OpenSSL, etc.)
Configuring the build - Using ./configure with appropriate paths
Setting up Apache integration - CGI scripts, authentication, permissions
Creating the nagios user/group - Understanding security isolation

Key Concepts:

Source compilation workflow
CGI web applications
Apache authentication (htpasswd)
Systemd service management

Difficulty: Intermediate Time estimate: 4-6 hours Prerequisites: Linux command line, package management, text editing

Real World Outcome

After completing this project, you will have:

$ systemctl status nagios
● nagios.service - Nagios Core Monitoring System
     Active: active (running) since Wed 2024-01-15 10:23:45 UTC
     Main PID: 12345 (nagios)

$ curl -u nagiosadmin:password http://localhost/nagios/
# Returns Nagios web interface HTML

$ /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Nagios Core 4.4.x
Copyright (c) 2009-present Nagios Core Development Team
...
Total Warnings: 0
Total Errors:   0

The Core Question You’re Answering

“What are the minimal components needed for a monitoring system, and how do they connect?”

Concepts You Must Understand First

What is a daemon process?
- How does it differ from a regular process?
- Why does Nagios run as a daemon?
How does Apache serve dynamic content?
- What is CGI?
- Why does Nagios use CGI instead of a modern framework?
What are file permissions and why do they matter?
- Who should own the Nagios files?
- What permissions are required for CGI execution?

Questions to Guide Your Design

Before installing:

Where should Nagios binaries be installed? Why?
What user account should Nagios run as? Why not root?
How will you back up your configuration?
How will you verify the installation succeeded?

Thinking Exercise

Before running make install, draw the directory structure you expect:

Where will the main daemon binary be?
Where will configuration files live?
Where will plugins be stored?
Where will log files go?

Verify your drawing against the actual installation.

The Interview Questions They’ll Ask

“Why would you compile from source instead of using packages?”
“How do you verify a Nagios configuration before applying it?”
“What happens if the nagios user doesn’t have permission to execute plugins?”
“How would you upgrade Nagios without losing configuration?”
“What are the security implications of running CGI scripts?”

Hints in Layers

Hint 1 - Starting Point: Begin with a minimal Linux installation. CentOS/RHEL or Ubuntu LTS are recommended.

Hint 2 - Dependencies: You need: gcc, glibc, glibc-common, gd, gd-devel, openssl, openssl-devel, make, perl, wget

Hint 3 - Build Process:

# General flow (not copy-paste - understand each step)
./configure --with-command-group=nagcmd
make all
make install
make install-init
make install-commandmode
make install-config
make install-webconf

Hint 4 - Verification:

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Books That Will Help

Topic	Resource
Installation process	Nagios Core Beginner’s Guide - Chapter 2
Apache configuration	Apache Cookbook - O’Reilly
Linux system admin	Linux Bible - Christopher Negus

Common Pitfalls & Debugging

Problem	Root Cause	Fix
“Permission denied” on CGI	Apache user not in nagcmd group	`usermod -aG nagcmd apache`
“Cannot open config file”	Wrong ownership	`chown -R nagios:nagios /usr/local/nagios`
Web interface shows blank	CGI not enabled	Enable mod_cgi in Apache
Daemon won’t start	Config error	Run `nagios -v` to validate

Learning Milestones

Basic: Nagios daemon starts and stays running
Intermediate: Web interface accessible with authentication
Advanced: Can modify nagios.cfg and reload without restart

Project 2: Understanding the Configuration File Structure

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration Language
Difficulty: Level 2: Intermediate
Knowledge Area: Configuration Management
Software or Tool: Text editor, nagios -v
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: A well-organized configuration directory with separate files for hosts, services, commands, contacts, and templates.

Why it teaches the concept: Understanding the configuration structure is essential for maintainable monitoring. Poor organization leads to configuration drift and errors.

Core challenges you’ll face:

Understanding cfg_file vs cfg_dir - How Nagios finds configuration
Object relationships - How hosts reference commands and templates
Template inheritance - Creating reusable configuration blocks
Configuration validation - Catching errors before restart

Key Concepts:

Object definition syntax
Directive inheritance
Configuration file inclusion
Macro expansion

Difficulty: Intermediate Time estimate: 3-4 hours Prerequisites: Project 1 completed

Real World Outcome

Your configuration directory will look like:

/usr/local/nagios/etc/
├── nagios.cfg              # Main configuration
├── cgi.cfg                 # Web interface settings
├── resource.cfg            # Sensitive variables ($USER1$)
├── objects/
│   ├── commands.cfg        # Check command definitions
│   ├── contacts.cfg        # Contact definitions
│   ├── timeperiods.cfg     # When to check/notify
│   ├── templates.cfg       # Reusable object templates
│   ├── hosts/
│   │   ├── web-servers.cfg
│   │   └── db-servers.cfg
│   └── services/
│       ├── linux-services.cfg
│       └── web-services.cfg

The Core Question You’re Answering

“How do I organize monitoring configuration so it scales to hundreds of hosts without becoming unmaintainable?”

Concepts You Must Understand First

What is object inheritance in Nagios?
- How does use work?
- What is the difference between name and host_name?
What are Nagios macros?
- What is $HOSTADDRESS$ ?
- Where is $USER1$ defined?

Questions to Guide Your Design

How many hosts will you monitor eventually?
Will hosts be grouped by location, function, or owner?
Who maintains which parts of the configuration?
How will you version control the configuration?

The Interview Questions They’ll Ask

“How does Nagios macro substitution work?”
“What’s the difference between a host template and a host definition?”
“How would you organize configuration for 1000 hosts?”
“How do you handle secrets in Nagios configuration?”
“What happens if two configuration files define the same object?”

Learning Milestones

Basic: Understand nagios.cfg and cfg_dir directive
Intermediate: Create templates and use inheritance
Advanced: Organize configuration for team collaboration

Project 3: Monitoring Local Services

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 1: Beginner
Knowledge Area: Service Monitoring
Software or Tool: Nagios Plugins
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Monitoring for essential local services: disk space, CPU load, memory usage, running processes, and swap usage on the Nagios server itself.

Why it teaches the concept: Local monitoring is the foundation. These same concepts apply to remote hosts via NRPE.

Core challenges you’ll face:

Understanding check commands - How plugins receive arguments
Setting thresholds - Choosing appropriate WARNING and CRITICAL values
Interpreting plugin output - Understanding performance data

Key Concepts:

Standard Nagios plugins (check_disk, check_load, check_procs)
Threshold syntax (-w and -c arguments)
Performance data format
Service check intervals

Difficulty: Beginner Time estimate: 2-3 hours Prerequisites: Projects 1-2 completed

Real World Outcome

# From the web interface or command line:
$ /usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /
DISK OK - free space: / 45678 MB (75% inode=89%);| /=15234MB;51200;57600;0;64000

$ /usr/local/nagios/libexec/check_load -w 5,4,3 -c 10,8,6
OK - load average: 0.15, 0.10, 0.08|load1=0.150;5.000;10.000;0; load5=0.100;4.000;8.000;0; load15=0.080;3.000;6.000;0;

Web interface shows all services green (OK state).

The Core Question You’re Answering

“How do I translate ‘disk is almost full’ into a monitoring check that alerts at the right time?”

The Interview Questions They’ll Ask

“How do you determine appropriate thresholds for disk space?”
“What’s the difference between load average values 1, 5, and 15 minutes?”
“How would you monitor a specific process?”
“What does the performance data after the pipe ( ) mean?”

Learning Milestones

Basic: One service check working with correct thresholds
Intermediate: All local services monitored with sensible thresholds
Advanced: Performance data graphed over time

Project 4: Setting Up NRPE for Remote Monitoring

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 3: Advanced
Knowledge Area: Remote Monitoring, Networking
Software or Tool: NRPE (Nagios Remote Plugin Executor)
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: NRPE daemon on a remote host that executes local checks and returns results to the Nagios server.

Why it teaches the concept: NRPE is the standard way to monitor internal metrics on remote hosts that require local access (disk, CPU, processes).

Core challenges you’ll face:

NRPE installation on remote hosts - Compiling or using packages
Firewall configuration - Port 5666 must be accessible
NRPE allowed_hosts - Security configuration
Command definitions - Matching server and client

┌─────────────────────────────────────────────────────────┐
│                    NRPE ARCHITECTURE                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Nagios Server                    Remote Host            │
│  ┌──────────────┐                ┌──────────────┐       │
│  │              │                │              │       │
│  │ check_nrpe   │ ──TCP/5666──► │  nrpe daemon │       │
│  │ -H host      │                │              │       │
│  │ -c check_cmd │                │  ┌────────┐  │       │
│  │              │ ◄──result──── │  │ plugin │  │       │
│  │              │                │  └────────┘  │       │
│  └──────────────┘                └──────────────┘       │
│                                                          │
│  The command "check_load" on the server maps to          │
│  a command definition on the remote host that            │
│  runs /usr/local/nagios/libexec/check_load locally.     │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-3, second VM or host

Real World Outcome

# From Nagios server:
$ /usr/local/nagios/libexec/check_nrpe -H 192.168.56.11 -c check_load
OK - load average: 0.08, 0.12, 0.15|load1=0.080;5.000;10.000;0;

# Remote host is being monitored for local metrics

The Core Question You’re Answering

“How do I run local checks on remote hosts securely and efficiently?”

The Interview Questions They’ll Ask

“Why use NRPE instead of SSH?”
“How does NRPE handle authentication?”
“What are the security implications of dont_blame_nrpe?”
“How would you troubleshoot ‘CHECK_NRPE: Error - Could not complete SSL handshake’?”
“Can NRPE pass arguments to commands? Should it?”

Learning Milestones

Basic: NRPE daemon running on remote host
Intermediate: Check commands working from server to client
Advanced: Secure configuration with SSL and restricted hosts

Project 5: Creating Custom Check Scripts

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Bash, Python, or Perl
Difficulty: Level 2: Intermediate
Knowledge Area: Scripting, Plugin Development
Software or Tool: Any scripting language
Main Book: Nagios Plugin Development Guidelines

What you’ll build: Custom monitoring plugins for application-specific checks not covered by standard plugins.

Why it teaches the concept: The plugin model is Nagios’s greatest strength. Understanding it lets you monitor anything.

Core challenges you’ll face:

Plugin specification compliance - Exit codes, output format, performance data
Threshold handling - Parsing -w and -c arguments
Error handling - UNKNOWN state for errors
Timeout handling - Plugin shouldn’t hang

┌─────────────────────────────────────────────────────────┐
│                 PLUGIN SPECIFICATION                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Exit Codes:                                             │
│    0 = OK                                                │
│    1 = WARNING                                           │
│    2 = CRITICAL                                          │
│    3 = UNKNOWN                                           │
│                                                          │
│  Output Format:                                          │
│    STATUS - Message text|performance_data                │
│                                                          │
│  Example:                                                │
│    OK - Queue length is 5|queue_length=5;10;20;0;100    │
│                                                          │
│  Performance Data Format:                                │
│    label=value[UOM];warn;crit;min;max                   │
│                                                          │
│  UOM (Unit of Measure):                                  │
│    (none) = count, s = seconds, % = percentage          │
│    B/KB/MB/GB/TB = bytes, c = counter                   │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Intermediate Time estimate: 4-6 hours Prerequisites: Scripting skills, Projects 1-3

Real World Outcome

A custom plugin that checks something specific to your environment:

$ ./check_app_queue.sh -w 10 -c 20
OK - Queue length is 5|queue_length=5;10;20;0;100
$ echo $?
0

$ ./check_app_queue.sh -w 3 -c 10
WARNING - Queue length is 5|queue_length=5;3;10;0;100
$ echo $?
1

The Core Question You’re Answering

“How do I extend Nagios to monitor something it doesn’t monitor out of the box?”

The Interview Questions They’ll Ask

“What makes a Nagios plugin valid?”
“How do you handle timeouts in custom plugins?”
“What should happen if a plugin can’t connect to what it’s checking?”
“How do you test a plugin before deploying it?”
“What is the performance data format and why does it matter?”

Learning Milestones

Basic: Plugin returns correct exit codes
Intermediate: Plugin parses threshold arguments correctly
Advanced: Plugin includes performance data for graphing

Project 6: Host and Service Groups

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 2: Intermediate
Knowledge Area: Configuration Organization
Software or Tool: Nagios Core
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Logical groupings of hosts and services for organized display and notification routing.

Why it teaches the concept: Groups are essential for managing monitoring at scale. They affect both display and notification behavior.

Core challenges you’ll face:

Defining meaningful groups - By location, function, owner, or criticality
Group membership methods - Via hostgroup_name or hostgroups directive
Service dependencies on groups - Applying services to all hosts in a group
Notification via groups - Contact groups for different teams

Difficulty: Intermediate Time estimate: 2-3 hours Prerequisites: Projects 1-4, multiple hosts configured

Real World Outcome

# Web interface shows organized groups:

Hostgroups:
├── web-servers (3 hosts)
├── db-servers (2 hosts)
├── production (4 hosts)
└── development (1 host)

Servicegroups:
├── disk-checks (15 services)
├── http-checks (3 services)
└── database-checks (4 services)

The Core Question You’re Answering

“How do I organize monitoring so that operators can quickly find what they’re looking for?”

The Interview Questions They’ll Ask

“What’s the difference between hostgroups and servicegroups?”
“How do you assign a service to all hosts in a group?”
“Can a host be in multiple groups? When is this useful?”
“How do groups affect notification routing?”

Learning Milestones

Basic: Create hostgroups and assign hosts
Intermediate: Use hostgroup_name to apply services to groups
Advanced: Servicegroups for cross-host service views

Project 7: Timeperiods and Scheduling

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 2: Intermediate
Knowledge Area: Scheduling, Time-Based Logic
Software or Tool: Nagios Core
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Custom timeperiods for business hours, maintenance windows, and on-call schedules.

Why it teaches the concept: Real monitoring requires time-awareness: check less often at night, notify different people on weekends.

Core challenges you’ll face:

Timeperiod syntax - Date and time range formats
Exclusions - Excluding holidays from business hours
Overlapping periods - Combining multiple timeperiods
Check vs notification periods - Different uses for timeperiods

Difficulty: Intermediate Time estimate: 2-3 hours Prerequisites: Projects 1-4

Real World Outcome

# Example timeperiods:

define timeperiod {
    timeperiod_name     business_hours
    alias               Monday-Friday 9AM-5PM
    monday              09:00-17:00
    tuesday             09:00-17:00
    wednesday           09:00-17:00
    thursday            09:00-17:00
    friday              09:00-17:00
}

define timeperiod {
    timeperiod_name     non_business_hours
    alias               Outside Business Hours
    monday              00:00-09:00,17:00-24:00
    tuesday             00:00-09:00,17:00-24:00
    ...
    saturday            00:00-24:00
    sunday              00:00-24:00
}

The Core Question You’re Answering

“How do I make monitoring behave differently based on time of day or week?”

The Interview Questions They’ll Ask

“What’s the difference between check_period and notification_period?”
“How do you handle maintenance windows?”
“How do you exclude holidays from a timeperiod?”
“What happens if a check runs outside its check_period?”

Learning Milestones

Basic: Define business hours timeperiod
Intermediate: Apply different timeperiods to checks and notifications
Advanced: Handle holidays and maintenance windows

Project 8: Notification Commands (Email, Slack, SMS)

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Bash, Nagios Configuration
Difficulty: Level 3: Advanced
Knowledge Area: Notifications, Integration
Software or Tool: sendmail/postfix, curl, custom scripts
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Multi-channel notifications that send alerts via email, Slack, and optionally SMS.

Why it teaches the concept: Notifications are useless if they don’t reach the right people in the right way.

Core challenges you’ll face:

Email configuration - SMTP setup, HTML vs plain text
Slack webhook integration - Formatting messages for Slack
SMS integration - Using services like Twilio
Command macros - Using Nagios macros in notification commands

┌─────────────────────────────────────────────────────────┐
│              NOTIFICATION FLOW                           │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  State Change (HARD CRITICAL)                            │
│        │                                                 │
│        ▼                                                 │
│  Contact Definition ────► notification_commands          │
│        │                     │                           │
│        │                     ▼                           │
│        │              notify-host-by-email               │
│        │              notify-host-by-slack               │
│        │                     │                           │
│        ▼                     ▼                           │
│  Contact gets email AND Slack message                    │
│                                                          │
│  Macros available:                                       │
│  $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$                │
│  $SERVICEOUTPUT$ $LONGDATETIME$ $CONTACTEMAIL$          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, working email server or Slack workspace

Real World Outcome

# Email notification received:
Subject: ** PROBLEM: web-server-01/HTTP is CRITICAL **

***** Nagios *****
Notification Type: PROBLEM
Host: web-server-01
Service: HTTP
State: CRITICAL
Output: Connection refused

# Slack message in #alerts channel:
🔴 CRITICAL: web-server-01/HTTP
Connection refused

The Core Question You’re Answering

“How do I ensure alerts reach the right people through their preferred communication channel?”

The Interview Questions They’ll Ask

“How do you prevent alert fatigue?”
“What macros are available in notification commands?”
“How would you add a new notification channel?”
“How do you test notification commands without triggering real alerts?”
“What’s the difference between host and service notification commands?”

Learning Milestones

Basic: Email notifications working
Intermediate: Slack webhook integration
Advanced: Multiple channels per contact, formatted messages

Project 9: Contact Groups and Escalations

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 3: Advanced
Knowledge Area: Notification Routing
Software or Tool: Nagios Core
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Escalation chains that notify different teams based on how long a problem has persisted.

Why it teaches the concept: Real incidents need escalation paths. The junior on-call should be notified first, then managers if unacknowledged.

Core challenges you’ll face:

Contact group design - Organizing contacts by role and responsibility
Escalation timing - first_notification, last_notification, escalation_interval
Escalation conditions - Which states trigger escalation
Acknowledgement handling - Stopping escalation when acked

┌─────────────────────────────────────────────────────────┐
│              ESCALATION TIMELINE                         │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Time:   0      5      10     15     20     25     30   │
│          │      │      │      │      │      │      │    │
│          ▼      ▼      ▼      ▼      ▼      ▼      ▼    │
│                                                          │
│  Notif:  1      2      3      4      5      6      7    │
│                                                          │
│  Level 1: on-call engineer (notifications 1-2)           │
│  Level 2: senior engineer (notifications 3-4)            │
│  Level 3: manager (notifications 5+)                     │
│                                                          │
│  If acknowledged at notification 3:                      │
│    - Escalation stops                                    │
│    - Recovery notification goes to acknowledger          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 3-4 hours Prerequisites: Projects 1-8

Real World Outcome

# First 10 minutes: on-call gets notified
# After 10 minutes: senior engineer also notified
# After 20 minutes: manager also notified

# When engineer acknowledges:
# - Escalation stops
# - When service recovers, acknowledger is notified

The Core Question You’re Answering

“How do I ensure problems get attention even if the first responder doesn’t react?”

The Interview Questions They’ll Ask

“What is the difference between notification_interval and escalation_interval?”
“How do you prevent the same person from being notified by multiple escalation levels?”
“What happens when an escalated problem is acknowledged?”
“How would you implement ‘follow the sun’ on-call rotation?”

Learning Milestones

Basic: Two-level escalation working
Intermediate: Escalation stops on acknowledgement
Advanced: Different escalation paths for different services

Project 10: Event Handlers for Auto-Remediation

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Bash
Difficulty: Level 4: Expert
Knowledge Area: Automation, Self-Healing Systems
Software or Tool: Shell scripts, SSH
Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd

What you’ll build: Automated remediation scripts that attempt to fix problems before humans are notified.

Why it teaches the concept: The best alert is the one that never fires because the system fixed itself.

Core challenges you’ll face:

Event handler logic - Understanding when handlers trigger
State-based actions - Different actions for SOFT vs HARD states
Safe remediation - Ensuring handlers don’t make things worse
Logging and auditing - Recording what actions were taken

┌─────────────────────────────────────────────────────────┐
│              EVENT HANDLER FLOW                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Check Result                                            │
│       │                                                  │
│       ▼                                                  │
│  ┌─────────────────┐                                    │
│  │ Event handler   │                                    │
│  │ enabled?        ├──NO──► No action                   │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  ┌─────────────────┐                                    │
│  │ State change?   │                                    │
│  │ (or first check)├──NO──► No action                   │
│  └────────┬────────┘                                    │
│           │ YES                                          │
│           ▼                                              │
│  Execute event_handler command                           │
│  with $SERVICESTATE$ $SERVICESTATETYPE$ args            │
│                                                          │
│  Example handler logic:                                  │
│  if SERVICESTATE=CRITICAL and STATETYPE=SOFT then       │
│      attempt_restart                                     │
│  if SERVICESTATE=CRITICAL and STATETYPE=HARD then       │
│      # Give up, notification will handle it             │
│  fi                                                      │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Expert Time estimate: 6-8 hours Prerequisites: Projects 1-4, sudo/SSH access to managed hosts

Real World Outcome

# Service goes CRITICAL (SOFT state):
Event handler: Attempting to restart httpd
Event handler: Service httpd restarted successfully

# Next check:
Service returned to OK state
# No notification ever sent!

# If restart fails:
# Service goes HARD CRITICAL
# Normal notification process triggers

The Core Question You’re Answering

“How do I automate the first steps of incident response to reduce human intervention?”

The Interview Questions They’ll Ask

“Why should event handlers only run on SOFT states?”
“How do you prevent event handlers from making problems worse?”
“How do you audit what event handlers did?”
“What’s the difference between event handlers and notifications?”
“How would you handle a service that keeps restarting in a loop?”

Learning Milestones

Basic: Event handler triggers on state change
Intermediate: Handler takes action only on SOFT CRITICAL
Advanced: Handler logs actions and has safety limits

Project 11: Passive Checks and NSCA

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Bash, Nagios Configuration
Difficulty: Level 3: Advanced
Knowledge Area: Push-Based Monitoring
Software or Tool: NSCA (Nagios Service Check Acceptor)
Main Book: Nagios Core Documentation

What you’ll build: Passive check infrastructure for monitoring behind firewalls or from external scripts/cron jobs.

Why it teaches the concept: Not everything can be polled. Passive checks enable monitoring of jobs, security events, and firewalled systems.

┌─────────────────────────────────────────────────────────┐
│            PASSIVE CHECK ARCHITECTURE                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Active Check (Pull):                                    │
│  Nagios ───────────────────────────────► Remote Host    │
│  (initiates connection every N minutes)                  │
│                                                          │
│  Passive Check (Push):                                   │
│  External Script ─────► NSCA ─────► Nagios External CMD │
│  (sends when ready)      │          │                    │
│                          │          ▼                    │
│                     Port 5667   nagios.cmd pipe          │
│                                                          │
│  Use cases:                                              │
│  - Cron job completion status                            │
│  - Security events from SIEM                             │
│  - Backup job results                                    │
│  - Hosts behind NAT/firewall                             │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4

Real World Outcome

# From external system:
$ echo -e "web-server-01\tBackup Status\t0\tBackup completed successfully" | \
    /usr/local/nagios/bin/send_nsca -H nagios-server -c send_nsca.cfg

# Nagios shows service "Backup Status" with OK state
# If no result received within freshness_threshold:
# Service goes CRITICAL - "No result received in 24 hours"

The Core Question You’re Answering

“How do I monitor things that can’t be actively polled?”

The Interview Questions They’ll Ask

“What’s the difference between active and passive checks?”
“How does freshness checking work?”
“What happens if a passive check never sends a result?”
“How do you secure NSCA communication?”
“When would you use passive vs active checks?”

Learning Milestones

Basic: NSCA server accepting results
Intermediate: Passive service with freshness checking
Advanced: Cron job sending results via send_nsca

Project 12: Performance Data and Graphing

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Perl, Nagios Configuration
Difficulty: Level 3: Advanced
Knowledge Area: Metrics, Time-Series Data
Software or Tool: PNP4Nagios, RRDtool, or InfluxDB/Grafana
Main Book: “Nagios and Nagios Related Development” - Nagios Exchange

What you’ll build: Performance data collection and graphing for historical trend analysis.

Why it teaches the concept: Monitoring without graphing is incomplete. Graphs show trends, capacity planning data, and historical context.

┌─────────────────────────────────────────────────────────┐
│            PERFORMANCE DATA PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Plugin Output:                                          │
│  "DISK OK - 45GB free|/=55GB;80;90;0;100"               │
│                ▲                                         │
│                │ performance data                        │
│                ▼                                         │
│  process_performance_data_command                        │
│                │                                         │
│                ▼                                         │
│  ┌─────────────────────────────────────┐                │
│  │  PNP4Nagios / Graphite / InfluxDB   │                │
│  │         (stores time-series)         │                │
│  └─────────────────────────────────────┘                │
│                │                                         │
│                ▼                                         │
│  ┌─────────────────────────────────────┐                │
│  │      Graphs in Web Interface        │                │
│  │   (click on service → see graph)    │                │
│  └─────────────────────────────────────┘                │
│                                                          │
│  Performance data format:                                │
│  label=value[UOM];warn;crit;min;max                     │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 6-8 hours Prerequisites: Projects 1-5

Real World Outcome

Click on any service in Nagios web interface and see historical graphs:

CPU usage over last 24 hours
Disk space trend over last month
Response time percentiles

The Core Question You’re Answering

“How do I see trends over time instead of just current state?”

The Interview Questions They’ll Ask

“How is performance data different from check output?”
“What is RRDtool and how does it work?”
“How would you alert on rates of change vs absolute values?”
“What’s the storage overhead for performance data?”

Learning Milestones

Basic: Performance data being written to files
Intermediate: PNP4Nagios or similar showing graphs
Advanced: Custom graphs with calculated metrics

Project 13: Nagios Plugins Development

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Python, Bash, or Perl
Difficulty: Level 3: Advanced
Knowledge Area: Plugin Development
Software or Tool: Nagios Plugin API
Main Book: Nagios Plugin Development Guidelines

What you’ll build: A production-quality plugin with proper argument parsing, timeout handling, and performance data.

Why it teaches the concept: Plugins are how Nagios interfaces with the world. Understanding plugin development means you can monitor anything.

Core challenges you’ll face:

Argument parsing - Standard options like -H, -w, -c, -t, -v
Timeout handling - Plugin must not hang
Help output - -h and –help should be informative
Verbose mode - -v for debugging
Performance data - Proper formatting

Difficulty: Advanced Time estimate: 8-10 hours Prerequisites: Projects 1-5, programming skills

Real World Outcome

A plugin that follows the Nagios guidelines:

$ ./check_myapp --help
check_myapp - Check MyApp API health
Usage: check_myapp -H <host> [-p port] [-w warn] [-c crit] [-t timeout]

Options:
  -H, --hostname   Hostname or IP
  -p, --port       Port number (default: 8080)
  -w, --warning    Warning threshold (ms)
  -c, --critical   Critical threshold (ms)
  -t, --timeout    Timeout in seconds (default: 10)
  -v, --verbose    Verbose output
  -h, --help       This help message

$ ./check_myapp -H localhost -w 200 -c 500
MYAPP OK - Response time 45ms|response_time=45ms;200;500;0;

The Core Question You’re Answering

“How do I create a monitoring plugin that is robust, user-friendly, and maintainable?”

The Interview Questions They’ll Ask

“What are the Nagios plugin guidelines?”
“How do you handle timeouts in plugins?”
“What should happen if a plugin encounters an unexpected error?”
“How do you test plugins before deployment?”
“What libraries exist to simplify plugin development?”

Learning Milestones

Basic: Plugin with correct exit codes
Intermediate: Proper argument parsing and help output
Advanced: Timeout handling, verbose mode, performance data

Project 14: Monitoring Windows Hosts

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 3: Advanced
Knowledge Area: Windows Monitoring
Software or Tool: NSClient++, check_nt, or WMI
Main Book: NSClient++ Documentation

What you’ll build: Windows host monitoring for disk, CPU, memory, services, and event logs.

Why it teaches the concept: Mixed environments require multi-platform monitoring. NSClient++ is the NRPE equivalent for Windows.

┌─────────────────────────────────────────────────────────┐
│            WINDOWS MONITORING OPTIONS                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Option 1: NSClient++ (Recommended)                      │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │         │ Windows Host │              │
│  │              │         │              │              │
│  │ check_nrpe   │◄───────►│ NSClient++   │              │
│  │ check_nt     │ TCP/12489│ (agent)     │              │
│  └──────────────┘         └──────────────┘              │
│                                                          │
│  Option 2: WMI (Agentless)                               │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │         │ Windows Host │              │
│  │              │         │              │              │
│  │ check_wmi    │◄───────►│ WMI Service  │              │
│  │ (uses wmic)  │ RPC/135 │ (built-in)   │              │
│  └──────────────┘         └──────────────┘              │
│                                                          │
│  Metrics available:                                      │
│  - CPU usage (CPULOAD)                                   │
│  - Memory usage (MEMUSE)                                 │
│  - Disk space (USEDDISKSPACE)                           │
│  - Service status (SERVICESTATE)                         │
│  - Process list (PROCSTATE)                              │
│  - Event log entries (EVENTLOG)                          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, Windows server access

Real World Outcome

$ /usr/local/nagios/libexec/check_nt -H windows-server -v CPULOAD -w 80 -c 90
CPU Load 12% (5 min average)|'5 min avg Load'=12%;80;90;0;100;

$ /usr/local/nagios/libexec/check_nt -H windows-server -v SERVICESTATE -l "Windows Update"
Windows Update: Started

The Core Question You’re Answering

“How do I monitor Windows systems from a Linux-based Nagios server?”

The Interview Questions They’ll Ask

“What’s the difference between check_nt and check_nrpe for Windows?”
“How do you monitor Windows services?”
“How do you check Windows event logs?”
“What are the security implications of WMI vs NSClient++?”

Learning Milestones

Basic: NSClient++ installed and responding
Intermediate: CPU, memory, disk monitored
Advanced: Service status and event log monitoring

Project 15: Network Device Monitoring (SNMP)

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 4: Expert
Knowledge Area: Network Monitoring, SNMP
Software or Tool: check_snmp, snmpwalk
Main Book: “Essential SNMP” by Mauro & Schmidt

What you’ll build: Monitoring for routers, switches, and network appliances using SNMP.

Why it teaches the concept: Network devices are different from servers. SNMP is the standard protocol for managing network infrastructure.

┌─────────────────────────────────────────────────────────┐
│                    SNMP CONCEPTS                         │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  SNMP = Simple Network Management Protocol               │
│                                                          │
│  Key Components:                                         │
│  ┌──────────────┐    ┌──────────────┐                   │
│  │ SNMP Manager │───►│ SNMP Agent   │                   │
│  │ (Nagios)     │◄───│ (Device)     │                   │
│  └──────────────┘    └──────────────┘                   │
│         │                   │                            │
│         │ GET/SET           │ Response                   │
│         │                   │ Trap                       │
│                                                          │
│  OID (Object Identifier):                                │
│  .1.3.6.1.4.1.9.2.1.58.0 = Cisco CPU 5min              │
│  .1.3.6.1.2.1.2.2.1.8.1 = Interface 1 oper status       │
│                                                          │
│  MIB (Management Information Base):                      │
│  Human-readable mappings for OIDs                        │
│                                                          │
│  SNMP Versions:                                          │
│  v1: Community strings (insecure)                        │
│  v2c: Same security, better performance                  │
│  v3: Authentication + encryption (recommended)           │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Expert Time estimate: 6-8 hours Prerequisites: Projects 1-4, network device with SNMP enabled

Real World Outcome

# Check interface status:
$ check_snmp -H router.example.com -C public -o IF-MIB::ifOperStatus.1
SNMP OK - up(1)|

# Check CPU utilization on Cisco device:
$ check_snmp -H router.example.com -C public \
    -o .1.3.6.1.4.1.9.2.1.58.0 -w 80 -c 90
SNMP OK - 45|iso.3.6.1.4.1.9.2.1.58.0=45;80;90

# Monitor interface bandwidth:
$ check_snmp_int.pl -H switch.example.com -C public -i "Gi0/1" -w 80,80 -c 90,90

The Core Question You’re Answering

“How do I monitor network infrastructure that doesn’t run standard OS agents?”

The Interview Questions They’ll Ask

“What is an OID and how do you find the right one?”
“What are the security differences between SNMP v2c and v3?”
“How do you monitor interface bandwidth?”
“What is an SNMP trap and how is it different from polling?”
“How do you monitor a device you don’t have the MIB for?”

Learning Milestones

Basic: SNMP GET working from Nagios
Intermediate: Interface status and bandwidth monitored
Advanced: SNMP v3 with authentication, trap receiver

Project 16: Log File Monitoring

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Bash, Perl
Difficulty: Level 3: Advanced
Knowledge Area: Log Analysis
Software or Tool: check_log, custom scripts
Main Book: Nagios Core Documentation

What you’ll build: Monitoring that detects error patterns in application log files and alerts.

Why it teaches the concept: Logs often contain warnings before metrics show problems. Log monitoring catches issues early.

Core challenges you’ll face:

Log rotation - Handling log files that rotate
State tracking - Only alerting on new entries
Pattern matching - Defining what constitutes an error
Performance - Scanning large logs efficiently

┌─────────────────────────────────────────────────────────┐
│            LOG MONITORING APPROACHES                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Approach 1: Scan for patterns (active check)            │
│  ┌──────────────────────────────────────────────┐       │
│  │ check_log -F /var/log/app.log -O /tmp/seek   │       │
│  │           -q "ERROR|FATAL"                    │       │
│  │                                               │       │
│  │ • Remembers last position (seek file)         │       │
│  │ • Only scans new entries                      │       │
│  │ • Must handle rotation                        │       │
│  └──────────────────────────────────────────────┘       │
│                                                          │
│  Approach 2: Real-time tail + NSCA (passive check)       │
│  ┌──────────────────────────────────────────────┐       │
│  │ tail -F /var/log/app.log | while read line   │       │
│  │ do                                            │       │
│  │   if [[ $line =~ ERROR ]]; then               │       │
│  │     send_nsca "host" "Log Errors" 2 "$line"  │       │
│  │   fi                                          │       │
│  │ done                                          │       │
│  └──────────────────────────────────────────────┘       │
│                                                          │
│  Approach 3: External log shipper → Nagios               │
│  (Filebeat/Logstash → parse → alert → NSCA)             │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, 11 (for passive approach)

Real World Outcome

# Active check approach:
$ /usr/local/nagios/libexec/check_log -F /var/log/myapp.log \
    -O /tmp/myapp_log.seek -q "ERROR"
LOG OK - No matches found

# (After an error is logged)
$ /usr/local/nagios/libexec/check_log -F /var/log/myapp.log \
    -O /tmp/myapp_log.seek -q "ERROR"
(1): 2024-01-15 10:23:45 ERROR: Database connection failed

The Core Question You’re Answering

“How do I detect problems from log messages before they become service failures?”

The Interview Questions They’ll Ask

“How do you handle log rotation in monitoring?”
“What’s the difference between active and passive log monitoring?”
“How do you avoid duplicate alerts for the same error?”
“How do you monitor logs across many servers?”
“How would you alert on an absence of expected log entries?”

Learning Milestones

Basic: check_log detecting patterns
Intermediate: State tracking across check runs
Advanced: Log rotation handling, passive check integration

Project 17: Custom Dashboards and Status Pages

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: PHP, HTML, API calls
Difficulty: Level 3: Advanced
Knowledge Area: UI/UX, Web Development
Software or Tool: Nagios CGIs, Thruk, Grafana
Main Book: Nagios Core Documentation - CGI Reference

What you’ll build: Custom status views and dashboards beyond the default Nagios interface.

Why it teaches the concept: Different stakeholders need different views. Operations needs details; management needs summaries.

Core challenges you’ll face:

Status API - Reading from status.dat or NDO
Custom CGIs - Building on top of Nagios CGIs
Alternative UIs - Thruk, Nagstamon, mobile apps
Data visualization - Meaningful dashboards

┌─────────────────────────────────────────────────────────┐
│            DASHBOARD ARCHITECTURE                        │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Data Sources:                                           │
│  ┌──────────────────┐                                   │
│  │ status.dat       │ ◄─ File-based, fast, limited     │
│  │ nagios.cmd       │ ◄─ Command pipe for control       │
│  │ NDO/Livestatus   │ ◄─ Database/socket, full history │
│  └──────────────────┘                                   │
│           │                                              │
│           ▼                                              │
│  ┌──────────────────────────────────────────────┐       │
│  │              Dashboard Options               │       │
│  ├──────────────────────────────────────────────┤       │
│  │ • Default Nagios CGIs (basic)                │       │
│  │ • Thruk (modern web UI)                      │       │
│  │ • Grafana + Nagios datasource                │       │
│  │ • Custom dashboards (TV displays)            │       │
│  │ • Mobile apps (Nagstamon, etc.)              │       │
│  └──────────────────────────────────────────────┘       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Advanced Time estimate: 6-8 hours Prerequisites: Projects 1-4, web development basics

Real World Outcome

TV dashboard in NOC showing current problems
Management summary page with SLA compliance
Mobile app notifications on phone
Custom status pages for specific teams

The Core Question You’re Answering

“How do I present monitoring data in a way that’s useful for different audiences?”

The Interview Questions They’ll Ask

“How do you read current status programmatically?”
“What’s the difference between status.dat and NDO?”
“How would you build a public status page?”
“How do you secure API access to Nagios data?”

Learning Milestones

Basic: Install Thruk or alternative UI
Intermediate: Custom status page for specific hosts
Advanced: Grafana dashboards with historical data

Project 18: High Availability Setup

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Nagios Configuration
Difficulty: Level 5: Master
Knowledge Area: High Availability, Clustering
Software or Tool: Pacemaker/Corosync, DRBD, or Active/Standby
Main Book: “Pro Linux High Availability Clustering” by Sander van Vugt

What you’ll build: Redundant Nagios infrastructure that survives the failure of any single component.

Why it teaches the concept: Your monitoring system must be more reliable than what it monitors. HA is essential for production.

┌─────────────────────────────────────────────────────────┐
│          HIGH AVAILABILITY ARCHITECTURES                 │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Option 1: Active/Standby with Pacemaker                 │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │ ───────►│ Nagios       │              │
│  │ Primary      │ DRBD    │ Standby      │              │
│  │ (Active)     │         │ (Passive)    │              │
│  └──────────────┘         └──────────────┘              │
│         │                                                │
│         ▼                                                │
│    Virtual IP (failover)                                 │
│                                                          │
│  Option 2: Distributed Monitoring                        │
│  ┌──────────────┐         ┌──────────────┐              │
│  │ Nagios       │ ◄──────►│ Nagios       │              │
│  │ Region A     │ NSCA    │ Region B     │              │
│  │ (monitors A) │         │ (monitors B) │              │
│  └──────────────┘         └──────────────┘              │
│         │                        │                       │
│         └────────────────────────┘                       │
│                    │                                     │
│                    ▼                                     │
│           Central Dashboard                              │
│                                                          │
│  Option 3: Load Balanced Pollers                         │
│  ┌──────────────┐                                       │
│  │ Central      │ ◄── Check results from                │
│  │ Nagios       │     multiple poller nodes             │
│  └──────────────┘                                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Master Time estimate: 12-16 hours Prerequisites: Projects 1-10, Linux clustering knowledge

Real World Outcome

# Primary server fails:
$ crm status
Node nagios-01: OFFLINE
Node nagios-02: online

Resources:
 Resource Group: nagios-group
     VIP        (ocf::heartbeat:IPaddr2):    Started nagios-02
     Nagios     (ocf::heartbeat:nagios):     Started nagios-02

# Nagios continues without interruption on nagios-02
# Alert: Monitoring failover occurred

The Core Question You’re Answering

“How do I ensure monitoring is available even when monitoring servers fail?”

The Interview Questions They’ll Ask

“How do you handle split-brain in a Nagios cluster?”
“What data needs to be replicated between nodes?”
“How do you test failover without affecting production?”
“What’s the recovery time objective for your monitoring?”
“How do you monitor your monitoring system?”

Learning Milestones

Basic: Active/standby configuration documented
Intermediate: Automated failover working
Advanced: Zero-downtime maintenance procedures

Project 19: Integration with External Tools

File: NAGIOS_MONITORING_MASTERY.md
Main Programming Language: Various (API integrations)
Difficulty: Level 4: Expert
Knowledge Area: Integration, APIs
Software or Tool: PagerDuty, ServiceNow, Ansible, Terraform
Main Book: Tool-specific documentation

What you’ll build: Integrations between Nagios and incident management, CMDB, and automation tools.

Why it teaches the concept: Nagios doesn’t exist in isolation. Integration with the broader toolchain is essential.

┌─────────────────────────────────────────────────────────┐
│          NAGIOS INTEGRATION LANDSCAPE                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Incident Management:                                    │
│  ┌──────────────┐                                       │
│  │   Nagios     │──Notification──►│ PagerDuty │         │
│  │              │                 │ OpsGenie  │         │
│  │              │                 │ VictorOps │         │
│  └──────────────┘                                       │
│                                                          │
│  Ticketing/ITSM:                                         │
│  ┌──────────────┐                                       │
│  │   Nagios     │──Create Ticket─►│ ServiceNow│         │
│  │              │                 │ Jira      │         │
│  │              │                 │ RT        │         │
│  └──────────────┘                                       │
│                                                          │
│  Configuration Management:                               │
│  ┌──────────────┐                                       │
│  │  Ansible/    │──Generate──►│ Nagios Config│          │
│  │  Puppet/     │  Config    │              │          │
│  │  Terraform   │            │              │          │
│  └──────────────┘            └──────────────┘          │
│                                                          │
│  Logging & Metrics:                                      │
│  ┌──────────────┐                                       │
│  │   Nagios     │──perfdata──►│ Graphite/   │          │
│  │              │             │ InfluxDB    │          │
│  │              │──alerts────►│ ELK Stack   │          │
│  └──────────────┘                                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Difficulty: Expert Time estimate: 8-12 hours Prerequisites: Projects 1-8, familiarity with target tools

Real World Outcome

Alerts automatically create PagerDuty incidents
Incident tickets created in ServiceNow
Nagios configuration generated from Terraform
Performance data visualized in Grafana

The Core Question You’re Answering

“How do I make Nagios work with my existing operational tools?”

The Interview Questions They’ll Ask

“How would you integrate Nagios with your incident management tool?”
“How do you auto-generate Nagios configuration from your CMDB?”
“How do you correlate Nagios alerts with logs in ELK?”
“What’s the best approach for Nagios + Prometheus coexistence?”

Learning Milestones

Basic: One external integration working
Intermediate: Bidirectional integration (e.g., acknowledge in PagerDuty reflects in Nagios)
Advanced: Configuration-as-code pipeline

Project Comparison Table

Project	Difficulty	Time	Key Skill	Real-World Value
1. Install from Source	Intermediate	4-6h	Linux admin	Foundation
2. Config Structure	Intermediate	3-4h	Configuration	Organization
3. Local Services	Beginner	2-3h	Plugin usage	Basic monitoring
4. NRPE Remote	Advanced	4-6h	Networking	Remote monitoring
5. Custom Scripts	Intermediate	4-6h	Scripting	Extensibility
6. Host/Service Groups	Intermediate	2-3h	Configuration	Scalability
7. Timeperiods	Intermediate	2-3h	Scheduling	Operations
8. Notifications	Advanced	4-6h	Integration	Alerting
9. Escalations	Advanced	3-4h	Workflow	On-call
10. Event Handlers	Expert	6-8h	Automation	Self-healing
11. Passive/NSCA	Advanced	4-6h	Push model	Special cases
12. Performance Data	Advanced	6-8h	Metrics	Trending
13. Plugin Development	Advanced	8-10h	Programming	Extension
14. Windows Hosts	Advanced	4-6h	Cross-platform	Enterprise
15. SNMP Devices	Expert	6-8h	Network	Infrastructure
16. Log Monitoring	Advanced	4-6h	Log analysis	Proactive
17. Dashboards	Advanced	6-8h	Visualization	UX
18. High Availability	Master	12-16h	Clustering	Reliability
19. Integration	Expert	8-12h	APIs	Ecosystem

Summary

#	Project	Core Skill	Prerequisites
1	Install from Source	Linux compilation	Linux basics
2	Config Structure	Object model	Project 1
3	Local Services	Plugin usage	Project 2
4	NRPE Remote	Remote execution	Project 3
5	Custom Scripts	Plugin development	Scripting
6	Host/Service Groups	Organization	Project 4
7	Timeperiods	Scheduling	Project 4
8	Notifications	Multi-channel alerts	Project 4
9	Escalations	Workflow routing	Project 8
10	Event Handlers	Auto-remediation	Project 4
11	Passive/NSCA	Push monitoring	Project 4
12	Performance Data	Metrics graphing	Project 5
13	Plugin Development	Production plugins	Project 5
14	Windows Hosts	Cross-platform	Project 4
15	SNMP Devices	Network monitoring	Project 4
16	Log Monitoring	Log analysis	Project 5
17	Dashboards	Visualization	Project 4
18	High Availability	Clustering	Projects 1-10
19	Integration	APIs	Projects 1-8

What You’ll Achieve

After completing this learning path, you will be able to:

Install and configure Nagios Core from source with confidence
Design monitoring for complex infrastructure (Linux, Windows, network devices)
Write custom plugins that follow best practices
Configure notifications with escalations and on-call rotations
Implement auto-remediation with event handlers
Build dashboards for different stakeholders
Scale Nagios with high availability configurations
Integrate with incident management and automation tools

Most importantly, you will understand the principles behind monitoring that apply to any tool - not just Nagios.

Additional Resources

Official Documentation

Nagios Core Documentation: https://assets.nagios.com/downloads/nagioscore/docs/
Nagios Plugins Guidelines: https://nagios-plugins.org/doc/guidelines.html
NRPE Documentation: https://github.com/NagiosEnterprises/nrpe

Community Resources

Nagios Exchange: https://exchange.nagios.org/ (thousands of plugins)
Nagios Library: https://library.nagios.com/
Monitoring Portal: https://www.monitoring-portal.org/

Books

“Nagios Core Beginner’s Guide” by Eric Loyd - Packt Publishing
“Nagios: System and Network Monitoring” by Wolfgang Barth - No Starch Press
“Essential SNMP” by Douglas Mauro & Kevin Schmidt - O’Reilly

Prometheus - Modern pull-based monitoring with PromQL
Grafana - Advanced visualization
Icinga - Nagios fork with modern features
Zabbix - All-in-one monitoring platform
OpenTelemetry - Observability framework

Last updated: 2025-01-01 Projects: 19 | Difficulty range: Beginner to Master | Estimated total time: 100-140 hours