Nagios Monitoring Mastery: From Installation to Enterprise-Scale Monitoring
Goal: Deeply understand infrastructure monitoring by mastering Nagios Core from the ground up. You will learn how Nagios checks work at the protocol level, how the scheduling engine orchestrates thousands of checks, how notifications flow through escalation chains, and how to extend Nagios with custom plugins. By completing these projects, you will understand why monitoring systems are architected the way they are, how to debug check failures, configure complex notification routing, implement auto-remediation with event handlers, and scale monitoring to enterprise environments. You will internalize the mental model that transforms raw check results into actionable operational intelligence.
Why Nagios Matters
Nagios is the grandfather of open-source infrastructure monitoring. First released in 1999 as “NetSaint,” it pioneered concepts that every modern monitoring tool still uses: host/service checks, state machines, notification escalations, and plugin architecture. Understanding Nagios deeply means understanding the foundations of monitoring itself.
Historical Context
Before Nagios, monitoring was either:
- Expensive commercial tools (HP OpenView, Tivoli) that cost tens of thousands of dollars
- Custom scripts that checked things ad-hoc with no centralized view
- Nothing at all - operators discovered problems when users complained
Ethan Galstad created Nagios to solve a real problem: he needed to monitor a network of Linux servers and couldn’t afford commercial solutions. The result was a monitoring philosophy that remains relevant 25 years later:
- Plugins are external - Any script that returns 0/1/2/3 is a valid check
- Configuration is declarative - Define what you want to monitor, not how to monitor it
- State is explicit - OK, WARNING, CRITICAL, UNKNOWN with clear transitions
- Notifications are programmable - Send alerts however you want
Industry Adoption
While newer tools (Prometheus, Datadog, Zabbix) have emerged, Nagios remains foundational:
- 30,000+ organizations actively use Nagios
- 10,000+ plugins available in the Nagios Exchange
- Industry standard for traditional infrastructure monitoring
- Base for derivatives - Icinga, Naemon, Shinken all forked from Nagios
Real-World Impact
Before Monitoring With Nagios
┌─────────────────────┐ ┌─────────────────────┐
│ User: "Site is down"│ │ Alert: Web server │
│ Admin: "Let me SSH" │ │ response time > 5s │
│ (30 minutes later) │ │ (detected in 60s) │
│ Admin: "Fixed it" │ │ Event handler: │
│ User: "Finally!" │ │ restart_httpd.sh │
└─────────────────────┘ │ (auto-fixed in 90s) │
└─────────────────────┘
MTTR: 30+ minutes MTTR: 90 seconds
The Nagios Mental Model
Understanding Nagios requires internalizing its core architecture:
┌─────────────────────────────────────────────┐
│ NAGIOS CORE │
│ ┌─────────────────────────────────────┐ │
│ │ SCHEDULING ENGINE │ │
│ │ (when to run which checks) │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────▼──────────────────────┐ │
│ │ CHECK EXECUTION │ │
│ │ (run plugins, collect results) │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────▼──────────────────────┐ │
│ │ STATE ENGINE │ │
│ │ (OK→WARNING→CRITICAL transitions) │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────▼──────────────────────┐ │
│ │ NOTIFICATION ENGINE │ │
│ │ (who to notify, when, how) │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ┌──────────────▼──────────────────────┐ │
│ │ EVENT HANDLERS │ │
│ │ (auto-remediation scripts) │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ PLUGIN │ │ NRPE │ │ NSCA │
│ check_http │ │ (Remote │ │ (Passive │
│ check_ping │ │ checks) │ │ checks) │
│ check_disk │ │ │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Local Host │ │ Remote Host │ │ Remote Host │
│ (Nagios runs │ │ (NRPE daemon │ │ (sends data │
│ checks here) │ │ runs checks) │ │ proactively) │
└───────────────┘ └───────────────┘ └───────────────┘
Why Nagios Before Modern Tools?
You might wonder: “Why learn Nagios when Prometheus exists?”
The answer is foundational understanding:
| Concept | Nagios Teaches | Modern Equivalent |
|---|---|---|
| Check execution | Fork/exec model, exit codes | Prometheus scraping |
| State machines | Hard/soft states, flapping | Alertmanager grouping |
| Plugins | Standard interface (0/1/2/3) | Exporters |
| Notifications | Contact groups, escalations | PagerDuty integrations |
| Configuration | Object inheritance | Labels and selectors |
| Passive checks | Push model | Pushgateway |
| Remote execution | NRPE | Node exporter |
Learning Nagios deeply teaches you why these patterns exist, not just how to configure them in modern tools.
Prerequisites & Background Knowledge
Essential Prerequisites
Before starting, you should have:
- Linux command line proficiency
- Navigate filesystems, edit files with vim/nano
- Understand file permissions and ownership
- Use systemd (systemctl start/stop/status)
- Read log files with tail, grep, less
- Basic networking knowledge
- TCP/IP fundamentals (ports, connections)
- DNS resolution basics
- HTTP request/response cycle
- ICMP (ping) and how it works
- Shell scripting basics
- Write simple bash scripts
- Use variables, conditionals, loops
- Understand exit codes
- Parse command output
Helpful but Not Required
- Previous experience with any monitoring tool
- Web server configuration (Apache/Nginx)
- SNMP protocol knowledge
- Database administration basics
Self-Assessment Questions
Before starting, verify you can answer:
- “What happens when you run
ping google.com?” - “How do you check if a service is running on Linux?”
- “What does exit code 0 mean in a shell script?”
- “How do you configure a firewall rule in iptables or firewalld?”
- “What is the difference between TCP and UDP?”
If you struggle with these, spend time on Linux fundamentals first.
Development Environment Setup
Option 1: VirtualBox/Vagrant (Recommended)
┌─────────────────────────────────────────────────┐
│ Your Laptop │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ VM: nagios-srv │ │ VM: client-01 │ │
│ │ 192.168.56.10 │ │ 192.168.56.11 │ │
│ │ Nagios Core │ │ NRPE Agent │ │
│ │ Apache │ │ Test services │ │
│ └────────────────┘ └────────────────┘ │
│ │ │ │
│ └────────────────────┘ │
│ Host-only network │
└─────────────────────────────────────────────────┘
Option 2: Cloud VMs (AWS/GCP/DigitalOcean)
- 2 small VMs (t2.micro equivalent)
- Security groups allowing ports 80, 443, 5666
- SSH access configured
Option 3: Docker Containers
- Good for quick testing
- Less realistic for learning production patterns
- Use docker-compose for multi-container setup
Time Investment Expectations
| Learning Path | Time | Projects |
|---|---|---|
| Quick overview | 2 weeks | 1-5 |
| Solid foundation | 4 weeks | 1-10 |
| Comprehensive mastery | 8 weeks | 1-18 |
| Enterprise expert | 12 weeks | All projects |
Reality Check
This guide does not provide:
- Copy-paste configurations (you must understand each directive)
- GUI-based shortcuts (we focus on configuration files)
- Nagios XI (commercial version) - we use Nagios Core (open source)
You will:
- Read man pages and documentation
- Debug configuration errors
- Write shell scripts
- Break things and fix them
Core Concept Analysis
1. The Plugin Model
Nagios plugins are the heart of monitoring. They are simple executables that:
┌─────────────────────────────────────────────────────────┐
│ PLUGIN CONTRACT │
├─────────────────────────────────────────────────────────┤
│ INPUT: │
│ - Command-line arguments (thresholds, targets) │
│ - Environment variables (optional) │
│ │
│ OUTPUT: │
│ - STDOUT: Human-readable message + performance data │
│ - EXIT CODE: 0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN │
│ │
│ EXAMPLE: │
│ $ ./check_disk -w 80 -c 90 -p / │
│ DISK OK - free space: / 45678 MB (72%)|/=17890MB │
│ $ echo $? │
│ 0 │
└─────────────────────────────────────────────────────────┘
Why this matters: Any language, any script - if it follows this contract, Nagios can use it. This is the ultimate in extensibility.
2. Object Configuration Model
Nagios uses an object-oriented configuration model:
┌─────────────────────────────────────────────────────────┐
│ OBJECT HIERARCHY │
├─────────────────────────────────────────────────────────┤
│ │
│ timeperiod ──────┐ │
│ │ │
│ command ─────────┼──► service ──► servicegroup │
│ │ │ │
│ contact ─────────┼───────┼──► contactgroup │
│ │ │ │
│ host ────────────┴───────┘ │
│ │ │
│ └──► hostgroup │
│ │
│ Templates can be inherited at any level │
│ │
└─────────────────────────────────────────────────────────┘
Configuration Inheritance Example:
┌─────────────────────────────────────────────────────────┐
│ │
│ define host { │
│ name linux-server-template │
│ check_period 24x7 │
│ notification_period 24x7 │
│ register 0 ; template only │
│ } │
│ │
│ define host { │
│ use linux-server-template │
│ host_name web-server-01 │
│ address 192.168.1.10 │
│ } │
│ │
└─────────────────────────────────────────────────────────┘
3. State Machine
Nagios implements a sophisticated state machine for each host and service:
┌─────────────────────────────────┐
│ STATE MACHINE │
└─────────────────────────────────┘
┌───────────────┐ ┌───────────────┐
│ SOFT STATE │ max_check_attempts │ HARD STATE │
│ (transitional)│ ─────────────────► │ (confirmed) │
└───────────────┘ └───────────────┘
Example: max_check_attempts = 3
Check 1: CRITICAL ──► SOFT CRITICAL (1/3)
Check 2: CRITICAL ──► SOFT CRITICAL (2/3)
Check 3: CRITICAL ──► HARD CRITICAL (3/3) ──► NOTIFY!
Why soft states exist:
- Prevent alert storms from transient issues
- Allow recovery before notification
- Reduce false positives
┌─────────────────────────────────────────────────────┐
│ STATE TRANSITIONS │
│ │
│ OK ◄─────────────────────────────────► WARNING │
│ ▲ ▲ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ UNKNOWN ◄─────────────────────────────► CRITICAL │
│ │
└─────────────────────────────────────────────────────┘
4. Check Scheduling
Nagios must efficiently schedule checks across thousands of hosts/services:
┌─────────────────────────────────────────────────────────┐
│ CHECK SCHEDULING │
├─────────────────────────────────────────────────────────┤
│ │
│ check_interval = 5 (minutes between checks) │
│ retry_interval = 1 (minutes between retries) │
│ │
│ Timeline (service goes CRITICAL at T=0): │
│ │
│ T=0 Check: CRITICAL ──► SOFT(1/3), retry in 1 min │
│ T=1 Check: CRITICAL ──► SOFT(2/3), retry in 1 min │
│ T=2 Check: CRITICAL ──► HARD(3/3), NOTIFY! │
│ T=7 Check: CRITICAL ──► still HARD │
│ T=12 Check: OK ──► RECOVERY, NOTIFY! │
│ T=17 Check: OK ──► normal interval resumes │
│ │
│ Scheduling optimization: │
│ - Interleaving: spread checks across time │
│ - Parallelization: run multiple checks simultaneously │
│ - Freshness: detect stale passive check results │
│ │
└─────────────────────────────────────────────────────────┘
5. Notification Flow
Notifications follow a complex decision tree:
┌─────────────────────────────────────────────────────────┐
│ NOTIFICATION FLOW │
├─────────────────────────────────────────────────────────┤
│ │
│ Check result: HARD CRITICAL │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Notifications │ NO │
│ │ enabled? ├────────► No notification │
│ └────────┬────────┘ │
│ │ YES │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Within time │ NO │
│ │ period? ├────────► No notification │
│ └────────┬────────┘ │
│ │ YES │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Contact wants │ NO │
│ │ this state? ├────────► No notification │
│ └────────┬────────┘ │
│ │ YES │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Escalation │ │
│ │ applies? │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ SEND NOTIFICATION │
│ │
└─────────────────────────────────────────────────────────┘
6. Active vs Passive Checks
┌─────────────────────────────────────────────────────────┐
│ ACTIVE CHECKS (Pull Model) │
├─────────────────────────────────────────────────────────┤
│ │
│ Nagios ────► Run plugin ────► Get result │
│ │
│ - Nagios controls timing │
│ - Nagios initiates connection │
│ - Good for: scheduled monitoring │
│ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ PASSIVE CHECKS (Push Model) │
├─────────────────────────────────────────────────────────┤
│ │
│ External ────► Send result ────► Nagios accepts │
│ │
│ - External system controls timing │
│ - External system initiates connection │
│ - Good for: security events, async jobs, firewalled hosts│
│ │
└─────────────────────────────────────────────────────────┘
Concept Summary Table
| Concept | What You Must Internalize |
|---|---|
| Plugin Model | Any executable returning 0/1/2/3 with STDOUT message is a valid check |
| Object Inheritance | Templates reduce duplication; use directive inherits all attributes |
| Soft/Hard States | Soft states prevent false alerts; hard states trigger notifications |
| Check Scheduling | check_interval for normal, retry_interval during problems |
| Notification Logic | Many conditions must be true before a notification sends |
| Active vs Passive | Active = Nagios pulls, Passive = external pushes via NSCA |
| NRPE | Runs local checks on remote hosts securely |
| Escalations | Route notifications based on problem duration |
| Event Handlers | Auto-remediation scripts triggered by state changes |
| Flapping | Detection of rapid state oscillation |
Deep Dive Reading by Concept
Nagios Architecture
| Concept | Resource |
|---|---|
| Overall architecture | Nagios Core Documentation - Chapter 5: Basic Concepts |
| Plugin development | Nagios Plugin Development Guidelines - nagios-plugins.org |
| Configuration structure | Nagios Core Documentation - Chapter 3: Configuration Overview |
Monitoring Fundamentals
| Concept | Book & Chapter |
|---|---|
| Monitoring philosophy | “The Practice of System and Network Administration” by Limoncelli - Ch. 22 |
| Alert design | “Site Reliability Engineering” by Beyer et al. - Ch. 6: Monitoring |
| On-call practices | “On-Call” by Jones - Chapters 1-3 |
Network Monitoring
| Concept | Book & Chapter |
|---|---|
| SNMP protocol | “Essential SNMP” by Mauro & Schmidt - Ch. 1-4 |
| Network monitoring | “Network Warrior” by Donahue - Ch. 24 |
| TCP/IP fundamentals | “TCP/IP Illustrated, Vol. 1” by Stevens - Ch. 1-5 |
Scripting and Automation
| Concept | Book & Chapter |
|---|---|
| Bash scripting | “Linux Command Line and Shell Scripting Bible” by Blum - Ch. 11-15 |
| Perl plugins | “Learning Perl” by Schwartz - Ch. 1-8 |
| Python scripting | “Automate the Boring Stuff” by Sweigart - Ch. 1-6 |
Essential Reading Order
- Week 1: Nagios Core Documentation Chapters 1-5
- Week 2: Plugin Development Guidelines + Essential SNMP Ch. 1-2
- Week 3: SRE Book Chapter 6 + Limoncelli Ch. 22
- Week 4: Shell scripting reference as needed
Quick Start Guide
If you feel overwhelmed, here’s your first 48 hours:
Day 1 (4 hours):
- Set up VirtualBox VM with CentOS or Ubuntu
- Complete Project 1 (Install from source)
- Access the web interface
- Understand the directory structure
Day 2 (4 hours):
- Complete Project 2 (Configuration structure)
- Add your first custom host
- Complete Project 3 (Local service monitoring)
- See your first alert
After this, you have a working Nagios system and understand the basics. Continue with projects 4-10 for comprehensive foundation, 11-18 for advanced topics.
Recommended Learning Paths
Path 1: System Administrator (4 weeks)
Focus on practical monitoring of Linux/Windows infrastructure.
Projects: 1, 2, 3, 4, 5, 6, 8, 14 Outcome: Can monitor typical IT infrastructure with notifications
Path 2: DevOps Engineer (6 weeks)
Focus on automation, custom checks, and integration.
Projects: 1, 2, 3, 5, 6, 9, 10, 15, 18, 19 Outcome: Can integrate Nagios into CI/CD and create custom monitoring
Path 3: Monitoring Specialist (8 weeks)
Comprehensive coverage of all Nagios capabilities.
Projects: All, in order Outcome: Enterprise-scale Nagios deployment and management
Project List
Project 1: Installing Nagios Core from Source
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Bash/Shell
- Difficulty: Level 2: Intermediate
- Knowledge Area: Linux System Administration
- Software or Tool: GCC, Make, Apache HTTPD
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: A fully functional Nagios Core installation compiled from source code, running with Apache web server and accessible via browser.
Why it teaches the concept: Compiling from source forces you to understand dependencies, directory structure, and how Nagios components interact. Package managers hide these details.
Core challenges you’ll face:
- Installing build dependencies - Understanding what libraries Nagios needs (GD, OpenSSL, etc.)
- Configuring the build - Using
./configurewith appropriate paths - Setting up Apache integration - CGI scripts, authentication, permissions
- Creating the nagios user/group - Understanding security isolation
Key Concepts:
- Source compilation workflow
- CGI web applications
- Apache authentication (htpasswd)
- Systemd service management
Difficulty: Intermediate Time estimate: 4-6 hours Prerequisites: Linux command line, package management, text editing
Real World Outcome
After completing this project, you will have:
$ systemctl status nagios
● nagios.service - Nagios Core Monitoring System
Active: active (running) since Wed 2024-01-15 10:23:45 UTC
Main PID: 12345 (nagios)
$ curl -u nagiosadmin:password http://localhost/nagios/
# Returns Nagios web interface HTML
$ /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Nagios Core 4.4.x
Copyright (c) 2009-present Nagios Core Development Team
...
Total Warnings: 0
Total Errors: 0
The Core Question You’re Answering
“What are the minimal components needed for a monitoring system, and how do they connect?”
Concepts You Must Understand First
- What is a daemon process?
- How does it differ from a regular process?
- Why does Nagios run as a daemon?
- How does Apache serve dynamic content?
- What is CGI?
- Why does Nagios use CGI instead of a modern framework?
- What are file permissions and why do they matter?
- Who should own the Nagios files?
- What permissions are required for CGI execution?
Questions to Guide Your Design
Before installing:
- Where should Nagios binaries be installed? Why?
- What user account should Nagios run as? Why not root?
- How will you back up your configuration?
- How will you verify the installation succeeded?
Thinking Exercise
Before running make install, draw the directory structure you expect:
- Where will the main daemon binary be?
- Where will configuration files live?
- Where will plugins be stored?
- Where will log files go?
Verify your drawing against the actual installation.
The Interview Questions They’ll Ask
- “Why would you compile from source instead of using packages?”
- “How do you verify a Nagios configuration before applying it?”
- “What happens if the nagios user doesn’t have permission to execute plugins?”
- “How would you upgrade Nagios without losing configuration?”
- “What are the security implications of running CGI scripts?”
Hints in Layers
Hint 1 - Starting Point: Begin with a minimal Linux installation. CentOS/RHEL or Ubuntu LTS are recommended.
Hint 2 - Dependencies: You need: gcc, glibc, glibc-common, gd, gd-devel, openssl, openssl-devel, make, perl, wget
Hint 3 - Build Process:
# General flow (not copy-paste - understand each step)
./configure --with-command-group=nagcmd
make all
make install
make install-init
make install-commandmode
make install-config
make install-webconf
Hint 4 - Verification:
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Books That Will Help
| Topic | Resource |
|---|---|
| Installation process | Nagios Core Beginner’s Guide - Chapter 2 |
| Apache configuration | Apache Cookbook - O’Reilly |
| Linux system admin | Linux Bible - Christopher Negus |
Common Pitfalls & Debugging
| Problem | Root Cause | Fix |
|---|---|---|
| “Permission denied” on CGI | Apache user not in nagcmd group | usermod -aG nagcmd apache |
| “Cannot open config file” | Wrong ownership | chown -R nagios:nagios /usr/local/nagios |
| Web interface shows blank | CGI not enabled | Enable mod_cgi in Apache |
| Daemon won’t start | Config error | Run nagios -v to validate |
Learning Milestones
- Basic: Nagios daemon starts and stays running
- Intermediate: Web interface accessible with authentication
- Advanced: Can modify nagios.cfg and reload without restart
Project 2: Understanding the Configuration File Structure
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration Language
- Difficulty: Level 2: Intermediate
- Knowledge Area: Configuration Management
- Software or Tool: Text editor, nagios -v
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: A well-organized configuration directory with separate files for hosts, services, commands, contacts, and templates.
Why it teaches the concept: Understanding the configuration structure is essential for maintainable monitoring. Poor organization leads to configuration drift and errors.
Core challenges you’ll face:
- Understanding cfg_file vs cfg_dir - How Nagios finds configuration
- Object relationships - How hosts reference commands and templates
- Template inheritance - Creating reusable configuration blocks
- Configuration validation - Catching errors before restart
Key Concepts:
- Object definition syntax
- Directive inheritance
- Configuration file inclusion
- Macro expansion
Difficulty: Intermediate Time estimate: 3-4 hours Prerequisites: Project 1 completed
Real World Outcome
Your configuration directory will look like:
/usr/local/nagios/etc/
├── nagios.cfg # Main configuration
├── cgi.cfg # Web interface settings
├── resource.cfg # Sensitive variables ($USER1$)
├── objects/
│ ├── commands.cfg # Check command definitions
│ ├── contacts.cfg # Contact definitions
│ ├── timeperiods.cfg # When to check/notify
│ ├── templates.cfg # Reusable object templates
│ ├── hosts/
│ │ ├── web-servers.cfg
│ │ └── db-servers.cfg
│ └── services/
│ ├── linux-services.cfg
│ └── web-services.cfg
The Core Question You’re Answering
“How do I organize monitoring configuration so it scales to hundreds of hosts without becoming unmaintainable?”
Concepts You Must Understand First
- What is object inheritance in Nagios?
- How does
usework? - What is the difference between
nameandhost_name?
- How does
- What are Nagios macros?
- What is
$HOSTADDRESS$? - Where is
$USER1$defined?
- What is
Questions to Guide Your Design
- How many hosts will you monitor eventually?
- Will hosts be grouped by location, function, or owner?
- Who maintains which parts of the configuration?
- How will you version control the configuration?
The Interview Questions They’ll Ask
- “How does Nagios macro substitution work?”
- “What’s the difference between a host template and a host definition?”
- “How would you organize configuration for 1000 hosts?”
- “How do you handle secrets in Nagios configuration?”
- “What happens if two configuration files define the same object?”
Learning Milestones
- Basic: Understand nagios.cfg and cfg_dir directive
- Intermediate: Create templates and use inheritance
- Advanced: Organize configuration for team collaboration
Project 3: Monitoring Local Services
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 1: Beginner
- Knowledge Area: Service Monitoring
- Software or Tool: Nagios Plugins
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: Monitoring for essential local services: disk space, CPU load, memory usage, running processes, and swap usage on the Nagios server itself.
Why it teaches the concept: Local monitoring is the foundation. These same concepts apply to remote hosts via NRPE.
Core challenges you’ll face:
- Understanding check commands - How plugins receive arguments
- Setting thresholds - Choosing appropriate WARNING and CRITICAL values
- Interpreting plugin output - Understanding performance data
Key Concepts:
- Standard Nagios plugins (check_disk, check_load, check_procs)
- Threshold syntax (-w and -c arguments)
- Performance data format
- Service check intervals
Difficulty: Beginner Time estimate: 2-3 hours Prerequisites: Projects 1-2 completed
Real World Outcome
# From the web interface or command line:
$ /usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /
DISK OK - free space: / 45678 MB (75% inode=89%);| /=15234MB;51200;57600;0;64000
$ /usr/local/nagios/libexec/check_load -w 5,4,3 -c 10,8,6
OK - load average: 0.15, 0.10, 0.08|load1=0.150;5.000;10.000;0; load5=0.100;4.000;8.000;0; load15=0.080;3.000;6.000;0;
Web interface shows all services green (OK state).
The Core Question You’re Answering
“How do I translate ‘disk is almost full’ into a monitoring check that alerts at the right time?”
The Interview Questions They’ll Ask
- “How do you determine appropriate thresholds for disk space?”
- “What’s the difference between load average values 1, 5, and 15 minutes?”
- “How would you monitor a specific process?”
-
“What does the performance data after the pipe ( ) mean?”
Learning Milestones
- Basic: One service check working with correct thresholds
- Intermediate: All local services monitored with sensible thresholds
- Advanced: Performance data graphed over time
Project 4: Setting Up NRPE for Remote Monitoring
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 3: Advanced
- Knowledge Area: Remote Monitoring, Networking
- Software or Tool: NRPE (Nagios Remote Plugin Executor)
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: NRPE daemon on a remote host that executes local checks and returns results to the Nagios server.
Why it teaches the concept: NRPE is the standard way to monitor internal metrics on remote hosts that require local access (disk, CPU, processes).
Core challenges you’ll face:
- NRPE installation on remote hosts - Compiling or using packages
- Firewall configuration - Port 5666 must be accessible
- NRPE allowed_hosts - Security configuration
- Command definitions - Matching server and client
┌─────────────────────────────────────────────────────────┐
│ NRPE ARCHITECTURE │
├─────────────────────────────────────────────────────────┤
│ │
│ Nagios Server Remote Host │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ │ │ │ │
│ │ check_nrpe │ ──TCP/5666──► │ nrpe daemon │ │
│ │ -H host │ │ │ │
│ │ -c check_cmd │ │ ┌────────┐ │ │
│ │ │ ◄──result──── │ │ plugin │ │ │
│ │ │ │ └────────┘ │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ The command "check_load" on the server maps to │
│ a command definition on the remote host that │
│ runs /usr/local/nagios/libexec/check_load locally. │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-3, second VM or host
Real World Outcome
# From Nagios server:
$ /usr/local/nagios/libexec/check_nrpe -H 192.168.56.11 -c check_load
OK - load average: 0.08, 0.12, 0.15|load1=0.080;5.000;10.000;0;
# Remote host is being monitored for local metrics
The Core Question You’re Answering
“How do I run local checks on remote hosts securely and efficiently?”
The Interview Questions They’ll Ask
- “Why use NRPE instead of SSH?”
- “How does NRPE handle authentication?”
- “What are the security implications of dont_blame_nrpe?”
- “How would you troubleshoot ‘CHECK_NRPE: Error - Could not complete SSL handshake’?”
- “Can NRPE pass arguments to commands? Should it?”
Learning Milestones
- Basic: NRPE daemon running on remote host
- Intermediate: Check commands working from server to client
- Advanced: Secure configuration with SSL and restricted hosts
Project 5: Creating Custom Check Scripts
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Bash, Python, or Perl
- Difficulty: Level 2: Intermediate
- Knowledge Area: Scripting, Plugin Development
- Software or Tool: Any scripting language
- Main Book: Nagios Plugin Development Guidelines
What you’ll build: Custom monitoring plugins for application-specific checks not covered by standard plugins.
Why it teaches the concept: The plugin model is Nagios’s greatest strength. Understanding it lets you monitor anything.
Core challenges you’ll face:
- Plugin specification compliance - Exit codes, output format, performance data
- Threshold handling - Parsing -w and -c arguments
- Error handling - UNKNOWN state for errors
- Timeout handling - Plugin shouldn’t hang
┌─────────────────────────────────────────────────────────┐
│ PLUGIN SPECIFICATION │
├─────────────────────────────────────────────────────────┤
│ │
│ Exit Codes: │
│ 0 = OK │
│ 1 = WARNING │
│ 2 = CRITICAL │
│ 3 = UNKNOWN │
│ │
│ Output Format: │
│ STATUS - Message text|performance_data │
│ │
│ Example: │
│ OK - Queue length is 5|queue_length=5;10;20;0;100 │
│ │
│ Performance Data Format: │
│ label=value[UOM];warn;crit;min;max │
│ │
│ UOM (Unit of Measure): │
│ (none) = count, s = seconds, % = percentage │
│ B/KB/MB/GB/TB = bytes, c = counter │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Intermediate Time estimate: 4-6 hours Prerequisites: Scripting skills, Projects 1-3
Real World Outcome
A custom plugin that checks something specific to your environment:
$ ./check_app_queue.sh -w 10 -c 20
OK - Queue length is 5|queue_length=5;10;20;0;100
$ echo $?
0
$ ./check_app_queue.sh -w 3 -c 10
WARNING - Queue length is 5|queue_length=5;3;10;0;100
$ echo $?
1
The Core Question You’re Answering
“How do I extend Nagios to monitor something it doesn’t monitor out of the box?”
The Interview Questions They’ll Ask
- “What makes a Nagios plugin valid?”
- “How do you handle timeouts in custom plugins?”
- “What should happen if a plugin can’t connect to what it’s checking?”
- “How do you test a plugin before deploying it?”
- “What is the performance data format and why does it matter?”
Learning Milestones
- Basic: Plugin returns correct exit codes
- Intermediate: Plugin parses threshold arguments correctly
- Advanced: Plugin includes performance data for graphing
Project 6: Host and Service Groups
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 2: Intermediate
- Knowledge Area: Configuration Organization
- Software or Tool: Nagios Core
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: Logical groupings of hosts and services for organized display and notification routing.
Why it teaches the concept: Groups are essential for managing monitoring at scale. They affect both display and notification behavior.
Core challenges you’ll face:
- Defining meaningful groups - By location, function, owner, or criticality
- Group membership methods - Via hostgroup_name or hostgroups directive
- Service dependencies on groups - Applying services to all hosts in a group
- Notification via groups - Contact groups for different teams
Difficulty: Intermediate Time estimate: 2-3 hours Prerequisites: Projects 1-4, multiple hosts configured
Real World Outcome
# Web interface shows organized groups:
Hostgroups:
├── web-servers (3 hosts)
├── db-servers (2 hosts)
├── production (4 hosts)
└── development (1 host)
Servicegroups:
├── disk-checks (15 services)
├── http-checks (3 services)
└── database-checks (4 services)
The Core Question You’re Answering
“How do I organize monitoring so that operators can quickly find what they’re looking for?”
The Interview Questions They’ll Ask
- “What’s the difference between hostgroups and servicegroups?”
- “How do you assign a service to all hosts in a group?”
- “Can a host be in multiple groups? When is this useful?”
- “How do groups affect notification routing?”
Learning Milestones
- Basic: Create hostgroups and assign hosts
- Intermediate: Use hostgroup_name to apply services to groups
- Advanced: Servicegroups for cross-host service views
Project 7: Timeperiods and Scheduling
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 2: Intermediate
- Knowledge Area: Scheduling, Time-Based Logic
- Software or Tool: Nagios Core
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: Custom timeperiods for business hours, maintenance windows, and on-call schedules.
Why it teaches the concept: Real monitoring requires time-awareness: check less often at night, notify different people on weekends.
Core challenges you’ll face:
- Timeperiod syntax - Date and time range formats
- Exclusions - Excluding holidays from business hours
- Overlapping periods - Combining multiple timeperiods
- Check vs notification periods - Different uses for timeperiods
Difficulty: Intermediate Time estimate: 2-3 hours Prerequisites: Projects 1-4
Real World Outcome
# Example timeperiods:
define timeperiod {
timeperiod_name business_hours
alias Monday-Friday 9AM-5PM
monday 09:00-17:00
tuesday 09:00-17:00
wednesday 09:00-17:00
thursday 09:00-17:00
friday 09:00-17:00
}
define timeperiod {
timeperiod_name non_business_hours
alias Outside Business Hours
monday 00:00-09:00,17:00-24:00
tuesday 00:00-09:00,17:00-24:00
...
saturday 00:00-24:00
sunday 00:00-24:00
}
The Core Question You’re Answering
“How do I make monitoring behave differently based on time of day or week?”
The Interview Questions They’ll Ask
- “What’s the difference between check_period and notification_period?”
- “How do you handle maintenance windows?”
- “How do you exclude holidays from a timeperiod?”
- “What happens if a check runs outside its check_period?”
Learning Milestones
- Basic: Define business hours timeperiod
- Intermediate: Apply different timeperiods to checks and notifications
- Advanced: Handle holidays and maintenance windows
Project 8: Notification Commands (Email, Slack, SMS)
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Bash, Nagios Configuration
- Difficulty: Level 3: Advanced
- Knowledge Area: Notifications, Integration
- Software or Tool: sendmail/postfix, curl, custom scripts
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: Multi-channel notifications that send alerts via email, Slack, and optionally SMS.
Why it teaches the concept: Notifications are useless if they don’t reach the right people in the right way.
Core challenges you’ll face:
- Email configuration - SMTP setup, HTML vs plain text
- Slack webhook integration - Formatting messages for Slack
- SMS integration - Using services like Twilio
- Command macros - Using Nagios macros in notification commands
┌─────────────────────────────────────────────────────────┐
│ NOTIFICATION FLOW │
├─────────────────────────────────────────────────────────┤
│ │
│ State Change (HARD CRITICAL) │
│ │ │
│ ▼ │
│ Contact Definition ────► notification_commands │
│ │ │ │
│ │ ▼ │
│ │ notify-host-by-email │
│ │ notify-host-by-slack │
│ │ │ │
│ ▼ ▼ │
│ Contact gets email AND Slack message │
│ │
│ Macros available: │
│ $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$ │
│ $SERVICEOUTPUT$ $LONGDATETIME$ $CONTACTEMAIL$ │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, working email server or Slack workspace
Real World Outcome
# Email notification received:
Subject: ** PROBLEM: web-server-01/HTTP is CRITICAL **
***** Nagios *****
Notification Type: PROBLEM
Host: web-server-01
Service: HTTP
State: CRITICAL
Output: Connection refused
# Slack message in #alerts channel:
🔴 CRITICAL: web-server-01/HTTP
Connection refused
The Core Question You’re Answering
“How do I ensure alerts reach the right people through their preferred communication channel?”
The Interview Questions They’ll Ask
- “How do you prevent alert fatigue?”
- “What macros are available in notification commands?”
- “How would you add a new notification channel?”
- “How do you test notification commands without triggering real alerts?”
- “What’s the difference between host and service notification commands?”
Learning Milestones
- Basic: Email notifications working
- Intermediate: Slack webhook integration
- Advanced: Multiple channels per contact, formatted messages
Project 9: Contact Groups and Escalations
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 3: Advanced
- Knowledge Area: Notification Routing
- Software or Tool: Nagios Core
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: Escalation chains that notify different teams based on how long a problem has persisted.
Why it teaches the concept: Real incidents need escalation paths. The junior on-call should be notified first, then managers if unacknowledged.
Core challenges you’ll face:
- Contact group design - Organizing contacts by role and responsibility
- Escalation timing - first_notification, last_notification, escalation_interval
- Escalation conditions - Which states trigger escalation
- Acknowledgement handling - Stopping escalation when acked
┌─────────────────────────────────────────────────────────┐
│ ESCALATION TIMELINE │
├─────────────────────────────────────────────────────────┤
│ │
│ Time: 0 5 10 15 20 25 30 │
│ │ │ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ ▼ ▼ │
│ │
│ Notif: 1 2 3 4 5 6 7 │
│ │
│ Level 1: on-call engineer (notifications 1-2) │
│ Level 2: senior engineer (notifications 3-4) │
│ Level 3: manager (notifications 5+) │
│ │
│ If acknowledged at notification 3: │
│ - Escalation stops │
│ - Recovery notification goes to acknowledger │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 3-4 hours Prerequisites: Projects 1-8
Real World Outcome
# First 10 minutes: on-call gets notified
# After 10 minutes: senior engineer also notified
# After 20 minutes: manager also notified
# When engineer acknowledges:
# - Escalation stops
# - When service recovers, acknowledger is notified
The Core Question You’re Answering
“How do I ensure problems get attention even if the first responder doesn’t react?”
The Interview Questions They’ll Ask
- “What is the difference between notification_interval and escalation_interval?”
- “How do you prevent the same person from being notified by multiple escalation levels?”
- “What happens when an escalated problem is acknowledged?”
- “How would you implement ‘follow the sun’ on-call rotation?”
Learning Milestones
- Basic: Two-level escalation working
- Intermediate: Escalation stops on acknowledgement
- Advanced: Different escalation paths for different services
Project 10: Event Handlers for Auto-Remediation
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Bash
- Difficulty: Level 4: Expert
- Knowledge Area: Automation, Self-Healing Systems
- Software or Tool: Shell scripts, SSH
- Main Book: “Nagios Core Beginner’s Guide” by Eric Loyd
What you’ll build: Automated remediation scripts that attempt to fix problems before humans are notified.
Why it teaches the concept: The best alert is the one that never fires because the system fixed itself.
Core challenges you’ll face:
- Event handler logic - Understanding when handlers trigger
- State-based actions - Different actions for SOFT vs HARD states
- Safe remediation - Ensuring handlers don’t make things worse
- Logging and auditing - Recording what actions were taken
┌─────────────────────────────────────────────────────────┐
│ EVENT HANDLER FLOW │
├─────────────────────────────────────────────────────────┤
│ │
│ Check Result │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Event handler │ │
│ │ enabled? ├──NO──► No action │
│ └────────┬────────┘ │
│ │ YES │
│ ▼ │
│ ┌─────────────────┐ │
│ │ State change? │ │
│ │ (or first check)├──NO──► No action │
│ └────────┬────────┘ │
│ │ YES │
│ ▼ │
│ Execute event_handler command │
│ with $SERVICESTATE$ $SERVICESTATETYPE$ args │
│ │
│ Example handler logic: │
│ if SERVICESTATE=CRITICAL and STATETYPE=SOFT then │
│ attempt_restart │
│ if SERVICESTATE=CRITICAL and STATETYPE=HARD then │
│ # Give up, notification will handle it │
│ fi │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Expert Time estimate: 6-8 hours Prerequisites: Projects 1-4, sudo/SSH access to managed hosts
Real World Outcome
# Service goes CRITICAL (SOFT state):
Event handler: Attempting to restart httpd
Event handler: Service httpd restarted successfully
# Next check:
Service returned to OK state
# No notification ever sent!
# If restart fails:
# Service goes HARD CRITICAL
# Normal notification process triggers
The Core Question You’re Answering
“How do I automate the first steps of incident response to reduce human intervention?”
The Interview Questions They’ll Ask
- “Why should event handlers only run on SOFT states?”
- “How do you prevent event handlers from making problems worse?”
- “How do you audit what event handlers did?”
- “What’s the difference between event handlers and notifications?”
- “How would you handle a service that keeps restarting in a loop?”
Learning Milestones
- Basic: Event handler triggers on state change
- Intermediate: Handler takes action only on SOFT CRITICAL
- Advanced: Handler logs actions and has safety limits
Project 11: Passive Checks and NSCA
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Bash, Nagios Configuration
- Difficulty: Level 3: Advanced
- Knowledge Area: Push-Based Monitoring
- Software or Tool: NSCA (Nagios Service Check Acceptor)
- Main Book: Nagios Core Documentation
What you’ll build: Passive check infrastructure for monitoring behind firewalls or from external scripts/cron jobs.
Why it teaches the concept: Not everything can be polled. Passive checks enable monitoring of jobs, security events, and firewalled systems.
┌─────────────────────────────────────────────────────────┐
│ PASSIVE CHECK ARCHITECTURE │
├─────────────────────────────────────────────────────────┤
│ │
│ Active Check (Pull): │
│ Nagios ───────────────────────────────► Remote Host │
│ (initiates connection every N minutes) │
│ │
│ Passive Check (Push): │
│ External Script ─────► NSCA ─────► Nagios External CMD │
│ (sends when ready) │ │ │
│ │ ▼ │
│ Port 5667 nagios.cmd pipe │
│ │
│ Use cases: │
│ - Cron job completion status │
│ - Security events from SIEM │
│ - Backup job results │
│ - Hosts behind NAT/firewall │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4
Real World Outcome
# From external system:
$ echo -e "web-server-01\tBackup Status\t0\tBackup completed successfully" | \
/usr/local/nagios/bin/send_nsca -H nagios-server -c send_nsca.cfg
# Nagios shows service "Backup Status" with OK state
# If no result received within freshness_threshold:
# Service goes CRITICAL - "No result received in 24 hours"
The Core Question You’re Answering
“How do I monitor things that can’t be actively polled?”
The Interview Questions They’ll Ask
- “What’s the difference between active and passive checks?”
- “How does freshness checking work?”
- “What happens if a passive check never sends a result?”
- “How do you secure NSCA communication?”
- “When would you use passive vs active checks?”
Learning Milestones
- Basic: NSCA server accepting results
- Intermediate: Passive service with freshness checking
- Advanced: Cron job sending results via send_nsca
Project 12: Performance Data and Graphing
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Perl, Nagios Configuration
- Difficulty: Level 3: Advanced
- Knowledge Area: Metrics, Time-Series Data
- Software or Tool: PNP4Nagios, RRDtool, or InfluxDB/Grafana
- Main Book: “Nagios and Nagios Related Development” - Nagios Exchange
What you’ll build: Performance data collection and graphing for historical trend analysis.
Why it teaches the concept: Monitoring without graphing is incomplete. Graphs show trends, capacity planning data, and historical context.
┌─────────────────────────────────────────────────────────┐
│ PERFORMANCE DATA PIPELINE │
├─────────────────────────────────────────────────────────┤
│ │
│ Plugin Output: │
│ "DISK OK - 45GB free|/=55GB;80;90;0;100" │
│ ▲ │
│ │ performance data │
│ ▼ │
│ process_performance_data_command │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ PNP4Nagios / Graphite / InfluxDB │ │
│ │ (stores time-series) │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Graphs in Web Interface │ │
│ │ (click on service → see graph) │ │
│ └─────────────────────────────────────┘ │
│ │
│ Performance data format: │
│ label=value[UOM];warn;crit;min;max │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 6-8 hours Prerequisites: Projects 1-5
Real World Outcome
Click on any service in Nagios web interface and see historical graphs:
- CPU usage over last 24 hours
- Disk space trend over last month
- Response time percentiles
The Core Question You’re Answering
“How do I see trends over time instead of just current state?”
The Interview Questions They’ll Ask
- “How is performance data different from check output?”
- “What is RRDtool and how does it work?”
- “How would you alert on rates of change vs absolute values?”
- “What’s the storage overhead for performance data?”
Learning Milestones
- Basic: Performance data being written to files
- Intermediate: PNP4Nagios or similar showing graphs
- Advanced: Custom graphs with calculated metrics
Project 13: Nagios Plugins Development
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Python, Bash, or Perl
- Difficulty: Level 3: Advanced
- Knowledge Area: Plugin Development
- Software or Tool: Nagios Plugin API
- Main Book: Nagios Plugin Development Guidelines
What you’ll build: A production-quality plugin with proper argument parsing, timeout handling, and performance data.
Why it teaches the concept: Plugins are how Nagios interfaces with the world. Understanding plugin development means you can monitor anything.
Core challenges you’ll face:
- Argument parsing - Standard options like -H, -w, -c, -t, -v
- Timeout handling - Plugin must not hang
- Help output - -h and –help should be informative
- Verbose mode - -v for debugging
- Performance data - Proper formatting
Difficulty: Advanced Time estimate: 8-10 hours Prerequisites: Projects 1-5, programming skills
Real World Outcome
A plugin that follows the Nagios guidelines:
$ ./check_myapp --help
check_myapp - Check MyApp API health
Usage: check_myapp -H <host> [-p port] [-w warn] [-c crit] [-t timeout]
Options:
-H, --hostname Hostname or IP
-p, --port Port number (default: 8080)
-w, --warning Warning threshold (ms)
-c, --critical Critical threshold (ms)
-t, --timeout Timeout in seconds (default: 10)
-v, --verbose Verbose output
-h, --help This help message
$ ./check_myapp -H localhost -w 200 -c 500
MYAPP OK - Response time 45ms|response_time=45ms;200;500;0;
The Core Question You’re Answering
“How do I create a monitoring plugin that is robust, user-friendly, and maintainable?”
The Interview Questions They’ll Ask
- “What are the Nagios plugin guidelines?”
- “How do you handle timeouts in plugins?”
- “What should happen if a plugin encounters an unexpected error?”
- “How do you test plugins before deployment?”
- “What libraries exist to simplify plugin development?”
Learning Milestones
- Basic: Plugin with correct exit codes
- Intermediate: Proper argument parsing and help output
- Advanced: Timeout handling, verbose mode, performance data
Project 14: Monitoring Windows Hosts
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 3: Advanced
- Knowledge Area: Windows Monitoring
- Software or Tool: NSClient++, check_nt, or WMI
- Main Book: NSClient++ Documentation
What you’ll build: Windows host monitoring for disk, CPU, memory, services, and event logs.
Why it teaches the concept: Mixed environments require multi-platform monitoring. NSClient++ is the NRPE equivalent for Windows.
┌─────────────────────────────────────────────────────────┐
│ WINDOWS MONITORING OPTIONS │
├─────────────────────────────────────────────────────────┤
│ │
│ Option 1: NSClient++ (Recommended) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Nagios │ │ Windows Host │ │
│ │ │ │ │ │
│ │ check_nrpe │◄───────►│ NSClient++ │ │
│ │ check_nt │ TCP/12489│ (agent) │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Option 2: WMI (Agentless) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Nagios │ │ Windows Host │ │
│ │ │ │ │ │
│ │ check_wmi │◄───────►│ WMI Service │ │
│ │ (uses wmic) │ RPC/135 │ (built-in) │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Metrics available: │
│ - CPU usage (CPULOAD) │
│ - Memory usage (MEMUSE) │
│ - Disk space (USEDDISKSPACE) │
│ - Service status (SERVICESTATE) │
│ - Process list (PROCSTATE) │
│ - Event log entries (EVENTLOG) │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, Windows server access
Real World Outcome
$ /usr/local/nagios/libexec/check_nt -H windows-server -v CPULOAD -w 80 -c 90
CPU Load 12% (5 min average)|'5 min avg Load'=12%;80;90;0;100;
$ /usr/local/nagios/libexec/check_nt -H windows-server -v SERVICESTATE -l "Windows Update"
Windows Update: Started
The Core Question You’re Answering
“How do I monitor Windows systems from a Linux-based Nagios server?”
The Interview Questions They’ll Ask
- “What’s the difference between check_nt and check_nrpe for Windows?”
- “How do you monitor Windows services?”
- “How do you check Windows event logs?”
- “What are the security implications of WMI vs NSClient++?”
Learning Milestones
- Basic: NSClient++ installed and responding
- Intermediate: CPU, memory, disk monitored
- Advanced: Service status and event log monitoring
Project 15: Network Device Monitoring (SNMP)
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 4: Expert
- Knowledge Area: Network Monitoring, SNMP
- Software or Tool: check_snmp, snmpwalk
- Main Book: “Essential SNMP” by Mauro & Schmidt
What you’ll build: Monitoring for routers, switches, and network appliances using SNMP.
Why it teaches the concept: Network devices are different from servers. SNMP is the standard protocol for managing network infrastructure.
┌─────────────────────────────────────────────────────────┐
│ SNMP CONCEPTS │
├─────────────────────────────────────────────────────────┤
│ │
│ SNMP = Simple Network Management Protocol │
│ │
│ Key Components: │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ SNMP Manager │───►│ SNMP Agent │ │
│ │ (Nagios) │◄───│ (Device) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ GET/SET │ Response │
│ │ │ Trap │
│ │
│ OID (Object Identifier): │
│ .1.3.6.1.4.1.9.2.1.58.0 = Cisco CPU 5min │
│ .1.3.6.1.2.1.2.2.1.8.1 = Interface 1 oper status │
│ │
│ MIB (Management Information Base): │
│ Human-readable mappings for OIDs │
│ │
│ SNMP Versions: │
│ v1: Community strings (insecure) │
│ v2c: Same security, better performance │
│ v3: Authentication + encryption (recommended) │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Expert Time estimate: 6-8 hours Prerequisites: Projects 1-4, network device with SNMP enabled
Real World Outcome
# Check interface status:
$ check_snmp -H router.example.com -C public -o IF-MIB::ifOperStatus.1
SNMP OK - up(1)|
# Check CPU utilization on Cisco device:
$ check_snmp -H router.example.com -C public \
-o .1.3.6.1.4.1.9.2.1.58.0 -w 80 -c 90
SNMP OK - 45|iso.3.6.1.4.1.9.2.1.58.0=45;80;90
# Monitor interface bandwidth:
$ check_snmp_int.pl -H switch.example.com -C public -i "Gi0/1" -w 80,80 -c 90,90
The Core Question You’re Answering
“How do I monitor network infrastructure that doesn’t run standard OS agents?”
The Interview Questions They’ll Ask
- “What is an OID and how do you find the right one?”
- “What are the security differences between SNMP v2c and v3?”
- “How do you monitor interface bandwidth?”
- “What is an SNMP trap and how is it different from polling?”
- “How do you monitor a device you don’t have the MIB for?”
Learning Milestones
- Basic: SNMP GET working from Nagios
- Intermediate: Interface status and bandwidth monitored
- Advanced: SNMP v3 with authentication, trap receiver
Project 16: Log File Monitoring
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Bash, Perl
- Difficulty: Level 3: Advanced
- Knowledge Area: Log Analysis
- Software or Tool: check_log, custom scripts
- Main Book: Nagios Core Documentation
What you’ll build: Monitoring that detects error patterns in application log files and alerts.
Why it teaches the concept: Logs often contain warnings before metrics show problems. Log monitoring catches issues early.
Core challenges you’ll face:
- Log rotation - Handling log files that rotate
- State tracking - Only alerting on new entries
- Pattern matching - Defining what constitutes an error
- Performance - Scanning large logs efficiently
┌─────────────────────────────────────────────────────────┐
│ LOG MONITORING APPROACHES │
├─────────────────────────────────────────────────────────┤
│ │
│ Approach 1: Scan for patterns (active check) │
│ ┌──────────────────────────────────────────────┐ │
│ │ check_log -F /var/log/app.log -O /tmp/seek │ │
│ │ -q "ERROR|FATAL" │ │
│ │ │ │
│ │ • Remembers last position (seek file) │ │
│ │ • Only scans new entries │ │
│ │ • Must handle rotation │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ Approach 2: Real-time tail + NSCA (passive check) │
│ ┌──────────────────────────────────────────────┐ │
│ │ tail -F /var/log/app.log | while read line │ │
│ │ do │ │
│ │ if [[ $line =~ ERROR ]]; then │ │
│ │ send_nsca "host" "Log Errors" 2 "$line" │ │
│ │ fi │ │
│ │ done │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ Approach 3: External log shipper → Nagios │
│ (Filebeat/Logstash → parse → alert → NSCA) │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 4-6 hours Prerequisites: Projects 1-4, 11 (for passive approach)
Real World Outcome
# Active check approach:
$ /usr/local/nagios/libexec/check_log -F /var/log/myapp.log \
-O /tmp/myapp_log.seek -q "ERROR"
LOG OK - No matches found
# (After an error is logged)
$ /usr/local/nagios/libexec/check_log -F /var/log/myapp.log \
-O /tmp/myapp_log.seek -q "ERROR"
(1): 2024-01-15 10:23:45 ERROR: Database connection failed
The Core Question You’re Answering
“How do I detect problems from log messages before they become service failures?”
The Interview Questions They’ll Ask
- “How do you handle log rotation in monitoring?”
- “What’s the difference between active and passive log monitoring?”
- “How do you avoid duplicate alerts for the same error?”
- “How do you monitor logs across many servers?”
- “How would you alert on an absence of expected log entries?”
Learning Milestones
- Basic: check_log detecting patterns
- Intermediate: State tracking across check runs
- Advanced: Log rotation handling, passive check integration
Project 17: Custom Dashboards and Status Pages
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: PHP, HTML, API calls
- Difficulty: Level 3: Advanced
- Knowledge Area: UI/UX, Web Development
- Software or Tool: Nagios CGIs, Thruk, Grafana
- Main Book: Nagios Core Documentation - CGI Reference
What you’ll build: Custom status views and dashboards beyond the default Nagios interface.
Why it teaches the concept: Different stakeholders need different views. Operations needs details; management needs summaries.
Core challenges you’ll face:
- Status API - Reading from status.dat or NDO
- Custom CGIs - Building on top of Nagios CGIs
- Alternative UIs - Thruk, Nagstamon, mobile apps
- Data visualization - Meaningful dashboards
┌─────────────────────────────────────────────────────────┐
│ DASHBOARD ARCHITECTURE │
├─────────────────────────────────────────────────────────┤
│ │
│ Data Sources: │
│ ┌──────────────────┐ │
│ │ status.dat │ ◄─ File-based, fast, limited │
│ │ nagios.cmd │ ◄─ Command pipe for control │
│ │ NDO/Livestatus │ ◄─ Database/socket, full history │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Dashboard Options │ │
│ ├──────────────────────────────────────────────┤ │
│ │ • Default Nagios CGIs (basic) │ │
│ │ • Thruk (modern web UI) │ │
│ │ • Grafana + Nagios datasource │ │
│ │ • Custom dashboards (TV displays) │ │
│ │ • Mobile apps (Nagstamon, etc.) │ │
│ └──────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Advanced Time estimate: 6-8 hours Prerequisites: Projects 1-4, web development basics
Real World Outcome
- TV dashboard in NOC showing current problems
- Management summary page with SLA compliance
- Mobile app notifications on phone
- Custom status pages for specific teams
The Core Question You’re Answering
“How do I present monitoring data in a way that’s useful for different audiences?”
The Interview Questions They’ll Ask
- “How do you read current status programmatically?”
- “What’s the difference between status.dat and NDO?”
- “How would you build a public status page?”
- “How do you secure API access to Nagios data?”
Learning Milestones
- Basic: Install Thruk or alternative UI
- Intermediate: Custom status page for specific hosts
- Advanced: Grafana dashboards with historical data
Project 18: High Availability Setup
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Nagios Configuration
- Difficulty: Level 5: Master
- Knowledge Area: High Availability, Clustering
- Software or Tool: Pacemaker/Corosync, DRBD, or Active/Standby
- Main Book: “Pro Linux High Availability Clustering” by Sander van Vugt
What you’ll build: Redundant Nagios infrastructure that survives the failure of any single component.
Why it teaches the concept: Your monitoring system must be more reliable than what it monitors. HA is essential for production.
┌─────────────────────────────────────────────────────────┐
│ HIGH AVAILABILITY ARCHITECTURES │
├─────────────────────────────────────────────────────────┤
│ │
│ Option 1: Active/Standby with Pacemaker │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Nagios │ ───────►│ Nagios │ │
│ │ Primary │ DRBD │ Standby │ │
│ │ (Active) │ │ (Passive) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ Virtual IP (failover) │
│ │
│ Option 2: Distributed Monitoring │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Nagios │ ◄──────►│ Nagios │ │
│ │ Region A │ NSCA │ Region B │ │
│ │ (monitors A) │ │ (monitors B) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ └────────────────────────┘ │
│ │ │
│ ▼ │
│ Central Dashboard │
│ │
│ Option 3: Load Balanced Pollers │
│ ┌──────────────┐ │
│ │ Central │ ◄── Check results from │
│ │ Nagios │ multiple poller nodes │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Master Time estimate: 12-16 hours Prerequisites: Projects 1-10, Linux clustering knowledge
Real World Outcome
# Primary server fails:
$ crm status
Node nagios-01: OFFLINE
Node nagios-02: online
Resources:
Resource Group: nagios-group
VIP (ocf::heartbeat:IPaddr2): Started nagios-02
Nagios (ocf::heartbeat:nagios): Started nagios-02
# Nagios continues without interruption on nagios-02
# Alert: Monitoring failover occurred
The Core Question You’re Answering
“How do I ensure monitoring is available even when monitoring servers fail?”
The Interview Questions They’ll Ask
- “How do you handle split-brain in a Nagios cluster?”
- “What data needs to be replicated between nodes?”
- “How do you test failover without affecting production?”
- “What’s the recovery time objective for your monitoring?”
- “How do you monitor your monitoring system?”
Learning Milestones
- Basic: Active/standby configuration documented
- Intermediate: Automated failover working
- Advanced: Zero-downtime maintenance procedures
Project 19: Integration with External Tools
- File: NAGIOS_MONITORING_MASTERY.md
- Main Programming Language: Various (API integrations)
- Difficulty: Level 4: Expert
- Knowledge Area: Integration, APIs
- Software or Tool: PagerDuty, ServiceNow, Ansible, Terraform
- Main Book: Tool-specific documentation
What you’ll build: Integrations between Nagios and incident management, CMDB, and automation tools.
Why it teaches the concept: Nagios doesn’t exist in isolation. Integration with the broader toolchain is essential.
┌─────────────────────────────────────────────────────────┐
│ NAGIOS INTEGRATION LANDSCAPE │
├─────────────────────────────────────────────────────────┤
│ │
│ Incident Management: │
│ ┌──────────────┐ │
│ │ Nagios │──Notification──►│ PagerDuty │ │
│ │ │ │ OpsGenie │ │
│ │ │ │ VictorOps │ │
│ └──────────────┘ │
│ │
│ Ticketing/ITSM: │
│ ┌──────────────┐ │
│ │ Nagios │──Create Ticket─►│ ServiceNow│ │
│ │ │ │ Jira │ │
│ │ │ │ RT │ │
│ └──────────────┘ │
│ │
│ Configuration Management: │
│ ┌──────────────┐ │
│ │ Ansible/ │──Generate──►│ Nagios Config│ │
│ │ Puppet/ │ Config │ │ │
│ │ Terraform │ │ │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Logging & Metrics: │
│ ┌──────────────┐ │
│ │ Nagios │──perfdata──►│ Graphite/ │ │
│ │ │ │ InfluxDB │ │
│ │ │──alerts────►│ ELK Stack │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Difficulty: Expert Time estimate: 8-12 hours Prerequisites: Projects 1-8, familiarity with target tools
Real World Outcome
- Alerts automatically create PagerDuty incidents
- Incident tickets created in ServiceNow
- Nagios configuration generated from Terraform
- Performance data visualized in Grafana
The Core Question You’re Answering
“How do I make Nagios work with my existing operational tools?”
The Interview Questions They’ll Ask
- “How would you integrate Nagios with your incident management tool?”
- “How do you auto-generate Nagios configuration from your CMDB?”
- “How do you correlate Nagios alerts with logs in ELK?”
- “What’s the best approach for Nagios + Prometheus coexistence?”
Learning Milestones
- Basic: One external integration working
- Intermediate: Bidirectional integration (e.g., acknowledge in PagerDuty reflects in Nagios)
- Advanced: Configuration-as-code pipeline
Project Comparison Table
| Project | Difficulty | Time | Key Skill | Real-World Value |
|---|---|---|---|---|
| 1. Install from Source | Intermediate | 4-6h | Linux admin | Foundation |
| 2. Config Structure | Intermediate | 3-4h | Configuration | Organization |
| 3. Local Services | Beginner | 2-3h | Plugin usage | Basic monitoring |
| 4. NRPE Remote | Advanced | 4-6h | Networking | Remote monitoring |
| 5. Custom Scripts | Intermediate | 4-6h | Scripting | Extensibility |
| 6. Host/Service Groups | Intermediate | 2-3h | Configuration | Scalability |
| 7. Timeperiods | Intermediate | 2-3h | Scheduling | Operations |
| 8. Notifications | Advanced | 4-6h | Integration | Alerting |
| 9. Escalations | Advanced | 3-4h | Workflow | On-call |
| 10. Event Handlers | Expert | 6-8h | Automation | Self-healing |
| 11. Passive/NSCA | Advanced | 4-6h | Push model | Special cases |
| 12. Performance Data | Advanced | 6-8h | Metrics | Trending |
| 13. Plugin Development | Advanced | 8-10h | Programming | Extension |
| 14. Windows Hosts | Advanced | 4-6h | Cross-platform | Enterprise |
| 15. SNMP Devices | Expert | 6-8h | Network | Infrastructure |
| 16. Log Monitoring | Advanced | 4-6h | Log analysis | Proactive |
| 17. Dashboards | Advanced | 6-8h | Visualization | UX |
| 18. High Availability | Master | 12-16h | Clustering | Reliability |
| 19. Integration | Expert | 8-12h | APIs | Ecosystem |
Summary
| # | Project | Core Skill | Prerequisites |
|---|---|---|---|
| 1 | Install from Source | Linux compilation | Linux basics |
| 2 | Config Structure | Object model | Project 1 |
| 3 | Local Services | Plugin usage | Project 2 |
| 4 | NRPE Remote | Remote execution | Project 3 |
| 5 | Custom Scripts | Plugin development | Scripting |
| 6 | Host/Service Groups | Organization | Project 4 |
| 7 | Timeperiods | Scheduling | Project 4 |
| 8 | Notifications | Multi-channel alerts | Project 4 |
| 9 | Escalations | Workflow routing | Project 8 |
| 10 | Event Handlers | Auto-remediation | Project 4 |
| 11 | Passive/NSCA | Push monitoring | Project 4 |
| 12 | Performance Data | Metrics graphing | Project 5 |
| 13 | Plugin Development | Production plugins | Project 5 |
| 14 | Windows Hosts | Cross-platform | Project 4 |
| 15 | SNMP Devices | Network monitoring | Project 4 |
| 16 | Log Monitoring | Log analysis | Project 5 |
| 17 | Dashboards | Visualization | Project 4 |
| 18 | High Availability | Clustering | Projects 1-10 |
| 19 | Integration | APIs | Projects 1-8 |
What You’ll Achieve
After completing this learning path, you will be able to:
- Install and configure Nagios Core from source with confidence
- Design monitoring for complex infrastructure (Linux, Windows, network devices)
- Write custom plugins that follow best practices
- Configure notifications with escalations and on-call rotations
- Implement auto-remediation with event handlers
- Build dashboards for different stakeholders
- Scale Nagios with high availability configurations
- Integrate with incident management and automation tools
Most importantly, you will understand the principles behind monitoring that apply to any tool - not just Nagios.
Additional Resources
Official Documentation
- Nagios Core Documentation: https://assets.nagios.com/downloads/nagioscore/docs/
- Nagios Plugins Guidelines: https://nagios-plugins.org/doc/guidelines.html
- NRPE Documentation: https://github.com/NagiosEnterprises/nrpe
Community Resources
- Nagios Exchange: https://exchange.nagios.org/ (thousands of plugins)
- Nagios Library: https://library.nagios.com/
- Monitoring Portal: https://www.monitoring-portal.org/
Books
- “Nagios Core Beginner’s Guide” by Eric Loyd - Packt Publishing
- “Nagios: System and Network Monitoring” by Wolfgang Barth - No Starch Press
- “Essential SNMP” by Douglas Mauro & Kevin Schmidt - O’Reilly
Related Topics to Explore Next
- Prometheus - Modern pull-based monitoring with PromQL
- Grafana - Advanced visualization
- Icinga - Nagios fork with modern features
- Zabbix - All-in-one monitoring platform
- OpenTelemetry - Observability framework
Last updated: 2025-01-01 Projects: 19 | Difficulty range: Beginner to Master | Estimated total time: 100-140 hours