Learn the Linux Command Line: From Novice to Power User
Goal: Build deep, durable command-line mastery, not just command memorization. You will understand how the shell parses and executes what you type, how data flows through streams and pipelines, and how the Linux filesystem, permissions, and process model fit together. By the end, you will design reliable pipelines, automate real maintenance tasks, and debug command-line failures with evidence-based workflows. You will also ship a small portfolio of CLI tools (scripts + reports) that demonstrate practical Linux skills you can show to employers or use in your own systems.
Introduction
The Linux command line is a text-based interface for controlling a Linux system using the shell and a small ecosystem of composable tools. It is the most universal interface for servers, development environments, and automation because it is scriptable, precise, and stable across machines.
What you will build (by the end of this guide):
- A website status monitoring script with clear error handling and output reports
- A log analysis pipeline that converts raw access logs into actionable insights
- A scheduled backup system with timestamps and retention patterns
- A system resource monitor that logs CPU, memory, and disk metrics into CSV
- A media organizer that classifies files by metadata into clean hierarchies
Scope (what is included):
- Bash shell basics and safe scripting patterns
- Filesystem navigation, metadata, permissions, and file discovery
- Streams, redirection, pipelines, and text processing toolchains
- Process monitoring, signals, and job control
- Automation and scheduling with cron
- CLI networking with curl and DNS basics
- Archiving, compression, and backup workflows
Out of scope (for this guide):
- Full Linux system administration (systemd services, kernel tuning, package management)
- Advanced networking (firewalls, routing, BPF, VPNs)
- Container orchestration and cloud-specific tools
The Big Picture (Mental Model)
You type a command
|
v
[Shell parses + expands] --> tokens, quotes, variables
|
v
[Commands execute] --> processes with stdin/stdout/stderr
|
v
[Streams flow] --> pipes + redirection + filters
|
v
[Filesystem state] --> files, permissions, metadata
|
v
[Automation] --> scripts + cron + logs
Key Terms You Will See Everywhere
- Shell: The program that interprets your commands (bash, sh, zsh).
- Pipeline: A chain of commands where stdout of one feeds stdin of the next.
- File descriptor (fd): A numeric handle for IO streams (0, 1, 2).
- Exit status: Numeric result of a command (0 = success, non-zero = error).
- Globbing: Filename pattern expansion (e.g.,
*.log). - Inode: Metadata record for a file (permissions, owner, timestamps).
How to Use This Guide
- Read the primer first: The Theory Primer is your mini-book. It builds mental models that make the projects intuitive.
- Build in layers: Each project has a core version and optional extensions. First get it working, then make it robust.
- Keep evidence: Save command output, logs, and small notes. These are proof of understanding and help debugging.
- Use the definitions: Every project references key concepts. If you get stuck, return to the concept chapter.
Suggested cadence:
- 2 to 3 sessions per week
- 90 minutes per session
- 1 project every 1 to 2 weeks
Prerequisites & Background Knowledge
Before starting these projects, you should have foundational understanding in these areas:
Essential Prerequisites (Must Have)
Programming Skills:
- Comfort with variables, basic if/else logic, and loops (in any language)
- Ability to read and interpret error messages
Linux Basics:
- Can open a terminal and navigate with
pwd,ls,cd - Can create/edit text files with a terminal editor or GUI
- Understand the idea of files vs directories
Recommended reading:
- The Linux Command Line, 2nd Edition by William E. Shotts - Ch. 1-10
- How Linux Works, 3rd Edition by Brian Ward - Ch. 1-3
Helpful But Not Required
Regular expressions
- You will learn them in Project 2
Networking basics
- HTTP status codes and DNS basics show up in Project 1
Self-Assessment Questions
- Can you explain the difference between a file and a directory?
- Can you describe what a pipeline
a | b | cdoes in words? - Can you explain what
chmod 644 file.txtmeans? - Can you read a log line and identify the status code?
- Do you know what an exit code is and where to find it?
If you answered “no” to questions 1-3, spend 1 week on the recommended reading before continuing.
Development Environment Setup
Required tools:
- Linux machine or VM (Ubuntu 22.04+, Debian 12+, Fedora 39+)
- A POSIX shell (
bashrecommended) - Standard coreutils:
ls,cat,grep,sort,uniq,awk,sed
Recommended tools:
curlfor HTTP checkstarfor archivescron(orcronie) for schedulingexiftoolfor metadata in Project 5rsyncfor reliable backups (optional)
Testing your setup:
uname -a
bash --version
which awk sed grep sort uniq curl tar
Time Investment
- Projects 1-2 (Foundations): 1 to 2 weeks total
- Projects 3-4 (Automation + Monitoring): 2 to 3 weeks total
- Project 5 (Advanced file workflows): 1 to 2 weeks
- Total sprint: ~1 to 2 months part-time
Important Reality Check
Command-line mastery is about progressive refinement:
- First pass: get correct output with simple commands
- Second pass: make it robust (quotes, edge cases)
- Third pass: make it fast and reliable (pipes, checks, logs)
- Fourth pass: make it reusable (scripts, functions, config)
This is normal. The CLI is a power tool, and real mastery happens in layers.
Big Picture / Mental Model
User Intent
|
v
Shell Parsing
- tokenization
- expansion
- quoting
|
v
Execution
- processes + exit codes
- stdin/stdout/stderr
|
v
Dataflow
- pipes + redirection
- filters (grep/awk/sort)
|
v
State
- filesystem + permissions
- logs + backups
|
v
Automation
- scripts + scheduling
Theory Primer
This is the mini-book. Each chapter is a concept cluster with deep explanations, diagrams, examples, and exercises.
Chapter 1: The Shell as a Language (Parsing, Expansion, Exit Status)
Fundamentals
The shell is a language interpreter, not just a program launcher. It reads a line of text, splits it into tokens, expands variables and globs, and then executes programs with the final arguments. This is why quoting matters: the shell can change the number and content of arguments before your program ever runs. In bash, expansions happen in a defined order, and quoting controls which of those expansions are allowed. The shell also tracks exit status for every command and makes it available via $?, which is the foundation of conditional logic in scripts. Once you internalize that the shell is a compiler + runtime, not just a terminal, you stop debugging the wrong layer and start controlling the real behavior of your scripts. The CLI becomes predictable, not mysterious.
Deep Dive into the Concept
When you press Enter, bash performs a sequence of steps before executing anything. It first tokenizes the input into words and operators, then performs expansions in a specific order: brace expansion, tilde expansion, parameter and variable expansion, arithmetic expansion, and command substitution (left-to-right). After that it performs word splitting, filename expansion (globbing), and finally quote removal. That ordering is critical: for example, echo "$HOME" expands $HOME but does not split words; echo $HOME can split on spaces if $HOME has them. Knowing the order lets you predict when a variable might be expanded into multiple words, when a wildcard will match filenames, and when quotes will be removed. Bash documents this order explicitly in its reference manual.
Tokenization is influenced by operators like |, >, &&, and ;, so a single line can become multiple commands with different control flow. Then redirections are applied after expansions but before execution. The order of redirections matters: cmd >out 2>&1 sends both stdout and stderr to out, while cmd 2>&1 >out sends only stdout to the file and leaves stderr on the terminal, because stderr was duplicated before stdout was redirected. Bash documents this left-to-right redirection order in its redirections section.
The shell also performs command lookup in a defined order: function, builtin, then external command found on PATH. This explains why type ls might show an alias or function instead of /bin/ls. Understanding lookup order prevents confusion in scripts and helps you avoid aliases in non-interactive contexts. Exit status is another core element: a zero exit code means success, and non-zero means error. By default, the exit status of a pipeline is the exit status of the last command. Bash offers set -o pipefail to change that: with pipefail, the pipeline returns the last non-zero status or zero if all succeed. This is documented in the set builtin section of the bash manual.
Quoting is where most shell bugs hide. Single quotes preserve literal text. Double quotes allow parameter expansion but prevent word splitting and globbing. Unquoted variables are split on $IFS and then globbed; this is why rm $file is dangerous if $file contains spaces or wildcard characters. Shell expansions and word splitting are described precisely in the bash manual.
Finally, be cautious with eval. It takes a string and re-parses it as shell code, which can turn user data into commands and create injection vulnerabilities. Prefer arrays, case statements, and explicit parsing instead.
Another subtlety is how the shell handles arrays and positional parameters. "$@" expands to each argument as its own word, preserving spaces safely, while "$*" joins all arguments into a single word. This distinction matters when you write wrappers or pass through arguments. Arrays are often the safest way to build command arguments without accidental word splitting: args=(--flag "$value") and then cmd "${args[@]}". The IFS variable also affects how read splits input; read -r prevents backslash escapes from being interpreted. For consistent output, prefer printf over echo, because echo’s handling of escape sequences can vary between shells. These small details are the difference between a script that works in your terminal and one that works everywhere.
How This Fits in Projects
- Project 1 depends on correct quoting, command substitution, and exit codes.
- Project 2 depends on reliable pipelines and
pipefailawareness. - Projects 3-5 depend on safe expansion and predictable argument handling.
Definitions & Key Terms
- Tokenization: Splitting input into words and operators.
- Expansion: Substituting variables, globs, and command output.
- Word splitting: Splitting expanded text into words using
$IFS. - Globbing: Filename expansion using
*,?,[abc]. - Exit status: Numeric result of a command (0 = success).
Mental Model Diagram
Raw Input Line
|
v
Tokenize -> Expand -> Split -> Glob -> Redirections -> Execute
|
v
Final argv[] passed to program
How It Works (Step-by-Step)
- Read input line and split into tokens (words/operators).
- Perform expansions in the documented order.
- Split expanded text into words using
$IFS(unless quoted). - Expand globs into filenames.
- Apply redirections (left-to-right).
- Execute builtins or external programs.
Invariants: expansion happens before execution; quoting blocks word splitting. Failure modes: unquoted variables, unintended globbing, pipeline errors hidden by default behavior.
Minimal Concrete Example
name="Ada Lovelace"
# Wrong: splits into two words
printf "Hello %s\n" $name
# Correct
printf "Hello %s\n" "$name"
Common Misconceptions
- “The program sees the glob” -> The shell expands it first.
- “Single quotes expand $VAR” -> They do not.
- “Pipelines fail if any command fails” -> Not unless pipefail is enabled.
Check-Your-Understanding Questions
- Why does
echo $filebehave differently fromecho "$file"? - What is the order of expansions in bash?
- Why is
for f in $(ls)fragile?
Check-Your-Understanding Answers
- Unquoted variables undergo word splitting and globbing; quoted variables do not.
- Brace, tilde, parameter/variable, arithmetic, command substitution, word splitting, filename expansion, quote removal.
$(ls)splits on whitespace and breaks filenames with spaces.
Real-World Applications
- Writing safe deployment scripts
- Automating file cleanup without deleting the wrong files
- Building pipelines that handle unpredictable input
Where You’ll Apply It
- Project 1, Project 2, Project 3, Project 4, Project 5
References
- Bash Reference Manual: Shell Expansions (order of expansion) - https://www.gnu.org/s/bash/manual/html_node/Shell-Expansions.html
- Bash Reference Manual: Word Splitting (IFS behavior) - https://www.gnu.org/software/bash/manual/html_node/Word-Splitting.html
- Bash Reference Manual: Redirections (left-to-right processing) - https://www.gnu.org/s/bash/manual/html_node/Redirections.html
- Bash Reference Manual: The
setbuiltin (pipefail) - https://www.gnu.org/s/bash/manual/html_node/The-Set-Builtin.html
Key Insight
Shell safety is mostly about controlling expansion and quoting so the shell cannot surprise you.
Summary
You are not just typing commands; you are writing a small program. Expansion order, quoting, and exit status decide whether that program is correct.
Homework/Exercises
- Write a one-liner that lists all
.logfiles in your home directory, even if they have spaces. - Demonstrate the difference between single and double quotes with a variable.
- Show the difference in pipeline exit status with and without
pipefail.
Solutions
find "$HOME" -name "*.log" -printname="Ada"; echo '$name'; echo "$name"false | true; echo $?; set -o pipefail; false | true; echo $?
Chapter 2: Filesystem Hierarchy, Metadata, and Permissions
Fundamentals
Linux organizes everything into a single directory tree rooted at /. Files are identified by paths, and each file has metadata: owner, group, permissions, timestamps, and an inode number. Understanding the filesystem hierarchy, such as /etc for configuration and /var for variable data, makes navigation and automation predictable across systems. Permissions are the default security model: read, write, and execute for user, group, and others. Directories behave differently from files: execute on a directory means you can access entries inside. File discovery and management rely on this model, which is why tools like find, stat, chmod, and chown show up in real maintenance scripts. The Filesystem Hierarchy Standard (FHS) documents the expected layout for Linux systems and explains why data lives where it does.
Deep Dive into the Concept
The Linux filesystem is a single rooted tree. Devices and partitions are attached to that tree at mount points. The root / is a logical entry point, not a physical disk. The Filesystem Hierarchy Standard explains the intended locations for system and application data: /etc for configuration, /var for variable data like logs, /usr for userland programs and read-only data, /home for user directories, and /srv for service data. These conventions allow scripts to be portable across distributions. FHS emphasizes predictability: software and administrators can find data in expected places because the layout is standardized.
Every file is represented by an inode, which stores metadata and pointers to data blocks. The filename is just a directory entry mapping a name to an inode. This is why hard links are possible: multiple directory entries can point to the same inode. Deleting a filename does not delete the data until the last link is removed. Symbolic links are different: they store a path to another file. If the target path moves, the symlink breaks. Knowing the difference matters when you back up or move data.
Permissions are represented as three triplets: user, group, others. Each triplet has read (r), write (w), and execute (x). For directories, read lets you list entries, write lets you create or delete entries, and execute lets you access entries and traverse the directory. This is why a directory with read but no execute is often useless: you can list names but cannot access them. Permissions are often represented in octal, where each digit is a sum of read (4), write (2), and execute (1). For example, 644 means read/write for owner and read-only for group and others. GNU coreutils documents how numeric modes map to permission bits.
Special permission bits add nuance: setuid (4xxx) and setgid (2xxx) cause executables to run with the file owner’s or group’s privileges. The sticky bit (1xxx) on directories prevents users from deleting files they do not own, which is why /tmp is sticky. The umask defines default permissions for new files by removing permission bits from the default mode. This matters for automation because new files in scripts inherit the umask. Permissions are enforced by the kernel on every access, and you can inspect them with ls -l or stat.
Linux also supports Access Control Lists (ACLs) and extended attributes (xattrs), which provide finer-grained permissions beyond the user/group/other model. While not used in every script, they appear frequently in enterprise systems and can explain confusing permission behavior. If ls -l shows a + at the end of the mode string, ACLs exist.
For diagnostics, stat is often more reliable than ls -l because it reports raw numeric modes, timestamps, and inode numbers in a consistent format. Scripts that operate on large directory trees should also be mindful of mount boundaries; find -xdev keeps traversal on a single filesystem so you do not accidentally cross into mounted backups or network filesystems. When you automate backups or cleanup, verifying filesystem type (df -T) can prevent surprises on special filesystems.
Time metadata also matters. Every inode tracks at least three timestamps: modification time (mtime), status change time (ctime), and access time (atime). When you build backup or cleanup scripts, you must choose which timestamp you care about. find can filter by time (-mtime, -ctime, -atime), type, size, and permissions. Combined with -exec or xargs, it becomes a full automation engine.
How This Fits in Projects
- Project 3 uses filesystem hierarchy to decide what to back up.
- Project 5 relies on metadata and safe permissions when moving files.
- All projects rely on safe path handling and permission awareness.
Definitions & Key Terms
- FHS: Filesystem Hierarchy Standard, describing expected layout.
- Inode: Metadata record for a file, separate from the filename.
- Hard link: Another name pointing to the same inode.
- Symlink: A file containing a path to another file.
- Umask: Mask that removes permission bits from defaults.
Mental Model Diagram
/path/name ---> directory entry ---> inode ---> data blocks
| |
v v
permissions timestamps
How It Works (Step-by-Step)
- Path is resolved from
/through directories. - Each directory entry maps name -> inode.
- Inode stores permissions, owner, size, timestamps.
- Kernel checks permissions on each access.
Invariants: permissions gate access; names are not the data. Failure modes: wrong ownership, missing execute bit on directories, symlink targets missing.
Minimal Concrete Example
# Make a directory private to the user
mkdir -p secret
chmod 700 secret
# Verify
ls -ld secret
Common Misconceptions
- “chmod 644 on a directory is fine” -> Without execute, you cannot access entries.
- “Deleting a filename deletes data” -> Not if other hard links exist.
- “Symlinks have their own permissions” -> Usually ignored by the kernel.
Check-Your-Understanding Questions
- Why does a directory need the execute bit?
- What does permission mode
750mean? - Why can two filenames refer to the same inode?
Check-Your-Understanding Answers
- Execute allows traversal and accessing entries within the directory.
- Owner: rwx (7), group: r-x (5), others: — (0).
- Hard links are multiple directory entries pointing to one inode.
Real-World Applications
- Backup scripts choosing
/etcand/varcorrectly - Secure file drops with permissions and ownership
- Safe cleanup based on timestamps
Where You’ll Apply It
- Project 3, Project 5
References
- Filesystem Hierarchy Standard (FHS) 3.0 specification - https://refspecs.linuxfoundation.org/FHS_3.0/fhs-3.0.html
- GNU coreutils: numeric mode bits for permissions - https://www.gnu.org/s/coreutils/manual/html_node/Numeric-Modes.html
Key Insight
The filesystem is a predictable data model: names map to inodes, and permissions guard every access.
Summary
Once you understand FHS, inodes, and permissions, file automation becomes safe and portable.
Homework/Exercises
- Create a directory where only your user can read/write/execute.
- Create a file and a hard link; delete one and confirm the data remains.
- Use
findto list files modified in the last 7 days in/var/log.
Solutions
mkdir private && chmod 700 privateecho hi > a; ln a b; rm a; cat bfind /var/log -type f -mtime -7 -print
Chapter 3: Streams, Redirection, and Pipelines
Fundamentals
Every process in Linux has three default streams: stdin (0), stdout (1), and stderr (2). The shell connects these streams to your terminal by default, but you can redirect them to files or other commands. This is the heart of the UNIX philosophy: write small tools that read from stdin and write to stdout so they can be composed into pipelines. Redirection lets you control where data goes, and pipes let you connect processes. Understanding file descriptors and redirection order is essential for building reliable pipelines and for debugging commands that appear to “lose” errors. Bash documents redirection behavior and order explicitly.
Deep Dive into the Concept
A pipeline connects the stdout of one process to the stdin of another using |. This creates a streaming dataflow between processes, allowing each command to do one transformation. Redirection operators (>, >>, <, 2>, 2>&1) change where streams go. For example, cmd >out redirects stdout to a file, while cmd 2>err redirects stderr. The order of redirections matters, which is why cmd >out 2>&1 behaves differently from cmd 2>&1 >out. Bash processes redirections left to right, and the manual shows examples illustrating this exact behavior.
Pipelines by default return the exit status of the last command. This can hide failures earlier in the pipeline. Bash offers set -o pipefail to change the pipeline status to the last non-zero exit code, making errors visible. This is especially important for multi-step log processing or backup pipelines where a failure in the middle should stop the pipeline. The set builtin documentation defines pipefail.
Redirections can also duplicate file descriptors. 2>&1 means “send stderr to the same place as stdout.” You can also open custom file descriptors with exec or use {var} syntax in bash to allocate one. Bash also supports special files like /dev/stdin, /dev/stdout, and /dev/stderr as documented in the redirections section.
Pipes are not just strings; they are kernel buffers. If a downstream command exits early (like head), the upstream command can receive SIGPIPE. This is normal and is one reason why you must interpret pipeline errors carefully. Many tools quietly handle SIGPIPE, but some do not. Using set -o pipefail plus explicit error checks in scripts gives you evidence for failures without panic.
Finally, understand that each command in a pipeline runs in its own process. Variable assignments in one pipeline stage do not change the parent shell. This is why cmd | while read line; do ...; done does not usually persist variable changes outside the loop unless you use process substitution or a here-string.
Here-documents and here-strings are also part of the stream toolkit. A here-document (<<EOF) feeds a block of text into stdin, while a here-string (<<< "text") feeds a single string. These are often safer and clearer than echo piped into a command, especially when the text contains special characters. Another powerful tool is tee, which splits a stream so you can both save output to a file and pass it along to the next command. This is invaluable for debugging pipelines because you can capture intermediate output without breaking the flow. Process substitution (<(cmd)) turns command output into a temporary file descriptor, letting commands that expect filenames read from command output directly. For example, diff <(sort a) <(sort b) compares sorted outputs without creating intermediate files.
Finally, be aware of buffering. Some commands buffer output when stdout is not a terminal, which can make pipelines appear to hang or delay output. Tools like stdbuf or command-specific flags (for example, grep --line-buffered) can change buffering behavior. Understanding buffering helps when you build real-time monitoring pipelines.
How This Fits in Projects
- Project 2 is a classic pipeline-based log analysis.
- Project 4 depends on redirection to build CSV logs.
- Project 5 uses safe pipelines to handle filenames with spaces.
Definitions & Key Terms
- stdin: Standard input (file descriptor 0).
- stdout: Standard output (file descriptor 1).
- stderr: Standard error (file descriptor 2).
- Redirection: Changing where a stream reads or writes.
- Pipeline: Passing stdout of one command to stdin of another.
Mental Model Diagram
[cmd1] --stdout--> |pipe| --stdin--> [cmd2] --stdout--> |pipe| --> [cmd3]
|stderr |stderr |stderr
v v v
terminal or file terminal or file terminal or file
How It Works (Step-by-Step)
- Shell creates pipe(s) and forks processes.
- Each process inherits file descriptors.
- Redirections are applied left to right.
- Data flows through the pipe buffer.
- Exit statuses are collected; pipeline status is decided.
Invariants: pipes connect stdout to stdin; stderr is separate unless redirected. Failure modes: missing pipefail, wrong redirection order, hidden errors.
Minimal Concrete Example
# Count unique IPs from a log
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -n 5
Common Misconceptions
- “stderr flows through the pipe” -> Not unless you redirect it.
- “Pipelines stop on the first failure” -> They run to completion;
pipefailonly changes status. - “Variables set in pipelines persist” -> They do not in most shells.
Check-Your-Understanding Questions
- Why does
cmd 2>&1 >outdiffer fromcmd >out 2>&1? - What does
set -o pipefailchange? - Why might a pipeline hide a failure in the middle?
Check-Your-Understanding Answers
- Redirections are processed left to right; the order determines where stderr points.
- It makes the pipeline return the last non-zero exit status.
- By default, the pipeline exit status is only the last command.
Real-World Applications
- Building reliable log processing pipelines
- Capturing errors to a file for later analysis
- Streaming backups to compression tools
Where You’ll Apply It
- Project 2, Project 4, Project 5
References
- Bash Reference Manual: Redirections (order of evaluation) - https://www.gnu.org/s/bash/manual/html_node/Redirections.html
- Bash Reference Manual: Pipelines - https://www.gnu.org/software/bash/manual/html_node/Pipelines.html
- Bash Reference Manual: The
setbuiltin (pipefail) - https://www.gnu.org/s/bash/manual/html_node/The-Set-Builtin.html
Key Insight
Streams are the Linux glue. If you control redirection and pipeline status, you control dataflow.
Summary
Pipelines and redirections let you build powerful workflows from small tools, but only if you manage stdout, stderr, and exit codes intentionally.
Homework/Exercises
- Redirect stdout and stderr to separate files for a failing command.
- Build a pipeline that counts unique error codes in a log file.
- Show how
pipefailchanges the exit status for a failing pipeline.
Solutions
cmd >out.log 2>err.loggrep "ERROR" app.log | awk '{print $5}' | sort | uniq -c | sort -nrfalse | true; echo $?; set -o pipefail; false | true; echo $?
Chapter 4: Text Processing and Regular Expressions
Fundamentals
Most CLI work is text processing. Logs, CSVs, configuration files, and command output are all text streams. Tools like grep, sed, awk, sort, and uniq let you filter, transform, and aggregate data without writing a full program. Regular expressions are the pattern language that powers many of these tools. GNU grep documents three regex dialects: basic (BRE), extended (ERE), and PCRE when available. Understanding the difference between simple text matching and regex matching is key to building correct pipelines.
Text processing is not just about patterns; it is about structure. You must know how a line is formatted, which fields matter, and what delimiters are used. Locale settings also affect sorting and character classes, so for consistent results and better performance many scripts use LC_ALL=C before sort. A good habit is to start with head to inspect data, then build your pipeline one stage at a time.
Deep Dive into the Concept
A good mental model is that text processing is a pipeline of transformations. grep filters lines that match a pattern. sed transforms lines using substitution or deletion rules. awk treats each line as a record and splits it into fields, giving you programmable access to columns. sort orders lines, and uniq collapses duplicates. This ecosystem is designed to be composable, so each tool should read from stdin and write to stdout.
Regular expressions describe sets of strings using operators like . (any character), * (zero or more), + (one or more), ? (optional), character classes like [0-9], and anchors ^ and $. GNU grep defines how these operators work and how basic vs extended regex differ. For example, in basic regex, + is literal unless escaped, while in extended regex + is a repetition operator. The GNU grep manual documents these differences.
awk is a small programming language. It splits each line into fields based on a field separator stored in FS. By default, whitespace separates fields, but you can define your own separator with -F or with BEGIN { FS = ":" }. The GNU awk manual explains that FS is a regular expression and controls how records are split.
sed is a stream editor. It processes input line by line and applies a script, often for substitutions like s/old/new/. By default it prints all lines, but -n plus p lets you print only specific lines. The GNU sed manual explains how -n and -i behave.
There is also nuance in how tools treat records. awk uses RS (record separator) to define what a “line” means and OFS (output field separator) to control output formatting. This means you can parse non-line-oriented data by changing RS, or build CSV output by setting OFS=\",\" and using printf for precise formatting. sort has both lexical (sort) and numeric (sort -n) modes, and can sort on specific keys (-k). This matters when you rank data: lexical sort will put 100 before 20 unless you use numeric sort. If you need stable sorting, sort -s keeps input order for equal keys. Finally, many tools follow POSIX regex rules, but GNU tools add extensions. grep -E uses extended regex syntax, while grep -P uses PCRE where available. Knowing which mode you are in prevents pattern bugs that are very hard to detect in long pipelines.
Beyond the big five tools, small utilities like cut, tr, paste, join, comm, and column can simplify pipelines. cut extracts delimited fields quickly, tr transforms character sets, and column -t formats output into tables. sort -u can replace sort | uniq, and wc -l gives quick counts. These utilities are often faster than a custom script and help keep pipelines clear and readable.
Text processing pipelines often rely on sort before uniq -c, because uniq only counts consecutive duplicates. This is why the canonical pattern is sort | uniq -c | sort -nr. If you know the structure of your log format, you can extract fields by position ($1, $9) and then aggregate. If the log format varies, you must defend against malformed lines (e.g., NF >= 9 in awk). This is one of the most common log analysis failure modes.
How This Fits in Projects
- Project 2 is built entirely on regex, awk fields, sort, and uniq.
- Project 4 extracts numeric fields from command output into CSV.
Definitions & Key Terms
- Regular expression (regex): Pattern describing a set of strings.
- Record: A line of input (awk default).
- Field: A column within a record (awk
$1,$2, …). - FS: Field separator in awk; can be a regex.
- Stream editor: A tool like
sedthat transforms lines.
Mental Model Diagram
input.log
|
v
[grep] -> filter lines
|
v
[awk] -> extract fields
|
v
[sort] -> order values
|
v
[uniq] -> count duplicates
How It Works (Step-by-Step)
grepselects lines that match your pattern.awksplits each line into fields and extracts the ones you need.sortgroups identical values together.uniq -ccounts grouped duplicates.sort -nrranks results by frequency.
Invariants: uniq only counts adjacent duplicates; awk fields depend on FS.
Failure modes: wrong regex dialect, unexpected log format, missing sort before uniq.
Minimal Concrete Example
# Top 5 most common HTTP status codes
awk '{print $9}' access.log | sort | uniq -c | sort -nr | head -n 5
Common Misconceptions
- “grep uses the same regex everywhere” -> Dialects differ across tools.
- “uniq counts duplicates anywhere” -> It only counts consecutive duplicates.
- “awk fields are always space-separated” ->
FScan be a regex.
Check-Your-Understanding Questions
- Why must
sortcome beforeuniq -c? - How do you change the field separator in awk?
- Why can a regex like
+fail in basic regex?
Check-Your-Understanding Answers
uniqcounts only adjacent duplicates, so sorting groups them first.- Use
awk -F ':'orBEGIN { FS = ":" }. - In BRE,
+is literal unless escaped; ERE treats it as repetition.
Real-World Applications
- Log analysis and incident response
- Data cleaning before import into databases
- Generating reports from large text files
Where You’ll Apply It
- Project 2, Project 4
References
- GNU grep manual: regex structure and dialects - https://www.gnu.org/software/grep/manual/grep.html
- GNU awk manual: field separators (
FS) - https://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html - GNU sed manual: overview and stream editing - https://www.gnu.org/software/sed/manual/html_node/Overview.html
Key Insight
Text processing is a pipeline of transformations. When you control regex, fields, and sorting, you can extract any signal from noisy logs.
Summary
CLI mastery is data mastery. Learn the regex and field models, and logs become structured data.
Homework/Exercises
- Extract the top 5 URLs from an access log.
- Replace all IPs in a log with
X.X.X.Xusingsed. - Use awk to print only lines with status 500.
Solutions
awk '{print $7}' access.log | sort | uniq -c | sort -nr | head -n 5sed -E 's/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/X.X.X.X/g' access.logawk '$9 == 500 {print $0}' access.log
Chapter 5: Processes, Signals, and System Introspection
Fundamentals
Every command you run becomes a process. Processes have IDs (PIDs), consume CPU and memory, and can be inspected or controlled. Tools like ps, top, and kill let you inspect and manage them. Linux exposes many system metrics through the /proc filesystem, which is a pseudo-filesystem containing live kernel data. Understanding /proc gives you a direct view into CPU usage, memory stats, and load averages, which is crucial for monitoring scripts. The proc man pages document the structure of /proc, and specific files like /proc/stat, /proc/meminfo, and /proc/loadavg show how metrics are computed.
Process behavior also depends on scheduling and state. A process can be running, sleeping, or waiting on IO, and this state affects responsiveness and load. Job control in the shell lets you suspend and resume processes, which is essential when you are multitasking in a terminal.
Deep Dive into the Concept
A process is a running instance of a program with its own memory space, file descriptors, and environment. The kernel schedules processes onto CPU cores. The process state can be running, sleeping, or waiting for IO. Load average is not CPU usage; it measures the average number of runnable or uninterruptible tasks in the run queue over 1, 5, and 15 minutes. Linux documents this in /proc/loadavg: the first three fields are load averages, and they correspond to the same values shown by uptime or top.
CPU usage metrics come from /proc/stat, which lists cumulative time spent by CPUs in different states (user, system, idle, iowait, etc.) measured in USER_HZ units. The proc_stat man page documents these fields and explains that they are cumulative since boot. To compute CPU usage, you sample /proc/stat twice and calculate deltas between the samples. This is why tools like top show a moving percentage.
Memory usage is exposed via /proc/meminfo. It is a structured text file where each line is a key and value, used by tools like free to report memory usage. The proc_meminfo man page describes how MemTotal, MemFree, MemAvailable, and other fields are defined.
Signals are another key mechanism. A signal is a notification sent to a process. SIGTERM asks a process to exit gracefully, while SIGKILL forces termination. SIGINT is what you send with Ctrl+C. Scripts can trap signals to clean up temporary files. Job control in interactive shells (background, foreground, jobs, fg, bg) relies on signals like SIGTSTP.
Understanding process lifetime is important for monitoring. A process that dies still has a PID until the parent collects its exit status; such a process is a zombie. Scripts that spawn background jobs should wait on them to avoid zombie accumulation.
For deeper inspection, ps can show state (STAT), CPU usage (%CPU), memory usage (%MEM), and command lines. top and htop provide a live view, but for scripts you often want snapshot tools like ps -eo pid,ppid,stat,pcpu,pmem,comm. If you need IO-focused insight, iostat or vmstat can help, but even without them you can infer IO pressure from high iowait in /proc/stat and elevated load averages. These details let you build monitors that explain not just that a system is busy, but why it is busy.
When sampling metrics in a script, use consistent intervals and record timestamps so you can correlate spikes with events. Even simple snapshots become powerful when you can compare them over time.
Process priority also matters. The kernel scheduler considers a process’s nice value, which can be adjusted with nice and renice. Increasing the nice value makes a process run with lower priority, which is useful for background jobs like backups. You can also test whether a process exists without sending a signal by using kill -0 <pid>, which performs permission checks but does not terminate the process. Finally, context switches and IO wait time explain why CPU usage and load average are not interchangeable: high load can be caused by IO-bound processes even when CPU usage appears low.
How This Fits in Projects
- Project 4 measures CPU, memory, and disk usage by reading system metrics.
- Project 3 and 5 rely on signals and process control for long-running tasks.
Definitions & Key Terms
- PID: Process ID.
- Load average: Average runnable or uninterruptible tasks in run queue.
- /proc: Pseudo-filesystem exposing kernel data.
- Signal: Kernel message to a process (e.g., SIGTERM).
- Zombie: Process that exited but has not been waited on.
Mental Model Diagram
Program -> Process -> Scheduler -> CPU
| | |
v v v
/proc signals metrics
How It Works (Step-by-Step)
- Shell forks and execs a program, creating a process.
- Kernel schedules process on CPU cores.
- Kernel updates /proc stats as the process runs.
- Tools like
ps,top,freeread those stats. - Signals can control process behavior.
Invariants: /proc is live kernel data; load average is about queue length, not CPU. Failure modes: misinterpreting load average, parsing /proc incorrectly, failing to wait on background jobs.
Minimal Concrete Example
# Read load average directly
cat /proc/loadavg
# Sample CPU usage fields
head -n 1 /proc/stat
Common Misconceptions
- “Load average is CPU percent” -> It is queue length, not percent.
- “Kill always means force” -> SIGTERM is polite; SIGKILL is force.
- “Zombie processes consume CPU” -> They do not, but they occupy PID entries.
Check-Your-Understanding Questions
- What does
/proc/loadavgrepresent? - Why do you need two samples of
/proc/statto compute CPU usage? - When should you use SIGTERM vs SIGKILL?
Check-Your-Understanding Answers
- Load average is the average number of runnable or uninterruptible tasks.
- The values are cumulative since boot; you need deltas to compute usage.
- SIGTERM for graceful shutdown; SIGKILL only when a process is stuck.
Real-World Applications
- Building custom resource monitors
- Investigating performance issues
- Managing long-running scripts and jobs
Where You’ll Apply It
- Project 4
References
- proc(5): overview of
/procfilesystem - https://man7.org/linux/man-pages/man5/proc.5.html - proc_stat(5): CPU time fields in
/proc/stat- https://man7.org/linux/man-pages/man5/proc_stat.5.html - proc_meminfo(5): Memory stats used by
free- https://man7.org/linux/man-pages/man5/proc_meminfo.5.html - proc_loadavg(5): Load average definition - https://man7.org/linux/man-pages/man5/proc_loadavg.5.html
Key Insight
Linux exposes its internal state as text. If you can parse /proc, you can measure anything.
Summary
Processes, signals, and /proc are the foundation of system monitoring and automation.
Homework/Exercises
- Write a script that prints load average every 5 seconds.
- Parse
/proc/meminfoto extract MemAvailable. - Start a background job, then use
psto find it andkillit gracefully.
Solutions
while true; do cat /proc/loadavg; sleep 5; doneawk '/MemAvailable/ {print $2 $3}' /proc/meminfosleep 1000 & pid=$!; ps -p $pid; kill $pid
Chapter 6: Automation, Scheduling, and Reliability
Fundamentals
Automation is about running the right command at the right time, consistently. Shell scripts turn command sequences into reusable programs. Scheduling tools like cron run scripts on a timetable, making backups, health checks, and log rotations possible without human intervention. The crontab file format defines minute, hour, day, month, and weekday fields, and is documented in the crontab(5) manual page.
Reliable automation requires more than scheduling. Scripts should validate inputs, log output, and exit with meaningful codes so other systems can react. A good automation mindset is to treat each script like a tiny service: it has inputs, outputs, and a predictable failure mode.
Deep Dive into the Concept
A reliable script is idempotent (safe to run multiple times), explicit about its environment, and records evidence of success or failure. Shell scripts inherit environment variables like PATH, but cron runs with a minimal environment. This is why scripts that work in a terminal often fail under cron. To make scripts robust, set PATH explicitly, use absolute paths, and log output to a file. Cron format is strict: a schedule line has five time fields (minute, hour, day of month, month, day of week) followed by the command. The man page documents ranges, lists, and step values like */5 for every 5 minutes.
A common reliability pattern is to fail fast with clear errors. Use set -e cautiously because it can exit on non-critical errors; prefer explicit error handling and exit codes. Always check that input files exist, that output directories are writable, and that external commands are available. Redirect logs (>>) and include timestamps. Consider a --dry-run mode for destructive operations.
Idempotency matters. A backup script should not overwrite previous backups; instead it should create timestamped files and optionally prune old ones. A log analyzer should treat input as read-only. A media organizer should move files only after ensuring the destination path exists and is safe.
For reproducibility, write scripts so they can be run manually and by cron. Provide a --help flag, validate arguments, and exit with meaningful non-zero codes on failure.
Scheduling introduces concurrency problems. If a cron job takes longer than its interval, you can get overlapping runs. Use lock files or flock to ensure only one instance runs at a time. Environment variables in cron can surprise you; the crontab format allows setting SHELL, PATH, and MAILTO, and these control how your script runs and where output goes. Even without advanced tooling, you can make scripts safe by writing to a dedicated log file, capturing both stdout and stderr, and using timestamps so you can reconstruct what happened during a failure window. You can also use trap to clean up temporary files when a script exits or is interrupted.
Another reliability pattern is separating configuration from code. Use a config file or environment variables to control paths, retention days, or endpoints, so the script remains stable while behavior changes. Consider setting a safe umask for files created by automation, and use set -u to fail fast on undefined variables. If a script produces output on every run, plan for log rotation or size limits; otherwise logs can grow until disk fills. The logger command can send messages to syslog, which centralizes logs in many systems. Finally, test automation with set -x tracing and with shellcheck if available, then run it manually before scheduling it in cron.
For one-off jobs, the at command can schedule a single run instead of a recurring cron entry. You should also think about notification: if a job fails, who sees it? Redirecting stderr to a monitored log or sending mail on non-zero exit status keeps automation from failing silently.
How This Fits in Projects
- Project 3 uses cron to schedule backups.
- Project 1 and 4 benefit from logging and idempotent design.
Definitions & Key Terms
- Cron: Scheduler for time-based job execution.
- Idempotent: Safe to run multiple times without changing outcome.
- PATH: Environment variable controlling command lookup.
- Dry run: Execute a script in preview mode without changes.
Mental Model Diagram
Script -> Validation -> Work -> Logging -> Exit status
^ |
| v
Cron scheduler ------------> repeat
How It Works (Step-by-Step)
- Cron reads a schedule from crontab.
- At the scheduled time, cron executes the command.
- The script runs with a minimal environment.
- Exit status is recorded; output can be mailed or logged.
Invariants: cron does not load your interactive shell environment. Failure modes: missing PATH, relative paths, silent failures without logs.
Minimal Concrete Example
# Run backup script every day at 01:00
0 1 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1
Common Misconceptions
- “Cron uses my shell profile” -> It does not.
- “Scripts are reliable if they work once” -> Reliability is about repeated runs.
- “Exit codes are optional” -> Automation depends on them.
Check-Your-Understanding Questions
- What fields make up a crontab schedule?
- Why do scripts fail under cron but work manually?
- What does idempotent mean in scripting?
Check-Your-Understanding Answers
- Minute, hour, day of month, month, day of week.
- Cron runs with a minimal environment and shorter PATH.
- Running the script multiple times yields the same result.
Real-World Applications
- Nightly backups
- Scheduled log rotation or cleanup
- Continuous system health checks
Where You’ll Apply It
- Project 3, Project 1, Project 4
References
- crontab(5) format and schedule syntax - https://man7.org/linux/man-pages/man5/crontab.5.html
Key Insight
Automation is mostly about predictability: environment, logging, and idempotency.
Summary
A good script is a reliable system component: explicit inputs, stable outputs, and a schedule.
Homework/Exercises
- Add a cron entry that runs a script every 10 minutes.
- Modify a script to log both stdout and stderr to a file.
- Add a
--dry-runflag to a script and implement it.
Solutions
*/10 * * * * /usr/local/bin/task.sh./task.sh >> /var/log/task.log 2>&1- Use a flag to skip destructive actions and only print what would happen.
Chapter 7: Networking and HTTP for the CLI
Fundamentals
Many CLI tasks involve networks: checking websites, pulling data, or testing connectivity. At a minimum, you need to understand DNS (name to IP resolution), HTTP status codes, and how tools like curl report errors. DNS is defined in RFC 1034, which explains how domain names map to IP addresses. HTTP status codes are defined in RFC 9110, which describes status code classes and their meaning. curl exposes both HTTP status codes and its own exit codes for network errors; the curl documentation lists exit codes and their meanings.
Networking failures are layered. DNS can fail, TCP can fail, TLS can fail, and HTTP can fail. A good CLI script distinguishes these layers so you can tell whether the remote server is down or your network path is broken. Simple tools like dig, nslookup, and ping can help isolate DNS or reachability issues before you even use curl.
Deep Dive into the Concept
DNS is the system that translates human-readable domain names into IP addresses. A client queries DNS for an A or AAAA record to resolve a name. RFC 1034 describes the DNS conceptual model and the idea that names map to records in a distributed database. When DNS fails, curl cannot connect and typically returns an error code like 6 (could not resolve host). The curl project documents exit codes and notes that non-zero return codes indicate errors.
HTTP status codes are 3-digit values grouped by class: 1xx informational, 2xx success, 3xx redirection, 4xx client errors, 5xx server errors. RFC 9110 defines these classes and notes that clients must understand the class of a status code even if they do not recognize the exact code. In monitoring scripts, you should decide which classes count as “success”. For example, 200 is clearly OK, 301 might be fine if you accept redirects, but 404 and 500 should be errors.
curl is the go-to CLI tool for HTTP. It can print the HTTP response code using --write-out with %{http_code}; the man page documents http_code in the write-out variables. You can also set timeouts (--connect-timeout and --max-time) to prevent a script from hanging. The combination of HTTP status codes and curl exit codes lets you distinguish between “server replied with error” and “network failed entirely.” This distinction is critical in monitoring.
HTTP itself is a request/response protocol with methods (GET, HEAD, POST, etc.), headers, and bodies. A simple status check often uses HEAD or GET requests; curl -I fetches headers only and is faster when you do not need the body. Monitoring scripts often use --fail so that HTTP 400/500 responses cause a non-zero exit code, and --retry for transient errors. Be careful with redirects: you may want -L to follow them or you may want to treat them as failures if you expect a specific URL. Also consider user agent strings; some servers return different responses depending on user agent, and specifying a stable user agent improves consistency. Finally, connection reuse and DNS caching can affect timing measurements, so when you measure performance, decide whether you are timing cold starts or warm connections.
Finally, be mindful of TLS. HTTPS uses TLS for encryption, and certificate errors should be treated differently from HTTP errors. curl exit codes differentiate connection failures from TLS errors, which is useful for diagnosis.
When debugging, curl can show detailed timing with --write-out variables like time_namelookup, time_connect, and time_starttransfer. These let you separate DNS latency from server latency. You can also use --resolve to force a hostname to a specific IP, which is useful when testing a new server before DNS changes. For strict scripts, -sS keeps output quiet but still shows errors, and --retry with --retry-connrefused can smooth over transient network glitches. These options turn a basic curl command into a diagnostic tool. When measuring uptime, normalize for retries and record both status and exit code in logs.
How This Fits in Projects
- Project 1 is built on DNS resolution, HTTP status codes, and curl exit codes.
Definitions & Key Terms
- DNS: Domain Name System; maps names to IPs.
- HTTP status code: 3-digit response code from server.
- Exit code: curl-specific numeric error result.
- Timeout: Maximum time to wait for a connection or response.
Mental Model Diagram
URL -> DNS lookup -> TCP/TLS -> HTTP request -> status code + body
| | |
v v v
curl exit curl exit HTTP status
How It Works (Step-by-Step)
curlresolves the domain via DNS.- It opens a TCP connection (and TLS if HTTPS).
- It sends an HTTP request and receives a response.
- It returns an HTTP status code and an exit code.
Invariants: HTTP status codes are separate from curl exit codes. Failure modes: DNS failure, timeout, TLS errors, HTTP errors.
Minimal Concrete Example
curl -s -o /dev/null -w "%{http_code}\n" https://example.com
Common Misconceptions
- “HTTP 404 means the server is down” -> It means the resource is not found; the server is up.
- “curl exit code equals HTTP status” -> They are separate systems.
- “DNS failure is the same as HTTP failure” -> DNS happens before HTTP.
Check-Your-Understanding Questions
- What is the difference between curl exit code 6 and HTTP 404?
- Why might a script treat 301 differently from 404?
- What does RFC 9110 say about status code classes?
Check-Your-Understanding Answers
- Exit code 6 is DNS resolution failure; HTTP 404 is a server response.
- 301 indicates redirect; 404 indicates missing resource.
- Clients must understand the class of a status code by its first digit.
Real-World Applications
- Website monitoring
- API health checks
- Automated retry logic for transient failures
Where You’ll Apply It
- Project 1
References
- RFC 1034: DNS concepts and facilities - https://www.rfc-editor.org/rfc/rfc1034.html
- RFC 9110: HTTP status code classes - https://www.rfc-editor.org/rfc/rfc9110
- curl man page: exit codes - https://curl.se/docs/manpage.html
- curl man page:
--write-outandhttp_codevariable - https://curl.se/docs/manpage.html
Key Insight
Network automation is about interpreting two signals: HTTP status codes and curl exit codes.
Summary
CLI networking is reliable when you separate DNS, transport, and HTTP failures.
Homework/Exercises
- Write a one-liner that prints HTTP status and curl exit code.
- Test a URL that times out and see which exit code you get.
- Modify a curl command to follow redirects and observe the status change.
Solutions
curl -s -o /dev/null -w "%{http_code}\n" https://example.com; echo $?curl --connect-timeout 1 --max-time 2 https://10.255.255.1; echo $?curl -L -s -o /dev/null -w "%{http_code}\n" http://example.com
Chapter 8: Archiving, Backup, and File Discovery
Fundamentals
Backup and file organization workflows depend on two core abilities: safely finding files and reliably packaging or copying them. find searches directory trees and can output results in a way that safely handles spaces or special characters. GNU findutils documents -print0 and xargs -0 as the safe way to handle arbitrary filenames. tar is the standard archiving tool; GNU tar documentation explains how to create and compress archives. rsync is a powerful file sync tool, and the man page documents flags like -a (archive), --delete, and --dry-run. When working with media files, metadata tools like ExifTool can extract dates like CreateDate.
Deep Dive into the Concept
find traverses a directory tree and evaluates an expression for each file. This expression can filter by name, type, size, or time. By default, find outputs filenames separated by newlines, which breaks if a filename contains a newline. GNU findutils provides -print0, which outputs a NUL-delimited list. xargs -0 consumes that list safely. The findutils manual explicitly recommends -print0 with xargs -0 for safe filename handling.
When you need to back up or move multiple files, tar lets you package them into a single archive, preserving metadata such as permissions and timestamps. GNU tar supports compression formats like gzip (-z) and xz (-J). The tar manual documents the -c (create) operation and compression options. This is why a common backup command is tar czf backup.tar.gz <dir>.
rsync provides incremental copy and synchronization. With -a it preserves permissions, timestamps, and symlinks, and with --delete it can remove files that no longer exist at the source. The rsync man page documents these options and their behavior. A safe practice is to run rsync --dry-run before actual deletion to preview changes.
For media organization, metadata is critical. ExifTool can extract CreateDate from EXIF metadata using tags like -CreateDate. The ExifTool documentation describes tag extraction and how tags are specified. When metadata is missing, you can fall back to filesystem timestamps like mtime.
Retention policies are part of backup design. A simple policy might keep the last 7 backups and delete older ones with find -mtime +7 -delete. This is safe as long as your filenames are predictable and you validate the target directory.
Verification is the often-missed step. After creating an archive, you can list its contents with tar -tf to confirm what was captured, or compute checksums with sha256sum for later integrity checks. For rsync, --checksum can verify file contents, and --partial can keep partially transferred files if a transfer is interrupted. When moving files, consider whether mv is safe or whether you should cp first and delete only after verification. Duplicate filenames are common in media folders, so a robust organizer either preserves directory structure, adds suffixes, or stores duplicates in a separate folder. These defensive choices protect against silent data loss, which is the biggest risk in bulk file workflows.
Tar also supports exclude patterns (--exclude='*.tmp') so you can keep transient files out of backups. For large datasets, consider incremental backups with --listed-incremental, which records state between runs. When you combine find -print0 with tar --null --files-from -, you avoid filename parsing bugs and can archive an exact, filtered file list. For organization scripts, find -printf can output custom fields like %TY-%Tm-%Td to build paths without external parsing. These options make your backup and organization scripts more precise and easier to audit. You can also write a manifest file that lists every archived or moved file, then compare counts before and after to ensure no files were skipped. A simple wc -l on the manifest is often enough to catch mistakes.
How This Fits in Projects
- Project 3 uses
tar, optionalrsync, and retention policies. - Project 5 uses
find -print0, metadata extraction, and safe file moves.
Definitions & Key Terms
- Archive: A single file containing many files and their metadata.
- -print0 / -0: NUL-delimited file list for safe filenames.
- rsync -a: Archive mode preserving metadata.
- CreateDate: EXIF metadata tag for media creation time.
Mental Model Diagram
[find] --NUL--> [xargs] --> [tar/rsync/mv]
|
v
metadata (EXIF) -> destination paths
How It Works (Step-by-Step)
finddiscovers files with filters.- Output is passed safely with
-print0andxargs -0. - Files are archived with
taror synced withrsync. - Metadata guides destination structure (e.g., YYYY/MM).
- Retention deletes old archives.
Invariants: NUL delimiters are safest; tar preserves metadata; rsync can delete if asked. Failure modes: unsafe filename handling, overwriting files, incorrect retention path.
Minimal Concrete Example
# Safe find and tar archive
find "$HOME/Projects" -type f -print0 | tar --null -T - -czf projects.tar.gz
Common Misconceptions
- “xargs is always safe” -> Not without
-0and-print0. - “tar is only for tapes” -> It is a general archive tool.
- “rsync is the same as cp” -> rsync can delete and sync incrementally.
Check-Your-Understanding Questions
- Why is
-print0safer than default output? - What does
tar -czfdo? - Why use
rsync --dry-run?
Check-Your-Understanding Answers
- Newlines can appear in filenames; NUL cannot.
- Create a gzipped tar archive.
- It previews changes and deletions before they happen.
Real-World Applications
- Nightly backups with retention
- Bulk file organization by metadata
- Safe cleanup of old files
Where You’ll Apply It
- Project 3, Project 5
References
- GNU findutils manual: safe file name handling (
-print0,-0) - https://www.gnu.org/software/findutils/manual/html_node/find_html/Safe-File-Name-Handling.html - GNU tar manual: compression options - https://www.gnu.org/software/tar/manual/html_section/Compression.html
- rsync man page: archive mode, delete, dry-run - https://man7.org/linux/man-pages/man1/rsync.1.html
- ExifTool documentation: tag extraction (
-CreateDate) - https://exiftool.org/exiftool_pod2.html
Key Insight
Safe file workflows require two guarantees: safe filename handling and explicit metadata policy.
Summary
Backups and organization are just controlled file discovery plus careful copying/archiving.
Homework/Exercises
- Build a
findcommand that selects JPGs modified in the last 30 days. - Create a tar archive with gzip compression of a directory.
- Run an rsync dry run between two folders and review the output.
Solutions
find . -type f -iname "*.jpg" -mtime -30 -printtar -czf photos.tar.gz ./Photosrsync -av --dry-run src/ dest/
Glossary
- ACL: Access Control List; fine-grained permissions beyond user/group/other.
- Archive: A single file containing multiple files and metadata.
- Exit status: Numeric result of a command (0 = success).
- FHS: Filesystem Hierarchy Standard for directory layout.
- Inode: Metadata record for a file.
- Load average: Average runnable or uninterruptible tasks.
- Pipeline: Command chain where stdout feeds stdin.
- Redirection: Changing where stdin/stdout/stderr go.
- Regex: Pattern describing a set of strings.
Why Linux CLI Matters
The Modern Problem It Solves
Modern infrastructure is distributed, automated, and headless. Most servers you touch have no GUI. The CLI is the universal interface that works over SSH, inside containers, and in automation pipelines. If you can reason about shells, pipelines, and permissions, you can debug production incidents faster, build reliable automation, and transfer skills across distributions.
Real-world impact (recent statistics):
- Linux powers the majority of websites with known OS: W3Techs reports Linux at 59.7% of websites with known operating systems (Dec 28, 2025). Source: https://w3techs.com/technologies/comparison/os-linux%2Cos-windows
- Android dominates mobile OS market share: StatCounter reports Android at 71.94% worldwide mobile OS share (Nov 2025). Source: https://gs.statcounter.com/os-market-share/mobile/-/2024
- Android is Linux-based: Android is an open source, Linux-based software stack. Source: https://developer.android.com/guide/platform/index.html
These numbers show why Linux CLI skills matter: you are operating the dominant server OS and the kernel that underpins the dominant mobile platform.
OLD APPROACH NEW APPROACH
┌───────────────────────┐ ┌──────────────────────────┐
│ Clickable GUIs only │ │ Scriptable CLI workflows │
│ Manual, slow changes │ │ Automated, repeatable │
│ One machine at a time │ │ Fleet-wide automation │
└───────────────────────┘ └──────────────────────────┘
Context & Evolution
The Linux CLI follows Unix design principles: small tools, composable pipelines, and plain text interfaces. These ideas outlast GUI trends and have proven resilient in cloud, DevOps, and automation workflows.
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Shell Language & Expansion | How parsing, quoting, and expansions determine command behavior and safety. |
| Filesystem & Permissions | How paths, inodes, and permissions control access and portability. |
| Streams & Pipelines | How data flows through stdin/stdout/stderr and how redirection order matters. |
| Text Processing & Regex | How to extract signal from logs with grep, awk, sed, sort, uniq. |
| Processes & System Introspection | How to interpret /proc, load average, CPU and memory metrics. |
| Automation & Scheduling | How cron works, environment differences, idempotent scripts. |
| Networking & HTTP | How DNS and HTTP status codes relate to curl exit codes. |
| Archiving & File Discovery | How to safely find files, archive, sync, and organize with metadata. |
Project-to-Concept Map
| Project | What It Builds | Primer Chapters It Uses |
|---|---|---|
| Project 1: Website Status Checker | Network monitor script | Ch. 1, 3, 6, 7 |
| Project 2: Log File Analyzer | Ranked log insights pipeline | Ch. 1, 3, 4 |
| Project 3: Automated Backup Script | Scheduled backups with retention | Ch. 2, 3, 6, 8 |
| Project 4: System Resource Monitor | CSV metric logger | Ch. 3, 4, 5 |
| Project 5: Find and Organize Media | Metadata-driven organizer | Ch. 2, 3, 8 |
Deep Dive Reading by Concept
Shell and CLI Foundations
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Shell parsing and quoting | The Linux Command Line by William E. Shotts - Ch. 7 | Prevents accidental word splitting and unsafe scripts. |
| Shell scripting basics | Shell Programming in Unix, Linux and OS X by Kochan & Wood - Ch. 1-5 | Foundation for robust CLI automation. |
| Effective shell habits | Effective Shell by Dave Kerr - Ch. 1-6 | Real-world shell best practices. |
Filesystems and Permissions
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Filesystem layout | How Linux Works by Brian Ward - Ch. 2 | Understand what to back up and where data lives. |
| Permissions and ownership | The Linux Command Line - Ch. 9 | Safe access and security boundaries. |
Text Processing and Pipelines
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Redirection and pipelines | The Linux Command Line - Ch. 6 | Core dataflow model for CLI. |
| Text processing tools | The Linux Command Line - Ch. 19-20 | grep, sed, awk, and regex mastery. |
| Practical pipelines | Wicked Cool Shell Scripts by Dave Taylor - Ch. 1-3 | Real CLI workflows and patterns. |
Automation and Monitoring
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Scheduling and cron | How Linux Works - Ch. 7 | Reliable time-based automation. |
| Process monitoring | The Linux Command Line - Ch. 10 | Understanding system state. |
Quick Start
Your first 48 hours:
Day 1 (4 hours):
- Read Chapter 1 and Chapter 3 (Shell language + streams).
- Skim Chapter 7 (Networking basics).
- Start Project 1 and get it to print a status for one URL.
- Do not worry about formatting or summaries yet.
Day 2 (4 hours):
- Add error categories to Project 1 (DNS vs HTTP vs timeout).
- Add a summary line and non-zero exit code on failure.
- Read the “Core Question” section for Project 1 and update your script.
End of Weekend: You can explain how a shell command becomes a process, how exit codes work, and how to detect HTTP vs DNS failures.
Recommended Learning Paths
Path 1: The Newcomer (Recommended Start)
Best for: learners new to Linux and CLI tools
- Project 1: Website Status Checker - small scope, immediate feedback
- Project 2: Log File Analyzer - classic pipeline practice
- Project 3: Automated Backup Script - automation and scheduling
- Project 4 and 5 as advanced practice
Path 2: The Data Analyst
Best for: people who want to process logs and data
- Project 2: Log File Analyzer
- Project 4: System Resource Monitor
- Project 1: Website Status Checker
- Projects 3 and 5 later
Path 3: The Ops Engineer
Best for: automation and reliability goals
- Project 3: Automated Backup Script
- Project 1: Website Status Checker
- Project 4: System Resource Monitor
- Project 2 and 5 later
Path 4: The Completionist
Best for: full CLI mastery
- Week 1: Project 1 + Project 2
- Week 2-3: Project 3 + Project 4
- Week 4: Project 5 and polish all scripts
Success Metrics
You have mastered this guide when you can:
- Explain shell expansion order and demonstrate safe quoting
- Build a multi-stage pipeline and verify each stage independently
- Diagnose a failing cron job by reading logs and checking PATH
- Interpret
/procmetrics and explain load average vs CPU percent - Write scripts that produce correct, repeatable outputs and exit codes
- Provide a small portfolio: 5 scripts with documentation and sample output
Appendix: CLI Safety & Debugging Checklist
- Always quote variables:
"$var" - Use
set -o pipefailfor critical pipelines - Log stdout and stderr for scheduled tasks
- Use
-print0andxargs -0for safe filenames - Prefer
printfoverechofor predictable output - Use absolute paths in cron scripts
Appendix: Quick Reference
Common directories (FHS):
/etcconfiguration/varvariable data (logs, spools)/usruserland programs/homeuser directories
Permission bits (octal):
- 4 = read, 2 = write, 1 = execute
- 7 = rwx, 6 = rw-, 5 = r-x
Crontab format:
# m h dom mon dow command
*/5 * * * * /path/to/script.sh
Project Overview Table
| Project | Difficulty | Outcome | Key Tools |
|---|---|---|---|
| Project 1: Website Status Checker | Beginner | HTTP status report + summary | curl, bash |
| Project 2: Log File Analyzer | Intermediate | Top IPs/URLs report | awk, sort, uniq |
| Project 3: Automated Backup Script | Intermediate | Timestamped backups + cron | tar, cron, find |
| Project 4: System Resource Monitor | Intermediate | CSV metrics log | ps, free, df |
| Project 5: Find and Organize Media | Advanced | Media organized by date | find, exiftool |
Project List
Project 1: Website Status Checker
- Main Programming Language: Shell (Bash)
- Alternative Programming Languages: N/A
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Shell Scripting / Networking
- Software or Tool:
curl,grep,awk - Main Book: “The Linux Command Line, 2nd Edition” by William E. Shotts
What you’ll build: A script that reads a file of URLs, checks each URL, and prints a clean report with status code, timing, and error category. The script returns a non-zero exit code if any URL is unhealthy.
Why it teaches the fundamentals: It combines arguments, loops, exit codes, quoting, and HTTP basics in a small but complete tool.
Core challenges you’ll face:
- Reading a file line-by-line ->
while read -rloops - Detecting HTTP status ->
curl -o /dev/null -w '%{http_code}' - Handling timeouts ->
--connect-timeoutand--max-time - Handling empty lines/comments ->
grep -v '^#'
Real World Outcome
What you will see:
- A clean table of URL results
- Separate error categories: DNS, timeout, HTTP error
- A summary count for quick overview
- A non-zero exit code when any failure occurs
Command Line Outcome Example:
$ cat urls.txt
https://example.com
https://example.com/does-not-exist
https://this-domain-should-not-exist.tld
$ ./check_sites.sh urls.txt --connect-timeout 5 --max-time 10
[OK] 200 https://example.com (0.112s)
[FAIL] 404 https://example.com/does-not-exist (0.087s)
[ERR] DNS https://this-domain-should-not-exist.tld
Summary: ok=1 fail=1 err=1
$ echo $?
1
The Core Question You’re Answering
“How do I reliably detect when a website is down versus when my network is failing?”
This project forces you to distinguish server errors from client/network errors and build error handling into automation.
Concepts You Must Understand First
- Shell expansion and quoting
- What expands first in a command line?
- Why are URLs with
?or&dangerous without quotes? - Book Reference: The Linux Command Line, Ch. 7
- Exit codes
- What does
$?represent? - Why is zero success and non-zero failure?
- Book Reference: The Linux Command Line, Ch. 10
- What does
- HTTP status codes
- What does 200, 301, 404, 500 mean?
- What is the difference between a redirect and an error?
- Book Reference: The Linux Command Line, Ch. 16
- DNS basics
- What happens before an HTTP request is sent?
- Why does a DNS error mean no HTTP status at all?
- Book Reference: Computer Networks, Ch. 1-2
Questions to Guide Your Design
- Input handling
- How will you ignore empty lines or comments?
- What should happen if the file is missing?
- Error handling
- How will you detect DNS errors vs HTTP errors?
- Should you retry on timeouts? How many times?
- Output formatting
- Will your output be parseable by another script?
- Will you include timing information?
- Exit status
- What exit code should the script return if any URL fails?
Thinking Exercise
The Failure Matrix
Imagine four URLs:
- A valid URL that returns 200
- A valid URL that returns 404
- A valid domain that times out
- An invalid domain that fails DNS
Sketch a table of the output you want for each. Decide how to categorize them.
The Interview Questions They’ll Ask
- “Why do you need to quote URLs in a shell script?”
- “How do you detect if curl failed versus the server returning 404?”
- “How would you add retries without spamming the server?”
- “What exit code should your script return if any URL fails?”
- “How would you schedule this to run every 5 minutes?”
Hints in Layers
Hint 1: Starting Point
Use a while read -r loop and skip empty lines.
while read -r url; do
[ -z "$url" ] && continue
case "$url" in \#*) continue;; esac
# ...
done < "$1"
Hint 2: Status Code Use curl to print only status code:
code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
Hint 3: Error Detection Check curl exit code to distinguish network errors:
curl -s --connect-timeout 5 --max-time 10 -o /dev/null -w "%{http_code}" "$url"
rc=$?
Hint 4: Output Formatting
Use printf for aligned columns:
printf "[%-4s] %3s %-40s (%ss)\n" "$status" "$code" "$url" "$time"
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Shell basics | The Linux Command Line | Ch. 1-7 |
| Exit status | The Linux Command Line | Ch. 10 |
| Networking basics | Computer Networks | Ch. 1-2 |
| Practical scripts | Wicked Cool Shell Scripts | Ch. 1-2 |
Common Pitfalls & Debugging
Problem 1: “All URLs show 000”
- Why: curl could not connect or resolve DNS
- Fix: add
--connect-timeoutand print curl exit codes - Quick test:
curl -I https://example.com
Problem 2: “URLs with ? or & break the script”
- Why: unquoted variables are expanded by the shell
- Fix: always quote variables
- Quick test:
echo "$url"
Problem 3: “Script works manually but fails in cron”
- Why: PATH is minimal in cron
- Fix: use full path to
curlor set PATH in script - Quick test:
which curl
Problem 4: “Timeouts hang forever”
- Why: no max time set
- Fix: add
--connect-timeoutand--max-time - Verification: run against a blackholed IP
Definition of Done
- Script accepts a file of URLs as input
- Handles empty lines and comments
- Distinguishes HTTP errors from network errors
- Outputs a clear summary count
- Exits non-zero if any URL fails
- Logs or prints timing information per URL
Project 2: Log File Analyzer
- Main Programming Language: Shell (Bash)
- Alternative Programming Languages: N/A
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Text Processing
- Software or Tool:
awk,sort,uniq,grep - Main Book: “Wicked Cool Shell Scripts, 2nd Edition” by Dave Taylor
What you’ll build: A one-line pipeline or short script that extracts the top 10 most common IPs and URLs from an access log, with optional filters by status code or date.
Why it teaches the fundamentals: This is the canonical UNIX pipeline. It forces you to think in stages and validate each stage of data transformation.
Core challenges you’ll face:
- Parsing log formats -> understanding fields
- Filtering with regex -> grep or awk patterns
- Aggregation -> sort + uniq -c
- Handling malformed lines ->
awk 'NF >= 9'
Real World Outcome
$ ./analyze_log.sh access.log --status 404
Top 10 IPs for status 404:
52 203.0.113.9
41 198.51.100.12
30 192.0.2.55
Top 10 URLs for status 404:
18 /robots.txt
12 /favicon.ico
9 /admin
The Core Question You’re Answering
“How do I convert raw log noise into a ranked summary that reveals real behavior?”
Concepts You Must Understand First
- Pipelines and redirection
- Why does each tool expect stdin/stdout?
- Book Reference: The Linux Command Line, Ch. 6
- Regular expressions
- What does
grepactually match? - Book Reference: The Linux Command Line, Ch. 19
- What does
- Field extraction with awk
- How do fields map to columns?
- Book Reference: The Linux Command Line, Ch. 20
- Sorting and aggregation
- Why must you sort before
uniq -c? - Book Reference: Wicked Cool Shell Scripts, Ch. 1
- Why must you sort before
Questions to Guide Your Design
- What log format are you assuming (Apache/Nginx combined)?
- Which field has the client IP? Which has the status code?
- How do you handle malformed lines?
- Should you allow filtering by status, date, or method?
Thinking Exercise
Given this line:
203.0.113.9 - - [21/Dec/2025:15:00:00 +0000] "GET /index.html HTTP/1.1" 404 1234
Which awk fields correspond to IP and status? Why?
The Interview Questions They’ll Ask
- “Why must
sortcome beforeuniq -c?” - “How would you filter by status code?”
- “How would you make the pipeline faster for huge logs?”
- “How do you guard against malformed lines?”
- “How would you extract only the URL path?”
Hints in Layers
Hint 1: Start Simple Extract the first column and count.
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -n 10
Hint 2: Filter by Status In combined logs, status is field 9:
awk '$9 == 404 {print $1}' access.log | sort | uniq -c | sort -nr | head -n 10
Hint 3: Handle Malformed Lines Require at least 9 fields:
awk 'NF >= 9 && $9 == 404 {print $1}' access.log
Hint 4: Extract URLs The URL is field 7 in common log format:
awk 'NF >= 9 {print $7}' access.log | sort | uniq -c | sort -nr | head -n 10
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Redirection and pipes | The Linux Command Line | Ch. 6 |
| Regex | The Linux Command Line | Ch. 19 |
| Text processing | The Linux Command Line | Ch. 20 |
| Practical pipelines | Wicked Cool Shell Scripts | Ch. 1-2 |
Common Pitfalls & Debugging
Problem 1: “uniq gives wrong counts”
- Why: input is not sorted
- Fix: add
sortbeforeuniq -c - Quick test:
sort | uniq -c
Problem 2: “status field is wrong”
- Why: log format differs from expected
- Fix: print field counts with
awk '{print NF, $0}'
Problem 3: “awk prints blank lines”
- Why: malformed log entries
- Fix: add
NF >= 9guard
Definition of Done
- Script outputs top 10 IPs
- Optional status filter works
- Handles malformed lines without crashing
- Pipeline stages can be verified independently
- Can output top URLs as well as top IPs
Project 3: Automated Backup Script
- Main Programming Language: Shell (Bash)
- Alternative Programming Languages: N/A
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Automation / File Management
- Software or Tool:
tar,gzip,date,cron,rsync(optional) - Main Book: “How Linux Works, 3rd Edition” by Brian Ward
What you’ll build: A script that creates timestamped archives, stores them in a backup directory, and optionally syncs to a remote host. It will be scheduled nightly with cron.
Why it teaches the fundamentals: It combines filesystem knowledge, scripting, archiving, scheduling, and retention policies in a real automation task.
Core challenges you’ll face:
- Timestamped naming ->
datecommand substitution - Archiving ->
tar -czf - Retention ->
find -mtime +N -delete - Scheduling -> cron format and PATH issues
Real World Outcome
$ ./backup.sh ~/Projects /mnt/backups/projects
[2025-12-31T01:00:00] Backing up /home/user/Projects
Archive: /mnt/backups/projects/backup-2025-12-31_01-00-00.tar.gz
OK
$ ls /mnt/backups/projects | tail -n 3
backup-2025-12-29_01-00-00.tar.gz
backup-2025-12-30_01-00-00.tar.gz
backup-2025-12-31_01-00-00.tar.gz
$ ./backup.sh --dry-run ~/Projects /mnt/backups/projects
DRY RUN: would create /mnt/backups/projects/backup-2025-12-31_01-00-00.tar.gz
DRY RUN: would delete backups older than 7 days
The Core Question You’re Answering
“How do I create repeatable backups that I can schedule and trust?”
Concepts You Must Understand First
- Filesystem hierarchy
- Which directories matter for backups?
- Book Reference: How Linux Works, Ch. 2
- Command substitution and variables
- How does
$(date ...)embed timestamps? - Book Reference: The Linux Command Line, Ch. 11
- How does
- Cron scheduling
- Why does cron need absolute paths?
- Book Reference: How Linux Works, Ch. 7
- Archiving
- What does tar preserve and why is that important?
- Book Reference: The Linux Command Line, Ch. 18
Questions to Guide Your Design
- How will you handle missing source or destination directories?
- How will you avoid overwriting backups?
- Should you add retention (delete old backups)?
- Do you need remote sync? If yes, where and how?
Thinking Exercise
You have 30 GB of data and only 50 GB of backup storage. How do you design retention? (Hint: keep last 7, delete older.)
The Interview Questions They’ll Ask
- “Why do you use
tarinstead of copying files directly?” - “How do you prevent overwriting backups?”
- “What happens if the backup destination is full?”
- “Why might cron fail even when the script works manually?”
- “How would you add offsite backup using rsync?”
Hints in Layers
Hint 1: Start Simple Create a timestamped tarball.
ts=$(date +"%Y-%m-%d_%H-%M-%S")
file="backup-$ts.tar.gz"
tar -czf "$dest/$file" "$src"
Hint 2: Add Safety Check that source exists.
[ -d "$src" ] || { echo "Missing source" >&2; exit 1; }
Hint 3: Add Retention Delete backups older than 7 days.
find "$dest" -name "backup-*.tar.gz" -mtime +7 -delete
Hint 4: Add Logging Append to a log file with timestamps.
printf "[%s] Backup complete\n" "$(date -Is)" >> "$dest/backup.log"
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Filesystem hierarchy | How Linux Works | Ch. 2 |
| Archiving and backup | The Linux Command Line | Ch. 18 |
| Scheduling | How Linux Works | Ch. 7 |
| Scripts in practice | Effective Shell | Ch. 7-9 |
Common Pitfalls & Debugging
Problem 1: “cron job runs but no backup appears”
- Why: cron uses a minimal PATH
- Fix: use absolute paths in the script
- Quick test:
which tar; which date
Problem 2: “backup file is empty”
- Why: source path is wrong
- Fix: log source path before running tar
Problem 3: “disk fills up”
- Why: no retention policy
- Fix: delete old backups with
find -mtime +N -delete
Problem 4: “tar overwrote previous backup”
- Why: filename not unique
- Fix: include timestamp in filename
Definition of Done
- Backup script accepts source and destination arguments
- Creates timestamped archive
- Fails with clear errors if source/dest is invalid
- Cron job runs at scheduled time
- Old backups are cleaned up
- Optional dry-run mode works
Project 4: System Resource Monitor
- Main Programming Language: Shell (Bash)
- Alternative Programming Languages: N/A
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Process Management / Monitoring
- Software or Tool:
ps,free,df,awk,sleep - Main Book: “The Linux Command Line, 2nd Edition” by William E. Shotts
What you’ll build: A script that collects CPU, memory, and disk usage every N seconds and writes CSV lines with timestamps.
Why it teaches the fundamentals: It forces you to parse system metrics, build reliable logging, and understand what the numbers actually mean.
Core challenges you’ll face:
- Parsing /proc or command output ->
awkextraction - Sampling intervals ->
sleeploops - CSV formatting -> consistent headers and output
Real World Outcome
$ ./monitor.sh 5
Monitoring system... press Ctrl+C to stop
$ head -n 5 resource_log.csv
timestamp,mem_used_percent,cpu_load_1m,disk_used_percent
2025-12-31T12:00:00,35,0.12,45
2025-12-31T12:00:05,36,0.18,45
2025-12-31T12:00:10,36,0.22,45
The Core Question You’re Answering
“How can I continuously measure system health using standard CLI tools?”
Concepts You Must Understand First
- Process monitoring
- What is load vs CPU usage?
- Book Reference: The Linux Command Line, Ch. 10
- Text extraction
- How do you extract a number from command output?
- Book Reference: The Linux Command Line, Ch. 20
- Redirection
- How do you append lines to a file?
- Book Reference: The Linux Command Line, Ch. 6
- /proc metrics
- What does
/proc/loadavgmean? - Book Reference: How Linux Works, Ch. 4
- What does
Questions to Guide Your Design
- How often should metrics be sampled?
- What CSV format will you use?
- How do you handle a missing tool (
free)? - Should load average or CPU percent be used?
Thinking Exercise
If CPU spikes for 1 second, will a 10-second sampling interval detect it? What does that imply?
The Interview Questions They’ll Ask
- “Where do
freeanddfget their data?” - “Why might CPU usage reported by
topdiffer from load average?” - “How would you make this script resilient to temporary command failures?”
- “Why do you need a CSV header?”
Hints in Layers
Hint 1: Extract memory usage
mem=$(free | awk '/Mem/ {printf "%.0f", $3/$2*100}')
Hint 2: Load average from /proc
load=$(awk '{print $1}' /proc/loadavg)
Hint 3: Disk usage
disk=$(df / | awk 'NR==2 {gsub(/%/,"",$5); print $5}')
Hint 4: Append to CSV
printf "%s,%s,%s,%s\n" "$ts" "$mem" "$load" "$disk" >> resource_log.csv
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Processes | The Linux Command Line | Ch. 10 |
| Text processing | The Linux Command Line | Ch. 20 |
| System basics | How Linux Works | Ch. 4 |
Common Pitfalls & Debugging
Problem 1: “CPU percentage is always 0”
- Why: parsing wrong field or using the wrong metric
- Fix: inspect
top -bn1output - Quick test:
top -bn1 | head -n 5
Problem 2: “CSV has broken lines”
- Why: newline embedded in variables
- Fix: ensure values are trimmed with
awk/printf
Problem 3: “Header appears multiple times”
- Why: script writes header every loop
- Fix: check if file exists before writing header
Definition of Done
- Script logs timestamped CSV rows
- Sampling interval configurable
- CSV header appears exactly once
- Works for at least 1 hour without errors
- Values are numeric and consistently formatted
Project 5: Find and Organize Photo/Video Files
- Main Programming Language: Shell (Bash)
- Alternative Programming Languages: N/A
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: File Management / Automation
- Software or Tool:
find,xargs,exiftool - Main Book: “Effective Shell” by Dave Kerr
What you’ll build: A script that finds media files, extracts creation date metadata, and moves them into YYYY/MM/ directories. It must handle spaces and missing metadata gracefully.
Why it teaches the fundamentals: It combines safe file discovery, metadata extraction, and careful file operations in a real-world workflow.
Core challenges you’ll face:
- Safe filename handling ->
-print0andxargs -0 - Metadata extraction ->
exiftool -CreateDate - Conflict handling -> avoid overwriting files
Real World Outcome
$ ./organize_media.sh ~/Pictures ~/Pictures/organized
Moved: IMG_1234.jpg -> 2025/12/IMG_1234.jpg
Moved: trip.mov -> 2024/06/trip.mov
Skipped: oldscan.png (no metadata)
$ tree ~/Pictures/organized | head
organized
├── 2024
│ └── 06
│ └── trip.mov
└── 2025
└── 12
└── IMG_1234.jpg
The Core Question You’re Answering
“How can I safely refactor a chaotic directory into a clean structure without losing data?”
Concepts You Must Understand First
- Filesystem metadata
- What is creation time vs modification time?
- Book Reference: How Linux Works, Ch. 2
- find and xargs
- How do you safely handle filenames with spaces?
- Book Reference: The Linux Command Line, Ch. 17
- Shell scripting safety
- Why use
-print0and-0? - Book Reference: The Linux Command Line, Ch. 24
- Why use
- Exif metadata
- What is CreateDate and when is it missing?
- Book Reference: Effective Shell, Ch. 8
Questions to Guide Your Design
- What file extensions will you include?
- What should happen when metadata is missing?
- How will you avoid overwriting existing files?
- Should the script support dry-run mode?
Thinking Exercise
If a file has metadata for 2012 but its filename suggests 2019, which do you trust? Why?
The Interview Questions They’ll Ask
- “Why use
-print0with find and xargs?” - “How do you avoid losing files when moving them?”
- “How would you make this tool support a dry-run mode?”
- “What is your fallback when EXIF metadata is missing?”
Hints in Layers
Hint 1: Find files safely
find "$src" -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.mov" -o -iname "*.mp4" \) -print0
Hint 2: Extract metadata
date=$(exiftool -d "%Y/%m" -CreateDate -s -s -s "$file")
Hint 3: Move safely
mkdir -p "$dest/$date"
cp -n "$file" "$dest/$date/"
Hint 4: Handle missing metadata
[ -z "$date" ] && date=$(date -r "$file" +"%Y/%m")
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Finding files | The Linux Command Line | Ch. 17 |
| Shell scripting | The Linux Command Line | Ch. 24 |
| Filesystem basics | How Linux Works | Ch. 2 |
| Robust scripts | Effective Shell | Ch. 6-9 |
Common Pitfalls & Debugging
Problem 1: “Files with spaces break”
- Why: using plain
xargswithout null delimiters - Fix:
find ... -print0 | xargs -0
Problem 2: “Metadata missing”
- Why: file lacks EXIF data
- Fix: fallback to
statmtime
Problem 3: “Files overwritten”
- Why: duplicate filenames in same month
- Fix: use
mv -nor add a suffix if exists
Problem 4: “Wrong dates for videos”
- Why: CreateDate tag missing or not populated
- Fix: try other tags or fallback to filesystem time
Definition of Done
- Handles spaces and weird characters
- Creates YYYY/MM directories
- Skips or logs files without metadata
- Never overwrites existing files
- Supports dry-run mode
Summary of Projects
| Project | Key Commands | Difficulty |
|---|---|---|
| Project 1: Website Status Checker | curl, while read |
Beginner |
| Project 2: Log File Analyzer | awk, sort, uniq |
Intermediate |
| Project 3: Automated Backup Script | tar, cron, date |
Intermediate |
| Project 4: System Resource Monitor | ps, free, df |
Intermediate |
| Project 5: Find and Organize Media | find, xargs, exiftool |
Advanced |