Building a Security Guard for AI Shell Commands
A 3-phase pre-tool hook that classifies shell commands as safe or destructive — heuristic rules for speed, a fine-tuned Qwen3.5-0.8B GGUF model for coverage, and a feedback loop that feeds production decisions back into training data.
The Problem
An AI agent with shell access can install packages, run tests, build projects — and also rm -rf /, drop a production database, or force-push over someone's work. A blocklist doesn't solve this because context matters:
rm -rf build/— standard build cleanup, runs ten times a dayrm -rf /— deletes the filesystemcurl https://get.docker.com | sh— remote code executioncurl https://api.example.com/data— harmless API call
The difference between git push and git push -f is one flag. One is a normal workflow, the other rewrites shared history. I needed something that could tell them apart — locally, under a second, no API dependency.
Three Phases, One Decision
Every shell command passes through three gates. The first one with a strong opinion wins:
Each layer exists because it handles a different failure mode.
The allowlist is the override. Process management (kill, pkill) is lifecycle, not data destruction — should never be blocked, even if the ML model hallucinates. The assistant's own CLI is safe by definition. git push without --force is normal work.
The heuristic rules are 15 pattern groups for known-dangerous operations: git reset --hard, rm -rf, DROP TABLE, docker system prune, piping curl to shell. Instant, deterministic, auditable. When something goes wrong, you can point to the exact rule.
The ML brain handles the long tail — novel patterns, obfuscated commands, creative variations no one thought to write a rule for. A fine-tuned 0.8B model, GGUF quantized, ~300ms per classification on Apple Silicon.
Rules for speed and auditability. ML for coverage. If the model fails, the system degrades to heuristic-only mode. You can delete the model file and the guard keeps working — with less coverage.
Why a Local Model
The guard is a pre-tool hook — it fires before every Bash command the AI executes. Hundreds of times per session. An API call would add 500ms+ per command and an external dependency. If the API goes down, the assistant is either blind or frozen.
So: Qwen3.5-0.8B, fine-tuned for binary classification (safe / destructive), quantized to GGUF Q8 (774MB), running via llama.cpp with Metal acceleration. Cold start ~500ms, inference ~300ms. The full pipeline — command in, block/allow out — under a second.
The system prompt is minimal. Binary classification doesn't need nuance:
System PromptYou are a safety classifier for shell commands.
Classify whether the command is safe or destructive.
Respond with ONLY the label. No explanation.
Labels:
- safe: read-only, builds, tests, linting, formatting, git status/log/diff, package install
- destructive: deletes user data/files, force-overwrites files, drops databases,
rewrites git history, remote code execution, bypasses safety checks
Qwen3.5 is a reasoning model — it always outputs a <think> block before answering. I pre-fill it as empty (<think>\n\n</think>\n\n), forcing the model to skip reasoning and go straight to the label. Saves 5-10 tokens, cuts latency roughly in half.
The decoder does early exit: checks after every token whether the output matches "safe" or "destructive". Since "safe" is one token, most classifications complete in a single decode step.
Training Data
I needed a corpus of safe commands and a representative set of destructive ones. Neither exists as a clean labeled dataset. I assembled data from four sources, each covering the others' gaps:
NL2Bash — academic dataset of real shell commands from natural language descriptions. Thousands of safe commands, good breadth, but doesn't reflect what an AI coding assistant actually runs day-to-day.
Atomic Red Team — MITRE ATT&CK framework, real-world attack techniques. Privilege escalation, persistence, exfiltration. Filtered to shell-executable commands, Linux and macOS.
Live telemetry — actual commands from my daily Claude Code usage, redacted for secrets and PII. The ground truth for what normal work looks like in this environment: cd ~/.diana/src/rust && make install, cargo test -p diana-store -- test_name.
Manual curated examples — 122 hand-written commands (67 safe, 55 destructive) targeting the exact boundary where other sources fail. Turned out to be the highest-impact data by far.
Edge Cases
The manual examples cover the hard cases: commands that look dangerous but aren't, and commands that look innocent but are.
make clean && make build has "clean" in it, but it's a standard build step. terraform apply -auto-approve sounds aggressive, but that's how infra gets deployed. chmod +x script.sh modifies permissions, but it's just setting an executable bit.
The other direction: > /etc/passwd has no dangerous keywords — no rm, no delete. Just a redirect that overwrites a critical system file with nothing. :(){ :|:& };: looks like line noise, it's a fork bomb. dd if=/dev/zero of=/dev/sda uses a legitimate tool to zero out a disk.
These examples are repeated 5x in the training set. Edge cases are where classifiers break.
Dataset Balance
The dataset is ~55% safe, ~45% destructive. Not 50/50, deliberately.
The ML brain is the third line of defense. By the time a command reaches it, the heuristics have already caught obvious threats. The real-world distribution hitting the model skews heavily toward safe. A false positive (blocking cargo build --release) erodes trust and gets the guard disabled. A false negative on an exotic vector is likely already caught by heuristics.
The bias matches the model's operating conditions.
Fine-Tuning: LoRA on Apple Silicon
LoRA via Apple's MLX framework, entirely on-device. Base model Qwen3.5-0.8B — trains in 20 minutes on M-series, large enough to understand shell semantics.
Hyperparameters: rank 16, alpha 32, 1500 iterations, batch 4, cosine decay with 100-step warmup. Max sequence length 512 — shell commands are short.
After training: fuse LoRA adapters into base, convert to GGUF Q8 for llama.cpp. This is where it got interesting.
MLX Fuse Bugs
MLX's mlx_lm.fuse introduces three bugs specific to Qwen3.5 that produce a model that loads fine and generates garbage.
Tensor names get swapped (language_model.model.* instead of model.language_model.*). Conv1d weights transpose wrong. RMS norm weights get the +1 offset applied twice — once by fuse, once by the GGUF converter.
The symptoms were subtle: the model loaded without errors, generated text, but labels were wrong. Not random — systematically off, like the weights were slightly corrupted. Took hours to diagnose. A custom fix script patches all three between fuse and conversion.
Heads up if you're fine-tuning Qwen3.5 with MLX and converting to GGUF — you'll hit this. Qwen3 (without the .5) doesn't have the same issues. The fix is mechanical but impossible to diagnose blind.
Feedback Loop
A classifier trained once is a depreciating asset. The guard needs to learn from its own decisions.
Every block/allow is logged with the redacted command, reason, and which phase decided. A PostToolUse hook logs whether the command actually succeeded or errored. This connects predictions to outcomes.
A built-in judge (diana hooks judge) aggregates this into a quality dashboard:
Terminal$ diana hooks judge --limit 500
=== Guard Quality ===
Total: 487 (12 blocks, 475 allows)
Block rate: 2.5%
Heuristic blocks: 9, Brain blocks: 3
=== Feedback Correlation ===
Router outcomes logged: 89 (3 errors)
Skills actually invoked: 156
The judge exports labeled training data: blocked commands become "destructive", allowed commands are sampled 1-in-10 as "safe". Asymmetric sampling keeps classes balanced despite orders of magnitude more safe commands.
One built-in diagnostic: if the brain blocks more than the heuristics, something is wrong. The brain should be the secondary layer catching the tail. If it's the primary blocker, it's producing false positives. The judge flags this automatically.
What v1 Got Wrong
The guard started as pure heuristics — 15 pattern groups, no ML. Worked for weeks until a self-audit found the gaps.
Missed threats: find . -delete does recursive deletion without rm. curl|sh without spaces bypassed the pattern requiring spaces around the pipe. git push -f slipped past the "-f " check. DROP VIEW and DROP INDEX weren't covered — only DROP TABLE was.
False positives: every git rebase blocked, including safe feature branch rebases. --no-verify flagged in test contexts. rm -rf build/ caught intermittently because path matching was too rigid.
Patched the known gaps in heuristics, but the pattern was clear: string matching can't anticipate what it hasn't seen. The ML layer was added for this long tail. It doesn't need to be perfect — it catches the things no rule writer would think of.
Block and Redirect
When a command is blocked, the hook outputs a structured message the AI interprets as a remediation prompt — the blocked command, reason, and a suggestion:
"DESTRUCTIVE OP BLOCKED: git push --force (overwrites remote history). Ask for explicit permission before running destructive operations. Consider safer alternatives — create a backup branch first, or use a non-force variant."
The AI sees this and adjusts — asks for permission, suggests an alternative, almost never retries the same command. The guard teaches through blocking, not through warnings the AI can ignore.
Takeaways
122 hand-curated edge cases had more impact on accuracy than 70,000 bulk examples from NL2Bash and Atomic Red Team combined. Bulk data defines "normal." Edge cases define the boundary.
If you're logging commands for training, you're logging secrets. The guard redacts quoted strings and env var assignments before writing to disk.
Rules fail by incompleteness, ML fails by unpredictability. Combining them means each covers the other's blind spots. When ML breaks entirely, rules still work.
The conversion pipeline (train → fuse → convert → quantize → deploy) is where subtle bugs hide — not crashes, just degraded accuracy. Always evaluate the final deployed artifact.
For a tool that fires hundreds of times a day, false positives are worse than false negatives. Block someone's build command once and they'll disable the guard. Optimize for trust.
The guard runs in production on every Bash command. Next iteration adds a third class — risky — for commands that should prompt the user instead of hard-blocking. The feedback loop keeps running. The training data keeps growing.