Case Study: Four Contributions to anthropics/claude-code

What We Found

Contributions Filed

Silent Enforcement Trap

Undocumented Flags

Active Hooks

The Silent Trap: exit(1) Is Not exit(2)

We discovered that 3 of our PreToolUse hooks were using sys.exit(1) to signal violations. They printed error messages, looked like they were working — but exit(1) is a non-blocking hook error. The tool call proceeds. Stderr is shown in verbose mode only. Only exit(2) actually blocks.

This matters because exit(1) is the universal Unix convention for "error." Developers writing enforcement hooks will naturally reach for it. The hook still runs and produces output — there's no obvious failure signal. The failure is completely silent.

The hooks were behaving exactly as written: running, printing error output, exiting non-zero. Everything looked correct. The only way to discover the problem was to observe that the tool calls the hooks were supposed to block were still proceeding. Three enforcement hooks in a production pipeline — a Class A gate, a README-before-code check, and a test-before-stack check — had been silently non-enforcing from the moment they were deployed.

# This looks like enforcement — it is NOT
sys.exit(1)  # Non-blocking. Tool call proceeds. Violation slips through.

# This is actual enforcement
sys.exit(2)  # Blocking. Tool call prevented. Stderr fed to Claude as context.

If your enforcement hook uses exit(1), it is silently not enforcing. The hook runs, produces output, exits non-zero — and the tool call proceeds anyway.

The Claude Code hooks reference does document this behavior correctly. But it doesn't warn that the intuitive choice — the one any developer with Unix experience will reach for — is the wrong one for enforcement. That gap is what we filed.

What We Filed

Contribution 1: Comment on #37550 — Hook Architecture Evidence

We contributed quantified production evidence to an existing discussion about hook enforcement. The core question being discussed was whether CLAUDE.md rules or PreToolUse hooks were more reliable for enforcing pipeline governance. We had the data.

We audited which of the 7 HARD RULES in our KB pipeline were actually enforced versus behavioral-only. The result: 2 of 7 were enforced via PreToolUse hooks with exit(2). The remaining 5 of 7 were CLAUDE.md-only — and all 5 were observed to be violated at least once during autonomous operation.

The reason is structural. CLAUDE.md is context window text. It raises salience, increases compliance probability, and is the right tool for guidance. But it operates inside the model's reasoning loop. PreToolUse hooks with exit(2) operate outside the model's reasoning loop entirely — they block at the OS level before the tool call completes. The model doesn't get a chance to reason its way around them.

Contribution 2: New Issue #44707 — exit(1) Documentation Gap

We filed a new issue requesting a prominent warning that exit(1) is silently non-blocking in PreToolUse hook enforcement contexts.

Three of our own hooks were silently failing. We are not casual users — we had read the hooks documentation carefully and had built a 10-hook enforcement pipeline with explicit governance rules. We still made this mistake because the documentation doesn't warn against it clearly enough. The behavior is technically documented; the footgun is not called out.

The request: add a prominent callout near the exit code behavior documentation explicitly stating that exit(1) will not block tool calls and that exit(2) is required for enforcement. Ideally with a warning pattern example that mirrors how developers naturally write enforcement code.

Contribution 3: New Issue — 8 Undocumented CLI Flags

Running claude --help on v2.1.86 revealed flags not present in the official documentation. We catalogued them with exact flag names, observed behavior, and platform context (Windows 11, bash shell).

Undocumented flags found: --brief, --debug-file, --file, --replay-user-messages, --mcp-debug (deprecated). Undocumented subcommands: claude doctor, claude install, claude setup-token. We also documented observed --bare flag behavior that exceeds what the docs describe — in bare mode, authentication is restricted to ANTHROPIC_API_KEY only, but skills still resolve, which is a non-obvious interaction worth noting explicitly.

Each flag was documented with its exact help text as it appears in claude --help output on the tested version, making the contribution directly verifiable.

Contribution 4: New Issue #57946 — PreToolUse Hooks Silently Bypassed on Windows Git Bash via Python Store Stub

On Windows 11 with Python installed via python.org, python3 in a Git Bash subprocess resolves to the Windows App Execution Alias stub at C:\Users\<user>\AppData\Local\Microsoft\WindowsApps\python3. In an interactive terminal, the stub prints "Python was not found..." and exits 49. In the non-TTY subprocess context where Claude Code runs hooks, it produces no output at all and exits 49 silently.

Claude Code treats exit 49 from a PreToolUse hook as non-blocking. The hook subprocess exits 49, Claude Code proceeds, and the tool call executes as if the hook never ran. Every enforcement hook configured as python3 <hook>.py silently fails. The user has no indication their enforcement layer is not running.

This is distinct from #15908 and #46449, which describe python3 failing with a visible error in PowerShell/CMD. In those cases, users see the error and know something is wrong. In the Git Bash subprocess context, the failure is completely invisible: which python3 succeeds, hooks appear configured correctly, and no error is surfaced anywhere.

# which python3 succeeds — hooks appear configured correctly
which python3
# /c/Users/<user>/AppData/Local/Microsoft/WindowsApps/python3

# But in the non-TTY hook subprocess context:
# python3 → Store stub → exits 49, no output
# Claude Code: exit 49 = non-blocking → tool call proceeds
# Enforcement is silently bypassed

# Workaround: use python instead of python3 in settings.json hooks
# "command": "python /absolute/path/to/hook.py"

The detection gap is the critical distinction: exit 127 (command not found) is diagnosable because which python3 fails. Exit 49 from the Store stub is not diagnosable because which python3 succeeds — it returns a path to a binary that exists and runs in interactive terminals. The stub gives a false signal of correct configuration.

Why Class A Evidence Matters for Open Source Contributions

Every claim in these contributions was backed by a specific Class A knowledge base entry — executed commands with observed output, not speculation. This is what makes contributions credible.

The difference between a useful issue and a noise issue is specificity: specific version (v2.1.86), specific platform (Windows 11), specific hooks with specific exit codes at specific line numbers, specific observed behavior versus expected behavior. When you file an issue with "I discovered this while operating a 10-hook enforcement pipeline on version 2.1.86 and here's the exact line in my code where the exit code was wrong" — that's a different signal than a general report.

The Class A trust tier in the AWACS knowledge pipeline exists precisely because of this. Class A entries are validated operational truth: a command was run, output was observed, the entry records exactly what happened. You can trace any claim back to a specific command log entry with a timestamp. That audit trail is what turns operational experience into credible evidence.

None of these contributions required speculation or inference. The exit code behavior was discovered by observing that tool calls proceeded when they shouldn't have, tracing the hooks to find the sys.exit(1) calls, changing them to sys.exit(2), and confirming the blocking behavior. The CLI flags were discovered by running claude --help and comparing output to documentation. Executed commands, observed results, documented evidence.

The Layered Architecture That Works

CLAUDE.md isn't useless — it's the wrong tool for hard enforcement but the right tool for behavioral guidance. The pattern that emerged from building and operating this pipeline is a three-layer governance architecture.

The first layer: CLAUDE.md rules raise salience and increase compliance probability for rules that can't be mechanically enforced. The 5 pipeline rules we couldn't wrap in hooks still benefit from their presence in CLAUDE.md. They're violated less often with the rules present. But "less often" isn't "never," and for must-not-fail rules, less often is not good enough.

The second layer: PreToolUse hooks with exit(2) for must-not-fail rules. These operate outside the model's reasoning loop. The model cannot reason its way around a hook that blocks at the OS level. This is where the critical pipeline invariants live: Class A writes require a prior candidate file and a librarian decision. No hook bypass, no exception, no context where the rule is relaxed.

The third layer: human-in-the-loop for high-risk operations. Some operations are consequential enough that no automated gate is sufficient on its own. The pipeline flags these explicitly and routes them for confirmation before proceeding.

The silent enforcement trap — hooks that look like they're working but aren't — collapses the second layer without any visible signal. That's the actual risk. Not that the hooks are hard to write, but that they can fail in a way that produces no error, no warning, and no indication that anything is wrong until you observe a violation that should have been blocked.

Four Contributions to anthropics/claude-code — All Backed by Production Evidence

What We Found

The Silent Trap: exit(1) Is Not exit(2)

What We Filed

Contribution 1: Comment on #37550 — Hook Architecture Evidence

Contribution 2: New Issue #44707 — exit(1) Documentation Gap

Contribution 3: New Issue — 8 Undocumented CLI Flags

Contribution 4: New Issue #57946 — PreToolUse Hooks Silently Bypassed on Windows Git Bash via Python Store Stub

Why Class A Evidence Matters for Open Source Contributions

The Layered Architecture That Works

Tags

Interested in working together?