A chat mode for GitHub Copilot CLI
Your agent should prove
its work. This one does.
Anvil verifies every change before you see it - builds, tests, lints, then has other AI models try to break it. You get proof, not promises.
copilot plugin install burkeholland/anvil
Includes the Context7 MCP server for up-to-date library docs.
Recommended: Add the frontend-design skill for beautiful UI generation:npx skills add anthropics/skills/frontend-design
The problem with AI agents today is they lie to you.
They write code and say "Done!" They claim the build passes without running it. They hallucinate test results. They never push back on a bad idea. And when something breaks in production, it's your fault - not theirs.
We got tired of it.
Adversarial Review
Every change gets reviewed by up to 3 different AI models - GPT, Gemini, Claude - each trying to find bugs, security holes, and logic errors. They disagree with each other. That's the point.
SQL Verification Ledger
Every build, test, and lint result is INSERTed into a SQLite table. The evidence bundle at the end is a SELECT, not prose. If the INSERT didn't happen, the verification didn't happen.
Baseline Snapshots
Before touching anything, Anvil records your project's state - build, diagnostics, tests. After changes, it compares. Regressions are caught before you see the code, not after you ship it.
Pushback
Anvil is a senior engineer, not an order taker. Bad idea? Tech debt? Dangerous edge case? It tells you before writing a single line. You can override it. But you'll know.
Session Memory
Broke a file last session? Anvil remembers. Learned your build command? Won't ask again. Context that actually persists across conversations.
Git Autopilot
Before writing code, Anvil checks your git state - stashes uncommitted work, creates a branch off main so you're never editing trunk directly. After verification passes, it commits with a clear message and gives you a one-line rollback command.
The Evidence Bundle.
Every task ends with this - a structured report generated from SQL, not written by the model. Baseline vs. after. Command outputs. Exit codes. Reviewer verdicts. Confidence level. Rollback command.
| Phase | Check | Result | Detail |
|---|---|---|---|
| baseline | build | โ | npm run build |
| baseline | tests | โ | 247 passed |
| after | build | โ | npm run build |
| after | tests | โ | 249 passed (+2 new) |
| after | diagnostics | โ | 0 errors, 0 warnings |
| review | gpt-5.3-codex | โ | No issues |
| review | gemini-3-pro | โ | No issues |
| review | claude-opus | โ | No issues |
Every task runs through the same loop.
- Understand & Recall Parses your request. Checks session history for past work on these files, including what broke.
- Survey & Pushback Searches for existing patterns to reuse. Tells you if the approach is wrong before writing code.
- Baseline Snapshots build, tests, diagnostics into the SQL ledger. This is the "before" photo.
- Implement Makes changes following your codebase patterns. Extends what's there instead of inventing new abstractions.
- The Forge Build, type check, lint, test - all recorded. Then up to 3 AI models attack the code. Issues get fixed before you see anything.
- Evidence & Commit Structured report from SQL. Auto-commit with rollback command. Done.
One command.
copilot plugin install burkeholland/anvil
Then run copilot and select the Anvil agent.
Recommended: npx skills add anthropics/skills/frontend-design