From Friday Hack to Deployable Agent
500+ commits later, here's what I learned.
You've built an AI agent that works. It calls APIs, follows instructions, and nails your cherry-picked example. Then someone asks a slightly different question, and it gets stuck on the same step until it gives up or hallucinates.
It started as a Friday hack: a small MCP server, a handful of tools, and that dopamine hit when the LLM successfully called an API for the first time. I showed it to colleagues. They'd nod and ask: "Can it also debug this crashing webhook?" And it would fail. Every conversation revealed another gap. The goal shifted from flashy demos to the tedious stuff that actually ships, such as configuring extensions or analyzing failing webhooks.
This post is about the gap between "toy demo" and "something that survives user testing." I've been building an agent for Rossum, but the lessons apply everywhere. They're about sub-agents, dynamic tool loading, prompt regression testing, and the dozen decisions that separate a demo from a deployable system.
If you're building agents with tool calling, Model Context Protocol (MCP) servers, or multi-model orchestration, this is the stuff I wish I'd known earlier. The biggest surprise? When I upgraded to a smarter model, my prompts got 70% shorter and the agent got much more clever.
Making an agent work once is easy. Making it work reliably for the next person's question? That's the actual job.
1. Let Type Hints Define Your Tools
PR #56My first tool layer looked like many backend systems: handler classes inheriting from a base, scattered across 10 files, each with its own registration boilerplate. Adding a new tool meant touching multiple files and trying to remember where the logic lived.
I replaced it with FastMCP, a declarative framework where a single decorator defines the entire tool contract:
# Before: manual Tool definition with hand-written JSON schema class QueuesHandler(BaseHandler): @classmethod def get_tool_definitions(cls) -> list[Tool]: return [ Tool( name="list_queues", description="List all queues with optional filters.", inputSchema={ "type": "object", "properties": { "workspace_id": {"type": ["integer", "null"]}, "name": {"type": ["string", "null"]}, }, }, ), ... ]
# After: type hints ARE the schema @mcp.tool(description="List all queues with optional filters.") async def list_queues(workspace_id: int | None = None, name: str | None = None) -> list[Queue]: ...
The type hints become the JSON schema the LLM sees. The description tells the agent when to use it. FastMCP turns a Python signature into an API contract automatically.
2. From Smolagents to the Claude Agent SDK
PR #57My initial stack used smolagents (Hugging Face's library for building agents in a few lines of code) and LiteLLM (a unified interface for swapping between OpenAI, Anthropic, and other providers). My first working prototype took an afternoon.
But "afternoon prototype" and "production-ready" live in different zip codes. As requirements grew, I needed lower-level control:
- Native async/await for cleaner streaming to the FastAPI frontend
- Direct tool calling: smolagents defaults to "code-agent" mode, where the LLM writes Python to invoke tools. That's powerful for complex orchestration, but my use case was simpler: wrap API calls and validate inputs. When the model generates code, you're debugging Python. When it emits structured JSON that maps to function signatures, you're validating data. For straightforward API wrappers, the latter is easier to constrain and test.
- Full visibility into system prompts for debugging and iteration
I migrated to the Claude Agent SDK with direct Bedrock integration (AWS's managed Claude service). The SDK handles streaming, tool schemas, and multi-turn context natively. Along the way, I deleted 1,500+ lines of custom file and analysis tools—things like Mermaid diagram generation—that modern Claude models now handle out of the box. More capabilities, less code.
3. Embed Agent's UI Where the Work Happens
PR #58The agent started as a CLI script. It worked great for engineers, but for a support engineer, solution architect, or account manager, "clone the repo and configure your environment variables" isn't a feature. It's a roadblock.
So I built a dedicated chat app (using Streamlit). It lowered the barrier, but introduced new friction: context-switching. Users had to jump between tabs, re-authenticate, and manually bridge the gap between the agent and the Rossum platform.
What actually worked was embedding the chat directly into the Rossum UI as a native panel. The agent now inherits the user's context and permissions. No more switching tabs, no re-authentication. The technical implementation (a FastAPI backend) was the easy part. Proximity to the user's workflow matters more than feature completeness.
4. Sub-Agents for Context Isolation
PR #107 Sub-AgentsEarly on, I treated the agent like a single specialist locked in a room with every tool I could provide. I assumed that because Rossum is a specialized domain, one well-prompted agent could handle the entire API surface.
The problem was context pollution. Rossum workflows use large schemas, tens to hundreds of fields, which quickly overwhelm a context window.
Complex tasks require messy iteration: retrying a failing webhook, patching a schema multiple times, digging through documentation. In a monolithic architecture, every failed attempt and verbose API error accumulates in the conversation. The noise drowns out the signal.
The solution: delegation with isolation.
Sub-agents are separate LLM calls with their own conversation history and a restricted toolset. They iterate, fail, and retry in a sandbox, then return only the result to the main agent. The main agent never sees the 50 lines of API errors it took to get there.
I chose sub-agents for tasks that either required many retries (like debugging webhooks) or produced verbose intermediate outputs (like knowledge base searches). The primary conversation stays clean:
Implementation: sub-agents are exposed as regular tools. Calling debug_hook spawns a separate LLM session with its own history, restricted toolset, and focused system prompt. An abstract base class handles the iteration loop; concrete implementations define their tools and execution logic. The 15 iterations of trial-and-error stay quarantined.
5. Skills: Load Knowledge Just-in-Time
PR #73 Agent SkillsInitially, the agent's "brain" was a monolithic system prompt. I kept adding rules, edge cases, and IMPORTANT: blocks in all caps. It felt productive until it became brittle, unreadable, and expensive. Loading every format specification and workflow upfront meant I was paying for tokens the agent didn't even need for 90% of tasks.
I pivoted to a Skills Framework: show only what's necessary. Rossum's documentation spans hundreds of pages; loading it upfront would consume half the context window. So the system loads expertise when it becomes relevant.
skills/ ├── hook-debugging.md # Debugging serverless hook issues ├── organization-setup.md # New customer setup with regional templates ├── rossum-deployment.md # Safe deployment via sandbox with diffs ├── schema-patching.md # Adding and updating individual fields ├── schema-pruning.md # Tree-based view for removing unused fields └── ui-settings.md # Configuring annotation list columns
The system operated in two distinct stages to optimize the context window:
- Metadata (always loaded): short descriptions of each skill. These act as "triggers" so the agent knows what expertise is available.
- Instructions (loaded on demand): the full procedural knowledge injected only when the agent identifies a matching task, i.e., examples, patterns, and constraints.
If a user asks "Why is my webhook failing?", the agent identifies the hook-debugging metadata and pulls in the full instructions. Knowledge for UI settings and schema patching stays dormant.
Adding a capability is now a standard software task: create a file, version it, ship it. Product managers draft skills in Markdown; engineers review them like code.
6. Dynamic Tool Loading Cuts Context Costs
PR #108As the toolkit expanded, I hit the "Agent Tax": every tool costs tokens even if never called. Loading 50+ MCP tools upfront burned 8,000 tokens before the user typed a word.
To solve this, I implemented a Two-Phase Dynamic Loading strategy:
- Phase 1: Keyword heuristics. When a message arrived, a simple regex scan in Python checked the user's prompt for domain-specific keywords (e.g., "webhook," "schema," "queue"), bypassing the LLM and relying on old-school pattern matching. The system pre-loaded the corresponding tool categories to eliminate "cold-start" delays. The agent started the conversation with the right tools already in hand.
- Phase 2: Just-in-time discovery. For complex or ambiguous requests, the agent used two "meta-tools" to explore the rest of the library:
list_tool_categories(): show me the available tool groups.get_tool_category(name): load the specific tools for this group.
"List my queues" now loads only queue-related tools (~800 tokens) instead of the full 8,000-token library. 10x reduction in baseline context.
The surprise was quality. With fewer tools to choose from, the model stopped second-guessing itself. Removing noise from the toolset was as effective as better prompting.
7. Smarter Models Need Simpler Prompts
PR #99 Opus 4.5 Migration PluginSwitching models usually feels like a simple configuration change. But moving from Sonnet to Claude Opus 4.5 taught me that better models require better, and often shorter, instructions. My prompt had become an over-engineered legal document, and I had to admit it.
The initial migration was rough. Opus fell into the "tool-calling inferno": redundant loops where the model over-analyzed simple tasks, burning tokens without progress. I realized the problem wasn't the model; it was my "legacy" prompts. Building with less capable models had trained me to write exhaustive, hand-holding instructions. Because Opus 4.5 is more literal and responsive, my CRITICAL: and MUST: blocks were being treated as absolute commands to act, even when unnecessary.
The fix: stop writing procedures, start writing goals.
# Before: 104 lines of step-by-step handholding ### Step 4: Hook Code Debugging with Opus **MANDATORY**: When debugging Python hook code, you MUST use `debug_hook`. **CRITICAL: Investigate ALL Issues** - DO NOT stop at the first issue found - The Opus sub-agent will exhaustively analyze for ALL issues - Continue investigating even after fixing one error ### Step 5: Trust and Apply Opus Results **CRITICAL:** - When Opus returns fixed code, you MUST trust its findings - DO NOT second-guess or re-analyze what Opus has investigated - DO NOT simplify or modify the fixed code Opus provides # After: 31 lines of goals + constraints # Hook Debugging Skill **Goal**: Identify and fix hook issues. | Tool | Purpose | |------|---------| | `search_knowledge_base` | **USE FIRST** - contains extension configs | | `debug_hook(hook_id, annotation_id)` | Spawns sub-agent for code analysis | **Constraints**: - Always search knowledge base first - Use `debug_hook` for Python code--do not analyze yourself - Trust `debug_hook` results--do not re-analyze or modify
The hook-debugging skill shrunk from 104 lines to 31. I stopped telling Opus how to think and started telling it what the desired outcome looks like.
By softening the aggressive language and letting the model infer the best path from the available constraints, I eliminated the tool loops. Counter-intuitively, the model became more predictable.
8. Regression Testing: Prompts Need Tests Too
PR #82Refactoring a prompt can change system behavior as much as changing business logic. Every "improvement" is a potential regression.
Consider a typical week: Monday, you tweak the system prompt for better clarity. Tuesday, you add a new validation tool. Wednesday, you "optimize" a tool description. By Friday, the agent is hallucinating parameters or skipping critical steps. Without a way to measure precisely what changed, you're debugging by intuition.
Intuition doesn't scale, so I built an Evaluation Framework. It runs the agent through realistic scenarios and validates behavior:
RegressionTestCase(
name="setup_invoice_queue_with_validation",
prompt="Set up Invoice queue with three specific business validation checks...",
tool_expectation=ToolExpectation(
expected_tools=[
"create_queue_from_template",
"search_knowledge_base",
"create_hook_from_template",
],
mode=ToolMatchMode.SUBSET,
),
token_budget=TokenBudget(
min_total_tokens=60000, # Guard against "lazy" agent shortcuts
max_total_tokens=90000, # Detect infinite loops early
),
success_criteria=SuccessCriteria(
require_subagent=True,
max_steps=6,
custom_checks=[BUSINESS_VALIDATION_HOOK_CHECK],
),
)
By tracking tool sequences and token budgets, I caught subtle regressions that a simple "check the output" test would miss:
- The silent skip: After refactoring a skill prompt, the agent stopped calling
search_knowledge_basebefore debugging hooks. The output looked fine, but the agent was guessing instead of consulting documentation. The test caught the missing tool in the expected sequence. - The infinite loop: A "clarified" tool description caused the agent to call
get_schemarepeatedly, convinced each response was incomplete. Token budget alerts flagged it before I burned through $50 in a single test run.
Neither failure was obvious from reading the output. The agent completed its task; it just took a worse path. Without quantitative checks, I'd have shipped regressions disguised as improvements.
9. Orchestrate Specialists, Don't Replicate Them
Commit 59e0881 Formula Fields in RossumSection 4 covered sub-agents for context isolation. This section goes further: instead of spawning another instance of the same frontier model, the agent hands off work to an existing production system.
Rossum uses Formula Fields for computing derived values during document extraction. Formulas are written in TxScript, a language based on Python but extended with Rossum-specific functions, variables, and field references. Rossum already has a production Formula Field Copilot: an LLM prompted to translate natural language into precise TxScript.
So why not have Opus generate formulas directly? Since TxScript is Python-based, Opus can write syntactically valid code. But "valid Python" isn't the same as "correct TxScript".
TxScript has nuanced conventions: field access patterns, required vs. non-required handling, multivalue iteration, related object access. I could incorporate this as a skill, but why maintain a parallel implementation when the Formula Field Copilot already exists?
So I just called the existing Copilot.
- The User asks for a business rule in plain English: "Add a 'Net Terms' field that computes Due Date minus Issue Date and categorizes it as Net 15, Net 30, or Outstanding."
- Opus interprets the intent, fetches the current schema, and derives a structured
hintfor the Formula Field Copilot. - The Formula Field Copilot generates the precise extraction formula in Rossum's syntax.
- Opus validates the formula against the schema and applies it via MCP tools.
This mirrors Section 4's sub-agent pattern, but sub-agents delegate for context isolation; here, delegation is for capability. Opus reasons and orchestrates; the Copilot writes formulas. Trying to make Opus do both would mean worse results at higher cost.
It's LLMs all the way down. The frontier model reasons; the production-tested Copilot executes. Neither tries to be something it isn't.
What's Next
Remember that 70% prompt reduction from the intro? It wasn't a one-time win. It was the pattern.
Every improvement in this post followed the same arc: I added complexity to solve a problem, then removed most of it once I understood what mattered. The system got more capable as it got smaller.
The gap between "works in a demo" and "works for real users" isn't closed by better models. It's closed by subtraction: fewer tools and fewer instructions.
Next: broader internal rollout at Rossum. More users, more edge cases. I expect to delete as much code in the next 500 commits as I write.
If you're building agents and want to compare notes, or tell me where I'm wrong, find me on GitHub or X. The lessons are in the diffs. Most are deletions.
All PRs are in the rossum-agents repository. Open the diffs to see the trade-offs and the backtracking.
Star on GitHub