Users need a 'Fleet Ops Agent' or similar system to monitor all other AI agents, write failure records, and update a central 'System Control Plane'. This provides observability, allowing quick identification and troubleshooting of agent failures.
***An update to "***[**I Built 11 Coordinated Notion Agents. Here's What Actually Matters.**](https://www.reddit.com/r/Notion/comments/1rex1ze/i_built_11_coordinated_notion_agents_heres_what/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)***"*** Three weeks ago I published a writeup on an 11-agent system. The reaction was incredible, I also received several questions about it and realized I'd underexplained a few things. I've also made several modifications since then and the initial article is already partially out of date. Quick context for anyone who didn't read the original: My Notion workspace is the operating system, 16 custom agents handle email triage, GitHub sync, client reporting, time auditing, and daily briefing. They run on schedules, write structured outputs to shared databases, and coordinate with each other through defined contracts. So: updated system, updated lessons. Same format. The thesis up front this time: **the value of this system is almost entirely in the failure handling, not the success path.** Build the failure path first. The success path takes care of itself. # The conceptual shift My original system had eleven agents. The new one has sixteen. The difference isn't the agent count, it's that the fleet now **operates itself**. Fleet Ops Agent monitors every other agent, writes failure records to a Dead Letters database, and updates a System Control Plane I can check in under 30 seconds. Before this, I had a collection of agents. Now I have a fleet with observability. Before Fleet Ops existed, a failed agent was invisible until the absence of output became noticeable, which might be days. Now: Fleet Ops writes a Dead Letter, System Control Plane shows ❌, Morning Briefing flags it, I open the Agent Debug Playbook and have a structured troubleshooting process. Failures surface in under 24 hours with enough context to diagnose quickly. The other additions reflect operational maturity, not feature addition: **Response Drafter** handles follow-up emails that Inbox Manager flags as needing a reply. It reads the Follow-Up Tracker, deduplicates against existing entries, loads the email thread from Notion Mail, and drafts a reply, stored in Notion, never auto-sent. The guardrail matters. An agent that sends email autonomously is a different risk category than one that stages a draft for review. **Drift Watcher** runs a weekly comparison of each agent's current instructions against a canonical snapshot. Instruction drift is real and quiet, you edit something for a quick fix and forget you changed it. Drift Watcher catches it before a behavioral change becomes a production issue. **Client Briefing Agent** produces pre-meeting summaries per client: recent tasks, open GitHub items, time log against budget, any escalations from Inbox Manager. Replaces a manual lookup I was doing before every client call. The original agents are still running. Most have been revised. The architecture is the same; the instrumentation is not. # The Workers SDK is the actual unlock This wasn't in the original writeup because they were just released the day before. But since the alpha release I've jumped in with both feet. Sarah Sachs, AI lead at Notion, recently wrote that architecture decisions can swing costs by 3× — more than model choice alone. The current fleet demonstrates this: Label Registry bypass, Worker-gated writes, and upstream status checks combine to reduce per-run costs by an estimated 40-50% versus naive full-reasoning approaches. The Workers SDK isn't about using fewer tokens. It's about building infrastructure that makes token costs an implementation detail, not the value proposition. I've deployed 21 server-side worker tools: `write-agent-digest`, `check-upstream-status`, `create-handoff-marker`, `update-dead-letter`, and 17 others. These aren't nice-to-haves. They are the enforcement layer for everything that matters about how the fleet operates. Without Workers, agents can read and write to Notion pages however they want. That sounds fine until you have sixteen of them doing it, and then you have 16 agents with slightly different digest formats, slightly different status line conventions, and no way to guarantee that downstream readers can parse what upstream producers wrote. Workers solve this by making the write path a validated function call. The agent doesn't write a digest directly. It calls `write-agent-digest` with a structured payload, and the tool enforces the schema. Here's what `write-agent-digest` enforces: { "agent_name": "string", "status_value": "ok | warn | error | skip", "run_time_chicago": "ISO timestamp", "summary": "string", "flagged_items": "array", "actions_taken": "array" } An agent can't write a malformed digest because the tool validates the schema at call time. Morning Briefing can parse any digest the same way because they all came from the same function. The credit implication is also real. Worker calls are cheaper than full reasoning passes. When an agent can call check-upstream-status to gate its own run instead of loading and reasoning over a digest page, the per-run cost drops measurably. This is routing-before-reasoning applied at the tool level, and when a tool call costs \~2 credits and a full reasoning pass costs \~15, this matters at scale. **One constraint worth naming:** unlike Notion's internal teams, which can switch between frontier labs at will, I'm operating within Notion's model abstraction. My optionality lies in task architecture, not model selection, choosing which tasks get reasoning passes versus deterministic routing, rather than which model serves them. Template Freshness Watcher is suspended not because a cheaper model could handle it, but because the task doesn't justify *any* model cost yet. # What I got wrong in the original **I discussed conventions. I should've discussed contracts.** The original article talks about "shared data pages between agents" and "a contract between agents: you publish a known artifact, I read the artifact, neither of us cares about the other's internals." That's right as a principle. What I didn't have was enforcement. Conventions drift. Contracts don't, if you enforce them. An agent gets updated, the digest format changes slightly, and the downstream reader starts failing silently because it's reading a field that no longer exists. The Workers SDK is what turns a convention into a contract. You can't write a malformed digest if the write function won't accept a malformed payload. Governance first, then scaling. I did it in the other order. The fleet worked well enough at 11 that I kept adding agents before the data contract layer was solid. This created rework. The right sequence: establish the contracts, enforce them through tooling, then add agents into a system that can handle them. **I underbuilt observability.** The original had heartbeat records, every agent writes a minimal "nothing to report" digest so Morning Briefing can distinguish a clean run from a missing run. That's necessary. It's not sufficient. What was missing was a central registry. Without Fleet Ops and the System Control Plane, knowing which agents ran successfully in the last 24 hours required reading individual digest pages. That's 10+ pages. It doesn't scale, and it degrades to "I'll check when something breaks" — which is not observability. System Control Plane now shows: expected cadence per agent, last successful run, current status (✅ / ⚠️ / ❌), and any open Dead Letters. One page. This is what I should've built at agent 3, not agent 14. **Per-email triggers were a mistake.** Removing them was the single highest-impact credit optimization I've made. Both email agents had per-message triggers as a supplement to the batch windows. In theory, faster response time. In practice, double-processing on some messages, the trigger fired, then the batch window also picked up the same email. The redundancy wasn't caught immediately because both runs produced valid-looking output. Consolidated to three batch windows for Inbox Manager (7am, noon, 5pm), two for Personal Ops (8am, 6pm). Nothing fell through. The per-email triggers were solving a latency problem I don't actually have, this is a task management system, not a real-time response system. # Credit architecture is architecture Paid credits start May 4. $10 per 1,000 credits. I'm treating May as a forcing function. If an agent can't justify its credit cost against the time it saves or the risk it catches, it either gets redesigned or suspended. Every design decision in the current system has a credit rationale, with rough estimates: * **Label Registry bypass**: \~60-65% of email volume routes deterministically, no reasoning pass. This saves \~40-50 credits per batch window. At 3 runs/day × 30 days for Inbox Manager alone, that's 3,600–4,500 credits/month, $36–45 saved on one agent. * **Notion Mail auto-labeling**: pushes the bypass rate toward 70-80% on well-labeled batches. Proactive labeling upstream means the Label Registry lookup is a direct key-value hit, not inference. * **Signal pre-scan in Morning Briefing**: `check-upstream-status` calls (\~2 credits each) before loading full digest pages (\~15 credits each). Only flagged items get full reads. Clean digests cost a fraction of full reads. * **Worker-gated writes**: structured tool calls instead of free-form page edits. Enforcement + cost reduction in the same change. * **Docs Librarian archival**: agent digest pages older than 90 days get archived, keeping query surfaces clean and response sizes manageable. Smaller pages = cheaper reads. * **Fleet Ops gating**: agents that detect a failed upstream producer abort early rather than running a full analysis on bad data. A wasted reasoning pass on stale input costs the same as a productive one. **Template Freshness Watcher** is already suspended. It'll stay that way until I have a use case that justifies the runs. # The inter-agent communication problem, revisited The original: Inbox Manager and Personal Ops Manager can exchange exactly one @mention per direction. Without that limit, it becomes a loop. I tested it. It becomes a loop. What I didn't describe is the downstream accumulation problem. What happens when neither agent resolves the ambiguous item, and it goes into Needs Review, and I'm traveling for a week? Current state machine: https://preview.redd.it/471n4jquhpog1.jpg?width=280&format=pjpg&auto=webp&s=3fc2761aaacc4d9a17f3e82efc037de146c10e6b Two re-escalations maximum. After day 6, the item moves to Needs Manual Review, a permanent section that doesn't generate re-escalations. This prevents unbounded queue growth during any absence longer than a few days. **The lesson:** every inter-agent communication path needs both a rate limit and a terminal state. The rate limit prevents loops. The terminal state prevents accumulation. You need both. # What's still unsolved **Calendar event deduplication.** Both email agents operate on calendar events. There's no deterministic way to identify that a client lunch mentioned in two different email threads is the same calendar event. Notion's calendar integration doesn't expose event IDs pre-acceptance, so matching has to happen on fuzzy string similarity (subject + time window). The engineering cost of a probabilistic matcher isn't justified yet. Current state: tolerate occasional duplicates, catch them in the Follow-Up Tracker's deduplication pass. Not satisfying. **Label Registry maintenance.** Accumulates Pending Review rows faster than I process them. The registry is self-updating in a soft way, when an agent encounters an unhandled label, it writes a Pending Review row with a suggested rule. Good behavior, but the review queue needs its own caretaker. Docs Librarian is supposed to take this over. That handoff hasn't shipped yet. **Automated output validation.** The Workers SDK enforces schema at write time, but there's no regression detection when I edit prompt instructions. A prompt change that silently breaks a downstream reader would surface in Morning Briefing eventually, "eventually" is not a testing strategy. What's needed: a validation tool that runs after prompt edits and checks that a sample digest still parses correctly end-to-end. This is the next real infrastructure item.Architecture diagram # Architecture diagram Here's the current state. Solid lines are routine data flows: digests, status updates, shared data snapshots. Dotted lines are exception paths: failures, spot-check triggers, escalations. The Workers SDK enforces write contracts on all major producer paths. https://preview.redd.it/fzpve4urhpog1.jpg?width=1822&format=pjpg&auto=webp&s=e320946c670f3b35e68eebee4a4833faea6e26ec **TL;DR:** Went from 11 to 16 agents. The real upgrade wasn't more agents, it was Fleet Ops (observability), Workers SDK (enforced data contracts), and eliminating per-email triggers. Build the failure path first. Contracts > conventions. Observability should be agent 3, not agent 14. Happy to go deeper on any of this, especially Workers SDK architecture or the credit cost breakdown.