research brief

Agent Visibility v1 Pressure Test

2026-03-15

Agent Visibility v1 Pressure Test

Executive summary

The proposed design is directionally right, but it is still more of a collaboration metaphor than an operating system. The good news: the metaphor is strong. A chief-of-staff orchestrator, specialist agents, Discord as the conversation surface, Mission Control as the stateful ops console, and memory as a handbook/wiki is much closer to current best practice than the usual "let a swarm of agents freestyle in one giant shared chat" nonsense.

The bad news: v1 is weak exactly where real agent systems usually fail, which is not in ideation, but in control surfaces. The current framing underspecifies state schemas, artifact contracts, checkpointing, evaluators, task boundaries, and memory write permissions. Without those, you get performative visibility instead of real visibility: lots of chat, thin traceability, poor replay, weak debugging, and agents stepping on each other’s context.

My bottom line:

If you want one sentence: The design has the right org chart, but not yet the right data model.

Best-practice comparison table

Design areaProposed v1Best-practice principle from current agent systemsAssessmentWhat to change
Delegation inputBriefing packets before delegationGood agents perform better with explicit task framing, starting points, constraints, and expected outputs. Devin guidance strongly emphasizes saying *how*, not just *what*, and telling the agent where to start. OpenAI and Anthropic both push clear instructions and scoped workflows.StrongMake packets structured, not prose-only: objective, success criteria, constraints, source context, prior decisions, artifacts to inspect, output schema, escalation rules.
Conversation vs stateDiscord for conversation, Mission Control for stateSeparation of ephemeral collaboration from durable system state is correct. LangGraph and AutoGen both treat conversation as one stream among many, with persistent state/checkpoints elsewhere.Strong direction, weak implementationMission Control must store canonical task/run/artifact state. Discord should never be the source of truth for task status.
Agent reporting contractKickoff, blocker, completion summary, confidenceThis matches human-in-the-loop checkpointing and traceability guidance from Anthropic, OpenAI, LangGraph, and Cognition. But confidence alone is weak unless grounded in evidence and verifiable outcomes.Good startAdd required fields: assumptions, evidence links, artifact refs, next action, requested decision type, failure mode, and whether confidence is calibrated from an eval rubric or just self-report.
Recommendation quality barRequire a recommendation, not raw dumpStrong match with product best practice. Humans want decision support, not agent exhaust. Anthropic’s evaluator-optimizer pattern and practitioner guidance both favor critique and refinement over first-pass output.StrongDefine a rubric: recommendation, options considered, evidence, tradeoffs, uncertainty, reversibility, and explicit ask from Pete.
Durable artifacts + decision log + daily digestPersist outputs and summarize decisionsVery aligned. Durable artifacts and decision logs are essential for replay, continuity, and memory curation. Hamel’s trace-first eval guidance supports this.StrongAdd lineage: each artifact should link to task, run, source inputs, and resulting decision. Daily digest should summarize deltas, not restate everything.
Vinny as orchestrator/coherence layerOne orchestrator sits above specialistsStrongly aligned with orchestrator-worker and manager patterns from Anthropic and OpenAI. Also fits practical limits of specialization and tool overload.Strong, with caveatsGive Vinny explicit powers and limits: routing, packet assembly, status synthesis, arbitration, and memory curation. Do not let Vinny become a dumping ground for all execution.
Memory architectureMemory as handbook/wikiDirectionally correct, but underspecified. Best practice is layered memory: short-term run state, durable records, curated long-term memory, external source retrieval. Shared memory without write policy becomes corruption.Weakest areaSplit memory into: run scratchpad, task state, artifact store, decision log, curated handbook/wiki, and user profile/preferences. Restrict writes by layer.
ObservabilityImplicitly via Discord + summariesModern agent systems need traces, spans, checkpoints, and replay. Hamel explicitly defines traces as the full chain of actions/messages/tool calls/retrievals. AutoGen and LangGraph emphasize observability and control.InsufficientAdd run timeline, tool-call log, artifact lineage, state transitions, and replayable checkpoints. Human-readable summary is not enough.
Human-in-the-loopAgents report blockers and completionGood instinct, but too thin. Anthropic and LangGraph both emphasize pause/resume at checkpoints, especially before risky actions or when ambiguity is high.PartialDefine checkpoint classes: approval required, ambiguity detected, conflicting evidence, side-effecting action, low confidence, timeout, dependency missing.
Task decompositionVinny delegates to Wendy/DavidGood if tasks are separable. Anthropic and OpenAI both recommend starting with a single agent and adding specialization only when prompts/tools become too complex.ReasonableUse explicit decomposition rules: split by toolset, domain, or time horizon, not by vibe. Keep cross-agent handoffs rare and structured.
Reliability and evalsConfidence reporting, implied summariesCurrent best practice says evals are non-optional. Hamel is blunt here: traces, error analysis, and pass/fail evals are the foundation. Self-reported confidence is not an eval.WeakAdd offline eval sets, online spot review, agent scorecards, and task-specific pass/fail checks. Track calibration over time.
Shared state vs isolationShared office metaphorShared context helps coordination, but uncontrolled shared memory harms reliability. AWS/LangGraph discussion is explicit that multi-agent systems need careful synchronization and memory hierarchies.Needs design disciplineDefault to isolated workspaces per run/task. Share only curated artifacts and canonical state, not full scratchpads.

Where our design is right

1) The orchestrator-specialist model is the right default

This is the most solid part of the design.

Anthropic’s orchestrator-workers pattern is a direct match for Vinny delegating to Wendy and David. OpenAI’s manager pattern and handoff model also map cleanly here. The practical reason is simple: specialization reduces tool overload and prompt overload. One agent trying to research, plan, implement, QA, and communicate all in one loop is how systems get sloppy.

The v1 org chart has a useful asymmetry:

That is sane. It respects the fact that tool access, prompt style, and success criteria differ by domain.

2) Discord as the office, Mission Control as the ops console is conceptually correct

This separation matches how serious systems are built.

Conversation is not state. Chat is where clarification, lightweight coordination, and human judgment happen. State belongs in a structured system. LangGraph’s state/checkpoint model and AutoGen’s event-driven architecture both point the same way: you need a persistent layer for system coordination that is not reducible to free-form chat.

So the premise is right:

That stack makes sense.

3) Briefing packets before delegation is exactly the right instinct

This aligns with both agent research and the lived experience of coding agents.

Cognition’s Devin guidance stresses that the agent should be told where to start, how the work should be done, and what constraints matter. OpenAI advises explicit instructions and clear routine steps. Anthropic recommends decomposition and targeted workflows instead of blind autonomy.

A good packet does three things:

This is one of the highest-leverage ideas in the whole design.

4) Durable artifacts and decision logs are far more important than more chat

This is another strong design choice.

Agent systems fail because their work vanishes into transcripts. Durable artifacts fix that. If Wendy produces a research memo, David produces an implementation plan, and Vinny records the resulting decision, you now have something replayable, referenceable, and auditable.

That is much closer to how strong human teams operate too.

5) A reporting contract is better than open-ended status spam

Kickoff, blocker, completion summary, and confidence is a useful start because it imposes rhythm and makes hidden work visible. This is especially important for asynchronous agent work.

The key strength is not the specific fields. It is the existence of a contract. Once there is a contract, you can evaluate adherence, improve it, and automate around it.

Where our design is weak

1) Memory is still described like a metaphor, not an architecture

"Memory as handbook/wiki" sounds nice, but it is too vague to build a reliable multi-agent system.

Current best practice separates memory into layers because each layer has different failure modes:

If all of that gets flattened into "memory," you will get contamination fast. Agents will overfit to stale notes, write junk into supposedly durable memory, and retrieve too much irrelevant context.

This is where v1 feels undercooked.

2) Mission Control currently looks like a task UI, not an agent runtime

Based on the local project structure, Mission Control today appears to be a local-first shell with projects, a kanban board, some seed data, and a UI framing of work lanes. That is not nothing, but it is not enough.

What is missing for real agent operations:

A kanban board is useful, but it is downstream of the actual agent system. It cannot substitute for it.

3) Confidence reporting is too easy to fake

Self-reported confidence is not reliable. Models are notoriously bad at calibrated confidence in open-ended tasks, and practitioners have learned this the hard way.

Confidence needs to be grounded in something concrete, such as:

Without that, the confidence field becomes vibes in a necktie.

4) The human-in-the-loop design is still event-based, not policy-based

Right now the logic seems to be: agents report blockers or completion. That is better than silence, but it is not enough.

You need explicit pause policies:

Anthropic’s checkpointing logic, LangGraph’s interrupt/resume model, and OpenAI’s guardrails/handoffs all point in this direction.

5) Too much shared context will create cross-agent contamination

The office metaphor is helpful socially, but dangerous operationally.

In real multi-agent systems, a lot of errors come from polluted context windows and weak state boundaries. AWS/LangGraph discussion explicitly calls out context synchronization and memory hierarchy as hard problems in multi-agent systems.

If Wendy and David can both read and write the same broad shared memory, you will get:

Default shared state is not a feature. It is a liability.

6) There is no explicit eval loop yet

Hamel Husain’s work is very clear here: if you are not reviewing traces, running pass/fail evals, and building quality judgments around actual failures, you are mostly guessing.

v1 has a reporting contract but not an evaluation system.

That means you will struggle to answer basic questions like:

Without evals, you cannot improve systematically.

What to change in v2

1) Replace the vague "task" model with a run-based operating model

Mission Control should model at least these first-class objects:

Core objects

This is the real backbone. Kanban can sit on top of it later.

2) Make briefing packets structured and machine-checkable

A packet should not be a paragraph. It should be a schema.

Proposed packet schema

This mirrors Devin-style "where to start" guidance and OpenAI/Anthropic instruction discipline.

3) Create a strict agent reporting contract

Current fields are not enough.

Kickoff report

Blocker report

Completion report

Confidence schema

Instead of a raw number alone, require:

4) Use layered memory with write permissions

This is the biggest architecture fix.

  1. Scratchpad memory
  1. Operational state
  1. Artifact repository
  1. Decision log
  1. Curated handbook/wiki
  1. External sources index

The key policy: specialist agents should not write directly to curated memory by default. They write artifacts. Vinny curates upward.

5) Default to isolated workspaces, shared artifacts

For most tasks:

This preserves specialization while limiting contamination. It also makes debugging possible because you can inspect the exact input packet, run outputs, and checkpoints for one execution.

6) Add observability that supports debugging, not just reassurance

A useful Mission Control should answer:

Minimum observability views

This is consistent with Hamel’s trace definition and AutoGen/LangGraph emphasis on observability and control.

7) Introduce evals before scaling the agent team

Before adding more agents, add a quality loop.

Minimum viable eval program

Hamel’s advice is the right one here: start with error analysis, not fancy infra. But once Mission Control exists, it should make that review easy.

8) Keep specialization narrow and use handoffs sparingly

A common anti-pattern in multi-agent systems is over-splitting.

Do not create many agents because it feels elegant. Create a new specialist only when one of these is true:

Otherwise, keep it in Vinny or one specialist.

9) Add decision hygiene

The design is right to emphasize recommendations, but recommendations need a repeatable shape.

Recommendation quality bar

Every recommendation should answer:

This keeps agents from dumping research and calling it strategy.

Mission Control recommendation

Recommendation: salvage concepts but rebuild

Not "keep as-is." Not full scorched-earth either.

Why not keep as-is

The current Mission Control appears to be a local-first product shell with projects and task/status views. Useful start. Wrong level of abstraction for agent operations.

If you keep it as-is and bolt agents onto it, you will likely end up with:

That will feel productive for a week or two, then turn into opaque chaos.

Why not totally nuke and replace

Because the product framing is good:

There is conceptual value here. Don’t throw out the intent.

What to salvage

What to rebuild

Rebuild Mission Control around these primitives:

The UI can still show kanban and project views, but those should be projections of richer underlying state, not the state model itself.

Practical migration path

  1. Keep the existing UI shell.
  2. Add a new backend data model for runs/artifacts/decisions/checkpoints.
  3. Make task board views derive from canonical run/task state.
  4. Add artifact and decision panels before adding more workflow automation.
  5. Add trace/eval views once a few workflows are live.

That is the least painful path.

3 concrete examples of how the improved design would work in practice

Example 1: Research delegation with real traceability

Scenario: Pete asks, "Should I pursue this startup opportunity?"

v1 likely flow

v2 improved flow

  1. Vinny creates a Request and decomposes it into a Task: company assessment.
  2. Mission Control generates a briefing packet for Wendy with:
  1. Wendy runs in an isolated workspace.
  2. Mission Control logs:
  1. Wendy completes with:
  1. Vinny reviews, records a Decision: pursue / pass / network first.
  2. Daily digest references only the final artifact and decision, not the entire scratch process.

Why this is better: replayable, reviewable, less context leakage, better memory hygiene.

Example 2: Engineering work with explicit checkpoints

Scenario: Pete asks David to improve a workflow or ship a Mission Control feature.

v1 likely flow

v2 improved flow

  1. Vinny assembles packet:
  1. David kickoff report states:
  1. Midway, David hits ambiguity. Mission Control opens a checkpoint tagged needs-product-decision.
  2. Pete answers one specific question in Discord.
  3. David resumes from checkpoint, finishes implementation, attaches:
  1. Eval record logs pass/fail for acceptance criteria.

Why this is better: fewer silent wrong turns, cleaner human intervention, better accountability than a generic "I’m working on it."

Example 3: Daily digest that is actually useful

Scenario: Multiple agents worked across the day.

v1 likely flow

v2 improved flow

Mission Control generates a digest from structured records:

Each line links back to the underlying task, run, and artifact.

Why this is better: the digest becomes an executive control surface, not a compressed transcript.

Final verdict

The proposed design is not wrong. In fact, it is more grounded than most agent setups because it already separates roles, conversation, state, and memory conceptually.

But v1 still stops one layer too early. It defines the org chart and the surfaces, not the runtime model. That is the gap.

So the recommendation is:

Sources and key ideas used