Mission Control

Documents

K

Agent Visibility v1 Pressure Test

Research BriefCreated Mar 15, 202618 min readFull screen ↗
Artifact Preview

Agent Visibility v1 Pressure Test

Executive summary

The proposed design is directionally right, but it is still more of a collaboration metaphor than an operating system. The good news: the metaphor is strong. A chief-of-staff orchestrator, specialist agents, Discord as the conversation surface, Mission Control as the stateful ops console, and memory as a handbook/wiki is much closer to current best practice than the usual "let a swarm of agents freestyle in one giant shared chat" nonsense.

The bad news: v1 is weak exactly where real agent systems usually fail, which is not in ideation, but in control surfaces. The current framing underspecifies state schemas, artifact contracts, checkpointing, evaluators, task boundaries, and memory write permissions. Without those, you get performative visibility instead of real visibility: lots of chat, thin traceability, poor replay, weak debugging, and agents stepping on each other’s context.

My bottom line:

  • Keep the core architecture idea. It aligns with Anthropic, OpenAI, LangGraph, AutoGen, and strong practitioner guidance.
  • Do not keep Mission Control as-is. It looks like an early UI shell and task board, not a trustworthy agent operations layer.
  • Recommendation: salvage concepts but rebuild. Keep the product intent, discard the assumption that a kanban board plus chat transcript equals agent state.
  • v2 should be built around runs, tasks, artifacts, decisions, checkpoints, and evals, with explicit contracts for who can read/write which memory layers.

If you want one sentence: The design has the right org chart, but not yet the right data model.

Best-practice comparison table

Design areaProposed v1Best-practice principle from current agent systemsAssessmentWhat to change
Delegation inputBriefing packets before delegationGood agents perform better with explicit task framing, starting points, constraints, and expected outputs. Devin guidance strongly emphasizes saying *how*, not just *what*, and telling the agent where to start. OpenAI and Anthropic both push clear instructions and scoped workflows.StrongMake packets structured, not prose-only: objective, success criteria, constraints, source context, prior decisions, artifacts to inspect, output schema, escalation rules.
Conversation vs stateDiscord for conversation, Mission Control for stateSeparation of ephemeral collaboration from durable system state is correct. LangGraph and AutoGen both treat conversation as one stream among many, with persistent state/checkpoints elsewhere.Strong direction, weak implementationMission Control must store canonical task/run/artifact state. Discord should never be the source of truth for task status.
Agent reporting contractKickoff, blocker, completion summary, confidenceThis matches human-in-the-loop checkpointing and traceability guidance from Anthropic, OpenAI, LangGraph, and Cognition. But confidence alone is weak unless grounded in evidence and verifiable outcomes.Good startAdd required fields: assumptions, evidence links, artifact refs, next action, requested decision type, failure mode, and whether confidence is calibrated from an eval rubric or just self-report.
Recommendation quality barRequire a recommendation, not raw dumpStrong match with product best practice. Humans want decision support, not agent exhaust. Anthropic’s evaluator-optimizer pattern and practitioner guidance both favor critique and refinement over first-pass output.StrongDefine a rubric: recommendation, options considered, evidence, tradeoffs, uncertainty, reversibility, and explicit ask from Pete.
Durable artifacts + decision log + daily digestPersist outputs and summarize decisionsVery aligned. Durable artifacts and decision logs are essential for replay, continuity, and memory curation. Hamel’s trace-first eval guidance supports this.StrongAdd lineage: each artifact should link to task, run, source inputs, and resulting decision. Daily digest should summarize deltas, not restate everything.
Vinny as orchestrator/coherence layerOne orchestrator sits above specialistsStrongly aligned with orchestrator-worker and manager patterns from Anthropic and OpenAI. Also fits practical limits of specialization and tool overload.Strong, with caveatsGive Vinny explicit powers and limits: routing, packet assembly, status synthesis, arbitration, and memory curation. Do not let Vinny become a dumping ground for all execution.
Memory architectureMemory as handbook/wikiDirectionally correct, but underspecified. Best practice is layered memory: short-term run state, durable records, curated long-term memory, external source retrieval. Shared memory without write policy becomes corruption.Weakest areaSplit memory into: run scratchpad, task state, artifact store, decision log, curated handbook/wiki, and user profile/preferences. Restrict writes by layer.
ObservabilityImplicitly via Discord + summariesModern agent systems need traces, spans, checkpoints, and replay. Hamel explicitly defines traces as the full chain of actions/messages/tool calls/retrievals. AutoGen and LangGraph emphasize observability and control.InsufficientAdd run timeline, tool-call log, artifact lineage, state transitions, and replayable checkpoints. Human-readable summary is not enough.
Human-in-the-loopAgents report blockers and completionGood instinct, but too thin. Anthropic and LangGraph both emphasize pause/resume at checkpoints, especially before risky actions or when ambiguity is high.PartialDefine checkpoint classes: approval required, ambiguity detected, conflicting evidence, side-effecting action, low confidence, timeout, dependency missing.
Task decompositionVinny delegates to Wendy/DavidGood if tasks are separable. Anthropic and OpenAI both recommend starting with a single agent and adding specialization only when prompts/tools become too complex.ReasonableUse explicit decomposition rules: split by toolset, domain, or time horizon, not by vibe. Keep cross-agent handoffs rare and structured.
Reliability and evalsConfidence reporting, implied summariesCurrent best practice says evals are non-optional. Hamel is blunt here: traces, error analysis, and pass/fail evals are the foundation. Self-reported confidence is not an eval.WeakAdd offline eval sets, online spot review, agent scorecards, and task-specific pass/fail checks. Track calibration over time.
Shared state vs isolationShared office metaphorShared context helps coordination, but uncontrolled shared memory harms reliability. AWS/LangGraph discussion is explicit that multi-agent systems need careful synchronization and memory hierarchies.Needs design disciplineDefault to isolated workspaces per run/task. Share only curated artifacts and canonical state, not full scratchpads.

Where our design is right

1) The orchestrator-specialist model is the right default

This is the most solid part of the design.

Anthropic’s orchestrator-workers pattern is a direct match for Vinny delegating to Wendy and David. OpenAI’s manager pattern and handoff model also map cleanly here. The practical reason is simple: specialization reduces tool overload and prompt overload. One agent trying to research, plan, implement, QA, and communicate all in one loop is how systems get sloppy.

The v1 org chart has a useful asymmetry:

  • Vinny handles intake, prioritization, routing, synthesis, and final coherence.
  • Wendy handles research-heavy ambiguity.
  • David handles implementation-heavy ambiguity.

That is sane. It respects the fact that tool access, prompt style, and success criteria differ by domain.

2) Discord as the office, Mission Control as the ops console is conceptually correct

This separation matches how serious systems are built.

Conversation is not state. Chat is where clarification, lightweight coordination, and human judgment happen. State belongs in a structured system. LangGraph’s state/checkpoint model and AutoGen’s event-driven architecture both point the same way: you need a persistent layer for system coordination that is not reducible to free-form chat.

So the premise is right:

  • Discord = coordination surface
  • Mission Control = operational source of truth
  • Memory/wiki = curated institutional memory

That stack makes sense.

3) Briefing packets before delegation is exactly the right instinct

This aligns with both agent research and the lived experience of coding agents.

Cognition’s Devin guidance stresses that the agent should be told where to start, how the work should be done, and what constraints matter. OpenAI advises explicit instructions and clear routine steps. Anthropic recommends decomposition and targeted workflows instead of blind autonomy.

A good packet does three things:

  • lowers ambiguity,
  • reduces wasted exploration,
  • improves reviewability after the fact.

This is one of the highest-leverage ideas in the whole design.

4) Durable artifacts and decision logs are far more important than more chat

This is another strong design choice.

Agent systems fail because their work vanishes into transcripts. Durable artifacts fix that. If Wendy produces a research memo, David produces an implementation plan, and Vinny records the resulting decision, you now have something replayable, referenceable, and auditable.

That is much closer to how strong human teams operate too.

5) A reporting contract is better than open-ended status spam

Kickoff, blocker, completion summary, and confidence is a useful start because it imposes rhythm and makes hidden work visible. This is especially important for asynchronous agent work.

The key strength is not the specific fields. It is the existence of a contract. Once there is a contract, you can evaluate adherence, improve it, and automate around it.

Where our design is weak

1) Memory is still described like a metaphor, not an architecture

"Memory as handbook/wiki" sounds nice, but it is too vague to build a reliable multi-agent system.

Current best practice separates memory into layers because each layer has different failure modes:

  • Run memory: temporary scratchpad for a single run.
  • Task state: structured status, ownership, due-ness, blockers, linked artifacts.
  • Artifact store: outputs, drafts, notes, plans, research docs.
  • Decision log: what was decided, by whom, based on what evidence.
  • Curated long-term memory: stable preferences, playbooks, SOPs, durable facts.
  • External knowledge retrieval: source documents, web results, repo files, tickets.

If all of that gets flattened into "memory," you will get contamination fast. Agents will overfit to stale notes, write junk into supposedly durable memory, and retrieve too much irrelevant context.

This is where v1 feels undercooked.

2) Mission Control currently looks like a task UI, not an agent runtime

Based on the local project structure, Mission Control today appears to be a local-first shell with projects, a kanban board, some seed data, and a UI framing of work lanes. That is not nothing, but it is not enough.

What is missing for real agent operations:

  • run objects
  • state transitions
  • checkpoint/resume
  • artifact lineage
  • decision records tied to runs
  • evaluation results
  • confidence calibration history
  • event log / trace view
  • permissions around who can write what

A kanban board is useful, but it is downstream of the actual agent system. It cannot substitute for it.

3) Confidence reporting is too easy to fake

Self-reported confidence is not reliable. Models are notoriously bad at calibrated confidence in open-ended tasks, and practitioners have learned this the hard way.

Confidence needs to be grounded in something concrete, such as:

  • evidence sufficiency,
  • source quality,
  • test pass rate,
  • rubric score,
  • agreement across checks,
  • whether contradictions remain unresolved.

Without that, the confidence field becomes vibes in a necktie.

4) The human-in-the-loop design is still event-based, not policy-based

Right now the logic seems to be: agents report blockers or completion. That is better than silence, but it is not enough.

You need explicit pause policies:

  • pause before side effects,
  • pause when evidence conflicts,
  • pause when requirements are underspecified,
  • pause when confidence falls below threshold,
  • pause after a time budget or iteration budget,
  • pause when touching shared memory or changing plans materially.

Anthropic’s checkpointing logic, LangGraph’s interrupt/resume model, and OpenAI’s guardrails/handoffs all point in this direction.

5) Too much shared context will create cross-agent contamination

The office metaphor is helpful socially, but dangerous operationally.

In real multi-agent systems, a lot of errors come from polluted context windows and weak state boundaries. AWS/LangGraph discussion explicitly calls out context synchronization and memory hierarchy as hard problems in multi-agent systems.

If Wendy and David can both read and write the same broad shared memory, you will get:

  • irrelevant retrieval,
  • stale assumptions persisting across tasks,
  • duplicate or conflicting work,
  • accidental overwrites of institutional memory.

Default shared state is not a feature. It is a liability.

6) There is no explicit eval loop yet

Hamel Husain’s work is very clear here: if you are not reviewing traces, running pass/fail evals, and building quality judgments around actual failures, you are mostly guessing.

v1 has a reporting contract but not an evaluation system.

That means you will struggle to answer basic questions like:

  • Is Wendy’s research actually helping decisions?
  • Does David’s completion quality improve with packets?
  • Which failure modes recur?
  • Is Vinny routing correctly?
  • Are confidence scores calibrated or decorative?

Without evals, you cannot improve systematically.

What to change in v2

1) Replace the vague "task" model with a run-based operating model

Mission Control should model at least these first-class objects:

Core objects

  • Request: the human ask or trigger.
  • Task: a scoped work unit derived from a request.
  • Run: one execution attempt of a task by a given agent.
  • Checkpoint: a pause requiring human review or policy handling.
  • Artifact: any durable output file, note, plan, summary, code diff, or research memo.
  • Decision: a recorded judgment with rationale and links to artifacts.
  • Eval: pass/fail or scored assessment of an output or run.

This is the real backbone. Kanban can sit on top of it later.

2) Make briefing packets structured and machine-checkable

A packet should not be a paragraph. It should be a schema.

Proposed packet schema

  • task_id
  • requester
  • assigned_agent
  • objective
  • why_this_matters
  • definition_of_done
  • constraints
  • relevant_sources
  • prior_decisions
  • recommended_starting_points
  • deliverables_required
  • output_format
  • time_budget
  • escalation_policy

This mirrors Devin-style "where to start" guidance and OpenAI/Anthropic instruction discipline.

3) Create a strict agent reporting contract

Current fields are not enough.

Kickoff report

  • interpretation of task
  • planned approach
  • assumptions
  • expected deliverables
  • estimated time/effort
  • first checkpoint if any

Blocker report

  • blocker type: missing info / dependency / contradiction / tool failure / policy stop
  • what was attempted
  • what evidence exists
  • what decision is needed from Pete or Vinny
  • options if no response

Completion report

  • final recommendation
  • evidence summary
  • linked artifacts
  • unresolved uncertainties
  • confidence score with basis
  • next best action
  • eval results if available

Confidence schema

Instead of a raw number alone, require:

  • confidence_level
  • confidence_basis: tested / source-backed / inferred / weakly inferred
  • failure_if_wrong
  • reversibility_of_decision

4) Use layered memory with write permissions

This is the biggest architecture fix.

  1. Scratchpad memory
  • per-run only
  • writable by executing agent
  • never treated as durable truth
  1. Operational state
  • task/run/checkpoint metadata
  • writable by system and orchestrator
  • canonical source for status
  1. Artifact repository
  • reports, packets, summaries, code plans, drafts
  • writable by assigned agent
  • immutable versions preferred
  1. Decision log
  • concise records of decisions made
  • writable by orchestrator after review
  1. Curated handbook/wiki
  • stable SOPs, user preferences, reusable playbooks
  • write-restricted, curated only
  1. External sources index
  • linked docs, tickets, web pages, repo files
  • retrieval layer, not memory layer

The key policy: specialist agents should not write directly to curated memory by default. They write artifacts. Vinny curates upward.

5) Default to isolated workspaces, shared artifacts

For most tasks:

  • each agent gets its own run context,
  • each run has isolated scratch state,
  • only approved artifacts and structured task state are shared.

This preserves specialization while limiting contamination. It also makes debugging possible because you can inspect the exact input packet, run outputs, and checkpoints for one execution.

6) Add observability that supports debugging, not just reassurance

A useful Mission Control should answer:

  • What is running right now?
  • What happened in this run?
  • Which tools and sources were used?
  • Where did the recommendation come from?
  • What changed since yesterday?
  • Why did the agent stop?
  • What failed repeatedly this week?

Minimum observability views

  • run timeline
  • state transitions
  • tool/action log
  • artifacts list with lineage
  • checkpoint queue
  • decision log
  • eval outcomes
  • agent scorecards by task type

This is consistent with Hamel’s trace definition and AutoGen/LangGraph emphasis on observability and control.

7) Introduce evals before scaling the agent team

Before adding more agents, add a quality loop.

Minimum viable eval program

  • Weekly review of 20 to 50 sampled runs
  • Pass/fail rubric by task type
  • Error taxonomy: routing failure, bad packet, missing source, hallucinated claim, premature completion, weak recommendation, wrong escalation, stale memory retrieval
  • Calibration check: compare confidence vs actual quality
  • One owner of quality judgment per workflow

Hamel’s advice is the right one here: start with error analysis, not fancy infra. But once Mission Control exists, it should make that review easy.

8) Keep specialization narrow and use handoffs sparingly

A common anti-pattern in multi-agent systems is over-splitting.

Do not create many agents because it feels elegant. Create a new specialist only when one of these is true:

  • toolset is distinct,
  • evaluation rubric is distinct,
  • memory needs are distinct,
  • handoff output can be tightly structured.

Otherwise, keep it in Vinny or one specialist.

9) Add decision hygiene

The design is right to emphasize recommendations, but recommendations need a repeatable shape.

Recommendation quality bar

Every recommendation should answer:

  • what should be done,
  • why,
  • what evidence supports it,
  • what alternatives were considered,
  • what could make it wrong,
  • what the next action is,
  • whether the decision is reversible.

This keeps agents from dumping research and calling it strategy.

Mission Control recommendation

Recommendation: salvage concepts but rebuild

Not "keep as-is." Not full scorched-earth either.

Why not keep as-is

The current Mission Control appears to be a local-first product shell with projects and task/status views. Useful start. Wrong level of abstraction for agent operations.

If you keep it as-is and bolt agents onto it, you will likely end up with:

  • chat transcripts doing too much work,
  • brittle task status updates,
  • no reliable replay/debugging,
  • no artifact lineage,
  • no eval loop,
  • no trustworthy explanation of what agents actually did.

That will feel productive for a week or two, then turn into opaque chaos.

Why not totally nuke and replace

Because the product framing is good:

  • office metaphor works,
  • projects and tasks still matter,
  • local-first posture is sensible,
  • split between communication and system state is right.

There is conceptual value here. Don’t throw out the intent.

What to salvage

  • the role framing: Vinny/Wendy/David
  • Discord as collaboration surface
  • project and task concepts
  • local-first, human-readable artifacts
  • the instinct toward decision logs and digests

What to rebuild

Rebuild Mission Control around these primitives:

  • requests
  • tasks
  • runs
  • checkpoints
  • artifacts
  • decisions
  • evals
  • memory layers

The UI can still show kanban and project views, but those should be projections of richer underlying state, not the state model itself.

Practical migration path

  1. Keep the existing UI shell.
  2. Add a new backend data model for runs/artifacts/decisions/checkpoints.
  3. Make task board views derive from canonical run/task state.
  4. Add artifact and decision panels before adding more workflow automation.
  5. Add trace/eval views once a few workflows are live.

That is the least painful path.

3 concrete examples of how the improved design would work in practice

Example 1: Research delegation with real traceability

Scenario: Pete asks, "Should I pursue this startup opportunity?"

v1 likely flow

  • Vinny pings Wendy in Discord.
  • Wendy researches and posts a summary.
  • Vinny replies with a recommendation.
  • Some useful context is lost in the thread.

v2 improved flow

  1. Vinny creates a Request and decomposes it into a Task: company assessment.
  2. Mission Control generates a briefing packet for Wendy with:
  • company name
  • evaluation rubric
  • relevant prior preferences from curated memory
  • required deliverable: one-page opportunity brief
  1. Wendy runs in an isolated workspace.
  2. Mission Control logs:
  • sources fetched
  • artifact versions created
  • blocker checkpoints if evidence conflicts
  1. Wendy completes with:
  • recommendation
  • evidence links
  • confidence basis
  • unresolved questions
  1. Vinny reviews, records a Decision: pursue / pass / network first.
  2. Daily digest references only the final artifact and decision, not the entire scratch process.

Why this is better: replayable, reviewable, less context leakage, better memory hygiene.

Example 2: Engineering work with explicit checkpoints

Scenario: Pete asks David to improve a workflow or ship a Mission Control feature.

v1 likely flow

  • David gets a loose prompt.
  • He works for a while.
  • He reports done or blocked.
  • Pete has to reverse-engineer what happened.

v2 improved flow

  1. Vinny assembles packet:
  • feature objective
  • definition of done
  • starting files/components
  • constraints
  • test expectations
  • checkpoint policy for schema changes or destructive edits
  1. David kickoff report states:
  • intended implementation path
  • assumptions
  • first checkpoint after repo inspection
  1. Midway, David hits ambiguity. Mission Control opens a checkpoint tagged needs-product-decision.
  2. Pete answers one specific question in Discord.
  3. David resumes from checkpoint, finishes implementation, attaches:
  • diff summary
  • tests run
  • screenshots or artifacts
  • residual risk
  1. Eval record logs pass/fail for acceptance criteria.

Why this is better: fewer silent wrong turns, cleaner human intervention, better accountability than a generic "I’m working on it."

Example 3: Daily digest that is actually useful

Scenario: Multiple agents worked across the day.

v1 likely flow

  • Digest is a chat summary of everything said.

v2 improved flow

Mission Control generates a digest from structured records:

  • New decisions made
  • Tasks completed
  • Tasks blocked awaiting Pete
  • High-value artifacts produced
  • Confidence warnings / low-confidence recommendations
  • What should happen tomorrow

Each line links back to the underlying task, run, and artifact.

Why this is better: the digest becomes an executive control surface, not a compressed transcript.

Final verdict

The proposed design is not wrong. In fact, it is more grounded than most agent setups because it already separates roles, conversation, state, and memory conceptually.

But v1 still stops one layer too early. It defines the org chart and the surfaces, not the runtime model. That is the gap.

So the recommendation is:

  • Keep the framing
  • Rebuild the state model
  • Add runs, artifacts, decisions, checkpoints, and evals before scaling agent autonomy
  • Salvage Mission Control’s concepts, but rebuild it as an agent ops layer rather than a prettier task board

Sources and key ideas used

  • Anthropic, "Building effective agents": prefer simple composable patterns; workflows before full autonomy; orchestrator-worker and evaluator-optimizer patterns; checkpoints and environmental feedback matter.
  • OpenAI, "A practical guide to building agents" and Agents resources: clear instructions, tools, routines, manager/handoff patterns, guardrails, evals before optimization.
  • Chip Huyen, "Agents": tools and planning define agent capability; compound mistakes make multi-step systems fragile; context construction and failure modes matter.
  • Hamel Husain, evals writing: traces are the complete record; start with error analysis; pass/fail evals over fuzzy ratings; evaluation is core development work, not optional polish.
  • Microsoft AutoGen: actor/event-driven orchestration, observability, modularity, state management, and memory are necessary for scalable multi-agent systems.
  • LangGraph / AWS Bedrock multi-agent guidance: graph/state/checkpoint model, central persistence, human interrupt/resume, memory hierarchy, and context synchronization challenges.
  • Cognition / Devin guidance: specify how, not just what; tell the agent where to start; use clear packets and checkpoints; review bottlenecks become the next problem after generation.