AI Architecture Decision Frameworks for Mid-Market DTC
Executive summary
Pete should build for deterministic workflows first, hybrids second, full agents last. The practical default for a $50M to $500M DTC brand is: keep business logic explicit, use LLMs only at high-variance judgment steps, and treat agents as a thin planning layer over auditable tools and schemas, not as the system itself. That matches Anthropic's guidance that workflows are better for well-defined tasks while agents are better when the number of steps cannot be hardcoded in advance (Anthropic). It also lines up with Aaron Levie's public argument that SaaS is not disappearing, it is becoming agent-native and interoperable, with agents coordinating across systems rather than replacing business systems of record (X, X).
What to do this quarter:
- Ship quick wins as schema-first automations with strong logs, evals, and replayability.
- Put an orchestration layer you own between models and SaaS vendors.
- Preserve four assets across every build: prompts, structured I/O schemas, eval sets, and business process docs.
- Use vendor-native AI only when it is already good enough and already embedded in the operator workflow, especially Shopify Sidekick and Gorgias AI Agent for narrow commerce and support use cases (Shopify, Gorgias).
- Assume most DTC “agent-native” rewrites are not required in 12 months. Customer support and analytics copilots move sooner; core ops, finance, and cross-system automations still favor APIs and workflows for 24 to 36 months or longer (McKinsey, Ramp AI Index, Workday).
1) Workflow vs. agent decision framework
Production rule of thumb
The useful split is not “old automation vs. new AI.” It is fixed-path execution vs. variable-path reasoning. Lenny’s taxonomy and Anthropic’s “workflows vs agents” framing both converge on that point: if you can specify the path, do it; if the path must be discovered at runtime, agent behavior becomes useful (Lenny’s Newsletter, Anthropic). Simon Willison makes the same operational point from the opposite angle: tool-using systems are powerful, but you should earn complexity only when simpler prompt chains stop working (Simon Willison).
Decision table
| Task shape | Input variability | Tool path known in advance? | Risk of silent failure | Recommendation | Why |
|---|---|---|---|---|---|
| Deterministic back-office action, e.g. sync order tags, push NetSuite status, trigger Klaviyo segment | Low | Yes | High | Deterministic workflow | Lowest latency, lowest cost, easiest audit trail, easiest rollback (Anthropic, Slack Web API, NetSuite REST) |
| High-volume content transformation, e.g. classify reviews, draft replies, summarize CX tickets | Medium | Mostly | Medium | Hybrid workflow | Put LLM only at extraction/classification/generation step, keep routing and writes deterministic (Anthropic, Hamel Husain) |
| Investigation, diagnosis, research, exception handling | High | No | Medium | Agent with bounded tools | Open-ended search, variable tool order, dynamic stopping rules, but must be sandboxed and observed (Anthropic, Chip Huyen) |
| Customer-facing action with money, compliance, refunds, or data mutation | Medium to High | Partly | Very high | Hybrid with approval gates | Let models decide, never let them silently finalize high-risk actions without deterministic checks or human approval (Anthropic, Gorgias Actions) |
Real-world examples by quadrant
| Example | Pattern | Why this is the right quadrant |
|---|---|---|
| Update Slack channel, create ticket, and post order delay notice when carrier webhook says “exception” | Deterministic workflow | Input schema and branch logic are known, and Slack/write APIs are explicit (Slack) |
| Parse return-policy docs, classify intent from incoming ticket, then route to a fixed macro or human | Hybrid workflow | LLM handles messy language; routing and execution stay explicit (Anthropic, Gorgias) |
| Investigate why subscription churn spiked across Recharge, Klaviyo, and Triple Whale, then propose hypotheses | Agent | Variable evidence sources and sequence of tool calls cannot be fully predetermined (Recharge, Triple Whale) |
| Draft merchandising recommendations from reviews, survey comments, and ticket summaries, then push approved changes into Shopify | Hybrid with human checkpoint | Open-ended reasoning is helpful, but actual catalog updates should remain explicit (Shopify Sidekick, Yotpo UGC API) |
| Automatically cancel, refund, and modify subscriptions based on free-form customer chat with no approval | Anti-pattern | High write risk, policy risk, and silent failure risk. Use structured intents and approval thresholds instead (Gorgias) |
| Research competitor promos, analyze reviews, propose bundle experiments, and create a planning memo | Agent | Multi-step, unstructured, judgment-heavy, best treated as a bounded research agent task (Chip Huyen) |
Failure modes and prevention
Deterministic workflows fail by becoming brittle spaghetti. Symptoms: too many branches, duplicated logic by tool, and hidden assumptions in Zapier-style glue. Prevention: central schemas, reusable steps, idempotent writes, and replayable event logs (Anthropic).
Agents fail by appearing impressive in demos while degrading silently in production. Hamel Husain’s eval guidance is blunt: if you are not checking outputs against representative datasets, you do not know if the system is improving or regressing (Hamel Husain). Shreya Shankar has made the same case in public talks and posts on production evals: accuracy is contextual, and failure often hides in long-tail cases, not averages (Shreya Shankar).
Hybrid systems fail when the boundary is wrong. Common anti-patterns are letting the model choose tools when the next tool is obvious, letting it write to systems of record without structured validation, and embedding business policy inside prompts instead of code or declarative rules (Anthropic, Simon Willison).
2) Architectural principles for forward-compatible builds
10 rules
| Principle | Right way | Wrong way |
|---|---|---|
| 1. Keep systems of record authoritative | Shopify, NetSuite, Klaviyo stay source-of-truth; the AI layer reads and proposes changes through APIs | Store critical customer/order state only inside prompt context or vector stores (NetSuite REST, Shopify Admin) |
| 2. Own the orchestration layer | Put a thin service between model calls and vendor APIs so prompts, retries, auth, and logging are portable | Hard-wire business process inside one vendor’s chatbot builder (Anthropic) |
| 3. Schema-first everything | Use typed inputs, typed outputs, and explicit tool contracts | Parse prose out of model responses and hope regex saves you (Jason Liu) |
| 4. Separate reasoning from execution | LLM proposes action, deterministic service validates and executes | Let the same prompt both decide and mutate external state with no check |
| 5. Prefer event logs over chat transcripts | Store orders, tickets, actions, decisions, and outcomes as structured events | Assume the conversational history is enough to debug or re-run |
| 6. Make prompts replaceable | Version prompts as artifacts, tied to datasets and evals | Leave prompts hidden inside automation builders or support macros |
| 7. Build for human override | Include approve/reject/escalate states in returns, refunds, pricing, and finance flows | Assume “fully autonomous” is the goal for sensitive workflows |
| 8. Treat MCP as an interface, not a strategy | Use MCP where it simplifies tool access, but keep vendor APIs and schemas behind your own contract | Rewrite the architecture around MCP hype before vendor support stabilizes (Model Context Protocol servers) |
| 9. Minimize vendor-specific memory | Keep reusable context in your datastore, not trapped in one assistant product | Train every vendor-native agent separately on the same policies and knowledge base |
| 10. Instrument from day one | Trace prompts, tool calls, latency, cost, outputs, and user corrections | Add observability after stakeholders complain about weird behavior (Langfuse, Braintrust, Helicone) |
Strongest signal from Levie and peers
The strongest public signal is not “agents replace SaaS.” It is “SaaS becomes the secure substrate and agents become the interaction layer.” Levie’s X posts repeatedly emphasize multi-agent interoperability across enterprise systems like Salesforce, Box, Stripe, ServiceNow, and Workday, plus a consumption layer on top of seat-based software (X, X). Box’s 2025 AI platform launch also leaned hard into open, interoperable agents over enterprise content rather than a closed assistant (Business Wire). Shopify’s trajectory is similar but more commerce-specific: Sidekick can take action in admin, and the AI Toolkit exposes docs, schemas, and a Dev MCP server for developers (Shopify Sidekick, Shopify AI Toolkit). Workday’s Illuminate announcements show the same pattern in finance and HR: domain-specific agents, not general autonomous swarms (Workday).
3) DTC SaaS readiness matrix
Scoring: 1 = weak, 5 = strong. “MCP” reflects publicly verified official or community availability reviewed on 2026-04-17.
| Vendor | API maturity | MCP availability | Agent-native roadmap credibility | Why this score |
|---|---|---|---|---|
| NetSuite | 4 | 1 | 2 | Mature REST and metadata browser, but little public MCP momentum and weak public agent-native signal for mid-market DTC execution (NetSuite REST, REST API Browser) |
| Shopify | 5 | 5 | 5 | Strong APIs, Sidekick shipped in admin, official AI Toolkit and Dev MCP server (Shopify Sidekick, Shopify AI Toolkit) |
| Slack | 5 | 3 | 3 | Excellent Web API, solid for tool actions, but agent-native value is mostly platform-level rather than DTC-specific workflow intelligence (Slack Web API, MCP servers repo) |
| Klaviyo | 5 | 5 | 4 | Mature versioned APIs plus official MCP server docs. Strong developer posture, though agent-native operator experience is earlier than Shopify’s (Klaviyo API Docs, Klaviyo MCP) |
| Gorgias | 4 | 2 | 5 | Good APIs and one of the clearest shipped AI-agent products in DTC support, but public MCP evidence is limited (Gorgias AI Agent, Gorgias AI actions) |
| Recharge | 4 | 2 | 2 | Solid API surface for subscription workflows; public MCP support is limited and agent-native roadmap is less visible (Recharge Dev Docs) |
| Attentive | 4 | 1 | 3 | Real developer docs and integration surface, but public MCP and agent-native execution signal are still modest (Attentive Docs, Attentive Developers) |
| Rebuy | 4 | 1 | 3 | Good commerce APIs and AI-flavored recommendations, but limited public MCP and narrower system-of-record role (Rebuy Developer Hub, Recommended endpoint) |
| Tapcart | 3 | 1 | 2 | Real developer platform and CLI, but limited evidence of agent-native operations beyond platform features (Tapcart Dev Docs, Tapcart CLI) |
| Triple Whale | 3 | 1 | 4 | Data-in positioning and visible AI-agent marketing, but public developer/API evidence is weaker than core systems of record (Triple Whale Data Platform, Triple Whale AI Agents) |
| Postscript | 4 | 1 | 2 | Partner APIs are real, but public MCP and agent platform signals remain limited (Postscript Developers, Getting Started) |
| Yotpo | 4 | 1 | 2 | Usable APIs across UGC and app developer surfaces, but limited public MCP and limited credible agent-native execution signal (Yotpo UGC API, Yotpo App Developer API) |
Interpretation for Pete
For a mid-market DTC stack, Shopify, Klaviyo, Slack, Gorgias, and Recharge are the most practical starting points for workflow-plus-LLM builds today. Shopify and Klaviyo are ahead on developer accessibility and MCP-adjacent posture. Gorgias is ahead on shipped customer-service agent behavior. NetSuite remains important, but it is more likely to be a tightly controlled API target than the place to experiment with autonomy.
4) Timeline calibration
Forecast by category
| Category | Table-stakes horizon for agent-native behavior | Confidence | Why |
|---|---|---|---|
| Customer service | 12 to 24 months | Medium-high | Gorgias and peers already ship brand-trained support agents with actions and measurable support metrics (Gorgias) |
| Analytics / insight copilots | 12 to 24 months | Medium | Strong demand and better tolerance for advisory outputs; Triple Whale, Shopify, and Box-type signals point this way (Triple Whale, Shopify Sidekick, Business Wire) |
| Content / merchandising assistance | 12 to 18 months | High | Human-in-the-loop content generation is already normal and low-risk relative to finance or refunds (Shopify Magic) |
| Cross-system ops automation | 24 to 36 months | Medium | Reliability, auditability, and vendor interoperability still favor workflows; MCP helps, but enterprise agent governance is still maturing (Anthropic, MCP servers repo) |
| Finance / ERP mutations | 36+ months | Medium-high | Workday and enterprise finance agents are still staged rollouts; systems-of-record remain heavily controlled (Workday) |
Evidence behind the forecast
Enterprise adoption is real, but uneven. McKinsey’s latest State of AI reports sustained enterprise deployment growth, especially in marketing, service operations, and software, not uniform autonomous transformation across all functions (McKinsey). Ramp’s AI Index shows software spend concentrating into a handful of AI leaders rather than broad-based, mature standardization across every function (Ramp AI Index). Salesforce, Workday, Microsoft, and Box are all launching agent products, but their own announcements still emphasize governance, low-code builders, and phased rollout, which is a clue that the market is early, not settled (Salesforce, Microsoft Copilot Studio, Workday, Business Wire).
Rewrite horizon for today’s quick wins
- Do not assume a Zapier automation built now must become MCP-native in 12 months. Many trigger-action workflows may never need a full agent rewrite.
- Expect rewrite pressure sooner when the work needs variable tool order, dynamic retrieval, negotiation across systems, or operator-style judgment.
- Expect much longer life for automations that are mostly rules plus a single LLM classification or drafting step.
5) Migration pattern catalog
| Pattern | What it is | Preserved asset | When to use |
|---|---|---|---|
| Prompt-in-a-box | Isolate prompt + model settings behind a callable service | Prompts, version history, eval bindings | When a no-code flow currently embeds prompt text in builder fields |
| Typed handoff | Replace free-text between steps with JSON schemas | Schemas, downstream compatibility | When moving from workflow chain to tool-calling agent |
| Approval sandwich | Agent proposes, deterministic validator checks, human or rule gate approves | Business rules, risk controls | Refunds, discounts, subscription changes |
| Event-sourced agent | Log each observation, tool call, output, and decision as events | Replayability, debugging, eval data | Any multi-step agent or analyst copilot |
| Tool facade | Create your own stable tool names over vendor APIs or MCP servers | Portability across vendor churn | When vendor APIs are unstable or MCP support is emerging |
| Retrieval split-brain | Separate durable business data retrieval from ephemeral conversational memory | Knowledge portability, security | When using multiple assistants across vendors |
| Eval-first refactor | Build regression dataset before changing architecture | Quality baseline | When converting a “working enough” workflow into an agent |
| Shadow mode agent | Run agent in parallel and compare to production workflow before cutover | Safe migration path | High-volume support, triage, analytics |
Wait-for-vendor vs build-now
Wait for vendor wins when the capability is already embedded in daily operator workflow, has a clear admin surface, and does not need much cross-system customization. Shopify Sidekick and Gorgias AI Agent are the best examples in this stack (Shopify, Gorgias).
Build custom now wins when the workflow spans multiple tools, needs your own approval policy, or depends on data normalization across systems. That is most of the interesting mid-market DTC opportunity.
6) Evaluation and observability stack
Practitioner norm
The serious pattern is now consistent: tracing + datasets + offline evals + online feedback. Anthropic recommends building simple systems first and only adding complexity when evals prove the need (Anthropic). Hamel Husain argues you need task-specific evals and a habit of testing changes against real examples, not vibes (Hamel Husain).
Practical stack
| Layer | Tools | Why it matters |
|---|---|---|
| Tracing | Langfuse, Helicone | Prompt, latency, token, and tool-call visibility |
| Eval datasets | Braintrust, custom JSONL sets | Regression checks before prompt/model/tool changes |
| Structured outputs | Model-native schema enforcement, Pydantic-style validation, tool schemas | Eliminates a large class of brittle parsing failures (Jason Liu) |
| Human feedback | Inline thumbs-up/down, correction capture, QA review queue | Turns production traffic into eval data |
| Replay / diffing | Event logs and saved traces | Lets you compare models, prompts, vendors, and tool configs over time |
Minimum bar for Pete’s client work
- Trace every model call.
- Keep a 25 to 100 example eval set per workflow.
- Validate every structured output before any write action.
- Separate “looks good in demo” from “passes regression set.”
- Record human overrides as labeled data.
7) Watch list
People
- Aaron Levie
- Simon Willison
- Hamel Husain
- Chip Huyen
- Shreya Shankar
- Eugene Yan
- Jason Liu
- Ethan Mollick
- Linus Lee
Repos / standards / publications
- Model Context Protocol servers repo
- Anthropic: Building effective agents
- Shopify AI Toolkit
- Klaviyo MCP server docs
- McKinsey State of AI
- Ramp AI Index
Bottom line
For mid-market DTC in 2026, the winning posture is not “go all in on agents.” It is ship workflows that can accept agentic reasoning later. Build deterministic rails, preserve schemas and evals, and place autonomy only where task variability justifies it. The best client outcome this year is not maximum autonomy. It is durable leverage without a rebuild.
Speculation clearly labeled
- Speculation: MCP will likely become a standard integration layer for developer-facing SaaS faster than it becomes the primary runtime contract for all business automations. Public momentum is strong, but operational standardization across the full DTC vendor stack is not there yet (MCP servers repo).
- Speculation: Shopify and Klaviyo are the most likely in this stack to become “agent-access hubs” for mid-market DTC workflows over the next 12 to 24 months because both already expose strong developer surfaces and public MCP-adjacent tooling (Shopify AI Toolkit, Klaviyo MCP).