AI Architecture Decision Frameworks for Mid-Market DTC

Research BriefDraftCreated Apr 17, 202614 min readFull screen ↗

Executive summary

Pete should build for deterministic workflows first, hybrids second, full agents last. The practical default for a $50M to $500M DTC brand is: keep business logic explicit, use LLMs only at high-variance judgment steps, and treat agents as a thin planning layer over auditable tools and schemas, not as the system itself. That matches Anthropic's guidance that workflows are better for well-defined tasks while agents are better when the number of steps cannot be hardcoded in advance (Anthropic). It also lines up with Aaron Levie's public argument that SaaS is not disappearing, it is becoming agent-native and interoperable, with agents coordinating across systems rather than replacing business systems of record (X, X).

What to do this quarter:

Ship quick wins as schema-first automations with strong logs, evals, and replayability.
Put an orchestration layer you own between models and SaaS vendors.
Preserve four assets across every build: prompts, structured I/O schemas, eval sets, and business process docs.
Use vendor-native AI only when it is already good enough and already embedded in the operator workflow, especially Shopify Sidekick and Gorgias AI Agent for narrow commerce and support use cases (Shopify, Gorgias).
Assume most DTC “agent-native” rewrites are not required in 12 months. Customer support and analytics copilots move sooner; core ops, finance, and cross-system automations still favor APIs and workflows for 24 to 36 months or longer (McKinsey, Ramp AI Index, Workday).

1) Workflow vs. agent decision framework

Production rule of thumb

The useful split is not “old automation vs. new AI.” It is fixed-path execution vs. variable-path reasoning. Lenny’s taxonomy and Anthropic’s “workflows vs agents” framing both converge on that point: if you can specify the path, do it; if the path must be discovered at runtime, agent behavior becomes useful (Lenny’s Newsletter, Anthropic). Simon Willison makes the same operational point from the opposite angle: tool-using systems are powerful, but you should earn complexity only when simpler prompt chains stop working (Simon Willison).

Decision table

Task shape	Input variability	Tool path known in advance?	Risk of silent failure	Recommendation	Why
Deterministic back-office action, e.g. sync order tags, push NetSuite status, trigger Klaviyo segment	Low	Yes	High	Deterministic workflow	Lowest latency, lowest cost, easiest audit trail, easiest rollback (Anthropic, Slack Web API, NetSuite REST)
High-volume content transformation, e.g. classify reviews, draft replies, summarize CX tickets	Medium	Mostly	Medium	Hybrid workflow	Put LLM only at extraction/classification/generation step, keep routing and writes deterministic (Anthropic, Hamel Husain)
Investigation, diagnosis, research, exception handling	High	No	Medium	Agent with bounded tools	Open-ended search, variable tool order, dynamic stopping rules, but must be sandboxed and observed (Anthropic, Chip Huyen)
Customer-facing action with money, compliance, refunds, or data mutation	Medium to High	Partly	Very high	Hybrid with approval gates	Let models decide, never let them silently finalize high-risk actions without deterministic checks or human approval (Anthropic, Gorgias Actions)

Real-world examples by quadrant

Example	Pattern	Why this is the right quadrant
Update Slack channel, create ticket, and post order delay notice when carrier webhook says “exception”	Deterministic workflow	Input schema and branch logic are known, and Slack/write APIs are explicit (Slack)
Parse return-policy docs, classify intent from incoming ticket, then route to a fixed macro or human	Hybrid workflow	LLM handles messy language; routing and execution stay explicit (Anthropic, Gorgias)
Investigate why subscription churn spiked across Recharge, Klaviyo, and Triple Whale, then propose hypotheses	Agent	Variable evidence sources and sequence of tool calls cannot be fully predetermined (Recharge, Triple Whale)
Draft merchandising recommendations from reviews, survey comments, and ticket summaries, then push approved changes into Shopify	Hybrid with human checkpoint	Open-ended reasoning is helpful, but actual catalog updates should remain explicit (Shopify Sidekick, Yotpo UGC API)
Automatically cancel, refund, and modify subscriptions based on free-form customer chat with no approval	Anti-pattern	High write risk, policy risk, and silent failure risk. Use structured intents and approval thresholds instead (Gorgias)
Research competitor promos, analyze reviews, propose bundle experiments, and create a planning memo	Agent	Multi-step, unstructured, judgment-heavy, best treated as a bounded research agent task (Chip Huyen)

Failure modes and prevention

Deterministic workflows fail by becoming brittle spaghetti. Symptoms: too many branches, duplicated logic by tool, and hidden assumptions in Zapier-style glue. Prevention: central schemas, reusable steps, idempotent writes, and replayable event logs (Anthropic).

Agents fail by appearing impressive in demos while degrading silently in production. Hamel Husain’s eval guidance is blunt: if you are not checking outputs against representative datasets, you do not know if the system is improving or regressing (Hamel Husain). Shreya Shankar has made the same case in public talks and posts on production evals: accuracy is contextual, and failure often hides in long-tail cases, not averages (Shreya Shankar).

Hybrid systems fail when the boundary is wrong. Common anti-patterns are letting the model choose tools when the next tool is obvious, letting it write to systems of record without structured validation, and embedding business policy inside prompts instead of code or declarative rules (Anthropic, Simon Willison).

2) Architectural principles for forward-compatible builds

10 rules

Principle	Right way	Wrong way
1. Keep systems of record authoritative	Shopify, NetSuite, Klaviyo stay source-of-truth; the AI layer reads and proposes changes through APIs	Store critical customer/order state only inside prompt context or vector stores (NetSuite REST, Shopify Admin)
2. Own the orchestration layer	Put a thin service between model calls and vendor APIs so prompts, retries, auth, and logging are portable	Hard-wire business process inside one vendor’s chatbot builder (Anthropic)
3. Schema-first everything	Use typed inputs, typed outputs, and explicit tool contracts	Parse prose out of model responses and hope regex saves you (Jason Liu)
4. Separate reasoning from execution	LLM proposes action, deterministic service validates and executes	Let the same prompt both decide and mutate external state with no check
5. Prefer event logs over chat transcripts	Store orders, tickets, actions, decisions, and outcomes as structured events	Assume the conversational history is enough to debug or re-run
6. Make prompts replaceable	Version prompts as artifacts, tied to datasets and evals	Leave prompts hidden inside automation builders or support macros
7. Build for human override	Include approve/reject/escalate states in returns, refunds, pricing, and finance flows	Assume “fully autonomous” is the goal for sensitive workflows
8. Treat MCP as an interface, not a strategy	Use MCP where it simplifies tool access, but keep vendor APIs and schemas behind your own contract	Rewrite the architecture around MCP hype before vendor support stabilizes (Model Context Protocol servers)
9. Minimize vendor-specific memory	Keep reusable context in your datastore, not trapped in one assistant product	Train every vendor-native agent separately on the same policies and knowledge base
10. Instrument from day one	Trace prompts, tool calls, latency, cost, outputs, and user corrections	Add observability after stakeholders complain about weird behavior (Langfuse, Braintrust, Helicone)

Strongest signal from Levie and peers

The strongest public signal is not “agents replace SaaS.” It is “SaaS becomes the secure substrate and agents become the interaction layer.” Levie’s X posts repeatedly emphasize multi-agent interoperability across enterprise systems like Salesforce, Box, Stripe, ServiceNow, and Workday, plus a consumption layer on top of seat-based software (X, X). Box’s 2025 AI platform launch also leaned hard into open, interoperable agents over enterprise content rather than a closed assistant (Business Wire). Shopify’s trajectory is similar but more commerce-specific: Sidekick can take action in admin, and the AI Toolkit exposes docs, schemas, and a Dev MCP server for developers (Shopify Sidekick, Shopify AI Toolkit). Workday’s Illuminate announcements show the same pattern in finance and HR: domain-specific agents, not general autonomous swarms (Workday).

3) DTC SaaS readiness matrix

Scoring: 1 = weak, 5 = strong. “MCP” reflects publicly verified official or community availability reviewed on 2026-04-17.

Vendor	API maturity	MCP availability	Agent-native roadmap credibility	Why this score
NetSuite	4	1	2	Mature REST and metadata browser, but little public MCP momentum and weak public agent-native signal for mid-market DTC execution (NetSuite REST, REST API Browser)
Shopify	5	5	5	Strong APIs, Sidekick shipped in admin, official AI Toolkit and Dev MCP server (Shopify Sidekick, Shopify AI Toolkit)
Slack	5	3	3	Excellent Web API, solid for tool actions, but agent-native value is mostly platform-level rather than DTC-specific workflow intelligence (Slack Web API, MCP servers repo)
Klaviyo	5	5	4	Mature versioned APIs plus official MCP server docs. Strong developer posture, though agent-native operator experience is earlier than Shopify’s (Klaviyo API Docs, Klaviyo MCP)
Gorgias	4	2	5	Good APIs and one of the clearest shipped AI-agent products in DTC support, but public MCP evidence is limited (Gorgias AI Agent, Gorgias AI actions)
Recharge	4	2	2	Solid API surface for subscription workflows; public MCP support is limited and agent-native roadmap is less visible (Recharge Dev Docs)
Attentive	4	1	3	Real developer docs and integration surface, but public MCP and agent-native execution signal are still modest (Attentive Docs, Attentive Developers)
Rebuy	4	1	3	Good commerce APIs and AI-flavored recommendations, but limited public MCP and narrower system-of-record role (Rebuy Developer Hub, Recommended endpoint)
Tapcart	3	1	2	Real developer platform and CLI, but limited evidence of agent-native operations beyond platform features (Tapcart Dev Docs, Tapcart CLI)
Triple Whale	3	1	4	Data-in positioning and visible AI-agent marketing, but public developer/API evidence is weaker than core systems of record (Triple Whale Data Platform, Triple Whale AI Agents)
Postscript	4	1	2	Partner APIs are real, but public MCP and agent platform signals remain limited (Postscript Developers, Getting Started)
Yotpo	4	1	2	Usable APIs across UGC and app developer surfaces, but limited public MCP and limited credible agent-native execution signal (Yotpo UGC API, Yotpo App Developer API)

Interpretation for Pete

For a mid-market DTC stack, Shopify, Klaviyo, Slack, Gorgias, and Recharge are the most practical starting points for workflow-plus-LLM builds today. Shopify and Klaviyo are ahead on developer accessibility and MCP-adjacent posture. Gorgias is ahead on shipped customer-service agent behavior. NetSuite remains important, but it is more likely to be a tightly controlled API target than the place to experiment with autonomy.

4) Timeline calibration

Forecast by category

Category	Table-stakes horizon for agent-native behavior	Confidence	Why
Customer service	12 to 24 months	Medium-high	Gorgias and peers already ship brand-trained support agents with actions and measurable support metrics (Gorgias)
Analytics / insight copilots	12 to 24 months	Medium	Strong demand and better tolerance for advisory outputs; Triple Whale, Shopify, and Box-type signals point this way (Triple Whale, Shopify Sidekick, Business Wire)
Content / merchandising assistance	12 to 18 months	High	Human-in-the-loop content generation is already normal and low-risk relative to finance or refunds (Shopify Magic)
Cross-system ops automation	24 to 36 months	Medium	Reliability, auditability, and vendor interoperability still favor workflows; MCP helps, but enterprise agent governance is still maturing (Anthropic, MCP servers repo)
Finance / ERP mutations	36+ months	Medium-high	Workday and enterprise finance agents are still staged rollouts; systems-of-record remain heavily controlled (Workday)

Evidence behind the forecast

Enterprise adoption is real, but uneven. McKinsey’s latest State of AI reports sustained enterprise deployment growth, especially in marketing, service operations, and software, not uniform autonomous transformation across all functions (McKinsey). Ramp’s AI Index shows software spend concentrating into a handful of AI leaders rather than broad-based, mature standardization across every function (Ramp AI Index). Salesforce, Workday, Microsoft, and Box are all launching agent products, but their own announcements still emphasize governance, low-code builders, and phased rollout, which is a clue that the market is early, not settled (Salesforce, Microsoft Copilot Studio, Workday, Business Wire).

Rewrite horizon for today’s quick wins

Do not assume a Zapier automation built now must become MCP-native in 12 months. Many trigger-action workflows may never need a full agent rewrite.
Expect rewrite pressure sooner when the work needs variable tool order, dynamic retrieval, negotiation across systems, or operator-style judgment.
Expect much longer life for automations that are mostly rules plus a single LLM classification or drafting step.

5) Migration pattern catalog

Pattern	What it is	Preserved asset	When to use
Prompt-in-a-box	Isolate prompt + model settings behind a callable service	Prompts, version history, eval bindings	When a no-code flow currently embeds prompt text in builder fields
Typed handoff	Replace free-text between steps with JSON schemas	Schemas, downstream compatibility	When moving from workflow chain to tool-calling agent
Approval sandwich	Agent proposes, deterministic validator checks, human or rule gate approves	Business rules, risk controls	Refunds, discounts, subscription changes
Event-sourced agent	Log each observation, tool call, output, and decision as events	Replayability, debugging, eval data	Any multi-step agent or analyst copilot
Tool facade	Create your own stable tool names over vendor APIs or MCP servers	Portability across vendor churn	When vendor APIs are unstable or MCP support is emerging
Retrieval split-brain	Separate durable business data retrieval from ephemeral conversational memory	Knowledge portability, security	When using multiple assistants across vendors
Eval-first refactor	Build regression dataset before changing architecture	Quality baseline	When converting a “working enough” workflow into an agent
Shadow mode agent	Run agent in parallel and compare to production workflow before cutover	Safe migration path	High-volume support, triage, analytics

Wait-for-vendor vs build-now

Wait for vendor wins when the capability is already embedded in daily operator workflow, has a clear admin surface, and does not need much cross-system customization. Shopify Sidekick and Gorgias AI Agent are the best examples in this stack (Shopify, Gorgias).

Build custom now wins when the workflow spans multiple tools, needs your own approval policy, or depends on data normalization across systems. That is most of the interesting mid-market DTC opportunity.

6) Evaluation and observability stack

Practitioner norm

The serious pattern is now consistent: tracing + datasets + offline evals + online feedback. Anthropic recommends building simple systems first and only adding complexity when evals prove the need (Anthropic). Hamel Husain argues you need task-specific evals and a habit of testing changes against real examples, not vibes (Hamel Husain).

Practical stack

Layer	Tools	Why it matters
Tracing	Langfuse, Helicone	Prompt, latency, token, and tool-call visibility
Eval datasets	Braintrust, custom JSONL sets	Regression checks before prompt/model/tool changes
Structured outputs	Model-native schema enforcement, Pydantic-style validation, tool schemas	Eliminates a large class of brittle parsing failures (Jason Liu)
Human feedback	Inline thumbs-up/down, correction capture, QA review queue	Turns production traffic into eval data
Replay / diffing	Event logs and saved traces	Lets you compare models, prompts, vendors, and tool configs over time

Minimum bar for Pete’s client work

Trace every model call.
Keep a 25 to 100 example eval set per workflow.
Validate every structured output before any write action.
Separate “looks good in demo” from “passes regression set.”
Record human overrides as labeled data.

7) Watch list

People

Repos / standards / publications

Bottom line

For mid-market DTC in 2026, the winning posture is not “go all in on agents.” It is ship workflows that can accept agentic reasoning later. Build deterministic rails, preserve schemas and evals, and place autonomy only where task variability justifies it. The best client outcome this year is not maximum autonomy. It is durable leverage without a rebuild.

Speculation clearly labeled

Speculation: MCP will likely become a standard integration layer for developer-facing SaaS faster than it becomes the primary runtime contract for all business automations. Public momentum is strong, but operational standardization across the full DTC vendor stack is not there yet (MCP servers repo).
Speculation: Shopify and Klaviyo are the most likely in this stack to become “agent-access hubs” for mid-market DTC workflows over the next 12 to 24 months because both already expose strong developer surfaces and public MCP-adjacent tooling (Shopify AI Toolkit, Klaviyo MCP).