[01]
Multi-Agent Reporting for E-commerce
Drew AI · 2025LLM · LangGraph · MCP
[+]
multi-agentprompt engllme-commerce

When I joined Drew AI, the first instinct on the team was to build one LLM that could answer any e-commerce question.

That was wrong.

Marketers don't ask "any question." A Meta Ads question needs Meta context. A Google Ads question needs GAQL. A GA4 question needs event-stream understanding. One model trying to do all three becomes mediocre at all three.

So I split the brain.

  • Three dedicated agents, one per data source.
  • Each with its own prompt engineering — optimised for what marketers actually skim, not what looks impressive in a pitch.
  • A router that decides which agent touches the query, and when to chain them.

The result is a reporting layer that feels like talking to three specialists who happen to share a desk — not one generalist trying to remember everything.

The tradeoff is real: more system complexity, more eval surface area, more chances for the agents to disagree. But the answers are sharper. And "sharp" is what gets Shopify founders to come back tomorrow.

In agentic systems, depth beats breadth far earlier than you think.
Query distribution across agents
META ADS45% GOOGLE ADS30% GA425%
Directional · agent routing weighted to platform with active spend
Stack: LangGraph · OpenAI/Anthropic APIs · custom router · MCP · Shopify · Meta Ads API · Google Ads (GAQL) · GA4
[02]
LLM Eval Infrastructure with LangSmith
Drew AI · 2025LangSmith · evals
[+]
evalslangsmithhallucination detectiontooling

You can't ship an AI product without evals. Everyone says this. Almost no one does it.

I built ours in LangSmith with four tiers:

  • Data accuracy — does the agent's number match the source?
  • Hallucination detection — is it inventing campaigns that don't exist?
  • Recommendation quality — would a real growth marketer act on this?
  • Latency — slow truth is a bad product.

Each tier has its own evaluator, its own threshold, its own alert. We catch regressions before users do.

The unsexy truth: most of the work isn't in the model. It's in the eval data — the curated set of real shop queries with known-good answers — and the discipline to update it as the product evolves.

  • Without evals, every model upgrade is a coin flip.
  • With evals, every upgrade has a scorecard.
Building eval infrastructure first felt like a detour. It wasn't. It's the only reason we can move fast now.
Eval tier pass rates · last release
DATA ACCURACYHALLUCINATIONREC QUALITYLATENCY 96%92%88%94% THRESHOLD = 80% · ALERT IF BREACHED
Directional · 4-tier eval system in LangSmith, run on every release
Stack: LangSmith · custom evaluators · curated golden dataset · CI integration · alerting
[03]
Designing for Trust — Confidence Scoring + HITL
Drew AI · 2025UX · AI safety
[+]
human-in-the-loopconfidence scoringtrustux

E-commerce founders have been burned by "AI-powered insights" before. So when Drew AI says "your CAC is up 18% this week," the next question is always: should I trust this?

We built three things for that:

  • Confidence scores attached to every recommendation, surfaced in plain English — not 0.87 floats.
  • Source citations — the exact data slice the answer came from, one click away.
  • Human-in-the-loop checkpoints on anything that touches budget or audience.

It's slower than fully autonomous. That's the point.

A wrong autonomous decision burns trust permanently. A reviewed decision builds it.

The hardest part wasn't engineering — it was UX. Confidence has to feel native, not bolted on. Founders shouldn't need a tutorial to read it.

Trust isn't a feature. It's the substrate.
Confidence score distribution · last 1,000 answers
0–2020–4040–6060–8080–100 CONFIDENCE %
Directional · ~78% of answers ship with high confidence (60+); low-confidence ones route to HITL
What I optimise for: auditability · explainability · founder-readability · zero "magic" steps
[04]
The MCP Bet — Drew AI across the D2C stack
Drew AI · 2025MCP · integrations
[+]
mcpslacknotionintegrations

When Anthropic shipped MCP, I made a call: every Drew AI integration would be MCP-first.

The reasoning was simple. A D2C founder doesn't live in one tool. They live across Slack, Notion, Linear, their ad platforms, their analytics dashboards.

If Drew AI sits inside any one of those, it's a feature.
If it sits across all of them via a shared protocol, it's the connective layer.

So Drew AI now talks to Slack, Notion, Linear, Google Ads, and Meta Ads through MCP servers — same agent brain, different surfaces.

  • Ask a question in Slack, get the answer threaded back.
  • Drew AI files an insights brief in Notion every Monday.
  • Anomalies open as Linear tickets automatically.

The bet was that MCP would become the standard. So far, it's playing out.

Where you embed an AI product matters as much as what the product does.
Drew AI MCP topology · 5 surfaces, 1 brain
DREW AI SLACK NOTION LINEAR META ADS GOOGLE ADS
MCP-first integrations · same agent brain across every surface a D2C operator lives in
Stack: Anthropic MCP · custom MCP servers · Slack · Notion · Linear · Google Ads · Meta Ads
[05]
A/B/n Testing at $100M Scale
American Express · 2020–2025experimentation · stats
[+]
experimentationcausal inferencescalegrowth

At American Express, the question wasn't "should we run experiments." It was: how do we run them at the scale of 6+ international markets without drowning in noise?

The system I led handled 30+ tests a year, each tied to revenue lines that mattered.

A few things that compound at scale:

  • Decision velocity is more valuable than statistical purity past a point. Cutting cycle time by ~1.5 weeks compounds across 30 experiments — that's 45 weeks back into the roadmap.
  • Cohort-targeted tests beat generic variants by 18–25%. Personalisation isn't a tactic; it's the experiment unit.
  • The experiments that move the needle aren't the ones that look exciting in a deck. They're the ones that fix the cracked tile in the funnel everyone walked past.

Referral conversions improved ~15%. Revenue impact crossed $100M.

Ship more, faster, smarter — in that order.
Lift distribution across 30 experiments at Amex
<0%0–2%2–5%5–10%10–15%15–20%>20% N=30
Directional · most wins are small (0–5%); the system makes them compound
Stack: Adobe Target · Adobe Analytics · SQL · Python · custom uplift framework · cohort design
[06]
ML Anomaly Detection for Revenue Pages
American Express · 2022ML · time-series
[+]
anomaly detectiontime-seriesalertingml

Picture this: a deploy goes wrong on a revenue-critical page at 11pm on a Friday. Traffic drops 40%. Nobody notices till Monday morning. By then, the analysts are calculating the loss in millions.

That actually happened. So I built the thing that would have caught it.

The model learns the normal traffic shape per page, per market, per hour. It flags drops the moment they break baseline — not the next morning, not the next dashboard refresh.

  • Sensitivity tuning was harder than the model itself. Too sensitive: alert fatigue. Too lax: you miss the thing you built it for.
  • Routing alerts to the right team within 5 minutes was the actual win — the model is the easy part.

The detector caught regressions in the first month that would have slipped past humans. Pays for itself in one save.

Anomaly detection is 20% model, 80% operations.
Traffic baseline + anomaly detection · sample window
BASELINE ANOMALYALERT FIRED IN <5MIN PER-PAGE · PER-MARKET · PER-HOUR BASELINE
Illustrative · model learns the normal traffic shape and flags drops the moment they break
Stack: Python · statistical baselines · seasonality decomposition · custom alert routing · cross-market deployment
[07]
SEO at 150K URLs
American Express · 2021–2024SEO · scale ops
[+]
seotechnical seocrawlabilityenterprise

Most SEO conversations happen at the level of "let's optimise this landing page."

At Amex, the surface was 150,000+ URLs. Different math.

The work split three ways:

  • Technical SEO audits at scale — crawlability, sitemap hygiene, redirect chains. Boring, high-leverage.
  • URL health checks — orphan pages, broken canonicals, indexation gaps. The kind of issues no one notices until traffic disappears.
  • Keyword relevance optimisation — partnering with content and engineering to rebuild discoverability for the URLs that actually drive revenue.

What I learned: at this scale, you don't fix SEO page by page. You fix it pattern by pattern. One template change cascades to thousands of URLs.

The right unit of work is the rule, not the page.
URL health distribution · 150K+ URLs at start of audit
HEALTHY · 62%ORPHAN 21%BROKEN 15%RD PRE-AUDIT POST-FIX→ ~92% healthy after pattern-level fixes (template & rule changes) N = 150,000+ URLS · ENTERPRISE SEO
Directional · the right unit of work was the rule, not the page
Stack: Screaming Frog · custom crawlers · BigQuery · Search Console API · Adobe Analytics
[08]
Theory of Constraints at a Steel Plant
Jindal Steel · 2014ops · supply chain
[+]
supply chainthroughputoperationsfirst job

My first analytics job wasn't in a tech company. It was on the floor of a steel plant.

The plant was running below capacity, with inventory piling up. The accepted explanation was "demand variability."

I read Goldratt's Theory of Constraints and realised the plant didn't have a demand problem — it had a bottleneck problem.

  • We mapped the production flow.
  • Found the constraint (a specific furnace stage).
  • Subordinated everything else to that constraint.

Inventory dropped ~17%. OTIF delivery rates improved.

Most "demand" problems are physics problems in disguise.

This was the project that turned me from a mechanical engineer who knew analytics into an analyst who happened to know how machines work. The order changed the rest of my career.

Inventory before / after Theory of Constraints · Jindal 2014
BEFORE100 (index) −17% AFTER83 DRIVER: BOTTLENECK-FIRST SCHEDULING (FURNACE STAGE)
Directional · the demand "problem" was a physics problem in disguise
Stack: Excel · production-flow modelling · Goldratt's Theory of Constraints · plant-floor observation