Between 2021 and 2025, while running group IT at a diversified Oman conglomerate, I built an AI Factory that deployed 55+ autonomous agents into production across 12 business functions, reached 300% ROI on transformation spend, and reduced operational cost by 35%. The detailed version is the AI Factory case study. This is the playbook version, the how, written the way I wish someone had written it for me in 2021.
If you’re reading this because your own AI programme has stalled at pilot, take the single most important number from the 2025 MIT study and paste it above your desk: 95% of generative AI pilots never reach production. This is not a model problem. It is an operating-discipline problem. The 5% that ship do the same unsexy things in the same order.
Here is that order.
Phase 0, the framing
Before you touch a model, a vendor, or a slide deck, make one decision:
Are you running AI projects, or are you running an AI factory?
Projects have charters, launch dates, and handovers. Agents deployed under a project model degrade the moment the consultants leave. Factories have platforms, backlogs, and standing teams. Agents deployed under a factory model compound.
I wrote about why this framing matters in Why AI Factories Beat AI Projects. The short version: if your AI budget is a project line, you will end up in the 95%. If it’s an operating line, you have a shot at the 5%.
Everything that follows assumes the factory framing.
Phase 1, process inventory (the moat everyone skips)
This is where most AI programmes die. Not at model selection. Not at integration. Here.
Before we wrote a single prompt at the conglomerate, we mapped 250+ business processes across 12 functions and scored every one on three axes: automation potential, governance risk, and marginal ROI. This took six months. Nobody wanted to do it. The exec sponsor asked twice whether we could skip it and “learn by doing”. We couldn’t, and we didn’t.
What the inventory actually contains
For every process:
- Trigger: what starts this process? (email in, form submitted, daily schedule, exception from another system)
- Inputs: what data does it consume, and from where?
- Steps: the sequence of human actions, with system touches noted
- Outputs: what leaves this process, and to where?
- Exceptions: what can go wrong, how often, and what’s the recovery path?
- Volume: how often this runs per day/week/month
- Frequency of change: is this process stable or does it shift monthly?
- Governance sensitivity: does it touch financials, PII, safety, or compliance?
Then the scoring:
- Automation potential (1–5): how much of this can plausibly run autonomously today?
- Risk (1–5): what’s the cost if an agent gets this wrong?
- Marginal ROI: hours saved × frequency × fully-loaded cost, minus estimated cost to build and run the agent.
The output is a priority list. Not a roadmap, a backlog. You ship from the top.
Why this phase is the moat
Three reasons.
First, you cannot automate what you cannot see. Teams that skip inventory end up automating whatever the loudest stakeholder asked for, which is rarely what moves the numbers. We had processes we thought were the obvious wins turn out to run four times a year at low volume, and processes nobody talked about that ran 300 times a day.
Second, the inventory is where governance starts. When you score every process for risk, you are building the map that your audit team will eventually want anyway. Getting it done now, before any agent touches production, saves you a compliance retrofit later. (Given the EU AI Act’s 2 August 2026 deadline, this is now a deadline problem, not a nice-to-have.)
Third, it aligns the org. The process of doing the inventory forces function heads to describe their own work. Half of them discover things they didn’t know about their own operations. That’s cultural work disguised as analysis, and it’s load-bearing for the next phase.
The bill
Roughly 6 months, 3 people (one AI lead, two process analysts), for an organisation of ~6,000 staff across 12 functions. You can compress it by going narrower (2–3 functions first) and expanding later. You cannot skip it.
Phase 2, the platform (not a vendor)
Once you have the inventory, you need somewhere to build. Resist the temptation to equate “platform” with “a big vendor licence”. A platform is the collection of primitives your factory will use, repeatedly, for every agent. It should:
- Be model-agnostic. The model layer changes every six weeks. Your platform should not.
- Be self-hostable. Data sovereignty, cost control, and sanity. Especially in regulated industries.
- Be version-controllable. Prompts, flows, configurations, all in git.
- Have an observability story on day one. You cannot debug what you cannot see.
Here is the shape we landed on. Your stack will differ by constraint; the categories will not.
| Layer | What we used | Why |
|---|---|---|
| Orchestration | n8n (self-hosted) | Visual + code, self-hostable, huge node library, fast iteration |
| LLM plumbing | LangChain + Python | Mature, composable, model-agnostic |
| Retrieval | Pinecone + FAISS | Managed where it made sense, self-hosted where it didn’t |
| Integration | Custom MCP servers + existing APIs | Portable, governed, see the MCP handbook |
| Analytics | Power BI | It’s where the business already was |
| Observability | OpenTelemetry + a custom dashboard | Per-agent metrics, cost, latency, error rates |
| Governance | Internal framework aligned to ISO 9001 | Already certified, already audited |
The entire platform took about three months to stand up to production quality, with two engineers. It stayed stable for the four years of the programme, we swapped models, added tools, and onboarded new agent types without rebuilding the platform. That’s the whole point.
Phase 3, governance before the first agent ships
At this point you’ll be itching to build an agent. Resist for two more weeks. Get the governance model in place first.
Human-in-the-loop by default. Every agent is HITL at launch. You decommission the human step only when you have production evidence that the agent is reliable enough not to need it. Even then, most agents stay HITL on the high-risk paths forever.
Named owner per agent. Every production agent has a named human accountable for it. Not a team. A name. The owner gets the alerts, answers for the failures, and signs off changes. This sounds like overhead; it’s the cheapest incident-prevention mechanism you will ever implement.
Responsible-AI policy. One document, signed by the exec sponsor. It covers: what data agents can touch, what they can do autonomously, what requires approval, how they log, how they escalate, and what the rollback path is. We aligned ours to ISO 9001 because we were already certified, but ISO 42001, NIST AI RMF, or a custom framework work equally well. Pick one. Document it. Train on it.
Escalation paths. When an agent is uncertain, and they will be, where does the decision go? We built a Slack-first escalation with a 15-minute SLA for any agent call-out during business hours. Agents that couldn’t find a human fell through to a safe default (usually: don’t act, log, notify).
Audit trail. Every agent call, every argument, every result, every human approval, retained for the period your compliance team needs. Non-negotiable. When an auditor asks in 2027 whether an agent made a particular decision on a particular day, the answer is a query, not a panic.
Phase 4, deployment standard (the part that lets you scale)
Here’s the trap. You ship your first agent, let’s say it auto-matches invoices to purchase orders, and the finance team loves it. The next function wants one. Then the next.
If every agent looks different, the factory collapses under its own weight. By agent #15 you’ll have fifteen bespoke deployment paths, fifteen integration patterns, fifteen observability stories. We watched other organisations hit this wall.
The fix is a deployment standard. Every agent ships the same way:
- Defined inputs. Typed schemas, documented.
- Monitored outputs. Known format, known constraints, validated.
- Fallback behaviour. What the agent does when it’s uncertain, when a tool call fails, when latency blows out.
- Named owner. Already covered.
- SLO. Latency, accuracy, cost. Dashboarded.
- Cost envelope. Per-agent budget. Alerts at 50%, 80%, 100%.
- Rollback path. How we revert to the pre-agent state, verified.
- Runbook. Written for the on-call engineer, not the author.
The first agent under the standard takes longer than a one-off build. Every subsequent agent is faster. By agent #20 you’re deploying a new one every two to three weeks with a single engineer. That’s what “factory throughput” actually means.
Phase 5, the first cohort (not a pilot)
You are not running a pilot. “Pilot” is project language. You are shipping a cohort.
Pick five to ten processes from the top of the backlog. Criteria:
- High volume, so the ROI is visible early.
- Lower risk, so a failure is recoverable.
- Different functions, so you prove the pattern works broadly.
- Dependency-light, so one agent’s delay doesn’t block another.
At the conglomerate, the first cohort was:
- OCR + bank reconciliation for Finance
- Resume triage for HR
- First-line ticket classification for IT
- Lead qualification for Sales
- Procurement price-benchmarking for Supply Chain
Five agents. Three months. All shipped to production under HITL, under the deployment standard, under governance. The cohort’s purpose is not just the business value (though that’s real). It’s to prove the throughput model works.
When the cohort is live, measure two things religiously:
- Business metric. Hours saved × cost. Error rate. Cycle time. Whatever the owner function already cares about.
- Factory metric. Throughput, agents shipped per engineer per month. This is your rate of compounding.
Phase 6, scale-out (where the numbers compound)
After the first cohort, you widen. Add functions. Add agent types. The process-inventory backlog is already sized, just work it.
This is where the 300% ROI number actually shows up. A few things that happen in this phase that surprised me:
The marginal cost per agent drops sharply. Agent #30 took one-fifth the engineering time of agent #3. Not because the engineers got better (they were already good). Because the platform, the deployment standard, and the integration patterns were reusable.
The governance overhead also compounds, downward. The first few agents each required a governance review. By the time we were shipping batch #5, governance was a 30-minute review against the existing template.
Adoption pattern is non-linear. Functions watch each other. Once Finance had three agents live and visibly contributing, HR and Procurement asked to be next. Once HR had five, operations followed. The hardest adoption is the first function; after that, it’s demand management.
Unit economics get strange fast. Some agents ran at effectively zero marginal cost (e.g., scheduled overnight reconciliation agents). Others cost more than a cheap human in API fees (e.g., high-volume conversational agents before we optimised model selection). You want a cost envelope per agent or you will discover this the hard way on the monthly bill.
Phase 7, handover (where most consultants disappear)
If you’re running this internally, the handover is to ops. If you’re engaging an outside operator, which, to be clear, is how I run AI Factory engagements now, the handover is to the in-house team and the success criteria are:
- In-house lead running the steering cadence.
- In-house engineer(s) capable of shipping a new agent without external help.
- Governance model operating without escalation to the original author.
- Runbooks maintained by the team, not the outside operator.
- Backlog owned by the function heads, managed by the factory lead.
When these are true, the external role becomes advisory, available for architecture calls, exec-level escalations, annual reviews. The factory continues compounding without you. That’s the goal.
The things that didn’t work
In the interest of not writing a puff piece, here are three things we tried that didn’t work.
Trying to go direct to autonomous without HITL first. We tried this once, with a finance-classification agent we thought was low-risk. It wasn’t. We rolled back in a week and put HITL back in. Three months of HITL data later, we decommissioned the human step and it’s been fine ever since. Lesson: HITL is not training wheels. It is the evaluation substrate.
Picking a tooling-first strategy. We went to vendor demos early and came back with a short-list of “end-to-end AI platforms”. Every one of them would have locked us into a stack we’d have to migrate off within 18 months. We walked back to the model-agnostic platform approach. Lesson: if your AI strategy starts with a vendor decision, it is upside-down.
Skipping the backlog for the shiny use case. Once, exec asked for a use case the backlog ranked at #42. We built it anyway. It delivered. It also crowded out three higher-ROI agents that quarter. Lesson: the backlog is the point. Exec requests go onto the backlog and get re-scored like anything else, with exec veto rights explicitly documented.
The numbers, one more time
From this engagement, measured not projected:
- 55+ autonomous agents in production across Finance, HR, Sales & CX, Supply Chain, Inventory, Procurement, Quality, IT, Marketing, and cross-functional ops.
- 30+ intelligent automations and 50+ AI-powered workflows shipped.
- 300% ROI on transformation investments.
- 35% reduction in operational cost.
- 40% reduction in response times via conversational AI.
- 250+ business processes mapped.
- 165 SOPs documented.
- Zero major governance incidents during the programme.
These are outcomes of the playbook. They are not a ceiling. Your organisation will have different constraints, different starting maturity, different regulatory context. The playbook still works.
The one-line version
If someone stops you in a corridor and asks how to actually deploy AI agents at scale, the one-line answer is:
Map the work before you touch the model, build a platform you’ll still be running in three years, ship your first cohort under a governance model your audit team already approves, and then let the backlog pull you.
Everything in this essay is a footnote to that sentence.
If you’re running an AI programme that has stalled, or building one that hasn’t started, the fastest conversation is a 30-minute call. We’ll cover where you are, what the constraint actually is, and whether a Factory Readiness engagement is the right next step, or whether something lighter (a fractional CAIO, an MCP build, a scoped process-discovery sprint) makes more sense.