What Tasks Can an AI Agent Actually Handle?, Amjid Ali

The shortest honest answer to “what tasks can an AI agent actually handle?” is: the work that’s structured, repeatable, follows rules an experienced human could write on one page, has high volume, and tolerates a 5–15% error rate because every output gets human review.

Everything else is a vendor demo or a research project.

That sentence sounds dismissive. It isn’t. The set of work that fits those five conditions is enormous, runs across every department in a typical mid-market business, and is the reason the agents we shipped at the Oman conglomerate retired about 35% of the operational cost base across 12 functions (the case study). This post is the practical map. The fit-test, then a department-by-department walkthrough, then the tasks where the answer is still “leave it to a human”.

The five-part fit test

Before any task list is useful, you need to know how to score a task on your own desk. Run any candidate task through these five questions. Three or more “yes” answers means you have a probable agent fit. Two or fewer means you don’t. Don’t argue with the test, and don’t let a vendor convince you a “no” is a “yes” with the right prompt.

1. Is the input structured, or can it be made structured cheaply? Forms, CSVs, fixed-format emails, database rows, structured PDFs. If the input is “whatever a customer feels like sending us today”, the agent can still help, but the failure rate climbs sharply. Unstructured input is solvable in 2026, but it’s a different cost class.

2. Are the rules writable on one page? If the most experienced person doing this work can write the SOP, the agent can probably learn it. If the rules live in someone’s head and surface as “I just know”, the agent will fail in surprising ways and you’ll spend more on eval and oversight than the automation is worth.

3. Is the volume high enough to justify the build? A task done 1,000 times a month is a fit. A task done 5 times a quarter is not, regardless of how clever the agent is. Build cost has to amortise. The crossover point in 2026 is roughly 10 hours/week of human time on the same recurring task.

4. Can a human review or override the agent’s output before it lands? The most reliable production agents are not “agent in a black box”. They’re “agent proposes, human approves on the 5–15% of cases that flag uncertainty”. If your workflow can’t tolerate a review gate, you’d better trust the agent at 99.9%, which most agents are not.

5. Are the consequences of being wrong recoverable? A miscoded receipt is recoverable. A wrongly-sent payment to a fraudulent vendor is not. The work where errors are catastrophic and irreversible needs heavier control: stronger guardrails, human-in-the-loop on every output, auditable logs. Often it means a smaller agent scope and more human oversight.

Now, the tasks themselves.

Finance and accounting

This is the deepest, most-proven fit in any business. Finance has structured inputs, rule-based logic, high volume, and a culture that already does sign-off and reconciliation. Most finance teams I work with can put 30–50% of their transactional load on agents inside three months once they’re set up properly.

High-fit tasks an agent can run today:

Invoice capture, coding, and three-way matching against POs and goods-receipt notes.
Bank reconciliation across multiple bank feeds and sub-ledgers.
Expense classification and policy compliance checks.
Month-end accruals and journal preparation against templates.
AR follow-up: dunning emails, payment promise tracking, escalation triggers.
Cash-flow forecasting against historical patterns and current commitments.
Vendor master cleansing: deduping, ABN validation, tax-rate checks.
BAS preparation drafts (final review still by a registered tax agent).
Audit-pack assembly: pulling the right schedules, reconciliations, and supporting documents.
Management report drafting from finance data plus narrative analysis.

Where it still breaks:

Complex tax positions involving judgement about deductibility or transfer pricing. Get an agent to surface the question. Don’t get it to make the call.
Fraud detection in genuinely novel patterns. Rule-based agents catch known patterns. The new ones still slip past until a human notices.

For the full deep-dive on AI in finance, see AI for CFO in 2026.

Customer support and service

The most public success story of agentic AI. Most consumer-facing brands of any scale now run a tier-1 agent. The honest 2026 numbers: 50–80% deflection rate on tickets, sub-30-second response times, and 24/7 coverage at a fraction of the equivalent human cost.

High-fit tasks:

Order status, shipping, returns, and cancellation queries against fixed policy.
Account self-service: password resets, plan changes, billing queries.
FAQ-style support across product knowledge bases.
Triage and routing of complex queries to the right specialist with full context handoff.
Sentiment analysis and escalation flagging on incoming tickets.
Post-resolution survey collection and CSAT analysis.
Knowledge-base maintenance: spotting gaps, drafting new articles from resolved tickets.

Where it still breaks:

High-emotion situations (bereavement, legal threats, regulator-grade complaints). Route to a human, do not have the agent attempt empathy.
Fluid pricing and discount negotiation that genuinely needs human authority.
Complex multi-system troubleshooting where context lives across 5+ tools the agent doesn’t have access to.

Sales and marketing

A maturing fit. Strong agent ROI on the volumes-and-rules side, much weaker on creative judgement and relationship work. The pattern that consistently pays back: agent does the prep, human does the relationship.

High-fit tasks:

Lead qualification against an ICP rubric, including web enrichment.
Lead scoring and routing across CRM rules.
Outbound email sequence drafting and personalisation against a structured prospect profile.
Inbound query triage and meeting booking.
Sales-call note-taking, summary generation, and CRM logging.
Competitive intelligence: monitoring named accounts, surfacing changes, drafting battlecards.
Content repurposing: turning long-form into short-form across formats.
SEO and metadata work: keyword research, content gap analysis, on-page audits.
RFP response drafting against a curated answer library.

Where it still breaks:

Genuine creative direction (the headline that makes the campaign). Agents are competent at on-brand variation. They are mediocre at category-defining new work.
High-touch enterprise relationship sales. The agent can prepare the brief, not run the room.

For an honest read on how this plays out, see the AI sales agent piece.

Operations and supply chain

Mid-fit, very dependent on data quality. The work pays back where ERPs are reasonably clean and integrations work. It struggles where data lives in spreadsheets, emails, and tribal knowledge.

High-fit tasks:

Inventory reorder calculations against demand forecasts and lead times.
Purchase requisition routing and approval workflow management.
Vendor performance monitoring and supplier scorecard generation.
Logistics tracking and exception alerting.
Production scheduling against constraints (with humans approving the final plan).
Quality-defect classification from images or structured reports.
Compliance documentation across multi-site operations.

Where it still breaks:

Crisis-response coordination across suppliers. The volume of judgement and political work in a real supply-chain crisis is beyond agents in 2026.
Anything that requires negotiating new commercial terms.

Human resources

Strong fit on transactional HR, weaker on people-management and culture work. The agents that survive in HR have careful guardrails on bias, privacy, and EU AI Act compliance (recruitment is a “high-risk” category).

High-fit tasks:

CV screening against a structured rubric (with human final-cut and bias monitoring).
Job description drafting and inclusivity-language checks.
Interview scheduling and coordination.
Onboarding logistics: equipment requests, system access provisioning, document collection.
Policy lookups and HR FAQ across the employee handbook.
Pulse-survey analysis and theme extraction from open-text responses.
Learning-content recommendation based on role and capability gap.

Where it still breaks:

Performance conversations and people decisions. Always human.
Cultural-fit interpretation. Agents amplify the bias in the training data; this is a regulated area in most jurisdictions for good reason.
Workforce planning that requires reading the room (which exec is about to leave, which team is about to revolt).

IT and engineering

A specialist’s deep-fit area, and where most engineering teams are already running agents whether they admit it or not. The data on developer productivity uplift is now solid: GitHub’s 2024–2025 research and McKinsey’s developer surveys both put productivity gains in the 25–55% range on well-implemented setups.

High-fit tasks:

Code generation and refactoring against tests and style guides.
Test generation, especially unit tests against existing functions.
Code review on style, common bugs, and security pattern violations.
Documentation generation from code.
Incident triage: pattern-matching against past incidents, drafting initial diagnosis.
Log analysis and anomaly detection.
IT helpdesk first-line: password resets, software requests, common troubleshooting.
Infrastructure-as-code drafting and validation.

Where it still breaks:

Architecture decisions at scale. Agents propose; humans architect.
Critical-system production debugging where context spans years of decisions.
Anything that requires deep judgement about non-functional requirements (security posture, regulatory tradeoffs, multi-year cost of ownership).

For the full take, see AI-assisted development from Copilot experiments to production-grade engineering.

Legal, risk, and compliance

A growing fit, with strong guardrails. Legal teams are conservative for excellent reasons. The pattern that works: agent does the prep work, lawyer does the sign-off and the judgement calls.

High-fit tasks:

Contract review against clause checklists.
Compliance documentation against regulatory frameworks (ISO 27001, ISO 9001, ISO 42001, EU AI Act, Australia’s Privacy Act).
Regulatory horizon-scanning: monitoring named regulators and surfacing relevant changes.
Policy-document drafting from templates.
Document discovery and review at scale.
Privacy-impact assessment drafting.

Where it still breaks:

Final legal opinions. The agent prepares; the lawyer signs.
Litigation strategy. The judgement load is too high.
Anything where a regulator or court will ask “who decided this and on what authority”. The answer can’t be “the agent”.

The “leave it to humans” list

Across every department, the same five categories belong with humans, regardless of model improvements:

Senior strategic decisions. Hiring at executive level, M&A calls, fundamental product direction, crisis communication, capital allocation above a material threshold.
Genuine novelty. Anything that hasn’t happened in your training data because it hasn’t happened anywhere yet.
Relationship-grade work. Senior client relationships, board management, regulator-facing accountability, internal political work.
Care and ethics work that requires moral weight. Layoff decisions, performance management, end-of-life clinical conversations, child-safety calls, anything where being “technically correct but cold” causes lasting damage.
Anything where a human signature is the legal product. Audited accounts, signed contracts, regulated medical or legal opinions, accreditations.

If you want a sharper framing, see what is an AI agent: the operator primer and process inventory: the moat nobody maps.

How to choose your first task

If your organisation hasn’t shipped a production agent yet, the goal is not to find the “best” task. It’s to find the task with the highest probability of building organisational confidence at the lowest probability of stalling. The criteria:

Volume above 10 hours per week. Below this, the build doesn’t amortise.
Single-team scope. Cross-team agents are technically possible and politically nuclear. Save them for project five.
A friendly internal owner. Someone who will champion adoption, defend the agent in performance reviews, and feed the eval loop.
Tolerable error rate. Pick a task where a 5–15% error doesn’t break a regulator, a customer, or a balance sheet.
Visible to the executive sponsor. The first agent’s job is to prove the model works. It needs to be one the CEO can talk about in a board meeting.

That filter typically points to: AP invoice processing in finance, tier-1 ticket deflection in support, or lead qualification in sales. Pick whichever has the cleanest data and the friendliest owner.

The one-paragraph summary

AI agents in 2026 reliably handle structured, rule-based, high-volume work in finance, support, sales operations, document review, internal automation, and engineering productivity, where the error rate is tolerable because humans review the exceptions. They are not yet (and may not ever be) the right tool for senior judgement, novel decisions, relationship-grade work, ethical sign-offs, or anything where a human signature is the legal product. Run any candidate task through the five-question fit test, pick the smallest agent that solves a real dollar problem, build the oversight loop, and ship. Compound from there.

The capability is real. The limits are real too. Knowing the difference is the entire game.

If you want a structured walk-through of your specific function with a fit-test scored against your real processes, book a 30-minute scoping call. For the deeper read on how the 55 agents got picked and shipped, see how we deployed 55 AI agents in production.

What Tasks Can an AI Agent Actually Handle?

The five-part fit test

Finance and accounting

Customer support and service

Sales and marketing

Operations and supply chain

Human resources

IT and engineering

Legal, risk, and compliance

The “leave it to humans” list

How to choose your first task

The one-paragraph summary

Read another.