AI Metrics: The Science and Art of Measuring AI

Two paths, one measurement problem
What businesses gain by measuring AI properly
Why AI metrics are different from “normal” product metrics
What to measure: guardrails vs outcomes
Make measurement part of the lifecycle loop
The other side of ROI: managing mistakes
Governance and regulation: measurement is evidence
Where we focus

AI is moving from “interesting experiment” to something that sits inside products, tech stacks, and across entire operating models. That shift changes the job of how we measure and evaluate AI systems and pushes teams to consider how they can prove their AI system is valuable, safe, and scalable.

The uncomfortable truth is that most organisations aren’t “bad at AI metrics”; they’re not really doing them at all. There’s no benchmark for quality, no shared definition of what “good” looks like, and no feedback loop once the system is live. The result is familiar: teams roll something out, it sort of works, and then everyone argues about quality, risks, and ROI after users already have AI in their hands.

This matters whether AI is customer-facing or internal. The principle is the same: you can’t manage what you don’t measure, and you can’t measure what you haven’t defined.

Two paths, one measurement problem

Organisations typically engage with AI in two distinct ways. They sound similar, but they behave very differently in the real world.

1) Building AI products

AI is embedded into something you ship: a support chatbot, a recommendation experience, an automated decision step, a co-pilot inside a workflow. Here, you own the behaviour end-to-end. You’re accountable for outcomes, failure modes, and the way quality changes over time. Even if you don’t own the underlying AI model, the key here is that you are embedding AI into an experience used by your customers.
2) Adopting AI tooling

AI is bought and rolled into operations: Copilot for developers, AI-assisted support tooling, meeting summarisation, analysis, content generation, industry platforms, etc. You control the context (data, access, workflows, policies, etc.) of how and where people interact with AI, and you are still accountable for the impact (efficiency gains, increased revenue, better experiences, etc.) you deliver to your business.

The measurement focus differs:

Building is about product quality at scale (trust, outcomes, safety, unit economics, quality)
Adopting is about workflow fit (productivity gains, error rates, data handling, and “are we introducing a new class of operational risk?”)

Both paths require rigorous measurement, but the stakes and focus differ:

Building AI products

External risk (customer harm, reputation)

Continuous monitoring at scale

Regulatory compliance for outputs

Product-market fit questions

Adopting AI tooling

Internal risk (efficiency loss, data leakage)

Usage patterns and productivity gains

Governance for access and data handling

Workflow-fit questions

But while the focus differs, most organisations stumble for the same reason: they measure activity, not impact.

They ask:

❌ How many people used the AI tool?
❌ How many prompts did we run?
❌ How many hours did it “save” (based on a survey)?
❌ How many teams have access?

When the better questions are:

✅ Which workflows improved, and by how much? (cycle time, throughput, rework)
✅ Where did it make things worse? (error rate, escalation, compliance flags)
✅ What is the cost per successful outcome? (not cost per query)
✅ What’s the failure mode, and how quickly do we detect and contain it?
✅ What’s “good enough” for this use case, and are we still above that threshold?

And ultimately:

✅ Did we solve a real problem using AI?

If any of these are hard to answer, that’s a signal you don’t yet have a measurement system, only a deployment.

close up of a computer keyboard focused on the letter A

What businesses gain by measuring AI properly

A shared definition of “good enough”.

Traditional software is mostly deterministic. Press button A, output B happens. Quality can be treated as a binary: pass/fail. In this way, it is easy to understand and test your digital experience end-to-end.
AI is different; it is probabilistic. Press button A and you might get output A, B, C, or Z. You can’t measure “the” experience in the same way. You measure an aggregate of experiences, and you decide what distribution of outcomes is acceptable for the context.

The practical consequence is simple: “good enough” has to be defined deliberately, cross-functionally, before momentum takes over. This means that cross-functional teams need to sit together and agree what return on investment they expect from building or adopting AI, what their expectations for quality are, what the non-negotiables are from a risk perspective, and how they are going to drive value and not AI for AI’s sake.

When organisations deploy AI to solve a real problem, backed by cross-functional buy-in, with a realistic expectation of what to expect, they can make braver, clearer decisions as they progress on their AI journey.

Braver, clearer decisions

Once quality is defined, decisions stop being vague negotiations. Teams can make concrete calls:

Which model is appropriate for this domain?
What guardrails are non-negotiable?
When does the system escalate to a human?
What’s the go/no-go threshold for a pilot?
What trade-off is acceptable between cost and quality?

This is where metrics become a decision system, not a reporting exercise.

To put this into practice, imagine you could improve support quality by 50%, but it would cost 3x your current budget. Would you do it? What if quality improved 25% at 2x cost? Or 10% improvement at 1.5x cost? At what point does the value justify the investment? By playing out these hypotheticals, you viscerally establish what your organisational values are, and in turn, use these to make decisions on how and where to deploy AI.

Without measurement, every AI decision becomes a negotiation based on gut feel. With it, you have evidence to support your strategy.

Why AI metrics are different from “normal” product metrics

The key difference isn’t that AI has “more metrics”. It’s that AI changes the nature of quality.

Deterministic

systems give you repeatability
Probabilistic

systems give you variability

So, AI Measurement needs to do two jobs at once:

1) Define quality in business terms (value and acceptable risk)

What outcomes matter? What level of variability is acceptable? What’s the cost if it’s wrong?
2) Track quality in operational terms (how the system behaves in production, over time)

Is it staying within acceptable bounds? Where is it drifting? When do we intervene?

A useful mental model is: measurement is the operating system that connects value + safety.

This applies whether you’re:

Building AI products: You need measurement to ship with confidence and maintain quality at scale
Adopting AI tooling: You need measurement to govern usage, track productivity gains, and prevent risks like data leakage or hallucinated outputs being treated as fact

That has two implications for leaders:

You need to define quality upfront, cross-functionally (product, engineering, risk, legal, ops, the teams actually using the tools)
You need measurement built into the lifecycle loop, not bolted on at the end

In both cases, you’re introducing a probabilistic system into workflows that expect consistency. Measurement is how you manage that tension.

In practice, two layers cover most needs:

What to measure: guardrails vs outcomes

A lot of AI measurements content starts with a glossary (accuracy, precision, recall). Those can be useful, but they’re rarely the best starting point for business decisions.

Layer 1: Always-on guardrails

These are the “keep it safe and sane” metrics that apply across most AI use cases:

Faithfulness: does it invent facts, or stay grounded?
Safety: can it produce harmful, biased, or inappropriate outputs?
Latency: is it fast enough for the workflow?
Cost: does it scale economically, or drift into a cost blowout?
Usefulness: do users signal it helped (thumbs up/down, short surveys, completion proxy, £/$ improvement)?
Escalation behaviour: does it hand off cleanly when it should?

Layer 2: Domain outcomes

This is where value actually shows up, and it depends on the job-to-be-done. For example:

Customer support: time to first resolution, first contact resolution, escalation rate, CSAT
Search/discovery: % of searches where a relevant result is clicked in the top N results
Internal productivity: time saved on a workflow, completion rates, adoption and retention

A helpful rule of thumb is that accuracy is rarely “the” metric. A better business framing is: what’s the cost of being wrong in this context? That question naturally forces teams to talk about user harm, reputational risk, compliance risk, operational drag, and real economics.

Make measurement part of the lifecycle loop

The biggest shift organisations need to make is treating evaluation as part of a collaborative delivery, not a one-off “model validation step” completed by engineering.

Discovery: define quality before you build or buy

Start with the questions that create alignment:

• What does quality mean here: speed, cost, hallucination rate, time to value?
• What failure modes matter most (reputational, legal, user harm, operational)?
• What is the baseline today (human performance, current cost, current pain)?
• What does “good enough” look like for a pilot?

This is a “Minimum Viable Quality” approach, where you are able to ship safe, responsible value to users.

Build/configure: iterate against the definition

Whether you are building a system or configuring a vendor tool, the loop looks similar:

• test outputs against your metrics,
• improve the experience (UX, prompts, retrieval, policies, escalation),
• track movement towards thresholds.

The key is keeping measurement tied to the user experience and business outcome, not just internal scores.

Launch: stagger it, then validate reality

AI behaves differently with real users and real data. A pilot or beta rollout is where assumptions meet reality:

• does it create the value you predicted?
• does latency hold under real usage?
• does cost behave, or spike?
• are failure modes acceptable given the escalation path?

In an initial pilot, you should hopefully be able to prove your hypothesis that “building or adopting this AI technology will bring about X value”. Once you have proven the value, you can slowly add more users, testing that hypothesis at scale, and layering in tests to consider cost, latency, and safety at scale.

Run: monitor drift and close the loop

Drift in AI and Machine Learning experiences is normal. Put simply, it is the tendency for the system quality to drift up or down over time. Maybe your support chatbot that found a knowledgebase article 80% of the time in 2025 can only find it 60% of the time in 2026. It can be internal (usage patterns, data) or external (seasonality, environment, society, regulation, expectations). It can even be competitive: what was “good enough” becomes table stakes.

The mistake leaders make is treating AI experiences like their traditional counterparts. Teams need a way to monitor whether quality is drifting, and when signals drift towards unacceptable, you need a clean path back into the build loop. For higher-risk use cases, it’s also sensible to treat “turn it off” as a planned capability (feature flags, safe fallbacks, incident runbooks).

The other side of ROI: managing mistakes

Yes, ROI matters. You can and should estimate value up front. In fact, all AI experiences should be well-specced to capture value from the get-go.

But with AI products and tools, the downside risk is higher and less predictable because the system is non-deterministic. The fallout can be:

reputational
user harm
legal exposure
operational chaos
cost blowouts

So, ROI for AI is not just “value captured”. It’s also risk avoided, and the maturity of your measurement loop is what turns that into something you can manage.

For AI products: One viral example of your chatbot saying something offensive can erase months of careful brand-building
For AI tooling: One instance of sensitive data leaking through a poorly-governed AI assistant can trigger regulatory action

Measurement is what gives you early warning signals before small problems become existential ones.

Governance and regulation: measurement is evidence

Even if you don’t lead with compliance, you’ll eventually be asked: why did the system do that, how often does it happen, what are you doing about it?

Data protection rules already touch automated decision-making. The UK ICO guidance on Article 22 of the UK GDPR covers restrictions and rights related to solely automated decisions with legal or similarly significant effects. The EU AI Act is now law with phased implementation, creating risk-based obligations for high-risk AI systems.

Measurement of your AI stack will allow you to build a better quality experience for customers and staff and, importantly, it is protective against governance and regulatory risks. It helps you demonstrate control, manage risk, and improve continuously.

Whether you’re building AI into your product or adopting AI tools across your organisation, the audit trail starts with measurement. If you can’t show what the system did, why it did it, and what you’re doing to improve it, you’re exposed.

Where we focus

We work with organisations across both paths:

Building AI products:

defining your AI strategy, finding the right problems to solve, prioritising use cases, selecting approaches, designing evaluation loops, and helping you to ship safely
Adopting AI tooling:

governance, rollout strategy, productivity measurement, and risk controls
Most organisations need both:

clarity on what should be built for advantage vs adopted for efficiency, and measurement frameworks that work across both realities

The through-line is co-creation rooted in real problems, not technology for its own sake:

What are you trying to achieve?
Where does AI genuinely create value vs where is it just shiny?
What quality thresholds matter for your users, operations, and regulators?
How can ROI be proved upfront and tracked continuously?

A practical starting point is to pick one use case and run a short cross-functional session: define the outcomes, guardrails, go/no-go thresholds, escalation path, and monitoring cadence.

That one step tends to unlock the rest of the delivery, whether you’re building something new or adopting something proven.

Measure what matters. Discover how to track AI success and turn insights into impact. Learn more.

More on AI

Blog
The other side of ROI in AI: Managing Mistakes Read more
Blog
Kill or cure: will Financial Services use AI to erode or entrench inequality? Read more
Blog
Tinker, tailor, tune a model Read more

Contents