AI is a LIAR... Sometimes

If you judge the state of AI by LinkedIn posts, all of your competitors are already running autonomous swarms of agentic super-workers - orchestrating quarterly OKRs, rewriting strategy decks, and paying each other with x402 - all while my WHOOP politely informs me I had a “sub-optimal" sleep last night.

Truth is though.. a lot of businesses out there are still wrestling with a very normal question: “How do we get this chatbot to stop making stuff up?”

And don’t get me wrong - chatbots are so last year. Cool guy like me? I only spin up self-optimising agent swarms with polysemantic memory routing and quantum-ready decision loops. Still, chatbots are a useful pattern in applied AI. They help triage support queries, surface internal knowledge, guide users through complex workflows, and act as the front door to systems nobody wants to navigate manually.

Only problem is, they keep being wrong...sometimes

Now, obviously - plenty of things feed into that: generative models are intrinsically non-deterministic, system prompts shape behaviour in ways that are basically sorcery, and users will always type something that instantly collapses whatever neat ontology you thought you’d built.

Even after all that; there's the knowledge base (i.e. what documents & data your bot is supposed to know all about). Real users don’t phrase things like your documentation. They don’t follow the structure of your ontology. And they absolutely don’t care how your retrieval pipeline works.

This is where building AI chatbots get a bit more interesting, actually, it's where all the effort is: understanding how the model performs across the entire knowledge base, how it changes across iterations and updates, and how it responds to the questions people will actually ask, not the questions you hope they ask.

Let’s assume you’ve done the groundwork: you’ve deployed on your preferred cloud stack, connected the model to a knowledge base, chosen an interface (Slack, Teams, some embeddable widget), and aligned on a system prompt - ideally one shaped collaboratively with user research so it reflects real user needs, not just engineering assumptions. Now comes the question:

How do you know it works?

Not “works for me” or “works when an SME asks a carefully phrased question,” but works reliably with varied terminology, inconsistent phrasing, and domain-specific scenarios.

Automated Evaluation Engines

Once you accept that intuition isn’t enough, you need a way to test your assistant systematically, especially if your knowledge base is large, complex, or regulatory in nature.

The goal is to confidently answer: “Does the model perform well across the full breadth of what it’s meant to know?”

The good news is: automating the process of asking questions and evaluating the responsesis a mostly solved problem. There are mature, open-source frameworks you can drop in easily - tools like Ragas, DeepEval, TruLens, or LangSmith. Or if you’re working on a major cloud provider, for example AWS, they'll offer solutions like Bedrock Evaluations which gives you a managed way to run systematic tests and measure correctness, grounding, coherence, and more.

But here’s the catch: every evaluation framework on earth is only as good as the data you feed it. One of the key activities that we incorporate into our delivery projects is to produce an evaluation dataset that actually reflects the shape and ambiguity of the knowledge base.

Ideally, you’d source as much of this as possible from real users - but for large or complex domains, real questions are often sparse and unevenly distributed. It’s a bit like the concept of code coverage in unit testing: you want to evaluate as many of the concepts, edge cases and relationships the model may need to reason over as possible.

One option for addressing this is to generate synthetic evaluation data- a reliable, repeatable test suite of question-answer pairs grounded directly in the knowledge base. You can build a tool to do this pretty quickly, a typical approach works in three stages:

1. Topic extraction: Retrieve-and-generate workflows scan the corpus to surface the concepts, relationships, rules, and exceptions the model is likely to reason over.

2. Question generation: From those topics, generate domain-shaped questions that mimic the kind of variations, ambiguities, and edge cases real users naturally introduce.

3. Grounded answer generation: Answers are produced strictly from the knowledge base i.e. citation-supported content from authoritative source material, ensuring the evaluation focuses on reasoning quality rather than hallucinated content.

An important step is to combine this with real questions from real users. Overtime, you can improve the synthetic:real data ratio as you gather more feedback from users, but initially it gives you a hybrid dataset representing:

broad coverage across the entire knowledge base
real-world phrasing and ambiguity
repeatability for regression testing
objective metrics for correctness, grounding, coherence and citation quality

With each prompt change, retrieval adjustment, chunking update, or new content source, you can now quantify whether the assistant gets better, worse - or simply different.

Ambiguity Analysis

Even with strong evaluation data, there’s another problem that quietly undermines many chatbot deployments: the domain’s own language isn’t as consistent as people assume.

In real domains, terminology drifts. Different documents use the same phrase in different ways. Definitions evolve. Exceptions accumulate. Two nearly identical terms may refer to entirely different concepts. And none of this is visible until you try to make an LLM reason over it.

This creates a subtle but painful failure mode: the assistant retrieves the “correct” piece of content, but the meaning of the term inside it is not the meaning the user intended.

One of the ways we recently tackled this in a delivery project was by building a tool for ambiguity analysis to expose these issues before they turned into user-facing errors

The tool works by:

Clustering similar words/terms in the knowledge base
Analysing how each term is used in different contexts
Identifying inconsistencies and overloaded terminology

If a term carries multiple meanings depending on the source, scenario, or regulatory framing, the tool surfaces it. Depending on the results, this may mean:

refining chunking
adjusting retrieval
adding prompt-level disambiguation
adding clarifying documentation
routing certain queries for SME oversight

This exercise not only helped us improve chatbot accuracy, it also gives teams a new kind of visibility: a map of where the domain’s own language is unstable.

In environments where correctness matters, this is super useful. Ambiguity shows up to the end user as “the bot got it wrong,” but the real culprit may be the structure of the domain lexicon itself. Once you can see the ambiguity, you can design around it.

Beyond the Chat Window

What you will quickly realise building AI chatbots is that the effort has very little to do with the chatbot itself. The real work sits around it: evaluation pipelines, retrieval tuning, guardrails, governance, and constant loops of user insight feeding back into the design.

In practice, high-quality AI assistants are built through a multi-disciplinary feedback loop:

User research reveals how people really use things, what they misunderstand, and where the domain is messy.
Governance defines the boundaries - what the system must do, must avoid, and how it should communicate uncertainty.
Engineering translates those discoveries into retrieval adjustments, prompt refinements, model evaluations, and safety controls.

Every iteration forces all three streams back into conversation with each other. A real user interaction reveals a behaviour you didn’t anticipate → engineering traces the underlying cause → governance updates boundaries or expectations → the system prompt, retrieval, or guardrails shift → evaluation measures the impact → and the cycle repeats.

I would typically say that most of the time spent on building AI chatbots or assistants is spent on the quality of their outputs. That’s why tools like automated evaluation engines, synthetic data & ambiguity analysis can make a real impact - they give teams the visibility and leverage they need to build systems that behave consistently.

So yes - AI is a liar sometimes. A probabilistic model operating over inconsistent human language always will be. But that’s not an argument for resignation; it just means you treat the system according to its risk profile and then you test, monitor, and improve it.

Once you treat accuracy as a system property rather than a model quirk, the whole thing shifts. Uncertainty becomes measurable, drift becomes detectable, behaviour becomes improvable. You stop hoping your chatbot behaves, and you start engineering it to.