Back to blog
RAGAI agentslatencycost optimization

Two-path retrieval — when not to ask the LLM

Z

Zachary Jackson

A flow diagram showing two retrieval paths converging on a single response

The wrong default

If you build a chatbot today, the default architecture goes like this: every user message hits an LLM, the LLM decides what to do, the LLM generates a response, you bill it to your OpenAI account. It works. It's also expensive and slow for the boring 80% of questions.

At a festival we built a chatbot for, the most-asked question all weekend was "what time is the gate open?" Three words. Same answer for fifty thousand people. There is no universe in which that question should burn 1,200 tokens and 1.5 seconds of latency per turn.

The pattern below is what we shipped — and it's why the cost line on that weekend looked nothing like what a pure-LLM rollout would have produced.

The pattern

                       User message
                            │
                            ▼
                      Keyword scorer
                            │
                  ┌─────────┴─────────┐
                  │                   │
            score ≥ 18?         score < 18?
                  │                   │
                  ▼                   ▼
           Static answer         LLM call
            (free, <10ms)    (Assistants API +
                              vector store,
                              paid, ~1.5s)

Two paths. The fast path handles the common case for free. The LLM path handles everything else — composition, weird phrasings, follow-ups, things the FAQ didn't anticipate. A confidence threshold routes between them.

The keyword scorer

The fast path is a small synonym-expanded scoring function. For every FAQ row in the knowledge base, score the user message:

  • Phrase matches (bigrams, trigrams) get the highest weight
  • Exact word matches in the question or answer get a medium weight
  • Normalized matches (stemmed, plural-stripped) catch "tickets" vs. "ticket"
  • Synonyms expand from a small hand-tuned dictionary — "playing" → ["performing", "appearing"], "buy" → ["purchase", "get"]

If the top match scores above the threshold (default 18 in our deployment), you respond with the static answer for that row. Done. No tokens billed. Sub-10-millisecond response.

If nothing scores above the threshold, fall through to the LLM.

Why a hand-tuned dictionary instead of embeddings on the fast path?

Embeddings on the fast path is tempting — same shape, more semantic recall. We tried it. Two problems killed the idea for our use case:

  1. Embedding the user query still costs a network round-trip to OpenAI. That makes the fast path not actually fast.
  2. Embeddings on short user messages are noisy. "When?" doesn't have enough signal to embed meaningfully without context. Keyword + synonym was more predictable in production.

The LLM path does use embeddings (the Assistants API vector store handles them). But that path is reserved for queries that need actual reasoning, not "where do I park?"

The confidence threshold is your knob

The threshold is what you tune. Too low and you'll route ambiguous queries to a static answer that doesn't quite fit — the chatbot looks brittle. Too high and everything falls through to the LLM, defeating the point.

We started at 15 and ended up at 18 after a weekend of logs. The right number depends on:

  • How varied your knowledge base is (more rows → more chance of false-positive matches → higher threshold)
  • How forgiving your users are of "close but not exact" answers (B2B users want precision, festival-goers want speed)
  • Whether you can A/B route a percentage to LLM and compare resolution quality

You should log scores even on the routed-to-LLM path so you can iterate the threshold against real traffic.

Open-source kit

The full pattern — Python scripts to build the knowledge base from a CSV, two n8n Code nodes for the routing logic, and a minimal example workflow — is open-sourced as rag-chatbot-n8n-kit. Clone it, drop in your own CSV of FAQs, and you have a working two-path bot in a few minutes.

The takeaway

The cheapest, fastest LLM call is the one you don't make. Most production chatbot deployments would benefit from a fast path layered in front of the LLM — not because the LLM is bad at the easy questions, but because it's too expensive an answer for them.

Build a thin keyword scorer. Tune the threshold against real traffic. Save the LLM for the hard questions where it earns its keep.

Is your business ready for AI agents?

Take our 2-minute assessment. Get a personalized readiness score and specific recommendations for where AI can have the most impact on your operations.