Categories
RAG

RAG

March 5,2026 in AI&ChatGPT | 0 Comments

RAG, short for Retrieval-Augmented Generation, is an AI architecture that combines retrieval of relevant information with answer generation. In other words, the model does not respond only from what it learned during training. It first retrieves relevant material from external sources and only then generates an answer based on those retrieved sources.

At first glance, RAG can look like ordinary document search with AI layered on top. The user asks a question, the system finds something and the language model answers. In reality, it is a more important architectural pattern than that. It addresses one of the main limitations of large language models – their limited access to current, internal or precisely verifiable information.

RAG is a way to connect a language model to external knowledge sources. The system first retrieves relevant parts of documents, databases or a knowledge base, then inserts those materials into the model’s context. The model answers using those supplied sources instead of relying only on its general training.

What RAG really means in practice

A large language model can work very well with text.

It can answer questions, write articles, summarise documents, translate, explain difficult concepts or draft emails. But that does not mean it automatically has access to all current and specific information.

A model does not know a company’s internal documentation unless that material is explicitly supplied to it. It may not know the latest price list, the current commercial terms, the newest version of a contract, specific product data or the contents of an internal knowledge base. If it answers without those materials, the answer may be generic, outdated or inaccurate.

RAG solves that problem by inserting a retrieval layer between the user query and the final answer. That layer finds relevant information and the model uses it as evidence.

In practice, the flow can look like this:

  • the user asks how to handle a complaint about damaged goods,
  • the system finds the relevant section of the complaints policy,
  • it selects a few relevant passages,
  • it inserts them into the model’s context,
  • the model turns them into a clear answer,
  • ideally, the user can also see the source behind the answer.

Without RAG, the model might answer only in general terms. With RAG, it has a chance to answer according to a specific source document.

Why RAG emerged

Large language models have one major strength: they are very good at generating language. At the same time, they have several limitations.

A model may have outdated knowledge because its training happened at a particular point in time. It may not know private company documents. It may not know which version of a document is the valid one. And if it lacks evidence, it may try to fill in missing parts of the answer on its own.

This is exactly where RAG becomes useful. It does not try to teach everything directly into the model. Instead, it supplies the model with relevant information at the time of the query.

RAG is not a method that retrains the model. It is a way of giving the model the right supporting materials for a specific question. The model then generates an answer not only from its general knowledge, but also from the additional retrieved context.

What the acronym RAG means

RAG stands for Retrieval-Augmented Generation. Each part of the name describes one important stage.

  • Retrieval – the system finds relevant information from external sources such as documents, a knowledge base, a database, an internal wiki or the web.
  • Augmented – the retrieved information is added to the query as extended context for the model.
  • Generation – the language model generates an answer from the user’s question and the supplied context.

Put simply: RAG first retrieves, then augments the context, and only after that generates the answer.

How RAG works step by step

RAG can be understood as a chain of connected stages. Some happen during data preparation, others happen only when the user asks a question.

  1. Sources are prepared – the system needs somewhere to search. These sources may be PDFs, internal articles, manuals, knowledge base entries, technical documentation, product data, contracts, databases or website content. If the sources are low quality, outdated or duplicated, the RAG output will be worse.
  2. Documents are split into smaller parts – a long document is usually not handled as one whole unit. It is split into smaller chunks so that retrieval can find the relevant passage, not just the entire file.
  3. The content is stored in a retrieval index – to search efficiently, the system builds an index over the content. That index may rely on keywords, semantic similarity, metadata or a combination of methods.
  4. Texts are often converted into embeddings – an embedding is a numerical representation of meaning. It helps the system compare the semantic similarity of the query and the document even if they do not share the same wording.
  5. The user asks a question – at this point the system does not immediately send the question to the model for free-form generation. It first tries to retrieve relevant evidence.
  6. Retrieval finds the most relevant passages – the retrieval layer compares the query with indexed content and selects the pieces that are most relevant. These may be paragraphs, rows from a table, parts of technical documentation or other structured records.
  7. The retrieved passages are inserted into the prompt – the system creates an augmented input for the model. It usually contains the original user question, the retrieved passages and often an instruction telling the model to answer only from those sources.
  8. The model generates the answer – the language model receives both the question and the supporting context, then generates an answer from them. This is the generation stage of RAG.
  9. The answer may include sources – a well-designed RAG system makes it possible to trace where the answer came from. That matters especially in companies, law, technical support, healthcare, finance and other settings where a fluent answer alone is not enough.

RAG is not just “a model over documents”. It does not mean you upload a pile of files into AI and the model automatically finds the right answer by itself. For RAG to work, the documents first need to be prepared in a form that supports retrieval.The text has to be extracted, cleaned and split into smaller segments, because the model usually does not need an entire long document for one answer. It needs the relevant passage. These smaller segments are then stored in a retrieval index. You can think of the index as a technical catalogue built over the documents – the system stores information that later helps it find the right passage quickly. That may include keywords, metadata, source references or numerical meaning representations such as embeddings.

When the user asks a question, the system does not have to scan everything from beginning to end. It uses the index to select only the parts most relevant to the query. These passages are then added to the model’s prompt as context. Only after that does the model generate the answer.

RAG versus an ordinary chatbot

The difference between a standard chatbot and a RAG-based chatbot becomes clear when the user is not asking about general knowledge, but about a specific rule, document or current internal information.

Imagine an e-commerce support scenario. A customer asks:

“My order arrived damaged. How should I file a complaint?”

A standard chatbot without RAG may answer only in general terms. It mainly relies on what the model learned during training and on the current conversation. So it may respond with something like:

“You should usually take photos of the damage, contact the seller and wait for further instructions.”

That answer may sound reasonable, but it may not reflect the specific rules of the given company. It does not know whether the company requires photos of the outer packaging, what the reporting deadline is, whether a form must be filled in, whether the item should be returned immediately or whether the customer should wait for collection instructions.

A RAG chatbot works differently. It first searches the company’s complaint policy, damaged-shipment procedure or internal help article. Only then does it answer according to the retrieved source.

It might therefore answer like this:

“According to the damaged shipment procedure, you should first photograph both the damaged product and the outer packaging, include the order number in the complaint form, and submit the complaint within 48 hours of delivery. Do not send the item back yet until customer support contacts you.”

The key difference is not just that the second answer sounds more specific. The main difference is the evidence behind it. A standard chatbot answers generically. A RAG chatbot answers according to a concrete source the system retrieved first.

RAG versus fine-tuning

RAG is often confused with fine-tuning. Fine-tuning means training the model further on a specific dataset or task. RAG means the model stays more general, but receives relevant external material at answer time.

The difference is fundamental.

  • Fine-tuning changes the behaviour of the model or adapts it to a certain task style.
  • RAG gives the model information at the moment the question is asked.

If a company updates its pricing, commercial terms, technical documentation or internal rules, RAG is often more practical. You update the source documents and the index. You do not need to retrain the model each time.

Fine-tuning is more useful when you want to change answer style, output format, specialised behaviour or recurring task handling. RAG is more useful when knowledge changes often and accuracy depends on specific sources.

RAG versus a long context window

Some modern models can process very long context windows. That does not mean RAG stops being useful.

A long context window allows the model to accept more text at once. RAG solves a different problem: selecting the few passages that are actually relevant from a much larger set of documents.

If a company has thousands of documents, it is not practical to send all of them to the model for every query. That would be expensive, slow and often less accurate. RAG acts as a filter that selects only the relevant parts.

Practically speaking:

  • a long context window gives the model a larger working space,
  • RAG helps decide what should go into that space,
  • a good system often uses both – enough context capacity and good retrieval.

RAG and retrieval

Retrieval is the first and often the most important part of RAG. If the system retrieves the wrong material, the model may generate the wrong answer from it. If it retrieves too little, the answer may be incomplete. If it retrieves too much irrelevant text, the answer may become confused.

Retrieval determines what the model actually sees.

That is why a RAG system often has to address questions such as:

  • how to split documents into chunks,
  • how to build the index,
  • whether to search by keywords, meaning or both,
  • how to handle dates and document versions,
  • how to rank results by relevance,
  • how many passages to place into context,
  • how to cite sources,
  • how to recognise when no good source was found.

RAG and embeddings

Embeddings matter especially in semantic retrieval. They allow the system to search by meaning, not only by shared wording.

Example:

  • the user asks: “How should we handle a late payment?”
  • the document is titled: “Procedure for overdue invoice payment”,
  • the wording is different,
  • but the semantic meaning matches.

Pure keyword retrieval may miss this connection. Semantic retrieval using embeddings has a better chance to find it.

That does not mean embeddings are always better than keywords. For exact identifiers such as order numbers, VAT IDs, SKUs or specific product names, keyword search may be more precise. This is why hybrid retrieval is common in practice.

RAG and tokens

Tokens matter in RAG because the retrieved passages are inserted into the model’s context, and that context has limited capacity.

If retrieval returns ten long passages, they consume many tokens. The model then has less space for the answer itself and may also become distracted by non-essential context.

A good RAG system therefore does not only ask whether something relevant was found. It also asks how much of that material should actually be passed to the model.

Example:

  • a weak RAG pipeline sends the model a long document where only one paragraph is relevant,
  • a better RAG pipeline sends only that paragraph plus a few surrounding lines,
  • the model receives clearer evidence and can answer more accurately and more efficiently.

RAG and prompt engineering

RAG is not separate from prompt engineering. The retrieved passages are usually assembled into the prompt that the model receives.

So it is not enough just to find a document. The system also has to present it to the model correctly.

The prompt may include instructions such as:

  • answer only from the supplied context,
  • if the answer is not present in the supplied sources, say that it cannot be verified,
  • do not add assumptions,
  • separate facts from interpretation,
  • cite the supporting passage you rely on.

Without such instructions, the model may generalise the retrieved material badly or add information that is not present in the source.

RAG and grounding

Grounding means anchoring the answer in concrete sources. RAG is one of the main ways to achieve grounding in AI systems.

Without grounding, a model may answer fluently, but it is unclear where the information comes from. With grounding, the answer is tied to a specific document, record or passage.

This matters especially for:

  • company rules,
  • technical documentation,
  • customer support,
  • legal and contractual materials,
  • product data,
  • internal knowledge bases,
  • content where the source needs to be verifiable.

RAG does not guarantee grounding automatically. If the wrong source is retrieved, or the model misreads the source, the answer may still be wrong.

What RAG can improve

RAG is most useful when the model needs to work with specific, current or private information.

It can improve things such as:

  • answers based on specific source documents,
  • use of more current information than the model had during training,
  • reduced need to retrain the model whenever documentation changes,
  • better traceability of answers through source references,
  • AI assistants over internal company knowledge bases,
  • answers aligned with internal company rules,
  • lower risk of unsupported answers because the model has evidence to work from.

The word lower matters. RAG reduces the risk of hallucination. It does not eliminate it completely.

Where RAG is used

RAG is mainly used in systems where AI is expected to answer based on specific content.

  • Customer support – an assistant can answer from help articles, complaint rules, manuals, product documentation or commercial terms.
  • Internal company assistants – employees can ask about internal procedures, policies, processes or documentation and receive answers grounded in company material.
  • Technical documentation – RAG is useful for manuals, API documentation, service procedures, technical sheets or internal engineering knowledge bases.
  • E-commerce – it can work with product parameters, availability, compatibility, instructions, complaints or customer FAQ content.
  • Legal and contract documents – it can help locate relevant clauses in contracts, commercial terms or internal policies, although legal interpretation still requires human review.
  • Editorial and content work – it can help work with source materials, previous articles, internal notes or structured knowledge archives.

Why RAG is not just search

RAG is sometimes simplified as “search plus AI”. That is not entirely wrong, but it is incomplete.

Search returns documents or links that a person still has to read manually. RAG goes further. It retrieves relevant passages, inserts them into the model’s context, and the model turns them into an answer.

The difference is:

  • Search – finds documents that the user must read.
  • RAG – finds relevant information and the model turns it into an answer.

That is useful, but it also introduces risk. As soon as the model starts reformulating the information, it can simplify, omit or connect things badly. That is why source traceability is so important.

Types of retrieval used in RAG

RAG can use multiple retrieval strategies, often in combination.

Keyword retrieval

Keyword retrieval looks for exact or similar wording. It works well for names, codes, numbers, SKUs, VAT IDs, product names or exact formulations.

It is fast and easy to explain. It is weaker when the user asks in different words from the source document.

Semantic retrieval

Semantic retrieval looks for similarity of meaning. It usually relies on embeddings and vector comparison.

It works well for natural queries such as:

  • “What should we do if a shipment is damaged?”
  • “How do we handle late payment?”
  • “Where is it written who approves exceptions?”

The query and the document do not need to share the same words if their meaning is related.

Hybrid retrieval

Hybrid retrieval combines keyword search and semantic retrieval. In practice, it is often the best option because it covers both exact identifiers and natural-language queries.

Metadata-based retrieval

Metadata means supporting information about a document, such as date, version, author, type, department, language, validity or permission level.

Metadata is highly important in RAG. It helps decide which document is current, valid and visible to the specific user.

RAG and permissions

In company settings, security is critical. A RAG system must not show users information they are not allowed to access.

It is therefore not enough just to retrieve a relevant document. The system also has to consider who is asking and what they are allowed to see.

Without proper permission handling, RAG may accidentally surface sensitive information from HR, finance, contracts, internal strategy documents or non-public commercial data.

A secure RAG setup should address:

  • who is asking,
  • which documents that user may access,
  • whether permissions are enforced during retrieval, not only after generation,
  • whether the answer could leak information from restricted sources,
  • whether used sources can be audited afterwards.

RAG and data freshness

One of the biggest advantages of RAG is that it can work with fresher data. If documentation, pricing or rules change, you do not need to retrain the whole model. You can update the source or the index instead.

But that only works if the system is designed well.

You still have to handle:

  • how often the documents are updated,
  • how old versions are removed,
  • how the valid version is identified,
  • how duplicates are handled,
  • how quickly changes appear in the index,
  • whether the model might still answer from stale evidence.

RAG over a chaotic folder full of old documents will not be reliable. It will simply surface the existing chaos more efficiently.

RAG and hallucinations

RAG is often described as a way to reduce hallucinations in AI. That is only partly true.

RAG can lower the risk that the model answers with no evidence at all. If it has relevant source documents, it has something to anchor the answer to.

But hallucinations can still happen inside a RAG system.

For example, when:

  • retrieval finds the wrong document,
  • the system retrieves an old version,
  • the model incorrectly combines two unrelated passages,
  • the answer is not present in the sources but the model tries to fill it in anyway,
  • the prompt does not instruct the model to admit uncertainty,
  • the sources contradict each other.

RAG is not a guarantee of truth. It is a way to make answers more source-based.

IMPORTANT! RAG does not remove hallucinations automatically. If the system retrieves the wrong evidence or the model interprets the retrieved material badly, the final answer can still be wrong. That is why source quality, permissions, document versioning and output review still matter.

What a badly designed RAG system looks like

A badly designed RAG system often looks impressive in a demo but fails in real use. It may answer a few showcase questions convincingly, yet return outdated, irrelevant or inaccurate material on more difficult queries.

Typical problems include:

  • outdated documents in the index,
  • multiple conflicting versions of the same information,
  • documents split into badly sized chunks,
  • headings separated from the paragraphs they define,
  • no proper use of metadata,
  • retrieval that returns too much noise,
  • no source citations,
  • no permission handling,
  • a model that answers even when it should say that the evidence is insufficient.

So RAG is not created simply by attaching documents to AI. The quality of the whole design matters.

How to recognise a good RAG system

A good RAG system should answer accurately, traceably and with awareness of its own limits.

It should be able to:

  • retrieve the right document or passage,
  • prioritise current and valid sources,
  • respect user permissions,
  • return the sources behind the answer,
  • avoid answering without evidence when the answer is not in the sources,
  • separate sourced facts from interpretation,
  • avoid overwhelming the model with irrelevant context,
  • perform well on real user queries, not only in demos.

You do not recognise good RAG because the answer sounds nice. You recognise it because the answer is grounded in the right source.

How RAG is tested

Testing RAG is not just testing the language model. You need to test the full chain.

It makes sense to evaluate, for example:

  • whether the system retrieved the correct sources,
  • whether the correct sources appeared near the top results,
  • whether the answer truly reflects the retrieved material,
  • whether the model added unsupported information,
  • whether citations match the actual supporting text,
  • whether the system handles queries whose answer is not present in the sources,
  • whether restricted documents are never shown to unauthorised users.

In company deployment, it is useful to build a test set of real queries, not only queries that make the system look good.

Where RAG makes the most sense

RAG has the highest value where there is a concrete body of knowledge that people query frequently.

Typical examples include:

  • internal company documentation,
  • product databases,
  • manuals and instructions,
  • technical documentation,
  • complaint rules and commercial terms,
  • legal documents,
  • customer knowledge bases,
  • sales enablement materials,
  • company wikis,
  • archives of articles or expert texts.

RAG makes less sense where there are no good sources, where precision is not important or where the task is purely creative and does not need to rely on documents.

RAG and multimodal models

RAG does not have to work only with text. In multimodal systems, retrieval can also target images, PDF pages, charts, screenshots, audio or video.

Example:

  • the user uploads a product photo,
  • the system retrieves similar items from a product database,
  • the model compares the visual input with text and product data,
  • it answers what product it most likely is.

Multimodal RAG is more complex than text-only RAG because it must work with several types of representation. In document workflows, it may combine OCR, page layout, text passages, images and metadata.

RAG in company environments

In a company, RAG is not only a technical feature. It is a way of working with knowledge.

If a company has messy documents, RAG will not fix that problem. It will expose it more quickly. The system may efficiently retrieve the wrong document, an old version or internally contradictory information.

Before deploying RAG, it therefore makes sense to define:

  • which sources should be included,
  • who owns those sources,
  • how the valid version is identified,
  • how documents are updated,
  • how permissions are set,
  • how errors will be measured,
  • who reviews high-risk answers,
  • what should happen if no good answer is found.

RAG is not a shortcut to knowledge hygiene. It is a layer that can make good knowledge systems much more usable. It does not rescue a bad knowledge base.

RAG and cost

RAG can reduce cost compared with sending full documents to the model for every query. At the same time, it adds its own costs for indexing, embeddings, retrieval, storage, updates and infrastructure.

The cost is influenced mainly by:

  • document volume,
  • update frequency,
  • chunk size,
  • the number of retrieved passages inserted into the prompt,
  • the price of the embedding model,
  • the price of the language model,
  • the number of user queries,
  • the requirement for citations and auditability,
  • the security and permissions layer.

That is why production RAG systems are designed not only for answer quality, but also for speed, cost and operational sustainability.

RAG and AI agents

RAG often acts as the knowledge layer for AI agents. An agent may not only answer a question, but also take another step – for example create a support ticket, prepare an email, locate a document, compare contracts or suggest a follow-up action.

RAG gives the agent the evidence it needs to make those decisions.

Example:

  • the agent receives a customer question,
  • RAG retrieves the relevant rules and prior documentation,
  • the model drafts the answer,
  • the agent opens the case in the helpdesk,
  • a human checks the output.

With agents, security matters even more. If the system can not only answer but also act, its boundaries must be clearly defined.

Common mistakes in RAG design

The most common mistake is thinking that RAG appears automatically once documents are “uploaded into AI”. That is only the beginning.

Frequent design problems include:

  • Low-quality sources – the system retrieves from old, duplicated or incomplete documents.
  • Poor chunking – the relevant passage is separated from its heading, definition, table or exception.
  • Missing metadata – the system does not know which version is valid.
  • Too much context – the model receives too much text and loses focus.
  • Too little context – the model gets too little evidence and starts guessing.
  • No citations – the user cannot see where the answer comes from.
  • Ignored permissions – the system may expose sensitive data to the wrong user.
  • No proper evaluation – the system is judged by impressions instead of real queries and correct sources.

When RAG will not help

RAG is not the right solution for everything. It will not help where there are no good sources. It will not help where the task is purely creative and does not need grounding in documents. It will also not solve situations where the sources themselves contradict each other and no priority rule has been defined.

RAG will not automatically repair badly written documentation either. If the internal knowledge base is confusing, outdated or full of vague language, the model may simply generate confusing answers from it.

In those cases, the first step is to improve the sources, not merely add an AI layer.

What a good RAG prompt should look like

A good RAG prompt should tell the model clearly how to use the retrieved evidence.

It may include instructions such as:

  • answer only from the supplied context,
  • if the answer is not present in the context, say that it cannot be verified from the sources,
  • do not add information that is not in the sources,
  • if the sources conflict, point out the conflict,
  • say which part of the document supports the answer,
  • phrase the answer clearly for the intended audience.

This matters because the model will not always know by itself when it should refrain from answering and when it has enough evidence.

Why RAG matters outside technical teams

RAG is not only a term for developers. It matters to marketers, sales teams, editors, product managers, customer support, HR, legal teams and company leadership as well.

It helps explain why some AI tools answer in general terms while others answer according to concrete documents. It also shows why the quality of an AI assistant depends not only on the model, but also on the sources available to it and on how well it can retrieve them.

Anyone who understands RAG understands why it is not enough to say “we will connect AI to our documents”. What matters is how the documents are prepared, chunked, indexed, retrieved, cited and controlled.

Risks and unknowns

These risks matter because they show where the analysis has limits and what can change the outcome.

  • Poor source quality – if the documents are outdated, duplicated or wrong, RAG will use poor evidence. Mitigation: clean the knowledge base regularly and clearly mark valid versions.
  • Weak retrieval – the system may retrieve a passage that only looks superficially relevant. Mitigation: test retrieval on real queries and combine keyword, semantic and metadata-based retrieval.
  • Bad chunking – if documents are split badly, the model may receive an answer fragment without the necessary context. Mitigation: split by headings, meaning units and document structure.
  • Old document versions – the system may answer from a document that is no longer valid. Mitigation: use metadata, validity dates and source-priority rules.
  • Hallucinations over retrieved evidence – the model may still combine or expand the retrieved material incorrectly. Mitigation: require citations, restrict the answer to supplied evidence and test outputs carefully.
  • Permission violations – the system may retrieve a sensitive document for a user who should not see it. Mitigation: enforce access control during retrieval, not only at display time.
  • Too much context – the model may receive too many loosely related passages and answer in a confused way. Mitigation: use reranking and limit the context to genuinely relevant passages.
  • Conflicting sources – different documents may say different things. Mitigation: define source-priority rules and instruct the model to flag conflicts.
  • Overtrust in fluent answers – users may treat the answer as true just because it sounds confident. Mitigation: keep human review for important decisions.
  • Operational cost – indexing, embeddings, retrieval and generation may cost more than expected. Mitigation: monitor token usage, optimise retrieval and track real operating costs.

Related terms

  • Retrieval – the process of finding and loading relevant information that is supplied to the model as evidence.
  • Large language model (LLM) – the model that processes and generates text based on input, context and learned language patterns.
  • Token – the basic unit of text or input that language models work with.
  • Prompt engineering – the design of model instructions so the output follows the required goal, format and constraints.
  • Embedding – a numerical representation of content that helps compare the semantic similarity of queries and documents.
  • Vector database – a database designed to store embeddings and retrieve similar vectors efficiently.
  • Chunking – splitting a long document into smaller parts that are easier to retrieve and insert into model context.
  • Hybrid retrieval – a combination of keyword retrieval and semantic retrieval.
  • Reranking – an additional relevance-sorting step performed before the retrieved passages are sent to the model.
  • Grounding – anchoring the model’s answer in concrete sources or data.
  • Knowledge base – a set of documents, rules, guides or data that a RAG system can retrieve from.
  • Machine learning – the broader AI field in which models learn patterns from data.
  • Deep learning – a subset of machine learning based on multi-layer neural networks.

Links and sources

Was this article helpful?

Support us to keep up the good work and to provide you even better content. Your donations will be used to help students get access to quality content for free and pay our contributors’ salaries, who work hard to create this website content! Thank you for all your support!

Reaction to comment: Cancel reply

What do you think about this article?

Your email address will not be published. Required fields are marked.