Building a RAG Chatbot application
Using Claude, Langchain, Hugging Face and ChromaDBAs a Design Systems lead for a large multi-national corporation, I spend significant time ensuring web components meet accessibility requirements and answering accessibility questions from multiple teams across various channels that go beyond the scope of just our components. The process typically involves understanding the problem the team is having, searching W3C documentation, formulating answers, and tailoring responses to specific use cases. This is a time-consuming cycle that repeats across numerous teams.With the advent of advanced language models like Anthropic's Claude, as well as corporations starting to adopt the use of these tools, I wondered what the feasibility was of creating a specialized chatbot that provides accurate, cited answers from a specific and local knowledge base, reduced hallucinations and ultimately reduced repetitive research while maintaining reliability.
Why not just use one of the well-known LLM Chatbots?
The reason for me using Retrieval-Augmented Generation (RAG) was a result of some citation accuracy issues that I encountered when using common LLM Chatbots such as ChatGPT.[1] Even after creating Custom GPT's and giving the relevant links to reference the W3C Web Accessibility Initiative Standards & Guidelines, the LLM would always try and give an answer, even if it wasn't quite right.However my biggest concern was that it would not always correctly cite its sources. The citations would be in quotes, and would be pretty close. But pretty close is not a citation, it's a paraphrase. And LLM's are really good at paraphrasing / summarising complex ideas and information.While summaries can be useful, they are not ideal in a professional context where accuracy and verifiability are paramount. Inaccurate citations can lead to misinformation, misinterpretation of guidelines, and ultimately undermine the credibility of the information provided.What is RAG?
Retrieval-Augmented Generation (RAG) combines a retrieval system with a generative model to produce accurate, grounded responses.In order to visualise the RAG process, it is first useful to visualise, in a simplistic way, an LLM call below (which was inspired by a visual from Benjamin Clavié [2]).
In the image above, we can see that there are only 3 points of information: the input from the user, the LLM processing that input (which is determined from pre-training data), and the output text generated by the LLM.What a RAG system does is Augment the middle step, by introducing new context from an external source and formulating a way to Retrieve this data and make sense of it. This additional context helps the LLM produce more accurate and relevant outputs.IBM's Ivan Belcic and Cole Stryker summarise what RAG is very succinctly:
RAG augments a natural language processing (NLP) model by connecting it to an organization's proprietary database [3]We can add to the simplified visualisation by showing the additional context that is being added into the original LLM call, below.

While its very useful to visualise RAG workflows in a simplistic way to understand how external context can augment an LLM's output, how can RAG be set up in practice?For this particular project that I am working on, the RAG process looks like this:
- Prepare the data and create a local database of that data
- Set up a method to query that database for relevant information
- Craft the response with the help of a 3rd party LLM

The technical workflow and logic for this project was inspired by this excellent Youtube tutorial from @pixegami.
Preparing the data
Ideally you want to work with data that is stored as markdown (.md), but any text format will do. As I wanted to generate markdown files for the following url: www.w3.org/WAI/standards-guidelines/ as well as all related urls, I leveraged an online tool from an independant developer: HTML-to-Markdown. It appears that this tool has since not been updated, but there are more manual ways to convert URLs to markdown.[4]After downloading the single markdown file of all the crawled urls, I used themdsplit python library to split the docs into separate .md files. These files are written to subdirectories representing the document's structure.Once I have my data in markdown format, I place the separated files into the projects' data/ folder.
Indexing / Building the Database
By running the following command:python create_database.py a new folder is created within the project called chroma/ that contains a searchable database of our documents.How it works under the hood:- Reads my documents from the
data/folder - Breaks them into chunks
- Converts each chunk to vectors using HuggingFace embeddings
- Stores everything in ChromaDB
Retrieval and Generation
Whenever we run thequery_data.py command, the script performs the following steps:- Converts the query into a vector using the same HuggingFace embedding model that was used within the
create_database.pyscript. - Compares the query vector to all vectors in ChromaDB, finds the most relevant chunks (along with their metadata)
- Passes the retrieved chunks and the original query to the LLM (in my case: Claude)
- The answer from Claude is printed along with what source documents were used.
Working example
I can query my local docs via a python command in the terminal:
python query_data.py "what is the minimum target size for level AA?"I receive the following output:
Response: Evidence:
- "[SOURCE: Understanding SC 2.5.8: Target Size (Minimum) (Level AA)] "The size of the target for pointer inputs is at least 24 by 24 CSS pixels"
- "[SOURCE: Understanding SC 2.5.8: Target Size (Minimum) (Level AA)] "For a target to be "at least 24 by 24 CSS pixels", it must be conceptually possible to draw a solid 24 by 24 CSS pixel square, aligned to the horizontal and vertical axis such that the square is completely within the target"
Answer:
The minimum target size for Level AA is at least 24 by 24 CSS pixels. This means it must be possible to draw a solid 24 by 24 CSS pixel square completely within the target area, aligned to the horizontal and vertical axis.
**Success Criterion 2.5.8 (Target Size Minimum) Details:**
- **Goal:** Make controls easier to activate
- **What to do:** Ensure targets meet a minimum size or have sufficient spacing around them
- **Why it's important:** Some people with physical impairments cannot click small buttons that are close together
Sources: ['data/-Understanding-SC-2.5.8target-size-minimum.md', 'data/-Understanding-SC-2.5.5httpsw3.orgTRWCAG21target-size-enhanced-.standalone-resource__type-of-guidanceTarget-Size-Level-AAA.md', 'data/-Understanding-SC-2.5.8target-size-minimum.md', 'data/-Understanding-SC-2.5.8target-size-minimum.md', 'data/-Understanding-SC-2.5.8target-size-minimum.md', 'data/-Understanding-SC-2.5.5httpsw3.orgTRWCAG21target-size-enhanced-.standalone-resource__type-of-guidanceTarget-Size-Level-AAA.md', 'data/-Understanding-SC-2.5.8target-size-minimum.md', 'data/-Understanding-SC-2.5.8target-size-minimum.md', 'data/-Understanding-SC-2.5.8target-size-minimum.md', 'data/-Understanding-SC-2.5.8target-size-minimum.md']Why is the output structured this way? It is because this is how I structured my
PROMT_TEMPLATE variable.I wanted the output to return the information in a specific format with particular emphasis on the source that was used to generate the answer along with a verbatim quote from the chunk that was referenced. These are outlined in lines 46 - 53 in the image below:
Information Retrieval (IR) Evaluation
Information Retrieval (IR) evaluation is a crucial step in assessing the effectiveness of the retrieval component in a RAG system. It involves measuring how well the system retrieves relevant documents or chunks of information from the database in response to a given query. Common metrics used in IR evaluation include Precision, Recall, and F1 Score.As Nandan Thakur explains in his talk Modern IR Evaluation in the RAG Era, automated information retrieval is not a new concept or practice.[5] For a deep-dive into the history and different approaches to IR Evals, I recommend watching Nandan's talk.For practical and concrete approaches to evaluating the RAG application, LangChain have documentation on Evaluating RAG Applications here.Chunking strategy
It is a good idea have a list of crucial questions that you can test the RAG application with, in order to check if the answers you are receiving are expected or not. One important factor that influences the quality of the answers is how you chunk your documents when building the database.Initially I used achunk_size of 300 with an overlap of 100, but I found that this often resulted in answers that were too brief or lacked sufficient context. By increasing the chunk size to 1000 with an overlap of 200, I was able to capture more context from each document, leading to more comprehensive and accurate responses from the LLM.def split_text(documents: list[Document]):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Increased from 300 to capture more context
chunk_overlap=200, # Increased overlap to preserve context across chunks
length_function=len,
...
)What's next?
This personal project definitely felt out of my comfort zone initially in terms of the logic for each python query, but thanks to the use of LLMs and the already great content out there on the web from skilled folks, I have been able to get the foundations in place.The next phase is to build a simple UI in order to interface with the RAG agent, as right now the only way is via the command line - which is not the most user-friendly.Github links
I am a proponent of sharing code where necessary, so if you would like to explore the code for this project further, please visit my Github repository: rag-app-claudeResources
- Zhang M, Zhao T. Citation Accuracy Challenges Posed by Large Language Models JMIR Med Educ. 2025
- Husain H, et al. Beyond Naive RAG: Practical Advanced Methods. 2025
- Belcic I, Stryker C. RAG vs. Fine-Tuning. IBM. Accessed November 20, 2025
- Pandoc or html2text
- Salton G, Lesk M. E. The SMART automatic document retrieval systems—an illustration. Commun. ACM 8, 6 (June 1965), 391-398.