Retrieval-Augmented Generation for Sitecore Development with RAGFlow
At Velir, one way AI assists our development is with Copilot, Microsoft's Large Language Model (LLM) service offering. However, some technologies, like Sitecore, use common languages very specifically. This makes it challenging for general-purpose LLMs to craft components or scripts accurately in C#, XML, or JavaScript, exactly as Sitecore expects. Also, using ancillary tools like Sitecore PowerShell Extensions, which can confuse the LLM with regular PowerShell syntax, and products like Content Hub, which reference specific JavaScript libraries, requires domain-specific examples.
Our Solution
To help our developers get the right answers, we've started using Retrieval-Augmented Generation (RAG). Whenever you work with LLMs, all the information you provide with your questions is called “context,” which can include instructions such as what language or version of that language to use. It might also contain examples of the solution it should respond with. The better the context you provide, the better the answer you'll receive. Most RAG solutions use a vector database to index your domain-specific knowledge and search for additional context using the prompt you provide. There is currently no single RAG solution, but we prefer RAGFlow with the Infinity database for our local development needs.
What is RAGFlow
RAGFlow is an open-source host environment for LLM systems that enables you to build chat tools and agent workflows around many document types. It also boasts a high-performing database and a complex retrieval process to find the best matching results. The hard part is finding, or creating, well-formatted documentation for it to ingest. In our case, Sitecore documentation provides the rich detail we need to populate our database of code examples and explanations that our LLMs can use to craft answers with improved precision.
Additional Benefits of RAGFlow
Using RAGFlow offers more benefits than just providing rich custom context. By running your own LLM host, you have more control of the process, and if you're self-hosting an LLM through tools like vLLM or Ollama, you can control the entire process. This means you lower the risk of leaking confidential information and don’t have to track token usage. Conversely, you’ll have to consider the hardware resources available on the host system. A virtual machine may provide more opportunities to scale resources to match demand, centralize security, and keep model storage off local machines. However, newer developer laptops have enough dedicated GPU memory for smaller models around 4-14 billion parameters (2–10GB) and allow flexibility for faster rollout.
Getting Started with RAGFlow
Setting up and running RAGFlow has a few, but not many, requirements. You'll need to have access to an LLM server, either externally hosted like OpenAI or internally hosted like Ollama. The next requirement is setting up the RAGFlow server itself and configuring the default models. It's a relatively quick process, but be aware that it does require using Linux containers, which may interfere with your other local development systems.
An option not required, but recommended, is to set the infinity database as the document engine in the docker .env file ex: DOC_ENGINE=${DOC_ENGINE:-infinity} because of the features core to the value of RAGFlow. Once running, RAGFlow is a web application that requires registration, but it’s all local and self‑contained.
The final ingredient is the documents themselves. What you include depends on the problem you’re trying to solve, but it can work with common text formats (PDF, DOC, TXT, MD), tables (CSV, XLSX), pictures (JPG, PNG, TIF, GIF), and slides (PPT, PPTX). One thing to note is that you may find it easier to convert text to Markdown to ensure code samples are explicitly identified and easier to understand.
Parsing Documents
Once you've configured the application, you can create a Knowledge Base for each target knowledge area. A Knowledge Base is differentiated by how the documents are indexed, but also by what they might be associated with. In the case of Sitecore, we focus on specific products and versions. For example, the documentation for Sitecore version 10.4 differs from the documentation for version 9.0. We'll also target XM Cloud or Content Hub separately. This allows you to build a chat that includes or excludes Knowledge Bases depending on the project's needs.
Now moving into the configuration, the first option you want to configure is the parser. The parser is the algorithm used to locate the structure in your content. There are two options available. The Naive parser is the default and is used for most text file types. The DeepDoc is generally for PDFs containing images. It will take more time and compute to parse, so consider that in your parsing strategy.
The next setting is the Embedding Model, which converts documents to vectors stored in the Infinity database. We're currently using Nomic Embed Text because of its wide context length and auditability. Also note that this model doesn't need to match the chat model since it converts results back to text for the context between systems. If you’re not using locally hosted models, consider the areas where tokens get used; this is one of them. The number of tokens used depends on the settings you choose, which can use the LLMs more.
Once you've decided on how to embed your data, you need to choose how to identify it so it’s broken up further into chunks. This assumes each document contains more than one insight and needs to be more closely measured while maintaining the link to the original document. Each option only supports specific file types, and given our use of Markdown files, we used General chunking.
The next few settings require extra consideration because they offer a trade-off between performance and accuracy. Adding a small number of Auto-Keywords and Auto-Questions to the metadata helped retrieval accuracy, as did enabling RAPTOR. These will increase parsing time and/or consume more tokens, but it’s worth testing to see if your documents benefit from it. What we aren't yet using is the Knowledge Graph. It will consume much more compute, or tokens, and you may find it unnecessary. The best way to find the right settings is to parse a single document, test with the Retrieval Testing, and tune the system to accelerate responses.
Now you're ready to upload files through RAGFlow’s File Management tool and then link each folder to a Knowledge Base. Uploading from this point instead of the Knowledge Base allows you to remove and recreate one without removing the files with it. After uploading your documents, you're ready to start parsing. Along with previous considerations about parsing performance are the size of the document and the availability of specific GPU cores. RAGFlow offers a Docker Compose file designed for using GPUs to distribute the compute. Parsing large documents, using the Knowledge Graph or Re-Rank models in a performant way will require it. You'll want to balance how much time parsing takes, how much compute you can afford, and the system’s accuracy. Parsing is a one-time cost, so you can manage it by increasing the specs only when you're parsing and dialing it back when in general use.
Domain Specific Chat
After you've completed parsing, the data is ready, and the chat interface can be configured. The Keyword Analysis is worth considering because of the ability to make clear hits on topics, helping to narrow the search. The Knowledge Base allows you to select which sets of Knowledge Bases the RAG search will use to match your project’s needs. The System Prompt will help frame the dialogue persona, so use it to apply guidelines and requirements for responses.
If you're activating Reasoning, the Knowledge Graph, or a Re-Rank model, the system will use the LLM to talk back and forth multiple times before responding. This can consume tokens or compute at a much higher rate. The Re-Rank can be left blank, which is the default. This will use a traditional function that combines weighted keywords and vector cosine similarity. This is where some of the deeper complexity comes in, but at a cost that makes it worth slowly working towards an effective system instead of starting with many features enabled.
The last option to highlight is the retrieval model itself. We've tried several models and initially succeeded with the qwen3 and deepseek-v3 modelsdeepseek-v3 model. Choosing the right model is a matter of testing and results. The size of a model will always improve comprehension, but the model type should depend on use. A generalized retrieval model can provide better general answers, but a tools or thinking model may excel at iteration and communicating with APIs. The LLM selection should be made in an informed way based on the needs and constraints in each situation.
Let’s Build Smarter AI Solutions Together
Whether you're looking to improve your internal development workflows or securely integrate AI into your digital ecosystem, we can help your developers use RAG search options or show you how it can also transform your site search.
Our team has experience customizing AI for complex platforms like Sitecore, and we're ready to help you unlock AI’s full potential for your organization. Contact us today to discover how Velir can bring smarter, more efficient solutions to your team.