RAG - Retrieval Augmented Generation

This article guides you through how you can implement a Retrieval-Augmented Generation pipeline in Python using an OpenAI LLM in combination with a Weaviate vector database and an OpenAI embedding model with orq.ai.

Retrieval-Augmented Generation, or RAG for short, is a concept to provide LLMs with additional information from an external knowledge source. This allows them to generate more accurate and contextual answers while reducing hallucinations.

In this guide, we show you how to implement a RAG pipeline with our Python SDK using an OpenAI LLM in combination with a Weaviate vector database and an OpenAI embedding model. LangChain is used for orchestration, generating responses from orq.ai, and also logging additional data.


Make sure you have installed the required Python packages.

  • langchain for orchestration
  • OpenAI for the embedding model and LLM
  • weaviate-client for the vector database
  • orq.ai for LLMOps
pip install orquesta-sdk langchain openai weaviate-client  

Grab your OpenAI API keys.


Enable models in the Model Garden

Orq.ai allows you to pick and enable the models of your choice and work with them. Enabling a model(s) is very easy; all you have to do is navigate to the Model Garden and toggle on the model of your choice.

Collect and load data

The raw text document is available in LangChain’s GitHub repository.

import requests
from langchain.document_loaders import TextLoader

url = "https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt"
res = requests.get(url)
with open("state_of_the_union.txt", "w") as f:

loader = TextLoader('./state_of_the_union.txt')
documents = loader.load()

Chunk your documents

LangChain has many built-in text splitters for this purpose. For this example, you can use CharacterTextSplitter with a chunk_size of about 1000 and a chunk_overlap of 0 to preserve text continuity between the chunks.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_documents(documents)

Embed and store the chunks

To enable semantic search across the text chunks, you need to generate the vector embeddings for each chunk and then store them together with their embeddings.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
from weaviate.embedded import EmbeddedOptions

client = weaviate.Client(
  embedded_options = EmbeddedOptions()

vectorstore = Weaviate.from_documents(
    client = client,    
    documents = chunks,
    embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY),
    by_text = False

Step 1: Retrieve

Populate the vector database and define it as the retriever component, which fetches the additional context based on the semantic similarity between the user query and the embedded chunks.

retriever = vectorstore.as_retriever()
docs = vectorstore.similarity_search("What did the president say about Justice Breyer")

Step 2: Augment

Create a client instance for orq.ai. Note that you can instantiate as many client instances as necessary with the OrquestaClient class. You can find your API Key in your workspace: https://my.orquesta.dev/<workspace-name>/settings/develop

from orquesta_sdk import Orquesta, OrquestaClientOptions

api_key = "ORQUESTA_API_KEY"

options = OrquestaClientOptions(

client = Orquesta(options)

Prepare a Deployment in orq.ai and set up the primary model, fallback model, number of retries, and the prompt itself with variables. Whatever the information from the RAG process, you need to attach it as a variable when you call a Deployment in orq.ai. An example is shown below:

Request a variant by right-clicking on the row and generate the code snippet.

Invoke orq.ai Deployment; for the context, we set it to the similarity search result, chaining together the retriever and the prompt.

deployment = client.deployments.invoke(
    "environments": [
    "locale": [
    "context": docs[0].page_content,
    "question": "What did the president say about Justice Breyer"

Step 3: Generate

Your LLM response is generated from orq.ai using the selected model from the Deployment, and you can print it out.


Logging score and metadata for Deployments

After a successful query, orq.ai will generate a log with the evaluation result. You can add metadata and score to the Deployment by using the add_metrics() method.

  feedback={"score": 100},
      "custom": "custom_metadata",
      "chain_id": "ad1231xsdaABw",

You can also fetch the deployment configuration using orq.ai as a prompt management system. Read more: orq.ai as Prompt Manager

config = client.deployments.get_config(
    "environments": [
    "locale": [
    "context": docs[0].page_content,
    "question": "What did the president say about Justice Breyer"

deployment_config = config.to_dict()

Finally, you can head over to your Deployment in the orq.ai dashboard and click on logs, and you will be able to see your LLM response and other information about the LLM interaction.

You can find the notebook link here and test it out.


If you run into "ssl.SSLCertVerificationError"

You can run into this error if you are macOS or OSX, and this is because python 3.6 or higher has no certificates installed at all, therefore it can’t validate any SSL connections. This means you will get errors whenever we try to connect to an HTTPS:// website.

Simply pip install certifi in your terminal and restart your computer.