RAG Similarity Search with ChromaDB
ChromaDB is a local database tool for creating and managing vector stores, essential for tasks like similarity search in large language model processing. This tutorial covers how to set up a vector store using training data from the Gekko Optimization Suite and explores the application in Retrieval-Augmented Generation (RAG) for Large-Language Models (LLMs).
The first step is to install necessary libraries. Ensure you have pandas and ChromaDB installed. You can do this using pip:
The next step is to import the modules and read train.jsonl from GitHub.
import chromadb
# read Gekko LLM training data
url='https://raw.githubusercontent.com'
path='/BYU-PRISM/GEKKO/master/docs/llm/train.jsonl'
qa = pd.read_json(url+path,lines=True)
The train.jsonl file contains hundreds of questions and answers about Gekko. It is used to provide context for the Gekko Support Agent that assists with questions about modeling and optimization in Python. The train.jsonl file is added to lists required to build the vector store with documents with the text, metadatas with a unique ID name, and ids with a unique integer identifier.
metadatas = []
ids = []
for i in range(len(qa)):
s = f"### Question: {qa['question'].iloc[i]} ### Answer: {qa['answer'].iloc[i]}"
documents.append(s)
metadatas.append({'qid':f'qid_{i}'})
ids.append(str(i))
The script reads training data from the Gekko Optimization Suite, processes it, and uses ChromaDB to create a vector store. This vector store is fundamental in building systems that can efficiently perform similarity searches, crucial in applications like RAG for Large-Language Models.
cc = chromadb.Client()
collection = cc.create_collection(name='mydb')
collection.add(documents=documents,metadatas=metadatas,ids=ids)
The vector database is stored in memory and is regenerated every time the program runs. For large documents, this can take significant time and it may be desirable to store the vector database on a local drive. Use the following code to create a local sqlite3 database.
from chromadb.config import Settings
st = Settings(anonymized_telemetry=False)
cc = chromadb.PersistentClient(path='chroma',settings=st)
try:
cc.delete_collection('mydb')
except:
pass
collection = cc.create_collection(name='mydb')
The final step is to perform a test query. It uses a k-Nearest Neighbors search to determine the closest 5 matches to the query. Execute a test query to ensure the vector store is functioning correctly.
query_texts=['What are you trained to do?'],
n_results=5,include=['distances','documents'])
print(results)
Review the responses and the distance metric to determine how close each document is in similarity to query_texts.
import chromadb
# read Gekko LLM training data
url='https://raw.githubusercontent.com'
path='/BYU-PRISM/GEKKO/master/docs/llm/train.jsonl'
qa = pd.read_json(url+path,lines=True)
documents = []
metadatas = []
ids = []
for i in range(len(qa)):
s = f"### Question: {qa['question'].iloc[i]} ### Answer: {qa['answer'].iloc[i]}"
documents.append(s)
metadatas.append({'qid':f'qid_{i}'})
ids.append(str(i))
# in memory
cc = chromadb.Client()
collection = cc.create_collection(name='mydb')
# on local drive
#from chromadb.config import Settings
#st = Settings(anonymized_telemetry=False)
#cc = chromadb.PersistentClient(path='chroma',settings=st)
#try:
# cc.delete_collection('mydb')
#except:
# pass
#collection = cc.create_collection(name='mydb')
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
results = collection.query(
query_texts=['What are you trained to do?'],
n_results=5,
include=['distances','documents'])
print(results)
Application in RAG with Large-Language Models
Once the vector store is set up, it can be used in Retrieval-Augmented Generation (RAG) models, particularly with Large-Language Models. RAG models leverage external knowledge sources to generate more informed and accurate responses.
a = support.agent()
a.ask("Can you optimize the Rosenbrock function?")
The snippet above uses the Gekko vector store and RAG to provide context to the LLM. This support agent runs in the cloud, but it can also be set up to run locally. By combining the retrieval power of ChromaDB with the generative capabilities of LLMs, you can significantly enhance the performance of AI applications in natural language processing (NLP) understanding and generation.
Activity: Generate Q+A Similarity Search
This activity encourages you to explore similarity search by creating your own set of questions and answers. Choose a topic you are passionate about, and generate at least 10 question-answer pairs. Once done, you'll build a vector database with these pairs and perform a similarity search using ChromaDB. This hands-on experience helps you understand the practical applications of similarity search in natural language processing.
Use the JSONL template to generate at least 10 questions and answers based on a topic of your interest and save the file as mydb.jsonl.
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
{"question":"","answer":""}
Build the vector database and perform a similarity search using the mydb.jsonl file instead of the Gekko Q+A.
import chromadb
# read training data
path='mydb.jsonl'
qa = pd.read_json(path,lines=True)
documents = []
metadatas = []
ids = []
for i in range(len(qa)):
s = f"### Question: {qa['question'].iloc[i]} ### Answer: {qa['answer'].iloc[i]}"
documents.append(s)
metadatas.append({'qid':f'qid_{i}'})
ids.append(str(i))
# in memory
cc = chromadb.Client()
collection = cc.create_collection(name='mydb')
collection.add(documents=documents,metadatas=metadatas,ids=ids)
results = collection.query(
query_texts=['Question to test similarity search.'],
n_results=5,include=['distances','documents'])
print(results)
results = collection.query(
query_texts=['Another question to test similarity search.'],
n_results=5,include=['distances','documents'])
print(results)
Test the similarity search with several questions and validate the distances that suggest closeness to query_texts.