December 18, 2023

AI in 5: RAG with PDFs

Table of Contents:

 

AI In 5_  RAG With PDFs

Introduction to Retrieval-Augmented Generation (RAG) for Large Language Models

We've created a module for this tutorial. You can follow these directions to create your own module using the Clarifai template, or just use this module itself on Clarifai Portal.

The advent of large language models (LLMs) like GPT-3 and GPT-4 has revolutionized the field of artificial intelligence. These models are proficient in generating human-like text, answering questions, and even creating content that is persuasive and coherent. However, LLMs are not without their shortcomings; they often draw on outdated or incorrect information embedded in their training data and can produce inconsistent responses. This gap between potential and reliability is where RAG comes into play.

RAG is an innovative AI framework designed to augment the capabilities of LLMs by grounding them in accurate and up-to-date external knowledge bases. RAG enriches the generative process of LLMs by retrieving relevant facts and data in order to provide responses that are not only convincing but also informed by the latest information. RAG can both enhance the quality of responses as well as provide transparency into the generative process, thereby fostering trust and credibility in AI-powered applications.

RAG operates on a multi-step procedure that refines the conventional LLM output. It starts with the data organization, converting large volumes of text into smaller, more digestible chunks. These chunks are represented as vectors, which serve as unique digital addresses to that specific information. Upon receiving a query, RAG probes its vast database of vectors to identify the most pertinent information chunks, which it then furnishes as context to the LLM. This process is akin to providing reference material prior to soliciting an answer but is handled behind the scenes.

RAG presents an enriched prompt to the LLM, which is now equipped with current and relevant facts, to generate a response. This reply is not just a result of statistical word associations within the model, but a more grounded and informed piece of text that aligns with the input query. The retrieval and generation happen invisibly, handing end-users an answer that is at once precise, verifiable, and complete.

This short tutorial aims to illustrate an example of an implementation of RAG using the libraries streamlit, langchain, and Clarifai, showcasing how developers can build out systems that leverage the strengths of LLMs while mitigating their limitations using RAG.

Again, you can follow these directions to create your own module using the Clarifai template, or just use this module itself on Clarifai Portal to get going in less than 5 minutes!

Let's take a look at the steps involved and how they're accomplished.

Data Organization

Before you can use RAG, you need to organize your data into manageable pieces that the AI can refer to later. The following segment of code is for breaking down PDF documents into smaller text chunks, which are then used by the embedding model to create vector representations.

Code Explanation:

This function load_chunk_pdf takes uploaded PDF files and reads them into memory. Using a CharacterTextSplitter, it then splits the text from these documents into chunks of 1000 characters without any overlap.

# 1. Data Organization: chunk documents
@st.cache_resource(ttl="1h")
def load_chunk_pdf(uploaded_files):
# Read documents
documents = []
temp_dir = tempfile.TemporaryDirectory()
for file in uploaded_files:
temp_filepath = os.path.join(temp_dir.name, file.name)
with open(temp_filepath, "wb") as f:
f.write(file.getvalue())
loader = PyPDFLoader(temp_filepath)
documents.extend(loader.load())
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunked_documents = text_splitter.split_documents(documents)
return chunked_documents
view raw 1-data.py hosted with ❤ by GitHub

Vector Creation

Once you have your documents chunked, you need to convert these chunks into vectors—a form that the AI can understand and manipulate efficiently.

Code Explanation:

This function vectorstore is responsible for creating a vector database using Clarifai. It takes user credentials and the chunked documents, then uses Clarifai's service to store the document vectors.

# Create vector store on Clarifai for use in step 2
def vectorstore(USER_ID, APP_ID, docs, CLARIFAI_PAT):
clarifai_vector_db = Clarifai.from_documents(
user_id=USER_ID,
app_id=APP_ID,
documents=docs,
pat=CLARIFAI_PAT,
number_of_docs=3,
)
return clarifai_vector_db
view raw 2-vector.py hosted with ❤ by GitHub

Setting up the Q&A Model

After organizing the data into vectors, you need to set up the Q&A model that will use RAG with the prepared document vectors.

Code Explanation:

The QandA function sets up a RetrievalQA object using Langchain and Clarifai. This is where the LLM model from Clarifai is instantiated, and the RAG system is initialized with a "stuff" chain type.

def QandA(CLARIFAI_PAT, clarifai_vector_db):
from langchain.llms import Clarifai
USER_ID = "openai"
APP_ID = "chat-completion"
MODEL_ID = "GPT-4"
clarifai_llm = Clarifai(
pat=CLARIFAI_PAT, user_id=USER_ID, app_id=APP_ID, model_id=MODEL_ID)
qa = RetrievalQA.from_chain_type(
llm=clarifai_llm,
chain_type="stuff",
retriever=clarifai_vector_db.as_retriever()
)
return qa
view raw 3-qa.py hosted with ❤ by GitHub

User Interface and Interaction

Here, we create a user interface where users can input their questions. The input and credentials are gathered, and the response is generated upon user request.

Code Explanation:

This is the main function that uses Streamlit to create a user interface. Users can input their Clarifai credentials, upload documents, and ask questions. The function handles reading in the documents, creating the vector store, and then running the Q&A model to generate answers to the user's questions.

def main():
user_question = st.text_input("Ask a question to GPT 3.5 Turbo model about your documents and click on get the response")
with st.sidebar:
st.subheader("Add your Clarifai PAT, USER ID, APP ID along with the documents")
CLARIFAI_PAT = st.text_input("Clarifai PAT", type="password")
USER_ID = st.text_input("Clarifai user id")
APP_ID = st.text_input("Clarifai app id")
uploaded_files = st.file_uploader(
"Upload your PDFs here", accept_multiple_files=True)
if not (CLARIFAI_PAT and USER_ID and APP_ID and uploaded_files):
st.info("Please add your Clarifai PAT, USER_ID, APP_ID and upload files to continue.")
elif st.button("Get the response"):
with st.spinner("Processing"):
docs = load_chunk_pdf(uploaded_files)
clarifai_vector_db = vectorstore(USER_ID, APP_ID, docs, CLARIFAI_PAT)
conversation = QandA(CLARIFAI_PAT, clarifai_vector_db)
response = conversation.run(user_question)
st.write(response)
if __name__ == '__main__':
main()
view raw 4-main.py hosted with ❤ by GitHub

 

The last snippet here is the entry point to the application, where the Streamlit user interface gets executed if the script is run directly. It orchestrates the entire RAG process from user input to displaying the generated answer.

Putting it all together

Here is the full code for the module. You can see its GitHub repo here, and also use it yourself as a module on the Clarifai platform.

import streamlit as st
import tempfile
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Clarifai
from langchain.chains import RetrievalQA
from clarifai.modules.css import ClarifaiStreamlitCSS
st.set_page_config(page_title="Chat with Documents", page_icon="🦜")
st.title("🦜 RAG with Clarifai and Langchain")
ClarifaiStreamlitCSS.insert_default_css(st)
# 1. Data Organization: chunk documents
@st.cache_resource(ttl="1h")
def load_chunk_pdf(uploaded_files):
# Read documents
documents = []
temp_dir = tempfile.TemporaryDirectory()
for file in uploaded_files:
temp_filepath = os.path.join(temp_dir.name, file.name)
with open(temp_filepath, "wb") as f:
f.write(file.getvalue())
loader = PyPDFLoader(temp_filepath)
documents.extend(loader.load())
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunked_documents = text_splitter.split_documents(documents)
return chunked_documents
# Create vector store on Clarifai for use in step 2
def vectorstore(USER_ID, APP_ID, docs, CLARIFAI_PAT):
clarifai_vector_db = Clarifai.from_documents(
user_id=USER_ID,
app_id=APP_ID,
documents=docs,
pat=CLARIFAI_PAT,
number_of_docs=3,
)
return clarifai_vector_db
def QandA(CLARIFAI_PAT, clarifai_vector_db):
from langchain.llms import Clarifai
USER_ID = "openai"
APP_ID = "chat-completion"
MODEL_ID = "GPT-4"
# LLM to use (set to GPT-4 above)
clarifai_llm = Clarifai(
pat=CLARIFAI_PAT, user_id=USER_ID, app_id=APP_ID, model_id=MODEL_ID)
# Type of Langchain chain to use, the "stuff" chain which combines chunks retrieved
# and prepends them all to the prompt
qa = RetrievalQA.from_chain_type(
llm=clarifai_llm,
chain_type="stuff",
retriever=clarifai_vector_db.as_retriever()
)
return qa
def main():
user_question = st.text_input("Ask a question to GPT 3.5 Turbo model about your documents and click on get the response")
with st.sidebar:
st.subheader("Add your Clarifai PAT, USER ID, APP ID along with the documents")
# Get the USER_ID, APP_ID, Clarifai API Key
CLARIFAI_PAT = st.text_input("Clarifai PAT", type="password")
USER_ID = st.text_input("Clarifai user id")
APP_ID = st.text_input("Clarifai app id")
uploaded_files = st.file_uploader(
"Upload your PDFs here", accept_multiple_files=True)
if not (CLARIFAI_PAT and USER_ID and APP_ID and uploaded_files):
st.info("Please add your Clarifai PAT, USER_ID, APP_ID and upload files to continue.")
elif st.button("Get the response"):
with st.spinner("Processing"):
# process pdfs
docs = load_chunk_pdf(uploaded_files)
# create a vector store
clarifai_vector_db = vectorstore(USER_ID, APP_ID, docs, CLARIFAI_PAT)
# 2. Vector Creation: create Q&A chain
conversation = QandA(CLARIFAI_PAT, clarifai_vector_db)
# 3. Querying: Ask the question to the GPT 4 model based on the documents
# This step also combines 4. retrieval and 5. Prepending the context
response = conversation.run(user_question)
st.write(response)
if __name__ == '__main__':
main()
view raw 5-all.py hosted with ❤ by GitHub