Welcome to document-summarization Overview page
Clarifai app is a place for you to organize all of the content including models, workflows, inputs and more.
For app owners, API keys and Collaborators have been moved under App Settings.
Document Summarization
Summarization is a fundamental building block of many LLM tasks. Summarization refers to the process of condensing a longer text document into a shorter version that captures the most important points or information. This can be particularly useful in fields like research, where digesting vast amounts of literature quickly is beneficial or in business, for summarizing reports, emails, and other communications.
3 Levels Of Summarization: Novice to Expert
In many use cases, we want to summarise a large body of text into a concise set of points.
For condensing extensive text into a more digestible format, various summarization methods are available, categorized into different levels based on the complexity and length of the text.
We will explore 3 primary methods for summarization that start with Novice and end up with Expert. These aren't the only options, feel free to make up your own.
3 Levels Of Summarization:
- Sentence and Paragraph Summarization- Basic Prompt
- Page-level Summarization - Map Reduce
- Summarize an entire book - Best Representation Vectors
Level 1: Basic Prompt - Summarize a couple of parahgraph
This helps to summarise a few paragraphs and want to one-off summarize.
You can simply submit the text for summarization through the following workflow, and it will return a condensed version:
- Summarisation in Paragraph - For summarizing text into a paragraph.
- Summarisation in Buttet points - For summarizing text into bullet points.
This method isn't scalable and only practical for a few use cases where input text is not very long and could be fit into the model's context window. the perfect level #1!
Level 2: Summarize multiple pages
This method addresses the challenge of summarizing texts spanning multiple pages, which may exceed the token limits of LLMs.
If you have multiple pages you'd like to summarize and might run into a token limit. Token limits won't always be a problem, but it is good to know how to handle them if run into an issue.
Using Llangchain with Clarifai
This could be easily implemented with the Llangchain framework and Clarifai
LangChain and Clarifai integration facilitates the summarization process by leveraging LLMs, embeddings, and vector storage within the LangChain ecosystem. Numerous example notebooks demonstrate the seamless integration of LangChain with Clarifai's capabilities.
Import Dependencies
from langchain import PromptTemplate, LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate
Initialized and Split Documents
history2_essay = '../Data/history2.txt'
with open(history2_essay, 'r') as file:
essay = file.read()
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)
docs = text_splitter.create_documents([essay])
Initialize Clarifai llm class from Langchain
from langchain.llms import Clarifai
# use model URL
MODEL_URL="https://clarifai.com/mistralai/completion/models/mistral-7B-OpenOrca"
llm=Clarifai(MODEL_URL)
Using the map_reduce chain type we are summarising the output chunks.
Let's use LangChain's load_summarize_chain to do map_reducing for us. We first need to initialize our chain
The chain type "Map Reduce" in the Llangchain is a method that helps with this. This first generates a summary of smaller chunks (that fit within the token limit) and then it will summarise the summary of the summaries.
from langchain.chains.summarize import load_summarize_chain
summary_chain = load_summarize_chain(llm=llm, chain_type='map_reduce',)
output = summary_chain.run(docs)
print (output)
Summarized the contents from all the chunks
Summarise in Bullet Points
This summary is a great start, but summarisation in bullet points is always effective and easier to understand. we want to get the final output in bullet point form.
To summarise in bullet point we will use custom prompts to instruct the model on summarising.
map_prompt = """
Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
combine_prompt = """
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
summary_chain = load_summarize_chain(llm=llm,
chain_type='map_reduce',
map_prompt=map_prompt_template,
combine_prompt=combine_prompt_template,)
output = summary_chain.run(docs)
print (output)
Level 3: Summarize an entire book
This method can help to summarise the entire book with the Best Representation Vectors Method.
Goal: Chunk your book then get embeddings of the chunks. Pick a subset of chunks which represent a wholistic but diverse view of the book. Or another way, is there a way to pick the top 10 passages that describe the book the best?
Once we have our chunks that represent the book then we can summarize those chunks and hopefully get a pretty good summary.
The Best Representation Vectors (BRV) Steps:
- 1. Load your book into a single text file
- 2. Split your text into large-ish chunks
- 3. Embed your chunks to get vectors
- 4. Cluster the vectors to see which are similar to each other and likely talk about the same parts of the book
- 5. Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)
- 6. Summarize the documents that these embeddings represent
Another way to phrase this process is, "Which ~10 documents from this book represent most of the meaning? then build a summary off those."
Note: There will be a bit of information loss, but show me a summary of a whole book that doesn't have information loss ;)
Import Dependencies
# Loaders
from langchain.schema import Document
# Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Model
from langchain.llms import Clarifai
# Embedding Support
from langchain_community.embeddings import ClarifaiEmbeddings
# Summarizer we'll use for Map Reduce
from langchain.chains.summarize import load_summarize_chain
# Data Science
import numpy as np
from sklearn.cluster import KMeans
Text Processing and Split Documents
Load book into a single text file and then split text into large-ish chunks
from langchain.document_loaders import PyPDFLoader
# Load the book
loader = PyPDFLoader("../data/IntoThinAirBook.pdf")
pages = loader.load()
# Cut out the open and closing parts
pages = pages[26:277]
# Combine the pages, and replace the tabs with spaces
text = ""
for page in pages:
text += page.page_content
text = text.replace('\t', ' ')
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=10000, chunk_overlap=3000)
docs = text_splitter.create_documents([text])
Embed Documents
Embed chunks to get vectors
Example_URL = "https://clarifai.com/clarifai/main/models/BAAI-bge-base-en-v15"
clarifai = ClarifaiEmbeddings(model_url=EXAMPLE_URL)
vectors = embeddings.embed_documents([x.page_content for x in docs])
Cluster Embeddings
Cluster the vectors to see which are similar to each other and likely talk about the same parts of the book
# Choose the number of clusters, this can be adjusted based on the book's content.
num_clusters = 11
# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)
Choose Embeddings that represent complete Book
Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)
# Find the closest embeddings to the centroids
# Create an empty list that will hold closest points
closest_indices = []
# Loop through the number of clusters
for i in range(num_clusters):
# Get the list of distances from that particular cluster center
distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
# Find the list position of the closest one (using argmin to find the smallest distance)
closest_index = np.argmin(distances)
# Append that position to your closest indices list
closest_indices.append(closest_index)
selected_indices = sorted(closest_indices)
# Then we'll get docs which the top vectors represented.
selected_docs = [docs[doc] for doc in selected_indices]
Initialize Clarifai llm class from Langchain
from langchain.llms import Clarifai
# use model URL
MODEL_URL="https://clarifai.com/openai/chat-completion/models/GPT-4"
llm=Clarifai(MODEL_URL)
Summarize the documents that these embeddings represent
map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.
```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
map_chain = load_summarize_chain(llm=llm,
chain_type="stuff",
prompt=map_prompt_template)
# Make an empty list to hold summaries
summary_list = []
# Loop through a range of the lenght of selected docs
for i, doc in enumerate(selected_docs):
# Go get a summary of the chunk
chunk_summary = map_chain.run([doc])
# Append that summary to your list
summary_list.append(chunk_summary)
Summary of the summaries
we have our list of summaries, let's get a summary of the summaries
combine_prompt = """
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of what happened in the story.
The reader should be able to grasp what happened in the book.
```{text}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
reduce_chain = load_summarize_chain(llm=llm,
chain_type="stuff",
prompt=combine_prompt_template,)
output = reduce_chain.run([summaries])
print (output)
Each method offers a tailored approach to summarization, catering to different lengths and complexities of text, ensuring that you can choose the most appropriate technique based on their specific needs.
.
- DescriptionAn app template for document summarization, supports 3 levels that start with Novice and end up with Expert
- Base Workflow
- Last UpdatedMar 18, 2024
- Default Languageen
- Share