App Overview
5

Welcome to document-summarization Overview page

Clarifai app is a place for you to organize all of the content including models, workflows, inputs and more.

For app owners, API keys and Collaborators have been moved under App Settings.

document-summarization
C
clarifai

Document Summarization

Summarization is a fundamental building block of many LLM tasks. Summarization refers to the process of condensing a longer text document into a shorter version that captures the most important points or information. This can be particularly useful in fields like research, where digesting vast amounts of literature quickly is beneficial or in business, for summarizing reports, emails, and other communications.

3 Levels Of Summarization: Novice to Expert

In many use cases, we want to summarise a large body of text into a concise set of points.

For condensing extensive text into a more digestible format, various summarization methods are available, categorized into different levels based on the complexity and length of the text.

We will explore 3 primary methods for summarization that start with Novice and end up with Expert. These aren't the only options, feel free to make up your own.

3 Levels Of Summarization:

  1. Sentence and Paragraph Summarization- Basic Prompt
  2. Page-level Summarization - Map Reduce
  3. Summarize an entire book - Best Representation Vectors

Level 1: Basic Prompt - Summarize a couple of parahgraph

This helps to summarise a few paragraphs and want to one-off summarize.

You can simply submit the text for summarization through the following workflow, and it will return a condensed version:

This method isn't scalable and only practical for a few use cases where input text is not very long and could be fit into the model's context window. the perfect level #1!

Level 2: Summarize multiple pages

This method addresses the challenge of summarizing texts spanning multiple pages, which may exceed the token limits of LLMs.

If you have multiple pages you'd like to summarize and might run into a token limit. Token limits won't always be a problem, but it is good to know how to handle them if run into an issue.

Using Llangchain with Clarifai

This could be easily implemented with the Llangchain framework and Clarifai

LangChain and Clarifai integration facilitates the summarization process by leveraging LLMs, embeddings, and vector storage within the LangChain ecosystem. Numerous example notebooks demonstrate the seamless integration of LangChain with Clarifai's capabilities.

Import Dependencies

from langchain import PromptTemplate, LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate

Initialized and Split Documents 

history2_essay = '../Data/history2.txt'

with open(history2_essay, 'r') as file:
essay = file.read()

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)
docs = text_splitter.create_documents([essay])

Initialize Clarifai llm class from Langchain

from langchain.llms import Clarifai

# use model URL
MODEL_URL="https://clarifai.com/mistralai/completion/models/mistral-7B-OpenOrca"

llm=Clarifai(MODEL_URL)

Using the map_reduce chain type we are summarising the output chunks.

Let's use LangChain's load_summarize_chain to do map_reducing for us. We first need to initialize our chain

The chain type "Map Reduce" in the Llangchain is a method that helps with this. This first generates a summary of smaller chunks (that fit within the token limit) and then it will summarise the summary of the summaries.

from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm=llm, chain_type='map_reduce',)

output = summary_chain.run(docs)
print (output)

Summarized the contents from all the chunks

Summarise in Bullet Points

This summary is a great start, but summarisation in bullet points is always effective and easier to understand. we want to get the final output in bullet point form.

To summarise in bullet point we will use custom prompts to instruct the model on summarising.

map_prompt = """
Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

combine_prompt = """
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

summary_chain = load_summarize_chain(llm=llm,
chain_type='map_reduce',
map_prompt=map_prompt_template,
combine_prompt=combine_prompt_template,)

output = summary_chain.run(docs)
print (output)

Level 3:  Summarize an entire book

This method can help to summarise the entire book with the Best Representation Vectors Method.

Goal: Chunk your book then get embeddings of the chunks. Pick a subset of chunks which represent a wholistic but diverse view of the book. Or another way, is there a way to pick the top 10 passages that describe the book the best?

Once we have our chunks that represent the book then we can summarize those chunks and hopefully get a pretty good summary.

The Best Representation Vectors (BRV) Steps:

  1. 1. Load your book into a single text file
  2. 2. Split your text into large-ish chunks
  3. 3. Embed your chunks to get vectors
  4. 4. Cluster the vectors to see which are similar to each other and likely talk about the same parts of the book
  5. 5. Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)
  6. 6. Summarize the documents that these embeddings represent

Another way to phrase this process is, "Which ~10 documents from this book represent most of the meaning? then build a summary off those."

Note: There will be a bit of information loss, but show me a summary of a whole book that doesn't have information loss ;)

Import Dependencies

# Loaders
from langchain.schema import Document

# Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Model
from langchain.llms import Clarifai

# Embedding Support
from langchain_community.embeddings import ClarifaiEmbeddings

# Summarizer we'll use for Map Reduce
from langchain.chains.summarize import load_summarize_chain

# Data Science
import numpy as np
from sklearn.cluster import KMeans

Text Processing  and Split Documents

Load book into a single text file and then split text into large-ish chunks

from langchain.document_loaders import PyPDFLoader

# Load the book
loader = PyPDFLoader("../data/IntoThinAirBook.pdf")
pages = loader.load()

# Cut out the open and closing parts
pages = pages[26:277]

# Combine the pages, and replace the tabs with spaces
text = ""

for page in pages:
    text += page.page_content
    
text = text.replace('\t', ' ')

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=10000, chunk_overlap=3000)

docs = text_splitter.create_documents([text])

Embed Documents

Embed chunks to get vectors

Example_URL = "https://clarifai.com/clarifai/main/models/BAAI-bge-base-en-v15"
clarifai = ClarifaiEmbeddings(model_url=EXAMPLE_URL)

vectors = embeddings.embed_documents([x.page_content for x in docs])

Cluster Embeddings

Cluster the vectors to see which are similar to each other and likely talk about the same parts of the book

# Choose the number of clusters, this can be adjusted based on the book's content.
num_clusters = 11

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)

Choose Embeddings that represent complete Book

Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)

# Find the closest embeddings to the centroids

# Create an empty list that will hold closest points
closest_indices = []

# Loop through the number of clusters
for i in range(num_clusters):
    
    # Get the list of distances from that particular cluster center
    distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
    
    # Find the list position of the closest one (using argmin to find the smallest distance)
    closest_index = np.argmin(distances)
    
    # Append that position to your closest indices list
    closest_indices.append(closest_index)

selected_indices = sorted(closest_indices)

# Then we'll get docs which the top vectors represented.
selected_docs = [docs[doc] for doc in selected_indices]

Initialize Clarifai llm class from Langchain

from langchain.llms import Clarifai

# use model URL
MODEL_URL="https://clarifai.com/openai/chat-completion/models/GPT-4"

llm=Clarifai(MODEL_URL)

Summarize the documents that these embeddings represent

map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

map_chain = load_summarize_chain(llm=llm,
chain_type="stuff",
prompt=map_prompt_template)

# Make an empty list to hold summaries
summary_list = []

# Loop through a range of the lenght of selected docs
for i, doc in enumerate(selected_docs):
    
    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])

    # Append that summary to your list
    summary_list.append(chunk_summary)

Summary of the summaries

we have our list of summaries, let's get a summary of the summaries

combine_prompt = """
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of what happened in the story.
The reader should be able to grasp what happened in the book.

```{text}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

reduce_chain = load_summarize_chain(llm=llm,
chain_type="stuff",
prompt=combine_prompt_template,)

output = reduce_chain.run([summaries])
print (output)

Each method offers a tailored approach to summarization, catering to different lengths and complexities of text, ensuring that you can choose the most appropriate technique based on their specific needs.

.

  • Description
    An app template for document summarization, supports 3 levels that start with Novice and end up with Expert
  • Base Workflow
  • Last Updated
    Mar 18, 2024
  • Default Language
    en
  • Share