App Overview

Welcome to blip Overview page

Clarifai app is a place for you to organize all of the content including models, workflows, inputs and more.

For app owners, API keys and Collaborators have been moved under App Settings.

InfoDetails

blip

salesforce

0

Datasets

0

6

Models

6

0

Workflows

0

Modules

0 Introduction

The BLIP model is trained to generate a caption based on the content of an image. The model completes this task using a novel ML technique known as Vision-Language Pre-training (VLP). The BLIP model stands out from other VLP architectures as it excels in both understanding and generation tasks.

As it can be seen, the model is generating a caption representing the image's content:

Man in fruit loops example

BLIP Image Captioner

This model's intended use is generating metadata for images, resulting in improved SEO. Additionally, automatic image captioning will reduce human workload and reduce subjectivity.

BLIP is not well suited for domain-specific images, such as medical images, and it may not generate accurate captions.

More info:

Original repository: GitHub
Interactive Demo: Google Colab

Paper

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.

Description
A Collection of BLIP image-to-text models and embedding models with various parameter sizes
Base Workflow
Last Updated
Aug 13, 2024
Default Language
en
Share
Models
Image To Text
Text Embedder
Visual Embedder
Multimodal Embedder