• Community
  • Model
  • image-embedder-blip

image-embedder-blip

BLIP-based image embedding model

Notes

Introduction

The primary use case of Salesforce BLIP model is to generate a caption based on the content of an image. The model completes this task using a novel ML technique known as Vision-Language Pre-training (VLP). The BLIP model stands out from other VLP architectures as it excels in both understanding and generation tasks.

This model's output is an embedding (numerical vectors that represent the input images in a 768-dimensional space) which are computed by the BLIP model using an image encoder that employs a visual transformer. The vectors of visually similar images will be close to each other in the 768-dimensional space.

BLIP – Image Embedder

Since this model's output is an embedding (numerical vector representations) of the input image, one powerful use case is comparing the resulting embeddings between two or more images. The more alike the input images are, the more similar the resulting embeddings will be. This allows for filtering, indexing, ranking, and organizing images according to their contained visual similarities. This approach is likely to yield powerful results in transfer learning tasks.

Salesforce – BLIP

More info:

Paper

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.

  • ID
  • Name
    BLIP
  • Model Type ID
    Visual Embedder
  • Description
    BLIP-based image embedding model
  • Last Updated
    Oct 17, 2024
  • Privacy
    PUBLIC
  • License
  • Share
    • Badge
      image-embedder-blip