- Community
- Model
- text-embedder-blip
Notes
Introduction
Contrary to other versions of the BLIP model, the input of this model is a string of text as opposed to an image. This model's output is an embedding (numerical vectors that represent the input images in a 768-dimensional space) which are computed by the BLIP model. The vectors of similar texts will be close to each other in the 768-dimensional space.
BLIP – Text Embedder
Since this model's output is an embedding (numerical vector representations) of the input text, one powerful use case is comparing the resulting embeddings between two blocks of text. The more alike the input blocks of text are, the more similar the resulting embeddings will be. This allows for filtering, indexing, ranking, and organizing blocks of text according to the similarity of their content. This approach is likely to yield powerful results in transfer learning tasks.
Salesforce – BLIP
More info:
- Original repository: GitHub
- Interactive Demo: Google Colab
Paper
Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi
Abstract
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.
- ID
- NameBLIP
- Model Type IDText Embedder
- DescriptionBLIP-based text embedding model.
- Last UpdatedOct 17, 2024
- PrivacyPUBLIC
- License
- Share
- Badge
Concept | Date |
---|