general-english-image-caption-clip

Image caption with CLIP encoding as prefix

Input

Output

Submit an image for a response.

Notes

Source: https://github.com/rmokady/CLIP_prefix_caption

We use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple mapping network over the raw encoding, and then fine-tune our language model to generate a valid caption. In addition, we present another variant, where we utilize a transformer architecture for the mapping network and avoid the fine-tuning of GPT-2. Still, our light model achieve comaparable to state-of-the-art over nocaps dataset.

  • ID
  • Model Type ID
    Image To Text
  • Input Type
    image
  • Output Type
    text
  • Description
    Image caption with CLIP encoding as prefix
  • Last Updated
    Oct 25, 2024
  • Privacy
    PUBLIC
  • Toolkit
  • License
  • Share
  • Badge
    general-english-image-caption-clip