This is a multi-lingual version of the OpenAI CLIP-ViT-B32 model. You can map text (in 50+ languages) and images to a common dense vector space such that images and the matching texts are close. This model can be used for image search (users search through a large collection of images) and for multi-lingual zero-shot image classification (image labels are defined as text).
This model is used in the Universal-Multilingual workflow. Use that as the app's base workflow to enable vector search with different languages!