New
Clarifai is recognized as a Leader in The Forrester Wave™: Computer Vision Tools, Q1 2024
February 14, 2020

Generate Image Descriptions with Natural Language Processing

Table of Contents:

As they say, “A picture is worth a thousand words.” On the other hand, several words accompanying a picture can significantly add meaning to what you see in that picture. Clarifai's Predict API allows you to obtain a set of categories for a submitted photo programmatically, thus enabling development of applications capable of processing images intelligently.

 

From descriptive tags (concepts) to meaningful descriptions

The obtainment of descriptive tags (such as fruit, health, grape, etc) for a photo can be quite useful when, for example, you need to organize your messy digital photo library, sorting pictures that fall into more than one category . Sometimes, however, you need to take it one step further and obtain a short description of what is shown in the photo.

As an example of where generating a description for a photo might be used, consider a chatbot designed to maintain conversations on different topics. A smart chatbot is supposed to "understand" not only text messages but also images submitted by the user, meaningfully reacting to a submitted image in a human-like manner. Since humans don't usually speak in separate words, but rather weaving them together into sentences, you'll need a mechanism for generating a relevant sentence based on a set of separate words (descriptive tags) that you have for your photo. This is where Natural Language Processing (NLP) comes into play.

In fact, automating the process of generating appropriate image descriptions based on a set of given tags can be employed not only by chatbots. Suppose you're a travel blogger who has tons of new photos for publication every week, but you have absolutely no time to make a description for each.

 

A concrete example

Let's look at a concrete example to understand how this might work. Look at the photo below:

breakfast

The list of tags generated by Clarifai's Predict API for this photo might include the following tags:

coffee, breakfast, drink, dawn, cup, espresso …

A description generated based on the above tags might look somewhat like:

A cup of coffee for breakfast to wake up.

For us, humans, a task of that kind is a breeze, but how can you teach your machine to do it? To understand how this might work, you will need to get a little familiar with Dependency grammar - a word-based grammar that focuses on syntactic relations between individual words in a phrase or a sentence. We''ll take a brief look at it in the next section, while continuing with the above example.

 

Employing syntactic dependencies in a sentence

At first glance, generating a meaningful phrase from a set of individual tags programmatically may seem like magic. The true is that it can be quite challenging, but not impossible though. Tools like spaCy, the leading open source library for NLP in Python, automatically determine syntactic dependencies in a sentence being processed, allowing your script to find syntactic neighbors of a word.

Having a large number of sentences for analysis at your disposal, you might search, for example, for phrases that include one or more words from the tag list you have for the photo, where these words are syntactic neighbors. The phrases can be then merged into a larger phrase, thus forming a meaningful utterance, This concept is illustrated in the figure below.

proposition_opbject

As you no doubt have realized, this approach requires a huge source of textual data to work efficiently on various photos. Here it would be appropriate to recall: "Providence is always on the side of the big battalions." proverbial saying, early 19th century. So, you might wonder where to get a large volume of text data for this kind of analysis. One way to obtain it is to build a collection of texts (aka corpus) from the set of Wikipedia articles, using Wikipedia database dump.

 

Conclusion

In this post, you looked at a simplified example of how Computer Vision and Natural Language processing can be efficiently used together for the task of image description generation. If you're interested in more detailed explanations on this and many other NLP-related topics, you might want to check out this book: Natural Language Processing Using Python.