Welcome to our fourth blog post! Today we'll be looking at a potential model for our use-case, the Show and Tell model.

The Show and Tell model is a neural image caption generator- It is a deep neural network that learns how to describe the content of input images.

e.g.

Show and Tell Examples

 

The Show and Tell model can be broken down into two blocks: the encoder, and the decoder. The encoder is a CNN, which takes an image, performs convolutional operations on it, and outputs a vectorized representation of the input. This vector is then given to a natural language processing model, which converts it into a sentence in a language of your choice (The original paper uses English).

Show and Tell model

 

 

The encoder network is using the Inception v3 image recognition model pre-trained on the ILSVRC-2012-CLS image classification dataset. It is a deep convolutional neural network. It takes a 299x299x3 image as input, and gives 8x8x2048 output. It has 1000 output classes.

The decoder is a long short-term memory (LSTM) network. This type of network is commonly used for sequence modeling tasks such as language modeling and machine translation. In the Show and Tell model, the LSTM network is trained as a language model conditioned on the image encoding.

 

The Show and Tell repository can be found here. It has a deeper explanation of the model as well as instructions on how to download, train and run the model.

We trained the model over the COCO dataset for 1 million iterations before stopping. Here are the results on a few images:

 

Now, we are going to try some finetuning as well as increasing the number of iterations in the training sequence. We are positive that the captioning can only improve from here, and will keep you posted.

 

Thank you for reading, the next one will be up in no time!