The CVPR 2019 paper presented by PhD student Ali Biten at this year’s annual Computer Vision gathering in Long Beach, USA, has delved into the possibility of giving AI systems the ability to generate interpretations of images by using newspaper images with a caption. Results show that the field of image captioning, although full of new application possibilities, is still a hard nut to crack.
CVC researchers have been pondering on how to develop a current image captioning system, which allows the automatic generation of a fully descriptive text of the content within pictures and photographs. As stated by Ali Biten, first author of the paper, “current systems are highly ineffective, performing merely at a descriptive level, essentially enumerating the objects in the scene and their relations”.
The paper presented at this year’s Computer Vision and Pattern Recognition conference within the framework of two projects led by Senior CVC researchers: the aBSINTHE project, led by Dr. Marçal Rossinyol, and the READS project, led by Dr. Dimosthenis Karatzas. The paper sets a new milestone towards the delivery of an effective context driven entity-aware captioning by using news images. It proposes a novel captioning method, which is able to leverage contextual information and thus produce image captions that effectively describe and interpret the scene.
“We have proposed an end-to-end architecture in two phases that allows us to dynamically extend the output dictionary to out-of-vocabulary named entities which keep popping up in news articles,” states Biten. “That is, proper names, locations, dates or even prices; words that would not be compiled in your everyday pocket dictionary”.Furthermore, they have produced the GoodNews dataset, the largest news image captioning database yet, with more than 466.000 image caption-pairs, along with the corresponding metadata.
Information systems trained to see
Now, how do you teach a computer to understand an image? Dr. Dimosthenis Karatzas, co-author of the paper and Associate Director of the Computer Vision Center, shows a broad smile when asked, “that is a good question”, which, of course means “this is going to be a very long explanation”. Let us say the computer can “see” with cameras. Drawing parallels with humans who capture images with their eyes and process them with their brains, cameras take the pictures and computers then analyse them. For a computer, an image is just a set of pixels, which is the visual data that our laptops are supposed to understand.
“You have to tell the computer what it is supposed to see in each image”, explains Dr. Karatzas, “the technical term for that is to annotate”. “Image datasets are crucial; depending if you have a robust one or a poor one, the neural network will learn in an accurate way or will not learn very well at all”. When a neural network fails to learn properly, we end up with systems with clear biases. BBC’s article on ‘Racist AI’ compiles a set of features on the topic.
Furthermore, deep learning has revolutionized the way in which computer engineers teach information systems. Computer vision is the discipline that has both boosted and benefited the most from this revolutionary technique. With the use of neural networks, computers are now deciding what to extract from images in order to understand them.
However, deep learning has a downside; these refined neural networks are incredibly data hungry, needing a huge quantity of images in order to learn effectively. That’s why these new methods are performing really well in facial biometric applications, for example, and so poorly in medical imaging, where obtaining images is a cumbersome process.
“In our case, the important problem is not the lack of data, but the lack of an evaluation method: how do you know if a generated caption is “correct” or not. By using newspaper images, the only way we actually have is to compare it to what the journalist wrote”, states Ali.
This method, as appointed by Dr. Karatzas, is “highly restrictive” as “different humans would also give very different captions for the same image”. In the paper, they also evaluated performance by asking human evaluators to judge whether the captions were plausible or not. The result: humans could not tell which one was the artificially generated caption and which one was human generated in 53% of the cases.
Therefore, the model delivery of a dataset such as the one proposed in the current paper (GoodNews) is a huge step forward. What’s more, the contribution of the paper also provides a model used to produce contextualised captions, being able to distribute its attention between the image and the context.
Neural networks, algorithms that think (or, to be more accurate, process)
A neural network is a set of algorithms that perform a proposed task, in this case of image captioning. Algorithms need an input: the image, a set of instructions (plenty of maths) and an output, a concept, an answer to what we have asked. In Ali Biten’s paper, the goal was for neural networks to give a description of a vast dataset of newspaper images by interpreting the semantic content. This means it can not only make a description of what it can see, but, in the future, it will be able to relate it to other images in other articles containing similar, but different concepts.
“Let me give you an example”, says Dr. Rossinyol, IP of the aBSINTHE project, funded by the BBVA Foundation. “I might be looking for images of the employment crisis that hit Spain back in 2008. A normal system will retrieve images that have been classified under the concept ‘crisis’ and will give us pictures of people in the streets queing to get into the state’s job centre, or student strikes asking for better pay rates”. “But”, he adds, “it won’t give you other images such as evictions, and they were, sadly, very common during 2018. Any person, when thinking of the 2008 crisis in Spain will most definitely remember images of people being evicted from their homes. Well, we need the neural network to make that association too. Train it to be able to relate these sets of pictures”.
“We understand scenes by building models and employing them to compose stories that explain their perceptual observations”, states Ali Biten, “This capacity of humans is associated with intelligent behaviour”. However, he continues, “Computers can at best perform at the description level and fail to integrate any prior world knowledge in the captions that they produce”. The efforts of this CVPR study are challenging, but have brought scientists a step closer to the production of image captions that will offer plausible interpretations of scenes by the integration of contextual information.
Up to now, currently available image captioning datasets are not fit for developing captioning models with the characteristics previously mentioned. “Current systems provide generic, dry, repetitive and non-contextualized captions”, states Biten. For the task in hand, Ali and colleagues decided to use images illustrating newspaper articles. The reason: the descriptions of the pictures provided by journalists and the contextual information (the texts and accompanying articles) are easily accessible and can be collected with reasonable effort.
“Newspapers are an excellent domain for moving towards human-like captions, as they provide readily available contextual information that can be modelled and exploited”, explains Dr. Rossinyol. To this end, CVC researchers decided to put together GoodNews.
“Remember when we talked about the importance of annotating?” asks Dr. Karatzas whilst talking about the article. “Well, news image pictures already give us that annotation, without an extra cost or effort on our part”.
“We haven’t solved the issue. That is not what we have proposed here. We have presented a new captioning method that aims to take us a step closer to producing captions that offer a plausible interpretation of the scene, and applied it to the particular case of news image captioning”, summarizes Ali Biten. As CVC researchers see it, they’ve advanced the field of image captioning within computer vision a little further, whilst releasing a news image captioning dataset, the largest to date. GoodNews will help foster the science of computer vision by providing researchers worldwide a useful, highly reliable and smart tool. It’s open source too! If that isn’t good news, what is?
Video of the project (in English):
A. Biten, L. Gómez, M. Rusiñol, D. Karatzas (2019): Good News, Everyone! Context driven entity-aware captioning for news images
Project funded by Fundación BBVA