CVC News

Defined by the looks: when text meets visual information

Credit: Freepik

The CVC project ‘Beyond Word spotting: visual context in support of open vocabulary scene text recognition’ led by Dr Dimosthenis Karatzas and Dr Andrew Bagdanov has won a Google Research Award 2016 and aims to give computers improved reading abilities, by teaching them how to read text in images taking into account the visual information contained in the image.

Google Research Awards are highly prestigious awards that help carry out projects related to Artificial intelligence and machine perception among other topics, financing a year of work within these areas. According to data provided by Google, in this Edition they’ve received more than 876 applications from over 44 countries and 300 universities; finally only granting a total of 143 projects, mainly focused in Machine Learning, Machine perception, Networks and Systems.

The project proposed has a clear objective: to give computers a way to comprehend text and visual context in a joint manner within the same image. Currently, computers are able to recognise text on one hand and visual information on the other, separately. But they don’t always do so properly, and that’s why researchers are combining modalities (text and visual information in this case). Dr. Karatzas and Dr. Bagdanov want to convert both domains to a common language and therefore give machines the ability to analyse both elements jointly, thus helping computers to recognise the images presented in a more exact and efficient way. Textual information then acts as context for interpreting visual information, and vice versa.

Let’s set an example”, explains Dr Karatzas, “if you see the image of a yellow post box you can easily guess that what is written on top is “POST” or “MAIL” – in this case, the visual content provides context for recognising the text in the image. Similarly, if you see a shop front, the textual content of the shop sign above can provide useful context for visual understanding”. In fact, google researchers recently discovered that a classifier trained to distinguish different businesses in images ends up learning how to read, as this is a key way to perform this task.

More specifically, Dr. Karatzas and Dr. Bagdanov stated two challenges in their Google Award proposal. The first one, to generate contextualised dictionaries based only on visual scene information. What does this mean? We are talking here of actual dictionaries (with its words and meanings). Therefore, when faced with an image (such as figure 1), the computer, by analysing the visual information can choose the contextualised dictionary that will help him match the word featured (‘trattoria’).

A traditional trattoria image
Figure 1

Imagine that you have your Oxford’s Dictionary, with more than 220.000 words, it will be more difficult for the computer to find a match, as it will have to skip through thousands of similar patterned words. But, imagine then, that the computer, by analysing the visual information present in the image (and ignoring the word ‘trattoria’) knows that what it’s looking at is a Restaurant. How? Because there are tables with typical patterned tablecloths, people and families seated with what seems to be food and drinks, a waiter within, etc.  It will then go to a sub dictionary titled ‘Restaurants’ (where all words related to this topic would be contained) and will have less trouble in actually finding ‘trattoria’.

Projects such as these will help, in a nearby future, to make computers more effective comprehending the scenes they are presented with, both in video and photography, at a real time. An adequate analysis of images will help improve, not only popular applications such as Street View or Google Maps, but can also be highly useful in terms of surveillance, localization and monitoring in exterior settings. It will most certainly make our daily lives easier: Helping blind people who cannot read, tourists who want to translate words in the street or drivers who can rely on cars that understand street signs.



Related articles: A Lab At The Library: Triggering Innovation In Bottom Up Processes

Image credit: Created by Mrsiraphol –

alexandra canet

The author alexandra canet