The CVC project ‘Beyond Word spotting: visual context in support of open vocabulary scene text recognition’ led by Dr Dimosthenis Karatzas and Dr Andrew Bagdanov has won a Google Research Award 2016. The project aims to give computers improved reading abilities by teaching them how to read text in images by taking into account the visual information contained within the same image.
Google Research Awards are highly prestigious awards that help carry out projects related to artificial intelligence and machine perception financing a year of work within these areas. According to data provided by Google, in this year’s edition they’ve received more than 876 applications from over 44 countries and 300 universities; granting a total of 143 projects, mainly focused on machine learning, machine perception, networks and systems.
The project proposed by Dr. Karatzas an Dr. Bagdanov has the goal of giving computers a way to comprehend text and visual context in a joint manner within the same image. To this day, computers are able to recognise text on one hand and visual information on the other, separately. However, they don’t always perform optimally, and that’s why researchers are combining modalities (text and visual information in this case). Dr. Karatzas and Dr. Bagdanov want to convert both domains to a common language and therefore give machines the ability to analyse both elements jointly, thus helping computers to recognise the images presented in a more exact and efficient way. Textual information then acts as context for interpreting visual information, and vice versa.
“Let’s set an example”, explains Dr Karatzas, “if you see the image of a yellow post box you can easily guess that what is written on top is “POST” or “MAIL” – in this case, the visual content provides context for recognising the text in the image. Similarly, if you see a shop front, the textual content of the shop sign above can provide useful context for visual understanding”. In fact, google researchers recently discovered that a classifier trained to distinguish different businesses in images ends up learning how to read, as this is a key way to perform this task.
More specifically, Dr. Karatzas and Dr. Bagdanov stated two challenges in their Google Award proposal. The first one, to generate contextualised dictionaries based only on visual scene information. What does this mean? We are talking here of actual dictionaries (with its words and meanings). Therefore, when faced with an image (such as figure 1), the computer, by analysing the visual information can choose the contextualised dictionary that will help him match the word featured (‘trattoria’).
Imagine that you have your Oxford’s Dictionary, with more than 220.000 words, it will be more difficult for the computer to find a match, as it will have to skip through thousands of similar patterned words. But, imagine then, that the computer, by analysing the visual information present in the image (and ignoring the word ‘trattoria’) knows that what it’s looking at is a Restaurant. How? Because there are tables with typical patterned tablecloths, people and families seated with what seems to be food and drinks, a waiter within, etc. It will then go to a sub dictionary titled ‘Restaurants’ (where all words related to this topic would be contained) and will have less trouble in actually finding ‘trattoria’.
Projects such as these will help, in a nearby future, to make computers more effective comprehending the scenes they are presented with, both in video and photography, at a real time. An adequate analysis of images will help improve, not only popular applications such as Street View or Google Maps, but can also be highly useful in terms of surveillance, localization and monitoring in exterior settings. It will most certainly make our daily lives easier: Helping blind people who cannot read, tourists who want to translate words in the street or drivers who can rely on cars that understand street signs.
Image credit: Created by Mrsiraphol – Freepik.com