Document Visual Question Answering: more than understanding a document

Document Visual Question Answering: more than understanding a document

CVC researchers in collaboration with the Centre for Visual Information Technology (CVIT), IIITH, and with support from an Amazon AWS Machine Learning Research Award have been organising a series of projects and challenges in order to go further in document understanding. The long-term aim is to push for a change of paradigm in the document analysis and recognition community to come up with new ways of doing things.

Written information is all around us, and performing most everyday tasks requires us to read and understand this kind information in the world. But, although computers have proven to be good with text, they still experience some limitations when it comes to reading information from an image, such as a scanned document or a picture of a sign in the street. In these cases, computers cannot readily access the information. That’s why, for instance, spammers use images with spam text embedded in order to circumvent spam filters.

Focusing on the case of documents, there has been strong research effort on improving the machine understanding of these insatiable sources of information. Document image analysis is one of the oldest application fields of artificial intelligence and it combines two important cognitive aspects: vision and language understanding. However, until now, all efforts have focused on creating models that extract information from images in a ‘bottom-up’ approach. This is the case of the well-known Optical Character Recognition (OCR) engines, which are really useful for recognising text in any font – typed, printed or handwritten -, detecting tables and diagrams or extracting handwritten fields from pre-defined forms, but are disconnected from the final use of this information.

Written communication requires much more than extract and interpret the textual content. Documents usually include all kind of visual elements such as symbols, marks, separators, diagrams, draw connections between things, page structure, forms, the different colors and fonts used, highlighting, etc. And this sort of non-textual communication can provide us necessary information to understand the document in its global context.

In conclusion, this kind of models are focused only on ‘conversion’ of documents into digital formats, rather than really ‘understanding’ the message contained in the documents. In addition, they are also designed to work offline, as no interaction with humans is required.

Document Visual Question Answering: A Higher Understanding Beyond Recognition

With support from an Amazon AWS Machine Learning Research Award, researcher from the Computer Vision Center (CVC) and the Centre for Visual Information Technology (CVIT), IIITH, started a collaborative research in order to go further in the field of document understanding.

Known as Document Visual Question Answering (DocVQA), the research is focused on initiating a dialogue with different forms of written text such as that in a document, a book, an annual report, a comic strip, etc. and guiding machines to understand human requests to respond appropriately to them, and eventually in real time.

“More than a set of challenges and datasets, the long-term aim is to push for a change of paradigm in the document analysis and recognition community, and hopefully to come up with new ways of doing things. With methods that condition the information extraction to the high-level task defined by the user in the form of a natural language question – maintaining a human friendly interface”, explains Dr. Dimosthenis Karatzas, CVC Associate Director and principal investigator of this project.

DocVQA Challenge series

The DocVQA Challenge series was born as a result of the first year of work in this research. To date, they have setup three challenges, looking at increasingly more difficult facets of the problem. “We started with defining a large-scale, broad dataset of 12,000 documents along with 50,000 question-and-answer pairs. Then we moved to asking questions over a set of documents – a whole collection. Finally, we are currently working on a very challenging case: infographics, where textual information is intrinsically linked with graphical elements to create complex layouts that tell a story based on data”, states Dr. Karatzas.

Furthermore, the DocVQA web portal is quickly becoming the de-facto benchmark for this task and researchers use it daily to evaluate new ideas, models and methods: “To this date, we have evaluated around 1,300 submissions to the first two challenges, out of which more than 60 have been made public by their authors and feature in the ranking tables”, points out Dr Dimosthenis Karatzas.

First Workshop On DocVQA At ICDAR 2021

In the context of the 16th International Conference on Document Analysis and Recognition (ICDAR 2021), the researchers involved in this project will organize the first workshop on DocVQA. This workshop aims to create some space to discuss the DocVQA paradigm and the results of the ICDAR 2021 long-term challenge on DocVQA. DocVQA 2021 comes after the successful organization of the Document Visual Question Answering (DocVQA) challenge as part of “Text and Documents in the Deep Learning Era” Workshop in CVPR 2020. The workshop will be held on September 6th and will count with the participation of top speakers such as Amanpreet Singh (Facebook), Dr. Yijuan Lu (Microsoft), and Dr. Brian Price (Adobe).