Digitus II: Releasing the content locked in manuscripts

If we form a straight line with all the historical documents stored within Catalan archives, starting in Barcelona we would arrive up to Paris. The distance between these two cities is about 840 km, which is exactly the same that would configure the materialized assets ranging from the 9th century to the present, stored, preserved and managed by the 323 historical archives of Catalonia, according to statistics made by the Cultural Department of the Catalan Government. That means a great amount of handwritten documents, which are an undeniable legacy of the history of our society.

These documents consist of all kinds of historical resources such as census, marriage licences, civil registries or literary work, among others, which can provide interesting and valuable information about past times. Unfortunately, these resources have been locked, and almost forgotten, in Archives for many years.

Notwithstanding the above, Archives have made huge efforts in the last decade to digitalize all these cultural assets. In fact, in 2011 it was estimated that there were more than 200 Terabytes of digital documentation available in on-line portals. Despite this important advance, CVC researcher Dr. Marçal Rossinyol, explained that the contents are not as accessible as they seem: “Although there is a huge amount of digitalized documents, they consist on images only accessible though manual page browsing or metadata searching. That means that it is extremely difficult to manage them when you want to search for specific content”.

To solve this problem, CVC researchers started a project with the main intention of creating a search engine for handwritten texts. The project, led by Dr. Rossinyol, is called Digitus II and it is still in a low technology readiness level. The project received funding from the European Regional Development Fund and with this grant, the team wants to carry out two main tasks: “Firstly, we want to improve the method of handwritten text recognition and, secondly, we want to analyse the market in order to discover how to make this technology economically viable”, stated Dr. Rossinyol.

The method used by the researchers consists in combining handwriting text recognition methods with Information Retrieval techniques: “Optical Character Recognition (OCR) works satisfactorily with typed texts but this doesn’t happen when it comes to handwritten documents due to the variability in different handwriting styles”, clarified Dr. Rossinyol. For this reason, they have opted to apply an intelligent search layer so that the system can relate similar words when they are not well recognized at all. With this, they pretend to get an efficient, scalable and affordable way of automatically transforming handwritten document images to searchable content.

This approach is a great alternative to manual transcription, which is the method that has been used so far in order to make these documents searchable. But reality is that it would be impossible to manually transcript all documents, too many years, and human and economic resources. Because of this, the majority of documents are not transcribed; they are not important enough to dedicate so many resources. At least, as Dr. Rossinyol said: “if all these documents can be processed automatically, despite all the limitations they may experience, it can be a good way to open them to society. If this is not done, it is impossible for people to access this information”.

This Project has been cofounded by the European Union by its European Regional Development Funds program.