Place: Large Lecture Room, CVC
Affiliation: Dipartimento di Sistemi e Informatica, University of Florence, Italy
In the last years the interest in e-book readers is growing, following the growth of sales in electronic books. Two main document formats are accepted by most devices: PDF and ePub.The PDF format is widely used to share documents allowing a cross-platform readability.However, it is not ideal for a comfortable reading on small screens. On the opposite, the ePub format is re-flowable and is well suited for e-book readers.
In this talk we analyze the challenges and opportunities for the Document Analysis research with respect to these devices and document formats. In particular, we first describe the main features of dedicated e-book readers and of the ePubformat. Afterwards, we analyze one system that we developed for the conversion of PDF books to ePub. In this system we invert the text formatting made during the pagination. To this purpose, layout analysis techniques are performed at the book level in order to identify the book’s table of contents and the main functional areas of the book such as chapters, paragraphs, and notes.
In the last part we will address ongoing research related to the conversion of scientific and technical documents that is more difficult.In particular, the presence of mathematical equations, tables, and illustrations in multi-column layouts requires the integration of document analysis techniques with information extraction algorithms. The features of a system designed to perform this conversion for technicalpapers are described in this work.