The Challenge is set up around three tasks:
- Text Localization, where the objective is to obtain a rough estimation of the text areas in the image, in terms of bounding boxes that correspond to parts of text (words or text lines).
- Text Segmentation, where the objective is the pixel level separation of text from the background.
- Word Recognition, where the locations (bounding boxes) of words in the image are assumed to be known and the corresponding text transcriptions are sought.
A training set of 420 images (containing 3583 words) is provided through the downloads section. The training set is common for all three competitions, although different ground truth data is provided for each of the tasks. In a similar fashion, a test set of 102 images (containing 918 words) will be used for evaluation.
All images are provided as PNG files and the text files are ASCII files with CR/LF new line endings.
For the text localization task we provide bounding boxes of words for each of the images. The ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word's bounding box and its transcription in a comma separated format (see Figure 1).
For the text localization task the ground truth data is provided in terms of word bounding boxes. For each image in the training set a separate ASCII text file will be provided, following the naming convention:
The text files are comma separated files, where each line will corresponds to one word in the image and gives its bounding box coordinates and its transcription in the format:
left, top, right, bottom, "transcription"
Please note that the escape character (\) is used for double quotes and backslashes (see for example img_4 in Figure 1).
The authors will be required to automatically localise the text in the images and return bounding boxes. The results will have to be submitted in separate text files for each image, with each line corresponding to a bounding box (comma separated values) as per the above format. A single compressed (zip or rar) file should be submitted containing all the result files. In the case that your method fails to produce any results for an image, you can either include an empty result file or no file at all.
The evaluation of the results will be based on the algorithm of Wolf et al  which in turn is an improvement on the algorithms used in the robust reading competitions in previous ICDAR instalments.
For the text segmentation task, the ground truth data is provided in the form of bi-level PNG images following the naming convention:
In the ground truth images, white pixels should be interpreted as text pixels, while black pixels as background (see Figure 2).
The authors will be asked to automatically segment the test images and submit their segmentation result as a series of bi-level images, following the same format. A single compressed (zip or rar) file should be submitted containing all the result files. In the case that your method fails to produce any results for an image, you can either include an empty result file or no file at all.
Evaluation will be primarily based on the methodology proposed by the organisers in the paper , while a typical precision / recall measurement will also be provided for consistency, in the same fashion as .
For the word recognition task, we provide all the words in our dataset with 3 characters or more (3583 words) in separate image files, along with the corresponding ground-truth transcription (See Figure 2 for examples). The transcription of all words is provided in a SINGLE text file for the whole collection. Each line in the ground truth file has the following format:
[image name], "transcription"
An example is given in figure 3. Please note that the escape character (\) is used for double quotes and backslashes (see for example the transcriptions of 15.png and 20.png in Figure 3).
For testing we will provide the images of about 400 words and we will ask for the transcription of each image. A single transcription per image will be requested. The authors should return all result transcriptions in a single text file of the same format as the ground truth.
For the evaluation we will calculate the edit distance between the submitted image and the ground truth transcription. Equal weights will be set for all edit operations. The best performing method will be the one with the smallest total edit distance.
Note that words are cut-out with a frame of 4 pixels around them (instead of the tight bounding box), in order to preserve the immediate context. This is usual practice to facilitate processing (see for example the MNIST character dataset).
- C. Wolf and J.M. Jolion, "Object Count / Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms", International Journal of Document Analysis, vol. 8, no. 4, pp. 280-296, 2006.
- A. Clavelli, D. Karatzas, and J. Llados, "A Framework for the Assessment of Text Extraction Algorithms on Complex Colour Images", in Proceedings of the 9th IAPR Workshop on Document Analysis Systems, Boston, MA, 2010, pp. 19-28.
- K. Ntirogiannis, B. Gatos, and I. Pratikakis, "An Objective Methodology for Document Image Binarization Techniques", in Proceedings of the 8th International Workshop on Document Analysis Systems, Nara, Japan, 2008, pp. 217-224