Color Document Image Binarization

Sun 21 May 2017

This project consisted of my fourth year engineering research project. And is detailed in my undergraduate thesis (pdf)

A paper detailing this work was accepted for publication at the International Conference on Signal Processing and Integrated Networks, 2016. Can be viewed here: SPIN 2016

Introduction

Image binarization refers to the process of converting a given grayscale or color document image in a manner that desired details of the original image are preserved in the binary image.

What we aimed to achieve: A investigation into the existing solutions to the problem allowed us to see that the existing solutions were such that they were tailored to specific document types and tended to not work as well on input image types to which they had not been tailored. We aimed to devise a solution which worked with a greater consistency than existing solutions for a variety of input image types.

Why it is an important problem: Most document analysis such as classification or text from non-text, character, word and line segmentation and recognition tasks require that the input image be binary. The performance of these higher level tasks therefore depend on how good the initial binarization process is. This makes the binarization task an important one.

Scope

The project involved the study of document image binarization schemes, the implementation of binarization schemes in the existing literature, development of a binarization method for printed color document images and the evaluation of the proposed binarization scheme using certain standard evaluation metrics as well a system level evaluation using the Tesseract OCR.

Solutions Implemented

Existing solutions: Solutions from the literature that we implemented were those by Kuo et al and Su et al. In addition to these more complex solutions the simpler classic methods we implemented were those of Niblack , Sauvola and Wolf.

Proposed solution: The binarization solution makes use of a color reduction step by means of the mean shift algorithm performed at different scales of the image followed an enhanced version of the Niblack’s thresholding method that was proposed (detailed in the thesis) for binarization of printed color document images with complex backgrounds and historical degraded document images. The binarization is followed by a optional morphological post-processing step.

You can examine some of the results of the proposed solution here: Binarization Results Album

Implementation

The binarization schemes were implemented majorly in C (using the OpenCV library) and the evaluation code consisted of MATLAB, Python, Batch scripts and a standard evaluation tool provided by the International Conference on Document Analysis and Recognition (ICDAR).

Evaluations

The proposed solution was evaluated using certain standard metrics used in the Document Image Binarization Contest (DIBCO) competitions conducted as part of the ICDAR. The proposed solution was found to perform at-least at par with or better than the state of the art on average. In addition the proposed method was also found to perform more consistently than the state of the art in all instances of our evaluation.

The proposed solution was also evaluated as part of a whole OCR system and found to raise overall character recognition accuracy by 1.94% as compared to Sauvola’s thresholding method that also aims to binarize degraded document images or document images with complex backgrounds

Conclusions

In the course of this study the one concrete conclusion that we arrive at is that a single binarization scheme that may succeed at all kinds of input images may be a an unreasonable goal. However, it is indeed a achievable goal to be able to reduce the variability of performance of a given binarization scheme on a more closer set of input document images.