My Notes on AI - Document Analysis

Document Analysis and Recognition

Document analysis and recognition aims to convert document images from image-based form to text-based form which enables many functionalities such as content retrieval, content searching and indexing, document compression, etc. A general framework of document analysis and recognition is presented in the following figure.

Document Analysis and Recognition is one of the foundational services towards building Enterprise Knowledge Repository and Document Archive system. In this task I have studied various image processing techniques, extremely crucial for the successful execution of Optical Character Recognition system, and learned to create a image processing pipeline that feeds the transformed image data to Tesseract system, to extract text from images and digitize them.

Document Image Skewness Detection

Document Image Skewness detection is an important part of image data pre-processing pipeline before feeding the data to the OCR, that enables the accuracy of the OCR output. Haugh Transform is a technique that helps skew detection in the document images. Hough transform transforms angles from the xy coordinate system to the polar coordinate system to identify line locations.

This task focuses on investigating the problem of Document Skew estimation applying the Hough Transform. The document image is first loaded and binarised using some prefixed threshold value, getting the negative version of the image, extracting connected components from the neagtive image and select the candidate points on the negative image using one of the following strategies. Then all the pixels that are not the candidate points are removed from the negative image and Hough transform on the negative image. This will return all the angles detected from the Hough transform. In the last step the image is deskewed with the angles calculated in the last step.

Text Recognition using pytesseract

pytesseract is a python wrapper of Google Tesseract-OCR. This library can be used to extract text regions from an image and recognise them in different languages. You can find more details of this library at https://pypi.org/project/pytesseract/

In this task I have created an entire document recognition system that takes a that takes input as a document image, deskews the input document, recognises its text, and returns a pdf of the input document.

Link to Notebook