Java OCR

What is Tesseract OCR?

The Tesseract OCR is an optical character reading engine developed by HP laboratories in 1985 and launched in 2005. Since 2006 it has been developed by Google. Tesseract has Unicode Support (UTF-8) and can detect more than 100 languages "out of the box" and thus can be used to create different language scanning software. The latest version of Tesseract is Tesseract 4. It adds a new OCR-based neural net (LSTM) engine that focuses on line recognition but also supports the Tesseract OCR legacy engine that works by recognizing character patterns.

With the rapid advancement in AI and Machine Learning, we now need rigorous image processing. It enables us to perform such processing in Java.

How OCR works?

Tesseract OCR is available for download on all the major operating systems such as Window, Mac and OS. To understand the working of OCR, consider the following steps in sequential order:

Pre-process image data, for example: switching to gray scale, smooth, de-skew, filter.
Detect lines, words and letters.
Generate a ranked list of candidate characters based on a set of qualified data set. (Here the setDataPath( ) method is used to set trainer data path)
Procedure for sending visual characters, select the best characters based on confidence in the previous step with language data. Language data includes dictionary, grammar rules, etc.

How to use Tesseract OCR?

In order to use Tesseract OCR in Java, follow the steps given below:

Download the Tess4J API.
Extract files from the downloaded file.
Open any IDE and create a new project.
Link the jar file to your project.
Please move via this path "..\Tess4J-3.4.8-src\Tess4J\dist".

The jar has been successfully linked to the project and hence the tesseract engine is ready to use.

Performing OCR on clear images

Now that we have linked the jar file, we can get started with our coding part. The following code reads an image file and perform OCR and display text on the console.

OCR.java

import java.io.File ;
import net.sourceforge.tess4j.Tesseract ;
import net.sourceforge.tess4j.TesseractException ;
public class OCR {
    public static void main( String[ ] args )
    {
        // creating an object of class Tesseract
        Tesseract tesseract = new Tesseract( ) ;
        try {
            // this includes the path of tessdata inside the extracted folder
            tesseract.setDatapath( " D:/Tess4J/tessdata " ) ;
            // specifying the image that has to be read
            String text = tesseract.doOCR( new File( " image.jpg " ) ) ;  
            // printing the text corresponding to the image interpreted
            System.out.print( text ) ;
        }
        catch ( TesseractException e ) {
            e.printStackTrace( ) ;
        }
    }

Input:

image.jpg

Output:

Sometimes, this simply isn't possible. Sometimes, we wish to automate a task of rewriting text from an image with our own hands.

Reading an Unclear Image Using OCR

Note that the image selected above is very high in resolution with consistent font but this doesn't happen in most of the cases. In most of the cases, we get an unclear or may be distorted image and thus a distorted output. To deal with it we need to perform some processing on the image called Image processing.

Tesseract works best when there is a very clean segmentation of the background text from the background. In fact, it can be very challenging to ensure good separation. There are various reasons why you may not get a good quality output from Tesseract if the image has uncleared or distorted background. In this case, we need to know how the image should be processed.

Here, we will create a small intelligent model that will scan the RGB content of the image and convert it to grey matter and create a zoom effect again.

The example below is a sample code of how an image can be greyed out based on its RGB content.

ReadingImage.java

// importing all the required packages
import java.awt.Graphics2D ;
import net.sourceforge.tess4j.* ;
import java.awt.image.* ;
import java.io.* ;
import javax.imageio.ImageIO ;
public class ReadingImage
{
	public static void processImg( BufferedImage inputImage, float scaleFactor, float offset )
		throws IOException, TesseractException
	{		
	// We will create an image buffer
	// for storing the image later on
	// and inputImage is an image buffer
	// of input image
	BufferedImage outputImage = new BufferedImage( 1050, 1024, inputImage.getType( ) ) ;
	// Now, for drawing the new image
	// we will create a 2D platform
	// on the buffer image
	Graphics2D grp = outputImage.createGraphics( ) ;
	// drawing a new zoomed image starting
	// from 0 0 of size 1050 x 1024
	// and null is the ImageObserver class object
	grp.drawImage( inputImage, 0, 0, 1050, 1024, null ) ;
	grp.dispose( ) ;		
	// for the gray scaling of images
	// we'll use RescaleOp object
	RescaleOp rescaleOutput = new RescaleOp( scaleFactor, offset, null ) ;	
	// Here, we are going to perform
	// scaling of the image and then
	// writing on a .jpg file
	BufferedImage finalOutputimage = rescaleOutput.filter( outputImage, null ) ;
	ImageIO.write( finalOutputimage, " jpg ",
				new File( " C:/Users/Yukta Malhotra/Desktop/pico.jpg " ) ) ;
	// Creating an instance of Tesseract class
	// that will be used to perform OCR
	Tesseract tesseractInstance = new Tesseract( ) ;
	tesseractInstance.setDatapath( " C:/Users/Yashneet/Desktop/Tess4J/tessdata " ) ;
	// finally performing OCR on the image
	// and then storing the result in 'str' string
	String str = tesseractInstance.doOCR( finalOutputimage ) ;
	System.out.println( str ) ;
	}
	public static void main( String args[ ] ) throws Exception
	{
	File f = new File( " C:/Users/Yashneet/Desktop/pic3.jpg " ) ;
	BufferedImage inputImage = ImageIO.read( f ) ;
	// here, we're getting the RGB content of the complete image file
	double d = inputImage.getRGB(inputImage.getTileWidth( ) / 2,   
                                      inputImage.getTileHeight() / 2 ) ;
	// now, we'll compare the values and
	// set up new scaling values
	// which will be use by RescaleOp later on
	if ( d >= -1.4211511E7 && d < -7254228 ) {
		processImg( inputImage, 3f, -10f ) ;
	}
	else if ( d >= -7254228 && d < -2171170 ) {
		processImg( inputImage, 1.455f, -47f ) ;
	}
	else if ( d >= -2171170 && d < -1907998 ) {
		processImg( inputImage, 1.35f, -10f ) ;
	}
	else if ( d >= -1907998 && d < -257 ) {
		processImg( inputImage, 1.19f, 0.5f ) ;
	}
	else if ( d >= -257 && d < -1 ) {
		processImg( inputImage, 1f, 0.5f ) ;
	}
	else if ( d >= -1 && d < 2 ) {
		processImg( inputImage, 1f, 0.35f ) ;
	}
	}
}

Input:

Output:

Time taken to search elements keep increasing as the number of elements were increased.

Advantages

The advantages of OCR are as follows:

It increases efficiency of work at office.
The ability to quickly search for content is very useful, especially in an office environment where you have to deal with high volume scanning or high-volume document entry.
The OCR is quick to ensure that the content of the document remains the same as it saves time.
Workflow increases as employees no longer spend time in manual labour and can work faster and more efficiently.

Disadvantages

The disadvantages of OCR are as follows:

OCR is limited to language recognition.
There is a lot of effort required to create data in different languages and implement that.
One also needs to do more work on image processing as it is the most important part when it comes to OCR performance.
After performing such a large amount of work, no OCR can provide 100% accuracy and even after OCR we have to determine an unknown character in neighbouring machine learning methods or repair it in person.

Next TopicObject Definition in Java

← prev next →