Javatpoint Logo
Javatpoint Logo

Java OCR

What is Tesseract OCR?

The Tesseract OCR is an optical character reading engine developed by HP laboratories in 1985 and launched in 2005. Since 2006 it has been developed by Google. Tesseract has Unicode Support (UTF-8) and can detect more than 100 languages "out of the box" and thus can be used to create different language scanning software. The latest version of Tesseract is Tesseract 4. It adds a new OCR-based neural net (LSTM) engine that focuses on line recognition but also supports the Tesseract OCR legacy engine that works by recognizing character patterns.

With the rapid advancement in AI and Machine Learning, we now need rigorous image processing. It enables us to perform such processing in Java.

How OCR works?

Tesseract OCR is available for download on all the major operating systems such as Window, Mac and OS. To understand the working of OCR, consider the following steps in sequential order:

  1. Pre-process image data, for example: switching to gray scale, smooth, de-skew, filter.
  2. Detect lines, words and letters.
  3. Generate a ranked list of candidate characters based on a set of qualified data set. (Here the setDataPath( ) method is used to set trainer data path)
  4. Procedure for sending visual characters, select the best characters based on confidence in the previous step with language data. Language data includes dictionary, grammar rules, etc.

How to use Tesseract OCR?

In order to use Tesseract OCR in Java, follow the steps given below:

  1. Download the Tess4J API.
  2. Extract files from the downloaded file.
  3. Open any IDE and create a new project.
  4. Link the jar file to your project.
  5. Please move via this path "..\Tess4J-3.4.8-src\Tess4J\dist".

The jar has been successfully linked to the project and hence the tesseract engine is ready to use.

Performing OCR on clear images

Now that we have linked the jar file, we can get started with our coding part. The following code reads an image file and perform OCR and display text on the console.

OCR.java

Input:

image.jpg

Java OCR

Output:

Sometimes, this simply isn't possible. Sometimes, we wish to automate a task of rewriting text from an image with our own hands.

Reading an Unclear Image Using OCR

Note that the image selected above is very high in resolution with consistent font but this doesn't happen in most of the cases. In most of the cases, we get an unclear or may be distorted image and thus a distorted output. To deal with it we need to perform some processing on the image called Image processing.

Tesseract works best when there is a very clean segmentation of the background text from the background. In fact, it can be very challenging to ensure good separation. There are various reasons why you may not get a good quality output from Tesseract if the image has uncleared or distorted background. In this case, we need to know how the image should be processed.

Here, we will create a small intelligent model that will scan the RGB content of the image and convert it to grey matter and create a zoom effect again.

The example below is a sample code of how an image can be greyed out based on its RGB content.

ReadingImage.java

Input:

Java OCR

Output:

Time taken to search elements keep increasing as the number of elements were increased.

Advantages

The advantages of OCR are as follows:

  1. It increases efficiency of work at office.
  2. The ability to quickly search for content is very useful, especially in an office environment where you have to deal with high volume scanning or high-volume document entry.
  3. The OCR is quick to ensure that the content of the document remains the same as it saves time.
  4. Workflow increases as employees no longer spend time in manual labour and can work faster and more efficiently.

Disadvantages

The disadvantages of OCR are as follows:

  1. OCR is limited to language recognition.
  2. There is a lot of effort required to create data in different languages and implement that.
  3. One also needs to do more work on image processing as it is the most important part when it comes to OCR performance.
  4. After performing such a large amount of work, no OCR can provide 100% accuracy and even after OCR we have to determine an unknown character in neighbouring machine learning methods or repair it in person.






Youtube For Videos Join Our Youtube Channel: Join Now

Feedback


Help Others, Please Share

facebook twitter pinterest

Learn Latest Tutorials


Preparation


Trending Technologies


B.Tech / MCA