Tika Parser API

Tika Parser is an interface that provides the facility to extract content and metadata from any type of document. It is key component of Tika and organized into the org.apache.tika.parser package. It provides a parse() method which has the following signature.

It takes four arguments, InputStream, ContentHandler, Metadata and ParseContect class objects. The purpose of each of the four arguments is shown below.


Tika Parser API

These arguments have following description.

ArgumentDescription
InputStream streamDocument is read from this input stream.
ContentHandler handlerContentHandler is an interface that handle the content of the document.
Metadata metadataIt is a multi-valued metadata container.
ParseContext contextIt is used to pass context information to Tika parsers.

Tika also provides AutoDetectParser class which automatically figure out what kind of content a file has, and then calls appropriate parser.

Apart from these, it supports various other parsers classes that can be used to parse document of parse class type. See the following table.

ParserPackageDescription
AppleSingleFileParserorg.apache.tika.parser.appleIt is used to parse apple file.
ClassParserorg.apache.tika.parser.asmIt is used to parse class file.
AudioParserorg.apache.tika.parser.audioIt is used to parse audio file.
MidiParserorg.apache.tika.parser.audioIt is used to parse midi file.
Pkcs7Parserorg.apache.tika.parser.cryptoIt is used to parse pkcs7.
TSDParserorg.apache.tika.parser.cryptoIt is used to parse tsd.
DWGParserorg.apache.tika.parser.dwgIt is used to parse dwg.
EnviHeaderParserorg.apache.tika.parser.enviIt is used to parse envi.
EpubParserorg.apache.tika.parser.epubIt is used to parse epub.
ExecutableParserorg.apache.tika.parser.executableIt is used to parse executable.
HtmlParserorg.apache.tika.parser.htmlIt is used to parse html file.
ImageParserorg.apache.tika.parser.imageIt is used to parse image file.
WebPParserorg.apache.tika.parser.imageIt is used to parse webp.
IptcAnpaParserorg.apache.tika.parser.iptcIt is used to parse iptcanpa.
JpegParserorg.apache.tika.parser.jpegIt is used to parse jpeg.
DBFParserorg.apache.tika.parser.dbfIt is used to parse dbf file.
Mp3Parserorg.apache.tika.parser.mp3It is used to parse mp3.
MP4Parserorg.apache.tika.parser.mp4It is used to parse mp4.
PDFParserorg.apache.tika.parser.pdfIt is used to parse pdf file.

Tika Parser Example

In this example, we are using AutoDetectParser that detect document type automatically and then parse the content and metadata.

Output:

Following is the content of hello.txt file after extraction.

Hello Welcome to Javatpoint