Tika Parser APITika Parser is an interface that provides the facility to extract content and metadata from any type of document. It is key component of Tika and organized into the org.apache.tika.parser package. It provides a parse() method which has the following signature. It takes four arguments, InputStream, ContentHandler, Metadata and ParseContect class objects. The purpose of each of the four arguments is shown below.
These arguments have following description. Argument | Description |
---|
InputStream stream | Document is read from this input stream. | ContentHandler handler | ContentHandler is an interface that handle the content of the document. | Metadata metadata | It is a multi-valued metadata container. | ParseContext context | It is used to pass context information to Tika parsers. |
Tika also provides AutoDetectParser class which automatically figure out what kind of content a file has, and then calls appropriate parser. Apart from these, it supports various other parsers classes that can be used to parse document of parse class type. See the following table. Parser | Package | Description |
---|
AppleSingleFileParser | org.apache.tika.parser.apple | It is used to parse apple file. | ClassParser | org.apache.tika.parser.asm | It is used to parse class file. | AudioParser | org.apache.tika.parser.audio | It is used to parse audio file. | MidiParser | org.apache.tika.parser.audio | It is used to parse midi file. | Pkcs7Parser | org.apache.tika.parser.crypto | It is used to parse pkcs7. | TSDParser | org.apache.tika.parser.crypto | It is used to parse tsd. | DWGParser | org.apache.tika.parser.dwg | It is used to parse dwg. | EnviHeaderParser | org.apache.tika.parser.envi | It is used to parse envi. | EpubParser | org.apache.tika.parser.epub | It is used to parse epub. | ExecutableParser | org.apache.tika.parser.executable | It is used to parse executable. | HtmlParser | org.apache.tika.parser.html | It is used to parse html file. | ImageParser | org.apache.tika.parser.image | It is used to parse image file. | WebPParser | org.apache.tika.parser.image | It is used to parse webp. | IptcAnpaParser | org.apache.tika.parser.iptc | It is used to parse iptcanpa. | JpegParser | org.apache.tika.parser.jpeg | It is used to parse jpeg. | DBFParser | org.apache.tika.parser.dbf | It is used to parse dbf file. | Mp3Parser | org.apache.tika.parser.mp3 | It is used to parse mp3. | MP4Parser | org.apache.tika.parser.mp4 | It is used to parse mp4. | PDFParser | org.apache.tika.parser.pdf | It is used to parse pdf file. |
Tika Parser ExampleIn this example, we are using AutoDetectParser that detect document type automatically and then parse the content and metadata. Output: Following is the content of hello.txt file after extraction. Hello Welcome to Javatpoint
|