Python Tutorial

Python is considered an extremely flexible programming language with a wide range of libraries, and it is a high-level language with easy-to-read and writes syntax. The reach of Python is being expanded in different sectors like Machine Learning, Web Development, Cybersecurity, Application Development, and a lot more. Thus, this programming language is widely chosen among programmers, engineers and developers.

In the following tutorial, we will be working on PDFs with the help of the Python programming language. PDFs, abbreviated for Portable Document Format, is a file format of a document containing texts, tables, images, and many more, which are usually utilized whenever we require to save files that cannot be modified any further or be easily shared or printed. The PDF file format was developed by Adobe in the year 1993 in order to present documents involving the formatted text and images in a way independent of software, applications, operating systems, and hardware.

The following tutorial has been divided into different parts in order for us to understand mostly everything related to PDF handling and processing using Python.

So, let's get started.

Some Famous Python PDF Libraries

Python offers a large variety of libraries that are used to manipulate a PDF file. Some famous libraries that are utilized generally while working with PDFs are:

PDFMiner,
PyPDF4,
PyPDF2,
Python-docx,
PyMuPDF,

and a lot more.

While there are different packages that are utilized in order to perform different functional operations with PDFs in Python, we will only discuss the working of some of the libraries such as PDFMiner, PyPDF2, PyMuPDF, reportlab, and a few more in this tutorial. PyPDF2 is considered one of the widely chosen Python modules to work with PDFs. This package is easy-to-use and offers various features. But when we talk about the extraction of texts, the PDFMiner package is more precise and dependable. PDFMiner was specifically designed for the users to extract texts from PDF files. There are different scenarios where one package is more efficient than the other in different aspects when we take PDF files manipulation into account. Hence, we will discuss different libraries utilized to manipulate PDF files based on their comfort and reliability in this tutorial.

Text Extraction from PDFs using Python

PDFs are composed of various contents such as Text, Tables, Images, Forms, and many more. These files are the graphical interpretations of the data. They deliver information on the exact location of a display or a paper. However, they don't have a logical structure specified for sentences or paragraphs and can't adapt themselves when the size of the display alters. The PDFMiner package performs the work for the users by evaluating the layouts and predicting the location of texts and other contents.

PDFMiner is considered one of the robust libraries utilized to perform operations like extracting texts from PDF files. Thus, in the following section, we will demonstrate the usage of PDFMiner for Text Extraction.

First of all, we have to install the PDFMiner package.

Installing the PDFMiner Package

We can install the PDFMiner package using the following command:

Syntax:

Once the installation is complete, we will head onto the main part, extracting texts using the PDFMiner library.

Let us consider the following example demonstrating the extraction of texts with the help of PDFMiner in Python.

Example:

from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
O_string = StringIO()
with open('my_file.pdf', 'rb') as input_file:
    my_parser = PDFParser(input_file)
    my_doc = PDFDocument(my_parser)
    rsrcmgr = PDFResourceManager()
    my_device = TextConverter(rsrcmgr, O_string, laparams = LAParams())
    my_interpreter = PDFPageInterpreter(rsrcmgr, my_device)
    for my_page in PDFPage.create_pages(my_doc):
        my_interpreter.process_page(my_page)
print(O_string.getvalue())

Output:

A Simple PDF File 
 This is a small demonstration .pdf file - 
 just for use in the Virtual Mechanics tutorials. More text. And more 
 text. And more text. And more text. And more text. 
 And more text. And more text. And more text. And more text. And more
 text. And more text. Boring, zzzzz. And more text. And more text. And
 more text. And more text. And more text. And more text. And more text.
 And more text. And more text.
 And more text. And more text. And more text. And more text. And more
 text. And more text. And more text. Even more. Continued on page 2 ...
 Simple PDF File 2
 ...continued from page 1. Yet more text. And more text. And more text.
 And more text. And more text. And more text. And more text. And more
 text. Oh, how boring typing this stuff. But not as boring as watching
 paint dry. And more text. And more text. And more text. And more text.
 Boring.  More, a little more text. The end, and just as well.

Explanation:

In the above snippet of code, we have imported the StringIO module from the io library and required functions and classes from the PDFMiner module. We created a StringIO object and used the with statement to open the pdf file from the directory. As per the PDFMiner documentation, PDFPageInterpreter is used to process page contents, while PDFResourceManager is used to store shared resources such as fonts or images. PDFPage is utilized for performing page-by-page analysis of data. LAParams loads up the Layout analysis of characters, textbox, text lines, images, and figures. With the help of these, the TextConverter function helps in converting a PDF document into text.

We are providing "my_file.pdf" as a PDF file to be analyzed and executed with the help of the PDFMiner module. We can extract texts from a PDF file using the process_page function.

At last, the print(text) function will print out the extracted text from a PDF. So, in this manner, the texts can be extracted from a PDF file using the PDFMiner library.

Image Extraction from PDFs using Python

Whenever we want to extract images from PDF, we can utilize PyMuPDF. This library uses an additional module, fitz, which makes the image extraction from a PDF file easier. Before starting to work with the modules directly, let us install the required libraries.

Installing the PyMuPDF Package

We can install the PyMuPDF package using the following command:

Syntax:

$ pip install pymupdf
$ pip install fitz

Once the installation is complete, we will head onto the main part, extracting texts using the PyMuPDF library and the fitz module.

Let us consider the following example demonstrating the extraction of images in Python.

Example:

# PyMuPDF
import fitz
import io
from PIL import Image
# path to our input file
my_file = "file2.pdf" 
# Input PDF file
my_pdf = open(my_file)
for page_num in range(len(my_pdf)):
   cur_page = my_pdf[page_num]
   img = cur_page.getImageList()
   for image_num, image in enumerate(cur_page.getImageList()):
       # get the XREF of the image
       xref = image[0]
       # extract the image bytes
       cur_image = my_pdf.extractImage(xref)
       imgBytes = cur_image["image"]
       # get the image extension
       img_ext = cur_image["ext"]
       # load it to PIL
       image = Image.open(io.BytesIO(imgBytes))
       # save it to local disk
       image.save(open(f"page{page_num + 1}_img{image_num}.{img_ext}", "wb"))

Output:

[+] Found a total of 2 images in page 0
[+] Found a total of 2 images in page 1

Explanation:

In the above snippet of code, we have imported the required modules. We have then loaded the PDF file using the fitz module. Then we go page by page and find the list of images. We have then converted the image bytes in the PDFs to actual images and saved them locally. Thus, in this manner, we have extracted the images out of a PDF file.

Tables Extraction from PDFs using Python

Tables extraction from PDF files is a bit easy as compared to images and text extraction. Python offers a predefined library known as camelot, which we can use to extract tables. So, before we start implementing the code, it becomes necessary to install the library at first.

Installing the camelot library

We can install the camelot module using the following command with the pip installer:

Syntax:

Once the installation is done, let us head on to extracting tables from the PDF files in Python.

Example:

import camelot
# reading the pdf file
my_tables = camelot.read_pdf("my_table.pdf")
print(my_tables[0].df)

Explanation:

In the above snippet of code, we have imported the camelot library. We have then extracted the tables from the PDF file using the read_pdf() function of the camelot library and stored them in a variable as a list. At last, we have printed one of the extracted tables using the index value of the table along with the df attribute. Hence, we have successfully extracted tables from PDF files.

Extracting URLs from PDFs using Python

Extracting URLs is considered as another handy function that Python offers. Python has a predefined library known as "pdfx", which is usually utilized to extract URLs from a PDF file. We can utilize the libraries such as PDFMiner, PyPDF2, and many more in order to extract texts and use regular expressions to find out the URLs. Nevertheless, this procedure is long and hectic. Thus, in order to shorten the length of the code, we will be using the pdfx library to extract URLs from a PDF file.

Installing the pdfx library

We can install the pdfx library using the following command with the pip installer:

Syntax:

Once the installation is complete, let us consider the following example to understand the extraction of URLs from PDFs.

Example:

import pdfx
# reading the PDF File
my_pdf = pdfx.PDFx("sample-url.pdf")
# get list of URLS
print(my_pdf.get_references_as_dict())

Output:

{'url': ['https://www.javatpoint.com/python-pass', 'https://www.javatpoint.com/python-tutorial', 'https://www.javatpoint.com/python-seaborn-library', 'https://www.javatpoint.com/', 'https://www.javatpoint.com/chatbot-in-python', 'https://www.javatpoint.com/python-if-else']}

Explanation:

In the above snippet of code, we have imported the pdfx library. We have then used the PDFx() function to read the PDF file from the directory. We have then used the get_references_as_dict() function to extract all the URLs available in the input PDF file in the form of a dictionary.

Pages Extraction from PDFs as an Image using Python

In this section, we will understand the extraction of the pages from a PDF file in the form of an image. In order to accomplish the task, we will need another short and simple library known as pdf2image. This library is usually utilized when we want to take the PDF files into images.

Let us begin by installing the library.

Installing the pdf2image library

We can use the following command with the pip installer to install the pdf2image library:

Syntax:

Once the installation is completed, let us consider the following example to understand the working of the pdf2image library.

Example:

from pdf2image import convert_from_path
my_pages = convert_from_path("my_file.pdf", 120) 
n = 0
# iterating through pages
for page in my_pages:
   n += 1
   page.save(f"output{n}.jpg", "JPEG")

Explanation:

In the above snippet of code, we have imported the convert_from_path function from the pdf2image library. We have then used the imported function along where we have provided the value 120. This value is known as the DPI or Dots Per Inch. Higher the value, a clearer and bigger-sized image will be formed. We are iterating through each page by saving the pages as JPEG images.

Next TopicManipulating PDF using Python

← prev next →