Scraping HTML Tables with Pandas and BeautifulSoup

Extracting information from the large expanse of the internet is an essential skill for analysts, researchers, and statistics enthusiasts in modern-day data-driven world. HTML tables are a good source of structured data that can be determined on many websites. They preserve insightful records this is simply waiting to be located. This difficult assignment may be achieved using Python libraries Pandas and BeautifulSoup.

The sturdy functionality provided through the effective information manipulation library pandas makes running with structured data easier. When used with the web scraping library BeautifulSoup, Python developers have an effective toolkit for speedy extracting HTML data.

This article will give a brief description of different libraries used for scraping HTML tables, along with the implementation and examples.

What is Web Scraping?

The process of extracting different types of data from web sites is called web scraping. It uses HTML tags to extract useful data. There are numerous strategies to web scraping. Selenium, Beautiful, and lots of different libraries are available to be used in internet scraping with Python. Because it becomes a big source of records for feeding the models and algorithms, it's far very beneficial in the construction of machine learning models. Additionally, web scraping helps natural language processing (NLP) analyse customers' behaviour and desires in order that it could make tips for advantageous and seamless experience. The statistics that consequences from web scraping is saved in a neighborhood file wherein it is able to be similarly altered and tested. The beautifulSoup, requests, and selenium presented via Python Programming language are the maximum not unusual and broadly used libraries used for internet scraping in Python.

What are HTML tables?

For data to read in an orderly and structured way, HTML tables are a crucial a part of web site improvement. In order to create the cells where records is inserted, rows and columns that intersect are used. In order to facilitate person comprehension and analysis of content material, HTML tables are frequently used on websites to offer widespread amounts of data in a tabular format. For net developers, HTML tables are a crucial tool, irrespective of the complexity of the device they're developing from a basic contact form to an advanced data visualisation tool.

Why HTML Tables?

These days, structured records at the web is almost always supplied using HTML tables. From displaying financial reviews and statistical analyses to showing sports activities rankings and climate forecasts, they may be employed for a variety of functions. It may be onerous and time-consuming to manually extract information from those tables. Web scraping can be utilized in this case. Web scraping affords a purposeful and attractive solution by using automating the technique.

What is BeautifulSoup?

BeautifulSoup is a well-liked web scraping Python package. It's very beneficial. It provides a quick and easy way to extract data from HTML and XML documents in a sophisticated yet straightforward manner. Usually, this library is used in conjunction with the requests library to access the website. BeautifulSoup's powerful parsing capabilities are then utilised to extract relevant data from the HTML tags. Because of its user-friendly interface and quick HTML file parsing, BeautifulSoup has become the go-to tool for online scrapers.

Pandas with BeautifulSoup

Pandas is a well-known Python library that offers a wide range of functions for manipulating data in addition to its simple data structure. Users can effectively parse and extract useful information from HTML and XML documents found on websites by combining Pandas with the BeautifulSoup library. These tools improve web scraping by making it efficient and effective for data gathering and analysis purposes.

Process of Scraping HTML tables using Pandas and BeautifulSoup

Prerequisites:

Python libraries used:

Requests: It is used for calling the website or sending HTTP requests to the website to get access. It can be installed using the pip command:

BeautifulSoup: It will parse the HTML and XML documents. This can be installed using:

Pandas: This library helps in providing a simple structure for data manipulation. It can be installed with the help of this command in the command prompt or any terminal:

There are two ways to scrape HTML tables:

Using only Pandas library
Using only BeautifulSoup library

Using Pandas Library

Pandas Library offers a function for pandas.read_html( ), which is used to read the html tables. It is used to extract data from an HTML table on a website. This function helps scrape websites using different libraries like BeautifulSoup and Urlib, which return the list of tables on a web page in the form of data frames.

Syntax of pandas.read_html( ) function

data = data frame formed
pd = object of the pandas library
url = link to the website

Implementing scraping HTML tables using pandas.read_html() function

Here is a detailed step-by-step guide explaining how to use pandas.read_html() function to scrape HTML tables from any website:

Step 1: Importing Necessary Libraries

import pandas as pd 
import requests 

The code imports the requests and pandas libraries. The requests library calls the website URL, while pandas scrape the tables using the read_html() function.

Step 2: Reading HTML tables using the website URL

url =  ' https://www.javatpoint.com/sql-add-drop-update-column-operation'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers = headers)

data = pd.read_html(response.text)
print(data)

Output:

Scraping HTML Tables with Pandas and BeautifulSoup

Here, a URL is called, and the headers are made by which the web page gives access to the scraper. The get function is used to call the url using the headers.

Step 3: Exploring Tables

for table in data:
    print(table.head())

Output:

This loop will search for the tables and print the top few rows from them.

Getting a particular table

df = data[2]
df

Output:

This is how any desired table can be selected. The table index will be called in the dataframe.

Step 4: Saving the table

Output:

The to_csv( ) function is used to save the selected table in a csv file.

Here is the complete code of how to scrape html tables using the pandas library:

import pandas as pd 
import requests 
url =  ' https://www.javatpoint.com/sql-add-drop-update-column-operation'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
data = pd.read_html(response.text)
print(data)

for table in data:
    print(table.head())
df = data[2]
df
df.to_csv('table.csv')

Data analysis and data extraction can be easily done using the pandas library's potential to scrape HTML tables. The pandas library may be used to scrape entire webpages using an instance of the BeautifulSoup Library, similar to tables.

While using the pandas library for scraping is simple, it isn't always appropriate for element-sensible scraping of HTML tables. When scraping an HTML desk, the pandas library isn't always nearly enough whilst only some factors or a portion of the desk are required. The beautifulSoup library may be used for this.

Using BeautifulSoup library

Let's start with the scraping of HTML tables using the BeautifulSoup library.

Firstly, the main concept which needs to be understood is the basic structure of HTML. The HTML or Hypertext Markup Language consist of various tags like table, heading, body, etc.

The structure of the HTML table is:

<table>
    <tr>
        <th>
        <th>
    </tr>
     <tr>
        <td>
        <td>
        <td>
    </tr>
    <tr>
        <td>
        <td>
        <td>
     </tr>
.
.
.
</table>

The <table>; tag is a useful device for adding tables to internet pages. Within the table tag, the <tr> tag is used to create every character desk row, while the <th> tag is used for table headers. The <td> tag, however, stands for table records and is used to save the description of the desk.

Implementation of Scraping HTML Tables using BeautifulSoup Library

Here is a detailed step-by-step guide explaining how to use beautifulSoup to scrape HTML tables from any website:

Step 1: Importing Required Libraries

import pandas as pd 
import requests 
from bs4 import BeautifulSoup

Firstly, the required libraries are imported. The requests library will call the URL of the website; the beautifulsoup library is used to scrape the website.

Step 2: Fetching the Webpage

url =  '/any url/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers = headers)

In this step, the get( ) function will call the webpage using the headers.

Step 3: Parsing the HTML tags

Using beautifulsoup object and the html.parser, HTML tags are parsed to get the HTML structure of the web page.

Step 4: Finding all the tables in the web page

Using the find() function, the beautifulsoup will search for all the tables on the web page.

Step 5: Extracting the data from the table

table = []
for rows in data.find_all('tr'):
    row = []
    for cell in rows.find_all(['th', 'td']):
            row.append(cell.get_text().strip())
    table.append(row)

To retrieve a table's facts, use a for loop. Retrieve the rows with the tr tag. Iterate through every cell inside the rows with the th and td tags to extract the table's records. Use the append() function to append it. Finally, print the desk row-via-row.

Step 6: Saving the data into a dataframe

df = pd.DataFrame(table[1:], columns = table[0])  
df

Output:

The table data can be saved as a data frame or CSV file using pandas library. To scrape tables, use the beautifulsoup library and refer to the following code.

import pandas as pd 
import requests 
from bs4 import BeautifulSoup

url =  '/any url/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers = headers)

soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find('table')

table = []
for rows in data.find_all('tr'):
    row = []
    for cell in rows.find_all(['th', 'td']):
            row.append(cell.get_text().strip())
    table.append(row)
df = pd.DataFrame(table[1:], columns = table[0])  
df

Conclusion:

Users can extract valuable statistics from the great internet by means of scraping HTML tables with the help of Pandas and BeautifulSoup. This library series offers a flexible and green solution for automating, analyzing or accomplishing studies. Once data analysts learn how to use these equipment, they are able to leverage the strength of the internet to help their initiatives and make nicely-informed choices.

Next TopicScrapy vs selenium vs beautiful soup for web scraping

← prev next →