Python BeautifulSoup - find_all Class

You can use various methods when using Beautiful Soup (BS) to find elements in Python based on class. Let's investigate them:

What is BeautifulSoup or bs4?

A Python library called Beautiful Soup is dedicated to parsing XML and HTML documents. It makes information extraction from web pages easier, which makes it a useful tool for data mining, content extraction, and web scraping. Here are some essential details regarding Beautiful Soup:

Purpose

By building a parse tree of Python objects, Beautiful Soup enables file navigation and search for HTML and XML documents. From this tree structure, you can then extract pertinent data.

Usage

It is frequently used for web scraping, gaining access to a webpage's HTML content to extract valuable data.

Using this library, you can effortlessly extract data from websites.
Used for analysing or researching purposes by scraping content.
To automate repetitive tasks involving web data.

Features

The nested HTML data is parsed by Beautiful Soup, which then builds a structured tree to represent the document.
You can search for particular elements, attributes, or text by navigating this parse tree.
It offers ways to take data and pull it according to classes, tags, or other specifications.
You can add, remove, or change elements in the parse tree.

How to Install BeautifulSoup?

To install BeautifulSoup, use the below command:

How to Install requests?

You can send HTTP/1.1 requests very easily with Requests. Additionally, Python does not include this module by default. Enter the following command in the terminal to install this.

Using class_ and find_all():

The desired class name can be specified as the class_ parameter when using the find_all() method. The suggested method for Beautiful Soup 4.1.2 and later iterations is as follows:

Example

from bs4 import BeautifulSoup

# HTML
html_content = """
<html>
    <head>
        <title>Welcome to javaTpoint</title>
    </head>
    <body>
        <p class="text"><b>javaTpoint</b></p>
        <!-- Other elements with different classes -->
    </body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
class_name = "text"
elements_with_class = soup.find_all(class_=class_name)

for element in elements_with_class:
    print(element.text)

Output:

javaTpoint

Code Explanation

The code above uses the BeautifulSoup library to parse HTML data stored in the html_content variable. html_content Represents a specific HTML structure including a paragraph including the phrase \"javaTpoint\" in the p tag with the attribute class set to \"text\" Using the BeautifulSoup library generates a parse tree from the HTML content, followed by \". text via the find_all() function Finds all elements with the class attribute \" and then the find_all() function loops through the specified elements and displays the output, using The \"javaTpoint\" displayed in this instance occurs.

Using CSS Selectors

Another approach is going with CSS (Cascading Style Sheets) selectors. CSS selectors can be used to locate elements with particular classes. For example:

Code

# CSS selector
elements_with_class = soup.select(".text")

for element in elements_with_class:
    print(element.text)

Output:

javaTpoint

Code Explanation

The above provided code we have select() method to retrieve all elements with the class "text" using a CSS selector. It first parses the HTML content using BeautifulSoup library, then, applied the CSS selector .text to find all elements with the class "text". Lastly, it iterates through the selected elements and will print the text "javaTpoint.

Using a List of Classes

You can give find_all() a list of class names to search for elements with multiple classes:

Code

class_list = ["text", "other-class"]
elements_with_classes = soup.find_all(class_=class_list)

for element in elements_with_classes:
    print(element.text)

Output:

javaTpoint

Explanation

In the above-mentioned example, it will search for all elements that have either the class "text" or "other-class" by passing a list of classes to the `find_all()` method. Lastly, it will iterate through the selected elements and print the result.

Finding the Class from an HTML Document

Firstly, create an HTML Doc and import the module. After that, parse the content into the BeautifulSoup library and iterate the data by classname.

Example

html_doc = """<html><head><title>Welcome to javaTpoint</title></head> 
<body> 
<p class="title"><b>JTP</b></p> 


<p class="body">javaTpoint provides all the tutorials online, including all the technologies like Java, Python, C, C++, etc.
</body> 
"""
from bs4 import BeautifulSoup 

# parsing HTML content
soup = BeautifulSoup( html_doc, 'html.parser') 

# Searching by class name 
soup.find( class_ = "body" )

Output:

<p class="body">javaTpoint provides all the tutorials online, including all the technologies like Java, Python, C, C++, etc.
</p>

Code Explanation

In the above code, the HTML document stored in the variable html_doc, containing elements like <title>, <p>, and <b> with specific classes. Next, we have imported the BeautifulSoup module and used it to parse the HTML document. Subsequently, it searches for an element with the class "body" using the find() method of BeautifulSoup. This method attempts to find the first occurrence of an element with the specified class, in this case, "body".

Program to Find All Classes in a URL

Import the module, make requests instance and pass into URL. After that, pass the requests into a Beautifulsoup() function. Then, we will iterate all tags and fetch the class name.

Code

# Import Module 
from bs4 import BeautifulSoup 
import requests 

# Website URL 
URL = 'https://www.javatpoint.com/'

# class list set 
class_list = set() 

# Page content from Website URL 
page = requests.get( URL ) 

# parse html content 
soup = BeautifulSoup( page.content , 'html.parser') 

# get all tags 
tags = {tag.name for tag in soup.find_all()} 

# iterate all tags 
for tag in tags: 

	# find all element of tag 
	for i in soup.find_all( tag ): 

		# if tag has attribute of class 
		if i.has_attr( "class" ): 

			if len( i['class'] ) != 0: 
				class_list.add(" ".join( i['class'])) 

print( class_list ) 

Output:

{'homecol2', 'ddsmoothmenu', 'points', 'right1024', 'adPushupAds', '__cf_email__', 'hrhomebox', 'homecontent', 'column4', 'lazyload', 'footer1', 'header', 'footer2', 'onlycontent', 'gra1', 'h3', 'h2', 'firsthomecontent', 'headermobile'}

Code Explanation

In the above code, we have utilized the BeautifulSoup library to scrape content from the 'https://www.javatpoint.com/'. First, it will import the necessary modules, BeautifulSoup and requests. Then, the script gets the content of the webpage using the requests module and then parses it. The beautifulSoup library retrieves all HTML tags present in the webpage and iterates through each tag to find elements that have a class attribute. After finding all the tags, it will extract the classes connected with the elements and adds them to a set called class_list. In the last, it will print out the set of unique class names found in the webpage's HTML content.

find_all Using Regex

Regular expressions are also supported by the.find_all() method. To use the regex query, just add it to the.find_all() method. For instance, to find all tags that begin with the letter b, we are using the.find_all() method in conjunction with a regex expression in this instance:

Code

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# --> body
# --> b

Output:

body
b

Code Explanation

In the code given above, the regular reference module (re) was used in conjunction with BeautifulSoup to find all HTML tags starting with the letter "b" in HTML text parsed with variable soup show it in. The procedure uses the constant reference structure ^b by parsing each tag specified in the HTML content. After selecting any string that begins with the letter b", the program displays a list of all tags that meet this criterion, including the <b> and <body> tags, with each tag listed on its own line.

Alternatively, to find all tags that contain the letter t, we are using the.find_all() method in conjunction with a regex expression:

Code

for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# --> html
# --> title

Output:

html
title

Code Explanation

In the above program we have used the re module in conjunction with BeautifulSoup to simplify the task of finding HTML tags containing the letter "t" in parsed soup content. With the help of loops, all the tags found in the HTML content and use the character "t" to display strings containing the letter "t". As a result, the code results tag names that confirm to this criterion, such as the "html" and "title" tags, each displayed on a new line. The output will list all the html and title as tags in the HTML content that satisfy the specified regular presentation format.

find_all Through Custom Functions

If you are working on any complex program or queries, you can pass the functions into find_all() method. Here is an example through a piece of code snippet.

Code

def custom_selector(tag):
	# To return "span" tags with a class name of "target_span"
	return tag.name == "span" and tag.has_attr("class") and "target_span" in tag.get("class")

soup.find_all(custom_selector)

Output:

[]

Code Explanation

In the above code, we have changes the HTML text to be parsed by using a selector function named custom_selector(). After finding a particular tag and a special condition, the custom_selector() function returns positive results. Next, the find_all() method uses this custom selector function to access all the tags that fulfil these criterias in the HTML content.

Conclusion

BeautifulSoup is the popular Python library used to access classes in an HTML file. This library enable us to effortlessly translate XML and HTML documents, makes ease the process of web scraping. By using the BeautifulSoup library in Python, one can point to objects based on their class using the find_all() method. This method allows us for searching for a specific class in an HTML document and can access relevant data.

Next TopicPython libraries for mesh and point cloud visualization

← prev next →