Python Tutorial

A dataframe in PySpark is a collection of data that is grouped into columns. A DataFrame is similar to a relational table in SparkSQL. We can create the pyspark dataframe with different functions in SparkSession.

PySpark MapType

MapType in PySpark is a data type to define the dictionary, which can store a key-value pair, a map type object. It consists of three components: key type (a data type), value type (a data type), and valueContainsNull (a Boolean type). The MapType can also be used to define the map key-value pair.

PySpark allows you to construct custom functions to change Spark DataFrames using user-defined functions (UDFs). PySpark includes UDF support for primitive data types, but dealing with complicated data structures like MapType with mixed value types necessitates a customized method.

Syntax of PySpark MapType

where keyType is the data type of the keys in the map (it is a not null value)

valueType is the data type of the values in the map

valueContainsNull is a Boolean type, checking if the value has null values.

PySpark UDF of MapType

The pyspark.sql.function gives the UDF function, which defines the custom functions. It takes two arguments: functions and return type.

Syntax of the UDF function in PySpark

The MapType column specifies a map or dictionary-like data structure that links keys to values. It is a set of key-value pairs where the keys and values might be of various data kinds.

PySpark UDF, Spark UDF, or User Defined Functions in Spark allow us to define unique functions or operations according to our needs. This allows us to develop methods not included in the built-in functions offered by Spark.

Spark UDF is unique and highly efficient as users can build such functions in any programming language, such as Scala, Java, Python, or R. UDFs in Pyspark or Spark are executed row by row.

The architecture of PySpark UDF

The creation of Spark UDF in Python needs a few steps:

The function is compressed and distributed to the workers.
Then, a Python process is started by the Spark in the worker node, and data is sent.
The process is carried out row-wise.
Once the process is completed in Python, the results are reverted to the Spark.

Registering UDF in PySpark

Let's create UDF in PySpark using different methods.

Firstly, we will import all the necessary libraries, including the UDF method from PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, MapType
from pyspark.sql.functions import udf
import json
import spark

Then, we will make a dataframe containing different integers.

spark = SparkSession.builder.getOrCreate()
data= range(1,150)
data_frame = Spark.createDataFrame([Row(data)])  # dataframe with integers from 1 to 150
data_frame.createOrReplaceTempView('Data Frame') 

data_frame.printSchema()

Output:

root
 |-- id: map (nullable = false)

Now, We will create a UDF to calculate the square of an integer. We are creating UDF with different methods:

1. We will create the UDF with the help of the decorator pattern. It is a simple method to create the UDF.

Code:

from pyspark.sql.functions import udf
from pyspark.sql.types import MapType

@udf(returnType=MapType())
def square_integer(num):
    return num**2

data_frame.select(square_integer('id')).show(5)

Output:

+------------------+
|square_integer(id)|
+------------------+
|                 1|
|                 4|
|                 9|
|                16|
|                25|
+------------------+
only showing top 5 row

Explanation:

We have printed the square of the integers using this UDF. We have called the UDF decorator with data type Map type. Then, we made a function that returns the square of the integer. Then, using the show function, we have printed it.

2. We will create UDF by the UDF method and pass the parameters (function and its return type) in it

Code:

from pyspark.sql.functions import udf
from pyspark.sql.types import MapType

def square_integer(num):
    return num**3

spark_square = udf(square_integer, MapType())

data_frame.select(spark_sqaure('id')).show(5)

Output:

+------------------+
|square_integer(id)|
+------------------+
|                 1|
|                 4|
|                 9|
|                16|
|                25|
+------------------+
only showing top 5 row

Explanation:

We have printed the square of the integers using this UDF. Firstly, we made a function that returns the square of the integer. Then, we have called the UDF function with data type Map type. Then, using the show function, we have printed it.

Now, we will modify the UDF with different functions.

Filter and Access the Map values in the UDF.

In this, we will access the map values and the filter rows using the getItem(), which will get the values from the map, and the filter() method, which will filter the rows. Here, we will make a dataframe containing different fruits.

Code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()

data = [
    (1, {"Apple": 2, "Orange": 3}),
    (2, {"Pineapple": 5, "Cherry": 1}),
    (3, {"Grapes": 7, "Mango": 5})
]

data_frame = spark.createDataFrame(data, ["id", "fruit_counts"])

# Accessig map values using getItem()
data_frame = data_frame.withColumn("apple_count", col("fruit_counts").getItem("Apple"))

# Filtering rows based on a condition
filtered_df = data_frame.filter(col("apple_count") > 2)

# filtered DataFrame
filtered_df.show(truncate=False)

Output:

+----+---------------------------+-------------+
|id  |fruit_count                | Apple_count |
+--------------------------------+-------------+
|1   | {Orange -> 3, Apple -> 2} | 2           |
+--------------------------------+-------------+

Explanation:

We have mapped the values in the dataframe and filtered the rows. Then we printed the values of the fruits with their counts.

Next TopicPython - Discrete Hyper-geometric Distribution in Statistics

← prev next →