Window Functions in Pandas

Introduction

The language also provides Pandas, an official data manipulation tool in Python, which includes a number of functions for data analysis and mela. There is something as powerful as window functions that can be implemented in a panda. Window functions are the operations that enable you to perform calculations in a range of rows, giving some information about the trends and patterns. The purpose of this article is to combine information about window functions in the Pandas library, to describe their types and examples of use, and to indicate their application.

What are the Window Functions?

Window functions work on a set of table rows related to the current row and calculate something on those rows. They are similar to aggregate functions. However, unlike most aggregate functions, window functions do not consolidate the result of numerous rows into a single row value. They retain the same layout of the row but with extra summary information shown as well.

The analytics tools in window functions are most commonly used to compute measures related to running totals, moving averages and various other cumulative parameters within a particular window of data.

Types of Window Functions

Pandas offers several types of window functions, including:

Rolling Windows: Create a moving average or moving window over those statistics.
Expanding Windows: Compute marginals that are aggregated over cumulative statistics that grow over time.
Exponential Moving Windows: Compute the Exponential Weighted Moving Average on a moving window.
GroupBy Windows: This should be done in groups, preferably by performing window calculations.

Setting Up the Environment

Before showing examples, let's prepare the context and load the libraries and a small example dataset into a variable.

 
import pandas as pd
import numpy as np
# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)
print(df)   

Output:

 
    date  sales
0 2021-01-01     15
1 2021-01-02     20
2 2021-01-03     21
3 2021-01-04     18
4 2021-01-05     30
5 2021-01-06     35
6 2021-01-07     40
7 2021-01-08     45
8 2021-01-09     50
9 2021-01-10     55

Explanation

Start by importing the Pandas library and give it an alias, Pandas, as 'pd'.
Import the pandas and NumPy as 'pd' and 'np', respectively, for numerical computations.
Create a dictionary named 'data' with two keys:
'date:' Creates a variety of dates from January 1st to January 10th, 2021.
'sales:' It contains a list of sales that have been made for each date.
Import the data into a Pandas DataFrame named df, but first, the dictionary data should be flattened flat.
Show the DataFrame 'df', which will illustrate the data in the source in the form of a table.

Rolling Windows

Running or Rolling windows is used to apply a funtion that calculates data over a sliding window of a given size. Another feature that plays into this is the 'rolling()' method in Pandas.

1. Calculating a Rolling Mean

A rolling mean takes the average of some number of observations and 'rolls' it over time, thereby masking short-term variations while revealing long-term trends or oscillations.

 
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)
Df['rolling_mean'] = df['sales'].rolling(window=3).mean()
Print(df)   

Output:

 
date  sales  rolling_mean
0 2021-01-01     15           NaN
1 2021-01-02     20           NaN
2 2021-01-03     21     18.666667
3 2021-01-04     18     19.666667
4 2021-01-05     30     23.000000
5 2021-01-06     35     27.666667
6 2021-01-07     40     35.000000
7 2021-01-08     45     40.000000
8 2021-01-09     50     45.000000
9 2021-01-10     55     50.000000

Explanation

Whenever working with a data frame, import the Pandas library under the name' PD' as part of the work.
Bring in the NumPy library under the alias 'np' for numerical computations.
Create a dictionary named 'data' with two keys:
'date: Returns a selection of dates that start from 1. 01. 2021 and end on 10. 01. 2021.
'sales: It gives a sales report that lists companies for each date.
Transform the dictionary data to a Pandas DataFrame data object referred to as 'df'.
Now, it is required to find the average of the top 20 values of the sales column with an expansion of 3 and name the new column as 'rolling_mean'.
Show the contents of the DataFrame 'df' to see the original data and the newly created 'rolling_mean' column.
The last-produced DataFrame will contain the original 'date and sales' columns along with the newly created rolling_mean column for each row. The rolling mean is the midpoint of the current value and the previous two values of sales. Thus, for the first two rows, the rolling mean will be equal to NaN, which means that there are not enough prior values to compute a mean.

2. Rolling Sum

You can also calculate a rolling sum, which gives a sum of the values in the window.

 
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)

df['rolling_mean']=df['sales'].rolling(window=3).mean()
df['rolling_sum']=df['sales'].rolling(window=3).sum()
print(df)   

Output:

 
 date  sales  rolling_sum
0 2021-01-01     15          NaN
1 2021-01-02     20          NaN
2 2021-01-03     21         56.0
3 2021-01-04     18         59.0
4 2021-01-05     30         69.0
5 2021-01-06     35         83.0
6 2021-01-07     40        105.0
7 2021-01-08     45        120.0
8 2021-01-09     50        135.0
9 2021-01-10     55        150.0

Explanation

Use Pandas as an alias 'pd' to handle data.
Import the last library in the list, named 'NumPy', and assign the alias 'np' to the numeric operations.
Create a dictionary named 'data' with two keys: Create a dictionary named 'data' with two keys:
'date: Create a set of dates starting from January 1, 2021, and include 10 days in the set.
'sales: Provide a list of the following sales and details for each date :
Transform the gathered dictionary data into a form of a Pandas DataFrame called 'df'.
For the current pivot table, make a new column named 'rolling_mean' and use a rolling mean function with a window size of 3 on the column 'sales'.
In this step, the DataFrame df was created to print the original data and the newly created column 'rolling_mean'.
Defining the size of the window to be 3, calculate the rolling sum of the sales column and then assign it to a new column with the 'rolling_sum' name.
To show the original data along with the 'rolling_mean' and 'r'olling_sum' columns, we can print the DataFrame 'df' once more.
The last shape of the DataFrame will demonstrate the date and sales columns from the initial DataFrame as well as the new 'rolling_mean' and 'rolling_sum' calculated based on the particular row. The values of the rolling mean and rolling sum are calculated over a window of 3 rows, and the result of such computation is displayed in the subsequent rows. For the first two rows with respect to these columns, the values will be 'NaN' as there are no earlier values that would enable one to compute them.

Rolling Standard Deviation

To evaluate the scatter of values in the window applied in the previous step, you may use the rolling standard deviation.

 
import numpy as np
import pandas as pd

# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)

df['rolling_std']=df['sales'].rolling(window=3).std()
print(df)   

Output:

 
date  sales  rolling_std
0 2021-01-01     15          NaN
1 2021-01-02     20          NaN
2 2021-01-03     21     3.214550
3 2021-01-04     18     1.527525
4 2021-01-05     30     6.244998
5 2021-01-06     35     8.736895
6 2021-01-07     40     5.000000
7 2021-01-08     45     5.000000
8 2021-01-09     50     5.000000
9 2021-01-10     55     5.000000

Explanation

The ` NumPy' library is imported under 'np' to perform numerical operations.
Create a dictionary named 'data' with two keys:
'date: Create a set of dates starting from the beginning of the year, 01. 01. 2021, for 10 days.
'sales: Demand for the sales and provide the number to be sold every day starting from monday to Sunday.
Transform the dictionary data into a Pandas DataFrame, which will be named 'df'.
Using the sales column, use the rolling function and assign it a window size of 3. Then, calculate the rolling column's standard deviation and store it in "rolling_std."
It is recommended that the DataFrame df be printed to revisit the original data as well as the introduced 'rolling_std' column.
Reprint the DataFrame 'df' once more (it may seem unnecessary given the second print statement; however, it prints the same DataFrame).
The last DataFrame will list the original date and sales for each row and have a new name column called 'rolling_std'. Rolling standard deviation is computed in a moving window of three rows. For the first two rows, rolling_std will be 'NaN' because there are no earlier values to calculate the standard deviation.

Expanding Windows

Growing windows start at index I and contain all the elements from I to J. This functionality is available in Pandas with the help of the expanding() method.

1. Calculating an Expanding Mean

An expanding mean offers the running total average from the initial row to the present row.

 
import numpy as np

# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)

df['expanding_mean']=df['sales'].expanding().mean()
print(df)    

Output:

 
date  sales  expanding_mean
0 2021-01-01     15       15.000000
1 2021-01-02     20       17.500000
2 2021-01-03     21       18.666667
3 2021-01-04     18       18.500000
4 2021-01-05     30       20.800000
5 2021-01-06     35       23.166667
6 2021-01-07     40       25.571429
7 2021-01-08     45       28.000000
8 2021-01-09     50       30.444444
9 2021-01-10     55       32.900000

Explanation

Import the library of NumPy as 'np' for operations of numbers.
Create a dictionary named data with two keys:
'date: Using the Nth root, create 10 date values beginning from Jan 1st, 2021.
'sales: Give possible values of sales figures for each date.
Transform the obtained dictionary data into the Pandas DataFrame called 'df'.
Find the expanding mean of the sales column and put it in a new column labeled expanding mean. The expanding mean, on the other hand, is computed by adding up the previous rows to the present row.
You can print out the DataFrame df to see the original data and the newly created 'expanding_mean' column.
The last DataFrame depicts the original date and sales columns, along with the Expanding mean calculated for each column. The increasing mean gives a running total average from the beginning of the table to the specified row.

2. Expanding Sum

Just like you do in expanding mean, you can also determine expanding sum.

 
import numpy as np
# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)
df['expanding_sum']=df['sales'].expanding().sum()
print(df)   

Output:

 
date  sales  expanding_sum
0 2021-01-01     15           15.0
1 2021-01-02     20           35.0
2 2021-01-03     21           56.0
3 2021-01-04     18           74.0
4 2021-01-05     30          104.0
5 2021-01-06     35          139.0
6 2021-01-07     40          179.0
7 2021-01-08     45          224.0
8 2021-01-09     50          274.0
9 2021-01-10     55          329.0

Explanation

Allow us to import the library NumPy, which is referred to as 'np' for numerical calculations.
Create a dictionary named 'data' with two keys:
'date: Using 'pd. date_range(start='2021-01-01', periods=10, freq='D')', create a sequence of dates from January 1, 2021, to January 10, 2021.
'sales: Give the array of sales for each date: [15, 20, 21, 18, 30, 35, 40, 45, 50, 55].
Using pd. DataFrame (data) transforms the dictionary data into a pandas data frame named 'df'.
Apply the 'expanding()' function on the 'sales' column as it computes the cumulative sums up to the particular row and saves it into the new variable called 'expanding_sum'.
Before adding a rolling sum of 'grp', print the DataFrame df to see the original data as well as the new 'expanding_sum' column.
The last DataFrame includes date and sales columns and 'expanding_sum' as a measure modeled by applying the expanding function for each row. The count shows the total sum of the sales values from the first up to the present row, as the cell is formatted by adding the prior sales amounts. The 'expanding_sum' column indicates that, by the time we get to each row of the dataset, it holds the sum of all the sales values going back to the beginning of the dataset.

Exponential Moving Windows

Exponential moving windows make use of weights that experience a decrease in an exponential fashion over time. The one employed for this is 'ewm()', which is found in Pandas.

This is preferred since it transposes the use of an exponential moving average. Remember that we already identified two approaches to measuring the volatility in stock prices.

1. Exponential Moving Mean

The exponential moving average is the perfect method when it comes to offering more weight to the current observation over the previous ones.

 
import numpy as np
# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)
df['ewm_mean'] = df['sales'].ewm(span=3, adjust=False).mean()
print(df)   

Output:

 
date  sales   ewm_mean
0 2021-01-01     15  15.000000
1 2021-01-02     20  17.500000
2 2021-01-03     21  19.250000
3 2021-01-04     18  18.625000
4 2021-01-05     30  24.312500
5 2021-01-06     35  29.656250
6 2021-01-07     40  34.828125
7 2021-01-08     45  39.914062
8 2021-01-09     50  44.957031
9 2021-01-10     55  49.978516

Explanation

Import pandas and numpy libraries.
Create a dictionary named 'data' with two keys. Add the dictionary 'data' with two entries:
'date: Applying the 'pd. date_range()' function, arrange a certain number of dates from the beginning of the year 2021 up to the 10th day = pd. date_range(start = '2021-01-01', periods = 10, freq = 'D'.
'sales: For each date, here was the list of the sales 15, 20, 21, 18, 30, 35, 40, 45, 50, 55. Transform the dictionary data into a tabular form by converting it to a Python DataFrame. Name the converted data 'df' using 'pd. DataFrame(data)'.
I have proceeded to the pane continuation of working and computed the 'new ()' for the sales as the exponentially weighted mean for the sales column with the parameters span=3 and adjust= False, and the resultant data was stored in a new column as ewm_mean.
To have the original data along with the new 'ewm_mean' column printed, use the following code: For the original data and the new dataframe containing the 'ewm_mean' column to be printed, the following code needs to be used:
In fact, EWMA involves calculating data in a way that offers more weight to the current period data and less weight to the prior period data. This is particularly advantageous for picking the seeming responsive regimes that appear to respond more to the current changes operating in the time series.

2. Exponential Moving Standard Deviation

You can, however, calculate an exponential moving standard deviation. Certain modifications or adaptations can be made to the basic formula above when calculating an exponential moving standard deviation.

 
import numpy as np
# Creating a sample DataFrame
data = {
    'date':pd.date_range(start='2021-01-01',periods=10,freq='D'),
    'sales':[15,20,21,18,30,35,40,45,50,55]
}
df=pd.DataFrame(data)
df['ewm_mean']=df['sales'].ewm(span=3,adjust=False).mean()
df['ewm_std']=df['sales'].ewm(span=3,adjust=False).std()
print(df)   

Output:

 
date  sales   ewm_std
0 2021-01-01     15       NaN
1 2021-01-02     20  3.535534
2 2021-01-03     21  3.146427
3 2021-01-04     18  2.304240
4 2021-01-05     30  7.164865
5 2021-01-06     35  8.274524
6 2021-01-07     40  8.622124
7 2021-01-08     45  8.715860
8 2021-01-09     50  8.725208
9 2021-01-10     55  8.711331

Explanation

Import the numpy library as np for functional calculations.
Create a dictionary named 'data' with two keys:
'date: By the use of the tool pd, it is possible to generate the following dates starting from the 1st of January 2021 up to ten different dates: Date generator pd. date_range(start =' 2021-01-01 ', periods = 10, freq = 'D').
'sales: Therefore, compute a list containing the future values of sales figures for each date based on the predecessor data: [15, 20, 21, 18, 30, 35, 40, 45, 50, 55].
If the dictionary data should be rearranged into a form suitable for analysis in a Pandas DataFrame, the code would be 'df = pd. DataFrame(data)'.
In the next step, use the span keyword argument and set adjust to False to calculate the exponentially weighted mean of the data to the 'sales' column using the 'ewm()' method, renaming the new column 'ewm_mean'.
To do this, calculate the exponentially weighted standard deviation of the sales column using the 'ewm()' method with a span of 3 and 'adjust=False' and store the results in the 'ewm_std' variable.
If you print the DataFrame 'df,' you will be able to see the original data along with the two new columns 'ewm_mean' and 'ewm_std'.

GroupBy Windows

GroupBy windows enable users to make window calculations per group. This is especially useful when working with time-relevant data, which requires some sort of grouping based on a categorical variable such as the customers' ID, product type, etc.

Example: Rolling Mean within groups.

Now, imagine a DataFrame with a 'category' column and find the rolling mean for each category.

 
data={
    'date':pd.date_range(start='2021-01-01',periods=12,freq='D'),
    'category': ['X',' X',' X',' X','Y', 'Y','Y','Y','Z','Z','Z','Z'],
    'sales':[15,20, 21, 18, 30, 35, 40, 45, 50, 55, 60, 65]
}
df=pd.DataFrame(data)

df['grouped_rolling_mean']=df.groupby('category')['sales'].rolling(window=2).mean().reset_index(level=0,drop=True)
print(df)   

Output:

 
date category  sales  grouped_rolling_mean
0  2021-01-01        X     15                   NaN
1  2021-01-02        X     20                  17.5
2  2021-01-03        X     21                  20.5
3  2021-01-04        X     18                  19.5
4  2021-01-05        Y     30                   NaN
5  2021-01-06        Y     35                  32.5
6  2021-01-07        Y     40                  37.5
7  2021-01-08        Y     45                  42.5
8  2021-01-09        Z     50                   NaN
9  2021-01-10        Z     55                  52.5
10 2021-01-11        Z     60                  57.5
11 2021-01-12        Z     65                  62.5

Explanation

Create a dictionary named 'data' with three keys:
'date:Using the function 'pd. date_range' with the parameters start= '2021-01-01' and periods=12, establishing the frequency as 'D'. Creates a list of dates starting from 12 days 2021-01-01
'category: Categorize the sales data and use the names of categories 'X, Y and Z' four times.
'sales: Create a list of sales figures, which corresponds to date (in the format of dd. MM. yyyy) and category: [15, 20, 21, 18, 30, 35, 40, 45, 50, 55, 60, 65].
Store the dictionary data in a Pandas' DataFrame named df by 'pd. DataFrame(data)'.
To which operations are performed on each of the categories but not across them, it is necessary to group the DataFrame by 'category' using 'df. group by ('category')'.
In the rolling(window=2) function, use the sales column and group it by the category over which the rolling mean will be carried out.
In the merge step, use 'reset_index(level=0, drop=True)" to remove the extra index generated by the group and make the result set match the original DataFrame's index.
Store the calculated grouped rolling mean in a new column as 'grouped_rolling_mean'.
To see the original data with the added 'grouped_rolling_mean' column, print the DataFrame df.
The resulting DataFrame, which will be built, will consist of four columns, which are the original date the record belonged to, the category to which the record was assigned, the retail sales and the new column 'grouped_rolling_mean'. As for the rolling mean, this is calculated category-wise; that is, for each category, the sales data of that particular category is taken into consideration. Thus, for the first row of each category, the value of the rolling mean will be NaN since there are no previous values from which to calculate the mean.

Practical Applications

Financial Analysis
- Moving Averages: Rolling averages have extensive applications in the analysis of stock markets. For instance, a 50-day or 200-day moving average gives analysts a feel for the general trend of the price of a stock or a particular index since prices fluctuate daily. This aids in arriving at buy or sell signals.
- Volatility Measurement: Rolling standard deviation is applied when a user wants to determine the amount of fluctuations in the price of stocks or returns for a certain period. This assists in determining the risk level that is related to a specific investment.
Sales and Revenue Forecasting
- Trend Analysis: Business organizations use the rolling average to analyze sales figures over a given period of time. For example, a retail company with a daily sales figure may use the 7-day or 30-day rolling mean, which will, in turn, facilitate proper inventory control and demand forecasting.
- Performance Tracking: Applications of window functions are extended to the cumulative sales or revenue, giving a clue to the business amounts at certain intervals of time.
Quality Control in Manufacturing
- Process Monitoring: Rolling mean and rolling standard deviation are used to control manufacturing processes through statistical process control (SPC) charts. When these functions are applied in real production, companies can monitor changes in process performance and prevent major problems.
- Defect Rate Analysis: Growing functions can include the aggregate defective ratio history; this info will assist in indicating whether there is an improvement or some deterioration in the quality of the product.
Time Series Forecasting
- Smoothing Time Series Data: In time series analysis, operating averages can be derived by rolling and expanding window functions to assist in activities such as smoothing data, which is important in modeling and forecasting. Subsequently, the extracting approach provides clean data that improves the identification of significant patterns useful for computational prediction.
- Lagged Features Creation: Rolling windows can give rise to lags that are used as inputs or predictors in time series analyses. These features encode history and seasonally, which can help boost the model's performance.
Sports Analytics
- Player Performance: It is common to apply rolling averages to analyze an athlete's performance over a number of games or matches. For example, in cricket, a rolling average of the player's scores for the last five or ten innings would be beneficial to examine the batting form and arrive at selection decisions.
- Team Strategy: Growing sums and means can follow the aggregate team performance indicators, which is useful for a coach or an analyst to create strategies concerning game outcomes depending on shifts in team performance.
Customer Behavior Analysis
- Churn Prediction: Businesses may leverage moving average indexes of clients' activity, such as logins and/or sales, to determine possible churn. If a customer's activity has decreased substantially compared to their rolling average, they may churn.
- Lifetime Value Calculation: Growing totals can be applied to obtain the lifetime consumer value, which can aid in categorizing clients and promotional strategies.

Best Practices and Considerations

Choose the Appropriate Window Size: Choose the window size that minimizes the influence of noise while allowing the trend to be preserved. A window drawn with a small thickness is suitable for short-span data changes, while a larger window will best demonstrate long-range trends.

Handle Missing Data: Before applying window functions, make sure that missing data is dealt with correctly. To avoid any disruptions in data continuity, one can employ methods such as forward or backward filling.
Consider Data Frequency: Choose the window size relative to the data frequency (daily data - daily window, weekly data - weekly window, etc.). When parallelism is not followed, the conclusions are not correct.
Group Data When Necessary: In the case of categorical data, one has to use group by (), which makes it possible to perform window functions within the groups considered while maintaining logical comparisons.
Optimize Performance: Always use efficient window functions. If you are working with large data, first reduce it using options like sampling or data aggregation.
Visualize Results: You also want to plot to account for the values and verify that the window functions are picking up the right patterns and features.

Conclusion

These window functions in Pandas offer an amazing panel for calculating time-based and time-related aggregations. They help analysts and data scientists analyze their data and find patterns and relationships, which makes them useful in companies in finance, sales and customer analysis.

Finally, in this article, we have briefly examined different types of window functions, such as rolling windows, expanding windows, exponentially moving windows, and grouping by windows. We have discussed real-life scenarios of these functions with examples such as warning about window functions in Pandas will increase your data processing abilities and decision-making.

Next TopicWord2vec and fasttext word embedding with gensim

← prev next →