Pandas dataframe.ffill() in Python

Introduction

Data dealing and computer coding are not separated from data science and other analytical methodologies. With the growing number of Python libraries, this language provides a powerful arsenal for tasks like data processing, which are traditionally the benchmark of Pandas. Panda is a very versatile package that offers several useful functions and methods for data manipulation, such as DataFrame.ffill(), one of the few methods that are quite useful in replacing missing values. In this article, we will look at the complexities of the 'ffill()' method in Pandas and learn how to use them in actual problems using examples.

What is DataFrame.ffill() Method()?

The 'ffill() method in the Pandas DataFrame is employed to fill forward the empty values. It represents a 'forward fill' operation that copies the last forecasted non-missing value of the column to the remaining cells beneath. That way is most suitable when there is time series, or sequential data where missing values can result from one of the reasons like the sensor is out of order, the data is collected with difficulty or is available occasionally.

Syntax:

  • axis: The axis specifies which empty values supplant the missing ones. Unless instructed otherwise, It generally starts in the centre (axis=0/fill along rows).
  • inplace: If True transforms the DataFrame in-place and gives none in return. If the type of the returned DataFrame is False (default), the existing ones will replace the missing values.
  • limit: Limits the number of consecutive NaN values filled. If provided, forward fill will stop after the specified number of NaN values are filled.

Understanding Forward Fill

In the ffill() method, before we go into the deep, let's understand the concept of fill forward using a simple chart. Suppose we have a DataFrame with missing values as follows:

Example:

Output:

     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  NaN
3  NaN  4.0
4  5.0  5.0

Explanation:

  • The code imports the Pandas library by 'pd' and the NumPy library by 'np'.
  • It builds a dictionary that includes 'data' as its key and values with A and B as the keys, which are both mapped to lists containing numeric values and NaN, respectively.
  • To do this, the dictionary is used to create a Pandas Data Frame named 'df' with two columns ('A' and' B').
  • The value [1 to 5] is in the first column, 'A'. Feel free to ask for help if you encounter any issues or have any questions for our professional staff.
  • The values in column 'B' [NaN, 2, NaN, 4, 5] are located in column ' B '
  • N/A values refer to those either not present or undefined in the DataFrame.

Now we will ffill() method to the above DataFrame:

Output:

Before Appling ffill Method
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  NaN
3  NaN  4.0
4  5.0  5.0
After Appling ffill Method
     A    B
0  1.0  NaN
1  1.0  2.0
2  3.0  2.0
3  3.0  4.0
4  5.0  5.0

Explanation:

  • The code begins by importing the NumPy Python Library as 'np' and the Pandas Data Analysis Python Library as 'pd'.
  • These lines initialize an empty dictionary 'data' with two keys ('A' and 'B'), and each key is linked with a value that includes numeric data and NaN (which represents missing values).
  • Utilizing a dictionary, it forms the 'df' Pandas DataFrame with two columns among them (A & B).
  • First, the method drops the 'dtype' object. Then, the method prompts any description by printing 'df' to show the df contents.
  • The next line features the 'ffill()' method, which the code applies to the DataFrame 'df' and makes the missing values in the column shift forward.
  • The method is applied to the dataset, and a print statement shows the df_ffilled DataFrame with forward fill as the approach used to fill the missing values.
  • This chunk of code illustrates the simplest example of using the 'ffill()` method on a pandas DataFrame for filling missing data in a DataFrame.

In the next section, Let's explore about the practical examples of dataframe.ffill() method.

Practical Examples

Now, let's examine the practical illustrations of the 'fill()' function behaviour in the context of real-world datasets.

Example 1: Time Series Data

Take missing data for a time series dataset in cases where that data is missing. Let's create a sample DataFrame representing daily temperature readings:

Example:

Output:

                   Temperature
2024-01-01         20.5
2024-01-02          NaN
2024-01-03         22.3
2024-01-04          NaN
2024-01-05         19.8

Explanation:

  • This code snippet demonstrates creating a Pandas DataFrame to represent time series data:
  • The panda's library is imported as 'pd' and numpy as 'np'.
  • These are then converted into the following dates: '2024-01-01' to '2024-12-31' for a total of 5 periods, which is accomplished by applying the "pd.date_range()" function to the "dates" variable.
  • A dict named 'data' is established, which will have 'Temperature' as a key mapped to a list with just some numbers and be asked for later.
  • The 'data' archive is followed by the 'dates' index, which helps create a Pandas DataFrame named 'ts_df' to describe time series data, in which 'Temperature' serves as the column header.
  • The output prints the DataFrame named 'ts_df' to exhibit the time series data. Here, the temperature values and the associated dates are shown graphically. The specific notation for a missing value, represented as 'NaN,' is used.

Example 2: Financial Data

For instance, one firm has a set of daily stock prices. We'll create a sample DataFrame with missing values to simulate this scenario:

Output:

 Stock_Price
2024-01-01        100.0
2024-01-02          NaN
2024-01-03        102.5
2024-01-04          NaN
2024-01-05         99.8

Explanation:

  • The code starts with an import statement that includes both the Pandas and NumPy libraries, abbreviated as 'pd' and 'np', respectively.
  • It suggests dates starting from '2024-01-01', 5 periods in total recommended by the 'pd.date_range()' function and assigns them all to a variable called 'dates'.
  • A dictionary named 'data' is defined with the key 'Stock_Price' mapping a list incorporating numerical values to 'NaN,' indicating the inability to compute these values.
  • By applying 'data' and 'dates', we can construct a Pandas DataFrame known as 'stock_df', which represents the stock prices' trend over time, with 'Stock_Price' as the column label.
  • A stock distribution (DataFrame) 'stock_df' of stock prices and corresponding dates, with empty places marked as 'NaN', is printed.

Now, we will apply the ffill() method to the above example:

Output:

 Stock_Price
2024-01-01        100.0
2024-01-02          NaN
2024-01-03        102.5
2024-01-04          NaN
2024-01-05         99.8
After Applying ffill() method to stock price:
            Stock_Price
2024-01-01        100.0
2024-01-02        100.0
2024-01-03        102.5
2024-01-04        102.5
2024-01-05         99.8

Explanation:

  • This code snippet demonstrates applying the 'ffill()' method to a DataFrame to fill forward missing values:
  • Using 'pd.date_range()', which takes '2024-01-01' as its starting date with 5 periods, a range of dates is produced and assigned to 'dates'.
  • A dictionary has a definition named 'data' and a key named 'Stock_Price'. It is backed by a list of all numeric values and 'NaN' for misplacement data.
  • Through an apprehension of 'data' and 'days', a DataFrame pinned down as 'stock_df' is generated, prompting representation of stock prices over time with 'Stock_Price' as the column heading.
  • Stock prices with dates of observations among them are shown in the original DataFrame 'stock_df', as well as those values which are missing and indicted as 'NaN'.
  • The ffill() method imputes backwards-gaping holes in the DataFrame stock_df, and the end result, the stock_df_ffilled DataFrame, is stored for future use.
  • Following is the printed output of stock_df_ffilled, which is the filled DataFrame with the absence of the stock prices' missing values that are filled using the forward fill method.

Conclusion

In this article, we've been dealing with the 'ffill()' method in DataFrame of Pandas, which is a strong tool for filling up a sequential gap (whose values are known) using values from consecutive positions in the same column. Until now, we have discussed its syntax and sampled severe practical instances employing time series and financial data to reveal its importance in pre-processing information. The 'ffill()' is a procedure data analysts and scientists invented to handle missing values. This is helping in improving the reliability of the datasets. The consequence of this can lead to increased analysis accuracy with fruitful outcomes.