Identify Patterns in Time-Series Data with Overlapping Window Techniques Using Pandas

Pandas includes a rather wide collection of tools for working with dates, times, and time-indexed data, as you might anticipate given that it was built with financial modelling in mind.

Time stamps designate specific points in time, such as July 14, 2016 at 7:00 a.m.
Periods and time intervals describe the amount of time that passes between a specific start and finish point, such as the year 2015. Periods often relate to a certain type of time intervals, such as 24-hour periods that comprise days, when each interval has a constant duration and does not overlap.
Time deltas and durations, such as 22.56 seconds, provide a precise amount of time.

utilising statistical approaches to smooth and analyse data trends across certain periods is necessary in order to find patterns in time-series data utilising overlapping window techniques. Analysts can find variances and cyclical patterns in the data by using rolling computations, including standard deviations and moving averages. These methods aid in the reduction of noise, the highlighting of underlying patterns, and the detection of possible abnormalities. For example, a rolling standard deviation can show volatility, whereas a rolling mean can show the general direction of the data trend. In time-series analysis, overlapping windows offer a more detailed perspective that enables more precise pattern detection and ongoing monitoring.

We will go over how to use Pandas to interact with each of these kinds of date/time data in this section. This brief piece is meant to provide as a general overview of how you, as a user, should approach dealing with time series; it is by no means a comprehensive tutorial to all of the time series tools that are available in Python or Pandas. Before delving deeper into a study of the capabilities offered by Pandas, let's take a quick look at some other Python tools for working with dates and times. We will first go over a few in-depth resources before quickly going over a few brief instances of using Pandas to work with time series data.

Dates and Times in Python

Numerous representations of dates, times, deltas, and timespans are available in the Python community. Although Pandas' time series features are typically the most beneficial for data science applications, it's important to understand how they relate to other Python packages.

Native Python dates and times: datetime and dateutil

The datetime module comes with Python and contains the fundamental objects for working with dates and times. You may use it in conjunction with the third-party dateutil module to swiftly execute a variety of helpful functions on dates and times. For instance, you may use the datetime type to manually create a date:

Example 1:

 
from datetime import datetime
datetime(yr=2016, mon=7, day=14)   

Output:

 
datetime.datetime(2016, 7, 14, 0, 0)

Example 2:

 
from dateutil import parser
date = parser.parse("14th of Jul, 2016")
date   

Output:

 
datetime.datetime(2016, 7, 14, 0, 0)

Once you have a datetime object, you may print the day of the week, for example.

Output:

 
'Thursday'

The last line uses one of the common string format codes ("%A") for printing dates; you can learn more about these codes in the datetime documentation for Python's strftime section. The online documentation for dateutil includes material for more helpful date utilities. Pytz is a related package that you should be aware of as it includes tools for working with time zones, the most migration-inducing component of time series data.

The versatility and simple syntax of datetime and dateutil allow you to do almost any action you'd like with ease. These objects are a powerful combination. They falter when you try to work with big arrays of dates and times: lists of Python datetime objects are poor when compared to typed arrays of encoded dates, just as lists of Python numerical variables are bad when compared to NumPy-style typed numerical arrays.

Typed Arrays of Times: NumPy's datetime64

The NumPy team added a collection of native time series data types to NumPy in response to the shortcomings in Python's datetime format. Due to the datetime64 dtype's ability to encode dates as 64-bit integers, arrays of dates may be expressed extremely compactly. The datetime64 needs input in a very particular format.

 
import numpy as np
date = np.array('2016-07-14', dtype=np.datetime64)
date   

Output:

 
array(datetime.date(2016, 7, 14), dtype='datetime64[D]')

However, we may do vectorized operations on this date fast after it has been formatted:

Output:

 
array(['2016-07-14' '2016-07-15' '2016-07-16' '2016-07-17' '2016-07-18' '2016-07-19' '2016-07-20' '2016-07-21' '2016-07-22' '2016-07-23' '2016-07-24' '2016-07-25'], dtype='datetime64

The fact that the datetime64 and timedelta64 objects are constructed using a basic time unit is one of their details. The range of encodable times is 264 times this basic unit due to the datetime64 object's 64-bit precision limitation. Stated differently, datetime64 requires a trade-off between maximal time span and time resolution.

For instance, you only have enough data to encode a range of 264 nanoseconds, or little less than 600 years, if you desire a time precision of one nanosecond. From the input, NumPy will determine the necessary unit; for instance, the following is a day-based datetime:

Output:

 
numpy.datetime64('2016-07-14')

Here is a minute-based datetime:

Output:

 
numpy.datetime64('2016-07-14T12:00')

Observe that on the machine running the code, the time zone is automatically set to the local time. One of the various format codes available allows you to impose any chosen basic unit; in this case, we'll force a time based on nanoseconds:

Output:

 
numpy.datetime64('2016-07-14T12:59:59.500000000')

Code	Meaning	Time span (relative)	Time span (absolute)
Y	Year	± 9.2e18 years	[9.2e18 BC, 9.2e18 AD]
M	Month	± 7.6e17 years	[7.6e17 BC, 7.6e17 AD]
W	Week	± 1.7e17 years	[1.7e17 BC, 1.7e17 AD]
D	Day	± 2.5e16 years	[2.5e16 BC, 2.5e16 AD]
h	Hour	± 1.0e15 years	[1.0e15 BC, 1.0e15 AD]
m	Minute	± 1.7e13 years	[1.7e13 BC, 1.7e13 AD]
s	Second	± 2.9e12 years	[2.9e9 BC, 2.9e9 AD]
ms	Millisecond	± 2.9e9 years	[2.9e6 BC, 2.9e6 AD]
us	Microsecond	± 2.9e6 years	[290301 BC, 294241 AD]
ns	Nanosecond ± 292 years	[1678 AD, 2262 AD]
ps	Picosecond	± 106 days	[1969 AD, 1970 AD]
fs	Femtosecond ± 2.6 hours	[1969 AD, 1970 AD]
as	Attosecond	± 9.2 seconds	[1969 AD, 1970 AD]

Datetime64[ns] is a good default for the kinds of data we encounter in the actual world since it can encode a reasonable range of current dates with an appropriately precise precision.

In conclusion, it should be noted that although the datetime64 data type improves upon some of the shortcomings of the built-in Python datetime type, it is devoid of some useful functions and methods offered by datetime, including dateutil. The datetime64 documentation for NumPy has further details.

Dates And Times in Pandas: Best of Both Worlds

Using all the techniques we just covered, Pandas creates a Timestamp object that combines the vectorized interface and efficient storage of numpy with the user-friendliness of datetime and dateutil.datetime64. Pandas can create a DatetimeIndex, which can be used to index data in a Series or DataFrame, from a collection of these Timestamp objects; we'll see several instances of this later.

For instance, we may replicate the previous presentation using the Pandas tools. The day of the week may be obtained by parsing a string date that has been flexible in formatting and using format codes:

Code:

 
import pandas as pd
date = pd.to_datetime("14th of July, 2016")
date   

Output:

 
Timestamp('2016-07-14 00:00:00')

Code:

Output:

 
'Thursday'

Furthermore, we can do vectorized operations in the style of NumPy directly on this same object:

Output:

 
DatetimeIndex(['2016-07-14' '2016-07-15' '2016-07-16' '2016-07-17' '2016-07-18' '2016-07-19' '2016-07-20' '2016-07-21' '2016-07-22' '2016-07-23' '2016-07-24' '2016-07-25'],   dtype='datetime64[ns]', freq=None)

Pandas Time Series: Indexing by Time

The actual value of the Pandas time series tools comes when you start indexing data according to timestamps. As an illustration, we may create a Series object with time-indexed data:

 
index = pd.DatetimeIndex(['2015-07-14', '2015-08-14',
                          '2016-07-14', '2016-08-14'])
data = pd.Series([0, 1, 2, 3], index=index)
data   

Output:

 
2015-07-14    0
2015-08-14    1
2016-07-14    2
2016-08-14    3
dtype: int64

Now that we have this data in a Series, we can pass values that can be forced into dates using any of the Series indexing techniques we covered in earlier sections:

Output:

 
2015-07-14    0
2015-08-14    1
2016-07-14    2
dtype: int64

Other unique date-only indexing methods exist, including passing a year to get a portion of all the data from that year:

Output:

 
2016-07-14    2
2016-08-14    3
dtype: int64

We'll see more instances of dates-as-indices' convenience later on. Let's take a deeper look at the time series data structures that are currently accessible first.

Pandas Time Series Data Structures

The basic Pandas data structures for handling time series data will be covered in this section:

Pandas offers the Timestamp type for time stamps. As previously stated, it functions as a stand-in for Python's built-in datetime function and is based on the more effective numpy.datetime64 is the data type. DatetimeIndex is the index structure that is related.
Pandas offers the Period type for time periods. This uses numpy.datetime64 to encode a fixed-frequency interval. PeriodIndex is the related index structure.
Pandas offers the Timedelta type for durations or time deltas. Based on numpy.timedelta64, Timedelta is a more effective substitute for Python's built-in datetime.timedelta class. TimedeltaIndex is the index structure that is linked with it.

The DatetimeIndex and Timestamp objects are the most basic of these date/time objects. Although these class objects can be used directly, the pd.to_datetime() function-which is capable of parsing a large number of formats-is more frequently used.

 
dates = pd.to_datetime([datetime(2015, 7, 3), '14th of July, 2016',
                       '2015-Jul-6', '07-07-2015', '20150708'])
dates   

Output:

 
DatetimeIndex(['2015-07-03', '2016-07-14', '2015-07-06', '2015-07-07',
               '2015-07-08'],
              dtype='datetime64[ns]', freq=None)

The to_period() method may convert any DatetimeIndex to a PeriodIndex by adding a frequency code; in this case, we'll use 'D' to denote daily frequency:

Output:

 
PeriodIndex(['2015-07-03', '2016-07-14', '2015-07-06', '2015-07-07',
             '2015-07-08'],
            dtype='int64', freq='D')

When one date is subtracted from another, for instance, a TimedeltaIndex is produced:

Output:

 
TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)

Regular Sequences: pd.date_range()

Pandas has many methods to facilitate the construction of standard date sequences: In a similar vein, pd.date_range() generates a regular series of dates by accepting a start date, an end date, and an optional frequency code. The frequency is set to one day by default:

You may change the spacing by using the freq parameter (which comes defaulted to D). Here, for instance, we'll create a range of hourly timestamps:

Output:

 
DatetimeIndex(['2016-07-13 00:00:00', '2016-07-13 01:00:00',
               '2016-07-13 02:00:00', '2016-07-13 03:00:00',
               '2016-07-13 04:00:00', '2016-07-13 05:00:00',
               '2016-07-13 06:00:00', '2016-07-13 07:00:00'],
              dtype='datetime64[ns]', freq='H')

The following are a few monthly periods:

Output:

 
PeriodIndex(['2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12',
             '2017-01', '2017=02'],
            dtype='int64', freq='M')

And a sequence of durations increasing by an hour:

Output:

 
TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                '0 days 09:00:00'],
               dtype='timedelta64[ns]', freq='H')

Frequencies and Offsets

The idea of a frequency or date offset is the basis of these Pandas time series tools. Using these codes, we may define any desired frequency spacing, just as we did with the D (day) and H (hour) codes previously shown. The primary codes that are available are compiled in the table below:

Code	Description	Code	Description
D	Calendar day	B	Business day
W	Weekly
M	Month end	BM	Business month end
Q	Quarter end	BQ	Business quarter end
A	Year end	BA	Business year end
H	Hours	BH	Business hours
T	Minutes
S	Seconds
L	Milliseonds
U	Microseconds
N	nanoseconds

At the conclusion of the designated term, the monthly, quarterly, and yearly frequencies are all indicated. Any of these that have a S suffix added to them will instead be indicated at the start:

Code	Description	Code	Description
MS	Month start	BMS	Business months start
QS	Quarter start	BQS	Business quarter start
AS	Year start	BAS	Business year start

Additionally, by adding a three-letter month code as a suffix, you may modify the month that is used to denote any quarterly or yearly code:

Q-JAN, BQ-FEB, QS-MAR, BQS-APR, etc.
A-JAN, BA-FEB, AS-MAR, BAS-APR, etc.

Similarly, a three-letter weekday code can be added to change the weekly frequency's split-point:

W-SUN, W-MON, W-TUE, W-WED, etc.

Additionally, codes and numbers can be concatenated to designate other frequencies. For instance, we may combine the hour (H) and minute (T) codes as follows for a periodicity of two hours and thirty minutes:

Output:

 
TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
                '0 days 07:30:00', '0 days 10:00:00', '0 days 12:30:00',
                '0 days 15:00:00', '0 days 17:30:00', '0 days 20:00:00'],
               dtype='timedelta64[ns]', freq='150T')

The pd.tseries.offsets module contains specific instances of Pandas time series offsets, which are referred to by each of these short codes

Resampling, Shifting, and Windowing

One key component of the Pandas time series tools is the ability to utilise dates and times as indexes to organise and retrieve data in an accessible manner. Pandas offers various more time series-specific operations, while the fundamental advantages of indexed data (automatic alignment during operations, easy data slicing and access, etc.) also hold true.

Here, we'll examine a couple of those using data on stock prices as an example. Pandas contains several extremely specialised capabilities for financial data because it was primarily designed in the finance domain. The associated pandas-datareader package, for instance, may import financial data from a variety of sources, including as Yahoo Finance and Google Finance, and can be installed using the command conda install pandas-datareader. Here we will load Google's closing price history:

 
from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override
pandas_datareader
goog = pdr.get_data_yahoo('GOOG', start='2004-01-01', end='2016-12-31')
print(goog.head())   

Output:

 
                            Open      High            Low        Close            Adj  Close     Volume
Date                                                                    
2004-08-19  2.490664  2.591785  2.390042  2.499133   2.496292  897427216
2004-08-20  2.515820  2.716817  2.503118  2.697639   2.694573  458857488
2004-08-23  2.758411  2.826406  2.716070  2.724787   2.721690  366857939
2004-08-24  2.770615  2.779581  2.579581  2.611960   2.608991  306396159
2004-08-25  2.614201  2.689918  2.587302  2.640104   2.637103  184645512

For simplicity, we'll use just the closing price:

 
goog = goog['Close']
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
goog.plot();   

Output:

Identify Patterns in Time-Series Data with Overlapping Window Techniques Using Pandas

Resampling and Converting Frequencies

Resampling time series data at a different frequency is a frequently required task. The more easier asfreq() function or the resample() technique may be used for this. The main distinction between the two is that asfreq() is essentially a data selection, whereas resample() is essentially a data aggregation.

Let's examine the Google closing price and see how the two perform when the data is down-sampled. Here, at the conclusion of the fiscal year, we will resample the data:

 
goog.plot(alpha=0.5, style='-')
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
           loc='upper left');   

Output:

Take note of the distinction: asfreq provides the value at the conclusion of the year, whereas resample reports the average from the prior year at each point.

Asfreq() and resample() are nearly similar when it comes to up-sampling, however resample provides a lot more choices. Both of the techniques' default settings in this situation are to leave the up-sampled points empty, or filled with NA values. As with the previously stated pd.fillna() function, asfreq() takes a method parameter to determine the imputed values. Here, we'll resample the data from the business days on a daily basis (that is, including weekends):

 
fig, ax = plt.subplots(2, sharex=True)
data = goog.iloc[:10]
data.asfreq('D').plot(ax=ax[0], marker='o')
data.asfreq('D', method='bfill').plot(ax=ax[1], style='-o')
data.asfreq('D', method='ffill').plot(ax=ax[1], style='--o')
ax[1].legend(["back-fill", "forward-fill"]);   

Output:

Non-business days are left as NA values in the top panel, which is the default, and are not plotted. The distinctions between the forward-filling and backward-filling gap-filling procedures are displayed in the bottom panel.

Time-Shifts

Shifting data in time is another popular time series-specific technique. Shift() and tshift() are two very comparable Pandas algorithms for computing this. The main distinction between both is that tshift() shifts the index, whereas shift() shifts the data. The shift is expressed as multiples of the frequency in both situations.

Here we will both shift() and tshift() by 900 days;

 
fig, ax = plt.subplots(3, sharey=True)
goog = goog.asfreq('D', method='pad')
goog.plot(ax=ax[0])
goog.shift(900).plot(ax=ax[1])
goog.tshift(900).plot(ax=ax[2])
local_max = pd.to_datetime('2007-11-05')
offset = pd.Timedelta(900, 'D')
ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')
ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')
ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');   

Output:

Here, we can see that tshift(900) moves the index values by 900 days, whereas shift(900) shifts the data by 900 days, pushing part of it off the edge of the graph (and leaving NA values at the other end).

This kind of movement is frequently observed in the context of computer differences across time. For instance, we employ shifted values to calculate the Google stock's one-year return on investment throughout the dataset:

 
import matplotlib.pyplot as plt
# Calculate ROI
ROI = 100 * (goog.shift(-365) / goog - 1)
# Plot the ROI
ROI['Close'].plot()
plt.ylabel('% Return on Investment')
plt.title('Return on Investment for GOOG')
plt.show()   

Output:

This makes it easier for us to understand the general pattern in Google stock: thus far, the best periods to invest in Google have been-unsurprisingly-shortly after its IPO and during the height of the 2009 recession.

Rolling Windows

A third kind of Pandas-specific time series operation is rolling statistics. The rolling() feature of Series and DataFrame objects may be used to do this, and it yields a view that is comparable to the one we saw when we used the groupby method (see Aggregation and Grouping). By default, this rolling view provides several aggregation operations.

Here is an example of the Google stock prices' one-year centred rolling mean and standard deviation:

 
import pandas as pd
rolling = goog['Close'].rolling(window=365, center=True)
data = pd.DataFrame({
    'input': goog['Close'],
    'one-year rolling_mean': rolling.mean(),
    'one-year rolling_std': rolling.std()
})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)
plt.show()   

Output: