Identify Patterns in Time-Series Data with Overlapping Window Techniques Using PandasPandas includes a rather wide collection of tools for working with dates, times, and time-indexed data, as you might anticipate given that it was built with financial modelling in mind.
utilising statistical approaches to smooth and analyse data trends across certain periods is necessary in order to find patterns in time-series data utilising overlapping window techniques. Analysts can find variances and cyclical patterns in the data by using rolling computations, including standard deviations and moving averages. These methods aid in the reduction of noise, the highlighting of underlying patterns, and the detection of possible abnormalities. For example, a rolling standard deviation can show volatility, whereas a rolling mean can show the general direction of the data trend. In time-series analysis, overlapping windows offer a more detailed perspective that enables more precise pattern detection and ongoing monitoring. We will go over how to use Pandas to interact with each of these kinds of date/time data in this section. This brief piece is meant to provide as a general overview of how you, as a user, should approach dealing with time series; it is by no means a comprehensive tutorial to all of the time series tools that are available in Python or Pandas. Before delving deeper into a study of the capabilities offered by Pandas, let's take a quick look at some other Python tools for working with dates and times. We will first go over a few in-depth resources before quickly going over a few brief instances of using Pandas to work with time series data. Dates and Times in PythonNumerous representations of dates, times, deltas, and timespans are available in the Python community. Although Pandas' time series features are typically the most beneficial for data science applications, it's important to understand how they relate to other Python packages. Native Python dates and times: datetime and dateutilThe datetime module comes with Python and contains the fundamental objects for working with dates and times. You may use it in conjunction with the third-party dateutil module to swiftly execute a variety of helpful functions on dates and times. For instance, you may use the datetime type to manually create a date: Example 1: Output: datetime.datetime(2016, 7, 14, 0, 0) Example 2: Output: datetime.datetime(2016, 7, 14, 0, 0) Once you have a datetime object, you may print the day of the week, for example. Output: 'Thursday' The last line uses one of the common string format codes ("%A") for printing dates; you can learn more about these codes in the datetime documentation for Python's strftime section. The online documentation for dateutil includes material for more helpful date utilities. Pytz is a related package that you should be aware of as it includes tools for working with time zones, the most migration-inducing component of time series data. The versatility and simple syntax of datetime and dateutil allow you to do almost any action you'd like with ease. These objects are a powerful combination. They falter when you try to work with big arrays of dates and times: lists of Python datetime objects are poor when compared to typed arrays of encoded dates, just as lists of Python numerical variables are bad when compared to NumPy-style typed numerical arrays. Typed Arrays of Times: NumPy's datetime64The NumPy team added a collection of native time series data types to NumPy in response to the shortcomings in Python's datetime format. Due to the datetime64 dtype's ability to encode dates as 64-bit integers, arrays of dates may be expressed extremely compactly. The datetime64 needs input in a very particular format. Output: array(datetime.date(2016, 7, 14), dtype='datetime64[D]') However, we may do vectorized operations on this date fast after it has been formatted: Output: array(['2016-07-14' '2016-07-15' '2016-07-16' '2016-07-17' '2016-07-18' '2016-07-19' '2016-07-20' '2016-07-21' '2016-07-22' '2016-07-23' '2016-07-24' '2016-07-25'], dtype='datetime64 The fact that the datetime64 and timedelta64 objects are constructed using a basic time unit is one of their details. The range of encodable times is 264 times this basic unit due to the datetime64 object's 64-bit precision limitation. Stated differently, datetime64 requires a trade-off between maximal time span and time resolution. For instance, you only have enough data to encode a range of 264 nanoseconds, or little less than 600 years, if you desire a time precision of one nanosecond. From the input, NumPy will determine the necessary unit; for instance, the following is a day-based datetime: Output: numpy.datetime64('2016-07-14') Here is a minute-based datetime: Output: numpy.datetime64('2016-07-14T12:00') Observe that on the machine running the code, the time zone is automatically set to the local time. One of the various format codes available allows you to impose any chosen basic unit; in this case, we'll force a time based on nanoseconds: Output: numpy.datetime64('2016-07-14T12:59:59.500000000')
Datetime64[ns] is a good default for the kinds of data we encounter in the actual world since it can encode a reasonable range of current dates with an appropriately precise precision. In conclusion, it should be noted that although the datetime64 data type improves upon some of the shortcomings of the built-in Python datetime type, it is devoid of some useful functions and methods offered by datetime, including dateutil. The datetime64 documentation for NumPy has further details. Dates And Times in Pandas: Best of Both WorldsUsing all the techniques we just covered, Pandas creates a Timestamp object that combines the vectorized interface and efficient storage of numpy with the user-friendliness of datetime and dateutil.datetime64. Pandas can create a DatetimeIndex, which can be used to index data in a Series or DataFrame, from a collection of these Timestamp objects; we'll see several instances of this later. For instance, we may replicate the previous presentation using the Pandas tools. The day of the week may be obtained by parsing a string date that has been flexible in formatting and using format codes: Code: Output: Timestamp('2016-07-14 00:00:00') Code: Output: 'Thursday' Furthermore, we can do vectorized operations in the style of NumPy directly on this same object: Output: DatetimeIndex(['2016-07-14' '2016-07-15' '2016-07-16' '2016-07-17' '2016-07-18' '2016-07-19' '2016-07-20' '2016-07-21' '2016-07-22' '2016-07-23' '2016-07-24' '2016-07-25'], dtype='datetime64[ns]', freq=None) Pandas Time Series: Indexing by TimeThe actual value of the Pandas time series tools comes when you start indexing data according to timestamps. As an illustration, we may create a Series object with time-indexed data: Output: 2015-07-14 0 2015-08-14 1 2016-07-14 2 2016-08-14 3 dtype: int64 Now that we have this data in a Series, we can pass values that can be forced into dates using any of the Series indexing techniques we covered in earlier sections: Output: 2015-07-14 0 2015-08-14 1 2016-07-14 2 dtype: int64 Other unique date-only indexing methods exist, including passing a year to get a portion of all the data from that year: Output: 2016-07-14 2 2016-08-14 3 dtype: int64 We'll see more instances of dates-as-indices' convenience later on. Let's take a deeper look at the time series data structures that are currently accessible first. Pandas Time Series Data StructuresThe basic Pandas data structures for handling time series data will be covered in this section:
The DatetimeIndex and Timestamp objects are the most basic of these date/time objects. Although these class objects can be used directly, the pd.to_datetime() function-which is capable of parsing a large number of formats-is more frequently used. Output: DatetimeIndex(['2015-07-03', '2016-07-14', '2015-07-06', '2015-07-07', '2015-07-08'], dtype='datetime64[ns]', freq=None) The to_period() method may convert any DatetimeIndex to a PeriodIndex by adding a frequency code; in this case, we'll use 'D' to denote daily frequency: Output: PeriodIndex(['2015-07-03', '2016-07-14', '2015-07-06', '2015-07-07', '2015-07-08'], dtype='int64', freq='D') When one date is subtracted from another, for instance, a TimedeltaIndex is produced: Output: TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None) Regular Sequences: pd.date_range()Pandas has many methods to facilitate the construction of standard date sequences: In a similar vein, pd.date_range() generates a regular series of dates by accepting a start date, an end date, and an optional frequency code. The frequency is set to one day by default: You may change the spacing by using the freq parameter (which comes defaulted to D). Here, for instance, we'll create a range of hourly timestamps: Output: DatetimeIndex(['2016-07-13 00:00:00', '2016-07-13 01:00:00', '2016-07-13 02:00:00', '2016-07-13 03:00:00', '2016-07-13 04:00:00', '2016-07-13 05:00:00', '2016-07-13 06:00:00', '2016-07-13 07:00:00'], dtype='datetime64[ns]', freq='H') The following are a few monthly periods: Output: PeriodIndex(['2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12', '2017-01', '2017=02'], dtype='int64', freq='M') And a sequence of durations increasing by an hour: Output: TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00', '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00', '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00', '0 days 09:00:00'], dtype='timedelta64[ns]', freq='H') Frequencies and OffsetsThe idea of a frequency or date offset is the basis of these Pandas time series tools. Using these codes, we may define any desired frequency spacing, just as we did with the D (day) and H (hour) codes previously shown. The primary codes that are available are compiled in the table below:
At the conclusion of the designated term, the monthly, quarterly, and yearly frequencies are all indicated. Any of these that have a S suffix added to them will instead be indicated at the start:
Additionally, by adding a three-letter month code as a suffix, you may modify the month that is used to denote any quarterly or yearly code:
Similarly, a three-letter weekday code can be added to change the weekly frequency's split-point:
Additionally, codes and numbers can be concatenated to designate other frequencies. For instance, we may combine the hour (H) and minute (T) codes as follows for a periodicity of two hours and thirty minutes: Output: TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00', '0 days 07:30:00', '0 days 10:00:00', '0 days 12:30:00', '0 days 15:00:00', '0 days 17:30:00', '0 days 20:00:00'], dtype='timedelta64[ns]', freq='150T') The pd.tseries.offsets module contains specific instances of Pandas time series offsets, which are referred to by each of these short codes Resampling, Shifting, and WindowingOne key component of the Pandas time series tools is the ability to utilise dates and times as indexes to organise and retrieve data in an accessible manner. Pandas offers various more time series-specific operations, while the fundamental advantages of indexed data (automatic alignment during operations, easy data slicing and access, etc.) also hold true. Here, we'll examine a couple of those using data on stock prices as an example. Pandas contains several extremely specialised capabilities for financial data because it was primarily designed in the finance domain. The associated pandas-datareader package, for instance, may import financial data from a variety of sources, including as Yahoo Finance and Google Finance, and can be installed using the command conda install pandas-datareader. Here we will load Google's closing price history: Output: Open High Low Close Adj Close Volume Date 2004-08-19 2.490664 2.591785 2.390042 2.499133 2.496292 897427216 2004-08-20 2.515820 2.716817 2.503118 2.697639 2.694573 458857488 2004-08-23 2.758411 2.826406 2.716070 2.724787 2.721690 366857939 2004-08-24 2.770615 2.779581 2.579581 2.611960 2.608991 306396159 2004-08-25 2.614201 2.689918 2.587302 2.640104 2.637103 184645512 For simplicity, we'll use just the closing price: Output: Resampling and Converting FrequenciesResampling time series data at a different frequency is a frequently required task. The more easier asfreq() function or the resample() technique may be used for this. The main distinction between the two is that asfreq() is essentially a data selection, whereas resample() is essentially a data aggregation. Let's examine the Google closing price and see how the two perform when the data is down-sampled. Here, at the conclusion of the fiscal year, we will resample the data: Output: Take note of the distinction: asfreq provides the value at the conclusion of the year, whereas resample reports the average from the prior year at each point. Asfreq() and resample() are nearly similar when it comes to up-sampling, however resample provides a lot more choices. Both of the techniques' default settings in this situation are to leave the up-sampled points empty, or filled with NA values. As with the previously stated pd.fillna() function, asfreq() takes a method parameter to determine the imputed values. Here, we'll resample the data from the business days on a daily basis (that is, including weekends): Output: Non-business days are left as NA values in the top panel, which is the default, and are not plotted. The distinctions between the forward-filling and backward-filling gap-filling procedures are displayed in the bottom panel. Time-ShiftsShifting data in time is another popular time series-specific technique. Shift() and tshift() are two very comparable Pandas algorithms for computing this. The main distinction between both is that tshift() shifts the index, whereas shift() shifts the data. The shift is expressed as multiples of the frequency in both situations. Here we will both shift() and tshift() by 900 days; Output: Here, we can see that tshift(900) moves the index values by 900 days, whereas shift(900) shifts the data by 900 days, pushing part of it off the edge of the graph (and leaving NA values at the other end). This kind of movement is frequently observed in the context of computer differences across time. For instance, we employ shifted values to calculate the Google stock's one-year return on investment throughout the dataset: Output: This makes it easier for us to understand the general pattern in Google stock: thus far, the best periods to invest in Google have been-unsurprisingly-shortly after its IPO and during the height of the 2009 recession. Rolling WindowsA third kind of Pandas-specific time series operation is rolling statistics. The rolling() feature of Series and DataFrame objects may be used to do this, and it yields a view that is comparable to the one we saw when we used the groupby method (see Aggregation and Grouping). By default, this rolling view provides several aggregation operations. Here is an example of the Google stock prices' one-year centred rolling mean and standard deviation: Output: Custom rolling calculations may be performed using the aggregate() and apply() methods, just like group-by operations. Next TopicPandas Interview |