Notes from Advances in Financial Machine Learning: Data Bars and Sampling

Author

Louis Becker

Published

March 8, 2023

Introduction

This is my first of a series of blog posts summarising chapters from Advances in Financial Machine Learning by Marcos Lopez de Prado. In this learning expedition I try to summarise Lopez de Prado’s explanations of financial data structures. This post focuses on data bars and how to construct the various types of bars. I add data and code to make these concepts as practical and actionable as possible.

Bars, What are They?

Most ML algorithms assume a table representation of the extracted data. Rows from these tables are often referred to as “bars”. Two types of bars exist in this setting, namely standard bars and information-driven bars.

As a quick sidenote, the code written here was inspired by this Github repository. To extract bars, we could use some python and R code on the S&P 500 Value Index. See the Data section for more information on the underlying data. Let us import the relevant libraries:

Python
R

import pandas as pd

df = pd.read_parquet('path/to/clean_IVE_tickbidask.parq')

library(arrow)
library(data.table)
library(janitor)
library(tidyverse)
library(xts)

df <- read_parquet("path/to/clean_IVE_tickbidask.parq")

Standard Bars

The purpose of so-called “standard” bar methods is to transform a series of observations that arrive at irregular frequency (often referred to as “inhomogeneous series”) into a homogeneous series derived from regular sampling. Lopez de Prado names time bars, tick bars, volume bars and dollar bars as examples of standard bars.

Time Bars

Time bars are obtained by sampling information at fixed time intervals. Although the most popular, time bars should be avoided when possible because markets do not process information at a constant time interval. This implies that time bars run the risk of oversampling information during low-activity periods and undersampling information during high-activity periods. Another argument the author makes against using time bars is that time-sampled financial series exhibit poor statistical properties such as serial correlation, hetereoscedasticity and non-normality of returns.

Examples of information that constitute time bars:

Timestamp
Volume Weighted average Price (VWAP)
Open price
Close Price
High price
Low Price
Volume traded

Here is an example of time bars sampled per minute:

Python
R

df_minute_bar = df['price'].resample('min').ohlc().dropna()
df_minute_bar.tail()

                         open      high       low     close
datetime                                                   
2023-03-22 15:56:00  145.7950  145.9143  145.7950  145.9143
2023-03-22 15:57:00  145.7986  145.8200  145.7986  145.8200
2023-03-22 15:58:00  145.8000  145.8500  145.7200  145.7300
2023-03-22 15:59:00  145.7200  145.7300  145.3900  145.3900
2023-03-22 16:00:00  145.4600  145.5000  145.4600  145.5000

# in order to replicate python code
df_minute_bar <- df %>%
    data.table() %>%
    select(datetime, price) %>% 
    to.minutes() %>%
    clean_names()

# need to investigate why python and R yield different time stamps. likely due to locale
tail(df_minute_bar)

                        open     high      low    close
2023-03-22 17:55:40 145.9800 145.9900 145.6600 145.7400
2023-03-22 17:56:42 145.7950 145.9143 145.7950 145.9143
2023-03-22 17:57:51 145.7986 145.8200 145.7986 145.8200
2023-03-22 17:58:54 145.8000 145.8500 145.7200 145.7300
2023-03-22 17:59:56 145.7200 145.7300 145.3900 145.3900
2023-03-22 18:00:01 145.4600 145.5000 145.4600 145.5000

Tick Bars

With tick bars, the idea is to sample the variables of interest (see above) each time a predefined number of transactions, or ticks, occur. This allows us to synchronize sampling with a proxy of information arrival (the speed at which ticks are originated). This type of sampling as a function of trading activity allows us to achieve returns that align closer to an identically and independent normal distribution ( $i i d$ ). This is important, because many statistical methods rely on the assumption that observations are drawn from an $i i d$ Gaussian process.

Implementing a tick bar approach is straight forward. Simply choose a predefined number of transactions or ticks, and sample.

Python
R

# define threshold, t
t = 10

# select bar every t rows
tick_bars = df.iloc[::t]
tick_bars.tail()

                      price     bid     ask  vol  dollar_vol
datetime                                                    
2023-03-22 15:58:19  145.80  145.79  145.79  144     20995.2
2023-03-22 15:58:48  145.72  145.69  145.74  100     14572.0
2023-03-22 15:59:01  145.70  145.70  145.73  141     20543.7
2023-03-22 15:59:22  145.60  145.60  145.64  505     73528.0
2023-03-22 16:00:01  145.50  145.41  145.51  100     14550.0

# define threshold, t
t <- 10

# select bar every t rows
tick_bars <- df[seq(1, nrow(df), t),]
tail(tick_bars)

# A tibble: 6 × 6
  datetime            price   bid   ask   vol dollar_vol
  <dttm>              <dbl> <dbl> <dbl> <int>      <dbl>
1 2023-03-22 17:58:07  146.  146.  146.   129     18807.
2 2023-03-22 17:58:19  146.  146.  146.   144     20995.
3 2023-03-22 17:58:48  146.  146.  146.   100     14572 
4 2023-03-22 17:59:01  146.  146.  146.   141     20544.
5 2023-03-22 17:59:22  146.  146.  146.   505     73528 
6 2023-03-22 18:00:01  146.  145.  146.   100     14550

Care should be taken to cater for outliers when constructing tick bars. Many exchanges, for example, carry out an auction at the open and an auction at the close. This means that for a period of time, the order book accumulates bids and offers without matching them. When the auction concludes, a large trade is published at the clearing price, for an outsized amount. This auction trade could be the equivalent of thousands of ticks, even though it is reported as one tick.

A drawback of tick bars is that order fragmentation introduces some arbitrariness in the number of ticks. Suppose a lot order of size, say, 5 is sitting on the offer. Buying 5 lots will record as one tick. If instead there are 5 orders of size 1 on offer our one buy will be recorded as 5 separate transactions.

Volume Bars

Volume bars circumvent the above-mentioned problem with tick bars by sampling every time a predefined amount of the security’s units (shares, futures contracts, etc.) have been exchanged. Prices could be sampled every time 500 units of a share is exchanged regardless of the number of ticks involved. Volume data is well published these days and sampling returns by volume has been shown to be closer to an $i i d$ Gaussian normal distribution than tick bars. Moreover, volume bars tend to fit market microstructure theories well and this format of sampling allows a more convenient structure for related analysis.

Here is a practical example of how it could be coded:

Python
R

# define threshold, t
t = 1000

alt = df.reset_index()    

idx = []
sampled_vol = []
cum_vol = 0

for i, v in alt.vol.items():
    cum_vol = cum_vol + v 
    if cum_vol >= t:
        idx.append(i)
        sampled_vol.append(cum_vol)
        cum_vol = 0

df_volume_bar = alt.loc[idx]
df_volume_bar.loc[idx, 'cum_vol'] = sampled_vol 
df_volume_bar = df_volume_bar.set_index('datetime')

df_volume_bar.tail()

                      price     bid     ask    vol  dollar_vol  cum_vol
datetime                                                               
2023-03-22 15:59:22  145.60  145.60  145.64    505    73528.00   1315.0
2023-03-22 15:59:54  145.47  145.40  145.47    481    69971.07   1047.0
2023-03-22 15:59:56  145.39  145.40  145.45   1027   149315.53   1127.0
2023-03-22 15:59:56  145.39  145.40  145.45   1800   261702.00   1800.0
2023-03-22 16:00:00  145.46  145.37  145.46  55922  8134414.12  56022.0

# Define threshold, t
t <- 1000  

idx <- vector(mode = "numeric", length = 200000)
sampled_vol <- vector(mode = "numeric", length = 200000)
cum_vol <- 0
j <- 1

for (i in 1:nrow(df)) {
    cum_vol <- cum_vol + df$vol[i]

    if (cum_vol >= t) {
        idx[j] <- i
        sampled_vol[j] <- cum_vol
        cum_vol <- 0
        j <- j + 1
    }
}

idx <- idx[idx != 0]
sampled_vol = sampled_vol[sampled_vol != 0]

df_volume_bar <- df %>%
    rowid_to_column() %>%
    inner_join(tibble(rowid=idx, cum_vol = sampled_vol), by = "rowid") %>%
    select(-rowid)

tail(df_volume_bar)

# A tibble: 6 × 7
  datetime            price   bid   ask   vol dollar_vol cum_vol
  <dttm>              <dbl> <dbl> <dbl> <int>      <dbl>   <dbl>
1 2023-03-22 17:59:11  146.  146.  146.  1400    203980     1400
2 2023-03-22 17:59:22  146.  146.  146.   505     73528     1315
3 2023-03-22 17:59:54  145.  145.  145.   481     69971.    1047
4 2023-03-22 17:59:56  145.  145.  145.  1027    149316.    1127
5 2023-03-22 17:59:56  145.  145.  145.  1800    261702     1800
6 2023-03-22 18:00:00  145.  145.  145. 55922   8134414.   56022

Dollar Bars

Dollar bars generically refer to the act of sampling an observation every time a predefined market value is exchanged (in a particular currency). It allows for sampling bars in terms of dollar (currency) value exchanged, rather than ticks or volume. This approach is particularly useful when the analysis involves significant price fluctuations. Dollar bars also allow for the number of units (e.g. shares or futures contracts) traded to be a function of the value exchanged.

With some assets, the number of time and volume bars for a given bar size can fluctuate quite wildly over certain periods. In contrast, an advantage of sampling dollar bars is that, when using a fixed size, this approach can reduce the range and speed of variation of bars. Another useful feature of dollar bars is that it tends to be more robust in the face of corporate actions. The number of shares outstanding often changes multiple times over the course of a security’s life because of corporate actions. Still, you may want to sample dollar bars where the size of the bar is not kept constant over time. Instead, the bar size could be adjusted dynamically as a function of the free-floating market capitalization of a company (in the case of stocks), or the outstanding amount of issued debt (in the case of fixed-income securities).

Python
R

# define threshold, t
t = 100000

alt = df.reset_index()   

idx = []
sampled_dvol = []
cum_dvol = 0

for i, dv in alt.dollar_vol.items():
    cum_dvol = cum_dvol + dv 

    if cum_dvol >= t:
        idx.append(i)
        sampled_dvol.append(cum_dvol)
        cum_dvol = 0 

df_dollar_bar = alt.loc[idx]
df_dollar_bar.loc[idx, 'cum_dollar_vol'] = [ '%.2f' % elem for elem in sampled_dvol ]
df_dollar_bar = df_dollar_bar.set_index('datetime')

df_dollar_bar.head()

                       price    bid    ask   vol  dollar_vol cum_dollar_vol
datetime                                                                   
2009-09-28 09:32:06  50.7800  50.76  50.78   500  25390.0000      118655.98
2009-09-28 09:33:54  50.8200  50.80  50.82   100   5082.0000      101525.62
2009-09-28 09:37:33  50.8299  50.80  50.83   166   8437.7634      105001.56
2009-09-28 09:41:53  50.8400  50.83  50.84   200  10168.0000      108836.07
2009-09-28 09:44:09  50.9100  50.91  50.92  1100  56001.0000      101767.00

# Define threshold, t
t <- 100000

idx <- vector(mode = "numeric", length = 200000)
sampled_dvol <- vector(mode = "numeric", length = 200000)
cum_dvol <- 0
j <- 1

for (i in 1:nrow(df)) {
    cum_dvol <- cum_dvol + df$dollar_vol[i]

    if (cum_dvol >= t) {
        idx[j] <- i
        sampled_dvol[j] <- cum_dvol
        cum_dvol <- 0
        j <- j + 1
    }
}

idx <- idx[idx != 0]
sampled_dvol = sampled_dvol[sampled_dvol != 0]

df_dollar_bar <- df %>%
    rowid_to_column() %>%
    inner_join(tibble(rowid=idx, cum_dvol = sampled_dvol), by = "rowid") %>%
    select(-rowid)
    
tail(df_dollar_bar)

# A tibble: 6 × 7
  datetime            price   bid   ask   vol dollar_vol cum_dvol
  <dttm>              <dbl> <dbl> <dbl> <int>      <dbl>    <dbl>
1 2023-03-22 17:59:21  146.  146.  146.   155     22570.  117972.
2 2023-03-22 17:59:45  146.  146.  146.   290     42202.  115730.
3 2023-03-22 17:59:54  145.  145.  145.   481     69971.  110121.
4 2023-03-22 17:59:56  145.  145.  145.  1027    149316.  163856.
5 2023-03-22 17:59:56  145.  145.  145.  1800    261702   261702 
6 2023-03-22 18:00:00  145.  145.  145. 55922   8134414. 8148953.

Information-driven Bars

The purpose of information-driven bars is to sample more frequently when new information arrives to the market. Market microstructure theories confer special importance to the persistence of imbalanced signed volumes, as that phenomenon is associated with the presence of informed traders. By synchronising sampling with the arrival of informed traders, we may be able to make decisions before prices reach a new equilibrium level. The following are examples of information-driven bars:

Tick Imbalance Bars

The idea behind tick imbalance bars (TIBs) is to sample bars whenever tick imbalances exceed our expectations. Consider a sequence of ticks ${(p_{t}, p_{v})}_{t = 1, . . ., T}$ , where price $p_{t}$ and volume $v_{t}$ both constitute tick $t$ . We define a “tick rule” that delineates a sequence ${b_{t}}_{t = 1, . . ., T}$ where

$b_{t} = {\begin{cases} b_{t - 1} & if Δ p_{t} = 0 \\ \frac{| Δ p_{t} |}{Δ p_{t}} & if Δ p_{t} \neq 0 \end{cases}$

with $b_{t} \in {- 1, 1}$ and the boundary condition, $b_{0}$ is set to match the terminal value $b_{T}$ from the immediately preceding bar. In other words, we want to determine the tick index, $T$ , such that the accumulation of signed ticks exceeds a given threshold. To determine $T$ :

Define the tick imbalance at time $T$ as
$θ_{T} = \sum_{t = 1}^{T} b_{t}$
Compute the expected value of $θ_{t}$ at the beginning of the bar: $E_{0} [θ_{T}] = E_{0} [T] (P [b_{t} = 1] - P [b_{t} = - 1])$
- Where:
  - $E_{0} [T]$ is the expected size of the tick bar
  - $P [b_{t} = 1]$ is the unconditonal probability that a tick is classified as a buy
  - $P [b_{t} = - 1]$ is the unconditonal probability that a tick is classified as a sell
  - The two unconditonal probabilities sum to 1, implying that $E_{0} [T] = E_{0} [T] (2 P [b_{t} = 1] - 1)$
- $E_{0} [T]$ is estimated as an exponentially weighted moving average of $T$ values from prior bars
- $(2 P [b_{t} = 1] - 1)$ is be estimated as an exponentially weighted moving average of $b_{t}$ values from prior bars
Define a TIB as a $T^{*}$ -contiguous subset of ticks such that the following condition is met: $T^{*} = \arg min_{T} {| θ_{T} | \geq E_{0} [T] | 2 P [b_{t} = 1] - 1 |}$
- Where the size of the expected imbalance is implied by $| 2 P [b_{t} = 1] - 1 |$
- When $θ_{T}$ is more imbalanced than expected, a low $T$ will satisfy these conditions
- With this approach TIBs are produced more frequently under the presence of informed trading (asymmetric information that triggers one-side trading)
- TIBs can be seen as buckets of trades containing equal information regardless of the volumes, prices or ticks traded.

Volume/Dollar Imbalance Bars

Volume imbalance bars (VIBs) and dollar imbalance bars (DIBs) extend extend the concept behind TIBs in that these respective bars are sample when the diverge from expectation. As with TIBs the notions of a tick rule and a boundary condition, $b_{0}$ apply here. In the case of TIBs and DIBs:

Define imbalance at time $T$ as $θ_{T} = \sum_{t = 1}^{T} b_{t} v_{t}$

where $v_{t}$ is either the number of securities traded (VIB) or the dollar amount exchanged (DIB).
Compute the expected value of $θ_{t}$ at the beginning of the bar:

$\begin{aligned} E_{0} [θ_{T}] & = E_{0} [\sum_{t | b_{t} = 1}^{T} v_{t}] - E_{0} [\sum_{t | b_{t} = - 1}^{T} v_{t}] \\ E_{0} [θ_{T}] & = E_{0} [T] (P [b_{t} = 1] E_{0} [v_{t} | b_{t} = 1] - P [b_{t} = - 1] E_{0} [v_{t} | b_{t} = - 1]) \end{aligned}$

If we denote $\begin{aligned} v^{+} & = P [b_{t} = 1] E_{0} [v_{t} | b_{t} = 1] \\ v^{-} & = P [b_{t} = - 1] E_{0} [v_{t} | b_{t} = - 1] \end{aligned}$ such that $E_{0} [T]^{- 1} E_{0} [\sum_{t} v_{t}] = E_{0} [v_{t}] = v^{+} + v^{-}$ then $E_{0} [T] = E_{0} [T] (v^{+} - v^{-}) = E_{0} [T] (2 v^{+} - E_{0} [v_{t}])$

In practice, we can estimate $E_{0} [T]$ as an exponentially weighted moving average of $T$ values from prior bars, and $(2 v^{+} - E_{0} [v_{t}])$ as an exponentially weighted moving average of $b_{t} v_{t}$ values from prior bars.
Define VIB or DIB as a $T^{*}$ -contiguous subset of ticks such that the following condition is met:

$T^{*} = \arg min_{T} {| θ_{T} | \geq E_{0} [T] | 2 v^{+} - E_{0} [v_{t}] |}$
- Where the size of the expected imbalance is implied by $| 2 v^{+} - E_{0} [v_{t}] |$
- When $θ_{T}$ is more imbalanced than expected, a low $T$ will satisfy these conditions
- It addresses concerns regarding tick fragmentation and outliers
- it also addresses the issue of corporate actions, because the above procedure does not rely on a constant bar size. Bar size is adjusted dynamically

Tick Runs Bars

TIBs, VIBs, and DIBs monitor order flow imbalance, as measured in terms of ticks, volumes, and dollar values exchanged. Large traders will sweep the order book, use iceberg orders, or slice a parent order into multiple children, all of which leave a trace of runs in the ${b_{t}}_{t = 1, . . ., T}$ sequence. For this reason, it can be useful to monitor the sequence of buys in the overall volume, and take samples when that sequence diverges from our expectations.

Define the length of the current run as $θ_{T} = max {\sum_{t | b_{t} = 1}^{T} b_{t}, - \sum_{t | b_{t} = - 1}^{T} b_{t}}$
Compute expected value of $θ_{T}$ at the beginning of the bar $E_{0} [θ_{T}] = E_{0} [T] max {P [b_{t} = 1], 1 - P [b_{t} = 1]}$
- $E_{0}$ is estimated as an exponentially weighted moving average of $T$ values from prior bars
- $P [b_{t} = 1]$ is estimated as an exponentially weighted moving average of the proportion of buy ticks from prior bars
Define a tick runs bar (TRB) as a $T^{*}$ -contiguous subset of ticks such that the following condition is met: $T^{*} = \arg min_{T} {| θ_{T} | \geq E_{0} [T] max {P [b_{t} = 1], 1 - P [b_{t} = 1]}$
- Where the expected count of ticks from runs is implied by $max {P [b_{t} = 1], 1 - P [b_{t} = 1]}$
- When $θ_{T}$ exhibits more runs than expected, a low $T$ will satisfy these conditions
- In this definition of runs we allow for sequence breaks. Instead of measuring the length of the longest sequence, we count the number of ticks of each side, without offsetting the (no imbalance). In the context of forming bars, this turns out to be a more useful definition than measuring sequence lengths.

Volume/Dollar Runs Bars

Volume runs bars (VRBs) and dollar runs bars (DRBs) extend the definition of runs to volumes and dollars exchanged, respectively. The intuition is that we wish to sample bars whenever the volumes or dollars traded by one side exceed our expectation for a bar.

Define the volumes or dollars associated with a run as $θ_{T} = max {\sum_{t | b_{t} = 1}^{T} b_{t} v_{t}, - \sum_{t | b_{t} = - 1}^{T} b_{t} v_{t}}$ where $v_{t}$ is either the number of securities traded (VIB) or the dollar amount exchanged (DIB).
Compute the expected value of $θ_{T}$ at the beginning of the bar: $E_{0} [θ_{T}] = E_{0} [T] max {P [b_{t} = 1] E_{0} [v_{t} | b_{t} = 1], (1 - P [b_{t} = 1]) E_{0} [v_{t} | b_{t} = 1 -]}$
- $E_{0}$ is estimated as an exponentially weighted moving average of $T$ values from prior bars
- $P [b_{t} = 1]$ is estimated as an exponentially weighted moving average of the proportion of buy ticks from prior bars
- $E_{0} [v_{t} | b_{t} = 1]$ is estimated as an exponentially weighted moving average of the sell volumes from prior bars
Define a volume runs bar (VRB) as a $T^{*}$ -contiguous subset of ticks such that the following condition is met: $T^{*} = \arg min_{T} {θ_{T} \geq E_{0} [T] max {P [b_{t} = 1] E_{0} [v_{t} | b_{t} = 1], (1 - P [b_{t} = 1] [E_{0} [v_{t} | b_{t} = - 1]])}}$
- where the expected volume from runs is implied by $max {P [b_{t} = 1] E_{0} [v_{t} | b_{t} = 1], (1 - P [b_{t} = 1] [E_{0} [v_{t} | b_{t} = - 1]])}$
- when $θ_{T}$ exhibits more runs than expected or the volum from runs is greater than expected, a low $T$ will satisfy these conditions

Sampling Features

It is useful to think about sampling strategies when applying machine-learning algorithms in finance for two reasons: First, several ML algorithms do not scale well with sample size (e.g., SVMs). Second, ML algorithms achieve highest accuracy when they attempt to learn from relevant examples. We will try to look at some palatable ways of sampling bars to produce a features matrix with relevant training examples.

The CUSUM Filter

The CUSUM filter is a quality-control method, designed to detect a shift in the mean value of a measured quantity away from a target value. If we consider a set of independently and identically distributed ( $i i d$ ) observations ${y_{t}}_{t = 1, . . ., T}$ , we can define the cumulative sums as

$S_{t} = max {0, S_{t - 1} + y_{t} - E_{t - 1} [y_{t}]}$

with boundary condition $S_{0} = 0$ . This procedure would recommend an action at the first $t$ satisfying $S_{t} \geq h$ , for some some threshold $h$ (referred to as the filter size). It implies that $S_{t} = 0$ whenever $y_{t} \leq E_{t - 1} [y_{t}] - S_{t - 1}$ . This zero floor means that we will skip some downward deviations that otherwise would make $S_{t}$ negative. The reason is that the filter is set up to identify a sequence of upside divergences from any reset level zero.

The threshold is activated when

$S_{t} \geq h ⟺ \exists τ \in [1, t] | \sum_{i = τ}^{t} (y_{i} - E_{i - 1} [y_{t}]) \geq h$

In other words, the threshold is activated ( $S_{t} \geq h$ ) if and only if there exists a bar at a time $τ$ between 1 and $t$ such that the sum of the differences between each observation at time $i$ and its expectation of $y_{t}$ is greater than the threshold $h$ .

We will sample a bar $t$ if and only if $S_{t} \geq h$ , at which point $S_{t}$ is reset. Let’s examine an implementation where $E_{t - 1} [y_{t}] = t_{t - 1}$

Data

To me, any of the theory on financial machine learning is useless if I cannot apply these techniques to some actual data. Combining some code and data with the mathematical concepts covered in the book is exactly what I attempt to do in this post and the other posts to follow. Since this is a learning exercise for me, I shamelessly sample best practice and ideas from what others have done and incorporate what makes sense to me into solving these problems. Anyway, the tick data comes from Kibot. Read more on the data, and how it was processed in this post. Here is a quick peak at what it looks like:

Data Sample

                      price     bid     ask    vol  dollar_vol
datetime                                                      
2009-09-28 09:30:00   50.79   50.70   50.79    100     5079.00
2009-09-28 09:30:00   50.71   50.70   50.79    638    32352.98
2009-09-28 09:31:32   50.75   50.75   50.76    100     5075.00
2009-09-28 09:31:33   50.75   50.72   50.75    100     5075.00
2009-09-28 09:31:50   50.75   50.73   50.76    300    15225.00
...                     ...     ...     ...    ...         ...
2023-03-22 15:59:56  145.39  145.40  145.45   1027   149315.53
2023-03-22 15:59:56  145.39  145.40  145.45   1800   261702.00
2023-03-22 15:59:56  145.39  145.40  145.45    100    14539.00
2023-03-22 16:00:00  145.46  145.37  145.46  55922  8134414.12
2023-03-22 16:00:01  145.50  145.41  145.51    100    14550.00

[2521941 rows x 5 columns]