```
import pandas as pd
= pd.read_parquet('path/to/clean_IVE_tickbidask.parq') df
```

# Notes from Advances in Financial Machine Learning: Data Bars and Sampling

## Introduction

This is my first of a series of blog posts summarising chapters from Advances in Financial Machine Learning by Marcos Lopez de Prado. In this learning expedition I try to summarise Lopez de Prado’s explanations of financial data structures. This post focuses on data bars and how to construct the various types of bars. I add data and code to make these concepts as practical and actionable as possible.

## Bars, What are They?

Most ML algorithms assume a table representation of the extracted data. Rows from these tables are often referred to as “bars”. Two types of bars exist in this setting, namely standard bars and information-driven bars.

As a quick sidenote, the code written here was inspired by this Github repository. To extract bars, we could use some python and R code on the S&P 500 Value Index. See the Data section for more information on the underlying data. Let us import the relevant libraries:

```
library(arrow)
library(data.table)
library(janitor)
library(tidyverse)
library(xts)
<- read_parquet("path/to/clean_IVE_tickbidask.parq") df
```

### Standard Bars

The purpose of so-called “standard” bar methods is to transform a series of observations that arrive at irregular frequency (often referred to as “inhomogeneous series”) into a homogeneous series derived from regular sampling. Lopez de Prado names time bars, tick bars, volume bars and dollar bars as examples of standard bars.

#### Time Bars

Time bars are obtained by sampling information at fixed time intervals. Although the most popular, time bars should be avoided when possible because markets do not process information at a constant time interval. This implies that time bars run the risk of oversampling information during low-activity periods and undersampling information during high-activity periods. Another argument the author makes against using time bars is that time-sampled financial series exhibit poor statistical properties such as serial correlation, hetereoscedasticity and non-normality of returns.

Examples of information that constitute time bars:

- Timestamp
- Volume Weighted average Price (VWAP)
- Open price
- Close Price
- High price
- Low Price
- Volume traded

Here is an example of time bars sampled per minute:

```
= df['price'].resample('min').ohlc().dropna()
df_minute_bar df_minute_bar.tail()
```

```
open high low close
datetime
2023-03-22 15:56:00 145.7950 145.9143 145.7950 145.9143
2023-03-22 15:57:00 145.7986 145.8200 145.7986 145.8200
2023-03-22 15:58:00 145.8000 145.8500 145.7200 145.7300
2023-03-22 15:59:00 145.7200 145.7300 145.3900 145.3900
2023-03-22 16:00:00 145.4600 145.5000 145.4600 145.5000
```

```
# in order to replicate python code
<- df %>%
df_minute_bar data.table() %>%
select(datetime, price) %>%
to.minutes() %>%
clean_names()
# need to investigate why python and R yield different time stamps. likely due to locale
tail(df_minute_bar)
```

```
open high low close
2023-03-22 17:55:40 145.9800 145.9900 145.6600 145.7400
2023-03-22 17:56:42 145.7950 145.9143 145.7950 145.9143
2023-03-22 17:57:51 145.7986 145.8200 145.7986 145.8200
2023-03-22 17:58:54 145.8000 145.8500 145.7200 145.7300
2023-03-22 17:59:56 145.7200 145.7300 145.3900 145.3900
2023-03-22 18:00:01 145.4600 145.5000 145.4600 145.5000
```

#### Tick Bars

With tick bars, the idea is to sample the variables of interest (see above) each time a predefined number of transactions, or ticks, occur. This allows us to synchronize sampling with a proxy of information arrival (the speed at which ticks are originated). This type of sampling as a function of trading activity allows us to achieve returns that align closer to an identically and independent normal distribution (\(iid\)). This is important, because many statistical methods rely on the assumption that observations are drawn from an \(iid\) Gaussian process.

Implementing a tick bar approach is straight forward. Simply choose a predefined number of transactions or ticks, and sample.

```
# define threshold, t
= 10
t
# select bar every t rows
= df.iloc[::t]
tick_bars tick_bars.tail()
```

```
price bid ask vol dollar_vol
datetime
2023-03-22 15:58:19 145.80 145.79 145.79 144 20995.2
2023-03-22 15:58:48 145.72 145.69 145.74 100 14572.0
2023-03-22 15:59:01 145.70 145.70 145.73 141 20543.7
2023-03-22 15:59:22 145.60 145.60 145.64 505 73528.0
2023-03-22 16:00:01 145.50 145.41 145.51 100 14550.0
```

```
# define threshold, t
<- 10
t
# select bar every t rows
<- df[seq(1, nrow(df), t),]
tick_bars tail(tick_bars)
```

```
# A tibble: 6 × 6
datetime price bid ask vol dollar_vol
<dttm> <dbl> <dbl> <dbl> <int> <dbl>
1 2023-03-22 17:58:07 146. 146. 146. 129 18807.
2 2023-03-22 17:58:19 146. 146. 146. 144 20995.
3 2023-03-22 17:58:48 146. 146. 146. 100 14572
4 2023-03-22 17:59:01 146. 146. 146. 141 20544.
5 2023-03-22 17:59:22 146. 146. 146. 505 73528
6 2023-03-22 18:00:01 146. 145. 146. 100 14550
```

Care should be taken to cater for outliers when constructing tick bars. Many exchanges, for example, carry out an auction at the open and an auction at the close. This means that for a period of time, the order book accumulates bids and offers without matching them. When the auction concludes, a large trade is published at the clearing price, for an outsized amount. This auction trade could be the equivalent of thousands of ticks, even though it is reported as one tick.

A drawback of tick bars is that order fragmentation introduces some arbitrariness in the number of ticks. Suppose a lot order of size, say, 5 is sitting on the offer. Buying 5 lots will record as one tick. If instead there are 5 orders of size 1 on offer our one buy will be recorded as 5 separate transactions.

#### Volume Bars

Volume bars circumvent the above-mentioned problem with tick bars by sampling every time a predefined amount of the security’s units (shares, futures contracts, etc.) have been exchanged. Prices could be sampled every time 500 units of a share is exchanged regardless of the number of ticks involved. Volume data is well published these days and sampling returns by volume has been shown to be closer to an \(iid\) Gaussian normal distribution than tick bars. Moreover, volume bars tend to fit market microstructure theories well and this format of sampling allows a more convenient structure for related analysis.

Here is a practical example of how it could be coded:

```
# define threshold, t
= 1000
t
= df.reset_index()
alt
= []
idx = []
sampled_vol = 0
cum_vol
for i, v in alt.vol.items():
= cum_vol + v
cum_vol if cum_vol >= t:
idx.append(i)
sampled_vol.append(cum_vol)= 0
cum_vol
= alt.loc[idx]
df_volume_bar 'cum_vol'] = sampled_vol
df_volume_bar.loc[idx, = df_volume_bar.set_index('datetime')
df_volume_bar
df_volume_bar.tail()
```

```
price bid ask vol dollar_vol cum_vol
datetime
2023-03-22 15:59:22 145.60 145.60 145.64 505 73528.00 1315.0
2023-03-22 15:59:54 145.47 145.40 145.47 481 69971.07 1047.0
2023-03-22 15:59:56 145.39 145.40 145.45 1027 149315.53 1127.0
2023-03-22 15:59:56 145.39 145.40 145.45 1800 261702.00 1800.0
2023-03-22 16:00:00 145.46 145.37 145.46 55922 8134414.12 56022.0
```

```
# Define threshold, t
<- 1000
t
<- vector(mode = "numeric", length = 200000)
idx <- vector(mode = "numeric", length = 200000)
sampled_vol <- 0
cum_vol <- 1
j
for (i in 1:nrow(df)) {
<- cum_vol + df$vol[i]
cum_vol
if (cum_vol >= t) {
<- i
idx[j] <- cum_vol
sampled_vol[j] <- 0
cum_vol <- j + 1
j
}
}
<- idx[idx != 0]
idx = sampled_vol[sampled_vol != 0]
sampled_vol
<- df %>%
df_volume_bar rowid_to_column() %>%
inner_join(tibble(rowid=idx, cum_vol = sampled_vol), by = "rowid") %>%
select(-rowid)
tail(df_volume_bar)
```

```
# A tibble: 6 × 7
datetime price bid ask vol dollar_vol cum_vol
<dttm> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 2023-03-22 17:59:11 146. 146. 146. 1400 203980 1400
2 2023-03-22 17:59:22 146. 146. 146. 505 73528 1315
3 2023-03-22 17:59:54 145. 145. 145. 481 69971. 1047
4 2023-03-22 17:59:56 145. 145. 145. 1027 149316. 1127
5 2023-03-22 17:59:56 145. 145. 145. 1800 261702 1800
6 2023-03-22 18:00:00 145. 145. 145. 55922 8134414. 56022
```

#### Dollar Bars

Dollar bars generically refer to the act of sampling an observation every time a predefined market value is exchanged (in a particular currency). It allows for sampling bars in terms of dollar (currency) value exchanged, rather than ticks or volume. This approach is particularly useful when the analysis involves significant price fluctuations. Dollar bars also allow for the number of units (e.g. shares or futures contracts) traded to be a function of the value exchanged.

With some assets, the number of time and volume bars for a given bar size can fluctuate quite wildly over certain periods. In contrast, an advantage of sampling dollar bars is that, when using a fixed size, this approach can reduce the range and speed of variation of bars. Another useful feature of dollar bars is that it tends to be more robust in the face of corporate actions. The number of shares outstanding often changes multiple times over the course of a security’s life because of corporate actions. Still, you may want to sample dollar bars where the size of the bar is not kept constant over time. Instead, the bar size could be adjusted dynamically as a function of the free-floating market capitalization of a company (in the case of stocks), or the outstanding amount of issued debt (in the case of fixed-income securities).

```
# define threshold, t
= 100000
t
= df.reset_index()
alt
= []
idx = []
sampled_dvol = 0
cum_dvol
for i, dv in alt.dollar_vol.items():
= cum_dvol + dv
cum_dvol
if cum_dvol >= t:
idx.append(i)
sampled_dvol.append(cum_dvol)= 0
cum_dvol
= alt.loc[idx]
df_dollar_bar 'cum_dollar_vol'] = [ '%.2f' % elem for elem in sampled_dvol ]
df_dollar_bar.loc[idx, = df_dollar_bar.set_index('datetime')
df_dollar_bar
df_dollar_bar.head()
```

```
price bid ask vol dollar_vol cum_dollar_vol
datetime
2009-09-28 09:32:06 50.7800 50.76 50.78 500 25390.0000 118655.98
2009-09-28 09:33:54 50.8200 50.80 50.82 100 5082.0000 101525.62
2009-09-28 09:37:33 50.8299 50.80 50.83 166 8437.7634 105001.56
2009-09-28 09:41:53 50.8400 50.83 50.84 200 10168.0000 108836.07
2009-09-28 09:44:09 50.9100 50.91 50.92 1100 56001.0000 101767.00
```

```
# Define threshold, t
<- 100000
t
<- vector(mode = "numeric", length = 200000)
idx <- vector(mode = "numeric", length = 200000)
sampled_dvol <- 0
cum_dvol <- 1
j
for (i in 1:nrow(df)) {
<- cum_dvol + df$dollar_vol[i]
cum_dvol
if (cum_dvol >= t) {
<- i
idx[j] <- cum_dvol
sampled_dvol[j] <- 0
cum_dvol <- j + 1
j
}
}
<- idx[idx != 0]
idx = sampled_dvol[sampled_dvol != 0]
sampled_dvol
<- df %>%
df_dollar_bar rowid_to_column() %>%
inner_join(tibble(rowid=idx, cum_dvol = sampled_dvol), by = "rowid") %>%
select(-rowid)
tail(df_dollar_bar)
```

```
# A tibble: 6 × 7
datetime price bid ask vol dollar_vol cum_dvol
<dttm> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 2023-03-22 17:59:21 146. 146. 146. 155 22570. 117972.
2 2023-03-22 17:59:45 146. 146. 146. 290 42202. 115730.
3 2023-03-22 17:59:54 145. 145. 145. 481 69971. 110121.
4 2023-03-22 17:59:56 145. 145. 145. 1027 149316. 163856.
5 2023-03-22 17:59:56 145. 145. 145. 1800 261702 261702
6 2023-03-22 18:00:00 145. 145. 145. 55922 8134414. 8148953.
```

### Information-driven Bars

The purpose of information-driven bars is to sample more frequently when new information arrives to the market. Market microstructure theories confer special importance to the persistence of imbalanced signed volumes, as that phenomenon is associated with the presence of informed traders. By synchronising sampling with the arrival of informed traders, we may be able to make decisions before prices reach a new equilibrium level. The following are examples of information-driven bars:

#### Tick Imbalance Bars

The idea behind tick imbalance bars (TIBs) is to sample bars whenever tick imbalances exceed our expectations. Consider a sequence of ticks \(\{(p_t, p_v)\}_{t=1,...,T}\), where price \(p_t\) and volume \(v_t\) both constitute tick \(t\). We define a “tick rule” that delineates a sequence \(\{b_t\}_{t=1,...,T}\) where

\[ b_t = \begin{cases} b_{t-1} & \text{if $\Delta p_t = 0$}\\ \frac{|\Delta p_t|}{\Delta p_t} & \text{if $\Delta p_t \neq 0$}\\ \end{cases} \]

with \(b_t \in \{-1,1\}\) and the boundary condition, \(b_0\) is set to match the terminal value \(b_T\) from the immediately preceding bar. In other words, we want to determine the tick index, \(T\), such that the accumulation of signed ticks exceeds a given threshold. To determine \(T\):

- Define the tick imbalance at time \(T\) as
- \[ \theta_T = \sum^T_{t=1}b_t \]
- Compute the expected value of \(\theta_t\) at the beginning of the bar: \[
E_0[\theta_T] = E_0[T](P[b_t = 1] - P[b_t = -1])
\]
- Where:
- \(E_0[T]\) is the expected size of the tick bar
- \(P[b_t = 1]\) is the unconditonal probability that a tick is classified as a
*buy* - \(P[b_t = -1]\) is the unconditonal probability that a tick is classified as a
*sell* - The two unconditonal probabilities sum to 1, implying that \(E_0[T] = E_0[T](2P[b_t = 1] - 1)\)

- \(E_0[T]\) is estimated as an exponentially weighted moving average of \(T\) values from prior bars
- \((2P[b_t = 1] - 1)\) is be estimated as an exponentially weighted moving average of \(b_t\) values from prior bars

- Where:
- Define a TIB as a \(T^*\)-contiguous subset of ticks such that the following condition is met: \[
T^* = \arg \min_T \{|\theta_T|\geq E_0[T]|2P[b_t = 1]-1| \}
\]
- Where the size of the expected imbalance is implied by \(|2P[b_t = 1]-1|\)
- When \(\theta_T\) is more imbalanced than expected, a low \(T\) will satisfy these conditions
- With this approach TIBs are produced more frequently under the presence of informed trading (asymmetric information that triggers one-side trading)
- TIBs can be seen as buckets of trades containing equal information regardless of the volumes, prices or ticks traded.

#### Volume/Dollar Imbalance Bars

Volume imbalance bars (VIBs) and dollar imbalance bars (DIBs) extend extend the concept behind TIBs in that these respective bars are sample when the diverge from expectation. As with TIBs the notions of a tick rule and a boundary condition, \(b_0\) apply here. In the case of TIBs and DIBs:

Define imbalance at time \(T\) as \[ \theta_T = \sum^T_{t=1}b_t v_t \]

where \(v_t\) is either the number of securities traded (VIB) or the dollar amount exchanged (DIB).

Compute the expected value of \(\theta_t\) at the beginning of the bar:

\[ \begin{aligned} E_0[\theta_T] & = E_0 \left[\sum^T_{t|b_t = 1}v_t \right] - E_0 \left[\sum^T_{t|b_t = -1}v_t \right] \\ E_0[\theta_T] & = E_0[T](P[b_t = 1]E_0[v_t|b_t = 1] - P[b_t = -1]E_0[v_t|b_t = -1]) \end{aligned} \]

If we denote \[ \begin{aligned} v^+ & = P[b_t = 1]E_0[v_t|b_t = 1] \\ v^- & = P[b_t = -1]E_0[v_t|b_t = -1] \end{aligned} \] such that \[ E_0[T]^{-1}E_0\left[\sum_t v_t \right] = E_0[v_t] = v^+ + v^- \] then \[ E_0[T] = E_0[T](v^+ - v^-) = E_0[T](2v^+ - E_0[v_t]) \]

In practice, we can estimate \(E_0[T]\) as an exponentially weighted moving average of \(T\) values from prior bars, and \((2v^+ − E_0[v_t])\) as an exponentially weighted moving average of \(b_tv_t\) values from prior bars.

Define VIB or DIB as a \(T^∗\)-contiguous subset of ticks such that the following condition is met:

\[ T^* = \arg \min_T\{|\theta_T| \geq E_0[T]|2v^+ - E_0[v_t]|\} \]

- Where the size of the expected imbalance is implied by \(|2v^+ - E_0[v_t]|\)
- When \(\theta_T\) is more imbalanced than expected, a low \(T\) will satisfy these conditions
- It addresses concerns regarding tick fragmentation and outliers
- it also addresses the issue of corporate actions, because the above procedure does not rely on a constant bar size. Bar size is adjusted dynamically

#### Tick Runs Bars

TIBs, VIBs, and DIBs monitor order flow imbalance, as measured in terms of ticks, volumes, and dollar values exchanged. Large traders will sweep the order book, use iceberg orders, or slice a parent order into multiple children, all of which leave a trace of runs in the \(\{b_t\}_{t=1,...,T}\) sequence. For this reason, it can be useful to monitor the *sequence* of buys in the overall volume, and take samples when that sequence diverges from our expectations.

Define the length of the current run as \[ \theta_T = \max \left\{ \sum^T_{t|b_t=1}b_t, -\sum^T_{t|b_t=-1}b_t \right\} \]

Compute expected value of \(\theta_T\) at the beginning of the bar \[ E_0[\theta_T] = E_0[T]\max\{P[b_t=1], 1-P[b_t=1]\} \]

- \(E_0\) is estimated as an exponentially weighted moving average of \(T\) values from prior bars
- \(P[b_t=1]\) is estimated as an exponentially weighted moving average of the proportion of buy ticks from prior bars

Define a tick runs bar (TRB) as a \(T^∗\)-contiguous subset of ticks such that the following condition is met: \[ T^* = \arg \min_T\{|\theta_T| \geq E_0[T]\max\{P[b_t=1], 1-P[b_t=1]\} \]

- Where the expected count of ticks from runs is implied by \(\max\{P[b_t=1], 1-P[b_t=1]\}\)
- When \(\theta_T\) exhibits more runs than expected, a low \(T\) will satisfy these conditions
- In this definition of runs we allow for sequence breaks. Instead of measuring the length of the longest sequence, we count the number of ticks of each side, without offsetting the (no imbalance). In the context of forming bars, this turns out to be a more useful definition than measuring sequence lengths.

#### Volume/Dollar Runs Bars

Volume runs bars (VRBs) and dollar runs bars (DRBs) extend the definition of runs to volumes and dollars exchanged, respectively. The intuition is that we wish to sample bars whenever the volumes or dollars traded by one side exceed our expectation for a bar.

Define the volumes or dollars associated with a run as \[ \theta_T = \max \left\{ \sum^T_{t|b_t=1}b_tv_t, -\sum^T_{t|b_t=-1}b_tv_t \right\} \] where \(v_t\) is either the number of securities traded (VIB) or the dollar amount exchanged (DIB).

Compute the expected value of \(\theta_T\) at the beginning of the bar: \[ E_0[\theta_T] = E_0[T]\max\{P[b_t = 1]E_0[v_t|b_t=1], (1-P[b_t = 1])E_0[v_t|b_t=1-]\} \]

- \(E_0\) is estimated as an exponentially weighted moving average of \(T\) values from prior bars
- \(P[b_t=1]\) is estimated as an exponentially weighted moving average of the proportion of buy ticks from prior bars
- \(E_0[v_t|b_t=1]\) is estimated as an exponentially weighted moving average of the sell volumes from prior bars

Define a volume runs bar (VRB) as a \(T^∗\)-contiguous subset of ticks such that the following condition is met: \[ T^* = \arg \min_T \{\theta_T \geq E_0[T]\max \left\{P[b_t=1]E_0[v_t|b_t=1], (1-P[b_t = 1][E_0[v_t|b_t = -1]])\}\right\} \]

- where the expected volume from runs is implied by \(\max\{P[b_t=1]E_0[v_t|b_t=1], (1-P[b_t = 1][E_0[v_t|b_t = -1]])\}\)
- when \(\theta_T\) exhibits more runs than expected or the volum from runs is greater than expected, a low \(T\) will satisfy these conditions

## Sampling Features

It is useful to think about sampling strategies when applying machine-learning algorithms in finance for two reasons: First, several ML algorithms do not scale well with sample size (e.g., SVMs). Second, ML algorithms achieve highest accuracy when they attempt to learn from relevant examples. We will try to look at some palatable ways of sampling bars to produce a features matrix with relevant training examples.

### The CUSUM Filter

The CUSUM filter is a quality-control method, designed to detect a shift in the mean value of a measured quantity away from a target value. If we consider a set of independently and identically distributed (\(iid\)) observations \(\{y_t\}_{t=1,...,T}\), we can define the cumulative sums as

\[ S_t = \max\{0, S_{t-1} + y_t - E_{t-1}[y_t]\} \]

with boundary condition \(S_0 = 0\). This procedure would recommend an action at the first \(t\) satisfying \(S_t \geq h\), for some some threshold \(h\) (referred to as the filter size). It implies that \(S_t = 0\) whenever \(y_t \leq E_{t-1}[y_t] - S_{t-1}\). This zero floor means that we will skip some downward deviations that otherwise would make \(S_t\) negative. The reason is that the filter is set up to identify a sequence of upside divergences from any reset level zero.

The threshold is activated when

\[ S_t \geq h \iff \exists \tau \in [1,t] | \sum^t_{i=\tau}(y_i - E_{i-1}[y_t]) \geq h \]

In other words, the threshold is activated (\(S_t \geq h\)) if and only if there exists a bar at a time \(\tau\) between 1 and \(t\) such that the sum of the differences between each observation at time \(i\) and its expectation of \(y_t\) is greater than the threshold \(h\).

We will sample a bar \(t\) if and only if \(S_t \geq h\), at which point \(S_t\) is reset. Let’s examine an implementation where \(E_{t-1}[y_t] = t_{t-1}\)

## Data

To me, any of the theory on financial machine learning is useless if I cannot apply these techniques to some actual data. Combining some code and data with the mathematical concepts covered in the book is exactly what I attempt to do in this post and the other posts to follow. Since this is a learning exercise for me, I shamelessly sample best practice and ideas from what others have done and incorporate what makes sense to me into solving these problems. Anyway, the tick data comes from Kibot. Read more on the data, and how it was processed in this post. Here is a quick peak at what it looks like:

#### Data Sample

```
price bid ask vol dollar_vol
datetime
2009-09-28 09:30:00 50.79 50.70 50.79 100 5079.00
2009-09-28 09:30:00 50.71 50.70 50.79 638 32352.98
2009-09-28 09:31:32 50.75 50.75 50.76 100 5075.00
2009-09-28 09:31:33 50.75 50.72 50.75 100 5075.00
2009-09-28 09:31:50 50.75 50.73 50.76 300 15225.00
... ... ... ... ... ...
2023-03-22 15:59:56 145.39 145.40 145.45 1027 149315.53
2023-03-22 15:59:56 145.39 145.40 145.45 1800 261702.00
2023-03-22 15:59:56 145.39 145.40 145.45 100 14539.00
2023-03-22 16:00:00 145.46 145.37 145.46 55922 8134414.12
2023-03-22 16:00:01 145.50 145.41 145.51 100 14550.00
[2521941 rows x 5 columns]
```