Notes from Advances in Financial Machine Learning: Labelling Financial Data - Work in Progress

Machine Learning
Quantitative Finance
Chapter Notes
Author: M. L. De Prado

Louis Becker


March 25, 2023


This is the second of a series of blog posts summarising chapters from Advances in Financial Machine Learning by Marcos Lopez de Prado. These notes are concerned with ways to label financial data for machine learning applications. I augment the code with my own approaches.

In this context labeling means identifying the dependent variable \(y_t\) in a way that it can be modeled efficiently to produce meaningful insights.

The Fixed-Time Horizon Method

Consider a features matrix \(X\) with \(I\) rows, \(\{X_i\}_{i=1,...,I}\), drawn from some bars with index \(t=1,...,T\) where \(I<T\). This sample is produced in such a way that an observation \(X_i\) is assigned a label \(y_i \in \{-1,0,1\}\) where

\[ y_i = \begin{cases} -1 & \text{if $r_{t_{i,0},t_{i,0}+h}<-\tau$}\\ 0 & \text{if |$r_{t_{i,0},t_{i,0}+h|}\leq \tau$}\\ 1 & \text{if $r_{t_{i,0},t_{i,0}+h}>\tau$} \end{cases} \]

  • \(\tau\) is a predefined constant threshold
  • \(t_{i,0}\) is the index of the bar immediately after \(X_i\) occurs
  • \(t_{i,0}+h\) is the index of the \(h^th\) bar after \(t_{i,0}\)
  • \(r_{t_{i,0},t_{i,0}+h}\) is the price return over a bar horizon \(h\): \[r_{t_{i,0},t_{i,0}+h} = \frac{P_{t_{i,0},t_{i,0} + h}}{P_{t_{i,0}}}-1\]

Because the literature almost always works with time bars, \(h\) implies a fixed-time horizon. The author argues that there are two reasons to avoid labeling observations according to a fixed threshold on time bars:

  1. Time bars do not exhibit good statistical properties.
  2. The same threshold \(\tau\) is applied regardless of the observed volatility.

There are two proposed alternatives to the fixed \(h\) approach:

1. Computing Dynamic Thresholds

The first alternative is to label data using a varying threshold, \(\sigma_{t_{i,0}}\). We estimate \(\sigma_{t_{i,0}}\) using an exponentially weighted standard deviation of returns. Importantly, we want to be able set profit taking and stop-loss limits that are a function of risks involved in a bet. Without these limits the threshold, \(\tau\) could be set to high or too low since in practice positions are regulated by risk management, margin calls, stop-loss limits, etc.

def getDailyVol(close, span0=100):

  # daily vol, reindexed to close
  df0 = close.index.searchsorted(close.index - pd.Timedelta(days=1))
  df0 = df0[df0 > 0]
  df0 = pd.Series(close.index[df0 - 1], index=close.index[close.shape[0] - df0.shape[0]:])
  df0 = close.loc[df0.index]/close.loc[df0.values].values - 1 # daily returns
  df0 = df0.ewm(span=span0).std()

  return df0

2. Use Volume or Dollar Bars

Use volume bars or dollar weighted bars, which should have volatilities that are more homoscedastic (constant).

In finality, a last argument against the fixed time horizon method is the path followed by prices in reality. The author argues that every investment strategy has stop-loss limits, and that it is “unrealistic to build a strategy that profits from positions that would have been stopped out the exchange.”

Computing Dynamic Thresholds

The Triple-Barrier Method