Why ML → Finance is Hard (3 / 4)

The problem of non-independent samples

Posted July 30, 2020

Following on from the prior post, want to discuss the problem of sample independence. Many machine learning models in finance deal with timeseries data, where samples used in training may be close together in time and not be independent of one another. There are very few features in finance that do not make use of lookback periods. Most features do evaluate prior windows:

almost all technical indicators (SMA being the most basic example)
distribution based signals
decomposition based signals
traditional signal processing
volatility: rolling stdev, garch, instantaneous vol, etc
hawkes processes for estimation of variance or buy/sell imbalance
regime models
…

The lookback period used in these features allows a feature at time t to overlap (and not be i.i.d) with respect to the last k features (where k is the lookback window size on one’s features). The t .. t+k features, future in the sequence, also have overlap. In general, if a given feature has lookback length k (for example a k period MA), the sample will share information with k prior samples and k future samples (or 2k) samples.

Some of the best supervised machine learning approaches employ sampling during the process of training, for example:

deep learning models
random forest
genetic algorithms
…

When samples lack inter-sample independence (i.e. are not i.i.d temporally), the machine learning model is often able to exploit the lookahead bias introduced, overfitting the model in training. Find some discussion of this problem here.

Discussion

In our earlier random forest toy example we achieved perfect accuracy during training. This was mostly due to the lack of independence across samples, where the model learned from the lookahead bias due to the overlapping ROC windows across samples. This overfitting resulted in poor out of sample (OOS) performance.

Applying the same to a deep learning model (such as a LSTM) will achieve similar results, with very high in-sample performance and poor OOS performance.

Random forest does allow for a number of constraints which can reduce the impact of non-independent samples by:

decreasing the # of samples used in fitting each tree in the forest
- If the # of random samples (with replacement) per tree is reduced from N (the total # of samples) to M < N, the probability of a given sample overlapping with another within the same tree is reduced from nearly 100% to a smaller probability of overlap.
reducing the complexity of each tree (for example tree depth)
- this just reduces the number of rules that can exploit the lookahead bias.

These just partially side-step the issue. A better solution would be to remove the independence issue entirely.

A solution for Random Forest

I tend to use bayesian models and random-forest when applying supervised learning, as they are often more appropriate for my feature sets than deep learning or alternative approaches. In terms of adjusting for non-independence, I
modified scikit-learn’s RandomForestClassifier and RandomForestRegressor algorithms to address this problem.

The change was as follows: adjusted the random forest classifier and regressor to allow for a user defined sampling function. This capability allows the user to account for samples that are not i.i.d relative to each other, ensuring that each tree contains independent samples. Find the modified sklearn library here.

Recall that our prior model had perfect accuracy in training (a sure sign of overfitting):

/	Positive	Negative	Accuracy
Positive	1723	0	100%
Negative	0	2057	100%

and an out of sample outcome with 47% precision (more losing trades than winning):

/	Positive	Negative	Accuracy
Positive	242	269	47%
Negative	526	587	53%

By introducing a sampling that avoids overlapping samples:

def select_stride(random, nrows, samples, skip=20):
    offset = random.randint(0,skip)
    raw = pd.Series(random.randint(0,nrows/skip,samples))
    indices = raw.apply(lambda x: min(x+offset, nrows-1))
    return indices

clf = RandomForestClassifier(
    n_estimators=500, random_state=1, n_jobs=-1,
    sampling_function=select_stride)

model = clf.fit (training[features], training.label)
pred_train = model.predict(training[features])
pred_test = model.predict(testing[features])

we get the following confusion matrices. For in-sample:

/	Positive	Negative	Accuracy
Positive	806	867	48%
Negative	917	1190	56%

and out-of-sample with a 51% accuracy:

/	Positive	Negative	Accuracy
Positive	260	250	51%
Negative	508	606	54%

Our feature set and labels in the toy example were not brilliant, so did not expect a workable strategy. That said, by removing the sampling overlap we:

reduced training model overfit (as evidenced above with the accuracy < 100%)
improved out-of-sample performance

Conclusions

Temporal non-independence of samples can degrade ML models substantially, causing bias and overfit
Training algorithms need to be modified to remove the impact of non-independent samples