# Why ML → Finance is Hard (3 / 4)

## The problem of non-independent samples

Following on from the prior post, want to discuss the problem of **sample independence**.
Many machine learning models in finance deal with timeseries data, where samples used in training may be **close together in
time and not be independent of one another**. There are very few features in finance that do not make use of lookback periods. Most
features do evaluate prior windows:

- almost all technical indicators (SMA being the most basic example)
- distribution based signals
- decomposition based signals
- traditional signal processing
- volatility: rolling stdev, garch, instantaneous vol, etc
- hawkes processes for estimation of variance or buy/sell imbalance
- regime models
- …

The lookback period used in these features allows a feature at time `t`

to **overlap** (and not be i.i.d) with respect to the **last
k features** (where k is the lookback window size on one’s features). The t .. t+k features, future in the sequence, also have
overlap. In general, if a given feature has lookback length k (for example a k period MA), the sample will share information
with k prior samples and k future samples (or 2k) samples.

Some of the best supervised machine learning approaches employ sampling during the process of training, for example:

- deep learning models
- random forest
- genetic algorithms
- …

When samples lack inter-sample independence (i.e. are not i.i.d temporally), the machine learning model is often **able to exploit the lookahead
bias** introduced, overfitting the model in training. Find some discussion of this problem here.

## Discussion

In our earlier random forest toy example we achieved perfect accuracy during training. This was mostly due to the lack of
independence across samples, where the **model learned from the lookahead bias** due to the overlapping ROC windows across
samples. This overfitting resulted in poor out of sample (OOS) performance.

Applying the same to a deep learning model (such as a LSTM) will achieve similar results, with very high in-sample performance and poor OOS performance.

Random forest does allow for a number of constraints which can **reduce the impact of non-independent samples** by:

**decreasing the # of samples used in fitting each tree in the forest**- If the # of random samples (with replacement) per tree is reduced from N (the total # of samples) to M < N, the probability of a given sample overlapping with another within the same tree is reduced from nearly 100% to a smaller probability of overlap.

**reducing the complexity of each tree**(for example tree depth)- this just reduces the number of rules that can exploit the lookahead bias.

These just partially side-step the issue. A better solution would be to remove the independence issue entirely.

## A solution for Random Forest

I tend to use bayesian models and random-forest when applying supervised learning, as they are often more appropriate for my
feature sets than deep learning or alternative approaches. In terms of adjusting for non-independence, I

modified scikit-learn’s `RandomForestClassifier`

and `RandomForestRegressor`

algorithms to address this problem.

The change was as follows: adjusted the random forest classifier and regressor to allow for a user defined sampling function. This capability allows the user to account for samples that are not i.i.d relative to each other, ensuring that each tree contains independent samples. Find the modified sklearn library here.

Recall that our prior model had perfect accuracy in training (a sure sign of overfitting):

/ | Positive | Negative | Accuracy |
---|---|---|---|

Positive | 1723 | 0 | 100% |

Negative | 0 | 2057 | 100% |

and an out of sample outcome with 47% precision (more losing trades than winning):

/ | Positive | Negative | Accuracy |
---|---|---|---|

Positive | 242 | 269 | 47% |

Negative | 526 | 587 | 53% |

By introducing a sampling that avoids overlapping samples:

```
def select_stride(random, nrows, samples, skip=20):
offset = random.randint(0,skip)
raw = pd.Series(random.randint(0,nrows/skip,samples))
indices = raw.apply(lambda x: min(x+offset, nrows-1))
return indices
clf = RandomForestClassifier(
n_estimators=500, random_state=1, n_jobs=-1,
sampling_function=select_stride)
model = clf.fit (training[features], training.label)
pred_train = model.predict(training[features])
pred_test = model.predict(testing[features])
```

we get the following confusion matrices. For in-sample:

/ | Positive | Negative | Accuracy |
---|---|---|---|

Positive | 806 | 867 | 48% |

Negative | 917 | 1190 | 56% |

and out-of-sample with a 51% accuracy:

/ | Positive | Negative | Accuracy |
---|---|---|---|

Positive | 260 | 250 | 51% |

Negative | 508 | 606 | 54% |

Our feature set and labels in the toy example were not brilliant, so did not expect a workable strategy. That said, by removing the sampling overlap we:

- reduced training model overfit (as evidenced above with the accuracy < 100%)
- improved out-of-sample performance

## Conclusions

- Temporal non-independence of samples can degrade ML models substantially, causing bias and overfit
- Training algorithms need to be modified to remove the impact of non-independent samples