Problem Description

The aim for this project was to develop an end-to-end machine learning (ML) system; to better understand ML system design. A vital skill for ML engineers. Initially the scale and scope of the project was small, however as time progressed the complexity and size grew. And with that growth in complexity, I had to learn new skills to keep up with the technical demand.

Skills learnt: docker, Amazon Web services (ECS, S3), DevOps practices, Java, Kafka, Redis

Skills developed: time series analysis and feature construction, feature selection

Data architecture hosted on AWS on a high availability cluster ECS:

alt

Task

The task was to utilise data science techniques to gain insights from financial time series data, and ultimately to build ML learning models to exploit those insights and predictive trading signals in buy or sell a particular crypto-currency pair.

The reason I focused on the cryptocurrency market is because tick and orderbook data for traditional financial instruments is expensive. Whereas most crypto exchanges allow access to market feed via API.

Two data pipelines are containerised. one to collect for offline data analysis and ML and the other low latency system to predict trading signals via online ML.

Data Collection

Market data was collected from 15 cryptocurrency exchanges, I forked a open-source project which handles the data feeds for most exchanges called cryptofeed https://github.com/bmoscon/cryptofeed

building ontop of the code and using to existing redis backend to batch the data every one minute (proccess time) and save as parquet files to my private s3 bucket using the aws api.

example tick data from BTC_USDT pair [figure1]

index price volume signed_tick
2021-02-01 00:00:00.235000 33084.6 0.060084 -1

example l2 orderbook data from BTC_USDT pair [figure2]

timestamp receipt_timestamp delta side price size
2021-10-20 14:09:55.026000128 2021-10-20 14:12:25.604629504 True bid 65961.2 0.00042

For the l2 data the pipeline is configured to take a snapshot of the orderbook for the first 100 levels once every day after midnight The rest of the updates are changes made to the orginal snapshot.

Time series analysis

Non independently identically distributed (IID) Financial data

The algorithms and techniques used were from Lopez de Prado (2018).

A sampling technique was used to construct ‘information-driven bars’, to idea is to sample more frequently from the tick feed as shown on figure1. When new information arrives to the market a information bar is conducted, which represents one row. The columns are the same as normal time-based bar features : open, high, low and close.

example information-driven as explained by Lopez de Prado (2018) [figure3]

date_time open high low close
2021-02-03 00:00:00.134000 1511.93 1512.11 1511.93 1512.11

This data set can be thought of as the main data set used to construct other features or enrich with supplementory orderbook data.

The reason for sampling when new information arrives is because financial time series is not independently identically distributed (IID). Machine learning models assume the data is IID. Sampling This way is to ensure the data is more IID.

Stationarity and memory trade-off

To build a ML model to make predictions based on historical financial time series the data must be made stationary. As supervised learning models require stationary features. However this reduces the memory of the series. The idea is to fractionally differentiate the price series in order to make the data stationary, but preserve as much memory as possible.

I used the Augmented Dickey-Fuller Test to analyse the stationary and memory trade-off.

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test.

Where :

  • Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure.
  • Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary.

The results are interpreted using the p-value method.

p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary. p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.

from statsmodels.tsa.stattools import adfuller

close_ffd = close.apply(np.log).cumsum().to_frame()

close_ffd =  frac_diff_ffd(series=close_ffd,
                            diff_amt=1.9999981,thresh=0.000001).dropna()

p_val= adfuller(close_ffd,maxlag=1,regression='c',autolag=None)[1]

print(p_val)

#0.04600717533264174

In the code above I log tansformed the close price seris of the data from figure 3, then applied the cumsum transformation. This helps preserve the memory of the data. In example above the p value is below 0.05 therefore the series is stationary.

cumsum filter

Lopez de prado (2018) states event-based sampling should be implemented using a cusum filter, market prices tend to fluctuate during a market event; big sell off, news etc. The cusum filter is designed to detect a shift in the mean of the prices, and market events cause a shift in the mean. The machine learning model will only learn from data where the mean

Feature selection

alt

References

Lopez de Prado, M. (2018). Advances in financial machine learning. John Wiley & Sons.