Machine Learning And Automated Trading For Fun And Losses

tl;dr automated / algorithmic trading relies on finding the right pattern to “gamble on”. Instead of trying to find it by construction, I tried to uncover it by optimizing existing trading agents using machine learning.

Algorithmic trading

Algorithmic trading is one kind of automated trading where making profits relies on implementing the right strategy to buy and sell a security, unlike High Frequency Trading it doesn’t rely on the speed of execution.

Basically those trading agents have access to the same level of information as human traders and can do the same kind of operations.

Here, the target was not to implement trading agents, it was rather to consider them as black boxes and predict out of the context in which they were trading what their performance would be.

If successful, this could have been use to only allow those agents to trade when they were in a favorable context, thus boosting their profitability.

Forex

Nowadays things like Quantopian make it very easy for someone versed in software to use them skills to easily trade online without having a Goldman Sachs like infrastructure.

Though in 2007, it was not the case. Two of my friends introduced me to Metatrader, a platform available on many brokers that allowed users to trade on the Forex, the foreign currency exchange. A language called MQL is embedded in the platform and let users program trading agents or assistants with testing accounts or real money.

Looking at your program trading is a very thrilling thing:

Metatrader

Forex has other advantages:

The market does not close during the week: less side effects.
The volume is huge: the impact of external events is not as important as for trading other things.
The fees are hidden in the prices and have no scale economy, if you are trading in most european countries you need a lot of cash to invest, otherwise you either cannot diversify your portofolio or get killed by the flat fees applied below a certain amount. Fees on Forex are very simple: if you are selling something at $2000, the buying counter part will see a $2002 price, the difference being there to pay the broker. This simplifies a lot of thinking.
The leverage can extremely important which enables a controlled risk level, and gives also the opportunity to some scammer like companies to produce extremely cheesy ads about single mothers becoming millionairs by investing their last penny. The fees being proportional to the size of the transaction, brokers tend to advertise this leverage a lot more than it is actually useful.

Technical analysis

The underlying idea behind technical analysis is that the past behavior of a market can explain (even partly) its future behavior / variations.

Technical analysis goes from statistics like Bollinger Bands (a thing just about standard deviation) to some more exotic stuff like RSI or looking for patterns in japanese candlesticks charts.

Illustration: “Candlestick chart scheme 01-en” by Probe-meteo.com.

Candlestick

If such a tool / method was working the simple fact that everybody would be using it would reduce the edge to zero.

Initially prices data are composed of five attributes, the last four ones being used to draw japanese candlesticks:

Volume, during the period
High, highest price during the period
Low
Open, price at the beginning of the period
Close, price at the end of the period

Technical indicators let you generate an huge number of virtual attributes by combination of the other ones.

Our approach was only based on technical analysis indicators.

Process

Our process was the following:

Capturing every technical indicators

over the last 20 timeframes and at 3 different time scales (minute, hour and day) when the monitored agent was opening a trade.

Trade data storage

Profit & loss information along with the data captured at the opening of the position.

Initially this was done using a PHP webservice and a MySQL star like database but we quickly switched to simply store trades in raw files.

Preprocessing / Feature engineering

where we generated extra technical indicators as well as preprocessed fields to ease the learning.

Offline learning

and model tuning.

Online scoring

at the moment of the trade opening, if the scoring predicted a losing trade, the trade was not allowed to the agent.

Learned: Optimize later

The overall performance was not a primary issue initially since learning was happening offline. Though while experimenting the preprocessing ended up to be painfully slow.

A very quick first optimization involved a profiler, since many of our fields were generated using their names / keys, e.g.: SMA9_-_High__-__SMA26_-_High/StdDev30-1 was indicating a substraction of two other fields, followed by a division. A 30% processing time reduction on the whole was possible by simply caching the processed keys. I would have never been able to anticipate that this would have been the main bottleneck.

A second optimization batch just parallelized the processing of trades using ZeroMQ, this was a bit more time consuming to implement and gave a second reduction of 30% of what remained.

Those two optimizations reaching a total 50% time reduction.

Learned: Storing data

The initial plan of storing in a MySQL database sounded good, though there was a table storing every metric for every trade: trade_metric (id, trade_id, metric_id, value). A few different trading agents, a few dozens of thousands trades per agent, more than one hundred metrics, 20 timeframes, 3 time scales.

That table grew steadily to the billion entries, by that time you need to pay attention to how you structure things, what engine, what column has indexes. At some point a simple SELECT count(*) FROM trade_metric; was taking more than a minute… Partitioning helped a lot, removing foreign key checks and related indexes helped a bit.

I was using MQL4 at that time. I heard that the new versions offered some improvements but at some point using or developing external libraries was a real pain. Skipping PHP, MySQL and the HTTP handling on the MQL side was really a plus, and it comes with one simple solution: writing one CSV file per trade.

Learned: Reusing models

Weka

At the time I began working on this project scikit-learn was not as widely used as it is today.

I was using Weka to experiment with models and benchmarking. Then came the time of reusing the learned model and for that Weka was not very convenient, especially when you are using it from the GUI.

Though a few models were displaying their inner data, making it possible to reuse what has been learned, you said it: “decision trees”. With the data out of decision trees, it was easy to build a Python scorer on top of it.

SAS Entreprise Miner

Part of this was done during my studies, so I had a student access to SAS and their mining tool, Entreprise Miner. I used it a bit to benchmark against Weka. SAS was presenting significantly better results, the processing was clearly too fast. Way too fast. With similar models, the same testing and validation datasets, SAS pretended to have a model in less than one minute, when it took days in Weka. It also pretended to reach impressive performances compared to Weka.

Trying to use the SAS model in production was not a great experience either, the buttons were there, I spend three full days trying to find documentation and make them work in either C or Java. Fail.

I would be more than ready to admit that I did not know that software, but I do not think it is the case, I had some experience with it, and I have been teaching some classes around it.

Learned: Domain knowledge & feature engineering

The fact that I somehow got stuck using decision trees enabled a lot of thinking about how tree induction works. Keep simply in mind that those algorithms create new nodes using a feature and a split value to separate data in a way that in the different branches contains subdataset where the target class is ordered in a “better” way than without that split.

Both things mentioned below did not manage to make results any better, I have only make a successful use of them once working on smart watches data.

It is funny to consider that if your model is “wrong enough” you could still use it by inverting its output, but it is never went wrong enough to cover the broker fees.

Including domain knowledge

At the beginning, my agents were taking long (buying) and short (selling) positions and those were processed the same way in the same dataset, there was just a feature trade_type telling buy or sell.

Just consider a made up problem:

A binary class C to predict: true or false
Two binary features A and B whose values give a perfect predictor of C, C = A XOR B.
A thousand other random features N_i, just noise that will prevent the learning algorithm to test the right, mentionned above features.

There is no doubt that XOR an be easily represented through a decision tree.

def score(data):
    if data.a:
        if data.b:
            return False
        else:
            return True
    else:
        if data.b:
            return True
        else:
            return False

Though when thinking about tree induction inner workings, there is no reason that A or B get selected more than any other random features for none of them separates the data properly on his own, and the learning tree will probably fail to separate correctly the data. The learning only adds one decision node at a time, the above as three.

In our practical problem, it means that if I had one feature x always predicting if the a buying trade would win, the learning algorithm may have been unsuccessful. Because it would not have been able to combine a decision with x and trade_type at the same time.

So I have chosen to include some domain knowledge by processing different “trade types” separately, in the end I ended up analyzing only on long positions.

Storing comparisons

Every technical indicators ships with a set of rule that are “supposed to make you rich”, one example could be: “if the average over 7 days cuts the average on 21 days to go above it, then the trend is switching to bullish: you should buy.”

The rule is simple again:

def score(data):
    below_before = (data.average21[1] - data.average7[1]) > 0
    above_after = (data.average21[0] - data.average7[0]) < 0

    if below_before: # represented as decision trees again
        if above_after:
            return True
        else:
            return False
    else:
        return False

But then encoding a difference in a decision tree is this time impossible, the closest thing possible could be to use thresholds:

def below_before(data):
    if data.average21[1] > 1: # Any arbitrary value
        if data.average7[1] < 1: # Same arbitrary value
            return True
        else:
            return False

    # Here we can't tell, unless we tell an infinite number of thresholds
    # as arbitrary values

Similarly some technical analysis rules are considered as valid only if the volume / volatility is important enough.

So the preprocessing was listing elements according to their types and specificity and then generating fields of the possible comparisons and ratios.

For example the computed value SMA9_-_High__-__SMA26_-_High/StdDev30-1 was storing the difference of two moving averages divided by a standard deviation. In the end we were generating about 1900 features (some of them of course exposing high correlation).

Conclusion, AKA “Are you rich now?”

The very naive formulation of the idea behind this work was that if something could be made of technical analysis, machine learning should be able to discover it. My application of machine learning has not uncovered anything, but maybe more domain knowledge may have helped. If I had to work again on this I would try to embed NLP and analyze news feeds, but accessing an history for those is something much more complex.

We got some models showing good performance, though none of those performances were reproducible. You could learn on 2004 trades, test on 2005 with an incredible accuracy but not be able to do it again the next year. Interestingly enough that kind of phenomenon was occurring more often with raw data that with preprocessed and enriched data. Enriched data produced consistently bad accuracy models, this consistency may mean that something positive was coming out of the feature engineering.

This work has been done between 2008 and 2012 and presented during at a Data Science Luxembourg event in 2015.