GitHub - LiamConnell/deep-algotrading at 5600fb6fc26085630334892eabeae5e07e32010a

Name	Name	Last commit message	Last commit date
Latest commit History 15 Commits
.ipynb_checkpoints	.ipynb_checkpoints
DataFlow_Suite	DataFlow_Suite
notebooks	notebooks
LICENSE	LICENSE
README.md	README.md

#A tour through tensorflow with financial data

I present several models ranging in complexity from simple regression to LSTM and policy networks. The series can be used as an educational resource for tensorflow or deep learning, a reference aid, or a source of ideas on how to apply deep learning techniques to problems that are outside of the usual deep learning fields (vision, natural language).

Not all of the examples will work. Some of them are far to simple to even be considered viable trading strategies and are only presented for educational purposes. Others, in the notebook form I present, have not been trained for the proper amount of time. Perhaps with a bit of rented GPU time they will be more promising and I leave that as an excercise for the reader (who wants to make a lot of money). Hopefully this project inspires some to try using deep learning techniques for some more interesting problems. Contact me if interested in learning more or if you have suggestions for additions or improvements.

The algorithms increase in complexity and introduce new concepts as they progress:

Simple Regression: Here we regress the prices from the last 100 days to the next day's price, training W and b in the equation y = Wx + b where y is the next day's price, x is a vector of dimension 100, W is a 100x1 matrix and b is a 1x1 matrix. We run the gradient descent algorithm to minimize the mean squared error of the predicted price and the actual next day price. Congratulations you passed highschool stats. But hopefully this simple and naive example helps demonstrate the idea of a tensor graph, as well as showing a great example of extreme overfitting.
Simple Regression on Multiple Symbols: Things get a little more interesting as soon as we introduce more than one symbol. What is the best way to model our eventual investment strategy: our policy, if you will. We start to realize that our model only vaguely implies a policy (investment actions) by predicting the actual movement in price. The implied policy is simple: buy if the the predicted price movement is positive, sell if it is negative. But that doesnt sound realistic at all. How much do we buy? And will optimizing this, even if we are very careful to avoid overfitting, even produce results that allign with our goals? We havent actaully defined our goals explicitly, but for those who are not familiar with investment metrics, common goals include:
- maximize risk adjusted return (like the Sharpe ratio)
- consistency of returns over time
- low market exposure
- long/short equity

If markets were easy to figure out and we could accurately predict the next day's return then it wouldn't matter. Our implied policy would fit with some goals (not long/short equity though) and the strategy would be viable. The reality is that our model cannot accurately predict this, nor will our strategy ever be perfect. Our best case scenario is always winning slightly more than losing. When operating on these margins it is much more important that we consider the policy explicitly, thus moving to 'Policy Based' deep learning.

Policy Gradient Training: Our policy will remain simple. We will chose a position, long/neutral/short, for each symbol in our portfolio. But now, instead of letting our estimation of the future return inform our decision, we train our network to choose the best position. Thus, instead of having an implied policy, it is explicit and trained directly. Even thought the policy is simple in this case, training it is a bit more involved. I did my best to interpret Andrej Karpathy's excelent article on Reinforcement Learning when writing this code. It might be worth reading his explanation, but I'll do my best to summarize what I did.

For each symbol in our portfolio, we sample (or argmax if the math is too hard) the probability distribution of our three position buckets to get our policy decision (a position l/s/n), we multiply our position by the target value to get a daily return for the symbol. Then we combine the symbols to get a full daily return. We can also get other metrics like the total return and sharpe ratio since we actually are feeding this through as a batch (more on that later). As Karpathy points out, we are only interested in the gradients of the positions we sampled, so we select the appropriate columns from the output and combine them into a new tensor.

So now we have a tensor with the regression's probability for the chosen (sampled) action for each symbol and each day. We also have a few performance metrics like daily and total return to choose from, but they're not differentiable because we sampled the probability so we cant just "gradient descent maximize" the profit...unfortunately. Instead, we find the cross entropy between the first table (the probabilities we chose/sampled) and an all-ones tensor of the same shape. We get a table of cross entropies of the same size (number of symbols by batch size) This is basically equivalent to saying, how do I do MORE of what I'm already doing, for every decision that I made. Now we dont necessarilly want MORE of what we're doing, but the opposite of it is definitely LESS of it, which is useful. We multiply that tensor by our fitness function (the daily or aggregate return) and we use the gradient descent optimizer to minimize the cost. So you see? If the fitness function is negative, it will train the weights of the regression to NOT do what it just did. Its a pretty cool idea and it can be applied to a lot of problems that are much more interesting. I give some examples in the notebook about which different fitness functions you can apply which I think is better explained by seeing it.

Stochastic Gradient Descent: As you saw in the notebook, the policy gradient doesnt train very well when we are grading it on the return over the entire dataset, but it trains very well when it uses each day's return or the position on each symbol every day. This makes sense, if we just take the total return over several years and its slightly positive then we tell our machine to do more of that. That will do almost nothing since so many of those decisions were actually losing money. The problem, as we have it set up now, needs to be broken down into smaller units. Fortunately there is some mathematical proof that that is legal and even faster. Score!

Stochastic Gradient Descent is basically just breaking your data into smaller batches and doing gradient descent on it. It will have slightly less accurate gradients WRT to the entire dataset's cost function, but since you are able to iterate faster with smaller batches you can run way more of them. There might even be more advantages to SGD that I'm not even mentioning, why dont you read the wikipedia or arXiv on it instead of taking my information? Or just use it since it works. If you're going on a wikipedia learning binge you might as well also learn about Momentum and Adagrad. They exist and they are more efficient but they're only really useful for people doing much bigger projects. If you are working on a huge project and your twobucksanhour AWS GPU instance is too slow then you should definitely be using them (and not be reading this introductory tutorial).

Multi Sampling: Since we are sampling the policy, we can sample repeatedly in order to compute better. Karpathy's article summarizes the math behind this nicely and this paper is worth reading. The concept is intuitive and simple, but getting the math to work out and the tensor graph in order is very involved. One realizes that a mastery of numpy and a solid understanding of linear algebra are very important to tensorflow once the problems get...deeper, I guess is the word.

Multi sampling adds a useful computational kick that lets the network train much more efficiently. The results are already impressive. Using batches of less than 75 days and only training on the total return over that timeframe, we are able to "overfit" our network. Keep in mind that all we are doing is telling the network to do more of what it is doing when it does well, and less when it does poorly! Sure, we are still far away from having anything out of sample, but that is because we are still using linear regression.

By now you are probably either wondering "does this guy even know what deep learning is? I havent seen a single neural network!" or you completely forgot we were still using the same linear regression that 16 year olds learn in math class. Well, we'll get to neural networks next but I wanted to talk about other things before neural networks to show how much tensorflow can be used for before neural networks even get mentioned, and to show how much important math exists in deep learning that has nothing to do with neural networks. Tensorflow makes neural nets so easy that you barely even notice that they're part of the picture and its definitely not worth getting bogged down by their math if you dont have a solid understanding of the math behind cross entropy, policy gradients and the like. They probably even distract from where the true difficulty is. Maybe I'll try to get a regression to play pong so that everyone shuts about neural networks and starts talking about policy learning...

Neural Networks: So we finanlly get to it. Here's the same thing with a neural network. Its the simplest kind of net that there is but it is still very powerful. Way more powerful that our puny regression ever was becuase it has nonlinearities (RELU layers) between the other layers (which are basically just regressions by themselves). A sequence of linear regressions is still obviously linear (I hope you didnt sleep through Linear Algebra class! I owe all my LA skills to Comrade Otto Bretscher of Colby College whose class I did sleep through but whose text book is worth its weight in gold). But if we put a nonlinearity between the layers, then our net can do anything. Thats what gets people excited about neural networks, becuase they can hold enourmous amounts of information when trained well. Fortunatly, we just learned a bunch of cool ways to train them in steps 1-5. Now putting in the network is very easy. I really changed nothing except the Variable and the equation really is basically still y = Wx + b.

The reason I introduced the networks so late is becuase they can be a bit difficult to tune. Chosing the right size of your network and the right training step can be difficult and sometimes it is helpful to start out simple until you have all the bells and whistles in place.

In steps 3-5, we spent a lot of time figuring out tricks to do with the training step, which is a widely researched area at the moment and is probably more relevant to algorithmic trading that anything else. Now we are starting to demonstrate some of the techniques used in the prediction engine (regression before, neural network now). I believe this is a much more researched area and TensorFlow is better equiped for it. Many people describe the types of neural networks that we will learn as cells or legos. You dont need to think that much about how it works as long as you know what it does. If you noticed, thats what I did with the neural network. There is a lot more to learn and its worth learning, but when you're actually building with it, you dont think about RELU layers as much as input/output and a black box in the middle. Or at least I do...there are a bunch of people in image processing who look inside networks and do very cool things.

LSTM: My favorite neural network, and a true stepping stone into real deep learning is the long short-term memory network, or LSTM. Colah wrote an incredibly clear explanation of LSTM and there is really no substitute to reading his post. To describe the setup as briefly as possible, you input the data one timestep at a time to the LSTM cell. And each timestep the cell not only recieves the new input, but it recieves the last timestep's output and what is called the cell state, a vector that carries information about what happened in the past. Within the cell you have trained gates (basically small neural nets) that decide, based on the three inputs, what to forget from the past cell state, what to remember (or add) to the new state, and what to output this timestep. It is a very powerful tool and fascinating in how effective it is.

I am now pretty far into this series and I have a pretty good idea of where it will go from here. These are the problems that I must tackle in no particular order:

new policies such as
- long/short equality amongs two symbols and more
- spread trading (if that is different from above)
- minimize correlation/ new risk meaesure that is appropriate for large number of symbols
migrating to AWS and using GPU computing power
ensebling large numbers of strategies that are generated with the same code
- policy grads find local maxima so no reason not to use that to my advantage
REDUCE OVERFITTING, use techniques like dropout to avoid overfitting enormously
testing suite to be able to test if the strategies are viable objectively
convolution nets, especially among larger groups of symbols we can expect that some patterns are fractal
turning it into a more formal project or web app

And of course we can start moving to other sources of data:

text
scraping
games

Stay tuned for some articles that I will write about the algorithms used here and a discussion of the difficulties of using these techniques for algorithmic trading developement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

LiamConnell/deep-algotrading

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages