Deep Q-Network for beginners Etsuji Nakai Cloud Solutions Architect at Google 2016/08/01 ver1.0 Google confidential | Do not distribute
$ who am i ▪ Etsuji Nakai Cloud Solutions Architect at Google Twitter @enakai00
Typical application of DQN ▪ Learning “the optimal operations to achieve the best score” through the screen images of video games. – In theory, you can learn it (without knowing the rule of the game) by collecting data consisting of “on what screen, with which operation, how the score will change, and what will be the next screen.” – This is analogous to construct an algorithm for the game of Go by collecting data consisting of “On what face of board, where you put the next stone, how your advantage will change.” https://www.youtube.com/watch?v=r3pb-ZDEKVg https://www.youtube.com/watch?v=V1eYniJ0Rnk
Theoretical framework of DQN ▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with the current state s and the action a, you will have a reward (score) r and the next state will be s' ” – This corresponds to the data “on what screen, with which operation, how the score will change, and what will be the next screen.” – It is impractical to collect data for all possible pairs (s, a), but suppose that you have enough of them to train the model to a certain level. – Note that, r and s' are functions of the pair (s, a) in terms of mathematics. ▪ You may naively think that the following gives the optimal action given the current state s. ⇒ Choose the action a which maximizes the immediate reward r . – But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d better hit blocks near the side walls even though it may take a little longer.
▪ In a nutshell, you have to figure out the action a which maximizes the long term rewards.
Let’s imagine the magical “Q” function ▪ First, we define the total rewards as below.
– s and a represent the state and action at n-th step. is a small number around 0.9 n n introduced to avoid the sum becomes infinite.
▪ Now suppose that we have a convenient magical function Q(s, a) as below although we don’t know how to calculate it at all. – Q(s, a) = “The total rewards you will receive when you choose the next action a, and keep choosing the optimal actions afterwards.”
▪ Once you have the function Q(s, a) , you can choose the optimal action at state s with the following rule.
⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal actions afterwards.
The black magic of “recursive definition” ▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies the following “Q-equation”.
– See the next slide for the mathematical proof.
Proof of the Q-equation – Suppose that the chain of states and actions is as below when you start from the state s 0 and keep choosing the best actions. – From the definition of Q(s, a), the following equation holds.
―― (1) – Now suppose that the initial state is s instead of s , and you keep choosing the best 1 0 actions, the chain will be as below. – Again, from the definition of Q(s, a), the following equation holds. ―― (2) – Rearrange (1) as below, and substitute (2).
– Considering the following relations, it’s equivalent to the Q-equation.
Approximate “Q” function using Q-equation ▪ Prepare some function with adjustable parameters, and by adjusting them, you may find a function which satisfies the Q-equation. – If you succeeded, now you have the “Q” function! – Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However, under some assumptions, it’s been proved to be sufficient. ▪ Here’s the steps to adjust the parameters. – 1. Let all the quartets (s, a, r, a') you have is D. – 2. Prepare an initial candidate of “Q” function as Q(s, a | w). – 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the sum of squared differences between LHS and RHS of the Q-equation. – 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3.
▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate version of the “Q” function. – You’d better use more complicated candidate Q(s, a | w) , so that you would have better approximation.
What is “the more complicated function”? ▪ Yes!
The Deep Neural Network!
▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate of Q-function.
By the way, what is Neural Network? ▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a highly complex function.
How do you collect the quartets? ▪ How do you collect the quartets (s, a, r, a') in the real world? – Basically, just keep playing the game with random actions. – In theory, if you keep playing for infinite time, you would encounter all the possible states. ▪ But in reality, the possibility to reach some states with random actions is quite small. To compensate it, you can take the following strategy. – Once you have collected some amount of data, train the Q-function using these data. – After that, you play the game by mixing random actions and the (presumably) best actions calculated from the current Q-function. – When you have collected some more additional data, train the Q-function again. – Through this cycle, you can make Q-function better, and collect more data including states which is unreachable with only random actions. ▪ Why don’t you play only using Q-function without random actions? – It doesn’t work. By collecting all kinds of states even with random actions, the model can learn “how to gain long term rewards through some non-rewarding states.”