このページは http://www.slideshare.net/enakai/deep-qnetwork-for-beginners の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3ヶ月前 (2016/07/19)にアップロードinテクノロジー

Introduction to the Deep Q-Network theory.

- Deep Q-Network for beginners

Etsuji Nakai

Cloud Solutions Architect at Google

2016/08/01 ver1.0

Google confidential | Do not distribute - Typical application of DQN

▪ Learning “the optimal operations to achieve the best score” through the screen

images of video games.

– In theory, you can learn it (without knowing the rule of the game) by collecting

data consisting of “on what screen, with which operation, how the score will

change, and what will be the next screen.”

– This is analogous to construct an algorithm for the game of Go by collecting

data consisting of “On what face of board, where you put the next stone, how

your advantage will change.”

https://www.youtube.com/watch?v=r3pb-ZDEKVg

https://www.youtube.com/watch?v=V1eYniJ0Rnk - Theoretical framework of DQN

▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with

the current state s and the action a, you will have a reward (score) r and the next

state will be s' ”

– This corresponds to the data “on what screen, with which operation, how the score will

change, and what will be the next screen.”

– It is impractical to collect data for all possible pairs (s, a), but suppose that you have

enough of them to train the model to a certain level.

– Note that, r and s' are functions of the pair (s, a) in terms of mathematics.

▪ You may naively think that the following gives the optimal action given the current

state s.

⇒ Choose the action a which maximizes the immediate reward r .

– But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d

better hit blocks near the side walls even though it may take a little longer.

▪ In a nutshell, you have to figure out the action a which maximizes the long term

rewards. - Let’s imagine the magical “Q” function

▪ First, we define the total rewards as below.

– s and a represent the state and action at n-th step. is a small number around 0.9

n

n

introduced to avoid the sum becomes infinite.

▪ Now suppose that we have a convenient magical function Q(s, a) as below although

we don’t know how to calculate it at all.

– Q(s, a) = “The total rewards you will receive when you choose the next action a,

and keep choosing the optimal actions afterwards.”

▪ Once you have the function Q(s, a) , you can choose the optimal action at state s

with the following rule.

⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal

actions afterwards. - The black magic of “recursive definition”

▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies

the following “Q-equation”.

– See the next slide for the mathematical proof. - Proof of the Q-equation

– Suppose that the chain of states and actions is as below when you start from the state s

0

and keep choosing the best actions.

– From the definition of Q(s, a), the following equation holds.

―― (1)

– Now suppose that the initial state is s instead of s , and you keep choosing the best

1

0

actions, the chain will be as below.

– Again, from the definition of Q(s, a), the following equation holds.

―― (2)

– Rearrange (1) as below, and substitute (2).

– Considering the following relations, it’s equivalent to the Q-equation. - Approximate “Q” function using Q-equation

▪ Prepare some function with adjustable parameters, and by adjusting them, you may

find a function which satisfies the Q-equation.

– If you succeeded, now you have the “Q” function!

– Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However,

under some assumptions, it’s been proved to be sufficient.

▪ Here’s the steps to adjust the parameters.

– 1. Let all the quartets (s, a, r, a') you have is D.

– 2. Prepare an initial candidate of “Q” function as Q(s, a | w).

– 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the

sum of squared differences between LHS and RHS of the Q-equation.

– 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3.

▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate

version of the “Q” function.

– You’d better use more complicated candidate Q(s, a | w) , so that you would have better

approximation. - What is “the more complicated function”?

▪ Yes!

The Deep Neural Network!

▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate

of Q-function. - By the way, what is Neural Network?

▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a

highly complex function. - How do you collect the quartets?

▪ How do you collect the quartets (s, a, r, a') in the real world?

– Basically, just keep playing the game with random actions.

– In theory, if you keep playing for infinite time, you would encounter all the possible states.

▪ But in reality, the possibility to reach some states with random actions is quite

small. To compensate it, you can take the following strategy.

– Once you have collected some amount of data, train the Q-function using these data.

– After that, you play the game by mixing random actions and the (presumably) best actions

calculated from the current Q-function.

– When you have collected some more additional data, train the Q-function again.

– Through this cycle, you can make Q-function better, and collect more data including states

which is unreachable with only random actions.

▪ Why don’t you play only using Q-function without random actions?

– It doesn’t work. By collecting all kinds of states even with random actions, the

model can learn “how to gain long term rewards through some non-rewarding

states.” - Thank you!