このページは http://www.slideshare.net/naotoyoshida9887/pfn-spring-internship-final-report-autonomous-drive-by-deep-rl の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

5ヶ月前 (2016/05/27)にアップロードinテクノロジー

This is the final report for the spring internship 2016 at Preferred Networks. gym_torcs is relea...

This is the final report for the spring internship 2016 at Preferred Networks. gym_torcs is released in mt github account: https://github.com/ugo-nama-kun/gym_torcs

- PyconJP 2015のポスター約1年前 by Naoto Yoshida
- JSAI2015_Twin_net1年以上前 by Naoto Yoshida

- Driving in TORCS with

Deep Deterministic Policy Gradient

Final Report

Naoto Yoshida - About Me

●

Ph.D. student from Tohoku University

●

My Hobby:

○

Reading Books

○

TBA

●

NEWS:

○

My conference paper on the reward function was

accepted!

■

SCIS&ISIS2016 @ Hokkaido - Outline

●

TORCS and Deep Reinforcement Learning

●

DDPG: An Overview

●

In Toy Domains

●

In TORCS Domain

●

Conclusion / Impressions - TORCS and

Deep Reinforcement Learning - TORCS: The Open source Racing Car Simulator

●

Open source

●

Realistic (?) dynamics simulation of the car environment - Deep Reinforcement Learning

●

Reinforcement Learning + Deep Learning

○

From Pixel to Action

■

General game play in ATARI domain

■

Car Driver

■

(Go Expert)

●

Deep Reinforcement Learning in Continuous Action Domain: DDPG

○

Lillicrap, Timothy P., et al.

"Continuous control with deep reinforcement learning." , ICLR 2016

Vision-based Car agent in TORCS

Steering + Accel/Blake

= 2 dim continuous actions - DDPG: An overview
- Reinforcement Learning

Action

: a

Agent

State

: s

Environment

Reward

: r

GOAL: Maximization of in expectation - Reinforcement Learning

Action

: a

Raw output: u

Interface

Agent

State

: s

Raw input: x

Environment

Reward

: r

GOAL: Maximization of in expectation - Deterministic Policy Gradient

Silver, David, et al. "Deterministic policy gradient algorithms."

ICML. 2014.

●

Formal Objective Function: Maximization of True Action Value

where

●

Policy Evaluation: Approximation of the objective function

Bellman equation

wrt Deterministic Policy

Loss for

Critic

●

Policy Improvement: Improvement of the objective function

Update direction

of Actor - Deep Deterministic Policy Gradient Lillicrap, Timothy P., et al. "Continuous control

with deep reinforcement learning.", ICLR 2016

RL agent

Initialization

(DDPG)

Sampling / Interaction

a

s, r

Update of Critic

+

minibatch

TORCS

Update of Actor

+

minibatch

Update of Target - Deep Architecture of DDPG

Three-step observation

Simultaneous training of two deep convolutional networks - SDE

Exploration: Ornstein–Uhlenbeck process

●

Gaussian noise with moments

○

θ，σ：parameters

○

dt：time difference

○

μ：mean （= 0.）

Wiener Process

●

Stochastic Differential Equation:

●

Exact Solution for the discrete time step:

Gaussian - OU process: Example

θ = 0.15, σ = 0.2，μ = 0

OU Process

Gaussian - DDPG in Toy Domains

https://gym.openai.com/ - Toy Problem: Pendulum Swingup

●

Classical RL benchmark task

○

Nonlinear control:

○

Action: Torque

○

State:

○

Reward:

From “Reinforcement Learning In Continuous Time and

Space”, Kenji Doya, 2000 - Results

# of Episode - Results: SVG(0)

# of Episode - Toy Problem 2: Cart-pole Balancing

●

Another classical benchmark task

○

Action: Horizontal Force

○

State:

Angle Area

○

Reward: (other definition is possible)

■

+1 (angle is in the area)

■

0 (Episode Terminal) - Results: non-convergent behavior :(

Successful

Score

RLLAB implementation worked well

https://rllab.readthedocs.io

Total Steps - DDPG in TORCS Domain

Note:

Red line

: Parameters with Author Confirmation / DDPG paper

Blue line

: Estimated/Hand-tuned Parameters - Track: Michigan Speedway

●

Used in DDPG paper

●

This track actually exists!

www.mispeedway.com - TORCS: Low-dimensional observation

●

TORCS supports low-dimensional sensor outputs for AI agent

○

“Track” sensor

○

“Opponent” sensor

○

Speed sensor

○

Fuel, damage, wheel spin speed etc.

●

Track + speed sensor as observation

●

Network: Shallow network

“Track” Laser Sensor

●

Action: Steering (1 dim)

●

Reward:

Car Axis

Track

○

If car clashed/course out , car gets penalty -1

Axis

○

otherwise - Result: Reasonable behavior
- TORCS: Vision inputs

●

Two deep convolutional neural networks

○

Convolution:

■

1st layer:

32 filters, 5x5 kernel, stride 2, paddling 2

■

2nd, 3rd layer:

32 filters, 3x3 kernel, stride 2, paddling 1

■

Full-connection: 200 hidden nodes - VTORCS-RL-color

●

Visual TORCS

○

TORCS for Vision-based AI agent

■

Original TORCS does not have vision API!

■

vtorcs:

●

Koutník et al., "Evolving deep unsupervised convolutional networks for vision-based reinforcement learning, ACM,

2014.

○

Monochrome image from TORCS server

■

Modification for the color vision → vtorcs-RL-color

○

Restart bug

■

Solved with help of mentors’ substantial suggestions! - Result: Still not a good result...
- What was the cause of the failure?

●

DDPG implementation?

○

Worked correctly, at least in toy domains.

■

The approximation of value functions → ok

●

However, policy improvement failed in the end.

■

Default exploration strategy is problematic in TORCS environment

●

This setting may be for general tasks

●

Higher order exploration in POMDP is required

●

TROCS environment?

○

Still several unknown environment parameters

■

Reward → ok (DDPG author check)

■

Episode terminal condition → still various possibilities

(from DDPG paper) - gym-torcs

・TORCS environment with openAI-Gym like interface - Impressions

●

On DDPG

○

Learning of the continuous control is a tough problem :(

■

Difficulty of policy update in DDPG

■

“Twice” recommendation of Async method by DDPG author (^ ^;)

●

Throughout this PFN internship:

○

Weakness: Coding

■

Thank you! Fujita-san, Kusumoto-san

■

I knew many weakness of my coding style

○

Advantage: Reinforcement Learning Theory

■

and its branched algorithms, topics and relationships between RL and Inference

■

For DEEP RL, Fujita-san is an auth. in Japan :) - Update after the PFI seminar
- Cart-pole Balancing

●

DDPG could learn the successful policy

○

Still unstable after the several successful trial - Success in Half-Cheetah Experiment

●

We could run successful experiment with identical hyper parameters in cart-

pole.

300-step total reward

Episode - Keys in DDPG / deep RL

●

Normalization of the environment

○

Preprocess is known to be very important for deep learning.

■

This is also true in deep RL.

■

Scaling of inputs (possibly, and actions, rewards) will help the agent to learn.

●

Possible normalization:

○

Simple normalization helps: x_norm = (x - mean_x)/std_x

○

Mean and Standard deviation are obtained during the initial exploration.

○

Other normalization like ZCA/PCA whitening may also help.

●

Epsilon parameter in Adam/RMS prop can be large value

○

0.1, 0.01, 0.001… We still need a hand-tuning / grid search...