このページは http://www.slideshare.net/SessionsEvents/jeff-johnson-research-engineer-facebook-at-mlconf-nyc の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

byMLconf

1年以上前 (2015/03/27)にアップロードinテクノロジー

Hacking GPUs for Deep Learning: GPUs have revolutionized machine learning in recent years, and ha...

Hacking GPUs for Deep Learning: GPUs have revolutionized machine learning in recent years, and have made both massive and deep multi-layer neural networks feasible. However, misunderstandings on why they seem to be winning persist. Many of deep learning’s workloads are in fact “too small” for GPUs, and require significantly different approaches to take full advantage of their power. There are many differences between traditional high-performance computing workloads, long the domain of GPUs, and those used in deep learning. This talk will cover these issues by looking into various quirks of GPUs, how they are exploited (or not) in current model architectures, and how Facebook AI Research is approaching deep learning programming through our recent work.

- Hacking GPUs for Deep Learning

MLConf New York

Jeff Johnson

Facebook AI Research

jhj@fb.com - Deep (convolutional) Neural Networks

Revolution in machine learning

Convolution: since 1980s. Deep: flops since

2000s

Avoid feature engineering

▪

With enough data, let network discover feature

representations

▪

Can work even for NLP. No word segmentation,

use raw character data. - 2D Convolutional Nets (images)

LeCun, Bottou,

Bengio and Haffner,

1998

Krizhevsky,

Sutskever and

Hinton, 2012 - 2D Convolutional Nets

Network architecture

ImageNet 1000 class top-5

error

AlexNet

~15%

OverFeat

~13%

ZeilerNet

~11%

Oxford-VGG

~7%

GoogLeNet

~6%, ~4.5%

PReLU (MSR)

~4.9%

Human performance

3-5%

Progress towards smaller kernels and deeper

nets - 3D Convolutional Nets (videos)

C3D (Tran et al., 2014)

DeepVideo (Karpathy et al., 2014) - 1D Convolutional Nets (text, sequences)

Zhang and LeCun, 2015

Collobert et al., 2011 - RNNs and LSTMs (text, sequences)

Mikolov, 2014

Graves, Mohamed and Hinton, 2013 - Deep Neural Networks

Supervised learning. Unsupervised ???

Train with back-propagation/SGD variants

Strong scaling is unsolved

▪

Distributed parameter space exploration (e.g.,

Hogwild!; Niu et al. 2011)

▪

Distributed hyperparameter space exploration

(e.g., Bayesian optimization; Snoek et al. 2012) - Characteristics
- Deep nets are flop eaters

Convolutions are expensive

Pointwise calcuations (log/exp, ReLU, */+, ...)

Neighborhood reductions (pooling, convolution)

Scaling network parameters

increased learning capacity; overfitting

more training data (real or synthetic),

regularization required - Deep nets are bandwidth eaters

More parameters = more memory, data to

exchange

Barrier to cross-machine parallelism

▪ periodic exchanges, compression, quantization

Increase reuse of memory while local?

▪ interspersed reductions are resistant to fusion of

computations

▪ generalized programming language problem - Deep nets are latency sensitive

Serial dependency of training

fprop => bprop => fprop => ...

Serial dependency of multi-layer networks

layer 1 => layer 2 => layer 3 => ...

Multiple path dependent networks (RNNs, multi-

layer LSTMs) - Deep nets are also small?

Deeper = smaller feature planes, more of them

input Rm => expand to Rn => non-lin => reduce to

Rk

Problems are tiny in HPC terms

4096×4096 FFT, FE/PDE on massive grids, ...

NLP tasks can be sparse

Setup/kernel launch latency on GPU can

dominate compute - The tools
- Vector processors

SIMD: Single Instruction,

Multiple Data

Serial processor with ability

to operate on more than one

piece of data concurrently

Cray-1 (1976) - Vector processors

Hard to use: instructions only operate on 4, 8,

16, ... pieces of data at a time.

Boundary/alignment effects. Great if your vectors

are large, but...

float* a = ...; // is this aligned (a % 1 6 = = 0 )?

float* b = ...; // is this aligned (b % 1 6 = = 0 )?

for (i = 0 ; i < 1 8 ; + + i) { // how to hand le [1 6 , 1 7 ]?

b [i] + = a[i]; // SIM D this?!? m asking/loop ep ilogue

} - “Vector cores”?

SIMD variant: NVIDIA calls

“SIMT”

Lots of simple cores (CM)

CM-1 (1983)

Hide latency through many

threads + switching (Tera)

“Pixel/vertex shaders” in 2000s

GPUs => GPGPU

Tera MTA (1995) - GPU versus CPU

GPUs represent a different form of vector

programming (“vector cores”)

▪

32-wide vector of threads (“warp”)

Sufficiently optimized CPU code can be on par

with GPU perf (Tflop range with AVX2/512,

exploit multi-level caches, deep pipelines,

prefetch, ...)

Vector programming: easier with GPUs than CPUs

Sweetspot is different from GPU codes - Parallelization + vectorization

Serial nature of commonly used CPU programming

languages sometimes hides opportunities

Auto-vectorizing/parallelizing compilers + DSLs

can’t yet compete with expert hand-rolled

▪

DSLs like Halide (Ragan-Kelley et al. 2013) show

promise but need a few more generations

Sprinkle in (OpenMP) doesn’t cut it - Who wins

CPU

GPU

flops

✔

✔

(vectorize: AVX2/512 gives Tesla K40: 2880 fp32 ALU

Tflop range)

pipelines

main memory b/w

✖

✔

(Xeon Phi improves)

latency

✔

✖

(high clock, reordering;

(threads slow, non-smem

caches are large and work

caches irrelevant, CPU ->

if you obey them)

GPU control overhead)

boundary effects,

✔✖

✖

small/irregular sizes

(branches easy,

(warp divergence, load

vectorization hard)

imbalance)

parallel programming

✖

✔✖

model

(vectorization hard, perf

(CUDA is very different,

black box)

domain knowledge) - Tool + problem = solution?
- Dive into 2D Convolutional Nets

Somewhat computationally expensive

O(b × f × f’ × n2 × k2)

1st layer AlexNet:

▪ 13.493 Gflop (1 flop here = fp32 multiply-add)

▪ 77.2 Mbyte in, 63.7 Mbyte out (fp32)

▪ Perfect caching + reuse, 175 flop/byte in

▪ No caching + reuse, 0.125 flop/byte in - The problem

Programmable caches (shared memory,

registers, ...) not large enough for perfect reuse

Space of all possible square 2D convolution

problems is 5/6-dimensional

Parameter

Size

minibatch size (b)

128

input feature maps (f)

3

output feature maps (f’)

96

input feature size (n x n)

224

convolution kernel size (k x k)

11

convolution kernel stride (SxS) (optional)

4 - Converting

Space of all possible matrix multiplications = 3

dimensional (A

B

= C

)

NxM

MxP

NxP

NVIDIA, Intel, others have put lots of effort into

optimizing many parts of this space

▪ Rephrase convolution as a matrix multiplication!

▪ NVIDIA’s cuDNN - But:

Sgemm originally optimized for large problems

13x13 * 3x3 is a small convolution. Unrolling it

192 times it might be enough to feed GPU

Large convolutions are intractable?

Small feature maps/

convolutions = boundary

effects bad for GPUs - Facebook AI Research work

2D convolution via FFT

Fast convolutional nets with fbfft: A GPU

Performance Evaluation (Vasilache, Johnson et al.,

2015 ICLR conference track oral)

Convolution => pointwise × in Fourier basis

Choice of basis is wide open! 2i is great perf

O(b f f’ n2 k2) => O(b f f’ n2 + (b f + f f’ + bf’) n2 log

n)

▪

>= 5x5 kernels, faster than cuDNN - fbfft

cuFFT optimized for large FFT sizes

fbfft: smaller data, fit in registers, focus on warp - Data layout

Different problem sizes => different data layout

▪ cudaconv: DHWB (optimal for large b)

▪ deeper layers: HWBD/BHWD (many feature

maps)

▪ b=1 faster convergence?

▪ b=128 better compute utilization

Smaller problems, exploit different

layout/batching

▪ fbcunn 1D convolution - Latency hiding: what holds you back?

▪ Compute bound? (math)

▪ Memory b/w bound? (streaming)

▪ Memory latency bound? (sparse)

Almost all “deep learning” algorithms are b/w

bound on GPU. Low math intensity!

cuBLAS: Sgemm b/w bound. Dgemm compute

bound - Kernel fusion: CPU vs GPU

Reduces memory b/w pressure

Exploits cache locality and register reuse

CPU: fusion not necessary

Kernel tiling + interleaving works due to caches

GPU: fusion necessary

Tiling + interleaving doesn’t work: smem not

persistent, caches too small/irrelevant - Kernel fusion

CUDA kernel = hard optimization boundary on

GPU

Loop interchange, lifting, better fusion on CPU

CUDA: parallelization layer not visible to

optimizer. Auto-tuning desired. HW specific non-

linear tradeoffs

Scripting languages are further barrier to fusion on

both CPU and GPU (Torch) - Kernel fusion

Torch: transposition is common operation

▪ size (80, 40) stride (40, 1) => size (40, 80) stride

(1, 40)

▪ Old approach: transpose in memory, perform

work, copy back

▪ New approach: rewrite kernel to handle

transpositions. Optimize if non-transposed

Runtime fusion (CUDA 7.0, Theano) - Exploiting parallelism
- end