On the benchmark of Chainer 2016年7⽉2⽇ Chainer Meetup #3@Dowango Seminar Room Preferred Networks Inc. Kenta Oono firstname.lastname@example.org
Self Introduction • Kenta Oono (twitter: @delta2323_) – Bio. : MSc@MathSci Univ. Tokyo → 2012.4 PFI → 2014.10 PFN – Role: BioHealthcare project, Chainer dev. team, etc. – blog: http://delta2323.github.io • Recent activity – Study meetup (NIPS2014, ICML2015, NIPS2015) – Several articles and talks on Deep Learning 7⽉21⽇ ICML2016読み会 @ドワンゴセミナールーム
What is Benchmark? • Metrics that evaluate the performance of frameworks – elapsed time, memory consumption, easiness of use etc. • Related to, but different from profiling – Profiling needs finer information of frameworks, possibly at the cost of performance – Benchmarking measures the overal behavior of frameworks • For framework developers: – provides suggestion for further enhancement of the framework – provides objective comparison with other frameworks • For framework users: – provides better choice of frameworks that satisfies their needs
Example: convnet-benchmarks • Author: Soumith Chintala（Facebook AI Research） • Measures latencies of convolutional neural networks • Provides objective comparison across various frameworks • Metric – Elapsed time of forward and backward propagation • Architecture – AlexNet-OWT / Overfeat / VGG-A / GoogleNet – Single 2D convolution layer of various sizes • Frameworks – Torch, neon, TenforFlow, fbfft (Torch), Chainer, cudaconvnet2, Caffe, CL-nn, Caffe-CL GreenTea etc.
Basics of measurement of kernel execution • We cannot measure GPU execution CPU GPU time as CPU because launch of kernels is asynchronous ! clock kernel kick exec. clock clock_t start, end; start = clock(); // launch kernel end = clock(); kernel elapsed_time = end - start; exec.
Basics of measurement of kernel execution • We can measure the kernel execution time by inserting two events at the start and end CPU GPU of the launch. float elapsed=0; cudaEvent_t start, stop; kernel cudaEventCreate(&start); record exec. cudaEventCreate(&stop); kick cudaEventRecord(start, 0); record // launch the kernel Event cudaEventRecord(stop, 0); cudaEventSynchronize (stop); cudaEventElapsedTime( kernel &elapsed, start, stop); exec. cudaEventDestroy(start); cudaEventDestroy(stop); sync Event
Measurement of single Chainer Function Execution start = cupy.cuda.Event() end = cupy.cuda.Event() CPU GPU start.record() y = F.f(x) # forward prop end.record() end.synchronize() kernel cupy.cuda.get_elapsed_time( record exec. start, end) Python F. f kick • Suppose GPU impl. of F.f consists of Event Python Python parts and single GPU kernel. record • Elapsed time calculates kernel execution in this case. kernel exec. sync Event
Measurement of single Chainer Function Execution start = cupy.cuda.Event() end = cupy.cuda.Event() CPU GPU start.record() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( record Event start, end) Python • Suppose kick – no other kernels are waiting in the kernel queue exec. – Python overhead is large Python – the kernel is light record Event • get_elapsed_time equals to whole execution time including Python code. sync
Measurement of single Chainer Function Execution start = cupy.cuda.Event() end = cupy.cuda.Event() CPU GPU start.record() y = F.f(x) # forward prop end.record() kernel end.synchronize() cupy.cuda.get_elapsed_time( record exec. start, end) Python Event • In general, the elapsed time between two events are different from what we measured kick in the two previous situations. Python kernel • What we real y measure depends on exec. record – the status of the waiting queue Event – the amounts of Python code and kernel sync
Synchronization before start Event start = cupy.cuda.Event() end = cupy.cuda.Event() CPU GPU start.record() start.synchronize() y = F.f(x) # forward prop end.record() kernel end.synchronize() record exec. cupy.cuda.get_elapsed_time( start, end) sync Event Python • It ensures the start Event point is kick right before the execution of Python code. kernel ・・・ exec. • But the timing of end Event is stil undetermined. ・・・
CPU GPU Measurement of multi-layered NNs kernel record exe- cution sync Event • Should we insert synchronization points Python before all function executions? kick • But it exposes Python code that should Python have been hidden if it were not for the kernel record synchronization. exe- record • I guess this is the reason why convnet- cution benchmarks offers the architectures that consist of single convolution layer. Event sync Event Python kick It would be hidden by the Python kernel execution if we did kernel record not measure elapsed times. exe- cution
Tentative solution (Timer class: PR #1249) • Offers start and stop methods for measuring lap times. • Three patterns for synchronization before measurement by blocking_method argument – block_every_time: synchronizes every start events – block_first_time: synchronizes only first start event – non_block: does not synchronize at the start of measurement • When we get the total time, Timer class implicitly cal synchronize method. • synchronize method synchronizes al Events inserted by start and stop and calculates lap times lazily. • Once synchronize is invoked, the timer CANNOT accumulate lap times until it is reset.
DeepMark aurthored by Soumith Chintala • Comparison with convnet-benchmarks – Not only image recognition but also various use cases – Relatively newer architectures are employed – Multi-GPU evaluation wil be supported (planned) • Many details of specifications are under discussion. • Architectures (planned) – Images : InceptionV3-batchnorm / Alexnet-OWT / VGG-D / ResNet-50 – Video: C3D – Audio: DeepSpeech2 / MSR's 5 layer FC – Text: Smal RNN LSTM / Large RNN LSTM • Chainer support (delta2323/chainer-deepmark) – Not al features are supported (see issue for details)
Conclusion • Measurement of elapsed time of multi-layered NNs have many things to be considered. • We wil participate in DeepMark, a general-purpose deep learning benchmarks. • Many criteria are to be measured: – Elapsed time <- Today’s topic – Memory consumption We are hiring! – etc…