- Preparing for TSUBAME3.0 and owards Exascale in Green Computi and Convergence with Big Data Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM) ICSC 2014 Shanghai, China 20140507
Performance Comparison of CPU vs. GPU 1750 GPU GPU NEC Confidential Peak Performance [GFLOPS] 1500 1250 1000 750 500 250 0 CPU CPU 200 160 120 80 40 0 Memory Bandwidth [GByte/s] x5-6 socket-to-socket advantage in both compute and memory bandwidth, Same power (200W GPU vs. 200W CPU+memory+NW+…)
TSUBAME2.0 NEC Confidential
TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” TSUBAME 2.0 New Development 4 32nm 40nm >400GB/s Mem BW 80Gbps NW BW ~1KW max >1.6TB/s Mem BW >12TB/s Mem BW 35KW Max >600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max
TSUBAME2.0 Compute Node NEC Confidential Total Perf 2.4PFlops Mem： ~100TB SSD: ~200TB 4-1 Thin Node Infiniband QDR x2 (80Gbps) HP SL390G7 (Developed for TSUBAME 2.0) GPU: NVIDIA Fermi M2050 x 3 515GFlops, 3GByte memory /GPU CPU: Intel Westmere-EP 2.93GHz x2 (12cores/node) Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD：60GBx2, 120GBx2 Productized as HP ProLiant SL390 G7 1.6 Tflops 400GB/s Mem BW 80GBps NW ~1KW max
2010: TSUBAME2.0 as No.1 in Japan > All Other Japanese NEC Confidential Centers on the Top500 COMBINED 2.3 PetaFlops Total 2.4 Petaflops #4 Top500, Nov. 2010
TSUBAME2.0 Very Green… “Greenest Production Supercomputer in the World” the Green 500 (#3 overall) Nov. 2010, June 2011 (#4 Top500 Nov. 2010) << 3 times more power efficient than a laptop!
TSUBAME Wins Awards… ACM Gordon Bell Prize 2011 2.0 Petaflops Dendrite Simulation Special Achievements in Scalability and Time-to-Solution “Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer”
TSUBAME Three Key Application Areas “Of High National Interest and Societal Benefit to the Japanese Taxpayers” 1. Safety/Disaster & Environment 2. Medical & Pharmaceutical 3. Manufacturing & Materials Academic + 150 Private Industry Usages Plus Co-Design for general IT Industry and Ecosystem impact (IDC, Big Data, etc.) 10
Lattice-Boltzmann-LES with Coherent-structure SGS model [Onodera&Aoki2013] Coherent-structure Smagorinsky model Second invariant of the velocity gradient The model parameter is locally determined by the tensor(Q) and Energy dissipation(ε) second invariant of the velocity gradient tensor. ◎ Turbulent flow around a complex object ◎ Large-scale parallel computation LBM: DriVer: BMW-Audi Lehrstuhl für Aerodynamik und Strömungsmechanik Technische Universität München 3,000x1,500x1,50 0 Re = 1,000,000 ＊ Number of grid points： 3,623,878,656 (3,072 × 1,536 × 768) ＊Grid resolution：4.2mm (13m x 6.5 m x 3.25m) ＊Number of GPUs： 288 (96 Nodes) 6600 kkmm//hh
アステラス製薬とのデング熱等の熱帯病の 特効薬の創薬 Accelerate In‐silico screeninig and data mining
Discovery of Dengue-Human Interactome w/GPU Docking Dengue virus enzymes Human proteins Human protein structures were collected from the public database PDB using the following criteria: >25 residues X-ray resolution better than 3.25 Å No mutation Protein name Structure (PDB ID) Protease 3U1I Methyltransferase 1R6A Polymerase 3VWS Helicase 2JLR #Structures (PDB-chains) 30,544 #Proteins (UniProt IDs) 3,353 44 ൈൈ 3300,,554444 ൌൌ 112222,,117766 dockings June 15, 2013
TSUBAME2.0⇒2.5 Thin Node Upgrade Thin Node Infiniband QDR x2 (80Gbps) HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem(per GPU) CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD：60GBx2, 120GBx2 Productized as HP ProLiant SL390s Modified for TSUABME2.5 Peak Perf. 4.08 Tflops ~800GB/s Mem BW 80GBps NW ~1KW max NVIDIA Fermi M2050 1039/515 GFlops NVIDIA Kepler K20X 3950/1310 GFlops
TSUBAME2.0 => 2.5 Changes • Doubled~Tripled performance – 2.4(DFP)/4.8(SFP) Petaflops => 5.76(x 2.4)/17.1(x3.6) – Preliminary results: ~2.7PF Linpack (x2.25) , ~3.4PF Dandrite GB app (x1.7) • Bigger and higher bandwidth GPU memory – 3GB=>6GB per GPU, 150GB/s => 250GB/s • Higher reliability – Resolved minor HW bug, compute node fail-stop occurrence to decrease up to 50% • Lower Power – Observing 10~20% drop in power/energy (tentative) • Better programmability w/new GPU features – Dynamic tasks, HyperQ, CPU/GPU shared memory • Prolongs TSUBAME2 lifetime by 1~2 years – TSUBAME 3.0 2016H1
Phase‐field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Gordon Bell 2011 Winner Weak scaling on TSUBAME (Single precision) Mesh size（1GPU+4 CPU cores）:4096 x 162 x 130 TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640 TSUBAME 2.0 2.000 PFlops (4,000 GPUs+16,000 CPU cores) 4,096 x 6,480 x 13,000 Developing lightweight strengthening material by controlling microstructure Low‐carbon society • Peta‐Scale phase‐field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials. • 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time‐to‐Solution
Peta‐scale stencil application : A Large‐scale LES Wind Simulation using Lattice Boltzmann Method [Onodera, Aoki] Weak scalability in single precision (N = 192 x 256 x 256) Number of GPUs Performance [TFlops] ▲ TSUBAME 2.5 (overlap) ● TSUBAME 2.0 (overlap) TSUBAME 2.5 1142 TFlops (3968 GPUs) 288 GFlops / GPU x 1.93 TSUBAME 2.0 149 TFlops (1000 GPUs) 149 GFlops / GPU Large-scale Wind Simulation for a 10km x 10km Area in Metropolitan Tokyo 10,080 x 10,240 x 512 (4,032 GPUs) The above peta‐scale simulations were executed as the TSUBAME Grand Challenge Program, Category A in 2012 fall. • The LES wind simulation for the area 10km × 10km with 1‐m resolution has never been done before in the world. • We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.
Docking Performance on TSUBAME2.5 1.00 8.93 37.11 83.94 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 1 CPU core 12 CPU cores 12 CPU cores 3 GPUs (Tesla M2050, TSUBAME 2.0) 12 CPU cores 3 GPUs (Tesla K20Xm, TSUBAME 2.5) Computation speed ratio [vs. 1 CPU core] Docking calculations for 352 pairs TSUBAME2.5
2013: TSUBAME2.5 No.1 in Japan* in Single Precision FP, 17 Petaflops (*but not in Linpack) ~= K Computer 11.4 Petaflops SFP/DFP Total 17.1 Petaflops SFP 5.76 Petaflops DFP All University Centers COMBINED 9 Petaflops SFP
Japan’s High Performance Computing Infrastructure Hokkaido U. Tohoku U. U. Tsukuba 1PF HPCI: a nation-wide HPC infrastructure - Supercomputers ~25 PFlops - National Storage 22 PB HDDs + 60PB Tape - Research Network (SINET4), 40+10GBps - SSO (HPCI-ID), Distributed FS (Gfarm), HPCI Allocation Kyoto U. Osaka U. Kyushu U 1.7PF Nagoya U. • Supercomputers 1.1PF • HPCI Storage (12PB) NII • Management of SINET &Single sign-on JAMSTEC Tokyo Tech TSUBAME2.5 5.7 Petaflops U. Tokyo 3 (HPCI) Tokyo Tech. Riken AICS • “K” computer 11Petaflops • HPCI Storage (10PB)
Focused Research Towards Tsubame 3.0 and Beyond towards Exa • New memory systems – Pushing the envelops of low Power vs. Capacity, Communication and Synchroniation Reducing Algorithms (CSRA) • Post Petascale Networks – HW, Topology, Routing Algorithms, Placement… • Green Computing: Ultra Power Efficient HPC • Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop Acceleration, Large Graphs • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction, Hybrid Algorithms • Post Petascale Programming – OpenACC and other many-core programming substrates, Task Parallel • Scalable Algorithms for Many Core – Communication and Synchronization Reducing Algorithm (CSRA)
TSUBAME-KFC by GSIC, Tokyo Institute of Technology NEC, NVIDIA, Green Revolution Cooling, SUPERMICRO
How do we sustain x1000? FLOPS/w improvement (Top500) Process Shrink x100 X Many-Core GPU Usage x5 X DVFS & Other LP SW x1.4 X Efficient Cooling x1.4 x1000 !!! ULP-HPC Project 2007-12 Ultra Green Supercomputing Project 2011-15
TSUBAME-KFC (Kepler Fluid Cooling) A TSUBAME3.0 prototype system with advanced cooling for next-gen. supercomputers. 40 compute nodes are oil-submerged 1200 liters of oil (~1 ton) Single Node 5.26 TFLOPS DFP System (40 nodes) 210.61 TFLOPS DFP 630TFlops SFP Peak Performance
Oil Fire Station at Den-en Chofu ExxonMobil SpectraSyn Polyalphaolefins (PAO) 4 6 8 Kinematic Viscosity@40C 19 cSt 31 cSt 48 cSt Specific Gravity@15.6C 0.820 0.827 0.833 Flash point (Open Cup) 220 C 246 C 260 C Pour point ‐66 C ‐57 C ‐48 C Flash point of oil must be higher than 250 degrees C, Otherwise it is a hazardous material under the Fire Defense Law in Japan. Still the officer at the fire station requested us to follow the safety regulations of hazardous material: sufficient clearance around the oil, etc.
Compute Node (2) Removed twelve cooling fans NEC LX 1U-4GPU Server, 104Re-1G • 2X Intel Xeon E5-2620 v2 Processor (Ivy Bridge EP, 2.1GHz, 6 core) • 4X NVIDIA Tesla K20X GPU • 1X Mellanox FDR InfiniBand HCA CentOS 6.4 64bit Intel Compiler, GCC, CUDA 5.5 OpenMPI 1.7.2 (1) Replace thermal grease with thermal sheet (3) Update firmware of power unit to operate with cooling fan stopped.
CarnotJet system Oil outlet Oil inlet GPU2 GPU3 GPU1 GPU0 PU CPU0 CPU1 Cold oil jet involves warmer oil around it to increase flow.
Optimizations for efficiency Lower performance leads higher efficiency • Tuning for HPL parameters • Especially, block size (NB), and process grid (P&Q) • Adjusting GPU clock and voltage • Available GPU clocks (MHz): 614 (best), 640, 666, 705, 732 (default), 758, 784 and advantages of hardware configuration • GPU:CPU ratio = 2:1 • Low power Ivy Bridge CPU (this also lower the perf.) • Cooling system. No cooling fans. Low temperature.
Green500 submission Power Efficiency (GFLOPS/Watt) Performance (TFLOPS) Too many LINPACK runs with different parameters…. Fastest Run Greenest Run #500 in Nov. 2013 Jun. 2014 line?
#1 in Green 500 List (Nov. 2013) • 1st achievement as Japanese supercomputer • World leading Graph500 and HPCG power efficiency • TSUBAME 2.5 is also ranked #6 TSUBAME‐KFC TSUBAME 2.5
PUE (Power Usage Effectiveness) (= Total power / power for computer system) 40 35 30 25 20 15 10 5 0 Air cooling TSUBAME-KFC compute node network air conditioner oil pump water pump cooling tower fun Current PUE = 1.09 (1.001 c.f.on air-cooling) Power (kW) Best PUE=1.29 in air cooling (TSUBAME2.0) Power for cooling is basically constant. Especially water pump is too high. Oil Pump (60%) 0.53 kW Water Pump 2.40 kW Cooling Tower Fan 1.40 kW Total 4.33 kW When compute nodes are fully loaded (>42kW), PUE is lower than 1.1.
Sensors, sensors, and more sensors… Component Type Provided by Delay Freq. Resolution Compute node Power Panasonic data logger 0 sec. 1 sec. 0.1 W Network Power 0 sec. 1 sec. 0.1 W Cooling tower Power 0 sec. 1 sec. 0.1 W Pump (water) Power 0 sec. 1 sec. 0.1 W Outdoor air Temp. 10 sec. 10 sec. 0.1 deg. C Humi. 10 sec. 10 sec. 0.1 % Indoor air Temp. 10 sec. 10 sec. 0.1 deg. C CPU, GPU Temp. IPMI on nodes <1 sec. <1 sec. 1 deg. C Oil Temp. GRC controller box 20 sec. 20 sec. 0.1 deg. C Pump Speed 2% Power 0.1 W Water Temp. 0.1 deg. C
Hardware Maintenance Test SSD upgrades in March 2014 for TSUBAME3.0 “Big Data” Configuration 2 min. (1) Lift up the server 18 min. (2) Server maintenance Open server chassis Remove GPU0 and GPU1 Install SATA cables Restore GPU0 and GPU1 Close server chassis Install SSDs at front‐side Almost normal operation except the groves 2 min. (3) Return the server “Cost Model” in the works… c.f. Tsubame 2.5
Power Efficiency of GB Dendrite Simulation since 2006 13 842 Test Server (K10 GPU) 1501 3800 5550 TSUBAME3 Estimate 15750 100000 10000 1000 100 10 1 2006H1 2006H2 2007H1 2007H2 2008H1 2008H2 2009H1 2009H2 2010H1 2010H2 2011H1 2011H2 2012H1 2012H2 2013H1 2013H2 2014H1 2014H2 2015H1 2015H2 2016H1 Power Efficieny (MFlops/Watt) Year TSUBAME1.2 (S1070 GPU) TSUBAME1.0 (Opteron CPU) #7 Top500 TSUBAME2.0 (M2050 GPU) #4 Top500 TSUBAME‐KFC (K20X GPU) x1,210 in 10 years! x1000 – 10 years 2017~2018 crossover
Tokyo Tech. Billion-Way Resiliency Project (2011-2015) (JSPS Grant-in-Aid & JST-ANR) • Collaboration with ANL (Franck Cappello, FTI), LLNL (Bronis de Spinksi, SCR), Hideyuki Jitsumoto (U-Tokyo)… • More precise system fault model and associated cost model of recovery and optimization • Aggressive architectural, systems, and algorithmic improvements – Use of localized flash/NVM for ultra fast checkpoints and recovery – Advanced coding and clustering algorithms for reliability against multiple failures – Combining coordinated & uncoordinated checkpoints – Overlapping transfers in the checkpoint storage hierarchy for quick recovery – Power optimized checkpoints • Better monitoring and micro-recovery 45
Multi‐level Asynchronous C/R Model • Compute checkpoint/restart “Efficiency” for C/R strategy comparison – Efficiency : Fraction of time an application spends only in computation in optimal checkpoint interval f : (Li 1...N , Oi 1...N , Ri 1...N ) 1 1 1 2 t c k rk t : Interval p0 (T) Efficiency Efficiency ideal runtime expected runtime ideal runtime : No failure and No checkpoint expected runtime : Computed by the models 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 k k i Duratio n k k i No failure Failure : c -level checkpoint time : c -level recovery time : i -level checkpoint time p0 (t ck ) t0 (t ck ) pi (t ck ) ti (t ck ) p0 (rk ) t0 (rk ) pi (rk ) ti (rk ) : No failure for T seconds : Expected time when : i - level failure for T seconds : Expected time when cc rc i t0 (T) whenp 0 (T) pi (T) ti (T) whenp i (T) • Input: Each level of – Li : Checkpoint Latency – Oi : Checkpoint overhead – Ri : Restart time • Output: “Efficiency” Source: Sato, K., Maruyama, N., Mohror, K., Moody, A., Gamblin, T., de Supinski, B. R. and Matsuoka, S.: Design and Modeling of a Non-Blocking Checkpointing System (SC12)
Efficiency with Increasing Failure Rates and Checkpoint Costs • Assuming message logging overhead is 0 • The burst buffer system always achieves a higher efficiency ⇒ Stores checkpoints on fewer nodes • All systems works equally well up to x10 => TSUBAME4.0 (2020) can go exascale • Uncoordinated checkpointing: 70% efficiency on systems two orders of magnitude larger (if logging overhead is 0) 47 ⇒ Partial restart exploit the bandwidth of both burst buffers and the PFS
Future Scalable Algorithms: Dealing with Deep Memrory Hiearchy with Comm. and Synch. Reducing Algorithms (CSRA) • GPU Stencil computation is successful in speed, but not in domain sizes, limited by GPU device memory size GPU cores L2$ 768KB Dev mem 3GB Host memory 54GB 150GB/s 8GB/s Secondary storage Typical Configuration < Using multi‐GPUs helps, but host memory size is even larger On TSUBAME2.0 with 4k GPUs, 1.4k nodes • Total GPU memory: 12TB • Total host memory: 82TB
What if domain sizes exceed GPU memory? • A naïve method is hand‐coded “swapping out” 3D 7point stencil on a M2050 GPU Within Suffers from “Swapping out” dev mem 97% Performance Degradation!!! PCIe is problem “Communication Avoiding” and “Locality Improvement” are keys >
Temporal Blocking (TB) • Performs multiple (s‐step) updates on a small block, before proceeding to the next block [Kowarschik 04] [Datta 08] s‐step updates at once Step 1 Step 2 Step 3 Step 4 Simulated time With “simple” temporal blocking, redundant computation is introduced due to data dependency with neighbor • TB has been mainly used for cache optimization (s=2~8) • We use temporal blocking to reduce PCIe commucation (s=30~300)
Single GPU Performance 3D 7point stencil on a M2050 GPU ‘S’ is hand‐tuned for each plot Auto‐tuning is important • With optimized TB, 27x larger domain size is successfully used with ~30% overhead!!!
Multiple GPU • 2D‐1D decomposition [Jin,Endo,Matsuoka 13] – XY decomposition for mutiple GPUs – Z decomposition for blocking 6.3GB per GPU (> 3GB) 25 20 15 10 5 0 Weak scalability 0 100 200 300 Speed (TFlops) # of GPUs With 256 GPUs on 256nodes (T2.0) • 21TFlops • 13TB/s • 1.6TB (>3GBx256) Within ~120kW
Current status/future plans about TB • Temporal blocking, one of communication avoiding methods, supports larger domain efficiently • Programming cost is the issue – Complex loop structure, complex border handling – Currently only 7point stencil is evaluated – How about the real climate codes, including NICAM? • Plans – To make Physis DSL (RIKEN Maruyama) to output temporal blocked codes • Physis: A forall‐type DSL for stencil computations (SC11) – To integrate HHRT, GPU oversubscription runtime (Endo) • Swapping between GPU memory and host memory
Extreme Big Data (EBD) Next Generation Big Data Infrastructure Technologies Towards Yottabyte/Year (2013H2-2018H1) Principal Invesigator Satoshi Matsuoka Global Scientific Information and Computing Center Tokyo Institute of Technolgy / JST CREST
Convergence of HPC and Big Data • The current “Big Data” are not really that Big… – Typical definition: “Mining people’s privacy data to make money” • But “Extreme Big Data” will change everything – “Breaking down of Silos” (Rajeeb Harza, Intel VP of Technical Computing) • Already happening in Science & Engineering due to Open Data movement • More complex analysis algorithms: O(n log n), O(m x n), … • Fundamental to next gen IT Infrastructure – Clouds hosting convergent machines
Future “Extreme Big Data” - NOT mining Tbytes Silo Data - Peta~Zetabytes of Data - Ultra High-BW Data Stream - Highly Unstructured, Irregular - Complex correlations between data from multiple sources - Extreme Capacity, 57 Bandwidth, Compute All Required
We will have tons of unknown genes • Directly sequencing uncultured microbiomes obtained from target environment and analyzing the sequence data – Finding novel genes from unculturable microorganism – Elucidating composition of species/genes of environments Examples of microbiome Human body Sea Gut microbiome Soil Metagenome analysis 58 [Slide Courtesy Yutaka Akiyama @ Tokyo Tech.]
Results from Akiyama group@Tokyo Tech Ultra high‐sensitive “big data” metagenome sequence analysis of human oral microbiome ‐ Required > 1 million node*hour product on K‐computer ‐ World’s most sensitive sequence analysis (based on amino acid similarity matrix) ‐ Discovered at least three microbiome clusters with functional differences. (Integrated 422 experiment samples taken from 9 different oral parts) Metabolic Pathway Map 歯列の内側歯列の外側歯垢59 572.8 M Reads / hour 82,944 node (663,552 Cores) K‐computer (2012)
Extremely “Big” Graphs • Large scale graphs in various fields – US Road network : 58 million edges – Twitter follow‐ship : 1.47 billion edges – Neuronal network : 100 trillion edges 61.6 million vertices & 1.47 billion edges Neuronal network @ Human Brain Project 89 billion vertices & 100 trillion edges Cyber‐security Twitter large US road network 24 million vertices & 58 million edges 15 billion log entries / day Social network • Fast and scalable graph processing by using HPC
Graph500 “Big Data” Benchmark Kronecker graph BSP Problem A: 0.57, B: 0.19 C: 0.19, D: 0.05 November 15, 2010 Graph 500 Takes Aim at a New Kind of HPC Richard Murphy (Sandia NL => Micron) “ I expect that this ranking may at times look very different from the TOP500 list. Cloud architectures will almost certainly dominate a major chunk of part of the list.” Reality: Top500 Supercomputers Dominate No Cloud IDCs at all TSUBAME2.0 #3(Nov.2011) #4(Jun.2012)
Top Supercomputers vs. Global IDC Tianhe2 (#1 2013) China Gwanjou 48,000 KNC Xeon Phi + 36,000 Ivy Bridge Xeon 18,000 nodes, >3 Million CPU cores 54 Petaflops (1016) 0.8 Petabyte memory, 20 MW Power ??? racks、???m2 NEC Confidential DARPA study 2020 Exaflop (1018) 100 million~ 1 Billion Cores K Computer (#1 2011-12) Riken-AICS Fujitsu Sparc VIII-fx Venus CPU 88,000 nodes, 800,000CPU cores ~11 Petaflops (1016) 1.4 Petabyte memory, 13 MW Power 864 racks、3000m2 C.f. Amazon ~= 500,000 Nodes, ~5 million Cores #1 2012 IBM BlueGene/Q “Sequoia” Lawrence Livermore National Lab IBM PowerPC System-On-Chip 98,000 nodes, 1.57million Cores ~20 Petaflops 1.6 Petabytes, 8MW, 96 racks
A Major Northern Japanese Cloud Datacenter (2013) 10GbE 10GbE 2 zone switches (Virtual Chassis) Juniper EX8208 Juniper EX8208 Juniper Juniper EX4200 EX4200 Zone (700 nodes) Juniper Juniper EX4200 EX4200 Zone (700 nodes) Juniper Juniper EX4200 EX4200 Zone (700 nodes) Juniper MX480 Juniper MX480 10GbE 10GbE LACP the Internet 8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010) Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber ~1500 nodes compute & storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps >> x1000!
But what does “220Tbps” mean? Global IP Traffic, 2011-2016 (Source Cicso) 2011 2012 2013 2014 2015 2016 CAGR NEC Confidential 2011-2016 By Type (PB per Month / Average Bitrate in Tbps) Fixed Internet 23,288 32,990 40,587 50,888 64,349 81,347 28% 71.9 101.8 125.3 157.1 198.6 251.1 Manage d IP 6,849 9,199 11,846 13,925 16,085 18,131 21% 21.1 28.4 36.6 43.0 49.6 56.0 Mobile data 597 1,252 2,379 4,215 6,896 10,804 78% 1.8 3.9 7.3 13.0 21.3 33.3 Total IP traffic 30,734 43,441 54,812 69,028 87,331 110,282 29% 94.9 134.1 169.2 213.0 269.5 340.4 TSUBAME2.0 Network has TWICE the capacity of the Global Internet, being used by 2.1 Billion users
Conventional networks were here (slide originally from Dan Reed@MS‐>U‐Iowa) • Historical hierarchical IDC Networks – 10Gbps – 1Gbps consolidation – (Mostly) driven by economics – (Partially) driven by Internet workloads – Performance limited by incoming North‐South Traffic – Incoming request may create x10 messages but only x3~4 East‐West Internal Traffic – Incoming 100Gbps => 400Gbps bisection sufficient • Now moving to “flat” EW – Big Data and HPC‐like workloads (e.g. Facebook) Internet BR BR AR AR … AR AR N‐S Traffic LB S S LB Internet Layer 3 S S … S S • Still, CISCO 2016 prediction: 6.6 Zeta(1021) bytes/year => 1.4 Petabps (similar to TSUBAME3.0) … … Layer 2 Key: • BR (L3 Border Router) • AR (L3 Access Router) • S (L2 Switch) • LB (Load Balancer) E‐W Traffic
IDC “Cloud” Technologies Nowhere near Suffuicient HPC: x1000 in 10 years CAGR ~= 100% Source: Assessing trends over time in performance, costs, and energy use for servers, Intel, 2009. IDC: x30 in 10 years Server unit sales flat (replacement demand) CAGR ~= 30‐40%
JST‐CREST Extreme Big Data Research Scheme (2013‐2018) Future Non-Silo Extreme Big Data Apps Co-Design Co-Design EBD System Software incl. EBD Object System NVM/Fla sh NVM/Fla sh NVM/Fla sh DRAM DRAM DRAM 2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM & NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year NVM/Flas h NVM/Flas h NVM/Flas h DRAM DRAM DRAM EBD Bag Cartesian Plane Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW Supercomputers Compute&Batch-Oriented Cloud IDC Very low BW & Efficiencty PCB TSV Interposer High Powered Main CPU Low Power CPU Low Power CPU Large Scale Metagenomics Massive Sensors and Data Assimilation in Weather Prediction Ultra Large Scale Graphs and Social Infrastructures Exascale Big Data HPC Graph Store KV S KV S KV S EBD KVS Co-Design
Cartesian Plane KVS 100,000 Times Fold EBD “Convergent” System Overview EBD Performance Modeling KVS KVS Tasks 5‐1~5‐3 Task6 Task 3 EBD Programming System Graph Store EBD Application Co‐ Design and Validation & Evaluation Ultra High BW & Low Latency NVM Ultra High BW & Low Latency NW Processor‐in‐memory 3D stacking Large Scale Genomic Correlation Data Assimilation in Large Scale Sensors and Exascale Atmospherics Large Scale Graphs and Social Infrastructure Apps TSUBAME 3.0 TSUBAME 2.0/2.5 EBD “converged” Real‐Time Resource Scheduling Task 4 EBD Distrbuted Object Store on 100,000 NVM Extreme Compute and Data Nodes Task 2 EBD Bag EBD KVS Ultra Parallel & Low Power I/O EBD “Convergent” Supercomputer ~10TB/s⇒~100TB/s⇒~10PB/s Task 1
Preliminary I/O Evaluation on GPU and NVRAM for TSUBAME3.0(2016) How to design local storage for next‐gen supercomputers ? ‐ Designed a local I/O prototype using 16 mSATA SSDs mSATA mSATA mSATA mSATA RAID card Mother board ・Capacity: 4TB ・Read bandwidth: 8 GB/s 〜320K IOPS ( 3 μ sec) I/O performance of multiple mSATA SSD I/O performance from GPU to multiple mSATA SSDs 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Raw mSATA 4KB RAID0 1MB RAID0 64KB 0 5 10 15 20 Bandwidth [MB/s] # mSATAs 3.5 3 2.5 2 1.5 1 0.5 0 Raw 8 mSATA 8 mSATA RAID0 (1MB) 8 mSATA RAID0 (64KB) 0.2740.547 1.09 2.19 4.38 8.75 17.5 35 70 140 Throughuput [GB/s] Matrix Size [GB] 〜8 GB/s from 16 mSATA SSDs (Enabled RAID0) 〜4 GB/s from 8 mSATA SSDs to GPU EBD- I/O (Many-core I/O) )
Tsubame 4: 2020‐ DRAM+NVM+CPU with 3D/2.5D Die Stacking ‐The Ultimate Convergence of BD and EC‐PCB NVM/Flash NVM/Flash NVM/Flash DRAM DRAM Low Power CPU High Powered Main CPU TSV Interposer DRAM NVM/Flash NVM/Flash NVM/Flash DRAM DRAM DRAM Low Power CPU 2Tbps HBM 4~6HBM Channels 2TB/s DRAM & NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year Direct Chip‐Chip Interconnect with planar VCSEL optics
EBD System Software (Matsuoka Group) Large Scale Genomic Correlation Performance Modeling for EBD Apps Data Assimilation in Large Scale Sensors and Exascale Atmospherics Large Scale Graphs and Social Infrastructure Apps Interactive Scheduler for EBD-based Analysis Algorithm Kernels on EBD Indexing Sort Matching Graph Search EBD Programming Model System‐level Programing for EBD Object Application‐level Programing on EBD EBD- I/O (Many-core I/O) Clustering GPUfs NetCDF HDF5
A scalable MapReduce‐based large scale graph processing algorithm using multi‐GPU [CCGrid 2013] How to utilize multi‐GPU for large‐scale graph processing ? ‐ Implemented a graph processing algorithm on multi‐GPU ‐ Confirmed speedup on multi‐GPU using PageRank application Performance evaluation on TSUBAME 2.0 (SCALE 27, 128 nodes) 186.6x Speedup over Hadoop Better 100000 10000 1000 100 10 1 PEGASUS MarsCPU MarsGPU‐3 KEdges / Sec
High Performance Sorting Fast algorithms: Distribution vs Comparison-based MSD radix sort variable length / long keys high efficiency on small alphabets Efficient implementation GPUs are good at counting numbers Computational Genomics (A,C,G,T) Scalability N log(N) classical sorts (quick, merge etc) LSD radix sort (THRUST) short length / fixed-length keys integer sorts apple apricot banana kiwi don't have to examine all characters Bitonic sort Comparison of keys Map-Reduce Hadoop easy to use but not that efficient Hybrid approaches/ Best to be found Good for GPU nodes Balancing IO / computation – Algorithm Kernels on EBD
Large Scale BFS Using NVRAM EBD Algorithm Kernels • Large scale graph processing in various domains DRAM resources has increased • Spread of Flash Devices Prof : Price per bit，Energy consumption Cons: Latency，Throughput Using NVRAMs for large scale graph processing has possibilities of minimum performance degradation 3.Proporsal ② BFS with reading data from NVRAM 1. Introduction ① offload small accesses data 2.Hybrid‐BFS Switch two approaches Top‐down Bottom‐up # of frontiers:nfrontier， # of all vertices:nall, parameter : α, β 4.Evaluation (Offload Top‐down Graph : we could reduce half the size of DRAM [128GB ‐> 64 GB ] at Scale 27) 6.00 4.00 2.00 0.00 4.1GTEPS(79.4%) β=10α β=0.1α β=10α β=0.1α β=10α β=0.1α β=10α β=0.1α α=1.E+04 α=1.E+05 α=1.E+06 α=1.E+07 GTEPS Swiching Parameter DRAM Only DRAM+ioDrive2 DRAM+Intel SSD 2.8GTEPS (52.9%) 5.2GTEPS
5. Current Work EBD Algorithm Kernels all un‐visited vertex have to do is find a edge which is connected to frontier’s vertex. Each vertex allocate only a few edges to DRAM v0 v1 v2 A lot of edges are not accessed DRAM NVRA M Vertices ID In Bottom‐up approach, Simulation : Reduce Bottom‐up Graph(BG), Scale 27 Better 40.0% 30.0% 20.0% 10.0% 0.0% Only 20~30% accesses are NVRAM 2 4 8 16 32 In‐DRAM BG Size to Full BG Size Max Number of Edges Which are Allocated in DRAM : outgoing edges form Vn 6. Related Work and Summary ● Pearce, et al. : 1 TB DRAM and 12 TB NVRAM(Fusion‐io ioDrive) 52 MTEPS [Scale 36 : 69G vertices, 1100G edges] ● We could reduce half the size of DRAM with 20.6% performance degradation (4.1 GTEPS) [ Scale 27 : 130M vertices，2.1G edges ] Roger Pearce, Maya Gokhale, Nancy M. Amato, "Scaling Techniques for Massive Scale‐Free Graphs in Distributed (External) Memory" Parallel and Distributed Processing Symposium, International, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
Tokyo Institute of Technology’s EBD-RH5885v2 is ranked Huawei RH5885v2 No.4 in the Big Data category of the Green Graph 500 Ranking of Supercomputers with 4.35 M T EPS/ W on Scale 30 on the second Green Graph 500 list published at the Supercomputing Conference (SC), November 19, 2013. Congratulations from the Green Graph 500 Chair Quad‐Socket Xeon Server + 4TB PCI‐e SSD
Twitter network (Application of Graph500 Benchmark) 41 million vertices and 2.47 billion edges Frontier size in BFS with source as User 21,804,357 Lv Frontier size Freq. (%) Cum. Freq. (%) 0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00 10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00 Total 41,652,230 100.00 ‐ Follow‐ship network 2009 User i User j (i, j)‐edge Our NUMA‐optimized BFS on 4‐way Xeon system 3.67 GTEPS ⇒ 400millisec ! Six‐degrees of separation
R&D of EDB Distributed Object Store (co‐PI: Osamu Tatebe, U‐Tsukuba) • Key design issues for Scaled‐out IOPS and I/O bandwidth – Scalable distributed MDS (1M IOPS) – High Performance local object store – Efficient parallel access (100 TB/s) and parallel query CPU High Performance Interconnect App. App.App. App. Distributed CPU App. App.App. App. Distributed メメモリ MDS モリ Object Store NVRAM Object Store MDS NVRAM Parallel access and parallel query Scalable distributed MDS High Performance Object Store for NVRAM/flash
EBD high and low‐radix NWs(NII G) EBD non-uniform access Low latency write/read ～10μs for 4KB Extreme Big Data Flow EBD interconnect Low-jitter topology w/ random shortcuts TCP/IP bypassing direct comm. to flash Our seeds technology: Rand Topology[ISCA12] Cabling Layout[HPCA13] Deadlock‐free routing[IEEE Trans.12] Virtual routing method[IPDPS09] Typical Data Centers ‐Poor scalability ‐ 1GbE + 10GbE ‐ TCP/IP basis K Computer Supercomputers ‐ Dedicated to neighboring and uniform access
API co‐design for complicated I/O requirements （co‐PI：Yutaka Akiyama, Tokyo Tech) 300% increase per year O(n) Meas. data O(m) Reference Database O(m n) calculation Correlation, Similarity search Metagenome sciences Important issues: 1) Allocation of big Measured data 2) Allocation of big References (DB) Simple batch of BLASTX software 0.18 M Reads / hour 144core Xeon Cluster (2010) 572.8 M Reads / hour 82944node K‐computer (2012) GHOST‐MP OpenMP / MPI load‐balancing data dispatcher
EBD Driven Planetary-Scale Social Analytics Infrastructure Disaster Management Data Source Data Source Transportation, Evacuation, Logistics Social Network Energy・Power Saving Analysis Large-Scale Graph Analysis EBD-Driven Social Simulation “Billion-Scale” and 10-Fold Real Time Discrete Event Simulation EBD-Driven Social Analytics Supercomputer with 25 PFLOPS, 10PB (DRAM) and 511 PB (Flash), 1 Petabit/s (Comm.) EBD Object Store 10 Tbps (Streaming Data including Satellite Image) Grand Challenge Problem Size: 242 vertices (4.4 Trillion Vertices, 1.1 PB Memory) 7 billion human beings on the planet with 3 billion-level road network Graph Partitioning Log data generation speed = 700 Tbps Total log size per 1 simulation = 2.2 PB Centrality/BC/ BFS/RWR/Clus tering, etc Data Assimilation Co-design
Fail‐Safe EBD Workflow and Geometrical Search in Big Data Assimilation（co‐PI：Takemasa Miyoshi, Riken AICS) EBD Data Assimilation System in Weather Forecast (Proposed simultaneously to Prof. Tanaka’s CREST) Weather Observation data keep flowing-in every 30s. In case of hardware failure Difficult to catch up once delayed A‐1. Quality Control A‐2. Data Processing Ensemble Forecasts シミュレーション 200GB データ Ensemble Analyses シミュレーション 200GB データ 4-dimensional Ensemble Kalman Filter 4D-LETKF 4D-LETKF enables processing multiple steps at one time Observations at multiple times are treated simultaneously. Highly reliable system enabling to catch up in case of delay ①30‐sec Ensemble Forecast Simulations 2 PFLOP ②Ensemble Data Assimilation 2 PFLOP Himawari 500MB/2.5min シミュレーション データ Phased Array Radar 1GB/30sec/2 radars シミュレーション データ B‐1. Quality Control B‐2. Data Processing Analysis Data 2GB ③30‐min Forecast Simulation 1.2 PFLOP 30‐min Forecast 2GB Repeat every 30 sec. Failed Simulation Phased Array Radar
Domestic Collaborations • JST CREST Post Petascale (Akinori Yonezawa) – Katsuki Fujisawa(Chuo Univ): “Advanced Computing and Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers” – Toshio Endo(Tokyo Tech.) “Software Technology that Deals with Deeper Memory Hierarchy in Post-petascale Era” – Osamu Tatebe: “System Software for Post Petascale Data Intensive Science” • TSUBAME3.0 – R&D collaborations ongoing with vendors: CPU/GPU, system, network, storage, SW tools • JST CREST “Big Data” x 2 --- All teams – Provide R&D Platform as TSUBAME3.0 + our SW, similar to Info-Plosion common platform scheme
Summary • TSUBAME1.0->2.0->2.5->3.0->… – Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP – Template for future supercomputers and IDC machines • TSUBAME3.0 Early 2016 – New supercomputing leadership – Tremendous power efficiency, extreme big data, extreme high reliability,… • Lots of background R&D for TSUBAME3.0 and towards Exascale – Green Computing: ULP-HPC & TSUBAME-KFC – Extreme Big Data – Convergence of HPC and IDC! – Exascale Resilience – Programming with Millions of Cores – … • Please stay tuned!