Some thoughts about energy efficient application execution on NEC
Transcription
Some thoughts about energy efficient application execution on NEC
Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science Friedrich-Alexander-University Erlangen-Nuremberg Germany Erlangen Regional Computing Center(RRZE) Hannover Berlin JuQueen 5 PF/s RRZE: Regional HPCservice provider and HPC research center FZ Jülich Erlangen HLRS-Stuttgart LRZ-München Hermit: 1 PF November 19, 2013 SuperMUC: 3 PF hpc@rrze.uni-erlangen.de 2 Erlangen Regional Computing Center A broad range of users: Biology, Chemistry, CFD, Material Science, Physics – Medicine, Economics,… A broad range of clusters: LINUX (NEC): 560 nodes (234 TF/s) Installation: 2013 LINUX (NEC): 500 nodes (64 TF/s) Installation: 2010 LINUX (others): 300 nodes (2007 – 2011) WINDOWS (other): 16 nodes (2009) Installation of a new LINUX cluster every 3 years: Decision based on benchmarks from users Production nodes: CPU only (benchmark commitments for applications on GPGPU / Phi cards …) Budget: ~2.5 – 3 Million USD November 19, 2013 hpc@rrze.uni-erlangen.de 3 NEC LX-Cluster@RRZE: Dedicated to Emmy Noether #210 in TOP500 as of Nov. 2013 191.5 TF/s LINPACK (CPU only) LINPACK efficiency: 97.1 % of 197.1 TF/s Peak (based on 2.2 GHz) “Emmy” cluster – 234 TF/s peak 560 compute nodes 2x Intel Xeon E5-2660v2 64 GB DDR3 RAM 6 GPGPU nodes: 6 Phi nodes: 4 mixed nodes: QDR Infiniband no local disks November 19, 2013 hpc@rrze.uni-erlangen.de (10 core “Ivy Bridge” @ 2.2 GHz) 2xNVIDIA K20c 2xIntel Xeon Phi 1xK20c + 1xPhi 4 HPC-Research objectives SC13 Tutorial: The Practitioner's Cookbook for Good Parallel Performance on Multi- and ManyCore Systems Presenter(s): G. Wellein, G. Hager, J. Treibig SC13 Poster: Pattern-Driven Node-Level Performance Engineering Author(s):J.Treibig, G. Hager, G. Wellein See you there at 5:15-7:00 today! Performance Engineering for multi-/manycore architectures Efficient programming on hybrid parallel systems Fault Tolerance SC13 Tutorial: Hybrid MPI and OpenMP Parallel Programming Presenter(s): G. Jost, R. Rabenseifner, G. Hager Multicore tooling Application: Sparse matrix schemes and Lattice Boltzmann methods SC13 Doctoral Showcase: A Unified Sparse Matrix Format for Heterogeneous Systems Presenter: M. Kreutzer Don’t miss it Thursday afternoon November 19, 2013 hpc@rrze.uni-erlangen.de 5 Energy efficient application execution Best energy efficiency? There are so many parameters to consider! Clock Speed? Code variants SMT? November 19, 2013 hpc@rrze.uni-erlangen.de Cores per Chip? 6 What kind of application do you run? Consider scalability within a single multicore processor chip “LINPACK type” Limiting factor: Core Execution “STREAM type” Limiting factor: Saturation (bandwidth) Change clock speed: 1.5 X 0.6 X November 19, 2013 hpc@rrze.uni-erlangen.de 7 Simple model for Energy to solution: Clock speeds and core counts (1) Performance using t cores at clock speed of f 𝒇 𝑷 𝒇, 𝒕 = 𝒎𝒎𝒎 × 𝑷𝟎 × 𝒕, 𝑷𝒎𝒎𝒎 𝒇𝟎 𝒇𝟎 : Baseline clock speed 𝑷𝟎 𝑷𝒎𝒎𝒎 : Baseline single core (max. chip) performance Power consumption for running t cores at clock speed of f 𝑾 𝒇, 𝒕 = 𝑾𝟎 + 𝑾𝟏 × 𝒇 + 𝑾𝟐 × 𝒇𝟐 × 𝒕 𝑾𝟎 : Baseline power (memory, IO, network…) 𝑾𝟎 , 𝑾𝟏 , 𝑾𝟐 : Determined by benchmarks W2 = 1 W/GHz2 For Intel SNB: W0 = 32 W for chip W0 = 73 W per Socket for whole system 8 Simple model for Energy to solution: Clock speeds and core counts (2) Energy to solution if running t cores at clock speed of f 𝑾𝟎 + 𝑾𝟏 × 𝒇 + 𝑾𝟐 × 𝒇𝟐 × 𝒕 𝑾 𝒇, 𝒕 𝑬 𝒇, 𝒕 = = 𝒇 𝑷 𝒇, 𝒕 𝒎𝒎𝒎 × 𝑷𝟎 × 𝒕, 𝑷𝒎𝒎𝒎 𝒇𝟎 Code optimization increases 𝑷𝟎 and / or 𝑷𝒎𝒎𝒎 and proportionally reduces E LINPACK type apps: Use all cores at clock speed of 𝒇𝒐𝒐𝒐 = 𝑾𝟎 𝒕×𝑾𝟐 STREAM type apps: Minimum energy at saturation point. 9 Energy to Solution W0 W2 = 73 W = 1 W / GHz2 base opt LINPACK type STREAM type Use all cores and high clock speed! November 19, 2013 = 2 GHz = 3 GHz Run all cores at clock speed which still saturates performance hpc@rrze.uni-erlangen.de 10 Energy to Solution: A different way of presentation Energy vs. Performance “Isoline” of constant Energy delay product (𝑬 × ∆𝒕) November 19, 2013 hpc@rrze.uni-erlangen.de 11 A real world example: Lattice Boltzmann CFD solver “STREAM type” code Different levels of optimization (𝑷𝟎 ): scalar, SSE, AVX code Not included in model: Bandwidth degradation with lower clock speed (2.7 GHz 1.2 GHz) November 19, 2013 hpc@rrze.uni-erlangen.de 12 A real world example: Lattice Boltzmann CFD solver Realistic model for LBM performance MODEL MEASUREMENT Optimal point of operation: 1.2 GHz with AVX code at saturation point (7 cores) November 19, 2013 hpc@rrze.uni-erlangen.de 13 A real world example: Lattice Boltzmann CFD solver Be aware! Lowering clock speed may lower MPI bandwidth between nodes! IMB sendrecv between two nodes (FDR IB) Using all cores network bandwidth may drop by 40%! November 19, 2013 hpc@rrze.uni-erlangen.de 14 Lessons to learn Code optimization is a must! LINPACK-type codes: run as fast as possible STREAM-type code: Run at saturation point of lowest clock speed which saturates Check degradation of Main memory bandwidth Interconnect bandwidth Things to consider at system administration level: Allow users to specify clock speeds (simple modification in Prolog NEC) Install LIKWID toolkit (http://code.google.com/p/likwid/) – allows users to measure power and energy consumption (likwid-powermeter) Works well with NEC software stack November 19, 2013 hpc@rrze.uni-erlangen.de 15 LIKWID toolbox: small, flexible and easy-to-use tools likwid-topology likwid-pin likwid-bench likwid-perfctr likwid-powermeter likwid-mpirun References An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to the highly parallel level. Submitted. Preprint: arXiv:1304.7664 Exploring performance and power properties of modern multicore chips via simple machine models. Accepted for publication in CCPE http://arXiv.org/abs/1208.2908 Thank you! November 19, 2013 hpc@rrze.uni-erlangen.de 16 Question: Name 2 hardware properties which may depend on clock speed – (besides: clock speed and peak performance)? November 19, 2013 hpc@rrze.uni-erlangen.de 17