Some thoughts about energy efficient application execution on NEC

Transcription

Some thoughts about energy efficient application execution on NEC
Some thoughts about energy efficient
application execution on NEC LX Series
compute clusters
G. Wellein, G. Hager, J. Treibig, M. Wittmann
Erlangen Regional Computing Center & Department of Computer Science
Friedrich-Alexander-University Erlangen-Nuremberg
Germany
Erlangen Regional Computing Center(RRZE)
Hannover
Berlin
JuQueen 5 PF/s
RRZE: Regional HPCservice provider and
HPC research center
FZ Jülich
Erlangen
HLRS-Stuttgart
LRZ-München
Hermit: 1 PF
November 19, 2013
SuperMUC: 3 PF
hpc@rrze.uni-erlangen.de
2
Erlangen Regional Computing Center
 A broad range of users: Biology, Chemistry, CFD, Material
Science, Physics – Medicine, Economics,…
 A broad range of clusters:




LINUX (NEC):
560 nodes (234 TF/s)
Installation: 2013
LINUX (NEC):
500 nodes (64 TF/s)
Installation: 2010
LINUX (others):
300 nodes (2007 – 2011)
WINDOWS (other): 16 nodes (2009)
 Installation of a new LINUX cluster every 3 years:
 Decision based on benchmarks from users
 Production nodes: CPU only
(benchmark commitments for applications on GPGPU / Phi cards …)
 Budget: ~2.5 – 3 Million USD
November 19, 2013
hpc@rrze.uni-erlangen.de
3
NEC LX-Cluster@RRZE: Dedicated to Emmy Noether
#210 in TOP500 as of Nov. 2013
191.5 TF/s LINPACK (CPU only)
LINPACK efficiency: 97.1 % of 197.1 TF/s
Peak (based on 2.2 GHz)
“Emmy” cluster – 234 TF/s peak
560 compute nodes
 2x Intel Xeon E5-2660v2
 64 GB DDR3 RAM
 6 GPGPU nodes:
 6 Phi nodes:
 4 mixed nodes:
QDR Infiniband
no local disks
November 19, 2013
hpc@rrze.uni-erlangen.de
(10 core “Ivy Bridge” @ 2.2 GHz)
2xNVIDIA K20c
2xIntel Xeon Phi
1xK20c + 1xPhi
4
HPC-Research objectives
SC13 Tutorial: The Practitioner's Cookbook for
Good Parallel Performance on Multi- and ManyCore Systems
Presenter(s): G. Wellein, G. Hager, J. Treibig
SC13 Poster: Pattern-Driven Node-Level
Performance Engineering
Author(s):J.Treibig, G. Hager, G. Wellein
See you there at 5:15-7:00 today!
 Performance Engineering for multi-/manycore architectures
 Efficient programming on hybrid parallel systems
 Fault Tolerance
SC13 Tutorial: Hybrid MPI and OpenMP Parallel
Programming
Presenter(s): G. Jost, R. Rabenseifner, G. Hager
 Multicore tooling
 Application: Sparse matrix schemes and Lattice Boltzmann methods
SC13 Doctoral Showcase: A Unified Sparse Matrix
Format for Heterogeneous Systems
Presenter: M. Kreutzer
Don’t miss it Thursday afternoon
November 19, 2013
hpc@rrze.uni-erlangen.de
5
Energy efficient application execution
Best energy efficiency?
There are so many parameters to consider!
Clock
Speed?
Code
variants
SMT?
November 19, 2013
hpc@rrze.uni-erlangen.de
Cores per
Chip?
6
What kind of application do you run?
Consider scalability within a single multicore processor chip
“LINPACK type”
Limiting factor: Core Execution
“STREAM type”
Limiting factor: Saturation (bandwidth)
Change
clock
speed:
1.5 X
0.6 X
November 19, 2013
hpc@rrze.uni-erlangen.de
7
Simple model for Energy to solution:
Clock speeds and core counts (1)
Performance using t cores at clock speed of f
𝒇
𝑷 𝒇, 𝒕 = 𝒎𝒎𝒎
× 𝑷𝟎 × 𝒕,
𝑷𝒎𝒎𝒎
𝒇𝟎
𝒇𝟎 :
Baseline clock speed
𝑷𝟎 𝑷𝒎𝒎𝒎 : Baseline single core (max. chip) performance
Power consumption for running t cores at clock speed of f
𝑾 𝒇, 𝒕 = 𝑾𝟎 + 𝑾𝟏 × 𝒇 + 𝑾𝟐 × 𝒇𝟐 × 𝒕
𝑾𝟎 :
Baseline power (memory, IO, network…)
𝑾𝟎 , 𝑾𝟏 , 𝑾𝟐 : Determined by benchmarks
W2 = 1 W/GHz2
For Intel SNB: W0 = 32 W for chip
W0 = 73 W per Socket for whole system
8
Simple model for Energy to solution:
Clock speeds and core counts (2)
Energy to solution if running t cores at clock speed of f
𝑾𝟎 + 𝑾𝟏 × 𝒇 + 𝑾𝟐 × 𝒇𝟐 × 𝒕
𝑾 𝒇, 𝒕
𝑬 𝒇, 𝒕 =
=
𝒇
𝑷 𝒇, 𝒕
𝒎𝒎𝒎
× 𝑷𝟎 × 𝒕, 𝑷𝒎𝒎𝒎
𝒇𝟎
Code optimization increases
𝑷𝟎 and / or 𝑷𝒎𝒎𝒎 and
proportionally reduces E
LINPACK type apps:
Use all cores at clock speed
of 𝒇𝒐𝒐𝒐 =
𝑾𝟎
𝒕×𝑾𝟐
STREAM type apps:
Minimum energy at
saturation point.
9
Energy to Solution
W0
W2
= 73 W
= 1 W / GHz2
base
opt
LINPACK type
STREAM type
Use all cores and high clock speed!
November 19, 2013
= 2 GHz
= 3 GHz
Run all cores at clock speed which still
saturates performance
hpc@rrze.uni-erlangen.de
10
Energy to Solution: A different way of presentation
Energy vs. Performance
“Isoline” of constant Energy delay product (𝑬 × ∆𝒕)
November 19, 2013
hpc@rrze.uni-erlangen.de
11
A real world example:
Lattice Boltzmann CFD solver
 “STREAM type” code
 Different levels of
optimization (𝑷𝟎 ):
scalar, SSE, AVX code
 Not included in model:
Bandwidth degradation
with lower clock speed
(2.7 GHz  1.2 GHz)
November 19, 2013
hpc@rrze.uni-erlangen.de
12
A real world example:
Lattice Boltzmann CFD solver
Realistic model for LBM performance
MODEL
MEASUREMENT
Optimal point of operation: 1.2 GHz with
AVX code at saturation point (7 cores)
November 19, 2013
hpc@rrze.uni-erlangen.de
13
A real world example:
Lattice Boltzmann CFD solver
 Be aware! Lowering clock speed may lower MPI bandwidth
between nodes!
 IMB sendrecv
between two nodes
(FDR IB)
 Using all cores
network bandwidth
may drop by 40%!
November 19, 2013
hpc@rrze.uni-erlangen.de
14
Lessons to learn
 Code optimization is a must!
 LINPACK-type codes: run as fast as possible
 STREAM-type code:
 Run at saturation point of lowest clock speed which saturates
 Check degradation of
 Main memory bandwidth
 Interconnect bandwidth
 Things to consider at system administration level:
 Allow users to specify clock speeds (simple modification in Prolog  NEC)
 Install LIKWID toolkit (http://code.google.com/p/likwid/) – allows
users to measure power and energy consumption (likwid-powermeter)
 Works well with NEC software stack
November 19, 2013
hpc@rrze.uni-erlangen.de
15
 LIKWID toolbox: small, flexible and easy-to-use tools






likwid-topology
likwid-pin
likwid-bench
likwid-perfctr
likwid-powermeter
likwid-mpirun
 References
 An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to
the highly parallel level. Submitted. Preprint: arXiv:1304.7664
 Exploring performance and power properties of modern multicore chips via simple
machine models. Accepted for publication in CCPE http://arXiv.org/abs/1208.2908
Thank you!
November 19, 2013
hpc@rrze.uni-erlangen.de
16
 Question: Name 2 hardware properties
which may depend on clock speed –
(besides: clock speed and
peak performance)?
November 19, 2013
hpc@rrze.uni-erlangen.de
17