Piz Daint: a modern research infrastructure for computational science

Transcription

Piz Daint: a modern research infrastructure for computational science
Piz Daint: a modern research infrastructure
for computational science
Thomas C. Schulthess
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess
1
source: A. Fichtner, ETH Zurich
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess
2
Data from many stations and earthquakes
source: A. Fichtner, ETH Zurich
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
3
Very large simulations allow inverting large data sets to
generate high-resolution model or earth’s mantle
20 km
70 km
3.55
2.80
vs [km/s]
4.60
4.05
vs [km/s]
source: A. Fichtner, ETH Zurich
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
4
The density of water and ice
Ice floats!
first early science result on “Piz Daint” (Joost VandeVondele, Jan. 2014)
MP2 li
MP2 ice:
Does
iceresults
float?
high-accuracy quantum simulation produce
correct
forYes!
water
Accurate estimate of the density of liquid water :
the importance ofNorduGrid
van 2014,
derHelsinki,
Waals
interactions5 !
May 20, 2014 T. Schulthess
Jacob's ladder in DFT:
Jacobs
ladder of chemical
accuracy
a hierarchy
of models
for(J.VPerdew)
XC
Full many-body Schrödinger Equation:
w( i ,
j)
=
Each rung
in the model:
(H includes
E) ( 1 ,more
..., Ningredients
)=0
● increases accuracy
◆
N ✓
N
2
X
X
1
● increases computational
H=
r2i + v( i )demands.
+
w( i , j )
i=1
2m
2
e2
|⇥ri
⇥rj |
= {⇤r, ⇥}
i6=j=1
Rung 5)
orbitals
IceVirtual
floats with
new “rung 5” simulations using
double hybrid : RPA / MP2 / GW
MP2-based
simulations with CP2K
Rung 4) Occupied orbitals
(VandeVondele
& Hutter)
hyper / hybrid : HF exchange
Rung 3) Kinetic energy density
Daint
Rosa
meta : tau
Rung 2)
theprevious
densitysimulations using
IceGradients
didn’t floatof
with
“rung 4” hybrid functionals
GGA
Palü
Rung 1) Density
LDA
@CSCS
Kohn-Sham Equation with Local Density Approximation:
✓
◆
2
~
r2 + vLDA (⇤r) ⇥i (⇤r) = i ⇥i (⇤r)
2m
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
6
The
challenge
of
water
Modelling interactions between water requires
quantum simulations with extreme accuracy
Weak interactions dominate properties!
Hydrogen-bonding most prominent.
Energy scales
Total energy:
Needed accuracy
-76.438.... a.u..
0.0001 a.u.
total energy: -76.438… a.u.
hydrogen
bonds:accuracy
~0.0001 a.u.
Roughly 99.9999%
!
(or error cancellation) needed!
required accuracy: 99.9999%
source: J. VandeVondele, ETH Zurich
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
7
Pillars of the scientific method
Mathematics / Simulation
(1) Synthesis of models and data: recognising characteristic
features of complex systems with calculations of limited
accuracy (e.g. inverse problems)
(2) Solving theoretical problems with high precision:
complex structures emerge from simple rules (natural
laws), more accurate predictions from “beautiful” theory
(in the Penrose sense)
Theory (models)
Experiment (data)
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
8
Pillars of the scientific method
Mathematics / Simulation
(1) Synthesis of models and data: recognising characteristic
features of complex systems with calculations of limited
accuracy (e.g. inverse problems)
(2) Solving theoretical problems with high precision:
complex structures emerge from simple rules (natural
laws), more accurate predictions from “beautiful” theory
(inthe
thechanging
Penrose sense)
Note
role of high-performance computing:
HPC is now an essential tool for science, used by all scientists
Theory
Experiment
(data)
(for better or(models)
worse), rather than being
limited to the domain
of
applied mathematics and providing numerical solution to
theoretical problems only few understand
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
9
CSCS’ new flagship system “Piz Daint,” and one of
Europe’s most powerful petascale supercomputers
Presently the world’s most energy efficient petascale supercomputer!
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
10
per
blade.
chassis
providesdimension.
all-to-all connectivity
between
the
and
oneThe
part
in 6 backplane
for the Black
The design
of the
blades.
Eachgroup
blade provides
electricalthat
connectors
usedof
to connect
to other
Cascade
(Fig. 4)5ensures
the ratio
intra-group
to
chassis
within the
group. Each meets
Aries also
global
links.
links
inter-group
bandwidths
or provides
exceeds10the
factor
ofThese
two that
taken
to connectors
on the chassis
is desirablearefor
Dragonfly
networks.
Withbackplane.
10 optical ports per
Aries the global bandwidth of a full network exceeds the
The second
dimension,
referred
to asinjected
the Black
injection
allproduces
of
the traffic
by adimension,
node can
Network Interface Controllers (NICs) and a 48-port high radix
implementation, since
each tilebandwidth
is identical, –and
a
dimension.
The
backplane
connects
each
Aries
to
each
of thea
consists
of
three
electrical
links
from
each
Aries
within
router.
regular structure for be
physical
implementation.
The router tiles
routed
to other groups.
chassis
toForty
its ofpeer
in chassis.
each
15matrix.
Aries
within
are arrangedother
in a 6×8
thethat
tiles,
referredofto the
as other chassis within that
network tiles, connect
the SerDes
which connect devices
D. toSystem
Configurations
group.
together. Eight of the tiles, called processor tiles connect to
Chassisfrom
witha16number
bladesof 2Cascade
systemstiles
areareconstructed
the four Aries NICs. Four
of the processor
shared
Figure 3. A Cascade chassis comprises 16 four-node blades between
with an Aries
cabinet
containing
NIC0 and NIC1,
the groups,
other foureach
are shared
between384 nodes. The global links
per blade. The chassis backplane provides all-to-all connectivity
between
the
NIC2 and NIC3. Traffic
is multiplexed
overare
thedivided
processor into
tiles bundles of equal size, with
from
each group
blades. Each blade provides 5 electrical connectors used to connect
to
other
on a packet-by-packet
This from
mechanism
the
onebasis.
bundle
eachdistributes
group connecting
to each of the other
chassis within the group. Each Aries also provides 10 global links.
These
links
traffic according to load, allowing NICs to exceed their fair
groups.
The number
of interpatterns
group cables per bundle is given
are taken to connectors on the chassis backplane.share of bandwidth on
non-uniform
communications
by:
such as gather and scatter.
Figure
1. Awith
Cascade
blade
which
consists ofnodes
8 Xeon sockets, organized as 4
Blade
4
dual
socket
and a singlereferred
Aries ASIC. to as the Black dimension,
Thedual-socket
second nodes,
dimension,
𝑔𝑙𝑜𝑏𝑎𝑙 𝑐𝑎𝑏𝑙𝑒𝑠 𝑝𝑒𝑟 𝑔𝑟𝑜𝑢𝑝
consists
of
three
electrical
links
from
each
Aries
within
a
𝑖𝑛𝑡𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑐𝑎𝑏𝑙𝑒𝑠 ≤
Each of the four Aries NICs provides an independent PCI𝑔𝑟𝑜𝑢𝑝𝑠 − 1
chassis
peer
in each
theCascade
other blade
chassis within that
Express
Gen3 to
x16its
host
interface.
In theoffirst
group.
design
these interfaces connect to four independent dual
This number may be reduced substantially in systems that
socket Xeon nodes. Each node has a pair of Sandy Bridge or
do not require full global bandwidth. One advantage of
Ivy Bridge processors and one SouthBridge. There are 8 DDRNIC 0 NIC 1 NIC 2 NIC 3
reducing the number of inter-group cables is that it makes the
3 memory channels per node, each with a single DIMM.
systems easier to upgrade. For example, a six group (12Memory capacity is up to 128 GBytes per node (Fig. 2).
Future Cascade blade designs might include other x86
cabinet) system might be initially configured with bundles of
processors and different types of accelerators.
12
optical
cables,
using
60 of
the maximum
of 240
ports
Figure
3.
A
Cascade
chassis
comprises
16
four-node
blades
with
an
Aries
Figure
4. Xeon
Structure
of asystem
Cascade electrical
group. to
Eacheight
row represents
the
16
Xeon
Xeon
The node issues network requests by writing them across
available.
IfXeon
the
is upgraded
groups
(16per
blade.
The
chassis
backplane
provides
all-to-all
connectivity
between
the
Aries in a chassis, with 4 nodes attached to each, and connected by the chassis
the host interface to the NIC. The NIC packetizes these
cabinets)
then
new cable
bundles
are added,
connecting
the
blades.
Each blade
provides
electrical
connectors
to connect
tothe
other
backplane
(green
links).5Each
column
representsused
an Aries
in one of
six
requests with virtual global addresses and issues the packets to
new groups
together
and
connecting
eachcables
of
the
existing
chassis chassis
within
the
Each group,
Aries
also
provides
10 global
links.
These
links
the network. Packets are routed across the network to a
of agroup.
two-cabinet
connected
by electrical
(black
links).
Xeon
Xeon
Xeon
Xeon
groups
to
the
two
new
ones.
These
cables
can
be
added
to
destination NIC. The destination NIC executes the operation
are taken to connectors on the chassis backplane.
specified in the request packet and returns a response to the
those in place without the need to rewire the system. The total
source.
number
ofdimension,
optical
cables
required
is as
given
Figure 2. A single
Aries
system-on-a-chip
device
provides
network to
The
second
referred
the by:
Black dimension,
Hierarchical setup of Cray’s Cascade architecture
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
connectivity for the four nodes on a Cascade blade.
Support for I/O is provided by an I/O blade comprising
consists of𝑖𝑛𝑡𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑐𝑎𝑏𝑙𝑒𝑠 three electrical links
from −each
within
⁄ 2 a
× (𝑔𝑟𝑜𝑢𝑝𝑠
1) ×Aries
𝑔𝑟𝑜𝑢𝑝𝑠 two single-socket nodes and a single Aries ASIC. Each node
C. Networkchassis
Packagingto its peer in each of the other chassis within that
supports two PCI-Express Gen 3 x8 low-profile cards. An I/O
and medium
systems
can be upgraded with
blade is the same form factor as a compute blade. A chassis
Each router
providesSmall
both intra-group
links thatsize
connect
it
group.
can be
populated
with a mix
computeelectrical
and I/O group.
blades.Each row represents
to other routers
group and
inter-group
links (alsoAll
known
chassis
granularity.
groups except the last must be
Figure
4. Structure
of aofCascade
the 16 in itssingle
as
global
links)
that
connect
it
to
the
other
groups.
The
group
Aries incooling
a chassis,
with 4 nodes using
attached
to each, and
connected by the chassis
full. Larger systems must be comprised of a whole number of
Cabinet
is implemented
side-to-side
air flow,
size
controls
the
scalability
of the system and was influenced
(green
Each
Aries
with airbackplane
moving in
serieslinks).
through
all column
cabinetsrepresents
in a row. an
Heat
is in one of the six
cabinets
oronly
a whole
number
of used
groups.
by
many
factors.
To
reduce
cost,
electrical
links
are
chassis
of
a
two-cabinet
group,
connected
by
electrical
cables
(black
links).
extracted via water coils embedded within each cabinet. Air
Figure 5. The global (blue) links connect Dragonfly groups together. In a
within a group. The maximum group size is governed by the
flow is accomplished by six horizontally mounted fans in
large
system
theseisC
links
are active
optical cables.
III.
ASCADE
ROUTING
maximum reach of the electrical
cables,
which
in turn
separate enclosures placed after every pair of cabinets.
limited by the signaling rate.
theNorduGrid
dimension
of Helsinki,
2014,
May 20,are
2014routed
T. Schulthess
In aRestrictions
Cascadeonnetwork
request
packets
from 11
B. Network Topology
Groups
via
optical
links,
referred
to
as
Blue
the mechanical infrastructure
limitare
the connected
number of Aries
routers
source
node
to
destination
node.
Response
packets
are
The Cascade network design builds on the Cray
per chassis. The number of ports on the Aries router and the
Electrical group (2 cabinets) with 6 chassis
Source: G. Fannes et al., SC’12 proceedings (2012)
Cascade system with 8 electrical groups (16 cabinets)
fro
on
gro
by
(A
ele
do
pac
red
uni
sys
loc
cab
and
12
Ca
av
int
cab
isned
Ar
gro
tho
inj
nu
be
D.
cab
sin
fro
ful
one
cab
gro
by:
sou
ind
req
pa
do
eit
red
a g
sys
ho
cab
bo
12
glo
ava
rou
Xeon
H-PDC
Xeon
4 nodes configured with:
> 2 Intel SandyBridge CPU
> 32 GB DDR3-1600 RAM
!
!
!
Peak performance of blade: 1.3 TFlops
GDDR5
Xeon
DDR3
Xeon
Xeon
GK110
Xeon
H-PDC
GK110
GK110
GDDR5
Xeon
QPDC
DDR3
Xeon
DDR3
DDR3
QPDC
Xeon
GDDR5
Xeon
GDDR5
Xeon
DDR3
NIC0 NIC1 NIC2 NIC3
DDR3
NIC0 NIC1 NIC2 NIC3
DDR3
48 port router
DDR3
48 port router
DDR3
Final hybrid CPU-GPU blade
DDR3
Xeon
Initial Multi-core blade
DDR3
DDR3
Regular multi-core vs. hybrid-multi-core blades
GK110
4 nodes configured with:
> 1 Intel SandyBridge CPU
> 32 GB DDR3-1600 RAM
> 1 NVIDIA K20X GPU
> 6GB GDDR5 memory
!
Peak performance of blade: 5.9 TFlops
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
12
A brief history of “Piz Daint”
• Installed 12 cabinets with dual-SandyBridge nodes in Oct./Nov. 2012
• ~50%
of final system size to test network at scale (lessons learned from Gemini & XE6)
• Rigorous
evaluation of three node architectures based on 9 applications
• dual-Xeon
• joint
• Five
vs. Xeon/Xeon-Phi vs. Xeon/Kepler
study with Cray from Dec. 2011 through Nov. 2012
applications were used for system design
• CP2K,
• CP2K
COSMO, SPECFEM-3D, GROMACS, Quantum ESPRESSO
& COSMO-OPCODE were co-developed with the system
• Moving
• Upgrade
to 28 cabinets with hybrid CPU-GPU nodes in Oct./Nov. 2013
• Accepted
• Early
performance goals: hybrid Cray XC30 has to beat the regular XC30 by 1.5x
in Dec. 2013
Science with fully performance hybrid nodes: January through March 2014
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess 13
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess 14
source: David Leutwyler
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess 15
Speedup of the full COSMO-2 production problem
(apples to apples with 33h forecast of Meteo Swiss)
4x
Monte Rosa
Cray XE6
(Nov. 2011)
Tödi
Cray XK7
(Nov. 2012)
Piz Daint
Cray XC30
(Nov. 2012)
Piz Daint
Cray XC30 hybrid (GPU)
(Nov. 2013)
4x
3x
3x
1.67x
2x
2x
1.77x
New HP2C funded code
1x
3.36x
1.4x
1.49x
1.35x
2.5x
1x
Current production code
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
16
Energy to solution (kWh / ensemble member)
6.0
Cray XE6
(Nov. 2011)
Cray XK7
(Nov. 2012)
Cray XC30
(Nov. 2012)
Cray XC30 hybrid (GPU)
(Nov. 2013)
Current production code
4.5
1.75x
1.41x
New HP2C funded code
3.93x
3.0
6.89x
1.49x
2.51x
2.64x
1.5
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
17
COSMO: current and new (HP2C developed) code
main (current / Fortran)
main (new / Fortran)
dynamics (C++)
physics
(Fortran)
dynamics (Fortran)
MPI
system
physics
(Fortran)
with OpenMP /
OpenACC
stencil library
X86
GPU
Shared
Infrastructure
boundary
conditions &
halo exchg.
Generic
Comm.
Library
MPI or whatever
system
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
18
velocities
pressure
Physical model
temperature
Mathematical description
water
Discretization / algorithm
turbulence
Domain science (incl. applied mathematics)
lap(i,j,k) = –4.0 * data(i,j,k) +
data(i+1,j,k) + data(i-1,j,k) +
data(i,j+1,k) + data(i,j-1,k);
Code compilation A given supercomputer
Code / implementation
“Port” serial code to supercomputers
> vectorize
> parallelize
Computer engineering
> petascaling (& computer science)
> exascaling
> ...
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess 19
velocities
pressure
Physical model
temperature
Mathematical description
water
Discretization / algorithm
turbulence
Domain science (incl. applied mathematics)
lap(i,j,k) = –4.0 * data(i,j,k) +
data(i+1,j,k) + data(i-1,j,k) +
data(i,j+1,k) + data(i,j-1,k);
Code compilation Architectural options / design
A given supercomputer
Code / implementation
“Port” serial code to supercomputers
> vectorize
> parallelize
Computer engineering
> petascaling (& computer science)
> exascaling
> ...
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess 20
velocities
pressure
Physical model
temperature
Mathematical description
water
Discretization / algorithm
turbulence
Domain science (incl. applied mathematics)
lap(i,j,k) = –4.0 * data(i,j,k) +
data(i+1,j,k) + data(i-1,j,k) +
data(i,j+1,k) + data(i,j-1,k);
Code compilation Architectural options / design
Code / implementation
Tools &
Libraries
Optimal algorithm
Auto tuning
Computer engineering
(& computer science)
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess 21
Model development
based on Python or
equivalent dynamic
language
velocities
pressure
Physical model
temperature
Mathematical description
water
Discretization / algorithm
turbulence
Domain science
lap(i,j,k) = –4.0 * data(i,j,k) +
data(i+1,j,k) + data(i-1,j,k) +
data(i,j+1,k) + data(i,j-1,k);
Code compilation Computer engineering
(& computer science)
Code / implementation
Tools &
Libraries
Optimal algorithm
Auto tuning
Architectural options / design
co-design
NorduGrid 2014, Helsinki, May 20, 2014
T. Schulthess 22
COSMO in five year: current and new (2019) code
main (current / Fortran)
main (new / “Python”)
physics
(“Python”)
physics
(Fortran)
dynamics (Fortran)
dynamics (C++ or “Python”)
some tools
(Fortan / C++)
other tools
grid tools
BE1
BE…
Shared Infrastructure
BE..
MPI
system
boundary
conditions &
halo exchg.
Generic
Comm.
Library
MPI or whatever
system
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
23
Thank you!
NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess
24