Piz Daint: a modern research infrastructure for computational science
Transcription
Piz Daint: a modern research infrastructure for computational science
Piz Daint: a modern research infrastructure for computational science Thomas C. Schulthess NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 1 source: A. Fichtner, ETH Zurich NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 2 Data from many stations and earthquakes source: A. Fichtner, ETH Zurich NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 3 Very large simulations allow inverting large data sets to generate high-resolution model or earth’s mantle 20 km 70 km 3.55 2.80 vs [km/s] 4.60 4.05 vs [km/s] source: A. Fichtner, ETH Zurich NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 4 The density of water and ice Ice floats! first early science result on “Piz Daint” (Joost VandeVondele, Jan. 2014) MP2 li MP2 ice: Does iceresults float? high-accuracy quantum simulation produce correct forYes! water Accurate estimate of the density of liquid water : the importance ofNorduGrid van 2014, derHelsinki, Waals interactions5 ! May 20, 2014 T. Schulthess Jacob's ladder in DFT: Jacobs ladder of chemical accuracy a hierarchy of models for(J.VPerdew) XC Full many-body Schrödinger Equation: w( i , j) = Each rung in the model: (H includes E) ( 1 ,more ..., Ningredients )=0 ● increases accuracy ◆ N ✓ N 2 X X 1 ● increases computational H= r2i + v( i )demands. + w( i , j ) i=1 2m 2 e2 |⇥ri ⇥rj | = {⇤r, ⇥} i6=j=1 Rung 5) orbitals IceVirtual floats with new “rung 5” simulations using double hybrid : RPA / MP2 / GW MP2-based simulations with CP2K Rung 4) Occupied orbitals (VandeVondele & Hutter) hyper / hybrid : HF exchange Rung 3) Kinetic energy density Daint Rosa meta : tau Rung 2) theprevious densitysimulations using IceGradients didn’t floatof with “rung 4” hybrid functionals GGA Palü Rung 1) Density LDA @CSCS Kohn-Sham Equation with Local Density Approximation: ✓ ◆ 2 ~ r2 + vLDA (⇤r) ⇥i (⇤r) = i ⇥i (⇤r) 2m NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 6 The challenge of water Modelling interactions between water requires quantum simulations with extreme accuracy Weak interactions dominate properties! Hydrogen-bonding most prominent. Energy scales Total energy: Needed accuracy -76.438.... a.u.. 0.0001 a.u. total energy: -76.438… a.u. hydrogen bonds:accuracy ~0.0001 a.u. Roughly 99.9999% ! (or error cancellation) needed! required accuracy: 99.9999% source: J. VandeVondele, ETH Zurich NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 7 Pillars of the scientific method Mathematics / Simulation (1) Synthesis of models and data: recognising characteristic features of complex systems with calculations of limited accuracy (e.g. inverse problems) (2) Solving theoretical problems with high precision: complex structures emerge from simple rules (natural laws), more accurate predictions from “beautiful” theory (in the Penrose sense) Theory (models) Experiment (data) NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 8 Pillars of the scientific method Mathematics / Simulation (1) Synthesis of models and data: recognising characteristic features of complex systems with calculations of limited accuracy (e.g. inverse problems) (2) Solving theoretical problems with high precision: complex structures emerge from simple rules (natural laws), more accurate predictions from “beautiful” theory (inthe thechanging Penrose sense) Note role of high-performance computing: HPC is now an essential tool for science, used by all scientists Theory Experiment (data) (for better or(models) worse), rather than being limited to the domain of applied mathematics and providing numerical solution to theoretical problems only few understand NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 9 CSCS’ new flagship system “Piz Daint,” and one of Europe’s most powerful petascale supercomputers Presently the world’s most energy efficient petascale supercomputer! NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 10 per blade. chassis providesdimension. all-to-all connectivity between the and oneThe part in 6 backplane for the Black The design of the blades. Eachgroup blade provides electricalthat connectors usedof to connect to other Cascade (Fig. 4)5ensures the ratio intra-group to chassis within the group. Each meets Aries also global links. links inter-group bandwidths or provides exceeds10the factor ofThese two that taken to connectors on the chassis is desirablearefor Dragonfly networks. Withbackplane. 10 optical ports per Aries the global bandwidth of a full network exceeds the The second dimension, referred to asinjected the Black injection allproduces of the traffic by adimension, node can Network Interface Controllers (NICs) and a 48-port high radix implementation, since each tilebandwidth is identical, –and a dimension. The backplane connects each Aries to each of thea consists of three electrical links from each Aries within router. regular structure for be physical implementation. The router tiles routed to other groups. chassis toForty its ofpeer in chassis. each 15matrix. Aries within are arrangedother in a 6×8 thethat tiles, referredofto the as other chassis within that network tiles, connect the SerDes which connect devices D. toSystem Configurations group. together. Eight of the tiles, called processor tiles connect to Chassisfrom witha16number bladesof 2Cascade systemstiles areareconstructed the four Aries NICs. Four of the processor shared Figure 3. A Cascade chassis comprises 16 four-node blades between with an Aries cabinet containing NIC0 and NIC1, the groups, other foureach are shared between384 nodes. The global links per blade. The chassis backplane provides all-to-all connectivity between the NIC2 and NIC3. Traffic is multiplexed overare thedivided processor into tiles bundles of equal size, with from each group blades. Each blade provides 5 electrical connectors used to connect to other on a packet-by-packet This from mechanism the onebasis. bundle eachdistributes group connecting to each of the other chassis within the group. Each Aries also provides 10 global links. These links traffic according to load, allowing NICs to exceed their fair groups. The number of interpatterns group cables per bundle is given are taken to connectors on the chassis backplane.share of bandwidth on non-uniform communications by: such as gather and scatter. Figure 1. Awith Cascade blade which consists ofnodes 8 Xeon sockets, organized as 4 Blade 4 dual socket and a singlereferred Aries ASIC. to as the Black dimension, Thedual-socket second nodes, dimension, 𝑔𝑙𝑜𝑏𝑎𝑙 𝑐𝑎𝑏𝑙𝑒𝑠 𝑝𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 consists of three electrical links from each Aries within a 𝑖𝑛𝑡𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑐𝑎𝑏𝑙𝑒𝑠 ≤ Each of the four Aries NICs provides an independent PCI𝑔𝑟𝑜𝑢𝑝𝑠 − 1 chassis peer in each theCascade other blade chassis within that Express Gen3 to x16its host interface. In theoffirst group. design these interfaces connect to four independent dual This number may be reduced substantially in systems that socket Xeon nodes. Each node has a pair of Sandy Bridge or do not require full global bandwidth. One advantage of Ivy Bridge processors and one SouthBridge. There are 8 DDRNIC 0 NIC 1 NIC 2 NIC 3 reducing the number of inter-group cables is that it makes the 3 memory channels per node, each with a single DIMM. systems easier to upgrade. For example, a six group (12Memory capacity is up to 128 GBytes per node (Fig. 2). Future Cascade blade designs might include other x86 cabinet) system might be initially configured with bundles of processors and different types of accelerators. 12 optical cables, using 60 of the maximum of 240 ports Figure 3. A Cascade chassis comprises 16 four-node blades with an Aries Figure 4. Xeon Structure of asystem Cascade electrical group. to Eacheight row represents the 16 Xeon Xeon The node issues network requests by writing them across available. IfXeon the is upgraded groups (16per blade. The chassis backplane provides all-to-all connectivity between the Aries in a chassis, with 4 nodes attached to each, and connected by the chassis the host interface to the NIC. The NIC packetizes these cabinets) then new cable bundles are added, connecting the blades. Each blade provides electrical connectors to connect tothe other backplane (green links).5Each column representsused an Aries in one of six requests with virtual global addresses and issues the packets to new groups together and connecting eachcables of the existing chassis chassis within the Each group, Aries also provides 10 global links. These links the network. Packets are routed across the network to a of agroup. two-cabinet connected by electrical (black links). Xeon Xeon Xeon Xeon groups to the two new ones. These cables can be added to destination NIC. The destination NIC executes the operation are taken to connectors on the chassis backplane. specified in the request packet and returns a response to the those in place without the need to rewire the system. The total source. number ofdimension, optical cables required is as given Figure 2. A single Aries system-on-a-chip device provides network to The second referred the by: Black dimension, Hierarchical setup of Cray’s Cascade architecture DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 connectivity for the four nodes on a Cascade blade. Support for I/O is provided by an I/O blade comprising consists of𝑖𝑛𝑡𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑐𝑎𝑏𝑙𝑒𝑠 three electrical links from −each within ⁄ 2 a × (𝑔𝑟𝑜𝑢𝑝𝑠 1) ×Aries 𝑔𝑟𝑜𝑢𝑝𝑠 two single-socket nodes and a single Aries ASIC. Each node C. Networkchassis Packagingto its peer in each of the other chassis within that supports two PCI-Express Gen 3 x8 low-profile cards. An I/O and medium systems can be upgraded with blade is the same form factor as a compute blade. A chassis Each router providesSmall both intra-group links thatsize connect it group. can be populated with a mix computeelectrical and I/O group. blades.Each row represents to other routers group and inter-group links (alsoAll known chassis granularity. groups except the last must be Figure 4. Structure of aofCascade the 16 in itssingle as global links) that connect it to the other groups. The group Aries incooling a chassis, with 4 nodes using attached to each, and connected by the chassis full. Larger systems must be comprised of a whole number of Cabinet is implemented side-to-side air flow, size controls the scalability of the system and was influenced (green Each Aries with airbackplane moving in serieslinks). through all column cabinetsrepresents in a row. an Heat is in one of the six cabinets oronly a whole number of used groups. by many factors. To reduce cost, electrical links are chassis of a two-cabinet group, connected by electrical cables (black links). extracted via water coils embedded within each cabinet. Air Figure 5. The global (blue) links connect Dragonfly groups together. In a within a group. The maximum group size is governed by the flow is accomplished by six horizontally mounted fans in large system theseisC links are active optical cables. III. ASCADE ROUTING maximum reach of the electrical cables, which in turn separate enclosures placed after every pair of cabinets. limited by the signaling rate. theNorduGrid dimension of Helsinki, 2014, May 20,are 2014routed T. Schulthess In aRestrictions Cascadeonnetwork request packets from 11 B. Network Topology Groups via optical links, referred to as Blue the mechanical infrastructure limitare the connected number of Aries routers source node to destination node. Response packets are The Cascade network design builds on the Cray per chassis. The number of ports on the Aries router and the Electrical group (2 cabinets) with 6 chassis Source: G. Fannes et al., SC’12 proceedings (2012) Cascade system with 8 electrical groups (16 cabinets) fro on gro by (A ele do pac red uni sys loc cab and 12 Ca av int cab isned Ar gro tho inj nu be D. cab sin fro ful one cab gro by: sou ind req pa do eit red a g sys ho cab bo 12 glo ava rou Xeon H-PDC Xeon 4 nodes configured with: > 2 Intel SandyBridge CPU > 32 GB DDR3-1600 RAM ! ! ! Peak performance of blade: 1.3 TFlops GDDR5 Xeon DDR3 Xeon Xeon GK110 Xeon H-PDC GK110 GK110 GDDR5 Xeon QPDC DDR3 Xeon DDR3 DDR3 QPDC Xeon GDDR5 Xeon GDDR5 Xeon DDR3 NIC0 NIC1 NIC2 NIC3 DDR3 NIC0 NIC1 NIC2 NIC3 DDR3 48 port router DDR3 48 port router DDR3 Final hybrid CPU-GPU blade DDR3 Xeon Initial Multi-core blade DDR3 DDR3 Regular multi-core vs. hybrid-multi-core blades GK110 4 nodes configured with: > 1 Intel SandyBridge CPU > 32 GB DDR3-1600 RAM > 1 NVIDIA K20X GPU > 6GB GDDR5 memory ! Peak performance of blade: 5.9 TFlops NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 12 A brief history of “Piz Daint” • Installed 12 cabinets with dual-SandyBridge nodes in Oct./Nov. 2012 • ~50% of final system size to test network at scale (lessons learned from Gemini & XE6) • Rigorous evaluation of three node architectures based on 9 applications • dual-Xeon • joint • Five vs. Xeon/Xeon-Phi vs. Xeon/Kepler study with Cray from Dec. 2011 through Nov. 2012 applications were used for system design • CP2K, • CP2K COSMO, SPECFEM-3D, GROMACS, Quantum ESPRESSO & COSMO-OPCODE were co-developed with the system • Moving • Upgrade to 28 cabinets with hybrid CPU-GPU nodes in Oct./Nov. 2013 • Accepted • Early performance goals: hybrid Cray XC30 has to beat the regular XC30 by 1.5x in Dec. 2013 Science with fully performance hybrid nodes: January through March 2014 NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 13 NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 14 source: David Leutwyler NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 15 Speedup of the full COSMO-2 production problem (apples to apples with 33h forecast of Meteo Swiss) 4x Monte Rosa Cray XE6 (Nov. 2011) Tödi Cray XK7 (Nov. 2012) Piz Daint Cray XC30 (Nov. 2012) Piz Daint Cray XC30 hybrid (GPU) (Nov. 2013) 4x 3x 3x 1.67x 2x 2x 1.77x New HP2C funded code 1x 3.36x 1.4x 1.49x 1.35x 2.5x 1x Current production code NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 16 Energy to solution (kWh / ensemble member) 6.0 Cray XE6 (Nov. 2011) Cray XK7 (Nov. 2012) Cray XC30 (Nov. 2012) Cray XC30 hybrid (GPU) (Nov. 2013) Current production code 4.5 1.75x 1.41x New HP2C funded code 3.93x 3.0 6.89x 1.49x 2.51x 2.64x 1.5 NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 17 COSMO: current and new (HP2C developed) code main (current / Fortran) main (new / Fortran) dynamics (C++) physics (Fortran) dynamics (Fortran) MPI system physics (Fortran) with OpenMP / OpenACC stencil library X86 GPU Shared Infrastructure boundary conditions & halo exchg. Generic Comm. Library MPI or whatever system NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 18 velocities pressure Physical model temperature Mathematical description water Discretization / algorithm turbulence Domain science (incl. applied mathematics) lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); Code compilation A given supercomputer Code / implementation “Port” serial code to supercomputers > vectorize > parallelize Computer engineering > petascaling (& computer science) > exascaling > ... NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 19 velocities pressure Physical model temperature Mathematical description water Discretization / algorithm turbulence Domain science (incl. applied mathematics) lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); Code compilation Architectural options / design A given supercomputer Code / implementation “Port” serial code to supercomputers > vectorize > parallelize Computer engineering > petascaling (& computer science) > exascaling > ... NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 20 velocities pressure Physical model temperature Mathematical description water Discretization / algorithm turbulence Domain science (incl. applied mathematics) lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); Code compilation Architectural options / design Code / implementation Tools & Libraries Optimal algorithm Auto tuning Computer engineering (& computer science) NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 21 Model development based on Python or equivalent dynamic language velocities pressure Physical model temperature Mathematical description water Discretization / algorithm turbulence Domain science lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k); Code compilation Computer engineering (& computer science) Code / implementation Tools & Libraries Optimal algorithm Auto tuning Architectural options / design co-design NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 22 COSMO in five year: current and new (2019) code main (current / Fortran) main (new / “Python”) physics (“Python”) physics (Fortran) dynamics (Fortran) dynamics (C++ or “Python”) some tools (Fortan / C++) other tools grid tools BE1 BE… Shared Infrastructure BE.. MPI system boundary conditions & halo exchg. Generic Comm. Library MPI or whatever system NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 23 Thank you! NorduGrid 2014, Helsinki, May 20, 2014 T. Schulthess 24