Engineering Breakthroughs at NCSA

Transcription

Engineering Breakthroughs at NCSA
4th International Industrial Supercomputing Workshop
Amsterdam 23-24 October 2013
Seid Koric
Senior Technical Lead - Private Sector Program at NCSA
Adjunct Professor, Mechanical Science and Engineering Dept.
University of Illinois
http://www.ncsa.illinois.edu
koric@illinois.edu
National Center for Supercomputing Applications
Who uses HPC ?
Answer: Anyone whose problem can not fit on a PC or workstation,
and/or would take a very long time to run on a PC
Molecular Science and Materials
Engineering
Geoscience
Weather & Climate
Health/Life Science
Astronomy
Finance Modeling
What makes HPC so “High Performance” ?
Answer: Parallelism, doing many things (computing) at the same
time, or a set of independent processors work cooperatively to solve
a single problem
Source CCT/LSU
Scalable Speedup (Supercomputing 101)
• Speed-up (Sp) = wallclock for 1 core / wallclock # of Cores
• Speed-up reveals the benefit of solving problems in parallel
• Every problem has a “sweet spot”; depends on parallel implementation and
problem size
• Real Sp is smaller then theoretical Sp due to: serial portions of the code, load
imbalance between CPUs, network latency and bandwith, specifics in parallel
implementation in the code, I/O, etc.
Sweet Spot
Think Big !
“It is amazing what one can do these days on a dual-core laptop computer.
Nevertheless, the appetite for more speed and memory, if anything is
increasing. There always seems to be some calculations that ones wants to do
that exceeds available resources. It makes one think that computers have and
will always come in one size and one speed: “Too small and too slow”. This will
be the case despite supercomputers becoming the size of football fields !”
Tom Hughes, 2001, President of
International Association for
Computing Mechanics-IACM
The Industrial Software Challenge
Performance
Single core
About 10-50% of theoretical peak
Scalability
Promising, but still heavily behind scientific peta-scale applications
Granularity, Communication (slow) vs. Computation (fast)
Load Balancing, Mapping tasks to cores to promote equal amount of work
Serial Code Portions (Remember good old Amdahl’s law)
I/O Strategies
Parallel I/O vs. 1 file per core
Licensing Model for HPC
Commercial software vendors (aka: ISV’s)
New Programing Models and Accelerators
Hybrid (MPI/OpenMP, MPI/OpenACC)
GPGPU (CUDA, OpenCL, OpenACC)
Xeon-Phi (OpenCL)
Blue Waters - Sustained Petascale System
Cray System & Storage
cabinets:
Compute nodes:
Storage Bandwidth:
System Memory:
Memory per core :
Gemini Interconnect
Topology:
• 300
• 25,000
• 1.2 TB/s
• 1.5 Petabytes
• 4 GB
• 3D Torus
Usable Storage:
• 400 Petabytes
Peak performance:
• ~13 Petaflops
Number of AMD processors:
• 50,000
Number of AMD x86 core
modules:
• 400,000
Number of NVIDIA GPUs:
• 4,200
iForge - NCSA’s Premiere HPC
Resource for Industry
x86 Cores
Platform 1
2048
Platform 2
576
CPU Type
“Sandy Bridge”
“Abu Dhabi”
Clock
3.2 Ghz
3.4 GHz
Cores/Node
16
32
Memory/Node
128 GB, 1600 MHz
256 GB, 1600MHz
Global RAMdisk
1.5 Terabytes
Total Memory
21 Terabytes
Storage
700 Terabytes
File system
GPFS
Interconnect
40 Gigabit QDR InfiniBand
MPI
Platform, Intel, MVAPICH2, OpenMP
Operating System
Red Hat Enterprise Linux 6.4
Massively Parallel Linear Solvers in
Implicit FEA
• Implicit FEA code spends 70-80% of time solving large systems of
linear equations, Ax=b , where A is sparse i.e., most coefficients
are zero
• A wide range of applications: finite element solid mechanics,
computational fluid dynamics, reservoir simulation, circuit design,
linear programming etc.
FE Model with Global Stiffness Matrix
Problem Specification (Matrices)
• Originate from either in-house industrial and academic codes,
or from a commercial FE code solving real world engineering
problems, mostly unstructured automatic meshes
• Mostly SPD with N=1-80 M, NNZ=120-1600M
• Condition Numbers 103-108
Problem Specification (solvers)
•WSMP: direct solver from IBM/Watson, based on multifrontal algorithm, hybrid
(MPI & p-threads), symmetric and nonsymmetric
•Super LU: direct solver developed by LBNL, LU decomposition, MPI, nonsymmetric
•MUMPS: direct solver funded by CEC ESPIRT IV, multifrontal algorithm, MPI,
symmetric and nonsymmetric
•Hypre: iterative solver, LLNL, Conjugate Gradient with AMG, IC, and SAI (Sparse
Approx Inverse) pre-conditioners, MPI, symmetric
•PETSc: iterative solver, ANL, Conjugate Gradients (CG), Bi-Conjugate Stabilized
(BCGS), Conjugate Residual Gradient (CR) with Bjacobi, ACM (Additive Schwarz) , and
AMG (Multi-Grid) pre-conditioners , MPI, symmetric and nonsymmetric
•Commercial FEA Codes (NDA)
Solver Work in Progress (iForge)
250
Matrix 1M, SPD, N=1.5M, NNZ=63.6M, COND=6.9E4
Lower = Better
Solution Time [sec]
200
150
16 cores
32 cores
64 cores
100
128 cores
256 cores
50
0
CG/Bjacobi, BCGS/Bjacobi, BCGS/ASM,
PETSc,
PETSc,
PETSc,
Rconv=1.E-5 Rconv=1.E-5 Rconv=1.E-5
CR/Bjacobi, PCG/ParaSails, MUMPS SPD,
PETSc,
Hypre,
Direct
Rconv=1.E-5 Rconv=1.E-5
WSMP SPD,
SuperLU,
Direct
Unsymmetric,
Direct
10x Larger Problem
Matrix 20M, SPD, N=20.05M, NNZ=827.49M, COND=~1.E7
Lower = Better
12000
10000
8000
Solution Time [sec]
16 cores
32 cores
6000
64 cores
128 cores
256 cores
4000
512 cores
2000
0
CR/Bjacobi, PETSc,
PCG/Parasails,
Hypre,
Rconv=1.0E-5
WSMP, SPD,
Direct
MUMPS, SPD,
Direct
The Largest Linear System
solved with Direct Solver
(N>40M, NNZ>1600M, Cond=4.E7)
40 M (higher=better)
35000
200
30000
180
160
25000
PETSc
20000
140
Hypre
WSMP
15000
10000
Speedup
Solution Time [sec]
40 M (lower=better)
120
PETSc
100
Hypre
80
WSMP
60
5000
40
0
16
32
64
128
256
512
20
Cores
0
16
32
64
128
256
Cores
512
WSMP Performance on iForge
Sparse Factroization Performance TFlop/Sec
Watson Sparse Matrix Package
Hybrid (MPI/Pthreads) Symmetric Solver N=2.8M, NNZ=107M, Higher
= Better
6
5
X5690/Westmere
4
3
XE5-2670/Sandy
Bridge
2
1
0
128
256
512
768
960
Number of Threads
ISV Implicit FEA Benchmark on iForge
ABAQUS model:
Number of elements:
Number of nodes:
Number of DOFs
ABAQUS analysis job:
Cluster:
Number of cores used:
Solver: Direct Sparse
2,274,403
12,190,073
>30M
iForge
24-196
7hours->1hour
Wall Clock Time (sec)
Wall Clock Time vs. Number of Cores
30000
25000
20000
15000
10000
5000
0
0
50
100
150
# of cores
200
250
Explicit FEA: LS-Dyna on Blue Waters
NCSA/PSP, Hardware Vendor (CRAY), ISV (LSTC), PSP partner
(NDA)-all working together !
Real geometry, Loads, BC-s, highly nonlinear transient dynamic
problem with difficult contact conditions
MPP Dyna solver fully ported and optimized to CRAY’s Linux
Environment and taking full advantage of Gemini interconnect
LS-Dyna Breakthrough on Blue Waters
26.5M nodes, 80M DOFs, Time in Hours, Lower = Better
16
14
iForge (MPI)
Wall Clock (hours)
12
10
Blue Waters
(MPI)
8
6
Blue Waters
(Hybrid)
4
2
0
512
1024
1536
2048
3072
4096
8192
CPU Cores
Reaching Scalability on 10,000 cores
20
LS-Dyna Parallel Scalability (Lower=Better)
18
Molding-10m-8x (>70M nodes, >41M elements)
Wall Clock Time [hours]
16
14
Blue Waters, Cray XE6
iForge, Intel SB
12
10
Highest known scaling of any ISV FEA code to date !!
8
6
4
2
0
512
1024
2048
4096
8192
10240
Cores
Typical MPP-Dyna Profiling
As scaling increases, performance becomes
more determined by communication!
512 cores
64 cores
Computing
Communication
LS-DYNA Work in Progress
• Benchmarking even larger real-world problems
• Memory management becoming a serious issue for DP
(decomposition, distribution, MPMD, etc.)
• Hybrid (MPI/OpenMP) solver uses less memory and less
communication
• Load balance in contact and rigid body algorithms
Star-CCM+ Breakthrough on Blue Waters
Source: NCSA Private Sector Partner ”B" (Confidential)
Code/Version: Star-CCM+ 7.6.9
Physics: Transient, turbulent, single-phase compressible flow
Mesh size: 21.4 million unstructured polyhedral cells
Complexity: Very complicated geometry, high resolution mesh
Complex real-life production case: A highly complex CFD
case both in terms of the mesh and physics involved.
CD-adapco Star-CCM+
Case from “Partner B” Iteration/Simulation hour, Higher = Better
Iterations / Simulation Hr
1000
Scaling with Infiniband
levels off at 256 cores
800
iForge
600
Highest known
scaling of StarCCM+ to date…
400
200
…and we broke
the code!
0
0
128 256 384 512 640 768 896 102411521280140815361664179219202048
CPU Cores
Blue
Waters
The Future of HPC
A View from 11/2010
Future of HPC, GPGPU Computing ?
OpenACC: Lowering Barriers to GPU
Programming
Minimize Data Movement !
The name of the game in GPGPU
PCI Bus
OpenACC Example: Solving Laplace (Heat) 2D
Equation with FED
Iteratively converges to correct value (temperature) by computing
new values at each point from the average of neighboring points
T T
 2 0
2
x
y
2
Ti ,kj 1
Ti k1, j
k 1
i, j
T
k
i 1, j
T
Ti ,kj1 
2
Ti k1, j  Ti k1, j  Ti ,kj 1  Ti ,kj 1
4
Ti ,kj 1
Laplace 2D: Single Node Performance
OpenACC v. OpenMP
Wall Clock [sec]
Laplace 2D (4096x4096), Lower = Better
100
90
80
14x Speedup !
70
60
50
40
Blue Waters XK7
(Interlagos/Kepler)
30
KIDS (Westmere/Fermi)
20
10
0
CPU Only (1 OMP)
CPU Only (6 OMP)
GPU(OpenACC)
Multinode Performance
Hybrid Laplace 2D solvers
Distributed 2D Laplace Hybrid Solvers, 8196x8196 grid size
Higher = Better
MPI
Ranks
Speedup wrt serial CPU run
250
200
150
100
XE6 (MPI+OpenMP)
XK7 (MPI+OpenACC)
50
0
1
2
4
8
16
Number of Nodes
OMP or OACC
Threads
OpenACC - Today and Tomorrow
OpenACC compilers are still in development (had to use a[i*ncol+j]
instead of a[i][j] etc.)
GPU (CUDA)-Aware MPI, sending buffer (pointers) to MPI from Device
(GPU) directly, instead of staging GPU buffers to Host (CPU)
GPU/CPU Load Balancing, distribute domain unequally and let GPU
work on the largest chunk, while CPU threads work on smaller chunks
to keep other CPU cores on a node busy
OpenACC programming for multiple GPUs attached to a CPU node
OpenACC merging with OpenMP Standard with Xeon-Phi Support ?
Multinode GPU Acceleration
Abaqus/Standard 6.11, Cluster Compatibility Mode
S4B Benchmark (5.23M Dofs), Higher=Better
Parallel Speedup wrt Serial CPU run
30
25
20
15
Cray XE6 (CPU only)
Cray XK7(CPU+GPU)
10
5
0
0.5
1
2
4
6
Nodes
NDEMC Public-Private Partnership
•US OEMs have gained a competitive
edge through the use of high
performance computing (HPC) with
modeling simulation and analysis
(MS&A).
•US Council of competitiveness
recognized that small and medium sized
enterprises (SMEs) are not able to take
advantage of HPC
•In Fall of 2011 a pilot program was
started in the Midwestern supply base.
NDEMC: Multiphysics Simulation of CAC
Objective:
Study fatigue life of a charge air (CAC) cooler
due to thermal stresses for NDEMC project.
Description:
Three-Step Sequentially Coupled Simulation
(1) CFD Analysis of turbulent fluid flow through
CAC coupled with advective HT  provide
thermal BC-s for FEA.
(2) FEA analysis of the thermo-mechanical 
provides transient thermal stresses in solid part
during the thermal cycle for Fatigue Analysis.
(3) Fatigue Model uses history of thermal
stresses  estimates the cycle life at critical
15M points
nodes
XSEDE ECSS Project
3D Study of Elastic-Plastic Transition and Fractal Patterns of 1 million
Grain Cube of grade 316-Steel (2010-2012)
(M. Ostoja-Starzewski, Jun Li, S. Koric, A. Saharan, Philosophical Magazine, 2012 )
Largest Nonhomogenous FEA simulations to
date
Every of 1 Million Elements (Grains) has a
different material property
Fractal dimension can be used to estimate
level of plasticity for damage assessment for
various structures
Now aiming at (much) larger simulations on
Blue Waters with ParaFEM!
Continuous Casting
Consortium at UIUC,
Steel Dynamics Inc., NCSA
•
Molten steel freezes against water-cooled walls of a
copper mold to form a solid shell.
•
Initial solidification occurs at the meniscus and is
responsible for the surface quality of the final
product.
•
Thermal strains arise due to volume changes caused
by temp changes and phase transformations.
•
Inelastic Strains develop due to both strain-rate
independent plasticity and time dependant creep.
•
Superheat flux from turbulent fluid flow mix in
liquid pool
•
Ferrostatic pressure pushes against the shell, causing
it bulge outwards.
•
Mold distortion and mold taper (slant of mold walls
to compensate for shell shrinkage) affects mold
shape and interfacial gap size.
Objective: multiphysics approach of simulating all 3 phenomena
(fluid flow, heat transfer, and stress)
Thermo-Mechanical Model
Breakout Shell Thickness Comparison Between Model and Plant Data
Hibbeler, Koric, Thomas, Xu, Spangler, CCC-UIUC, SD Inc., 2009
Mismatchdue to uneven
superheat distribution !
Power of Multiphysics
(Thermo-Mechanical-CFD Model)
Less
Superheat
at WF
Koric, Hibbeler, Liu,Thomas, CCC-UIUC, 2011
The HPC Innovation Excellence Award 2011
0
0.1
Distance from Meniscus (m)
0.2
0.3
0.4
0.5
X
Y
0.6
Z
Special Thanks
•
•
•
•
•
•
•
•
•
•
Prof. Martin Ostoja-Starzewski (MechSE, UIUC)
Prof Brian G. Thomas and CCC
Dr. Ahmed Taha (NCSA)
CRAY
2 PSP Partner Companies (NDA)
NDEMC
LSTC
IBM Watson (Dr. Anshul Gupta)
Simulia Dassault Systems
Blue Waters Team

Engineering Breakthroughs at NCSA

Transcription

Similar documents

английская версия 2.1.cdr

Bticino 334202 (5 Wire)

Innomax MPI - Cloudfront.net

XDNet and ACI HPC Update - CASC | The Coalition for Academic

SPE 163090 Advances in Modeling of Giant Reservoirs

presentation title

High Performance Computer Simulations

Fall 2002 Bulletin final copy - Minnesota Supercomputing Institute

Your Partner n Core Bus ness

parallel implementation of dijkstra`s algorithm using mpi library on a

Intermediate Supercomputing

J. Bigot , V. Grandgirard , G. Latu , Ch. Passeron , F. Rozar

VI-HPS productivity tools suite