Maximum Performance Computing for Exascale Applications
Transcription
Maximum Performance Computing for Exascale Applications
Maximum Performance Computing for Exascale Applications Oskar Mencer July 2012 Challenges Scientific Computing is a small market with a large impact on society : Medicine, Earth Science, Physics, Chemistry, BioChemistry, ... Efficiency § What is the maximum amout of computation per Watt we could get? § Exascale: ExaBytes at Exaflops § Operational Costs: Exa$ and ExaWatts? Micro processors § ISCA makes computer architecture research boring § Intel-ISA dominance § Von Neuman Architecture § IEEE Floating Point abstraction § If performance depends on data movement, Parallel Programming Amdahls Law does not apply. § David May: Compilers improve 2x in >=10 years (but SW efficiency halfs every 18 months) § Parallel Programming is HARD § Reading parallel programs is “impossible” Limits of Computation Objective: Maximum Performance Computing (MPC) What is the fastest we can compute desired results? Conjecture: Data movement is the real limit on computation. Maximum Performance Computing (MPC) Less Data Movement = Less Data * Less Movement The journey will take us through: 1. Information Theory: Kolmogorov Complexity 2. Optimisation via Kahneman and Von Neumann 3. Real World Dataflow Implications and Results Kolmogorov Complexity (K) Definition (Kolmogorov): “If a description of string s, d(s), is of minimal length, […] it is called a minimal description of s. Then the length of d(s), […] is the Kolmogorov complexity of s, written K(s), where K(s) = |d(s)|” Of course K(s) depends heavily on the Language L used to describe actions in K. (e.g. Java, Esperanto, an Executable file, etc) Kolmogorov, A.N. (1965). "Three Approaches to the Quantitative Definition of Information". Problems Inform. Transmission 1 (1): 1–7. A Maximum Performance Computing Theorem For a computational task f, computing the result r, given inputs i, i.e. task f: r = f( i ), or i f r Assuming infinite capacity to compute and remember inside box f, the time T to compute task f depends on moving the data in and out of the box. Thus, for a machine f with infinite memory and infinitely fast arithmetic, Kolmogorov complexity K(i+r) defines the fastest way to compute task f. SABR model: dFt = σ t Ft β dWt dσ t = ασ t dZ t < dW , dZ >= ρdt We integrate in time (Euler in log-forward, Milstein in volatility) ln Ft +1 = ln Ft − 12 (σ t exp((β − 1) ln Ft ))2 .dt + σ t exp((β − 1) ln Ft )ΔWt σ t +1 = σ t + ασ t ΔZt + 12 (ασ t )(α )(ΔZt2 − dt ) logic st a t e σ, F The representation K(σ,F) of the state σ,F is critical! MPC– Bad News 1. Real computers do not have either infinite memory or infinitely fast arithmetic units. 2. Kolmogorov Theorem. K is not a computable function. MPC – Good News Today’s arithmetic units are fast enough. So in practice... Kolmogorov Complexity => => MPC depends on the Representation of the Problem. Euclids Elements, Representing a²+b²=c² 17 × 24 = ? Thinking Fast and Slow Daniel Kahneman Nobel Prize in Economics, 2002 back to 17 × 24 Kahneman splits thinking into: System 1: fast, hard to control ... 400 System 2: slow, easier to control ... 408 Remembering Fast and Slow John von Neumann, 1946: “We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding, but which is less quickly accessible.” Consider Computation and Memory Together Computing f(x) in the range [a,b] with |E| ≤ 2⁻ⁿ Table Table+Arithmetic and +,-‐,×,÷ § uniform vs non-uniform § number of table entries § how many coefficients Arithmetic +,-‐,×,÷ § polynomial or rational approx § continued fractions § multi-partite tables Underlying hardware/technology changes the optimum MPC in Practice Tradeoff Representation, Memory and Arithmetic From Theory to Practice Optimise Whole Programs Customise Architecture Method Iteration Processor Discretisation Storage Bit Level Representation Customise Numerics Example: data flow graph generated by MaxCompiler 4866 static dataflow cores in 1 chip Mission Impossible? Maximum Performance Computing (MPC) Less Data Movement = Less Data + Less Movement The journey will take us through: 1. Information Theory: Kolmogorov Complexity 2. Optimisation via Kahneman and Von Neumann 3. Real World Dataflow Implications and Results 8 Maxeler DFEs replacing 1,900 Intel CPU cores presented by ENI at the Annual SEG Conference, 2010 2,000 Compared to 32 3GHz x86 cores parallelized using MPI 15Hz peak frequency 1,600 30Hz peak frequency 1,400 45Hz peak frequency 1,200 70Hz peak frequency Equivalent CPU cores 1,800 1,000 800 600 400 200 0 1 4 Number of MAX2 cards 8 100kWa&s of Intel cores => 1kWa& of Maxeler Dataflow Engines Example: Sparse Matrix Computations O. Lindtjorn et al, HotChips 2010 Given matrix A, vector b, find vector x in Ax = b. DOES NOT SCALE BEYOND SIX x86 CPU CORES MAXELER SOLUTION: 20-40x in 1U 60 GREE0A 1new01 Speedup per 1U Node 50 40 30 20 10 0 0 1 2 3 4 5 6 Compression Ratio 7 8 Domain Specific Address and Data Encoding (*Patent Pending) 9 10 Example: JP Morgan Derivatives Pricing O Mencer, S Weston, Journal on Concurrency and ComputaOon, July 2011. • Compute value and risk of complex credit derivaOves. • Moving overnight run to realOme intra day • Reported Speedup: 220-‐270x 8 hours => 2 minutes • 2011: American Finance Technology Award for Most CuUng Edge IT IniOaOve Validated Maximum Performance Computing customers comparing 1 box from Maxeler (in a deployed system) with 1 box from Intel 22 Seismic App1 19x and App2 25x Weather Finance App 32x and App2 29x Fluid Flow 30x 60x Sensor Trace Processing App1 22x, App2 22x Imaging / Preprocessing App1 26x and App2 30x Optimise Whole Programs with Finite Resources SYSTEM 1 x86 cores SYSTEM 2 flexible memory +logic Low Latency Memory High Throughput Memory Balance Computation and Memory The Ideal System 2 is a Production Line SYSTEM 1 x86 cores SYSTEM 2 flexible memory +logic Low Latency Memory High Throughput Memory Balance Computation and Memory 1U dataflow cloud providing dynamically scalable compute capability over Infiniband MPC-‐X1000 • 8 vec&s dataflow engines (DFEs) • 192GB of DFE RAM • Dynamic allocaOon of DFEs to convenOonal CPU servers – Zero-‐copy RDMA between CPUs and DFEs over Infiniband • Equivalent performance to 40-‐60 x86 servers 25 Datacenter Qualified Dataflow Solutions integrated Engines (Cards), 1U nodes, Racks, MaxelerOS, MaxCompiler High Density DFEs Intel Xeon CPU cores and up to 6 DFEs with 288GB of RAM MaxWorkstaNo n Desktop dataflow development system The Dataflow Appliance The Low Latency Appliance Intel Xeon CPUs and 1-‐2 DFEs with Dense compute with 8 DFEs, direct links to up to six 10Gbit 384GB of RAM and dynamic Ethernet connecOons allocaOon of DFEs to CPU servers with zero-‐copy RDMA access MaxRack 10, 20 or 40 node rack systems integraOng compute, networking & storage MaxCloud Hosted, on-‐demand, scalable accelerated compute Dataflow Engines 48GB DDR3, high-‐speed connecOvity and dense configurable logic Architecture Model Host application CPU SLiC Kernels MaxelerOS DataFlow + + Memory 27 * Memory Interconnect Manager Programming with MaxCompiler C / C++ / Fortran SLiC 28 MaxJ MaxCompiler Development Process CPU CPU Code Main Memory CPU Code (.c) int *x, *y; for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; 29 yi = xi × xi + 30 MaxCompiler Development Process Memory CPU Code PCI Manager Express CPU Code (.c) #include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y; for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30; Calc(x, y, DATA_SIZE) 30 x MaxelerOS 30 x Chip SLiC Main x y Memory x CPU x 30 + + x y Manager (.java) Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", PCIE), link(“y", PCIE)); m.addMode(modeDefault()); m.build(); MyKernel (.java) HWVar x = io.input("x", hwInt(32)); HWVar result = x * x + 30; io.output("y", result, hwInt(32)); MaxCompiler Development Process Memory y Host Code PCI Manager Express CPUCode (.c) #include device = “MaxSLiCInterface.h” max_open_device(maxfile, #include "/dev/maxeler0"); “Calc.max” int *x, *y; Calc(x, DATA_SIZE) 31 30 x Main x Memory x Chip SLiC MaxelerOS x CPU x 30 + + x y Manager (.java) MyKernel (.java) Manager m = new Manager(); Kernel k = new MyKernel(); HWVar x = io.input("x", hwInt(32)); m.setKernel(k); m.setIO( link(“x", PCIE), link(“y", DRAM_LINEAR1D)); m.addMode(modeDefault()); m.build(); io.output("y", result, hwInt(32)); HWVar result = x * x + 30;
Similar documents
Computing with DFEa
• 192GB of DFE RAM • Dynamic allocation of DFEs to conventional CPU servers
More information