Xeon Phi - Intel Developer Zone
Transcription
Xeon Phi - Intel Developer Zone
Parallel Computing and Intel® Xeon Phi™ coprocessors Warren He Ph.D (何万青) Wanqing.he@intel.com Technical Computing Solution Architect Datacenter & Connected System Group Transition from Multicore to Manycore Agenda • Why Many-Core? • Intel Xeon Phi Family & ecosystem • Intel Xeon Phi HW and SW architecture • HPC design and development considerations for Xeon Phi • Case Studies & Demo 3 Scaling workloads for Intel® Xeon Phi™ coprocessors Performance Scale to manycore Vectorize Parallelize % Vector Fraction Parallel * Theoretical acceleration of a highly parallel processor over a Intel® Xeon® parallel processor VECTOR IA CORE VECTOR IA CORE … VECTOR IA CORE VECTOR IA CORE INTERPROCESSOR NETWORK COHERENT CACHE COHERENT CACHE COHERENT CACHE COHERENT CACHE … … COHERENT CACHE COHERENT CACHE COHERENT CACHE COHERENT CACHE INTERPROCESSOR NETWORK VECTOR IA CORE VECTOR IA CORE … VECTOR IA CORE VECTOR IA CORE MEMORY and I/O INTERFACES FIXED FUNCTION LOGIC Intel® Xeon Phi™ Coprocessor PCIe GDDR5 Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. All SKUs, pricing and features are PRELIMINARY and are subject to change without notice Silicon Cores 57-61 Silicon Max Freq 1.05-1.25 GHz Double Precision Peak Performance 1003-1220 GFLOP GDDR5 Devices 24-32 Memory Capacity 6-8GB Memory Channels 12-16 Form Factors Refer to picture: passive, active, dense form factor (DFF), no thermal solution (NTS) Memory/BW Peak 240-352 GT/s Total Cache 28.5- 30.5MB Board TDP 225-300 Watts PCIe PCIe Gen 2 x16, I/O speeds up to: KNC Host: 7.0 GB/s; Host KNC: 6.7 GB/s Misc. • • • • Product codename: Knights Corner Based on 22nm process ECC enabled Turbo not currently available Power/ Mgmt • • Future node manager support IMPI including Intel Coprocessor Communications Link Tools • • • • Intel C++/C/FORTRAN compilers, Intel Math Kernel Library Debuggers, Performance and Correctness Analysis Tools OpenMP, MPI, OFED messaging infrastructure (Linux only), OpenCL Programming Models: Offload, Native and mixed Offload+Native OS • RHEL 6.x, SuSE 11.1, Microsoft Windows 8 Server, Win 7, Win 8 Intel®Many Integrated Core Architecture: Processor Numbering Overview Intel®Xeon Phi™ coprocessor Brand Designates the family of products Performance Shelf 7 1 10 P Suffix SKU Number Generation Indication of relative pricing and performance within generation 1 Knights Corner 2 Knights Landing NOTE: Where applicable, may denote form factor, thermal solution, usage model 7 Best Performance P Passive Thermal Solution 5 Best Performance/Watt A Active Thermal Solution 3 Best Value X No Thermal Solution D Dense Form Factor Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. 7 Software Development Ecosystem1 for Intel® Xeon Phi™ Coprocessor Open Source Commercial Compilers, Run environments gcc (kernel build only, not for applications), python* Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* 3.2.5 (Beta) compiler, PGI*, PGAS GPI (Fraunhofer ITWM*), ISPC Debugger gdb Intel Debugger, RogueWave* TotalView* 8.9, Allinea* DDT 3.3 Libraries TBB2, MPICH2 1.5, FFTW, NetCDF NAG*, Intel® Math Kernel Library, Intel® MPI Library, Intel® OpenMP*, Intel® Cilk™ Plus (in Intel compilers), MAGMA*, Accelereyes* ArrayFire 2.0 (Beta), Boost C++ Libraries 1.47+ Profiling & Analysis Tools Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Intel® Advisor XE, TAU* – ParaTools* 2.21.4 Virtualization ScaleMP* vSMP Foundation 5.0, Xen* 4.1.2+ Cluster, Workload Management, and Manageability Tools Altair* PBS Professional 11.2, Adaptive* Computing Moab 7.2, Bright* Cluster Manager 6.1 (Beta), ET International* SWARM (Beta), IBM* Platform Computing* {LSF 8.3, HPC 3.2 and PCM 3.2}, MPICH2, Univa* Grid Engine 8.1.3 1These are all announced as of November 2012. Intel has said there are more actively being developed but are not yet announced. 2Commercial support of Intel TBB available from Intel. Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Technical Compute Target Market Segments & Applications Applications/Workloads Sub-Segments Intel® Xeon Phi™ Candidates Public Sector HPL, HPCC, NPB, LAMMPS, QCD Energy (including Oil & Gas) RTM (Reverse Time Migration), WEM (Wave Equation Migration) Climate Modeling & Weather Simulation WRF, HOMME Financial Analysis Monte Carlo, Black-Scholes, Binomial model, Heston model Life Sciences (Molecular Dynamics, Gene Sequencing, BioChemistry) LAMMPS, NAMD, AMBER, HMMER, BLAST, QCD Manufacturing CAD/CAM/CAE/CFD/EDA Implicit, Explicit Solvers Digital Content Creation Ray Tracing, Animation, Effects Software Ecosystem Tools, Middleware Mix of ISV and End User developm 9 Agenda • Why Many-Core? • Intel Xeon Phi Family & ecosystem • Intel Xeon Phi HW and SW architecture • HPC design and development considerations for Xeon Phi • Case Studies & Demo 10 Intel® Xeon Phi™ coprocessor codenamed “Knights Corner” Block Diagram Distributed tag directory enables faster snooping High-speed bi-directional ring interconnect Cores: lightweight, power efficient, in-order, support 4 hardware threads Vector Processing Unit (VPU) on each core, 32 512bit SIMD registers Speeds and Feeds Per-core caches reference info: Type Size Ways Set conflict Location L1I (instr) 32KB 4 8KB on-core L1D (data) 32KB 8 4KB on-core Memory: L2 (unified) 512KB 8 64KB connected via • 8 memory controllers, each supporting 2 32-bit core/ring interface channels • GDDR5 channels theoretical peak of 5.5GT/s • Practical peak BW limited by ring and other factors 12 Speeds and Feeds, Part 2 Per-core TLBs reference info: Type Page Size Coverage L1 Instruction 32 4KB 128KB L1 Data 64 4KB 256KB 32 64KB 2MB 8 2MB 16MB 64 4KB, 64KB, or 2MB Up to 128MB L2 Entries • Linux support for 64K pages on Intel64 not yet available 13 Knights Corner Core X86 specific logic < 2% of core + L2 area Intel Confidential 64-bit 512-wide In Order VPU: integer, SP, DP; 3-operand, 16-instruction Specialized Instructions (new encoding) HW transcendentals Intel Confidential Fully Coherent L2 Hardware Prefetching 32 KB per core 512 KB Slice per Core – Fast Access to Local Copy Interprocessor Network L2 cache Scalar Registers Vector Registers 4 Threads per Core L1 Cache [Data and Instruction] Scalar Unit Vector Unit Instruction Decode Knights Corner Silicon Core Architecture Improved ring interconnect delivering 3x bandwidth over Aubrey Isle silicon ring Vector Processing Unit PPF D2 DEC E VC1 PF VC2 D0 D1 V1 D2 E WB D2 E VC1 V2 V3 LD VPU RF 3R, 1W EMU Vector ALUs 16 Wide x 32 bit 8 Wide x 64 bit ST Fused Multiply Add Mask RF Intel Confidential Scatter Gather VC2 V1-V4 V4 WB Peak Performace = Clock Freq * # of Cores * 8 Lane (DP) * 2 (FMA) FLOPS/Cycle e.g. 1.091 GHz * 61 Cores * 8 Lanes * 2 = 1064.8 Gflops 17 Wider vectorization opportunity Intel® Pentium® processor (1993) MMX™ (1997) Effective use of ever increasing number of vector elements and new instruction sets is vital to Intel’s success Illustrated with the number of 32bit data processed by one “packed” instruction Intel® Streaming SIMD Extensions (Intel® SSE in 1999 to Intel® SSE4.2 in 2008) Intel® Advanced Vector Extensions (Intel® AVX in 2011 and Intel® AVX2 in 2013) Intel® Xeon™ Phi Coprocessor (in 2012) 18 1/7/2013 Multi-Cores and Many-Cores on top of this!! Knights Corner PCI Express Card Block Diagram INTEL CONFIDENTIA L 19 ntel® Xeon Phi™ Coprocessors Software Architecture (MPSS) 20 Intel® Xeon Phi™ Coprocessors: They’re So Much More General purpose IA Hardware leads to less idle time for your investment. Restrictive architectures It’s a supercomputer on a chip Operate as a compute node Run a full OS Program to MPI GPU ASIC FPGA Run x86 code Run restricted code Run offloaded code Custom HW Acceleration Intel® Xeon Phi™ Coprocessor Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models Source: Intel Estimates 21 MIC Software Stack HOST MIC Offload Tools/ Applications Offload Tools/ Applications User Mode User Mode Offload Library MYO COI Linux Kernel Mode SCIF Offload Library MYO COI SCIF Linux Kernel Mode PCI-E Communications HW 1/7/2013 22 Spectrum of Execution Models of Heterogeneous Computing on Xeon and Xeon Phi Coprocessor CPU-Centric Intel®MIC-Centric Intel®Xeon Processor Intel®Many Integrated Core (MIC) Multi-core Hosted Offload General purpose serial and parallel computing Symmetric Multi-core Main( ) Foo( ) MPI_*() Codes with serial phases Main( ) Foo( ) MPI_*() Foo( ) PCIe Foo( ) Many-core Many-Core Hosted Highly-parallel codes Codes with balanced needs Codes with highly- parallel phases Main( ) Foo( ) MPI_*() Reverse Offload Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Productive Programming Models Across the Spectrum Supported with Intel Tools Not Currently Supported by Language Ext. Offload for Intel Tools Main() Foo( ) MPI_*() MPSS Setup Step by Step 24 Tools for both OEMs and End Users : • • • • micsmc – A control panel application for system management and viewing configuration status information specific to SMC operations on one or more MIC cards in a single platform. micflash – A command line utility to update both the MIC SPI Flash and the SMC Firmware as well as grab various information about the flash from either on the ROM on the MIC card or from a ROM file. micinfo – A command line utility that allows the OEM and the end user to get various information (version data, memory info, bus speeds, etc.) from one or more MIC cards in a single platform. hwcfg – A command line utility that can be used to view and configure various hardware aspects of one or more MIC cards in a single platform. Tools for OEMs only : • • • • micdiag – A command line utility that performs various diagnostics on one or more MIC cards in a single platform. micerrinj – A command line utility that allows the OEMs to inject one or more ECC errors (correctable, uncorrectable or fatal) into the GDDR on a MIC card to test platform error handling capabilities. KncPtuGen – A power/thermal command line utility that can be used to put various loads on one or more individual cores on a MIC card to simulate compromised performance on specific cores. micprun (OEM Workloads) – A command line utility that provides an environment for executing various performance workloads (provided by Intel) that allow the OEM to test their platform’s performance over time and compare to Intel’s “Gold” configuration in the factory. Intel® Many Integrated Core Architecture Marketing 1/7/2013 25 ‘micsmc’ – MIC SMC Control Panel • What does it do? – A graphical display of SMC status items, such as temperature, memory usage and core utilization for one or multiple MIC cards on a single platform. – Command Line Parameters available for scripting purposes. • What does it look like? 1/7/2013 Intel® Many Integrated Core Architecture 26 ‘micsmc’ Card View Windows Utilization View Core Histogram View Historical Utilization View 1/7/2013 Intel® Many Integrated Core Architecture 27 ‘micprun’ – MIC Performance Tool • What can it do? – This is actually a wrapper that can be used to execute the OEM workloads. – Neatly wraps up many of the parameters specific to the workload in addition to setting up workload comparisons to previous executions and even to Intel “Gold” system configurations. • What does it look like? – ./micprun [–k workload] [–p workloadParams] [-c category] [-x method] [-v level] [-o output] [-t tag] [-r pickle] – Workload (-k for kernel) specifies which workload to run... SGEMM, DGEMM, etc. – Categories (-c) include “optimal”, “scaling”, “test”. – Methods (-x) include “native”, “scif”, “coi” and “myo”. – Levels (-v) specify the level of detail in the output. – Output (-o) specifies to create a “pickle” file for future comparisons (use Tag (-t) to specify custom name). – Pickle (-r) is used to compare the current workload execution to a previously saved workload execution and execute the current run exactly the same as the specified pickle file. 1/7/2013 Intel® Many Integrated Core Architecture 28 ‘micprun’ continued… • Show me some examples! – Run all of the native workloads with optimal parameters : – ./micprun – ./micprun –k all –c optimal –x native –v 0 – Run SGEMM in scaling mode, and save to a pickle file : – ./micprun –k sgemm –c scaling –o . –t mySgemm – Run a comparison to the above example and create a graph : – ./micprun –r micp_run_mySgemm –v 2 – View the SGEMM workload parameters : – ./micprun –k sgemm --help – Run SGEMM on only 4 cores : – ./micprun –k sgemm –p ‘--num_core 4’ 1/7/2013 Intel® Many Integrated Core Architecture 29 Agenda • Why Many-Core? • Intel Xeon Phi Family & ecosystem • Intel Xeon Phi HW and SW architecture • HPC design and development considerations for Xeon Phi • Case Studies & Demo 30 Considerations for Application Development on MIC Scientific Parallel Application Considerations Workload Implementation Algorithm Communication Dataset Size Grid Shape Data Access Patterns Load SPMD Balancing Fork/Join Loop Structure Considerations in Selecting Workload Size Applicability of Amdahl's law, • Speedup = (s+p)/(s+(p/N)) ..(1), where p is the parallel fraction, N number of parallel processors • Using single processor runtime to normalize, s+p=1, • Speedup = 1/(s+(p/N)) • Indicates serial portion ‘s’ will dominate in massively parallel hardware like Intel® Xeon Phi™ • Limiting advantage of the many core hardware Ref: “Scientific Parallel Computing” – Scott Ridgeway et. al. Scaled vs. Fixed Size Speedup Amdahl’s Law Implication Amdahl’s Law considered harmful in designing parallel architecture. Gustafson’s[1] Observation: • For Scientific computing problems, problem size increases with parallel computing power • The parallel component of the problem scales with problem size • Ask “how long a given parallel program would have taken on serial processor” • Scaled Speedup = (s’+p’N)/(s’+p’) • where s’,p’ represent serial and parallel time spent on parallel system, s’+p’=1 • Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ ..(2) When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added Gustafson’s Law Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added Fig: John L Gustafson, “Reevaluating Amdahl’s Law” Intel® Xeon Phi™ Coprocessors Workload Suitability Many core peak Can your workload scale to over 100 threads? Yes Performance Yes No No No Multi-core peak Can your workload benefit from large vectors ? Can your workload benefit from more memory bandwidth? Yes If application scales with threads and vectors or memory BW Intel® Xeon PhiTM Coprocessors Threads Click to see Animation video on Exploiting Parallelism on Intel®Xeon PhiTM Coprocessors Representative example workload for illustration purposes only A picture speaks a Thousands Words 37 The Keys to Take Advantages of MIC • Choose the right Xeon-centric or MIC-centric model for your application • Vectorize your application – Use Intel’s vectorizing compiler • Parallelize your application – With MPI (or other multiprocess model) – With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.) • Consider standard optimization techniques such as tiling, overlapping computation and communication, etc. 38 Intel® Xeon™ Phi SW Imperatives • In order for Intel® Xeon™ Phi program to succeed, think differently and guide the programmers to think differently. Speed optimize Speed port Time Naï ve-Port/Longer-Optimization Losing Strategy Time Intelligent-Port/Shorter-Optimization Winning Strategy Intelligent application porting to Intel® Xeon™ Phi Platforms requires parallelization and vectorization 39 1/7/2013 Algorithm and Data Structure One needs to consider • Type of Parallel Algorithm to use for the problem – Look carefully at every sequential part of the algorithm • Memory Access Patterns – Input and output data set • Communication Between Threads • Work division between threads/processes – Load Balancing – Granularity 40 Parallelism Pattern For Xeon Phi™ 41 Memory Access Patterns • Algorithm determines the data access patterns. • Data Access Patterns can be categorized into: – Arrays/Linear Data Structures – Conducive to low latency and high bandwidth – Look out for: Unaligned access – Random Access – Try to avoid. Needs prefetching if possible – Not many threads to hide latency compared to Tesla implementation – Recursive (Tree) • Data Decomposition: – Geometric Decomposition – Regular – Irregular – Recursive Decomposition 42 Vectorization methodologies • Auto vectorization may be improved by additional pragmas like #pragma IVDEP • A more aggressive pragma is #pragma SIMD this pragma will use less strict heuristics and may produce wrong code. Results must be carefully checked • Experts may use compiler intrinsics. The caveat is the loss of portability. A portable layer between intrinsics and app. may help 1/7/2013 43 Compiler has to assume the worst case the language/flag allow. • float *A; void vectorize() { int i; for (i = 0; i < 102400; i++) { A[i] *= 2.0f; } } •Loop Body: • Load of A 3) Recompile with –ansi-alias • icc –vec-report1 –ansi-alias test1.c – test1.c(4): (col. 3) remark: LOOP WAS VECTORIZED. 4) Change “float *A” to “float *restrict A”. Q: Will the store • icc –vec-report1 test1a.c – test1a.c(4): (col. 3) remark: LOOP modify A? WAS VECTORIZED. 5) Add “#pragma ivdep” to the loop. A: Maybe • icc –vec-report1 test1b.c – test1b.c(5): (col. 3) remark: LOOP WAS VECTORIZED. • Load of A[i] • Multiply with 2.0f • Store of A[i] 44 1/7/2013 “NO” is needed to make vectorization legal. Writing Explicit Vector Code with Intel® Cilk™Plus 9) Using Array Notation • float *A; • float *A; void vectorize() void vectorize() { { A[0:102400] *= 2.0f; for (int i = 0; i < 102400; i++) } { A[i] *= 2.0f; } 10)Using Elemental Function } • float *A; 8) Using SIMD Pragma __declspec(noinline) • float *A; __declspec(vector(uniform(p), void vectorize() linear(i))) { void mul(int i){ p[i] *= 2.0f; } #pragma simd vectorlength(4) void vectorize() for (int i = 0; i < 102400; i++) { for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } { mul(A, i); } } } 45 1/7/2013 Memory Accesses and Alignment SIZE: 64B for Xeon™Phi, 32B for AVX1/2, 16B for SSE4.2 and below • Memory Access Patterns Unit-stride … A[i] A[i+1] A[i+2] A[i+3] •Alignment Optimization Aligned Unit-stride … … A[i] A[i+1] A[i+2] Strided (special form of gather/scatter) Addr % SIZE == 0 A[2*i] A[2*(i+1)] … A[i] Gather/Scatter A[i+1] A[B[i]] Misaligned Unit-stride A[i+2] A[i+3] A[B[i+1]] … Alignment unknown Unit-stride … A[i] A[i+1] A[i+2] A[i+3] If you write in Cilk™Plus Array Notation, Addr % SIZE == ??? access patterns are obvious in your eyes: Unit-stride means A[:] or A[lb:#elems], helps you think more clearly. 1/7/2013 Addr % SIZE != 0 46 … A[2*(i+2)] A[B[i+2]] A[i+3] … Memory Access Patterns • float A[1000]; Struct BB { float x, y, z; } B[1000]; int C[1000]; • for(i=0; i<n; i++){ a) A[I] // unit-stride b) A[2*I] // strided c) B[i].x // strided d) j = C[i] // unit-stride A[j] // gather/scatter e) A[C[i]] // same as d). f) B[i].x + B[i].y + B[i].z // 3 strided loads } 47 1/7/2013 • Unit-Stride – Good --- with one exception. – Out-of-order cores won’t store-forward masked (unit-stride) store. – • Avoid storing under the mask (unit-stride or not), but unnecessary stores can also create bandwidth problem. Ouch. Strided, Gather/Scatter – Perf varies on micro-arch and the actual index patterns. – Big latency is exposed if you have these on the critical path – F90 Arrays tend to be strided if not carefully used. See this article. Use F2008 “contiguous” attribute. http://software.intel.com/enus/articles/fortran-array-data-andarguments-and-vectorization Data Alignment Optimization • float *A, *B, *C, *D; A[0:n] = B[0:n] * C[0:n] + D[0:n]; • Explain compiler’s assumption on data alignment and how that relates to the generated code. • Identify different ways to optimize the data alignment of this code. 48 1/7/2013 1) Understand the baseline assumption icc –xSSE4.2 –vec-report1 –restrict test2.c ifort –xSSE4.2 –vec-report1 ftest2.f icc –mmic –vec-report6 test2.c ifort –mmic –vec-report6 ftest2.f Examine the ASM code. 2) Add “vector aligned” icc –xSSE4.2 –vec-report1 test2a.c ifort –xSSE4.2 –vec-report1 ftest2a.f icc –mmic –vec-report6 test2a.c ifort –mmic –vec-report6 ftest2a.f Examine the ASM code 3) Compile with –opt-assume-safe-padding icc –mmic –vec-report6 –opt-assume-safe-padding test2.c ifort –mmic –vec-report6 –opt-assume-safe-padding ftest2.f Examine the ASM code Align the Data at Allocation and Tell that to the Compiler at the Use • C/C++ • FORTRAN • Align at Data • Align at Data Allocation Allocation __declspec(align(n, !DIR$ attribute align:n [off])) :: var __mm_malloc(size, n) -align arrayNbyte • Inform the Compiler (N=16,32,64) at Use Sites #pragma vector aligned • Inform the Compiler __assume_aligned(ptr, n); __assume(var % x == 0); 49 at Use Sites !DIR$ vector aligned • __assume_aligned(ptr , n); Reading Material: http://software.intel.com/en-us/articles/data-alignment-to1/7/2013 assist-vectorization Array of Struct is BAD • struct node { • Use Struct of Arrays float x, y, z; struct node1 { }; float x[1024], struct node NODES[1024]; y[1024], float dist[1024]; z[1024]; for(i=0;i<1024;i+=16){ } float struct node1 NODES1; x[16],y[16],z[16],d[16]; • Use Tiled Array of x[:] = NODES[i:16].x; y[:] = NODES[i:16].y; Structs z[:] = NODES[i:16].z; struct node2 { d[:] = sqrtf(x[:]*x[:] + float x[16], y[:]*y[:] + y[16], z[16]; z[:]*z[:]); } dist[i:16] = d[:]; struct nodes2 } NODES2[]; 50 1/7/2013 Example in Optimizing LBM for MIC • Updates 𝑓𝑖 (𝑥, 𝑡) – Particle distribution function of particles in pos x at time t traveling in direction ei 𝑓𝑖 𝑥 + 𝑒𝑖 𝑑𝑡, 𝑡 + 𝑑𝑡 = 𝑓𝑖 𝑥, 𝑡 + Ω𝑖 (𝑓 𝑥, 𝑡 ) 1 0 Ω𝑖 𝑓 𝑥, 𝑡 = (𝑓𝑖 − 𝑓𝑖 ) 𝜏 • D3Q19 model – Each cell has |ei| = 19 values – Uses 19 values at curr node, does ~12 flops/direction (220 flops total) and writes 19 neighbors – Has special case computation at boundaries (flag array used here) 51 DO k = 1, NZ DO j = 1, NY !DIR$ IVDEP DO i = 1, NX NX = NY = NZ = 128 m = i + NXG*( j + NYG*k ) phin = phi(m) rhon = g0(m+now) + g1(m+now) + g5(m+now) + g6(m+now) + g10(m+now) + g11(m+now) + g15(m+now) + g16(m+now) + + + + g2(m+now) g7(m+now) g12(m+now) g17(m+now) + + + + g3(m+now) + g4(m+now) & g8(m+now) + g9(m+now) & g13(m+now) + g14(m+now) &A g18(m+now) Key challenges: - 49 streams of data (some unaligned) – – – – sFx = muPhin*gradPhiX(m) sFy = muPhin*gradPhiY(m) sFz = muPhin*gradPhiZ(m) ux = ( g1(m+now) - g2(m+now) + g7(m+now) g10(m+now) + g11(m+now) - g12(m+now) + 0.5D0*sFx )*invRhon uy = ( g3(m+now) - g4(m+now) + g7(m+now) + g10(m+now) + g15(m+now) - g16(m+now) + 0.5D0*sFy )*invRhon uz = ( g5(m+now) - g6(m+now) + g11(m+now) + g14(m+now) + g15(m+now) - g16(m+now) + 0.5D0*sFz )*invRhon f0(m+nxt) f1(m+now) f2(m+now) f3(m+now) f4(m+now) f5(m+now) f6(m+now) = = = = = = = invTauPhi1*f0(m+now) invTauPhi1*f1(m+now) invTauPhi1*f2(m+now) invTauPhi1*f3(m+now) invTauPhi1*f4(m+now) invTauPhi1*f5(m+now) invTauPhi1*f6(m+now) g0(m+nxt) g1(m+nxt + 1) g2(m+nxt - 1) g3(m+nxt + NXG) g4(m+nxt - NXG) g5(m+nxt + NXG*NYG) g6(m+nxt - NXG*NYG) g7(m+nxt + 1 + NXG) g8(m+nxt - 1 - NXG) g9(m+nxt + 1 - NXG) g10(m+nxt - 1 + NXG) g11(m+nxt + 1 + NXG*NYG) g12(m+nxt - 1 - NXG*NYG) g13(m+nxt + 1 - NXG*NYG) g14(m+nxt - 1 + NXG*NYG) g15(m+nxt + NXG + NXG*NYG) g16(m+nxt - NXG - NXG*NYG) g17(m+nxt + NXG - NXG*NYG) g18(m+nxt - NXG + NXG*NYG) END DO END DO END DO = = = = = = = = = = = = = = = = = = = + + + + + + + Af0 Af1 Af1 Af1 Af1 Af1 Af1 + + + + - - g8(m+now) + g9(m+now) & + g13(m+now) - g14(m+now) & - g8(m+now) - g9(m+now) & + g17(m+now) - g18(m+now) & - g12(m+now) - g13(m+now) & - g17(m+now) + g18(m+now) & invTauPhi*phin Cf1*ux Cf1*ux Cf1*uy Cf1*uy Cf1*uz Cf1*uz invTauRhoOne*g0(m+now) invTauRhoOne*g1(m+now) invTauRhoOne*g2(m+now) invTauRhoOne*g3(m+now) invTauRhoOne*g4(m+now) invTauRhoOne*g5(m+now) invTauRhoOne*g6(m+now) invTauRhoOne*g7(m+now) invTauRhoOne*g8(m+now) invTauRhoOne*g9(m+now) invTauRhoOne*g10(m+now) invTauRhoOne*g11(m+now) invTauRhoOne*g12(m+now) invTauRhoOne*g13(m+now) invTauRhoOne*g14(m+now) invTauRhoOne*g15(m+now) invTauRhoOne*g16(m+now) invTauRhoOne*g17(m+now) invTauRhoOne*g18(m+now) + + + + + + + + + + + + + + + + + + + ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! g0'nxt(i , g1'nxt(i+1, g2'nxt(i-1, g3'nxt(i , g4'nxt(i , g5'nxt(i , g6'nxt(i , g7'nxt(i+1, g8'nxt(i-1, g9'nxt(i+1, g10'nxt(i-1, g11'nxt(i+1, g12'nxt(i-1, g13'nxt(i+1, g14'nxt(i-1, g15'nxt(i , g16'nxt(i , g17'nxt(i , g18'nxt(i , 19 read streams g* (blue) 19 write streams g* (orange) 7 read/write streams f* (green) 4 read streams (red) (phi, gradPhiX, gradPhiY, gradPhiZ) - Prefetch / TLB associative issues - Compiler not vectorizing some hot loops. j , j , j , j+1, j-1, j , j , j+1, j-1, j-1, j+1, j , j , j , j , j+1, j-1, j+1, j-1, k ) k ) k ) k ) k ) k+1) k-1) k ) k ) k ) k ) k+1) k-1) k-1) k+1) k+1) k-1) k-1) k+1) = = = = = = = = = = = = = = = = = = = g0(i,j,k) g1(i,j,k) g2(i,j,k) g3(i,j,k) g4(i,j,k) g5(i,j,k) g6(i,j,k) g7(i,j,k) g8(i,j,k) g9(i,j,k) g10(i,j,k) g11(i,j,k) g12(i,j,k) g13(i,j,k) g14(i,j,k) g15(i,j,k) g16(i,j,k) g17(i,j,k) g18(i,j,k) 1 1 1 2 3 4 (left) (left) (left) (left) (left) (left) (left) (left) (left) 52 Initial Results + Basic Optimizations • Straight porting: – KNC was 1/10 the performance of a 2-socket SNB-EP system – CPU Perf improved by 8x – KNC Perf improved by 150x – Optimizations: – – – – compiler vector directives 2M pages loop fusion change in OMP work distribution Speedup • Basic optimizations improve both CPU and MIC 20 performance: 16 15 10 5 8 1 0.11 SNB-EP KNC 0 Reference Code SNB-EP KNC Moderate Optimized Code (Pragmas, Compiler, etc) Data measured by Tanmay Gangwani, Shreeniwas Sapre and Sangeeta Bhattacharya on 2x8 SNB-EP 2.2GHz, KNC-ES2 53 Advanced Optimizations • Include modifications to data structure and tiling to make better use of vector units, caches and available memory bandwidth – AOS to SOA transformation helps improve SIMD utilization by avoiding gather / scatter operations – 3.5D blocking to take advantage of large coherent caches – SW Prefetching for MIC • Results: – CPU and MIC both further improves by 9x Similarity in Intel ® Architectures make optimizations (basic + advanced) Ref: Anthony Nguyen et. al.®“3.5-D Blocking Optimization for Stencil work for both Intel Xeon processor and Intel Xeon Phi ® coprocessor Computations on Modern CPUs and GPUs,” SC 2010. 54 Agenda • Why Many-Core? • Intel Xeon Phi Family & ecosystem • Intel Xeon Phi HW and SW architecture • HPC design and development considerations for Xeon Phi • Case Studies & Demo 55 ScaleMP • vSMP Foundation aggregates multiple x86 systems into a single, virtual x86 system – SCALE their application compute, memory, and I/O resources • Simple to move an application to Intel® Xeon Phi™ coprocessor based systems using ScaleMP software • Zero porting effort Video: ScaleMP to Support Intel Xeon Phi Coprocessors ... ScaleMP CEO Shai Fultheim describes how vSMP Foundation supports the new Intel Xeon Phi coprocessor. (From ISC’12 – Hamburg) Intel demo booth #: ScaleMP – SMP VM for Xeon & Xeon Phi 57 Programming Intel® MIC-based Systems CPU-Centric CPU-hosted Offload Intel®MIC-Centric Symmetric Reverse Offload MIC-hosted MIC Data CPU Data MIC CPU MPI MPI MIC Data CPU Data CPU MIC ... Network Network MIC Data CPU Data MIC CPU MIC Data CPU Data MIC CPU 58 LBM running in Xeon cluster with Xeon Phi co-processor Offload Data MPI CPU MIC Data CPU Network Data CPU MIC Homogenous Node Data CPU 59 LES_MIC: Load balance between. CPU & MIC LES CPU+MIC simultaneous computing design Relative speedup with different problem size Copyright ©2012 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Sponsors of Tomorrow., and the Intel Sponsors of Tomorrow. logo are trademarks of Intel Corporation in the U.S. and other countries. Demo#: LES_MIC 61 Recommended reading: 4 books Intel® Xeon Phi™ Coprocessor Developer site: http://software.intel.com/mic-developer • Architecture, Setup and Programming resources • Self-guided training • Case Studies • Information on Tools and Ecosystem • Support through Community Forum 6 3 64