Sebastian Klose - Robotics and Embedded Systems
Transcription
Sebastian Klose - Robotics and Embedded Systems
HETEROGENEOUS PARALLEL COMPUTING WITH OPENCL Sebastian Klose kloses@in.tum.de 27.05.2013 Montag, 27. Mai 13 AGENDA • Motivation • OpenCL Overview • Hardware • OpenCL Montag, 27. Mai 13 Examples on Altera FPGAs MOTIVATION Montag, 27. Mai 13 TREND OF PROGRAMMABLE HARDWARE Brief Overview of the OpenCL Standard Page 3 Figure 3. Recent Trend of Programmable and Parallel Technologies Brief Overview of the OpenCL Standard Page 3 Figure 3. Recent Trend of Programmable and Parallel Technologies CPUs Single Cores Brief Overview of the OpenCL Standard DSPs Multicores Coarse-Grained CPUs and DSPs Multicores Arrays Coarse-Grained Page 3 Massively Parallel Processor Arrays FPGAs Fine-Grained Massively Parallel Arrays FPGAs Considering the need for creating parallel programs for the emerging multicore era, it Figure 3. Recent Trend of Programmable and Parallel Technologies that there needs to be a standard modelFPGAs for creating programs that CPUs DSPswas recognized Multicores Arrays will execute across all of these quite different devices. The lack of a standard that is Multicores Coarse-Grained Fine-Grained portable across these different programmable technologies has plagued Brief Overview of the OpenCL Standard Page 3 Coarse-Grained Massively Parallel Massively programmers. In the summer of 2008, Apple submitted a proposal for an OpenCL CPUs and DSPs Processor Arrays Parallel Arrays (Open Computing Language) draft specification to The Khronos Group in an effort to create cross-platform parallel programming standard.multicore The Khronos Group consists Considering the needafor creating parallel programs for the emerging era, it Figure 3. Recent Trend of Programmable and Parallel Technologies a consortium ofto industry members suchfor as creating Apple, IBM, Intel, that AMD, NVIDIA, was recognizedofthat there needs be a standard model programs Altera, many others. This group hasThe been responsible for defining will execute across all and of these quite different devices. lack of a standard that is the OpenCL 1.0, 1.1, and 1.2 specifications. The OpenCL standard allows for the implementation of portableMulticores across these different programmable technologies CPUs DSPs Arrays FPGAs has plagued Brief Overview of the OpenCL Standard Page 3 parallel algorithms that can be ported from platform to platform with minimal programmers. In the summer of 2008, Apple submitted a proposal for an OpenCL Single Cores Multicores Coarse-Grained Fine-Grained recoding. The language is based on C programming language and contains extensions (Open Computing Language) draft specification to The Khronos Group in an effort to Coarse-Grained Massively Parallel Massively that allow for theprogramming specification of parallelism. create a cross-platform parallel standard. The Khronos Group consists Figure 3. Recent Trend of Programmable andCPUs Parallel andTechnologies DSPs Processor Arrays Parallel Arrays of a consortium of industry members such as Apple, IBM, Intel, AMD,standard NVIDIA,inherently offers the In addition to providing a portable model, the OpenCL Altera, and many others. This group has been responsible for defining the Considering the need for creating ability paralleltoprograms for the emerging multicore era, it describe parallel algorithms to be implemented on OpenCL FPGAs, at a much higher 1.0, 1.1, and 1.2 specifications. The OpenCL standard allows for the implementation ofas VHDL or that there needs Arrays tolevel be aofstandard model forhardware creating programs that CPUs DSPs was recognized Multicores FPGAs abstraction than description languages (HDLs) such parallel algorithms that can be ported from platform to platform with minimal will execute across all of these quite different devices. The high-level lack of a standard that is exist for gaining this higher level of Verilog. Although many synthesis tools Single Cores Multicores Coarse-Grained Fine-Grained recoding. The language is based on C programming language and contains extensions portable across these different programmable technologies has plagued abstraction, they have all suffered from the same fundamental problem. These tools Coarse-Grained Massively Massively that allow for the specification of parallelism. programmers. In the summer ofParallel 2008, Apple submitted a proposal for an OpenCL would attempt to take in a sequential C program and produce a parallel HDL CPUs and DSPs Processor Arrays Parallel Arrays (Open Computing Language) specification to The Themodel, Khronos inso an effortin toinherently implementation. difficulty was not much the creation of a the HDL In additiondraft to providing a portable theGroup OpenCL standard offers cross-platform parallel programming standard. The Khronos Group consists Considering thecreate need afor creating parallel programs for the emerging multicore era, it implementation, but rather in the extraction of thread-level parallelism ability to describe parallel algorithms to be implemented on FPGAs, at a much higher that would of a consortium industry members such Apple, IBM, Intel, that AMD, NVIDIA, that there needsofto be a of standard model foras creating programs allow the FPGA implementation to achieve high performance. With CPUs DSPs was recognized Multicores Arrays FPGAs level abstraction than hardware description languages (HDLs) such as VHDL orFPGAs being on Altera, and many others. This group has been responsible for defining the OpenCL will execute across all of these quite different devices. The lack of a standard that is the furthest extreme of the parallel spectrum, any failure to extract maximum Verilog. Although many high-level synthesis tools exist for gaining this higher level of Single Cores Multicores Coarse-Grained Fine-Grained 1.1, and 1.2 programmable specifications. The OpenCL standard allows for the implementation of portable across 1.0, these different technologies has plagued parallelism is more crippling than on other devices. The OpenCL standard solves abstraction, they have all suffered from the same fundamental problem. These tools Coarse-Grained Massively Massively parallel algorithms thatApple can besubmitted ported from platform to an platform withand minimal programmers. In the summer ofParallel 2008, proposal for OpenCL many ofain these problems allowing theproduce programmer to explicitly would attempt toParallel take a sequential Cby program a parallel HDL specify and control CPUs and DSPs Processor Arrays Arrays recoding. The language is based on C programming language and contains extensions (Open Computing Language) draftimplementation. specification parallelism. to The in so an effort in to TheGroup OpenCL standard more naturallyofmatches The Khronos difficulty was not much the creation a HDL the highly-parallel nature thatparallel allow for theprogramming specification of parallelism. cross-platform parallel standard. Thethan Khronos of FPGAs sequential programs described in purethat C. would Considering thecreate needafor creating programs for the emerging era, itGroup consists implementation, butmulticore rather in do the extraction of thread-level parallelism a consortium of industry members such as Apple, IBM, Intel, AMD, NVIDIA, was recognizedofthat there needs to be a standard model for creating programs that allow the FPGA implementation to achieve high performance. With FPGAs being on In addition to providing a portable model, the OpenCL standard inherently offers the Altera, and many others. This group has been responsible for defining the OpenCL will execute across all of these quite different devices. The lack of a standard that is the furthest extreme parallel spectrum, anyatfailure to higher extract maximum ability to describe parallel algorithms to of bethe implemented on FPGAs, a much Brief Overview ofallows the OpenCL Standard 1.1,different and 1.2 level specifications. The OpenCL standard for the implementation of as portable across1.0, these programmable technologies has plagued parallelism is more crippling than on other devices. The OpenCL of abstraction than hardware description languages (HDLs) such VHDL or standard solves parallel algorithms thatApple can besubmitted ported from platform to an platform with minimal programmers. In the summer of 2008, a proposal for OpenCL many of these problems by allowing the programmer to explicitly controlis a pure Verilog. Although many high-level synthesis tools exist for gaining this higher level of specify OpenCL applications consist of two parts. The OpenCL hostand program recoding. The language is based CThe programming language and contains extensions (Open Computing Language) draft specification to Group in an effort to parallelism. The OpenCL standard more naturally matches the highly-parallel nature abstraction, they on have all Khronos suffered from the same fundamental problem. These tools software routine written in standard C/C++ that runs on any sort of microprocessor. that allow for theprogramming specification of parallelism. create a cross-platform parallel standard. Thethan Khronos Group consists of FPGAs do sequential programs described in pure C. Single Cores Multicores DSPs CPUs Montag, 27. Mai 13 GPUs OPENCL GOALS • Khronos standard for unified programming of heterogeneous hardware (GPU, CPU, DSP, Embedded Systems, FPGAs, ...) • Initiated • explicit by Apple (2008) - Dec. 2008 OpenCL 1.0 Release declaration of parallelism • portability/code • ease Montag, 27. Mai 13 reuse: same code for different hardware programmability of parallel hardware AVAILABILITY SDK Intel OpenCL SDK AMP APP SDK Hardware Core i3/i5/i7 CPU & Integrated Intel HD GPU AMD GPUs X86 CPUs 1.2 1.2 Apple CPUs & GPUs 1.2 Altera selected boards 1.0 Qualcomm Qualcomm CPU & GPU 1.1 ARM Mali T604 GPU 1.1 ... Montag, 27. Mai 13 Version Platform 1 1 * CommandQueue Event 0..1 * * * 1 1 * DeviceID 1..* * 0..1 1 * 1 * * Program * Context 1 * MemObject * * {abstract} Sampler 1 * Kernel 1 Buffer Image * Figure 2.1 - OpenCL UML Class Diagram OPENCL OVERVIEW Last Revision Date: 11/14/12 Montag, 27. Mai 13 Page 21 OPEN COMPUTING LANGUAGE OpenCL GPU Host (C, C++, ...) Platform Layer API •query/select devices of host •initialize compute devices Montag, 27. Mai 13 Runtime API CPU •“compile” kernel code •execute compiled kernels on device(s) •exchange data between host and Accelerator compute devices •synchronize several devices OpenCL C Language (Kernel) •C99 subset + extensions •built-in functions •(sin, cos, cross, dot, ...) PLATFORM MODEL • compute devices are composed of one or more compute units • compute units are divided into one or more processing elements Montag, 27. Mai 13 Compute Device Compute Device Compute Unit one or more compute devices Compute Unit • Compute Unit one host Compute Unit • Host PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE A full Kepler GK110 implementation includes 15 SMX units and six 64 bit memory controllers. Different products will use different configurations of GK110. For example, some products may deploy 13 or 14 SMXs. EXAMPLE: NVIDIA KEPLER GK110 Key features of the architecture that will be discussed below in more depth include: The new SMX processor architecture An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially faster DRAM I/O implementation. Hardware support throughout the design to enable new programming model capabilities •15 SMX = Compute Units •1SMX = 192 Processing Elements Kepler GK110 Full chip block diagram Source: http://www.nvidia.de/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf Montag, 27. Mai 13 EXECUTION MODEL • Kernel unit of executable code - data parallel or task parallel • Program collection of kernels and functions (think of dynamic library) • Command Queue host application queues kernels & data transfers in order or out of order • Work-Item execution of a kernel by a single processing element (think of thread) • Work-Group collection of work-items that execute on a single compute unit (think of cpu core) Montag, 27. Mai 13 SPECIFY PARALLELISM • say our “global” problem consists of 1024x1024 tasks, which can be executed in parallel • therefore we have 1024x1024 work-items • group several work-items into local workgroups Montag, 27. Mai 13 MEMORY MODEL private private Work Item Work Item Constant ... private Work Item ... private private Work Item Work Item ... Local Local Workgroup Workgroup private Work Item Global Device Host Memory Montag, 27. Mai 13 Host EXECUTION MODEL • host application submits work to the compute devices via command queues Montag, 27. Mai 13 CPU Context Queue environment within which work-items executes includes: devices, memories and command queues Queue • context: GPU SYNCRONIZATION • Events: synchronize kernel executions between different queues in the same context • Barriers synchronize kernels within a queue Montag, 27. Mai 13 SIMPLE KERNEL EXAMPLE void inc( float* a, float b, int N ) { for( int i = 0; i < N; ++i ) { a[ i ] = a[ i ] + b; } } void main( void ) { ... inc( a, b, N ); } kernel void inc( global float* a, float b ) { int i = get_global_id( 0 ); a[ i ] = a[ i ] + b; } // host code void main( void ) { ... clEnqueueNDRangeKernel( ..., &N, ... ); } the kernel you (e.g.) program the execution of the inner part of a loop • inside • built-in Montag, 27. Mai 13 kernels can be used to exploit special hardware units + OPENCL ON ALTERA FPGAS Montag, 27. Mai 13 ALTERAS EXECUTION MODEL Host CPU FPGA User Program Accelerator Accelerator Accelerator PCIe OpenCL Runtime Embedded CPU User Program OpenCL Runtime Montag, 27. Mai 13 FPGA SOC Accelerator Accelerator Accelerator ALTERA OPENCL COMPILATION OpenCL Host Program & Kernels Produces throughput and area report ACL Compiler Std. C Compiler FPGA bitstream X86 Binary PCIe Montag, 27. Mai 13 DIFFERENCES TO OTHER OPENCL IMPLEMENTATIONS • in contrast to CPU/GPU, specialized datapaths are generated • for each kernel, custom hardware is created • logic is organized in functional units, based on operation and linked together to form the dedicated datapath required to implement the special kernel • execution of multiple workgroups in parallel in a custom fashion • pipeline parallelism Montag, 27. Mai 13 ALTERA LIMITATIONS • OpenCL Version 1.0 (nearly completely) • #contexts: 2 #program_objects/ctxt = 3 #concurrently running kernels=3 #arguments / kernel = 128 • max 256MB of host-side memory (CL_MEM_ALLOC_HOST_PTR) • max 2GB DDR as device global memory • No support for out-of order execution of command queues • no Image read and write function (atm), no explicit memory fences, no floating point exceptions • Memory: #cmd_queues/ctxt = 3; #outstanding events/ctxt around 100 #kernels / device = 16 #work_items / work_group = 256 • global memory visible to all work-items (e.g. DDR Memory on FPGA) • local memory visible to a work-group (e.g. on-chip memory) • private memory visible to a work-item (e.g. registers) Montag, 27. Mai 13 BENEFITS • shorter design cycles & time to market 83,6 • performance • performance • explicit per watt Terasic DE4 Xeon W3690 Tesla C2075 GPU 15,1 parallelism • “portable” Montag, 27. Mai 13 15,9 0 23 45 68 90 Performance per Watt in Million Terms/J