Folie 1
Transcription
Folie 1
OpenACC (PGI Compiler) LRZ, 27.4.2015, Dr. Volker Weinberg, weinberg@lrz.de OpenACC ● ● ● ● http://www.openacc-standard.org/ A CAPS, Cray, Nvidia and PGI initiative AMD and PathScale joined the OpenACC Standards Group in 2014 Open Standard: OpenACC 1.0 Spec (Nov 2011): http://www.openacc.org/sites/default/files/OpenACC.1.0_0.pdf OpenACC 2.0a Spec (Aug. 2013): http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf ● OpenACC 1.0 quite similar to OpenMP 4.0 ● OpenACC shares many features with former PGI accelerator directives and the spirit of CAPS HMPP compiler ● Quick Reference Guide: http://www.openacc.org/sites/default/files/213462%2010_OpenACC_AP I_QRG_HiRes.pdf 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC ● ● ● 27/04/2015 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors. The directives and programming model defined in the OpenACC API document allow programmers to create high-level host+accelerator programs without the need to explicitly initialize the accelerator, manage data or program transfers between the host and accelerator, or initiate accelerator startup and shutdown. All of these details are implicit in the programming model and are managed by the OpenACC API-enabled compilers and runtimes. The programming model allows the programmer to augment information available to the compilers, including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details. Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC ● OpenACC Compilers: PGI Compiler, Portland Group http://www.pgroup.com/ Support for NVIDIA & AMD GPUs Extension of x86 PGI compiler suite CAPS Compilers, CAPS Enterpise (until 2014) Support for NVIDIA & AMD GPUs, Intel Xeon Phi Source-to-source compilers Cray Compiler Only for Cray systems 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Accelerator Block Diagram 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Offload Execution Model ● Host: Executes most of the program Allocates memory on the accelerator device Initiates data copies from host memory to accelerator memory Sends the kernel code to the accelerator Waits for kernel completion Initiates data copy from the accelerator back to the host memory Deallocates memory ● Accelerator: 27/04/2015 Only compute intensive regions should be executed on the accelerator Executes kernels, one after the other Concurrently may transfer data between host and accelerator Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC Execution Model ● ● ● ● 27/04/2015 The OpenACC execution model has three levels: gang, worker and vector. The model target architecture is a collection of processing elements or PEs, where each PE is multithreaded, and each thread on the PE can execute vector instructions. For an NVIDIA GPU, the PEs might map to the streaming multiprocessors, multithreading might map to warps, and the vector dimension might map to the threads within a warp. The gang dimension would map across the PEs, the worker across the multithreading dimension within a PE, and the vector dimension to the vector instructions. Mapping is compiler-dependent! Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC Execution Model ● There is no support for any synchronization between gangs, since current accelerators typically do not support synchronization across PEs. ● A program should try to map parallelism that shares data to workers within the same gang, since those workers will be executed by the same PE, and will share resources (such as data caches) that would make access to the shared data more efficient. 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC Programming Model ● Main OpenACC constructs: Parallel Construct Kernels Construct Data Construct Loop Construct ● Runtime Library Routines 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC Syntax ● C: #pragma acc directive-name [clause [, clause] …] { …. // Offload Code } ● Fortran: !$acc directive-name [clause [, clause] …] ! Offload Code !$acc end directive-name 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Kernels Directive ● Kernels Directive An accelerator kernels construct surrounds loops to be executed on the accelerator, typically as a sequence of kernel operations. Typically every loop will be a distinct kernel Number of Gangs and Workers can be different for each kernel 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Kernels Directive ● C: #pragma acc kernels [clause [, clause] …] { for(i=0;i<n;i++) { …} ; 1st kernel for(j=0;j<n;j++) { …} ; 2nd kernel } ● Fortran: !$acc kernels [clause [, clause] …] DO i=1,n … END DO DO j=1,n … END DO 1st kernel 2nd kernel !$acc end kernels 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Important Kernels Clauses 27/04/2015 if(condition) When condition is true, the kernels region will execute on the acc; otherwise on the host async(expression) The kernels region executes asynchronously with the host. Intel MIC & GPU Programming Workshop, LRZ 2015 Important Data Clauses copy(list) allocates data on the acc and copies data: host ↔ accelerator copyin(list) allocates data on the acc and copies data: host → accelerator copyout(list) allocates data on the acc and copies data: host ← accelerator create(list) allocates data on the acc but does not copy data: host ≠accelerator present(list) does not allocate data, but uses data already allocated on the acc present_or_copy/copyin/copyout/cre ate (list) If data is already present, that data is used, otherwise like copy/copyin/… Can be used on parallel constructs, kernel constructs, data constructs and others. 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Parallel Directive ● Parallel Directive An accelerator parallel construct launches a number of gangs executing in parallel, where each gang may support multiple workers, each with vector or SIMD operations. Number of Gangs and Workers remains constant for the parallel region. One worker in each gang begins executing the code in the region. 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Parallel Directive ● C: #pragma acc parallel [clause [, clause] …] { Parallel region } ● Fortran: !$acc parallel [clause [, clause] …] Parallel region !$acc end parallel 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Important Parallel Clauses 27/04/2015 if(condition) When condition is true, the parallel region will execute on the acc; otherwise on the host async(expression) The parallel region executes asynchronously with the host. num_gangs(n) Controls how many gangs are created num_workers(n) Controls how many workers are created in each gang vector_length(n) Controls the vector length on each worker private(list) A copy of each variable in list is allocated for each gang. firstprivate(list) same as private, but data is initialised with the value from the host. reduction(operator:list) Allows reduction operations. Intel MIC & GPU Programming Workshop, LRZ 2015 Data Clauses specific to Data region 27/04/2015 if(condition) When condition is false, no data will be allocated or moved to/from the accelerator. async(expression) Data movement between host and accelerator occur asynchronously with the host. Intel MIC & GPU Programming Workshop, LRZ 2015 Loop Directive ● A loop directive applies to the immediately following loop or nested loops, and describes the type of accelerator parallelism to use to execute the iterations of the loop. ● C: #pragma acc loop [clause [, clause] …] ● Fortran: !$acc loop [clause [, clause] …] ● Can also be combined “#pragma acc kernels loop” 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Loop Clauses collapse(n) Applies the directive to the following n tightly nested loops. seq Executes this loop sequentially on the accelerator. private(list) A copy of each variable in list is created for each iteration of the loop. gang[(num)] Use at most num gangs. worker[(num)] Use at most num workers of a gang. vector[(length)] Executes the iterations of the loop in SIMD vector-mode with max. vectorlenth. num/length only possible for loops in kernels regions, not within parallel region 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Runtime Library Routines ● Prototypes or interfaces for the runtime library routines along with datatypes and enumeration types are available as follows: ● C: #include “openacc.h” ● Fortran: use openacc or #include “openacc_lib.h” 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC 1.0 Runtime Library Routines acc_get_num_devices(devicetype) Returns umber of acc devices of type devicetype. acc_set_device_type(devicetype) Sets acc device type to use for this host thread. acc_get_device_type() Returns acc device type that is being used by this host thread. acc_set_device_num(devicenum, devicetype) Sets the device number to use for this host thread. acc_get_device_num(devicetype) Returns the acc device number that is beingt used by this host thread. acc_async_test(expression) Returns nonzero or .TRUE. if all asynchronous activities have been completed; otherwise returns zero or .FALSE. 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 OpenACC 1.0 Runtime Library Routines acc_async_test_all() Returns nonzero or .TRUE. if all asynchronous activities have been completed; otherwise returns acc_async_wait(expression) Waits until all asynchronous activities have been completed. acc_init(devicetype) Initialized the runtime system and sets the accelerator device type to use for this host thread. acc_shutdown Disconnects this host thread from the accelerator device. acc_on_device(devicetype) In a parallel or kernels region, this is used to take different execution paths depending whether the program is running on an accelerator or on the host. acc_malloc(size_t) Returns the address of memory allocated on the accelerator device. acc_free(void*) Frees memory allocated by acc_malloc. 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 acc_device_t /lrz/sys/compilers/pgi/14/linux86-64/14.1/include/openacc.h: typedef enum{ acc_device_none = 0, acc_device_default = 1, acc_device_host = 2, acc_device_not_host = 3, acc_device_nvidia = 4, acc_device_radeon = 5, acc_device_xeonphi = 6, acc_device_pgi_opencl = 7, acc_device_nvidia_opencl = 8, acc_device_opencl = 9 }acc_device_t; 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Using PGI Compilers ● At LRZ: 5 Floating Licenses of the PGI Compiler Suite https://www.lrz.de/services/software/programmierung/pgi_lic/ ● PGI OpenACC Docu: Resources: http://www.pgroup.com/resources/accel.htm Getting started guide: http://www.pgroup.com/doc/openACC_gs.pdf C: http://www.pgroup.com/lit/articles/insider/v4n1a1b.htm Fortran: http://www.pgroup.com/lit/articles/insider/v4n1a1a.htm 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Running programs on the GPU ● ● ● ● ● 27/04/2015 Login to GPU Cluster ssh lxlogin_gpu Allocate 1 GPU salloc --gres=gpu:1 --reservation=gpu_course Load PGI modules: module unload ccomp fortran module load ccomp/pgi/13.10 Compile pgcc -acc -Minfo=accel file.c Run interactively srun --gres=gpu:1 ./a.out or: export RUN=“srun --gres=gpu:1”; $RUN ./a.out Intel MIC & GPU Programming Workshop, LRZ 2015 Getting info about the Host CPU lu65fok@lxa195:~> pgcpuid vendor id : AuthenticAMD model name : AMD Opteron(tm) Processor 6128 HE cores :8 cpu family : 16 model :9 stepping :1 processor count : 8 clflush size : 8 L2 cache size : 512KB L3 cache size : 10MB flags : abm apic cflush cmov cx8 de fpu fxsr fxsropt ht lm mca mce flags : mmx mmx-amd monitor msr mas mtrr nx pae pat pge pse pseg36 flags : sep sse sse2 sse3 sse4a cx16 popcnt syscall tsc vme 3dnow flags : 3dnowext type : -tp istanbul-64 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Getting Infos about the GPU: pgaccelinfo lu65fok@lxa195:~> $RUN pgaccelinfo CUDA Driver Version: 5000 NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012 Device Number: 0 Device Name: Tesla X2070 Device Revision Number: 2.0 Global Memory Size: 5636554752 Number of Multiprocessors: 14 Number of Cores: 448 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 32768 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 65535 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 1147 MHz Execution Timeout: No Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: Yes Memory Clock Rate: 1548 MHz Memory Bus Width: 384 bits L2 Cache Size: 786432 bytes Max Threads Per SMP: 1536 Async Engines: 2 Unified Addressing: Yes Initialization time: 6314891 microseconds Current free memory: 5570027520 Upload time (4MB): 1494 microseconds ( 945 ms pinned) Download time: 1410 microseconds (1065 ms pinned) Upload bandwidth: 2807 MB/sec (4438 MB/sec pinned) Download bandwidth: 2974 MB/sec (3938 MB/sec pinned) PGI Compiler Option: -ta=nvidia,cc20 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Building Programs ● C: pgcc –acc –Minfo=accel file.c –o file ● Fortran: pgfortran –acc –Minfo=accel file.f90 –o file 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Compiler Command Line Options ● PGI olds Accelerator Directives still supported, but not conformant with OpenACC ● better use –acc=strict or –verystrict ● pgcc –help: -acc[=[no]autopar|strict|verystrict] Enable OpenACC directives [no]autopar Enable (default) or disable loop autoparallelization within acc parallel strict Issue warnings for non-OpenACC accelerator directives verystrict Fail with an error for any non-OpenACC accelerator directive 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Compiler Command Line Options -ta=nvidia:{nofma|[no]flushz|keep|noL1|noL1cache|maxregcount:<n>|[no]rdc|tesla|cc1x|fermi|cc2x| kepler|cc3x|fastmath|cuda5.0|cuda5.5}|radeon:{keep|tahiti|apu|buffercount:<n>}|host Choose target accelerator nvidia Select NVIDIA accelerator target nofma Don't generate fused mul-add instructions [no]flushz Enable flush-to-zero mode on the GPU keep Keep kernel files noL1 Don't use the L1 hardware data cache to cache global variables noL1cache Don't use the L1 hardware data cache to cache global variables maxregcount:<n> Set maximum number of registers to use on the GPU [no]rdc Generate relocatable device code; disables cc1x and cuda4.2 tesla Compile for Tesla architecture cc1x Compile for compute capability 1.x fermi Compile for Fermi architecture cc2x Compile for compute capability 2.x kepler Compile for Kepler architecture cc3x Compile for compute capability 3.x 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Compiler command line options fastmath Use fast math library cuda5.0 Use CUDA 5.0 Toolkit compatibility cuda5.5 Use CUDA 5.5 Toolkit compatibility cuda5.5 Use CUDA 5.5 Toolkit compatibility radeon Select AMD Radeon GPU accelerator target keep Keep kernel source files tahiti Compile for Radeon Tahiti architecture apu Compile for Radeon APU architecture buffercount:<n> Set max number of device buffers used by OpenCL kernel host Compile for the host, i.e., no accelerator target 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 C Example #include <stdio.h> #include <stdlib.h> #include <assert.h> int main( int argc, char* argv[] ) { int n; /* size of the vector */ float *restrict a; /* the vector */ float *restrict r; /* the results */ float *restrict e; /* expected results */ int i; if( argc > 1 ) n = atoi( argv[1] ); else n = 100000; if( n <= 0 ) n = 100000; 27/04/2015 a = (float*)malloc(n*sizeof(float)); r = (float*)malloc(n*sizeof(float)); e = (float*)malloc(n*sizeof(float)); for( i = 0; i < n; ++i ) a[i] = (float)(i+1); #pragma acc kernels { for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f; } /* compute on the host to compare */ for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f; /* check the results */ for( i = 0; i < n; ++i ) assert( r[i] == e[i] ); printf( "%d iterations completed\n", n ); return 0; } Compiling a C Program lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> pgcc -acc -Minfo=accel c1.c -o c1 main: 23, Generating present_or_copyout(r[0:n]) !Mind r[start-index:nelements] Generating present_or_copyin(a[0:n]) Generating NVIDIA code Generating compute capability 1.0 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 25, Loop is parallelizable Accelerator kernel generated 25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Executing a program lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> $RUN ./c1 100000 iterations completed lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> export ACC_NOTIFY=1 lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> $RUN ./c1 launch CUDA kernel file=/home/hpc/pr28fa/lu65fok/openacc/pgi_tutorial/v1n1a1/c1.c function=main line=25 device=0 grid=782 block=128 100000 iterations completed 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Compiling a Fortran Program lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> pgfortran -acc Minfo=accel f1.f90 main: 21, Generating present_or_copyin(a(1:n)) Generating present_or_copyout(r(1:n)) Generating NVIDIA code Generating compute capability 1.0 binary Generating compute capability 2.0 binary Generating compute capability 3.0 binary 22, Loop is parallelizable Accelerator kernel generated 22, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Fortran Example program main integer :: n ! size of the vector real,dimension(:),allocatable :: a ! the vector real,dimension(:),allocatable :: r ! the results real,dimension(:),allocatable :: e ! expected results integer :: i character(10) :: arg1 if( iargc() .gt. 0 )then call getarg( 1, arg1 ) read(arg1,'(i10)') n else n = 100000 endif if( n .le. 0 ) n = 100000 allocate(a(n)) allocate(r(n)) allocate(e(n)) do i = 1,n a(i) = i*2.0 enddo 27/04/2015 !$acc kernels do i = 1,n r(i) = a(i) * 2.0 enddo !$acc end kernels do i = 1,n e(i) = a(i) * 2.0 enddo ! check the results do i = 1,n if( r(i) .ne. e(i) )then print *, i, r(i), e(i) stop 'error found' endif enddo print *, n, 'iterations completed' end program PGI Unified Binary ● PGI Unified Binary for Multiple Accelerator Types Compile PGI Unified Binary pgcc –ta=nvidia,host (Default for –acc) Run PGI Unified Binary: export ACC_DEVICE=nvidia; $RUN ./a.out export ACC_DEVICE_NUM=1; $RUN ./a.out export ACC_DEVICE=host; ./a.out ● PGI Unified Binary for Multiple Processor Types pgcc –tp=nehalem,sandybridge 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Insight into the Code ● To have insight into the generated CUDA code use: pgcc -acc -Mcuda=keepgpu -Minfo=accel c1.c File c1.n001.gpu will contain CUDA code ● Insight into host code: pgcc -S -acc -Minfo=accel c1.c Shows call of __pgi_uacc_* routines 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 Insight into __pgi_uacc_* calls ● pgcc -S -acc -Minfo=accel c1.c call atoi call malloc call malloc call malloc call __pgi_uacc_begin call __pgi_uacc_enter call __pgi_uacc_dataona call __pgi_uacc_dataona call __pgi_uacc_datadone call __pgi_uacc_launch call __pgi_uacc_dataoffa call __pgi_uacc_dataoffa call __pgi_uacc_datadone call __pgi_uacc_noversion call __pgi_uacc_end call __assert_fail call printf 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015 #include "cuda_runtime.h" #include "pgi_cuda_runtime.h" extern "C" __global__ __launch_bounds__(128) void main_25_gpu( int tc2, signed char* p3, signed char* p4) { int _i_1; unsigned int _ui_1; float _r_1; unsigned int e30; int j39; int j38; int j37; unsigned int e37; 27/04/2015 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● e30 = ((int)gridDim.x)*(128); e37 = (e30)*(4); if( ((0)>=(tc2))) goto _BB_6; _ui_1 = ((int)gridDim.x)*(128); j37 = ((tc2)-((int)(_ui_1)))+((int)(_ui_1)); j38 = 0; j39 = 0; _BB_8: ; if( (((j39)-(tc2))>=0)) goto _BB_9; if( ((((((int)((int)threadIdx.x))(tc2))+((int)(((int)blockIdx.x)*(128))))+(j39))>=0)) goto _BB_9; _i_1 = ((int)((((int)threadIdx.x)+(((int)blockIdx.x)*(128))) *(4)))+(j38); _r_1 = (*( float*)((p3)+((long long)(_i_1)))); *( float*)((p4)+((long long)(_i_1))) = _r_1+_r_1; _BB_9: ; _ui_1 = ((int)gridDim.x)*(128); j37 = (j37)+(-((int)(_ui_1))); j38 = (j38)+((int)(e37)); j39 = (j39)+((int)(_ui_1)); if( ((j37)>0)) goto _BB_8; _BB_6: ; } Lab: OpenACC 27/04/2015 Intel MIC & GPU Programming Workshop, LRZ 2015