Atomic Operations
Transcription
Atomic Operations
Mitglied der Helmholtz-Gemeinschaft Atomic Operations May 6, 2014 | Andrew V. Adinetz Application: Histogram Histogram ! Assign each input element to a bin ! How many elements in each bin? ! ai — input data, 1 ≤ i ≤ n h j =#{ai : bin(ai ) = j} € ! often, bin(ai ) = ai € € € May 6, 2014 Folie 2 Example Usage: Histogram Equalization May 6, 2014 Folie 3 Histogram on CPU a histogram of byte values init bins with zeros void histo_cpu (int *histo, const unsigned char *data, int n, int nbins) { for(int j = 0; j < nbins; j++) histo[j] = 0; for(int i = 0; i < n; i++) histo[data[i]]++; } accumulate histogram counters May 6, 2014 Folie 4 “Histogram on GPU” __global__ void histo_kernel (int *histo, const unsigned char *data, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if(i < n) { histo[data[i]]++; } } this does not work! May 6, 2014 Folie 5 Why Doesn’t it Work? // PTX: // histo[data[i]]++; // ... ld.global.u32 %r7, [%rd9]; add.s32 %r9, %r7, 1; st.global.u32 [%rd9], %r9; load add store non-atomic update with multiple hardware instructions May 6, 2014 Folie 6 Why Doesn’t it Work? histo is all zeroes initially data 0 1 0 1 2 3 time thread 1 t1 load (histo[1] == 0) load (histo[1] == 0) t2 > t1 add 1 (result == 1) add 1(result == 1) t3 > t2 store (histo[1] = 1) store (histo[1] = 1) t May 6, 2014 4 5 10 5 thread 2 histo[1]: must be 2, but is 1 Folie 7 Atomic Operations Safe update in multi-threaded environment ! Instructions for atomic read-modify-writes ! atomicity guaranteed by hardware ! atomicity scope = single instruction ! For algorithms where multiple threads can write the same memory location ! Global and shared memory ! Update visible to all GPU threads ! For atomics or volatile reads May 6, 2014 Folie 8 Atomic Operations (API) atomicOp(T *addr, T val) ! addr — shared or global-memory address ! val — second value ! returns old value (before update) ! *addr Op= val done atomically ! T = int, unsigned int, unsigned long long ! Op = Add, Sub, And, Or, Xor, Min, Max, Inc, Dec, Exch ! For T = float, Op = Add, Exch ! atomicAdd(&counter, 1); May 6, 2014 Folie 9 Atomic Inc/Dec Wrap-around on other values than T_MAX/T_MIN: ! old — value of *addr before update ! atomicInc(T *addr, T val) ! *addr = old >= val ? 0 : old + 1 ! atomicDec(T *addr, T val) ! *addr = (old == 0 || old > val) ? val : old – 1 ! Slower than atomicAdd / atomicSub May 6, 2014 Folie 10 Availability ! CC 1.1 — integer, global memory 32-bit ! CC 1.2 — integer, shared memory 32-bit, global memory 64-bit ! CC 2.x (Fermi) — floating-point add, integer shared memory 64-bit ! CC 3.5 (Kepler) — integer 64-bit And, Or, Xor, Min, Max May 6, 2014 Folie 11 Histogram with Atomics on GPU __global__ void histo_kernel (int *histo, const unsigned char *data, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if(i < n) { atomicAdd(&histo[data[i]], 1); } } May 6, 2014 Folie 12 Histogram Performance Improved global atomics in Kepler, now resolved in L2 cache May 6, 2014 Folie 13 Shared Memory Atomics ! Faster than for global memory ! Important on Fermi, less so on Kepler ! Less conflicts ! Less threads accessing (1 block) ! Several copies of address (# of thread blocks) ! Using: ! Don‘t forget __syncthreads() ! Process several elements in a thread May 6, 2014 Folie 14 #define PER_THREAD 32 #define NCLASSES 256 #define BS 256 __global__ void histo_kernel(int *histo, const unsigned char* data, int n) { // shared memory histogram storage __shared__ int lhisto[NCLASSES]; for(int i = threadIdx.x; i < NCLASSES; i += blockDim.x) lhisto[i] = 0; __syncthreads(); // compute per-block histogram int istart = blockIdx.x * (BS * PER_THREAD) + threadIdx.x; int iend = min(istart + BS * PER_THREAD, n); for(int i = istart; i < iend; i += BS) atomicAdd(&lhisto[data[i]], 1); __syncthreads(); // update global memory histogram for(int i = threadIdx.x; i < NCLASSES; i += blockDim.x) atomicAdd(&histo[i], lhisto[i]); } May 6, 2014 Folie 15 Multiple Elements per Thread Otherwise, global atomics are not amortized! May 6, 2014 Folie 16 Filtering ! copy only elements satisfying a predicate ! on the host: ! on the GPU: nres = 0; for(int i = 0; i < n; i++) { if(data[i] > 0) res[nres++] = data[i]; } __global__ void fitler_k (int *res, int *nres, const int *data, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if(i >= n) return; if(data[i] > 0) res[atomicAdd(nres, 1)] = data[i]; } ! what is the problem? May 6, 2014 Folie 17 Simple Filtering (K20X) ! too many atomics ! high degree of conflict May 6, 2014 Folie 18 Filtering Aggregation ! Shared memory atomics ! same #atomics ! still high degree of conflicts ! Warp-aggregated increment ! select the leader ! leader performs atomic operation ! leader broadcasts result ! each thread computes its position ! up to 32x less atomics ! Combine shared memory and warp aggregation May 6, 2014 Folie 19 Warp Intrinsics (before CC 3.0) ! Reduction + sync across warp ! for active threads only ! CC 1.2 (__any/__all), CC 2.0 (__ballot) ! int __any(int v) ! non-zero iff v is non-zero on any active thread ! int __all(int v) ! non-zero iff v is non-zero on all active threads ! unsigned __ballot(int v) ! mask: bit i is non-zero iff v is non-zero for thread i May 6, 2014 Folie 20 Bit Intrinsics ! unsigned __brev(unsigned v) ! reverse bits ! int __clz(int v) ! #consecutive higher-order zero bits ! int __ffs(int v) ! No. of least significant 1, starting at 1 ! __ffs(0) = 0 ! int __popc(unsigned v) ! #bits set to 1 ! all intrinsics with ll suffix (e.g., __ffsll, __popcll) ! for 64-bit integers May 6, 2014 Folie 21 Warp Shuffle (CC 3.0+) ! Intra-warp „collective operation“ ! int __shfl(int var, int lid) ! read value of var from lane lid (0 .. warpSize – 1) ! lane lid must also call __shfl() ! Other intrinsics available // read a value int v = a[i]; // get a value from the lane at the right int v_left = __shfl(v, (threadIdx.x + 1) % warpSize); May 6, 2014 Folie 22 Intra-Warp Broadcast #define WARP_SZ 32 #define MAX_NWARPS 32 int lane_id(void) { return threadIdx.x % WARP_SZ; } int warp_id(void) { return threadIdx.x / WARP_SZ; } int warp_bcast(int v, int leader) { #if __CUDA_CC__ >= 300 return __shfl(v, src); #else volatile __shared__ int vs[MAX_NWARPS]; if(lane_id() == leader) vs[warp_id()] = v; return vs[warp_id()]; #endif } May 6, 2014 Folie 23 Warp-Aggregated Increment int atomicAggInc(int *p) { int mask = __ballot(1); // select the leader int leader = __ffs(mask) – 1; // leader increments int res; if(lane_id() == leader) res = atomicAdd(p, __popc(mask)); // broadcast result res = warp_bcast(res, leader); // each thread computes its own value return res + __popc(mask & ((lane_id() << 1) – 1)); } // atomicAggInc // ... if(data[i] > 0) res[atomicAggInc(nres)] = data[i]; // ... May 6, 2014 Folie 24 Warp Aggregation and Shared Memory ! Warp aggregation ! self-contained ! can stick anywhere (deeply nested ifs) ! very specific use cases ! Combining with shared memory ! same code May 6, 2014 Folie 25 Warp-Aggregated Atomics (K20X) 17x faster filtering on K20X is cheap May 6, 2014 Folie 26 Warp-Aggregated Atomics (M2070) 55x faster Fermi: shared memory + aggregation is best filtering not as cheap May 6, 2014 Folie 27 Atomic Exch/CAS ! Useful for: ! synchronization primitives ! list algorithms ! atomicExch(T *addr, T val) ! *addr = val, returns old value ! are we the first to write new value? ! atomicCAS(T *addr, T cmp, T val) ! *addr = old == cmp ? val : old, returns old value ! compare-and-swap: has *addr changed in-between? May 6, 2014 Folie 28 Atomics for Other Types with atomicCAS ! double, 32-bit complex, etc. ! either 32 or 64-bit ! Slow, somewhat faster than critical section __device__ double atomicAdd(double *address, double val) { unsigned long long * address_as_ull = (unsigned long long *)address; unsigned long long old = *address_as_ull, assumed; do { assumed = old; old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed))); } while(assumed != old); return __longlong_as_double(old); } // atomicAdd May 6, 2014 Folie 29 Atomics in OpenCL ! Extensions and built-in ! OpenCL 1.2: 32-bit built-in, 64-bit as extensions ! separate extensions for global/local, 32/64-bit, base/ extended ! atom[ic]_op(volatile qual T* p, T val) ! T = int, unsigned, long, unsigned long ! long, unsigned long — 64 bit ! op = xchg, T = float ! op = add, sub, xchg, min, max, and, or, xor ! qual = global, local ! atomic_cmpxchg(volatile qual T* p, T cmp, T val) ! = CUDA atomicCAS May 6, 2014 Folie 30 Questions? ? May 6, 2014 Folie 31 Exercise 1 ! computing histogram of an „image“ ! = 3 histograms, one per color channel ! /home/gpu/Atomics/exercises/histo/ ! task: task/ ! solution: solution/ ! TODO: ! write the working kernel with the global atomics ! call the kernel correctly May 6, 2014 Folie 32 Exercise 2 ! partitioning the array into values > 0 and <= 0 ! = 2 filters done simultaneously ! /home/gpu/Atomics/exercises/partition/ ! task: task/ ! solution: solution/ ! TODO: ! write the working kernel with global atomics ! optional: warp aggregation for better performance May 6, 2014 Folie 33 Atomics with Sub-Word Integers ! char, short, also non-power-of-two number of bits ! must entirely fit into a single 4-byte word ! Can save shared memory ! but avoid overflows __device__ void atomicAdd(short *addr, short val) { size_t up = (size_t)addr, upi = up / sizeof(int) * sizeof(int); int sh = (int)(up - upi) * 8; int *pi = (int*)upi; atomicAdd(pi, (int)val << sh); } May 6, 2014 Folie 34 “GPU Implementation” of Critical Section (Mutex) ! 0 = free, 1 = locked ! first thread to write 1 locks the mutex // enter the critical section (lock) while(atomicExch(&lock, 1)); // do some useful work __threadfence(); // leave the critical section (unlock) atomicExch(&lock, 0); ! warp threads execute in lock-step ! => doesn‘t work if multiple threads in the warp try to lock May 6, 2014 Folie 35 Correct Mutex Implementation ! loop executes synchronously, no deadlock // try to lock int want_lock = 1; while(__any(want_lock)) { if(want_lock && !atomicExch(&locks[d], 1)) { // do useful work __threadfence(); // unlock atomicExch(&locks[d], 0); want_lock = 0; } } // while(any wants to lock) May 6, 2014 Folie 36 Performance on K20X: Multiple Approaches “Artificial atomics” and mutexes/locks are very slow May 6, 2014 Folie 37