Objectives: Multicore Processors and Threading
Transcription
Objectives: Multicore Processors and Threading
Objectives: Multicore Processors and Threading Ref: [Tanembaum, sect 8.1], [O’H&Bryant, sect 1.7.2, 1.9.1, 12.1, 12.3-4] l to understand the basic concepts of threading l to grasp the main ideas of shared memory programming model l to understand how a multiple cores on a chip operate l to appreciate how they arose l to understand the idea of hardware threading (some figures in these slides are from Magee & Kramer, Concurrency, Wiley; Lin & Snyder, Principles of Parallel Programming, Pearson; Chapman et al, Using OpenMP) COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 1 Processes and Threads l processes may have one or more threads of execution within them n these all share the same address space and manipulate the same memory areas l (OS-style) processes: the Operating System view (courtesy Magee&Kramer) n a (heavyweight) process in an operating system is represented by its descriptor, code, data and the state of the machine registers n to support multiple (lightweight) threads of control, it has multiple stacks, one for each thread n the (specific) state at any time of a running thread includes its stack plus the values in the registers of the CPU it runs on COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 2 The Thread Life-cycle l an overview of the life-cycle of a thread as state transitions: l normally, the operating system manages these l a specific system call creates a new thread, e.g. Solaris lwp create() n this will call some function and the thread terminates when it exits the function n once terminated, it cannot be restarted COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 3 Shared Memory Multiprocessors l a number of processors (CPUs) connected to a globally addressable memory n through a bus ( [Lyn&Snyder, fig 2.3]) or, better, n memory is organized into modules, all connected by an interconnect n at any time, different threads (or processes) can run in parallel on different CPUs l need caches! memory consistency problem is now exacerbated! l consider programs (threads) on processors 0 and 1 attempting to acquire a lock n this is needed when one wishes to update a shared data structure n hardware must support this via some atomic instructions, e.g. ! % o0 has address of the lock mov 0 xff , % o1 ! 0 xff is value for acquiring lock loop : ! loop to acquire lock brnz %o1 , loop ! exit if % o1 = 0 ( lock value was 0) ldstub [% o0 ], % o1 ! atomic swap % o1 and value at lock ... ! safely update shared data structure stb %g0 , [% o0 ] ! release lock COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 4 The Shared Memory Programming Model l uses a fork-join variant of the threaded programming model [Chap&Jost&derPas, fig 2.1] n the team of threads executes a parallel region n for parallel speedup, each thread needs to be allocated to a different processor n in the low-level threads library (pthreads), only fork or join 1 thread at once e.g. race.c l threads communicate via global variables in common memory sections (e.g. static data, heap); have private stacks for thread-local variables l main synchronization mechanisms are locks and barriers p t h r e a d m u t e x i n i t (& mutex1 , NULL ); ... p t h r e a d m u t e x l o c k (& mutex1 ); // involves a busy − wait loop / ∗ mutually exclusive access to shared resource ∗ / p t h r e a d m u t e x u n l o c k (& mutex1 ); ... pbarrier (); // wait until other threads reach the same point COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 5 Overview of the OpenMP Programming Model l idea is to add directives to ordinary C, C++ or Fortran code l example: matrix-vector multiply y ← y + Ax ([Chap&Jost&derPas, fig 3.9]) double A[N ][ N], x[N], y[n ]; int i , j; ... # pragma omp parallel for private (j) for (i =0; i < N; i ++) // each thread updates segment of y [] for (j =0; j < n; j ++) y[i] += A[i ][ j] ∗ x[j ]; // alternately : for (i =0; i < N; i ++) { double s = y[i ]; # pragma omp parallel for reduction (+: s) for (j =0; j < n; j ++) // each thread computes a partial sum s += A[i ][ j] ∗ x[j ]; // in its own version of s , which are y[i] = s; // later summed into the global s } l l l l more generally, directives apply over regions #pragma omp parallel { ... } within a parallel region, a barrier may be inserted (#pragma omp barrier) mutual exclusion via #pragma omp critical { ... } for more details, see OpenMP web site or LNL Tutorial COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 6 What is Multicore? l also known as chip multiprocessing, CMP): multiple shared memory processors (‘cores’ = CPUs) on a single chip l each has its own register sets and operates on a common main memory n why? because we can! also because we must! n can run multiple applications (processes) in parallel n can run a single (threaded) application in parallel; herein lies the challenge! l (large-scale) parallelism is now cheap, mainstream l memory (data access) is an increasing consideration! COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 7 Advent of Multicore l caused by end of Dennard scaling and further improvement by pipelining / superscalar techniques l for a long time, Moore’s Law permitted an exponential increase in clock speed with constant power density (Dennard scaling), as well as the number of transistors/chip l extrapolation of exponential power density increase 1985–2000 indicates we are at the limit! l 2000 Intel chip equivalent to a hotplate, would have ⇒ a rocket nozzle by 2010! l dissipated power is given by: P ∝ V 2 f ∝ f 3, V is the voltage, f is the clock frequency (speed) COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 8 Hardware Threading l for each core, have a number of ‘virtual CPUs’, each with own register set l the core’s control unit can flexibly select instructions ready to execute in each of these l the operating system sees each as a separate CPU l we can thus hide effects of cache misses l this tends to be effective on memory-intensive workloads with high degrees of concurrency (e.g. database and web servers) l for other applications, there still is some limited benefit (e.g. 25% speedup for 4 way hardware threading) One the other hand, it is a ‘cheap’ technique in terms of extra hardware. COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II × 9