Arquitectura de Computadores II
Transcription
Arquitectura de Computadores II
Nota Importante n Arquitectura de Computadores II http://eden.dei.uc.pt/~pmarques/courses/best2003/pmarques_best.pdf 6. Multi-Processamento 2004/2005 6.1. Introdução n Para além desses materiais, é principalmente utilizado o Cap. 6 do [CAQA] e o Cap. 9 do “Computer Organization and Design” Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2 Motivation n A apresentação desta parte da matéria é largamente baseada num curso internacional leccionado no DEI, em Set/2003 sobre “Cluster Computing and Parallel Programming”. Os slides originais podem ser encontrados em: Motivation I have a program that takes 7 days to execute, which is far too long for practical use. How do I make it run in 1 day? n n We are interested in the last approach: n Work smarter! n (i.e. find better algorithms) Why? n n n Work faster! n (i.e. buy a faster processor/memory/machine) n n Work harder! (i.e. add more processors!!!) n 3 Add more processors! (We don’t care about being too smart or spending too much $$$ in bigger faster machines!) It may no be feasible to find better algorithms Normally, faster, bigger machines are very expensive There are lots of computers available in any institution (especially at night) There are computer centers from where you can buy parallel machine time Adding more processors enables you not only to run things faster, but to run bigger problems 4 Motivation n “Adding more processors enables you not only to run things faster, but to run bigger problems”?! n n Arquitectura de Computadores II “9 women cannot have a baby in 1 month, but they can have 9 babies in 9 months” This is called the Gustafson-Barsis law (informally) 6. Multi-Processamento 6.2. Arquitectura das Máquinas What the Gustafson-Barsis law tell us is that when the size of the problem grows, normally there’s more parallelism available 5 von Neumann Architecture n n 2004/2005 n Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt Flynn's Taxonomy Based on the fetch-decode-execute cycle The computer executes a single sequence of instructions that act on data. Both program and data are stored in memory. n Classifies computers according to… n n The number of execution flows The number of data flows Number of data flows Flow of instructions Number of execution flows A B C Data 7 SISD SIMD Single-Instruction Single-Data Single-Instruction Multiple-Data MISD MIMD Multiple-Instruction Single-Data Multiple-Instruction Multiple-Data 8 Single Instruction, Single Data (SISD) n n n n Single Instruction, Multiple Data (SIMD) A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Most PCs, single CPU workstations, … n n n n n A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as image processing. Examples: Connection Machine CM-2, Cray J90, Pentium MMX instructions ADD V3, V1, V2 1 3 4 5 21 3 3 5 V1 32 43 2 46 87 65 43 32 V2 V3 9 The Connection Machine 2 (SIMD) n 10 Multiple Instruction, Single Data (MISD) The massively parallel Connection Machine 2 was a supercomputer produced by Thinking Machines Corporation, containing 32,768 (or more) processors of 1-bit that work in parallel. n n Few actual examples of this class of parallel computer have ever existed Some conceivable examples might be: n n n 11 multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message the Data Flow Architecture 12 IBM BlueGene/L DD2 Multiple Instruction, Multiple Data (MIMD) n n n n n Currently, the most common type of parallel computer Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic Examples: most current supercomputers, computer clusters, multi-processor SMP machines (inc. some types of PCs) n n Department of Energy's, Lawrence Livermore National Laboratory (California, USA) Currently the fastest machine on earth (70TFLOPS) Some Facts - 32768x 700MHz PowerPC440 CPUs (Dual Processors) - 512MB RAM per node, total = 16TByte of RAM - 3D Torus Network; 300MB/sec per node. 13 IBM BlueGene/L DD2 14 What about Memory? n System (64 cabinets, 64x32x32) The interface between CPUs and Memory in Parallel Machines is of crucial importance Cabinet (32 Node boards, 8x8x16) n Node Board (32 chips, 4x4x2) 16 Compute Cards n Compute Card (2 chips, 2x1x1) 180/360 TF/s 16 TB DDR Chip (2 processors) 90/180 GF/s 8 GB DDR 2.8/5.6 GF/s 4 MB The bottleneck on the bus, many times between memory and CPU, is known as the von Neumann bottleneck It limits how fast a machine can operate: relationship between computation/communication 2.9/5.7 TF/s 256 GB DDR 5.6/11.2 GF/s 0.5 GB DDR 15 16 Communication in Parallel Machines n n Shared Memory Programs act on data. Quite important: how do processors access each others’ data? n n Message Passing Model Shared Memory Model n CPU Memory CPU Memory CPU network Memory CPU n CPU Memory CPU Memory Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as a global address space Multiple processors can operate independently but share the same memory resources Changes in a memory location made by one processor are visible to all other processors Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA CPU CPU 17 Shared Memory (2) 18 Uniform Memory Access (UMA) n Single 4-processor Machine n n A 3-processor NUMA Machine CPU n n CPU Memory CPU CPU UMA: Uniform Memory Access CPU CPU CPU Memory Memory Memory Fast Memory Interconnect n Most commonly represented today by Symmetric Multiprocessor (SMP) machines Identical processors Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level. Very hard to scale NUMA: Non-Uniform Memory Access 19 20 UMA and NUMA Non-Uniform Memory Access (NUMA) n n n n Often made by physically linking two or more SMPs. One SMP can directly access memory of another SMP. Not all processors have equal access time to all memories Sometimes called DSM – Distributed Shared Memory Advantages n n n n User-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of memory and CPUs More scalable than SMPs The Power MAC G5 features 2 PowerPC 970/G5 processors that share a common central memory (up to 8Gbyte) Disadvantages n n n Lack of scalability between memory and CPUs Programmer responsibility for synchronization constructs that ensure "correct" access of global memory Expensive: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors 21 Distributed Memory (DM) n n n n n n n n n Processors have their own local memory. Memory addresses in one processor do not map to another processor (no global address space) Because each processor has its own local memory, cache coherency does not apply Requires a communication network to connect interprocessor memory When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is the programmer's responsibility Very scalable Cost effective: use of off-the-shelf processors and networking Slower than UMA and NUMA machines SGI Origin 3900: - 16 R14000A processors per brick, each brick with 32GBytes of RAM. - 12.8GB/s aggregated memory bw (Scales up to 512 processors and 1TByte of memory) 22 Distributed Memory Computer Computer Computer CPU CPU CPU Memory Memory Memory TITAN@DEI, a PC cluster interconnected by FastEthernet network interconnect 23 24 ASCI White at the Lawrence Livermore National Laboratory Hybrid Architectures n Today, most systems are an hybrid featuring shared distributed memory. n n n n Each node has several processors that share a central memory A fast switch interconnects the several nodes In some cases the interconnect allows for the mapping of memory among nodes; in most cases it gives a message passing interface n n n n CPU CPU Memory CPU CPU Each node is an IBM POWER3 375 MHz NH-2 16-way SMP (i.e. 16 processors/node) Each node has 16GB of memory A total of 512 nodes, interconnected by a 2GB/sec network node-to-node The 512 nodes feature a total of 8192 processors, having a total of 8192 GB of memory It currently operates at 13.8 TFLOPS CPU CPU Memory CPU CPU fast network interconnect CPU CPU Memory CPU CPU CPU CPU Memory CPU CPU 25 Summary 26 Summary (2) n Architecture CC-UMA CC-NUMA Distributed/ Hybrid Examples - SMPs Sun Vexx SGI Challenge IBM Power3 - SGI Origin - HP Exemplar - IBM Power4 - Cray T3E - IBM SP2 Programming - MPI Threads OpenMP Shmem - - MPI Scalability Draw Backs Software Availability MPI Threads OpenMP Shmem <10 processors <1000 processors ~1000 processors - Limited mem bw - Hard to scale - New architecture - Point-to-point communication - Costly system administration - Programming is hard to develop and maintain - Great - Great - Limited 27 Plot of top 500 supercomputer sites over a decade 28 Warning n Arquitectura de Computadores II n 6. Multi-Processamento 6.3. Modelos de Programação e Desafios 2004/2005 n n n pmarques@dei.uc.pt 30 Shared Memory Model A programming model abstracts the programmer from the hardware implementation The programmer sees the whole machine as a big virtual computer which runs several tasks at the same time The main models in current use are: n n n n In summary: DON’T PANIC! Paulo Marques Departamento de Eng. Informática Universidade de Coimbra The main programming models… n We will now introduce the main ways how you can program a parallel machine. Don’t worry if you don’t immediately visualize all the primitives that the APIs provide. We will cover that latter. For now, you just have to understand the main ideas behind each paradigm. Process or Thread B Process or Thread A Shared Memory Message Passing Data parallel / Parallel Programming Languages double matrix_A[N]; double matrix_B[N]; Process or Thread C double result[N]; Note that this classification is not all inclusive. There are hybrid approaches and some of the models overlap (e.g. data parallel with shared memory/message passing) Globally Accessible Memory (Shared) Process or Thread D 31 32 Shared Memory Model n n n n n n Shared Memory Modes Independently of the hardware, each program sees a global address space Several tasks execute at the same time and read and write from/to the same virtual memory Locks and semaphores may be used to control access to the shared memory An advantage of this model is that there is no notion of data “ownership”. Thus, there is no need to explicitly specify the communication of data between tasks. Program development can often be simplified An important disadvantage is that it becomes more difficult to understand and manage data locality. Performance can be seriously affected. n There are two major shared memory models: n n All tasks have access to all the address space (typical in UMA machines running several threads) Each task has its address space. Most of the address space is private. A certain zone is visible across all tasks. (typical in DSM machines running different processes) Memory Memory A A B C Memory B A B Shared memory (all the tasks share the same address space) 33 Shared Memory Model – Closely Coupled Implementations n n Shared Memory – Thread Model On shared memory platforms, the compiler translates user program variables into global memory addresses Typically a thread model is used for developing the applications n n 34 n n n POSIX Threads OpenMP n A single process can have multiple threads of execution Each thread can be scheduled on a different processor, taking advantage of the hardware All threads share the same address space From a programming perspective, thread implementations commonly comprise: n n n There are also some parallel programming languages that offer a global memory model, although data and tasks are distributed For DSM machines, no standard exists, although there are some proprietary implementations n n 35 A library of subroutines that are called from within parallel code A set of compiler directives imbedded in either serial or parallel source code Unrelated standardization efforts have resulted in two very different implementations of threads: POSIX Threads and OpenMP 36 POSIX Threads n n n n n OpenMP Library based; requires parallel coding Specified by the IEEE POSIX 1003.1c standard (1995), also known as PThreads C Language Most hardware vendors now offer PThreads Very explicit parallelism; requires significant programmer attention to detail n n n n n n Compiler directive based; can use serial code Jointly defined and endorsed by a group of major computer hardware and software vendors. The OpenMP Fortran API was released October 28, 1997. The C/C++ API was released in late 1998 Portable / multi-platform, including Unix and Windows NT platforms Available in C/C++ and Fortran implementations Can be very easy and simple to use - provides for “incremental parallelism” No free compilers available 37 Message Passing Model n 38 Message Passing Model The programmer must send and receive messages explicitly n n A set of tasks that use their own local memory during computation. Tasks exchange data through communications by sending and receiving messages n n 39 Multiple tasks can reside on the same physical machine as well as across an arbitrary number of machines. Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation. 40 Message Passing Implementations n n MPI – The Message Passing Interface Message Passing is generally implemented as libraries which the programmer calls A variety of message passing libraries have been available since the 1980s n n Part 1 of the Message Passing Interface (MPI), the core, was released in 1994. n n These implementations differed substantially from each other making it difficult for programmers to develop portable applications n MPI is now the “de facto” industry standard for message passing n n In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations n Part 2 (MPI-2), the extensions, was released in 1996. Freely available on the web: http://www.mpi-forum.org/docs/docs.html Nevertheless, most systems do not implement the full specification. Especially MPI-2 For shared memory architectures, MPI implementations usually don’t use a network for task communications n Typically a set of devices is provided. Some for network communication, some for shared memory. In most cases, they can coexist. 41 Data Parallel Model n 42 Data Parallel Model Typically a set of tasks performs the same operations on different parts of a big array n The data parallel model demonstrates the following characteristics: n n n n n n 43 Most of the parallel work focuses on performing operations on a data set The data set is organized into a common structure, such as an array or cube A set of tasks works collectively on the same data structure, however, each task works on a different partition of the same data structure Tasks perform the same operation on their partition of work, for example, “add 4 to every array element” On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task 44 Data Parallel Programming n Typically accomplished by writing a program with data parallel constructs n n n n n Middleware for parallel programming: n calls to a data parallel subroutine library compiler directives n In most cases, parallel compilers are used: n n Summary High Performance Fortran (HPF): Extensions to Fortran 90 to support data parallel programming Compiler Directives: Allow the programmer to specify the distribution and alignment of data. Fortran implementations are available for most common parallel platforms n Shared memory: all the tasks (threads or processes) see a global address space. They read and write directly from memory and synchronize explicitly. Message passing: the tasks have private memory. For exchanging information, they send and receive data through a network. There is always a send() and receive() primitive. Data parallel: the tasks work on different parts of a big array. Typically accomplished by using a parallel compiler which allows data distribution to be specified. DM implementations have the compiler convert the program into calls to a message passing library to distribute the data to all the processes. n All message passing is done invisibly to the programmer 45 Final Considerations… 46 Load Balancing n Load balancing is always a factor to consider when developing a parallel application. n n n Beware of Amdahl's Law! Too big granularity Poor load balancing Too small granularity Too much communication The ratio computation/communication is of crucial importance! Task 1 Task 2 Task 3 time 47 Work Wait 48 Amdahl's Law n Amdahl's Law – The Bad News! The speedup depends on the amount of code that cannot be parallelized: Speedup vs. Percentage of Non-Parallel Code 30 1 T = speedup(n, s ) = T ⋅ s + T ⋅(1n− s ) s + (1−ns ) 0% 25 Linear Speedup Speedup 20 15 5% 10 10% n: number of processors s: percentage of code that cannot be made parallel T: time it takes to run the code serially 20% 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Number of Processors 49 Efficiency Using 30 Processors What Is That “s” Anyway? 100 n speedup(n, s ) efficiency(n, s ) = n 90 80 Three slides ago… n n 70 Efficiency (%) 50 60 “s: percentage of code that cannot be made parallel” Actually, it’s worse than that. Actually it’s the percentage of time that cannot be executed in parallel. It can be: n 50 n 40 n 30 n Time Time Time Time spent spent spent spent communicating waiting for/sending jobs waiting for the completion of other processes calling the middleware for parallel programming 20 n 10 Remember… n 0 0% 5% 10% if s is even as small as 0.05, the maximum speedup is only 20 20% Percentage of Non-Parallel Code 51 52 Maximum Speedup On the Positive Side… You can run bigger problems You can run several simultaneous jobs (you have more parallelism available) If you have ∞ processors this will be 0, so the maximum possible speedup is 1/s 1 speedup(n, s ) = s + (1−ns ) non-parallel (s) maximum speedup 0% ∞ (linear speedup) 5% 20 10% 10 20% 5 25% 4 § Gustafson-Barsis with no equations: “9 women cannot have a baby in 1 month, but they can have 9 babies in 9 months” 53 54 Problema da Coerência das Caches (UMA) Arquitectura de Computadores II 6. Multi-Processamento 2004/2005 6.4. Hardware Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 56 Mantendo a Coerência: Snooping Snooping n Leituras e Escritas de Blocos n n n n As múltiplas cópias de um bloco, quando existem leituras, não são um problema No entanto, quando existe uma escrita, um processador tem de ter acesso exclusivo ao bloco que quer escrever Os processadores, quando fazem uma leitura, têm também de ter sempre o valor mais recente do bloco em causa Nos protocolos de snooping, o hardware tem de localizar todas as caches que contêm uma cópia do bloco, quando existe uma escrita. Existem então duas abordagens possíveis: n n Invalidar todas caches que contêm esse bloco (write-invalidate) Actualizar todas as caches que contêm esse bloco 57 Protocolo de Snooping (Exemplo) 58 Problema da Coerência das Caches (NUMA) n n A abordagem de Snooping não é escalável para máquinas com dezenas/centenas de processadores (NUMA) Nesse caso utiliza-se um outro tipo de protocolos – baseados em Directorias n 59 Uma Directoria é um local centralizado que mantém informação sobre quem é que tem cada bloco 60 Tendências n Arquitectura de Computadores II Neste momento torna-se extremamente complicado escalar os processadores em termos de performance individual e clock-rate n n A Intel, à semelhança de outros fabricantes introduz o Simultaneous Multi-Threading (SMT), na sua terminologia, chamado HyperThreading n 5. Multi-Processamento n 2004/2005 5.4. Aspectos Recentes e Exemplos n Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt O futuro é o MULTI-PROCESSAMENTO!!! Um aumento de desempenho potencialmente razoável (max=30%) à custa de um pequeno gasto de transístores (5%) Atenção: pode levar a uma performance pior! Prepara os programadores para a programação concorrente!!! J (a opinião generalizada é que o Hyperthreading serviu apenas para tal) n Os dual-core (dois processadores no mesmo die e/ou pacote) irão ser banais nos próximos 2/3 anos n Os servidores multi-processador (SMP – Symmetrical MultiProcessing) estão neste momento banalizados n Os clusters estão neste momento banalizados 62 Anúncios... Como é que funciona o HyperThreading (1)? Processador super-escalar “normal” 63 Dual Processor (SMP) 64 Como é que funciona o HyperThreading (2)? Motivações para o uso de Simultaneous Multi-Threading (SMT) n Normalmente existem mais unidades funcionais disponíveis do que aquelas que estão a ser utilizadas n n Os computadores actuais estão constantemente a executar mais do que um programa/thread n n Existe trabalho disponível, independente, para fazer. Não se encontra é na mesma thread! Um dos aspectos em que esta abordagem é muito útil é a esconder latências inevitáveis de acesso a memória ou previsões erradas de saltos n Time-sliced Multithreaded CPU (Super-Threaded CPU) Limitações do tamanho dos blocos básicos e/ou paralelismo disponível a nível das instruções (ILP) Hyper-Threaded CPU E.g. uma thread que tenha de ler dados de memória pode ficar bastante tempo à espera enquanto os dados não chegam. Nessas alturas, tendo SMT, é possível outra thread executar. 65 Implementação (Ideia Básica) n Para terminar... Exemplo de um cluster! Replicar o Front-end do processador e tudo o que seja visível em termos de ISA (Instruction Set Architecture) n n 66 n Cluster da GOOGLE n e.g. Registos, Program Counters, etc. Desta forma, um processador físico torna-se dois processadores n n n n Máquinas do cluster GOOGLE n n n n n n Particionam-se alguns recursos (e.g. filas de issue de instruções) e Partilham-se outros (e.g. Reorder-Buffers) PCs “baratos” com processadores Intel, c/ 256MB RAM Cerca de 6.000 processadores, 12.000 discos (1 PByte de espaço, dois discos por máquina) Linux Red Hat 2 sites na Califórnia e 2 na Virgínia Ligação à rede n n 67 Tem de servir 1000 queries/segundo, cada query não demorando mais de 0.5s! 8 biliões de páginas indexadas (8.058.044.651, 01/Maio/2005) Técnica para manter a indexação: Tabelas Invertidas (ver TC/BD) Todas as páginas são revisitadas mensalmente Cada site tem uma ligação OC48 (2.5 Gbps) à Internet Entre cada par de sites existe um link de backup de OC12 (622 Mbps) 68 Máquinas super-rápidas?? 40 PCs/rack 40 Racks 69 Material para ler n Computer Architecture: A Quantitative Approach, 3rd Ed. n n Alternativamente (ou complementarmente), a matéria encontra-se bastante bem explicada no Capítulo 9 do n n n n Secções 6.1, 6.3, 6.5 (brevemente), 6.9, 6.15 Computer Organization and Design, 3rd Ed. D. Patterson & J. Hennessy Morgan Kaufmann, ISBN 1-55860-604-1 August 2004 Em particular, a descrição do cluster Google foi retirada de lá. A única matéria não coberta foi a Secção 9.6 Este capítulo do livro será colocado online no site da cadeira, disponível apenas para utilizadores autenticados Jon Stokes, “Introduction to Multithreading, Superthreading and Hyperthreading”, in Ars Technica, October 2003 http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars 71 70