Arquitectura de Computadores II

Transcription

Arquitectura de Computadores II
Nota Importante
n
Arquitectura de
Computadores II
http://eden.dei.uc.pt/~pmarques/courses/best2003/pmarques_best.pdf
6. Multi-Processamento
2004/2005
6.1. Introdução
n
Para além desses materiais, é principalmente utilizado o
Cap. 6 do [CAQA] e o Cap. 9 do “Computer Organization
and Design”
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
pmarques@dei.uc.pt
2
Motivation
n
A apresentação desta parte da matéria é largamente
baseada num curso internacional leccionado no DEI, em
Set/2003 sobre “Cluster Computing and Parallel
Programming”. Os slides originais podem ser encontrados
em:
Motivation
I have a program that takes 7 days to execute, which is
far too long for practical use. How do I make it run in 1
day?
n
n
We are interested in the last approach:
n
Work smarter!
n
(i.e. find better algorithms)
Why?
n
n
n
Work faster!
n
(i.e. buy a faster processor/memory/machine)
n
n
Work harder!
(i.e. add more processors!!!)
n
3
Add more processors!
(We don’t care about being too smart or spending too
much $$$ in bigger faster machines!)
It may no be feasible to find better algorithms
Normally, faster, bigger machines are very expensive
There are lots of computers available in any institution (especially
at night)
There are computer centers from where you can buy parallel
machine time
Adding more processors enables you not only to run things faster,
but to run bigger problems
4
Motivation
n
“Adding more processors enables you not only to run
things faster, but to run bigger problems”?!
n
n
Arquitectura de
Computadores II
“9 women cannot have a baby in 1 month, but they
can have 9 babies in 9 months”
This is called the Gustafson-Barsis law (informally)
6. Multi-Processamento
6.2. Arquitectura das Máquinas
What the Gustafson-Barsis law tell us is that when the
size of the problem grows, normally there’s more
parallelism available
5
von Neumann Architecture
n
n
2004/2005
n
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
pmarques@dei.uc.pt
Flynn's Taxonomy
Based on the fetch-decode-execute cycle
The computer executes a single sequence of instructions
that act on data. Both program and data are stored in
memory.
n
Classifies computers according to…
n
n
The number of execution flows
The number of data flows
Number of data flows
Flow of instructions
Number of
execution
flows
A
B
C
Data
7
SISD
SIMD
Single-Instruction
Single-Data
Single-Instruction
Multiple-Data
MISD
MIMD
Multiple-Instruction
Single-Data
Multiple-Instruction
Multiple-Data
8
Single Instruction, Single Data (SISD)
n
n
n
n
Single Instruction, Multiple Data (SIMD)
A serial (non-parallel) computer
Single instruction: only one instruction stream is being
acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input
during any one clock cycle
Most PCs, single CPU workstations, …
n
n
n
n
n
A type of parallel computer
Single instruction: All processing units execute the same
instruction at any given clock cycle
Multiple data: Each processing unit can operate on a
different data element
Best suited for specialized problems characterized by a
high degree of regularity, such as image processing.
Examples: Connection Machine CM-2, Cray J90, Pentium
MMX instructions
ADD V3, V1, V2
1
3
4
5
21
3
3
5
V1
32 43
2
46 87 65 43 32
V2
V3
9
The Connection Machine 2 (SIMD)
n
10
Multiple Instruction, Single Data (MISD)
The massively parallel Connection Machine 2 was a
supercomputer produced by Thinking Machines
Corporation, containing 32,768 (or more) processors of
1-bit that work in parallel.
n
n
Few actual examples of this class of parallel computer
have ever existed
Some conceivable examples might be:
n
n
n
11
multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single
coded message
the Data Flow Architecture
12
IBM BlueGene/L DD2
Multiple Instruction, Multiple Data (MIMD)
n
n
n
n
n
Currently, the most common type of parallel computer
Multiple Instruction: every processor may be executing a
different instruction stream
Multiple Data: every processor may be working with a
different data stream
Execution can be synchronous or asynchronous,
deterministic or non-deterministic
Examples: most current supercomputers, computer
clusters, multi-processor SMP machines (inc. some types
of PCs)
n
n
Department of Energy's,
Lawrence Livermore
National Laboratory
(California, USA)
Currently the fastest
machine on earth
(70TFLOPS)
Some Facts
- 32768x 700MHz PowerPC440 CPUs (Dual Processors)
- 512MB RAM per node, total = 16TByte of RAM
- 3D Torus Network; 300MB/sec per node.
13
IBM BlueGene/L DD2
14
What about Memory?
n
System
(64 cabinets, 64x32x32)
The interface between CPUs and Memory in Parallel
Machines is of crucial importance
Cabinet
(32 Node boards, 8x8x16)
n
Node Board
(32 chips, 4x4x2)
16 Compute Cards
n
Compute Card
(2 chips, 2x1x1)
180/360 TF/s
16 TB DDR
Chip
(2 processors)
90/180 GF/s
8 GB DDR
2.8/5.6 GF/s
4 MB
The bottleneck on the bus, many times between memory and
CPU, is known as the von Neumann bottleneck
It limits how fast a machine can operate:
relationship between computation/communication
2.9/5.7 TF/s
256 GB DDR
5.6/11.2 GF/s
0.5 GB DDR
15
16
Communication in Parallel Machines
n
n
Shared Memory
Programs act on data.
Quite important: how do processors access each others’
data?
n
n
Message Passing Model
Shared Memory Model
n
CPU
Memory
CPU
Memory
CPU
network
Memory
CPU
n
CPU
Memory
CPU
Memory
Shared memory parallel computers vary widely, but
generally have in common the ability for all processors to
access all memory as a global address space
Multiple processors can operate independently but share
the same memory resources
Changes in a memory location made by one processor are
visible to all other processors
Shared memory machines can be divided into two main
classes based upon memory access times: UMA and
NUMA
CPU
CPU
17
Shared Memory (2)
18
Uniform Memory Access (UMA)
n
Single 4-processor
Machine
n
n
A 3-processor
NUMA Machine
CPU
n
n
CPU
Memory
CPU
CPU
UMA: Uniform Memory Access
CPU
CPU
CPU
Memory
Memory
Memory
Fast Memory
Interconnect
n
Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Identical processors
Equal access and access times to memory
Sometimes called CC-UMA - Cache Coherent UMA.
Cache coherent means if one processor updates a location
in shared memory, all the other processors know about
the update. Cache coherency is accomplished at the
hardware level.
Very hard to scale
NUMA: Non-Uniform Memory Access
19
20
UMA and NUMA
Non-Uniform Memory Access (NUMA)
n
n
n
n
Often made by physically linking two or more
SMPs. One SMP can directly access memory of
another SMP.
Not all processors have equal access time to all
memories
Sometimes called DSM – Distributed Shared
Memory
Advantages
n
n
n
n
User-friendly programming perspective to memory
Data sharing between tasks is both fast and uniform due to the proximity of
memory and CPUs
More scalable than SMPs
The Power MAC G5 features
2 PowerPC 970/G5 processors
that share a common central
memory (up to 8Gbyte)
Disadvantages
n
n
n
Lack of scalability between memory and CPUs
Programmer responsibility for synchronization constructs that ensure "correct"
access of global memory
Expensive: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors
21
Distributed Memory (DM)
n
n
n
n
n
n
n
n
n
Processors have their own local memory.
Memory addresses in one processor do not map to
another processor (no global address space)
Because each processor has its own local memory, cache
coherency does not apply
Requires a communication network to connect interprocessor memory
When a processor needs access to data in another
processor, it is usually the task of the programmer to
explicitly define how and when data is communicated.
Synchronization between tasks is the programmer's
responsibility
Very scalable
Cost effective: use of off-the-shelf processors and
networking
Slower than UMA and NUMA machines
SGI Origin 3900:
- 16 R14000A processors per brick,
each brick with 32GBytes of RAM.
- 12.8GB/s aggregated memory bw
(Scales up to 512 processors and
1TByte of memory)
22
Distributed Memory
Computer
Computer
Computer
CPU
CPU
CPU
Memory
Memory
Memory
TITAN@DEI, a PC cluster
interconnected by FastEthernet
network interconnect
23
24
ASCI White at the
Lawrence Livermore National Laboratory
Hybrid Architectures
n
Today, most systems are an hybrid featuring shared distributed
memory.
n
n
n
n
Each node has several processors that share a central memory
A fast switch interconnects the several nodes
In some cases the interconnect allows for the mapping of memory among
nodes; in most cases it gives a message passing interface
n
n
n
n
CPU
CPU
Memory
CPU
CPU
Each node is an IBM POWER3 375 MHz NH-2 16-way SMP
(i.e. 16 processors/node)
Each node has 16GB of memory
A total of 512 nodes, interconnected by a 2GB/sec
network node-to-node
The 512 nodes feature a total of 8192 processors,
having a total of 8192 GB of memory
It currently operates at 13.8 TFLOPS
CPU
CPU
Memory
CPU
CPU
fast network interconnect
CPU
CPU
Memory
CPU
CPU
CPU
CPU
Memory
CPU
CPU
25
Summary
26
Summary (2)
n
Architecture
CC-UMA
CC-NUMA
Distributed/
Hybrid
Examples
-
SMPs
Sun Vexx
SGI Challenge
IBM Power3
- SGI Origin
- HP Exemplar
- IBM Power4
- Cray T3E
- IBM SP2
Programming
-
MPI
Threads
OpenMP
Shmem
-
- MPI
Scalability
Draw Backs
Software Availability
MPI
Threads
OpenMP
Shmem
<10 processors
<1000 processors
~1000 processors
- Limited mem bw
- Hard to scale
- New architecture
- Point-to-point
communication
- Costly system
administration
- Programming is hard
to develop and
maintain
- Great
- Great
- Limited
27
Plot of top 500 supercomputer sites over a decade
28
Warning
n
Arquitectura de
Computadores II
n
6. Multi-Processamento
6.3. Modelos de Programação
e Desafios
2004/2005
n
n
n
pmarques@dei.uc.pt
30
Shared Memory Model
A programming model abstracts the programmer from the
hardware implementation
The programmer sees the whole machine as a big virtual
computer which runs several tasks at the same time
The main models in current use are:
n
n
n
n
In summary: DON’T PANIC!
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
The main programming models…
n
We will now introduce the main ways how you can
program a parallel machine.
Don’t worry if you don’t immediately visualize all the
primitives that the APIs provide. We will cover that latter.
For now, you just have to understand the main ideas
behind each paradigm.
Process
or Thread
B
Process
or Thread
A
Shared Memory
Message Passing
Data parallel / Parallel Programming Languages
double matrix_A[N];
double matrix_B[N];
Process
or Thread
C
double result[N];
Note that this classification is not all inclusive. There are
hybrid approaches and some of the models overlap (e.g.
data parallel with shared memory/message passing)
Globally Accessible Memory (Shared)
Process
or Thread
D
31
32
Shared Memory Model
n
n
n
n
n
n
Shared Memory Modes
Independently of the hardware, each program sees a
global address space
Several tasks execute at the same time and read and
write from/to the same virtual memory
Locks and semaphores may be used to control access to
the shared memory
An advantage of this model is that there is no notion of
data “ownership”. Thus, there is no need to explicitly
specify the communication of data between tasks.
Program development can often be simplified
An important disadvantage is that it becomes more
difficult to understand and manage data locality.
Performance can be seriously affected.
n
There are two major shared memory models:
n
n
All tasks have access to all the address space
(typical in UMA machines running several threads)
Each task has its address space. Most of the address space is
private. A certain zone is visible across all tasks. (typical in DSM
machines running different processes)
Memory
Memory
A
A B C
Memory
B
A
B
Shared
memory
(all the tasks share
the same address
space)
33
Shared Memory Model –
Closely Coupled Implementations
n
n
Shared Memory – Thread Model
On shared memory platforms, the compiler translates user
program variables into global memory addresses
Typically a thread model is used for developing the
applications
n
n
34
n
n
n
POSIX Threads
OpenMP
n
A single process can have multiple threads of execution
Each thread can be scheduled on a different processor,
taking advantage of the hardware
All threads share the same address space
From a programming perspective, thread implementations
commonly comprise:
n
n
n
There are also some parallel programming languages that
offer a global memory model, although data and tasks are
distributed
For DSM machines, no standard exists, although there are
some proprietary implementations
n
n
35
A library of subroutines that are called from within parallel code
A set of compiler directives imbedded in either serial or parallel
source code
Unrelated standardization efforts have resulted in two
very different implementations of threads: POSIX Threads
and OpenMP
36
POSIX Threads
n
n
n
n
n
OpenMP
Library based; requires parallel coding
Specified by the IEEE POSIX 1003.1c standard (1995),
also known as PThreads
C Language
Most hardware vendors now offer PThreads
Very explicit parallelism; requires significant programmer
attention to detail
n
n
n
n
n
n
Compiler directive based; can use serial code
Jointly defined and endorsed by a group of major
computer hardware and software vendors. The OpenMP
Fortran API was released October 28, 1997. The C/C++
API was released in late 1998
Portable / multi-platform, including Unix and Windows NT
platforms
Available in C/C++ and Fortran implementations
Can be very easy and simple to use - provides for
“incremental parallelism”
No free compilers available
37
Message Passing Model
n
38
Message Passing Model
The programmer must send and receive messages
explicitly
n
n
A set of tasks that use their own local memory during
computation.
Tasks exchange data through communications by sending
and receiving messages
n
n
39
Multiple tasks can reside on the same physical machine as well as
across an arbitrary number of machines.
Data transfer usually requires cooperative operations to
be performed by each process. For example, a send
operation must have a matching receive operation.
40
Message Passing Implementations
n
n
MPI – The Message Passing Interface
Message Passing is generally implemented as libraries
which the programmer calls
A variety of message passing libraries have been available
since the 1980s
n
n
Part 1 of the Message Passing Interface (MPI), the core,
was released in 1994.
n
n
These implementations differed substantially from each other
making it difficult for programmers to develop portable
applications
n
MPI is now the “de facto” industry standard for message
passing
n
n
In 1992, the MPI Forum was formed with the primary goal
of establishing a standard interface for message passing
implementations
n
Part 2 (MPI-2), the extensions, was released in 1996.
Freely available on the web:
http://www.mpi-forum.org/docs/docs.html
Nevertheless, most systems do not implement the full
specification. Especially MPI-2
For shared memory architectures, MPI implementations
usually don’t use a network for task communications
n
Typically a set of devices is provided. Some for network
communication, some for shared memory. In most cases, they can
coexist.
41
Data Parallel Model
n
42
Data Parallel Model
Typically a set of tasks performs the same operations on
different parts of a big array
n
The data parallel model demonstrates the following
characteristics:
n
n
n
n
n
n
43
Most of the parallel work focuses on performing operations on a
data set
The data set is organized into a common structure, such as an
array or cube
A set of tasks works collectively on the same data structure,
however, each task works on a different partition of the same
data structure
Tasks perform the same operation on their partition of work, for
example, “add 4 to every array element”
On shared memory architectures, all tasks may have
access to the data structure through global memory.
On distributed memory architectures the data structure is
split up and resides as "chunks" in the local memory of
each task
44
Data Parallel Programming
n
Typically accomplished by writing a program with data
parallel constructs
n
n
n
n
n
Middleware for parallel programming:
n
calls to a data parallel subroutine library
compiler directives
n
In most cases, parallel compilers are used:
n
n
Summary
High Performance Fortran (HPF):
Extensions to Fortran 90 to support data parallel programming
Compiler Directives: Allow the programmer to specify the
distribution and alignment of data. Fortran implementations are
available for most common parallel platforms
n
Shared memory: all the tasks (threads or processes) see a global
address space. They read and write directly from memory and
synchronize explicitly.
Message passing: the tasks have private memory. For exchanging
information, they send and receive data through a network. There
is always a send() and receive() primitive.
Data parallel: the tasks work on different parts of a big array.
Typically accomplished by using a parallel compiler which allows
data distribution to be specified.
DM implementations have the compiler convert the
program into calls to a message passing library to
distribute the data to all the processes.
n
All message passing is done invisibly to the programmer
45
Final Considerations…
46
Load Balancing
n
Load balancing is always a factor to consider when
developing a parallel application.
n
n
n
Beware of Amdahl's Law!
Too big granularity
Poor load balancing
Too small granularity
Too much communication
The ratio computation/communication is of crucial
importance!
Task 1
Task 2
Task 3
time
47
Work
Wait
48
Amdahl's Law
n
Amdahl's Law – The Bad News!
The speedup depends on the amount of code that cannot
be parallelized:
Speedup vs. Percentage of Non-Parallel Code
30
1
T
=
speedup(n, s ) =
T ⋅ s + T ⋅(1n− s ) s + (1−ns )
0%
25
Linear Speedup
Speedup
20
15
5%
10
10%
n: number of processors
s: percentage of code that cannot be made parallel
T: time it takes to run the code serially
20%
5
0
1
2 3
4 5
6 7
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Number of Processors
49
Efficiency Using 30 Processors
What Is That “s” Anyway?
100
n
speedup(n, s )
efficiency(n, s ) =
n
90
80
Three slides ago…
n
n
70
Efficiency (%)
50
60
“s: percentage of code that cannot be made parallel”
Actually, it’s worse than that. Actually it’s the percentage
of time that cannot be executed in parallel. It can be:
n
50
n
40
n
30
n
Time
Time
Time
Time
spent
spent
spent
spent
communicating
waiting for/sending jobs
waiting for the completion of other processes
calling the middleware for parallel programming
20
n
10
Remember…
n
0
0%
5%
10%
if s is even as small as 0.05, the maximum speedup is only 20
20%
Percentage of Non-Parallel Code
51
52
Maximum Speedup
On the Positive Side…
You can run bigger problems
You can run several simultaneous jobs
(you have more parallelism available)
If you have ∞ processors
this will be 0, so the maximum
possible speedup is 1/s
1
speedup(n, s ) =
s + (1−ns )
non-parallel (s)
maximum speedup
0%
∞ (linear speedup)
5%
20
10%
10
20%
5
25%
4
§ Gustafson-Barsis with no equations:
“9 women cannot have a baby in 1 month,
but they can have 9 babies in 9 months”
53
54
Problema da Coerência das Caches (UMA)
Arquitectura de
Computadores II
6. Multi-Processamento
2004/2005
6.4. Hardware
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
pmarques@dei.uc.pt
56
Mantendo a Coerência: Snooping
Snooping
n
Leituras e Escritas de Blocos
n
n
n
n
As múltiplas cópias de um bloco, quando existem leituras, não são
um problema
No entanto, quando existe uma escrita, um processador tem de
ter acesso exclusivo ao bloco que quer escrever
Os processadores, quando fazem uma leitura, têm também de ter
sempre o valor mais recente do bloco em causa
Nos protocolos de snooping, o hardware tem de localizar
todas as caches que contêm uma cópia do bloco, quando
existe uma escrita. Existem então duas abordagens
possíveis:
n
n
Invalidar todas caches que contêm esse bloco (write-invalidate)
Actualizar todas as caches que contêm esse bloco
57
Protocolo de Snooping (Exemplo)
58
Problema da Coerência das Caches (NUMA)
n
n
A abordagem de Snooping não é escalável para máquinas
com dezenas/centenas de processadores (NUMA)
Nesse caso utiliza-se um outro tipo de protocolos –
baseados em Directorias
n
59
Uma Directoria é um local centralizado que mantém informação
sobre quem é que tem cada bloco
60
Tendências
n
Arquitectura de
Computadores II
Neste momento torna-se extremamente complicado escalar os
processadores em termos de performance individual e clock-rate
n
n
A Intel, à semelhança de outros fabricantes introduz o Simultaneous
Multi-Threading (SMT), na sua terminologia, chamado
HyperThreading
n
5. Multi-Processamento
n
2004/2005
5.4. Aspectos Recentes e Exemplos
n
Paulo Marques
Departamento de Eng. Informática
Universidade de Coimbra
pmarques@dei.uc.pt
O futuro é o MULTI-PROCESSAMENTO!!!
Um aumento de desempenho potencialmente razoável (max=30%) à
custa de um pequeno gasto de transístores (5%)
Atenção: pode levar a uma performance pior!
Prepara os programadores para a programação concorrente!!! J
(a opinião generalizada é que o Hyperthreading serviu apenas para tal)
n
Os dual-core (dois processadores no mesmo die e/ou pacote) irão ser
banais nos próximos 2/3 anos
n
Os servidores multi-processador (SMP – Symmetrical MultiProcessing) estão neste momento banalizados
n
Os clusters estão neste momento banalizados
62
Anúncios...
Como é que funciona o HyperThreading (1)?
Processador super-escalar
“normal”
63
Dual Processor (SMP)
64
Como é que funciona o HyperThreading (2)?
Motivações para o uso de Simultaneous Multi-Threading (SMT)
n
Normalmente existem mais unidades funcionais
disponíveis do que aquelas que estão a ser utilizadas
n
n
Os computadores actuais estão constantemente a
executar mais do que um programa/thread
n
n
Existe trabalho disponível, independente, para fazer. Não se
encontra é na mesma thread!
Um dos aspectos em que esta abordagem é muito útil é a
esconder latências inevitáveis de acesso a memória ou
previsões erradas de saltos
n
Time-sliced Multithreaded CPU
(Super-Threaded CPU)
Limitações do tamanho dos blocos básicos e/ou paralelismo
disponível a nível das instruções (ILP)
Hyper-Threaded CPU
E.g. uma thread que tenha de ler dados de memória pode ficar
bastante tempo à espera enquanto os dados não chegam. Nessas
alturas, tendo SMT, é possível outra thread executar.
65
Implementação (Ideia Básica)
n
Para terminar... Exemplo de um cluster!
Replicar o Front-end do processador e tudo o que seja
visível em termos de ISA (Instruction Set Architecture)
n
n
66
n
Cluster da GOOGLE
n
e.g. Registos, Program Counters, etc.
Desta forma, um processador físico torna-se dois processadores
n
n
n
n
Máquinas do cluster GOOGLE
n
n
n
n
n
n
Particionam-se alguns recursos (e.g. filas de issue de
instruções) e Partilham-se outros (e.g. Reorder-Buffers)
PCs “baratos” com processadores Intel, c/ 256MB RAM
Cerca de 6.000 processadores, 12.000 discos (1 PByte de espaço, dois
discos por máquina)
Linux Red Hat
2 sites na Califórnia e 2 na Virgínia
Ligação à rede
n
n
67
Tem de servir 1000 queries/segundo, cada query não demorando mais de
0.5s!
8 biliões de páginas indexadas (8.058.044.651, 01/Maio/2005)
Técnica para manter a indexação: Tabelas Invertidas (ver TC/BD)
Todas as páginas são revisitadas mensalmente
Cada site tem uma ligação OC48 (2.5 Gbps) à Internet
Entre cada par de sites existe um link de backup de OC12 (622 Mbps)
68
Máquinas super-rápidas??
40 PCs/rack
40 Racks
69
Material para ler
n
Computer Architecture: A Quantitative Approach, 3rd Ed.
n
n
Alternativamente (ou complementarmente), a matéria encontra-se
bastante bem explicada no Capítulo 9 do
n
n
n
n
Secções 6.1, 6.3, 6.5 (brevemente), 6.9, 6.15
Computer Organization and Design, 3rd Ed.
D. Patterson & J. Hennessy
Morgan Kaufmann, ISBN 1-55860-604-1
August 2004
Em particular, a descrição do cluster Google foi retirada de lá. A única
matéria não coberta foi a Secção 9.6
Este capítulo do livro será colocado online no site da cadeira, disponível
apenas para utilizadores autenticados
Jon Stokes, “Introduction to Multithreading, Superthreading and
Hyperthreading”, in Ars Technica, October 2003
http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars
71
70