From Multicores to Manycores Processors: Challenging

Transcription

From Multicores to Manycores Processors:
Challenging Programing Issues with the
MPPA/Kalray
Márcio Castro1 , Emilio Francesquini2,3 , Thomas Messi Nguélé4
and Jean-François Mehaut2,5
1
2
3
4
5
Federal University of Rio Grande do Sul
University of Grenoble
University of São Paulo
University of Yaoundé
CEA - DRT
JLPC Workshop - November 2013
2
Laboratoire
L
A MARCA
ELEMENTOS GRÁFICOS
ELEMENTOS GRÁFICOS
LOGOTIPO
LOGOTIPO
O logotipo da PUCRS é formado pelas iniciais
UMR 5217 - CNRS,
UM
O logotipo da PUCRS é formado pelas iniciais
da assinatura completa da Universidade junto à
da assinatura completa da Universidade junto à
sigla do estado do Rio Grande do Sul, utilizando a
sigla do estado do Rio Grande do Sul, utilizando a
fonte Friz Quadrata (normal) maiúscula, conforme
fonte Friz Quadrata (normal) maiúscula, conforme
representado abaixo.
representado abaixo.
ELEMENTOS GRÁFICOS
LOGOMARCA
ELEMENTOS GRÁFICOS
LOGOMARCA
A logomarca é formada pela combinação do
Brasão e do logotipo da Universidade, conforme
Téléphone
: (+33)
4 76
Téléphone
Télécopie
: (+33)
4 76
Télécopie
Adresse électronique
: électronique
mailto :Da
Adresse
A logomarca é formada pela combinação do
representado ao lado.
Brasão e do logotipo da Universidade, conforme
representado ao lado.
MANUAL DE IDENTIDADE
PUCRS
7
MANUAL DE IDENTIDADE
PUCRS
7
From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray
1
Introduction
2
Platforms
3
Case study: the TSP problem
4
Adapting the TSP for manycores
5
Computing and energy performance results
6
Conclusions
0/20
Introduction
1
Introduction
2
Platforms
3
4
5
6
Conclusions
0/20
Introduction
Introduction
Demand for higher processor performance
Increase of clock frequency
Turning point: power consumption changed the course of
development of new processors
Trend in parallel computing
The number of cores per die continues to increase
Hundreds or even thousands of cores
Different execution and programming models
Light-weight manycore processors: autonomous cores,
POSIX threads, data and task parallelism
GPUs: SIMD model, CUDA and OpenCL
1/20
Introduction
Introduction
Energy efficiency is already a primary concern
Mont-Blanc project1 : develop a full energy-efficient HPC
system using low-power commercially available embedded
processors
What we’ve been seeing
Performance and energy efficiency of numerical kernels on
multicores
What is missing?
1
Few works on embedded and manycore processors
2
What about irregular applications?
1
http://montblanc-project.eu
2/20
Introduction
Introduction
Our goals
Analyze the computing and energy performance of multicore
and manycore processors
Consider an irregular application as a case study:
Traveling-Salesman Problem (TSP)
Consider a new manycore chip (MPPA-256) and other
general-purpose (Intel Sandy Bridge) and embedded (ARM)
multicore processors
3/20
Platforms
1
Introduction
2
Platforms
3
4
5
6
Conclusions
3/20
Platforms
General-purpose processors
Platforms
We considered 4 platforms in this study
General-purpose and embedded processors
General-purpose processors
Xeon E5: Intel Xeon E5-4640 Sandy Bridge-EP processor
chip, which has 8 CPU cores (16 threads with Hyper-Threading
support enabled) running at 2.40GHz
Altix UV 2000: NUMA platform composed of 24 Xeon E5
processors interconnected by NUMA-link6
parallelized to make use of an
suring the complete use of the
problem is highly paralleliznt issues related to imbalance
of an execution for the same
astically change depending on
ks.
aspects in this study. The first
ming issues and challenges enTSP for MPPA-256. The use
r communication and the abcols are among the important
Node 0
Main Memory (32GB)
Node 1
L3
Node 2
Node 23
L2
L2
L2
L2
L2
L2
L2
L2
C0
C1
C2
C3
C4
C5
C6
C7
t0 t1
t2 t3
t4 t5
t6 t7
t8 t9
t10 t11
t12 t13
t14 t15
Intel Xeon E5
NUMA Node
SGI UV 2000
Figure 1: A simplified view of Altix UV 2000.
4/20
Platforms
Embedded processors
Platforms
MP
10
MPPA®-512
General-purpose and
embedded
processors
512
Cores
We considered 4 platforms in this study
Embedded processors
Carma: a development kit from SECO
that features a quad-core Nvidia Tegra 3
running at 1.3GHz
MPPA-256: a single-chip manycore
processor developed by Kalray that
integrates 256 user cores and 32 system
cores running at 400MHz
Q4
2013
5/20
Platforms
Embedded processors
Kalray
French Startup founded in 2008
Headquarters in Grenoble (Montbonnot), Paris (Orsay)
Offices in Japan and US (California)
Disruptive Technology
Based on research from CEA-LETI (Hardware), INRIA
(compilers), CEA-LIST (data-flow languages)
Multi-Purpose, Massively Parallel, Low Power Processors and
System Software
People
55 employees, out of which 45 in Research and Development
Fabless company
Manufacturing is delegated to state of the art foundries
(TSMC, 28 nm)
6/20
Platforms
Embedded processors
Kalray MPPA-256
Q4
2014
I/O Subsystem
RM
D-NoC
RM
C-NoC
RM
PE0
PE1
PE2
PE3
PE8
PE9
PE10
PE11
Shared Memory (2MB)
RM
RM
RM
RM
I/O Subsystem
RM
RM
RM
RM
RM
I/O Subsystem
RM
PCIe, DDR, ...
PE4
PE5
PE6
PE7
PE12
PE13
PE14
PE15
Application
Inside the chip:
Specific
256 cores (400MHz): 16 clusters
– 16 PEs per cluster
PEs share 2MB of memory
Absence of cache coherence
protocol inside clusters
Processors
Communication between clusters: Network-on-Chip (NoC)
RM
RM
RM
RM
PCIe, DDR, ...
Compute Cluster
I/O Subsystem
MPPA-256
4 I/O subsystems: 2 of them connected to external memory
7/20
Platforms
Embedded processors
Kalray MPPA-256
I/O Subsystem
RM
RM
RM
RM
RM
RM
RM
I/O Subsystem
RM
RM
RM
I/O Subsystem
D-NoC
RM
RM
RM
RM
PCIe, DDR, ...
C-NoC
RM
PE0
PE1
PE2
PE3
PE8
PE9
PE10
PE11
Shared Memory (2MB)
RM
RM
PCIe, DDR, ...
PE4
PE5
PE6
PE7
PE12
PE13
PE14
PE15
Compute Cluster
I/O Subsystem
MPPA-256
8/20
Platforms
Embedded processors
Kalray MPPA-256
I/O Subsystem
RM
RM
RM
RM
D-NoC
RM
RM
RM
I/O Subsystem
RM
RM
RM
I/O Subsystem
RM
RM
RM
RM
RM
RM
PCIe, DDR, ...
C-NoC
RM
PE0
PE1
PE2
PE3
PE8
PE9
PE10
PE11
Shared Memory (2MB)
Master
PCIe, DDR, ...
PE4
PE5
PE6
PE7
PE12
PE13
PE14
PE15
Compute Cluster
I/O Subsystem
MPPA-256
A master process runs on an RM of an I/O subsystem
8/20
Platforms
Embedded processors
Kalray MPPA-256
I/O Subsystem
Master
PCIe, DDR, ...
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
PCIe, DDR, ...
C-NoC
RM
PE0
PE1
PE2
PE3
PE8
PE9
PE10
PE11
Shared Memory (2MB)
RM
D-NoC
I/O Subsystem
RM
Slave
RM
I/O Subsystem
RM
Slave
PE4
PE5
PE6
PE7
PE12
PE13
PE14
PE15
Compute Cluster
I/O Subsystem
MPPA-256
The master process spawns slave processes
1 slave process per cluster
8/20
Platforms
Embedded processors
Kalray MPPA-256
I/O Subsystem
Master
PCIe, DDR, ...
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
PCIe, DDR, ...
C-NoC
RM
PE0
Slave
PE1
PE2
PE3
PE8
PE9
PE10
PE11
Shared Memory (2MB)
RM
D-NoC
I/O Subsystem
RM
Slave
RM
I/O Subsystem
RM
Slave
PE4
PE5
PE6
PE7
PE12
PE13
PE14
PE15
Compute Cluster
I/O Subsystem
MPPA-256
The slave process runs on PE0 and may create up to 15
threads
one for each PE
Threads share 2MB of memory within the cluster
8/20
Platforms
Embedded processors
Kalray MPPA-256
I/O Subsystem
Master
PCIe, DDR, ...
RM
RM
RM
RM
RM
RM
RM
write!
RM
RM
RM
RM
RM
PCIe, DDR, ...
C-NoC
RM
PE0
Slave
PE1
PE2
PE3
PE8
PE9
PE10
PE11
Shared Memory (2MB)
RM
D-NoC
I/O Subsystem
RM
Slave
RM
I/O Subsystem
RM
Slave
PE4
PE5
PE6
PE7
PE12
PE13
PE14
PE15
Compute Cluster
I/O Subsystem
MPPA-256
Communications: remote writes
Data travel through the NoC
8/20
1
Introduction
2
Platforms
3
4
5
6
Conclusions
8/20
Overview
Case study: Travelling salesman problem (TSP)
Definition
It consists of finding a shortest possible path that passes
through n cities, visiting each city only once, and returns to
the city of origin.
Representation
Complete undirected graph
Nodes: cities
Edges: distances (costs)
current_cost = 22
min_path = 92
2
Example: n = 4
Possible paths from 1:
1-2-3-4-1, 1-2-4-3-1,
1-3-2-4-1, 1-3-4-2-1, ...
2
12
65
1
10
65
10
3
8
4
5
10
3
4
4
5
10
9/20
oth From
PEs Multicores
and RMstofeature
private
2-wayChallenging
gorithm,
thenIssues
we explain
we extended it to work with
Manycores
Processors:
Programing
with the how
MPPA/Kalray
n and data caches.
multiple threads. Finally, we present its distributed version.
grouped within compute clusters and
Sequential
algorithm
ch compute
cluster
features 16 PEs, 1
3.1 Sequential Algorithm
d memory of 2MB, which enables a high
The sequential version of the algorithm is based on the
ghput between PEs. Each I/O subsysbranch and bound method using brute force. Figure 4 outwith a shared D-cache, static memory
lines this solution. It takes as input the number of cities,
cess. Contrary to the RMs available on
and a cost matrix, and outputs the minimum path length.
RMs of I/O subsystems also run user
di↵erence of the MPPA-256 architecture
global min path
upport cache coherence between PEs,
procedure tsp solve(last city, current cost, cities)
Recursive
method
the same
compute cluster.
if cities = ;
ns running on MPPA-256 usually follow
then return (current cost)
Depth-first
traversal
for each i 2 cities
attern. The master process runs on an
ystem and it is responsible for spawn8do
new cost
current cost + costs[last city, i]
Branch
bound:
>
. These
processes and
are then
executed
<if new cost < min path
⇢
and each process may create up to 16
doesn’t explore paths
tsp solve(i, new cost, cities\{i})
>
: then new min
for each PE. In other words, the masatomic update if less(min path, new min)
on the I/O
must than
spawn 16
thatsubsystem
are longer
the
main
d each process must create 16 threads
current
shortest one
min path
1
use of the
256 cores.
tsp solve(1, 0, {2, 3, ..., n cities})
and I/O subsystems are connected by
output (min path)
th bi-directional links, one for data (Dr control (C-NoC). There is one NoC
Figure 4: Sequential version of TSP.
uster, which is controlled by the RM.
systems have 4 NoC nodes, each one
This algorithm does a depth-first search looking for the
NoC router and a C-NoC router. The
shortest path and has complexity O(n!). It does not explore
o high bandwidth data transfers. The
paths that are already known to be longer than the best
o peripheral D-NoC flow control, power
path found so far, therefore discarding fruitless branches.
plication messages.
Figure 3b shows this behavior. The shaded edges are those
that the algorithm does not follow, since a possible solution
DY: THE TSP
that includes them would be more costly than the one it
sman Problem (TSP) consists of solvhas already found. This simple pruning technique greatly 10/20
TSP: Sequential algorithm
Sequential algorithm
Recursive method
Depth-first traversal
Branch and bound:
that are longer than the
current shortest one
current_cost = 0
min_path = inf
1
2
12
65
10
1
8
10
3
4
5
10/20
Recursive method
Branch and bound:
current_cost = 12
min_path = inf
2
1
12
2
12
65
10
1
8
10
3
4
5
10/20
Recursive method
Branch and bound:
current_cost = 77
min_path = inf
2
2
12
65
65
10
1
1
12
3
8
10
3
4
5
10/20
Recursive method
Branch and bound:
current_cost = 82
min_path = inf
2
2
12
65
65
10
1
1
12
3
8
5
10
3
4
5
4
10/20
Recursive method
Branch and bound:
current_cost = 92
min_path = 92
2
2
12
65
65
10
1
1
12
3
8
5
10
3
4
4
5
10
1
92
10/20
Recursive method
Branch and bound:
current_cost = 22
min_path = 92
2
2
12
65
10
65
10
1
1
12
3
8
4
5
10
3
4
4
5
10
1
92
10/20
Recursive method
Branch and bound:
current_cost = 27
min_path = 92
2
2
12
65
10
65
10
1
1
12
3
8
5
10
4
5
3
4
4
5
3
10
1
92
10/20
Recursive method
Branch and bound:
current_cost = 35
min_path = 35
2
2
12
65
10
65
10
1
1
12
3
8
5
10
4
5
3
4
4
5
10
3
8
1
1
92
35
10/20
Recursive method
Branch and bound:
current_cost = 8
min_path = 35
2
2
12
65
8
3
10
65
10
1
1
12
3
8
5
10
4
5
3
4
4
5
10
3
8
1
1
92
35
10/20
Recursive method
Branch and bound:
current_cost = 73
min_path = 35
2
8
2
12
65
3
10
65
4
X
65
10
1
1
12
3
8
5
10
5
2
73
3
4
4
5
10
3
8
1
1
92
35
Irregular behavior
Pruning approach introduces irregularities into the search space!
10/20
Recursive method
Branch and bound:
current_cost = 13
min_path = 35
2
8
2
12
65
3
10
65
4
X
65
10
1
1
12
3
8
5
10
5
2
5
4
73
3
4
4
5
10
3
8
1
1
92
35
Irregular behavior
10/20
Recursive method
Branch and bound:
current_cost = 23
min_path = 35
2
8
2
12
65
3
10
65
5
4
X
10
65
10
1
1
12
3
8
5
10
5
4
2
73
3
4
4
5
10
3
2
8
1
1
92
35
Irregular behavior
10/20
Recursive method
Branch and bound:
current_cost = 35
min_path = 35
2
8
2
12
65
3
10
65
5
4
X
10
65
10
1
1
12
3
8
5
10
5
4
2
73
3
4
4
5
10
2
3
12
8
1
1
92
35
X
1
35
Irregular behavior
10/20
current_cost = min_path = 35
Recursive method
2
Branch and bound:
65
3
10
65
4
X
65
10
3
8
5
10
10
8
2
12
1
1
12
5
4
5
4
2
73
10
4
5
10
2
3
12
8
1
1
92
35
2
65
3
4
10
5
3
65
X X
3
2
85
80
X
1
35
Irregular behavior
10/20
Multithreaded algorithm
TSP: Multithreaded algorithm
Generates tasks sequentially
at the beggining
Enqueues tasks in a
centralized queue of tasks
Threads atomically dequeue
tasks and call TSP SOLVE ()
global queue, min path
procedure generate tasks(n hops, last city,
current cost, cities)
if n hops
⇢ = max hops
task
(last city, current cost, cities)
then
enqueue task(queue, task)
8
for each i 2 cities
>
8
>
>
if last city = none
>
>
>
>
>
> then last cost
<
0
>
<
else last cost
costs[last city, i]
else
do
>
>
curr cost + last cost
>new cost
>
>
>
>
>
>
>
:generate tasks(n hops + 1, i,
:
new cost, cities\{i})
procedure do work()
while(queue 6= ;
(last city, current cost, cities)
atomic dequeue(queue)
do
tsp solve(last city, current cost, cities)
main
min path
1
generate tasks(0, none, 0, {1, 2, ..., n cities})
for i
1 to n threads
do spawn thread(do work())
wait every child thread()
output (min path)
Figure 5: Multi-threaded version of TSP.
11/20
Distributed algorithm
TSP: Distributed algorithm
Peers: run the multithreaded algorithm
Master peer: enqueues partitions in peers’ local task queues
Partitions: a set of tasks
Peers broadcast new shortest paths when found
The master peer sends partitions of decreasing size at each
request to decrease the imbalance between the peers at the
end of the execution
Master Peer
send partition
get partition
Peer 1
Peer 2
...
Peer n
new min_path
Multithreaded
Algorithm
12/20
1
Introduction
2
Platforms
3
4
5
6
Conclusions
12/20
Xeon E5 and Carma
Multithreaded algorithm
Shared variable min path stores the shortest path and can be
updated by all threads (locks needed)
Altix UV 2000
Broadcast
no explicit communication (locks and condition
variables needed)
Thread and data affinity to reduce NUMA penalties
13/20
MPPA-256
Communications between the master/peers
remote writes
Absence of cache coherence: worker threads inside peers
might use a stale value of the min path
performance loss
We used platform specific instructions to bypass the cache
when reading/writing from/to min path
14/20
1
Introduction
2
Platforms
3
4
5
6
Conclusions
14/20
Measurement methodology
Measurement methodology
Metrics
Time-to-solution: time to reach a solution for a given problem
Energy-to-solution: amount of energy to reach a solution for
a given problem
Power and energy measurements
Xeon E5 and Altix UV 2000: energy sensors (hw. counters)
MPPA-256: energy sensors
Carma (Tegra 3): power consumption specification
Power (W)
Xeon E5
Altix
UV 2000
Carma
MPPA-256
68.6
1,418.4
5.88
8.26
15/20
0
ix UV 2000
40
35
MPPA-256
20Large
cities
(256 cores)
.4
Energy
Time
35
Energy-to-Solution (kJ)
ed by each
ng the exthe TSP.
E5 procesbserved on
se Xeon E5
0
Xeon E5
(8 cores + HT)
44
95
Carma
(4 cores)
5000
4375
30
26
.4
3750
25
3125
20
2500
15
1875
10
5
Time-to-Solution (s)
PPA-256
8.26
Sensors
Time-t
Energy-t
cores on the Xeon E5 platform. Although this processor has
8 physical cores, it features Hyper-Threading (HT) which7500
95
44
doubles the10number of logical cores, allowing the execution
5000
Chip-to-chip comparison
of 16 threads concurrently. HT was
beneficial
in
our
case,
cessors.
5
,
3
5
2500
1
improving the
performance
of the
multi-threaded
TSP.
5
52
32
Chip-to-chip
comparison:
performance
and
energy
From Multicores to Manycores Processors:
Challenging Programing Issues with the MPPA/Kalray
15
1250
52
1
2.7
3
25
625
ed version
0
0
Xeon E5
MPPA-256
Carma
example,
(8 cores + HT)
(256 cores)
(4 cores)
those rection 5.3).
Figure 6: Time and energy-to-solution comparison
tained usResultsbetween
on MPPA-256:
multicore and manycore processors.
d Altix UV Performance
∼1.6x faster than Xeon E5
which has
Time-to-Solution.
As expected,
the TSP(Tegra
on Carma
∼10x less energy
than Carma
3) presors. This Energy
sented the highest execution times among all processors, beCPU-level
ing 8.6x slower than Xeon E5. The reason for that is three-
16/20
Medium
Energy-to-solution: MPPA-256
vs. Altix UV 2000
(c) Large
3.5
MPPA-256 single cluster vs. Altix UV 2000 single node
45
2.0
1.5
1.0
0.5
40
2.5
16 cities
30
25
20Altix UV 2000
15MPPA-256
10
5
0.0
11 2 12
(a) Small
(a) Small
35
3
0
13
4
145 15
616
Number
hree problem sizes.
(a) Small
1.8
1.8
Legend
1.6
1.6
1.6
Altix UV 2000
1.4
1.4
1.4
Altix UV 2000
MPPA-256
1.2
1.2
1.2
MPPA-256
1.0
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
17
21 2211 23 1224
0.4188 19 920 10
70.4
0.2
Number
of peers
of
clusters
0.2
0.2
0.0
0.0
0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 Number
6 7 8of threads
9 10 11 12 13 14 15 16
Number of threads
Number of threads
4.0
4.0
4.0
3.5
3.5
3.5
3.0
3.0
3.0
2.5
2.5
2.5
2.0
2.0
Altix UV 2000
2.0
Altix UV 2000
1.5Altix
UV 2000
MPPA-256
1.5
1.5MPPA-256
1.0MPPA-256
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0 2 2 3 3 4 4
2
3
4
(b) Medium
(b)Mediu
Mediu
(b)
18 citie
Energy-to-Solution
(kJ)
Energy-to-Solution
(kJ)
3.0
Energy-to-Solution
Energy-to-Solution(kJ)
(kJ)
4.0
5
6
7
8
5
6
7
5 Number
6
7of pe
Number of p
Number of pe
Figure
Figure7.7. Energy-to-solution
Energy-to-solutioncomparison
comparisonbetween
betweenAltix
AltixUV
UV2000
2000and
andMPPA-2
MPPA-
blem size due to time
constraints.
Then, we comparecomparison
the
Figure
7: Energy-to-solution
between A
rgy-to-solution
of both platforms
when
varying the number
Single node/cluster
power
consumption
Xeon
E5
from
its
parallelism,
since
peers (i.e.,
processors
on profits
Altix
UV
2000
and68.6W
computing
Xeon27W
E5
profits
from
itshigher
higher
parallelism,
sinceour
ourparallel
parallel
Altix: from
(1 thread)
up
to
(16 threads)
TSP
scales
considerably
well
as
we
increase
the
number
the
distributed
algorithm.
The
broadcast
communication
sters on MPPA-256
),
while
fixing
the
number
of
threads
TSP scales considerably well as we increase the number of
of
MPPA-256:
3.7Wa (1
thread)
up to
4W
threads)execution
clusters
was
implemented
using
mesthreads.
Surprisingly,
MPPA-256
presented
the
processor/cluster
tobetween
16.from
We
used
medium
problem
size(16asynchronous
threads.
Surprisingly,
MPPA-256
presented
the best
best execution
sages
throughout
D-NoC,
allowing
compute
to
time
among
all processors,
executing
the
1.6x clusters
faster than
17/20
compare the energy-to-solution
fromthe
2 to
12 peers
and TSP
(c) Large
(c) Large
20 cities
(c) Large
45
45
45
40
40
40
(a) Small
35
35
1.8
35
Legend
30
Legend
30
2.0
Legend
1.6
30
2.0
25
Altix UV 2000
2.0
25
Altix UV 2000
1.4
25
AltixUV
UV2000
2000
1.5
Altix
20Altix UV 2000
MPPA-256
1.5
Altix UV 2000
20Altix UV 2000
MPPA-256
1.5
1.2
Altix
UV 2000
Altix
UV
2000
20
MPPA-256
MPPA-256
15MPPA-256
1.0
MPPA-256
MPPA-256
15
1.0
MPPA-256
1.0
15MPPA-256
1.0
10
10
0.5
0.8
10
0.5
5
0.5
5
0.6
5
0.0
0
0.0
0
21 2211 23 1224
10 0.0
11 2 12 3 013
413 1414
5 1515 61616 17
7170.41818
8 191992020 10
21 2211 23 1224
10 11 2 12 3
4
5
6
7
8
9 21
10
13
15 616 Number
17
188of19
2211 23 1224
4 145Number
70.2
920 10
10 11 2 12 3
peers
of
clusters
Number
of peers
NumberNumber
of
clusters
of peers
Number of0.0
clusters
nbetween
betweenAltix
AltixUV
UV2000
2000and
andMPPA-256
MPPA-256with
withthree
threeproblem
problemsizes.
sizes.
1 2 3
4.0
3.5
5
6
7
8
9
5
6
7
8
9
5 Number
6
7of peers
8
9
Number of peers
Number of peers
Medium
Energy-to-Solution
(kJ)
18 cities
4.0
4.0
4.0
3.5
3.5
3.5
3.0
3.0
3.0
2.5
2.5
2.5
4 15 16
14 15 16
14 15 16
4.0
4.0
4.0
3.5
3.5
3.5
3.0
3.0
3.0
2.5
2.5
2.5
2.0
2.0
Altix UV 2000
2.0
Altix UV 2000
1.5Altix
UV 2000
MPPA-256
1.5
1.5MPPA-256
1.0MPPA-256
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0 2 2 3 3 4 4
2
3
4
(b) Medium
(b)Medium
Medium
(b)
Energy-to-Solution
(kJ)
000
56
Energy-to-solution: MPPA-256 vs. Altix UV 2000
Varying the number ofMedium
clusters/nodes
Medium
Energy-to-Solution
(kJ)
Energy-to-Solution
(kJ)
2000
Varying the number of clusters/nodes in MPPA-256 and Altix UV 2000
3.0
2.5
2.0
Altix UV
1.5
MPPA-2
1.0
0.5
0.0
2
4 5 6 7 8 9 10 11 12 13 14 15 16
are
the
n comparison
between Altix UV 2000 and MPPA-256 with three
sizes.
Numberproblem
of threads
Peers: NUMA nodes (Altix UV 2000); clusters
(MPPA-256)
number
Figure 7. Energy-to-solution comparison between Altix U
arallelism,
mputing
arallelism,since
sinceour
ourparallel
parallel problem
problem size
size due
due toto time
time constraints.
constraints. Then,
Then, we
we compare
compare the
the
increase
the
number
of
energy-to-solution
of
both
platforms
when
varying
ewe
broadcast
communication
5.3.1
Varying
the
Number
of
Threads
MPPA-256
achieved
much
better
energy-to-solution
threads
we increase the number of
energy-to-solution of both platforms when varyingthe
thenumber
number
ed using
mespresented
the
of
peers
(i.e.,
processors
on
Altix
UV
2000
and
computing
energy
of both
UV
2000
and
MPPAem
size asynchronous
Xeon E5
profits
from
its
higher
parallelism, sin
presented
the best
best execution
execution
of The
peers
(i.e.,performance
on
AltixAltix
UV
2000
and
computing
1.6x
From
2 to
8.3x
lessprocessors
energy
llowing
compute
clusters
to12 peers:
256 may
significantly
vary
depending
onthe
thenumber
number
ing
the
TSP
faster
than
clusters
on
MPPA-256
),), while
of
threads
ers
and
TSP
scalesfixing
considerably
well of
asthreads
we
increase
ting
the
TSP
1.6x
faster
than
clusters
on
MPPA-256
while
fixing
the
number
of
threads
mputation.
used
(Figure
7a).
When
comparing
the
energy
efficiency
of
ock
of
MPPA-256
per
processor/cluster
toto 16.
We
used
aa medium
problem
size
hat frequency
the
threads.
Surprisingly,
MPPA-256
presented
the
From
13
to
16
peers:
11.3x
less
energy
lock
frequency
of
MPPA-256
per
processor/cluster
16.
We
used
medium
problem
size
embedded
processors
prea single
Altixthe
UV 2000 processor against
a2single
MPPA-256
on
E5
cores,
this embedded
toto
compare
and
ore
weXeon
time among all from
processors,
executing
the
on
E5
cores,
compare
the energy-to-solution
energy-to-solution
from
2 toto 12
12 peers
peers
and TSP 1
than
E5.this
Theembedded
low decluster,
we
noticed
that
Altix
UV
2000
outperformed
MPPA18/20
ance.UV
Once again, this is due
a large problem size than
for more
than
12 peers.
that the
Altix
Xeon E5
. Even
thoughNote
the clock
frequenc
Conclusions
1
Introduction
2
Platforms
3
4
5
6
Conclusions
18/20
Conclusions
Conclusions
For a fairly parallelizable application, such as the TSP
MPPA-256 can be very competitive
Better performance than Xeon E5 (∼1.6x)
Better energy efficiency than Tegra 3 (∼9.8x)
However...
It demands non-trivial source code adaptations, so that
applications can efficiently use the whole chip
Absence of a coherent cache considerably increases the
implementation complexity
19/20
Conclusions
Future works
Assess how well the performance of this processor fares on
applications with heavier communication patterns
Work in progress: we are adapting a seismic wave propagation
simulator developed by BRGM (France) to MPPA-256
Compare the computing and energy performance of
MPPA-256 with other low-power processors
Mobile versions of Intel’s Sandy Bridge processors
Investigate other manycores such as Tilera’s TILE-Gx
Frameworks for heterogeneous platforms
Work in progress: support for OpenCL on MPPA-256
JLPC: Potential collaborations with Wen-Mei Hwu (UIUC),
John Stratton (MCW-UIUC)
20/20
Conclusions
Questions?
20/20

From Multicores to Manycores Processors: Challenging

Transcription

Similar documents

here - Babyccino Kids

Qurbani Recipe

Grilled Marinated Roast

Oatmeal Scotchies

Live today like there`s no tomorrow.

RHC recipe card Chili small

Recipe for Pepe`s Pollo Burrito

Blackened Salmon

Goulash with an Austrian Accent

PANCAKE DAY!