From Multicores to Manycores Processors: Challenging
Transcription
From Multicores to Manycores Processors: Challenging
From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Márcio Castro1 , Emilio Francesquini2,3 , Thomas Messi Nguélé4 and Jean-François Mehaut2,5 1 2 3 4 5 Federal University of Rio Grande do Sul University of Grenoble University of São Paulo University of Yaoundé CEA - DRT JLPC Workshop - November 2013 2 Laboratoire L A MARCA ELEMENTOS GRÁFICOS ELEMENTOS GRÁFICOS LOGOTIPO LOGOTIPO O logotipo da PUCRS é formado pelas iniciais UMR 5217 - CNRS, UM O logotipo da PUCRS é formado pelas iniciais da assinatura completa da Universidade junto à da assinatura completa da Universidade junto à sigla do estado do Rio Grande do Sul, utilizando a sigla do estado do Rio Grande do Sul, utilizando a fonte Friz Quadrata (normal) maiúscula, conforme fonte Friz Quadrata (normal) maiúscula, conforme representado abaixo. representado abaixo. ELEMENTOS GRÁFICOS LOGOMARCA ELEMENTOS GRÁFICOS LOGOMARCA A logomarca é formada pela combinação do Brasão e do logotipo da Universidade, conforme Téléphone : (+33) 4 76 Téléphone Télécopie : (+33) 4 76 Télécopie Adresse électronique : électronique mailto :Da Adresse A logomarca é formada pela combinação do representado ao lado. Brasão e do logotipo da Universidade, conforme representado ao lado. MANUAL DE IDENTIDADE PUCRS 7 MANUAL DE IDENTIDADE PUCRS 7 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray 1 Introduction 2 Platforms 3 Case study: the TSP problem 4 Adapting the TSP for manycores 5 Computing and energy performance results 6 Conclusions 0/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Introduction 1 Introduction 2 Platforms 3 Case study: the TSP problem 4 Adapting the TSP for manycores 5 Computing and energy performance results 6 Conclusions 0/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Introduction Introduction Demand for higher processor performance Increase of clock frequency Turning point: power consumption changed the course of development of new processors Trend in parallel computing The number of cores per die continues to increase Hundreds or even thousands of cores Different execution and programming models Light-weight manycore processors: autonomous cores, POSIX threads, data and task parallelism GPUs: SIMD model, CUDA and OpenCL 1/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Introduction Introduction Energy efficiency is already a primary concern Mont-Blanc project1 : develop a full energy-efficient HPC system using low-power commercially available embedded processors What we’ve been seeing Performance and energy efficiency of numerical kernels on multicores What is missing? 1 Few works on embedded and manycore processors 2 What about irregular applications? 1 http://montblanc-project.eu 2/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Introduction Introduction Our goals Analyze the computing and energy performance of multicore and manycore processors Consider an irregular application as a case study: Traveling-Salesman Problem (TSP) Consider a new manycore chip (MPPA-256) and other general-purpose (Intel Sandy Bridge) and embedded (ARM) multicore processors 3/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms 1 Introduction 2 Platforms 3 Case study: the TSP problem 4 Adapting the TSP for manycores 5 Computing and energy performance results 6 Conclusions 3/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms General-purpose processors Platforms We considered 4 platforms in this study General-purpose and embedded processors General-purpose processors Xeon E5: Intel Xeon E5-4640 Sandy Bridge-EP processor chip, which has 8 CPU cores (16 threads with Hyper-Threading support enabled) running at 2.40GHz Altix UV 2000: NUMA platform composed of 24 Xeon E5 processors interconnected by NUMA-link6 parallelized to make use of an suring the complete use of the problem is highly paralleliznt issues related to imbalance of an execution for the same astically change depending on ks. aspects in this study. The first ming issues and challenges enTSP for MPPA-256. The use r communication and the abcols are among the important Node 0 Main Memory (32GB) Node 1 L3 Node 2 Node 23 L2 L2 L2 L2 L2 L2 L2 L2 C0 C1 C2 C3 C4 C5 C6 C7 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 Intel Xeon E5 NUMA Node SGI UV 2000 Figure 1: A simplified view of Altix UV 2000. 4/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Platforms MP 10 MPPA®-512 General-purpose and embedded processors 512 Cores We considered 4 platforms in this study Embedded processors Carma: a development kit from SECO that features a quad-core Nvidia Tegra 3 running at 1.3GHz MPPA-256: a single-chip manycore processor developed by Kalray that integrates 256 user cores and 32 system cores running at 400MHz Q4 2013 5/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Kalray French Startup founded in 2008 Headquarters in Grenoble (Montbonnot), Paris (Orsay) Offices in Japan and US (California) Disruptive Technology Based on research from CEA-LETI (Hardware), INRIA (compilers), CEA-LIST (data-flow languages) Multi-Purpose, Massively Parallel, Low Power Processors and System Software People 55 employees, out of which 45 in Research and Development Fabless company Manufacturing is delegated to state of the art foundries (TSMC, 28 nm) 6/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Kalray MPPA-256 Q4 2014 I/O Subsystem RM D-NoC RM C-NoC RM PE0 PE1 PE2 PE3 PE8 PE9 PE10 PE11 Shared Memory (2MB) RM RM RM RM I/O Subsystem RM RM RM RM RM I/O Subsystem RM PCIe, DDR, ... PE4 PE5 PE6 PE7 PE12 PE13 PE14 PE15 Application Inside the chip: Specific 256 cores (400MHz): 16 clusters – 16 PEs per cluster PEs share 2MB of memory Absence of cache coherence protocol inside clusters Processors Communication between clusters: Network-on-Chip (NoC) RM RM RM RM PCIe, DDR, ... Compute Cluster I/O Subsystem MPPA-256 4 I/O subsystems: 2 of them connected to external memory 7/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Kalray MPPA-256 I/O Subsystem RM RM RM RM RM RM RM I/O Subsystem RM RM RM I/O Subsystem D-NoC RM RM RM RM PCIe, DDR, ... C-NoC RM PE0 PE1 PE2 PE3 PE8 PE9 PE10 PE11 Shared Memory (2MB) RM RM PCIe, DDR, ... PE4 PE5 PE6 PE7 PE12 PE13 PE14 PE15 Compute Cluster I/O Subsystem MPPA-256 8/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Kalray MPPA-256 I/O Subsystem RM RM RM RM D-NoC RM RM RM I/O Subsystem RM RM RM I/O Subsystem RM RM RM RM RM RM PCIe, DDR, ... C-NoC RM PE0 PE1 PE2 PE3 PE8 PE9 PE10 PE11 Shared Memory (2MB) Master PCIe, DDR, ... PE4 PE5 PE6 PE7 PE12 PE13 PE14 PE15 Compute Cluster I/O Subsystem MPPA-256 A master process runs on an RM of an I/O subsystem 8/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Kalray MPPA-256 I/O Subsystem Master PCIe, DDR, ... RM RM RM RM RM RM RM RM RM RM RM RM PCIe, DDR, ... C-NoC RM PE0 PE1 PE2 PE3 PE8 PE9 PE10 PE11 Shared Memory (2MB) RM D-NoC I/O Subsystem RM Slave RM I/O Subsystem RM Slave PE4 PE5 PE6 PE7 PE12 PE13 PE14 PE15 Compute Cluster I/O Subsystem MPPA-256 The master process spawns slave processes 1 slave process per cluster 8/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Kalray MPPA-256 I/O Subsystem Master PCIe, DDR, ... RM RM RM RM RM RM RM RM RM RM RM RM PCIe, DDR, ... C-NoC RM PE0 Slave PE1 PE2 PE3 PE8 PE9 PE10 PE11 Shared Memory (2MB) RM D-NoC I/O Subsystem RM Slave RM I/O Subsystem RM Slave PE4 PE5 PE6 PE7 PE12 PE13 PE14 PE15 Compute Cluster I/O Subsystem MPPA-256 The slave process runs on PE0 and may create up to 15 threads one for each PE Threads share 2MB of memory within the cluster 8/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Platforms Embedded processors Kalray MPPA-256 I/O Subsystem Master PCIe, DDR, ... RM RM RM RM RM RM RM write! RM RM RM RM RM PCIe, DDR, ... C-NoC RM PE0 Slave PE1 PE2 PE3 PE8 PE9 PE10 PE11 Shared Memory (2MB) RM D-NoC I/O Subsystem RM Slave RM I/O Subsystem RM Slave PE4 PE5 PE6 PE7 PE12 PE13 PE14 PE15 Compute Cluster I/O Subsystem MPPA-256 Communications: remote writes Data travel through the NoC 8/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem 1 Introduction 2 Platforms 3 Case study: the TSP problem 4 Adapting the TSP for manycores 5 Computing and energy performance results 6 Conclusions 8/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Overview Case study: Travelling salesman problem (TSP) Definition It consists of finding a shortest possible path that passes through n cities, visiting each city only once, and returns to the city of origin. Representation Complete undirected graph Nodes: cities Edges: distances (costs) current_cost = 22 min_path = 92 2 Example: n = 4 Possible paths from 1: 1-2-3-4-1, 1-2-4-3-1, 1-3-2-4-1, 1-3-4-2-1, ... 2 12 65 1 10 65 10 3 8 4 5 10 3 4 4 5 10 9/20 oth From PEs Multicores and RMstofeature private 2-wayChallenging gorithm, thenIssues we explain we extended it to work with Manycores Processors: Programing with the how MPPA/Kalray n and data caches. multiple threads. Finally, we present its distributed version. Case study: the TSP problem grouped within compute clusters and Sequential algorithm ch compute cluster features 16 PEs, 1 3.1 Sequential Algorithm d memory of 2MB, which enables a high The sequential version of the algorithm is based on the ghput between PEs. Each I/O subsysbranch and bound method using brute force. Figure 4 outwith a shared D-cache, static memory lines this solution. It takes as input the number of cities, cess. Contrary to the RMs available on and a cost matrix, and outputs the minimum path length. RMs of I/O subsystems also run user di↵erence of the MPPA-256 architecture global min path upport cache coherence between PEs, procedure tsp solve(last city, current cost, cities) Recursive method the same compute cluster. if cities = ; ns running on MPPA-256 usually follow then return (current cost) Depth-first traversal for each i 2 cities attern. The master process runs on an ystem and it is responsible for spawn8do new cost current cost + costs[last city, i] Branch bound: > . These processes and are then executed <if new cost < min path ⇢ and each process may create up to 16 doesn’t explore paths tsp solve(i, new cost, cities\{i}) > : then new min for each PE. In other words, the masatomic update if less(min path, new min) on the I/O must than spawn 16 thatsubsystem are longer the main d each process must create 16 threads current shortest one min path 1 use of the 256 cores. tsp solve(1, 0, {2, 3, ..., n cities}) and I/O subsystems are connected by output (min path) th bi-directional links, one for data (Dr control (C-NoC). There is one NoC Figure 4: Sequential version of TSP. uster, which is controlled by the RM. systems have 4 NoC nodes, each one This algorithm does a depth-first search looking for the NoC router and a C-NoC router. The shortest path and has complexity O(n!). It does not explore o high bandwidth data transfers. The paths that are already known to be longer than the best o peripheral D-NoC flow control, power path found so far, therefore discarding fruitless branches. plication messages. Figure 3b shows this behavior. The shaded edges are those that the algorithm does not follow, since a possible solution DY: THE TSP that includes them would be more costly than the one it sman Problem (TSP) consists of solvhas already found. This simple pruning technique greatly 10/20 TSP: Sequential algorithm From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 0 min_path = inf 1 2 12 65 10 1 8 10 3 4 5 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 12 min_path = inf 2 1 12 2 12 65 10 1 8 10 3 4 5 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 77 min_path = inf 2 2 12 65 65 10 1 1 12 3 8 10 3 4 5 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 82 min_path = inf 2 2 12 65 65 10 1 1 12 3 8 5 10 3 4 5 4 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 92 min_path = 92 2 2 12 65 65 10 1 1 12 3 8 5 10 3 4 4 5 10 1 92 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 22 min_path = 92 2 2 12 65 10 65 10 1 1 12 3 8 4 5 10 3 4 4 5 10 1 92 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 27 min_path = 92 2 2 12 65 10 65 10 1 1 12 3 8 5 10 4 5 3 4 4 5 3 10 1 92 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 35 min_path = 35 2 2 12 65 10 65 10 1 1 12 3 8 5 10 4 5 3 4 4 5 10 3 8 1 1 92 35 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 8 min_path = 35 2 2 12 65 8 3 10 65 10 1 1 12 3 8 5 10 4 5 3 4 4 5 10 3 8 1 1 92 35 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 73 min_path = 35 2 8 2 12 65 3 10 65 4 X 65 10 1 1 12 3 8 5 10 5 2 73 3 4 4 5 10 3 8 1 1 92 35 Irregular behavior Pruning approach introduces irregularities into the search space! 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 13 min_path = 35 2 8 2 12 65 3 10 65 4 X 65 10 1 1 12 3 8 5 10 5 2 5 4 73 3 4 4 5 10 3 8 1 1 92 35 Irregular behavior Pruning approach introduces irregularities into the search space! 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 23 min_path = 35 2 8 2 12 65 3 10 65 5 4 X 10 65 10 1 1 12 3 8 5 10 5 4 2 73 3 4 4 5 10 3 2 8 1 1 92 35 Irregular behavior Pruning approach introduces irregularities into the search space! 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm Recursive method Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one current_cost = 35 min_path = 35 2 8 2 12 65 3 10 65 5 4 X 10 65 10 1 1 12 3 8 5 10 5 4 2 73 3 4 4 5 10 2 3 12 8 1 1 92 35 X 1 35 Irregular behavior Pruning approach introduces irregularities into the search space! 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Sequential algorithm TSP: Sequential algorithm current_cost = min_path = 35 Recursive method 2 Depth-first traversal Branch and bound: doesn’t explore paths that are longer than the current shortest one 65 3 10 65 4 X 65 10 3 8 5 10 10 8 2 12 1 1 12 5 4 5 4 2 73 10 4 5 10 2 3 12 8 1 1 92 35 2 65 3 4 10 5 3 65 X X 3 2 85 80 X 1 35 Irregular behavior Pruning approach introduces irregularities into the search space! 10/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Multithreaded algorithm TSP: Multithreaded algorithm Generates tasks sequentially at the beggining Enqueues tasks in a centralized queue of tasks Threads atomically dequeue tasks and call TSP SOLVE () global queue, min path procedure generate tasks(n hops, last city, current cost, cities) if n hops ⇢ = max hops task (last city, current cost, cities) then enqueue task(queue, task) 8 for each i 2 cities > 8 > > if last city = none > > > > > > then last cost < 0 > < else last cost costs[last city, i] else do > > curr cost + last cost >new cost > > > > > > > :generate tasks(n hops + 1, i, : new cost, cities\{i}) procedure do work() while(queue 6= ; (last city, current cost, cities) atomic dequeue(queue) do tsp solve(last city, current cost, cities) main min path 1 generate tasks(0, none, 0, {1, 2, ..., n cities}) for i 1 to n threads do spawn thread(do work()) wait every child thread() output (min path) Figure 5: Multi-threaded version of TSP. 11/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Case study: the TSP problem Distributed algorithm TSP: Distributed algorithm Peers: run the multithreaded algorithm Master peer: enqueues partitions in peers’ local task queues Partitions: a set of tasks Peers broadcast new shortest paths when found The master peer sends partitions of decreasing size at each request to decrease the imbalance between the peers at the end of the execution Master Peer send partition get partition Peer 1 Peer 2 ... Peer n new min_path Multithreaded Algorithm 12/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Adapting the TSP for manycores 1 Introduction 2 Platforms 3 Case study: the TSP problem 4 Adapting the TSP for manycores 5 Computing and energy performance results 6 Conclusions 12/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Adapting the TSP for manycores Adapting the TSP for manycores Xeon E5 and Carma Multithreaded algorithm Shared variable min path stores the shortest path and can be updated by all threads (locks needed) Altix UV 2000 Distributed algorithm Broadcast no explicit communication (locks and condition variables needed) Thread and data affinity to reduce NUMA penalties 13/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Adapting the TSP for manycores Adapting the TSP for manycores MPPA-256 Distributed algorithm Communications between the master/peers remote writes Absence of cache coherence: worker threads inside peers might use a stale value of the min path performance loss We used platform specific instructions to bypass the cache when reading/writing from/to min path 14/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Computing and energy performance results 1 Introduction 2 Platforms 3 Case study: the TSP problem 4 Adapting the TSP for manycores 5 Computing and energy performance results 6 Conclusions 14/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Computing and energy performance results Measurement methodology Measurement methodology Metrics Time-to-solution: time to reach a solution for a given problem Energy-to-solution: amount of energy to reach a solution for a given problem Power and energy measurements Xeon E5 and Altix UV 2000: energy sensors (hw. counters) MPPA-256: energy sensors Carma (Tegra 3): power consumption specification Power (W) Xeon E5 Altix UV 2000 Carma MPPA-256 68.6 1,418.4 5.88 8.26 15/20 0 ix UV 2000 40 35 MPPA-256 20Large cities (256 cores) .4 Energy Time 35 Energy-to-Solution (kJ) ed by each ng the exthe TSP. E5 procesbserved on se Xeon E5 0 Xeon E5 (8 cores + HT) 44 95 Carma (4 cores) 5000 4375 30 26 .4 3750 25 3125 20 2500 15 1875 10 5 Time-to-Solution (s) PPA-256 8.26 Sensors Time-t Energy-t cores on the Xeon E5 platform. Although this processor has 8 physical cores, it features Hyper-Threading (HT) which7500 Computing and energy performance results 95 44 doubles the10number of logical cores, allowing the execution 5000 Chip-to-chip comparison of 16 threads concurrently. HT was beneficial in our case, cessors. 5 , 3 5 2500 1 improving the performance of the multi-threaded TSP. 5 52 32 Chip-to-chip comparison: performance and energy From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray 15 1250 52 1 2.7 3 25 625 ed version 0 0 Xeon E5 MPPA-256 Carma example, (8 cores + HT) (256 cores) (4 cores) those rection 5.3). Figure 6: Time and energy-to-solution comparison tained usResultsbetween on MPPA-256: multicore and manycore processors. d Altix UV Performance ∼1.6x faster than Xeon E5 which has Time-to-Solution. As expected, the TSP(Tegra on Carma ∼10x less energy than Carma 3) presors. This Energy sented the highest execution times among all processors, beCPU-level ing 8.6x slower than Xeon E5. The reason for that is three- 16/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Computing and energy performance results Medium Energy-to-solution: MPPA-256 vs. Altix UV 2000 (c) Large 3.5 MPPA-256 single cluster vs. Altix UV 2000 single node 45 2.0 1.5 1.0 0.5 40 Energy-to-Solution (kJ) 2.5 16 cities 30 25 20Altix UV 2000 15MPPA-256 10 5 0.0 11 2 12 (a) Small (a) Small 35 3 0 13 4 145 15 616 Number hree problem sizes. (a) Small 1.8 1.8 Legend 1.6 1.6 1.6 Altix UV 2000 1.4 1.4 1.4 Altix UV 2000 MPPA-256 1.2 1.2 1.2 MPPA-256 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 17 21 2211 23 1224 0.4188 19 920 10 70.4 0.2 Number of peers of clusters 0.2 0.2 0.0 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 Number 6 7 8of threads 9 10 11 12 13 14 15 16 Number of threads Number of threads 4.0 4.0 4.0 3.5 3.5 3.5 3.0 3.0 3.0 2.5 2.5 2.5 2.0 2.0 Altix UV 2000 2.0 Altix UV 2000 1.5Altix UV 2000 MPPA-256 1.5 1.5MPPA-256 1.0MPPA-256 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 2 2 3 3 4 4 2 3 4 (b) Medium (b)Mediu Mediu (b) 18 citie Energy-to-Solution (kJ) Energy-to-Solution (kJ) Energy-to-Solution (kJ) 3.0 Energy-to-Solution (kJ) Energy-to-Solution Energy-to-Solution(kJ) (kJ) 4.0 5 6 7 8 5 6 7 5 Number 6 7of pe Number of p Number of pe Figure Figure7.7. Energy-to-solution Energy-to-solutioncomparison comparisonbetween betweenAltix AltixUV UV2000 2000and andMPPA-2 MPPA- blem size due to time constraints. Then, we comparecomparison the Figure 7: Energy-to-solution between A rgy-to-solution of both platforms when varying the number Single node/cluster power consumption Xeon E5 from its parallelism, since peers (i.e., processors on profits Altix UV 2000 and68.6W computing Xeon27W E5 profits from itshigher higher parallelism, sinceour ourparallel parallel Altix: from (1 thread) up to (16 threads) TSP scales considerably well as we increase the number the distributed algorithm. The broadcast communication sters on MPPA-256 ), while fixing the number of threads TSP scales considerably well as we increase the number of of MPPA-256: 3.7Wa (1 thread) up to 4W threads)execution clusters was implemented using mesthreads. Surprisingly, MPPA-256 presented the processor/cluster tobetween 16.from We used medium problem size(16asynchronous threads. Surprisingly, MPPA-256 presented the best best execution sages throughout D-NoC, allowing compute to time among all processors, executing the 1.6x clusters faster than 17/20 compare the energy-to-solution fromthe 2 to 12 peers and TSP From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Computing and energy performance results (c) Large (c) Large 20 cities (c) Large 45 45 45 40 40 40 (a) Small 35 35 1.8 35 Legend 30 Legend 30 2.0 Legend 1.6 30 2.0 25 Altix UV 2000 2.0 25 Altix UV 2000 1.4 25 AltixUV UV2000 2000 1.5 Altix 20Altix UV 2000 MPPA-256 1.5 Altix UV 2000 20Altix UV 2000 MPPA-256 1.5 1.2 Altix UV 2000 Altix UV 2000 20 MPPA-256 MPPA-256 15MPPA-256 1.0 MPPA-256 MPPA-256 15 1.0 MPPA-256 1.0 15MPPA-256 1.0 10 10 0.5 0.8 10 0.5 5 0.5 5 0.6 5 0.0 0 0.0 0 21 2211 23 1224 10 0.0 11 2 12 3 013 413 1414 5 1515 61616 17 7170.41818 8 191992020 10 21 2211 23 1224 10 11 2 12 3 4 5 6 7 8 9 21 10 13 15 616 Number 17 188of19 2211 23 1224 4 145Number 70.2 920 10 10 11 2 12 3 peers of clusters Number of peers NumberNumber of clusters of peers Number of0.0 clusters nbetween betweenAltix AltixUV UV2000 2000and andMPPA-256 MPPA-256with withthree threeproblem problemsizes. sizes. 1 2 3 4.0 3.5 Energy-to-Solution (kJ) 5 6 7 8 9 5 6 7 8 9 5 Number 6 7of peers 8 9 Number of peers Number of peers Medium Energy-to-Solution (kJ) Energy-to-Solution Energy-to-Solution(kJ) (kJ) 18 cities 4.0 4.0 4.0 3.5 3.5 3.5 3.0 3.0 3.0 2.5 2.5 2.5 Energy-to-Solution (kJ) 4 15 16 14 15 16 14 15 16 4.0 4.0 4.0 3.5 3.5 3.5 3.0 3.0 3.0 2.5 2.5 2.5 2.0 2.0 Altix UV 2000 2.0 Altix UV 2000 1.5Altix UV 2000 MPPA-256 1.5 1.5MPPA-256 1.0MPPA-256 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 2 2 3 3 4 4 2 3 4 (b) Medium (b)Medium Medium (b) Energy-to-Solution (kJ) Energy-to-Solution Energy-to-Solution(kJ) (kJ) 000 56 Energy-to-solution: MPPA-256 vs. Altix UV 2000 Varying the number ofMedium clusters/nodes Medium Energy-to-Solution (kJ) Energy-to-Solution (kJ) Energy-to-Solution (kJ) 2000 Varying the number of clusters/nodes in MPPA-256 and Altix UV 2000 3.0 2.5 2.0 Altix UV 1.5 MPPA-2 1.0 0.5 0.0 2 4 5 6 7 8 9 10 11 12 13 14 15 16 are the n comparison between Altix UV 2000 and MPPA-256 with three sizes. Numberproblem of threads Peers: NUMA nodes (Altix UV 2000); clusters (MPPA-256) number Figure 7. Energy-to-solution comparison between Altix U arallelism, mputing arallelism,since sinceour ourparallel parallel problem problem size size due due toto time time constraints. constraints. Then, Then, we we compare compare the the increase the number of energy-to-solution of both platforms when varying ewe broadcast communication 5.3.1 Varying the Number of Threads MPPA-256 achieved much better energy-to-solution threads we increase the number of energy-to-solution of both platforms when varyingthe thenumber number ed using mespresented the of peers (i.e., processors on Altix UV 2000 and computing energy of both UV 2000 and MPPAem size asynchronous Xeon E5 profits from its higher parallelism, sin presented the best best execution execution of The peers (i.e.,performance on AltixAltix UV 2000 and computing 1.6x From 2 to 8.3x lessprocessors energy llowing compute clusters to12 peers: 256 may significantly vary depending onthe thenumber number ing the TSP faster than clusters on MPPA-256 ),), while of threads ers and TSP scalesfixing considerably well of asthreads we increase ting the TSP 1.6x faster than clusters on MPPA-256 while fixing the number of threads mputation. used (Figure 7a). When comparing the energy efficiency of ock of MPPA-256 per processor/cluster toto 16. We used aa medium problem size hat frequency the threads. Surprisingly, MPPA-256 presented the From 13 to 16 peers: 11.3x less energy lock frequency of MPPA-256 per processor/cluster 16. We used medium problem size embedded processors prea single Altixthe UV 2000 processor against a2single MPPA-256 on E5 cores, this embedded toto compare and ore weXeon time among all from processors, executing the on E5 cores, compare the energy-to-solution energy-to-solution from 2 toto 12 12 peers peers and TSP 1 than E5.this Theembedded low decluster, we noticed that Altix UV 2000 outperformed MPPA18/20 ance.UV Once again, this is due a large problem size than for more than 12 peers. that the Altix Xeon E5 . Even thoughNote the clock frequenc From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Conclusions 1 Introduction 2 Platforms 3 Case study: the TSP problem 4 Adapting the TSP for manycores 5 Computing and energy performance results 6 Conclusions 18/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Conclusions Conclusions For a fairly parallelizable application, such as the TSP MPPA-256 can be very competitive Better performance than Xeon E5 (∼1.6x) Better energy efficiency than Tegra 3 (∼9.8x) However... It demands non-trivial source code adaptations, so that applications can efficiently use the whole chip Absence of a coherent cache considerably increases the implementation complexity 19/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Conclusions Future works Assess how well the performance of this processor fares on applications with heavier communication patterns Work in progress: we are adapting a seismic wave propagation simulator developed by BRGM (France) to MPPA-256 Compare the computing and energy performance of MPPA-256 with other low-power processors Mobile versions of Intel’s Sandy Bridge processors Investigate other manycores such as Tilera’s TILE-Gx Frameworks for heterogeneous platforms Work in progress: support for OpenCL on MPPA-256 JLPC: Potential collaborations with Wen-Mei Hwu (UIUC), John Stratton (MCW-UIUC) 20/20 From Multicores to Manycores Processors: Challenging Programing Issues with the MPPA/Kalray Conclusions Questions? 20/20