LoHAM - Peter Heywood
Transcription
LoHAM - Peter Heywood
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated Computing Serial Computing ~40 GigaFLOPS Parallel Computing 384 GigaFLOPS 8.74 TeraFLOPS 1 core 16 cores 4992 cores 10.0 TFlops 8.74 TeraFLOPS 9.0 TFlops 8.0 TFlops 7.0 TFlops 6.0 TFlops 5.0 TFlops 4.0 TFlops 3.0 TFlops 2.0 TFlops 1.0 TFlops 0.0 TFlops ~40 GigaFLOPS 1 CPU Core GPU (4992 cores) 6 hours CPU time vs. 1 minute GPU time Accelerators • Much of the functionality of CPUs is unused for HPC • Complex Pipelines, Branch prediction, out of order execution, etc. • Ideally for HPC we want: Simple, Low Power and Highly Parallel cores An accelerated system DRAM GDRAM CPU GPU/ Accelerator I/O PCIe I/O • Co-processor not a CPU replacement Thinking Parallel • Hardware considerations • High Memory Latency (PCI-e) • Huge Numbers of processing cores • Algorithmic changes required • High level of parallelism required • Data parallel != task parallel “If your problem is not parallel then think again” Amdahl’s Law 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 = Speedup (S) 25 20 P = 25% 15 P = 50% 1 𝑃 𝑁 − (1 − 𝑃) P = 90% 10 P= 95% 5 0 Number of Processors (N) • Speedup of a program is limited by the proportion than can be parallelised • Addition of processing cores gives diminishing returns SATALL Optimisation Time per function for Largest Network (LoHAM) 45000 Profile the Application • 40000 97.4% 35000 11 hour runtime 30000 • Function A • 97.4% runtime • 2000 calls Time (s) • 25000 20000 Hardware • Intel Core i7-4770k 3.50GHz 15000 • 16GB DDR3 • Nvidia GeForce Titan X 10000 5000 0.3% 0.3% 0.1% 0.1% 0.1% B C D E F 0 A Function Function A • Input • Network (directed weighted graph) • Origin-Destination Matrix • Output • Traffic flow per edge • 2 Distinct Steps 1. Single Source Shortest Path (SSSP) All-or-Nothing Path For each origin in the O-D matrix 2. Flow Accumulation Distribution of Runtime (LoHAM) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Apply the OD value for each trip to each link on the route Serial Flow SSSP Single Source Shortest Path For a single Origin Vertex (Centroid) Find the route to each Destination Vertex With the Lowest Cumulative Weight (Cost) Serial SSSP Algorithm • D’Esopo-Pape (1974) • Maintains a priority queue of vertices to explore Highly Serial • Not a Data-Parallel Algorithm • We must change algorithm to match the hardware Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222. Parallel SSSP Algorithm • Bellman-Ford Algorithm (1956) • Poor serial performance & time complexity • Performs significantly more work • Highly Parallel • Suitable for GPU acceleration Number of Edges 30 Bellman, Richard. On a routing problem. 1956. Total Edges Considered 25 20 15 10 5 0 Bellman-Ford Desopo-Pape Implementation A2 - Naïve BellmanFord using Cuda • Up to 369x slower • Striped bars continue off the scale Derby 36.5s CLoHAM 1777.2s LoHAM 5712.6s Time to compute all requied SSSP results per model 70.00 60.00 Time (seconds) • 50.00 40.00 30.00 20.00 10.00 0.00 Serial A2 Optimisation Derby CLoHAM LoHAM Implementation Followed iterative cycle of performance optimisations • A3 – Early Termination • A4 – Node Frontier • A8 – Multiple origins Concurrently • • A10 – Improved load Balancing • • SSSP for each Origin in the OD matrix Cooperative Thread Array Time to compute all requied SSSP results per model 70.00 60.00 Time (seconds) • 50.00 40.00 30.00 20.00 10.00 0.00 Serial A2 A3 A4 A8 Optimisation A11 – Improved array access Derby CLoHAM LoHAM A10 A11 Limiting Factor (Function A) Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial Serial Flow SSSP Flow SSSP Limiting Factor (Function A) • Limiting Factor has now changed • Need to parallelise Flow Accumulation Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial A11 Flow SSSP Serial A11 Flow SSSP Flow Accumulation • Shortest Path + OD = Flow-per-link For each origin-destination pair Trace the route from the destination to the origin increasing the flow value for each link visited • Parallel problem • But requires synchronised access to shared data structure for all trips (atomic operations) 0 1 2 … 0 0 1 2 … 1 2 0 5 … 2 3 4 0 … … … … … Link 0 1 2 3 … Flow 9 8 6 9 … Time to compute Flow values per link Flow Accumulation 5 Problem: • A12 - lots of atomic operations serialise the execution 4 Time (Seconds) • 6 3 2 1 0 Serial A12 Derby CLoHAM LoHAM Time to compute Flow values per link Flow Accumulation Problem: • • 5 A12 - lots of atomic operations serialise the execution Solutions: • A15 - Reduce number of atomic operations • • • 3 Solve in batches using parallel reduction A14 - Use fast hardware-supported single precision atomics • 4 Time (Seconds) • 6 Minimise loss of precision using multiple 32-bit summations 2 1 0.000022% total error 0 Serial A12 Derby CLoHAM A15 (double) LoHAM A14 (single) Integrated Results Relative Assignment Runtime Performance vs Serial 40 Serial 35 LoHAM – 17m 32s Hardware: • Intel Core i7-4770K • 16GB DDR3 • Nvidia GeForce Titan X 18.24 20.70 15 10 5 6.14 • 11.24 Reduced loss of precision 3.73 • 20 6.81 Single precision 25 5.51 LoHAM – 35m 22s 3.42 • 1.00 Double precision 30 1.00 LoHAM – 12h 12m 1.00 • Assignment Time relative to serial Assignment Speedup relative to Serial 41.77 45 0 Serial Multicore Derby CLoHAM A15 LoHAM A14 (single) Relative Assignment Runtime Performance vs Multicore 7 6.14 Assignment Speedup relative to Multicore 6 • Reduced loss of precision • LoHAM – 17m 32s 2 • Nvidia GeForce Titan X 0.15 16GB DDR3 0.18 • 1 0.29 Intel Core i7-4770K 1.00 Hardware: • 3.31 3 1.79 Single precision 4 3.04 LoHAM – 35m 22s 2.04 • 1.09 Double precision 5 1.00 LoHAM – 1h 47m 1.00 • Assignment Time relative to multicore Multicore 0 Serial Multicore Derby CLoHAM A15 LoHAM A14 (single) GPU Computing at UoS Expertise at Sheffield • Specialists in GPU Computing and performance optimisation • Complex Systems Simulations via FLAME and FLAME GPU • Visual Simulation, Computer Graphics and Virtual Reality • Training and Education for GPU Computing Thank You Largest Model (LoHAM) results Runtime Paul Richmond p.richmond@sheffield.ac.uk Speedup Serial Speedup Multicore Serial 12:12:24 1.00 0.15 Multicore 01:47:36 6.81 1.00 A15 (double precision) 00:35:22 20.70 3.04 A14 (single precision) 00:17:32 41.77 6.14 paulrichmond.shef.ac.uk Peter Heywood p.heywood@sheffield.ac.uk ptheywood.uk Backup Slides Benchmark Models • 3 Benchmark networks Benchmark model performance 50000 • Range of sizes 45000 • Small to V. Large 40000 35000 Model Derby CLoHAM Time (s) • Up to 12 hour runtime 30000 25000 Vertices (Nodes) Edges (Links) O-D trips 2700 25385 547² 10000 15179 132600 2548² 5000 20000 15000 0 LoHAM 18427 192711 5194² Serial Salford Multicore Version Derby CLoHAM LoHAM Edges considered per algorithm Total Edges Considered 7 30 6 25 5 Number of Edges Number of Edges Edges Considered Per Iteration 4 3 2 20 15 10 5 1 0 0 1 2 3 4 0 Iteration Bellman-Ford Desopo-Pape Bellman-Ford Desopo-Pape Vertex Frontier (A4) • Much fewer threads launched per iteration • Up to 2500 instead of 18427 per iteration Frontier size for each element for source 1 3000 2500 Frontier Size • Only Vertices which were updated in the previous iteration can result in an update 2000 1500 1000 500 0 Iteration Derby CLoHAM LoHAM Multiple Concurrent Origins (A8) Frontier size for each element for source 1 Frontier Size for all concurrent sources 18000000 3000 16000000 14000000 2500 Frontier Size Frontier Size 12000000 2000 1500 10000000 8000000 6000000 1000 4000000 500 2000000 0 0 Iteration Derby CLoHAM LoHAM Iteration Derby CLoHAM LoHAM Atomic Contention • Atomic operations are guaranteed to occur 30000000 Atomic Contention • multiple threads atomically modify same address • Serialised! • atomicAdd(double) not implemented in hardware • Not yet • Atomic Contestance per Iteration Solutions 1. 2. Algorithmic change to minimise atomic contention Single precision 25000000 Atomic Contestance • 20000000 15000000 10000000 5000000 0 0 10 20 30 40 Cloham 50 60 Loham 70 80 90 100 Assignment runtime per algorithm 50000 45000 Raw Performance 40000 35000 • Intel Core i7-4770K • 16GB DDR3 • Nvidia GeForce Titan X Time (seconds) Hardware: 30000 25000 20000 15000 10000 5000 0 Serial Multicore A8 Derby A10 CLoHAM A11 LoHAM A15 A13 (single) A14 (single)