Dynamic Intelligent Kernel Assignment in Heterogeneous MultiGPU Systems
Transcription
Dynamic Intelligent Kernel Assignment in Heterogeneous MultiGPU Systems
Category: Clusters & GPU Management - CG02 Poster P4200 contact Name Joao Gazolla: gazolla@ic.uff.br Dynamic Intelligent Kernel Assignment in Heterogeneous MultiGPU Systems Joao Gazolla, Esteban Clua Universidade Federal Fluminense, Niteroi, Rio de Janeiro, Brazil. gabrielgazolla@gmail.com Basic Preliminary Idea What is this poster about ? The poster illustrates and present an initial concept of a research about Dynamic Intelligent Kernel Assignment in Heterogenous MultiGPU Systems, where given one application using the StarPU framework, our scheduler will select custom scheduling policies and execute the kernels in an intelligent way, being responsible for the mapping of kernels to the correspondent devices in a seamless way, minimizing the execution time. Motivation Graphics Processing Units (GPUs) have been evolving at a rapid speed on the last decade, becoming faster, with more cores and sophisticated unified architecture. Every year many different GPU models with distinct specifications are released in the market. Despite this rapid evolution, there’s still a gap in using and taking advantage of this huge computing power, predominantly when there’s a grid of computers with heterogeneous devices on each node. It’s simply difficult for an application to detect the grid architecture and use all these devices resources intelligently. Another consideration is that depending on how you assign and map kernels to each device, this can make a huge difference on execution time of the kernel, depending on the settings of each device, and if the device is busy or not, minimizing the idle time of the devices. Introduction In order to reach our objective, a third party library, an unified runtime system for heterogeneous multicore architectures (StarPU) is being used, giving support for heterogeneous multicore architectures and a unified view of the computational resources, CPUs and GPUs, simultaneously. It also takes care of efficiently mapping and executing tasks while transparently handling low-level issues such as data transfers. The core of StarPU is its run-time support library, which is responsible for scheduling application-provided tasks on heterogeneous machines. www.ic.uff.br Objective StarPU's run-time and programming language extensions support a task-based programming model and it allows developers, to implement custom scheduling policies in a portable fashion. Many scheduling policies will be tested and determined which ones are more efficient. Kernel 10 Kernel 09 Future Work Kernel 08 We are currently testing simple proofs of concepts to confirm the viability to build this system. Our next step is to build this concept to a solo machine with multiple devices and after this expand the idea to a grid of computers. StarPU A.I. Load Balancing Conclusion Scheduling Policies Kernel 01 Kernel 06 Given a software that has multiple kernels, these kernels are going to be added to an universal queue, where each of them will be mapped to a device. On the fly information will be used from the NVIDIA Management Library (NVML) in order to establish the schedule. There will be also a calibration phase, where the kernels are going to be tested and an initial correspondence among kernels and devices created. In this work the programmer does not need to know how many nodes or devices are on the grid. This will be transparent and a complete responsability of this middleware in development. It is expected that this system is going to optimize speedups, since the appropriated kernel will be assigned to the right GPU, reducing idle time and making executions faster. Kernel 07 Kernel 02 Kernel 04 Kernel 05 Kernel 03 There is a lot of room for improvement on this research area, but we need to carefully investigate issues like: memory consistency, scheduling policies and communications. If we are able to build it and reduce execution time, we will have plenty of room for big speedups with very little porting effort, with potential to be a next level of unified architectures. Acknowledgements This project it is being financially supported by the CAPES-Brazil and CNPq-Brazil. Computer 01 Computer 02 Computer 03 One Device Two Devices Four Devices Memory Consistency Some References… C. Augonnet, S. Thibault, R. Namyst, and P.A. Wacrenier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th Euro-Par Conference, Delft, The Netherlands, August 2009. M. Joselli, M. Z., E. C., A. M., A. C., R. L., L. V., B. F., M. d. O., and C. P. Automatic Dynamic Task Distribution between CPU and GPU for Real-Time Systems. In CSE’08, pages 48–55, 2008. ...