Dynamic Intelligent Kernel Assignment in Heterogeneous MultiGPU Systems

Transcription

Dynamic Intelligent Kernel Assignment in Heterogeneous MultiGPU Systems
Category: Clusters & GPU Management - CG02
Poster
P4200
contact Name
Joao Gazolla: gazolla@ic.uff.br
Dynamic Intelligent Kernel Assignment in Heterogeneous MultiGPU Systems
Joao Gazolla, Esteban Clua
Universidade Federal Fluminense, Niteroi, Rio de Janeiro, Brazil.
gabrielgazolla@gmail.com
Basic Preliminary Idea
What is this poster about ?
The poster illustrates and present an initial concept of a
research about Dynamic Intelligent Kernel Assignment
in Heterogenous MultiGPU Systems, where given one
application using the StarPU framework, our scheduler will
select custom scheduling policies and execute the kernels
in an intelligent way, being responsible for the mapping of
kernels to the correspondent devices in a seamless way,
minimizing the execution time.
Motivation
Graphics Processing Units (GPUs) have been
evolving at a rapid speed on the last decade,
becoming faster, with more cores and sophisticated
unified architecture. Every year many different GPU
models with distinct specifications are released in the
market. Despite this rapid evolution, there’s still a
gap in using and taking advantage of this huge
computing power, predominantly when there’s a
grid of computers with heterogeneous devices on
each node. It’s simply difficult for an application to
detect the grid architecture and use all these devices
resources intelligently. Another consideration is that
depending on how you assign and map kernels to
each device, this can make a huge difference on
execution time of the kernel, depending on the
settings of each device, and if the device is busy or
not, minimizing the idle time of the devices.
Introduction
In order to reach our objective, a third party library, an
unified runtime system for heterogeneous multicore
architectures (StarPU) is being used, giving support
for heterogeneous multicore architectures and a
unified view of the computational resources,
CPUs and GPUs, simultaneously. It also takes care of
efficiently mapping and executing tasks while
transparently handling low-level issues such as data
transfers. The core of StarPU is its run-time support
library, which is responsible for scheduling
application-provided
tasks
on
heterogeneous
machines.
www.ic.uff.br
Objective
StarPU's run-time and programming language
extensions support a task-based programming model
and it allows developers, to implement custom
scheduling policies in a portable fashion. Many
scheduling policies will be tested and determined
which ones are more efficient.
Kernel
10
Kernel
09
Future Work
Kernel
08
We are currently testing simple proofs of concepts to
confirm the viability to build this system. Our next
step is to build this concept to a solo machine
with multiple devices and after this expand the
idea to a grid of computers.
StarPU
A.I.
Load
Balancing
Conclusion
Scheduling Policies
Kernel
01
Kernel
06
Given a software that has multiple kernels, these
kernels are going to be added to an universal queue,
where each of them will be mapped to a device. On
the fly information will be used from the NVIDIA
Management Library (NVML) in order to establish the
schedule. There will be also a calibration phase,
where the kernels are going to be tested and an initial
correspondence among kernels and devices created.
In this work the programmer does not need to know
how many nodes or devices are on the grid. This will
be transparent and a complete responsability of this
middleware in development. It is expected that this
system is going to optimize speedups, since the
appropriated kernel will be assigned to the right GPU,
reducing idle time and making executions faster.
Kernel
07
Kernel
02
Kernel
04
Kernel
05
Kernel
03
There is a lot of room for improvement on this
research area, but we need to carefully investigate
issues like: memory consistency, scheduling policies
and communications. If we are able to build it and
reduce execution time, we will have plenty of room for
big speedups with very little porting effort, with
potential to be a next level of unified architectures.
Acknowledgements
This project it is being financially supported by the
CAPES-Brazil and CNPq-Brazil.
Computer 01
Computer 02
Computer 03
One Device
Two Devices
Four Devices
Memory Consistency
Some References…
C. Augonnet, S. Thibault, R. Namyst, and P.A. Wacrenier. StarPU: A Unified Platform for
Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th
Euro-Par Conference, Delft, The Netherlands, August 2009.
M. Joselli, M. Z., E. C., A. M., A. C., R. L., L. V., B. F., M. d. O., and C. P. Automatic
Dynamic Task Distribution between CPU and GPU for Real-Time Systems. In CSE’08,
pages 48–55, 2008.
...