Sebastian Klose - Robotics and Embedded Systems

Transcription

Sebastian Klose - Robotics and Embedded Systems
HETEROGENEOUS PARALLEL
COMPUTING WITH OPENCL
Sebastian Klose
kloses@in.tum.de
27.05.2013
Montag, 27. Mai 13
AGENDA
• Motivation
• OpenCL
Overview
• Hardware
• OpenCL
Montag, 27. Mai 13
Examples
on Altera FPGAs
MOTIVATION
Montag, 27. Mai 13
TREND OF PROGRAMMABLE
HARDWARE
Brief Overview of the OpenCL Standard
Page 3
Figure 3. Recent Trend of Programmable and Parallel Technologies
Brief Overview of the OpenCL Standard
Page 3
Figure 3. Recent Trend of Programmable and Parallel Technologies
CPUs
Single Cores
Brief Overview of the OpenCL Standard
DSPs
Multicores
Coarse-Grained
CPUs and DSPs
Multicores
Arrays
Coarse-Grained
Page 3
Massively Parallel
Processor Arrays
FPGAs
Fine-Grained
Massively
Parallel Arrays
FPGAs
Considering the need for creating parallel programs for the emerging multicore era, it
Figure 3. Recent Trend of Programmable and Parallel Technologies
that there needs to
be a standard modelFPGAs
for creating programs that
CPUs
DSPswas recognized
Multicores
Arrays
will execute across all of these quite different devices. The lack of a standard that is
Multicores
Coarse-Grained
Fine-Grained
portable across these different
programmable
technologies
has plagued
Brief Overview of the OpenCL Standard
Page 3
Coarse-Grained
Massively
Parallel
Massively
programmers. In the summer of 2008, Apple submitted
a proposal for an OpenCL
CPUs and DSPs
Processor Arrays
Parallel Arrays
(Open Computing Language) draft specification to The Khronos Group in an effort to
create
cross-platform
parallel
programming
standard.multicore
The Khronos
Group consists
Considering the
needafor
creating parallel
programs
for the emerging
era, it
Figure 3. Recent Trend of Programmable and Parallel Technologies
a consortium
ofto
industry
members
suchfor
as creating
Apple, IBM,
Intel, that
AMD, NVIDIA,
was recognizedofthat
there needs
be a standard
model
programs
Altera,
many
others.
This group
hasThe
been
responsible
for defining
will execute across
all and
of these
quite
different
devices.
lack
of a standard
that is the OpenCL
1.0,
1.1,
and
1.2
specifications.
The
OpenCL
standard
allows
for
the implementation of
portableMulticores
across these different
programmable
technologies
CPUs
DSPs
Arrays
FPGAs has plagued
Brief Overview of the OpenCL Standard
Page 3
parallel
algorithms
that
can
be
ported
from
platform
to
platform
with minimal
programmers. In the summer of 2008, Apple submitted a proposal for an OpenCL
Single Cores
Multicores
Coarse-Grained
Fine-Grained
recoding.
The
language
is
based
on
C
programming
language
and
contains
extensions
(Open Computing Language) draft specification to The Khronos Group in an effort
to
Coarse-Grained
Massively
Parallel
Massively
that allow
for theprogramming
specification of
parallelism.
create a cross-platform
parallel
standard.
The Khronos Group consists
Figure 3. Recent Trend of Programmable andCPUs
Parallel
andTechnologies
DSPs
Processor Arrays
Parallel Arrays
of a consortium
of
industry
members
such
as
Apple,
IBM,
Intel,
AMD,standard
NVIDIA,inherently offers the
In addition to providing a portable model, the
OpenCL
Altera,
and
many
others.
This
group
has
been
responsible
for
defining
the
Considering the need for creating ability
paralleltoprograms
for the emerging
multicore
era, it
describe parallel
algorithms
to be implemented
on OpenCL
FPGAs, at a much higher
1.0,
1.1,
and
1.2
specifications.
The
OpenCL
standard
allows
for
the
implementation
ofas VHDL or
that there needs Arrays
tolevel
be aofstandard
model
forhardware
creating programs
that
CPUs
DSPs was recognized
Multicores
FPGAs
abstraction
than
description
languages (HDLs) such
parallel
algorithms
that
can
be
ported
from
platform
to
platform
with
minimal
will execute across all of these quite
different
devices.
The high-level
lack of a standard
that
is exist for gaining this higher level of
Verilog.
Although
many
synthesis
tools
Single Cores
Multicores
Coarse-Grained
Fine-Grained
recoding.
The
language
is
based
on
C
programming
language
and
contains extensions
portable across these
different
programmable
technologies
has
plagued
abstraction, they have all suffered from the same fundamental
problem. These tools
Coarse-Grained
Massively
Massively
that
allow for
the
specification
of
parallelism.
programmers. In the
summer
ofParallel
2008,
Apple
submitted
a
proposal
for
an
OpenCL
would
attempt
to
take
in
a
sequential
C
program
and
produce
a
parallel HDL
CPUs and DSPs
Processor Arrays
Parallel Arrays
(Open Computing Language)
specification
to The
Themodel,
Khronos
inso
an
effortin
toinherently
implementation.
difficulty
was
not
much
the creation
of a the
HDL
In additiondraft
to providing
a portable
theGroup
OpenCL
standard
offers
cross-platform
parallel
programming
standard.
The
Khronos
Group
consists
Considering thecreate
need afor
creating parallel
programs
for
the
emerging
multicore
era,
it
implementation,
but
rather
in
the
extraction
of
thread-level
parallelism
ability to describe parallel algorithms to be implemented on FPGAs, at a much higher that would
of
a consortium
industry
members
such
Apple,
IBM,
Intel, that
AMD,
NVIDIA,
that
there needsofto
be a of
standard
model
foras
creating
programs
allow
the
FPGA
implementation
to achieve
high performance.
With
CPUs
DSPs was recognized
Multicores
Arrays
FPGAs
level
abstraction
than
hardware
description
languages
(HDLs)
such as VHDL
orFPGAs being on
Altera,
and
many
others.
This
group
has
been
responsible
for
defining
the
OpenCL
will execute across
all of
these
quite
different
devices.
The
lack
of
a
standard
that
is
the
furthest
extreme
of
the
parallel
spectrum,
any
failure
to
extract
maximum
Verilog. Although many high-level synthesis tools exist for gaining this higher level
of
Single Cores
Multicores
Coarse-Grained
Fine-Grained
1.1,
and 1.2 programmable
specifications.
The
OpenCL
standard
allows
for
the
implementation
of
portable across 1.0,
these
different
technologies
has
plagued
parallelism
is more crippling
than on
other devices.
The OpenCL
standard solves
abstraction, they
have all suffered
from the same
fundamental
problem.
These tools
Coarse-Grained
Massively
Massively
parallel
algorithms
thatApple
can
besubmitted
ported
from
platform
to an
platform
withand
minimal
programmers. In
the summer
ofParallel
2008,
proposal
for
OpenCL
many
ofain
these
problems
allowing
theproduce
programmer
to explicitly
would
attempt
toParallel
take
a sequential
Cby
program
a parallel
HDL specify and control
CPUs and DSPs
Processor
Arrays
Arrays
recoding.
The
language
is
based
on
C
programming
language
and
contains
extensions
(Open Computing Language) draftimplementation.
specification parallelism.
to The
in so
an
effort in
to
TheGroup
OpenCL
standard
more
naturallyofmatches
The Khronos
difficulty
was not
much
the creation
a HDL the highly-parallel nature
thatparallel
allow
for
theprogramming
specification
of parallelism.
cross-platform
parallel
standard.
Thethan
Khronos
of FPGAs
sequential
programs
described
in purethat
C. would
Considering thecreate
needafor
creating
programs
for the emerging
era,
itGroup consists
implementation,
butmulticore
rather
in do
the
extraction
of thread-level
parallelism
a consortium
of
industry
members
such
as
Apple,
IBM,
Intel,
AMD,
NVIDIA,
was recognizedofthat
there needs
to
be
a
standard
model
for
creating
programs
that
allow
the
FPGA
implementation
to
achieve
high
performance.
With
FPGAs
being on
In addition to providing a portable model, the OpenCL standard inherently offers the
Altera,
and
many
others.
This
group
has
been
responsible
for
defining
the
OpenCL
will execute across
all of
these
quite
different
devices.
The
lack
of
a
standard
that
is
the furthest
extreme
parallel spectrum,
anyatfailure
to higher
extract maximum
ability to describe parallel
algorithms
to of
bethe
implemented
on FPGAs,
a much
Brief
Overview
ofallows
the
OpenCL
Standard
1.1,different
and 1.2 level
specifications.
The
OpenCL
standard
for
the
implementation
of as
portable across1.0,
these
programmable
technologies
has
plagued
parallelism
is more
crippling
than
on other
devices.
The
OpenCL
of abstraction
than hardware
description
languages
(HDLs)
such
VHDL
or standard solves
parallel
algorithms
thatApple
can besubmitted
ported
from
platform
to an
platform
with
minimal
programmers. In
the summer
of
2008,
a proposal
for
OpenCL
many
of
these
problems
by
allowing
the
programmer
to
explicitly
controlis a pure
Verilog.
Although
many
high-level
synthesis
tools
exist
for
gaining
this
higher
level
of specify
OpenCL applications consist of two parts. The OpenCL
hostand
program
recoding.
The language
is based
CThe
programming
language
and
contains
extensions
(Open Computing
Language)
draft
specification
to
Group
in
an
effort
to
parallelism.
The
OpenCL
standard
more
naturally
matches
the
highly-parallel
nature
abstraction,
they on
have
all Khronos
suffered
from
the
same
fundamental
problem.
These
tools
software routine written in standard C/C++ that runs on any sort
of microprocessor.
that allow
for theprogramming
specification of
parallelism.
create a cross-platform
parallel
standard.
Thethan
Khronos
Group consists
of FPGAs
do sequential
programs described in pure C.
Single Cores
Multicores
DSPs
CPUs
Montag, 27. Mai 13
GPUs
OPENCL GOALS
• Khronos
standard for unified programming of heterogeneous
hardware (GPU, CPU, DSP, Embedded Systems, FPGAs, ...)
• Initiated
• explicit
by Apple (2008) - Dec. 2008 OpenCL 1.0 Release
declaration of parallelism
• portability/code
• ease
Montag, 27. Mai 13
reuse: same code for different hardware
programmability of parallel hardware
AVAILABILITY
SDK
Intel OpenCL SDK
AMP APP SDK
Hardware
Core i3/i5/i7
CPU & Integrated Intel HD GPU
AMD GPUs
X86 CPUs
1.2
1.2
Apple
CPUs & GPUs
1.2
Altera
selected boards
1.0
Qualcomm
Qualcomm CPU & GPU
1.1
ARM
Mali T604 GPU
1.1
...
Montag, 27. Mai 13
Version
Platform
1
1
*
CommandQueue
Event
0..1 *
*
*
1
1
*
DeviceID
1..*
*
0..1
1
*
1
*
*
Program
*
Context
1
*
MemObject
*
*
{abstract}
Sampler
1
*
Kernel
1
Buffer
Image
*
Figure 2.1 - OpenCL UML Class Diagram
OPENCL OVERVIEW
Last Revision Date: 11/14/12
Montag, 27. Mai 13
Page 21
OPEN COMPUTING
LANGUAGE
OpenCL
GPU
Host
(C, C++, ...)
Platform Layer
API
•query/select devices of host
•initialize compute devices
Montag, 27. Mai 13
Runtime API
CPU
•“compile” kernel code
•execute compiled kernels on device(s)
•exchange data between host and
Accelerator
compute devices
•synchronize several devices
OpenCL C Language
(Kernel)
•C99 subset + extensions
•built-in functions
•(sin, cos, cross, dot, ...)
PLATFORM MODEL
•
compute devices are
composed of one or more
compute units
•
compute units are divided
into one or more processing
elements
Montag, 27. Mai 13
Compute
Device
Compute
Device
Compute Unit
one or more compute
devices
Compute Unit
•
Compute Unit
one host
Compute Unit
•
Host
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
A full Kepler GK110 implementation includes 15 SMX units and six 64 bit memory controllers. Different
products will use different configurations of GK110. For example, some products may deploy 13 or 14
SMXs.
EXAMPLE: NVIDIA KEPLER
GK110
Key features of the architecture that will be discussed below in more depth include:
The new SMX processor architecture
An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at
each level of the hierarchy, and a fully redesigned and substantially faster DRAM I/O
implementation.
Hardware support throughout the design to enable new programming model capabilities
•15 SMX = Compute Units
•1SMX = 192 Processing Elements
Kepler GK110 Full chip block diagram
Source: http://www.nvidia.de/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
Montag, 27. Mai 13
EXECUTION MODEL
•
Kernel
unit of executable code - data parallel or task parallel
•
Program
collection of kernels and functions (think of dynamic library)
•
Command Queue
host application queues kernels & data transfers
in order or out of order
•
Work-Item
execution of a kernel by a single processing element (think of thread)
•
Work-Group
collection of work-items that execute on a single compute unit (think of cpu core)
Montag, 27. Mai 13
SPECIFY PARALLELISM
• say
our “global” problem
consists of 1024x1024 tasks,
which can be executed in
parallel
• therefore
we have 1024x1024
work-items
• group
several work-items into
local workgroups
Montag, 27. Mai 13
MEMORY MODEL
private
private
Work Item
Work Item
Constant
...
private
Work Item
...
private
private
Work Item
Work Item
...
Local
Local
Workgroup
Workgroup
private
Work Item
Global
Device
Host Memory
Montag, 27. Mai 13
Host
EXECUTION MODEL
• host
application submits work
to the compute devices via
command queues
Montag, 27. Mai 13
CPU
Context
Queue
environment within
which work-items executes
includes: devices, memories
and command queues
Queue
• context:
GPU
SYNCRONIZATION
• Events:
synchronize kernel executions between different queues in the
same context
• Barriers
synchronize kernels within a queue
Montag, 27. Mai 13
SIMPLE KERNEL EXAMPLE
void inc( float* a, float b, int N )
{
for( int i = 0; i < N; ++i )
{
a[ i ] = a[ i ] + b;
}
}
void main( void ) {
...
inc( a, b, N );
}
kernel
void inc( global float* a, float b )
{
int i = get_global_id( 0 );
a[ i ] = a[ i ] + b;
}
// host code
void main( void ) {
...
clEnqueueNDRangeKernel( ..., &N, ... );
}
the kernel you (e.g.) program the execution of the
inner part of a loop
• inside
• built-in
Montag, 27. Mai 13
kernels can be used to exploit special hardware units
+
OPENCL ON ALTERA FPGAS
Montag, 27. Mai 13
ALTERAS EXECUTION MODEL
Host CPU
FPGA
User Program
Accelerator
Accelerator
Accelerator
PCIe
OpenCL Runtime
Embedded CPU
User Program
OpenCL Runtime
Montag, 27. Mai 13
FPGA SOC
Accelerator
Accelerator
Accelerator
ALTERA OPENCL
COMPILATION
OpenCL
Host Program & Kernels
Produces throughput
and area report
ACL Compiler
Std. C Compiler
FPGA bitstream
X86 Binary
PCIe
Montag, 27. Mai 13
DIFFERENCES TO OTHER
OPENCL IMPLEMENTATIONS
•
in contrast to CPU/GPU, specialized datapaths are generated
•
for each kernel, custom hardware is created
•
logic is organized in functional units, based on operation and linked
together to form the dedicated datapath required to implement the
special kernel
•
execution of multiple workgroups in parallel in a custom fashion
•
pipeline parallelism
Montag, 27. Mai 13
ALTERA LIMITATIONS
•
OpenCL Version 1.0 (nearly completely)
•
#contexts: 2
#program_objects/ctxt = 3
#concurrently running kernels=3
#arguments / kernel = 128
•
max 256MB of host-side memory (CL_MEM_ALLOC_HOST_PTR)
•
max 2GB DDR as device global memory
•
No support for out-of order execution of command queues
•
no Image read and write function (atm), no explicit memory fences, no floating point exceptions
•
Memory:
#cmd_queues/ctxt = 3;
#outstanding events/ctxt around 100
#kernels / device = 16
#work_items / work_group = 256
•
global memory visible to all work-items (e.g. DDR Memory on FPGA)
•
local memory visible to a work-group (e.g. on-chip memory)
•
private memory visible to a work-item (e.g. registers)
Montag, 27. Mai 13
BENEFITS
• shorter
design cycles & time
to market
83,6
• performance
• performance
• explicit
per watt
Terasic DE4
Xeon W3690
Tesla C2075 GPU
15,1
parallelism
• “portable”
Montag, 27. Mai 13
15,9
0
23
45
68
90
Performance per Watt in Million Terms/J