Leveraging Silicon-Photonic NoC for Designing Scalable

Transcription

Leveraging Silicon-Photonic NoC for Designing Scalable
Leveraging Silicon-Photonic NoC
for Designing Scalable GPUs
Amir Kavyan Ziabari† , José L. Abellán ‡ , Rafael Ubal† , Chao Chen§
Ajay Joshi\ , David Kaeli†
†
‡
Electrical and Computer Engineering Dept.
Northeastern University
Computer Science Dept.
Universidad Católica San Antonio de Murcia
{aziabari,ubal,kaeli}@ece.neu.edu
jlabellan@ucam.edu
§
Digital Networking Group Dept.
Freescale Semiconductor, Inc.
\
Electrical and Computer Engineering Dept.
Boston University
chen9810@gmail.com
joshi@bu.edu
ABSTRACT
Categories and Subject Descriptors
Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient
network-on-chip (NoC) architectures. While silicon-photonic
NoC designs have been extensively studied for future manycore systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze
an electrical NoC which connects different cache levels (L1 to
L2) in a contemporary GPU memory hierarchy. Evaluating
workloads from the AMD SDK run on the Multi2sim GPU
simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of
compute units grows in future GPU systems.
To address this issue, we advocate using silicon-photonic
link technology for on-chip communication in GPUs, and
we present the first GPU-specific analysis of a cost-effective
hybrid photonic crossbar NoC. Our baseline is based on an
AMD Southern Islands GPU with 32 compute units (CUs)
and we compare this design to our proposed hybrid siliconphotonic NoC. Our proposed photonic hybrid NoC increases
performance by up to 6× (2.7× on average) and reduces the
energy-delay2 product (ED2 P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For
future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For
a 128-CU GPU, the proposed hybrid silicon-photonic NoC
can improve performance by up to 1.9× (43% on average)
and achieve up to 62% reduction in ED2 P (3% on average)
in comparison to mesh design with best performance.
B.4.3 [Input/Output and Data Communications]: Interconnections (subsystems)—fiber optics, physical structures,
topology; C.1.2 [Processor Architectures]: Multiple Data
Stream Architectures—Single-Instruction Stream, MultipleData Processors (SIMD)
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
ICS’15, June 8–11, 2015, Newport Beach, CA, USA.
c 2015 ACM 978-1-4503-3559-1/15/06 ...$15.00.
Copyright http://dx.doi.org/10.1145/2751205.2751229 .
Keywords
Network-on-Chip, Photonics Technology, GPUs
1.
INTRODUCTION
A little over a decade ago, GPUs were fixed-function processors built around a pipeline that was dedicated to rendering 3-D graphics. In the past decade, as the potential
for GPUs to provide massive compute parallelism became
apparent, the software community developed programming
environments to leverage these massively-parallel architectures. Vendors facilitated this move toward massive parallelism by incorporating programmable graphics pipelines.
Parallel programming frameworks such as OpenCL [12] were
introduced, and tools were developed to explore GPU architectures. NVIDIA and AMD, two of the leading graphics
vendors today, tailor their GPU designs for general purpose
high-performance computing by providing higher compute
throughput and memory bandwidth [3, 4]. Contemporary
GPUs are composed of tens of compute units (CUs). CUs
are composed of separate scalar and vector pipelines designed to execute a massive number of concurrent threads
(e.g., 2048 threads on the AMD Radeon HD 7970) in a single
instruction, multiple data (SIMD) fashion [20].
Multi-threaded applications running on multi-core CPUs
exhibit coherence traffic and data sharing which leads to
significant L1-to-L1 traffic (between cores) on the interconnection network. GPU applications, on the other hand, are
executed as thread blocks (workgroups) on multiple compute units. There is limited communication between different workgroups. Since GPU applications generally exhibit
little L1 temporal locality [9], the communication between
the L1 and L2 caches becomes the main source of traffic on
the GPU’s on-chip interconnection network. As the number
4000
40
3200
30
2400
0.0164
20 0.0061
1600
10
800
Laser
Source
BS
DCT DWTHAAR LS
MTWIST RED
Execution Time (ms)
RG
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
Ideal NoC
Elec. NoC
CONV
SOBEL URNG
Bandwidth(Gb/s)
Figure 1: Potential improvements for various workloads running on a 32-CU GPU, with an ideal crossbar NoC against
an electrical NoC.
of CUs increases in each future generation of GPU systems,
latency in the on-chip interconnection network becomes a
major performance bottleneck on the GPU [6].
To evaluate the potential benefits of employing a lowlatency Network-on-Chip (NoC) for GPUs, we evaluate a
32-compute unit GPU system (see a more detailed description in Section 3). We compare a system with an electrical
Multi-Write-Single-Read crossbar network against a system
deploying an ideal NoC, with a fixed 3-cycle latency between
source and destination nodes. To evaluate the GPU system
performance when using these NoCs, we utilize applications
from the AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK) [1]. The ideal NoC offers up to
90% (51% on average) reduction in execution time, and up to
10.3× (3.38× on average) higher bandwidth (see Figure 1).
These results clearly motivate the need to explore novel link
technologies and architectures to design low-latency highbandwidth NoCs that can significantly improve performance
in current and future GPUs.
Silicon-photonic link technology has been proposed as a
potential replacement for electrical link technology for the
design of NoC architectures for future manycore systems [10,
16, 17, 27, 34]. Silicon-photonic links have the potential to
provide significantly lower latency and an order of magnitude higher bandwidth density for global communication in
comparison to electrical links. We take advantage of photonic link technology in designing a NoC for GPU systems.
The main contributions of the paper are as follows:
• We explore the design of Multiple-Write, Single-Read
(MWSR), and Single-Write, Multiple-Read (SWMR)
electrical and photonic crossbar NoCs for communication between L1 and L2 of a GPU with 32 CUs.
• We propose the design of a GPU-specific hybrid NoC
with reduced channel count based on the traffic patterns observed in the communications between L1 and
L2 cache units in the GPU systems. We compare the
electrical and silicon-photonic implementations of this
hybrid NoC architecture in a GPU with 32 CUs running standard GPU benchmarks.
• We evaluate the scalability of a silicon-photonic NoC
on a larger GPU system with 128 CUs, comparing our
proposed hybrid photonic crossbar against a competitive electrical 2D-mesh NoC.
Receiver
Driver
Coupler
and Taper
0
Ideal NoC
0
Bandwidth (Gb/s)
4800
50
Elec. NoC
Execution Time (ms)
60
Photodetector
1
2
Waveguide
Ring Modulator
λ1 Resonance
1
2
Ring Filter
λ2 Resonance
Figure 2: Photonic Link Components – Two point-to-point
photonic links implemented with Wavelength-Devision Multiplexing (WDM).
2.
SILICON-PHOTONIC TECHNOLOGY
Figure 2 shows a generic silicon-photonic channel with
two links multiplexed onto the same waveguide. These two
silicon-photonic links are powered using a laser source. The
output of the laser source is coupled into the planar waveguides using vertical grating couplers. Each light wave is
modulated by the respective ring modulator that is controlled by a modulator driver. During this modulation step,
the data is converted from the electrical domain to the photonic domain. The modulated light waves propagate along
the waveguide and can pass through zero or more ring filters. On the receiver side, the ring filter whose resonant
wavelength matches with the wavelength of the light wave
“drops” the light wave onto a photodetector. The resulting
photodetector current is sensed by an electrical receiver. At
this stage, data is converted back into the electrical domain
from the photonic domain.
Each Electrical-to-Optical (E/O) and Optical-to-Electrical
(O/E) conversion stages introduce a 1-cycle latency. The
time-of-flight for the data on the photonic link is 1 cycle.
The serialization latency, on the other hand, depends on the
bandwidth of the photonic link. Contention and queuing
delays are other contributing factors to the overall latency
of transmissions. We consider all these forms of latency in
our simulation framework.
For our silicon-photonic NoCs, we consider monolithicallyintegrated links [13,25,26]. Monolithic integration of siliconphotonic links use the existing layers in the process technology to design photonic devices. Working prototypes of links
designed using monolithic integration have been presented
in previous work [13, 25, 26]. We consider a next-generation
link design [16] that uses double-ring filters with a 4 THz
free-spectral range. For our analysis we project the E-O-E
conversion cost, thermal tuning, and photonic device losses
for this next generation link using the measurements results
in previous works [13,25,26]. This design supports up to 128
wavelengths (λ) modulated at 10 Gbps on each waveguide
(64λ in each direction, which are interleaved to alleviate
filter roll-off requirements and crosstalk). A non-linearity
limit of 30 mW at 1 dB loss is assumed for the waveguides.
The waveguides are single mode and have a pitch of 4 µm
to minimize the crosstalk between neighboring waveguides.
The modulator ring and filter ring diameters are ∼10 µm
(further details in Tables 4 and 5). The silicon-photonic
links are driven by an off-chip laser source.
3.
TARGET GPU SYSTEMS
3.1
32-CU GPU Architecture
To explore the NoC design in the memory hierarchy of
a GPU, we leverage an existing simulation model of a high
Table 1: AMD Radeon HD 7970 GPU specification.
sL1
vL1
Memory System
28nm
925
32
16
16
256
64
Size of L1 Vector Cache
Size of L1 Scalar Cache
L2 Caches/Mem. Cntrls.
Block Size
Size of L2 Cache
Memory Page Size
LDS Size
16KB
16KB
6B
64B
128KB
4KB
64KB
performance GPU. We target an AMD Radeon HD 7970,
a state-of-the-art GPU of AMD’s Southern Islands family.
CUs are clocked at 925 MHz, are manufactured in a 28 nm
CMOS technology node, and are tailored for general purpose
workloads [3]. The die size for this GPU is reported to be
352 mm2 and its maximum power draw is 250 watts [21].
The layout of the AMD Radeon HD 7970 GPU chip is
shown in Figure 3a [22]. Table 1 outlines the details of the
different GPU components. The Radeon HD 7970 has 32
CUs. Each CU has a shared front-end and multiple execution units, where instructions are classified in computational
categories and scheduled appropriately to a special-purpose
execution units. Four execution units are SIMD, formed of
16 SIMD lanes, which run integer and floating-point operations in parallel.
Each CU is equipped with a L1 vector data cache unit
(vL1), known as a vector cache, which provides the data for
vector-memory operations. The vector cache is 16KB and
4-way set associative, with a 64B line size. The vector cache
is coherent with the L2 and other cache units in the GPU
system. The vector cache has a write-through, write-allocate
design. Scalar operations (i.e., computations shared by all
threads running as one SIMD unit) use a dedicated set of
L1 scalar data cache units (sL1s), known as scalar caches,
shared by 4 CUs. The scalar cache units are also connected
to L2 cache units through the NoC [3].
The L2 cache in the Southern Islands architecture is physically partitioned into banks that are coupled with separate
memory controllers. Similar to the L2 cache units, memory controllers are partitioned into different memory address spaces. Each L2 cache unit is associated with only one
memory controller. Sequential 4KB memory pages are distributed across L2 banks, which reduces the load on any single memory controller while maximizing spatial locality [3].
Two types of messages are transmitted between L1 and
L2 cache units in the memory hierarchy. The first type is
an 8 bytes control message. Cache requests such as reads,
writes, and invalidations generate and transmit control messages. These control messages contain information on the
destination cache unit. The second type of the message is a
cache line which has a size of 64 bytes. Since cache line does
not contain any information on type of requests or the unit
which should receive the cache-line, it should always follow
a control message that contains this information.
3.2
128-CU GPU Architecture
We consider a forward-looking GPU design that uses the
32-CU Radeon HD 7970 GPU as a building block, and increases the number of CUs. In this new architecture, we
quadruple the number of CUs to 128 while keeping the same
CU design as in the 32-CU GPU for individual CUs. The
128-CU GPU also quadruples the number of memory components in the HD 7970 GPU chip: 128 L1 vector caches, 32
18.96 mm
15.22 mm
Fabrication process
Clock Frequency
Compute Units
MSHR per Compute Unit
SIMD Width
Threads per Core
Wavefront Size
16.79 mm
20.57 mm
(a) Crossbar for the 32-CU GPU (28 nm).
24.45 mm
16.45 mm
Processor Cores
L2 & MC
CU
1.53 mm
3.69 mm
7.23 mm
(b) 4x12 2D-mesh for the 128-CU GPU (14 nm).
Figure 3: Physical layouts of the electrical NoCs for the two
GPU chips.
L1 scalar caches, and 24 shared L2 Caches. We assume that
the 128-CU GPU is designed using 14 nm CMOS technology,
which results in a reasonable floorplan area of 402.2 mm2 .
4.
NOC DESIGN FOR GPU SYSTEMS
In this section, we discuss the electrical and photonic NoC
designs used in our evaluation of the 32-CU GPU (described
in Section 3.1), and the scaled-up 128-CU GPU (described
in Section 3.2).
4.1
Electrical NoC Design for 32-CU GPU
Although not every detail of the interconnect used in the
AMD Radeon HD 7970 chip has been publicly disclosed,
online resources suggest that the crossbar topology is in
fact used between the L1 and L2 caches [3, 23]. Therefore,
we model a crossbar NoC for the baseline 32-CU GPU to
provide communication between its 40 L1 caches and 6 L2
caches.
The crossbar is a low-diameter high-radix topology that
supports a strict non-blocking connectivity, and can provide
high throughput. The crossbar network topology can be
implemented as a multi-bus topology, where each node in
the network has a dedicated bus. Different buses of the
crossbar topology can be routed around the GPU chip to
form a U-shaped layout (Figure 3a). We consider two main
designs for the bus crossbar – a Single-Write-Multiple-Read
(SWMR) design and a Multiple-Write-Single-Read (MWSR)
design.
In the SWMR crossbar, each transmitting node has a designated channel to transfer messages. All receiving nodes
are able to listen to all sending channels. All nodes decode
the header of the control messages on the channel and decide whether to receive the rest of the message or not. The
destination node is the only node that receives the cache
line that directly follows the control message. A node receives a message if it is the destination of that message. The
message is received and buffered in the destination node’s
input buffer. The destination node applies Round-Robin
(RR) arbitration between messages it receives from multiple transmitting nodes at the same time. In the MWSR
crossbar, each transmitting node needs to acquire access to
the receiver’s dedicated bus to transmit a message. An access is granted in a RR fashion between nodes that have a
message to transmit. The channels in both designs utilize
credit-based flow control for buffer management, with acknowledgments in the reverse direction being piggybacked
on control messages.
Each L1 and L2 cache unit in the GPU system acts as
a node in the network. For example, in the design of the
SWMR crossbar NoC in the 32-CU GPU there are 40 dedicated SWMR buses (32 for vector caches and 8 for scalar
caches) required for L1 units, and 6 dedicated SWMR buses
for L2 units to send information to all the other units.
4.2
Electrical NoC Design for 128-CU GPUs
As the number of CUs in a GPU grows, the number of
L1 and L2 units that need to communicate also grows. This
means that additional SWMR/MWSR buses are required in
the crossbar to support this communication. A larger number of buses, implemented using electrical link technology,
translates to large die area and higher power dissipation.
Moreover, in a MWSR, the transmitting units need to acquire access to one of the buses in the crossbar via arbitration. Also, if each node holds on to the bus for a large number of cycles (due to serialization and transmission delays),
this will result in very long wait times for other nodes. Large
number of SWMR buses impose pressure on the receiving
cache unit which has limited number of ports. Therefore, for
GPUs with large CU counts, we propose to use an electrical
2D-mesh network. The mesh is a low-radix high-diameter
topology that uses decentralized flow control. An electrical
mesh network is easy to design from a hardware perspective
due to its short wires and low-radix routers. While a mesh
avoids the long arbitration delays present within a crossbar,
it has higher zero-load latency, as each packet has to traverse
multiple routers (hops) to reach its destination.
A typical 2D-mesh is constructed using radix-5 routers
connected neighboring routers and sender/receiver end-nodes.
The main drawback of using a 2D-mesh with a large number
of nodes is the high hop count, which results in longer latencies. To alleviate this issue, we propose using a concentrated
2D-mesh, where each router has a larger radix. As shown in
Figure 3b, for the 128-CU GPU, we use routers with radix
7 and radix 8.
In our baseline 2D-mesh designs for a 128-CU GPU, a
single router is connected to L1 vector and scalar cache units,
Table 2: Number of global buses required for topologies
in 32- and 128-CUs GPU – HYB = The proposed hybrid,
DOWN = L1 to L2 traffic, and UP = L2 to L1 traffic.
Design
MWSR
SWMR
HYB
32-CU GPU
DOWN UP
6
40
6
40
6
6
128-CU GPU
DOWN UP
24
160
24
160
24
24
depending on the location of the router on the layout. While
various configurations are possible for placement of L2 units
[6], we chose to connect L2 units to the mesh through the
routers placed along the periphery of the mesh to follow
a similar layout to the AMD Radeon 7970 for our 128-CU
GPUs. In the AMD Radeon 7970 the L2 banks are placed
on the periphery of the die [22]
The mesh design is modeled using state-of-the-art singlecycle routers [28], each having room for 8 cache lines in
the input buffers and no virtual channels. The routers use
a standard matrix-based crossbar that implements roundrobin arbitration. We use an X-Y routing scheme with
credit-based flow control.
4.3
Hybrid NoC Design for GPU
As mentioned earlier, there are considerable drawbacks in
terms of area and power dissipation with both the SWMR
and MWSR crossbars, due to the large number of global
channels. To address this problem, we design a NoC with
a significantly lower number of channels, by taking advantage of the asymmetry in the L1-to-L2 versus L2-to-L1 network traffic, and by analyzing the communication between
L1 units.
We propose to use SWMR buses for L2-to-L1 communication and MWSR buses for L2-to-L1 communication for our
hybrid design. The choice of a SWMR bus in the L2-to-L1
network is easy to justify. The data from L2 cache units to
L1 cache units is latency sensitive since this data is needed
for the work-items to start their execution. In SWMR, there
is no arbitration latency in the transmitting side (L1 cache).
Additionally, the number of buses is low, equal to the number of L2 caches.
Using the MWSR buses in the L1-to-L2 direction requires
few buses (equal to the number of L2 caches), at the expense
of introducing additional arbitration latency—hardware must
deal with the worst case where all L1 caches attempt to access the same bus simultaneously. However, this has minimal impact on the overall performance because the latency
of L1-to-L2 traffic is typically not critical as the GPU programming model enforces that no other CU should be stalled
awaiting completion of a write-back transfer. At the same
time, GPU workloads exhibit light traffic for L1-to-L2 accesses. Our analysis of the GPU applications in the AMD
APP SDK benchmark reveals that on average, 80.5% of the
traffic generated from L1-to-L2 is due to control messages
(8 byte messages). This means any bus with a bandwidth
wider than 8B would have a 0-cycle serialization latency for
80% of the messages from L1 to L2. Thus, scaling down
the L1-to-L2 network causes no noticeable increase in contention.
A crossbar topology provides point-to-point connections
between all the L1 and L2 units. Therefore, any L1-to-L1
Table 3: The main elements in the photonic crossbars. The
channel width is 72 Bytes. WG = Waveguides, MD = Modulators, FL = Filters, TL = MD+FL. The columns are for
the 32-CU GPU if not stated otherwise. HYB128 uses concentration through access points (AP) to reduce the number
of rings (4 vL1s and 1 sL1 share an AP).
MWSR
HYB
HYB128
63
2668
27840
30508
63
27840
2668
30508
26
14268
14268
28536
186
45936
45936
91872
communication has a dedicated path. As mentioned in Section 1, the GPU applications causes a very small amount
of L1-to-L1 communication. This is while modern GPU designs all provide coherency between L1 units. We evaluated
many memory-intensive applications in the AMD APP SDK
benchmark suite in order to calculate the amount of traffic
between L1 units. As expected, all applications in the suite
exhibit close to zero transactions between L1 units, except
for two applications conv and dct (for these two applications,
16% and 12% of all the traffic on the network are L1-to-L1
transactions, respectively – see Table 6 for further details on
applications).
So in our hybrid design, in order to reduce the power consumption of the NoC, we also removed the connections between L1 units. So any L1-to-L1 transaction will be replaced
with two transactions: one from L1-to-L2 and a second one
from L2-to-L1. By this change we would also reduce any latency that might be introduced by arbitration due to small
amount of communications between L1 units.
Overall, the hybrid design reduces the number of buses in
the crossbar for the 32-CU GPU from 46 to 12. Similarly,
using the hybrid design reduces the number of buses from
184 to 48 for the 128-CU GPU (see Table 2 for details).
4.4
L2 & MC
19.00 mm
16.95 mm
SWMR
CU
16.72 mm
20.57 mm
(a) Crossbar for the 32-CU GPU (28 nm).
24.91 mm
7.90 mm
16.46 mm
15.78 mm
WG
MD
FL
TL
sL1
vL1
4.24 mm
(b) 32×24 Crossbar for the 128-CU GPU (14 nm).
Figure 4: Physical layout of the silicon-photonic NoCs for
the two GPU chips. We use 4 µm-pitch waveguides and 10
µm-diameter rings.
Photonic NoC Designs
The link length-independent data-dependent energy of
silicon-photonic links makes this technology more appropriate for designing low-diameter high-radix topologies such as
a crossbar, clos or butterfly. In this work, we consider crossbar since it provides strictly non-blocking connectivity, and
is easy to program. A photonic crossbar provides relatively
lower latency in comparison to an electrical crossbar. We use
silicon-photonic link technology to design a crossbar NoC for
the GPU, as discussed in Section 4.1.
The photonic MWSR implementation uses an optical token channel with fast-forwarding, which provides the same
RR arbitration as the electrical crossbars [35]. The photonic
SWMR implementation avoids the need for arbitration. At
every transmission, the source node sends the control message, and its head flit is decoded by all the receiving nodes.
All nodes except the destination node turn off their receiver
ring detector after the decode process [27]. Similar to our
electrical hybrid NoC, our photonic hybrid design also utilizes SWMR buses for L2-to-L1 and MWSR for L1-to-L2
communications.
Table 3 shows the number of photonic components required for the different crossbars for the maximum bus width
(72 bytes). If we compare the three topologies for the 32-CU
GPU, our proposed hybrid scheme requires fewer waveguides
due to the lower number of buses (see Table 2) and a correspondingly lower number of total rings (modulators and
filters), which results in lower NoC power. As a result, we
will only consider this hybrid design for the photonic NoC
for larger 128-CU GPU. To further minimize static NoC
power, the hybrid crossbar utilizes concentration through
Access Points (AP). In particular, 4 vector L1 caches and 1
shared L1 cache use the same AP to access their corresponding crossbar’s bus on a time-division multiplexing basis (see
Figure 4b).
Photonic topologies can be mapped onto the U-shaped
and serpentine physical layouts illustrated in Figure 4a and
Figure 4b, respectively, for the two GPUs under study. Note
that the Figures show the GPU chip dimensions. As explained in Section 3, to obtain the 128-CU floorplan area,
we scale down the dimensions of the 32-CU chip components (CUs, vL1s, sL1s, L2s and MCs) from 28 nm (the
technology node of the 32-CU GPU) to 14 nm technology
node. Monolithic integration of a photonic crossbar would
slightly increase the area of the GPU chip. We use 4 µmpitch waveguides and 10 µm-diameter rings. When comparing the total area of the GPU chips with the electrical
NoCs and the photonic NoCs, we obtain floorplan areas of
Table 4: Energy Projections for Photonic links based
on [13, 25, 26]. – Tx = Modulator driver circuits, Rx =
Receiver circuits, Dynamic energy = Data-traffic dependent
energy, Fixed energy = clock and leakage We consider 20%
(conservative projection) and 30% (aggressive projection)
laser efficiency.
Circuit
Data-dependent
energy
Fixed
energy
Thermal
tuning
20
20
5
5
16
16
Tx (fJ/bt)
Rx (fJ/bt)
Table 5: Projected/Measured Optical Loss per Component [13, 25, 26]. We consider -17 dBm (conservative
projection) and -20 dBm (aggressive projection) for the photodetector sensitivity.
Device
Optical Fiber (per cm)
Non-linearity (at 30 mW)
Modulator Insertion
Waveguide crossing
Waveguide (per cm)
Loss (dB)
5e-6
1
1
0.05
2
Device
Loss (dB)
Coupler
Splitter
Filter through
Filter drop
Photodetector
1
0.2
1e-3
1.5
0.1
390.0 mm2 vs. 390.8 mm2 (32-CU GPU) and 402.2 mm2 vs
410.0 mm2 (128-CU GPU), respectively. Photonic components are not scalable (contrary to CMOS transistors) and
that is the main reason for this increase in area. The area
overhead in the 128-CU GPU is 7.8mm2 . This area is equal
to the area of approximately 2 CUs (the area for a CU is
4 mm2 ). Increasing the number of CUs from 128 to 130
results in a very small performance improvement.
Energy and loss projections for photonic crossbars are presented in Tables 4 and 5, respectively [13,25,26]. We use conservative choices for the 32-CU GPU and aggressive choices
for the 128-CU GPU. Conservative and aggressive designs
employ different parameters for the photodetector sensitivity and laser efficiency (-17 dBm vs. -20 dBm in the former,
and 20% vs. 30% in the latter, respectively).
5.
EVALUATION METHODOLOGY
We conducted experiments to evaluate the potential for
silicon-photonic NoC in GPUs using the Multi2sim 4.2 simulation framework [31]. The simulator has a configurable
model for commercial AMD and NVIDIA GPU architectures. Multi2sim integrates an emulator for the AMD Southern Islands instruction set. The emulator provides traces of
instructions at the ISA level, which are fed to an architectural simulation featuring a detailed timing model of the
GPU pipelines and memory system.
We have extended Multi2sim to include models for both
electrical and photonic buses, as well as packet processing
and flitting support for NoCs. To validate the updated
Multi2sim simulation framework, we leveraged the standalone network simulation mode of Multi2sim that can inject
random traffic in various network topologies. We compared
and verified the reported latencies and performance numbers
in our cycle-based model against previous research [7, 11].
The applications evaluated in this work are taken from the
AMD APP SDK [1], which AMD has provided to highlight
efficient use of the AMD Southern Islands family of GPUs.
For each application, we can change program inputs to spec-
ify the workload intensity. We chose large input sizes for our
benchmarks to show how current GPUs are unable to support the growth in data set sizes we expect to see in future
workloads. We made sure that the selected subset of applications include a diversity of workload features and varying
intensities for creating a range of traffic behaviors. Table 6
lists the set of applications selected from this benchmark
suite, and includes a brief description for each application.
The power for the electrical network is estimated using a
detailed transistor-level circuit model. The power dissipated
by the network is calculated based on the physical layout,
the flow control mechanism and network traffic workloads.
For the 32-CU GPU, the wires in the crossbar are designed
in the global metal layers using pipelining and repeater insertion in 28 nm technology [2]. All of inter-router channels
in the 2D mesh are implemented in semi-global metal layers with standard repeater wires. We use the 14 nm technology node [2] for the 128-CU GPU systems. The power
dissipated in the SRAM array and crossbar of the routers is
calculated using the methodology described in [19] and [36],
respectively. We use the photonic technology described in
Sections 2 and 4.3 to calculate laser power, Tx/Rx power
and thermal tuning power in the NoC.
6.
6.1
EXPERIMENTAL RESULTS
Electrical vs. Photonic Designs of Crossbar Topologies for a 32-CU GPU
As described in Section 3.1, 64B cache line should immediately follow an 8B control message. This means the biggest
message in the network is 72B. For any channel with a bandwidth smaller than 72B (e.g., 32B), the control signal will
be transfered in a separate cycle first, and the cache line
is packeted and transfered immediately in the next cycles
(2 cycles for 32B). In our evaluation, we considered MWSR
and SWMR electrical buses with channel widths of 16, 32
(divisors of 64), and 72B (packing control signals and the
cache line in 1 message) for the 32-CU GPU system. For
the photonic NoC, due to the bandwidth density advantage of silicon-photonic link technology versus electrical link
technology, we considered SWMR and MWSR buses with a
channel width of 72B.
In Figure 5a, we compare the electrical and photonic MWSR
and SWMR crossbars in terms of performance (top plot)
and network traffic (bottom plot). To calculate the performance improvement (i.e., speed up) for each application,
we divided the application execution time when using the
MWSR electrical crossbar NoC with a 32B channel width
(E-MWSR-32) by the execution time for the other NoC designs. As shown in the corresponding plot, the photonic
SWMR crossbar (P-SWMR-72) achieves an average 2.60×
speedup, and the photonic MWSR crossbar (P-MWSR-72)
achieves an average 2.64× speedup compared to E-MWSR32. A similar performance improvement is observed if we
compare P-MWSR-72 and P-SWMR-72 with E-SWMR-72.
The magnitude of speedups directly correlate to the network
bandwidth required by the applications (see the bottom plot
in the Figure 5a). Applications with a higher number of
memory transactions (i.e., larger offered bandwidth) exhibit
larger benefits (i.e. 6.3× for mtwist and 5.2× for conv ). The
benefits we observe are a result of the low latency nature of
photonic NoCs. As can be seen in the figure, increasing the
channel bandwidth of the electrical crossbars results in very
Table 6: The workloads selected from the AMD APP SDK.
Application
BS
CONV
DCT
DWTHAAR
LARGSCAN -LS
MTWIST
RED
RG
SOBEL
URNG
Description
It finds the position of a given element in a sorted array of size 1048576.
Convolution filtering on each element of an input matrix of 4096 × 4096 with blur mask of 5 × 5
Discrete Cosine Transforms on an input matrix of size 8192 × 8192
One-dimensional Haar wavelet transform on a one-dimensional matrix of size 8388608
Performs scan on a large default array of 134217728 elements
SIMD-oriented Fast Mersenne Twister (SFMT) generates 4194304 random numbers and uses Box-Muller to convert these
numbers to Gaussian random numbers.
Performs reduction by dividing an array of 33554432 into blocks, calculating the sum of blocks and calculating the sum of
block sums.
Performs recursive Gaussian filter on an image of size 1536 × 1536.
Performs Sobel Edge detection algorithm on an image of size 1536 × 1536.
Generates unifrom noise on an input image of size 1536 × 1536.
(Gbit/s)
Offered BW
BW (Gbit/s) Normalized
Offered
SpeedupSpeedup
Normalized
7
6
5
74
63
52
41
30
2
2500 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72
1
BS
CONV
DCT
DWTHAAR LARGSCAN MTWIST
RED
RG
SOBEL
URNG
2000
0
1500 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72
2500
1000
2000
BS
CONV
DCT
DWTHAAR LARGSCAN MTWIST
RED
RG
SOBEL
URNG
10000
500
0
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72
P-SWMR-72
E-MWSR
E-MWSR
P-MWSR-72
P-MWSR-72
E-SWMR
E-SWMR
P-SWMR-72
P-SWMR-72 E-MWSR
E-MWSR
P-MWSR-72
P-MWSR-72
500
1500
BS
CONV
DCT DWTHAAR
LS
MTWIST
RED
RG
SOBEL
URNG
20
15
25
fixed
10
20
155
100
5
5
4
0
3
5
2
4
1
3
0
2
1
0
18
BS
CON
dynamic
DCT DWTHAAR
LS
0.25
0.01
thermal
MTWIST
laser
RED
0.01
18
0.08
0.04
0.01
0.06 LS 0.25MTWIST RED
BS
CON
DCT DWTHAAR
0.01
0.08
0.06
0.08
0.25
0.01
0.01
0.08
0.04
0.01
0.06
0.25
0.01
0.08
0.06
0.08
RG
SOBEL
URNG
0.31
0.21 0.20
RG0.33 SOBEL
URNG
0.31
0.33
0.21 0.20
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR E-SWMR
P-SWMR-72 P-SWMR-72
E-MWSR E-MWSR
P-MWSR-72 P-MWSR-72
E-SWMR
E-SWMR P-SWMR-72
P-SWMR-72 E-MWSR
E-MWSR P-MWSR-72
P-MWSR-72
(Watts)
NoC PowerNoC
Power (Watts)
ED2P
NormalizedNormalized
ED2P
(a) 25Speedup and offered bandwidth results for a 32-CU GPU
fixedDWTHAAR
dynamicLS
thermal RED
laser RG
with different
crossbars.
BS
CONV
DCT
MTWIST
SOBEL URNG
BS
CONV
DCT
DWTHAAR
LS
MTWIST
BS
CONV
DCT
DWTHAAR
LS
MTWIST
RED
RED
RG
RG
SOBEL
SOBEL
URNG
URNG
(b) Breakdown of the total power and energy-delay2 product
of NoCs in a 32-CU GPU.
Figure 5: Evaluation of a current 32-CU GPU with electrical
and photonic MWSR and SWMR crossbar NoCs – Ticks
in the x-axis follow the pattern T -X-N , where T refers to
the type of technology (electrical = E; photonic = P), X
is the type of crossbar (MWSR or SWMR), and N refers
to the channel width in bytes. For electrical NoCs three
different link bandwidths (from left to right 16, 32 and 72)
are considered. The performance speedup and ED2 P are
normalized to an electrical MWSR crossbar NoC with a 32byte channel width (E − M W SR − 32).
small performance improvements in memory intensive applications (such as conv ), since transmission latency of the
electrical channel (4 cycles on average) masks any benefit of
reduction in serialization latency through increase in bandwidth.
Compute intensive applications (such as urng), and applications with a small number of thread blocks (such as bs),
can store their entire dataset in L1 data caches (generating very few memory accesses to the L2 banks). Therefore,
these applications do not benefit from high bandwidth NoC
designs since the NoC is not utilized.
Figure 5b compares the E-MWSR, E-SWMR, P-MWSR
and P-SWMR crossbars using NoC power (see top plot) and
the energy-delay2 -product (ED2 P) metrics (see bottom plot)
across the different benchmarks. For all benchmarks, the
static power of electrical crossbars is dominant and increases
from 2.1 Watts to 19 Watts when the channel width is increased from 8 to 72 bytes. A 72B link requires 72 × 8
wires, and each wire is 46.71 mm long. We have 46 buses in
the network. This means 26,496 wires are required. If we
assume 4 segments comprise a wire, running at a frequency
of 1 GHz, the wire consumes roughly 180 fJ/bit in 28 nm
technology. This leads to a 19 Watts power dissipation for
the NoC, which is too high so commercial GPUs do not use
the electrical networks that have links with 72B bandwidth.
The P-MWSR-72 and P-SWMR-72 crossbars consume 29%
lower power on average as compared to E-MWSR-72 and
E-SWMR-72, but they consume 32% more power on average as compared to E-MWSR-32 and E-SWMR-32. Note
that the static power values for both the P-MWSR and PSWMR crossbars (fixed, thermal and laser components) are
very similar. The reason for their similar power dissipation
is that both P-MWSR and P-SWMR crossbars utilize the
same number of photonic rings and waveguides (see Table 3).
To reduce this large fraction of static power, a NoC layout
using concentration can be adopted. For instance, access
points can be used to attach 4 vL1s and 1 sL1 through a
single link to the bus. By adding access points, we can reduce the total number of rings by a factor of 4.78 (30,508
vs. 6,380) and the number of waveguides by 3.5 (63 vs. 18).
As explained in Section 4.4, we apply this technique to reduce NoC power for scaled-up GPU system (128 CUs) with
a hybrid NoC.
In any case, the bottom plot of Figure 5b still shows
that the photonic NoCs achieve the lowest ED2 P for all the
benchmarks, except urng. The urng application is insensitive to NoC channel bandwidths, so the electrical crossbar
NoC with the smallest channel width reports the best ED2 P,
BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 Offered BW (Gbit/s) Normalized Speedup 7 6 5 4 3 2 1 0 2500 2000 1500 1000 500 0 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG NoC Power (Wa,s) (a) Speedup and offered bandwidth of benchmarks running
on 32-CU GPU systems with a hybrid NoC topology.
fixed 15 dynamic thermal laser 10 5 1.5 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG 1 0.5 0 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 E-­‐8 E-­‐16 E-­‐32 E-­‐72 P-­‐72 Normalized ED2P 0 2 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG (b) Breakdown of total power and energy-delay2 product of
the hybrid NoCs in a 32-CU GPU.
Figure 6: Evaluation of a 32-CU GPU with electrical and
photonic hybrid NoCs – The Speed up and ED2 P are normalized to E-MWSR-32 (not shown for clarity). Captions
in x-axis follows the pattern T -N , where T refers to type of
technology (electrical = E; photonic = P) and N refers to
channel width in bytes (8, 16, 32 and 72).
as its power is lower than its electrical counterpart, and its
performance equals the performance of the crossbar with the
largest bus bandwidth. Note that the ED2 P is significantly
improved when considering our photonic NoCs (the average
reduction across all benchmarks in terms of ED2 P is 66.5%
for P-SWMR and 67.7% for P-MWSR, when compared to
E-MWSR-32).
6.2
Photonic vs. Electrical Design of a Hybrid
Topology for 32-CU GPUs
Figure 6 presents results of the electrical and photonic
implementations of our proposed hybrid design (see Section 4.3). By comparing Figures 6 and 5 (results in both
Figures are normalized to E-MWSR-32), we can observe
that our electrical implementation of the hybrid design, EHYB-72, does not show any performance improvement over
its crossbar counterparts, E-MWSR-32 and E-MWSR-72.
But this implementation reduces the power consumption
of the electrical crossbar on average by 68% (due to less
hardware), while providing the same performance. On the
other hand, our photonic hybrid design, P-HYB-72 provides
higher speedup than E-MWSR-32 and E-MWSR-72 (2.7×
for both). Increasing the bandwidth of the electrical crossbar did not affect GPU’s performance due to high electrical
link latency, as shown in Section 6.1.
The P-HYB-72 consumes 51% less power than both EMWSR-72 and E-SWMR-72. It also has a marginally higher
power dissipation (2%) than E-MWSR-32 and E-SWMR32. This makes our hybrid design a very good contender
compared to any electrical crossbar design.
The P-HYB-72, achieves higher performance than P-MWSR72 (1.05×) and P-SWMR-72 (1.03×) on average, by slightly
reducing the latency (See Section 4.3 for details) and achieves
32% and 33% lower power dissipation on average, by using
less hardware resources. Therefore P-HYB-72 achieves 34%
and 30% reduction in the ED2 P in comparison to P-MWSR72 and P-SWMR-72, respectively.
The comparison between photonic hybrid design and electrical hybrid design (E-HYB) (Figure 6a) in terms of performance (top plot of Figure 6a) and network bandwidth (bottom plot of Figure 6a) reveals the clear benefits of using low
latency photonics technology. P-HYB-72 exhibits, on average, a 2.5× speedup and offers 0.6 Tbit/s more bandwidth
versus E-HYB with the highest bandwidth (E-HYB-72).
Figure 6b compares the electrical hybrid (E-HYB) and
photonic hybrid (P-HYB) designs in terms of power (top
plot) and the ED2 P metric (bottom plot). P-HYB-72 reports 51% higher power than the E-HYB-72 NoC due to
higher static power consumption (laser, thermal tuning and
fixed power). One solution is to adapt a run-time management mechanism based on the workload on compute units.
Using run-time management, we can deactivate photonic
links when they are not being used [8]. We leave run-time
management mechanisms as an area for future work.
Nonetheless, when analyzing the ED2 P metric in the bottom plot of the Figure 6b, we can see that our P-HYB-72
produces on average 82% lower ED2 P as compared to the
E-HYB-72 NoC.
6.3
Electrical Mesh vs Photonic Hybrid Designs for 128-CU GPUs
In this section, we consider future GPUs with 128 CUs,
to study the scalability of electrical and photonic NoC designs presented in Section 4. For comparison, we consider
an electrical mesh design and photonic hybrid design with
channel widths of 16, 32 and 72 bytes. Here, all comparison metrics are normalized to an electrical 2D-mesh with
16-byte channel widths (E-MESH-16).
Figure 7a compares performance (top plot) and bandwidth (bottom plot) of the E-MESH design with our P-HYB,
for 128-CU GPU, assuming the same set of channel widths
as the previous evaluation. The performance speedup reported for P-HYB-72 is 82% better than the speedup for EMESH-72 (with maximum 3.43× speedup for mtwist). The
E-MESH-72 offers increased bandwidth up to 2.81 Tbit/s
(dct), whereas P-HYB-72 achieves up to 7.28 Tbit/s bandwidth (mtwist).
Figure 7b compares E-MESH and P-HYB in terms of
power (top plot) and ED2 P (bottom plot). P-HYB NoCs
generally consume more power than an E-MESH counterpart. This leads to 19% higher ED2 P for P-HYB-72, in
comparison to E-MESH-72.
We can reduce the bandwidth of the channel to reduce
power dissipation of the P-HYB. Reduction in the P-HYB
channel bandwidth from 72 bytes to 32 and 16 bytes, reduces this power dissipation by 39% and 60%, respectively.
The average speedup observed for all the applications in our
study for P-HYB-32 is 43%, and for P-HYB-16 is 17%, when
Normalized Speedup 8 7.
6 A large amount of work has been done in the area of NoC
design for manycore architectures, with the goal of providing energy-efficient on-chip communication. The maturity
of electrical NoCs for manycore systems is evidenced by
the availability of commercial designs (80-tile, Sub-100W
TeraFLOPS processor introduced by Vangal et al. [33], or
Tilera’s 64-core TILE64 chip [37]), On the photonic NoC
front, there are no working prototypes, but researchers have
explored the entire spectrum of network topologies – from
low-radix high-diameter mesh/torus topologies [10,17,30] to
medium-radix medium-diameter butterfly/crossbar topologies [15,16,27] to high-radix low-diameter bus/crossbar topologies [29, 34] to multilayer topologies [24, 32, 38].
The area of GPU NoCs has not been widely explored.
Bakhoda et al. [6] exploit the many-to-few traffic patterns
in manycore accelerators by alternating full routers in congested areas with half routers. In a related work [5], the
same authors evaluate GPU performance degradation due
to NoC router latencies. Both [5] and our work corroborate
the motivation for low-latency networks to mitigate their
impact on GPU performance.
Lee et al. [18] identify a novel trade-off in CPU-GPU heterogeneous systems concerning the NoC design. CPUs run
highly latency-sensitive threads, while coexisting GPUs demand high bandwidth. They thoroughly survey the impact
of primary network design parameters on the CPU-GPU system performance, including routing algorithms, cache partitioning, arbitration policies, link heterogeneity, and node
placement.
In [14], Goswami et al. explore a 3D-stacked GPU microarchitecture that uses an optical on-chip crossbar to connect
shader cores and memory controllers in the GPU memory
hierarchy. The main difference between our work and [14] is
that we present our own tailored monolithically-integrated
photonic NoCs for communication between L1 cache and L2
cache and evaluate them against different electrical designs
for current and future scaled-up GPUs.
4 2 Offered BW (Tbit/s) 0 8 6 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG 4 0 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 2 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG (a) Speedup and bandwidth of benchmarks running on a 128CU GPU system with varying hybrid NoCs.
fixed NoC Power (W) 60 dynamic thermal laser 40 BS 6.0 CONV DCT DWTH LARG MTWI RED RG SOBE URNG 2 1 0 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 E-­‐16 E-­‐32 E-­‐72 P-­‐16 P-­‐32 P-­‐72 Normalized ED2P 3 4.4 0 4 6.1 6.7 11 22 20 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG (b) Breakdown of the total power and energy-delay2 product
of GPU’s NoC.
Figure 7: Evaluation of 128-CU GPUs with an electrical
2D-mesh NoC and a photonic hybrid NoC. – Captions along
the x-axis follow the pattern T -N , where T refers to type of
technology and topology (electrical 2D-mesh = E; photonic
hybrid = P) and N refers to channel width in bytes (16,
32 and 72). Speedup and ED2 P results are normalized to
E − 16.
compared to E-MESH-72. By reducing the channel width to
32B or 16B, we reduce the power consumption, and therefore ED2 P in our hybrid designs. Therefore the ED2 P for
P-HYB-32 and P-HYB-16 is reduced by 3% (for both) when
compared to E-MESH-72. This means both P-HYB-32 and
P-HYB-16 are marginally better than the 2D-mesh.
One important feature of the hybrid design is its effect
on memory intensive applications. If applications utilize
the memory hierarchy (such as conv, dct, dwthaar, mtwist,
largescan and sobel ), this will favor the photonic hybrid design. For these memory intensive applications, P-HYB-16
enjoys, on average, a 26% performance speedup and a reduction of 13%, on average, in ED2 P as compared to EMESH-72.
Since these applications scale well with the number of
GPU units, P-HYB-32 and P-HYB-72 provides significant
reductions in ED2 P against E-MESH-72 (34% and 40% respectively). The reported speedups for P-HYB-32 and PHYB-72, for these memory-intensive applications, are on
average 1.6× and 2.2× in comparison to E-MESH-72, respectively.
Our results clearly show that for future GPU systems that
will execute memory intensive workloads, a photonic hybrid
design provides the best ED2 P solution.
8.
RELATED WORK
CONCLUSIONS
In this paper, we combine our knowledge of silicon-photonic
link technology and GPU architecture to present a GPUspecific photonic hybrid NoC (used for communication between L1 and L2) that is more energy efficient than the
electrical NoC. Our proposed hybrid design uses MWSR for
L1-to-L2 communication and SWMR for L2-to-L1 communication. Our simulation-based analysis shows that applications that are bandwidth sensitive can take advantage of a
photonic hybrid NoC to achieve better performance, while
achieving an energy-delay2 value that is lower than the traditional electrical NoC. In the AMD Southern Islands GPU
chip with 32 CUs, our proposed photonic hybrid NoC increases application performance by up to 6× (2.7× on average) while reducing ED2 P by up to 99% (79% on average). We also evaluated the scalability of the photonic hybrid NoC to a GPU system with 128 CUs. For the 128-CU
GPU system running memory intensive applications, we can
achieve up to 3.43× (2.2× on average) performance speedup,
while reducing ED2 P by up to 99% ( 82% on average), compared to electrical mesh NoC. Moving forward, we plan to
explore techniques for run-time power management of photonic NoCs to extend this improvement to all applications.
Here, depending on the application requirements, the photonic NoC bandwidth (and hence the photonic NoC power)
will be appropriately scaled up/down to achieve energy efficient operation in the photonic NoC, as well as in the system
as whole for the entire spectrum of applications.
9.
ACKNOWLEDGMENT
This work was supported in part by DARPA Contract No.
W911NF-12-1-0211 and NSF CISE grant CNS-1319501.
10.
REFERENCES
[1] AMD Accelerated Parallel Processing (APP) Software
Development Kit (SDK).
http://developer.amd.com/sdks/amdappsdk/.
[2] Predictive Technology Model. http://ptm.asu.edu/.
[3] AMD Graphics Cores Next (GCN) Architecture, June
2012. White paper.
[4] NVIDIA’s Next Generation CUDA Compute Architecture:
Kepler GK110, 2012.
http://www.nvidia.com/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf.
[5] A. Bakhoda et al. Analyzing CUDA Workloads Using a
Detailed GPU Simulator. In Proc. Int’l Symposium on
Performance Analysis of Systems and Software, April 2009.
[6] A. Bakhoda, J. Kim, and T. M. Aamodt. On-Chip Network
Design Considerations for Compute Accelerators. In Proc.
of the 19th Int’l Conference on Parallel Architectures and
Compilation Techniques, Sept. 2010.
[7] C. Batten et al. Building manycore processor-to-dram
networks with monolithic silicon photonics. In High
Performance Interconnects, 2008. HOTI’08. 16th IEEE
Symposium on. IEEE, 2008.
[8] C. Chen and A. Joshi. Runtime management of laser power
in silicon-photonic multibus noc architecture. IEEE
Journal of Selected Topics in Quantum Electronics,
19(2):338–350, 2013.
[9] X. Chen et al. Adaptive cache management for
energy-efficient gpu computing. In Proceedings of the 47th
Annual IEEE/ACM International Symposium on
Microarchitecture, December 2014.
[10] M. J. Cianchetti, J. C. Kerekes, and D. H. Albonesi.
Phastlane: A Rapid Transit Optical Routing Network.
SIGARCH Computer Architecture News, 37(3), June 2009.
[11] W. Dally. Virtual-Channel Flow Control. IEEE
Transactions on Parallel and Distributed Systems, 3(2),
March 1992.
[12] B. R. Gaster, L. W. Howes, D. R. Kaeli, P. Mistry, and
D. Schaa. Heterogeneous Computing with OpenCL Revised OpenCL 1.2 Edition, volume 2. Morgan Kaufmann,
2013.
[13] M. Georgas et al. A Monolithically-Integrated Optical
Receiver in Standard 45-nm SOI. IEEE Journal of
Solid-State Circuits, 47, July 2012.
[14] N. Goswami, Z. Li, R. Shankar, and T. Li. Exploring silicon
nanophotonics in throughput architecture. Design & Test,
IEEE, 31(5):18–27, 2014.
[15] H. Gu, J. Xu, and W. Zhang. A low-power fat tree-based
optical network-on-chip for multiprocessor system-on-chip.
In Proceedings of the Conference on Design, Automation
and Test in Europe, DATE ’09, pages 3–8, 2009.
[16] A. Joshi et al. Silicon-Photonic Clos Networks for Global
On-Chip Communication. In 3rd AMC/IEEE Int’l
Symposium on Networks on Chip, May 2009.
[17] N. Kirman and J. F. Martı́nez. A Power-efficient
All-Optical On-Chip Interconnect Using Wavelength-Based
Oblivious Routing. In Proc. of the 15th Int’l Conference on
Architectural Support for Programming Languages and
Operating Systems, Mar. 2010.
[18] J. Lee et al. Design Space Exploration of On-chip Ring
Interconnection for a CPU-GPU Architecture. Journal of
Parallel and Distributed Computing, 73(12), Dec. 2012.
[19] X. Liang, K. Turgay, and D. Brooks. Architectural power
models for sram and cam structures based on hybrid
analytical/empirical techniques. In Proc. of the Int’l
Conference on Computer Aided Design, 2007.
[20] E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and
Computing Architecture. IEEE Micro, 28(2), March 2008.
[21] L. Mah. The AMD GCN Architecture, a Crash Course.
AMD Fusion Developer Summit, 2013.
[22] M. Mantor. Amd hd7970 graphics core next (gcn)
architecture. In HOT Chips, A Symposium on High
Performance Chips, 2012.
[23] M. Mantor and M. Houston. AMD Graphics Core Next:
Low-Power High-Performance Graphics and Parallel
Compute. AMD Fusion Developer Summit, 2011.
[24] R. Morris, A. Kodi, and A. Louri. Dynamic Reconfiguration
of 3D Photonic Networks-on-Chip for Maximizing
Performance and Improving Fault Tolerance. In Proc. of
the 45th Int’l Symposium on Microarchitecture, Dec. 2012.
[25] B. Moss et al. A 1.23pj/b 2.5gb/s monolithically integrated
optical carrier-injection ring modulator and all-digital
driver circuit in commercial 45nm soi. In Solid-State
Circuits Conference Digest of Technical Papers (ISSCC),
2013 IEEE International, pages 126–127, Feb 2013.
[26] J. S. Orcutt et al. Nanophotonic integration in
state-of-the-art cmos foundries. Opt. Express,
19(3):2335–2346, Jan 2011.
[27] Y. Pan et al. Firefly: Illuminating Future Network-on-chip
with Nanophotonics. SIGARCH Computuer Architecture
News, 37(3), June 2009.
[28] S. Park et al. Approaching the Theoretical Limits of a Mesh
NoC with a 16-Node Chip Prototype in 45nm SOI. In Proc.
of the 49th Design Automation Conference, June 2012.
[29] J. Psota et al. ATAC: Improving Performance and
Programmability with On-Chip Optical Networks. In Proc.
Int’l Symposium on Circuits and Systems, 2010.
[30] A. Shacham, K. Bergman, and L. P. Carloni. On the design
of a photonic network-on-chip. In Proceedings of the First
International Symposium on Networks-on-Chip, NOCS ’07,
pages 53–64, 2007.
[31] R. Ubal et al. Multi2Sim: A Simulation Framework for
CPU-GPU Computing. In Proc. of the 21st Int’l
Conference on Parallel Architectures and Compilation
Techniques, Sept. 2012.
[32] A. N. Udipi et al. Combining Memory and a Controller
with Photonics Through 3D-Stacking to Enable Scalable
and Energy-Efficient Systems. In Proc. of the 38th Int’l
Symposium on Computer Architecture, June 2011.
[33] S. R. Vangal et al. An 80-Tile Sub-100W TeraFLOPS
Processor in 65nm CMOS. IEEE Journal of Solid-State
Circuits, 43(1), Jan. 2008.
[34] D. Vantrease et al. Corona: System Implications of
Emerging Nanophotonic Technology. In Proc. of the 35th
Int’l Symposium on Computer Architecture, June 2008.
[35] D. Vantrease et al. Light speed arbitration and flow control
for nanophotonic interconnects. In Microarchitecture, 2009.
MICRO-42. 42nd Annual IEEE/ACM International
Symposium on, pages 304–315. IEEE, 2009.
[36] H. Wang, L.-S. Peh, and S. Malik. Power-Driven Design of
Router Microarchitectures in On-Chip Networks. In Proc.
of the 36th Int’l Symposium on Microarchitecture, 2003.
[37] D. Wentzlaff et al. On-Chip Interconnection Architecture of
the Tile Processor. IEEE Micro, 27(5), Sept. 2007.
[38] X. Zhang and A. Louri. A Multilayer Nanophotonic
Interconnection Network for On-Chip Many-Core
Communications. In Proc. of the 47th Design Automation
Conference, June 2010.