Leveraging Silicon-Photonic NoC for Designing Scalable
Transcription
Leveraging Silicon-Photonic NoC for Designing Scalable
Leveraging Silicon-Photonic NoC for Designing Scalable GPUs Amir Kavyan Ziabari† , José L. Abellán ‡ , Rafael Ubal† , Chao Chen§ Ajay Joshi\ , David Kaeli† † ‡ Electrical and Computer Engineering Dept. Northeastern University Computer Science Dept. Universidad Católica San Antonio de Murcia {aziabari,ubal,kaeli}@ece.neu.edu jlabellan@ucam.edu § Digital Networking Group Dept. Freescale Semiconductor, Inc. \ Electrical and Computer Engineering Dept. Boston University chen9810@gmail.com joshi@bu.edu ABSTRACT Categories and Subject Descriptors Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient network-on-chip (NoC) architectures. While silicon-photonic NoC designs have been extensively studied for future manycore systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze an electrical NoC which connects different cache levels (L1 to L2) in a contemporary GPU memory hierarchy. Evaluating workloads from the AMD SDK run on the Multi2sim GPU simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of compute units grows in future GPU systems. To address this issue, we advocate using silicon-photonic link technology for on-chip communication in GPUs, and we present the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC. Our baseline is based on an AMD Southern Islands GPU with 32 compute units (CUs) and we compare this design to our proposed hybrid siliconphotonic NoC. Our proposed photonic hybrid NoC increases performance by up to 6× (2.7× on average) and reduces the energy-delay2 product (ED2 P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For a 128-CU GPU, the proposed hybrid silicon-photonic NoC can improve performance by up to 1.9× (43% on average) and achieve up to 62% reduction in ED2 P (3% on average) in comparison to mesh design with best performance. B.4.3 [Input/Output and Data Communications]: Interconnections (subsystems)—fiber optics, physical structures, topology; C.1.2 [Processor Architectures]: Multiple Data Stream Architectures—Single-Instruction Stream, MultipleData Processors (SIMD) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ICS’15, June 8–11, 2015, Newport Beach, CA, USA. c 2015 ACM 978-1-4503-3559-1/15/06 ...$15.00. Copyright http://dx.doi.org/10.1145/2751205.2751229 . Keywords Network-on-Chip, Photonics Technology, GPUs 1. INTRODUCTION A little over a decade ago, GPUs were fixed-function processors built around a pipeline that was dedicated to rendering 3-D graphics. In the past decade, as the potential for GPUs to provide massive compute parallelism became apparent, the software community developed programming environments to leverage these massively-parallel architectures. Vendors facilitated this move toward massive parallelism by incorporating programmable graphics pipelines. Parallel programming frameworks such as OpenCL [12] were introduced, and tools were developed to explore GPU architectures. NVIDIA and AMD, two of the leading graphics vendors today, tailor their GPU designs for general purpose high-performance computing by providing higher compute throughput and memory bandwidth [3, 4]. Contemporary GPUs are composed of tens of compute units (CUs). CUs are composed of separate scalar and vector pipelines designed to execute a massive number of concurrent threads (e.g., 2048 threads on the AMD Radeon HD 7970) in a single instruction, multiple data (SIMD) fashion [20]. Multi-threaded applications running on multi-core CPUs exhibit coherence traffic and data sharing which leads to significant L1-to-L1 traffic (between cores) on the interconnection network. GPU applications, on the other hand, are executed as thread blocks (workgroups) on multiple compute units. There is limited communication between different workgroups. Since GPU applications generally exhibit little L1 temporal locality [9], the communication between the L1 and L2 caches becomes the main source of traffic on the GPU’s on-chip interconnection network. As the number 4000 40 3200 30 2400 0.0164 20 0.0061 1600 10 800 Laser Source BS DCT DWTHAAR LS MTWIST RED Execution Time (ms) RG Ideal NoC Elec. NoC Ideal NoC Elec. NoC Ideal NoC Elec. NoC Ideal NoC Elec. NoC Ideal NoC Elec. NoC Ideal NoC Elec. NoC Ideal NoC Elec. NoC Ideal NoC Elec. NoC Ideal NoC Elec. NoC CONV SOBEL URNG Bandwidth(Gb/s) Figure 1: Potential improvements for various workloads running on a 32-CU GPU, with an ideal crossbar NoC against an electrical NoC. of CUs increases in each future generation of GPU systems, latency in the on-chip interconnection network becomes a major performance bottleneck on the GPU [6]. To evaluate the potential benefits of employing a lowlatency Network-on-Chip (NoC) for GPUs, we evaluate a 32-compute unit GPU system (see a more detailed description in Section 3). We compare a system with an electrical Multi-Write-Single-Read crossbar network against a system deploying an ideal NoC, with a fixed 3-cycle latency between source and destination nodes. To evaluate the GPU system performance when using these NoCs, we utilize applications from the AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK) [1]. The ideal NoC offers up to 90% (51% on average) reduction in execution time, and up to 10.3× (3.38× on average) higher bandwidth (see Figure 1). These results clearly motivate the need to explore novel link technologies and architectures to design low-latency highbandwidth NoCs that can significantly improve performance in current and future GPUs. Silicon-photonic link technology has been proposed as a potential replacement for electrical link technology for the design of NoC architectures for future manycore systems [10, 16, 17, 27, 34]. Silicon-photonic links have the potential to provide significantly lower latency and an order of magnitude higher bandwidth density for global communication in comparison to electrical links. We take advantage of photonic link technology in designing a NoC for GPU systems. The main contributions of the paper are as follows: • We explore the design of Multiple-Write, Single-Read (MWSR), and Single-Write, Multiple-Read (SWMR) electrical and photonic crossbar NoCs for communication between L1 and L2 of a GPU with 32 CUs. • We propose the design of a GPU-specific hybrid NoC with reduced channel count based on the traffic patterns observed in the communications between L1 and L2 cache units in the GPU systems. We compare the electrical and silicon-photonic implementations of this hybrid NoC architecture in a GPU with 32 CUs running standard GPU benchmarks. • We evaluate the scalability of a silicon-photonic NoC on a larger GPU system with 128 CUs, comparing our proposed hybrid photonic crossbar against a competitive electrical 2D-mesh NoC. Receiver Driver Coupler and Taper 0 Ideal NoC 0 Bandwidth (Gb/s) 4800 50 Elec. NoC Execution Time (ms) 60 Photodetector 1 2 Waveguide Ring Modulator λ1 Resonance 1 2 Ring Filter λ2 Resonance Figure 2: Photonic Link Components – Two point-to-point photonic links implemented with Wavelength-Devision Multiplexing (WDM). 2. SILICON-PHOTONIC TECHNOLOGY Figure 2 shows a generic silicon-photonic channel with two links multiplexed onto the same waveguide. These two silicon-photonic links are powered using a laser source. The output of the laser source is coupled into the planar waveguides using vertical grating couplers. Each light wave is modulated by the respective ring modulator that is controlled by a modulator driver. During this modulation step, the data is converted from the electrical domain to the photonic domain. The modulated light waves propagate along the waveguide and can pass through zero or more ring filters. On the receiver side, the ring filter whose resonant wavelength matches with the wavelength of the light wave “drops” the light wave onto a photodetector. The resulting photodetector current is sensed by an electrical receiver. At this stage, data is converted back into the electrical domain from the photonic domain. Each Electrical-to-Optical (E/O) and Optical-to-Electrical (O/E) conversion stages introduce a 1-cycle latency. The time-of-flight for the data on the photonic link is 1 cycle. The serialization latency, on the other hand, depends on the bandwidth of the photonic link. Contention and queuing delays are other contributing factors to the overall latency of transmissions. We consider all these forms of latency in our simulation framework. For our silicon-photonic NoCs, we consider monolithicallyintegrated links [13,25,26]. Monolithic integration of siliconphotonic links use the existing layers in the process technology to design photonic devices. Working prototypes of links designed using monolithic integration have been presented in previous work [13, 25, 26]. We consider a next-generation link design [16] that uses double-ring filters with a 4 THz free-spectral range. For our analysis we project the E-O-E conversion cost, thermal tuning, and photonic device losses for this next generation link using the measurements results in previous works [13,25,26]. This design supports up to 128 wavelengths (λ) modulated at 10 Gbps on each waveguide (64λ in each direction, which are interleaved to alleviate filter roll-off requirements and crosstalk). A non-linearity limit of 30 mW at 1 dB loss is assumed for the waveguides. The waveguides are single mode and have a pitch of 4 µm to minimize the crosstalk between neighboring waveguides. The modulator ring and filter ring diameters are ∼10 µm (further details in Tables 4 and 5). The silicon-photonic links are driven by an off-chip laser source. 3. TARGET GPU SYSTEMS 3.1 32-CU GPU Architecture To explore the NoC design in the memory hierarchy of a GPU, we leverage an existing simulation model of a high Table 1: AMD Radeon HD 7970 GPU specification. sL1 vL1 Memory System 28nm 925 32 16 16 256 64 Size of L1 Vector Cache Size of L1 Scalar Cache L2 Caches/Mem. Cntrls. Block Size Size of L2 Cache Memory Page Size LDS Size 16KB 16KB 6B 64B 128KB 4KB 64KB performance GPU. We target an AMD Radeon HD 7970, a state-of-the-art GPU of AMD’s Southern Islands family. CUs are clocked at 925 MHz, are manufactured in a 28 nm CMOS technology node, and are tailored for general purpose workloads [3]. The die size for this GPU is reported to be 352 mm2 and its maximum power draw is 250 watts [21]. The layout of the AMD Radeon HD 7970 GPU chip is shown in Figure 3a [22]. Table 1 outlines the details of the different GPU components. The Radeon HD 7970 has 32 CUs. Each CU has a shared front-end and multiple execution units, where instructions are classified in computational categories and scheduled appropriately to a special-purpose execution units. Four execution units are SIMD, formed of 16 SIMD lanes, which run integer and floating-point operations in parallel. Each CU is equipped with a L1 vector data cache unit (vL1), known as a vector cache, which provides the data for vector-memory operations. The vector cache is 16KB and 4-way set associative, with a 64B line size. The vector cache is coherent with the L2 and other cache units in the GPU system. The vector cache has a write-through, write-allocate design. Scalar operations (i.e., computations shared by all threads running as one SIMD unit) use a dedicated set of L1 scalar data cache units (sL1s), known as scalar caches, shared by 4 CUs. The scalar cache units are also connected to L2 cache units through the NoC [3]. The L2 cache in the Southern Islands architecture is physically partitioned into banks that are coupled with separate memory controllers. Similar to the L2 cache units, memory controllers are partitioned into different memory address spaces. Each L2 cache unit is associated with only one memory controller. Sequential 4KB memory pages are distributed across L2 banks, which reduces the load on any single memory controller while maximizing spatial locality [3]. Two types of messages are transmitted between L1 and L2 cache units in the memory hierarchy. The first type is an 8 bytes control message. Cache requests such as reads, writes, and invalidations generate and transmit control messages. These control messages contain information on the destination cache unit. The second type of the message is a cache line which has a size of 64 bytes. Since cache line does not contain any information on type of requests or the unit which should receive the cache-line, it should always follow a control message that contains this information. 3.2 128-CU GPU Architecture We consider a forward-looking GPU design that uses the 32-CU Radeon HD 7970 GPU as a building block, and increases the number of CUs. In this new architecture, we quadruple the number of CUs to 128 while keeping the same CU design as in the 32-CU GPU for individual CUs. The 128-CU GPU also quadruples the number of memory components in the HD 7970 GPU chip: 128 L1 vector caches, 32 18.96 mm 15.22 mm Fabrication process Clock Frequency Compute Units MSHR per Compute Unit SIMD Width Threads per Core Wavefront Size 16.79 mm 20.57 mm (a) Crossbar for the 32-CU GPU (28 nm). 24.45 mm 16.45 mm Processor Cores L2 & MC CU 1.53 mm 3.69 mm 7.23 mm (b) 4x12 2D-mesh for the 128-CU GPU (14 nm). Figure 3: Physical layouts of the electrical NoCs for the two GPU chips. L1 scalar caches, and 24 shared L2 Caches. We assume that the 128-CU GPU is designed using 14 nm CMOS technology, which results in a reasonable floorplan area of 402.2 mm2 . 4. NOC DESIGN FOR GPU SYSTEMS In this section, we discuss the electrical and photonic NoC designs used in our evaluation of the 32-CU GPU (described in Section 3.1), and the scaled-up 128-CU GPU (described in Section 3.2). 4.1 Electrical NoC Design for 32-CU GPU Although not every detail of the interconnect used in the AMD Radeon HD 7970 chip has been publicly disclosed, online resources suggest that the crossbar topology is in fact used between the L1 and L2 caches [3, 23]. Therefore, we model a crossbar NoC for the baseline 32-CU GPU to provide communication between its 40 L1 caches and 6 L2 caches. The crossbar is a low-diameter high-radix topology that supports a strict non-blocking connectivity, and can provide high throughput. The crossbar network topology can be implemented as a multi-bus topology, where each node in the network has a dedicated bus. Different buses of the crossbar topology can be routed around the GPU chip to form a U-shaped layout (Figure 3a). We consider two main designs for the bus crossbar – a Single-Write-Multiple-Read (SWMR) design and a Multiple-Write-Single-Read (MWSR) design. In the SWMR crossbar, each transmitting node has a designated channel to transfer messages. All receiving nodes are able to listen to all sending channels. All nodes decode the header of the control messages on the channel and decide whether to receive the rest of the message or not. The destination node is the only node that receives the cache line that directly follows the control message. A node receives a message if it is the destination of that message. The message is received and buffered in the destination node’s input buffer. The destination node applies Round-Robin (RR) arbitration between messages it receives from multiple transmitting nodes at the same time. In the MWSR crossbar, each transmitting node needs to acquire access to the receiver’s dedicated bus to transmit a message. An access is granted in a RR fashion between nodes that have a message to transmit. The channels in both designs utilize credit-based flow control for buffer management, with acknowledgments in the reverse direction being piggybacked on control messages. Each L1 and L2 cache unit in the GPU system acts as a node in the network. For example, in the design of the SWMR crossbar NoC in the 32-CU GPU there are 40 dedicated SWMR buses (32 for vector caches and 8 for scalar caches) required for L1 units, and 6 dedicated SWMR buses for L2 units to send information to all the other units. 4.2 Electrical NoC Design for 128-CU GPUs As the number of CUs in a GPU grows, the number of L1 and L2 units that need to communicate also grows. This means that additional SWMR/MWSR buses are required in the crossbar to support this communication. A larger number of buses, implemented using electrical link technology, translates to large die area and higher power dissipation. Moreover, in a MWSR, the transmitting units need to acquire access to one of the buses in the crossbar via arbitration. Also, if each node holds on to the bus for a large number of cycles (due to serialization and transmission delays), this will result in very long wait times for other nodes. Large number of SWMR buses impose pressure on the receiving cache unit which has limited number of ports. Therefore, for GPUs with large CU counts, we propose to use an electrical 2D-mesh network. The mesh is a low-radix high-diameter topology that uses decentralized flow control. An electrical mesh network is easy to design from a hardware perspective due to its short wires and low-radix routers. While a mesh avoids the long arbitration delays present within a crossbar, it has higher zero-load latency, as each packet has to traverse multiple routers (hops) to reach its destination. A typical 2D-mesh is constructed using radix-5 routers connected neighboring routers and sender/receiver end-nodes. The main drawback of using a 2D-mesh with a large number of nodes is the high hop count, which results in longer latencies. To alleviate this issue, we propose using a concentrated 2D-mesh, where each router has a larger radix. As shown in Figure 3b, for the 128-CU GPU, we use routers with radix 7 and radix 8. In our baseline 2D-mesh designs for a 128-CU GPU, a single router is connected to L1 vector and scalar cache units, Table 2: Number of global buses required for topologies in 32- and 128-CUs GPU – HYB = The proposed hybrid, DOWN = L1 to L2 traffic, and UP = L2 to L1 traffic. Design MWSR SWMR HYB 32-CU GPU DOWN UP 6 40 6 40 6 6 128-CU GPU DOWN UP 24 160 24 160 24 24 depending on the location of the router on the layout. While various configurations are possible for placement of L2 units [6], we chose to connect L2 units to the mesh through the routers placed along the periphery of the mesh to follow a similar layout to the AMD Radeon 7970 for our 128-CU GPUs. In the AMD Radeon 7970 the L2 banks are placed on the periphery of the die [22] The mesh design is modeled using state-of-the-art singlecycle routers [28], each having room for 8 cache lines in the input buffers and no virtual channels. The routers use a standard matrix-based crossbar that implements roundrobin arbitration. We use an X-Y routing scheme with credit-based flow control. 4.3 Hybrid NoC Design for GPU As mentioned earlier, there are considerable drawbacks in terms of area and power dissipation with both the SWMR and MWSR crossbars, due to the large number of global channels. To address this problem, we design a NoC with a significantly lower number of channels, by taking advantage of the asymmetry in the L1-to-L2 versus L2-to-L1 network traffic, and by analyzing the communication between L1 units. We propose to use SWMR buses for L2-to-L1 communication and MWSR buses for L2-to-L1 communication for our hybrid design. The choice of a SWMR bus in the L2-to-L1 network is easy to justify. The data from L2 cache units to L1 cache units is latency sensitive since this data is needed for the work-items to start their execution. In SWMR, there is no arbitration latency in the transmitting side (L1 cache). Additionally, the number of buses is low, equal to the number of L2 caches. Using the MWSR buses in the L1-to-L2 direction requires few buses (equal to the number of L2 caches), at the expense of introducing additional arbitration latency—hardware must deal with the worst case where all L1 caches attempt to access the same bus simultaneously. However, this has minimal impact on the overall performance because the latency of L1-to-L2 traffic is typically not critical as the GPU programming model enforces that no other CU should be stalled awaiting completion of a write-back transfer. At the same time, GPU workloads exhibit light traffic for L1-to-L2 accesses. Our analysis of the GPU applications in the AMD APP SDK benchmark reveals that on average, 80.5% of the traffic generated from L1-to-L2 is due to control messages (8 byte messages). This means any bus with a bandwidth wider than 8B would have a 0-cycle serialization latency for 80% of the messages from L1 to L2. Thus, scaling down the L1-to-L2 network causes no noticeable increase in contention. A crossbar topology provides point-to-point connections between all the L1 and L2 units. Therefore, any L1-to-L1 Table 3: The main elements in the photonic crossbars. The channel width is 72 Bytes. WG = Waveguides, MD = Modulators, FL = Filters, TL = MD+FL. The columns are for the 32-CU GPU if not stated otherwise. HYB128 uses concentration through access points (AP) to reduce the number of rings (4 vL1s and 1 sL1 share an AP). MWSR HYB HYB128 63 2668 27840 30508 63 27840 2668 30508 26 14268 14268 28536 186 45936 45936 91872 communication has a dedicated path. As mentioned in Section 1, the GPU applications causes a very small amount of L1-to-L1 communication. This is while modern GPU designs all provide coherency between L1 units. We evaluated many memory-intensive applications in the AMD APP SDK benchmark suite in order to calculate the amount of traffic between L1 units. As expected, all applications in the suite exhibit close to zero transactions between L1 units, except for two applications conv and dct (for these two applications, 16% and 12% of all the traffic on the network are L1-to-L1 transactions, respectively – see Table 6 for further details on applications). So in our hybrid design, in order to reduce the power consumption of the NoC, we also removed the connections between L1 units. So any L1-to-L1 transaction will be replaced with two transactions: one from L1-to-L2 and a second one from L2-to-L1. By this change we would also reduce any latency that might be introduced by arbitration due to small amount of communications between L1 units. Overall, the hybrid design reduces the number of buses in the crossbar for the 32-CU GPU from 46 to 12. Similarly, using the hybrid design reduces the number of buses from 184 to 48 for the 128-CU GPU (see Table 2 for details). 4.4 L2 & MC 19.00 mm 16.95 mm SWMR CU 16.72 mm 20.57 mm (a) Crossbar for the 32-CU GPU (28 nm). 24.91 mm 7.90 mm 16.46 mm 15.78 mm WG MD FL TL sL1 vL1 4.24 mm (b) 32×24 Crossbar for the 128-CU GPU (14 nm). Figure 4: Physical layout of the silicon-photonic NoCs for the two GPU chips. We use 4 µm-pitch waveguides and 10 µm-diameter rings. Photonic NoC Designs The link length-independent data-dependent energy of silicon-photonic links makes this technology more appropriate for designing low-diameter high-radix topologies such as a crossbar, clos or butterfly. In this work, we consider crossbar since it provides strictly non-blocking connectivity, and is easy to program. A photonic crossbar provides relatively lower latency in comparison to an electrical crossbar. We use silicon-photonic link technology to design a crossbar NoC for the GPU, as discussed in Section 4.1. The photonic MWSR implementation uses an optical token channel with fast-forwarding, which provides the same RR arbitration as the electrical crossbars [35]. The photonic SWMR implementation avoids the need for arbitration. At every transmission, the source node sends the control message, and its head flit is decoded by all the receiving nodes. All nodes except the destination node turn off their receiver ring detector after the decode process [27]. Similar to our electrical hybrid NoC, our photonic hybrid design also utilizes SWMR buses for L2-to-L1 and MWSR for L1-to-L2 communications. Table 3 shows the number of photonic components required for the different crossbars for the maximum bus width (72 bytes). If we compare the three topologies for the 32-CU GPU, our proposed hybrid scheme requires fewer waveguides due to the lower number of buses (see Table 2) and a correspondingly lower number of total rings (modulators and filters), which results in lower NoC power. As a result, we will only consider this hybrid design for the photonic NoC for larger 128-CU GPU. To further minimize static NoC power, the hybrid crossbar utilizes concentration through Access Points (AP). In particular, 4 vector L1 caches and 1 shared L1 cache use the same AP to access their corresponding crossbar’s bus on a time-division multiplexing basis (see Figure 4b). Photonic topologies can be mapped onto the U-shaped and serpentine physical layouts illustrated in Figure 4a and Figure 4b, respectively, for the two GPUs under study. Note that the Figures show the GPU chip dimensions. As explained in Section 3, to obtain the 128-CU floorplan area, we scale down the dimensions of the 32-CU chip components (CUs, vL1s, sL1s, L2s and MCs) from 28 nm (the technology node of the 32-CU GPU) to 14 nm technology node. Monolithic integration of a photonic crossbar would slightly increase the area of the GPU chip. We use 4 µmpitch waveguides and 10 µm-diameter rings. When comparing the total area of the GPU chips with the electrical NoCs and the photonic NoCs, we obtain floorplan areas of Table 4: Energy Projections for Photonic links based on [13, 25, 26]. – Tx = Modulator driver circuits, Rx = Receiver circuits, Dynamic energy = Data-traffic dependent energy, Fixed energy = clock and leakage We consider 20% (conservative projection) and 30% (aggressive projection) laser efficiency. Circuit Data-dependent energy Fixed energy Thermal tuning 20 20 5 5 16 16 Tx (fJ/bt) Rx (fJ/bt) Table 5: Projected/Measured Optical Loss per Component [13, 25, 26]. We consider -17 dBm (conservative projection) and -20 dBm (aggressive projection) for the photodetector sensitivity. Device Optical Fiber (per cm) Non-linearity (at 30 mW) Modulator Insertion Waveguide crossing Waveguide (per cm) Loss (dB) 5e-6 1 1 0.05 2 Device Loss (dB) Coupler Splitter Filter through Filter drop Photodetector 1 0.2 1e-3 1.5 0.1 390.0 mm2 vs. 390.8 mm2 (32-CU GPU) and 402.2 mm2 vs 410.0 mm2 (128-CU GPU), respectively. Photonic components are not scalable (contrary to CMOS transistors) and that is the main reason for this increase in area. The area overhead in the 128-CU GPU is 7.8mm2 . This area is equal to the area of approximately 2 CUs (the area for a CU is 4 mm2 ). Increasing the number of CUs from 128 to 130 results in a very small performance improvement. Energy and loss projections for photonic crossbars are presented in Tables 4 and 5, respectively [13,25,26]. We use conservative choices for the 32-CU GPU and aggressive choices for the 128-CU GPU. Conservative and aggressive designs employ different parameters for the photodetector sensitivity and laser efficiency (-17 dBm vs. -20 dBm in the former, and 20% vs. 30% in the latter, respectively). 5. EVALUATION METHODOLOGY We conducted experiments to evaluate the potential for silicon-photonic NoC in GPUs using the Multi2sim 4.2 simulation framework [31]. The simulator has a configurable model for commercial AMD and NVIDIA GPU architectures. Multi2sim integrates an emulator for the AMD Southern Islands instruction set. The emulator provides traces of instructions at the ISA level, which are fed to an architectural simulation featuring a detailed timing model of the GPU pipelines and memory system. We have extended Multi2sim to include models for both electrical and photonic buses, as well as packet processing and flitting support for NoCs. To validate the updated Multi2sim simulation framework, we leveraged the standalone network simulation mode of Multi2sim that can inject random traffic in various network topologies. We compared and verified the reported latencies and performance numbers in our cycle-based model against previous research [7, 11]. The applications evaluated in this work are taken from the AMD APP SDK [1], which AMD has provided to highlight efficient use of the AMD Southern Islands family of GPUs. For each application, we can change program inputs to spec- ify the workload intensity. We chose large input sizes for our benchmarks to show how current GPUs are unable to support the growth in data set sizes we expect to see in future workloads. We made sure that the selected subset of applications include a diversity of workload features and varying intensities for creating a range of traffic behaviors. Table 6 lists the set of applications selected from this benchmark suite, and includes a brief description for each application. The power for the electrical network is estimated using a detailed transistor-level circuit model. The power dissipated by the network is calculated based on the physical layout, the flow control mechanism and network traffic workloads. For the 32-CU GPU, the wires in the crossbar are designed in the global metal layers using pipelining and repeater insertion in 28 nm technology [2]. All of inter-router channels in the 2D mesh are implemented in semi-global metal layers with standard repeater wires. We use the 14 nm technology node [2] for the 128-CU GPU systems. The power dissipated in the SRAM array and crossbar of the routers is calculated using the methodology described in [19] and [36], respectively. We use the photonic technology described in Sections 2 and 4.3 to calculate laser power, Tx/Rx power and thermal tuning power in the NoC. 6. 6.1 EXPERIMENTAL RESULTS Electrical vs. Photonic Designs of Crossbar Topologies for a 32-CU GPU As described in Section 3.1, 64B cache line should immediately follow an 8B control message. This means the biggest message in the network is 72B. For any channel with a bandwidth smaller than 72B (e.g., 32B), the control signal will be transfered in a separate cycle first, and the cache line is packeted and transfered immediately in the next cycles (2 cycles for 32B). In our evaluation, we considered MWSR and SWMR electrical buses with channel widths of 16, 32 (divisors of 64), and 72B (packing control signals and the cache line in 1 message) for the 32-CU GPU system. For the photonic NoC, due to the bandwidth density advantage of silicon-photonic link technology versus electrical link technology, we considered SWMR and MWSR buses with a channel width of 72B. In Figure 5a, we compare the electrical and photonic MWSR and SWMR crossbars in terms of performance (top plot) and network traffic (bottom plot). To calculate the performance improvement (i.e., speed up) for each application, we divided the application execution time when using the MWSR electrical crossbar NoC with a 32B channel width (E-MWSR-32) by the execution time for the other NoC designs. As shown in the corresponding plot, the photonic SWMR crossbar (P-SWMR-72) achieves an average 2.60× speedup, and the photonic MWSR crossbar (P-MWSR-72) achieves an average 2.64× speedup compared to E-MWSR32. A similar performance improvement is observed if we compare P-MWSR-72 and P-SWMR-72 with E-SWMR-72. The magnitude of speedups directly correlate to the network bandwidth required by the applications (see the bottom plot in the Figure 5a). Applications with a higher number of memory transactions (i.e., larger offered bandwidth) exhibit larger benefits (i.e. 6.3× for mtwist and 5.2× for conv ). The benefits we observe are a result of the low latency nature of photonic NoCs. As can be seen in the figure, increasing the channel bandwidth of the electrical crossbars results in very Table 6: The workloads selected from the AMD APP SDK. Application BS CONV DCT DWTHAAR LARGSCAN -LS MTWIST RED RG SOBEL URNG Description It finds the position of a given element in a sorted array of size 1048576. Convolution filtering on each element of an input matrix of 4096 × 4096 with blur mask of 5 × 5 Discrete Cosine Transforms on an input matrix of size 8192 × 8192 One-dimensional Haar wavelet transform on a one-dimensional matrix of size 8388608 Performs scan on a large default array of 134217728 elements SIMD-oriented Fast Mersenne Twister (SFMT) generates 4194304 random numbers and uses Box-Muller to convert these numbers to Gaussian random numbers. Performs reduction by dividing an array of 33554432 into blocks, calculating the sum of blocks and calculating the sum of block sums. Performs recursive Gaussian filter on an image of size 1536 × 1536. Performs Sobel Edge detection algorithm on an image of size 1536 × 1536. Generates unifrom noise on an input image of size 1536 × 1536. (Gbit/s) Offered BW BW (Gbit/s) Normalized Offered SpeedupSpeedup Normalized 7 6 5 74 63 52 41 30 2 2500 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 1 BS CONV DCT DWTHAAR LARGSCAN MTWIST RED RG SOBEL URNG 2000 0 1500 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 16 72 2500 1000 2000 BS CONV DCT DWTHAAR LARGSCAN MTWIST RED RG SOBEL URNG 10000 500 0 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 500 1500 BS CONV DCT DWTHAAR LS MTWIST RED RG SOBEL URNG 20 15 25 fixed 10 20 155 100 5 5 4 0 3 5 2 4 1 3 0 2 1 0 18 BS CON dynamic DCT DWTHAAR LS 0.25 0.01 thermal MTWIST laser RED 0.01 18 0.08 0.04 0.01 0.06 LS 0.25MTWIST RED BS CON DCT DWTHAAR 0.01 0.08 0.06 0.08 0.25 0.01 0.01 0.08 0.04 0.01 0.06 0.25 0.01 0.08 0.06 0.08 RG SOBEL URNG 0.31 0.21 0.20 RG0.33 SOBEL URNG 0.31 0.33 0.21 0.20 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 E-SWMR E-SWMR P-SWMR-72 P-SWMR-72 E-MWSR E-MWSR P-MWSR-72 P-MWSR-72 (Watts) NoC PowerNoC Power (Watts) ED2P NormalizedNormalized ED2P (a) 25Speedup and offered bandwidth results for a 32-CU GPU fixedDWTHAAR dynamicLS thermal RED laser RG with different crossbars. BS CONV DCT MTWIST SOBEL URNG BS CONV DCT DWTHAAR LS MTWIST BS CONV DCT DWTHAAR LS MTWIST RED RED RG RG SOBEL SOBEL URNG URNG (b) Breakdown of the total power and energy-delay2 product of NoCs in a 32-CU GPU. Figure 5: Evaluation of a current 32-CU GPU with electrical and photonic MWSR and SWMR crossbar NoCs – Ticks in the x-axis follow the pattern T -X-N , where T refers to the type of technology (electrical = E; photonic = P), X is the type of crossbar (MWSR or SWMR), and N refers to the channel width in bytes. For electrical NoCs three different link bandwidths (from left to right 16, 32 and 72) are considered. The performance speedup and ED2 P are normalized to an electrical MWSR crossbar NoC with a 32byte channel width (E − M W SR − 32). small performance improvements in memory intensive applications (such as conv ), since transmission latency of the electrical channel (4 cycles on average) masks any benefit of reduction in serialization latency through increase in bandwidth. Compute intensive applications (such as urng), and applications with a small number of thread blocks (such as bs), can store their entire dataset in L1 data caches (generating very few memory accesses to the L2 banks). Therefore, these applications do not benefit from high bandwidth NoC designs since the NoC is not utilized. Figure 5b compares the E-MWSR, E-SWMR, P-MWSR and P-SWMR crossbars using NoC power (see top plot) and the energy-delay2 -product (ED2 P) metrics (see bottom plot) across the different benchmarks. For all benchmarks, the static power of electrical crossbars is dominant and increases from 2.1 Watts to 19 Watts when the channel width is increased from 8 to 72 bytes. A 72B link requires 72 × 8 wires, and each wire is 46.71 mm long. We have 46 buses in the network. This means 26,496 wires are required. If we assume 4 segments comprise a wire, running at a frequency of 1 GHz, the wire consumes roughly 180 fJ/bit in 28 nm technology. This leads to a 19 Watts power dissipation for the NoC, which is too high so commercial GPUs do not use the electrical networks that have links with 72B bandwidth. The P-MWSR-72 and P-SWMR-72 crossbars consume 29% lower power on average as compared to E-MWSR-72 and E-SWMR-72, but they consume 32% more power on average as compared to E-MWSR-32 and E-SWMR-32. Note that the static power values for both the P-MWSR and PSWMR crossbars (fixed, thermal and laser components) are very similar. The reason for their similar power dissipation is that both P-MWSR and P-SWMR crossbars utilize the same number of photonic rings and waveguides (see Table 3). To reduce this large fraction of static power, a NoC layout using concentration can be adopted. For instance, access points can be used to attach 4 vL1s and 1 sL1 through a single link to the bus. By adding access points, we can reduce the total number of rings by a factor of 4.78 (30,508 vs. 6,380) and the number of waveguides by 3.5 (63 vs. 18). As explained in Section 4.4, we apply this technique to reduce NoC power for scaled-up GPU system (128 CUs) with a hybrid NoC. In any case, the bottom plot of Figure 5b still shows that the photonic NoCs achieve the lowest ED2 P for all the benchmarks, except urng. The urng application is insensitive to NoC channel bandwidths, so the electrical crossbar NoC with the smallest channel width reports the best ED2 P, BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 Offered BW (Gbit/s) Normalized Speedup 7 6 5 4 3 2 1 0 2500 2000 1500 1000 500 0 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG NoC Power (Wa,s) (a) Speedup and offered bandwidth of benchmarks running on 32-CU GPU systems with a hybrid NoC topology. fixed 15 dynamic thermal laser 10 5 1.5 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG 1 0.5 0 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 E-‐8 E-‐16 E-‐32 E-‐72 P-‐72 Normalized ED2P 0 2 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG (b) Breakdown of total power and energy-delay2 product of the hybrid NoCs in a 32-CU GPU. Figure 6: Evaluation of a 32-CU GPU with electrical and photonic hybrid NoCs – The Speed up and ED2 P are normalized to E-MWSR-32 (not shown for clarity). Captions in x-axis follows the pattern T -N , where T refers to type of technology (electrical = E; photonic = P) and N refers to channel width in bytes (8, 16, 32 and 72). as its power is lower than its electrical counterpart, and its performance equals the performance of the crossbar with the largest bus bandwidth. Note that the ED2 P is significantly improved when considering our photonic NoCs (the average reduction across all benchmarks in terms of ED2 P is 66.5% for P-SWMR and 67.7% for P-MWSR, when compared to E-MWSR-32). 6.2 Photonic vs. Electrical Design of a Hybrid Topology for 32-CU GPUs Figure 6 presents results of the electrical and photonic implementations of our proposed hybrid design (see Section 4.3). By comparing Figures 6 and 5 (results in both Figures are normalized to E-MWSR-32), we can observe that our electrical implementation of the hybrid design, EHYB-72, does not show any performance improvement over its crossbar counterparts, E-MWSR-32 and E-MWSR-72. But this implementation reduces the power consumption of the electrical crossbar on average by 68% (due to less hardware), while providing the same performance. On the other hand, our photonic hybrid design, P-HYB-72 provides higher speedup than E-MWSR-32 and E-MWSR-72 (2.7× for both). Increasing the bandwidth of the electrical crossbar did not affect GPU’s performance due to high electrical link latency, as shown in Section 6.1. The P-HYB-72 consumes 51% less power than both EMWSR-72 and E-SWMR-72. It also has a marginally higher power dissipation (2%) than E-MWSR-32 and E-SWMR32. This makes our hybrid design a very good contender compared to any electrical crossbar design. The P-HYB-72, achieves higher performance than P-MWSR72 (1.05×) and P-SWMR-72 (1.03×) on average, by slightly reducing the latency (See Section 4.3 for details) and achieves 32% and 33% lower power dissipation on average, by using less hardware resources. Therefore P-HYB-72 achieves 34% and 30% reduction in the ED2 P in comparison to P-MWSR72 and P-SWMR-72, respectively. The comparison between photonic hybrid design and electrical hybrid design (E-HYB) (Figure 6a) in terms of performance (top plot of Figure 6a) and network bandwidth (bottom plot of Figure 6a) reveals the clear benefits of using low latency photonics technology. P-HYB-72 exhibits, on average, a 2.5× speedup and offers 0.6 Tbit/s more bandwidth versus E-HYB with the highest bandwidth (E-HYB-72). Figure 6b compares the electrical hybrid (E-HYB) and photonic hybrid (P-HYB) designs in terms of power (top plot) and the ED2 P metric (bottom plot). P-HYB-72 reports 51% higher power than the E-HYB-72 NoC due to higher static power consumption (laser, thermal tuning and fixed power). One solution is to adapt a run-time management mechanism based on the workload on compute units. Using run-time management, we can deactivate photonic links when they are not being used [8]. We leave run-time management mechanisms as an area for future work. Nonetheless, when analyzing the ED2 P metric in the bottom plot of the Figure 6b, we can see that our P-HYB-72 produces on average 82% lower ED2 P as compared to the E-HYB-72 NoC. 6.3 Electrical Mesh vs Photonic Hybrid Designs for 128-CU GPUs In this section, we consider future GPUs with 128 CUs, to study the scalability of electrical and photonic NoC designs presented in Section 4. For comparison, we consider an electrical mesh design and photonic hybrid design with channel widths of 16, 32 and 72 bytes. Here, all comparison metrics are normalized to an electrical 2D-mesh with 16-byte channel widths (E-MESH-16). Figure 7a compares performance (top plot) and bandwidth (bottom plot) of the E-MESH design with our P-HYB, for 128-CU GPU, assuming the same set of channel widths as the previous evaluation. The performance speedup reported for P-HYB-72 is 82% better than the speedup for EMESH-72 (with maximum 3.43× speedup for mtwist). The E-MESH-72 offers increased bandwidth up to 2.81 Tbit/s (dct), whereas P-HYB-72 achieves up to 7.28 Tbit/s bandwidth (mtwist). Figure 7b compares E-MESH and P-HYB in terms of power (top plot) and ED2 P (bottom plot). P-HYB NoCs generally consume more power than an E-MESH counterpart. This leads to 19% higher ED2 P for P-HYB-72, in comparison to E-MESH-72. We can reduce the bandwidth of the channel to reduce power dissipation of the P-HYB. Reduction in the P-HYB channel bandwidth from 72 bytes to 32 and 16 bytes, reduces this power dissipation by 39% and 60%, respectively. The average speedup observed for all the applications in our study for P-HYB-32 is 43%, and for P-HYB-16 is 17%, when Normalized Speedup 8 7. 6 A large amount of work has been done in the area of NoC design for manycore architectures, with the goal of providing energy-efficient on-chip communication. The maturity of electrical NoCs for manycore systems is evidenced by the availability of commercial designs (80-tile, Sub-100W TeraFLOPS processor introduced by Vangal et al. [33], or Tilera’s 64-core TILE64 chip [37]), On the photonic NoC front, there are no working prototypes, but researchers have explored the entire spectrum of network topologies – from low-radix high-diameter mesh/torus topologies [10,17,30] to medium-radix medium-diameter butterfly/crossbar topologies [15,16,27] to high-radix low-diameter bus/crossbar topologies [29, 34] to multilayer topologies [24, 32, 38]. The area of GPU NoCs has not been widely explored. Bakhoda et al. [6] exploit the many-to-few traffic patterns in manycore accelerators by alternating full routers in congested areas with half routers. In a related work [5], the same authors evaluate GPU performance degradation due to NoC router latencies. Both [5] and our work corroborate the motivation for low-latency networks to mitigate their impact on GPU performance. Lee et al. [18] identify a novel trade-off in CPU-GPU heterogeneous systems concerning the NoC design. CPUs run highly latency-sensitive threads, while coexisting GPUs demand high bandwidth. They thoroughly survey the impact of primary network design parameters on the CPU-GPU system performance, including routing algorithms, cache partitioning, arbitration policies, link heterogeneity, and node placement. In [14], Goswami et al. explore a 3D-stacked GPU microarchitecture that uses an optical on-chip crossbar to connect shader cores and memory controllers in the GPU memory hierarchy. The main difference between our work and [14] is that we present our own tailored monolithically-integrated photonic NoCs for communication between L1 cache and L2 cache and evaluate them against different electrical designs for current and future scaled-up GPUs. 4 2 Offered BW (Tbit/s) 0 8 6 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG 4 0 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 2 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG (a) Speedup and bandwidth of benchmarks running on a 128CU GPU system with varying hybrid NoCs. fixed NoC Power (W) 60 dynamic thermal laser 40 BS 6.0 CONV DCT DWTH LARG MTWI RED RG SOBE URNG 2 1 0 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 E-‐16 E-‐32 E-‐72 P-‐16 P-‐32 P-‐72 Normalized ED2P 3 4.4 0 4 6.1 6.7 11 22 20 BS CONV DCT DWTH LARG MTWI RED RG SOBE URNG (b) Breakdown of the total power and energy-delay2 product of GPU’s NoC. Figure 7: Evaluation of 128-CU GPUs with an electrical 2D-mesh NoC and a photonic hybrid NoC. – Captions along the x-axis follow the pattern T -N , where T refers to type of technology and topology (electrical 2D-mesh = E; photonic hybrid = P) and N refers to channel width in bytes (16, 32 and 72). Speedup and ED2 P results are normalized to E − 16. compared to E-MESH-72. By reducing the channel width to 32B or 16B, we reduce the power consumption, and therefore ED2 P in our hybrid designs. Therefore the ED2 P for P-HYB-32 and P-HYB-16 is reduced by 3% (for both) when compared to E-MESH-72. This means both P-HYB-32 and P-HYB-16 are marginally better than the 2D-mesh. One important feature of the hybrid design is its effect on memory intensive applications. If applications utilize the memory hierarchy (such as conv, dct, dwthaar, mtwist, largescan and sobel ), this will favor the photonic hybrid design. For these memory intensive applications, P-HYB-16 enjoys, on average, a 26% performance speedup and a reduction of 13%, on average, in ED2 P as compared to EMESH-72. Since these applications scale well with the number of GPU units, P-HYB-32 and P-HYB-72 provides significant reductions in ED2 P against E-MESH-72 (34% and 40% respectively). The reported speedups for P-HYB-32 and PHYB-72, for these memory-intensive applications, are on average 1.6× and 2.2× in comparison to E-MESH-72, respectively. Our results clearly show that for future GPU systems that will execute memory intensive workloads, a photonic hybrid design provides the best ED2 P solution. 8. RELATED WORK CONCLUSIONS In this paper, we combine our knowledge of silicon-photonic link technology and GPU architecture to present a GPUspecific photonic hybrid NoC (used for communication between L1 and L2) that is more energy efficient than the electrical NoC. Our proposed hybrid design uses MWSR for L1-to-L2 communication and SWMR for L2-to-L1 communication. Our simulation-based analysis shows that applications that are bandwidth sensitive can take advantage of a photonic hybrid NoC to achieve better performance, while achieving an energy-delay2 value that is lower than the traditional electrical NoC. In the AMD Southern Islands GPU chip with 32 CUs, our proposed photonic hybrid NoC increases application performance by up to 6× (2.7× on average) while reducing ED2 P by up to 99% (79% on average). We also evaluated the scalability of the photonic hybrid NoC to a GPU system with 128 CUs. For the 128-CU GPU system running memory intensive applications, we can achieve up to 3.43× (2.2× on average) performance speedup, while reducing ED2 P by up to 99% ( 82% on average), compared to electrical mesh NoC. Moving forward, we plan to explore techniques for run-time power management of photonic NoCs to extend this improvement to all applications. Here, depending on the application requirements, the photonic NoC bandwidth (and hence the photonic NoC power) will be appropriately scaled up/down to achieve energy efficient operation in the photonic NoC, as well as in the system as whole for the entire spectrum of applications. 9. ACKNOWLEDGMENT This work was supported in part by DARPA Contract No. W911NF-12-1-0211 and NSF CISE grant CNS-1319501. 10. REFERENCES [1] AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK). http://developer.amd.com/sdks/amdappsdk/. [2] Predictive Technology Model. http://ptm.asu.edu/. [3] AMD Graphics Cores Next (GCN) Architecture, June 2012. White paper. [4] NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110, 2012. http://www.nvidia.com/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf. [5] A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Proc. Int’l Symposium on Performance Analysis of Systems and Software, April 2009. [6] A. Bakhoda, J. Kim, and T. M. Aamodt. On-Chip Network Design Considerations for Compute Accelerators. In Proc. of the 19th Int’l Conference on Parallel Architectures and Compilation Techniques, Sept. 2010. [7] C. Batten et al. Building manycore processor-to-dram networks with monolithic silicon photonics. In High Performance Interconnects, 2008. HOTI’08. 16th IEEE Symposium on. IEEE, 2008. [8] C. Chen and A. Joshi. Runtime management of laser power in silicon-photonic multibus noc architecture. IEEE Journal of Selected Topics in Quantum Electronics, 19(2):338–350, 2013. [9] X. Chen et al. Adaptive cache management for energy-efficient gpu computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, December 2014. [10] M. J. Cianchetti, J. C. Kerekes, and D. H. Albonesi. Phastlane: A Rapid Transit Optical Routing Network. SIGARCH Computer Architecture News, 37(3), June 2009. [11] W. Dally. Virtual-Channel Flow Control. IEEE Transactions on Parallel and Distributed Systems, 3(2), March 1992. [12] B. R. Gaster, L. W. Howes, D. R. Kaeli, P. Mistry, and D. Schaa. Heterogeneous Computing with OpenCL Revised OpenCL 1.2 Edition, volume 2. Morgan Kaufmann, 2013. [13] M. Georgas et al. A Monolithically-Integrated Optical Receiver in Standard 45-nm SOI. IEEE Journal of Solid-State Circuits, 47, July 2012. [14] N. Goswami, Z. Li, R. Shankar, and T. Li. Exploring silicon nanophotonics in throughput architecture. Design & Test, IEEE, 31(5):18–27, 2014. [15] H. Gu, J. Xu, and W. Zhang. A low-power fat tree-based optical network-on-chip for multiprocessor system-on-chip. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, pages 3–8, 2009. [16] A. Joshi et al. Silicon-Photonic Clos Networks for Global On-Chip Communication. In 3rd AMC/IEEE Int’l Symposium on Networks on Chip, May 2009. [17] N. Kirman and J. F. Martı́nez. A Power-efficient All-Optical On-Chip Interconnect Using Wavelength-Based Oblivious Routing. In Proc. of the 15th Int’l Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2010. [18] J. Lee et al. Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture. Journal of Parallel and Distributed Computing, 73(12), Dec. 2012. [19] X. Liang, K. Turgay, and D. Brooks. Architectural power models for sram and cam structures based on hybrid analytical/empirical techniques. In Proc. of the Int’l Conference on Computer Aided Design, 2007. [20] E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2), March 2008. [21] L. Mah. The AMD GCN Architecture, a Crash Course. AMD Fusion Developer Summit, 2013. [22] M. Mantor. Amd hd7970 graphics core next (gcn) architecture. In HOT Chips, A Symposium on High Performance Chips, 2012. [23] M. Mantor and M. Houston. AMD Graphics Core Next: Low-Power High-Performance Graphics and Parallel Compute. AMD Fusion Developer Summit, 2011. [24] R. Morris, A. Kodi, and A. Louri. Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance. In Proc. of the 45th Int’l Symposium on Microarchitecture, Dec. 2012. [25] B. Moss et al. A 1.23pj/b 2.5gb/s monolithically integrated optical carrier-injection ring modulator and all-digital driver circuit in commercial 45nm soi. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International, pages 126–127, Feb 2013. [26] J. S. Orcutt et al. Nanophotonic integration in state-of-the-art cmos foundries. Opt. Express, 19(3):2335–2346, Jan 2011. [27] Y. Pan et al. Firefly: Illuminating Future Network-on-chip with Nanophotonics. SIGARCH Computuer Architecture News, 37(3), June 2009. [28] S. Park et al. Approaching the Theoretical Limits of a Mesh NoC with a 16-Node Chip Prototype in 45nm SOI. In Proc. of the 49th Design Automation Conference, June 2012. [29] J. Psota et al. ATAC: Improving Performance and Programmability with On-Chip Optical Networks. In Proc. Int’l Symposium on Circuits and Systems, 2010. [30] A. Shacham, K. Bergman, and L. P. Carloni. On the design of a photonic network-on-chip. In Proceedings of the First International Symposium on Networks-on-Chip, NOCS ’07, pages 53–64, 2007. [31] R. Ubal et al. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proc. of the 21st Int’l Conference on Parallel Architectures and Compilation Techniques, Sept. 2012. [32] A. N. Udipi et al. Combining Memory and a Controller with Photonics Through 3D-Stacking to Enable Scalable and Energy-Efficient Systems. In Proc. of the 38th Int’l Symposium on Computer Architecture, June 2011. [33] S. R. Vangal et al. An 80-Tile Sub-100W TeraFLOPS Processor in 65nm CMOS. IEEE Journal of Solid-State Circuits, 43(1), Jan. 2008. [34] D. Vantrease et al. Corona: System Implications of Emerging Nanophotonic Technology. In Proc. of the 35th Int’l Symposium on Computer Architecture, June 2008. [35] D. Vantrease et al. Light speed arbitration and flow control for nanophotonic interconnects. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 304–315. IEEE, 2009. [36] H. Wang, L.-S. Peh, and S. Malik. Power-Driven Design of Router Microarchitectures in On-Chip Networks. In Proc. of the 36th Int’l Symposium on Microarchitecture, 2003. [37] D. Wentzlaff et al. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27(5), Sept. 2007. [38] X. Zhang and A. Louri. A Multilayer Nanophotonic Interconnection Network for On-Chip Many-Core Communications. In Proc. of the 47th Design Automation Conference, June 2010.