Graceful Deadlock-Free Fault-Tolerant Routing Algorithm for 3D
Transcription
Graceful Deadlock-Free Fault-Tolerant Routing Algorithm for 3D
Graceful Deadlock-Free Fault-Tolerant Routing Algorithm for 3D Network-on-Chip Architectures Akram Ben Ahmed, Abderazek Ben Abdallah The University of Aizu, Graduate School of Computer Science and Engineering, Adaptive Systems Laboratory, Fukushima-ken, Aizu-Wakamatsu-shi 965-8580, Japan E-mail: {d8141104, benab}@u-aizu.ac.jp Abstract Three-Dimensional Networks-on-Chip (3D-NoC) has been presented as an auspicious solution merging the high parallelism of Network-on-Chip (NoC) interconnect paradigm with the high-performance and lower interconnect-power of 3-dimensional integration circuits. However, 3D-NoC systems are exposed to a variety of manufacturing and design factors making them vulnerable to different faults that cause corrupted message transfer or even catastrophic system failures. Therefore, a 3D-NoC system should be fault-tolerant to transient malfunctions or permanent physical damages. In this paper, we present an efficient fault-tolerant routing algorithm, called Hybrid-LookAhead-Fault-Tolerant (HLAFT), which takes advantage of both local and look-ahead routing to boost the performance of 3D-NoC systems while ensuring fault-tolerance. A deadlockrecovery technique associated with HLAFT, named Random-Access-Buffer (RAB), is also presented. RAB takes advantage of look-ahead routing to detect and remove deadlock with no considerably additional hardware complexity. We implemented the proposed algorithm 1 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) and deadlock-recovery technique on a real 3D-NoC architecture (3D-OASIS-NoC1 ) and prototyped it on FPGA. Evaluation results show that the proposed algorithm performs better than XYZ, even when considering high fault-rates (i.e., ≥ 20%), and outperforms our previously designed Look-Ahead-Fault-Tolerant routing (LAFT) demonstrated in latency/flit reduction that can reach 12.5% and a throughput enhancement reaching 11.8% in addition to 7.2% dynamic-power saving thanks to the Power-management module integrated with HLAFT. Keywords: 3D-NoC ; Architecture ; Parallel ; Adaptive ; Look-ahead routing ; Deadlock-free; 1 Introduction During the past few decades, technology enabled the aggressive scaling and continuous shrinkage of transistors dimension on modern microchips. This made the integration of billions of transistors on a single chip achievable [1]. As an example, the Intel Xeon processor [2] includes 2.3 billion transistors. With such high integration level available, the development of many cores on a single die has become possible. For instance, the Tilera Tile64 [3] and Intel Polaris [4] contain 64 and 80 cores, respectively. As the number of cores keeps increasing, and in order to take advantage of the large number of cores, the employment of efficient and scalable interconnect fabrics has become imperative. Traditional on-chip interconnect schemes, such as Point-to-Point, sharedbus, and the crossbar, are no longer reliable to provide the necessary communication among the processor cores. Recently, Network-on-Chip (NoC) [5, 6] has been proposed as a potential solution to respond to the high interconnection demands arising in the nanometer era. Based on a simple and scalable architecture platform, NoC connects processors, memories and other custom designs together using switches to distribute packets on a hop-by-hop basis in order to increase the bandwidth and performance and solve the interconnect bottleneck in traditional 1 This project is partially supported by Competitive research funding, Ref. P1-5, Fukushima, Japan. 2 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) bus-based systems. On the other hand, as the number of cores keeps increasing in hyper-core systems, spreading hundreds of cores in a 2D plane may not satisfy the high requirements of future data-intensive applications. This is mainly caused by the increasing diameter of 2D-NoC that does not scale with such large number of cores. At the same time, three dimensional integrated circuits (3D-ICs) [7] have attracted a lot of attention as a potential solution to resolve also the interconnect bottleneck. Thanks to the reduced average interconnect length, 3D-ICs can achieve higher-performance and a lower interconnectpower consumption can be obtained [8]. Moreover, the realization of mixed technology has become possible [9]. Combining the NoC structure with the benefits of the 3D integration offers a promising 3D-NoC architecture. The combination of these two major trends provides a new horizon for NoC designs to satisfy the high requirements of future large-scale applications. As 3D-NoC architectures started to show their outperformance and energy efficiency against 2D-NoC systems, questions about their reliability to sustain their performance growth begun to arise [10]. This is mainly due to challenges inherited from both 3D-ICs and NoCs: On one side, the complex nature of 3D-IC fabrics and the continuing shrinkage of semiconductor components. Furthermore, the significant heterogeneity in 3D chips which are likely to mix logic layers with memory layers and even more complex technologies increases the fault’s probability in a system [11]. On the other side, the single-point-failure nature of NoC introduces a big concern to their reliability as they are the sole communication medium. As a result, 3D-NoC systems are becoming susceptible to a variety of faults caused by crosstalk, electromagnetic interferences, impact of radiations, oxide breakdown, and so on [12]. A simple failure in a single transistor caused by one of these factors may compromise the entire system reliability where the failure can be illustrated in corrupted message delivery, time requirements unsatisfactory, or even sometimes the entire system collapse. Faults can occur at any component of a 3D-NoC system (i.e., link, router, buffers, crossbar etc.) and their rate of occurrence depends on the design, technology, environment and 3 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) operation conditions. From a time perspective, the duration of faults is very important especially for real-time 3DNoC systems and it can be categorized into three main types [13]: 1) Transient faults: they occur and remain in the system for a particular period of time before disappearing; 2) Intermittent faults: they are transient faults which occur from time to time; 3) Permanent faults: they start at a particular time and remain in the system until they are repaired. Consequently, reliable 3D-NoC systems should be able to detect first all the fault types occurrence, then working on reconfiguring the system resources to recover from these faults and guarantee the continuous correct functionality of the system. Fault detection is very crucial phase in fault-tolerant 3D-NoC systems. It can be obtained by relying on custom testing mechanisms or other detection schemes based on codes. Codes are mostly used in NoC systems and they were proposed to detect and correct errors in specific components of the system at the presence of a specific type of fault. For instance, Crosstalk Avoidance Codes (CAC) [14] are used for transmission wires and they are considered more efficient than the already existing methods to avoid crosstalk. For errors whose presence could not be detected, Error Detection Codes (EDC) and Error Correcting Codes (ECC) [15] are used to detect and correct these errors. Detection and checking mechanisms in 3D-NoC systems are out of the scoop of this paper and we are mainly interested in avoiding the detected faults by using a light-weight adaptive routing to avoid information loss or corruption. On the other hand, deadlock is one of the problems that may come with adaptive routing. Deadlock is caused when packets in different buffers are unable to progress because they are dependent on each other forming a dependency cycle. Most of the existing 3D-NoC systems used Virtual-Channel (VC) [16] as a deadlock-avoidance technique while others employ Virtual-Output-Queue (VOQ) [17] or their routing algorithms are based on a turn-model that adds restrictions to the routing choices to avoid deadlock. These procedures insure 4 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) the deadlock freedom at the expense of the hardware and implementation complexity or latency penalty caused by the routing restrictions. Based on these facts, in this paper we propose a novel low-latency, high-throughput and fault-tolerant routing algorithm, that we called Hybrid-Look-Ahead Fault-Tolerant (HLAFT). HLAFT takes advantages of the high-throughput and low-overhead of our previously implemented routing algorithm, Look-Ahead-Fault-Tolerant (LAFT) [18, 19], and combines it with localrouting for better routing decision while simultaneously guaranteeing fault tolerance with graceful performance degradation. The proposed algorithm can detect if the route calculated for the next node leads to a blocking path and recalculate the route to save the unnecessary clock cycles needed for nonminimal routing. To ensure deadlock-freedom, we employed a low-cost technique called Random-Access-buffer (RAB). RAB detects first the deadlock occurrence then manages to drop the blocking request and looks for other ones to free some slots in the buffer and break the dependency cycle causing the deadlock. Both HLAFT and RAB were implemented on 3DOASIS-NoC [20, 21] architecture which is a natural extension of our previous robust and flexible 2D-OASIS-NoC system [6, 23]. We prototyped the proposed system on FPGA and evaluated its hardware complexity and performance under different fault-rates and traffic loads using Transpose and Uniform synthetic workloads and also Matrix-multiplication and JPEG-encoder applications. The rest of the paper is organized as follows: In Section 2, we present some of the related work to fault-tolerant routing algorithms used in most popular 2D- and 3D-NoC systems. The proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) and its main features are explained in Section 3. In Section 4, we present Random-Access-Buffer technique for deadlock-recovery. Section 5 is dedicated for the evaluation methodology and results, and finally we end the paper with the conclusion and future work in Section 6. 5 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) 2 Related Work Many works have been conducted so far to tackle fault-tolerance in NoC systems where they can be classified depending the target system, the fault’s type, or the faults’ handling mechanism (e.g., using routing algorithms or architectural solutions). The majority of the fault-tolerant solutions were proposed for 2D-NoC systems. Some of them added restrictions to the number of faults as a security requirement for their systems. For instance, a single link or single node failure tolerance was presented in [24]. For n-dimensional mesh, Duato et al. [25] presented a routing algorithm that can tolerate up to n-1 faults. Another part of the proposed solutions eliminated the number of faults restriction and, instead, they limited the location of faults to a specific part of the network. They called these restricted subnetworks fault-regions which can be disabled, if necessary, to ensure fault-tolerance of the system. These restrictions can be represented by the shape of the fault-regions [26], their locations (excluding faults located at the edges of the network [27]), or assuming the absence of faults in some specific parts of the router [28]). Some works were presented without adding any restriction to either the number of faults or location. For example, uLBDR [29] is a routing scheme for 2D mesh topology based on Virtualcut-through which incurs a high complexity despite its great performance. The work in [30] is based on stochastic approaches to tolerate transient and permanent faults. Other works [31, 32] focused on the importance of adopting minimal routing algorithms to reduce the congestion caused by the presence of faults. They proposed minimal routing schemes that are able to adaptively route packets through the shortest paths, as long as a path exists. Authors in [33], presented a comprehensive investigation about 2D/3D fault-tolerant routing algorithms’ implementation. They discussed the different turn models, fault conditions, and the fault occurrence in both links and routers. 6 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Some of the proposed fault-tolerant solutions for 3D-NoC systems primarily focused on permanent faults occurring in Through-Silicon-Vias (TSVs). For example, Loi et al. [34] addressed the TSV yield improvement by presenting a solution based on spare via insertion. Pasricha et al. [35] introduced serialization to achieve also the same goal. On-chip sequential elements protection techniques were also proposed in [36, 37] to deal with single event upsets. The other existing works for 3D-NoC systems focused on link failure in general by adopting fault-tolerant routing algorithms. For example, Rahmani et al. [38] presented a fault-tolerant routing algorithm for named AdaptiveZ targeted for Hybrid-3D-NoC architecture. Feng et al. [41] proposed a low overhead deflection routing algorithm based on routing tables. However, the deployment of routing tables is still costly in terms of hardware and suffers from poor scalability due to the area required for the tables. Nordbotten et al. [39] presented an adaptive and fault-tolerant routing for 3D meshes and tori based on storing at each node a table for routing containing information about destination and the list of intermediate nodes. A fault-tolerant routing algorithm for 3D torus was presented by Yamin et al. [40]. With up to 30% faulty-node rate, the proposed algorithm can find a valid link between two faulty nodes with a probability higher than 90%. The Planar Adaptive Routing (PAR) for large scale 3D topologies was presented by Chien et al. [42] where they used three Virtual-Channels (VCs) [16] for deadlock-recovery and used also some routing rules. In [43] a fault model based on PAR, named 3D minimum-connected component (MCC), was presented. Later, Xiang et al. [44] proposed a fault model, named planar network (PN), that needs to store much less safety information at each faulty-free node. The problem with the solutions mentioned above is that they use VCs which are costly in terms of hardware and implementation complexity. This is caused by the arbitration needed to handle the different requests coming from the multiple VCs at each input-port. Thus, they introduce an extremely expensive hardware complexity and implementation cost besides the energy overhead 7 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) making them undesirable for the 3D-NoC implementation. To avoid using VCs while insuring deadlock-freedom, Pasricha et al. [45] extended a 2D turn-model for partially adaptive routing to the third dimension. The proposed scheme combines both 4N-FIRST and 4P-FIRST schemes to propose a lightweight 4NP-FIRST. On the other hand, this turn-model introduces some routing restriction to prevent from deadlock. These restrictions cause a nonminimal routing selection where in some cases it may take too many additional hops for the packet to reach its destination. A recent work by Akbari et al. [46] focused mainly on faults links and proposed an algorithm named AFRA. This algorithm routes packets through ZXY in the absence of link faults. When faults are detected, flits are sent first to an escape node through X, and then they continue to their destination through ZXY as it is in the no-fault case. The authors presented simplicity, good performance and robustness as the main features of this algorithm. However, this solution has two major drawbacks: First, AFRA tolerates only vertical links and horizontal links are not taken into consideration; second, the network congestion status is not taken into consideration, especially when more than one possible minimal path is available. In that case, a static minimal path selection does not eventually lead to low latency (i.e. a minimal path may contain a high congestion while other alternative ones do not). Ebrahimi et al. [47] stated that AFRA requires a fault distribution mechanism to know about the fault information on all vertical links along each row. They proposed a more reliable routing algorithm named HamFA which does not require any fault distribution, additional fault information, or any virtual channels. They modified the basic form of the Hamiltonian path in order to be able to switch between high and low channel subnetworks. In fact, HamFA provides much more reliability than AFRA; however, it requires that the faulty links should belong to the same subnetwork in order to tolerate multiple vertical or horizontal one-faulty links. As a result, this limits the reliability of this approach to 95% when considering the presence of a single faulty 8 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) link. In our approach, no restrictions are added to the fault locations and they can be anywhere in the network as long as there exists a valid path between a given (source, destination) pair. 3 3.1 Hybrid-Look-Ahead-Fault-Tolerant Routing Algorithm Look-Ahead-Fault-Tolerant routing (LAFT) drawbacks To keep the benefits of look-ahead routing [22], Look-Ahead-Fault-Tolerant routing algorithm (LAFT) [18, 19] should be able to perform the routing decision for the next node taking into consideration its link status and selects the best minimal path. The fault information is read by each input-port where LAFT handles the routing computation. The first phase of this algorithm calculates the next node address depending on the next-port identifier read from the flit. For a given node wishing to send a flit to a given destination, there exist at most three possible directions through X, Y, and Z dimensions respectively. In the second phase, LAFT performs the calculation of these three directions by comparing x, y and z coordinates of both current and destination nodes concurrently. At the same time, as these directions are being computed, the fault-control module reads the next-port identifier from the flit and sends the appropriate fault information to the corresponding input-port. By the end of this second phase, LAFT has information about the next node fault status and also the three possible directions for a minimal routing. In the next phase, the routing selection is performed. For this decision, we adopted a set of prioritized conditions to ensure fault-tolerance and high performance either in the presence or absence of faults: 1) The selected direction should ensure a minimal path and it is given the highest priority in the routing selection, 2) we should select the direction with the largest next hope path diversity, and the congestion status is given the lowest priority. To understand better how LAFT works, we observe Fig. 1 (a). Assuming that the current node (labeled C) received an incoming flit where the next port identifier, calculated in the previous node, 9 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) indicates that the out-port for this flit is East (Red arrow). The next node address is calculated (labeled N). Three minimal directions are possible for routing: East, North or Up. The East direction will not be selected since the link in this direction is faulty. Therefore, either North or Up can be selected, which both are minimal and nonfaulty. In this case, the diversity priority [18] is taken into consideration. If Up is selected, where the node in this direction is on one of the network edges, the diversity value is equal to 2 (two minimal possible directions: East or North). But, if North is selected, its diversity value is equal to 3 (East, North or Up). Having the highest priority, the North out-port (Green arrow) is selected for the next node and it is embedded in the flit to be used in the downstream node to allow the routing calculation and switch allocation to be performed both in parallel. More details about the different steps of the algorithm can be found in [18]. Employing look-ahead routing reduces the router latency and improves the system overall performance. However, due to the limited information restricted to only one switch ahead, in some cases LAFT may not select the best route. Observing the example in Fig. 1 (a), we can see that the North route is leading to a blocked path where all the minimal possible directions to the destination are faulty. Therefore, nonminimal routing is required causing several additional clock cycles. When arriving to the next node (labeled N), the out-port is already decided and sent to the Switch-allocator due to the look-ahead routing restraints. This is despite the fact that this path is leading to a blocking path and also a better blocking-free route exists (Up). Therefore, optimizing LAFT and making it able to detect whether the route precalculated in the previous upstream node would lead to a blocked path or not is necessary. To enable this, the benefits of look-ahead routing should be combined with local routing for better routing selection. 10 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) 3.2 Proposed Hybrid-Look-Ahead-Fault-Tolerant routing (HLAFT) algorithm The proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) solves the earlier mentioned problems of LAFT. At every incoming flit, HLAFT makes a simple computation to judge whether the precalculated Next-port identifier will lead to a blocking path or not. In case where a possible nonminimal route might occur, HLAFT recomputes the route depending on the local and neighboring nodes fault status. Algorithm 1 represents the proposed HLAFT routing. At first, the fault-control module at each input-port reads the Next-port identifier and destination addresses from the flit. Depending on the Next-port value, the next-node can be calculated. After acquiring information about the next-node, the three possible minimal directions are computed and checked whether a valid minimal route leading to the destination exists or not. In case where at least a valid minimal link leading to the destination exits, the Next-port identifier is sent to the Switch-allocator and the look-ahead-routing-computation (LA-RC) modules at the same time. In the opposite case (i.e., all the possible three directions are faulty), a flag is issued triggering the local-routing-calculation module (Loc-RC) to recalculate the output-port by verifying the status of the current and neighboring nodes links. The same three priorities, earlier explained in the previous subsection (i.e., minimal valid, higher diversity, and less congested), are taken into consideration while selecting the new out-port. The recalculated out-port issued from Loc-RC is sent to the Switch-allocator to perform the arbitration, and also sent to the LA-RC module to compute the Next-port for the next-node to continue the look-ahead routing cycle. As illustrated in Fig. 1 (b), when the flit arrives to the node labeled C and having a current outport North precalculated in the previous node, the fault-control checks the fault status of the links of the next-node in the north direction. When performing the checking, all the possible minimal paths are found faulty; thus, Loc-RC is triggered and computes Up as the new best route (since East is faulty). Then, the newly calculated Up direction is sent to the LA-RC module which issues North as the Next-out-port. 11 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Figure 2 (a) depicts the LAFT pipeline stages which takes advantage of look-ahead routing to parallelize both Next-Port-Calculation (NPC) and Switch-Arbitration (SA) stages. For the proposed HLAFT algorithm, illustrated in Fig. 2 (b), we can observe first that a pipeline stage is dedicated for the fault-control module (FTC) to issue the flag and whether trigger local-routing or not. This simple computation is done in parallel with the Buffer Writing stage (BW) to further reduce the computation latency in the router without creating any critical paths. In case where no flag is issued, the Local-Routing-Calculation stage (Local-RC) is bypassed and then Next-PortCalculation and Switch-Arbitration stages are performed in parallel (NPC/SA). In the presence of a flag, Local-RC is performed and then the results are sent to NPC/SA stage before the final Crossbar Traversal stage (CT) handles the flit transfer to the next neighboring node. 4 Random-Access-Buffer Scheme for Deadlock-Recovery As is the case for every adaptive routing algorithm, the deadlock issue may rise. As we previously mentioned in Section 1, most of the existing routing algorithms use either Virtual-channels (VCs) or add restrictions to the routing selection to avoid deadlock. These solutions either suffer from high implementation complexity or incur an additional delay due to the nonminimal approach. In our case, we implemented a similar technique to VCs, but it is much simpler and less complex. This technique, named Random-Access-Buffer (RAB), detects first the flit being the reason of deadlock in the buffer, drops its request and then looks for another flit whose request can be granted to free some slots in the buffer and break the dependency. Figure 3 shows an example how RAB works. In each input-port, the RAB-controller (RABcntrl) manages the detection of deadlock and handles the assignment of wr-adr and rd-adr addresses. The detection mechanism is based on a timer which after a period of time, if the request being processed is not served, a flag is issued informing the presence of a deadlock (Fig. 3 (a)). This is done by reading the sw-grnt signal received from the Switch-allocator. In this case, 12 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) the BC reads the head of the next packet in the buffer and checks whether the requested out-port is different from the one previously flagged as blocked or not. When it finds a request whose channel is free (Fig. 3 (b)), it sends a request to the Switch-allocator to be served. When the request is granted, the flits of the granted packet are read from the buffer and the freed slots can be used to host another incoming packet (Fig. 3 (c)). After new flits are written in the buffer, the blocked packet is checked again (Fig. 3 (d)). The RAB-cntrl receives a grant for the direction requested (North) and the packet is read from the buffer. The timer’s value is one of the important choices that should be carefully taken. This is because the deadlock occurrence is strongly dependent on the application for any adaptive routing algorithm. Moreover, the presence of faults in the system adds a higher probability for the deadlock occurrence; therefore, before selecting the value of the timer, we profiled each one of the used applications and analyzed the task graph of each one of them. During this analysis we took into consideration the fault-rate and its impact on the congestion and deadlock occurrence. In order to avoid any flit overwriting caused by mismanaging the wr-adr and rd-adr addresses, we added a small status register (labeled SR in Figs. 3 and 4) which keeps information about the flits being flagged (i.e., not served yet). When a flit is read from the buffer, the BC checks the status register and makes sure not to write the incoming flit in a flagged slot. Observing, Figs. 3 (1) and (2), we can see that status[0] and status[1], which are keeping information about the two flits of the flagged packet are updated to 1. As shown in Fig. 3 (3), when an incoming flit arrives to the buffer, RAB-cntrl makes sure that it will not be stored in one of these two flits’ locations, but instead it is stored in an empty one. When a flit which was previously flagged, and after of period of time its request is granted, the RAB-cntrl should update the status register to free the flagged slot so it can be used by other incoming flits, as it is show in Fig. 3 (4) (status[0] is updated to 0). With this simple technique, performance and deadlock-freedom are ensured while guaranteeing a small overhead. Moreover, instead of managing many requests at the same time, as it is the 13 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) case of VCs requiring then additional complexity and delay for the arbitration, RAB mechanism handles each request at a time. Despite the delay penalty required by the timer to detect the deadlock, this technique is still faster and simpler to implement than Virtual-channels. Livelock freedom As long as the chosen route is minimal, the livelock problem does not exist either. However, it can be observed when a nonminimal direction is selected. For this reason, some restrictions are added when selecting the nonminimal route [18]. The first restriction forbids the flit to turn back to the same direction where it came from. The second one forbids selecting a path which is in the opposite direction of the faulty link (i.e., if ”East” is faulty then ”West” should not be selected). Adopting these restrictions guarantees the livelock freedom of HLAFT, and the flits will continue to advance and search for a route until it finds a valid link. Power management For the proposed system, we used a power management scheme to reduce the dynamic power in the system. This scheme is based on clock-gating to turn-off the clock in some specific components which, in some periods of time, they are in a waiting state and not performing any computation. Therefore, it is important to switch off these components when they are not necessary for the correct working of the system and they are awakened when the system requires their services. We employed the power management scheme in three main components. First, when a router detects that one of its links is faulty, it immediately switches off the clock from the entire inputport associated to this faulty link. In this case, an important amount of dynamic power can be saved since input-ports occupy the largest portion of the system power budget due to the presence of buffers which are considered as the hungriest components in the entire router in terms of power. Second, we turn off the clock in the local-routing module that we added to recompute the incoming route if necessary. Thus, the local-routing module is initially not fed by the clock. Whenever the fault-control module signals the necessity to recompute the route, the local-routing module 14 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) is awakened and performs the necessary computation. Otherwise, it is disconnected from the clock and the unnecessary dynamic power is spared. Finally, the last component is the RABmanger circuit (See Fig. 4). As we previously mentioned, RAB deadlock-recovery technique is triggered only at the presence of deadlock. When no blocking happens, the buffer is managed by the conventional First-In-First-Out (FIFO) buffer manager, as shown in Fig. 4. For dynamic-power saving, RAB-manager is put asleep at the absence of deadlock (deadlock-flag is equal to zero) and it is working only when the timer (deadlock-flag) informs it that deadlock is occurring. In this fashion, deadlock-recovery is insured at a low-cost with no unnecessary power waste. 5 5.1 Evaluation Evaluation methodology Our proposed algorithm is implemented on 3D-OASIS-NoC system [20, 21, 22] which was designed in Verilog HDL. For the hardware synthesis we used Altera Quartus II CAD tool, and Power Play Analyzer to extract both static and dynamic power of the proposed system [48]. In our evaluation, we conducted two experiments: in the first one, we evaluated the hardware complexity and performance of 3D-OASIS-NoC and compared it to that of the baseline 2D-OASIS-NoC system. In the second experiment, we evaluated the hardware complexity of HLAFT router in terms of area utilization, power consumption (static and dynamic) and speed. To evaluate the performance of the proposed algorithm, we selected Matrix-multiplication [49] and JPEG-encoder [50] as real benchmarks and also two traffic patterns: Transpose [51] and Uniform [52]. We chose Matrix-multiplication because it is one of the most fundamental problems in computer sciences and mathematics, which forms the core of many important algorithms such as engineering and image processing applications [49]. To evaluate 3D-OASIS-NoC system’s performance with Matrix-multiplication, we set the matrix size to a 6×6. We also decided to calculate from 1 to 15 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) 100 different matrices at the same time. This aims to increase the number of flits traveling the network at the same time and see the impact of congestion on the performance of the proposed system with different traffic loads. The second application used for the evaluation is the JPEG-encoder application [50] which is a well-known application frequently used in a lot of research and includes some parallel tasks. This can be suitable to evaluate the performance of NoC systems. To increase the network size and introduce more parallelism, we implemented four JPEG systems that run concurrently and whose tasks share three shared memories where the input and output images are stored. The Transpose traffic pattern is a communication method based matrix transposition. Each node sends messages to another node with the address of the reversed dimension index [51]. The Transpose workload is often used to evaluate the NoC throughput and power consumption since it creates a bottleneck due to the long communication distance exhibited between (transmitter and receiver) pairs. The Uniform traffic pattern is a standard benchmark used in on-chip and off-chip network routing studies which can be considered as the traffic model for well-balanced shared memory computations [52]. Each node sends messages to other nodes with an equal probability (i.e., destination nodes are chosen randomly using a uniform probability distribution function). In our evaluation with the two traffic patterns, we set 4x4x4 as a network size where all the nodes were assigned for both transmitter and receiver nodes. Each transmitter node injects from 102 to 105 flits into the network. While on the other side, receiver nodes verify the correctness of the received flits. Using these four benchmarks, we conducted two experiments: in the first one we evaluated the performance of 3D-OASIS-NoC when compared to the base 2D-architecture under different routing algorithms (XY [53], XYZ [55], LA-XY [54], and LA-XYZ [22]). In the second experience, we evaluated the latency per flit, the throughput, and the dynamic-power of the 16 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm under each of the aforementioned application. We observed the performance variation of the proposed system under different fault link rates: 0%, 1%, 5%, 10%, 15% and 20%. The number of links in each system can be calculated using the formula shown below [56]: #links = N1 N2 (N3 − 1) + N1 N3 (N2 − 1) + N2 N3 (N1 − 1) (1) Where N1, N2 and N3 are the respective network’s X, Y and Z dimensions. During the evaluation, we divided the faults into two categories: Half of the faults are permanent (considered during the whole simulation time) and the second half are transient (randomly start and end along the simulation time). In addition, as much as the fault-rate increases, we employed more faults in flits’ paths to cause nonminimal routing and observe the system behavior in a worst case environment. All the results obtained with HLAFT were compared with LAFT [18], LA-XYZ [22], and also Dimension-Order-Routing XYZ [53]. Table 1 represents the configuration parameters used for our evaluation. 5.2 5.2.1 3D Vs. 2D NoC architectures comparison results Hardware complexity evaluation Table 2 summarizes the hardware complexity of both 2D- and 3D-OASIS-NoC architectures based on XY, LA-XY (2D), XYZ, and LA-XYZ (3D). When we extend the 2D-architecture to the third dimension (XY to XYZ or LA-XY to LA-XYZ) we observe, an area penalty which was about 37% in average. This can be explained by the additional two input-ports used by inter-layer links to connect the two adjacent layers. This results in additional buffer resources usage in addition to a larger crossbar. We also evaluated the effect of look-ahead routing on both architectures. Adopting look-ahead incurs an additional 4.6% extra area in average. This additional area can be explained by the additional bits needed for the Next-port identifier. In terms of speed, LA-XYZ- 17 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) based system has the slowest speed which is caused by the additional hardware that we previously explained. In terms of total power consumption, 3D-architectures show the best performance illustrated in an average reduction of 5.2%. This slight reduction is a result of a tradeoff between the additional static power caused by the additional hardware when moving to third dimension, and the dynamic power reduction thanks to reduced number of hops with 3D-architectures which incurs less buffering, less arbitration, and less computation resulting in dynamic-consumption reduction. 5.2.2 Communication latency evaluation We evaluated the communication latency by calculating the latency per flit for each architecture under each of the four applications. The latency per flit results are shown in Fig. 5. Extending the system to the third dimension in Matrix application (Fig. 5 (a)) reduces the latency/flit to 33% in average thanks to the reduced number of hops. Furthermore, adopting look-ahead routing (LA-XY and LA-XYZ) offers 25% latency/flit reduction. This reduction is obtained thanks to the ability of look-ahead routing to reduce the router delay and forward faster the flits to reach its destination. As a result, when compared to XY-based system, LA-XYZ-based one reduces the latency/ flit up to 55%. As a conclusion, 3D-architectures reduce the latency/flit with 22.5%, 24.5%, and 26% with Transpose, Uniform, and JPEG applications, respectively, when compared to 2D-routingbased systems. On the other hand, employing look-ahead routing offers also 28%, 29%, and 17% latency reduction with Transpose, Uniform, and JPEG applications, respectively. 5.2.3 Throughput Evaluation For the second 2D- and 3D-architecture evaluation, we calculated each system’s throughput. Figures 6 (a), (b), (c), and (d) show the throughput results for each of the used application. Transpose application (Fig. 6 (b)) shows the best throughput enhancement, when comparing 3D- 18 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) to 2D-architectures, illustrated in 45% improvement. This is can be explained by the long-distance communication omnipresent in this kind of traffic pattern. With this kind of communication, 3Darchitectures take advantage of the reduced number of hops to forward flits to their destination faster than 2D-systems. When observing the look-ahead routing effect, we noticed that Matrixmultiplication (Fig. 6 (a)) showed the best performance represented in 32% enhancement when comparing LA-XYZ to Dimension-Order-Routing XYZ. This improvement can be explained by the large size of the network (3×6×6) presenting long distance between source and destination nodes, where the router delay plays an important role in the throughput. 5.3 3D routing algorithms comparison results The second part of the evaluation takes into consideration fault-tolerance and compares the proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) with other 3D-routings (XYZ, LA-XYZ, and LAFT). This comparison is made in terms of hardware complexity (area, speed, and static-power), communication latency, throughput, dynamic-power, and reliability. 5.3.1 Hardware complexity evaluation Table 3 represents the hardware complexity results of 3D-OASIS-NoC’s router when employing HLAFT, XYZ, LA-XYZ, and LAFT. To see the additional hardware required for the RandomAccess-Buffer (RAB) for deadlock freedom, we evaluated the proposed HLAFT’s router without (HLAFT-based) and with RAB (HLAFT+RAB-based). When compared to XYZ and LAFT, HALFT presents an increasing area evaluated to 23% and 34%, respectively. This area penalty is a natural cause to the additional Loc-routing module added to recompute Next-port if needed. Moreover, the additional control signals added to perform the flag decision and insure the correct functionality of the proposed routing algorithm. Observing the additional area required for the adopted deadlock-recovery technique, RAB (HLAFT+RAB-based) requires only less than 19 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) 7.1% of extra hardware when compared to HLAFT router with no deadlock-recovery mechanism (HLAFT+RAB-based). In terms of static-power consumption, the addition of RAB does not considerably affect the proposed algorithm’s power whose static-power consumption increased by 2.8% and 5.6% when compared to LAFT and XYZ, respectively. In terms of speed, XYZbased router seems to be the fastest among the three other routers thanks to its low area while HLAFT is 2.2% and 10.3% slower than LAFT- and XYZ-based routers, respectively. 5.3.2 Communication latency evaluation For the performance evaluation, we first calculated the latency/flit for XYZ and LA-XYZ with 0% fault-rate (since both algorithms do not support fault-tolerance), and LAFT and HLAFT under fault-rates varying between 0% and 20% and we observed the performance variation. Figures 7 (a), (b), (c), and (d) show the average latency/flit results for each of the used applications. At the absence of faults, both LAFT and HLAFT have the same latency/flit. This means that the Localrouting was not triggered and HLAFT has the same behavior as LAFT. In Uniform application (Fig. 7 (c)) the latency/flit reduction with HLAFT and LAFT can reach the 48% and 33% when compared to XYZ and LA-XYZ, respectively. This reduction is a result of the ability of both algorithms to detect the congestion caused by such traffic and makes the best routing decision for less congested channels. When we increased the fault-rate and placed the fault links in some critical locations, the previous LAFT algorithm fails and deadlock happens at 20% (Matrix and Transpose) and starting from 10% (Uniform and JPEG) fault-rates as illustrated in Figs. 7 (a), (b), (c), and (d). In these figures, we can see that the latency tends to ∞, which means the failure of the system. On the other hand, HLAFT manages to avoid deadlock and, moreover, its latency is still smaller than XYZ even under 20% fault-rate with JPEG application (Fig. 7 (d)) while observing a small overhead when compared to LA-XYZ illustrated in 1.5% in Transpose traffic. In average 20 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) and under 20% fault-rate, the latency/flit is increased with HLAFT with 13.75% and 3.2% when compared to LA-XYZ and XYZ, respectively. It’s important to remind that despite their small outperformance against the proposed HLAFT algorithm, both LA-XYZ and XYZ algorithms are static schemes that do not support fault-tolerance and they can crash even at the presence of a single faulty link. When we reobserve the latency/flit results, we can see that even when LAFT succeeds to avoid deadlock, its latency is always equal or higher than that of HLAFT. The outperformance of HLAFT against LAFT can reach the 6.2 and 12.6% latency/flit average reduction under 5% and 10% fault-rates, respectively, with Matrix application. This performance improvement is obtained thanks to the employment of both local- and look-ahead routing which gives our system the opportunity to recompute the route if it finds out that the already calculated one will lead to a blocking path costing several clock cycles for nonminimal routing. In this fashion, HLAFT optimizes the routing decision and minimizes the probability of finding a blocking path (that leads to a nonminimal route) while keeping the routing efficiently minimal as long as one minimal path exists. Thus, with the help of RAB, both fault-tolerance and deadlock-recovery are insured with no considerable performance degradation. 5.3.3 Throughput Evaluation In the second performance evaluation, we computed the throughput of each routing-based system as shown in Figs. 8 (a), (b), (c), and (d). Under 0% fault-rate, the throughput enhancement with LAFT and HLAFT can reach the 53% and 23% when compared to XYZ and LA-XYZ, respectively, with Matrix application (Fig. 8 (a)). Under 5% and 10% fault-rates, HLAFT keeps its advantage against XYZ for all applications, and even under 20% fault-rates, HLAFT still outperforms XYZ with Matrix, Transpose and JPEG application observing 12% throughput 21 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) degradation with Uniform traffic-pattern. When compared to LA-XYZ, HALFT’s throughput is the highest under 5% and 10% fault-rates for all applications except for Uniform (10%) where we observed a slight 2.3% degradation. The final throughput comparison considered the performance of HLAFT and LAFT. As we previously mentioned, LAFT fails to avoid deadlock when the faultrate exceeds the 10%. For this reason, the throughput results of LAFT are not shown on these figures and the throughput is considered equal to zero. However, HLAFT with the help of the RAB deadlock-recovery technique, manages to detect and eliminate the deadlock. Moreover, even when LAFT succeeds to avoid deadlock, its throughput is always equal or lower than that of HLAFT. The enhancement of HLAFT against LAFT can reach the 5.4% and 11.8% under 5% and 10% fault-rates, respectively, with Transpose application. 5.3.4 Dynamic-power evaluation In the third part of our evaluation, we evaluated the dynamic-power of each routing-based system. In addition, we want to see the effect of the employed Power-manager (PM) for dynamic-power reduction; therefore, we evaluated the proposed HLAFT algorithm without (HLAFT) and with (HALFT+PM) PM. The average dynamic-power results are illustrated in Table 4. XYZ-based system represents the one with the lowest dynamic-power consumption followed by LA-XYZ with a slight 2.3% difference. The low dynamic-power of these two deterministic routing-based systems is natural due to their simplicity and the absence of fault-tolerance. When evaluating LAFT, it exhibits 7.2% and 5.1% higher average dynamic-power consumption than XYZ and LA-XYZ, respectively, which increases as much as we increase the fault-rates where the computation is more important than those at low fault-rates. When comparing LAFT with HLAFT, we can observe that the proposed algorithm, without employing any power management, presents the highest dynamicpower consumption illustrated by an average of 5% which also increases also linearly as we 22 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) increase the fault-rate. This can be explained by the power overhead caused by both Local-routing (to recompute the route) and RAB (for deadlock-recovery) which are triggered more frequently at high fault-rates (10% and 20%). When adopting the Power-management (PM) for HLAFT, the dynamic-power reduction with the optimized system can reach the 21.2% when compared to the system with no PM. Furthermore, when no faults are detected in the system, HLAFT+PM has almost the same dynamic-power consumption as in LAFT since both Local-routing module and RAB are shutdown. As we employ faults, HLAFT+PM starts to take advantage against LAFT that can reach the 7.2% at 20% fault-rate. At this high fault-rate, the computation performed in HLAFT+PM is more important than that in LAFT, causing then additional dynamic-power that can be seen in HLAFT without PM. However, when integrating the PM module 20% of the input-port connected to the faulty-links are clock-gated providing an important dynamic-power saving. Despite the small area penalty, the proposed system endorses fault-tolerance, provides higher performance, and ensures deadlock-freedom. Moreover, all these benefits are obtained with a minimal dynamic-power overhead to maintain the system fault-tolerant at low cost. 5.3.5 Reliability evaluation In this subsection, we discuss the reliability of HLAFT. We define reliability as the capability of the system to deliver all the packets to their destinations. If all packets are delivered except for one, the system is considered as unreliable. We compared the results with both AFRA [46] and HamFA [47] algorithms. The reliability results of these schemes are obtained from [47] where Uniform traffic pattern was used and one, two, and three faulty links are considered. HamFA could tolerate one, two, and three faulty links by 95%, 44%, 20% reliability, respectively, while AFRA could tolerate 33%, 7%, and 3% for the same number of faulty links, respectively [47]. For the proposed HLAFT, the system could deliver all the flits to their destinations with no single dropped flit (100% reliability) when considering three faulty links. 23 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Moreover, all the packets could reach their destinations when running all the used benchmarks, even when considering a large number of faulty links that can reach sometimes 252 faulty links (case of 20% fault rate in Matrix-multiplication application). We can explain this reliability by the fact that HLAFT always finds a path form source to destination, no matter where the fault is located and how heavy the traffic is. The only possible reason that can prevent the arrival of a given packet to its destination is the presence of deadlock. But thanks to the adopted RAB mechanism, deadlock was avoided and, therefore, all packets could reach their destinations providing a full reliability under different fault-rates and with different injection-rates. 6 Conclusion In this paper, we presented a novel low-latency, high-throughput, and fault-tolerant 3D-Networkon-Chip routing algorithm named Hybrid-Look-Ahead-Fault-Tolerant (HLAFT). HLAFT is a minimal routing algorithm that combines the benefits of look-ahead routing and also local-routing for better routing choices. Furthermore, HLAFT adopts a low-cost technique, named RandomAccess-Buffer (RAB), for deadlock-recovery. We implemented the proposed algorithm on a real 3D-NoC architecture (3D-OASIS-NoC), which has shown good performance, scalability and robustness. We prototyped HLAFT on an FPGA, evaluated its performance over real and large benchmarks, and compared it with well-known 3D-NoC routing schemes. We conducted two experiments where in the first one we showed the benefits of adopting 3D-architecture over the 2D-one illustrated in 33% latency/flit reduction and 45% throughput enhancement. We also saw the outperformance of look-ahead routing against local-routing depicted in an average of 24.7% latency/flit reduction and a throughput enhancement that can reach the 33%. In the second experiment we demonstrated that the proposed HLAFT algorithm introduces 24 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) lower latency/flit reduction against Dimensional-Order-Routing (XYZ) even at 20% fault-rate. Thanks to the adopted RAB deadlock-recovery technique, deadlock-freedom was ensured in HLAFT while the previous algorithm (LAFT) fails at 10% fault-rate. HLAFT outperforms LAFT even when deadlock does not occur in this latter. This outperformance is represented in a latency/flit reduction that can reach the 12.5% and a throughput enhancement reaching 11.8%. Thanks to the adopted Power-management scheme (PM), the dynamic-power increasing could be controlled and an important saving could be obtained that reached 7.2% when compared to the earlier LAFT routing. Adopting the proposed HLAFT routing introduced an area penalty represented in 23% and 34% extra ALUTs when compared to XYZ and LAFT, respectively, while observing that the adopted RAB deadlock-recovery technique cost only 7.1% extra hardware. In terms of staticpower consumption a small 2.8% and 5.6% overhead is observed when compared to LAFT and XYZ, respectively, and also 2.2% and 10.3% decreasing speed was reported when compared to the aforementioned algorithms, respectively. As a future work, an in-depth thermal power study should be conducted to observe how the performance gain obtained with the proposed algorithm would affect this design requirement, as it is very crucial for 3D-Network-on-Chip architectures. Moreover, we want to consider the occurrence of faults in other components of the systems, such as input buffers and crossbar. These circuits consume a big portion of the system area, making them more vulnerable to transient and permanent faults. References [1] Y. Xie, J. Cong, and S. Sapatnekar. Three-Dimensional Integrated Circuits Design. Springer, 2010. 25 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) [2] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, R. Varada, M. Ratta, and S. Vora. A 45 nm 8-core enterprise Xeon processor. In Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2009, pages 9-12. [3] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. TILE64 processor: A 64core SoC with mesh interconnect. In Proc. of the IEEE International Solid-State Circuits Conference. Digest of Technical Papers, Feb. 2008, pages 88-598. [4] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraFLOPS processor in 65-nm CMOS. IEEE Journal on Solid-State Circuits, 43(1):29-41, Jan. 2008. [5] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks, Morgan Kaufmann, 2004. [6] A. Ben Abdallah and M. Sowa. Basic Network-on-Chip Interconnection for Future Gigascale MCSoCs Applications: Communication and Computation Orthogonalization. Proceedings of The TJASSST2006 Symposium on Science, December 2006. [7] G. Philip, B. Christopher, and P. Ramm. Handbook of 3D Integration: Technology and Applications of 3D Integrated Circuits. Wiley-VCH, 2008. [8] A. W. Topol, J. D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A. Kumar, G. U. Singco, A. M. Young, K. W. Guarini and M. Ieong. Three-dimensional Integrated Circuits. IBM Journal of Research and Development, 50(4/5):491-506, July 2006. 26 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) [9] G. Sun, X. Dong, Y. Xie, J. Li and Y. Chen. A Novel 3D Stacked MRAM Cache Architecture for CMPs. IEEE 15th International Symposium High Performance Computer Architecture, pages 239-249, February 2009. [10] L. Benini and G. De Micheli. Networks on Chips: Technology and Tools. Morgan Kauffmann, 2006. [11] I. Loi, F. Angiolini, S. Fujita, S. Mitra, L. Benini. Characterization and Implementation of Fault-Tolerant Vertical Links for 3-D Networks-on- Chip. IEEE Trans. On CAD of Integrated Circuits and Systems, 30(1):124-134, Jan. 2011. [12] A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, J. Hu, and G. Chen. A Reliable Routing Architecture and Algorithm for NoCs. IEEE Trans. on CAD of Integrated Circuits and Systems, 31(5):726-739, May 2012. [13] A. Burns and A. Wellings. Real-Time Systems and Programming Languages Ada Real-Time Java and C/Real-Time Posix. Addison Wesley, 2009. [14] P. P. Pande, H. Zhu, A. Ganguly, and C. Grecu. Energy reduction through crosstalk avoidance coding in NoC paradigm. In Proc. of the 9th EUROMICRO Conference on Digital System Design, pages 689-695, 2006. IEEE Computer Society. [15] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1983. [16] W. J. Dally. Virtual-channel flow control, IEEE Trans. on Parallel and Distributed Systems, 3(2):194-205, March 1992. [17] Y. Tar and G. L. Frazier. High-performance multiqueue buffers for VLSI communication switches. 15th Annual International Symposium on Computer Architecture, pages 343-354, May-June 1988. 27 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) [18] A. Ben Ahmed and A. Ben Abdallah. Architecture and Design of High-throughput, Lowlatency, and Fault-Tolerant Routing Algorithm for 3D-Network-on-Chip (3D-NoC). The Journal of Supercomputing, 66(3):1507-1532, December 2013. [19] A. Ben Ahmed and A. Ben Abdallah. Fault-tolerant Routing Algorithm with Deadlock Recovery Support for 3D-NoC Architectures. The 7th IEEE International Symposium on Embedded Multicore SoCs, pages 67-72, September 2013. [20] A. Ben Ahmed, A. Ben Abdallah, and K. Kuroda. Architecture and Design of Efficient 3D Network-on-Chip (3D NoC) for Custom Multicore SoC. IEEE Proceedings of the 5th International Conference on Broadband, Wireless Computing, Communication and Applications, pages 67-73, November 2010. [21] A. Ben Ahmed and A. Ben Abdallah. LA-XYZ: Low Latency, High Throughput LookAhead Routing Algorithm for 3D Network-on-Chip (3D-NoC) Architecture. The 6th IEEE International Symposium on Embedded Multicore SoCs, pages 167-174, September 2012. [22] A. Ben Ahmed and A. Ben Abdallah. Low-overhead Routing Algorithm for 3D Networkon-Chip. IEEE Proceedings of The Third International Conference on Networking and Computing, pages 23-32, December 2012. [23] K. Mori, A. Esch, A. Ben Abdallah and K. Kuroda. Advanced Design Issues for OASIS Network-on-Chip Architecture. IEEE Proceedings of the 5th International Conference on Broadband, Wireless Computing, Communication and Applications, pages 74-79, November 2010. [24] W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos. The reliable router: A reliable and high-performance communication substrate for parallel computers. In Proc. of 28 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) the First International Workshop on Parallel Computer Routing and Communication, pages 241-255, 1994. [25] J. Duato, A theory of fault-tolerant routing in wormhole networks. IEEE Trans. Parallel Distributed Systems, 8(8):790-802, August 1997. [26] Z. Zhang, A. Greiner, and S. Taktak. A reconfigurable routing algorithm for faulttolerant 2-D-mesh network-on-chip. In Proc. of the 45th ACM/IEEE Conference on Design Automation Conference, pages 441-446, June 2008. [27] P.-H. Sui and S.-D. Wang. Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Trans. on Computers, 44(7):848-864, January 2000. [28] C. Liu, L. Zhang, Y. Han, and X. Li. A resilient on-chip router design through data path salvaging. In Proc. of the 16th Asia and South Pacific Design Automation Conference, pages 437-442, January 2011. [29] S. Rodrigo, J. Flich, A. Roca, S. Medardoni, D. Bertozzi, J. Camacho, F. Silla, and J. Duato. Addressing manufacturing challenges with cost-efficient fault tolerant routing. 4th ACM/IEEE International Symposium on Networks-on-Chip, pages 25-32, May 2010. [30] P. Bogdan, T. Dumitras, and R. Marculescu. Stochastic communication: A new paradigm for fault-tolerant networks-on-chip. VLSI Design, 2007(95348):17, 2007. [31] M. Ebrahimi, M. Daneshtalab, J. Plosila, and H. Tenhunen. MAFA: Adaptive Fault-Tolerant Routing Algorithm for Networks-on-Chip. In Proc. of 15th Euromicro Conference on Digital System Design, pages 201-207, September 2012. [32] M. Ebrahimi, M. Daneshtalab, J. Plosila, and F. Mehdipour, MD: Minimal path-based FaultTolerant Routing in On-Chip Networks. In Proc. of 18th Asia and South Pacific Design Automation Conference, pages 35-40, January 2013. 29 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) [33] M. Ebrahimi. Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip. Routing Algorithms in Networks-on-Chip, Springer, 2014. [34] I. Loi, S. Mitra, T. H. Lee, Sh. Fujita, and L. Benini. A low-overhead fault tolerance scheme for TSV-based 3D network on chip links. In Proc. of the 2008 IEEE/ACM International Conference on Computer-Aided Design, pages 598-602, 2008. [35] S. Pasricha. Exploring serial vertical interconnects for 3D ICs. In Proc. of the 46th ACM/IEEE Design Automation Conference, pages 581-586, July 2009. [36] M. Nicolaidis. Design for soft error mitigation. IEEE Trans. on Device and Materials Reliability, 5(3):405-418, September 2005. [37] F. L. Kastensmidt, L. Carro, and R. Reis. Fault-Tolerance Techniques for SRAM-Based FPGAs. Frontiers in Electronic Testing, Springer-Verlag, 2006. [38] A. -M. Rahmani, K. R. Vaddina, K. Latif, P. Liljeberg, J. Plosila and H. Tenhunen. Design and Management of High-performance, Reliable and Thermal-aware 3D Networks-onChip. IET Circuits, Devices & Systems, 6(5):308-321, September 2012. [39] N.A. Nordbotten, M.E. Gmez, J. Flich, P. Lopez, A. Robles, T. Skeie, O. Lysne, and J. Duato. A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes. In Proc. of IFIP international conference on Network and Parallel Computing, pages 341-356, October 2004. [40] Y. Li, Sh. Peng, and W. Chu. Adaptive box-based efficient fault-tolerant routing in 3D torus. In Proc. of the 11th International Conference Parallel and Distributed Systems, pages 7177, July 2005. 30 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) [41] Ch. Feng, M. Zhang, J. Li, J. Jiang, Z. Lu and A. Jantsch. A Low-Overhead Fault-Aware Deflection Routing Algorithm for 3D Network-on-Chip. IEEE Computer Society Annual Symposium on VLSI, pages 19-24, July 2011. [42] A. A. Chien and J. H. Kim. Planar-adaptive Routing: Low-cost Adaptive Networks for Multiprocessors. The 19th Annual International Symposium on Computer Architecture, pages 268-277, 1992. [43] Z. Jiang, J. Wu and D. Wang. A New Fault Information Model for Fault-Tolerant Adaptive and Minimal Routing in 3-D Meshes. IEEE Transactions on Reliability, 57(1):149-162, March 2008. [44] D. Xiang, Y. Zhang and Y. Pan. Practical Deadlock-Free Fault-Tolerant Routing Based on the Planar Network Fault Model. IEEE Transactions on Computers, 58(5):620-633, May 2009. [45] S. Pasricha and Y. Zou. A Low Overhead Fault Tolerant Routing Scheme for 3D Networkson-Chip. The 12th International Symposium on Quality Electronic Design, pages 1-8, March 2011. [46] S. Akbari, A. Shafieey, M. Fathy and R. Berangi. AFRA: A Low Cost High Performance Reliable Routing for 3D Mesh NoCs. Design, Automation & Test in Europe Conference & Exhibition, pages 332-337, March 2012. [47] M. Ebrahimi, M. Daneshtalab, and J. Plosila. Fault-Tolerant Routing Algorithm for 3D NoC Using Hamiltonian Path Strategy. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1601-1604, March 2013. [48] http://www.altera.com 31 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) [49] P. Chan, K. Dai, D. Wu, J. Rao and X Zou. The Parallel Algorithm Implementation of Matrix Multiplication Based on ESCA. IEEE ASIA Pacific Conference on Circuits and Systems, pages 1091-1094, December 2010. [50] Y. L. Lee, J. W. Yang, and J. M. Jou. Design of a distributed JPEG encoder on a scalable NoC platform. IEEE International Symposium VLSI-DAT, pages 132-135, April 2008 [51] A. A. Chien and J. H. Kim. Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors. Journal of the ACM, 42(1):91-123, January 1995. [52] R. Sivaram. Queuing delays for uniform and nonuniform traffic patterns in a MIN. ACM SIGSIM Simulation Digest, 22(1):17-27, 1990. [53] H. Sullivan and T. R. Bashkow. Large Scale, Homogeneous, Fully Distributed Parallel Machine. Annual Symposium on Computer Architecture, ACM Press, pages 105-117, March 1977. [54] A. Ben Ahmed, K. Mori, and A. Ben Abdallah. ONoC-SPL: Customized Network-on-Chip (NoC) Architecture and Prototyping for Data-intensive Computation Applications. IEEE Proc. of the 4th International Conference on Awareness Science and Technology, pages 257-262, August 2012. [55] C. H. Chao, K. Y. Jheng, H. Y. Wang, J. C. Wu and A. -Y. Wu. Traffic and Thermal-aware Run-time Thermal Management Scheme for 3D NoC Systems. In Proc. of the ACM/IEEE International Symposium on Networks-on-Chip (NoCS), pages 223-230, May 2010. [56] B. Feero and P. P. Pande. Performance Evaluation for Three-Dimensional Networks-onChip. Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 305-310, May 2007. 32 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Figure 1: Example of fault-tolerant routing: (a) Look-ahead routing (LAFT) (b) Hybrid routing (HLAFT). Figure 2: Router pipeline stages: (a) Look-Ahead-Fault-Tolerant [18] (b) Hybrid-Look-AheadFault-Tolerant. 33 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) (a) (b) (c) (d) Figure 3: Example of deadlock-recovery with Random-Access-Buffer. Figure 4: Input buffer block diagram. 34 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) (a) (b) (c) (d) Figure 5: Latency per flit comparison results between 2-D and 3-D NoC architectures with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG. 35 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Algorithm 1: Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) // Destination address Input: Xdest , Ydest , Zdest // Current node address Input: Xcur , Ycur , Zcur // Next-port identifier Input: Next-port // Link status information Input: Fault-in // New-next-port for next node Output: New-next-port // Calculate the next-node address Next← Next-node (Xcur , Ycur , Zcur , Next-port); // Read fault information for the next-node Next-fault← Next-status (Fault-in, Next-port); // Calculate the three possible directions for the next-node Next-dir← poss-dir (Xdest , Ydest , Zdest , Next x , Nexty , Nextz ); // Evaluate the flag if (|Next-dir| == 0) then flag← 1; else flag← 0;; // Evaluate the New-next-port if (flag == 1) then // Trigger local-routing out-port← local-routing (Fault-in, Xcur , Ycur , Zcur , Xdest , Ydest , Zdest ); // Execute LAFT with the new recomputed out-port New-next-port← LAFT-routing (Xcur , Ycur , Zcur , Xdest , Ydest , Zdest , out-port, Fault-in); else // Execute LAFT with the inputed Next-port identifier New-next-port ← LAFT-routing (Xcur , Ycur , Zcur , Xdest , Ydest , Zdest , Next-port, Fault-in); end 36 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Table 1: Simulation configuration. Parameters / System XYZ-based LA-XYZ-based LAFT-based HLAFT-based JPEG 3x3x3 3x3x3 3x3x3 3x3x3 Matrix 3x6x6 3x6x6 3x6x6 3x6x6 Transpose & Uniform 4x4x4 4x4x4 4x4x4 4x4x4 JPEG 27 bits 30 bits 30 bits 30 bits Matrix 31 bits 34 bits 34 bits 34 bits Trans. & Unif. 34 flit 34 flit 34 flit 34 flit JPEG 10 bits 13 bits 13 bits 13 bits Matrix 10 bits 13 bits 13 bits 13 bits Transpose 16 bits 13 bits 13 bits 13 bits JPEG 16 bits 16 bits 16 bits 16 bits Matrix 21 bits 21 bits 21 bits 21 bits Transpose 21 bits 21 bits 21 bits 21 bits Buffer Depth 4 4 4 4 Switching Wormhole-like Wormhole-like Wormhole-like Wormhole-like Flow control Stall-Go Stall-Go Stall-Go Stall-Go Scheduling Matrix-Arbiter Matrix-Arbiter Matrix-Arbiter Matrix-Arbiter Routing XYZ LA-XYZ LAFT HLAFT Target FPGA device Stratix III Stratix III Stratix III Network Size (Mesh) Flit size Header size Payload size 37 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Table 2: Router hardware complexity evaluation comparison results between 2D and 3D NoC architectures. System / Parameter Figure 6: Area Total Power Speed (ALUTs) Mw Mhz 2D (XY-based) 78201 1399.08 125.32 2D (LA-XY-based) 90576 1211.06 109.22 3D (XYZ-based) 138578 1210.77 107.34 3D (LA-XYZ-based) 145287 1121.89 100.89 (a) (b) (c) (d) Throughput comparison results between 2-D and 3-D NoC architectures with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG. 38 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) Table 3: Router hardware complexity comparison results between 3D-NoC systems. System/ Parameter Area Static Power Speed (mW) (MHz) XYZ-based 2809 1258.01 194.61 LA-XYZ-based 3134 1269.76 194.09 LAFT-based 3272 1296.92 178.51 HLAFT-based 4277 1334.5 174.68 HLAFT+RAB-based 4580 1336.4 174.42 (a) (b) (c) (d) Figure 7: Latency per flit comparison results between XYZ, LA-XYZ, LAFT, and HLAFT based 3D-NoC systems with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG. 39 Preprint version to be published in the Journal of Parallel and Distributed Computing (2014) (a) (b) (c) (d) Figure 8: Throughput comparison results between XYZ, LA-XYZ, LAFT, and HLAFT based 3D-NoC systems with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG. Table 4: Average dynamic-power evaluation under different fault-rates. System/ Fault-rate 0% 5% 10% 20% XYZ-based 165.2 165.2 165.2 165.2 LA-XYZ-based 169.9 169.9 169.9 169.9 LAFT-based 170.09 172.92 178.51 194.61 HLAFT-based 173.7 174.61 183.68 231.43 HLAFT+PM-based 170.11 170.2 174.42 180.02 40
Similar documents
Architecture and Design of High-throughput, Low
Despites the latency reduction, throughput enhancement and low overhead obtained with this algorithm, LA-XYZ does not support fault tolerance. As well as being a static scheme, LA-XYZ suffers from ...
More information