Full PDF - Multi2Sim
Transcription
Full PDF - Multi2Sim
19th International Symposium on Computer Architecture and High Performance Computing 07 – 7, 2007 r 7, 2007 , Brazil zed by Published by the IEEE Computer Society 10662 Los Vaqueros Circle P.O. Box 3014 Los Alamitos, CA 90720-1314 IEEE Computer Society Order Number P3014 ISBN 0-7695-3014-1 ISSN 1550-6533 ISBN 0-7695-3014-1 90000 IEEE Computer Society 9 780769 530147 Copyright © 2007 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331. The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They reflect the authors’ opinions and, in the interests of timely dissemination, are published as presented and without change. Their inclusion in this publication does not necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of Electrical and Electronics Engineers, Inc. IEEE Computer Society Order Number P3014 ISBN 0-7695-23014-1 ISBN 978-0-7695-3014-7 ISSN 1550-6533 Additional copies may be ordered from: IEEE Computer Society Customer Service Center 10662 Los Vaqueros Circle P.O. Box 3014 Los Alamitos, CA 90720-1314 Tel: + 1 800 272 6657 Fax: + 1 714 821 4641 http://computer.org/cspress csbooks@computer.org IEEE Service Center 445 Hoes Lane P.O. Box 1331 Piscataway, NJ 08855-1331 Tel: + 1 732 981 0060 Fax: + 1 732 981 9667 http://shop.ieee.org/store/ customer-service@ieee.org IEEE Computer Society Asia/Pacific Office Watanabe Bldg., 1-4-2 Minami-Aoyama Minato-ku, Tokyo 107-0062 JAPAN Tel: + 81 3 3408 3118 Fax: + 81 3 3408 3553 tokyo.ofc@computer.org Individual paper REPRINTS may be ordered at: <reprints@computer.org> Editorial production by Silvia Ceballos Cover art production by Joseph Daigle/ Studio Productions Printed in the United States of America by The Printing House IEEE Computer Society Conference Publishing Services (CPS) http://www.computer.org/cps 19th International Symposium on Computer Architecture and High Performance Computing SBAC-PAD Message from the General Chairs............................................................... ix Message from the Program Committee Chairs ............................................ x Conference Organizers .............................................................................. xi Program Committee.................................................................................. xii Reviewers ................................................................................................ xiv Brazilian Computer Society (SBC)............................................................. xv Session 1 Applications I Multi-level Parallelism in the Computational Modeling of the Heart.............................................................. 3 Carolina Xavier, Rafael Sachetto, Vinicius Vieira, Rodrigo Weber dos Santos, and Wagner Meira Jr. Computational Characteristics of Production Seismic Migration and its Performance on Novel Processor Architectures............................................................................................................... 11 Jairo Panetta, Paulo R. P. de Souza Filho, Carlos A. da Cunha Filho, Fernando M. Roxo da Motta, Silvio S. Pinheiro, Ivan Pedrosa Junior, Andre L. R. Rosa, Luiz R. Monnerat, Leandro T. Carneiro, and Carlos H. B. de Albrecht Voice Command Recognition with Dynamic Time Warping (DTW) using Graphics Processing Units (GPU) with Compute Unified Device Architecture (CUDA) ............................................................... 19 Gustavo Poli, Alexandre L. M. Levada, João F. Mari, and José Hiroki Saito Exploring Novel Parallelization Technologies for 3-D Imaging Applications .............................................. 26 Diego Rivera, Dana Schaa, Micha Moffie, and David Kaeli Session 2 Microarchitecture Low-cost Techniques for Reducing Branch Context Pollution in a Soft Realtime Embedded Multithreaded Processor........................................................................................................... 37 Emre Özer, Alastair Reid and Stuart Biles Self-Imposed Temporal Redundancy: An Efficient Technique to Enhance the Reliability of Pipelined Functional Units ................................................................................................ 45 Elias Mizan, Tileli Amimeur, and Margarida F. Jacome Predicting Loop Termination to Boost Speculative Thread-Level Parallelism in Embedded Applications........................................................................................................................... 54 Mafijul Md. Islam Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors .............................. 62 Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro López v Session 3 Applications II Performance Improvement of the Parallel Lattice Boltzmann Method through Blocked Data Distributions........................................................................................................................................ 71 Claudio Schepke and Nicolas Maillard A Scalable Parallel Deduplication Algorithm............................................................................................... 79 Walter Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr., Altigran S. Da Silva, Renato Ferreira and Dorgival Guedes A Multigrid-Schwarz Method for the Solution of Hydrodynamics and Heat Transfer Problems in Unstructured Meshes .............................................................................................................................. 87 Guilherme Galante, Rogério L. Rizzi, and Tiarajú A. Diverio Session 4 Benchmarking, Performance Measurements and Analysis Performance Evaluation of the Dual-Core Based SGI Altix 4700............................................................... 97 Rod Fatoohi Impacts of Multiprocessor Configurations on Workloads in Bioinformatics .............................................. 105 Youfeng Wu, Mauricio Breternitz Jr., and Victor Ying Session 5 Application-Specific Architectures Efficient Hardware for Modular Exponentiation Using the Sliding-Window Method with Variable-Length Partitioning .............................................................................................................. 117 Nadia Nedjah and Luiza de Macedo Mourelle Optimized Math Functions for a Fixed-Point DSP Architecture ................................................................ 125 Karlo G. Lenzi and Osamu Saotome Session 6 Grid Computing A Component-Oriented Support for Hierarchical MPI Programming on Multi-cluster Grid Environments........................................................................................................... 135 Elton Nicoletti Mathias, Vincent Cave, Francoise Baude, and Nicolas Maillard A Selector of Grid Resources based on the Semantic Integration of Multiple Ontologies ....................... 143 Alexandre P.C Silva and Mario A.R. Dantas A Novel Algorithm for Indirect Reputation-Based Grid Resource Management....................................... 151 Javier Echaiz, Jorge R. Ardenghi, and Guillermo R. Simari vi Session 7 Cache and Memory Architectures Register File Energy Optimization for Snooping Based Clustered VLIW Architectures ........................... 161 Rahul Nagpal and Y. N. Srikant Queue Register File Optimization Algorithm for QueueCore Processor .................................................. 169 Arquimedes Canedo, Ben Abderazek, and Masahiro Sowa An Intelligent Mechanism to Explore a Two-Level Cache Hierarchy Considering Energy Consumption and Time Performance ....................................................................................................... 177 Abel G. Silva-Filho, Carmelo J. A. Bastos-Filho, Ricardo M.F. Lima, Davi M.A. Falcão, Filipe R. Cordeiro, and Marília P. Lima A Code Compression Method to Cope with Security Hardware Overheads ............................................ 185 Eduardo Wanderley Netto, Romain Vaslin, Guy Gogniat, and Jean-Philippe Diguet Session 8 Interconnection Networks, Routing, and Communication Architectural Breakdown of End-to-End Latency in a TCP/IP Network .................................................... 195 Steen Larsen, Parthasarathy Sarangam, and Ram Huggahalli Performance Analysis and Linear Optimization Modeling of All-to-all Collective Communication Algorithms ....................................................................................................................... 203 Hyacinthe N. Mamadou, Guilherme de Melo B. Domingues, Takeshi Nanri, and Kazuaki Murakami Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP) ...................... 211 Seung Eun Lee, Jun Ho Bahn, and Nader Bagherzadeh Session 9 Tools for Parallel and Distributed Programming Node Level Primitives for Parallel Exact Inference................................................................................... 221 Yinglong Xia and Viktor Prasanna Fault-Tolerance in Filter-Labeled-Stream Applications............................................................................. 229 Bruno Coutinho, Dorgival Guedes, Wagner Meira Jr., and Renato A. Ferreira High-Level Service Connectors for Component-Based High Performance Computing ........................... 237 Francisco H. de Carvalho-Junior, Ricardo C. Corrêa, Gisele A. Araújo, Jefferson C. Silva, and Rafael D. Lins vii Session 10 Load Balancing and Scheduling On-Line Scheduling of MPI-2 Programs with Hierarchical Work Stealing ................................................ 247 Guilherme P. Pezzi, Márcia C. Cera, Elton Mathias, Nicolas Maillard, and Philippe O. A. Navaux Exigency-Based Real-Time Scheduling Policy to Provide Absolute QoS for Web Services.................... 255 Lucas S. Casagrande, Rodrigo F. de Mello, Ricardo Bertagna, José A. Andrade Filho, and Francisco J. Monaco DTA-C: A Decoupled Multi-threaded Architecture for CMP Systems ....................................................... 263 Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic Automatic Constraint Partitioning to Speed up CLP Execution ................................................................ 271 Marluce R. Pereira, Patrícia K. Vargas, Maria Clícia S. de Castro, Felipe M. G. França, and Inês de Castro Dutra Author Index .............................................................................................................. 279 viii Message from the Program Committee Chairs SBAC-PAD On behalf of the Program Committee, we are pleased to welcome you to the 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2007). SBAC-PAD is an annual international conference series, the first of which was held 20 years ago, that has traditionally presented the state of the art, latest trends and new developments in computer architecture design, parallel and distributed technologies and high performance applications. We would first like to thank the Brazilian Computer Society, the IEEE Computer Society, the Technical Committees on Computer Architecture (TCCA) and Scalable Computing (TCSC), and International Federation for Information Processing (IFIP) for their continued support and sponsorship of SBAC-PAD 2007. Thanks also to the 66 members of the Program Committee, all recognized experts in their fields, from around the world who volunteered to participate in the selection of papers. Their help, together with that of the Organizing Committee, in initially publicizing the symposium had a significant impact on the number and quality of paper submissions. Authors were invited to submit manuscripts that presented original unpublished research in all areas of computer architecture and high performance computing. Work focusing on applications or emerging technologies were especially welcome. Even with the plethora of coinciding conferences, SBAC-PAD 2007 received 107 submissions from industry and academia located in 24 countries, a reflection of its current international standing. We would thus like to thank the authors for their contributions and, of course, the 142 reviewers for the time and effort they took to diligently review the submissions. After a rigorous peer-review process, most papers had 5 reviews but every paper had at least three, we chose 32 high quality papers (less than a third of the submissions) on research work from institutions in 10 countries (15 full papers from Brazil, 7 from the USA, 2 each from Japan and France, and one from Argentina, India, Italy, Spain, Sweden, and the UK, respectively). In addition to these regular papers, the scientific and technical program includes several keynote presentations from experts who have kindly accepted to share their knowledge and wisdom on a variety of state of the art issues. To further stimulate discussion among attendees, six workshops have been organized, covering a range of “hot” topics. Finally, a number of our industrial sponsors will also be presenting some of their insights on leading edge technologies. Thanks to all of the contributors and participants who together have created an excellent scientific and technical program. A number of people have endeavored to help organize this outstanding program and to make SBAC-PAD 2007 a resounding success. We would like to offer our sincerest thanks to them all; in particular thanks must go to Professor Philippe Navaux, this year’s General Chair, as well as the Steering Committee for their guidance and support and to Lucas Schnorr for his substantial assistance with the development and configuration the professional conference web pages. Once again, welcome to Gramado and to SBAC-PAD 2007! We are sure that you will enjoy both the scientific as well the social program of the conference. We look forward to the many stimulating discussions and presentations. SBAC-PAD 2007 Co-chairs Jean-Luc Gaudiot Vinod Rebello x 19th International Symposium on Computer Architecture and High Performance Computing Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors R. Ubal, J. Sahuquillo, S. Petit and P. López Universidad Politécnica de Valencia Camino de Vera s/n 46021 Valencia, Spain raurte@gap.upv.es Abstract sulted in chip multiprocessors (CMPs), which include various cores in a single chip [1]. With respect to memory hierarchy, its design is a major concern in current and incoming microprocessors, since long memory latencies act frequently as a performance bottleneck. Current on-chip parallel processing models provide new cache access patterns and offer the possibility of either replicating or sharing caches among processing elements. This fact rises the need to evaluate tradeoffs between memory hierarchy configuration and processor cores/threads structure. Finally, interconnection networks (or interconnects) serve as communication medium for processor components (mainly processor cores). In an environment where caches from different processors share memory blocks, the interconnect is in charge of transmitting coherence messages generated by the cache controllers. Research in this field tries to increase network performance by focusing on new topologies, switching and flow control mechanisms, routing algorithms or fault tolerance techniques. In this paper we present Multi2Sim, which integrates processor cores, memory hierarchy and interconnection network in a tool that enables their evaluation. The rest of this paper is organized as follows. Section 2 presents an overview of existing processor simulators. Section 3 describes the Multi2Sim structure, while Section 4 discusses the integrated features to support multithreading and multicore simulation. Examples including simulation results are shown in Section 5. Finally, Section 6 presents some concluding remarks. Current microprocessors are based in complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need of evaluating new designs on each of these components motivates the development of tools which simulate the system working as a whole. In this paper, we present the Multi2Sim simulation framework, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators. A set of simulation examples is also included for illustrative purposes. 1 Introduction The evolution of microprocessors, mainly enabled by technology advances, has led to complex designs that combine multiple physical processing units in a single chip. These designs provide to the operating system (OS) the view of having multiple processors, and thus, different software processes can be scheduled at the same time. This processor model consists of three major components: the microprocessor core, the cache hierarchy, and the interconnection network. A design improvement on any of these components will result in a performance gain over the whole system. Therefore, current processor architecture trends bring a lot of opportunities for researchers to investigate novel microarchitectural proposals. Below, some design issues on these components are drawn. Concerning processor cores, deep and wide pipelines have been designed, aimed at exploiting the high amount of instruction level parallelism (ILP) present in current workloads. On the other hand, thread level parallelism (TLP) enables to exploit additional sources of independent instructions to increase processor resources utilization. This idea, jointly with an overcome of hardware constraints, re- 1550-6533/07 $25.00 © 2007 IEEE DOI 10.1109/SBAC-PAD.2007.17 2 Related Work Multiple simulation environments aimed at evaluating computer architecture proposals have been developed. The most widely used simulator in recent years has been SimpleScalar [2], which models an out-of-order superscalar processor. Lots of extensions have been applied to Sim- 62 loads, with dynamic threads creation. The last cited simulator is M5 [13], which provides support for out-of-order SMT-capable CPUs, multiprocessors and cache coherency, and runs in both full-system and application-only modes. The limitations lie once again in the low flexibility of multithreaded pipeline designs. pleScalar to model in a more accurate manner certain aspects of superscalar processors. For example, the HotLeakage simulator [3] quantifies leakage energy consumption. SimpleScalar is quite difficult to extend to model new parallel microarchitectures without significantly changing its structure. In spite of this fact, various SimpleScalar extensions to support multithreading have been implemented, e.g. SSMT [4], M-Sim [5], or SMTSim [6], but they have the limitation of only executing a set of sequential workloads and implementing a fixed resource sharing strategy among threads. Multithread and multicore extensions have been also applied to the Turandot simulator [7] [8], which models a PowerPC architecture and has been also used with power measurement aims (PowerTimer [9]). Turandot extensions to parallel microarchitectures are mostly cited (e.g., [10]) but not publicly available. Both SimpleScalar and Turandot are application-only tools, which directly simulate the behaviour of an application. Such tools have the advantage of isolating the workload execution, so statistics are not affected by the simulation of additional software. The tool proposed in this paper can also be classified as an application-only simulator. In contrast to the application-only simulators, a set of so-called full-system simulators are available. Such tools are able to boot an unmodified operating system and applications run at the same time over it. Although this model provides higher simulation power, it involves a huge computational load and sometimes unnecessary simulation accuracy. Simics [11] is an example of generic full-system simulator, commonly used for multiprocessor systems simulation, but unfortunately not freely available. A variety of Simics derived tools has been implemented for specific research purposes in this area. This is the case of GEMS [12], which introduces a timing simulation module to model a complete processor pipeline and a memory hierarchy supporting cache coherence. However, GEMS provides low flexibility of modelling multithreaded designs and it integrates no interconnection network model. An important feature included in some processor simulators is the timing-first approach, provided by GEMS and adopted in Multi2Sim. On such a scheme, a timing module traces the state of the processor pipeline while instructions traverse it, possibly in a speculative state. Then, a functional module is called to actually execute the instructions, so the correct execution paths are always guaranteed by a previously developed robust simulator. The timing-first approach confers efficiency, robustness, and the possibility of performing simulations on different levels of detail. Our proposal adopts the timing-first simulation with a functional support that, unlike GEMS, need not simulate a whole operating system, but is still capable of executing parallel work- 3 Basic simulator description Multi2Sim [14] has been developed integrating some significant characteristics of popular simulators, such as separate functional and timing simulation, SMT and multiprocessor support and cache coherence. Multi2Sim is an application-only tool intended to simulate final MIPS32 executable files. With a MIPS32 cross-compiler (or a MIPS32 machine) one can compile his own program sources, and test them under Multi2Sim. This section deals with the process of starting and running an application in a crossplatform environment, and describes briefly the three implemented simulation techniques (functional, detailed and event-driven simulation). 3.1 Program Loading Program loading is the process in which an executable file is mapped into different virtual memory regions of a new software context, and its register file and stack are initialized to start execution. In a real machine, the operating system is in charge of these actions, but an application-only tool should manage program loading during its initialization. Executable File Loading. The executable files output by gcc follow the ELF (Executable and Linkable Format) specification. An ELF file is made up of a header and a set of sections. Some Linux distributions include the library libbfd, which provides types and functions to list the sections of an ELF file and track their main attributes (starting address, size, flags and content). When the flags of an ELF section indicate that it is loadable, its contents are copied into memory after the corresponding starting address. Program Stack. The next step of the program loading process is to initialize the process stack. The aim of the program stack is to store function local variables and parameters. During the program execution, the stack pointer ($sp register) is managed by the own program code. However, when the program starts, it expects some data in it, namely the program arguments and environment variables, which must be placed by the program loader. Register File. The last step is the register file initialization. This includes the $sp register, which has been progressively updated during the stack initialization, and the 63 tional and detailed simulation are independent, the implementation of the machine instructions behaviour can be centralized in a single file (functional simulation), increasing the simulator modularity. In this sense, function calls that activate hardware components (detailed simulation) have an interface that returns the latency required to complete their access. Nevertheless, this latency is not a deterministic value in some situations, so it cannot be calculated when the function call is performed. Instead, it must be simulated cycle by cycle. This is the case of interconnects and caches, where an access can result in a message transfer, whose delay cannot be computed a priori, justifying the need of an independent event-driven simulation engine. PC and NPC registers. The initial value of the PC register is specified in the ELF header of the executable file as the program entry point. The NPC register is not explicitly defined in the MIPS32 architecture, but it is used internally by the simulator to handle the branch delay slot. 3.2 Simulation Model Multi2Sim uses three different simulation models, embodied in different modules: a functional simulation engine, a detailed simulator and an event-driven module —the latter two perform the timing simulation. To describe them, the term context will be used hereafter to denote a software entity, defined by the status of a virtual memory image and a logical register file. In contrast, the term thread will refer to a processor hardware entity comprising a physical register file, a set of physical memory pages, a set of entries in the pipeline queues, etc. The three main simulation techniques are described next. 4 Support for Multithreaded and Multicore Architectures This section describes the basic simulator features that provide support for multithreaded and multicore processor modelling. They can be classified in two main groups: those that affect the functional simulation engine (enabling the execution of parallel workloads) and those which involve the detailed simulation module (enabling pipelines with various hardware threads on the one hand, and systems with several cores on the other). Functional Simulation, also called simulator kernel. It is built as an autonomous library and provides an interface to the rest of the simulator. This engine does not know of hardware threads, and owns functions to create/destroy software contexts, perform program loading, enumerate existing contexts, consult their status, execute machine instructions and handle speculative execution. The supported machine instructions follow the MIPS32 specification [15] [16]. This choice was basically motivated by a fixed instruction size and formats, which enable a simple instruction decoding. An important feature of the simulation kernel, inherited from SimpleScalar [2], is the checkpointing capability of the implemented memory module and register file, thinking of an external module that needs to implement speculative execution. In this sense, when a wrong execution path starts, both the register file and memory status are saved, reloading them on the misprediction detection. 4.1 Functional simulation: parallel workloads support The functional engine has been extended to support parallel workloads execution. In this context, parallel workloads can be seen as tasks that dynamically create child processes at runtime, carrying out communication and synchronization operations. The supported parallel programming model is the one specified by the widely used POSIX Threads library (pthread) shared memory model [17]. In a multithreaded environment, some studies suggest using a set of sequential workloads [18]. The reason is that multiple resources are shared among hardware threads, and processor throughput can be evaluated more accurately when no contention appears due to communication between contexts. In contrast, multicore processor pipelines are fully replicated, and an important contention point is the interconnection network. The execution of multiple sequential workloads exhibits only some interconnect activity in form of L2-L1 cache transfers, but no coherence actions can occur between processes having disjoint memory maps. Thus, in order to evaluate multicore processors, it makes sense to support and run parallel workloads with shared memory locations, whose distributed access can stress the interconnection network. Detailed Simulation. The Multi2Sim detailed simulator uses the functional engine to perform a timing-first [12] simulation: in each cycle, a sequence of calls to the kernel updates the state of existing contexts. The detailed simulator analyzes the nature of the recently executed machine instructions and accounts the operation latencies incurred by hardware structures. The main simulated hardware consists of pipeline structures (stage resources, instruction queue, load-store queue, reorder buffer...), branch predictor (modelling a combined bimodal-gshare predictor), cache memories (with variable size, associativity and replacement policy), memory management unit, and segmented functional units of configurable latency. Event-Driven Simulation. In a scheme where func- 64 Actual parallel workloads require special hardware support (machine instructions), as well as low level software support (system calls) that enable threads spawning, synchronization and termination. Each of these issues are described below, jointly with a brief description of the POSIX threads management: Instruction set support. When the processor hardware supports concurrent threads execution, the parallel programming requirement that directly affects its architecture is the existence of critical sections, which cannot be executed simultaneously by more than one thread. CMPs or multithreaded processors must stall the activity of a hardware thread when it tries to enter a critical section occupied by other thread. The MIPS32 approach implements the mutual exclusion mechanism by means of two machine instructions (LL and SC), defining the concept of RMW (read-modify-write) sequence [16]. An RMW sequence is a set of instructions, embraced by a pair LL-SC that run atomically on a multiprocessor system. The cited machine instructions do not enforce an RMW sequence, but the output value of SC informs of the RMW success or failure. Figure 1. Examples of pipeline organizations The fetch stage takes instructions from the L1 instruction cache and places them into an IFQ (instruction fetch queue). The decode/rename stage takes instructions from an IFQ, decodes them, renames their registers and assigns them a slot in the ROB (reorder buffer) and IQ (instruction queue). Then, the issue stage consumes instructions from the IQ and sends them to the corresponding functional unit. During the execution stage, the functional units operate and write their results back into the register file. Finally, the commit stage retires instructions from the ROB in program order. This architecture is analogous to the one modelled by the SimpleScalar tool set [2], but uses a ROB, an IQ (instruction queue) and a physical register file, instead of the RUU (register update unit). Operating system support. Tracing the execution of a parallel workload, the operating system support required by pthread is formed of system calls i) to spawn/destroy a thread (clone, exit group), ii) to wait for child threads (waitpid), iii) to communicate and synchronize threads with system pipes (pipe, read, write, poll) and iv) to wake up suspended threads using system signals (sigaction, sigprocmask, sigsuspend, kill). Figure 1 illustrates two possible pipeline organizations. In a) all stages are shared among threads, while in b) all stages (except execute) are replicated as many times as supported hardware threads. Multi2Sim allows to evaluate different stage sharing strategies, as well as different algorithms that schedule stage resources in each cycle. Depending on the stages sharing and thread selection policies, a multithread processor can be classified as finegrain (FGMT), coarse-grain (CGMT) or simultaneous multithread (SMT). POSIX Threads parallelism management. Applications programmed with pthread can be simulated without changes using Multi2Sim. This library introduces user code which handles parallelism by means of the described subset of machine instructions and system calls. However, the fact of having thread management code mingled with application code must be taken into account, as it constitutes a certain overhead which could affect final results. Further details on this consideration can be found in [14]. 4.2 A FGMT processor switches threads on a fixed schedule, typically on every processor cycle. In contrast, a CGMT processor is characterized by a thread switch induced by a long latency operation or a thread quantum expiration. Finally, an SMT processor enhances the previous ones with a more aggressive instruction issue policy, which is able to issue instructions from different threads in a single cycle. The simulator parameters that specify the sharing strategy of pipeline stages among threads, and thus the kind of multithreading, are summarized in Table 1. Again, [14] gives a detailed description of all possible values these parameters may take. Detailed simulation: Multithreading support Multi2Sim supports a set of parameters that specify how stages are organized in a multithreaded design. Stages can be shared among threads or private per thread [19] (except the execute stage, which is shared by definition of multithread). Moreover, when a stage is shared, there must be an algorithm which schedules a thread every cycle on the stage. The modelled pipeline is divided into five stages, described below. 65 hand, and checking its correctness on the other. These experiments i) test different multithread pipeline configurations, ii) explore different bus widths and iii) trace the network traffic executing a parallel workload. In all cases, the simulated machine includes 64KB separate L1 instruction and data caches, 1MB unified and shared among threads L2 cache, private physical register files of 128 entries, and fetch, decode, issue and commit width of 8 instructions per cycle. Table 1. Combination of parameters for different multithread configurations FGMT CGMT SMT fetch kind timeslice switchonevent timeslice/ fetch priority - - equal/icount decode kind shared/ shared/ shared/ timeslice/ timeslice timeslice/ multiple replicated issue kind timeslice retire kind timeslice replicated shared/ i) Multithread Pipeline Organizations. Figure 3 shows the results for four different multithreaded implementations: FGMT, CGMT, SMT with equal thread priorities and SMT with ICOUNT (giving priority to those threads with less instructions in the pipeline [21]). Figure 3a shows the average number of instructions issued per cycle, while Figure 3b represents the global IPC (i.e., the sum of the IPCs achieved by the different threads), executing benchmark 176.gcc from the SPEC2000 suite with one instance per hardware thread, and varying the number of threads. Results are in accordance with the ones published by Tullsen et al [18], where CGMT and FGMT processors performs slightly better when the number of threads increases up to four threads. Besides, an SMT processor shows not only higher performance for any number of threads, but also higher scalability, both with equal and variable thread priorities. replicated timeslice timeslice timeslice/ replicated Figure 2. Evaluated cache distribution designs 4.3 ii) Bus Width Evaluation. This experiment shows how the bus width impacts on processor performance, resulting in different number of contention cycles during data transfers. For this test, we assume MOESI requests of 8 bytes and cache blocks of 64 bytes, so network messages can have either 8 bytes (only a MOESI request) or 72 bytes (MOESI request + block data). The executed workload is fft, which belongs to the SPLASH2 suite, a set of parallel benchmarks. Figure 4 represents the average contention cycles per transfer. Because no message larger than 72 bytes will be transferred, at least a 72-byte bus width is required to send any message in a single bus cycle and minimize contention. However, results show that a bus width more than three times smaller provides (for this workload) almost the same benefits. Detailed simulation: Multicore support A multicore simulation environment is basically achieved by replicating the data structures that represent a single processor core. The zone of shared resources in a multicore processor starts with the memory hierarchy. When caches are shared among cores, some contention can exist when they are accessed simultaneously. In contrast, when they are private per core, a coherence protocol (e.g. MOESI [20]) is implemented to guarantee memory consistency. Multi2Sim implements in its current version a splittransaction bus as interconnection network, extensible to any other topology of on-chip networks. The number of interconnects and their location vary depending on the sharing strategy of data and instruction caches. Figure 2 shows three possible schemes of sharing L1 and L2 caches (t = private per thread, c = private per core, s = shared), and the resulting interconnects for a dualcore dual-thread processor. iii) Interconnect Traffic Evaluation. This experiment shows the activity of the interconnection network during the execution of the fft benchmark with the same processor configuration described above, for a 16-byte bus width. Figure 5a represents the fraction of total bus bandwidth used in the network connecting the L1 caches and the common L2 cache, taking intervals of 104 cycles. Figure 5b represents the same metric referring to the interconnect between L2 and main memory (MM). As one can see, traffic distribution is quite irregular, showing some peaks of interconnect activity at some execution intervals. 5 Results This section presents some simulation experiments using Multi2Sim, illustrating the simulator application on one 66 a) Bus contention 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 25 Contention Instructions Issued per Cycle a) Issue rate cgmt fgmt smt_equal smt_icount 1 2 3 4 5 6 Number of Threads 20 15 10 5 0 0 7 10 20 30 40 50 60 L1-L2 Bus Width (bytes) 70 80 70 80 b) Processor performance 8 1.6 IPC 1.4 b) IPC 1.2 1 3 Throughput (IPC) 0.8 2.5 0 10 20 30 40 50 60 L1-L2 Bus Width (bytes) 2 1.5 1 Figure 4. Performance for different bus widths simulating fft cgmt fgmt smt_equal smt_icount 0.5 0 2 3 4 5 6 Number of Threads 7 8 a) L1-L2 Interconnect Fraction of Used Bandwidth 1 Figure 3. Issue rate and IPC with different multithreaded designs 6 Conclusions In this paper, we presented Multi2Sim, a simulation framework that integrates important features of existing simulators and extends them to provide additional functionality. Regarding the features adopted from other tools, we can cite the basic pipeline architecture (SimpleScalar), the timing first simulation (Simics-GEMS) or the support to cache coherence protocols. Among the extensions of Multi2Sim, we find the simulation of sharing strategies of pipeline stages, memory hierarchy configurations, multicore-multithread combinations and an integrated interface with the on-chip interconnection network. These features make Multi2Sim suitable for the evaluation of state-of-the-art processors, covering hot topics in the computer architecture field. In this paper, we showed some guidance examples on how to use these simulator characteristics. The source code of Multi2Sim is written in C and can be downloaded at [14]. 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Processor Cycles (Millions) 3.5 Fraction of Used Bandwidth b) L2-MM Interconnect 0.05 0.04 0.03 0.02 0.01 0 0 0.5 1 1.5 2 2.5 3 Processor Cycles (Millions) 3.5 Figure 5. Traffic distribution in L1-L2 and L2MM interconnects 67 Acknowledgements [11] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2), 2002. This work was supported by CICYT under Grant TIN2006-15516-C04-01, by Consolider-Ingenio 2010 under Grant CSD2006-00046 and by the Generalitat Valenciana under grant GV06/326. [12] M. R. Marty, B. Beckmann, L. Yen, A. R. Alameldeen, M. Xu, and K. Moore. GEMS: Multifacet’s General Execution-driven Multiprocessor Simulator. International Symposium on Computer Architecture, 2006. References [1] AMD AthlonTM 64 X2 Dual-Core Processor Product Data Sheet. www.amd.com, Sept. 2006. [13] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt. Network-Oriented Full-System Simulation Using M5. Sixth Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW), pages 36– 43, Feb. 2003. [2] D.C. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report CS-TR-1997-1342, 1997. [14] www.gap.upv.es/˜raurte/tools/multi2sim.html. R. Ubal Homepage – Tools – Multi2Sim. [3] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan. HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects. Univ. of Virginia Dept. of Computer Science Technical Report CS-2003-05, 2003. [15] MIPS Technologies, Inc. MIPS32TM Architecture For Programmers, volume I: Introduction to the MIPS32TM Architecture. 2001. [16] MIPS Technologies, Inc. MIPS32TM Architecture For Programmers, volume II: The MIPS32TM Instruction Set. 2001. [4] D. Madon, E. Sanchez, and S. Monnier. A Study of a Simultaneous Multithreaded Processor Implementation. In European Conference on Parallel Processing, pages 716–726, 1999. R [17] D. R. Butenhof. Programming with POSIX Threads. Addison Wesley Professional, 1997. [5] J. Sharkey. M-Sim: A Flexible, Multithreaded Architectural Simulation Environment. Technical Report CS-TR-05-DP01, Department of Computer Science, State University of New York at Binghamton, 2005. [18] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. Proceedings of the 22nd International Symposium on Computer Architecture, pages 392–403, June 1995. [6] D. M. Tullsen. Simulation and Modeling of a Simultaneous Multithreading Processor. 22nd Annual Computer Measurement Group Conference, December 1996. [19] J. P. Shen and M. H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. July 2004. [7] M. Moudgill, P. Bose, and J. Moreno. Validation of Turandot, a Fast Processor Model for Microarchitecture Exploration. IEEE International Performance, Computing, and Communications Conference, pages 451–457, 1999. [20] P. Sweazey and A.J. Smith. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. 13th Int’l Symp. Computer Architecture, pages 414–423, June 1986. [8] M. Moudgill, J. Wellman, and J. Moreno. Environment for PowerPC Microarchitecture Exploration. IEEE Micro, pages 15–25, 1999. [21] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In ISCA, pages 191– 202, 1996. [9] D. Brooks, P. Bose, V. Srinivasan, M. Gschwind, and M. Rosenfield P. Emma. Microarchitecture-Level Power-Performance Analysis: The PowerTimer Approach. IBM J. Research and Development, 47(5/6), 2003. [10] B. Lee and D. Brooks. Effects of Pipeline Complexity on SMT/CMP Power-Performance Efficiency. Workshop on Complexity Effective Design, 2005. 68