Full PDF - Multi2Sim

Transcription

Full PDF - Multi2Sim
19th International Symposium on Computer Architecture and High Performance Computing
07
– 7, 2007
r
7, 2007
, Brazil
zed by
Published by the IEEE Computer Society
10662 Los Vaqueros Circle
P.O. Box 3014
Los Alamitos, CA 90720-1314
IEEE Computer Society Order Number P3014
ISBN 0-7695-3014-1
ISSN 1550-6533
ISBN 0-7695-3014-1
90000
IEEE
Computer
Society
9 780769 530147
Copyright © 2007 by The Institute of Electrical and Electronics Engineers, Inc.
All rights reserved.
Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy
beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carry a code at
the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright
Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service
Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They reflect
the authors’ opinions and, in the interests of timely dissemination, are published as presented and without change.
Their inclusion in this publication does not necessarily constitute endorsement by the editors, the IEEE Computer
Society, or the Institute of Electrical and Electronics Engineers, Inc.
IEEE Computer Society Order Number P3014
ISBN 0-7695-23014-1
ISBN 978-0-7695-3014-7
ISSN 1550-6533
Additional copies may be ordered from:
IEEE Computer Society
Customer Service Center
10662 Los Vaqueros Circle
P.O. Box 3014
Los Alamitos, CA 90720-1314
Tel: + 1 800 272 6657
Fax: + 1 714 821 4641
http://computer.org/cspress
csbooks@computer.org
IEEE Service Center
445 Hoes Lane
P.O. Box 1331
Piscataway, NJ 08855-1331
Tel: + 1 732 981 0060
Fax: + 1 732 981 9667
http://shop.ieee.org/store/
customer-service@ieee.org
IEEE Computer Society
Asia/Pacific Office
Watanabe Bldg., 1-4-2
Minami-Aoyama
Minato-ku, Tokyo 107-0062
JAPAN
Tel: + 81 3 3408 3118
Fax: + 81 3 3408 3553
tokyo.ofc@computer.org
Individual paper REPRINTS may be ordered at: <reprints@computer.org>
Editorial production by Silvia Ceballos
Cover art production by Joseph Daigle/ Studio Productions
Printed in the United States of America by The Printing House
IEEE Computer Society
Conference Publishing Services (CPS)
http://www.computer.org/cps
19th International Symposium on
Computer Architecture and High Performance Computing
SBAC-PAD
Message from the General Chairs............................................................... ix
Message from the Program Committee Chairs ............................................ x
Conference Organizers .............................................................................. xi
Program Committee.................................................................................. xii
Reviewers ................................................................................................ xiv
Brazilian Computer Society (SBC)............................................................. xv
Session 1
Applications I
Multi-level Parallelism in the Computational Modeling of the Heart.............................................................. 3
Carolina Xavier, Rafael Sachetto, Vinicius Vieira, Rodrigo Weber dos Santos, and Wagner Meira Jr.
Computational Characteristics of Production Seismic Migration and its Performance
on Novel Processor Architectures............................................................................................................... 11
Jairo Panetta, Paulo R. P. de Souza Filho, Carlos A. da Cunha Filho, Fernando M. Roxo da Motta,
Silvio S. Pinheiro, Ivan Pedrosa Junior, Andre L. R. Rosa, Luiz R. Monnerat, Leandro T. Carneiro,
and Carlos H. B. de Albrecht
Voice Command Recognition with Dynamic Time Warping (DTW) using Graphics Processing
Units (GPU) with Compute Unified Device Architecture (CUDA) ............................................................... 19
Gustavo Poli, Alexandre L. M. Levada, João F. Mari, and José Hiroki Saito
Exploring Novel Parallelization Technologies for 3-D Imaging Applications .............................................. 26
Diego Rivera, Dana Schaa, Micha Moffie, and David Kaeli
Session 2
Microarchitecture
Low-cost Techniques for Reducing Branch Context Pollution in a Soft Realtime
Embedded Multithreaded Processor........................................................................................................... 37
Emre Özer, Alastair Reid and Stuart Biles
Self-Imposed Temporal Redundancy: An Efficient Technique to Enhance
the Reliability of Pipelined Functional Units ................................................................................................ 45
Elias Mizan, Tileli Amimeur, and Margarida F. Jacome
Predicting Loop Termination to Boost Speculative Thread-Level Parallelism
in Embedded Applications........................................................................................................................... 54
Mafijul Md. Islam
Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors .............................. 62
Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro López
v
Session 3
Applications II
Performance Improvement of the Parallel Lattice Boltzmann Method through Blocked
Data Distributions........................................................................................................................................ 71
Claudio Schepke and Nicolas Maillard
A Scalable Parallel Deduplication Algorithm............................................................................................... 79
Walter Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr., Altigran S. Da Silva, Renato
Ferreira and Dorgival Guedes
A Multigrid-Schwarz Method for the Solution of Hydrodynamics and Heat Transfer Problems
in Unstructured Meshes .............................................................................................................................. 87
Guilherme Galante, Rogério L. Rizzi, and Tiarajú A. Diverio
Session 4
Benchmarking, Performance Measurements and Analysis
Performance Evaluation of the Dual-Core Based SGI Altix 4700............................................................... 97
Rod Fatoohi
Impacts of Multiprocessor Configurations on Workloads in Bioinformatics .............................................. 105
Youfeng Wu, Mauricio Breternitz Jr., and Victor Ying
Session 5
Application-Specific Architectures
Efficient Hardware for Modular Exponentiation Using the Sliding-Window Method
with Variable-Length Partitioning .............................................................................................................. 117
Nadia Nedjah and Luiza de Macedo Mourelle
Optimized Math Functions for a Fixed-Point DSP Architecture ................................................................ 125
Karlo G. Lenzi and Osamu Saotome
Session 6
Grid Computing
A Component-Oriented Support for Hierarchical MPI Programming
on Multi-cluster Grid Environments........................................................................................................... 135
Elton Nicoletti Mathias, Vincent Cave, Francoise Baude, and Nicolas Maillard
A Selector of Grid Resources based on the Semantic Integration of Multiple Ontologies ....................... 143
Alexandre P.C Silva and Mario A.R. Dantas
A Novel Algorithm for Indirect Reputation-Based Grid Resource Management....................................... 151
Javier Echaiz, Jorge R. Ardenghi, and Guillermo R. Simari
vi
Session 7
Cache and Memory Architectures
Register File Energy Optimization for Snooping Based Clustered VLIW Architectures ........................... 161
Rahul Nagpal and Y. N. Srikant
Queue Register File Optimization Algorithm for QueueCore Processor .................................................. 169
Arquimedes Canedo, Ben Abderazek, and Masahiro Sowa
An Intelligent Mechanism to Explore a Two-Level Cache Hierarchy Considering Energy
Consumption and Time Performance ....................................................................................................... 177
Abel G. Silva-Filho, Carmelo J. A. Bastos-Filho, Ricardo M.F. Lima, Davi M.A. Falcão,
Filipe R. Cordeiro, and Marília P. Lima
A Code Compression Method to Cope with Security Hardware Overheads ............................................ 185
Eduardo Wanderley Netto, Romain Vaslin, Guy Gogniat, and Jean-Philippe Diguet
Session 8
Interconnection Networks, Routing, and Communication
Architectural Breakdown of End-to-End Latency in a TCP/IP Network .................................................... 195
Steen Larsen, Parthasarathy Sarangam, and Ram Huggahalli
Performance Analysis and Linear Optimization Modeling of All-to-all Collective
Communication Algorithms ....................................................................................................................... 203
Hyacinthe N. Mamadou, Guilherme de Melo B. Domingues, Takeshi Nanri, and Kazuaki Murakami
Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP) ...................... 211
Seung Eun Lee, Jun Ho Bahn, and Nader Bagherzadeh
Session 9
Tools for Parallel and Distributed Programming
Node Level Primitives for Parallel Exact Inference................................................................................... 221
Yinglong Xia and Viktor Prasanna
Fault-Tolerance in Filter-Labeled-Stream Applications............................................................................. 229
Bruno Coutinho, Dorgival Guedes, Wagner Meira Jr., and Renato A. Ferreira
High-Level Service Connectors for Component-Based High Performance Computing ........................... 237
Francisco H. de Carvalho-Junior, Ricardo C. Corrêa, Gisele A. Araújo, Jefferson C. Silva,
and Rafael D. Lins
vii
Session 10
Load Balancing and Scheduling
On-Line Scheduling of MPI-2 Programs with Hierarchical Work Stealing ................................................ 247
Guilherme P. Pezzi, Márcia C. Cera, Elton Mathias, Nicolas Maillard, and Philippe O. A. Navaux
Exigency-Based Real-Time Scheduling Policy to Provide Absolute QoS for Web Services.................... 255
Lucas S. Casagrande, Rodrigo F. de Mello, Ricardo Bertagna, José A. Andrade Filho,
and Francisco J. Monaco
DTA-C: A Decoupled Multi-threaded Architecture for CMP Systems ....................................................... 263
Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic
Automatic Constraint Partitioning to Speed up CLP Execution ................................................................ 271
Marluce R. Pereira, Patrícia K. Vargas, Maria Clícia S. de Castro, Felipe M. G. França,
and Inês de Castro Dutra
Author Index .............................................................................................................. 279
viii
Message from the Program Committee Chairs
SBAC-PAD
On behalf of the Program Committee, we are pleased to welcome you to the 19th International
Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2007). SBAC-PAD
is an annual international conference series, the first of which was held 20 years ago, that has
traditionally presented the state of the art, latest trends and new developments in computer architecture
design, parallel and distributed technologies and high performance applications.
We would first like to thank the Brazilian Computer Society, the IEEE Computer Society, the Technical
Committees on Computer Architecture (TCCA) and Scalable Computing (TCSC), and International
Federation for Information Processing (IFIP) for their continued support and sponsorship of SBAC-PAD
2007.
Thanks also to the 66 members of the Program Committee, all recognized experts in their fields, from
around the world who volunteered to participate in the selection of papers. Their help, together with that
of the Organizing Committee, in initially publicizing the symposium had a significant impact on the number
and quality of paper submissions.
Authors were invited to submit manuscripts that presented original unpublished research in all areas of
computer architecture and high performance computing. Work focusing on applications or emerging
technologies were especially welcome. Even with the plethora of coinciding conferences, SBAC-PAD
2007 received 107 submissions from industry and academia located in 24 countries, a reflection of its
current international standing. We would thus like to thank the authors for their contributions and, of
course, the 142 reviewers for the time and effort they took to diligently review the submissions. After a
rigorous peer-review process, most papers had 5 reviews but every paper had at least three, we chose
32 high quality papers (less than a third of the submissions) on research work from institutions in 10
countries (15 full papers from Brazil, 7 from the USA, 2 each from Japan and France, and one from
Argentina, India, Italy, Spain, Sweden, and the UK, respectively).
In addition to these regular papers, the scientific and technical program includes several keynote
presentations from experts who have kindly accepted to share their knowledge and wisdom on a variety
of state of the art issues. To further stimulate discussion among attendees, six workshops have been
organized, covering a range of “hot” topics. Finally, a number of our industrial sponsors will also be
presenting some of their insights on leading edge technologies. Thanks to all of the contributors and
participants who together have created an excellent scientific and technical program.
A number of people have endeavored to help organize this outstanding program and to make SBAC-PAD
2007 a resounding success. We would like to offer our sincerest thanks to them all; in particular thanks
must go to Professor Philippe Navaux, this year’s General Chair, as well as the Steering Committee for
their guidance and support and to Lucas Schnorr for his substantial assistance with the development and
configuration the professional conference web pages.
Once again, welcome to Gramado and to SBAC-PAD 2007! We are sure that you will enjoy both the
scientific as well the social program of the conference. We look forward to the many stimulating
discussions and presentations.
SBAC-PAD 2007 Co-chairs
Jean-Luc Gaudiot
Vinod Rebello
x
19th International Symposium on Computer Architecture and High Performance Computing
Multi2Sim: A Simulation Framework to Evaluate
Multicore-Multithreaded Processors
R. Ubal, J. Sahuquillo, S. Petit and P. López
Universidad Politécnica de Valencia
Camino de Vera s/n 46021 Valencia, Spain
raurte@gap.upv.es
Abstract
sulted in chip multiprocessors (CMPs), which include various cores in a single chip [1].
With respect to memory hierarchy, its design is a major concern in current and incoming microprocessors, since
long memory latencies act frequently as a performance
bottleneck. Current on-chip parallel processing models
provide new cache access patterns and offer the possibility of either replicating or sharing caches among processing elements. This fact rises the need to evaluate tradeoffs between memory hierarchy configuration and processor cores/threads structure.
Finally, interconnection networks (or interconnects)
serve as communication medium for processor components
(mainly processor cores). In an environment where caches
from different processors share memory blocks, the interconnect is in charge of transmitting coherence messages
generated by the cache controllers. Research in this field
tries to increase network performance by focusing on new
topologies, switching and flow control mechanisms, routing
algorithms or fault tolerance techniques.
In this paper we present Multi2Sim, which integrates
processor cores, memory hierarchy and interconnection network in a tool that enables their evaluation. The rest of
this paper is organized as follows. Section 2 presents an
overview of existing processor simulators. Section 3 describes the Multi2Sim structure, while Section 4 discusses
the integrated features to support multithreading and multicore simulation. Examples including simulation results are
shown in Section 5. Finally, Section 6 presents some concluding remarks.
Current microprocessors are based in complex designs,
integrating different components on a single chip, such as
hardware threads, processor cores, memory hierarchy or
interconnection networks. The permanent need of evaluating new designs on each of these components motivates
the development of tools which simulate the system working
as a whole. In this paper, we present the Multi2Sim simulation framework, which models the major components of
incoming systems, and is intended to cover the limitations
of existing simulators. A set of simulation examples is also
included for illustrative purposes.
1 Introduction
The evolution of microprocessors, mainly enabled by
technology advances, has led to complex designs that combine multiple physical processing units in a single chip.
These designs provide to the operating system (OS) the
view of having multiple processors, and thus, different software processes can be scheduled at the same time.
This processor model consists of three major components: the microprocessor core, the cache hierarchy, and
the interconnection network. A design improvement on any
of these components will result in a performance gain over
the whole system. Therefore, current processor architecture
trends bring a lot of opportunities for researchers to investigate novel microarchitectural proposals. Below, some design issues on these components are drawn.
Concerning processor cores, deep and wide pipelines
have been designed, aimed at exploiting the high amount of
instruction level parallelism (ILP) present in current workloads. On the other hand, thread level parallelism (TLP)
enables to exploit additional sources of independent instructions to increase processor resources utilization. This
idea, jointly with an overcome of hardware constraints, re-
1550-6533/07 $25.00 © 2007 IEEE
DOI 10.1109/SBAC-PAD.2007.17
2 Related Work
Multiple simulation environments aimed at evaluating
computer architecture proposals have been developed. The
most widely used simulator in recent years has been SimpleScalar [2], which models an out-of-order superscalar
processor. Lots of extensions have been applied to Sim-
62
loads, with dynamic threads creation.
The last cited simulator is M5 [13], which provides
support for out-of-order SMT-capable CPUs, multiprocessors and cache coherency, and runs in both full-system and
application-only modes. The limitations lie once again in
the low flexibility of multithreaded pipeline designs.
pleScalar to model in a more accurate manner certain aspects of superscalar processors. For example, the HotLeakage simulator [3] quantifies leakage energy consumption.
SimpleScalar is quite difficult to extend to model new
parallel microarchitectures without significantly changing
its structure. In spite of this fact, various SimpleScalar extensions to support multithreading have been implemented,
e.g. SSMT [4], M-Sim [5], or SMTSim [6], but they have
the limitation of only executing a set of sequential workloads and implementing a fixed resource sharing strategy
among threads.
Multithread and multicore extensions have been also applied to the Turandot simulator [7] [8], which models a
PowerPC architecture and has been also used with power
measurement aims (PowerTimer [9]). Turandot extensions
to parallel microarchitectures are mostly cited (e.g., [10])
but not publicly available.
Both SimpleScalar and Turandot are application-only
tools, which directly simulate the behaviour of an application. Such tools have the advantage of isolating the workload execution, so statistics are not affected by the simulation of additional software. The tool proposed in this paper
can also be classified as an application-only simulator.
In contrast to the application-only simulators, a set of
so-called full-system simulators are available. Such tools
are able to boot an unmodified operating system and applications run at the same time over it. Although this model
provides higher simulation power, it involves a huge computational load and sometimes unnecessary simulation accuracy.
Simics [11] is an example of generic full-system simulator, commonly used for multiprocessor systems simulation, but unfortunately not freely available. A variety of
Simics derived tools has been implemented for specific research purposes in this area. This is the case of GEMS [12],
which introduces a timing simulation module to model a
complete processor pipeline and a memory hierarchy supporting cache coherence. However, GEMS provides low
flexibility of modelling multithreaded designs and it integrates no interconnection network model.
An important feature included in some processor simulators is the timing-first approach, provided by GEMS and
adopted in Multi2Sim. On such a scheme, a timing module
traces the state of the processor pipeline while instructions
traverse it, possibly in a speculative state. Then, a functional module is called to actually execute the instructions,
so the correct execution paths are always guaranteed by a
previously developed robust simulator. The timing-first approach confers efficiency, robustness, and the possibility of
performing simulations on different levels of detail. Our
proposal adopts the timing-first simulation with a functional
support that, unlike GEMS, need not simulate a whole operating system, but is still capable of executing parallel work-
3 Basic simulator description
Multi2Sim [14] has been developed integrating some
significant characteristics of popular simulators, such as
separate functional and timing simulation, SMT and multiprocessor support and cache coherence. Multi2Sim is an
application-only tool intended to simulate final MIPS32 executable files. With a MIPS32 cross-compiler (or a MIPS32
machine) one can compile his own program sources, and
test them under Multi2Sim. This section deals with the
process of starting and running an application in a crossplatform environment, and describes briefly the three implemented simulation techniques (functional, detailed and
event-driven simulation).
3.1
Program Loading
Program loading is the process in which an executable
file is mapped into different virtual memory regions of a
new software context, and its register file and stack are initialized to start execution. In a real machine, the operating
system is in charge of these actions, but an application-only
tool should manage program loading during its initialization.
Executable File Loading. The executable files output
by gcc follow the ELF (Executable and Linkable Format)
specification. An ELF file is made up of a header and a set
of sections. Some Linux distributions include the library
libbfd, which provides types and functions to list the sections of an ELF file and track their main attributes (starting
address, size, flags and content). When the flags of an ELF
section indicate that it is loadable, its contents are copied
into memory after the corresponding starting address.
Program Stack. The next step of the program loading
process is to initialize the process stack. The aim of the
program stack is to store function local variables and parameters. During the program execution, the stack pointer
($sp register) is managed by the own program code. However, when the program starts, it expects some data in it,
namely the program arguments and environment variables,
which must be placed by the program loader.
Register File. The last step is the register file initialization. This includes the $sp register, which has been progressively updated during the stack initialization, and the
63
tional and detailed simulation are independent, the implementation of the machine instructions behaviour can be centralized in a single file (functional simulation), increasing
the simulator modularity. In this sense, function calls that
activate hardware components (detailed simulation) have an
interface that returns the latency required to complete their
access.
Nevertheless, this latency is not a deterministic value in
some situations, so it cannot be calculated when the function call is performed. Instead, it must be simulated cycle by
cycle. This is the case of interconnects and caches, where an
access can result in a message transfer, whose delay cannot
be computed a priori, justifying the need of an independent
event-driven simulation engine.
PC and NPC registers. The initial value of the PC register
is specified in the ELF header of the executable file as the
program entry point. The NPC register is not explicitly defined in the MIPS32 architecture, but it is used internally by
the simulator to handle the branch delay slot.
3.2
Simulation Model
Multi2Sim uses three different simulation models, embodied in different modules: a functional simulation engine,
a detailed simulator and an event-driven module —the latter two perform the timing simulation. To describe them,
the term context will be used hereafter to denote a software
entity, defined by the status of a virtual memory image and a
logical register file. In contrast, the term thread will refer to
a processor hardware entity comprising a physical register
file, a set of physical memory pages, a set of entries in the
pipeline queues, etc. The three main simulation techniques
are described next.
4 Support for Multithreaded and Multicore
Architectures
This section describes the basic simulator features that
provide support for multithreaded and multicore processor
modelling. They can be classified in two main groups: those
that affect the functional simulation engine (enabling the execution of parallel workloads) and those which involve the
detailed simulation module (enabling pipelines with various
hardware threads on the one hand, and systems with several
cores on the other).
Functional Simulation, also called simulator kernel. It
is built as an autonomous library and provides an interface
to the rest of the simulator. This engine does not know
of hardware threads, and owns functions to create/destroy
software contexts, perform program loading, enumerate existing contexts, consult their status, execute machine instructions and handle speculative execution. The supported
machine instructions follow the MIPS32 specification [15]
[16]. This choice was basically motivated by a fixed instruction size and formats, which enable a simple instruction decoding.
An important feature of the simulation kernel, inherited
from SimpleScalar [2], is the checkpointing capability of
the implemented memory module and register file, thinking of an external module that needs to implement speculative execution. In this sense, when a wrong execution path
starts, both the register file and memory status are saved,
reloading them on the misprediction detection.
4.1
Functional simulation: parallel workloads support
The functional engine has been extended to support parallel workloads execution. In this context, parallel workloads can be seen as tasks that dynamically create child
processes at runtime, carrying out communication and synchronization operations. The supported parallel programming model is the one specified by the widely used POSIX
Threads library (pthread) shared memory model [17].
In a multithreaded environment, some studies suggest
using a set of sequential workloads [18]. The reason is
that multiple resources are shared among hardware threads,
and processor throughput can be evaluated more accurately
when no contention appears due to communication between
contexts. In contrast, multicore processor pipelines are fully
replicated, and an important contention point is the interconnection network. The execution of multiple sequential
workloads exhibits only some interconnect activity in form
of L2-L1 cache transfers, but no coherence actions can occur between processes having disjoint memory maps. Thus,
in order to evaluate multicore processors, it makes sense
to support and run parallel workloads with shared memory
locations, whose distributed access can stress the interconnection network.
Detailed Simulation. The Multi2Sim detailed simulator uses the functional engine to perform a timing-first [12]
simulation: in each cycle, a sequence of calls to the kernel
updates the state of existing contexts. The detailed simulator analyzes the nature of the recently executed machine
instructions and accounts the operation latencies incurred
by hardware structures.
The main simulated hardware consists of pipeline structures (stage resources, instruction queue, load-store queue,
reorder buffer...), branch predictor (modelling a combined
bimodal-gshare predictor), cache memories (with variable
size, associativity and replacement policy), memory management unit, and segmented functional units of configurable latency.
Event-Driven Simulation. In a scheme where func-
64
Actual parallel workloads require special hardware support (machine instructions), as well as low level software
support (system calls) that enable threads spawning, synchronization and termination. Each of these issues are described below, jointly with a brief description of the POSIX
threads management:
Instruction set support. When the processor hardware supports concurrent threads execution, the parallel
programming requirement that directly affects its architecture is the existence of critical sections, which cannot be executed simultaneously by more than one thread. CMPs or
multithreaded processors must stall the activity of a hardware thread when it tries to enter a critical section occupied
by other thread.
The MIPS32 approach implements the mutual exclusion
mechanism by means of two machine instructions (LL and
SC), defining the concept of RMW (read-modify-write) sequence [16]. An RMW sequence is a set of instructions,
embraced by a pair LL-SC that run atomically on a multiprocessor system. The cited machine instructions do not
enforce an RMW sequence, but the output value of SC informs of the RMW success or failure.
Figure 1. Examples of pipeline organizations
The fetch stage takes instructions from the L1 instruction cache and places them into an IFQ (instruction fetch
queue). The decode/rename stage takes instructions from
an IFQ, decodes them, renames their registers and assigns
them a slot in the ROB (reorder buffer) and IQ (instruction
queue). Then, the issue stage consumes instructions from
the IQ and sends them to the corresponding functional unit.
During the execution stage, the functional units operate and
write their results back into the register file. Finally, the
commit stage retires instructions from the ROB in program
order. This architecture is analogous to the one modelled
by the SimpleScalar tool set [2], but uses a ROB, an IQ (instruction queue) and a physical register file, instead of the
RUU (register update unit).
Operating system support. Tracing the execution of
a parallel workload, the operating system support required
by pthread is formed of system calls i) to spawn/destroy
a thread (clone, exit group), ii) to wait for child
threads (waitpid), iii) to communicate and synchronize
threads with system pipes (pipe, read, write, poll)
and iv) to wake up suspended threads using system signals
(sigaction, sigprocmask, sigsuspend, kill).
Figure 1 illustrates two possible pipeline organizations.
In a) all stages are shared among threads, while in b) all
stages (except execute) are replicated as many times as
supported hardware threads. Multi2Sim allows to evaluate different stage sharing strategies, as well as different algorithms that schedule stage resources in each cycle. Depending on the stages sharing and thread selection
policies, a multithread processor can be classified as finegrain (FGMT), coarse-grain (CGMT) or simultaneous multithread (SMT).
POSIX Threads parallelism management. Applications programmed with pthread can be simulated without
changes using Multi2Sim. This library introduces user code
which handles parallelism by means of the described subset of machine instructions and system calls. However, the
fact of having thread management code mingled with application code must be taken into account, as it constitutes
a certain overhead which could affect final results. Further
details on this consideration can be found in [14].
4.2
A FGMT processor switches threads on a fixed schedule,
typically on every processor cycle. In contrast, a CGMT
processor is characterized by a thread switch induced by a
long latency operation or a thread quantum expiration. Finally, an SMT processor enhances the previous ones with
a more aggressive instruction issue policy, which is able to
issue instructions from different threads in a single cycle.
The simulator parameters that specify the sharing strategy
of pipeline stages among threads, and thus the kind of multithreading, are summarized in Table 1. Again, [14] gives a
detailed description of all possible values these parameters
may take.
Detailed simulation: Multithreading
support
Multi2Sim supports a set of parameters that specify how
stages are organized in a multithreaded design. Stages can
be shared among threads or private per thread [19] (except
the execute stage, which is shared by definition of multithread). Moreover, when a stage is shared, there must be an
algorithm which schedules a thread every cycle on the stage.
The modelled pipeline is divided into five stages, described
below.
65
hand, and checking its correctness on the other. These experiments i) test different multithread pipeline configurations, ii) explore different bus widths and iii) trace the network traffic executing a parallel workload. In all cases, the
simulated machine includes 64KB separate L1 instruction
and data caches, 1MB unified and shared among threads
L2 cache, private physical register files of 128 entries, and
fetch, decode, issue and commit width of 8 instructions per
cycle.
Table 1. Combination of parameters for different multithread configurations
FGMT
CGMT
SMT
fetch kind
timeslice
switchonevent
timeslice/
fetch priority
-
-
equal/icount
decode kind
shared/
shared/
shared/
timeslice/
timeslice
timeslice/
multiple
replicated
issue kind
timeslice
retire kind
timeslice
replicated
shared/
i) Multithread Pipeline Organizations. Figure 3 shows
the results for four different multithreaded implementations: FGMT, CGMT, SMT with equal thread priorities
and SMT with ICOUNT (giving priority to those threads
with less instructions in the pipeline [21]). Figure 3a shows
the average number of instructions issued per cycle, while
Figure 3b represents the global IPC (i.e., the sum of the
IPCs achieved by the different threads), executing benchmark 176.gcc from the SPEC2000 suite with one instance
per hardware thread, and varying the number of threads.
Results are in accordance with the ones published by
Tullsen et al [18], where CGMT and FGMT processors performs slightly better when the number of threads increases
up to four threads. Besides, an SMT processor shows not
only higher performance for any number of threads, but also
higher scalability, both with equal and variable thread priorities.
replicated
timeslice
timeslice
timeslice/
replicated
Figure 2. Evaluated cache distribution designs
4.3
ii) Bus Width Evaluation. This experiment shows how
the bus width impacts on processor performance, resulting
in different number of contention cycles during data transfers. For this test, we assume MOESI requests of 8 bytes
and cache blocks of 64 bytes, so network messages can have
either 8 bytes (only a MOESI request) or 72 bytes (MOESI
request + block data). The executed workload is fft, which
belongs to the SPLASH2 suite, a set of parallel benchmarks.
Figure 4 represents the average contention cycles per
transfer. Because no message larger than 72 bytes will be
transferred, at least a 72-byte bus width is required to send
any message in a single bus cycle and minimize contention.
However, results show that a bus width more than three
times smaller provides (for this workload) almost the same
benefits.
Detailed simulation: Multicore support
A multicore simulation environment is basically
achieved by replicating the data structures that represent
a single processor core. The zone of shared resources in
a multicore processor starts with the memory hierarchy.
When caches are shared among cores, some contention can
exist when they are accessed simultaneously. In contrast,
when they are private per core, a coherence protocol (e.g.
MOESI [20]) is implemented to guarantee memory consistency. Multi2Sim implements in its current version a splittransaction bus as interconnection network, extensible to
any other topology of on-chip networks.
The number of interconnects and their location vary depending on the sharing strategy of data and instruction
caches. Figure 2 shows three possible schemes of sharing
L1 and L2 caches (t = private per thread, c = private per
core, s = shared), and the resulting interconnects for a dualcore dual-thread processor.
iii) Interconnect Traffic Evaluation. This experiment
shows the activity of the interconnection network during the
execution of the fft benchmark with the same processor configuration described above, for a 16-byte bus width. Figure
5a represents the fraction of total bus bandwidth used in
the network connecting the L1 caches and the common L2
cache, taking intervals of 104 cycles. Figure 5b represents
the same metric referring to the interconnect between L2
and main memory (MM). As one can see, traffic distribution is quite irregular, showing some peaks of interconnect
activity at some execution intervals.
5 Results
This section presents some simulation experiments using Multi2Sim, illustrating the simulator application on one
66
a) Bus contention
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
25
Contention
Instructions Issued per Cycle
a) Issue rate
cgmt
fgmt
smt_equal
smt_icount
1
2
3
4
5
6
Number of Threads
20
15
10
5
0
0
7
10
20 30 40 50 60
L1-L2 Bus Width (bytes)
70
80
70
80
b) Processor performance
8
1.6
IPC
1.4
b) IPC
1.2
1
3
Throughput (IPC)
0.8
2.5
0
10
20
30
40
50
60
L1-L2 Bus Width (bytes)
2
1.5
1
Figure 4. Performance for different bus
widths simulating fft
cgmt
fgmt
smt_equal
smt_icount
0.5
0
2
3
4
5
6
Number of Threads
7
8
a) L1-L2 Interconnect
Fraction of Used Bandwidth
1
Figure 3. Issue rate and IPC with different
multithreaded designs
6 Conclusions
In this paper, we presented Multi2Sim, a simulation
framework that integrates important features of existing
simulators and extends them to provide additional functionality. Regarding the features adopted from other tools, we
can cite the basic pipeline architecture (SimpleScalar), the
timing first simulation (Simics-GEMS) or the support to
cache coherence protocols.
Among the extensions of Multi2Sim, we find the simulation of sharing strategies of pipeline stages, memory hierarchy configurations, multicore-multithread combinations
and an integrated interface with the on-chip interconnection
network. These features make Multi2Sim suitable for the
evaluation of state-of-the-art processors, covering hot topics in the computer architecture field. In this paper, we
showed some guidance examples on how to use these simulator characteristics.
The source code of Multi2Sim is written in C and can be
downloaded at [14].
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5
3
Processor Cycles (Millions)
3.5
Fraction of Used Bandwidth
b) L2-MM Interconnect
0.05
0.04
0.03
0.02
0.01
0
0
0.5
1
1.5
2
2.5
3
Processor Cycles (Millions)
3.5
Figure 5. Traffic distribution in L1-L2 and L2MM interconnects
67
Acknowledgements
[11] P.S. Magnusson, M. Christensson, J. Eskilson,
D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson,
A. Moestedt, and B. Werner. Simics: A Full System
Simulation Platform. IEEE Computer, 35(2), 2002.
This work was supported by CICYT under Grant
TIN2006-15516-C04-01, by Consolider-Ingenio 2010 under Grant CSD2006-00046 and by the Generalitat Valenciana under grant GV06/326.
[12] M. R. Marty, B. Beckmann, L. Yen, A. R. Alameldeen,
M. Xu, and K. Moore. GEMS: Multifacet’s General
Execution-driven Multiprocessor Simulator. International Symposium on Computer Architecture, 2006.
References
[1] AMD AthlonTM 64 X2 Dual-Core Processor Product
Data Sheet. www.amd.com, Sept. 2006.
[13] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt.
Network-Oriented Full-System Simulation Using M5.
Sixth Workshop on Computer Architecture Evaluation
using Commercial Workloads (CAECW), pages 36–
43, Feb. 2003.
[2] D.C. Burger and T.M. Austin. The SimpleScalar Tool
Set, Version 2.0. Technical Report CS-TR-1997-1342,
1997.
[14] www.gap.upv.es/˜raurte/tools/multi2sim.html.
R. Ubal Homepage – Tools – Multi2Sim.
[3] Y. Zhang, D. Parikh, K. Sankaranarayanan,
K. Skadron, and M. Stan.
HotLeakage: A
Temperature-Aware Model of Subthreshold and
Gate Leakage for Architects. Univ. of Virginia Dept.
of Computer Science Technical Report CS-2003-05,
2003.
[15] MIPS Technologies, Inc. MIPS32TM Architecture
For Programmers, volume I: Introduction to the
MIPS32TM Architecture. 2001.
[16] MIPS Technologies, Inc. MIPS32TM Architecture For
Programmers, volume II: The MIPS32TM Instruction
Set. 2001.
[4] D. Madon, E. Sanchez, and S. Monnier. A Study of
a Simultaneous Multithreaded Processor Implementation. In European Conference on Parallel Processing,
pages 716–726, 1999.
R
[17] D. R. Butenhof. Programming with POSIX
Threads.
Addison Wesley Professional, 1997.
[5] J. Sharkey. M-Sim: A Flexible, Multithreaded Architectural Simulation Environment. Technical Report
CS-TR-05-DP01, Department of Computer Science,
State University of New York at Binghamton, 2005.
[18] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. Proceedings of the 22nd International Symposium on Computer Architecture, pages 392–403, June
1995.
[6] D. M. Tullsen. Simulation and Modeling of a Simultaneous Multithreading Processor. 22nd Annual
Computer Measurement Group Conference, December 1996.
[19] J. P. Shen and M. H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. July
2004.
[7] M. Moudgill, P. Bose, and J. Moreno. Validation of
Turandot, a Fast Processor Model for Microarchitecture Exploration. IEEE International Performance,
Computing, and Communications Conference, pages
451–457, 1999.
[20] P. Sweazey and A.J. Smith. A Class of Compatible
Cache Consistency Protocols and Their Support by the
IEEE Futurebus. 13th Int’l Symp. Computer Architecture, pages 414–423, June 1986.
[8] M. Moudgill, J. Wellman, and J. Moreno. Environment for PowerPC Microarchitecture Exploration.
IEEE Micro, pages 15–25, 1999.
[21] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy,
J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In ISCA, pages 191–
202, 1996.
[9] D. Brooks, P. Bose, V. Srinivasan, M. Gschwind,
and M. Rosenfield P. Emma. Microarchitecture-Level
Power-Performance Analysis: The PowerTimer Approach. IBM J. Research and Development, 47(5/6),
2003.
[10] B. Lee and D. Brooks. Effects of Pipeline Complexity
on SMT/CMP Power-Performance Efficiency. Workshop on Complexity Effective Design, 2005.
68