DEPARTMENT of COMPUTER SCIENCE Carneg=e

Transcription

CMU-CS-86-122
Parallelism in Production Systems
Anoop Gupta
March 1986
DEPARTMENT
of
COMPUTER
SCIENCE
Carneg=e-iellon
Un=vers=ty
CMU-CS-86-122
Parallelism in Production Systems
Anoop Gupta
Department of Computer Science
Carnegie-Mellon University
Pittsburgh, Pennsylvania 15213
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in
Computer Science at Carnegie-Mellon University
March 1986
Copyright © 1986 Anoop Gupta
This research was sponsored by the Defense Advanced Research Projects Agency
Order No. 4864, monitored by the Space and Naval Warfare Systems Command
N00039-85-0134. The views and conclusions contained in this document are those of
should not be interpreted as representing the official policies, either expressed or
Defense Advanced Research Projects Agency or the US Government.
(DOD), ARPA
under contract
the authors and
implied, of the
i
ABSTRACT
Abstract
Production systems (or rule-based systems) are widely used in Artificial Intelligence for modeling
intelligent behavior and building expert systems. Most production system programs, however, are
extremely computation intensive and run quite slowly. The slow speed of execution has prohibited
the use of production systems in domains requiring high performance and real-time response. This
thesis explores the role of parallelism in the high-speed execution of production systems.
On the surface, production system programs appear to be capable of using large amounts of
parallelismmit is possible to perform match for each production in a program in parallel. The thesis
shows that in practice, however, the speed-up obtainable from parallelism is quite limited, around
10-fold as compared to initial expectations of 100-fold to 1000-fold. The main reasons for the limited
speed-up are: (1) there are only a small number of productions that are affected (require significant
processing) per change to working memory; (2) there is a large variation in the processing requirement of these productions; and (3) the number of changes made to working memory per recognizeact cycle is very small.
Since the number of productions affected and the number of working-
memory changes per recognize-act cycle are not controlled by the implementor of the production
system interpreter (they are governed mainly by the author of the program and the nature of the
task), the solution to the problem of limited speed-up is to somehow decrease the variation in the
processing costof affected productions. The thesis proposes a parallel version of the Rete algorithm
which exploits parallelism at a very fine grain to reduce the variation.
It further suggests that to
exploit the fine-grained parallelism, a shared-memory multiprocessor with 32-64 high performance
processors is desirable. For scheduling the fine-grained tasks consisting of about 50-100 instructions,
a hardware task scheduler is proposed.
The thesis presents simulation results for a large set of production systems exploiting different
sources of parallelism. The thesis points out the features of existing programs that limit the speed-up
obtainable from parallelism and suggests solutions for some of the bottlenecks.
The simulation
results show that using the suggested multiprocessor architecture (with individual processors performing at 2 MIPS), it is possible to obtain execution speeds of about 12000 working-memory element changes per second. This corresponds to a speed-up of 10-fold over the best known sequential
implementation using a 2 MIPS processor.
This performance is significantly higher than that ob-
tained by other proposed parallel implementations of production systems.
ii
PARALLELISM IN PRODUCTION SYSTEMS
ACKNOWLEDGMENTS
iii
Acknowledgments
I would like to thank my advisors Charles Forgy, Allen Newell, and HT Kung for their guidance,
support, and encouragement.
Charles Forgy helped with his deep understanding
systems and their implementation.
of production
Many of the ideas presented in this thesis originated with him or
have benefited from his comments. Allen Newell, in addition to being an invaluable source of ideas,
has shown me what doing research is about, and through his own example, what it means to be a
good researcher.
He has been a constant source of inspiration and it has been a great pleasure to
work with him. HT Kung has been an excellent sounding board for ideas. He greatly helped in
keeping the thesis on solid ground by always questioning my assumptions. I would also like to thank
A1Davis for serving on my thesis committee. The final quality of the thesis has benefited much from
his comments.
The work reported in this thesis has been done as a part of the Production System Machine (PSM)
project at Carnegie-Mellon University. I would like to thank its current and past membersmCharles
Forgy, Ken Hughes, Dirk Kalp, Ted Lehr, Allen Newel1, Kemal Oflazer, Jim Quinlan, Leon Weaver,
and Robert Wedig--for their contributions to the research.
I would also like to thank Greg Hood,
John Laird, Bob Sproull (my advisor for the first two years at CMU), and Hank Walker for many
interesting discussions about my research.
I would like to thank all my friends in Pittsburgh who have made these past years so enjoyable. I
would like to thank Greg Hood, Ravi Kannan, Gudrun and Georg Klinker, Roberto Minio, Bud
Mishra, Pradeep Sindhu, Pedro Szekely, Hank Walker, Angelika Zobel, and especially Paola Giannini and Yumi Iwasaki for making life so much fun. Finally, I would like to thank my family, my
parents and my two sisters, for their immeasurable love, encouragement, and support of my educational endeavors.
iv
PARALI,ELISM IN PRODUCTION SYSTEMS
TABLEOF CONTENrFS
V
Table of Contents
1. Introduction
1
1.1. Preview of Results
1.2. Organization of the Thesis
2. Background
3
5
7
2.1. OPS5
2.1.1. Working-Memory Elements
2.1.2. The l.eft-Hand Side of a Production
2.1.3. The Right-Hand Side of a Production
2.2. Soar
2.3. The Rete Match Algorithm
2.4. Why Parallelize Rete?
2.4.1. State-Saving vs. Non-State-Saving Match Algorithms
2.4.2. Rete as a Specific Instance of State-Saving Algorithms
2.4.3. Node Sharing in the Rete Algorithm
2.4.4. Rete as a Parallel Algorithm
3. Measurements on Production Systems
3.1. Production-System Programs Studied in the Thesis
3.2. Surface Characteristics of Production Systems
3.2.1. Condition Elements per Production
3.2.2. Actions per Production
3.2.3. Negative Condition Elements per Production
3.2.4. Attributes per Condition Element
3.2.5. Tests per Two-Input Node
3.2.6. Variables Bound and Referenced
3.2.7. Variables Bound but not Referenced
3.2.8. Variable Occurrences in Left-Hand Side
3.2.9. Variables per Condition Element
3.2.10. Condition Element Classes
3.2.11. Action Types
3.2.12. Summary of Surface Measurements
3.3. Measurements on the Rete Network
3.3.1. Number of Nodes in the Rete Network
3.3.2. Network Sharing
3.4. Run-Time Characteristics of Production Systems
3.4.1. Constant-Test Nodes
3.4.2. Alpha-Memory Nodes
3.4.3. Beta-Memory Nodes
3.4.4. And Nodes
3.4.5. Not Nodes
3.4.6. Terminal Nodes
3.4.7. Summary of Run-Time Characteristics
7
8
8
10
10
11
16
16
18
20
20
23
23
24
25
25
25
27
27
29
29
30
30
32
34
34
35
35
36
37
37
38
39
40
41
42
42
vi
PARAI,LELISM
INPROI)UCTION
SYSTEMS
4. Parallelism in Production Systems
4.1. The Structure of a Parallel Production-System Interpreter
4.2. Parallelism in Match
4.2.1. Production Parallelism
4.2.2. Node Parallelism
4.2.3. Intra-Node Parallelism
4.2.4. Action Parallelism
4.3. Parallelism in Conflict-Resolution
4.4. Parallelism in RHS Evaluation
4.5. Application Parallelism
4.6. Summary
4.7. Discussion
45
45
47
48
51
53
54
56
56
57
57
59
5. Parallel Implementation of Production Systems
63
5.1. Architecture of file Production-System Machine
5.2. q-he State-Update Phase Processing
5.2.1. Hash-Table Based vs. IJst Based Memory Nodes
5.2,2. Memory Nodes Need to be Lumped with Two-Input Nodes
5.2.3. Problems with Processing Conjugate Pairs of Tokens
5.2.4. Concurrently Processable Activations of Two-Input Nodes
5.2.5. Locks for Memory Nodes
5.2.6. Linear vs. Binary Rete Networks
5.3. The Selection Phase Processing
5.3.1. Sharing of Constant-Test Nodes
5.3.2. Constant-Test Node Successors
5.3.3. Alpha-Memory Node Successors
5.3.4. Processing Multiple Changes to Working Memory in Parallel
5.4. Summary
6. The Problem of Scheduling Node Activations
6.1. The Hardware Task Scheduler
6.1.1. How Fast Need the Scheduler be?
6.1.2. The Interface to the Hardware Task Scheduler
6.1.3. Structure of the Hardware Task Scheduler
6.1.4. Multiple Hardware Task Schedulers
6.2. Software Task Schedulers
7. The Simulator
63
66
67
68
70
71
74
74
80
80
81
82
83
83
85
86
86
88
89
91
92
97
7.1. Structure &the Simulator
7.1.1. Inputs to the Simulator
7.1.1.1. The Input Trace
7.1.1.2. The Computational Model
7.1.1.3. The Cost Model
7.1.1.4. The Memory Contention Model
7.1.2. Outputs of the Simulator
7.2. Limitations of the Simulation Model
7.3. Validity of the Simulator
8. Simulation Results and Analysis
8.1. Traces Used in the Simulations
8.2. Simulation Results for Uniprocessors
8.3. Production Parallelism
98
98
98
98
100
102
105
106
107
109
109
111
115
TABLE OF CONTENTS
,
vii
8.3.1. Effects of Action Parallelism on Production Parallelism
8.4. Node Parallelism
8.4.1. Effects of Action Parallelism on Node Parallelism
8.5. Intra-Node Parallelism
8.5.1. Effects of Action Parallelism on Intra-Node Parallelism
8.6. Linear vs. Binary Rete Networks
8.6.1. Uniprocessor Implementations with Binary Networks
8.6.2. Results of Parallelism with Binary Networks
8.7. Hardware Task Scheduler vs. Software Task Queues
8.8. Effects of Memory Contention
8.9. Summary
9. Related Work
119
122
122
126
126
131
131
133
139
146
151
155
9.1. Implementing Production Systems on C.mmp
9.2. Implementing Production Systems on Illiac-IV
9.3. The DADO Machine
9.4. The NON-VON Machine
9.5. Kemal Oflazer's Work on Partitioning and Parallel Processing of Production Systems
9.5.1. The Partitioning Problem
9.5.2. The Parallel Algorithm
9.5.3. The Parallel Architecture
9.5.4. Discussion
9.6. Honeywell's Data-Flow Model
9.7. Other Work on Speeding-up Production Systems
10. Smnmary andConclusions
155
156
157
159
162
162
163
164
165
167
168
169
10.1. Primary Results of Thesis
10.1.1. Suitability of the Rete-Class of Algorithms
10.1.2. Parallelism in Production Systems
10.1.3. Software Implementation Issues
10.1.4. Hardware Architecture
10.2. Some General Conclusions
10.3. Directions for Future Research
References
169
169
171
173
174
175
176
179
AppendixA. ISP of Processor Used in Parallel Implementation
AppendixB. Code and Data Structures for Parallel Implementation
187
193
B.1. Code for Interpreter with Hardware Task Scheduler
B.2. Code for Interpreter Using Multiple Software Task Schedulers
Appendix C. Derivation of Cost Models for the Simulator
193
212
219
C.1. Cost Model for the Parallel Implementation Using HTS
C.2. Cost Model for the Parallel Implementation Using STQs
219
231
°°.
viii
PARALLELISM 1N PROI)UCFION
SYSTEMS
ix
I,IST OF HGURES
List of Figures
Figure 2-1: A sample production.
Figure 2-2: The Rete network.
Figure 3-1: Condition elements per production.
Figure 3-2: Actions per production.
Figure 3-3: Negative condition elements per production.
Figure 3-4: Attributes per condition element.
Figure 3-5: Tests per two-input node.
Figure 3-6: Variables bound and referenced.
Figure 3-7: Variables bound but not referenced.
Figure 3-8: Occurrences of each variable.
Figure 3-9: Variables per condition element.
Figure 4-1:OPS5 interpreter cycle.
Figure 4-2: Soar interpreter cycle.
Figure 4-3: Selection and state-update phases in match.
Figure 4-4: Production parallelism.
Figure 4-5: Node parallelism.
Figure 4-6: Tile cross-product effect.
Figure 5-1: Architecture of the production-system machine.
Figure 5-2: A production and the associated Rete network.
Figure 5-3: Problems with memory-node sharing.
Figure 5-4: Concurrent activations of two-input nodes.
Figure 5-5: The long-chain effect.
Figure 5-6: A binary Rete network.
Figure 5-7: Scheduling activations of constant-test nodes.
Figure 5-8: Possible solution when too many alpha-memory successors.
Figure 6-1: Problem of dynamically changing set of processable node activations.
Figure 6-2: Effect of scheduler performance on maximum speed-up.
Figure 6-3: Structure of the hardware task scheduler.
Figure 6-4: Effect of multiple schedulers on speed-up.
Figure 6-5: Multiple software task queues.
Figure 7-1: A sample trace fragment.
Figure 7-2: Static node information.
Figure 7-3: Code for left activation of an and-node.
Figure 7-4: Degradation in performance due to memory contention.
Figure 8-1: Production parallelism (nominal speed-up).
Figure 8-2: Production parallelism (true speed-up).
Figure 8-3: Production parallelism (execution speed).
Figure 8-4: Production and action parallelism (nominal speed-up).
Figure 8-5: Production and action parallelism (true speed-up).
Figure 8-6: Production and action parallelism (execution speed).
Figure 8-7: Node parallelism (nominal speed-up).
Figure 8-8: Node parallelism (true speed-up).
7
12
26
26
27
28
28
29
30
31
31
46
46
47
48
52
53
64
67
69
72
77
78
81
83
86
87
90
93
95
99
100
101
104
117
117
117
121
121
121
123
123
X
PARALLELISM IN PRODUCTION
Figure 8-9: Node parallelism (execution speed).
Figure 8-10: Node and action parallelism (nominal speed-up).
Figure 8-11: Node and action parallelism (true speed-up).
Figure 8-12: Node and action parallelism (execution speed).
Figure 8-13: Intra-nodc parallelism (nominal speed-up).
Figure 8-14: Intra-node parallelism (true speed-up).
Figure 8-15: Intra-node parallelism (execution speed).
Figure 8-16: Intra-node and action parallelism (nominal speed-up).
Figure 8-17: Intra-node and action parallelism (true speed-up).
Figure 8-18: Intra-node and action parallelism (execution speed).
Figure 8-19: Average nominal speed-up.
Figure 8-20: Average true speed-up.
Figure 8-21: Average execution speed.
Figure 8-30: Intra-node parallelism (nominal speed-up).
Figure 8-36: Effect of number of software task queues.
Figure 8-45: Intra-node parallelism (nominal speed-up).
Figure 8-51: Processor efficiency as a function of number of active processors.
Figure 9-1: The prototype DADO architecture.
Figure 9-2: The NON-VON architecture.
Figure 9-3: Structure of the parallel processing system.
SYSTEMS
123
125
125
125
127
127
127
129
129
129
130
130
130
135
135
135
135
136
136
136
136
137
137
137
137
138
138
141
142
142
142
142
143
143
143
143
144
144
144
144
145
145
147
149
150
157
160
164
LISTOFTABLES
xi
List of Tables
Table 3-1: VT: Condition Element Classes
Table 3-2: ILOG: Condition Element Classes
Table 3-3: MUD: Condition Element Classes
"Fable3-4: DAA: Condition Element Classes
Table 3-5: R1-SOAR: Condition Element Classes
Table 3-6: EP-SOAR: Condition Element Classes
Table 3-7: Action Type Distribution
Table 3-8: Summary of Surface Measurements
Table 3-9: Number of Nodes
Table 3-10: Nodes per Production
Table 3-11: Nodes per Condition Element (with sharing)
Table 3-12: Nodes per Condition Element (without sharing)
Table 3-13: Network Sharing (Nodes without sharing/Nodes with sharing)
Table 3-14: Constant-Test Nodes
Table 3-15: Alpha-Memory Nodes
Table 3-16: Beta-Memory Nodes
Table 3-17: And Nodes
Table 3-18: Not Nodes
Table 3-19: Terminal Nodes
Table 3-20: Summary of Node Activations per Change
Table 3-21: Number of Affected Productions
Table 3-22: General Run-Time Data
Table 7-1: Relative Costs of Various Instruction Types
Table 8-1: Uniprocessor Execution With No Overheads: Part-A
Table 8-2: Uniprocessor Execution With No Overheads: Part-B
Table 8-3: Uniprocessor Execution With Overheads: Part-A Node Parallelism and
Node Parallelism
Table 8-4: Uniprocessor Execution With Overheads: Part-B Node Parallelism and
Node Parallelism
Table 8-5: Uniprocessor Execution With Overheads: Part-A Production Parallelism
Table 8-6: Uniprocessor Execution With Overheads: Part-B Production Parallelism
Table 8-7: Uniprocessor Execution With No Overheads: Part-A
Table 8-9: Uniprocessor Execution With Overheads: Part-A Node Parallelism and
Node Parallelism
Table 8-10: Uniprocessor Execution With Overheads: Part-B Node Parallelism and
Node Parallelism
Table 8-11: Uniprocessor Execution With Overheads: Part-A Production Parallelism
Table 8-12: Uniprocessor Execution With Overheads: Part-B Production Parallelism
Table 8-13: Comparison of Linear and Binary Network Rete
Intra-
32
32
33
33
33
33
34
34
35
35
36
36
36
37
38
39
40
41
42
43
43
44
102
112
112
114
Intra-
114
Intra-
115
115
132
132
132
Intra-
132
133
133
134
To my parents
INTRODUCTION
1
Chapter One
Introduction
Production systems (or rule-based systems) occupy a prominent position within the field of Artificial Intelligence.
They have been used extensively to understand the nature of intelligence--in
cognitive modeling, in the study of problem-solving systems, and in the study of learning systems
[2, 44, 45, 62, 73, 74, 92]. They have also been used extensively to develop large expert systems
spanning a variety of applications in areas including computer-aided design, medicine, configuration
tasks, oil exploration [11, 14, 39, 41, 42, 55, 56, 81]. Production-system programs, however, are computation intensive and run quite slowly. For example, OPS5 [10, 19] production-system
programs
using the Lisp-based or the Bliss-based interpreter execute at a speed of only 8-40 working-memory
element changes per second (wme-changes/sec) on a VAX-11/780.1 Although sufficient for many
interesting applications (as proved by the current popularity of expert systems), the slow speed of
execution precludes the use of production systems in many domains requiring high performance and
real-time response. For example, one study that considered implementing the Harpy algorithm as a
production system [63] for real-time speech recognition required that the program be able to execute
at a rate of about 200,000 wme-changes/sec.
The slow speed of execution of current systems also
impacts the research that is done with them, since researchers often avoid programming styles and
systems that run too slowly. This thesis examines the issue of significantly speeding up the execution
of production systems (several orders of magnitude over the 8-40 wme-changes/sec).
A significant
increase in the execution speed of production systems is expected to open up new application areas
for production systems, and to be valuable to both the practitioners and the researchers in Artificial
InteUigence.
There also exist deeper reasons for wanting to speed up the execution of production systems. The
cognitive activity of an intelligent agent involves two types of search: (1) knowledge search, that is,
search by the agent of its knowledge base to find information that is relevant to solving a given
Lpais corresponds to an execution speed of 3-16 production firings per second. On average, 2.5 changes are made to the
working memory per production firing.
2
PARALLELISM
IN PRODUCTION
SYSTEMS
problem; and (2) problem-space search, that is, search within the problem space [62] for a goal state.
Problem-space search manifests itself as a combinatorial AND/OR search [65]. Since problem-space
search when not pruned by knowledge is combinatoriaily explosive, a highly intelligent agent, regardless of what it is doing, must engage in a certain amount of knowledge search after each step that it
takes. This results in knowledge search being a part of the inner loop of the computation performed
by an intelligent agent.
Furthermore,
as the intelligence of the agent increases (the size of the
knowledge base increases), the resources needed to perform knowledge search also increase, and it
becomes important to speed up knowledge search as much as possible.
As an example, consider the problem of determining the next move to make in a game of chess.
Problem-space search corresponds to the different moves that the player tries out before making the
actual move. However, the fact that he tries out only a small fraction of all possible moves requires
that he use problem and situation-specific knowledge to constrain the search.
corresponds
to the computation
involved
in identifying
this problem
Knowledge search
and situation-specific
knowledge from the rest of the knowledge that the player may have.
Knowledge search forms an essential component of the execution of production systems. Each
execution cycle of a production system involves a knowledge-search step (the match phase), where
the knowledge represented in rules is matched against the global data memory. Since the ability to
do efficient knowledge search is fundamental to the construction of intelligent agents, it follows that
the ability to execute production systems with large rule sets at high speeds will greatly help in
constructing intelligent programs. In short, the match-phase computation (knowledge search) done
in production systems is not something specific to production systems, but such computation has to
be done, in one form or another, in any intelligent system. Thus, speeding up such computation is an
essential part of the construction of highly intelligent systems. Furthermore, since production systems offer a highly transparent model of knowledge search, the results obtained about speed-up from
parallelism for production systems will also have implications for other models of intelligent computation involving knowledge search.
There are several different methods for speeding up the execution of production systems: (1) the
use of faster technology; (2) the use of better algorithms; (3) the use of better architectures; and (4)
the use of parallelism. This thesis focuses on the use of parallelism. It identifies the various sources
of parallelism in production systems and discusses the feasibility of exploiting them.
Several im-
plementation issues and some architectural considerations are also discussed. The main reasons for
considering parallelism are: (1) Given any technology base, it is always possible to use multiple
INTRODUCTION
3
processors to achieve higher execution speeds. Stated another way, as technology advances, the new
technology can also be used in the construction of multiple processor systems. Furthermore, as the
rate of improvement in technology slows (as it must) parallelism becomes even more important.
(2)
Although significant improvements in speed have been obtained in the past through better compilation techniques and better algorithms [17, 20, 21, 22], we appear to be at a point where too much
more cannot be expected.
Furthermore,
any improvements in compilation technology and al-
gorithms will probably also carry over to the parallel implementations.
(3) On the surface, produc-
tion systems appear to be capable of using large amounts of parallelism--it is possible to perform the
match for each production in parallel.
This apparent mismatch between the inherently parallel
production systems and the uniprocessor implementations, makes parallelism the obvious way to
obtain significant speed up in the execution rates.
The thesis concentrates on the parallelism available in OPS5 [10] and Soar [46] production systems.
OPS5 was chosen because it has become widely available and because several large, diverse, and real
production-system
programs have been written in it. These programs form an excellent base for
measurements and analysis. Soar was chosen because it represents an interesting new approach in the
use of production systems for problem solving and learning. Since only OPS5 and Soar programs are
considered, the analysis of parallelism presented in this thesis is possibly biased by the characteristics
of these languages. For this reason the results may not be safely generalized to production-system
programs written in languages with substantially different characteristics, such as EMYCIN, EXPERT, and KAS [59, 93, 14].
Finally, the research reported in this thesis has been carried out in the context of the Production
System Machine (PSM) project at Carnegie-Mellon University, which has been exploring all facets of
the problem of improving the efficiency of production systems [22, 23, 30, 31, 32, 33]. This thesis
extends, refines, and substantiates the preliminary work that appears in the earlier publications.
1.1. Preview of Results
The first thing that is observed on analyzing production systems is that the speed-up from parallelism is quite limited, about 10-fold as compared to initial expectations of 100-fold or 1000-fold. The
main reasons for the limited parallelism are: (1) The number of productions that require significant
processing (the number of affected productions) as a result of a change to working memory is quite
small, less than 30. Thus, processing each of these productions in parallel cannot result in a speed-up
of more than 30. (2) The variation in the processing requirements of the affected productions is large.
4
PARAI,LELISM
1NPRODUCTIONSYSTEMS
This results in a situation where fewer and fewer processors are busy as the execution progresses,
which reduces the average number of processors that are busy over the complete execution cycle. (3)
The number of changes made to working memory per recognize-act cycle is very small (around 2-3
for most OPS5 systems). As a result, the speed-up obtained from processing multiple changes to
working memory in parallel is quite small.
To obtain a large fraction of the limited speed-up that is available, the thesis proposes the exploitation of parallelism at a very fine grain. It also proposes that all working-memory changes made by a
production firing be processed in parallel to increase the speed-up. The thesis argues that the Rete
algorithm used for performing the match step in existing uniprocessor implementations
suitable for parallel implementations.
is also
However, there are several changes that are necessary to the
serial Rete algorithm to make it suitable for parallel implementation.
The thesis discusses these
changes and gives the reasons behind the design decisions.
The thesis argues that a highly suitable architecture to exploit the fine-grained parallelism in
production systems is a shared-memory multiprocessor, with about 32-64 high performance processors. For scheduling the fine grained tasks (consisting of about 50-100 instructions), two solutions are
proposed. The first solution consists of a hardware task scheduler. The hardware task scheduler is to
be capable of scheduling a task in one bus cycle of the multiprocessor.
of multiple software task queues.
The second solution consists
Preliminary simulation studies indicate that the hardware task
scheduler is significantly superior to the software task queues.
The thesis presents a large set of sirnulation results for production systems exploiting different
sources of parallelism. The thesis points out the features of existing programs that limit the speed-up
obtainable from parallelism and suggests solutions for some of the bottlenecks.
The simulation
results show that using the suggested multiprocessor architecture (with individual processors performing at 2 MIPS), it is possible to obtain execution speeds of 5000-27000 wine-changes/see.
This
corresponds to a speed-up of 4-fold to 23-fold over the best known sequential implementation using a
2 MIPS processor.
This performance is significantly higher than that obtained by other proposed
parallel implementations of production systems.
INTRODUCTION
5
1.2. Organization
of the Thesis
Chapter 2 contains the background information necessary for the thesis.
Sections 2.1 and 2.2
introduce the OPS5 and Soar production-system formalisms and describe their computation cycles.
Section 2.3 presents a detailed description of the Rete algorithm which is used to perform the match
step for production systems. The Rete algorithm forms the starting point for much of the work
described later in the thesis. Section 2.4 presents the reasons why it is interesting to parallelize the
Rete algorithm.
Chapter 3 lists the set of production-system
programs analyzed in this thesis and presents the
results of static and run-time measurements made on these production-system programs. The static
measurements include data on the surface characteristics of production systems (for example, the
number of condition elements per production, the number of attribute-value pairs per condition
element) and data on the structure of the Rete networks constructed for the programs. The run-time
measurements include data on the number of node activations per change to working memory, the
number of working-memory changes per production firing, etc. The run-time data can be used to get
rough upper-bounds on the speed-up obtainable from parallelism.
Chapter 4 focuses on the sources of parallelism in production-system implementations.
'
For each of
the sources (production parallelism, node parallelism, intra-node parallelism, action parallelism, and
application parallelism) it describes some of the implementation constraints, the amount of speed-up
expected, and the overheads associated with exploiting that source. Most of the chapter is devoted to
the parallelism in the match phase; the parallelism in the conflict-resolution
phase and the rhs-
evaluation phase is discussed only briefly.
Chapter 5 discusses the various hardware and software issues associated with the parallel implementation of production systems. It first describes a multiprocessor architecture that is suitable
for the parallel implementation and provides justifications for the various decisions. Subsequently it
describes the changes that need to be made to the serial Rete algorithm to make it suitable for parallel
implementation.
Various issues related to the choice of data structures are also discussed.
Chapter 6 discusses the problem of scheduling node activations in the multiprocessor implementation. It proposes two solutions: (1) the use of a hardware task scheduler, and (2) the use of multiple
software task queues.
These two solutions are detailed in Sections 6.1 and 6.2 respectively.
The
performance results corresponding to the two solutions, however, are not discussed until Section 8.7.
PARALLELISM
INPRODUCTIONSYSTEMS
6
Chapter 7 presents details about the simulator used to study parallelism in production systems. It
presents information about the input traces, the cost models, the computational model, the outputs,
and the limitations of the simulator.
More details about the derivation of the cost model are
presented in Appendices A, B, and C.
Chapter 8 presents the results of the simulations.
simulations.
Section 8.1 lists the run-time traces used in the
Section 8.2 discusses the overheads of a parallel implementation over an implemen-
tation done for a uniprocessor. Sections 8.3, 8.4, and 8.5 discuss the speed-up obtained using production parallelism, node parallelism, and intra-node parallelism respectively. Section 8.6 discusses the
effect of constructing binary instead of linear Rete networks for productions.
Section 8.7 presents
results for the case when multiple software task queues are used instead of a hardware task scheduler.
Section 8.8 presents results for the case when memory contention overheads are taken into account,
and finally, Section 8.9 presents a summary of all results.
Chapter 9 presents related work done by other researchers. Work done on parallel implementation
of production
systems on C.mmp, Illiac-IV, DADO, NON-VON,
Oflazer's production-system
machine, and Honeywell's data-flow machine is presented.
Finally, Chapter 10 reviews the primary results of the thesis and presents directions for future
research.
BACKGROUND
7
Chapter Two
Background
The first two sections of this chapter describe the syntactic and semantic features of OPS5 and
SOAR production-system languages--the
two languages for which parallelism is explored in this
thesis. The third section describes the Rete match algorithm. The Rete algorithm is used in existing
uniprocessor implementations of OPS5 and SOAR, and also forms the basis for the parallel match
algorithm proposed in the thesis. The last section describes the different classes of algorithms that
may be used for match in production systems and gives reasons why the Rete algorithm is appropriate for parallel implementation of production systems.
2.1. OPS5
An OPS5 [10, 19] production system is composed of a set of ifithen rules called productions that
make up the production memory, and a database of assertions called the working memory. The
assertions in the working memory are called working-memory elements. Each production consists of a
conjunction of condition elements corresponding to the/]'part of the rule (also called the left-hand side
of the production), and a set of actions corresponding to the then part of the rule (also called the
right-hand side of the production).
The actions associated with a production can add, remove or
modify working-memory elements, or perform input-output.
Figure 2-1 shows an OPS5 production
named pl, which has three condition elements in its left-hand side, and one action in its fight-hand
side.
(p
pl
(el
(C2
- (C3
tattrl <x>
tattrl 15
tattrl <x>)
tattr2 12)
tattr2 <x>)
( remove 2))
Figure 2-1: A sample production.
The production-system interpreter is the underlying mechanism that determines the set of satisfied
productions and controls the execution of the production-system program. The interpreter executes a
production-system program by performing the following recognize-act cycle:
8
PARALLELISM IN PRODUC"TION SYSTEMS
• Match: In this first phase, the left-hand sides of all productions are matched against the
contents of working memory. As a result a conflict sel is obtained, which consists of
instantiations of all satisfied productions. An instantiation of a production is an ordered
list of working memory elements that satisfies the left-hand side of the production. At
any given time, the conflict set may contain zero, one, or more instantiations of a given
production.
• Conflict-Resolution: In this second phase, one of the production instantiations in the
conflict set is chosen for execution. If no productions are satisfied, the interpreter halts.
• Act: In this third phase, the actions of the production selected in the conflict-resolution
phase are executed. These actions may change the contents of working memory. At the
end of this phase, the first phase is executed again.
The recognize-act cycle forms the basic control structure in production system programs. During
the match phase the knowledge of the program (represented by the production rules) is tested for
relevance against the existing problem state (represented by the working memory).
During the
conflict-resolution phase the most relevant piece of knowledge is selected from all knowledge that is
applicable (the conflict set) to the existing problem state. During the act phase, the relevant piece of
knowledge is applied to the existing problem state, resulting in a new problem state.
2.1.1. Working-Memory
Elements
p
A working-memory element is a parenthesized list consisting of a constant symbol called the class
or type of the element and zero or more attribute-value pairs. The attributes are symbols that are
preceded by the operator *. The values are symbolic or numeric constants.
For example, the
following working-memory element has class C1, the value 12 for attribute attrl and the value 15 for
attribute attr2.
(C1
,attrl
12
,attr2
IB)
2.1.2. The Left-Hand Side of a Production
The condition elements in the left-hand side of a production are parenthesized lists similar to the
working-memory elements. They may optionally be preceded by the symbol -.
Such condition
dements are called negated condition elements. For example, the production in Figure 2-1 contains
three condition elements, with the third one being negated.
Condition elements are interpreted as
partial descriptions of working-memory elements. When a condition element describes a workingmemory element, the working-memory element is said to match the condition element. A production
is said to be satisfied when:
BACKGROUND
9
• For every non-negated condition element in the left-hand side of the production, there
exists a working-memory clement that matches it.
• For every negated condition element in the left-hand side of the production, there does
not exist a working-memory element that matches it.
I
Like a working-memory
attribute-value pairs.
element, a condition element contains a class name and a sequence of
However, the condition element is less restricted than the working-memory
element; while the working-memory element can contain only constant symbols and numbers, the
condition element can contain variables, predicate symbols, and a variety of other operators as well as
constants. Only variables and predicates are described here, since they are sufficient for the purposes
of this thesis. A variable is an identifier that begins with the character "<" and ends with ")"--for
example, <x) and <status> are variables. The predicate symbols in OPS5 are:
<
=
>
<=
>=
<>
<=>
The predicates have their usual meanings for numerical and symbolic values. For example, the first
predicate in the fist, "<", denotes the less-than relationship, the second predicate, "=",
equality, and the last predicate, "<=>", denotes ofthe-same-type
denotes
relationship (a value in OPS5 is
either of type numeric or symbolic). The following condition element contains one constant value
(the value of attrl), one variable value (the value of attr2), and one constant value that is modified by
the predicate symbol <) (the value ofattr3).
(Cl
tattrl nil
cattr2 <x>
tattr3 <> nil)
A working-memory
elementmatchesacondition
elementiftheclass-field
ofthetwomatchandif
thevalueofeveryattribute
inthecondition
elementmatchesthevalueofthecorresponding
attribute
in the working-memory element.
The rules for determining whether a working-memory element
value matches a condition element value are:
• If the condition element value is a constant, it matches only an identical constant.
• If the condition element value is a variable, it will match any value. However, if a
variable occurs more than once in a left-hand side, all occurrences of the variable must
match identical values.
• If the condition element value is preceded by a predicate symbol, the working-memory
element value must be related to the condition element value in the indicated way.
Thus the working-memory element
(el
tattrl
12
_attr2
1B)
will match the following two condition elements
(Cl
tattrl
12
tattr2
<x>)
10
PARALLELISM
INPRODUCTIONSYSTEMS
(C1
*attr2
> 0)
but it will not match the condition element
(CI
+attrl
<x>
+attr2
<x>).
2.1.3. The Right-Hand Side of a Production
The right-hand side of a production consists of an unconditional sequence of actions which can
cause input-output, and which are responsible for changes to the working memory. Three kinds of
actions are provided to effect working memory changes.
Make creates a new working-memory
element and adds it to working memory. Modify changes one or more values of an existing workingmemory element. Remove deletes an element from the working memory.
2.2. Soar
Soar[45, 47, 48, 74, 75] is a new production-system formalism developed at Carnegie-Mellon
University to perform research in problem-solving, expert systems, and learning. It is an attempt to
provide expert systems with general reasoning power and the ability to learn. In Soar, every task is
formulated as heuristic search in a problem space to achieve a goal. The problem space [62] consists
of a set of states and a set of operators. The operators are used to transform one problem state into
another. Problem solving is the process of moving from some initial state through intermediate states
(generated as a result of applying the operators) until a goal state is reached. Knowledge about the
task domain is used to guide the search leading to the goal state.
Currently, Soar is built on top of OPS5--the operators, the domain knowledge, the goal-state
recognition mechanism, are all built as OPS5 productions. As a result most of the implementation
issues, including the exploitation of parallelism, are similar in OPS5 and Soar. The main difference,
however, is that Soar does not follow the match---conflict-resolutionuact
cycle of OPS5 exactly. The
computation cycle in Soar is divided into two phases: a monotonic elaboration phase and a decision
phase. During the elaboration phase, all directly available knowledge relevant to the current problem
state is brought to bear. On each cycle of elaboration phase, all instantiations of satisfied productions
fire concurrently. This phase goes on till quiescence, that is, till there are no more satisfied productions. During the decision phase a fixed procedure is run that translates the information obtained
during the elaboration phase into a specific decision--for example, the operator to be applied next.
With respect to parallelism, the relevant differences from OPS5 are: (1) there is no conflict-resolution
phase; and (2) multiple productions can fire in parallel. The impact of these differences is explored
later in the thesis.
BACKGROUND
]1
Soar production-system programs differ from OPS5 programs in yet another way. Soar programs
can improve their performance over time by adding new productions at run-time.
An auxiliary
process automatically creates new productions concurrently with the operation of the production
system. The impact of this feature on parallelism is, however, not explored in this thesis.
2.3. The Rete Match Algorithm
The most time consuming step in the execution of production systems is the match step. To get a
feeling for the complexity of match, consider a production system consisting of 1000 productions and
1000 working-memory elements, where each production has three condition elements. In a naive
implementation each production will have to be matched against all tuples of size three from the
working memory, leading to over a trillion (1000x10003) match operations for each execution cycle.
Of course, more complex algorithms can achieve the above computation using a much smaller number of operations, but even with specialized algorithms, match constitutes around 90% of the interpretation time. The match algorithm used by uniprocessor implementations of OPS5 and Soar is
called Rete [20]. This section describes the Rete algorithm in some detail as it forms the basis for
much of the work described later in the thesis.
The Rete algorithm exploits (1) the fact that only a small fraction of working memory changes each
cycle, by storing results of match from previous cycles and using them in subsequent cycles; and (2)
the similarity between condition elements of productions, by performing common tests only once.
These two features combined together make Rete a very efficient algorithm for match.
The Rete algorithm uses a special kind of a data-flow network compiled from the left-hand sides of
productions to perform match.
To generate the network for a production, it begins with the in-
dividual condition elements in the left-hand side. For each condition element it chains together test
nodes that check:
• If the attributes in the condition element that have a constant as their value are satisfied.
• If the attributes in the condition element that are related to a constant by a predicate are
satisfied.
• If two occurrences of the same variable within the condition element are consistently
bound.
Each node in the chain performs one such test.
(The three kinds of tests above are called
intra-condition tests, because they correspond to individual condition elements.) Once the algorithm
has finished with the individual condition dements,
it adds nodes that check for consistency of
12
PARAI_,LELISM IN PRODUCTION
SYSTEMS
variable bindings across the multiple condition elements in the left-hand side. (These tests are called
inter-condition tests, because they refer to multiple condition elements.) Finally the algorithm adds a
special terminal node to represent the production corresponding to this part of the network.
Figure 2-2 shows such a network for productions pl and p2 which appear in the top part of the
figure.
In this figure, lines have been drawn between nodes to indicate the paths along which
information flows. Information flows from the top-node down along these paths. The nodes with a
single predecessor (near the top of the figure) are the ones that are concerned with individual
condition elements.
The nodes with two predecessors are the ones that check for consistency of
variable bindings between condition elements. The terminal nodes are at the bottom of the figure.
Note that when two left-hand sides require identical nodes, the algorithm shares part of the network
rather than building duplicate nodes.
(p pl
(C1 tattrl <x> tattr2 12)
(C2 tattrl 15 tattr2 <x>)
(C3 tattrl <x>)
(p p2
(C2 tattrl 15 tattr2
<y>)
(C4 tattrl <y>)
..>
..>
(modify 1tattrl 12))
(remove 2))
root
C1
constant-teSnodes
|t_ Cless =
|
attr2 = 12
attrl = 15
Class = CA
Class = C3
L
/
_,
\,..,=,
"
-nl
/
_
_'alpha-mem
gn_
t_lnal-node
I
I
_and-node
iLterminal-node
Conflict Set
-p2'
I
Add to Working Memory
wmel: (C1 tattrl 12 tattr212)
wine2:(C2 tattrl 12 eattr215)
wme3:(C2 tattrl 15 tattr212)
wme4:(C3 l'attrl 12)
Figure 2-2: The Rete network.
To avoid performing the same tests repeatedly, the Rete algorithm stores the result of the match
BACKGROUND
13
with working memory as state within the nodes.
"Ibis way, only changes made to the working
memory by the most recent production firing have to be processed every cycle. Thus, the input to the
Rete network consists of the changes to the working memory.
network updating the state stored within the network.
These changes filter through the
The output of the network consists of a
specification of changes to the conflict set.
The objects that are passed between nodes are called tokens, which consist of a tag and an ordered
list of working-memory elements. The tag can be either a +, indicating that something has been
added to the working memory, or a.-,
indicating that something has been removed from it. (No
special tag for working-memory element modification is needed because a modify is treated as a
delete followed by an add.)
The list of working-memory elements associated with a token cor-
responds to a sequence of those elements that the system is trying to match or has already matched
against a subsequence of condition elements in the left-hand side.
The data-flow network produced by the Rete algorithm consists of four different types of nodes. 2
These are:
1. Constant-test nodes: These nodes are used to test if the attributes in the condition
element which have a constant value are satisfied. These nodes always appear in the top
part of the network. They have only one input, and as a result, they are sometimes called
one-input nodes.
2. Memory nodes: These nodes store the results of the match phase from previous cycles as
state within them. The state stored in a memory node consists of a list of the tokens that
match a part of the left-hand side of the associated production. For example, the rightmost memory node in Figure 2-2 stores all tokens matching the second condition-element
of production p2.
At a more detailed level, there are two types of memory nodes--the a-mem nodes and the
t-mere nodes. The a-mem nodes store tokens that match individual condition elements.
Thus all memory nodes immediately below constant-test nodes are a-mere nodes. The
fl-mem nodes store tokens that match a sequence of condition elements in the left-hand
side of a production. Thus all memory nodes immediately below two-input nodes are
fl-mem nodes.
3. Two-input nodes: These nodes test for joint satisfaction of condition elements in the
left-hand side of a production. Both inputs of a two-input node come from memory
nodes. When a token arrives on the left input of a two-input node, it is compared to each
token stored in the memory node connected to the right input. All token pairs that have
2Currentimplementations
of theRetealgorithmcontainsomeothernodetypesthatare notmentionedhere. Nodesof
thesetypesdo not performanyof the conceptually
necessaryoperationsandare presentprimarilyto simplifyspecific
implementations.
Forthisreason,theyhavebeenomittedfromdiscussion
inthethesis.
14
PARALLELISM
INPROI)UCrlONSYSTEMS
consistent variable bindings are sent to the successors of the two-input node.
action is taken when a token arrives on the right input of a two-input node.
Similar
There are also two types of two-input nodes--the and-nodes and the not-nodes. While the
and-nodes are responsible for the positive condition elements and behave in the way
described above, the not-nodes are responsible for the negated condition elements and
behave in an opposite manner. The not-nodes generate a successor token only if there are
no matching tokens in the memory node corresponding to the negated condition element.
4. Terminal nodes: There is one such node associated with each production in the program,
as can be seen at bottom of Figure 2-2. Whenever a token flows into a terminal node, the
corresponding production is either inserted into or deleted from the conflict set.
The following example provides a more detailed view of the processing that goes on inside the Rete
network. The example corresponds to the two productions and the network given in Figure 2-2. It
shows the match process as the four working-memory elements shown in the bottom-left comer of
Figure 2-2 are sequentially added to the working memory.
When the first working-memory element is added, token t-w1,
< + , (CI tattrl 12 +attr2 12) >
is constructed and sent to the root node. The root node broadcasts the token to all its successors. The
associated tests fail at all successors except at one which is checking for "Class = C1". This constant/
test node passes the token down to its single successor, another constant-test node, checking if
"attr2 = 12". Since this is so, the token is passed on to the memory node, which stores the token and
passes a copy of the token to the and-node below it. The and-node compares the incoming token on
its left input to tokens in its right memory (which at this point is empty), but no pairs can be formed.
At this point, the network has stabilizedmin other words no further activity occurs, so we go on to the
second working-memory element.
The token for the second working-memory element, t-w2,
< + , (C2 tattrl 12 tattr2 15) >
is constructed and sent to the root node, which broadcasts the token to its successors.
The token
passes the "Class = C2" test but fails the "attrl = 15" test, so no further processing takes place.
The token for the third working-memory element, t-w3,
< + , (C2 _attrl 15 tattr2 12) >
passes
throughthetests
"Class= C2" and "attrl
= 15",and isstored
inthememory nodebelow
them. The memory nodepasses
a copyofthetokentothetwosuccessor
and-nodesbelowit.The
BACKGROUND
15
and-node on the right finds no tokens in its right memory, so no further processing is done there.
The and-node on the left checks the token for consistency against token, t-wl, stored in its left
memory. The consistency check is satisfied as the variable <x> is bound consistently. The and-node
creates a new token, t-wlw3,
< + , ((Cl tattrl 12 tattr2 12), (C2 tattrl 15 tattr2 12)) >
and passes it down to the memory node below, which stores it. The memory node now passes a copy
of the token to the and-node below it. The and-node finds that its right memory is empty, so no
further processing takes place.
On addition of the fourth working-memory element, token t-w4,
< + , (C3 tattrl 12) >
issenttotherootnode,whichbroadcasts
it.The tokenpasses
thetest
"Class= C3" andpasses
on to
thememory nodebelowit.The memory nodestores
thetokenandpasses
a copyofthetokentothe
and-node below it. The and-node checks for consistent bindings in the left memory and finds that
the newly arrived token, t-w4, is consistent with token t-wlw3 stored in its left memory.
The
and-node then creates a new token, t-wlw3w4,
<+,((Cl tattrl 12 tattr2 12)(C2 _attrl 15 tattr2 12)(C3 tattrl 12))>
and sends it to the terminal node corresponding to pl below. The terminal node then inserts the
instantiation of pl corresponding to t-wlw3w4 into the conflict set.
The performance of Rete-based interpreters has steadily improved over the years. The most widely
used Rete-based interpreter is the OPS5 interpreter.
The Franz Lisp implementation of this inter-
preter runs at around 8 wme-changes/sec (about 3 rule firings per second) on a VAX-11/780, while a
Bliss-based implementation runs at around 40 wme-changes/sec.
significant loss in the speed is due to the interpretation
In the above two interpreters a
overhead of nodes.
In OPS83 [21] this
overhead has been eliminated by compiling the network directly into machine code.
While it is
possible to escape to the interpreter for complex operations during match or for setting up the initial
conditions for the match, the majority of the match is done without an intervening interpretation
level.
This has led to a large speed-up and the OPS83 interpreter runs at around 200 wme-
changes/sec on the VAX-11/780.
Some further optimizations to the OPS83 have been designed
which would permit it to run at around 400-800 wme-changes/sec.
The aim of the parallel im-
plementations is to take the performance still higher, in the range 2000-20000 wine-changes/see.
It is
expected that this order of magnitude increase in speed over the best possible uniprocessor interpreters will open new application and research areas that could not be addressed before by production systems.
16
2.4. Why Parailelize
PARALLELISM
IN PRODUCTION
SYSTEMS
Rete?
While being an extremely efficient algorithm for match on uniprocessors, Rete is also an effective
algorithm for match on parallel processors.
This section discusses some of the motivations for
studying Rete and the reasons why Rete is appropriate for parallel implementations.
2.4.1. State-Saving vs. Non-State-Saving
Match Algorithms
It is possible to divide the set of match algorithms for production system into two categories: (1) the
state-saving algorithms and (2) the non-state-saving algorithms. The state-saving algorithms store the
results of executing match from previous recognize-act cycles, so that only the changes made to the
working memory by the most recent production firing need be processed every cycle. In contrast, the
non-state-saving algorithms start from scratch every time, that is, they match the complete working
memory against all the productions on each cycle.
In a state-saving algorithm the work done includes two steps: (1) computing the fresh state corresponding to the newly inserted working-memory elements and storing the fresh state in appropriate
data structures; (2) identifying the state corresponding to the deleted working-memory elements and
deleting this state from the data structures storing it. In a non-state-saving algorithm the work done
includes only one step, that of computing the state for the match between the complete working
memory and all the productions.
state that is generated.)
(Note that this may involve temporarily storing some of the partial
In both state-saving and non-state-saving algorithms, state refers to the
matches between condition elements and working-memory elements that are computed as an intermediate step in the process of computing the match between the complete left-hand sides of productions and the working-memory elements.
Whether it is advantageous to store state depends on (1) the fraction of working memory that
changes on each cycle, and (2) the amount of state that is stored by the state-saving algorithm. To
evaluate the advantages and disadvantages more concretely, consider the following simple model.
Consider a production-system program for which the stable size of the working memory is s, the
average number of inserts to working memory on each cycle is i, and the average number of deletes
from working memory is d. Let the cost of step-1 (as described in the previous paragraph) for a single
insert to working memory be c1 and the cost for step-2 for a single delete from working memory be
c2. Further assume that the average cost of the temporary state computed and stored by the non-state
saving algorithm is c3for each working-memory element. Then the average cost per execution cycle
of the state-saving algorithm is Cstate_sav=i.c1+ d.c2. The average cost per execution cycle of the
non-state-saving algorithm is Cnon_state_sav= s.e 3.
BACKGROUND
To
evaluate
Ctate_sav<
17
the
Cnon_state_sa
advantages
v.
For
of
the
state-saving
the implementations
algorithm,
consider
the
inequality
being considered for the Rete algorithm, the
cost of an insert to working memory is the same as the cost of a delete from the working memory. As
a result, substituting ca= c2 in the inequality, we get (i+ d)/s < c3/c_. Estimates based on simulations
done for the Rete algorithm (see Chapter 8) indicate that cz is approximately equal to execution of
1800 machine-code instructions, and that c3 is approximately equal to execution of 1100 machinecode instructions.
Using these estimates we get the condition that state-saving algorithms are more
efficient when (i+ d)/s < 0.61, that is, state-saving algorithms are better if the number of insertions
plus deletions per cycle is less than 61% of the stable size of the working memory.
Measurements
[30] on several OPS5 programs show that the number of inserts plus deletes per cycle constitutes less
than 0.5% of the stable working memory size. Thus a non-state-saving algorithm will have to recover
an inefficiency factor of 120 before it breaks even with a state-saving algorithm for OPS5-1ike production systems. 3
The following example illustrates some of the points made in the previous paragraphs.
Consider a
production-system program whose stable working memory size is around 1000 elements and where
each production firing makes 2-3 changes to the working memory.
OPS5 programs.
This is a common scenario for
It is quite obvious in this case that since 99.8% of the working memory is un-
changed, it is unwise to use a non-state-saving algorithm which performs match with this unchanged
working memory all over again. Now consider a program whose stable working memory size is again
1000, but where each production firing changes 750 of these 1000 working-memory elements. In this
case a state-saving algorithm will first have to identify and delete the state corresponding to the 750
deleted elements and then recompute and store state for the new 750 elements.
The only saving
corresponds to the unchanged 250 working-memory elements. The state-saving algorithm in this case
no longer seems attractive.
The amount of state stored by a state-saving algorithm also influences the suitability of such an
algorithm compared to a non-state-saving algorithm.
The amount of state stored is important be-
cause it determines the amount of work needed to compute it when working-memory elements are
inserted, and the amount of work needed to delete it when working-memory elements are deleted. In
terms of the model described in the previous paragraphs, the amount of state that a state-saving
algorithm stores affects the value of the constants cz, c2, and ca, thus influencing the ratio (i+ cO/s
when a non-state-saving algorithm becomes appropriate.
3Ineasethevaluesofc1,c2,andc3areestimatedfromadifferentbasealgorithm,
theabovenumberswillchangesomewhat.
18
PARALI,EHSMIN PRODUCTIONSYSTEMS
In summary, state-saving algorithms are appropriate when working memory changes slowly and
when it costs more to recompute the state for the unchanged part of the working memory than to
undo the state for the deletcd working-memory elements. For OPS5 and Soar systems, the fraction
of working memory that changes on every cycle is, in fact, very small. For this reason only statesaving algorithms are considered in the thesis.
2.4.2. Rete as a Specific Instance of State-Saving
Algorithms
While it is generally accepted that state-saving algorithms are suitable for OPS5 and Soar production systems, there is no consensus about the amount of state that such algorithms should store. Of
course, there are many possibilities. This section discusses some of the schemes that various research
groups have explored, where the Rete algorithm fits amongst these schemes, and why Rete is interesting.
One possible .scheme that a state-saving algorithm may use is to store information only about
matches between individual condition elements and working memory elements. In the terminology
of the Rete algorithm, this means that only the state associated with a-mem nodes is stored.
For
example, consider a production with three condition elements CEI, CE2,and CE3. Then the algorithm
stores information about all working-memory elements that match CE1,all working-memory elements
that match CE2, and all working-memory elements that match CE3. However, it does not store
working-memory tuples that satisfy CE1 and CE2 together, or CE2and CE3 together, and so on. This
information is recomputed on each cycle. Such a scheme is used by the TREAT algorithm developed
for the DADO machine at Columbia University [60]. This scheme stands at the low end of the
spectrum of state-saving algorithms. A problem with this scheme is that much of the state has to be
recomputed on each cycle, often with the effect of increasing the total time taken by the cycle.
A second possible scheme that a state-saving algorithm may use is to store information about
matches between all possible combinations of condition elements that occur in the left-hand side of a
production and the sequences of working-memory elements that satisfy them. For example, consider
a production with three condition elements CE1,CE2, and CE3. Then, in this scheme, the algorithm
stores information about all working memory elements that match CE1, CE2, and CE3 individually,
about all working-memory element tuples that match CE1and CE2together, CE2and CE3together, eel
and CE3together, and so on. Kemal Oflazer, in his thesis [67],has proposed a variation of this scheme
to implement a highly parallel algorithm for match.
This scheme stands at the high end of the
spectrum of state-saving algorithms, in that it stores almost all information known about the matches
BACKGROUND
19
between the productions and the working memory. Two possible problems with such a scheme are:
(1) the state may become very large; and (2) the algorithm may spend a lot of time computing and
deleting state that never really gets used, that is, state that never results in a production entering or
leaving the conflict set.
The amount of state computed by the Rete algorithm falls in between that computed by the
previous two schemes. The Rete algorithm stores information about working-memory dements that
match individual condition elements, as proposed in the first scheme.
In addition, it also stores
information about tuples of workingrmemory elements that match some fixed combinations of condition elements occurring in the left-hand side of a production.
This is in contrast to the second
scheme, where information is stored about tuples of working-memory elements that match all combinations of condition elements. The choice about the combinations of condition elements for which
match information is stored is fixed at compile time. 4 For example, for a production with three
condition elements CE1,CE2,and CE3,the standard Rete algorithm stores information about workingmemory elements that match eel, O/2, and CE3 individually.
This information is stored in the
a-mem nodes. In addition, it stores information about working-memory element tuples that match
EEl and CE2together. This information is stored in a fl-mem node, as can be seen in Figure 2-2. The
Rete algorithm uses this information and combines it with the information about working-memory
dements that match CE3 to generate tuples that match the complete left-hand side (eEl, CE2,and CE3
all together).
The Rete algorithm does not store information about working memory tuples that
match CE1 and CE3 together or those tuples that match CE2 and CE3 together, as is done by the
algorithm in the second scheme.
The Rete algorithm has been successfully used in current uniprocessor implementations of OPS5
and Soar. It avoids some of the extra work done in the first scheme--work
done in recomputing
working-memory tuples that match combinations of condition elements. It is also less susceptible to
the combinatorial explosion of state that is possible in the second scheme, because it can carefuUy
select the combinations of condition elements for which state is stored. The combinations that are
selected can significantly impact the efficiency of a parallel implementation of the algorithm.
The
thesis evaluates two of the many possible schemes for choosing these combinations, and discusses the
factors influencing the choice in Section 5.2.6.
4Notethat by varyingthe combinationsof conditionelementsfor whichmatchinformationis stored,a largefamilyof
differentRetealgorithmscanbegenerated.
20
2.4.3. Node Sharing in the Rete Algorithm
As mentioned in Section 2.3, the Rete algorithm exploits the similarity between condition elements
of productions by sharing constant-test nodes, memory nodes, and two-input nodes. For example,
two constant-test nodes and an a-memory node are shared between productions pl and p2 shown in
Figure 2-2 (for some statistics on sharing of nodes in the Rete network, see Section 3.3.2). The
sharing of nodes in the Rete network results in considerable savings in execution time, since the
nodes have to be evaluated only once instead of multiple times. The sharing of nodes also results in
savings in space. This is becaus e sharing reduces the size of the Rete network and because sharing
can collapse replicated tokens in multiple unshared memory-nodes into a single token in a shared
memory-node.
The main implication of the above discussion is that it is important for an efficient
match algorithm to exploit the similarity in condition elements of productions to enhance its performance.
2A.4. Rete as a Parallel Algorithm
The previous three subsections have discussed the general suitability of the Rete algorithm for
match in OPS5 and Soar systems. They, however, do not discuss the suitability of Rete for parallel
implementation.
'
This subsection describes some features of Rete that make it attractive for parallel
implementation.
The data-flow like organization of the Rete network is the key feature that permits exploitation of
parallelism at a relatively fine grain. It is possible to evaluate the activations of different nodes in the
Rete network in parallel.
It is also possible to evaluate multiple activations of the same node in
parallel and to process multiple changes to working memory in parallel. These sources of parallelism
are discussed in detail in Chapter 4.
The parallel evaluation of node activations in the Rete network also corresponds to higher-level,
more intuitive forms of parallelism in production systems. For example, evaluating different node
activations in parallel corresponds to (1) performing match for different productions in parallel (also
called production-level parallelism) and (2) performing match for different condition elements within
the same production in parallel (also called condition-level parallelism) [22].
The state-saving model of Rete, where the state corresponding to only fixed combinations of
condition elements in the left-hand side is stored, does impose some sequentiality on the evaluation
of nodes, as compared to models where state corresponding to all possible combinations of condition
21
BACKGROUND
elements is stored.
It is, however, a plausible trade-off in order to avoid some of the problems
associated with the other schemes, as discussed in the previous subsections.
Finally, although we have no way to prove that Rete is the most suitable algorithm for parallel
implementation
of production systems, for the reasons stated above, we are confident that it is a
pretty good algorithm. Detailed simulation results for parallel implementations presented in Chapter
8 of this thesis, and comparisons to the performance numbers obtained by other researchers (see
Chapter 9) [31, 37, 60, 67], further confirm this view.
22
SYSTEMS
MEASUREMENTS
ON PRODUCTION
23
SYSTEMS
Chapter Three
Measurements on Production Systems
Before proceeding on the design of a complex algorithm or architecture, it is necessary to identify
the characteristics of the programs or applications for which the algorithm or the architecture are to
be optimized.
For example, computer architects refer to data about usage of instructions, depth of
procedure call invocations, frequency of successful branch instructions, to optimize the design and
implementation of new machine architectures [36, 68, 70]. Information about the target programs or
applications serves two purposes: (1) it serves as an aid in the design process by identifying critical
requirements; and (2) it serves as a means to evaluate the finished design.
This chapter describes the characteristics of six OPS5 and Soar production systems that have been
used in the thesis for the design and evaluation of parallel implementations of production systems. 5
J
The data about the six production systems is divided into three parts.
measurements on the textual structure of these production systems.
The first part consists of
The second part consists of
information on the compiled form of the productions, and the third part consists of run-time
measurements on the production-system programs.
3.1. Production-System Programs Studied in the Thesis
The six production-system programs that have been used to evaluate the algorithms and the architectures for the parallel implementation of production systems are given below. They are listed in
order of decreasing number of productions, and this order is maintained in all the graphs shown
later. 6
1. VT [51] (Vertical Transport) is an expert system that selects components for a traction
elevator system. It is written in OPS5 and consists of 1322 rules.
5The characteristicsfor another set of six production-system programs can be found in [30].
6Note: Many of the production-system programs listed below are still undergoing development. For this reason the data
associated with the programs is liable to change. The number of rules listed with each of the programs below corresponds to
the number of rules in the version on which data was taken.
24
PARALLELISM
IN PRODUCTIONSYSTEMS
2. 1LOG7 [58] is an expert system that maintains inventories and production schedules for
factories. It is written in OPS5 and consists of 1181 rules.
3. MUD [39] is an expert system that is used to analyze fluids used in oil drilling operations.
It is written in OPS5 and consists of 872 rules.
4. DAA [42, 43] (Design Automation Assistant) is an expert system that designs computers
from high-level specifications of the systems. It is written in OPS5 and consists of 445
rules.
5. R1-SOAR [74] is an expert system that configures the UNIBUS for Digital Equipment
Corporation's VAX-ll computer systems. It is written in Soar and consists of 319 rules.
6. EP-SOAR [47] is an expert system that solves the Eight Puzzle. It is written in Soar and
consists of 62 rules.
The above production-system programs represent a variety of applications and programming styles.
For example, VT is a knowledge-intensive expert system which has been especially designed with
knowledge acquisition in mind. It consists of only a small number of rule types and is significantly
different from the earlier systems [55, 56] developed at Carnegie-Mellon University. 8 ILOG is a
run-of-the-mill knowledge-intensive expert system. In contrast to the other five systems, the MUD
system is a backward-chaining production system [4] and is primarily goal driven. The DAA program
represents a computation-intensive
task compared to the knowledge-intensive tasks performed by
VT, ILOG, and MUD systems. Both R1-SOAR and EP-SOAR represent programming styles in
Soar. R1-SOAR also represents an attempt at doing knowledge-intensive programming in a general
weak-method problem-solving architecture. It can make use of the available knowledge to achieve
high performance, but whenever knowledge is lacking, it has mechanisms so that the program can
resort to more basic and knowledge-lean problem-solving methods.
3.2. Surface Characteristics
of Production
Systems
Surface measurements refer to the textual features of production-system programs.
such features are--the
Examples of
number of condition elements in the left-hand sides of productions, the
number of attributes per condition element, the number of variables per condition element.
Such
features are useful in that they give information about the code and static data structures that are
generated for the programs, and they also help explain some aspects of the the run-time behavior of
the programs.
7ReferredtoasPTRANSinthecitedpaper.
8Personalcommunication
fromJohnMcDermott.
MEASUREMENTS
ON PRODUCrlON
SYSTEMS
25
The following subsections present the data for the measured features, including a brief description
of how the measurements were made. Data about the same features of different production systems
are presented together, and have been normalized to permit comparison. 9 Along with each data
graph the average, the standard deviation, and the coefficient of variation10 for the data points are
given.
3.2.1. Condition Elements per Production
Figure 3-1 shows the number of condition elements per production for the six production-system
programs.
The number of condition elements per production includes both positive elements and
negative ones.
The curves for the programs are normalized by plotting percent of productions,
instead of number of productions, along the y-axis. The number of condition elements in a production reflects the specificity of the production, that is, the set of situations in which the production is
applicable.
The number of condition elements in a production also impacts the complexity of
performing match for that production (see Section 5.2.6). Note that, on average, Soar productions
have many more condition elements than OPS5 productions.
3.2.2. Actions per Production
Figure 3-2 shows the number of actions per production.
The number of actions reflects the
processing required to execute the fight-hand side of a production.
A large number of actions per
production also implies a greater potential for parallelism, because then a large number of changes to
the working memory can be processed in parallel, before the next conflict-resolution phase is executed.
3.2.3. Negative Condition Elements per Production
The graph in Figure 3-3 shows the number of negated condition elements in the left-hand side of a
production versus the percent of productions having them. It shows that approximately 27% percent
of productions have one or more negated condition elements.
Since negated condition elements
denote universal quantification over the working memory, the percentage of productions having
them is an important characteristic of production-system
programs.
The measurements are also
useful in calculating the number of not-nodes in the Rete network.
9The limits of the axes of the graphs have been adjusted to show the main portion of the graph dearly. In doing this,
however, in some eases a few eatreme points could not be put on the graph. For this reason, the reader should not draw
conclusions about the maximum values of the parameters from the graph.
10Coefficient of Variation = Standard Deviation / Average.
26
SYSTEMS
.O
_o
,,v'r
o ILOG
"_040
_.
[]
O
•
•
30
Avg. 3.28, SD
Avg. 3.92, SD
Avg. 2.47, SD
Avg. 3.89, SD
Avg. 8.60, SD
_
2O
MUD
DAA
R1-SOAR
EP-SOAR
1.66, CV
2.03, CV
1.36, CV
2.88, CV
4.45, CV
0.50 for
0.52 for
0.55 for
0.74 for
0.52 for
VT
ILOG
MUD
DAA
R1-SOAR
1
1
2
3
,
,
,
4
5
6
_
7
"_
.
8
9
_
•
10
11
_
_
-.¢_
12
13
14 15
16
Number of Conditions
Figure 3-1: Condition elements per production.
7'0
o ILOG
_.
Q.
0[]MUD
DAA
• R1-SOAR
O 50
• EP-SOAR
q)
tj
_.
Avg. 4.80,
Avg. 3.16,
Avg. 3.42,
avg. 2.42,
Avg. 9.62,
Avg. 4.29,
30
SD
SD
SD
SD
SD
SD
13.88, CV 2.89 for VT
4.48, CV 1.42 for ILOG
5.77, CV 1.69 for MUD
2.19, CV 0.91 for DAA
16.68, CV 1.73 for R1-SOAR
17.17, CV 4.00 for EP-SOAR
2O
10
0
1
2
3
4
5
6
7
Figure3-2: Actions per production.
8
9
10
Number of Actions
MEASUREMENTS
ONPRODUCTIONSYSTEMS
27
to 1OO
u
,_
90
,,w
o IL_
[] MUO
o DAA
• R1-SOAR
• EP-SOAR
8Q
_.
_
_.
\
0 7(:] "_\\
_.
60 \
50
_
\ _
\\\
Avg. 0.27, SD 0.50, CV
Avg. 0.35, SD 0.69, CV
Avg. 0.33, SD 0.61, CV
Avg. 0.52, SD 0.83, CV
Avg. 0.24, SD 0.59, CV
....
"x_\
40
1.85 for
1.96 for
1.86 for
1.60 for
2.49 for
2 for
VT
ILOG
MUD
DAA
R1-SOAR
EP-SOAR
30
20
10
I
0
1
I
"2
3
4
5
Number of Negated Conditions
Figure 3-3: Negative condition elements per production.
3.2.4. Attributes per Condition Element
Figure 3-4shows the distribution for the number of attributes per condition element. The class of
a condition element, which is an implicit attribute, is counted explicitly in the measurements.
The
number of attributes in a condition element reflects the number of tests that are required to detect a
matching working-memory element. The striking peak at three for R1-Soar and EP-Soar programs
reflects the uniform encoding of data as triplets in Soar.
3.2.5. Tests per Two-Input Node
This feature is specific to the Rete match algorithm and refers to the number of variable bindings
that are checked for consistency at each two-input node (and-node or not-node).
A value of zero
indicates that no variables are checked for consistent binding, while a large value indicates that a
large number of variables are checked. For example, if the number of tests is zero, for every token
that arrives at the input of an and-node, as many tokens as there are in the opposite memory are sent
to its successors. This usually implies a large amount of work. Alternatively, if the number of tests is
large, then the number of tokens sent to the successors is small, but doing the pairwise comparison
for consistent binding now takes more time. The graph for the number of tests per two-input node is
shown in Figure 3-5.
28
SYSTEMS
tOO
O
'_'
(_ 90
t, VT
o ILOG
o
80
{1.
,,,,.
0
q.,,!
[]MUD
0 DAA
• R1-SOAR
• EP-SOAR
70
60
_,
Avg. 2.58,
Avg. 2.87,
Avg. 2.34,
Avg. 2.65,
Avg. 3.10,
Avg. 3.12,
50
40
SD
SD
SD
SD
SD
SD
1.64, CV 0.64
1.86, CV 0.65
1.39, CV 0.60
1.48, CV 0.56
0.61, CV 0.20
0.66, CV 0.21
for
for
for
for
for
for
VT
ILOG
MUD
DAA
R1 -SOAR
EP-SOAR
30
20
10
0
1
2
3
4
5
6
Attributes
7
8
per Condition
9
10
Element
Figure 3-4: Attributes per condition element.
100
¢,
_. 90
_ w
""
0 ILOG
Q" 80
[] MUD
0 DAA
• R1-SOAR
_* EP-SOAR
(_
I--
o
&.
60
40
Avg. 0.37, SD
Avg.
1.21, SD
Avg. 0.59,
SD
Avg. 1.27, SD
Avg. 1.16, SD
Avg. 1.23, SD
0.75, CV 2.00 for
1.01,
0.71, CV0.84
CV 1.21 for
for
1.53, CV 1.21 for
0.49, CV 0.42 for
0.53, CV 0.43 for
VT
ILOG
MUD
DAA
R1-SOAR
EP-SOAR
4
6
7
Tests per Two-Input
30
10
0
1
2
3
5
Figure 3-5: Tests per two-input node.
8
Node
29
MEASUREMENTSON PRODUC'I'IONSYSTEMS
3.2.6. Variables Bound and Referenced
Figure 3-6 shows the number of distinct variables which are both bound and referenced in the
left-hand side of a production. Consistency tests are necessary only for these variables. Beyond the
a-mem nodes, all processingdone by the two-input nodes requires access to the values of only these
variables; valuesof other variables or attributes is not required. This implies that the tokens in the
network may only store the values of these variables instead of storing complete copies of workingmemory elements. For parallel architectures that do not have shared memory, this can lead to
significantimprovementsin the storage requirements and in the communication costsassociatedwith
tokens.
5O
.o
AVT
o ILOG
t3 MUD
<>DAA
• R1-SOAR
• EP-SOAR
"0
0
_.
0
_.
30
Avg.0.72,SD 1.15,CV 1.59for VT
_vg.2.21, SD2.02,CV 0.91for ILOG
Avg.0.68, SD0.96,CV 1.41for MUD
Avg.2.29, SD3.13,CV 1.27for DAA
Avg.4.77, SD2.75,CV 0.58for R1-SOAR
Avg.5.84, SD5.41,CV 0.93for EP-SOAR
20
10
0
2
4
6
8
10
12
Variables Bound and Referenced
Figure 3,6: Variablesbound and referenced.
3.2.7.VariablesBoundbutnotReferenced
Figure 3-7 shows the number of distinct variables which are bound but not referenced in the
left-hand side of a production. (These bindings are usually used in the fight-hand side of the
production.) This indicates the number of variables for which no consistency checks have to be
performed.
30
zxVT
"_o_4
O<_
o MUD
D
ILOG
•
[_
=.
O DAA
Avg. 1.93, SD 2.66, CV 1.37 for
Avg. 1.96, SD 5.30, CV 2.70 for
Avg. 0.94, SD1.54, CV1.63 for
Avg. 2.82, SD 3.50, CV 1.24 for
Avg. 1.61, SD 2.00, CV 1.24 for
I \\/_,a
x/
\
I _ \ \ /\
\
I / \\ \ Y
\
\
I/
\ \k/\
_,. \
20_( /__'_
10_0
0
RI-SOAR
EP-SOAR
VT
ILOG
MUD
DAA
R1-SOAR
Avg. 1.65, SD 1.59, CV 0.96 for EP-SOAR
2
4
6
8
10
Variables Bound but not Referenced
Figure 3-7: Variables bound but not referenced.
3.2.8. Variable Occurrences in Left-Hand Side
Figure 3-8 shows the number of times each variable occurs in the left-hand side of a production.
Both positive and negative condition elements are considered in counting the variables.
Our
measurements also show that variables almost never occur multiple times within the same condition
element (average of 1.5% over all systems). Under this assumption, the number of occurrences of a
variable represents the number of condition elements within a production in which the variable
occurs.
3.2.9. Variables per Condition Element
Figure 3-9 shows the number of variable occurrences within a condition element (not necessarily
distinct, though as per Section 3.2.8 they mostly are). If this number is significant compared to the
number of attributes for some class of condition elements, then it usually implies that the selectivity
of those condition elements is small, or in other words, a large number of working-memory elements
will match those condition elements.
31
MEASUREMENTSON PRODUCTION SYSTEMS
8O
.Q
•_
_"
aVT
o tt.oG
[] MUD
<>DAA
• R1-SOAR
• EP.SOAR
0
O
¢j
"
Avg.
Avg.
Avg.
Avg.
Avg.
Avg.
40
30
ok
O
J
2
1
3
1.32,
1.85,
1.54,
1.72,
2.38,
2.48,
4
Figure 3-8:
SD
SD
SD
SD
SD
SD
0.58,
1.11,
0.74,
1.10,
1.25,
1.25,
CV
CV
CV
CV
CV
CV
5
Occurrences
0.44
0.60
0.48
0.64
0.52
0.50
for
for
for
for
for
for
6
Number
VT
ILOG
MUD
DAA
R1-SOAR
EP-SOAR
7
of Occurrences
=
8
of each variable.
80
o ILOG
art
o MUD
0 DAA
• R1-SOAR
• EP-SOAR
_70
=
.2 60
:1_
O
(J 5O
0
*E
_.
40_ \
Avg. 1.07, SD 1.44, CV 1.34 for VT
30
ovg. 1.00,
Avg.
1.97,
.Avg. 2.26,
_vg. 1.77,
Avg. 1.86,
0
1
2
3
Figure 3-9:
4
5
6
SD
SD
SD
SD
SD
1.24,
2.13,
2.49,
0.66,
0.64,
7
Variables per condition element.
CV
CV
CV
CV
CV
1.24
1.09
1.10
0.37
0.34
8
Number
for
for
for
for
for
MUD
ILOG
DAA
R1-SOAR
EP-SOAR
9
10
of Variables
32
PARALI,ELISM IN PRODUCFION
SYSTEMS
3.2.10. Condition Element Classes
Tables 3-1, 3-2, 3-3, 3-4, 3-5, and 3-6 list file seven condition element classes occurring most
frequently for the production-system programs, qlae tables also list the total number of attributes, the
average number of attributcs and its standard deviation, and the average number of variable occurrences in condition elements of each class. The total number of attributes for a condition element
class gives ,an estimate of the size of the working-memory element. This information can be used to
determine the communication overhead in transporting working-memory elements amongst multiple
memories in a parallel architecture.
It also has implications for space requirements for storing the
working-memory elements. If we subtract the average number of variables from the average number
of attributes for a condition element class, we obtain the average number of attributes which have a
constant value for that class. This number in turn has implications for the selectivity of condition
elements of that class.
Table 3-1: VT: Condition Element Classes
Class Name
1. context
2. item
3. input
4. needdata
5. distance
6. sys-measure
7. it-stack
# of _
1366 (31%)
756 (17%)
448 (10%)
239 (5%)
228 (5%)
175 (4%)
110 (2%)
_
4
47
19
27
12
11
4
Avg-Attr
1.59
2.97
3.06
2.62
5.18
4.87
1.05
SD-Attr
0.75
0.86
1.24
1.53
1.40
1.47
0.35
Avtz-Vars
0.22
1.14
1.56
1.56
1.67
1.71
1.02
Total number of condition element classes is 48
Table 3-2: ILOG: Condition Element Classes
Class Name
1. arg
2. task
3. datum
4. period
5. packed-with
6. order
7. capacity
# of _
1270 (27%)
1004 (21%)
431 (9%)
143 (3%)
106 (2%)
101 (2%)
91 (1%)
T0t-Attr
4
2
58
13
32
37
41
Avg-Attr
2.99
1.76
4.16
3.81
4.73
3.16
5.66
SD-A_r
0.14
0.44
2.15
1.18
2.23
2.29
4.57
Aviz-Vars
1.91
0.77
2.89
3.41
3.84
3.08
3.90
33
MEASUREM EN_FSON PRODUCq'ION SYSTEMS
Table 3-3: MUD: Condition Element Classes
Name
1. task
2. data
3. hyp
4. datafor
5. reason
6. change
7. do
# of CEs__)
678 (31%)
547 (25%)
160 (7%)
111 (5%)
74 (3%)
65 (3%)
65 (3%)
Tot-Attr
4
24
9
20
13
6
21
Ay_g-Attr
2.35
2.35
1.99
4.14
3.12
1.40
5.25
_l)-Attr
0.85
1.15
0.72
1.93
1.58
0.87
1.60
Ave-Vats
0.58
1.11
0.60
2.55
1.24
0.88
2.83
'Fable 3-4: DAA: Condition Element Classes
Class Name
1. context
2. port
3. db-operator
4. link
5. module
6. lists
7. outnode
# of CEs(__)
474 (24%)
241 (13%)
197 (11%)
173 (9%)
170 (9%)
134 (7%)
112 (6%)
Tot-Attr
3
6
6
6
6
3
11
Av_-Attr
2.40
2.35
1.70
5.28
2.68
1.75
2.37
_
0.52
0.72
0.58
1.53
1.12
0.44
0.87
Ave-Vats
2.05
2.08
0.54
5.55
1.66
2.06
2.14
Table 3-5: R1-SOAR:
Class Name
1. goal-ctx-info
2. op-info
3. state-info
4. space-info
5. order-info
6. preference
7. module-info
# of _
988 (36%)
383 (13%)
375 (13%)
217 (17%)
183 (6%)
157 (5%)
87 (3%)
Tot-Attr
3
3
3
3
3
8
3
Condition Element Classes
Ave-Attr
2.99
2.95
2.88
3.00
2.99
5.32
2.92
SD-A_r
0.11
0.23
0.32
0.07
0.10
0.78
0.27
Av2-gars
1.80
1.54
1.77
1.04
1.67
3.44
1.90
Table 3-6: EP-SOAR: Condition Element Classes
Class Name
1. goal-ctx-info
2. binding-info
3. state-info
4. eval-info
5. op-info
6. preference
7. space-info
#__.
of CEs(%)
278 (44%)
85 (13%)
59 (9%)
54 (8%)
41 (6%)
36 (5%)
30 (4%)
Tot-httr
3
3
3
3
3
8
3
hv_-Attr
2.99
3.00
2.90
2.96
2.93
5.47
3.00
SD-A_r
0.10
0.00
0.30
0.19
0.26
1.12
0.00
Av_-Vars
1.83
1.71
1.92
1.83
1.54
3.22
1.13
34
3.2.11. Action Types
Table 3-7 gives the distribution of actions in the right-hand side into classes mate, remove, modify,
and other for the production-system programs. The only actions that affect the working memory are
of type make, remove, or modify. While each make and remove action causes only one change to the
working memory, a modify actions causes two changes to the working memory. This data then gives
an estimate of the percentage of fight-hand side actions that change the working memory. This data
can also be combined with data about the number of actions in the right-hand side of productions
(given in Section 3.2.2) to determine the average number of changes made to working memory per
production firing.
Table 3-7: Action Type Distribution
Act_n True
1. Make
2. Modify
3. Remove
4. Others
VT
52%
13%
5%
27%
ILOG
20%
15%
7%
56%
MUD
48%
17%
4%
28%
DAA
34%
18%
18%
27%
RI-$OAR
86%
0%
0%
12%
EP-SOAR
78%
0%
0%
21%
3.2.12. Summary of Surface Measurements
Table 3-8 gives a summary of the surface measurements for the production-system
programs.
It
brings together the average values of the various features for all six programs. The features listed in
the table are condition elements per production, actions per production, negated condition elements
per production,
attributes per condition element, variables per condition element, and tests per
two-input node.
Table 3-8: Summary of Surface Measurements
Feature
1. Productions
2. CEs/Prod
3. Actns/Prod
4. nCEs/Prod
5. Attr/CE
6. Vars/CE
7. Tests/2inp
VT
1322
3.28
4.80
0.27
2.58
1.07
0.37
ILOG
1181
3.92
3.16
0.35
2.87
1.97
1.21
MUD
872
2.47
3.42
0.33
2.34
1.00
0.59
DAA
445
3.89
2.42
0.52
2.65
2.26
1.27
R1-SOAR
319
8.60
9.62
0.24
3.10
1.77
1.16
EP-SOAR
62
9.97
4.29
0.21
3.12
1.86
1.23
Mb_SURF_ENTS
ONPRODUCTION
SYSTEMS
35
3.3. Measurements on the Rete Network
This section presents results of measurements made on the Rete network constructed by the OPS5
compiler. The measured features include the number of nodes of each type in the network and the
amount of sharing that is present in the network.
3.3.1. Number of Nodes in the Rete Network
Table 3-9 presents data on the number of nodes of each type in the network for the various
production-system
programs.
Thesi_ numbers reflect the complexity of the network that is con-
structed for the programs. Table 3-10 gives the normalized number of nodes, that is, the number of
nodes per production.
The normalized numbers are useful for comparing the average complexity of
the productions for the various production-system programs. It
Table 3-9: Number of Nodes
Node Tree
1. Const-Test
2. a-mem
3. fl-mem
4. And
5. Not
6. Terminal
7. Total
V_.T.T
2849
1748
1116
2205
332
1322
9572
ILOG
1884
1481
1363
2320
400
1181
8629
MUD
1743
878
358
872
267
872
4990
DAA
397
339
549
847
144
445
2721
R1-SOAR
436
398
1252
1542
60
3.91
4079
EP-SOAR
118
96
369
425
13
62
1083
R1-SOAR
1.11
EP-SOAR
1.90
Table 3-10: Nodes per Production
Node Tree
1. Const-Test
2. a-mem
3. fl-mem
4. And
5. Not
6. Terminal
7. Total
VT
2.15
ILOG
1.59
MUD
1.99
1.32
1.25
1.00
0.76
1.01
1.54
0.84
1.66
0.25
1.00
7.22
1.15
1.96
0.33
1.00
7.28
0.41
1.00
0.30
1.00
5.70
1.23
1.89
0.32
1.00
6.09
3.20
3.94
0.15
1.00
10.41
5.95
6.85
0.20
1.00
17.44
Table 3-11 presents the number of nodes per condition
programs.
DAA
0.89
element for the production-system
The average number of nodes per condition element over all the systems is 1.86. This
number is quite small because many nodes are shared between condition elements.
In case no
sharing is allowed, this number jumps up two to three fold, as is shown in Table 3-12.
llAI1thenumberslistedinTables3-9and3-10are forthecasewherethenetworkcompilerisallowedto sharenodes.
36
PARALLELISM
1NPRODUCTIONSYSTEMS
Table 3-11: Nodes per Condition Element (with sharing)
Featu_
1. Total CEs
2. Tot. Nodes
3. Nodes/CE
V'_.[I"
4336
9572
2.20
_
4629
8629
1.86
MUD
2153
4990
2.31
I)A_._A
1731
2721
1.57
RI-SOAR
2743
4079
1.48
EP'SOAR
618
1083
1.75
Table 3-12: Nodes per Condition Element (without sharing)
Feature
1. Total CEs
2. Tot. Nodes
3. Nodes/CE
4. Sharing
V._T.T
4336
20950
4.83
2.19
ILOG
4629
19717
4.25
2.28
MUD
2153
9953
4.62
2.00
DA_.__A
1731
7006
4.04
2.57
RI-SQAR
2743
12024
4.38
2.95
EP-SOAR
618
2532
4.10
2.34
3.3.2. Network Sharing
The OPS5 network compiler exploits similarity in the condition elements of productions to share
nodes in the Rete network. Such sharing is not possible in parallel implementations of production
systems where each production is placed on a separate processor, although some sharing is possible in
parallel implementations
that use a shared-memory multiprocessor.
To help estimate the extra
computation required due to loss of sharing, Table 3-13 gives the ratios of the number of nodes in the
unshared Rete network to the number of nodes in the shared Rete network. The ratios do not give
the extra computational requirements exactly because they are only a static measure -- the exact
numbers will depend on the dynamic flow of information (tokens) through the network. Table 3-13
also shows that the sharing is large only for constant-test and a-mem nodes, and small for all other
node types.Ix
Table 3-13: Network Sharing (Nodes without sharing/Nodes with sharing)
Nod_ Tvoe
1. Const-Test
2. a-mere
3. B-mem
4. And
5. Not
VT
3.86
2.35
1.35
1.19
1.08
ILOG
4.57
3.04
1.44
1.30
1.04
MUD
3.21
2.05
1.17
1.12
1.00
DAA
7.38
4.57
1.44
1.24
1.61
R1-SOAR
10.34
6.85
1.63
1.52
1.26
EP-SOAR
6.90
6.40
1.31
1.27
1.00
12Notethatthe reportedratioscorrespond
to theamountofsharingor similarityexploitedbythe OPS5networkcompiler,
whichmaynotbe thesameasthemaximum
exploitablesimilarity
available
intheproduction-system
program.
MEASUREMENTS
ONPRODUCI"ION
SYSTEMS
37
3.4. Run-Time Characteristics of Production Systems
This section presents data on the run-time behavior of production systems. The measurements are
useful to identify operations frequently performed by the interpreter and provide some rough bounds
on the speed-up that may be achieved by parallel implementations.
Although most of the reported
measurements are in terms of the Rete network, a number of general conclusions can be drawn from
the measurements.
3.4.1.Constant-TestNodes
Table 3-14 presents run-time statistics for constant-test nodes. The first line of the table, labeled
"visits/change",
refers to the average number of constant-test node visits (activations) per change to
working memory. 13 The second line of the table reports the number of constant-test activations as a
fraction of the total number of node activations. The third line of the table, labeled "success", reports
the percent of constant-test node activations that have their associated test satisfied.
Table 3-14: Constant-Test Nodes
Feature
1. visits/change
2. %of total
3. success (%)
4. hash-visits/ch
VT
107.00
76.9%
15.3%
22.92
ILOG
231.20
84.6%
3.3%
24.48
MUD
117.79
70.2%
24.5%
41.96
DA___AA R1-SOAR
57.02
48.79
52.0%
60.0%
8.0%
6.3%
7.14
5.05
EP-SOAR
18.93
35.0%
14.1%
3.97
Although constant-test node activations constitute a large fraction (63% on average) of the total
node activations, a relatively small fraction of the total match time is spent in processing them. This
is because the processing associated with constant-test nodes is very simple compared with other
nodes like a-mere nodes, or and-nodes. In the OPS83 [21] implementation on the VAX-11 architecture, the evaluation of a constant-test node takes only 3 machine instructions.
The evaluation of
two-input nodes in comparison takes 50-100 instructions.
The numbers on the third line show that only a small fraction (11.9% on average) of the constanttest node activations are successful. This suggests that by using indexing techniques (for example,
hashing), many constant-test node activations that do not result in satisfaction of the associated tests
may be avoided. The fourth line of the table, labeled "hash-visits/ch", gives the approximate number of constant-test node activations per working-memory change when hashing is used to avoid
13Therun-timedata presentedin thischaptercorrespondsto tracesvt_lin,iloglin,mudo.lin,da_lin,rlxlin, andepsdin.
ThesetracesaredescribedinSection8,1.
38
PARALLEHSM
IN PRODUCTION
SYSTFaMS
evaluation of nodes whose tests are bound to fail. Calculations show that approximately 82% of the
total constant-test node activations can be avoided by using hashing.
The hashing technique is
especially helpful for the constant-test nodes immediately below the root node. These nodes check
for the class of the working-memory element (see Figure 2-2), and since a working-memory element
has only one class, all but one of these constant-test nodes fail their test. Calculations show that by
using hashing at the top-level, the total number of constant-test node activations can be reduced by
about 43%.
3.4.2. Alpha-Memory Nodes
An a-mem node associated with a condition element stores tokens corresponding to workingmemory elements that partially match the condition element, that is, tokens that satisfy all intracondition tests for the condition element. These nodes are the first significant nodes, in terms of the
processing required, that get affected when a change is made to the working memory. It is only later
that changes filter through a-mere nodes down to and-nodes,
not-nodes, fl-mem
nodes, and
terminal-nodes.
The first line of Table 3-15 gives the number of a-mem node activations per change to working
memory. The average number of activations for the six programs is only 5.00. This is quite small
because of the large amount of sharing between a-mem nodes. The second line of the table gives the
number of a-mem node activations when sharing is eliminated (something that is necessary in many
parallel implementations).
In this case the average number of a-mem node activations goes up to
26.48, an increase by a factor of 5.30. The third line of the table gives the dynamic sharing factor
(line-2/line-1), which may be contrasted to the static sharing factor given in Table 3-13. As can be
seen from the data, the dynamic sharing factor is consistently larger than the observed static sharing
factor.
Table 3-15: Alpha-Memory Nodes
F_t_e
1. visits/ch(sh)
2. visits/ch(nsh)
3. dyn. shar. factor
4. avg. tokens
5. max. tokens
VT
5.29
29.67
5.60
302.76
1467
ILOG
6.60
30.06
4.55
180.44
572
MUD
10.73
27.59
2.57
64.91
369
DAA
3.28
37.94
11.56
14.91
88
R1-SOAR
2.57
19.17
7.45
48.50
197
EP-SOAR
1.55
14.50
9.35
7.15
38
The fourth line of Table 3-15 reports the average number of tokens present in an a-mem node
when it is activated. This number indicates the complexity of the processing performed by an a-mere
39
MEASUREMF_NTS ON PRODUCTION SYSTEMS
node. When an a-mem node is activated by an incoming token with a - tag, the node must find a
corresponding token in its stored set of tokens, and then delete that token. If a linear search is done
to find the corresponding token, on average, half of the stored tokens will be looked up. Thus the
complexity of deleting a token from an a-mem node is proportional to the average number of tokens.
On arrival of a token with a + tag, the a-mem node simply stores the token. This involves allocating
memory and linking the token, and takes a constant amount of time. In case hashing is used to locate
the token to be deleted, the delete operation can also be done in constant time. However, then we
have to pay the overhead associated with maintaining a hash table.
Hash tables become more
economical as the number of tokens stored in the a-mem increases. The numbers presented in the
second line are useful for deciding when hash tables (or other indexing techniques) are appropriate.
The fifth line of Table 3-15 reports the maximum number of tokens found in an a-mem node for
the various programs. 14 These numbers are useful for estimating the maximum storage requirements
for individual memory nodes. The maximum storage requirements, in turn, are useful in the design
of hardware associative memories to hold the tokens.
3.4.3. Beta-Memory Nodes
A/_-mem node stores tokens that match a subset of condition elements in the left-hand side of a
production. The data for/3-mem nodes, presented in Table 3-16, can be interpreted in the same way
as that for a-mere nodes. There is, however, one difference that is of relevance. The sharing between
]_-mem nodes is much less than that between a-mere nodes, so that in parallel implementations the
cost of processing/3-mem nodes does not increase so much. When no sharing is present, the average
number of/_-mem node activations goes up from 3.53 to 5.17, an increase by a factor of only 1.46 as
compared to a factor of 5.30 for the a-mem nodes.
Table 3-16: Beta-Memory Nodes
Feature
1. visits/ch(sh)
2. visits/ch(nsh)
3. dyn. shar. factor
4. avg. tokens
5. max. tokens
VT
0.53
1.29
2.43
3.30
48
ILOG
1.57
2.36
1.50
3.97
50
MUD
2.62
4.14
1.58
73.10
168
DAA
4.12
5.44
1.32
28.26
360
R1-SOAR
3.89
8.03
2.06
7.43
85
EP-SOAR
8.47
9.81
1.15
4.95
18
14It is intere_ng to note that the value for maximum number of tokens is the same as the value for maximum size of
working memory (see Table 3-22) for VT, ILOG, and MUD systems. This implies that there is at least one condition dement
in each of these three systems that is satisfied by all working-memory elementg
40
PARALLELISMIN PRODUCTION SYSTEMS
3.4.4. And Nodes
The run-time
and-node
data for and-nodes
are given in Table 3-17.
activations per change to working memory.
the six programs is 27.66.
The first line gives the number
The average number of node activations for
The second line gives the average number
which no tokens are found in the opposite memory nodes.
of and-node activations for
For example,
first line in the table shows that there are 25.96 and-node activations.
have an empty opposite memory.
of
for the VT program, the
Of these 25.96 activations, 24.48
Since an and-node activation for which there are no tokens in the
opposite memory requires very little processing, evaluating the majority of the and-node activations
is very cheap.
Most of the processing
effort goes into evaluating
which have non-empty opposite memories.
the small fraction of activations
This means that if all and-node activations are evaluated
on different processors, then the majority of the processors
will finish very early compared
remaining few. This large variation in the processing requirements
of and-nodes
to the
(see Tables 8-1 and
8-2 for some actual numbers) reduces the effective speed-up that can be obtained by evaluating each
and-node activation
on a different processor.
When a token arrives on the left input of an and-node, it must be compared
the memory node associated with the right input of that and-node.
to all tokens stored in
The comparisons
may involve
tests to check if the values of the variables bound in the two tokens are equal, if one is greater than
the other, or other similar tests. The third line of the table gives the percentage of two-input
activations
where no equality tests are performed. 15 These numbers
indicate
node
the fraction of node
activations where hash-table based memory nodes do not help in cutting down the tokens examined
in the opposite memory
(also see Section 5.2.1).
Table 3-17:
And Nodes
Feature
1. visits/change
2. null-mere
3. null-tests
VT
25.96
24.48
13.2%
ILOG
26.59
23.42
7.8%
MUD
25.95
20.26
12.8%
DAA
39.41
33.53
8.2%
R1-SOAR
24.48
16.86
0.3%
EP-SOAR
23.56
10.81
0.0%
4. tokens
5. tests
17.00
17.35
4.39
5.18
24.33
25.94
27.18
27.51
4.87
5.29
7.96
8.45
6. pairs
1.41
0.90
1.06
0.83
0.60
0.71
The fourth
line shows the average number
node activation,
of tokens found
when the opposite memory is not empty.
in the opposite
memory
In case tokens in memory
for an and-
nodes are stored
15For reasonstoo complexto explain here,separatenumbers for and-node and not-node activationswere not available.
That is,the numbers presentedin line-3areforthe combinedactivationsof and-nodes and not-nodes.
41
MEASUREMENTS ON PRODUCYION SYSTEMS
as linked lists, this number represents the average number of tokens against which the incoming
token must be matched to determine consistent pairs of tokens. 'Ihe magnitude of this number can
be used to determine if hashing or other indexing techniques ought to be used to limit this search.
The numbers in the fifth line of the table indicate the average number of tests performed by an
and-node when a token arrives on its left or right input and its opposite memory is not empty. The
number of tests performed is equal to the product of the average number of tokens found in the
opposite memory (given in the fourth line) and the number of consistency tests that have to be made
to check if the left and right tokens of the and-node are consistent. Thus if the number of tokens that
are looked up from the opposite memory is reduced by use of indexing techniques, then this number
will also go down.
The numbers in the sixth line of the table show the average number of consistent token-pairs found
after matching the incoming token to all tokens in the opposite memory. For example, for the DAA
program, on the activation of an and-node, an average of 27.18 tokens are found in the opposite
memory node. On average, however, only 0.83 tokens are found to be consistent with the incoming
token. This indicates that the opposite memory contains a lot of information, of which only a very
small portion is relevant to the current context. The numbers in the sixth line also give a measure of
token regeneration taking place within the network. This data may be used to construct probabilistic
models of information flow within the Rete network.
3.4.5. Not Nodes
Not-nodes are very similar to and-nodes, and the data for them should be interpreted in exactly the
same way as that for and-nodes. The data are presented in Table 3-18.
Table 3-18: Not Nodes
Feature
1. visits/change
2. null-mere
3. tokens
4. tests
5. pairs
VT
5.01
3.90
31.39
34.95
0.25
ILOG
5.84
4.28
5.99
7.94
0.45
MUD
5.79
3.89
13.94
14.06
0.31
DAA
3.97
2.33
12.51
12.53
0.43
RI-SOAR
2.63
1.42
9.87
11.91
1.41
EP-$OAR
0.75
0.27
6.43
7.38
0.75
42
SYSTEMS
3.4.6. Terminal Nodes
Activations of terminal nodes correspond to insertion of production instantiations into the conflict
set and deletion of instantiations from the conflict set. The first line of Table 3-19 gives the number
of changes to the conflict set for each working-memory change. The second line gives the average
number of changes made to the working memory per production firing, and the third line, the
product of the first two lines, gives the average number of changes made to the conflict set per
production firing. The data in the third line gives the number of changes that will be transmitted to a
central conflict-resolution processor, in an architecture using centralized conflict-resolution.
The
fourth line gives the size of the conflict-set when averaged over the complete run.
Table 3-19: Terminal Nodes
1. visits/change
2. changes/cycle
3. mods./cycle
4. avg confl-set
VT
1.79
3.27
5.85
35
ILOG
2.06
1.70
3.50
10
MUD
3.69
2.13
7.86
36
DA.__AA RI-$OAR
1.65
0.55
2.22
4.55
3.66
2.50
22
12
EP-SOAR
0.74
4.69
3.47
18
3.4.7. Summary of Run-Time Characteristics
Table 3-20 summarizes data for the number of node activations, when a working-memory element
is inserted into or deleted from the working memory. The data show that a large percentage (63% on
average) of the activations are of constant-test nodes.
Constant-test node activations, however, re-
quire very little processing compared to other node types, and furthermore, a large number of
constant-test activations can be eliminated by suitable indexing techniques (see Section 3.4.1). To
eliminate the effect of this large number of relatively cheap constant-test node activations, we subtracted the number of constant-test node activations from the activations of all nodes. These numbers are shown on line-8 of Table 3-20.
The first observation that can be made from the data on line-8 of Table 3-20 is that, the way
production-system programs are currently written, changes to working memory do not have global
effects, but affect only a very small fraction of the nodes present in the Rete network (see Table 3-9).
This also means that the number of productions that are affected 16is very small, as can be seen from
line-1 in Table 3-21. Both the small number of affected nodes and the small number of affected
productions limit the amount of speed-up that can be obtained from using parallelism, as is discussed
in Chapter 4.
16A production is said to
of its condition elements.
be affected
by a change to working memory, if the working-memory
elementsatisfiesat leastone
MEASUREMENTS ON PRODUCTION
43
SYSTEMS
The second observation that can be made is that the total number of node activations (excluding
constant-test node activations) per change is quite independent of the number of productions in the
production-system program. This, in turn, implies that the number of productions that are affected is
quite independent of the total number of productions present in the system, as can be seen from
Table 3-21. There are several implications of the above observations.
First, we should not expect
smaller production systems (in terms of number of productions) to run faster than larger ones.
Second, it appears that allocating one processor to each node in the Rete network or allocating one
processor to each production is not a good idea. Finally, there is no reason to expect that larger
production systems will necessarily exhibit more speed-up from parallelism.
Table 3-20: Summary of Node Activations per Change
Node _
1. Const-Test
2. a-mem
3./_-mem
4. And
5. Not
6. Terminal
7. Total
8. Line7 - Linel
V__T.T
107.00
5.29
0.53
25.96
5.01
1.79
145.58
38.58
ILOG
231.20
6.60
1.57
26.59
5.84
2.06
273.92
42.72
MUD
117.79
10.73
2.62
25.95
5.79
3.6_._29
166.57
48.78
DA._.._A R1-SOAR
57.02
48.79
3.28
2.57
4.12
3.89
39.41
24.48
3.97
2.63
1.65
0.5__._5
109.45
82.91
52.43
34.12
EP-$QAR
18.93
1.55
8.47
23.56
0.75
0.74
54.00
35.07
Table 3-21: Number of Affected Productions
Feature
1. p-aft/change
2. SD 17for Linel
3. changes/cycle
4. p-aft/firing
5. SD for Line4
VT
31.22
19.55
3.27
40.14
31.59
ILOG
34.19
38.53
1.70
36.49
52.70
MUD
27.01
25.39
2.13
32.05
28.69
DA_._.A
28.54
27.77
2.22
40.04
32.55
R1-SOAR
34.57
60.16
4.55
63.04
93.67
Table 3-22 gives general information about the runs of the production-system
EP-SOAR
12.07
14.69
4.69
20.45
20.12
programs from
which data is presented in this chapter. 18 The first two lines of the table give the average and
maximum sizes of the working memory. The third and the fourth lines give the average and maximum values for the sizes of the conflict set. The fifth and the sixth lines give the average and
17SD stands for Standard Deviation.
18The numbers presented in this chapter and in later chapters of the thesis are based on one run per production system
program. Detailed simulation-based analysis (results presented in Chapter 8) was not done for multiple runs of programs
because of the large amount of data involved and because of the iarse processing requirements. However, we did gather
statistics, like the ones presented in this chapter, for multiple runs of programs. The variation in the numbers obtained from
the multiple runs was small.
44
PARALI,ELISM IN PRODUCTION
SYSTEMS
maximum sizes of the token memory when memory nodes may be shared. (The size of the token
memory at any instant is the total number of tokens stored in all memory nodes at that instant.) The
seventh and the eighth line give the average and the maximum sizes of the token memory when
memory nodes may not be shared. The last line in the table gives the total number of changes made
to the working memory in the production system run from which the statistics are gathered.
Table 3-22: General Run-Time Data
Featur_
1. avg work-mem
2. max work-mem
3. avg confl-set
4. max confl-set
5. avg tokm(sh)
6. max tokm(sh)
7. avg tokm(nsh)
8. max tokm(nsh)
9. WM changes
V_.T.T
1134
1467
35
131
5485
7416
13366
22640
1767
_
486
572
10
38
3506
4204
5363
8346
2191
MUD
241
369
36
648
3176
4576
4717
7583
2074
DA___A
250
308
22
88
1182
2624
18343
23213
3200
R1-SOAR
543
786
12
36
1515
2716
3892
7402
2220
EP-SOAR
199
258
18
31
555
856
2546
3480
924
Finally, it is important to point out that, the results of measurements presented for the six systems
in this chapter are very similar to the results obtained for another set of systems (R1 [57], XSEL [56],
PTRANS [34], HAUNT, DAA [42], and EP-SOAR [45]) analyzed in [30]. Consequently, there is a
good reason to believe that the results about parallelism (presented later in the thesis) apply not only
to the six systems discussed here, but also to most other systems that have been written in the OPS5
and Soar languages.
45
ChapterFour
Parallelismin ProductionSystems
On the surface, production systems' appear to be capable of exploiting large amounts of parallelism.
For example, it is possible to perform match for all productions in parallel. This chapter identifies
some obvious and other not-so-obvious sources of parallelism in production systems, and discusses
the feasibility of exploiting them. It draws upon performance results reported in Chapter 8 of the
thesis to motivate the utilization of some of the sources. Note that for reasons stated in Section 2.4,
most of the discussion focuses on the parallelism that may be used within the context of the Rete
algorithm.
4.1. The Structure
of a Parallel
Production-System
Interpreter
As discussed in Section 2.1, there are three steps that are repeatedly performed to execute an OPS5
production-system program:
match, conflict-resolution, and act. Figure 4-1 shows the flow of
information between these three stages of the interpreter.
It is possible to use parallelism while
performing each of these three steps. It is further possible to overlap the processing performed
within the match step and the conflict-resolution step of the same recognize-act cycle, and that within
the act step of one cycle and the match step of the next cycle. However, it is not possible to overlap
the processing within the conflict-resolution step and the subsequent act step. This is because the
conflict-resolution must finish completely before the next production to fire can be determined and
its right-hand side evaluated. Thus, in an OPS5 programming environment, the possible sources of
speed-up are (1) parallelism within the match step, (2) parallelism within the conflict-resolution step,
(3) parallelism within the act step, (4) overlap between the match step and the conflict-resolution step
of the same cycle, and (5) overlap between the act step of one cycle and the match step of the next
cycle.
As pointed out in Section 2.2, Soar programs do not execute the standard match--conflictresolution--act cycle executed by OPS5 programs. A simplified diagram of the information flow in
the Soar cycle is shown in Figure 4-2. The match step and the act step are the same as in OPS5, but
46
PARALLEI.ISM
INPRODUCTIONSYSTEMS
Illl[
-.
Figure 4-1:OPS5
the conflict-resolution step is not present.
phase and a decision phase.
interpreter cycle.
Instead, the computation is divided into an elaboration
Within each phase all productions that are satisfied may be fired
concurrently, and the productions that become satisfied as a result of such firings may also be fired
concurrently with the originally satisfied productions.
Such concurrency increases the speed-up that
may be obtained from using parallelism, as will be discussed later in this chapter.
There are,
however, synchronization points between the elaboration phase and the decision phase; the elaboration phase must finish completely before the processing may proceed to the decision phase, and vice
versa. The serializing affect of these two synchronization points in Soar is not as bad as that of the
synchronization point between the conflict-resolution and the act step in OPS5. This is because Soar
systems usually go through a few loops internally within the elaboration phase and within the decision phase, with no synchronization points to produce any serialization.
I
El
r i
Ph
sync.
)41-----
sync.
----_q
Decision
Phase
Figure 4-2: Soar interpreter cycle.
Ib
PARALLELISM IN PRODUCIION
47
SYSTEMS
4.2. Parallelism in Match
Current production-system interpreters spend almost 90% of their time in the match step, and only
around 10%of the time in the conflict-resolution and the act steps. The reason for this is the inherent
complexity of the match step, as was discussed in Section 2.3. This makes it imperative that we speed
up the match step as much as possible. The following discussion presents several ways in which
parallelism may be used to speed up the match step.
The processing done within the match step can be divided into two pans: the selection phase and
the slate-update phase [66]. During the selection phase, the match algorithm determines those condition elements that are satisfied by the new change to working memory, that is, it determines those
condition elements for which all the intra-condition tests are satisfied by the newly inserted workingmemory element. During the state-update phase, the match algorithm updates the state (stores a
token in the memory nodes) associated with the condition elements determined in the selection
phase. In addition, this new state is matched with previously stored state to determine new instantiations of satisfied productions. In the context of the Rete algorithm, the processing done during the
selection phase corresponds to the evaluation of the top-part of the Rete network, the part consisting
of constant-test nodes. The processing done during the state-update phase corresponds to the evaluation of _t-mem nodes, fl-mem nodes, and-nodes, not-nodes, and terminal nodes.
Changesto __
Satisfied __
Changesto
Working Mere.
Condition
Conflict-Set
Iv
Figure4-3: Selectionand state-updatephasesin match.
Although the beginning of selection phase must precede the state-update phase, the processing for
the two phases may overlap. As soon as the selection phase determines the first satisfied condition
element the state-update phase can begin.
In case many changes to working memory are to be
processed concurrently, it is also possible to overlap the processing of the selection phase for one
change to working memory with the state-update phase for another change.
Comparing the selection phase and the state-update phase, about 75%-95% of the processing time is
spent in performing the state-update phase. The main reason for this, as stated in Section 3.4.1, is
that the activations of constant-test nodes are much cheaper than the activations of memory nodes
and two-input nodes.
This disparity in the computational requirements between the two phases
makes it necessary to speed up the state-update phase much more than the selection phase to attain
balance.
Since the state-update phase is more critical to the overall performance of the match
48
PARALLEI,ISM IN PRODUCI'ION SYS'FEMS
algorithm, the following subsections focus primarily on the parailelization of the state-update
phase. 19
4.2.1. Production Parallelism
To use production parallelism, the productions in a program are divided into several partitions and
the match for each of the partitions is performed in parallel. In the extreme case, the number of
partitions equals the number of productions in the program, so that the match for each production in
the program is performed in parallel. Figure 4-4 shows the case where a production system is split
into N partitions. The main advantage of using production parallelism is that no communication is
required between the processes performing match for different productions or different partitions.
,9
Figure 4-4: Production parallelism.
Before going into the implementation issues related to exploiting production parallelism, it is useful
to examine the approximate speed-up that may be obtained from using production parallelism. For
example, do we expect 10-fold speed-up, do we expect 100-fold speed-up, or do we expect 1000-fold
speed-up provided that enough processors are present.
Our studies for OPS5 and Soar programs
show that the true speed-up expected from production parallelism is really quite small, only about
2-fold. Some of the reasons for this are given below:
19Kemal Oflazer has developed a special algorithm, which uses the information in both the left-hand sides and right-hand
sides of productions to speed up the selection phase. So far we have not felt the necessity to use this more complex selection
algorithm, because even with the standard selection/discrimination network used by Rete, the state-update phase is _ the
botfleneeL
PARALLELISM
IN PRODUCI'ION
SYSTEMS
49
• Simulations show that thc average number of productions affected 2°per change to working memory is only 26. This implies that if there is a separate processor performing
match for each production in the program, only 26 processors will be performing useful
work and the rest will have no work to do. Thus the maximum speed-up from production
parallelism is limited to 26.21 For reasons stated below, however, the expected speed-up
is even smaller.
• The speed-up obtainable from production parallelism is further reduced by the variance
in the processing time required by the affected productions. The maximum speed-up
that can be obtained is proportional to the ratio tavg: Imax, where tavgis the average time
taken by an affected production to finish match and lrnax is the maximum time taken by
any affected production to finish match. _Iqaeparallelism is inversely proportional to tmax
because the next recognize-act cycle cannot begin until all productions have finished
match. Simulations for OPS5 and Soar programs show that because of this variance the
maximum nominal speed-up 22that is obtainable using production parallelism is 5.1-fold,
a factor of 5.1 less than the average number of affected productions. 23
• The third factor that influences the speed-up is the loss of sharing in the Rete network
when production parallelism is used. The loss of sharing happens because operations
which would have been performed only once for similar productions are now performed
independently for such productions, since the productions are evaluated on different
processors. Simulations show that the loss of sharing increases the average processing
cost by a factor of 1.63. Thus if there are 16 processors that are active all the time, the
speed-up as compared to a uniprocessor implementation (with no loss in sharing) will still
be less than 10.
'
• The fourth factor that influences the speed-up is the overhead of mapping the decomposition of the algorithm onto a parallel hardware architecture. The overheads may take
the form of memory-contention costs, synchronization costs, or task-scheduling costs.
Simulations done for an implementation of the parallel Rete algorithm on a sharedmemory multiprocessor show that such overheads increase the processing cost by a factor
ofl.61.
The combined sharing, synchronization, and scheduling overheads account for loss in performance
20Recallthata productionis saidto be affectedby a changeto workingmemory,if the newworking-memory
element
matchesat leastoneoftheconditionelementsofthat production.Updatingthestateassociatedwiththeaffectedproductions
(the
state-update
phase
computation)
takes
about
75%-95%
ofthetotal
time
taken
bythematch
phase.
21Notethatin theabovediscussionwehaveonlybeenconcernedwiththestate-updatephasecomputation.Thisispossible
becausewealreadyhaveparallelalgorithmsto executetheselectionphaseveryfast.
22Nominalspeed-up(or concurrency)
refersto the averagenumberof pr_..ssors that are kept busy in the parallel
implementation.Nominalspeed-upis to be contrastedagainsttruespeed-upwhichrefersto thespeed-upwithrespectto the
highestperformanceuniprocessor
implementation,
assumingthat the uniprocessoris as powerfulas theindividualnodesof
theparallelprocessor.Truespeed-upis usuallylessthanthe nominalspeed-upbecausesomeof the resourcesin a parallel
implementation
are devotedto synchronizing
the parallelprocesses,_aheduling
theparallelprocesses,recomputingsomedata
whichistoo expensiveto be communicated,
etc.
23Notethat the numbersgivenin this sectionand the followingsectionscorrespondto the simulationresultsfor
production-system
traceslistedin Section8.1.
50
by a factor of 2.62 (1.61xl.63).
As a result of the combined losses the true speed-up from using
production parallelism is only 1.9-fold (down from the nominal speed-up of 5.1-fold).
Some implementation issues associated with using production-level parallelism are now discussed.
The first point that emerges from the previous discussion is that it is not advisable to allocate one
processor per production for performing match.
If this is done most of the processors will be idle
most of the time and the hardware utilization will be poor [22, 31].24 If only a small number of
processors are to be used, there are two alternative strategies.
The first strategy is to divide the
production-system program into several partitions so that the processing required by productions in
each partition is almost the same, and then allocate one processor for each partition.
The second
strategy is to have a task queue shared by all processors in which entries for all productions requiring
processing are placed. Whenever a processor finishes processing one production, it gets the next
production that needs processing from the task queue. Some advantages and disadvantages of these
two strategies are given below.
The first strategy is suitable for both shared-memory
multicomputers.
multiprocessors and non-shared-memory
It is possible for each processor to work from its local memory and little or no
communication between processors is required. The main difficulty, however, is to find partitions of
the production system that require the same amount of processing. Note that it is not sufficient to
find partitions with only one affected production per partition, because the variance in the cost of
processing the affected productions still destroys most of the speed-up. 25 The task of partitioning is
also difficult because good models are not available for estimating the processing required by productions, and also because the processing required by productions varies over time. A discussion of the
various issues involved in the partitioning task is presented in [66, 67].
The second strategy is suitable only for shared-memory architectures, because it requires that each
processor have access to the code and state of all productions in the program. 26 Since the tasks are
allocated dynamically to the processors, this strategy has the advantage that no load-distribution
problems are present.
Another advantage of this strategy is that it extends very well to lower
24Low utilization is not justifiable, no matter how inexpensive the hardware, for it indicates that some alternative design can
be found that can attain more performance at the same cost.
25Kemal Otlazer in his thesis [67] evaluates a scheme where more than one processor is allocated to each partition to offset
theeffect ofthevariance.
26Whileitispossible
toreplicate
thecode(that
is, theRetenetwork)
inthelocal
memoriesofalltheprocessors,
itisnot
possible to do so for the dynamically changing tt_
PARALLEIJSM IN PRODUCTION SYSTEMS
granularities of parallelism.
51
However, this strategy loses some performance due to the synchroniza-
tion, scheduling, and memory contention overheads present in a multiprocessor.
In conclusion, the maximum speed-up that can be obtained from production-level parallelism is
equal to the average number of productions affected per change to working memory (average of 26
for the production systems studied). However, in practice, the nominal speed-up that is obtained is
only 5.1-fold. This is due to the variance in the processing times required by the affected productions. The true speed-up that can be obtained is even less, only 1.9-fold. This is due to the loss of
sharing in parallel decompositions (a factor of 1.63), and the overheads of mapping the decompositions onto hardware architectures (a factor of 1.61).
4.2.2. Node Parallelism
Unlike production parallelism, node parallelism is specific to the Rete algorithm.
When node
parallelism is used, activations of different two-input nodes in the Rete network are evaluated in
parallel. 27 Node parallelism is graphically depicted in Figure 4-5. It is important to note that node
parallelism subsumes production parallelism, in that node parallelism has a finer grain than production parallelism.
Thus, using node parallelism, both activations of two-input nodes belonging to
different productions (corresponding to production parallelism), and activations of two-input nodes
belonging to the same production (resulting in the extra parallelism) are processed in parallel.
The main reason for going to this finer granularity of parallelism is to reduce the value of tmax, the
maximum time taken by any affected production to finish match.
This decreased granularity of
parallelism, however, leads to increased communication requirements between the processes evaluating the nodes in parallel. In node parallelism a process must communicate the results of a successful
match to its successor two-input nodes.
No communication
is necessary if the match fails. To
evaluate the usefulness of exploiting node parallelism, it is necessary to weigh the advantages of
reducing tmax against the cost of increased communication and the associated limitations on feasible
architectures.
Another advantage of using node parallelism is that some of the sharing lost in the Rete network
when using production parallelism is recovered.
If two productions need a node with the same
27Note that in the context of node parallelism, the activation of a two-input node corresponds to the processing required by
both the two-input node (the and-node or the not-node) and the associated memory node. Lumping the memory node
together with the two-input node is necessary when using hash-table based memory nodes, and is discussed in detail later in
Section 5.2.2.
52
PARALLELISMIN PRODUCTION SYS'IEMS
CE1
CE2
Activation
Queue
Concurrent
Processes
mem-nodes
CE3
Queue
mere-nodes
Figure 4-5: Node parallelism.
functionality, it is possible to keep only one copy of the node and it is possible to evaluate it only
once, since it is no longer necessaryto have separate nodes for different productions. The gain due to
the increased amount of sharing is a factor of 1.33,which is quite significant.
The extraspeed-up available from node parallelism over that obtained from production paraUelism
is bounded by the number of two-input nodes present in a production. The reason for this is that the
extra speed-up comes only from the parallel evaluation of nodes belonging to the same production.
Since the average number of two-input nodes (one less than the average number of condition
elements) in the production systems considered in this thesis is quite small, the maximum extra
speed-up expected from node parallelism is also small. The results of simulations indicate that using
node parallelism results in a nominal speed-up of 5.8-fold and true speed-up of 2.9-fold. Thus it is
possible to get about 1.50 times more true speed-up than that could be obtained if production
parallelism alone was used.28. The increase in speed-up is significantly lower than the number of
two-input nodes per production (around 4), because most of the time all two-input nodes associated
with a production do not have to be evaluated.
The implementation considerations for node parallelism are very similar to those for production
28This factorof 1.50includes the factor of 1.33that wasgained becauseof reducedloss of sharing,as stated in the previous
paragraph
PARALLELISM IN PRODUCTION SYSTF_IS
53
parallelism described in the previous subsection.
However, since the communication required be-
tween the parallel processes is more, shared-memory architectures are preferable.
The size of the
tasks when node parallelism is used is smaller than when production parallelism is used. Simulations
indicate that the average time to process a two-input node activation is around 50-100 computer
instructions. This number is significant in that it limits the amount of synchronization and scheduling overhead that can be tolerated in an implementation.
4.2.3. Intra-Node Parallelism
The previous two subsections expressed the desirability of reducing the value of tmax, the maximum time taken by any affected production to finish the match phase. Looking at simulation traces
of production systems using node parallelism, a major cause for the large value of tmax was found to
be the cross-product effect. As shown in Figure 4-6, 29the cross-product effect refers to the case where
a single token flowing into a two-input node finds a large number of tokens with consistent bindings
in the opposite memory. This results in the generation of a large number of new tokens, all of which
have to be processed by the successor node. Since node parallelism does not permit multiple activa.
tions of the same two-input node to be processed in parallel, they are processed sequentially and a
large value of tmax results.
GEl
CE2
cE,
cross-product
""
g
CE6
\(
Figure 4-6: The cross-product effect.
lntra-node parallelism is designed to reduce the impact of the cross-product effect and some other
problems that arise when multiple changes to working memory are processed in parallel.
When
intra-node parallelism is used, not only are multiple activations of different two-input nodes evaluated
in parallel (as in node parallelism), but also multiple activations of the same two-input node are
29In Figure 4-6, the arrows represent the flow of tokens in the Rete network and the thick lines represent the network for
the production.
54
PARALLELISM
1NPRODUCTIONSYSTEMS
evaluated in parallel, 30
Because of its finer granularity, intra-node parallelism requires some extra synchronization over
that required by node parallelism, but its impact is relatively insignificant.
Simulations show that
using intra-node parallelism results in a nominal speed-up of 7.6-fold and a true speed-up of 3.9-fold.
Thus it is possible to get an extra factor of 1.30 over the speed-up that can be obtained from using
node parallelism alone. This factor is larger when many changes are processed in parallel, as is
discussed in the next subsection.
4.2.4. Action Parallelism
Usually, when a production fires, it makes several changes to the working memory. Measurements
show that the average number of changes made to the working memory per execution cycle is 7.34.31
Processing these changes concurrently, instead of sequentially, leads to increased speed-up from
production, node, and intra-node parallelism.
The reasons for the increased speed-up from production parallelism when used with action parallelism are the following. In Section 4.2.1, it was observed that the speed-up available from production parallelism is proportional to the average number of affected productions.
The set of produc-
t.ions which is affected as a result of processing many changes concurrently is the union of the sets of
affected productions for the individual changes to the working memory. Since this combined set of
affected productions is larger than that affected by any individual change, more speed-up can be
obtained. For example, consider the case where a production firing results in two changes to working
memory, such that change-1 affects productions pl, p2, and p3, and change-2 affects productions p4,
pS, and p6. If change-1 and change-2 are processed sequentially, it is best to use three processors.
Assuming that each affected production takes the same amount of processing time, each change takes
one cycle and the total cost is two cycles. However, if change-1 and change-2 are processed concurrently, they can be processed in one instead of two cycles using six processors. Simulations indicate
that processing multiple changes in parallel the average size of the affect sets goes up from 26.3 to
59.5 (a factor of 2.26) and the speed-up obtainable from production parallelism alone goes up by a
factor of 1.5. Thus using both production and action parallelism results in a nominal speed-up of
3°Justasnodeparallelism
subsumesproductionparallelism,
intra-node
parallelism
subsumesnodeparallelism.
31Theaveragenumberof changesthatareprocessed
inparallelforthefourOPS5tracesis 2.44andtheaverageforthe four
Soartracesis 12.25.Notethatthe numberof chansesthatmaybeprocessedinparallelfor the Soarsystemsis thesumof
changesmadebyalltheproductions
thatfireinparallel.
55
7.6-fold, as compared to a nominal speed-up of 5.1-fold when only production parallelism is used.
The extra speed-up is less than the average number of working-memory changes per cycle, because
the sets of productions affected by the multiple changes are not distinct but have considerable
overlap (see line-1 and line-4 of Table 3-21).
Analysis shows that often two successive changes to working memory affect two distinct condition
elements of the same production, as a result causing two distinct two-input node activations.
It is
then possible, using node parallelism, to process these node activations in parallel, thus increasing the
available parallelism.
For example, consider the case where both change-1 and change-2 affect
productions pl, p2, and p3. If the activations correspond to distinct two-input nodes, it is possible to
process both the changes in parallel, in one instead of two cycles. Simulations indicate that the use of
action parallelisrn increases the speed-up obtainable from node parallelism alone by a factor of
around 1.85, resulting in a nominal speed-up of 10.7-fold.
In a manner similar to node parallelism, when successive changes to working memory cause multiple activations of the same two-input node, then using intra-node parallelism it is possible to process
them in parallel.
Simulations indicate that the use of action parallelism increases the speed-up
obtainable from intra-node parallelism alone by a factor of around 2.54, resulting in a nominal
speed-up of 19.3-fold. The average increase in performance for the OPS5 programs is a factor of 1.84
i
and that for the Soar programs is a factor of 3.30. The increase in speed-up is larger for Soar
programs because, on average, 12.25 working-memory changes are processed in parallel for Soar
programs, while only 2.44 changes are processed in parallel for OPS5 programs.
It is interesting to note that the factor by which the speed-up improves when using action parallelism increases as we go from production parallelism (factor of 1.50) to node parallelism (factor of
1.85) to intra-node parallelism (factor of 2.54). The reason for this is that node parallelism subsumes
production parallelism and intra-node parallelism subsumes node parallelism. Thus node parallelism
gets the extra speed-up from action parallelism that production parallelism can get. In addition node
parallelism gets extra speed-up from parallelism that production parallelism could not obtain, for
example, when the multiple changes affect different two-input nodes belonging to the same production. The reasoning between node paraUelism and intra-node parallelism is similar.
56
PARALLEI,ISM
INPRODUCFION
SYSTEMS
4.3. Parallelism in Conflict-Resolution
The thesis does not evaluate the parallelism within the conflict-resolution phase in detail. This is
partly because the conflict-resolution phase is not present in new production systems like Soar, and
partly because the conflict-resolution phase is not expected to be a bottleneck in the near future. The
reasons why conflict-resolution is not expected to become a bottleneck are:
• Current production-system interpreters spend only about 5% of their execution time on
conflict-resolution.
Thus the match step has to be speeded up considerably before
conflict-resolution becomes a bottleneck.
• In production, node, and intra-node parallelism discussed earlier, the match for the affected productions finishes at different times because of the variation in the processing
required by the affected productions. Thus many changes to the conflict set are available
to the conflict-resolution process, while some productions are still performing match.
Thus much of the conflict-resolution time can be overlapped with the match time, reducing the chances of conflict-resolution becoming a bottleneck.
• If the conflict-resolution does becomes a bottleneck, there are several strategies for avoiding it. For example, to begin the next execution cycle, it is not necessary to perform
conflict-resolution for the current changes to completion. It is only necessary to compare
each current change to the highest priority production instantiation so far. Once the
highest priority instantiation is selected the next execution cycle can begin. The complete
sorting of the production instantiations can be overlapped with the match phase for the
next cycle. Hardware priority queues provide another strategy.
4.4. Parallelism
in RHS Evaluation
The rhs-evaluation step like the conflict-resolution step takes only about 5% of the total time for the
current production systems. When many productions are allowed to fire in parallel, as in Soar, it is
quite straight forward to evaluate their right-hand sides in parallel. Even when the right-hand side of
only a single production is to be evaluated, it is possible to overlap some of the input/output
with the
match for the next execution cycle. Also when the right-hand side results in several changes to the
working memory, the match phase can begin as soon as the first change to working memory is
determined.
For the above reasons the act step is not expected to be a bottleneck in speeding up the
execution of production systems. The thesis does not evaluate the parallelism in the RHS-evaluation
in any greater detail.
PARALLELISM
INPRODUCI'IONSYSTEMS
4.5. Application
57
Parallelism
There is substantial speed-up to be gained from application parallelism, where a number of
cooperating but loosely coupled production-system tasks execute in parallel [29, 87]. The cooperating
tasks may arise in the context of search, where there are a number of paths to be explored, and it is
possible to explore each of the paths in parallel (similar to or-parallelism in logic programs [91]).
Alternatively, the cooperating tasks may arise in the context where there are a number of semiindependent tasks, all of which have to be performed, and they can be performed in parallel (similar
to and-parallelism in logic programs).
It is also possible to have cooperating tasks that have a
producer-consumer relationship among them (similar to stream-parallelism in logic programs). The
maximum speed-up that can be obtained from application parallelism is equal to the number of
cooperating tasks, which can be significant. Unfortunately, most current production systems do not
exploit such parallelism, because (1) the production-system programs were expected to run on a
uniprocessor, where no advantage is to be had from having several parallel tasks, and (2) current
production-system languages do not provide the features to write multiple cooperating production
tasks easily.
Although not currently exploited by OPS5 programs, it is possible to use a simple form of application parallelism in Soar programs. In Soar all problem-solving is done as heuristic search within a
problem-space, and Soar permits exploring several paths in the problem-space concurrently. The use
of application parallelism within the two Soar programs studied in this thesis increases the nominal
speed-up obtained using intra-node and action parallelism from 17.9-fold to 30.4-fold, an extra factor
of 1.7. It is interesting to note that to the implementor, the use of application parallelism in Soar
appears simply as several productions firing in parallel. This results in a large number of workingmemory changes that may be processed in parallel. No special mechanisms are required to make use
of application parallelism, since the mechanisms developed for exploiting action parallelism suffice.
4.6. Summary
In summary, the following observations can be made about the parallelism in production systems:
• Contrary to initial expectations, the speed-up obtainable from parallelism is quite limited,
of the order of few tens rather than hundreds or thousands.
• The match step takes the most time in the recognize-act cycle, and for that reason the
match needs to be speeded up most.
• The first important source of parallelism for the match step is production parallelism.
58
SYSTEMS
Using production parallelism it is possible to get an average nominal speed-up of 5.1-fold
and average true speed-up of 1.9-fold. The speed-up is limited by the small number
(approx. 26) of productions affected per change to working memory. The speed-up is
further limited by the large variance in the amount of processing required by the affected
productions (factor of 5.10), by the loss of sharing in the Rcte network (factor of 1.63),
and by the overheads of mapping the parallel algorithm onto a multiprocessor (factor of
1.61).
• To reduce the variance in the processing requirements of the affected productions, it is
necessary to exploit parallelism at a much finer granularity than production parallelism.
The two schemes proposed for this are node parallelism and intra-node parallelism. Exploiting the parallelism at a finer granularity increases the communication requirements
between the parallel processes, and restricts the class of suitable architectures to sharedmemory multiprocessors.
• When using node parallelism, it is possible to process activations of distinct two-input
nodes in parallel. This results in average nominal speed-up of 5.8-fold and average true
speed-up of 2.8-fold (an extra factor of 1.50 over the speed-up that can be obtained by
using production parallelism alone).
• Intra-node parallelism is even finer grain that node parallelism, and it permits the
processing of multiple activations of the same two-input node in parallel. This results in
average nominal speed-up of 7.6-fold and average true speed-up of 3.9-fold (an extra
factor of 1.30 over the speed-up that can be obtained by using node parallelism alone).
• Processing many changes to working memory in parallel (action parallelism) enhances the
speed-up obtainable from production, node, and intra-node parallelism. The nominal
speed-up obtainable from production parallelism increases to 7.6-fold (a factor of 1.50
over when action parallelism is not used), that from node parallelism increases to 10.7fold (an extra factor of 1.85), and that from intra-node parallelism increases to 19.3-fold
(an extra factor of 2.54).
• The conflict-resolution step and the RHS-evaluation step take only a small fraction (5%
each) of the processing time required by the recognize-act cycle. Much of the processing
required by the conflict-resolution step can be overlapped with the match step. These
two steps are not expected to become a bottleneck in the near future.
• Significant speed-up can be obtained by letting several loosely coupled threads of computation to proceed in parallel (application parallelism). Simulation results for two Soar
systems show that the speed-up obtainable from intra-node parallelism increases by a
factor of 1.7 when application parallelism is used.
PARALLELISM IN PRODUCF1ON
SYSTEMS
59
4.7. Discussion
The results presented earlier in this chapter indicate the performance of only one model (that of
OPS5-1ike production
production systems.
systems using a Rete-like match algorithm) for parallel interpretation
of
It is therefore essential to ask whether it is possible to change the parallel
interpreter design -- or even the production systems being interpreted -- in such a way so as to
increase the speed-up obtainable from parallelism.
Of course, one is not likely to be able to give
universal answers to questions like this. It is surely the case that there are applications and associated
implementation techniques that permit quite high degrees of parallelism to be used, and that there
are other applications that do not permit much parallelism at all to be used. However, by examining
the basic factors affecting the speed-up obtained from parallelism, one can develop fairly general
evaluations about the speed-up that is obtainable from parallelism, independent
sions made in any particular parallel implementation.
of the design deci-
The following paragraphs give reasons why the
three main factors responsible for the limited speed-up, namely (1) the small number of productions
affected per change to working memory, (2) the small number of changes made to working memory
per cycle, and (3) the large variation in the processing requirements of the affected productions, are
not likely to change significantly in the near future, and consequently, why it is not reasonable to
expect significantly larger speed-ups from parallelism.
Let us first examine the reasons for the observations that the affect-sets (the set of affected
productions) are small and that the size of the affect-sets is independent
of the total number of rules
in the program (see Table 3-21). One possible way to explain these observations is to note that to
perform most interesting tasks, the rule-base must contain knowledge about many different types of
objects and many diverse situations. The number of rules associated with any specific object-type or
any situation is expected to be small [58]. Since most working-memory elements describe aspects of
only a single object or situation, then clearly most working-memory elements cannot be of interest to
more than a few of the rules.
Another way that one might explain the small and independent size of the affect-sets is the conjecture that programmers recursively divide problems into subproblems when writing the programs.
The final size of the subproblems at the end of the recursive division of problems into subproblems
(which is correlated to the number of productions associated with the subproblems) is independent of
the size of the original problem and primarily depends on (1) the complexity of the subproblems, and
(2) the complexity that the programmer can deal with at the same time (see [58] for a discussion of
this hypothesis).
Since, at any given time, the program execution corresponds to solving only one of
60
PARALLEIJSM
IN PRODUCrION
SYSTEMS
these subproblems, the number of productions that are affected (relevant to the subproblem) is small
and independent of the overall program size.32
Yet another way to look at the size of the affect-sets is in terms of the organization of knowledge in
programs.
If the knowledge about a given situation is small (the number of associated rules are
small), the affecrsets would also be small. If the amount of knowledge about the given situation is
very large, it is possible that the affect-sets are large. However, whenever the amount of knowledge is
large, we tend to structure it hierarchically or impose some other structure on the knowledge, so that
it is easily comprehensible to us and so that it is easy to reason about [82]. For example, the structure
of knowledge in classification tasks is not flat but usually hierarchical.
Consequently, when clas-
sifying an object we do it in several sequential steps, each with a small branching factor, rather than in
one step with a very large branching factor. Thus if there was one rule associated with each branch of
the decision tree, the total number of rules relevant at any node in the decision tree would be small.
We now give reasons why the number of working-memory changes per recognize-act cycle is not
likely to become significantly larger in future production-system programs.
The reason for using
production systems is to permit a style of programming in which substantial amounts of knowledge
can affect each action that the program takes. If individual rules are permitted to do much more
processing (which would correspond to making a large number of changes to working memory), then
the advantages of this programming style begin to be lost. Knowledge is brought to bear only during
the match phase of the cycle, and the less frequently match phases occur, the less chance other rules
have to affect the outcome of the processing. Certainly there are many applications in which it is
possible to perform substantial amounts of processing without stepping back and reevaluating the
state of the system, but those are not the kinds of tasks for which one should choose the productionsystem paradigm.
Alternatively, the argument may be made as follows. As stated in Chapter 1, an intelligent agent
must perform knowledge search after each small step to avoid the combinatorial explosion resulting
from uncontrolled problem-space search. Since most often, a small step in the problem space also
corresponds to a small change in the state/environment
of the agent, the number of changes made
32The above discussion only addresses the case where application parallelism is not exploited. In ease application
parallelism is used, it is possible for a program to be working on several subproblems s_nultaneously, thus having a larger set
of affected productions. Also note, it is not argued that the size of affect-sets will be the same in future systems as has been
measured for existing systems. It is, of course, possible to construct systems that have more knowledge applicable to each
given situation, thus increasing the number of affected productions by some small factor. It is, however, argued that the
probability that the number of affected productions will increase by 50-fold, 100-fold, or more in the future is small.
61
PARALI.ELISM 1N PRODUC'IlON SYSTEMS
between consecutive knowledge-search steps is expected to be small.
It is possible to envision
situations when there are local spurts in the number of changes made to the working memory per
cycle (for example, when an intelligent agent returns from solving a subgoal, it may want to delete
much of the local state associated with that subgoal [48]), but the average rate of change to working
memory per cycle is expected to remain small.
Before leaving this point, it should be observed that there is one way to increase the rate of
working-memory turnover -- using parallelism in the production system itself.
If a system has
multiple threads, each one could be. performing only the usual small number of working-memory
changes per cycle, but since there would be several threads, the total number of changes per cycle
would be several times higher. Thus application-level parallelism will certainly help when it can be
used. However, it may not be actually used in very many cases for two reasons: First, obviously, it
can only be used in tasks that have parallel decompositions, and not all interesting tasks will. Second,
using application-level parallelism places additional burdens on the developers of the applications.
They must find the parallel decompositions and then implement them in such a way that the program
is both correct and efficient.
The final factor, the large variation in the processing required by the affected productions, may
change somewhat because researchers are actively working on techniques to reduce this. Even here,
however, it is not likely that much improvement is possible. The obvious way to handle the problem
is to divide the match process into a large number of small tasks (for example, as done in going from
production parallelism to node parallelism to intra-node parallelism). This is effective, but it cannot
be carried too far because the amount of overhead time (for scheduling, for synchronization, etc.)
goes up as we go to finer granularity and the number of processes increases.
62
PARALLELISM 1N PRODUCI'ION
SYSTFdVIS
PARALLEL
IMPLEMENTATION
OF PRODUCTIONSYSTEMS
63
Chapter Five
Parallel Implementation of Production Systems
In the previous chapter, we discussed the various sources of parallelism that may be exploited in
the context of the Rete algorithm--production
and action parallelism.
parallelism, node parallelism, intra-node parallelism,
We also observed that intra-node and action parallelism when combined
together provided the most speed-up.
This chapter discusses the hardware and software structures
necessary for efficiently exploiting these sources of parallelism. Section 5.1 discusses the architecture
of the multiprocessor proposed to execute the parallel version of the Rete algorithm.
Section 5.2
presents the data representations and constraints necessary for the parallel processing of the stateupdate phase, that is, while processing activations of memory nodes, two-input nodes, and terminal
nodes. Section 5.3 presents the data structures and constraints necessary for the parallel processing of
the selection phase, that is, while processing activations of constant-test nodes. This section also
discusses issues that arise at the boundary between the selection phase and the state-update phase.
5.1. Architecture of the Production-System Machine
This section describes the architecture of the production-system machine (PSM), the hardware
structure suitable for executing the parallel version of the Rete algorithm. We begin with a description of the proposed machine (see Figure 5-1), and later provide justifications for each of the design
decisions. The major characteristics of the machine are:
1. The production-system machine should be a shared-memory multiprocessor with about
32-64 processors.
2. The individual processors should be high performance computers, each with a small
amount of private memory and a cache.
3. The processors should be connected to the shared memory via one or more shared buses.
4. The multiprocessor should support a hardware task scheduler to help enqueue node activations that need to be evaluated on the task queue and to help assign pending node
activations to idle processors.
64
Figure 5-1: Architecture'of the production-system machine.
The first requirement
for the proposed production-system
machine (as listed above) is that it
should be a shared-memory multiprocessor with 32-64 processors.
shared-memory
The main reason for using a
architecture stems from the fact that to achieve a high degree of speed-up from
parallelism, the parallel Rete algorithm exploits parallelism at a very fine grain. For example, in the
parallel Rete algorithm multiple activations of the same node may be evaluated in parallel.
This
requires that multiple processors have access to the state corresponding to that node, which strongly
suggests a shared-memory architecture.
It is not possible to replicate the state, since keeping all
copies of the state up to date is extremely expensive.
Another important reason for using a shared-memory architecture relates to the load distribution
problem.
In case processors do not share memory, the processor on which the activations of a given
node in the Rete network are evaluated must be decided at the time the network is loaded into the
parallel machine. Since the number of node activations is much smaller than the total number of
nodes in the Rete network [30], it is necessary to assign several nodes in the network to a single
processor. This partitioning of nodes amongst the processors is a very difficult problem, and in its
full generality is shown to be NP-complete [67]. Using a shared-memory architecture the partitioning
problem is bypassed since all processors are capable of processing all node activations, and it is
possible to assign processors to node activations at run-time.
The suggestion of 32-64 processors for the multiprocessor is derived as a result of measurements
and simulations done for many large production-system
programs [39, 43, 55, 56]. Because of the
small number of productions affected per change to working memory and because of the variance in
the processing required by the affected productions,
simulations for most production-system
programs show that using more than 32 processors does not yield any additional speed-up. There are
some production systems which can use up to 64 processors, but the number of such systems is small.
PARALLEL IMPLEMENTATION
OF PRODUCTION SYSTEMS
65
"lqaefact that only about 32 processorsare needed is alsoconsonant with our use of a shared-memory
architecture--it is quite difficult to build shared-memory multiprocessors with very large number of
processors. In case it does become useful to use a larger number of processors (say 256) for some
programs, the use of hierarchical multiprocessors is proposed for the parallel Rete algorithm, where
the ]atency for accessinginformation in another multiprocessor duster is longer than the latency for
accessinginformation in the localcluster. The reasons for using eight clusters of 32-processormultiprocessors instead of a single 256-processor multicomputer arc: (1) it is easier to partition a
production-system program into 8 parts than it is to partition it into 256 parts; (2) within each
32-processor multiprocessor it is possible to exploit very fine-grain parallelism (for example, intranode parallelism)to reduce the maximum time taken by any single production to finishmatch.
The second requirement for the proposed production-system machine is that the individual processors should be high-performance computers, each with a cache and a small amount of private
memory.
Since simulations show that the number of processors required for the proposed
production-system machine is small, there is no reason not to use the processors with the highest
performance--processors having wide datapaths and which use the fastesttechnology available.33 It
is interesting to note that the code sequences used to execute production-system programs do not
inctude complex instructions. The instructions used most often are simple loads, compares, and
branches without any complex addressing modes (see Appendix B and [22,69]). As a result, the
reduced instruction set computers (RISCs)[36, 68, 70] form good candidates for the individual
processorsin the production-systemmachine.
The reason for associatinga small amount of private memory with each processor is to reduce the
amount of traffic to the shared memory, since the traffic to shared memory results in both bus
contention and memory contention. The data structures stored in the private memory would be
those that do not change very often or those that change at well defined points, an example being the
data structure storing the contents of the working memory. It is possible to replicate such data
structures in the private memories of processors and to keep them updated, thus saving repeated
references to shared memory. Some of the frequently used code to process node activationscan also
be replicatedin the privatememories of the processors.
33The suggestion for using a small number (32-64) of high-performance processors may be contrasted with suggestions for
using a very large number (1000-100,000) of weak processors, as originally suggested for DADO and NON-VON [79, 84].
While the schemes using a small number of processors can use expensive very high performance processors, schemes using a
very large number of processors cannot afford to have the fast processors for each of the processing nodes. In [31] we show
that the performance lost as a result of weak individual processing nodes is difficult to recover by simply using a large number
ofthem.
66
PARALLELISMIN PRODUCTIONSYSTF2VIS
The third requirement
be connected
for the proposed
to the shared
memory
production-system
via shared
buses.
machine
is that the processors should
The reasons
scheme, instead of a crossbar switch or other communication
for suggesting
networks (such as an Omega network or
a Shuffle
Exchange
network [9]), are: (1) it is much
coherency
solutions
when shared buses are used [76], and (2) simulation
high-speed
bus should be able to handle
reasonable
cache-hit
ratios are obtained
easier to construct
Rete algorithm are expected to be to such shared objects.
It is possible
The fourth and final requirement
for the proposed
for the nodes is needed
production-system
because
for the same node in pat',die1.
scheduling
each acl:ivation to ensure that it cannot
scheduling
importance
must be done serially.
and the time to schedule
machine
mechanism
like spin-locks
[53],
is that it should be
to enqueue
in the task queue
it is often necessary
node activations
to idle processors.
to schedule
The
several node
In order to do this, several tests must be made before
interfere with other activations
that are being processed at that time. While the amount
of great
structures,
bus traffic.
activations
nonetheless
accesses in the parallel
When shared objects can be
short synchronization
and to help assign node activations
scheduler
that
If the processor suffers a cache miss for all
able to support a hardware task scheduler, that is, a hardware
hardware
provided
for the processor to loop out of the cache when the synchronization
structure is busy, thus causing no additional
into a task queue
single
The reason why caches
of memory
shared-memoryreferences,the performancepenalty will be significanL
stored in cache, it is also possible to implement
multi-cache
results show that a
8.8 for assumptions).
must be able to hold shared data objects is that a large number
very efficiently.
sophisticated
the load put on it by about 32 processors,
(see Section
a shared-bus
to the efficiency
The hardware
of checking
that must be done is small, it is
of the production-system
task scheduler
an activation using such a scheduler
of the same node
is expected
is expected
machine,
since the
to sit on the shared bus,
to be one bus cycle.
The
details for the necessity and structure of such a scheduler are given in Chapter 6.
5.2. The State-UpdatePhase Processing
This section discusses various issues regarding the parallel implementation of the state-update
phase of the Rete algorithmon a multiprocessor.
PARALLEL IMPLEMENTATION OF PRODUCTION
SYSTEMS
67
5.2.1. Hash-Table Based vs. List Based Memory Nodes
Line 2 in Tables 3-15 and 3-16 (see Chapter 3) gives the average number of tokens found when a
memory node is activated. Similarly, Line 3 in Tables 3-17 and 3-18 gives the number of tokens in
the opposite memory when a two-input node is activated. The significance of these numbers is that
they indicate Lhecomplexity of processing memory-nodes and two-input nodes respectively.
Existing OPS5 and Soar interpreters store the contents of the memory nodes as a linear list of
tokens. Thus when a token with a - tag arrives at a memory node, a corresponding token must be
found and deleted from the memory node. If a linear search is done, then on average, half of the
tokens in that memory node will be looked up. Similarly, for an activation of a two-input node, all
tokens in the opposite memory must be looked up to find the set of matehing tokens.
It is proposed that, instead of storing tokens in a memory node as a linear fist, it is better to store
them in a hash table. The hash function used for storing the tokens corresponding to a memory node
is based on (1) the tests associated with the two-input node below the memory node, and (2) the
distinct node-id associated with the two-input node. (Recall that the memory nodes always feed into
some two-input node, and that two-input nodes have tests associated with them to determine those
sets of tokens that have consistent variable bindings.) For example, consider the Rete network shown
,
in Figure 5-2. The hash function used for a token entering the lef_ memory node is based on the
value of attr2 field of the associated working-memory element and the node-id of the and-node
below. The hash function for a token entering the right memory node is based on the value of attrl
field and the node-id of the and-node below.
=cl
col..,,. 1,
.... >
(02 tattr1<x> tattr212)
(Modify 1 tattr112))
=1,
2'
/
c2
/¢..r2.12
"_m-1
/mem-2
test::
_
/
left:attr2 = _ and-1
dght:attrl
[
• term-p1
Figure 5-2: A production and the associated Rete network.
There are two main advantages of storing tokens in a hash table. First, the cost of deleting a token
from a memory node is now a constant, instead of being proportional to half the number of tokens in
that memory node. Similarly, the cost of finding matching tokens in the opposite memory is now
proportional to the number of successful matches, instead of being proportional to the number of
68
PARALLELISM
IN PRODUCFIONSYSTEMS
tokens in the opposite memory. 34 Second, hashing cuts down the variance in the processing time
required by the various memory node and two-input node activations, which is especially important
for parallel implementations.
The main disadvantage of using hashing is the overhead of computing
the value of the hash function for each node activation. However, because of the fact that hashing
reduces the variance, even if the cost with hashing is greater than the cost when linear fists are used,
hash-table based memory nodes may still be advantageous for parallel implementations.
As far as the implementation of the hash-table-based node memories is concerned, there are two
options. The hash table may he shared between all the memory nodes in the Rete network, or there
may be separate hash tables for each memory node. Since there are a large fraction of memory nodes
that do not have any tokens at all (or have very few tokens), it would be waste of space to have a
separate hash table for each node.35 Also, since there is a large variation in the number of tokens that
are present in the various memory nodes, a hash table of a single size for each memory node would
not be appropriate.
In the implementation suggested in the thesis, two hash tables are used for all
memory nodes in the network. One hash table for all memory nodes that form the lef_ input of a
two-input node, and another hash table for all memory nodes that form the right input of a two-input
node.
5.2.2. Memory Nodes Need to be Lumped with Two-Input Nodes
Uniprocessor implementations of the Rete algorithm save significant processing by sharing nodes
between similar productions (see Table 3-i3).
This section discusses reasons why in the parallel
implementation proposed in this thesis, it is not possible to share a memory node between multiple
two-input nodes. 36 Simulations show that the loss of such sharing increases the total cost of performing match by about 25%.
The main problem with the straightforward sharing of memory nodes is that it violates some of the
assumptions made by the Rete algorithm. The Rete algorithm assumes that (1) the processing of the
34However,
in caseof two-inputnodeswithno equalitytests,hashingdoesnotprovideany discriminating
effect. The
processing
times,in suchcases,arethe sameaswhenlinearlistsareused. Fortunately,
the numberof suchnodesis quite
small.
35Onereasonwhyalargefractionof memorynodeshavenotokensorveryfewtokensisthat.likeprograms
in manyother
programming
languages,
onlyasmallfractionofthetotalproductions
areresponsible
formostoftheactioninarun.
36Therun-timedatapresented
in Table3-20showsthatthe numberof memorynodeactivations
is abouta thirdof the
two-inputnodeactivations.Thisindicatesthat,onaverageat run-time,eachmemorynodeis sharedbetweenthree two-input
nodes.
OF PRODUCTION
69
SYSTEMS
memory node and the associated two-input node is an atomic operation and (2) while the two-input
node is being processed the contents of the opposite memory do not change. The problem can be
explained with the help of the Rete network shown in Figure 5-3. The network shows a memory
node shared between the two condition elements of a production.
When working-memory element
"wme-l" (shown at the bottom-left of Figure 5-3) is added to the working memory, it passes the
constant-test "Class = C1", and is then added to the memory node. This causes both a left activation
of the and-node and a right activation of the and-node.
When the left activation is processed, the
and-node finds a matching token in the right memory node (it is the same as the left memory node),
and outputs a token to the terminal, node. Similarly, when the right activation of the and-node is
processed, another token is output to the terminal node. This is incorrect, because if the memory
nodes were not shared there would have been only one token sent to the terminal node.
(p pl
(C1 ,attrl <x> tattr2 <y>)
.... >
(Remove 1))
/T
Class = C1
left-
ht-
activati° n '_,_lt=\,\,,._Jib;,,,,/activati°
wme-l: (C1 *attrl
7 tattr27)
n
and'node
• (ll
term-node
Figure 5-3: Problems with memory-node sharing.
The problem as described above would occur in both sequential and parallel implementations.
There are techniques, however, which permit the problem to be avoided for sequential implementations, but they do not work well for the parallel case.
For example, in current uniprocessor
implementations of Rete, a memory node keeps two lists of successor nodes--the left successors and
the right successors. The left successors correspond to those two-input nodes for which this memory
node forms the left input, and the right successors correspond to those two-input nodes for which the
memory node forms the fight input. The uniprocessor algorithm processes all the left successors
(including the activations caused by the processing of the immediate left successors) before actually
adding the new token into the memory node (a pointer to the token to be added is passed as a
parameter), then the token is added to the memory node, and then all the right successors are
processed. Thus for the network shown in Figure 5-3, no token is sent to the terminal node for the
left activation of the and-node, and one token is sent for the right activation, as desired.
The above scheme is not suitable for parallel implementations because it is too constraining.
It
requires that all left successors (including all successors of the immediate left successors) be processed
before the right successors are processed, and this defeats the whole purpose of the parallel ira-
70
PARALLELISM IN F'RODUCTION SYSTEMS
plementation.
Other more complex schemes, for example, associating a marker array with each
newly added token that keeps track of the successor two-input nodes that have really seen and
processed the token added to the memory node, impose too much overhead to be useful. In some
simple-minded parallel implementations, sharing of memory nodes can also cause deadlocks, for
example, when two processes try to ensure that the shared memory node does not get modified while
the two-input nodes are being processed.
For the implementation proposed in this thesis, there is one other reason why memory nodes need
to be lumped with the two-input nodes. This reason is the use of hash-table based memory nodes.
As suggested in the previous section, to enable finding matching tokens in the opposite memory
efficiently, the hash function uses the values of the tests present in the associated two-input node.
Thus if the tests associated with the successor two-input nodes are different, then it is not possible to
share the memory nodes feeding into those two-input nodes. In all subsequent sections of this thesis,
it is assumed that memory nodes are not shared (that is, there are two separate memory nodes for
each two-input node), and that the processing associated with a two-input node refers to the processing required by the and-node or the not-node and the associated memory nodes. However, note that
it is still possible to share two-input nodes (and-nodes and the not-nodes) in the Rete network, and if
the two-input nodes are shared then the associated memory nodes are automatically shared too.
5.2.3. Problems with Processing Conjugate Pairs of Tokens
Sometimes when performing match for a single change or multiple changes to working memory, it
is possible that the same token is first added to and then deleted from a memory node. Such a pair of
tokens is called a conjugate pair of tokens [17]. For example, consider the productions and the
corresponding Rete network shown in Figure 2-2 (see Chapter 2). Let the initial condition of the
network correspond to the state when working-memory
elements wmel and wine2 (shown at the
bottom-left of Figure 2-2) have been added to the working memory.
Now consider a production
firing that inserts wme3 to working memory and deletes wmel from working memory.
If wine3 is
processed before wmel, the token t-wlw3 will first be added to the memory node at the bottom-left
of the Rete network, and subsequently when deletion of wmel is processed, the token t-wlw3 will be
deleted from the memory node. Although conjugate pairs are not generated very often, their occurrence poses problems for parallel implementations, as explained below.
Consider the case when the insertion of wrne3 and deletion of wmel are processed concurrently. In
this case, as before, requests for both the addition of t-wlw3 and the deletion of t-wlw3 to the
memory node are generated.
71
Now it is quite possible, that the scheduler used in the parallel im-
plementation assigns the processing of the deletion of t-wlw3 to a processor before it assigns the
processing of the addition of t-wlw3. 37 When the delete t-wlw3 request is processed, the token to be
deleted would not be found, since it has not been added so far. 38 There are two alternatives.
First,
the request for the deletion of the token can simply be aborted, and no more action taken. Second,
the fact that the token to be deleted was not found should be recorded somewhere. This way when
the add t-wlw3 request is processed, it is possible to determine that an extra delete operation had
been performed on the memory node and appropriate action can be taken.
The first alternative suggested above is not reasonable, since it would lead to incorrect results from
the match.
The second alternative is what this thesis proposes.
In the proposed implementation,
extra tokens from early deletes are stored in a special extra-deletes-list associated with each two-input
node. Whenever a token is to be inserted into a memory node, the extra-deletes-list associated with
the corresponding two-input node is first checked to see if the token to be inserted cancels out an
early delete. Most of the time this list is expected to be empty, so the overhead of such a check
should not be too much.
5.2.4. Concurrently Processable Activations of Two-Input Nodes
In Section 4.2.3 on intra-node parallelism, it was proposed that multiple activations of the same
two-input node should be evaluated in parallel to obtain additional speed-up. This section discusses
some of the restrictions that need to be put on processing multiple activations of the same node in
parallel. These restrictions are necessary to reduce the amount of synchronization needed to ensure
correct operation of the algorithm.
Figure 5-4 shows the various kinds of situations that may arise when multiple activation of the same
two-input happen. For example, case-1 in the figure refers to multiple insert-requests from the left
side, case-4 refers to multiple delete-requests from the right side, case-5 refers to both insert and
delete requests from the left side, and case-7 refers to multiple insert-requests, some of which are
37There are a number of reasons why this may happen. For example, although the request for the addition of t-wlw3 is
generated before the request for the deletion, because of arbitrary delays in the communication process, the request for the
deletion may reach the scheduler before the request for addition. Also there is no simple way for the scheduler to realize that
the delete request it has just received is part of a conjugate pair.
38Note that there are no such problems in uniproeessor implementations of the Rete algorithm. The reason is that, in
uniproeassor implementations, it is possible to ensure that the sequence in which the requests for insertions and deletions of
tokens are processed is the same as the sequence in which these requests were generated.
72
SYSTEMS
from the left side of the two-input node and the some of which are from the right side. We propose
that only the multiple activations depicted in cases 1-6 should be processed in parallel, and that
multiple activations depicted in the cases 7-10 (or their combinations) should not be processed in
parallel. To justify these restrictions, the various cases are divided into three groups. Cases 1-4 are
grouped together, and represent the case when multiple inserts or multiple deletes from the same side
are to be processed.
Cases 5-6 are grouped together, and represent the case when both insert and
delete requests from the same side are to be processed.
Cases 7-10 are grouped together, and
represent the case when activations from both sides are to be processed concurrently.
,,,,',,,
j,/ ",,",, /,:,
I-del
r-del
(1)
{2)
(7)
(8)
(3)
(4)
(9)
(10)
Figure 5-4: Concurrent activations of two-input nodes.
The reason for not processing cases 7-10 in parallel is related to the assumption of Rete algorithm
that, while a two-input node is being processed the opposite memory should not be modified.
The
effect of violating this assumption was also shown in Section 5.2.2. To illustrate the problems with
processing such activations in parallel, consider case-7 where an insert request from the left and an
insert request from the fight are processed concurrently.
Also assume that the tokens corresponding
to these requests satisfy the tests associated with the two-input node.
It is possible to have the
following sequence of operations: (1) The token corresponding to left insert-request is added to left
memory-node. (2) The token corresponding to the right insert-request is added to the right memorynode. (3) The left activation of the two-input node finds the newly inserted right-token and results in
the generation of a successor token. (4) Similarly, the right activation of the two-input node results in
the generation of another successor token. This is incorrect, since only one successor token should
have been generated. The cost of detection and deletion of duplicates at the successor nodes is too
expensive a solution. Similarly, there is no simple way of ensuring that the relevant portion of the
PARALLELIMPLEMENTATION
OF PRODUCTION
SYSTEMS
73
opposite memory does not get modified while the two-input node is being processed. 39 The reasons
for not permitting cases 8-10 to be processed in parallel are along the same lines as above.
Whether cases 5-6 are permitted to be processed in parallel, depends on some subtle implementat.ion decisions. For example, when the extra-deletes-list is associated with each two-input node in
the Rete network (as proposed in Section 5.2.3), it is not possible to process activations corresponding
to cases 5-6 in parallel.
The reasons are related to the conjugate-pair
problem discussed in the
previous subsection. Consider the case when both an insert request and a delete request for a token
are to be processed.
It is possible that the following sequence of operations takes place: (1) The
delete request begins first, it locks the memory node (the lock associated with the appropriate bucket
of the hash table storing the tokens), it does not find the token to be deleted, it releases the lock on
the memory node. It then attempts to get a second lock, the lock necessary to insert the extra delete
into the special extra-deletes-list associated with the corresponding two-input node. (2) Before the
delete request can get hold of the second lock, the insert request gets hold of that lock to check if any
extra deletes have been done. It finds no extra deletes, it releases this lock, and then goes on to insert
the token into the memory node. (3) The delete request gets hold of the second lock, and inserts the
token into the extra-deletes-list.
The result of the above sequence is obviously incorrect, since the
correct result should have been no token in the memory node and no token in the extra-deletes-list.
A solution to the above problem, so that activations corresponding to cases 5-6 may be processed in
parallel, is to use only a single lock to check/change
check/change the contents of the extra-deletes-list.
the contents of the memory node and to
This can be achieved by associating an extra-
deletes-list with each bucket of the hash table and using a common lock, rather than associating an
extra-deletes-list with each two-input node (as proposed in Section 5.2.3). Because of the late discovery of this solution, the simulation results presented in Chapter 8 represent the case when node
activations corresponding to cases 5-6 are not processed in parallel
Some simulations done to test the
new solution show an overall performance improvement of about 5%. The increase in performance is
• not very significant because multiple insert and delete requests for a two-input node from the same
side are rare.
The reason why cases 1-4 can be processed easily in parallel is that, while the multiple activations of
39In the current implementation,the relevantportion of the oppositememoryis identifiedby the tokens in the
corresponding
bucketofthe oppositehashtabte.Thusif it is ensuredthatthisspecificoppositebucketis notbeingmodified,
the rest of the oppositememorymaybe modifiedin any way. Althoughthisidea has not been developedany further,it
shouldbe possibleto workout a solutionalongtheselines,so that multipleactivationof the sametwo-inputnodefrom
differentdirectionscanbeprocessed
inparallel.
74
PA RA LLELISM IN PRODUCI'ION
SYSTEMS
the two-input nodes are being processed, the opposite memory stays stable and does not change,
unlike in cases 7-10. There is also no potential for race conditions, where the same token is being
inserted and deleted from a memory node at the same time, as in cases 5-6. The parallel processing of
cases 1-4 results in the most increase in speed-up as it eliminates the cross-product bottleneck mentioned in Section 4.2.3.
5.2.5. Locks for Memory Nodes
As proposed in Section 5.2.1, the tokens associated with all memory nodes in the Rete network are
stored in two global hash tables--one for the tokens belonging to the left memory-nodes and one for
tokens belonging to the right memory-nodes.
Since multiple node activations may be processed in
parallel, it is necessary to control access to the individual buckets of the hash tables. It is proposed
that there should be a lock associated with each bucket of the two hash tables.
Furthermore, the
locks should be of multiple-reader/single-writer type, that is, the lock associated with a bucket should
permit multiple readers at the same time, but it should permit only a single writer at the same time,
and it should exclude readers and writers from entering the bucket at the same time.
The use of the read and write locks is expected to be as follows.
For the left activation of a
two-input node, a write-lock is used to insert/delete the token from the chosen bucket in the left hash
table.
The read-lock is used to look at the contents of the corresponding bucket in the right hash
table to find tokens in the fight memory node that have consistent variable bindings with the left
token. Thus a multiple-reader/single-writer
lock permits several node activations that wish to look at
a bucket in read-mode to proceed in parallel. Such a scheme is expected to help in handling the hash
table accesses generated during the processing of cross-products.
5.2.6. Linear vs. Binary Rete Networks
In Section 2.4.2 it was pointed out that there is a large variability possible in the amount of state
that a match algorithm stores.
For example, on the low side, the TREAT algorithm [60] stores
information only about matches between individual condition elements and individual workingmemory elements.
On the high side, the algorithm proposed by Kemal Oflazer [67] stores infor-
marion about matches between all possible combinations
of condition elements that occur in the
left-hand side of a production and the sequences of working-memory elements that satisfy them. The
state stored by the Rete-class of algorithms falls in between the above two schemes. The Rete-class of
algorithms, in addition to storing information about matches between individual condition elements
and working-memory elements, store information about matches between some fixed combinations
PARALLEL
IMPLEMENTATION
OFPRODUCI'IONSYS_TEMS
75
of condition elements occurring in the left-hand sides of productions and working-memory
element
tuples. The combinations for which is stored are decided at the time when the network is compiled.
This section discusses some of the factors that influence the selection of the combinations of condition elements for which state is stored.
To study the state stored for a production by the standard Rete algorithm, 40the algorithm used in
existing
OPS5 interpreters,
consider
the
following production
with
k condition
elements:
CI&C2&. .. &Ck --, AI.A 2... Ar The state stored consists of working-memory elements matching
the individual condition elements (stored in the a-memory nodes of the Rete network), workingmemory element pairs jointly matching the condition elements CI&C 2, working-memory
triples matching CI&C2&C3, and so on (stored in the t-memory
finally working-memory
element
nodes of the Rete network), and
element k-tuples matching C_&C2&... &C k (stored in the conflict-set).
The algorithm does not store state for any other combinations of condition dements, for example,
working-memory element pairs matching the combination C2&C3are neither computed nor stored.
To make the discussion of the advantages and disadvantages of the scheme used by the standard
Rete algorithm easier, it helps to consider the state of a production in terms of relational database
concepts [90]. In relational database terminology, the sets of working-memory elements matching the
individual condition elements C1..... Ck can be considered as relations R 1..... R k. The relation
specifying the working-memory element k-tuples matching the complete production is denoted by
R1® R2® ... ® Rk, and is computed by the join of the relations R 1..... Rk, where the join conditions correspond to those tests that ensure the consistency of variable bindings between the various
condition elements. The state stored by the standard Rete algorithm corresponds to the tuples in the
relations:
• R1..... R k. This is the state stored in the a-memory nodes of the Rete network.
• R] @ R 2, RI @ R2® Ra ..... R_ ® R2® ... ® R k.
This is the state stored in the
fl-memory nodes of the Rete network and the conflict-set.
The processing required when a working-memory element is either inserted or deleted corresponds
to keeping all these relations updated.
In this process, the algorithm makes use of the smaller joins
that have been computed earlier to compute the larger joins.
For example, to compute the join
R ® R2® R3, the algorithm uses the join RI ® R: and the relation R3 that have been computed
earlier. Similarly, the addition of a new working-memory element to the relation Rk requires that the
40Alsocalledthe linearRete,becauseofthelinearchainoftwcrinputnodesconstructedbyit.
76
PARALLELISM
1NPRODUCTION
SYSTEMS
relation
R1®R 2® ... ®R k
be
updated.
This
is
done
by
computing
the
join
(R z® R2® ... ® Rk_ l ) ® {new- wine}, where the join of Rt ..... Rk_ 1 already exists.
The goodness of the state-saving strategy used by a match algorithm is determined by the amount
of work needed to compute the conflict-set, that is, the relation Rz® R2® ... ® Rk for all the
productions, and of course, any other intermediate relations used in the process.41 Empirical studies
show that by the above criteria, the standard Rete algorithm does quite well. The work required to
keep the relations R1® R2, Rz® R2 ®
ber of t-memory
updated is quite small (as shown by the small num-
R 3 .....
node activations per change to working memory in Section 3.4.3). The reasons for
this are:
• The way productions are currently written, the initial condition element of a production is
very constraining, which makes the cardinality of the relations Rz,R1® R2..... quite
small, which in turn implies that the processing required to keep the state updated is
small. For example, the first condition element of a large fraction of productions is of the
type context, and since normally only one context is active at any given time, the relation
Rz is empty for all productions that do not belong to the active context, and they do not
require any state processing for the t-memory nodes. 42
• Relations corresponding to a large number of condition elements, that is, relations of the
form Rz® ... ® R, where p is 3 or more, are naturally restrictive (do not grow too
large) because of theelarge number conditions involved and the associated tests. 43
While the scheme used by the standard Rete algorithm works fine for uniprocessor implementations, some problems are present for parallel implementations.
This is because the criteria for
goodness for parallel implementations are slightly different from that for uniprocessor implementations. In a uniprocessor implementation the aim is to minimize the total processing required in the
state-update process, and it is not important if the state-update phase for some productions takes
much longer than that for other productions.
In a parallel implementation, while it is important to
keep the sum of the state-update costs for the affected productions down, it is equally important to
reduce the variation in the costs of the affected productions.
The standard Rete algorithm keeps the
total cost of the state-update phase well under control, but it is not so good at keeping the variation
down. Some of the reasons are discussed below.
41Thiscriterionofgoodnessisappropriate
primarilyforuniproeessor
implementations.
42Ineasethe firstconditionelementofa productionis notveryconstraining,
it is oftenpossibleto reorderthecondition
elementsso thatthe firstconditionelementis restrictive.Suchtransformations
mayeitherbe donebymachine[48]orby a
person.
43Again,the orderingofconditionelementscanhelpto reducetheamountofstatethathasto be updatedoneachcycle.
77
Simulation studies for parallel implementations
using the state-saving strategy of standard Rete
show that a common reason for the large value of tmax/tavg is a long chain of dependent node
activations, that is, often a node activation causes an activation of a successor node, which in turn
causes an activation of its successor node, and so on. This is called the long-chain effect (see Figure
5-5), and it is especially important in programs that have productions with a large number of condition elements.
It is possible to reduce the length of these long sequences of dependent node
activations by shortening the maximum length of chains that is possible--by
changing the inter-
mediate state that is computed by the algorithm and the way it is combined to get the final changes to
the conflict-set.
CE1
CE2
CE3
L__
Maximum Length of Chain of
CIEk.2
Dependent Node Activationsis k.
Figure 5-5: The long-chain effect.
One way to reduce the length of the long chains is to construct the Rete network as a binary
network (see Figure 5-6) instead of as a linear network. This way the maximum length of a chain of
dependent node activations is cut down from k to [log2k ] + 1. In such a scheme the stored state
corresponds to tuples in the relations:
• R1..... Rk,
• R I®R2,
R a®R
4 .....
Rk_ I®R
k,
• RI® R2® R3® R4.....
Rk_3® Rk_2® Rk_I® Rk,
• and SO on.
Simulations for Soar programs show that the binary network scheme reduces the variation in the
processing costs of productions significantly, thus increasing the number of processors that can be
effectively used in a parallel implementation.
For example, for the eight-puzzle program in Soar,
using the binary network increases the speed-up from 9-fold to 15-fold. The speed-up obtained for
normal OPS5 programs is not as good, in fact, in most cases the speed-up is reduced.
78
SYSITNIS
PARALLEI,ISM IN PRODUCTION
CE1
CE2
CE3
CE4
CEk-3
CEk-2 CEk-1 CEk
Dependent Node Activations
is log(k) + 1.
Y
Figure 5-6: A binaryRetenetwork.
The reasons for the reduction in speed-up in OPS5 programs are the following. First, the average
number of condition elements per production in OPS5 programs is quite small, 3.4 for OPS5
programs as compared to 6.8 for the Soar programs. Since the number of condition elements is small,
the difference between the length of chains in the linear and the binary networks is not significant,
and thus not much improvement can be expected. Second, the total cost of the state-update phase
when using the binary network scheme is often much larger than the total cost when using the linear
network scheme. As an example, consider the following production:
(p free-all-hands
(robot
(object
(object
(object
(modify
*hand1 <x>
tname (x>)
+name <y>)
+name <z>)
1
*hand1
free
*hand2
<y>
,hand2
*hand3 <z>)
free
thand3
free
...)
The production consists of four condition elements and the joins stored by the linear Rete are
R1® R2, (R 1® R2)® R3, (R 1® R2® R3) ® R4.
The
joins
stored by
the
binary
Rete
are
RI® R:, R3® R4, (R1 ® R2)® (R 3® R4). Note that the relations in parentheses indicate the way a
larger relation is computed. Thus the relation (R 1® R2) ® R3 implies that it is computed as the join
of R1® R2and R3,and not as the join of R_and R2® R3.
Now consider
the scenario where there is one working-memory
element
of type robot
(corresponding to a single robot in the environment) and there are 100 working-memory elements of
type object (corresponding to 100 different objects in the environment).
In such a case the cardinality
of the relations R1,R2,R3, and R4 will be 1, 100, 100, and 100 respectively. Then in the case of the
PARALLELIMPI.EMEN'I'ATION
OFPRODUCTION
SYSTEMS
linear network all t-memory
79
nodes will have a cardinality of 1, that is, they will contain only one
token. This is because of the constraining effect of the first condition element, and the fact that the
variables x, y and z are all bound by the time the join operation moves to the second, third, and the
fourth condition elements. In case of the binary Rete network, however, the t-memory
node cor-
responding to the relation R3® R4 will have 10,000 tokens in it, since there are no common variables
between the third and the fourth condition elements. So whenever a working-memory element of
type object is inserted about 100 tokens have to be added to that memory node, in contrast to only a
single addition if the linear network is used.
As the above discussion shows, the use of a binary network can often result in a large number of
node activations that would not have occurred in the linear network. 44 This increase in the basic cost
of the state-update phase often offsets the extra advantage given by the smaller value of tmax/tav8
found in binary networks. In fact, in all systems studied, there were at least a few productions where
the state grew enormously when the binary networks were used. The networks for these few productions had to be changed back to the linear form (or some mixture of binary and linear forms) before
the production-system programs could be run to completion.
As a result of the various studies and simulation done for this thesis, it appears that there is no
single state-storage strategy that is good for all production systems. There is no reason, however, to
have the restriction that all productions in a program should use the linear network, the binary
network, or any other fixed network form. We propose that the compiler should be built so that the
network form to be used with each production can be specified independently, and that this network
form should be individually tailored to each production.
by the programmer at the time of writing a production.
The form of the network may be provided
This strategy has the advantage that the
programmer often has semantic information that permits him to determine the sizes of the various
relations; information that a machine may not have. Alternatively, the form of the network may be
provided by another program that uses various heuristics and analysis techniques [83]. Such a
program may also use data gathered about relation sizes from earlier runs of the program to optimize
the network form. 4s
44Notethat it is alsopossibleto constructexampleswherethe state-updatephasefor the linearnetworkis muchmore
expensivethanthatforthebinarynetwork.However,in practice,suchcasesarenotencounteredasoften.
45Thenetworkcompilermayalsoprovidesomedefaultnetworkformsthat maybeusedwiththeproductions,asis thecase
currently
instandard
gete.
80
5.3. The Selection
PARALLELISM
INPRODUCTIONSYSTF_MS
Phase
Processing
Given a working-memory element, the selection phase identifies those condition elements that are
satisfied by that working-memory element. The processing done during the selection phase primarily
involves evaluating the constant-test nodes found in the upper part of the Rete network. This section
discusses some of the issues that arise when the selection phase is performed on a multiprocessor.
5.3.1. Sharing of Constant-Test
Nodes
A production-system program often has different productions to deal with variants of the same
situation, and as a result the condition elements of these productions are also similar. The Rete
network compiler shares constant-test nodes whenever such similar condition elements are compiled,
and the uniprocessor implementations of the Rete algorithm rely greatly on the sharing of constanttest nodes to save processing time in the selection phase (see Section 3.3.2).
For example, the
constant-test nodes immediately below the root-node consist of tests that check the type of the
condition element, since that is the first field of all condition elements.
For the VT production
system [51], the total number of condition element types is 48, thus with sharing only 48 constant-test
nodes are needed at the top-level, each checking for one of the 48 types. If no sharing is present,
however, a separate node would be needed to check the type of each condition element, and since the
system consists of approximately 4500 condition elements, that many nodes will be required at the
top-level.
In a non-shared memory multicomputer
implementation
of production
systems, where each
production is allocated to a separate processor, it is not possible to exploit the sharing mentioned
above. This is because each processor must independently determine which of the condition elements of the production allocated to it are satisfied. There is no reason, however, not to use such
sharing in a shared-memory
multiprocessor implementation
of production systems.
One of the
consequences of the sharing of constant-test nodes is that often a constant-test node may have a large
number of successors.
These successors may either be other constant-test nodes or they may be
a-memory nodes. Some implementation issues related to how these successors ought to be evaluated
are discussed below.
PARALLEL IMPLIZ.MENTATION OF PRODUCTION
5.3.2. Constant-Test
SYS'ITMS
81
Node Successors
Consider the cost of evaluating a constant-test node. The cost consists of: (1) In case the activation
is picked up from the centralized task queue, the cost of removing the activation from the task queue
and the cost of setting up the registers so that the processing may begin. (2) The cost of evaluating the
test associated with the constant-test node. (3) In case the associated test succeeds, the cost of pushing
the successor nodes onto a local stack (for activations to be processed on the same processor) or to the
centralized task queue (for activations to be picked up by any idle processor).
In the proposed
implementation using a hardware task scheduler, the cost of the first step is about 10 instructions, the
cost of the second step about 5-8 instructions, and the cost of the third step is about 2 instructions for
a local push and about 5 instructions for a global push.
Since the cost of evaluating the test associated with a node (cost of step-2) is small compared to the
costs of pushing and popping an activation from the task queue (costs of step-1 and step-3), it is not
advisable to schedule each constant-test node activation through the global task queue. Instead, it is
suggested that only bunches of constant-test node activations should be scheduled through the global
task queue, and that all nodes within a bunch should be processed on the same processor (thus saving
the cost of step-l).
One way to achieve this is to schedule node activations close to the top-level
through the global task scheduler, and have the activations in the subtrees of these globally scheduled
nodes processed locally (see Figure 5-7). Another alternative is for the network compiler to use
heuristics to decide which nodes are to be scheduled globally and which nodes are to be scheduled
locally.
/J
/
////
//
1,17
)/--
"_
/r_
_a, _
[_,_
(!'
_
L--
'
'
' ,_
_
\
/'
constant-test nodes scheduled
through centralized task queue
__ontl'_.meproce_hsorasp.rent
Restof Rete Network
Figure 5-7: Schedulingactivationsof constant-testnodes.
It was observed in Section 3.4.1 that only about 12%of the constant-test node activations have their
associated test satisfied. The reason for the large number of activations with failed tests is that the
standard Rete algorithm does not use any indexing techniques (for example, hashing) to filter out
82
PARAI.LELISM
IN PRODUCTION
SYSTEMS
node activations that are bound to fail their tests. For example, consider a node whose four successor
nodes test if the value of some attribute is 1, 2, 3, or 4 respectively. The standard Rete algorithm
would evaluate all these nodes, even though 3 of them are bound to fail. The choice whether or not
to use indexing is, however, not always so clear. This is because using a hash function imposes the
overhead of computing the hash value, and because the constant-test node activations are very cheap
to evaluate (the OPS83 implementation evaluates a constant-test node in three machine instructions).
Since the constant-test nodes are not as cheap to evaluate in a multiprocessor implementation,
it is
proposed that indexing be used in most places, especially in places where the branching factor is
large.
5.3.3. Alpha-Memory Node Successors
At the interface between the selection phase and the state-update phase are the ,v-memory nodes
(recall that the constant-test nodes feed directly into the a-memory nodes).
Normally, when a
constant-test node succeeds, the associated processor builds tokens for the activations of its successor
a-memory nodes. These tokens are then given to the task scheduler so that they may be processed by
other idle processors. A problem, however, occurs when a constant-test node having a large number
(10-30) of a-memory successors is satisfied. Since preparing a token for the scheduler requires the
execution of several instructions by the processor, around 5 in the proposed implementation (see
Appendix B for details), it implies that by the time the last successor node is scheduled, 100 instructions worth of time has already elapsed (assuming 20 successor nodes).
The time taken by 100 instructions is quite large considering the fact that processing a constant-test
node activation takes only around 5-20 instructions (only 3 instructions in the uniprocessor
implementation) and that processing a two-input node activation takes only 50-100 instructions. (See
Tables 8-3 and 8-4 for the relative costs of processing different node activations.) For this reason, it is
proposed that in case a constant-test node has a large number of a-memory successors, it is advisable
to replicate that constant-test node, such that each of the replicated nodes has a small number of
successors, as shown in Figure 5-8. Thus when the multiple copies of the original constant-test node
are evaluated in parallel, the successors would get scheduled in paraUeL The number of times that a
node ought to be replicated is a function of (1) the number of successor nodes, and (2) the relative
costs of processing a constant-test node activation, of processing a two-input node activation, and of
preparing and scheduling a successor node. An alternative to replicating the constant-test node is to
use a new node type which does not perform the test made by the constant-test node but simply helps
in scheduling the a-memory successors in parallel.
OF PRODUCFION
SYSTEMS
83
_t-node
/
/
/
__
const-test
transform
\
_
\
",o..-mom
"
\
successors
Rete Network
_
,o..om \
successors
Rete Network
Figure 5-8: Possiblesolution when too many alpha-memory successors.
5.3.4. Processing Multiple Changes to Working Memory in Parallel
As in the case of the state-update phase, it is possible to process the selection phase for multiple
changes to working memory in parallel. Since the constant-test nodes do not store any state, there are
no restrictions placed on the constant-test node activations that may be evaluated in parallel.
5.4. Summary
In this chapter various software and hardware issues dealing with the parallel implementation of
production systems have been studied. The discussion may be summarized as follows:
• The hardware architecture suitable for implementing production systems is a sharedmemory multiprocessor consisting of 32-64 high-performance processors. Each processor
should have a small amount of private memory and a cache that is capable of storing both
private and shared data. The multiprocessor should support a hardware task scheduler to
help in the task of assigning pending node activations to idle processors.
• The following points list some of the problems and issues that arise when implementing
the state-update phase on a multiprocessor:
o The tokens associated with the memory nodes should be stored in global hash tables
instead of in linear lists, as is done in existing implementations.
This helps in
reducing the variance in processing required by insertion and deletion of tokens. It
is further suggested that a multiple-reader/single-writer
lock be associated with
each bucket of the hash tables to enable correct access by concurrent node activations.
o Section 5.2.2 gives reasons why it is not possible to share a memory node between
84
SYSTEMS
multiple two-input nodes, as is done in uniprocessor implementations of Rete. The
reasons have to do with synchronization constraints and the use of hash tables for
storing tokens.
o A solution is given to the problem of processing conjugate pairs of tokens. The
solution consists of associating an extra-deletes-list with each memory node. This
way whenever a delete-token request is processed before the corresponding addtoken request, it can be put on the extra-deletes-list and processed appropriately
later.
o In the proposed parallel implementation it is not possible to process all activations
of a two-input node in parallel. Section 5.2.4 gives reasons why multiple activations
from one side of a two-input node can be processed concurrently, but why multiple
activations from both the left and the right side should not he processed concurrenfly.
o Within the Rete-class of match algorithms many different state-saving strategies can
he used. Section 5.2.6 discusses the relative merits of the linear-network strategy
used by the standard Rete algorithm and the alternative binary-network strategy.
While the binary-network strategy is found to be much more suitable for Soar
programs (programs having productions with a large number of condition
elements), the linear-network strategy is found to he more suitable for the OPS5
programs. In general, even within a single program, no single strategy is found to
be suitable for all the productions, and it is suggested that tailoring the networks to
individual productions can lead to significant speed-ups.
• The following points list some of the problems and issues that arise when implementing
the selection phase:
o Measurements show that only 12% of constant-test node activations have their associated tests satisfied. To avoid evaluating node activations which fail their tests, it
is proposed that indexing/hashing techniques should be used. This can result in
significant savings.
o The cost of evaluating the test associated with a constant-test node activation is
significantly less compared to the cost of enqueueing and dequeueing a node activation from the centralized task queue. For this reason instead of scheduling individual constant-test node activations through the centralized task queue, it is
suggested that only bunches of constant-test node activations should be scheduled
through the centralized task queue.
o Often a constant-test node may have a large number of ,v-memory node successors.
Scheduling these successors serially results in a significant delay by the time the last
successor node is scheduled. A solution using replicated constant-test nodes and
another solution using a new node type are proposed for scheduling the successors
in parallel.
THEPROBLEMOFSCHEDULING
NODEACTIVATIONS
85
Chapter Six
The Problem of Scheduling Node Activations
The model for parallel execution of production systems proposed in the previous chapter consists
of two components: (1) a number of node processors connected to a shared memory, where each
node processor is capable of processing any given node activation; and (2)a
centralized task
scheduler, where all node node activations requiring processing may be placed and subsequently
extracted by idle processors.
This chapter explores the implementation
issues that arise in the
construction of such a centralized task scheduler.
The first difficulty in implementing a centralized task scheduling mechanism stems from the fine
granularity of the node activations that are processed. For example, in the current implementation,
the average processing required by a node activation is only 50-100 instructions and processing a
change to working memory results in about 50 node activations. In a centralized task scheduler, even
if enqueueing and dequeueing an activation took only 10 instructions, by the time the last activation
is enqueued
and finally picked-up
for processing, 500 instructions
worth of time would have
elapsed. 46 This time is significantly larger than that to process individual node activations, and if the
scheduler is not to be a bottleneck, the processing required for enqueueing and dequeueing activations must be made much smaller.
The second difficulty associated with implementing a centralized task scheduling mechanism for
the parallel Rete algorithm is that the functionality required of it is much more than that of a simple
queue. This is because all node activations present in the task queue are not processable all of the
time. To be concrete, consider the example shown in Figure 6-1. The figure shows a two-input node
for which four activations al,a2,a3, and a4 are waiting to be processed.
However, because of
synchronization constraints given in Section 5.2.4, left activations may not be processed concurrently
with fight activations.
Thus, while node activations al and a2 may be processed concurrently and
46Wedo not assumea memorystructureof the formproposedfor theNYU Ultracomputer
[28],whereenqueuesand
dequeuescanbe doneinparallel.Becauseoftheadditionalcomplexities
in theenqueueanddequeuerequiredforproduction
systems,
thestandard
structure
proposed
fortheUltracomputer
wouldnotwork.
86
PARALLELISM IN PRODUCFION SYSTEMS
node activations a3 and a4 may be processed concurrently, node activations al and a3 may not be
processed concurrently. As a result, as soon as the node activation al is assigned to an idle processor
for evaluation, the activations a3 and a4 become unprocessable for the duration that al is being
processed. Similar restrictions would apply if activation a3 had been picked up first for processing.
This dynamically changing set of processable node activations in the task queue makes the scheduler
much more complex, and consequently much more difficult to implement.
CE1
al
a2_
CE2
--
/a3
a4
I two-input
n_e
Figure 6-1: Problem of dynamically changing set ofprocessable node activations.
The following sections discuss two solutions for solving the scheduling problem. The first solution
involves the construction of a hardware task scheduler (HTS) and the second solution involves the
construction of multiple software task schedulers (STSs).
'
6.1. The Hardware
Task Scheduler
6.1.1. How Fast Need the Scheduler be?
To get a rough estimate of the performance that is required of the hardware task scheduler,
consider the following simple model for parallel execution of production systems. Let the total
number of independently schedulable tasks generated during a recognize-act cycle be n. Let the
average cost of processing one such task be c. Let the cost ofenqueueing and dequeueing a task from
the scheduler be s. Note that s corresponds only to that part of the scheduling cost during which the
scheduler is locked, for example, when the task queue is being modified. Let the cost when the
scheduler is not locked, for example, the time taken for preparing a token to be inserted into the task
queue, be t. Now the cost of performing the recognize-act cycle on a single processor is Cuni = n.c.
If there are k processors, then assuming perfect division of work between the k processors, the cost of
performing the match on a multiprocessor with a centralized scheduler is Cmu! = [n/k](c+
t) + n.s,
and the maximum speed-up obtainable on the multiprocessor is given by S = Cuni/Cmu 1.
Figure 6-2 shows the maximum speed-up that can be obtained when n=128 tasks/cycle, c= 100
THE PROBLEM OF SCItEI)ULING
87
NODE ACTIVATIONS
instructions, t= 10 instructions, and for varying values of s and k. As the figure shows, the effect of
the duration for which the scheduler is locked is very pronounced.
For large values of s, the
saturation speed-up is reached with a relatively small number of processors, irrespective of the
inherent parallelism in the problem.
In terms of the parameters described above, the saturation
speed-up as k--+ oo is given by S= n .c/(c+ t+ n .s), and then as n--* oo, the expression for speed-up
reduces to S = c/s. Thus it is extremely important to maximize c/s, but it is not easy to increase c
(the only reasonable way to increase c is to increase the granularity of the tasks, which then reduces
the value of n, which has other adverse effects). Hence, the value of s must be reduced as much as
possible. In fact, as seen from the graph in Figure 6-2, if 20-40 fold speed-up is to be obtained, then
the time for which the scheduler is locked to process an activation must not be much longer than the
time taken by a processor to execute a single instruction. It is possible to construct such a scheduler
in hardware, as is discussed next.
5O
Ct
Z_S= 0.5
0 S=l.0
D S=2,0
_40
<>s
= 4.0
• s=8.0
• s= 16.0
30
n = 128
c=100
t=lO
20
0
8
16
24
32
40
48
56
64
Number of Processors
Figure 6-2: Effect of scheduler performance on maximum speed-up.
72
(iO
88
PARALt,ELISM
INPRODUCI'IONSYSTEMS
6.1.2. The Interface to the Hardware Task Scheduler
In the proposed production-system machine (see Section 5.1), the hardware task scheduler sits on
the main bus along with the processors and the shared memory in the multiprocessor. The scheduler
is mapped onto the shared address-space of the multiprocessor, and the synchronization (locking of
the schedule0 is achieved simply from the serialization that occurs over the bus on which the requests
for enqueueing and dequeueing activations are sent to the hardware task scheduler. The scheduler is
assumed to be fast enough that it can process a request in a single bus cycle.
There are three types of messages that are exchanged between the processors and the hardware task
scheduler. (1) Pusk-Task:
The processor sends a message to the scheduler when it wants to enqueue
a new node activation. (2) Pop-Task: The scheduler sends a message to the processor when it wants
an idle processor to evaluate a pending node activation (the scheduler keeps track of the idle processors in the system). (3) Done-Task: The processor sends a message to the scheduler when it is done
with evaluating a node activation.
The Push-Task(flag-dir-nid, tokenPtr) command sent to the scheduler by the processor takes two
arguments. The
first argument, flag-dir-nid, encodes three pieces of information: (1) the flag as-
sociated with the activation (insert or delete token request), which takes one bit, (2) the direction (dir)
of the node activation (left or right), which takes one bit, and (3) the node-id (nid) of the activation,
which is allocated the remaining 30 bits of the first word. The second argument, tokenPtr, is a
pointer to the token that is causing the activation (the token is stored in shared memory), which takes
all 32 bits of the second word. The combined size of all this information is 64 bits or 8 bytes. It is
assumed that the bus connecting the processors to the scheduler can deliver this information without
any extra synchronization.
This may be easily achieved if the bus is 64 bits wide (in this case the
synchronization is achieved through the bus arbiter, which permits only one processor to use the bus
at a time). This can also be achieved if the bus is only 32 bits wide, but if it is possible for a processor
to obtain its use for two consecutive cycles. As stated earlier, the scheduler is mapped onto the shared
address-space of the multiprocessor.
The addresses to which the hardware task scheduler is mapped,
however, are different for each of the processors (the low order 10 bits correspond to the identity of
the processor).
This enables the scheduler to determine the identity of the processor making a
request, even though that information is not explicitly sent with the request.
There are two distinct locations in shared memory (8 bytes) associated with each processor where
the scheduler puts information about the node activations to be processed by that processor. Thus, to
inform an idle processor to begin processing a pending node activation, the scheduler executes a
"FILE
PROBLEM
OF SCHEDUHNGNODFACTIVATIONS
Pop-Task(flag-dir-nid
89
tokenPtr) command, and transfers the flag, direction, node-id, and token-
pointer information to the two locations assigned to that processor. Before the command is executed,
the idle processor keeps executing a loop checking for the second location to have a non-null token
pointer. The processor is expected to be looping out of its cache, so that it does not cause any load on
the shared bus. When the hardware scheduler writes to the two locations, the cache of the processor
gets invalidated, it gets the new information destined for it, and it can begin processing the new node
activation.
When a processor is finished with evaluating a node activation, it first sets the value of the second
of the two locations assigned to it for receiving node activations to null (this is the location on which
the processor will subsequently be looping, waiting for it to be set to a non-null value by the
scheduler). It then executes the Done-Task(proc-id
node-id) command and transfers the node-id to
the scheduler, informing the scheduler that it is finished with processing that node activation. This
information is very important, because using it the scheduler can determine: (1) the set of processors
that are idle, and (2) the set of node activations in the task queue that are processable (recall that all
activations in the task queue are not necessarily processable).
For all of the above commands, the hardware task scheduler is locked only for the bus cycles
during which data is being written into or written by the scheduler. For example, if the bus is 64 bits
wide and the bus cycle is 100ns (corresponding to a bus bandwidth of 80 MegaBytes/see), then the
total duration for which the scheduler is locked for the Push-Task, Pop-Task, and Done-Task commands for a node activation is only 300ns.
6.1.3. Structure of the Hardware Task Scheduler
The hardware task scheduler consists of three main components: (1) the proc-state array, (2) the
task queue, and (3) the controller, as shown in Figure 6-3. Both the proc-state array and the task
queue are built out of content-addressable
memory.
The proc-state array keeps track of the node
activations being processed by each of the k processors in the multiprocessor.
The task queue keeps
track of all node activations that are pending processing or are currently being processed, up to a
maximum of n.47 The enable-array associated with the task queue keeps track of all pending node
activations that are process,able given the contents of the proc-state array (the node activations being
currently processed). The controller consists of microcode to control both the proc-state array and
the task queue.
47Simulations
showthattheaveragenumberof activationspresentin thetaskqueueis around90. Themaximumnumber
ofactivations,however,canbe ashighas2000.
90
PARALLI3.LISM IN PRODUCTION
Push-Task
/
Pop-Task
Proc-State Array
\
SYSTEMS
Done-Task
Task Queue
/
\
next task
1
1
m
2
2
-E
a
g
D
i
r
Token
Pointer
node-id
k
D
i node-id
a
g
r
1
1
o
a
b
I
e
Enc-\
taler /'-
n
1
1
30
1
Ir
32
30
Figure 6-3: Structure of the hardware task scheduler.
To get some insight into the internal functioning of the hardware task queue, consider the processing required by the various commands:
• Push-Task(flag-dir-nid, tokenPtr): (1) Find empty slot in task queue, and insert entry (the
64 bits of information associated with the command) there. (2) If there is no entry in the
proc-state array with the same node-id as that in the command or if the xxxx-dir-nid 48 of
some entry in the proc-state array matches the xxxx-dir-nid given in the command, then
set the enable bit for the new entry to true otherwise set the enable bit to false.
• Pop-Task(flag-dir-nid, tokenPtr): (1)The encoder on the extreme fight of Figure 6-3 gives
the next node activation to be processed. (2) Put this entry (the information corresponding to the node activation) in the appropriate slot in the proc-state array, that is, in the slot
corresponding to the processor to which this node activation is being assigned. (3) For all
entries in the task queue for which the node-id matches that of the assigned entry, if the
direction field also matches then set enable bit to true. Otherwise if the node-id matches
but the direction field does not match, then set enable bit to false.
• Done-Task(proc-id nid): (1) Clear the slot in proc-state array corresponding to processorid (proc-id). Furthermore, let count be the number of activations in the proc-state array
that have the same node-id as the node activation that has just finished. (2) If count > 0
then do nothing, otherwise for all entries in the task queue for which the node-id matches
set enable to true.
48The xxxx in xxxx-dir-nid indicates that we do not care about the value of the flag field.
TtJEPROBLEMOFSCttF.DUI
JNG NODEACFIVATIONS
91
The above steps ensure that only those node activations that are considered concurrently processable
by the criteria given in Section 5.2.4 are permitted to be processed in parallel.
The hardware task scheduler is expected to be able to process the required commands (push-task,
pop-task, done-task) in one bus cycle, about lOOns. It is estimated that each of the commands can be
executed within two/z-instructions
from the controller and for that reason the controller and the
associative arrays must be capable of operating on a clock of 50ns or less. Current high-speed
technologies like q_L, ECL, high-speed CMOS should be able to support the kinds of speeds that
are required.
6.1.4. Multiple Hardware Task Schedulers
While the speed of a single hardware task scheduler may be able to support 32-64 parallel processors, what if the parallelism is much more and the number of processors is much larger? This section
discusses some issues regarding the use of multiple hardware task schedulers; cases where a single
task scheduler is not powerful enough. The multiple schedulers may appear on a single bus (where it
is possible for each scheduler to observe the commands being processed by the other schedulers), or
they may appear on multiple buses (where it is not possible for a given scheduler to watch the
commands being processed by the other schedulers).
i
A basic assumption made in the design of the scheduler described in the previous section is that it
is possible for the scheduler to determine from its local state (the contents of the proc-state array) the
subset of node activations in the task queue that are processable (using the criteria given in Section
5.2.4). This is easily possible when there is only one scheduler, since the proc-state array keeps track
of all node activations being processed at any given time, and that is enough to determine which
other activations are processable.
However, when there are multiple schedulers, each scheduler
cannot observe the activity ofaU other schedulers, and the proc-state array of a scheduler cannot keep
track of all node activations being processed at any given time.
There are two solutions to the above problem.
The first is to have the schedulers communicate
with each other or with some other centralized resource to determine which node activations are
processable. This solution does not look very attractive because the overhead of communication will
probably nullify any advantage gained from the extra schedulers. The second solution, which is more
reasonable, is to partition the node activations between the multiple schedulers. For example, if there
are two schedulers, then one scheduler could be responsible for all node activations that have an even
node-id and the other scheduler for all node activations that have an odd node-id. Since activations
92
PARALLEIJSM
which have a different node-id do not interact with each other, they may always be processed
concurrently (see Section 5.2.4), the activations assigned to processors by one scheduler will not affect
the processability of activations in the task queue of the other scheduler.
Thus the local proc-state
array of each scheduler 49 contains sufficient information for deciding the processability of nodes in
its task queue.
There is still one problem that remains in using multiple schedulers, especially if each scheduler is
to be capable of scheduling a task on any processor.
In the previous section, there was a single
scheduler that knew the state of all the processors--it
knew which of them were idle and could
automatically schedule activations on such processors. When there are multiple schedulers, none of
the schedulers has knowledge of the state of all the processors.
Furthermore,
even if a scheduler
knows that a processor is idle in the current bus cycle, say because it executed the Done-Task
command in the current bus cycle, there is no way for it to know the state of that processor in the next
bus cycle, since in the next bus cycle some other scheduler may have assigned a task to that
processor. 50 The suggested solution to the problem is to have the processors poll the schedulers for
processable tasks, instead of the schedulers assigning tasks to idle processors of their own accord. By
having each scheduler set a flag in the shared memory indicating the presence of processable tasks, it
is possible to make the idle processors loop out of cache (instead of causing traffic on the bus) when
no processable tasks are available in the schedulers.
6.2. Software
Task Schedulers
While it would be nice to have hardware task schedulers for the production-system machine, there
are two main problems associated with them:
(1) hardware task schedulers are not flexible, in that
they are not easy to change as the algorithms evolve, and (2) the resources needed to build hardware
task schedulers and to interface them to the rest of the system are more than that required by
software task schedulers. However, as worked out in Section 6.1.1, if a single task scheduler is not to
be a bottleneck, then it must be able to schedule a task within the period of about one instruction.
While it is not feasible to achieve such performance out of a single software task scheduler, it is
possible to use multiple software task schedulers to achieve reasonable performance.
This section
discusses some of the issues involved in the design of such software task schedulers.
49Notethat the localproc-statearray storesinformationonly aboutthosenode activationsthat wereassignedby the
associatedschedulerforprocessing.It doesnot keeptrackof the nodeactivationsassignedforprocessingbythe othertask
schedulers.
50Aschedulermayknowwhena processorisidlein casetheprocessorsarealsopartitionedamongstthe schedulers,
justas
the nodesare partitionedamongstthe schedulers,However,such a partitioningwouldviolatethe initialideathat each
scheduleristo becapableofschedulingactivations
onallprocessors.
THE PROBLEM OF SCHEDUI3NG
93
NODE ACFIVATIONS
The main reason for going to multiple software task schedulers is to avoid the serial bottleneck
caused by a single task scheduler through which all activations must be scheduled.
In terms of the
model described in Section 6.1.1, the multiple schedulers modify the equation for maximum speedup as follows. Recall that n is the number of tasks that need to be scheduled per cycle, c is the
average cost of processing a task, s is the average serial cost of enqueueing and dequeueing a task, t is
the average non-serial cost of enqueueing and dequeueing a task, and k is the number of processors
in the multiprocessor.
Let l be the number of schedulers (software task queues) in the system. The
cost per cycle on a uniprocessor
is Cun i "-
n.c.
The cost per cycle on the multiprocessor (assuming
that the load is uniformly distributed amongst the k processors and the l task queues) is given by
Cmul= [n/k](c+
t)+ [n/l].s,
and the speed-up is given by S = Cuni/Cmu l.
The graph for the
maximum speed-up when n= 128,c= 100,t= 10, and varying values of k,/, and s is shown in Figure
6-4.51 It shows that even if the serial cost of enqueueing and dequeueing a task is only 16 instructions, the maximum
speed-up obtainable
with 64 processors and 32 schedulers is only 45-
fold--almost a quarter of the processing power is wasted while waiting for the schedulers.
_64
_
s=8.0,1=4
s=8.0,1= 16
s=8.0,1 =32
•_56
48
._
= 40
t
_
s=16.0,1=4
s=16.0,1=16
13 = 128
C = 100
_
_
s=32.0,1=4
s=16.0,1=32
s = 32.0,1= 16
t = 10
_--------_
s = 32.0,1= 32
_.h
_
_._
J_
32
16
8
0
I
I
I
I
I
8
16
24
32
40
I
I
I
48
56
64
I
72
(It,)
Figure 6.4: Effect of multiple schedulers on speed-up.
51Although the curves in the graph are shown to be continuous, the actual plot of the equation for S would have
discontinuities in it because of the ceiling function used in Cmu l. The curves in the graph are an approximation to the actual
curves for the equations.
94
PARALLEt,ISM
INPRODUCTION
SYSTEMS
A software task scheduler may either be passive or active. A passive scheduler, or preferably a task
queue, corresponds to an abstract data structure where node activations may be stored or retrieved
using predefined operations like push-task and pop-task.
On the other hand, an active scheduler
corresponds to an independent process to which messages for pushing and popping tasks may be
sent. Once the processor has issued the request, it may proceed with what it was doing earlier. The
requesting processor does not have to wait while its request is being processed.
In this thesis only passive software schedulers (software task queues) are studied. The main reason
is that there are a number of overheads associated with the active schedulers which are not present in
the passive schedulers. For example, in an active scheduler, scheduling a task involves the sending of
a message to the active scheduler, and then the processing of this request by the scheduler. It is quite
possible that the cost of sending the message is more than the cost of scheduling on the passive
scheduler.
Similarly, it is quite possible that when a message is sent to an active scheduler, the
scheduler process is not running and it has to be swapped in before the message can be processed.
This could cause a significant delay in the message getting processed. The main advantage of active
schedulers occurs when the cost of scheduling a task is significantly larger than the cost of sending the
message to the scheduler.
In that case the task sending the message can continue without waiting for
the processing required by the scheduler to complete.
i
Figure 6-5 shows an overview of the structure proposed for using multiple software task queues.
To schedule tasks there exist several task queues, all of which can be accessed by each of the
processors.
There is a lock associated with each task queue, and this lock must be obtained by a
processor before it can put or extract any tasks from the task queue.
To obtain a task an idle
processor first checks if the task queue is empty. If it is empty, the processor goes on to check the
next task queue to see if it has any processable tasks.52 If it is not empty, the match process obtains
the lock for the task queue, extracts a processable task (if any is present) from the task queue, and
then releases the lock. By having a dynamic cache coherence strategy [76], it is possible to arrange the
locks so that idle processors looking for a task (when none are available) spin on their cache without
causing a large traffic on the shared bus.
To determine whether a node activation in the task queue is processable, the match processes access
the node table. For each node in the Rete network, the node table keeps track of information about
52Note,thematchprocesscheckswhethera taskqueueis emptyor not withoutobtainingthe lock. Of course,this
informationcan be inaccuratesinceit is checkedwithoutobtainingthe lock,butsinceit is onlyusedas a hint it doesnot
matter. If onematchprocessmissesa taskthat wasjust beingenqueuedwhenit checkedthetask queue,anothermatch
processat somelatertimewillfindit.
95
THE PROBLEM OF SCHEDULING NODE ACTIVATIONS
STQ
1
2
STQ
•
"
N
Node Table
STQ
Processors
Task Queues
Figure 6-5: Multiple software task queues.
its activations that are being currently processed. This information is sufficient to determine whether
a node is processable or not. There is a separate lock associated with each entry of the node table,
and this lock must be obtained before the related information is modified or checked. Since there is a
separate lock associated with each entry in the node table, the multiple schedulers looking for the
processability of different node activations do not clash with each other.
The processing required when a new task is to be put in a task queue or when the match process
finishes processing a node activation is also quite simple. To push a new node activation into a task
queue, the match process selects a random task queue from the several that are available. If the lock
associated with that task queue is busy, the match process simply goes on and tries another task
queue. Otherwise it obtains the lock and enters the new node activation into the task queue. When a
match process finishes processing a node activation, it modifies the corresponding entry in the node
table to indicate that there is one less activation of that node being processed.
Simulations done for the above scheme using multiple sottware task queues show that approximately a factor of 2 is lost in performance compared to when a hardware task scheduler is used
(detailed results are presented in Chapter 8). We are currently experimenting with variations of the
above scheme. For example: one possible variation is that instead of putting both processable and
non-processable node activations in the task queue, one may put only processable node activations in
the task queue. The non-process.able node activations would be attached to the associated slots in the
node table, and whenever the last processable activation of a node finishes, the additional processable
entries would be put into the task queues. Such a scheme would reduce the cost of extracting a node
96
PARALLEIJSM
IN PRODUCTION
SYSTEMS
activation from a task queue, since no checks have to be made about the process,ability of the node
activation.
However, this would also increase the time to enqueue a node activation, since now it is
necessary to first determine whether the given activation is processable or not. Another variation is to
use some kind of a priority scheme for ordering pending node activations.
For example, node
activations that can potentially result in long chains of activations should be processed before activations that cannot result in long chains of activations.
multiple active software task schedulers.
We are also experimenting with the use of
THESIMULATOR
97
Chapter Seven
The Simulator
Most of the earlier studies exploring parallelism in production systems were done using simulators
with very simple cost models [22, 32, 66], or were done using average-case data as presented in Chapter 3 [30, 37, 60]. 1nne simulators did not take many of the overheads into account, and often variations in the cost of processing productions were not taken into account. This chapter presents details
about a second generation simulator that has been constructed to evaluate the potential of using
parallelism to speed up the execution of production systems. The aims of the simulator are: (1) to
study the speed-up obtainable from the various sources of parallelism and to determine the associated
overheads, (2) to determine the bottlenecks that reduce the speed-up obtainable from parallelism and
the effects of eliminating these bottlenecks, (3) to study the advantages and disadvantages of using
the hardware and software task schedulers, and (4) to study the effects of different data structures and
,
algorithms on the amount of speed-up that is obtainable. The reasons for using a simulator instead of
an actual implementation on a multiprocessor are the following:
• An implementation on a multiprocessor which incorporates all the planned optimizations
would have taken a very long time to do. An implementation which does not include the
optimizations leads to significantly different results from one that includes them, and for
that reason it is not very useful.
• An implementation corresponds to the case where most of the design decisions are already frozen. There is not as much scope for trying out various alternatives, which is
what our aim is at this moment.
• A multiprocessor consisting of 32-64 processors, that has a smart cache-coherence
strategy, and that supports other features mentioned in Section 5.1 is not currently available to us. Although such a multiprocessor may be available in the near future, until then
we have to rely on a simulator to obtain results. To test some of our ideas about using
parallelism, we are currently implementing OPS5 on the VAX-11/784, a four processor
multiprocessor from Digital Equipment Corporation. However, the implementation is
still in its early stages and the results are not available at this time.
98
PARALLELISM
INPRODUCTIONSYSTEMS
7.1. Structure of the Simulator
The simulator that we have constructed is an event-driven simulator. The inputs to the simulator
consist of: (1) a detailed trace of node activations in the Rete network corresponding to an actual
production-system run; (2) a specification of the parallel computational model on which the production system is to be run; and (3) a cost model that can be used to determine the cost of any given node
activation.
The output of the simulator consists of various statistics for the overall run and for the
individual cycles of the run.
7.1.1. Inputs to the Simulator
7.1.1.1. The Input Trace
Figure 7-1 shows a small fragment of a trace that is fed to the input simulator.
The trace is
obtained by actually running a production system and recording the activations of nodes in the Rete
network in a file. The trace contains information about the dependencies between the node activations.
The simulator, depending on the granularity of parallelism being used, can lump several
activations into one task, and knows which activations can be and which activations can not be
processed in parallel.
Information about nodes that remains fixed over the complete run, for ex-
ample, the tests associated with a node and the type of a node, are presented to the simulator in a
static table, as shown in Figure 7-2. The combined information available to the simulator from the
input trace and the static table is sufficient to provide fairly accurate estimates of the cost of processing a given node activation.
7.1.1.2. The Computational Model
The computational model specifies the hardware and software structure of the parallel processor on
which the production-system traces are to be evaluated. The computational model specifies:
• The sources of parallelism (production, node, intra-node, action parallelism, etc.) that are
to be used in executing the production-system trace. For example, when only production
parallelism is to be used, the simulator lumps all node activations belonging to the same
production together, and processes them as one task. Also activations of nodes that are
shared between several productions are replicated (once for each production), since nodes
cannot be shared between different productions when using production parallelism.
• Whether a hardware task scheduler or software task queues are to be used. In case several
software task queues are to be used, then the number of such task queues.
• The hardware organization of the paraUel processor. For example, the number of processors in the parallel machine, the speed of the individual processors, etc.
THE SIMULATOR
99
(pfire uwm-no-operator-retry)
(wme-change 914)
((prey 914) (cur 13630) (type bus) (lev l))
((prey 13630) (cur ]3631) (type teqa) (lev 2))
((prey 13631) (cur 1022389) (node-it 541) (side right) (flag insert) (num-left 4) (hum-right 20))
((prey 13631) (cur 13632) (type teqa) (lev 3))
((prey 13632) (cur 13633) (type tnea) (lev 4))
((prey 13633) (cur 13634) (type me.a) (lev 5))
((prey 13635) (cur 1022390) (node-it 576) (side right) (flag insert) (num-lefi 1) (hum-right 4))
((prey 1022390) (cur 1022391) (node-id 577) (flag insert))
((prey 13635) (cur 1022392) (node-it 201) (side right) (flag insert) (hum-left 4) (num-right 4))
((prey 1022392) (cur 1022393) (node-id 202) (side left) (flag delete) (hum-left 3) (num-right 36))
(pfire eight-copy-unchanged)
(wine-change 915)
((prey 915) (cur 13636) (type bus) (lev 1))
((prey 13638) (cur 1022394) (node-id 207) (side right) (flag insert) (hum-left 1) (num-right 21))
((prey 13638) (cur 1022395) (node-id 197) (side right) (flag insert) (hum-left 1) (hum-right 21))
((prey 13638) (cur 1022396) (node-id 193) (side right) (flag insert) (num-left 1) (hum-right 21))
;;; pfire: The name of the production that fired at this point in the trace.
;;; wine-change: The number of changes made to working memory so far.
;;; prey: The activation-number of the predecessor of this node activation.
;;; cur: The unique activation-number associated with a node activation.
;;; type: The type of a constant-test node activation.
;:; lev: The distance between the root-node and the associated constant-test
node.
;;; node-id: The unique id associated with each node in the Rete network.
:;: side: Whether a left-activation or a right-activation (only for and-nodes and not-nodes).
:;: flag: Whether the token is being inserted or deleted.
;;; num-lef_num-right:
The number of tokens in the left-memory/right-memory
node.
Figure7-1: A sample tracefragment.
• Whether the effectsof memory contention are to be taken into account. If they are to be
taken into account, then it is possible to specify the expected cache-hit ratio, the bus
bandwidth, etc. Details on how the effects of memory contention are handled are given
in Section 7.1.1.4.
• The cost model to be used in evaluating the cost of the individual node activations.
Different cost models permit experimentation with different algorithms, data structures,
and processor architectures. However, different cost models cannot account for major
changesin the algorithms or data structures.
100
PARALLELISM
INPRODUCTION
SYSTEMS
((node-id193)(typeand)(prods(pSI)(Icy10)(Ices6)(rces1)(tests(teqb1000].1)))
((node-id197)(typeand)(prods(pd))(Icy6)(lces10)(teesl) (tests(teqb300011)))
((node-id201)(typenot)(prods(p8))(Icy2)(Ices14)(rces1)(tests(teqb31teqb 1000015 teqb16teqb300017)))
((node-id202)(typeand)(prods(pd))(IcyI) (Ices14)(rces1)(tests(teqb100031teqb32teqb100013)))
((node-id207)(typeand)(prods(pg))(lev6)(Ices4)(rces1)(tests(teqb31)))
((node-id541)(typenot)(prods(p45))(lev1)(Ices2)(rces1)(tests(teqb31)))
((node-id576)(typeand)(prods(p51))(lev1)(Ices2)(rces1)(tests(teqb15teqb100036teqb37)))
((node-id577)(typeterm)(prods(pS1))(lev3)(Ices-) (rces-))
;;:
;;;
;;;
;;;
,;;
;;;
;;;
node-id:Theuniqueid associated
witheachnodein theRetenetwork.
type:Typeofthenode(and-node,not-node,or term-node).
prods:Theproductionorproductions(incasethenodeis shared)to whichthenodebelongs.
lev:Thenumberofinterveningnodesto getto theterminalnode.
Ices:Thesizeofthetokens(numberof winepointersneeded)in theleft-memorynode.
rces:Thesizeofthetokens(numberofwinepointersneeded)inthe fight-memory
node.
tests:Thetestsassociatedwiththetwo-inputnodesto ensureconsistentvariablebindings.
Figure 7-2: Static node information.
7.1.1.3. The Cost Model
The simulator relies on an accurate cost model for determining the time required to (1) process
node activations (given the information in the input trace), (2) push node activations (tasks) into the
task queue, (3) pop tasks from the task queue, etc. The cost model reflects the effects of:
• The algorithms and data structures used to process the node activations.
• The code used to push/pop node activations from the task schedulers.
• The instruction set architecture of the individual processing elements (this determines the
number of machine instructions required to implement the proposed algorithms).
• The time taken by different instructions in the architecture, for example, synchronization
instructions may take much longer than register-register instructions.
• The structure of the multiprocessor, the presence of memory contention, etc.
The basic cost models used in the simulations have been obtained by writing parametrized assembly code for various primitive operations required in processing node activations. For example,
code sequences are written for computing the hash value corresponding to a token, for deleting a
token from a memory node, and so on. These code sequences can then be combined to obtain the
code for processing a complete node activation. For example, the code to process the left activation
of an and-node is shown in Figure 7-3 (details of data structures and code are presented in Appendix
B).
101
TIlE SIMULATOR
L_del:
HAStt-TOKEN-LEFT
cb_eq
Oelete,R-flg,L_del
MAKE-TOKEN
INSERT-LTOKEN
br
L11
DELETE-LTOKEN
!
!
!
!
I
I
Lli:
Idal
! Get address of relevant bucket in
LIO:
(R-rtokHT)R-hIndex,r5
cmp
MULL,(r5)tList
br_eq
L_exit
LOCK-RTOKEN-HT
Idl
(r5)tList,R-state
L_loop: NEXT-MATCHING-RTOKEN
SCHEDULE-SUCCESSORS
L12:
br
L_loop
RELEASE-RTOKEN-HT
L_exit: RELEASE-NODE-LOCK
compute hash value for left
token
if (flag
= Delete)
goto L_del
Allocate
storage and make token
Insert token in left hash table
Goto Lli
Delete token from left hash table
!
!
!
!
right hash table (opposite memory)
Check if the hash bucket is empty
If so, then goto L_exit
Obtain lock for the hash bucket
!
!
!
!
Load pointer to first token in opp.
memory into register R-state
Find first matching token in opp mem
Schedule activations of successor nodes
! Goto L_loop
! Free lock for the hash bucket
! Inform scheduler that processing for
! this node activation is finished
Figure 7-3: Code for lef_ activation of an and-node.
In the code in Figure 7-3, the operations listed in capitals refer to macros that would be expanded
at a later point in time. The register allocation for the code for processing node activations is done
manually, since the total amount of code is small and it is used very often. 53
Given the code for the primitive operations involved in executing a node activation, the next step is
to calculate the time taken to execute these pieces of code. To achieve this, the instructions in a code
sequence are divided into groups, such that the instructions in the same group take the same amount
of time. Thus register-register instructions are grouped together, branch instructions are grouped
together, memory-register instructions are grouped together, and synchronization instructions are
grouped together. 54 Once the instructions have been divided into groups, the cost of executing an
operation can simply be found by adding the time of executing instructions in each group. The cost
of a node activation can then be computed by adding the cost of the primitive operations involved.
Note that processing different node activations requires a different combination
of the primitive
53The cost models used in the simulator assume that no overheads are caused by the operating system of the productionsystem machine. This is a reasonable assumption to make because all synchronization code and scheduling code is a part of
the production-system implementation code. The production-system machine is also expected to have enough main memory
sothat the operating system intervention to handle page faults is minimal.
54Note that the instruction set of the processors (described in Appendix A) has been chosen so that the number of such
groups is very small. For example, almost all instructions make either 0 or 1 memory references. There are very few
instructions that make 2 memory references, and they are treated specially when computing the cost models for the simulator.
Table 7-1 shows the instruction categories and their relative costs as used in the simulator. The relative costs can be changed in
the simulator just by changing some constant definitions.
102
PARALLEI.ISM IN PRODUCTION SYSTEMS
operations. The set of primitive operations required is determined by the parameters associated with
the node activation in the input trace, for example, the number of tokens in the opposite memory, the
set of tests associated with the node, and so on.
"fable 7-1: Relative Costs of Various Instruction Types
Instruction Ty.p_g
register-register
memory-register
memory-memory-register
synchronization/interlocked
compare-and-branch
branch
Relative Cost
1.0
1.0
2.0
3.0
1.5
1.5
7.1.1.4. The Memory Contention Model
Modeling the overhead due to memory contention accurately is an extremely difficult problem
[5, 6, 38, 52]. Since actual measurements are not possible until the multiprocessor system and the
algorithms have actually been designed and implemented, one has to rely on analytical and simulation techniques.
The analytical models are more difficult to discover and usually require more
assumptions than the simulation models.
However, once they have been built, they permit the
exploration of the design space very quickly. The simulation models are easier to come up with, but
to get each data point can require a lot of computing time.
To model the effects of memory
contention on the parallel execution of production systems, our simulator uses a mixture of analytical
and simulation techniques.
As the first step, the simulator generates a table which gives degradation in processing power due to
memory contention as a function of:
• Number of active processors in the multiprocessor. At any given time many processors
may be idle; it is assumed that these processors will be looping out of cache and will not
cause any load on the memory system.
• The number of memory modules. This determines the contention when requests are sent
to the same memory module.
• The characteristics of the bus (bandwidth, synchronous/asynchronous, etc.) interconnecting the processors to the memory modules. This information is used to determine the
contention for the bus.
• The cache-hit ratio. It is assumed that the processors communicate with the external
memory system through a cache which captures most of the requests.
Once such a table has been generated, then the memory contention overhead is included in the
simulations as follows. The simulator keeps track of the number of active processors while processing
THF.SIMULATOR
103
the trace. Whenever the cost for processing a node activation is required, instead of using the basic
cost for the node activation (the cost computed by the method described in Section 7.1.1.3), the basic
cost is multiplied by the appropriate contention-factor (1/ processor,- efficiency) from the table and
then used.
The tables for the contention-factor used in the simulator are computed using an analytical model
for memory contention proposed in [38]. The proposed model deals with multiple-bus multiprocessor systems and makes the following assumptions:
• When a processor accesses the shared memory, a connection is immediately established
between the processor and the referenced memory module, provided the referenced
memory module is not being accessed by another processor and a bus is available for
connection.
• A processor cannot make another request if its present request has not been granted.
• The duration between the completion of a request and the generation of the next request
to the shared memory is an independent exponentially distributed random variable with
the same mean value 1/h for all the processors.
• The duration of an access by a processor to the common memory is an independent
exponentially distributed variable with the same mean value 1/# for all the memory
modules.
'
• The probability of a request from a processor to a memory module is independent of the
module and is equal to 1/m, where m is the number of shared-memory modules.
A problem
with the above model is that it deals with multiple-bus multiprocessors
non-lime-multiplexed
with
buses, while we are interested in evaluating a multiprocessor with a single
time-multiplexed bus (such a bus makes it simple to implement cache-coherence
strategies). 55
However, we have found a way of approximating a single time-multiplexed bus with multiple nontime-multiplexed buses. We assume that the number of non-time-multiplexed
buses is equal to the
degree of time multiplexing of the single time-multiplexed bus. Thus if the time-multiplexed
bus
delivers a word of data 1/_safter the request (has a latency of 1/_s), but it can deliver a word of data
every lOOns (throughput of 10 requests per latency period), then it is assumed that there are 10
non-time-multiplexed
buses, each with a latency of 1/J,s. It can be shown that using this approxima-
tion, the inaccuracy in the results can be at most a factor of 2. In other words, if the approximate
model predicts that the processing power lost due to contention is 10%, then in the accurate model
with a single time-multiplexed bus, not more than 20% of the processing power would be lost due to
55We could not find an analytical model dealing with contention for time-multiplexed buses in the literature.
104
PARALI_ELISM IN PRODUCTION SYSTEMS
contention.
The situation where the results are off by a factor of 2 is quite pathological, when all
requests come to the time-multiplexed bus at exactly the same thne. 56 In actual operation it is hoped
that the error due to the approximation would be significantly less.
Figure 7-4 shows the degradation in performance due to memory contention as predicted by the
analytical model given in [38]. To compute the curves, it is assumed that h = 3.0 MACS (million
accesses per second),/x= 1.OMACS, the number of memory modules m=32, and the number of
non-time-multiplexed
buses b= lO00ns/lOOns= 10. The curves show the processor efficiency as a
function of the number of active processors k, and the cache-hit ratio c. It can be observed that the
degradation in processor efficiency is significant if the cache-hit ratio is not high.
_, 1.0
._
.9
uj
.8
.7
o c = o.go
"_
_
_
_'_
lambda = 3.0
mu = 1.0
m =
32
ec=0.60=
.
b= 10
0,..4"5.3
.2
.1
.0
0
I
I
I
I
I
I
I
I
I
8
16
24
32
40
48
56
64
72
Number
Figure
7-4:
Degradation
in performance
of Active
due to memory
Processors
(10
contention.
56I.¢t the latency for the time-multiplexed bus be l and the degree of time-multiplexing be k. Then in the approximate
model we use k non-time-multiplexed buses, each with latency L The worst ease occurs when k requests are presented to the
buses at the same time. In the time-multiplexed bus the last request is satisfied by time (k*l/k)+ l, which simplifies to 2/. In
the case of k non-time-multiplexed buses, the last request is satisfied by time L thus resulting in a factor of 2 loss in
performance. Note that if k+ 1 requests had arrived at the same time, then the time for the time-multiplexed bus would have
been 2l+//k, the time for the non-time-multiplexed buses would have been 2/, and the loss factor would have been only
1+0/2k), which is less than 2. In case k-1 requests had arrived at the same time, then the time for the time-multiplexed bus
would have been 2l-(l/k), the _a'ne for the non-time-multiplexed buses would have been/, and the loss factor would have been
2-0/k), which is less than 2.
TItE SIMULATOR
105
7.1.2. Outputs of the Simulator
The statistics output by the simulator consist of both per-cycle information and overall-run information. The statistics that are output for each match-execute cycle of the production system are:
• S max- t.=
, t./t max- i., where S max- i is the maximum speed-up that can be achieved
.L n..t
i a_gin the i" cycle assuming no limit on the number of processors used, n. is the number of
tasks 57 in the _h cycle, t
. is the average cost for tasks in the ith cyclez,and t/?lax-- 1. is the
maximum cost of any _-lin the ith cycle as determined by the simulator. Note that
ni.lavg_i represents the cost of executing the
ith
cycle on a uniprocessor.
• S n o.m-z.=. n..t
./t c)r-J ., where. S nom-i is the.....nominal speed-up (or concurrency) that is
t .avg-.,4,
acn_evectin me i""cycle using me numoer ot processors specit_ed in me computational
model, and t c i is the cost of the ith cycle as computed by the simulator. Note that it
follows from _e-definitions of lmax_ i and Icyc_i that lcyc_i >_Imax_ i.
• PU i = Snom_i/k, where PUi is the processor utilization in the ith cycle and k is the
number of processors specified in the computation model.
The same set of statistics can also be computed at the level of the complete run.
statistics are:
The overall
• S r_x = , Y_ J=l
. ni •tavg-i /Y_
where S max is the maximum speed-up that can be
llmax-i '
acmevea over me complete program run assuming no limit on the number of processors
used.
• Spore=]_ N
n-t
/_ N.t
is the nominal speed-up over the com_=1 _" avg-i
tp_ cyc-t .,whereS .,.nora.
plete run using me numner orprocessors specmea m the computation model.
• PU= Snorek, where PU is the processor utilization over the complete run and k is the
number of processors specified in the computation model.
The results presented in Chapter 8 mainly refer to the overall statistics: The following equations
show the relationship between the overall statistics and the per-cycle statistics:
N
soo
m
N
i=1
j=l
N
N
i=l
s
j=l
t _).
The above equations state that the overall speed-up is not a simple average of the per-cycle speed-ups
57A task here corresponds to an independently schedulable piece of work that can be executed in parallel. Thus when using
production-level parallelism, a task corresponds to all node activations belonging to a production. When using node-level
parallelism a task becomes more complex, corresponding approximately to a sequence of dependent-node activations, i.e., a set
of node activations no two of which could have been processed in parallel.
106
PARA1.LELISM IN PRODUCrlON
but a weighted average of the per-cycle speed-ups.
The weight for the ith cycle is !max
-- l
SYSTEMS
./_ tmax in
the first equation and tcyc_i/_, Icyc in the second equation. Thus the per-cycle statistic is weighted
by its fraction of the total cost in the parallel implementation (not the total cost in the uniprocessor
implementation).
As a result, a few long cycles with low speed-ups can destroy the overall speed-up
for a run,
In addition to statistics about the obtainable speed-up, the simulator also outputs statistics about
individual node activations.
For example, for each type of node activations (constant-test nodes,
and-nodes, not-nodes, terminal-nodes) it outputs the (1) number of node activations of that type
per-cycle, (2) the average cost of such node activations, and (3) the variance in the cost of evaluating
such activations. This information can then be used to modify the algorithms and data structures so
as to reduce the cost and variance of evaluating node activations.
7.2. Limitations of the Simulation Model
The simulator, although obviously not as accurate as an actual implementation,
provides con-
siderable flexibility in evaluating different algorithms and architectures for the parallel implementation of production systems. The main inaccuracies in the simulator come from:
,
• The simulator models the execution of production system at an abstract level, in the sense
that each instruction in the code is not actually executed but only the combined cost is
taken into account. Thus low-level effects, such as locality of memory references, bursts
of cache misses when processing for a new node activation is begun, are not taken into
account.
• The simulator ignores contention for certain kinds of resources. For example, contention
for the locks associated with the buckets of the token hash-table is ignored. The reason
for making such approximations is to make the simulator simple and fast, especially since
the traces are quite long (each trace is about 5-10 MBytes long). It was also felt that the
errors due to such approximations would be small. For example, if in an actual implementation it is observed that many clashes are occurring, it is easily possible to increase
the size of the hash table.
• The cost model makes certain simplifying assumptions as wee For example, if a token is
to be searched in a list, then the cost model assumes that half of that list would have to be
searched before the token is found. Such an assumption masks the effects of the variations that would occur in an actual implementation. One way to take such variations into
account would be to build a new version of the Rete interpreter with the same processing
model as that of the proposed parallel implementation, and extract traces from that
interpreter. Due to various time constraints this was not done.
• As cited in the previous section on memory contention, there are also a number of
approximations that have been made in the memory contention model.
TtIES1MU/_ATOR
107
AS mentioned in the above points, a number of approximations have been made in building the
simulation model. Some of the approximations were made for reasons of efficiency of the simulator,
some were made because it would have been too time consuming to fix them, and some were made
because we did not know any reasonable way of handling them. On the whole, however, we believe
that most of the important sources of cost in processing node activations and the variations therein
are taken into account by the simulator.
7.3. Validity of the Simulator
In any simulation based study it is essential to somehow establish the validity of the simulator, that
is, to establish that the simulator gives results according to the prescribed model and assumptions.
One way to validate a simulator is to compare its results to an actual implementation for some limited
set of sample cases, and then make only high-level checks for the other cases. This method is not
possible in this thesis, since there is no actual parallel implementation corresponding to the model
used in the simulator. However, for the following reasons we believe that the results of the simulator
are correct:
• Before implementing the simulator, we implemented a parallel version of the Rete interpreter on the VAX-11/784, a four processor multiprocessor. The interpreter was not
optimized for performance, but it included all the synchronization code necessary to be
able to process node activations in parallel. The assembly code written for the parallel
Rete algorithm in the thesis (from which the cost models used in the simulator are
derived) uses data structures and algorithms similar to those used in the working
VAX-11/784 implementation. Thus there is good reason to believe that the algorithms
and the code from which the cost models for the simulator are derived are correct (see
Appendix B for the code).
• The performance predicted by the simulator for production-system execution On a
uniprocessor is close to the performance predicted independently for an optimized
•
58
umprocessor implementation of OPS5.
Both the simulator and the independent study
predict a performance of 400-800 wme-changes/sec on a 1 MIPS uniprocessor. This
again indicates that there are no major costs that are being ignored by the simulator.
• Finally, the simulation results have been hand verified for small runs, and local checks
made on portions of large runs to check the correct operation of the simulator.
Because of reasons given above, we believe that the simulator is giving the correct results (of
course, while keeping the limitations mentioned in the previous section in mind).
Based on these
simulation results, an optimized parallel implementation is currently being done on the VAX-11/784.
Once this implementation is running, it will be possible to validate the simulator more completely.
58Independent
estimateforOPS5providedbyCharlesForgyofCarnegie-Mellon
University.
108
SYSTEMS
109
SIMULATION RESULTS AND ANALYSIS
Chapter Eight
Simulation Results and Analysis
This chapter presents Simulation results for the parallel execution of production systems. The
chapter is organized as follows. The traces used in the simulations are presented first. Next results
are presented for the execution of production systems on uniprocessor systems. This section also
discusses the overheads due to loss of sharing of nodes, synchronization, and scheduling in parallel
implementations.
Subsequently, Sections 8.3, 8.4, and 8.5 discuss the speed-up obtainable from
production parallelism, node parallelism, and intra-node parallelism when a hardware task scheduler
is used.
Section 8.6 discusses the effect of using binary Rete networks (instead of linear Rete
networks) on the amount of speed-up obtainable.
Section 8.7 discusses the execution speeds when
multiple software task queues are used instead of hardware task schedulers.
Finally, Section 8.8
discusses the effects of memory contention on the execution speed of production systems. The results
,
are summarized in Section 8.9.
8.1. Traces Used in the Simulations
The simulation results presented in this chapter correspond to traces obtained from six productionsystem programs. The following are the six production systems and the traces associated with them
(the names given to traces below are used consistently through the rest of the chapter):
• VT: An expert system that selects components for a traction elevator system [51]. The
associated traces were obtained from a run involving 1767 changes to working memory
and they are named as follows:
o vlo.lin: This trace corresponds to the case when the interpreter constructs linear
networks (see Section 5.2.6) for the productions.
o vto.bin: This trace corresponds to the case when the interpreter constructs binary
networks for the productions.
o vl.lin and vt.bin: These traces correspond to the linear and binary network versions
110
PARAIJ_ELISM
of the krl" system from which seven productions have been removed. 59 These
productions were removed since the node activations associated with them had very
large costs, which increased the variance of processing node activations and consequently decreased the speed-up available from parallelism. It was felt that these
productions could be modified or coded in another way so as to reduce the large
costs associated with the node activations.
• ILOG: An expert system to maintain inventories and production schedules for factories
[58].60 The associated traces were obtained from a run involving 2191 changes to working memory. The traces are:
o ilog.lin and ilog.bin: These traces correspond to the linear network and binary network versions of the ILOG system.
• MUD: An expert system to analyze the mud-lubricant used in drilling operations [39].
The associated traces were obtained from a run involving 2074 changes to working
memory. The traces are:
o mudo.Iin and mudo.bin: These traces correspond to the linear and binary network
versions of the MUD system.
o mud.lin and mud.bin: These traces correspond to the finear and binary network
versions of the MUD system from which two productions have been removed. 61
The reasons for removing the two productions are the same as that for VT.
• DAA: An expert system that designs computer systems from a high-level specification of
the system [43]. The associated traces were obtained from a run involving 3200 workingmemory changes. The traces are:
o daa,lin and daa.bin: These traces correspond to the linear and binary network
versions of the DAA system.
• R1-SOAR: A Soar expert system that configures the UNIBUS for Digital Equipment
Corporation's VAX-11 computer systems [74]. The associated traces are:
o rls.lin and rls.bin: These traces correspond to the sequential version of the Rl-Soar
program, that is, a version in which the problem-space is searched sequentially.
The traces were obtained from a run involving 2220 changes to working memory.
o rlp.lin and rlp.bin: These traces correspond to the parallel version of the R1-Soar
program, that is, a version in which some of the problem-space search is done in
parallel. The traces were obtained from a run involving 3231 changes to working
memory.
59The names of the productions that have been removed are (1) detail::create-item-from-tnput,
(2)
demil.':insert-value-frora-input,
(3) detail::ladder-duet-l-car'so,
(4) build-why-answer-for-itern-without-dependency,
(5)
build-why-answer-for-attribute,
(6)save-cleanup::remove-wmes,
and (7)save.current-job::remove.
60ILOGisreferredto asPTRANSin thecitedwork.
61Thenamesoftheproductionsthathavebeenremovedare(1)cleanup::
and(2)undo-analysi._':l.
SIMUI:,TION
RI_ULTSAND ANALYSIS
I1l
• EP-SOAR: A Soar expert system that solves the Eight Puzzle [47]. The associated traces
are:
o eps.lin and eps.bin: These traces correspond to the sequential version of the EP-Soar
program, qhe traces were obtained from a run involving 924 changes to working
memory.
o epp.lin and epp.bin: These traces correspond to the parallel version of the EP-Soar
program. The traces were obtained from a run involving 948 changes to working
memory.
8.2. Simulation Results for Uniprocessors
When production systems are implemented on a multiprocessor there are a number of overheads
that are encountered, for example, memory contention, loss of node sharing, synchronization, and
scheduling overheads. This section discusses the speed of production systems on a uniprocessor
when such overheads are not present and also when such overheads are present. These results then
form the basis for the speed-up numbers given in the subsequent sections. For example, consider a
production system that without overheads of a parallel implementation runs at a speed of 400 wmechanges/sec on a uniprocessor, and that with the overheads runs at a speed of 200 wme-changes/sec
on a uniprocessor. Now if a parallel implementation using eight processors runs at a speed of 1200
wme-changes/sec, then the nominal speed-up (the average number of processors that are busy or the
speed-up with respect to the parallel implementation on a uniprocessor) is 6-fold, while the true
speed-up (speed-up with respect to the base serial implementation with no overheads) is only 3-fold.
The nominal speed-up (sometimes called concurrency) is an indicator of the average processor utilization in the parallel implementation, while the true speed-up is an indicator of the performance
improvement over the best known sequential implementation.
Tables 8-1 and 8-2 present the data for the uniprocessor execution of production systems when the
synchronization and scheduling overheads are not present. 62 The cost models for these simulations
were derived by removing all the synchronization and scheduling code from the code for the parallel
implementation (given in Appendix B). Adjustments were also made to compensate for the loss of
sharing of memory nodes in the parallel implementation.
The costs listed for the various node
activations and the cost per wine-change are in milliseconds, and correspond to a machine that
executes one million register-register instructions per second (the relative costs of other instructions
are shown in Table 7-1).
62Thecorresponding
datafortracesusingbinaryRetenetworksarepresentedinSection8.6.
112
SYSTEMS
"Fable8-1: Uniprocessor Execution With No Overheads: Part-A
F_ture
_
vtlin
_
_
mud.lin
1. root-per-ch, avg cost, sd
1.0, .010, .000
1.0, .010, .000
1.0, .010, .000
1.0, .010, .000
1.0, .010, .000
2. etst-per-ch, avg cost, sd
2Z92, .014, .025
22.92, .014, .024
24.48, .011, .029
41.96, .011, .017
41.96, Oll, .017
3. and-per-ch, avg cost, sd
25.96, .086, .357
24.02, .046, .025
26.59, .050, .039
25.95, .076, .150
24.62, .058, .051
4. not-per-oh, avg cost, sd
5.01, .049, .039
4.65, .049, .038
5.84, .051, 033
5.79, .059, .044
5.79, .059, .044
5. term-per-ch, avg cost, sd
1.79, .028, .000
1.03, .028, .000
2.06, .028, 000
3.69, .028, .000
3.69, .028, .080
6. cost per wme-ch (ms)
1.656
1.346
1.646
2.255
2.019
7. speed (wme-eh/sec)
603.9
742.9
607.5
443.5
495.3
1. root-per-eh, avg cost, sd
1.0, .010, .000
1.0, .0]0, .000
1.0, .010, .000
1.0, .010, .000
1.0, .010, .000
2. etst-per-ch, avg cost, sd
7.14, .034, .092
5.05, .026, .059
5.07, .026, .059
3.97, .025, .042
3.99, .025, .041
39.41, .052, .054
24.58, .059, .048
24.68, .060, .049
23.56, .071, .054
34.56, .088, .079
4. not-per-oh, avg cost, sd
3.97, .057, .037
2.63, .067, .016
9_85,.068, .015
0.75, .062, .022
0.76, .062, .022
5. term-per-ch, avg cost, sd
1.65, .028, .000
0.55, .028, .000
0.55, .028, .000
0.74, .028, .000
0.78, .028, .000
6. cost per wrne-eh (ms)
1.911
1.420
1.458
1.616
2.985
7. speed (wme-ch/sec)
523.3
704.2
685.9
618.8
335.0
The data in Tables 8-1 and 8-2 may be interpreted as follows. Lines 1-5 give the average number of
node activations of each type per change to working memory, and the mean and standard deviation
of cost per node activation of that type. 63 Line 6 gives the average cost of processing a workingmemory change, and line 7 gives the execution speed of production
systems on a 1 MIPS
uniprocessor. 64 Using the data the following observations may be made: (1) The average speed of
production systems on a uniprocessor is 589.1 wme-changes/sec, where the average is computed over
vt.lin, ilog.lin, mud.lin, daa.lin, rls.lin, rlp.lin, eps.lin, and epp.lin traces. 65 (Note all data in the
following sections of this chapter is also averaged over the this set of traces). (2) The performance of
vt.lin is better than that of vto.lin by 23%, which implies that the seven productions removed from
vto.lin were taking almost a quarter of the total processing time. (3) Similarly, the performance of
63The standard deviation for the cost of processing root-node activations and terminal-node activations is zero because these
nodes do not perform any data dependent action. For more details about the actual code executed by these node activations,
see Appendices Band C.
64Note that for each of the traces, the sum of the product of the first two entries in each of lines 1-5 is greater than the
number given in line-& This is not an error. The sum computed from lines 1-5 does not take into account the sharing of
memory nodes that takes place in an actual uniprocessor implementation. The savings in cost due to memory-node sharing are
subtracted from the sum computed from lines 1-5 to obtain the cost listed in line-6.
65The traces vto.lin and mudo.lin have been excluded, because otherwise the VT and MUD systems would have had too
much weight.
SIMULATION
RESULTSANDANALYSIS
113
mud.lin is better than that of mudo.lin by 12%, indicating that the two productions removed from
mudo.lin were taking an eighth of the total processing time. (4) However, the more important effect
of removing the seven productions from vto.lin and the two productions from mudo.lin is the large
reduction in the standard deviation of the cost of processing and-node activations for the two systems.
The standard deviation drops down from 0.357 ms for vto.lin to 0.025 ms for vt.lin. Similarly, the
standard deviation drops down from 0.150 ms for mudo.lin to 0.051 ms for mud.lin.
As a result,
although the improvements in the uniprocessor speed are just 23% and 12% for vt.lin and mud.lin
respectively, the improvements in the multiprocessor speed are much more significant, close to
200%-300% (see Section 8.5).
Other interesting numbers that may be extracted from the tables are the average costs of processing
various types of node activations. The average cost for processing constant-test node activations is
.021 ms (equivalent to 21 register-register instructions66); for and-node activations is .060 ms; for
not-node activations is .059 ms; and for terminal node activations is .028 ms. These numbers are
indicative of the kinds of scheduling and synchronization overheads that may be tolerated in processing these activations on a parallel computer.
Tables 8-3 and 8-4 present the data for the uniprocessor execution of production systems when the
synchronization
and scheduling overheads that occur in parallel implementations
are taken into
account. 67 The tables give the data for overheads corresponding to the use of node parallelism and
intra-node parallelism. The data when production parallelism is used are given later. Lines 1-7 are to
be interpreted in the same way as that for Tables 8-1 and 8-2. Line 8 gives the overhead in parallel
implementations because of loss of sharing of memory nodes (see Section 5.2.2). Line 9 gives the loss
in performance due to the synchronization and scheduling overhead, and line 10 gives the combined
loss in performance due to all the overheads.
The data shows that the average execution speed
including all the overheads is 296.8 wme-changes/sec, that is a factor of 1.98 less than the speed when
the overheads are not present.
Thus a parallel implementation
would perform better than a uniprocessor implementation.
must recover this factor before it
Another way of looking at it is that the
maximum speed-up using k processors is not going to be better than k/1.98. The factor of 1.98 is
composed of (1) a factor of 1..22 from loss of sharing of nodes and (2) a factor of 1.62 from the
synchronization and scheduling overheads.
66Thecostismuchhigherthanthat forconstant-test
nodesin theOPS83implementation
becauseofthehashingtechniques
beingused.
67Thememorycontentionoverheadsarenotincludedhere,butareconsideredin Section8.8.
114
PARALLELISMIN PRODUCTION SYSTEMS
With the inclusion
of overheads,
The average cost of processing
of and-nodes
constant-test
of 63%), and of terminal
the increases
be recovered
are significant
using a larger number
of processors.
More critical
the final processing
F_tu_
1. root-per-ch,avg cost, sd
2. ctst-per-ch,avg cost, sd
3. and-per-ch, avg cost sd
4. not-per-ch, avg cost sd
5. term-per-ch,avg cost, sd
6. costper wme-ch(ms)
7. speed (wme-ch/see)
8. sharingoverhead
9. (sync+ sched) overhead
10.(sync+ sched+ shar) ovrhd
be shared.
nodes
from the root-node
node
down, since that is what
speed.
Uniprocessof Execution With Overheads: Part-A
Node Parallelism and Intra-Node Parallelism
Uniprocessor Execution With Overheads: Part-B
or intra-node
parallelism
that were shared between
This loss in performance
and intra-node
parallelism.
distinct
is used, when production
productions
due to lack of sharing
above the loss due to lack of sharing of memory
node parallelism
than the cost of individual
daa.lin
_
_
_
1.0,.019, .000
1.0,.019, .000
1.0,.019,.000
1.0,.019,.000
1.0,.019,.000
7.14,.044, .097 5.05,.034, .062 5.07,.034, .062 3.97,.030,.043 3.99,.030,.042
39.41,.088,.058 24.58,.098, .055 24.68,.098, .057 23.56,.112, .064 34.56,.129, .087
3.97,.098, .043 2.63,.103, .019 2.85,.103, .019 0.75, .099, .028 0.76, .098,.028
1.65, .043,.000 0.55,.043, .000 0.55, .043, .000 0.74, .043,.000 0.78, .043,.000
4.272
2.893
2.943
2.888
4.722
234.1
345.7
339.8
346.3
211.8
1.36
1.27
L25
1.15
1.08
1.65
1.61
L61
1.56
1.46
2.24
2.04
102
1.79
L58
when node parallelism
used, two-input
loss can
vto.lin
._dmn
_
_
mud.lin
1.0..019,.000
1.0,.019,.000
1.0,.019,.000
1.0,.019,.000
1.0,.019,.000
22.92,.021, .029 22.92,.020,.028 24.48,.019,.034 41.96,.019,.026 41.96,.019,.026
25.96,.121, .358 24.02,.081,.029 26.59,.087,.045 25.95,.113,.152 24.62,.095,.058
5.01,.086, .043 4.65, .085,.043 5.84, .089,.039 5.79,.096,.049 5.79, .096,.049
1.79,.043, .000 1.03,.043,.000 2.06,.043, .000 3.69,.043,.000 3.69, .043,.000
4.158
2.887
3.404
4.490
3.895
240.5
346.4
293.8
222.7
256.7
1.73
1.26
1.24
1.28
1.16
1.45
1.70
1.67
1.56
1.66
2.51
2.14
2.07
1.99
1.93
Table 8-4:
F_ture
1. root-per-ch, avgeost, sd
2. ctst-per-ch,avg cost, sd
4. not-perch, avg cost, sd
5. term-per-oh,avg cost, sd
6. costper wme-ch(ms)
8. sharingoverhead
9. (sync+ sched) overhead
10.(sync+ sched+ shar) ovrhd
of 38%),
goes up from .059 ms to
they are not too bad, since much of the performance
determines
Table 8-3:
also goes up.
nodes goes up from .028 ms to .043 ms (increase of 54%).
is the cost of the longest chain of activations
longer
node activations
nodes goes up from .021 ms to .029 ms (increase
activations
Unlike
individual
goes up from .060 ms to .098 ms (increase of 63%), of not-nodes
.096 ms (increase
Although
the cost of processing
in the Rete network
of two-input
is
can no
nodes is over and
nodes, a loss that is also encountered
Tables
parallelism
when using
8-5 and 8-6 give some of the data for systems
SIMULATION
RESULTS AND ANALYSIS
115
executing on a uniprocessor when the overheads for using production parallelism are included. Lines
1 and 2 give the cost of processing a working-memory change and the overall speed of execution on a
1 MIPS processor. Line 3 gives the sharing overhead factor over a uniprocessor implementation with
no sharing losses. Line 4 gives the extra loss due to sharing when using production parallelism over
the loss due to sharing when using node parallelism (basically, line 3 of Tables 8-5 and 8-6 divided by
corresponding entries in line 8 of Tables 8-3 and 8-4). The data shows that the average extra loss due
to sharing when using production parallelism is a factor 1.33. Line 5 gives the total loss factor over an
uniprocessor implementation with no overheads. The average value of this loss factor for production
parallelism is 2.64 as compared to 1.98 for node and intra-node parallelism.
Table 8-5: Uniprocessor Execution With Overheads: Part-A
Production Parallelism
4.51
3.23
3.74
4.58
3.99
2. speed (wrne-ch/sec)
221.7
309.6
267.4
218.3
250.6
3. sharing overhead
1.878
1.407
1.363
1.307
1.185
4. extra sharing ovrhd
1.086
1.117
1.099
1.021
1.025
5. (sync + sched + shar) ovrhd
2.72
2.40
2.27
2.03
L98
Table 8-6: Uniprocessor Execution With Overheads: Part-B
F_ture
1. cost per wme-eh (ms)
daa.lin
5.80
rls.lin
5.30
_
5.36
evs.lin
3.62
evv.lin
5.43
2. speed (wine-cA/see)
172.4
188.7
186.6
276.2
184.2
3. sharing overhead
1.845
2.328
2.278
1.442
1.241
4. extra sharing ovrhd
1.357
1.833
1.822
1.254
1.149
5. (sync + sched + shar) ovrhd
3.04
3.73
3.68
2.24
1.82
8.3. Production Parallelism
In this section simulation results about the speed-up obtained using production parallelism on a
multiprocessor are presented.
It is assumed that the multiprocessor has a hardware task scheduler
associated with it. The results for the case when multiple sot_ware task queues are used are presented
in Section 8.7.
Figures 8-1, 8-2, and 8-3 show the graphs for speed-up when production parallelism is used without
action parallelism, that is, when each change to working memory is processed to completion before
the processing for the next change is begun.
Figure 8-1 shows the nominal speed-up, Figure 8-2
shows the true speed-up, and Figure 8-3 gives the actual execution speed of the parallel implemen-
116
PARAI,LELISM IN PRODUCrlON
SYS'I_._.MS
tation (in working-mcmory changes processed per second), assuming that the individual nodes in the
multiprocessor work at a speed of 2 MIPS. 68 (We choose 2 MIPS as the processor speed to roughly
match the current operating region of technology. Within limits set by the operating regions of other
components of the multiprocessor system, the results can simply be scaled for other processor
speeds.) To explain the nature of the graphs it is convenient to divide the curves into two regions.
The first region, the active region, of the curve is where the overall speed-up is increasing significantly
with an increase in the number of processors (say up to the 16 processor mark). The second region,
the saturation region, corresponds to the portion where the curve is almost fiat (beyond the 16
processor mark).
The saturation speed-up, or the maximum speed-up, available from production parallelism is
primarily affected by the following factors: (1) It is limited by the number of productions affected per
change to working memory, that is, the number of productions whose state changes as a result of a
working-memory change.
For the traces under consideration the average size of the affect-sets is
26.3. The two curves at the bottom of Figure 8-1 representing eps.lin and epp.lin have an affect-set
size of 12.1 and 11.9 respectively. The curve at the top representing vt.lin has an affect-set size of
31.2. (2) The saturation speed-up is proportional to the ratio tavg/trnax, where tavgis the average cost
of processing an affected production and tmax is the cost of processing the production requiring the
most time in that cycle (also see Section 7.1.2). For the curves shown in Figure 8-1, the average
saturation nominal speed-up is 5.1, which is much smaller than the average size of the affect-sets. A
large factor of almost 5 is lost due to the variation in the cost of processing affected productions.
(3)
The effect of loss of sharing in the Rete network on the saturation speed-up is as follows. In the
saturation region, when as many processors as needed are available, multiple activations corresponding to a single shared node in the original network are all processed in parallel. While this makes the
nominal speed-up higher than it would have been if nodes were still shared, the true speed-up
remains the same. This is because the true speed-up in the saturation region is primarily dependent
on the longest chain (most expensive chain)of node activations, and loss of sharing only affects the
branching factor of nodes in the chain, but not the length of the longest chain. 69 (4) The effect of the
synchronization and scheduling overheads on the saturation speed-up is very complex. The portion
of these overheads that simply increases the cost of the individual node activations does not effect the
saturation nominal speed-up much, but it does effect the saturation true speed-up significantly. The
68Tlaeterms nominalspeed-up and true speed-up were defined in the beginning of Section 8.2.
69The change in the branching factor does have second order effects on the saturation speed-up. For example, if the
branching factor is large, then by the time the node corresponding to the longest chain gets scheduled for parallel exeeulion, a
significant amount of time may have elapsed.
SIMUIATION
117
10
_10
•
&
o
•
"D
9
vto.lin
vt.lin
ilog.lin
mudo.lin
_..j
"1_
_)
•
9
o mua.,n
8
7
_
c_ daa.lin
0 rts.lin
$ rlp,lin
# eps.lin
e epp.lin
& vt.lin
o ilog.lin
• mudo.lin
o _.d.,.
8
7
I
• vto.lin
a
O
$
#
_
daa.lin
rls.lin
rlD.lin
eps.lin
epp.lin
.
6
6
5
5
4
o
3
#
2
_
4
"
-
3
2
=
"=
1
0
"=
i
i
i
i
8
16
24
32
-0
1
i
i
i
i
a
40
48
56
64
72
0
Figure8-1: Production parallelism (nominal speed-up),
•_ 9000
_J
'4
_.,,
80(::)0
0'_
J_
7000
8
16
24
32
Figure 8-2: Production parallelism (true speed-up).
vto lin
,'_ vt.lin
C, ilog lin
• mudo lin
0 mud.lin
r'] daa.lin
<3' rls.lin
Processor Speed: 2 MIPS
$ rlp.lin
# eps lin
@ epp.lin
•_ 6000
o.)
Q)
O
5000
_j
o
o
30O0
2000
0
40
48
56
64
72
o
i
i
i
i
8
16
24
32
i
i
i
l
i
40 48
56 64
72
Figure8-3: Production parallelism (execution speed).
118
PARAI.LEI.ISM IN PRODUCTION
SYSTEMS
portion of these overheads that requires serial processing (for example, the processing required for
each node to be scheduled through a serial ,scheduler) can, however, significantly effect both the
saturation nominal and saturation true speed-ups.
The speed-up in the active region of the curves, in addition to being limited by the factors affecting
the saturation speed-up, is dependent on the following factors: (1) The speed-up is obviously
bounded by the number of processors in the system. (2) The speed-up is reduced by the variation in
the size of the affect-sets. The variation results in a loss of processor utilization because, within the
same run, for some cycles there are too many processors (the excess processors remaining idle) and
for some cycles there are too few processors (some processors have to process more than one production activation, while other processors are waiting for these to finish). (3) The loss of sharing of nodes
in the parallel implementations also affects the active region of the true speed-up curves. Since the
maximum nominal speed-up is bounded by the number of processors, if the loss due to sharing is
high, then the maximum true speed-up is correspondingly reduced.
Thus if the loss of sharing
increases the cumulative cost of processing all the node activations by a factor of 2, then using eight
processors, no better than 4-fold true speed-up can be obtained.
(4) In the case of non-shared
memory multicomputers, the speed-up is greatly dependent on the quality of the partitioning, that is,
the uniformity with which the work is distributed amongst the processors. The work in this thesis,
however, does not address the issues in implementation
memory multicomputers.
of production
systems on non-shared
Some discussion can be found in [32, 66, 67].
There are several other observations that can be made from the curves in Figures 8-1, 8-2, and 8-3:
(1) The average nominal saturation speed-up is 5.1.70 (2) The average true saturation speed-up is
only 1.9. In fact for the epp.lin trace, the true speed-up never goes over 1.0, that is, the parallel
implementation cannot even overcome the losses due to the overheads in the parallel implementation. (3) The average saturation execution speed using 2 MIPS processors is around 2350 wmechanges/sec or about 1000 prod-firings/sec.
(4) The saturation region of the speed-up curves is
reached while using 16 processors or less. In fact for most systems, using more than 8 processors does
not seem to be advisable.
(5) The average loss of speed-up due to the overheads of the parallel
implementation is a factor of 2.65. This suggests that if only production parallelism is to be used,
then it may be better to use the partitioning approach (divide the production system into several parts
and perform match for each part in parallel) rather than the centralized task queue approach sug-
70Note that all average numbers reported in this Section (unless otherwise stated) are computed over the traces _.lin,
ilog.lin, mud.lin, daa.lin, rls.lin, rlp.lin, eps.lin, and epp.lin. The traces vto.lin and mudo.lin are excluded because that would
have resulted in excess weight for the vt and mud production systems.
SIMULATION
RF_SULTS
ANDANALYSIS
119
gested in this thesis. The advantage of the partitioning approach is that the synchronization and
scheduling overheads are not present, although the sharing overheads are still present. (6) Although
the production-system
programs considered have very different complexity and size, the larger
programs do not appear to gain more from parallelism than the smaller systems. This is a consequence of the fact that the sizes of the affect-sets are quite independent
of the sizes of the
production-system programs.
At the level of individual production-system programs the following observations can be made: (1)
The Soar Eight-Puzzle program traces (eps.lin and epp.lin) are not doing well at all. The reasons for
the low speed-up are the small size of the affect-sets (12.1 and 11.8 respectively) and the large
variance in processing times. The large variance is a result of the long-chain effect and the crossproduct effect discussed in Sections 5.2.6 and 4.2.3. (2) The difference in the speed-ups obtained by
the vt.lin and vto.lin systems is quite large, about a factor of 2. Since the size of the affect-sets for
vt.lin and vto.lin are about the same, this shows that the removal of the 7 productions from vto.lin
significantly reduces the variation in the processing times. The difference in the speed-ups achieved
by the mud.lin and the mudo.lin systems is, however, not very large. This is because the productions
that were removed did not have a very high cost compared to the processing required by the rest of
the affected productions. The difference in the speed-up obtained by mud.lin and mudo.lin becomes
more significant when intra-node parallelism is used (see Figure 8-13), which suggests that the two
removed productions contained node activations that could not be processed in parallel using
production parallelism or node parallelism but that could be processed in parallel using intra-node
parallelism.
8.3.1. Effects of Action Parallelism on Production Parallelism
Figures 8-4, 8-5, and 8-6 present speed-up data for the case when both production and action
parallelism are used.
Figure 8-4 presents data about nominal speed-up, Figure 8-5 presents data
about true speed-up, and Figure 8-6 presents data about the execution speed. Comparing the graphs
for speed-up with action parallelism with those without action parallelism the following observations
can be made: (1) Some systems (vt.lin, rls.lin, rlp.lin, and eps.lin) show a significant increase in the
speed-up with the use of action parallelism, while other systems (vto.lin, ilog.lin, mud.lin, mudo.lin,
daa.lin, and epp.lin) show very little extra speed-up. The main reason why some of the systems show
little extra speed-up is that the affect-sets corresponding to the multiple changes have large amounts
of overlap. For example, if the production taking the longest time to process is affected by each of
the changes to the working memory, then no extra speed-up is obtained from processing such
120
changes in parallel.
PARALLEHSM
IN PRODUCTION
SYSTEMS
(This is exactly what happens in the case of the epp.lin trace.) The second
limitation is imposed by the number of changes made to working memory per production firing (per
phase for Soar systems). This number itself is quite small for the OPS5 systems. (2) For the systems
that show significant improvement, the number of processors needed to get to the saturation point
goes up from 8-16 processors to 16-32 processors. (3)The average nominal saturation speed-up goes
up from 5.1 to 7.6, an improvement by a factor of 1.50. (4) The average saturation execution speed
correspondingly goes up from 2350 to 3550 wme-changes/sec.
SIMUI.ATION
RE.SUI.]'S
AND ANALYSIS
121
_
14
I vtoJin
O
4 ilog,lin
vtlin
_14
12
• mucl.lin
rnudo.iin
o
"_
_
10
0o
$
#
#
claa,
lin
rls,
lJn
rlp.lin
el_.l_
egp.lin
_ 12
-_
I0
• vto,lin
z, ilog.lin
vt.lin
O
• mudo.lin
o mud.lin
0 cl_ta.lin
0 ¢ls.lin
$ rlp.lin
# eps.lin
! epolin
8
6
4
0
2
2
-i-.----.-
0
8
16
24
32
40
48
,56
64
72
Figure 8-4: Production and action parallelism
0
(nominal speed-up).
0
8
18
24
32
40
48
56
64
72
(true speed-up).
8
t6
24
32
40
48
56
64
72
(execution speed).
122
PARA1.LELISM IN PRODUCTION
SYSTEMS
8.4. Node Parallelism
Figures 8-7, 8-8, and 8-9 present the data when node parallelism is used to speed-up the execution
of production systems. Some of the computed statistics are: (1) The average saturation nominal
speed-up is 5.8. This is only a factor of 1.14 better than the corresponding speed-up when production
parallelism is used. (2) The average true saturation speed-up is 2.9, which is a factor of 1.50 better
than the corresponding speed-up when only production parallelism is used. The improvement in the
true speed-up is larger than the improvement in the nominal speed-up because the overheads when
exploiting node parallelism are less than the overheads when exploiting production parallelism. The
average overhead when using node parallelism is 1.98 as compared to 2.65 when using production
parallelism.
(3) The average saturation execution speed using node parallelism is 3500 wme-
changes/sec
as compared to 2350 wme-changes/sec
when using production parallelism. (4) The
processors needed to obtain the saturation speed-up is still around 8-16 processors.
Studying the nominal speed-up graphs for production and node parallelism shows that systems
which achieved relatively high speed-ups with production parallelism (vt.lin, ilog.lin, daa.lin, rls.lin,
rlp.lin) do not benefit much from node parallelism.
Systems which did poorly using production
parallelism, however, show a more marked improvement.
This suggests that systems that were doing
well for production parallelism did not suffer from multiple node activations corresponding to the
same production when a change to working memory was made, while many of the systems that
performed poorly did suffer from such problems.
8.4.1. Effects of Action Parallelism on Node Parallelism
Figures 8-10, 8-11, and 8-12 show the speed-ups for the case when both node parallelism and action
parallelism are used. The average statistics are as follows: (1) The saturation nominal speed-up is 10.7
as compared to 5.8 for only node paraUelism (a factor of 1.84). (2) The saturation true speed-up is 5.4
as compared to 2.9 for only node parallelism. (3) The average saturation execution speed for the
production systems is 6600 wme-changes/sec as compared to 3500 wme-changes/sec for only node
parallelism. (4) For most systems it appears that 16 processors are sufficient, though for vt.lin, rls.lin,
and rlp.lin it seems appropriate to use 32 processors.
It is interesting to note that although the sizes of the affect-sets with action parallelism for rls.lin
and rlp.lin are 128.9 and 145.8 respectively, the saturation nominal speed-up is only around 14-fold.
This indicates that still a factor of almost 10 is getting lost due to variations in the processing cost of
the affected productions.
One source of such variation is long chains of dependent node activations,
123
SIMULATION RESUL'FS AND ANALYSIS
,_fO
"_
q)
9
t_
a
A
O
•
vto.lin
vt.lin
ilog.lin
mudo.lin
_ 10
"_
_)
o mud lin
[3 daa.lin
0
$
#
B
q,
&
O
•
9
¢/_
vto.lin
vt.lin
iloglin
mudo.lin
o mud.lin
rn daaJin
rls.lin
rlp.lin
eps.lin
ef:]lD.lin
_
6
0
$
#
e
rls.lin
rlp,lin
eps.lin
epp.lin
6
o
5
=
4
2
5
_
"
4
_
2
2
1
3
1
3 _
0
|
16
I
24
I
32
I
40
i
48
I
56
i
64
|
72
I
0
8I
l
Figure8-7: Node parallelism (nominal speed-up).
"G 9000
vl
_.
__
16
m
24
i
;
|
;
| .
I
o
m
o
o
n
32
i
Figure8-8: Node parallelism (u'ue speed-up).
o violin
._ vt,lin
0 *log.lin
murlo.Iin
_8000
o mud.lin
[] daa.lin
O rls lin
_) 7OOO
$# rlpAin
eps.lin
g epp.lin
¢:
(_
"_
Processor Speed:2 MIPS
60OO
= 5OOO
O
¢x4OOO
o
o
2000
1ooo
I
0
40
I
48
I
56
I
64
I
72
I
8
I
16
I
I
24
32
I
I
I
I
I
40 48
56 64
72
124
PARAI_I.EI,ISM
especially since R1-Soar has several productions with large numbers of condition elements.
This
problem is dealt with in Section 8.6 where the binary Rete networks are considered. The other source
of variation is multiple activations of the same node in the network (this may happen due to the
cross-product effect as discussed in Section 4.2.3, or due to the multiple changes causing activations
of the same node in the Rete network).
Since multiple activations of the same node cannot be
processed in parallel using node parallelism, they all have to be processed sequentially. The use of
intra-node parallelism, which is discussed next, addresses some of these problems.
SIMUlaTION
RESULTS
AND
125
ANALYSIS
16
"Q
_)
r_ r4
12
10
o
&
O
•
o
vto.tin
vt.lin
ilog.li.
mudolin
mud.lin
daa.lin
0 rlslin
$ rlp,tin
# eps.lin
_ 16
"o
_
_)
_ I'4
_
___..-...-.'O_
_r-.-'--r
_
J_=----_---_-
-
jf
JJ
_._
_'7
@ epp.lin
_
12
_--
10
8
o
•
&
O
•
o
o
O
$
#
vto.lin
vtJin
ilog.lin
mudo.lin
mud.lin
daa.lin
rls.lin
rlp.lin
egs.lin
, epp.lin
8
6
6
4
4
2
2
"
0
i
0
•
8
Figure
*
16
8-10:
*
24
i
32
Node
,
,
40
48
Number
and action
(nominal
h
i
56
6`4
of Processors
!
72
0
parallelism
i
8
Figure
a
16
i
24
8-11:
,
32
Node
speed-up).
4. 1O0OO
,,j
¢)
\
el gO00
_)
J=
d
_)
E
*
A
0
•
0
o
O
$
#
8OOO
vtoiin
vt.lin
ilog.lin
mudotin
mud.lin
daa.lin
rls fin
rlp.lin
eps.lin
Processor
Speed:
2 MIPS
7000
®
¢_
6000
o
¢J
ILl
and
(true
5000
o
o
4000
3000
•
2000
o
,
1000
0
i
8
Figure
i
16
8-12:
i
24
Node
i
32
i
i
,
i
40
48
56
64
Number
of Processors
and action
(execution
parallelism
speed).
i
72
i
I
40
48
Number
action
*
i
56
64
of Processors
parallelism
speed-up).
,
72
126
8.5. lntra-Node
PARAI_LELISM
INPRODUCTION
SYSTEMS
Parallelism
Figures 8-13, 8-14, and 8-15 show the speed-up data when intra-node parallelism is used.
The
average statistics when using intra-node parallelism are: (1) The average saturation nominal speed-up
is 7.6 as compared to 5.8 when using node parallelism and 5.1 when using production parallelism. (2)
The average saturation true speed-up is 3.9 as compared to 2.9 for node parallelism and 1.9 for
production parallelism.
(3) The average saturation execution speed is 4460 wme-changes/sec
as
compared to 3500 when using node parallelism and 2350 when using production parallelism.
It is interesting to observe that the curve for epp.lin has made a sudden upward jump, so that now
the saturation nominal speed-up for epp.lin is 7.8 as compared to 3.2 when node parallelism is used
and 1.8 when production parallelism is used. This sudden increase can be explained as follows. In
the epp.lin system the occurrence of cross-products is very common.
While it was not possible to
process the cross-products in parallel using production parallelism or node parallelism, it is possible
to do so using intra-node parallelism, and hence the large increase in the speed-up.
On the other
hand the curve for vto.lin shows no extra speed-up at all over what could be achieved using production parallelism. The reason for the low speed-up is that a specific node in the system is affected by
all the changes to the working memory, and this node activation takes a really long time to process.
Since there is only a single node activation that is involved in the bottleneck, the use of parallelism
does not help, unless multiple processors are allocated to process that single node activation. There is
some work going on in this direction, though it is not the discussed in this thesis [23].
A general point that emerges from the discussion of the various sources of parallelism is that
different production systems pose different bottlenecks to the use of parallelism.
While for some
programs production parallelism is sufficient (sufficient in the sense that most of the speed-up that is
to be gained from using parallelism is obtained by using production parallelism alone), others need to
use node parallelism or intra-node parallelism to obtain the full benefits from parallelism. Since this
move to finer granularity does not impose any extra overheads, the fine granularity scheme of
intra-node parallelism seems to be the scheme of choice.
8.5.1. Effects of Action Parallelism on Intra-Node Parallelism
Figures 8-16, 8-17, and 8-18 present the speed-up data for the case when both intra-node parallelism and action parallelism are used. The average statistics for this case are as follows: (1) The
average saturation nominal speed-up is 19.3 as compared to 7.6 when only intra-node parallelism is
used. (2) The average saturation true speed-up is 10.2. (3) The average saturation execution speed is
127
SIMULATION RESUI..TS AND ANALYSIS
_12
• vto.lin
._
1=
_ vt.lin
O
• ilog.lin
mudo.lin
o mud.lin
Q daa.lin
0 rls.lin
$ rlp.lin
# eps.lin
@ epp.lin
,_
11
t_
10
g
12
• vto.lin
/, vt.lin
_ I 1
_
10
9
0
8
o ilog.lin
o
• mudlin
mudo.lin
E1 daa.lin
0 rlslin
$ rlptin
# epslin
@ epp.lin
8
7
_"
7
6
•
6
#
5
lm
j
5:
e
3
_
1[
3
f '
2
1
0
i
.......
24
16
32
;
40
48
56
64
2
Figure 8-13: lnu'a-node parallelism (nominal speed-up).
•-_ 9000
@
Ul
\
8000
_ 7000
8
32
$
# rlp.lin
eps lin
@ epp.lin
6000
5000
a
_ ,moo
Lu
_
0
3000
,¢
2OOO
looo /-'-'
0
40
48
56
64
72
Figure 8-14: Intra-node parallelism (true speed-up).
r, _* vto.lin
i
: r, vtlin
i o ilog.lin
_- • mudo.lin
o mud lin
0 daa.lin
0 r ls.lin
,_
4::
.........
16 24
0
o
"
a
i
a
a
8
16
24
32
I
|
i
i
!
40 48
56 64
"/'2
Figure8-15: Intra-node parallelism (execution speed).
128
PARALLELISM
IN PRODUCTION
SYSTEMS
11,260 wme-changes/sec.
This corresponds to less than 100bts for match per working-memory
change.
It is interesting to note that the highest speed-ups (with the exception of eps.lin) are obtained by
Soar programs. This is mainly because of the large number of working-memory changes (average of
12.25 changes) that are processed in parallel for these systems. Systems like ilog.lin, daa.lin, vt.lin
which did relatively well when action parallel was not used, fall behind due to the small number of
working-memory changes (average of 2.44 changes) processed every cycle. Finally, barring a few
systems like rls.lin, rlp.lin, and epp.lin which could use 64 processors, most systems seem to be able
to use only about 32 processors effectively.
A summary of the speed-ups obtained using the various sources of parallelism is given in Figures
8-19, 8-20, and 8-21. The curves represent the average nominal speed-up, the average true speed-up,
and the average execution speed, with the averages computed over the various production-system
traces.
As expected, the use of intra-node and action parallelism results in the most speed-up,
followed by node and action parallelism, foUowed by intra-node parallelism alone, with the rest
clustered below.
SIMULATION
:_ 40
'1_
_-36
_)
RESULTS
:_ 40
_
_.136
_
• vto.lin
& vt.lin
0 ilog.lin
@
0 mudo,fin
mud.lin
0 daa.fin
C, rls.lin
$ rtD.lin
# eps.lin
32
28
129
AND ANALYSIS
, j._....
32
7-_
e epp.lin jr
28
20
@ eDplin
20
16
24
_
12
_
o
16
24
o
12
8
8
4
4
,,,
0
o vto.lin
_ vt.lin
o iloglin
•
0 mudo.lin
mud.lin
0 daa.tin
O rls.lin
$ rlp.lin
# eps.lin
i
I
|
I
8
16
24
32
Figure 8-16:
I
I
I
I
I
40
48
56
64
72
0
Intra-Node and action parallelism
(nominal speed-up),
;
o
i
I
i
I
8
16
24
32
Figure 8-17:
•._ 20000
oj
\_#j
18000
16000
ProceSsor Speed: 2 MIPS
# e0shn
_ epp.lin
_,_14000
tl_ 12000
10000
x
Lu 8000
60O0
4000
--
2000
*
0
i
8
Figure 8-18:
!
16
i
24
i
32
|
|
i
,
,
40
48
56
64
72
Intra-Node and action parallelism
(execution speed).
I
I
[nLra-Node and action parallelism
(true speed-up).
111 VtOl_in
oD, Vtliifl
ilog.lin
• mudo.lin
0 mud.lin
o daaJin
OrlS lin
$ rlp lin
I
I
i
40
48
56
64
72
130
['ARA1LEI.ISM
20
"_
18
._
16
IN PRODUCTION
SYSTF, MS
_ 20
_ productionparallelism
_,production
andaction
paratlelism
O node parallelism
l nodeandaction parallelism
o intra-nodeparallelism
Q intra-nodeand action
1=
_ t8
o_.
vj
16
14
14
12
12
10
10
8
_
6
=
8
o
=
6
4
4
2
2
* productionparallelism
_,production
andaction
parallelism
0 node oarallelism
e nodeand action parallefism
0 intra-nodepa=alletism
O intra.nodeandactionparallelism
o
o
J:D
0
8
i
16
Figure 8-19:
I
24
i
32
i
i
i
i
I
40
48
56
64
72
0
I
8
Average nominal speed-up.
i
16
i
24
Figure 8-20:
i
32
Average true speed-up.
14000
@
,_
_,
t_
.3 produc_onand actionparallelism
o nodeparallelism
•ncde andacl_onparallelism
o intra.node parallelism
[] intra-nodeand action parallelism
_12000
-_
10000
8000
Processor
Speed: 2 MIPS
d.l
6000
uJ
o
o
4000
l
200O
0
i
8
i
16
i
24
i
32
i
i
i
i
J
40
48
56
64
72
Figure 8-21: Average execution speed,
i
i
i
i
,
40
48
56
64
72
131
8.6. Linear vs. Binary Rete Networks
As discussed in Section 5.2.6 there are several advantages of using binary networks instead of linear
networks for productions. The main advantage, however, is that it reduces the maximum length that
a chain of dependent node activations may have. This section presents simulation results corresponding to production-system
runs in which binary networks were used.
As done when using linear
networks, results are first presented for the binary-network runs on uniprocessors.
These data are
then used to calibrate the performance of the runs on multiprocessors.
8.6.1.UniprocessorImplementationswithBinaryNetworks
Tables 8-7 and 8-8 present information about the cost of uniprocessor runs when binary networks
are used and when the overheads associated with parallel implementations are eliminated. Lines 1-5
give the number of node activations of each node type and the average cost (in milliseconds) per
activation on a 1 MIPS processor. Line 6 gives the cost per wine-change (in milliseconds). Line-7
gives the execution speed in wme-changes per second when binary networks are used and line-8 gives
the same information when linear networks are used (line-8 is a copy of line-7 in Tables 8-] and 8-2).
Line-9 gives the ratio of the uniprocessor speeds when binary and linear networks are used.
An important observation that can be made form the data presented in Tables 8-7 and 8-8 is that
for all systems other than epp.bin, the binary network version is slower than the linear network
version. The average speed decreases by a factor of 1.36. This is because there are many more
and-node activations in the binary network version, which in turn is caused by a larger number of
and-nodes with no filtering tests (see Section 5.2.6). In fact, in the results reported in this section for
the binary network case, there were some productions in each of the programs that were retained in
their linear network form to avoid the blow-up of state caused by the binary network form. If this
was not done the state grew so much that the lisp system would crash. Thus to perform better than
the parallel linear-network implementations, the parallel binary-network implementations have to
recover the performance they lose due to the extra node activations.
Tables 8-9 and 8-10 present the overheads in binary networks due to the parallel implementation
when node parallelism or intra-node parallelism is used. The average overhead due to the parallel
implementation is a factor of :[.84, as compared to ].98 when linear networks are used. Similarly,
Tables 8-11 and 8-12 present the overheads due to the parallel implementation when production
parallelism is used. The average overhead in this case is 2.52, as compared to 2.64 when linear
networks are used.
132
PARAI,LELISM IN PRODUCTION SYSTEMS
Table 8-7: Uniprocessor Execution With No Ovcrheads: Part-A
vto.bin
vt.bi.___n
_
mudo.bin
mud.bin
1. root-per-ch, avg cost
1.0, .010
1.0, .010
1.0, .010
1.0, .010
].0, .010
2. ctst-per-ch, avg cost
22.92, .014
22.92, .013
24.48, .013
41.96, .011
41.96, .011
3. and-per-ch, avg cost
27.82, .093
25.88, .057
28.90, .059
31.71, .077
30.38, .062
4. not-per-eh, avg cost
5.01, .049
4.65, .049
5.84, .051
5.79, .059
5.79, .059
5. term-per-ch, avg cost
1.79, .028
1.03, .028
2.06, .028
3.69, .028
3.69, .028
1.964
1.645
2.033
2.664
2.438
7. bin-speed (wme-ch/sec)
509.2
607.8
491.9
375.4
410.2
8. lin-speed (wme-eh/see)
603.9
742.9
607.5
443.5
495.3
9. bin-speect/lin-speed
0.84
0.82
0.81
0.85
0.83
Feature
daa.bin
rls.bin
1. root-per-oh, avg cost
2. ctst-per-eh, avg cost
1.0, .010
7.14, .026
1.0, .010
5.05, .021
1.0, .010
5.07, .021
1.0, .010
3.97, .019
1.0, .010
3.99, .018
3. and-per-ch, avg cost
36.30, .066
33.37, .078
37.44, .080
28.98, .086
30.05, .087
4. not-per-ch, avg cost
3.97, .062
2.60, .067
2.87, .068
1.04, .069
1.24, .070
5. term-per-oh, avg cost
1.65, .028
0.55, .028
0.55, .028
0.74, .028
0.78, .028
2.248
2.458
2.855
2.415
2.540
7. bin-speed (wrne-ch/sec)
8. tin-speed (wme-ch/see)
444.8
523.3
406.8
704.2
350.3
685.9
414.2
618.8
393.8
335.0
0.85
0.58
0.51
0.67
1.18
9. bin-speed/fin-speed
_
_
Feature
_
vtbin
_
_
1. cost per wine-oh (ms)
4.604
3.333
3.896
5.145
4.551
2. speed (wme-eh/see)
217.2
300.1
256.7
194.4
219.7
3. (sync+ sched + shar) ovrhd
2.344
2.026
1.916
1.931
1.867
F_ture
daa.bin
ris.bin
_
eos.bin
4.520
4.425
5.032
4.012
4.190
221.2
225.9
198.8
249.3
238.7
3. (sync + sehed + shar) ovrhd
2.011
1.800
1.762
1.661
1.650
SIMUI.ATIONRF_SUITS
ANDANALYSIS
] 33
1.costperwme-ch(ms)
2.speed(wme-eh/see)
3.(syne+ sched+ shar)ovrhd
vto.bin
5.206
198.8
2.649
vt.bin
3.915
255.4
2.381
_
4.724
211.7
2.322
mudo.bin
5.321
187.9
1.997
mud.bin
4.727
211.5
1.938
F_ture
1.costper wine-oh(ms)
2.speed(wme-eh/see)
3.(sync+sched+shar)ovrhd
daa.bin
7.316
136.7
3.254
rls.bin
7.627
131.1
3.103
_
8.282
120.7
2.900
_
5.233
191.1
2.166
5.393
185.4
2.124
8.6.2. Results of Parallelism with Binary Networks
Results for the speed-up from parallelism for binary networks are presented in Figures 8-22
through 8-25 for production parallelism, in Figures 8-26 through 8-29 for node parallelism, and in
Figures 8-30 through 8-33 for intra-node parallelism. Results about average speed-up are presented
in Figures 8-34 and 8-35. These graphs are to be interpreted in exactly the same way as the graphs
,
presented earlier for linear networks.
It is interesting to compare the average saturation speed-ups for the linear network case and the
binary network case. The results are shown in Table 8-13--the data on the left is for the binarynetwork case and the data on the right is for the linear-network case. As can be seen from the Table,
most of the time, the average saturation nominal speed-up obtained using binary networks is higher
than that obtained using linear networks. However, most of the time, the average saturation execution speed (given in wme-changes/sec) obtained using binary networks is lower than that obtained
using linear networks. The answer to this apparent contradiction lies in the fact that programs when
using a binary network execute at less speed than when using a linear network on a uniprocessor, as
discussed in Section 8.6.1. Looking separately at OPS5 systems and Soar systems, while the linearnetwork scheme seems to be more suitable for OPS5 systems, the binary-network scheme seems to be
more suitable for Soar systems. This is because in OPS5 systems, where the average number of
condition elements per production is small, long-chains of dependent node activations do not arise,
so binary networks do not help. In Soar systems, where the average number of condition elements
per production is much higher than in OPS5, the long-chains are a bottleneck in the exploitation of
parallelism, and for that reason using the binary-network scheme helps significantly.
134
PARAI_.LEI_.ISMIN PRODUCITON SYSTEMS
Table 8-13: Comparison of Linear and Binary Network Rete
Average Saturation
Sources o_fParallelism
1. Production Parallelism
5.3, 5.1
Execution Soeed
1870, 2350
2. Production and Action Parallelism
8.3, 7.6
2850, 3550
3. Node Parallelism
5.6, 5.8
2680, 3490
11.6, ].0.7
5400, 6600
4. Node and Action Parallelism
5. Intra-Node Parallelism
6. Intra-Node and Action Parallelism
Nominal _
_
Average Saturation
8.0, 7.6
3480, 4460
25.8, 19.3
12020, 11260
135
SIMUI.ATION RFSULTS AND ANALYSIS
Q. 9
• vto.bin
z_ vt.bin
0
_8
-_
ilog.bin
. mudbin
mudobin
o
m dl_.bin
0 rls.bin
$ rlp.bin/
7
9000
_
"_
_
0
8000
z,
t=
"_
_ 7000
c_°daabinmUdbm
_ rls.bin
$ rip.bin
E
_
n e_s.bi.
_
/_
z.
_s.biq/"
6
,_' _t.bin
A
vto bin
0 ilog.bin
o
,,.
•
P_5000
4
_ 4000
2 MIPS
.o
3
3000
2
2000
1
1000
i
i
|
t
8
16
24
32
i
i
, i
i
i
i
40
48
66
64
72
0
Figure8-ZZ: Production parallelism (nominal speed-up),
• rio.bin
-_
1_
_]
A vt,bin
0 ilogibin
_
"_
@,_14
•0 mud
mudobin
bin
(3 daa.bin
9000
C, rls.bin
16
i
24
0
a
a
_,
32
f
r
i
r
#
40 48
66 64
72
bin
dog bin
Processor
•0 mudbin
mudo bin
Speed: 2 MIPS
r_ daabin
0 rlsbin
"_
_ 7000
_
-----$--
,to
•
0
_)
131_000
=
$ rlp.bin
# epsbin
0 epp.bin
t
8
0
Figure 8-Z3: Production parallelism (execuuon speed).
:_ 16
f2
Processor Speed:
_ 6000
5
0
mudobin
_
$ rlpbin
# epsbin
@ epP.bin
6000
10
_
.
8
_
°= 5000
'_-,
_,
(3
o
I_ 4000
6
4
20OO
2
0
•
........
16
24
8
32
•
;
40
48
56
4
72
Figure8-Z4: Production and action parallelism
(nominal speed-up).
I000
3000
0
:
......
24
16
32
__
40 48
56
4 72
Figure 8-Z5: Production and action parallelism
(execution speed).
136
PARALLF.I.ISM
9
•
•_
_ 8
7'
IN PRODUCTION
._
¢1 9O00
vto.bin
e vie.bin
vt.bin
0_, ilog.bin
I mudobin
_
_,
_ 8000
o_
•
o mud.bin
n da&bin
r ls.bin
_)
_
.1_
o mud,bin
o daa,bin
O rls.bin
$ rlp.bin
$ rip.bin
_
w
# eps.bin
_,
_,
@ epp.bin
o
(3
_
_
"0 6000
6
_,
5
4
•
o
6
i
-
vt.bin
ilog.bin
mudo.bin
U 7'000
# eps.bin
SYSTEMS
Processor Speed:
2 MIPS
@ epp.bin
_ 5000
0
i
4000
3
_.
o
3000
8
!
0
20o0 _
f
fO00
0
i
i
8
16
Figure 8-26:
i
l
24
32
i
i
i
40
48
Number
i
_
f_
i
56
64
72
of Processors
0
Node parallelism (nominal speed-up).
•.
l
i
8
16
i
I
24
32
i
i
i
i
i
40
48
56
64
72
Number el Processors
Figure 8-Z7: Node parallelism (execution
speed).
10(10)(0
,,J
A vt,bin
0 ilog.bin
"_18
t_
•
0
0
mudo,bin
mud.bin
daa.bin
0
rls
_
_.
/_"_
/
y
/_......._
/
(:3 daa.bin
_, rls,bin
_
7'000
#
@
t_
6000
a
10
"_ 5000
e
8
_J 4000
o
n
S rlp.bin
_s.bi_ //
@ eppbin
14
_
J_
///
_
_//
_
/
12
6
o
o
:
4
MIPS
s _pt_in
epsbin
eppDin
II
.
3000
-
0
:
2000
2
O
2
I mudo.bin
O mud.bin
_
I:
//-
Processor Speed:
e vt.bin
_,
vto bin
0 ilog.bin
9000
8000
16
bin
/
¢@
1000
i
i
i
i
8
16
24
32
i
i
i
i
i
40
48
56
64
72
Figure 8-2.8: Node and acdon parallelism
(nominal speed-up).
0
i
i
i
i
8
16
24
32
i
i
i
i
i
40
48
56
64
72
Figure 8-Z9: Node and acdon parallelism
(execution speed).
SIMULATION
137
12
• vto.bin
_ vt.bin
_
11
o ilog.bin
_
10
• mudbin
mudo.bin
o
ra daa.bin
_ 8000
O_
I::
0 rls.bin
"_
•o mud.bin
mudo.bin
o daa.bin
0 rls.bin
_ 7000
$ rip.bin
1D
¢/_
9
$
rlp.bin
# e_s.bin
_
$
9OOO
8'
_
5
_ 40O0
4
30O0
• vto.bin
•_ vt.bin
o ilog.bin
Processor
Speed:
2 MIPS
i eppmbin
,_ 6000
3
-
¢
¢
II
I
II
2OOO
2
1
0
1000 F
i
'
16
'
24
' 4'0 48.... 56
32
64
72
0
Figure8-30: Intra-node parallelism (nominal speed-up).
60
1_
55
O_
50
45
8
16
32
40
48 56
64 72
Figure8-31: Intra-node parallelism (execution speed).
• vtobin
"d
_l 30000
ot_ vt.bin
ilog.bin
• mudo.bin
0 mud.bin
{3 daa,bin
O rls.bin
$ rlp.bin
"_ 27000
01
I=
t_
"_ 24000
(b
• mudobin
o
ilog.bin
o mud,l_n
E
_ 2"f000
# eps.bin
@ epp.bin
• vto.bin
_ vt.bin
vj
eps.bin
_' epp.bi
24
_
Processor
Speed:
2 MIPS
Q daa.bin
0 rls.bin
$ _lp.bin
40
35
_
18000
30
_¢j 15000
as
_u12ooo
20
9OO0
15
,
10
5
0
g
=
8
16
24
32
i
:
40
48
56
64
72
Figure8-3Z: Intra-node and action parallelism
(nominal speed-up).
.ooo
.
3000
•
0
8
16
24
32
.
40 48 56
64 72
Figure 8-33: Intra-node and action parallelism
(execution speed).
138
._
3°F
_ production parallelism
27
t_
t
24
o node parellelism
•ncde
and action
parallelism
'_ production
and action
parallelism
o inlra-node parallelism
0 intra-node and action parallelism
21
18
f5
12
9
6
3
0
I
I
8
16
l
24
I
I
32
I
I
I
I
40
48
56
64
72
,._ 14000
e production
parallelism
_. production
andactionparallelism
o nodeparallelism
¢_
®
12000
• node and action parallelism
0 intra-node parallelism
o intra-node and action parallelism
-_
d,
100oo
t,_ 8000
_
tu
6ooo
4000
o
o
-----'-
6
2000
0
,
I
I
I
l
8
16
24
32
l
I
l
i
l
40 48 56 64
72
SYSTEMS
SlMUt,A'rlON RESUI,TS AND ANALYS1S
139
8.7. Hardware Task Scheduler vs. Software Task Queues
Ilqis section presents
instead
of a hardware
suggested
simulation
task scheduler.
It is interesting
to consider the overheads
overhead
production
scheduler
node parallelism
parallelism
with no such overheads.
intra-node
parallelism
in the simulator
is a factor
of 2.96, and the average
is a factor of 3.70. 71 The corresponding
of the much larger costs of enqueueing
In the proposed
implementation,
queue be locked for the duration
corresponding
pared to less than ] such instruction
activation
requires
Also note that the dequeueing
activation.
overhead
The overheads
activations
a node activation
as
for the hardware
task queues
show that the
task queue.
for a duration
to less than 1 such instruction
cost is similar in magnitude
when
the software
instructions
with the hardware
can be obtained
to use software
"Given some number
of processors,
performance
when the number
varied between
task
a node
44 register-
task scheduler.
to the actual cost of processing
a node
reasonable
because many task queues are used.
Once the decision
obtained
task
72 as com-
dequeueing
to about
task
larger
from the software
Although the task queues have to be locked for these relatively long durations,
performance
exploiting
are significantly
Similarly,
corresponding
overhead
when a hardware
requires
to about 13 register-register
the task queue to be locked
register instructions 73 as compared
Calculations
numbers
and dequeueing
enqueueing
using software
is a factor of 3.23, the average
is being used are 1.98, 1.98, and 2.64 respectively.
queues.
execution
are modeled
in the parallel implementation
implementation
when exploiting
when exploiting
'
task queues
are used
B.2.
with respect to the uniprocessor
because
The software
task queues
in Section 6.2. 1he actual code from which the cost model used in the simulator is derived
is given in Appendix
average
results for the case when several software
task queues
has been made,
the next question
how many task queues should be used?".
of processors
Figure 8-36 plots the
is fixed at 32 and the number of task queues is
4 and 64. The curves show that when the number
speed is quite sensitive to this number
that arises is
of task queues is small (4-16), the
and increases rapidly with an increase in the number
71The overheads were determined by dividingthe cost of the parallel version running on a single processorwith a single
softwaretask queue, by the costof the uniprocessorversion with no overheads due to parallelism(as describedin Tables 8-1
and8-2).
72It'is actually composed of 2 synchronizationinstructions, 3 memory-registerinstructions,1 memory-memory-register
instruction, and 1 branch instruction. Using the relative cost of instructions given in Table 7-1, the above instructions are
equivalentto 12.5register-registerinstructions.
73It is actually composedof 4 synchronizationinstructions,9 register-registerinstructions,9 memory-registerinstructions,1
memory-memory-registerinstruction,anti 8 branch instructions. Usingthe relativecosts of instructionsgiven in Table 7-1,the
aboveinstructionsare equivalentto 44 register-registerinstructions.
140
PARALLELISM IN PRODUCFION SYSTEMS
of task queues. When the number of task queues is large (32-64), the execution speed is not sensitive
to this number and it decreases slowly with an increase in the number of task queues.
In between
these two regions, when there are 16-32 schedulers, the execution speed reaches a maximum and then
slowly drops. These observations can be explained as follows. When the number of task queues is
small, the enqueueing and dequeueing of activations from the task queues is a bottleneck, and thus
the low execution speed. Also as a result, provided that there is enough intrinsic parallelism in the
programs (unlike ilog.lin, mud.lin, and daa.lin), the performance is almost proportional to the number of task queues that are present, which explains the large slope when the number of task queues is
increased. As the number of task queues is increased further, the performance peaks at some point
determined by the intrinsic parallelism available in the program and the costs associated with using
the task queues.
Beyond this point, the effect of a still larger number of task queues is that the
processor has to look up several task queues before it finds a node activation to process (as there are
an excessive number of task queues, many of them are empty). Since the cost of looking up an extra
task queue to see if it is empty is quite small, the slope in this region of the curves is small. All the
results presented later in this section, assume that the number of task queues present is half of the
number of processors. This number was found to be quite reasonable empirically, in that, beyond
this number of task queues the performance does not increase or decrease significantly.
Figures 8-37 to 8-48 show the nominal speed-up and the execution speed obtained using software
task queues and different sources of parallelism.
The sources of parallelism range from production
parallelism to intra-node and action parallelism. Data about the nominal speed-up and the execution
speed averaged over the eight traces are presented in Figures 8-49 and 8-50. As can be observed from
these graphs, the average saturation execution speed when using production and action parallelism is
2080 wme-changes/sec as compared to 3550 wme-changes/sec when using a hardware task scheduler.
When using node and action parallelism, the saturation speed with software task queues is 3330
wme-changes/sec as compared to 6600 wme-changes/sec when using a hardware task scheduler.
When using intra-node
and action paralletism,
the saturation execution speed is 4700 wme-
changes/sec as compared to 11260 wme-changes/sec when using a hardware task scheduler. Thus on
an average, a factor of 1.7 to 2.4 is lost in performance when software task queues are used.
SIMUI.ATIONRESULTSAND ANALYSIS
._ 5oo0
_)
\
141
A ¢t.lin
o ilog.lin
o mud.lin
[] daa.lin
<>Ms]in
$ rlp.lin
# eps.lin
@epp.lin
4500
¢_
¢:
4000
Intra-Node and Action Parallelism
NumProcessors: 32
E
_=3500
3000
0
,O
_J2000
_
_
1500
o
o
2500
1000
_
8
500
0
I
I
I
i
8
16
24
32
Figure 8-36:
Effect of number
I
I
I
i
I
40
48
56
64
72
Number of Task Queues
of software task queues.
142
PARAI.LELISM
e,..9
& vtlin
3_} 4500
"_
o0 ilog.lin
mud.lin
ol
"%
O_ 8_
n daa.lin
O
rlsJin
7
$ rlp.lin
# el)s.lin
NumSTQs
= NumProcessors/2
_1 4000
_
=
_
¢:
_ epp.lin
_ 3500
IN PRODUCTION
SYSTEMS
.'_vt.lin
C, ilog.fin
Q daa.lin
o
mudJin
0 rls.lin
$ rlp.lin
NumSTQs
Processor
= NumProcessors/2
Speed: 2 MIPS
@
# epp.lin
eps.lin
E
6
_ 3000
5
c 2500
O
3
o
_ 2000
3
1500
2
__
,.
I
I
I
0
i
16
8
i
24
i
32
"_
(/_
,'_
Q
o
Q
0
9
^
1000
o
n
500
I
I
0
Figure8-3"/: Production parallelism (nominal speed-up).
10
_"
i
i
b
i
i
40
48
56
64
72
_"_'__
8i
i
16
i
24
=
32
i
I
=
i
i
40
48
56
64
72
Figure8-38: Production parallelism (execution speed).
45o0
vtJin
ilog.iin
mud.lin
daa,lin
rls.lin
_
_
_ 4000
_:_
t_
(_
.¢:
_
$ rtp.lin
8
NumSTQs = NumProcessors/2
,'. vl.lin
,_ ;Iog.lin
o mud lin
[:3 daa.lin
O rts.fin
$ rlp.lin
(J3500
NumSTQs NumProcessors/2
#
eps.lin
@ epp lin
$
E
3000
_ 2500
_
2000
5
4
1500
3
a
2
1000
f
500
,
0
I
8
I
16
I
24
I
32
I
I
I
I
I
40
48
56
64
72
(nominal speed-up).
0
I
I
i
*
8
16
24
32
I
I
I
i
a
40 48 56
64 72
(execution speed).
SIMUIATION Ri_ULTS
_9
I_
_ vt.lin
O ilog.lin
_1
_ 8
o mud.lin
O daa.lin
O rts.lin
O_
143
AND ANALYSIS
-_ 6O00
_)
"_ 5500
$ rlp.lin
# eps.lin
7'
& vt.lin
NumSTOs = NumProcessors/2
o
0
_ 5000
@ _p.lin
I
$
_4500
# eps.lin
NumSTQs
@eoD
ProcessorSpeed: 2 MIPS
6
xl 4000
5
_
4
,
_,
_J
_ 2500
#
#
O
rl_.lin
lin
=
NumProcessors/2
350O
_
3
iloglin
mudlin
(3 daa.lin
O rls.lin
2000
_
e
1500
_
2
1000
o
o
o
cl
e
a
I
500
i
0
'
16
' &2 40..............
48
56
64
24
72
0
8
16
24
32
40 48
56 64
72
16
-. 6OO0
:_
_1
.", vt.lin
o ilog.lin
_)
_)
0
0
_14
O rls.lin
$ rlp.lin
# eps.lin
U
mud.lin
da&,lin
epp.lin
12
.
_
_
_
5500
_. vt.lin
_
_
_
5000
0 ilog.lin
0 mudlin
rn daalin
_
0$ rlo.lin
rlslin
_ 4500
# eps.hn
Processor Speed:
2 MIPS
4000
_o
_ 3500
8
----#
"_3000
/
_ 2500
6
2000
4
_
1500
1000
2
500
0
*
i
|
i
8
16
24
32
i
*
i
i
*
40
48
56
64
72
Figure8-43: Nodeand actionparallelism
(nominalspeed-up).
0
i
i
i
i
8
16
24
32
|
*
*
i
;
40 48
56 64
7"2
Figure 8-44: Node and actionparallelism
(executionspeed).
144
PARAI.L['LISM IN PRODUCTION
,_ o
1_
& vt.Hn
o ilog.lin
o mud,lin
¢ 8
t_
o
0
7
-_9ooo
@
_
\
& vt.lin
0 ilog.lin
0 n_d.lin
_8000
_.n
[] daa.lin
<> rls.lin
$ rlo.lin
# eps,lin
.1_
$
#
@ egp.lin
_
daa.lin
rts.lin
NumSTQs
= NumProcessors/2
6
=
_
,,O
_ 401;10
3
3OOO
2,
2ooO,
ooo
0
a
n
i
|
i
8
16
24
32
|
i
i
a
0
40
48
56
64
72
0
i
n
I
|
8
f8
24
32
1°
I
o
i
t
I
40
48 56
64 72
Figure8-46: tntra-node parallelism (execution speed).
vt.lin
ilog.lin
[]: '
=
s
Figure8-45: lntra-node parallelism(nominal speed-up).
30
= NumProcessors/2
_6000
4
1=
a}
NumSTQs
rlp.lin
eps.lin
@ epp.lin
7'000
SYSTEMS
o
_ 9000
_
_, vttin
_
O
0
iloglin
mudlin
NumSTQs
daa.lin
27
mud.lin
D daa.lin
O rls.lin
_
_ 8OOO
r_ rls.lin
0
24
$
rlolin
_
$ r10.1in
@
el00.ti
_
7000
@
= NumProcessors/2
epo.lin
"a6000
=
m
0
f5
'_
4ooo
9
= 30OO
5000
6
2000
3
1000
0
a
i
n
i
8
16
24
32
n
i
a
i
=
40
48
56
64
72
Figure8-47: lntxa-node and action parallelism
(nominal speed-up).
0
i
i
i
i
8
16
24
32
n
i
i
i
i
40 48
56 64
72
Figure8-48: lntra-node and action parallelism
(execution speed).
SIMULATION
RESUI.TS
145
AND ANALYSIS
_16
"1_
• productionparallelism
A productionandactionparallelism
0 nodeparallelism
o• intramode
parallelism
nodeandactionparallelism
rn intra-nodeand actionparallelism
4
8|
10
sors/2
12
8
_s
,_____.-&
6
o
2
,
0
*
8
i
16
,
24
Figure 8-49:
8
32
i
i
a
*
u
40
48
56
64
72
Average nominal speed-up.
9000
zx 13roduction
andactionparallelism
o node parallelism
• node andactionparallelism
o _ntranode parallelism
[] intra-nodeandactionparallelism
\
_}8000
0'_
,,_
7000
-_ 6000
o
o
NumSTOs
= NumProcessors/2
Processor
Speed: 2 MIPS
5000
3000
2OOO
1000
I
0
l
8
l
16
Figure 8-50:
,
24
,
32
i
i
J
n
*
40
48
56
64
72
Average execution speed.
146
8.8. Effects
PARALLELISM
IN PRODUCTIONSYSTF.MS
of Memory
Contention
The simulation results that have been presented so far do not take effects of memory contention
into account.
parallelism.
This was done in order to keep machine specific data away from the analysis of
To be more specific, as technology changes and computer buses can handle higher
bandwidth, as memories get faster, as new cache coherence strategies evolve, the performance lost
due to memory contention will change.
Not including memory contention in the simulations
provides a measure of parallelism in the programs which is independent
nology and algorithms--somewhat
of such changes in tech-
like an upper-bound result. 74 While it is important to know the
intrinsic limits of speed-up from parallelism for programs, it is, of course, also interesting to know the
actual speed-up that would be obtained on a real machine, which does have some losses due to
memory contention.
This section presents simulation results that include the memory contention
overhead.
The memory contention overhead is included in the simulation results as per the model described
in Section 7.1.1.4. Since one would not design multiprocessors with 8, 16, 32, 64, 128 processors in
exactly the same way (for example, using buses with the same bandwidth to connect the processors to
memory), different models are used to calculate the contention for multiprocessors with different
number of processors.
The multiprocessors are divided into three groups:
those having 8 or 16
processors, those having 32 or 64 processors, and those having 128 processors (the simulations were
run with only these discrete values for number of processors). For the case when there are 8 or 16
processors, it is assumed that the number of memory modules is equal to the number of processors (8
or 16), the bus has a latency of 1/_s,the bus is time-multiplexed, and that it can transfer 4 bytes of
data every lOOns. For the case when there are 32 or 64 processors, it is assumed that the number of
memory modules is 32, the bus has a latency of 1/_s, the bus is time-multiplexed, and that it can
transfer 8 bytes of data every lOOns. For the case when there are 128 processors, it is assumed that the
number of memory modules is 32, the bus has a latency of 1/_s,the bus is time-multiplexed, and that
it can transfer 8 bytes of data every 50ns. It is further assumed for all the cases that each processor
has a speed of 2 MIPS and that the instruction mix consists of 50% register-register instructions and
50% memory-register instructions.
Thus each processor when active generates 3 million memory
references per second, 2 million references for the instructions and i million references for the data.
The cache-hit-ratio is uniformly assumed to be 0.85, that is, 85% of the memory accesses are satisfied
74Note
someportions
ofthe
architecture,
like
theinstruction
set
ofthemachine,
hadtobespecified
inmoredetail,
because
itwould
nothave
been
possible
todothesimulations
otherwise.
147
by the cache and do not go to the bus.75 Figure 8-51 shows the processor efficiency for the above
cases as a function of the number of processors that are active.76
o,_
¢j
__, 1.0_
.9
U
2
Q.
.a
Ap=8
oP=16
[] P = 32or64
OP= 128
.7
.6
•5
I
0
8
I
16
!
I
I
I
I
24
32
40
48
56
I
I
I
I
,,I
64
72
80
88
96
Number of Active Processors
Figure 8-51: Processor efficiency as a function
of number of active processors.
Figures 8-52 and 8-53 present the nominal speed-up and the execution speed for production
systems as predicted by the simulator when intra-node and action parallelism are used. The results
are presented only for the intra-node and action parallelism case because the combination of these
two sources results in the most speed-up, and also because these sources are most affected by the
memory contention overhead.
The graphs present data for the individual programs (curves with
solid lines) and also the average statistics (curves with dotted lines).
It is interesting to compare the average statistics curves in Figures 8-52 and 8-53 with the corresponding curves in Figures 8-19 and 8-21 (when memory contention overheads are not included).
75The cache-hit ratio of 85% is based on the following rough calculations. Since caeh processor executes at 2 MIPS, there
are 2 million references per sec (MRS) generated for instructions and I MRS generated for data (assuming that only 50% of
the instructions require memory references). Assuming 95% hit-ratio for code and 65% hit-ratio for data, the composite
hit-ratio turns out to be 85%.
76The variable P in Figure 8-51 refers to the total number of processors (both active and inactive) in the multiproeessor
system.
148
PARALLEHSM
1N PRODUCTION
SYSTEMS
The average nominal speed-up (concurrency or average number of active processors) with 128
processors with memory contention is 20.5 and without memory contention is 19.3 (an increase Of
about 6%). The average execution speed with 128 processors with memory contention is 10950
wme-changes/sec and without memory contention is 11261 wme-changes/sec (a decrease of about
3%). The increase in the concurrency and decrease in the execution speed may be understood as
follows. Since memory contention causes everything to slow down, the execution speed will obviously be lower. However, since all processors in the multiprocessor are not busy all of the time, the
multiprocessor is able to compensate for these slower processors by using some processors that would
not be used if memory contention was not present, thus overcoming some of the losses. Thus,
although the cost of processing the production systems goes up by 9% due to memory contention, the
decrease in the speed of the parallel implementation is only 3%. The remaining 6% is recovered by an
increase in the concurrency.
149
SIMU1 ATION RESULTS AND ANALYSIS
_40
36
32
tx vt,lin
o ilog.lin
o mud,lin
D daa.lin
O
rls.lin
$ rll0.1in
# eps,lin
@ epp.lin
• avg-stats
28
o
12
8
4
0
8
16
24
32
40
48
56
64
72
Figure 8-52: Intra-node and action parallelism
(nominal speed-up).
150
PARALI.,ELISMIN PRODUCTION SYSTEMS
20000
zx vt.lin
o ilog.lin
o mud.lin
[] daa.lin
0 rls.lin
$ rlp.lin
\
18000
_:
tU
"_ 16000
#
0 eps,lin
epp,lin
• avg-stats
E
_=14000
"o
12000
"" 10000
_j
•
....-°''
.....
_..._#_____
sooo
0
8
Figure 8-53:
16
24
32
40
48
56
64
72
Intra-node and action parallelism
(execution speed).
SIMULATION
RF.SULTS AND ANALYSIS
151
8.9. Summary
In summary, the following observations can be made about the simulation results presented in this
chapter:
• When using a hardware task scheduler:
o A parallel implementation has some intrinsic overheads as compared to a
uniprocessor implementation. The overheads occur because of lack of sharing of
memory nodes, synchronization costs and scheduling costs. Such overheads when
using node or intra-node parallelism result in cost increase by a factor of 1.98, and
when using production parallelism result in a cost increase by a factor of 2.64. The
overheads are larger when using production parallelism because it is not possible to
share two-input nodes between different productions.
o The average execution speed of production systems on a uniprocessor (without
considering the overheads of a parallel implementation) that executes two million
register-register instructions per second is about 1180 wme-changes/sec.
'
o The average saturation nominal speed-up (concurrency) obtained using production
and action parallelism is 7.6, that using node and action parallelism is 10.7, and that
using intra-node and action parallelism is 19.3. Using intra-node and action parallelism, the saturation execution speed is about 11,250 wme-changes/sec assuming a
multiprocessor with 2 MIPS processors. The speed-up from parallelism is significantly lower when action parallelism is not used. For example, when intra-node
parallelism is used, the saturation nominal speed-up is only 7.6 as compared to 19.3
when action parallelism is also used.
o As a result of the larger number of changes made to working memory per cycle, the
Soar systems show much larger benefits from action parallelism than OPS5 systems.
For exarnple, when using intra-node parallelism, the speed-up for OPS5 systems
increases by a factor of 1.84 as a result of action parallelism, while the speed-up for
Soar systems increases by a factor of 3.30.
o The simulations show that only 32-64 processors are needed to reach the saturation
speed-up for most production systems. Thus a multiprocessor system with significantly more processors is not expected to provide any additional speed benefits.
• When using binary Rete networks:
o The average cost of executing a production system on a uniprocessor (with no
overheads due to parallelism) goes up by a factor of 1.39 as compared to when using
linear networks. This increase in cost is due to an increased number of node
activations per change to working memory. The increased cost sometimes results in
situations where the actual execution speed of a production system is less than that
of its linear network counterpart, although the nominal speed-up achieved due to
parallelism is more.
o The average overhead due to parallelism when exploiting node or intra-node paral-
152
SYSTEMS
lelism is a factor of 1.84 and that when exploiting production parallelism is a factor
of 2.52.
o The average saturation nominal spced-up (concurrency) obtained using production
and action parallelism is 8.8, that using node and action parallelism is 11.6, and that
using intra-node and action parallelism is 25.8. Using intra-node and action parallelism, the saturation execution speed is about 12,000 wme-changes/sec assuming a
multiprocessor with 2 M IPS processors.
o The benefits from using binary networks are much more significant for Soar systems than for OPS5 systems. In fact, the average saturation nominal speed-up for
the OPS5 systems goes down by 16% (from 14.5 to 12.3) as a result of using binary
networks, while the corresponding speed-up for the Soar systems goes up by 62%
(from 24.1 to 39.2). The reason for the large increase is that there are several
productions in the Soar systems which have very large number of condition elements (resulting in longer chains).
o The above differences between the results for OPS5 systems and Soar systems
suggests that there is no single strategy (binary or linear Rete networks) that is
uniformly good for all production systems, and that the strategy to use should be
determined individually for each production system.
• When using software task queues:
o The average overhead due to parallelism for intra-node parallelism is a factor of
3.23, for node parallelism is a factor of 2.96, and for production parallelism is a
factor of 3.70. These factors are much larger than when a hardware task scheduler
is used because of the large cost associated with enqueueing and dequeueing node
activations from the task queues.
o The saturation execution speed is 2080 wme-changes/sec when using production
and action parallelism (as compared to 3550 when using a hardware task scheduler),
3330 wme-changes/sec when using node and action parallelism (as compared to
6600), and 4700 wme-changes/sec when using intra-node and action parallelism (as
compared to 11,260). Thus the performance loss when using software task queues is
a factor between 1.7 and 2.4.
o Simulations show that the performance of a scheme using software task queues is
best when the number of queues is approximately equal to the number of processors.
• When memory contention overheads are included:
o The memory contention overheads were studied in the simulations by assuming
different processor memory interconnect bandwidths for multiprocessors with different number of processors. For multiprocessors with 8 or 16 processors the
bandwidth was assumed to be 40 MBytes/sec, for 32-64 processors the bandwidth
was assumed to be 80 MBytes/sec, and for 128 processors the bandwidth was assumed to be 160 MBytes/sec.
SIMULATION
RF_SULTS
ANDANALYSIS
o For the case of a multiprocessor with 128 processors it is observed that when
memory contention is taken into account, the cumulative cost of executing all node
activations goes up by 9%, the nominal speed-up achieved due to parallelism goes
up by 6%, and the actual exccution speed drops by about 3% (as compared to when
memory contention overheads are ignored). It is interesting to note that some of
the increase in execution costs due to memory contention is absorbed by the unused
processors of the multiprocessor.
153
154
SYSTEMS
RELATEDWORK
"155
Chapter Nine
Related Work
Research on exploiting parallelism to speed up the execution of production systems is not very
new, but the efforts have gained significant momentum recently. This gain in momentum has been
caused by several factors: (1) There are slowly emerging larger and larger production systems, and
their limited execution speed is becoming more noticeable. (2) With the increase in popularity of
expert systems, there has been a movement to use expert systems in new domains, some of which
require very high performance (for example, real-time control applications).
(3) There is a general
feeling that the necessary speed-up is not going to come from improvements in hardware technology
alone. (4) Finally, the production systems, at least on the surface, appear to be highly amenable to
the use of large amounts of parallelism, and this has encouraged researchers to explore paraUelism.
This chapter briefly describes some of the eal"ly and more recent efforts to speed-up the execution of
production systems through parallelism.
9.1. ImplementingProductionSystemsonC.mmp
For his thesis [54], Donald McCracken implemented a production-system version of the Hearsay-II
speech understanding system [16] on the C.mmp multiprocessor
[53]. 77 He
showed that high degrees
of parallelism could be obtained using a shared memory multiprocessor--one
of his simulations
showed that it was possible to keep 50 processors busy 60%of the time during the match phase. Most
of the results presented in his thesis, however, are not applicable to the current research.
This is
because:
• The characteristics of the Hearsay-II system are distinct enough from current OPS5 and
SOAR programs that the results cannot be carried over.
• The speed-up obtained from parallel implementations of production systems is dependent on the underlying computation model. For example, it depends on the quality of the
underlying match algorithm. If the underlying match algorithm is naive, it is possible to
obtain very large amount of speed-up from parallelism. Since the basic match algorithm
77TheC.mmpmultiprocessor
consistedof16PDP-11processors
connectedto sharedmemoryviaacrossbarswitch.
156
PARAI.LELISM 1N PRODUCI'ION SYSTI?MS
used by Don McCracken in his thesis is significantly different from the OPS5-Rete algorithm, it is not possible to interpret his results in the current context.
• McCracken's thesis addresses issues related to the parallel implementation of Hearsay-II
on the C.mmp architecture. Because of the differences in the hardware structure of the
PSM considered in this thesis and C.mmp, many issues that were evaluated for C.mmp
have no relevance for the PSM. For example, since processors in C.mmp did not have
caches, the performance of the parallel implementation was considerably affected by how
the code was distributed among the multiple memory modules. In the PSM, since
processors have local memory and cache, the distribution of code is not an issue.
9.2. Implementing
Production
Systems
on llliac-IV
Charles Forgy, in one of his papers [18], has studied the problem of implementing production
systems on the Illiac-IV computer system [7]. Since the Illiac-IV is a SIMD (single instruction stream,
multiple data streams) computer, the main concern was to develop a match algoritlun where all
processors would simultaneously execute the same instructions on different data to achieve higher
performance.
In the algorithm described in the paper, a production system is initially divided into sixty four
partitions, corresponding to the number of processors in IUiac-IV. The paper does not detail the
partitioning technique, but suggests that it should be such that similar productions are in different
partitions. This is to ensure that the work is uniformly distributed amongst the processors. The Rete
network for each partition is constructed and the associated code is placed in corresponding processors. The network interpreter then executes the code in a manner similar to the uniprocessor interpreters, but with one exception. All node evaluations of one type are executed before the node
evaluations of another type are begun. For example, the interpreter will finish executing all constanttest nodes before attempting to execute any memory nodes. This ensures that for most of the time all
processors are executing nodes of the same type, and since nodes of the same type require the same
instructions (they may use different data), the SIMD nature of Illiac-IV is usefully exploited. Although the paper describes the algorithms for executing production systems on Illiac-IV in detail, no
estimates are given for the expected speed-up from such an implementation.
RELATEDWORK
] 57
9.3. The DADO Machine
DADO [84, 85] is a highly parallel
systems at Columbia
architecture
full-scale
University
tree-structured
by Salvatore
consists of a very large number
version)
of processing
elements,
processing
element
specialized
I/O switch that is constructed
prototype
designed
J. Stolfo and his colleagues.
to execute
interconnected
to form a complete
a small amount
of random
using a custom VLSI chip.
in [84]. The Intel 8751 processors
production
The proposed
(on the order of tens of thousands
consists of its own processor,
as described
architecture
machine
in the envisioned
binary
tree.
Each
access memory,
and a
Figure 9-1 depicts the DADO
used in the prototype
DADO
are rated at
0.5 MIPS.
Control
Processor
Inte18751Processor
RAM
8 KByte
RAM
EPROM
I
I
I/0
SWITCH
8bits
Figure 9-1: The prototype DADO architecture.
Severaldifferent algorithmsfor implementing production systems on DADO have been proposed
[31,60, 86]. Of the many proposed algorithms, the two algorithms offering the highest performance
158
PARALt,ELISMIN PRODUCFION SYS'rEMS
for OPS5-1ike productions
systems on DADO are (1) the parallel
Rete algorithm
and (2) the Treat
algorithm.
In the implcmentation
of the Rete algorithm
divided into 16-32 partitions,
parallelism
titioned,
mapped
the actual number of partitions depending
(see Section 4.2.1) present
separate
Rete networks
onto a processing
(also called
processing
the
working-memory
in the program.
are generated
Using
the processor
in the WM-subtree
elements, to associadvely
performance
of the parallel
Rete algorithm
at the PM-level
system has been parEach partition
rithm stores state corresponding
match
DADO
and limitations
of the Treat algorithm on DADO
described
as a control
are used for performing
on the prototype
(for more details, assumptions,
parallel Rete algorithm
to both a-memory
to the t-memory
the
elements
and
other similar
175
of the analysis see [31]).
and/3-memory
of the
way. While the Rete algo-
nodes in the Rete network,
the
nodes. 79 It is argued that storing the
nodes is not very useful for DADO
of the t-memory
The
to be around
[60] is similar to the implementation
only to the a-memory
the relevant portion
elements
conflict-resolution.
(especially
systems where a large fraction of the working memory changes on every recognize-act
to compute
is then
processor,
condition
is predicted
above, but varies in one fundamental
Treat algorithm stores state corresponding
it is possible
system is
of production
locate tokens to be deleted, and to perform
above the PM-level
state corresponding
Once the production
are used to associatively
The processors
The implementation
on the amount
for each of the partitions,
operations.
wme-changes/sec
production
element at the PM-leve178 and its associated subtree of processing
WM-Subtree).
elements
on DADO [31], the complete
for production
cycle), because
state 8° very efficiently
on DADO.
This is because:
• It is possible to dynamically change the order in which tokens matching individual condition elements are combined, so as to reduce the amount of t-memory
state that is
computed. Such reordering is not possible in the Rete algorithm, where the combinations
of condition elements for which state is stored is frozen at compile time (see Section
78ThePM-level
(the
production-memory
level)
isdetermined
bythenumberofpartitions
that
aremade.Forexample,
ff
thenumberofpartitions
is16thePM-level
wouldbe4,andifthenumberofpartitions
is32thePM-level
wouldbe5.
79Recall
fromSection
2.4.2
that
theTreat
algorithm
falls
onthelow-end
ofthespectrum
ofstate-saving
matchalgorithms.
80Therelevant
portion
ofthefl-memory
state
corresponds
tothose
tokens
that
include
a reference
totheworking-memory
change
being
processed.
This
isbecause
onlytokens
involving
thenewworking-memory
element
cancause
a change
tothe
existing
conflict-set
RELATED WOR K
] 59
5.2.6).81
• In case the number of working-memory elements matching some condition element of a
production is zero, the Treat algorithm does not compute the fl-memory state at all. This
is because an instantiation of such a production cannot enter the conflict-set. The Rote
algorithm in such a case would anyway go ahead and compute the fl-memory state in the
hope that it would be useful at some later time.
• Finally, the associative match capabilities of the DADO WM-subtree help speed up
computation of the fl-memory state that is necessary. For example, using the WMsubtree it is possible to associativdy match a given token against all tokens in the opposite
memory node. Similarly, all tokens containing a pointer to a given working-memory
dement can be deleted associativdy.
The performance of the Treat algorithm on the prototype DADO is predicted to be around 215
wme-changes/sec.
This is only slightly higher than the performance of 175 wme-changes/sec
predicted for the Rete algorithm. These two are, of course, only average numbers, and in practice,
for some production systems Treat would do better and for some production systems Rete would do
better.
9.4. The NON-VON
Machine
NON-VON [79, 80] is a massively parallel tree-structured architecture designed for AI applications
at Columbia University by David Elliot Shaw and his colleagues. The proposed machine architecture
consists of a very large number (anywhere from 16K in the prototype version to one million in the
envisioned full-scale supercomputer) of small processing elements (SPEs) interconnected to form a
complete binary network. Each small processing element consists of 8-bit wide data paths, a small
amount of random-access memory (32-256 bytes), a modest amount of processing logic, and an I/O
switch that permits the machine to be dynamically reconfigured to support various forms of interprocessor communication.
The leaf nodes of the SPE-tree are also interconnected
to form a two-
dimensional orthogonal mesh. In addition to the small processing elements, the NON-VON architecture provides for a small number of large processing elements (LPEs).
Specifically, each small
processing dement above a certain fixed level in the binary network is connected to its own large
processing element.
The large processing elements are to be built out of off-the-shelf 32-bit
81Note that since each change to working memory results in seve_ changes
to the conflict-set,
there are at least a few
productions for which the Treat algorithm has to compute the complete relevant fl-memory state. For example, it has to
compute tokens matching the first two condition elements of the production, the tokens matching the first three condition
elements of the productions, and so on for the entire production. Since much of this computation is done serially on DADO,
the variation in the processing times for different productions is expected to be quite large, and consequently the speed-up
from parallelism is expected to be less.
160
PARAIJ,ELISM
IN PRODUCTION SYSTEMS
microprocessors(for example, the Motorola68020),with a significantamount of local random-access
memory. A large processingelcment normally stores the programs that are to be executed by the
SPE-subtreebelow it, and it can broadcast instructions at a very high speed with the assistanceof a
high-speed interface called the aclive memory comroller. With the assistance of the large processing
dements the NON-VON architectureis capable of functioning in multiple-SIMD (single instruction
stream, multiple data stream) mode. Figure 9-2 shows a picture of the proposed NON-VON architecture.
LPE Network
To Host
J
_
Leaf Mesh Connections
•
Small Processing Element
D
Large Processing Element
/
O
Disk Head and Intelligent Head Unit
Figure9-2: The NON-VON architecture.
The proposed implementation of production systems on the NON-VON machine [37]is similarto
the implementation of the Rete algorithmsuggested forthe DADO machine in the previoussection.
However,many changes were made to accommodate the proposed data structures into the small
amount of memory available within each SPE. For example, it was often necessaryto distribute a
singje memory-nodetoken acrossmultiple SPEs. This fine distribution of state amongst the processing elements permits a greaterdegreeof associativeparaUelismthan what was possible in the DADO
implementation. The performancepredicted for the execution of OPS5 on NON-VON [37]is about
2000 wme-changes/sec. The performancenumbers correspond to the case when both the large
RF.LATED
WORK
161
processing elements and the small processing elements in NON-VON function at a speed of about 3
MIPS. (Note that the significantly better performance of NON-VON over I)AI)O can be partly
attributed to the facts that (1) NON-VON processing elements are 6 times faster than the prototype
DADO processing elements, and (2) NON-VON
LPEs have 4 times wider datapaths than the
prototype DADO processing elements.)
At this point, it might be appropriate to contrast architectures using a small number (32-64) of
high-performance processors (for example, the scheme proposed in this thesis) against architectures
using a very large number (10,000-1000,000) of weak processors (for example, DADO and NONVON). Studies for uniprocessor implementations show that using a single 3 MIPS processor, it is
possible to achieve a performance of about 1800 wme-changes/sec, which is only 10% slower than the
performance achieved by the NON-VON machine using thirty-two LPEs of 3 MIPS each and several
thousand SPEs. The performance for the DADO machine is even smaller.
reasons for the low performance
There are two main
of these highly parallel machines: (1) The amount of intrinsic
parallelism available in OPS5-1ike production systems is quite small, as shown in the previous chapter. As a result, researchers have used the large number of processors available in the massivelyparallel machines as associative memory.
However, this does not buy them too much, because
hashing on a single powerful processor works just as well. (2) While a scheme using a small number
of processors can use expensive and very high performance processors, schemes using a very large
number of processors cannot afford fast processors for each of the processing nodes.
The perfor-
mance lost in the highly parallel machines due to the weak individual processing nodes is difficult to
recover by simply using a large number of such nodes (since the parallelism is limited).
Note,
however, the massively parallel machines may do better for highly non-temporal production systems
(production systems where a large fraction of the working-memory changes on every cycle), or for
production systems where the number of rules affected per change to working memory is very
large. 82
82Notethatthe techniquesdevelopedin thisthesiswillalsoresultin largerspeed-upsforprogramsonwhichthemassively
parallelmachinesare expectedto dowell. Theonlyproblemariseswhenthepossiblespeed-upsare of the orderof several
hundreds.Thisis becauseitis difficultto constructsharedmemorymultiprocessors
withhundredsor thousandsofprocessors.
It is suggestedthatin sucha caseit wouldbe bestto use a mixtureof the partitioningapproach(for example,as usedin
implementationsof Rete on DADO and NON-VON)and the fine-grainedapproach(as proposedin this thesis),and
implementit ontop ofa hierarchical
multiprocessor.
162
PARALLELISM iN PRODUCTION SYSTEMS
9.5. Kemal Oflazer's Work on Partitioning and Parallel Processing of Production
Systems
Kemal Oflazer in his thesis [67] explores a number of issues related to the parallel processing of
production systems. (1) He explores the task of partitioning production systems so that the work is
uniformly distributed amongst the processors. (2) He proposes a new parallel algorithm for performing match for production systems. (3) He proposes a parallel architecture to execute the proposed
algorithm. The new algorithm is especially interesting in that it stores much more state than the Rete
algorithm in an attempt to cut down the variance in the processing required by different productions.
95.1. The Partitioning Problem
The partitioning problem for production systems addresses the task of assigning productions to
processors in a parallel computer in a way such that the load on processors is uniformly distributed.
While the problem of partitioning the production
systems amongst multiple processors may be
bypassed in shared-memory architectures, like the one proposed in this thesis, it is central in all
architectures that do not permit such sharing (for example, partitioning is essential for the algorithms
proposed for the DADO and NON-VON machines). The advantages of schemes not relying on the
shared-memory
architectures are that scheduling, synchronization, and memory-contention
heads are not present.
over-
The first part of Oflazer's thesis presents a formulation of the partitioning
problem as a minimization problem, which may be described as follows.
The execution of a
production-system run may be characterized by the sequence T=(el,e 2..... et>, where t is the number of changes made to the working memory during the run and where ei is the set of productions
affected by the
ith
change to working memory. 83 Let H =(YI 1..... I-l/c) be some partitioning of the
production system onto k processors. With such a partitioning, the cost of executing the production
t
p ¢ e.NII.
system on a parallel processor is COStT,rI = _i=lmaxa<j< k(_
z Jc(p)), where ci is a cost function that gives the processing required by productions for the ith change to working memory. Using
this terminology, the partitioning problem is simply stated as the problem of discovering a partitioning H such that COStT,n is minimized. The complexity of finding the exact solution to this minimization problem is shown by Oflazer to be NP-complete [26].
In addition to analysis of some simple partitioning methods like random, round-robin, and contextbased, Oflazer's thesis presents a more complex heuristic method for partitioning that relies on data
83RecaU that a production is said to be affected by a change to working memory, if the working-memory dement matches at
least one of its condition dements, that is, if the state corresponding to that production has to be updated.
REI.ATEDWORK
163
obtained from actual production-system runs. The inputs to the new partitioning method consist of
the affect-sets for each of the changes to working memory and the cost associated with each affected
production in these affect-sets. The algorithm is very fast and gives results that are 1.15 to 1.25 times
better than the results of the simpler partitioning strategies.
9.5.2. The Parallel Algorithm
The second part of Oflazer's thesis concerns itself with a highly parallel algorithm for the stateupdate phase of match in production systems. The algorithm is based on the contention that both
Treat and Rete algorithms are too conservative in the amount of state they store. For example, Treat
only stores tokens matching the individual condition elements of productions, and Rete only stores
tokens that match the individual condition elements and some fixed combinations of the condition
elements. Oflazer's parallel algorithm proposes that the tokens matching not some but all combinations of condition elements of a production should be stored. The main motivation for doing so is to
reduce the variance in the processing requirements of the various affected productions in any given
match cycle. For example, consider the Treat algorithm. Since it does not store state corresponding
to any of the combinations of condition elements, a lot of state has to be computed when a change is
made to the working memory. Much of the state computation that is done after the change is made
to the working memory could have been done beforehand, thus reducing the interval between the
time the working-memory change is made and the time the conflict-set is ready. A similar argument
is used against the Rete algorithm. 84
Oflazer also proposes a new representation for storing state corresponding to the partial matches
for a production.
He introduces the notion of a null working-memory
condition elements, satisfies all inter-condition
tests for productions,
element that matches all
and is always present in the
working memory. The state of a production is represented by a set of instance elements (IEs), where
each instance element has the form <(tl,w)(t2,w2)...
(to,we)>, where c is the number of condition
elements in the production, ti is the tag associated with the ith slot, mad wi is the working-memory
element associated with the/h
slot. The working-memory element wi must satisfy the ith condition
element of the production, and the c-tuple <w,w 2.....
production,
that is, the working-memory
w> must consistently satisfy the complete
elements together should satisfy all the inter-condition
element tests (note that some of the slots may point to the null working-memory elemen0.
The tags
are used to help detect and delete some of the redundant state generated for a production.
84Oflazer's thesis, however, does not demonstrate clearly if the variation in the processing requirements of productions is
actually reduced by the proposed scheme. Storing state for all combinations of condition elements can result in some
productions requiring very large amount of processing, a situation which may not have occurred in the Treat or Rete
algorithms (see Section 9.5.4).
164
PARALLELISM IN PRODUCI'ION SYSTEMS
When a change
is made to the working
result of the interaction
representation
the state-update
the same instance
mation content
and in parallel.
or instance
otherwise
resources.
a sequential
architecture
9.5.3. The Parallel
The architecture
redundant
instance elements
to eliminate
to give incorrect
of redundant
requires
content
for that production.
to help detect and eliminate
As a result of
(multiple
copies of
these redundant
instance
results and they also use up
is problematic
redundant
instance
because
element be
Oflazer presents the detailed
redundant
of the old
is a subset of the infor-
instance elements
that each potentially
as a
of the proposed
to each instance element
whose information
they can cause the match
process--it
An advantage
There is, however, one problem.
It is necessary
The elimination,
is obtained
algorithms
instance elements.
Architecture
proposed
structured
machine,
processing
capabilities
of the proposed
elements
elements).
to all other instance elements
and a hardware
of the new change
it is possible to generate
of other instance
scarce processing
compared
element
because
it is essentially
independently
processing,
the new state for a production
the old state and the new change.
for state is that the interaction
state may be computed
elements,
between
memory,
for the algorithm
with fast processors
described
above
is a parallel
located at the leaf nodes and specialized
located at the interior nodes of the tree.
reconfigurable
tree-
switches with simple
Figure 9-3 shows a high-level
picture
architecture.
l C°ntr°''er
l
P:InstanceElementProcessors
S: Switches
Figure 9-3:
In mapping
the proposed
Structure
algorithm
of the parallel
onto the suggested
processing
architecture,
system.
each production
is assigned to
RELATED WORK
165
some subset of the leaf processors, and as a result each processor is responsible for some subset of the
productions in the program. The processors assigned to a production are responsible for processing
the instance elements associated with that production and for keeping the state of that production
updated. 85 The internal switch nodes of the tree are used to send information to/from the controller
located at the root. They are also used to isolate small sections of the tree during the redundantinstance-element-removal
phase of the algorithm.
The thesis presents some performance figures based on simulations done for the XSEL [56] and R1
[55] production
systems.
The simulations assume that (1) processors take lOOns to execute an
instruction from the on-chip memory and 200ns to execute an instruction from the external memory,
(2) each stage of switch nodes takes 200ns to compute and transfer information to the next stage, and
(3) the number of processors allocated to a production is the next power of two that is larger than the
average number of instance elements for that production.
Under these assumptions the time to
perform match for a single change to working memory for XSEL is around 150/ts (6666 wmechanges/sec), and for R1 it is around 210/ss (4750 wme-changes/sec).
In all these runs an average of
about 300 processors were used to update the instance elements. However, if required on any given
match cycle, more processors were assumed to be available.
9.5A. Discussion
The most interesting aspect of Oflazer's research is the proposed parallel state-update algorithm. It
provides yet another distinct data point (at the high-end of the spectrum) in the space of state-saving
algorithms for match. The proposed architecture (using 256-1024 processors) also forms a distinct
data point as far as the number of processors is concerned.
It falls in between the multiprocessor
architecture proposed in this thesis with 32-64 processors, and the DADO and NON-VON architectures with 10000 or more processors.
A possible problem with Oflazefs parallel algorithm is that potentially the state associated with a
production may become extremely large. Such a production would then require an extremely large
number of processors to update its state, or it will become a bottleneck.
For example, consider the
following production which locates btocks that are at height 6 from a table and prints the result.
85The number of processors assigned to a production depends on how large its associated state becomes during actual runs.
A modified version of the partitioning algorithm given earlier in the thesis is used to partition the productions amongst the
processors.
166
PARALLEHSM
{p blocks-at-height-6
{on
+block (a)
(on
+block <b)
(on
iblock <c)
(on
+block <d)
(on
+block <e)
{on
+block <f)
+block
+block
+block
+block
+block
+block
IN PRODUCI'ION
SYSTEMS
table)
<a>)
<b))
<c))
<d))
<e>)
{call print "block" <f) "is at height 6"))
Now suppose there are 100 blocks in the blocks world, and with each block there is the property
"on tblock <x) tblock (y)", where this property is represented as a working-memory element. Thus
there would be 100 working-memory elements matching CE-2(condition element 2), and similarly
100 working-memory elements matching each of condition elements cE-3 through CE-6. Since
Oflazer's algorithm stores state corresponding to all possible combinations of condition elements,
consider the number of tokens that would be matching CE-2,CE-4,and CE-6together. Since there are
no common variables between these condition elements, the total number of tokens matching these
condition elements together would be 100xl00x]00=l,000,000.
If a single new block is added to
the system, the number of matching tokens would go to 101x101x101=l,030,031,
that is, about
30,000 new tokens would have to be processed.
Oflazer suggests a solution to the above problem by splitting such productions into two or several
pieces, so that the combinations of condition elemenm for which the state is stored is controlled. For
example, the above production may be split into the following two productions.
(p blocks-at-helght-6-part-I
(on
+block <a)
+block table)
(on
+block <b)
+block <a>)
(on
+block <c)
+block <b))
---)
(send-message blocks-at-height-6-part-Z
(p blocks-at-height-6-part-I
(message +vars <c>)
(on
+block <d>
+block
(on
+block (e>
+block
(on
+block <f>
+block
+vans <C>))
<c>)
<d>)
<e>)
(call print "block" <f> "is at height 6"))
T
RELATED WORK
167
While splitting the production into two parts reduces the number of tokens that are generated, it
reintroduces some of the sequentiality in state processing, which is exactly what the algorithm was
trying to avoid. The goodness of the proposed solution depends on the number of productions that
have to be split, and the performance penalty they cause. Oflazer notes that for the XSEL system 42
productions had to be split and for the R1 system 22 productions had to be split. However, no clear
numbers about the performance lost due to such splitting are available at this time.
Another drawback of Oflazer's algorithm is that it cannot exploit action parallelism, that is, it
cannot easily process multiple changes to working memory in parallel. This is because (1) often the
multiple changes to working memory affect the same set of productions, which requires that the
instance elements for such productions respond to the effect of several changes to their slots, and (2)
the algorithm requires that multiple changes to the slots of an instance element be processed sequentially. Since action parallelism is exploited very usefully by the implementation
proposed in this
thesis, not being able to exploit it is a significant disadvantage.
Finally, it is interesting to compare the performance of the proposed algorithm to that of the Rete
algorithm.
Oflazer's algorithm using about three hundred 5-10 MIPS 16-bit processors achieves
4500-7000 wme-changes/sec.
The Rete algorithm on a 5 MIPS 32-bit uniprocessor can achieve a
speed of 3000 wme-changes/sec, qhe reasons for the small amount of speed-up after using so many
i
more processors appear to be: (1) The intrinsic parallelism in production systems is limited, so large
amounts of speed-up cannot be expected. (2) The strategy of keeping large amounts of state for
productions is not working, that is, while keeping large state is increasing the number of processors
that is required, it is not at the same time helping in significantly reducing the variation in the
processing times of productions. (3) There is significant overhead in the proposed parallel implementation, for example, the time taken to remove redundant instance elements, that nullifies much of the
potential speed-up.
9.6. Honeywell's Data-Flow Model
Researchers at Honeywell CSC have also been exploring the use of parallelism for executing
production-system
programs[71].
They have proposed a tagged token data-flow computation
model 86 for capturing the inherent parallelism present in OPS5 production systems. The proposed
86A tagged token data-flow model is different from the conventional data-flow models. While there can be only one token
present on an output are in the conventional data-flow model, there can be multiple tokens present on the output ares of the
tagged token data-flow model.
168
PARALLELISM
IN PRODUCTION
SYSTEMS
model is based on file Rote algorithm, and the key idea is to translate the Rete network into a
data-flow graph that explicitly shows the data dependencies.
Similarly, operations performed in the
Rete algorithm are encapsulatcd into appropriate activities (or tasks) in the data-flow model, which
can then be executed on the available physical processing resources. For example, consider the case
when a token arriving at an and-node of the Rete network finds n tokens in the opposite memory
node. In such a case, n activities would be generated in the proposed data-flow model--an
activity
each for testing the incoming token for consistent variable bindings with one of the n opposite
memory tokens.
The paper [71] presents details about the kinds of nodes that are required in the data-flow graph
and the functionality associated with those nodes. However, details about the hardware structure on
which the proposed model is to be mapped, and how the necessary synchronization and scheduling is
to be performed are not given in the paper. Since the size of the individual activities in the proposed
data-flow model is very small (about 10 machine instructions or less), extremely efficient scheduling
and synchronization methods will have to be developed if the approach is to be successful.
9.7. Other Work on Speeding-up
Production
Systems
In addition to the efforts mentioned above, which specifically address the issue of speeding up OPS
production systems through parallelism, there are several other efforts [8, 49, 69, 72, 77, 89] going on
to speed up production systems. Two of these, noted below, have been carried out within the PSM
project and complement the work done in this thesis.
Jim Quinlan has done a comparative analysis of computer architectures for production-system
machines [69]. He uses run-time measurements on production systems to evaluate the performance
of five computer architectures (the VAX-11/780, the Berkeley RISC II computer, a custom designed
microcoded machine for production systems, a custom RISC processor for production systems, and
the Pyramid computer). His main conclusions are: (1) The custom designed microcoded machine is
the best CPU architecture for production systems. Although it takes more machine cycles than the
custom designed RISC processor, it has lower processor-memory bandwidth requirements.
(2) The
difference in the performance of the six architectures is not very large. As a result the motivation for
building a custom processor is small.
Ted Lehr presents a custom pipelined RISC architecture for production-system
execution [49].
The proposed architecture has a static branch prediction strategy, a large register file, and separate
instruction and data fetch units. Since the proposed architecture is very simple, he also discusses the
viability of implementing it in GaAs.
169
SUMMARY AND CONCLUSIONS
Chapter Ten
Summary and Conclusions
In this thesis we have explored the use of parallelism to speed up the execution of productionsystem programs.
We have discussed the sources of parallelism available in OPS5 and Soar produc-
tion systems, the design of a suitable parallel match-algorithm,
architecture, and the implementation
the design of a suitable parallel
of the parallel match-algorithm
on the parallel architecture.
This chapter reiterates the main results of the thesis and discusses directions for future research.
10.1. Primary Results of Thesis
The study of parallelism in OPS5 and Soar production systems in this thesis leads us to make the
following conclusions:
1. The Rete-class of algorithms is highly suitable for parallel implementation.
2. The amount of speed-up available from parallelism is quite limited, about 10-fold, in
contrast to initial expectations of 100-fold to 1000-fold.
3. To obtain the limited speed-up that is available, it is necessary to exploit parallelism at a
very fine granularity.
4. To exploit the suggested sources of parallelism, a multiprocessor architecture with 32-64
high-performance processors and special hardware support for scheduling the finegrained tasks is desirable.
The above conclusions are expanded in the following subsections.
10.1.1.Suitabilityof the Rete-Classof Algorithms
The thesis empirically shows that the Rete-class of algorithms is highly suitable for parallel implementation of production systems. While Rete-class algorithms use significantly fewer processors
170
PARALI-ELISM IN PRODUCi'ION
SYSTEMS
than other proposed algorithms [37, 60, 67] (and in that sense are less concurrent87), simulations show
that they perform better than these other algorithms.
Some of the reasons for choosing and parallelizing the Rete class of algorithms are the following
(see Section 2.4 for details). In designing a parallel algorithm, the first choice is between state-saving
algorithms and non-state saving algorithms. State-saving algorithms are the obvious choice since only
a very small fraction (less than 1%) of the global working-memory changes on each recognize-act
cycle. Within the class of state-saving algorithms itself, however, many different algorithms can be
designed, each storing different amounts of state. The Rete-class of algorithms store an intermediate
amount of state (between the low extreme of the Treat algorithm [60] and the high extreme of
Oflazer's parallel algorithm [67]). The state stored for a production in Rete corresponds to (1)
matchings between individual condition elements and working-memory elements, and (2) matchings
between some fixed combinations of condition elements occurring in the left-hand side of the
production and tuples of working-memory elements. In algorithms like Treat, where the state stored
is small, the disadvantage is that much of the information about partial matches with the unchanged
part of the working memory has to be recomputed.
In algorithms like Oflazer's parallel algorithm,
where the state stored is large, the disadvantage is that a large amount of processing resources are
wasted in computing partial matches that never reach the conflict-set.
We believe that the Rete class of algorithms avoids the disadvantages of both Treat and Oflazer's
parallel algorithm. However, note that we do not wish to argue that the Rete class is the best class of
parallel algorithms, but only that the Rete-class algorithms fall in an interesting part of the spectrum
of state-saving algorithms.
The suitability of Rete as a parallel algorithm is also based on its other
features, for example, the discrimination net used for the selection-phase computation and the dataflow like nature of the overall computation graph. Finally, the claim for suitability of the Rete-class
of algorithms for parallel implementation
is based on the results of simulations, which show the
execution speeds obtained by parallel Rete to be favorable compared to other proposed algorithms.
87Statements about the amount of parallelism available in a class of progran_ can often be misleading. This is because it is
always possible to construct parallel algorithms that can keep a very large numbers of processors busy without providing any
significant speed-up over the known sequential algorithms. Thus simply talking about the average number of processors that
are kept busy by a parallel algorithm is not very useful, at least not in isolation of the absolute speed-up over the best known
sequential algorithms.
SUMMARYANDCONCLtJSIONS
171
10.1.2. Parallelism in Production Systems
One of the main results of this thesis is that the speed-up obtainable from parallelism is quite
limited, of the order of a few tens rather than of the order of few hundreds or thousands. 88The initial
expectations about the speed-up from parallelism for production-system programs were very high,
especially for large programs. The general idea was that if match for each production is performed in
parallel, then speed-up proportional to tile number of productions in the program would be achieved
[84]. This idea was quickly abandoned, as results from actual measurements on production systems
were obtained (see Chapter 3 and [30, 31]). The reasons for the limited speed-up were found to be:
(1) The number of productions that are affected as a result of a change to working memory is very
small (about 26), and since affected productions take most of the processing time, assigning a processor to each production can result in only 26-fold speed-up. (2) The speed-up is actually much less
than 26-fold, because there is a large variance in the processing requirements of the affected productions. In fact, using production parallelism in a straightforward manner was found to result in less
than 5.1-fold nominal speed-up. 89 (3) Overheads due to loss of sharing in the Rete network and
overheads due to the parallel implementation cause the real speed-up to be only 1.9-fold (a factor of
2.64 is lost). An attempt to increase the size of the affect-sets by processing all changes resulting from
a production firing in parallel (use of action parallelism) results in a nominal speed-up of 7.6-fold,
instead of the 5.1-fold achieved otherwise.
The increase in speed-up is much smaller than the
number of working-memory changes processed in parallel, because the affect-sets of the multiple
changes overlap significantly.
Since the number of productions that are affected on each cycle is not controlled by the implementor of the production-system interpreter (it is governed mainly by the author of the program and the
nature of the task), one solution to the problem of limited speed-up is to somehow decrease the
variance in the processing required by the affected productions.
This requires that the processing
associated with an affected production be distributed amongst multiple processors by exploiting
parallelism at a finer granularity. To achieve this end, the thesis proposes the use of node parallelism,
that is, processing activations of distinct nodes in the Rete network in parallel. Using node parallelism and action parallelism results in a nominal speed-up of about 10.7-fold, as compared to 7.6fold achieved for production and action parallelism. The overheads in this case are a factor of 1.98, so
88Note that the speed-up numbers in the following discussion are with respect to the sequential Rete algorithm, which is so
farthe fastest sequential algorithm for implementing production systems.
89Nominal speed-up or concurrency refers to the average number of processors that are busy in a parallel implementation.
In contrast, the real speed-up refers to the speed-up with respect to the highest performance sequential algorithm.
172
PARALLEI.ISM
INI'RODUCI'IONSYSTEMS
that the real speed-up is 5.4-fold. Studying the results in detail, two bottlenecks were found to be
limiting the speed-up. These were (1) the cross-product effect and (2) the long-chain effect.
The cross-product effect refers to the case when an incoming token finds several matching tokens
in the opposite memory node, and as a result of which a large number of tokens are sent to the
successor of that node. Since multiple activations of any given node are processed sequentially when
using node parallelism, the cross-product effect resulted in large processing times for some of the
productions, thus reducing the speed-up obtained.
The long-chain effect refers to the occurrence of long chains of dependent node activations. Since
these activations, as their name suggests, cannot be processed concurrently, they result in some
productions taking much longer to finish than others, thus resulting in small speed-ups.
The long-
chain effect is especially bad for Soar systems, where the number of condition elements per production is larger than in OPS5 systems and as a result of which the networks for productions often
contain long chains.
As a solution to the problem of the cross-product effect, the thesis proposes the use of intra-node
parallelism, where in addition to processing activations of different nodes in the Rete network in
parallel, it is possible to process the relevant activations of the same node in parallel. Using intranode and action parallelism it is possible to achieve 19.3-fold nominal speed-up, as compared to node
and action parallelism where only 10.6-fold speed-up is achieved. The corresponding average execution speed on a multiprocessor with 2 MIPS processors is 11,250 wme-changes/sec.
The nominal
speed-up for individual systems actually varies between 12-fold for some OPS5 systems and 35-fold
for some Soar systems, and the execution speed varies between 7000 wme-changes/sec and 17,000
wme-changes/sec.
As a solution to the problem of the long-chain effect, the thesis proposes using binary networks for
productions rather than linear networks as used in the original Rete algorithm. 9° This way the
maximum length of a chain can be reduced to the logarithm of the number of condition elements in a
production.
The average nominal speed-up obtained using binary networks and intra-node and
action parallelism is 25.8-fold. The corresponding average execution speed achieved is 12,000 wmechanges/sec.
Although the average nominal speed-up is significantly larger than the 19.3-fold ach-
ieved with linear networks, the execution speed is not much higher than the 11,250 wme-changes/sec
90Thiscorresponds
to varyingthefixedcombinations
ofconditiondementsforwhichpartialmatchesarestoredbytheRete
algorithm.
173
achieved with linear networks.
This is because, for many of the systems, using binary networks
results in a larger number of node activations per change to working memory. Thus, even though the
speed-up with respect to the uniprocessor implementation
using binary networks is higher, the
uniprocessor implementation using binary networks is slower than that using linear networks, so that
not much is gained on the whole.
The speed-up for the individual systems when using binary networks varies between ll-fold and
56-fold. The benefits of using the binary networks are small (or sometimes even negative) for OPS5
systems since the average number of condition elements per production is small. The benefits for
Soar systems are much larger since the average number of condition elements in Soar productions is
much higher. Thus the decision whether or not to use binary networks should be made depending on
the characteristics of the individual production systems. In fact, there is no reason not to have a
mixture in the same program, that is, have some productions with linear networks, some productions
with binary networks, and some productions with a mixture of linear and binary network forms. The
overall aim is to minimize the state that is computed, while at the same time avoiding long chains.
This job of selecting the network form for productions may be done by a programmer who understands the underlying implementation, or by a program which uses static and run-time information,
or by some combination of the two.
10.1.3. Software Implementation Issues
The thesis addresses a number of issues related to the correctness and the efficiency of the parallel
Rete algorithm. Some of the issues discussed are: (1) the use of hashing for constant-test nodes at the
top-level of the Rete network, (2) the lumping of multiple constant-test node activations when
scheduling them on a processor, (3) the lumping of memory-nodes with the associated two-input
nodes, (4) the use of hash tables to store tokens in memory nodes, (5) the set of concurrently
processable activations of a node, (6) the processing of conjugate pairs of tokens, (7) the locks
necessary for manipulating the contents of memory nodes, and (8) the use of binary networks versus
linear networks. Although the details are not relevant here, the above features have considerable
impact on the overheads that are imposed by the parallel implementation, and are consequently very
important if a parallel implementation is to be successful.
174
PARALLELISM IN PRODUCI'ION
SYSTEMS
10.1.4. Hardware Architecture
To implement the parallel Rete class of algorithms, as described in the previous paragraphs, the
thesis proposes the use of a shared-memory multiprocessor architecture. The basic characteristics of
the proposcd architecture are: (1) Shared-memory multiprocessor with 32-64 processors.
(2) The
individual processors should be high performance computers, each with a cache and a small amount
of private memory. (3) The processors should be connected to the shared memory via one or more
shared buses. (4) The multiprocessor should support a hardware task scheduler to help enqueue node
activations on the task queue and to assign pending tasks to idle processors.
The main reason for suggesting the shared-memory multiprocessor is a consequence of the fine
granularity at which the parallelism is exploited by the parallel Rete algorithm.
The parallel Rete
algoritt-nn requires that the data structures be shared amongst multiple concurrent processes, which
makes it appropriate to use a shared-memory architecture. When used in conjunction with a centralized task queue, a shared-memory multiprocessor also makes it possible to bypass the load distribution problem. The reason for using only 32-64 processors is that simulations show that no additional
speed-up is gained by using a larger number of processors. Since only a small number of processors
are used, it is possible to use expensive high performance processors. Each processor should have a
cache and some private memory to enable high speed of operation (to avoid the large latency to the
'
shared memory), and also to reduce contention for the shared memory-modules and the shared bus.
The thesis recommends a bus-based architecture instead of an architecture based on a crossbar or
other processor-memory interconnects because it is easier to construct intelligent cache-coherence
schemes for shared buses and because simulations show that a single bus would be able to support
the load generated by 32-64 processors (provided reasonable cache-hit ratios are obtained). 91 To
avoid the problem of load distribution, the thesis suggests the use of a centralized task queue containing all pending node activations. The task queue required for this job is quite complex since not all
node activations in the task queue can be processed concurrently, in fact, the set of concurrently
processable node activations changes dynamically (see Section 5.2.4 for details). Furthermore, since
the average processing required by a node activation is only 50-100 instructions, it is necessary to
have a mechanism whereby the node activations can be enqueued and dequeued extremely fast, so
that the task queue does not become a bottleneck.
The thesis proposes two mechanisms for solving
the scheduling problem: the use of a hardware scheduler and the use of multiple software task
91Many of the the design recommendations made in the thesis are highly technology dependent. Future advances in the
processor technology and the interconnection-network technology may make it necessary to reevaluate the recommendations.
175
queues. The proposed hardware task scheduler makes use of content-addressable
memory to main-
tain a list of concurrently processable node activations, and is expected to perform at the bus-speed of
the multiprocessor.
The software task queues are studied as an alternative to building the custom
hardware task scheduler. Simulations show that the performance when using multiple software task
queues is about half of the performance when using the hardware task scheduler.
10.2. Some General Conclusions
Amongst Treat, Rete, and Oflazer's algorithm three very distinct points in the space of possible
state-saving match algorithms are covered.
Alongside in the implementations
of the above al-
gorithms, three distinct points in the architecture space have also been covered--small-scale
parallel
architectures like a 32-node multiprocessor, medium-scale parallel architectures like Oflazer's 512
processor parallel machine, and highly parallel architectures like DADO and NON-VON with tens of
thousands of processors. In terms of performance, Treat on DADO is expected to execute at a rate of
about 215 wme-changes/sec, assuming sixteen thousand 0.5 MIPS node processors [60]; Rete on
NON-VON is expected to execute at about 2000 wme-changes/sec, assuming thirty-two 3 MIPS large
processing elements and sixteen thousand 3 MIPS small processing elements [37]; Oflazer's algorithm
is expected to execute at about 4750-7000 wme-changes/sec, assuming five-hundred-and-twelve
5-10
MIPS custom processors [67]; Rete on a single 2 MIPS processor is expected to execute at about 1200
wme-changes/sec; and Rete on the production-system machine proposed in this thesis is expected to
execute at 11000 wme-changes/sec,
assuming
thirty-two
node multiprocessor
using 2 MIPS
processors. 92
The first general conclusion that can be derived from the relative comparison of various algorithms
and architectures discussed above is that the speed-up over the best existing sequential algorithm by
any of the parallel implementations
is small, and this is irrespective of the type of algorithm or
architecture being used. Thus the answer to the question "Is it the sequentiality of the Rete algorithm that is blocking the speed-up from the use of parallelism?" is most probably no, because other
algorithms at both ends of the state-saving spectrum have not shown any better results.
Thus it
92No attempt has been made to normalize the meeds of the processors used in the architectures discussed above. For
example, the speed of preduction-system execution on the DADO machine is given when processors have 8-bit wide datapaths
and execute at 0.5 MIPS (as described in the original paper), and not when the processors have 32-bit datapaths and execute at
2 MIPS. This is because the overall effect on the feasibility of the different architectures is very different when individual
processors are speeded up. For example, in Oflazer's machine, 5 MIPS processors are fine because they work out of their local
memory. Using 5 MIPS processors for the PSM may, however, cause problems since the shared bus would become a
bottleneck. The reader may, however, wish to do some such normalization (based on architectural feasibility) to gain more
comparability.
176
PARALI.ELISM
IN PRODUCI'ION SYS'i'F.MS
appears that it is not the Retc algorithm, or the Treat algorithm, or Oflazer's algorithm that is
preventing the use of parallelism, but it is the inherent nature of the computation
involved in im-
plementing production systems that prevents significant speed-up from parallelism.
Another conclusion that we can draw is that while massively parallel architectures like DADO and
NON-VON may be effective for the task of executing production systems, that is, they can execute
production systems at reasonably high speeds, the approach of using a small number of tightlycoupled high-performance processors with the migration of critical functionality to special purpose
hardware seems to be preferred.
Since the results in this thesis are based on the analysis of existing programs, an interesting question
to ask is: "Are existing production-system programs not able to exploit parallelism because they were
not written with parallel implementations
in mind, or alternatively, will future production-system
programs when written with parallel implementations in mind be able to exploit more parallelism?"
The answer is undoubtedly
yes, but probably to a limited extent only.
We believe that while
additional factors of two, four, or eight are very probable, it is doubtful that additional factors of fifty,
a hundred, or more will be obtained in the future. There are reasons to believe that the two main
factors limiting the speed-up (the small number of affected productions per cycle and the small
number of changes to working memory per cycle) will not change significantly in the near future, and
maybe not even in the long run (see Section 4.7 for more details). 93 On the positive side, however,
the techniques that have been developed in this thesis will still be applicable to the new class of
systems, the only difference being that the speed-up will be larger.
10.3. Directions for Future Research
Although many issues regarding the use of parallelism in implementing production systems have
been addressed in this thesis, many more remain to be addressed.
This section discusses some such
issues.
In the area of design of algorithms and architectures it is extremely useful to have a generally
accepted set of benchmark programs that can be used by all researchers. At this point there are no
such established benchmarks for production-system
programs and as a result one often ends up
comparing apples to oranges. The PSM group at Carnegie-Mellon University is beginning an effort
93There may be exceptions to the above claims for special classes of programs, for example, production-system programs
working on low-level vision. But for a large majority of tasks in Artificial Intelligence they are expected to be true.
177
to assemble such a set of programs. 94 When selecting such a set of programs it is necessary to ensure
that there is sufficient variability in them, so that the proposed algorithms and architectures can be
tested along various dimensions. For example, there should be programs that are knowledge-search
intensive, those that are problem-space search intensive, those with small and those with large working memory, those with small and those with large production memory, and so on. The final success
of such an effort, of course, would be established only if the selected set of benchmark programs is
adopted by the rest of the research community and used to evaluate many architectures.
A criticism that has often been cited for the current work being done in parallel implementations of
production systems is that, existing programs were written with sequential implementations in mind,
and that they do not reflect the true parallelism which is to be found in programs written with parallel
implementations in mind [40, 86]. As stated a few paragraphs earlier, while we believe that small
factors of two or four may be achieved in such a way, factors of fifty or hundred will not be achieved.
However, factors of two or four are not small enough to be ignored, and much work needs to be done
in developing production-system formalisms that permit more explicit expression of parallelism (for
example, the Herbal language being developed at Columbia University), or those that implicitly
allow much more parallelism to be used (for example, the Soar formalism as compared to the OPS5
formalism).
An obvious direction for further work is to implement the ideas proposed in this thesis on an actual
multiprocessor.
Such an implementation will bring many interesting issues to hght, and a running
parallel implementation
will certainly encourage production-system
styles to make better use of the parallelism.
programmers to adapt their
Such an implementation is currently underway by the
PSM group. The thesis does not explore the issue of using multiple software task schedulers in a very
comprehensive way, and additional work is needed to help clarify the trade-offs involved. Another
interesting task would be to build a program that uses static and run-time information to decide on
the network forms (linear, binary, or some mixture) for the productions [83]. The criteria of goodness
for such a program is that it should minimize the total amount of state computed on every cycle,
while avoiding the occurrence of long chains of dependent node activations.
Another interesting direction for future work is to analyze the relative merits of different AI
programming languages.
To be more specific, with the start of the Japanese Fifth Generation
94The collection of programs that we have been using in our experiments represents a start, but it is still inadequate in
various ways. For example, most of the programs are too large to be recoded in other related languages, and many programs
are of a proprietarynature and cannot be shipped outside of CMU.
178
PARAI,LELISM
IN PRODUCTION
SYS'I]_MS
Computing Project [24], the language Prolog has gained very wide usage 113, 781. Prolog has also
been put forward as a language for building expert systems, and it has been claimed that massive
amounts of parallelism can be used in its implementations [112,27, 50, 61, 88]. It would be interesting
to hnplement a common set of tasks using Prolog, OPS5, Soar, Multilisp, Actors, and other such
languages [1, 15, 25, 35, 59, 64, 93], and see the amount of parallelism that each of the formalisms
permits, and the absolute performance that can be achieved by each of them.
REFERENCES
179
References
[1]
Gul Agha and Carl Hewitt.
Concurrent Programming Using Actors: Exploiting Large-Scale Parallelism.
A.I. Memo 865, Massachusetts Institute of Technology, October, 1985.
[2]
J.R. Anderson.
The Architecture of Cognition.
Harvard University Press, 1983.
[3]
Mario R. Barbacci.
An Introduction to ISPS.
In Daniel P. Siewiorek, C. Gordon Bell, and Allen NeweU (editor), Computer Science:
Computer Structures: Principles and Examples, chapter 4. McGraw-Hill, 1982.
[4]
Avron Ban"and Edward A. Feigenbaum.
The Handbook of Artificial Intelligence, Volume 1.
William Kaufmann, Inc., 1981.
[5]
Forest Baskett and Alan Jay Smith.
Interference in Multiprocessor Computer Systems with Interleaved Memory.
Communications of the ACM 19(6), June, 1976.
[6]
D.P. Bhandarkar.
Analysis of Memory Interference in Multiprocessors.
IEEE Transactions on Computers C-24(9), September, 1975.
[7]
W.J. Bouknight, Stewart A. Denenberg, David E. Mclntyre, J.M. Randall, Amed H. Sameh,
and Daniel L. Slotnick.
The Illiac IV System.
Proceedings of the IEEE , April, 1972.
[8]
Ruven Brooks and Rosalyn Lure.
Yes, An SIMD Machine Can Be Used For AI.
In International Joint Conference on Artificial Intelligence. 1985.
[9]
George Broomell and J. Robert Heath.
Classification Categories and Historical Development of Circuit Switching Topologies.
Computing Surveys 15(2):95-133, June, 1983.
[10]
Lee Brownston, Robert Farrell, Elaine Kant, and Nancy Martin.
Programming Expert Systems in OPSS: An Introduction to Rule-Based Programming.
Addison-Wesley, 1985.
180
PARA1J,ELISM IN PRODUCTION SYSTEMS
[11]
B.G. Buchanan and E.A. Feigenbaum.
I)ENI)RAI_ and Mcta-I)EN1)RAI.: Their Applications I)imensions.
Art_cial Intelligence 11(1,2), 1978.
[12]
Yaohan Chu and Kozo ltano.
Organization of a Parallel Prolog Machine.
In International Workshop on High-Level Computer Architecture. 1984.
[13]
W.F. Clocksin and C.S. Mellish.
Programming in Prolog.
Springer-Verlag, 1981.
[14]
R.O. Duda, J.G. Gaschnig, and P.E. Hart.
Model Design in the PROSPECTOR Consultant System for Mineral Exploration.
In D. Michie (editor), Expert Systems in the Micro-Electronic Age. Edinburgh University
Press, Edinburgh, 1979.
[15]
J. Fain, F. Hayes-Rot.h, S.J. Rosenschein, H. Sowizral, and D. Waterman.
The ROSIE Language Reference Manual
Technical Report N-1647-ARPA, Rand Corporation, 1981.
[16]
Richard D. Fennell and Victor R. Lesser.
Parallelism in Artificial Intelligence Problem Solving: A Case Study of Hearsay II.
IEEE Transactions on Computers C-26(2), February, 1977.
[17]
Charles L. Forgy.
On the Ej'ficient hnplementations of Production Systems.
PhD thesis, Carnegie-Mellon University, Pittsburgh, 1979.
[18]
Charles L. Forgy.
Note on Production Systems and ILLIAC-IV.
Technical Report CMU-CS-80-130, Carnegie-Mellon University, Pittsburgh, 1980.
[19]
Charles L. Forgy.
OPS5 User's ManuaL
[20]
Charles L. Forgy.
Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem.
Artificial Intelligence 19, September, 1982.
[21]
Charles L. Forgy.
The 0PS83 Report.
Technical Report CMU-CS-84-133, Carnegie-Mellon University, Pittsburgh, May, 1984.
[22]
Charles Forgy, Anoop Gupta, Allen Newell, and Robert Wedig.
Initial Assessment of Architectures for Production Systems.
In National Conference for Artificial Intelligence. AAAI-1984.
[23]
Charles Forgy and Anoop Gupta.
Preliminary Architecture of the CMU Production System Machine.
In Hawaii International Conference on System Sciences. January, 1986.
RI_:ERENCES
[24]
Kazuhiro Fuchi.
Revisiting Original Philosophy of Fifth Generation Computer Systems Project.
In Internatio,al Conference on Fifth Generation Computer Systems. 1COT. 1984.
[25]
Richard P. Gabriel and John McCarthy.
Queue-based Multi-processing Lisp.
In ACM Symposium on Lisp and Functional Programming. ACM, 1984.
[26]
Michael R. Garey and David S. Johnson.
Computers and Intractability: A Guide to the Theory of NP-Completeness.
W. H. Freeman and Company, 1979.
[27]
Atsuhiro Goto, Hidehiko Tanaka, and Tohru Moto-oka.
Highly Parallel Inference Engine -- Goal Rewriting Model and Machine Architecture.
New Generation Computing 2: 37-58, 1984.
[28]
A. Gottileb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir.
The NYU Ultracomputer -- Designing a MIMD Shared-memory Parallel Machine.
In The 9th Annual Symposium on Computer Architecture. IEEE and ACM, 1982.
[29]
J.H. Griesmer, S.J. Hong, M. Karnaugh, J.K. Kastner, M.I. Schor, R.L. Ennis, D.A. Klein,
K.R. Milliken, H.M. VanWoerkom.
YES/MVS: A Continuous Real Time Expert System.
In National Conference on Artificial Intelligence. AAAI-1984.
[30]
Anoop Gupta and Charles L. Forgy.
Measurements on Production Systems.
[31]
Anoop Gupta.
Implementing OPS5 Production Systems on DADO.
In International Conference on Parallel Processing. IEEE, 1984.
[32]
Anoop Gupta.
Parallelism in Production Systems: The Sources and the Expected Speed-up.
Also in Proceedings of Fifth International Workshop on Expert Systems and Applications,
Avignon, France, May 1985.
[33]
Anoop Gupta, Charles Forgy, Allen Newell, and Robert Wedig.
Parallel Algorithms and Architectures for Production Systems.
In 13th International Symposium on Computer Architecture. June, 1986.
To appear.
[34]
P. Haley, J. Kowalski, J. McDermott, and R. McWhorter.
PTRANS: A Rule-Based Management Assistant.
Technical Report, Carnegie-Mellon University, Pittsburgh, 1983.
[35]
Robert H. Halstead, Jr.
Multilisp: A Language for Concurrent Symbolic Computation.
acre Transactions on Programming Language and Systems 7(4):501-538, October, 1985.
]81
182
PARAI.LELISM
IN PRODUCI'ION
SYSTEMS
[36]
J.L. Hennessy, N. Jouppi, S. Przybylski, C. Rowen, and T. Gross.
The MIPS Machine.
In Computer Conference. February, 1982.
[37]
Bruce K. Hillyer and David E. Shaw.
Execution of OPS5 Production Systems on a Massively Parallel Machine.
Technical Report, Columbia University, September, 1984.
[38]
Keki B. Irani and Ibrahim H. Onyuksel.
A Closed-Form Solution for the Performance Analysis of Multiple-Bus Multiprocessor Systems.
IEEE Transactions on Computers C-33(11), November, 1984.
[39]
Gary Kahn and John McDermott.
The MUD System.
In First Cotference on Artificial Intelligence Applications. IEEE Computer Society and AAAI,
December, 1984.
[40]
Dennis F. Kibler and John Conery.
Parallelism in AI Programs.
[41]
Jin Kim, John McDermott, and Daniel Siewiorek.
TALIB: A Knowledge-Based System for IC Layout Design.
[42]
Ted Kowalski and Don Thomas.
The VLSI Design Automation Assistant: Prototype System.
In 20th Design Automation Conference. ACM and IEEE, June, 1983.
[43]
Ted Kowalski.
The VLSI Design Automation Assistant: A Knowledge-Based Expert System.
PhD thesis, Carnegie-Mellon University, April, 1984.
[441
John E. Laird.
Universal Subgoaling.
PhD thesis, Carnegie-Mellon University, Pittsburgh, December, 1983.
[45]
John E. Laird and Allen Newell.
A Universal Weak Method: Summary &Results.
[46]
John E. Laird and Allen Newell.
A Universal Weak Method.
Technical Report CMU-CS-83-141, Carnegie-Mellon University, Pittsburgh, June, 1983.
[47]
John E. Laird, Paul S. Rosenbloom, and Allen NeweU.
Towards Chunking as a General Learning Mechanism.
In National Cotference on Artificial Intelligence. AAAI-1984.
[48]
John E. Laird.
Soar User's Manual
4th edition, Xerox PARC, 1986.
REFERENCES
183
[49]
Theodore F. Lehr.
The hnplementation of a Production System Machine.
Mastcr's thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon
University, 1985.
[50]
G.J. Lipovski and M.V. Hermenegildo.
B-LOG: A Branch and Bound Methodology for the Parallel Execution of logic Programs.
[51]
Sandra Marcus, John McDermott, Robert Roche, Tim Thompson, Tianran Wang, and George
Wood.
Design Document for VT.
1984.
Carnegie-Mellon University.
[52]
M.A. Marsan.
Bounds on Bus and Memory Interference in a Class of Multiple-Bus Multiprocessor Systems.
In Third International Conference on Distributed Computer Systems. October, 1982.
[531
Henry H. Mashburn.
The C.mmp/Hydra Project: An Architectural Overview.
In Daniel P. Siewiorek, C. Gordon Bell, and Allen Newell (editor), Computer Structures:
Principles and Examples. McGraw-Hill, 1982.
[541
Donald McCracken.
A Production System Version of the Hearsay-II Speech Understanding System.
PhD thesis, Carnegie-Mellon University, Pittsburgh, 1978.
[55]
John McDermott.
RI: A Rule-based Configurer of Computer Systems.
Technical Report CMU-CS-80-119, Carnegie-Mellon University, Pittsburgh, April, 1980.
[56]
John McDermott.
XSEL: A Computer Salesperson's Assistant.
In J.E. Hayes, D. Michie, and Y.H. Pao (editor), Machine Intelligence. Horwood, 1982.
[57]
John McDermott.
RI: A Rule-Based Configurer of Computer Systems.
Artificial Intelligence 19(1):39-88, 1982.
[58]
John McDermott.
Extracting Knowledge from Expert Systems.
[59]
W. Van Melle, A.C. Scott, J.S. Bennett, and M. Peairs.
The Emycin Manual
Technical Report STAN-CS-81-885, Stanford University, October, 1981.
[60]
Daniel P. Miranker.
Performance Estimates for the DADO Machine: A Comparison of Treat and Rete.
In Fifth Generation Computer Systems. ICOT, Tokyo, 1984.
184
PARALLELISM
INPRODUC'rlONSYSTEMS
[61]
Tohru Moto-oka, Hidehiko Tanaka, ltitoshi Aida, Keiji tlirata, and 'l'sutomu Maruymna.
The Architecture of a Parallel Inference Engine -- P11.?,
--.
In b_ternational Conference on Fifth Generation Computer Systems. ICOT, 1984.
[62]
Allen Newell and Herbert A. Simon.
Human Problem Solving.
Prentice-Hall, 1972.
[63]
Allen Newell.
ttARP Y, Production ,Systems and Human Cognition.
Technical Report CMU-CS-78-140, Carnegie-Mellon University, Pittsburgh, September,
1978.
[64]
H.P. Nii and N. Aiello.
AGE(Attempt to Generalize): A Knowledge-Based Program for Building Knowledge-Based
Programs.
[65]
Nils J. Nilsson.
Computer Science Series: Problem-Solving Methods in Artificial Intelligence.
McGraw-Hill, New York, 1971.
[66]
Kemal Oflazer.
Parallel Execution of Production Systems.
In bztenzational Conference on Parallel Processing. IEEE, August, 1984.
[67]
Kemal Oflazer.
Partitioning in Parallel Processing OfProduction Systems.
PhD thesis, Carnegie-Mellon University, (in preparation), 1986.
[68]
D.A. Patterson and C.H. Sequin.
A VLSI RISC.
Computer 9, 1982.
[69]
James E. Quinlan.
A Comparative Analysis of Computer Architectures for Production System Machines.
Master's thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon
University, 1985.
[70]
G. Radin.
The 801 Minicomputer.
IBM Journal of Research and Development 27, May, 1983.
[71]
Raja Ramnarayan.
A Tagged Token Data Flow Computation Model for OPS5 Production Systems.
1984.
Working Draft, Honeywell CSC, Bloomington, MN.
[72]
Bruce Reed, Jr.
The ASPRO Parallel Inference Engine (P.I.E.): A Real Time Production Rule System.
1985.
Goodyear Aerospace.
REFERENCES
[73]
Paul S. Rosenbloom.
The Chunking of Goat Hierarchies: A Model of St&talus-Response Compatibility.
PhD thesis, Carnegie-Mellon University, Pittsburgh, August, 1983.
[74]
Paul S. Rosenbloom, John E. Laird, John McDermott, Allen Newell, and Edmund Orciuch.
R1-Soar: An Experiment in Knowledge-Intensive Programming in a Problem-Solving Architecture.
In 1EEE Workshop on Principles of Knowledge Based Systems. 1984.
[75]
Paul S. Rosenbloom, John E. Laird, Allen Newell, Andrew Golding, and Amy Unruh.
Current Research on Learning in Soar.
In International Workshop on Machine Learning. 1985.
[76]
Larry Rudolph and Zary Segall.
Dynamic Decentralized Cache Schemes for MIMD Parallel Processors.
In International Symposium on Computer Architecture. 1984.
[77]
Mike Rychener, Joe Kownacki, and Zary Segall.
Parallel Production Systems: OPS3.
Cm*: An Experiment in Multiprocessing.
Digital Press, 1986.
[78]
Ehud Y. Shapiro.
A Subset of Concurrent Prolog and Its Interpreter.
Technical Report, ICOT -- Institute for New Generation Computer Technology, February,
1983.
[79]
David Elliot Shaw.
The NON- VON Supercomputer.
Technical Report, Columbia University, New York, August, 1982.
[80]
David Elliot Shaw.
On the Range of Applicability of an Artificial Intelligence Machine.
Technical Report, Columbia University, January, 1985.
[81]
E.H. Shortliffe.
Computer- Based Medical Consultations: M YC IN.
North-Holland, 1976.
[82]
Herbert A. Simon.
The Architecture of Complexity.
The Sciences of the Artificial
MIT Press, 1981, Chapter 7.
[83]
David E. Smith and Michael R. Genesereth.
Ordering Conjunctive Queries.
Artificial Intelligence 26(2): 171-215, 1985.
[84]
Salvatore J. Stolfo and David E. Shaw.
DADO: A Tree-Structured Machine Architecture for Production Systems.
'
185
186
PARALLELISM
INPRODUCI'ION
SYSTEMS
[85]
Salvatore J. Stolfo, Daniel Miranker, and David E. Shaw.
Architccturc and Applications of I)AI)O: A Large-Scalc Parailcl Computer for Artificial Intelligence.
[86]
Salvatore J. Stolfo.
Five Parallel Algorithms for Production System Execution on the DADO Machine.
In National Conference on ArtificiaI bltelligence. AAAI-1984.
[87]
S.N. Talukdar, E. Cardozo, L. Leao, R. Banares, and R. Joobbani.
A System for Distributed Problem Solving.
In Workshop on Coupling Symbolic and Numerical Computing in Expert Systems. August,
1985.
[88]
Stephen Taylor, Christopher Maio, Salvatore J. Stolfo, and David E. Shaw.
PROLOG on the DADO Machine: A Parallel System for High-Speed Logic Programming.
Technical Report, Columbia University, New York, January, 1983.
[89]
M.F.M. Tenorio and D.I. Moldovan.
Mapping Production Systems into Multiprocessors.
[90]
Jeffrey D. Ullman.
Principles of Database Systems.
Computer Science Press, 1982.
[91]
Shinji Umeyama and Koichiro Tamura.
A Parallel Execution Model of Logic Programs.
In The lOth Annual International Symposium on Computer Architecture. IEEE and ACM,
June, 1983.
[92]
D.A. Waterman and Frederick Hayes-Roth.
Pattern-Directed Inference Systems.
Academic Press, 1978.
[93]
S.M. Weiss and C.A. Kulikowski.
EXPERT: A System for Developing Consultation Models.
ISP OF PROCESSOR
USED IN PARALLEL
187
IMPI,EMENTATION
Appendix A
ISP of Processor Used in Parallel Implementation
! This
is an ISPS
! processors
! simulator
so that
I with
are
the
the
on
this
almost
only
and
all
a very
they
of the
The
number
treated
make
of
The
a small
within
models
either
0 or
when
that
in the
of classes,
being
1 memory
need
computing
used
is designed
number
a class
instructions
specially
individual
cost
instruction-set
into
instructions
instructions
of the
machine.
be partitioned
ell
small
are
architecture
system
description.
may
of executing
are
the
[3]
production
instructions
I references
[ for
in the
based
cost
! For example,
I There
description
used
the same.
references.
more
memory
the cost
models
simulator.
**PC.State**
R[0:31]<0:31>,
I General
IR<0:31>,
! Instruction
Purpose
PSW<0:31>,
I Prog.
PREFIX<O:31>,
I Prefix
PC
:= R[31],
I Program
SP
:= R[30],
I System
Stack
Link
:= R[29],
I System
Link
Zero
:= R[28];
I Zero
Status
Cond.
Code
Register
eg.,
byte,
word,..
Pointer
Register
Register
:= PSW<O>,
I zero
:= PSW<I>,
I overflow
N<>
:=
1 negative
C<>
:= PSW<3>,
Register.
and
Counter
Z<>
.*Instruction.
Word,
Register
V<>
PSW<2>,
Registers
Register
I carry
bit
Fields**
opcode<0:5>
:= IR<26:31>
opcode
type<O:1>
:= IR<24:25>.
type
rd<O:4>
:= IR<19:23>.
destination
mode<O:1>
:= IR<17:18>
used
rs<0:4>
:= IR<12:16>.
source
rx<0:4>
:= IR<12:16>
index
of operands,
register
in instruction
ri<0:4>
:= IR<19:23>,
I first
r2<0:4>
:= IR<12:16>,
I second
r3<0:4>
:= IR<7:II>,
I third
register
register/
for
base
reg specified
reg specified
reg
specified
for
instruction
interpretation
instruction
register
by instruction
by instruction
by
instruction
xd<0:4>
:=
IR<Ig:23>,
1 5 bit
signed/unsigned
constant
xs<0:4>
:=
IR<12:16>,
1 5 bit
signed/unsigned
constant
188
PARALLELISM
IN PRODUCTION
disp/const<0:6>
:= IR<0:6>,
! 7
bit
displacement
or
Idisp/Iconst<0:11>
:= IR<O:II>,
! 12 bit
displacement
or constant
lldisp/llconst<O:16>
:= IR<0:16>,
! 17 bit
displacement
or
llldisp/lllconst<0:23>
:= IR<0:23>,
1 24 bit
displacement
or constant
! Note:
The difference
! displacements
I while
are
constants
I constants
may
are
not
be signed
I instruction.
I a word,
between
shifted
type
affected
or
For example,
it will
constants
by the
and
by the
unsigned
displacements
(byte,
word.
type
of
as relevant
if the constant
constant
constant
is that,
the
long .... ) of the
the
instruction.
in the
is used
context
to specify
SYSTEMS
instr.,
All
of the
a bit within
be unsigned.
i .........................................................................
I
r3
I
I ......
I 5
I
disp/const
I ........................
7
r2/rs/rx
I
........
i ...............................
ldisp/lconst
5
opcode
........
6
[ type
Irl/rd/xdlmode/xs
I ......
2
I........
I
5
12
lldisp/llconst
I .......
2
........................................
_7
]
llldisp/lllconst
I .........................................................
24
INSTRUCTION
REGISTER
31
0
"*Instructions-Set'*
& Logical
• *Arithmetic
I The
arithmetic
and
I three-operand
logical
formats.
I with
the
I Thus
to determine
DECODE
Instructions'"
three
type<O:l>
instructions
For each
primitive
data
the type
exist
instruction
types
of the
(byte,
in both
the
word,
two-operand
variants
longword)
which
also
and
deal
exist.
instruction:
=>
begin
00:
type
- byte;
01:
type
- word;
10:
type
- longword;
II:
type
is part
of opcode.
(instructions
that
end
I Instructions
I instr
Args
in the
two-operand
Mode
form.
Interpretation
do not
need
type
bits)
ISP OF PROCESSOR
USED IN PARALLEL
add2
llconst,rd
O0
rd <--
rs,rd
01
rd <--
rd + rs
(rx)Idisp,rd
10
rd <--
rd + M[rx
+ Idisp]
(rx)r3,rd
11
rd <--
rd + M[rx
+ r3]
I Note,
when
I type
of the
I with
the
mode
= 11, the
instruction,
same
format
contents
just
of
like
according
displacement.
to the
Other
instr.
are:
add with
subtract
carry
subc2
subtract
with
and2
logical
and
or2
logical
or
xor2
logical
xor
I Instructions
rd + llconst
r3 are shifted
any other
addc2
sub2
I instr
189
IMPLEMENTATION
in the three-operand
Args
carry
form.
Mode
Interpretation
............................
add3
rd <--
rs + r3
Ol
rd <--
rs + Iconst
rs,(r3)disp,rd
10
rd <--
rs + M[r3
+ disp]
rd <-format are:
xs + M[r3
+ disp]
11
the same
addc3
I add with
sub3
! subtract
subc3
! subtract
and3
I logical
and
or3
I logical
or
xor3
I logical
xor
• * Shift
Instructions
I instr
with
carry
"*
Args
Mode
Interpretation
Other
rs,r3,rd
DO
rd <--
r3 shifted-by
xs,r3,rd
01
rd
r3
xs,(r3)disp,rd
10
related
instructions
with
the
shr
I shift
right
shra
I shift
right
** Bit-Field
Extract/Insert
I The following
I modes
I take
10 and
up the
two
instructions
11 are
bits
also
that
not
would
:= IR<7:11>,
siz<0:4>
:= IR<0:4>,
I instr
Args
............................
Mode
<--
rd
same
<-M[r3
+ disp]
format are:
rs
xs
shifted-by
xs
''
do not make
used.
shifted-by
arithmetic
Instructions
pos<0:4>
|
carry
............................
shl
I
O0
rs,lconst,rd
xs,(r3)disp,rd
instructions
with
! Other
I
rs,r3,rd
The
be needed
use
of the
reason
to specify
is that
type
pos
a useful
Interpretation
bits.
and
The
siz fields
memory
address.
190
PARALLELISM
bfx
bfi
rs,pos,siz,rd
O0
rd<O,siz-l>
<--
rs<pos,pos+siz-l>
xs,pos,siz,rd
O1
rd<O,siz-l>
<--
xs<pos,pos+siz-l>
rs,pos,siz,rd
O0
rd<pos,pos+siz-l>
<--
rs<O,siz-l>
xs,pos,siz,rd
O1
rd<pos,pos+siz-l>
<--
xs<O,siz-l>
*" Load/Store
I Note:
t these
Instructions
The
two
I quad-word
type
I
instr
............................
Id (load)
also
is necessary
Args
st
Ida
ldpi
I Ida:
*"
instructions
instructions
i
ld
to accommodate
are
type
special,
(type
the hardware
task
rd <--
01
rd <--
rs
(rx)ldisp,rd
10
rd <--
M[rx
+ Idisp]
(rx)r3,rd
11
rd <--
M[rx
+ r3]
xd,(rx)Idisp
O0
M[rx
+ Idisp]
<--
xd
xd,(rx)r3
Ol
M[rx
+ r3]
<--
xd
r1,(rx)Idisp
I0
M[rx
+ Idisp]
<--
rl
rl,(rx)r3
II
M[rx
+ r3]
<--
rl
(rx)Idisp,rd
I0
rd <--
rx + Idisp
(rx)r3,rd
11
rd <--
rx + r3 (shifted
const:IR<0:25>
--
PREFIX<O:25>
Args
llconst
<--
(shifted
by type)
by type)
IR<0:25>
instruction.
instruction.
The bits
with with PREFIX<O:25>
or constant,
Instructions
The
scheduler.
Interpretation
O0
load-address
in that
= 11).
the
number
being
IR<0:5>
to form
of the following
a 32 bit
PREFIX<O:25>:IR<0:5>.
*"
Mode
Interpretation
............................
push
pusha
pop
**
st (store)
rs,rd
I displacement
I
and
a quad-word
llconst,rd
is the
I instr
have
Node
I Idpi: is the load-prefix
I instruction
are combined
• * Push/Pop
IN PRODUCTION
Subroutine
llconst,rd
O0
(rd)++;
M[rd]
<--
rs,rd
01
(rd)++;
M[rd]
<--
rs
(rx)ldisp,rd
10
(rd)++;
M[rd]
<--
M[rx
+ ldisp]
(rx)r3,rd
ll
(rd)++;
M[rd]
<--
M[rx
+ r3]
(rx)ldisp,rd
10
(rd)++;
M[rd]
<--
rx
+ ldisp;
(rx)r3,rd
11
(rd)++;
M[rd]
<--
rx
+ r3;
rs,rd
--
rd
Linkage
Instructions
''
<--
M[rs];
(rs)--;
llconst
SYSTEMS
1SP OFPROCI_SOR
! instr
!
191
USED IN PARALLELIMPLEMENTATION
Args
Mode
Interpretation
............................
brlink
rl,lldisp
O0
Link
<-- PC;
PC <--
rl + lldisp;
rl.r2
01
Link
<-- PC;
PC <--
rl + r2;
(rl)lldisp
00
Link
<-- PC;
PC <-- M[rl
+ lldisp];
(rl)r2
01
Link
<--
PC <-- M[rl
+ r2];
rl,lldisp
00
(SP)++;
M[SP]
<--
PC;
PC <--
rl + lldisp;
rl,r2
01
(SP)++;
M[SP]
<--
PC;
PC <--
rl + r2;
bsb
llldisp
--
(SP)++;
M[SP]
<--
PC;
PC <--
PC + llldisp;
rsb
....
PC <--
M[SP];
brlinki
jsb
PC;
(SP)--;
I NOTE:
The brlinki
(branch-link-indirect)
instruction
is especially
useful
I when code for the various nodes in the Rete network
is shared.
For
I example,
it can
I depending
** Comparison
I There
branch
are four
(br),
then
I instr
size
types
to branch
of the
and Control
I instructions
! can
be used
on the
Flow
of
Instructions
instructions
compare-and-branch
set
jump
to one of
(cb),
the Z, V, N, C, bits
of some
combination
Args
several
MakeToken
routines
token.
Mode
*"
in this
and
category:
jump
in the PSW.
of these
compare
(imP).
bits
The
The
in the
(cmp),
compare
branch
instructions
PSW.
Interpretation
............................
cmp
rl,
llconst
O0
compare(r1,
llconst)
rl,
r2
01
compare(r1,
r2)
rl,
(rx)Idisp
10
compare(r1,
M[rx
+ Idisp])
rl,
(rx)r3
11
compare(r1,
M[rx
+ r3])
br
llldisp
--
PC <--
br_xx
llldisp
--
if xx(N,Z,V,C)
I where
cb_xx
I where
xx - z. nz,
ovf,
neg,
eq,
neq,
PC + llldisp
It, gt,
then
le,
PC <--
rl,r2,1disp
00
if xx(rl,
r2)
then
PC <--
PC + Idisp
rl,(r2),ldisp
Ol
if xx(rl,
MEt2])
then
PC <--
PC + Idisp
xd,r2,1disp
10
if xx(xd,
r2)
then
PC <--
PC + Idisp
xd,(r2),Idisp
11
if xx(xd,
M[r2])
then
PC <--
PC + Idisp
xx = eq, neq,
It,
gt,
le,
ge
imp
llldisp
--
PC <--
jmpi
(rl)lldisp
00
PC <-- M[rl
+ lldisp]
(rl)r2
01
PC <-- M[rl
+ r2]
rl,lldisp
D0
PC <--
rl + lldisp
rl,r2
01
PC <--
rl + r2
jmpx
PC + llldisp
ge
llldisp
192
PARAI,LELISM
**
Synchronization
Instructions
test-and-set-interlocked
! tci:
test-and-clear-interlocked
I instr
I
SYSTEMS
**
I tsi:
pos<0:4>
1N PRODUCFION
:= xs
Args
Mode
Interpretation
............................
tsi
tci
I instr
(rl)Idisp,pos
00
cmp(M[rl+Idisp]<pos>,0);M[rl+ldisp]<pos>
<--
I
(rl)r2,pos
01
cmp(M[rl
<--
1
(rl)Idisp,pos
DO
cmp(M[r1+Idisp]<pos>,l);M[r1+Idisp]<pos>
<--
0
(rl)r2,pos
01
cmp(M[rl
<--
0
Args
Mode
+ r2]<pos>,0);
+ r2]<pos>,l);
M[rl
M[rl
+ r2]<pos>
+ r2]<pos>
Interpretation
............................
incr_i
decr_i
! The
(rx)Idisp,rd
DO
rd = (M[rx
+ Idisp]
+=
I);
Also
set
flags
(rx)r3,rd
01
rd = (M[rx
+ r3]
+=
I);
Also
set
flags
(rx)Idisp,rd
00
rd = (M[rx
+ Idisp]
-= I);
Also
set
flags
(rx)r3,rd
01
rd = (M[rx
+ r3]
-= I);
Also
set
flags
increment_interlocked
! useful
to maintain
I do the
incr/decr
! instructions
**
I instr
!
shared
Args
............................
put_psw
.....
get_psw
.....
not
decrement_interlocked
counters
operations
should
Miscellaneous
and
at the
be any
Instructions
Mode
more
instructions
are
in multiprocessors.
It is possible
memory-board
so that
expensive
**
Interpretation
itself,
than
a read
from
these
memory.
to
CODE AND DATA STRUCTURES
FOR PARALLEL
193
IMPLEMENTATION
Appendix B
Code and Data Structures for Parallel Implementation
B.1. Code for Interpreterwith HardwareTask Scheduler
/*
* Cost
Model
for
OPS5
* implementation
*/
/*
Data
typedef
Structure
struct
Production
using
the
Systems.
Hardware
Declarations
To
Task
be used
in the
parallel
Scheduler.
*/
TokenTag
{
struct
TokenTag
*tNext;
in*
hid;
in*
refCount;/*
required
for
left-not
tokens
*/
struct
TokenTag
*tLeft;
/* Beta-Token:
Left
Component
*/
struct
TokenTag
**Right;
/* Beta-Token:
Right
Component
*/
struct
WmeTag
*wptr[32];
} Token;
typedef
struct
TokPtrTag
{
}
Token
*ltok;
Token
*rtok;
TokPtr;
/*
* Structure
of data
* The
word
* of
first
three
pieces
* R'-nid<30>
* second
* contains
sent
of
and
HTS:
sent
information.
:= Dir,
word
to the
is received
i.e.,
is received
a pointer
The
from
to Token
sent
or
from
consists
register
R'-nid<31>
Left/Right,
and
data
:= Flag,
and R'-nid<29:0>
register
of
two words.
R'-nid.
i.e.,
This
:= Node-ID.
R'-tokPtr.
consists
Insert/Delete,
This
The
word
TokPtr.
$
*
* wordl:
*
*
+ ......
I Flag
. ......
31
+ .....
+ ..............................................
I
I
Dir
. .....
30
+
I
Node-ID
. ..............................................
29
.
0
194
PARALLELISM
II
IN PRODUCTION
"1"...........................................................
i
* word2:
"1"
tokPtr:
pointer
to Token
or
TokPtr
I
-p...........................................................
*
SYSTEMS
4-
31
O
*/
#define
MaxWmeFields
128
/* max
#define
WmeHTSize
4096
/* size
typedef
struct
fields
in working-memory
of Wme
hash
element
table
*/
*/
WmeTag
{
struct
WmeTag
*wNext;
int
wTimeTag;
int
wVal[MaxWmeFields];
} Wme;
typedef
struct
{
int
lock;
/*
lock
for
Wme
*wList;
/*
list
of wmes
modifying
wList
*/
associated
with
this
entry
"/
} WmeHTEntry;
extern-shared
WmeHTEntry
"WmeHT[WmeHTSize];
/*
* Register
* Both
R-wmeHT
symbolic
" fields
and
of the
" symbolic
Integer
is enoded
work
a pointer
numeric-integer
Wme.
data
* comparisons
contains
to the
data
data
as (sym-val
properly
between
4096
short
short
Token
are
is encoded
shared
structure
encoded
within
as (val
* 2 + I).
values.
/*
size
of
Token
lock;
/*
lock
to
modify
refCount;
I" -1:
writeLock,
*tList;
/* ptr
to token
the
• 2),
Standard
numeric
WmeHT.
wVal
while
integer
*/
#define
TokenHTSize
typedef
struct
hash
table
"/
{
refCount
safely
O: free,
+k:
k readers
*/
*/
list
*/
*/
} TokHTEntry;
extern-shared
TokHTEntry
LTokHT[TokenHTSize];
/" hash
table
for
left
extern-shared
/-
TokHTEntry
RTokHT[TokenHTSize];
/" hash
table
for
right
* Pointers
to the
* R-ItokHT
and
,/
two
token
R-rtokHT.
hash
tables
are
available
in registers
toks
toks
"/
CODE AND DA'FA STRUCTURES
195
FOR PARALLELIMPI,EMENTATION
#define
MaxNodes
I0000
#define
NodeTableSize
MaxNodes
typedef
struct
nodeTag
{
struct
nodeTag
*next;
unsigned
dir.succNid;
} Node;
typdef
struct
{
unsigned
addr-lef-act;
/* code
address
for
left
unsigned
addr-rht-act;
/* code
address
for
right
unsigned
addr-mak-tok;
/* code
address
for
makTok-lces-rces
unsigned
addr-lef-bash;
/* code
address
for
left
unsigned
addr-rht-hash;
/* code
address
for
right
unsigned
addr-do-tests;
/* code
for
safe
tests
assoc
modification
activation
*/
activation
hash
fn
hash
*/
*/
fn
*/
node
*/
fields
*/
with
of other
*/
short
lock;
/* for
short
refCount;
/* used
Token
*leftExtraRem;
/* extra
removes
due
to conjugate
pairs
*/
Token
*rightExtraRem;
/* extra
removes
due
to conjugate
pairs
*/
Node
*succList;
/*
by implementation
set of successors
with
of
STQs
*/
the node
*/
} NodeTableEntry;
extern-shared
/*
NodeTableEntry
NodeTable[NodeTableSize];
* Pointer
to the NodeTable
* Note: Most of the fields
* the
txxx
* end
of PopG
* the
*/
code
/*
Some
nodes.
data
typedef
If space
routine
can
branch
can
is at a premium,
be used
using
structures
struct
is available
in the register R-nodTab.
in the node-table
data structure
are wasted
for
the
to check
an extra
for
appropriate
test
just
txxx-nodes.
before
for
the
Subsequently
table.
conflict-resolution
*/
ConResComTag
{
struct
TokPtr
ConResComTag
*next;
*ltok;
TokPtr
*rtok;
int
flag;
} ConResCom;
extern-shared
ConResCom
extern-shared
int
"ConflResCo_List;
/* Confl-Res
ConflResCommListLock;
/* the
Command
associated
/*
*
** Register
* Note:
* The
The
system
* R[31]
register
defined
-= PC,
R[30]
Definitions
allocation
registers
-- SP,
is done
*"
by hand
for
all
the
R[28]
mm
Zero
are:
R[29]
-- Link,
and
code.
List
lock
*/
"/
196
PARALLELISM
IN PRODUCI'ION
SYSTEMS
*/
#define
R'-nid
R[27]
/"
temp-reg
to
#define
R'-tokPtr
R[26]
/*
temp-reg
to store
#define
R-nid
R[25]
/* node-id
#define
R-tokPtr
R[24]
/* token
#define
R-flg
R[23]
/* Flag
#define
R-dir
R[22]
/* Direction
#define
R-ItokOtr
R[21]
/* left
token
of
struct-TokPtr
*/
#define
R-rtokPtr
R[20]
/* right
token
of
struct-TokPtr
*/
#define
R-state
RICO]
/* Position
t R-state
is the same physical
I used at the same time.
register
sto
for
ptr
flg-dir-nid
in
token
ptr
HTS */
in HTS
activation
*/
for activation
*/
(ins/del)
for cur-actvn
*/
of activation
*/
in opp-token-mem
as R-rtokPtr,
since
they
*/
are never
#define
R-Itok
R[19]
/* ptr
to left
tok
being
matched
*/
#define
R-rtok
R[18]
/* ptr
to right
tok
being
matched
*/
#define
R-wmePtr
R[18]
/* Ptr
to wme
I R-wmePtr
I these
is the same
registers
are
physical
never
register
used
as R-rtok.
at the
same
for
alpha-actvns
This
is possible
"/
because
time.
#define
R-hIndex
R[17]
/* Hash
#define
R-nodTab
R[16]
/* Address
value
of NodeTable
*/
#define
R-wmeHT
R[15]
/* Address
of WmeHT
*/
#define
R-ItokHT
R[14]
/* Address
of
LTokenHT
*/
#define
R-rtokHT
R[13]
/* Address
of
RTokenHT
*/
#define
R-HTS
R[12]
/* Contains
* hardware
#define
R-globTab
R[11]
/* Many global variables
can be
* accessed
as offsets of this
value
* a pointer
R-node
R[IO]
/" Ptr
token
*/
the address
of the
task scheduler.*/
* register
#define
for
-- something
to the
to relevant
name-table.
like
"/
NodeTableEntry
*I
I*
* The
following
* code
is given
* combined
for
to form
* to be expanded
* macros
,/
pages
and
give
each
code
in-line,
the
code
for
of the
primitive
for more
complex
so that
not procedures.
the
the
parallel
operations,
operations.
individual
implementation.
which
All
sequences
are
code
of
The
then
is expected
code
are
like
*/
CODI" AND DATA STRUCTURES
I **
FOR PARALLI_L
197
IMPLI!MENTATION
PopG **
! ............
! When the
processor
!
piece
code.
!
fetched
of
PopG:
I
is
is
NULL,
and
that
the
for
when
new tasks,
no
tasks
processor
it
are
does
executes
the
available,
not
cause
loop
out
the
any
following
value
traffic
of
on the
tokPtr
bus.
ldq
(R-HTS),R-nid---R-tokPtr
cb_eq
NULL,R-tokPtr,PopG
i
bfx
R-nid,#31,#1,R-flg
! extract
flag
bfx
R-nid,#3D,#1,R-dir
! extract
dir
bfx
R-nid,#O,#30,R-nid
! extract
node-id
shl
#3,R-nid,rO
! rO contains
offset
add2
R-hid,r0
!
done
as siz
of
nodTabRow
is
10
lwords
add2
R-nid,r0
!
done
as siz
of
nodTabRow
is
10
lwords
ldal
(R-nodTab)r0,R-node
]
get
jmpi
(R-node)R-dir
I
I
jump to
activation
The activation
types
1 Both
insert
and
1 jump
to
custom
1 **
looking
Note
all
are:
delete
dir=left/right,
operations
of
base
cache
when
field
field
from
handled
from
R-nid
into NodeTable
of
relevant
code
specific
type.
by
R-nid
R-nid
field
node-table
lev=alpha/beta,
are
idle.
from
to
entry
this
type=and/not/prod
the
same
HTS.
The
code.
Txxx
nodes
code.
PushG **
I .............
I
The
code
I
and
R'-tokPtr
executed
stq
1 **
to
schedule
already
a new task
contain
the
on the
relevant
registers
R'-nid
data.
R'-nid---R'-tokPtr,(R-HTS)
ReleaseNodeLock
**
] ........................
I This
code
is executed
when
node
activation
R-flg.#31,#1,R-nid
I Insert
val
of flag
bfi
R-dir,#30,#1,R-nid
! Insert
val
of dir field
stl
NULL,(R-HTS)I
I Set
stl
R-nid,(R-HTS)
! inform
br
PopG
I PopG
is a global
is being
maintained
I flg-dir-nid
that
PendingTesksCount
info
is sent
I value
of (R-HTS)I
is set
I task,
it will
] .................
scheduled
bfi
I It is assumed
I "*
a globally
Root-Node
find
"*
the
to the
HTS
to NULL,
value
NULL
again
so that
until
value
to speed
when
the
field
of tokPtr
HTS
PopG
about
end
in bit
in bit
field
31
30
to NULL
of processing
label
by the HTS.
up processing
is looping
HTS modifies
finishes.
it.
The
there.
to find
The
a
198
PARAI.LELISM
!
The
t
cycle.
root-node
I
receives
struct
tokTag
I
int
I
Token
I
struct
L beg:
L_end
of
of
each
wme-changes
entry
in
to
the
be processed
list
is
as
in
**
I
each
follows:
{
flag;
*tokPtr;
tokTag
*tNext;
Idl
(R-globTab)wme-toks,rO
Idpi
prefix_dir.&bus-nid
) Tok;
I get list
Idl
restof_dir.&busnid,R'-nid
cb_eq
#O,rO,L_end
or2
(rO),R'-nid
Idl
PUSHG
(rO)tokPtr,R'-tokPtr
Idl
(rO)tNext,rO
bfi
#O,#31,#1,R'-nid
br
L_beg
Bus Node
Note:
t the
I clear
of changes
the
flag
to process
field
**
Some of
the
successors
same processor),
(scheduled
through
while
may
others
be processed
as
may be processed
subtasks
as
(processed
independent
on
tasks
HIS).
ldpi
prefix_dir.succ-nid
I
Idl
restof_dir.succ-nid,R'-nid
I depending
Several
repetitions
of
this
bfi
PUSflG
R-flg,#31,#1,R'-nid
I and # of non-teqa/non-teqn
on # of beta-lev
...................................................................
Idpi
prefix
push
restof_L_exit,SP
L_exit
Idpi
prefix_nodeAddress
I 0 or more,
push
restof_nodeAddress,SP
I
non-teqn
if some
done
as
non-teqa/
subtasks.
...................................................................
Idl
(R-tokPtr)wme,R-wmePtr
and3
b'111OO',(R-wmePtr)2,rO
add2
LocTabOffset,rO
jmpi
(PC)rO
L_exit:
RELEASE-NODE-LOCK
LocTab:
address-of-succ-nodel
I (R-wmePtr)2
address-of-succ-node2
I
**
SYSTEMS
RELEASE-NODE-LOCK
!
!
a list
The structure
IN PRODUCTION
Constant-Test
Nodes
or
Txxx
Nodes
==
is type
of wme.
code
succs
succs.
CODEAND
DATA STRUCIURESI,XgR
PARALLEI,
199
IMPI,EMENTATION
! .........................................
! The
code
! txxx
below
node
I
in
!
cases:
in
which
f which
(1)
When
hashing
I
to
no txxx
I Case-I:
rete
successors
successors
1 are
corresponds
the
be
of
the
can
....
make
as
node
are
has
hashing
The
on
to
be
code
the
(2)
but
(3)
is
custom
for
each
number,
type,
and
the
scheduled,
some teqa/teqn
useful,
subtasks.
is
node.
beneficially.
be processed
When hashing
a toqa
Depending
txxx-node
to
to
the
be used
processed
nodes
to
network.
there
are
successors
for
When
there
there
Same as
are
are
no
teqa/teqn
some txxx
case-2,
but
way
several
when
nodes
which
there
are
as subtasks.
used.
Idl
! executed,
Idpi
prefix_type.val
! recall:
Idl
restof_type.val,rO
Idl
(R-wmePtr)field,rl
cb_neq
rO,r1,L_fail/L_exit
! if (thru
iff sched
num=
HTS)
(through
2*val,
sym
L exit
else
HTS)
= 2*val+1
L_fail
..................................................................
Idpi
prefix_dir.succ-nid
! Several
ldl
! depending
repetitions
of this
bfi
PUSHG
R-flg,#31,#1,R'-nid
code
on # of beta-lev
succs
succs.
...................................................................
Idpi
prefix_L_exit
I Done
push
restof_L_exit,SP
i through
iff node
is scheduled
HTS.
...................................................................
ldpi
prefix_nodeAddress
! 0 or more,
push
! non-teqn
if some
done
as
non-teqa/
subtasks.
...................................................................
and3
b'lllOO',(R-wmePtr)fld,rO
add2
LocTabOffset,rO
jmpi
(PC)tO
I extract
from
of the
L fail:
rsb
I Pop
L_exit:
LocTab:
RELEASE-NODE-LOCK
address-of-succ-nodel
I used
one
only
field
to hash
on
subtasks
if sched
through
HTS.
I Case-II:
When hashing
is
NOT used,
but
some txxx-nodes
ldl
I executed,
Idpi
prefix
I recall:
Idl
restof_type.val,rO
Idl
(R-wmePtr)field,rl
cb neq
rO,r1,L_fail/L_exit
Idpi
prefix_dir.succ-nid
I Several
Idl
I depending
type.val
I if (thru
as
iff
num-
HTS)
subtasks.
sched
(through
HTS)
Z'val,
sym = 2"val+1
L_exit
else
repetitions
on # of
L_fail
of this
beta-lay
code
succ$
200
PARALLELISM
bfi
PUSHG
R-flg,#31,#1,R'-nid
! and #
of
IN PRODUC'TION
non-teqa/non-teqn
SYSTEMS
succs.
...................................................................
ldpi
prefix_L_exit
!
push
restof_L_exit,SP
1 through
Done
iff
the
the
node
is
scheduled
HTS.
...................................................................
ldpi
prefix_nodeAddress
!
one
push
I
done
or
more
of
txxx
nodes
as subtasks.
...................................................................
L_fail:
rsb
1 Pop one
L_exit:
RELEASE-NODE-LOCK
I
!
Case-III:
When hashing
is
NOT used,
and
used
the
subtasks
if
sched
no txxx-nodes
as
subtasks.
iff
sched
num=
2*val,
ldl
!
executed,
ldpi
prefix
I
recall:
ldl
restof_type.val,rO
type.val
of
only
through
(through
HTS.
HTS)
sym = 2*val+l
ldl
(R-wmePtr)field,rt
cb_neq
rO,rl,L_exit
Idpi
prefix_dir.succ-nid
I Several
Idl
! depending
bfi
PUSHG
R-flg,#31,#1,R'-nid
repetitions
of this
on # of beta-lev
code
succs
succs.
...................................................................
L_exit:
RELEASE-NODE-LOCK/rsb
I *" Left
! The
Beta
code
And
Node
for a right
Idl
I if (thru
HTS)
REL-N...
else
rsb
*"
activation
(R-tokPtr),
can
be obtained
by minor
I check
extra-removes
substitutions.
R-ltokPtr
Idl
(R-tokPtr)1,R-rtokPtr
HASH-TOKEN-LEFT
cb_eq
Delete,R-flg,L
Idal
(R-node)LefExRem,rO
del
cb_eq
NULL,(rO),LIO
CHECK-EXTRA-REMOVES
LIO:
I result
in r5.
field
O=>OK,
for
1->ExtraRemoves
cb_eq
#1,r5,L_exit
MAKE-TOKEN
INSERT-LTOKEN
br
L_del:
Lll
DELETE-LTOKEN
cb_neq
NULL,R-ltok,L11
INSERT-EXTRA-REMOVE
L11:
br
L_exit
Idal
I see if hash
bucket
node
is empty
CODE AND DATA STRUCTURES
FOR PARAI,L[-L
cmp
NULL,(rS)tList
br_eq
L_exit
201
IMPLEMENTATION
LOCK-RTOKEN-HT
Idl
L loop:
(rS)tList,R-state
NEXT-MATCHING-RTOKEN
SCHEDULE-SUCCESSORS
L12:
br
L_Ioop
RELEASE-RTOKEN-HT
L_exit:
FREE-TOKPTR
I L12
is used
within
NextMatchingTok
RELEASE-NODE-LOCK
! Note:
In the
previous
! RELEASE-TOKEN-HT
! was
zero.
i the
hash
In this
bucket
! overhead
! longer
present.
look
! was
based
tokens,
if the
in them,
execute
I
Alpha
Left
the
then
the
even
the above
between
the
above
would
table
has
not
The
full,
code
several
been
that
are
sequence
of code.
the
no tokens
skipped.
in the
(1) The
is no
if there
are
hash-bucket
while
is, most
if
nodes
tokens,
disadvantage
that
if there
is skipped
advantages:
is skipped,
have
and
in the opp-mem
in the memory
probability
entry.
pretty
of the
following
code
LOCK-TOKEN-HT
of tokens
of tokens
is a large
gets
And Node
has
counts
code
node
table
section
if the opp-mem
there
per
hash
! we will
**
This
so that
on counts
the code
if the number
that
the
(2) Even
of space
! tokens
version,
is NULL.
is empty,
! 4 bytes
! that
version,
skipped
of maintaining
! no matching
! we
was
where
if the decision
(3)
of the
buckets
opposite
It saves
scheme
have
is
some
memory,
**
! ..........................
HASH-TOKEN-LEFT
cb_eq
Delete,R-flg,L_del
MAKE-TOKEN
LIO:
INSERT-LTOKEN
br
L del:
LII:
L11
DELETE-LTOKEN
Idal
cmp
NULL,(rS)tList
I see
if hash
bucket
is empty
I L12
is used
within
NextMatchingToken
br_eq
L_exit
LOCK-RTOKEN-HT
Idl
L_Ioop:
(rS)tList,R-state
NEXT-MATCHING-RTOKEN
SCHEDULE-SUCCESSORS
LI2:
br
L loop
RELEASE-RTOKEN-HT
L_exit:
RELEASE-NODE-LOCK
1 **
Left
I .........................
Beta
Not
Node
**
202
PARALLF.LISM
Idl
(R-tokPLr),
Idl
(R-tokPtr)l,R-rtokPtr
IN PRODUCrlON
SYSTEMS
R-ItokPtr
HASH-7OKEN-LEFT
cb_eq
Delete,R-flg,L
Idal
(R-node)LefExRem,rO
del
cb_eq
NULL,(rO),LIO
CHECK-EXTRA-REMOVES
cb eq
LIO:
I check
extra-removes
I result
field
is returned
for
node
in r5
#_,rS,L_exit
MAKE-TOKEN
INSERT-LTOKEN
stl
#O,(R-ltok)refC
ldal
(R-rtokHT)R-hIndex,rO
cmp
NULL_(rO)tList
br_eq
L12
I Initialize
I skip
refCount
DetRefCount
to 0
if hash-bucket=NULL
LOCK-RTOKEN-HT
DETERMINE-REFCOUNT
RELEASE-RTOKEN-HT
br
L_del:
L11
DELETE-LTOKEN
cb_neq
NULL,R-Itok,Lll
INSERT-EXTRA-REMOVE
br
L_exit
L11:
cmp
Zero,(R-Itok)refC
L12:
br_neq
L_exit
SCHEDULE-SUCCESSORS
L exit:
FREE-TOKPTR
] check
value
found
by DetRefCount
RELEASE-NODE-LOCK
I
**
Left
Alpha
Not
Node
"*
I ..........................
HASH-TOKEN-LEFT
LIO:
cb_eq
Delete,R-flg,L_del
MAKE-TOKEN
INSERT-LTOKEN
stl
#O,(R-Itok)refC
ldal
(R-rtokHT)R-hIndex,rO
cmp
NULL,(rO)tList
br_eq
L12
I Initialize
! skip
refCount
DetRefCount
to 0
if hash-bucket-NULL
LOCK-RTOKEN-MT
DETERMINE-REFCOUNT
RELEASE-RTOKEN-HT
br
Lll
L del:
DELETE-LTOKEN
L11:
cmp
L12:
br_neq
L_exit
SCHEDULE-SUCCESSORS
L_exit:
RELEASE-MODE-LOCK
Zero,(R-Itok)refC
I check
value
found
by DetRefCount
CODE AN D DATA S'FR UCTURFS
! ** Right
Beta
FOR PARALI_I(L
Not Node
203
IM PLEM EN'FATION
**
! ..........................
Idl
(R-tokPtr),
R-ltokPtr
Idl
(R-tokPtr)1,R-rtokPtr
HASH-TOKEN-RIGHT
cb_eq
Delete,R-flg,L
ldal
(R-node)RightExRem,rO
del
cb_eq
NULL,(rO),LIO
CHECK-EXTRA-REMOVES
LIO:
I check
extra-removes
field
is returned
in r5
! result
for
node
cb_eq
#1,r5,L_exit
MAKE-TOKEN
INSERT-RTOKEN
br
L_del:
L21
DELETE-RTOKEN
cb_neq
NULL,R-rtok,L11
INSERT-EXTRA-REMOVE
L11:
L loop:
br
L_exit
ldal
(R-node)R-hIndex,r5
cmp
NULL,(rS)tList
br_eq
L_exit
I see
if hash
bucket
xor2
#1,R-flg
LOCK-LTOKEN-HT
R-flg
gets
Idl
Note:
LockTokenHT
(r5)tList,R-state
NEXT-MATCHING-LTOKEN-NOT
L12 used
value
within
is empty
of opp-flag
does
not
NextMatch
use
....
SCHEDULE-SUCCESSORS
LI2:
br
L loop
xor2
#1,R-flg
restore
R-flg
to not(opp-flag)
RELEASE-LTOKEN-HT
L exit:
FREE-TOKPTR
RELEASE-NODE-LOCK
! '' Right
Alpha
Not
Node
*"
I ...........................
HASH-TOKEN-RIGHT
cb_eq
LIO:
Delete,R-flg,L_del
MAKE-TOKEN
INSERT-RTOKEN
br
LII
L_del:
DELETE-RTOKEN
Ll1:
Idal
(R-ItokHT)R-hIndex,r5
cmp
NUlL,(rS)tList
br_eq
L_exit
xor2
#1,R-flg
LOCK-LTOKEN-HT
Idl
L_Ioop:
I see
if hash
I R-flg
gets
bucket-us
value
of opp-flag
(r5)tList,R-stete
NEXT-MATCHING-LTOKEN-NOT
I LI2
used
within
NextMatch
SCHEDULE-SUCCESSORS
L12:
empty
br
L_Ioop
xor2
#l,R-flg
I restore
value
of R-flg
....
r5
204
PARAIJ.ELISM
IN I'RODUCIION
SYSTEMS
RELEASE-LTOKEN-HT
L_exit:
RELEASE-NODE-LOCK
* GetNewToken
This
is
used
making
this
tokens
list
*
by
are
for
for
allocated
of free
To obtain
code
useful
from
tokens.
the
code
Idl
MakeToken.
insert
shared
Thus
for
The
token
command from
memory,
no locks
right
are
is
each
returned
Note,
processor
keeps
to get
simply
replace
I load
value
in
although
required
activations,
(R-globTab)TokFr,R-Itok
ptr
left.
R-ltok,
the
its own
this
storage.
R-Itok
by R-rtok.
of _TokenFreeList
cb_neq
NULL,R-Itok,LIO
ALLOCATE-MORE-SPACE
LIO:
I
*"
ldl
(R-Itok),rO
stl
rO,(R-globTab)TokFr
MakeToken
I value
of next
field
of
token
*"
I .................
! Code
! Code
for
a left-alpha-token
GET-NEW-TOKEN
stl
R-nid,(R-Itok)nid
I copy
Idl
(R-tokPtr)wme,rO
I get pointer
node-id
stl
rO,(R-Itok)wptr
! copy
wme
I copy
node-id
in token
to wme
pointer
in token
for a left-beta-token
GET-NEW-TOKEN
stl
R-nid,(R-Itok)nid
stq
R-ItokPtr---R-rtokPtr,(R-ltok)tLeft;
brlinki
(R-node)addr-mak-tok
I Node specific code for make-token.
t combination
of Ices and rces that
I the
case
! routines
where
a linear
would
be present.
Rete
network
is used
(with
(R-ltokPtr)wme,rO
I get wme
stl/stq
rD,(R-Itok)wptr
I store
stmts
to copy
rest
of wme ptrs
(R-rtokPtr)wme,rD
I get wme
stl/stq
rO,(R-Itok)wptr
I store
.. similar
stmts
Link
to copy
rest
of wme
pointer/s
from
for each
Thus for
= 32),
32
such
in token
R-ItokPtr
to R-Itok.
pointer/s
wme
ptrs
MaxCEs
tokens
pointer/s
wme
Idl/Idq
jmpx
constituent
There is one such routine
occurs in the rete network.
ldl/Idq
.. similar
in token
I copy
pointer/s
from
in token
R-rtokPtr
to R-Itok.
CODE AND DATA STRUCTURESFOR
t **
205
PARALI_EI_IMPLEMENTATION
**
GetNewTokPtr
! ....................
I Returns
the
pointer
in R'-tokPtr,
which
Idl
(R-globTab)TokFr,R'-tokPtr
cb_neq
NULL,R'-tokPtr,LIO
will
then
be sent
I load
to the HTS.
value
of _TokPtrFreeList
ALLOCATE-MORE-SPACE
LIO:
! **
!
Idl
(R'-tokPtr),rO
stl
rO,(R-globTab)TokFr
FreeTokPtr
Note:
The
I value
token
to
be freed
is
still
in
of tokptr
R-tokPtr
Idl
(R-globTab)TokFr,rO
! rO <--
stl
rO,(R-tokPtr)
! tokPtr->tNext
<--
TokPtrFreeList
stl
R-tokPtr,(R-globTab)TokFr
! TokPtrFreeList
<--
tokPtr
Idl
R-nid,rO
brlinki
t the hash
(R-node)addr-lef/rht-hash
bfx
rO,#1,#12,R-hIndex
I extract
shl
#1,R-hIndex,R-hIndex
! since
value
bits
size
! R-hIndex
Node-specific
Hash
TokPtrFreeList
"*
I and wmeHT
**
field
**
! ** HashToken-Left/Right
I
of next
Function
Code
is accumulated
<1:12>
of each
for
is 2 lwords.
gives
the
hash
entry
in rO
value
in tokHT
This
correct
way
offset.
**
........................................
I Code for beta activations.
Here the two components
of the token to be
I hashed are available
in R-ltokPtr
and R-rtokPtr.
The final value is
! accumulated
in tO.
Idl
(R-ItokPtr)wme,rl
I move
xor2
(rl)val-x,
I xor
xor2
(rl)val-y,rO
Idl
(R-rtokPtr)wme',rl
I get wmes from
xor2
(r])val-z,
I xor
...
and
so on,
rO
wptr
I multiple
rO
depending
on
the
tests
associated
to rl
value
into
rO
value
value
with
from
same
R-rtokPtr
into
the
rO
node
wme
206
PARALLELISM1N
jmpx
! Code
for
Link
alpha
! in R-tokPtr.
I only
) **
! Pop
activations.
R-ItokPtr
be one wme
Here
the
ans R-rtokPtr
associated
with
the
pointer
are
back
to the
(R-tokPtr)wme,rl
I move
(rl)val-x,
I xor
xor2
(rl)val-y,rO
so on,
Link
InsertToken
rO
code
is available
Note
that
wptr
to rl
value
! multiple
depending
generic
SYSTEMS
there
can
token.
xor2
... and
to
token
not used.
Idl
jmpx
PRODUCFION
on the
tests
associated
I Pop
into
rO
value
with
back
from
the
same
wme
node
to generic
code
**
I ...................
I
I
The following
in left-token
code
hash
corresponds
table.
to
inserting
is
to
I
Idl
(R-ItokHT)rl,rO
! get
stl
rO,(R-Itok)next
! Itok->tNext
<--
! tokList
Itok
DeleteToken
tokList
pointed
add3
#1,R-hIndex,rl
LOCK-LTOKEN-HT
stl
R-Itok,(R-ItokHT)rl
RELEASE-LTOKEN-HT
1 **
token
second
tokList
by
R-ltok
longword
in
bucket
in rO
<--
tokList
**
I ...................
1 The following
I is formed
code
jointly
add3
corresponds
by R-ItokPtr
to
the
deletion
of
a
left
token,
which
and R-rtokPtr.
#l,R-hIndex,rO
LOCK-LTOKEN-HT
L loop:
ldl
(R-ItokHT)rO,rl
ldl
NULL,r2
cb_eq
NULL,rl,L20
cmp
R-nid,(rl)nid
br
neq
! compare
node-ids
L_fail
V ..................................................................
I
The
enclosed
code
V
corresponds
to
deletion
Idl
(R-Itok)tLeft,r3
I load
cmp
rl,(rl)tLeft
I see
br_neq
L_fail
Idl
(R-Itok)tRight,r3
I load
cmp
r3,(rl)tRight
I
br_neq
L_fail
^ ..................................................................
see
of
ptr
to left-part
if the
ptr
if
a left
two
have
to left-part
the
two
have
beta
token
of tok
same
of tok
same
in r3
left-part
in r3
left-part
^
CODE AND DATA STRUCTURIiSFOR
207
PARAI_LELIMPLEMENrFATION
OR
V ..................................................................
I The
enclosed
V
code
corresponds
to deletion
Idl
(R-Itok)wptr,r3
I load
cmp
r3,(rl)wptr
I compare
br_neq
L Fail
of a left
ptr
to the
with
alpha
only
wme
the other
token
in r3
token
^ ..................................................................
L_fail:
L_succ:
LIO:
L20:
br
L_succ
Idl
rl,r2
Idl
(rl)next,rl
br
L_Ioop
ldl
(rl)next,r3
cb eq
NULL,rZ,LIO
stl
br
r3,(r2)next
L20
^
I token
to be del
is at head
of
tList
stl
r3,(R-ItokHT)rO
RELEASE-LTOKEN-HT
Idl
rl,R-Itok
I deleted
cbeq
NULL,rI,L30
I if tok==NULL,
! token
Idl
(R-globTab)DelTokList,rO
stl
rO,(rl)next
stl
r1,(R-globTab)DeITokList
token
cannot
is returned
then
in R-Itok
return.
be freed
now
L30:
1 **
LockTokenHT
**
I ..................
I
The
1 can
following
be
LI:
used
code
"lock"
lock-left-token-HT.
is
the
first
(R-ltokHT)R-hTndex,r6
cb_neqw
#O,(r6),L1
I
tsi_w
(r6),#O
t try
br_nz
Ll
cmp
Zero,(r6)refC
LIO0
Note
field
ldal
br_neq/It
LIO0:
implements
because
if
of
lock
and
hash
is
busy
obtain
! O: free,
#-I/#1,(r6)refC,r7
stw
r7,(r6)refC
tci_w
br
(r6),#O
L200
tci_w
(r6),#O
br
L1
I decr/incr
I try
cb_neq
loop
the
refC
out
r6 already
dep-on
again
*"
has
the correct
address
for
the
lock
cache
+k:
k readers
It:read-lock
I ......................
I Note:
of
lock
L200:
I *" ReleeseTokenHT
instr
entry.
-I: writer,
I neq:write-lock,
add3_w
that
table
in it.
write/read
208
PARALLELISM
LI:
cb_neqw
#O,(rB),Ll
br_neq
LI
tsi_w
(r6),#O
br_nz
L1
add3
w
#1/#-l,(r6)refC,r7
stw
rT,(rB)refC
tci_w
(r6),#O
t Note:
The
time
inside
I any
contention)
! IDO
instructions,
! no more
than
I hardware
! and
! **
the
is around
then
a factor
is being
cmp_decr_i
I incr/decr
lock
for
LockTokenHT
12 instructions.
for tokens
the
above
dep-on
same
benefit
node
hash-table
takes
bucket
If specialized
from
cmp_incr_i,
(compare-and-if-equal-incr/decr-interlocked)
NextMatchingToken
(without
average
be achieved.
will
SYSTEMS
write/read
ReleaseTokenHT
if the
on the
can
code
and
Thus,
arriving
of 8 in speed-up
designed
refC
IN PRODUCTION
instructions.
**
I .........................
I Code
below
I a pointer
L_Ioop:
corresponds
to the
next
to a left
token
activation.
that
should
The
be tried
cb_eq
NULL,R-state,L12
I LI2 occurs
cmp
R-nid,(R-state)nid
I check
br_neq
L_fail
ldl
R-state,R-Itok
brlinki
(R-node)addr-do-tests
I result
cb_eq
#1,r5,L_succ
I all tests
Idl
(R-state)next,R-state
L_succ:
br
L_Ioop
Idl
I This
is needed
is not
I update
I value
R-state
calling
are
stores
the code
direction
must
code
same
so that
returned
to do
dependent
in r5
have
R-state,
I we return
! ** NextMatchingTokenNOT
within
if node-ids
I tests
L_fail:
register
for match.
succeeded
so that
to L_loop
in R-state,
we
and
next
time
check
new
not old
one.
"*
| .........................
I Code
below
! stores
I makes
L loop:
corresponds
a pointer
extensive
to a right-not-activation.
to the next
use of the
token
fact
that
that
should
the
tests
cb_eq
NULL,R-state,L12
I Lt2
cmp
R-nid,(R-state)nid
I check
br_neq
L fail
Idl
R-state,R-Itok
brlinki
(R'node)addr-do-tests
cb_eq
#O,r5,L_fail
incr_iw/decr_iw
(R-state)refC,rD
are
occurs
register
known
within
if node-ids
I result
I All
The
be tried
returned
tests
must
R-state
for match.
at compile
calling
are
The
time.
code
same
in r5
have
code
succeeded
CODE AND DATA SI'RUCFURESFOR
209
PARAIJ.]_LIMPLEMENTATION
t incr:Insert-flag,
! Updated
cb_eq
#1/#O,rD,L_succ
Idl
L_succ:
br
L_Ioop
Idl
! ** DetermineRefCount
decr:Oel-flag
returned
! #1:Insert-flag,
! #I:
L_fail:
value
0=>I,
#0:
#0:I=>0
in rO
Del-flag
transition.
**
I ........................
I This
code
L_loop:
is executed
on
the left-activation
of a not-node.
add3
#1,R-hIndex,rO
! offset
Idal
Idl
(R-rtokHT)rO,rO
#O.rl
I get tList of opp-mem
in rO
I the # of match-toks
is accum
cb_eq
NULL,rO.L_exit
cmp
R-nid.(rO)nid
br neq
L_fail
brlinki
(R-node)addr-do-tests
cb eq
#O,r5,L_fail
add2
#I,ri
if node-ids
! result
returned
I all
L exit:
!
**
Idl
(rO)next,rO
br
L_Ioop
stl
r1,(R-Itok)refC
DO Tests
access
! check
tests
! incr
L fail:
to get
I store
are
have
same
succeeded
of matching
count
in rl
in r5
must
num
to tList
opp-tokens
field
of Itok
! Note this update does not
I be done in an interlocked
in refC
have to
manner.
**
1 ...............
] This
piece
I a given
of
code
node.
The
result
the
case
I corresponds
t is
greater
to
than
is
node
specific
and
of
the
tests
when
the
number
performs
is
the
returned
of
tests
in
tests
associated
r5.
The
associated
code
with
the
with
below
node
zero.
...............................................................
ldl
(R-Itok)wptr,rO
l
get
wme from
left-tok
in rO
Idl
(R-rtok)wptr',rl
I get
wme from
opp-tok
in rl
Idl
(rO)field,r2
! get
value
cmp
r2,(rl)field'
I compare
br_xx
L_fail
I xx depends
.. the
last
.. the
above
three
instructions
sequence
for
tests
for each
between
to be
compared
values
on the test
test
between
different
Idl
#1,r5
! all
tests
jmpx
Link
! Pop
back
must
type
the same
wmes
have
to generic
succeeded
code
wmes
2[0
PARALI,ELISM
L_fail:
Idl
#O,r5
jmpx
Link
I Case
when
I code
can
the
I
number
be shared
of tests
by all
Idl
#I,r5
jmpx
Link
Pop
associated
nodes
that
back
with
have
! ** ScheduleSuccessors
generic
a node
zero
I Pop
to
IN PI_O1)UCFION
SYSTEMS
code
is zero.
The
following
tests.
back
to generic
code
**
I .........................
I Generic
loop:
code
to schedule
successors
of
Idl
(R-node)succList,r3
! get
Idl
(r3)dir.succNid,R'nid
! first
bfi
R-flg,#31,#1,R'-nid
GET-NEW-TOKPTR
I
**
a node.
! ptr
stl
R-Itok,(R'-tokPtr)
stl
PUSHG
R-rtok,(R'-tokPtr)l
Idl
(r3)next,r3
cb_neq
NULL,r3,1oop
CheckExtraRemoves
pointer
time
to successor
thru
is returned
loop,
list
r3 can't
be NULL
in R'-tokPtr
**
........................
I
This
routine
will
! that
reason
need
I
beta-insert
LI:
L loop:
L_fail:
be executed
not
be
operation.
only
highly
The
in
optimized.
result
cmp_w
#O,(R-node)lock
br_neq
LI
tsi_w
(R-node)lock,#O
br_nz
LI
Idl
(R-node)LefExRem,r2
Idl
NULL,r2
cb_eq
NULL,rI,L_notF
cmp
R-nid,(rl)nid
br_neq
L_fail
cmp
R-ltokPtr,(rl)tLeft
br_neq
L_fail
cmp
R-rtokPtr,(rl)tRight
br neq
L_fail
br
L_succ
Idl
rl,r2
is
exceptional
The
returned
circumstances,
code
in
r5.
corresponds
and
to
a
for
left-
CODE AND DATA STRUCI'URESFOR
L_succ:
PARAI,I]_I.
ldl
(rl)next,rl
br
L_Ioop
Idl
(rl)next,r3
cb_eq
NULL,r2,LIO
stl
r3,(r2)next
211
IMPLEMENTATION
! Token
to be del
I value
being
is at head
of
list
br L20
LIO:
stl
r3,(R-node)LefExRem
L20:
tci_w
(R-node)lock,#O
Idl
(R-globTab)DeITokList,rO
stl
rO,(rl)next
stl
r1,(R-globTab)DelTokList
ldl
br
#1,r5
L30
L notF:
I case
ldl
if a matching
token
is not
found
returned
in the
in r5
Extra-Removes-List
#O,r5
L30:
! **
""
InsertExtraRemove
I ........................
I This
routine
I that
reason
will
need
be executed
only
not be highly
MAKE-TOKEN
LI:
J "*
! ptr
cmp
#O,(R-node)lock
br_neq
LI
tsi_w
(R-node)lock,#O
br_nz
L1
Idl
(R-node)LefExRem,rO
stl
rO,(R-Itok)next
stl
R-Itok,(R-node)LefExRem
tci_w
(R-node)lock,#O
Prod
Node
in exceptional
circumstances,
and
for
optimized.
to token
is returned
in R-Itok
**
] .................
1 The
code
for
prod-node
corresponds
to the
act
I for some process/es
doing conflict-resolution,
I command
into a command-list.
Idl
(R-tokPtr),
Idl
(R-tokPtr)l,R-rtokPtr
MAKE-CR-COMMAND
INSERT-CR-COMMAND
of preparing
and
a command
inserting
this
R-ItokPtr
make confl-res
insert command
command.
(ptr
in cr-list
in rO)
EREE-TOKPTR
RELEASE-NODE-LOCK
inform
HTS
that
proc
is finished.
212
PARAI,LELtSM
! **
IN PRODUCI'ION
SYSTEMS
**
MakeCRCommand
! ....................
Idl
(R-globTab)comFreeList,rO
cb neq
NULL,rO,LIO
[ get
command
from
free-list
ALLOCATE-MORE-SPACE
LIO:
I **
ldl
(rO)next,rl
stl
r1,(R-globTab)comFreeList
stl
R-ItokPtr,(rO)Itok
stl
R-rtokPtr,(rO)rtok
stl
R-flg,(rO)flag
InsertCRCormmand
**
I ......................
LI:
ldal
(R-globTab)crListLock,rl
cb_neq
#O,(rl),L1
tsi
(rl)
br_nz
L1
Idl
(R-globTab)crList,r2
stl
r2,(rO)next
stl
rO,(R-globTab)crList
tci
(rl)
I
spin
on cache
I recall
new-comm
is ret
in rO
B.2. Code for Interpreter Using Multiple Software Task Schedulers
1 Extensions
to the
I schedulers.
I the
and data
of the
implementation
using
I routines
l
code
Most
that
Extensions
to
change
Data
code
structures
written
software
significantly.
for
task
for
the
the
use of software
task
HTS
remains
the
same
even
schedulers.
There
are
only
three
Note:
Only
the
changes
are
given
here.
structures:
I ...............................
#define
MaxSchedulers
32
#define
NumSchedulers
XX
typedef
struct
{
unsigned
flg-dir-nid;
I flag,
Token
*tokPtr;
I ptr
} TaskStackEntry;
#define
TaskStackSize
typedef
struct
512
dir,
nid
to token
for
causing
for
activation
activation;
CODE AND DATA STRUCTURtG
FOR PARAI,LEL
213
IMPLEMENTATION
(
int
lock;
! lock
int
stop;
I index
TaskStackEntry
*taskStk;
int
I ptr
dummy;
for entering
of top
to task
I place
scheduler
entry
in Task
stack
filler
of size
Stack
TStkSize
to make
size
2^2
pending
in all
lwords
} Sched;
e×tern-shared
Sched
Scheduler[NumSchedulers];
extern-shared
int
PendTaskCount;
extern
SchedID;
int
I sum of
global
refers
being
extern
int
Random;
extern-shared
int NewCycleLock;
serviced
that
the
register
R-HTS
was
sched
global
variable
is no
longer
needed
R-HTS
! points
to the
#define
R-stk
R-hIndex
I points
to task
I Note:
The
field
I Extensions
of the
node-table-row
bit<0>:lock-bit,
for
schedID
new
task.
per processor.
only
R-sched
lock
or the
searched
It
task
one
processor
goes
off
beginNextCycle.
#define
I of information,
obtained,
being
so that
to process
I Note
scheds
var, local to each processor.
to the schedID from which the
of the
Lock
tasks
in the STS
base
of the
stack
now
of
implementation.
Scheduler
given
consists
of three
bit<l>:direction-bit,
and
array.
scheduler
pieces
bit<2>:flag-bit.
to the code
I .........................
I THE
FOLLOWING
J IS USED.
THE
I PARALLELISM
I **
CODE
EXTENSIONS
ARE
CODE
EXTENSIONS
WHEN
IS USED
ReleaseNodeLock
ARE
GIVEN
FOR THE
NODE
CASE
WHEN
INTRA-NODE
PARALLELISM
PARALLELISM
OR PRODUCTION
LATER.
**
I ......................
I
This
I
is
1 for
I
routine
done
with
corresponds
evaluating
a new node
to
activation
a new recognize-act
the
a node
to
action
that
activation,
evaluate,
or
is
taken
At
this
it
when
it
a processor
may want
may mark
to
the
beginning
refC
assoc
look
of
cycle.
decr_iw
(R-node)refC,r0
ldal
(R-globTab)PendTaskC,r0
I
decrement
decr_i
(r0)
!
decr
cb neq
#0,(rO),L1
I
check
ldal
(R-globTab)NewCycLock,r7
tsi
(r7)
br_nz
L1
I
PendingTaskCount
if
load
I the
end
of
address
I if the
I some
the
lock
other
new
is busy
so
cycle
NewCycleLock
processor
cycle,
node
interlocked
prod
of
with
it implies
that
is scheduling
this
proc
need
not
214
PARAI,LELISM
! worry
cb_neq
#O,(rO),L2
I Get
about
value
I Even
! is possible
BEGIN-NEXT-CYCLE
L2:
LI:
tci
went
**
the
another
in,
again.
lock,
scheduled
and act
it
processor
the
then released the
for that case.
I confl-res
I PopG
PushG
gets
next
lock.
phases
(r7)
PopG:
I
proc
that
I cycle, and
I Must check
SYSTEMS
it.
of PendingTaskCount
if this
I already
tN PRODUCHON
is a global
label
**
I ............
!
This
I
globally
piece
of
code
is
executed
whenever
a pending
ldl
(R-globTab)random,rO
random-seed
xor2
#4513,r0
xor
bfx
rO,#(31-x),#x,r8
#x
with
I the
#x,rO,rO
bfi
r8,#O,#x,rO
rotation
stl
rO,(R-globTab)random
store
shl
#2,r8,rg
note
Idal
(R-sched)rg,r9
get
cb neq
#O,(rg),L1
check
tsi
br_nz
(rg)lock,#O
LI
try
add3
#2,(rg)sTop,rO
stl
rO,(rg)sTop
add3
rO,(rg)taskStk,rl
I get
stq
R'-nid---R'-tokPtr,(rl)
I actually
tci
(rg)lock
Idal
(R-globTab)PendTaskC,rO
incr_i
(rO)
I stop
last
to
be
time
number
on NumScheds.
rand-seed
We also
by #x
bits
is complete
back
value
r8 has
indx
base-addr
into
of that
sched-lock;
and get
scheduler
spin
in cache
lock
+= 2, as each
address
random-seed
of rand-sched
where
push
entry
to push
is 2 lwords
data
data.
I increment_interlocked
determines
of PopG
the
routine
processability
as described
PopG:
L2:
is
PendTaskCount
*"
code
task
from
the
shl
I ** PopG
I This
activation
some prime
depends
rotate
LI:
node
scheduled.
of a node
in the
I PopG
ldl
(R-globTab)schID,r8
add2
#I,r8
cb le
(NumScheds-1),rB,L1
I get
HTS
in addition
version
is a global
ID
of
last
to performing
of code.
label
sched
used
CODE AND DATA STRUC'FURESFOR
L2:
L loop:
L3:
sub2
r8,NumScheds
shl
#2,r8,r9
! increment
Idal
(R-sched)r9,rg
Zero,(rg)sTop
! check
br_z
L2
! if empty
cb_neq
#O,(r9),L2
! if sched-lock
tsi
(r9),#O
br_nz
L2
I if cant
Idl
(rg)taskStk,R-stk
! get
base-addr
Idl
(r9)sTop,rO
! get
sTop
cb_le
#O,rO,L_fail
Idal
(R-stk)rO,r2
I get
base
Idq
(rl),R-nid---R-tokPtr
bfx
R-nid,#31,#2,R-flg
bfx
R-nid,#30,#1,R-dir
R-nid,#O,#30,R-nid
shl
#3,R-nid,rO
add2
R-nid,rO
add2
R-nid,rO
I done
Idal
(R-nodTab)rO,R-node
I R-node
cmp
R-hid,#10000
I check
br gt
L_succ
1 if so
tsi w
br_nz
(R-node)Iock,#O
L3
I lock
cb_eq
Insert,R-flg,L_Ins
.. two similar
symm.
cases,
get
try
of stack
bas Addr
here
stw
flg-dir-lock,(R-node)lock
add3
is 10 lwords
of nod-tab-entry
tab entry
one
Lg4:
entry
success
node
only
Lg2:
sched
if txxx-node
consider
L91
sched
stack
of nodTabRow
gets
next
next
task
address
here
br_neq
try
try
of
one
Zero,(R-node)refC
sched
is busy,
lock,
as siz
empty?
next
Left,R-dir,LefIns
.. two
w
similar
symm.
cases,
_ check
if refCount
I flg-dir-lock
is zero
is compile-time
const
#1,(R-node)refC,r3
stw
r3,(R-node)refC
I update
tci w
(R-node)lock,#O
I release
br
L_succ
cmp_w
flg-dir-lock,(R-node)lock
I check
if flg-dir-lock
br_eq
Lg4
if same
then
tci w
(R-node)Iock,#O
release
lock
sub2
#2,rO
sub
size
br
L_Ioop
see
if next
tci
(rg)Iock,#O
release
br
L2
try
cmp w
rO,(r9)sTop
check
br_eq
L95
if from
Idl
(r9)sTop,rO
get value of stop. I can destroy
rO,
as rl contains
info in better form.
ldal
(R-stk)rO,rO
get
Idq
(rO),r3-r4
load
stq
r3-r4,(rl)
store
sub3
#2,(r9)sTop,rO
stl
rO,(rg)sTop
top
L95:
-- Is stack
then
only
cmp_w
L_succ:
NumScheds
consider
LefIns:
L_fail:
sTop
bfx
cb_eq
L93:
modulo
cmp
L Ins:
L91:
215
PARAI_I.ELIMPLI_MENTATION
refCount
lock
on node
incr
entry
I decrement
entry
succeed
from
for
picked
offset
pr-task
of the
or from
no compaction
in between
neccessary
to top entry
slot
back
was
stack
top,
pointer
match
and
is processable
scheduler
if task
of task
entry
on scheduler
the next
top
refC
of tskstk
lock
table
of
into
stack
the middle
stack-top
field
slot
216
PARAI.LEHSM
tci
(rg)lock,#O
I release
stl
r8,(R-globTab)schID
I store
scheduler
schedID
I obtained
jmpi
(R-node)R-dir
I jump
FOLLOWING
! USED.
! THE
CODE
THE CODE
COST-MODEL
typdef
CORRESPONDS
WHEN
TO THE
PRODUCTION
DERIVED
FROM
CASE
WHEN
PARALLELISM
THIS
CODE
CAN
the
to code
SYSTEMS
lock
from
into
I activation
I THE
IN PRODUCTION
which
task
global
specific
vat
was
schedID.
to this
type.
NODE
PARALLELISM
IS USED
BE USED
IS
IS SIMILAR,
THERE
AND
TOO.
struct
(
unsigned
addr-lef-act;
/* code
address
for
left
unsigned
addr-rht-act;
/* code
address
for
right
unsigned
addr-mak-tok;
/* code
address
for
makTok-lces-rces
unsigned
addr-lef-hash;
/* code
address
for
left
unsigned
addr-rht-hash;
/" code
address
for
right
unsigned
addr-do-tests;
/* code
for
safe
tests
assoc
modification
activation
*/
activation
hash
fn
hash
*/
*/
fn
*/
node
*I
fields
*/
with
short
lock;
/* for
short
refCount;
/* used
Token
*leftExtraRem;
/" extra
to conjugate
pairs
*/
Token
Node
*rightExtraRem;
*succList;
/* extra removes due to conjugate
/* set of successors
of the node
pairs
*/
*/
by STS.
(not
removes
of other
*/
needed
due
here)
*I
} NodeTableEntry;
I =*
ReleaseNodeLock'
! Only
the
! cost
remains
first
*"
line
the
is
changed
as
respect
to
the
tci_w
(R-node)lock,#O
Idal
I
release
decr_i
(rD)
I decr
cb neq
#O,(rO),L1
I check
Idal
(R-globTab)NewCycLock,r7
tsi
(r7)
br_nz
LI
lock
lock
other
cycle,
LI:
(r7)
with
confl-res
node
interlocked
cycle
it implies
that
is scheduling
so this
proc
need
not
it.
of PendingTaskCount
proc
that
went
cycle, and
Must check
tci
Overall
NewCycleLock
is busy
is possible
BEGIN-NEXT-CYCLE
of
if this
already
prod
processor
about
Get value
Even
of
address
the new
#O,(rO),L2
associated
if end
if the
worry
L2:
version.
PendingTaskCount
I load
some
cb_neq
previous
same.
in,
gets
the
another
scheduled
again.
lock,
the next
then released the
for that case.
and
act
phases
it
processor
lock.
CODE AND DATA STRUCTURF.S
FOR PARALI_EL
PopG:
!
**
!
PushG'
217
IMPLEMENTATION
PopG is
a
global
label
"*
I ............
1 The version
!
of
parallelism.
LI:
PushG
So
the
does
not
cost
also
change
between
remains
the
node
parallelism
and
Idl
(R-globTab)random,rO
! random-seed
xor2
#4513,r0
I xor
bfx
rO,#(31-x),_x.r8
! #x depends
on NumScheds.
I rotate
rand-seed
with
from
some
the
shl
#x,rO,rO
bfi
r8,#O,#x,rO
I rotation
stl
rO,(R-globTab)random
I store
shl
#2,r8,r9
I note
Idal
(R-sched)r9,r9
I get
cb_neq
#O,(r9),L1
I check
tsi
(r9)lock,#O
I try
brnz
LI
add3
#2,(rg)sTop,rO
stl
rO,(r9)sTop
value
r8 has
and
rO,(rg)taskStk,rl
! get
I actually
tci
(r9)lock
ldal
incr_i
(rO)
also
bits
get
random-seed
of that
scheduler
spin
in cache
lock
+= 2. as each
R'-nid---R'-tokPtr,(rl)
We
by #x
of rand-sched
sched-lock;
stq
time
number
into
indx
base-addr
I sTop
last
prime
is complete
back
add3
I ** PopG'
intra-node
same.
address
where
push
entry
to push
is 2 lwords
data
data.
! increment_interlocked
PendTaskCount
**
I ............
I This
I the
I This
code
task
determines
of PopG
version
I is used.
of the
The main
I processability
the
routine
code
place
of the
processability
as described
changes
PopG:
L2:
LI:
significantly
of change
selected
of a node
in the HTS
node
when
corresponds
I get
ID of
last
add2
#1,r8
cb_le
(NumScheds-1),rB,L1
sub2
rB,NumScheds
shl
#2,r8,r9
ldal
(R-sched)rg,r9
cmp
Zero,(rg)sTop
I check
br_z
L2
I
if
empty
cb_neq
#O,(rg),L2
!
if
sched-lock
I increment
the
label
sched
modulo
sTop
where
established.
is a global
(R-globTab)schID,r8
parallelism
portion
is being
Idl
to performing
of code.
node
to the
activation
I PopG
in addition
version
NumScheds
--
Is
stack
then
try
next
is
used
busy,
empty?
sched
try
next
sched
218
PARAI,LELISM
L_Ioop:
L3:
L_fail:
L_succ:
tsi
(rg),#O
br_nz
L2
! if cant
Idl
(rg)taskStk,R-stk
I get
base-addr
Idl
(rg)sTop,rO
I get
sTop
cb le
#O,rO,L_fail
Idal
(R-stk)rO,rl
I get
base
Idq
(rl),R-nid---R-tokPtr
lock,
of
try
next
task
address
SYSTEMS
sched
stack
of stack
entry
bfx
R-nid,#31,#1,R-flg
bfx
R-nid,#30,#1,R-dir
bfx
R-nid,#O,#30,R-nid
shl
#3,R-nid,rD
add2
R-nid,rO
add2
R-nid,rO
I done
Idal
(R-nodTab)rO,R-node
I R-node
cmp
R-nid,#10000
I check
br_gt
L_succ
I if so success
tsi_w
(R-node)Iock,#O
! lock
br_z
L succ
I if lock_obtained
sub2
#2,tO
I sub
size
br
L_Ioop
! see
if next
tci
(r9)lock,#O
I release
lock
on
br
L2
I try
next
scheduler
cmp_w
rO,(r9)sTop
I check
br_eq
L95
! if from
Idl
(rg)sTop,rO
I get value of sTop. I can destroy
rO,
I as rl contains
info in better form.
Idal
(R-stk)rO,rO
I get
ldq
(rO),r3-r4
I load
stq
r3-r4,(rl)
I store
sub3
#2,(rg)sTop,rO
stl
rO,(rg)sTop
!
tci
(rg)lock,#O
1 release
stl
r8,(R-globTab)schID
! store
I top
L95:
get
IN I'RODUCFION
as siz of nodTabRow
gets
(R-node)R-air
node
the
tab entry
then
of tskstk
entry
if task
of task
pointer
top
offset
scheduler
was
for
picked
pr-task
of the
or from
in between
neccessary
to top entry
of
into
stack
the middle
stack-top
scheduler
schedID
into
to code
I activation
from
no compaction
slot
back
success
entry
is processable
stack
top,
decrement
I jump
is I0 lwords
of nod-tab-entry
if txxx-node
I obtained
jmpi
bas Addr
field
lock
from
which
the
global
specific
type.
slot
to
task
vat
this
was
schedID.
DERIVATION
O[: COST MODF.I,S
219
FOR TIlE SIMULATOR
Appendix C
Derivation of Cost Models for the Simulator
C.I. Cost Model for the ParallelImplementationUsing HTS
/*
*
*
This cost
Scheduler
model
(HTS)
was designed
is used.
for
the
case
where
a single
Hardware
Task
* Exports:
*
double
TaskProcCost(tptr:
*
double
TaskFinishCost(tptr:
*Task);
*
double
TaskLoopCost(tptr:
*
double
TaskSchedCost(tptr:
*Task);
*Task);
*Task);
*/
/*
* The
cost
model
described
* diagram
given
* usually
consists
* associated
* tokens
* node
below.
with
of the
the
successor
* scheduler,
* is finished,
going
fetch
* following
diagram:
* Loop-Start
cost;
onto
T_b
T_le
cost;
T_enq
* is Cost
of dequeueing
* by this
file
steps:
simply
then
the end.
pending
of enqueuing
to the
from
following
* TaskProcCost(tptr:
*Task)
:= T_b
* TaskLoopCost(tptr:
*Task)
:= T_il;
* TaskFinishCost(tptr:
* TaskSchedCost(tptr:
*Task)
*Task)
by a node
Do some
fails
each
activation
T_e
T_il
processing
(3)
If the
of those
processing
on the
for
to process.
is Ending
cost;
T_Is
cost;
in the
the
procedures
The
a node
In the
an activation
HTS.
of
any matching
is Inner-Loop
is
T_nl
HTS;
is
T_deq
exported
costs:
+ (if no_succ
:- T e + (if no_succ
:= T_enq
basic
go to the end.
the
in terms
activation
to find
enqueue
cost;
cost;
an activation
correspond
be understood
(4) Once
node
is Beginning
is Cost
(I)
the node
then
is Loop-End
may
required
activations,
another
* No-Loop
(2) If
memory,
node
before
Appendix
following
node.
in the opposite
has
in this
The processing
:= T_deq;
then
T_nl;
then
O; else
for
the
HTS
else
T_Is;);
T_le;);
version.
220
PARALLELISM
IN PRODUCllON
SYSTEMS
T deq
+=_==<======_======z=====_===========_======::=_========_=:===<:_====+
I
I
V
+-< ..........
I
I
I
I
I
T_il
T_b
+ ...........
T_ls
I
I
>+ ..........
<-+
T_enq
I
I
>+===========s=>+
T_le
T_e
...........
>+
I
Start
..............
[
r_nl
I
+ ......
End
I
> ..........................
• .....
+
*I
#include
"sim.h"
#include
"stats.h"
* Some
constant
definitions
for
cost
of
tasks
*i
#define
RR
I 0
/* cost
of reg-reg
instruction
*/
#define
MR
I 0
/* cost
of mem-reg
instruction
*/
#define
M2R
2 0
/* cost
of mem-mem-reg
instruction
*I
#define
SYN
3 0
/" cost
of interlocked
instructions
*I
#define
CBR
I 5
/* compare&branch
on register
value
"/
#define
CBM
I 5
/* compare&branch
on memory
value
*I
#define
BRR
I 5
/* branch
on register
#define
BRM
I 5
/" branch
requiring
#define
C_PushG
(N2R)
#define
C_PopG
(6"RR
#define
C_GetNewTokPtr
(3*MR
+CBR)
#define
C_GetNewToken
(3"MR
+CBR)
#define
C_FreeTokPtr
(3*MR)
#define
C_ReleaseNodeLock
#define
C_LockTokenHT
(4"MR
+ 2*SYN
+CBM
+ 3"BRR)
#define
C_ReleaseTokenHT
(2"MR
+ 2*SYN
+CBM
+ 2"BRR)
#define
C_MakeCRCommand
(B'MR
+CBR)
#define
C_InsertCRCommand
(4"MR
+ 2*SYN
/s
+ MR + M2R
(2*RR
+CBR
value
*I
memory-ref
*I
+BRM)
+ 2"MR
+ BRR)
+CBM
+ BRR)
I
>+
DERIVATION
OF COST MODEI,S
* The
following
variable
names
*nid:
node-id.
* toksz:
number
of wme
* ntests:
number
of tests
* nteq:
number
of equality
* ntok:
number
of tokens
* ntokOmem:
* nsucc:
number
number
" primTask:
are
pointers
used
tests
if a given
with
made
in a given
of tokens
in the
functions
a two-input
at
the
memory
in the
opposite
node
activations
task
memory
is scheduled
of activation
through
direction
* Ices:
the number
of condition
elements
to the
* rces:
the number
of condition
elements
to the right
true
* NumActiveProc:
number
(left
node.
generated
whether
is being
global
activation.
HTS.
or deleted.
contention
of active
by a given
the
or right).
inserted
if memory
node.
node.
* flag:
* MemConFlag:
to mean:
node.
two-input
* dir:
a token
below
in token.
associated
of successor
true
221
FOR 'NIE SIMULATOR
left
of the
two-input
of the
is to be taken
two-input
into
account
processors.
m
*/
double
C_MakeToken(toksz)
int toksz;
{
/*
* if (tok_size
* because
*/
the
double
<= I) then
right
side
assume
that
of a not-node
alpha-token
may
have
is being
rces
listed
made.
"<="
as O.
cost;
y
if (toksz
<= I)
cost
= 3 * MR + C_GetNewToken;
cost
= MR + (2 * toksz)
else
return
* MR + M2R + BRR
+ BRM
(cost);
}
double
C_InsertToken()
{
double
cost
cost;
- RR + 3*MR
+ C_LockTokenHT
+ C_ReleaseTokenHT;
return(cost);
}
double
int
C_DeleteToken(toksz,
toksz,
nteq,
nteq,
ntok)
ntok;
{
double
cost
if
cost;
= C_LockTokenHT
(toksz
{
<= I)
/*
+ C_ReleaseTokenHT;
delete
alpha-token
"/
+ C_GetNewToken;
node.
node.
222
PARAI,LEI.ISM
if (nteq
> O) /* hashing
works
IN PROI)UCTION
*/
{
cost
+= 3*RR
+ g'MR
cost
+= (3*RR
cost
+= (ntok/2)
+ 3"CBR
+ 3*BRR;
}
else
{
+ 6*MR
+ 2*CBR
* (RR + 4*MR
+ BRR)
+ (3*MR
+ CBR
+ 3*BRR);
+ CBR + 3"BRR);
}
}
else
{
if (nteq
> O) /" hashing
works
"/
cost
+- 3*RR
+ It*MR
+ 3*CBR
+ 4*BRR;
cost
+= (3*RR
+ 6*MR
+ 2*CBR
+ BRR)
cost
+= (ntok/2)
{
}
else
{
* (RR + 5*MR
+ (5*MR
}
}
return(cost);
}
double
C_HashToken(toksz,nteq)
int toksz,nteq;
{
double
cost;
if (toksz
cost
else
cost
<=
I)
= 3*RR
+ MR + nteq*MR
= 3*RR
+ MR + 2*nteq*MR
+ BRR
+ BRM;
+ BRR + BRM;
return(cost);
}
double C_DoTests(ntests,pass)
int ntests;
bool
pass;
{
double
cost;
if (ntests
cost
== O)
= RR + BRR;
else
{
if
(pass)
cost
/*
the
= ntests
tests
*
succeed
(4*MR
*/
+ BRR)
+ CBR
+ CBR + 3.5*BRR);
+ RR + BRR;
+ 4*BRR);
SYSTEMS
DERIVATION
OF COST MODEI,SFOR
TItE
223
S1MULATOR
else
cost
*
= (ntests/2)
(4*MR
+ BRR)
+ RR + BRR;
}
return(cost);
}
double
C_NextMatchTok(nsucc,
int nsucc,
nteq,
ntests,
nteq,
ntests,
ntokOmem)
ntokOmem;
{
double
cost
cost,
k;
= 0.0;
if (nteq
== O)
{
if (nsucc
==
O)
(
cost
+= ntokOmem
* (RR + 2*MR
cost
+= ntokOmem
* C_DoTests(ntests,False)
+ 2*CBR
+ Z*BRR
+ BRM);
+ CBR;
)
else
{
k = ntokOmem/nsucc;
cost
+= (k - I) * (RR + 2*MR
cost
+= (k - 1) * C_DoTests(ntests,False);
cost
+= RR + 3*MR
+ 2*CBR
+ 2*CBR
+ 2*BRR
+ BRR + BRM
+ BRM);
+ C_DoTests(ntests,True);
)
}
else
/* hashing
is useful
*/
{
if (nsucc
cost
== O)
+= CBR;
else
cost
+= RR + 3*MR
+ 2*CBR
+ BRR
+ BRM + C_DoTests(ntests,True);
)
return(cost);
)
double
C_NextMatchTokNOT(nsucc,
int nsucc,
nteq,
ntests,
nteq,
ntests,
ntokOmem)
ntokOmem;
(
double
cost
cost,
k;
= 0,0;
if (nteq
== O)
{
if (nsucc
=- O)
{
/* It is still
• opposite
possible
node
• transitions
that
happened.
for
successful
It only
an insert
or I->0
matches
implies
that
transitions
*/
k = ntokOmem/2;
cost
+= k * (RR + Z'MR
+ 2*CBR
+ 2'BRR
+ BRM);
with
the tokens
there
for
were
of the
no 0->I
a delete.
224
PARAL.LEHSM
cost
+= k * C_Oolests(ntests,False)
cost
+= k * (RR + 2*MR
cost
+= k * C DoTests(ntests,True);
+ SYN
IN PRODUCTION
SYS'H_MS
+ CBR;
+ 3*CBR
+ 2*BRR
+ BRM);
+ 2*BRR
+ BRM);
)
else
{
k = ntokOmem/nsucc;
cost
+= (k - I) * (RR + 2*MR
cost
+= (k - I) * C_DoTests(ntests.False);
cost
+= RR + 2*MR
+ SYN
+ 2*CBR
+ 3*CBR
+ BRR
+ BRM
+ C_DoTests(ntests,True);
)
)
else
/* hashing
is useful
*/
(
if (nsucc
cost
== O)
+= CBR;
/* the opposite
bucket
is hopefully
empty
*/
else
cost
+= RR + 2"MR
+ SYN
+ 3"CBR
+ BRR + BRM
+ C_DoTests(ntests.
True);
)
return(cost);
}
double
C_DetermineRefCount(nsucc,
int nsucc,
nteq,
ntests,
nteq,
ntests,
ntokOmem)
ntokOmem;
{
double
cost
'
cost;
= 3*RR
if (nteq
+ MR;
> D)
{
if (nsucc
== O)
{
/*
* This
case
" Assuming
*/
implies
that
that
it maps
cost
+- RR + 2*MR
cost
+- C_DoTests(ntests.
the # of matching
into
+ 2"CBR
a hash
+ 2"BRR
tokens
bucket
+ BRM
with
was
greater
exactly
one
than
token.
+ CBR;
True);
}
else
/* nsucc==1
cost
and
the opposite
hash
bucket
* (RR + 2*MR
+ 2"CBR
would
be empty
*/
+- CBR;
}
else
{
if (nsucc
== O)
{
cost
+- ntokOmem
cost
+= ntokOmem*((C_DoTests(ntests,True)+C_DoTests(ntests,False))/2.O);
+ 2"BRR
+ BRM)
cost
+= ntokOmem
cost
+- C_DoTests(ntests,False)
}
else
{
* (RR
+ 2*MR
+ 2*CBR
+ CBR;
+ 2"BRR
+ BRM);
+ CBR;
O.
DERIVATION
OF COST MOD[-I,S
225
[:OR THE SIMUI ATOR
}
}
return(cost):
}
double
Task
TaskProcCost(tptr)
*tptr;
{
bool
primTask;
TaskPtr
*tp;
double
cost,
int
tempc;
hid,
Ices,
Direction
dir;
cost
rces,
nsucc,
nteq,
ntests,
ntok,
ntokOmem,
toksz;
- 0.0;
primTask
nid=
= tptr->tPrimaryTask;
tptr->tNodeID;
tp = tptr->tDepList;
while
(tp)
nsucc
{ nsucc++;
= O;
tp = tp->tNext;
}
switch(tptr->tType)
{
case
rootNode:
if (nsucc)
cost
else
cost
= (2*RR
+ 3*MR
+ CBR)
= (2*RR
+ MR + CBR);
+ C_PushG;
break;
case
txxxNode:
/*
the
cost
for
the
&bus
and
txxx
nodes
is
made
equal
language,
and
if (primTask)
cost
else
- (2"RR
+ 2"MR + CBR)
cost
= (2*RR
+ MR
+ C OopG:
+ CBR);
/*
* The following
if-stmt
is not
* has been added as a simple
" in the simulator.
a legal
stmt
approximation
in the
to the
actual
,/
if (nsucc
> O)
{
if
successor_sched_thru_HTS
else
cost
+- RR + MR;
}
break;
case
andNode:
Ices
- NodeTable[nid].nLces;
rces
= NodeTable[nid].nRces;
dir
= tptr->tSide;
nteq
- NodeTable[nid].numTeqbTests;
ntests
= NodeTable[nid].numTests;
cost
+-
3*RR
+ C_PushG;
form
used
*/
226
PARALLELISM
if
(die
IN PRODUCI'ION
SYSTF_,iS
== Left)
{ toksz
else
= Ices;
ntok
= tptr->tNumLeft;
{ toksz
if (toksz
= rces;
<= I)
ntokOmem
{ /" alpha-activation
cost
= C_PopG
ntokOmem
= tptr->tNumLeft;
}
= tptr->tNumRight;
}
*/
+ C_HashToken(toksz,nteq)
if (tptr->tFlag
ntok
: tptr->tNumRight;
+ CBR
+ (RR + MR + BRR);
== Insert)
cost
+= C MakeToken(toksz)
cost
+= C_DeleteToken(toksz,
+ C_InsertToken()
+ BRR;
else
if ((ntokOmem
nteq,
I= O) && ((nteq
== O)
ntok);
[I (nsucc
_- 0)))
{
cost
+= C_LockTokenHT
cost
+= C_NextMatchTok(nsucc,nteq,ntests,ntokOmem);
+ MR;
}
if (nsucc
!- O) cost
+-
RR + 2*MR
+ C_GetNewTokPtr
+ 2*MR
+ C PushG;
}
else
/*
beta-activation
*/
{
cost = C_PopG + 2"MR + C HashToken(toksz,nteq)
if (tptr->tFlag
== Insert)
cost
+=
RR + MR + C MakeToken(toksz)
cost
+ CBR
+ (RR + MR + BRR);
+ C InsertToken()
+ BRR;
else
if ((ntokOmem
I= O) &&
((nteq
nteq,
== O)
ntok)
+ CBR;
I[ (nsucc
I= 0)))
{
cost
+= C_LockTokenHT
cost
+= C_NextMatchTok(nsucc,nteq,ntests,ntokOmem);
+ MR;
}
if (nsucc
]= O) cost
+= RR . 2*MR
+ C_GetNewTokPtr
+ 2*MR
+ C_PushG;
}
break;
case
notNode:
Ices
= NodeTable[nid].nLces;
rces
_ NodeTable[nid].nRces;
dir
= tptr->tSide;
if (dir
nteq
== Left)
toksz
=lces;
else
toksz
= rces;
= NodeTable[nid].numTeqbTests;
ntests
if (dir
== Left)
{
ntok
= tptr->tNumLeft;
cost
- C_PopG
if
+ 2"MR
(tptr->tFlag
ntokOmem
= tptr->tNumRight;
. C_HashToken(toksz,nteq)
+ CBR;
=- Insert)
{
COSt
+= 2"RR
if ((ntokOmem
+ 3*MR
+ C_MakeToken(toksz)
I- O) && ((nteq
== O)
+ C_InsertToken()
II (nsucc
== 0)))
{
cost
+= C_LockTokenHT
+ C_ReleaseTokenHT
+ BRR;
+ BRR;
DERIVATION
227
OF COST MODI_LSFORTItESIMUIATOR
cost
+= C_DetermineaefCount(nsucc,
nteq,
ntests,
ntokOmem);
}
)
else
cost
if
if
+= C_OeleteToken(toksz,
(nsucc
t = 0)
cost
(toksz
<= 1)
/*
nteq,
+= RR + 5*MR
i.e.,
an
ntok)
+ CBR;
+ BRR + C GetNewTokPtr
alpha
not-node
activation
+ C_PushG;
*/
{
==
if (tptr->tFlag
Insert)
cost
else
= cost
- (RR + 3*MR);
cost
= cost
- (2*MR
+ BRR);
}
}
else
/* dir
== right
*/
{
ntokOmem
cost
= tptr->tNumLeft;
= C PopG
+ 2*MR
if (tptr->tFlag
cost
ntok
= tptr->tNumRight;
+ C HashToken(toksz,nteq)
+ CBR
+ RR + MR + BRR;
== Insert)
+= RR + MR + C_MakeTokea(toksz)+C
InsertToken()
+ BRR;
else
cost
if ((ntokOmem
t= O) &&
nteq,
((nteq
ntok)
II
== O)
+ CBR;
(nsucc
I= 0)))
(
cost
+= RR + C_LockTokenHT
+ MR;
cost
+= C_NextMatchTokNOT(nsucc,nteq,ntests,ntokOmem);
)
if (nsucc
t= O) cost
if (toksz
<= 2)
+= RR + 2*MR
/* i.e.,
an
alpha
+ C_GetNewTokPtr
not-node
+ 2*MR
activation
(
if (tptr->tFlag
cost
else
= cost
cost
==
Insert)
- (RR + 3*MR);
= cost
- (2*MR
+ CBR);
)
}
break;
case
pNode:
lces=
NodeTable[nid].nLces;
rces
cost
= C_PopG
cost
+= C_MakeCRCommand
cost
+= C_ReleaseNodeLock;
break;
default:
cost
"
break;
)
O.O;
+ 2*MR;
+ C_InsertCRCommand
+ C_FreeTokPtr;
"/
+ C_PushG;
228
PARALLELISM
if (MemConFlag)
cost
IN PRODUCTION
= cost/MemCon[NumActiveProc];
return(cost/1OOO.O);
}
double
Task
TaskLoopCost(tptr)
*tptr;
{
TaskPtr
*tp;
double
cost;
int hid,
nsucc,
Direction
dir;
cost
nteq,
ntests,
ntok,
ntokOmem,
toksz;
= O.O;
nid=
tptr->tNodeID;
tp = tptr->tDepList;
while
(tp)
nsucc
{ nsucc++;
= O;
tp = tp->tNext;
}
switch(tptr->tType)
(
case
rootNode:
cost
= RR + 3*MR
+ CBR
+ BRR + C_PushG;
break;
case
txxxNode:
/*
*
The
following
*
has
been
if-stmt
added
is
not
as a simple
a legal
stmt
approximation
in
to
the
the
language,
actual
form
and
used
* in the simulator.
*/
if successor_sched_thru_HTS
else
cost
cost
= 3*RR
+ C_PushG;
= RR + MR;
break;
case
andNode:
nteq
ntests
dir
= tptr->tSide;
if
(dir
{
ntok
== Left)
= tptr->tNumLeft;
ntokOmem
ntokOmem
= tptr->tNumLeft;
= tptr->tNumRight;
else
{
cost
=
cost
+= MR + CBR + BRR + C_NextMatchTok(nsucc,
RR + 2"MR
+ C_GetNewTokPtr
break;
case
notNode:
nteq
= NodeTabte[nid].numTeqbTests;
ntests
dir
= tptr->tSide;
if (dir
-= Left)
ntok
}
= tptr->tNumRight;
+ C_PushG
}
+ 2*MR;
nteq,
ntests,
ntokOmem);
SYSTEMS
DERIVATION
OF COST MODEI_
FOR TIlE
229
SIMULATOR
{
ntok
: tptr->tNumLeft;
cost
+= 5*MR
+ RR
ntokOmem
+ CBR
+ BRR
: tptr->tNumRight;
+ C_GetNewTokPtr
+ C_PushG;
}
else
{
ntokOmem
= tptr->tNumLeft;
+ CBR
ntok
cost
+= RR + 5*MR
+ BRR
cost
+= C_NextMatchTokNOT(nsucc,
= tptr->tNumRight;
+ C_GetNewTokPtr
nteq,
ntests,
+ C_PushG;
ntokOmem);
}
break;
case
pNode:
cost
= 0.0;
break;
default:
break;
)
return(cost/lO00.O);
}
double
Task
TaskFinishCost(tptr)
*tptr;
{
TaskPtr
double
*tp;
cost;
int hid,
Ices,
Direction
dir;
cost
nsucc,
nteq,
ntests,
ntok,
ntokOmem,
toksz;
= O.O;
nid=
tp
rces,
tptr->tNodeID;
= tptr->tDepList;
while
(tp)
nsucc
{ nsucc++;
-
0;
tp = tp->tNext;
switch(tptr->tType)
{
case
rootNode:
if (nsucc
cost
+=
i= O) cost
+= RR + MR + CBR
+ BRR;
C_ReleaseNodeLock;
break;
case
txxxNode:
if (tptr->tPrimaryTask)
else
/,
•
•
,/
cost
-
cost
= C_ReleaseNodeLock;
BRM;
The following
to the actual
if-stmt
has been
form used in the
added as
simulator.
a simple
approximation
230
PARALLELISM
if (nsucc
> O) cost
IN PRODUCTION
SYSTEMS
+= MR + RR + BRM;
break;
case
andNode:
Ices
= NodeTable[nid].nLces;
rces
= Nodelable[nid].nRces;
dir
= tptr->tSide;
nteq
ntests
if (dir
== Left)
{ toksz
else
= Ices;
{ toksz
if (nsucc
cost
cost
ntok
= rces;
= tptr->tNumLeft;
ntokOmem
ntokOmem
= tptr->tNumLeft;
= tptr->tNumRight;
ntok
I= O)
= MR
+ CBR
+ BRR
+ C_NextMatchTok(nsucc,
+= C_ReleaseNodeLock
nteq,
ntests,
ntokOmem);
+ C_FreeTokPtr;
if ((ntokOmem
!= O) && ((nteq
cost += C_ReleaseTokenHT;
== O)
if (toksz <= I) /* alpha activation
cost = cost - C FreeTokPtr;
II (nsucc
l= 0)))
*/
break;
case
notNode:
Ices
= NodeTab,le[nid].nLces;
rces
flit
= tptr->tSide;
nteq
ntests
'
if (dir
== Left)
{
toksz
=lces;
if (nsucc
cost
ntok
= tptr->tNumLeft;
I= O) cost
+= C_FreeTokPtr
ntokOmem
= tptr->tNumRight;
+= MR + CBR;
+ C_ReleaseNodeLock;
}
else
/* dir
== Right
"/
{
toksz
= rces;
if (nsucc
ntokOmem
= tptr->tNumLeft;
ntok
- tptr->tNumRight;
!- O)
{
cost
+= MR + CBR + BRR;
cost
+= C NextMatchTokNOT(nsucc,
nteq,
ntests,
ntokOmem);
}
cost
if
+- C_FreeTokPtr
+ C_ReleaseNodeLock;
((ntokOmem
!= O) && ((nteq
cost
+= C_ReleaseTokenHT;
== O)
)
if
(toksz
break;
case
pNode:
cost
- 0.0;
break;
}
= tptr->tNumRight;
<= 1)
cost
= cost
- C_FreeTokPtr;
II
(nsucc
1=
0)))
}
DERIVATION
OF COST MOI)ELS
231
FOR "II-tE SIMULATOR
default:
cost
= 0.0;
break;
)
if
(MemConFlag)
cost
)
double
Task
TaskSchedCost(tptr)
*tptr;
{
/*
* This
is the
time
spent
by the
* to the number
of bus cycles
" wide
would
bus
* be 2 bus
* cycles
this
cycles.
Assuming
are 2-3.
* I assume
be one
Assuming
a 64 bit wide
hardware
required
bus
cycle,
one
bus
100ns
bus
scheduler,
while
cycle
per
and
for servicing
on a 32 bit wide
to xmit
cycle,
(80 MBytes/s)
corresponds
a request.
this
and
address,
for
a 64 bit
it would
the total
corresponds
200ns
On
bus
busy
to 200-30Dns.
the
xfer
time.
./
double
cost
if
cost;
= 0.2;
(MemConFlag)
cost
}
C.2. Cost Model for the Paragel Implementation
Using STQs
/*
* When
* the
software
same
* small
task
as when
number
* and dequeued
*/
queues
are used,
a hardware
task
of differences
from
the
task
for
large
scheduler
corresponding
queues.
C_ReleaseNodaLock
/" cost
for
#define
C_PushG
(6*RR
+ 2"MR
#define
C_SchedTask
(3*MR
+ M2R
#define
C_SchedEnd
(RR + SYN)
/*
costs
for
scheduling
popping
(RR + 2*SYN
a task
a task
through
from
the
to when
These
#define
parts
the cost
is used.
There
the tasks
differences
are
model
remains
are,
however,
are
enqueued
listed
a
below.
+ CBM)
the global
+ 2"SYN
scheduler
#define
C_SchedLoopStart
(MR)
#define
C_SchedLoop
(4"RR
#define
C_PopG
(MR + BRM)
schedulers
+ CBM)
+ MR + CBR
+ BRR)
*/
/* outside
lock
"/
/* inside
lock
*/
/*
lock
*/
/* outside
lock
*I
/* outside
lock
*/
/*
lock
*/
outside
*/
+ CBM
+ BRR)
outside
232
PARAI,LELISM
#define
/*
C_DeSchedTask-rxxx
Costs
when
(9*RR
intra-node-level
+ 5*MR
parallelism
#define
C_i_DeSchedTask
(g*RR
#define
C_i_DeSchedFaitLoop
(IO*RR
/"
Costs
when
node-level
or
#define
C_np
#define
C_np_DeSchedFailLoop
double
Task
is
+ g'MR
+ 5*MR
(IO*RR
.
+ CBR + 3*BRR)
used.
*/
+ 3*CBR
+ M2R + 2*SYN
parallelism
(9,RR
2*SYN
+ M2R + 4*SYN
+ 2*MR
prod-level
DeSchedTask
+ M2R +
IN PROI)UCFION
is
+ 3*CBR
used.
+ M2R + 2*SYN
+ 5*BRR)
+ 5*BRR)
*/
+ CBR + 4*BRR)
M2R + SYN + CBR + 3*BRR)
TaskSchedCost(tptr)
*tptr;
{
/*
" This
is the
cost
of enqueueing
a node
activation
in a task
queue.
*/
double
cost;
cost
= C_SchedTask/lO00.O;
if (MemConFlag)
cost
= cost
/ MemCon[NumActiveProc];
return(cost);
}
double
Task
TaskDeSchedCost(tptr,
numFail)
*tptr;
int numFail;
/* hum
of tasks
looked
before
the
right
task
was
found
{
double
cost;
cost
if
= 0.0;
((Grain
cost
else
if
cost
else
else
== prodLev))
*
(C_i_DeSchedFailLoop/lO00.O)
NULL)
&& (tptr->tType
((double)
"
((double)
== txxxNode))
((Grain
== nodeLev)
II
(Grain
== prodLev))
+= C_np_DeSchedTask/lO00.O;
cost
+= C_i_DeSchedTask/lO00.O;
(MemConFlag)
return(cost);
)
(Grain
+= C_DeSchedTaskTxxx/lO00.O;
if
cost
II
(C_np_DeSchedFailLoop/lO00.O)
+=
((tptrl=
cost
if
== nodeLev)
+=
cost
= cost
/
MemCon[NumActiveProc];
numFail);
numFail);
*/
SYS'I_IMS

DEPARTMENT of COMPUTER SCIENCE Carneg=e

Transcription

Similar documents

A favorite photo on a BookStar marquee

Part 2 - SAOUG

The Continuous World of Dungeon Siege

Cut Sheet for CPICS-100 family

C.A.R. Kwikset

Intel Mote: Sensor Network Technology for Industrial Applications