DEPARTMENT of COMPUTER SCIENCE Carneg=e
Transcription
DEPARTMENT of COMPUTER SCIENCE Carneg=e
CMU-CS-86-122 Parallelism in Production Systems Anoop Gupta March 1986 DEPARTMENT of COMPUTER SCIENCE Carneg=e-iellon Un=vers=ty CMU-CS-86-122 Parallelism in Production Systems Anoop Gupta Department of Computer Science Carnegie-Mellon University Pittsburgh, Pennsylvania 15213 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science at Carnegie-Mellon University March 1986 Copyright © 1986 Anoop Gupta This research was sponsored by the Defense Advanced Research Projects Agency Order No. 4864, monitored by the Space and Naval Warfare Systems Command N00039-85-0134. The views and conclusions contained in this document are those of should not be interpreted as representing the official policies, either expressed or Defense Advanced Research Projects Agency or the US Government. (DOD), ARPA under contract the authors and implied, of the i ABSTRACT Abstract Production systems (or rule-based systems) are widely used in Artificial Intelligence for modeling intelligent behavior and building expert systems. Most production system programs, however, are extremely computation intensive and run quite slowly. The slow speed of execution has prohibited the use of production systems in domains requiring high performance and real-time response. This thesis explores the role of parallelism in the high-speed execution of production systems. On the surface, production system programs appear to be capable of using large amounts of parallelismmit is possible to perform match for each production in a program in parallel. The thesis shows that in practice, however, the speed-up obtainable from parallelism is quite limited, around 10-fold as compared to initial expectations of 100-fold to 1000-fold. The main reasons for the limited speed-up are: (1) there are only a small number of productions that are affected (require significant processing) per change to working memory; (2) there is a large variation in the processing requirement of these productions; and (3) the number of changes made to working memory per recognizeact cycle is very small. Since the number of productions affected and the number of working- memory changes per recognize-act cycle are not controlled by the implementor of the production system interpreter (they are governed mainly by the author of the program and the nature of the task), the solution to the problem of limited speed-up is to somehow decrease the variation in the processing costof affected productions. The thesis proposes a parallel version of the Rete algorithm which exploits parallelism at a very fine grain to reduce the variation. It further suggests that to exploit the fine-grained parallelism, a shared-memory multiprocessor with 32-64 high performance processors is desirable. For scheduling the fine-grained tasks consisting of about 50-100 instructions, a hardware task scheduler is proposed. The thesis presents simulation results for a large set of production systems exploiting different sources of parallelism. The thesis points out the features of existing programs that limit the speed-up obtainable from parallelism and suggests solutions for some of the bottlenecks. The simulation results show that using the suggested multiprocessor architecture (with individual processors performing at 2 MIPS), it is possible to obtain execution speeds of about 12000 working-memory element changes per second. This corresponds to a speed-up of 10-fold over the best known sequential implementation using a 2 MIPS processor. This performance is significantly higher than that ob- tained by other proposed parallel implementations of production systems. ii PARALLELISM IN PRODUCTION SYSTEMS ACKNOWLEDGMENTS iii Acknowledgments I would like to thank my advisors Charles Forgy, Allen Newell, and HT Kung for their guidance, support, and encouragement. Charles Forgy helped with his deep understanding systems and their implementation. of production Many of the ideas presented in this thesis originated with him or have benefited from his comments. Allen Newell, in addition to being an invaluable source of ideas, has shown me what doing research is about, and through his own example, what it means to be a good researcher. He has been a constant source of inspiration and it has been a great pleasure to work with him. HT Kung has been an excellent sounding board for ideas. He greatly helped in keeping the thesis on solid ground by always questioning my assumptions. I would also like to thank A1Davis for serving on my thesis committee. The final quality of the thesis has benefited much from his comments. The work reported in this thesis has been done as a part of the Production System Machine (PSM) project at Carnegie-Mellon University. I would like to thank its current and past membersmCharles Forgy, Ken Hughes, Dirk Kalp, Ted Lehr, Allen Newel1, Kemal Oflazer, Jim Quinlan, Leon Weaver, and Robert Wedig--for their contributions to the research. I would also like to thank Greg Hood, John Laird, Bob Sproull (my advisor for the first two years at CMU), and Hank Walker for many interesting discussions about my research. I would like to thank all my friends in Pittsburgh who have made these past years so enjoyable. I would like to thank Greg Hood, Ravi Kannan, Gudrun and Georg Klinker, Roberto Minio, Bud Mishra, Pradeep Sindhu, Pedro Szekely, Hank Walker, Angelika Zobel, and especially Paola Giannini and Yumi Iwasaki for making life so much fun. Finally, I would like to thank my family, my parents and my two sisters, for their immeasurable love, encouragement, and support of my educational endeavors. iv PARALI,ELISM IN PRODUCTION SYSTEMS TABLEOF CONTENrFS V Table of Contents 1. Introduction 1 1.1. Preview of Results 1.2. Organization of the Thesis 2. Background 3 5 7 2.1. OPS5 2.1.1. Working-Memory Elements 2.1.2. The l.eft-Hand Side of a Production 2.1.3. The Right-Hand Side of a Production 2.2. Soar 2.3. The Rete Match Algorithm 2.4. Why Parallelize Rete? 2.4.1. State-Saving vs. Non-State-Saving Match Algorithms 2.4.2. Rete as a Specific Instance of State-Saving Algorithms 2.4.3. Node Sharing in the Rete Algorithm 2.4.4. Rete as a Parallel Algorithm 3. Measurements on Production Systems 3.1. Production-System Programs Studied in the Thesis 3.2. Surface Characteristics of Production Systems 3.2.1. Condition Elements per Production 3.2.2. Actions per Production 3.2.3. Negative Condition Elements per Production 3.2.4. Attributes per Condition Element 3.2.5. Tests per Two-Input Node 3.2.6. Variables Bound and Referenced 3.2.7. Variables Bound but not Referenced 3.2.8. Variable Occurrences in Left-Hand Side 3.2.9. Variables per Condition Element 3.2.10. Condition Element Classes 3.2.11. Action Types 3.2.12. Summary of Surface Measurements 3.3. Measurements on the Rete Network 3.3.1. Number of Nodes in the Rete Network 3.3.2. Network Sharing 3.4. Run-Time Characteristics of Production Systems 3.4.1. Constant-Test Nodes 3.4.2. Alpha-Memory Nodes 3.4.3. Beta-Memory Nodes 3.4.4. And Nodes 3.4.5. Not Nodes 3.4.6. Terminal Nodes 3.4.7. Summary of Run-Time Characteristics 7 8 8 10 10 11 16 16 18 20 20 23 23 24 25 25 25 27 27 29 29 30 30 32 34 34 35 35 36 37 37 38 39 40 41 42 42 vi PARAI,LELISM INPROI)UCTION SYSTEMS 4. Parallelism in Production Systems 4.1. The Structure of a Parallel Production-System Interpreter 4.2. Parallelism in Match 4.2.1. Production Parallelism 4.2.2. Node Parallelism 4.2.3. Intra-Node Parallelism 4.2.4. Action Parallelism 4.3. Parallelism in Conflict-Resolution 4.4. Parallelism in RHS Evaluation 4.5. Application Parallelism 4.6. Summary 4.7. Discussion 45 45 47 48 51 53 54 56 56 57 57 59 5. Parallel Implementation of Production Systems 63 5.1. Architecture of file Production-System Machine 5.2. q-he State-Update Phase Processing 5.2.1. Hash-Table Based vs. IJst Based Memory Nodes 5.2,2. Memory Nodes Need to be Lumped with Two-Input Nodes 5.2.3. Problems with Processing Conjugate Pairs of Tokens 5.2.4. Concurrently Processable Activations of Two-Input Nodes 5.2.5. Locks for Memory Nodes 5.2.6. Linear vs. Binary Rete Networks 5.3. The Selection Phase Processing 5.3.1. Sharing of Constant-Test Nodes 5.3.2. Constant-Test Node Successors 5.3.3. Alpha-Memory Node Successors 5.3.4. Processing Multiple Changes to Working Memory in Parallel 5.4. Summary 6. The Problem of Scheduling Node Activations 6.1. The Hardware Task Scheduler 6.1.1. How Fast Need the Scheduler be? 6.1.2. The Interface to the Hardware Task Scheduler 6.1.3. Structure of the Hardware Task Scheduler 6.1.4. Multiple Hardware Task Schedulers 6.2. Software Task Schedulers 7. The Simulator 63 66 67 68 70 71 74 74 80 80 81 82 83 83 85 86 86 88 89 91 92 97 7.1. Structure &the Simulator 7.1.1. Inputs to the Simulator 7.1.1.1. The Input Trace 7.1.1.2. The Computational Model 7.1.1.3. The Cost Model 7.1.1.4. The Memory Contention Model 7.1.2. Outputs of the Simulator 7.2. Limitations of the Simulation Model 7.3. Validity of the Simulator 8. Simulation Results and Analysis 8.1. Traces Used in the Simulations 8.2. Simulation Results for Uniprocessors 8.3. Production Parallelism 98 98 98 98 100 102 105 106 107 109 109 111 115 TABLE OF CONTENTS , vii 8.3.1. Effects of Action Parallelism on Production Parallelism 8.4. Node Parallelism 8.4.1. Effects of Action Parallelism on Node Parallelism 8.5. Intra-Node Parallelism 8.5.1. Effects of Action Parallelism on Intra-Node Parallelism 8.6. Linear vs. Binary Rete Networks 8.6.1. Uniprocessor Implementations with Binary Networks 8.6.2. Results of Parallelism with Binary Networks 8.7. Hardware Task Scheduler vs. Software Task Queues 8.8. Effects of Memory Contention 8.9. Summary 9. Related Work 119 122 122 126 126 131 131 133 139 146 151 155 9.1. Implementing Production Systems on C.mmp 9.2. Implementing Production Systems on Illiac-IV 9.3. The DADO Machine 9.4. The NON-VON Machine 9.5. Kemal Oflazer's Work on Partitioning and Parallel Processing of Production Systems 9.5.1. The Partitioning Problem 9.5.2. The Parallel Algorithm 9.5.3. The Parallel Architecture 9.5.4. Discussion 9.6. Honeywell's Data-Flow Model 9.7. Other Work on Speeding-up Production Systems 10. Smnmary andConclusions 155 156 157 159 162 162 163 164 165 167 168 169 10.1. Primary Results of Thesis 10.1.1. Suitability of the Rete-Class of Algorithms 10.1.2. Parallelism in Production Systems 10.1.3. Software Implementation Issues 10.1.4. Hardware Architecture 10.2. Some General Conclusions 10.3. Directions for Future Research References 169 169 171 173 174 175 176 179 AppendixA. ISP of Processor Used in Parallel Implementation AppendixB. Code and Data Structures for Parallel Implementation 187 193 B.1. Code for Interpreter with Hardware Task Scheduler B.2. Code for Interpreter Using Multiple Software Task Schedulers Appendix C. Derivation of Cost Models for the Simulator 193 212 219 C.1. Cost Model for the Parallel Implementation Using HTS C.2. Cost Model for the Parallel Implementation Using STQs 219 231 °°. viii PARALLELISM 1N PROI)UCFION SYSTEMS ix I,IST OF HGURES List of Figures Figure 2-1: A sample production. Figure 2-2: The Rete network. Figure 3-1: Condition elements per production. Figure 3-2: Actions per production. Figure 3-3: Negative condition elements per production. Figure 3-4: Attributes per condition element. Figure 3-5: Tests per two-input node. Figure 3-6: Variables bound and referenced. Figure 3-7: Variables bound but not referenced. Figure 3-8: Occurrences of each variable. Figure 3-9: Variables per condition element. Figure 4-1:OPS5 interpreter cycle. Figure 4-2: Soar interpreter cycle. Figure 4-3: Selection and state-update phases in match. Figure 4-4: Production parallelism. Figure 4-5: Node parallelism. Figure 4-6: Tile cross-product effect. Figure 5-1: Architecture of the production-system machine. Figure 5-2: A production and the associated Rete network. Figure 5-3: Problems with memory-node sharing. Figure 5-4: Concurrent activations of two-input nodes. Figure 5-5: The long-chain effect. Figure 5-6: A binary Rete network. Figure 5-7: Scheduling activations of constant-test nodes. Figure 5-8: Possible solution when too many alpha-memory successors. Figure 6-1: Problem of dynamically changing set of processable node activations. Figure 6-2: Effect of scheduler performance on maximum speed-up. Figure 6-3: Structure of the hardware task scheduler. Figure 6-4: Effect of multiple schedulers on speed-up. Figure 6-5: Multiple software task queues. Figure 7-1: A sample trace fragment. Figure 7-2: Static node information. Figure 7-3: Code for left activation of an and-node. Figure 7-4: Degradation in performance due to memory contention. Figure 8-1: Production parallelism (nominal speed-up). Figure 8-2: Production parallelism (true speed-up). Figure 8-3: Production parallelism (execution speed). Figure 8-4: Production and action parallelism (nominal speed-up). Figure 8-5: Production and action parallelism (true speed-up). Figure 8-6: Production and action parallelism (execution speed). Figure 8-7: Node parallelism (nominal speed-up). Figure 8-8: Node parallelism (true speed-up). 7 12 26 26 27 28 28 29 30 31 31 46 46 47 48 52 53 64 67 69 72 77 78 81 83 86 87 90 93 95 99 100 101 104 117 117 117 121 121 121 123 123 X PARALLELISM IN PRODUCTION Figure 8-9: Node parallelism (execution speed). Figure 8-10: Node and action parallelism (nominal speed-up). Figure 8-11: Node and action parallelism (true speed-up). Figure 8-12: Node and action parallelism (execution speed). Figure 8-13: Intra-nodc parallelism (nominal speed-up). Figure 8-14: Intra-node parallelism (true speed-up). Figure 8-15: Intra-node parallelism (execution speed). Figure 8-16: Intra-node and action parallelism (nominal speed-up). Figure 8-17: Intra-node and action parallelism (true speed-up). Figure 8-18: Intra-node and action parallelism (execution speed). Figure 8-19: Average nominal speed-up. Figure 8-20: Average true speed-up. Figure 8-21: Average execution speed. Figure 8-22: Production parallelism (nominal speed-up). Figure 8-23: Production parallelism (execution speed). Figure 8-24: Production and action parallelism (nominal speed-up). Figure 8-25: Production and action parallelism (execution speed). Figure 8-26: Node parallelism (nominal speed-up). Figure 8-27: Node parallelism (execution speed). Figure 8-28: Node and action parallelism (nominal speed-up). Figure 8-29: Node and action parallelism (execution speed). Figure 8-30: Intra-node parallelism (nominal speed-up). Figure 8-31: Intra-node parallelism (execution speed). Figure 8-32: Intra-node and action parallelism (nominal speed-up). Figure 8-33: Intra-node and action parallelism (execution speed). Figure 8-34: Average nominal speed-up. Figure 8-35: Average execution speed. Figure 8-36: Effect of number of software task queues. Figure 8-37: Production parallelism (nominal speed-up). Figure 8-38: Production parallelism (execution speed). Figure 8-39: Production and action parallelism (nominal speed-up). Figure 8-40: Production and action parallelism (execution speed). Figure 8-41: Node parallelism (nominal speed-up). Figure 8-42: Node parallelism (execution speed). Figure 8-43: Node and action parallelism (nominal speed-up). Figure 8-44: Node and action parallelism (execution speed). Figure 8-45: Intra-node parallelism (nominal speed-up). Figure 8-46: Intra-node parallelism (execution speed). Figure 8-47: Intra-node and action parallelism (nominal speed-up). Figure 8-48: Intra-node and action parallelism (execution speed). Figure 8-49: Average nominal speed-up. Figure 8-50: Average execution speed. Figure 8-51: Processor efficiency as a function of number of active processors. Figure 8-52: Intra-node and action parallelism (nominal speed-up). Figure 8-53: Intra-node and action parallelism (execution speed). Figure 9-1: The prototype DADO architecture. Figure 9-2: The NON-VON architecture. Figure 9-3: Structure of the parallel processing system. SYSTEMS 123 125 125 125 127 127 127 129 129 129 130 130 130 135 135 135 135 136 136 136 136 137 137 137 137 138 138 141 142 142 142 142 143 143 143 143 144 144 144 144 145 145 147 149 150 157 160 164 LISTOFTABLES xi List of Tables Table 3-1: VT: Condition Element Classes Table 3-2: ILOG: Condition Element Classes Table 3-3: MUD: Condition Element Classes "Fable3-4: DAA: Condition Element Classes Table 3-5: R1-SOAR: Condition Element Classes Table 3-6: EP-SOAR: Condition Element Classes Table 3-7: Action Type Distribution Table 3-8: Summary of Surface Measurements Table 3-9: Number of Nodes Table 3-10: Nodes per Production Table 3-11: Nodes per Condition Element (with sharing) Table 3-12: Nodes per Condition Element (without sharing) Table 3-13: Network Sharing (Nodes without sharing/Nodes with sharing) Table 3-14: Constant-Test Nodes Table 3-15: Alpha-Memory Nodes Table 3-16: Beta-Memory Nodes Table 3-17: And Nodes Table 3-18: Not Nodes Table 3-19: Terminal Nodes Table 3-20: Summary of Node Activations per Change Table 3-21: Number of Affected Productions Table 3-22: General Run-Time Data Table 7-1: Relative Costs of Various Instruction Types Table 8-1: Uniprocessor Execution With No Overheads: Part-A Table 8-2: Uniprocessor Execution With No Overheads: Part-B Table 8-3: Uniprocessor Execution With Overheads: Part-A Node Parallelism and Node Parallelism Table 8-4: Uniprocessor Execution With Overheads: Part-B Node Parallelism and Node Parallelism Table 8-5: Uniprocessor Execution With Overheads: Part-A Production Parallelism Table 8-6: Uniprocessor Execution With Overheads: Part-B Production Parallelism Table 8-7: Uniprocessor Execution With No Overheads: Part-A Table 8-8: Uniprocessor Execution With No Overheads: Part-B Table 8-9: Uniprocessor Execution With Overheads: Part-A Node Parallelism and Node Parallelism Table 8-10: Uniprocessor Execution With Overheads: Part-B Node Parallelism and Node Parallelism Table 8-11: Uniprocessor Execution With Overheads: Part-A Production Parallelism Table 8-12: Uniprocessor Execution With Overheads: Part-B Production Parallelism Table 8-13: Comparison of Linear and Binary Network Rete Intra- 32 32 33 33 33 33 34 34 35 35 36 36 36 37 38 39 40 41 42 43 43 44 102 112 112 114 Intra- 114 Intra- 115 115 132 132 132 Intra- 132 133 133 134 To my parents INTRODUCTION 1 Chapter One Introduction Production systems (or rule-based systems) occupy a prominent position within the field of Artificial Intelligence. They have been used extensively to understand the nature of intelligence--in cognitive modeling, in the study of problem-solving systems, and in the study of learning systems [2, 44, 45, 62, 73, 74, 92]. They have also been used extensively to develop large expert systems spanning a variety of applications in areas including computer-aided design, medicine, configuration tasks, oil exploration [11, 14, 39, 41, 42, 55, 56, 81]. Production-system programs, however, are computation intensive and run quite slowly. For example, OPS5 [10, 19] production-system programs using the Lisp-based or the Bliss-based interpreter execute at a speed of only 8-40 working-memory element changes per second (wme-changes/sec) on a VAX-11/780.1 Although sufficient for many interesting applications (as proved by the current popularity of expert systems), the slow speed of execution precludes the use of production systems in many domains requiring high performance and real-time response. For example, one study that considered implementing the Harpy algorithm as a production system [63] for real-time speech recognition required that the program be able to execute at a rate of about 200,000 wme-changes/sec. The slow speed of execution of current systems also impacts the research that is done with them, since researchers often avoid programming styles and systems that run too slowly. This thesis examines the issue of significantly speeding up the execution of production systems (several orders of magnitude over the 8-40 wme-changes/sec). A significant increase in the execution speed of production systems is expected to open up new application areas for production systems, and to be valuable to both the practitioners and the researchers in Artificial InteUigence. There also exist deeper reasons for wanting to speed up the execution of production systems. The cognitive activity of an intelligent agent involves two types of search: (1) knowledge search, that is, search by the agent of its knowledge base to find information that is relevant to solving a given Lpais corresponds to an execution speed of 3-16 production firings per second. On average, 2.5 changes are made to the working memory per production firing. 2 PARALLELISM IN PRODUCTION SYSTEMS problem; and (2) problem-space search, that is, search within the problem space [62] for a goal state. Problem-space search manifests itself as a combinatorial AND/OR search [65]. Since problem-space search when not pruned by knowledge is combinatoriaily explosive, a highly intelligent agent, regardless of what it is doing, must engage in a certain amount of knowledge search after each step that it takes. This results in knowledge search being a part of the inner loop of the computation performed by an intelligent agent. Furthermore, as the intelligence of the agent increases (the size of the knowledge base increases), the resources needed to perform knowledge search also increase, and it becomes important to speed up knowledge search as much as possible. As an example, consider the problem of determining the next move to make in a game of chess. Problem-space search corresponds to the different moves that the player tries out before making the actual move. However, the fact that he tries out only a small fraction of all possible moves requires that he use problem and situation-specific knowledge to constrain the search. corresponds to the computation involved in identifying this problem Knowledge search and situation-specific knowledge from the rest of the knowledge that the player may have. Knowledge search forms an essential component of the execution of production systems. Each execution cycle of a production system involves a knowledge-search step (the match phase), where the knowledge represented in rules is matched against the global data memory. Since the ability to do efficient knowledge search is fundamental to the construction of intelligent agents, it follows that the ability to execute production systems with large rule sets at high speeds will greatly help in constructing intelligent programs. In short, the match-phase computation (knowledge search) done in production systems is not something specific to production systems, but such computation has to be done, in one form or another, in any intelligent system. Thus, speeding up such computation is an essential part of the construction of highly intelligent systems. Furthermore, since production systems offer a highly transparent model of knowledge search, the results obtained about speed-up from parallelism for production systems will also have implications for other models of intelligent computation involving knowledge search. There are several different methods for speeding up the execution of production systems: (1) the use of faster technology; (2) the use of better algorithms; (3) the use of better architectures; and (4) the use of parallelism. This thesis focuses on the use of parallelism. It identifies the various sources of parallelism in production systems and discusses the feasibility of exploiting them. Several im- plementation issues and some architectural considerations are also discussed. The main reasons for considering parallelism are: (1) Given any technology base, it is always possible to use multiple INTRODUCTION 3 processors to achieve higher execution speeds. Stated another way, as technology advances, the new technology can also be used in the construction of multiple processor systems. Furthermore, as the rate of improvement in technology slows (as it must) parallelism becomes even more important. (2) Although significant improvements in speed have been obtained in the past through better compilation techniques and better algorithms [17, 20, 21, 22], we appear to be at a point where too much more cannot be expected. Furthermore, any improvements in compilation technology and al- gorithms will probably also carry over to the parallel implementations. (3) On the surface, produc- tion systems appear to be capable of using large amounts of parallelism--it is possible to perform the match for each production in parallel. This apparent mismatch between the inherently parallel production systems and the uniprocessor implementations, makes parallelism the obvious way to obtain significant speed up in the execution rates. The thesis concentrates on the parallelism available in OPS5 [10] and Soar [46] production systems. OPS5 was chosen because it has become widely available and because several large, diverse, and real production-system programs have been written in it. These programs form an excellent base for measurements and analysis. Soar was chosen because it represents an interesting new approach in the use of production systems for problem solving and learning. Since only OPS5 and Soar programs are considered, the analysis of parallelism presented in this thesis is possibly biased by the characteristics of these languages. For this reason the results may not be safely generalized to production-system programs written in languages with substantially different characteristics, such as EMYCIN, EXPERT, and KAS [59, 93, 14]. Finally, the research reported in this thesis has been carried out in the context of the Production System Machine (PSM) project at Carnegie-Mellon University, which has been exploring all facets of the problem of improving the efficiency of production systems [22, 23, 30, 31, 32, 33]. This thesis extends, refines, and substantiates the preliminary work that appears in the earlier publications. 1.1. Preview of Results The first thing that is observed on analyzing production systems is that the speed-up from parallelism is quite limited, about 10-fold as compared to initial expectations of 100-fold or 1000-fold. The main reasons for the limited parallelism are: (1) The number of productions that require significant processing (the number of affected productions) as a result of a change to working memory is quite small, less than 30. Thus, processing each of these productions in parallel cannot result in a speed-up of more than 30. (2) The variation in the processing requirements of the affected productions is large. 4 PARAI,LELISM 1NPRODUCTIONSYSTEMS This results in a situation where fewer and fewer processors are busy as the execution progresses, which reduces the average number of processors that are busy over the complete execution cycle. (3) The number of changes made to working memory per recognize-act cycle is very small (around 2-3 for most OPS5 systems). As a result, the speed-up obtained from processing multiple changes to working memory in parallel is quite small. To obtain a large fraction of the limited speed-up that is available, the thesis proposes the exploitation of parallelism at a very fine grain. It also proposes that all working-memory changes made by a production firing be processed in parallel to increase the speed-up. The thesis argues that the Rete algorithm used for performing the match step in existing uniprocessor implementations suitable for parallel implementations. is also However, there are several changes that are necessary to the serial Rete algorithm to make it suitable for parallel implementation. The thesis discusses these changes and gives the reasons behind the design decisions. The thesis argues that a highly suitable architecture to exploit the fine-grained parallelism in production systems is a shared-memory multiprocessor, with about 32-64 high performance processors. For scheduling the fine grained tasks (consisting of about 50-100 instructions), two solutions are proposed. The first solution consists of a hardware task scheduler. The hardware task scheduler is to be capable of scheduling a task in one bus cycle of the multiprocessor. of multiple software task queues. The second solution consists Preliminary simulation studies indicate that the hardware task scheduler is significantly superior to the software task queues. The thesis presents a large set of sirnulation results for production systems exploiting different sources of parallelism. The thesis points out the features of existing programs that limit the speed-up obtainable from parallelism and suggests solutions for some of the bottlenecks. The simulation results show that using the suggested multiprocessor architecture (with individual processors performing at 2 MIPS), it is possible to obtain execution speeds of 5000-27000 wine-changes/see. This corresponds to a speed-up of 4-fold to 23-fold over the best known sequential implementation using a 2 MIPS processor. This performance is significantly higher than that obtained by other proposed parallel implementations of production systems. INTRODUCTION 5 1.2. Organization of the Thesis Chapter 2 contains the background information necessary for the thesis. Sections 2.1 and 2.2 introduce the OPS5 and Soar production-system formalisms and describe their computation cycles. Section 2.3 presents a detailed description of the Rete algorithm which is used to perform the match step for production systems. The Rete algorithm forms the starting point for much of the work described later in the thesis. Section 2.4 presents the reasons why it is interesting to parallelize the Rete algorithm. Chapter 3 lists the set of production-system programs analyzed in this thesis and presents the results of static and run-time measurements made on these production-system programs. The static measurements include data on the surface characteristics of production systems (for example, the number of condition elements per production, the number of attribute-value pairs per condition element) and data on the structure of the Rete networks constructed for the programs. The run-time measurements include data on the number of node activations per change to working memory, the number of working-memory changes per production firing, etc. The run-time data can be used to get rough upper-bounds on the speed-up obtainable from parallelism. Chapter 4 focuses on the sources of parallelism in production-system implementations. ' For each of the sources (production parallelism, node parallelism, intra-node parallelism, action parallelism, and application parallelism) it describes some of the implementation constraints, the amount of speed-up expected, and the overheads associated with exploiting that source. Most of the chapter is devoted to the parallelism in the match phase; the parallelism in the conflict-resolution phase and the rhs- evaluation phase is discussed only briefly. Chapter 5 discusses the various hardware and software issues associated with the parallel implementation of production systems. It first describes a multiprocessor architecture that is suitable for the parallel implementation and provides justifications for the various decisions. Subsequently it describes the changes that need to be made to the serial Rete algorithm to make it suitable for parallel implementation. Various issues related to the choice of data structures are also discussed. Chapter 6 discusses the problem of scheduling node activations in the multiprocessor implementation. It proposes two solutions: (1) the use of a hardware task scheduler, and (2) the use of multiple software task queues. These two solutions are detailed in Sections 6.1 and 6.2 respectively. The performance results corresponding to the two solutions, however, are not discussed until Section 8.7. PARALLELISM INPRODUCTIONSYSTEMS 6 Chapter 7 presents details about the simulator used to study parallelism in production systems. It presents information about the input traces, the cost models, the computational model, the outputs, and the limitations of the simulator. More details about the derivation of the cost model are presented in Appendices A, B, and C. Chapter 8 presents the results of the simulations. simulations. Section 8.1 lists the run-time traces used in the Section 8.2 discusses the overheads of a parallel implementation over an implemen- tation done for a uniprocessor. Sections 8.3, 8.4, and 8.5 discuss the speed-up obtained using production parallelism, node parallelism, and intra-node parallelism respectively. Section 8.6 discusses the effect of constructing binary instead of linear Rete networks for productions. Section 8.7 presents results for the case when multiple software task queues are used instead of a hardware task scheduler. Section 8.8 presents results for the case when memory contention overheads are taken into account, and finally, Section 8.9 presents a summary of all results. Chapter 9 presents related work done by other researchers. Work done on parallel implementation of production systems on C.mmp, Illiac-IV, DADO, NON-VON, Oflazer's production-system machine, and Honeywell's data-flow machine is presented. Finally, Chapter 10 reviews the primary results of the thesis and presents directions for future research. BACKGROUND 7 Chapter Two Background The first two sections of this chapter describe the syntactic and semantic features of OPS5 and SOAR production-system languages--the two languages for which parallelism is explored in this thesis. The third section describes the Rete match algorithm. The Rete algorithm is used in existing uniprocessor implementations of OPS5 and SOAR, and also forms the basis for the parallel match algorithm proposed in the thesis. The last section describes the different classes of algorithms that may be used for match in production systems and gives reasons why the Rete algorithm is appropriate for parallel implementation of production systems. 2.1. OPS5 An OPS5 [10, 19] production system is composed of a set of ifithen rules called productions that make up the production memory, and a database of assertions called the working memory. The assertions in the working memory are called working-memory elements. Each production consists of a conjunction of condition elements corresponding to the/]'part of the rule (also called the left-hand side of the production), and a set of actions corresponding to the then part of the rule (also called the right-hand side of the production). The actions associated with a production can add, remove or modify working-memory elements, or perform input-output. Figure 2-1 shows an OPS5 production named pl, which has three condition elements in its left-hand side, and one action in its fight-hand side. (p pl (el (C2 - (C3 tattrl <x> tattrl 15 tattrl <x>) tattr2 12) tattr2 <x>) ( remove 2)) Figure 2-1: A sample production. The production-system interpreter is the underlying mechanism that determines the set of satisfied productions and controls the execution of the production-system program. The interpreter executes a production-system program by performing the following recognize-act cycle: 8 PARALLELISM IN PRODUC"TION SYSTEMS • Match: In this first phase, the left-hand sides of all productions are matched against the contents of working memory. As a result a conflict sel is obtained, which consists of instantiations of all satisfied productions. An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production. At any given time, the conflict set may contain zero, one, or more instantiations of a given production. • Conflict-Resolution: In this second phase, one of the production instantiations in the conflict set is chosen for execution. If no productions are satisfied, the interpreter halts. • Act: In this third phase, the actions of the production selected in the conflict-resolution phase are executed. These actions may change the contents of working memory. At the end of this phase, the first phase is executed again. The recognize-act cycle forms the basic control structure in production system programs. During the match phase the knowledge of the program (represented by the production rules) is tested for relevance against the existing problem state (represented by the working memory). During the conflict-resolution phase the most relevant piece of knowledge is selected from all knowledge that is applicable (the conflict set) to the existing problem state. During the act phase, the relevant piece of knowledge is applied to the existing problem state, resulting in a new problem state. 2.1.1. Working-Memory Elements p A working-memory element is a parenthesized list consisting of a constant symbol called the class or type of the element and zero or more attribute-value pairs. The attributes are symbols that are preceded by the operator *. The values are symbolic or numeric constants. For example, the following working-memory element has class C1, the value 12 for attribute attrl and the value 15 for attribute attr2. (C1 ,attrl 12 ,attr2 IB) 2.1.2. The Left-Hand Side of a Production The condition elements in the left-hand side of a production are parenthesized lists similar to the working-memory elements. They may optionally be preceded by the symbol -. Such condition dements are called negated condition elements. For example, the production in Figure 2-1 contains three condition elements, with the third one being negated. Condition elements are interpreted as partial descriptions of working-memory elements. When a condition element describes a workingmemory element, the working-memory element is said to match the condition element. A production is said to be satisfied when: BACKGROUND 9 • For every non-negated condition element in the left-hand side of the production, there exists a working-memory clement that matches it. • For every negated condition element in the left-hand side of the production, there does not exist a working-memory element that matches it. I Like a working-memory attribute-value pairs. element, a condition element contains a class name and a sequence of However, the condition element is less restricted than the working-memory element; while the working-memory element can contain only constant symbols and numbers, the condition element can contain variables, predicate symbols, and a variety of other operators as well as constants. Only variables and predicates are described here, since they are sufficient for the purposes of this thesis. A variable is an identifier that begins with the character "<" and ends with ")"--for example, <x) and <status> are variables. The predicate symbols in OPS5 are: < = > <= >= <> <=> The predicates have their usual meanings for numerical and symbolic values. For example, the first predicate in the fist, "<", denotes the less-than relationship, the second predicate, "=", equality, and the last predicate, "<=>", denotes ofthe-same-type denotes relationship (a value in OPS5 is either of type numeric or symbolic). The following condition element contains one constant value (the value of attrl), one variable value (the value of attr2), and one constant value that is modified by the predicate symbol <) (the value ofattr3). (Cl tattrl nil cattr2 <x> tattr3 <> nil) A working-memory elementmatchesacondition elementiftheclass-field ofthetwomatchandif thevalueofeveryattribute inthecondition elementmatchesthevalueofthecorresponding attribute in the working-memory element. The rules for determining whether a working-memory element value matches a condition element value are: • If the condition element value is a constant, it matches only an identical constant. • If the condition element value is a variable, it will match any value. However, if a variable occurs more than once in a left-hand side, all occurrences of the variable must match identical values. • If the condition element value is preceded by a predicate symbol, the working-memory element value must be related to the condition element value in the indicated way. Thus the working-memory element (el tattrl 12 _attr2 1B) will match the following two condition elements (Cl tattrl 12 tattr2 <x>) 10 PARALLELISM INPRODUCTIONSYSTEMS (C1 *attr2 > 0) but it will not match the condition element (CI +attrl <x> +attr2 <x>). 2.1.3. The Right-Hand Side of a Production The right-hand side of a production consists of an unconditional sequence of actions which can cause input-output, and which are responsible for changes to the working memory. Three kinds of actions are provided to effect working memory changes. Make creates a new working-memory element and adds it to working memory. Modify changes one or more values of an existing workingmemory element. Remove deletes an element from the working memory. 2.2. Soar Soar[45, 47, 48, 74, 75] is a new production-system formalism developed at Carnegie-Mellon University to perform research in problem-solving, expert systems, and learning. It is an attempt to provide expert systems with general reasoning power and the ability to learn. In Soar, every task is formulated as heuristic search in a problem space to achieve a goal. The problem space [62] consists of a set of states and a set of operators. The operators are used to transform one problem state into another. Problem solving is the process of moving from some initial state through intermediate states (generated as a result of applying the operators) until a goal state is reached. Knowledge about the task domain is used to guide the search leading to the goal state. Currently, Soar is built on top of OPS5--the operators, the domain knowledge, the goal-state recognition mechanism, are all built as OPS5 productions. As a result most of the implementation issues, including the exploitation of parallelism, are similar in OPS5 and Soar. The main difference, however, is that Soar does not follow the match---conflict-resolutionuact cycle of OPS5 exactly. The computation cycle in Soar is divided into two phases: a monotonic elaboration phase and a decision phase. During the elaboration phase, all directly available knowledge relevant to the current problem state is brought to bear. On each cycle of elaboration phase, all instantiations of satisfied productions fire concurrently. This phase goes on till quiescence, that is, till there are no more satisfied productions. During the decision phase a fixed procedure is run that translates the information obtained during the elaboration phase into a specific decision--for example, the operator to be applied next. With respect to parallelism, the relevant differences from OPS5 are: (1) there is no conflict-resolution phase; and (2) multiple productions can fire in parallel. The impact of these differences is explored later in the thesis. BACKGROUND ]1 Soar production-system programs differ from OPS5 programs in yet another way. Soar programs can improve their performance over time by adding new productions at run-time. An auxiliary process automatically creates new productions concurrently with the operation of the production system. The impact of this feature on parallelism is, however, not explored in this thesis. 2.3. The Rete Match Algorithm The most time consuming step in the execution of production systems is the match step. To get a feeling for the complexity of match, consider a production system consisting of 1000 productions and 1000 working-memory elements, where each production has three condition elements. In a naive implementation each production will have to be matched against all tuples of size three from the working memory, leading to over a trillion (1000x10003) match operations for each execution cycle. Of course, more complex algorithms can achieve the above computation using a much smaller number of operations, but even with specialized algorithms, match constitutes around 90% of the interpretation time. The match algorithm used by uniprocessor implementations of OPS5 and Soar is called Rete [20]. This section describes the Rete algorithm in some detail as it forms the basis for much of the work described later in the thesis. The Rete algorithm exploits (1) the fact that only a small fraction of working memory changes each cycle, by storing results of match from previous cycles and using them in subsequent cycles; and (2) the similarity between condition elements of productions, by performing common tests only once. These two features combined together make Rete a very efficient algorithm for match. The Rete algorithm uses a special kind of a data-flow network compiled from the left-hand sides of productions to perform match. To generate the network for a production, it begins with the in- dividual condition elements in the left-hand side. For each condition element it chains together test nodes that check: • If the attributes in the condition element that have a constant as their value are satisfied. • If the attributes in the condition element that are related to a constant by a predicate are satisfied. • If two occurrences of the same variable within the condition element are consistently bound. Each node in the chain performs one such test. (The three kinds of tests above are called intra-condition tests, because they correspond to individual condition elements.) Once the algorithm has finished with the individual condition dements, it adds nodes that check for consistency of 12 PARAI_,LELISM IN PRODUCTION SYSTEMS variable bindings across the multiple condition elements in the left-hand side. (These tests are called inter-condition tests, because they refer to multiple condition elements.) Finally the algorithm adds a special terminal node to represent the production corresponding to this part of the network. Figure 2-2 shows such a network for productions pl and p2 which appear in the top part of the figure. In this figure, lines have been drawn between nodes to indicate the paths along which information flows. Information flows from the top-node down along these paths. The nodes with a single predecessor (near the top of the figure) are the ones that are concerned with individual condition elements. The nodes with two predecessors are the ones that check for consistency of variable bindings between condition elements. The terminal nodes are at the bottom of the figure. Note that when two left-hand sides require identical nodes, the algorithm shares part of the network rather than building duplicate nodes. (p pl (C1 tattrl <x> tattr2 12) (C2 tattrl 15 tattr2 <x>) (C3 tattrl <x>) (p p2 (C2 tattrl 15 tattr2 <y>) (C4 tattrl <y>) ..> ..> (modify 1tattrl 12)) (remove 2)) root C1 constant-teSnodes |t_ Cless = | attr2 = 12 attrl = 15 Class = CA Class = C3 L / _, \,..,=, " -nl / _ _'alpha-mem gn_ t_lnal-node I I _and-node iLterminal-node Conflict Set -p2' I Add to Working Memory wmel: (C1 tattrl 12 tattr212) wine2:(C2 tattrl 12 eattr215) wme3:(C2 tattrl 15 tattr212) wme4:(C3 l'attrl 12) Figure 2-2: The Rete network. To avoid performing the same tests repeatedly, the Rete algorithm stores the result of the match BACKGROUND 13 with working memory as state within the nodes. "Ibis way, only changes made to the working memory by the most recent production firing have to be processed every cycle. Thus, the input to the Rete network consists of the changes to the working memory. network updating the state stored within the network. These changes filter through the The output of the network consists of a specification of changes to the conflict set. The objects that are passed between nodes are called tokens, which consist of a tag and an ordered list of working-memory elements. The tag can be either a +, indicating that something has been added to the working memory, or a.-, indicating that something has been removed from it. (No special tag for working-memory element modification is needed because a modify is treated as a delete followed by an add.) The list of working-memory elements associated with a token cor- responds to a sequence of those elements that the system is trying to match or has already matched against a subsequence of condition elements in the left-hand side. The data-flow network produced by the Rete algorithm consists of four different types of nodes. 2 These are: 1. Constant-test nodes: These nodes are used to test if the attributes in the condition element which have a constant value are satisfied. These nodes always appear in the top part of the network. They have only one input, and as a result, they are sometimes called one-input nodes. 2. Memory nodes: These nodes store the results of the match phase from previous cycles as state within them. The state stored in a memory node consists of a list of the tokens that match a part of the left-hand side of the associated production. For example, the rightmost memory node in Figure 2-2 stores all tokens matching the second condition-element of production p2. At a more detailed level, there are two types of memory nodes--the a-mem nodes and the t-mere nodes. The a-mem nodes store tokens that match individual condition elements. Thus all memory nodes immediately below constant-test nodes are a-mere nodes. The fl-mem nodes store tokens that match a sequence of condition elements in the left-hand side of a production. Thus all memory nodes immediately below two-input nodes are fl-mem nodes. 3. Two-input nodes: These nodes test for joint satisfaction of condition elements in the left-hand side of a production. Both inputs of a two-input node come from memory nodes. When a token arrives on the left input of a two-input node, it is compared to each token stored in the memory node connected to the right input. All token pairs that have 2Currentimplementations of theRetealgorithmcontainsomeothernodetypesthatare notmentionedhere. Nodesof thesetypesdo not performanyof the conceptually necessaryoperationsandare presentprimarilyto simplifyspecific implementations. Forthisreason,theyhavebeenomittedfromdiscussion inthethesis. 14 PARALLELISM INPROI)UCrlONSYSTEMS consistent variable bindings are sent to the successors of the two-input node. action is taken when a token arrives on the right input of a two-input node. Similar There are also two types of two-input nodes--the and-nodes and the not-nodes. While the and-nodes are responsible for the positive condition elements and behave in the way described above, the not-nodes are responsible for the negated condition elements and behave in an opposite manner. The not-nodes generate a successor token only if there are no matching tokens in the memory node corresponding to the negated condition element. 4. Terminal nodes: There is one such node associated with each production in the program, as can be seen at bottom of Figure 2-2. Whenever a token flows into a terminal node, the corresponding production is either inserted into or deleted from the conflict set. The following example provides a more detailed view of the processing that goes on inside the Rete network. The example corresponds to the two productions and the network given in Figure 2-2. It shows the match process as the four working-memory elements shown in the bottom-left comer of Figure 2-2 are sequentially added to the working memory. When the first working-memory element is added, token t-w1, < + , (CI tattrl 12 +attr2 12) > is constructed and sent to the root node. The root node broadcasts the token to all its successors. The associated tests fail at all successors except at one which is checking for "Class = C1". This constant/ test node passes the token down to its single successor, another constant-test node, checking if "attr2 = 12". Since this is so, the token is passed on to the memory node, which stores the token and passes a copy of the token to the and-node below it. The and-node compares the incoming token on its left input to tokens in its right memory (which at this point is empty), but no pairs can be formed. At this point, the network has stabilizedmin other words no further activity occurs, so we go on to the second working-memory element. The token for the second working-memory element, t-w2, < + , (C2 tattrl 12 tattr2 15) > is constructed and sent to the root node, which broadcasts the token to its successors. The token passes the "Class = C2" test but fails the "attrl = 15" test, so no further processing takes place. The token for the third working-memory element, t-w3, < + , (C2 _attrl 15 tattr2 12) > passes throughthetests "Class= C2" and "attrl = 15",and isstored inthememory nodebelow them. The memory nodepasses a copyofthetokentothetwosuccessor and-nodesbelowit.The BACKGROUND 15 and-node on the right finds no tokens in its right memory, so no further processing is done there. The and-node on the left checks the token for consistency against token, t-wl, stored in its left memory. The consistency check is satisfied as the variable <x> is bound consistently. The and-node creates a new token, t-wlw3, < + , ((Cl tattrl 12 tattr2 12), (C2 tattrl 15 tattr2 12)) > and passes it down to the memory node below, which stores it. The memory node now passes a copy of the token to the and-node below it. The and-node finds that its right memory is empty, so no further processing takes place. On addition of the fourth working-memory element, token t-w4, < + , (C3 tattrl 12) > issenttotherootnode,whichbroadcasts it.The tokenpasses thetest "Class= C3" andpasses on to thememory nodebelowit.The memory nodestores thetokenandpasses a copyofthetokentothe and-node below it. The and-node checks for consistent bindings in the left memory and finds that the newly arrived token, t-w4, is consistent with token t-wlw3 stored in its left memory. The and-node then creates a new token, t-wlw3w4, <+,((Cl tattrl 12 tattr2 12)(C2 _attrl 15 tattr2 12)(C3 tattrl 12))> and sends it to the terminal node corresponding to pl below. The terminal node then inserts the instantiation of pl corresponding to t-wlw3w4 into the conflict set. The performance of Rete-based interpreters has steadily improved over the years. The most widely used Rete-based interpreter is the OPS5 interpreter. The Franz Lisp implementation of this inter- preter runs at around 8 wme-changes/sec (about 3 rule firings per second) on a VAX-11/780, while a Bliss-based implementation runs at around 40 wme-changes/sec. significant loss in the speed is due to the interpretation In the above two interpreters a overhead of nodes. In OPS83 [21] this overhead has been eliminated by compiling the network directly into machine code. While it is possible to escape to the interpreter for complex operations during match or for setting up the initial conditions for the match, the majority of the match is done without an intervening interpretation level. This has led to a large speed-up and the OPS83 interpreter runs at around 200 wme- changes/sec on the VAX-11/780. Some further optimizations to the OPS83 have been designed which would permit it to run at around 400-800 wme-changes/sec. The aim of the parallel im- plementations is to take the performance still higher, in the range 2000-20000 wine-changes/see. It is expected that this order of magnitude increase in speed over the best possible uniprocessor interpreters will open new application and research areas that could not be addressed before by production systems. 16 2.4. Why Parailelize PARALLELISM IN PRODUCTION SYSTEMS Rete? While being an extremely efficient algorithm for match on uniprocessors, Rete is also an effective algorithm for match on parallel processors. This section discusses some of the motivations for studying Rete and the reasons why Rete is appropriate for parallel implementations. 2.4.1. State-Saving vs. Non-State-Saving Match Algorithms It is possible to divide the set of match algorithms for production system into two categories: (1) the state-saving algorithms and (2) the non-state-saving algorithms. The state-saving algorithms store the results of executing match from previous recognize-act cycles, so that only the changes made to the working memory by the most recent production firing need be processed every cycle. In contrast, the non-state-saving algorithms start from scratch every time, that is, they match the complete working memory against all the productions on each cycle. In a state-saving algorithm the work done includes two steps: (1) computing the fresh state corresponding to the newly inserted working-memory elements and storing the fresh state in appropriate data structures; (2) identifying the state corresponding to the deleted working-memory elements and deleting this state from the data structures storing it. In a non-state-saving algorithm the work done includes only one step, that of computing the state for the match between the complete working memory and all the productions. state that is generated.) (Note that this may involve temporarily storing some of the partial In both state-saving and non-state-saving algorithms, state refers to the matches between condition elements and working-memory elements that are computed as an intermediate step in the process of computing the match between the complete left-hand sides of productions and the working-memory elements. Whether it is advantageous to store state depends on (1) the fraction of working memory that changes on each cycle, and (2) the amount of state that is stored by the state-saving algorithm. To evaluate the advantages and disadvantages more concretely, consider the following simple model. Consider a production-system program for which the stable size of the working memory is s, the average number of inserts to working memory on each cycle is i, and the average number of deletes from working memory is d. Let the cost of step-1 (as described in the previous paragraph) for a single insert to working memory be c1 and the cost for step-2 for a single delete from working memory be c2. Further assume that the average cost of the temporary state computed and stored by the non-state saving algorithm is c3for each working-memory element. Then the average cost per execution cycle of the state-saving algorithm is Cstate_sav=i.c1+ d.c2. The average cost per execution cycle of the non-state-saving algorithm is Cnon_state_sav= s.e 3. BACKGROUND To evaluate Ctate_sav< 17 the Cnon_state_sa advantages v. For of the state-saving the implementations algorithm, consider the inequality being considered for the Rete algorithm, the cost of an insert to working memory is the same as the cost of a delete from the working memory. As a result, substituting ca= c2 in the inequality, we get (i+ d)/s < c3/c_. Estimates based on simulations done for the Rete algorithm (see Chapter 8) indicate that cz is approximately equal to execution of 1800 machine-code instructions, and that c3 is approximately equal to execution of 1100 machinecode instructions. Using these estimates we get the condition that state-saving algorithms are more efficient when (i+ d)/s < 0.61, that is, state-saving algorithms are better if the number of insertions plus deletions per cycle is less than 61% of the stable size of the working memory. Measurements [30] on several OPS5 programs show that the number of inserts plus deletes per cycle constitutes less than 0.5% of the stable working memory size. Thus a non-state-saving algorithm will have to recover an inefficiency factor of 120 before it breaks even with a state-saving algorithm for OPS5-1ike production systems. 3 The following example illustrates some of the points made in the previous paragraphs. Consider a production-system program whose stable working memory size is around 1000 elements and where each production firing makes 2-3 changes to the working memory. OPS5 programs. This is a common scenario for It is quite obvious in this case that since 99.8% of the working memory is un- changed, it is unwise to use a non-state-saving algorithm which performs match with this unchanged working memory all over again. Now consider a program whose stable working memory size is again 1000, but where each production firing changes 750 of these 1000 working-memory elements. In this case a state-saving algorithm will first have to identify and delete the state corresponding to the 750 deleted elements and then recompute and store state for the new 750 elements. The only saving corresponds to the unchanged 250 working-memory elements. The state-saving algorithm in this case no longer seems attractive. The amount of state stored by a state-saving algorithm also influences the suitability of such an algorithm compared to a non-state-saving algorithm. The amount of state stored is important be- cause it determines the amount of work needed to compute it when working-memory elements are inserted, and the amount of work needed to delete it when working-memory elements are deleted. In terms of the model described in the previous paragraphs, the amount of state that a state-saving algorithm stores affects the value of the constants cz, c2, and ca, thus influencing the ratio (i+ cO/s when a non-state-saving algorithm becomes appropriate. 3Ineasethevaluesofc1,c2,andc3areestimatedfromadifferentbasealgorithm, theabovenumberswillchangesomewhat. 18 PARALI,EHSMIN PRODUCTIONSYSTEMS In summary, state-saving algorithms are appropriate when working memory changes slowly and when it costs more to recompute the state for the unchanged part of the working memory than to undo the state for the deletcd working-memory elements. For OPS5 and Soar systems, the fraction of working memory that changes on every cycle is, in fact, very small. For this reason only statesaving algorithms are considered in the thesis. 2.4.2. Rete as a Specific Instance of State-Saving Algorithms While it is generally accepted that state-saving algorithms are suitable for OPS5 and Soar production systems, there is no consensus about the amount of state that such algorithms should store. Of course, there are many possibilities. This section discusses some of the schemes that various research groups have explored, where the Rete algorithm fits amongst these schemes, and why Rete is interesting. One possible .scheme that a state-saving algorithm may use is to store information only about matches between individual condition elements and working memory elements. In the terminology of the Rete algorithm, this means that only the state associated with a-mem nodes is stored. For example, consider a production with three condition elements CEI, CE2,and CE3. Then the algorithm stores information about all working-memory elements that match CE1,all working-memory elements that match CE2, and all working-memory elements that match CE3. However, it does not store working-memory tuples that satisfy CE1 and CE2 together, or CE2and CE3 together, and so on. This information is recomputed on each cycle. Such a scheme is used by the TREAT algorithm developed for the DADO machine at Columbia University [60]. This scheme stands at the low end of the spectrum of state-saving algorithms. A problem with this scheme is that much of the state has to be recomputed on each cycle, often with the effect of increasing the total time taken by the cycle. A second possible scheme that a state-saving algorithm may use is to store information about matches between all possible combinations of condition elements that occur in the left-hand side of a production and the sequences of working-memory elements that satisfy them. For example, consider a production with three condition elements CE1,CE2, and CE3. Then, in this scheme, the algorithm stores information about all working memory elements that match CE1, CE2, and CE3 individually, about all working-memory element tuples that match CE1and CE2together, CE2and CE3together, eel and CE3together, and so on. Kemal Oflazer, in his thesis [67],has proposed a variation of this scheme to implement a highly parallel algorithm for match. This scheme stands at the high end of the spectrum of state-saving algorithms, in that it stores almost all information known about the matches BACKGROUND 19 between the productions and the working memory. Two possible problems with such a scheme are: (1) the state may become very large; and (2) the algorithm may spend a lot of time computing and deleting state that never really gets used, that is, state that never results in a production entering or leaving the conflict set. The amount of state computed by the Rete algorithm falls in between that computed by the previous two schemes. The Rete algorithm stores information about working-memory dements that match individual condition elements, as proposed in the first scheme. In addition, it also stores information about tuples of workingrmemory elements that match some fixed combinations of condition elements occurring in the left-hand side of a production. This is in contrast to the second scheme, where information is stored about tuples of working-memory elements that match all combinations of condition elements. The choice about the combinations of condition elements for which match information is stored is fixed at compile time. 4 For example, for a production with three condition elements CE1,CE2,and CE3,the standard Rete algorithm stores information about workingmemory elements that match eel, O/2, and CE3 individually. This information is stored in the a-mem nodes. In addition, it stores information about working-memory element tuples that match EEl and CE2together. This information is stored in a fl-mem node, as can be seen in Figure 2-2. The Rete algorithm uses this information and combines it with the information about working-memory dements that match CE3 to generate tuples that match the complete left-hand side (eEl, CE2,and CE3 all together). The Rete algorithm does not store information about working memory tuples that match CE1 and CE3 together or those tuples that match CE2 and CE3 together, as is done by the algorithm in the second scheme. The Rete algorithm has been successfully used in current uniprocessor implementations of OPS5 and Soar. It avoids some of the extra work done in the first scheme--work done in recomputing working-memory tuples that match combinations of condition elements. It is also less susceptible to the combinatorial explosion of state that is possible in the second scheme, because it can carefuUy select the combinations of condition elements for which state is stored. The combinations that are selected can significantly impact the efficiency of a parallel implementation of the algorithm. The thesis evaluates two of the many possible schemes for choosing these combinations, and discusses the factors influencing the choice in Section 5.2.6. 4Notethat by varyingthe combinationsof conditionelementsfor whichmatchinformationis stored,a largefamilyof differentRetealgorithmscanbegenerated. 20 PARALLELISM IN PRODUCTION SYSTEMS 2.4.3. Node Sharing in the Rete Algorithm As mentioned in Section 2.3, the Rete algorithm exploits the similarity between condition elements of productions by sharing constant-test nodes, memory nodes, and two-input nodes. For example, two constant-test nodes and an a-memory node are shared between productions pl and p2 shown in Figure 2-2 (for some statistics on sharing of nodes in the Rete network, see Section 3.3.2). The sharing of nodes in the Rete network results in considerable savings in execution time, since the nodes have to be evaluated only once instead of multiple times. The sharing of nodes also results in savings in space. This is becaus e sharing reduces the size of the Rete network and because sharing can collapse replicated tokens in multiple unshared memory-nodes into a single token in a shared memory-node. The main implication of the above discussion is that it is important for an efficient match algorithm to exploit the similarity in condition elements of productions to enhance its performance. 2A.4. Rete as a Parallel Algorithm The previous three subsections have discussed the general suitability of the Rete algorithm for match in OPS5 and Soar systems. They, however, do not discuss the suitability of Rete for parallel implementation. ' This subsection describes some features of Rete that make it attractive for parallel implementation. The data-flow like organization of the Rete network is the key feature that permits exploitation of parallelism at a relatively fine grain. It is possible to evaluate the activations of different nodes in the Rete network in parallel. It is also possible to evaluate multiple activations of the same node in parallel and to process multiple changes to working memory in parallel. These sources of parallelism are discussed in detail in Chapter 4. The parallel evaluation of node activations in the Rete network also corresponds to higher-level, more intuitive forms of parallelism in production systems. For example, evaluating different node activations in parallel corresponds to (1) performing match for different productions in parallel (also called production-level parallelism) and (2) performing match for different condition elements within the same production in parallel (also called condition-level parallelism) [22]. The state-saving model of Rete, where the state corresponding to only fixed combinations of condition elements in the left-hand side is stored, does impose some sequentiality on the evaluation of nodes, as compared to models where state corresponding to all possible combinations of condition 21 BACKGROUND elements is stored. It is, however, a plausible trade-off in order to avoid some of the problems associated with the other schemes, as discussed in the previous subsections. Finally, although we have no way to prove that Rete is the most suitable algorithm for parallel implementation of production systems, for the reasons stated above, we are confident that it is a pretty good algorithm. Detailed simulation results for parallel implementations presented in Chapter 8 of this thesis, and comparisons to the performance numbers obtained by other researchers (see Chapter 9) [31, 37, 60, 67], further confirm this view. 22 PARALLELISM IN PRODUCTION SYSTEMS MEASUREMENTS ON PRODUCTION 23 SYSTEMS Chapter Three Measurements on Production Systems Before proceeding on the design of a complex algorithm or architecture, it is necessary to identify the characteristics of the programs or applications for which the algorithm or the architecture are to be optimized. For example, computer architects refer to data about usage of instructions, depth of procedure call invocations, frequency of successful branch instructions, to optimize the design and implementation of new machine architectures [36, 68, 70]. Information about the target programs or applications serves two purposes: (1) it serves as an aid in the design process by identifying critical requirements; and (2) it serves as a means to evaluate the finished design. This chapter describes the characteristics of six OPS5 and Soar production systems that have been used in the thesis for the design and evaluation of parallel implementations of production systems. 5 J The data about the six production systems is divided into three parts. measurements on the textual structure of these production systems. The first part consists of The second part consists of information on the compiled form of the productions, and the third part consists of run-time measurements on the production-system programs. 3.1. Production-System Programs Studied in the Thesis The six production-system programs that have been used to evaluate the algorithms and the architectures for the parallel implementation of production systems are given below. They are listed in order of decreasing number of productions, and this order is maintained in all the graphs shown later. 6 1. VT [51] (Vertical Transport) is an expert system that selects components for a traction elevator system. It is written in OPS5 and consists of 1322 rules. 5The characteristicsfor another set of six production-system programs can be found in [30]. 6Note: Many of the production-system programs listed below are still undergoing development. For this reason the data associated with the programs is liable to change. The number of rules listed with each of the programs below corresponds to the number of rules in the version on which data was taken. 24 PARALLELISM IN PRODUCTIONSYSTEMS 2. 1LOG7 [58] is an expert system that maintains inventories and production schedules for factories. It is written in OPS5 and consists of 1181 rules. 3. MUD [39] is an expert system that is used to analyze fluids used in oil drilling operations. It is written in OPS5 and consists of 872 rules. 4. DAA [42, 43] (Design Automation Assistant) is an expert system that designs computers from high-level specifications of the systems. It is written in OPS5 and consists of 445 rules. 5. R1-SOAR [74] is an expert system that configures the UNIBUS for Digital Equipment Corporation's VAX-ll computer systems. It is written in Soar and consists of 319 rules. 6. EP-SOAR [47] is an expert system that solves the Eight Puzzle. It is written in Soar and consists of 62 rules. The above production-system programs represent a variety of applications and programming styles. For example, VT is a knowledge-intensive expert system which has been especially designed with knowledge acquisition in mind. It consists of only a small number of rule types and is significantly different from the earlier systems [55, 56] developed at Carnegie-Mellon University. 8 ILOG is a run-of-the-mill knowledge-intensive expert system. In contrast to the other five systems, the MUD system is a backward-chaining production system [4] and is primarily goal driven. The DAA program represents a computation-intensive task compared to the knowledge-intensive tasks performed by VT, ILOG, and MUD systems. Both R1-SOAR and EP-SOAR represent programming styles in Soar. R1-SOAR also represents an attempt at doing knowledge-intensive programming in a general weak-method problem-solving architecture. It can make use of the available knowledge to achieve high performance, but whenever knowledge is lacking, it has mechanisms so that the program can resort to more basic and knowledge-lean problem-solving methods. 3.2. Surface Characteristics of Production Systems Surface measurements refer to the textual features of production-system programs. such features are--the Examples of number of condition elements in the left-hand sides of productions, the number of attributes per condition element, the number of variables per condition element. Such features are useful in that they give information about the code and static data structures that are generated for the programs, and they also help explain some aspects of the the run-time behavior of the programs. 7ReferredtoasPTRANSinthecitedpaper. 8Personalcommunication fromJohnMcDermott. MEASUREMENTS ON PRODUCrlON SYSTEMS 25 The following subsections present the data for the measured features, including a brief description of how the measurements were made. Data about the same features of different production systems are presented together, and have been normalized to permit comparison. 9 Along with each data graph the average, the standard deviation, and the coefficient of variation10 for the data points are given. 3.2.1. Condition Elements per Production Figure 3-1 shows the number of condition elements per production for the six production-system programs. The number of condition elements per production includes both positive elements and negative ones. The curves for the programs are normalized by plotting percent of productions, instead of number of productions, along the y-axis. The number of condition elements in a production reflects the specificity of the production, that is, the set of situations in which the production is applicable. The number of condition elements in a production also impacts the complexity of performing match for that production (see Section 5.2.6). Note that, on average, Soar productions have many more condition elements than OPS5 productions. 3.2.2. Actions per Production Figure 3-2 shows the number of actions per production. The number of actions reflects the processing required to execute the fight-hand side of a production. A large number of actions per production also implies a greater potential for parallelism, because then a large number of changes to the working memory can be processed in parallel, before the next conflict-resolution phase is executed. 3.2.3. Negative Condition Elements per Production The graph in Figure 3-3 shows the number of negated condition elements in the left-hand side of a production versus the percent of productions having them. It shows that approximately 27% percent of productions have one or more negated condition elements. Since negated condition elements denote universal quantification over the working memory, the percentage of productions having them is an important characteristic of production-system programs. The measurements are also useful in calculating the number of not-nodes in the Rete network. 9The limits of the axes of the graphs have been adjusted to show the main portion of the graph dearly. In doing this, however, in some eases a few eatreme points could not be put on the graph. For this reason, the reader should not draw conclusions about the maximum values of the parameters from the graph. 10Coefficient of Variation = Standard Deviation / Average. 26 PARALLELISM IN PRODUCTION SYSTEMS .O _o ,,v'r o ILOG "_040 _. [] O • • 30 Avg. 3.28, SD Avg. 3.92, SD Avg. 2.47, SD Avg. 3.89, SD Avg. 8.60, SD _ 2O MUD DAA R1-SOAR EP-SOAR 1.66, CV 2.03, CV 1.36, CV 2.88, CV 4.45, CV 0.50 for 0.52 for 0.55 for 0.74 for 0.52 for VT ILOG MUD DAA R1-SOAR 1 1 2 3 , , , 4 5 6 _ 7 "_ . 8 9 _ • 10 11 _ _ -.¢_ 12 13 14 15 16 Number of Conditions Figure 3-1: Condition elements per production. 7'0 o ILOG _. Q. 0[]MUD DAA • R1-SOAR O 50 • EP-SOAR q) tj _. Avg. 4.80, Avg. 3.16, Avg. 3.42, avg. 2.42, Avg. 9.62, Avg. 4.29, 30 SD SD SD SD SD SD 13.88, CV 2.89 for VT 4.48, CV 1.42 for ILOG 5.77, CV 1.69 for MUD 2.19, CV 0.91 for DAA 16.68, CV 1.73 for R1-SOAR 17.17, CV 4.00 for EP-SOAR 2O 10 0 1 2 3 4 5 6 7 Figure3-2: Actions per production. 8 9 10 Number of Actions MEASUREMENTS ONPRODUCTIONSYSTEMS 27 to 1OO u ,_ 90 ,,w o IL_ [] MUO o DAA • R1-SOAR • EP-SOAR 8Q _. _ _. \ 0 7(:] "_\\ _. 60 \ 50 _ \ _ \\\ Avg. 0.27, SD 0.50, CV Avg. 0.35, SD 0.69, CV Avg. 0.33, SD 0.61, CV Avg. 0.52, SD 0.83, CV Avg. 0.24, SD 0.59, CV .... "x_\ 40 1.85 for 1.96 for 1.86 for 1.60 for 2.49 for 2 for VT ILOG MUD DAA R1-SOAR EP-SOAR 30 20 10 I 0 1 I "2 3 4 5 Number of Negated Conditions Figure 3-3: Negative condition elements per production. 3.2.4. Attributes per Condition Element Figure 3-4shows the distribution for the number of attributes per condition element. The class of a condition element, which is an implicit attribute, is counted explicitly in the measurements. The number of attributes in a condition element reflects the number of tests that are required to detect a matching working-memory element. The striking peak at three for R1-Soar and EP-Soar programs reflects the uniform encoding of data as triplets in Soar. 3.2.5. Tests per Two-Input Node This feature is specific to the Rete match algorithm and refers to the number of variable bindings that are checked for consistency at each two-input node (and-node or not-node). A value of zero indicates that no variables are checked for consistent binding, while a large value indicates that a large number of variables are checked. For example, if the number of tests is zero, for every token that arrives at the input of an and-node, as many tokens as there are in the opposite memory are sent to its successors. This usually implies a large amount of work. Alternatively, if the number of tests is large, then the number of tokens sent to the successors is small, but doing the pairwise comparison for consistent binding now takes more time. The graph for the number of tests per two-input node is shown in Figure 3-5. 28 PARALLELISM IN PRODUCTION SYSTEMS tOO O '_' (_ 90 t, VT o ILOG o 80 {1. ,,,,. 0 q.,,! []MUD 0 DAA • R1-SOAR • EP-SOAR 70 60 _, Avg. 2.58, Avg. 2.87, Avg. 2.34, Avg. 2.65, Avg. 3.10, Avg. 3.12, 50 40 SD SD SD SD SD SD 1.64, CV 0.64 1.86, CV 0.65 1.39, CV 0.60 1.48, CV 0.56 0.61, CV 0.20 0.66, CV 0.21 for for for for for for VT ILOG MUD DAA R1 -SOAR EP-SOAR 30 20 10 0 1 2 3 4 5 6 Attributes 7 8 per Condition 9 10 Element Figure 3-4: Attributes per condition element. 100 ¢, _. 90 _ w "" 0 ILOG Q" 80 [] MUD 0 DAA • R1-SOAR _* EP-SOAR (_ I-- o &. 60 40 Avg. 0.37, SD Avg. 1.21, SD Avg. 0.59, SD Avg. 1.27, SD Avg. 1.16, SD Avg. 1.23, SD 0.75, CV 2.00 for 1.01, 0.71, CV0.84 CV 1.21 for for 1.53, CV 1.21 for 0.49, CV 0.42 for 0.53, CV 0.43 for VT ILOG MUD DAA R1-SOAR EP-SOAR 4 6 7 Tests per Two-Input 30 10 0 1 2 3 5 Figure 3-5: Tests per two-input node. 8 Node 29 MEASUREMENTSON PRODUC'I'IONSYSTEMS 3.2.6. Variables Bound and Referenced Figure 3-6 shows the number of distinct variables which are both bound and referenced in the left-hand side of a production. Consistency tests are necessary only for these variables. Beyond the a-mem nodes, all processingdone by the two-input nodes requires access to the values of only these variables; valuesof other variables or attributes is not required. This implies that the tokens in the network may only store the values of these variables instead of storing complete copies of workingmemory elements. For parallel architectures that do not have shared memory, this can lead to significantimprovementsin the storage requirements and in the communication costsassociatedwith tokens. 5O .o AVT o ILOG t3 MUD <>DAA • R1-SOAR • EP-SOAR "0 0 _. 0 _. 30 Avg.0.72,SD 1.15,CV 1.59for VT _vg.2.21, SD2.02,CV 0.91for ILOG Avg.0.68, SD0.96,CV 1.41for MUD Avg.2.29, SD3.13,CV 1.27for DAA Avg.4.77, SD2.75,CV 0.58for R1-SOAR Avg.5.84, SD5.41,CV 0.93for EP-SOAR 20 10 0 2 4 6 8 10 12 Variables Bound and Referenced Figure 3,6: Variablesbound and referenced. 3.2.7.VariablesBoundbutnotReferenced Figure 3-7 shows the number of distinct variables which are bound but not referenced in the left-hand side of a production. (These bindings are usually used in the fight-hand side of the production.) This indicates the number of variables for which no consistency checks have to be performed. 30 PARALLELISM IN PRODUCTION SYSTEMS zxVT "_o_4 O<_ o MUD D ILOG • [_ =. O DAA Avg. 1.93, SD 2.66, CV 1.37 for Avg. 1.96, SD 5.30, CV 2.70 for Avg. 0.94, SD1.54, CV1.63 for Avg. 2.82, SD 3.50, CV 1.24 for Avg. 1.61, SD 2.00, CV 1.24 for I \\/_,a x/ \ I _ \ \ /\ \ I / \\ \ Y \ \ I/ \ \k/\ _,. \ 20_( /__'_ 10_0 0 RI-SOAR EP-SOAR VT ILOG MUD DAA R1-SOAR Avg. 1.65, SD 1.59, CV 0.96 for EP-SOAR 2 4 6 8 10 Variables Bound but not Referenced Figure 3-7: Variables bound but not referenced. 3.2.8. Variable Occurrences in Left-Hand Side Figure 3-8 shows the number of times each variable occurs in the left-hand side of a production. Both positive and negative condition elements are considered in counting the variables. Our measurements also show that variables almost never occur multiple times within the same condition element (average of 1.5% over all systems). Under this assumption, the number of occurrences of a variable represents the number of condition elements within a production in which the variable occurs. 3.2.9. Variables per Condition Element Figure 3-9 shows the number of variable occurrences within a condition element (not necessarily distinct, though as per Section 3.2.8 they mostly are). If this number is significant compared to the number of attributes for some class of condition elements, then it usually implies that the selectivity of those condition elements is small, or in other words, a large number of working-memory elements will match those condition elements. 31 MEASUREMENTSON PRODUCTION SYSTEMS 8O .Q •_ _" aVT o tt.oG [] MUD <>DAA • R1-SOAR • EP.SOAR 0 O ¢j " Avg. Avg. Avg. Avg. Avg. Avg. 40 30 ok O J 2 1 3 1.32, 1.85, 1.54, 1.72, 2.38, 2.48, 4 Figure 3-8: SD SD SD SD SD SD 0.58, 1.11, 0.74, 1.10, 1.25, 1.25, CV CV CV CV CV CV 5 Occurrences 0.44 0.60 0.48 0.64 0.52 0.50 for for for for for for 6 Number VT ILOG MUD DAA R1-SOAR EP-SOAR 7 of Occurrences = 8 of each variable. 80 o ILOG art o MUD 0 DAA • R1-SOAR • EP-SOAR _70 = .2 60 :1_ O (J 5O 0 *E _. 40_ \ Avg. 1.07, SD 1.44, CV 1.34 for VT 30 ovg. 1.00, Avg. 1.97, .Avg. 2.26, _vg. 1.77, Avg. 1.86, 0 1 2 3 Figure 3-9: 4 5 6 SD SD SD SD SD 1.24, 2.13, 2.49, 0.66, 0.64, 7 Variables per condition element. CV CV CV CV CV 1.24 1.09 1.10 0.37 0.34 8 Number for for for for for MUD ILOG DAA R1-SOAR EP-SOAR 9 10 of Variables 32 PARALI,ELISM IN PRODUCFION SYSTEMS 3.2.10. Condition Element Classes Tables 3-1, 3-2, 3-3, 3-4, 3-5, and 3-6 list file seven condition element classes occurring most frequently for the production-system programs, qlae tables also list the total number of attributes, the average number of attributcs and its standard deviation, and the average number of variable occurrences in condition elements of each class. The total number of attributes for a condition element class gives ,an estimate of the size of the working-memory element. This information can be used to determine the communication overhead in transporting working-memory elements amongst multiple memories in a parallel architecture. It also has implications for space requirements for storing the working-memory elements. If we subtract the average number of variables from the average number of attributes for a condition element class, we obtain the average number of attributes which have a constant value for that class. This number in turn has implications for the selectivity of condition elements of that class. Table 3-1: VT: Condition Element Classes Class Name 1. context 2. item 3. input 4. needdata 5. distance 6. sys-measure 7. it-stack # of _ 1366 (31%) 756 (17%) 448 (10%) 239 (5%) 228 (5%) 175 (4%) 110 (2%) _ 4 47 19 27 12 11 4 Avg-Attr 1.59 2.97 3.06 2.62 5.18 4.87 1.05 SD-Attr 0.75 0.86 1.24 1.53 1.40 1.47 0.35 Avtz-Vars 0.22 1.14 1.56 1.56 1.67 1.71 1.02 Total number of condition element classes is 48 Table 3-2: ILOG: Condition Element Classes Class Name 1. arg 2. task 3. datum 4. period 5. packed-with 6. order 7. capacity # of _ 1270 (27%) 1004 (21%) 431 (9%) 143 (3%) 106 (2%) 101 (2%) 91 (1%) T0t-Attr 4 2 58 13 32 37 41 Total number of condition element classes is 86 Avg-Attr 2.99 1.76 4.16 3.81 4.73 3.16 5.66 SD-A_r 0.14 0.44 2.15 1.18 2.23 2.29 4.57 Aviz-Vars 1.91 0.77 2.89 3.41 3.84 3.08 3.90 33 MEASUREM EN_FSON PRODUCq'ION SYSTEMS Table 3-3: MUD: Condition Element Classes Name 1. task 2. data 3. hyp 4. datafor 5. reason 6. change 7. do # of CEs__) 678 (31%) 547 (25%) 160 (7%) 111 (5%) 74 (3%) 65 (3%) 65 (3%) Tot-Attr 4 24 9 20 13 6 21 Ay_g-Attr 2.35 2.35 1.99 4.14 3.12 1.40 5.25 _l)-Attr 0.85 1.15 0.72 1.93 1.58 0.87 1.60 Ave-Vats 0.58 1.11 0.60 2.55 1.24 0.88 2.83 Total number of condition element classes is 38 'Fable 3-4: DAA: Condition Element Classes Class Name 1. context 2. port 3. db-operator 4. link 5. module 6. lists 7. outnode # of CEs(__) 474 (24%) 241 (13%) 197 (11%) 173 (9%) 170 (9%) 134 (7%) 112 (6%) Tot-Attr 3 6 6 6 6 3 11 Av_-Attr 2.40 2.35 1.70 5.28 2.68 1.75 2.37 _ 0.52 0.72 0.58 1.53 1.12 0.44 0.87 Ave-Vats 2.05 2.08 0.54 5.55 1.66 2.06 2.14 Total number of condition element classes is 26 Table 3-5: R1-SOAR: Class Name 1. goal-ctx-info 2. op-info 3. state-info 4. space-info 5. order-info 6. preference 7. module-info # of _ 988 (36%) 383 (13%) 375 (13%) 217 (17%) 183 (6%) 157 (5%) 87 (3%) Tot-Attr 3 3 3 3 3 8 3 Condition Element Classes Ave-Attr 2.99 2.95 2.88 3.00 2.99 5.32 2.92 SD-A_r 0.11 0.23 0.32 0.07 0.10 0.78 0.27 Av2-gars 1.80 1.54 1.77 1.04 1.67 3.44 1.90 Total number of condition element classes is 21 Table 3-6: EP-SOAR: Condition Element Classes Class Name 1. goal-ctx-info 2. binding-info 3. state-info 4. eval-info 5. op-info 6. preference 7. space-info #__. of CEs(%) 278 (44%) 85 (13%) 59 (9%) 54 (8%) 41 (6%) 36 (5%) 30 (4%) Tot-httr 3 3 3 3 3 8 3 Total number of condition element classes is 10 hv_-Attr 2.99 3.00 2.90 2.96 2.93 5.47 3.00 SD-A_r 0.10 0.00 0.30 0.19 0.26 1.12 0.00 Av_-Vars 1.83 1.71 1.92 1.83 1.54 3.22 1.13 34 PARALLELISM IN PRODUCTION SYSTEMS 3.2.11. Action Types Table 3-7 gives the distribution of actions in the right-hand side into classes mate, remove, modify, and other for the production-system programs. The only actions that affect the working memory are of type make, remove, or modify. While each make and remove action causes only one change to the working memory, a modify actions causes two changes to the working memory. This data then gives an estimate of the percentage of fight-hand side actions that change the working memory. This data can also be combined with data about the number of actions in the right-hand side of productions (given in Section 3.2.2) to determine the average number of changes made to working memory per production firing. Table 3-7: Action Type Distribution Act_n True 1. Make 2. Modify 3. Remove 4. Others VT 52% 13% 5% 27% ILOG 20% 15% 7% 56% MUD 48% 17% 4% 28% DAA 34% 18% 18% 27% RI-$OAR 86% 0% 0% 12% EP-SOAR 78% 0% 0% 21% 3.2.12. Summary of Surface Measurements Table 3-8 gives a summary of the surface measurements for the production-system programs. It brings together the average values of the various features for all six programs. The features listed in the table are condition elements per production, actions per production, negated condition elements per production, attributes per condition element, variables per condition element, and tests per two-input node. Table 3-8: Summary of Surface Measurements Feature 1. Productions 2. CEs/Prod 3. Actns/Prod 4. nCEs/Prod 5. Attr/CE 6. Vars/CE 7. Tests/2inp VT 1322 3.28 4.80 0.27 2.58 1.07 0.37 ILOG 1181 3.92 3.16 0.35 2.87 1.97 1.21 MUD 872 2.47 3.42 0.33 2.34 1.00 0.59 DAA 445 3.89 2.42 0.52 2.65 2.26 1.27 R1-SOAR 319 8.60 9.62 0.24 3.10 1.77 1.16 EP-SOAR 62 9.97 4.29 0.21 3.12 1.86 1.23 Mb_SURF_ENTS ONPRODUCTION SYSTEMS 35 3.3. Measurements on the Rete Network This section presents results of measurements made on the Rete network constructed by the OPS5 compiler. The measured features include the number of nodes of each type in the network and the amount of sharing that is present in the network. 3.3.1. Number of Nodes in the Rete Network Table 3-9 presents data on the number of nodes of each type in the network for the various production-system programs. Thesi_ numbers reflect the complexity of the network that is con- structed for the programs. Table 3-10 gives the normalized number of nodes, that is, the number of nodes per production. The normalized numbers are useful for comparing the average complexity of the productions for the various production-system programs. It Table 3-9: Number of Nodes Node Tree 1. Const-Test 2. a-mem 3. fl-mem 4. And 5. Not 6. Terminal 7. Total V_.T.T 2849 1748 1116 2205 332 1322 9572 ILOG 1884 1481 1363 2320 400 1181 8629 MUD 1743 878 358 872 267 872 4990 DAA 397 339 549 847 144 445 2721 R1-SOAR 436 398 1252 1542 60 3.91 4079 EP-SOAR 118 96 369 425 13 62 1083 R1-SOAR 1.11 EP-SOAR 1.90 Table 3-10: Nodes per Production Node Tree 1. Const-Test 2. a-mem 3. fl-mem 4. And 5. Not 6. Terminal 7. Total VT 2.15 ILOG 1.59 MUD 1.99 1.32 1.25 1.00 0.76 1.01 1.54 0.84 1.66 0.25 1.00 7.22 1.15 1.96 0.33 1.00 7.28 0.41 1.00 0.30 1.00 5.70 1.23 1.89 0.32 1.00 6.09 3.20 3.94 0.15 1.00 10.41 5.95 6.85 0.20 1.00 17.44 Table 3-11 presents the number of nodes per condition programs. DAA 0.89 element for the production-system The average number of nodes per condition element over all the systems is 1.86. This number is quite small because many nodes are shared between condition elements. In case no sharing is allowed, this number jumps up two to three fold, as is shown in Table 3-12. llAI1thenumberslistedinTables3-9and3-10are forthecasewherethenetworkcompilerisallowedto sharenodes. 36 PARALLELISM 1NPRODUCTIONSYSTEMS Table 3-11: Nodes per Condition Element (with sharing) Featu_ 1. Total CEs 2. Tot. Nodes 3. Nodes/CE V'_.[I" 4336 9572 2.20 _ 4629 8629 1.86 MUD 2153 4990 2.31 I)A_._A 1731 2721 1.57 RI-SOAR 2743 4079 1.48 EP'SOAR 618 1083 1.75 Table 3-12: Nodes per Condition Element (without sharing) Feature 1. Total CEs 2. Tot. Nodes 3. Nodes/CE 4. Sharing V._T.T 4336 20950 4.83 2.19 ILOG 4629 19717 4.25 2.28 MUD 2153 9953 4.62 2.00 DA_.__A 1731 7006 4.04 2.57 RI-SQAR 2743 12024 4.38 2.95 EP-SOAR 618 2532 4.10 2.34 3.3.2. Network Sharing The OPS5 network compiler exploits similarity in the condition elements of productions to share nodes in the Rete network. Such sharing is not possible in parallel implementations of production systems where each production is placed on a separate processor, although some sharing is possible in parallel implementations that use a shared-memory multiprocessor. To help estimate the extra computation required due to loss of sharing, Table 3-13 gives the ratios of the number of nodes in the unshared Rete network to the number of nodes in the shared Rete network. The ratios do not give the extra computational requirements exactly because they are only a static measure -- the exact numbers will depend on the dynamic flow of information (tokens) through the network. Table 3-13 also shows that the sharing is large only for constant-test and a-mem nodes, and small for all other node types.Ix Table 3-13: Network Sharing (Nodes without sharing/Nodes with sharing) Nod_ Tvoe 1. Const-Test 2. a-mere 3. B-mem 4. And 5. Not VT 3.86 2.35 1.35 1.19 1.08 ILOG 4.57 3.04 1.44 1.30 1.04 MUD 3.21 2.05 1.17 1.12 1.00 DAA 7.38 4.57 1.44 1.24 1.61 R1-SOAR 10.34 6.85 1.63 1.52 1.26 EP-SOAR 6.90 6.40 1.31 1.27 1.00 12Notethatthe reportedratioscorrespond to theamountofsharingor similarityexploitedbythe OPS5networkcompiler, whichmaynotbe thesameasthemaximum exploitablesimilarity available intheproduction-system program. MEASUREMENTS ONPRODUCI"ION SYSTEMS 37 3.4. Run-Time Characteristics of Production Systems This section presents data on the run-time behavior of production systems. The measurements are useful to identify operations frequently performed by the interpreter and provide some rough bounds on the speed-up that may be achieved by parallel implementations. Although most of the reported measurements are in terms of the Rete network, a number of general conclusions can be drawn from the measurements. 3.4.1.Constant-TestNodes Table 3-14 presents run-time statistics for constant-test nodes. The first line of the table, labeled "visits/change", refers to the average number of constant-test node visits (activations) per change to working memory. 13 The second line of the table reports the number of constant-test activations as a fraction of the total number of node activations. The third line of the table, labeled "success", reports the percent of constant-test node activations that have their associated test satisfied. Table 3-14: Constant-Test Nodes Feature 1. visits/change 2. %of total 3. success (%) 4. hash-visits/ch VT 107.00 76.9% 15.3% 22.92 ILOG 231.20 84.6% 3.3% 24.48 MUD 117.79 70.2% 24.5% 41.96 DA___AA R1-SOAR 57.02 48.79 52.0% 60.0% 8.0% 6.3% 7.14 5.05 EP-SOAR 18.93 35.0% 14.1% 3.97 Although constant-test node activations constitute a large fraction (63% on average) of the total node activations, a relatively small fraction of the total match time is spent in processing them. This is because the processing associated with constant-test nodes is very simple compared with other nodes like a-mere nodes, or and-nodes. In the OPS83 [21] implementation on the VAX-11 architecture, the evaluation of a constant-test node takes only 3 machine instructions. The evaluation of two-input nodes in comparison takes 50-100 instructions. The numbers on the third line show that only a small fraction (11.9% on average) of the constanttest node activations are successful. This suggests that by using indexing techniques (for example, hashing), many constant-test node activations that do not result in satisfaction of the associated tests may be avoided. The fourth line of the table, labeled "hash-visits/ch", gives the approximate number of constant-test node activations per working-memory change when hashing is used to avoid 13Therun-timedata presentedin thischaptercorrespondsto tracesvt_lin,iloglin,mudo.lin,da_lin,rlxlin, andepsdin. ThesetracesaredescribedinSection8,1. 38 PARALLEHSM IN PRODUCTION SYSTFaMS evaluation of nodes whose tests are bound to fail. Calculations show that approximately 82% of the total constant-test node activations can be avoided by using hashing. The hashing technique is especially helpful for the constant-test nodes immediately below the root node. These nodes check for the class of the working-memory element (see Figure 2-2), and since a working-memory element has only one class, all but one of these constant-test nodes fail their test. Calculations show that by using hashing at the top-level, the total number of constant-test node activations can be reduced by about 43%. 3.4.2. Alpha-Memory Nodes An a-mem node associated with a condition element stores tokens corresponding to workingmemory elements that partially match the condition element, that is, tokens that satisfy all intracondition tests for the condition element. These nodes are the first significant nodes, in terms of the processing required, that get affected when a change is made to the working memory. It is only later that changes filter through a-mere nodes down to and-nodes, not-nodes, fl-mem nodes, and terminal-nodes. The first line of Table 3-15 gives the number of a-mem node activations per change to working memory. The average number of activations for the six programs is only 5.00. This is quite small because of the large amount of sharing between a-mem nodes. The second line of the table gives the number of a-mem node activations when sharing is eliminated (something that is necessary in many parallel implementations). In this case the average number of a-mem node activations goes up to 26.48, an increase by a factor of 5.30. The third line of the table gives the dynamic sharing factor (line-2/line-1), which may be contrasted to the static sharing factor given in Table 3-13. As can be seen from the data, the dynamic sharing factor is consistently larger than the observed static sharing factor. Table 3-15: Alpha-Memory Nodes F_t_e 1. visits/ch(sh) 2. visits/ch(nsh) 3. dyn. shar. factor 4. avg. tokens 5. max. tokens VT 5.29 29.67 5.60 302.76 1467 ILOG 6.60 30.06 4.55 180.44 572 MUD 10.73 27.59 2.57 64.91 369 DAA 3.28 37.94 11.56 14.91 88 R1-SOAR 2.57 19.17 7.45 48.50 197 EP-SOAR 1.55 14.50 9.35 7.15 38 The fourth line of Table 3-15 reports the average number of tokens present in an a-mem node when it is activated. This number indicates the complexity of the processing performed by an a-mere 39 MEASUREMF_NTS ON PRODUCTION SYSTEMS node. When an a-mem node is activated by an incoming token with a - tag, the node must find a corresponding token in its stored set of tokens, and then delete that token. If a linear search is done to find the corresponding token, on average, half of the stored tokens will be looked up. Thus the complexity of deleting a token from an a-mem node is proportional to the average number of tokens. On arrival of a token with a + tag, the a-mem node simply stores the token. This involves allocating memory and linking the token, and takes a constant amount of time. In case hashing is used to locate the token to be deleted, the delete operation can also be done in constant time. However, then we have to pay the overhead associated with maintaining a hash table. Hash tables become more economical as the number of tokens stored in the a-mem increases. The numbers presented in the second line are useful for deciding when hash tables (or other indexing techniques) are appropriate. The fifth line of Table 3-15 reports the maximum number of tokens found in an a-mem node for the various programs. 14 These numbers are useful for estimating the maximum storage requirements for individual memory nodes. The maximum storage requirements, in turn, are useful in the design of hardware associative memories to hold the tokens. 3.4.3. Beta-Memory Nodes A/_-mem node stores tokens that match a subset of condition elements in the left-hand side of a production. The data for/3-mem nodes, presented in Table 3-16, can be interpreted in the same way as that for a-mere nodes. There is, however, one difference that is of relevance. The sharing between ]_-mem nodes is much less than that between a-mere nodes, so that in parallel implementations the cost of processing/3-mem nodes does not increase so much. When no sharing is present, the average number of/_-mem node activations goes up from 3.53 to 5.17, an increase by a factor of only 1.46 as compared to a factor of 5.30 for the a-mem nodes. Table 3-16: Beta-Memory Nodes Feature 1. visits/ch(sh) 2. visits/ch(nsh) 3. dyn. shar. factor 4. avg. tokens 5. max. tokens VT 0.53 1.29 2.43 3.30 48 ILOG 1.57 2.36 1.50 3.97 50 MUD 2.62 4.14 1.58 73.10 168 DAA 4.12 5.44 1.32 28.26 360 R1-SOAR 3.89 8.03 2.06 7.43 85 EP-SOAR 8.47 9.81 1.15 4.95 18 14It is intere_ng to note that the value for maximum number of tokens is the same as the value for maximum size of working memory (see Table 3-22) for VT, ILOG, and MUD systems. This implies that there is at least one condition dement in each of these three systems that is satisfied by all working-memory elementg 40 PARALLELISMIN PRODUCTION SYSTEMS 3.4.4. And Nodes The run-time and-node data for and-nodes are given in Table 3-17. activations per change to working memory. the six programs is 27.66. The first line gives the number The average number of node activations for The second line gives the average number which no tokens are found in the opposite memory nodes. of and-node activations for For example, first line in the table shows that there are 25.96 and-node activations. have an empty opposite memory. of for the VT program, the Of these 25.96 activations, 24.48 Since an and-node activation for which there are no tokens in the opposite memory requires very little processing, evaluating the majority of the and-node activations is very cheap. Most of the processing effort goes into evaluating which have non-empty opposite memories. the small fraction of activations This means that if all and-node activations are evaluated on different processors, then the majority of the processors will finish very early compared remaining few. This large variation in the processing requirements of and-nodes to the (see Tables 8-1 and 8-2 for some actual numbers) reduces the effective speed-up that can be obtained by evaluating each and-node activation on a different processor. When a token arrives on the left input of an and-node, it must be compared the memory node associated with the right input of that and-node. to all tokens stored in The comparisons may involve tests to check if the values of the variables bound in the two tokens are equal, if one is greater than the other, or other similar tests. The third line of the table gives the percentage of two-input activations where no equality tests are performed. 15 These numbers indicate node the fraction of node activations where hash-table based memory nodes do not help in cutting down the tokens examined in the opposite memory (also see Section 5.2.1). Table 3-17: And Nodes Feature 1. visits/change 2. null-mere 3. null-tests VT 25.96 24.48 13.2% ILOG 26.59 23.42 7.8% MUD 25.95 20.26 12.8% DAA 39.41 33.53 8.2% R1-SOAR 24.48 16.86 0.3% EP-SOAR 23.56 10.81 0.0% 4. tokens 5. tests 17.00 17.35 4.39 5.18 24.33 25.94 27.18 27.51 4.87 5.29 7.96 8.45 6. pairs 1.41 0.90 1.06 0.83 0.60 0.71 The fourth line shows the average number node activation, of tokens found when the opposite memory is not empty. in the opposite memory In case tokens in memory for an and- nodes are stored 15For reasonstoo complexto explain here,separatenumbers for and-node and not-node activationswere not available. That is,the numbers presentedin line-3areforthe combinedactivationsof and-nodes and not-nodes. 41 MEASUREMENTS ON PRODUCYION SYSTEMS as linked lists, this number represents the average number of tokens against which the incoming token must be matched to determine consistent pairs of tokens. 'Ihe magnitude of this number can be used to determine if hashing or other indexing techniques ought to be used to limit this search. The numbers in the fifth line of the table indicate the average number of tests performed by an and-node when a token arrives on its left or right input and its opposite memory is not empty. The number of tests performed is equal to the product of the average number of tokens found in the opposite memory (given in the fourth line) and the number of consistency tests that have to be made to check if the left and right tokens of the and-node are consistent. Thus if the number of tokens that are looked up from the opposite memory is reduced by use of indexing techniques, then this number will also go down. The numbers in the sixth line of the table show the average number of consistent token-pairs found after matching the incoming token to all tokens in the opposite memory. For example, for the DAA program, on the activation of an and-node, an average of 27.18 tokens are found in the opposite memory node. On average, however, only 0.83 tokens are found to be consistent with the incoming token. This indicates that the opposite memory contains a lot of information, of which only a very small portion is relevant to the current context. The numbers in the sixth line also give a measure of token regeneration taking place within the network. This data may be used to construct probabilistic models of information flow within the Rete network. 3.4.5. Not Nodes Not-nodes are very similar to and-nodes, and the data for them should be interpreted in exactly the same way as that for and-nodes. The data are presented in Table 3-18. Table 3-18: Not Nodes Feature 1. visits/change 2. null-mere 3. tokens 4. tests 5. pairs VT 5.01 3.90 31.39 34.95 0.25 ILOG 5.84 4.28 5.99 7.94 0.45 MUD 5.79 3.89 13.94 14.06 0.31 DAA 3.97 2.33 12.51 12.53 0.43 RI-SOAR 2.63 1.42 9.87 11.91 1.41 EP-$OAR 0.75 0.27 6.43 7.38 0.75 42 PARALLELISM IN PRODUCTION SYSTEMS 3.4.6. Terminal Nodes Activations of terminal nodes correspond to insertion of production instantiations into the conflict set and deletion of instantiations from the conflict set. The first line of Table 3-19 gives the number of changes to the conflict set for each working-memory change. The second line gives the average number of changes made to the working memory per production firing, and the third line, the product of the first two lines, gives the average number of changes made to the conflict set per production firing. The data in the third line gives the number of changes that will be transmitted to a central conflict-resolution processor, in an architecture using centralized conflict-resolution. The fourth line gives the size of the conflict-set when averaged over the complete run. Table 3-19: Terminal Nodes 1. visits/change 2. changes/cycle 3. mods./cycle 4. avg confl-set VT 1.79 3.27 5.85 35 ILOG 2.06 1.70 3.50 10 MUD 3.69 2.13 7.86 36 DA.__AA RI-$OAR 1.65 0.55 2.22 4.55 3.66 2.50 22 12 EP-SOAR 0.74 4.69 3.47 18 3.4.7. Summary of Run-Time Characteristics Table 3-20 summarizes data for the number of node activations, when a working-memory element is inserted into or deleted from the working memory. The data show that a large percentage (63% on average) of the activations are of constant-test nodes. Constant-test node activations, however, re- quire very little processing compared to other node types, and furthermore, a large number of constant-test activations can be eliminated by suitable indexing techniques (see Section 3.4.1). To eliminate the effect of this large number of relatively cheap constant-test node activations, we subtracted the number of constant-test node activations from the activations of all nodes. These numbers are shown on line-8 of Table 3-20. The first observation that can be made from the data on line-8 of Table 3-20 is that, the way production-system programs are currently written, changes to working memory do not have global effects, but affect only a very small fraction of the nodes present in the Rete network (see Table 3-9). This also means that the number of productions that are affected 16is very small, as can be seen from line-1 in Table 3-21. Both the small number of affected nodes and the small number of affected productions limit the amount of speed-up that can be obtained from using parallelism, as is discussed in Chapter 4. 16A production is said to of its condition elements. be affected by a change to working memory, if the working-memory elementsatisfiesat leastone MEASUREMENTS ON PRODUCTION 43 SYSTEMS The second observation that can be made is that the total number of node activations (excluding constant-test node activations) per change is quite independent of the number of productions in the production-system program. This, in turn, implies that the number of productions that are affected is quite independent of the total number of productions present in the system, as can be seen from Table 3-21. There are several implications of the above observations. First, we should not expect smaller production systems (in terms of number of productions) to run faster than larger ones. Second, it appears that allocating one processor to each node in the Rete network or allocating one processor to each production is not a good idea. Finally, there is no reason to expect that larger production systems will necessarily exhibit more speed-up from parallelism. Table 3-20: Summary of Node Activations per Change Node _ 1. Const-Test 2. a-mem 3./_-mem 4. And 5. Not 6. Terminal 7. Total 8. Line7 - Linel V__T.T 107.00 5.29 0.53 25.96 5.01 1.79 145.58 38.58 ILOG 231.20 6.60 1.57 26.59 5.84 2.06 273.92 42.72 MUD 117.79 10.73 2.62 25.95 5.79 3.6_._29 166.57 48.78 DA._.._A R1-SOAR 57.02 48.79 3.28 2.57 4.12 3.89 39.41 24.48 3.97 2.63 1.65 0.5__._5 109.45 82.91 52.43 34.12 EP-$QAR 18.93 1.55 8.47 23.56 0.75 0.74 54.00 35.07 Table 3-21: Number of Affected Productions Feature 1. p-aft/change 2. SD 17for Linel 3. changes/cycle 4. p-aft/firing 5. SD for Line4 VT 31.22 19.55 3.27 40.14 31.59 ILOG 34.19 38.53 1.70 36.49 52.70 MUD 27.01 25.39 2.13 32.05 28.69 DA_._.A 28.54 27.77 2.22 40.04 32.55 R1-SOAR 34.57 60.16 4.55 63.04 93.67 Table 3-22 gives general information about the runs of the production-system EP-SOAR 12.07 14.69 4.69 20.45 20.12 programs from which data is presented in this chapter. 18 The first two lines of the table give the average and maximum sizes of the working memory. The third and the fourth lines give the average and maximum values for the sizes of the conflict set. The fifth and the sixth lines give the average and 17SD stands for Standard Deviation. 18The numbers presented in this chapter and in later chapters of the thesis are based on one run per production system program. Detailed simulation-based analysis (results presented in Chapter 8) was not done for multiple runs of programs because of the large amount of data involved and because of the iarse processing requirements. However, we did gather statistics, like the ones presented in this chapter, for multiple runs of programs. The variation in the numbers obtained from the multiple runs was small. 44 PARALI,ELISM IN PRODUCTION SYSTEMS maximum sizes of the token memory when memory nodes may be shared. (The size of the token memory at any instant is the total number of tokens stored in all memory nodes at that instant.) The seventh and the eighth line give the average and the maximum sizes of the token memory when memory nodes may not be shared. The last line in the table gives the total number of changes made to the working memory in the production system run from which the statistics are gathered. Table 3-22: General Run-Time Data Featur_ 1. avg work-mem 2. max work-mem 3. avg confl-set 4. max confl-set 5. avg tokm(sh) 6. max tokm(sh) 7. avg tokm(nsh) 8. max tokm(nsh) 9. WM changes V_.T.T 1134 1467 35 131 5485 7416 13366 22640 1767 _ 486 572 10 38 3506 4204 5363 8346 2191 MUD 241 369 36 648 3176 4576 4717 7583 2074 DA___A 250 308 22 88 1182 2624 18343 23213 3200 R1-SOAR 543 786 12 36 1515 2716 3892 7402 2220 EP-SOAR 199 258 18 31 555 856 2546 3480 924 Finally, it is important to point out that, the results of measurements presented for the six systems in this chapter are very similar to the results obtained for another set of systems (R1 [57], XSEL [56], PTRANS [34], HAUNT, DAA [42], and EP-SOAR [45]) analyzed in [30]. Consequently, there is a good reason to believe that the results about parallelism (presented later in the thesis) apply not only to the six systems discussed here, but also to most other systems that have been written in the OPS5 and Soar languages. 45 PARALLELISM IN PRODUCTION SYSTEMS ChapterFour Parallelismin ProductionSystems On the surface, production systems' appear to be capable of exploiting large amounts of parallelism. For example, it is possible to perform match for all productions in parallel. This chapter identifies some obvious and other not-so-obvious sources of parallelism in production systems, and discusses the feasibility of exploiting them. It draws upon performance results reported in Chapter 8 of the thesis to motivate the utilization of some of the sources. Note that for reasons stated in Section 2.4, most of the discussion focuses on the parallelism that may be used within the context of the Rete algorithm. 4.1. The Structure of a Parallel Production-System Interpreter As discussed in Section 2.1, there are three steps that are repeatedly performed to execute an OPS5 production-system program: match, conflict-resolution, and act. Figure 4-1 shows the flow of information between these three stages of the interpreter. It is possible to use parallelism while performing each of these three steps. It is further possible to overlap the processing performed within the match step and the conflict-resolution step of the same recognize-act cycle, and that within the act step of one cycle and the match step of the next cycle. However, it is not possible to overlap the processing within the conflict-resolution step and the subsequent act step. This is because the conflict-resolution must finish completely before the next production to fire can be determined and its right-hand side evaluated. Thus, in an OPS5 programming environment, the possible sources of speed-up are (1) parallelism within the match step, (2) parallelism within the conflict-resolution step, (3) parallelism within the act step, (4) overlap between the match step and the conflict-resolution step of the same cycle, and (5) overlap between the act step of one cycle and the match step of the next cycle. As pointed out in Section 2.2, Soar programs do not execute the standard match--conflictresolution--act cycle executed by OPS5 programs. A simplified diagram of the information flow in the Soar cycle is shown in Figure 4-2. The match step and the act step are the same as in OPS5, but 46 PARALLEI.ISM INPRODUCTIONSYSTEMS Illl[ -. Figure 4-1:OPS5 the conflict-resolution step is not present. phase and a decision phase. interpreter cycle. Instead, the computation is divided into an elaboration Within each phase all productions that are satisfied may be fired concurrently, and the productions that become satisfied as a result of such firings may also be fired concurrently with the originally satisfied productions. Such concurrency increases the speed-up that may be obtained from using parallelism, as will be discussed later in this chapter. There are, however, synchronization points between the elaboration phase and the decision phase; the elaboration phase must finish completely before the processing may proceed to the decision phase, and vice versa. The serializing affect of these two synchronization points in Soar is not as bad as that of the synchronization point between the conflict-resolution and the act step in OPS5. This is because Soar systems usually go through a few loops internally within the elaboration phase and within the decision phase, with no synchronization points to produce any serialization. I El r i Ph sync. )41----- sync. ----_q Decision Phase Figure 4-2: Soar interpreter cycle. Ib PARALLELISM IN PRODUCIION 47 SYSTEMS 4.2. Parallelism in Match Current production-system interpreters spend almost 90% of their time in the match step, and only around 10%of the time in the conflict-resolution and the act steps. The reason for this is the inherent complexity of the match step, as was discussed in Section 2.3. This makes it imperative that we speed up the match step as much as possible. The following discussion presents several ways in which parallelism may be used to speed up the match step. The processing done within the match step can be divided into two pans: the selection phase and the slate-update phase [66]. During the selection phase, the match algorithm determines those condition elements that are satisfied by the new change to working memory, that is, it determines those condition elements for which all the intra-condition tests are satisfied by the newly inserted workingmemory element. During the state-update phase, the match algorithm updates the state (stores a token in the memory nodes) associated with the condition elements determined in the selection phase. In addition, this new state is matched with previously stored state to determine new instantiations of satisfied productions. In the context of the Rete algorithm, the processing done during the selection phase corresponds to the evaluation of the top-part of the Rete network, the part consisting of constant-test nodes. The processing done during the state-update phase corresponds to the evaluation of _t-mem nodes, fl-mem nodes, and-nodes, not-nodes, and terminal nodes. Changesto __ Satisfied __ Changesto Working Mere. Condition Conflict-Set Iv Figure4-3: Selectionand state-updatephasesin match. Although the beginning of selection phase must precede the state-update phase, the processing for the two phases may overlap. As soon as the selection phase determines the first satisfied condition element the state-update phase can begin. In case many changes to working memory are to be processed concurrently, it is also possible to overlap the processing of the selection phase for one change to working memory with the state-update phase for another change. Comparing the selection phase and the state-update phase, about 75%-95% of the processing time is spent in performing the state-update phase. The main reason for this, as stated in Section 3.4.1, is that the activations of constant-test nodes are much cheaper than the activations of memory nodes and two-input nodes. This disparity in the computational requirements between the two phases makes it necessary to speed up the state-update phase much more than the selection phase to attain balance. Since the state-update phase is more critical to the overall performance of the match 48 PARALLEI,ISM IN PRODUCI'ION SYS'FEMS algorithm, the following subsections focus primarily on the parailelization of the state-update phase. 19 4.2.1. Production Parallelism To use production parallelism, the productions in a program are divided into several partitions and the match for each of the partitions is performed in parallel. In the extreme case, the number of partitions equals the number of productions in the program, so that the match for each production in the program is performed in parallel. Figure 4-4 shows the case where a production system is split into N partitions. The main advantage of using production parallelism is that no communication is required between the processes performing match for different productions or different partitions. ,9 Figure 4-4: Production parallelism. Before going into the implementation issues related to exploiting production parallelism, it is useful to examine the approximate speed-up that may be obtained from using production parallelism. For example, do we expect 10-fold speed-up, do we expect 100-fold speed-up, or do we expect 1000-fold speed-up provided that enough processors are present. Our studies for OPS5 and Soar programs show that the true speed-up expected from production parallelism is really quite small, only about 2-fold. Some of the reasons for this are given below: 19Kemal Oflazer has developed a special algorithm, which uses the information in both the left-hand sides and right-hand sides of productions to speed up the selection phase. So far we have not felt the necessity to use this more complex selection algorithm, because even with the standard selection/discrimination network used by Rete, the state-update phase is _ the botfleneeL PARALLELISM IN PRODUCI'ION SYSTEMS 49 • Simulations show that thc average number of productions affected 2°per change to working memory is only 26. This implies that if there is a separate processor performing match for each production in the program, only 26 processors will be performing useful work and the rest will have no work to do. Thus the maximum speed-up from production parallelism is limited to 26.21 For reasons stated below, however, the expected speed-up is even smaller. • The speed-up obtainable from production parallelism is further reduced by the variance in the processing time required by the affected productions. The maximum speed-up that can be obtained is proportional to the ratio tavg: Imax, where tavgis the average time taken by an affected production to finish match and lrnax is the maximum time taken by any affected production to finish match. _Iqaeparallelism is inversely proportional to tmax because the next recognize-act cycle cannot begin until all productions have finished match. Simulations for OPS5 and Soar programs show that because of this variance the maximum nominal speed-up 22that is obtainable using production parallelism is 5.1-fold, a factor of 5.1 less than the average number of affected productions. 23 • The third factor that influences the speed-up is the loss of sharing in the Rete network when production parallelism is used. The loss of sharing happens because operations which would have been performed only once for similar productions are now performed independently for such productions, since the productions are evaluated on different processors. Simulations show that the loss of sharing increases the average processing cost by a factor of 1.63. Thus if there are 16 processors that are active all the time, the speed-up as compared to a uniprocessor implementation (with no loss in sharing) will still be less than 10. ' • The fourth factor that influences the speed-up is the overhead of mapping the decomposition of the algorithm onto a parallel hardware architecture. The overheads may take the form of memory-contention costs, synchronization costs, or task-scheduling costs. Simulations done for an implementation of the parallel Rete algorithm on a sharedmemory multiprocessor show that such overheads increase the processing cost by a factor ofl.61. The combined sharing, synchronization, and scheduling overheads account for loss in performance 20Recallthata productionis saidto be affectedby a changeto workingmemory,if the newworking-memory element matchesat leastoneoftheconditionelementsofthat production.Updatingthestateassociatedwiththeaffectedproductions (the state-update phase computation) takes about 75%-95% ofthetotal time taken bythematch phase. 21Notethatin theabovediscussionwehaveonlybeenconcernedwiththestate-updatephasecomputation.Thisispossible becausewealreadyhaveparallelalgorithmsto executetheselectionphaseveryfast. 22Nominalspeed-up(or concurrency) refersto the averagenumberof pr_..ssors that are kept busy in the parallel implementation.Nominalspeed-upis to be contrastedagainsttruespeed-upwhichrefersto thespeed-upwithrespectto the highestperformanceuniprocessor implementation, assumingthat the uniprocessoris as powerfulas theindividualnodesof theparallelprocessor.Truespeed-upis usuallylessthanthe nominalspeed-upbecausesomeof the resourcesin a parallel implementation are devotedto synchronizing the parallelprocesses,_aheduling theparallelprocesses,recomputingsomedata whichistoo expensiveto be communicated, etc. 23Notethat the numbersgivenin this sectionand the followingsectionscorrespondto the simulationresultsfor production-system traceslistedin Section8.1. 50 PARALLELISM IN PRODUCTION SYSTEMS by a factor of 2.62 (1.61xl.63). As a result of the combined losses the true speed-up from using production parallelism is only 1.9-fold (down from the nominal speed-up of 5.1-fold). Some implementation issues associated with using production-level parallelism are now discussed. The first point that emerges from the previous discussion is that it is not advisable to allocate one processor per production for performing match. If this is done most of the processors will be idle most of the time and the hardware utilization will be poor [22, 31].24 If only a small number of processors are to be used, there are two alternative strategies. The first strategy is to divide the production-system program into several partitions so that the processing required by productions in each partition is almost the same, and then allocate one processor for each partition. The second strategy is to have a task queue shared by all processors in which entries for all productions requiring processing are placed. Whenever a processor finishes processing one production, it gets the next production that needs processing from the task queue. Some advantages and disadvantages of these two strategies are given below. The first strategy is suitable for both shared-memory multicomputers. multiprocessors and non-shared-memory It is possible for each processor to work from its local memory and little or no communication between processors is required. The main difficulty, however, is to find partitions of the production system that require the same amount of processing. Note that it is not sufficient to find partitions with only one affected production per partition, because the variance in the cost of processing the affected productions still destroys most of the speed-up. 25 The task of partitioning is also difficult because good models are not available for estimating the processing required by productions, and also because the processing required by productions varies over time. A discussion of the various issues involved in the partitioning task is presented in [66, 67]. The second strategy is suitable only for shared-memory architectures, because it requires that each processor have access to the code and state of all productions in the program. 26 Since the tasks are allocated dynamically to the processors, this strategy has the advantage that no load-distribution problems are present. Another advantage of this strategy is that it extends very well to lower 24Low utilization is not justifiable, no matter how inexpensive the hardware, for it indicates that some alternative design can be found that can attain more performance at the same cost. 25Kemal Otlazer in his thesis [67] evaluates a scheme where more than one processor is allocated to each partition to offset theeffect ofthevariance. 26Whileitispossible toreplicate thecode(that is, theRetenetwork) inthelocal memoriesofalltheprocessors, itisnot possible to do so for the dynamically changing tt_ PARALLEIJSM IN PRODUCTION SYSTEMS granularities of parallelism. 51 However, this strategy loses some performance due to the synchroniza- tion, scheduling, and memory contention overheads present in a multiprocessor. In conclusion, the maximum speed-up that can be obtained from production-level parallelism is equal to the average number of productions affected per change to working memory (average of 26 for the production systems studied). However, in practice, the nominal speed-up that is obtained is only 5.1-fold. This is due to the variance in the processing times required by the affected productions. The true speed-up that can be obtained is even less, only 1.9-fold. This is due to the loss of sharing in parallel decompositions (a factor of 1.63), and the overheads of mapping the decompositions onto hardware architectures (a factor of 1.61). 4.2.2. Node Parallelism Unlike production parallelism, node parallelism is specific to the Rete algorithm. When node parallelism is used, activations of different two-input nodes in the Rete network are evaluated in parallel. 27 Node parallelism is graphically depicted in Figure 4-5. It is important to note that node parallelism subsumes production parallelism, in that node parallelism has a finer grain than production parallelism. Thus, using node parallelism, both activations of two-input nodes belonging to different productions (corresponding to production parallelism), and activations of two-input nodes belonging to the same production (resulting in the extra parallelism) are processed in parallel. The main reason for going to this finer granularity of parallelism is to reduce the value of tmax, the maximum time taken by any affected production to finish match. This decreased granularity of parallelism, however, leads to increased communication requirements between the processes evaluating the nodes in parallel. In node parallelism a process must communicate the results of a successful match to its successor two-input nodes. No communication is necessary if the match fails. To evaluate the usefulness of exploiting node parallelism, it is necessary to weigh the advantages of reducing tmax against the cost of increased communication and the associated limitations on feasible architectures. Another advantage of using node parallelism is that some of the sharing lost in the Rete network when using production parallelism is recovered. If two productions need a node with the same 27Note that in the context of node parallelism, the activation of a two-input node corresponds to the processing required by both the two-input node (the and-node or the not-node) and the associated memory node. Lumping the memory node together with the two-input node is necessary when using hash-table based memory nodes, and is discussed in detail later in Section 5.2.2. 52 PARALLELISMIN PRODUCTION SYS'IEMS CE1 CE2 Activation Queue Concurrent Processes mem-nodes CE3 Queue mere-nodes Figure 4-5: Node parallelism. functionality, it is possible to keep only one copy of the node and it is possible to evaluate it only once, since it is no longer necessaryto have separate nodes for different productions. The gain due to the increased amount of sharing is a factor of 1.33,which is quite significant. The extraspeed-up available from node parallelism over that obtained from production paraUelism is bounded by the number of two-input nodes present in a production. The reason for this is that the extra speed-up comes only from the parallel evaluation of nodes belonging to the same production. Since the average number of two-input nodes (one less than the average number of condition elements) in the production systems considered in this thesis is quite small, the maximum extra speed-up expected from node parallelism is also small. The results of simulations indicate that using node parallelism results in a nominal speed-up of 5.8-fold and true speed-up of 2.9-fold. Thus it is possible to get about 1.50 times more true speed-up than that could be obtained if production parallelism alone was used.28. The increase in speed-up is significantly lower than the number of two-input nodes per production (around 4), because most of the time all two-input nodes associated with a production do not have to be evaluated. The implementation considerations for node parallelism are very similar to those for production 28This factorof 1.50includes the factor of 1.33that wasgained becauseof reducedloss of sharing,as stated in the previous paragraph PARALLELISM IN PRODUCTION SYSTF_IS 53 parallelism described in the previous subsection. However, since the communication required be- tween the parallel processes is more, shared-memory architectures are preferable. The size of the tasks when node parallelism is used is smaller than when production parallelism is used. Simulations indicate that the average time to process a two-input node activation is around 50-100 computer instructions. This number is significant in that it limits the amount of synchronization and scheduling overhead that can be tolerated in an implementation. 4.2.3. Intra-Node Parallelism The previous two subsections expressed the desirability of reducing the value of tmax, the maximum time taken by any affected production to finish the match phase. Looking at simulation traces of production systems using node parallelism, a major cause for the large value of tmax was found to be the cross-product effect. As shown in Figure 4-6, 29the cross-product effect refers to the case where a single token flowing into a two-input node finds a large number of tokens with consistent bindings in the opposite memory. This results in the generation of a large number of new tokens, all of which have to be processed by the successor node. Since node parallelism does not permit multiple activa. tions of the same two-input node to be processed in parallel, they are processed sequentially and a large value of tmax results. GEl CE2 cE, cross-product "" g CE6 \( Figure 4-6: The cross-product effect. lntra-node parallelism is designed to reduce the impact of the cross-product effect and some other problems that arise when multiple changes to working memory are processed in parallel. When intra-node parallelism is used, not only are multiple activations of different two-input nodes evaluated in parallel (as in node parallelism), but also multiple activations of the same two-input node are 29In Figure 4-6, the arrows represent the flow of tokens in the Rete network and the thick lines represent the network for the production. 54 PARALLELISM 1NPRODUCTIONSYSTEMS evaluated in parallel, 30 Because of its finer granularity, intra-node parallelism requires some extra synchronization over that required by node parallelism, but its impact is relatively insignificant. Simulations show that using intra-node parallelism results in a nominal speed-up of 7.6-fold and a true speed-up of 3.9-fold. Thus it is possible to get an extra factor of 1.30 over the speed-up that can be obtained from using node parallelism alone. This factor is larger when many changes are processed in parallel, as is discussed in the next subsection. 4.2.4. Action Parallelism Usually, when a production fires, it makes several changes to the working memory. Measurements show that the average number of changes made to the working memory per execution cycle is 7.34.31 Processing these changes concurrently, instead of sequentially, leads to increased speed-up from production, node, and intra-node parallelism. The reasons for the increased speed-up from production parallelism when used with action parallelism are the following. In Section 4.2.1, it was observed that the speed-up available from production parallelism is proportional to the average number of affected productions. The set of produc- t.ions which is affected as a result of processing many changes concurrently is the union of the sets of affected productions for the individual changes to the working memory. Since this combined set of affected productions is larger than that affected by any individual change, more speed-up can be obtained. For example, consider the case where a production firing results in two changes to working memory, such that change-1 affects productions pl, p2, and p3, and change-2 affects productions p4, pS, and p6. If change-1 and change-2 are processed sequentially, it is best to use three processors. Assuming that each affected production takes the same amount of processing time, each change takes one cycle and the total cost is two cycles. However, if change-1 and change-2 are processed concurrently, they can be processed in one instead of two cycles using six processors. Simulations indicate that processing multiple changes in parallel the average size of the affect sets goes up from 26.3 to 59.5 (a factor of 2.26) and the speed-up obtainable from production parallelism alone goes up by a factor of 1.5. Thus using both production and action parallelism results in a nominal speed-up of 3°Justasnodeparallelism subsumesproductionparallelism, intra-node parallelism subsumesnodeparallelism. 31Theaveragenumberof changesthatareprocessed inparallelforthefourOPS5tracesis 2.44andtheaverageforthe four Soartracesis 12.25.Notethatthe numberof chansesthatmaybeprocessedinparallelfor the Soarsystemsis thesumof changesmadebyalltheproductions thatfireinparallel. PARALLELISM IN PRODUCTION SYSTEMS 55 7.6-fold, as compared to a nominal speed-up of 5.1-fold when only production parallelism is used. The extra speed-up is less than the average number of working-memory changes per cycle, because the sets of productions affected by the multiple changes are not distinct but have considerable overlap (see line-1 and line-4 of Table 3-21). Analysis shows that often two successive changes to working memory affect two distinct condition elements of the same production, as a result causing two distinct two-input node activations. It is then possible, using node parallelism, to process these node activations in parallel, thus increasing the available parallelism. For example, consider the case where both change-1 and change-2 affect productions pl, p2, and p3. If the activations correspond to distinct two-input nodes, it is possible to process both the changes in parallel, in one instead of two cycles. Simulations indicate that the use of action parallelisrn increases the speed-up obtainable from node parallelism alone by a factor of around 1.85, resulting in a nominal speed-up of 10.7-fold. In a manner similar to node parallelism, when successive changes to working memory cause multiple activations of the same two-input node, then using intra-node parallelism it is possible to process them in parallel. Simulations indicate that the use of action parallelism increases the speed-up obtainable from intra-node parallelism alone by a factor of around 2.54, resulting in a nominal speed-up of 19.3-fold. The average increase in performance for the OPS5 programs is a factor of 1.84 i and that for the Soar programs is a factor of 3.30. The increase in speed-up is larger for Soar programs because, on average, 12.25 working-memory changes are processed in parallel for Soar programs, while only 2.44 changes are processed in parallel for OPS5 programs. It is interesting to note that the factor by which the speed-up improves when using action parallelism increases as we go from production parallelism (factor of 1.50) to node parallelism (factor of 1.85) to intra-node parallelism (factor of 2.54). The reason for this is that node parallelism subsumes production parallelism and intra-node parallelism subsumes node parallelism. Thus node parallelism gets the extra speed-up from action parallelism that production parallelism can get. In addition node parallelism gets extra speed-up from parallelism that production parallelism could not obtain, for example, when the multiple changes affect different two-input nodes belonging to the same production. The reasoning between node paraUelism and intra-node parallelism is similar. 56 PARALLEI,ISM INPRODUCFION SYSTEMS 4.3. Parallelism in Conflict-Resolution The thesis does not evaluate the parallelism within the conflict-resolution phase in detail. This is partly because the conflict-resolution phase is not present in new production systems like Soar, and partly because the conflict-resolution phase is not expected to be a bottleneck in the near future. The reasons why conflict-resolution is not expected to become a bottleneck are: • Current production-system interpreters spend only about 5% of their execution time on conflict-resolution. Thus the match step has to be speeded up considerably before conflict-resolution becomes a bottleneck. • In production, node, and intra-node parallelism discussed earlier, the match for the affected productions finishes at different times because of the variation in the processing required by the affected productions. Thus many changes to the conflict set are available to the conflict-resolution process, while some productions are still performing match. Thus much of the conflict-resolution time can be overlapped with the match time, reducing the chances of conflict-resolution becoming a bottleneck. • If the conflict-resolution does becomes a bottleneck, there are several strategies for avoiding it. For example, to begin the next execution cycle, it is not necessary to perform conflict-resolution for the current changes to completion. It is only necessary to compare each current change to the highest priority production instantiation so far. Once the highest priority instantiation is selected the next execution cycle can begin. The complete sorting of the production instantiations can be overlapped with the match phase for the next cycle. Hardware priority queues provide another strategy. 4.4. Parallelism in RHS Evaluation The rhs-evaluation step like the conflict-resolution step takes only about 5% of the total time for the current production systems. When many productions are allowed to fire in parallel, as in Soar, it is quite straight forward to evaluate their right-hand sides in parallel. Even when the right-hand side of only a single production is to be evaluated, it is possible to overlap some of the input/output with the match for the next execution cycle. Also when the right-hand side results in several changes to the working memory, the match phase can begin as soon as the first change to working memory is determined. For the above reasons the act step is not expected to be a bottleneck in speeding up the execution of production systems. The thesis does not evaluate the parallelism in the RHS-evaluation in any greater detail. PARALLELISM INPRODUCI'IONSYSTEMS 4.5. Application 57 Parallelism There is substantial speed-up to be gained from application parallelism, where a number of cooperating but loosely coupled production-system tasks execute in parallel [29, 87]. The cooperating tasks may arise in the context of search, where there are a number of paths to be explored, and it is possible to explore each of the paths in parallel (similar to or-parallelism in logic programs [91]). Alternatively, the cooperating tasks may arise in the context where there are a number of semiindependent tasks, all of which have to be performed, and they can be performed in parallel (similar to and-parallelism in logic programs). It is also possible to have cooperating tasks that have a producer-consumer relationship among them (similar to stream-parallelism in logic programs). The maximum speed-up that can be obtained from application parallelism is equal to the number of cooperating tasks, which can be significant. Unfortunately, most current production systems do not exploit such parallelism, because (1) the production-system programs were expected to run on a uniprocessor, where no advantage is to be had from having several parallel tasks, and (2) current production-system languages do not provide the features to write multiple cooperating production tasks easily. Although not currently exploited by OPS5 programs, it is possible to use a simple form of application parallelism in Soar programs. In Soar all problem-solving is done as heuristic search within a problem-space, and Soar permits exploring several paths in the problem-space concurrently. The use of application parallelism within the two Soar programs studied in this thesis increases the nominal speed-up obtained using intra-node and action parallelism from 17.9-fold to 30.4-fold, an extra factor of 1.7. It is interesting to note that to the implementor, the use of application parallelism in Soar appears simply as several productions firing in parallel. This results in a large number of workingmemory changes that may be processed in parallel. No special mechanisms are required to make use of application parallelism, since the mechanisms developed for exploiting action parallelism suffice. 4.6. Summary In summary, the following observations can be made about the parallelism in production systems: • Contrary to initial expectations, the speed-up obtainable from parallelism is quite limited, of the order of few tens rather than hundreds or thousands. • The match step takes the most time in the recognize-act cycle, and for that reason the match needs to be speeded up most. • The first important source of parallelism for the match step is production parallelism. 58 PARALLELISM IN PRODUCTION SYSTEMS Using production parallelism it is possible to get an average nominal speed-up of 5.1-fold and average true speed-up of 1.9-fold. The speed-up is limited by the small number (approx. 26) of productions affected per change to working memory. The speed-up is further limited by the large variance in the amount of processing required by the affected productions (factor of 5.10), by the loss of sharing in the Rcte network (factor of 1.63), and by the overheads of mapping the parallel algorithm onto a multiprocessor (factor of 1.61). • To reduce the variance in the processing requirements of the affected productions, it is necessary to exploit parallelism at a much finer granularity than production parallelism. The two schemes proposed for this are node parallelism and intra-node parallelism. Exploiting the parallelism at a finer granularity increases the communication requirements between the parallel processes, and restricts the class of suitable architectures to sharedmemory multiprocessors. • When using node parallelism, it is possible to process activations of distinct two-input nodes in parallel. This results in average nominal speed-up of 5.8-fold and average true speed-up of 2.8-fold (an extra factor of 1.50 over the speed-up that can be obtained by using production parallelism alone). • Intra-node parallelism is even finer grain that node parallelism, and it permits the processing of multiple activations of the same two-input node in parallel. This results in average nominal speed-up of 7.6-fold and average true speed-up of 3.9-fold (an extra factor of 1.30 over the speed-up that can be obtained by using node parallelism alone). • Processing many changes to working memory in parallel (action parallelism) enhances the speed-up obtainable from production, node, and intra-node parallelism. The nominal speed-up obtainable from production parallelism increases to 7.6-fold (a factor of 1.50 over when action parallelism is not used), that from node parallelism increases to 10.7fold (an extra factor of 1.85), and that from intra-node parallelism increases to 19.3-fold (an extra factor of 2.54). • The conflict-resolution step and the RHS-evaluation step take only a small fraction (5% each) of the processing time required by the recognize-act cycle. Much of the processing required by the conflict-resolution step can be overlapped with the match step. These two steps are not expected to become a bottleneck in the near future. • Significant speed-up can be obtained by letting several loosely coupled threads of computation to proceed in parallel (application parallelism). Simulation results for two Soar systems show that the speed-up obtainable from intra-node parallelism increases by a factor of 1.7 when application parallelism is used. PARALLELISM IN PRODUCF1ON SYSTEMS 59 4.7. Discussion The results presented earlier in this chapter indicate the performance of only one model (that of OPS5-1ike production production systems. systems using a Rete-like match algorithm) for parallel interpretation of It is therefore essential to ask whether it is possible to change the parallel interpreter design -- or even the production systems being interpreted -- in such a way so as to increase the speed-up obtainable from parallelism. Of course, one is not likely to be able to give universal answers to questions like this. It is surely the case that there are applications and associated implementation techniques that permit quite high degrees of parallelism to be used, and that there are other applications that do not permit much parallelism at all to be used. However, by examining the basic factors affecting the speed-up obtained from parallelism, one can develop fairly general evaluations about the speed-up that is obtainable from parallelism, independent sions made in any particular parallel implementation. of the design deci- The following paragraphs give reasons why the three main factors responsible for the limited speed-up, namely (1) the small number of productions affected per change to working memory, (2) the small number of changes made to working memory per cycle, and (3) the large variation in the processing requirements of the affected productions, are not likely to change significantly in the near future, and consequently, why it is not reasonable to expect significantly larger speed-ups from parallelism. Let us first examine the reasons for the observations that the affect-sets (the set of affected productions) are small and that the size of the affect-sets is independent of the total number of rules in the program (see Table 3-21). One possible way to explain these observations is to note that to perform most interesting tasks, the rule-base must contain knowledge about many different types of objects and many diverse situations. The number of rules associated with any specific object-type or any situation is expected to be small [58]. Since most working-memory elements describe aspects of only a single object or situation, then clearly most working-memory elements cannot be of interest to more than a few of the rules. Another way that one might explain the small and independent size of the affect-sets is the conjecture that programmers recursively divide problems into subproblems when writing the programs. The final size of the subproblems at the end of the recursive division of problems into subproblems (which is correlated to the number of productions associated with the subproblems) is independent of the size of the original problem and primarily depends on (1) the complexity of the subproblems, and (2) the complexity that the programmer can deal with at the same time (see [58] for a discussion of this hypothesis). Since, at any given time, the program execution corresponds to solving only one of 60 PARALLEIJSM IN PRODUCrION SYSTEMS these subproblems, the number of productions that are affected (relevant to the subproblem) is small and independent of the overall program size.32 Yet another way to look at the size of the affect-sets is in terms of the organization of knowledge in programs. If the knowledge about a given situation is small (the number of associated rules are small), the affecrsets would also be small. If the amount of knowledge about the given situation is very large, it is possible that the affect-sets are large. However, whenever the amount of knowledge is large, we tend to structure it hierarchically or impose some other structure on the knowledge, so that it is easily comprehensible to us and so that it is easy to reason about [82]. For example, the structure of knowledge in classification tasks is not flat but usually hierarchical. Consequently, when clas- sifying an object we do it in several sequential steps, each with a small branching factor, rather than in one step with a very large branching factor. Thus if there was one rule associated with each branch of the decision tree, the total number of rules relevant at any node in the decision tree would be small. We now give reasons why the number of working-memory changes per recognize-act cycle is not likely to become significantly larger in future production-system programs. The reason for using production systems is to permit a style of programming in which substantial amounts of knowledge can affect each action that the program takes. If individual rules are permitted to do much more processing (which would correspond to making a large number of changes to working memory), then the advantages of this programming style begin to be lost. Knowledge is brought to bear only during the match phase of the cycle, and the less frequently match phases occur, the less chance other rules have to affect the outcome of the processing. Certainly there are many applications in which it is possible to perform substantial amounts of processing without stepping back and reevaluating the state of the system, but those are not the kinds of tasks for which one should choose the productionsystem paradigm. Alternatively, the argument may be made as follows. As stated in Chapter 1, an intelligent agent must perform knowledge search after each small step to avoid the combinatorial explosion resulting from uncontrolled problem-space search. Since most often, a small step in the problem space also corresponds to a small change in the state/environment of the agent, the number of changes made 32The above discussion only addresses the case where application parallelism is not exploited. In ease application parallelism is used, it is possible for a program to be working on several subproblems s_nultaneously, thus having a larger set of affected productions. Also note, it is not argued that the size of affect-sets will be the same in future systems as has been measured for existing systems. It is, of course, possible to construct systems that have more knowledge applicable to each given situation, thus increasing the number of affected productions by some small factor. It is, however, argued that the probability that the number of affected productions will increase by 50-fold, 100-fold, or more in the future is small. 61 PARALI.ELISM 1N PRODUC'IlON SYSTEMS between consecutive knowledge-search steps is expected to be small. It is possible to envision situations when there are local spurts in the number of changes made to the working memory per cycle (for example, when an intelligent agent returns from solving a subgoal, it may want to delete much of the local state associated with that subgoal [48]), but the average rate of change to working memory per cycle is expected to remain small. Before leaving this point, it should be observed that there is one way to increase the rate of working-memory turnover -- using parallelism in the production system itself. If a system has multiple threads, each one could be. performing only the usual small number of working-memory changes per cycle, but since there would be several threads, the total number of changes per cycle would be several times higher. Thus application-level parallelism will certainly help when it can be used. However, it may not be actually used in very many cases for two reasons: First, obviously, it can only be used in tasks that have parallel decompositions, and not all interesting tasks will. Second, using application-level parallelism places additional burdens on the developers of the applications. They must find the parallel decompositions and then implement them in such a way that the program is both correct and efficient. The final factor, the large variation in the processing required by the affected productions, may change somewhat because researchers are actively working on techniques to reduce this. Even here, however, it is not likely that much improvement is possible. The obvious way to handle the problem is to divide the match process into a large number of small tasks (for example, as done in going from production parallelism to node parallelism to intra-node parallelism). This is effective, but it cannot be carried too far because the amount of overhead time (for scheduling, for synchronization, etc.) goes up as we go to finer granularity and the number of processes increases. 62 PARALLELISM 1N PRODUCI'ION SYSTFdVIS PARALLEL IMPLEMENTATION OF PRODUCTIONSYSTEMS 63 Chapter Five Parallel Implementation of Production Systems In the previous chapter, we discussed the various sources of parallelism that may be exploited in the context of the Rete algorithm--production and action parallelism. parallelism, node parallelism, intra-node parallelism, We also observed that intra-node and action parallelism when combined together provided the most speed-up. This chapter discusses the hardware and software structures necessary for efficiently exploiting these sources of parallelism. Section 5.1 discusses the architecture of the multiprocessor proposed to execute the parallel version of the Rete algorithm. Section 5.2 presents the data representations and constraints necessary for the parallel processing of the stateupdate phase, that is, while processing activations of memory nodes, two-input nodes, and terminal nodes. Section 5.3 presents the data structures and constraints necessary for the parallel processing of the selection phase, that is, while processing activations of constant-test nodes. This section also discusses issues that arise at the boundary between the selection phase and the state-update phase. 5.1. Architecture of the Production-System Machine This section describes the architecture of the production-system machine (PSM), the hardware structure suitable for executing the parallel version of the Rete algorithm. We begin with a description of the proposed machine (see Figure 5-1), and later provide justifications for each of the design decisions. The major characteristics of the machine are: 1. The production-system machine should be a shared-memory multiprocessor with about 32-64 processors. 2. The individual processors should be high performance computers, each with a small amount of private memory and a cache. 3. The processors should be connected to the shared memory via one or more shared buses. 4. The multiprocessor should support a hardware task scheduler to help enqueue node activations that need to be evaluated on the task queue and to help assign pending node activations to idle processors. 64 PARALLELISM IN PRODUCTION SYSTEMS Figure 5-1: Architecture'of the production-system machine. The first requirement for the proposed production-system machine (as listed above) is that it should be a shared-memory multiprocessor with 32-64 processors. shared-memory The main reason for using a architecture stems from the fact that to achieve a high degree of speed-up from parallelism, the parallel Rete algorithm exploits parallelism at a very fine grain. For example, in the parallel Rete algorithm multiple activations of the same node may be evaluated in parallel. This requires that multiple processors have access to the state corresponding to that node, which strongly suggests a shared-memory architecture. It is not possible to replicate the state, since keeping all copies of the state up to date is extremely expensive. Another important reason for using a shared-memory architecture relates to the load distribution problem. In case processors do not share memory, the processor on which the activations of a given node in the Rete network are evaluated must be decided at the time the network is loaded into the parallel machine. Since the number of node activations is much smaller than the total number of nodes in the Rete network [30], it is necessary to assign several nodes in the network to a single processor. This partitioning of nodes amongst the processors is a very difficult problem, and in its full generality is shown to be NP-complete [67]. Using a shared-memory architecture the partitioning problem is bypassed since all processors are capable of processing all node activations, and it is possible to assign processors to node activations at run-time. The suggestion of 32-64 processors for the multiprocessor is derived as a result of measurements and simulations done for many large production-system programs [39, 43, 55, 56]. Because of the small number of productions affected per change to working memory and because of the variance in the processing required by the affected productions, simulations for most production-system programs show that using more than 32 processors does not yield any additional speed-up. There are some production systems which can use up to 64 processors, but the number of such systems is small. PARALLEL IMPLEMENTATION OF PRODUCTION SYSTEMS 65 "lqaefact that only about 32 processorsare needed is alsoconsonant with our use of a shared-memory architecture--it is quite difficult to build shared-memory multiprocessors with very large number of processors. In case it does become useful to use a larger number of processors (say 256) for some programs, the use of hierarchical multiprocessors is proposed for the parallel Rete algorithm, where the ]atency for accessinginformation in another multiprocessor duster is longer than the latency for accessinginformation in the localcluster. The reasons for using eight clusters of 32-processormultiprocessors instead of a single 256-processor multicomputer arc: (1) it is easier to partition a production-system program into 8 parts than it is to partition it into 256 parts; (2) within each 32-processor multiprocessor it is possible to exploit very fine-grain parallelism (for example, intranode parallelism)to reduce the maximum time taken by any single production to finishmatch. The second requirement for the proposed production-system machine is that the individual processors should be high-performance computers, each with a cache and a small amount of private memory. Since simulations show that the number of processors required for the proposed production-system machine is small, there is no reason not to use the processors with the highest performance--processors having wide datapaths and which use the fastesttechnology available.33 It is interesting to note that the code sequences used to execute production-system programs do not inctude complex instructions. The instructions used most often are simple loads, compares, and branches without any complex addressing modes (see Appendix B and [22,69]). As a result, the reduced instruction set computers (RISCs)[36, 68, 70] form good candidates for the individual processorsin the production-systemmachine. The reason for associatinga small amount of private memory with each processor is to reduce the amount of traffic to the shared memory, since the traffic to shared memory results in both bus contention and memory contention. The data structures stored in the private memory would be those that do not change very often or those that change at well defined points, an example being the data structure storing the contents of the working memory. It is possible to replicate such data structures in the private memories of processors and to keep them updated, thus saving repeated references to shared memory. Some of the frequently used code to process node activationscan also be replicatedin the privatememories of the processors. 33The suggestion for using a small number (32-64) of high-performance processors may be contrasted with suggestions for using a very large number (1000-100,000) of weak processors, as originally suggested for DADO and NON-VON [79, 84]. While the schemes using a small number of processors can use expensive very high performance processors, schemes using a very large number of processors cannot afford to have the fast processors for each of the processing nodes. In [31] we show that the performance lost as a result of weak individual processing nodes is difficult to recover by simply using a large number ofthem. 66 PARALLELISMIN PRODUCTIONSYSTF2VIS The third requirement be connected for the proposed to the shared memory production-system via shared buses. machine is that the processors should The reasons scheme, instead of a crossbar switch or other communication for suggesting networks (such as an Omega network or a Shuffle Exchange network [9]), are: (1) it is much coherency solutions when shared buses are used [76], and (2) simulation high-speed bus should be able to handle reasonable cache-hit ratios are obtained easier to construct Rete algorithm are expected to be to such shared objects. It is possible The fourth and final requirement for the proposed for the nodes is needed production-system because for the same node in pat',die1. scheduling each acl:ivation to ensure that it cannot scheduling importance must be done serially. and the time to schedule machine mechanism like spin-locks [53], is that it should be to enqueue in the task queue it is often necessary node activations to idle processors. to schedule The several node In order to do this, several tests must be made before interfere with other activations that are being processed at that time. While the amount of great structures, bus traffic. activations nonetheless accesses in the parallel When shared objects can be short synchronization and to help assign node activations scheduler that If the processor suffers a cache miss for all able to support a hardware task scheduler, that is, a hardware hardware provided for the processor to loop out of the cache when the synchronization structure is busy, thus causing no additional into a task queue single The reason why caches of memory shared-memoryreferences,the performancepenalty will be significanL stored in cache, it is also possible to implement multi-cache results show that a 8.8 for assumptions). must be able to hold shared data objects is that a large number very efficiently. sophisticated the load put on it by about 32 processors, (see Section a shared-bus to the efficiency The hardware of checking that must be done is small, it is of the production-system task scheduler an activation using such a scheduler of the same node is expected is expected machine, since the to sit on the shared bus, to be one bus cycle. The details for the necessity and structure of such a scheduler are given in Chapter 6. 5.2. The State-UpdatePhase Processing This section discusses various issues regarding the parallel implementation of the state-update phase of the Rete algorithmon a multiprocessor. PARALLEL IMPLEMENTATION OF PRODUCTION SYSTEMS 67 5.2.1. Hash-Table Based vs. List Based Memory Nodes Line 2 in Tables 3-15 and 3-16 (see Chapter 3) gives the average number of tokens found when a memory node is activated. Similarly, Line 3 in Tables 3-17 and 3-18 gives the number of tokens in the opposite memory when a two-input node is activated. The significance of these numbers is that they indicate Lhecomplexity of processing memory-nodes and two-input nodes respectively. Existing OPS5 and Soar interpreters store the contents of the memory nodes as a linear list of tokens. Thus when a token with a - tag arrives at a memory node, a corresponding token must be found and deleted from the memory node. If a linear search is done, then on average, half of the tokens in that memory node will be looked up. Similarly, for an activation of a two-input node, all tokens in the opposite memory must be looked up to find the set of matehing tokens. It is proposed that, instead of storing tokens in a memory node as a linear fist, it is better to store them in a hash table. The hash function used for storing the tokens corresponding to a memory node is based on (1) the tests associated with the two-input node below the memory node, and (2) the distinct node-id associated with the two-input node. (Recall that the memory nodes always feed into some two-input node, and that two-input nodes have tests associated with them to determine those sets of tokens that have consistent variable bindings.) For example, consider the Rete network shown , in Figure 5-2. The hash function used for a token entering the lef_ memory node is based on the value of attr2 field of the associated working-memory element and the node-id of the and-node below. The hash function for a token entering the right memory node is based on the value of attrl field and the node-id of the and-node below. =cl col..,,. 1, .... > (02 tattr1<x> tattr212) (Modify 1 tattr112)) =1, 2' / c2 /¢..r2.12 "_m-1 /mem-2 test:: _ / left:attr2 = _ and-1 dght:attrl [ • term-p1 Figure 5-2: A production and the associated Rete network. There are two main advantages of storing tokens in a hash table. First, the cost of deleting a token from a memory node is now a constant, instead of being proportional to half the number of tokens in that memory node. Similarly, the cost of finding matching tokens in the opposite memory is now proportional to the number of successful matches, instead of being proportional to the number of 68 PARALLELISM IN PRODUCFIONSYSTEMS tokens in the opposite memory. 34 Second, hashing cuts down the variance in the processing time required by the various memory node and two-input node activations, which is especially important for parallel implementations. The main disadvantage of using hashing is the overhead of computing the value of the hash function for each node activation. However, because of the fact that hashing reduces the variance, even if the cost with hashing is greater than the cost when linear fists are used, hash-table based memory nodes may still be advantageous for parallel implementations. As far as the implementation of the hash-table-based node memories is concerned, there are two options. The hash table may he shared between all the memory nodes in the Rete network, or there may be separate hash tables for each memory node. Since there are a large fraction of memory nodes that do not have any tokens at all (or have very few tokens), it would be waste of space to have a separate hash table for each node.35 Also, since there is a large variation in the number of tokens that are present in the various memory nodes, a hash table of a single size for each memory node would not be appropriate. In the implementation suggested in the thesis, two hash tables are used for all memory nodes in the network. One hash table for all memory nodes that form the lef_ input of a two-input node, and another hash table for all memory nodes that form the right input of a two-input node. 5.2.2. Memory Nodes Need to be Lumped with Two-Input Nodes Uniprocessor implementations of the Rete algorithm save significant processing by sharing nodes between similar productions (see Table 3-i3). This section discusses reasons why in the parallel implementation proposed in this thesis, it is not possible to share a memory node between multiple two-input nodes. 36 Simulations show that the loss of such sharing increases the total cost of performing match by about 25%. The main problem with the straightforward sharing of memory nodes is that it violates some of the assumptions made by the Rete algorithm. The Rete algorithm assumes that (1) the processing of the 34However, in caseof two-inputnodeswithno equalitytests,hashingdoesnotprovideany discriminating effect. The processing times,in suchcases,arethe sameaswhenlinearlistsareused. Fortunately, the numberof suchnodesis quite small. 35Onereasonwhyalargefractionof memorynodeshavenotokensorveryfewtokensisthat.likeprograms in manyother programming languages, onlyasmallfractionofthetotalproductions areresponsible formostoftheactioninarun. 36Therun-timedatapresented in Table3-20showsthatthe numberof memorynodeactivations is abouta thirdof the two-inputnodeactivations.Thisindicatesthat,onaverageat run-time,eachmemorynodeis sharedbetweenthree two-input nodes. PARALLEL IMPLEMENTATION OF PRODUCTION 69 SYSTEMS memory node and the associated two-input node is an atomic operation and (2) while the two-input node is being processed the contents of the opposite memory do not change. The problem can be explained with the help of the Rete network shown in Figure 5-3. The network shows a memory node shared between the two condition elements of a production. When working-memory element "wme-l" (shown at the bottom-left of Figure 5-3) is added to the working memory, it passes the constant-test "Class = C1", and is then added to the memory node. This causes both a left activation of the and-node and a right activation of the and-node. When the left activation is processed, the and-node finds a matching token in the right memory node (it is the same as the left memory node), and outputs a token to the terminal, node. Similarly, when the right activation of the and-node is processed, another token is output to the terminal node. This is incorrect, because if the memory nodes were not shared there would have been only one token sent to the terminal node. (p pl (C1 ,attrl <x> tattr2 <y>) .... > (Remove 1)) /T Class = C1 left- ht- activati° n '_,_lt=\,\,,._Jib;,,,,/activati° wme-l: (C1 *attrl 7 tattr27) n and'node • (ll term-node Figure 5-3: Problems with memory-node sharing. The problem as described above would occur in both sequential and parallel implementations. There are techniques, however, which permit the problem to be avoided for sequential implementations, but they do not work well for the parallel case. For example, in current uniprocessor implementations of Rete, a memory node keeps two lists of successor nodes--the left successors and the right successors. The left successors correspond to those two-input nodes for which this memory node forms the left input, and the right successors correspond to those two-input nodes for which the memory node forms the fight input. The uniprocessor algorithm processes all the left successors (including the activations caused by the processing of the immediate left successors) before actually adding the new token into the memory node (a pointer to the token to be added is passed as a parameter), then the token is added to the memory node, and then all the right successors are processed. Thus for the network shown in Figure 5-3, no token is sent to the terminal node for the left activation of the and-node, and one token is sent for the right activation, as desired. The above scheme is not suitable for parallel implementations because it is too constraining. It requires that all left successors (including all successors of the immediate left successors) be processed before the right successors are processed, and this defeats the whole purpose of the parallel ira- 70 PARALLELISM IN F'RODUCTION SYSTEMS plementation. Other more complex schemes, for example, associating a marker array with each newly added token that keeps track of the successor two-input nodes that have really seen and processed the token added to the memory node, impose too much overhead to be useful. In some simple-minded parallel implementations, sharing of memory nodes can also cause deadlocks, for example, when two processes try to ensure that the shared memory node does not get modified while the two-input nodes are being processed. For the implementation proposed in this thesis, there is one other reason why memory nodes need to be lumped with the two-input nodes. This reason is the use of hash-table based memory nodes. As suggested in the previous section, to enable finding matching tokens in the opposite memory efficiently, the hash function uses the values of the tests present in the associated two-input node. Thus if the tests associated with the successor two-input nodes are different, then it is not possible to share the memory nodes feeding into those two-input nodes. In all subsequent sections of this thesis, it is assumed that memory nodes are not shared (that is, there are two separate memory nodes for each two-input node), and that the processing associated with a two-input node refers to the processing required by the and-node or the not-node and the associated memory nodes. However, note that it is still possible to share two-input nodes (and-nodes and the not-nodes) in the Rete network, and if the two-input nodes are shared then the associated memory nodes are automatically shared too. 5.2.3. Problems with Processing Conjugate Pairs of Tokens Sometimes when performing match for a single change or multiple changes to working memory, it is possible that the same token is first added to and then deleted from a memory node. Such a pair of tokens is called a conjugate pair of tokens [17]. For example, consider the productions and the corresponding Rete network shown in Figure 2-2 (see Chapter 2). Let the initial condition of the network correspond to the state when working-memory elements wmel and wine2 (shown at the bottom-left of Figure 2-2) have been added to the working memory. Now consider a production firing that inserts wme3 to working memory and deletes wmel from working memory. If wine3 is processed before wmel, the token t-wlw3 will first be added to the memory node at the bottom-left of the Rete network, and subsequently when deletion of wmel is processed, the token t-wlw3 will be deleted from the memory node. Although conjugate pairs are not generated very often, their occurrence poses problems for parallel implementations, as explained below. Consider the case when the insertion of wrne3 and deletion of wmel are processed concurrently. In this case, as before, requests for both the addition of t-wlw3 and the deletion of t-wlw3 to the PARALLEL IMPLEMENTATION memory node are generated. 71 OF PRODUCTION SYSTEMS Now it is quite possible, that the scheduler used in the parallel im- plementation assigns the processing of the deletion of t-wlw3 to a processor before it assigns the processing of the addition of t-wlw3. 37 When the delete t-wlw3 request is processed, the token to be deleted would not be found, since it has not been added so far. 38 There are two alternatives. First, the request for the deletion of the token can simply be aborted, and no more action taken. Second, the fact that the token to be deleted was not found should be recorded somewhere. This way when the add t-wlw3 request is processed, it is possible to determine that an extra delete operation had been performed on the memory node and appropriate action can be taken. The first alternative suggested above is not reasonable, since it would lead to incorrect results from the match. The second alternative is what this thesis proposes. In the proposed implementation, extra tokens from early deletes are stored in a special extra-deletes-list associated with each two-input node. Whenever a token is to be inserted into a memory node, the extra-deletes-list associated with the corresponding two-input node is first checked to see if the token to be inserted cancels out an early delete. Most of the time this list is expected to be empty, so the overhead of such a check should not be too much. 5.2.4. Concurrently Processable Activations of Two-Input Nodes In Section 4.2.3 on intra-node parallelism, it was proposed that multiple activations of the same two-input node should be evaluated in parallel to obtain additional speed-up. This section discusses some of the restrictions that need to be put on processing multiple activations of the same node in parallel. These restrictions are necessary to reduce the amount of synchronization needed to ensure correct operation of the algorithm. Figure 5-4 shows the various kinds of situations that may arise when multiple activation of the same two-input happen. For example, case-1 in the figure refers to multiple insert-requests from the left side, case-4 refers to multiple delete-requests from the right side, case-5 refers to both insert and delete requests from the left side, and case-7 refers to multiple insert-requests, some of which are 37There are a number of reasons why this may happen. For example, although the request for the addition of t-wlw3 is generated before the request for the deletion, because of arbitrary delays in the communication process, the request for the deletion may reach the scheduler before the request for addition. Also there is no simple way for the scheduler to realize that the delete request it has just received is part of a conjugate pair. 38Note that there are no such problems in uniproeessor implementations of the Rete algorithm. The reason is that, in uniproeassor implementations, it is possible to ensure that the sequence in which the requests for insertions and deletions of tokens are processed is the same as the sequence in which these requests were generated. 72 PARALLELISM IN PRODUCTION SYSTEMS from the left side of the two-input node and the some of which are from the right side. We propose that only the multiple activations depicted in cases 1-6 should be processed in parallel, and that multiple activations depicted in the cases 7-10 (or their combinations) should not be processed in parallel. To justify these restrictions, the various cases are divided into three groups. Cases 1-4 are grouped together, and represent the case when multiple inserts or multiple deletes from the same side are to be processed. Cases 5-6 are grouped together, and represent the case when both insert and delete requests from the same side are to be processed. Cases 7-10 are grouped together, and represent the case when activations from both sides are to be processed concurrently. ,,,,',,, j,/ ",,",, /,:, I-del r-del (1) {2) (7) (8) (3) (4) (9) (10) Figure 5-4: Concurrent activations of two-input nodes. The reason for not processing cases 7-10 in parallel is related to the assumption of Rete algorithm that, while a two-input node is being processed the opposite memory should not be modified. The effect of violating this assumption was also shown in Section 5.2.2. To illustrate the problems with processing such activations in parallel, consider case-7 where an insert request from the left and an insert request from the fight are processed concurrently. Also assume that the tokens corresponding to these requests satisfy the tests associated with the two-input node. It is possible to have the following sequence of operations: (1) The token corresponding to left insert-request is added to left memory-node. (2) The token corresponding to the right insert-request is added to the right memorynode. (3) The left activation of the two-input node finds the newly inserted right-token and results in the generation of a successor token. (4) Similarly, the right activation of the two-input node results in the generation of another successor token. This is incorrect, since only one successor token should have been generated. The cost of detection and deletion of duplicates at the successor nodes is too expensive a solution. Similarly, there is no simple way of ensuring that the relevant portion of the PARALLELIMPLEMENTATION OF PRODUCTION SYSTEMS 73 opposite memory does not get modified while the two-input node is being processed. 39 The reasons for not permitting cases 8-10 to be processed in parallel are along the same lines as above. Whether cases 5-6 are permitted to be processed in parallel, depends on some subtle implementat.ion decisions. For example, when the extra-deletes-list is associated with each two-input node in the Rete network (as proposed in Section 5.2.3), it is not possible to process activations corresponding to cases 5-6 in parallel. The reasons are related to the conjugate-pair problem discussed in the previous subsection. Consider the case when both an insert request and a delete request for a token are to be processed. It is possible that the following sequence of operations takes place: (1) The delete request begins first, it locks the memory node (the lock associated with the appropriate bucket of the hash table storing the tokens), it does not find the token to be deleted, it releases the lock on the memory node. It then attempts to get a second lock, the lock necessary to insert the extra delete into the special extra-deletes-list associated with the corresponding two-input node. (2) Before the delete request can get hold of the second lock, the insert request gets hold of that lock to check if any extra deletes have been done. It finds no extra deletes, it releases this lock, and then goes on to insert the token into the memory node. (3) The delete request gets hold of the second lock, and inserts the token into the extra-deletes-list. The result of the above sequence is obviously incorrect, since the correct result should have been no token in the memory node and no token in the extra-deletes-list. A solution to the above problem, so that activations corresponding to cases 5-6 may be processed in parallel, is to use only a single lock to check/change check/change the contents of the extra-deletes-list. the contents of the memory node and to This can be achieved by associating an extra- deletes-list with each bucket of the hash table and using a common lock, rather than associating an extra-deletes-list with each two-input node (as proposed in Section 5.2.3). Because of the late discovery of this solution, the simulation results presented in Chapter 8 represent the case when node activations corresponding to cases 5-6 are not processed in parallel Some simulations done to test the new solution show an overall performance improvement of about 5%. The increase in performance is • not very significant because multiple insert and delete requests for a two-input node from the same side are rare. The reason why cases 1-4 can be processed easily in parallel is that, while the multiple activations of 39In the current implementation,the relevantportion of the oppositememoryis identifiedby the tokens in the corresponding bucketofthe oppositehashtabte.Thusif it is ensuredthatthisspecificoppositebucketis notbeingmodified, the rest of the oppositememorymaybe modifiedin any way. Althoughthisidea has not been developedany further,it shouldbe possibleto workout a solutionalongtheselines,so that multipleactivationof the sametwo-inputnodefrom differentdirectionscanbeprocessed inparallel. 74 PA RA LLELISM IN PRODUCI'ION SYSTEMS the two-input nodes are being processed, the opposite memory stays stable and does not change, unlike in cases 7-10. There is also no potential for race conditions, where the same token is being inserted and deleted from a memory node at the same time, as in cases 5-6. The parallel processing of cases 1-4 results in the most increase in speed-up as it eliminates the cross-product bottleneck mentioned in Section 4.2.3. 5.2.5. Locks for Memory Nodes As proposed in Section 5.2.1, the tokens associated with all memory nodes in the Rete network are stored in two global hash tables--one for the tokens belonging to the left memory-nodes and one for tokens belonging to the right memory-nodes. Since multiple node activations may be processed in parallel, it is necessary to control access to the individual buckets of the hash tables. It is proposed that there should be a lock associated with each bucket of the two hash tables. Furthermore, the locks should be of multiple-reader/single-writer type, that is, the lock associated with a bucket should permit multiple readers at the same time, but it should permit only a single writer at the same time, and it should exclude readers and writers from entering the bucket at the same time. The use of the read and write locks is expected to be as follows. For the left activation of a two-input node, a write-lock is used to insert/delete the token from the chosen bucket in the left hash table. The read-lock is used to look at the contents of the corresponding bucket in the right hash table to find tokens in the fight memory node that have consistent variable bindings with the left token. Thus a multiple-reader/single-writer lock permits several node activations that wish to look at a bucket in read-mode to proceed in parallel. Such a scheme is expected to help in handling the hash table accesses generated during the processing of cross-products. 5.2.6. Linear vs. Binary Rete Networks In Section 2.4.2 it was pointed out that there is a large variability possible in the amount of state that a match algorithm stores. For example, on the low side, the TREAT algorithm [60] stores information only about matches between individual condition elements and individual workingmemory elements. On the high side, the algorithm proposed by Kemal Oflazer [67] stores infor- marion about matches between all possible combinations of condition elements that occur in the left-hand side of a production and the sequences of working-memory elements that satisfy them. The state stored by the Rete-class of algorithms falls in between the above two schemes. The Rete-class of algorithms, in addition to storing information about matches between individual condition elements and working-memory elements, store information about matches between some fixed combinations PARALLEL IMPLEMENTATION OFPRODUCI'IONSYS_TEMS 75 of condition elements occurring in the left-hand sides of productions and working-memory element tuples. The combinations for which is stored are decided at the time when the network is compiled. This section discusses some of the factors that influence the selection of the combinations of condition elements for which state is stored. To study the state stored for a production by the standard Rete algorithm, 40the algorithm used in existing OPS5 interpreters, consider the following production with k condition elements: CI&C2&. .. &Ck --, AI.A 2... Ar The state stored consists of working-memory elements matching the individual condition elements (stored in the a-memory nodes of the Rete network), workingmemory element pairs jointly matching the condition elements CI&C 2, working-memory triples matching CI&C2&C3, and so on (stored in the t-memory finally working-memory element nodes of the Rete network), and element k-tuples matching C_&C2&... &C k (stored in the conflict-set). The algorithm does not store state for any other combinations of condition dements, for example, working-memory element pairs matching the combination C2&C3are neither computed nor stored. To make the discussion of the advantages and disadvantages of the scheme used by the standard Rete algorithm easier, it helps to consider the state of a production in terms of relational database concepts [90]. In relational database terminology, the sets of working-memory elements matching the individual condition elements C1..... Ck can be considered as relations R 1..... R k. The relation specifying the working-memory element k-tuples matching the complete production is denoted by R1® R2® ... ® Rk, and is computed by the join of the relations R 1..... Rk, where the join conditions correspond to those tests that ensure the consistency of variable bindings between the various condition elements. The state stored by the standard Rete algorithm corresponds to the tuples in the relations: • R1..... R k. This is the state stored in the a-memory nodes of the Rete network. • R] @ R 2, RI @ R2® Ra ..... R_ ® R2® ... ® R k. This is the state stored in the fl-memory nodes of the Rete network and the conflict-set. The processing required when a working-memory element is either inserted or deleted corresponds to keeping all these relations updated. In this process, the algorithm makes use of the smaller joins that have been computed earlier to compute the larger joins. For example, to compute the join R ® R2® R3, the algorithm uses the join RI ® R: and the relation R3 that have been computed earlier. Similarly, the addition of a new working-memory element to the relation Rk requires that the 40Alsocalledthe linearRete,becauseofthelinearchainoftwcrinputnodesconstructedbyit. 76 PARALLELISM 1NPRODUCTION SYSTEMS relation R1®R 2® ... ®R k be updated. This is done by computing the join (R z® R2® ... ® Rk_ l ) ® {new- wine}, where the join of Rt ..... Rk_ 1 already exists. The goodness of the state-saving strategy used by a match algorithm is determined by the amount of work needed to compute the conflict-set, that is, the relation Rz® R2® ... ® Rk for all the productions, and of course, any other intermediate relations used in the process.41 Empirical studies show that by the above criteria, the standard Rete algorithm does quite well. The work required to keep the relations R1® R2, Rz® R2 ® ber of t-memory updated is quite small (as shown by the small num- R 3 ..... node activations per change to working memory in Section 3.4.3). The reasons for this are: • The way productions are currently written, the initial condition element of a production is very constraining, which makes the cardinality of the relations Rz,R1® R2..... quite small, which in turn implies that the processing required to keep the state updated is small. For example, the first condition element of a large fraction of productions is of the type context, and since normally only one context is active at any given time, the relation Rz is empty for all productions that do not belong to the active context, and they do not require any state processing for the t-memory nodes. 42 • Relations corresponding to a large number of condition elements, that is, relations of the form Rz® ... ® R, where p is 3 or more, are naturally restrictive (do not grow too large) because of theelarge number conditions involved and the associated tests. 43 While the scheme used by the standard Rete algorithm works fine for uniprocessor implementations, some problems are present for parallel implementations. This is because the criteria for goodness for parallel implementations are slightly different from that for uniprocessor implementations. In a uniprocessor implementation the aim is to minimize the total processing required in the state-update process, and it is not important if the state-update phase for some productions takes much longer than that for other productions. In a parallel implementation, while it is important to keep the sum of the state-update costs for the affected productions down, it is equally important to reduce the variation in the costs of the affected productions. The standard Rete algorithm keeps the total cost of the state-update phase well under control, but it is not so good at keeping the variation down. Some of the reasons are discussed below. 41Thiscriterionofgoodnessisappropriate primarilyforuniproeessor implementations. 42Ineasethe firstconditionelementofa productionis notveryconstraining, it is oftenpossibleto reorderthecondition elementsso thatthe firstconditionelementis restrictive.Suchtransformations mayeitherbe donebymachine[48]orby a person. 43Again,the orderingofconditionelementscanhelpto reducetheamountofstatethathasto be updatedoneachcycle. PARALLEL IMPLEMENTATION 77 OF PRODUCTION SYSTEMS Simulation studies for parallel implementations using the state-saving strategy of standard Rete show that a common reason for the large value of tmax/tavg is a long chain of dependent node activations, that is, often a node activation causes an activation of a successor node, which in turn causes an activation of its successor node, and so on. This is called the long-chain effect (see Figure 5-5), and it is especially important in programs that have productions with a large number of condition elements. It is possible to reduce the length of these long sequences of dependent node activations by shortening the maximum length of chains that is possible--by changing the inter- mediate state that is computed by the algorithm and the way it is combined to get the final changes to the conflict-set. CE1 CE2 CE3 L__ Maximum Length of Chain of CIEk.2 Dependent Node Activationsis k. Figure 5-5: The long-chain effect. One way to reduce the length of the long chains is to construct the Rete network as a binary network (see Figure 5-6) instead of as a linear network. This way the maximum length of a chain of dependent node activations is cut down from k to [log2k ] + 1. In such a scheme the stored state corresponds to tuples in the relations: • R1..... Rk, • R I®R2, R a®R 4 ..... Rk_ I®R k, • RI® R2® R3® R4..... Rk_3® Rk_2® Rk_I® Rk, • and SO on. Simulations for Soar programs show that the binary network scheme reduces the variation in the processing costs of productions significantly, thus increasing the number of processors that can be effectively used in a parallel implementation. For example, for the eight-puzzle program in Soar, using the binary network increases the speed-up from 9-fold to 15-fold. The speed-up obtained for normal OPS5 programs is not as good, in fact, in most cases the speed-up is reduced. 78 SYSITNIS PARALLEI,ISM IN PRODUCTION CE1 CE2 CE3 CE4 CEk-3 CEk-2 CEk-1 CEk Dependent Node Activations is log(k) + 1. Y Figure 5-6: A binaryRetenetwork. The reasons for the reduction in speed-up in OPS5 programs are the following. First, the average number of condition elements per production in OPS5 programs is quite small, 3.4 for OPS5 programs as compared to 6.8 for the Soar programs. Since the number of condition elements is small, the difference between the length of chains in the linear and the binary networks is not significant, and thus not much improvement can be expected. Second, the total cost of the state-update phase when using the binary network scheme is often much larger than the total cost when using the linear network scheme. As an example, consider the following production: (p free-all-hands (robot (object (object (object (modify *hand1 <x> tname (x>) +name <y>) +name <z>) 1 *hand1 free *hand2 <y> ,hand2 *hand3 <z>) free thand3 free ...) The production consists of four condition elements and the joins stored by the linear Rete are R1® R2, (R 1® R2)® R3, (R 1® R2® R3) ® R4. The joins stored by the binary Rete are RI® R:, R3® R4, (R1 ® R2)® (R 3® R4). Note that the relations in parentheses indicate the way a larger relation is computed. Thus the relation (R 1® R2) ® R3 implies that it is computed as the join of R1® R2and R3,and not as the join of R_and R2® R3. Now consider the scenario where there is one working-memory element of type robot (corresponding to a single robot in the environment) and there are 100 working-memory elements of type object (corresponding to 100 different objects in the environment). In such a case the cardinality of the relations R1,R2,R3, and R4 will be 1, 100, 100, and 100 respectively. Then in the case of the PARALLELIMPI.EMEN'I'ATION OFPRODUCTION SYSTEMS linear network all t-memory 79 nodes will have a cardinality of 1, that is, they will contain only one token. This is because of the constraining effect of the first condition element, and the fact that the variables x, y and z are all bound by the time the join operation moves to the second, third, and the fourth condition elements. In case of the binary Rete network, however, the t-memory node cor- responding to the relation R3® R4 will have 10,000 tokens in it, since there are no common variables between the third and the fourth condition elements. So whenever a working-memory element of type object is inserted about 100 tokens have to be added to that memory node, in contrast to only a single addition if the linear network is used. As the above discussion shows, the use of a binary network can often result in a large number of node activations that would not have occurred in the linear network. 44 This increase in the basic cost of the state-update phase often offsets the extra advantage given by the smaller value of tmax/tav8 found in binary networks. In fact, in all systems studied, there were at least a few productions where the state grew enormously when the binary networks were used. The networks for these few productions had to be changed back to the linear form (or some mixture of binary and linear forms) before the production-system programs could be run to completion. As a result of the various studies and simulation done for this thesis, it appears that there is no single state-storage strategy that is good for all production systems. There is no reason, however, to have the restriction that all productions in a program should use the linear network, the binary network, or any other fixed network form. We propose that the compiler should be built so that the network form to be used with each production can be specified independently, and that this network form should be individually tailored to each production. by the programmer at the time of writing a production. The form of the network may be provided This strategy has the advantage that the programmer often has semantic information that permits him to determine the sizes of the various relations; information that a machine may not have. Alternatively, the form of the network may be provided by another program that uses various heuristics and analysis techniques [83]. Such a program may also use data gathered about relation sizes from earlier runs of the program to optimize the network form. 4s 44Notethat it is alsopossibleto constructexampleswherethe state-updatephasefor the linearnetworkis muchmore expensivethanthatforthebinarynetwork.However,in practice,suchcasesarenotencounteredasoften. 45Thenetworkcompilermayalsoprovidesomedefaultnetworkformsthat maybeusedwiththeproductions,asis thecase currently instandard gete. 80 5.3. The Selection PARALLELISM INPRODUCTIONSYSTF_MS Phase Processing Given a working-memory element, the selection phase identifies those condition elements that are satisfied by that working-memory element. The processing done during the selection phase primarily involves evaluating the constant-test nodes found in the upper part of the Rete network. This section discusses some of the issues that arise when the selection phase is performed on a multiprocessor. 5.3.1. Sharing of Constant-Test Nodes A production-system program often has different productions to deal with variants of the same situation, and as a result the condition elements of these productions are also similar. The Rete network compiler shares constant-test nodes whenever such similar condition elements are compiled, and the uniprocessor implementations of the Rete algorithm rely greatly on the sharing of constanttest nodes to save processing time in the selection phase (see Section 3.3.2). For example, the constant-test nodes immediately below the root-node consist of tests that check the type of the condition element, since that is the first field of all condition elements. For the VT production system [51], the total number of condition element types is 48, thus with sharing only 48 constant-test nodes are needed at the top-level, each checking for one of the 48 types. If no sharing is present, however, a separate node would be needed to check the type of each condition element, and since the system consists of approximately 4500 condition elements, that many nodes will be required at the top-level. In a non-shared memory multicomputer implementation of production systems, where each production is allocated to a separate processor, it is not possible to exploit the sharing mentioned above. This is because each processor must independently determine which of the condition elements of the production allocated to it are satisfied. There is no reason, however, not to use such sharing in a shared-memory multiprocessor implementation of production systems. One of the consequences of the sharing of constant-test nodes is that often a constant-test node may have a large number of successors. These successors may either be other constant-test nodes or they may be a-memory nodes. Some implementation issues related to how these successors ought to be evaluated are discussed below. PARALLEL IMPLIZ.MENTATION OF PRODUCTION 5.3.2. Constant-Test SYS'ITMS 81 Node Successors Consider the cost of evaluating a constant-test node. The cost consists of: (1) In case the activation is picked up from the centralized task queue, the cost of removing the activation from the task queue and the cost of setting up the registers so that the processing may begin. (2) The cost of evaluating the test associated with the constant-test node. (3) In case the associated test succeeds, the cost of pushing the successor nodes onto a local stack (for activations to be processed on the same processor) or to the centralized task queue (for activations to be picked up by any idle processor). In the proposed implementation using a hardware task scheduler, the cost of the first step is about 10 instructions, the cost of the second step about 5-8 instructions, and the cost of the third step is about 2 instructions for a local push and about 5 instructions for a global push. Since the cost of evaluating the test associated with a node (cost of step-2) is small compared to the costs of pushing and popping an activation from the task queue (costs of step-1 and step-3), it is not advisable to schedule each constant-test node activation through the global task queue. Instead, it is suggested that only bunches of constant-test node activations should be scheduled through the global task queue, and that all nodes within a bunch should be processed on the same processor (thus saving the cost of step-l). One way to achieve this is to schedule node activations close to the top-level through the global task scheduler, and have the activations in the subtrees of these globally scheduled nodes processed locally (see Figure 5-7). Another alternative is for the network compiler to use heuristics to decide which nodes are to be scheduled globally and which nodes are to be scheduled locally. /J / //// // 1,17 )/-- "_ /r_ _a, _ [_,_ (!' _ L-- ' ' ' ,_ _ \ /' constant-test nodes scheduled through centralized task queue __ontl'_.meproce_hsorasp.rent Restof Rete Network Figure 5-7: Schedulingactivationsof constant-testnodes. It was observed in Section 3.4.1 that only about 12%of the constant-test node activations have their associated test satisfied. The reason for the large number of activations with failed tests is that the standard Rete algorithm does not use any indexing techniques (for example, hashing) to filter out 82 PARAI.LELISM IN PRODUCTION SYSTEMS node activations that are bound to fail their tests. For example, consider a node whose four successor nodes test if the value of some attribute is 1, 2, 3, or 4 respectively. The standard Rete algorithm would evaluate all these nodes, even though 3 of them are bound to fail. The choice whether or not to use indexing is, however, not always so clear. This is because using a hash function imposes the overhead of computing the hash value, and because the constant-test node activations are very cheap to evaluate (the OPS83 implementation evaluates a constant-test node in three machine instructions). Since the constant-test nodes are not as cheap to evaluate in a multiprocessor implementation, it is proposed that indexing be used in most places, especially in places where the branching factor is large. 5.3.3. Alpha-Memory Node Successors At the interface between the selection phase and the state-update phase are the ,v-memory nodes (recall that the constant-test nodes feed directly into the a-memory nodes). Normally, when a constant-test node succeeds, the associated processor builds tokens for the activations of its successor a-memory nodes. These tokens are then given to the task scheduler so that they may be processed by other idle processors. A problem, however, occurs when a constant-test node having a large number (10-30) of a-memory successors is satisfied. Since preparing a token for the scheduler requires the execution of several instructions by the processor, around 5 in the proposed implementation (see Appendix B for details), it implies that by the time the last successor node is scheduled, 100 instructions worth of time has already elapsed (assuming 20 successor nodes). The time taken by 100 instructions is quite large considering the fact that processing a constant-test node activation takes only around 5-20 instructions (only 3 instructions in the uniprocessor implementation) and that processing a two-input node activation takes only 50-100 instructions. (See Tables 8-3 and 8-4 for the relative costs of processing different node activations.) For this reason, it is proposed that in case a constant-test node has a large number of a-memory successors, it is advisable to replicate that constant-test node, such that each of the replicated nodes has a small number of successors, as shown in Figure 5-8. Thus when the multiple copies of the original constant-test node are evaluated in parallel, the successors would get scheduled in paraUeL The number of times that a node ought to be replicated is a function of (1) the number of successor nodes, and (2) the relative costs of processing a constant-test node activation, of processing a two-input node activation, and of preparing and scheduling a successor node. An alternative to replicating the constant-test node is to use a new node type which does not perform the test made by the constant-test node but simply helps in scheduling the a-memory successors in parallel. PARALLEL IMPLEMENTATION OF PRODUCFION SYSTEMS 83 _t-node / / / __ const-test transform \ _ \ ",o..-mom " \ successors Rete Network _ ,o..om \ successors Rete Network Figure 5-8: Possiblesolution when too many alpha-memory successors. 5.3.4. Processing Multiple Changes to Working Memory in Parallel As in the case of the state-update phase, it is possible to process the selection phase for multiple changes to working memory in parallel. Since the constant-test nodes do not store any state, there are no restrictions placed on the constant-test node activations that may be evaluated in parallel. 5.4. Summary In this chapter various software and hardware issues dealing with the parallel implementation of production systems have been studied. The discussion may be summarized as follows: • The hardware architecture suitable for implementing production systems is a sharedmemory multiprocessor consisting of 32-64 high-performance processors. Each processor should have a small amount of private memory and a cache that is capable of storing both private and shared data. The multiprocessor should support a hardware task scheduler to help in the task of assigning pending node activations to idle processors. • The following points list some of the problems and issues that arise when implementing the state-update phase on a multiprocessor: o The tokens associated with the memory nodes should be stored in global hash tables instead of in linear lists, as is done in existing implementations. This helps in reducing the variance in processing required by insertion and deletion of tokens. It is further suggested that a multiple-reader/single-writer lock be associated with each bucket of the hash tables to enable correct access by concurrent node activations. o Section 5.2.2 gives reasons why it is not possible to share a memory node between 84 PARALLELISM IN PRODUCTION SYSTEMS multiple two-input nodes, as is done in uniprocessor implementations of Rete. The reasons have to do with synchronization constraints and the use of hash tables for storing tokens. o A solution is given to the problem of processing conjugate pairs of tokens. The solution consists of associating an extra-deletes-list with each memory node. This way whenever a delete-token request is processed before the corresponding addtoken request, it can be put on the extra-deletes-list and processed appropriately later. o In the proposed parallel implementation it is not possible to process all activations of a two-input node in parallel. Section 5.2.4 gives reasons why multiple activations from one side of a two-input node can be processed concurrently, but why multiple activations from both the left and the right side should not he processed concurrenfly. o Within the Rete-class of match algorithms many different state-saving strategies can he used. Section 5.2.6 discusses the relative merits of the linear-network strategy used by the standard Rete algorithm and the alternative binary-network strategy. While the binary-network strategy is found to be much more suitable for Soar programs (programs having productions with a large number of condition elements), the linear-network strategy is found to he more suitable for the OPS5 programs. In general, even within a single program, no single strategy is found to be suitable for all the productions, and it is suggested that tailoring the networks to individual productions can lead to significant speed-ups. • The following points list some of the problems and issues that arise when implementing the selection phase: o Measurements show that only 12% of constant-test node activations have their associated tests satisfied. To avoid evaluating node activations which fail their tests, it is proposed that indexing/hashing techniques should be used. This can result in significant savings. o The cost of evaluating the test associated with a constant-test node activation is significantly less compared to the cost of enqueueing and dequeueing a node activation from the centralized task queue. For this reason instead of scheduling individual constant-test node activations through the centralized task queue, it is suggested that only bunches of constant-test node activations should be scheduled through the centralized task queue. o Often a constant-test node may have a large number of ,v-memory node successors. Scheduling these successors serially results in a significant delay by the time the last successor node is scheduled. A solution using replicated constant-test nodes and another solution using a new node type are proposed for scheduling the successors in parallel. THEPROBLEMOFSCHEDULING NODEACTIVATIONS 85 Chapter Six The Problem of Scheduling Node Activations The model for parallel execution of production systems proposed in the previous chapter consists of two components: (1) a number of node processors connected to a shared memory, where each node processor is capable of processing any given node activation; and (2)a centralized task scheduler, where all node node activations requiring processing may be placed and subsequently extracted by idle processors. This chapter explores the implementation issues that arise in the construction of such a centralized task scheduler. The first difficulty in implementing a centralized task scheduling mechanism stems from the fine granularity of the node activations that are processed. For example, in the current implementation, the average processing required by a node activation is only 50-100 instructions and processing a change to working memory results in about 50 node activations. In a centralized task scheduler, even if enqueueing and dequeueing an activation took only 10 instructions, by the time the last activation is enqueued and finally picked-up for processing, 500 instructions worth of time would have elapsed. 46 This time is significantly larger than that to process individual node activations, and if the scheduler is not to be a bottleneck, the processing required for enqueueing and dequeueing activations must be made much smaller. The second difficulty associated with implementing a centralized task scheduling mechanism for the parallel Rete algorithm is that the functionality required of it is much more than that of a simple queue. This is because all node activations present in the task queue are not processable all of the time. To be concrete, consider the example shown in Figure 6-1. The figure shows a two-input node for which four activations al,a2,a3, and a4 are waiting to be processed. However, because of synchronization constraints given in Section 5.2.4, left activations may not be processed concurrently with fight activations. Thus, while node activations al and a2 may be processed concurrently and 46Wedo not assumea memorystructureof the formproposedfor theNYU Ultracomputer [28],whereenqueuesand dequeuescanbe doneinparallel.Becauseoftheadditionalcomplexities in theenqueueanddequeuerequiredforproduction systems, thestandard structure proposed fortheUltracomputer wouldnotwork. 86 PARALLELISM IN PRODUCFION SYSTEMS node activations a3 and a4 may be processed concurrently, node activations al and a3 may not be processed concurrently. As a result, as soon as the node activation al is assigned to an idle processor for evaluation, the activations a3 and a4 become unprocessable for the duration that al is being processed. Similar restrictions would apply if activation a3 had been picked up first for processing. This dynamically changing set of processable node activations in the task queue makes the scheduler much more complex, and consequently much more difficult to implement. CE1 al a2_ CE2 -- /a3 a4 I two-input n_e Figure 6-1: Problem of dynamically changing set ofprocessable node activations. The following sections discuss two solutions for solving the scheduling problem. The first solution involves the construction of a hardware task scheduler (HTS) and the second solution involves the construction of multiple software task schedulers (STSs). ' 6.1. The Hardware Task Scheduler 6.1.1. How Fast Need the Scheduler be? To get a rough estimate of the performance that is required of the hardware task scheduler, consider the following simple model for parallel execution of production systems. Let the total number of independently schedulable tasks generated during a recognize-act cycle be n. Let the average cost of processing one such task be c. Let the cost ofenqueueing and dequeueing a task from the scheduler be s. Note that s corresponds only to that part of the scheduling cost during which the scheduler is locked, for example, when the task queue is being modified. Let the cost when the scheduler is not locked, for example, the time taken for preparing a token to be inserted into the task queue, be t. Now the cost of performing the recognize-act cycle on a single processor is Cuni = n.c. If there are k processors, then assuming perfect division of work between the k processors, the cost of performing the match on a multiprocessor with a centralized scheduler is Cmu! = [n/k](c+ t) + n.s, and the maximum speed-up obtainable on the multiprocessor is given by S = Cuni/Cmu 1. Figure 6-2 shows the maximum speed-up that can be obtained when n=128 tasks/cycle, c= 100 THE PROBLEM OF SCItEI)ULING 87 NODE ACTIVATIONS instructions, t= 10 instructions, and for varying values of s and k. As the figure shows, the effect of the duration for which the scheduler is locked is very pronounced. For large values of s, the saturation speed-up is reached with a relatively small number of processors, irrespective of the inherent parallelism in the problem. In terms of the parameters described above, the saturation speed-up as k--+ oo is given by S= n .c/(c+ t+ n .s), and then as n--* oo, the expression for speed-up reduces to S = c/s. Thus it is extremely important to maximize c/s, but it is not easy to increase c (the only reasonable way to increase c is to increase the granularity of the tasks, which then reduces the value of n, which has other adverse effects). Hence, the value of s must be reduced as much as possible. In fact, as seen from the graph in Figure 6-2, if 20-40 fold speed-up is to be obtained, then the time for which the scheduler is locked to process an activation must not be much longer than the time taken by a processor to execute a single instruction. It is possible to construct such a scheduler in hardware, as is discussed next. 5O Ct Z_S= 0.5 0 S=l.0 D S=2,0 _40 <>s = 4.0 • s=8.0 • s= 16.0 30 n = 128 c=100 t=lO 20 0 8 16 24 32 40 48 56 64 Number of Processors Figure 6-2: Effect of scheduler performance on maximum speed-up. 72 (iO 88 PARALt,ELISM INPRODUCI'IONSYSTEMS 6.1.2. The Interface to the Hardware Task Scheduler In the proposed production-system machine (see Section 5.1), the hardware task scheduler sits on the main bus along with the processors and the shared memory in the multiprocessor. The scheduler is mapped onto the shared address-space of the multiprocessor, and the synchronization (locking of the schedule0 is achieved simply from the serialization that occurs over the bus on which the requests for enqueueing and dequeueing activations are sent to the hardware task scheduler. The scheduler is assumed to be fast enough that it can process a request in a single bus cycle. There are three types of messages that are exchanged between the processors and the hardware task scheduler. (1) Pusk-Task: The processor sends a message to the scheduler when it wants to enqueue a new node activation. (2) Pop-Task: The scheduler sends a message to the processor when it wants an idle processor to evaluate a pending node activation (the scheduler keeps track of the idle processors in the system). (3) Done-Task: The processor sends a message to the scheduler when it is done with evaluating a node activation. The Push-Task(flag-dir-nid, tokenPtr) command sent to the scheduler by the processor takes two arguments. The first argument, flag-dir-nid, encodes three pieces of information: (1) the flag as- sociated with the activation (insert or delete token request), which takes one bit, (2) the direction (dir) of the node activation (left or right), which takes one bit, and (3) the node-id (nid) of the activation, which is allocated the remaining 30 bits of the first word. The second argument, tokenPtr, is a pointer to the token that is causing the activation (the token is stored in shared memory), which takes all 32 bits of the second word. The combined size of all this information is 64 bits or 8 bytes. It is assumed that the bus connecting the processors to the scheduler can deliver this information without any extra synchronization. This may be easily achieved if the bus is 64 bits wide (in this case the synchronization is achieved through the bus arbiter, which permits only one processor to use the bus at a time). This can also be achieved if the bus is only 32 bits wide, but if it is possible for a processor to obtain its use for two consecutive cycles. As stated earlier, the scheduler is mapped onto the shared address-space of the multiprocessor. The addresses to which the hardware task scheduler is mapped, however, are different for each of the processors (the low order 10 bits correspond to the identity of the processor). This enables the scheduler to determine the identity of the processor making a request, even though that information is not explicitly sent with the request. There are two distinct locations in shared memory (8 bytes) associated with each processor where the scheduler puts information about the node activations to be processed by that processor. Thus, to inform an idle processor to begin processing a pending node activation, the scheduler executes a "FILE PROBLEM OF SCHEDUHNGNODFACTIVATIONS Pop-Task(flag-dir-nid 89 tokenPtr) command, and transfers the flag, direction, node-id, and token- pointer information to the two locations assigned to that processor. Before the command is executed, the idle processor keeps executing a loop checking for the second location to have a non-null token pointer. The processor is expected to be looping out of its cache, so that it does not cause any load on the shared bus. When the hardware scheduler writes to the two locations, the cache of the processor gets invalidated, it gets the new information destined for it, and it can begin processing the new node activation. When a processor is finished with evaluating a node activation, it first sets the value of the second of the two locations assigned to it for receiving node activations to null (this is the location on which the processor will subsequently be looping, waiting for it to be set to a non-null value by the scheduler). It then executes the Done-Task(proc-id node-id) command and transfers the node-id to the scheduler, informing the scheduler that it is finished with processing that node activation. This information is very important, because using it the scheduler can determine: (1) the set of processors that are idle, and (2) the set of node activations in the task queue that are processable (recall that all activations in the task queue are not necessarily processable). For all of the above commands, the hardware task scheduler is locked only for the bus cycles during which data is being written into or written by the scheduler. For example, if the bus is 64 bits wide and the bus cycle is 100ns (corresponding to a bus bandwidth of 80 MegaBytes/see), then the total duration for which the scheduler is locked for the Push-Task, Pop-Task, and Done-Task commands for a node activation is only 300ns. 6.1.3. Structure of the Hardware Task Scheduler The hardware task scheduler consists of three main components: (1) the proc-state array, (2) the task queue, and (3) the controller, as shown in Figure 6-3. Both the proc-state array and the task queue are built out of content-addressable memory. The proc-state array keeps track of the node activations being processed by each of the k processors in the multiprocessor. The task queue keeps track of all node activations that are pending processing or are currently being processed, up to a maximum of n.47 The enable-array associated with the task queue keeps track of all pending node activations that are process,able given the contents of the proc-state array (the node activations being currently processed). The controller consists of microcode to control both the proc-state array and the task queue. 47Simulations showthattheaveragenumberof activationspresentin thetaskqueueis around90. Themaximumnumber ofactivations,however,canbe ashighas2000. 90 PARALLI3.LISM IN PRODUCTION Push-Task / Pop-Task Proc-State Array \ SYSTEMS Done-Task Task Queue / \ next task 1 1 m 2 2 -E a g D i r Token Pointer node-id k D i node-id a g r 1 1 o a b I e Enc-\ taler /'- n 1 1 30 1 Ir 32 30 Figure 6-3: Structure of the hardware task scheduler. To get some insight into the internal functioning of the hardware task queue, consider the processing required by the various commands: • Push-Task(flag-dir-nid, tokenPtr): (1) Find empty slot in task queue, and insert entry (the 64 bits of information associated with the command) there. (2) If there is no entry in the proc-state array with the same node-id as that in the command or if the xxxx-dir-nid 48 of some entry in the proc-state array matches the xxxx-dir-nid given in the command, then set the enable bit for the new entry to true otherwise set the enable bit to false. • Pop-Task(flag-dir-nid, tokenPtr): (1)The encoder on the extreme fight of Figure 6-3 gives the next node activation to be processed. (2) Put this entry (the information corresponding to the node activation) in the appropriate slot in the proc-state array, that is, in the slot corresponding to the processor to which this node activation is being assigned. (3) For all entries in the task queue for which the node-id matches that of the assigned entry, if the direction field also matches then set enable bit to true. Otherwise if the node-id matches but the direction field does not match, then set enable bit to false. • Done-Task(proc-id nid): (1) Clear the slot in proc-state array corresponding to processorid (proc-id). Furthermore, let count be the number of activations in the proc-state array that have the same node-id as the node activation that has just finished. (2) If count > 0 then do nothing, otherwise for all entries in the task queue for which the node-id matches set enable to true. 48The xxxx in xxxx-dir-nid indicates that we do not care about the value of the flag field. TtJEPROBLEMOFSCttF.DUI JNG NODEACFIVATIONS 91 The above steps ensure that only those node activations that are considered concurrently processable by the criteria given in Section 5.2.4 are permitted to be processed in parallel. The hardware task scheduler is expected to be able to process the required commands (push-task, pop-task, done-task) in one bus cycle, about lOOns. It is estimated that each of the commands can be executed within two/z-instructions from the controller and for that reason the controller and the associative arrays must be capable of operating on a clock of 50ns or less. Current high-speed technologies like q_L, ECL, high-speed CMOS should be able to support the kinds of speeds that are required. 6.1.4. Multiple Hardware Task Schedulers While the speed of a single hardware task scheduler may be able to support 32-64 parallel processors, what if the parallelism is much more and the number of processors is much larger? This section discusses some issues regarding the use of multiple hardware task schedulers; cases where a single task scheduler is not powerful enough. The multiple schedulers may appear on a single bus (where it is possible for each scheduler to observe the commands being processed by the other schedulers), or they may appear on multiple buses (where it is not possible for a given scheduler to watch the commands being processed by the other schedulers). i A basic assumption made in the design of the scheduler described in the previous section is that it is possible for the scheduler to determine from its local state (the contents of the proc-state array) the subset of node activations in the task queue that are processable (using the criteria given in Section 5.2.4). This is easily possible when there is only one scheduler, since the proc-state array keeps track of all node activations being processed at any given time, and that is enough to determine which other activations are processable. However, when there are multiple schedulers, each scheduler cannot observe the activity ofaU other schedulers, and the proc-state array of a scheduler cannot keep track of all node activations being processed at any given time. There are two solutions to the above problem. The first is to have the schedulers communicate with each other or with some other centralized resource to determine which node activations are processable. This solution does not look very attractive because the overhead of communication will probably nullify any advantage gained from the extra schedulers. The second solution, which is more reasonable, is to partition the node activations between the multiple schedulers. For example, if there are two schedulers, then one scheduler could be responsible for all node activations that have an even node-id and the other scheduler for all node activations that have an odd node-id. Since activations 92 PARALLEIJSM IN PRODUCTIONSYSTEMS which have a different node-id do not interact with each other, they may always be processed concurrently (see Section 5.2.4), the activations assigned to processors by one scheduler will not affect the processability of activations in the task queue of the other scheduler. Thus the local proc-state array of each scheduler 49 contains sufficient information for deciding the processability of nodes in its task queue. There is still one problem that remains in using multiple schedulers, especially if each scheduler is to be capable of scheduling a task on any processor. In the previous section, there was a single scheduler that knew the state of all the processors--it knew which of them were idle and could automatically schedule activations on such processors. When there are multiple schedulers, none of the schedulers has knowledge of the state of all the processors. Furthermore, even if a scheduler knows that a processor is idle in the current bus cycle, say because it executed the Done-Task command in the current bus cycle, there is no way for it to know the state of that processor in the next bus cycle, since in the next bus cycle some other scheduler may have assigned a task to that processor. 50 The suggested solution to the problem is to have the processors poll the schedulers for processable tasks, instead of the schedulers assigning tasks to idle processors of their own accord. By having each scheduler set a flag in the shared memory indicating the presence of processable tasks, it is possible to make the idle processors loop out of cache (instead of causing traffic on the bus) when no processable tasks are available in the schedulers. 6.2. Software Task Schedulers While it would be nice to have hardware task schedulers for the production-system machine, there are two main problems associated with them: (1) hardware task schedulers are not flexible, in that they are not easy to change as the algorithms evolve, and (2) the resources needed to build hardware task schedulers and to interface them to the rest of the system are more than that required by software task schedulers. However, as worked out in Section 6.1.1, if a single task scheduler is not to be a bottleneck, then it must be able to schedule a task within the period of about one instruction. While it is not feasible to achieve such performance out of a single software task scheduler, it is possible to use multiple software task schedulers to achieve reasonable performance. This section discusses some of the issues involved in the design of such software task schedulers. 49Notethat the localproc-statearray storesinformationonly aboutthosenode activationsthat wereassignedby the associatedschedulerforprocessing.It doesnot keeptrackof the nodeactivationsassignedforprocessingbythe othertask schedulers. 50Aschedulermayknowwhena processorisidlein casetheprocessorsarealsopartitionedamongstthe schedulers, justas the nodesare partitionedamongstthe schedulers,However,such a partitioningwouldviolatethe initialideathat each scheduleristo becapableofschedulingactivations onallprocessors. THE PROBLEM OF SCHEDUI3NG 93 NODE ACFIVATIONS The main reason for going to multiple software task schedulers is to avoid the serial bottleneck caused by a single task scheduler through which all activations must be scheduled. In terms of the model described in Section 6.1.1, the multiple schedulers modify the equation for maximum speedup as follows. Recall that n is the number of tasks that need to be scheduled per cycle, c is the average cost of processing a task, s is the average serial cost of enqueueing and dequeueing a task, t is the average non-serial cost of enqueueing and dequeueing a task, and k is the number of processors in the multiprocessor. Let l be the number of schedulers (software task queues) in the system. The cost per cycle on a uniprocessor is Cun i "- n.c. The cost per cycle on the multiprocessor (assuming that the load is uniformly distributed amongst the k processors and the l task queues) is given by Cmul= [n/k](c+ t)+ [n/l].s, and the speed-up is given by S = Cuni/Cmu l. The graph for the maximum speed-up when n= 128,c= 100,t= 10, and varying values of k,/, and s is shown in Figure 6-4.51 It shows that even if the serial cost of enqueueing and dequeueing a task is only 16 instructions, the maximum speed-up obtainable with 64 processors and 32 schedulers is only 45- fold--almost a quarter of the processing power is wasted while waiting for the schedulers. _64 _ s=8.0,1=4 s=8.0,1= 16 s=8.0,1 =32 •_56 48 ._ = 40 t _ s=16.0,1=4 s=16.0,1=16 13 = 128 C = 100 _ _ s=32.0,1=4 s=16.0,1=32 s = 32.0,1= 16 t = 10 _--------_ s = 32.0,1= 32 _.h _ _._ J_ 32 16 8 0 I I I I I 8 16 24 32 40 I I I 48 56 64 Number of Processors I 72 (It,) Figure 6.4: Effect of multiple schedulers on speed-up. 51Although the curves in the graph are shown to be continuous, the actual plot of the equation for S would have discontinuities in it because of the ceiling function used in Cmu l. The curves in the graph are an approximation to the actual curves for the equations. 94 PARALLEt,ISM INPRODUCTION SYSTEMS A software task scheduler may either be passive or active. A passive scheduler, or preferably a task queue, corresponds to an abstract data structure where node activations may be stored or retrieved using predefined operations like push-task and pop-task. On the other hand, an active scheduler corresponds to an independent process to which messages for pushing and popping tasks may be sent. Once the processor has issued the request, it may proceed with what it was doing earlier. The requesting processor does not have to wait while its request is being processed. In this thesis only passive software schedulers (software task queues) are studied. The main reason is that there are a number of overheads associated with the active schedulers which are not present in the passive schedulers. For example, in an active scheduler, scheduling a task involves the sending of a message to the active scheduler, and then the processing of this request by the scheduler. It is quite possible that the cost of sending the message is more than the cost of scheduling on the passive scheduler. Similarly, it is quite possible that when a message is sent to an active scheduler, the scheduler process is not running and it has to be swapped in before the message can be processed. This could cause a significant delay in the message getting processed. The main advantage of active schedulers occurs when the cost of scheduling a task is significantly larger than the cost of sending the message to the scheduler. In that case the task sending the message can continue without waiting for the processing required by the scheduler to complete. i Figure 6-5 shows an overview of the structure proposed for using multiple software task queues. To schedule tasks there exist several task queues, all of which can be accessed by each of the processors. There is a lock associated with each task queue, and this lock must be obtained by a processor before it can put or extract any tasks from the task queue. To obtain a task an idle processor first checks if the task queue is empty. If it is empty, the processor goes on to check the next task queue to see if it has any processable tasks.52 If it is not empty, the match process obtains the lock for the task queue, extracts a processable task (if any is present) from the task queue, and then releases the lock. By having a dynamic cache coherence strategy [76], it is possible to arrange the locks so that idle processors looking for a task (when none are available) spin on their cache without causing a large traffic on the shared bus. To determine whether a node activation in the task queue is processable, the match processes access the node table. For each node in the Rete network, the node table keeps track of information about 52Note,thematchprocesscheckswhethera taskqueueis emptyor not withoutobtainingthe lock. Of course,this informationcan be inaccuratesinceit is checkedwithoutobtainingthe lock,butsinceit is onlyusedas a hint it doesnot matter. If onematchprocessmissesa taskthat wasjust beingenqueuedwhenit checkedthetask queue,anothermatch processat somelatertimewillfindit. 95 THE PROBLEM OF SCHEDULING NODE ACTIVATIONS STQ 1 2 STQ • " N Node Table STQ Processors Task Queues Figure 6-5: Multiple software task queues. its activations that are being currently processed. This information is sufficient to determine whether a node is processable or not. There is a separate lock associated with each entry of the node table, and this lock must be obtained before the related information is modified or checked. Since there is a separate lock associated with each entry in the node table, the multiple schedulers looking for the processability of different node activations do not clash with each other. The processing required when a new task is to be put in a task queue or when the match process finishes processing a node activation is also quite simple. To push a new node activation into a task queue, the match process selects a random task queue from the several that are available. If the lock associated with that task queue is busy, the match process simply goes on and tries another task queue. Otherwise it obtains the lock and enters the new node activation into the task queue. When a match process finishes processing a node activation, it modifies the corresponding entry in the node table to indicate that there is one less activation of that node being processed. Simulations done for the above scheme using multiple sottware task queues show that approximately a factor of 2 is lost in performance compared to when a hardware task scheduler is used (detailed results are presented in Chapter 8). We are currently experimenting with variations of the above scheme. For example: one possible variation is that instead of putting both processable and non-processable node activations in the task queue, one may put only processable node activations in the task queue. The non-process.able node activations would be attached to the associated slots in the node table, and whenever the last processable activation of a node finishes, the additional processable entries would be put into the task queues. Such a scheme would reduce the cost of extracting a node 96 PARALLEIJSM IN PRODUCTION SYSTEMS activation from a task queue, since no checks have to be made about the process,ability of the node activation. However, this would also increase the time to enqueue a node activation, since now it is necessary to first determine whether the given activation is processable or not. Another variation is to use some kind of a priority scheme for ordering pending node activations. For example, node activations that can potentially result in long chains of activations should be processed before activations that cannot result in long chains of activations. multiple active software task schedulers. We are also experimenting with the use of THESIMULATOR 97 Chapter Seven The Simulator Most of the earlier studies exploring parallelism in production systems were done using simulators with very simple cost models [22, 32, 66], or were done using average-case data as presented in Chapter 3 [30, 37, 60]. 1nne simulators did not take many of the overheads into account, and often variations in the cost of processing productions were not taken into account. This chapter presents details about a second generation simulator that has been constructed to evaluate the potential of using parallelism to speed up the execution of production systems. The aims of the simulator are: (1) to study the speed-up obtainable from the various sources of parallelism and to determine the associated overheads, (2) to determine the bottlenecks that reduce the speed-up obtainable from parallelism and the effects of eliminating these bottlenecks, (3) to study the advantages and disadvantages of using the hardware and software task schedulers, and (4) to study the effects of different data structures and , algorithms on the amount of speed-up that is obtainable. The reasons for using a simulator instead of an actual implementation on a multiprocessor are the following: • An implementation on a multiprocessor which incorporates all the planned optimizations would have taken a very long time to do. An implementation which does not include the optimizations leads to significantly different results from one that includes them, and for that reason it is not very useful. • An implementation corresponds to the case where most of the design decisions are already frozen. There is not as much scope for trying out various alternatives, which is what our aim is at this moment. • A multiprocessor consisting of 32-64 processors, that has a smart cache-coherence strategy, and that supports other features mentioned in Section 5.1 is not currently available to us. Although such a multiprocessor may be available in the near future, until then we have to rely on a simulator to obtain results. To test some of our ideas about using parallelism, we are currently implementing OPS5 on the VAX-11/784, a four processor multiprocessor from Digital Equipment Corporation. However, the implementation is still in its early stages and the results are not available at this time. 98 PARALLELISM INPRODUCTIONSYSTEMS 7.1. Structure of the Simulator The simulator that we have constructed is an event-driven simulator. The inputs to the simulator consist of: (1) a detailed trace of node activations in the Rete network corresponding to an actual production-system run; (2) a specification of the parallel computational model on which the production system is to be run; and (3) a cost model that can be used to determine the cost of any given node activation. The output of the simulator consists of various statistics for the overall run and for the individual cycles of the run. 7.1.1. Inputs to the Simulator 7.1.1.1. The Input Trace Figure 7-1 shows a small fragment of a trace that is fed to the input simulator. The trace is obtained by actually running a production system and recording the activations of nodes in the Rete network in a file. The trace contains information about the dependencies between the node activations. The simulator, depending on the granularity of parallelism being used, can lump several activations into one task, and knows which activations can be and which activations can not be processed in parallel. Information about nodes that remains fixed over the complete run, for ex- ample, the tests associated with a node and the type of a node, are presented to the simulator in a static table, as shown in Figure 7-2. The combined information available to the simulator from the input trace and the static table is sufficient to provide fairly accurate estimates of the cost of processing a given node activation. 7.1.1.2. The Computational Model The computational model specifies the hardware and software structure of the parallel processor on which the production-system traces are to be evaluated. The computational model specifies: • The sources of parallelism (production, node, intra-node, action parallelism, etc.) that are to be used in executing the production-system trace. For example, when only production parallelism is to be used, the simulator lumps all node activations belonging to the same production together, and processes them as one task. Also activations of nodes that are shared between several productions are replicated (once for each production), since nodes cannot be shared between different productions when using production parallelism. • Whether a hardware task scheduler or software task queues are to be used. In case several software task queues are to be used, then the number of such task queues. • The hardware organization of the paraUel processor. For example, the number of processors in the parallel machine, the speed of the individual processors, etc. THE SIMULATOR 99 (pfire uwm-no-operator-retry) (wme-change 914) ((prey 914) (cur 13630) (type bus) (lev l)) ((prey 13630) (cur ]3631) (type teqa) (lev 2)) ((prey 13631) (cur 1022389) (node-it 541) (side right) (flag insert) (num-left 4) (hum-right 20)) ((prey 13631) (cur 13632) (type teqa) (lev 3)) ((prey 13632) (cur 13633) (type tnea) (lev 4)) ((prey 13633) (cur 13634) (type me.a) (lev 5)) ((prey 13632) (cur 13635) (type teqa) (lev 4)) ((prey 13635) (cur 1022390) (node-it 576) (side right) (flag insert) (num-lefi 1) (hum-right 4)) ((prey 1022390) (cur 1022391) (node-id 577) (flag insert)) ((prey 13635) (cur 1022392) (node-it 201) (side right) (flag insert) (hum-left 4) (num-right 4)) ((prey 1022392) (cur 1022393) (node-id 202) (side left) (flag delete) (hum-left 3) (num-right 36)) (pfire eight-copy-unchanged) (wine-change 915) ((prey 915) (cur 13636) (type bus) (lev 1)) ((prey 13636) (cur 13637) (type teqa) (lev 2)) ((prey 13637) (cur 13638) (type teqa) (lev 3)) ((prey 13638) (cur 1022394) (node-id 207) (side right) (flag insert) (hum-left 1) (num-right 21)) ((prey 13638) (cur 1022395) (node-id 197) (side right) (flag insert) (hum-left 1) (hum-right 21)) ((prey 13638) (cur 1022396) (node-id 193) (side right) (flag insert) (num-left 1) (hum-right 21)) ;;; pfire: The name of the production that fired at this point in the trace. ;;; wine-change: The number of changes made to working memory so far. ;;; prey: The activation-number of the predecessor of this node activation. ;;; cur: The unique activation-number associated with a node activation. ;;; type: The type of a constant-test node activation. ;:; lev: The distance between the root-node and the associated constant-test node. ;;; node-id: The unique id associated with each node in the Rete network. :;: side: Whether a left-activation or a right-activation (only for and-nodes and not-nodes). :;: flag: Whether the token is being inserted or deleted. ;;; num-lef_num-right: The number of tokens in the left-memory/right-memory node. Figure7-1: A sample tracefragment. • Whether the effectsof memory contention are to be taken into account. If they are to be taken into account, then it is possible to specify the expected cache-hit ratio, the bus bandwidth, etc. Details on how the effects of memory contention are handled are given in Section 7.1.1.4. • The cost model to be used in evaluating the cost of the individual node activations. Different cost models permit experimentation with different algorithms, data structures, and processor architectures. However, different cost models cannot account for major changesin the algorithms or data structures. 100 PARALLELISM INPRODUCTION SYSTEMS ((node-id193)(typeand)(prods(pSI)(Icy10)(Ices6)(rces1)(tests(teqb1000].1))) ((node-id197)(typeand)(prods(pd))(Icy6)(lces10)(teesl) (tests(teqb300011))) ((node-id201)(typenot)(prods(p8))(Icy2)(Ices14)(rces1)(tests(teqb31teqb 1000015 teqb16teqb300017))) ((node-id202)(typeand)(prods(pd))(IcyI) (Ices14)(rces1)(tests(teqb100031teqb32teqb100013))) ((node-id207)(typeand)(prods(pg))(lev6)(Ices4)(rces1)(tests(teqb31))) ((node-id541)(typenot)(prods(p45))(lev1)(Ices2)(rces1)(tests(teqb31))) ((node-id576)(typeand)(prods(p51))(lev1)(Ices2)(rces1)(tests(teqb15teqb100036teqb37))) ((node-id577)(typeterm)(prods(pS1))(lev3)(Ices-) (rces-)) ;;: ;;; ;;; ;;; ,;; ;;; ;;; node-id:Theuniqueid associated witheachnodein theRetenetwork. type:Typeofthenode(and-node,not-node,or term-node). prods:Theproductionorproductions(incasethenodeis shared)to whichthenodebelongs. lev:Thenumberofinterveningnodesto getto theterminalnode. Ices:Thesizeofthetokens(numberof winepointersneeded)in theleft-memorynode. rces:Thesizeofthetokens(numberofwinepointersneeded)inthe fight-memory node. tests:Thetestsassociatedwiththetwo-inputnodesto ensureconsistentvariablebindings. Figure 7-2: Static node information. 7.1.1.3. The Cost Model The simulator relies on an accurate cost model for determining the time required to (1) process node activations (given the information in the input trace), (2) push node activations (tasks) into the task queue, (3) pop tasks from the task queue, etc. The cost model reflects the effects of: • The algorithms and data structures used to process the node activations. • The code used to push/pop node activations from the task schedulers. • The instruction set architecture of the individual processing elements (this determines the number of machine instructions required to implement the proposed algorithms). • The time taken by different instructions in the architecture, for example, synchronization instructions may take much longer than register-register instructions. • The structure of the multiprocessor, the presence of memory contention, etc. The basic cost models used in the simulations have been obtained by writing parametrized assembly code for various primitive operations required in processing node activations. For example, code sequences are written for computing the hash value corresponding to a token, for deleting a token from a memory node, and so on. These code sequences can then be combined to obtain the code for processing a complete node activation. For example, the code to process the left activation of an and-node is shown in Figure 7-3 (details of data structures and code are presented in Appendix B). 101 TIlE SIMULATOR L_del: HAStt-TOKEN-LEFT cb_eq Oelete,R-flg,L_del MAKE-TOKEN INSERT-LTOKEN br L11 DELETE-LTOKEN ! ! ! ! I I Lli: Idal ! Get address of relevant bucket in LIO: (R-rtokHT)R-hIndex,r5 cmp MULL,(r5)tList br_eq L_exit LOCK-RTOKEN-HT Idl (r5)tList,R-state L_loop: NEXT-MATCHING-RTOKEN SCHEDULE-SUCCESSORS L12: br L_loop RELEASE-RTOKEN-HT L_exit: RELEASE-NODE-LOCK compute hash value for left token if (flag = Delete) goto L_del Allocate storage and make token Insert token in left hash table Goto Lli Delete token from left hash table ! ! ! ! right hash table (opposite memory) Check if the hash bucket is empty If so, then goto L_exit Obtain lock for the hash bucket ! ! ! ! Load pointer to first token in opp. memory into register R-state Find first matching token in opp mem Schedule activations of successor nodes ! Goto L_loop ! Free lock for the hash bucket ! Inform scheduler that processing for ! this node activation is finished Figure 7-3: Code for lef_ activation of an and-node. In the code in Figure 7-3, the operations listed in capitals refer to macros that would be expanded at a later point in time. The register allocation for the code for processing node activations is done manually, since the total amount of code is small and it is used very often. 53 Given the code for the primitive operations involved in executing a node activation, the next step is to calculate the time taken to execute these pieces of code. To achieve this, the instructions in a code sequence are divided into groups, such that the instructions in the same group take the same amount of time. Thus register-register instructions are grouped together, branch instructions are grouped together, memory-register instructions are grouped together, and synchronization instructions are grouped together. 54 Once the instructions have been divided into groups, the cost of executing an operation can simply be found by adding the time of executing instructions in each group. The cost of a node activation can then be computed by adding the cost of the primitive operations involved. Note that processing different node activations requires a different combination of the primitive 53The cost models used in the simulator assume that no overheads are caused by the operating system of the productionsystem machine. This is a reasonable assumption to make because all synchronization code and scheduling code is a part of the production-system implementation code. The production-system machine is also expected to have enough main memory sothat the operating system intervention to handle page faults is minimal. 54Note that the instruction set of the processors (described in Appendix A) has been chosen so that the number of such groups is very small. For example, almost all instructions make either 0 or 1 memory references. There are very few instructions that make 2 memory references, and they are treated specially when computing the cost models for the simulator. Table 7-1 shows the instruction categories and their relative costs as used in the simulator. The relative costs can be changed in the simulator just by changing some constant definitions. 102 PARALLEI.ISM IN PRODUCTION SYSTEMS operations. The set of primitive operations required is determined by the parameters associated with the node activation in the input trace, for example, the number of tokens in the opposite memory, the set of tests associated with the node, and so on. "fable 7-1: Relative Costs of Various Instruction Types Instruction Ty.p_g register-register memory-register memory-memory-register synchronization/interlocked compare-and-branch branch Relative Cost 1.0 1.0 2.0 3.0 1.5 1.5 7.1.1.4. The Memory Contention Model Modeling the overhead due to memory contention accurately is an extremely difficult problem [5, 6, 38, 52]. Since actual measurements are not possible until the multiprocessor system and the algorithms have actually been designed and implemented, one has to rely on analytical and simulation techniques. The analytical models are more difficult to discover and usually require more assumptions than the simulation models. However, once they have been built, they permit the exploration of the design space very quickly. The simulation models are easier to come up with, but to get each data point can require a lot of computing time. To model the effects of memory contention on the parallel execution of production systems, our simulator uses a mixture of analytical and simulation techniques. As the first step, the simulator generates a table which gives degradation in processing power due to memory contention as a function of: • Number of active processors in the multiprocessor. At any given time many processors may be idle; it is assumed that these processors will be looping out of cache and will not cause any load on the memory system. • The number of memory modules. This determines the contention when requests are sent to the same memory module. • The characteristics of the bus (bandwidth, synchronous/asynchronous, etc.) interconnecting the processors to the memory modules. This information is used to determine the contention for the bus. • The cache-hit ratio. It is assumed that the processors communicate with the external memory system through a cache which captures most of the requests. Once such a table has been generated, then the memory contention overhead is included in the simulations as follows. The simulator keeps track of the number of active processors while processing THF.SIMULATOR 103 the trace. Whenever the cost for processing a node activation is required, instead of using the basic cost for the node activation (the cost computed by the method described in Section 7.1.1.3), the basic cost is multiplied by the appropriate contention-factor (1/ processor,- efficiency) from the table and then used. The tables for the contention-factor used in the simulator are computed using an analytical model for memory contention proposed in [38]. The proposed model deals with multiple-bus multiprocessor systems and makes the following assumptions: • When a processor accesses the shared memory, a connection is immediately established between the processor and the referenced memory module, provided the referenced memory module is not being accessed by another processor and a bus is available for connection. • A processor cannot make another request if its present request has not been granted. • The duration between the completion of a request and the generation of the next request to the shared memory is an independent exponentially distributed random variable with the same mean value 1/h for all the processors. • The duration of an access by a processor to the common memory is an independent exponentially distributed variable with the same mean value 1/# for all the memory modules. ' • The probability of a request from a processor to a memory module is independent of the module and is equal to 1/m, where m is the number of shared-memory modules. A problem with the above model is that it deals with multiple-bus multiprocessors non-lime-multiplexed with buses, while we are interested in evaluating a multiprocessor with a single time-multiplexed bus (such a bus makes it simple to implement cache-coherence strategies). 55 However, we have found a way of approximating a single time-multiplexed bus with multiple nontime-multiplexed buses. We assume that the number of non-time-multiplexed buses is equal to the degree of time multiplexing of the single time-multiplexed bus. Thus if the time-multiplexed bus delivers a word of data 1/_safter the request (has a latency of 1/_s), but it can deliver a word of data every lOOns (throughput of 10 requests per latency period), then it is assumed that there are 10 non-time-multiplexed buses, each with a latency of 1/J,s. It can be shown that using this approxima- tion, the inaccuracy in the results can be at most a factor of 2. In other words, if the approximate model predicts that the processing power lost due to contention is 10%, then in the accurate model with a single time-multiplexed bus, not more than 20% of the processing power would be lost due to 55We could not find an analytical model dealing with contention for time-multiplexed buses in the literature. 104 PARALI_ELISM IN PRODUCTION SYSTEMS contention. The situation where the results are off by a factor of 2 is quite pathological, when all requests come to the time-multiplexed bus at exactly the same thne. 56 In actual operation it is hoped that the error due to the approximation would be significantly less. Figure 7-4 shows the degradation in performance due to memory contention as predicted by the analytical model given in [38]. To compute the curves, it is assumed that h = 3.0 MACS (million accesses per second),/x= 1.OMACS, the number of memory modules m=32, and the number of non-time-multiplexed buses b= lO00ns/lOOns= 10. The curves show the processor efficiency as a function of the number of active processors k, and the cache-hit ratio c. It can be observed that the degradation in processor efficiency is significant if the cache-hit ratio is not high. _, 1.0 ._ .9 uj .8 .7 o c = o.go "_ _ _ _'_ lambda = 3.0 mu = 1.0 m = 32 ec=0.60= . b= 10 0,..4"5.3 .2 .1 .0 0 I I I I I I I I I 8 16 24 32 40 48 56 64 72 Number Figure 7-4: Degradation in performance of Active due to memory Processors (10 contention. 56I.¢t the latency for the time-multiplexed bus be l and the degree of time-multiplexing be k. Then in the approximate model we use k non-time-multiplexed buses, each with latency L The worst ease occurs when k requests are presented to the buses at the same time. In the time-multiplexed bus the last request is satisfied by time (k*l/k)+ l, which simplifies to 2/. In the case of k non-time-multiplexed buses, the last request is satisfied by time L thus resulting in a factor of 2 loss in performance. Note that if k+ 1 requests had arrived at the same time, then the time for the time-multiplexed bus would have been 2l+//k, the time for the non-time-multiplexed buses would have been 2/, and the loss factor would have been only 1+0/2k), which is less than 2. In case k-1 requests had arrived at the same time, then the time for the time-multiplexed bus would have been 2l-(l/k), the _a'ne for the non-time-multiplexed buses would have been/, and the loss factor would have been 2-0/k), which is less than 2. TItE SIMULATOR 105 7.1.2. Outputs of the Simulator The statistics output by the simulator consist of both per-cycle information and overall-run information. The statistics that are output for each match-execute cycle of the production system are: • S max- t.= , t./t max- i., where S max- i is the maximum speed-up that can be achieved .L n..t i a_gin the i" cycle assuming no limit on the number of processors used, n. is the number of tasks 57 in the _h cycle, t . is the average cost for tasks in the ith cyclez,and t/?lax-- 1. is the maximum cost of any _-lin the ith cycle as determined by the simulator. Note that ni.lavg_i represents the cost of executing the ith cycle on a uniprocessor. • S n o.m-z.=. n..t ./t c)r-J ., where. S nom-i is the.....nominal speed-up (or concurrency) that is t .avg-.,4, acn_evectin me i""cycle using me numoer ot processors specit_ed in me computational model, and t c i is the cost of the ith cycle as computed by the simulator. Note that it follows from _e-definitions of lmax_ i and Icyc_i that lcyc_i >_Imax_ i. • PU i = Snom_i/k, where PUi is the processor utilization in the ith cycle and k is the number of processors specified in the computation model. The same set of statistics can also be computed at the level of the complete run. statistics are: The overall • S r_x = , Y_ J=l . ni •tavg-i /Y_ where S max is the maximum speed-up that can be llmax-i ' acmevea over me complete program run assuming no limit on the number of processors used. • Spore=]_ N n-t /_ N.t is the nominal speed-up over the com_=1 _" avg-i tp_ cyc-t .,whereS .,.nora. plete run using me numner orprocessors specmea m the computation model. • PU= Snorek, where PU is the processor utilization over the complete run and k is the number of processors specified in the computation model. The results presented in Chapter 8 mainly refer to the overall statistics: The following equations show the relationship between the overall statistics and the per-cycle statistics: N soo m N i=1 j=l N N i=l s j=l t _). The above equations state that the overall speed-up is not a simple average of the per-cycle speed-ups 57A task here corresponds to an independently schedulable piece of work that can be executed in parallel. Thus when using production-level parallelism, a task corresponds to all node activations belonging to a production. When using node-level parallelism a task becomes more complex, corresponding approximately to a sequence of dependent-node activations, i.e., a set of node activations no two of which could have been processed in parallel. 106 PARA1.LELISM IN PRODUCrlON but a weighted average of the per-cycle speed-ups. The weight for the ith cycle is !max -- l SYSTEMS ./_ tmax in the first equation and tcyc_i/_, Icyc in the second equation. Thus the per-cycle statistic is weighted by its fraction of the total cost in the parallel implementation (not the total cost in the uniprocessor implementation). As a result, a few long cycles with low speed-ups can destroy the overall speed-up for a run, In addition to statistics about the obtainable speed-up, the simulator also outputs statistics about individual node activations. For example, for each type of node activations (constant-test nodes, and-nodes, not-nodes, terminal-nodes) it outputs the (1) number of node activations of that type per-cycle, (2) the average cost of such node activations, and (3) the variance in the cost of evaluating such activations. This information can then be used to modify the algorithms and data structures so as to reduce the cost and variance of evaluating node activations. 7.2. Limitations of the Simulation Model The simulator, although obviously not as accurate as an actual implementation, provides con- siderable flexibility in evaluating different algorithms and architectures for the parallel implementation of production systems. The main inaccuracies in the simulator come from: , • The simulator models the execution of production system at an abstract level, in the sense that each instruction in the code is not actually executed but only the combined cost is taken into account. Thus low-level effects, such as locality of memory references, bursts of cache misses when processing for a new node activation is begun, are not taken into account. • The simulator ignores contention for certain kinds of resources. For example, contention for the locks associated with the buckets of the token hash-table is ignored. The reason for making such approximations is to make the simulator simple and fast, especially since the traces are quite long (each trace is about 5-10 MBytes long). It was also felt that the errors due to such approximations would be small. For example, if in an actual implementation it is observed that many clashes are occurring, it is easily possible to increase the size of the hash table. • The cost model makes certain simplifying assumptions as wee For example, if a token is to be searched in a list, then the cost model assumes that half of that list would have to be searched before the token is found. Such an assumption masks the effects of the variations that would occur in an actual implementation. One way to take such variations into account would be to build a new version of the Rete interpreter with the same processing model as that of the proposed parallel implementation, and extract traces from that interpreter. Due to various time constraints this was not done. • As cited in the previous section on memory contention, there are also a number of approximations that have been made in the memory contention model. TtIES1MU/_ATOR 107 AS mentioned in the above points, a number of approximations have been made in building the simulation model. Some of the approximations were made for reasons of efficiency of the simulator, some were made because it would have been too time consuming to fix them, and some were made because we did not know any reasonable way of handling them. On the whole, however, we believe that most of the important sources of cost in processing node activations and the variations therein are taken into account by the simulator. 7.3. Validity of the Simulator In any simulation based study it is essential to somehow establish the validity of the simulator, that is, to establish that the simulator gives results according to the prescribed model and assumptions. One way to validate a simulator is to compare its results to an actual implementation for some limited set of sample cases, and then make only high-level checks for the other cases. This method is not possible in this thesis, since there is no actual parallel implementation corresponding to the model used in the simulator. However, for the following reasons we believe that the results of the simulator are correct: • Before implementing the simulator, we implemented a parallel version of the Rete interpreter on the VAX-11/784, a four processor multiprocessor. The interpreter was not optimized for performance, but it included all the synchronization code necessary to be able to process node activations in parallel. The assembly code written for the parallel Rete algorithm in the thesis (from which the cost models used in the simulator are derived) uses data structures and algorithms similar to those used in the working VAX-11/784 implementation. Thus there is good reason to believe that the algorithms and the code from which the cost models for the simulator are derived are correct (see Appendix B for the code). • The performance predicted by the simulator for production-system execution On a uniprocessor is close to the performance predicted independently for an optimized • 58 umprocessor implementation of OPS5. Both the simulator and the independent study predict a performance of 400-800 wme-changes/sec on a 1 MIPS uniprocessor. This again indicates that there are no major costs that are being ignored by the simulator. • Finally, the simulation results have been hand verified for small runs, and local checks made on portions of large runs to check the correct operation of the simulator. Because of reasons given above, we believe that the simulator is giving the correct results (of course, while keeping the limitations mentioned in the previous section in mind). Based on these simulation results, an optimized parallel implementation is currently being done on the VAX-11/784. Once this implementation is running, it will be possible to validate the simulator more completely. 58Independent estimateforOPS5providedbyCharlesForgyofCarnegie-Mellon University. 108 PARALLELISM IN PRODUCTION SYSTEMS 109 SIMULATION RESULTS AND ANALYSIS Chapter Eight Simulation Results and Analysis This chapter presents Simulation results for the parallel execution of production systems. The chapter is organized as follows. The traces used in the simulations are presented first. Next results are presented for the execution of production systems on uniprocessor systems. This section also discusses the overheads due to loss of sharing of nodes, synchronization, and scheduling in parallel implementations. Subsequently, Sections 8.3, 8.4, and 8.5 discuss the speed-up obtainable from production parallelism, node parallelism, and intra-node parallelism when a hardware task scheduler is used. Section 8.6 discusses the effect of using binary Rete networks (instead of linear Rete networks) on the amount of speed-up obtainable. Section 8.7 discusses the execution speeds when multiple software task queues are used instead of hardware task schedulers. Finally, Section 8.8 discusses the effects of memory contention on the execution speed of production systems. The results , are summarized in Section 8.9. 8.1. Traces Used in the Simulations The simulation results presented in this chapter correspond to traces obtained from six productionsystem programs. The following are the six production systems and the traces associated with them (the names given to traces below are used consistently through the rest of the chapter): • VT: An expert system that selects components for a traction elevator system [51]. The associated traces were obtained from a run involving 1767 changes to working memory and they are named as follows: o vlo.lin: This trace corresponds to the case when the interpreter constructs linear networks (see Section 5.2.6) for the productions. o vto.bin: This trace corresponds to the case when the interpreter constructs binary networks for the productions. o vl.lin and vt.bin: These traces correspond to the linear and binary network versions 110 PARAIJ_ELISM IN PRODUCTIONSYSTEMS of the krl" system from which seven productions have been removed. 59 These productions were removed since the node activations associated with them had very large costs, which increased the variance of processing node activations and consequently decreased the speed-up available from parallelism. It was felt that these productions could be modified or coded in another way so as to reduce the large costs associated with the node activations. • ILOG: An expert system to maintain inventories and production schedules for factories [58].60 The associated traces were obtained from a run involving 2191 changes to working memory. The traces are: o ilog.lin and ilog.bin: These traces correspond to the linear network and binary network versions of the ILOG system. • MUD: An expert system to analyze the mud-lubricant used in drilling operations [39]. The associated traces were obtained from a run involving 2074 changes to working memory. The traces are: o mudo.Iin and mudo.bin: These traces correspond to the linear and binary network versions of the MUD system. o mud.lin and mud.bin: These traces correspond to the finear and binary network versions of the MUD system from which two productions have been removed. 61 The reasons for removing the two productions are the same as that for VT. • DAA: An expert system that designs computer systems from a high-level specification of the system [43]. The associated traces were obtained from a run involving 3200 workingmemory changes. The traces are: o daa,lin and daa.bin: These traces correspond to the linear and binary network versions of the DAA system. • R1-SOAR: A Soar expert system that configures the UNIBUS for Digital Equipment Corporation's VAX-11 computer systems [74]. The associated traces are: o rls.lin and rls.bin: These traces correspond to the sequential version of the Rl-Soar program, that is, a version in which the problem-space is searched sequentially. The traces were obtained from a run involving 2220 changes to working memory. o rlp.lin and rlp.bin: These traces correspond to the parallel version of the R1-Soar program, that is, a version in which some of the problem-space search is done in parallel. The traces were obtained from a run involving 3231 changes to working memory. 59The names of the productions that have been removed are (1) detail::create-item-from-tnput, (2) demil.':insert-value-frora-input, (3) detail::ladder-duet-l-car'so, (4) build-why-answer-for-itern-without-dependency, (5) build-why-answer-for-attribute, (6)save-cleanup::remove-wmes, and (7)save.current-job::remove. 60ILOGisreferredto asPTRANSin thecitedwork. 61Thenamesoftheproductionsthathavebeenremovedare(1)cleanup:: and(2)undo-analysi._':l. SIMUI:,TION RI_ULTSAND ANALYSIS I1l • EP-SOAR: A Soar expert system that solves the Eight Puzzle [47]. The associated traces are: o eps.lin and eps.bin: These traces correspond to the sequential version of the EP-Soar program, qhe traces were obtained from a run involving 924 changes to working memory. o epp.lin and epp.bin: These traces correspond to the parallel version of the EP-Soar program. The traces were obtained from a run involving 948 changes to working memory. 8.2. Simulation Results for Uniprocessors When production systems are implemented on a multiprocessor there are a number of overheads that are encountered, for example, memory contention, loss of node sharing, synchronization, and scheduling overheads. This section discusses the speed of production systems on a uniprocessor when such overheads are not present and also when such overheads are present. These results then form the basis for the speed-up numbers given in the subsequent sections. For example, consider a production system that without overheads of a parallel implementation runs at a speed of 400 wmechanges/sec on a uniprocessor, and that with the overheads runs at a speed of 200 wme-changes/sec on a uniprocessor. Now if a parallel implementation using eight processors runs at a speed of 1200 wme-changes/sec, then the nominal speed-up (the average number of processors that are busy or the speed-up with respect to the parallel implementation on a uniprocessor) is 6-fold, while the true speed-up (speed-up with respect to the base serial implementation with no overheads) is only 3-fold. The nominal speed-up (sometimes called concurrency) is an indicator of the average processor utilization in the parallel implementation, while the true speed-up is an indicator of the performance improvement over the best known sequential implementation. Tables 8-1 and 8-2 present the data for the uniprocessor execution of production systems when the synchronization and scheduling overheads are not present. 62 The cost models for these simulations were derived by removing all the synchronization and scheduling code from the code for the parallel implementation (given in Appendix B). Adjustments were also made to compensate for the loss of sharing of memory nodes in the parallel implementation. The costs listed for the various node activations and the cost per wine-change are in milliseconds, and correspond to a machine that executes one million register-register instructions per second (the relative costs of other instructions are shown in Table 7-1). 62Thecorresponding datafortracesusingbinaryRetenetworksarepresentedinSection8.6. 112 PARALLELISM IN PRODUCTION SYSTEMS "Fable8-1: Uniprocessor Execution With No Overheads: Part-A F_ture _ vtlin _ _ mud.lin 1. root-per-ch, avg cost, sd 1.0, .010, .000 1.0, .010, .000 1.0, .010, .000 1.0, .010, .000 1.0, .010, .000 2. etst-per-ch, avg cost, sd 2Z92, .014, .025 22.92, .014, .024 24.48, .011, .029 41.96, .011, .017 41.96, Oll, .017 3. and-per-ch, avg cost, sd 25.96, .086, .357 24.02, .046, .025 26.59, .050, .039 25.95, .076, .150 24.62, .058, .051 4. not-per-oh, avg cost, sd 5.01, .049, .039 4.65, .049, .038 5.84, .051, 033 5.79, .059, .044 5.79, .059, .044 5. term-per-ch, avg cost, sd 1.79, .028, .000 1.03, .028, .000 2.06, .028, 000 3.69, .028, .000 3.69, .028, .080 6. cost per wme-ch (ms) 1.656 1.346 1.646 2.255 2.019 7. speed (wme-eh/sec) 603.9 742.9 607.5 443.5 495.3 Table 8-2: Uniprocessor Execution With No Overheads: Part-B 1. root-per-eh, avg cost, sd 1.0, .010, .000 1.0, .0]0, .000 1.0, .010, .000 1.0, .010, .000 1.0, .010, .000 2. etst-per-ch, avg cost, sd 7.14, .034, .092 5.05, .026, .059 5.07, .026, .059 3.97, .025, .042 3.99, .025, .041 3. and-per-ch, avg cost, sd 39.41, .052, .054 24.58, .059, .048 24.68, .060, .049 23.56, .071, .054 34.56, .088, .079 4. not-per-oh, avg cost, sd 3.97, .057, .037 2.63, .067, .016 9_85,.068, .015 0.75, .062, .022 0.76, .062, .022 5. term-per-ch, avg cost, sd 1.65, .028, .000 0.55, .028, .000 0.55, .028, .000 0.74, .028, .000 0.78, .028, .000 6. cost per wrne-eh (ms) 1.911 1.420 1.458 1.616 2.985 7. speed (wme-ch/sec) 523.3 704.2 685.9 618.8 335.0 The data in Tables 8-1 and 8-2 may be interpreted as follows. Lines 1-5 give the average number of node activations of each type per change to working memory, and the mean and standard deviation of cost per node activation of that type. 63 Line 6 gives the average cost of processing a workingmemory change, and line 7 gives the execution speed of production systems on a 1 MIPS uniprocessor. 64 Using the data the following observations may be made: (1) The average speed of production systems on a uniprocessor is 589.1 wme-changes/sec, where the average is computed over vt.lin, ilog.lin, mud.lin, daa.lin, rls.lin, rlp.lin, eps.lin, and epp.lin traces. 65 (Note all data in the following sections of this chapter is also averaged over the this set of traces). (2) The performance of vt.lin is better than that of vto.lin by 23%, which implies that the seven productions removed from vto.lin were taking almost a quarter of the total processing time. (3) Similarly, the performance of 63The standard deviation for the cost of processing root-node activations and terminal-node activations is zero because these nodes do not perform any data dependent action. For more details about the actual code executed by these node activations, see Appendices Band C. 64Note that for each of the traces, the sum of the product of the first two entries in each of lines 1-5 is greater than the number given in line-& This is not an error. The sum computed from lines 1-5 does not take into account the sharing of memory nodes that takes place in an actual uniprocessor implementation. The savings in cost due to memory-node sharing are subtracted from the sum computed from lines 1-5 to obtain the cost listed in line-6. 65The traces vto.lin and mudo.lin have been excluded, because otherwise the VT and MUD systems would have had too much weight. SIMULATION RESULTSANDANALYSIS 113 mud.lin is better than that of mudo.lin by 12%, indicating that the two productions removed from mudo.lin were taking an eighth of the total processing time. (4) However, the more important effect of removing the seven productions from vto.lin and the two productions from mudo.lin is the large reduction in the standard deviation of the cost of processing and-node activations for the two systems. The standard deviation drops down from 0.357 ms for vto.lin to 0.025 ms for vt.lin. Similarly, the standard deviation drops down from 0.150 ms for mudo.lin to 0.051 ms for mud.lin. As a result, although the improvements in the uniprocessor speed are just 23% and 12% for vt.lin and mud.lin respectively, the improvements in the multiprocessor speed are much more significant, close to 200%-300% (see Section 8.5). Other interesting numbers that may be extracted from the tables are the average costs of processing various types of node activations. The average cost for processing constant-test node activations is .021 ms (equivalent to 21 register-register instructions66); for and-node activations is .060 ms; for not-node activations is .059 ms; and for terminal node activations is .028 ms. These numbers are indicative of the kinds of scheduling and synchronization overheads that may be tolerated in processing these activations on a parallel computer. Tables 8-3 and 8-4 present the data for the uniprocessor execution of production systems when the synchronization and scheduling overheads that occur in parallel implementations are taken into account. 67 The tables give the data for overheads corresponding to the use of node parallelism and intra-node parallelism. The data when production parallelism is used are given later. Lines 1-7 are to be interpreted in the same way as that for Tables 8-1 and 8-2. Line 8 gives the overhead in parallel implementations because of loss of sharing of memory nodes (see Section 5.2.2). Line 9 gives the loss in performance due to the synchronization and scheduling overhead, and line 10 gives the combined loss in performance due to all the overheads. The data shows that the average execution speed including all the overheads is 296.8 wme-changes/sec, that is a factor of 1.98 less than the speed when the overheads are not present. Thus a parallel implementation would perform better than a uniprocessor implementation. must recover this factor before it Another way of looking at it is that the maximum speed-up using k processors is not going to be better than k/1.98. The factor of 1.98 is composed of (1) a factor of 1..22 from loss of sharing of nodes and (2) a factor of 1.62 from the synchronization and scheduling overheads. 66Thecostismuchhigherthanthat forconstant-test nodesin theOPS83implementation becauseofthehashingtechniques beingused. 67Thememorycontentionoverheadsarenotincludedhere,butareconsideredin Section8.8. 114 PARALLELISMIN PRODUCTION SYSTEMS With the inclusion of overheads, The average cost of processing of and-nodes constant-test of 63%), and of terminal the increases be recovered are significant using a larger number of processors. More critical the final processing F_tu_ 1. root-per-ch,avg cost, sd 2. ctst-per-ch,avg cost, sd 3. and-per-ch, avg cost sd 4. not-per-ch, avg cost sd 5. term-per-ch,avg cost, sd 6. costper wme-ch(ms) 7. speed (wme-ch/see) 8. sharingoverhead 9. (sync+ sched) overhead 10.(sync+ sched+ shar) ovrhd be shared. nodes from the root-node node down, since that is what speed. Uniprocessof Execution With Overheads: Part-A Node Parallelism and Intra-Node Parallelism Uniprocessor Execution With Overheads: Part-B Node Parallelism and Intra-Node Parallelism or intra-node parallelism that were shared between This loss in performance and intra-node parallelism. distinct is used, when production productions due to lack of sharing above the loss due to lack of sharing of memory node parallelism than the cost of individual daa.lin _ _ _ 1.0,.019, .000 1.0,.019, .000 1.0,.019,.000 1.0,.019,.000 1.0,.019,.000 7.14,.044, .097 5.05,.034, .062 5.07,.034, .062 3.97,.030,.043 3.99,.030,.042 39.41,.088,.058 24.58,.098, .055 24.68,.098, .057 23.56,.112, .064 34.56,.129, .087 3.97,.098, .043 2.63,.103, .019 2.85,.103, .019 0.75, .099, .028 0.76, .098,.028 1.65, .043,.000 0.55,.043, .000 0.55, .043, .000 0.74, .043,.000 0.78, .043,.000 4.272 2.893 2.943 2.888 4.722 234.1 345.7 339.8 346.3 211.8 1.36 1.27 L25 1.15 1.08 1.65 1.61 L61 1.56 1.46 2.24 2.04 102 1.79 L58 when node parallelism used, two-input loss can vto.lin ._dmn _ _ mud.lin 1.0..019,.000 1.0,.019,.000 1.0,.019,.000 1.0,.019,.000 1.0,.019,.000 22.92,.021, .029 22.92,.020,.028 24.48,.019,.034 41.96,.019,.026 41.96,.019,.026 25.96,.121, .358 24.02,.081,.029 26.59,.087,.045 25.95,.113,.152 24.62,.095,.058 5.01,.086, .043 4.65, .085,.043 5.84, .089,.039 5.79,.096,.049 5.79, .096,.049 1.79,.043, .000 1.03,.043,.000 2.06,.043, .000 3.69,.043,.000 3.69, .043,.000 4.158 2.887 3.404 4.490 3.895 240.5 346.4 293.8 222.7 256.7 1.73 1.26 1.24 1.28 1.16 1.45 1.70 1.67 1.56 1.66 2.51 2.14 2.07 1.99 1.93 Table 8-4: F_ture 1. root-per-ch, avgeost, sd 2. ctst-per-ch,avg cost, sd 3. and-per-ch, avg cost, sd 4. not-perch, avg cost, sd 5. term-per-oh,avg cost, sd 6. costper wme-ch(ms) 7. speed (wme-ch/see) 8. sharingoverhead 9. (sync+ sched) overhead 10.(sync+ sched+ shar) ovrhd of 38%), goes up from .059 ms to they are not too bad, since much of the performance determines Table 8-3: also goes up. nodes goes up from .028 ms to .043 ms (increase of 54%). is the cost of the longest chain of activations longer node activations nodes goes up from .021 ms to .029 ms (increase activations Unlike individual goes up from .060 ms to .098 ms (increase of 63%), of not-nodes .096 ms (increase Although the cost of processing in the Rete network of two-input is can no nodes is over and nodes, a loss that is also encountered Tables parallelism when using 8-5 and 8-6 give some of the data for systems SIMULATION RESULTS AND ANALYSIS 115 executing on a uniprocessor when the overheads for using production parallelism are included. Lines 1 and 2 give the cost of processing a working-memory change and the overall speed of execution on a 1 MIPS processor. Line 3 gives the sharing overhead factor over a uniprocessor implementation with no sharing losses. Line 4 gives the extra loss due to sharing when using production parallelism over the loss due to sharing when using node parallelism (basically, line 3 of Tables 8-5 and 8-6 divided by corresponding entries in line 8 of Tables 8-3 and 8-4). The data shows that the average extra loss due to sharing when using production parallelism is a factor 1.33. Line 5 gives the total loss factor over an uniprocessor implementation with no overheads. The average value of this loss factor for production parallelism is 2.64 as compared to 1.98 for node and intra-node parallelism. Table 8-5: Uniprocessor Execution With Overheads: Part-A Production Parallelism 1. cost per wme-ch (ms) 4.51 3.23 3.74 4.58 3.99 2. speed (wrne-ch/sec) 221.7 309.6 267.4 218.3 250.6 3. sharing overhead 1.878 1.407 1.363 1.307 1.185 4. extra sharing ovrhd 1.086 1.117 1.099 1.021 1.025 5. (sync + sched + shar) ovrhd 2.72 2.40 2.27 2.03 L98 Table 8-6: Uniprocessor Execution With Overheads: Part-B Production Parallelism F_ture 1. cost per wme-eh (ms) daa.lin 5.80 rls.lin 5.30 _ 5.36 evs.lin 3.62 evv.lin 5.43 2. speed (wine-cA/see) 172.4 188.7 186.6 276.2 184.2 3. sharing overhead 1.845 2.328 2.278 1.442 1.241 4. extra sharing ovrhd 1.357 1.833 1.822 1.254 1.149 5. (sync + sched + shar) ovrhd 3.04 3.73 3.68 2.24 1.82 8.3. Production Parallelism In this section simulation results about the speed-up obtained using production parallelism on a multiprocessor are presented. It is assumed that the multiprocessor has a hardware task scheduler associated with it. The results for the case when multiple sot_ware task queues are used are presented in Section 8.7. Figures 8-1, 8-2, and 8-3 show the graphs for speed-up when production parallelism is used without action parallelism, that is, when each change to working memory is processed to completion before the processing for the next change is begun. Figure 8-1 shows the nominal speed-up, Figure 8-2 shows the true speed-up, and Figure 8-3 gives the actual execution speed of the parallel implemen- 116 PARAI,LELISM IN PRODUCrlON SYS'I_._.MS tation (in working-mcmory changes processed per second), assuming that the individual nodes in the multiprocessor work at a speed of 2 MIPS. 68 (We choose 2 MIPS as the processor speed to roughly match the current operating region of technology. Within limits set by the operating regions of other components of the multiprocessor system, the results can simply be scaled for other processor speeds.) To explain the nature of the graphs it is convenient to divide the curves into two regions. The first region, the active region, of the curve is where the overall speed-up is increasing significantly with an increase in the number of processors (say up to the 16 processor mark). The second region, the saturation region, corresponds to the portion where the curve is almost fiat (beyond the 16 processor mark). The saturation speed-up, or the maximum speed-up, available from production parallelism is primarily affected by the following factors: (1) It is limited by the number of productions affected per change to working memory, that is, the number of productions whose state changes as a result of a working-memory change. For the traces under consideration the average size of the affect-sets is 26.3. The two curves at the bottom of Figure 8-1 representing eps.lin and epp.lin have an affect-set size of 12.1 and 11.9 respectively. The curve at the top representing vt.lin has an affect-set size of 31.2. (2) The saturation speed-up is proportional to the ratio tavg/trnax, where tavgis the average cost of processing an affected production and tmax is the cost of processing the production requiring the most time in that cycle (also see Section 7.1.2). For the curves shown in Figure 8-1, the average saturation nominal speed-up is 5.1, which is much smaller than the average size of the affect-sets. A large factor of almost 5 is lost due to the variation in the cost of processing affected productions. (3) The effect of loss of sharing in the Rete network on the saturation speed-up is as follows. In the saturation region, when as many processors as needed are available, multiple activations corresponding to a single shared node in the original network are all processed in parallel. While this makes the nominal speed-up higher than it would have been if nodes were still shared, the true speed-up remains the same. This is because the true speed-up in the saturation region is primarily dependent on the longest chain (most expensive chain)of node activations, and loss of sharing only affects the branching factor of nodes in the chain, but not the length of the longest chain. 69 (4) The effect of the synchronization and scheduling overheads on the saturation speed-up is very complex. The portion of these overheads that simply increases the cost of the individual node activations does not effect the saturation nominal speed-up much, but it does effect the saturation true speed-up significantly. The 68Tlaeterms nominalspeed-up and true speed-up were defined in the beginning of Section 8.2. 69The change in the branching factor does have second order effects on the saturation speed-up. For example, if the branching factor is large, then by the time the node corresponding to the longest chain gets scheduled for parallel exeeulion, a significant amount of time may have elapsed. SIMUIATION 117 RESULTS AND ANALYSIS 10 _10 • & o • "D 9 vto.lin vt.lin ilog.lin mudo.lin _..j "1_ _) • 9 o mua.,n 8 7 _ c_ daa.lin 0 rts.lin $ rlp,lin # eps.lin e epp.lin & vt.lin o ilog.lin • mudo.lin o _.d.,. 8 7 I • vto.lin a O $ # _ daa.lin rls.lin rlD.lin eps.lin epp.lin . 6 6 5 5 4 o 3 # 2 _ 4 " - 3 2 = "= 1 0 "= i i i i 8 16 24 32 -0 1 i i i i a 40 48 56 64 72 Number of Processors 0 Figure8-1: Production parallelism (nominal speed-up), •_ 9000 _J '4 _.,, 80(::)0 0'_ J_ 7000 8 16 24 32 Figure 8-2: Production parallelism (true speed-up). vto lin ,'_ vt.lin C, ilog lin • mudo lin 0 mud.lin r'] daa.lin <3' rls.lin Processor Speed: 2 MIPS $ rlp.lin # eps lin @ epp.lin •_ 6000 o.) Q) O 5000 _j o o 30O0 2000 0 40 48 56 64 72 Number of Processors o i i i i 8 16 24 32 i i i l i 40 48 56 64 72 Number of Processors Figure8-3: Production parallelism (execution speed). 118 PARAI.LEI.ISM IN PRODUCTION SYSTEMS portion of these overheads that requires serial processing (for example, the processing required for each node to be scheduled through a serial ,scheduler) can, however, significantly effect both the saturation nominal and saturation true speed-ups. The speed-up in the active region of the curves, in addition to being limited by the factors affecting the saturation speed-up, is dependent on the following factors: (1) The speed-up is obviously bounded by the number of processors in the system. (2) The speed-up is reduced by the variation in the size of the affect-sets. The variation results in a loss of processor utilization because, within the same run, for some cycles there are too many processors (the excess processors remaining idle) and for some cycles there are too few processors (some processors have to process more than one production activation, while other processors are waiting for these to finish). (3) The loss of sharing of nodes in the parallel implementations also affects the active region of the true speed-up curves. Since the maximum nominal speed-up is bounded by the number of processors, if the loss due to sharing is high, then the maximum true speed-up is correspondingly reduced. Thus if the loss of sharing increases the cumulative cost of processing all the node activations by a factor of 2, then using eight processors, no better than 4-fold true speed-up can be obtained. (4) In the case of non-shared memory multicomputers, the speed-up is greatly dependent on the quality of the partitioning, that is, the uniformity with which the work is distributed amongst the processors. The work in this thesis, however, does not address the issues in implementation memory multicomputers. of production systems on non-shared Some discussion can be found in [32, 66, 67]. There are several other observations that can be made from the curves in Figures 8-1, 8-2, and 8-3: (1) The average nominal saturation speed-up is 5.1.70 (2) The average true saturation speed-up is only 1.9. In fact for the epp.lin trace, the true speed-up never goes over 1.0, that is, the parallel implementation cannot even overcome the losses due to the overheads in the parallel implementation. (3) The average saturation execution speed using 2 MIPS processors is around 2350 wmechanges/sec or about 1000 prod-firings/sec. (4) The saturation region of the speed-up curves is reached while using 16 processors or less. In fact for most systems, using more than 8 processors does not seem to be advisable. (5) The average loss of speed-up due to the overheads of the parallel implementation is a factor of 2.65. This suggests that if only production parallelism is to be used, then it may be better to use the partitioning approach (divide the production system into several parts and perform match for each part in parallel) rather than the centralized task queue approach sug- 70Note that all average numbers reported in this Section (unless otherwise stated) are computed over the traces _.lin, ilog.lin, mud.lin, daa.lin, rls.lin, rlp.lin, eps.lin, and epp.lin. The traces vto.lin and mudo.lin are excluded because that would have resulted in excess weight for the vt and mud production systems. SIMULATION RF_SULTS ANDANALYSIS 119 gested in this thesis. The advantage of the partitioning approach is that the synchronization and scheduling overheads are not present, although the sharing overheads are still present. (6) Although the production-system programs considered have very different complexity and size, the larger programs do not appear to gain more from parallelism than the smaller systems. This is a consequence of the fact that the sizes of the affect-sets are quite independent of the sizes of the production-system programs. At the level of individual production-system programs the following observations can be made: (1) The Soar Eight-Puzzle program traces (eps.lin and epp.lin) are not doing well at all. The reasons for the low speed-up are the small size of the affect-sets (12.1 and 11.8 respectively) and the large variance in processing times. The large variance is a result of the long-chain effect and the crossproduct effect discussed in Sections 5.2.6 and 4.2.3. (2) The difference in the speed-ups obtained by the vt.lin and vto.lin systems is quite large, about a factor of 2. Since the size of the affect-sets for vt.lin and vto.lin are about the same, this shows that the removal of the 7 productions from vto.lin significantly reduces the variation in the processing times. The difference in the speed-ups achieved by the mud.lin and the mudo.lin systems is, however, not very large. This is because the productions that were removed did not have a very high cost compared to the processing required by the rest of the affected productions. The difference in the speed-up obtained by mud.lin and mudo.lin becomes more significant when intra-node parallelism is used (see Figure 8-13), which suggests that the two removed productions contained node activations that could not be processed in parallel using production parallelism or node parallelism but that could be processed in parallel using intra-node parallelism. 8.3.1. Effects of Action Parallelism on Production Parallelism Figures 8-4, 8-5, and 8-6 present speed-up data for the case when both production and action parallelism are used. Figure 8-4 presents data about nominal speed-up, Figure 8-5 presents data about true speed-up, and Figure 8-6 presents data about the execution speed. Comparing the graphs for speed-up with action parallelism with those without action parallelism the following observations can be made: (1) Some systems (vt.lin, rls.lin, rlp.lin, and eps.lin) show a significant increase in the speed-up with the use of action parallelism, while other systems (vto.lin, ilog.lin, mud.lin, mudo.lin, daa.lin, and epp.lin) show very little extra speed-up. The main reason why some of the systems show little extra speed-up is that the affect-sets corresponding to the multiple changes have large amounts of overlap. For example, if the production taking the longest time to process is affected by each of the changes to the working memory, then no extra speed-up is obtained from processing such 120 changes in parallel. PARALLEHSM IN PRODUCTION SYSTEMS (This is exactly what happens in the case of the epp.lin trace.) The second limitation is imposed by the number of changes made to working memory per production firing (per phase for Soar systems). This number itself is quite small for the OPS5 systems. (2) For the systems that show significant improvement, the number of processors needed to get to the saturation point goes up from 8-16 processors to 16-32 processors. (3)The average nominal saturation speed-up goes up from 5.1 to 7.6, an improvement by a factor of 1.50. (4) The average saturation execution speed correspondingly goes up from 2350 to 3550 wme-changes/sec. SIMUI.ATION RE.SUI.]'S AND ANALYSIS 121 _ 14 I vtoJin O 4 ilog,lin vtlin _14 12 • mucl.lin rnudo.iin o "_ _ 10 0o $ # # claa, lin rls, lJn rlp.lin el_.l_ egp.lin _ 12 -_ I0 • vto,lin z, ilog.lin vt.lin O • mudo.lin o mud.lin 0 cl_ta.lin 0 ¢ls.lin $ rlp.lin # eps.lin ! epolin 8 6 4 0 2 2 -i-.----.- 0 8 16 24 32 40 48 ,56 64 72 Number of Processors Figure 8-4: Production and action parallelism 0 (nominal speed-up). 0 8 18 24 32 40 48 56 64 72 Number of Processors Figure 8-5: Production and action parallelism (true speed-up). 8 t6 24 32 40 48 56 64 72 Number of Processors Figure 8-6: Production and action parallelism (execution speed). 122 PARA1.LELISM IN PRODUCTION SYSTEMS 8.4. Node Parallelism Figures 8-7, 8-8, and 8-9 present the data when node parallelism is used to speed-up the execution of production systems. Some of the computed statistics are: (1) The average saturation nominal speed-up is 5.8. This is only a factor of 1.14 better than the corresponding speed-up when production parallelism is used. (2) The average true saturation speed-up is 2.9, which is a factor of 1.50 better than the corresponding speed-up when only production parallelism is used. The improvement in the true speed-up is larger than the improvement in the nominal speed-up because the overheads when exploiting node parallelism are less than the overheads when exploiting production parallelism. The average overhead when using node parallelism is 1.98 as compared to 2.65 when using production parallelism. (3) The average saturation execution speed using node parallelism is 3500 wme- changes/sec as compared to 2350 wme-changes/sec when using production parallelism. (4) The processors needed to obtain the saturation speed-up is still around 8-16 processors. Studying the nominal speed-up graphs for production and node parallelism shows that systems which achieved relatively high speed-ups with production parallelism (vt.lin, ilog.lin, daa.lin, rls.lin, rlp.lin) do not benefit much from node parallelism. Systems which did poorly using production parallelism, however, show a more marked improvement. This suggests that systems that were doing well for production parallelism did not suffer from multiple node activations corresponding to the same production when a change to working memory was made, while many of the systems that performed poorly did suffer from such problems. 8.4.1. Effects of Action Parallelism on Node Parallelism Figures 8-10, 8-11, and 8-12 show the speed-ups for the case when both node parallelism and action parallelism are used. The average statistics are as follows: (1) The saturation nominal speed-up is 10.7 as compared to 5.8 for only node paraUelism (a factor of 1.84). (2) The saturation true speed-up is 5.4 as compared to 2.9 for only node parallelism. (3) The average saturation execution speed for the production systems is 6600 wme-changes/sec as compared to 3500 wme-changes/sec for only node parallelism. (4) For most systems it appears that 16 processors are sufficient, though for vt.lin, rls.lin, and rlp.lin it seems appropriate to use 32 processors. It is interesting to note that although the sizes of the affect-sets with action parallelism for rls.lin and rlp.lin are 128.9 and 145.8 respectively, the saturation nominal speed-up is only around 14-fold. This indicates that still a factor of almost 10 is getting lost due to variations in the processing cost of the affected productions. One source of such variation is long chains of dependent node activations, 123 SIMULATION RESUL'FS AND ANALYSIS ,_fO "_ q) 9 t_ a A O • vto.lin vt.lin ilog.lin mudo.lin _ 10 "_ _) o mud lin [3 daa.lin 0 $ # B q, & O • 9 ¢/_ vto.lin vt.lin iloglin mudo.lin o mud.lin rn daaJin rls.lin rlp.lin eps.lin ef:]lD.lin _ 6 0 $ # e rls.lin rlp,lin eps.lin epp.lin 6 o 5 = 4 2 5 _ " 4 _ 2 2 1 3 1 3 _ 0 | 16 I 24 I 32 I 40 i 48 I 56 i 64 | 72 I Number of Processors 0 8I l Figure8-7: Node parallelism (nominal speed-up). "G 9000 vl _. __ 16 m 24 i ; | ; | . I o m o o n 32 i Figure8-8: Node parallelism (u'ue speed-up). o violin ._ vt,lin 0 *log.lin murlo.Iin _8000 o mud.lin [] daa.lin O rls lin _) 7OOO $# rlpAin eps.lin g epp.lin ¢: (_ "_ Processor Speed:2 MIPS 60OO = 5OOO O ¢x4OOO o o 2000 1ooo I 0 40 I 48 I 56 I 64 I 72 I Number of Processors 8 I 16 I I 24 32 I I I I I 40 48 56 64 72 Number of Processors Figure 8-9: Node parallelism (execution speed). 124 PARAI_I.EI,ISM IN PRODUCTIONSYSTEMS especially since R1-Soar has several productions with large numbers of condition elements. This problem is dealt with in Section 8.6 where the binary Rete networks are considered. The other source of variation is multiple activations of the same node in the network (this may happen due to the cross-product effect as discussed in Section 4.2.3, or due to the multiple changes causing activations of the same node in the Rete network). Since multiple activations of the same node cannot be processed in parallel using node parallelism, they all have to be processed sequentially. The use of intra-node parallelism, which is discussed next, addresses some of these problems. SIMUlaTION RESULTS AND 125 ANALYSIS 16 "Q _) r_ r4 12 10 o & O • o vto.tin vt.lin ilog.li. mudolin mud.lin daa.lin 0 rlslin $ rlp,tin # eps.lin _ 16 "o _ _) _ I'4 _ ___..-...-.'O_ _r-.-'--r _ J_=----_---_- - jf JJ _._ _'7 @ epp.lin _ 12 _-- 10 8 o • & O • o o O $ # vto.lin vtJin ilog.lin mudo.lin mud.lin daa.lin rls.lin rlp.lin egs.lin , epp.lin 8 6 6 4 4 2 2 " 0 i 0 • 8 Figure * 16 8-10: * 24 i 32 Node , , 40 48 Number and action (nominal h i 56 6`4 of Processors ! 72 0 parallelism i 8 Figure a 16 i 24 8-11: , 32 Node speed-up). 4. 1O0OO ,,j ¢) \ el gO00 _) J= d _) E * A 0 • 0 o O $ # 8OOO vtoiin vt.lin ilog.lin mudotin mud.lin daa.lin rls fin rlp.lin eps.lin Processor Speed: 2 MIPS 7000 ® ¢_ 6000 o ¢J ILl and (true 5000 o o 4000 3000 • 2000 o , 1000 0 i 8 Figure i 16 8-12: i 24 Node i 32 i i , i 40 48 56 64 Number of Processors and action (execution parallelism speed). i 72 i I 40 48 Number action * i 56 64 of Processors parallelism speed-up). , 72 126 8.5. lntra-Node PARAI_LELISM INPRODUCTION SYSTEMS Parallelism Figures 8-13, 8-14, and 8-15 show the speed-up data when intra-node parallelism is used. The average statistics when using intra-node parallelism are: (1) The average saturation nominal speed-up is 7.6 as compared to 5.8 when using node parallelism and 5.1 when using production parallelism. (2) The average saturation true speed-up is 3.9 as compared to 2.9 for node parallelism and 1.9 for production parallelism. (3) The average saturation execution speed is 4460 wme-changes/sec as compared to 3500 when using node parallelism and 2350 when using production parallelism. It is interesting to observe that the curve for epp.lin has made a sudden upward jump, so that now the saturation nominal speed-up for epp.lin is 7.8 as compared to 3.2 when node parallelism is used and 1.8 when production parallelism is used. This sudden increase can be explained as follows. In the epp.lin system the occurrence of cross-products is very common. While it was not possible to process the cross-products in parallel using production parallelism or node parallelism, it is possible to do so using intra-node parallelism, and hence the large increase in the speed-up. On the other hand the curve for vto.lin shows no extra speed-up at all over what could be achieved using production parallelism. The reason for the low speed-up is that a specific node in the system is affected by all the changes to the working memory, and this node activation takes a really long time to process. Since there is only a single node activation that is involved in the bottleneck, the use of parallelism does not help, unless multiple processors are allocated to process that single node activation. There is some work going on in this direction, though it is not the discussed in this thesis [23]. A general point that emerges from the discussion of the various sources of parallelism is that different production systems pose different bottlenecks to the use of parallelism. While for some programs production parallelism is sufficient (sufficient in the sense that most of the speed-up that is to be gained from using parallelism is obtained by using production parallelism alone), others need to use node parallelism or intra-node parallelism to obtain the full benefits from parallelism. Since this move to finer granularity does not impose any extra overheads, the fine granularity scheme of intra-node parallelism seems to be the scheme of choice. 8.5.1. Effects of Action Parallelism on Intra-Node Parallelism Figures 8-16, 8-17, and 8-18 present the speed-up data for the case when both intra-node parallelism and action parallelism are used. The average statistics for this case are as follows: (1) The average saturation nominal speed-up is 19.3 as compared to 7.6 when only intra-node parallelism is used. (2) The average saturation true speed-up is 10.2. (3) The average saturation execution speed is 127 SIMULATION RESUI..TS AND ANALYSIS _12 • vto.lin ._ 1= _ vt.lin O • ilog.lin mudo.lin o mud.lin Q daa.lin 0 rls.lin $ rlp.lin # eps.lin @ epp.lin ,_ 11 t_ 10 g 12 • vto.lin /, vt.lin _ I 1 _ 10 9 0 8 o ilog.lin o • mudlin mudo.lin E1 daa.lin 0 rlslin $ rlptin # epslin @ epp.lin 8 7 _" 7 6 • 6 # 5 lm j 5: e 3 _ 1[ 3 f ' 2 1 0 i ....... 24 16 32 ; 40 48 56 64 2 Number of Processors Figure 8-13: lnu'a-node parallelism (nominal speed-up). •-_ 9000 @ Ul \ 8000 _ 7000 8 32 Processor Speed: 2 MIPS $ # rlp.lin eps lin @ epp.lin 6000 5000 a _ ,moo Lu _ 0 3000 ,¢ 2OOO looo /-'-' 0 40 48 56 64 72 Number of Processors Figure 8-14: Intra-node parallelism (true speed-up). r, _* vto.lin i : r, vtlin i o ilog.lin _- • mudo.lin o mud lin 0 daa.lin 0 r ls.lin ,_ 4:: ......... 16 24 0 o " a i a a 8 16 24 32 I | i i ! 40 48 56 64 "/'2 Number of Processors Figure8-15: Intra-node parallelism (execution speed). 128 PARALLELISM IN PRODUCTION SYSTEMS 11,260 wme-changes/sec. This corresponds to less than 100bts for match per working-memory change. It is interesting to note that the highest speed-ups (with the exception of eps.lin) are obtained by Soar programs. This is mainly because of the large number of working-memory changes (average of 12.25 changes) that are processed in parallel for these systems. Systems like ilog.lin, daa.lin, vt.lin which did relatively well when action parallel was not used, fall behind due to the small number of working-memory changes (average of 2.44 changes) processed every cycle. Finally, barring a few systems like rls.lin, rlp.lin, and epp.lin which could use 64 processors, most systems seem to be able to use only about 32 processors effectively. A summary of the speed-ups obtained using the various sources of parallelism is given in Figures 8-19, 8-20, and 8-21. The curves represent the average nominal speed-up, the average true speed-up, and the average execution speed, with the averages computed over the various production-system traces. As expected, the use of intra-node and action parallelism results in the most speed-up, followed by node and action parallelism, foUowed by intra-node parallelism alone, with the rest clustered below. SIMULATION :_ 40 '1_ _-36 _) RESULTS :_ 40 _ _.136 _ • vto.lin & vt.lin 0 ilog.lin @ 0 mudo,fin mud.lin 0 daa.fin C, rls.lin $ rtD.lin # eps.lin 32 28 129 AND ANALYSIS , j._.... 32 7-_ e epp.lin jr 28 20 @ eDplin 20 16 24 _ 12 _ o 16 24 o 12 8 8 4 4 ,,, 0 o vto.lin _ vt.lin o iloglin • 0 mudo.lin mud.lin 0 daa.tin O rls.lin $ rlp.lin # eps.lin i I | I 8 16 24 32 Figure 8-16: I I I I I 40 48 56 64 72 Number of Processors 0 Intra-Node and action parallelism (nominal speed-up), ; o i I i I 8 16 24 32 Figure 8-17: •._ 20000 oj \_#j 18000 16000 ProceSsor Speed: 2 MIPS # e0shn _ epp.lin _,_14000 tl_ 12000 10000 x Lu 8000 60O0 4000 -- 2000 * 0 i 8 Figure 8-18: ! 16 i 24 i 32 | | i , , 40 48 56 64 72 Number of Processors Intra-Node and action parallelism (execution speed). I I [nLra-Node and action parallelism (true speed-up). 111 VtOl_in oD, Vtliifl ilog.lin • mudo.lin 0 mud.lin o daaJin OrlS lin $ rlp lin I I i 40 48 56 64 72 Number of Processors 130 ['ARA1LEI.ISM 20 "_ 18 ._ 16 IN PRODUCTION SYSTF, MS _ 20 _ productionparallelism _,production andaction paratlelism O node parallelism l nodeandaction parallelism o intra-nodeparallelism Q intra-nodeand action 1= _ t8 o_. vj 16 14 14 12 12 10 10 8 _ 6 = 8 o = 6 4 4 2 2 * productionparallelism _,production andaction parallelism 0 node oarallelism e nodeand action parallefism 0 intra-nodepa=alletism O intra.nodeandactionparallelism o o J:D 0 8 i 16 Figure 8-19: I 24 i 32 i i i i I 40 48 56 64 72 Number of Processors 0 I 8 Average nominal speed-up. i 16 i 24 Figure 8-20: i 32 Average true speed-up. 14000 @ ,_ _, t_ * productionparallelism .3 produc_onand actionparallelism o nodeparallelism •ncde andacl_onparallelism o intra.node parallelism [] intra-nodeand action parallelism _12000 -_ 10000 8000 Processor Speed: 2 MIPS d.l 6000 uJ o o 4000 l 200O 0 i 8 i 16 i 24 i 32 i i i i J 40 48 56 64 72 Number of Processors Figure 8-21: Average execution speed, i i i i , 40 48 56 64 72 Number of Processors 131 SIMULATION RESULTS AND ANALYSIS 8.6. Linear vs. Binary Rete Networks As discussed in Section 5.2.6 there are several advantages of using binary networks instead of linear networks for productions. The main advantage, however, is that it reduces the maximum length that a chain of dependent node activations may have. This section presents simulation results corresponding to production-system runs in which binary networks were used. As done when using linear networks, results are first presented for the binary-network runs on uniprocessors. These data are then used to calibrate the performance of the runs on multiprocessors. 8.6.1.UniprocessorImplementationswithBinaryNetworks Tables 8-7 and 8-8 present information about the cost of uniprocessor runs when binary networks are used and when the overheads associated with parallel implementations are eliminated. Lines 1-5 give the number of node activations of each node type and the average cost (in milliseconds) per activation on a 1 MIPS processor. Line 6 gives the cost per wine-change (in milliseconds). Line-7 gives the execution speed in wme-changes per second when binary networks are used and line-8 gives the same information when linear networks are used (line-8 is a copy of line-7 in Tables 8-] and 8-2). Line-9 gives the ratio of the uniprocessor speeds when binary and linear networks are used. An important observation that can be made form the data presented in Tables 8-7 and 8-8 is that for all systems other than epp.bin, the binary network version is slower than the linear network version. The average speed decreases by a factor of 1.36. This is because there are many more and-node activations in the binary network version, which in turn is caused by a larger number of and-nodes with no filtering tests (see Section 5.2.6). In fact, in the results reported in this section for the binary network case, there were some productions in each of the programs that were retained in their linear network form to avoid the blow-up of state caused by the binary network form. If this was not done the state grew so much that the lisp system would crash. Thus to perform better than the parallel linear-network implementations, the parallel binary-network implementations have to recover the performance they lose due to the extra node activations. Tables 8-9 and 8-10 present the overheads in binary networks due to the parallel implementation when node parallelism or intra-node parallelism is used. The average overhead due to the parallel implementation is a factor of :[.84, as compared to ].98 when linear networks are used. Similarly, Tables 8-11 and 8-12 present the overheads due to the parallel implementation when production parallelism is used. The average overhead in this case is 2.52, as compared to 2.64 when linear networks are used. 132 PARAI,LELISM IN PRODUCTION SYSTEMS Table 8-7: Uniprocessor Execution With No Ovcrheads: Part-A vto.bin vt.bi.___n _ mudo.bin mud.bin 1. root-per-ch, avg cost 1.0, .010 1.0, .010 1.0, .010 1.0, .010 ].0, .010 2. ctst-per-ch, avg cost 22.92, .014 22.92, .013 24.48, .013 41.96, .011 41.96, .011 3. and-per-ch, avg cost 27.82, .093 25.88, .057 28.90, .059 31.71, .077 30.38, .062 4. not-per-eh, avg cost 5.01, .049 4.65, .049 5.84, .051 5.79, .059 5.79, .059 5. term-per-ch, avg cost 1.79, .028 1.03, .028 2.06, .028 3.69, .028 3.69, .028 6. cost per wme-eh (ms) 1.964 1.645 2.033 2.664 2.438 7. bin-speed (wme-ch/sec) 509.2 607.8 491.9 375.4 410.2 8. lin-speed (wme-eh/see) 603.9 742.9 607.5 443.5 495.3 9. bin-speect/lin-speed 0.84 0.82 0.81 0.85 0.83 Table 8-8: Uniprocessor Execution With No Overheads: Part-B Feature daa.bin rls.bin 1. root-per-oh, avg cost 2. ctst-per-eh, avg cost 1.0, .010 7.14, .026 1.0, .010 5.05, .021 1.0, .010 5.07, .021 1.0, .010 3.97, .019 1.0, .010 3.99, .018 3. and-per-ch, avg cost 36.30, .066 33.37, .078 37.44, .080 28.98, .086 30.05, .087 4. not-per-ch, avg cost 3.97, .062 2.60, .067 2.87, .068 1.04, .069 1.24, .070 5. term-per-oh, avg cost 1.65, .028 0.55, .028 0.55, .028 0.74, .028 0.78, .028 6. cost per wme-ch (ms) 2.248 2.458 2.855 2.415 2.540 7. bin-speed (wrne-ch/sec) 8. tin-speed (wme-ch/see) 444.8 523.3 406.8 704.2 350.3 685.9 414.2 618.8 393.8 335.0 0.85 0.58 0.51 0.67 1.18 9. bin-speed/fin-speed _ _ Table 8-9: Uniprocessor Execution With Overheads: Part-A Node Parallelism and Intra-Node Parallelism Feature _ vtbin _ _ 1. cost per wine-oh (ms) 4.604 3.333 3.896 5.145 4.551 2. speed (wme-eh/see) 217.2 300.1 256.7 194.4 219.7 3. (sync+ sched + shar) ovrhd 2.344 2.026 1.916 1.931 1.867 Table 8-10: Uniprocessor Execution With Overheads: Part-B Node Parallelism and Intra-Node Parallelism F_ture daa.bin ris.bin _ eos.bin 1. cost per wme-eh (ms) 4.520 4.425 5.032 4.012 4.190 2. speed (wme-ch/see) 221.2 225.9 198.8 249.3 238.7 3. (sync + sehed + shar) ovrhd 2.011 1.800 1.762 1.661 1.650 SIMUI.ATIONRF_SUITS ANDANALYSIS ] 33 Table 8-11: Uniprocessor Execution With Overheads: Part-A Production Parallelism 1.costperwme-ch(ms) 2.speed(wme-eh/see) 3.(syne+ sched+ shar)ovrhd vto.bin 5.206 198.8 2.649 vt.bin 3.915 255.4 2.381 _ 4.724 211.7 2.322 mudo.bin 5.321 187.9 1.997 mud.bin 4.727 211.5 1.938 Table 8-12: Uniprocessor Execution With Overheads: Part-B Production Parallelism F_ture 1.costper wine-oh(ms) 2.speed(wme-eh/see) 3.(sync+sched+shar)ovrhd daa.bin 7.316 136.7 3.254 rls.bin 7.627 131.1 3.103 _ 8.282 120.7 2.900 _ 5.233 191.1 2.166 5.393 185.4 2.124 8.6.2. Results of Parallelism with Binary Networks Results for the speed-up from parallelism for binary networks are presented in Figures 8-22 through 8-25 for production parallelism, in Figures 8-26 through 8-29 for node parallelism, and in Figures 8-30 through 8-33 for intra-node parallelism. Results about average speed-up are presented in Figures 8-34 and 8-35. These graphs are to be interpreted in exactly the same way as the graphs , presented earlier for linear networks. It is interesting to compare the average saturation speed-ups for the linear network case and the binary network case. The results are shown in Table 8-13--the data on the left is for the binarynetwork case and the data on the right is for the linear-network case. As can be seen from the Table, most of the time, the average saturation nominal speed-up obtained using binary networks is higher than that obtained using linear networks. However, most of the time, the average saturation execution speed (given in wme-changes/sec) obtained using binary networks is lower than that obtained using linear networks. The answer to this apparent contradiction lies in the fact that programs when using a binary network execute at less speed than when using a linear network on a uniprocessor, as discussed in Section 8.6.1. Looking separately at OPS5 systems and Soar systems, while the linearnetwork scheme seems to be more suitable for OPS5 systems, the binary-network scheme seems to be more suitable for Soar systems. This is because in OPS5 systems, where the average number of condition elements per production is small, long-chains of dependent node activations do not arise, so binary networks do not help. In Soar systems, where the average number of condition elements per production is much higher than in OPS5, the long-chains are a bottleneck in the exploitation of parallelism, and for that reason using the binary-network scheme helps significantly. 134 PARAI_.LEI_.ISMIN PRODUCITON SYSTEMS Table 8-13: Comparison of Linear and Binary Network Rete Average Saturation Sources o_fParallelism 1. Production Parallelism 5.3, 5.1 Execution Soeed 1870, 2350 2. Production and Action Parallelism 8.3, 7.6 2850, 3550 3. Node Parallelism 5.6, 5.8 2680, 3490 11.6, ].0.7 5400, 6600 4. Node and Action Parallelism 5. Intra-Node Parallelism 6. Intra-Node and Action Parallelism Nominal _ _ Average Saturation 8.0, 7.6 3480, 4460 25.8, 19.3 12020, 11260 135 SIMUI.ATION RFSULTS AND ANALYSIS Q. 9 • vto.bin z_ vt.bin 0 _8 -_ ilog.bin . mudbin mudobin o m dl_.bin 0 rls.bin $ rlp.bin/ 7 9000 _ "_ _ 0 8000 z, t= "_ _ 7000 c_°daabinmUdbm _ rls.bin $ rip.bin E _ n e_s.bi. _ /_ z. _s.biq/" 6 ,_' _t.bin A vto bin 0 ilog.bin o ,,. • P_5000 4 _ 4000 2 MIPS .o 3 3000 2 2000 1 1000 i i | t 8 16 24 32 i i , i i i i 40 48 66 64 72 Number of Processors 0 Figure8-ZZ: Production parallelism (nominal speed-up), • rio.bin -_ 1_ _] A vt,bin 0 ilogibin _ "_ @,_14 •0 mud mudobin bin (3 daa.bin 9000 C, rls.bin 16 i 24 0 a a _, 32 f r i r # 40 48 66 64 72 Number of Processors bin dog bin Processor •0 mudbin mudo bin Speed: 2 MIPS r_ daabin 0 rlsbin "_ _ 7000 _ -----$-- ,to • 0 _) 131_000 = $ rlp.bin # epsbin 0 epp.bin t 8 0 Figure 8-Z3: Production parallelism (execuuon speed). :_ 16 f2 Processor Speed: _ 6000 5 0 mudobin _ $ rlpbin # epsbin @ epP.bin 6000 10 _ . 8 _ °= 5000 '_-, _, (3 o I_ 4000 6 4 20OO 2 0 • ........ 16 24 8 32 • ; 40 48 56 4 72 Number of Processors Figure8-Z4: Production and action parallelism (nominal speed-up). I000 3000 0 : ...... 24 16 32 __ 40 48 56 4 72 Number of Processors Figure 8-Z5: Production and action parallelism (execution speed). 136 PARALLF.I.ISM 9 • •_ _ 8 7' IN PRODUCTION ._ ¢1 9O00 vto.bin e vie.bin vt.bin 0_, ilog.bin I mudobin _ _, _ 8000 o_ • o mud.bin n da&bin r ls.bin _) _ .1_ o mud,bin o daa,bin O rls.bin $ rlp.bin $ rip.bin _ w # eps.bin _, _, @ epp.bin o (3 _ _ "0 6000 6 _, 5 4 • o 6 i - vt.bin ilog.bin mudo.bin U 7'000 # eps.bin SYSTEMS Processor Speed: 2 MIPS @ epp.bin _ 5000 0 i 4000 3 _. o 3000 8 ! 0 20o0 _ f fO00 0 i i 8 16 Figure 8-26: i l 24 32 i i i 40 48 Number i _ f_ i 56 64 72 of Processors 0 Node parallelism (nominal speed-up). •. l i 8 16 i I 24 32 i i i i i 40 48 56 64 72 Number el Processors Figure 8-Z7: Node parallelism (execution speed). 10(10)(0 ,,J A vt,bin 0 ilog.bin "_18 t_ • 0 0 mudo,bin mud.bin daa.bin 0 rls _ _. /_"_ / y /_......._ / (:3 daa.bin _, rls,bin _ 7'000 # @ t_ 6000 a 10 "_ 5000 e 8 _J 4000 o n S rlp.bin _s.bi_ // @ eppbin 14 _ J_ /// _ _// _ / 12 6 o o : 4 MIPS s _pt_in epsbin eppDin II . 3000 - 0 : 2000 2 O 2 I mudo.bin O mud.bin _ I: //- Processor Speed: e vt.bin _, vto bin 0 ilog.bin 9000 8000 16 bin / ¢@ 1000 i i i i 8 16 24 32 i i i i i 40 48 56 64 72 Number of Processors Figure 8-2.8: Node and acdon parallelism (nominal speed-up). 0 i i i i 8 16 24 32 i i i i i 40 48 56 64 72 Number of Processors Figure 8-Z9: Node and acdon parallelism (execution speed). SIMULATION 137 RESULTS AND ANALYSIS 12 • vto.bin _ vt.bin _ 11 o ilog.bin _ 10 • mudbin mudo.bin o ra daa.bin _ 8000 O_ I:: 0 rls.bin "_ •o mud.bin mudo.bin o daa.bin 0 rls.bin _ 7000 $ rip.bin 1D ¢/_ 9 $ rlp.bin # e_s.bin _ $ 9OOO 8' _ 5 _ 40O0 4 30O0 • vto.bin •_ vt.bin o ilog.bin Processor Speed: 2 MIPS i eppmbin ,_ 6000 3 - ¢ ¢ II I II 2OOO 2 1 0 1000 F i ' 16 ' 24 ' 4'0 48.... 56 32 64 72 Number of Processors 0 Figure8-30: Intra-node parallelism (nominal speed-up). 60 1_ 55 O_ 50 45 8 16 32 40 48 56 64 72 Number of Processors Figure8-31: Intra-node parallelism (execution speed). • vtobin "d _l 30000 ot_ vt.bin ilog.bin • mudo.bin 0 mud.bin {3 daa,bin O rls.bin $ rlp.bin "_ 27000 01 I= t_ "_ 24000 (b • mudobin o ilog.bin o mud,l_n E _ 2"f000 # eps.bin @ epp.bin • vto.bin _ vt.bin vj eps.bin _' epp.bi 24 _ Processor Speed: 2 MIPS Q daa.bin 0 rls.bin $ _lp.bin 40 35 _ 18000 30 _¢j 15000 as _u12ooo 20 9OO0 15 , 10 5 0 g = 8 16 24 32 i : 40 48 56 64 72 Number of Processors Figure8-3Z: Intra-node and action parallelism (nominal speed-up). .ooo . 3000 • 0 8 16 24 32 . 40 48 56 64 72 Number of Processors Figure 8-33: Intra-node and action parallelism (execution speed). 138 PARALLELISM IN PRODUCTION ._ 3°F _ production parallelism 27 t_ t 24 o node parellelism •ncde and action parallelism '_ production and action parallelism o inlra-node parallelism 0 intra-node and action parallelism 21 18 f5 12 9 6 3 0 I I 8 16 l 24 I I 32 I I I I 40 48 56 64 72 Number of Processors Figure 8-34: Average nominal speed-up. ,._ 14000 e production parallelism _. production andactionparallelism o nodeparallelism ¢_ ® 12000 • node and action parallelism 0 intra-node parallelism o intra-node and action parallelism -_ d, 100oo t,_ 8000 _ tu Processor Speed: 2 MIPS 6ooo 4000 o o -----'- 6 2000 0 , I I I l 8 16 24 32 l I l i l 40 48 56 64 72 Number of Processors Figure 8-35: Average execution speed. SYSTEMS SlMUt,A'rlON RESUI,TS AND ANALYS1S 139 8.7. Hardware Task Scheduler vs. Software Task Queues Ilqis section presents instead of a hardware suggested simulation task scheduler. It is interesting to consider the overheads overhead production scheduler node parallelism parallelism with no such overheads. intra-node parallelism in the simulator is a factor of 2.96, and the average is a factor of 3.70. 71 The corresponding of the much larger costs of enqueueing In the proposed implementation, queue be locked for the duration corresponding pared to less than ] such instruction activation requires Also note that the dequeueing activation. overhead The overheads activations a node activation as for the hardware task queues show that the task queue. for a duration to less than 1 such instruction cost is similar in magnitude when the software instructions with the hardware can be obtained to use software "Given some number of processors, performance when the number varied between task a node 44 register- task scheduler. to the actual cost of processing a node reasonable because many task queues are used. Once the decision obtained task 72 as com- dequeueing to about task larger from the software Although the task queues have to be locked for these relatively long durations, performance exploiting are significantly Similarly, corresponding overhead when a hardware requires to about 13 register-register the task queue to be locked register instructions 73 as compared Calculations numbers and dequeueing enqueueing using software is a factor of 3.23, the average is being used are 1.98, 1.98, and 2.64 respectively. queues. execution are modeled in the parallel implementation implementation when exploiting when exploiting ' task queues are used B.2. with respect to the uniprocessor because The software task queues in Section 6.2. 1he actual code from which the cost model used in the simulator is derived is given in Appendix average results for the case when several software task queues has been made, the next question how many task queues should be used?". of processors Figure 8-36 plots the is fixed at 32 and the number of task queues is 4 and 64. The curves show that when the number speed is quite sensitive to this number that arises is of task queues is small (4-16), the and increases rapidly with an increase in the number 71The overheads were determined by dividingthe cost of the parallel version running on a single processorwith a single softwaretask queue, by the costof the uniprocessorversion with no overheads due to parallelism(as describedin Tables 8-1 and8-2). 72It'is actually composed of 2 synchronizationinstructions, 3 memory-registerinstructions,1 memory-memory-register instruction, and 1 branch instruction. Using the relative cost of instructions given in Table 7-1, the above instructions are equivalentto 12.5register-registerinstructions. 73It is actually composedof 4 synchronizationinstructions,9 register-registerinstructions,9 memory-registerinstructions,1 memory-memory-registerinstruction,anti 8 branch instructions. Usingthe relativecosts of instructionsgiven in Table 7-1,the aboveinstructionsare equivalentto 44 register-registerinstructions. 140 PARALLELISM IN PRODUCFION SYSTEMS of task queues. When the number of task queues is large (32-64), the execution speed is not sensitive to this number and it decreases slowly with an increase in the number of task queues. In between these two regions, when there are 16-32 schedulers, the execution speed reaches a maximum and then slowly drops. These observations can be explained as follows. When the number of task queues is small, the enqueueing and dequeueing of activations from the task queues is a bottleneck, and thus the low execution speed. Also as a result, provided that there is enough intrinsic parallelism in the programs (unlike ilog.lin, mud.lin, and daa.lin), the performance is almost proportional to the number of task queues that are present, which explains the large slope when the number of task queues is increased. As the number of task queues is increased further, the performance peaks at some point determined by the intrinsic parallelism available in the program and the costs associated with using the task queues. Beyond this point, the effect of a still larger number of task queues is that the processor has to look up several task queues before it finds a node activation to process (as there are an excessive number of task queues, many of them are empty). Since the cost of looking up an extra task queue to see if it is empty is quite small, the slope in this region of the curves is small. All the results presented later in this section, assume that the number of task queues present is half of the number of processors. This number was found to be quite reasonable empirically, in that, beyond this number of task queues the performance does not increase or decrease significantly. Figures 8-37 to 8-48 show the nominal speed-up and the execution speed obtained using software task queues and different sources of parallelism. The sources of parallelism range from production parallelism to intra-node and action parallelism. Data about the nominal speed-up and the execution speed averaged over the eight traces are presented in Figures 8-49 and 8-50. As can be observed from these graphs, the average saturation execution speed when using production and action parallelism is 2080 wme-changes/sec as compared to 3550 wme-changes/sec when using a hardware task scheduler. When using node and action parallelism, the saturation speed with software task queues is 3330 wme-changes/sec as compared to 6600 wme-changes/sec when using a hardware task scheduler. When using intra-node and action paralletism, the saturation execution speed is 4700 wme- changes/sec as compared to 11260 wme-changes/sec when using a hardware task scheduler. Thus on an average, a factor of 1.7 to 2.4 is lost in performance when software task queues are used. SIMUI.ATIONRESULTSAND ANALYSIS ._ 5oo0 _) \ 141 A ¢t.lin o ilog.lin o mud.lin [] daa.lin <>Ms]in $ rlp.lin # eps.lin @epp.lin 4500 ¢_ ¢: 4000 Intra-Node and Action Parallelism NumProcessors: 32 Processor Speed: 2 MIPS E _=3500 3000 0 ,O _J2000 _ _ 1500 o o 2500 1000 _ 8 500 0 I I I i 8 16 24 32 Figure 8-36: Effect of number I I I i I 40 48 56 64 72 Number of Task Queues of software task queues. 142 PARAI.LELISM e,..9 & vtlin 3_} 4500 "_ o0 ilog.lin mud.lin ol "% O_ 8_ n daa.lin O rlsJin 7 $ rlp.lin # el)s.lin NumSTQs = NumProcessors/2 _1 4000 _ = _ ¢: _ epp.lin _ 3500 IN PRODUCTION SYSTEMS .'_vt.lin C, ilog.fin Q daa.lin o mudJin 0 rls.lin $ rlp.lin NumSTQs Processor = NumProcessors/2 Speed: 2 MIPS @ # epp.lin eps.lin E 6 _ 3000 5 c 2500 O 3 o _ 2000 3 1500 2 __ ,. I I I 0 i 16 8 i 24 i 32 "_ (/_ ,'_ Q o Q 0 9 ^ 1000 o n 500 I I 0 Figure8-3"/: Production parallelism (nominal speed-up). 10 _" i i b i i 40 48 56 64 72 Number of Processors _"_'__ 8i i 16 i 24 = 32 i I = i i 40 48 56 64 72 Number of Processors Figure8-38: Production parallelism (execution speed). 45o0 vtJin ilog.iin mud.lin daa,lin rls.lin _ _ _ 4000 _:_ t_ (_ .¢: _ $ rtp.lin 8 NumSTQs = NumProcessors/2 ,'. vl.lin ,_ ;Iog.lin o mud lin [:3 daa.lin O rts.fin $ rlp.lin (J3500 NumSTQs NumProcessors/2 Processor Speed: 2 MIPS # eps.lin @ epp lin $ E 3000 _ 2500 _ 2000 5 4 1500 3 a 2 1000 f 500 , 0 I 8 I 16 I 24 I 32 I I I I I 40 48 56 64 72 Number of Processors Figure 8-39: Production and action parallelism (nominal speed-up). 0 I I i * 8 16 24 32 I I I i a 40 48 56 64 72 Number of Processors Figure 8-40: Production and action parallelism (execution speed). SIMUIATION Ri_ULTS _9 I_ _ vt.lin O ilog.lin _1 _ 8 o mud.lin O daa.lin O rts.lin O_ 143 AND ANALYSIS -_ 6O00 _) "_ 5500 $ rlp.lin # eps.lin 7' & vt.lin NumSTOs = NumProcessors/2 o 0 _ 5000 @ _p.lin I $ _4500 # eps.lin NumSTQs @eoD ProcessorSpeed: 2 MIPS 6 xl 4000 5 _ 4 , _, _J _ 2500 # # O rl_.lin lin = NumProcessors/2 350O _ 3 iloglin mudlin (3 daa.lin O rls.lin 2000 _ e 1500 _ 2 1000 o o o cl e a I 500 i 0 ' 16 ' &2 40.............. 48 56 64 24 72 0 8 16 24 32 Number of Processors Figure 8-41: Node parallelism (nominal speed-up). 40 48 56 64 72 Number of Processors Figure 8-42: Node parallelism (execution speed). 16 -. 6OO0 :_ _1 .", vt.lin o ilog.lin _) _) 0 0 _14 O rls.lin $ rlp.lin # eps.lin U NumSTQs = NumProcessors/2 mud.lin da&,lin epp.lin 12 . _ _ _ 5500 _. vt.lin _ _ _ 5000 0 ilog.lin 0 mudlin rn daalin _ 0$ rlo.lin rlslin NumSTQs = NumProcessors/2 _ 4500 # eps.hn Processor Speed: 2 MIPS 4000 _o _ 3500 8 ----# "_3000 / _ 2500 6 2000 4 _ 1500 1000 2 500 0 * i | i 8 16 24 32 i * i i * 40 48 56 64 72 Number of Processors Figure8-43: Nodeand actionparallelism (nominalspeed-up). 0 i i i i 8 16 24 32 | * * i ; 40 48 56 64 7"2 Number of Processors Figure 8-44: Node and actionparallelism (executionspeed). 144 PARAI.L['LISM IN PRODUCTION ,_ o 1_ & vt.Hn o ilog.lin o mud,lin ¢ 8 t_ o 0 7 -_9ooo @ _ \ & vt.lin 0 ilog.lin 0 n_d.lin _8000 _.n [] daa.lin <> rls.lin $ rlo.lin # eps,lin .1_ $ # @ egp.lin _ daa.lin rts.lin NumSTQs = NumProcessors/2 6 = _ ,,O _ 401;10 3 3OOO 2, 2ooO, ooo 0 a n i | i 8 16 24 32 | i i a 0 40 48 56 64 72 Number of Processors 0 i n I | 8 f8 24 32 1° I o i t I 40 48 56 64 72 Number of Processors Figure8-46: tntra-node parallelism (execution speed). NumSTQs = NumProcessors/2 vt.lin ilog.lin []: ' = s Figure8-45: lntra-node parallelism(nominal speed-up). 30 = NumProcessors/2 Processor Speed: 2 MIPS _6000 4 1= a} NumSTQs rlp.lin eps.lin @ epp.lin 7'000 SYSTEMS o _ 9000 _ _, vttin _ O 0 iloglin mudlin NumSTQs daa.lin Processor Speed: 2 MIPS 27 mud.lin D daa.lin O rls.lin _ _ 8OOO r_ rls.lin 0 24 $ rlolin _ $ r10.1in @ el00.ti _ 7000 @ = NumProcessors/2 epo.lin "a6000 = m 0 f5 '_ 4ooo 9 = 30OO 5000 6 2000 3 1000 0 a i n i 8 16 24 32 n i a i = 40 48 56 64 72 Number of Processors Figure8-47: lntxa-node and action parallelism (nominal speed-up). 0 i i i i 8 16 24 32 n i i i i 40 48 56 64 72 Number of Processors Figure8-48: lntra-node and action parallelism (execution speed). SIMULATION RESUI.TS 145 AND ANALYSIS _16 "1_ • productionparallelism A productionandactionparallelism 0 nodeparallelism o• intramode parallelism nodeandactionparallelism rn intra-nodeand actionparallelism 4 8| 10 sors/2 12 8 _s ,_____.-& 6 o 2 , 0 * 8 i 16 , 24 Figure 8-49: 8 32 i i a * u 40 48 56 64 72 Number of Processors Average nominal speed-up. 9000 * productionparallelism zx 13roduction andactionparallelism o node parallelism • node andactionparallelism o _ntranode parallelism [] intra-nodeandactionparallelism \ _}8000 0'_ ,,_ 7000 -_ 6000 o o NumSTOs = NumProcessors/2 Processor Speed: 2 MIPS 5000 3000 2OOO 1000 I 0 l 8 l 16 Figure 8-50: , 24 , 32 i i J n * 40 48 56 64 72 Number of Processors Average execution speed. 146 8.8. Effects PARALLELISM IN PRODUCTIONSYSTF.MS of Memory Contention The simulation results that have been presented so far do not take effects of memory contention into account. parallelism. This was done in order to keep machine specific data away from the analysis of To be more specific, as technology changes and computer buses can handle higher bandwidth, as memories get faster, as new cache coherence strategies evolve, the performance lost due to memory contention will change. Not including memory contention in the simulations provides a measure of parallelism in the programs which is independent nology and algorithms--somewhat of such changes in tech- like an upper-bound result. 74 While it is important to know the intrinsic limits of speed-up from parallelism for programs, it is, of course, also interesting to know the actual speed-up that would be obtained on a real machine, which does have some losses due to memory contention. This section presents simulation results that include the memory contention overhead. The memory contention overhead is included in the simulation results as per the model described in Section 7.1.1.4. Since one would not design multiprocessors with 8, 16, 32, 64, 128 processors in exactly the same way (for example, using buses with the same bandwidth to connect the processors to memory), different models are used to calculate the contention for multiprocessors with different number of processors. The multiprocessors are divided into three groups: those having 8 or 16 processors, those having 32 or 64 processors, and those having 128 processors (the simulations were run with only these discrete values for number of processors). For the case when there are 8 or 16 processors, it is assumed that the number of memory modules is equal to the number of processors (8 or 16), the bus has a latency of 1/_s,the bus is time-multiplexed, and that it can transfer 4 bytes of data every lOOns. For the case when there are 32 or 64 processors, it is assumed that the number of memory modules is 32, the bus has a latency of 1/_s, the bus is time-multiplexed, and that it can transfer 8 bytes of data every lOOns. For the case when there are 128 processors, it is assumed that the number of memory modules is 32, the bus has a latency of 1/_s,the bus is time-multiplexed, and that it can transfer 8 bytes of data every 50ns. It is further assumed for all the cases that each processor has a speed of 2 MIPS and that the instruction mix consists of 50% register-register instructions and 50% memory-register instructions. Thus each processor when active generates 3 million memory references per second, 2 million references for the instructions and i million references for the data. The cache-hit-ratio is uniformly assumed to be 0.85, that is, 85% of the memory accesses are satisfied 74Note someportions ofthe architecture, like theinstruction set ofthemachine, hadtobespecified inmoredetail, because itwould nothave been possible todothesimulations otherwise. 147 SIMULATION RESULTS AND ANALYSIS by the cache and do not go to the bus.75 Figure 8-51 shows the processor efficiency for the above cases as a function of the number of processors that are active.76 o,_ ¢j __, 1.0_ .9 U 2 Q. .a Ap=8 oP=16 [] P = 32or64 OP= 128 .7 .6 •5 I 0 8 I 16 ! I I I I 24 32 40 48 56 I I I I ,,I 64 72 80 88 96 Number of Active Processors Figure 8-51: Processor efficiency as a function of number of active processors. Figures 8-52 and 8-53 present the nominal speed-up and the execution speed for production systems as predicted by the simulator when intra-node and action parallelism are used. The results are presented only for the intra-node and action parallelism case because the combination of these two sources results in the most speed-up, and also because these sources are most affected by the memory contention overhead. The graphs present data for the individual programs (curves with solid lines) and also the average statistics (curves with dotted lines). It is interesting to compare the average statistics curves in Figures 8-52 and 8-53 with the corresponding curves in Figures 8-19 and 8-21 (when memory contention overheads are not included). 75The cache-hit ratio of 85% is based on the following rough calculations. Since caeh processor executes at 2 MIPS, there are 2 million references per sec (MRS) generated for instructions and I MRS generated for data (assuming that only 50% of the instructions require memory references). Assuming 95% hit-ratio for code and 65% hit-ratio for data, the composite hit-ratio turns out to be 85%. 76The variable P in Figure 8-51 refers to the total number of processors (both active and inactive) in the multiproeessor system. 148 PARALLEHSM 1N PRODUCTION SYSTEMS The average nominal speed-up (concurrency or average number of active processors) with 128 processors with memory contention is 20.5 and without memory contention is 19.3 (an increase Of about 6%). The average execution speed with 128 processors with memory contention is 10950 wme-changes/sec and without memory contention is 11261 wme-changes/sec (a decrease of about 3%). The increase in the concurrency and decrease in the execution speed may be understood as follows. Since memory contention causes everything to slow down, the execution speed will obviously be lower. However, since all processors in the multiprocessor are not busy all of the time, the multiprocessor is able to compensate for these slower processors by using some processors that would not be used if memory contention was not present, thus overcoming some of the losses. Thus, although the cost of processing the production systems goes up by 9% due to memory contention, the decrease in the speed of the parallel implementation is only 3%. The remaining 6% is recovered by an increase in the concurrency. 149 SIMU1 ATION RESULTS AND ANALYSIS _40 36 32 tx vt,lin o ilog.lin o mud,lin D daa.lin O rls.lin $ rll0.1in # eps,lin @ epp.lin • avg-stats 28 o 12 8 4 0 8 16 24 32 40 48 56 64 72 Number of Processors Figure 8-52: Intra-node and action parallelism (nominal speed-up). 150 PARALI.,ELISMIN PRODUCTION SYSTEMS 20000 zx vt.lin o ilog.lin o mud.lin [] daa.lin 0 rls.lin $ rlp.lin \ 18000 _: tU "_ 16000 Processor Speed: 2 MIPS # 0 eps,lin epp,lin • avg-stats E _=14000 "o 12000 "" 10000 _j • ....-°'' ..... _..._#_____ sooo 0 8 Figure 8-53: 16 24 32 40 48 56 64 72 Number of Processors Intra-node and action parallelism (execution speed). SIMULATION RF.SULTS AND ANALYSIS 151 8.9. Summary In summary, the following observations can be made about the simulation results presented in this chapter: • When using a hardware task scheduler: o A parallel implementation has some intrinsic overheads as compared to a uniprocessor implementation. The overheads occur because of lack of sharing of memory nodes, synchronization costs and scheduling costs. Such overheads when using node or intra-node parallelism result in cost increase by a factor of 1.98, and when using production parallelism result in a cost increase by a factor of 2.64. The overheads are larger when using production parallelism because it is not possible to share two-input nodes between different productions. o The average execution speed of production systems on a uniprocessor (without considering the overheads of a parallel implementation) that executes two million register-register instructions per second is about 1180 wme-changes/sec. ' o The average saturation nominal speed-up (concurrency) obtained using production and action parallelism is 7.6, that using node and action parallelism is 10.7, and that using intra-node and action parallelism is 19.3. Using intra-node and action parallelism, the saturation execution speed is about 11,250 wme-changes/sec assuming a multiprocessor with 2 MIPS processors. The speed-up from parallelism is significantly lower when action parallelism is not used. For example, when intra-node parallelism is used, the saturation nominal speed-up is only 7.6 as compared to 19.3 when action parallelism is also used. o As a result of the larger number of changes made to working memory per cycle, the Soar systems show much larger benefits from action parallelism than OPS5 systems. For exarnple, when using intra-node parallelism, the speed-up for OPS5 systems increases by a factor of 1.84 as a result of action parallelism, while the speed-up for Soar systems increases by a factor of 3.30. o The simulations show that only 32-64 processors are needed to reach the saturation speed-up for most production systems. Thus a multiprocessor system with significantly more processors is not expected to provide any additional speed benefits. • When using binary Rete networks: o The average cost of executing a production system on a uniprocessor (with no overheads due to parallelism) goes up by a factor of 1.39 as compared to when using linear networks. This increase in cost is due to an increased number of node activations per change to working memory. The increased cost sometimes results in situations where the actual execution speed of a production system is less than that of its linear network counterpart, although the nominal speed-up achieved due to parallelism is more. o The average overhead due to parallelism when exploiting node or intra-node paral- 152 PARALLELISM IN PRODUCTION SYSTEMS lelism is a factor of 1.84 and that when exploiting production parallelism is a factor of 2.52. o The average saturation nominal spced-up (concurrency) obtained using production and action parallelism is 8.8, that using node and action parallelism is 11.6, and that using intra-node and action parallelism is 25.8. Using intra-node and action parallelism, the saturation execution speed is about 12,000 wme-changes/sec assuming a multiprocessor with 2 M IPS processors. o The benefits from using binary networks are much more significant for Soar systems than for OPS5 systems. In fact, the average saturation nominal speed-up for the OPS5 systems goes down by 16% (from 14.5 to 12.3) as a result of using binary networks, while the corresponding speed-up for the Soar systems goes up by 62% (from 24.1 to 39.2). The reason for the large increase is that there are several productions in the Soar systems which have very large number of condition elements (resulting in longer chains). o The above differences between the results for OPS5 systems and Soar systems suggests that there is no single strategy (binary or linear Rete networks) that is uniformly good for all production systems, and that the strategy to use should be determined individually for each production system. • When using software task queues: o The average overhead due to parallelism for intra-node parallelism is a factor of 3.23, for node parallelism is a factor of 2.96, and for production parallelism is a factor of 3.70. These factors are much larger than when a hardware task scheduler is used because of the large cost associated with enqueueing and dequeueing node activations from the task queues. o The saturation execution speed is 2080 wme-changes/sec when using production and action parallelism (as compared to 3550 when using a hardware task scheduler), 3330 wme-changes/sec when using node and action parallelism (as compared to 6600), and 4700 wme-changes/sec when using intra-node and action parallelism (as compared to 11,260). Thus the performance loss when using software task queues is a factor between 1.7 and 2.4. o Simulations show that the performance of a scheme using software task queues is best when the number of queues is approximately equal to the number of processors. • When memory contention overheads are included: o The memory contention overheads were studied in the simulations by assuming different processor memory interconnect bandwidths for multiprocessors with different number of processors. For multiprocessors with 8 or 16 processors the bandwidth was assumed to be 40 MBytes/sec, for 32-64 processors the bandwidth was assumed to be 80 MBytes/sec, and for 128 processors the bandwidth was assumed to be 160 MBytes/sec. SIMULATION RF_SULTS ANDANALYSIS o For the case of a multiprocessor with 128 processors it is observed that when memory contention is taken into account, the cumulative cost of executing all node activations goes up by 9%, the nominal speed-up achieved due to parallelism goes up by 6%, and the actual exccution speed drops by about 3% (as compared to when memory contention overheads are ignored). It is interesting to note that some of the increase in execution costs due to memory contention is absorbed by the unused processors of the multiprocessor. 153 154 PARALLELISM IN PRODUCTION SYSTEMS RELATEDWORK "155 Chapter Nine Related Work Research on exploiting parallelism to speed up the execution of production systems is not very new, but the efforts have gained significant momentum recently. This gain in momentum has been caused by several factors: (1) There are slowly emerging larger and larger production systems, and their limited execution speed is becoming more noticeable. (2) With the increase in popularity of expert systems, there has been a movement to use expert systems in new domains, some of which require very high performance (for example, real-time control applications). (3) There is a general feeling that the necessary speed-up is not going to come from improvements in hardware technology alone. (4) Finally, the production systems, at least on the surface, appear to be highly amenable to the use of large amounts of parallelism, and this has encouraged researchers to explore paraUelism. This chapter briefly describes some of the eal"ly and more recent efforts to speed-up the execution of production systems through parallelism. 9.1. ImplementingProductionSystemsonC.mmp For his thesis [54], Donald McCracken implemented a production-system version of the Hearsay-II speech understanding system [16] on the C.mmp multiprocessor [53]. 77 He showed that high degrees of parallelism could be obtained using a shared memory multiprocessor--one of his simulations showed that it was possible to keep 50 processors busy 60%of the time during the match phase. Most of the results presented in his thesis, however, are not applicable to the current research. This is because: • The characteristics of the Hearsay-II system are distinct enough from current OPS5 and SOAR programs that the results cannot be carried over. • The speed-up obtained from parallel implementations of production systems is dependent on the underlying computation model. For example, it depends on the quality of the underlying match algorithm. If the underlying match algorithm is naive, it is possible to obtain very large amount of speed-up from parallelism. Since the basic match algorithm 77TheC.mmpmultiprocessor consistedof16PDP-11processors connectedto sharedmemoryviaacrossbarswitch. 156 PARAI.LELISM 1N PRODUCI'ION SYSTI?MS used by Don McCracken in his thesis is significantly different from the OPS5-Rete algorithm, it is not possible to interpret his results in the current context. • McCracken's thesis addresses issues related to the parallel implementation of Hearsay-II on the C.mmp architecture. Because of the differences in the hardware structure of the PSM considered in this thesis and C.mmp, many issues that were evaluated for C.mmp have no relevance for the PSM. For example, since processors in C.mmp did not have caches, the performance of the parallel implementation was considerably affected by how the code was distributed among the multiple memory modules. In the PSM, since processors have local memory and cache, the distribution of code is not an issue. 9.2. Implementing Production Systems on llliac-IV Charles Forgy, in one of his papers [18], has studied the problem of implementing production systems on the Illiac-IV computer system [7]. Since the Illiac-IV is a SIMD (single instruction stream, multiple data streams) computer, the main concern was to develop a match algoritlun where all processors would simultaneously execute the same instructions on different data to achieve higher performance. In the algorithm described in the paper, a production system is initially divided into sixty four partitions, corresponding to the number of processors in IUiac-IV. The paper does not detail the partitioning technique, but suggests that it should be such that similar productions are in different partitions. This is to ensure that the work is uniformly distributed amongst the processors. The Rete network for each partition is constructed and the associated code is placed in corresponding processors. The network interpreter then executes the code in a manner similar to the uniprocessor interpreters, but with one exception. All node evaluations of one type are executed before the node evaluations of another type are begun. For example, the interpreter will finish executing all constanttest nodes before attempting to execute any memory nodes. This ensures that for most of the time all processors are executing nodes of the same type, and since nodes of the same type require the same instructions (they may use different data), the SIMD nature of Illiac-IV is usefully exploited. Although the paper describes the algorithms for executing production systems on Illiac-IV in detail, no estimates are given for the expected speed-up from such an implementation. RELATEDWORK ] 57 9.3. The DADO Machine DADO [84, 85] is a highly parallel systems at Columbia architecture full-scale University tree-structured by Salvatore consists of a very large number version) of processing elements, processing element specialized I/O switch that is constructed prototype designed J. Stolfo and his colleagues. to execute interconnected to form a complete a small amount of random using a custom VLSI chip. in [84]. The Intel 8751 processors production The proposed (on the order of tens of thousands consists of its own processor, as described architecture machine in the envisioned binary tree. Each access memory, and a Figure 9-1 depicts the DADO used in the prototype DADO are rated at 0.5 MIPS. Control Processor Inte18751Processor RAM 8 KByte RAM EPROM I I I/0 SWITCH 8bits Figure 9-1: The prototype DADO architecture. Severaldifferent algorithmsfor implementing production systems on DADO have been proposed [31,60, 86]. Of the many proposed algorithms, the two algorithms offering the highest performance 158 PARALt,ELISMIN PRODUCFION SYS'rEMS for OPS5-1ike productions systems on DADO are (1) the parallel Rete algorithm and (2) the Treat algorithm. In the implcmentation of the Rete algorithm divided into 16-32 partitions, parallelism titioned, mapped the actual number of partitions depending (see Section 4.2.1) present separate Rete networks onto a processing (also called processing the working-memory in the program. are generated Using the processor in the WM-subtree elements, to associadvely performance of the parallel Rete algorithm at the PM-level system has been parEach partition rithm stores state corresponding match DADO and limitations of the Treat algorithm on DADO described as a control are used for performing on the prototype (for more details, assumptions, parallel Rete algorithm to both a-memory to the t-memory the elements and other similar 175 of the analysis see [31]). and/3-memory of the way. While the Rete algo- nodes in the Rete network, the nodes. 79 It is argued that storing the nodes is not very useful for DADO of the t-memory The to be around [60] is similar to the implementation only to the a-memory the relevant portion elements conflict-resolution. (especially systems where a large fraction of the working memory changes on every recognize-act to compute is then processor, condition is predicted above, but varies in one fundamental Treat algorithm stores state corresponding it is possible system is of production locate tokens to be deleted, and to perform above the PM-level state corresponding Once the production are used to associatively The processors The implementation on the amount for each of the partitions, operations. wme-changes/sec production element at the PM-leve178 and its associated subtree of processing WM-Subtree). elements on DADO [31], the complete for production cycle), because state 8° very efficiently on DADO. This is because: • It is possible to dynamically change the order in which tokens matching individual condition elements are combined, so as to reduce the amount of t-memory state that is computed. Such reordering is not possible in the Rete algorithm, where the combinations of condition elements for which state is stored is frozen at compile time (see Section 78ThePM-level (the production-memory level) isdetermined bythenumberofpartitions that aremade.Forexample, ff thenumberofpartitions is16thePM-level wouldbe4,andifthenumberofpartitions is32thePM-level wouldbe5. 79Recall fromSection 2.4.2 that theTreat algorithm falls onthelow-end ofthespectrum ofstate-saving matchalgorithms. 80Therelevant portion ofthefl-memory state corresponds tothose tokens that include a reference totheworking-memory change being processed. This isbecause onlytokens involving thenewworking-memory element cancause a change tothe existing conflict-set RELATED WOR K ] 59 5.2.6).81 • In case the number of working-memory elements matching some condition element of a production is zero, the Treat algorithm does not compute the fl-memory state at all. This is because an instantiation of such a production cannot enter the conflict-set. The Rote algorithm in such a case would anyway go ahead and compute the fl-memory state in the hope that it would be useful at some later time. • Finally, the associative match capabilities of the DADO WM-subtree help speed up computation of the fl-memory state that is necessary. For example, using the WMsubtree it is possible to associativdy match a given token against all tokens in the opposite memory node. Similarly, all tokens containing a pointer to a given working-memory dement can be deleted associativdy. The performance of the Treat algorithm on the prototype DADO is predicted to be around 215 wme-changes/sec. This is only slightly higher than the performance of 175 wme-changes/sec predicted for the Rete algorithm. These two are, of course, only average numbers, and in practice, for some production systems Treat would do better and for some production systems Rete would do better. 9.4. The NON-VON Machine NON-VON [79, 80] is a massively parallel tree-structured architecture designed for AI applications at Columbia University by David Elliot Shaw and his colleagues. The proposed machine architecture consists of a very large number (anywhere from 16K in the prototype version to one million in the envisioned full-scale supercomputer) of small processing elements (SPEs) interconnected to form a complete binary network. Each small processing element consists of 8-bit wide data paths, a small amount of random-access memory (32-256 bytes), a modest amount of processing logic, and an I/O switch that permits the machine to be dynamically reconfigured to support various forms of interprocessor communication. The leaf nodes of the SPE-tree are also interconnected to form a two- dimensional orthogonal mesh. In addition to the small processing elements, the NON-VON architecture provides for a small number of large processing elements (LPEs). Specifically, each small processing dement above a certain fixed level in the binary network is connected to its own large processing element. The large processing elements are to be built out of off-the-shelf 32-bit 81Note that since each change to working memory results in seve_ changes to the conflict-set, there are at least a few productions for which the Treat algorithm has to compute the complete relevant fl-memory state. For example, it has to compute tokens matching the first two condition elements of the production, the tokens matching the first three condition elements of the productions, and so on for the entire production. Since much of this computation is done serially on DADO, the variation in the processing times for different productions is expected to be quite large, and consequently the speed-up from parallelism is expected to be less. 160 PARAIJ,ELISM IN PRODUCTION SYSTEMS microprocessors(for example, the Motorola68020),with a significantamount of local random-access memory. A large processingelcment normally stores the programs that are to be executed by the SPE-subtreebelow it, and it can broadcast instructions at a very high speed with the assistanceof a high-speed interface called the aclive memory comroller. With the assistance of the large processing dements the NON-VON architectureis capable of functioning in multiple-SIMD (single instruction stream, multiple data stream) mode. Figure 9-2 shows a picture of the proposed NON-VON architecture. LPE Network To Host J _ Leaf Mesh Connections • Small Processing Element D Large Processing Element / O Disk Head and Intelligent Head Unit Figure9-2: The NON-VON architecture. The proposed implementation of production systems on the NON-VON machine [37]is similarto the implementation of the Rete algorithmsuggested forthe DADO machine in the previoussection. However,many changes were made to accommodate the proposed data structures into the small amount of memory available within each SPE. For example, it was often necessaryto distribute a singje memory-nodetoken acrossmultiple SPEs. This fine distribution of state amongst the processing elements permits a greaterdegreeof associativeparaUelismthan what was possible in the DADO implementation. The performancepredicted for the execution of OPS5 on NON-VON [37]is about 2000 wme-changes/sec. The performancenumbers correspond to the case when both the large RF.LATED WORK 161 processing elements and the small processing elements in NON-VON function at a speed of about 3 MIPS. (Note that the significantly better performance of NON-VON over I)AI)O can be partly attributed to the facts that (1) NON-VON processing elements are 6 times faster than the prototype DADO processing elements, and (2) NON-VON LPEs have 4 times wider datapaths than the prototype DADO processing elements.) At this point, it might be appropriate to contrast architectures using a small number (32-64) of high-performance processors (for example, the scheme proposed in this thesis) against architectures using a very large number (10,000-1000,000) of weak processors (for example, DADO and NONVON). Studies for uniprocessor implementations show that using a single 3 MIPS processor, it is possible to achieve a performance of about 1800 wme-changes/sec, which is only 10% slower than the performance achieved by the NON-VON machine using thirty-two LPEs of 3 MIPS each and several thousand SPEs. The performance for the DADO machine is even smaller. reasons for the low performance There are two main of these highly parallel machines: (1) The amount of intrinsic parallelism available in OPS5-1ike production systems is quite small, as shown in the previous chapter. As a result, researchers have used the large number of processors available in the massivelyparallel machines as associative memory. However, this does not buy them too much, because hashing on a single powerful processor works just as well. (2) While a scheme using a small number of processors can use expensive and very high performance processors, schemes using a very large number of processors cannot afford fast processors for each of the processing nodes. The perfor- mance lost in the highly parallel machines due to the weak individual processing nodes is difficult to recover by simply using a large number of such nodes (since the parallelism is limited). Note, however, the massively parallel machines may do better for highly non-temporal production systems (production systems where a large fraction of the working-memory changes on every cycle), or for production systems where the number of rules affected per change to working memory is very large. 82 82Notethatthe techniquesdevelopedin thisthesiswillalsoresultin largerspeed-upsforprogramsonwhichthemassively parallelmachinesare expectedto dowell. Theonlyproblemariseswhenthepossiblespeed-upsare of the orderof several hundreds.Thisis becauseitis difficultto constructsharedmemorymultiprocessors withhundredsor thousandsofprocessors. It is suggestedthatin sucha caseit wouldbe bestto use a mixtureof the partitioningapproach(for example,as usedin implementationsof Rete on DADO and NON-VON)and the fine-grainedapproach(as proposedin this thesis),and implementit ontop ofa hierarchical multiprocessor. 162 PARALLELISM iN PRODUCTION SYSTEMS 9.5. Kemal Oflazer's Work on Partitioning and Parallel Processing of Production Systems Kemal Oflazer in his thesis [67] explores a number of issues related to the parallel processing of production systems. (1) He explores the task of partitioning production systems so that the work is uniformly distributed amongst the processors. (2) He proposes a new parallel algorithm for performing match for production systems. (3) He proposes a parallel architecture to execute the proposed algorithm. The new algorithm is especially interesting in that it stores much more state than the Rete algorithm in an attempt to cut down the variance in the processing required by different productions. 95.1. The Partitioning Problem The partitioning problem for production systems addresses the task of assigning productions to processors in a parallel computer in a way such that the load on processors is uniformly distributed. While the problem of partitioning the production systems amongst multiple processors may be bypassed in shared-memory architectures, like the one proposed in this thesis, it is central in all architectures that do not permit such sharing (for example, partitioning is essential for the algorithms proposed for the DADO and NON-VON machines). The advantages of schemes not relying on the shared-memory architectures are that scheduling, synchronization, and memory-contention heads are not present. over- The first part of Oflazer's thesis presents a formulation of the partitioning problem as a minimization problem, which may be described as follows. The execution of a production-system run may be characterized by the sequence T=(el,e 2..... et>, where t is the number of changes made to the working memory during the run and where ei is the set of productions affected by the ith change to working memory. 83 Let H =(YI 1..... I-l/c) be some partitioning of the production system onto k processors. With such a partitioning, the cost of executing the production t p ¢ e.NII. system on a parallel processor is COStT,rI = _i=lmaxa<j< k(_ z Jc(p)), where ci is a cost function that gives the processing required by productions for the ith change to working memory. Using this terminology, the partitioning problem is simply stated as the problem of discovering a partitioning H such that COStT,n is minimized. The complexity of finding the exact solution to this minimization problem is shown by Oflazer to be NP-complete [26]. In addition to analysis of some simple partitioning methods like random, round-robin, and contextbased, Oflazer's thesis presents a more complex heuristic method for partitioning that relies on data 83RecaU that a production is said to be affected by a change to working memory, if the working-memory dement matches at least one of its condition dements, that is, if the state corresponding to that production has to be updated. REI.ATEDWORK 163 obtained from actual production-system runs. The inputs to the new partitioning method consist of the affect-sets for each of the changes to working memory and the cost associated with each affected production in these affect-sets. The algorithm is very fast and gives results that are 1.15 to 1.25 times better than the results of the simpler partitioning strategies. 9.5.2. The Parallel Algorithm The second part of Oflazer's thesis concerns itself with a highly parallel algorithm for the stateupdate phase of match in production systems. The algorithm is based on the contention that both Treat and Rete algorithms are too conservative in the amount of state they store. For example, Treat only stores tokens matching the individual condition elements of productions, and Rete only stores tokens that match the individual condition elements and some fixed combinations of the condition elements. Oflazer's parallel algorithm proposes that the tokens matching not some but all combinations of condition elements of a production should be stored. The main motivation for doing so is to reduce the variance in the processing requirements of the various affected productions in any given match cycle. For example, consider the Treat algorithm. Since it does not store state corresponding to any of the combinations of condition elements, a lot of state has to be computed when a change is made to the working memory. Much of the state computation that is done after the change is made to the working memory could have been done beforehand, thus reducing the interval between the time the working-memory change is made and the time the conflict-set is ready. A similar argument is used against the Rete algorithm. 84 Oflazer also proposes a new representation for storing state corresponding to the partial matches for a production. He introduces the notion of a null working-memory condition elements, satisfies all inter-condition tests for productions, element that matches all and is always present in the working memory. The state of a production is represented by a set of instance elements (IEs), where each instance element has the form <(tl,w)(t2,w2)... (to,we)>, where c is the number of condition elements in the production, ti is the tag associated with the ith slot, mad wi is the working-memory element associated with the/h slot. The working-memory element wi must satisfy the ith condition element of the production, and the c-tuple <w,w 2..... production, that is, the working-memory w> must consistently satisfy the complete elements together should satisfy all the inter-condition element tests (note that some of the slots may point to the null working-memory elemen0. The tags are used to help detect and delete some of the redundant state generated for a production. 84Oflazer's thesis, however, does not demonstrate clearly if the variation in the processing requirements of productions is actually reduced by the proposed scheme. Storing state for all combinations of condition elements can result in some productions requiring very large amount of processing, a situation which may not have occurred in the Treat or Rete algorithms (see Section 9.5.4). 164 PARALLELISM IN PRODUCI'ION SYSTEMS When a change is made to the working result of the interaction representation the state-update the same instance mation content and in parallel. or instance otherwise resources. a sequential architecture 9.5.3. The Parallel The architecture redundant instance elements to eliminate to give incorrect of redundant requires content for that production. to help detect and eliminate As a result of (multiple copies of these redundant instance results and they also use up is problematic redundant instance because element be Oflazer presents the detailed redundant of the old is a subset of the infor- instance elements that each potentially as a of the proposed to each instance element whose information they can cause the match process--it An advantage There is, however, one problem. It is necessary The elimination, is obtained algorithms instance elements. Architecture proposed structured machine, processing capabilities of the proposed elements elements). to all other instance elements and a hardware of the new change it is possible to generate of other instance scarce processing compared element because it is essentially independently processing, the new state for a production the old state and the new change. for state is that the interaction state may be computed elements, between memory, for the algorithm with fast processors described above is a parallel located at the leaf nodes and specialized located at the interior nodes of the tree. reconfigurable tree- switches with simple Figure 9-3 shows a high-level picture architecture. l C°ntr°''er l P:InstanceElementProcessors S: Switches Figure 9-3: In mapping the proposed Structure algorithm of the parallel onto the suggested processing architecture, system. each production is assigned to RELATED WORK 165 some subset of the leaf processors, and as a result each processor is responsible for some subset of the productions in the program. The processors assigned to a production are responsible for processing the instance elements associated with that production and for keeping the state of that production updated. 85 The internal switch nodes of the tree are used to send information to/from the controller located at the root. They are also used to isolate small sections of the tree during the redundantinstance-element-removal phase of the algorithm. The thesis presents some performance figures based on simulations done for the XSEL [56] and R1 [55] production systems. The simulations assume that (1) processors take lOOns to execute an instruction from the on-chip memory and 200ns to execute an instruction from the external memory, (2) each stage of switch nodes takes 200ns to compute and transfer information to the next stage, and (3) the number of processors allocated to a production is the next power of two that is larger than the average number of instance elements for that production. Under these assumptions the time to perform match for a single change to working memory for XSEL is around 150/ts (6666 wmechanges/sec), and for R1 it is around 210/ss (4750 wme-changes/sec). In all these runs an average of about 300 processors were used to update the instance elements. However, if required on any given match cycle, more processors were assumed to be available. 9.5A. Discussion The most interesting aspect of Oflazer's research is the proposed parallel state-update algorithm. It provides yet another distinct data point (at the high-end of the spectrum) in the space of state-saving algorithms for match. The proposed architecture (using 256-1024 processors) also forms a distinct data point as far as the number of processors is concerned. It falls in between the multiprocessor architecture proposed in this thesis with 32-64 processors, and the DADO and NON-VON architectures with 10000 or more processors. A possible problem with Oflazefs parallel algorithm is that potentially the state associated with a production may become extremely large. Such a production would then require an extremely large number of processors to update its state, or it will become a bottleneck. For example, consider the following production which locates btocks that are at height 6 from a table and prints the result. 85The number of processors assigned to a production depends on how large its associated state becomes during actual runs. A modified version of the partitioning algorithm given earlier in the thesis is used to partition the productions amongst the processors. 166 PARALLEHSM {p blocks-at-height-6 {on +block (a) (on +block <b) (on iblock <c) (on +block <d) (on +block <e) {on +block <f) +block +block +block +block +block +block IN PRODUCI'ION SYSTEMS table) <a>) <b)) <c)) <d)) <e>) {call print "block" <f) "is at height 6")) Now suppose there are 100 blocks in the blocks world, and with each block there is the property "on tblock <x) tblock (y)", where this property is represented as a working-memory element. Thus there would be 100 working-memory elements matching CE-2(condition element 2), and similarly 100 working-memory elements matching each of condition elements cE-3 through CE-6. Since Oflazer's algorithm stores state corresponding to all possible combinations of condition elements, consider the number of tokens that would be matching CE-2,CE-4,and CE-6together. Since there are no common variables between these condition elements, the total number of tokens matching these condition elements together would be 100xl00x]00=l,000,000. If a single new block is added to the system, the number of matching tokens would go to 101x101x101=l,030,031, that is, about 30,000 new tokens would have to be processed. Oflazer suggests a solution to the above problem by splitting such productions into two or several pieces, so that the combinations of condition elemenm for which the state is stored is controlled. For example, the above production may be split into the following two productions. (p blocks-at-helght-6-part-I (on +block <a) +block table) (on +block <b) +block <a>) (on +block <c) +block <b)) ---) (send-message blocks-at-height-6-part-Z (p blocks-at-height-6-part-I (message +vars <c>) (on +block <d> +block (on +block (e> +block (on +block <f> +block +vans <C>)) <c>) <d>) <e>) (call print "block" <f> "is at height 6")) T RELATED WORK 167 While splitting the production into two parts reduces the number of tokens that are generated, it reintroduces some of the sequentiality in state processing, which is exactly what the algorithm was trying to avoid. The goodness of the proposed solution depends on the number of productions that have to be split, and the performance penalty they cause. Oflazer notes that for the XSEL system 42 productions had to be split and for the R1 system 22 productions had to be split. However, no clear numbers about the performance lost due to such splitting are available at this time. Another drawback of Oflazer's algorithm is that it cannot exploit action parallelism, that is, it cannot easily process multiple changes to working memory in parallel. This is because (1) often the multiple changes to working memory affect the same set of productions, which requires that the instance elements for such productions respond to the effect of several changes to their slots, and (2) the algorithm requires that multiple changes to the slots of an instance element be processed sequentially. Since action parallelism is exploited very usefully by the implementation proposed in this thesis, not being able to exploit it is a significant disadvantage. Finally, it is interesting to compare the performance of the proposed algorithm to that of the Rete algorithm. Oflazer's algorithm using about three hundred 5-10 MIPS 16-bit processors achieves 4500-7000 wme-changes/sec. The Rete algorithm on a 5 MIPS 32-bit uniprocessor can achieve a speed of 3000 wme-changes/sec, qhe reasons for the small amount of speed-up after using so many i more processors appear to be: (1) The intrinsic parallelism in production systems is limited, so large amounts of speed-up cannot be expected. (2) The strategy of keeping large amounts of state for productions is not working, that is, while keeping large state is increasing the number of processors that is required, it is not at the same time helping in significantly reducing the variation in the processing times of productions. (3) There is significant overhead in the proposed parallel implementation, for example, the time taken to remove redundant instance elements, that nullifies much of the potential speed-up. 9.6. Honeywell's Data-Flow Model Researchers at Honeywell CSC have also been exploring the use of parallelism for executing production-system programs[71]. They have proposed a tagged token data-flow computation model 86 for capturing the inherent parallelism present in OPS5 production systems. The proposed 86A tagged token data-flow model is different from the conventional data-flow models. While there can be only one token present on an output are in the conventional data-flow model, there can be multiple tokens present on the output ares of the tagged token data-flow model. 168 PARALLELISM IN PRODUCTION SYSTEMS model is based on file Rote algorithm, and the key idea is to translate the Rete network into a data-flow graph that explicitly shows the data dependencies. Similarly, operations performed in the Rete algorithm are encapsulatcd into appropriate activities (or tasks) in the data-flow model, which can then be executed on the available physical processing resources. For example, consider the case when a token arriving at an and-node of the Rete network finds n tokens in the opposite memory node. In such a case, n activities would be generated in the proposed data-flow model--an activity each for testing the incoming token for consistent variable bindings with one of the n opposite memory tokens. The paper [71] presents details about the kinds of nodes that are required in the data-flow graph and the functionality associated with those nodes. However, details about the hardware structure on which the proposed model is to be mapped, and how the necessary synchronization and scheduling is to be performed are not given in the paper. Since the size of the individual activities in the proposed data-flow model is very small (about 10 machine instructions or less), extremely efficient scheduling and synchronization methods will have to be developed if the approach is to be successful. 9.7. Other Work on Speeding-up Production Systems In addition to the efforts mentioned above, which specifically address the issue of speeding up OPS production systems through parallelism, there are several other efforts [8, 49, 69, 72, 77, 89] going on to speed up production systems. Two of these, noted below, have been carried out within the PSM project and complement the work done in this thesis. Jim Quinlan has done a comparative analysis of computer architectures for production-system machines [69]. He uses run-time measurements on production systems to evaluate the performance of five computer architectures (the VAX-11/780, the Berkeley RISC II computer, a custom designed microcoded machine for production systems, a custom RISC processor for production systems, and the Pyramid computer). His main conclusions are: (1) The custom designed microcoded machine is the best CPU architecture for production systems. Although it takes more machine cycles than the custom designed RISC processor, it has lower processor-memory bandwidth requirements. (2) The difference in the performance of the six architectures is not very large. As a result the motivation for building a custom processor is small. Ted Lehr presents a custom pipelined RISC architecture for production-system execution [49]. The proposed architecture has a static branch prediction strategy, a large register file, and separate instruction and data fetch units. Since the proposed architecture is very simple, he also discusses the viability of implementing it in GaAs. 169 SUMMARY AND CONCLUSIONS Chapter Ten Summary and Conclusions In this thesis we have explored the use of parallelism to speed up the execution of productionsystem programs. We have discussed the sources of parallelism available in OPS5 and Soar produc- tion systems, the design of a suitable parallel match-algorithm, architecture, and the implementation the design of a suitable parallel of the parallel match-algorithm on the parallel architecture. This chapter reiterates the main results of the thesis and discusses directions for future research. 10.1. Primary Results of Thesis The study of parallelism in OPS5 and Soar production systems in this thesis leads us to make the following conclusions: 1. The Rete-class of algorithms is highly suitable for parallel implementation. 2. The amount of speed-up available from parallelism is quite limited, about 10-fold, in contrast to initial expectations of 100-fold to 1000-fold. 3. To obtain the limited speed-up that is available, it is necessary to exploit parallelism at a very fine granularity. 4. To exploit the suggested sources of parallelism, a multiprocessor architecture with 32-64 high-performance processors and special hardware support for scheduling the finegrained tasks is desirable. The above conclusions are expanded in the following subsections. 10.1.1.Suitabilityof the Rete-Classof Algorithms The thesis empirically shows that the Rete-class of algorithms is highly suitable for parallel implementation of production systems. While Rete-class algorithms use significantly fewer processors 170 PARALI-ELISM IN PRODUCi'ION SYSTEMS than other proposed algorithms [37, 60, 67] (and in that sense are less concurrent87), simulations show that they perform better than these other algorithms. Some of the reasons for choosing and parallelizing the Rete class of algorithms are the following (see Section 2.4 for details). In designing a parallel algorithm, the first choice is between state-saving algorithms and non-state saving algorithms. State-saving algorithms are the obvious choice since only a very small fraction (less than 1%) of the global working-memory changes on each recognize-act cycle. Within the class of state-saving algorithms itself, however, many different algorithms can be designed, each storing different amounts of state. The Rete-class of algorithms store an intermediate amount of state (between the low extreme of the Treat algorithm [60] and the high extreme of Oflazer's parallel algorithm [67]). The state stored for a production in Rete corresponds to (1) matchings between individual condition elements and working-memory elements, and (2) matchings between some fixed combinations of condition elements occurring in the left-hand side of the production and tuples of working-memory elements. In algorithms like Treat, where the state stored is small, the disadvantage is that much of the information about partial matches with the unchanged part of the working memory has to be recomputed. In algorithms like Oflazer's parallel algorithm, where the state stored is large, the disadvantage is that a large amount of processing resources are wasted in computing partial matches that never reach the conflict-set. We believe that the Rete class of algorithms avoids the disadvantages of both Treat and Oflazer's parallel algorithm. However, note that we do not wish to argue that the Rete class is the best class of parallel algorithms, but only that the Rete-class algorithms fall in an interesting part of the spectrum of state-saving algorithms. The suitability of Rete as a parallel algorithm is also based on its other features, for example, the discrimination net used for the selection-phase computation and the dataflow like nature of the overall computation graph. Finally, the claim for suitability of the Rete-class of algorithms for parallel implementation is based on the results of simulations, which show the execution speeds obtained by parallel Rete to be favorable compared to other proposed algorithms. 87Statements about the amount of parallelism available in a class of progran_ can often be misleading. This is because it is always possible to construct parallel algorithms that can keep a very large numbers of processors busy without providing any significant speed-up over the known sequential algorithms. Thus simply talking about the average number of processors that are kept busy by a parallel algorithm is not very useful, at least not in isolation of the absolute speed-up over the best known sequential algorithms. SUMMARYANDCONCLtJSIONS 171 10.1.2. Parallelism in Production Systems One of the main results of this thesis is that the speed-up obtainable from parallelism is quite limited, of the order of a few tens rather than of the order of few hundreds or thousands. 88The initial expectations about the speed-up from parallelism for production-system programs were very high, especially for large programs. The general idea was that if match for each production is performed in parallel, then speed-up proportional to tile number of productions in the program would be achieved [84]. This idea was quickly abandoned, as results from actual measurements on production systems were obtained (see Chapter 3 and [30, 31]). The reasons for the limited speed-up were found to be: (1) The number of productions that are affected as a result of a change to working memory is very small (about 26), and since affected productions take most of the processing time, assigning a processor to each production can result in only 26-fold speed-up. (2) The speed-up is actually much less than 26-fold, because there is a large variance in the processing requirements of the affected productions. In fact, using production parallelism in a straightforward manner was found to result in less than 5.1-fold nominal speed-up. 89 (3) Overheads due to loss of sharing in the Rete network and overheads due to the parallel implementation cause the real speed-up to be only 1.9-fold (a factor of 2.64 is lost). An attempt to increase the size of the affect-sets by processing all changes resulting from a production firing in parallel (use of action parallelism) results in a nominal speed-up of 7.6-fold, instead of the 5.1-fold achieved otherwise. The increase in speed-up is much smaller than the number of working-memory changes processed in parallel, because the affect-sets of the multiple changes overlap significantly. Since the number of productions that are affected on each cycle is not controlled by the implementor of the production-system interpreter (it is governed mainly by the author of the program and the nature of the task), one solution to the problem of limited speed-up is to somehow decrease the variance in the processing required by the affected productions. This requires that the processing associated with an affected production be distributed amongst multiple processors by exploiting parallelism at a finer granularity. To achieve this end, the thesis proposes the use of node parallelism, that is, processing activations of distinct nodes in the Rete network in parallel. Using node parallelism and action parallelism results in a nominal speed-up of about 10.7-fold, as compared to 7.6fold achieved for production and action parallelism. The overheads in this case are a factor of 1.98, so 88Note that the speed-up numbers in the following discussion are with respect to the sequential Rete algorithm, which is so farthe fastest sequential algorithm for implementing production systems. 89Nominal speed-up or concurrency refers to the average number of processors that are busy in a parallel implementation. In contrast, the real speed-up refers to the speed-up with respect to the highest performance sequential algorithm. 172 PARALLEI.ISM INI'RODUCI'IONSYSTEMS that the real speed-up is 5.4-fold. Studying the results in detail, two bottlenecks were found to be limiting the speed-up. These were (1) the cross-product effect and (2) the long-chain effect. The cross-product effect refers to the case when an incoming token finds several matching tokens in the opposite memory node, and as a result of which a large number of tokens are sent to the successor of that node. Since multiple activations of any given node are processed sequentially when using node parallelism, the cross-product effect resulted in large processing times for some of the productions, thus reducing the speed-up obtained. The long-chain effect refers to the occurrence of long chains of dependent node activations. Since these activations, as their name suggests, cannot be processed concurrently, they result in some productions taking much longer to finish than others, thus resulting in small speed-ups. The long- chain effect is especially bad for Soar systems, where the number of condition elements per production is larger than in OPS5 systems and as a result of which the networks for productions often contain long chains. As a solution to the problem of the cross-product effect, the thesis proposes the use of intra-node parallelism, where in addition to processing activations of different nodes in the Rete network in parallel, it is possible to process the relevant activations of the same node in parallel. Using intranode and action parallelism it is possible to achieve 19.3-fold nominal speed-up, as compared to node and action parallelism where only 10.6-fold speed-up is achieved. The corresponding average execution speed on a multiprocessor with 2 MIPS processors is 11,250 wme-changes/sec. The nominal speed-up for individual systems actually varies between 12-fold for some OPS5 systems and 35-fold for some Soar systems, and the execution speed varies between 7000 wme-changes/sec and 17,000 wme-changes/sec. As a solution to the problem of the long-chain effect, the thesis proposes using binary networks for productions rather than linear networks as used in the original Rete algorithm. 9° This way the maximum length of a chain can be reduced to the logarithm of the number of condition elements in a production. The average nominal speed-up obtained using binary networks and intra-node and action parallelism is 25.8-fold. The corresponding average execution speed achieved is 12,000 wmechanges/sec. Although the average nominal speed-up is significantly larger than the 19.3-fold ach- ieved with linear networks, the execution speed is not much higher than the 11,250 wme-changes/sec 90Thiscorresponds to varyingthefixedcombinations ofconditiondementsforwhichpartialmatchesarestoredbytheRete algorithm. 173 SUMMARY AND CONCLUSIONS achieved with linear networks. This is because, for many of the systems, using binary networks results in a larger number of node activations per change to working memory. Thus, even though the speed-up with respect to the uniprocessor implementation using binary networks is higher, the uniprocessor implementation using binary networks is slower than that using linear networks, so that not much is gained on the whole. The speed-up for the individual systems when using binary networks varies between ll-fold and 56-fold. The benefits of using the binary networks are small (or sometimes even negative) for OPS5 systems since the average number of condition elements per production is small. The benefits for Soar systems are much larger since the average number of condition elements in Soar productions is much higher. Thus the decision whether or not to use binary networks should be made depending on the characteristics of the individual production systems. In fact, there is no reason not to have a mixture in the same program, that is, have some productions with linear networks, some productions with binary networks, and some productions with a mixture of linear and binary network forms. The overall aim is to minimize the state that is computed, while at the same time avoiding long chains. This job of selecting the network form for productions may be done by a programmer who understands the underlying implementation, or by a program which uses static and run-time information, or by some combination of the two. 10.1.3. Software Implementation Issues The thesis addresses a number of issues related to the correctness and the efficiency of the parallel Rete algorithm. Some of the issues discussed are: (1) the use of hashing for constant-test nodes at the top-level of the Rete network, (2) the lumping of multiple constant-test node activations when scheduling them on a processor, (3) the lumping of memory-nodes with the associated two-input nodes, (4) the use of hash tables to store tokens in memory nodes, (5) the set of concurrently processable activations of a node, (6) the processing of conjugate pairs of tokens, (7) the locks necessary for manipulating the contents of memory nodes, and (8) the use of binary networks versus linear networks. Although the details are not relevant here, the above features have considerable impact on the overheads that are imposed by the parallel implementation, and are consequently very important if a parallel implementation is to be successful. 174 PARALLELISM IN PRODUCI'ION SYSTEMS 10.1.4. Hardware Architecture To implement the parallel Rete class of algorithms, as described in the previous paragraphs, the thesis proposes the use of a shared-memory multiprocessor architecture. The basic characteristics of the proposcd architecture are: (1) Shared-memory multiprocessor with 32-64 processors. (2) The individual processors should be high performance computers, each with a cache and a small amount of private memory. (3) The processors should be connected to the shared memory via one or more shared buses. (4) The multiprocessor should support a hardware task scheduler to help enqueue node activations on the task queue and to assign pending tasks to idle processors. The main reason for suggesting the shared-memory multiprocessor is a consequence of the fine granularity at which the parallelism is exploited by the parallel Rete algorithm. The parallel Rete algoritt-nn requires that the data structures be shared amongst multiple concurrent processes, which makes it appropriate to use a shared-memory architecture. When used in conjunction with a centralized task queue, a shared-memory multiprocessor also makes it possible to bypass the load distribution problem. The reason for using only 32-64 processors is that simulations show that no additional speed-up is gained by using a larger number of processors. Since only a small number of processors are used, it is possible to use expensive high performance processors. Each processor should have a cache and some private memory to enable high speed of operation (to avoid the large latency to the ' shared memory), and also to reduce contention for the shared memory-modules and the shared bus. The thesis recommends a bus-based architecture instead of an architecture based on a crossbar or other processor-memory interconnects because it is easier to construct intelligent cache-coherence schemes for shared buses and because simulations show that a single bus would be able to support the load generated by 32-64 processors (provided reasonable cache-hit ratios are obtained). 91 To avoid the problem of load distribution, the thesis suggests the use of a centralized task queue containing all pending node activations. The task queue required for this job is quite complex since not all node activations in the task queue can be processed concurrently, in fact, the set of concurrently processable node activations changes dynamically (see Section 5.2.4 for details). Furthermore, since the average processing required by a node activation is only 50-100 instructions, it is necessary to have a mechanism whereby the node activations can be enqueued and dequeued extremely fast, so that the task queue does not become a bottleneck. The thesis proposes two mechanisms for solving the scheduling problem: the use of a hardware scheduler and the use of multiple software task 91Many of the the design recommendations made in the thesis are highly technology dependent. Future advances in the processor technology and the interconnection-network technology may make it necessary to reevaluate the recommendations. 175 SUMMARY AND CONCLUSIONS queues. The proposed hardware task scheduler makes use of content-addressable memory to main- tain a list of concurrently processable node activations, and is expected to perform at the bus-speed of the multiprocessor. The software task queues are studied as an alternative to building the custom hardware task scheduler. Simulations show that the performance when using multiple software task queues is about half of the performance when using the hardware task scheduler. 10.2. Some General Conclusions Amongst Treat, Rete, and Oflazer's algorithm three very distinct points in the space of possible state-saving match algorithms are covered. Alongside in the implementations of the above al- gorithms, three distinct points in the architecture space have also been covered--small-scale parallel architectures like a 32-node multiprocessor, medium-scale parallel architectures like Oflazer's 512 processor parallel machine, and highly parallel architectures like DADO and NON-VON with tens of thousands of processors. In terms of performance, Treat on DADO is expected to execute at a rate of about 215 wme-changes/sec, assuming sixteen thousand 0.5 MIPS node processors [60]; Rete on NON-VON is expected to execute at about 2000 wme-changes/sec, assuming thirty-two 3 MIPS large processing elements and sixteen thousand 3 MIPS small processing elements [37]; Oflazer's algorithm is expected to execute at about 4750-7000 wme-changes/sec, assuming five-hundred-and-twelve 5-10 MIPS custom processors [67]; Rete on a single 2 MIPS processor is expected to execute at about 1200 wme-changes/sec; and Rete on the production-system machine proposed in this thesis is expected to execute at 11000 wme-changes/sec, assuming thirty-two node multiprocessor using 2 MIPS processors. 92 The first general conclusion that can be derived from the relative comparison of various algorithms and architectures discussed above is that the speed-up over the best existing sequential algorithm by any of the parallel implementations is small, and this is irrespective of the type of algorithm or architecture being used. Thus the answer to the question "Is it the sequentiality of the Rete algorithm that is blocking the speed-up from the use of parallelism?" is most probably no, because other algorithms at both ends of the state-saving spectrum have not shown any better results. Thus it 92No attempt has been made to normalize the meeds of the processors used in the architectures discussed above. For example, the speed of preduction-system execution on the DADO machine is given when processors have 8-bit wide datapaths and execute at 0.5 MIPS (as described in the original paper), and not when the processors have 32-bit datapaths and execute at 2 MIPS. This is because the overall effect on the feasibility of the different architectures is very different when individual processors are speeded up. For example, in Oflazer's machine, 5 MIPS processors are fine because they work out of their local memory. Using 5 MIPS processors for the PSM may, however, cause problems since the shared bus would become a bottleneck. The reader may, however, wish to do some such normalization (based on architectural feasibility) to gain more comparability. 176 PARALI.ELISM IN PRODUCI'ION SYS'i'F.MS appears that it is not the Retc algorithm, or the Treat algorithm, or Oflazer's algorithm that is preventing the use of parallelism, but it is the inherent nature of the computation involved in im- plementing production systems that prevents significant speed-up from parallelism. Another conclusion that we can draw is that while massively parallel architectures like DADO and NON-VON may be effective for the task of executing production systems, that is, they can execute production systems at reasonably high speeds, the approach of using a small number of tightlycoupled high-performance processors with the migration of critical functionality to special purpose hardware seems to be preferred. Since the results in this thesis are based on the analysis of existing programs, an interesting question to ask is: "Are existing production-system programs not able to exploit parallelism because they were not written with parallel implementations in mind, or alternatively, will future production-system programs when written with parallel implementations in mind be able to exploit more parallelism?" The answer is undoubtedly yes, but probably to a limited extent only. We believe that while additional factors of two, four, or eight are very probable, it is doubtful that additional factors of fifty, a hundred, or more will be obtained in the future. There are reasons to believe that the two main factors limiting the speed-up (the small number of affected productions per cycle and the small number of changes to working memory per cycle) will not change significantly in the near future, and maybe not even in the long run (see Section 4.7 for more details). 93 On the positive side, however, the techniques that have been developed in this thesis will still be applicable to the new class of systems, the only difference being that the speed-up will be larger. 10.3. Directions for Future Research Although many issues regarding the use of parallelism in implementing production systems have been addressed in this thesis, many more remain to be addressed. This section discusses some such issues. In the area of design of algorithms and architectures it is extremely useful to have a generally accepted set of benchmark programs that can be used by all researchers. At this point there are no such established benchmarks for production-system programs and as a result one often ends up comparing apples to oranges. The PSM group at Carnegie-Mellon University is beginning an effort 93There may be exceptions to the above claims for special classes of programs, for example, production-system programs working on low-level vision. But for a large majority of tasks in Artificial Intelligence they are expected to be true. 177 SUMMARY AND CONCLUSIONS to assemble such a set of programs. 94 When selecting such a set of programs it is necessary to ensure that there is sufficient variability in them, so that the proposed algorithms and architectures can be tested along various dimensions. For example, there should be programs that are knowledge-search intensive, those that are problem-space search intensive, those with small and those with large working memory, those with small and those with large production memory, and so on. The final success of such an effort, of course, would be established only if the selected set of benchmark programs is adopted by the rest of the research community and used to evaluate many architectures. A criticism that has often been cited for the current work being done in parallel implementations of production systems is that, existing programs were written with sequential implementations in mind, and that they do not reflect the true parallelism which is to be found in programs written with parallel implementations in mind [40, 86]. As stated a few paragraphs earlier, while we believe that small factors of two or four may be achieved in such a way, factors of fifty or hundred will not be achieved. However, factors of two or four are not small enough to be ignored, and much work needs to be done in developing production-system formalisms that permit more explicit expression of parallelism (for example, the Herbal language being developed at Columbia University), or those that implicitly allow much more parallelism to be used (for example, the Soar formalism as compared to the OPS5 formalism). An obvious direction for further work is to implement the ideas proposed in this thesis on an actual multiprocessor. Such an implementation will bring many interesting issues to hght, and a running parallel implementation will certainly encourage production-system styles to make better use of the parallelism. programmers to adapt their Such an implementation is currently underway by the PSM group. The thesis does not explore the issue of using multiple software task schedulers in a very comprehensive way, and additional work is needed to help clarify the trade-offs involved. Another interesting task would be to build a program that uses static and run-time information to decide on the network forms (linear, binary, or some mixture) for the productions [83]. The criteria of goodness for such a program is that it should minimize the total amount of state computed on every cycle, while avoiding the occurrence of long chains of dependent node activations. Another interesting direction for future work is to analyze the relative merits of different AI programming languages. To be more specific, with the start of the Japanese Fifth Generation 94The collection of programs that we have been using in our experiments represents a start, but it is still inadequate in various ways. For example, most of the programs are too large to be recoded in other related languages, and many programs are of a proprietarynature and cannot be shipped outside of CMU. 178 PARAI,LELISM IN PRODUCTION SYS'I]_MS Computing Project [24], the language Prolog has gained very wide usage 113, 781. Prolog has also been put forward as a language for building expert systems, and it has been claimed that massive amounts of parallelism can be used in its implementations [112,27, 50, 61, 88]. It would be interesting to hnplement a common set of tasks using Prolog, OPS5, Soar, Multilisp, Actors, and other such languages [1, 15, 25, 35, 59, 64, 93], and see the amount of parallelism that each of the formalisms permits, and the absolute performance that can be achieved by each of them. REFERENCES 179 References [1] Gul Agha and Carl Hewitt. Concurrent Programming Using Actors: Exploiting Large-Scale Parallelism. A.I. Memo 865, Massachusetts Institute of Technology, October, 1985. [2] J.R. Anderson. The Architecture of Cognition. Harvard University Press, 1983. [3] Mario R. Barbacci. An Introduction to ISPS. In Daniel P. Siewiorek, C. Gordon Bell, and Allen NeweU (editor), Computer Science: Computer Structures: Principles and Examples, chapter 4. McGraw-Hill, 1982. [4] Avron Ban"and Edward A. Feigenbaum. The Handbook of Artificial Intelligence, Volume 1. William Kaufmann, Inc., 1981. [5] Forest Baskett and Alan Jay Smith. Interference in Multiprocessor Computer Systems with Interleaved Memory. Communications of the ACM 19(6), June, 1976. [6] D.P. Bhandarkar. Analysis of Memory Interference in Multiprocessors. IEEE Transactions on Computers C-24(9), September, 1975. [7] W.J. Bouknight, Stewart A. Denenberg, David E. Mclntyre, J.M. Randall, Amed H. Sameh, and Daniel L. Slotnick. The Illiac IV System. Proceedings of the IEEE , April, 1972. [8] Ruven Brooks and Rosalyn Lure. Yes, An SIMD Machine Can Be Used For AI. In International Joint Conference on Artificial Intelligence. 1985. [9] George Broomell and J. Robert Heath. Classification Categories and Historical Development of Circuit Switching Topologies. Computing Surveys 15(2):95-133, June, 1983. [10] Lee Brownston, Robert Farrell, Elaine Kant, and Nancy Martin. Programming Expert Systems in OPSS: An Introduction to Rule-Based Programming. Addison-Wesley, 1985. 180 PARA1J,ELISM IN PRODUCTION SYSTEMS [11] B.G. Buchanan and E.A. Feigenbaum. I)ENI)RAI_ and Mcta-I)EN1)RAI.: Their Applications I)imensions. Art_cial Intelligence 11(1,2), 1978. [12] Yaohan Chu and Kozo ltano. Organization of a Parallel Prolog Machine. In International Workshop on High-Level Computer Architecture. 1984. [13] W.F. Clocksin and C.S. Mellish. Programming in Prolog. Springer-Verlag, 1981. [14] R.O. Duda, J.G. Gaschnig, and P.E. Hart. Model Design in the PROSPECTOR Consultant System for Mineral Exploration. In D. Michie (editor), Expert Systems in the Micro-Electronic Age. Edinburgh University Press, Edinburgh, 1979. [15] J. Fain, F. Hayes-Rot.h, S.J. Rosenschein, H. Sowizral, and D. Waterman. The ROSIE Language Reference Manual Technical Report N-1647-ARPA, Rand Corporation, 1981. [16] Richard D. Fennell and Victor R. Lesser. Parallelism in Artificial Intelligence Problem Solving: A Case Study of Hearsay II. IEEE Transactions on Computers C-26(2), February, 1977. [17] Charles L. Forgy. On the Ej'ficient hnplementations of Production Systems. PhD thesis, Carnegie-Mellon University, Pittsburgh, 1979. [18] Charles L. Forgy. Note on Production Systems and ILLIAC-IV. Technical Report CMU-CS-80-130, Carnegie-Mellon University, Pittsburgh, 1980. [19] Charles L. Forgy. OPS5 User's ManuaL Technical Report CMU-CS-81-135, Carnegie-Mellon University, Pittsburgh, 1981. [20] Charles L. Forgy. Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem. Artificial Intelligence 19, September, 1982. [21] Charles L. Forgy. The 0PS83 Report. Technical Report CMU-CS-84-133, Carnegie-Mellon University, Pittsburgh, May, 1984. [22] Charles Forgy, Anoop Gupta, Allen Newell, and Robert Wedig. Initial Assessment of Architectures for Production Systems. In National Conference for Artificial Intelligence. AAAI-1984. [23] Charles Forgy and Anoop Gupta. Preliminary Architecture of the CMU Production System Machine. In Hawaii International Conference on System Sciences. January, 1986. RI_:ERENCES [24] Kazuhiro Fuchi. Revisiting Original Philosophy of Fifth Generation Computer Systems Project. In Internatio,al Conference on Fifth Generation Computer Systems. 1COT. 1984. [25] Richard P. Gabriel and John McCarthy. Queue-based Multi-processing Lisp. In ACM Symposium on Lisp and Functional Programming. ACM, 1984. [26] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. [27] Atsuhiro Goto, Hidehiko Tanaka, and Tohru Moto-oka. Highly Parallel Inference Engine -- Goal Rewriting Model and Machine Architecture. New Generation Computing 2: 37-58, 1984. [28] A. Gottileb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer -- Designing a MIMD Shared-memory Parallel Machine. In The 9th Annual Symposium on Computer Architecture. IEEE and ACM, 1982. [29] J.H. Griesmer, S.J. Hong, M. Karnaugh, J.K. Kastner, M.I. Schor, R.L. Ennis, D.A. Klein, K.R. Milliken, H.M. VanWoerkom. YES/MVS: A Continuous Real Time Expert System. In National Conference on Artificial Intelligence. AAAI-1984. [30] Anoop Gupta and Charles L. Forgy. Measurements on Production Systems. Technical Report CMU-CS-83-167, Carnegie-Mellon University, Pittsburgh, 1983. [31] Anoop Gupta. Implementing OPS5 Production Systems on DADO. In International Conference on Parallel Processing. IEEE, 1984. [32] Anoop Gupta. Parallelism in Production Systems: The Sources and the Expected Speed-up. Technical Report CMU-CS-84-169, Carnegie-Mellon University, Pittsburgh, 1984. Also in Proceedings of Fifth International Workshop on Expert Systems and Applications, Avignon, France, May 1985. [33] Anoop Gupta, Charles Forgy, Allen Newell, and Robert Wedig. Parallel Algorithms and Architectures for Production Systems. In 13th International Symposium on Computer Architecture. June, 1986. To appear. [34] P. Haley, J. Kowalski, J. McDermott, and R. McWhorter. PTRANS: A Rule-Based Management Assistant. Technical Report, Carnegie-Mellon University, Pittsburgh, 1983. [35] Robert H. Halstead, Jr. Multilisp: A Language for Concurrent Symbolic Computation. acre Transactions on Programming Language and Systems 7(4):501-538, October, 1985. ]81 182 PARAI.LELISM IN PRODUCI'ION SYSTEMS [36] J.L. Hennessy, N. Jouppi, S. Przybylski, C. Rowen, and T. Gross. The MIPS Machine. In Computer Conference. February, 1982. [37] Bruce K. Hillyer and David E. Shaw. Execution of OPS5 Production Systems on a Massively Parallel Machine. Technical Report, Columbia University, September, 1984. [38] Keki B. Irani and Ibrahim H. Onyuksel. A Closed-Form Solution for the Performance Analysis of Multiple-Bus Multiprocessor Systems. IEEE Transactions on Computers C-33(11), November, 1984. [39] Gary Kahn and John McDermott. The MUD System. In First Cotference on Artificial Intelligence Applications. IEEE Computer Society and AAAI, December, 1984. [40] Dennis F. Kibler and John Conery. Parallelism in AI Programs. In International Joint Conference on Artificial Intelligence. 1985. [41] Jin Kim, John McDermott, and Daniel Siewiorek. TALIB: A Knowledge-Based System for IC Layout Design. In National Conference on Artificial Intelligence. AAAI-1983. [42] Ted Kowalski and Don Thomas. The VLSI Design Automation Assistant: Prototype System. In 20th Design Automation Conference. ACM and IEEE, June, 1983. [43] Ted Kowalski. The VLSI Design Automation Assistant: A Knowledge-Based Expert System. PhD thesis, Carnegie-Mellon University, April, 1984. [441 John E. Laird. Universal Subgoaling. PhD thesis, Carnegie-Mellon University, Pittsburgh, December, 1983. [45] John E. Laird and Allen Newell. A Universal Weak Method: Summary &Results. In International Joint Conference on Artificial Intelligence. 1983. [46] John E. Laird and Allen Newell. A Universal Weak Method. Technical Report CMU-CS-83-141, Carnegie-Mellon University, Pittsburgh, June, 1983. [47] John E. Laird, Paul S. Rosenbloom, and Allen NeweU. Towards Chunking as a General Learning Mechanism. In National Cotference on Artificial Intelligence. AAAI-1984. [48] John E. Laird. Soar User's Manual 4th edition, Xerox PARC, 1986. REFERENCES 183 [49] Theodore F. Lehr. The hnplementation of a Production System Machine. Mastcr's thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, 1985. [50] G.J. Lipovski and M.V. Hermenegildo. B-LOG: A Branch and Bound Methodology for the Parallel Execution of logic Programs. In International Conference on Parallel Processing. IEEE, 1985. [51] Sandra Marcus, John McDermott, Robert Roche, Tim Thompson, Tianran Wang, and George Wood. Design Document for VT. 1984. Carnegie-Mellon University. [52] M.A. Marsan. Bounds on Bus and Memory Interference in a Class of Multiple-Bus Multiprocessor Systems. In Third International Conference on Distributed Computer Systems. October, 1982. [531 Henry H. Mashburn. The C.mmp/Hydra Project: An Architectural Overview. In Daniel P. Siewiorek, C. Gordon Bell, and Allen Newell (editor), Computer Structures: Principles and Examples. McGraw-Hill, 1982. [541 Donald McCracken. A Production System Version of the Hearsay-II Speech Understanding System. PhD thesis, Carnegie-Mellon University, Pittsburgh, 1978. [55] John McDermott. RI: A Rule-based Configurer of Computer Systems. Technical Report CMU-CS-80-119, Carnegie-Mellon University, Pittsburgh, April, 1980. [56] John McDermott. XSEL: A Computer Salesperson's Assistant. In J.E. Hayes, D. Michie, and Y.H. Pao (editor), Machine Intelligence. Horwood, 1982. [57] John McDermott. RI: A Rule-Based Configurer of Computer Systems. Artificial Intelligence 19(1):39-88, 1982. [58] John McDermott. Extracting Knowledge from Expert Systems. In International Joint Conference on Artificial Intelligence. 1983. [59] W. Van Melle, A.C. Scott, J.S. Bennett, and M. Peairs. The Emycin Manual Technical Report STAN-CS-81-885, Stanford University, October, 1981. [60] Daniel P. Miranker. Performance Estimates for the DADO Machine: A Comparison of Treat and Rete. In Fifth Generation Computer Systems. ICOT, Tokyo, 1984. 184 PARALLELISM INPRODUC'rlONSYSTEMS [61] Tohru Moto-oka, Hidehiko Tanaka, ltitoshi Aida, Keiji tlirata, and 'l'sutomu Maruymna. The Architecture of a Parallel Inference Engine -- P11.?, --. In b_ternational Conference on Fifth Generation Computer Systems. ICOT, 1984. [62] Allen Newell and Herbert A. Simon. Human Problem Solving. Prentice-Hall, 1972. [63] Allen Newell. ttARP Y, Production ,Systems and Human Cognition. Technical Report CMU-CS-78-140, Carnegie-Mellon University, Pittsburgh, September, 1978. [64] H.P. Nii and N. Aiello. AGE(Attempt to Generalize): A Knowledge-Based Program for Building Knowledge-Based Programs. In International Joint Conference on Artificial Intelligence. 1979. [65] Nils J. Nilsson. Computer Science Series: Problem-Solving Methods in Artificial Intelligence. McGraw-Hill, New York, 1971. [66] Kemal Oflazer. Parallel Execution of Production Systems. In bztenzational Conference on Parallel Processing. IEEE, August, 1984. [67] Kemal Oflazer. Partitioning in Parallel Processing OfProduction Systems. PhD thesis, Carnegie-Mellon University, (in preparation), 1986. [68] D.A. Patterson and C.H. Sequin. A VLSI RISC. Computer 9, 1982. [69] James E. Quinlan. A Comparative Analysis of Computer Architectures for Production System Machines. Master's thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, 1985. [70] G. Radin. The 801 Minicomputer. IBM Journal of Research and Development 27, May, 1983. [71] Raja Ramnarayan. A Tagged Token Data Flow Computation Model for OPS5 Production Systems. 1984. Working Draft, Honeywell CSC, Bloomington, MN. [72] Bruce Reed, Jr. The ASPRO Parallel Inference Engine (P.I.E.): A Real Time Production Rule System. 1985. Goodyear Aerospace. REFERENCES [73] Paul S. Rosenbloom. The Chunking of Goat Hierarchies: A Model of St&talus-Response Compatibility. PhD thesis, Carnegie-Mellon University, Pittsburgh, August, 1983. [74] Paul S. Rosenbloom, John E. Laird, John McDermott, Allen Newell, and Edmund Orciuch. R1-Soar: An Experiment in Knowledge-Intensive Programming in a Problem-Solving Architecture. In 1EEE Workshop on Principles of Knowledge Based Systems. 1984. [75] Paul S. Rosenbloom, John E. Laird, Allen Newell, Andrew Golding, and Amy Unruh. Current Research on Learning in Soar. In International Workshop on Machine Learning. 1985. [76] Larry Rudolph and Zary Segall. Dynamic Decentralized Cache Schemes for MIMD Parallel Processors. In International Symposium on Computer Architecture. 1984. [77] Mike Rychener, Joe Kownacki, and Zary Segall. Parallel Production Systems: OPS3. Cm*: An Experiment in Multiprocessing. Digital Press, 1986. [78] Ehud Y. Shapiro. A Subset of Concurrent Prolog and Its Interpreter. Technical Report, ICOT -- Institute for New Generation Computer Technology, February, 1983. [79] David Elliot Shaw. The NON- VON Supercomputer. Technical Report, Columbia University, New York, August, 1982. [80] David Elliot Shaw. On the Range of Applicability of an Artificial Intelligence Machine. Technical Report, Columbia University, January, 1985. [81] E.H. Shortliffe. Computer- Based Medical Consultations: M YC IN. North-Holland, 1976. [82] Herbert A. Simon. The Architecture of Complexity. The Sciences of the Artificial MIT Press, 1981, Chapter 7. [83] David E. Smith and Michael R. Genesereth. Ordering Conjunctive Queries. Artificial Intelligence 26(2): 171-215, 1985. [84] Salvatore J. Stolfo and David E. Shaw. DADO: A Tree-Structured Machine Architecture for Production Systems. In National Conference on Artificial Intelligence. AAAI-1982. ' 185 186 PARALLELISM INPRODUCI'ION SYSTEMS [85] Salvatore J. Stolfo, Daniel Miranker, and David E. Shaw. Architccturc and Applications of I)AI)O: A Large-Scalc Parailcl Computer for Artificial Intelligence. In International Joint Conference on Artificial Intelligence. 1983. [86] Salvatore J. Stolfo. Five Parallel Algorithms for Production System Execution on the DADO Machine. In National Conference on ArtificiaI bltelligence. AAAI-1984. [87] S.N. Talukdar, E. Cardozo, L. Leao, R. Banares, and R. Joobbani. A System for Distributed Problem Solving. In Workshop on Coupling Symbolic and Numerical Computing in Expert Systems. August, 1985. [88] Stephen Taylor, Christopher Maio, Salvatore J. Stolfo, and David E. Shaw. PROLOG on the DADO Machine: A Parallel System for High-Speed Logic Programming. Technical Report, Columbia University, New York, January, 1983. [89] M.F.M. Tenorio and D.I. Moldovan. Mapping Production Systems into Multiprocessors. In International Conference on Parallel Processing. IEEE, 1985. [90] Jeffrey D. Ullman. Principles of Database Systems. Computer Science Press, 1982. [91] Shinji Umeyama and Koichiro Tamura. A Parallel Execution Model of Logic Programs. In The lOth Annual International Symposium on Computer Architecture. IEEE and ACM, June, 1983. [92] D.A. Waterman and Frederick Hayes-Roth. Pattern-Directed Inference Systems. Academic Press, 1978. [93] S.M. Weiss and C.A. Kulikowski. EXPERT: A System for Developing Consultation Models. In International Joint Conference on Artificial Intelligence. 1979. ISP OF PROCESSOR USED IN PARALLEL 187 IMPI,EMENTATION Appendix A ISP of Processor Used in Parallel Implementation ! This is an ISPS ! processors ! simulator so that I with are the the on this almost only and all a very they of the The number treated make of The a small within models either 0 or when that in the of classes, being 1 memory need computing used is designed number a class instructions specially individual cost instruction-set into instructions instructions of the machine. be partitioned ell small are architecture system description. may of executing are the [3] production instructions I references [ for in the based cost ! For example, I There description used the same. references. more memory the cost models simulator. **PC.State** R[0:31]<0:31>, I General IR<0:31>, ! Instruction Purpose PSW<0:31>, I Prog. PREFIX<O:31>, I Prefix PC := R[31], I Program SP := R[30], I System Stack Link := R[29], I System Link Zero := R[28]; I Zero Status Cond. Code Register eg., byte, word,.. Pointer Register Register := PSW<O>, I zero := PSW<I>, I overflow N<> := 1 negative C<> := PSW<3>, Register. and Counter Z<> .*Instruction. Word, Register V<> PSW<2>, Registers Register I carry bit Fields** opcode<0:5> := IR<26:31> opcode type<O:1> := IR<24:25>. type rd<O:4> := IR<19:23>. destination mode<O:1> := IR<17:18> used rs<0:4> := IR<12:16>. source rx<0:4> := IR<12:16> index of operands, register in instruction ri<0:4> := IR<19:23>, I first r2<0:4> := IR<12:16>, I second r3<0:4> := IR<7:II>, I third register register/ for base reg specified reg specified reg specified for instruction interpretation instruction register by instruction by instruction by instruction xd<0:4> := IR<Ig:23>, 1 5 bit signed/unsigned constant xs<0:4> := IR<12:16>, 1 5 bit signed/unsigned constant 188 PARALLELISM IN PRODUCTION disp/const<0:6> := IR<0:6>, ! 7 bit displacement or Idisp/Iconst<0:11> := IR<O:II>, ! 12 bit displacement or constant lldisp/llconst<O:16> := IR<0:16>, ! 17 bit displacement or llldisp/lllconst<0:23> := IR<0:23>, 1 24 bit displacement or constant ! Note: The difference ! displacements I while are constants I constants may are not be signed I instruction. I a word, between shifted type affected or For example, it will constants by the and by the unsigned displacements (byte, word. type of as relevant if the constant constant constant is that, the long .... ) of the the instruction. in the is used context to specify SYSTEMS instr., All of the a bit within be unsigned. i ......................................................................... I r3 I I ...... I 5 I disp/const I ........................ 7 r2/rs/rx I ........ i ............................... ldisp/lconst 5 opcode ........ 6 [ type Irl/rd/xdlmode/xs I ...... 2 I........ I 5 12 lldisp/llconst I ....... 2 ........................................ _7 ] llldisp/lllconst I ......................................................... 24 INSTRUCTION REGISTER 31 0 "*Instructions-Set'* & Logical • *Arithmetic I The arithmetic and I three-operand logical formats. I with the I Thus to determine DECODE Instructions'" three type<O:l> instructions For each primitive data the type exist instruction types of the (byte, in both the word, two-operand variants longword) which also and deal exist. instruction: => begin 00: type - byte; 01: type - word; 10: type - longword; II: type is part of opcode. (instructions that end I Instructions I instr Args in the two-operand Mode form. Interpretation do not need type bits) ISP OF PROCESSOR USED IN PARALLEL add2 llconst,rd O0 rd <-- rs,rd 01 rd <-- rd + rs (rx)Idisp,rd 10 rd <-- rd + M[rx + Idisp] (rx)r3,rd 11 rd <-- rd + M[rx + r3] I Note, when I type of the I with the mode = 11, the instruction, same format contents just of like according displacement. to the Other instr. are: add with subtract carry subc2 subtract with and2 logical and or2 logical or xor2 logical xor I Instructions rd + llconst r3 are shifted any other addc2 sub2 I instr 189 IMPLEMENTATION in the three-operand Args carry form. Mode Interpretation ............................ add3 rd <-- rs + r3 Ol rd <-- rs + Iconst rs,(r3)disp,rd 10 rd <-- rs + M[r3 + disp] rd <-format are: xs + M[r3 + disp] 11 the same addc3 I add with sub3 ! subtract subc3 ! subtract and3 I logical and or3 I logical or xor3 I logical xor • * Shift Instructions I instr with carry "* Args Mode Interpretation Other rs,r3,rd DO rd <-- r3 shifted-by xs,r3,rd 01 rd r3 xs,(r3)disp,rd 10 related instructions with the shr I shift right shra I shift right ** Bit-Field Extract/Insert I The following I modes I take 10 and up the two instructions 11 are bits also that not would := IR<7:11>, siz<0:4> := IR<0:4>, I instr Args ............................ Mode <-- rd same <-M[r3 + disp] format are: rs xs shifted-by xs '' do not make used. shifted-by arithmetic Instructions pos<0:4> | carry ............................ shl I O0 rs,lconst,rd xs,(r3)disp,rd instructions with ! Other I rs,r3,rd The be needed use of the reason to specify is that type pos a useful Interpretation bits. and The siz fields memory address. 190 PARALLELISM bfx bfi rs,pos,siz,rd O0 rd<O,siz-l> <-- rs<pos,pos+siz-l> xs,pos,siz,rd O1 rd<O,siz-l> <-- xs<pos,pos+siz-l> rs,pos,siz,rd O0 rd<pos,pos+siz-l> <-- rs<O,siz-l> xs,pos,siz,rd O1 rd<pos,pos+siz-l> <-- xs<O,siz-l> *" Load/Store I Note: t these Instructions The two I quad-word type I instr ............................ Id (load) also is necessary Args st Ida ldpi I Ida: *" instructions instructions i ld to accommodate are type special, (type the hardware task rd <-- 01 rd <-- rs (rx)ldisp,rd 10 rd <-- M[rx + Idisp] (rx)r3,rd 11 rd <-- M[rx + r3] xd,(rx)Idisp O0 M[rx + Idisp] <-- xd xd,(rx)r3 Ol M[rx + r3] <-- xd r1,(rx)Idisp I0 M[rx + Idisp] <-- rl rl,(rx)r3 II M[rx + r3] <-- rl (rx)Idisp,rd I0 rd <-- rx + Idisp (rx)r3,rd 11 rd <-- rx + r3 (shifted const:IR<0:25> -- PREFIX<O:25> Args llconst <-- (shifted by type) by type) IR<0:25> instruction. instruction. The bits with with PREFIX<O:25> or constant, Instructions The scheduler. Interpretation O0 load-address in that = 11). the number being IR<0:5> to form of the following a 32 bit PREFIX<O:25>:IR<0:5>. *" Mode Interpretation ............................ push pusha pop ** st (store) rs,rd I displacement I and a quad-word llconst,rd is the I instr have Node I Idpi: is the load-prefix I instruction are combined • * Push/Pop IN PRODUCTION Subroutine llconst,rd O0 (rd)++; M[rd] <-- rs,rd 01 (rd)++; M[rd] <-- rs (rx)ldisp,rd 10 (rd)++; M[rd] <-- M[rx + ldisp] (rx)r3,rd ll (rd)++; M[rd] <-- M[rx + r3] (rx)ldisp,rd 10 (rd)++; M[rd] <-- rx + ldisp; (rx)r3,rd 11 (rd)++; M[rd] <-- rx + r3; rs,rd -- rd Linkage Instructions '' <-- M[rs]; (rs)--; llconst SYSTEMS 1SP OFPROCI_SOR ! instr ! 191 USED IN PARALLELIMPLEMENTATION Args Mode Interpretation ............................ brlink rl,lldisp O0 Link <-- PC; PC <-- rl + lldisp; rl.r2 01 Link <-- PC; PC <-- rl + r2; (rl)lldisp 00 Link <-- PC; PC <-- M[rl + lldisp]; (rl)r2 01 Link <-- PC <-- M[rl + r2]; rl,lldisp 00 (SP)++; M[SP] <-- PC; PC <-- rl + lldisp; rl,r2 01 (SP)++; M[SP] <-- PC; PC <-- rl + r2; bsb llldisp -- (SP)++; M[SP] <-- PC; PC <-- PC + llldisp; rsb .... PC <-- M[SP]; brlinki jsb PC; (SP)--; I NOTE: The brlinki (branch-link-indirect) instruction is especially useful I when code for the various nodes in the Rete network is shared. For I example, it can I depending ** Comparison I There branch are four (br), then I instr size types to branch of the and Control I instructions ! can be used on the Flow of Instructions instructions compare-and-branch set jump to one of (cb), the Z, V, N, C, bits of some combination Args several MakeToken routines token. Mode *" in this and category: jump in the PSW. of these compare (imP). bits The The in the (cmp), compare branch instructions PSW. Interpretation ............................ cmp rl, llconst O0 compare(r1, llconst) rl, r2 01 compare(r1, r2) rl, (rx)Idisp 10 compare(r1, M[rx + Idisp]) rl, (rx)r3 11 compare(r1, M[rx + r3]) br llldisp -- PC <-- br_xx llldisp -- if xx(N,Z,V,C) I where cb_xx I where xx - z. nz, ovf, neg, eq, neq, PC + llldisp It, gt, then le, PC <-- rl,r2,1disp 00 if xx(rl, r2) then PC <-- PC + Idisp rl,(r2),ldisp Ol if xx(rl, MEt2]) then PC <-- PC + Idisp xd,r2,1disp 10 if xx(xd, r2) then PC <-- PC + Idisp xd,(r2),Idisp 11 if xx(xd, M[r2]) then PC <-- PC + Idisp xx = eq, neq, It, gt, le, ge imp llldisp -- PC <-- jmpi (rl)lldisp 00 PC <-- M[rl + lldisp] (rl)r2 01 PC <-- M[rl + r2] rl,lldisp D0 PC <-- rl + lldisp rl,r2 01 PC <-- rl + r2 jmpx PC + llldisp ge llldisp 192 PARAI,LELISM ** Synchronization Instructions test-and-set-interlocked ! tci: test-and-clear-interlocked I instr I SYSTEMS ** I tsi: pos<0:4> 1N PRODUCFION := xs Args Mode Interpretation ............................ tsi tci I instr (rl)Idisp,pos 00 cmp(M[rl+Idisp]<pos>,0);M[rl+ldisp]<pos> <-- I (rl)r2,pos 01 cmp(M[rl <-- 1 (rl)Idisp,pos DO cmp(M[r1+Idisp]<pos>,l);M[r1+Idisp]<pos> <-- 0 (rl)r2,pos 01 cmp(M[rl <-- 0 Args Mode + r2]<pos>,0); + r2]<pos>,l); M[rl M[rl + r2]<pos> + r2]<pos> Interpretation ............................ incr_i decr_i ! The (rx)Idisp,rd DO rd = (M[rx + Idisp] += I); Also set flags (rx)r3,rd 01 rd = (M[rx + r3] += I); Also set flags (rx)Idisp,rd 00 rd = (M[rx + Idisp] -= I); Also set flags (rx)r3,rd 01 rd = (M[rx + r3] -= I); Also set flags increment_interlocked ! useful to maintain I do the incr/decr ! instructions ** I instr ! shared Args ............................ put_psw ..... get_psw ..... not decrement_interlocked counters operations should Miscellaneous and at the be any Instructions Mode more instructions are in multiprocessors. It is possible memory-board so that expensive ** Interpretation itself, than a read from these memory. to CODE AND DATA STRUCTURES FOR PARALLEL 193 IMPLEMENTATION Appendix B Code and Data Structures for Parallel Implementation B.1. Code for Interpreterwith HardwareTask Scheduler /* * Cost Model for OPS5 * implementation */ /* Data typedef Structure struct Production using the Systems. Hardware Declarations To Task be used in the parallel Scheduler. */ TokenTag { struct TokenTag *tNext; in* hid; in* refCount;/* required for left-not tokens */ struct TokenTag *tLeft; /* Beta-Token: Left Component */ struct TokenTag **Right; /* Beta-Token: Right Component */ struct WmeTag *wptr[32]; } Token; typedef struct TokPtrTag { } Token *ltok; Token *rtok; TokPtr; /* * Structure of data * The word * of first three pieces * R'-nid<30> * second * contains sent of and HTS: sent information. := Dir, word to the is received i.e., is received a pointer The from to Token sent or from consists register R'-nid<31> Left/Right, and data := Flag, and R'-nid<29:0> register of two words. R'-nid. i.e., This := Node-ID. R'-tokPtr. consists Insert/Delete, This The word TokPtr. $ * * wordl: * * + ...... I Flag . ...... 31 + ..... + .............................................. I I Dir . ..... 30 + I Node-ID . .............................................. 29 . 0 194 PARALLELISM II IN PRODUCTION "1"........................................................... i * word2: "1" tokPtr: pointer to Token or TokPtr I -p........................................................... * SYSTEMS 4- 31 O */ #define MaxWmeFields 128 /* max #define WmeHTSize 4096 /* size typedef struct fields in working-memory of Wme hash element table */ */ WmeTag { struct WmeTag *wNext; int wTimeTag; int wVal[MaxWmeFields]; } Wme; typedef struct { int lock; /* lock for Wme *wList; /* list of wmes modifying wList */ associated with this entry "/ } WmeHTEntry; extern-shared WmeHTEntry "WmeHT[WmeHTSize]; /* * Register * Both R-wmeHT symbolic " fields and of the " symbolic Integer is enoded work a pointer numeric-integer Wme. data * comparisons contains to the data data as (sym-val properly between 4096 short short Token are is encoded shared structure encoded within as (val * 2 + I). values. /* size of Token lock; /* lock to modify refCount; I" -1: writeLock, *tList; /* ptr to token the • 2), Standard numeric WmeHT. wVal while integer */ #define TokenHTSize typedef struct hash table "/ { refCount safely O: free, +k: k readers */ */ list */ */ } TokHTEntry; extern-shared TokHTEntry LTokHT[TokenHTSize]; /" hash table for left extern-shared /- TokHTEntry RTokHT[TokenHTSize]; /" hash table for right * Pointers to the * R-ItokHT and ,/ two token R-rtokHT. hash tables are available in registers toks toks "/ CODE AND DA'FA STRUCTURES 195 FOR PARALLELIMPI,EMENTATION #define MaxNodes I0000 #define NodeTableSize MaxNodes typedef struct nodeTag { struct nodeTag *next; unsigned dir.succNid; } Node; typdef struct { unsigned addr-lef-act; /* code address for left unsigned addr-rht-act; /* code address for right unsigned addr-mak-tok; /* code address for makTok-lces-rces unsigned addr-lef-bash; /* code address for left unsigned addr-rht-hash; /* code address for right unsigned addr-do-tests; /* code for safe tests assoc modification activation */ activation hash fn hash */ */ fn */ node */ fields */ with of other */ short lock; /* for short refCount; /* used Token *leftExtraRem; /* extra removes due to conjugate pairs */ Token *rightExtraRem; /* extra removes due to conjugate pairs */ Node *succList; /* by implementation set of successors with of STQs */ the node */ } NodeTableEntry; extern-shared /* NodeTableEntry NodeTable[NodeTableSize]; * Pointer to the NodeTable * Note: Most of the fields * the txxx * end of PopG * the */ code /* Some nodes. data typedef If space routine can branch can is at a premium, be used using structures struct is available in the register R-nodTab. in the node-table data structure are wasted for the to check an extra for appropriate test just txxx-nodes. before for the Subsequently table. conflict-resolution */ ConResComTag { struct TokPtr ConResComTag *next; *ltok; TokPtr *rtok; int flag; } ConResCom; extern-shared ConResCom extern-shared int "ConflResCo_List; /* Confl-Res ConflResCommListLock; /* the Command associated /* * ** Register * Note: * The The system * R[31] register defined -= PC, R[30] Definitions allocation registers -- SP, is done *" by hand for all the R[28] mm Zero are: R[29] -- Link, and code. List lock */ "/ 196 PARALLELISM IN PRODUCI'ION SYSTEMS */ #define R'-nid R[27] /" temp-reg to #define R'-tokPtr R[26] /* temp-reg to store #define R-nid R[25] /* node-id #define R-tokPtr R[24] /* token #define R-flg R[23] /* Flag #define R-dir R[22] /* Direction #define R-ItokOtr R[21] /* left token of struct-TokPtr */ #define R-rtokPtr R[20] /* right token of struct-TokPtr */ #define R-state RICO] /* Position t R-state is the same physical I used at the same time. register sto for ptr flg-dir-nid in token ptr HTS */ in HTS activation */ for activation */ (ins/del) for cur-actvn */ of activation */ in opp-token-mem as R-rtokPtr, since they */ are never #define R-Itok R[19] /* ptr to left tok being matched */ #define R-rtok R[18] /* ptr to right tok being matched */ #define R-wmePtr R[18] /* Ptr to wme I R-wmePtr I these is the same registers are physical never register used as R-rtok. at the same for alpha-actvns This is possible "/ because time. #define R-hIndex R[17] /* Hash #define R-nodTab R[16] /* Address value of NodeTable */ #define R-wmeHT R[15] /* Address of WmeHT */ #define R-ItokHT R[14] /* Address of LTokenHT */ #define R-rtokHT R[13] /* Address of RTokenHT */ #define R-HTS R[12] /* Contains * hardware #define R-globTab R[11] /* Many global variables can be * accessed as offsets of this value * a pointer R-node R[IO] /" Ptr token */ the address of the task scheduler.*/ * register #define for -- something to the to relevant name-table. like "/ NodeTableEntry *I I* * The following * code is given * combined for to form * to be expanded * macros ,/ pages and give each code in-line, the code for of the primitive for more complex so that not procedures. the the parallel operations, operations. individual implementation. which All sequences are code of The then is expected code are like */ CODI" AND DATA STRUCTURES I ** FOR PARALLI_L 197 IMPLI!MENTATION PopG ** ! ............ ! When the processor ! piece code. ! fetched of PopG: I is is NULL, and that the for when new tasks, no tasks processor it are does executes the available, not cause loop out the any following value traffic of on the tokPtr bus. ldq (R-HTS),R-nid---R-tokPtr cb_eq NULL,R-tokPtr,PopG i bfx R-nid,#31,#1,R-flg ! extract flag bfx R-nid,#3D,#1,R-dir ! extract dir bfx R-nid,#O,#30,R-nid ! extract node-id shl #3,R-nid,rO ! rO contains offset add2 R-hid,r0 ! done as siz of nodTabRow is 10 lwords add2 R-nid,r0 ! done as siz of nodTabRow is 10 lwords ldal (R-nodTab)r0,R-node ] get jmpi (R-node)R-dir I I jump to activation The activation types 1 Both insert and 1 jump to custom 1 ** looking Note all are: delete dir=left/right, operations of base cache when field field from handled from R-nid into NodeTable of relevant code specific type. by R-nid R-nid field node-table lev=alpha/beta, are idle. from to entry this type=and/not/prod the same HTS. The code. Txxx nodes code. PushG ** I ............. I The code I and R'-tokPtr executed stq 1 ** to schedule already a new task contain the on the relevant registers R'-nid data. R'-nid---R'-tokPtr,(R-HTS) ReleaseNodeLock ** ] ........................ I This code is executed when node activation R-flg.#31,#1,R-nid I Insert val of flag bfi R-dir,#30,#1,R-nid ! Insert val of dir field stl NULL,(R-HTS)I I Set stl R-nid,(R-HTS) ! inform br PopG I PopG is a global is being maintained I flg-dir-nid that PendingTesksCount info is sent I value of (R-HTS)I is set I task, it will ] ................. scheduled bfi I It is assumed I "* a globally Root-Node find "* the to the HTS to NULL, value NULL again so that until value to speed when the field of tokPtr HTS PopG about end in bit in bit field 31 30 to NULL of processing label by the HTS. up processing is looping HTS modifies finishes. it. The there. to find The a 198 PARAI.LELISM ! The t cycle. root-node I receives struct tokTag I int I Token I struct L beg: L_end of of each wme-changes entry in to the be processed list is as in ** I each follows: { flag; *tokPtr; tokTag *tNext; Idl (R-globTab)wme-toks,rO Idpi prefix_dir.&bus-nid ) Tok; I get list Idl restof_dir.&busnid,R'-nid cb_eq #O,rO,L_end or2 (rO),R'-nid Idl PUSHG (rO)tokPtr,R'-tokPtr Idl (rO)tNext,rO bfi #O,#31,#1,R'-nid br L_beg Bus Node Note: t the I clear of changes the flag to process field ** Some of the successors same processor), (scheduled through while may others be processed as may be processed subtasks as (processed independent on tasks HIS). ldpi prefix_dir.succ-nid I Idl restof_dir.succ-nid,R'-nid I depending Several repetitions of this bfi PUSflG R-flg,#31,#1,R'-nid I and # of non-teqa/non-teqn on # of beta-lev ................................................................... Idpi prefix push restof_L_exit,SP L_exit Idpi prefix_nodeAddress I 0 or more, push restof_nodeAddress,SP I non-teqn if some done as non-teqa/ subtasks. ................................................................... Idl (R-tokPtr)wme,R-wmePtr and3 b'111OO',(R-wmePtr)2,rO add2 LocTabOffset,rO jmpi (PC)rO L_exit: RELEASE-NODE-LOCK LocTab: address-of-succ-nodel I (R-wmePtr)2 address-of-succ-node2 address-of-succ-node8 I ** SYSTEMS RELEASE-NODE-LOCK ! ! a list The structure IN PRODUCTION Constant-Test Nodes or Txxx Nodes == is type of wme. code succs succs. CODEAND DATA STRUCIURESI,XgR PARALLEI, 199 IMPI,EMENTATION ! ......................................... ! The code ! txxx below node I in ! cases: in which f which (1) When hashing I to no txxx I Case-I: rete successors successors 1 are corresponds the be of the can .... make as node are has hashing The on to be code the (2) but (3) is custom for each number, type, and the scheduled, some teqa/teqn useful, subtasks. is node. beneficially. be processed When hashing a toqa Depending txxx-node to to the be used processed nodes to network. there are successors for When there there Same as are are no teqa/teqn some txxx case-2, but way several when nodes which there are as subtasks. used. Idl (R-tokPtr)wme,R-wmePtr ! executed, Idpi prefix_type.val ! recall: Idl restof_type.val,rO Idl (R-wmePtr)field,rl cb_neq rO,r1,L_fail/L_exit ! if (thru iff sched num= HTS) (through 2*val, sym L exit else HTS) = 2*val+1 L_fail .................................................................. Idpi prefix_dir.succ-nid ! Several ldl restof_dir.succ-nid,R'-nid ! depending repetitions of this bfi PUSHG R-flg,#31,#1,R'-nid I and # of non-teqa/non-teqn code on # of beta-lev succs succs. ................................................................... Idpi prefix_L_exit I Done push restof_L_exit,SP i through iff node is scheduled HTS. ................................................................... ldpi prefix_nodeAddress ! 0 or more, push restof_nodeAddress,SP ! non-teqn if some done as non-teqa/ subtasks. ................................................................... and3 b'lllOO',(R-wmePtr)fld,rO add2 LocTabOffset,rO jmpi (PC)tO I extract from of the L fail: rsb I Pop L_exit: LocTab: RELEASE-NODE-LOCK address-of-succ-nodel I used one only field to hash on subtasks if sched through HTS. address-of-succ-node2 address-of-succ-node8 I Case-II: When hashing is NOT used, but some txxx-nodes ldl (R-tokPtr)wme,R-wmePtr I executed, Idpi prefix I recall: Idl restof_type.val,rO Idl (R-wmePtr)field,rl cb neq rO,r1,L_fail/L_exit Idpi prefix_dir.succ-nid I Several Idl restof_dir.succ-nid,R'-nid I depending type.val I if (thru as iff num- HTS) subtasks. sched (through HTS) Z'val, sym = 2"val+1 L_exit else repetitions on # of L_fail of this beta-lay code succ$ 200 PARALLELISM bfi PUSHG R-flg,#31,#1,R'-nid ! and # of IN PRODUC'TION non-teqa/non-teqn SYSTEMS succs. ................................................................... ldpi prefix_L_exit ! push restof_L_exit,SP 1 through Done iff the the node is scheduled HTS. ................................................................... ldpi prefix_nodeAddress ! one push restof_nodeAddress,SP I done or more of txxx nodes as subtasks. ................................................................... L_fail: rsb 1 Pop one L_exit: RELEASE-NODE-LOCK I ! Case-III: When hashing is NOT used, and used the subtasks if sched no txxx-nodes as subtasks. iff sched num= 2*val, ldl (R-tokPtr)wme,R-wmePtr ! executed, ldpi prefix I recall: ldl restof_type.val,rO type.val of only through (through HTS. HTS) sym = 2*val+l ldl (R-wmePtr)field,rt cb_neq rO,rl,L_exit Idpi prefix_dir.succ-nid I Several Idl restof_dir.succ-nid,R'-nid ! depending bfi PUSHG R-flg,#31,#1,R'-nid I and # of non-teqa/non-teqn repetitions of this on # of beta-lev code succs succs. ................................................................... L_exit: RELEASE-NODE-LOCK/rsb I *" Left ! The Beta code And Node for a right Idl I if (thru HTS) REL-N... else rsb *" activation (R-tokPtr), can be obtained by minor I check extra-removes substitutions. R-ltokPtr Idl (R-tokPtr)1,R-rtokPtr HASH-TOKEN-LEFT cb_eq Delete,R-flg,L Idal (R-node)LefExRem,rO del cb_eq NULL,(rO),LIO CHECK-EXTRA-REMOVES LIO: I result in r5. field O=>OK, for 1->ExtraRemoves cb_eq #1,r5,L_exit MAKE-TOKEN INSERT-LTOKEN br L_del: Lll DELETE-LTOKEN cb_neq NULL,R-ltok,L11 INSERT-EXTRA-REMOVE L11: br L_exit Idal (R-rtokHT)R-hIndex,r5 I see if hash bucket node is empty CODE AND DATA STRUCTURES FOR PARAI,L[-L cmp NULL,(rS)tList br_eq L_exit 201 IMPLEMENTATION LOCK-RTOKEN-HT Idl L loop: (rS)tList,R-state NEXT-MATCHING-RTOKEN SCHEDULE-SUCCESSORS L12: br L_Ioop RELEASE-RTOKEN-HT L_exit: FREE-TOKPTR I L12 is used within NextMatchingTok RELEASE-NODE-LOCK ! Note: In the previous ! RELEASE-TOKEN-HT ! was zero. i the hash In this bucket ! overhead ! longer present. look ! was based tokens, if the in them, execute I Alpha Left the then the even the above between the above would table has not The full, code several been that are sequence of code. the no tokens skipped. in the (1) The is no if there are hash-bucket while is, most if nodes tokens, disadvantage that if there is skipped advantages: is skipped, have and in the opp-mem in the memory probability entry. pretty of the following code LOCK-TOKEN-HT of tokens of tokens is a large gets And Node has counts code node table section if the opp-mem there per hash ! we will ** This so that on counts the code if the number that the (2) Even of space ! tokens version, is NULL. is empty, ! 4 bytes ! that version, skipped of maintaining ! no matching ! we was where if the decision (3) of the buckets opposite It saves scheme have is some memory, ** ! .......................... HASH-TOKEN-LEFT cb_eq Delete,R-flg,L_del MAKE-TOKEN LIO: INSERT-LTOKEN br L del: LII: L11 DELETE-LTOKEN Idal (R-rtokHT)R-hIndex,r5 cmp NULL,(rS)tList I see if hash bucket is empty I L12 is used within NextMatchingToken br_eq L_exit LOCK-RTOKEN-HT Idl L_Ioop: (rS)tList,R-state NEXT-MATCHING-RTOKEN SCHEDULE-SUCCESSORS LI2: br L loop RELEASE-RTOKEN-HT L_exit: RELEASE-NODE-LOCK 1 ** Left I ......................... Beta Not Node ** 202 PARALLF.LISM Idl (R-tokPLr), Idl (R-tokPtr)l,R-rtokPtr IN PRODUCrlON SYSTEMS R-ItokPtr HASH-7OKEN-LEFT cb_eq Delete,R-flg,L Idal (R-node)LefExRem,rO del cb_eq NULL,(rO),LIO CHECK-EXTRA-REMOVES cb eq LIO: I check extra-removes I result field is returned for node in r5 #_,rS,L_exit MAKE-TOKEN INSERT-LTOKEN stl #O,(R-ltok)refC ldal (R-rtokHT)R-hIndex,rO cmp NULL_(rO)tList br_eq L12 I Initialize I skip refCount DetRefCount to 0 if hash-bucket=NULL LOCK-RTOKEN-HT DETERMINE-REFCOUNT RELEASE-RTOKEN-HT br L_del: L11 DELETE-LTOKEN cb_neq NULL,R-Itok,Lll INSERT-EXTRA-REMOVE br L_exit L11: cmp Zero,(R-Itok)refC L12: br_neq L_exit SCHEDULE-SUCCESSORS L exit: FREE-TOKPTR ] check value found by DetRefCount RELEASE-NODE-LOCK I ** Left Alpha Not Node "* I .......................... HASH-TOKEN-LEFT LIO: cb_eq Delete,R-flg,L_del MAKE-TOKEN INSERT-LTOKEN stl #O,(R-Itok)refC ldal (R-rtokHT)R-hIndex,rO cmp NULL,(rO)tList br_eq L12 I Initialize ! skip refCount DetRefCount to 0 if hash-bucket-NULL LOCK-RTOKEN-MT DETERMINE-REFCOUNT RELEASE-RTOKEN-HT br Lll L del: DELETE-LTOKEN L11: cmp L12: br_neq L_exit SCHEDULE-SUCCESSORS L_exit: RELEASE-MODE-LOCK Zero,(R-Itok)refC I check value found by DetRefCount CODE AN D DATA S'FR UCTURFS ! ** Right Beta FOR PARALI_I(L Not Node 203 IM PLEM EN'FATION ** ! .......................... Idl (R-tokPtr), R-ltokPtr Idl (R-tokPtr)1,R-rtokPtr HASH-TOKEN-RIGHT cb_eq Delete,R-flg,L ldal (R-node)RightExRem,rO del cb_eq NULL,(rO),LIO CHECK-EXTRA-REMOVES LIO: I check extra-removes field is returned in r5 ! result for node cb_eq #1,r5,L_exit MAKE-TOKEN INSERT-RTOKEN br L_del: L21 DELETE-RTOKEN cb_neq NULL,R-rtok,L11 INSERT-EXTRA-REMOVE L11: L loop: br L_exit ldal (R-node)R-hIndex,r5 cmp NULL,(rS)tList br_eq L_exit I see if hash bucket xor2 #1,R-flg LOCK-LTOKEN-HT R-flg gets Idl Note: LockTokenHT (r5)tList,R-state NEXT-MATCHING-LTOKEN-NOT L12 used value within is empty of opp-flag does not NextMatch use .... SCHEDULE-SUCCESSORS LI2: br L loop xor2 #1,R-flg restore R-flg to not(opp-flag) RELEASE-LTOKEN-HT L exit: FREE-TOKPTR RELEASE-NODE-LOCK ! '' Right Alpha Not Node *" I ........................... HASH-TOKEN-RIGHT cb_eq LIO: Delete,R-flg,L_del MAKE-TOKEN INSERT-RTOKEN br LII L_del: DELETE-RTOKEN Ll1: Idal (R-ItokHT)R-hIndex,r5 cmp NUlL,(rS)tList br_eq L_exit xor2 #1,R-flg LOCK-LTOKEN-HT Idl L_Ioop: I see if hash I R-flg gets bucket-us value of opp-flag (r5)tList,R-stete NEXT-MATCHING-LTOKEN-NOT I LI2 used within NextMatch SCHEDULE-SUCCESSORS L12: empty br L_Ioop xor2 #l,R-flg I restore value of R-flg .... r5 204 PARAIJ.ELISM IN I'RODUCIION SYSTEMS RELEASE-LTOKEN-HT L_exit: RELEASE-NODE-LOCK * GetNewToken This is used making this tokens list * by are for for allocated of free To obtain code useful from tokens. the code Idl MakeToken. insert shared Thus for The token command from memory, no locks right are is each returned Note, processor keeps to get simply replace I load value in although required activations, (R-globTab)TokFr,R-Itok ptr left. R-ltok, the its own this storage. R-Itok by R-rtok. of _TokenFreeList cb_neq NULL,R-Itok,LIO ALLOCATE-MORE-SPACE LIO: I *" ldl (R-Itok),rO stl rO,(R-globTab)TokFr MakeToken I value of next field of token *" I ................. ! Code ! Code for a left-alpha-token GET-NEW-TOKEN stl R-nid,(R-Itok)nid I copy Idl (R-tokPtr)wme,rO I get pointer node-id stl rO,(R-Itok)wptr ! copy wme I copy node-id in token to wme pointer in token for a left-beta-token GET-NEW-TOKEN stl R-nid,(R-Itok)nid stq R-ItokPtr---R-rtokPtr,(R-ltok)tLeft; brlinki (R-node)addr-mak-tok I Node specific code for make-token. t combination of Ices and rces that I the case ! routines where a linear would be present. Rete network is used (with (R-ltokPtr)wme,rO I get wme stl/stq rD,(R-Itok)wptr I store stmts to copy rest of wme ptrs (R-rtokPtr)wme,rD I get wme stl/stq rO,(R-Itok)wptr I store .. similar stmts Link to copy rest of wme pointer/s from for each Thus for = 32), 32 such in token R-ItokPtr to R-Itok. pointer/s wme ptrs MaxCEs tokens pointer/s wme Idl/Idq jmpx constituent There is one such routine occurs in the rete network. ldl/Idq .. similar in token I copy pointer/s from in token R-rtokPtr to R-Itok. CODE AND DATA STRUCTURESFOR t ** 205 PARALI_EI_IMPLEMENTATION ** GetNewTokPtr ! .................... I Returns the pointer in R'-tokPtr, which Idl (R-globTab)TokFr,R'-tokPtr cb_neq NULL,R'-tokPtr,LIO will then be sent I load to the HTS. value of _TokPtrFreeList ALLOCATE-MORE-SPACE LIO: ! ** ! Idl (R'-tokPtr),rO stl rO,(R-globTab)TokFr FreeTokPtr Note: The I value token to be freed is still in of tokptr R-tokPtr Idl (R-globTab)TokFr,rO ! rO <-- stl rO,(R-tokPtr) ! tokPtr->tNext <-- TokPtrFreeList stl R-tokPtr,(R-globTab)TokFr ! TokPtrFreeList <-- tokPtr Idl R-nid,rO brlinki t the hash (R-node)addr-lef/rht-hash bfx rO,#1,#12,R-hIndex I extract shl #1,R-hIndex,R-hIndex ! since value bits size ! R-hIndex Node-specific Hash TokPtrFreeList "* I and wmeHT ** field ** ! ** HashToken-Left/Right I of next Function Code is accumulated <1:12> of each for is 2 lwords. gives the hash entry in rO value in tokHT This correct way offset. ** ........................................ I Code for beta activations. Here the two components of the token to be I hashed are available in R-ltokPtr and R-rtokPtr. The final value is ! accumulated in tO. Idl (R-ItokPtr)wme,rl I move xor2 (rl)val-x, I xor xor2 (rl)val-y,rO Idl (R-rtokPtr)wme',rl I get wmes from xor2 (r])val-z, I xor ... and so on, rO wptr I multiple rO depending on the tests associated to rl value into rO value value with from same R-rtokPtr into the rO node wme 206 PARALLELISM1N jmpx ! Code for Link alpha ! in R-tokPtr. I only ) ** ! Pop activations. R-ItokPtr be one wme Here the ans R-rtokPtr associated with the pointer are back to the (R-tokPtr)wme,rl I move (rl)val-x, I xor xor2 (rl)val-y,rO so on, Link InsertToken rO code is available Note that wptr to rl value ! multiple depending generic SYSTEMS there can token. xor2 ... and to token not used. Idl jmpx PRODUCFION on the tests associated I Pop into rO value with back from the same wme node to generic code ** I ................... I I The following in left-token code hash corresponds table. to inserting is to I Idl (R-ItokHT)rl,rO ! get stl rO,(R-Itok)next ! Itok->tNext <-- ! tokList Itok DeleteToken tokList pointed add3 #1,R-hIndex,rl LOCK-LTOKEN-HT stl R-Itok,(R-ItokHT)rl RELEASE-LTOKEN-HT 1 ** token second tokList by R-ltok longword in bucket in rO <-- tokList ** I ................... 1 The following I is formed code jointly add3 corresponds by R-ItokPtr to the deletion of a left token, which and R-rtokPtr. #l,R-hIndex,rO LOCK-LTOKEN-HT L loop: ldl (R-ItokHT)rO,rl ldl NULL,r2 cb_eq NULL,rl,L20 cmp R-nid,(rl)nid br neq ! compare node-ids L_fail V .................................................................. I The enclosed code V corresponds to deletion Idl (R-Itok)tLeft,r3 I load cmp rl,(rl)tLeft I see br_neq L_fail Idl (R-Itok)tRight,r3 I load cmp r3,(rl)tRight I br_neq L_fail ^ .................................................................. see of ptr to left-part if the ptr if a left two have to left-part the two have beta token of tok same of tok same in r3 left-part in r3 left-part ^ CODE AND DATA STRUCTURIiSFOR 207 PARAI_LELIMPLEMENrFATION OR V .................................................................. I The enclosed V code corresponds to deletion Idl (R-Itok)wptr,r3 I load cmp r3,(rl)wptr I compare br_neq L Fail of a left ptr to the with alpha only wme the other token in r3 token ^ .................................................................. L_fail: L_succ: LIO: L20: br L_succ Idl rl,r2 Idl (rl)next,rl br L_Ioop ldl (rl)next,r3 cb eq NULL,rZ,LIO stl br r3,(r2)next L20 ^ I token to be del is at head of tList stl r3,(R-ItokHT)rO RELEASE-LTOKEN-HT Idl rl,R-Itok I deleted cbeq NULL,rI,L30 I if tok==NULL, ! token Idl (R-globTab)DelTokList,rO stl rO,(rl)next stl r1,(R-globTab)DeITokList token cannot is returned then in R-Itok return. be freed now L30: 1 ** LockTokenHT ** I .................. I The 1 can following be LI: used code "lock" lock-left-token-HT. is the first (R-ltokHT)R-hTndex,r6 cb_neqw #O,(r6),L1 I tsi_w (r6),#O t try br_nz Ll cmp Zero,(r6)refC LIO0 Note field ldal br_neq/It LIO0: implements because if of lock and hash is busy obtain ! O: free, #-I/#1,(r6)refC,r7 stw r7,(r6)refC tci_w br (r6),#O L200 tci_w (r6),#O br L1 I decr/incr I try cb_neq loop the refC out r6 already dep-on again *" has the correct address for the lock cache +k: k readers It:read-lock I ...................... I Note: of lock L200: I *" ReleeseTokenHT instr entry. -I: writer, I neq:write-lock, add3_w that table in it. write/read 208 PARALLELISM LI: cb_neqw #O,(rB),Ll br_neq LI tsi_w (r6),#O br_nz L1 add3 w #1/#-l,(r6)refC,r7 stw rT,(rB)refC tci_w (r6),#O t Note: The time inside I any contention) ! IDO instructions, ! no more than I hardware ! and ! ** the is around then a factor is being cmp_decr_i I incr/decr lock for LockTokenHT 12 instructions. for tokens the above dep-on same benefit node hash-table takes bucket If specialized from cmp_incr_i, (compare-and-if-equal-incr/decr-interlocked) NextMatchingToken (without average be achieved. will SYSTEMS write/read ReleaseTokenHT if the on the can code and Thus, arriving of 8 in speed-up designed refC IN PRODUCTION instructions. ** I ......................... I Code below I a pointer L_Ioop: corresponds to the next to a left token activation. that should The be tried cb_eq NULL,R-state,L12 I LI2 occurs cmp R-nid,(R-state)nid I check br_neq L_fail ldl R-state,R-Itok brlinki (R-node)addr-do-tests I result cb_eq #1,r5,L_succ I all tests Idl (R-state)next,R-state L_succ: br L_Ioop Idl (R-state)next,R-state I This is needed is not I update I value R-state calling are stores the code direction must code same so that returned to do dependent in r5 have R-state, I we return ! ** NextMatchingTokenNOT within if node-ids I tests L_fail: register for match. succeeded so that to L_loop in R-state, we and next time check new not old one. "* | ......................... I Code below ! stores I makes L loop: corresponds a pointer extensive to a right-not-activation. to the next use of the token fact that that should the tests cb_eq NULL,R-state,L12 I Lt2 cmp R-nid,(R-state)nid I check br_neq L fail Idl R-state,R-Itok brlinki (R'node)addr-do-tests cb_eq #O,r5,L_fail incr_iw/decr_iw (R-state)refC,rD are occurs register known within if node-ids I result I All The be tried returned tests must R-state for match. at compile calling are The time. code same in r5 have code succeeded CODE AND DATA SI'RUCFURESFOR 209 PARAIJ.]_LIMPLEMENTATION t incr:Insert-flag, ! Updated cb_eq #1/#O,rD,L_succ Idl (R-state)next,R-state L_succ: br L_Ioop Idl (R-state)next,R-state ! ** DetermineRefCount decr:Oel-flag returned ! #1:Insert-flag, ! #I: L_fail: value 0=>I, #0: #0:I=>0 in rO Del-flag transition. ** I ........................ I This code L_loop: is executed on the left-activation of a not-node. add3 #1,R-hIndex,rO ! offset Idal Idl (R-rtokHT)rO,rO #O.rl I get tList of opp-mem in rO I the # of match-toks is accum cb_eq NULL,rO.L_exit cmp R-nid.(rO)nid br neq L_fail brlinki (R-node)addr-do-tests cb eq #O,r5,L_fail add2 #I,ri if node-ids ! result returned I all L exit: ! ** Idl (rO)next,rO br L_Ioop stl r1,(R-Itok)refC DO Tests access ! check tests ! incr L fail: to get I store are have same succeeded of matching count in rl in r5 must num to tList opp-tokens field of Itok ! Note this update does not I be done in an interlocked in refC have to manner. ** 1 ............... ] This piece I a given of code node. The result the case I corresponds t is greater to than is node specific and of the tests when the number performs is the returned of tests in tests associated r5. The associated code with the with below node zero. ............................................................... ldl (R-Itok)wptr,rO l get wme from left-tok in rO Idl (R-rtok)wptr',rl I get wme from opp-tok in rl Idl (rO)field,r2 ! get value cmp r2,(rl)field' I compare br_xx L_fail I xx depends .. the last .. the above three instructions sequence for tests for each between to be compared values on the test test between different Idl #1,r5 ! all tests jmpx Link ! Pop back must type the same wmes have to generic succeeded code wmes 2[0 PARALI,ELISM L_fail: Idl #O,r5 jmpx Link I Case when I code can the I number be shared of tests by all Idl #I,r5 jmpx Link Pop associated nodes that back with have ! ** ScheduleSuccessors generic a node zero I Pop to IN PI_O1)UCFION SYSTEMS code is zero. The following tests. back to generic code ** I ......................... I Generic loop: code to schedule successors of Idl (R-node)succList,r3 ! get Idl (r3)dir.succNid,R'nid ! first bfi R-flg,#31,#1,R'-nid GET-NEW-TOKPTR I ** a node. ! ptr stl R-Itok,(R'-tokPtr) stl PUSHG R-rtok,(R'-tokPtr)l Idl (r3)next,r3 cb_neq NULL,r3,1oop CheckExtraRemoves pointer time to successor thru is returned loop, list r3 can't be NULL in R'-tokPtr ** ........................ I This routine will ! that reason need I beta-insert LI: L loop: L_fail: be executed not be operation. only highly The in optimized. result cmp_w #O,(R-node)lock br_neq LI tsi_w (R-node)lock,#O br_nz LI Idl (R-node)LefExRem,r2 Idl NULL,r2 cb_eq NULL,rI,L_notF cmp R-nid,(rl)nid br_neq L_fail cmp R-ltokPtr,(rl)tLeft br_neq L_fail cmp R-rtokPtr,(rl)tRight br neq L_fail br L_succ Idl rl,r2 is exceptional The returned circumstances, code in r5. corresponds and to a for left- CODE AND DATA STRUCI'URESFOR L_succ: PARAI,I]_I. ldl (rl)next,rl br L_Ioop Idl (rl)next,r3 cb_eq NULL,r2,LIO stl r3,(r2)next 211 IMPLEMENTATION ! Token to be del I value being is at head of list br L20 LIO: stl r3,(R-node)LefExRem L20: tci_w (R-node)lock,#O Idl (R-globTab)DeITokList,rO stl rO,(rl)next stl r1,(R-globTab)DelTokList ldl br #1,r5 L30 L notF: I case ldl if a matching token is not found returned in the in r5 Extra-Removes-List #O,r5 L30: ! ** "" InsertExtraRemove I ........................ I This routine I that reason will need be executed only not be highly MAKE-TOKEN LI: J "* ! ptr cmp #O,(R-node)lock br_neq LI tsi_w (R-node)lock,#O br_nz L1 Idl (R-node)LefExRem,rO stl rO,(R-Itok)next stl R-Itok,(R-node)LefExRem tci_w (R-node)lock,#O Prod Node in exceptional circumstances, and for optimized. to token is returned in R-Itok ** ] ................. 1 The code for prod-node corresponds to the act I for some process/es doing conflict-resolution, I command into a command-list. Idl (R-tokPtr), Idl (R-tokPtr)l,R-rtokPtr MAKE-CR-COMMAND INSERT-CR-COMMAND of preparing and a command inserting this R-ItokPtr make confl-res insert command command. (ptr in cr-list in rO) EREE-TOKPTR RELEASE-NODE-LOCK inform HTS that proc is finished. 212 PARAI,LELtSM ! ** IN PRODUCI'ION SYSTEMS ** MakeCRCommand ! .................... Idl (R-globTab)comFreeList,rO cb neq NULL,rO,LIO [ get command from free-list ALLOCATE-MORE-SPACE LIO: I ** ldl (rO)next,rl stl r1,(R-globTab)comFreeList stl R-ItokPtr,(rO)Itok stl R-rtokPtr,(rO)rtok stl R-flg,(rO)flag InsertCRCormmand ** I ...................... LI: ldal (R-globTab)crListLock,rl cb_neq #O,(rl),L1 tsi (rl) br_nz L1 Idl (R-globTab)crList,r2 stl r2,(rO)next stl rO,(R-globTab)crList tci (rl) I spin on cache I recall new-comm is ret in rO B.2. Code for Interpreter Using Multiple Software Task Schedulers 1 Extensions to the I schedulers. I the and data of the implementation using I routines l code Most that Extensions to change Data code structures written software significantly. for task for the the use of software task HTS remains the same even schedulers. There are only three Note: Only the changes are given here. structures: I ............................... #define MaxSchedulers 32 #define NumSchedulers XX typedef struct { unsigned flg-dir-nid; I flag, Token *tokPtr; I ptr } TaskStackEntry; #define TaskStackSize typedef struct 512 dir, nid to token for causing for activation activation; CODE AND DATA STRUCTURtG FOR PARAI,LEL 213 IMPLEMENTATION ( int lock; ! lock int stop; I index TaskStackEntry *taskStk; int I ptr dummy; for entering of top to task I place scheduler entry in Task stack filler of size Stack TStkSize to make size 2^2 pending in all lwords } Sched; e×tern-shared Sched Scheduler[NumSchedulers]; extern-shared int PendTaskCount; extern SchedID; int I sum of global refers being extern int Random; extern-shared int NewCycleLock; serviced that the register R-HTS was sched global variable is no longer needed R-HTS ! points to the #define R-stk R-hIndex I points to task I Note: The field I Extensions of the node-table-row bit<0>:lock-bit, for schedID new task. per processor. only R-sched lock or the searched It task one processor goes off beginNextCycle. #define I of information, obtained, being so that to process I Note scheds var, local to each processor. to the schedID from which the of the Lock tasks in the STS base of the stack now of implementation. Scheduler given consists of three bit<l>:direction-bit, and array. scheduler pieces bit<2>:flag-bit. to the code I ......................... I THE FOLLOWING J IS USED. THE I PARALLELISM I ** CODE EXTENSIONS ARE CODE EXTENSIONS WHEN IS USED ReleaseNodeLock ARE GIVEN FOR THE NODE CASE WHEN INTRA-NODE PARALLELISM PARALLELISM OR PRODUCTION LATER. ** I ...................... I This I is 1 for I routine done with corresponds evaluating a new node to activation a new recognize-act the a node to action that activation, evaluate, or is taken At this it when it a processor may want may mark to the beginning refC assoc look of cycle. decr_iw (R-node)refC,r0 ldal (R-globTab)PendTaskC,r0 I decrement decr_i (r0) ! decr cb neq #0,(rO),L1 I check ldal (R-globTab)NewCycLock,r7 tsi (r7) br_nz L1 I PendingTaskCount if load I the end of address I if the I some the lock other new is busy so cycle NewCycleLock processor cycle, node interlocked prod of with it implies that is scheduling this proc need not 214 PARAI,LELISM ! worry cb_neq #O,(rO),L2 I Get about value I Even ! is possible BEGIN-NEXT-CYCLE L2: LI: tci went ** the another in, again. lock, scheduled and act it processor the then released the for that case. I confl-res I PopG PushG gets next lock. phases (r7) PopG: I proc that I cycle, and I Must check SYSTEMS it. of PendingTaskCount if this I already tN PRODUCHON is a global label ** I ............ ! This I globally piece of code is executed whenever a pending ldl (R-globTab)random,rO random-seed xor2 #4513,r0 xor bfx rO,#(31-x),#x,r8 #x with I the #x,rO,rO bfi r8,#O,#x,rO rotation stl rO,(R-globTab)random store shl #2,r8,rg note Idal (R-sched)rg,r9 get cb neq #O,(rg),L1 check tsi br_nz (rg)lock,#O LI try add3 #2,(rg)sTop,rO stl rO,(rg)sTop add3 rO,(rg)taskStk,rl I get stq R'-nid---R'-tokPtr,(rl) I actually tci (rg)lock Idal (R-globTab)PendTaskC,rO incr_i (rO) I stop last to be time number on NumScheds. rand-seed We also by #x bits is complete back value r8 has indx base-addr into of that sched-lock; and get scheduler spin in cache lock += 2, as each address random-seed of rand-sched where push entry to push is 2 lwords data data. I increment_interlocked determines of PopG the routine processability as described PopG: L2: is PendTaskCount *" code task from the shl I ** PopG I This activation some prime depends rotate LI: node scheduled. of a node in the I PopG ldl (R-globTab)schID,r8 add2 #I,r8 cb le (NumScheds-1),rB,L1 I get HTS in addition version is a global ID of last to performing of code. label sched used CODE AND DATA STRUC'FURESFOR L2: L loop: L3: sub2 r8,NumScheds shl #2,r8,r9 ! increment Idal (R-sched)r9,rg Zero,(rg)sTop ! check br_z L2 ! if empty cb_neq #O,(r9),L2 ! if sched-lock tsi (r9),#O br_nz L2 I if cant Idl (rg)taskStk,R-stk ! get base-addr Idl (r9)sTop,rO ! get sTop cb_le #O,rO,L_fail Idal (R-stk)rO,r2 I get base Idq (rl),R-nid---R-tokPtr bfx R-nid,#31,#2,R-flg bfx R-nid,#30,#1,R-dir R-nid,#O,#30,R-nid shl #3,R-nid,rO add2 R-nid,rO add2 R-nid,rO I done Idal (R-nodTab)rO,R-node I R-node cmp R-hid,#10000 I check br gt L_succ 1 if so tsi w br_nz (R-node)Iock,#O L3 I lock cb_eq Insert,R-flg,L_Ins .. two similar symm. cases, get try of stack bas Addr here stw flg-dir-lock,(R-node)lock add3 is 10 lwords of nod-tab-entry tab entry one Lg4: entry success node only Lg2: sched if txxx-node consider L91 sched stack of nodTabRow gets next next task address here br_neq try try of one Zero,(R-node)refC sched is busy, lock, as siz empty? next Left,R-dir,LefIns .. two w similar symm. cases, _ check if refCount I flg-dir-lock is zero is compile-time const #1,(R-node)refC,r3 stw r3,(R-node)refC I update tci w (R-node)lock,#O I release br L_succ cmp_w flg-dir-lock,(R-node)lock I check if flg-dir-lock br_eq Lg4 if same then tci w (R-node)Iock,#O release lock sub2 #2,rO sub size br L_Ioop see if next tci (rg)Iock,#O release br L2 try cmp w rO,(r9)sTop check br_eq L95 if from Idl (r9)sTop,rO get value of stop. I can destroy rO, as rl contains info in better form. ldal (R-stk)rO,rO get Idq (rO),r3-r4 load stq r3-r4,(rl) store sub3 #2,(r9)sTop,rO stl rO,(rg)sTop top L95: -- Is stack then only cmp_w L_succ: NumScheds consider LefIns: L_fail: sTop bfx cb_eq L93: modulo cmp L Ins: L91: 215 PARAI_I.ELIMPLI_MENTATION refCount lock on node incr entry I decrement entry succeed from for picked offset pr-task of the or from no compaction in between neccessary to top entry slot back was stack top, pointer match and is processable scheduler if task of task entry on scheduler the next top refC of tskstk lock table of into stack the middle stack-top field slot 216 PARAI.LEHSM tci (rg)lock,#O I release stl r8,(R-globTab)schID I store scheduler schedID I obtained jmpi (R-node)R-dir I jump FOLLOWING ! USED. ! THE CODE THE CODE COST-MODEL typdef CORRESPONDS WHEN TO THE PRODUCTION DERIVED FROM CASE WHEN PARALLELISM THIS CODE CAN the to code SYSTEMS lock from into I activation I THE IN PRODUCTION which task global specific vat was schedID. to this type. NODE PARALLELISM IS USED BE USED IS IS SIMILAR, THERE AND TOO. struct ( unsigned addr-lef-act; /* code address for left unsigned addr-rht-act; /* code address for right unsigned addr-mak-tok; /* code address for makTok-lces-rces unsigned addr-lef-hash; /* code address for left unsigned addr-rht-hash; /" code address for right unsigned addr-do-tests; /* code for safe tests assoc modification activation */ activation hash fn hash */ */ fn */ node *I fields */ with short lock; /* for short refCount; /* used Token *leftExtraRem; /" extra to conjugate pairs */ Token Node *rightExtraRem; *succList; /* extra removes due to conjugate /* set of successors of the node pairs */ */ by STS. (not removes of other */ needed due here) *I } NodeTableEntry; I =* ReleaseNodeLock' ! Only the ! cost remains first *" line the is changed as respect to the tci_w (R-node)lock,#O Idal (R-globTab)PendTaskC,rO I release decr_i (rD) I decr cb neq #O,(rO),L1 I check Idal (R-globTab)NewCycLock,r7 tsi (r7) br_nz LI lock lock other cycle, LI: (r7) with confl-res node interlocked cycle it implies that is scheduling so this proc need not it. of PendingTaskCount proc that went cycle, and Must check tci Overall NewCycleLock is busy is possible BEGIN-NEXT-CYCLE of if this already prod processor about Get value Even of address the new #O,(rO),L2 associated if end if the worry L2: version. PendingTaskCount I load some cb_neq previous same. in, gets the another scheduled again. lock, the next then released the for that case. and act phases it processor lock. CODE AND DATA STRUCTURF.S FOR PARALI_EL PopG: ! ** ! PushG' 217 IMPLEMENTATION PopG is a global label "* I ............ 1 The version ! of parallelism. LI: PushG So the does not cost also change between remains the node parallelism and Idl (R-globTab)random,rO ! random-seed xor2 #4513,r0 I xor bfx rO,#(31-x),_x.r8 ! #x depends on NumScheds. I rotate rand-seed with from some the shl #x,rO,rO bfi r8,#O,#x,rO I rotation stl rO,(R-globTab)random I store shl #2,r8,r9 I note Idal (R-sched)r9,r9 I get cb_neq #O,(r9),L1 I check tsi (r9)lock,#O I try brnz LI add3 #2,(rg)sTop,rO stl rO,(r9)sTop value r8 has and rO,(rg)taskStk,rl ! get I actually tci (r9)lock ldal (R-globTab)PendTaskC,rO incr_i (rO) also bits get random-seed of that scheduler spin in cache lock += 2. as each R'-nid---R'-tokPtr,(rl) We by #x of rand-sched sched-lock; stq time number into indx base-addr I sTop last prime is complete back add3 I ** PopG' intra-node same. address where push entry to push is 2 lwords data data. ! increment_interlocked PendTaskCount ** I ............ I This I the I This code task determines of PopG version I is used. of the The main I processability the routine code place of the processability as described changes PopG: L2: LI: significantly of change selected of a node in the HTS node when corresponds I get ID of last add2 #1,r8 cb_le (NumScheds-1),rB,L1 sub2 rB,NumScheds shl #2,r8,r9 ldal (R-sched)rg,r9 cmp Zero,(rg)sTop I check br_z L2 I if empty cb_neq #O,(rg),L2 ! if sched-lock I increment the label sched modulo sTop where established. is a global (R-globTab)schID,r8 parallelism portion is being Idl to performing of code. node to the activation I PopG in addition version NumScheds -- Is stack then try next is used busy, empty? sched try next sched 218 PARAI,LELISM L_Ioop: L3: L_fail: L_succ: tsi (rg),#O br_nz L2 ! if cant Idl (rg)taskStk,R-stk I get base-addr Idl (rg)sTop,rO I get sTop cb le #O,rO,L_fail Idal (R-stk)rO,rl I get base Idq (rl),R-nid---R-tokPtr lock, of try next task address SYSTEMS sched stack of stack entry bfx R-nid,#31,#1,R-flg bfx R-nid,#30,#1,R-dir bfx R-nid,#O,#30,R-nid shl #3,R-nid,rD add2 R-nid,rO add2 R-nid,rO I done Idal (R-nodTab)rO,R-node I R-node cmp R-nid,#10000 I check br_gt L_succ I if so success tsi_w (R-node)Iock,#O ! lock br_z L succ I if lock_obtained sub2 #2,tO I sub size br L_Ioop ! see if next tci (r9)lock,#O I release lock on br L2 I try next scheduler cmp_w rO,(r9)sTop I check br_eq L95 ! if from Idl (rg)sTop,rO I get value of sTop. I can destroy rO, I as rl contains info in better form. Idal (R-stk)rO,rO I get ldq (rO),r3-r4 I load stq r3-r4,(rl) I store sub3 #2,(rg)sTop,rO stl rO,(rg)sTop ! tci (rg)lock,#O 1 release stl r8,(R-globTab)schID ! store I top L95: get IN I'RODUCFION as siz of nodTabRow gets (R-node)R-air node the tab entry then of tskstk entry if task of task pointer top offset scheduler was for picked pr-task of the or from in between neccessary to top entry of into stack the middle stack-top scheduler schedID into to code I activation from no compaction slot back success entry is processable stack top, decrement I jump is I0 lwords of nod-tab-entry if txxx-node I obtained jmpi bas Addr field lock from which the global specific type. slot to task vat this was schedID. DERIVATION O[: COST MODF.I,S 219 FOR TIlE SIMULATOR Appendix C Derivation of Cost Models for the Simulator C.I. Cost Model for the ParallelImplementationUsing HTS /* * * This cost Scheduler model (HTS) was designed is used. for the case where a single Hardware Task * Exports: * double TaskProcCost(tptr: * double TaskFinishCost(tptr: *Task); * double TaskLoopCost(tptr: * double TaskSchedCost(tptr: *Task); *Task); *Task); */ /* * The cost model described * diagram given * usually consists * associated * tokens * node below. with of the the successor * scheduler, * is finished, going fetch * following diagram: * Loop-Start cost; onto T_b T_le cost; T_enq * is Cost of dequeueing * by this file steps: simply then the end. pending of enqueuing to the from following * TaskProcCost(tptr: *Task) := T_b * TaskLoopCost(tptr: *Task) := T_il; * TaskFinishCost(tptr: * TaskSchedCost(tptr: *Task) *Task) by a node Do some fails each activation T_e T_il processing (3) If the of those processing on the for to process. is Ending cost; T_Is cost; in the the procedures The a node In the an activation HTS. of any matching is Inner-Loop is T_nl HTS; is T_deq exported costs: + (if no_succ :- T e + (if no_succ := T_enq basic go to the end. the in terms activation to find enqueue cost; cost; an activation correspond be understood (4) Once node is Beginning is Cost (I) the node then is Loop-End may required activations, another * No-Loop (2) If memory, node before Appendix following node. in the opposite has in this The processing := T_deq; then T_nl; then O; else for the HTS else T_Is;); T_le;); version. 220 PARALLELISM IN PRODUCllON SYSTEMS T deq +=_==<======_======z=====_===========_======::=_========_=:===<:_====+ I I V +-< .......... I I I I I T_il T_b + ........... T_ls I I >+ .......... <-+ T_enq I I >+===========s=>+ T_le T_e ........... >+ I Start .............. [ r_nl I + ...... End I > .......................... • ..... + *I #include "sim.h" #include "stats.h" * Some constant definitions for cost of tasks *i #define RR I 0 /* cost of reg-reg instruction */ #define MR I 0 /* cost of mem-reg instruction */ #define M2R 2 0 /* cost of mem-mem-reg instruction *I #define SYN 3 0 /" cost of interlocked instructions *I #define CBR I 5 /* compare&branch on register value "/ #define CBM I 5 /* compare&branch on memory value *I #define BRR I 5 /* branch on register #define BRM I 5 /" branch requiring #define C_PushG (N2R) #define C_PopG (6"RR #define C_GetNewTokPtr (3*MR +CBR) #define C_GetNewToken (3"MR +CBR) #define C_FreeTokPtr (3*MR) #define C_ReleaseNodeLock #define C_LockTokenHT (4"MR + 2*SYN +CBM + 3"BRR) #define C_ReleaseTokenHT (2"MR + 2*SYN +CBM + 2"BRR) #define C_MakeCRCommand (B'MR +CBR) #define C_InsertCRCommand (4"MR + 2*SYN /s + MR + M2R (2*RR +CBR value *I memory-ref *I +BRM) + 2"MR + BRR) +CBM + BRR) I >+ DERIVATION OF COST MODEI,S * The following variable names *nid: node-id. * toksz: number of wme * ntests: number of tests * nteq: number of equality * ntok: number of tokens * ntokOmem: * nsucc: number number " primTask: are pointers used tests if a given with made in a given of tokens in the functions a two-input at the memory in the opposite node activations task memory is scheduled of activation through direction * Ices: the number of condition elements to the * rces: the number of condition elements to the right true * NumActiveProc: number (left node. generated whether is being global activation. HTS. or deleted. contention of active by a given the or right). inserted if memory node. node. * flag: * MemConFlag: to mean: node. two-input * dir: a token below in token. associated of successor true 221 FOR 'NIE SIMULATOR left of the two-input of the is to be taken two-input into account processors. m */ double C_MakeToken(toksz) int toksz; { /* * if (tok_size * because */ the double <= I) then right side assume that of a not-node alpha-token may have is being rces listed made. "<=" as O. cost; y if (toksz <= I) cost = 3 * MR + C_GetNewToken; cost = MR + (2 * toksz) else return * MR + M2R + BRR + BRM (cost); } double C_InsertToken() { double cost cost; - RR + 3*MR + C_LockTokenHT + C_ReleaseTokenHT; return(cost); } double int C_DeleteToken(toksz, toksz, nteq, nteq, ntok) ntok; { double cost if cost; = C_LockTokenHT (toksz { <= I) /* + C_ReleaseTokenHT; delete alpha-token "/ + C_GetNewToken; node. node. 222 PARAI,LEI.ISM if (nteq > O) /* hashing works IN PROI)UCTION */ { cost += 3*RR + g'MR cost += (3*RR cost += (ntok/2) + 3"CBR + 3*BRR; } else { + 6*MR + 2*CBR * (RR + 4*MR + BRR) + (3*MR + CBR + 3*BRR); + CBR + 3"BRR); } } else { if (nteq > O) /" hashing works "/ cost +- 3*RR + It*MR + 3*CBR + 4*BRR; cost += (3*RR + 6*MR + 2*CBR + BRR) cost += (ntok/2) { } else { * (RR + 5*MR + (5*MR } } return(cost); } double C_HashToken(toksz,nteq) int toksz,nteq; { double cost; if (toksz cost else cost <= I) = 3*RR + MR + nteq*MR = 3*RR + MR + 2*nteq*MR + BRR + BRM; + BRR + BRM; return(cost); } double C_DoTests(ntests,pass) int ntests; bool pass; { double cost; if (ntests cost == O) = RR + BRR; else { if (pass) cost /* the = ntests tests * succeed (4*MR */ + BRR) + CBR + CBR + 3.5*BRR); + RR + BRR; + 4*BRR); SYSTEMS DERIVATION OF COST MODEI,SFOR TItE 223 S1MULATOR else cost * = (ntests/2) (4*MR + BRR) + RR + BRR; } return(cost); } double C_NextMatchTok(nsucc, int nsucc, nteq, ntests, nteq, ntests, ntokOmem) ntokOmem; { double cost cost, k; = 0.0; if (nteq == O) { if (nsucc == O) ( cost += ntokOmem * (RR + 2*MR cost += ntokOmem * C_DoTests(ntests,False) + 2*CBR + Z*BRR + BRM); + CBR; ) else { k = ntokOmem/nsucc; cost += (k - I) * (RR + 2*MR cost += (k - 1) * C_DoTests(ntests,False); cost += RR + 3*MR + 2*CBR + 2*CBR + 2*BRR + BRR + BRM + BRM); + C_DoTests(ntests,True); ) } else /* hashing is useful */ { if (nsucc cost == O) += CBR; else cost += RR + 3*MR + 2*CBR + BRR + BRM + C_DoTests(ntests,True); ) return(cost); ) double C_NextMatchTokNOT(nsucc, int nsucc, nteq, ntests, nteq, ntests, ntokOmem) ntokOmem; ( double cost cost, k; = 0,0; if (nteq == O) { if (nsucc =- O) { /* It is still • opposite possible node • transitions that happened. for successful It only an insert or I->0 matches implies that transitions */ k = ntokOmem/2; cost += k * (RR + Z'MR + 2*CBR + 2'BRR + BRM); with the tokens there for were of the no 0->I a delete. 224 PARAL.LEHSM cost += k * C_Oolests(ntests,False) cost += k * (RR + 2*MR cost += k * C DoTests(ntests,True); + SYN IN PRODUCTION SYS'H_MS + CBR; + 3*CBR + 2*BRR + BRM); + 2*BRR + BRM); ) else { k = ntokOmem/nsucc; cost += (k - I) * (RR + 2*MR cost += (k - I) * C_DoTests(ntests.False); cost += RR + 2*MR + SYN + 2*CBR + 3*CBR + BRR + BRM + C_DoTests(ntests,True); ) ) else /* hashing is useful */ ( if (nsucc cost == O) += CBR; /* the opposite bucket is hopefully empty */ else cost += RR + 2"MR + SYN + 3"CBR + BRR + BRM + C_DoTests(ntests. True); ) return(cost); } double C_DetermineRefCount(nsucc, int nsucc, nteq, ntests, nteq, ntests, ntokOmem) ntokOmem; { double cost ' cost; = 3*RR if (nteq + MR; > D) { if (nsucc == O) { /* * This case " Assuming */ implies that that it maps cost +- RR + 2*MR cost +- C_DoTests(ntests. the # of matching into + 2"CBR a hash + 2"BRR tokens bucket + BRM with was greater exactly one than token. + CBR; True); } else /* nsucc==1 cost and the opposite hash bucket * (RR + 2*MR + 2"CBR would be empty */ +- CBR; } else { if (nsucc == O) { cost +- ntokOmem cost += ntokOmem*((C_DoTests(ntests,True)+C_DoTests(ntests,False))/2.O); + 2"BRR + BRM) cost += ntokOmem cost +- C_DoTests(ntests,False) } else { * (RR + 2*MR + 2*CBR + CBR; + 2"BRR + BRM); + CBR; O. DERIVATION OF COST MOD[-I,S 225 [:OR THE SIMUI ATOR } } return(cost): } double Task TaskProcCost(tptr) *tptr; { bool primTask; TaskPtr *tp; double cost, int tempc; hid, Ices, Direction dir; cost rces, nsucc, nteq, ntests, ntok, ntokOmem, toksz; - 0.0; primTask nid= = tptr->tPrimaryTask; tptr->tNodeID; tp = tptr->tDepList; while (tp) nsucc { nsucc++; = O; tp = tp->tNext; } switch(tptr->tType) { case rootNode: if (nsucc) cost else cost = (2*RR + 3*MR + CBR) = (2*RR + MR + CBR); + C_PushG; break; case txxxNode: /* the cost for the &bus and txxx nodes is made equal language, and if (primTask) cost else - (2"RR + 2"MR + CBR) cost = (2*RR + MR + C OopG: + CBR); /* * The following if-stmt is not * has been added as a simple " in the simulator. a legal stmt approximation in the to the actual ,/ if (nsucc > O) { if successor_sched_thru_HTS else cost +- RR + MR; } break; case andNode: Ices - NodeTable[nid].nLces; rces = NodeTable[nid].nRces; dir = tptr->tSide; nteq - NodeTable[nid].numTeqbTests; ntests = NodeTable[nid].numTests; cost +- 3*RR + C_PushG; form used */ 226 PARALLELISM if (die IN PRODUCI'ION SYSTF_,iS == Left) { toksz else = Ices; ntok = tptr->tNumLeft; { toksz if (toksz = rces; <= I) ntokOmem { /" alpha-activation cost = C_PopG ntokOmem = tptr->tNumLeft; } = tptr->tNumRight; } */ + C_HashToken(toksz,nteq) if (tptr->tFlag ntok : tptr->tNumRight; + CBR + (RR + MR + BRR); == Insert) cost += C MakeToken(toksz) cost += C_DeleteToken(toksz, + C_InsertToken() + BRR; else if ((ntokOmem nteq, I= O) && ((nteq == O) ntok); [I (nsucc _- 0))) { cost += C_LockTokenHT cost += C_NextMatchTok(nsucc,nteq,ntests,ntokOmem); + MR; } if (nsucc !- O) cost +- RR + 2*MR + C_GetNewTokPtr + 2*MR + C PushG; } else /* beta-activation */ { cost = C_PopG + 2"MR + C HashToken(toksz,nteq) if (tptr->tFlag == Insert) cost += RR + MR + C MakeToken(toksz) cost += C_DeleteToken(toksz, + CBR + (RR + MR + BRR); + C InsertToken() + BRR; else if ((ntokOmem I= O) && ((nteq nteq, == O) ntok) + CBR; I[ (nsucc I= 0))) { cost += C_LockTokenHT cost += C_NextMatchTok(nsucc,nteq,ntests,ntokOmem); + MR; } if (nsucc ]= O) cost += RR . 2*MR + C_GetNewTokPtr + 2*MR + C_PushG; } break; case notNode: Ices = NodeTable[nid].nLces; rces _ NodeTable[nid].nRces; dir = tptr->tSide; if (dir nteq == Left) toksz =lces; else toksz = rces; = NodeTable[nid].numTeqbTests; ntests = NodeTable[nid].numTests; if (dir == Left) { ntok = tptr->tNumLeft; cost - C_PopG if + 2"MR (tptr->tFlag ntokOmem = tptr->tNumRight; . C_HashToken(toksz,nteq) + CBR; =- Insert) { COSt += 2"RR if ((ntokOmem + 3*MR + C_MakeToken(toksz) I- O) && ((nteq == O) + C_InsertToken() II (nsucc == 0))) { cost += C_LockTokenHT + C_ReleaseTokenHT + BRR; + BRR; DERIVATION 227 OF COST MODI_LSFORTItESIMUIATOR cost += C_DetermineaefCount(nsucc, nteq, ntests, ntokOmem); } ) else cost if if += C_OeleteToken(toksz, (nsucc t = 0) cost (toksz <= 1) /* nteq, += RR + 5*MR i.e., an ntok) + CBR; + BRR + C GetNewTokPtr alpha not-node activation + C_PushG; */ { == if (tptr->tFlag Insert) cost else = cost - (RR + 3*MR); cost = cost - (2*MR + BRR); } } else /* dir == right */ { ntokOmem cost = tptr->tNumLeft; = C PopG + 2*MR if (tptr->tFlag cost ntok = tptr->tNumRight; + C HashToken(toksz,nteq) + CBR + RR + MR + BRR; == Insert) += RR + MR + C_MakeTokea(toksz)+C InsertToken() + BRR; else cost += C_DeleteToken(toksz, if ((ntokOmem t= O) && nteq, ((nteq ntok) II == O) + CBR; (nsucc I= 0))) ( cost += RR + C_LockTokenHT + MR; cost += C_NextMatchTokNOT(nsucc,nteq,ntests,ntokOmem); ) if (nsucc t= O) cost if (toksz <= 2) += RR + 2*MR /* i.e., an alpha + C_GetNewTokPtr not-node + 2*MR activation ( if (tptr->tFlag cost else = cost cost == Insert) - (RR + 3*MR); = cost - (2*MR + CBR); ) } break; case pNode: lces= NodeTable[nid].nLces; rces = NodeTable[nid].nRces; cost = C_PopG cost += C_MakeCRCommand cost += C_ReleaseNodeLock; break; default: cost " break; ) O.O; + 2*MR; + C_InsertCRCommand + C_FreeTokPtr; "/ + C_PushG; 228 PARALLELISM if (MemConFlag) cost IN PRODUCTION = cost/MemCon[NumActiveProc]; return(cost/1OOO.O); } double Task TaskLoopCost(tptr) *tptr; { TaskPtr *tp; double cost; int hid, nsucc, Direction dir; cost nteq, ntests, ntok, ntokOmem, toksz; = O.O; nid= tptr->tNodeID; tp = tptr->tDepList; while (tp) nsucc { nsucc++; = O; tp = tp->tNext; } switch(tptr->tType) ( case rootNode: cost = RR + 3*MR + CBR + BRR + C_PushG; break; case txxxNode: /* * The following * has been if-stmt added is not as a simple a legal stmt approximation in to the the language, actual form and used * in the simulator. */ if successor_sched_thru_HTS else cost cost = 3*RR + C_PushG; = RR + MR; break; case andNode: nteq = NodeTable[nid].numTeqbTests; ntests dir = NodeTable[nid].numTests; = tptr->tSide; if (dir { ntok == Left) = tptr->tNumLeft; ntokOmem ntokOmem = tptr->tNumLeft; = tptr->tNumRight; else { cost = cost += MR + CBR + BRR + C_NextMatchTok(nsucc, RR + 2"MR + C_GetNewTokPtr break; case notNode: nteq = NodeTabte[nid].numTeqbTests; ntests dir = NodeTable[nid].numTests; = tptr->tSide; if (dir -= Left) ntok } = tptr->tNumRight; + C_PushG } + 2*MR; nteq, ntests, ntokOmem); SYSTEMS DERIVATION OF COST MODEI_ FOR TIlE 229 SIMULATOR { ntok : tptr->tNumLeft; cost += 5*MR + RR ntokOmem + CBR + BRR : tptr->tNumRight; + C_GetNewTokPtr + C_PushG; } else { ntokOmem = tptr->tNumLeft; + CBR ntok cost += RR + 5*MR + BRR cost += C_NextMatchTokNOT(nsucc, = tptr->tNumRight; + C_GetNewTokPtr nteq, ntests, + C_PushG; ntokOmem); } break; case pNode: cost = 0.0; break; default: break; ) return(cost/lO00.O); } double Task TaskFinishCost(tptr) *tptr; { TaskPtr double *tp; cost; int hid, Ices, Direction dir; cost nsucc, nteq, ntests, ntok, ntokOmem, toksz; = O.O; nid= tp rces, tptr->tNodeID; = tptr->tDepList; while (tp) nsucc { nsucc++; - 0; tp = tp->tNext; switch(tptr->tType) { case rootNode: if (nsucc cost += i= O) cost += RR + MR + CBR + BRR; C_ReleaseNodeLock; break; case txxxNode: if (tptr->tPrimaryTask) else /, • • ,/ cost - cost = C_ReleaseNodeLock; BRM; The following to the actual if-stmt has been form used in the added as simulator. a simple approximation 230 PARALLELISM if (nsucc > O) cost IN PRODUCTION SYSTEMS += MR + RR + BRM; break; case andNode: Ices = NodeTable[nid].nLces; rces = Nodelable[nid].nRces; dir = tptr->tSide; nteq = NodeTable[nid].numTeqbTests; ntests = NodeTable[nid].numTests; if (dir == Left) { toksz else = Ices; { toksz if (nsucc cost cost ntok = rces; = tptr->tNumLeft; ntokOmem ntokOmem = tptr->tNumLeft; = tptr->tNumRight; ntok I= O) = MR + CBR + BRR + C_NextMatchTok(nsucc, += C_ReleaseNodeLock nteq, ntests, ntokOmem); + C_FreeTokPtr; if ((ntokOmem != O) && ((nteq cost += C_ReleaseTokenHT; == O) if (toksz <= I) /* alpha activation cost = cost - C FreeTokPtr; II (nsucc l= 0))) */ break; case notNode: Ices = NodeTab,le[nid].nLces; rces = NodeTable[nid].nRces; flit = tptr->tSide; nteq = NodeTable[nid].numTeqbTests; ntests ' = NodeTable[nid].numTests; if (dir == Left) { toksz =lces; if (nsucc cost ntok = tptr->tNumLeft; I= O) cost += C_FreeTokPtr ntokOmem = tptr->tNumRight; += MR + CBR; + C_ReleaseNodeLock; } else /* dir == Right "/ { toksz = rces; if (nsucc ntokOmem = tptr->tNumLeft; ntok - tptr->tNumRight; !- O) { cost += MR + CBR + BRR; cost += C NextMatchTokNOT(nsucc, nteq, ntests, ntokOmem); } cost if +- C_FreeTokPtr + C_ReleaseNodeLock; ((ntokOmem != O) && ((nteq cost += C_ReleaseTokenHT; == O) ) if (toksz break; case pNode: cost - 0.0; break; } = tptr->tNumRight; <= 1) cost = cost - C_FreeTokPtr; II (nsucc 1= 0))) } DERIVATION OF COST MOI)ELS 231 FOR "II-tE SIMULATOR default: cost = 0.0; break; ) if (MemConFlag) cost = cost/MemCon[NumActiveProc]; return(cost/lO00.O); ) double Task TaskSchedCost(tptr) *tptr; { /* * This is the time spent by the * to the number of bus cycles " wide would bus * be 2 bus * cycles this cycles. Assuming are 2-3. * I assume be one Assuming a 64 bit wide hardware required bus cycle, one bus 100ns bus scheduler, while cycle per and for servicing on a 32 bit wide to xmit cycle, (80 MBytes/s) corresponds a request. this and address, for a 64 bit it would the total corresponds 200ns On bus busy to 200-30Dns. the xfer time. ./ double cost if cost; = 0.2; (MemConFlag) cost = cost/MemCon[NumActiveProc]; return(cost/lO00.O); } C.2. Cost Model for the Paragel Implementation Using STQs /* * When * the software same * small task as when number * and dequeued */ queues are used, a hardware task of differences from the task for large scheduler corresponding queues. C_ReleaseNodaLock /" cost for #define C_PushG (6*RR + 2"MR #define C_SchedTask (3*MR + M2R #define C_SchedEnd (RR + SYN) /* costs for scheduling popping (RR + 2*SYN a task a task through from the to when These #define parts the cost is used. There the tasks differences are model remains are, however, are enqueued listed a below. + CBM) the global + 2"SYN scheduler #define C_SchedLoopStart (MR) #define C_SchedLoop (4"RR #define C_PopG (MR + BRM) schedulers + CBM) + MR + CBR + BRR) */ /* outside lock "/ /* inside lock */ /* lock */ /* outside lock *I /* outside lock */ /* lock */ outside */ + CBM + BRR) outside 232 PARAI,LELISM #define /* C_DeSchedTask-rxxx Costs when (9*RR intra-node-level + 5*MR parallelism #define C_i_DeSchedTask (g*RR #define C_i_DeSchedFaitLoop (IO*RR /" Costs when node-level or #define C_np #define C_np_DeSchedFailLoop double Task is + g'MR + 5*MR (IO*RR . + CBR + 3*BRR) used. */ + 3*CBR + M2R + 2*SYN parallelism (9,RR 2*SYN + M2R + 4*SYN + 2*MR prod-level DeSchedTask + M2R + IN PROI)UCFION is + 3*CBR used. + M2R + 2*SYN + 5*BRR) + 5*BRR) */ + CBR + 4*BRR) M2R + SYN + CBR + 3*BRR) TaskSchedCost(tptr) *tptr; { /* " This is the cost of enqueueing a node activation in a task queue. */ double cost; cost = C_SchedTask/lO00.O; if (MemConFlag) cost = cost / MemCon[NumActiveProc]; return(cost); } double Task TaskDeSchedCost(tptr, numFail) *tptr; int numFail; /* hum of tasks looked before the right task was found { double cost; cost if = 0.0; ((Grain cost else if cost else else == prodLev)) * (C_i_DeSchedFailLoop/lO00.O) NULL) && (tptr->tType ((double) " ((double) == txxxNode)) ((Grain == nodeLev) II (Grain == prodLev)) += C_np_DeSchedTask/lO00.O; cost += C_i_DeSchedTask/lO00.O; (MemConFlag) return(cost); ) (Grain += C_DeSchedTaskTxxx/lO00.O; if cost II (C_np_DeSchedFailLoop/lO00.O) += ((tptrl= cost if == nodeLev) += cost = cost / MemCon[NumActiveProc]; numFail); numFail); */ SYS'I_IMS