Level Generation System
Transcription
Level Generation System
Level Generation System for Platform Games Based on a Reinforcement Learning Approach Atanas Laskov MSc in Artificial Intelligence School of Informatics The University of Edinburgh 2009 Abstract Automated level generation is a topic of much controversy in the video games industry that divides the opinions strongly for and against it. The current situation is that generation techniques are in common use only in a few specific genres of video games for reasons that can be principal, practical and in some cases entirely subjective. At the same time there is a widespread tendency for game worlds to become larger and more repayable. Manual level design becomes an expensive task and the necessity for automated productivity tools is growing. In this project I focus my efforts on the creation of a level generation system for a genre of video games that is very conservative in its use of these techniques. Automated generation for platform video games also presents a technological challenge because there are hard restrictions on what constitutes a valid level. The intuition for choosing reinforcement learning as a generative approach is based on the structure of platform game levels, which lends itself to a natural representation as a sequential decision making process. 2 Acknowledgements I would like to thank my parents for their care, as well as my supervisor Taku Komura for being so friendly and supportive. I also owe many thanks to the creators of my favourite game - Pandemonium, for their great work and the inspiration it gives me. 3 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree of professional qualification except as specified. (Atanas Laskov) 4 CONTENTS ...................................... 10 I.1. Introduction to Automated Level Generation . . . . . . . . . . . . . . . 10 I.2. Introduction to Platform Games . . . . . . . . . . . . . . . . . . . . . . . . 12 I.3. Goals of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 I. INTRODUCTION I.3.1. Target Level Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 13 I.3.2. Target Mode of Distribution . . . . . . . . . . . . . . . . . . . . . . . 14 I.3.3. Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 II. EXISITING RESEARCH, LIBRARIES AND TOOLS . . . . . . . . . . . . . 16 II.1. Research in Level Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 II. 1.1. Using Context Free Grammars and Graphs . . . . . . . . . . . . . . 16 II. 1.2. Landscape Generation, Fractal and Related Methods . . . . . . . 17 II. 1.3. Level Generation for Platform Games . . . . . . . . . . . . . . . . . . 18 II.2. Reinforcement Learning Overview . . . . . . . . . . . . . . . . . . . . . . . . 19 II. 2.1. The Reinforcement Learning Paradigm . . . . . . . . . . . . . . . . . 20 II. 2.2. Methods in Reinforcement Learning . . . . . . . . . . . . . . . . . . . 22 II. 2.2.1. Bellman Equations and Dynamic Programming . . . . . 22 II. 2.2.2. Temporal-difference Learning . . . . . . . . . . . . . . . . . 23 II. 2.2.3. Actor-critic Methods . . . . . . . . . . . . . . . . . . . . . . . 24 II. 2.2.4. Model-building Methods . . . . . . . . . . . . . . . . . . . . . 24 II. 2.2.5. Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . 25 II. 2.3. Reinforcement Learning Libraries . . . . . . . . . . . . . . . . . . . . . 25 II. 2.3.1. RL Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 II. 2.3.2. LibPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 II. 2.3.3. Other Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 II. 2.3.4. Choice of a Reinforcement Learning Library . . . . . . . 27 II.3. Library for the Visualization and Testing of Levels . . . . . . . . . . . . 28 III. ARCHITECTURE OF THE SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . 30 III.1. Conventions and Class Diagrams . . . . . . . . . . . . . . . . . . . . . . . . 30 III.1.1. Class Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 III.1.2. Member Variables and Methods . . . . . . . . . . . . . . . . . . . . . . 30 III.2. Architecture of the Level Generation System . . . . . . . . . . . . . . . 30 III.3. Stages of Level Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 IV. LEVEL GENERATION AS A REINFORCEMENT LEARNING TASK 34 IV.1. Task Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 IV. 1.1. State Space Representation . . . . . . . . . . . . . . . . . . . . . . . . 34 IV. 1.3. Designing the Reward Function . . . . . . . . . . . . . . . . . . . . . . 37 IV. 1.4. Actions of the Building Agent . . . . . . . . . . . . . . . . . . . . . . . 39 IV. 1.5. Choice of a Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . 40 IV.2. Task Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 IV. 2.1. The “Generation Algorithm” Sub-system . . . . . . . . . . . . . . . 41 IV. 2.2. Class Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 IV. 2.1. Abstract learning algorithm ( genAlgorithm ) . . . . . . . 42 IV. 2.2. Reinforcement Learning algorithm ( genAlgorithm_RL ) 42 IV. 2.3. State of the building agent ( genState_RL ) . . . . . . . . 44 IV. 2.4. Actions of the building agent ( genAction_RL ) . . . . . . 44 IV. 2.5. Transition and reward functions ( genWorldModel ) . . . 44 IV. 2.6. Traceability Markers . . . . . . . . . . . . . . . . . . . . . . . . 45 IV. 2.6.1. Updating the Traceability Markers . . . . . . . . 46 IV. 2.6.2. Jump Trajectory . . . . . . . . . . . . . . . . . . . . . 48 IV. 2.6.3. Representation in Memory . . . . . . . . . . . . . . 48 V. POST-PROCESSING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 V.1. The “Blueprint” and “Post-processing” Sub-systems . . . . . . . . . . . 51 V.2. Implementation of Post-processing . . . . . . . . . . . . . . . . . . . . . . . . 52 V. 2.1. Wildcards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 V. 2.2. Context Matchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 V. 2.2.1. Removing Redundancies. Smoothing and Bordering . . 54 V.3. Prepared Graphical Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 V.4. Output File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 V.4.1. Extending the system to other file formats . . . . . . . . . . . . . . . 57 VI. GRAPHICAL USER INTERFACE . . . . . . . . . . . . . . . . . . . . . . . . . . 58 VI.1. The Level Parameters Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 VI.2. Implemented User Interface Controls . . . . . . . . . . . . . . . . . . . . . 59 VI.3. Level Generator Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 VI. 3.1. Parameters Dialog ( uiParametersDialog ) . . . . . . . . . . . . . . 62 VI. 3.1.1. Generation thread . . . . . . . . . . . . . . . . . . . . . . . . 63 VI. 3.2. Progress Dialog ( uiProgressDialog ) . . . . . . . . . . . . . . . . . . 63 6 VI. 3.3. Statistics Dialog ( uiStatisitcsDialog ) . . . . . . . . . . . . . . . . . 63 VI. 3.4. Completion Dialog ( uiCompletedDialog ) . . . . . . . . . . . . . . . 63 VII. OUTPUT AND EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 VII.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 VII.1.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 VII.1.2. Performance Measurement . . . . . . . . . . . . . . . . . . . . . . 68 VII.1.3. Error Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 VII.2. Optimisation of Parameter Settings . . . . . . . . . . . . . . . . . . . . . . 70 VII. 2.1. Global Search in the Parameter Space . . . . . . . . . . . . . 70 VII. 2.2. Local Search in the Parameter Space . . . . . . . . . . . . . . 73 VII. 2.3. Results of Parameter Optimisation . . . . . . . . . . . . . . . . 74 VII.3. Generator Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 VII. 3.1. Intrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 VII. 3.1.1. Benchmark: training time . . . . . . . . . . . . . . . 75 VII. 3.1.2. Benchmark: variant generation . . . . . . . . . . . 76 VII. 3.1.3. Benchmark: post-processing time . . . . . . . . . 78 VII. 3.1.4. Total level generation time . . . . . . . . . . . . . . . 79 VII. 3.2. User Suggestions for Future Development . . . . . . . . . . 80 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7 LIST OF FIGURES Fig. I-1 Classification of platform games according to level structure . . . . 13 Fig. II-1 Evaluative Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Fig. III-1 Notation used in class diagrams . . . . . . . . . . . . . . . . . . . . . . . . 30 Fig. III-2 Architecture of the level generation system . . . . . . . . . . . . . . . . 31 Fig. III-3 Five stages of level generation 32 Fig. IV-1 Level generation as a sequential process . . . . . . . . . . . . . . . . . 35 Fig. IV-2 Global and local state of the building agent . . . . . . . . . . . . . . . . 36 Fig. IV-3 Is the state a fail state? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Fig. IV-4 Rewards and penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Fig. IV-5 Actions of the building agent . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Fig. IV-6 Class diagram of the “Generation Algorithm” subsystem . . . . . . . 41 Fig. IV-7 Pseudo-code of the transition function . . . . . . . . . . . . . . . . . . . 45 Fig. IV-8 Sample level blueprint showing the traceability markers . . . . . . . 46 Fig. IV-9 Updating the traceability markers . . . . . . . . . . . . . . . . . . . . . . 48 Fig. IV-10 Simplification of jump trajectory . . . . . . . . . . . . . . . . . . . . . . . 49 Fig. V-1 The “Blueprint” and “Post-processing” subsystems . . . . . . . . . . . 51 Fig. V-2 Table of common wildcards used during post-processing . . . . . . 52 Fig. V-3 Context matchers for removing redundancies . . . . . . . . . . . . . . 54 Fig. V-4 Terrain smoothing context matchers . . . . . . . . . . . . . . . . . . . . 55 Fig. V-5 Bordering context matchers . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Fig. V-6 Contour tiles joined together . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Fig. V-7 Lava trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Fig. V-8 Shooting adversary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Fig. V-9 Sample output to a .lvl file . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Fig. VI-1 Layout of the user interface . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Fig. VI-2 Registering a new parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Fig. VI-3 Base classes of the user interface . . . . . . . . . . . . . . . . . . . . . . 60 Fig. VI-4 Dialogs implemented for the level generation system . . . . . . . . . 61 Fig. VI-5 Parameters Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Fig. VI-6 Completion Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8 ........................ Fig. VII-1 Examples for generated levels with different branching factors . . 66 Fig. VII-2 Global search in the parameter space . . . . . . . . . . . . . . . . . . . . 71 Fig. VII-3 Best parameter configurations after global search . . . . . . . . . . . 72 Fig. VII-4 Influence of the gamma parameter on the generation of variants 72 Fig. VII-5 Correlation between PC, epsilon and its attenuation . . . . . . . . . . 73 Fig. VII-6 Correlation between PC, alpha and its attenuation . . . . . . . . . . . 73 Fig. VII-7 Results of parameter optimisation . . . . . . . . . . . . . . . . . . . . . . 74 Fig. VII-8 Test configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Fig. VII-9 Training time benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Fig. VII-10 Variant generation benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 76 Fig. VII-11 Post-processing benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Fig. VII-12 Scalability with regard to level length and branching . . . . . . . . . 78 Fig. VII-13 Scalability with regard to the “chaos” parameter . . . . . . . . . . . . 79 Fig. VII-14 User suggestions for further improvement . . . . . . . . . . . . . . . . 80 EQUATIONS Eq. II-1 Discounted reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Eq. II-2 Bellman equation for an optimal value function V*(S) . . . . . . . . . 22 p Eq. II-3 The value function V (S) of a policy p . . . . . . . . . . . . . . . . . . . . . 22 Eq. II-4 The update rule of temporal difference learning . . . . . . . . . . . . . 23 Eq. IV-1 Penalty for reaching the fail state . . . . . . . . . . . . . . . . . . . . . . . 38 Eq. IV-2 Reward for creating a branch in the level . . . . . . . . . . . . . . . . . . 38 Eq. IV-3 Reward for placing dangers and treasure . . . . . . . . . . . . . . . . . . 38 Eq. VII-1 Convergence metric PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Eq. VII-2 Variant generation metric PG . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Eq. VII-3 Standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 9 CHAPTER I INTRODUCTION Throughout the history of the video games industry there are many instances of level generation tools of varied sophistication but very little work, and even less practical success, when it comes to level generation for the genre of platform games. This deficiency can be explained with the technical obstacles that arise as a result of the structure of platform games and also with the tradition of using handcrafted levels. In spite of this challenge, it is my belief that the development of an automated tool for platform games is a worthwhile task. With the constant increase of graphical complexity and player expectations level design becomes a costly effort. Large game studios invest millions in level design alone, whereas small and independent developers struggle to deliver a game that is up to standard and still within a reasonable budget. This makes an automated tool to aid the design process for platform games a valuable and timely contribution. I.1. Introduction to Automated Level Generation Automated level generation has been successfully applied to some genres of video games but to this date there are no commercial platform games using generation techniques. The indie game “Spelunky” [Yu08] is a significant success in this respect and to the best of my knowledge the only game to combine random level generation and elements of platform gameplay. However, it can be classified as a platform game only in the broadest sense because the used level structure resembles more closely that of a Rogue-like game. “Rogue” [TWA80] is the first game in a long succession of dungeon exploration games, a genre that is most often associated with automated level generation. The widespread use of generated levels in Rouge derivatives is based on the practical reason that there are very few hard restrictions on the shape that a dungeon level can take. By contrast, platform games are a much more demanding genre, as even small changes in a level can easily invalidate it. 10 Although the technological challenges can differ for each genre of game the benefits of using a level generation system apply to universally. There are two important advantages of automated level generation over manually created levels: · Increased replayability of games. In recent years replayability is becoming a very important design objective for many game development studios [Bou06]. Designers use different approaches to encourage reliability, such as making the game non-linear, introducing hidden and hard to reach (HTR) areas, unlocking additional playable characters. In spite of these efforts, the level is very much the same the second time it is played. Integrating an automated generation system in the game can improve replayability because the player will be provided with a completely different level for every replay. · Reduced production cost. With the increase of graphical complexity and size of the game worlds, the amount of effort that level designers must put in a single level is also increasing. It is often the case that a whole team of designers must work on a single level, making handcrafted levels very costly [Ada01]. Automated level generation can reduce the amount of manual work by providing an initial approximation to the level structure that is further refined by human designers. Critical parts of the level design, such a boss encounters or any scripted events, would receive more attention than less important parts of a level. Common criticism of automated level generation is that it produces levels of inferior quality to that of hand-crafted levels. For example, generating a level containing logical puzzles or one that contributes to the storyline of the game is very difficult to achieve. It is also not possible to give an automated system the sense of aesthetics that a human designer can apply. These arguments are valid but they are not a reason for abandoning research in this field. Even if the technology is not fully mature it is still possible to integrate automated and human design with good results. Many dungeon exploration games, such as “Diablo” I and II [Bli00], demonstrate the validity of this statement. With the advancement of technology it would become a practical solution to entrust the generator with more sophisticated design tasks. 11 I.2. Introduction to Platform Games This section makes a brief introduction to platform games and a description of the level structure typical for this genre. In doing so I will try to highlight the specific requirements that apply to level generation for platform games and also to familiarise the reader with the area of application of this project. Platform games have evolved significantly since the early years of the game industry with considerable sophistication of game mechanics and a transition form 2D to 3D graphics. Nonetheless, the core gameplay is based on the same principles that apply to early titles such as “Super Mario Bros” [Nin87] and “Donkey Kong” [Nin81]. Initially, the main character in the Mario games was called “Jumpman”, a name that reveals one of the distinguishing features of this genre – jumping in order to avoid obstacles. Platform games encourage the player to perform jumps and other acrobatic activities within a level that contains a sequence of platforms, pits, monsters and traps [Bou06, SCW08]. It is the goal of the player to overcome all obstacles on his way to the finishing point. Because of limitations in development time and resources, it is common to have a small closed set of obstacle elements that are repeatedly used throughout the levels. For each level the human designer chooses an appropriate subset of level elements, commonly referred to as the “tileset”, and decides how to arrange them in order to produce the illusion of variety. The use of a tileset greatly facilitates the functioning of the automated generation system. It allows to specify the actions of the building agent as indexes in this relatively small set of elements. Traditionally platform games have mostly linear level structure, allowing for only a few alternative paths for the player [Bou06]. It is typical for a level to start with one pathway then branch into several alternative pathways and by the end of the level for all of them to merge again into a single pathway. There is only one start point and only one finishing point but still some freedom for the player to choose how to approach a level. This is what will be referred to as the classical “branching structure” of levels. The transition to 3D graphics brought not only a cosmetic change but also the opportunity for level designers to experiment with an alternative level structure. Adding a third dimension makes it possible to engage in more exploration if this is supported by a non-linear structure of levels. Some games, as for example “Super 12 Mario 64” [Nin96], have adopted this very different level design approach but there are also many 3D games that continue to be faithful to the traditional design paradigm. It may seem that a 3D graphics engine lends itself naturally to a nonlinear level concept but this is not necessarily the best choice. Non-linear levels can be problematic if not implemented with care, creating burdens for the player such as unintuitive camera control and increased difficulty in judging distances. It should also be noted that even the most “non-linear” world concept is based on implicit linear episodes. It is by definition true that in platform games the player performs a sequences of jumps in order to avoid a sequence of obstacles. In the classical branching structure of levels this order is strictly compulsory, whereas in non-linear levels it can only be suggested by the level designer. I.3. Goals of the project It is the goal of this project to implement a level generation system capable of adequately modelling the level structure of platform video games. There are many automated generation tools in existence but none of them is able of adequately performing this task. In the following sections I present this goal more accurately by describing the target level type, mode of distribution and learning approach used by the level generation system. I.3.1 Target Level Structure There is a significant variety of platform games so in designing a level generation system it is necessary first step to specify type of levels that it will be able to produce. Figure I-1 on the next page presents one possible classification of platform games by level structure. This project has the ambition of implementing a level generation system for platform games of Class (1). This class includes most of the 2D games and also some games with 3D graphics but a traditional 2D level structure, such as “Pandemonium” [Cry97] and “Duke Nukem: Manhattan Project” [3DR02]. The system is designed with the idea of extensibility to Class (2) but implementing level generation for this class is beyond the scope of this project. Games with the level structure described by Class (3) differ from the other two classes to an extend that makes their classification as “platform games” open to a debate. Generating levels of this class would require a significantly different approach challenges so it is excluded from the goals of this project. 13 Figure I-1. Classification of Platform Games by Level Structure Class (1) Level Branching Levels in 2D and 2.5D The player walks along the X axis and jumps along the Y axis. Typically, the player starts at the leftmost end of the level and gradually progresses to the right. The view of the level is scrolled as the player moves. There can be some amount of non-linearity, implemented as several “branches” running in parallel to one another. Although the level structure is two-dimensional, modern games of this type are usually implemented with a 3D graphical engine. The resulting genre is sometimes referred to as 2.5D. Example: Pandemonium, DN: Manhattan Project [Cry97, 3DR02] Class (2) Level Branching Levels in 3D The player walks on the (XY) plane and jumps along the Z axis. The freedom of movement on the (XY) plane is restricted to a narrow pathway and a small area around it. There can be both horizontal and vertical “branches” of the level, providing alternative choices for the player. Example: Crash Bandicoot [Son96] Class (3) Level Non-linear levels The player walks on the (XY) plane and jumps along the Z axis. Unlike levels of Class (2), the movement on the (XY) plane is not restricted and the player can choose in what order to interact with terrain elements. Example: Super Mario 64 [Nin96] I.3.2 Target Mode of Distribution Another important consideration for the goals of this project is related to the way in which the level generation system will deliver its output to the user. It is possible to use the generation system either as an internal development tool or to distribute it to the player, together with a video game. In the first case the user of 14 the system is not the player but the game developer. Under this mode of distribution it is acceptable to for the level output to contain occasional errors, for as long as the correction of these errors by a human designer takes significantly less time than building a level from scratch. In this case the replayability of levels is not increased but the system can be a useful productivity tool. The second alternative is to distribute the level generation system in integration with a game project. Under this mode of distribution the advantages of having a level generation system can be exploited more fully but it is also more demanding with regard to the validity the output. It is necessary to guarantee that the system generates an error-free level within a reasonable time limit. The quality of the output must also meet higher standards in terms of visual appeal because in this case there is no human intervention to improve it. This project was originally intended as an internal development tool but the evaluation of the system showed that, with some additional work, it can be extended to the more demanding mode of distribution. The implementation outlined in this document makes it is possible to decouple the learning and generation part of the system, resulting in more predictable and error-free output. The game could integrate only a library of learned polices a lightweight variant generation algorithm. I.3.3 Learning Approach In this project I choose to make use of an unsupervised learning approach for solving the level generation task. Unsupervised learning operates without the need for training data which is a significant advantage in light of the fact that most of the existing level data is proprietary. It was also my reasoning the task lends itself naturally to a reinforcement learning interpretation because platform game levels are discrete, sequential, and not unlike many of the Gridworld examples in the literature [SB98]. The difference is that a level generation system would have a “building agent” that changes the environment, rather than a “walking agent” responding to existing obstacles. 15 CHAPTER II EXISITNG RESEARCH, LIBRARIES AND TOOLS In this chapter I present existing research in level generation in light of its relevance to the task of level generation for platform games. Methods such as context free grammars and graph grammars are examined, as well as research focusing on level generation for platform games. Research in this specific area is sparse but it gives a valuable insight, especially with regard to the method used for evaluating the validity of levels. The chapter continues with an overview of the of reinforcement learning paradigm, some principal methods and learning algorithms. I also examine several libraries implementing these algorithms and discuss how the use of an existing library relates to the goals of this project. II.1. Research in Level Generation II.1.1 Using Context-free Grammars and Graphs In the context-free grammar (CFG) approach to level generation [Inc99], which is typically used for generating dungeon levels, the permissible level structures are encoded in the productions of a grammar. Each level is represented as a sequence of grammar productions starting with the global non-terminal LEVEL. Given this level representation, automatic generation amounts to building random valid sentences. This is a simple procedure that can be implemented as follows: Starting with the LEVEL non-terminal, the generation algorithm recursively applies a random production out of the set of productions that have the required nonterminal in their antecedents (i.e. on the left side of the production). This is repeated until all the leafs of the expansion tree become terminal symbols. The author of [Inc99] mentions several drawbacks of this approach, including irregular distribution of the enemies, existence of “pointless sections” and deadends in the level output. There is also the necessity to convert the tree representation into a level layout, which is not necessarily possible for every 16 generated tree. Although these problems could be solved by modifications in the tree-generation algorithm, in my view the CFG approach has a more significant disadvantage. The necessity to specify an explicit grammatical representation of all possible level configurations could require very time consuming and complicated work. Even if a corpora of LEVEL strings existed, which to the best of my knowledge is not the case, it would be necessary to implement some sort of authomated inference in order to learn the grammar [JM08]. In light of this, I see no viable alternative to creating the grammar by hand. It is clear that any method based on CFG or tree expansion would have to involve considerable amount of manual work in order to adapt it for a different game or in the case the level specification changes. In [Ada02] the another presents a variation of the grammar approach that uses Graph Grammars (GG) for the purposes of generating dungeon levels. Under this approach the grammar encodes a series of transformations applied to a graph representation of the level. This appears to be a reasonable choice for modelling the web-like structure of dungeon levels but it is not clear what the advantage would be in the case of platform games. The current project is concerned with levels that have no enclosed “rooms” to be represented as nodes of a graph, and no linking corridors corresponding to the arcs. The connectivity of different locations is a platform game level is the result of the simulated laws of physics and not an explicit design choice. II.1.2 Landscape Generation, Fractal and Related Methods The task of generating naturally-looking landscapes has attracted a lot of attention and as a result there are several mature approaches capable of producing satisfactory results. Fractal generation methods [Mar97, Mil89] and bitmap-based methods [Mar00] are often used in game development, as they can produce very flexible output at low computational costs. There are also some experimental landscape generation systems, as for example the one presented in [OSK*05] where the authors develop a two-stage Genetic Algorithm for “evolving” terrains. In the first stage of generation the user specifies a 2D sketch of the environment that is refined by a genetic algorithm. In the second stage another GA uses the refined sketch as a template for blending together several terrain types and introducing variations. In the second stage the GA a database of terrain samples is used. The authors point out that unlike 17 Geographically Informed Systems (GIS) that are based on a physical simulation of tectonic activity and erosion processes, this method is can not only produce realistic terrains but also comical or exaggerated comical landscapes. In spite of their popularity and generative power these methods do not address the specific challenges of level generation for platform games. Even if there is outdoor terrain in a given platform game it usually serves as nothing more than a graphical backdrop. II.1.3. Level Generation for Platform Games There is very little research dedicated to the specific needs of level generation for platform games. In [CM06] the authors describe a level generation system for platform games that is related to the context-free grammar methods discussed in section II.1.1. This paper proposes a hierarchical model of levels, which consists of three layers - level elements, patterns and cells. Each of the components in this hierarchy is created in a different way and captures a different aspect of platform level design. Level elements are the manually specified terminal symbols of the grammar (or a “bracketing system” as it is referred by the authors), representing different types of platforms. Patterns capture the sequence of jumps and obstacles in the level and are generated by a hill-climbing algorithm based on random modifications of the bracketing rules. Cells capture the non-linearity (i.e. “branching”) of levels. This research shares the shortcoming of all CFG methods in that it needs a manual specification of all pattern and cell configurations. This is referred to as a “bracketing system” but as far as it can be judged by the provided description, there is no principal difference to designing a grammar. Specifying all possible combination of level cells is likely to be a time consuming task that must be repeated for every new game taking advantage of the generator. Another problem with the proposed level generation method is in the use of a hillclimbing algorithm, which converges on a local maximum in the solution space. For each possible subset of cells the generation algorithm uses steepest ascend hillclimbing in order to find the optimal level pattern. Out of all these patterns with different length the system selects the one which is closest to the desired difficulty value. It is the author’s claim that is an acceptable solution because of the absence of local maxima in the search space of possible level patterns. However, this statement is not substantiated in the paper. It could be the case that the particular 18 grammar is designed in order to ensure this but in cannot be assumed in the general case of a bracketing system developed for any game. The same work describes an interesting method for evaluating the difficulty and traceability of levels by running a ballistic simulation of the player. This simulation is executed on the whole level pattern, followed by a step of the hill-climbing algorithm. In the evaluation step the simulated player moves along the platforms in the level and performs jumps at every possible location. Each jump defines a “spatial window” that marks the portion of the level below the ballistic curve as accessible. In the tradition of platform games this physical simulation is not entirely realistic, as it includes the ability to change direction in “mid-air”. Because if this ability not only the ballistic curve but the whole area below it becomes accessible in a single jump. In the implemented model there is also the possibility for other popular exaggerations, such as double jumps for example. For the goals of this project I develop a system of traceability markers based on the same idea of level evaluation. The method described in [CM06] evaluates and re-evaluates the whole level at every step of a hill-climbing algorithm. Unlike this procedure, traceability markers need be updated only once for each position in the level. II.2. Reinforcement Learning Overview As discussed in the previous section, approaches based on context free grammars have been applied to platform games but they require the hand-coding of generational rules. This requirement for manual specification of rules and the resulting lack of generality are flaws of the CFG method. Supervised learning techniques could be a solution to this problem. Unfortunately, training data is difficult to come by because every game has a unique level format, which is often proprietary. One additional disadvantage of using a supervised approach would be that it tries to mimic existing data whereas game designers want originality in their levels. It is my reasoning that a more natural solution to the generation problem would be based on unsupervised methods, such as Reinforcement Learning. This would allow for a specification of the task in terms of desirable and undesirable states of affairs, rather than by mimicking existing data. 19 II.2.1 The Reinforcement Learning Paradigm Reinforcement learning is concerned with the task of finding an optimal policy p* that maximises the long-term, online performance of an agent functioning in a given environment [KLM96]. The immediate performance is measured by the reward signal rtÎR that the environment generates after each action of the agent. There are several possible definitions for long-term performance. In the most commonly used definition the goal is to optimise the infinite-horizon discounted reward [SB98]: ¥ R = å g t rt (Eq. II-1) t =0 where g is a reward discount rate. One important distinguishing feature of reinforcement learning that sets it apart form supervised learning techniques is the use of a reward signal. In a discrete interpretation, at each time step t the agent performs an action atÎA and depending on the Agent Action <at> Environment current state of the environment stÎS Reward signal and new state <rt+1, st+1> it receives a different reward signal rtÎR. The next state st+1 is at least Figure II-1 Evaluative feedback partially determined by the performed action (Figure II-1). The use of a reward signal is termed “evaluative feedback”, as opposed to the “instructive feedback” used in supervised learning. In most cases the reward signal is delayed (e.g. until a success or failure state is reached) and the environment could also contain deceptive reward signals that lead to a very undesirable part of the state space and a subsequent penalty. Therefore, the probability of a future event occurring must have an effect on the actions that the agent takes in the present time step. Maximizing the long term performance makes it necessary to devise some method of evaluating these probabilities and propagating the future rewards back to the current state. 20 In order to specify a given task as a reinforcement learning problem, it necessary to satisfy several conditions [SB98]: (1) Discrete state representation Most reinforcement learning techniques are designed to work in a discrete state space and at discrete time steps. If the natural representation of the state space is continuous then discretisation may increase the number of states dramatically, which in turn will affect the performance and memory requirements of the learning algorithm. As outlined in [KLM96], supervised learning methods, such as the approximation of the value functions with a neural network, can help to alleviate this problem. Fortunately for the goals of this project, platform games often represent the level as a discrete grid, each cell containing one out of a finite number of values (i.e. indexes in the tileset). In some games with a more free-form appearance of the terrain, converting to a discrete grid representation may require additional processing. (2) Markov Property The Markov property requires all decisions of the agent to be made only on the basis of the current state and not the previous history of states. In the case of the level generation problem this requirement is not immediately satisfied but with the use of traceability markers, introduced in Chapter III, the previous history of states can be efficiently compressed into the current state. (3) Reward signal Evaluative feedback defines the goals of the optimisation task. Specifying the reward signal can be a challenging task and if not implemented with care the learning algorithm could converge on a solution that is optimal with respect to the reward signal but hardly useful for the goals and purposes of the designer. It is good practice to specify incremental rewards leading to a desirable part of the state space, rather than only in the final state of achievement. Seeding the value function with an approximate solution could be another way to guide the learning process. The environment of the agent can be non-deterministic, allowing for a transition function that draws the next state form some underlying and unchanging probability distribution. If the underlying probability distribution changes the 21 environment is not only non-deterministic but also non-static. In [KLM96] the authors point out that most learning algorithms are proved to converge only under the assumptions for a static environment and lack such guarantees in the case of a non-static environment. Nonetheless, for a slowly changing non-static environment reinforcement learning methods still show good results. II.2.2 Methods for Reinforcement Learning Because of the variety of available methods reinforcement learning can be described more precisely by the learning paradigm described in the previous section, rather than a specific learning algorithm. The wide variety of reinforcement learning methods is often divided into two distinct groups – methods learning a value function and policy gradient methods. The discussion of learning algorithms starts with value-based methods that take advantage of the representation of the learning task as a Markov Decision Process (MDP). By contrast with this specific value-learning approach, policy gradient method can use any general optimisation algorithm combined with evaluative feedback. II.2.2.1 Bellman Equations and Dynamic Programming Dynamic programming, as applied to solving the Bellman optimality equations, is amongst the first approaches to reinforcement learning. This method requires a model of the environment which means that the transition probabilities and the reward function must be specified in advance [SB98]. The Bellman equation for V*(s) is a recursive definition of what constitutes an optimal value function: V * ( S ) = max å Pssa' ( Rssa ' + gV * ( s ' )) , a (Eq. II-2) s' where Pass’ is the transition probability, R ass’ is the reward signal and g is the reward discount rate. More generally, for any given policy the value function V p (S) is given as follows: V p (S ) = å p ( s, a)å Pssa' ( Rssa ' + gV p ( s ' )), a (Eq. II-3) s' where p(s,a) is the probability of choosing action aÎA in state sÎS under the current policy. The dynamic programming approach can be implemented as a two phase iterative algorithm. In each iteration it brings the current policy p one step 22 closer to the optimal policy p* by alternately performing “policy evaluation” and “policy improvement”. Policy evaluation computes an estimate of V(s) for the current policy p, followed by policy improvement, which is a greedy change of p with regard to the new estimate for V(s). There is a theoretical proof for convergence of the algorithm and it is also guaranteed that at each step the policy will only improve. The most significant disadvantages of the Dynamic Programming approach are the requirement for a model of the environment and the computational costs of re-estimating the value function. II.2.2.2 Temporal-Difference Learning In [SB98], temporal-difference learning is described as an approach to estimating the value of V(s), or Q(s,a), on the basis of previously obtained estimates. It does not require a model of the environment and learning occurs throughout the training run and not only at the end of it. The updates of the value function be performed either in sweeps or each state can be updated asynchronously. The update rule of the TD algorithm is as follows: V new ( s ) = V ( s ) + a(r + gV ( s ' ) - V ( s )), (Eq. II-4) where a is a learning rate parameter (gradually decreased during the run), g is the reward discount rate and V(s’) is the current estimate for the value of the next state. This update rule makes no reference to the transition or reward function of the environment, making TD a model-free approach to estimating the value function. The update rule can also be extended to a look-up in several subsequent states, resulting in the TD(l) algorithm. This is technique is referred to as an “eligibility trace”, controlled by the l parameter. Eligibility traces can improve the accuracy of value estimates at the cost of more computation per iteration [KLM96]. There are two important learning algorithms that are based on the principles of temporal-difference learning. The SARSA algorithm is an on-policy learning algorithm that learns the value function of state-action pairs Q(s,a), instead of V(s). SARSA can also make use of eligibility traces for performing more accurate estimates. Q-Learning is an off-policy temporal difference learning algorithm. Like SARSA it learns a value function for the state action pairs Q(s,a). Unlike SARSA, it always chooses the best action in its update rule, regardless of the choice that the agent makes. 23 There are both advantages and disadvantages to off-policy learning. It is more likely for an on-policy learning algorithm, such as SARSA, to converge on suboptimal solution if the initial seed of the value function happens to be unlucky. Q-learning is more resilient with regard to bad policies [SB98]. However, it is possible that an exploratory move would push the agent into a undesirable state. The value function learned by an off-policy algorithm would not account for this probability and therefore the agent will not learn how to avoid such states under the current policy. Therefore, under some circumstances the “on-line performance” of the agent can become worse [SB98, Num05]. II.2.2.3. Actor-Critic Methods As outlined in [KLM96], the distinguishing feature of Actor-critic methods is that they store the learned value function and the policy separately. The critic part of the system receives a reward signal form the environment and uses it to compute an error for its value estimates. This is followed by an update of the estimate and a propagation of the error value to the actor part of the system. It is the responsibility of the actor to update the action selection policy based on the error signal supplied by the critic and not directly the environment reward signal. The Adaptive Heuristic Critic algorithm is a modified version of policy iteration that stores the policy and optimal value function separately. Unlike the Dynamic Programming version of policy iteration, it does not rely on the Bellman equations for computing the V(s) function but performs the TD update rule presented in equation (VII-4). Natural Actor Critic [PS05] is another development in the group actor-critic methods. It is an advantage of methods in this group that the computational cost of update rules can be very small. II.2.2.4. Model-building Methods Model-building methods, as presented in [KLM96, SB98], try to maximise the use information that can be obtained form training experience by trying to estimate the dynamics of the environment. For tasks such as robotic control where training experience can be difficult to obtain, this could be a well motivated approach. The original model-building algorithm, referred to as Certainty Equivalence, performs a full re-evaluation of every state of the environment at each time step. This is a procedure that can easily performance is required. becomes intractable, especially when real-time The Dyna algorithm suggests a more computationally tractable alternative that chooses a random sample of k states at each time step. Prioritized Sweeping [MA93] is an elaboration of the Dyna sampling method that uses DV(s), the amount of “surprise”, as an indicator for good parts of the state 24 place to be sampled. Model-building methods trade the speed of the update rules for a better use of training experience, which in the case of a simulated environment (i.e. where experience is easy to obtain) probably is not a good choice. II.2.2.5. Policy Gradient Methods Policy Gradient Reinforcement Learning (PGRL) arises as a combination of evaluative feedback and traditional supervised learning approaches. Neural networks have been successfully used in combination with a reinforcement signal that readjusts the weights of the network, as in [WDR*93] for example, and other gradient methods such as simulated annealing can also be adapted. There are no limitations in the use of an optimisation algorithm for as long as it can accommodate the feedback generated by the environment. One particularly interesting line of development is Evolutionary Algorithms for Reinforcement Learning or “EARL” [MSG99] and reinforcement Genetic Algorithms. The genetic algorithm uses a fitness function in order to rank a population of competing “chromosomes”, encoding candidate solutions to a problem. Traditionally, the fitness function is calculated by testing the chromosome on a database of problem/answer pairs, where fitness is awarded reverse proportionally to the distance from the correct answer. In an alternative solution the GA can be adapted to the reinforcement learning paradigm by creating an environment model that responds to the behaviour of a candidate solution. The cumulative reward can then be used as the fitness measure and the GA will perform selection according to its value. It should be noted that in this case the goal of the optimisation would not be a future discounted reward, as in Eq. II-1, but the cumulative reward for the run. II.2.3. Reinforcement Learning Libraries The majority of the algorithms described in the previous sections have already been implemented and studied in the work of many researchers. Therefore, it would be a repetition of previous work to focus on a new implementation when the goal of this project is not to improve any particular learning algorithm but to implement the task of level generation for platform games. In this section I present several reinforcement learning libraries and discuss my choice of an implementation that addresses the needs of this project. 25 II. 2.3.1 RL Toolbox RL-Toolbox [Num05] is a C++ library for reinforcement learning that implements many of the value-learning algorithms. This includes support for TD(l)-learning (SARSA and Q-Learning), Actor-Critic methods and Dynamic Programming. The library also includes implementations of commonly used action selection policies, such as e-greedy and Softmax action selection, as well as a useful hierarchy of adaptive parameter classes. Learning algorithms in this library are represented by an abstract interface that is inherited by the specific implementations. This makes changes of the learning algorithm relatively easy. In order to initialize the reinforcement learning environment, an agent object is created, all available actions are registered in it and a “controller” object is specified (i.e. the action-selection policy). Specifying the environment transition and reward function requires the creation of a derived class that implements two virtual methods corresponding to these functions. The author of RL-Toolbox draws attention the fact that the primarily design objective of the framework is to serve as a general library for reinforcement learning and performance of the implementation takes a second place. It is my reasoning that using a generalised library, such as RL-Toolbox, is a good choose in the prototyping of a reinforcement learning solution. Only after sufficient experiments have been performed it would be a sensible decision to invest efforts in developing a customised and streamlined learning algorithm for the task at hand. Another advantage of this library is the availability of both Linux and Windows distributions. The ease of prototyping and experimentation, combined with the compatibility of the target platform are two strong points that motivate the choice of RL-Toolbox as the library used in the current project. RL-Toolbox is also well documented, initially in [Num05] but also in terms of source code documentation and the availability of examples. II. 2.3.2 LibPG LibPG is an open-source library developed as part of the Factored Policy-Gradient Planner [BA06]. The acronym “PG” stands for Policy Gradient but the library also implements value-based learning algorithms. LibPG implements the SARSA, Q- 26 Learning, Policy-Gradient Reinforcement Learning, Natural Actor-Critic and Least Squares Policy Iteration (LSPI) algorithms. This project is distributed together with example programs and has good documentation, although not as extensive as that of RL-Toolbox. Unfortunately, the LibPG library is only available for Linux. In light of the fact that the current project is developed in a Microsoft Windows programming environment, it appears that LibPG would require some time consuming readjustments if it is to be used. II. 2.3.3 Other Libraries There are two Java implementations that merit at least a brief description. PIQLE [Com05] stands for “Platform for Implementing Q-Learning Experiments” and as the name suggests is an implementation of Q-Learning. This is an open-source project that also supports multi-agent reinforcement learning. The “Free Connectionist Q-learning Java Framework” [Kuz02] is another open-source Java implementation. This framework is an example for combining the reinforcement learning paradigm with supervised learning techniques, in this case a neural network. Another interesting project, which does not contain any algorithmic implementations but specifies a standard for the components of a reinforcement learning system, is RL-Glue [WLB*07]. The goal of this library is to create a standard for the interaction between the different components of an reinforcement learning experiment. Experiments, agents and environments can be implemented in a number of different programming languages, including C/C++, Java and as Matlab scripts. These objects can work together by means of interfacing with the standardising RL-Glue code. II. 2.3.4. Choice of a Reinforcement Learning Library When comparing different reinforcement learning libraries the following factors were considered: · Implemented reinforcement learning algorithms; · Compatibility of the programming language and development environment; · Availability of documentation; The programming language used by this project is C++ and the development platform is the free Express Edition of Microsoft Visual Studio. Out of the five 27 different libraries that were discussed in this chapter, only RL-Toolbox and RL-Glue are compatible with this development environment. Furthermore, RL-Glue does not contain any algorithmic implementations but only serves as a standard for performing experiments. LibPG would be a very strong alternative to RL-Toolbox in the case of platform compatibility but the lack of a Windows library distribution makes it a less desirable choice. Migrating to a Linux development environment wound is not a realistic option, as the graphical engine used for visualisation and testing of levels [Las07] is not available under that operating system. It would either be necessary to adapt LibPG to the Windows environment, or the graphical engine to Linux, neither of which is trivial and neither is an essential goal of this project. Having this in consideration, the most practical course of action would be to use the RL-Toolbox library. II.3. Library for the Visualization and Testing of Levels Evaluating the output of a level generation system can be a challenging task, because a generated level becomes meaningful output only when integrated in a video game. The performance of the level generation system can be measured independently in terms of generation time, convergence of the learning algorithm and validity of the output. However, it is difficult to be sure that the generated level is “playable” without actually playing it and even more difficult to debug the system by looking at the raw output. It is therefore essential to have some means of visualising and testing generated levels. For these purposes I use a game prototype [Las07] that I have developed in a previous project. The engine is mostly OpenGL-based but it also uses DirectInput (a part of Microsoft DirectX), and some Windows API calls during initialization. It implements the loading and visualisation of levels, collision detection, interaction with treasure and dangers. The user interface of the level generation system also uses the rendering, texture and font management facilities provided by this library. It should be noted that the level generation system itself is not dependant on the Windows platform and neither in the OpenGL library. It is the graphical engine that would need some reworking if a Linux port is to be implemented. In the case of switching to a different graphical engine, only the user interface component of the level generation system would need to be reworked. 28 CHAPTER III ARCHITECTURE OF THE SYSTEM This chapter presents the object-oriented design of the level generation system and the principal goals that it is intended to achieve. The most fundamental design consideration is that the level generation system must not make references to any specific indexes in the tileset, because any realistic game project would use several different sets of elements and changing them should be performed with ease. Rather than working with indexes in the tileset, the level generator first produces an abstract “blueprint” of the level. In a later stage of the level generation procedure, referred to as post-processing, the level blueprint is transformed into a renderable level. In an extension this requirement for generality, it was also a design objective to make the level generator as independent of the used game engine library as possible. This would ensure that the system can be integrated in many different game projects without the requirement of using one specific engine. To that end, the functionality for producing file output is encapsulated in a class derived from the abstract blueprint class. User interface and the all other visualisation tasks are isolated in a separate sub-system of the level generator so that they can be separated out if necessary. Although the focus of this project is on implementing a reinforcement learning generation algorithm, the system design allows for a multitude of generation approaches. The level generation algorithm is represented by an abstract façade class and other parts of the system are not aware that a reinforcement learning implementation, or any other algorithm, is currently in use. This design choice would make further extension of the system much easier. In short, the system design is centred around the following goals: · Independence of the game engine library; · Independence of the learning technique and library; · Allowing for a change of the tileset and modifications in its content. 29 III.1. Conventions and Class Diagrams Figure III-1 summarizes the different types of relations can occur between classes. The class diagrams in this chapter and in Chapter IV use this standard notation based on [GHJ*95]. aggregation ( 1 x 1 ) aggregation ( 1 x N ) reference inheritance This project also follows some source code conventions, briefly outlined here. Figure III-1. Class diagram notation III.1.1. Class Names Class names start with the ‘gen’ prefix, indicating they are part of the level generation system. For classes implementing the user interface this prefix is ‘ui’. Classes corresponding to an implementation for any specific library or learning technique include an abbreviation to indicate this, after an underscore at the end of the class name. For example genAlgorithm is the abstract interface for a level generation algorithm, whereas genAlgorithm_RL is the reinforcement learning implementation of the same interface. III.1.2. Member Variables and Methods All member variables include one or two lowercase symbols to indicate type and private member variables also start with an underscore. For instance _nLenght is a private variable of type integer, whereas _pLevel is a private pointer. The only requirement for method names is to start with a lowercase character and to reflect the function that they serve. III.2. Architecture of the Level Generation System Figure III-2 on the next page shows the main subsystems of the level generator. The reinforcement learning implementation, consisting of a state transition function, reward function, state representation and an action set, is hidden behind an abstract facade. Interaction with the reinforcement learning library is also limited to this subsystem. This makes it easy to switch to a different reinforcement learning implementation or even a different learning approach altogether. 30 The ‘Blueprint and Post-processing’ subsystem also has an abstract facade, which is used by classes external to it. This part of the level generator is responsible for representing the level blueprint in memory, performing post-processing, and implementing output to the specific file format of the game engine. The Postprocessing algorithm achieves independence of the used engine and tileset by storing this specific information in files with the extension .cm. Files of this type contain context-matching and replacement rules that map the abstract blueprint to appropriate tileset indexes. External Libraries ‘Cubic’ 3D Engine Level File GUI Base Classes RL-Toolbox .CM Files ‘Cubic’ Blueprint RL Algorithm RL State RL Actions Generator Dialogs Post-processing GUI Abstract Blueprint RL Environment Abstract Algorithm Parameters Blueprint and Post-processing Generation Algorithm Figure III-2. Architecture of the Level Generation System Last but not least important is the user interface subsystem. In order to use the level generator as a development tool, it is essential to provide a user interface for setting parameter values, training the system and visualising the results in a quick and convenient way. Having a robust user interface also facilitates the debugging of the system and allows to pre-visualise the output, rather than waiting for the game engine to load all of the texture files and 3D models. This sub-system depends on the graphical engine and no attempt is made to make it platform independent. In the case of the level generation system being used as a development tool, the output can still be targeted at any game engine. If the level 31 generator is to be distributed together with a game, the user interface can easily be separated from the rest of the system. III.3. Stages of Level Generation The generation of levels can be regarded as a sequence of several simpler tasks, namely a training phase, generation Level Parameters phase, post-processing and file output. Input parameters specified by the user. Clearly, tasks such as determining which 3D model in the tileset would fit in a particular position are of smaller level of granularity than determining Trained System the overall shape of the terrain (i.e. Generation policy is implicitly specified by a learned value-function the level blueprint). The sequence of steps that leads to successful level generation is illustrated in Figure III-3. Level Blueprint The first stage in the process is to Level with no associated graphical information specify the parameters of the level and this goal is achieved by developing the user interface system. In of the the second generation stage the Post-Processed Blueprint system is trained to generate levels Abstract terrain, danger and reward elements are replaced with indexes referring to the tileset. with the specified parameters. Once the system is trained it would be possible to generate multiple variants with similar properties. Implementing this functionality is the goal of the Level File generation phase. Level is stored in a file format supported by the game engine All of the tasks related to the visual representation of levels are grouped together in the post-processing phase. This phase transforms blueprint by matching and using the the Visualisation and Testing level The level file can be loaded and tested context- replacement rules Figure III-3. Stages of Level Generation specified in external files. In the case of changing the tileset only these files would have to be modified and no changes 32 to the source code of the level generator would be required. Context-matchers implement several different graphical tasks, such as the removal of redundant level elements, terrain smoothing and bordering. The last stage in the level generation process is output to a specific file format. In the current implementation of the system there is only one supported file format but this be changed easily by implementing a new derived class of the abstract level blueprint. 33 CHAPTER IV LEVEL GENERATION AS A REINFORCEMENT LEARNING TASK Up to his point the architectural design of the system was presented and level generation was subdivided into five smaller and easier tasks. In this chapter the focus is on two of them: · Learning a generation policy that satisfies the specified parameters; · Using the policy to generate a level blueprint. It is assumed that the user has the means of supplying the necessary parameters and the post-processing subsystem is capable of developing the blueprint into a level that can be loaded, rendered and played. IV.1. Task Specification IV.1.1. State Space Representation Platform games have a level structure that can easily be represented with a discrete two-dimensional grid in which the cells represent terrain elements, treasure and dangers. It is also typical for platform game levels to have a much greater length than height, which is the source of term “side-scrolling games”. At any given time only a small “window” of the level is visible and this view scrolls as the player moves. The actual length to height ratio of the level by far exceeds that of any monitor. While the height is usually in the small range [20;100] cells, the length can be thousands of cells. These observations are relevant to the goal of representing level generation as a sequential action-taking process. Because of the much greater length of levels it is convenient to divide them into vertical slices and perform incremental generation slice by slice. For now we will assume that a slice could be a single action of the building agent, as illustrated in Figure IV-1, although it will soon become clear why this idea is an oversimplification. 34 Level length Î [100; 1 000] cells 1 2 3 4 5 History of previous actions Figure IV-1. Platform Game Level Generated as a Sequence of “Slices” 6 … Current action In order tot specify level generation as a Markov Decision Process (MDP), it is also necessary to decide what will constitute a “state” of the environment that the building agent can perceive. One possible approach would be to use the values in the last slice as a state representation for the next one to be generated. Unfortunately, this is not a valid representation. The terrain elements, dangers and treasure located in any given slice may or may not be accessible depending on the terrain in all preceding slices. For example, a very high platform would be accessible to the player only if there is a slightly lower platform in a preceding position. The player would be able to jump on the first platform and then the second one but not directly to the second one. In this project I develop a way for compactly representing these dependencies into a relatively small state representation. Unlike the method of using the last slice as a state indicator, this method accurately evaluates the traceability of cells. Traceability markers are updated at each step of the level generation procedure by running an incremental simulation of the player that is advanced with one slice forward at every time step. Each marker is internally stored as an integer variable but it is visible to the agent as a Boolean value. Level generation can now be presented as a MDP and at least from a theoretical point of view this should be sufficient to solve it with any of the value-based methods outlined in Chapter II. In practise, the size of the state space would be too large for any of these algorithms to achieve reasonable performance. Even though traceability markers are visible to the agent as Boolean variables, rather than by their full range of values, for an average level of height 20 there would be 220 markers. The number of possible actions would also be too large, because the 35 building agent must discriminate between several types of objects (i.e. terrain, empty space, treasure or danger) and place them in different combinations. Figure IV-2 illustrates this problem and one possible solution that is based on the level structure of platform games. It was discussed in the introduction to this thesis that platform games often have several pathways, or branches, running in parallel to one another. Most of the meaningful interaction between the player and the level occurs within the boundaries of the current branch with only the occasional opportunities for switching to a different one. It is therefore reasonable to assume that the building agent only needs the traceability markers of the current branch in order to generate it in a meaningful way. Traceability markers are still updated on a global scale (i.e. for the whole height of the level) which provides integration between the pathways. However, the decision making of the level building agent is based only on the much smaller local state, corresponding to a single branch of the level. Local state (a) Only the local state is visible to the building agent at any given time step. Each branch has a separate local state, consisting of 6 Boolean traceability markers, plus a Boolean “pathway” flag. Global State Maximal Number of Branches Branch height ð There are 3 x 6 traceability markers =3 =6 = 18 ð Each marker can take 5 internal values, indicating jump height; Even if it is converted to Boolean the number of states is still very large: ð There are only 27 = 128 local states Local state (b) 218 states in the global state It is not practical to use the global state directly for reinforcement learning. Local state (c) Figure IV-2. Global and local state. The traceability simulation runs in the global state but only the local state is visible to the building agent. All level branches are updated in a sequence, followed by a transition to the next position along the length of the level. In an early version of the level generation system the agent was also allowed to perceive the index of the current branch. Interestingly, this augmentation does not result in increased quality of levels but rather in some disappointing solutions exploiting the index. For instance, the 36 lowest branch would often be completely blocked with obstacle elements so that the player cannot fall off the bottom of the level, whereas upper branches would be a very minimalistic sequence of empty and solid blocks. By contrast, if the index of the branch cannot be perceived the building agent is forced to discover a more general solution, aiming to simultaneously keep the current branch traceable and promote access to upper and lower branches. IV.1.2. Designing the Reward Function Level generation can be described as a maintenance task in which the goal is to generate a level that satisfies the user specified parameters at all times. If the level diverges significantly start from the specified parameters, a failure state is reached and the agent receives a penalty. Figure IV-3 shows yes No set traceability markers in the global state ? the conditions that can trigger the no fail-state. The goal of these checks is to ensure that there Pathway flag set in the local state ? is always a way forward for yes the player and to keep the level accessible at all times. By checking the flags in the global state it is easy to No set traceability markers in the local state ? no desired number of branches in the yes yes X > Length / 8.0 and less pathways than required ? verify that the level as a whole no no is traceable. The local state is false augmented with a “pathway” flag that serves as memory for the building agent true and Figure IV-3. Is the state a fail state? indicates if at any time in the past the current branch has been accessed. This mechanism prevents the creation of dead-end branches. The size of the fail-state penalty is proportional to the distance between the current position and the length of the level. Decisions that pushe the level into a fail-state at the very beginning would receive a big penalty, whereas close to the 37 end of the level the penalty would be smaller. This proportionality can be expressed with the following equation: RFail - State = X current - 1, Parameters :: Lenght (Eq. IV-1) In the current implementation of the level generation system the user can specify the branching factor (i.e. the number of parallel pathways), the desired length of the level and a “chaos” factor that controls the amount of noise introduced in the level. Apart from the penalty for reaching a fail-state, achieving the desired level of branching is also encouraged at the increase in number branches. This is a smaller task that is intended to improve the convergence of the learning algorithm. The penalties and rewards, as implemented in the current version of the system, are summarised in the table below. Figure IV-4. Rewards and Penalties Situation (Eq. IV-1) The failure state is reached. Return a penalty proportional to the part of the level that was not generated. For example, for a level that is almost completed only a small penalty would be imposed, whereas a level that reaches the fail-state quickly is likely to be in the result of worse choices and results in a larger penalty. RFail - State = (Eq. IV-2) X current -1 Parameters :: Lenght Reward branching on every time step. This promotes branching and also discourages the construction of deadens. RBranching = (Eq. IV-3) N branches - 1 Parameters :: Lenght Add a small reward proportional to the amount of treasure and dangers placed in the level. æ N Re ward + N Treasure ö ÷÷ RTreasue -Traps = 0.1 * çç è Parameters :: Lenght ø 38 Placing danger and treasure elements also results in a small reward for the agent but currently there is no explicit parametric control over this aspect of level generation. This part of the system could be extended by keeping track of the cumulative reward generated in the level and normalising it in order to calculate a danger-per-length coefficient CD and treasure-per-length coefficient CT. Both coefficients can be compared to a corresponding user specified parameter and if they differ significantly this should result in a penalty for the agent. IV.1.3. Actions of the Building Agent The building agent chooses an action out of a small subset of all the possible combinations of cell values. Most of these combinations are meaningless and not likely to occur in any level. For instance, actions containing a lot of scattered bits of terrain or dangers and traps are very unlikely and also undesirable in levels. Figure IV-5 presents the action set currently used by the building agent. It consists of 9 basic action that are extended to 21 with the addition of dangers and treasure. Figure IV-5. Actions available to the building agent The 9 basic action can be combined with danger and treasure elements for a total of 21 different actions. The width and height of actions are parameters of the level generation system that have a significant effect on the appearance of the level output and the speed of generation. In the initial design of the system these actions were intended to be only one cell wide. It was my reasoning that this would reduce the number of possible action types and thus facilitate learning procedure. This implementation works but it results in a very cluttered level, so it appears to be too fine-grained. After some experimentation with different sizes it was determined that a size of 6 x 3 is the best compromise between the number of different actions and their meaningfulness. This size also tend to produce less cluttered levels than a size of 6 x 1. The currently active action set is specified in an external file located in the ”output\gen\actions\” sub-directory of the project. Using a different action set can produce a distinctly different output because it forces the building agent to discover alternative ways for solving the same problem. 39 IV.1.4. Choice of a Learning Algorithm With regard to the choose of a learning algorithm, the principal alternatives are value-based methods, actor-critic learning and policy gradient search. There would be no benefit to using model-building approaches, such as Prioritized Sweeping, because the task is entirely simulated and scarcity of training experience is not an issue [KLM96]. Furthermore, using a value-based method would take advantage of the discrete and sequential representation of the task whereas a policy gradient method would not, which motivates my decision if favour of algorithms that learn a value function. It is also my reasoning that for the task of level generation using an on-policy learning algorithm, such as SARSA, would be a better motivated choose than offpolicy learning. This could be the case because there is a certain amount of desirable “chaos”, or randomness, in levels and this makers the action-selection policy a very prominent factor in the dynamics of the environment. Level chaos is implemented by making the action selection policy less deterministic than in other reinforcement learning tasks. If the system is to use an off-policy learning algorithm the random action choices would be ignored, resulting in different dynamics of the environment. Q-learning would optimise for a chaos-free environment and this could hinder the performance of the system because it is not a realistic assumption. Another contributing factor to the suitability of on-policy learning is that level generation is a maintenance task containing states resulting in a very large penalty if not avoided. In combination with level chaos this can easily “push” the building agent into a fail-state from nearby neutral states. Because random action selection is ignored by the Q-learning algorithm, these dangerous states would remain neutral and will not be avoided by the building agent. By contrast, SARSA would make a more conservative chose and stay away from the fail-states. In [SB98] this effect is illustrated with a “Cliff Walking” example, where the goal is to learn how to make a trip between two points in a Gridworld without falling into the chasm between them. In this case Q-learning finds a solution that walks dangerously close to the edge of the cliff, whereas SARSA points to a longer but much safer route. The author of [Num05] also makes a point of this advantage of the SARSA learning algorithm for tasks that have the aforementioned properties. 40 IV.2. Task Implementation IV.2.1 The “Generation Algorithm” Sub-system The class hierarchy implementing the level generation reinforcement learning task is presented in Figure IV-6. These classes are grouped together in the “Generation Algorithm” subsystem which is accessed by the rest of the system through the abstract interface genAlgorithm. The abstract interface provides methods for training the system, generating level variants and also gives access to the genParameters object. RL-Toolbox Generation Algorithm genWorldModel genAlgorithm_RL transitionFunction() getReward() genAlgorithm setParameters( param ) train( statistics ) generate( blueprint ) genAction_RL genState_RL apply ( blueprint ) build ( markers ) genTrainingStatistics genTraceabilityMarkers genParameters genBlueprint FigureIV-6. The “Generation Algorithm” Subsystem The reinforcement learning implementation of the training and generation methods is contained by the derived class genAlgorithm_RL. As already discussed, this implementation is based on the SARSA learning algorithm and the RL-Toolbox library. This implementation also involves several other classes. Local states visible to the agent are stored in the class genState_RL. Objects of this type can be created from the column of traceability markers represented by the genTraceabiltyMarkers class, which is in practise the “global state” of the level. The level blueprint maintains a pointer to the genTraceabiltyMarkers object but functionally this class is more closely relatedto the ‘Generation Algorithm’ sub- 41 system, as it play an important role in the reinforcement learning process. Traceability markers are updated incrementally at each generative each step. At the end of each training run both the traceability markers and the blueprint are cleared and another level generation attempt can commence. For classes external to the ‘Generation Algorithm’ sub-system this functionality is a black box, visible only thought the genAlgorithm interface. The following sections provide some insight into the functionality of the classes in Figure IV-6, collectively implementing level generation as a reinforcement learning task. IV.2.2. Class Implementations IV. 2.2.1. Abstract learning algorithm ( genAlgorithm ) This is an abstract base class that maintains a pointer to the level blueprint and the parameters object. It also defines the virtual methods genAlgorithm::train() and genAlgorithm::generate() respectively for the training of the system and the generation of level variants. IV. 2.2.2. Reinforcement Learning algorithm ( genAlgorithm_RL ) This class is derived from genAlgorithm and provides the reinforcement learning implementation of the genAlgorithm::train() and genAlgorithm::generate() methods. Training starts with a setup of the reinforcement learning framework, consisting of the following steps: · Create an instance of class genWorldModel, implementing the transition and reward functions; · Create an agent object and register the action list specified by the genAction_RL class; · Create an instance of the learning algorithm, implemented in the reinforcement learning library; · Create an ε-greedy action selection policy and register it as the active controller of the agent. After the initialization phase is completed the genAlgorithm_RL::train() method starts a sequence of training runs. For the duration of each run the ε parameter of the action selection strategy is adapted as a linear function of the current step. The initial value of epsilon and its attenuation rate are specified by the parameters “RL 42 Epsilon” and “RL Epsilon-reduction” respectively. Additionally, the parameter “RL Max Runs” is used to limit the number of training runs. During the generation phase, implemented in genAlgorithm_RL::generate(), the learned value function is frozen by deactivating the learning algorithm object. The generation phase also uses a different action selection strategy. The ε-greedy policy is replaced with a Softmax, which results in agent behaviour that chooses actions with probability proportional to their value. Optimal and near-optimal actions will be selected very often, whereas the probability of choosing a drastically undesirable action would be small. It is important to point out that following a greedy policy during the generation phase is not a viable alternative because some amount of randomness is desirable in the level output. Once the building agent manages to establish a certain rhythm, optimal with regard to the value function, the greedy policy does not provide an impulse for changing it. On the other hand, the player would feel bored after playing to the same rhythm for a certain amount of time. Instead of trying to estimate player “boredom” and adapting to it, a much simpler solution would be to allow for some randomness in the generation phase. Research in platform games often quotes the existence of a rhythm as a positive thing but it is assumed that this rhythm will be alternated periodically, in order to avoid boredom. In [SCW08] the authors make an analogy with music, where not only rhythm but also the change of rhythm contributes to the enjoinment of the listener. The use of a Softmax action selection strategy provides the necessary impulse for changing the rhythm of the level without being as disruptive as an ε-greedy policy. In the RL-Toolbox library Softmax is implemented as a Gibbs probability distribution [Num05]. This distribution has a “greediness” parameter β that is reversely proportional to the “RL Level Chaos” parameter. As a positive side effect of using an non-deterministic generation phase, once the system is trained it is capable of generating many “variants” of the same level. If the level generation system is to be used as an internal development tool, the human designer can generate several variants and select the one that best meets his subjective aesthetical criteria. In the case that the system is to be distributed together with a game, the generation of variants makes it possible to decouple the training and generation components. The player would be provided only with a library of learned value functions and the variant phase will be executed directly 43 when the game needs a particular level. This would provide much greater certainty for the game developer as to the appearance of the level output. IV. 2.2.3. State of the building agent ( genState_RL ) This class represents the state of the level under construction that can be perceived by the building agent. It stores 6 traceability markers, converted to Boolean values, and a “pathway” flag. The flag is used as a simple memory register allowing the agent to remember if the current branch has ever been accessible to the player. If this is the case and the branch suddenly ceases to be traceable that can result in the creation of a dead-end. The reward function accounts for deadends but in order for the agent to learn how to avoid them it is necessary to have this perception of the past. IV. 2.2.4. Actions of the building agent ( genAction_RL ) The class genAction_RL stores a template of an action that can be applied at any position in the level blueprint. Templates can be loaded form an external file located in the project directory under ‘output\gen\actions\’ or they can be created manually. As part of the setup of the reinforcement learning framework, implemented in genAlgorithm_RL::train(), the list of actions that will be used is loaded form a file in this directory. Actions of the building agent are identified by their index in this list. When the agent invokes a particular action the method genAction_RL::apply(X,Y, blueprint) is activated with parameters referring to the current position in the blueprint. This method copies the content of the action object onto the blueprint. There are two special actions, applied at the start and end of levels. The start of level template is applied at the specified start position of the player, whereas the end position is determined automatically and merged with the terrain at the end of the level. IV. 2.2.5. Transition and reward functions ( genWorldModel ) The class genWorldModel implements both the transition and reward function that define the dynamics of the environment. In order to achieve this it inherits the RLToolbox classes CTransitionFunction and CRewardFunction and implements the virtual methods genWorldModel::transitionFucntion(state, action, new-state) and genWorldModel::getReward(state). 44 Figure IV-7 presents a pseudo-code of the environment transition function. This method advances each branch of the level in parallel and keeps track of the current position in the blueprint along the X axis. The exact positioning in the level is not visible to the agent, as that would result in a huge state space and lack of generality in the learned policy. Instead, the local state is built on the basis of the traceability markers and the agent must discover for itself the dynamics of these changes. Extract currentState currentAction.apply( blueprint, currentX, currentBranch) currentBranch ++ If currentBranch > maxBranches then { currentBranch = 0 currentX++ blueprint.traceabilityUpdate() } If currentX > maxX then endAction.apply( currentX, detectEndY() ) Else { currentState.build( blueprint.traceabilityGet(), currentX, currentBranch) Commit currentState } Figure IV-7. Pseudo-code of the Transition Function The reward function is a straightforward implementation of equations Eq.VI-1, Eq.VI-2 and Eq.VI-3 discussed earlier in this chapter. The class also implements a genWorldModel::reset() method that returns the building agent to the beginning of the level, clears to blueprint and the traceability markers. IV.2.3. Traceability Markers Traceability markers, implemented in class genTacabiltyMarkers, are a discrete and incremental version on the level validation method outlined in [CM06]. It can be judged whether or not certain parts of the level are accessible by executing a simulation of the player that moves along platforms and performs all possible jumps. The physics of platform games are not entirely true to the physics of the real world because they allow the player to change direction in mid-air. This results in an “accessibility window” corresponding to the part of the level below the 45 ballistic curve, rather than only the curve itself. Most platform games use this model of altered physics in order to increase the amount of control that can be exerted on the player character. By contrast with the player simulation described in [CM06], the method I propose does not re-evaluate the traceability of the whole level at each step in the generation procedure. Instead, as level generation progresses a column of traceability markers moves along with the building agent and gradually “scans” the whole length of the level. Figure IV-8 presents the history of traceability markers for a sample level blueprint. . Figure IV-8. Sample level blueprint showing the traceability markers For every cell of the blueprint an integer value is stored, indicating the distance that the player can move upward form the current position. When this distance becomes 0 no further movement upward is possible and if the value becomes negative the corresponding cell is marked as inaccessible. The algorithm can work equally well with real-valued traceability markers, which improves the accuracy of the prediction but reduces its speed. For the purposes of the current project it was determined that integer markers provide sufficient accuracy and using real values would only increase the training time. IV. 2.3.1. Updating the traceability markers Given a column of traceability markers, the next column can be determined unambiguously by performing transformations on the marker values illsutrated in Figure IV-9. This figure shows a pseudo-code of the marker update algorithm, as 46 implemented in the constructor of class genTraceabiltyMarkers. With the exception of the first column of markers, objects of this type are constructed only by transforming the previously existing markers. Set all traceability markers to UNACCESSIBLE Set X = Xcurrent For every Y < Height If blueprint[X, Y] is not SOLID and old_markers[Y] is ACCESSIBLE then { If blueprint[X, Y+1] is SOLID then updateMarker(Y, DYmax) Else { updateMarker(Y, doTrajectoryStep(old_markers[Y])) propagate(Y) } } updateMarker(Y, value) marker[Y] = MAX(marker[Y], value) doTrajectoryStep(value) if value == 0 return UNACCESSIBLE else return value-1 propagate(Ystart) { Y = Ystart While blueprint[ X, Y ] is not SOLID updateMarker (Y, doTrajectoryStep(Y--, markers[Y] )) } Y = Ystart While blueprint[ X, Y ] is not SOLID updateMarker (Y++, 0) Figure IV-9. Updating the traceability markers For every step along the X axis, the traceability markers that make up the Global State are updated. The presented pseudo-code for doTrajectoryStep(value) corresponds to a linear approximation of the jump trajectory. This function can be replaced with a more sophisticated ballistic simulation of the player. The transformation starts by blocking all markers by default. Next, for every marker not blocked in the blueprint the status of the surrounding markers is evaluated. If there is solid terrain bellow the currently evaluated marker and the old marker is accessible this is a possible starting point of a jump. Markers of this type are set to the maximal jump height DYmax, which is a parameter fop the level generator. In the case of a marker that does not initiate a new jump the old value of the marker is transformed by the function doTrajectoryStep() and it is propagated upwards and downwards in the column of markers. In order to avoid conflicts a 47 value is updated only if it is greater than the current value. To illustrate this, consider the following example. Any position in the level could accessible both as the starting point of a jump (M=DYmax) and the landing site of a falling player (M=0). Clearly, the greater value should take precedence because the possibility of landing on a platform does not exclude the possibility of starting a new jump form the same platform. IV. 2.3.2. Jump trajectory The trajectory that the simulated player DXmax follows during a jump is encoded in the function doTrajectoryStep(value) which under the current implementation of the algorithm only decreases the value of the DYmax marker with one. As illustrated in Figure IV10 this corresponds to approximating the ballistic curve with a line. The cells included under the linear trajectory are only a subset of the cells that a ballistic curve would include. Because decisions are made on the basis of a pessimistic scenario, and not an overly optimistic one, the level could be “more traceable” than predicted but not less so. This ensures that if a level is valid with respect to the traceability markers it will also be valid in the game engine. IV. 2.3.3. Representation in memory Traceability markers are represented as a list structure and a pointer to the rightmost column of markers is stored in the level blueprint. The last genTraceabilityMarkers object is all that is necessary for performing Figure IV-10. Simplification of Jump Trajectory The ballistic curve is approximated with a simpler linear trajectory. The shaded area beneath the curve represents cells of the blueprint that are flagged as traceable. Because the linear trajectory covers a smaller area, it can give only a pessimistic scenario for the traceability of the level. reinforcement learning but in order to visualise and debug the markers it is convenient to preserve all of them. The Post-processing phase, discussed in the next chapter, also uses the full history of markers in order to remove redundancies in the level. 48 Each time a new column of traceability markers needs to be created, a genTaceabilityMarker object is constructed from the markers currently stored in the level blueprint. The constructor transforms the old markers and stores an internal pointer to them. The level blueprint no longer stores this pointer as it is replaced by the newly create genTracabilityMakrers object. Figure IV-12 illustrates this interaction between the genBlueprint and genTracabilityMakrers classes. genBlueprint _pTM traceabilityUpdate() traceabilityGet() genTraceabilityMarkers _pPrevious constructor ( old-markers, blueprint ) Figure IV-12. Traceability Markers and the Blueprint The blueprint object stores a pointer only to the rightmost column of traceability markers. Every time the traceabilityUpdate() method is invoked, it creates a new genTraceabilityMarkers object from the previous one. 49 CHAPTER V POST-PROCESSING The post-processing phase of level generation takes the level blueprint as input and transforms it in order to implement the following tasks: · Removal of redundancies. The output of the generation phase is a valid blueprint, traceable from the start to the end. However, there is not way to avoid the occasional placement of isolated terrain elements that are not accessible in any way. These redundancies should not be present in the final output. · Terrain smoothing. Scattered terrain elements should be merged where appropriate and any gaps in the terrain that serve no functional purpose should be filled in. The task of terrain smoothing must be implemented without changing the traceability of the level, otherwise it could be invalidated. · Terrain bordering. The contour of the terrain must be defined by choosing appropriate tileset indexes, taking in consideration the surrounding tiles. The terrain around lava traps should be modified in order to accommodate for the 3D model of this type of obstacle. Other decorative elements are also placed in this step. After training the system and generating a level variant, the blueprint contains only an abstract specification of the level. The areas that contain terrain elements, dangers and treasure are marked but no specific indexes in the tileset are assigned. Furthermore, the aesthetic quality of the level can be improved by making some transformations of the terrain. In this project I develop a system of context-matching and transformation rules that implement all of the aforementioned tasks. The level generation system loads the context-matching rules form external files with the extension ‘.cm’. All of the tileset-specific information is contained within these files . 50 V.1. The “Blueprint” and “Post-processing” Sub-systems The architecture of the ‘Level Blueprint’ subsystem is presented in Figure V-1. The purpose of class genBlueprint, which takes a prominent position in the sub-system, is to store a representation of the blueprint in memory and to provide access to it. This class also maintains a pointer to the parameters object and a list of traceability markers updated during level generation. ‘Cubic’ 3D Engine Level Blueprint genBlueprint_CB Post-processing save( file ) draw () genContextMatcher load( file ) apply( blueprint ) genContext genBlueprint match(X,Y, blueprint) get( X,Y ) set( X,Y, value ) pp() genWildcard match ( value ) symbol () genParameters genTraceabilityMarkers Figure V-1. The “Blueprint” and “Post-processing” subsystems The class genBlueprint_CB is an implementation of the blueprint class capable of producing file output to the specific file format of the game engine. If the level generation system is to be extended to work with a different game engine it will be necessary to implement another class, inheriting the genBluperint functionality and to implementing the virtual method genBluperint::save(). Except for the genBlueprint_CB class, no other part of the level generation system makes a reference to the file format used to store levels. The rest of the ‘Blueprint’ subsystem is a group of classes implementing the tasks of level post-processing. This part of the class hierarchy is activated by a call to the genBlueprint::pp() method. 51 V.2. Implementation of Post-processing The initialization of the post-processing subsystem is performed in the constructor of the genBlueprint class by loading a list of ‘.cm’ files. The constructor creates two vectors of objects relevant to the task of post processing. The first vector, stored as the member variable _vWildcards, is a list of wildcard symbols used in the context matching rules. The use of wildcards greatly reduces the number of context matching rules and simplifies their design. The second vector represent the context matcher and is stored in the _vContextMathcers member variable. Invoking the genBlueprint::pp() method applies all of the registered context matchers to the level blueprint. V.2.1. Wildcards Wildcards, as implemented by the class genWildcard, are a convenient tool that allows for a more abstract specification of context-matching rules. It is often the case that some values of the context are irrelevant or they matter only for as long as they belong to a certain subset of the tileset. Each wildcard is specified as a symbol followed by the tileset ranges that match this symbol. Figure V-2 lists the most important wildcards used in context-matching rules. The list of wildcards is user specified and for a different tileset it may contain a completely different set of wildcards. Symbol * Interpretation Match any element. Matching Ranges [0;9999] Match only elements representing ‘empty space’. This category @ includes all elements that do not stop the player character from 0, [24;38], [55;57], passing through, such as decorative elements, challenge and [2000;9999] reward elements. Match only elements representing an obstacle to the movement of [1;23], [39;54], the player character. [58;9999] C Match only challenge (danger) elements. [2000;2999] R Match only reward (treasure) elements. [3000;3999] $ Match any challenge or reward element. [2000;3999] # T N Match any ‘@’ element that has a set traceability flag in the traceability markers. The opposite of the ‘T’ wildcard. like ‘@’ like ‘@’ Figure V-2. Wildcards used in the post-processing phase The list of wildcard is loaded from the external file ‘gen\post-processing\post-processing.pp’, allowing for an easy transition to a different tileset. Only the T and N wildcards have built in functionality. 52 In order to match a wildcard both the symbol and the range must match. In addition to this requirement the two built in wildcards ‘T’ and ‘N’ also make a reference to the traceability markers stored in by the blueprint. These built in wildcards are necessary for implementing the removal of terrain redundancies. V.2.2. Context Matchers Class genContext implements the functionality of a single context matching-rule. Contexts are represented as a 3 x 3 matrix, consisting of the cell that is subjected to transformation and the 8 cells that surround it. Objects of this type can be constructed in two ways: · Constructed from a transformation rule. In this case the context matrix can contain a mixture of tileset indexes and wildcard symbols, as specified in the ‘.cm’ file. Rules of this type also include a target value to be substituted in the case of a successful match. · Constructed form a location in the level blueprint. In this case the context does not contain any wildcards and the target value is ignored. During the postprocessing phase a genContext object is constructed for each cell of the blueprint. Each of the blueprint contexts is matched against all of the transformation rules. This comparison is implemented in the genContext::match(context) method, returning a Boolean value. In order to separate the post-processing functionality from the rest of the level blueprint code, context-matchers are encapsulated in the class genContextMatchers and not directly in the level blueprint. The functionality of this class is concentrated mainly in the method genContextMatchers::apply(blueprint). It implements a loop that builds the context of each cell in the blueprint and tries to match it against all of the registered genContext rules. In the case of a match the target value in the rule is transferred to the blueprint. The same class also has a genContextMatchers::load() method invoked from the constructor of the level blueprint. 53 V. 2.2.1. Removing redundancies. Smoothing and bordering It is perhaps easier to illustrate the functionality of context-matchers with a specific example. Figure V-3 shows the context-matchers that implement removal of redundant terrain elements. This task requires with only two rules applied recursively to the blueprint. The first rule matches any terrain element (i.e. the ‘#’ wildcard) that has empty non-traceable space above it. The rest of the context is ignored, as signified by the ‘*’ wildcard. The zero to the right of the context indicates a replacement with the nil tileset index which corresponds to empty space. This rule [FLAGS] recursive: true recursion-depth: -1 [CONTEXT-MATCHERS] *N* *#* *** : 0 *N* *$* *N* : 0 [END] “eats away” the surface of any isolated terrain elements not accessible for the player. The second rule functions in a similar way, this time matching any danger or Figure V-3. Removing Redundancies These two context-matching rules are used for removing redundancies in the generated level. The ‘N’ wildcard matches any nontraceable cells in the blueprint. The ‘$’ wildcard matches a treasure or trap element. The ‘*’ wildcard matches any element type. treasure element that is not accessible to the player. These elements serve no functional purpose and look rather odd in the final output so it is preferable that they are removed. Not all of the post-processing tasks can be implemented with such a small number of transformation rules. There are currently 18 rules that implement terrain smoothing and 34 rules to implement bordering ( extracts are presented in Figures V-3 and V-4 ). The large number of bordering context matchers is due to the fact that each one of them must handle a configuration of terrain elements that corresponds to a different 3D model and a different index in the tileset. This number would be unmanageably larger if it was not for the use of wildcards. Smoothing context matcher work with only two tileset indexes and add or remove bits of the terrain that would create unpleasantly rough appearance or the sense of a cluttered level. These rules are applied recursively so with careful design it is possible to affect larger areas than the explicitly specified 3x3 context. 54 Figure V-4. Terrain smoothing (an extract) Occasional gaps in the terrain and unnaturally looking configurations are transformed by these “smoothing” context matchers. Figure V-5. Bordering context matchers (an extract) These transformation rules introduce the tileset indexes that build up the contour of the terrain. Each index in the tileset corresponds to a 3D model designed for the specific context. [FLAGS] recursive: false [FLAGS] recursive: true recursion-depth: -1 [CONTEXT-MATCHERS] [CONTEXT-MATCHERS] * 1 * 1 @ 1 * 1 * : 1 * 1 * 1 @ @ * 1 * : 1 * 1 * @ @ 1 * 1 * : 1 * 1 * @ @ @ * 1 * : 1 * 1 * 1 @ 1 * @ * : 1 1 @ 1 * @ * * 1 * : 1 * 1 * * @ * 1 @ 1 : 1 @ @ 1 @ 1 * 1 * * : 0 1 @ @ * 1 @ * * 1 : 0 @ @ 1 @ 1 * @ 1 * : 0 1 @ @ * 1 @ * 1 @ : 0 * 1 * @ 1 @ @ @ @ : 0 @ @ @ @ # @ @ @ # : 51 @ @ @ @ # @ # @ @ : 52 @ # @ @ # @ @ @ # : 53 @ # @ @ # @ # @ @ : 54 @ # @ @ # @ @ # # : 22 @ # @ @ # @ # # @ : 23 @ @ @ # # # * * * : 2 @ @ @ # # @ * # @ : 5 @ @ @ @ # # @ # * : 6 @ # @ @ # @ @ # @ : 7 @ @ @ @ # @ @ # @ : 8 @ @ @ @ # # # * * : 9 ... ... ... ... [END] [END] 55 V.3. Prepared Graphical Elements The graphical engine used by the level generation system had a very small tileset which was not capable of visualising all possible outputs of the level generation system. To that end it was necessary to create a new tileset implementing terrain border 3D models, as well the models for lava traps and a single monster type. Developing this tileset was the only way to ensure that the post-processing system performs any meaningful work and also a necessity for debugging the whole level generation system. Most of the elements of the tileset correspond Figure V-6. Contour tiles joined together to 3D models that are used to build a smooth contour of the terrain when placed at appropriate locations. Figure V-6 illustrates this with a fragment of terrain that appears to be a continuous curve but internally is represented as a regular grid. The grid is populated with 3D models matched to the context of surrounding cells. There are about 24 contour elements in the tileset but due to symmetry only half that number of 3D models were created. The tileset also contains some decorative elements and variants of the contour tiles. Figure V-8. Shooting adversary Because of the time constraints for the development of this project there is only one monster type (Figure V-8) and one type of trap (Figure V7). The type of danger element to be inserted in the level is determined automatically depending on its position. Lava traps are placed when the a danger element specified in the blueprint comes in contact with the level terrain. Dangers surrounded by empty space are converted to the shooting monster type. Lava traps can also be stretched horizontally and this requires Figure V-7. Lava trap some additional post-processing in order to detect the position of the trap edges. 56 V.4. Output File Format File output is implemented in the method genBlueprint_CB::save() invoked after postprocessing of the level blueprint. Figure V-9 shows a sample level file generated by this method. The file format starts with a ‘general’ section that specifies the name of the level and the tileset file to be used. This is followed by a list of level ‘segments’, each one representing a 24 x 10 matrix of tileset indexes. For convenient performance for the reasons graphical it is engine to represent levels as a list of equally sized segments conform and to the this output format. method must Following the segments is a list of object each one specified by its position and a textual identifier of type. Some object types may require additional parameters. This textual representation is not a very efficient one, considering the occurrence rate of the nil element, but the file format was designed primarily with readability in mind. Figure V-9. Sample output to a .lvl file (an extract) [GENERAL] name: 'Generated Level' tileset: 'world-1.ts' segment-count: 10 [SEGMENT] use-litter-pools: 0 0 0 0 0 0 0 0 0 0 46 27 31 38 6 45 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 44 27 31 38 11 1 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 44 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 46 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 1 27 0 0 0 0 0 0 0 0 [SEGMENT] ... ... [Object] instance-of: 'player' position: 0 [0, 11, 0] Having a separate list of object, instead of [Object] instance-of: 'sentinel' position: 0 [5, 12, 0] specifying them directly as a part of the ... ... segment list, is also due to the requirement [END] for readability and simple means of editing. V.4.1. Extending the system to other file formats In order to implement file output to another format a new class must be derived from genBlueprint overriding the virtual method genBlueprint::save(). It would also be necessary to provide a new implementation for the bordering context matchers, as they depend completely on the contents of the tileset. It is unlikely that any changes would be required in the smoothing and redundancy removal context matchers. 57 CHAPTER VI GRAPHICAL USER INTERFACE In order to be a complete and useful tool the level generation system must be provided with the user interface necessary for specifying level parameters and inspecting the output of the system. Figure VI-1 presents layout of the graphical user interface developed as part of this project. Figure VI-1. Layout of the User Interface At the top of the screen is located the “Training History” window, showing the cumulative reward for each run and each training session. The user is also presented with two buttons for scrolling the blueprint view at the left and right edge of the screen, as well as a progress indicator at the bottom. The user interface is rendered by the same graphical engine that is used for testing generated levels. An alternative solution would be to use Windows API directly or trough a library such as MFC. This option was ruled out because of the resulting dependence on the Windows operating system and the unnecessary complication of the user interface subsystem. The small set of user interface controls that was 58 implemented for the project consists of buttons, sliders, an edit box control and a “container” class that implements the functionality of a window and a loadable dialog box. This class hierarchy provides sufficient functionality for the purposes of the level generation system. VI.1. The Level Parameters Object Level generation parameters are stored in the genParameters class which is derived from the STL class for an associative array std::map. This class is the link between the different sub-systems of the level generator. The use of std::map as a base class allows the handing of parameters to be implemented in a very intuitive and flexible way. Parameters can be addressed by a string index corresponding to their name. For each registered parameter the minimal, maximal and default value can be specified. Once a pointer to the genParameters object is available reading and writing parameters can be performed as demonstrated in Figure VI-2 below. (*params)["Level Chaos"].dMin = 0.2; (*params)["Level Chaos"].dMax = 0.8; (*params)["Level Chaos"].dValue = 0.5; Figure VI-2. Registering a new parameter The minimum, maximum and default values are specified. Any object can add a new parameter by referring to it for the first time and it automatically become available to all interested readers of the parameter value. The classes genAlgorithm and genBlueprint, as well as their derived classes, can specify the parameters they need during the initialisation of the level generator. This is implemented as a call to the methods genAlgorithm::specify(parms) and genBlueprint::specify params). It is the task of the user interface subsystem to identify these parameters and provide the necessary controls for their modification. The user interface subsystem adds a slider control to the parameters dialog for each one of the specified parameters. VI.2. Implemented User Interface Controls The class hierarchy of user interface controls starts with the classes uiSkin and uiSkinObject (Figure VI-3). Creating an instance of class uiSkin results in the loading of all texture files in order to draw the user interface. This class assigns a 59 unique text identifier to every texture and colour variable presented in the skin file, so that these objects can be accessed easily by user interface controls. The rest of the user interface hierarchy is based on the class uiSkinObject that stores a pointer to the uiSkin object. ‘Cubic’ 3D Engine Texture Files GUI Hierarchy uiSkin uiSkinObject uiWindow uiProgress uiPointer uiLabel uiSlider uiButton uiContainer uiEdit uiDesktop Figure VI-3. Base Classes of the User Interface The used graphical engine does not have its own hierarchy of user interface classes but they are necessary for the level generation system. The class uiPointer represents the mouse pointer. This class accesses the texture with identifier “TX_POINTER” and draws it at the current position of the cursor. More important out of the two direct children of uiSkinObject is the class uiWindow. It implements the interactive behaviour of a window and provides several virtual methods that can be overloaded by children classes. The method uiWindow::draw() draws the window at its current position and uiWindow::reply() is called when any mouse event occurs within the area of the window. In addition to these virtual functions, the class also implements several mouse event methods (e.g. onMouseMove(), onClick() ) that can also be overridden. The functionality of a dialog box is implemented by class uiContainer, which is based on uiWindow and is capable of maintaining a list of child windows. The list is represented as std::vector and the uiContainer::draw() and uiContainer::reply() 60 methods ensure all children are drawn at their appropriate positions and respond to mouse input. This class also introduces a new method uiContainer::load(file) that automatically creates the child controls specified in a dialog template file. Dialog templates reside in the ‘output\gui\dialogs\’ directory of the project. The level generation system implements several dialogs that use this mechanism. The classes uiLabel, uiEdit, uiButton, uiProgress and uiSlider implement the user interface controls that their respective names suggest. All of these classes inherit the uiWindow functionality and implement their specific function by overriding the methods uiWindow::draw(), uiWindow::reply() and mouse event methods. The class uiDesktop is a special top-level container. Creating an instance of uiDesktop will automatically create the uiSkin and uiPointer objects. All other windows and containers should be inserted into a uiDesktop object, ensuring the existence of a single skin and mouse pointer per desktop. There is no restriction on the number of separate desktops that can be used but the level generation system needs only one instance of this class. VI.3. Level Generator Dialogs The level generation system implement four main dialogs, all of them implemented as classes derived from uiContainer (Figure VI-4). Most of the user interface controls in the dialogs are created automatically form the corresponding template files but in order to respond to user commands it is alo necessary to derive a class and to override the uiCointainer::onCommand() method. In the following sections I present the functionality implemented in each of these classes. uiContainer uiStatisticsWindow uiParametersDialo g uiProgressDialo g uiCompletedDialog Figure VI-4. Dialogs Implemented for the Level Generation System 61 VI.3.1. Parameters Dialog ( uiParametersDialog ) This is the first dialog that is created during the initialization of the system. Although the parameter dialog is not visible when level generation starts the object remains active in memory. This class creates all important objects of the level generation system, namely an instance of genParameters, genAlgoritmh_RL and genBlueprint_CB. Figure VI-5. Parameters Dialog During initialization a slider control is added for each parameter contained in the genParameter object. The constructor of the parameters dialog creates the level generation system objects and calls the methods genAlgoritmh_RL::specify(parameters) and genBlueprint_CB::specify(parameters). After this step all necessary parameters of the level generation system are registered in the genParameters object. The next step is to enter a loop that enumerates the parameters and insert a slider control for each one of them. The end result of this process is presented in Figure VI-5. Originally, the dialog template contains only the edit box controls and the push buttons. After the specify methods are called the dialog also includes sliders attached to the blueprint and generation algorithm parameters. 62 VI. 3.1.1 Generation thread When the user presses the “Generate” button level generation starts in a separate thread. This is implemented in the class genGenerationThread and the role of the parameter dialog is only to create an object of this type and supply it with the necessary pointers. Having two threads makes it possible to generate levels and respond to user commands simultaneously. It also helps to separate the graphics code form the core functionality of the level generation system. The two threads communicate by means of sharing the common genBlueprint, genAlgorithm and genParameters objects. VI.3.2. Progress Dialog ( uiProgressDialog ) This class implements a progress tracking dialog and it is also created when the level generation procedure starts. This dialog maintains a pointer to the level blueprint and an internal flag indicating the current phase of level generation ( i.e. training, variant generation, post-processing, or completed). Depending on the current phase the progress bar is updated accordingly. VI.3.3 Training History Dialog ( uiStatisticsDialog ) This class is a statistics dialog that draws a graph of the cumulative reward for each training run. It also prints an indication of the number of successful runs and measures the training time. All previous sessions are recorded by the class so it is easy to see the effect of different parameter settings on the performance of the reinforcement learning algorithm. VI.3.4. Completion Dialog ( uiCompletedDialog ) This dialog is created at the end of the level generation procedure. As illustrated in Figure VI-6 it prompts the user to make a choice about the next action of the level generation system. If the generated level is satisfactory, it can be tested by loading it into the game engine. Figure VI-6. The User is Prompted to Test the Level Once the level generation procedure is completed, the user is allowed to make a choice as to his next action. He can either test the level with the game engine, generate another variant (without re-training) or return to the parameters dialog and specify different settings. This option is available through the “Test” button. There are two alternative options - in 63 case that the training phase was completed in a satisfactory way but the user wants to try a different variant of the level, it is possible to do this by pressing the button labelled “Another variant”. This clears the level blueprint and invokes the genAlgoritmh::generate() method without a preceding call to genAlgoritmh::train(). Generation of variants is a usually a quick procedure, provided that the learned policy is good. The last option available to the user is to return to the parameters dialog. If the user chooses this option all dialogs of the generation phase are hidden and the parameters dialog is presented again. This allows for re-training of the system and further experimentation with different parameter values. 64 CHAPTER VII OUTPUT AND EVALUATION This chapter present the experiments that were performed on the level generation system in order to find an optimal parameter setup and also a set of benchmarking tests designed with the goal of evaluating performance and scalability. Finding an optimal parameter configuration is an important prerequisite for the evaluation of the system. The optimisation starts with a “global search” in the parameter space that helps to identify areas of high performance. Due to constraints in computational power, it is not possible to perform the global search with a fine-grained step so a more accurate “local search” focuses on the better parts of the joint parameter space. With respect to a small area of the parameter space it is also possible draw meaningful conclusions on the basis of individual parameters, rather than the multidimensional parameter setup. At the end of this stage an optimal parameter configuration is identified and its convergence and generative performance metrics are reported. The evaluation of the system continues with a set of performance and scalability benchmarks measuring Training Time, Variant Generation Performance, Postprocessing Time, Scalability with Regard to Level Length and Branching, as well as the effects of the “chaos” parameter. Figure VII-1 on the next page shows a sample output of the system. For each of the displayed levels, the upper image shows the level blueprint and the lower image is the post-processed level, as rendered by the graphical engine. The presented levels are of the same length l=100 but the system is capable of generating much longer levels with sizes in the range [50; 400] cells. Because the player normally sees only a “window” of the level that scrolls as he moves, presenting long levels in a printed document can be difficult. Levels of greater size can be explored directly in the game engine integrated with the level generation system. 65 (b=1) (b=2) (b=3) Figure VII-1. Examples for generated levels with different branching factors 66 VII.1. Methodology VII.1.1. Experimental Setup All of the experiments presented in this chapter are implemented as methods of the class genGenerationThread. This includes thee methods: The method genGenerationThread::autodetect() can reproduce the parameter optimisation tests. In the parameters object genParameters, the step sizes and ranges for all parameters of the reinforcement learning algorithm are specified. For a value-based learning algorithm this includes the following: · Learning rate, a; · Attenuation of the learning rate (specified in percents of a); · Reward discount rate, g; · Parameter of the eligibility traces, l; · Random action selection probability, e; · Attenuation of the random action probability (specified in percents of e); · Maximal number of training runs, NRmax. Automatic parameter detection is implemented as a loop that re-trains the system and explores all parameter combination with the specified ranges and step sizes. Each trial is repeated N times, where the sample size N is a parameter varying for different tests. Sample sizes of 25, 35 and 45 were used, as specified in the particular test. It was the effort of the author to obtain as large a sample as possible but in some of the tests larger values of N become computationally intractable. The method genGenerationThread::autodetect() also prints its results in the training history window and also displays an additional window showing the best parameter configuration discovered so far. For each sample the parameter vector <a, amin, e, emin, g, l> and the pair <P, Perror> is recorded in the file “gs.txt” in the output directory of the project, where P is the performance measure of interest. Another method, genGenerationThread::benchmark_ttrain(), implements timing benchmarks. This type of test differs form the detection of optimal parameters in that it implements a loop over level parameters, rather than the reinforcement 67 learning parameters. Level parameters varied during benchmarks of this type may include any of the following: · Level length, lÎ[50,500]; · Level branching factor, bÎ{1;2;3}; · Level chaos, c=1-β; where βÎ(0;1) is the “greediness” parameter of the Softmax action selection policy. In the used algorithmic implementation this is corresponds to a Gibbs distribution with parameter β [Num05]. During the benchmarks reinforcement learning parameters are set to the optimal values discovered in the optimisation phase. The benchmarking method also needs a way to measure time and time variation. I use the Windows API function GetTickCount(), which returns the number of milliseconds since the last reboot of the system. The resolution of this timer is the same as the resolution of the hardware system timer, which is adequate for the measurement of level generation times. It was already discussed in Chapter IV that the learning algorithm used by the level generation system is SARSA. As a result of the time constraints for the development of this project it was not possible to rigorously evaluate the performance of multiple algorithms, although some informal comparison with QLearning showed that SARSA generally performs better. Another interesting future development would be to compare performance of value-based learning and a Genetic Algorithm implementation of unsupervised learning. Genetic algorithms have a naturally built in randomness in the solution which could be beneficial for this particular task. VII.1.2. Performance Measurement The tests presented here use two main performance measures. The first one is a measure of convergence and is defined as follows: PC = N Successful N Limit , (Eq. VII-1) where NSuccessful is the number of training runs resulting in a successful outcome (i.e. not a fail state) and NLimit is the maximal allowed number of successful runs. Although each training session has an upper limit of NRmax, in the case of achieving 68 NLimit successful attempts it is assumed the learning outcome is successful and further training is not necessary. If the training session ends and 0 < NSuccessful < NLimit this shows some partial success but no convergence or late convergence. The value of the parameter is always in the range PcÎ[0;1] and given enough samples it can be interpreted as convergence probability. The second parameter corresponds to the probability of generating a correct level variant. It can be specified with the following equations: If Pc > 0.2, PG = Otherwise, N Attempts æ PG = 0.2 * PC + 0.8 * çç1 NG Max è ö ÷÷ ø (Eq. VII-2) PG = 0.2 * PC The generative performance PG is a compound measure of the convergence probability and the probability of generating a successful level variant on the first attempt. The probability of generating a successful variant is estimated as the number of attempts it takes to generate the level (NAttampts), divided by the upper limit for the number of attempts. In the case of poor convergence (Pc < 0.2) the additional assumption is made that variant generation will not be successful for this policy. In this case the PG parameter is proportional only to Pc. This approximation could introduce a minor error in the results (e.g. if a very bad policy succeeds in generating a valid level shortly before the value of NGMax is exceeded) but it speeds up testing considerably. Generating a level with a bad policy results in many unsuccessful attempts before the upper limit NGmax is exceeded and by adding the filtering condition this performance bottleneck is resolved. If a greedy policy is used for level generation, the value for the generative performance would be PG=1. In practise PGÎ(0;1) because levels are generated by following a Softmax action selection policy. Therefore when optimising the parameters of the system the effect on PG should also be measured. For example parameter setups having a very low value for e could result in acceptable convergence but still be rejected as incompatible with the generation phase. 69 VII.1.3. Error measurement The values of PC, PG and the different timing parameters are measured as the average value in a sample of size NÎ{25; 35; 45}. The following definition for the standard error applied: SE = s N (Eq. VII-3) where s is the sample variance and N is the size of the sample. The error margins are indicated with error bars, or in some cases with a dashed line, in a range of two standard errors around the measured value. VII.2. Optimisation of Parameter Settings VII.2.1. Global Search in the Parameter Space Performing a fine-grained search in the parameter space becomes computationally tractable only for a small range of the parameter values so in order to find a global optimal solution it is necessary to implement a preliminary “global search” with large step sizes. It is assumed that the performance metrics outlined in the previous section change smoothly within a small range of the parameter values. In light of this, the global search should help to identify good areas in the joint space of parameters and in a subsequent step these areas can be explored further. The results of the global search are presented in Figure VII-2 on the next page. In this test the sample size was set to N=45 and the maximal number of training runs to NRmax=100. At this stage only the convergence metric PC was recorded because calculating the generation metric PG would only incur additional computational costs in a preliminary search. The test was performed for the following combinations of parameters: · Learning rate, a Î [0.1; 0.5]; step size 0.2; · Attenuation of the learning rate [0; 80], step size 40% · Reward discount rate, g Î [0.1; 1], step size 0.2; · Parameter of the eligibility traces, l Î [0.1; 1] ; step size 0.2; · Random action selection probability, e Î [0.05; 0.5], step size 0.15; · Attenuation of the random action selection [0; 80], step size 40%; 70 Level length was set to l=100 and the maximal branching factor of b=3 was used. The small length is intended as a way of avoiding superfluous computations as longer levels do not present a greater challenge but training can take longer as it is necessary to maintain the discovered behaviour of a longer period. The maximal branching factor presents the greatest difficulty for the building agent so a value of b=3 is used in all optimisation test. a 0.5 0.1 amin, % 100 0 e 0.5 0.05 e min, % 100 0 g 1 0.1 l 1 0.1 performance PC 0.59 0 324 333 342 351 360 369 Figure VII-2. Global Search in the Parameter Space In the graph for PC the dots correspond to samples of the performance measure and the lines surrounding them reflect the standard error confidence interval. It is immediately obvious that values e > 0.2 inhibit performance regardless of other parameter settings. Another fact revealed by the graphic is that higher values of the learning rate a make the performance peak higher and concentrated towards lower values of e. As evident by the confidence intervals the small scale variation is 71 not noise but it is caused by the g and l parameters changing with the highest frequency. Figure VII-3 shows the six highest performing parameter combinations discovered during the global search and their corresponding g parameters. There is a performance suggesting peak this approximate is value at a g=0.9 0.65 good for the 0.51 This difference becomes even Pc parameter. 0.32 more clearly expressed when comparing the generative 0.23 performance PG measured for the top six solutions (Fig. VII- 0.1 4). Higher values of gamma 0.25 0.50 Gamma Figure VII-3. clearly outperform lower values 0.90 Best parameter configurations after global search during the generation phase. Figure VII-4 also shows the different performance when using a Softmax and egreedy policy to introduce level chaos. The Softmax test was performed with β=0.5 and the for the e-greedy the parameter was also set to e=0.5. Although it is difficult to judge the equivalence of these parameter settings they are both the minimal values that result in apparent level variety and acceptable output. For values g > 0.5 the Softmax policy is less disruptive and results in a better value of the PG metric. e-Greedy Softmax 0.5 0.5 0.41 Pg 0.34 0.12 0.06 0.2 0.4 0.6 Gamma 0.8 1 0.2 0.4 0.6 Gamma 0.8 Figure VII-4. Influence of the gamma parameter on the generation of level variants 72 1 1 VII.2.2. Local Search in the Parameter Space During the local search the same experimental setup was used, except for smaller step sizes concentrated within a smaller range of the parameter values. This resulted in an improved parameter set with convergence measure Pc=0.64. The corresponding parameter settings are <a=0.28, amin=60%, e=0.17, emin=75%, g=0.9, l=1.0>. Within a small range it is also reasonable to assume the independence of the learning parameters and try to analyse them as separate variables. Figures VII-5 and VII-6 show the correlation between performance, the e and a parameters, and their attenuation rates. Epsilon min, 75% Epsilon min, 45% Epsilon min, 100% Pc 0.6 0.53 0.4 0.18 0.2 0.14 0.1 0.2 0.1 0.3 0.2 Epsilon 0.3 0.1 0.2 0.3 Figure VII-5. Correlation between PC, epsilon and its attenuation Alpha min, 40% Alpha min, 60% 0.7 0.62 Alpha min, 80% 0.7 0.7 0.6 Pc 0.51 0.25 0.31 0.1 0.2 0.3 0.4 0.1 0.3 0.2 0.3 Alpha 0.4 0.1 0.2 0.3 0.4 Figure VII-6. Correlation between PC, alpha and its attenuation In the first row of graphics it is clearly visible that performance peaks at an attenuation rate of 75% and a value of the epsilon parameter around e=1.5. For higher and lower attenuation rates the graphic lacks this distinct maximum and is more evenly distributed. The correlation between performance and the alpha parameter has a different character. Low attenuation rates appear to shift the peak towards lower values of a ( Pc=0.62 at <a=0.25; amin=40%, > ) and higher attenuation rates towards higher 73 values of a ( Pc=0.6 at <a=0.4; amin=80%, > ). These two alternatives appear to be equally good but the attenuation rates around amin=60% yield a worse maximal performance. VII.2.3. Results of Parameter Optimisation Results of the parameter optimisation phase are summarised in the Figure VII-7. As evident by the table, the global search value for Pc (1) is improved in the phase of local search (2). The generation metric PG remains the same in both cases. Implementing a phase of local search that optimises PG instead of Pc results in an slight improvement of the convergence metric but at the same time a decrease in convergence (3). Figure VII-7. Results of parameter optimisation Parameter Settings PG PG Softmax, β = 0.5 e-Greedy, e = 0.5 0.5 0.4 0.25 0.64 0.4 0.3 0.4 0.55 0.25 PC l = 0.3 l Reduction = 60% (1) e = 0.1 e Reduction = 70% g = 0.9 l = 1.0 l = 0.28 l Reduction = 60% (2) e = 0.17 e Reduction = 75% g = 0.90 l = 1.0 l = 0.29 l Reduction = 60% (3) e = 0.17 e Reduction = 75% g = 0.94 l = 0.94 74 Because of the high amount of chaos introduced in levels, it is difficult to achieve a value close to one for both of these parameters. Nonetheless, as the next section demonstrates these results are sufficiently good to ensure the quick generation of levels within some range of the parameters. It should also be noted that the PC and PG parameters were tested for the maximal value of the branching factor b=3 which corresponds to the most challenging level generation task. In the case of b={1;2} both parameters are in the range [0.8;1]. VII.3. Generator Benchmark VII.3.1. Intrinsic evaluation In this section I present several performance and scalability benchmarks based on the optimal parameter configuration (Fig. VII-7, (2)). In most of the tests performance is measured as the time required for completing the training of the system, the generation of a level variant or some Component Processor: Intel Core2 Duo 2.2Ghz other time measure. It is System memory: 2GB therefore essential to give an exact account of the hardware and Description Operating system: Windows XP, 64bit edition configuration operating Figure VII-8. Test configuration system used for performing the benchmarks. This information is available in Figure VII-8. All of the presented benchmarks use a sample size of N=35, except the Variant Generation Benchmark, where a smaller sample size of N=25 was used. This particular benchmark requires the generation of a large amount of levels with increasing length and a re-training of the system for each one generated variant. In this case using a larger sample was not possible. VII. 3.1.1 Benchmark: Training time The goal of this benchmark is to evaluate the time required for generating a level with given length lÎ[50;500] and branching factors bÎ{1,2,3} and to determine how this performance scales. Figure VII-9 shows the results of this benchmark. It can be argued that the increase in length does not increase the complexity of the reinforcement learning task but only requires the agent to maintain the discovered behaviour for a longer period of time. This is why the performance scales linearly 75 with regard to level length. The more interesting effect is that of increasing the branching factor. For the maximal level length of 500 and a branching factor of b=3 the average training time is t=4.1 seconds, as opposed to only 0.6 seconds in the case of b=1. 5 This can be explained with the fact that higher 4 a more complex problem. It is necessary to learn how to create a longer “ladder” out of terrain Training Tim e, sec . requires the building agent to solve 3.5 3 2.5 2 1.5 elements in order to make the 1 upper branches accessible to 0.5 the player. This difficulty is concentrated only in b=1 b=2 b=3 4.5 branching 50 100 150 200 the 250 300 Level Lenght 350 400 450 500 Figure VII-9. Training Time Benchmark beginning of the level and accounts for the initial offsets of 0.05, 0.25 and 0.8 seconds respectively for b=1, 2 and 3. It is also necessary to simultaneously prevent all of the branches from becoming deadens, which explains why the slope of the graphics increase proportionally to branching. VII. 3.1.2. Benchmark : Variant generation The goal of this benchmark is to evaluate the number of attempts required to generate a valid level in the presence of chaos, as introduced by the Softmax action selection policy. Figure VII-10 shows 1200 the 1000 Unlike training time, variant generation does not scale well with the increase of level Variant Attem pts experimental results of this benchmark. b=1 b=2 b=3 800 600 400 200 length. In the range lÎ[50; 250] the number of attempts is less than 100 for 0 150 all branching factors. However, 200 250 300 Level Lenght 350 400 Figure VII-10. Variant generation benchmark as the length increases further a huge difference between the branching factors 76 appears. Levels with b=1 scale very well, whereas the number of attempts grows up to 1200 in the case of bÎ{2;3}. This effect can be explained with the fact that preventing all of the branches from becoming deadens becomes an increasingly difficult task in the presence of level chaos. An analogy can be made with a palace made of cards, where someone continuously shuffles random cards at the base of the construction. As the palace gets taller it becomes more likely that a random shuffle will result in the collapse of the structure. In the case of level generation, each steps brings a small probability of triggering the fail state and as the building agent moves along the length of the level this probability gradually accumulates. It is sufficient to have one impossibly long jump and a branch of the level will be invalidated. Figure VII-11. Relation between PG and the Level Chaos parameter 1 0.9 0.8 Pg 0.7 0.6 0.5 0.4 0.3 0.2 0.2 b=3 b=2 0.3 0.4 0.5 0.6 Level Chaos 0.7 0.8 Figure VIII-11 presents an evaluation of the negative effects of level chaos on the generative performance PG. This test was performed for a level length of l=200. As evident by the figure, the probability of generating a valid level with three branches is PG=0.6 in the range c Î [0.2;0.5]. As the chaos parameter increases this probability drops to 0.3 at which point it would take many attempts to generate a valid level. Lower branching factors are not affected by this negative effect to such an extend. The generative performance decreases slightly for a branching factor of two, whit a minimum of 0.74. The graph for b=1 is not displayed on the figure because it is the horizontal line PG = 1. In light of these results it is clear that the chaos factor should be set to the minimal value that is sufficient to create the appearance of variety in the level. It is 77 the subjective judgement of the author that a value of the parameter c=0.5 is sufficient for this purpose and higher values should be avoided. After considering the graphs on figure VII-11 it becomes clear that the cause for the large number of attempts it takes to generate levels with many branches is mainly caused by introducing level chaos. Therefore a possible way to improve the performance of the system is to find a way of creating level variety that is not entirely dependant on the Softmax action selection policy. One approach to that end would be to modify the reward function so that the agent is proactive in the creation of varied levels instead of just responding to a non-deterministic environment. VII. 3.1.3. Benchmark: Post-processing time The goal of this benchmark is to evaluate the time required for the post-processing of levels and how well it scales with level length. Figure VII-12 shows the results of this benchmark. Figure VII-12. Post-processing benchmark Post-processing Time, sec. 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 50 100 150 200 250 Level Lenght 300 350 400 The figure reveals a linear dependence between length and post-processing time. This is not a surprising result as the number of context-matches performed by the post-processing algorithm also increases linearly with every new column added to the length of the level. Level branching also increases this measure linearly but the effect is negligibly small. 78 VII. 3.1.4 Total level generation time For a level blueprint with parameters {l=400; b=3} post-processing adds only tpp=0.06s to the total generation time. In comparison, the average training time for the same length is tt= 4.1s and variant generation time (i.e. following the Softmax policy) is measured to be tpp = 0.05s. Both training and variant generation can take multiple attempts in order to generate a valid level. In view of these performance measures and the PC and PG probabilities it is easy to estimate the total generation time for different parameter setups. This estimation is presented in the following table (Figure VII-13). Figure VII-13 Estimated Generation Times Parameters Training Generation Post-processing Total l=100, b=1 0.06s x 1 0.03s x 1.5 0.015s 0.2s l=300, b=1 0.2s x 1 0.04s x 10 0.04s 0.7s l=400, b=1 0.4s x 1 0.05s x 25 0.06s 1.7s l=100, b=2 0.45s x 1 0.03s x 1.5 0.015s 0.5s l=300, b=2 1.3s x 1 0.04s x 420 0.04s 18s l=400, b=2 1.6s x 1 0.05s x 650 0.06s 34s l=100, b=3 1.04s x 2.2 0.03s x 30 0.015s 3s l=300, b=3 2.9s x 1.8 0.04s x 1030 0.04s 46s l=400, b=3 4.1s x 2.4 0.05s x 1100 0.06s 1m 5s 79 VII.3.2. User Suggestions for Future Development User evaluation was performed only by a small group of people that have relevant interests in video games and in game development. This evaluation, as well as test plays performed by the author, confirm that generated levels are traceable in the game engine and the output is valid. Although the sample size was too small to draw any statistical conclusions about the quality of levels it proved valuable source of ideas for the future improvement of the level generation system. Figure VII-14 summarises these ideas and proposes possible solutions. Figure VII-14. User suggestions for further improvement Problem Possible Solution Creating more monster and trap types; This is not part 1 The small number of of the level generation system but if the placement of enemy types makes the enemies is to be an informed choice, and not completely levels less interesting random, the level generation system would need to differentiate between several enemy types. Introducing a difficulty parameter, controlled by the user, and a “difficulty coefficient” CD calculated automatically as the level is generated; CD could be calculated as the number of danger element placed per No parametric control 2 level length. over the placement of dangers and treasure If CD diverges significantly from the user specified parameter the fail state would be triggered. Parametric control over the placement of treasure can be implemented in a similar way. Solving this problem would require an implementation of (2) and the dynamic adaptation of the CD parameter as Difficulty does not 3 escalate near the end of levels which is normally expected to happen a function of the current position in the level. It must be pointed out that by adapting the CD parameter the environment of the building agent will become non-stationary. It must be ensured that these non-stationary changes occur gradually in order for the learning algorithm to perform well. 4 Enemies are too sparsely located Lower branches of the level are sometimes 1 This issue can be resolved by implementing (1) and (2). obstructed by upper branches making some of the jumps difficult Some investigation revealed this problem is caused by the post-processing rules. It was already discussed that post-processing should not change the accessibility of the levels but only to provide a visual improvement. The rules that introduce changes in accessibility should be modified. 80 CONCLUSION This document presented the architecture and implementation of a reinforcement learning level generation system. In the evaluation of the system it was demonstrated that it can produce valid platform game levels within reasonable time constraints. It was determined that the length of levels can be varied safely in the range of 50 to 300 cells for all of the tested branching factors and up to 400 in the case of a smaller branching factor. The system successfully implements the tasks of placing enemies and treasure in the level, as evident by inspection of the output and test plays within the game engine. Improving scalability with regard to level length and introducing parametric control over the placement of treasure and rewards can be two beneficial developments to the current project. Recording the cumulative amount of treasure and danger placed since the start of the level, or in a sliding window of fixed size, can be the basis for better parametric control. This value would then be normalized in the range 0-100% and compared against a user specified parameter. With regard to the improvement of scalability, it was already discussed that creating a more proactive building agent could be the solution to this problem. The agent would actively introduce variety, allowing a smaller value for the level chaos parameter to be used. During the analysis and evaluation of the system it was discovered that the chaos parameter is the main obstacle that harms scalability. In order to avoid this and teach the agent how to be a proactive “chaos maker” it is necessary to devise some measure of level variety. This could be realised in a way similar to the cumulative danger and treasure measures. Another feature of the implemented system is level post-processing. This is an important final step in level generation because it allows the produced output to be rendered in a game engine. Without post-processing the level is only an abstract specification that does not have any graphical information associated with it. As evidence for the correct functioning of this step screenshots of levels rendered in the 3D engine were presented (Fig. VII-1). Additional level output, including that of much longer levels, is available and ready for visualisation with the executable files of this project. 81 BIBLIOGRAPHY [3DR02] 3D REALMS, Duke Nukem: Manhattan Project. Video game, http://www.3drealms.com/dukemp/index.html, (2002). [Ada01] ADAMS, E., Replayability Part 2: Game Mechanics. in Gamasutra, Article available at http://www.gamasutra.com/features/20010703/adams_01.htm, (2001). [Ada02] ADAMS, D., Automatic Generation of Dungeons for Computer Games. BSc Thesis, University of Sheffield, pp. 9-13 (2002). [BA06] BUFFET, O., ABERDEEN, D. The Factored Policy-Gradient Planner. in Proceedings of the Fifth International Planning Competition, pp. 69–71, (2006). [Bli00] BLIZZARD ENTERTAINMENT, Diablo II. video game, http://www.blizzard.com/us/diablo2/, (2000). [Bou06] BOUTROS, D., A Detailed Cross-Examination of Yesterday and Today’s Best-Selling Platform Games. in Gamasutra, Article available at http://www.gamasutra.com/features/20060804/boutros_24.shtml, (2006). [CM06] COMPTON, K., MATES, M., Procedural Level Design for Platform Games. in Proceedings of the 2nd Artificial Intelligence and Interactive Digital Entertainment Conference - AIIDE, (2006). [Com05] DE COMITÉ F., A Java Platform for Reinforcement Learning Experiments. Laboratoire d'Informatique Fondamentale de Lille, (2005). [Cry97] CRYSTAL DYNAMICS, Pandemonium II. video game, http://www.crystald.com, (1997). [GHJ*95] GAMMA, E., HELM, R., JOHNSON, R., VLISSIDES, J. Design Patterns: Elements of Reusable Object-Oriented Software. Pearson Education, pp. 35-53, (1995). [Inc99] INCE, S., Automatic Dynamic Content Generation for Computer Games. Thesis, University of Sheffield, (1999). [JM08] JURAFSKY, D., MARTIN, J., Speech and language processing: An Introduction. Pearson Education press, pp. 438-442 (2008). [KLM96] KAELBING, L., LITTMAN, M., MOORE, A. Reinforcement Learning: A Survey. in Journal of Artificial Intelligence Research, volume4, pp. 237-285 (1996). [Kuz02] KUZMIN V., Connectionist Q-Learning in Robot Control Task. in Scientific Proceedings of Riga Technical University, (2002). [Las07] LASKOV, A., Three-dimensional platform game “Jumping Ron”. Bachelors’ Thesis, Technical University of Sofia, (2007). [MA93] MOORE, A., ATKENSON, C. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time. in Machine Learning, volume 13, pp. 103-130, (1993). [Mar00] MARTIN, K., Using Bitmaps for Automatic Generation of Large-Scale Terrain Models. in Gamasutra, Article available at http://www.gamasutra.com/features/20000427/martin_pfv.htm, (2000). 82 [Mar97] MARTZ, P., Generating Random Fractal Terrain. in Game Programmer, Article available at http://gameprogrammer.com/fractal.html, (1997). [Mil89] MILLER, G., The Definition and Rendering of Terrain Maps. in SIGRAPH ’89 Conference Proceedings, pp. 39-48, (1989). [MSG99] MORIARTY D., SCHULTZ, A., GREFENSTETTE, J. Evolutionary Algorithms for Reinforcement Learning Journal of Artificial Intelligence Research, volume 11, pp. 241-276, (1999). [Nin81] NINTENDO, Donkey Kong. video game, http://nintendo.com (1981). [Nin87] NINTENDO, Super Mario Bros. video game, http://nintendo.com (1987). [Nin96] NINTENDO, Super Mario 64. video game, http://nintendo.com (1996). [Num05] NUMANN, G., The Reinforcement Learning Toolbox, Reinforcement Learning for Optimal Control Tasks. Master’s Thesis, Technical University of Graz, (2005). [OSK*05] ONG, T.J., SAUNDRES, R., KEYSER, J., LEGGETT, J. Terrain Generation Using Genetic Algorithms. in Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, pp. 1463-1470, (2005). [PS05] PETERS J., SCHAAL S. Natural Actor-critic. in Proceedings of the Sixteenth European Conference on Machine Learning, pp. 280-291, (2005). [SB98] SHUTTON, R., BARTO, A., Reinforcement Learning: An Introduction. The MIT Press, pp. 52-82, (1998). [SCW08] SMITH, G., CHA, M., WHITEHEAD, J., A Framework for Analysis of 2D Platformer Levels. in Proceedings of the 2008 ACM SIGGRAPH symposium on Video games, pp. 75-80 (2008). [Son96] SONY ENTERTAINMENT. Crash Bandicoot. video game, (1996). [TWA80] TOY, M., WITCHMAN, G., ARNOLD, K. Rogue. video game, (1980). [WDR*93] WHITELY, D., DOMINIC, S., DAS, R., ANDERSON, C. Genetic Reinforcement Learning for Neurocontrol Problems. in Machine Learning, volume 13, pp.259-285, (1993). [WLB*07] WHITE, A., LEE, M., BUTCHER, A., TANNER, B., HACKMAN, L., SHUTTON R.,RL-Glue. Available at http://rlai.cs.ualberta.ca/RLBB/top.html, (2007). [Yu08] YU, D., Spelunky. Video game, http://www.derekyu.com, (2008). 83