Virtual resources management on HPC clusters Jonathan Stoppani
Transcription
Virtual resources management on HPC clusters Jonathan Stoppani
VURM Virtual resources management on HPC clusters Jonathan Stoppani Author Prof. Dorian Arnold Prof. Patrick G. Bridges Prof. François Kilchoer Prof. Pierre Kuonen Advisors Prof. Peter Kropf Expert Summer 2011 Bachelor project College of Engineering and Architecture Fribourg member of the University of Applied Sciences of Western Switzerland Abstract Software deployment on HPC clusters is often subject to strict limitations with regard to software and hardware customization of the underlying platform. One possible approach to circumvent these restrictions is to virtualize the different hardware and software resources by interposing a dedicated layer between the running application and the host operating system. The goal of the VURM project is to enhance existing HPC resource management tools with special virtualization-oriented capabilities such as job submission to specially created virtual machines or runtime migration of virtual nodes to account for updated job priorities or for better performance exploitation. The two main enhancements this project aims to provide to the already existing tools are firstly, full customization of the platform running the client software, and secondly, improvement of dynamic resource allocation strategies by exploiting virtual machines migration techniques. The final work is based upon SLURM (Simple Linux Utility for Resource Management) as job scheduler and uses libvirt (an abstraction layer able to communicate with different hypervisors) to manage the KVM/QEMU hypervisor. Keywords: resource management, virtualization, HPC, SLURM, migration, libvirt, KVM i Contents Abstract i 1 Introduction 1.1 High Performance Computing 1.2 Virtualization . . . . . . . . . 1.3 Project goals . . . . . . . . . 1.4 Technologies . . . . . . . . . . 1.5 Context . . . . . . . . . . . . 1.6 Structure of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 3 5 6 2 Architecture 2.1 SLURM architecture . . . . . . 2.2 VURM architecture . . . . . . . 2.3 Provisioning workflow . . . . . 2.4 Implementing new provisioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 10 13 16 . . . . . 19 20 22 25 27 31 3 Physical cluster provisioner 3.1 Architecture . . . . . . . . 3.2 Deployment . . . . . . . . 3.3 Libvirt integration . . . . 3.4 VM Lifecycle aspects . . . 3.5 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Migration 4.1 Migration framework . 4.2 Allocation strategy . . 4.3 Scheduling strategy . . 4.4 Migration techniques . 4.5 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 40 43 45 46 5 Conclusions 5.1 Achieved results . . . . 5.2 Encountered problems 5.3 Future work . . . . . . 5.4 Personal experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 51 51 Acronyms 53 References 55 A User manual A.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Configuration reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 57 57 58 iv CONTENTS A.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 VM image creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 62 B ViSaG comparison 65 C Project statement 67 D CD-ROM contents 71 List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Basic overview of the SLURM architecture . . . . . . . . . . . Initial proposal for the VURM architecture . . . . . . . . . . The VURM system from the SLURM controller point of view The VURM system updated with the VURM controller . . . Adopted VURM architecture . . . . . . . . . . . . . . . . . . VURM architecture class diagram . . . . . . . . . . . . . . . . Resource provisioning workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 11 12 12 13 14 17 3.1 3.2 3.3 3.4 Overall architecture of the remotevirt provisioner Complete VURM + remotevirt deployment diagram Copy-On-Write image R/W streams . . . . . . . . . Complete domain lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 24 30 33 4.1 4.2 4.3 4.4 4.5 Example of virtual cluster resizing through VMs migration. . . . . . . . . . Migration framework components . . . . . . . . . . . . . . . . . . . . . . . . Collaboration between the different components of the migration framework Graphical representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edge coloring and the respective migration scheduling. . . . . . . . . . . . . . . . . . . . . . . 35 37 39 44 44 v . . . . . . . . . . . . . . . . . . . . List of Code Listings 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Configuration excerpt . . . . . . . . . . . Reconfiguration command . . . . . . . . . Node naming without grouping . . . . . . Node naming with grouping . . . . . . . . SLURM configuration for a virtual cluster Batch job execution on a virtual cluster . vurm/bin/vurmctld.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 10 10 14 15 16 3.1 3.2 3.3 3.4 3.5 Basic libvirt XML description file . . . . . . . . . . . . . . . . . . . . . . . . Example arp -an output . . . . . . . . . . . . . . . . . . . . . . . . . . . . Libvirt TCP to serial port device description . . . . . . . . . . . . . . . . . Shell script to write the Internet Protocol (IP) address to the serial port . . Shell script to exchange the IP address and a public key over the serial port . . . . . . . . . . 26 28 29 29 31 4.1 4.2 4.3 4.4 CU rounding algorithm . . . . . . . . . . . . Nodes to virtual cluster allocation . . . . . VMs migration from external nodes . . . . . Leveling node usage inside a virtual cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41 42 42 A.1 A.2 A.3 A.4 A.5 A.6 Example VURM configuration file . . . . . . . . . . Synopsis of the vurmctld command . . . . . . . . . Synopsis of the vurmd-libvirt command . . . . . Synopsis of the valloc command . . . . . . . . . . Synopsis of the vrelease command . . . . . . . . . Shell script to write the IP address to the serial port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 61 61 61 62 63 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction High Performance Computing (HPC) is often seen only as a niche area of the broader computer science domain and frequently associated with highly specialized research contexts only. On the other side, virtualization, once considered part of a same or similar limited context, is nowadays gaining acceptance and popularity thanks to its leveraging in cloud-based computing approaches. The goal of this project is to combine HPC tools – more precisely a batch scheduling system – with virtualization techniques and evaluate the advantages that such a solution brings over more traditional tools and techniques. The outcome of this project is a tool called Virtual Utility for Resource Management (VURM) and consists in a loosely coupled extension to Simple Linux Utility for Resource Management (SLURM) to provide virtual resource management capabilities to run jobs submitted to SLURM on dynamically spawned virtual machines. The rest of this chapter is dedicated to present the concepts of HPC and virtualization, illustrate the project goal in deeper detail and introduce the tools and technologies on which the project is based. The last section at the end of the present chapter provides an overview of the structure of this report. 1.1 High Performance Computing Once limited to specialized research or academic environments, High Performance Computing (HPC) continues to gain popularity as the limits of more traditional computing approaches are reached and the need for more powerful paradigms are needed. HPC brings a whole new set of problems to the already complex world of software development. One possible example are problems bound to CPU processing errors: completely negligible on more traditional computing platforms, they instantly assume primordial importance, introducing the need for specialized error recovery strategies. Another possible problem bound to the high-efficiency characteristics such systems have to expose, are the strict hardware and software limitations imposed by the underlying computing platforms.1 In order to take full advantage of the resources offered by a computing platform, a single task is often split in different jobs which can then in turn be run concurrently across different 1 Such limitations can be categorized in hard limitations, imposed by the particular hardware or software architecture, and soft limitations, normally imposed by the entity administering the systems and/or by the usage policies of the owning organization. 1 2 CHAPTER 1. INTRODUCTION nodes. A multitude of strategies are available to schedule these jobs in the most efficient manner possible. The application of a given strategy is taken in charge by a job scheduler 2 ; a specialized software entity which runs across a given computing cluster and which often provides additional functionality, as for example per-user accounting, system monitoring or other pluggable functionalities. 1.2 Virtualization Virtualization has been an actual topic in the scientific and academic communities for different years before making its debut in the widespread industrial and consumer market. Now adopted and exploited as a powerful tool by more and more people, it rapidly became one of the current trends all over the Information Technology (IT) market mainly thanks to its adoption to leverage the more recent cloud computing paradigm. The broader virtualization term refers to the creation and exploitation of a virtual (rather than actual) version of a given resource. Examples of commonly virtualized resources are complete hardware platforms, operating systems, network resources or storage devices. Additionally, different virtualization types are possible: in the context of this project, the term virtualization, always refers to the concept of hardware virtualization. Other types of virtualization are operating-system level virtualization, memory virtualization, storage virtualization, etc. The term hardware virtualization, also called platform virtualization, is tightly bound to the creation of virtual machines; a software environment which presents itself to its guest (that is, the software running on it) as a complete hardware platform and which isolates the guest environment from the environment of the process running the virtual machine. Although different additional hardware virtualization types exist, for the scope of this project we differentiate only between full virtualization and paravirtualization. When using full virtualization, the underlying hardware platform is almost fully virtualized and the guest software, usually an operating system, can run unmodified on it. In the normal case the guest software doesn’t even know that it runs on a virtual platform. Paravirtualization, instead, is a technique allowing the guest software to directly access the hardware in its own isolated environment. This requires modifications to the guest software but enables much better performances.3 The main advantages of virtualization techniques are the possibility to run the guest software completely sandboxed (which greatly increases security), the increase of the exploitation ratio of a single machine4 and the possibility to offer full software customization through the use of custom built Operating System (OS) disk images. This last advantage can be extended in some cases to hardware resources too, allowing to attach emulated hardware resources to a running guest. Virtualization does not come with advantages only; the obvious disadvantage of virtualized systems is the increased resource usage overhead caused by the interposition of an additional abstraction layer. Recent software and hardware improvements (mainly paravirtualization and hardware-assisted virtualization respectively) have contributed to greatly optimize the performance of hardware virtualization. 2 Also known for historical reasons as batch scheduler. Paravirtualization is fully supported in the linux kernel starting at version 2.6.25 through the use of the virtio drivers [11]. 4 Although this last point does not apply to HPC systems, were resources are fully exploited in the majority of the cases, virtualization can greatly improve average exploitation ratio of a common server. 3 1.3. PROJECT GOALS 1.3 3 Project goals Different problematics and limitations arise when developing software to be run in HPC contexts. As seen in section 1.1, these limitations are either inherited by the hardware and software architecture or imposed by its usage policies. Afterwards, the section 1.2 introduces a virtualization based approach to circumvent these limitations. It is thus deemed possible to overcome the limitations imposed by a particular HPC execution environment by trading off a certain amount of performance for a much more flexible platform by using virtualization techniques. The goal of this project is to add virtual resource management capabilities to an existing job scheduling utility. These enhancements would allow platform users to run much more customized software instances without the administrators having to be concerned about security or additional management issues. In such a system, user jobs would be run inside a virtual machine using a user provided base disk image, effectively leaving the hosting environment untouched. The adoption of a virtual machine based solution enables advanced scheduling capabilities to be put in place: the exploitation of virtual machine migration between physical nodes allows the job scheduler to adapt resource usage to the current system load and thus dynamically optimize resource allocation at execution time and not only while scheduling jobs. This optimization can be carried out by migrating Virtual Machines (VMs) from one physical node in the system to another and executing them on the best available resource at every point in time. To reach the prefixed objective, different partial goals have to be attained. The following breakdown illustrates in deeper details each of the partial tasks which have to be accomplished: 1. Adding support to the job scheduler of choice for running regular jobs inside dynamically created and user customizable virtual machines; 2. Adding support for controlling the state of each virtual machine (pausing/resuming, migration) to the job scheduler so that more sophisticated resource allocation decisions can be made (e.g. migrating multiple virtual machines of a virtual cluster onto a single physical node) as new information (e.g. new jobs) become available; 3. Implementing simple resource allocation strategies based on the existing job scheduling techniques to demonstrate the capabilities added to the job scheduler and the Virtual Machine Monitor (VMM) in (1) and (2) above. A more formal project statement, containing additional information such as deadlines and additional resources, can be found in Appendix C. 1.4 Technologies Different, already existing libraries and utilities were used to build the final prototype. This section aims to provide an overview of the main external components initially chosen to build upon. The absolutely necessary components are a job scheduler to extend and an hypervisor (or VMM) to use to manage the virtual machines; in the case of this project, SLURM and KVM/QUEMU were chosen, respectively. A VMM abstraction layer called libvirt was used in order to lessen the coupling between the developed prototype and the hypervisor, allowing to easily swap out KVM with a different 4 CHAPTER 1. INTRODUCTION hypervisor. This particular decision is mainly due to an effort to make a future integration with Palacios – a particular VMM intended to be used for high performance computing – as easy as possible. The remaining part of this section aims to introduce the reader to all the utilities and libraries cited above. SLURM Simple Linux Utility for Resource Management (SLURM) is a batch scheduler for Linux based operating systems. Initially developed at the Lawrence Livermore National Laboratory, it was subsequently open sourced and is now a well known protagonist of the job scheduling ecosystem. The authors [19] describe it with the following words: The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. SLURM requires no kernel modifications for its operation and is relatively self-contained. As a cluster resource manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. The section 2.1 provides additional information about the SLURM working model and architectural internals. KVM/QEMU QEMU [2] (which probably stands for Quick EMUlator [10]) is an open source processor emulator and virtualizer. It can be used as both an emulator or a hosted Virtual Machine Monitor (VMM). This project only takes advantage of its virtualization capabilities, by making little or no use of the emulation tools. Kernel-based Virtual Machine (KVM) [12] is a full virtualization solution for Linux based operating systems running on x86 hardware containing virtualization extensions (specifically, Intel VT or AMD-V). To run hardware accelerated guests, the host operating system kernel has to include support for the different KVM modules; this support is included in the mainline Linux kernel as of version 2.6.20. The main reason for choosing the KVM/QEMU match over other VMMs is KVM’s claimed support for both live and offline VM migration as well as both being released as open source software. The recent versions of QEMU distributed by the package managers of most Linux distributions already include (either by default or optionally) the bundled KVM/QEMU build5 . Palacios Palacios is an open source VMM specially targeted at research and teaching in high performance computing, currently under development as part of the V3VEE project6 . The official description, as found on the publicly-accessible website [18], is reported below: 5 On Debian and Ubuntu systems only the qemu-kvm package is provided; the qemu package is a dummy transitional package for qemu-kvm. On Gentoo systems, KVM support can be enabled for the qemu package by setting the kvm USE flag. 6 http://v3vee.org/ 1.5. CONTEXT 5 Palacios is a type I, non-paravirtualized, OS-independent VMM that builds on the virtualization extensions in modern x86 processors, particularly AMD SVM and Intel VT. Palacios can be embedded into existing kernels, including very small kernels. Thus far, Palacios has been embedded into the Kitten lightweight kernel from Sandia National Labs and the University of Maryland’s GeekOS teaching kernel. Currently, Palacios can run on emulated PC hardware, commodity PC hardware, and Cray XT3/4 machines such as Sandia’s Red Storm. Palacios is also able to boot an unmodified Linux distribution [13] and can thus be used on a wide range of both host hardware and software platforms and to host different (more or less lightweight) guest operating systems. Libvirt The libvirt project aims to provide an abstraction layer to manage virtual machines over different hypervisors, both locally and remotely. It offers an Application Programming Interface (API) to create, start, stop, destroy and otherwise manage virtual machines and their respective resources such as network devices, storage devices, processor pinnings, hardware interfaces, etc. Additionally, through the libvirt daemon, it accepts incoming remote connections and allows complete exploitation of the API from remote locations while accounting for authentication and authorization related issues. The libvirt codebase is written in C, but bindings for different languages are provided as part of the official distribution or as part of external packages. The currently supported languages are C, C#, Java, OCaml, Perl, PHP, Python and Ruby. The decision to use libvirt as an abstraction layer instead of directly accessing the KVM/QEMU exposed API was taken to facilitate to switch to a different VMM if the need should arise, and, in particular, to ease an eventual Palacios integration once the necessary support is added to the libvirt API. The chapter 3 contains additional information about libvirt’s working model and its integration into the VURM architecture. 1.5 Context This project is the final diploma work of Jonathan Stoppani, and will allow him to obtain the Bachelor of Science (B.Sc.) in Computer Science at the College for Engineering and Architecture Fribourg, part of the University of Applied Sciences of Western Switzerland. This work is carried out at the Scalable Systems Lab (SSL) of the Computer Science department of the University of New Mexico (UNM), USA during Summer 2011. Prof. Peter Kropf, Head of the Distributed Computing Group and Dean of the Faculty of Science of the University of Neuchatel, Switzerland covers the role of expert. Prof. Patrick G. Bridges and Prof. Dorian Arnold, associate professors at the University of New Mexico, are supervising the project locally. Prof. Pierre Kuonen, Head of the GRID and Cloud Computing Group, and Prof. François Kilchoer, Dean of the Computer Science Department, both of the College of Engineering and Architecture of Fribourg, are supervising the project from Switzerland. 6 CHAPTER 1. INTRODUCTION 1.6 Structure of this report This section aims to describe the overall structure of the present report by shortly introducing the contents of each chapter and placing it into the right context. The main content is organized in the following chapters: • This first chapter, chapter 1, introduces the project and the context, explains the problematics aimed to be solved and lists the main adopted technological choices; • After a general overview, chapter 2 explains the overall architectural design of the VURM tool and how it fits into the existing SLURM architecture by arguing the different choices; • Chapter 3 introduces the remotevirt resource provisioner, one of the two provisioners shipped with the default VURM implementation. This chapter, as well as the following one, presupposes a global understanding of the global architecture presented in chapter 2; • Virtual machine migration between physical nodes, one of the main advantages of using a virtualization-enabled job scheduler, is described in chapter 4. The migration capabilities are implemented as part of the remotevirt provisioner but presented in a different chapter as a whole, self-standing, subject; • Finally, chapter 5 concludes the report by resuming the important positive and negative aspects, citing possible areas of improvement and providing a personal balance of the executed work. In addition to the main content, different appendices are available at the end of the present report, containing information such as user manuals and additional related information not directly relevant to the presented chapter. Refer to the table of contents for a more detailed contents listing. The used acronyms, and the cited references are also available at the pages 53, and 55 respectively. Chapter 2 Architecture The VURM virtualization capabilities have to be integrated with the existing SLURM job scheduler in order to exploit the functionalities offered by each component. At least a basic understanding of the SLURM internals is thus needed in order to design the best possible architecture for the VURM utility. This chapter aims to introduce the reader to both the SLURM and VURM architectural design and explains the reasons behind the different choices that led to the implemented solution. The content presented in this chapter is structured into the following sections: • Section 2.1 introduces the SLURM architecture and explains the details needed to understand how VURM integrates into the system; • Section 2.2 explains the chosen VURM architecture and presents the design choices that led to the final implementation; • Section 2.3 introduces the provisioning workflow used to allocate new resources to users requesting them and introduces the concept of virtual clusters; • Section 2.4 explains the details of the implementation of a new resource provisioner and its integration in a VURM system. 2.1 SLURM architecture The architecture of the SLURM batch scheduler was conceived to be easily expandable and customizable, either by configuring it the right way or by adding functionalities through the use of specific plugins. This section does not aim to provide a complete understanding of how the SLURM internals are implemented or which functionalities can be extended through the use of plugins; it will be limited instead to providing a global overview of the different entities which intervene in its very basic lifecycle and which are needed to understand the VURM architecture presented afterwards. 7 8 CHAPTER 2. ARCHITECTURE Basic SLURM Architecture Partition 1..* * 1..* 1 Daemon 1..* 1 Controller <<use>> <<Interface>> Command <<use>> srun squeue scontrol s<...> Figure 2.1: Basic overview of the SLURM architecture 2.1.1 Components A basic overview of the SLURM architecture is presented in Figure 2.1. Note that the actual architecture is much more complex than what illustrated in the class diagram and involves different additional entities (such as backup controllers, databases,. . . ) and interactions. Although, for the scope of this project, the representation covers all the important parts and can be used to place the VURM architecture in the right context later on in this chapter. The classes represented in the diagram are not directly mapped to actual objects in the SLURM codebase, but rather to the different intervening entities1 ; the rest of this subsection is dedicated to provide some more explications about them. Controller The Controller class represents a single process controlling the whole system; its main responsibility is to schedule incoming jobs and assign them to nodes for execution based on different (configurable) criteria. Additionally, it keeps track of the state of each configured daemon and partition, accepts incoming administration requests and maintains an accounting database. A backup controller can also be defined; requests are sent to this controller instead of the main one as soon as a requesting client notices that the main controller is no more able to correctly process incoming requests. Optionally, a Database Management System (DBMS) backend can be used to take advantage of additional functionalities such as per-user accounting or additional logging. 1 A deployment diagram may be more suitable for this task, but a class diagram allows to better represent the cardinality of the relationships between the different entities. 2.1. SLURM ARCHITECTURE 9 Daemons Each Daemon class represents a process running on a distinct node; their main task is to accept incoming job requests and run them on the local node. The job execution commands (see below) communicate with the daemons directly after having received an allocation token by the controller. The first message they send to the nodes to request job allocation and/or execution is signed by the controller. This decentralized messaging paradigm allows for better scalability on clusters with many nodes as the controller does not have to deal with the job execution processing on each node itself. Each daemon is periodically pinged by the controller to maintain its status updated in the central node database. Partitions The Daemon instances (and thus the nodes) are organized in one or more (possibly overlapping) logical partitions; each partition can be configured individually with regard to permissions, priorities, maximum job duration, etc. This allows to create partitions with different settings. A possible example of useful per-partition configuration is to allow only a certain group of users to access a partition which contains more powerful nodes. Another possible example is to organize the same set of nodes in two overlapping partitions with different priorities; jobs submitted to the partition with higher priority will thus be carried out faster (i.e. their allocation will have an higher priority) than jobs submitted to the partition with lower priority. Commands Different Commands are made available to administrators and end users; they communicate with the Controller (and possibly directly with the Daemons) by using a custom binary protocol and TCP/IP as the transport layer. The main commands used throughout the project are the srun command, which allows to submit a job for execution (and offers a plethora of options and arguments to fine tuning the request) and the scontrol command, which can be used to perform administration requests, as for example listing all nodes or reloading the configuration file. 2.1.2 Configuration The SLURM configuration resides in one single file which exposes a simple syntax. This configuration file contains the directives for all components of a SLURM-managed system (including thus the controller, the daemons, the databases,. . . ); an excerpt of such a file is showed in Listing 2.1. Listing 2.1: Configuration excerpt 1 2 3 4 5 6 7 8 9 10 11 # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=controller-hostname # ... SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd 10 12 13 14 15 CHAPTER 2. ARCHITECTURE SlurmUser=nelson # ... NodeName=testing-node Procs=1 State=UNKNOWN PartitionName=debug Nodes=ubuntu Default=YES MaxTime=INFINITE State=UP 16 17 18 19 NodeName=nd-6ad4185-0 NodeHostname=10.0.0.101 NodeName=nd-6ad4185-1 NodeHostname=10.0.0.100 PartitionName=vc-6ad4185 Nodes=nd-6ad4185-[0-1] Default=NO MaxTime=INFINITE State=UP An important feature of the SLURM controller daemon is its ability to reload this configuration file at runtime. This feature is exploited to dynamically add or remove nodes and partitions from a running system. The file can be reloaded by executing a simple administration request through the use of the scontrol command, as illustrated in the Listing 2.2. Listing 2.2: Reconfiguration command 1 $ scontrol reconfigure A useful feature leveraged by the configuration syntax is the ability to group node definitions if their NodeName is terminated by numerical indexes; this may not be deemed that interesting when working with 10 nodes, but being able to group configuration directives for clusters of thousands of nodes may be more than welcome. The listings 2.3 and 2.4 show an example of this feature. Listing 2.3: Node naming without grouping 1 2 3 4 5 NodeName=linux1 Procs=1 State=UNKNOWN NodeName=linux2 Procs=1 State=UNKNOWN # ... 29 additional definitions ... NodeName=linux32 Procs=1 State=UNKNOWN PartitionName=debug Nodes=linux1,linux2,...,linux32 Default=YES MaxTime= INFINITE State=UP Listing 2.4: Node naming with grouping 1 2 NodeName=linux[1-32] Procs=1 State=UNKNOWN PartitionName=debug Nodes=linux[1-32] Default=YES MaxTime=INFINITE State=UP 2.2 VURM architecture The previous section introduced some of the relevant SLURM architectural concepts. This section, instead, aims to provide an overview of the final implemented solution and the process that led to the architectural design of the VURM framework. NOTE 1 · Blue circles represent virtual machines (blue color is used for virtual resources or software running on virtual resources). · Green squares represent physical nodes (green color is used for physical resources or software running on physical resources). · Virtual machines and physical nodes can be grouped into virtual clusters and partitions respectively. 2.2. VURM ARCHITECTURE 11 · Rounded rectangles represent services or programs. NOTE 2 The term Virtual Clusters is used a couple of times here. For now, consider a virtual cluster only as a SLURM partition with a label attached to it; the section 2.3 contains more information about these special partitions. 2.2.1 A first proposal The first approach chosen to implement the VURM architecture was to use SLURM for everything and inject the needed virtualization functionalities into it, either by adding plugins or modifying the original source code. This approach is introduced by the Figure 2.2. To help better understand the different parts, the representation splits the architecture on two different layers: a physical layer and a virtual layer. Virtual nodes La ye r Virtual cluster 1 Virtual cluster 2 La Seed partition Ph ys ic al srun sl ye r ur mc tl d Vi rt ua l Virtual cluster 3 Physical nodes Figure 2.2: Initial proposal for the VURM architecture In this proposal, physical nodes (each node runs a SLURM daemon) are managed by the SLURM controller. When a new request is received by the controller a new VM is started on one of these nodes and a SLURM daemon setup and started on it. Upon startup this daemon registers itself as a new node to the SLURM controller. Once the registration occurred, the SLURM controller can then allocate resources and submit jobs to the new virtual node. The initial idea behind this proposal was to spawn new virtual machines by running a job which takes care of this particular task through the SLURM system itself. This job will be run on a daemon placed inside the seed partition. The SLURM controller daemon can’t, in this case, differentiate between daemons running on physical or virtual nodes. It only knows that to spawn new virtual machines, the respective job has to be submitted to the seed partition, while user jobs will be submitted to one of the created virtual cluster partitions. In this case the user has to specify in which virtual cluster (i.e. in which partition) he wants to run the job. The Figure 2.3 illustrates how the SLURM controller sees the whole system (note that virtual and physical nodes as well as virtual clusters and partition are still differentiated for illustration purposes, but aside their name the SLURM controller isn’t able to tell the differences between them). 12 CHAPTER 2. ARCHITECTURE Virtual cluster 2 Virtual cluster 1 slurmctld Seed partition Virtual cluster 3 Figure 2.3: The VURM system from the SLURM controller point of view 2.2.2 Completely decoupled approach By further analyzing and iterating over the previously presented approach, it is possible to deduce that the SLURM controller uses the nodes in the seed partition exclusively to spawn new virtual machines and does not effectively take advantage of these resources to allocate and run jobs. Using SLURM for this specific operation introduces useless design and runtime compelxity. It is also possible to observe that these physical nodes have to run some special piece of software which provides VM management specific capabilities not provided by SLURM anyways. Basing on these argumentations, it was thus chosen to use SLURM only to manage usersubmitted jobs, while deferring all virtualization oriented tasks to a custom, completely decoupled, software stack. This stack is responsible for the mapping of virtual machines to physical nodes and their registration as resources to the SLURM controller. This decision has several consequences on the presented design: • There is no need to run the SLURM daemons on the nodes in the seed partition; • The custom VURM daemons will be managed by a VURM-specific controller; • The SLURM controller will manage the virtual nodes only. The Figures 2.2 and 2.3 were updated to take into account these considerations, resulting in the figures 2.5 and 2.4, respectively. Virtual cluster 2 Virtual cluster 1 vurmctld slurmctld Virtual cluster 3 Figure 2.4: The VURM system updated with the VURM controller The users can create new virtual clusters by executing the newly provided valloc command and then run its jobs on it by using the well known srun command. Behind the scenes, each 2.3. PROVISIONING WORKFLOW 13 tl d Virtual cluster 1 La ye r sl ur mc srun Virtual cluster 2 Vi rt ua l Virtual cluster 3 Seed partition ys ic al La ye vu r rm ct ld valloc Ph vrelease = physical cluster node running a VURM daemon v<command> = VURM management command = virtual node running a SLURM daemon s<command> = SLURM management command Figure 2.5: Adopted VURM architecture of these commands talks to the relative responsible entity: the VURM controller (vurmctld) in the case of the valloc command or to the SLURM controller (slurmctld) in the case of the srun command. 2.2.3 Multiple provisioners support The abstract nature of the nodes provided to the SLURM controller (only an hostname and a port number of a listening SLURM daemon is needed) allows to possibly run SLURM daemons on nodes coming from different sources. A particular effort was put into the architecture design to make it possible to use different resource provisioners – and thus resources coming from different sources – in a single VURM controller instance. The Figure 2.6 contains a formal Unified Modeling Language (UML) class diagram which describes the controller architecture into more detail and introduces the new entities needed to enable support for multiple provisioners. Each IProvisioner realization is a factory for objects providing the INode interface. The VurmController instance queries each registered IProvisioner realization instance for a given number of nodes, assembles them together and creates a new VirtualCluster instance. Thanks to the complete abstraction of the IProvisioner and INode implementations, it is possible to use heterogeneous resource origins together as long as they implement the required interface. This makes easy to add new provisioner types (illustrated in the diagram by the provisioners.XYZ placeholder package) to an already existing architecture. More details on the exact provisioning and virtual cluster creation process are given in the next section. 2.3 Provisioning workflow This section aims to introduce the concept of virtual cluster and explain the provisioning workflow anticipated in the previous sections in deeper detail. Most of the explications provided in this section refer to the sequence diagram illustrated in Figure 2.7 on page 17. 14 CHAPTER 2. ARCHITECTURE Architecture overview <<Interface>> SlurmController +configuration +reloadConfig() 1 slurmController 1 vurmController VurmController +createVirtualCluster(size, minSize) +destroyVirtualCluster(name) -updateSlurmConfig() 1 1..* VirtualCluster +spawnNodes() +release() clusters +getConfigEntry() 1 0..* controller controller cluster 1 provisioners nodes 1..* <<Interface>> IProvisioner +getNodes(count, names, config) <<Interface>> INode -nodeName -hostname -port +spawn() +release() +getConfigEntry() provisioners.Multilocal Multilocal <<use>> LocalNode provisioners.KVM RemoteKVM <<use>> RemoteNode provisioners.XYZ XYZProvisioner XYZNode Figure 2.6: VURM architecture class diagram 2.3.1 Virtual cluster definition A virtual cluster is a logical grouping of virtual machines spawned in response to a user request. VMs in a virtual cluster belong to the user which initially requested them and can be exploited to run batch jobs through SLURM. These nodes can originate from different providers, depending on the particular resource availability when the creation request occurs. Virtual clusters are exposed to SLURM as normal partitions containing all virtual nodes assigned to the virtual cluster. The VURM controller modifies the SLURM configuration by placing the nodes and partition definition between comments clearly identifying each virtual cluster and allowing an automated process to retrieve the defined virtual clusters by simply parsing the configuration. An example of such a configuration for a virtual cluster consisting of two nodes is illustrated in Listing 2.5. Listing 2.5: SLURM configuration for a virtual cluster 1 2 3 4 5 # [vc-6ad4185] NodeName=nd-6ad4185-0 NodeHostname=10.0.0.101 NodeName=nd-6ad4185-1 NodeHostname=10.0.0.100 PartitionName=vc-6ad4185 Nodes=nd-6ad4185-[0-1] Default=NO MaxTime=INFINITE State=UP # [/vc-6ad4185] 2.3. PROVISIONING WORKFLOW 15 Once a virtual cluster is created, the user can run jobs on it by using the --partition argument to the srun command, as illustrated in the Listing 2.6. Listing 2.6: Batch job execution on a virtual cluster 1 2 # The -N2 argument forces SLURM to execute the job on 2 nodes srun --partition=vc-6ad4185 -N2 hostname The introduction of the concept of virtual cluster allows thus to manage a logical group of nodes belonging to the same allocation unit, conveniently mapped to a SLURM partition, and enables the user to access and exploit the resources assigned to him in the best possible way. 2.3.2 Virtual cluster creation A new virtual cluster creation process is started as soon as the user executes the newly provided valloc command. The only required parameter is the desired size of the virtual cluster to be created. Optionally, a minimal size and the priority can also be specified. The minimum size will be used as the failing threshold if the controller can’t find enough resources to satisfy the requested size. If not specified, the minimum size defaults to the canonical size (message 1 in the sequence diagram in the Figure 2.7). The priority (defaulting to 1) is used to allocate resources to the virtual cluster and is further described in chapter 4, starting at page 35. When a request for a new virtual cluster is received, the controller asks the first defined provisioner for the requested amount of nodes. The provisioner can decide, based on the current system load, to allocate and return all nodes, only a part of them or none at all (messages 1.1, 1.2 and 1.2.1). Until the total number of allocated nodes does not reach the requested size, the controller goes on by asking the next provisioner to allocate the remaining number of nodes. If the provisioners list is exhausted without reaching the user-requested number of nodes, the minimum size is checked. In the case that the number of allocated nodes equals or exceeds the minimum cluster size, the processing goes on; otherwise all nodes are released and the error propagated back to the user. NOTE The loop block with the alternate opt illustrated in the sequence diagram (first block) equals to the Python-specific for-else construct. In this construct, the else clause is executed only if the for-loop completes all the iterations without executing a break statement. The Python documentation explaining the exact syntax and semantics of the for-else construct can be found at the following url: http://docs.python.org/reference/compound_ stmts.html#for In this specific case, the else clause is executed when the controller – after having iterated over all provisioners – has not collected the requested number of nodes. Once the resource allocation process successfully completes, a new virtual cluster instance is created with the collected nodes (message 1.5) and the currently running SLURM controller is reconfigured and restarted (messages 1.6 and 1.7). At this point, if the reconfiguration operation successfully completes, a SLURM daemon instance is spawned on each node of the virtual cluster (messages 2 and 2.1). If the reconfiguration fails, 16 CHAPTER 2. ARCHITECTURE instead, the virtual cluster is destroyed by releasing all its nodes (messages 3 and 3.1) and the error is propagated back to the user. If all the operations complete successfully, the random generated name of the virtual cluster is returned to the user to be used in future srun or vrelease invocations. 2.4 Implementing new provisioners The default implementation ships with two different provisioner implementations: the multilocal implementation is intended for testing purposes and spawns multiple SLURM daemons with different names on the local node while the remotevirt implementation runs SLURM daemons on virtual machines spawned on remote nodes (the remotevirt implementation is explained in further detail in chapter 3: Physical cluster provisioner, starting at page 19). It is possible to easily add new custom provisioners to the VURM controller by implementing the IProvisioner interface and registering an instance of such class to the VurmController instance. The Listing 2.7 describes how a custom provisioner instance providing the correct interface can be registered to the VURM controller instance. Listing 2.7: vurm/bin/vurmctld.py 40 41 42 43 44 45 46 47 # Build controller ctld = controller.VurmController(config, [ # Custom defined provisioner # Instantiate and configure the custom provisioner directly here or # outside the VurmController constructor invocation. # If the custom provisioner is intended as a fallback provisioner, # put it after the main one in this list. customprovisioner.Provisioner(customArgument, reactor, config), 48 remotevirt.Provisioner(reactor, config), multilocal.Provisioner(reactor, config), 49 50 51 # Default shipped provisioner # Testing provisioner ]) One of the future improvements foresees to adopt a plugin architecture to implement, publish and configure the provisioners; refer to the conclusions chapter on page 51 for more information about this topic. 2.4. IMPLEMENTING NEW PROVISIONERS 17 s d Provisioning workflow vurmController : VurmController provisioner : IProvisioner User SLURM 1: valloc size=10 minsize=5 loop [has provisioners] 1.1: getNextProvisioner 1.2: getNodes() loop [has enough resources and count(nodes) < size] 1.2.1: node : INode 1.3: addNodesToNodesList break [count(nodes) == size] opt [count(nodes) < minsize] loop [count(nodes) > 0] 1.4: release() 1.5: createWithNodes(nodes) virtualCluster : VirtualCluster 1.6: getConfigEntry() loop [for each node] 1.6.1: getConfigEntry() 1.7: reconfigureSlurm 1.7.1: modify configuration file 1.7.2: reloadConfiguration() alt [reconfiguration succeeded] 2: spawnNodes() loop [for each node] 2.1: spawn() 3: release() loop [for each node] 3.1: release() Figure 2.7: Resource provisioning workflow Chapter 3 Physical cluster provisioner A resource provisioner is an entity able to provision resources in the form of SLURM daemons upon request by the VURM controller. The VURM architecture allows multiple, stackable, resource provisioners and comes with two different implementations. One of them, the physical cluster provisioner, is further described in this chapter. For additional information about the exact role of a resource provisioner, how it fits into the VURM architecture and which tasks it has to be able to accomplish, refer to the Architecture chapter at page 7. The physical cluster provisioner is conceived to manage Linux based HPC clusters. This provisioner is able to execute SLURM daemons on virtual machines spawned on physical nodes of the cluster and feed them to the SLURM controller. Additionally this module is also responsible to manage virtual machine migration between physical nodes to better exploit the available computing resources. The different aspects covered by this chapter are structured as follows: • Section 3.1 aims to describe the provisioner specific architecture, including the description of both internal and external software components, networking configuration, etc. A class diagram for the overall provisioner architecture is presented as well; • Section 3.2 describes all the entities incurring in the deployment of a remotevirt based VURM setup; • Section 3.3 introduces some libvirt related concepts and conventions and describes the eXtensible Markup Language (XML) domain description file used to configure the guests; • Section 3.4 deals with different aspects of the virtual machine lifecycle, such as setup, efficient disk image cloning, different IP address retrieval techniques or public key exchange approaches. Aspects directly bound to VM migration between physical nodes will not be discussed in this chapter. Refer to the Migration chapter, starting at page 35, for additional information about this topic. 19 20 3.1 CHAPTER 3. PHYSICAL CLUSTER PROVISIONER Architecture The cluster provisioner takes advantage of the abstraction provided by libvirt to manage VMs on the different nodes of the cluster. Each node is thus required to have libvirt installed and the libvirt daemon (libvirtd) running. Libvirt already implements support for remote management, offering good security capabilities and different client interfaces. Unfortunately, but understandably, the exposed functionalities are strictly virtualization oriented. As the VURM cluster provisioner needs to carry out additional and more complex operations on each node individually (IP address retrieval, disk image cloning, SLURM daemon management,. . . ) an additional component is required to be present on these nodes. This component was implemented in the form of a VURM daemon. This daemon exposes the required functionalities and manages the communication with the local libvirt daemon. An additional remotevirt controller was implemented to centralize the management of the VURM daemons distributed on the different nodes. This controller is implemented inside the physical cluster provisioner package and is a completely different entity than the VURM controller daemon used to manage resources at an higher abstraction level. The class diagram in Figure 3.1 on page 21 illustrates the whole architecture graphically. The entities on the right (enclosed by the yellow rounded rectangle) are part of the VURM daemon running on each single node, while the rest of the entities in the vurm.provisioners.remotevirt package are part of the provisioner interface and the remotevirt controller implementation. Together with the Domain class, the PhysicalNode, VirtualCluster and System classes allow to model a whole physical cluster, its physical nodes, the various created virtual clusters and their relative domains (virtual nodes). Each PhysicalNode instance is bound to the remote daemon sitting on the actual computing node by a DomainManager instance. This relationship is also exploited by the Domain instances themselves for operations related to the management of existing VMs, as for example VM releasing, or (as introduced later on) VM migration. The remotevirt daemons expose a DomainManager instance for remote operations, such as domain creation/destruction and SLURM daemon spawning. The domain management operations are carried out through a connection to the local libvirt daemon accessed through the bindings provided by the libvirt package. Operations which have to be carried out on the VM itself, instead, are executed by exploiting a Secure Shell (SSH) connection provided by the SSHClient object. More details about the implemented interfaces (INode and IProvisioner) can be found in the subsection 2.2.3 on page 13, as part of the Architecture chapter. The next section offers additional details about how components of this specific provider fit into the global system setup and defines the relationship with other system entities for a correct deployment. <<Interface>> IProvisioner +getNodes(count, names, config) +spawn() +release() +getConfigEntry() <<Interface>> INode -nodeName -hostname -port Remotevirt provisioner 1 1..* provisioner 1 virtualClusters 0..* domains 0..* system system System +addCluster(cluster) +removeCluster(cluster) 1 nodes 1 1..* PhysicalNode +getDomainsByCluster(cluster) +addDomain() +removeDomain() The libvirt package is system 1 node 1 node 1 1 connection Domain +destroy() LibvirtConnection +open(uri) +createLinux(description) +lookupByName(name) libvirt SSHClient +executeCommand(command) +transferFile(fileHandler, remotePath) +disconnect() +connect(username, key) <<use>> DomainManager +getHypervisor() +createDomain(description) +destroyDomain(name) +spawnDaemon(name, config) <<use>> Figure 3.1: Overall architecture of the remotevirt provisioner Provisioner VirtualCluster +addDomain(domain) +removeDomain(domain) +release() cluster domains Domain vurm.provisioners.remotevirt domain 0..* manager 1 3.1. ARCHITECTURE 21 22 CHAPTER 3. PHYSICAL CLUSTER PROVISIONER 3.2 Deployment The Architecture section introduced the different entities incurring in the management of VMs on a cluster using the Physical cluster provisioner. A complete VURM system based on this provisioner is even more complex because of the additional components needed to manage the overall virtualized resources. This section aims to introduce and explain the deployment diagram of a complete VURM setup which uses the Physical cluster provisioner as its only resource provisioner. All the explications provided in these section refer to the deployment diagram in Figure 3.2 on page 24. The deployment diagram illustrates the most distributed setup currently possible to achieve. All the different logical nodes can be placed on a single physical node if needed (either for testing or development purposes). For illustration purposes, four different styles were used to draw the different components and artifacts present in the diagram. A legend is available in the bottom left corner of the diagram and defines the coding used for the different formatting styles. The following list provides a more detailed description of the semantics of each style: • The first formatting style was used for externally provided entities which have to be installed by a system administrator and which the VURM libraries uses to carry out the different tasks. All these components are provided by either SLURM or libvirt; • The second formatting style was used for artifacts that the have to be provided either by the system administrator (mainly configuration files) or by the end user (OS disk images, key pairs,. . . ). The only component in this category is the IP address sender. In the normal case, this component is composed by a small shell script, as described in the subsection 3.4.1. A simple but effective version of this script is provided in the Listing 3.4; • Components in the third category are provided by the VURM library. These components have to be installed by a system administrator in the same way as the SLURM and libvirt libraries; • The fourth category contains artifacts which are generated at runtime by the different intervening entities. These artifacts don’t need to be managed by any third party. The rest of this section is dedicated to explain the role of each represented component in deeper detail. Commands The srun, scontrol, valloc and vrelease components represent either SLURM or VURM commands. As illustrated in the diagram, each command talks directly to the respective entity (the SLURM controller for SLURM commands or the VURM controller for VURM commands). This difference is abstracted from the user perspective and does not involve additional usage complexity. Additional, not illustrated, commands may be available. The SLURM library comes with a plethora of additional commands, such as sview, sinfo, squeue, scancel, etc. 3.2. DEPLOYMENT 23 VurmController and RemotevirtProvisioner The VurmController component represents the VURM controller daemon, configured with the RemotevirtProvisioner. The provisioner component represents both the IProvisioner realization itself and the remotevirt controller introduced in the previous section. The user manual in the Appendix A offers a complete reference for this configuration file. Both the entities can be configured through a single VURM configuration file. This file contains the configuration for the overall VURM system and also defines the different physical nodes available to the remotevirt provisioner. Domain description file The domain description file is used as a template to generate an actual domain description to pass to the VURM daemons. This file describes a libvirt domain using an XML syntax defined by libvirt itself. The section 3.3 offers an overview of the different available elements, describes the additions made by VURM and presents part of the default domain description file. VurmdLibvirt, Libvirt and KVM/QEMU The three components placed on the physical node are responsible to carry out the management operations requested by the controller. Despite being represented one single time on the deployment diagram, a VURM system using the remotevirt provisioner normally consists of one controller node and many different physical nodes. The same argument can be made for the virtual node too: a physical node is able to host different virtual nodes. The VurmdLibvirt component mainly offers a facade for more complex operations over the libvirt daemon and is responsible for the local housekeeping of the different assets (mainly disk image clones and state files). The libvirt daemon and the hypervisor of choice, in this case KVM/QEMU, are used to effectively create, destroy and migrate the VMs themselves. The particular port between the VurmdLibvirt and the IP address sender components is needed to get the IP address from the VM once it is assigned to it. The used IP address retrieval technique is described in deeper detail in the subsection 3.4.1. AMP protocol The SLURM commands talk to the SLURM controller by using a custom SLURM binary protocol, while the VURM commands adopt the Asynchronous Messaging Protocol (AMP) protocol. The same protocol is also adopted by the remotevirt provisioner to communicate with the daemons placed on each single node. AMP [15] is an asynchronous key/value pair based protocol with implementations in different languages. The asynchronous nature of AMP allows to send multiple request/response pairs over the same connection and without having to wait to obtain a response to a previous request before sending a new one. The adoption of AMP as the default protocol to communicate between the VURM entities allows to reach an higher degree of scalability thanks to the reuse and multiplexing of different channels over a single TCP/IP connection. Artifacts with this background background have to be provided <<artifact>> Configuration file <<component>> vrelease <<component>> valloc <<component>> scontrol <<component>> srun ClientNode VURM w/ remotevirt provisioner AMP protocol over tcp/ip SLURM binary protocol over tcp/ip <<artifact>> Clone n <<artifact>> Clone 1 <<artifact>> Clone 0 NFS AMP protocol over tcp/ip SLURM binary protocol over tcp/ip <<component>> KVM/QEMU <<component>> Libvirt libvirtd protocol (unix, ssh, tcp) <<component>> VurmdLibvirt <<artifact>> VURM private key <<artifact>> VURM configuration file Figure 3.2: Complete VURM + remotevirt deployment diagram <<artifact>> Disk image Shared storage location <<artifact>> Domain description file <<artifact>> VURM configuration file <<artifact>> SLURM configuration file <<component>> RemotevirtProvisioner <<component>> VurmController <<component>> scontrol <<component>> SlurmController ControllerNode <<component>> Slurmd VirtualNode host side, <<artifact>> SLURM configuration file <<artifact>> VURM public key <<component>> IP address sender PhysicalNode 24 CHAPTER 3. PHYSICAL CLUSTER PROVISIONER 3.3. LIBVIRT INTEGRATION 3.3 25 Libvirt integration The libvirt library provides an abstraction layer to manage virtual machines by using different hypervisors, both locally and remotely. It offers an API to create, start, stop, destroy and otherwise manage virtual machines and their respective resources such as network devices, storage devices, processor pinnings, hardware interfaces, etc. Additionally, through the libvirt daemon, it accepts incoming remote connections and allows complete exploitation of the API from remote locations, while accounting for authentication and authorization related issues. The libvirt codebase is written in C, but bindings for different languages are provided as part of the official distribution or as part of external packages. The currently supported languages are C, C#, Java, OCaml, Perl, PHP, Python and Ruby. For the scope of this project the Python bindings were used. Remote connections can be established using raw Transmission Control Protocol (TCP), Transport Layer Security (TLS) with x509 certificates or SSH tunneling. Authentication can be provided by Kerberos and Simple Authentication and Security Layer (SASL). Authentication of local connections (using unix sockets) can be controlled with PolicyKit. To abstract the differences between VM management over different hypervisors, libvirt adopts a special naming convention and defines its own XML syntax to describe the different involved resources. This section aims to introduce the naming convention, and the XML description file. The XML description file subsection includes the explications of the changes introduced by the VURM virtualization model. 3.3.1 Naming conventions Libvirt adopts the following naming conventions to describe the different entities involved in the VM lifecycle: • A node is a single physical machine; • An hypervisor is a layer of software allowing to virtualize a node in a set of virtual machines with possibly different configurations than the node itself; • A domain is an instance of an operating system running on a virtualized machine provided by the hypervisor. Under this naming convention and for the scope of the VURM project, the definition of VM or of virtual node is thus the same as the definition of domain. 3.3.2 XML description file The XML description file is the base for the definition of a new domain. This file contains all needed information, such as memory allocation, processor pinning, hypervisor features to enable, disks definition, network interfaces definitions, etc. Given the wide range of configuration options to support, a complete description of the available elements would be too long to include in this section. The complete reference is available online at [8]. For the purposes of this project, the description of the elements contained in the XML description presented in Listing 3.1 is more than sufficient. 26 CHAPTER 3. PHYSICAL CLUSTER PROVISIONER Listing 3.1: Basic libvirt XML description file 1 2 <domain type="kvm"> <name>domain-name</name> 3 4 <memory>1048576</memory> 5 6 <vcpu>2</vcpu> 7 8 9 10 11 <os> <type arch="x86_64">hvm</type> <boot dev="hd"/> </os> 12 13 14 15 16 <features> <acpi/> <hap/> </features> 17 18 19 <devices> <emulator>/usr/bin/qemu-system-x86_64</emulator> 20 21 22 23 24 25 <disk type="file"> <driver name="qemu" type="qcow2"/> <source file="debian-base.qcow2"/> <target dev="sda" bus="virtio"/> </disk> 26 27 28 29 30 31 32 33 <interface type="bridge"> <source bridge="br0"/> <target dev="vnet0"/> <model type=’virtio’/> </interface> </devices> </domain> The root element for the definition of a new domain is the domain element. Its type attribute identifies the hypervisor to use. Each domain element has a name element containing a host-unique name to assign to the newly spawned VM. When using the remotevirt provisioner to spawn new virtual machines, the name element will automatically be set (and overridden if present). The memory element contains the amount of memory to allocate to this domain. The units of this value are kilobytes. In this case, 1GB of memory is allocated to the domain. The type child of the os element specifies the type of operating system to be booted in the virtual machine. hvm indicates that the OS is one designed to run on bare metal, so requires full virtualization. The optional arch attribute specifies the CPU architecture to virtualize. The features element can be used to enable additional features. In this case the acpi (useful for power management, for example, with KVM guests it is required for graceful shutdown to work) and hap (enables use of Hardware Assisted Paging if available in the hardware) features are enabled. The content of the emulator element specifies the fully qualified path to the device model emulator binary to be used. 3.4. VM LIFECYCLE ASPECTS 27 The disk element is the main container for describing disks. The type attribute is either file, block, dir, or network and refers to the underlying source for the disk. The optional device attribute indicates how the disk is to be exposed to the guest OS. Possible values for this attribute are floppy, disk and cdrom, defaulting to disk. The file attribute of the source element specifies the fully-qualified path to the file holding the disk. The path contained by this attribute is always resolved as a child of the path defined by the imagedir option in the VURM configuration (in the case of a VURM setup, absolute path are thus not supported). The optional driver element allows specifying further details related to the hypervisor driver used to provide the disk. QEMU/KVM only supports a name of qemu, and multiple types including raw, bochs, qcow2, and qed. The target element controls the bus/device under which the disk is exposed to the guest OS. The dev attribute indicates the logical device name. The optional bus attribute specifies the type of disk device to emulate; possible values are driver specific, with typical values being ide, scsi, virtio, xen or usb. 3.4 VM Lifecycle aspects Before being able to run the SLURM daemon, different operations have to be performed. The ultimate goal of this set of operations if to have a VM up and running. This section aims to describe the different involved steps to reach this goal and to discuss some alternative approaches for some of them. The sequence diagram in Figure 3.4 on page 33 illustrates the complete VM lifecycle. The lifecycle can be divided in three phases: the first phase (domain creation) occurs when the creation request is first received, the second phase (SLURM daemon spawning) occurs as soon as all domains of the virtual cluster were correctly created and the IP addresses received and the third phase (domain destruction) occurs once that the end-user finished to execute SLURM jobs on the virtual cluster and asks for it to be released. Two operations of the first phase, more precisely the messages 1.1 and 1.3, deserve some particular attention. The subsections 3.4.1 and 3.4.2 address these operations respectively. The last subsection is dedicated to explain some of the possible public-key exchange approaches needed to perform a successful authentication when connecting to the VM through SSH (message 3.1). Due to the nature of the adopted solution, no messages are visible on the diagram for this operation. 3.4.1 IP address retrieval The simplest way to interact with a newly started guest is through a TCP/IP connection. The VURM tools use SSH to execute commands on the guest, while the SLURM controller communicates with its daemons over a socket connection. All of these communications means require the IP address (or the hostname) of the guest to be known. At present, libvirt does not exposes any APIs to retrieve a guests IP address and a solution to overcome this shortcoming has to be implemented completely by the calling codebase. During the execution of the project, different methods were analyzed and compared; the rest of this sub section is dedicated to analyze the advantages and shortcomings of each of them. 28 CHAPTER 3. PHYSICAL CLUSTER PROVISIONER Inspect host-local ARP tables As soon as a guest establish a TCP connection with the outside world, it has to advertise its IP address in an Address Resolution Protocol (ARP) response. This causes the other hosts on the same subnet to pick it up and cache it in its ARP table. Running the arp -an command on the host will reveal the contents of the ARP table, as presented in Listing 3.2: Listing 3.2: Example arp -an output 1 2 3 4 5 6 7 8 9 $ ? ? ? ? ? ? ? ? arp -an (10.0.0.1) at 0:30:48:c6:35:ea on en0 ifscope [ethernet] (10.0.0.95) at 52:54:0:48:5c:1f on en0 ifscope [ethernet] (10.0.1.101) at 0:14:4f:2:59:ae on en0 ifscope [ethernet] (10.0.6.30) at 52:54:56:0:0:1 on en0 ifscope [ethernet] (10.255.255.255) at ff:ff:ff:ff:ff:ff on en0 ifscope [ethernet] (129.24.176.1) at 0:1c:e:f:7c:0 on en1 ifscope [ethernet] (129.24.183.255) at ff:ff:ff:ff:ff:ff on en1 ifscope [ethernet] (169.254.255.255) at (incomplete) on en0 [ethernet] As libvirt is able to retrieve the Media Access Control (MAC) address associated with a given interface, it is possible to parse the output and retrieve the correct IP address. This method could not be adopted because ARP packets do not pass across routed networks and the bridged network layout used by the spawned domains would thus stop them. Proprietary VM Tools Some hypervisors (as, for example, VMWare ESXi or VirtualBox), are coupled with a piece of software which can be installed on the target guest and provides additional interaction capabilities, such as commands execution, enhanced virtual drivers or, in this case, IP address retrieval. Such a tool would bind the whole system to a single hypervisor and would vanish the efforts done to support the complete range of libvirt-compatible virtual machine monitors. Additionally, the KVM/Qemu pair, chosen for their live migration support, does not provide such a tool. Static DHCP entries A second possible method is to configure a pool of statically defined IP/MAC address couples in the Dynamic Host Configuration Protocol (DHCP) server and configure the MAC address in the libvirt XML domain description file in order to know in advance which IP address will be assigned to the VM. This method could have worked and is a viable alternative; it was however preferred to not depend on external services (the DHCP server, in this case) and provide a solution which can work in every context. A configuration switch to provide a list of IP/MAC address pairs instead of the adopted solution can although be considered as a future improvement. Serial to TCP data exchange The last explored and successfully adopted method takes advantage of the serial port emulation feature provided by the different hypervisors and the capability to bind them to (or connect them with) a TCP endpoint of choice. 3.4. VM LIFECYCLE ASPECTS 29 In the actual implementation, the VURM daemon listens on a new (random) port before it spawns a new VM; the libvirt daemon will then provide a serial port to the guest and connect it to the listening server. Once booted, a little script on the guest writes the IP address to the serial port and the TCP server receives it as a normal TCP/IP data stream. Although not authenticated nor encrypted, this communication channel is deemed secure enough as no other entity can possibly obtain relevant information by eavesdropping the communication (only an IP address is exchanged). Additionally, as it’s the hypervisor task to establish the TCP connection directly to the local host, there is no danger for man-in-the-middle attacks (and thus invalid information being transmitted).1 The Listing 3.3 contains the needed libvirt device configuration: Listing 3.3: Libvirt TCP to serial port device description 1 2 3 4 5 6 <serial type="tcp"> <source mode="connect" host="127.0.0.1" service="$PORT"/> <protocol type="raw"/> <target port="1"/> <alias name="serial0"/> </serial> The shell script on the guest needed to write the IP address to the serial device can be as simple as the following: Listing 3.4: Shell script to write the IP address to the serial port 1 2 #!/bin/sh ifconfig eth0 | grep -oE ’([0-9]{1,3}\.){3}[0-9]{1,3}’ | head -1 >/dev/ttyS0 3.4.2 Volumes cloning During the virtual cluster creation phase, the user has the possibility to choose which disk image all the created VMs will be booted from. As a virtual cluster is formed by multiple VMs, and as each virtual machine needs its own disk image from which read from and to which write to, multiple copies of the base image have to be created. Performing a raw copy of the base image presents obvious limitations: firstly each copy occupies the same amount of space as the original image, which rapidly leads to fill up the available space on the storage device, and secondly, given the necessity of these images to live on a shared file system (see chapter 4 for the details), the copy operation can easily fill up the available network bandwidth and take a significant amount of time. To overcome these drawbacks, a particular disk image format called Qemu Copy On Write, version 2 (QCOW2) is used. This particular format has two useful features: • A qcow2 image file, unlike a raw disk image, occupies only the equivalent of the size of the data effectively written to the disk. This means that it is possible to create a 40GB disk image which, once a base OS is installed on it, occupies only a couple of GBs2 ; • It is possible to create a disk image based on another one. Read data is retrieved from the base image until the guest modifies it. Modified data is written only to the new image, while leaving the base image intact. The Figure 3.3 represents this process graphically. 30 CHAPTER 3. PHYSICAL CLUSTER PROVISIONER guest operations read read read File1 v0 File2 v0 File1 v0 read read File1 v1 File1 v1 qcow2 image base image write File2 v0 File1 v0 File2 v0 Time Figure 3.3: Copy-On-Write image R/W streams The combination of these two features allows to create per-VM disk images in both a time- and space-efficient manner. 3.4.3 Public key exchange As anticipated in section 3.1, the VURM daemon executes commands on the guest through a SSH connection. The daemon authentication modes were deliberately restricted, for security reasons, to public key authentication only. This restriction requires that the daemon public key is correctly installed on the guest disk image. As the disk image is customizable by the user, it is necessary to define how this public key will be installed. Different techniques can be used to achieve this result; the rest of this subsection is dedicated to analyze three of them. Public key as a requirement The simplest (from the VURM point of view) solution is to set the setup of a VURM-specific user account with a given public key as a requirement to run a custom disk image on a given VURM deployment. This approach requires the end user to manually configure its custom disk image. The obvious drawback is the inability to run the same disk image on different VURM deployments as the key pair differs from setup to setup.3 The final chosen solution is based on this approach, given that the disk image has to be specifically customized because of other parameters anyway. Refer to the section A.4 in the User manual for more information about the VM disk image creation process. 1 The daemon only binds to the loopback interface; a malicious process should be running on the same host to possibly connect to the socket and send a wrong IP address 2 Other disk image formats support this feature too. For example qcow (version 1) or VMDK (when using the sparse option). 3 This is, however, a very minor drawback; the percentile of users running jobs on different clusters is very low, and even so, the disk image probably needs adjustments of other parameters anyway. 3.5. POSSIBLE IMPROVEMENTS 31 Pre-boot key installation Another explored technique consists of mounting the disk image on the host, copying the public key to the correct location, unmounting the image and then starting the guest. The mounting, copying and unmounting operation was made straightforward by the use of a specific image manipulation library called libguestfs. For security reasons, this library would spawn a new especially crafted VM instead of mounting the image directly on the host; this made the whole key installation operation a long task when compared with the simplicity of the problem (mounting, copying and unmounting took an average of 5 seconds during the preliminary tests). The poor performances of the operation, its different dependencies (too many for such a simple operation) and its tight coupling with the final public key path on the guest, led this approach to be discarded in the early prototyping phase. IP-Key exchange As seen in the previous section, the final solution adopted to retrieve a guest IP address, already requires to setup a communication channel between the VURM daemon and the guest. The idea of this approach is to send back the public key to the guest as soon as the IP address is received. In this case, the guest would first write the IP address to the serial port and then read the public key and store it to the appropriate location. An extension to the shell script 3.4 which saves the public key as a valid key to login as root is presented in the Listing 3.5: Listing 3.5: Shell script to exchange the IP address and a public key over the serial port #!/bin/sh ifconfig eth0 | grep -oE ’([0-9]{1,3}\.){3}[0-9]{1,3}’ | head -1 >/dev/ttyS0 mkdir -p /root/.ssh chmod 0755 /root/.ssh cat /dev/sttyS0 >>/root/.ssh/authorized_keys chmod 0644 /root/.ssh/authorized_keys 1 2 3 4 5 6 This technique was firstly adopted, but then discarded in favor of the more proper requirementbased solution as bidirectional communication using a serial port revealed to be too complicated to implement correctly in a simple shell script.4 A proper solution would have required the installation of a custom built utility; copying a simple file (the public key) was considered a less troublesome operation than installing yet another utility along with its dependencies. 3.5 3.5.1 Possible improvements Disk image transfer The current implementation takes advantage of the Network File System (NFS) support already required by libvirt5 to exchange disk images between the controller and the physical nodes. On 4 Synchronization and timing issues which could not simply be solved by using timeouts where detected when running slow guests. 5 NFS support is required by libvirt and the qemu driver for VM migration, more details are given in the chapter 4. 32 CHAPTER 3. PHYSICAL CLUSTER PROVISIONER clusters consisting of many nodes, the shared network location becomes a bottleneck as soon as multiple daemons try to access the same disk image to start new VMs. A trivial improvement would consist of a basic file exchange protocol allowing to transfer the disk image between the controller and the daemons while optimizing and throttling this process in order to not saturate the whole network. As well as the NFS based solution, the trivial approach also exposes different and non-negligible drawbacks and can be further optimized. The nature of the task – distributing a single big file to a large number of entities – seems to be a perfect fit for the BitTorrent protocol, as better highlighted in the official protocol specification [5]: BitTorrent is a protocol for distributing files. [. . . ] Its advantage over plain HTTP is that when multiple downloads of the same file happen concurrently, the downloaders upload to each other, making it possible for the file source to support very large numbers of downloaders with only a modest increase in its load. A more advanced and elegant solution would thus implement disk image transfers using the BitTorrent protocol instead of a more traditional single-source distribution methodology. 3.5.2 Support for alternative VLAN setups The current implementations bridges the VM network connection directly to the physical network card of the host system. VMs appear thus on the network as standalone devices and they share the same subnet as their hosts. It could be possible to create an isolated Virtual LAN (VLAN) subnet for each virtual cluster in order to take advantage of the different aspects offered by such a setup, as for example, traffic control patterns and quick reactions to inter-subnet relocations (useful in case of VM migrations). This improvement would add even more flexibility to the physical cluster provisioner and enable even more transparent SLURM setups spawning multiple clusters residing in different physical locations. It is useful to note that different private cloud managers already implement and support VLAN setups to various degrees (the example par excellence being Eucalyptus [7] [9]). 3.5. POSSIBLE IMPROVEMENTS 33 s d Domain lifecycle DomainManager libvirtConnection Remotevirt provisioner 1: createDomain(description) 1.1: clone disk image 1.2: update description with disk image path 1.3: start new IP address receiver server 1.4: add serial to tcp device to description 1.5: createLinux(description) Domain 1.5.1: create() 2: sendIPAddress(ip) 2.1: domain created 3: spawnDaemon(domain, slurmConfig) 3.1: connect(username, key) SSHConnection 3.2: transferFile(slurmConfig, '/etc/slurm.conf') 3.3: executeCommand('slurmd') 3.4: disconnect() 3.5: daemon spawned 4: destroyDomain(domainName) 4.1: destroyDomain(domainName) 4.1.1: destroy() 4.2: remove image clone 4.3: domain destroyed Figure 3.4: Complete domain lifecycle Chapter 4 Migration Resources allocation to jobs is difficult task to execute optimally and often depends on a plethora of different factors. Most job schedulers are conceived to schedule jobs and allocate resources to them upfront, with only limited abilities to intervene once a job entered the running state. Process migration, or, in this specific case, VM migration, allows to dynamically adjust resource allocation and job scheduling as new information becomes available even if a job was already started. This runtime adjustment is applied to the system by moving jobs from one resource to another as necessity arises. The Figure 4.5 provides an example of the concept of VM migration applied to the scheduling of virtual clusters on resources managed by the Physical cluster provisioner described in chapter 3. In this scenario, a virtual clusters is resized through VMs migration in order to take account for an additional job submitted to the system. In the illustration, each square represents a physical node, each circle a virtual machine (using different colors for VMs belonging to different virtual clusters) and each shaded rectangle a virtual cluster. (a) Low load operation (b) Migration (c) High load operation (d) Restoration Figure 4.1: Example of virtual cluster resizing through VMs migration. The step 4.1a illustrates the state of the different nodes in a stable situation with one virtual cluster consisting of twelve virtual machines. The step 4.1b shows how the system reacts as soon as a new virtual cluster request is submitted to the system: some VMs are migrated to already busy nodes of the same cluster (oversubscription) to free up resources for the new virtual cluster. The freed up resources are then allocated to the new virtual cluster as shown 35 36 CHAPTER 4. MIGRATION in step 4.1c. The step 4.1d shows how an optimal resource usage is restored once one of the virtual clusters is released and more resources become available again. Basic support for this scheduling and allocation enhancements is provided by the current VURM implementation. In order to be able to account for different allocation, scheduling and migration techniques and algorithms, the different involved components are completely pluggable and can easily be substituted individually. In order to further describe the different components and the implemented techniques and algorithms, the present chapter is structured in the following sections: • Section 4.1 introduces the migration framework put in place to support alternate pluggable resource allocators and migration schedulers by describing the involved data structures and the responsibilities of the different components; • Section 4.2 describes the resources allocation strategy implemented in the default resource allocator shipped with the VURM implementation; • Similarly as done for the previous section, section 4.3 describes the default migration scheduling strategy and offers insights over the optimal solution; • Section 4.4 introduces the different approaches to VM migration and describes the implemented solution as well as the issues encountered with the alternate techniques; • To conclude, section 4.5 resumes some of the possible improvements which the current VURM version could take advantage of but which weren’t implemented due to different constraints. 4.1 Migration framework Different job types, different execution contexts or different available resources are all factors which can led to chose one given resource allocation or job scheduling algorithm over another. A solution which works equally well in every case does not exists and as such, users and system administrators, want to be able to adapt the strategy to the current situation. In order to provide support for configurable and pluggable schedulers and allocators, a simple framework was put in place as part of the Physical cluster provisioner architecture. This section describes the different intervening entities, their responsibility and the data structures exchanged between them. The Figure 4.2 illustrates the class diagram of the migration framework. Most of the represented classes were taken from the physical cluster provisioner class diagram (Figure 3.1 on page 21); newly added classes, attributes and operations are highlighted using a lighter background color. As shown in the class diagram, three new interfaces were added to the architecture. Each of these interfaces is responsible for a different task in the whole migration operation. The newly added members to the already existing classes and the Migration class were added to help the three interface realizations make the right decisions and communicate them back to the resource provisioner for it to carry them out. 4.1. MIGRATION FRAMEWORK 37 Migration framework depends on 0..* 1 Migration 0..1 +migrate() +waitToComplete() migration +addDependency(migration) 0..* 1 domain Domain +scheduleMigrationTo(node) domains 1..* srcNode 0..* 1 domains 0..* migrated by migration 0..* 1 1 1 dstNode PhysicalNode -computingUnits node +getDomainsByCluster(cluster) +addDomain() +removeDomain() nodes 1..* <<use>> cluster 1 system VirtualCluster -priority +addDomain(domain) +removeDomain(domain) +release() +getComputingUnits() +getWeight() 0..* system virtualClusters 1 manager <<Interface>> IMigrationManager +migrate(domain, srcNode, dstNode) <<Interface>> IMigrationScheduler +schedule(migrations) 1 System +addCluster(cluster) +removeCluster(cluster) +getTotalPriority() +getTotalComputingUnits() <<use>> <<Interface>> IResourcesAllocator +allocate(system) Figure 4.2: Migration framework components 4.1.1 Node and cluster weighting The current version of the VURM remotevirt provisioner allows to assign a fixed Computing Unit (CU) value to each physical node P N in the system. A computing unit is simply a relative value comparing a node to other nodes in the system: if Node A has a CU value of 1 while Node B has a value of 2, then Node B is assumed to be twice as powerful as Node A. Note that these values do not express any metrics about the absolute computing power of either node but only the relationship between them. The total power of the system CUsys is defined as the sum of all computing units of each node in the system itself (Equation 4.1). It is possible to access these values by reading the value contained in the computingUnits property of a PhysicalNode instance or by calling the getTotalComputingUnits method on the System instance. CUsys = n X CUP Ni (4.1) i=0 Similarly as done for physical nodes, it is possible to assign a fixed priority P to a virtual cluster V C at creation time. As for the CUs, the priority is also a value relative to other clusters in the system: if Virtual Cluster A and Virtual Cluster B have the same priority, they will both have access to the same amount of computing power. The weight of a virtual cluster WV C is defined as the ratio between the priority of the cluster and the sum of the priorities of all the clusters in the system (Equation 4.2); this value indicates the fraction of computing power the virtual cluster has right to. 38 CHAPTER 4. MIGRATION PV Ci WV Ci = Pn i=0 PV Ci (4.2) The exact amount of computing power assigned to a cluster CUV C is easily calculated by multiplying the weight with the total computing power of the system (Equation 4.3). CUV Ci = WV Ci · CUsys (4.3) Methods to calculate all those values are provided by the VURM implementation itself and are not needed to be carried out by the resources allocators. The next subsection illustrates how these values and the provided data structures are used by the three interface realizations to effectively reassign resources to different VMs, schedule migrations in the optimal way and apply a given migration technique. 4.1.2 Allocators, schedulers and migration managers The class diagram in the Figure 4.2 adds three new interfaces to the physical cluster provisioner architecture: the IResourceAllocator realization is in charge to decide which VMs have to be migrated to which physical node, the IMigrationScheduler realization schedules migrations by defining dependencies between them in order to optimize the execution time and the IMigrationManager realization is responsible to actually carry out each single migration. Each one of these three interfaces have to be implemented by one or more objects and passed to the remotevirt provisioner at creation time. The interactions between the provisioner and these three components of the migration framework is illustrated in the sequence diagram in Figure 4.3 on page 39. Each one of the intervening components is further described in the following paragraphs. Resource allocator The allocate method of the configured resource allocator instance is invoked when a resource reallocation is deemed necessary by the provisioner (refer to the next subsection for more information about the exact migration triggering strategy). The method takes the current System instance as its only argument and returns nothing. It can reallocate resources by migrating virtual machines from one physical node to another. To do so, the implemented allocation algorithm has to call the scheduleMigrationTo method of the Domain instance it wants to migrate. Scheduled migrations are not carried out immediately but deferred for later execution, in order to allow for the migration scheduler to optimize their concurrent execution. Migration scheduler Once the resource allocator has decided which virtual machines need to be moved to a different node and the respective migrations have been scheduled, the schedule method of the migration scheduler instance is invoked. This method receives a list of Migration instances as its only argument and returns a similar data structure. Its task is to define dependencies between migrations in order to optimize the resource usage due to their concurrent execution. 4.1. MIGRATION FRAMEWORK 39 s d Migration ResourceAllocator Domain MigrationScheduler MigrationManager Provisioner 1: allocate(system) loop [each domain in the system] opt [domains needs migration] 1.1: scheduleMigrationTo(node) 1.1.1: create Migration 1.1.2: register for later execution 1.2: allocation done 2: schedule(migrations) loop [each scheduled migration] opt [needs to wait for a shared resource] 2.1: addDependency(migration) loop [migration is considered useless] 2.2: remove from list 2.3: list of migrations loop [each scheduled migration] 3: migrate 3.1: wait for required migrations to terminate 3.2: migrate(domain, from, to) 3.3: notify dependent migrations Figure 4.3: Collaboration between the different components of the migration framework To add a new dependency, the implemented algorithm has to call the addDependency method of the dependent Migration instance. All the migration instances still present in the returned list will be executed by respecting the established dependencies, it is thus possible to prevent a migration from happening simply by not inserting it in the returned list. Migration manager Different migration techniques (e.g. offline, iterative precopy, demand migration,. . . ) are available to move a virtual machine from a physical node to another.1 It is the responsibility of the migration manager to decide which migration technique to apply for each given migration. Each migration contained in the list returned from the schedule invocation is triggered by the provisioner. When told to start, the Migration instances wait for all migrations they depend on to finish and subsequently defer their execution to the responsible MigrationManager instance by calling its migrate method. Lastly, once the migration is terminated, each migration instance notifies the dependent migrations that they can now start. 1 A deeper analysis of the different techniques is exposed in section 4.4. 40 CHAPTER 4. MIGRATION 4.1.3 Migration triggering The whole resource allocation, migration scheduling and actual migration execution is triggered by the provisioner in exactly three cases: 1. Each time a new virtual cluster is created. This triggering method waits for a given (configurable) stabilization period to be over before effectively starting the process. Thanks to the stabilization interval, if two virtual cluster creation or releasing requests are received in a short time interval, only one scheduling will take place; 2. Each time a virtual cluster is released. Similarly as done for the creation trigger, the stabilization interval is applied in this case too. The stabilization interval is reset in both cases, regardless of the type of request (creation or release) which is received; 3. At regular intervals. If no clusters are created or released, a resources reallocation is triggered at a regular and configurable interval. Interval triggers are momentarily suspended when the triggering system is waiting for the stabilization period to be over. 4.2 Allocation strategy The remotevirt provisioner ships with a default resource allocator implementation called SimpleResourceAllocator. This section aims to describe the implemented resource allocation algorithm. The SimpleResourceAllocator allocates resources on a static system view basis. This means that each time a new resource allocation request is made, the algorithm only cares about the current system state and does not take into account previous states. Additionally, the algorithm bases its decision only on the cluster priorities and the CUs assigned to the physical nodes. The implemented algorithm can be divided in four different steps; each one of these steps is further described in the remaining part of this section. 4.2.1 Cluster weighing This step takes care of assigning an integer computing unit value to each cluster while accounting for rounding errors. An integer value is needed because in this simple allocator version, nodes are not shared between clusters. The algorithm pseudocode to round the value calculated using the Equation 4.3 is reported in Listing 4.1. Listing 4.1: CU rounding algorithm 1 2 # Create a list of computing units and clusters tuples computingUnits = [(c.getComputingUnits(), c) for c in clusters] 3 4 5 # Calculate the quotient and the remainder of each computing unit computingUnits = [(cu % 1, int(cu), c) for cu, c in computingUnits] 6 7 8 # Calculate the remainder of the truncated computing units sum remainder = sum([rem for (rem, _, _) in computingUnits]) 9 10 11 # Sort computing units in reverse order by remainder and create an iterator adjustements = iter(sorted(computingUnits, reverse=True)) 12 13 14 15 16 17 # While there is some remainder left, increment the next cluster with # the largest remainder while remainder: next(adjustements)[1] += 1 remainder -= 1 4.2. ALLOCATION STRATEGY i P CV C i 0 24 1 WV Ci 41 CUV Ci with CUsys = 10 CUV Ci with CUsys = 15 Exact Rounded Algo Exact Rounded Algo 0.08 0.8 1 1 1.2 1 1 6 0.02 0.2 0 0 0.3 0 0 2 84 0.28 2.8 3 3 4.2 4 4 3 108 0.36 3.6 4 4 5.4 5 6 4 48 0.16 1.6 2 1 2.4 2 2 5 30 0.10 1.0 1 1 1.5 2 2 Totals 300 1.00 10 11 10 15 14 15 Table 4.1: Rounding algorithm example with two different CUsys values The Table 4.1 provides an example of the application of the described algorithm to a system containing six virtual clusters. The exact value, the rounded value and the algorithm value are calculated for two different scenarios: once for a system with a total CU of 10 and a second time for the same system but with a CU value of 15. 4.2.2 Nodes to cluster allocation Once that each cluster has an integer computing units value attached to it, the algorithm goes on by assigning a set of nodes to each virtual cluster in order to fulfill its computing power requirements. The implemented algorithm assigns nodes to clusters starting with the cluster with the highest computing unit assigned to it and then proceeding in descending order. Firstly, the most powerful available node is allocated to the current cluster and then the algorithm iterates over all available nodes in descending computing power order and assigns them to the cluster as long as the cluster computing power is not exceeded. The Listing 4.2 illustrates a simplified version of the implemented algorithm. Note that the allocated computing units (currentPower) is also returned as it can be different than the threshold value (assignedComputingUnits). Listing 4.2: Nodes to virtual cluster allocation 1 2 3 # Initialize data structure and variables clusterNodes = [availableNodes.pop(0)] currentPower = clusterNodes[0].computingUnits 4 5 6 7 8 9 10 11 12 # If the power is not already exeeded if currentPower < assignedComputingUnits: for n in availableNodes: # If adding this node to the cluster does not exceed the assigned CUs if currentPower + n.computingUnits < assignedComputingUnits: availableNodes.remove(n) # Move the node... clusterNodes.append(n) # ...to the cluster nodes list currentPower += n.computingUnits 13 14 15 # Return the real computing power of the virtual cluster and the assigned nodes return currentPower, clusterNodes 42 CHAPTER 4. MIGRATION 4.2.3 VMs to nodes allocation This third step of the algorithm provides to assign a certain number of VMs to each node in the cluster basing on the computing units of the node, the total computing units assigned to the virtual cluster and the number of VMs running inside it. The exact number of virtual machines V MP Ni assigned to a node i of the virtual cluster j is calculated as show in Equation 4.4. Once the exact value is calculated for each physical node of the virtual cluster, the same rounding algorithm as used for the Cluster weighing step is applied. V MP Ni = V MV Cj · 4.2.4 CUP Ni CUV Cj (4.4) VMs migration The remaining part of the resources allocation algorithm is responsible to decide which virtual machine has to be migrated to which physical node. This part has been divided into two distinct sets of migrations. The first migrations set is responsible to migrate all virtual machines currently running on nodes reassigned to other virtual clusters to a free node of the virtual cluster to which the virtual machine belongs to. This first iteration is reported in the Listing 4.3. Listing 4.3: VMs migration from external nodes 1 2 # nodesVMCount is a list of (node, number of vm) tuples idleNodesIter = iter(nodesVMCount) 3 4 5 # Initialize the variables node, maxCount = next(idleNodesIter) 6 7 8 9 10 11 12 13 14 15 for domain in cluster.domains: # If the virtual machine is running on an external node if domain.physicalNode not in clusterNodes: # If the current node is already completely subscribed while len(node.domainsByCluster(cluster)) >= maxCount: # Find the next node which has a VM slot available node, maxCount = next(idleNodesIter) # Migrate the VM to the chosen node domain.scheduleMigrationTo(node) The second migration set is responsible to migrate VMs between different nodes of the same virtual cluster in order to reach the correct number of assigned virtual machines running on each node. This is necessary because depending on the set of nodes assigned to a cluster, a given node could be running more virtual machines than the allocated number. Listing 4.4: Leveling node usage inside a virtual cluster 1 2 3 4 # The filterByLoad function creates three separate lists from the given # nodesVMCount variable. A list of nodes with available slots, a list of # already fully allocated nodes, and a list of oversubscribed nodes. idle, allocated, oversubscribed = filterByLoad(nodesVMCount, cluster) 5 6 7 8 for node, count in oversubscribed: # Get an iterator over each domain belonging to cluster running on the # oversubscribed node 4.3. SCHEDULING STRATEGY 43 domains = iter(node.domainsByCluster(cluster)) 9 10 for i in range(count): # For each domain exceeding the node capacity domain = next(domains) 11 12 13 14 if not idle[0][1]: # Discard the current idle node if in the meantime it has # reached its capacity idle.pop(0) 15 16 17 18 19 # Update the number of running domains idle[0][1] -= 1 # Migrate the domain to the idle node domain.scheduleMigrationTo(idle[0][0]) 20 21 22 23 Once the second set of migrations was scheduled too, the resource allocator can return the control to the provisioner and wait for the next triggering to take place (message 1.2 in the sequence diagram on page 39). 4.3 Scheduling strategy VM migration is an heavy task in terms of bandwidth and, depending on the used migration strategy, in terms of CPU usage. Running a multitude of simultaneous migrations between a common set of physical nodes can thus reduce the overall system responsiveness and saturate the network bandwidth, as well as lengthen the migration time and thus the downtime of the single VMs. To obviate to this problem, and to allow for different scheduling strategies to be put in place, the remotevirt provisioner supports a special component called migration scheduler. The responsibility of this component is to schedule migrations by establishing different dependency links between the single migration tasks in order to keep resources usage under a certain threshold while optimizing the maximum number of concurrent migrations. The default implementation shipped with the physical cluster provisioner is a simple noop scheduler which lets all migrations execute concurrently. A more optimized solution was not implemented due to time constraints, but the problem was analyzed on a theoretical level and formulated in terms of a graph coloring problem. This section aims to explain the theoretical basis behind the optimal solution and to provide the simple greedy algorithm for a possible future implementation. The problem statement for which a solution is sought is formulated in the following way: find the optimal scheduling to execute all migrations using the minimum number of steps in such way that no migration involving one or more common nodes is carried out during the same step. By representing the migrations as edges connecting vertices (i.e. the physical nodes) of an undirected2 graph, the previous formulation can be applied to the more general problem of edge coloring. In graph theory, an edge coloring of a graph is an assignment of “colors” to the edges of the graph so that no two adjacent edges have the same color. The Table 4.2 lists the migrations used for the examples throughout this section. Similarly, the Figure 4.4 represents the same migration set using an undirected graph. 2 The migration direction is not relevant for the problem formulation as the resource usage is assumed to be the same on both ends of the migration. 44 CHAPTER 4. MIGRATION Domain Source Node Dest. Node 0 A E 1 A D 2 C B 3 B D 4 D A 5 A C D E 1 3 B 2 4 0 A Table 4.2: Migrations in tabular format 5 C Figure 4.4: Graphical representation Once the edges are colored, it is possible to execute migrations concurrently as long as they share the same color. The subfigure 4.5a shows one of the possible (optimal) coloring solution and the subfigure 4.5b one of the possible migration scheduling resulting from this coloring solution represented as a Direct Acyclic Graph (DAG). D E 1 3 B 2 4 Step 1 Step 2 3 1 0 5 A 5 (a) One possible edge coloring solution Step 5 Step 4 4 0 2 C (b) Chosen migration scheduling Figure 4.5: Edge coloring and the respective migration scheduling. By Vizing’s theorem [16], the number of colors needed to edge color a simple graph is either its maximum degree ∆ or ∆ + 1. In our case, the graph can be a multigraph because multiple migrations can occur between the same two nodes and thus the number of colors may be as large as 3∆/2. There are polynomial time algorithms that construct optimal colorings of bipartite graphs, and colorings of non-bipartite simple graphs that use at most ∆ + 1 colors [6]; however, the general problem of finding an optimal edge coloring is NP-complete and the fastest known algorithms for it take exponential time. A good approximation is offered by the greedy coloring algorithm applied to the graph vertex coloring problem. The graph vertex coloring problems is the same as the edge coloring problem but applied to adjacent vertices instead. As the edge chromatic number of a graph G is equal to the vertex chromatic number of its line graph L(G), it is possible to apply this algorithm to the edge coloring problem as well. Greedy coloring exploits a greedy algorithm to color the vertices of a graph that considers the vertices in sequence and assigns each vertex its first available color. Greedy colorings do not in general use the minimum number of colors possible and can use as much as 2∆−1 colors (which 4.4. MIGRATION TECHNIQUES 45 may be nearly twice as many number of colors as is necessary); however the algorithm exposes a time complexity of O(∆2 + log ∗ n) for every n-vertex graph [14] and it has the advantage that it may be used in an online algorithm setting in which the input graph is not known in advance. In this setting, its competitive ratio is two, and this is optimal for every online algorithm [1]. More complex formulations of the migration problem can also be made. It could be interesting to be able to configure a maximum number of concurrent migrations occurring on each node instead of hard limiting this value to one. This enhancement would allow to better exploit the available resources as the time ∆tn−concurrent taken for n concurrent migrations between the same nodes is less than n times the duration of a single migration ∆tsingle (Equation 4.5). ∆tn−concurrent < n · ∆tsingle (4.5) Empirical measures have shown that fully concurrent migration setups can achieve as much as 20% speedup over single-scheduled migrations by trading off for a complete network saturation and very high CPU loads. A configurable value for the maximum number of concurrent migrations would allow to adjust the speed/load tradeoff to the runtime environment. 4.4 Migration techniques It is not yet possible to migrate a virtual machine from one physical node to another without any service downtime. Different migration techniques exist thus to try to minimize the VM downtime, often while trading off for a longer migration time. Service downtime is defined as the interval between the instant at which the VM is stopped on the source host and the instant at which it is completely resumed on the destination host. Migration time is defined as the interval between the reception of the migration request and the instant at which the source host is no more needed for the migrated VM to work correctly on the destination host This section aims to introduce some of the most important categories of VM migration techniques and illustrate the method adopted by the VURM remotevirt provisioner. The most basic migration technique is called offline migration. Offline migration consists in completely suspending the VM on reception of the migration request, copying the whole state over the network to the destination host, and resuming the VM execution there. Offline migration accounts for the smallest migration time and the highest service downtime (the two intervals are approximately the same). All other migration techniques are a subset of live migration. These migration implementations seek to minimize downtime by accepting a longer overall migration time. Usually these algorithms generalize memory transfer into three phases [3] and are implemented by exploiting one or two of the three: • Push phase The source VM continues to run while certain pages are pushed across the network to the new destination. To ensure consistency, pages modified during this process must be re-sent; • Stop-and-copy phase The source VM is stopped, pages are copied across to the destination VM, then the new VM is started; 46 CHAPTER 4. MIGRATION • Pull phase The new VM executes and, if it accesses a page that has not yet been copied, this page is faulted in across the network from the source VM. Iterative pre-copy is a live migration algorithm which applies the first two phases. The push phase is repeated by sending over pages during round each n which were modified during round n − 1 (all pages are transferred in the first round). This is one of the most diffused algorithms and provides very good performances (in [3] downtimes as low as 60ms with migration times in the order of 1 − 2minutes were reached). Iterative pre-copy is also the algorithm used by the live KVM migration implementation. Another special technique which is not based only on the three previously exposed phases exploits techniques such as checkpoint/restart and trace/replay, usually implemented in process failure management systems. The approach is similar but instead of recovering a process from an error on a local node, the VM is “recovered” on the destination host. The migration implementation adopted by the remotevirt provisioner is a simplistic form of offline migration. The virtual machine to migrate is suspended and its state saved to a shared storage location. Subsequently, the same virtual machine is recreated and resumed on the destination host by using the saved state file. It was not possible to adopt the KVM live migration implementation because different problems didn’t allow to provide a working proof of concept. A started live migration would completely freeze the migrating domain without reporting any advancement or errors and required a complete system restart for the hypervisor to resume normal operation. Thanks to libvirt’s support for this KVM functionality, an eventual addition to the remotevirt provisioner to support this kind of migration will be even simpler than the implementation provided to support the offline migration itself. Once the problem is identified and eventually solved, the addition of a new migration manager implementing this kind of migration is strongly advised. 4.5 Possible improvements Due to the time dedicated to research a solution to the live migration problems of the KVM hypervisor, different features of the migration framework were approached only on a theoretical basis. The current implementation could greatly benefit from some additions and improvements, as better described in the remaining part of this chapter. 4.5.1 Share physical nodes between virtual clusters One of the principles on which the current resource allocation algorithm is based foresees that physical nodes are entirely allocated to a single virtual cluster at a time. Such a strategy is effective to isolate resource usage between virtual clusters (a greedy VM of a virtual cluster can’t use part of the resources of a VM of another virtual cluster running on the same node) but limits the usefulness of bigger nodes. Systems with only a few powerful nodes are limited to run only as much clusters as there are nodes in the system, by leaving the additional virtual clusters effectively unscheduled. The two approaches can also be combined in a way that nodes with a CU value below a configurable threshold are allocated to a single cluster only while allowing to share more powerful nodes between different virtual clusters. 4.5. POSSIBLE IMPROVEMENTS 4.5.2 47 Implement live virtual machine migration The current migration manager implements a manual virtual machine migration technique consisting in suspending the virtual machine, saving its state to a shared storage location and then resuming the VM on the destination host. Applying such a migration technique involves high service downtime intervals (in the order of 15 − 30s); live migration would allow to reduce service downtime to the order of 100ms. As anticipated in the previous section, KVM already implements live virtual machine migration, and the implementation of the support to exploit this capability, once the problems related to its execution are solved, is straightforward. 4.5.3 Exploit unallocated nodes The current node allocation algorithm allows for nodes without allocated VMs. Optimize the resources allocation algorithm to exploit all available resources by assigning them to a virtual cluster and by running at least one VM on each node. Unused nodes can be allocated to virtual clusters with the greatest difference between exact assigned CUs and the rounded value provided by the currently implemented rounding algorithm. If there are resources which are effectively too limited to run even one VM, provide at least significant reporting facilities (logging) to help system administrators identify and correct such problems. 4.5.4 Load-based VMs to physical nodes allocation Different VMs can have different CPU and memory requirements. The current allocation algorithm defines only how many VMs run on a given physical node but not which virtual machine combination would allow to exploit the resources in the best way. A virtual cluster spawning two physical nodes and composed of two VMs demanding many resources and two VMs being idle could see the two idle VMs being allocated to the same physical node. The ideal algorithm would combine an heavily loaded and an idle VM on each node of the virtual cluster. 4.5.5 Dynamically adapt VCs priority to actual resource exploitation The priority assigned to a virtual cluster is defined by the user at creation time and never changed afterwards. It is possible to dynamically adapt the priority to react to actual resource exploitation of the virtual cluster. If a virtual cluster with an high priority exploits only a small percentage of the resources allocated to it, at the next iteration, the resource allocation algorithm, should lessen its priority unless the resource usage reaches a certain threshold. In a similar way, if a cluster is using all resources allocated to it and there are unused resources available on the system, the algorithm should increase its priority to augment the amount of resources which will be allocated to the virtual cluster at the next iteration. 48 4.5.6 CHAPTER 4. MIGRATION Implement the greedy coloring algorithm The section 4.3 introduced a good solution to the concurrent migrations problem on a theoretical level. The current implementation simply schedules all migration for fully concurrent execution and would thus saturate the resources an larger scale systems. The presented greedy coloring algorithm is easy enough to be worth to be implemented as an additional migration scheduler realization. To fully exploit the possibilities offered by the scheduling system, the implementation of the configurable concurrency limit variant is also advisable. Chapter 5 Conclusions Three whole months of planning, design, implementation and testing are over. A new tool to integrate an already existing job scheduler and virtual machine monitor was created and the time to conclude the work has finally come. This last chapter of the present report aims to summarize the different achievements, illustrate the solved problems, hinting at some possible future developments and finally giving some personal advice about the entire experience. This chapter is thus structured as follows: • Section 5.1 summarizes the project output by highlighting the initial goals and comparing them to the final results; • Section 5.2 lists some of the most important encountered problems, the solution eventually found for them and the impact they had on the project itself; • Section 5.3 gives a list of hints for future work by summarizing what already done for single sections in the project and adds additional items highlighting other possible improvements; • Finally, section 5.4, gives a personal overview and digest of the whole experience, speaking about the project itself and its execution context. 5.1 Achieved results The goal of the project was to add virtual resources management capabilities to a job scheduler thought for HPC clusters. SLURM was chosen as the batch scheduler for the job, while virtualization functionalities were provided using a combination of libvirt and KVM/QEMU (the latter one being swappable with other hypervisors of choice). The final result consist in a tool called VURM which nicely integrates all these different components in a single working unit. No modifications are necessary to the external tools in order for them to work in such a setup. Using the developed tool, and by using the provided remotevirt resource provisioner, it is possible to create virtual clusters of VMs running on different physical nodes (belonging to a physical cluster). Pluggable interfaces are also provided in order to be able to add new resource 49 50 CHAPTER 5. CONCLUSIONS provisioners to the system and also aggregate domains running on heterogenous systems (grids, clouds, single nodes,. . . ) at the same time. Particular attention was given to the remotevirt provisioner (also called physical resources provisioner ) and to the functionalities offered by its particular architecture. By using this provisioner, it is possible to create ad-hoc virtual machines, assemble them into virtual clusters and provide them to SLURM as partitions to which jobs can be submitted in the usual fashion. Additionally, this provisioner exposes a complete resource management framework heavily based on the possibility to migrate domains (e.g. virtual machines) from one physical node to another. Using components responsible respectively for resource allocation, migration scheduling, and migration execution it is possible to dynamically adapt the resources allocated to a virtual cluster based on different criteria such as priority definitions, node computing power, current system load, etc. 5.2 Encountered problems Each new development project brings a new set of challenges and problems to the table. Research projects, in which new methods are approached and areas explored, may suffer even more from such eventualities. This project was not different from any other and a whole bunch of problematics had to be approached, dealt with and eventually solved. Two bigger problems can be identified in the whole project development timeframe. The first one was the retrieval of the IP address of a newly spawned virtual machine. Even though that different solutions already existed (many of them also presented in the respective chapter), none of them fitted the specific context and an alternative and more creative one had to be implemented. This final solution – transferring the IP address over the serial port to a listening TCP server – may now seem an obvious approach to a non-problem but its finding caused a non-trivial headache when the problem firstly occurred. The second important problem which was encountered manifested itself towards the end of the project, when VM migration had to be implemented. As KVM claimed support for live migration out of the box, no big analysis and testing was performed in the initial project phase and this led to the later finding that this specific capability was not working on the development setup. Because of the importance given to live migration over offline migration, the source of the problem was researched for two entire weeks without success. This lead to a bigger adaptation of the planning and a total loss of motivation. Finally, a more complex (but working) offline migration solution was implemented. Not always, as it was the case for the second presented issue, a direct solution could be found and a workaround had to be implemented instead. The lessons learned from the encountered problems are essentially two: firstly, put more effort into the initial analysis phase by testing out important required functionalities and providing simple proof of concepts for the final solution; secondly, if a problem is encountered and a solution cannot be found in a reasonable amount of time adopt a workaround, continue with the development of the remaining part of the project and come back to the problematic in order to find a more optimal solution only once the critical parts of the project are finished. Adopting this two learnings it should be possible to have a better initial overview of the different tasks and their complexity first, and a more consequent adherence to the planning subsequently. 5.3. FUTURE WORK 5.3 51 Future work The first section exposed the different achievements of the present project by summarizing the work which was done and the produced results. The second section resumed the encountered problems and how they were eventually solved. This section aims to offer some hints for future work which can be done to improve the project. Improvements tightly related to the main components of the VURM utility have already been listed in the respective chapters. Section 3.5 of the Physical cluster provisioner chapter listed the implementation of scalable disk image transfers using BitTorrent and support for VLAN based network setups as main possible improvements to the remotevirt resources provisioner. Similarly, for the Migration chapter, section 4.5 offers an extended list of additional features and improvements to the migration framework, resources allocation and scheduling strategies and migration techniques. Possible improvements in this area are for example enabling physical nodes to be shared between virtual clusters or implementing live migration to minimize service downtime. An interesting addition to the project, to effectively testing the resources origin abstraction put in place in the early development stages, would be the implementation of a cloud computing based resources provisioner. Such a provisioner would allocate new resources on the cloud when a new virtual cluster is requested. Additionally, such a provisioners could be used as a fallback to the remotevirt provisioner. In such a scenario, resources on the cloud would be allocated only if the physical cluster provisioner has exhausted all available physical nodes. An important part actually lacking and potentially yielding interesting results is a complete testing and benchmarking analysis of the VURM utility running on larger scale systems. Empirical tests and simple benchmarks were executed only on a maximum of three machines and mainly for development purposes. A complete overview of the scalability of the system when run on a larger amount of nodes would also highlight possible optimization areas which didn’t pose any problem when running on the development environment. The different pluggable part of the VURM application still require to modify the main codebase. It is possible to implement a plugin architecture leveraging Twisted’s component framework in an easy and straightforward way. Additionally, by allowing custom command line parsers for each plugin, it is possible to add provisioners specific options to existing commands. Such a system would allow to configure all the components (provisioners, resource allocators, migration schedulers and migration managers) using the standard configuration file and removing the need to modify the VURM code. The adoption of libvirt abstracted the differences between the plethora of available hypervisors. One of the reasons to use libvirt instead of accessing KVM/QEMU functionalities directly was the possible future adoption of the Palacios hypervisor. This last improvement idea is about implementing support for the Palacios hypervisor, either by adding the needed translation layers to libvirt or by directly accessing it from the VURM tools by adding a new palacios resource provisioner. 5.4 Personal experience Working for three full months on a single project may be challenging. Doing so overseas in a completely new context and working with people which were never met before surely is. These and many other pros and cons of the experience of developing this bachelor project in a whole different environment are to be summarized in this last section of the present report. 52 CHAPTER 5. CONCLUSIONS As already explained in the Introduction, this project was carried out at the Scalable Systems Lab (SSL) of the Computer Science department of the University of New Mexico (UNM), USA during the summer 2011 and is the final diploma work which hopefully will allow me to obtain B.Sc. in Computer Science at the College for Engineering and Architecture Fribourg. One of the particularities about the initial phases of the project was that I left Switzerland without a precise definition of what the project would have been. The different goals and tasks were defined only once settled in New Mexico, in collaboration with the local advisors and had to be communicated back and explained to the different people supervising the project remotely from Switzerland. Although being a challenge, the need to being able to explain what the project is to people completely external to its definition and, additionally, since the first iterations constituted a big advantage in that it was necessary to clearly formulate all the different aspects of the statement. The same argument can be made for the different meetings incurred during the entire project duration: I was always forced to explain the progresses, issues and results in a sub-optimal environment; that forced me, however, to adopt a clearer and more understandable formulation. Another interesting aspect is the complete new environment in which the project was executed: new people were met, a new working location had to be setup, different working hours, different relationships with professors and colleagues had to be taken care of, etc. All these different aspects greatly contributed to make my experience varied, interesting and enriching. Unfortunately this also brings negative aspects to the table and keeping the motivation high for the whole duration of my stay was not always possible. To summarize this whole experience, I would say that, beside the highs and lows, each aspect of my stay in New Mexico somewhat contributed to greatly enrich an already interesting and challenging adventure which, in any case, I would recommend to anyone. Acronyms AMP Asynchronous Messaging Protocol. 23 API Application Programming Interface. 5, 25, 27, 71 ARP Address Resolution Protocol. 28 B.Sc. Bachelor of Science. 5, 52 CPU Central Processing Unit. 1, 26, 43, 45, 47 CU Computing Unit. 37, 40, 46, 47 DAG Direct Acyclic Graph. 44 DBMS Database Management System. 8 DHCP Dynamic Host Configuration Protocol. 28, 63 HPC High Performance Computing. 1–3, 49 IP Internet Protocol. vii, 19, 23, 27–31, 50, 63, 65, 66 IT Information Technology. 2 KVM Kernel-based Virtual Machine. 4, 46, 47, 50 MAC Media Access Control. 28 NFS Network File System. 31, 32 OS Operating System. 2, 22, 26, 27, 29, 66 POP-C++ Parallel Object Programming C++. 65 QCOW2 Qemu Copy On Write, version 2. 29 SASL Simple Authentication and Security Layer. 25 SLURM Simple Linux Utility for Resource Management. 1, 4, 6–16, 19, 20, 22, 23, 27, 32, 49, 50, 57, 59, 60, 66 SSH Secure Shell. 20, 25, 27, 30, 65 SSL Scalable Systems Lab. 5, 52 TCP Transmission Control Protocol. 25, 28, 29, 50, 65 TLS Transport Layer Security. 25 UML Unified Modeling Language. 13 UNM University of New Mexico. 5, 52 ViSaG Virtual Safe Grid. 65, 66 VLAN Virtual LAN. 32, 51 VM Virtual Machine. 3, 4, 11, 12, 14, 19, 20, 22, 23, 25–32, 35, 36, 38, 42, 43, 45–47, 49, 50, 60, 61, 65, 66 53 54 Acronyms VMM Virtual Machine Monitor. 3–5, 65 VURM Virtual Utility for Resource Management. 1, 5–8, 10–14, 16, 19, 20, 22, 23, 25, 27, 29–31, 36–38, 45, 49, 51, 57–59, 61, 62, 65, 66, 71 XML eXtensible Markup Language. 19, 23, 25, 28 References [1] Amotz Bar-Noy, Rajeev Motwani, and Joseph Naor. “The greedy algorithm is optimal for on-line edge coloring”. In: Information Processing Letters 44.5 (Dec. 1992), pp. 251–253. [2] Fabrice Bellard. QEMU. IBM. Aug. 2011 (accessed August 9, 2011). url: http://qem u.org/. [3] Christopher Clark et al. “Live Migration of Virtual Machines”. In: (2005 (accessed August 15, 2011). url: http://www.usenix.org/event/nsdi05/tech/full_papers/ clark/clark.pdf. [4] Valentin Clément. POP-C++ Virtual-Secure (VS) – Road to ViSaG. University of Applied Sciences of Western Switzerland, Fribourg. Apr. 2011. [5] Bram Cohen. The BitTorrent Protocol Specification. Jan. 2008 (accessed August 14, 2011). url: http://www.bittorrent.org/beps/bep_0003.html. [6] Richard Cole, Kirstin Ost, and Stefan Schirra. “Edge-Coloring Bipartite Multigraphs in O ( E log D ) Time”. In: COMBINATORICA 21.1 (Sept. 1999), pp. 5–12. [7] Johnson D et al. Eucalyptus Beginner’s Guide - UEC Edition. Dec. 2010. url: http: //cssoss.files.wordpress.com/2010/12/eucabookv2-0.pdf. [8] Domain XML format. url: http://libvirt.org/formatdomain.html. [9] Eucalyptus Network Configuration (2.0). url: http://open.eucalyptus.com/wik i/EucalyptusNetworkConfiguration_v2.0. [10] Michael Jang. Ubuntu Server Administration. McGraw-Hill, Aug. 2008. [11] M. Tim Jones. Virtio: An I/O virtualization framework for Linux. IBM. Jan. 2010 (accessed August 8, 2011). url: http://www.ibm.com/developerworks/linux/li brary/l-virtio/. [12] Kernel Based Virtual Machine. Feb. 2009 (accessed August 9, 2011). url: http://ww w.linux-kvm.org/. [13] Jack Lange and Peter Dinda. An Introduction to the Palacios Virtual Machine Monitor. Tech. rep. Northwestern University, Electrical Engineering and Computer Science Department, Nov. 2008. url: http://v3vee.org/papers/NWU-EECS-08-11.pdf. [14] Nathan Linial. “Locality in Distributed Graph Algorithms”. In: SIAM Journal on Computing 21.1 (Dec. 1990), pp. 193–201. [15] Eric P. Mangold. AMP - Asynchronous Messaging Protocol. 2010. url: http://amp-p rotocol.net/. [16] J. Misra and D. Gries. “A Constructive Proof of Vizing’s Theorem”. In: Information Processing Letters 41 (1992), pp. 131–133. 55 56 [17] REFERENCES Tuan Anh Nguyen et al. Parallel Object Programming C++ – User and Installation Manual. University of Applied Sciences of Western Switzerland, Fribourg. 2005. url: http: //gridgroup.hefr.ch/popc/lib/exe/fetch.php/popc-doc-1.3.pdf. [18] Palacios: An OS independent embeddable VMM. Feb. 2008 (accessed August 9, 2011). url: http://v3vee.org/palacios/. [19] SLURM: A Highly Scalable Resource Manager. Lawrence Livermore National Laboratory. July 2008 (accessed August 2, 2011). url: https://computing.llnl.gov/linux/ slurm/. Appendix A User manual A.1 Installation The VURM installation involves different components but does not present an high complexity degree. The following instructions are basing on an Ubuntu distribution but can be generalized to any Linux based operating system. 1. Install SLURM. Packages are provided for Debian, Ubuntu and other distributions such as Gentoo. For complete installation instructions refer to https://computing.llnl. gov/linux/slurm/quickstart_admin.html, but basically this comes down to execute the following command: $ sudo apt-get install slurm-llnl 2. Install libvirt. Make sure to include the Python bindings too: $ sudo apt-get install libvirt-bin python-libvirt 3. Install the Python dependencies on which VURM relies on: $ sudo apt-get install python-twisted \ python-twisted-conch \ python-lxml \ python-setuptools 4. Get the last development snapshot from the GitHub repository: $ wget -O vurm.tar.gz https://github.com/VURM/vurm/tarball/develop 5. Extract the sources and change directory: $ tar xzf vurm.tar.gz && cd VURM-vurm-* 6. Install the package using setuptools: $ sudo python setup.py install 57 58 APPENDIX A. USER MANUAL A.2 Configuration reference The VURM utilities all read configuration files from a set of predefined locations. It is also possible to specify a file by using the --config option. The default locations are /etc/vurm/vurm.conf and ∼/.vurm.conf. The configuration file syntax is a variant of INI with interpolation features. For more information about the syntax refer to the Python ConfigParser documentation at http: //docs.python.org/library/configparser.html. All the different utilities and components can be configured in the same file by using the appropriate section names. The rest of this section, after the example configuration file reported in the Listing A.5, lists the different sections and their configuration directives. Listing A.1: Example VURM configuration file 1 2 3 # General configuration [vurm] debug=yes 4 5 6 7 # Client configuration [vurm-client] endpoint=tcp:host=10.0.6.20:port=9000 8 9 10 11 12 13 # Controller node configuration [vurmctld] endpoint=tcp:9000 slurmconfig=/etc/slurm/slurm.conf reconfigure=/usr/bin/scontrol reconfigure 14 15 16 17 18 19 20 # Remotevirt provisioner configuration [libvirt] migrationInterval=30 migrationStabilizationTimer=20 domainXML=/root/sources/tests/configuration/domain.xml nodes=node-0,node-1 21 22 23 24 [node-0] endpoint=tcp:host=10.0.6.20:port=9010 cu=20 25 26 27 28 [node-1] endpoint=tcp:host=10.0.6.10:port=9010 cu=1 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 # Computing nodes configuration [vurmd-libvirt] basedir=/root/sources/tests sharedir=/nfs/student/j/jj username=root key=%(basedir)s/configuration/vurm.key sshport=22 slurmconfig=/usr/local/etc/slurm.conf slurmd=/usr/local/sbin/slurmd -N {nodeName} endpoint=tcp:port=9010 clonebin=/usr/bin/qemu-img create -f qcow2 -b {source} {destination} hypervisor=qemu:///system statedir=%(sharedir)s/states clonedir=%(sharedir)s/clones imagedir=%(sharedir)s/images A.2. CONFIGURATION REFERENCE A.2.1 59 The vurm section This section serves for general purpose configuration directives common to all components. Currently only one configuration directive is available: debug Set this to yes to enable debugging mode (mainly more verbose logging) or to no to disable it. A.2.2 The vurm-client section This section contains configuration directives for the different commands interacting with a remote daemon. Currently only one configuration directive is available: endpoint Set this to the endpoint on which the server is listening on. More information about the endpoint syntax can be found online at: http://twistedmatrix.com/documents/ 11.0.0/api/twisted.internet.endpoints.html#clientFromString. To connect to a TCP host listening on port 9000 at the host example.com, use the following string: tcp:host=example.com:port=9000 A.2.3 The vurmctld section This section contains configuration directives for the VURM controller daemon. The available options are: endpoint The endpoint on which the controller has to listen for incoming client connections. More information about the endpoint syntax can be found online at: http://twistedmatrix. com/documents/11.0.0/api/twisted.internet.endpoints.html#serverFromString. To listen on all interfaces on the TCP port 9000, use the following string: tcp:9000 slurmconfig The path to the SLURM configuration file used by the currently running SLURM controller daemon. The VURM controller daemon needs read and write access to this file (a possible location is: /etc/slurm/slurm.conf). reconfigure The complete shell command to use to reconfigure the running SLURM controller daemon once the configuration file was modified. The suggested value is /usr/bin/scontrol reconfigure. A.2.4 The libvirt section This section contains the configuration directives for the remotevirt provisioner. The available options are: domainXML The location of the libvirt domain XML description file to use to create new virtual machines. migrationInterval The time (in seconds) between resource reallocation and migration triggering if no other event (virtual cluster creation or release) occur in the meantime. migrationStabilizationTimer The time (in seconds) to wait for the system to stabilize after a virtual cluster creation or release event before the resource reallocation and migration is triggered. nodes A comma-separated list of section names contained in the same configuration file. Each section defines a single node or a node set on which a remotevirt daemon is running. The format of these sections is further described in the next subsection. 60 A.2.5 APPENDIX A. USER MANUAL The node section This section contains the configuration directives to manage a physical node belonging the the remotevirt provisioner. The section name can be arbitrarily chosen (as long as it does not conflict with other already defined sections). The available options are: endpoint The endpoint on which the remotevirt is listening on. More information about the endpoint syntax can be found online at: http://twistedmatrix.com/documents/ 11.0.0/api/twisted.internet.endpoints.html#clientFromString. This endpoint allows to group similar nodes together by specifying an integer range in the last part of the hostname, similarly as possible in the SLURM configuration. It is thus possible to define a node set containing 10 similar nodes using the following value: tcp:hostname[0-9]:port=9010 cu The amount of computing units associated to this node (or each node in the set). A.2.6 The vurmd-libvirt section This section contains the configuration directives for the single remotevirt daemons running on the physical nodes. The available options are: username The username to use to remotely connect to the spawned virtual machines via SSH and execute the slurm daemon spawning command. sshport The TCP port to use to establish the SSH connection to the virtual machine. slurmconfig The location, on the virtual machine, where the SLURM configuration file has to be saved. slurmd The complete shell command to use to spawn the SLURM daemon on the virtual machine. This value will be interpolated for each virtual machine using Python’s string.format method. Currently the only available interpolated value is the nodeName. The suggested value is /usr/local/sbin/slurmd -N nodeName. key The path to the private key to use to login to the virtual machine via SSH. endpoint The endpoint on which the remotevirt daemon has to listen for incoming connections from the remotevirt controller. More information about the endpoint syntax can be found online at: http://twistedmatrix.com/documents/11.0.0/api/twisted. internet.endpoints.html#serverFromString. To listen on all interfaces on the TCP port 9010, use the following string: tcp:9010 hypervisor The connection URI to use to connect to the hypervisor through libvirt. The possible available values are described online at: http://libvirt.org/uri.html statedir The path to the directory where the VM state file is saved to perform an offline migration. Has to reside on a shared location and be the same for all remotevirt daemon instances. clonedir The path to the directory where the cloned disk images to run the different VMs are saved. Has to reside on a shared location and be the same for all remotevirt daemon instances. A.3. USAGE 61 imagedir The path to the directory where the base disk images to clone to start a VM are stored. Has to reside on a shared location and be the same for all remotevirt daemon instances. clonebin The complete shell command to use to clone a disk image. This value will be interpolated for each cloning operation using Python’s string.format method. The available interpolated values are the source and the destionation of the disk image. The suggested value is /usr/bin/qemu-img create -f qcow2 -b source destination. A.3 Usage The VURM project provides a collection of command line utilities to both run and interact with the system. This section describes the provided utilities and their command line usage. A.3.1 Controller daemon Starts a new VURM controller daemon on the local machine by loading configuration from the default locations or from the specified path. Listing A.2: Synopsis of the vurmctld command usage: vurmctld [-h] [-c CONFIG] VURM controller daemon. optional arguments: -h, --help show this help message and exit -c CONFIG, --config CONFIG Configuration file A.3.2 Remotevirt daemon Starts a new remotevirt daemon on the local machine by loading configuration from the default locations or from the specified path. Listing A.3: Synopsis of the vurmd-libvirt command usage: vurmctld [-h] [-c CONFIG] VURM libvirt helper daemon. optional arguments: -h, --help show this help message and exit -c CONFIG, --config CONFIG Configuration file A.3.3 Virtual cluster creation Requests a new virtual cluster creation of a given size to the VURM controller daemon. An optional minimal size and priority can also be defined. The configuration is loaded from the default locations or from the specified path. Listing A.4: Synopsis of the valloc command usage: valloc [-h] [-c CONFIG] [-p PRIORITY] [minsize] size VURM virtual cluster allocation command 62 APPENDIX A. USER MANUAL positional arguments: minsize size Minimum acceptable virtual cluster size Desired virtual cluster size optional arguments: -h, --help show this help message and exit -c CONFIG, --config CONFIG Configuration file -p PRIORITY, --priority PRIORITY Virtual cluster priority A.3.4 Virtual cluster release Releases a specific or all virtual clusters currently defined on the system. The configuration is loaded from the default locations or from the specified path. Listing A.5: Synopsis of the vrelease command usage: vrelease [-h] [-c CONFIG] (--all | cluster-name) VURM virtual cluster release command. positional arguments: cluster-name Name of the virtual cluster to release optional arguments: -h, --help show this help message and exit -c CONFIG, --config CONFIG Configuration file --all Release all virtual clusters A.4 VM image creation The operating system installed on the disk image used to spawn new virtual machines has to respect a given set of constraints. This section aims to expose the different prerequisites for such an image to work seamlessly in a VURM setup. A.4.1 Remote login setup Once started and online, the operating system has to allow remote login via SSH. The username, port and public key used to connect to the virtual machine have to be provided by the VURM system administrator, but it is the users responsibility to make sure that its virtual machine will allow the user to login remotely. All most common distributions already come with an SSH client installed and correctly setup. The only remaining task are to create the correct user and copy the provided public key to the correct location. Refer to the distribution documentation to learn how users are created and how to allow a client to authenticate against a given public key (normally it is sufficient to copy the key in the ∼/.ssh/authorized_keys file). A.4.2 IP address communication It is the responsibility of the virtual machine operating system to communicate its IP address back to the remotevirt daemon. A script to retrieve and write the IP address to the serial port is provided in the Listing A.6. A.4. VM IMAGE CREATION 63 Listing A.6: Shell script to write the IP address to the serial port #!/bin/sh ifconfig eth0 | grep -oE ’([0-9]{1,3}\.){3}[0-9]{1,3}’ | head -1 >/dev/ttyS0 This script has to be triggered once the IP address is received by the guest operating system. A possible approach is to use the triggering capabilities offered by the default dhclient by placing the IP sending script inside the /etc/dhcp3/dhclient-exit-hooks.d/ directory. Each script contained in this directory will be executed each time the DHCP client receives a new IP address lease. Appendix B ViSaG comparison Parallel Object Programming C++ (POP-C++) [17] is a framework that provides C++ language extension, compiler and runtime to easily deploy applications to a computing grid. One of the core aspects of the framework is that existing code can easily be converted to a parallel executing version with the minimum amount of changes. Virtual Safe Grid (ViSaG) [4] is a project that aims to add security to the execution of POPC++ applications by adding secure communication channels and virtualization to the runtime. Virtualization was used for different reasons in ViSaG and VURM: mainly security and sandboxing in the first and customization and dynamic resource reallocation in the second. Nonetheless, both project encountered some of the same problematics and solved them differently. This appendix aims to provide a short overview of the differently implemented solutions and how they relate one to the other. Both ViSaG and VURM use libvirt to access the main hypervisor functions. In the case of VURM, this allows the integration with different hypervisors; the main reason to use libvirt was the provided abstraction layer, and thus the ability to swap hypervisors at will while guaranteeing an easy path to support the Palacios VMM in the future. In the case of ViSaG, the main reason for the adoption of libvirt was the relatively simple interface it offered to access the VMWare ESX hypervisor, while the abstraction of different hypervisor was overshadowed by the implementation of some direct interactions with the ESX hypervisor as seen later in this appendix. One of the first similar encountered problems is the retrieval of the IP address from a newly spawned virtual machine. The subsection 3.4.1 of the Physical cluster provisioner chapter presents the different analyzed solutions to this problem. Both the solutions adopted by ViSaG and VURM were presented in the above cited section: VURM went down the serial to TCP data exchange way while ViSaG chose to take advantage of the proprietary virtual machine tools provided by ESX and installed on the running domain. Each solution has its own advantages and drawbacks: the serial to TCP data exchange adds more complexity to the application as an external TCP server has to be put in place, while the proprietary VM tools solution tightly couples the ViSaG implementation to a specific hypervisor. Another key problem solved in different ways is the setup of the authentication credentials for the remote login through an SSH connection. As done for the IP address retrieval problem, different 65 66 APPENDIX B. VISAG COMPARISON possible solutions were discussed in the subsection 3.4.3 of the Physical cluster provisioner chapter. In this case, the authentication model specific to each project clearly identified the correct solution for it. When using the VURM utility, the end users specifies a disk image to use to execute on its virtual cluster. This allows complete customization over the environment in which the different SLURM jobs will be executed. Differently, in the ViSaG case, the VM disk image is provided during the setup phase and used for security and sandboxing purposes. This disk image is shared among all different users and each user wants to grant access to a running domain only to its own application. The key difference here is that all virtual machines on a VURM system are accessed by the same entity (the remotevirt daemon) using a single key pair while on a ViSaG system each virtual machine is potentially accessed by different entities (the application which requested the creation of the VM) using different key pairs. In the case of VURM, the installation of the public key into the VM disk image is a requirement which has to be fulfilled by the user. In the case of ViSaG, the public key is copied to the virtual machine at runtime using an hypervisor specific feature. The last difference between the two projects is the method used to spawn new virtual machines. In the case of VURM, new virtual machines are spawned simultaneously as part of a new virtual cluster creation request; loosing some time to completely boot a new domain was considered an acceptable tradeoff given the relative infrequency of the operation. In the context of a ViSaG execution, VMs have to be spawned more frequently and with lower latency. This requirement led to adopt a spawning technique based on a VM snapshot, in which an already booted and suspended domain is resumed. The resuming operation is faster compared to a full OS boot, but presents other disadvantages which had to be overcome, as, for example, the triggering of the negotiation of a new IP address. Appendix C Project statement 67 VURM – Project statement Virtual resources management on HPC clusters Jonathan Stoppani College of Engineering and Architecture Fribourg jonathan.stoppani@edu.hefr.ch Abstract Software deployment on HPC clusters is often subject to strict limitations with regard to software and hardware customization of the underlying platform. One possible approach to circumvent these restrictions is to virtualize the different hardware and software resources by interposing a dedicated layer between the running application and the host operating system. The goal of this project is to enhance existing HPC resource management tools with special virtualization-oriented capabilities such as job submission to specially created virtual machines or runtime migration of virtual nodes to account for updated job priorities or for better performance exploitation. Virtual machines can be started inside a virtual cluster and run on a physical node. In low system load situations or for high priority jobs, each physical node hosts exactly one VM (Subfigure 1a). When additional or higher prioritized jobs are submitted to the system, these VMs can be migrated to already busy nodes to free up resources for the incoming jobs (Subfigure 1b and 1c). The same procedure can be applied if some resources are released in order to increase the performances (Subfigure 1d). Keywords resource management, virtualization, HPC, slurm 1. Introduction SLURM (Simple Linux Utility for Resource Management) is, as the name says, a resource manager for Linux based clusters of all sizes. Its job is to allocate resources to users requesting them by arbitrating possible contentions using a queue of pending work and to offer tools to start, execute and monitor work started on previously allocated resources. To be able to support more esoteric system configurations than what allowed on common HPC clusters and to provide advanced job controlling capabilities (mainly pausing/resuming and migration), SLURM can be extended to support virtual clusters and virtual machines management. A virtual cluster groups physical nodes togheter and can be thought of as a logical container for virtual machines. Such a cluster can be resized to account for workload changes and runtime job prioritization updates. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. c Copyright 2011, Jonathan Stoppani (a) Low load operation (b) Migration (c) High load operation (d) Restoration Figure 1. VMs migration process to allocate resources for an higher prioritized job in its different states: before, during allocation, while running both jobs and after. (Squares represent physical nodes, circles represent VMs and darker rectangles are virtual clusters.) 2. Goals The ultimate goal of this project is to add support for the virtual resources management capabilities described above to SLURM. These capabilities can either be provided as plugins or by directly modifying the source tree. To reach this objective, different partial goals have to be attained. The following list summarizes them: 1. Adding support for starting and stopping virtual clusters in SLURM with the newly-created virtual clusters attaching to the existing SLURM instance and on which regular jobs can be launched. This will require adding support for the libvirt or related virtualization management libraries to SLURM. 2. Adding support for controlling the pausing and migration of virtual machines to SLURM so that more sophisticated resource allocation decisions can be made (e.g. migrating multiple virtual machines of a virtual cluster onto a single physical node) as new information (e.g. new jobs) become available. 3. Implementing simple resource allocation strategies based on the existing job scheduling techniques to demonstrate the capabilities added to SLURM, KVM, and Palacios in (1) and (2) above. 3. Deadlines The deadlines for the project are resumed in the Table 1. For more detailed information about the content of each deliverable or milestone, refer to the planning document. A. Context This work is carried out at Scalable Systems Lab (SSL) of the Computer Science department of the University of New Mexico, USA (UNM) during Summer 2011. The project will be based on SLURM (Simple Linux Utility for Resource Management) as underlying layer for resources management and two different virtual machine monitors: KVM and Palacios. KVM (Kernel-based Virtual Machine) is a general purpose virtual machine monitor which integrates directly into the Linux kernel, while Palacios is an HPC-oriented hypervisor designed to be embedded into a range of different host operating systems, including lightweight Linux kernel variants and thus potentially including the Cray Linux Environment. B. Experts and Supervisors Prof. Peter Kropf, Head of the Distributed Computing Group and Dean of the Faculty of Science of the University of Neuchatel, Switzerland covers the role of expert. Prof. Patrick G. Bridges, associate professor at the University of New Mexico is supervising the project locally. 1 All dates refer to 2011 D=Deliverable 2 M=Milestone, Date1 Type2 Asset Jun. 6 M Project start Jun. 20 D Project statement Jun. 20 D Planning Jul. 1 Jul. 8 M M Dynamic partitions Virtual clusters Jul. 15 M Virtual cluster pausing/resuming Jul. 15 D Project summary (EN/DE) Jul. 22 M Virtual cluster resizing Jul. 29 M Migration strategy Aug. 19 D Final presentation Aug. 19 D Final report Aug. 19 D Project sources and documentation Aug. 19 M Project end Sep. 7 M Oral defense Sep. 7 D Project poster Table 1. Deadlines for the project. Prof. Pierre Kuonen, Head of the GRID and Cloud Computing Group, and Prof. François Kilchoer, Dean of the Computer Science Department, both of the College of Engineering and Architecture Fribourg, are supervising the project from Switzerland. C. Useful resources • https://computing.llnl.gov/linux/slurm/ Official web site of the SLURM project website. Sources, admin/user documentation, papers as well as a configuration tool can be found on there. • http://www.linux-kvm.org/ Web site of the KVM (Kernel-based Virtual Machine) project. A special page describing VM migration using KVM is available. • http://www.v3vee.org/palacios/ Official web site of the Palacios VMM project, developed as part of the V3VEE project. Access to news, documentation and source code is available. Appendix D CD-ROM contents The following assets can be found on the attached CD-ROM: • api-reference: The interactive (HTML) version of the VURM API reference; • documents: A collection of all documents related to the project management (meeting minutes, planning versions, compiled report, summaries in different languages...); • report-sources: Complete checkout of the vurm-report LATEXsources git repository; • vurm-sources: Complete checkout of the vurm source git repository; • website.tar.gz: Tarball of the project wiki used throughout the project (needs a PHP runtime to execute). 71