Virtual resources management on HPC clusters Jonathan Stoppani

Transcription

Virtual resources management on HPC clusters Jonathan Stoppani
VURM
Virtual resources management on HPC clusters
Jonathan Stoppani
Author
Prof. Dorian Arnold
Prof. Patrick G. Bridges
Prof. François Kilchoer
Prof. Pierre Kuonen
Advisors
Prof. Peter Kropf
Expert
Summer 2011
Bachelor project
College of Engineering and Architecture Fribourg
member of the University of Applied Sciences of Western Switzerland
Abstract
Software deployment on HPC clusters is often subject to strict limitations with regard to software and hardware customization of the underlying platform. One possible approach to circumvent these restrictions is to virtualize the different hardware and software resources by
interposing a dedicated layer between the running application and the host operating system.
The goal of the VURM project is to enhance existing HPC resource management tools with
special virtualization-oriented capabilities such as job submission to specially created virtual
machines or runtime migration of virtual nodes to account for updated job priorities or for
better performance exploitation.
The two main enhancements this project aims to provide to the already existing tools are firstly,
full customization of the platform running the client software, and secondly, improvement of
dynamic resource allocation strategies by exploiting virtual machines migration techniques.
The final work is based upon SLURM (Simple Linux Utility for Resource Management) as job
scheduler and uses libvirt (an abstraction layer able to communicate with different hypervisors)
to manage the KVM/QEMU hypervisor.
Keywords: resource management, virtualization, HPC, SLURM, migration, libvirt, KVM
i
Contents
Abstract
i
1 Introduction
1.1 High Performance Computing
1.2 Virtualization . . . . . . . . .
1.3 Project goals . . . . . . . . .
1.4 Technologies . . . . . . . . . .
1.5 Context . . . . . . . . . . . .
1.6 Structure of this report . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
3
5
6
2 Architecture
2.1 SLURM architecture . . . . . .
2.2 VURM architecture . . . . . . .
2.3 Provisioning workflow . . . . .
2.4 Implementing new provisioners
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
10
13
16
.
.
.
.
.
19
20
22
25
27
31
3 Physical cluster provisioner
3.1 Architecture . . . . . . . .
3.2 Deployment . . . . . . . .
3.3 Libvirt integration . . . .
3.4 VM Lifecycle aspects . . .
3.5 Possible improvements . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Migration
4.1 Migration framework .
4.2 Allocation strategy . .
4.3 Scheduling strategy . .
4.4 Migration techniques .
4.5 Possible improvements
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
40
43
45
46
5 Conclusions
5.1 Achieved results . . . .
5.2 Encountered problems
5.3 Future work . . . . . .
5.4 Personal experience . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
50
51
51
Acronyms
53
References
55
A User manual
A.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Configuration reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
57
57
58
iv
CONTENTS
A.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4 VM image creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
62
B ViSaG comparison
65
C Project statement
67
D CD-ROM contents
71
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Basic overview of the SLURM architecture . . . . . . . . . . .
Initial proposal for the VURM architecture . . . . . . . . . .
The VURM system from the SLURM controller point of view
The VURM system updated with the VURM controller . . .
Adopted VURM architecture . . . . . . . . . . . . . . . . . .
VURM architecture class diagram . . . . . . . . . . . . . . . .
Resource provisioning workflow . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
11
12
12
13
14
17
3.1
3.2
3.3
3.4
Overall architecture of the remotevirt provisioner
Complete VURM + remotevirt deployment diagram
Copy-On-Write image R/W streams . . . . . . . . .
Complete domain lifecycle . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
24
30
33
4.1
4.2
4.3
4.4
4.5
Example of virtual cluster resizing through VMs migration. . . . . . . . . .
Migration framework components . . . . . . . . . . . . . . . . . . . . . . . .
Collaboration between the different components of the migration framework
Graphical representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Edge coloring and the respective migration scheduling. . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
35
37
39
44
44
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Code Listings
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Configuration excerpt . . . . . . . . . . .
Reconfiguration command . . . . . . . . .
Node naming without grouping . . . . . .
Node naming with grouping . . . . . . . .
SLURM configuration for a virtual cluster
Batch job execution on a virtual cluster .
vurm/bin/vurmctld.py . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
10
10
10
14
15
16
3.1
3.2
3.3
3.4
3.5
Basic libvirt XML description file . . . . . . . . . . . . . . . . . . . . . . . .
Example arp -an output . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Libvirt TCP to serial port device description . . . . . . . . . . . . . . . . .
Shell script to write the Internet Protocol (IP) address to the serial port . .
Shell script to exchange the IP address and a public key over the serial port
.
.
.
.
.
.
.
.
.
.
26
28
29
29
31
4.1
4.2
4.3
4.4
CU rounding algorithm . . . . . . . . . . . .
Nodes to virtual cluster allocation . . . . .
VMs migration from external nodes . . . . .
Leveling node usage inside a virtual cluster
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
41
42
42
A.1
A.2
A.3
A.4
A.5
A.6
Example VURM configuration file . . . . . . . . . .
Synopsis of the vurmctld command . . . . . . . . .
Synopsis of the vurmd-libvirt command . . . . .
Synopsis of the valloc command . . . . . . . . . .
Synopsis of the vrelease command . . . . . . . . .
Shell script to write the IP address to the serial port
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
61
61
61
62
63
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
High Performance Computing (HPC) is often seen only as a niche area of the broader computer
science domain and frequently associated with highly specialized research contexts only. On
the other side, virtualization, once considered part of a same or similar limited context, is
nowadays gaining acceptance and popularity thanks to its leveraging in cloud-based computing
approaches. The goal of this project is to combine HPC tools – more precisely a batch scheduling
system – with virtualization techniques and evaluate the advantages that such a solution brings
over more traditional tools and techniques.
The outcome of this project is a tool called Virtual Utility for Resource Management (VURM)
and consists in a loosely coupled extension to Simple Linux Utility for Resource Management
(SLURM) to provide virtual resource management capabilities to run jobs submitted to SLURM
on dynamically spawned virtual machines.
The rest of this chapter is dedicated to present the concepts of HPC and virtualization, illustrate
the project goal in deeper detail and introduce the tools and technologies on which the project
is based. The last section at the end of the present chapter provides an overview of the structure
of this report.
1.1
High Performance Computing
Once limited to specialized research or academic environments, High Performance Computing
(HPC) continues to gain popularity as the limits of more traditional computing approaches are
reached and the need for more powerful paradigms are needed.
HPC brings a whole new set of problems to the already complex world of software development.
One possible example are problems bound to CPU processing errors: completely negligible on
more traditional computing platforms, they instantly assume primordial importance, introducing the need for specialized error recovery strategies. Another possible problem bound to the
high-efficiency characteristics such systems have to expose, are the strict hardware and software
limitations imposed by the underlying computing platforms.1
In order to take full advantage of the resources offered by a computing platform, a single
task is often split in different jobs which can then in turn be run concurrently across different
1
Such limitations can be categorized in hard limitations, imposed by the particular hardware or software
architecture, and soft limitations, normally imposed by the entity administering the systems and/or by the usage
policies of the owning organization.
1
2
CHAPTER 1. INTRODUCTION
nodes. A multitude of strategies are available to schedule these jobs in the most efficient
manner possible. The application of a given strategy is taken in charge by a job scheduler 2 ;
a specialized software entity which runs across a given computing cluster and which often
provides additional functionality, as for example per-user accounting, system monitoring or
other pluggable functionalities.
1.2
Virtualization
Virtualization has been an actual topic in the scientific and academic communities for different
years before making its debut in the widespread industrial and consumer market. Now adopted
and exploited as a powerful tool by more and more people, it rapidly became one of the current
trends all over the Information Technology (IT) market mainly thanks to its adoption to leverage
the more recent cloud computing paradigm.
The broader virtualization term refers to the creation and exploitation of a virtual (rather than
actual) version of a given resource. Examples of commonly virtualized resources are complete
hardware platforms, operating systems, network resources or storage devices. Additionally,
different virtualization types are possible: in the context of this project, the term virtualization, always refers to the concept of hardware virtualization. Other types of virtualization are
operating-system level virtualization, memory virtualization, storage virtualization, etc.
The term hardware virtualization, also called platform virtualization, is tightly bound to the
creation of virtual machines; a software environment which presents itself to its guest (that
is, the software running on it) as a complete hardware platform and which isolates the guest
environment from the environment of the process running the virtual machine.
Although different additional hardware virtualization types exist, for the scope of this project
we differentiate only between full virtualization and paravirtualization. When using full virtualization, the underlying hardware platform is almost fully virtualized and the guest software,
usually an operating system, can run unmodified on it. In the normal case the guest software
doesn’t even know that it runs on a virtual platform. Paravirtualization, instead, is a technique
allowing the guest software to directly access the hardware in its own isolated environment.
This requires modifications to the guest software but enables much better performances.3
The main advantages of virtualization techniques are the possibility to run the guest software
completely sandboxed (which greatly increases security), the increase of the exploitation ratio
of a single machine4 and the possibility to offer full software customization through the use of
custom built Operating System (OS) disk images. This last advantage can be extended in some
cases to hardware resources too, allowing to attach emulated hardware resources to a running
guest.
Virtualization does not come with advantages only; the obvious disadvantage of virtualized
systems is the increased resource usage overhead caused by the interposition of an additional
abstraction layer. Recent software and hardware improvements (mainly paravirtualization and
hardware-assisted virtualization respectively) have contributed to greatly optimize the performance of hardware virtualization.
2
Also known for historical reasons as batch scheduler.
Paravirtualization is fully supported in the linux kernel starting at version 2.6.25 through the use of the
virtio drivers [11].
4
Although this last point does not apply to HPC systems, were resources are fully exploited in the majority
of the cases, virtualization can greatly improve average exploitation ratio of a common server.
3
1.3. PROJECT GOALS
1.3
3
Project goals
Different problematics and limitations arise when developing software to be run in HPC contexts. As seen in section 1.1, these limitations are either inherited by the hardware and software
architecture or imposed by its usage policies. Afterwards, the section 1.2 introduces a virtualization based approach to circumvent these limitations. It is thus deemed possible to overcome
the limitations imposed by a particular HPC execution environment by trading off a certain
amount of performance for a much more flexible platform by using virtualization techniques.
The goal of this project is to add virtual resource management capabilities to an existing
job scheduling utility. These enhancements would allow platform users to run much more
customized software instances without the administrators having to be concerned about security
or additional management issues. In such a system, user jobs would be run inside a virtual
machine using a user provided base disk image, effectively leaving the hosting environment
untouched.
The adoption of a virtual machine based solution enables advanced scheduling capabilities to be
put in place: the exploitation of virtual machine migration between physical nodes allows the
job scheduler to adapt resource usage to the current system load and thus dynamically optimize
resource allocation at execution time and not only while scheduling jobs. This optimization can
be carried out by migrating Virtual Machines (VMs) from one physical node in the system to
another and executing them on the best available resource at every point in time.
To reach the prefixed objective, different partial goals have to be attained. The following
breakdown illustrates in deeper details each of the partial tasks which have to be accomplished:
1. Adding support to the job scheduler of choice for running regular jobs inside dynamically
created and user customizable virtual machines;
2. Adding support for controlling the state of each virtual machine (pausing/resuming, migration) to the job scheduler so that more sophisticated resource allocation decisions can
be made (e.g. migrating multiple virtual machines of a virtual cluster onto a single physical node) as new information (e.g. new jobs) become available;
3. Implementing simple resource allocation strategies based on the existing job scheduling
techniques to demonstrate the capabilities added to the job scheduler and the Virtual
Machine Monitor (VMM) in (1) and (2) above.
A more formal project statement, containing additional information such as deadlines and
additional resources, can be found in Appendix C.
1.4
Technologies
Different, already existing libraries and utilities were used to build the final prototype. This
section aims to provide an overview of the main external components initially chosen to build
upon. The absolutely necessary components are a job scheduler to extend and an hypervisor
(or VMM) to use to manage the virtual machines; in the case of this project, SLURM and
KVM/QUEMU were chosen, respectively.
A VMM abstraction layer called libvirt was used in order to lessen the coupling between
the developed prototype and the hypervisor, allowing to easily swap out KVM with a different
4
CHAPTER 1. INTRODUCTION
hypervisor. This particular decision is mainly due to an effort to make a future integration with
Palacios – a particular VMM intended to be used for high performance computing – as easy
as possible.
The remaining part of this section aims to introduce the reader to all the utilities and libraries
cited above.
SLURM
Simple Linux Utility for Resource Management (SLURM) is a batch scheduler for Linux based
operating systems. Initially developed at the Lawrence Livermore National Laboratory, it was
subsequently open sourced and is now a well known protagonist of the job scheduling ecosystem.
The authors [19] describe it with the following words:
The Simple Linux Utility for Resource Management (SLURM) is an open source,
fault-tolerant, and highly scalable cluster management and job scheduling system
for large and small Linux clusters. SLURM requires no kernel modifications for its
operation and is relatively self-contained. As a cluster resource manager, SLURM
has three key functions. First, it allocates exclusive and/or non-exclusive access to
resources (compute nodes) to users for some duration of time so they can perform
work. Second, it provides a framework for starting, executing, and monitoring
work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates
contention for resources by managing a queue of pending work.
The section 2.1 provides additional information about the SLURM working model and architectural internals.
KVM/QEMU
QEMU [2] (which probably stands for Quick EMUlator [10]) is an open source processor emulator and virtualizer. It can be used as both an emulator or a hosted Virtual Machine Monitor
(VMM). This project only takes advantage of its virtualization capabilities, by making little or
no use of the emulation tools.
Kernel-based Virtual Machine (KVM) [12] is a full virtualization solution for Linux based
operating systems running on x86 hardware containing virtualization extensions (specifically,
Intel VT or AMD-V). To run hardware accelerated guests, the host operating system kernel
has to include support for the different KVM modules; this support is included in the mainline
Linux kernel as of version 2.6.20.
The main reason for choosing the KVM/QEMU match over other VMMs is KVM’s claimed
support for both live and offline VM migration as well as both being released as open source
software. The recent versions of QEMU distributed by the package managers of most Linux
distributions already include (either by default or optionally) the bundled KVM/QEMU build5 .
Palacios
Palacios is an open source VMM specially targeted at research and teaching in high performance computing, currently under development as part of the V3VEE project6 . The official
description, as found on the publicly-accessible website [18], is reported below:
5
On Debian and Ubuntu systems only the qemu-kvm package is provided; the qemu package is a dummy
transitional package for qemu-kvm. On Gentoo systems, KVM support can be enabled for the qemu package
by setting the kvm USE flag.
6
http://v3vee.org/
1.5. CONTEXT
5
Palacios is a type I, non-paravirtualized, OS-independent VMM that builds on the
virtualization extensions in modern x86 processors, particularly AMD SVM and
Intel VT. Palacios can be embedded into existing kernels, including very small kernels. Thus far, Palacios has been embedded into the Kitten lightweight kernel from
Sandia National Labs and the University of Maryland’s GeekOS teaching kernel.
Currently, Palacios can run on emulated PC hardware, commodity PC hardware,
and Cray XT3/4 machines such as Sandia’s Red Storm.
Palacios is also able to boot an unmodified Linux distribution [13] and can thus be used on a
wide range of both host hardware and software platforms and to host different (more or less
lightweight) guest operating systems.
Libvirt
The libvirt project aims to provide an abstraction layer to manage virtual machines over different
hypervisors, both locally and remotely. It offers an Application Programming Interface (API)
to create, start, stop, destroy and otherwise manage virtual machines and their respective
resources such as network devices, storage devices, processor pinnings, hardware interfaces, etc.
Additionally, through the libvirt daemon, it accepts incoming remote connections and allows
complete exploitation of the API from remote locations while accounting for authentication and
authorization related issues.
The libvirt codebase is written in C, but bindings for different languages are provided as part
of the official distribution or as part of external packages. The currently supported languages
are C, C#, Java, OCaml, Perl, PHP, Python and Ruby.
The decision to use libvirt as an abstraction layer instead of directly accessing the KVM/QEMU
exposed API was taken to facilitate to switch to a different VMM if the need should arise, and,
in particular, to ease an eventual Palacios integration once the necessary support is added to
the libvirt API.
The chapter 3 contains additional information about libvirt’s working model and its integration into the VURM architecture.
1.5
Context
This project is the final diploma work of Jonathan Stoppani, and will allow him to obtain the
Bachelor of Science (B.Sc.) in Computer Science at the College for Engineering and Architecture
Fribourg, part of the University of Applied Sciences of Western Switzerland.
This work is carried out at the Scalable Systems Lab (SSL) of the Computer Science department
of the University of New Mexico (UNM), USA during Summer 2011.
Prof. Peter Kropf, Head of the Distributed Computing Group and Dean of the Faculty of Science
of the University of Neuchatel, Switzerland covers the role of expert. Prof. Patrick G. Bridges
and Prof. Dorian Arnold, associate professors at the University of New Mexico, are supervising
the project locally. Prof. Pierre Kuonen, Head of the GRID and Cloud Computing Group,
and Prof. François Kilchoer, Dean of the Computer Science Department, both of the College of
Engineering and Architecture of Fribourg, are supervising the project from Switzerland.
6
CHAPTER 1. INTRODUCTION
1.6
Structure of this report
This section aims to describe the overall structure of the present report by shortly introducing
the contents of each chapter and placing it into the right context. The main content is organized
in the following chapters:
• This first chapter, chapter 1, introduces the project and the context, explains the problematics aimed to be solved and lists the main adopted technological choices;
• After a general overview, chapter 2 explains the overall architectural design of the VURM
tool and how it fits into the existing SLURM architecture by arguing the different choices;
• Chapter 3 introduces the remotevirt resource provisioner, one of the two provisioners
shipped with the default VURM implementation. This chapter, as well as the following
one, presupposes a global understanding of the global architecture presented in chapter
2;
• Virtual machine migration between physical nodes, one of the main advantages of using a
virtualization-enabled job scheduler, is described in chapter 4. The migration capabilities
are implemented as part of the remotevirt provisioner but presented in a different
chapter as a whole, self-standing, subject;
• Finally, chapter 5 concludes the report by resuming the important positive and negative
aspects, citing possible areas of improvement and providing a personal balance of the
executed work.
In addition to the main content, different appendices are available at the end of the present
report, containing information such as user manuals and additional related information not
directly relevant to the presented chapter. Refer to the table of contents for a more detailed
contents listing.
The used acronyms, and the cited references are also available at the pages 53, and 55 respectively.
Chapter 2
Architecture
The VURM virtualization capabilities have to be integrated with the existing SLURM job
scheduler in order to exploit the functionalities offered by each component. At least a basic
understanding of the SLURM internals is thus needed in order to design the best possible
architecture for the VURM utility. This chapter aims to introduce the reader to both the
SLURM and VURM architectural design and explains the reasons behind the different choices
that led to the implemented solution.
The content presented in this chapter is structured into the following sections:
• Section 2.1 introduces the SLURM architecture and explains the details needed to understand how VURM integrates into the system;
• Section 2.2 explains the chosen VURM architecture and presents the design choices that
led to the final implementation;
• Section 2.3 introduces the provisioning workflow used to allocate new resources to users
requesting them and introduces the concept of virtual clusters;
• Section 2.4 explains the details of the implementation of a new resource provisioner and
its integration in a VURM system.
2.1
SLURM architecture
The architecture of the SLURM batch scheduler was conceived to be easily expandable and
customizable, either by configuring it the right way or by adding functionalities through the
use of specific plugins.
This section does not aim to provide a complete understanding of how the SLURM internals
are implemented or which functionalities can be extended through the use of plugins; it will
be limited instead to providing a global overview of the different entities which intervene in
its very basic lifecycle and which are needed to understand the VURM architecture presented
afterwards.
7
8
CHAPTER 2. ARCHITECTURE
Basic SLURM Architecture
Partition
1..*
*
1..*
1
Daemon
1..*
1
Controller
<<use>>
<<Interface>>
Command
<<use>>
srun
squeue
scontrol
s<...>
Figure 2.1: Basic overview of the SLURM architecture
2.1.1
Components
A basic overview of the SLURM architecture is presented in Figure 2.1. Note that the actual
architecture is much more complex than what illustrated in the class diagram and involves
different additional entities (such as backup controllers, databases,. . . ) and interactions. Although, for the scope of this project, the representation covers all the important parts and can
be used to place the VURM architecture in the right context later on in this chapter.
The classes represented in the diagram are not directly mapped to actual objects in the SLURM
codebase, but rather to the different intervening entities1 ; the rest of this subsection is dedicated
to provide some more explications about them.
Controller
The Controller class represents a single process controlling the whole system; its main
responsibility is to schedule incoming jobs and assign them to nodes for execution based on
different (configurable) criteria. Additionally, it keeps track of the state of each configured
daemon and partition, accepts incoming administration requests and maintains an accounting
database.
A backup controller can also be defined; requests are sent to this controller instead of the main
one as soon as a requesting client notices that the main controller is no more able to correctly
process incoming requests.
Optionally, a Database Management System (DBMS) backend can be used to take advantage
of additional functionalities such as per-user accounting or additional logging.
1
A deployment diagram may be more suitable for this task, but a class diagram allows to better represent
the cardinality of the relationships between the different entities.
2.1. SLURM ARCHITECTURE
9
Daemons
Each Daemon class represents a process running on a distinct node; their main task is to accept
incoming job requests and run them on the local node.
The job execution commands (see below) communicate with the daemons directly after having
received an allocation token by the controller. The first message they send to the nodes to
request job allocation and/or execution is signed by the controller. This decentralized messaging
paradigm allows for better scalability on clusters with many nodes as the controller does not
have to deal with the job execution processing on each node itself.
Each daemon is periodically pinged by the controller to maintain its status updated in the
central node database.
Partitions
The Daemon instances (and thus the nodes) are organized in one or more (possibly overlapping)
logical partitions; each partition can be configured individually with regard to permissions,
priorities, maximum job duration, etc. This allows to create partitions with different settings.
A possible example of useful per-partition configuration is to allow only a certain group of
users to access a partition which contains more powerful nodes. Another possible example
is to organize the same set of nodes in two overlapping partitions with different priorities;
jobs submitted to the partition with higher priority will thus be carried out faster (i.e. their
allocation will have an higher priority) than jobs submitted to the partition with lower priority.
Commands
Different Commands are made available to administrators and end users; they communicate
with the Controller (and possibly directly with the Daemons) by using a custom binary
protocol and TCP/IP as the transport layer.
The main commands used throughout the project are the srun command, which allows to
submit a job for execution (and offers a plethora of options and arguments to fine tuning the
request) and the scontrol command, which can be used to perform administration requests,
as for example listing all nodes or reloading the configuration file.
2.1.2
Configuration
The SLURM configuration resides in one single file which exposes a simple syntax. This configuration file contains the directives for all components of a SLURM-managed system (including
thus the controller, the daemons, the databases,. . . ); an excerpt of such a file is showed in
Listing 2.1.
Listing 2.1: Configuration excerpt
1
2
3
4
5
6
7
8
9
10
11
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=controller-hostname
# ...
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
10
12
13
14
15
CHAPTER 2. ARCHITECTURE
SlurmUser=nelson
# ...
NodeName=testing-node Procs=1 State=UNKNOWN
PartitionName=debug Nodes=ubuntu Default=YES MaxTime=INFINITE State=UP
16
17
18
19
NodeName=nd-6ad4185-0 NodeHostname=10.0.0.101
NodeName=nd-6ad4185-1 NodeHostname=10.0.0.100
PartitionName=vc-6ad4185 Nodes=nd-6ad4185-[0-1] Default=NO MaxTime=INFINITE
State=UP
An important feature of the SLURM controller daemon is its ability to reload this configuration
file at runtime. This feature is exploited to dynamically add or remove nodes and partitions
from a running system. The file can be reloaded by executing a simple administration request
through the use of the scontrol command, as illustrated in the Listing 2.2.
Listing 2.2: Reconfiguration command
1
$ scontrol reconfigure
A useful feature leveraged by the configuration syntax is the ability to group node definitions if
their NodeName is terminated by numerical indexes; this may not be deemed that interesting
when working with 10 nodes, but being able to group configuration directives for clusters of
thousands of nodes may be more than welcome. The listings 2.3 and 2.4 show an example of
this feature.
Listing 2.3: Node naming without grouping
1
2
3
4
5
NodeName=linux1 Procs=1 State=UNKNOWN
NodeName=linux2 Procs=1 State=UNKNOWN
# ... 29 additional definitions ...
NodeName=linux32 Procs=1 State=UNKNOWN
PartitionName=debug Nodes=linux1,linux2,...,linux32 Default=YES MaxTime=
INFINITE State=UP
Listing 2.4: Node naming with grouping
1
2
NodeName=linux[1-32] Procs=1 State=UNKNOWN
PartitionName=debug Nodes=linux[1-32] Default=YES MaxTime=INFINITE State=UP
2.2
VURM architecture
The previous section introduced some of the relevant SLURM architectural concepts. This
section, instead, aims to provide an overview of the final implemented solution and the process
that led to the architectural design of the VURM framework.
NOTE 1
· Blue circles represent virtual machines (blue color is used for virtual resources or software
running on virtual resources).
· Green squares represent physical nodes (green color is used for physical resources or software
running on physical resources).
· Virtual machines and physical nodes can be grouped into virtual clusters and partitions
respectively.
2.2. VURM ARCHITECTURE
11
· Rounded rectangles represent services or programs.
NOTE 2
The term Virtual Clusters is used a couple of times here. For now, consider a virtual cluster
only as a SLURM partition with a label attached to it; the section 2.3 contains more information
about these special partitions.
2.2.1
A first proposal
The first approach chosen to implement the VURM architecture was to use SLURM for everything and inject the needed virtualization functionalities into it, either by adding plugins or
modifying the original source code. This approach is introduced by the Figure 2.2. To help
better understand the different parts, the representation splits the architecture on two different
layers: a physical layer and a virtual layer.
Virtual nodes
La
ye
r
Virtual cluster 1
Virtual cluster 2
La
Seed partition
Ph
ys
ic
al
srun
sl
ye
r
ur
mc
tl
d
Vi
rt
ua
l
Virtual cluster 3
Physical nodes
Figure 2.2: Initial proposal for the VURM architecture
In this proposal, physical nodes (each node runs a SLURM daemon) are managed by the
SLURM controller. When a new request is received by the controller a new VM is started on
one of these nodes and a SLURM daemon setup and started on it. Upon startup this daemon
registers itself as a new node to the SLURM controller. Once the registration occurred, the
SLURM controller can then allocate resources and submit jobs to the new virtual node.
The initial idea behind this proposal was to spawn new virtual machines by running a job which
takes care of this particular task through the SLURM system itself. This job will be run on a
daemon placed inside the seed partition.
The SLURM controller daemon can’t, in this case, differentiate between daemons running on
physical or virtual nodes. It only knows that to spawn new virtual machines, the respective
job has to be submitted to the seed partition, while user jobs will be submitted to one of the
created virtual cluster partitions. In this case the user has to specify in which virtual cluster
(i.e. in which partition) he wants to run the job.
The Figure 2.3 illustrates how the SLURM controller sees the whole system (note that virtual
and physical nodes as well as virtual clusters and partition are still differentiated for illustration
purposes, but aside their name the SLURM controller isn’t able to tell the differences between
them).
12
CHAPTER 2. ARCHITECTURE
Virtual cluster 2
Virtual cluster 1
slurmctld
Seed partition
Virtual cluster 3
Figure 2.3: The VURM system from the SLURM controller point of view
2.2.2
Completely decoupled approach
By further analyzing and iterating over the previously presented approach, it is possible to
deduce that the SLURM controller uses the nodes in the seed partition exclusively to spawn
new virtual machines and does not effectively take advantage of these resources to allocate
and run jobs. Using SLURM for this specific operation introduces useless design and runtime
compelxity. It is also possible to observe that these physical nodes have to run some special
piece of software which provides VM management specific capabilities not provided by SLURM
anyways.
Basing on these argumentations, it was thus chosen to use SLURM only to manage usersubmitted jobs, while deferring all virtualization oriented tasks to a custom, completely decoupled, software stack. This stack is responsible for the mapping of virtual machines to physical
nodes and their registration as resources to the SLURM controller.
This decision has several consequences on the presented design:
• There is no need to run the SLURM daemons on the nodes in the seed partition;
• The custom VURM daemons will be managed by a VURM-specific controller;
• The SLURM controller will manage the virtual nodes only.
The Figures 2.2 and 2.3 were updated to take into account these considerations, resulting in
the figures 2.5 and 2.4, respectively.
Virtual cluster 2
Virtual cluster 1
vurmctld
slurmctld
Virtual cluster 3
Figure 2.4: The VURM system updated with the VURM controller
The users can create new virtual clusters by executing the newly provided valloc command
and then run its jobs on it by using the well known srun command. Behind the scenes, each
2.3. PROVISIONING WORKFLOW
13
tl
d
Virtual cluster 1
La
ye
r
sl
ur
mc
srun
Virtual cluster 2
Vi
rt
ua
l
Virtual cluster 3
Seed partition
ys
ic
al
La
ye
vu
r
rm
ct
ld
valloc
Ph
vrelease
= physical cluster node running a VURM daemon
v<command>
= VURM management command
= virtual node running a SLURM daemon
s<command>
= SLURM management command
Figure 2.5: Adopted VURM architecture
of these commands talks to the relative responsible entity: the VURM controller (vurmctld)
in the case of the valloc command or to the SLURM controller (slurmctld) in the case of
the srun command.
2.2.3
Multiple provisioners support
The abstract nature of the nodes provided to the SLURM controller (only an hostname and a
port number of a listening SLURM daemon is needed) allows to possibly run SLURM daemons
on nodes coming from different sources. A particular effort was put into the architecture design
to make it possible to use different resource provisioners – and thus resources coming from
different sources – in a single VURM controller instance.
The Figure 2.6 contains a formal Unified Modeling Language (UML) class diagram which describes the controller architecture into more detail and introduces the new entities needed to
enable support for multiple provisioners.
Each IProvisioner realization is a factory for objects providing the INode interface. The
VurmController instance queries each registered IProvisioner realization instance for a
given number of nodes, assembles them together and creates a new VirtualCluster instance.
Thanks to the complete abstraction of the IProvisioner and INode implementations, it is
possible to use heterogeneous resource origins together as long as they implement the required
interface. This makes easy to add new provisioner types (illustrated in the diagram by the
provisioners.XYZ placeholder package) to an already existing architecture.
More details on the exact provisioning and virtual cluster creation process are given in the next
section.
2.3
Provisioning workflow
This section aims to introduce the concept of virtual cluster and explain the provisioning workflow anticipated in the previous sections in deeper detail. Most of the explications provided in
this section refer to the sequence diagram illustrated in Figure 2.7 on page 17.
14
CHAPTER 2. ARCHITECTURE
Architecture overview
<<Interface>>
SlurmController
+configuration
+reloadConfig()
1
slurmController
1
vurmController
VurmController
+createVirtualCluster(size, minSize)
+destroyVirtualCluster(name)
-updateSlurmConfig()
1
1..*
VirtualCluster
+spawnNodes()
+release()
clusters +getConfigEntry()
1
0..*
controller
controller
cluster
1
provisioners
nodes
1..*
<<Interface>>
IProvisioner
+getNodes(count, names, config)
<<Interface>>
INode
-nodeName
-hostname
-port
+spawn()
+release()
+getConfigEntry()
provisioners.Multilocal
Multilocal
<<use>>
LocalNode
provisioners.KVM
RemoteKVM
<<use>>
RemoteNode
provisioners.XYZ
XYZProvisioner
XYZNode
Figure 2.6: VURM architecture class diagram
2.3.1
Virtual cluster definition
A virtual cluster is a logical grouping of virtual machines spawned in response to a user request.
VMs in a virtual cluster belong to the user which initially requested them and can be exploited to
run batch jobs through SLURM. These nodes can originate from different providers, depending
on the particular resource availability when the creation request occurs.
Virtual clusters are exposed to SLURM as normal partitions containing all virtual nodes assigned to the virtual cluster. The VURM controller modifies the SLURM configuration by
placing the nodes and partition definition between comments clearly identifying each virtual
cluster and allowing an automated process to retrieve the defined virtual clusters by simply
parsing the configuration. An example of such a configuration for a virtual cluster consisting
of two nodes is illustrated in Listing 2.5.
Listing 2.5: SLURM configuration for a virtual cluster
1
2
3
4
5
# [vc-6ad4185]
NodeName=nd-6ad4185-0 NodeHostname=10.0.0.101
NodeName=nd-6ad4185-1 NodeHostname=10.0.0.100
PartitionName=vc-6ad4185 Nodes=nd-6ad4185-[0-1] Default=NO MaxTime=INFINITE
State=UP
# [/vc-6ad4185]
2.3. PROVISIONING WORKFLOW
15
Once a virtual cluster is created, the user can run jobs on it by using the --partition
argument to the srun command, as illustrated in the Listing 2.6.
Listing 2.6: Batch job execution on a virtual cluster
1
2
# The -N2 argument forces SLURM to execute the job on 2 nodes
srun --partition=vc-6ad4185 -N2 hostname
The introduction of the concept of virtual cluster allows thus to manage a logical group of nodes
belonging to the same allocation unit, conveniently mapped to a SLURM partition, and enables
the user to access and exploit the resources assigned to him in the best possible way.
2.3.2
Virtual cluster creation
A new virtual cluster creation process is started as soon as the user executes the newly provided
valloc command. The only required parameter is the desired size of the virtual cluster to
be created. Optionally, a minimal size and the priority can also be specified. The minimum
size will be used as the failing threshold if the controller can’t find enough resources to satisfy
the requested size. If not specified, the minimum size defaults to the canonical size (message 1
in the sequence diagram in the Figure 2.7). The priority (defaulting to 1) is used to allocate
resources to the virtual cluster and is further described in chapter 4, starting at page 35.
When a request for a new virtual cluster is received, the controller asks the first defined provisioner for the requested amount of nodes. The provisioner can decide, based on the current
system load, to allocate and return all nodes, only a part of them or none at all (messages 1.1,
1.2 and 1.2.1).
Until the total number of allocated nodes does not reach the requested size, the controller goes on
by asking the next provisioner to allocate the remaining number of nodes. If the provisioners
list is exhausted without reaching the user-requested number of nodes, the minimum size is
checked. In the case that the number of allocated nodes equals or exceeds the minimum cluster
size, the processing goes on; otherwise all nodes are released and the error propagated back to
the user.
NOTE
The loop block with the alternate opt illustrated in the sequence diagram (first block) equals
to the Python-specific for-else construct. In this construct, the else clause is executed
only if the for-loop completes all the iterations without executing a break statement.
The Python documentation explaining the exact syntax and semantics of the for-else construct can be found at the following url: http://docs.python.org/reference/compound_
stmts.html#for
In this specific case, the else clause is executed when the controller – after having iterated
over all provisioners – has not collected the requested number of nodes.
Once the resource allocation process successfully completes, a new virtual cluster instance is
created with the collected nodes (message 1.5) and the currently running SLURM controller is
reconfigured and restarted (messages 1.6 and 1.7).
At this point, if the reconfiguration operation successfully completes, a SLURM daemon instance
is spawned on each node of the virtual cluster (messages 2 and 2.1). If the reconfiguration fails,
16
CHAPTER 2. ARCHITECTURE
instead, the virtual cluster is destroyed by releasing all its nodes (messages 3 and 3.1) and the
error is propagated back to the user.
If all the operations complete successfully, the random generated name of the virtual cluster is
returned to the user to be used in future srun or vrelease invocations.
2.4
Implementing new provisioners
The default implementation ships with two different provisioner implementations: the multilocal
implementation is intended for testing purposes and spawns multiple SLURM daemons with different names on the local node while the remotevirt implementation runs SLURM daemons
on virtual machines spawned on remote nodes (the remotevirt implementation is explained
in further detail in chapter 3: Physical cluster provisioner, starting at page 19).
It is possible to easily add new custom provisioners to the VURM controller by implementing the
IProvisioner interface and registering an instance of such class to the VurmController
instance. The Listing 2.7 describes how a custom provisioner instance providing the correct
interface can be registered to the VURM controller instance.
Listing 2.7: vurm/bin/vurmctld.py
40
41
42
43
44
45
46
47
# Build controller
ctld = controller.VurmController(config, [
# Custom defined provisioner
# Instantiate and configure the custom provisioner directly here or
# outside the VurmController constructor invocation.
# If the custom provisioner is intended as a fallback provisioner,
# put it after the main one in this list.
customprovisioner.Provisioner(customArgument, reactor, config),
48
remotevirt.Provisioner(reactor, config),
multilocal.Provisioner(reactor, config),
49
50
51
# Default shipped provisioner
# Testing provisioner
])
One of the future improvements foresees to adopt a plugin architecture to implement, publish
and configure the provisioners; refer to the conclusions chapter on page 51 for more information
about this topic.
2.4. IMPLEMENTING NEW PROVISIONERS
17
s d Provisioning workflow
vurmController : VurmController
provisioner : IProvisioner
User
SLURM
1: valloc size=10 minsize=5
loop
[has provisioners]
1.1: getNextProvisioner
1.2: getNodes()
loop
[has enough resources and count(nodes) < size]
1.2.1:
node : INode
1.3: addNodesToNodesList
break
[count(nodes) == size]
opt
[count(nodes) < minsize]
loop
[count(nodes) > 0]
1.4: release()
1.5: createWithNodes(nodes)
virtualCluster : VirtualCluster
1.6: getConfigEntry()
loop
[for each node]
1.6.1: getConfigEntry()
1.7: reconfigureSlurm
1.7.1: modify configuration file
1.7.2: reloadConfiguration()
alt
[reconfiguration succeeded]
2: spawnNodes()
loop
[for each node]
2.1: spawn()
3: release()
loop
[for each node]
3.1: release()
Figure 2.7: Resource provisioning workflow
Chapter 3
Physical cluster provisioner
A resource provisioner is an entity able to provision resources in the form of SLURM daemons
upon request by the VURM controller. The VURM architecture allows multiple, stackable,
resource provisioners and comes with two different implementations. One of them, the physical
cluster provisioner, is further described in this chapter.
For additional information about the exact role of a resource provisioner, how it fits into the
VURM architecture and which tasks it has to be able to accomplish, refer to the Architecture
chapter at page 7.
The physical cluster provisioner is conceived to manage Linux based HPC clusters. This provisioner is able to execute SLURM daemons on virtual machines spawned on physical nodes of the
cluster and feed them to the SLURM controller. Additionally this module is also responsible
to manage virtual machine migration between physical nodes to better exploit the available
computing resources.
The different aspects covered by this chapter are structured as follows:
• Section 3.1 aims to describe the provisioner specific architecture, including the description
of both internal and external software components, networking configuration, etc. A class
diagram for the overall provisioner architecture is presented as well;
• Section 3.2 describes all the entities incurring in the deployment of a remotevirt based
VURM setup;
• Section 3.3 introduces some libvirt related concepts and conventions and describes the
eXtensible Markup Language (XML) domain description file used to configure the guests;
• Section 3.4 deals with different aspects of the virtual machine lifecycle, such as setup,
efficient disk image cloning, different IP address retrieval techniques or public key exchange
approaches.
Aspects directly bound to VM migration between physical nodes will not be discussed in this
chapter. Refer to the Migration chapter, starting at page 35, for additional information about
this topic.
19
20
3.1
CHAPTER 3. PHYSICAL CLUSTER PROVISIONER
Architecture
The cluster provisioner takes advantage of the abstraction provided by libvirt to manage VMs
on the different nodes of the cluster. Each node is thus required to have libvirt installed and
the libvirt daemon (libvirtd) running.
Libvirt already implements support for remote management, offering good security capabilities
and different client interfaces. Unfortunately, but understandably, the exposed functionalities
are strictly virtualization oriented. As the VURM cluster provisioner needs to carry out additional and more complex operations on each node individually (IP address retrieval, disk image
cloning, SLURM daemon management,. . . ) an additional component is required to be present
on these nodes. This component was implemented in the form of a VURM daemon. This daemon exposes the required functionalities and manages the communication with the local libvirt
daemon.
An additional remotevirt controller was implemented to centralize the management of the
VURM daemons distributed on the different nodes. This controller is implemented inside
the physical cluster provisioner package and is a completely different entity than the VURM
controller daemon used to manage resources at an higher abstraction level.
The class diagram in Figure 3.1 on page 21 illustrates the whole architecture graphically. The entities on the right (enclosed by the yellow rounded rectangle) are part of the VURM daemon running on each single node, while the rest of the entities in the vurm.provisioners.remotevirt
package are part of the provisioner interface and the remotevirt controller implementation.
Together with the Domain class, the PhysicalNode, VirtualCluster and System classes
allow to model a whole physical cluster, its physical nodes, the various created virtual clusters
and their relative domains (virtual nodes).
Each PhysicalNode instance is bound to the remote daemon sitting on the actual computing node by a DomainManager instance. This relationship is also exploited by the Domain
instances themselves for operations related to the management of existing VMs, as for example
VM releasing, or (as introduced later on) VM migration.
The remotevirt daemons expose a DomainManager instance for remote operations, such
as domain creation/destruction and SLURM daemon spawning. The domain management
operations are carried out through a connection to the local libvirt daemon accessed through
the bindings provided by the libvirt package. Operations which have to be carried out on
the VM itself, instead, are executed by exploiting a Secure Shell (SSH) connection provided by
the SSHClient object.
More details about the implemented interfaces (INode and IProvisioner) can be found in
the subsection 2.2.3 on page 13, as part of the Architecture chapter.
The next section offers additional details about how components of this specific provider fit into
the global system setup and defines the relationship with other system entities for a correct
deployment.
<<Interface>>
IProvisioner
+getNodes(count, names, config)
+spawn()
+release()
+getConfigEntry()
<<Interface>>
INode
-nodeName
-hostname
-port
Remotevirt provisioner
1
1..*
provisioner
1
virtualClusters
0..*
domains
0..*
system
system
System
+addCluster(cluster)
+removeCluster(cluster)
1
nodes
1
1..*
PhysicalNode
+getDomainsByCluster(cluster)
+addDomain()
+removeDomain()
The libvirt package is
system
1
node
1
node
1
1
connection
Domain
+destroy()
LibvirtConnection
+open(uri)
+createLinux(description)
+lookupByName(name)
libvirt
SSHClient
+executeCommand(command)
+transferFile(fileHandler, remotePath)
+disconnect()
+connect(username, key)
<<use>>
DomainManager
+getHypervisor()
+createDomain(description)
+destroyDomain(name)
+spawnDaemon(name, config)
<<use>>
Figure 3.1: Overall architecture of the remotevirt provisioner
Provisioner
VirtualCluster
+addDomain(domain)
+removeDomain(domain)
+release()
cluster
domains
Domain
vurm.provisioners.remotevirt
domain
0..*
manager
1
3.1. ARCHITECTURE
21
22
CHAPTER 3. PHYSICAL CLUSTER PROVISIONER
3.2
Deployment
The Architecture section introduced the different entities incurring in the management of VMs
on a cluster using the Physical cluster provisioner. A complete VURM system based on this
provisioner is even more complex because of the additional components needed to manage the
overall virtualized resources.
This section aims to introduce and explain the deployment diagram of a complete VURM setup
which uses the Physical cluster provisioner as its only resource provisioner. All the explications
provided in these section refer to the deployment diagram in Figure 3.2 on page 24.
The deployment diagram illustrates the most distributed setup currently possible to achieve.
All the different logical nodes can be placed on a single physical node if needed (either for
testing or development purposes).
For illustration purposes, four different styles were used to draw the different components and
artifacts present in the diagram. A legend is available in the bottom left corner of the diagram
and defines the coding used for the different formatting styles. The following list provides a
more detailed description of the semantics of each style:
• The first formatting style was used for externally provided entities which have to be
installed by a system administrator and which the VURM libraries uses to carry out the
different tasks. All these components are provided by either SLURM or libvirt;
• The second formatting style was used for artifacts that the have to be provided either by
the system administrator (mainly configuration files) or by the end user (OS disk images,
key pairs,. . . ). The only component in this category is the IP address sender. In
the normal case, this component is composed by a small shell script, as described in the
subsection 3.4.1. A simple but effective version of this script is provided in the Listing 3.4;
• Components in the third category are provided by the VURM library. These components
have to be installed by a system administrator in the same way as the SLURM and libvirt
libraries;
• The fourth category contains artifacts which are generated at runtime by the different
intervening entities. These artifacts don’t need to be managed by any third party.
The rest of this section is dedicated to explain the role of each represented component in deeper
detail.
Commands
The srun, scontrol, valloc and vrelease components represent either SLURM or VURM
commands. As illustrated in the diagram, each command talks directly to the respective entity
(the SLURM controller for SLURM commands or the VURM controller for VURM commands).
This difference is abstracted from the user perspective and does not involve additional usage
complexity.
Additional, not illustrated, commands may be available. The SLURM library comes with a
plethora of additional commands, such as sview, sinfo, squeue, scancel, etc.
3.2. DEPLOYMENT
23
VurmController and RemotevirtProvisioner
The VurmController component represents the VURM controller daemon, configured with
the RemotevirtProvisioner. The provisioner component represents both the IProvisioner
realization itself and the remotevirt controller introduced in the previous section. The user
manual in the Appendix A offers a complete reference for this configuration file.
Both the entities can be configured through a single VURM configuration file. This file contains
the configuration for the overall VURM system and also defines the different physical nodes
available to the remotevirt provisioner.
Domain description file
The domain description file is used as a template to generate an actual domain description to
pass to the VURM daemons. This file describes a libvirt domain using an XML syntax defined
by libvirt itself. The section 3.3 offers an overview of the different available elements, describes
the additions made by VURM and presents part of the default domain description file.
VurmdLibvirt, Libvirt and KVM/QEMU
The three components placed on the physical node are responsible to carry out the management operations requested by the controller. Despite being represented one single time on the
deployment diagram, a VURM system using the remotevirt provisioner normally consists of
one controller node and many different physical nodes. The same argument can be made for
the virtual node too: a physical node is able to host different virtual nodes.
The VurmdLibvirt component mainly offers a facade for more complex operations over the
libvirt daemon and is responsible for the local housekeeping of the different assets (mainly
disk image clones and state files). The libvirt daemon and the hypervisor of choice, in this
case KVM/QEMU, are used to effectively create, destroy and migrate the VMs themselves.
The particular port between the VurmdLibvirt and the IP address sender components
is needed to get the IP address from the VM once it is assigned to it. The used IP address
retrieval technique is described in deeper detail in the subsection 3.4.1.
AMP protocol
The SLURM commands talk to the SLURM controller by using a custom SLURM binary
protocol, while the VURM commands adopt the Asynchronous Messaging Protocol (AMP)
protocol. The same protocol is also adopted by the remotevirt provisioner to communicate
with the daemons placed on each single node.
AMP [15] is an asynchronous key/value pair based protocol with implementations in different
languages. The asynchronous nature of AMP allows to send multiple request/response pairs
over the same connection and without having to wait to obtain a response to a previous request
before sending a new one.
The adoption of AMP as the default protocol to communicate between the VURM entities
allows to reach an higher degree of scalability thanks to the reuse and multiplexing of different
channels over a single TCP/IP connection.
Artifacts with this background
background have to be provided
<<artifact>>
Configuration file
<<component>>
vrelease
<<component>>
valloc
<<component>>
scontrol
<<component>>
srun
ClientNode
VURM w/ remotevirt provisioner
AMP protocol
over tcp/ip
SLURM
binary protocol
over tcp/ip
<<artifact>>
Clone n
<<artifact>>
Clone 1
<<artifact>>
Clone 0
NFS
AMP protocol
over tcp/ip
SLURM
binary protocol
over tcp/ip
<<component>>
KVM/QEMU
<<component>>
Libvirt
libvirtd protocol
(unix, ssh, tcp)
<<component>>
VurmdLibvirt
<<artifact>>
VURM private key
<<artifact>>
VURM configuration file
Figure 3.2: Complete VURM + remotevirt deployment diagram
<<artifact>>
Disk image
Shared storage location
<<artifact>>
Domain description file
<<artifact>>
VURM configuration file
<<artifact>>
SLURM configuration file
<<component>>
RemotevirtProvisioner
<<component>>
VurmController
<<component>>
scontrol
<<component>>
SlurmController
ControllerNode
<<component>>
Slurmd
VirtualNode
host side,
<<artifact>>
SLURM configuration file
<<artifact>>
VURM public key
<<component>>
IP address sender
PhysicalNode
24
CHAPTER 3. PHYSICAL CLUSTER PROVISIONER
3.3. LIBVIRT INTEGRATION
3.3
25
Libvirt integration
The libvirt library provides an abstraction layer to manage virtual machines by using different
hypervisors, both locally and remotely. It offers an API to create, start, stop, destroy and
otherwise manage virtual machines and their respective resources such as network devices,
storage devices, processor pinnings, hardware interfaces, etc. Additionally, through the libvirt
daemon, it accepts incoming remote connections and allows complete exploitation of the API
from remote locations, while accounting for authentication and authorization related issues.
The libvirt codebase is written in C, but bindings for different languages are provided as part of
the official distribution or as part of external packages. The currently supported languages are
C, C#, Java, OCaml, Perl, PHP, Python and Ruby. For the scope of this project the Python
bindings were used.
Remote connections can be established using raw Transmission Control Protocol (TCP), Transport Layer Security (TLS) with x509 certificates or SSH tunneling. Authentication can be provided by Kerberos and Simple Authentication and Security Layer (SASL). Authentication of
local connections (using unix sockets) can be controlled with PolicyKit.
To abstract the differences between VM management over different hypervisors, libvirt adopts
a special naming convention and defines its own XML syntax to describe the different involved
resources. This section aims to introduce the naming convention, and the XML description file.
The XML description file subsection includes the explications of the changes introduced by the
VURM virtualization model.
3.3.1
Naming conventions
Libvirt adopts the following naming conventions to describe the different entities involved in
the VM lifecycle:
• A node is a single physical machine;
• An hypervisor is a layer of software allowing to virtualize a node in a set of virtual
machines with possibly different configurations than the node itself;
• A domain is an instance of an operating system running on a virtualized machine provided
by the hypervisor.
Under this naming convention and for the scope of the VURM project, the definition of VM or
of virtual node is thus the same as the definition of domain.
3.3.2
XML description file
The XML description file is the base for the definition of a new domain. This file contains
all needed information, such as memory allocation, processor pinning, hypervisor features to
enable, disks definition, network interfaces definitions, etc.
Given the wide range of configuration options to support, a complete description of the available
elements would be too long to include in this section. The complete reference is available online
at [8]. For the purposes of this project, the description of the elements contained in the XML
description presented in Listing 3.1 is more than sufficient.
26
CHAPTER 3. PHYSICAL CLUSTER PROVISIONER
Listing 3.1: Basic libvirt XML description file
1
2
<domain type="kvm">
<name>domain-name</name>
3
4
<memory>1048576</memory>
5
6
<vcpu>2</vcpu>
7
8
9
10
11
<os>
<type arch="x86_64">hvm</type>
<boot dev="hd"/>
</os>
12
13
14
15
16
<features>
<acpi/>
<hap/>
</features>
17
18
19
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
20
21
22
23
24
25
<disk type="file">
<driver name="qemu" type="qcow2"/>
<source file="debian-base.qcow2"/>
<target dev="sda" bus="virtio"/>
</disk>
26
27
28
29
30
31
32
33
<interface type="bridge">
<source bridge="br0"/>
<target dev="vnet0"/>
<model type=’virtio’/>
</interface>
</devices>
</domain>
The root element for the definition of a new domain is the domain element. Its type attribute
identifies the hypervisor to use.
Each domain element has a name element containing a host-unique name to assign to the
newly spawned VM. When using the remotevirt provisioner to spawn new virtual machines, the
name element will automatically be set (and overridden if present).
The memory element contains the amount of memory to allocate to this domain. The units of
this value are kilobytes. In this case, 1GB of memory is allocated to the domain.
The type child of the os element specifies the type of operating system to be booted in the
virtual machine. hvm indicates that the OS is one designed to run on bare metal, so requires
full virtualization. The optional arch attribute specifies the CPU architecture to virtualize.
The features element can be used to enable additional features. In this case the acpi (useful
for power management, for example, with KVM guests it is required for graceful shutdown to
work) and hap (enables use of Hardware Assisted Paging if available in the hardware) features
are enabled.
The content of the emulator element specifies the fully qualified path to the device model
emulator binary to be used.
3.4. VM LIFECYCLE ASPECTS
27
The disk element is the main container for describing disks. The type attribute is either
file, block, dir, or network and refers to the underlying source for the disk. The optional
device attribute indicates how the disk is to be exposed to the guest OS. Possible values for
this attribute are floppy, disk and cdrom, defaulting to disk.
The file attribute of the source element specifies the fully-qualified path to the file holding
the disk. The path contained by this attribute is always resolved as a child of the path defined
by the imagedir option in the VURM configuration (in the case of a VURM setup, absolute
path are thus not supported).
The optional driver element allows specifying further details related to the hypervisor driver
used to provide the disk. QEMU/KVM only supports a name of qemu, and multiple types
including raw, bochs, qcow2, and qed.
The target element controls the bus/device under which the disk is exposed to the guest OS.
The dev attribute indicates the logical device name. The optional bus attribute specifies the
type of disk device to emulate; possible values are driver specific, with typical values being ide,
scsi, virtio, xen or usb.
3.4
VM Lifecycle aspects
Before being able to run the SLURM daemon, different operations have to be performed. The
ultimate goal of this set of operations if to have a VM up and running. This section aims to
describe the different involved steps to reach this goal and to discuss some alternative approaches
for some of them.
The sequence diagram in Figure 3.4 on page 33 illustrates the complete VM lifecycle. The
lifecycle can be divided in three phases: the first phase (domain creation) occurs when the
creation request is first received, the second phase (SLURM daemon spawning) occurs as soon
as all domains of the virtual cluster were correctly created and the IP addresses received and
the third phase (domain destruction) occurs once that the end-user finished to execute SLURM
jobs on the virtual cluster and asks for it to be released.
Two operations of the first phase, more precisely the messages 1.1 and 1.3, deserve some particular attention. The subsections 3.4.1 and 3.4.2 address these operations respectively.
The last subsection is dedicated to explain some of the possible public-key exchange approaches
needed to perform a successful authentication when connecting to the VM through SSH (message 3.1). Due to the nature of the adopted solution, no messages are visible on the diagram
for this operation.
3.4.1
IP address retrieval
The simplest way to interact with a newly started guest is through a TCP/IP connection.
The VURM tools use SSH to execute commands on the guest, while the SLURM controller
communicates with its daemons over a socket connection. All of these communications means
require the IP address (or the hostname) of the guest to be known.
At present, libvirt does not exposes any APIs to retrieve a guests IP address and a solution
to overcome this shortcoming has to be implemented completely by the calling codebase.
During the execution of the project, different methods were analyzed and compared; the rest
of this sub section is dedicated to analyze the advantages and shortcomings of each of them.
28
CHAPTER 3. PHYSICAL CLUSTER PROVISIONER
Inspect host-local ARP tables
As soon as a guest establish a TCP connection with the outside world, it has to advertise its
IP address in an Address Resolution Protocol (ARP) response. This causes the other hosts on
the same subnet to pick it up and cache it in its ARP table.
Running the arp -an command on the host will reveal the contents of the ARP table, as
presented in Listing 3.2:
Listing 3.2: Example arp -an output
1
2
3
4
5
6
7
8
9
$
?
?
?
?
?
?
?
?
arp -an
(10.0.0.1) at 0:30:48:c6:35:ea on en0 ifscope [ethernet]
(10.0.0.95) at 52:54:0:48:5c:1f on en0 ifscope [ethernet]
(10.0.1.101) at 0:14:4f:2:59:ae on en0 ifscope [ethernet]
(10.0.6.30) at 52:54:56:0:0:1 on en0 ifscope [ethernet]
(10.255.255.255) at ff:ff:ff:ff:ff:ff on en0 ifscope [ethernet]
(129.24.176.1) at 0:1c:e:f:7c:0 on en1 ifscope [ethernet]
(129.24.183.255) at ff:ff:ff:ff:ff:ff on en1 ifscope [ethernet]
(169.254.255.255) at (incomplete) on en0 [ethernet]
As libvirt is able to retrieve the Media Access Control (MAC) address associated with a given
interface, it is possible to parse the output and retrieve the correct IP address.
This method could not be adopted because ARP packets do not pass across routed networks
and the bridged network layout used by the spawned domains would thus stop them.
Proprietary VM Tools
Some hypervisors (as, for example, VMWare ESXi or VirtualBox), are coupled with a piece
of software which can be installed on the target guest and provides additional interaction capabilities, such as commands execution, enhanced virtual drivers or, in this case, IP address
retrieval.
Such a tool would bind the whole system to a single hypervisor and would vanish the efforts done
to support the complete range of libvirt-compatible virtual machine monitors. Additionally, the
KVM/Qemu pair, chosen for their live migration support, does not provide such a tool.
Static DHCP entries
A second possible method is to configure a pool of statically defined IP/MAC address couples
in the Dynamic Host Configuration Protocol (DHCP) server and configure the MAC address
in the libvirt XML domain description file in order to know in advance which IP address will
be assigned to the VM.
This method could have worked and is a viable alternative; it was however preferred to not
depend on external services (the DHCP server, in this case) and provide a solution which can
work in every context. A configuration switch to provide a list of IP/MAC address pairs instead
of the adopted solution can although be considered as a future improvement.
Serial to TCP data exchange
The last explored and successfully adopted method takes advantage of the serial port emulation
feature provided by the different hypervisors and the capability to bind them to (or connect
them with) a TCP endpoint of choice.
3.4. VM LIFECYCLE ASPECTS
29
In the actual implementation, the VURM daemon listens on a new (random) port before it
spawns a new VM; the libvirt daemon will then provide a serial port to the guest and connect
it to the listening server. Once booted, a little script on the guest writes the IP address to the
serial port and the TCP server receives it as a normal TCP/IP data stream.
Although not authenticated nor encrypted, this communication channel is deemed secure enough
as no other entity can possibly obtain relevant information by eavesdropping the communication (only an IP address is exchanged). Additionally, as it’s the hypervisor task to establish
the TCP connection directly to the local host, there is no danger for man-in-the-middle attacks
(and thus invalid information being transmitted).1 The Listing 3.3 contains the needed libvirt
device configuration:
Listing 3.3: Libvirt TCP to serial port device description
1
2
3
4
5
6
<serial type="tcp">
<source mode="connect" host="127.0.0.1" service="$PORT"/>
<protocol type="raw"/>
<target port="1"/>
<alias name="serial0"/>
</serial>
The shell script on the guest needed to write the IP address to the serial device can be as simple
as the following:
Listing 3.4: Shell script to write the IP address to the serial port
1
2
#!/bin/sh
ifconfig eth0 | grep -oE ’([0-9]{1,3}\.){3}[0-9]{1,3}’ | head -1 >/dev/ttyS0
3.4.2
Volumes cloning
During the virtual cluster creation phase, the user has the possibility to choose which disk
image all the created VMs will be booted from.
As a virtual cluster is formed by multiple VMs, and as each virtual machine needs its own disk
image from which read from and to which write to, multiple copies of the base image have to
be created.
Performing a raw copy of the base image presents obvious limitations: firstly each copy occupies
the same amount of space as the original image, which rapidly leads to fill up the available space
on the storage device, and secondly, given the necessity of these images to live on a shared file
system (see chapter 4 for the details), the copy operation can easily fill up the available network
bandwidth and take a significant amount of time.
To overcome these drawbacks, a particular disk image format called Qemu Copy On Write,
version 2 (QCOW2) is used. This particular format has two useful features:
• A qcow2 image file, unlike a raw disk image, occupies only the equivalent of the size of
the data effectively written to the disk. This means that it is possible to create a 40GB
disk image which, once a base OS is installed on it, occupies only a couple of GBs2 ;
• It is possible to create a disk image based on another one. Read data is retrieved from the
base image until the guest modifies it. Modified data is written only to the new image,
while leaving the base image intact. The Figure 3.3 represents this process graphically.
30
CHAPTER 3. PHYSICAL CLUSTER PROVISIONER
guest
operations
read
read
read
File1
v0
File2
v0
File1
v0
read
read
File1
v1
File1
v1
qcow2 image
base image
write
File2
v0
File1
v0
File2
v0
Time
Figure 3.3: Copy-On-Write image R/W streams
The combination of these two features allows to create per-VM disk images in both a time- and
space-efficient manner.
3.4.3
Public key exchange
As anticipated in section 3.1, the VURM daemon executes commands on the guest through a
SSH connection. The daemon authentication modes were deliberately restricted, for security
reasons, to public key authentication only.
This restriction requires that the daemon public key is correctly installed on the guest disk
image. As the disk image is customizable by the user, it is necessary to define how this public
key will be installed.
Different techniques can be used to achieve this result; the rest of this subsection is dedicated
to analyze three of them.
Public key as a requirement
The simplest (from the VURM point of view) solution is to set the setup of a VURM-specific
user account with a given public key as a requirement to run a custom disk image on a given
VURM deployment.
This approach requires the end user to manually configure its custom disk image. The obvious
drawback is the inability to run the same disk image on different VURM deployments as the
key pair differs from setup to setup.3
The final chosen solution is based on this approach, given that the disk image has to be specifically customized because of other parameters anyway. Refer to the section A.4 in the User
manual for more information about the VM disk image creation process.
1
The daemon only binds to the loopback interface; a malicious process should be running on the same host
to possibly connect to the socket and send a wrong IP address
2
Other disk image formats support this feature too. For example qcow (version 1) or VMDK (when using the
sparse option).
3
This is, however, a very minor drawback; the percentile of users running jobs on different clusters is very
low, and even so, the disk image probably needs adjustments of other parameters anyway.
3.5. POSSIBLE IMPROVEMENTS
31
Pre-boot key installation
Another explored technique consists of mounting the disk image on the host, copying the public
key to the correct location, unmounting the image and then starting the guest.
The mounting, copying and unmounting operation was made straightforward by the use of a
specific image manipulation library called libguestfs.
For security reasons, this library would spawn a new especially crafted VM instead of mounting
the image directly on the host; this made the whole key installation operation a long task
when compared with the simplicity of the problem (mounting, copying and unmounting took
an average of 5 seconds during the preliminary tests).
The poor performances of the operation, its different dependencies (too many for such a simple
operation) and its tight coupling with the final public key path on the guest, led this approach
to be discarded in the early prototyping phase.
IP-Key exchange
As seen in the previous section, the final solution adopted to retrieve a guest IP address, already
requires to setup a communication channel between the VURM daemon and the guest.
The idea of this approach is to send back the public key to the guest as soon as the IP address
is received. In this case, the guest would first write the IP address to the serial port and then
read the public key and store it to the appropriate location. An extension to the shell script
3.4 which saves the public key as a valid key to login as root is presented in the Listing 3.5:
Listing 3.5: Shell script to exchange the IP address and a public key over the serial port
#!/bin/sh
ifconfig eth0 | grep -oE ’([0-9]{1,3}\.){3}[0-9]{1,3}’ | head -1 >/dev/ttyS0
mkdir -p /root/.ssh
chmod 0755 /root/.ssh
cat /dev/sttyS0 >>/root/.ssh/authorized_keys
chmod 0644 /root/.ssh/authorized_keys
1
2
3
4
5
6
This technique was firstly adopted, but then discarded in favor of the more proper requirementbased solution as bidirectional communication using a serial port revealed to be too complicated
to implement correctly in a simple shell script.4
A proper solution would have required the installation of a custom built utility; copying a
simple file (the public key) was considered a less troublesome operation than installing yet
another utility along with its dependencies.
3.5
3.5.1
Possible improvements
Disk image transfer
The current implementation takes advantage of the Network File System (NFS) support already
required by libvirt5 to exchange disk images between the controller and the physical nodes. On
4
Synchronization and timing issues which could not simply be solved by using timeouts where detected when
running slow guests.
5
NFS support is required by libvirt and the qemu driver for VM migration, more details are given in the
chapter 4.
32
CHAPTER 3. PHYSICAL CLUSTER PROVISIONER
clusters consisting of many nodes, the shared network location becomes a bottleneck as soon
as multiple daemons try to access the same disk image to start new VMs.
A trivial improvement would consist of a basic file exchange protocol allowing to transfer the
disk image between the controller and the daemons while optimizing and throttling this process
in order to not saturate the whole network.
As well as the NFS based solution, the trivial approach also exposes different and non-negligible
drawbacks and can be further optimized. The nature of the task – distributing a single big file
to a large number of entities – seems to be a perfect fit for the BitTorrent protocol, as better
highlighted in the official protocol specification [5]:
BitTorrent is a protocol for distributing files. [. . . ] Its advantage over plain HTTP is
that when multiple downloads of the same file happen concurrently, the downloaders
upload to each other, making it possible for the file source to support very large
numbers of downloaders with only a modest increase in its load.
A more advanced and elegant solution would thus implement disk image transfers using the
BitTorrent protocol instead of a more traditional single-source distribution methodology.
3.5.2
Support for alternative VLAN setups
The current implementations bridges the VM network connection directly to the physical network card of the host system. VMs appear thus on the network as standalone devices and they
share the same subnet as their hosts.
It could be possible to create an isolated Virtual LAN (VLAN) subnet for each virtual cluster
in order to take advantage of the different aspects offered by such a setup, as for example,
traffic control patterns and quick reactions to inter-subnet relocations (useful in case of VM
migrations).
This improvement would add even more flexibility to the physical cluster provisioner and enable
even more transparent SLURM setups spawning multiple clusters residing in different physical
locations.
It is useful to note that different private cloud managers already implement and support VLAN
setups to various degrees (the example par excellence being Eucalyptus [7] [9]).
3.5. POSSIBLE IMPROVEMENTS
33
s d Domain lifecycle
DomainManager
libvirtConnection
Remotevirt provisioner
1: createDomain(description)
1.1: clone disk image
1.2: update description with disk image path
1.3: start new IP address receiver server
1.4: add serial to tcp device to description
1.5: createLinux(description)
Domain
1.5.1: create()
2: sendIPAddress(ip)
2.1: domain created
3: spawnDaemon(domain, slurmConfig)
3.1: connect(username, key)
SSHConnection
3.2: transferFile(slurmConfig, '/etc/slurm.conf')
3.3: executeCommand('slurmd')
3.4: disconnect()
3.5: daemon spawned
4: destroyDomain(domainName)
4.1: destroyDomain(domainName)
4.1.1: destroy()
4.2: remove image clone
4.3: domain destroyed
Figure 3.4: Complete domain lifecycle
Chapter 4
Migration
Resources allocation to jobs is difficult task to execute optimally and often depends on a plethora
of different factors. Most job schedulers are conceived to schedule jobs and allocate resources
to them upfront, with only limited abilities to intervene once a job entered the running state.
Process migration, or, in this specific case, VM migration, allows to dynamically adjust resource
allocation and job scheduling as new information becomes available even if a job was already
started. This runtime adjustment is applied to the system by moving jobs from one resource
to another as necessity arises.
The Figure 4.5 provides an example of the concept of VM migration applied to the scheduling of
virtual clusters on resources managed by the Physical cluster provisioner described in chapter 3.
In this scenario, a virtual clusters is resized through VMs migration in order to take account
for an additional job submitted to the system.
In the illustration, each square represents a physical node, each circle a virtual machine (using
different colors for VMs belonging to different virtual clusters) and each shaded rectangle a
virtual cluster.
(a) Low load operation
(b) Migration
(c) High load operation
(d) Restoration
Figure 4.1: Example of virtual cluster resizing through VMs migration.
The step 4.1a illustrates the state of the different nodes in a stable situation with one virtual
cluster consisting of twelve virtual machines. The step 4.1b shows how the system reacts as
soon as a new virtual cluster request is submitted to the system: some VMs are migrated
to already busy nodes of the same cluster (oversubscription) to free up resources for the new
virtual cluster. The freed up resources are then allocated to the new virtual cluster as shown
35
36
CHAPTER 4. MIGRATION
in step 4.1c. The step 4.1d shows how an optimal resource usage is restored once one of the
virtual clusters is released and more resources become available again.
Basic support for this scheduling and allocation enhancements is provided by the current VURM
implementation. In order to be able to account for different allocation, scheduling and migration
techniques and algorithms, the different involved components are completely pluggable and can
easily be substituted individually.
In order to further describe the different components and the implemented techniques and
algorithms, the present chapter is structured in the following sections:
• Section 4.1 introduces the migration framework put in place to support alternate pluggable
resource allocators and migration schedulers by describing the involved data structures
and the responsibilities of the different components;
• Section 4.2 describes the resources allocation strategy implemented in the default resource
allocator shipped with the VURM implementation;
• Similarly as done for the previous section, section 4.3 describes the default migration
scheduling strategy and offers insights over the optimal solution;
• Section 4.4 introduces the different approaches to VM migration and describes the implemented solution as well as the issues encountered with the alternate techniques;
• To conclude, section 4.5 resumes some of the possible improvements which the current
VURM version could take advantage of but which weren’t implemented due to different
constraints.
4.1
Migration framework
Different job types, different execution contexts or different available resources are all factors
which can led to chose one given resource allocation or job scheduling algorithm over another.
A solution which works equally well in every case does not exists and as such, users and system
administrators, want to be able to adapt the strategy to the current situation.
In order to provide support for configurable and pluggable schedulers and allocators, a simple
framework was put in place as part of the Physical cluster provisioner architecture. This
section describes the different intervening entities, their responsibility and the data structures
exchanged between them.
The Figure 4.2 illustrates the class diagram of the migration framework. Most of the represented
classes were taken from the physical cluster provisioner class diagram (Figure 3.1 on page 21);
newly added classes, attributes and operations are highlighted using a lighter background color.
As shown in the class diagram, three new interfaces were added to the architecture. Each of
these interfaces is responsible for a different task in the whole migration operation. The newly
added members to the already existing classes and the Migration class were added to help
the three interface realizations make the right decisions and communicate them back to the
resource provisioner for it to carry them out.
4.1. MIGRATION FRAMEWORK
37
Migration framework
depends on
0..*
1
Migration
0..1
+migrate()
+waitToComplete()
migration +addDependency(migration)
0..*
1
domain
Domain
+scheduleMigrationTo(node)
domains
1..*
srcNode
0..*
1
domains
0..*
migrated by
migration
0..*
1
1
1
dstNode
PhysicalNode
-computingUnits
node
+getDomainsByCluster(cluster)
+addDomain()
+removeDomain()
nodes
1..*
<<use>>
cluster
1
system
VirtualCluster
-priority
+addDomain(domain)
+removeDomain(domain)
+release()
+getComputingUnits()
+getWeight()
0..*
system
virtualClusters
1
manager
<<Interface>>
IMigrationManager
+migrate(domain, srcNode, dstNode)
<<Interface>>
IMigrationScheduler
+schedule(migrations)
1
System
+addCluster(cluster)
+removeCluster(cluster)
+getTotalPriority()
+getTotalComputingUnits()
<<use>>
<<Interface>>
IResourcesAllocator
+allocate(system)
Figure 4.2: Migration framework components
4.1.1
Node and cluster weighting
The current version of the VURM remotevirt provisioner allows to assign a fixed Computing
Unit (CU) value to each physical node P N in the system. A computing unit is simply a relative
value comparing a node to other nodes in the system: if Node A has a CU value of 1 while
Node B has a value of 2, then Node B is assumed to be twice as powerful as Node A. Note that
these values do not express any metrics about the absolute computing power of either node but
only the relationship between them.
The total power of the system CUsys is defined as the sum of all computing units of each node
in the system itself (Equation 4.1). It is possible to access these values by reading the value
contained in the computingUnits property of a PhysicalNode instance or by calling the
getTotalComputingUnits method on the System instance.
CUsys =
n
X
CUP Ni
(4.1)
i=0
Similarly as done for physical nodes, it is possible to assign a fixed priority P to a virtual cluster
V C at creation time. As for the CUs, the priority is also a value relative to other clusters in
the system: if Virtual Cluster A and Virtual Cluster B have the same priority, they will both
have access to the same amount of computing power. The weight of a virtual cluster WV C is
defined as the ratio between the priority of the cluster and the sum of the priorities of all the
clusters in the system (Equation 4.2); this value indicates the fraction of computing power the
virtual cluster has right to.
38
CHAPTER 4. MIGRATION
PV Ci
WV Ci = Pn
i=0 PV Ci
(4.2)
The exact amount of computing power assigned to a cluster CUV C is easily calculated by
multiplying the weight with the total computing power of the system (Equation 4.3).
CUV Ci = WV Ci · CUsys
(4.3)
Methods to calculate all those values are provided by the VURM implementation itself and
are not needed to be carried out by the resources allocators. The next subsection illustrates
how these values and the provided data structures are used by the three interface realizations
to effectively reassign resources to different VMs, schedule migrations in the optimal way and
apply a given migration technique.
4.1.2
Allocators, schedulers and migration managers
The class diagram in the Figure 4.2 adds three new interfaces to the physical cluster provisioner
architecture: the IResourceAllocator realization is in charge to decide which VMs have
to be migrated to which physical node, the IMigrationScheduler realization schedules
migrations by defining dependencies between them in order to optimize the execution time and
the IMigrationManager realization is responsible to actually carry out each single migration.
Each one of these three interfaces have to be implemented by one or more objects and passed
to the remotevirt provisioner at creation time.
The interactions between the provisioner and these three components of the migration framework is illustrated in the sequence diagram in Figure 4.3 on page 39. Each one of the intervening
components is further described in the following paragraphs.
Resource allocator
The allocate method of the configured resource allocator instance is invoked when a resource
reallocation is deemed necessary by the provisioner (refer to the next subsection for more
information about the exact migration triggering strategy).
The method takes the current System instance as its only argument and returns nothing. It
can reallocate resources by migrating virtual machines from one physical node to another. To
do so, the implemented allocation algorithm has to call the scheduleMigrationTo method
of the Domain instance it wants to migrate.
Scheduled migrations are not carried out immediately but deferred for later execution, in order
to allow for the migration scheduler to optimize their concurrent execution.
Migration scheduler
Once the resource allocator has decided which virtual machines need to be moved to a different
node and the respective migrations have been scheduled, the schedule method of the migration
scheduler instance is invoked.
This method receives a list of Migration instances as its only argument and returns a similar
data structure. Its task is to define dependencies between migrations in order to optimize the
resource usage due to their concurrent execution.
4.1. MIGRATION FRAMEWORK
39
s d Migration
ResourceAllocator
Domain
MigrationScheduler
MigrationManager
Provisioner
1: allocate(system)
loop
[each domain in the system]
opt
[domains needs migration]
1.1: scheduleMigrationTo(node)
1.1.1: create
Migration
1.1.2: register for later execution
1.2: allocation done
2: schedule(migrations)
loop
[each scheduled migration]
opt
[needs to wait for a shared resource]
2.1: addDependency(migration)
loop
[migration is considered useless]
2.2: remove from list
2.3: list of migrations
loop
[each scheduled migration]
3: migrate
3.1: wait for required migrations to terminate
3.2: migrate(domain, from, to)
3.3: notify dependent migrations
Figure 4.3: Collaboration between the different components of the migration framework
To add a new dependency, the implemented algorithm has to call the addDependency method
of the dependent Migration instance. All the migration instances still present in the returned
list will be executed by respecting the established dependencies, it is thus possible to prevent a
migration from happening simply by not inserting it in the returned list.
Migration manager
Different migration techniques (e.g. offline, iterative precopy, demand migration,. . . ) are available to move a virtual machine from a physical node to another.1 It is the responsibility of the
migration manager to decide which migration technique to apply for each given migration.
Each migration contained in the list returned from the schedule invocation is triggered by the
provisioner. When told to start, the Migration instances wait for all migrations they depend
on to finish and subsequently defer their execution to the responsible MigrationManager instance by calling its migrate method. Lastly, once the migration is terminated, each migration
instance notifies the dependent migrations that they can now start.
1
A deeper analysis of the different techniques is exposed in section 4.4.
40
CHAPTER 4. MIGRATION
4.1.3
Migration triggering
The whole resource allocation, migration scheduling and actual migration execution is triggered
by the provisioner in exactly three cases:
1. Each time a new virtual cluster is created. This triggering method waits for a given (configurable) stabilization period to be over before effectively starting the process. Thanks to
the stabilization interval, if two virtual cluster creation or releasing requests are received
in a short time interval, only one scheduling will take place;
2. Each time a virtual cluster is released. Similarly as done for the creation trigger, the
stabilization interval is applied in this case too. The stabilization interval is reset in both
cases, regardless of the type of request (creation or release) which is received;
3. At regular intervals. If no clusters are created or released, a resources reallocation is triggered at a regular and configurable interval. Interval triggers are momentarily suspended
when the triggering system is waiting for the stabilization period to be over.
4.2
Allocation strategy
The remotevirt provisioner ships with a default resource allocator implementation called
SimpleResourceAllocator. This section aims to describe the implemented resource allocation algorithm.
The SimpleResourceAllocator allocates resources on a static system view basis. This
means that each time a new resource allocation request is made, the algorithm only cares about
the current system state and does not take into account previous states. Additionally, the
algorithm bases its decision only on the cluster priorities and the CUs assigned to the physical
nodes.
The implemented algorithm can be divided in four different steps; each one of these steps is
further described in the remaining part of this section.
4.2.1
Cluster weighing
This step takes care of assigning an integer computing unit value to each cluster while accounting
for rounding errors. An integer value is needed because in this simple allocator version, nodes
are not shared between clusters. The algorithm pseudocode to round the value calculated using
the Equation 4.3 is reported in Listing 4.1.
Listing 4.1: CU rounding algorithm
1
2
# Create a list of computing units and clusters tuples
computingUnits = [(c.getComputingUnits(), c) for c in clusters]
3
4
5
# Calculate the quotient and the remainder of each computing unit
computingUnits = [(cu % 1, int(cu), c) for cu, c in computingUnits]
6
7
8
# Calculate the remainder of the truncated computing units sum
remainder = sum([rem for (rem, _, _) in computingUnits])
9
10
11
# Sort computing units in reverse order by remainder and create an iterator
adjustements = iter(sorted(computingUnits, reverse=True))
12
13
14
15
16
17
# While there is some remainder left, increment the next cluster with
# the largest remainder
while remainder:
next(adjustements)[1] += 1
remainder -= 1
4.2. ALLOCATION STRATEGY
i
P CV C i
0
24
1
WV Ci
41
CUV Ci with CUsys = 10
CUV Ci with CUsys = 15
Exact
Rounded
Algo
Exact
Rounded
Algo
0.08
0.8
1
1
1.2
1
1
6
0.02
0.2
0
0
0.3
0
0
2
84
0.28
2.8
3
3
4.2
4
4
3
108
0.36
3.6
4
4
5.4
5
6
4
48
0.16
1.6
2
1
2.4
2
2
5
30
0.10
1.0
1
1
1.5
2
2
Totals
300
1.00
10
11
10
15
14
15
Table 4.1: Rounding algorithm example with two different CUsys values
The Table 4.1 provides an example of the application of the described algorithm to a system
containing six virtual clusters. The exact value, the rounded value and the algorithm value are
calculated for two different scenarios: once for a system with a total CU of 10 and a second
time for the same system but with a CU value of 15.
4.2.2
Nodes to cluster allocation
Once that each cluster has an integer computing units value attached to it, the algorithm goes
on by assigning a set of nodes to each virtual cluster in order to fulfill its computing power
requirements.
The implemented algorithm assigns nodes to clusters starting with the cluster with the highest
computing unit assigned to it and then proceeding in descending order. Firstly, the most
powerful available node is allocated to the current cluster and then the algorithm iterates over
all available nodes in descending computing power order and assigns them to the cluster as long
as the cluster computing power is not exceeded.
The Listing 4.2 illustrates a simplified version of the implemented algorithm. Note that the
allocated computing units (currentPower) is also returned as it can be different than the
threshold value (assignedComputingUnits).
Listing 4.2: Nodes to virtual cluster allocation
1
2
3
# Initialize data structure and variables
clusterNodes = [availableNodes.pop(0)]
currentPower = clusterNodes[0].computingUnits
4
5
6
7
8
9
10
11
12
# If the power is not already exeeded
if currentPower < assignedComputingUnits:
for n in availableNodes:
# If adding this node to the cluster does not exceed the assigned CUs
if currentPower + n.computingUnits < assignedComputingUnits:
availableNodes.remove(n) # Move the node...
clusterNodes.append(n)
# ...to the cluster nodes list
currentPower += n.computingUnits
13
14
15
# Return the real computing power of the virtual cluster and the assigned nodes
return currentPower, clusterNodes
42
CHAPTER 4. MIGRATION
4.2.3
VMs to nodes allocation
This third step of the algorithm provides to assign a certain number of VMs to each node in
the cluster basing on the computing units of the node, the total computing units assigned to
the virtual cluster and the number of VMs running inside it.
The exact number of virtual machines V MP Ni assigned to a node i of the virtual cluster j is
calculated as show in Equation 4.4. Once the exact value is calculated for each physical node
of the virtual cluster, the same rounding algorithm as used for the Cluster weighing step is
applied.
V MP Ni = V MV Cj ·
4.2.4
CUP Ni
CUV Cj
(4.4)
VMs migration
The remaining part of the resources allocation algorithm is responsible to decide which virtual
machine has to be migrated to which physical node. This part has been divided into two distinct
sets of migrations.
The first migrations set is responsible to migrate all virtual machines currently running on
nodes reassigned to other virtual clusters to a free node of the virtual cluster to which the
virtual machine belongs to. This first iteration is reported in the Listing 4.3.
Listing 4.3: VMs migration from external nodes
1
2
# nodesVMCount is a list of (node, number of vm) tuples
idleNodesIter = iter(nodesVMCount)
3
4
5
# Initialize the variables
node, maxCount = next(idleNodesIter)
6
7
8
9
10
11
12
13
14
15
for domain in cluster.domains:
# If the virtual machine is running on an external node
if domain.physicalNode not in clusterNodes:
# If the current node is already completely subscribed
while len(node.domainsByCluster(cluster)) >= maxCount:
# Find the next node which has a VM slot available
node, maxCount = next(idleNodesIter)
# Migrate the VM to the chosen node
domain.scheduleMigrationTo(node)
The second migration set is responsible to migrate VMs between different nodes of the same
virtual cluster in order to reach the correct number of assigned virtual machines running on
each node. This is necessary because depending on the set of nodes assigned to a cluster, a
given node could be running more virtual machines than the allocated number.
Listing 4.4: Leveling node usage inside a virtual cluster
1
2
3
4
# The filterByLoad function creates three separate lists from the given
# nodesVMCount variable. A list of nodes with available slots, a list of
# already fully allocated nodes, and a list of oversubscribed nodes.
idle, allocated, oversubscribed = filterByLoad(nodesVMCount, cluster)
5
6
7
8
for node, count in oversubscribed:
# Get an iterator over each domain belonging to cluster running on the
# oversubscribed node
4.3. SCHEDULING STRATEGY
43
domains = iter(node.domainsByCluster(cluster))
9
10
for i in range(count):
# For each domain exceeding the node capacity
domain = next(domains)
11
12
13
14
if not idle[0][1]:
# Discard the current idle node if in the meantime it has
# reached its capacity
idle.pop(0)
15
16
17
18
19
# Update the number of running domains
idle[0][1] -= 1
# Migrate the domain to the idle node
domain.scheduleMigrationTo(idle[0][0])
20
21
22
23
Once the second set of migrations was scheduled too, the resource allocator can return the
control to the provisioner and wait for the next triggering to take place (message 1.2 in the
sequence diagram on page 39).
4.3
Scheduling strategy
VM migration is an heavy task in terms of bandwidth and, depending on the used migration
strategy, in terms of CPU usage. Running a multitude of simultaneous migrations between a
common set of physical nodes can thus reduce the overall system responsiveness and saturate
the network bandwidth, as well as lengthen the migration time and thus the downtime of the
single VMs.
To obviate to this problem, and to allow for different scheduling strategies to be put in place, the
remotevirt provisioner supports a special component called migration scheduler. The responsibility of this component is to schedule migrations by establishing different dependency links
between the single migration tasks in order to keep resources usage under a certain threshold
while optimizing the maximum number of concurrent migrations.
The default implementation shipped with the physical cluster provisioner is a simple noop
scheduler which lets all migrations execute concurrently. A more optimized solution was not
implemented due to time constraints, but the problem was analyzed on a theoretical level and
formulated in terms of a graph coloring problem. This section aims to explain the theoretical
basis behind the optimal solution and to provide the simple greedy algorithm for a possible
future implementation.
The problem statement for which a solution is sought is formulated in the following way: find
the optimal scheduling to execute all migrations using the minimum number of steps in such way
that no migration involving one or more common nodes is carried out during the same step.
By representing the migrations as edges connecting vertices (i.e. the physical nodes) of an
undirected2 graph, the previous formulation can be applied to the more general problem of
edge coloring. In graph theory, an edge coloring of a graph is an assignment of “colors” to the
edges of the graph so that no two adjacent edges have the same color.
The Table 4.2 lists the migrations used for the examples throughout this section. Similarly, the
Figure 4.4 represents the same migration set using an undirected graph.
2
The migration direction is not relevant for the problem formulation as the resource usage is assumed to be
the same on both ends of the migration.
44
CHAPTER 4. MIGRATION
Domain
Source Node Dest. Node
0
A
E
1
A
D
2
C
B
3
B
D
4
D
A
5
A
C
D
E
1
3
B
2
4
0
A
Table 4.2: Migrations in tabular format
5
C
Figure 4.4: Graphical representation
Once the edges are colored, it is possible to execute migrations concurrently as long as they
share the same color. The subfigure 4.5a shows one of the possible (optimal) coloring solution
and the subfigure 4.5b one of the possible migration scheduling resulting from this coloring
solution represented as a Direct Acyclic Graph (DAG).
D
E
1
3
B
2
4
Step 1
Step 2
3
1
0
5
A
5
(a) One possible edge coloring solution
Step 5
Step 4
4
0
2
C
(b) Chosen migration scheduling
Figure 4.5: Edge coloring and the respective migration scheduling.
By Vizing’s theorem [16], the number of colors needed to edge color a simple graph is either
its maximum degree ∆ or ∆ + 1. In our case, the graph can be a multigraph because multiple
migrations can occur between the same two nodes and thus the number of colors may be as
large as 3∆/2.
There are polynomial time algorithms that construct optimal colorings of bipartite graphs, and
colorings of non-bipartite simple graphs that use at most ∆ + 1 colors [6]; however, the general
problem of finding an optimal edge coloring is NP-complete and the fastest known algorithms
for it take exponential time.
A good approximation is offered by the greedy coloring algorithm applied to the graph vertex
coloring problem. The graph vertex coloring problems is the same as the edge coloring problem
but applied to adjacent vertices instead. As the edge chromatic number of a graph G is equal
to the vertex chromatic number of its line graph L(G), it is possible to apply this algorithm to
the edge coloring problem as well.
Greedy coloring exploits a greedy algorithm to color the vertices of a graph that considers the
vertices in sequence and assigns each vertex its first available color. Greedy colorings do not in
general use the minimum number of colors possible and can use as much as 2∆−1 colors (which
4.4. MIGRATION TECHNIQUES
45
may be nearly twice as many number of colors as is necessary); however the algorithm exposes
a time complexity of O(∆2 + log ∗ n) for every n-vertex graph [14] and it has the advantage that
it may be used in an online algorithm setting in which the input graph is not known in advance.
In this setting, its competitive ratio is two, and this is optimal for every online algorithm [1].
More complex formulations of the migration problem can also be made. It could be interesting
to be able to configure a maximum number of concurrent migrations occurring on each node
instead of hard limiting this value to one. This enhancement would allow to better exploit the
available resources as the time ∆tn−concurrent taken for n concurrent migrations between the
same nodes is less than n times the duration of a single migration ∆tsingle (Equation 4.5).
∆tn−concurrent < n · ∆tsingle
(4.5)
Empirical measures have shown that fully concurrent migration setups can achieve as much as
20% speedup over single-scheduled migrations by trading off for a complete network saturation and very high CPU loads. A configurable value for the maximum number of concurrent
migrations would allow to adjust the speed/load tradeoff to the runtime environment.
4.4
Migration techniques
It is not yet possible to migrate a virtual machine from one physical node to another without
any service downtime. Different migration techniques exist thus to try to minimize the VM
downtime, often while trading off for a longer migration time.
Service downtime is defined as the interval between the instant at which the VM is stopped
on the source host and the instant at which it is completely resumed on the destination host.
Migration time is defined as the interval between the reception of the migration request and
the instant at which the source host is no more needed for the migrated VM to work correctly
on the destination host
This section aims to introduce some of the most important categories of VM migration techniques and illustrate the method adopted by the VURM remotevirt provisioner.
The most basic migration technique is called offline migration. Offline migration consists in
completely suspending the VM on reception of the migration request, copying the whole state
over the network to the destination host, and resuming the VM execution there. Offline migration accounts for the smallest migration time and the highest service downtime (the two
intervals are approximately the same).
All other migration techniques are a subset of live migration. These migration implementations
seek to minimize downtime by accepting a longer overall migration time. Usually these algorithms generalize memory transfer into three phases [3] and are implemented by exploiting one
or two of the three:
• Push phase The source VM continues to run while certain pages are pushed across the
network to the new destination. To ensure consistency, pages modified during this process
must be re-sent;
• Stop-and-copy phase The source VM is stopped, pages are copied across to the destination VM, then the new VM is started;
46
CHAPTER 4. MIGRATION
• Pull phase The new VM executes and, if it accesses a page that has not yet been copied,
this page is faulted in across the network from the source VM.
Iterative pre-copy is a live migration algorithm which applies the first two phases. The push
phase is repeated by sending over pages during round each n which were modified during round
n − 1 (all pages are transferred in the first round). This is one of the most diffused algorithms
and provides very good performances (in [3] downtimes as low as 60ms with migration times in
the order of 1 − 2minutes were reached). Iterative pre-copy is also the algorithm used by the
live KVM migration implementation.
Another special technique which is not based only on the three previously exposed phases
exploits techniques such as checkpoint/restart and trace/replay, usually implemented in process
failure management systems. The approach is similar but instead of recovering a process from
an error on a local node, the VM is “recovered” on the destination host.
The migration implementation adopted by the remotevirt provisioner is a simplistic form
of offline migration. The virtual machine to migrate is suspended and its state saved to a
shared storage location. Subsequently, the same virtual machine is recreated and resumed on
the destination host by using the saved state file.
It was not possible to adopt the KVM live migration implementation because different problems
didn’t allow to provide a working proof of concept. A started live migration would completely
freeze the migrating domain without reporting any advancement or errors and required a complete system restart for the hypervisor to resume normal operation.
Thanks to libvirt’s support for this KVM functionality, an eventual addition to the remotevirt
provisioner to support this kind of migration will be even simpler than the implementation
provided to support the offline migration itself. Once the problem is identified and eventually
solved, the addition of a new migration manager implementing this kind of migration is strongly
advised.
4.5
Possible improvements
Due to the time dedicated to research a solution to the live migration problems of the KVM
hypervisor, different features of the migration framework were approached only on a theoretical
basis. The current implementation could greatly benefit from some additions and improvements,
as better described in the remaining part of this chapter.
4.5.1
Share physical nodes between virtual clusters
One of the principles on which the current resource allocation algorithm is based foresees that
physical nodes are entirely allocated to a single virtual cluster at a time. Such a strategy is
effective to isolate resource usage between virtual clusters (a greedy VM of a virtual cluster
can’t use part of the resources of a VM of another virtual cluster running on the same node)
but limits the usefulness of bigger nodes.
Systems with only a few powerful nodes are limited to run only as much clusters as there are
nodes in the system, by leaving the additional virtual clusters effectively unscheduled.
The two approaches can also be combined in a way that nodes with a CU value below a
configurable threshold are allocated to a single cluster only while allowing to share more powerful
nodes between different virtual clusters.
4.5. POSSIBLE IMPROVEMENTS
4.5.2
47
Implement live virtual machine migration
The current migration manager implements a manual virtual machine migration technique
consisting in suspending the virtual machine, saving its state to a shared storage location and
then resuming the VM on the destination host.
Applying such a migration technique involves high service downtime intervals (in the order of
15 − 30s); live migration would allow to reduce service downtime to the order of 100ms.
As anticipated in the previous section, KVM already implements live virtual machine migration,
and the implementation of the support to exploit this capability, once the problems related to
its execution are solved, is straightforward.
4.5.3
Exploit unallocated nodes
The current node allocation algorithm allows for nodes without allocated VMs. Optimize the
resources allocation algorithm to exploit all available resources by assigning them to a virtual
cluster and by running at least one VM on each node. Unused nodes can be allocated to virtual
clusters with the greatest difference between exact assigned CUs and the rounded value provided
by the currently implemented rounding algorithm.
If there are resources which are effectively too limited to run even one VM, provide at least
significant reporting facilities (logging) to help system administrators identify and correct such
problems.
4.5.4
Load-based VMs to physical nodes allocation
Different VMs can have different CPU and memory requirements. The current allocation algorithm defines only how many VMs run on a given physical node but not which virtual machine
combination would allow to exploit the resources in the best way.
A virtual cluster spawning two physical nodes and composed of two VMs demanding many
resources and two VMs being idle could see the two idle VMs being allocated to the same
physical node. The ideal algorithm would combine an heavily loaded and an idle VM on each
node of the virtual cluster.
4.5.5
Dynamically adapt VCs priority to actual resource exploitation
The priority assigned to a virtual cluster is defined by the user at creation time and never
changed afterwards. It is possible to dynamically adapt the priority to react to actual resource
exploitation of the virtual cluster.
If a virtual cluster with an high priority exploits only a small percentage of the resources
allocated to it, at the next iteration, the resource allocation algorithm, should lessen its priority
unless the resource usage reaches a certain threshold.
In a similar way, if a cluster is using all resources allocated to it and there are unused resources
available on the system, the algorithm should increase its priority to augment the amount of
resources which will be allocated to the virtual cluster at the next iteration.
48
4.5.6
CHAPTER 4. MIGRATION
Implement the greedy coloring algorithm
The section 4.3 introduced a good solution to the concurrent migrations problem on a theoretical
level. The current implementation simply schedules all migration for fully concurrent execution
and would thus saturate the resources an larger scale systems.
The presented greedy coloring algorithm is easy enough to be worth to be implemented as
an additional migration scheduler realization. To fully exploit the possibilities offered by the
scheduling system, the implementation of the configurable concurrency limit variant is also
advisable.
Chapter 5
Conclusions
Three whole months of planning, design, implementation and testing are over. A new tool to
integrate an already existing job scheduler and virtual machine monitor was created and the
time to conclude the work has finally come.
This last chapter of the present report aims to summarize the different achievements, illustrate
the solved problems, hinting at some possible future developments and finally giving some
personal advice about the entire experience. This chapter is thus structured as follows:
• Section 5.1 summarizes the project output by highlighting the initial goals and comparing
them to the final results;
• Section 5.2 lists some of the most important encountered problems, the solution eventually
found for them and the impact they had on the project itself;
• Section 5.3 gives a list of hints for future work by summarizing what already done for single
sections in the project and adds additional items highlighting other possible improvements;
• Finally, section 5.4, gives a personal overview and digest of the whole experience, speaking
about the project itself and its execution context.
5.1
Achieved results
The goal of the project was to add virtual resources management capabilities to a job scheduler
thought for HPC clusters. SLURM was chosen as the batch scheduler for the job, while virtualization functionalities were provided using a combination of libvirt and KVM/QEMU (the
latter one being swappable with other hypervisors of choice).
The final result consist in a tool called VURM which nicely integrates all these different components in a single working unit. No modifications are necessary to the external tools in order
for them to work in such a setup.
Using the developed tool, and by using the provided remotevirt resource provisioner, it is
possible to create virtual clusters of VMs running on different physical nodes (belonging to a
physical cluster). Pluggable interfaces are also provided in order to be able to add new resource
49
50
CHAPTER 5. CONCLUSIONS
provisioners to the system and also aggregate domains running on heterogenous systems (grids,
clouds, single nodes,. . . ) at the same time.
Particular attention was given to the remotevirt provisioner (also called physical resources
provisioner ) and to the functionalities offered by its particular architecture. By using this
provisioner, it is possible to create ad-hoc virtual machines, assemble them into virtual clusters
and provide them to SLURM as partitions to which jobs can be submitted in the usual fashion.
Additionally, this provisioner exposes a complete resource management framework heavily based
on the possibility to migrate domains (e.g. virtual machines) from one physical node to another.
Using components responsible respectively for resource allocation, migration scheduling, and
migration execution it is possible to dynamically adapt the resources allocated to a virtual
cluster based on different criteria such as priority definitions, node computing power, current
system load, etc.
5.2
Encountered problems
Each new development project brings a new set of challenges and problems to the table. Research projects, in which new methods are approached and areas explored, may suffer even
more from such eventualities. This project was not different from any other and a whole bunch
of problematics had to be approached, dealt with and eventually solved.
Two bigger problems can be identified in the whole project development timeframe. The first
one was the retrieval of the IP address of a newly spawned virtual machine. Even though
that different solutions already existed (many of them also presented in the respective chapter),
none of them fitted the specific context and an alternative and more creative one had to be
implemented. This final solution – transferring the IP address over the serial port to a listening
TCP server – may now seem an obvious approach to a non-problem but its finding caused a
non-trivial headache when the problem firstly occurred.
The second important problem which was encountered manifested itself towards the end of
the project, when VM migration had to be implemented. As KVM claimed support for live
migration out of the box, no big analysis and testing was performed in the initial project phase
and this led to the later finding that this specific capability was not working on the development
setup. Because of the importance given to live migration over offline migration, the source of the
problem was researched for two entire weeks without success. This lead to a bigger adaptation
of the planning and a total loss of motivation. Finally, a more complex (but working) offline
migration solution was implemented.
Not always, as it was the case for the second presented issue, a direct solution could be found
and a workaround had to be implemented instead. The lessons learned from the encountered
problems are essentially two: firstly, put more effort into the initial analysis phase by testing out
important required functionalities and providing simple proof of concepts for the final solution;
secondly, if a problem is encountered and a solution cannot be found in a reasonable amount of
time adopt a workaround, continue with the development of the remaining part of the project
and come back to the problematic in order to find a more optimal solution only once the critical
parts of the project are finished.
Adopting this two learnings it should be possible to have a better initial overview of the different
tasks and their complexity first, and a more consequent adherence to the planning subsequently.
5.3. FUTURE WORK
5.3
51
Future work
The first section exposed the different achievements of the present project by summarizing the
work which was done and the produced results. The second section resumed the encountered
problems and how they were eventually solved. This section aims to offer some hints for future
work which can be done to improve the project.
Improvements tightly related to the main components of the VURM utility have already been
listed in the respective chapters. Section 3.5 of the Physical cluster provisioner chapter listed
the implementation of scalable disk image transfers using BitTorrent and support for VLAN
based network setups as main possible improvements to the remotevirt resources provisioner.
Similarly, for the Migration chapter, section 4.5 offers an extended list of additional features
and improvements to the migration framework, resources allocation and scheduling strategies
and migration techniques. Possible improvements in this area are for example enabling physical
nodes to be shared between virtual clusters or implementing live migration to minimize service
downtime.
An interesting addition to the project, to effectively testing the resources origin abstraction put
in place in the early development stages, would be the implementation of a cloud computing
based resources provisioner. Such a provisioner would allocate new resources on the cloud when
a new virtual cluster is requested. Additionally, such a provisioners could be used as a fallback
to the remotevirt provisioner. In such a scenario, resources on the cloud would be allocated
only if the physical cluster provisioner has exhausted all available physical nodes.
An important part actually lacking and potentially yielding interesting results is a complete
testing and benchmarking analysis of the VURM utility running on larger scale systems. Empirical tests and simple benchmarks were executed only on a maximum of three machines and
mainly for development purposes. A complete overview of the scalability of the system when
run on a larger amount of nodes would also highlight possible optimization areas which didn’t
pose any problem when running on the development environment.
The different pluggable part of the VURM application still require to modify the main codebase.
It is possible to implement a plugin architecture leveraging Twisted’s component framework in
an easy and straightforward way. Additionally, by allowing custom command line parsers for
each plugin, it is possible to add provisioners specific options to existing commands. Such a
system would allow to configure all the components (provisioners, resource allocators, migration
schedulers and migration managers) using the standard configuration file and removing the need
to modify the VURM code.
The adoption of libvirt abstracted the differences between the plethora of available hypervisors.
One of the reasons to use libvirt instead of accessing KVM/QEMU functionalities directly was
the possible future adoption of the Palacios hypervisor. This last improvement idea is about
implementing support for the Palacios hypervisor, either by adding the needed translation layers
to libvirt or by directly accessing it from the VURM tools by adding a new palacios resource
provisioner.
5.4
Personal experience
Working for three full months on a single project may be challenging. Doing so overseas in a
completely new context and working with people which were never met before surely is. These
and many other pros and cons of the experience of developing this bachelor project in a whole
different environment are to be summarized in this last section of the present report.
52
CHAPTER 5. CONCLUSIONS
As already explained in the Introduction, this project was carried out at the Scalable Systems
Lab (SSL) of the Computer Science department of the University of New Mexico (UNM), USA
during the summer 2011 and is the final diploma work which hopefully will allow me to obtain
B.Sc. in Computer Science at the College for Engineering and Architecture Fribourg.
One of the particularities about the initial phases of the project was that I left Switzerland
without a precise definition of what the project would have been. The different goals and tasks
were defined only once settled in New Mexico, in collaboration with the local advisors and
had to be communicated back and explained to the different people supervising the project
remotely from Switzerland. Although being a challenge, the need to being able to explain
what the project is to people completely external to its definition and, additionally, since the
first iterations constituted a big advantage in that it was necessary to clearly formulate all the
different aspects of the statement. The same argument can be made for the different meetings
incurred during the entire project duration: I was always forced to explain the progresses, issues
and results in a sub-optimal environment; that forced me, however, to adopt a clearer and more
understandable formulation.
Another interesting aspect is the complete new environment in which the project was executed:
new people were met, a new working location had to be setup, different working hours, different
relationships with professors and colleagues had to be taken care of, etc. All these different
aspects greatly contributed to make my experience varied, interesting and enriching. Unfortunately this also brings negative aspects to the table and keeping the motivation high for the
whole duration of my stay was not always possible.
To summarize this whole experience, I would say that, beside the highs and lows, each aspect
of my stay in New Mexico somewhat contributed to greatly enrich an already interesting and
challenging adventure which, in any case, I would recommend to anyone.
Acronyms
AMP Asynchronous Messaging Protocol. 23
API Application Programming Interface. 5, 25, 27, 71
ARP Address Resolution Protocol. 28
B.Sc. Bachelor of Science. 5, 52
CPU Central Processing Unit. 1, 26, 43, 45, 47
CU Computing Unit. 37, 40, 46, 47
DAG Direct Acyclic Graph. 44
DBMS Database Management System. 8
DHCP Dynamic Host Configuration Protocol. 28, 63
HPC High Performance Computing. 1–3, 49
IP Internet Protocol. vii, 19, 23, 27–31, 50, 63, 65, 66
IT Information Technology. 2
KVM Kernel-based Virtual Machine. 4, 46, 47, 50
MAC Media Access Control. 28
NFS Network File System. 31, 32
OS Operating System. 2, 22, 26, 27, 29, 66
POP-C++ Parallel Object Programming C++. 65
QCOW2 Qemu Copy On Write, version 2. 29
SASL Simple Authentication and Security Layer. 25
SLURM Simple Linux Utility for Resource Management. 1, 4, 6–16, 19, 20, 22, 23, 27, 32,
49, 50, 57, 59, 60, 66
SSH Secure Shell. 20, 25, 27, 30, 65
SSL Scalable Systems Lab. 5, 52
TCP Transmission Control Protocol. 25, 28, 29, 50, 65
TLS Transport Layer Security. 25
UML Unified Modeling Language. 13
UNM University of New Mexico. 5, 52
ViSaG Virtual Safe Grid. 65, 66
VLAN Virtual LAN. 32, 51
VM Virtual Machine. 3, 4, 11, 12, 14, 19, 20, 22, 23, 25–32, 35, 36, 38, 42, 43, 45–47, 49, 50,
60, 61, 65, 66
53
54
Acronyms
VMM Virtual Machine Monitor. 3–5, 65
VURM Virtual Utility for Resource Management. 1, 5–8, 10–14, 16, 19, 20, 22, 23, 25, 27,
29–31, 36–38, 45, 49, 51, 57–59, 61, 62, 65, 66, 71
XML eXtensible Markup Language. 19, 23, 25, 28
References
[1]
Amotz Bar-Noy, Rajeev Motwani, and Joseph Naor. “The greedy algorithm is optimal for
on-line edge coloring”. In: Information Processing Letters 44.5 (Dec. 1992), pp. 251–253.
[2]
Fabrice Bellard. QEMU. IBM. Aug. 2011 (accessed August 9, 2011). url: http://qem
u.org/.
[3]
Christopher Clark et al. “Live Migration of Virtual Machines”. In: (2005 (accessed August
15, 2011). url: http://www.usenix.org/event/nsdi05/tech/full_papers/
clark/clark.pdf.
[4]
Valentin Clément. POP-C++ Virtual-Secure (VS) – Road to ViSaG. University of Applied
Sciences of Western Switzerland, Fribourg. Apr. 2011.
[5]
Bram Cohen. The BitTorrent Protocol Specification. Jan. 2008 (accessed August 14, 2011).
url: http://www.bittorrent.org/beps/bep_0003.html.
[6]
Richard Cole, Kirstin Ost, and Stefan Schirra. “Edge-Coloring Bipartite Multigraphs in
O ( E log D ) Time”. In: COMBINATORICA 21.1 (Sept. 1999), pp. 5–12.
[7]
Johnson D et al. Eucalyptus Beginner’s Guide - UEC Edition. Dec. 2010. url: http:
//cssoss.files.wordpress.com/2010/12/eucabookv2-0.pdf.
[8] Domain XML format. url: http://libvirt.org/formatdomain.html.
[9] Eucalyptus Network Configuration (2.0). url: http://open.eucalyptus.com/wik
i/EucalyptusNetworkConfiguration_v2.0.
[10]
Michael Jang. Ubuntu Server Administration. McGraw-Hill, Aug. 2008.
[11]
M. Tim Jones. Virtio: An I/O virtualization framework for Linux. IBM. Jan. 2010 (accessed August 8, 2011). url: http://www.ibm.com/developerworks/linux/li
brary/l-virtio/.
[12] Kernel Based Virtual Machine. Feb. 2009 (accessed August 9, 2011). url: http://ww
w.linux-kvm.org/.
[13]
Jack Lange and Peter Dinda. An Introduction to the Palacios Virtual Machine Monitor. Tech. rep. Northwestern University, Electrical Engineering and Computer Science
Department, Nov. 2008. url: http://v3vee.org/papers/NWU-EECS-08-11.pdf.
[14]
Nathan Linial. “Locality in Distributed Graph Algorithms”. In: SIAM Journal on Computing 21.1 (Dec. 1990), pp. 193–201.
[15]
Eric P. Mangold. AMP - Asynchronous Messaging Protocol. 2010. url: http://amp-p
rotocol.net/.
[16]
J. Misra and D. Gries. “A Constructive Proof of Vizing’s Theorem”. In: Information
Processing Letters 41 (1992), pp. 131–133.
55
56
[17]
REFERENCES
Tuan Anh Nguyen et al. Parallel Object Programming C++ – User and Installation Manual. University of Applied Sciences of Western Switzerland, Fribourg. 2005. url: http:
//gridgroup.hefr.ch/popc/lib/exe/fetch.php/popc-doc-1.3.pdf.
[18] Palacios: An OS independent embeddable VMM. Feb. 2008 (accessed August 9, 2011).
url: http://v3vee.org/palacios/.
[19] SLURM: A Highly Scalable Resource Manager. Lawrence Livermore National Laboratory.
July 2008 (accessed August 2, 2011). url: https://computing.llnl.gov/linux/
slurm/.
Appendix A
User manual
A.1
Installation
The VURM installation involves different components but does not present an high complexity
degree. The following instructions are basing on an Ubuntu distribution but can be generalized
to any Linux based operating system.
1. Install SLURM. Packages are provided for Debian, Ubuntu and other distributions such
as Gentoo. For complete installation instructions refer to https://computing.llnl.
gov/linux/slurm/quickstart_admin.html, but basically this comes down to execute the following command:
$ sudo apt-get install slurm-llnl
2. Install libvirt. Make sure to include the Python bindings too:
$ sudo apt-get install libvirt-bin python-libvirt
3. Install the Python dependencies on which VURM relies on:
$ sudo apt-get install python-twisted \
python-twisted-conch \
python-lxml \
python-setuptools
4. Get the last development snapshot from the GitHub repository:
$ wget -O vurm.tar.gz https://github.com/VURM/vurm/tarball/develop
5. Extract the sources and change directory:
$ tar xzf vurm.tar.gz && cd VURM-vurm-*
6. Install the package using setuptools:
$ sudo python setup.py install
57
58
APPENDIX A. USER MANUAL
A.2
Configuration reference
The VURM utilities all read configuration files from a set of predefined locations. It is also possible to specify a file by using the --config option. The default locations are /etc/vurm/vurm.conf
and ∼/.vurm.conf.
The configuration file syntax is a variant of INI with interpolation features. For more information about the syntax refer to the Python ConfigParser documentation at http:
//docs.python.org/library/configparser.html.
All the different utilities and components can be configured in the same file by using the
appropriate section names. The rest of this section, after the example configuration file reported
in the Listing A.5, lists the different sections and their configuration directives.
Listing A.1: Example VURM configuration file
1
2
3
# General configuration
[vurm]
debug=yes
4
5
6
7
# Client configuration
[vurm-client]
endpoint=tcp:host=10.0.6.20:port=9000
8
9
10
11
12
13
# Controller node configuration
[vurmctld]
endpoint=tcp:9000
slurmconfig=/etc/slurm/slurm.conf
reconfigure=/usr/bin/scontrol reconfigure
14
15
16
17
18
19
20
# Remotevirt provisioner configuration
[libvirt]
migrationInterval=30
migrationStabilizationTimer=20
domainXML=/root/sources/tests/configuration/domain.xml
nodes=node-0,node-1
21
22
23
24
[node-0]
endpoint=tcp:host=10.0.6.20:port=9010
cu=20
25
26
27
28
[node-1]
endpoint=tcp:host=10.0.6.10:port=9010
cu=1
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Computing nodes configuration
[vurmd-libvirt]
basedir=/root/sources/tests
sharedir=/nfs/student/j/jj
username=root
key=%(basedir)s/configuration/vurm.key
sshport=22
slurmconfig=/usr/local/etc/slurm.conf
slurmd=/usr/local/sbin/slurmd -N {nodeName}
endpoint=tcp:port=9010
clonebin=/usr/bin/qemu-img create -f qcow2 -b {source} {destination}
hypervisor=qemu:///system
statedir=%(sharedir)s/states
clonedir=%(sharedir)s/clones
imagedir=%(sharedir)s/images
A.2. CONFIGURATION REFERENCE
A.2.1
59
The vurm section
This section serves for general purpose configuration directives common to all components.
Currently only one configuration directive is available:
debug Set this to yes to enable debugging mode (mainly more verbose logging) or to no to
disable it.
A.2.2
The vurm-client section
This section contains configuration directives for the different commands interacting with a
remote daemon. Currently only one configuration directive is available:
endpoint Set this to the endpoint on which the server is listening on. More information about
the endpoint syntax can be found online at: http://twistedmatrix.com/documents/
11.0.0/api/twisted.internet.endpoints.html#clientFromString. To connect
to a TCP host listening on port 9000 at the host example.com, use the following string:
tcp:host=example.com:port=9000
A.2.3
The vurmctld section
This section contains configuration directives for the VURM controller daemon. The available
options are:
endpoint The endpoint on which the controller has to listen for incoming client connections.
More information about the endpoint syntax can be found online at: http://twistedmatrix.
com/documents/11.0.0/api/twisted.internet.endpoints.html#serverFromString.
To listen on all interfaces on the TCP port 9000, use the following string: tcp:9000
slurmconfig The path to the SLURM configuration file used by the currently running
SLURM controller daemon. The VURM controller daemon needs read and write access to
this file (a possible location is: /etc/slurm/slurm.conf).
reconfigure The complete shell command to use to reconfigure the running SLURM controller daemon once the configuration file was modified. The suggested value is /usr/bin/scontrol
reconfigure.
A.2.4
The libvirt section
This section contains the configuration directives for the remotevirt provisioner. The available options are:
domainXML The location of the libvirt domain XML description file to use to create new virtual
machines.
migrationInterval The time (in seconds) between resource reallocation and migration
triggering if no other event (virtual cluster creation or release) occur in the meantime.
migrationStabilizationTimer The time (in seconds) to wait for the system to stabilize
after a virtual cluster creation or release event before the resource reallocation and migration
is triggered.
nodes A comma-separated list of section names contained in the same configuration file. Each
section defines a single node or a node set on which a remotevirt daemon is running. The
format of these sections is further described in the next subsection.
60
A.2.5
APPENDIX A. USER MANUAL
The node section
This section contains the configuration directives to manage a physical node belonging the the
remotevirt provisioner. The section name can be arbitrarily chosen (as long as it does not
conflict with other already defined sections). The available options are:
endpoint The endpoint on which the remotevirt is listening on. More information about
the endpoint syntax can be found online at: http://twistedmatrix.com/documents/
11.0.0/api/twisted.internet.endpoints.html#clientFromString. This endpoint allows to group similar nodes together by specifying an integer range in the last part of the
hostname, similarly as possible in the SLURM configuration. It is thus possible to define a node
set containing 10 similar nodes using the following value: tcp:hostname[0-9]:port=9010
cu The amount of computing units associated to this node (or each node in the set).
A.2.6
The vurmd-libvirt section
This section contains the configuration directives for the single remotevirt daemons running
on the physical nodes. The available options are:
username The username to use to remotely connect to the spawned virtual machines via SSH
and execute the slurm daemon spawning command.
sshport The TCP port to use to establish the SSH connection to the virtual machine.
slurmconfig The location, on the virtual machine, where the SLURM configuration file has
to be saved.
slurmd The complete shell command to use to spawn the SLURM daemon on the virtual machine. This value will be interpolated for each virtual machine using Python’s string.format
method. Currently the only available interpolated value is the nodeName. The suggested value
is /usr/local/sbin/slurmd -N nodeName.
key The path to the private key to use to login to the virtual machine via SSH.
endpoint The endpoint on which the remotevirt daemon has to listen for incoming connections from the remotevirt controller. More information about the endpoint syntax can
be found online at: http://twistedmatrix.com/documents/11.0.0/api/twisted.
internet.endpoints.html#serverFromString. To listen on all interfaces on the TCP
port 9010, use the following string: tcp:9010
hypervisor The connection URI to use to connect to the hypervisor through libvirt. The
possible available values are described online at: http://libvirt.org/uri.html
statedir The path to the directory where the VM state file is saved to perform an offline
migration. Has to reside on a shared location and be the same for all remotevirt daemon
instances.
clonedir The path to the directory where the cloned disk images to run the different VMs
are saved. Has to reside on a shared location and be the same for all remotevirt daemon
instances.
A.3. USAGE
61
imagedir The path to the directory where the base disk images to clone to start a VM are
stored. Has to reside on a shared location and be the same for all remotevirt daemon
instances.
clonebin The complete shell command to use to clone a disk image. This value will be
interpolated for each cloning operation using Python’s string.format method. The available
interpolated values are the source and the destionation of the disk image. The suggested
value is /usr/bin/qemu-img create -f qcow2 -b source destination.
A.3
Usage
The VURM project provides a collection of command line utilities to both run and interact
with the system. This section describes the provided utilities and their command line usage.
A.3.1
Controller daemon
Starts a new VURM controller daemon on the local machine by loading configuration from the
default locations or from the specified path.
Listing A.2: Synopsis of the vurmctld command
usage: vurmctld [-h] [-c CONFIG]
VURM controller daemon.
optional arguments:
-h, --help
show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
A.3.2
Remotevirt daemon
Starts a new remotevirt daemon on the local machine by loading configuration from the
default locations or from the specified path.
Listing A.3: Synopsis of the vurmd-libvirt command
usage: vurmctld [-h] [-c CONFIG]
VURM libvirt helper daemon.
optional arguments:
-h, --help
show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
A.3.3
Virtual cluster creation
Requests a new virtual cluster creation of a given size to the VURM controller daemon. An
optional minimal size and priority can also be defined. The configuration is loaded from the
default locations or from the specified path.
Listing A.4: Synopsis of the valloc command
usage: valloc [-h] [-c CONFIG] [-p PRIORITY] [minsize] size
VURM virtual cluster allocation command
62
APPENDIX A. USER MANUAL
positional arguments:
minsize
size
Minimum acceptable virtual cluster size
Desired virtual cluster size
optional arguments:
-h, --help
show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
-p PRIORITY, --priority PRIORITY
Virtual cluster priority
A.3.4
Virtual cluster release
Releases a specific or all virtual clusters currently defined on the system. The configuration is
loaded from the default locations or from the specified path.
Listing A.5: Synopsis of the vrelease command
usage: vrelease [-h] [-c CONFIG] (--all | cluster-name)
VURM virtual cluster release command.
positional arguments:
cluster-name
Name of the virtual cluster to release
optional arguments:
-h, --help
show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
--all
Release all virtual clusters
A.4
VM image creation
The operating system installed on the disk image used to spawn new virtual machines has to
respect a given set of constraints. This section aims to expose the different prerequisites for
such an image to work seamlessly in a VURM setup.
A.4.1
Remote login setup
Once started and online, the operating system has to allow remote login via SSH. The username,
port and public key used to connect to the virtual machine have to be provided by the VURM
system administrator, but it is the users responsibility to make sure that its virtual machine
will allow the user to login remotely.
All most common distributions already come with an SSH client installed and correctly setup.
The only remaining task are to create the correct user and copy the provided public key to the
correct location. Refer to the distribution documentation to learn how users are created and
how to allow a client to authenticate against a given public key (normally it is sufficient to copy
the key in the ∼/.ssh/authorized_keys file).
A.4.2
IP address communication
It is the responsibility of the virtual machine operating system to communicate its IP address
back to the remotevirt daemon. A script to retrieve and write the IP address to the serial
port is provided in the Listing A.6.
A.4. VM IMAGE CREATION
63
Listing A.6: Shell script to write the IP address to the serial port
#!/bin/sh
ifconfig eth0 | grep -oE ’([0-9]{1,3}\.){3}[0-9]{1,3}’ | head -1 >/dev/ttyS0
This script has to be triggered once the IP address is received by the guest operating system.
A possible approach is to use the triggering capabilities offered by the default dhclient by
placing the IP sending script inside the /etc/dhcp3/dhclient-exit-hooks.d/ directory.
Each script contained in this directory will be executed each time the DHCP client receives a
new IP address lease.
Appendix B
ViSaG comparison
Parallel Object Programming C++ (POP-C++) [17] is a framework that provides C++ language extension, compiler and runtime to easily deploy applications to a computing grid. One
of the core aspects of the framework is that existing code can easily be converted to a parallel
executing version with the minimum amount of changes.
Virtual Safe Grid (ViSaG) [4] is a project that aims to add security to the execution of POPC++ applications by adding secure communication channels and virtualization to the runtime.
Virtualization was used for different reasons in ViSaG and VURM: mainly security and sandboxing in the first and customization and dynamic resource reallocation in the second. Nonetheless,
both project encountered some of the same problematics and solved them differently. This appendix aims to provide a short overview of the differently implemented solutions and how they
relate one to the other.
Both ViSaG and VURM use libvirt to access the main hypervisor functions. In the case of
VURM, this allows the integration with different hypervisors; the main reason to use libvirt
was the provided abstraction layer, and thus the ability to swap hypervisors at will while
guaranteeing an easy path to support the Palacios VMM in the future. In the case of ViSaG,
the main reason for the adoption of libvirt was the relatively simple interface it offered to access
the VMWare ESX hypervisor, while the abstraction of different hypervisor was overshadowed
by the implementation of some direct interactions with the ESX hypervisor as seen later in this
appendix.
One of the first similar encountered problems is the retrieval of the IP address from a newly
spawned virtual machine. The subsection 3.4.1 of the Physical cluster provisioner chapter
presents the different analyzed solutions to this problem. Both the solutions adopted by ViSaG
and VURM were presented in the above cited section: VURM went down the serial to TCP data
exchange way while ViSaG chose to take advantage of the proprietary virtual machine tools
provided by ESX and installed on the running domain. Each solution has its own advantages
and drawbacks: the serial to TCP data exchange adds more complexity to the application as
an external TCP server has to be put in place, while the proprietary VM tools solution tightly
couples the ViSaG implementation to a specific hypervisor.
Another key problem solved in different ways is the setup of the authentication credentials for the
remote login through an SSH connection. As done for the IP address retrieval problem, different
65
66
APPENDIX B. VISAG COMPARISON
possible solutions were discussed in the subsection 3.4.3 of the Physical cluster provisioner
chapter. In this case, the authentication model specific to each project clearly identified the
correct solution for it. When using the VURM utility, the end users specifies a disk image to use
to execute on its virtual cluster. This allows complete customization over the environment in
which the different SLURM jobs will be executed. Differently, in the ViSaG case, the VM disk
image is provided during the setup phase and used for security and sandboxing purposes. This
disk image is shared among all different users and each user wants to grant access to a running
domain only to its own application. The key difference here is that all virtual machines on a
VURM system are accessed by the same entity (the remotevirt daemon) using a single key
pair while on a ViSaG system each virtual machine is potentially accessed by different entities
(the application which requested the creation of the VM) using different key pairs. In the case
of VURM, the installation of the public key into the VM disk image is a requirement which has
to be fulfilled by the user. In the case of ViSaG, the public key is copied to the virtual machine
at runtime using an hypervisor specific feature.
The last difference between the two projects is the method used to spawn new virtual machines.
In the case of VURM, new virtual machines are spawned simultaneously as part of a new virtual
cluster creation request; loosing some time to completely boot a new domain was considered an
acceptable tradeoff given the relative infrequency of the operation. In the context of a ViSaG
execution, VMs have to be spawned more frequently and with lower latency. This requirement
led to adopt a spawning technique based on a VM snapshot, in which an already booted and
suspended domain is resumed. The resuming operation is faster compared to a full OS boot,
but presents other disadvantages which had to be overcome, as, for example, the triggering of
the negotiation of a new IP address.
Appendix C
Project statement
67
VURM – Project statement
Virtual resources management on HPC clusters
Jonathan Stoppani
College of Engineering and Architecture Fribourg
jonathan.stoppani@edu.hefr.ch
Abstract
Software deployment on HPC clusters is often subject to
strict limitations with regard to software and hardware customization of the underlying platform. One possible approach to circumvent these restrictions is to virtualize the
different hardware and software resources by interposing a
dedicated layer between the running application and the host
operating system.
The goal of this project is to enhance existing HPC resource management tools with special virtualization-oriented
capabilities such as job submission to specially created virtual machines or runtime migration of virtual nodes to account for updated job priorities or for better performance
exploitation.
Virtual machines can be started inside a virtual cluster
and run on a physical node. In low system load situations or
for high priority jobs, each physical node hosts exactly one
VM (Subfigure 1a). When additional or higher prioritized
jobs are submitted to the system, these VMs can be migrated
to already busy nodes to free up resources for the incoming
jobs (Subfigure 1b and 1c). The same procedure can be
applied if some resources are released in order to increase
the performances (Subfigure 1d).
Keywords resource management, virtualization, HPC, slurm
1.
Introduction
SLURM (Simple Linux Utility for Resource Management)
is, as the name says, a resource manager for Linux based
clusters of all sizes. Its job is to allocate resources to users
requesting them by arbitrating possible contentions using a
queue of pending work and to offer tools to start, execute
and monitor work started on previously allocated resources.
To be able to support more esoteric system configurations than what allowed on common HPC clusters and to
provide advanced job controlling capabilities (mainly pausing/resuming and migration), SLURM can be extended to
support virtual clusters and virtual machines management.
A virtual cluster groups physical nodes togheter and
can be thought of as a logical container for virtual machines. Such a cluster can be resized to account for workload
changes and runtime job prioritization updates.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission.
c
Copyright 2011,
Jonathan Stoppani
(a) Low load operation
(b) Migration
(c) High load operation
(d) Restoration
Figure 1. VMs migration process to allocate resources for
an higher prioritized job in its different states: before, during allocation, while running both jobs and after. (Squares
represent physical nodes, circles represent VMs and darker
rectangles are virtual clusters.)
2.
Goals
The ultimate goal of this project is to add support for the
virtual resources management capabilities described above
to SLURM. These capabilities can either be provided as
plugins or by directly modifying the source tree.
To reach this objective, different partial goals have to be
attained. The following list summarizes them:
1. Adding support for starting and stopping virtual clusters
in SLURM with the newly-created virtual clusters attaching to the existing SLURM instance and on which regular
jobs can be launched.
This will require adding support for the libvirt or related
virtualization management libraries to SLURM.
2. Adding support for controlling the pausing and migration
of virtual machines to SLURM so that more sophisticated
resource allocation decisions can be made (e.g. migrating multiple virtual machines of a virtual cluster onto a
single physical node) as new information (e.g. new jobs)
become available.
3. Implementing simple resource allocation strategies based
on the existing job scheduling techniques to demonstrate
the capabilities added to SLURM, KVM, and Palacios in
(1) and (2) above.
3.
Deadlines
The deadlines for the project are resumed in the Table 1.
For more detailed information about the content of each
deliverable or milestone, refer to the planning document.
A.
Context
This work is carried out at Scalable Systems Lab (SSL) of
the Computer Science department of the University of New
Mexico, USA (UNM) during Summer 2011.
The project will be based on SLURM (Simple Linux
Utility for Resource Management) as underlying layer for
resources management and two different virtual machine
monitors: KVM and Palacios. KVM (Kernel-based Virtual Machine) is a general purpose virtual machine monitor
which integrates directly into the Linux kernel, while Palacios is an HPC-oriented hypervisor designed to be embedded into a range of different host operating systems, including lightweight Linux kernel variants and thus potentially
including the Cray Linux Environment.
B.
Experts and Supervisors
Prof. Peter Kropf, Head of the Distributed Computing Group
and Dean of the Faculty of Science of the University of
Neuchatel, Switzerland covers the role of expert.
Prof. Patrick G. Bridges, associate professor at the University of New Mexico is supervising the project locally.
1 All
dates refer to 2011
D=Deliverable
2 M=Milestone,
Date1
Type2
Asset
Jun. 6
M
Project start
Jun. 20
D
Project statement
Jun. 20
D
Planning
Jul. 1
Jul. 8
M
M
Dynamic partitions
Virtual clusters
Jul. 15
M
Virtual cluster pausing/resuming
Jul. 15
D
Project summary (EN/DE)
Jul. 22
M
Virtual cluster resizing
Jul. 29
M
Migration strategy
Aug. 19
D
Final presentation
Aug. 19
D
Final report
Aug. 19
D
Project sources and documentation
Aug. 19
M
Project end
Sep. 7
M
Oral defense
Sep. 7
D
Project poster
Table 1. Deadlines for the project.
Prof. Pierre Kuonen, Head of the GRID and Cloud Computing Group, and Prof. François Kilchoer, Dean of the
Computer Science Department, both of the College of Engineering and Architecture Fribourg, are supervising the
project from Switzerland.
C.
Useful resources
• https://computing.llnl.gov/linux/slurm/
Official web site of the SLURM project website. Sources,
admin/user documentation, papers as well as a configuration tool can be found on there.
• http://www.linux-kvm.org/
Web site of the KVM (Kernel-based Virtual Machine)
project. A special page describing VM migration using
KVM is available.
• http://www.v3vee.org/palacios/
Official web site of the Palacios VMM project, developed
as part of the V3VEE project. Access to news, documentation and source code is available.
Appendix D
CD-ROM contents
The following assets can be found on the attached CD-ROM:
• api-reference: The interactive (HTML) version of the VURM API reference;
• documents: A collection of all documents related to the project management (meeting
minutes, planning versions, compiled report, summaries in different languages...);
• report-sources: Complete checkout of the vurm-report LATEXsources git repository;
• vurm-sources: Complete checkout of the vurm source git repository;
• website.tar.gz: Tarball of the project wiki used throughout the project (needs a
PHP runtime to execute).
71