Transcription
PDF
Università di Pisa Dipartimento di Informatica Technical Report: TR-07-13 VirtuaLinux Design Principles Marco Aldinucci Massimo Torquati Marco Vanneschi Dipartimento di Informatica, Università di Pisa, Italy Manuel Cacitti Alessandro Gervaso Pierfrancesco Zuccato Eurotech S.p.A, Italy June 21, 2007 ADDRESS: largo B. Pontecorvo 3, 56127 Pisa, Italy. TEL: +39 050 2212700 FAX: +39 050 2212726 VirtuaLinux Design Principles∗ Marco Aldinucci† Massimo Torquati† Manuel Cacitti‡ Marco Vanneschi† Alessandro Gervaso‡ Pierfrancesco Zuccato‡ June 21, 2007 Abstract VirtuaLinux is a Linux meta-distribution that allows the creation, deployment and administration of both physical and virtualized clusters with no single point of failure. They are avoided by means of a combination of architectural, software and hardware strategies, including the transparent support for disk-less and master-less cluster configuration. VirtuaLinux support the creation and management of virtual clusters in seamless way: VirtuaLinux Virtual Cluster Manager enables the system administrator to create, save, restore Xen-based virtual clusters, and to map and dynamically re-map them onto the nodes of the physical cluster. Master-less, disk-less and virtual clustering relies on the novel VirtuaLinux disk abstraction layer, which enables the fast (almost constant time), space-efficient, dynamic creation of virtual clusters composed of fully independent complete virtual machines. VirtuaLinux has been jointly designed and developed by the Computer Science Dept. (HPC lab.) of the University of Pisa and Eurotech HPC lab., a division of Eurotech S.p.A. VirtuaLinux is a open source software under GPL available at http://virtualinux.sourceforge.net/. ∗ VirtuaLinux has been developed at the HPC lab. of Computer Science Dept. - University of Pisa and Eurotech HPC, a division of Eurotech Group. VirtuaLinux project has been supported by the initiatives of the LITBIO Consortium, founded within FIRB 2003 grant by MIUR, Italy. † Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, I-56127 Pisa, Italy. WEB: http://www.di.unipi.it/groups/architetture/ E-mail: {aldinuc, torquati, vannesch}@di.unipi.it ‡ Eurotech S.p.A., Via Fratelli Solari 3/a, I-33020 Amaro (UD), Italy. WEB: http://www.eurotech.com E-mail: {m.cacitti, a.gervaso, p.zuccato}@exadron.com 1 Contents 1 Introduction 1.1 Common Flaws of Classical Clusters . . . . . . . . . . . . . . . . 1.2 No Single Point of Failure . . . . . . . . . . . . . . . . . . . . . . 2 Disk-less Cluster 2.1 Storage Area Network (SAN) . . . . . . . . . . . . . . . . . . 2.2 Design alternatives: Network-Attached Storage (NAS) . . . . 2.3 VirtuaLinux Storage Architecture . . . . . . . . . . . . . . . . 2.3.1 Understanding the Snapshot Technique . . . . . . . . 2.3.2 Snapshots as Independent Volumes: an Original Usage 2.4 Cluster Boot Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Master-less Cluster 3 4 5 6 6 7 8 11 12 13 14 4 Cluster Virtualization 16 4.1 Para-virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Xen Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 VirtuaLinux Virtual Clustering 5.1 VC Networking . . . . . . . . . . . . . . . . . 5.2 VC Disk Virtualization . . . . . . . . . . . . . 5.3 VC Mapping and Deployment . . . . . . . . . 5.4 VVCM: VirtuaLinux Virtual Cluster Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 21 22 23 23 6 Experiments 24 7 Conclusions 28 A VT-x: Enabling Virtualization in the x86 ISA 30 B Eurotech HPC Solutions 32 C VirtuaLinux-1.0.6 User Manual C.1 VirtuaLinux Pre-requisites . . . . . . . . C.2 Physical cluster . . . . . . . . . . . . . . C.2.1 Cluster Boot . . . . . . . . . . . C.2.2 Cluster Setup (Configuration and C.2.3 Cluster or Node Restore . . . . . C.3 Cluster Virtualization . . . . . . . . . . C.3.1 Installation . . . . . . . . . . . . C.3.2 Tools . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . Installation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 34 34 35 39 40 40 41 1 Introduction A computer cluster is a group of tightly coupled computers that work together closely so that in many respects they can be viewed as a single computing platform. The computers comprising a cluster (nodes) are commonly connected to each other through one or more fast local area networks. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability. The kind of nodes may range over the complete spectrum of computing solutions, from large Symmetric MultiProcessor (SMP) computers to bare uniprocessor desktop PCs. Intermediate solutions, such as dual/quad-core processors, are nowadays popular because they can fit in very compact blades, each of them implementing a complete multiprocessor-multi-core computer. As an example, Eurotech clusters are currently shipped in a 4U case hosting 8 blades, each of them equipped with two dual-core CPUs (32 cores per case). Larger clusters can be assembled by wiring more than one case [5]. A wide range of solutions is also available for node networking. High-density clusters are commonly equipped with at least a switched Ethernet network for its limited cost and wide support in operating systems. Recently, the Infiniband network has also become increasingly popular due its speed-cost trade-off [10]. As an example, Eurotech clusters are currently shipped with one switched fast Ethernet, two switched Giga-Ethernets, and one 10 Gbits/s Infiniband NICs per node (Infiniband switch is not included in the 4U case). Since the nodes of a cluster implement complete computers, each node is usually equipped with all standard devices, such as RAM memory, disk, USB ports, and character I/O devices (keyboard, video, mouse). High-density clusters typically include a terminal concentrator (KVM) to enable the control of all nodes from a single console. Also, they include a hard disk per blade and are possibly attached to an external Storage Area Network (SAN). A disk mounted on a blade is typically installed with the node Operating System (OS) and might be used, either locally or globally, to store users data. A SAN is typically used to store user data because of its shared nature. A blade for high-density clusters can be typically fitted with no more than one disk, which is typically quite small and slow (e.g. 2.5 inches SATA disk, 80-120 MBytes, 5-20 MBytes/s) due to space, power, and cooling limitations. A SAN usually exploits a Redundant Array of Independent Disks (RAID) and offers large storage space, built-in faulttolerance, and medium-large I/O speed (e.g. 10 TBytes, 400-800 MBytes/s). SANs are usually connected to a cluster via Fiber Channel and/or other Gigabitspeed technology such as Ethernet or Infiniband. The foremost motivation of clusters popularity comes from their similarity with a network of complete computers. They are built by assembling standard components thus, on the one hand they are significantly cheaper than parallel 3 machines, and on the other hand they can benefit from already existing and tested software. In particular the presence of a private disk space for each node enables the installation of standard (possibly SMP-enabled) OSes that can be managed by non-specialized administrators. Once installed, nodes of a cluster can be used in insulation to run different applications or cooperatively to run parallel applications. Also, a number of tools for the centralized management of nodes are available in all OSes [25]. These tools typically use client-server architecture: one of the nodes of the cluster acts as the master of the cluster, while others depend on it. The master is usually statically determined at the installation time for its hardware (e.g. larger disks) or software (e.g. services configuration). 1.1 Common Flaws of Classical Clusters Unfortunately, disks mounted on blades, especially within high-density clusters, are statistically the main source of failures because of the strict constraints of size, power and temperature [19, 24]. Moreover, they considerably increase cluster engineering complexity and power requirement, and decrease density. This is particularly true on the master disk, that happens also to be a critical single point of failure for cluster operation since it hosts services involving file sharing, user authentication, Internet gateway, and in the case of disk-less cluster also root file system access, IP management and network boot services. A hardware or software crash/malfunction on the master is simply a catastrophic event for cluster stability. Moreover, cluster management exhibits several critical issues: • First installation and major OS upgrades are very time consuming, and during this time the cluster should be set offline. Node hot-swapping usually requires cluster reconfiguration. • Rarely a single configuration or even a single OS can be adapted to supply all user needs. Classic solutions like static cluster partitioning with multiple boots are static and not flexible enough to consolidate several user environments and require an additional configuration effort. • Since cluster configuration involves the configuration of distinct OS copies in different nodes, any configuration mistake, which may seriously impair cluster stability, is difficult to undo. We present a coupled hardware-software approach based on open source software aiming at the following goals: 1. Avoiding fragility due to the presence of disks on the blades by removing disks from blades (high-density disk-less cluster) and replacing them with a set of storage volumes. These are abstract disks implemented via an external SAN that is accessed via suitable protocols. 2. Avoiding single point of failure by removing the master from the cluster. Master node features, i.e. the set of services implemented by the master 4 node, are categorized and made redundant by either active or passive replication in such a way they are, at each moment, cooperatively implemented by the running nodes. 3. Improving management flexibility and configuration error resilience by means of transparent node virtualization. A physical cluster may support one or more virtual clusters (i.e. cluster of virtual nodes) that can be independently managed and this can be done with no impact on the underlying physical cluster configuration and stability. Virtual clusters run a guest OS (either a flavor of Linux or Microsoft Windows) that may differ from the host OS, governing physical cluster activities. These goals are achieved independently through solutions that have been designed to be coupled, thus to be selectively adopted. A suite of tools, called VirtuaLinux, enables the boot, the installation, the configuration and the maintenance of a cluster exhibiting the previously described features. VirtuaLinux is currently targeted to AMD/Intel x86 64-based nodes, and includes: • One or more Linux distributions, currently Ubuntu Edgy 6.10 and CentOS 4.4. • An install facility able to install and configure included distributions according to goals 1-3. • A recovery facility able to revamp a misconfigured node. • A toolkit to manage virtual clusters (VVCM) and one or more pre-configured virtual cluster images (currently Ubuntu Edgy 6.10 and CentOS 4.4). In the following sections we will describe how goals 1-3 are achieved. Features a-d are described in Appendix C. 1.2 No Single Point of Failure There is no such thing as a perfectly reliable system. Reliability engineering cannot engineer out failure modes that are not anticipated by modeling. For this reason, usually, reliable systems are specified at and designed to some nonzero failure rate (e.g. 99.99% availability). The main engineering approaches toward reliable systems design are (in order of importance): • eliminating single points of failure (“no single point of failure”); • engineering any remaining single points of failure to whatever level is necessary to reach the system specification; • adding extra system safety margins to allow for errors in modeling or implementation. 5 Single point of failure describes any part of the system that can, if it fails, cause an interruption of required target service. This can be as simple as a process failure or as catastrophic as a computer system crash. The present work aims to remove several significant single points of failure in cluster organization at the OS level. This means the target service is cluster functionality at the OS level, not a particular service running on one or more nodes of the cluster. In particular, VirtuaLinux aims to guarantee that in a cluster the following proprieties are ensured: • if some node crash due to a hardware or software failure, not crashed nodes remain fully functional independently of the identity of the crashed nodes (full symmetry of nodes); • crashed nodes can be repaired and restarted with no impact on other running nodes. Nodes can be hot-removed or hot-added to the cluster (nodes are hot-swappable). VirtuaLinux is able to support these properties provided it is running on suitable hardware, that exhibits a sufficient redundancy level of cluster support devices, like power feeds, network connections, routers, and router interconnections. The same kinds of assumptions are made on the internal SAN architecture and the connections among the SAN and the cluster. Note that for mission-critical systems, the mere use of massive redundancy does not make a service reliable because the whole system itself is single point of failure (e.g. both routers are housed in a single rack, allowing a single spilled cup of coffee to take out both routers at once). Mission-critical related problems are not addressed by VirtuaLinux and are outside the scope of the present work. 2 2.1 Disk-less Cluster Storage Area Network (SAN) In computing, a storage area network (SAN) is a network designed to attach computer storage devices such as disk array controllers to servers. A SAN consists of a communication infrastructure, which provides physical connections, and a management layer, which organizes the connections, storage elements, and computer systems so that data transfer is secure and robust. SAN are distinguished from other forms of network storage by the low-level access method that they use (block I/O rather than file access). Data traffic on the SAN fabric is very similar to those used for internal disk drives, like ATA and SCSI. On the contrary, in more traditional file storage access methods, like SMB/CIFS or NFS, a server issues a request for an abstract file as a component of a larger file system, managed by an intermediary computer. The intermediary then determines the physical location of the abstract resource, accesses it on one of its internal drives, and sends the complete file across the network. Sharing storage usually simplifies storage administration and adds flexibility since cables and storage devices do not have to be physically moved to move 6 storage from one server to another. SANs tend to increase storage capacity utilization, since multiple servers can share the same growth reserve, and if compared to disks that a high-density cluster can accommodate, exhibit better performances and reliability since they are realized by arranging high-speed high-quality disks in RAID. SANs also tend to enable more effective disaster recovery processes. A SAN attached storage array can replicate data belonging to many servers to a secondary storage array. This secondary array can be local or, more typically, remote. The goal of disaster recovery is to place copies of data outside the radius of effect of an anticipated threat, the long-distance transport capabilities of SAN protocols.1 Note that, in general, SAN storage implements a one-to-one relationship. That is, each device, or Logical Unit Number (LUN) on the SAN is owned by a single computer. In reality, in order to achieve the disk-less cluster goal, it is essential to ensure that many computers can access the same disk abstraction over a network since the SAN should support all nodes of the cluster that are independent computers. In particular, the SAN should store the OS and swap space of each node, in addition to, possibly shared, applications data. Note that node OS kind is not Single-system image (SSI), thus each node requires owning a private read-write copy of several sub-trees of the root file system (e.g. /var). While several design alternatives are available to achieve this goal, probably the cleanest one consists in supplying each node with a private partition of the disk abstraction. VirtuaLinux implements a many-to-one abstract disk supporting a flexible partition mechanism by stacking iSCSI (Internet Small Computer System Interface) and EVMS (Enterprise Volume Management System): • iSCSI is a network protocol standard, that allows the use of the SCSI protocol over TCP/IP networks. It enables many initiators (e.g. nodes) to access (read and write) a single target (e.g. SAN), but it does not ensure any coherency/consistency control in the case that many initiators access in read-write mode to the same partition [13]. • EVMS provides a single, unified system for handling storage management tasks, including the dynamic creation and destruction of volumes, which are EVMS abstractions behaving as disk partitions and enabling the access of volumes of the same target from different nodes [21]. iSCSI, EVMS and their role in VirtuaLinux design are discussed in the next sections. 2.2 Design alternatives: Network-Attached Storage (NAS) The proposed approach is not the only possible one. Another approach consists in setting up a shared file system abstraction exported by the SAN to nodes of the cluster, i.e. a Network Attached Storage (NAS). SMB/CIFS or NFS are 1 Demand for this SAN application has increased dramatically after the September 11th attacks in the United States, and increased regulatory requirements associated with SarbanesOxley and similar legislation. 7 instances of a NAS. Note that file system abstraction (NAS) is a higher-level abstraction w.r.t. disk abstraction (SAN). This impacts both external storage (server) and cluster (client): • The external storage should be able to implement a quite high-level protocol, and thus should be smart enough. For example, it can be realized with a computer running dedicated software (e.g. OpenFiler), which may become the single point of failure. In addition, the solution is not appropriate with operating system tools accessing directly the block device. • The nodes of the cluster need a quite deep stack of protocols to access NAS, and thus need a considerable fraction of the OS functionality. These should be set up in the early stages of node boot in order to mount the root file system: in particular, these should be run from initial ramdisk (initrd). This has several disadvantages. First, the complexity to pivot running protocols from the initial ramdisk to the root file system, that may involve deep modifications to the standard OS configuration. Second, the impossibility for the OS to access the disk at the block level, thus independently from the file system, that strongly couples the methodology with the target OS. For example, the OS cannot use a swap partition, but just a swap file. Overall, the VirtuaLinux design aims to reduce the dependency of disk-less architecture on OS-specific high-level protocols and to expose to the OS a collection of abstract nodes that are similar as possible to a classical cluster. 2.3 VirtuaLinux Storage Architecture As mentioned above, EVMS provides a single, unified system for handling storage management tasks, including the dynamic creation and destruction of volumes, which are an EVMS abstraction that are seen from the OS as disk devices [6, 21]. The external SAN should hold a distinct copy of the OS for each node. At this end, VirtuaLinux prepares, during installation, one volume per node and a single volume for data shared among nodes. As we shall see later, several other volumes are used to realize virtual cluster abstraction. Volumes are formatted with an OS specific native file system (e.g. ext3), while shared volumes are formatted with a distributed file system that arbitrates concurrent reads and writes from cluster nodes, such as the Oracle Concurrent File System (OCFS2) or the Global File System (GFS). Volumes are obtained by using the EVMS snapshot facility (see 2.3.1). A snapshot represents a frozen image of a volume of an original source. When a snapshot is created, it looks exactly like the original at that point in time. As changes are made to the original, the snapshot remains the same and looks exactly like the original at the time the snapshot was created. A file on a snapshot is a reference (at the level of disk block) to its original copy, and thus does no consume disk space while the original and its snapshot copy remain 8 identical. A file is really stored in the snapshot, and thus consumes disk space only when either the original or its snapshot copy is modified. Indeed, snapshot creation is quite a fast operation. The snapshot technique is usually used to build on-line backups of a volume: the accesses to the volume are suspended just for the (short) time of snapshot creation; then the snapshot can be used as on-line backup, which can be kept on-line either indefinitely or just for the time needed to store it on a different storage medium (e.g. tape). Multiple snapshots of the same volume can be used to keep several versions of the volume over time. As we shall see in Sec. 2.3.2, the management of a large number of snapshots requires particular care in current Linux systems. VirtuaLinux installs an original volume with the selected OS distribution (called the default), and then creates n identical snapshots. Each node of the cluster uses a different snapshot as the root file system. Once snapshots have been made accessible (activated), the content of both original and snapshots can evolve along different paths, as they are independent volumes. However, in the case of cluster management, snapshots have several advantages as compared to independent volumes: • Fast creation time. Assume an n-node cluster is to be installed Since each node of the cluster requires a private disk, n independent volumes should be created at installation time starting from the same initial system distribution (e.g. CentOS system image). These volumes are physically stored in the same SAN due to the disk-less architecture. Creating these volumes by a standard copy loop may be extremely expensive in term of time since a complete Linux distribution should be installed n times.2 As we shall see in Sec. 5, a similar amount of time should be spent for the creation of each new Virtual Cluster as well. Snapshot usage drastically decreases volume creation time since volume content is not copied but just referenced at the disk block level. Empirical experiences show that a snapshot of 10 GBytes volume is created in a few seconds on a GigEthernet attached SAN. • Reduced disk space usage. In the general case a snapshot requires at least the same amount of space as the original volume. This space is used to store original files in the case they are changed in the original volume after snapshot creation time, or new data stored in the snapshot that was not existing in the original volume at snapshot creation time. However, VirtuaLinux uses snapshots in a particular way: the original volume holds the root file system of the Linux distribution, which does not changes over time (when the original volume changes, the snapshots are reset). Since data in the original volume is immutable to a large degree (OS files), a considerable amount of disk space is saved with respect to full data copy. 2 Estimated time depends on many factors, such as number of nodes, distribution size, DVD reader speed, SAN throughput. However, it can easily reach several hours even for small cluster configurations due the large number of small files that must be copied. 9 ste file system ext3 ext3 EVMS node 2 node 1 r clu swap /dev/evms/node1 ext3 swap /dev/evms/node2 volumes /dev/evms/default regions R_def Ra_1 disks SAN ext3 Ra_2 Rb_2 swap /dev/evms/noden OCFS2 /dev/evms/shared /dev/evms/swapn ... snap_n ... Ra_n Rb_n R_shared container_0 containers segments ... snap_2 Rb_1 ... /dev/evms/swap2 /dev/evms/swap1 snap_1 snaphshots node n ... segm1 segm2 segm3 sda1 sda2 sda3 sdb sdb sda Figure 1: VirtuaLinux storage architecture. As an example, if a snapshot volume Y (sizeof(Y )=y) is created from an original X (sizeof(X)=x), which stores an amount of z immutable data, then X is able to store an amount x of fresh data, for a total size available from X of almost x + z. • Device name independence. EVMS ensures the binding of the raw device name (e.g. /dev/sda1) and logical volume name (e.g. /dev/evms/node1). Avoiding the use of raw device names is particularly important when using iSCSI connected devices since they may appear on different nodes with different names messing up system configuration (this typically happens when a node has an additional device with respect to other nodes, e.g. an external DVD reader). • Centralized management. A snapshot can be reset to a modified version of the original. Data that has been changed in the snapshot is lost. This facility enables the central management of copies, as for example for major system updates that involves all nodes. This facility is not strictly needed for cluster management since all snapshots can be changed, as they are different copies by using classical cluster techniques such as broadcasted remote data distribution [25]. The architectural view of VirtuaLinux disk management is sketched in Fig. 1. Notice that since EVMS is a quite flexible and sophisticated management tool, the same goal can be achieved with different architectural designs, for example by using real volumes instead of snapshots with the EVMS cluster management 10 facility. As discussed above, the VirtuaLinux design exhibits superior features with respect to alternative (and more classical) design options. The full description of EVMS functionality, which is outside the scope of this paper, can be found in [6, 21]. Note that VirtuaLinux uses the snapshot technique to provide a cluster with a number of independent volumes that can be efficiently created from a common template volume (original), whereas snapshots are usually used as transient, short-lived on-line backups. To the best of our knowledge, no other systems exhibit a similar usage of snapshots (and the consequent features). Indeed, in order to correctly exploit a different usage of snapshots, VirtuaLinux slightly extends EVMS snapshot semantics and implementation. This extension, which is described in the following sections, is correct with respect to EVMS snapshot semantics. 2.3.1 Understanding the Snapshot Technique There are different implementation approaches adopted by vendors to create snapshots, each with its own benefits and drawbacks. The most common are copy-on-write, redirect-on-write, and split mirror. We briefly describe copy-onwrite, which is adopted by EVMS; we refer back to the literature for an extensive description [9]. A snapshot of a storage volume is created using the pre-designated space for the snapshot. When the snapshot is first created, only the meta-data about where the original data is stored is copied. No physical copy of the data is made at the time the snapshot is created. Therefore, the creation of the snapshot is almost instantaneous. The snapshot copy then tracks the changing blocks on the original volume as writes to the original volume are performed. The original data that is being written to is copied into the designated storage pool that is set aside for the snapshot before the original data is overwritten. Before a write is allowed to a block, copy-on-write moves the original data block to the snapshot storage. This keeps the snapshot data consistent with the exact time the snapshot was taken. Read requests to the snapshot volume of the unchanged data blocks are redirected to the original volume, while read requests to data blocks that have been changed are directed to the “copied” blocks in the snapshot. The snapshot contains the meta-data that describes the data blocks that have changed since the snapshot was first created. Note that the original data blocks are copied only once into the snapshot storage when the first write request is received. In addition to the basic functionality, EVMS snapshots can be managed as real volumes, i.e. data can be added or modified on the snapshot with no impact on the original volume, provided that enough free space has been pre-allocated for the snapshot. Also, they can be activated and deactivated as standard volumes, i.e. mapped and unmapped onto Unix device drivers. However, despite being standard volumes, snapshots have a subtle semantics with respect to activation due to copy-on-write behaviour. In fact, the system 11 cannot write on an inactive snapshot since it is not mapped to any device, thus may lose the correct alignment with its original during the deactivation period. EVMS solves the problem by logically marking a snapshot for reset at deactivation time, and resetting it to the current original status at activation time. 2.3.2 Snapshots as Independent Volumes: an Original Usage As discussed above, VirtuaLinux uses EVMS snapshots to provide a cluster with a number of independent volumes that can be efficiently created from a common template volume (original). Since snapshots cannot be deactivated without losing snapshot private data, they all should always be kept active in all nodes, even if each node will access only one of them. Snapshots on Linux OS (either created via EVMS, LVM, or other software) are managed as UNIX devices via the device mapper kernel functionality. Although EVMS does not fix any limit on the number of snapshots that can be created or activated, current Linux kernels establish a hardwired limit on the number of snapshots that can be currently active on the same node. This limit comes from the number of pre-allocated memory buffers (in kernel space) that are required for snapshot management. Standard Linux kernels enable no more than a dozen active snapshots at the same time. This indirectly constrains the number of snapshots that can be activated at the same time, and thus the number of nodes that VirtuaLinux can support. Raising this limit is possible, but requires a non-trivial intervention on the standard Linux kernel code. VirtuaLinux overcomes the limitation with a different approach, which does not require modifications to the kernel code. It leverages on the following facts: • Since each snapshot is used as private disk, each snapshot is required to be accessible in the corresponding node only. In this way, each node can map onto a device just one snapshot. • The status of a EVMS snapshot is kept on the permanent storage. This information is also maintained in memory in terms of available snapshot objects. This information is maintained in a lazy consistent way. Status information is read at EVMS initialization time (evms activate), and committed out at any EVMS command (e.g. create, destroy, activate, deactivate a snapshot). While each snapshot can have just one status for all nodes on the permanent storage, it may have different status on the local memory of nodes (e.g. it can be mapped onto a device on a node, while not appearing on another). • Snapshot deactivation consists in unmapping a snapshot device from the system, then logically marking it for reset on permanent storage. VirtuaLinux extends EVMS features with the option to disable EVMS snapshot reset-on-activate feature via a special flag in the standard EVMS configuration file. In the presence of this flag, the extended version of EVMS will proceed to unmap the snapshot without marking it for reset. 12 VirtuaLinux EVMS extension preserves snapshot correctness since the original volume is accessed in read-only mode by all nodes, and thus no snapshot can lose alignment with the original. One exception exists: major system upgrades, that are performed directly on the original copy of the file system trigger the reset of all snapshots. At the implementation level, the VirtuaLinux EVMS extension requires the patching of EVMS user-space source code (actually just one line of C code). Overall, VirtuaLinux extends EVMS semantics. The extension covers a case in which general conditions that trigger a snapshot reset have been relaxed (avoids reset-on-activate) provided the original volume is not written. The extension ensures snapshot correctness. The described EVMS enables an original usage of the general snapshot technique. 2.4 Cluster Boot Basics The described architecture has a unique permanent storage: the iSCSI attached SAN. Although booting from it (iBoot [8]) will be possible quite soon, iBoot is not currently supported by the majority of cluster vendors. VirtuaLinux is designed to enable the migration toward iBoot but it currently provides cluster boot through standard Intel PXE (Preboot Execution Environment [11]). The PXE protocol is approximately a combination of DHCP (Dynamic Host Configuration Protocol) and TFTP (Trivial File Transfer Protocol), albeit with some modifications to both. DHCP is used to locate the appropriate boot server or servers, while TFTP is used to download the initial bootstrap program and additional files. VirtuaLinux uses an initial ram disk image (initrd) as bootstrap program. It includes the kernel of the selected OS as well as all the required software (both kernel modules and applications) to bring up the connection with the SAN via iSCSI in order to mount the root file system, which is stored in an EVMS-managed volume on the SAN itself. As a result, the cluster can be booted by providing the nodes with a suitable initrd. Since PXE is a client-server protocol, the cluster should be provided with a PXE server functionality to seed initrd images during all stages of the cluster boot. Clusters usually rely on a master-slaves organization: a distinguished node of the cluster (master) provides to other nodes most of the cluster basic services, including DHCP and TFTP. These services enable network boot facility to other disk-less nodes. The master node solution cannot be adopted by VirtuaLinux (goal 2) since the master is clearly a single point of failure of the cluster configuration: any hardware or software problem on the master node may catastrophically disrupt cluster stability. At this end, the VirtuaLinux introduces the meta-master functionality that is supported cooperatively by all nodes of the cluster in such a way that the boot service can be guaranteed also in the case of crash of one or more nodes. Nevertheless, both during cluster boot and install a singular node acting as spark plug of the cluster start-up is unavoidable. To this end, the full booting of the cluster is achieved in three steps: 13 Eth node 2 SAN node 1 ... ... ➀ node n Eth node 3 IB node n node 3 node 2 node 1 IB SAN ➁ IB ... node n node 3 node 2 node 1 Eth SAN ➂ Figure 2: VirtuaLinux boot procedure. À One of the nodes of the cluster is booted via a DVD reader loaded with VirtuaLinux. The booted OS is equipped with meta-functionality (and tools for SAN preparation and installation, which are described later in this guide). Á All nodes booted after the first inherit from the first some of the meta-master functionality; thus, the first node can be rebooted from the network (just detaching the external USB DVD reader). Â At the end of the process, all nodes are uniformly configured, and each of them has inherited some of the meta-master functionality, and thus is able to provide at least TFTP service (as well as other above-mentioned services). The detailed instructions to boot and install a cluster with VirtuaLinux are reported in Sec. C. Notice that the whole procedure, and in particular the live DVD, is largely independent of the installed OS, provided it is a flavor of Linux OS. In order to add yet another distribution to the already existing one it is enough to provide a valid initrd (that should be able to run iSCSI and EVMS) and a valid tarball of the root file system. 3 Master-less Cluster In the previous section we discussed the importance of exploiting robust services with respect to the cluster boot (e.g. DHCP and TFTP). Since in master-based clusters almost all services are implemented in the master node, the service robustness issue naturally increases along the cluster run steady state. Those services are related to file sharing, time synchronization, user authentication, 14 network routing, etc. VirtuaLinux avoids single point of failure in cluster operation by enforcing fault-tolerance of the master for all services (master-less cluster). Fault-tolerance is enforced by using two classical techniques: active replication and passive replication (primary-backup), targeting the categories of stateless and stateful services, respectively. Two additional categories should be added for completeness, i.e. the category of node-oriented and self-healing services. • Stateless services. Active replication can be used whether the service can be configured to behave as a stateless service. Typically clients locate this kind of service by using broadcasted messages on the local network. A client request is non-deterministically served by the most reactive node. • Stateful services. Passive replication is used when the service exists in a unique copy on the local network because of its authoritative or stateful nature. A client request usually specifies the service unique identifier (e.g. IP, MAC). • Node-oriented services. A fail-stop of the node is catastrophic for this particular service, but has no impact on the usage of services on other nodes. No replication is needed in order to ensure cluster stability. Making these services reliable is outside the scope of VirtuaLinux, which ensures cluster availability at the OS level, not the availability of all services running on nodes. • Self-healing services. These services adopt service-specific methods in order to guarantee fault-tolerance. These methods usually fall into the class of active or passive replication, which are implemented with servicespecific protocols. Just services needed to ensure cluster stability are preconfigured in VirtuaLinux. Since VirtuaLinux targets the uniform configuration of cluster nodes, active replication should be considered the preferred method. Active replication is realized by carefully configuring service behavior. Passive replication is realized by using a heartbeat fault-detector. Service Fault-tolerance Notes DHCP TFTP NTP IB manager DNS LDAP IP GW Mail SSH/SCP NFS SMB/CIFS active active active active active service-specific passive node-oriented node-oriented node-oriented node-oriented Pre-defined map between IP and MAC all copies provide the same image Pre-defined external NTPD fallback via GW Stateless service Cache only DSN Service-specific master redundancy Heartbeat on 2 nodes with IP takeover (HA) Local node and relays via DNS Pre-defined keys Pre-defined configuration Pre-defined configuration 15 4 Cluster Virtualization Virtualizing the physical resources of a computing system to achieve improved degrees of sharing and utilization is a well-established concept that goes back decades [7, 20, 23]. Full virtualization of all system resources (including processors, memory and I/O devices) makes it possible to run multiple operating systems (OSes) on a single physical platform. In contrast to a non-virtualized system, in which a single OS is solely in control of all hardware platform resources, a virtualized system includes a new layer of software, called a Virtual Machine Monitor (VMM). The principal role of the VMM is to arbitrate access to the underlying physical host platform resources so that these resources can be shared among multiple OSes that are guests of the VMM. The VMM presents to each guest OS a set of virtual platform interfaces that constitute a Virtual Machine (VM). By extension, a Virtual Cluster (VC) is a collection of VMs that are running on one or more physical nodes of a cluster and that are wired by one or more virtual private networks. By uniformity with the physical layer, all VMs are homogeneous, i.e. each VM may access a private virtual disk and all VMs of a virtual cluster run the same OS and may access a shared disk space. Different virtual clusters may coexist on the same physical cluster, but no direct relationship exists among them, apart from their concurrent access to the same resources (see Fig. 3). Virtual clusters bring considerable added value to the deployment of a production cluster because they ease a number of management problems, such as: • Physical cluster insulation. Crashes or system instability due to administration mistakes or cursoriness at the virtual layer are not propagated down to the physical layer and have no security or stability impact on the physical layer. The relative usage of physical resources, such as processor, disk, memory, can be dynamically modulated and harmonized among different clusters. Two classes of system administrators are therefore introduced: the physical and virtual cluster administrator. For example, physical cluster management may be restricted to qualified/certified administrator, whereas the management rights of a virtual cluster can be given to virtual cluster owners who can change the configuration and install applications with no impact on underlying physical layer stability. • Cluster consolidation. Virtualization is used to deploy multiple VCs, each exploiting a collection of VMs running an OS and associated services and applications. Therefore, the VMs of different VCs may be targeted to exploit a different OS and applications to meet different user needs. Installing and deploying new VCs, as well as reconfiguring and tuning existing VCs, does not affect the availability of services and applications running on the physical cluster and other VCs. • Cluster super scalability. VCs may enable the efficient usage of the underlying hardware. A couple of cases are worth mentioning. First, the mul16 VM1 1 vcpu VM2 1 vcpu VM1 2 vcpu VM3 1 vcpu VM4 1 vcpu Virtual Cluster "green" 4 VMs x 1 VCPUs 10.0.3.0/24 Virtual Cluster "tan" 2 VMs x 2 VCPUs 10.0.1.0/24 VM2 2 vcpu VM1 4 vcpu VM2 4 vcpu VM3 4 vcpu VM4 4 vcpu Virtual Cluster "pink" 4VMs x 4VCPUs 10.0.0.0/24 node1 4 cpu node2 4 cpu node3 4 cpu node4 4 cpu Physical Cluster + external SAN InfiniBand + Ethernet 4 Nodes x 4 CPUs Cluster InfiniBand 192.0.0.0/24 Cluster Ethernet 192.0.1.0/24 Internet Gateway 131.1.7.6 eth disk1 IB disk2 Figure 3: A physical cluster running three Virtual Clusters. tiplexing of physical resources may induce a natural balancing of physical resource usage. This happens, for example, if a VC runs mostly I/O-bound services whereas another runs mostly CPU-bound applications. Second, applications and OS services can be tested with a parallelism degree that is far larger than the number of nodes physically available on the cluster. The main drawback of virtualization is overhead, which usually grows with the extent of hardware and software layers that should be virtualized. Virtualization techniques are usually categorized in three main classes, in decreasing order of overhead: emulation, binary translation, and para-virtualization. At one end of the spectrum, a system can be emulated on top of a pre-packaged processor simulator3 , i.e. an interpreter running on a processor of kind A (e.g. x86) that fully simulates the same or another processor of kind B (e.g. PowerPC). A guest OS and related application can be installed and run on top of the emulator. Emulation overhead is usually extremely large; therefore the use of the technique is limited to addressing interoperability tasks. Para-virtualization is at the other end of the spectrum [3]. Para-virtualization is something of a hack, as it makes the hosted OSes aware that they are in a virtualized environment, and modifies them so they will operate accordingly: some guest OS machine instructions are statically substituted by calls to the VMM. Virtualization mainly involves operations at OS kernel level, such as I/O devices arbitration, memory management, and processes scheduling. In this regard, it is not as much a complete virtualization as it is a cooperative relationship between VMM and guest OSes. Para-virtualization is further discussed in the next section. 3 Simulation is not, in general, a synonym of emulation. Unlike simulation, which attempts to gather a great deal of run-time information as well as reproducing a program’s behavior, emulation attempts to model to various degrees the state of the device being emulated. However, the contrast between the two terms has been blurred by the current usage, especially when referring to CPU emulation. 17 Binary translation is the middle ground [1]. It translates sequences of instructions from the source to the target instruction set. The translation may be performed both statically and dynamically. In the static binary translation an entire executable file is translated into an executable of the target architecture. This is very difficult to do correctly, since not all the code can be discovered by the translator. For example, some parts of the executable may be reachable only through indirect branches, whose value is only known at runtime. A dynamic binary translator is basically an interpreter equipped with a translated code caching facility. Binary translation differs from simple emulation by eliminating the emulator’s main read-decode-execute loop (a major performance bottleneck), paying for this by large overhead during translation time. This overhead is hopefully amortized as translated code sequences are executed multiple times, even if it cannot be fully eliminated. More advanced dynamic translators employ dynamic recompilation: the translated code is instrumented to find out what portions are executed a large number of times, and these portions are optimized aggressively [4]. As a particular case, a binary code for a given architecture can be translated for the same architecture in order to dynamically detect guest OS machine instructions for I/O access and memory, processes scheduling and memory management. These instructions are somehow dynamically translated to trigger a VMM call: this may happen completely in software or with the help of a hardware facility, such as the recently introduced Intel VT-x (a.k.a. Intel Vanderpool) [17] support or AMD SVM (a.k.a. AMD Pacifica), discussed in Appendix A. With respect to para-virtualization, binary translation enables the execution of unmodified executable code, as for example a legacy guest OS such as Microsoft Windows. For this reason the majority of virtualization products currently available, such as QEMU [4], VMware [26], Apple Rosetta [2], use the binary translation technique. 4.1 Para-virtualization In a traditional VMM the virtual hardware exposed is functionally identical to the underlying machine. Although full virtualization has the obvious benefit of allowing unmodified operating systems to be hosted, it also has a number of drawbacks. This has been particularly true for the prevalent IA-32, or x86, architecture, at least up to the introduction of the VT extension [17]. These architectures exploit the hierarchical protection domains (or protection rings) mechanism to protect data and functionality from faults (fault tolerance) and malicious behavior. This approach is diametrically opposite to that of capability-based security [14]. A protection ring is one of two or more hierarchical levels or layers of privilege within the architecture of a computer system. This is generally hardware-enforced by some CPU architectures, that provide different CPU modes at the firmware level. Rings are arranged in a hierarchy from most privileged (most trusted, usually numbered zero, or supervisor mode) to least privileged (least trusted, usually with the highest ring number). On most operating systems, Ring 0 is the level with the most privileges and interacts most directly with the physical hardware such as the CPU and mem18 Dom0 VM0 DomU VM1 DomU VM2 DomU VM3 QUEMU bin translator Device Manager and Control SW Unmodified User Software Unmodified User Software Unmodified User Software Para-virtualized GuestOS (e.g Linux) Para-virtualized GuestOS (e.g Linux) Para-virtualized GuestOS (e.g Linux) Unmodified GuestOS (e.g. WinXP) Native Device Drivers SMP Back-End Device Drivers Font-End Device Drivers Control IF Safe HW IF Font-End Device Drivers Event Channel Font-End Device Drivers Virtual CPU Virtual MMU Hypervisor (Xen Virtual Machine Monitor) Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) Figure 4: Xen architecture. ory. Special gates between rings are provided to allow an outer ring to access an inner ring’s resources in a predefined manner, as opposed to allowing arbitrary usage. Correctly gating access between rings can improve security by preventing programs from one ring or privilege level from misusing resources intended for programs in another. To implement a VM system with the hierarchical protection model, certain supervisor instructions must be handled by the VMM for correct virtualization. Unfortunately, in pre-VT x86 architectures, executing these instructions with insufficient privilege fails silently rather than causing a convenient trap. Efficiently virtualizing the x86 MMU is also difficult. These problems can be solved by means of the mentioned binary translation technique, but only at the cost of increased complexity and reduced performance. VirtuaLinux adopts Xen as VMM to provide the user with para-virtualized VMs. Xen is an open-source VMM and currently supports IA-32, x86-64, IA-64 and PowerPC architectures [3]. 4.2 Xen Architecture Xen is a free source VMM for IA-32, x86-64, IA-64 and PowerPC architectures. It is para-virtualization software that runs on a host OS and allows one to run several guest OSes on top of the host on the same computer hardware at the same time. Modified versions of Linux and NetBSD can be used as hosts. Several modified Unix-like OSes may be employed as guest systems; on certain hardware 19 (as x86-VT, see Appendix A), unmodified versions of Microsoft Windows and other proprietary operating systems can also be used as guest (via the QEMU binary translator) [27]. In the Xen terminology, the VMM is called hypervisor. A domain is a running VM within which a guest OS executes. The Domain0 (Dom0 ) is the first domain that is automatically started at boot time. Dom0 has permission to control all hardware on the system, and is used to manage the hypervisor and the other domains. All other domains are “unprivileged” (User Domains or DomU ), i.e. domains with no special hardware access. Xen 3.0 architecture is sketched in Fig. 4, we refer back to the literature for an extensive description [3, 27]. 5 VirtuaLinux Virtual Clustering A Virtual Cluster (VC) is a cluster of VMs that can cooperate with a private virtual network, and that share physical resources of the same physical cluster. Different VCs behave as they were different physical clusters: i.e. they can cooperate one each other via standard networking mechanisms, but do not natively share any virtualized device nor user authentication mechanism. VirtuaLinux natively supports the dynamic creation and management of VCs by means of the following layers: • VM implementation, i.e. the layer implementing the single VM. VirtuaLinux currently adopts the Xen Virtual Machine Monitor ([3], see also Sec. 4.1). The support of QEMU KVM kernel-based VMs is currently under evaluation for the next VirtuaLinux version [22]. • VM aggregation, i.e. the layer that aggregates many VMs in a VC, and dynamically creates and manages different VCs. This is realized via the VirtuaLinux Virtual Cluster Manager (VVCM), whose functionality is described in the rest of the section. Overall, the VVCM enables the system administrator to dynamically create, destroy, suspend and resume from disk a number of VCs. The VCs are organized in a two-tier network: each node of a VC are connected to a private virtual network, and to the underlying physical network via a gateway node chosen in the VC. The gateway node of each VC is also reachable by nodes of all VCs. The nodes of a VC are homogeneous in terms of virtualized resources (e.g. memory size, number of CPUs, private disk size, etc.) and OS. Different clusters may exploit different configurations of virtual resources and different OSes. Running VCs share the physical resources according to a creation time mapping onto the physical cluster. VCs may be reallocated by means of the run-time migration of the VM between physical nodes4 . 4 Migration of virtual nodes is currently an experimental feature, and it not included in the VVCM front-end. 20 Each virtual node of a VC is (currently) implemented by a Xen virtual machine (VM) that is configured at the VC creation time. Each virtual node includes: • a virtual network interface with a private IP; • a private virtual disk of configurable size; • a private virtual swap area of configurable size; • a VC-wide shared virtual storage. The virtualization of devices is realized via the standard Xen virtualization mechanisms. 5.1 VC Networking Xen supports VM networking via virtualized Ethernet interfaces. These interfaces can be connected to underlying physical network devices either via bridged or routed networking. Bridging basically copies data traffic among bridged interfaces at the data link layer (OSI model layer 2), thus bypassing any explicit routing decisions at higher layers, such as the IP layer (whereas routing between MAC addresses can still be performed). On the contrary, routing takes place at the OSI model layer 3, and thus is able to distinguish IP networks and to establish routing rules among them. On the one hand, bridging requires less setup complexity and connection tracking overhead as compared to the routing method. On the other hand, bridging impairs insulation among different networks on the same bridge, and it lacks flexibility since it can hardly be dynamically configured to reflect the dynamic creation and destruction of VC-private networks. For this, VirtuaLinux currently adopts the routed networking. VirtuaLinux sets up VC-private networks in a simple yet efficient manner: all nodes in the VC are assigned addresses from a private network chosen at creation time, and the VC does not share the same subnet as the physical cluster, so the communications among physical and virtual clusters are handled by setting up appropriated routing policies on each physical node, which acts as a router for all the VMs running on it. Routing policies are dynamically set up at the deployment time of the VM. These ensure that: • All VMs of all VCs can be reached from all physical nodes of the cluster. • Each VC can access to the underlying physical network without any master gateway node. Internet can be accessed through physical cluster gateway (that is passively replicated). • Therefore, all VMs are reachable each other. Virtual nodes of a VC are simply VMs on the same virtual subnet. However, each virtual network is insulated from the others. In particular, from within a VC is not possible to sniff the packets passing across virtual networks of other VCs. 21 VM1 node 1 node 2 ... node n VM2 VMn VC "pink" VM1 VM2 VMn VC "green" physical cluster container_Dom0 container_DomU segment_Dom0 segment_DomU EVMS SAN Figure 5: VirtuaLinux virtualized storage architecture. • The routing configuration is dynamic, and has a VC lifespan. The configuration is dynamically updated in the case virtual nodes are re-mapped onto the physical cluster (see also 5.4). VirtuaLinux virtual networks relies on the TCP/IP protocol on top of the Ethernet and Infiniband networks. In the latter case, the IP layer is implemented via the IPoverIB kernel module. Currently no user space drivers (Infiniband verbs) are available within VCs, whereas they are available at within the privileged domain (Dom0) as “native drivers” (see Fig. 4). User-space drivers enable direct access to the Infiniband API from the user application significantly boosting networking performance (see also MPI performance measures in Sec. 6). To the best of our knowledge, a version of user-space Infiniband verbs working within a Xen VM is currently under development, and it will be integrated in VirtuaLinux as soon as it will reach the “release candidate” quality [15]. 5.2 VC Disk Virtualization One of the most innovative features of VirtuaLinux concerns disk virtualization. Typically, VM-private disks are provided either via disk partitions or disk image files. The former method usually provides a speed edge while the latter guarantees a greater flexibility for dynamic creation of VMs. However, neither of them is suitable for supporting dynamically created VCs. As a matter of fact, both methods require the whole root file system of the host OS as many times as the number of nodes in the VC. This leads to an very high data replication on the physical disk, a very long VC creation time (for example the creation of a 10 nodes VC may require at least 40 GBytes), and a significant additional pressure on the network and the external SAN. VirtuaLinux copes with these issue by means of the EVMS snapshotting technique described in Sec. 2.3. As sketched in Fig. 5, the VirtuaLinux physical domain storage architecture (also see Fig. 1) is dynamically replicated at VC creation time. All private disks of a VC are obtained as snapshots of a single image including the VC guest OS. 22 As discussed in Sec. 2.3.1, this leads to a VC creation time that is independent of the number of nodes in the VC (in the range of seconds) and all benefit discussed in Sec. 2.3. Once created, EVMS volumes are dynamically mounted on physical nodes according to the virtual-to-physical mapping policy chosen for the given VC. As for the physical cluster, each VC comes with its own VC-private shared storage, which relies on OCFS2 distributed file system to arbitrates concurrent read and write accesses from virtual cluster nodes. However, since Xen does not currently enable the sharing of disks between VMs on the same physical nodes, the VC shared disk cannot be directly accessed from within virtual nodes. VirtuaLinux currently overcomes the problem by wrapping the shared storage with a NFS file system. At VC deployment time, each physical node involved in the deployment mounts the VC shared storage, which is in turn virtualized and make available to virtual nodes. 5.3 VC Mapping and Deployment VirtuaLinux provides two strategies for virtual-to-physical mapping of VMs: • Block. Mapping aims to minimise the spread of VMs on the physical nodes. This is achieved by allocating on the physical node the maximum allowed number of VMs. For example, if we consider that the physical nodes are equipped with four cores, and the VC has been configured with one virtual node per core constraint, a VC consisting in 4 uniprocessor virtual nodes will be mapped and deployed on a single physical node. • Cyclic. Try to spread the cluster’s VM across all the cluster’s physical nodes. For example, a virtual cluster consisting of 4 uniprocessor virtual nodes will be deployed on four different physical nodes. The two strategies discussed can be coupled with the following modifiers: 1. Strict. The deployment can be done only if there are enough free cores. 2. Free. The constraint between the number of VM processors and physical cores is not taken into account at all. Notice that the mapping strategy of a VC can be changed after the first deployment provided it is the suspended state. 5.4 VVCM: VirtuaLinux Virtual Cluster Manager The VVCM consist of a collection of Python scripts to create and manage the VCs. These scripts use the following main components: • A database used store all information about physical and virtual nodes. The database also stores all information about the mapping between physical and virtual nodes, the state of each virtual machine (if it is running or stopped). The information is maintained consistent between different 23 Dom0 IB Dom0 IPoIB Dom0 Geth DomU IPoIB Ubuntu Ubuntu Ubuntu Ubuntu Dom0, Infiniband user-space verbs (MPI-gen2) Dom0, Infiniband IPoverIB (MPI-TCP) Dom0, Giga-Ethernet (MPI-TCP) DomU, virtual net on top of Infiniband IPoverIB (MPI-TCP) Table 1: Intel MBI experiments legend. cluster launches. With simple scripts it is possible to retrieve from the database the current cluster running status and the physical mapping. • A command-line library for the creation, the activation and the destruction of the VCs. In the current implementation the library is composed by a bunch of Python scripts that implement three main commands: 1. VC Create for the creation of the VC, it fills the database with VCrelated static information. 2. VC Control able to launch a previously created VC and deploy it on the physical cluster according to a mapping policy. It is also able to stop, suspend or resume a VC. 3. VC Destroy that purges a VC from the system; it makes the clean-up of the database. • A communication layer used for the staging and the execution of the VMs. The current implementation is build on top of the Secure Shell support (ssh). • A virtual cluster start-time support able to dynamically configure the network topology and the routing policies on the physical nodes. The VC Control command relies on these feature for VC start-up or shutdown. 6 Experiments The implementation of VirtuaLinux has been recently completed and tested. We present here some preliminary experiments. The experimental platform consists of a 4U-case Eurotech cluster hosting 4 high-density blades, each of them equipped with a two dual-core AMD Opteron@2.2GHz and 8 GBytes of memory. Each blade has 2 Giga-Ethernets and one 10 Gbits/s Infiniband NIC (Mellanox InfiniBand HCA). The blades are connected with a (not managed) Infiniband switch. Experimental data has been collected on two installation of VirtuaLinux based on two different base Linux distributions: • Ubuntu Edgy 6.10, Xen 3.0.1 VMM, Linux kernel 2.6.16 Dom0 (Ub-Dom0) and DomU (Ub-DomU); • CentOS 4.4, no VMM Linux kernel 2.6.9 (CentOS). 24 Micro-benchmark Unit Ub-Dom0 Ub-DomU CentOS Simple syscall Simple open/close Select on 500 tcp fd’s Signal handler overhead Protection fault Pipe latency Process fork+execve usec usec usec usec usec usec usec 0.6305 5.0326 37.0604 2.5141 1.0880 20.5622 1211.4000 0.6789 4.9424 37.0811 2.6822 1.2352 12.5365 1092.2000 0.0822 3.7018 75.5373 1.1841 0.3145 9.5663 498.6364 float mul float div double mul double div nsec nsec nsec nsec 1.8400 8.0200 1.8400 9.8800 1.8400 8.0300 1.8400 9.8800 1.8200 9.6100 1.8300 11.3300 43.5155 55.0066 73.7297 592.3300 29.9752 38.7324 57.5417 1448.7300 32.1614 40.8672 55.9775 956.21 Ub-Dom0 vs CentOS Ub-DomU vs CentOS Ub-DomU vs Ub-Dom0 +667% +36% +51% +112% +246% +115% +143% +726% +34% +51% +127% +293% +31% +119% +7% -2% 0% +7% +13% -40% -10% ∼0% ∼0% ∼0% ∼0% ∼0% ∼0% ∼0% ∼0% ∼0% ∼0% ∼0% ∼0% +35% +35% +32% -38% -7% -5% +3% +51% -31% -30% -22% +144% RPC/udp latency localhost RPC/tcp latency localhost TCP/IP conn. to localhost Pipe bandwidth usec usec usec MB/s Micro-benchmark Simple syscall Simple open/close Select on 500 tcp fd’s Signal handler overhead Protection fault Pipe latency Process fork+execve float mul float div double mul double div RPC/udp latency localhost RPC/tcp latency localhost TCP/IP conn. to localhost Pipe bandwidth Table 2: VirtuaLinux: evaluation of ISA and OS performances with the LMbench micro-benchmark toolkit[16], and their differences (percentage). 25 160 140 140 120 120 Latency (usec) Latency (usec) 160 100 80 60 100 80 60 40 40 20 20 0 Dom0_IB Dom0_IPoIB Dom0_Geth DomU_IPoIB 0 PingPong PingPing Sendrecv Barrier 2 nodes Benchmark Name (2 nodes) Barrier 4 nodes Benchmark Name 800 700 700 300 Sendrecv size (Bytes) - 2 nodes 4M 1M K 6K 25 K 64 16 1 4M 1M 4K 16 K 64 K 25 6K 1K 64 25 6 4 0 16 100 0 4K 200 100 6 200 400 1K 300 500 64 400 25 500 Dom0_IB Dom0_IPoIB Dom0_GEth DomU_IPoIB 600 4 600 16 Bandwidth (MBytes/s) 800 1 Bandwidth (MBytes/s) Figure 6: VirtuaLinux: evaluation of network communication latency with the Intel MBI Benchmarks [12]. Sendrecv size (Bytes) - 4 nodes Figure 7: VirtuaLinux: evaluation of network bandwidth with the Intel MBI Benchmarks [12]. 0.25 0.06 Average Time (sec) 0.05 0.04 0.03 0.02 0.01 0.15 0.1 0.05 t as Bc al l lto Al r he at lg Al ca tte r ce _s ed u ce lre Al ed u du ce t as Bc al l lto Al he r r at lg Benchmark Name (2 nodes) R R ed u ce Al ca tte _s ed u R lre Al ce 0 du ce 0 Dom0_IB Dom0_IPoIB Dom0_Geth DomU_IPoIB 0.2 R Average Time (sec) 0.07 Benchmark Name (4 nodes) Figure 8: VirtuaLinux: evaluation of collective communication performance with the Intel MBI Benchmarks [12]. 26 The experiments here reported are mainly focused to the evaluation of virtualization overhead with respect to two main aspects: guest OS primitives and networking (either virtualized or not). At this end, two sets of microbenchmarks have been used: the LMbench benchmark suite [16], which has been used to evaluate the OS performance, and the Intel MBI Benchmarks [12] with MVAPICH MPI toolkit (mvapich2-0.9.8) [18], which has be used to evaluate networking performance. Results of the LMbench suite are reported in Table 2. As expected, the virtualization of system calls has a non negligible cost: within both the privileged domain (Ub-Dom0) and user domain (Ub-DomU) a simple syscall pay a consistent overhead (∼ +700%) with respect to the non-virtualized OS (CentOS) on the same hardware (while the difference between the privileged and the user domain is negligible). Other typical OS operations, such as fork+execve, exhibit a limited slowdown due to virtualization (∼ +120%). However, as expected in a para-virtualized system, processor instructions exhibit almost no slowdown5 . Overall, the OS virtualization overhead is likely to be amortized to a large extent in real business code. The evaluation of this extent for several classes of applications is currently ongoing. The second class of experiments are related to networking. Fig. 6, Fig. 7, and Fig. 8 report an evaluation of the network latency, bandwidth and collective communications, respectively (the legend is reported in Table 1). Experiments highlight that the only configuration able to exploit Infiniband potentiality is the one using user-space Infiniband verbs (that are native drivers, see also Fig. 4). In this case, experiment figures are compliant with state-of-the-art performances reported in literature (and with CentOS installation, not reported here). Since native drivers bypass the VMM, virtualization introduces no overheads. As mentioned in Sec. 5.1, these drivers cannot be currently used within the VM (DomU), as they cannot be used to deploy standard Linux services, which are based on the TCP/IP protocol. At this aim, VirtuaLinux provides the TCP/IP stack on top of the Infiniband network (via the IPoverIB, or IPoIB kernel module). Experiments show that this additional layer is a major source of overhead (irrespectively of the virtualization layer): the TCP/IP stack on top of the 10 Gigabit Infiniband (Dom0 IPoIB ) behave as a 2 Gigabit network. The performance of a standard Gigabit network is given as reference testbed (Dom0 GEth). Network performance is further slowed down by user domain driver decoupling that require data copy between front-end and back-end network drivers (see also Fig. 4). As result, as shown by DomU IPoIB figures, VC virtual networks on top of a 10 Gigabit network, exhibits a Giga-Ethernet-like performances. Extensive evaluation of disk abstraction layer performance is currently ongoing. Preliminary results show that VirtuaLinux succeeds to drawn around the 75% of raw throughput from an external SAN. 5A subset of benchmarks have been reported here. Excluded tests are coherent with ones reported in the table. 27 7 Conclusions We have presented a coupled hardware-software approach based on open source software aiming at the following goals: 1. Avoiding fragility due to the presence of disks on the blades by removing disks from blades (high-density disk-less cluster) and replacing them with a set of storage volumes. These are abstract disk implemented via an external SAN, which is accessed by nodes of the cluster through the iSCSI protocol, and is abstracted out through EVMS, which enables the flexible and dynamic partitioning of the SAN. 2. Avoiding single point of failure by removing master from the cluster. Master node features, i.e. the set of services implemented by the master node, are categorized and made redundant by either active or passive replication in such a way they are, at each moment, cooperatively implemented by the running nodes. 3. Improving management flexibility and configuration error resilience by means of transparent node virtualization. A physical cluster may support one or more virtual clusters (i.e. cluster of virtual nodes) that can be independently managed and this can be done with no impact on the underlying physical cluster configuration and stability. Virtual clusters run a guest OS (either a flavor of Linux or Microsoft Windows) that may differ from host OS, governing physical cluster activities. Xen and QEMU are used to provide the user with both cluster para-virtualization and hardware-assisted binary translation. These goals are achieved independently through solutions that have been designed to be coupled, thus to be selectively adopted. A suite of tools, called VirtuaLinux, enables the boot, the installation, the configuration and the maintenance of a cluster exhibiting the previously described features. VirtuaLinux is currently targeted to AMD/Intel x86 64-based nodes, and includes: • Several Linux distributions, currently Ubuntu Edgy 6.10 and CentOS 4.4. • An install facility able to install and configure included distributions according to goals 1-3, and easily expandable to other Linux distributions and versions. • A recovery facility able to reset a misconfigured node. • A toolkit to manage virtual clusters (VVCM) and one or more pre-configured virtual cluster images (currently Ubuntu Edgy 6.10 and CentOS 4.4). VirtuaLinux is an open source software under GPL. It is already tested and available on Eurotech HPC platforms. In this regard, Eurotech laboratory experienced a tenfold drop of clusters installation and configuration time. To the best of our knowledge few existing OS distributions achieve the described 28 goals, and none achieve all of them. VirtuaLinux introduces a novel disk abstraction layer, which is the cornerstone of several VirtuaLinux features, such as the time and space efficient implementation of virtual clustering. Preliminary experiments show that VirtuaLinux exhibits a reasonable efficiency, which will naturally grow with virtualization technology evolution. Acknowledgements and Credits VirtuaLinux has been developed at the HPC laboratory of Computer Science Department of the University of Pisa and Eurotech HPC, a division of Eurotech Group. VirtuaLinux project has been supported by the initiatives of the LITBIO Consortium, founded within FIRB 2003 grant by MIUR, Italy. VirtuaLinux is a open source software under GPL available at http://virtualinux. sourceforge.net/. We are grateful to Gianmarco Spinatelli and Francesco Polzella who contributed to VirtuaLinux development, and to Peter Kilpatrick for his help in improving the presentation. 29 A VT-x: Enabling Virtualization in the x86 ISA Virtualization was once confined to specialized, proprietary, high-end server and mainframe systems. It is now becoming more broadly available and is supported in off-the-shelf systems. Unfortunately, the most popular recent processor architectures, e.g. x86 Intel Architecture prior to the VT extension, are not designed to support virtualization [17]. In particular, the x86 processor exhibits a peculiar implementation of processor protection domains (rings). Conceptually rings are a way to divide a system into privilege levels in such a way it is possible to have an OS running in a level that a user’s program cannot modify. This way, if an user program goes wild, it will not crash the system, and the OS can take control, shutting down the offending program cleanly: rings enforce control over various parts of the system. There are four rings in the x86 architecture: 0, 1, 2, and 3, with the lower numbers being higher privilege. A simple way to think about it is that a program running at a given ring cannot change things running at a lower numbered ring, but something running at a low ring can interfere with a higher numbered ring6 . In practice, only rings 0 and 3, the highest and lowest, are commonly used. OSes typically run in ring 0 while user programs are in ring 3. One of the ways the 64-bit extensions to x86 “clean up” the Instruction Set Architecture (ISA) is by losing the middle rings, i.e. 1 and 2. Mostly no one cares that they are gone, except the virtualization people. The VMM obviously have to run in ring 0, but if they want to maintain complete control, they need to keep the host OS (i.e. the OS running on top of the VM) out of ring 0. If a runaway task can overwrite the VM, it negates the reason protection rings exist. The obvious solution is to force the hosted OS to run in a higher ring, like ring 1. This would be fine except that the OSes are used to running in ring 0, and having complete control of the system. They are set to go from 0 to 3, not 1 to 3. In a para-virtualized environment, the OS has been changed to operate correctly with it. This solution is clearly not transparent with respect to hosted OSes. The problem here is that some instructions of x86 will only work if they are going to or from ring 0, and others will behave oddly if not in the right ring. These odd behaviours mainly spring from an extremely weak design of x86 ISA. We mention some of those for the sake of exemplification, whereas we refer the reader to literature for the full details [17]. One of the most curious, is that some privileged instructions do not fault when executed outside ring 0 (they silently fail). Therefore, the classical approach of trapping the call issued by the host OS, and serving it within the VMM, simply, will not work. VT-x augments x86 ISA with two new forms of CPU operation: VMX root operation and VMX non-root operation. VMX root operation is intended for use by a VMM, and its behavior is very similar to that of x86 ISA without VT-x. VMX non-root operation provides an alternative x86 ISA environment controlled by a VMM and designed to support a VM. Both forms of operation support all four privilege levels, allowing guest software to run at its intended 6 The underlying assumption here (not fully sound nor legitimate) is that code running in a low ring, such as the OS, will behave correctly. 30 privilege level, and providing a VMM with the flexibility to use multiple privilege levels. VT-x defines two new transitions: a transition from VMX root operation to VMX non-root operation is called a VM entry, and a transition from VMX non-root operation to VMX root operation is called a VM exit. VM entries and VM exits are managed by a new data structure called the virtual-machine control structure (VMCS). The VMCS includes a guest-state area and a hoststate area, each of which contains fields corresponding to different components of processor state. VM entries load processor state from the guest-state area. VM exits save processor state to the guest-state area and then load processor state from the host-state area. Processor operation is changed substantially in VMX non-root operation. The most important change is that many instructions and events cause VM exits. Some instructions (e.g., INVD) cause VM exits unconditionally and thus can never be executed in VMX non-root operation. Other instructions (e.g., INVLPG) and all events can be configured to do so conditionally using VMexecution control fields in the VMCS [17]. 31 B Eurotech HPC Solutions Highly scalable, Eurotech HPC solutions provide turn-key systems for a large number of applications. The T-Racx technology provides more than 2TFlops per rack using the latest Intel Xeon and AMD Opteron processors. Each T-Racx model integrates Infiniband connectivity to provide in a single infrastructure very low latency communications and large bandwidth with direct IB storage access. T-Racx solutions integrate an elegant Italian Design by Linea Guida with a powerful air cooling and a power distribution system that enable even the most dense and demanding installations. 32 C VirtuaLinux-1.0.6 User Manual This section focuses on the basic installation and configuration of VirtuaLinux version 1.0.6 meta-distribution of the Linux operating system. It is targeted to system administrators familiar with cluster computing and moderately familiar with administering and managing servers, storage, and networks. General concepts and design of VirtuaLinux are covered in the previous sections, while advanced configuration is covered in the software man pages and info files. VirtuaLinux has been extensively tested on Eurotech clusters described in Sec. 6, however VirtuaLinux can be installed in any homogeneous cluster equipped with an external SAN that exports an iSCSI disk. Notice that iSCSI disk setup procedure is not part of VirtuaLinux. C.1 VirtuaLinux Pre-requisites VirtuaLinux DVD enables the seamless installation, configuration and recovery of a Linux-based fault-tolerant cluster platform. Target platforms are disk-less clusters equipped with an external SAN meeting the following requirements: • Nodes (blades, boards) of the cluster should be networked with at least one Ethernet switch (boot network). Ethernet and Infiniband are supported as additional networks. • Nodes should be homogeneous and based on either AMD or Intel x86 architecture. • The external SAN should provide iSCSI connectivity, on either the Ethernet or Infiniband NIC. It exposes a known static IP. The SAN is supposed to autonomously exploit fail-safe features. All other disks despite the SAN (e.g. blades onboard disks) are ignored during the installation process. • The cluster should be equipped with an internal or external USB DVD reader. • Eurotech clusters are the only tested hardware. • VirtuaLinux DVD installs a homogeneous software configuration on all nodes of the cluster (master-less cluster). All installed services are made fault-tolerant via either active or passive replication technique. Therefore, once installed, the cluster does not exhibit single point of failure due to nodes hardware or software crashes and malfunctions. The VirtuaLinux DVD provides the user with three facilities: 1. Cluster boot. 2. Cluster set up (configuration and installation). 3. Cluster or node restore. 33 Name Meaning[default value] storage ip storage port sif sifip sifnm sifm ivid root IP address of the external SAN [192.168.0.252] port assigned to the iSCSI service on storage ip [3260] network interface for the external SAN [eth0 | ib0] IP address to be assigned to sif [192.168.0.254] netmask to be assigned to sif [255.255.255.0] kernel modules needed to bring up sif [tg3 | ib xxx] iSCSI exported disk name [ ] root device [/dev/evms/defaultcontainername] required required required optional optional optional optional required Table 3: Boot parameters description. These features rely on a highly cooperative behavior of nodes starting from early stages of first boot. The cooperative behavior is apt to eliminate the single point of failures during both cluster set up and ordinary work. C.2 C.2.1 Physical cluster Cluster Boot The boot feature enables the boot from VirtuaLinux DVD of a node of the cluster whether no other nodes are already running. This feature should be used to start the boot sequence since the cluster cannot boot via iSCSI from an external SAN. Assuming the cluster is switched off, proceed as follow: 1. Connect a DVD reader to one of the nodes of the cluster and load the reader with the VirtuaLinux DVD. 2. Switch on cluster main power supply, any external network switch, and the external SAN (that should be connected to either Ethernet or Infiniband switch of the cluster). 3. Switch on the node that is connected to the DVD reader. The node BIOS should be configured to boot from DVD reader. 4. The selected node will boot from the VirtuaLinux DVD. A menu will appear on the screen: select Boot Eurotech Cluster, check all boot parameter are properly configured, then press Enter. Boot parameters depend on how the external SAN is connected with the cluster, and they can be changed by pressing F6 key. Parameters meaning is described in Table 3. 5. After a little while a debug console will appear. Then all other nodes can be simply switched on (in open order). As soon as at least another node had completed the boot, the DVD reader should be detached from the cluster and first node can be rebooted (typing Ctrl-D on the debug console). 34 param list ::= parameter =value parameter list | sifm=value list param list | parameter =value --parameter ::= storage ip | storage port | sif | sifip | sifnm | ivid | root value list ::= value,value list | value value ::= h ASCII string i Table 4: Boot screen: command line syntax. The command line syntax (in the Extended Backus-Naur form) is shown in Table 4. In the case the same parameter appear twice or more, just the first occurrence will be taken in account. Two use cases are given in the following, exemplifying the boot with a Ethernet and Infiniband storage, respectively. Ethernet Let us suppose the external SAN is connected to the cluster via a Ethernet network, implemented on cluster end through an e1000 board, while SAN iSCSI IP is 192.168.0.200:3260, and during install procedure the name foo has been chosen as EVMS root segment name. In this case, the parameter should be adjusted as follow: ... root=/dev/evms/defaultfoo storage ip=192.168.0.200 sifm=e1000 ... --- Notice that only parameters that differ from their default value are shown here. The pre-defined command line includes commonly used parameters at their default values, which can be changed by the user, but not deleted. The command line should always be terminated by three hyphens (---). Infiniband Let us suppose the external SAN is connected to the cluster via an Infiniband network by means of IPoverIB protocol, while SAN iSCSI IP is 192.168.1.2:3260 with netmask 255.255.0.0, and during install procedure the name bar as been chosen has EVMS root segment name. In this case, the parameter should be adjusted as follow: . . . root=/dev/evms/defaultbar sif=ib0 sifnm=255.255.0.0 storage ip=192.168.1.2 --- C.2.2 Cluster Setup (Configuration and Installation) The cluster set up feature enables to configure VirtuaLinux for a given cluster and install it on an external SAN. The VirtuaLinux essentially consists in a Linux OS tailored and pre-configured for fault-tolerant cluster platforms. Currently, the supported base Linux distributions are either Ubuntu 6.10 or CentOS 4.4. The user is asked to choose which base distribution to install in all nodes of the cluster. These are base distribution main features: 35 Ubuntu Edgy 6.10 Ubuntu Edgy distribution is based on kernel 2.6.16-xen. It supports OS para-virtualization of Xen 3.0.1 virtual machines. In addition to common services (described in the next sections), the distribution is ready to launch Xen virtual machines. Also, it provides the system administrator with a number of tools for the management of virtual clusters, which in turn can be installed with other operating systems, such as Linux-Ubuntu, Linux-CentOS (Microsoft Windows XP on some cluster platforms). Virtual clustering and its management are described later in this guide. CentOS 4.4 CentOS distribution is based on kernel 2.6.9. It does not supports OS para-virtualization. Common services Several services are pre-installed on both flavors of VirtuaLinux, among the others Gateway, NTP, DHCP, TFTP, SSH, LDAP, NSCD, iSCSI, OpenFabrics network manager. Configuration procedure follow: Assuming the cluster is switched off, proceed as 1. Connect a DVD reader to one of the nodes of the cluster and load the reader with the VirtuaLinux DVD. The chosen node will become the primary gateway node to the external network. 2. Switch on cluster main power supply, any external network switch, and the external SAN (that should be connected to either Ethernet or Infiniband switch of the cluster). 3. Switch on the node that is connected to the DVD reader. The node BIOS should be configured to boot from DVD reader. 4. The selected node will boot from the DVD. A menu will appear on the screen: select Install Eurotech Cluster. 5. A live version of VirtuaLinux will come up on the node (it require several minutes); a Gnome-based GUI will start automatically. 6. Launch install tool by double-clicking on its icon placed on the Gnome desktop. A number of question should be answered to proceed: • Distribution: Base distribution Either Ubuntu or CentOS • External SAN configuration: IP It should be preferably kept in the range of xxx.yyy.zzz.[250-252] in order to avoid possible conflicts with IP of cluster nodes. Netmask [255.255.255.0] Port [3260] 36 Meta-master-IP [xxx.yyy.zzz.253] Ensure there are no conflicts with the storage IP before changing it Connection-kind Either Ethernet or Infiniband. Asked only if both network are plugged. • Cluster HW configuration: Number-of-nodes • Target storage. The iSCSI SAN is activated and scanned: iSCSI-target All disks exported by the iSCSI SAN target are listed. One of them should be selected for the installation. • Target storage virtualization. The iSCSI target is managed via EVMS (Enterprise Volume Management System), which enables to expose to the cluster a number independent volumes, i.e. of logical partitions of one o more iSCSI connected SANs. The parameters related to storage virtual organization are the following: volume-size [5120] Size in MBytes of each node volume. swap-size [2048] Size in MBytes of each node swap volume. shared-size [10240] Size in MBytes of the volume shared among nodes of the cluster. free-space-allocation A list of free space segments on iSCSI-target disks are listed. One of them should be selected to hold a container (note that it is possible to build a container over the concatenation of more free space segments, even if the features is not currently supported). container-name [ubuntu/centos] Different installation on the same SAN should have different container names. shared-fs-id [sharedfs] identifier of the shared file system. Advanced configuration switch, it is safe to leave it at the default value. The volumes will be formatted with the ext3 file system. The shared volume will be formatted with OCFS2 (Oracle Cluster File System) in the case of Ubuntu, or with GFS (Global Files System) in the case of CentOS. Both OCFS2 and GFS support concurrent accesses and distributed locks. GFS does not support Memory Mapped I/O. • Gateway configuration. One of the nodes of the cluster may act as gateway between cluster local networks and external network (e.g. Internet). Currently, the external network is supposed to be an Ethernet. The gateway will be configured in primary-backup fail-over mode with a backup node, thus if cluster-name1 stops working for any reason, cluster-name2 will inherit gateway functionality (cluster routing tables are automatically managed to achieve fail-over transparency). The mechanism is implemented through heartbeat (Linux HA), and needs a spare IP address on the local network. Parameters related to gateway configuration are the following: 37 primary-gateway-local-IP [xxx.yyy.zzz.1] IP of the primary gateway node in the local network. backup-gateway-local-IP [xxx.yyy.zzz.2] IP of the backup gateway node in the local network. spare-local-IP [xxx.yyy.zzz.253] Spare IP in the local network. public-IP [XXX.YYY.ZZZ.WWW] IP assigned to the gateway node in the external network. public-netmask [255.255.255.0] External network netmask. public-router [aaa.bbb.ccc.ddd] External network router IP. public-DNS [AAA.BBB.CCC.DDD] External network DNS. domain [eurotech.com] Default domain name. At this point, an automatic procedure to recognize the interfaces for the external and boot networks is started. The procedure will ask the user to approve the detected interfaces. In order to detect the boot interface, the procedure will ask the user to switch on a random node of the cluster. confirm-ext-if [YES/no] External network interface. confirm-boot-if [YES/no] Boot network interface. • Network configuration. If the network used for the storage is also used for the network boot (that should be an Ethernet), the following parameters are required: first-node-IP-netboot-netstorage [calculated address] Otherwise, if two different networks are used for the storage and the network boot (e.g. Infiniband and Ethernet, or two different Ethernets), the following parameters are required: first-node-IP-netboot [calculated address] netmask-netboot [calculated netmask] first-node-IP-netstorage [calculated address] The calculated suggested default values represent a safe choice. • Nodes configuration. cluster-hostname The string will be used assign nodes hostnames; nodes will be called cluster-hostnameX, where X is the node ordinal number collected during MAC addresses collection. • Gathering of MAC addresses. MAC-gather-procedure [1 automatic 2 manual 3 from-file] 1 automatic The procedure to gather nodes MAC addresses will automatically start. The user is required to follow instructions on the installation console that consist in repeat the following procedure for each node of the cluster (except the one where is currently running the installation procedure): 38 for each node Switch the node on; Wait the node is sensed by the meta-master (20-40 secs); Switch the node off; 2 manual MAC addresses of nodes are iteratively queried from the user console. 3 from-file Not yet implemented. 7. The data gathering is almost completed. The installation procedure proceeds to prepare the external storage (EVMS volumes creations and volumes formatting) and copy all needed data onto the proper volumes. This operation may require several minutes. A number of configuration files will be generated and stored in /etc/install/INSTALL.info file. 8. As last step, the installation procedure asks a password for cluster root user, that will trigger the last configuration step of the cluster. 9. After the completion of the installation, the node used fro the installation should be rebooted. Notice the DVD reader should not be disconnected since it is required to boot the cluster. The boot procedure is described in Cluster boot section of this guide. Notice, some of the data entered during install procedure are required to boot the cluster. C.2.3 Cluster or Node Restore Cluster set up consists of two successive steps: cluster configuration and installation. After the cluster set up, configuration data (collected during cluster set up) is saved in a predefined file (/etc/install/INSTALL.info). Cluster restore feature enables to reset either the whole cluster or a node to the original status (set up time status) by using configuration data. The feature helps in solving two typical problems of cluster management: • Reset. A cluster can be fully reset to original status (at the time of installation). • Restore. The configuration or the root file system of some nodes of the cluster has been corrupted due to administration mistakes. The file INSTALL.info produced during installation should be available in a known path of the files system. In the case of full reset of the cluster (commands 2), the boot from VirtuaLinux DVD is required. The node reset can be performed by booting from DVD or from another running node in the same cluster. Restore options are: -f path Path of the file INSTALL.info (compulsory argument). -w path Path of a temporary directory [/tmp]. -i Toggle interactive mode. 39 -d n Where n is the command to be executed: 1 Rebuild configuration files of all nodes. Disks are not formatted, software installed on nodes after the installation will be preserved. 2 Fully reset all nodes. Disks are reset to installation time status, software installed on nodes after the installation will be lost. 3 Reset a single node (to be used with -n num). -s Toggle formatting of the shared volume. -F Force the reallocation of EVMS segment. -n num Node number selected for the restore (to be used with the option -d 3). -v level Verbosity level (1=low, 2=high). Examples: Reset all nodes: root@node1:# cd /etc/install/scripts root@node1:# ./restore.sh -f /etc/install -d 2 Reset node 3: root@node1:# cd /etc/install/scripts; root@node1:# ./restore.sh -f /etc/install -d 3 -n 3 C.3 Cluster Virtualization DomU installation procedure is still under refinement. This documentation reports installation steps required by Eurotech HPC version 1.0.6. These step will be further simplified in the next Eurotech HPC version. The VC management system requires a setup step, that should be performed from the live DVD. The best way to install cluster virtualization software is to perform these steps at the cluster installation time, just after the VirtuaLinux installation (before booting the cluster). C.3.1 Installation Run as root user the script setup.sh in the /install packages/VirtuaLinux/install domU directory to set up the VC management system. Note that all the scripts located in the directory mentioned above must be executed as root user from the directory itself. The setup.sh needs two parameters in input: a node number and a port. They are relative to a postgres database service that will be used 40 to store VCs information. For example, if we want to use as database server the second node on port 5432 we have to start the setup.sh script as following: root@node1:/shared/VC/domU scripts#./setup.sh -n 2 -p 5432 At the termination of the execution the following tasks should have been accomplished: • initialization of directories where VC configuration files are stored (/shared/VC). • installation of the python objects needed to get the system working. • initialization of the database. At the end of DomU installation we have in the /shared/VC/domU scripts directory the scripts that have to be used to manage VCs. In the /shared/VC/os images repository directory we can find the DomU images in tgz format. C.3.2 Tools To easily manage VCs you can use the tools located in /shared/VC/domU scripts. Such tools are: VC Create Enable the user create a new VC and deploys it on the physical cluster. The deployment is accomplished in respect of the selected allocation policies. In the case nodes of the cluster are equipped with both an Infiniband network and an Ethernet, the Infiniband network address should be used. VC Control {start | stop | save | restore} Start/stop/save/restore a VC. VC Destroy Definitively destroy the VC. VC Create Create a new VC in an interactive manner. root@node1:/shared/VC/domU scripts#./VC Create.py -b 192.168.10.2 -p 5432 The script starts querying the user about the VC characteristics. The options of the command are: -b db IP -p db port Note that if Infiniband network is present and configured we have to use the Infiniband IP in the -b option. The complete list of options are available via on line help. 41 VC Control Once a VC has been created by VC Create use VC Control to perform the basic actions on it. This script takes in input a list of options, a VC name and an action chosen among: start Start the specified VC. The VMs are created on the physical nodes according to the mapping defined during the deployment process. stop Stop the specified VC. save The state of each virtual machine of the cluster is saved onto secondary storage. restore The state of the cluster is restored from an image previously created by a save operation. In the VC Create command you have to choose the deployment strategy of the virtual nodes on the physical one. In particular, it is possible to deploy the VC according to two different strategies: block Mapping aims to minimise the spread of VMs on the physical nodes. This is achieved by allocating on the physical node the maximum allowed number of VMs. For example, if we consider that the physical nodes are equipped with four cores, and the VC has been configured with one virtual node per core constraint, a VC consisting in 4 uniprocessor virtual nodes will be mapped and deployed on a single physical node. cyclic Try to spread the cluster’s VM across all the cluster’s physical nodes. For example, a virtual cluster consisting of 4 uniprocessor virtual nodes will be deployed on four different physical nodes. The two strategies briefly discussed above can behave in two slightly different ways: strict The deployment can be done only if there are enough free cores. free The constraint between the number of VM processors and physical cores is not taken into account at all. Example: root@node1:/shared/VC/domU scripts#./VC Control.py -b 192.168.10.2 -p 5432 -t cyclic -m strict virtualCluster1 start The command in the above example deploy the VC named “virtualCluster1” using the cyclic policy and the strict constraint. Using the option -l a list of available clusters is returned. The complete list of options are available via on line help. 42 VC Destroy Definitively destroy a VC including configuration files and disk partitions. It takes in input a list of options and a virtual cluster name. Example: root@node1:/shared/VC/domU scripts#./VC Destroy.py -b 192.168.10.2 -p 5432 virtualCluster1 The complete list of options are available via on line help. References [1] K. Adams and O. Agesen. A comparison of software and hardware techniques for x86 virtualization. In ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, pages 2–13, New York, NY, USA, 2006. ACM Press. [2] Apple Inc. Universal Binary Programming Guidelines, Second Edition, Jan. 2007. http://developer.apple.com/documentation/MacOSX/ Conceptual/universal binary/universal binary.pdf. [3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proc. of the 9th ACM Symposium on Operating Systems Principles (SOSP’03), pages 164–177. ACM Press, 2003. [4] F. Bellard. QEMU, a fast and portable dynamic translator. In USENIX 2005 Annual Technical Conference, FREENIX Track, Anaheim, CA, Apr. 2005. [5] Eurotech HPC, a division of Eurotech S.p.A. System and components for pervasive and high performance computing, 2007. http://www.exadron. com. [6] EVMS website. Enterprise Volume Management System, 2007. http: //evms.sourceforge.net/. [7] R. P. Goldberg. Survey of virtual machine research. Computer, pages 34–45, June 1974. [8] IBM. iBoot - Remote Boot over iSCSI. IBM, 2007. http://www.haifa. il.ibm.com/projects/storage/iboot/index.html. [9] IBM. Understanding and exploiting snapshot technology for data protection, 2007. http://www-128.ibm.com/developerworks/tivoli/ library/t-snaptsm1/index.html. [10] The Infiniband Consortium. The Infiniband trade association: consortium for Infiniband specification, 2007. http://www.infinibandta.org/. 43 [11] Intel Corporation. Preboot Execution Environment (PXE) Specification, 1999. http://www.pix.net/software/pxeboot/archive/pxespec.pdf. [12] Intel Corporation. Intel MPI Benchmarks: Users Guide and Methodology Description, ver. 3.0 edition, 2007. http://www.intel.com/cd/software/ products/asmo-na/eng/cluster/clustertoolkit/219848.htm. [13] iSCSI Specification. RFC 3720: The Internet Small Computer Systems Interface (iSCSI), 2003. http://tools.ietf.org/html/rfc3720. [14] T. A. Linden. Operating system structures to support security and reliable software. ACM Comput. Surv., 8(4):409–445, 1976. [15] J. Liu, W. Huang, B. Abali, and D. K. Panda. High performance VMMBypass I/O in virtual machines. In USENIX 2006 Annual Technical Conference, Boston, MA, June 2006. http://www.usenix.org/events/ usenix06/tech/full papers/liu/liu html/index.html. [16] L. McVoy and C. Staelin. LMbench: Tools for Performance Analysis, ver. 3.0 edition, Apr. 2007. http://sourceforge.net/projects/lmbench/. [17] G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig. Intel virtualization technology: Hardware support for efficient processor virtualization. Intel Technology Journal, 10(3):166–178, Aug. 2006. [18] The Ohio State University. MVAPICH: MPI over InfiniBand and iWARP, 2007. http://mvapich.cse.ohio-state.edu/overview/mvapich2/. [19] E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proc. of the 5th USENIX Conference on File and Storage Technologies (FAST’07), pages 17–28, San Jose, CA, USA, Feb. 2007. [20] G. J. Popek and R. P. Goldberg. Formal requirements for virtualizable third generation architectures. CACM, 17(7):412–421, 1974. [21] S. Pratt. EVMS: A common framework for volume management. In Ottawa Linux Symposium, 2002. http://evms.sourceforge.net/ presentations/evms-ols-2002.pdf. [22] Qumranet Inc. KVM: Kernel-based Virtual Machine for Linux, June 2007. http://kvm.qumranet.com/kvmwiki. [23] M. Rosenblum. The reincarnation of virtual machines. Queue, 2(5):34–40, 2005. [24] B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proc. of the 5th USENIX Conference on File and Storage Technologies (FAST’07), pages 1–16, San Jose, CA, USA, Feb. 2007. 44 [25] Sun Microsystems. sunsource.net/. Sun Grid Engine, 2007. http://gridengine. [26] VMware Inc. VMware website, 2007. http://www.vmware.com/. [27] Xen Source. Xen wiki, 2007. http://wiki.xensource.com/. 45