Masterarbeit - Institut für Informatik
Transcription
Masterarbeit - Institut für Informatik
Freie Universität Berlin Fachbereich für Mathematik und Informatik Institut für Informatik Masterarbeit Storage and Parallel Processing of Image Data Gathered for the Analysis of Social Structures in a Bee Colony 7. August 2014 Bearbeitet von: Simon Wichmann Erich-Baron-Weg 24A 12623 Berlin Kontakt: simonwichmann@fu-berlin.de Gutachter : Prof. Raúl Rojas Betreuer : Dr. Tim Landgraf Abstract The Beesbook project’s goal is to gather information of unprecedented detail about the processes in a beehive. During the planned experiments, due to a high temporal and image resolution, a big amount of image data will be acquired (up to 300 Terabyte). This thesis, firstly, presents an approach to live transfer the recorded image data to a specialized data storage facility in a stable and efficient way. For that purpose, it has been elaborated how to use a previously established Gigabit connection to capacity. Secondly, due to the image analysis taking an estimated time of 700 years on one processor, a parallelization of the analysis has been developed. The presented approach uses the newly built supercomputer of Zuse–Institut Berlin, which provides 17856 processors. In order to analyze the whole dataset in an automated way, a program has been developed that monitors the supercomputer’s batch queue and submits processing jobs continuously. The resulting program is applicable to all perfectly parallel problems. In fact, it can be be compared to the Map–step of the well-known MapReduce programming model, but in a supercomputer environment instead of a compute cluster. 2 Eidesstattliche Erklärung Hiermit erkläre ich an Eides statt, dass ich die vorliegende Arbeit selbstständig und ohne fremde Hilfe verfasst und keine anderen als die angegebenen Hilfsmittel verwendet habe. Diese Arbeit wurde keiner anderen Prüfungsbehörde in gleicher oder ähnlicher Form vorgelegt. Berlin, den 7. August 2014 .............................................. Simon Wichmann Acknowledgements An der erfolgreichen Ausführung einer solchen Arbeit sind weit mehr Menschen beteiligt, als nur der Autor. An dieser Stelle möchte ich einigen von ihnen danken. Ein großes Dankeschön geht an Dr. Tim Landgraf für seine äußerst kreative und pragmatische Arbeitsweise und Hilfestellung, sowie sein Vertrauen in meine Arbeit. Weiterhin danke ich dem ganzen Beesbook und Biorobotik Team für viele inspirierende Gespräche und ein sehr angenehmes Zusammenarbeiten. Ich möchte außerdem Dr. Wolfgang Baumann und Wolfgang Pyszkalski vom Zuse–Institut Berlin für lange und aufschlussreiche Gespräche danken. Dank gebührt auch dem IT– Dienst des Instituts für Informatik für die Hilfestellung beim Umgang mit der Netzwerktechnik, sowie der schnellen Schaltung und Wartung einer Gigabit–Leitung für das Beesbook Projekt. Für Hinweise zur korrekten Benutzung der englischen Sprache, sowie das ausführliche Korrekturlesen bedanke ich mich herzlich bei Anne Becker. Besonderer Dank gilt meiner Freundin, Michele Ritschel, die für etwaige Probleme während der Arbeit immer ein offenes Ohr hatte. Zuletzt möchte ich meinen Eltern danken, ohne die diese Arbeit in diesem Umfang nicht möglich gewesen wäre. 4 Contents 1. Introduction 8 1.1. The Beesbook Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.2. Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2. The Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.1. Transfer and Storage of the Image Data . . . . . . . . . . . . . . . . 12 1.2.2. Parallelized Image Processing . . . . . . . . . . . . . . . . . . . . . . 12 2. Comparison with Similar Projects 13 3. Transfer and Storage of the Image Data 14 3.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2. Storage Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1. Local Hard Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2. Tape Storage Devices at Zuse Institut Berlin . . . . . . . . . . . . . 15 3.2.3. Using the Supercomputer’s File System . . . . . . . . . . . . . . . . 15 3.3. Live Transfer of the Image Data . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.2. Monitoring the Image Directory with the FileSystemWatcher . . . . 17 3.3.3. Archiving the Images with 7zip . . . . . . . . . . . . . . . . . . . . . 18 3.3.4. Transferring the Archives with Winscp . . . . . . . . . . . . . . . . . 19 3.3.5. Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4. Parallelization of the Image Analysis 21 4.1. The Parallelization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1. Problem Decomposition and Computing Architectures . . . . . . . . 21 4.1.2. The Right Parallelization Approach for the Beesbook Problem . . . . 22 4.2. The HLRN/Cray Supercomputer System . . . . . . . . . . . . . . . . . . . 23 4.2.1. Overview on the CrayXC30 Supercomputer . . . . . . . . . . . . . . 23 4.3. Parallelization on the Cray Supercomputer System . . . . . . . . . . . . . . 25 4.3.1. Parallelization per Job . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4. Organizing the Automatic Job Submission . . . . . . . . . . . . . . . . . . . 26 4.4.1. The Beesbook Observer Script . . . . . . . . . . . . . . . . . . . . . . 26 4.4.2. The BbCtx Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 4.4.3. The Job Queue Manager Module . . . . . . . . . . . . . . . . . . . . 28 4.4.4. The Image Provider Module . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.5. Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.5. How to Configure and Use the Beesbook Observer . . . . . . . . . . . . . . . 33 4.5.1. The Configuration via the BbCtx Module . . . . . . . . . . . . . . . 34 4.5.2. (Re-) Initializing the Observers’ Work Directory . . . . . . . . . . . 35 5. Evaluation 36 5.1. Evaluation of the Image Transfer . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.1. The Best Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.2. Transfer Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2. Evaluation of the Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 39 6. Discussion 42 6.1. Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.2. Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7. Future work 44 7.1. Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2. Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2.1. Starting the Observer . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.2.2. Extending the Observer . . . . . . . . . . . . . . . . . . . . . . . . . 44 Appendices 46 A. Calculations and Tables 47 A.1. Calculation Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A.1.1. Image Sizes and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 47 A.1.2. HDD Capicities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A.1.3. Maximal Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . 48 A.1.4. Needed NPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 A.2. Additional Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 A.2.1. FileSystemWatcher Event Buffer Size . . . . . . . . . . . . . . . . . 48 A.2.2. Archive Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 A.2.3. Processing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 A.2.4. Number of Batch Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . 50 A.2.5. Work Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . 51 B. Glossary 53 B.1. Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6 List of Figures 1.1. The top level of the Beesbook structure, showing the separate steps of the experiment’s workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2. The design of the bee tags (source: [9]) . . . . . . . . . . . . . . . . . . . . . 11 3.1. The hard- and software setup during the recording stage . . . . . . . . . . . 17 3.2. A diagram showing the activities of one image transfer thread . . . . . . . . 18 4.1. An illustration of the batch job scheduling . . . . . . . . . . . . . . . . . . . 24 4.2. An illustration of the proceedings in one batch job . . . . . . . . . . . . . . 27 4.3. The top–level Beesbook Observer schema . . . . . . . . . . . . . . . . . . . . 29 5.1. The results of the data transfer stability tests . . . . . . . . . . . . . . . . . 38 5.2. Determined NPL overhead during several test runs . . . . . . . . . . . . . . 41 A.1. The bandwidths and total sizes corresponding to certain JPEG quality levels 47 A.2. The time capacities of differently sized hard disks . . . . . . . . . . . . . . . 47 A.3. The maximal number of concurrent computations . . . . . . . . . . . . . . . 48 A.4. NPL need depending on runtime per image . . . . . . . . . . . . . . . . . . 48 A.5. The image used for benchmarking . . . . . . . . . . . . . . . . . . . . . . . 50 A.6. The structure of the Beesbook work directory. . . . . . . . . . . . . . . . . . 51 7 1. Introduction The tasks I worked on in the scope of this master thesis are part of the Beesbook project [1] led by Tim Landgraf. To give an idea of how this thesis’ topic integrates into Beesbook, I will firstly describe the project and its structure. The second chapter introduces and compares some related work. Subsequently, chapter three and four describe the implementation of two main tasks of this thesis. The implementations will then be evaluated and discussed in chapter five and six. The last chapter gives an outlook on further applications of the solved problems. 1.1. The Beesbook Project Pollination services provided by bee colonies are essential for a wide range of important plants, for example almond, apple and cocoa [17]. In North America bee pollinated crops represent about 30% of the food consumed by humans [12]. In recent years there was a growing number of reports about whole colonies dying (colony-collapse disorder) [16] while the cause remains unclear. The honeybee’s agricultural importance, as well as the complexity of the processes in a beehive, makes it a frequent object of research. 1.1.1. Motivation Understanding the individual behaviors and a bee colony’s coaction is crucial in order to understand the colony’s behavior as a whole. One prominent example of interactive communication between bees is the waggle dance, which is an important part of foraging. Bees following the dance are pointed to a certain spot in the environment by decoding the location’s polar coordinates from the body movements. While many technical aspects of the waggle dance are well understood, its social components remain unclear. Why does a bee follow a specific dance? Do dancers target certain individuals? To answer these questions, we have to survey factors, which might have an influence on the waggle dance behavior. Beesbook aims at acquiring interaction information not only about a few individuals in the proximity of a dance but about all individuals in the hive. A novel automated tracking system is currently built that facilitates this unprecedented breadth of observation. In order to identify patterns of interesting interaction, the observation also covers a very large time frame of 60 days. The gathered time and location data about every single bee will make it possible to analyze the social bee network. However, the storage capacity and computing time needed to achieve this are far beyond the capabilities of current commodity hardware. The amount 8 1.1. THE BEESBOOK PROJECT of image data to be recorded will sum up to about 300 Terabyte (TB) and the expected computing time to analyze all images will be up to 700 years (on one single processor). For that reason both a storage solution and a parallelization approach have to bee invented. 1.1.2. Project Structure In order to analyze all the interactions inside a beehive, they have to be observed in an automated way. For this purpose, cameras are set up for video recording. In addition, a tracking system that allows to analyze the image data is needed. It consists of 2 parts: 1. a unique marking for each individual, and 2. a program that can spot and identify each marking and its orientation in an image of the hive — the software decoder. The top level of the project structure is depicted in figure 1.1. A short description of each stage of the experiment is given in the following paragraphs. Further background information, for example about biological considerations and camera/hive setup, will be given in upcoming Beesbook papers. Growing the Beehive As the base for the experiment, the bee colony has to be grown first. Moreover, a container for the beehive that allows to observe its interior must be built. Bee Tagging Each single bee has to be marked with a unique number tag in order to identify it unambiguously. This stage again consists of two parts: • The tag design: This includes the tag’s physical properties like material, dimensions and production facilities. Figure 1.2 shows the information encoding based on a circular alignment. • Fixing the tags to the bees: Each bee has to be extracted individually from the hive to glue a tag to its torso. All tags are fixed with the same defined orientation so that the bee’s orientation can also be observed because it correlates with the tag orientation. Image Capturing and Storing During this stage the actual experiment data is recorded. The aim is to capture 4 frames per second from 4 cameras with a resolution of 12 megapixel each. Due to the large bandwidth produced (between 20 MB/s and 66 MB/s, depending on the image quality) and the complexity of the tracking step, the data cannot be processed in real-time. Since Chapter 1 S. Wichmann 9 1.1. THE BEESBOOK PROJECT Top-Level Grow Beehive Tag Bees Store Images Start Image Capturing Analyze Images Analyze Location Data Figure 1.1.: The top level of the Beesbook structure, showing the separate steps of the experiment’s workflow. the overall size of the recorded data will amount to about 300 TB, it has to be transferred directly to a specialized data storage facility. The live transfer of the image data is one task I worked on and will be further introduced in section 1.2.1. Image Analysis At this stage the image data gathered has to be processed in a way that allows for later reconstruction of the social bee network. For each single image a table containing the location (two dimensional coordinates) and orientation (angle) of each bee is saved as an intermediate result. For this purpose a software decoder using the openCV framework is developed. Even if one image could be analyzed in only one second (actually it takes much more time), due to the large number of images the overall core time would add up to about Chapter 1 S. Wichmann 10 1.1. THE BEESBOOK PROJECT Figure 1.2.: The tag design: The binary sequence is represented by black and white cells lying on the outer arc. The inner semicircles define the tag’s orientation and the location of the lowest significant bit. 3 years. To obtain results in a reasonable time it is necessary to parallelize the image analysis. This is the second task I worked on, which will be introduced in more detail in section 1.2.2. Location Data Analysis By utilizing the location and orientation data extracted in the previous stage, it is possible to reconstruct a social network which covers the interactions of all marked bees. Two important points will affect the result: • The definition of what an interaction is, in terms of distance/orientation between two bees. • The handling of detection errors; A filtering step could remove or even correct unreasonable bee detections, by using the chronological context. Eventually, the resulting interaction network (or other representations which can be derived from the intermediate location data) can be used to obtain new insights on the waggle dance behavior. Chapter 1 S. Wichmann 11 1.2. THE SCOPE OF THIS THESIS 1.2. The Scope of this Thesis As I already announced in the previous sections, I now give some additional introductory information on the tasks I worked on. 1.2.1. Transfer and Storage of the Image Data The bandwidth of the recorded image data depends on the quality needed to obtain sufficient tracking results. Since the software decoder is still in the development process, there are no quantitative measurements of its tracking performance yet. The estimated bandwidths range from about 9.1 MB/s (Q=70) to 66 MB/s (Q=100), assuming that the black and white JPEG format will satisfy the software decoder’s requirements. Accordingly, a hard disk with a capacity of 2 TB could merely hold 4.3 hours (Q=100) to 13.9 h (Q=90) or 63 h (Q=70) of image data. Given that real-time processing is not feasible, as described in 1.1.2, it becomes necessary to store the data for later analysis. The planned experiment length of 60 days will result in an overall data volume of 45 TB (Q=70) to 159 TB (Q=90) or even 325 TB (Q=100). The only way to handle such a volume, instead of storing it locally, is to directly transfer the data to a specialized data storage facility, during the recording. For this purpose, I researched and implemented solutions for the following questions: • Where can a data volume of more than 300 Terabyte be stored? • How do we establish a stable connection to the target storage server? • How can we provide a constant bandwidth of up to 66 MB/s (528 Mbit/s)? The answers and their implementations are described in chapter 3. 1.2.2. Parallelized Image Processing The experiment, if we used the parameters we aim for (4 cameras, 4 fps, experiment runtime of 60 d), will produce a total of 4 · 4/s · 60d · 24h/d · 3600s/h = 82.944.000 images. Because the software decoder is still in development, it was initially not clear how long its runtime would be. Still, as described in section 1.1.2, even if one image could be processed in one second, the overall core time would sum up to more than 3 years. The estimated runtime being much longer, only parallelizing the processing stage would allow to obtain results in a reasonable time. This, to achieve a considerable speed up I investigated various parallelization approaches, and implemented the one that fitted best. Chapter 4 describes how to identify the best approach and its implementation. Chapter 1 S. Wichmann 12 2. Comparison with Similar Projects Seeley [13] marked up to 4000 bees by combining plastic bee tags and different colors of paint. Yet, the observation of the hive could only be performed manually (sometimes augmented by video recordings for later reference). The spatial and temporal resolution of these observations is conceivable small, demanding for a more powerful observation system in order to investigate larger scale questions. A study [10] similar to Beesbook used video cameras to track six ant colonies (containing about 900 individuals altogether) over 41 days to learn how labor is divided among worker ants. The ants’ positions and orientations were extracted from image data in real time twice a second, using an existing tag library. Only a video with reduced resolution is saved in the process. To be able to identify far more individuals (about 2000), a new tag had to be designed for the Beesbook project. Since bees would chew the paper tags, shape and material of the tags have to be different from the ones used in [10]. Due to more information concentrated on a small marker, the locating and decoding of these new tags is too expensive to be carried out in real time. Hence, the full resolution image data has to be stored for later analysis, which is a challenging demand on storage facilities. Furthermore, the much more expensive tag decoding does also pose a challenge on the available computing power. This is why a transfer and storage solution as well as a high–parallel computing environment had to be developed for the Beesbook project. Processing large amounts of data is a common task for some big companies like e.g. Google. A total of more than twenty petabytes of data is processed on Google’s clusters every day (as of 2008 [7]). A programming model called MapReduce was invented by Google (and is widely used today) to automatically parallelize computations on a large cluster once the problem has been specified in terms of a map and a reduce function. The MapReduce model is useful in cluster environments but it is inappropriate for other high performance computing systems without network issues. Chapter 4 explains the decision for the CrayXC30 supercomputer system and the need for a new way of work partitioning and job organization. 13 3. Transfer and Storage of the Image Data The following two chapters deal with the actual implementation of the tasks I worked on. The calculation of most of the mentioned numbers is depicted in the appendices. This chapter deals with the transfer and storage of the recorded image data. It consists of three parts: Firstly, I give a list of the general requirements for this stage of the experiment. The subsequent section deals with the various possible storage approaches. Finally, this chapter is concluded by a description of the data transfer program I developed. 3.1. Requirements The image capturing stage will produce up to 66 Megabyte per second of image data. Together with the stage’s length of 60 days and the impossibility of real-time processing, this requires: • A storage facility capable of storing up to 325 Terabyte of image data must be found. • The recorded data must be transferred to the storage facility of choice, presumably over the Internet. • Throughout the transfer the local storage is prevented from overflowing by ensuring a constant bandwidth of at least 66 MB/s. 3.2. Storage Approaches There are a range of possible storage approaches. Yet, both the bandwidth and the data volume’s total size render some of them unfeasible. The following sections describe three of them, including their advantages and disadvantages and how they can meet the given requirements. When speaking about data storage, the possibility of binary compression (e.g. gzip, tar) has to be considered. But, in case of image data, this yields no advantage because image formats like PNG and JPEG already employ advanced image compression algorithms. These algorithms leave little to no room for compression on the binary layer. 3.2.1. Local Hard Disks The easiest solution would be to store the image data on local hard disks, making a (live) transfer unnecessary. This would require 82 currently available hard disks (capacity of 14 3.2. STORAGE APPROACHES 4 TB) to store all the data, costing more than 10000€. Moreover, they would have to be hot-swapped in the running system, whenever a hard reaches the limits of its capacity. After the recording, the data would have to be transferred to a compute cluster. In an article assessing the potentials of cloud computing [5] the data transfer over the Internet (20 Mbit/s) is said to be slower than sending the hard disks via overnight shipping, which would yield a bandwidth of 1500 MBit/s. However, this assumes that the whole data set is present on hard disks at one time. This approach is not viable due to poor cost-efficiency and the need for swapping the hard disks every 44 minutes (see figure A.2). Consequently, the live transfer of the recorded data to a specialized storage facility is inevitable. 3.2.2. Tape Storage Devices at Zuse Institut Berlin The data management department of Zuse Institut Berlin (ZIB) provides quasi infinite disk space. This is achieved by internally copying all the files to tapes while only the metadata (e.g. the inodes) remains present on the hard disks [11, p. 13]. Each tape has a capacity of about 5 TB and is read/written using a tape drive. 24 handbots swap the tapes in one of the 60 tape drives, according to current access needs. Furthermore, there is room for 18.000 tapes, resulting in an overall capacity of about 45 Petabyte. This storage approach has some advantages over the local storage: It is cheaper (the cooperation contract with ZIB would cost 4000€) and it is more secure due to professional data storage methods, such as replication. However, the recorded data has to be transferred to the storage servers. Since the servers are located outside of the institutes’s local network, the transfer relies on the Internet connection of the Freie Universität (FU). Depending on the Internet bandwidth available, the frame rate which can be used for the recording could be drastically reduced by the transfer. Yet, we were able to establish a Gigabit connection to the storage gateway server, thanks to our cooperation with the institute’s IT–Services and our institute’s direct Internet connection to ZIB. The resulting Gigabit connection provides a theoretical bandwidth of 117.5 MB/s net [18], which exceeds the limitations of a standard 100 Mbit/s connection (11.75 MB/s net). Tests showed an actual application level bandwidth of approx. 100 MB/s when using at least four connections. 3.2.3. Using the Supercomputer’s File System The supercomputer on which the image analysis will be executed is introduced in chapter 4. Its work file system has a capacity of about 1 Petabyte which would be open for us to use for a limited time. However, there are a few drawbacks: The file system has no means of data security. Furthermore, the supercomputer has scheduled downtimes which last up to two days. Therefore we would have to buffer the data locally during these times. Chapter 3 S. Wichmann 15 3.3. LIVE TRANSFER OF THE IMAGE DATA Since the usage of the supercomputer’s file system would be free of charge, this approach is the preferred one. Some future work will have to research whether the required buffering is practicable. In addition, it is unclear whether the Gigabit connection can be established to the supercomputer’s login servers too. 3.3. Live Transfer of the Image Data The local storage approach is not feasible, as explained in section 3.2.1. As a consequence, the data has to be transferred to a storage facility right away during the recording process. Figure 3.1 gives an overview on the hard- and software setup of the image recording stage. It also shows the data flow from the cameras to ZIB servers. One important point is the utilization of a RAM–Disk instead of a hard disk to capture the images from the cameras. This ensures that concurrently writing new images and reading older ones, in order to transfer them, is not obstructed by slow hard disks. Several points have to be considered in order to automate the continuous transfer process. For this reason I developed a transfer script in the Windows Powershell language which takes care of the necessary steps like image archiving and connection maintenance. The following sections describe the implementation details of the script. 3.3.1. Overview The Windows PowerShell is an alternative to cmd.exe, provided since Windows XP and based on the .NET–Framework. In addition, the PowerShell Scripting Language facilitates the scripting of system commands, as known from Unix–Shells. It offers sophisticated tools like the execution of multiple concurrent jobs and modularization by the use of Cmdlets. The number of concurrent jobs to use is determined by two factors. One of them is the fact that only by using multiple concurrent FTP connections one can ensure that the available Internet bandwidth is used to capacity. This can only be achieved by using multiple asynchronous jobs. The other factor is a limitation of the PowerShell: It is impossible to communicate with a job once it is started. Consequently, the file system monitoring and the transfer must be executed together in the same job. Tests showed that three connections suffice to use the bandwidth to capacity. This is due to a bandwidth of about 35,15 MB/s per connection (see section 5.1.1) and the absence of a bandwidth loss when using multiple connections concurrently. However, since there are four cameras to handle, four connections are used so that each job can use its own one. The transfer script starts four concurrent jobs (threads), one for each camera. Prior to the data transfer loop, each job initializes its own FileSystemWatcher (described in section 3.3.2) and FTP (File Transfer Protocol) session. Chapter 3 S. Wichmann 16 3.3. LIVE TRANSFER OF THE IMAGE DATA Figure 3.1.: The hard- and software setup during the recording stage; The arrows show the image data rate, the data flow direction and the interface speeds of SSD [15], PCI and USB 3.0 [6]. Figure 3.2 shows the data transfer loop executed by each job after the initialization. Whenever a certain number of images has been captured, they are stored in a tar–archive and deleted afterwards. After the archive has been transferred via FTP connection, it will be deleted in order to release disk space. In case an error occurs, the respective job will be terminated and the error will be printed to the command line. The single steps of the data transfer loop are described in the next three sections. 3.3.2. Monitoring the Image Directory with the FileSystemWatcher The FileSystemWatcher (FSW) is a .NET Framework Class which is used to monitor file system events. Whenever an event occurs to which the job is subscribed (in this case the creation of a file), the information is stored in an event buffer and the job gets notified. Since all cameras write their images into the same directory, an event filter is registered with each FSW. The filter is represented by a string of the form ∗_$id.png, matching only filenames of camera $id (∗ is a wildcard). Thereby each job only handles new images created by its associated camera. Whenever an event is raised, the respective job increments its file counter and stores Chapter 3 S. Wichmann 17 3.3. LIVE TRANSFER OF THE IMAGE DATA One Of Four Transfer-Threads Archive size not reached FileSystemWatcher watching for new images captured from one camera Archive size reached Store images in a tar archive Transfer the archive over FTP Delete archive and images Figure 3.2.: This diagram shows the activities of a transfer thread. The initialization of the FileSystemWatcher and the FTP session is not shown here and is performed prior to this loop. The archive size (number of images) is configured in the transfer script and depends on the available disk space and the transfer speed. the file name. Then, if the archive size is reached, the loop proceeds to the next stage, the archiving (see next section). After the event has been handled, it will be removed from the event queue to release its memory. The maximum event buffer size is 64 Kilobyte [2] and therefore limits the number of events which can be stored. Conservative calculations yield an event buffer queue capacity of approximately 1638 events (see appendix A.2.1). As a consequence, the transfer of an archive must not take longer than 409 s (1638/frame rate). Stated differently, each archive must have a maximal size of about 409s ∗100M B/s 4 = 10225 M B. 3.3.3. Archiving the Images with 7zip In order to minimize transfer protocol (FTP) overhead, it is convenient to transfer one large file instead of many smaller ones. For this purpose, the images are stored in a tar archive using the 7zip command line program 7za.exe. Then the images are deleted to release disk space as early as possible. As I explained earlier (see section 3.2), it is of no use to compress the images. This is why I chose compression mode -mx0, which just copies the data without any compression. Even if the additional compression were of any benefit, it would most likely not be worth the additional computational work. All in all, the computational throughput must not be less than the Internet bandwidth during the whole pre–transfer processing. Chapter 3 S. Wichmann 18 3.3. LIVE TRANSFER OF THE IMAGE DATA A bigger archive size can reduce the transfer protocol overhead. Yet, the archive size is limited by the available amount of disk space. In our case this is a portion of the available RAM, of which I allocated 15 GB for the RAM–disk. The calculations resulted in a maximal archive size (number of contained images) of 467 (see appendix A.2.2). However, tests showed (see evaluation chapter) that the transfer throughput is already stable enough at an archive size of 190. 3.3.4. Transferring the Archives with Winscp After a job has stored the images into an archive, it is transferred to the remote storage facility over the job’s Internet connection. One essential requirement for this connection is a bandwidth of one Gigabit. The next slower level of 100 Mbit/s (approx. 10 MB/s net.) would allow a JPG quality level of 70 at most, which will presumably not be sufficient for the tracking. This is why we established a Gigabit connection to ZIB (see 3.2.2), which is the basis for the viability of the transfer step. Tests showed that encrypted file transfer protocols achieve much lower bandwidths than unencrypted ones (see section 5.1.1). This is mostly because the encryption needs a significant amount of computing power, which reduces the computational throughput significantly. Since data security is not relevant for us, I chose FTP as transfer protocol because it provides the necessary bandwidths. The WinSCP program delivers a .NET–library in addition to its standard executable. This library provides an interface to several file transfer protocols and can be used natively inside the PowerShell Scripting Language. Together with the FileSystemWatcher, a FTP session is configured (server information, binary mode) and initialized at the beginning of each job. Immediately after archiving a certain number of images, the archive is transferred by calling the PutFiles() function of the WinSCP session object. Then the transfer result is checked and the transfer will be repeated in case anything goes wrong. Finally, the archive will be deleted to release its memory. 3.3.5. Error Handling The fault tolerance of the transfer script is a vital property in order to ensure a continuous data acquisition during the whole experiment. All errors will immediately be printed to the command line, but the transfer script will not be able to recover from most of the errors because they arise from problems in the underlying system (e.g. file system or Internet problems). There are three steps which can produce errors during the runtime of the script: • The FileSystemWatcher will produce an Error event instead of a Created event if there is a problem with the FSW. This can happen when the event queue reaches the limits of its capacity (1638 events, see appx. A.2.1), which can only be avoided Chapter 3 S. Wichmann 19 3.3. LIVE TRANSFER OF THE IMAGE DATA by setting conservative archive sizes. If this or another error occurs here, the script will not be able to recover itself. • The image archiving with 7zip can fail. In that case it sets a global exit code to a value indicating the cause. One cause can be an image which is still opened while trying to archive it. This can happen because the Created event is raised immediately after the file is created and hence the script might try to archive the images while the camera is still writing data to the newest file. To overcome this uncertainty, the archiving is executed again until it succeeds or another error code is specified by 7zip. In case of another error code the transfer script will not be able to recover itself. • The transfer of the archive may not succeed. Since the WinSCP interface already employs connection recovery, all other errors are beyond the transfer script’s capability to recover itself. Chapter 3 S. Wichmann 20 4. Parallelization of the Image Analysis The experiment will produce a total of 82.944.000 images. As shown in appendix A.2.3, the overall processing time sums up to more than 770 years. For this reason, to obtain results in a reasonable amount of time, it is necessary to parallelize the image processing. This chapter covers the research and implementation of an efficient way to speed up this experiment stage. I will first explain some general points regarding the parallelization approach. Then, an explanation of the parallelization approach applied to the Beesbook image analysis concludes the conceptional part of this chapter. The following three sections go into detail about working with the Cray supercomputer at Norddeutscher Verbund für Hoch– und Höchstleistungsrechnen (HLRN) and the implementation of the parallelization of the data analysis. Finally, the last section explains how to configure and use the resulting Beesbook Observer program. 4.1. The Parallelization Approach In this section I explain program parallelization driven by problem decomposition. Furthermore, I introduce the parallelization approach I decided to use for the Beesbook problem. 4.1.1. Problem Decomposition and Computing Architectures A program or problem can be parallelized in three ways: • By decomposing the functionality of the program into multiple parts with few interconnections (henceforth intra–parallelization). Usually, this is done for systems which are meant to be highly responsive while executing multiple computations parallel, which is not required for this project. Intra–parallelization can also become useful, if there are fewer datums to analyze than processors available (see next point). • By decomposing the domain data into chunks (single images, in the case of Beesbook), which can be processed parallel by multiple program instances (henceforth inter– parallelization). That makes it possible to process several images at a time, speeding up the whole process. If there are fewer data chunks than processors available, additional intra–parallelization could allow to utilize more processors. 21 4.1. THE PARALLELIZATION APPROACH All the approaches assume the availability of multiple processing units. Currently there are two common architectures (taxonomy: [8]) providing a large amount of processing units: Array Processors like a GPU employ a Single Instruction Multiple Data (SIMD) ap- proach. All processing units execute the same command in the program flow on separate parts of the data. This is only efficient while the processing of all data parts is highly homogeneous. The program flow of our software decoder depends strongly on the input image, which is why the analysis of multiple images is not homogeneous. Supercomputer Systems commonly consist of many nodes containing several independent processors. These massive parallel systems constitute a Multiple Instruction Multiple Data (MIMD) architecture, making it possible to execute multiple independent programs concurrently. The Single Program Multiple Data (SPMD) approach makes use of this architecture by executing one program in multiple instances, working on separate data (images). 4.1.2. The Right Parallelization Approach for the Beesbook Problem First of all, the parallelization of the image analysis stage shall decrease the overall running time. Consequently, the objective is to utilize as many processors as possible. The number of processors which can be used concurrently is limited by several factors, though. These factors include the number of processors available, the problem’s degree of decomposability and the data transfer bandwidth in case the data does not reside near the supercomputer. The effectiveness of data decomposition depends on the problem’s degree of decomposability and the number of parts the decomposition results in. In case of Beesbook each image can be processed independently, which is already a sufficient decomposition in order to use a huge number of processors concurrently. Sometimes this problem structure, which has little to no interdependencies between results, is called perfectly parallel because of that property. Additional intra–parallelization would only be helpful if there were many more processors available than images. Hence, the most efficient solution is to map one process to each available processor, each analyzing a chunk of the image data. In this case, implicit intra–parallelization, as offered by the openCV library and another research library [14], would even hinder the process because additional threads would compete for processors already allocated to particular processes. In case the data does not reside in the file system of the supercomputer, prior to processing, a data transfer is necessary. Depending on the bandwidth available between the supercomputer and the data storage, as well as the processing time, this transfer can be a considerable bottleneck. See appendix A.1.3 for actual calculations of how many computations can happen in parallel. This bottleneck is one important reason for us to pursue both, storing the data at ZIB and processing it on the close-by Cray system. Chapter 4 S. Wichmann 22 4.2. THE HLRN/CRAY SUPERCOMPUTER SYSTEM 4.2. The HLRN/Cray Supercomputer System As pointed out in section 4.1.1, the Beesbook image analysis is not homogeneous enough to be carried out on an array processor. Accordingly, the architecture of choice is a MIMD supercomputer system. The decision for HLRN has several reasons: First of all, it shares the file system designated to store our data which circumvents an additional data transfer. Second, the HLRN houses the new Cray XC30 supercomputer, which provides 17856 processors (we need about 4.3% of 17856 core years, compare appx. A.2.3) that enables us to obtain the computing capacity we need. During the following sections I introduce a system that is able to process all data in a directory in an efficient, automatic and configurable way. The system described is completely independent from the actual processing program and could also be applied to other perfectly parallel problems. 1 4.2.1. Overview on the CrayXC30 Supercomputer The CrayXC30 supercomputer system at HLRN consists of 744 compute nodes with 24 CPU cores each, and some additional auxiliary servers for login and data management. Moreover, the system has 46.5 TB of memory and 1.4 Petabyte of hard disks, organized as RAID 6. This information and further hints on how to use the Cray system can be found in the Cray user documentation [3]. The Batch System The compute nodes are not directly accessible from the login shell. In order to execute a program, a batch script has to be submitted to the job queue. Said batch script contains information about the resources needed (number of compute nodes, runtime) and a call to the actual binary. After submitting the script with the msub command, the system scheduler decides when to execute the job. This decision is based on complex calculations on the job’s resource requirements and the system load, among other things. One job may run for a maximum of twelve hours and allocate a maximum of 256 compute nodes, which requires us to partition the whole workload into smaller jobs. The important thing to note about the batch system is that the actual job execution happens in a completely asynchronous way. Therefore it is not possible to guarantee a constant number of jobs running at a time. Hence, it is also impossible to guarantee a constant speedup during the whole analysis stage. Instead, one can only finetune the job sizes in order to fit in the small spaces left by scheduling fragmentation (see figure 4.1). After some organizational work in the batch script, the binary is executed using the aprun command. Its parameters include the number of nodes and cores used for this execution, program arguments and the path to the binary. An example call to execute the program on 48 cores would look like this: aprun -n 48 beesbook.bin jobWorkDir. 1 For further information about the Beesbook software decoder see Laubisch, 2014 [9] Chapter 4 S. Wichmann 23 4.2. THE HLRN/CRAY SUPERCOMPUTER SYSTEM Figure 4.1.: An illustration of the batch job scheduling. The jobs are scheduled so that unused capacities are minimized. In contrast to the Beesbook problem, most of the computations carried out on Cray cannot be decomposed to an arbitrary degree (dark gray). Thus, the scheduling usually leaves some slots unassigned because no job fits in there. The scheduler will backfill these slots with Beesbook jobs (light gray) to minimize unused capacities. The Beesbook jobs trickle into these slots like grains of sand (also called Sandkornprinzip), thereby minimizing the waiting time. Chapter 4 S. Wichmann 24 4.3. PARALLELIZATION ON THE CRAY SUPERCOMPUTER SYSTEM Job Accounting The HLRN’s currency is called NPL (Norddeutsche Parallelrechner–Leistungseinheit). Each active account gets 2500 NPL per quarter (a project proposal is required to get more NPL) and one compute node costs 2 NPL per hour. There is no deduction in case not all cores of a node are used, hence it is important to utilize all reserved cores as efficiently as possible. If we neglected any overhead, the overall core time of 773.57 core years would cost 773.57 core years ∗ 2 NPL cores /24 = 564.703 N P L. node h node This corresponds to about 4.3% of the HLRN’s yearly capacity (compare appx. A.2.3). One important point to note about figure 4.1 and the Sandkornprinzip is that minimizing job waiting times becomes unnecessary if the project’s NPL are granted over multiple quarters. This is because the overall runtime will be at least n-1 quarters. In this case, only the last quarter’s runtime would benefit from efficient scheduling. 4.3. Parallelization on the Cray Supercomputer System After introducing the parallelization approach and the Cray supercomputer system, I now describe the actual implementation and organization of the automated image analysis. As was explained earlier, the processing will be divided in many jobs. Each job will reserve a certain amount of compute nodes for a certain amount of time to process one part of the image data. The parallelized processing therefore consists of two parts: Concurrently analyzing images on all allocated cores and organizing the continuous job submission. 4.3.1. Parallelization per Job As described in section 4.1.2, the partitioning of the data into single images is enough to utilize many processors concurrently. Hence, during the execution of one job the software decoder is simply started on each allocated core. Each decoder then loops over a part of the images which was assigned to this program instance. In order to avoid conflicts, the data must be partitioned so that each process works on its own part of the data. This is achieved by supplying the data in separate directories, one for every process. The last difficulty is to actually assign the directories to the processes. Due to a limitation of the aprun command it is not possible to do this by passing the particular directory via command line arguments. This limitation arises from the fact that aprun can only spawn processes in a homogeneous way, hence all processes on a node are started with the same command line argument.2 2 It is also impossible to call aprun for each single process to pass different arguments because each subsequent call to aprun targets another compute node, which would prevent us from starting more than one process per node. Chapter 4 S. Wichmann 25 4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION Consequently, the only way to have both, one process per core and a coordinated data distribution, is to use interprocess communication. For this purpose I utilize the MPI (Message Passing Interface) library, which has several implementations present at Cray. Besides initializing and finalizing the MPI environment, there is one single function used: MPI_Comm_rank( MPI_COMM_WORLD, &world_rank ). It determines the ID of the process among all other processes (ranging from one to the number of processes) started with the same aprun call. This way each process knows which directory it has to work on. A more detailed explanation of the proceedings during one batch job can be found in figure 4.2. Furthermore, the structure of the Beesbook work directory is shown in appx. A.2.5. 4.4. Organizing the Automatic Job Submission The maximum runtime for one job is twelve hours. This means that between 471 and 23530 jobs will have to be submitted in total (compare appx. A.2.4), which necessitates an automated job organization and submitting program. Since we want to comply with the Sandkornprinzip mentioned in figure 4.1, there will be shorter, hence, even more jobs. There are two conceivable approaches to implement the automation: • The so-called job script chaining would use some reserved time of a batch job to organize and submit the next job before calling the aprun command. The data would be provided during another asynchronous job and the newly submitted job would have to wait for that job to complete. • An external observer program can organize the batch jobs just like a human would do it manually. However, this requires the possibility to have a process running on the Cray system during the whole data processing stage. Since the job script chaining requires some non-trivial synchronization between multiple jobs (only one of them must perform the organization), I chose the latter possibility. Yet, this is only possible because user processes are allowed to run for an unlimited time on a data node (login node processes are limited to 3000 seconds) at Cray, given that the process does little CPU–intensive work. 4.4.1. The Beesbook Observer Script The automation program is implemented as a Python script which utilizes the command line programs available on Cray: msub <jobScript> returns the jobID of the submitted job and checkjob <jobID> returns the status of the job. The status comprises a state (Idle/Running/Completed) and a remaining runtime if the job is already running. In order to facilitate a fast and efficient image processing (and to minimize waiting time), the automation program performs the following tasks: • It provides a block of images to the working file system for the next job to be submitted. This may take some time, depending on where the image data is stored. Chapter 4 S. Wichmann 26 4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION BeesbookMPI aprun -n <number of processes> Process 1 on CPU 1 Process ... on CPU ... Process n on CPU n id = MPI_Comm_rank() id = MPI_Comm_rank() id = MPI_Comm_rank() Loop over dir <id> until time is over Loop over dir <id> until time is over Loop over dir <id> until time is over MPI_Finalize() Exit job Figure 4.2.: An illustration of the proceedings in one batch job. System calls have a red background, MPI calls orange background and the Beesbook program has a green background. After adding the openCV libraries to the environment variable LD_LIBRARY_PATH, the aprun command is called to start the execution of the Beesbook program on all reserved cores. Then, each process determines its ID by calling the appropriate MPI function. Subsequently, the actual software decoder is executed for every image contained in the directory of the process. At last the MPI Finalize() function is called in every process which behaves like a barrier, ensuring that all processes finished their analysis before exiting the job. Chapter 4 S. Wichmann 27 4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION The data providing must be completed before the actual job is executed in order to avoid idle running during the reserved computation time. In particular, the image providing can be done while waiting for a job to finish. • In order to minimize the overall processing time, the waiting time between jobs has to be minimized. For that purpose, the program has to maintain an internal job queue, ensuring that at least one Beesbook job is available for scheduling at any time. • Whenever a job is finished its results must be saved and a new job has to be prepared for the job queue. Besides the organizational tasks, the program is able to recover itself from every failure to continue with normal production. Figure 4.3 is a schematic depiction of the main routine of the Beesbook Observer script. The Beesbook Observer is divided into three modules, which are described in the following sections. 4.4.2. The BbCtx Module The Beesbook Context module is a container for all global constants and is used in both other modules. These constants include for example directory paths as well as job properties (wall–clock time, number of used cores, etc.). Additionally, the persistent status file is loaded here, using Python’s shelve module. The statusShelf is used to persistently store information about the job queue and the overall progress across consecutive executions of the Beesbook Observer. Moreover, certain checkpoint values are stored in the statusShelf to indicate whenever a critical section is entered. This way, the Observer can recover from unexpected terminations (e.g. after the program was terminated by the system). The statusShelf is used like a dictionary (a key–value map) so that, e.g., the job queue is accessed in the following way: BbCtx.statusShelf[’jobQueue’]. Before starting the actual analyzing stage, some of the variables in the BbObserverConfig file will have to be adapted to the system (e.g. path to Beesbook binary and the image archives). 4.4.3. The Job Queue Manager Module The Job Queue Manager encapsulates functions for the manipulation of the internal job queue. It is based on the implementation of a persistent queue which immediately writes all changes to the status file. A description of the Managers functions follows: SubmitJob calls the msub command to submit the job in the next job slot and parses the job’s information (Cray–ID and job slot) from the output. A tuple representing the Chapter 4 S. Wichmann 28 4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION BeesbookObserver Recover state and progress [no images left] Wait for and collect all remaining jobs [images left] Provide next image block in a free job slot Wait until queue is not full or a job finishes [a job finished] Save and organize results and errors [no job finished] (just filling the queue) Submit BeesbookMP job in the free slot Figure 4.3.: First of all, the Beesbook Observer recovers the state of the job queue and the overall progress. This is necessary because the program could have been terminated at any point of execution. Then, while there are images left to process, the observer keeps the job queue filled in the following way: At the beginning of the loop an image block is provided in advance for the next job to submit. Then the next job is submitted if there is still room in the job queue or after a job is finished. The submission can be executed immediately because the image block was already provided before waiting for a job to finish. Since there is one more job slot than the job queue is big, the next image block can always be provided before waiting. job is then enqueued in the job queue. After that, the next job slot, which will be used, is calculated as (currentSlot +1) % (MAX_QUEUE_SIZE +1). Chapter 4 S. Wichmann 29 4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION WaitForQueue returns nothing if the job queue is not full. Otherwise it waits for a job to finish. This is used to fill the queue with jobs in the beginning (see figure 4.3). The size of the queue is defined by the BbCtx constant MAX_QUEUE_SIZE. WaitForJobsToFinish checks the status of the oldest job in the queue (by parsing the output of the checkjob command) and waits until it is finished. The job is then removed from the queue and its information is returned to the caller. The checkjob command returns the job state, among other information. There are four cases to handle: • If the job state is Completed, the job finished execution and can be removed from the queue. • If the state is Idle, the job is waiting for execution. The job will not finish before at least the given wallclock time expired, so the Observer waits to check again later. • The Running state signalizes that the job is already running. The checkjob command also returns a remaining wallclock time so the Observer parses and waits for that amount of time. • In case the system cannot find a job with the given ID, it returns an error message. The system deletes completed jobs after some minutes so the error message means that the job finished the execution and was deleted before the Observer polled the job’s status. Thus, the job can be handled as if its status was Completed. SaveResults cleans the given job slot by moving the results to the result directory. Then all remaining image files (which are unprocessed, the software decoder would have deleted the input after successful processing) are moved back to the image directory so that they are processed in another job. The Observer relies on the Beesbook program to delete the input image to signalize that the analysis was completed successfully. The Observer does save the result only if the input image is not present anymore. Otherwise the result is discarded and the input file is returned to the image heap. Note, that besides some MPI commands (see section 4.3.1) this is the only requirement the Beesbook software decoder has to fulfill in order to be compatible with the Observer. Recover uses the statusShelf to recover the job queue from previous Observer runs. It also takes action against possible inconsistencies caused by abnormal program termination. For further information on recovery, see section 4.4.5. 4.4.4. The Image Provider Module The Image Provider encapsulates functions to provide jobs with images and to track the overall processing progress. The image providing is organized in two stages: When an Chapter 4 S. Wichmann 30 4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION image block is provided to a job slot, the images are taken from an internal image heap which is a directory containing the extracted images. In case there are not enough images left, the next image archive is extracted into the image heap. The functions utilized in the textitObserver are: ProvideNextImageBlock fills the given job slot with images to process. As described in section 4.3.1, each slot consists of n directories, with n as the number of processes (and hence cores) per job. The number of images per process is defined in the BbCtx constant CHUNKSIZE_PER_PROC. The function unpackNextArchive (which utilizes the Python tarfile module) is called to extract the next image archives if necessary. In that case, also the new progress is saved in the statusShelf, represented as the number of archives already extracted. ImagesLeft checks if there are images left in the image heap or if there are image archives left which have not been extracted yet. Recover uses the statusShelf to recover the current progress. This is necessary if the Observer was terminated during the extraction of an archive or the providing of an image block. For further information on recovery, see section 4.4.5. 4.4.5. Recovery It may happen that the Beesbook Observer program gets terminated at any point during execution. This may be, for example, due to Cray maintenance shutdowns or Cray system errors. The Observer is able to recover from most inconsistencies, which can arise from incomplete operations. Note that the recovery system that is described here does not handle errors occurring in the Beesbook binary. All operations manipulating data that is located in files on the hard disk may produce inconsistencies due to incomplete execution. This includes program terminations before a new status can be persisted in the statusShelf. Yet, due to the sequential nature of the observer, there can be at most one error which has to be recovered during each program start. This insight makes it much easier to think about the whole recovery process. The recovery system consists of two parts: Checkpoint values are stored in the statusShelf (comp. section 4.4.2) to indicate that a critical section was entered. The recovery functions then check for these values to identify incomplete operations and to recover from the inconsistent state. In the following paragraphs, I give an analysis of the critical sections in the program and how they are handled by the recovery system. The first three paragraphs cover the critical sections of the Job Queue Manager module, while the subsequent paragraphs address the Image Provider’s critical sections. Chapter 4 S. Wichmann 31 4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION Job Submission The function submitJob calls msub to submit the next job and stores the job’s ID and its slot in the persistent job queue. If the program is terminated after submitting the job but before storing the information, the result is a ghost job which will lead to errors when the next job is submitted in the same slot. A placeholder job with ID = −1 is saved in the job queue before calling msub to be able to detect this situation. After the submission the placeholder is updated with the actual ID. During recovery, a job with ID = −1 is recovered by comparing the internal job queue with the output of the Cray showq command. If there is an ID that is not present in the internal job queue, that would be the ghost job’s missing ID. If there is no excess ID in the Cray queue, that would indicate that the ghost job already finished and can be saved. If the termination occurs before determining the next slot to use, the current slot would be used again for the next job. For that reason, the next slot is inferred from the youngest job in the persistent job queue during recovery. Job Waiting The only critical section is after the job is removed from the persistent job queue in waitForJobsToFinish. If the Observer is terminated before the results of the job are saved, the reuse of the slot would possibly lead to errors. To avoid this, the slot of the finished job is stored under the key saveResults in the statusShelf before removing the job from the queue. Then, at the end of the saveResults function, the entry is deleted. During recovery, if an entry for saveResults is present in the statusShelf, saveResults is executed retroactively for the corresponding job slot. If the Observer is terminated before actually removing the finished job from the job queue, it would check again for the job’s status during the next run and remove it because it is finished. Hence, this operation does not need to be guarded by recovery. As explained in the next paragraph, it is also safe to execute saveResults again during this process. Saving Results After a job is finished, its slot directories (one for each process) are traversed and the results are moved into the result directory. Moreover, remaining image files are moved back into the image heap for later reprocessing. If the Observer is terminated before or during saveResults, the function can just be called again during recovery. As described in the previous paragraph, callers of saveResults can guard the completion of the operation by adding the key saveResults to the statusShelf. Since it is an idempotent function (the result stays the same if successively called multiple times), it is save to call saveResults again until the operation completed. Chapter 4 S. Wichmann 32 4.5. HOW TO CONFIGURE AND USE THE BEESBOOK OBSERVER Recovery The recovery functions of both, the Job Queue Manager and the Image Provider, are the first functions to be called during each program start. Therefore, if the program is terminated while still recovering, the respective recovery operation will be repeated since the indicating statusShelf entry was not deleted yet. The following paragraphs address the Image Provider’s critical sections. Image Block Providing To recover from partially provided image blocks, saveResults is called for the slot that will be used for the next job. This will move all images back to the image heap so that the image providing can be executed anew. Since this is done during each program start, regardless whether the image providing for the next slot has been completed successfully, an image block might be provided multiple times. To prevent this overhead, it is insufficient just to indicate a failure by adding an appropriate key to the statusShelf. For further discussion on this topic see chapter 7. Archive Extraction In case the image heap does not contain enough images for the next image block, the next archives are extracted into the image heap. To ensure the successful completion of the extraction, a flag is stored under the key extracting in the statusShelf. If this flag is set during recovery, the function unpackNextArchive is executed again to complete the operation. Since the extraction silently overwrites existing files, it is an idempotent operation and therefore it is safe to execute it again. Then, after extracting the next archive, the overall progress has to be updated. To make sure that this happens, the old progress value is stored under the key oldProgress in the statusShelf before deleting the extracting flag. By interlocking these flags, one can ensure that a termination during unpackNextArchive is definitely detected. During recovery, if the oldProgress value is set, it is compared to the current progress value. If they have the same value, which indicates that the update of the progress did not happen, the update will be repeated. Now, if the Observer is terminated right after incrementing the progress but before deleting the oldProgress value, the incrementation will not happen again, since now oldProgress and progress differ. 4.5. How to Configure and Use the Beesbook Observer In order to start the Observer, it must be configured first. Chapter 4 S. Wichmann 33 4.5. HOW TO CONFIGURE AND USE THE BEESBOOK OBSERVER 4.5.1. The Configuration via the BbCtx Module The file BbObserverConfig.py contains all the variable information for the Observer like binary path, job queue size etc. An explanation of all relevant values and how to pick the right value follows. If any of the parameters is changed during the analysis, the Observer has to be re-initialized. A description of how to do that follows below. However, only Cray parameters should be changed after starting the analysis. The values shown are chosen to comply with the Sandkornprinzip and will work for the whole analysis. If it turns out that this is unnecessary, CHUNKSIZE_PER_PROC, WALLTIME, and NUM_NODES can be increased to e.g. (50, 04:04:50, 10). • MAX_QUEUE_SIZE: 8 – Number of the Cray jobs put into queue; – Since most of the time no more than 4 jobs will be running (Cray soft limit), this number should stay below 10. – Note that there will be CHUNKSIZE_PER_PROC * NUM_PROCS images provided per queue slot. One should make sure that there is enough disk space (and quota) available. • NUM_NODES: 3 – The number of nodes to reserve per Job. – This affects the number of images analysed per job. Should be relatively small to conform with the Sandkornprinzip. Values up to 3 nodes are scheduled into the "smallqueue". – 3 Nodes corresponds to 0.4% of all available nodes (744). – For system maxima of Walltime and number of nodes/cores see [3]. • CHUNKSIZE_PER_PROC: 10 – Number of images per process per job. – CHUNKSIZE_PER_PROC and WALLTIME are the most important values to adjust properly in order to minimize the NPL overhead. – Should correspond to the Walltime: Depending on the variance of the Beesbook program, The WALLTIME should be adjusted to CHUNKSIZE_PER_PROC * <avg runtime for one image>, in order to avoid idle running. – To conform with Sandkornprinzip, the Walltime should not be longer than 60 min. Hence, for a runtime of 293s per image, 10-12 is a good value • WALLTIME: 00:48:55 – Specifies the time the requested ressources are reserved for. Has to be of the form hh:mm:ss. Chapter 4 S. Wichmann 34 4.5. HOW TO CONFIGURE AND USE THE BEESBOOK OBSERVER – Should correspond to CHUNKSIZE_PER_PROC, see explanation above. 10∗293s = 2930s = 48 : 50, including some extra buffer seconds. – Maximum Walltime is 12h. • WORKING_DIRECTORY: <Path to where the work directory will be initialized> – The directory where the image blocks are provided and the log and status files are written. • BIN_PATH: <Path to the Beesbook binary> • ARCHIVE_DIR: <Path to the image archives> – The archives will be untared into IMAGE_FILE_DIR during the analysis. • RESULT_FILE_DIR: <Path to the where the results will be stored> 4.5.2. (Re-) Initializing the Observers’ Work Directory Before starting the analysis, the Observer has to be executed with python BeesbookObserver.py --init. This way the work directory is initialized at the path specified in WORKING_DIRECTORY. Later, if parameters are changed in the configuration, the parameter --reInit would be used to update the work directory according to the new parameters. Note that the job queue has to be empty before reinitializing the Observer because that will delete the present directories. The program switch --collectJobs can be used to collect all active jobs without submitting new ones. Chapter 4 S. Wichmann 35 5. Evaluation Both, the evaluation chapter and the discussion chapter, are divided into two individual sections. The chapters begin with the image transfer’s evaluation and discussion, followed by the parallelization. 5.1. Evaluation of the Image Transfer The efficiency of the transfer of the image archives was measured in two ways. Firstly, the best transfer protocol was identified by measuring the maximal bandwidth one can achieve with it using one single connection. Secondly, longterm tests were performed with varying archive sizes and using four connections. The longterm tests should measure the stability of the transfer and identify the archive size needed in order to achieve the maximal possible bandwidth. 5.1.1. The Best Transfer Protocol During these tests an image archive of 104 MB was transfered multiple times. The following results represent the average bandwidth of all transfers per protocol. • Putty SCP: 14,87 MB/s • WinSCP: 18,55 MB/s • WinFTP: 35,15 MB/s The results show that the unencrypted File Transfer Protocol is by far the most efficient one. This is due to additional computations that are needed for encryption, which is unnecessary in case of Beesbook. 5.1.2. Transfer Stability In order to measure the longterm stability of the data transfer a script has been written that generates images inside the observed folder. Thereby the script produces about 97,92 MB/s of data. The transfer stability was then measured by observing the number of images residing in the observed directory each second, while executing the transfer script. Depending on the archive size there is a maximum for this number of images while the transfer is fast enough. If this number is exceeded that indicates that the transfer is too 36 5.1. EVALUATION OF THE IMAGE TRANSFER slow (less than an average of 97,92 MB/s) and hence one could run out of disk space soon. See figure 5.1 for the results. The expected maximum number of files in the directory is calculated as the archive size times four (four cameras) times two (while transferring one archive, new images continue to be generated) plus four (number of archives that reside in the directory): Archive Size · 4 · 2 + 4 The archives sizes that were tested are: • 16 images: 97,92 MB, expected maximum number of files: 132 • 32 images: 195,84 MB, expected maximum number of files: 260 • 64 images: 391,68 MB, expected maximum number of files: 516 • 128 images 783,36 MB, expected maximum number of files: 1028 In addition to the tests depicted in figure 5.1, test runs with an archive size of 128 were also performed for runtimes of 8 hours, 10h, 16h and 19h. The maximum and average file count did not differ significantly from the ones showed in the diagram. Those results prove that the present 1GBit connection to ZIB can be successfully used to capacity and that it is stable enough to be used for the data acquisition of the Beesbook project. However, further tests with other parameters like another data generation bandwidth or longer runtime were not possible because the account at the ZIB data storage server expired and could not be reactivated free of charge. Hence, there could still be bandwidth fluctuations during certain periods which were not be detected during the tests. Chapter 5 S. Wichmann 37 5.1. EVALUATION OF THE IMAGE TRANSFER Figure 5.1.: This chart shows the results of the transfer stability tests. The bars display the maximum (and average) number of files that resided in the image directory during the whole test (26,6 min). For that purpose a test script generated about 100 MB/s of image data inside a directory located on the RAM–disk. Values above 100% indicate that the transfer was slower than the image generation and hence the number of images increased continuously. During the actual experiment this must not happen because eventually there would be no more disk space available, which would lead to data loss. The results show that the transfer speed is too low for archive sizes below 64. This is due to connection maintenance and archiving overhead. Yet, archive sizes from 64 on result in a sufficient transfer bandwidth of nearly 100 MB/s. Chapter 5 S. Wichmann 38 5.2. EVALUATION OF THE PARALLELIZATION 5.2. Evaluation of the Parallelization The success of the parallelization effort can be measured in 3 dimensions: • Reliability of the Beesbook Observer regarding the recovery and the seamless organization of the data analysis. • The overall speedup/overall runtime. The speedup is represented as the dividend by which the total runtime is divided by parallelizing the computation. • The efficiency in terms of needed NPL. How much NPL are needed beyond the actual runtime of the software decoder? How short can one single job be so that the NPL overhead is minimal (to comply with the Sandkornprinzip)? The Observer’s reliability was extensively tested in three ways: 1. The Observer was intentionally terminated before, during and after every critical section during several runs. For this purpose, a call to sys.exit() was executed at the corresponding spots in the code. 2. The Observer was terminated manually several times during test runs. 3. A script was written, which automatically restarts the Observer repeatedly after a random amount of time. During one test run the times were chosen from the range 20s − 120s, during another from 2s − 40s. After each run, the outcome was verified by checking whether all result files were present in the result directory. No errors were detected during these executions . The overall speedup was tested by measuring the runtimes for one single image while using: • one core: 22 seconds • 24 cores (one node): 22.5s • 48 cores (2 nodes): 23s • 96 cores (four nodes): 23s These results show that the overhead introduced by using all cores on a node and communication via MPI is minimal and growing very slowly. Hence, the overall speedup is close to the number of cores used, as expected for a problem with this structure (explained in section 4.1.2). However, the number of cores used concurrently does highly depend on several Cray properties and hence it is subject to fluctuations. For the most part this can be attributed to the scheduling system which will execute an unpredictable number of Beesbook jobs concurrently, depending on the system load. Chapter 5 S. Wichmann 39 5.2. EVALUATION OF THE PARALLELIZATION It is more promising to examine the speedup from another point of view: Computing time is measured in NPL and the HLRN does allocate them to users quarterly. Since no more NPL are allocated than can be used during the quarter, every user is guaranteed that his computations will be carried out in at least three months. In case the overall amount of NPL granted by the HLRN is distributed over q quarters, this implies that the entire analysis will take at least q − 1 quarters. In view of the large amount of processing time (several hundred years) needed, the critical question will not be about how long the analysis will take but whether the amount of NPL needed will be granted by the HLRN (see appx. A.1.4 for NPL calculations). The possible speedup then depends directly on the amount of NPL granted per quarter. The efficiency of the parallelization was tested in eleven test runs with varying properties. All tests have in common the use of 24 cores (one node) per job and the chunk size of 40 images per process (960 images per job). The number of tests one can run is limited by the amount of NPL available (2500 per quarter) and the long runtime per test. This is because in order to measure the overhead of NPL usage, the software decoder must actually run for a certain time during the batch job. For this reason, only qualitative measurements of the NPL usage can can be performed. Because the actual software decoder was not finished at testing time, it simply slept for a given time during execution. This way, the analysis of each image took the exact same time. Yet, the actual software decoder will likely take varying amounts of time. There are 3 properties that were changed during the tests: the total number of images to process, the runtime per image (a constant value) and the wallclock time (the job runtime). Figure 5.2 shows the most important results of the measurements. The overhead was determined by calculating the NPL demand (see appx. A.1.4 for the formula) for a number of images and a runtime per image. Then, the job’s NPL costs of the test run were summed up and divided by the calculated value. Chapter 5 S. Wichmann 40 5.2. EVALUATION OF THE PARALLELIZATION Figure 5.2.: Determined NPL overhead during several test runs. WC — wallclock time. One can see that, with an increasing number of images and runtime per image, the NPL overhead decreases. The data point [20000 images, 21s runtime per image] emphasizes the importance of an accurate wallclock adjustment. If the wallclock time is too short to analyze all images in a job, each process is terminated while analyzing an image. Hence, up to [number of cores] · [runtime per image] seconds wallclock of computing time are wasted. The smaller the ratio runtime per image , the greater the NPL overhead. On the other hand, if the processes finished their analysis but there was still wallclock time available, the job would idle up to 70 seconds (verified during multiple tests). This is due to the scheduling system taking some time to release the resources. The resulting overhead again depends on the wallclock time and diminishes with increasing job length. For jobs the wallclock time of which is too long, the overhead goes below 1% for wallclock times above 70s / 0.01 / 60 = 116min. For jobs the wallclock time of which is too short, the overhead goes below 1% for wallclock times above [runtime per image] / 0.01s. The needed wallclock time is potentially much longer than in the latter case, thus, the most efficient way is to assign a slightly longer wallclock time than the job is expected to need. Chapter 5 S. Wichmann 41 6. Discussion 6.1. Transfer The results show that it is possible to use a Gigabit connection to capacity in an efficient and stable way. However, the risk of bandwidth fluctuations and even server downtimes cannot be ruled out entirely. A system that can buffer data during short connection problems will be needed in order to guarantee a data acquisition without any data loss. Yet, as a prerequisite for the parallelized image processing on the Cray supercomputer, the possibility of a continuous data transfer was proven. 6.2. Parallelization The Beesbook Observer automatically and reliably analyzes images supplied in an archive directory. In case the Observer gets terminated, it can simply be started again and continue its work without further user interaction, thanks to the recovery system. However, there is not yet a mechanism that notifies the user about the Observer’s termination. Currently, the result files are stored unorganized in a result directory. Whereas such an organization would still have to be invented, one could also use a database in order to store the results. Additionally, the Beesbook Observer has to be configured correctly before the image analysis can be started. The configuration process and its values are described in section 4.5. The overall speedup of the presented parallelization approach is close to the possible maximum (the number of cores in use per job). Yet, as I explained in section 5.2 (overall speedup paragraph), the critical condition to successfully execute the whole image analysis stage on Cray is that the project is granted the according amount of NPL. In this light, it is important to take the efficiency of the NPL usage, which was calculated during test runs, into account. Overhead values ranging from 0.5% to 11% were found. However, because of the unknown variance of the actual Beesbook software decoder’s runtime, it is impossible to optimize this value any further at this point. The NPL overhead highly depends on the wallclock time chosen for a job. Generally, as explained in figure 5.2, it is best to assign to the job a longer wallclock time than the processes’ estimated runtime. However, due to the fact that the decoder runtimes per image will vary, a larger overhead is possible because each process that completed its work 42 6.2. PARALLELIZATION before the last process wastes computing time (which has to be paid for). A dynamic work scheduling system might be needed to diminish this overhead. Yet, it was not possible to address this problem within the scope of this thesis (especially as its benefit can not be quantified without knowing the actual variance of the runtime). Chapter 6 S. Wichmann 43 7. Future work 7.1. Transfer As explained in the discussion, bandwidth fluctuations due to varying network load or server issues cannot be ruled out entirely. Hence, the most urgent improvement is to facilitate some sort of local buffering. Local buffering would not make sense with the parameters used in this thesis because such a buffer can never be emptied if the available connection is already used to capacity. However, the actual experiments will produce smaller bandwidths (likely no more that 66MB/s) so a local buffer can be a viable option to handle unstable bandwidths. 7.2. Parallelization First I give an overview of the most imminent future work. Then some more general developmental possibilities follow. 7.2.1. Starting the Observer In order to actually start the image analysis, the following points must be prepared: • The Observer must be configured correctly (described in 4.5). • The Observer must be adapted to the image archive nomenclature used during the experiment’s image acquisition. The Image Provider module provides internal functions that can be changed to reflect the archive naming. • The analysis’ results must be organized, either in archives or in a database. Furthermore, as described in section 6.2, a scheduling system might be needed in order to reduce the NPL overhead. 7.2.2. Extending the Observer The design of the Observer makes it possible to apply it to all perfectly parallel problems. Basically, just like the Map step of the MapReduce programming model, the Observer applies a given function (the program binary) to a given dataset (the input file archives). Adapting the Observer to another problem involves just the same steps as preparing it for the Beesbook analysis. There are few dependencies regarding the program that will be executed for the analysis during the batch jobs: 44 7.2. PARALLELIZATION • The program must take one command line argument that represents a job’s data directory. • It must use MPI (or another interprocess communication library) to identify the individual data directory for each process. The program then must run its analysis on each of the files in its data directory. • Each input file must be deleted after persisting the result to indicate a successful analysis. See the Beesbook source code for an example how to implement those requirements. The additional code needed to prepare a program for the Observer should be no more than 30 lines of code. In the light of a field of application beyond Beesbook, there is a number of conceivable enhancements: • In order to reduce the NPL overhead when an analysis did not finish because the job was terminated, a checkpoint system could be implemented. Intermediate results can be written to a file which is always moved together with its input file by the Observer. This way an aborted analysis need not start from the beginning. • Currently, the Observer’s state has to be checked manually to ensure that it is still running. An alerting system might be implemented by setting up an external watchdog. • In case a program error occurs during analysis, it would be helpful to report it (e.g. via email). • If an error occurs during the analysis, the input file concerned would be analyzed again and again. Besides error reporting, the Observer could keep track about erroneous input files and put them into a separate directory to prevent too much overhead. • Currently, all results are assumed to be independent from each other. But in case of the Beesbook tracking problem, it could be beneficial to take the results of images from previous time frames into account. Thereby, the last position of an individual could be used to locate it faster in the new image. To make that possible, some kind of processing order would have to be enforced by the Observer. Furthermore, the respective result files would have to be supplied properly. Chapter S. Wichmann 45 Appendices 46 A. Calculations and Tables Here I include some of the calculation tables I generated during the research process. All calculations were made using the following parameters: Number of cameras: 4; frame rate: 4 fps; Experiment length: 60 days; The resulting number of images is: 4 ∗ 4f ps ∗ 60d = 82.944.000f rames (A.1) A.1. Calculation Tables A.1.1. Image Sizes and Bandwidth This table shows the produced bandwidth and the total size of the dataset at the end of the experiment, depending on the size per image (second column). Figure A.1.: The bandwidths and total sizes corresponding to certain JPEG quality levels A.1.2. HDD Capicities Depending on the image quality (respective image size), this table shows the time until a hard disk of the given capacity runs full during . Figure A.2.: The time capacities of differently sized hard disks 47 A.2. ADDITIONAL CALCULATIONS A.1.3. Maximal Parallelization Depending on the bandwidth available between the supercomputer and the data storage, as well as the processing time, the maximal number of concurrent computations can be limited to relatively small values. Generally, this happens when the transfer time of one image dominates its processing time. This means that only a small number of images can be transferred while processing one image, which is actually the maximal number of images processed concurrently. The table shows values for a processing time of 294s per image (for calculation of the processing time see A.2.3). Figure A.3.: The maximal number of concurrent computations A.1.4. Needed NPL This table shows the expected amount of NPL needed for the analysis of the data depending on the runtime per image. The formula is as follows: [runtime per image]s · [number of images] / 3600 s NPL /24 · 2 h h Figure A.4.: NPL need depending on runtime per image A.2. Additional Calculations A.2.1. FileSystemWatcher Event Buffer Size Each event uses 16 Byte of memory (according to Microsoft), excluding the filename. Since the organization of the recorded data is not specified yet, it is unclear how long the image filenames (of the form ∗_$id.png) will exactly be. At least, the image names need to be unique inside each archive. Because the archive size will not exceed 1000 images Chapter A S. Wichmann 48 A.2. ADDITIONAL CALCULATIONS (see section 3.3.3), the part before the underscore will consist of at least three digits. This results in a total filename length of nine characters, and hence one event can consume up to 25 Byte. Consequently, an event buffer of 64 KB can hold 65536/25 = 2621 events. Other sources [4] state that one event’s size is 12 Byte + 2*|file path| which would result in an event queue capacity of approx. 1638 events, assuming the file path contains 14 characters (the shortest possible path would be something like "G:\b\"). A.2.2. Archive Size The maximal number of images per archive depends on the image size (test image is 4210 KB), the Internet bandwidth (approx. 100 MB/s), the bandwidth of the recorded data (65.8 MB/s) and the available disk space (15 GB RAM–disk). The point is that the system must not run out of disk space while transferring the archives. We have to assume that all jobs transfer their archives at the same time, while the cameras continue to produce data. If the Internet bandwidth and the bandwidth of the recorded data were equal, the needed memory would amount to two times the archive size: the archive itself and the images newly created while transferring the archive. The formula for the maximal archive size hence is (including the factor for the four cameras): 15GB ≈ 467 2 · 4 · 4, 11M B If the recorded bandwidth is smaller than the Internet bandwidth, this value would slightly increase. Yet, tests showed (see evaluation chapter) that the transfer throughput is already stable enough at an archive size of 190 (corresponds to 781,15 MB per archive). A.2.3. Processing Times Since the software decoder is still being developed, the runtime can only be estimated. Furthermore, the available test images contained only a small number of tags. Consequently, I measured the runtime with an image (see fig. A.5) containing 17 visible tags and extrapolated it to the runtime of an image containing 250 tags (the average number of tags in one image). The lack of images containing a larger number of tags is due to the difficult tagging process and a yet nonexistent synthesis of images. The runtime of the test image amounts to 20s (measured on Cray). Hence, the runtime per image will be 20 core seconds/17 tags ∗ 250 tags = 294 core seconds. (A.2) The overall runtime will hence amount to 294 core seconds ∗ 82944000 = 773.57 core years. Chapter A S. Wichmann (A.3) 49 A.2. ADDITIONAL CALCULATIONS Figure A.5.: The image used for benchmarking This corresponds to about 4.3% of the HLRN’s yearly capacity (about 17856 core years), or 17.3% of its quarterly capacity (about 4464 core years) A.2.4. Number of Batch Jobs The maximum runtime for one job is twelve hours. Depending on how many nodes we can use per job, the overall number of jobs will range between the following values. The batch system allows 256 nodes to be used per job which represents an upper bound of nodes to use per job. A much more realistic number of nodes is 50 or even less (complying to the Sandkornprinzip, see figure 4.1). The runtime per job will most likely also be much less than twelve hours, which increases the number of jobs even further. Chapter A 773.57 core years/12 h cores /(24 ∗ 1 node) = 23530 jobs job node 773.57 core years/12 h cores /(24 ∗ 50 nodes) = 471 jobs job node 773.57 core years/12 h cores /(24 ∗ 256 nodes) = 92 jobs job node S. Wichmann 50 A.2. ADDITIONAL CALCULATIONS A.2.5. Work Directory Structure The Beesbook work directory is structured in a way that allows the use of multiple job slots. For one job slot the image data is distributed among the process directories. Beesbook_Work Directory BbMPI_jobScript_Slot_0 ... BbMPI_jobScript_Slot_<n+1> Beesbook.log Beesbook.status image_heap job_outputs results job_slots data_slot_0 ... data_slot_<n+1> proc_0 ... proc_<c> Figure A.6.: The structure of the Beesbook work directory. Files have round boxes, directories have rectangular boxes. Explanation: • The BbMPI_jobScript_Slot_n files are the job scripts for each job slot. They are generated by the Observer when initializing the work directory. • The image_heap directory contains the extracted images. • The job_outputs directory contains the command line output of the batch jobs (e.g. for error reporting). • The results directory contains the result files of the image analysis. • The job_slots directory contains the image data for the batch jobs. Each data_slot directory contains one proc_ directory for each process in the batch job. The image blocks are provided here for their respective job. Chapter A S. Wichmann 51 A.2. ADDITIONAL CALCULATIONS CD Content There are three directories on the CD. The Beesbook Observer directory contains the implementation of the parallelization. The Data Transfer directory contains the transfer script, as well as a test script. Additionally, all executables and libraries needed for the transfer (7zip and WinSCP) are included. The Thesis directory again contains two directories. The Document directory contains the Latex source of the this thesis and all images used in it. The Evaluation directory contains the protocols of the Observer tests. Chapter A S. Wichmann 52 B. Glossary Cray The Cray XC30 is the supercomputer system housed at HLRN. FTP File Transfer Protocol – An unencrypted protocol for file transfers over the internet. HLRN Norddeutscher Verbund für Hoch- und Höchstleistungsrechnen MPI Acronym for Message Passing Interface; A library for interprocess communication. It provides an abstraction layer which hides the work on the network layer. It also brings various convenience functions, for example, to determine the processes ID among all other processes. RAID 6 Acronym for Redundant Array of Independent Disks; An organization approach for hard disks to introduce data redundancy and performance improvement. RAM–Disk A portion of the main memory (RAM) is allocated and can be used like a hard disk, thereby significantly reducing data access times. SCP Secure CoPy – An encrypted protocol for file transfers over the internet. B.1. Units fps Acronym for frames per second. KB, MB, GB, TB, PB Kilobyte (equals 1024 Bytes), Megabyte, Gigabyte, Terabyte, Petabyte. Wall–clock Amount of time a batch job reserves its resources for. NPL Acronym for Norddeutsche Parallelrechner-Leistungseinheit; One Cray compute node costs 2 NPL per hour. 53 Bibliography [1] Beesbook website. http://beesbook.mi.fu-berlin.de/wordpress. [2] FileSystemWatcher — Internal Buffer Size. http://msdn.microsoft.com/en-us/ library/system.io.filesystemwatcher.internalbuffersize(v=vs.110).aspx. [3] HLRN–III User Documentation. https://www.hlrn.de/home/view/System3/ WebHome. [4] Is it really that expensive to increase Filesystemwatcher internal buffersize? http: //stackoverflow.com/a/13917670/909595. [5] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. A View of Cloud Computing. Commun. ACM, 53(4):50–58, April 2010. [6] Benjamin Benz. Pfeilschnell. Die dritte USB-Generation liefert Transferraten von 300 MByte/s. c’t, 2008. ISSN 0724-8679. [7] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107–113, January 2008. [8] M. Flynn. Some Computer Organizations and Their Effectiveness. Computers, IEEE Transactions on, C-21(9):948–960, Sept 1972. [9] Balduin Laubisch. Automatic Decoding of Honeybee Identification Tags from Comb Images. 2014. Bachelor Thesis, Freie Universität Berlin. [10] Danielle P. Mersch, Alessandro Crespi, and Laurent Keller. Tracking Individuals Shows Spatial Fidelity Is a Key Regulator of Ant Social Organization. Science, 340(6136):1090–1093, 2013. [11] Wolfgang Pyszkalski. Technischer Bericht TR13-16 — Übersicht über die Datenhaltung im ZIB und die Möglichkeiten einer Nutzung durch Projekte, 2013. [12] Morse Roger A. and Nicholas Calderone. The value of honey bees as pollinators of u.s. crops in 2000. 2000. http://www.utahcountybeekeepers.org/Other%20Files/ Information%20Articles/Value%20of%20Honey%20Bees%20as%20Pollinators% 20-%202000%20Report.pdf. [13] T.D. Seeley. The Wisdom Of The Hive. Harvard University Press, 1995. 54 Bibliography [14] F.J. Seinstra, D. Koelma, and J.M. Geusebroek. A software architecture for user transparent parallel image processing. Parallel Computing, 28(7–8):967 – 993, 2002. [15] Anand Lal Shimpi. As-ssd Incompressible Sequential Performance (Samsung SSD 840 Pro (256gb) Review). 2012. http://www.anandtech.com/show/6328/ samsung-ssd-840-pro-256gb-review/2. [16] Bryan Walsh. Beepocalypse redux: we still don’t know why. 2013. Honeybees are still dying — and http://science.time.com/2013/05/07/ beepocalypse-redux-honey-bees-are-still-dying-and-we-still-dont-know-why/. [17] Wikipedia. List of crop plants pollinated by bees — wikipedia, the free ency- clopedia. http://en.wikipedia.org/w/index.php?title=List_of_crop_plants_ pollinated_by_bees&oldid=585911208, 2013. [Online; accessed 19-March-2014]. [18] Dusan Zivadinovic. Selbst ist der Spiderman. Netzausbau: Weitere Räume und Gebäude ans LAN anbinden. c’t, 2008. ISSN 0724-8679. Chapter 7 S. Wichmann 55