A Cover Sheet
Transcription
A Cover Sheet
A Cover Sheet Please replace this page with cover sheet. CISE Cross-Cutting Programs: FY 2010 Data Intensive Computing Program Solicitation # 09-558 Title: DC:Small:Collaborative Research:DARE: Declarative and Scalable Recovery PI: Joseph M. Hellerstein Professor University of California, Berkeley E-Mail: hellerstein@cs.berkeley.edu Co-PI: Andrea C. Arpaci-Dusseau Professor University of Wisconsin, Madison E-Mail: dusseau@cs.wisc.edu B Project Summary The field of computing is changing. In the past, improving computational performance of a single machine was the key to improving application performance. Today, with the advent of scalable computing, improving application performance can be as simple as adding more machines; systems are now capable of scaling and redistributing data and jobs automatically. But a new challenge has come: with many thousands of devices, failure is not a rarity but rather a commonplace occurrence. As failure becomes the norm, when data is lost or availability is reduced, we should no longer blame the failure, but rather the inability to recover from the failure. Thus, we believe a key aspect of system design that must be scrutinized more than ever before is recovery. To address the challenges of large-scale recovery, in this proposal, we describe the Berkeley-Wisconsin Declarative and Scalable Recovery (DARE) Project, as part of our vision of building systems that “DARE to fail.” To reach this vision, we will proceed in three major directions. The first is offline recovery testing. As thousands of servers produce “millions of opportunities” for component failures each day [79], our focus is to inject multiple failures, including the rare combinations, such that recovery is extensively exercised. The second is online recovery monitoring. “Surprising” failures take place in deployment [79], and hence more recovery problems often appear during actual deployment. Thus, systems should have the ability to monitor recovery online and alert system administrators when recovery bugs are observed. Finally, as we expect system builders to learn from failures over time and refine recovery continuously (“try again, fail again, fail better” [87]), we believe systems builders need new approaches to design and robustly implement recovery. Thus, we plan to design an executable recovery specification, a declarative language for writing recovery specifications that can be translated into executable protocols. A.1 Intellectual Merit Intellectual merit and importance: The DARE project will advance the state of knowledge in large-scale storage systems in three fundamental ways. First, by introducing many (ideally all) possible failures, we will understand the deficiency of today’s systems and develop the next state-of-the-art of large-scale recovery. Second, we will explore the design space of a new paradigm: online failure scheduling and recovery monitoring. Finally, we will demonstrate the utility of declarative languages for specifying and implementing various aspects of recovery management. Qualifications: We believe we are well positioned to make progress on this demanding problem, having assembled different sets of expertise. Professor Joseph Hellerstein is a leader in the application of data management concepts to system design, and this project will leverage his expertise in declarative programming [23, 54, 58, 108, 109, 111], distributed monitoring [92, 93, 136], and scalable architectures [19, 55, 72, 94, 110, 142, 157]. Professor Andrea Arpaci-Dusseau is an expert in file and storage systems. This project will leverage her expertise in storage system reliability [33, 35, 76, 77, 78, 104, 133, 145] and high-performance storage clusters [25, 27, 39, 40, 75, 156]. Organization and access to resources: From an organizational viewpoint, our goal is to perform “low-cost, highimpact” research. Hence, the bulk of funding requested within this proposal is found in human costs; we will leverage donations from industry for much of the infrastructure. A.2 Broader Impacts Advancing discovery while promoting teaching, training, and learning: In general, we work to give students hands-on training with cutting-edge systems technology. We also plan to incorporate our research directly into undergrad and grad courses (as we have done in the past), and develop the next generation of engineers that are critical to the future of science and engineering of our country. Enhancing infrastructure for research and disseminating results: We plan to disseminate the results of our research in three ways: through the classic medium of publication, which in the past has impacted the design and implementation of various storage systems including the EMC Centera [75], NetApp filers [104], and Yahoo cloud services [59]; through the development of numerous software artifacts, which we have shared with the open source community, parts of which have been adopted, for example, into next generation Linux file systems [133] and MySQL [147]; and finally, through our work with various industry partners to help shape their next generation storage systems. Benefits to society: We believe the DARE project falls in the same directions set by federal agencies; a recent HEC FSIO workshop declared “research in reliability at scale” and “scalable file system architectures” as topics that are very important and greatly need research [36]. We also believe that our project will benefit society at large; in the near future, users will store in the Internet all of their data (emails, work documents, generations of family photos and videos, etc.). As John Sutter of CNN put it: “This is not just data. It’s my life. And I would be sick if I lost it” [149]. Unfortunately, data loss still happens in reality [113, 118, 121]. Through the DARE project, we will build the next generation large-scale storage systems that will meet the performance and reliability demands of current society. Keywords: scalable recovery, declarative recovery, parallel file systems, cloud computing, testing, online monitoring. C Table of Contents (This page will be automatically generated) D Project Description D.1 Introduction Three characteristics dominate today’s large-scale computing systems. The first is the prevalence of large storage clusters. Storage clusters at the scale of hundreds or thousands of commodity machines are increasingly being deployed. At companies like Amazon, Google, Yahoo, and others, thousands of nodes are managed as a single system [3, 38, 73]. This first characteristic has empowered the second one: Big Data. A person on average produces a terabyte of digital data annually [112], a scientific project could capture hundreds of gigabytes of data per day [141, 152], and an Internet company could store multiple petabytes of web data [120, 130]. This second characteristic attracts the third: large jobs. Web-content analysis becomes popular [57, 130], scientists now run a large number of complicated queries [85], and it is becoming typical to run tens of thousands of jobs on a set of files [20, 51]. Nevertheless, as large clusters have brought many benefits, they also bring a new challenge: a growing number and frequency of failures that must be managed [18, 42, 79, 87]. Bits, sectors, disks, machines, racks, and many other components fail. With millions of servers and hundreds of data centers, there are “millions of opportunities” for these components to fail [79]. Failing to deal with failures will directly impact the reliability and availability of data and jobs. Unfortunately, we still hear data-loss stories even recently. For example, in March 2009, Facebook lost millions of photos due to simultaneous disk failures that “should” rarely happen at the same time [118] (but it happened); in July 2009, a large bank was fined a record total of £3 millions after losing data on thousands of its customers [121]; more recently, in October 2009, T-Mobile Sidekick, which uses Microsoft’s cloud service, also lost its customer data [113]. These incidents have shown that existing large-scale storage systems are still fragile to failures. We believe a key aspect of system design that must be scrutinized more than ever before is recovery. As failure becomes the norm, when data is lost or availability is reduced, we should no longer blame the failure, but rather the inability to recover from the failure. Although recovery principles have been highlighted before [127], recovery was proposed in much smaller settings [42]. At a larger scale, we believe recovery has more responsibilities: it should anticipate not only all individual failures but also rare combinations of failures [79]; it must be efficient and scale to a large number of failures; it must also consider rack-awareness [43] and geographic locations [61]. In short, as James Hamilton, the Vice President and Distinguished Engineer of the Amazon Web Services, suggested: “a rigorously specified, scalable form [of recovery] is very much needed” [42]. This leaves many pressing challenges for largescale system designers: What should recovery look like for scalable clusters? What are the possible combinations of failures that the system should anticipate? In what ways should recovery be formally specified? How should recovery specifications be checked? To address the challenges of large-scale recovery, in this proposal, we describe the Berkeley-Wisconsin Declarative and Scalable Recovery (DARE) Project wherein we want to (1) seek the fundamental problems of recovery in today’s scalable world of computing, (2) improve the reliability, performance, and scalability of existing large-scale recovery, and (3) explore formally grounded languages to empower rigorous specification of recovery properties and behaviors. Our vision is to build systems that “DARE to fail”: systems that deliberately fail themselves, exercise recovery routinely, and enable easy and correct deployment of new recovery policies. There are three major thrusts in this project which forms what we call the Iterative lifecycle DARE iterative lifecycle, as depicted in Figure 1. First, we begin our work with offline recovery testing of large-scale file systems. Large-scale recovery is inherently complex as it has to deal with multiple failures, including the rare combinations, Offline Online against which recovery is rarely tested. Furthermore, correct recovery decisions Testing Monitoring must be made based on many metrics such as system load, priority, location, cost, and many more. Thus, by testing recovery extensively and learning through the Executable The results, we can sketch out the fundamental principles of recovery in the context of Specification large-scale file systems. System To complement offline testing, our second thrust is about online recovery monitoring. In actual deployment, recovery is faced with more scenarios that might Figure 1: DARE lifecycle. not have been covered in offline testing. Thus, interesting problems appear when the system is deployed in a large cluster of hundreds or thousands of machines [79, 87]. In fact, we have observed that system builders learn new issues from real-world deployment [81]. Therefore, we believe there is a need for an online recovery monitoring with the ability to monitor recovery in action and alert system administrators when recovery bugs are observed. 1 Finally, in our third thrust, we advocate executable recovery specification. We believe that system builders can benefit greatly by using declarative languages to specify recovery policies in manner that is both formally checkable and also executable in the field. We expect system builders to learn from failures over time and refine recovery continuously (“try again, fail again, fail better” [87]). Declarative recovery specifications can allow system builders to analyze the correctness and scope of their code, and easily prototype different recovery strategies by focusing on the high-level invariants maintained by their recovery protocols. With executable specification, recovery code becomes a form of explicit documentation, and this documentation is also precisely the running code that enforces the specification. Finally, a rich tradition in “shared-nothing” parallelization of declarative languages suggests that this language design will promote scalable recovery. We plan to do all of the above to two classes of widely used large-scale file systems: Internet service (“cloud”) file systems [6, 9, 14, 15, 16, 43, 69] and parallel file systems [8, 44, 50, 155]. For this project, we will focus on one file system in each class: the Yahoo Hadoop File System (HDFS) [43] and the Sun Lustre Parallel File System [44]. We note that the existence of these open-source systems brings great opportunities for researchers to analyze the design space of large-scale recovery. In response, our contributions will also be directly useful to them. As illustrated in Figure 1, we see the three phases of the DARE lifecycle as an iterative, pay-as-you-go approach. That is, we believe the lessons learned from each phase will benefit the other phases in parallel. Thus, our plan is to rapidly prototype and improve all of them hand-in-hand. We anticipate specific major contributions in our thrusts: •Offline Recovery Testing: To emulate failures, we will automatically insert failure points using recent advances in Aspect-Oriented Programming [56]. For example, we have used AspectJ [5] to insert hundreds of fault-injection hooks around all I/O-related system calls without intruding the HDFS base code. Since our goal is to exercise various combinations of failures, we will develop methodologies to intelligently sample the large test space. We also plan to explore the idea of declarative testing to explicitly specify which important fault risk scenarios to prioritize. We expect to uncover reliability, performance, and scalability bugs. We will submit this aspect-oriented test harness and our findings to the HDFS and Lustre communities. •Online Recovery Monitoring: We plan to extend existing scalable monitoring tools [59, 116, 134] to monitor detailed recovery actions. To infer the high-level states of the system being monitored, these tools depend on log messages generated by the system. Thus, to infer high-level recovery actions, one challenge that we will deal with is log provenance (i.e., we need to know which log messages are in the context of recovery, and furthermore, due to which specific failures). We will also develop declarative analyses to help system administrators easily declare what they intend to monitor and analyze. •Executable Recovery Specification: Hellerstein’s group has a great deal of experience in the design and use of declarative languages like Overlog to build full-function distributed systems [23, 107]. With this experience and the active learning done in the first two phases above, we will develop a domain specific language for recovery specifications, which then will be directly translated into executable protocols. With executable specification, we also expect to be able to specify recovery performance and scalability goals, and generate the code that meets the goals. In the remainder of this proposal, we present the extended motivation for DARE. We then give an overview of the DARE project, and present the details of each component of our research. We continue by describing our research plan and the educational impact of our work. We conclude by presenting related work and our prior funded efforts. D.2 Extended Motivation At the heart of large-scale computing are the large-scale storage systems capable of managing hundreds or thousands of machines. Millions of users fetch data from these systems, and computations are “pushed” to the data inside the systems. Thus, they must maintain high data-availability; when failures occur, data recovery must be correct and efficient. In this section, we estimate how often data recovery takes place in large-scale storage settings. Then, we present our initial study of large-scale recovery from real-world deployment. Finally, we present extended motivation for the three thrusts of the DARE project. D.2.1 How Often Does Data Recovery Take Place? Data recovery can be triggered due to whole-disk failure. A study of 100,000 disks over a period of five years by Schroeder et al. showed that the average percentage of disks failing per year is around 3% [138]; they also found a 2 set of disks from a manufacturer had a 13.5% failure rate. Google also released a similar rate: 8.6% [129]. All these failure rates suggest that a 1-PB cluster might recover between 30 to 135 TB of data yearly (90 to 400 GB daily). Disk could also fail partially; the disk might be working but some blocks may not be accessible (e.g., due to latent sector errors [96, 97]), or the blocks are still accessible but have been corrupted (e.g., due to hardware and software bugs [52, 67, 106, 148, 158]). Bairavasundaram et al. found that a total of 3.45% of 1.53 million disks in production systems developed latent sector errors [32]; some “bad” disk drives had more than 1000 errors. In the same population, they also found that 400,000 blocks were corrupt [33]. If an inaccessible or corrupt block stores metadata information, the big data sitting behind the metadata can become unreachable, and thus orders of magnitude more data needs to be regenerated than what was actually lost. Disk failures are not the only triggers of recovery; as disks are placed underneath layers of software (e.g., OS, device drivers, device firmware), an error in one layer can make the system crash or hang, and hence data becomes unavailable although the disks are working fine [71, 139]. Amazon has seen this in practice; a firmware bug “nuked” whole server [87]. Other hardware failures such as network and power-cord failures can also bring down data availability. As human reaction is slow [87], some machines could be offline and unattended for a while, and hence unable to send “I’m alive” heartbeat messages for a long period of time. In this case, in some architectures such as HDFS, the master server will treat these machines as dead and regenerate the data stored in these unattended machines. In summary, as failure is a commonplace occurrence, system builders should not disregard the veracity of Murphy’s Law: “anything that can go wrong will go wrong.” D.2.2 What Are The Issues of Large-Scale Recovery? To understand the fundamental problems of large-scale recovery, we have performed an initial study of real-world issues faced from the deployment of a large-scale file system, the Hadoop File System (HDFS) [43]. We picked HDFS as it has been widely deployed in over 80 medium to large organizations including Amazon, Yahoo, and Facebook, in the form of 4- to 4000-nodes clusters [3]. HDFS is a complex piece of code (25 KLOC) and serves as a foundation for supporting Hadoop MapReduce jobs; it keeps three replicas of data, and computations are typically pushed to the data. We have studied almost 600 issues reported by the HDFS developers [81] and found that 44 are about recovery. Below, we summarize some of our interesting findings. •Too-aggressive recovery: HDFS developers have observed a whole-cluster crash (600 nodes) caused by simultaneous failures of only three nodes [83]. This happened when the cluster was too aggressive in regenerating the lost copies in the three dead nodes such that some heartbeats from the healthy nodes were “buried” in this busy recovery. As the heartbeats from some healthy nodes could not reach the master node, these healthy nodes were also considered dead, which then caused the cluster to regenerate more data as it saw more dead nodes. From this point on the cluster was constantly regenerating more data and losing more nodes and finally became dysfunctional. •Too-slow recovery: There was a case where recovery was very slow because the replication procedure was coupled with the wrong heartbeat [84], and hence the recovery ran too slow even if resources such as network bandwidth were free. As a result, data recovery of one failed node took three hours, although it should have taken only one hour. •Coarse-grained recovery: A “tiny” failure can make the whole cluster unavailable if recovery is not fine-grained. As an example, HDFS refused to start even if there was only one bad file entry, making the whole cluster unusable [82]. Since we believe that there are more problems that have not been reported, we also have manually injected some hand-picked data failures into HDFS, and found more interesting findings, as listed below. Again, these findings point to the fact that recovery is often under-specified and hence must be rigorously tested and monitored. •Expensive recovery: We found that many important metadata are not replicated such that a loss of a metadata will result in an expensive recovery. Node 2 One example is the non-replicated local directory block which potentially Node 1 stores references to thousands of large files; a loss of a 4-KB directory block 0 200 s 400 s 600 s will result in hundreds of GB being regenerated. Another example is the non-replicated version file kept in each machine. If the version file of a Figure 2: Four restarted jobs due to a node gets corrupted or inaccessible, GBs or TBs of data in the node must late reporting of data failure. be regenerated, although the node is perfectly healthy. •Late recovery: We also have identified a late recovery due to a delayed reporting of data failure, as illustrated in Figure 2. When a job runs on a bad copy of a file (in Node 1), HDFS forgets to directly report the bad copy to the master server. The bad copy is only reported by a background scan that runs at 3 a very slow rate (e.g., hourly). As a result, jobs wastefully run on the bad copy (dashed lines) only to find themselves restarted to the other good copies (solid lines). D.2.3 Why Offline Recovery Testing? Recovery should definitely be reliable. However, in the scalable world of computing, recovery must also scale in three dimensions: number of failures, size of data, and number of jobs dependent on the data. To achieve these, recovery must be fast, efficient, and conscious of the jobs running on the system. If it is not fast, recovering from a large number of failures will not scale in time. If it is not efficient (e.g., recovery generates much more data than what was lost), recovering a big data loss will not scale in size. If it is not conscious of the scale of the jobs running on each data, a large number of jobs will experience a performance degradation (e.g., if a loss of a popular file is not prioritized). Evaluating whether existing recovery strategies are robust and scalable is a hard problem, especially when recovery protocols are often under-documented. As a result, component interactions are hard to understand and system builders tend to unearth important design issues “late in the game” [42]. Thus, we plan to apply an extensive fault-injection technique that enables us to insert many possible failures, including the rare combinations, and thus exercise many recovery paths. Unlike previous work on testing, we also plan to explore the idea of declarative testing to intelligently sample important fault risk scenarios from the large test space; for example, we might wish to run a testing specification such as “Combinations=1 & Workload=STARTUP” (i.e., insert all possible single failures only in the reboot process). Hypothesis 1: Unlike small-scale recovery, large-scale recovery must be rigorously tested against various combinations of failures, including the rare ones. D.2.4 Why Online Recovery Monitoring? The bug reports we examined for HDFS reflect that system builders learn new issues from real-world deployment; interesting problems appear when the system is deployed in a big cluster of hundreds or thousands of machines. This is because new solutions tend to be tested only in small settings. Therefore, we envision that failures should be deliberately scheduled during runtime, but furthermore, we believe there is a need for an online framework that monitors recovery actions lively and reasons about their correctness and scalability. More specifically, we expect administrators and system administrators to ask high-level analysis questions such as: “How many bytes are transferred when a large number of disks fails? How are jobs affected when a replica is missing? How long to recover a missing replica when the system is 90% busy?” Thus, we plan to build an online monitoring framework where system builders or administrators can declare what they intend to monitor and analyze (e.g., in the form of declarative queries). Hypothesis 2: More recovery problems will be uncovered if failures are deliberately injected during actual deployment, and if the corresponding recovery reactions are monitored and analyzed online. D.2.5 Why Executable Recovery Specification? Testing and analysis alone cannot ensure high-quality, robust, and scalable storage systems; the resulting insight must be turned into action. However, fixing recovery subsystems has been proven hard for several reasons. First, recovery code is often written in low-level system languages (e.g., C++ or Java) in which recovery specification is not easy to express. As a pragmatic result, recovery is often under-specified; the larger the failure scenarios are, the harder it is to formally check whether all scenarios have been handled properly. Second, low-level system languages tend to make recovery code scattered [76, 78, 133]; a simple change must be added in many places. Third, although many failure scenarios have been considered in initial design, it is inevitable that “surprising” failures take place in deployment [79]; as developers unearth new issues, ad-hoc patches are applied, and over time the code becomes unnecessarily complex. Finally, the design space of large-scale recovery is actually vast, involving many metrics and components. Yet, it is common that early releases do not cover all the design space. Ultimately, we believe recovery code must evolve from low-level system languages (e.g., C++ or Java) into higherlevel specifications. Declarative language approaches are attractive in this regard. With recovery specified declaratively, one can formally check its correctness, add new specifications, and more easily evaluate performance-reliabilityscalability tradeoffs of different approaches. We note that the culture of large-scale recovery programming is to start 4 with “simple and nearly stupid” strategies [79]. This is because additional optimizations tend to introduce more complexity but not always bring orders of magnitude of improvement. In contrast, we believe declarativity will enable developers to explore a broad spectrum of recovery strategies (even complex ones). Moreover, with declarativity, we believe we can parallelize recovery tasks explicitly, and hence radically improve recovery performance and scalability. In addition to specifying recovery declaratively, our goal is also to translate the specifications into executable protocols such that system developers do not have to write the code twice. Recently, we have used a declarative data-centric programming language to rapidly build cloud services (e.g., MapReduce, HDFS) [23]. We will use this experience to design a domain specific language for writing executable recovery specifications. Hypothesis 3: A declarative language for recovery specifications will lead to more reliable and manageable large-scale storage systems D.2.6 Summary: Declarativity and Iterative Lifecycle One central theme of DARE is the use of declarative languages. Having presented the various usages of declarativity in our DARE lifecycle, we highlight again its four important benefits that greatly fit the context of large-scale recovery. First, declarative languages enable natural expression of system invariants without regard to how or whether they are maintained. For example, in the analysis phase, system administrators can declaratively express “healthy” and “faulty” scenarios without the need to know the detailed implementation of the system. Second, declarative expressions can be directly executable. Thus, not only can recovery be written down as a form of documentation, but the documentation also becomes precisely the running code that enforces the specification. This corresponds to a direct specification of the distributed systems notion of system “Safety” (as in “Safety and Liveness”). Third, declarative languages tend to be data-centric. This fits well for storage systems in which the bulk of the invariants is all about ensuring correct, consistent representations of data via data placement, movement, replication, regeneration, and so on. Finally, work in the database community has shown that declarative languages can naturally be scaled out via data parallelism. This feature fits well for large-scale recovery where recovery must be scalable. Another central theme of DARE is the iterative, pay-as-you-go nature of its three phases. In the testing and analysis phases, we will improve the recovery of existing code bases (i.e., HDFS and Lustre) and submit our refinements to the corresponding communities. At the same time, we will start formulating fundamental principles of large-scale recovery and sketch them down as executable specifications. This process inadvertently forms an N-version programming [29]. That is, the executable specification becomes an evolving, formal document of the base code properties, and hence can be tested for compliance with the base code as both evolve. Iteratively, as the specification becomes more powerful, it can guide and improve the efficiency of the testing phase. For example, by leveraging the logical constraints defined in the specification, we can exclude impossible or uncorrelated combinations of failures during the offline testing phase. Thus, we believe the DARE lifecycle is a powerful software-engineering approach to recovery-oriented enrichment of existing systems. D.3 The Berkeley-Wisconsin DARE Project The main goal of the DARE project is to build robust and scalable recovery as part of the next generation large-scale storage systems. To get there, recovery must become a first-class component in today’s large-scale computing, and hence must be tested and analyzed rigorously. Furthermore, as recovery is inherently complex in large-scale settings, we believe declarative approaches must be explored in this context. We believe the collaboration we have assembled is well positioned to make progress on this demanding problem, as we draw on expertise in storage system reliability [33, 35, 76, 77, 78, 104, 133, 145], high-performance storage clusters [25, 27, 39, 40, 75, 156], distributed monitoring [92, 93, 136], declarative programming [23, 54, 58, 108, 109, 111], and scalable architectures [19, 55, 72, 94, 110, 142, 157]. By having students each work under multiple advisors with different expertise, we will build in a structural enforcement of the interdisciplinary nature of this proposal; hence, students will learn techniques and knowledge from a much broader range of topics, and be able to solve the problems in question. We now discuss the challenges of each component in DARE, along with our concrete plans. We begin with offline recovery testing, progress to online recovery monitoring, and finally conclude the research portion of this proposal with executable recovery specification. 5 D.4 Offline Recovery Testing In this section we describe our approach to offline recovery testing. Our goal is to extensively test recovery against various combinations of failures such that recovery bugs could be discovered “early in the game.” We begin with describing our fault model and technique to insert fault-injection hooks. We follow with the challenges of covering the failure sample space. Finally, we describe our intention to explore the idea of declarative testing specification to better drive the testing process. D.4.1 Fault Model and Failure Points Users store data. Jobs run on data. Reboot needs data. Thus, our initial focus is on the storage fault model. Our fault model will range from coarse-grained ones (e.g., machine failure, whole-disk failure) to fine-grained ones (e.g., latent sector error, byte corruption). We note that recovery of coarse-grained failures has been examined rigorously in literature and practice. For example, a job running on a failed machine is simply restarted [62]; a lost file is simply regenerated from the surviving replicas. However, fine-grained fault models are often overlooked. For example, Amazon S3 storage system had an outage for much of a day when a corrupted value “slipped” into their gossipbased subsystem [42]. Thus, as Hamilton suggested, we should mimic security threat modeling which considers each possible security threat and implements adequate mitigation [79]. We believe the same approach should be applied in designing fault resiliency and recovery. After defining the storage fault model, the first challenge is to decide how to emulate the faults in the model. We decide to add a fault-injection framework as part of the system under test (i.e., a white-box framework). We believe this integration is necessary to enable a large number of analysis; having internal knowledge of the system will allow us to test many components of the system. We will start with placing failure points around system calls that interact with local storage (e.g., read, write, stat). To do this without making the original code look “ugly”, we plan to use recent advances in aspect oriented programming (AOP) [56]; the HDFS developers have successfully used AspectJ [5], the AOP for Java, as a tool to inject failures [154]. With AspectJ, we have easily identified around 500 I/O-related failure points within HDFS. We plan to do the same thing for Lustre file system, but with AspectC [4]. As we test the HDFS base code, we will also test our declarative version of HDFS (BOOM-FS) hand-in-hand [23]. BOOM-FS is comprised of the Overlog declarative language and its Java runtime support, Java Overlog Library (JOL). Thus, we will also use AspectJ to insert failure points in JOL. BOOM-FS and JOL will be explained more in Section D.6.1. Task 1: Add a white-box fault-injection framework to each system we plan to test (i.e., HDFS, Lustre, and BOOM-FS) with aspect-oriented technology. D.4.2 Coverage Scheduling Once failure points are in place, the next challenge revolves around the three dimensions of failure scheduling policies: what failures to inject, when and where to inject them. We translate these challenges into the problems of failure, sequence, and workload coverage. Failure coverage requires us to ideally inject all possible failure scenarios, ranging from individual to combinations of failures. Without careful techniques, the sample space (i.e., the set of all possible failures) can be very big. For example, even in the case of a single failure, a value can be corrupted to any arbitrary value, which will take a long time to exhaust. A worse scenario is to exhaust all combinations of multiple failures. To reduce the coverage space without reducing the coverage quality, we plan to adopt two methods. The first is the type-aware fault injection technique that we have used successfully in the past [31, 34, 77, 131, 133]. The second is the combinatorial design from the field of bio-computation which suggests that if we cover all combinations of two possible failures (versus all combinations of all possible failures), we might cover almost all important cases [143]. After dealing with two-failure scenarios, we will explore more techniques to cover more multiple failures. Sequence coverage presents the idea that different sequences of failures could trigger different reactions. In other words, it is not enough to say “inject X and Y”, but rather the we must be able to say “inject X, then Y” and vice versa. This observation came from one of our initial experiments where a sequence of two failures result in a data loss while the opposite sequence is not dangerous. For example, in HDFS, if a copy of a metadata file could not be written (the first failure) just because of a transient failure, this copy is considered stale; all future updates will not be reflected to this file. If then, the update-time of the file is corrupted (the second failure) such that it becomes more recent than those of the other good copies, HDFS will read this stale file during the next reboot, which implies that all 6 updates after the first failure will be lost. The reverse scenario is safe because when the write fails, HDFS also updates the update-time. Thus, sequence coverage is important, but it explodes the sample space further, and hence we plan to look for solutions for this coverage too. Finally, workload coverage requires us to exercise failures in different parts of the code; the same failure can be handled in different segments of the code. Thus, we must also cover all possible workloads (e.g., start-up, data transfer, numerous client operations, background check, and even during recovery itself). To do this, we will leverage existing programming language techniques in directed testing [70]. Task 2: Develop techniques to ensure that the injected failures have high-qualities of failure, sequence, and workload coverage. D.4.3 Declarative Testing Specification We note that testing is often performed in a manner that is oblivious to the source it is run against. For example, in the field of fault injection, a range of different failures are commonly inserted into a system with little or no sense as to whether a small or large portion of the failure handling code is being exercised [102]. We therefore believe there is an opportunity to explore the feasibility of writing declarative testing specification to better drive testing. Such “specification-aided” testing exploits domain specific knowledge (e.g., program logic) to help developers specify the high-level test objectives. For example, typically it is hard to verify how many failures a system can survive before going down. In this case, we ideally want to run a test specification such as “Combinations=2 & Server6=DOWN” (i.e., insert any possible two failures and verify that the system is never down). Or, in the case of systems with crashonly recovery [48], the reboot phase must be thoroughly tested. Thus, we might wish to run a testing specification such as “Combinations=1 & Workload=STARTUP” (i.e., insert all possible single failures in the reboot process). We also note that declarative testing specification advocates experiment repeatability, something that bug fixers highly appreciate. Task 3: Explore how program logic (available from the source-code or program specification) can be used to better drive testing. D.5 Online Declarative Monitoring To complement offline testing, our next goal is to analyze recovery when failures occur in actual deployment. We note that many novel analyses have been proposed to pinpoint root-causes of failures in large-scale settings [37, 95, 122, 124, 153, 161]. What we plan to do is fundamentally different: in our case, failures are deliberately injected, and recovery is the target of the analysis. In other words, our focus is not about finding out what causes failures to happen, but rather why the system cannot recover from failures. Therefore, a different kind of framework is needed than what has been proposed before. Below, we describe the three components of our online monitoring framework (as depicted in Figure 3): online failure scheduling, recovery monitoring, and declarative analysis. The System Failure Scheduler Log: "X fails" pre−recovery state Log: "Regenearing File Y" post−recovery state Recovery Monitoring queries Declarative Analyses Figure 3: Architecture for Online Declarative Analysis. D.5.1 Online Failure Scheduling We note that recently there is a big interest in industry to have failures deliberately injected during actual runs [79, 87, 126], rather than waiting for failures to happen. We name this new trend as online failure scheduling. Unfortunately, to the best of our knowledge, there has been no work that lays out the possible designs in this space. Thus, we plan to explore this paradigm as part of our online monitoring. We believe this piece of project will foster a new breed of research that will directly influence existing practices. To begin with, we will use the same fault-injection framework that we will build for our offline testing, as described in the previous section. In addition, we will explore the fundamental differences between offline and online faultinjections. We have identified three challenges that we need to deal with. First, performance is important in online 7 testing. Thus, every time the system hits a failure point, the fault-injection decision making must be done rapidly (e.g., locally without consulting to a remote scheduler). This requires us to build distributed failure schedulers that run on each machine but communicate with each other; such feature is typically not required in an offline testing [162]. Second, timing is important; offline testing can direct the program execution to any desired states, but in deployment settings, such freedom might be more restricted. For example, to test the start-up protocol, we cannot obliviously reboot the system. Finally, reliability is important; we cannot inject a failure and lose customers’ data, especially if there is no guarantee that recovery is reliable [135]. On the other hand, the fact that the failure is scheduled presents an opportunity for the system to prepare for the failure (e.g., by backing up the data that might be affected by the injected failure). A safe implication is that a buggy recovery can be caught without actual consequences (e.g., real data loss), and hence the system “dares to fail”. One big challenge here is to design an efficient preparation state (i.e., identifying which data that should be backed up). Task 4: Design and implement distributed failure schedulers that run on each machine, but still coordinate failure-timings efficiently. Task 5: Develop techniques for efficient preparation from failures (i.e., identify which data that are potentially affected by to-be-injected failures). D.5.2 Recovery Monitoring After failures are injected online, our next goal is to detect bad recovery actions. This requires a recovery monitoring framework that can gather sufficient information during recovery and construct the complex recovery behaviors of the system under analysis. Administrators can then plug in their analyses on this monitoring framework. Below, we describe the challenges of such a specific monitoring. We note that this framework can also be used for our offline testing phase. •Global health: Correct recovery decisions must be made based on many metrics such as system load, priority, location, cost, and many more. We assume a large-scale system has a monitoring infrastructure that captures the general health of the system such as CPU usage, disk usage, network usage, and number of jobs from log messages [2, 37, 134, 153, 161]. This general information will be useful to help us understand the global condition when recovery takes place. We do not plan to reinvent the wheel, instead, we plan to extend existing infrastructure with recovery monitoring in mind. •Log provenance: We find that existing monitoring tools are not sufficient for our purpose; they depend on general log messages that do not capture important component interactions during recovery. What is fundamentally missing is the contextual information of each message. Other researchers have raised the same concern. For example, Oline and Stearley studied 1 billion log messages from five of the world’s most powerful supercomputers, and their first insight was that current logs do not contain sufficient information, which then impedes important diagnosis [123]. Without context, it is hard to correlate one event (e.g., “failure X occurred”) with another event (e.g., “Z bytes were transferred”). As a result, tedious techniques such as filtering [75], timing correlation [22, 165], perturbation [30, 45, 115], source-code analysis [161] are needed. We believe context is absent because monitoring was not a first class entity in the past, while today, monitoring is almost a must for large-scale systems [79]. We note that contextual information already exists in the code (e.g., in the form of states), but it has not immersed down to the realm of logging. We also believe that context is important to classify log messages and hence will enable richer and more focused analysis. Thus, we plan to develop techniques to establish concrete provenance of log messages (“lint for logs”). Fortunately, today’s systems have leveraged more sophisticated logging services beyond “printf”. For example, HDFS employs Apache log4j logging services [2]. We will explore how context can be incorporated into existing logging services, and hence providing a transparent log provenance. If context transparency is impossible, we still believe that adding context in log messages would be a one-time burden that will benefit future analysis. Task 6: Establish concrete provenance of log messages, and explore if existing logging services can incorporate log provenance transparently. D.5.3 Declarative Analysis After having rich contextual information in log messages, we can continuously record them in a “recovery database”. We plan to adopt existing strategies that use an intermediate database to transform log messages into well-formatted 8 structures [116, 134, 161]. To catch bad recovery actions, the next step is to run a set of online analyses on the recovery database, specifically by writing them as declarative queries. This approach is motivated by our initial success in using declarative queries to find state inconsistencies in the context of local file systems. We first describe this initial success to illustrate how we will express declarative analysis. •Initial work (Declarative Local Fsck): We have built a declara- firstBlk = sb->sFirstDataBlk; tive file system checker named SQCK [77] to express the high-level lastBlk = firstBlk + blksPerGrp; intent of the local file system consistency checker (fsck) in the form for (i = 0, gd=fs->grpDesc; i < fs->grpDescCnt; of declarative queries (with SQL). This was accomplished by loadi++, gd++) { ing file system states into a database on which the consistency-check if (i == fs->grpDescCnt - 1) lastBlk = sb->sBlksCnt; queries run. With such declarativity, we have rewritten the core logic if ((gd->bgBlkBmap < firstBlk) || of a popular fsck (150 consistency checks) from 16 thousands lines of (gd->bgBlkBmap >= lastBlk)) { C code into 1100 lines of SQL statements. Figure 4 shows an exampx.blk = gd->bgBlkBmap; if (fixProblem(PR0BBNOTGRP,...)) ple of a check that we transformed from C into a declarative query; gd->bgBlkBmap = 0; } } one can see that the high-level intent of the declarative check is more ----------------------------------clearly specified. This initial experience provides a major foundation SELECT * for our online declarative analysis. FROM GroupDescTable G •Declarative Analysis: With the recovery database, we provide ad- WHERE G.blkBitmap NOT BETWEEN G.start AND G.end ministrators the ability to extract the recovery protocols (i.e., a statecentric view of recovery as query processing and update over system Figure 4: The C-version vs. declarative state). For example, to identify whether recovery of X is expensive version of a consistency check. or not, one can query the statistics (e.g., bytes transferred) during the context of recovering X. Thus, the challenge is to design the format of the recovery database and establish a set of queries that infer various aspects of recovery such as timing, scheduling, priority, load-balancing, data placement, and many more. More importantly, our goal here is to enable system administrators to plug in their specifications for both healthy and faulty scenarios. For example, after recovery, the system should enter a stable state again. Thus, we can express a specification that will alert the administrator if the healthy scenario is not met. One could also write a specification that tracks a late recovery (as described in Section D.2.2) in a query such as “Alert if the same exact error is found more than once.” To catch a chained-reaction of failures, one can plug in a rule such as “Alert if the number of underreplicated files is not shrinking over some period of time.” This set of expected invariants is by no means complete. In fact, our goal is to enrich them as we learn more about bad recovery behaviors from the offline testing phase. Thus, another challenge is to find out how to ease this process of going from manual analysis to useful online specifications. Task 7: Design a proper structure for the recovery database, establish a set of analyses that infer various aspects of recovery, and a set of specifications that express healthy and faulty recovery scenarios. D.6 Executable Recovery Specification Although the first two phases of DARE will uncover recovery problems, developers still need to fix them in the same code base. Yet, developers often introduce new bugs as they fix the old bugs, especially when the system is complex and written in low-level system languages [164]. Thus, we believe recovery code must evolve from low-level system languages (e.g., C++ or Java) into declarative specifications. In this section, we describe our plan to design a declarative language for writing executable recovery specification. To date, we only know of one work on declarative recovery, and that is in the context of sensor networks [74]. In the context of large-scale file systems, we believe this will be a challenging task, yet doable; we are encouraged by our initial success with rewriting HDFS declaratively [23]. We begin with describing this initial success, followed with our detailed design and evaluation plans. D.6.1 Initial Work: Data-Centric Programming After reviewing some of the initial datacenter infrastructure efforts in the literature [46, 62, 63, 69], it seemed to us that most of the non-trivial logic involves managing various forms of asynchronously-updated state (e.g., sessions, protocols, storage). This suggests that the Overlog language used for declarative networking [109] would be wellsuited to these tasks and could significantly ease development. With this observation, we have successfully used the Overlog declarative language [109] to rewrite the HDFS metadata management and communication protocols [23]. Our declarative implementation (BOOM-FS) is ten times shorter than the original HDFS implementation: less than 9 500 lines of Overlog with 1500 lines of additional Java code; the Java code is part of our Java Overlog Library (JOL) which functions as the Overlog runtime support. Furthermore, to show how easy we can add complex distributed functionalities, we have added the Paxos algorithm [68] into BOOM-FS in 400 lines of Overlog rules; with Paxos, BOOM-FS can manage more than one master server with strong consistency. To give a flavor of how protocol specifications are written in Overlog, we give a more detailed description of our BOOM- // The set of nodes holding each chunk FS implementation. The first step of our HDFS rewrite was to computeChunkLocs(ChunkId, set<NodeAddr>) :hbChunk(NodeAddr, ChunkId, _); represent file system metadata as a collection of relations. For example, we have an hbChunk relation which carries informa- // If chunk exists => return set of nodes tion about the file chunks stored by each data node in the sys- response(@Src, RequestId, true, NodeSet) :request(@Master, RequestId, Src, tem. The master server updates this relation as new heartbeats ’ChunkLocations’, ChunkId), arrive; if the master server does not receive a heartbeat from computeChunkLocs(ChunkId, NodeSet); a data node within a configurable amount of time, it assumes that the data node has failed and removes the corresponding // Chunk does not exist => return failure RequestId, false, null) :rows from this table. After defining all relations in the system, response(@Src, request(@Master, RequestId, Src, we can write rules for each command in the HDFS metadata ’ChunkLocations’, ChunkId), notin hbChunk(_, ChunkId, _); protocol. For example, Figure 5 shows a set of rules (run on the master server) that return the set of data nodes that hold a Figure 5: A sample of Overlog Rules. given chunk. These rules work on the information provided in the hbChunk relation. Due to space constraint we are not able to describe the rule semantics. However, our main point is that metadata protocols can be specified concisely with Overlog rules. D.6.2 Evaluating Declarativity To design a declarative language for recovery, we plan to explore whether existing declarative languages such as Overlog can naturally express recovery. More concretely, we have defined four challenges that a proper declarative language for recovery should meet: •Checkability: We believe recovery is often under-specified. Yet, a small error in recovery could lead to a data loss or degrade performance and availability. Declarativity is a way to turn recovery into formal system specification. Thus, the chosen declarative language must be able to support formal verification on top of it. We also want to note that formal verification will greatly complement testing; as testing is often considered expensive and unlikely to cover everything, our hope is to have formal checks cover what testing does not and vice versa. •Flexibility: Large-scale recovery management is complex; many parameters of different components play a role in making recovery decisions. As an illustration, Table 1 lists some of the metrics commonly used in making recovery decisions. Even though the list is not complete, there are many metrics to consider already. Thus, possibilities for recovery decisions are nuanced. For example, we might want to prioritize data recovery for first-class customers [100]; on the other hand, migrating recovery over time especially during surge load period is considered as good provisioning [80]. Recovery must also be adaptive. For example, if a cluster at one geographic location is not reachable, the system might consider temporarily duplicating the unreachable data. Often, what we have in existing systems are static policies. Therefore, our final solution should enable system builders to investigate different schemes, exploring the large design space without being constrained with what is available today. Metrics #Replicas Popularity Access #Jobs Foreground request Free space Heterogeneous hardware Utilization Rack awareness Geographic location Financial cost Network bandwidth Possible policies based on the metrics Prioritize recovery if the number of available replicas is small Prioritize recovery of popular files Prioritize recovery of files that are being used by jobs Try not to impact foreground workload Try to piggyback recovery with foreground request Try to replicate files to machines with more free space Try to recover important files to fast machines Try to pick less utilized machines to perform recovery Ensure that different copies are in different racks Ensure that one copy is available in a different geographic location If multiple recovery options are available, choose the cheapest one Throttle recovery if bandwidth is scarce Table 1: Recovery metrics and policies. 10 •Scalability: We believe a declarative language suitable for large-scale systems must be able to express parallelism explicitly. We notice that although the power of elastic computing is available, some form of recovery has not evolved, and hence does not exploit the available compute power in the system. An example is the classic file system checker (fsck) whose job is to cross-check the consistency of all metadata in the system and then repair all broken invariants [12, 88, 117, 159]. A great deal of performance optimizations have been introduced to fsck [41, 89, 90, 128], but only locally. Hence, a complete check still often take hours or days [90], and hence run very rarely. Thus, the chosen declarative approach should enable easy parallelization of recovery tasks. As a start, in our prior work on declarative fsck [77], we made concrete suggestions on how to run consistency checks in parallel (e.g., via traditional data-parallel methods such as MapReduce and/or database-style queries). •Integration: We note that some recent work has also introduced solutions to build flexible data management (e.g., with modularity [114]). However, they typically omit the recovery part. Thus, previous approaches although novel, are not a holistic approach for building a complete storage system, as sooner or later recovery code needs to be added [77]. Thus, our goal is to design a holistic approach where the chosen declarative language will cover data management as well as recovery; extending Overlog would be a great beginning as we have used Overlog to build the former. D.6.3 Declarative Recovery Design We believe that coming up with a good design requires careful exploration of possible alternatives. Thus, with the four axes of evaluation in mind, there are several directions that we will explore. First, we will investigate if Overlog is well-suited to express recovery protocols. With Overlog, metadata operations are treated as query processing and updates over system state; recovery could have the same state-centric view, and hence a high chance of success for this direction. Second, we will explore AOP for declarative languages (in our case, we will begin with AOP for Overlog). This direction comes from our observation that if recovery rules pervade the main operational rules, the latter could become unnecessary complex even if both are written declaratively. Thus, if declarative recovery can be written in an AOP-style, we can provide a formal and executable “recovery plane” for real systems. Finally, if we find that Overlog is not suitable for our goal, we will investigate other declarative techniques, rather than being constrained with what we currently have. We also want to note that recovery is constrained to how the data and metadata are managed; a policy that recovers a lost data from a surviving replica would not exist if the data is not replicated in the first place; a failover policy for a failed metadata server would be impossible if only one metadata server exists. This implies that we also need to re-evaluate how far existing declarative languages such as Overlog enable flexibility in managing data and metadata. Task 8: Prototype recovery specifications for HDFS in Overlog, and then evaluate the prototype based on the four axes of evaluation above. Based on this experience, develop guidelines for a more appropriate domain-specific language for recovery. Task 9: Evaluate the reliability of our recovery specifications using the first two phases of DARE (hence, the iterative nature of the DARE lifecycle). D.7 Putting It All Together From our initial study of existing large-scale recovery (Section D.2.2), we learn two important lessons. First, early designs usually do not cover all possible failures; unpredictable failures occur in real deployment, and thus system builders learn from previous failures. Second, even as new failures are handled, the corresponding recovery strategies must be designed carefully; a careless recovery could easily bring down availability, overuse resources, and affect a large number of applications. Therefore, we believe the three components of DARE form a novel framework for improving recovery: the offline testing phase unearths recovery bugs early, the online monitoring phase catches more problems found in deployment, and the executable specification enables system builders try out different recovery strategies. Because we see these three phases as an iterative, pay-as-you-go approach, our plan is to rapidly prototype the three phases and improve them hand-in-hand. As mentioned throughout previous sections, we believe the lessons learned from each phase will benefit the other phases in parallel. 11 D.8 Related Work In this section, we highlight the distinctions between our project and previous major work on recovery and scalability. •Recovery: Recovery-Oriented Computing (ROC) was extensively proposed eight years ago [127]. This powerful paradigm has convinced the community that errors are inevitable, unavailability is costly, and thus failure management should be revisited. Since then, a long line of efficient recovery techniques have been proposed. For example, Nooks ensures that a driver failure does not cause whole-OS failure [150, 151]; Micro-reboot advocates that a system should be compartmentalized to prevent full reboot [49]; More recently, CuriOS prevents failure propagation by isolating each service [60]. Today, as systems scale, recovery should be revisited again at a larger scale [42]. •Scalable Recovery: Recently, some researchers have started to investigate the scalability of job recovery. For example, Ko et al. found that a failed MapReduce task could lead to a cascaded re-execution of all preceding tasks [101]. Similarly, Arnold and Miller also pointed out a similar issue in the context of tree-based overlay computation [24]; a node failure could lead to a re-execution of all computations in the sub-tree. Our project falls along the same line with a different emphasis: scalable data recovery. •Data Recovery: Fast data recovery was first introduced in the context of RAID around 16 years ago [91]. Recently, RAID recovery has been evaluated at a larger scale [160]. However, there is a shift from using disk-level RAID to distributed file-level RAID [159] or simple replication across cluster of machines [43, 69]. We believe that recovery should be revisited again in this new context. Fast data recovery during restart (log-based recovery) has also been considered important. For example, the ARIES database system exploits multi-processors by making redo/undo processes parallel [119]. More recently, Sears and Brewer extended the idea with a new technique, segment-based recovery, which can run across large-scale distributed systems [140]. In our project, in addition to log-based recovery, we will revisit the scalability of all recovery tasks of the systems we will analyze. •Pinpointing problems: There has been a growing number of work attempting to pinpoint problems in large-scale settings: Priya’s group at CMU built a black-box analysis [124], white-box analysis [153], and automated online fingerpointing tool [37] for monitoring MapReduce jobs; Oliner et al. developed unsupervised algorithms to analyze raw logs and automatically alert administrators [122]; Xu et al. employs machine learning techniques to detect anomaly from logs [161]. We will extend this existing work to capture recovery problems. •Declarative Recovery: Keeton et al. pointed out the complexity of recovery in a multi-tier storage system [98, 99, 100]; recovery choices are myriad (e.g., failover, reprovision, failback, restore, mirror). To help administrators, they introduced novel recovery graphs that process input constraints (e.g., tolerated loss, cost) and produce a solution. However, they only provide a high-level framework but stop short in providing a new language. To date, we only know of one work on declarative recovery, and that is in the context of sensor networks. As sensor applications often deal with failures, lots of recovery concerns are tangled with the main program, making it hard to understand and evolve [74]. Thus, the same lesson should be applied to large-scale file systems. •Parallel workloads: Today, all kinds of workloads are made scalable: B-Tree data structures can coordinate itself across machines [21]; Tree-learning algorithm now runs on multiple machines [125]; all subcomponents of database (query planning, query optimizer, etc.) are made parallel [66, 157]; even monitoring tools are formed as scalable jobs [134, 161]. This trend forces us to revisit again the scalability of all file system subcomponents; we will begin with the recovery component. •Bug finding: Many approaches have been introduced to find bugs in storage system code [47, 163, 164]. We note that the type of failure analysis we are advocating is not simply the search for “bugs”, rather, we also seek to understand the recovery policy of the system under analysis at a high level. In the past, we have shown that high-level analyses unearth problems not found by bug finding tools [31, 34, 133]. D.9 Research and Management Plan In this section, we first discuss the global research plan we will pursue, and then discuss the division of labor across research assistants. D.9.1 Global Research Plan We believe strongly in an iterative approach to systems building via analysis, design, implementation, evaluation, and dissemination. We believe that this cycle is crucial, allowing system components to evolve as we better understand the relevant issues. We discuss these in turn. 12 Year 1 2 3 UC Berkeley (Postdoc1 for 1 year + RA1 for 2 years) UW Madison (RA2 for 3 years) • Prototype DARE phase 2 (online monitoring) for HDFS: deal • Prototype DARE phase 1 (offline testing) for HDFS: insert with log provenance, establish a list of online analyses. failure points with AOP, auto-generate combinations of failures. • Prototype DARE phase 3 (executable specification) for • Document deficiencies of existing recovery (for Postdoc1 ). HDFS: bootstrap spec-writing in Overlog into BOOM-FS. • Improve test performance with distributed failure schedulers. • From RA2 ’s testing results, refine the two phases above. • (RA1 will continue Postdoc1 ’s work) • Apply the three phases of DARE to Lustre. • Evaluate our recovery domain specific language (DSL) (veri• Develop techniques to improve failure coverage and intellifiability, flexibility, performance, parallelism, robustness). gent sampling of test scenarios. • Use this experience to develop more appropriate DSLs and • Design online failure scheduling, and integrate this to design patterns for recovery. Postdoc1 ’s monitoring framework. • Revisit the whole process again, identify, and tackle new interesting research challenges. • Extend DARE to three other areas of cloud-related research: structured stores such as memcached [13] and Hadoop HBase (BigTable) [11], job management such as Hadoop MapReduce [10], and resource management such as Eucalyptus [7]. Table 2: Preliminary Task Breakdown. •“By hand” analysis: Detailed protocols of existing large-scale file systems are not highly documented. Thus, the initial stage is to analyze the source code “by hand”. •Design: As we gain exposure to the details of existing systems, we can start sketching down recovery specifications declaratively, and also envision new testing and monitoring tools. Every piece of software we plan to build has a serious design phase. •Implementation: One of our group mantras is “learning through doing.” Thus, all of our research centers around implementation as the final goal; because this project is focused on finding real problems in real systems, we do not anticipate the usage of simulation in any substantial manner. •Evaluation: At all stages throughout the project, we will evaluate various components of DARE via controlled micro-benchmarks. Measurements of system behavior guide our work, informing us which of our ideas have merit, and which have unanticipated flaws. •Dissemination: Our goal is to disseminate our findings quickly. Thus, one metric of success is to share useful findings, tools, and prototypes to the HDFS and Lustre communities at the end of each research year. We also believe that, beyond file systems, the vision of DARE should immerse to other areas as well (e.g., databases, job/resource management). Hence, another metric of success is to achieve that in the final year (see Year 3 in Table 2). D.9.2 Division of Labor We request funding for one postdoctoral scholar at UC Berkeley for one year (Postdoc1), one graduate student researcher (GSR) at UC Berkeley for two years (RA1 ), and another GSR at UW Madison for three years (RA2 ). Table 2 presents a rough division of labor across the different components of the project. We emphasize that the bulk of funding requested within this proposal is found in human costs; we have pre-existing access to large-scale computing infrastructure described below. We believe our Berkeley-Wisconsin collaboration will generate a fruitful project; PIs Hellerstein and ArpaciDusseau have co-authored significant papers before [25, 26, 28]; Furthermore, Postdoc1 and RA2 have been working together in completing some of the 1st-year items, and will continue to do so as described in Table 2. Exchange of ideas and decision makings will continue via phone/video conferencing and several visits. D.10 Facilities for Data-Intensive Computing Research As mentioned before, one of the goals of this project is to test recovery at scales beyond those considered in prior research. We believe that as systems get larger in size, recovery must be tested at the same scale. Thus, we plan to use two kinds of facilities (medium and large scale). The first one is our internal cluster of 32 Linux machines with 2.2 GHz processor, 1 GB of memory, and 1 TB of disk on each machine. As we have exclusive access to this cluster anytime, we will test our prototypes on this cluster before deploying them in a larger cluster. For a larger evaluation, we plan to use the Yahoo M45 [17] cluster and/or Amazon EC2 [1]. We have been granted access to the Yahoo M45 cluster, which has approximately 4,000 processor-cores and 1.5 PBs of disks. Hellerstein’s group has an intensive pre-existing collaboration with Yahoo Research, and an accompanying letter of support from their management documents both that relationship and their interest in supporting our work. In addition, we also plan to use Amazon EC2 which allows users to rent computers on which to run their own computer applications; 13 researchers are able to rent 200 nodes cheaply [161]. This fits well for scalability-related evaluations, and thus, we allocate a modest budget ($1000/year) for using Amazon EC2. Our goal is to get the most result with the least cost. Thus, we will explore which of the large-cluster options will give us the most benefit. We also want to acknowledge that human processes should not interfere with large-scale, online evaluation (i.e., everything should be automated). This frames our approach in this proposal. For example, our fault-injection framework is integrated as part of the system under test, and hence failures can be automatically scheduled; we also plan to gather runtime information in as much detail as possible (e.g., in a recovery database) such that we can analyze our findings offline. D.11 Broader Impact and Education Plan In this section we first describe the impact of this project to undergraduates; we strongly believe that these students are critical to the future of science and engineering in our country, and thus we have been working to engage as many of them in various ways. Then, we close with the broader impact of this project. •Teaching impact: In the past, Arpaci-Dusseau has incorporated modern distributed file systems such as Google File System into her undergraduate Operating System curriculum, and Hellerstein has incorporated MapReduce into his undergrad Database Systems curriculum. As a result, the students were deeply excited to understand the science behind the latest technology. Furthermore, unlike in the past, today’s distributed file systems are surprisingly very easy to install; a first-timer can install HDFS on multiple machines in an undergraduate computer lab in less than ten minutes. This ease encourages us to use HDFS for future course projects. •Training impact: As large-scale reliability research is fairly new, we believe we will find and fix many vulnerabilities. We plan to take simple but fundamental fixes that can be repeated by undergraduate students. Thus, rather than building “toy” systems in all of their course projects, they can experience fixing a real system. We believe this is a good practice before they join industry. Furthermore, we also realize that students typically implement a single design predefined by the instructor. Although this is a great starting point, we feel that students tend to take for granted the fundamental decisions behind the design. Thus, we also plan to give a project where students use a high-level language (probably ours) to rapidly implement more than one design and perform performance-reliability tradeoffs. •Undergraduate independent study: Due to the significant number of large-scale file systems that we wish to analyze, the DARE project is particularly suited to allowing undergraduate students gain exposure to true systems research. More specifically, we will pair undergraduates with graduate students in our groups such that exchanges of knowledge are possible. This gives them a chance to work on cutting-edge systems research, and hence makes them more ready to go to industry or national labs. At both UW Madison and UC Berkeley,our efforts along this direction have been quite successful; undergraduate students have published research papers, interned at national labs, and gone on to graduate school and prominent positions in academia and industry. •Community outreach and dissemination: One of our priorities is to generate novel research results and publish them in the leading scholarly venues. In the past, we have focused upon systems and databases venues such as SOSP, OSDI, FAST, USENIX, SIGMOD, VLDB, ICDE, and related conferences, and plan to continue along those lines here; we feel this is one of the best avenues to expose our ideas to our academic and industrial peers. We are also committed to developing software that is released in the open source and highly documented. In the past we have worked with Linux developers to move code into the Linux source base. For example, some of our work in improving Linux file systems has been adopted into the main kernel, specifically in the Linux ext4 file system [133], and many bugs and design flaws we have found have already been fixed by developers [78]. We have also worked extensively in recent years with Yahoo, the main locus of development behind Hadoop, HDFS, and PIG. Our recent open-source extensions to those systems [59] have generated significant interest in the open source community, and we are working aggressively to get them incorporated into the open-source codebase. In the past, we have also made major contributions to open source systems like PostgreSQL and PostGIS. •Technology transfer: We place significant value on technology transfer; some of our past work has led to direct industrial impact. For example, the ideas behind the D-GRAID storage system have been adopted by EMC in their Centera product line [145]. Further, we also worked with NetApp to transfer some of our earlier fault injection technology into their product development and testing cycle [133]; this technology transfer led directly to follow-on research in which we found design flaws in existing commercial RAID systems, which has spurred some companies to fix the problems we found [104]. In recent years we have been involved in the scale-out both of core database infrastructure [157] and the use of parallel databases to do online advertising via rich statistical methods at unprecedented 14 scales [57]. Earlier research has been transferred to practice in database products from IBM [105], Oracle [146] and Informix [86, 103]. •Benefits to society: We believe the DARE project falls in the same directions set by federal agencies; a recent HEC FSIO workshop declared “research in reliability at scale” and “scalable file system architectures” as topics that are very important and greatly need research [36]. Moreover, our analysis of and potential improvements to parallel file systems will definitely benefit national scientists who use these systems on daily basis. We also believe that our project will benefit society at large; in the near future, users will store in the “cloud” all of their data (emails, work documents, generations of family photos and videos, etc.). Through the DARE project, we will build the next generation largescale Internet and parallel file systems that will meet the performance and reliability demands of the current society. D.12 Prior Results PI Joseph M. Hellerstein has been involved in a number of NSF awards. Three are currently active, but entering their final year: MUNDO: Managing Uncertainty in Networks with Declarative Overlays (NGNI-Medium, 09/08 - 08/10, $303,872) SCAN: Statistical Collaborative Analysis of Networks (NeTS-NBD, 0722077, 01/08 - 12/10, $249,000), and Dynamic Meta-Compilation in Networked Information Systems (III-COR 0713661, 09/07 - 08/10, $450,000). The remainder are completed: Adaptive Dataflow: Eddies, SteMs and FLuX (0208588, 08/02 - 07/05, $299,998), Query Processing in Structured Peer-to-Peer Networks (0209108, 08/02 - 07/05, $179,827), Robust Large Scale Distributed System (ITR 5710001486, 10/02 - 09/07, $2,840,869), Data on the Deep Web: Queries, Trawls, Policies and Countermeasures (ITR 0205647, 10/02 - 09/07, $1,675,000), and Mining the Deep Web for Economic Data (SGER 0207603, 01/02 - 06/03, $98,954). The three current awards all share at their core the design and use of declarative languages and runtimes, in different contexts. These languages and runtimes form a foundation for the declarative programming aspects of the work we propose here, and have supported many of the recent results cited earlier [23, 53, 58, 59] as well as a variety of work that is out of the scope of this proposal. The emphasis on scalable storage recovery in this proposal is quite different from the prior grants, which focused on Internet security (SCAN), distributed machine learning (MUNDO), and the internals of compilers for declarative languages (III-COR 713661). Prior awards have led to a long list of research results that have been recognized with awards. A series of students have been trained with this support, including graduate students who have gone on to distinguished careers in academia (MIT, Penn, Maryland) and industrial research (IBM, Microsoft, HP), and undergraduates who have continued on into graduate school at Berkeley, Stanford and elsewhere. Co-PI Andrea C. Arpaci-Dusseau has been involved with a number of NSF-related efforts. The active ones are: HaRD: The Wisconsin Hierarchically-Redundant, Decoupled Storage Project (HEC, 09/09 - 08/10, $221,381), Skeptical Systems (CSR-DMSS-SM, 08/08 - 08/11, $420,000), and WASP: Wisconsin Arrays in Software Project (CPACSA, 07/08 - 08/10, $217,978). The completed ones are: PASS: Formal Failure Analysis for Storage Systems (HEC, 09/06 - 08/09, $951,044), SAFE: Semantic Failure Analysis and Management (CSR–PDOS, 09/05 - 08/08, $700,000), WISE: Wisconsin Semantic Disks (ITR-0325267, 08/03 - 08/07, $600,000), Robust Data-intensive Cluster Programming (CAREER proposal, CCR-0092840, 09/01 - 08/06, $250,000), WiND: Robust Adaptive Network-Attached Storage (CCR-0098274, 09/01 - 08/04, $310,000), and Wisconsin DOVE (NGS-0103670, 09/01 - 08/04, $596,740). The most relevant awards are the PASS, SAFE, and WiND projects. Through the PASS and SAFE projects, we have gained expertise on the topic of storage reliability [31, 33, 34, 35, 64, 65, 75, 76, 77, 78, 104, 131, 132, 133, 137, 144, 145, 147]. We began by analyzing how commodity file systems react to a range of more realistic disk failures [131, 133]. We have shown that commodity file system failure policies (such as those in Linux ext3, ReiserFS, and IBM JFS) are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures. We also have extended our analysis techniques to distributed storage systems [75], RAID systems [65], commercial file systems [33, 34], and virtual memory systems [31]. Most recently, we have employed formal methods (model checking and static analysis) to find flaws in file system code [78, 137] and RAID designs [104]. Finally, we were able to propose novel designs for building more reliable file systems; one design received a best paper award [35], and the other two appeared in top conferences [76, 77]. Although most of our recent work has focused on the analysis of reliability in storage systems, our earlier efforts in I/O focused greatly on performance [39, 40, 156]. We have on more than one occasion held the world-record in external (disk-to-disk) sorting [25]. We thus will bring to bear this older expertise on the DARE project. 15 References [1] Amazon EC2. http://aws.amazon.com/ec2. [2] Apache Logging Services Project. http://logging.apache.org/log4j/. [3] Applications and organizations using Hadoop/HDFS. http://wiki.apache.org/hadoop/PoweredBy. [4] AspectC. www.aspectc.org. [5] AspectJ. www.eclipse.org/aspectj. [6] CloudStore / KosmosFS. http://kosmosfs.sourceforge.net. [7] Eucalyptus. http://www.eucalyptus.com. [8] GFS/GFS2. http://sources.redhat.com/cluster/gfs. [9] GlusterFS. www.gluster.org. [10] Hadoop MapReduce. http://hadoop.apache.org/mapreduce. [11] HBase. http://hadoop.apache.org/hbase. [12] HDFS User Guide. http://hadoop.apache.org/common/docs/r0.20.0/hdfs user guide.html. [13] Memcached. http://memcached.org. [14] MogileFS. www.danga.com/mogilefs. [15] Tahoe-LAFS. http://allmydata.org/trac/tahoe. [16] XtreemFS. www.xtreemfs.org. [17] Yahoo M45. http://research.yahoo.com/node/1884. [18] Daniel Abadi. Data Management in the Cloud: Limitations and Opportunities. IEEE Data Engineering Bulletin, 32(1):3–12, March 2009. [19] Daniel J. Abadi, Michael J. Cafarella, Joseph M. Hellerstein, Donald Kossmann, Samuel Madden, and Philip A. Bernstein. How Best to Build Web-Scale Data Managers? A Panel Discussion. PVLDB, 2(2), 2009. [20] Parag Agrawal, Daniel Kifer, and Christopher Olston. Scheduling Shared Scans of Large Data Files. In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB ’08), Auckland, New Zealand, July 2008. [21] Marcos Aguilera, Wojciech Golab, and Mehul Shah. A Practical Scalable Distributed B-Tree. In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB ’08), Auckland, New Zealand, July 2008. [22] Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), Bolton Landing, New York, October 2003. [23] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and Russell C Sears. BOOM: DataCentric Programming in the Datacenter. UC. Berkeley Technical Report No. UCB/EECS-2009-113, 2009. [24] Dorian C. Arnold and Barton P. Miller. State Compensation: A Scalable Failure Recovery Model for Tree-based Overlay Networks. UW Technical Report, 2009. [25] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and Dave Patterson. HighPerformance Sorting on Networks of Workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD ’97), Tucson, Arizona, May 1997. [26] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and Dave Patterson. Searching for the Sorting Record: Experiences in Tuning NOW-Sort. In The 1998 Symposium on Parallel and Distributed Tools (SPDT ’98), Welches, Oregon, August 1998. [27] Remzi H. Arpaci-Dusseau. Run-Time Adaptation in River. ACM Transactions on Computer Systems (TOCS), 21(1):36–86, February 2003. [28] Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and Dave Patterson. The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs. In Proceedings of the 4th International Symposium on High Performance Computer Architecture (HPCA-4), Las Vegas, Nevada, February 1998. [29] Algirdas A. Aviˇzienis. The Methodology of N-Version Programming. In Michael R. Lyu, editor, Software Fault Tolerance, chapter 2. John Wiley & Sons Ltd., 1995. [30] Saurabh Bagchi, Gautam Kar, and Joseph L. Hellerstein. Dependency Analysis in Distributed Systems Using Fault Injection. In 12th International Workshop on Distributed Systems, Nancy, France, October 2001. 1 [31] Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Dependability Analysis of Virtual Memory Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN ’06), Philadelphia, Pennsylvania, June 2006. [32] Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. An Analysis of Latent Sector Errors in Disk Drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’07), San Diego, California, June 2007. [33] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. ArpaciDusseau. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST ’08), pages 223–238, San Jose, California, February 2008. [34] Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift. Systematically Benchmarking the Effects of Disk Pointer Corruption. In Proceedings of the International Conference on Dependable Systems and Networks (DSN ’08), Anchorage, Alaska, June 2008. [35] Lakshmi N. Bairavasundaram, Swaminathan Sundararaman, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Tolerating File-System Mistakes with EnvyFS. In Proceedings of the USENIX Annual Technical Conference (USENIX ’09), San Diego, California, June 2009. [36] Marti Bancroft, John Bent, Evan Felix, Gary Grider, James Nunez, Steve Poole, Rob Ross, Ellen Salmon, and Lee Ward. High End Computing Interagency Working Group (HECIWG) HEC File Systems and I/O 2008 Roadmaps. http:// institutes.lanl.gov/hec-fsio/docs/HEC-FSIO-FY08-Gaps RoadMap.pdf. [37] Keith Bare, Michael P. Kasick, Soila Kavulya, Eugene Marinelli, Xinghao Pan, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. ASDF: Automated and Online Fingerpointing for Hadoop. CMU PDL Technical Report CMU-PDL-08-104, 2008. [38] Luiz Barroso, Jeffrey Dean, and Urs Hoelzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22–28, 2003. [39] John Bent, Doug Thain, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit Control in a Batch-Aware Distributed File System. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI ’04), pages 365–378, San Francisco, California, March 2004. [40] John Bent, Venkateshwaran Venkataramani, Nick Leroy, Alain Roy, Joseph Stanley, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Flexibility, Manageability, and Performance in a Grid Storage Appliance. In Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC 11), pages 3–12, Edinburgh, Scotland, July 2002. [41] Eric J. Bina and Perry A. Emrath. A Faster fsck for BSD Unix. In Proceedings of the USENIX Winter Technical Conference (USENIX Winter ’89), San Diego, California, January 1989. [42] Ken Birman, Gregory Chockler, and Robbert van Renesse. Towards a Cloud Computing Research Agenda. ACM SIGACT News, 40(2):68–80, June 2009. [43] Dhruba Borthakur. design.html. HDFS Architecture. http://hadoop.apache.org/common/docs/current/hdfs [44] Peter J. Braam and Michael J. Callahan. Lustre: A SAN File System for Linux. www.lustre.org/docs/luswhite.pdf, 2005. [45] A. Brown, G. Kar, and A. Keller. An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Environment. In The 7th IFIP/IEEE International Symposium on Integrated Network Management, May 2001. [46] Mike Burrows. The Chubby lock service for loosely-coupled distributed systems Export. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), Seattle, Washington, November 2006. [47] Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI ’08), San Diego, California, December 2008. [48] George Candea and Armando Fox. Crash-Only Software. In The Ninth Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue, Hawaii, May 2003. [49] George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Microreboot – A Technique for Cheap Recovery. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), pages 31–44, San Francisco, California, December 2004. [50] Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and Rajeev Thakur. PVFS: A parallel file system for linux clusters. In Atlanta Linux Showcase (ALS ’00), Atlanta, Georgia, October 2000. 2 [51] Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB ’08), Auckland, New Zealand, July 2008. [52] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An Empirical Study of Operating System Errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), pages 73–88, Banff, Canada, October 2001. [53] David Chu, Joseph M. Hellerstein, and Tsung te Lai. Optimizing declarative sensornets. In Proceedings of the 6th International Conference on Embedded Networked Sensor Systems (SenSys ’08), Raleigh, NC, November 2008. [54] David Chu, Lucian Popa, Arsalan Tavakoli, Joseph M. Hellerstein, Philip Levis, Scott Shenker, and Ion Stoica. The design and implementation of a declarative sensor network system. In Proceedings of the 5th International Conference on Embedded Networked Sensor Systems (SenSys ’07), Sydney, Australia, November 2007. [55] Brent N. Chun, Joseph M. Hellerstein, Ryan Huebsch, Shawn R. Jeffery, Boon Thau Loo, Sam Mardanbeigi, Timothy Roscoe, Sean C. Rhea, Scott Shenker, and Ion Stoica. Querying at Internet-Scale. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’04), Paris, France, June 2004. [56] Yvonne Coady, Gregor Kiczales, Mike Feeley, and Greg Smolyn. Using AspectC to Improve the Modularity of Path-Specific Customization in Operating System Code. In Proceedings of the Joint European Software Engineering Conference (ESEC) and 9th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE-9), September 2001. [57] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2):1481–1492, 2009. [58] Tyson Condie, David Chu, Joseph M. Hellerstein, and Petros Maniatis. Evita raced: metacompilation for declarative networks. PVLDB, 1(1), 2008. [59] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell C Sears. MapReduce Online. UC. Berkeley Technical Report No. UCB/EECS-2009-136, 2009. [60] Francis M. David, Ellick M. Chan, Jeffrey C. Carlyle, and Roy H. Campbell. CuriOS: Improving Reliability through Operating System Structure. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI ’08), San Diego, California, December 2008. [61] Jeff Dean. Panel: Desired Properties in a Storage System (For building large-scale, geographically-distributed services). In Workshop on Hot Topics in Storage and File Systems (HotStorage ’09), Big Sky, Montana, October 2009. [62] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), pages 137–150, San Francisco, California, December 2004. [63] Giuseppe Decandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP ’07), Stevenson, Washington, October 2007. [64] Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Journal-guided Resynchronization for Software RAID. In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST ’05), pages 87– 100, San Francisco, California, December 2005. [65] Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Deconstructing Storage Arrays. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI), pages 59–71, Boston, Massachusetts, October 2004. [66] David J. DeWitt and Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM, 35(6):85–98, 1992. [67] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), pages 57–72, Banff, Canada, October 2001. [68] Eli Gafni and Leslie Lamport. Disk Paxos. In International Symposium on Distributed Computing (DISC ’00), Toledo, Spain, October 2000. [69] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), pages 29–43, Bolton Landing, New York, October 2003. [70] Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed Automated Random Testing. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI ’05), Chicago, Illinois, June 2005. 3 [71] Jim Gray and Catharine Van Ingen. Empirical Measurements of Disk Failure Rates and Error Rates. Microsoft Research Technical Report MSR-TR-2005-96, December 2005. [72] Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and David Culler. Scalable and Distributed Data Structures for Internet Service Construction. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI ’00), San Diego, California, October 2000. [73] Robert L. Grossman and Yunhong Gu. On the Varieties of Clouds for Data Intensive Computing. IEEE Data Engineering Bulletin, 32(1):44–50, March 2009. [74] Ramakrishna Gummadi, Nupur Kothari, Todd D. Millstein, and Ramesh Govindan. Declarative failure recovery for sensor networks. In Proceedings of the 6th International Conference on Aspect-Oriented Software Development (AOSD ’07), Vancouver, Canada, March 2007. [75] Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Jiri Schindler. Deconstructing Commodity Storage Clusters. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA ’05), pages 60–73, Madison, Wisconsin, June 2005. [76] Haryadi S. Gunawi, Vijayan Prabhakaran, Swetha Krishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Improving File System Reliability with I/O Shepherding. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP ’07), pages 283–296, Stevenson, Washington, October 2007. [77] Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. SQCK: A Declarative File System Checker. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI ’08), San Diego, California, December 2008. [78] Haryadi S. Gunawi, Cindy Rubio-Gonzalez, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO: Error Handling is Occasionally Correct. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST ’08), pages 207–222, San Jose, California, February 2008. [79] James Hamilton. On Designing and Deploying Internet-Scale Services. In Proceedings of the 21st Large Installation System Administration Conference (LISA ’07), Dallas, Texas, November 2007. [80] James Hamilton. Where Does the Power Go in High-Scale Data Centers? Keynote at USENIX 2009, June 2009. [81] HDFS JIRA. http://issues.apache.org/jira/browse/HDFS. [82] HDFS JIRA. A bad entry in namenode state when starting up. https://issues.apache.org/jira/browse/ HDFS-384. [83] HDFS JIRA. Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes. http://issues. apache.org/jira/browse/HADOOP-572. [84] HDFS JIRA. Replication should be decoupled from heartbeat. http://issues.apache.org/jira/browse/ HDFS-150. [85] Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Wei Lin, Bing Su, Hongyi Wang, and Lidong Zhou. Wave Computing in the Cloud. In the 12th Workshop on Hot Topics in Operating Systems (HotOS XII), Monte Verita, Switzerland, May 2009. [86] Joseph M. Hellerstein, Ron Avnur, and Vijayshankar Raman. Informix under CONTROL: Online Query Processing. Data Min. Knowl. Discov., 4(4):281–314, 2000. [87] Alyssa Henry. Cloud Storage FUD: Failure and Uncertainty and Durability. In Proceedings of the 7th USENIX Symposium on File and Storage Technologies (FAST ’09), San Francisco, California, February 2009. [88] Val Henson. The Many Faces of fsck. http://lwn.net/Articles/248180/, September 2007. [89] Val Henson, Zach Brown, Theodore Ts’o, and Arjan van de Ven. Reducing fsck time for ext2 file systems. In Ottawa Linux Symposium (OLS ’06), Ottawa, Canada, July 2006. [90] Val Henson, Arjan van de Ven, Amit Gud, and Zach Brown. Chunkfs: Using divide-and-conquer to improve file system reliability and repair. In IEEE 2nd Workshop on Hot Topics in System Dependability (HotDep ’06), Seattle, Washington, November 2006. [91] Mark Holland, Garth A. Gibson, and Daniel P. Siewiorek. Fast, on-line failure recovery in redundant disk arrays. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing (FTCS-23), pages 421–433, Toulouse, France, June 1993. [92] Ling Huang, Minos N. Garofalakis, Joseph M. Hellerstein, Anthony D. Joseph, and Nina Taft. Toward sophisticated detection with distributed triggers. In Proceedings of the 2nd Annual ACM Workshop on Mining Network Data (MineNet ’06), Pisa, Italy, September 2006. 4 [93] Ling Huang, XuanLong Nguyen, Minos N. Garofalakis, Joseph M. Hellerstein, Michael I. Jordan, Anthony D. Joseph, and Nina Taft. Communication-Efficient Online Detection of Network-Wide Anomalies. In The 26th IEEE International Conference on Computer Communications (INFOCOM ’07), Anchorage, Alaska, May 2007. [94] Ryan Huebsch, Brent N. Chun, Joseph M. Hellerstein, Boon Thau Loo, Petros Maniatis, Timothy Roscoe, Scott Shenker, Ion Stoica, and Aydan R. Yumerefendi. The Architecture of PIER: an Internet-Scale Query Processor. 2005. [95] Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. Detailed diagnosis in enterprise networks Export. In Proceedings of SIGCOMM ’09, Barcelona, Spain, August 2009. [96] Hannu H. Kari. Latent Sector Faults and Reliability of Disk Arrays. PhD thesis, Helsinki University of Technology, September 1997. [97] Hannu H. Kari, H. Saikkonen, and F. Lombardi. Detection of Defective Media in Disks. In The IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems, pages 49–55, Venice, Italy, October 1993. [98] Kimberly Keeton, Dirk Beyer, Ernesto Brau, Arif Merchant, Cipriano Santos, and Alex Zhang. On the Road to Recovery: Restoring Data after Disasters. In Proceedings of the EuroSys Conference (EuroSys ’06), Leuven, Belgium, April 2006. [99] Kimberly Keeton and Arif Merchant. A Framework for Evaluating Storage System Dependability. In Proceedings of the International Conference on Dependable Systems and Networks (DSN ’04), Florence, Italy, June 2004. [100] Kimberly Keeton, Cipriano Santos, Dirk Beyer, Jeffrey Chase, and John Wilkes. Designing for disasters. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST ’04), San Francisco, California, April 2004. [101] Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta. On Availability of Intermediate Data in Cloud Computations. In the 12th Workshop on Hot Topics in Operating Systems (HotOS XII), Monte Verita, Switzerland, May 2009. [102] Phil Koopman. What’s Wrong with Fault Injection as a Dependability Benchmark? In Workshop on Dependability Benchmarking (in conjunction with DSN-2002), Washington DC, July 2002. [103] Marcel Kornacker. High-Performance Extensible Indexing. In Proceedings of the 25th International Conference on Very Large Databases (VLDB ’99), San Francisco, CA, September 1999. [104] Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. ArpaciDusseau, and Remzi H. Arpaci-Dusseau. Parity Lost and Parity Regained. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST ’08), pages 127–141, San Jose, California, February 2008. [105] T. Y. Cliff Leung, Hamid Pirahesh, Joseph M. Hellerstein, and Praveen Seshadri. Query rewrite optimization rules in IBM DB2 universal database. Readings in Database Systems (3rd ed.), pages 153–168, 1998. [106] Xin Li, Michael C. Huang, , and Kai Shen. An Empirical Study of Memory Hardware Errors in A Server Farm. In The 3rd Workshop on Hot Topics in System Dependability (HotDep ’07), Edinburgh, UK, June 2007. [107] Boon Thau Loo, Tyson Condie, Minos Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. Declarative networking. Commun. ACM, 52(11):87–95, 2009. [108] Boon Thau Loo, Tyson Condie, Minos N. Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. Declarative networking: language, execution and optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’06), Chicago, Illinois, June 2006. [109] Boon Thau Loo, Tyson Condie, Joseph M. Hellerstein, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Implementing Declarative Overlays. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05), Brighton, United Kingdom, October 2005. [110] Boon Thau Loo, Joseph M. Hellerstein, Ryan Huebsch, Scott Shenker, and Ion Stoica. Enhancing p2p file-sharing with an internet-scale query processor. In Proceedings of the 30th International Conference on Very Large Databases (VLDB ’04), Toronto, Canada, September 2004. [111] Boon Thau Loo, Joseph M. Hellerstein, Ion Stoica, and Raghu Ramakrishnan. Declarative routing: extensible routing with declarative queries. In Proceedings of the ACM SIGCOMM 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM ’05), Philadelphia, Pennsylvania, August 2005. [112] Peter Lyman and Hal R. Varian. How Much Information? projects/how-much-info-2003. 2003. www2.sims.berkeley.edu/research/ [113] Om Malik. When the Cloud Fails: T-Mobile, Microsoft Lose Sidekick Customer Data. http://gigaom.com. [114] Mike Mammarella, Shant Hovsepian, and Eddie Kohler. Modular data storage with Anvil. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP ’09), Big Sky, Montana, October 2009. [115] Richard P. Martin, Amin M. Vahdat, David E. Culler, and Thomas E. Anderson. Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA ’97), pages 85–97, Denver, Colorado, May 1997. 5 [116] Matthew L. Massie, Brent N. Chun, and David E. Culler. The Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing, 30(7), July 2004. [117] Marshall Kirk McKusick, Willian N. Joy, Samuel J. Leffler, and Robert S. Fabry. Fsck - The UNIX File System Check Program. Unix System Manager’s Manual - 4.3 BSD Virtual VAX-11 Version, April 1986. [118] Lucas Mearian. Facebook temporarily loses more than 10% of photos in hard drive failure. www.computerworld.com. [119] C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, and Peter M. Schwarz. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst., 17(1):94–162, 1992. [120] Curt Monash. eBay’s Two Enormous Data Warehouses. DBMS2 (weblog), April 2009. http://www.dbms2.com/ 2009/04/30/ebays-two-enormous-data-warehouses. [121] John Oates. Bank fined 3 millions pound sterling for data loss, still not taking it seriously. www.theregister.co.uk/ 2009/07/22/fsa hsbc data loss. [122] Adam J. Oliner, Alex Aiken, and Jon Stearley. Alert Detection in System Logs. In Proceedings of the International Conference on Data Mining (ICDM ’08), Pisa, Italy, December 2008. [123] Adam J. Oliner and Jon Stearley. What supercomputers say: A study of five system logs. In Proceedings of the International Conference on Dependable Systems and Networks (DSN ’07), Edinburgh, UK, June 2007. [124] Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. Ganesha: Black-box Fault Diagnosis for MapReduce Environments. In the 2nd Workshop on Hot Topics in Measurement and Modeling of Computer Systems (HotMetrics ’09), Seattle, Washington, June 2009. [125] Biswanath Panda, Joshua Herbach, Sugato Basu, and Roberto Bayardo. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB ’09), Lyon, France, August 2009. [126] Shankar Pasupathy. Personal Communication from Shankar Pasupathy of NetApp, 2009. [127] David Patterson, Aaron Brown, Pete Broadwell, George Candea, Mike Chen, James Cutler, Patricia Enriquez, Armando Fox, Emre Kiciman, Matthew Merzbacher, David Oppenheimer, Naveen Sastry, William Tetzlaff, Jonathan Traupman, and Noah Treuhaft. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical Report CSD-02-1175, U.C. Berkeley, March 2002. [128] J. Kent Peacock, Ashvin Kamaraju, and Sanjay Agrawal. Fast Consistency Checking for the Solaris File System. In Proceedings of the USENIX Annual Technical Conference (USENIX ’98), pages 77–89, New Orleans, Louisiana, June 1998. [129] Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST ’07), pages 17–28, San Jose, California, February 2007. [130] Lucian Popa, Mihai Budiu, Yuan Yu, and Michael Isard. DryadInc: Reusing Work in Large-scale Computations. In Workshop on Hot Topics in Cloud Computing (HotCloud ’09), San Diego, California, June 2009. [131] Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Model-Based Failure Analysis of Journaling File Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN ’05), pages 802–811, Yokohama, Japan, June 2005. [132] Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Model-Based Failure Analysis of Journaling File Systems. To appear in IEEE Transactions on Dependable and Secure Computing (TDSC), 2006. [133] Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. IRON File Systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05), pages 206–220, Brighton, United Kingdom, October 2005. [134] Ari Rabkin, Andy Konwinski, Mac Yang, Jerome Boulon, Runping Qi, and Eric Yang. Chukwa: a large-scale monitoring system. In Cloud Computing and Its Applications (CCA ’08), Chicago, IL, October 2008. [135] American Data Recovery. Data loss statistics. http://www.californiadatarecovery.com/content/adr loss stat.html. [136] Frederick Reiss and Joseph M. Hellerstein. Declarative network monitoring with an underprovisioned query processor. In Proceedings of the 22nd International Conference on Data Engineering (ICDE ’06), Atlanta, GA, April 2006. [137] Cindy Rubio-Gonzalez, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. Error Propagation Analysis for File Systems. In Proceedings of the ACM SIGPLAN 2009 Conference on Programming Language Design and Implementation (PLDI ’09), Dublin, Ireland, June 2009. 6 [138] Bianca Schroeder and Garth Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST ’07), pages 1–16, San Jose, California, February 2007. [139] Thomas Schwarz, Mary Baker, Steven Bassi, Bruce Baumgart, Wayne Flagg, Catherine van Ingen, Kobus Joste, Mark Manasse, and Mehul Shah. Disk failure investigations at the internet archive. In NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST) Work in Progress Session, 2006. [140] Russell Sears and Eric A. Brewer. Segment-based recovery: Write ahead logging revisited. PVLDB, 2(1):490–501, 2009. [141] Simon CW See. Data Intensive Computing. In Sun Preservation and Archiving Special Interest Group (PASIG ’09), San Fancisco, California, October 2009. [142] Mehul A. Shah, Joseph M. Hellerstein, and Eric A. Brewer. Highly-Available, Fault-Tolerant, Parallel Dataflows. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’04), Paris, France, June 2004. [143] Dennis Shasha. Biocomputational Puzzles: Data, Algorithms, and Visualization. In Invited Talk at the 11th International Conference on Extending Database Technology (EDBT ’08), Nantes, France, March 2008. [144] Muthian Sivathanu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Somesh Jha. A Logic of File Systems. In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST ’05), pages 1–15, San Francisco, California, December 2005. [145] Muthian Sivathanu, Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Improving Storage System Availability with D-GRAID. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST ’04), pages 15–30, San Francisco, California, April 2004. [146] Michael Stonebraker and Joseph M. Hellerstein. Content integration for e-business. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD ’01), Santa Barbara, California, May 2001. [147] Sriram Subramanian, Yupu Zhang, Rajiv Vaidyanathan, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. ArpaciDusseau, and Jeffrey F. Naughton. Impact of Disk Corruption on Open-Source DBMS. In Proceedings of the 26th International Conference on Data Engineering (ICDE ’10), Long Beach, California, March 2010. [148] Rajesh Sundaram. The Private Lives of Disk Drives. www.netapp.com/go/techontap/matl/sample/0206tot resiliency.html, February 2006. [149] John D. Sutter. A trip into the secret, online ’cloud’. www.cnn.com/2009/TECH/11/04/cloud.computing. hunt/index.html. [150] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Reliability of Commodity Operating Systems. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), Bolton Landing, New York, October 2003. [151] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Recovering device drivers. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), pages 1–16, San Francisco, California, December 2004. [152] Alexander Szalay and Jim Gray. 2020 Computing: Science in an exponential world. Nature, (440):413–414, March 2006. [153] Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. SALSA: Analyzing Logs as StAte Machines. In the 1st Workshop on the Analysis of System Logs (WASL ’08), San Diego, CA, September 2008. [154] Hadoop Team. Fault Injection framework: How to use it, test using artificial faults, and develop new faults. http:// issues.apache.org. [155] PVFS2 Development Team. PVFS Developer’s Guide. www.pvfs.org. [156] Doug Thain, John Bent, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Pipeline and Batch Sharing in Grid Workloads. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 12), pages 152–161, Seattle, Washington, June 2003. [157] Florian M. Waas and Joseph M. Hellerstein. Parallelizing Extensible Query Optimizers. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD ’09), Providence, Rhode Island, June 2009. [158] Glenn Weinberg. The Solaris Dynamic File System. http://members.visi.net/∼thedave/sun/DynFS.pdf, 2004. [159] Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. Scalable Performance of the Panasas Parallel File System. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST ’08), San Jose, California, February 2008. [160] Qin Xin, Ethan L. Miller, and Thomas Schwarz. Evaluation of distributed recovery in large-scale storage systems. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC 13), Honolulu, Hawaii, June 2004. 7 [161] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP ’09), Big Sky, Montana, October 2009. [162] Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI ’09), Boston, Massachusetts, April 2009. [163] Junfeng Yang, Can Sar, and Dawson Engler. EXPLODE: A Lightweight, General System for Finding Serious Storage System Errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), Seattle, Washington, November 2006. [164] Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using Model Checking to Find Serious File System Errors. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), San Francisco, California, December 2004. [165] Shaula Alexander Yemini, Shmuel Kliger, Eyal Mozes, Yechiam Yemini, and David Ohsie. High Speed and Robust Event Correlation. IEEE Communications, 34(5):82–90, May 1996. 8 E Supplement: Postdoctoral Mentoring Plan This proposal includes budget in the first year to support Dr. Haryadi Gunawi, a post-doctoral researcher in Hellerstein’s group at Berkeley. Gunawi did his Ph.D. at UW-Madison under the co-direction of co-PI Andrea ArpaciDusseau and her frequent collaborator Remzi Arpaci-Dusseau. Gunawi serves as a key technical bridge between the research agendas of the two PIs, and will naturally interact with and be mentored by both PIs. Arpaci-Dusseau mentored Gunawi through his Ph.D., and the two have an established relationship which evolved during that period. Gunawi will sit at Berkeley with Hellerstein’s group during his postdoc internship. The intent of the postdoctoral mentoring in this project is to give Gunawi the experience of defining and leading a research project, while also learning to collaborate with peers, senior researchers, industrial partners, and students at the graduate and undergraduate level. As a first step in this process, Gunawi has already been given a lead role in defining and writing this proposal with his familiar mentor Arpaci-Dusseau, and his less-familiar new collaborator Hellerstein. The proposal-writing process established a tone that we intend to follow throughout the post-doc: Gunawi led the identification of research themes and drafted his ideas in text, Arpaci-Dusseau and Hellerstein commented and injected their ideas into Gunawi’s context, and Gunawi himself was responsible for synthesizing the mix of ideas into a cogent whole, both conceptually and textually. At Berkeley, Gunawi will share office space with Hellerstein’s graduate students, a few steps from Hellerstein’s office. He will attend weekly meetings of Hellerstein’s research group, which include graduate students, fellow professors from other areas in computer science (e.g. Programming Languages and Machine Learning), and collaborators from industry (e.g. a weekly visiting collaborator from Yahoo Research). Gunawi will be asked to present occasional lectures in courses on both Operating Systems and Database Systems, at the graduate and undergraduate levels. He will of course be expected to aggressively lead research and publication efforts, as he has already done during his Ph.D. years (when he won the departmental research excellence award at UW-Madison). The PIs believe that one of the key challenges for a very successful recent Ph.D. like Gunawi is to learn to thoughtfully extend their recently-acquired personal research prowess to enhance and leverage contributions from more junior (graduate and undergraduate students) and more senior (faculty and industrial) collaborators. To that end, the PIs intend to facilitate opportunities for Gunawi to work relatively independently with students and mature collaborators, but also to engage with him in reflection on the process of those collaborations, along with tips on mechanisms for sharing and subdividing research tasks, stepping back to let more junior researchers find their way through problems, and translating concepts across research areas. Because Gunawi has had the benefit of mentorship from Arpaci-Dusseau for many years in graduate school, the plan is to shift his main point of contact to Hellerstein and provide him with a new experience at Berkeley. Structurally, the plan is for Hellerstein, Arpaci-Dusseau and Gunawi to meet as a group bi-weekly via teleconference, and for Hellerstein and Gunawi to meet on a weekly basis. Hellerstein has a strong track record of mentoring PhD students (11 completed to date, 7 underway) into successful careers in academic research and teaching positions, and industrial research and development.