How To Make Chord Correct
Transcription
How To Make Chord Correct
How To Make Chord Correct Pamela Zave AT&T Laboratories—Research One AT&T Way, Room 4D148K Bedminster, New Jersey 07921, USA +1 908 901 2204 pamela@research.att.com February 10, 2014 Abstract The Chord distributed hash table (DHT) is well-known, frequently used to implement peer-to-peer systems, and frequently used as a foundation for research on designing DHTs with additional properties. Despite claims of proven correctness, previous work has shown that the Chord ring-maintenance protocol is not correct. It has not, however, discovered whether Chord could be made correct. The principle contribution of this paper is to solve the problem by providing the first specification of a correct version of Chord, and a proof of its correctness. In addition the paper provides a simple, necessary, and sufficient inductive invariant, as well as interesting new insights into the workings of distributed systems that use ring-shaped data structures in large identifier spaces. This paper is a regular submission. It is not eligible for a student paper award. i 1 Introduction Peer-to-peer systems are distributed systems featuring decentralized control, self-organization of similar nodes, and scalability. A distributed hash table (DHT) is a peer-to-peer system that implements a persistent key-value store. It can be used for shared file storage, group directories, and many other purposes. The distributed hash table Chord was first presented in a 2001 SIGCOMM paper [20]. This paper has been the fourth-most-cited paper in computer science for several years (according to Citeseer), and won the 2011 SIGCOMM Test-of-Time Award. The nodes of a Chord network have identifiers in an m-bit identifier space, and reach each other through pointers in this identifier space. Because the pointer structure is based on adjacency in the identifier space, and 2m −1 is adjacent to 0, the structure of a Chord network is a ring. The ring structure is disrupted when nodes join, leave, or fail. The original Chord papers [20, 21] specify a ring-maintenance protocol whose minimum correctness property is eventual reachability: given ample time and no further disruptions, the ring-maintenance protocol can repair all disruptions in the ring structure. If the protocol is not correct in this sense, then some nodes of a Chord network can become permanently unreachable from other nodes. The introductions of the original Chord papers say, “Three features that distinguish Chord from many other peer-to-peer lookup protocols are its simplicity, provable correctness, and provable performance.” An accompanying PODC paper [12] lists invariants of the ring-maintenance protocol. The claims of simplicity and performance are certainly true. The Chord algorithms are far simpler and more completely specified than those of other DHTs, such as Pastry [18], Tapestry [25], CAN [17], and Kademlia [14]. There is no attempt to specify or enforce fairness on distributed nodes. There are no atomic operations involving multiple nodes. The ease of implementing Chord is probably the reason for its popularity as a component of peer-to-peer systems. Its fundamental simplicity is probably the reason for its popularity as a basis for building DHTs with stronger guarantees and additional capabilities, such as protection against malicious peers [2, 4, 16], key consistency [6], range queries [7], and atomic access to replicated data [13, 15]. Unfortunately, the claim of correctness is not true. The original specification does not have eventual reachability, and none of the claimed invariants is actually an invariant [23]. The principal contribution of this paper is to 1 solve the problems that were revealed in [23], by providing the first specification of a correct version of Chord, and a proof of its correctness. These new results are achieved with lightweight modeling, a technology that has advanced to maturity in recent years. Lightweight modeling consists of building small formal models of key concepts of a system, and checking properties of them by means of exhaustive enumeration of instances over a bounded domain. This push-button analysis either yields a counterexample, or proves that the property holds in the bounded domain. In this paper lightweight models are written in the Alloy language and checked with the Alloy Analyzer [8]; the reasons for this choice are discussed in [24]. The results in this paper are significant in three major ways: (1) Many people implement Chord, use Chord as a component of their distributed systems, or base their research on Chord. They should have a correct version of Chord to use. They should also have a correct invariant for Chord, as this is a design principle for enhancing DHT security [19]. (2) Many people reason about Chord behavior for the purposes of their research. This reasoning should have a sound foundation. For example, the performance analysis in [10] makes incorrect assumptions about Chord behavior [23]. The research on augmenting and strengthening Chord, as referenced above, relies on informal descriptions of Chord and informal reasoning about its behavior. As automated proof checking increasingly becomes the norm in distributed systems, attempts to prove properties of systems based on original Chord will fail or yield unsound results. (3) As explained in Section 4, efforts to find the best version of Chord and the best invariant for a proof have led to interesting insights into how Chord works. People who build on Chord should be aware of these properties so as to preserve them. Some principles may be applicable to all systems that use ring-shaped data structures in large identifier spaces (e.g., [1, 17]). The paper begins with an overview of Chord using the revised, correct ring-maintenance operations (Section 2), and a new specification of these operations (Section 3). Although the specification is informal for maximum accessibility, it is a paraphrase of a formal specification in Alloy. The complete Alloy model, including specification, invariant, and all steps of the proof, can be found at http://www2.research.att.com/~pamela/chord.html. Correct operations are necessary but not sufficient. It is also necessary to have an initialization condition and an invariant to use in constructing a proof. Original Chord is initialized with a network of one node; this is not correct, and Section 4 shows why. Chord must be initialized and maintained with a ring containing a minimum of r + 1 nodes, where r is the length of each node’s list of successors. 2 Section 4 shows that it is also necessary for r + 1 initial nodes to form a “stable base,” in the sense that these nodes remain members of the network throughout its lifetime. The stable base is sufficient to enforce a simple invariant and allow a straightforward proof of correctness (Section 5). The stable base is an interesting requirement, because the stable base has few members and a Chord network can have millions of them. This means that there are arbitrarily large segments of the ring that are far from any base member, and that the base members can have no local effect on. The effect of the stable base is global, preventing complications due to “wrapping around the ring” in a sense explained in Section 4. The stable base is also a fortunate requirement, because it is simple, eliminates the need for additional implementation mechanisms, and has a practical effect only when the ring is small. Once a Chord network reaches a reasonable size, it can be ignored as an implementation requirement. Although other researchers have found problems with Chord implementations [5, 9, 22], they have not discovered any problems with the specification of Chord. Other work on verifiable ring maintenance operations [11] uses multi-node atomic operations, which are avoided by Chord. 2 Overview of correct Chord Every member of a Chord network has an identifier (assumed unique) that is an m-bit hash of its IP address. Every member has a successor list of pointers to other members. The first element of this list is the successor, and is always shown as a solid arrow in the figures. Figure 1 shows two Chord networks with m = 6, one in the ideal state of a ring ordered by identifiers, and the other in the valid state of an ordered ring with appendages. In the networks of Figure 1, key-value pairs with keys from 31 through 37 are stored in member 37. While running the ring-maintenance protocol, a member also acquires and updates a predecessor pointer, which is always shown as a dotted arrow in the figures. The ring-maintenance protocol is specified in terms of three operations, each of which changes the state of at most one member. In executing an operation, the member queries another member or sequence of members, then updates its own pointers if necessary. The specification of Chord assumes that inter-node communication is bidirectional and reliable, so we are not concerned with Chord behavior when inter-node communication fails. A node becomes a member in a join operation. A member node is also referred to as live. When a member joins, it contacts an existing member 3 63 53 50 10 48 16 37 9 62 62 10 48 16 37 30 30 Figure 1: Ideal (left) and valid (right) networks. Members are represented by their identifiers. Solid arrows are successor pointers. 7 10 10 joins 19 7 7 10 10 stabilizes 10 19 rectifies 19 19 7 7 7 stabilizes 10 10 rectifies 19 Figure 2: A new node becomes part of the ring. A gray circle marks the pointer updated by an operation, if any. Dotted arrows are predecessors. and gets its own current successor from that member. (It also contacts the current successor to get a full successor list.) The first stage of Figure 2 shows successor and predecessor pointers in a section of a network where 10 has just joined. When a member stabilizes, it learns its successor’s predecessor. It adopts the predecessor as its new successor, provided that the predecessor is closer in identifier order than its current successor. Because a member must query its successor to stabilize, this is also an opportunity for it to update its successor list with information from the successor. Members schedule their own stabilize operations, which should be periodic. Between the first and second stages of Figure 2, 10 stabilizes. Because its successor’s predecessor is 7, which is not a better successor for 10 than its current 19, this operation does not change the successor of 10. After stabilizing (regardless of the result), a node notifies its successor of its identity. This causes the notified member to execute a rectify operation. The rectifying member checks whether its current predecessor is still a member, and then adopts the notifying member as its new predecessor if the 4 19 10 notifying member is closer in identifier order than its current predecessor (or if it has no live predecessor). In the third stage of Figure 2, 10 has notified 19, and 19 has adopted 10 as its new predecessor. In the fourth stage of Figure 2, 7 stabilizes, which causes it to adopt 10 as its new successor. In the last stage 7 notifies and 10 rectifies, so the predecessor of 10 becomes 7. Now the new member 10 is completely incorporated into the ring, and all the pointers shown are correct. The assumption of the protocol is that a member in good standing always responds to queries in a timely fashion. A node ceases to become a member in a fail event, which can represent failure of the machine, or the node’s silently leaving the network. A member that has failed is also referred to as dead. After a member fails, it no longer responds to queries from other members. With these assumptions, members can detect the failure of other members perfectly by noticing whether they respond to a query before a timeout occurs. Another assumption about failure behavior is that successor lists are long enough, and failures are infrequent enough, to ensure that a member is never left with no live successor in its list. Failures can produce gaps in the ring, which are repaired during stabilization. As a member attempts to query its successor for stabilization, it may find that its successor is dead. In this case it attempts to query the next member in its successor list and make this its new successor, continuing through the list until it finds a live successor. It is well known that a Chord network must preserve the following structural property. Defining a member’s best successor as its first successor pointing to a live node (member): • there must be a ring of best successors; • there must be no more than one ring; • on the ring of best successors, the nodes must be in identifier order; • from each member in an appendage, the ring must be reachable through best successors. If any of these rules is violated, there is a disruption in the structure that the ring-maintenance protocol cannot repair. The inevitable result is that some members will be permanently unreachable from some other members. A network is ideal when each pointer is globally correct. For example, on the right of Figure 1, the globally correct successor of 48 is 50 because it is the nearest member in identifier order. The minimum correctness criterion for the ring-maintenance protocol is simple: In any execution state, if there 5 are no subsequent join or fail events, then eventually the network will become ideal and remain ideal. This is not a particularly stringent requirement, as it allows the protocol ample time and no further disruptions while it works to repair the ring. The Chord papers define the lookup protocol, which is not discussed here. They also define the maintenance and use of finger tables, which improve lookup speed by providing pointers that cross the ring like chords of a circle. Finger tables are built from successor lists and the correctness of ring maintenance does not depend on them, so they are not discussed here. 3 Specification of ring-maintenance operations Appendix A contains pseudocode for the join, stabilize, and rectify operations, following the overview in the previous section. In format, it is much more complete and explicit than the original specification of Chord, particularly with respect to communication between nodes. There is a type Identifier which is a string of m bits. Implicitly, whenever a member transmits the identifier of a member, it also transmits its IP address so that the recipient can reach the identified member. The pair is self-authenticating, as the identifier must be the hash of the IP address according to a chosen function. The Boolean function between is used to check the order of identifiers, and will be referred to in Section 4. Because identifier order wraps around at zero, it is meaningless to compare two identifiers—each precedes and succeeds the other. This is why between has three arguments: Boolean function between (n1, n2, n3: Identifier) { if (n1 < n3) { n1 < n2 && n2 < n3 } else { n1 < n2 || n2 < n3 } } It is important to note that, for all distinct x and y, between(x,y,x) is always true, and between(x,x,y) and between(y,x,x) are always false. The join, stabilize, and rectify operations are quite different from the original pseudocode specification of Chord. There are two major changes. First, multiple smaller operations are assembled into larger operations. This ensures that the successor lists of members are always fully populated, rather than having missing entries to be filled in by later operations. Second, before incorporating a pointer to a node into its state, a member usually checks that it is live. This prevents cases where a member replaces a pointer to a live node with a pointer to a dead one. Both changes increase the amount 6 62 37 48 37 48 stabilizes, 62 rectifies 48 62 37 48 fails 62 Figure 3: Why the ring cannot be initialized at size 1. Dashed arrows are second-successor pointers. Predecessor pointers are not shown in the last two stages, as they are irrelevant. of useful information available to members. In [23] there are numerous examples of how the previous operations can result in fatal errors or violation of desirable invariants. Including both major and minor changes, the new specification has three sources: • pseudocode and text selected from the three papers [12, 20, 21]; • clarifications about the orginal implementation of Chord [3]; • bug fixes found with extensive Alloy modeling and analysis, as described further in the next two sections. 4 Initialization and invariant Correct operations are necessary but not sufficient. We also need an initialization condition and an invariant to use in constructing a proof. Original Chord initializes a network with a single member that is its own successor, i.e., the initial network is a ring of size 1. This is not correct, as shown in Figure 3. In this example, the length r of the successor list is 2. The problem is that when the ring is too small, successor lists must start with some equal entries (62 and 37 start with both list entries equal). According to the operating assumptions, 48 can fail, leaving members 62 and 37 with insufficient information to find each other. Only when members have r distinct entries in their successor lists do the operating assumptions guarantee that each member will always have at least one live successor. For members to have r distinct entries in their successor lists, a Chord network must be initialized and maintained with a minimum ring size of r+1. 7 However, this is not sufficient. In addition, r + 1 of the initial ring members must serve as a “stable base” of ring members that are highly available and will continue to be members throughout the life of the network. The typical range for r is 3-5, so the typical stable base would require 4 to 6 members. The stable base is necessary for correctness. Briefly, we say that a member n1 skips another member n2 if n2 is not in the successor list of n1, yet between(n1, n2, n1.lastSucc), where n1.lastSucc is the last member of the successor list. Now consider a network whose ring has shrunk (through failures) to its minimum size, and each remaining ring member skips the next remaining ring member, having in its successor list the second-closest remaining ring member. If the minimum size is odd, the result is a ring whose members are not in identifier order. If the minimum size is even, the result is two disjoint rings. The stable base prevents these situations by ensuring that the remaining nodes are the nodes of the stable base, and that—because they have always been members, rather than failing and rejoining—they are not skipped in successor lists. The good news about the stable base is that it is not only necessary, but sufficient. The remainder of this section describes an invariant developed on the idea of a stable base. It also explains how the invariant works and why this is all good news. A Chord network can be initialized to any state that satisfies this invariant. The first conjuncts of the invariant, which is paraphrased completely here and expressed in Alloy in Appendix B, are merely the formalizations of the structural properties in Section 2. The conjunct StructuredSuccessorList says that each node’s successor list has r distinct entries. Also, considering node n’s extended successor list append(n, n.succList), for all contiguous sublists (x, y, z), between(x, y, z) holds. The conjunct BaseNotSkipped says that, for all contiguous sublists (x, y) of node n’s extended successor list, and for all members b of the stable base, between(x, b, y) is false. Note that x = b or y = b is allowed. Figure 4 illustrates a failure that cannot occur when there is a stable base, for r = 2. Although the initial phase violates the invariant (between its first successor 3 and its second successor 45, 52 skips over 20 and 31, at least one of which must be in the stable base of size 3), its individual peculiarities are not impossible. 45 is an appendage that attaches to the ring in the wrong place. The second successor of 52 points outside the ring. Examples in [23] show how both peculiarities can occur (and still can, even with the revised operations). If we look at the extended successor lists (52, 3, 45) and (45, 3, 20), both are well-formed in the sense that they satisfy StructuredSuccessorList. 8 45 45 3 52 20 3 fails 31 20 52 31 Figure 4: Without the stable base, failure can leave the ring disordered. Only the relevant pointers are drawn. However, (52, 3, 45) implies that 45 is close to and clockwise of 3, and (45, 3, 20) implies that 45 is close to and counterclockwise of 3. Both could only be realistic if the ring were smaller than it really is. In other words, together these two successor lists imply a tight “wrap around” of identifiers that is not correct. When 3 fails, the ring becomes disordered. The stable base works because it provides enough permanent spacers around the ring to prevent such anomalies. Note that its beneficial effect relies only on r, not on the distribution of base nodes around the ring. All r + 1 base nodes could be adjacent, and it would work perfectly well. The first advantage of this invariant is that it is simple. There are localized structural properties that were studied as possible parts of an invariant, but they are far more complex (as well as being neither necessary nor sufficient). The full paper will discuss these properties, as they might be useful for security (in the spirit of [19]). The second advantage of this invariant is that it is powerful in eliminating the need for additional implementation mechanisms. For example, some plausible localized properties can be falsified when a node rejoins after failing, if other nodes still have pointers to it from its last episode of membership. Their obsolete pointers can be misleading. To prevent this problem and enforce the plausible properties, it would be necessary to enforce a “blackout interval” during which a failed node cannot rejoin. This mechanism would be both uncertain and inconvenient, and it is not necessary with a stable base. The third and most important advantage of this invariant is that a stable base has few nodes and a Chord network can have millions. This means that there are arbitrarily large segments of the ring that are far from any base member, and that the base members can have no local effect on. The problems prevented by the stable base can only occur when the ring is small. 9 As a result, once a Chord network reaches a reasonable size, the effect of the stable base is negligible and it can be ignored as an implementation requirement. 5 Proof of correctness The formal model uses shared memory communication between nodes to simulate queries. Concurrency has an interleaving semantics. The interleaved atomic steps model the local computations performed by nodes between or after queries. The following proof steps can all be found in full at the previously given URL. (1) Check that the computation steps of the operations are consistent. The join operation has two steps such that the first establishes a precondition for the second. Since the precondition concerns the stable base, it cannot be falsified by interleaved computations. (2) Check that each computation step preserves the invariant. (3) Define and validate effectiveness predicates for the stabilize and rectify operations. If EffectiveOpEnabled[n,t] for some node n and state t, then the referenced operation is enabled for n in state t, and will change the state of n if it occurs. (4) Check that in any valid state that is not ideal, some effective stabilize or rectify operation is enabled. Also check that in any ideal state, no effective operation is enabled. This shows that stabilize and rectify can always change a non-ideal state, and can never change an ideal state. (5) Define a measure of the error in a Chord network, such that the measure of an ideal network is 0 and the measure of a non-ideal network is a positive integer. Show that every effective stabilize or rectify reduces the measure. This proves that a network with no new joins or fails will eventually become ideal. Step 5 is manual, while all the others have been automated by the Alloy Analyzer for r = 2 and up to 8 nodes. In general, the claim that this constitutes a proof is based on the empirical “small scope hypothesis,” which says that most bugs can be illustrated by small examples [8]. In particular, there is no evidence that longer successor lists would exhibit new problems. Concerning the number of nodes, although many new behaviors were found by increasing the number of nodes from 5 to 6, no new behaviors were ever found by increasing the number from 6 to 7; this makes 8 look safe as a limit. Nevertheless, the current proof is straightforward, and open to improvement with techniques that apply to successor lists and networks of all sizes. 10 References [1] B. Awerbuch and C. Scheideler. The hyperring: A low-congestion deterministic data structure for distributed environments. In Proceedings of SODA. ACM, 2004. [2] B. Awerbuch and C. Scheideler. Towards a scalable and robust DHT. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 318–327. ACM, 2006. [3] H. Balakrishnan and I. Stoica, 2013. Personal communication. [4] A. Fiat, J. Sala, and M. Young. Making Chord robust to Byzantine attacks. In Proceedings of the European Symposium on Algorithms, pages 803–814. Springer LNCS 3669, 2005. [5] M. J. Freedman, K. Lakshminarayanan, S. Rhea, and I. Stoica. Nontransitive connectivity and DHTs. In Proceedings of the 2nd Conference on Real, Large, Distributed Systems, pages 55–60. USENIX, 2005. [6] L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and T. Anderson. Scalable consistency in Scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. ACM, October 2011. [7] A. Gupta, D. Agrawal, and A. E. Abbadi. Approximate range selection queries in peer-to-peer systems. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR 2003), 2003. [8] D. Jackson. Software Abstractions: Logic, Language, and Analysis. MIT Press, 2006, 2012. [9] C. Killian, J. A. Anderson, R. Jhala, and A. Vahdat. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the 4th USENIX Symposium on Networked System Design and Implementation, pages 243–256, 2007. [10] S. Krishnamurthy, S. El-Ansary, E. Aurell, and S. Haridi. A statistical theory of Chord under churn. In Peer-to-Peer Systems IV. Springer LNCS 3640, 2005. [11] X. Li, J. Misra, and C. G. Plaxton. Active and concurrent topology maintenance. In Distributed Computing, pages 320–334. Springer LNCS 3274, 2004. 11 [12] D. Liben-Nowell, H. Balakrishnan, and D. Karger. Analysis of the evolution of peer-to-peer systems. In Proceedings of the 21st ACM Symposium on Principles of Distributed Computing, pages 233–242. ACM, 2002. [13] N. Lynch, D. Malkhi, and D. Ratajczak. Atomic data access in distributed hash tables. In Proceedings of IPTPS, pages 295–305. Springer LNCS 2429, 2002. [14] P. Maymounkov and D. Mazi`eres. Kademlia: A peer-to-peer information system based on the XOR metric. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems, 2002. [15] A. Muthitacharoen, S. Gilbert, and R. Morris. Etna: A fault-tolerant algorithm for atomic mutable DHT data. MIT CSAIL Technical Report 2005-044, http://hdl.handle.net/1721.1/30555, 2005. [16] K. Needels and M. Kwon. Secure routing in peer-to-peer distributed hash tables. In Proceedings of the ACM Symposium on Applied Computing. ACM, 2009. [17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proceedings of ACM SIGCOMM. ACM, August 2001. [18] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In Proceedings of the 18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 2001. [19] E. Sit and R. Morris. Security considerations for peer-to-peer distributed hash tables. In Proceedings of IPTPS. Springer LNCS 2429, 2002. [20] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of ACM SIGCOMM. ACM, August 2001. [21] I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup protocol for Internet applications. IEEE/ACM Transactions on Networking, 11(1), February 2003. 12 [22] M. Yabandeh, N. Kneˇzevi´c, D. Kosti´c, and V. Kuncak. CrystalBall: Predicting and preventing inconsistencies in deployed distributed systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation. USENIX, April 2009. [23] P. Zave. Using lightweight modeling to understand Chord. ACM SIGCOMM Computer Communication Review, 42(2), April 2012. [24] P. Zave. A practical comparison of Alloy and Spin, 2013. Submitted for publication. [25] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Kubiatowicz. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications, 22(1):41–53, January 2004. Appendix A: Pseudocode specification of the ringmaintenance operations There is a type Identifier which is a string of m bits. The parameter r is the length of successor lists. The function Identifier function lookupSucc (joining: Identifier) { } takes the identifier of a joining node and returns the identifier of its proper successor. In other words, for two members n and lookupSucc(joining) that are adjacent in the ring, between(n,joining,lookupSucc(joining)). Each node has the following variables: myIdent: Identifier; known: Identifier; pred: Identifier; succList: list Identifier; // length is r where myIdent is the hash of its IP address, known is a member of the Chord network known to the node when it joins, and pred is the node’s predecessor. succList is its entire successor list; the head of this list is its first successor or simply its successor. To join, a node executes the following pseudocode. 13 // Join operation newSucc: Identifier; query known for lookupSucc(myIdent); if (query returns before timeout) { newSucc = lookupSucc(myIdent); query newSucc for newSucc.succList; if (query returns before timeout) { succList = append(newSucc, butLast(newSucc.succList)); pred = null; } else retry Join later; } else retry Join later; First, the node asks the known node to look up the node’s identifier and get its proper successor, storing the value in newSucc. The node then queries newSucc for its successor list. Finally the node constructs its own successor list by concatenating newSucc and newSucc’s successor list, with the last element of the list trimmed off to produce a result of length r. If either of the queries fail the node has no choice but to retry again later. To stabilize, a node executes the following pseudocode. // Stabilize operation newSucc: Identifier; while (succList is not empty) { query head(succList) for head(succList).pred and head(succList).succList; if (query returns before timeout) { newSucc = head(succList).pred; succList = append(head(succList), butLast(head(succList).succList)); if (between(myIdent,newSucc,head(succList)) { query newSucc for newSucc.succList; if (query returns before timeout) 14 succList = append(newSucc, butLast(newSucc.succList)); }; notify head(succList) of myIdent; break; } else succList = tail(succList); }; In the outer loop of this code, the node queries its successor for its successor’s predecessor and successor list. If this query times out, then the node’s successor is presumed dead. The node promotes its second successor to first and tries again. Once it has contacted a live successor, it executes inner code ending in a break out of the loop. The loop is guaranteed to terminate before succList is empty, based on the assumption that successor lists are long enough so that each list contains at least one live node. Once it has contacted a live successor, the node first updates its successor list with its successor’s list. It then checks to see if the new pointer it has learned, its successor’s predecessor, is an improved successor. If so, and if newSucc is live, it adopts newSucc as its new successor. Thus the stabilize operation requires one or two queries for each traversal of the outer loop. Whether or not there is a live improved successor, the node notifies its successor of its own identity. A node rectifies when it is notified, thereafter executing the following pseudocode: // Rectify operation newPred: Identifier; receive notification of newPred; if (pred = null) pred = newPred; else { query pred to see if live; if (query returns before timeout) { if (between(pred,newPred,myIdent)) pred = newPred; } else pred = newPred; }; 15 When a node fails or leaves, it ceases to stabilize, notify, or respond to queries from other nodes. When a node rejoins, it re-initializes its Chord variables. Appendix B: Alloy specification of the invariant There is an implicit linear order on instances of type Node, which acts as the identifier order. For simplicity, this is a formalization for r = 2. Successor lists of arbitrary but bounded length would not be more difficult to model in Alloy, but examples and counterexamples would be more difficult to read. one sig Network { base: set Node } { # base = 3 } sig Node { succ: Node lone -> Time, succ2: Node lone -> Time, prdc: Node lone -> Time, bestSucc: Node lone -> Time } { -- This defines bestSucc. all t: Time | (Member[succ.t,t] && Member[succ2.t,t] => bestSucc.t = succ.t) && (Member[succ.t,t] && NonMember[succ2.t,t] => bestSucc.t = succ.t) && (NonMember[succ.t,t] && Member[succ2.t,t] => bestSucc.t = succ2.t) && (NonMember[succ.t,t] && NonMember[succ2.t,t] => no bestSucc.t) -- A node is unambiguously a member or not. all t: Time | Member[this,t] || NonMember[this,t] } pred Member [n: Node, t: Time] { some n.succ.t } pred NonMember [n: Node, t: Time] { no n.succ.t && no n.prdc.t && no n.succ2.t } pred Invariant [t: Time] { OneOrderedRing[t] && ConnectedAppendages[t] && StructuredSuccessorList[t] && BaseNotSkipped[t] 16 } pred OneOrderedRing [t: Time] { let ringMembers = { n: Node | n in n.^(bestSucc.t) } | Network.base in ringMembers -- at least one ring containing base && (all disj n1, n2: ringMembers | n1 in n2.^(bestSucc.t) ) -- at most one ring && (all disj n1, n2, n3: ringMembers | n2 = n1.bestSucc.t => ! Between[n1,n3,n2] -- ordered ring ) } pred ConnectedAppendages [t: Time] { let members = { n: Node | Member[n,t] } | let ringMembers = { n: members | n in n.^(bestSucc.t) } | all na: members - ringMembers | -- na is in some nc: ringMembers | nc in na.^(bestSucc.t) -- an appendage -- yet reaches ring } pred StructuredSuccessorList [t: Time] { all n: Node | Member[n,t] => { some n.succ2.t -- the successor list is full Between[n,n.succ.t,n.succ2.t] -- successor list is ordered -- guarantees that n != n.succ.t and n.succ.t != n.succ2.t n != n.succ2.t -- no self-successors } } pred BaseNotSkipped [t: Time] { all n: Node | Member[n,t] => { no b: Network.base | (Between[n, b, n.succ.t] && b ! in (n + n.succ.t)) no b: Network.base | (Between[n.succ.t, b, n.succ2.t] && b ! in (n.succ.t + n.succ2.t)) } } 17