Concurrent Effect Search in Evolutionary Systems
Transcription
Concurrent Effect Search in Evolutionary Systems
Concurrent Effect Search in Evolutionary Systems Keki M. Burjorjee Pandora Media Inc. Oakland, CA kburjorjee@pandora.com ABSTRACT Concurrent Effect Search (CES) is a form of efficient computational learning thought to underlie general-purpose, nonlocal, noise-tolerant adaptation in evolutionary algorithms. We demonstrate that CES is indeed efficient by showing that it can be used to obtain optimal bounds on the time and queries required to approximately correctly solve a subclass (k = 7, η = 1/5) of a familiar computational learning problem: learning parities with noisy membership queries; where k is the number of relevant attributes and η is the oracle’s noise rate. We show that a simple genetic algorithm that treats the noisy membership query oracle as a fitness function can be straightforwardly used to PAC-learn the relevant variables in O(log(n/δ)) queries and O(n log(n/δ)) time, where n is the total number of attributes and δ is the probability of error. To the best of our knowledge, this is the first rigorous identification of optimally efficient computation in an evolutionary algorithm on a non-trivial learning problem. Our proof technique relies on accessible symmetry arguments and the use of statistical hypothesis testing to reject a global null hypothesis at the 10−100 level of significance. The result obtained and indeed the implicit implementation of CES by a simple genetic algorithm depends crucially on recombination. This dependence yields a fresh explanation for the role of sex (based on the unmixibility of genes). This new explanation follows straightforwardly from the CES hypothesis, further elevating its appeal. 1. INTRODUCTION In recent years, theoretical computer scientists have become increasingly perturbed by the problem posed by evolution. The subject of consternation is a system thought to have computed the encoding of every biological form to have lived, that, as luck would have it, represents information the way a Turing Machine might—digitally; in strings drawn from a quaternary alphabet—and manipulates information in ways that are well understood (e.g. meiosis) or Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. amenable to abstraction (e.g. natural selection). Contemplating what this computational system has achieved given the resources at its disposal leaves one awestruck. Yet, theoretical computer science, for all its success in other areas, has not identified anything particularly efficient or arresting about evolution.1 We have on our hands what might be called a computational origins problem—while our physical origins have been worked out, our computational origins remain a mystery. Referring to this problem, Valiant speculates that future generations will wonder why it was not regarded with a greater sense of urgency [29]. If the computational origins problem is cause for reflection within Theoretical Computer Science, it is doubly so amongst Evolutionary Computation theorists. The promise of Evolutionary Computation was twofold: 1) That the computational efficiencies at play in natural evolution would be identified and 2) That these efficiencies would be harnessed, via biomimicry, and used to efficiently procure solutions to human problems. One might expect these outcomes to be realized in order or simultaneously. In a curious twist, however, the field seems to be making good on the second part of its promise, but has not delivered on the first. This twist makes evolutionary algorithms rather interesting from a theoretical standpoint. Find something computationally efficient about one of the more bioplausible ones, and a piece of the computational origins puzzle may fall into place. We have previously suggested that an efficient form of computational learning underlies adaptation in evolutionary algorithms [4, 3]. Here, we describe the hypothesized computational efficiency—Concurrent Effect Search (CES)—in detail and explain how it can drive efficient general-purpose, non-local, noise-tolerant adaptation. We establish that CES is a bonafide form of efficient computational learning by using it to derive optimal bounds on the query and time complexity of an algorithm that PAC-solves a subclass (k = 7, η = 1/5) of a non-trivial computational learning problem: learning parities with noisy membership queries [28, 8], where k is the number of relevant attributes, and η is the probability that the membership query (MQ) oracle oracle misclassifies a query (i.e. returns a 1 instead of a 0, or vice versa). This result—to the best of our knowledge, the first time efficient, not to mention optimally efficient, non-trivial computation is shown to occur in a evolutionary 1 The Computational Theories of Evolution Workshop held at the Simons Institute, Berkeley, CA from March 17 – March 21, 2014 brought together researchers from multiple disciplines to discuss the issue. Presentations available at http://simons.berkeley.edu/workshops/abstracts/326 algorithm—hinges on an empirical conclusion with a p-value less than 10−100 . 1.1 The Integral Role of Recombination Of all the mysteries evolution holds for a computer scientist, surely, sex is one of the greatest. No account of the computational power of evolution can be considered complete or even close to complete as long as the part played by sex remains murky, or marginal. The role of recombination in the implicit implementation of Concurrent Effect Search is neither. Following the derivation of our result we explain why recombination is crucial to the derivation. Briefly, parity functions induce a property called unmixibility, which recombination adroitly exploits. Schemata in a schema partition are assumed to be in lexicographical order of their templates. Formally, for any positive integer n, and any index set I ⊆ [n], we assume that the schemata in JIKn are ordered as follows: Js1 K, . . . , Js2|I| K, where s1 , . . . , s2|I| is a list of templates of the schemata in JIKn ordered lexicographically. Definition 2 (Underlying Effect). For any positive integer n,any non-negative integer k, and any index set I ⊆ [n] such that |I| = k, let Si denote the ith schema of JIKn . The underlying effect of the schema partition JIKn with respect to some fitness function φ : {0, 1} → R is a value ξI defined as follows: k ξI = 2. CONCURRENT EFFECT SEARCH We begin with a primer on schemata and schema partitions [10, 19]. For any positive integer k, let [k] denote the set {1, . . . , k}. For any mathematical expression a, and boolean valued expression b, let [b]{a} denote a if b evaluates to true and 0 otherwise. For any positive integer n, a schema (plural, schemata) of the set of binary strings {0, 1}n is a subset of {0, 1}n that is traditionally represented by a schema template. A schema template is a string in {0, 1, ∗}n . As will become clear from the formal definition below, the symbol ∗ stands for ‘wildcard’ at the positions at which it occurs. Given a schema template s ∈ {0, 1, ∗}n , let JsK denote the schema represented by s. Then, n JsK = {x ∈ {0, 1} | ∀i ∈ [n], si = ∗ ∨ xi = si } Let I ⊆ {1, . . . , n} be some set of integers. Then I represents a partition of {0, 1}n into 2|I| schemata, whose templates are given by {s ∈ {0, 1, ∗}n | si 6= ∗ ⇔ i ∈ I}. This partition of {0, 1}n , denoted JIKn can also be represented in template form by a string t of length n drawn from from the alphabet {#, ∗} where ti = # ⇐⇒ i ∈ I. Here the symbol # stands for ‘defined bit’ and the symbol ∗ stands, as it did before, for ’wildcard’. The schema partition represented by a schema partition template t is denoted JtK. We omit the double brackets (i.e., J·K) when their use is clear from the context. The order of a schema partition is simply the number of # symbols in the schema partition’s template. It is easily seen that schema partitions of lower order are coarser than schema partitions of higher order. More specifically, a schema partition of order k is comprised of 2k schemata. Example 1. The index set {1, 2, 4} induces an order three schema partition of {0, 1}5 into eight schemata as shown in Figure 1. 00*0* 00000 00001 00100 00101 00*1* 00010 00011 00110 00111 01*0* 01000 01001 01100 01101 01*1* 01010 01011 01110 01111 10*0* 01000 01001 01100 01101 10*1* 10010 10011 10110 10111 11*0* 10000 10001 10100 10101 11*1* 11010 11011 11110 11111 Figure 1: A tabular depiction of the schema partition ## ∗ #∗ of order three. The table headings give the templates of the schemata comprising the partition. The elements of each schema in the partition appear in the column below its template. 2 1 X φ φ 2 (F − F{0,1} n) 2k i=1 Si where, for any set S ⊆ {0, 1}n , FSφ denotes the mean fitness of the elements of S. In words, the underlying effect of the schema partition JIKn is the variance of the mean fitness values of the schemata in JIKn . Considering the size of typical search spaces, the underlying effect of non-trivial schema partitions are unknowable. However, given some number of independent samples drawn from the uniform distribution over {0, 1}n and their fitness values, one can construct an estimator like the one below. Definition 3 (Sampling Effect). For any positive integer n, any non-negative integer k, and any index set I ⊆ [n] such that |I| = k, let Si denote the ith schema of JIKn . Let X1 , . . . , Xr be random variables drawn independently from the uniform distribution over {0, 1}n . For any i ∈ [2m ], let Yi be a random variable defined as follows: Yi = r X j=1 [Xj ∈ Jsi K] {1} and let W be a random variable defined as follows: k W = 2 X [Yi > 0]{1} i=1 For any i ∈ [2k ] such that Yi > 0, let the random variable Zij be Xk where k is the index of the jth random variable in the list X1 , . . . , Xr that belongs to schema Si . The sampling effect of JIKn with respect to some fitness function φ : {0, 1}n → R is given by the estimator ξbI defined as follows: ( 2 ) 2m X 1 d φ \ φ ξbI = [Yi > 0] FSi − F{0,1}n W i=1 φ where F\ {0,1}n is an estimator defined as follows: r 1X φ = φ(Xi ) F\ n {0,1} r i=1 d and for any i ∈ [2k ] such that Yi > 0, FSφi is an estimator defined as follows: 1 X d FSφi = φ(Zij ) Yi j=1 As estimators are random variables, they have means and variances. Let us informally examine how the mean and variance of ξbI changes as we coarsen a schema partition. Coarsening a schema partition JIKn amounts to removing elements from I. For any I 0 ⊂ I it is easily seen that ξI 0 ≤ ξI . We conjecture, therefore, that E[ξbI 0 ] ≤ E[ξbI ]. What about the variance of ξbI ? Here we rely on the observation that coarsening JIKn , reduces the variance of the estimators d FSφi for all i ∈ [2k ] such that Yi > 0. This observation underlies our conjecture that Var[ξbI 0 ] ≤ Var[ξbI ]. The variance of an estimator is, of course, inversely related to the statistical significance of statements about the expected value of the estimator. Suppose one’s goal is to find schema partitions with statistically significant sampling effects. Then the conjectures above suggest that coarsening schema partitions is a mixed blessing. On the positive side, coarsening decreases the variance of the sampling effect, driving up statistical significance. On the other hand it pushes the expected sampling effect to zero, adversely affecting statistical significance. With respect to the fitness functions arising in practice, we conjecture the existence of one or more “goldilocks” schema partitions, i.e. schema partitions that are coarse enough that the variance of the sampling effect is not too high, and not so coarse that the expected sampling effect is too low. Assuming, conservatively, that goldilocks schema partitions are rare, how many coarse schemata partitions must one evaluate on average before such a schema partition is found? For large values of n, the numbers are astronomical. For example, when n = 106 , the number of schema partitions of order 2, 3, 4 and 5 are on the order of 1011 , 1017 , 1022 , and 1028 respectively. Implicit concurrent effect evaluation is a capacity possessed by simple genetic algorithms with uniform crossover for scaleably (with respect to n) finding one or more coarse schema partitions with non-negligible effects. It amounts to a capacity for efficiently performing vast numbers of concurrent effect/no-effect multivariate analyses to identify small numbers of interacting loci with statistically significant non-zero sampling effects. 2.1 p Ξβ Sφ p0 /q Sφ 0 VT Ξβ / q0 /r VT 0 Ξβ / r0 The proof of the above is constructive. That is, φ0 and T 0 are precisely specified. Crucially, there is no restriction on the index set I. In other words, the above holds true simultaneously for all schema partitions of {0, 1}n . It is this simultaneity that underlies our sense that evolution implicitly carries out vast numbers of implicit operations concurrently.2 2.2 A Need for Science, Practiced With Rigor Formal evidence for the CES hypothesis beyond that referenced above, specifically formal evidence pertaining to evolutionary algorithms with finite population sizes seems difficult to obtain, not least because the analysis of evolutionary algorithms with finite populations is notoriously unwieldy. We have argued previously that resorting to the scientific method is a necessary and appropriate response to this hurdle [4]. After all, the scientific method, rigorously practiced, is the foundation of many a useful field of engineering. A hallmark of rigorous science is the ongoing making and testing of predictions. Predictions found to be true lend credence to the hypotheses that entail them. The more unexpected a prediction (in the absence of the hypothesis), the greater the credence owed the hypothesis if the prediction is borne out [23, 22]. The work that follows is meant to be received in this context. As we explain in the next section, the CES hypothesis straightforwardly entails that a genetic algorithm with uniform crossover (UGA) can be used in the straightforward construction of an algorithm that efficiently solves the problem of learning parities with a noisy MQ oracle for small but non-trivial values of k, the number of relevant attributes, and η ∈ (0, 1/2), the probability that the oracle makes a classification error (returns a 1 instead of a 0, or vice versa) on a given query. Such a result is completely unexpected in the absence of the CES hypothesis. Prior Theoretical Evidence Theoretical support for the idea that evolution can implicitly perform vast numbers of concurrent operations appears in a prior paper [5], where it is shown that for any schema partition JIKn , infinite population evolution over {0, 1}n using fitness proportional selection, homologous recombination, and standard bit mutation implicitly induces infinite population evolution over the set of schemata in JIKn . More formally, let β : {0, 1}n → JIKn be a function that maps binary strings in {0, 1}n to their corresponding schemata in JIKn . Let S, V, and Ξ be the parameterized selection, variation, and projection operators defined in the paper [4]. Then, for any probability distribution p over {0, 1}n , fitness function φ : {0, 1}n → R, and transmission function T over {0, 1}n that models homologous recombination followed by canonical bit mutation, there exist probability distributions q and r over {0, 1}n , probability distributions p0 , q 0 , r0 over JIKn , fitness function φ0 : JIKn → R, and transmission function T 0 over JIKn such that the following diagram commutes: 3. PAC-LEARNING PARITIES WITH MQS The problem of learning parities is a refinement of the learning juntas problem [20], so we approach the former problem by way of the latter. For any boolean function f over n variables, a variable i is said to be relevant if there exist binary strings x, y ∈ {0, 1}n that differ only in their ith coordinate such that f (x) 6= f (y). Variables that are not relevant are said to be irrelevant. For any non-negative integer n and any non-negative integer k ≤ n, a k-junta is a function f : {0, 1}n → {0, 1} such that for some non-negative integer 2 The function β in the above was mistakenly called a coarsegraining by Burjorjee and Pollack [5]. The mistake was corrected in a later paper [2], where the idea of coarse-graining was explained in detail. We mention this because the concept of coarse-graining is used later on in this paper (in Section 6). As φ0 in the commutative diagram is not invariant to p, no coarse-graining of evolutionary dynamics was shown by Burjorjee and Pollack. What was actually shown is arguably more interesting—implicit concurrent evolution. j ≤ k only j of the n inputs to f are relevant. These j relevant variables are said to be the juntas of f . The function f is completely specified by its juntas (characterizable by the set J ⊆ [n] of junta indices) and by a hidden boolean function h over the j juntas. The output of f is just the output of h on the values of the j relevant variables (the values of the irrelevant variables are ignored). The problem of identifying the relevant variables of any k-junta f and the truth table of its hidden boolean function is called the learning k-juntas problem. The learning k-parities problem is a refinement of the learning k-juntas problem where it is additionally specified that j = k, and the hidden function h is the parity (i.e. xor) function over k inputs. In this case, the function f is completely specified by its juntas. An algorithm A with access to some oracle φ is said to solve the learning k-parities problem if for any k-parity function f : {0, 1}n → {0, 1} whose juntas are given by the set J ⊆ [n] and any δ ∈ (0, 1/2), Aφ (n, δ) outputs a set S ⊆ [n] such that Pr(S 6= J) ≤ δ. A noisy MQ oracle φ behaves as follows. For any string x ∈ {0, 1}n , φ returns ¬f (x) with probability η ∈ (0, 1/2) and f (x) with probability 1 − η. The parameter η is called the noise rate. Bounds derived for the time and query complexity of Aφ with respect to n and δ speak to the efficiency with which Aφ solves the problem. The algorithmic learning described here is Probably Approximately Correct learning [14] with the inaccuracy tolerance set to zero: Probably Correct (PC) learning, if you will. 3.1 The Noise Model Blum and Langley [1] give a simple binary search based method that learns a single relevant variable of any k-junta in O(n log n) time and O(log n) queries with noise free membership queries. As explained in Section 3.1 of a paper by Mossel and O’Donnell [20], once a single relevant variable is found, the method can be recursively repeated O(k2k ) times to find the remaining relevant variables. In an MQ setting, the introduction of noise does not complicate the situation much if the corruption of a given answer is independent of the corruption of previous answers. In this case, noise can be eliminated to an arbitrary level simply by sampling the 1 ) times, where p(·) is some polynomial, MQ oracle p( 1−2η and taking the majority value. An appropriate departure from independent noise is the random persistent classification noise model due to Goldman, Kearns, and Shapire [11] wherein on any query x, the oracle operates as follows: If the oracle has been queried on x before, it returns its previous answer to x. Otherwise, it returns the correct answer with probability 1 − η. Such an oracle appears deterministic from the outside, so clearly, querying it some number of times and taking the majority value does not reduce the effect of noise. While persistent noise is an appropriate noise model for a membership query oracle, it tends to make analysis difficult. Fortunately, if it is extremely unlikely that an algorithm A will query the oracle twice with the same value, then A can be treated as if it is making calls to an MQ oracle with random independent classification noise [8]. The analysis in the following sections takes advantage of this loophole. As n gets large it becomes extremely unlikely that the membership query oracle will be queried twice or more on the same input. We therefore treat the noise as if it were independent. 3.2 Information Theoretic Lower Bound For any positive integer k, a simple information theoretic argument shows that it is not possible to PAC-learn k-parity functions in less than O(n log n) time or less than O(log n) queries (not possible, in other words, to PAC-learn the relevant variables in o(n log n) time or o(log n) queries). The argument relies on Shannon’s source coding theorem, part of which states that if N i.i.d. random variables each with entropy H(X) are transmitted in less than N H(X) bits it is virtually certain that information will be lost [16]. Let us consider the minimum time and queries required to learn just one relevant variable. Observe that the oracle can transmit at most one bit per query and that the time required by A to generate each query is Ω(n). Finally recall that the entropy of a random variable X that can take an arbitrary value in [n] is Ω(log n). Thus, by Shanon’s source coding theorem, the transmission of the index of a single relevant variable with an arbitrarily small possibility of error takes Ω(log n) queries and Ω(n log n) time. 3.3 Sampling Effects of k-parities Let f be some k-parity function whose relevant variables are given by the set J. It can be seen that a noisy MQ oracle induces an expected sampling effect of zero for all schema partitions with order less than or equal to k − 1, and a positive expected sampling effect for precisely one schema partition JIKn of order k—the one where I = J. While the variance of the sampling effect of JIKn increases with η, the expected effect remains unchanged. The CES hypothesis predicts that for low values of k and η, a UGA can be used to efficiently identify J. 4. OUR RESULT AND APPROACH We give a evolutionary computation based algorithm that probably correctly learns 7-parities in O(log(n/δ)) queries and O(n log(n/δ)) time given access to a MQ oracle with a noise rate of 1/5. Our argument is comprised of two parts. In the first, we define a form of learning for the learning parities problem called Piecewise Probably Correct (PPC) learning and show that an algorithm that PPC-learns kparities in O(n) time and O(1) queries can be used in the construction of an algorithm that PC-learns k-parities in O(n log(n/δ)) time and O(log(n/δ)) queries. In the second part we rely on a symmetry argument and a hypothesis testing based rejection of two null hypotheses at the 10−100 level of significance to conclude that for η = 1/5, a UGA can PPC learn 7-parities in O(n) time and O(1) queries. 5. PPC TO PC LEARNING An algorithm A with access to some oracle φ is said to piecewise probably correctly (PPC) learn k-parities if there exists some δ ∈ (0, 21 ) such that for any k-parity f , whose juntas are given by J ⊆ [n], Aφ (n) outputs a set S such that for any x ∈ [n], Pr(¬(x ∈ S ⇔ x ∈ J)) ≤ δ. That is, the probability that A misclassifies x is less than or equal to δ. Given that it takes O(n) time to formulate a query, PPCLearning in O(n) time and O(1) queries is clearly optimal. The following theorem shows a close relationship between PPC-learning k-parities and PC-learning k-parities Theorem 4. If k-parities is PPC-learnable in O(n) time and O(1) queries, then k-parities is PC learnable in n }| z m 1 0 1 1 1 0 1 . . . 0 0 0 1 0 0 1 . . . 1 1 0 0 0 0 0 . . . 1 0 1 0 0 1 1 . . . 0 1 0 1 0 0 0 . . . 1 0 0 0 1 1 0 . . . 0 0 1 1 0 0 1 . . . 0 1 1 0 1 0 1 . . . 1 1 1 0 0 1 1 . . . { 0 0 0 0 0 1 0 . . . 1 1 1 1 1 0 1 . . . 0 0 0 1 0 1 1 . . . 1 0 1 0 0 0 0 . . . 0 1 1 1 1 0 1 . . . 0 0 0 0 0 0 0 . . . ··· ··· ··· ··· ··· ··· ··· 1 0 1 0 1 0 1 . . . 0 0 0 1 0 0 0 . . . . . . . . . 1 0 0 0 ··· . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ··· 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 ··· ··· Figure 2: A hypothetical population of m chromosomes, each n bits long. The 3rd , 5th , and 9th loci of the population are shown in grey. O(n log(n/δ)) time and O(log(n/δ)) queries, where δ is the probability of error Given the information theoretic lower bound on learning relevant variables of a parity function with respect to n, for any positive integer k, Theorem 4 states that the PPClearnability of k-parities in O(n) time and O(1) queries entails that k-parities are PC-learnable with optimal efficiency with respect to n. The proof of the theorem relies on the two well known bounds. The first is the additive upper Chernoff bound [14]: Theorem 5 (Additive Upper Chernoff Bound). Let X1 , . . . , Xr be r independent bernoulli random variables, let µ b = r1 (X1 + . . . + Xr ) be an estimator for the mean of these variables, and let µ = E[b µ] be the expected mean. Then, for any > 0, the following inequality holds: Pr(b µ > µ + ) ≤ e−2r 2 The second is the union bound [14], which is as follows: Theorem 6 (Union Bound). For any probability space, and any two events A and B over that space, Pr[A ∪ B] ≤ Pr[A] + Pr[B] Crucially, the events A and B need not be independent. Proof of Theorem 4. Let Aφ be an algorithm that PPC-learns k-parities in O(n) time and O(1) queries with a per attribute error probability δ 0 . For any k-parity function f over n variables whose juntas are given by J, let S1 , . . . , Sr be sets output by Aφ on r independent runs, and let S be a set defined as follows That is x ∈ S iff x appears in more than half the sets S1 , . . . , Sr . We claim that Pr(S 6= J) ≤ δ if n 2 r> log 0 2 (1 − 2δ ) δ Considering that it takes O(nr) time to compute S given S1 , . . . , Sr , Theorem 4 follows straightforwardly from the claim. For a proof of the claim observe that for each x ∈ [n] we have exactly two cases: (i) x ∈ J and (ii) x 6∈ J. Case i) [x ∈ J]: Let µ cx be a random variable defined as follows: r 1X µ cx = [x 6∈ Si ]{1} r i=1 Theorem 5 entails that for any > 0, Pr (c µx > E[c µx ] + ) ≤ e−2r 2 Note that E[c µx ] = δ 0 . So by the premise of Theorem 4, E[c µx ] < 1/2. Setting = 1/2 − δ 0 in the expression above yields 0 2 1 1 ≤ e− 2 r(1−2δ ) Pr µ cx > 2 1 0 2 Thus, Pr(x 6∈ S) ≤ e− 2 r(1−2δ ) . Case ii) [x 6∈ J]: An argument similar to the one above 0 2 1 with µ cx defined as follows yields Pr(x ∈ S) ≤ e− 2 r(1−2δ ) . µ cx = r 1X [x ∈ Si ]{1} r i=1 By combining the two cases we get that for all x ∈ [n], 0 2 1 Pr(¬(x ∈ S ⇔ x ∈ J)) ≤ e− 2 r(1−2δ ) . The application of the union bound yields Pr(¬(1 ∈ S ⇔ 1 ∈ J) ∨ . . . ∨ ¬(n ∈ S ⇔ n ∈ J)) ≤ ne 1 x ∈ S ⇐⇒ 1 r r X i=1 [x ∈ Si ]{1} > 1 2 0 2 − 1 r(1−2δ 0 )2 2 In other words, Pr(S = 6 J) ≤ ne− 2 r(1−2δ ) . Finally, set0 2 −1 r(1−2δ ) < δ and taking logarithms yields the ting ne 2 claim. 6. SYMMETRY ANALYSIS For any positive integer m, let Dm denote the set 2 1 ,m , . . . , m−1 , 1}. Let G be a UGA with a population {0, m m of size m and binary chromosomes of length n. A hypothetical population is shown in Figure 2. The 1-frequency of some locus i ∈ [n] at some time step t is a value in Dm that gives the frequency of the bit 1 at locus i at time step t (in other words the number of ones in the population of G at locus i in generation t divided by m, the size of the population). Let f be a k-junta over {0, 1}n whose juntas are given by J and let h be the hidden function of f such that h is symmetric, i.e. for any permutation π : [n] → [n] and any element x ∈ {0, 1}n , h(x1 , . . . , xk ) = h(xπ(1) , . . . , xπ(k) ). Consider a noisy MQ oracle φf that internally uses f . Let G be a UGA that uses φf as a fitness function, and let 1ti be a random variable that gives the 1-frequency of G at timestep t, then for any time step t, any loci i, j ∈ J and any loci i0 , j 0 ∈ [n]\J, an appreciation of algorithmic symmetry (the absence of the positional bias in uniform crossover and the fact that h is symmetric) yields the following conclusions: Conclusion 7. ∀ x ∈ Dm , Pr(1ti = x) = Pr(1tj = x) Conclusion 8. ∀ x ∈ Dm , Pr(1ti0 = x) = Pr(1tj 0 = x) Which is to say that for all i, j ∈ J, 1ti and 1tj are drawn from the same distribution, which we denote pt , and for all i0 , j 0 ∈ [n]\J, 1ti0 and 1tj 0 are drawn from the same distribution, which we denote qt . (It is not to say that 1ti and 1tj are independent, or that 1ti0 and 1tj 0 are independent.) Appreciating that the location of the juntas of f is immaterial to the 1-frequency dynamics of the relevant and irrelevant loci yields the following conclusion: Conclusion 9. For all t, pt and qt are invariant to J Finally, if it is known that that the per bit probability of mutation is not dependent on the length of the chromosomes, then appreciating that the non-relevant loci are just “along for the ride” and can be spliced out without affecting the 1-frequency dynamics at other loci give us the following conclusion: Conclusion 10. For all t, pt and qt are invariant to n 6.1 Note on our Use of Symmetry This section is a lightly modified version of Section 3 in an earlier paper [4] . We include it here because our case for the use of symmetry arguments remains the same. A simple genetic algorithm with a finite but non-unitary population of size m (the kind of GA used in this paper) can be modeled by a Markov Chain over a state space consisting of all possible populations of size m [21]. Such models tend to be unwieldy [12] and difficult to analyze for all but the most trivial fitness functions. Fortunately, it is possible to avoid modeling and analysis of this kind, and still obtain precise results for non-trivial fitness functions by exploiting some simple symmetries introduced through the use of uniform crossover and length independent mutation. A homologous crossover operation between two chromosomes of length n can be modeled by a vector of n random binary variables hX1 , . . . , Xn i representing a crossover mask. Likewise, a mutation operation can be modeled by a vector of n random binary variables hY1 , . . . , Yn i representing a mutation mask. Only in the case of uniform crossover are the random variables X1 , . . . , Xn independent and identically distributed. This absence of positional bias [7] in uniform crossover constitutes a symmetry. Essentially, permuting the bits of all chromosomes using some permutation π before crossover, and permuting the bits back using π −1 after crossover has no effect on the overall dynamics of a UGA. If, in addition, the random variables Y1 , . . . , Yn that model the mutation operator are identically distributed (which is typical), conditionally independent given the per bit mutation rate, and independent of the value of n, then in the event that the values of chromosomes at some locus i are immaterial to the fitness evaluation, the locus i can be “spliced out” without affecting allele dynamics at other loci. In other words, the dynamics of the UGA can be exactly coarse-grained [2]. These conclusions flow readily from an appreciation of the symmetries induced by uniform crossover and length independent mutation. While the use of symmetry arguments is uncommon in Theoretical Computer Science, symmetry arguments form a crucial part of the foundations of Physics and Chemistry. Indeed, according to the theoretical physicist E. T. Jaynes “almost the only known exact results in atomic and nuclear structure are those which we can deduce by symmetry arguments, using the methods of group theory” [13, p331-332]. Note that the conclusions above hold true regardless of the selection scheme (fitness proportionate, tournament, truncation, etc), and any fitness scaling that may occur (sigma scaling, linear scaling etc). “The great power of symmetry arguments lies just in the fact that they are not deterred by any amount of complication in the details”, writes Jaynes [13, p331]. An appeal to symmetry, in other words, allows one to cut through complications that might hobble attempts to reason within a formal axiomatic system. Of course, symmetry arguments are not without peril. However, when used sparingly and only in circumstances where the symmetries are readily apparent, they can yield significant insight at low cost. 7. STATISTICAL HYPOTHESIS TESTING For any positive integer n, let f be a 7-parity function over {0, 1}n , and let φf be a noisy MQ oracle such that for any x ∈ {0, 1}n Pr(φ(x) = ¬f (x)) = 1/5 and Pr(φ(x) = f (x)) = 4/5. Let G(n) be the simple genetic algorithm given in Algorithm 1 with chromosomes of length n, population size m=1500, uniform recombination (asex = true), and per bit mutation probability pmut = 0.004. Let Aφ (n) be an algorithm that runs G(n) for 800 generations using φf as the fitness function and returns a set S ⊆ [n] such that i ∈ S if and only if the 1-frequency of locus i at generation 800 exceeds 1/2. Claim 11. Aφf PPC-solves the learning 7-parities problem in O(n) time and O(1) queries. 0 Argument. Let Dm be the set {x ∈ Dm | 0.05 < x < 0.95} Note that the hidden function of f is invariant to a reordering of its inputs and the per bit probability of mutation in G is constant with respect to n. Thus, Conclusions 7, 8, 9, and 10 ¸ hold. Consider the following two null hypotheses: (a) (b) Figure 3: The 1-frequency dynamics over 3000 runs of the first (left figure) and last (right figure) loci of Algorithm 1 with m = 1500, n = 8, τ = 800, pmut = 0.004, and asex = f alse using the membership query oracle φf ∗ , described in the text, as a fitness function. The dashed lines mark the frequencies 0.05 and 0.95. H0p : X p800 (x) ≥ 0 x∈D1500 H0q : X Likewise, if H0q is true, then for any independent random variables X1 , . . . , X3000 drawn from the distribution q800 , and any i ∈ [3000], 1 8 0 Pr(Xi 6∈ D1500 ) ≥ 1/8 q800 (x) ≥ 0 x6∈D1500 H0p which entails that 1 8 H0q Pr(X1 ∈ 0 Pr(Xi ∈ D1500 ) ≥ 1/8 which entails that 0 Pr(Xi 6∈ D1500 ) < 7/8 3000 7 < 8 Thus, the chance that the 1-frequency of the last locus of 0 G(8) could be in D1500 in generation 800 of all 3000 runs, as seen in Figure 4b, is less than (7/8)3000 . We thus reject hypothesis H0q at the 10−173 level of significance. Each p-value is less than a Bonferroni adjusted critical value of 10−100 /2, so we reject the global null hypotheses H0p ∨ H0q at the 10−100 level of significance. We are left with the following conclusions: The independence of the random variables entails that 3000 7 0 0 Pr(X1 6∈ D1500 ∧ . . . ∧ X3000 6∈ D1500 )< 8 X 8 Let f be the 7-parity function over {0, 1} whose juntas are given by the set {1, . . . , 7}. Figures 3a and 3b show the 1frequency of the first and last loci, respectively, of G(8) given the fitness function φf ∗ in 3000 independent runs, each 800 generations long.3 Thus, the chance that the 1-frequency of 0 the first locus of G(8) is in D1500 \D1500 in generation 800 of all 3000 runs, as seen in Figure 3a, is less than (7/8)3000 . As (7/8)3000 < 10−173 , we can reject hypothesis H0p at the 10−173 level of significance. The experiment can be rerun and the results examined by visiting https://github.com/burjorjee/ evolve-parities.git and following instructions. p800 (x) < 1 8 q800 (x) < 1 8 0 x∈D1500 X 3 ∧ . . . ∧ X3000 ∈ 0 D1500 ) −100 We seek to reject ∨ at the 10 level of significance. Assume H0p is true, then for any independent random variables X1 , . . . , X3000 drawn from the distribution p800 , and any i ∈ [3000], ∗ 0 D1500 0 x6∈D1500 The observation that running G(n) for 800 generations takes O(n) time and O(1) queries completes the argument. 8. REMARKS Now that our prediction has been validated, some remarks and clarifications are in order. 8.1 Other values of k and η? The result obtained above pertains only to k = 7 and η = 1/5. We expect that the proof technique used here can Algorithm 1: Pseudocode for a simple genetic algorithm with uniform crossover. The population is stored in an m by n array of bits, with each row representing a single chromosome. shuffle(·) randomly shuffles the contents of an array in-place, rand() returns a number drawn uniformly at random from the interval [0,1], ones(a, b) returns an a by b array of ones, and rand(a, b) < c resolves to an a by b array of bits each of which is 1 with probability c. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Input: m: population size Input: n: length of bitstrings Input: τ : number of generations Input: pmut : per bit mutation probability Input: asex: (flag) perform asexual evolution pop ← rand(m,n) < 0.5 for t ← 1 to τ do fitnessVals ← evaluate-fitness(pop) totalFitness ← 0 for i ←1 to m do totalFitness ← totalFitness + fitnessVals[i] end cumFitnessVals[1] ← fitnessVals[1] for i ←2 to m do cumFitnessVals[i] ← cumFitnessVals[i − 1] + fitnessVals[i] end for i ← 1 to 2m do k ← rand() ∗ totalFitness ctr ← 1 while k > cumFitnessVals[ctr] do ctr ← ctr + 1 end parentIndices[i] ← ctr end shuffle(parentIndices) if asex then crossOverMasks ← ones(m, n) else crossOverMasks ← rand(m, n) < 0.5 end for i ← 1 to m do for j ← 1 to n do if crossMasks[i,j]= 1 then newPop[i, j]← pop[parentIndices[i],j] else newPop[i, j]← pop[parentIndices[i + m],j] end end end mutationMasks ← rand(m, n) < pmut for i ← 1 to m do for j ← 1 to n do newPop[i,j]← xor(newPop[i, j], mutMasks[i, j]) end end pop←newPop end to give the first proof of efficient computational learning in an evolutionary algorithm on a non-trivial learning problem. That the computational learning is optimally efficient is a welcome finding. 8.2 8.3 be used to derive identical bounds with respect to n and δ for other values of k and η as long as these values remain small. This conjecture can only be verified on a case by case basis with the proof technique provided. the symmetry argument used in the proof precludes the derivation of bounds with respect to k and η as is typically done in the computational learning literature. Our goal, however, is not to derive such bounds for all k and η, but to verify a prediction that follows from the CES hypothesis, and in doing so Use and Misuse of CES We have previously explained how CES powers efficient general-purpose, non-local, noise-tolerant optimization in genetic algorithms with uniform crossover [4]. For the sake of completeness, we provide a sketch below. Consider the following heuristic: Use CES to find a coarse schema partition JIKn with a significant effect. Now limit future search to a schema S ∈ JIKn with an above average average sampling fitness. Limiting future search in this way amounts to permanently setting, i.e. fixing, the values at each locus i ∈ I in the population to the binary value given by the ith symbol of the template of S and performing search over the remaining loci. Fixing a schema thus yields a new, lower-dimensional search space. We now make use of the staggered conditional effects assumption [4]. We assume, in other words, that there exists a coarse schema partition in the new search space that may have had a statistically insignificant effect in the old search space but has a significant effect in the new space. Once again, use CES to find such a schema partition and limit future search to a schema in the partition with an above average sampling fitness. Recurse in this manner until no statistically significant conditional effects can be found. Such a heuristic is non-local because it does not make use of neighborhood information. It is noise-tolerant because it is sensitive only to the average fitness values of coarse schemata. We consider it to be general-purpose firstly, because it relies on a very weak assumption about the distribution of fitness over a search space—the existence of staggered conditional effects; and secondly, because it is an example of a decimation heuristic, and as such is in good company. Decimation heuristics such as Survey Propagation [18, 15] and Belief Propagation [17], when used in concert with local search heuristics (e.g. WalkSat [27]), are state of the art methods for solving large instances of a several NP-Hard combinatorial optimization problems close to their solvability/unsolvability thresholds. We have hypothesized that the heuristic sketched above is the abstract heuristic that UGAs implement—or, as is the case more often than not, misimplement. It stands to reason that an unspecified computational efficiency might not be harnessed properly while it remains unspecified. The common practice of making the per bit mutation rate depend on the length of chromosomes so as to make the expected number of mutations per chromosome independent of chromosomal length is a mistake; a non-bioplausible one at that. CES is not Implicit Parallelism Given its description in terms of concepts from schema theory, CES bears a resemblance to implicit parallelism, the computational efficiency hypothesized to be powering adaptation in genetic algorithms according to the beleaguered building block hypothesis [10, 24]. While the implicit parallelism hypothesis has informed the formulation of the CES hypothesis, the two hypotheses are emphatically not the same. The objects supposedly evaluated during implicit parallel Figure 4: The 1-frequency dynamics over 75 runs of the first (left figure) and last (right figure) loci of Algorithm 1 with m = 1500, n = 8, τ = 10000, pmut = 0.004 and asex = true using the membership query oracle φf ∗ , described in the text, as a fitness function. The dashed lines mark the frequencies 0.05 and 0.95. search are low order schemata satisfying an adjacency constraint: The maximum number of ∗ symbols between any two non-∗ symbols in a schema’s template must be small. The objective of the search is to find schemata whose expected sampling fitness is greater than the average sampling fitness of the population. If a schema fits the bill, then according to the implicit parallelism hypothesis, its frequency will increase multiplicatively in each generation. On the other hand, the objects evaluated in concurrent effect search, are coarse schema partitions. The objective of the search is to find schema partitions with statistically significant sampling effects. If a schema partition fits the bill across multiple generations, then according to the CES hypothesis, a single schema in the partition with above average expected sampling fitness will go to fixation. Looking at the number of objects evaluated as n gets large, a simple argument shows that the adjacency constraint required by implicit parallelism gives concurrent effect search the upper hand. For chromosomes of length n, the number of schema partitions of order k is nk ∈ Ω(nk ) [6]. Thus, the number of schema partitions evaluated by concurrent effect search scales polynomially with n. On the other hand, the adjacency constraint, restricts the number of schemata supposedly evaluated by implicit parallelism to O(n). Figure 4 shows the 1-frequencies of the first and last loci of of Algorithm 1 over 75 runs using φf ∗ as a fitness function as before and m = 1500 and n = 8 as before. The values of τ and asex were changed to 10000 and true respectively. In other words, we greatly reduced the number of runs, greatly increased the length of each run, and, most importantly, disabled recombination. As one can see, the first loci does not go to fixation even once during the 75 runs, despite the increase in the run length to 10000 generations. Perhaps some other settings of m, or pmut might, do a better job of sending the the first locus to fixation while keeping the the last locus unfixed. To understand why this isn’t likely, consider the 2-parity function f 0 with juntas {1, 2} over {0, 1}3 and consider infinite population evolution over {0, 1}3 using φf 0 as the fitness function. The frequencies of 00∗, 01∗, 10∗, and 11∗ in any population is then a point in the simplex over these values. In the absence of recombination, h0, 12 , 12 , 0i is the only fixed point of the system and it is stable. In the presence of uniform recombination, however, h0, 12 , 12 , 0i is an unstable fixed point. The instability arises from the fact that 01∗ and 10∗ are unmixable, i.e. tend to yield chromosomes of lower value when crossed. The only two stable fixed points of the system are h0, 1, 0, 0i and h0, 1, 0, 0i. 9. 9.1 THE ROLE OF SEX It is evident that higher organisms reproduce sexually, however, the nature of the advantage conferred by sex is still under debate. Sex seems contradictory given what we know about the prevalence of epistasis in biological genomes. Interactions between genes (epistasis) is the norm, not the exception [25]. What sense does it make, then, to break up genes that play well together? We build up to an answer by beginning with a simpler question: What role does recombination play in procuring the result in Figure 3? One way to approach the question is to repeat the experiment with recombination turned off. Sex, Effects, and Unmixibility It can be seen that a schema partition with unmixable schemata must have a non-zero effect. It is this property that gives recombination its power. Essentially, recombination, selection, and the unmixability of the schemata in a schema partition ensure that one of the schemata in the partition will go to fixation; A second and equally important benefit of sex is that it curbs hitchhiking [26, 9, 19], so only the loci, participating in an effect go to fixation. The explanation above can be short and sweet because of the bulk of the explanatory heavy lifting has already occurred in the CES hypothesis. Within the rubric provided by this hypothesis, the computational purpose of sex becomes transparent. It is difficult, in fact, to think of another operation that can play the part played by sex (driving the fixation of genes participating in an effect while curbing hitchhiking) as effectively. 10. CONCLUSION For the first time, optimally efficient non-trivial computation is shown to occur in an evolutionary system. This demonstration is accompanied by a hypothesis about the more general computation that evolutionary algorithms implicitly execute efficiently—Concurrent Effect Search. We explained how Concurrent Effect Search in evolutionary algorithms can power non-local, noise-tolerant, generalpurpose search—the sort of thing evolutionary systems are known to be good at. Finally, we highlighted the crucial role played by sex and unmixability in the derivation of our optimality result, and, more generally, in the implicit implementation of concurrent effect search. 11. REFERENCES [1] Avrim L Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial intelligence, 97(1):245–271, 1997. [2] Keki M. Burjorjee. Sufficient conditions for coarse-graining evolutionary dynamics. In Foundations of Genetic Algorithms 9 (FOGA IX), 2007. [3] Keki M. Burjorjee. Generative Fixation: A Unifed Explanation for the Adaptive Capacity of Simple Recombinative Genetic Algorithms. PhD thesis, Brandeis University, 2009. [4] Keki M. Burjorjee. Explaining optimization in genetic algorithms with uniform crossover. In Proceedings of the twelfth workshop on Foundations of genetic algorithms XII. ACM, 2013. [5] Keki M. Burjorjee and Jordan B. Pollack. A general coarse-graining framework for studying simultaneous inter-population constraints induced by evolutionary operations. In GECCO 2006: Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM Press, 2006. [6] T. H. Cormen, C. H. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw-Hill, 1990. [7] L.J. Eshelman, R.A. Caruana, and J.D. Schaffer. Biases in the crossover landscape. Proceedings of the third international conference on Genetic algorithms table of contents, pages 10–19, 1989. [8] Vitaly Feldman. Attribute-efficient and non-adaptive learning of parities and dnf expressions. Journal of Machine Learning Research, 8(1431-1460):101, 2007. [9] Stephanie Forrest and Melanie Mitchell. Relative building-block fitness and the building-block hypothesis. In L. Darrell Whitley, editor, Foundations of Genetic Algorithms 2, pages 109–126, San Mateo, CA, 1993. Morgan Kaufmann. [10] David E. Goldberg. Genetic Algorithms in Search, Optimization & Machine Learning. Addison-Wesley, Reading, MA, 1989. [11] Sally A Goldman, Michael J Kearns, and Robert E Schapire. Exact identification of read-once formulas using fixed points of amplification functions. SIAM Journal on Computing, 22(4):705–726, 1993. [12] John H. Holland. Building blocks, cohort genetic algorithms, and hyperplane-defined functions. Evolutionary Computation, 8(4):373–391, 2000. [13] E.T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2007. [14] Michael J Kearns and Umesh Virkumar Vazirani. An introduction to computational learning theory. MIT press, 1994. [15] Lukas Kroc, Ashish Sabharwal, and Bart Selman. Survey propagation revisited. In Ronald Parr and Linda C. van der Gaag, editors, UAI, pages 217–226. AUAI Press, 2007. [16] David JC MacKay. Information theory, inference, and learning algorithms, volume 7. Cambridge University Press, 2003. [17] Elitza Maneva, Elchanan Mossel, and Martin J. Wainwright. A new look at survey propagation and its generalizations. J. ACM, 54(4), July 2007. [18] M. M´ezard, G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random satisfiability problems. Science, 297(5582):812–815, 2002. [19] Melanie Mitchell. An Introduction to Genetic Algorithms. The MIT Press, Cambridge, MA, 1996. [20] Elchanan Mossel, Ryan O’Donnell, and Rocco P Servedio. Learning juntas. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 206–212. ACM, 2003. [21] A.E. Nix and M.D. Vose. Modeling genetic algorithms with Markov chains. Annals of Mathematics and Artificial Intelligence, 5(1):79–88, 1992. [22] Karl Popper. Conjectures and Refutations. Routledge, 2007. [23] Karl Popper. The Logic Of Scientific Discovery. Routledge, 2007. [24] C.R. Reeves and J.E. Rowe. Genetic Algorithms: Principles and Perspectives: a Guide to GA Theory. Kluwer Academic Publishers, 2003. [25] Sean H. Rice. The evolution of developmental interactions. Oxford University Press, 2000. [26] J. David Schaffer, Larry J. Eshelman, and Daniel Offut. Spurious correlations and premature convergence in genetic algorithms. In Gregory J. E. Rawlins, editor, Foundations of Genetic Algorithms, pages 102–112, San Mateo, 1991. Morgan Kaufmann. [27] B. Selman, H. Kautz, and B. Cohen. Local search strategies for satisfiability testing. Cliques, coloring, and satisfiability: Second DIMACS implementation challenge, 26:521–532, 1993. [28] Uehara, Tsuchida, and Wegener. Identification of partial disjunction, parity, and threshold functions. TCS: Theoretical Computer Science, 230, 2000. [29] Leslie Valiant. Probably approximately correct: nature’s algorithms for learning and prospering in a complex world. Basic Books, 2013.