I(x
Transcription
I(x
UNA DEFINICIÓN DE INFORMACIÓN por Sergei Levashkin sergei@levashkin.com www.levashkin.com Dentro del Curso “Teoría de la Computación” de Adolfo Guzmán Arenas CIC-IPN, Abril 2010 PERSONAJES PRINCIPALES Alan Turing 1912-1954 Alonzo Church 1903-1995 Claude Shannon 1916-2001 Andrei Kolmogorov 1903-1987 Ray Solomonoff (1926-2009) and Gregory Chaitin (1947-) CONTENIDO • • • • • • • • • • • • • • Andrei Nikolaevich Kolmogorov: Su vida y obra Teoría de información (Complejidad de Kolmogorov = Complejidad Algorítmica = Complejidad Descriptiva) Obras selectas Antecedentes y trabajos relacionados 1. Axiomas de probabilidad 2. “A Mathematical Theory of Communication” by Claude Elwood Shannon 3. Tesis de Turing-Church 4. Máquinas de Kolmogorov 5. Máquinas de Kolmogorov versus Máquinas de Turing 6. Probabilidad algorítmica e inferencia inductiva de Solomonoff 7. Trabajos de Chaitin Información: Definiciones previas Motivación Tres enfoques para definir la cantidad de información (análisis de Kolmogorov) Definición de Complejidad de Kolmogorov Propiedades de Complejidad de Kolmogorov Definición de Complejidad Condicional de Kolmogorov Propiedades de Complejidad Condicional de Kolmogorov Ejemplos y aplicaciones Información semántica Concluciones y trabajo futuro Andrei Nikolaevich Kolmogorov April 25, 1903, Tambov, Russian Empire - October 20, 1987, Moscow, USSR K • VIDA Y OBRA • TEORÍA DE INFORMACIÓN K • VIDA Y OBRA K • VIDA K Native House K Relatives & Aunt • Classical Education: French, History K Youth 1920s • Fourier Analysis • Theory of Functions • Topology Kolmogorov-Alexandrov K Vitae Desde 1922 enseñaba matemáticas en una escuela secundaria de Moscú y paralelamente estudiaba en la universidad. En 1922 comenzé mi investigación independiente bajo la dirección de V.V. Stepanov y N.N. Luzin. Tenía alredador de quince publicaciones en revistas sobre teoría de funciones de variables reales al graduarme... K Middle Age 1930s • Foundations of the Theory of Probability • Logic • Member of Academy at age 36 K Komarovka • Now-legendary Place: Sports, Rests K Reads 1940s • Defense Works: Statistical Theory of Bombing & Shooting • Second World War • Politics K Wife • She was his highschool friend K Writes 1950s • • • • Dynamic Systems 13th Hilbert Problem Statistics Theory of Algorithms K Moscow State University Dirección: Leninskie Gory, MGU Sector “L” Departamento 10 Moscú U.R.S.S. K Thinks 1960s • Textbooks • Turbulence • Linguistic Analysis of Poetry & Fiction • Theory of Information K Looks 1970s • ICM Congress in Nice Discrete versus Continuos Math • Numerous Awards • Main Soviet Mathematician K Smiles • Scientific Childhood: Kolmogorov’s Enigma K Rest • Skiing, Swimming, Sailing, Walking, Oaring... • Classical Music (TOPAZ) K Lecture 1 • Lecturer World-Wide K Lecture 2 • Difficult to understand K Escuela de Kolmogorov Tutor Grown up several outstanding men in Math, Probability, Statistics, Logic, Informatics, Physics, Oceanography, Atmosphere, etc. K Kolmogorov Students All Soviet Union Kolmogorov’s High School in Physics & Math # 18 Founded on December 2, 1963 K Awards • Seven Orders of Lenin, the Order of the October Revolution, and also the high title of Hero of Socialist Labor; Lenin Prizes and State Prizes. K Awards • The Royal Netherlands Academy of Sciences (1963), the London Royal Society (1964), the USA National Society (1967), the Paris Academy of Sciences (1968), the Polish Academy of Sciences, the Rumanian Academy of Sciences (1956), the German Academy of Sciences Leopoldina (1959), the American Academy of Sciences and Arts in Boston (1959). K Ya soñaron un segundo... y ¡mano a la obra! Awards • • • Honorary doctorates from the universities of Paris, Berlin, Warsaw, Stockholm, etc. Honorary member of the Moscow, London, Indian, and Calcutta Mathematical Societies, of the London Royal Statistical Society, the International Statistical Institute, and the American Meteorological Society. International Balzan Prize (1963). K 80! 1980s • Health Problems (Parkinson) K Elder Age Died in Moscow on October 20, 1987† K Novodevichie Cementary K • OBRA K • TEORÍA DE INFORMACIÓN K • OBRAS SELECTAS K • • • Selected Works of A.N.Kolmogorov : Mathematics and Mechanics, Volume 1 Edited by V. M. Tikhomirov Selected Works of A.N.Kolmogorov : Probability Theory and Mathematical Statistics, Volume 2 Edited by A. N. Shiryayev Selected Works of A.N.Kolmogorov : Information Theory and the Theory of Algorithms, Volume 3 Edited by A. N. Shiryayev K • ANTECEDENTES Y TRABAJOS RELACIONADOS0 Probability axioms, 1930 • First axiom The probability of an event is a non-negative real number: where F is the event space and E is any event in F. K • ANTECEDENTES Y TRABAJOS RELACIONADOS1 Probability axioms , 1930 • Second axiom This is the assumption of unit measure: that the probability that some elementary event in the entire sample space will occur is 1. More specifically, there are no elementary events outside the sample space. and This is often overlooked in some mistaken probability calculations; if you cannot precisely define the whole sample space, then the probability of any subset cannot be defined either. K • ANTECEDENTES Y TRABAJOS RELACIONADOS2 Probability axioms, 1930 • Third axiom This is the assumption of σ additivity: Any countable sequence of pairwise disjoint events E1,E2,... satisfies K • ANTECEDENTES Y TRABAJOS RELACIONADOS3 Claude Shannon, 1948 “A Mathematical Theory of Communication”, The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. “Claude Shannon - Father of the Information Age”. Video: http://www.ucsd.tv/searchdetails.aspx?showID=6090 Análisis de Kolmogorov K • ANTECEDENTES Y TRABAJOS RELACIONADOS4 Turing-Church Thesis, 1930-1952 Informally the Church–Turing thesis states that if an algorithm (a procedure that terminates) exists then there is an equivalent Turing machine, recursively-definable function, or applicable λ-function, for that algorithm. A more simplified but understandable expression of it is that "everything computable is computable by a Turing machine." Though not formally proven, today the thesis has near-universal acceptance. K • ANTECEDENTES Y TRABAJOS RELACIONADOS5 Kolmogorov machine, 1953 Kolmogorov machines are similar to Turing machines except that the tape can change its topology. The tape is a finite connected graph with a distinguished (active) node. The graph is directed but symmetric: If there is an edge from u to v then there is an edge from v to u. The edges are colored in such a way that the edges coming out from any one node have diferent colors. Thus, every path from the active node is characterized by a string of colors. The number of colors is bounded (for each machine). The neighborhood of the active node of some fixed (for every machine) radius is called the active zone. (One may suppose that the radius always equals 2.) For each isomorphism type of the active zone, the program specifies a sequence of instructions of the following forms: K • ANTECEDENTES Y TRABAJOS RELACIONADOS6 Kolmogorov machine, 1953 1. add a new node together with a pair of edges of some colors between the active node and the new one, 2. remove a node and the edges incident to it, 3. add a pair of edges of some colors between two existing nodes, 4. remove the two edges between two existing nodes, 5. halt. Executing the whole instruction sequence creates the next configuration of the tape. K • ANTECEDENTES Y TRABAJOS RELACIONADOS7 Kolmogorov vs. Turing machines K machines are more powerful than T machines: There exist a function realtime computable by some Kolmogorov machine but not real-time computable by any Turing machine. The following description of real-time computable is appropriate in the case of Turing or K machines: A realtime algorithm inputs a symbol, then outputs a symbol within a fixed number c of steps, then inputs a symbol, then outputs a symbol within c steps, and so on. K • ANTECEDENTES Y TRABAJOS RELACIONADOS8 Solomonoff’s algorithmic probability and inductive inference, 1960-1964 Algorithmic probability quantifies the idea of theories and predictions with reference to short programs and their output: Take a universal computer and randomly generate an input program. The program will compute some possibly infinite output. The algorithmic probability of any given finite output prefix q is the sum of the probabilities of the programs that compute something starting with q. Certain long objects with short programs have high probability. K • ANTECEDENTES Y TRABAJOS RELACIONADOS9 Solomonoff’s algorithmic probability and inductive inference, 1960-1964 Algorithmic "Solomonoff" Probability (AP) assigns to objects an a priori probability that is in some sense universal. This prior distribution has theoretical applications in a number of areas, including inductive inference theory and the time complexity analysis of algorithms. Its main drawback is that it is not computable and thus can only be approximated in practice. Consider an unknown process producing a binary string of one hundred 1s. The probability that such a string is the result of a uniform random process, such as fair coin flips, is just like that of any other string of the same length. Intuitively, we feel that there is a difference between a string that we can recognize and distinguish, and the vast majority of strings that are indistinguishable to us. In fact, we feel that a far more likely explanation is that some deterministic algorithmic simple process has generated this string. K • ANTECEDENTES Y TRABAJOS RELACIONADOS10 Solomonoff’s algorithmic probability and inductive inference, 1960-1964 This observation was already made by P.S. Laplace in about 1800, but the techniques were lacking for that great mathematician to quantify this insight. Presently we do this using the lucky confluence of combinatorics, information theory and theory of algorithms. It is clear that, in a world with computable processes, patterns which result from simple processes are relatively likely, while patterns that can only be produced by very complex processes are relatively unlikely. Formally, a computable process that produces a string x is a program p that when executed on a universal Turing machine U produces the string x as output. As p is itself a binary string, we can define the discrete universal a priori probability, m(x), to be the probability that the output of a universal prefix Turing machine U is x when provided with fair coin flips on the input tape. Formally, where the sum is over all halting programs p for which U outputs the string x. In fact, the boundedness of the sum should be an obvious consequence from realizing that the above expression is indeed a probability, which hinges on the fact that a prefix machine is one with only {0,1} input symbols and no end-of-input marker of any kind. K • ANTECEDENTES Y TRABAJOS RELACIONADOS11 Solomonoff’s algorithmic probability and inductive inference, 1960-1964 It is easy to see that this distribution is strongly related to Kolmogorov complexity in that m(x) is at least the maximum term in the summation, which is The Coding Theorem of L.A. Levin (1974) states that equality also holds in the other direction in the sense that K • ANTECEDENTES Y TRABAJOS RELACIONADOS12 Solomonoff’s algorithmic probability and inductive inference, 1960-1964 Algorithmic probability is the main ingredient of Ray Solomonoff's theory of inductive inference, the theory of prediction based on observations. Given a sequence of symbols, which will come next? Solomonoff's theory provides an answer that is optimal in a certain sense, although it is incomputable. Unlike, for example, Karl Popper's informal inductive inference theory, however, Solomonoff's is mathematically rigorous. Algorithmic probability is closely related to the concept of Kolmogorov complexity. The Kolmogorov complexity of any computable object is the length of the shortest program that computes it and then halts. The invariance theorem shows that it is not really important which computer we use. Solomonoff's enumerable measure is universal in a certain powerful sense, but it ignores computation time. K • ANTECEDENTES Y TRABAJOS RELACIONADOS13 Chaitin, circa 1966 Chaitin's early work on algorithmic information theory paralleled the earlier work of Kolmogorov. Ref. http://en.wikipedia.org/wi ki/Gregory_Chaitin K • ANTECEDENTES Y TRABAJOS RELACIONADOS14 Chaitin, circa 1966 Define L(S) for any finite binary sequence S by: A Turing machine with n internal states can be programmed to compute S if and only if n ≥ L(S). Define L(Cn) by L(Cn) = maxL(S), where S is any binary sequence of length n. Let Cn be the set of all binary sequences of length n satisfying L(S) = L(Cn). October 19, 1965 There it is proposed that elements of Cn may be considered patternless or random. This is applied; some properties of the L function are derived by using what may be termed the simple normality (Borel) of these binary sequences. January 6, 1966 K • RESUMEN DE TRABAJOS PREVIOS Kolmogorov dice (1970): Further I follow my own work, but more or less similar ideas can be found in the works of Solomonoff and Chaitin; in these works, however, they appear in a somewhat veiled form. Selected Works of A.N.Kolmogorov : Information Theory and the Theory of Algorithms, Volume 3 Edited by A. N. Shiryayev p. 212 K • INFORMACIÓN0 Turing (TM, UTM) Shannon (log2N) Kolmogorov (LA) Preliminary K-definition & Reflections on it (Looks & Smiles) K • MOTIVACIÓN1 2100 messages by Shannon contain log2 2100 = 100 bits of info (log2N)!? K • MOTIVACIÓN2 Shannon’s entropy (“amount of information”) − ∑ p log i , pi ≠ 0 i 2 ( pi ) K • MOTIVACIÓN3 Where pi is the probability of occurrences of value i. Using this criterion, the higher the entropy, the more the randomness. For instance, the sequence 00100010 (entropy= 0.81) is less random than 01010101 (entropy = 1). The inadequacy of Shannon's measure of randomness is apparent, because intuitively the former sequence is more random than the latter. K • TRES ENFOQUES PARA DEFINIR LA CANTIDAD DE INFORMACIÓN* (ANÁLISIS DE KOLMOGOROV)1 *Reprint in Selected Works of A.N.Kolmogorov : Information Theory and the Theory of Algorithms, Volume 3 Edited by A. N. Shiryayev p. 184-193 *Original: Problemy peredachi Informatsii, 1965, Vol. 1, No.1, pages 3-11 K • TRES ENFOQUES PARA DEFINIR LA CANTIDAD DE INFORMACIÓN (ANÁLISIS DE KOLMOGOROV)2 Combinatorial Probabilistic Algorithmic K • K-DEFINICIÓN1 We start by describing objects with binary strings (Turing). We take the length of the shortest string description of an object as a measure of that object's complexity. K • K-DEFINICIÓN2 Let's look at the most traditional application of information theory, communications (Shannon). The sender and receiver each know specification method L. Message x can be transmitted as y, such that L(y)=x. This is written "y : L(y)=x" and the ":" is read as "such that". The cost of transmission is the length of y, |y|. The least cost is min(|y|): L(y)=x K • K-DEFINICIÓN3 • • This minimal |y| is the descriptional complexity of x under specification method L. A universal description method should have the following properties: it should be independent of L, within a constant offset, so that we can compare the complexity of any object with any other object's complexity; the description should in principle be performable by either machines or humans. K • K-DEFINICIÓN4 Such a method would give us a measure of absolute information content, the amount of data which needs to be transmitted in the absence of any other a priori knowledge. The description method that meets these criteria is the Kolmogorov complexity: the size of the shortest program (in bits) that without additional data, computes the string and terminates. K(x) = min | p| :U(p)=x where U is a universal Turing machine (believe!). K • K-DEFINICIÓN5 The conditional Kolmogorov complexity of string x given string y is K(x|y) = min| p|: U(p,y)=x The length of the shortest program that can compute both x and y and a way to tell them apart is K(x,y) = min| p|: U(p)=xy K • INFORMACIÓN1 The Kolmogorov information of string x given string y is I (x|y) = K(y) – K(y|x) K • INFORMACIÓN2 The Kolmogorov information of string x is I (x|x) := I (x) = K(x) K • K-PROPIEDADES1 Theorem 1. ∃c ∀x [K(x) ≤ ⏐x⏐ + c]. Theorem 2. ∃c ∀x [K(xx) ≤ K(x) + c]. Theorem 3. ∃c ∀x,y [K(xy) ≤ 2K(x) + K(y)+ c]. Theorem 4. For any description language p, a fixed constant c exists that depends only on p, where ∀x [K(x) ≤ Kp(x) + c]. K • K-PROPIEDADES2 Definition 1. Let x be a string. Say that x is c-compressible if K(x) ≤ ⏐x⏐ - c. • If x is not c-compressible , we say that x is incompressible by c. • If x is incompressible by 1, we say that x is incompressible. Theorem 5. Incompressible strings of every length exist. Exercise. Proof that at least 2^n – 2^(n-c+1) + 1strings of length n are incompressible by c. K • K-PROPIEDADES3 Theorem 6. Let f be a computable property that holds for almost all strings. Then, for any b > 0, the property f is FALSE on only finitely many strings that are incompressible by b. Theorem 7. For some constant b, for every string x, the minimal description d(x) of x is incompressible by b. K • EXAMPLE (COM)1 π is an infinite sequence of seemingly random digits, but it contains only a few bits of information: the size of the short program that can produce the consecutive bits of π forever. Informally we say the descriptional complexity of π is a constant. Formally we say K(π) = 0(1), which means "K(π) does not grow". K • EXAMPLE (COM)2 A truly random string is not significantly compressible; its description length is within a constant offset of its length. Formally we say K(x) = Theta(|x|), which means "K(x) grows as fast as the length of x" K • EXAMPLE (INFO)31 The amount of information contained in the variable x with respect to a related variable y can be defined as follows. K • EXAMPLE (INFO)32 The relationship between variables x and y ranging over the respective sets X and Y is that not all pairs x,y belonging to the Cartesian product X x Y are “possible”. From the set of possible pairs U, one can determine, for any a ∈ X, the set Ya of those y for which (a,y) ∈ U. It is natural to define the conditional entropy by the relation H(y| a) = log2N(Ya) where N(Yx) is the number of elements in the set Yx. K • EXAMPLE (INFO)33 It is natural to define the conditional entropy by the relation H(y| a) = log2N(Ya) where N(Yx) is the number of elements in the set Yx and the information in x with respect to y by the formula I(x|y) = H(y) – H(y|x) K • EXAMPLE (INFO)35 x y 1 2 3 4 1 + + + + 2 + - + - 3 - + - - K • EXAMPLE (INFO)36 For example, in the case shown in the table, we have I(x = 1| y) = 0, I(x=2| y) = 1, I(x = 3| y) = 2. It is clear that H(y|x) and I(x|y) are functions of x (while y enters in them as a “bound variable”). K • INFORMACIÓN SEMÁNTICA1 http://plato.stanford.ed u/entries/informationsemantic/ K • INFORMACIÓN SEMÁNTICA2 http://www.inference. phy.cam.ac.uk/mackay /infotheory/course.html http://www.scientifica merican.com/article.cf m?id=the-semanticweb K • INFORMACIÓN SEMÁNTICA3 http://www.lecb.ncifcr f.gov/~toms/paper/pri mer/ Marvin Minsky, Semantic Information Processing, MIT Press, 1969 K • CONCLUCIÓN1 Combinatorial (C) Probabilistic (P) Algorithmic (A) K-theorem: C→A; P →A K • CONCLUCIÓN2 Algorithmic Probability & Induction Prediction Occam’s Razor Distance Superficiality & Sophistication Part & Whole Total & Partial Order Computable description methods K • REFERENCIAS 1. 2. 3. 4. 5. 6. 7. Selected Works of A.N.Kolmogorov: Information Theory and the Theory of Algorithms, Volume 3 Edited by A. N. Shiryayev, Kluwer Academic Publishers, Vol 27, 1993 Murray Gell-Mann: What is complexity? Complexity, Vol. 1, No. 1, 1995. Claude Shannon: A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, JulyOctober, 1948. B. Jack Copeland: The Church-Turing Thesis Stanford Encyclopedia of Philosophy, 1997-2002 http://plato.stanford.edu/entries/church-turing/ Ray Solomonoff: A Preliminary Report on a General Theory of Inductive Inference, Report V131, Zator Co., Cambridge, Ma. Feb 4, 1960. Gregory J. Chaitin: On the length of programs for computing finite binary sequences, Journal of the ACM 13 (1966), pp. 547-569. Michael Sipser: Inroduction to the theory of computation, 2nd edition, Thomson K