I(x

Transcription

I(x
UNA DEFINICIÓN DE
INFORMACIÓN
por
Sergei Levashkin
sergei@levashkin.com
www.levashkin.com
Dentro del Curso “Teoría de la Computación” de Adolfo Guzmán Arenas
CIC-IPN, Abril 2010
PERSONAJES PRINCIPALES
Alan Turing
1912-1954
Alonzo Church
1903-1995
Claude Shannon
1916-2001
Andrei Kolmogorov
1903-1987
Ray Solomonoff (1926-2009) and Gregory Chaitin (1947-)
CONTENIDO
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Andrei Nikolaevich Kolmogorov: Su vida y obra
Teoría de información (Complejidad de Kolmogorov = Complejidad Algorítmica = Complejidad
Descriptiva)
Obras selectas
Antecedentes y trabajos relacionados
1.
Axiomas de probabilidad
2.
“A Mathematical Theory of Communication” by Claude Elwood Shannon
3.
Tesis de Turing-Church
4.
Máquinas de Kolmogorov
5.
Máquinas de Kolmogorov versus Máquinas de Turing
6.
Probabilidad algorítmica e inferencia inductiva de Solomonoff
7.
Trabajos de Chaitin
Información: Definiciones previas
Motivación
Tres enfoques para definir la cantidad de información (análisis de Kolmogorov)
Definición de Complejidad de Kolmogorov
Propiedades de Complejidad de Kolmogorov
Definición de Complejidad Condicional de Kolmogorov
Propiedades de Complejidad Condicional de Kolmogorov
Ejemplos y aplicaciones
Información semántica
Concluciones y trabajo futuro
Andrei Nikolaevich Kolmogorov
April 25, 1903, Tambov, Russian Empire - October 20, 1987, Moscow, USSR
K
• VIDA Y OBRA
• TEORÍA DE
INFORMACIÓN
K
• VIDA Y OBRA
K
• VIDA
K
Native House
K
Relatives & Aunt
• Classical Education:
French, History
K
Youth
1920s
• Fourier Analysis
• Theory of Functions
• Topology
Kolmogorov-Alexandrov
K
Vitae
Desde 1922 enseñaba matemáticas en una
escuela secundaria de Moscú y paralelamente
estudiaba en la universidad. En 1922 comenzé
mi investigación independiente bajo la dirección
de V.V. Stepanov y N.N. Luzin. Tenía alredador
de quince publicaciones en revistas sobre teoría
de funciones de variables reales al graduarme...
K
Middle Age
1930s
• Foundations of the
Theory of Probability
• Logic
• Member of Academy
at age 36
K
Komarovka
• Now-legendary Place:
Sports, Rests
K
Reads
1940s
• Defense Works:
Statistical Theory of
Bombing & Shooting
• Second World War
• Politics
K
Wife
• She was his highschool friend
K
Writes
1950s
•
•
•
•
Dynamic Systems
13th Hilbert Problem
Statistics
Theory of Algorithms
K
Moscow State University
Dirección:
Leninskie Gory, MGU
Sector “L”
Departamento 10
Moscú
U.R.S.S.
K
Thinks
1960s
• Textbooks
• Turbulence
• Linguistic Analysis of
Poetry & Fiction
• Theory of Information
K
Looks
1970s
• ICM Congress in Nice
Discrete versus
Continuos Math
• Numerous Awards
• Main Soviet
Mathematician
K
Smiles
• Scientific Childhood:
Kolmogorov’s Enigma
K
Rest
• Skiing, Swimming,
Sailing, Walking,
Oaring...
• Classical Music
(TOPAZ)
K
Lecture 1
• Lecturer World-Wide
K
Lecture 2
• Difficult to understand
K
Escuela de Kolmogorov
Tutor
Grown up several
outstanding men in
Math, Probability,
Statistics, Logic,
Informatics, Physics,
Oceanography,
Atmosphere, etc.
K
Kolmogorov Students
All Soviet Union
Kolmogorov’s High
School in Physics &
Math # 18
Founded on December
2, 1963
K
Awards
• Seven Orders of
Lenin, the Order of the
October Revolution,
and also the high title
of Hero of Socialist
Labor; Lenin Prizes
and State Prizes.
K
Awards
•
The Royal Netherlands Academy
of Sciences (1963), the London
Royal Society (1964), the USA
National Society (1967), the Paris
Academy of Sciences (1968), the
Polish Academy of Sciences, the
Rumanian Academy of Sciences
(1956), the German Academy of
Sciences Leopoldina (1959), the
American Academy of Sciences
and Arts in Boston (1959).
K
Ya soñaron un segundo...
y ¡mano a la obra!
Awards
•
•
•
Honorary doctorates from the
universities of Paris, Berlin,
Warsaw, Stockholm, etc.
Honorary member of the Moscow,
London, Indian, and Calcutta
Mathematical Societies, of the
London Royal Statistical Society,
the International Statistical
Institute, and the American
Meteorological Society.
International Balzan Prize (1963).
K
80!
1980s
• Health Problems
(Parkinson)
K
Elder Age
Died in Moscow on
October 20, 1987†
K
Novodevichie
Cementary
K
• OBRA
K
• TEORÍA DE
INFORMACIÓN
K
• OBRAS
SELECTAS
K
•
•
•
Selected Works of
A.N.Kolmogorov : Mathematics
and Mechanics, Volume 1 Edited
by V. M. Tikhomirov
Selected Works of
A.N.Kolmogorov : Probability
Theory and Mathematical
Statistics, Volume 2 Edited by
A. N. Shiryayev
Selected Works of
A.N.Kolmogorov : Information
Theory and the Theory of
Algorithms, Volume 3 Edited by
A. N. Shiryayev
K
• ANTECEDENTES Y
TRABAJOS
RELACIONADOS0
Probability axioms, 1930
• First axiom
The probability of an
event is a non-negative
real number:
where F is the event space
and E is any event in F.
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS1
Probability axioms , 1930
• Second axiom
This is the assumption of unit
measure: that the probability that
some elementary event in the entire
sample space will occur is 1. More
specifically, there are no elementary
events outside the sample space.
and
This is often overlooked in some
mistaken probability calculations; if
you cannot precisely define the whole
sample space, then the probability of
any subset cannot be defined either.
K
•
ANTECEDENTES Y
TRABAJOS
RELACIONADOS2
Probability axioms, 1930
• Third axiom
This is the assumption
of σ additivity:
Any countable sequence
of pairwise disjoint
events E1,E2,... satisfies
K
•
ANTECEDENTES Y
TRABAJOS
RELACIONADOS3
Claude Shannon, 1948
“A Mathematical Theory of
Communication”, The Bell
System Technical Journal, Vol.
27, pp. 379–423, 623–656, July,
October, 1948.
“Claude Shannon - Father of
the Information Age”. Video:
http://www.ucsd.tv/searchdetails.aspx?showID=6090
Análisis de Kolmogorov
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS4
Turing-Church Thesis, 1930-1952
Informally the Church–Turing
thesis states that if an algorithm (a
procedure that terminates) exists then
there is an equivalent Turing
machine, recursively-definable
function, or applicable λ-function, for
that algorithm.
A more simplified but understandable
expression of it is that "everything
computable is computable by a Turing
machine." Though not formally proven,
today the thesis has near-universal
acceptance.
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS5
Kolmogorov machine, 1953
Kolmogorov machines are similar to Turing
machines except that the tape can change its
topology.
The tape is a finite connected graph with a
distinguished (active) node. The graph is directed but
symmetric: If there is an edge from u to v then there
is an edge from v to u. The edges are colored in such
a way that the edges coming out from any one node
have diferent colors. Thus, every path from the active
node is characterized by a string of colors. The
number of colors is bounded (for each machine).
The neighborhood of the active node of some fixed
(for every machine) radius is called the active zone.
(One may suppose that the radius always equals 2.)
For each isomorphism type of the active zone, the
program specifies a sequence of instructions of the
following forms:
K
•
ANTECEDENTES Y
TRABAJOS RELACIONADOS6
Kolmogorov machine, 1953
1. add a new node together with a pair
of edges of some colors between
the active node and the new one,
2. remove a node and the edges
incident to it,
3. add a pair of edges of some colors
between two existing nodes,
4. remove the two edges between two
existing nodes,
5. halt.
Executing the whole instruction
sequence creates the next
configuration of the tape.
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS7
Kolmogorov vs. Turing machines
K machines are more powerful than T
machines: There exist a function realtime computable by some Kolmogorov
machine but not real-time computable
by any Turing machine.
The following description of real-time
computable is appropriate in the case of
Turing or K machines: A realtime
algorithm inputs a symbol, then outputs
a symbol within a fixed number c of
steps, then inputs a symbol, then
outputs a symbol within c steps, and so
on.
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS8
Solomonoff’s algorithmic probability and
inductive inference, 1960-1964
Algorithmic probability quantifies the
idea of theories and predictions with
reference to short programs and their
output: Take a universal computer and
randomly generate an input program.
The program will compute some
possibly infinite output.
The algorithmic probability of any
given finite output prefix q is the sum
of the probabilities of the programs that
compute something starting with q.
Certain long objects with short
programs have high probability.
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS9
Solomonoff’s algorithmic probability and
inductive inference, 1960-1964
Algorithmic "Solomonoff" Probability
(AP) assigns to objects an a priori
probability that is in some sense
universal. This prior distribution has
theoretical applications in a number
of areas, including inductive
inference theory and the time
complexity analysis of algorithms. Its
main drawback is that it is not
computable and thus can only be
approximated in practice.
Consider an unknown process
producing a binary string of one
hundred 1s. The probability that
such a string is the result of a
uniform random process, such as fair
coin flips, is just
like that of any other string of the
same length. Intuitively, we feel that
there is a difference between a string
that we can recognize and
distinguish, and the vast majority of
strings that are indistinguishable to
us. In fact, we feel that a far more
likely explanation is that some
deterministic algorithmic simple
process has generated this string.
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS10
Solomonoff’s algorithmic probability and
inductive inference, 1960-1964
This observation was already made by P.S. Laplace in about 1800, but the
techniques were lacking for that great
mathematician to quantify this insight.
Presently we do this using the lucky
confluence of combinatorics, information
theory and theory of algorithms. It is clear
that, in a world with computable processes,
patterns which result from simple
processes are relatively likely, while
patterns that can only be produced by very
complex processes are relatively unlikely.
Formally, a computable process that
produces a string x is a program p that
when executed on a universal Turing
machine U produces the string x as output.
As p is itself a binary string, we can define
the discrete universal a priori probability,
m(x), to be the probability that the output
of a universal prefix Turing machine U is x
when provided with fair coin flips on the
input tape. Formally,
where the sum is over all halting programs
p for which U outputs the string x. In fact,
the boundedness of the sum should be an
obvious consequence from realizing that
the above expression is indeed a
probability, which hinges on the fact that a
prefix machine is one with only {0,1} input
symbols and no end-of-input marker of
any kind.
K
•
ANTECEDENTES Y TRABAJOS
RELACIONADOS11
Solomonoff’s algorithmic probability and
inductive inference, 1960-1964
It is easy to see that this distribution is
strongly related to Kolmogorov complexity
in that m(x) is at least the maximum term
in the summation, which is
The Coding Theorem of L.A. Levin (1974)
states that equality also holds in the other
direction in the sense that
K
•
ANTECEDENTES Y TRABAJOS RELACIONADOS12
Solomonoff’s algorithmic probability and inductive
inference, 1960-1964
Algorithmic probability is the main ingredient of Ray
Solomonoff's theory of inductive inference, the
theory of prediction based on observations. Given a
sequence of symbols, which will come next?
Solomonoff's theory provides an answer that is
optimal in a certain sense, although it is
incomputable. Unlike, for example, Karl Popper's
informal inductive inference theory, however,
Solomonoff's is mathematically rigorous.
Algorithmic probability is closely related to the
concept of Kolmogorov complexity.
The Kolmogorov complexity of any computable
object is the length of the shortest program that
computes it and then halts. The invariance
theorem shows that it is not really important which
computer we use.
Solomonoff's enumerable measure is universal in a
certain powerful sense, but it ignores computation
time.
K
• ANTECEDENTES Y
TRABAJOS
RELACIONADOS13
Chaitin, circa 1966
Chaitin's early work on
algorithmic information
theory paralleled the
earlier work
of Kolmogorov.
Ref.
http://en.wikipedia.org/wi
ki/Gregory_Chaitin
K
•
ANTECEDENTES Y
TRABAJOS
RELACIONADOS14
Chaitin, circa 1966
Define L(S) for any finite
binary sequence S by: A Turing
machine with n internal states
can be programmed to compute
S if and only if n ≥ L(S). Define
L(Cn) by L(Cn) = maxL(S),
where S is any binary sequence
of length n. Let Cn be the set of
all binary sequences of length n
satisfying L(S) = L(Cn).
October 19, 1965
There it is proposed that
elements of Cn may be
considered patternless or
random. This is applied;
some properties of the L
function are derived by
using what may be termed
the simple normality
(Borel) of these binary
sequences. January 6, 1966
K
•
RESUMEN DE TRABAJOS
PREVIOS
Kolmogorov dice (1970):
Further I follow my own work, but
more or less similar ideas can be
found in the works of Solomonoff
and Chaitin; in these works,
however, they appear in a
somewhat veiled form.
Selected Works of
A.N.Kolmogorov : Information
Theory and the Theory of
Algorithms, Volume 3 Edited by
A. N. Shiryayev p. 212
K
• INFORMACIÓN0
Turing (TM, UTM)
Shannon (log2N)
Kolmogorov (LA)
Preliminary K-definition &
Reflections on it (Looks &
Smiles)
K
• MOTIVACIÓN1
2100 messages by
Shannon contain
log2 2100 = 100 bits
of info (log2N)!?
K
• MOTIVACIÓN2
Shannon’s entropy
(“amount of information”)
−
∑ p log
i , pi ≠ 0
i
2
( pi )
K
• MOTIVACIÓN3
Where pi is the probability of
occurrences of value i. Using this
criterion, the higher the entropy,
the more the randomness. For
instance, the sequence
00100010
(entropy= 0.81) is less random than
01010101
(entropy = 1). The inadequacy of
Shannon's measure of randomness
is apparent, because intuitively the
former sequence is more random
than the latter.
K
•
TRES ENFOQUES PARA DEFINIR
LA CANTIDAD DE
INFORMACIÓN*
(ANÁLISIS DE KOLMOGOROV)1
*Reprint in Selected Works of
A.N.Kolmogorov : Information
Theory and the Theory of
Algorithms, Volume 3 Edited by
A. N. Shiryayev p. 184-193
*Original: Problemy peredachi
Informatsii, 1965, Vol. 1, No.1,
pages 3-11
K
• TRES ENFOQUES
PARA DEFINIR LA
CANTIDAD DE
INFORMACIÓN
(ANÁLISIS DE
KOLMOGOROV)2
Combinatorial
Probabilistic
Algorithmic
K
• K-DEFINICIÓN1
We start by describing
objects with binary
strings (Turing). We
take the length of the
shortest string
description of an object
as a measure of that
object's complexity.
K
• K-DEFINICIÓN2
Let's look at the most traditional
application of information theory,
communications (Shannon). The
sender and receiver each know
specification method L. Message x
can be transmitted as y, such that
L(y)=x. This is written "y : L(y)=x"
and the ":" is read as "such that".
The cost of transmission is the
length of y, |y|. The least cost is
min(|y|): L(y)=x
K
• K-DEFINICIÓN3
•
•
This minimal |y| is the descriptional
complexity of x under specification
method L. A universal description
method should have the following
properties:
it should be independent of L,
within a constant offset, so that we
can compare the complexity of any
object with any other object's
complexity;
the description should in principle
be performable by either machines
or humans.
K
• K-DEFINICIÓN4
Such a method would give us a
measure of absolute information
content, the amount of data which
needs to be transmitted in the
absence of any other a priori
knowledge. The description
method that meets these criteria is
the Kolmogorov complexity: the
size of the shortest program (in
bits) that without additional data,
computes the string and terminates.
K(x) = min | p| :U(p)=x
where U is a universal Turing
machine (believe!).
K
• K-DEFINICIÓN5
The conditional Kolmogorov
complexity of string x given
string y is
K(x|y) = min| p|: U(p,y)=x
The length of the shortest
program that can compute both
x and y and a way to tell them
apart is
K(x,y) = min| p|: U(p)=xy
K
• INFORMACIÓN1
The Kolmogorov information of
string x given string y is
I (x|y) = K(y) – K(y|x)
K
• INFORMACIÓN2
The Kolmogorov information of
string x is
I (x|x) := I (x) = K(x)
K
• K-PROPIEDADES1
Theorem 1. ∃c ∀x [K(x) ≤ ⏐x⏐ + c].
Theorem 2. ∃c ∀x [K(xx) ≤ K(x) + c].
Theorem 3. ∃c ∀x,y [K(xy) ≤ 2K(x) +
K(y)+ c].
Theorem 4. For any description
language p, a fixed
constant c exists that
depends only on p, where
∀x [K(x) ≤ Kp(x) + c].
K
• K-PROPIEDADES2
Definition 1. Let x be a string. Say that
x is c-compressible if
K(x) ≤ ⏐x⏐ - c.
• If x is not c-compressible , we say
that x is incompressible by c.
• If x is incompressible by 1, we say
that x is incompressible.
Theorem 5. Incompressible strings of
every length exist.
Exercise. Proof that at least 2^n –
2^(n-c+1) + 1strings of
length n are incompressible
by c.
K
• K-PROPIEDADES3
Theorem 6. Let f be a
computable property that holds
for almost all strings. Then, for
any b > 0, the property f is
FALSE on only finitely many
strings that are incompressible
by b.
Theorem 7. For some constant
b, for every string x, the
minimal description d(x) of x is
incompressible by b.
K
• EXAMPLE (COM)1
π is an infinite sequence of
seemingly random digits, but it
contains only a few bits of
information: the size of the
short program that can produce
the consecutive bits of π
forever. Informally we say the
descriptional complexity of π is
a constant. Formally we say
K(π) = 0(1), which means
"K(π) does not grow".
K
• EXAMPLE (COM)2
A truly random string is
not
significantly
compressible;
its
description
length
is
within a constant offset of
its length. Formally we
say K(x) = Theta(|x|),
which means "K(x) grows
as fast as the length of x"
K
• EXAMPLE (INFO)31
The amount of information
contained in the variable x with
respect to a related variable y
can be defined as follows.
K
• EXAMPLE (INFO)32
The relationship between variables
x and y ranging over the respective
sets X and Y is that not all pairs x,y
belonging to the Cartesian product
X x Y are “possible”. From the set
of possible pairs U, one can
determine, for any a ∈ X, the set Ya
of those y for which (a,y) ∈ U.
It is natural to define the
conditional entropy by the relation
H(y| a) = log2N(Ya)
where N(Yx) is the number of
elements in the set Yx.
K
• EXAMPLE (INFO)33
It is natural to define the
conditional entropy by the
relation
H(y| a) = log2N(Ya)
where N(Yx) is the number of
elements in the set Yx and the
information in x with respect to
y by the formula
I(x|y) = H(y) – H(y|x)
K
• EXAMPLE (INFO)35
x
y
1
2
3
4
1 +
+
+
+
2 +
-
+
-
3 -
+
-
-
K
• EXAMPLE (INFO)36
For example, in the case shown
in the table, we have
I(x = 1| y) = 0, I(x=2| y) = 1,
I(x = 3| y) = 2.
It is clear that H(y|x) and I(x|y)
are functions of x (while y
enters in them as a “bound
variable”).
K
• INFORMACIÓN
SEMÁNTICA1
http://plato.stanford.ed
u/entries/informationsemantic/
K
• INFORMACIÓN
SEMÁNTICA2
http://www.inference.
phy.cam.ac.uk/mackay
/infotheory/course.html
http://www.scientifica
merican.com/article.cf
m?id=the-semanticweb
K
• INFORMACIÓN
SEMÁNTICA3
http://www.lecb.ncifcr
f.gov/~toms/paper/pri
mer/
Marvin Minsky,
Semantic Information
Processing, MIT
Press, 1969
K
• CONCLUCIÓN1
Combinatorial (C)
Probabilistic (P)
Algorithmic (A)
K-theorem:
C→A; P →A
K
• CONCLUCIÓN2
Algorithmic Probability &
Induction
Prediction
Occam’s Razor
Distance
Superficiality & Sophistication
Part & Whole
Total & Partial Order
Computable description methods
K
• REFERENCIAS
1.
2.
3.
4.
5.
6.
7.
Selected Works of A.N.Kolmogorov: Information
Theory and the Theory of Algorithms, Volume 3
Edited by A. N. Shiryayev, Kluwer Academic
Publishers, Vol 27, 1993
Murray Gell-Mann: What is complexity?
Complexity, Vol. 1, No. 1, 1995.
Claude Shannon: A Mathematical Theory of
Communication, The Bell System Technical
Journal, Vol. 27, pp. 379–423, 623–656, JulyOctober, 1948.
B. Jack Copeland: The Church-Turing Thesis
Stanford Encyclopedia of Philosophy, 1997-2002
http://plato.stanford.edu/entries/church-turing/
Ray Solomonoff: A Preliminary Report on a
General Theory of Inductive Inference, Report V131, Zator Co., Cambridge, Ma. Feb 4, 1960.
Gregory J. Chaitin: On the length of programs for
computing finite binary sequences, Journal of the
ACM 13 (1966), pp. 547-569.
Michael Sipser: Inroduction to the theory of
computation, 2nd edition, Thomson
K