Biljeske 06 - Tehnički fakultet Rijeka

Transcription

Biljeske 06 - Tehnički fakultet Rijeka
TEORIJA INFORMACIJE
Željko Jeričević, dr. sc.
Zavod za računarstvo, Tehnički fakultet &
Zavod za biologiju i medicinsku genetiku, Medicinski fakultet
51000 Rijeka, Croatia
Phone: (+385) 51-651 594
E-mail: zeljko.jericevic@riteh.hr
http://www.riteh.uniri.hr/~zeljkoj/Zeljko_Jericevic.html
Information theory
Iz dosadašnjeg gradiva znamo da se informacija prije slanja kroz
kanal treba prirediti. To se postiže pretvorbom informacije u
formu koja ima entropiju blisku maksimalnoj čime se efikasnost
prenosa približava maksimalnoj. Ovo se može postići
kompresijom bez gubitaka informacije (lossless compression),
napr. aritmetičkim kodiranjem.
Druga pretvorba odnosi se na sigurnost prijenosa pri čemu se
informacija prevodi u formu gdje je za određeni tip pogrešaka
moguća automatska korekcija (napr. Hamming-ovim
kodiranjem).
10 February 2012
zeljko.jericevic@riteh.hr
2
Sažimanje (compression)
10 February 2012
zeljko.jericevic@riteh.hr
3
Entropijsko kodiranje:
Kraft-ova nejednakost (u Huffman & Shannon-Fano)
1.4.1 The Kraft inequality
We shall prove the existence of efficient source codes by actually constructing some
codes that are important in applications. However, getting to these results requires some
intermediate steps.
A binary variable-length source code is described as a mapping from the source
alphabet A to a set of finite strings, C from the binary code alphabet, which we always
denote {0, 1}. Since we allow the strings in the code to have different lengths, it is
important that we can carry out the reverse mapping in a unique way. A simple way of
ensuring this property is to use a prefix code, a set of strings chosen in such a way that
no string is also the beginning (prefix) of another string. Thus, when the current string
belongs to C, we know that we have reached the end, and we can start processing the
following symbols as a new code string. In Example 1.5 an example of a simple prefix
code is given.
If ci is a string in C and l(ci ) its length in binary symbols, the expected length of the
source code per source symbol is
L(C) =
N
i=1
P(ci )l(ci ).
If the set of lengths of the code is {l(ci )}, any prefix code must satisfy the following
important condition, known as the Kraft inequality:
i
2−l(ci ) ≤ 1. (1.10)
10 February 2012
4
Entropijsko kodiranje:
Kraft-ova nejednakost
1.4.1 The Kraft inequality
The code can be described as a binary search tree: starting from the root, two branches
are labelled 0 and 1, and each node is either a leaf that corresponds to the end of a string,
or a node that can be assumed to have two continuing branches. Let lm be the maximal
length of a string. If a string has length l(c), it follows from the prefix condition that
none of the 2lm−l(c) extensions of this string are in the code. Also, two extensions of
different code strings are never equal, since this would violate the prefix condition. Thus
by summing over all codewords we get
i
2lm−l(ci ) ≤ 2lm
and the inequality follows. It may further be proven that any uniquely decodable code
must satisfy (1.10) and that if this is the case there exists a prefix code with the
same set of code lengths. Thus restriction to prefix codes imposes no loss in coding
performance.
10 February 2012
5
Entropijsko kodiranje:
Kraft-ova nejednakost
1.4.1 The Kraft inequality
Example 1.5 (A simple code). The code {0, 10, 110, 111} is a
prefix code for an alphabet
of four symbols. If the probability distribution of the source is
(1/2, 1/4, 1/8, 1/8), the
average length of the code strings is 1 × 1/2 + 2 × 1/4 + 3 ×
1/4 = 7/4, which is
also the entropy of the source.
10 February 2012
6
Entropijsko kodiranje:
Kraft-ova nejednakost
1.4.1 The Kraft inequality
If all the numbers −log P(ci ) were integers, we could choose these as the lengths
l(ci ). In this way the Kraft inequality would be satisfied with equality, and furthermore
L = i P(ci )l(ci ) = −i P(ci )log P(ci ) = H(X)
and thus the expected code length would equal the entropy. Such a case is shown in
Example 1.5. However, in general we have to select code strings that only approximate
the optimal values. If we round −log P(ci ) to the nearest larger integer −log P(ci ),
the lengths satisfy the Kraft inequality, and by summing we get an upper bound on the
code lengths
l(ci ) = −log P(ci ) ≤ −log P(ci ) + 1. (1.11)
The difference between the entropy and the average code length may be evaluated
from
H(X) − L = i P(ci ) log P(ci )− li = i P(ci )log 2−l P(ci ) ≤ log i 2−li ≤ 0,
where the inequalities are those established by Jensen and Kraft, respectively. This gives
H(X) ≤ L ≤ H(X) + 1, (1.12)
where the right-hand side is given by taking the average of (1.11).
The loss due to the integer rounding may give a disappointing resultwhen the coding is
done on single source symbols. However, if we apply the result to strings of N symbols,
we find an expected code length of at most NH + 1, and the result per source symbol
becomes at most H + 1/N. Thus, for sources with independent symbols, we can get an
expected code length close to the entropy by encoding sufficiently long strings of source
symbols.
10 February 2012
7
Aritmetičko kodiranje
• Pretpostavimo da želimo poslati poruku koja se sastoji
od 3 slova: A, B & C s podjednakom vjerojatnosti
pojavljivanja
• Upotreba 2 bita po simbolu je neefikasna: jedna od
kombinacija bitova se nikada neće upotrebiti.
• Bolja ideja je upotreba realnih brojeva izmedu 0 & 1 u
brojevnom sustavu po bazi 3, pri cemu svaka znamenka
predstavlja simbol.
• Na primjer, sekvenca ABBCAB postaje 0.011201 (uz
A=0, B=1, C=2)
10 February 2012
8
Aritmetičko kodiranje
• Prevođenjem realnog broja 0.011201 po bazi 3 u
binarni, dobivamo 0.001011001
• Upotreba 2 bita po simbolu zahtjeva 12 bitova za
sekvencu ABBCAB, a binarna reprezentacija 0.011201
(u bazi 3) zahtjeva 9 bitova u binarnoj bazi što je ušteda
od 25%.
• Metoda se zasniva na efikasnim “in place” algoritmima
za prevođenje iz jedne baze u drugu
10 February 2012
9
Brzo prevođenje iz jedne baze u drugu
• Linux/Unix bc program
• Primjeri:
¾ echo "ibase=2; 0.1" | bc
¾ .5
¾ echo "ibase=3; 0.1000000" | bc
¾ .3333333
¾ echo "ibase=3; obase=2; 0.011201" | bc
¾ .00101100100110010001
¾ echo "ibase=2; obase=3; .001011001" | bc
¾ .0112002011101011210 zaokruženo na .011201 (dužina106)
Aritmetičko dekodiranje
• Aritmetičkim kodiranjem možemo postići rezultat
blizak optimalnom (optimalno je –log2p bita za svaki
simbol vjerojatnosti p).
• Primjer s četiri simbola, aritmetičkim kodom 0.538 i
sljedećom distribucijom vjerojatnosti (D je kraj
poruke):
Simbol
A
B C D
Vjerojatnost 0.6 0.2 0.1 0.1
10 February 2012
11
Aritmetički kod sekvence je 0.538 (ACD)
• Prvi korak: početni interval [0,1] podjeli u subintervale
proporcionalno vjerojatnostima:
Simbol
A
B
C
D
Interval [0 − 0.6) [0.6 − 0.8) [0.8 − 0.9) [0.9 − 1)
• 0.538 pada u prvi interval (simbol A)
10 February 2012
12
Aritmetički kod sekvence je 0.538 (ACD)
• Drugi korak: interval [0,6) izabran u prvom koraku
podjeli u subintervale proporcionalno vjerojatnostima:
Simbol
A
B
C
D
Interval [0 − 0.36) [0.46 − 0.48) [0.48 − 0.54) [0.54 − 0.6)
• 0.538 pada u treći sub-interval (simbol C)
10 February 2012
13
Aritmetički kod sekvence je 0.538 (ACD)
• Treći korak: interval [0.48-0.54) izabran u prvom
koraku podjeli u subintervale proporcionalno
vjerojatnostima:
Simbol
A
B
C
D
Interval [0.48 − 0.516) [0.516 − 0.528) [0.528 − 0.534) [0.534 − 0.54)
• 0.538 pada u četvrti sub-interval (simbol D, koji je
ujedno i simbol završetka niza)
10 February 2012
14
Aritmetički kod sekvence je 0.538 (ACD)
Grafički prikaz aritmetičkog dekodiranja
10 February 2012
15
Aritmetički kod sekvence je 0.538 (ACD)
• (ne)Jednoznačnost: Ista sekvenca mogla se prikazati kao
0.534, 0.535, 0.536, 0.537 ili 0.539. Uporaba dekadskih
umijesto binarnih znamenki uvodi neefikasnost.
• Informacijski sadržaj tri dekadske zamenke je oko 9.966
bita (zašto?)
• Istu poruku možemo binarno kodirati kao 0.10001010
što odgovara 0.5390625 dekadski i zahtjeva 8 bita.
10 February 2012
16
Aritmetički kod sekvence je 0.538 (ACD)
8 bita je više nego stvarna entropija poruke (1.58 bita)
zbog kratkoće poruke i pogrešne distribucije. Ako se
uzme u obzir stvarna distribucija simbola u poruci
poruka se može kodirati uz upotrebu sljedećih intervala:
[0, 1/3); [1/9, 2/9); [5/27, 6/27); i binarnog intervala of
[1011110, 1110001).
Rezultat kodiranja je poruka 111, odnosno 3 bita
Ispravna statistika poruke je krucijalna za efikasnost
kodiranja!
10 February 2012
17
Aritmetičko kodiranje
Iterativno dekodiranje poruke
18
Aritmetičko kodiranje
Iterativno kodiranje poruke
19
Aritmetičko kodiranje
Dva simbola s vjerojatnošću pojavljivanja px=2/3
& py=1/3
20
Aritmetičko kodiranje
Tri simbola s vjerojatnošću pojavljivanja px=2/3 &
py=1/3
21
Aritmetičko kodiranje
22