Shanno-Fano coding, Huffman and Adaptive Huffman coding

Transcription

Shanno-Fano coding, Huffman and Adaptive Huffman coding
Homework #1 Report
Multimedia Data Compression
EE669 2013Spring
Name: Shu Xiong
ID: 3432757160
Email: shuxiong@usc.edu
Content
Problem1: Writing Questions .................................................................................................... 2
Huffman Coding ................................................................................................................ 2
Lempel-Ziv Coding ............................................................................................................. 4
Arithmetic Coding ............................................................................................................. 5
Problem 2: Entropy Coding ....................................................................................................... 5
Abstract and Motivation ................................................................................................... 5
Approach and Procedures ................................................................................................. 6
Results ............................................................................................................................... 8
Discussion ........................................................................................................................ 10
Problem 3: Run-Length Coding ............................................................................................... 11
Abstract and Motivation ................................................................................................. 11
Approach and Procedures ............................................................................................... 11
Results ............................................................................................................................. 12
Discussion ........................................................................................................................ 14
Problem1: Writing Questions
Huffman Coding
Bits/symbol
Bits/symbol
Lempel-Ziv Coding
Arithmetic Coding
Problem 2: Entropy Coding
Abstract and Motivation
Entropy coding is an important step in multimedia data compression. It is a lossless
compression scheme which transforms symbols into binary codes. It assigns each
symbol with a unique prefix-free code and the length of each code depends on the
probability of the corresponding symbol. The symbol with larger probability is
assigned with a shorter code. [1] In this section, I will implement two famous
encoding methods: Shannon-Fano coding, Huffman and adaptive Huffman coding
and discuss their performance.
Approach and Procedures
Shannon-Fano Coding:
Shannon-Fano coding assigns code to each symbol based on its probability. It
produces variable length codes with unique prefix. The algorithm is as following:
1. Calculate the global statistics of the input data, including the weights of each
symbol, the total occurrences and the probability of each symbol.
2. Sort the symbols by their probabilities. The symbols with larger probability are
put in left and the ones with smaller probability are put in right.
3. Divide the symbols into two parts. The sum of weights of symbols in one part is
as equal as possible to that in the other part.
4. Assign digit 0 to the symbols in left and assign digit 1 to the symbols in right.
Append the digit to its corresponding code.
5. Recursively perform step 3 and step 4 until each symbol becomes the leaf of the
binary tree.
To calculate the entropy, we use the equation:
L
H   P( S k ) log 2 P( S k ) (1)
k 1
where P( Sk ) is the probability of Symbol S k .
To calculate the Bit Average of the encoding result, we use the equation:
Bavg  P(Sk ) L(Sk ) (2)
Where L( S k ) is the code length of Symbol S k .
The key step in the implementation is how to divide the symbols into nearly equal
probabilities. I use a function divideTree() to recursively find the position where
equally divide the symbols until reach the leaves of the tree.
Huffman Coding:
Huffman coding is developed from Shannon-Fano coding. And it is also a lossless
compression scheme based on the frequency of each symbol. The algorithm is as
following:
1. Calculate the global statistics of the input data, including the weights of each
symbol, the total occurrences and the probability of each symbol. Generate a
binary tree which contains all the symbols.
2. Find two nodes with the smallest probabilities and add a new node which
represents the combination of the two nodes.
3. Set the new node as the parent of these two nodes. Drop the two nodes from
further processing.
4. Recursively perform step2 and step 3 until we reach the root of the binary tree.
5. Encode from the root to the leaves of the binary tree. Assign the node which is
the left child of its parent digit 0 and the one which is the right child of its parent
digit 1. When we reach the leaves, the code for each symbol is generated.
In my implementation, I use a structure Huffman tree to store the information about
a symbol including weight, left child, right child, parent, code length, probability and
entropy.
Adaptive Huffman Coding:
Adaptive Huffman coding is developed based on Huffman coding. It is a one pass
algorithm, which does not need the global statistics of the input data. It can generate
the Huffman codes on the fly. The adaptive Huffman coding algorithm that I have
implemented is proposed by Vitter. [2] There are several key concepts about this
algorithm.
Implicit numbering: the nodes of the tree are divided into different levels by their
weights. The nodes with larger weights are in higher level than the ones with smaller
weights. The nodes in the same level are sorted from left to right in an increasing
order.
Blocks: If the nodes have the same weight and they are either both leaves or both
internal nodes, they are in the same block. The highest numbered node is defined as
the leader in the block.
NYT(Not Yet Transmitted): Used to represent the incoming new symbols. The weight
is always zero.
One important operation in this algorithm is to update the tree according to the
incoming symbol. The pseudo code for this operation is as following:
Procedure Update;
Begin
If (it is a new symbol) then
Add two nodes as the children of the current NYT. The right child is the new
symbol, the left one is the new NYT.
Traverse the tree from the top levels down, try to find whether there are nodes
which have same weight as the new symbol.
If (node found && found node is not the parent of the new symbol) then
Swap the position of the two nodes;
End
Update the weights of nodes which are in the branch of the current symbol;
Encode the symbol using current tree;
End
If (it is an old symbol) then
Traverse the tree from the top levels down, try to find whether there are
nodes which have same weight as the new symbol.
If (node found && found node is not the parent of the new symbol) then
Swap the position of the two nodes;
End
Update the weights of nodes which are in the branch of the current symbol;
Encode the symbol using current tree;
End
End
Results
The equation I use for compression ratio calculation is:
ratio 
original file size  compressed file size
100% (3)
original file size
Shannon-Fano Coding:
Audio.dat:
Table 1 Audio.dat Shanno-Fano Result
Entropy
6.45594 bits/symbol
Bit Average
6.5132 bits/symbol
Compression Redundancy 0.05726 bits/symbol
Original File Size
65536 bytes
Compressed File Size
53357 bytes
Compression Ratio
18.85%
Binary.dat:
Table 2 Binary.dat Shanno-Fano Result
Entropy
0.183244 bits/symbol
Bit Average
1 bits/symbol
Compression Redundancy 0.816756 bits/symbol
Original File Size
65536 bytes
Compressed File Size
8192 bytes
Compression Ratio
87.5%
Text.dat:
Table 3 Text.dat Shanno-Fano Result
Entropy
4.3962 bits/symbol
Bit Average
4.4263 bits/symbol
Compression Redundancy 0.0301 bits/symbol
Original File Size
8358 bytes
Compressed File Size
4625 bytes
Compression Ratio
44.67%
Image.dat:
Table 4 Image.dat Shanno-Fano Result
Entropy
7.59311 bits/symbol
Bit Average
7.64249 bits/symbol
Compression Redundancy 0.04938 bits/symbol
Original File Size
65536 bytes
Compressed File Size
Compression Ratio
62608 bytes
4.67%
Huffman Coding
Audio.dat:
Table 5 Audio.dat Huffman Coding Result
Entropy
6.45594 bits/symbol
Bit Average
6.48994 bits/symbol
Compression Redundancy 0.034 bits/symbol
Original File Size
65536 bytes
Compressed File Size
53166 bytes
Compression Ratio
18.88%
Binary.dat:
Table 6 Binary.dat Huffman Coding Result
Entropy
0.183244 bits/symbol
Bit Average
1 bits/symbol
Compression Redundancy 0.816756 bits/symbol
Original File Size
65536 bytes
Compressed File Size
8192 bytes
Compression Ratio
87.5%
Text.dat:
Table 7 Text.dat Huffman Coding Result
Entropy
4.3962 bits/symbol
Bit Average
4.42067 bits/symbol
Compression Redundancy 0.02447 bits/symbol
Original File Size
8358 bytes
Compressed File Size
4619 bytes
Compression Ratio
44.74%
Image.dat:
Table 8 Image.dat Huffman Coding Result
Entropy
7.59311 bits/symbol
Bit Average
7.62111 bits/symbol
Compression Redundancy 0.028 bits/symbol
Original File Size
65536 bytes
Compressed File Size
62433 bytes
Compression Ratio
4.73%
Adaptive Huffman Coding
Audio.dat:
Table 9 Audio.dat Adaptive Huffman Coding Result
Original File Size
65536 bytes
Compressed File Size 53159 bytes
Compression Ratio
18.89%
Binary.dat:
Table 10 Binary.dat Adaptive Huffman Coding Result
Original File Size
65536 bytes
Compressed File Size 8420 bytes
Compression Ratio
87.15%
Text.dat:
Table 11 Text.dat Adaptive Huffman Coding Result
Original File Size
8358 bytes
Compressed File Size 4620 bytes
Compression Ratio
44.74%
Image.dat:
Table 12 Image.dat Adaptive Huffman Coding Result
Original File Size
65536 bytes
Compressed File Size 62434 bytes
Compression Ratio
4.73%
Discussion
For Shanno-Fano coding, from above tables we can see that the compression of the
four kinds of files does not reach their theoretical bounds. All the bit averages are
larger than the entropies. In compression ratio, we can see that it achieves best
compression ratio in binary.dat file, 87.5%. However, it doesn’t compress the
binary.dat actually: the symbol 0 is encoded as 0 and the symbol 1 is encoded as 1.
The reduce of file size is because I use 1 bit to store one symbol 0 or 1 while the
original file uses 1 byte. So the compressed file size is 1/8 of the original file size. The
smallest compression ratio is in image.dat file, 4.67%. The compression ratio
depends on the input data. The compression ratio is also related to the entropy,
which is the theoretical bound for compression. If the entropy is large, the
compression ratio would be small. And if the entropy is small, the compression ratio
would be large.
For Huffman coding, from above tables we can see that the compression of the
compression of the four kinds of files does not reach their theoretical bounds. The
compression redundancy is smaller than the ones using Shanno-Fano coding. The
best compression ratio is in binary.dat file, 87.5%. But it does not compress the file
for the same reason stated in Shanno-Fano Coding. The smallest compression ratio is
in image.dat file, 4.73%. Comparing to the results in Shanno-Fano coding, Huffman
coding achieves better compression results in audio.dat, image.dat and text.dat. It
achieves the same compression result in binary.dat. Although Huffman coding
achieves better results, the improvement is not significantly large.
For Adaptive Huffman coding, it achieves the same compression ratio as Huffman
coding in image.dat and text.dat. It gets a slightly better compression result in
audio.dat and it gets a slightly worse compression result in binary.dat. This is because
the adaptive Huffman coding may use two bits to represent one symbol while
Huffman coding only uses 1 bit when compressing the binary file. It achieves slightly
better or worse compression results depending on the input data. Overall, it achieves
similar compression results as Huffman coding.
Huffman coding is an improvement of Shanno-Fano coding. Shanno-Fano coding is a
root-to-leaf method while Huffman code is a leaf-to-root method. This difference
makes Huffman code slightly more efficient than Shanno-Fano coding. In general,
Shanno-Fano coding may have same result as Huffman coding in best cases, while
Shanno-Fano coding can never outperform Huffman coding.
The biggest advantage that adaptive Huffman coding has over Huffman coding is that
it does not need the global statistics of the input data so that it can be applied for
real time compression. The Huffman coding algorithm is a 2 pass procedure which
needs the global statistics to generate the Huffman tree. While adaptive Huffman
coding is a 1 pass algorithm which makes the Huffman tree on the fly.
Problem 3: Run-Length Coding
Abstract and Motivation
Run-length coding is a common encoding scheme whose basic idea is to encode a
sequence of repeated symbols as the number of repeats followed by the symbol.
Except the basic scheme, it also has some modification methods such as the MSB
modified scheme. Some coding methods also use the basic idea of RLC, such as
Golomb coding. [3] In this section, I will implement these 3 RLC methods and discuss
their performance.
Approach and Procedures
Basic Scheme:
Follow the pattern: number+symbol. In my implementation, I use an array fileData[]
to store the bytes read from input file and use an array codeArray[] to store the
encoding result. Count the number of consecutive repeats of a symbol in the fileData
array and store the number and symbol to codeArray. For example, 03 03 04 03 is
encoded as 02 03 01 04 01 03. I use the same pattern to decode the compressed file.
One special case is that when the number of repeated symbol is more than 255, we
divide it into several parts for encoding. For example, if there are 513 0s in sequence,
we encode it as FF 00 FF 00 03 00.
Modified Scheme:
Based on the basic scheme, modified scheme adds a MSB method to indicate
whether the code represents a number of repeats or a symbol. It uses the most
significant bit as indication: if the encoded byte represents the number of repeats,
MSB will be set to 1. For example, if the number of repeats is 6, the encoded result is
1000 0110, that is 86 in hex base. Convert it to decimal base, the encoded result is
134. If the number of repeats of one symbol is 1, we encode the symbol directly
without the repeat number. One special case is that when the number of repeated
symbol is more than 127, we divide it into several parts for encoding. For example, if
there are 134 0s in sequence, we encode it as FF 00 87 00. The other special case is
that when the symbol itself has a MSB that is set to 1, we need to add a number
before it in order to eliminate the confusion. For example, when the symbol is 86, it
should be encoded as 81 86.
Golomb Coding:
This method is designed to encode binary input files. We choose a tunable number
M to be the base for Golomb coding, where M is the power of 2. M=2 m. Suppose the
number of repeats of a symbol is N, it can be denoted as:
N  qM  r
N
(4)
q 
M 
r  N mod M
Then the number N is encodes as
q ' s 01 m bits for r (5)
For example, if N=66, M=32. Then q=2, r=2. It is encoded as 00100010.
To determine M, we use the equation:


1
M 
 (6)
 log 2 (1  p) 
Where p is the probability of symbol 1 in the input data. [4]
In my implementation, since the given file has a lot of 0s, I encoded the number of
zeros in sequences divided by 1s. The M I choose is 64, m is 6.
The same scheme applies to decoding process too. I decode the number of zeros
according to equation 4 and insert 1 bit of 1 in each sequence of zeros.
I implemented encoding and decoding process for all the three methods. And the
results prove the correctness of my implementation.
Results
Basic Scheme
Audio.dat:
Table 13 Audio.dat Basic Scheme RLC Compression Result
Original File Size
65536 bytes
Compressed File Size 108534 bytes
Compression Ratio
-65.61%
Decompressed File Size 65536 bytes
Binary.dat:
Table 14 Binary.dat Basic Scheme RLC Compression Result
Original File Size
Compressed File Size
Compression Ratio
65536 bytes
4780 bytes
92.71%
Decompressed File Size 65536 bytes
Text.dat:
Table 15 Text.dat Basic Scheme RLC Compression Result
Original File Size
8358 bytes
Compressed File Size 16400 bytes
Compression Ratio
-96.22%
Decompressed File Size 8358 bytes
Image.dat:
Table 16 Image.dat Basic Scheme RLC Compression Result
Original File Size
65536 bytes
Compressed File Size 124320 bytes
Compression Ratio
-89.70 %
Decompressed File Size 65536 bytes
Modified Scheme
Audio.dat:
Table 17 Audio.dat Modified Scheme RLC Compression Result
Original File Size
65536 bytes
Compressed File Size 85205 bytes
Compression Ratio
-30.01%
Decompressed File Size 65536 bytes
Binary.dat:
Table 18 Binary.dat Modified Scheme RLC Compression Result
Original File Size
65536 bytes
Compressed File Size
4446 bytes
Compression Ratio
93.22%
Decompressed File Size 65536 bytes
Text.dat:
Table 19 Text.dat Modified Scheme RLC Compression Result
Original File Size
8358 bytes
Compressed File Size 8343 bytes
Compression Ratio
0.18%
Decompressed File Size 8358 bytes
Image.dat:
Table 20 Image.dat Modified Scheme RLC Compression Result
Original File Size
65536 bytes
Compressed File Size 82766 bytes
Compression Ratio
-26.29 %
Decompressed File Size 65536 bytes
Golomb Coding:
Table 21 Golomb.dat Golomb Compression Result
Original File Size
Compressed File Size
Compression Ratio
1250 bytes
109 bytes
91.28 %
Decompressed File Size 1250 bytes
Discussion
For basic scheme RLC, we can see that only binary.dat has a good compression result:
the compression ratio is 92.71%. It expands the sizes of other files. This is because
basic scheme RLC is only good for data with sequences of repeat symbols. In
binary.dat file, there are many sequences of 0s and 1s so that basic scheme RLC can
encode the sequences into number+symbol pattern to save more bits. However, in
the other three files, there are few sequences of repeated symbols. The single
symbol is encoded as 1+symbol, adding more bits than the original data. For example,
data 02 03 04 05 is encoded as 01 02 01 03 01 04 01 05. This is why the other three
files got expanded sizes after the compression.
For modified scheme RLC, binary.dat got the best compression ratio, 93.22%.
Text.dat is compressed a little than the original size: its compression ratio is 0.18%.
Image.dat and audio.dat got expanded sizes after the compression. Compared with
the basic scheme, it has a better compression effect. The compression ratio of
binary.dat is higher than that of basic scheme. And text.dat file size becomes
compressed from expansion. The file sizes of image.dat and audio.dat are less
expanded than those in basic scheme. This is because it does not have to encode the
single symbol whose MSB is not 1 so that it can save more bits than basic scheme
RLC. However, it is not good for data file which has few sequences of repeat symbols
neither since it has to encode the symbol whose MSB is 1 with more bits in order to
eliminate the confusion between number and symbol.
For Golomb Coding, it has a good compression ratio, 91.28%. Golomb coding is good
at encoding binary files which have large number of one symbols and small number
of the other symbols. It is based on a probability model and is a useful development
from RLC. But Golomb coding is not good at encoding binary files which have equal
probability in both symbols. For example, if we use Golomb coding to encode a
binary stream like 010101010101010101, it would expand the file size instead of
compressing it. When applying Golomb coding, choosing an appropriate parameter
for M and m is also important. An inappropriate parameter may reduce the effects of
compression. By using the formula (6) we can ensure that the tunable parameter M
we choose is suitable for the input data.
Reference
[1] Ihara, Shunsuke (1993). Information theory for continuous systems. World Scientific.
p. 2. ISBN 978-981-02-0985-8.
[2] Design and Analysis of Dynamic Huffman Codes, Jeffrey Vitter, Journal of the Association for
Computing Machinery, Vol.34, No.4, October 1987, pp.825-845.
[3]Nelson, Mark R., The Data Compression Book, M&T Books, Redwood City, CA, 1991.
[4]Golomb, S.W. (1966). , Run-length encodings. IEEE Transactions on Information Theory,
IT--12(3):399--401