Shanno-Fano coding, Huffman and Adaptive Huffman coding
Transcription
Shanno-Fano coding, Huffman and Adaptive Huffman coding
Homework #1 Report Multimedia Data Compression EE669 2013Spring Name: Shu Xiong ID: 3432757160 Email: shuxiong@usc.edu Content Problem1: Writing Questions .................................................................................................... 2 Huffman Coding ................................................................................................................ 2 Lempel-Ziv Coding ............................................................................................................. 4 Arithmetic Coding ............................................................................................................. 5 Problem 2: Entropy Coding ....................................................................................................... 5 Abstract and Motivation ................................................................................................... 5 Approach and Procedures ................................................................................................. 6 Results ............................................................................................................................... 8 Discussion ........................................................................................................................ 10 Problem 3: Run-Length Coding ............................................................................................... 11 Abstract and Motivation ................................................................................................. 11 Approach and Procedures ............................................................................................... 11 Results ............................................................................................................................. 12 Discussion ........................................................................................................................ 14 Problem1: Writing Questions Huffman Coding Bits/symbol Bits/symbol Lempel-Ziv Coding Arithmetic Coding Problem 2: Entropy Coding Abstract and Motivation Entropy coding is an important step in multimedia data compression. It is a lossless compression scheme which transforms symbols into binary codes. It assigns each symbol with a unique prefix-free code and the length of each code depends on the probability of the corresponding symbol. The symbol with larger probability is assigned with a shorter code. [1] In this section, I will implement two famous encoding methods: Shannon-Fano coding, Huffman and adaptive Huffman coding and discuss their performance. Approach and Procedures Shannon-Fano Coding: Shannon-Fano coding assigns code to each symbol based on its probability. It produces variable length codes with unique prefix. The algorithm is as following: 1. Calculate the global statistics of the input data, including the weights of each symbol, the total occurrences and the probability of each symbol. 2. Sort the symbols by their probabilities. The symbols with larger probability are put in left and the ones with smaller probability are put in right. 3. Divide the symbols into two parts. The sum of weights of symbols in one part is as equal as possible to that in the other part. 4. Assign digit 0 to the symbols in left and assign digit 1 to the symbols in right. Append the digit to its corresponding code. 5. Recursively perform step 3 and step 4 until each symbol becomes the leaf of the binary tree. To calculate the entropy, we use the equation: L H P( S k ) log 2 P( S k ) (1) k 1 where P( Sk ) is the probability of Symbol S k . To calculate the Bit Average of the encoding result, we use the equation: Bavg P(Sk ) L(Sk ) (2) Where L( S k ) is the code length of Symbol S k . The key step in the implementation is how to divide the symbols into nearly equal probabilities. I use a function divideTree() to recursively find the position where equally divide the symbols until reach the leaves of the tree. Huffman Coding: Huffman coding is developed from Shannon-Fano coding. And it is also a lossless compression scheme based on the frequency of each symbol. The algorithm is as following: 1. Calculate the global statistics of the input data, including the weights of each symbol, the total occurrences and the probability of each symbol. Generate a binary tree which contains all the symbols. 2. Find two nodes with the smallest probabilities and add a new node which represents the combination of the two nodes. 3. Set the new node as the parent of these two nodes. Drop the two nodes from further processing. 4. Recursively perform step2 and step 3 until we reach the root of the binary tree. 5. Encode from the root to the leaves of the binary tree. Assign the node which is the left child of its parent digit 0 and the one which is the right child of its parent digit 1. When we reach the leaves, the code for each symbol is generated. In my implementation, I use a structure Huffman tree to store the information about a symbol including weight, left child, right child, parent, code length, probability and entropy. Adaptive Huffman Coding: Adaptive Huffman coding is developed based on Huffman coding. It is a one pass algorithm, which does not need the global statistics of the input data. It can generate the Huffman codes on the fly. The adaptive Huffman coding algorithm that I have implemented is proposed by Vitter. [2] There are several key concepts about this algorithm. Implicit numbering: the nodes of the tree are divided into different levels by their weights. The nodes with larger weights are in higher level than the ones with smaller weights. The nodes in the same level are sorted from left to right in an increasing order. Blocks: If the nodes have the same weight and they are either both leaves or both internal nodes, they are in the same block. The highest numbered node is defined as the leader in the block. NYT(Not Yet Transmitted): Used to represent the incoming new symbols. The weight is always zero. One important operation in this algorithm is to update the tree according to the incoming symbol. The pseudo code for this operation is as following: Procedure Update; Begin If (it is a new symbol) then Add two nodes as the children of the current NYT. The right child is the new symbol, the left one is the new NYT. Traverse the tree from the top levels down, try to find whether there are nodes which have same weight as the new symbol. If (node found && found node is not the parent of the new symbol) then Swap the position of the two nodes; End Update the weights of nodes which are in the branch of the current symbol; Encode the symbol using current tree; End If (it is an old symbol) then Traverse the tree from the top levels down, try to find whether there are nodes which have same weight as the new symbol. If (node found && found node is not the parent of the new symbol) then Swap the position of the two nodes; End Update the weights of nodes which are in the branch of the current symbol; Encode the symbol using current tree; End End Results The equation I use for compression ratio calculation is: ratio original file size compressed file size 100% (3) original file size Shannon-Fano Coding: Audio.dat: Table 1 Audio.dat Shanno-Fano Result Entropy 6.45594 bits/symbol Bit Average 6.5132 bits/symbol Compression Redundancy 0.05726 bits/symbol Original File Size 65536 bytes Compressed File Size 53357 bytes Compression Ratio 18.85% Binary.dat: Table 2 Binary.dat Shanno-Fano Result Entropy 0.183244 bits/symbol Bit Average 1 bits/symbol Compression Redundancy 0.816756 bits/symbol Original File Size 65536 bytes Compressed File Size 8192 bytes Compression Ratio 87.5% Text.dat: Table 3 Text.dat Shanno-Fano Result Entropy 4.3962 bits/symbol Bit Average 4.4263 bits/symbol Compression Redundancy 0.0301 bits/symbol Original File Size 8358 bytes Compressed File Size 4625 bytes Compression Ratio 44.67% Image.dat: Table 4 Image.dat Shanno-Fano Result Entropy 7.59311 bits/symbol Bit Average 7.64249 bits/symbol Compression Redundancy 0.04938 bits/symbol Original File Size 65536 bytes Compressed File Size Compression Ratio 62608 bytes 4.67% Huffman Coding Audio.dat: Table 5 Audio.dat Huffman Coding Result Entropy 6.45594 bits/symbol Bit Average 6.48994 bits/symbol Compression Redundancy 0.034 bits/symbol Original File Size 65536 bytes Compressed File Size 53166 bytes Compression Ratio 18.88% Binary.dat: Table 6 Binary.dat Huffman Coding Result Entropy 0.183244 bits/symbol Bit Average 1 bits/symbol Compression Redundancy 0.816756 bits/symbol Original File Size 65536 bytes Compressed File Size 8192 bytes Compression Ratio 87.5% Text.dat: Table 7 Text.dat Huffman Coding Result Entropy 4.3962 bits/symbol Bit Average 4.42067 bits/symbol Compression Redundancy 0.02447 bits/symbol Original File Size 8358 bytes Compressed File Size 4619 bytes Compression Ratio 44.74% Image.dat: Table 8 Image.dat Huffman Coding Result Entropy 7.59311 bits/symbol Bit Average 7.62111 bits/symbol Compression Redundancy 0.028 bits/symbol Original File Size 65536 bytes Compressed File Size 62433 bytes Compression Ratio 4.73% Adaptive Huffman Coding Audio.dat: Table 9 Audio.dat Adaptive Huffman Coding Result Original File Size 65536 bytes Compressed File Size 53159 bytes Compression Ratio 18.89% Binary.dat: Table 10 Binary.dat Adaptive Huffman Coding Result Original File Size 65536 bytes Compressed File Size 8420 bytes Compression Ratio 87.15% Text.dat: Table 11 Text.dat Adaptive Huffman Coding Result Original File Size 8358 bytes Compressed File Size 4620 bytes Compression Ratio 44.74% Image.dat: Table 12 Image.dat Adaptive Huffman Coding Result Original File Size 65536 bytes Compressed File Size 62434 bytes Compression Ratio 4.73% Discussion For Shanno-Fano coding, from above tables we can see that the compression of the four kinds of files does not reach their theoretical bounds. All the bit averages are larger than the entropies. In compression ratio, we can see that it achieves best compression ratio in binary.dat file, 87.5%. However, it doesn’t compress the binary.dat actually: the symbol 0 is encoded as 0 and the symbol 1 is encoded as 1. The reduce of file size is because I use 1 bit to store one symbol 0 or 1 while the original file uses 1 byte. So the compressed file size is 1/8 of the original file size. The smallest compression ratio is in image.dat file, 4.67%. The compression ratio depends on the input data. The compression ratio is also related to the entropy, which is the theoretical bound for compression. If the entropy is large, the compression ratio would be small. And if the entropy is small, the compression ratio would be large. For Huffman coding, from above tables we can see that the compression of the compression of the four kinds of files does not reach their theoretical bounds. The compression redundancy is smaller than the ones using Shanno-Fano coding. The best compression ratio is in binary.dat file, 87.5%. But it does not compress the file for the same reason stated in Shanno-Fano Coding. The smallest compression ratio is in image.dat file, 4.73%. Comparing to the results in Shanno-Fano coding, Huffman coding achieves better compression results in audio.dat, image.dat and text.dat. It achieves the same compression result in binary.dat. Although Huffman coding achieves better results, the improvement is not significantly large. For Adaptive Huffman coding, it achieves the same compression ratio as Huffman coding in image.dat and text.dat. It gets a slightly better compression result in audio.dat and it gets a slightly worse compression result in binary.dat. This is because the adaptive Huffman coding may use two bits to represent one symbol while Huffman coding only uses 1 bit when compressing the binary file. It achieves slightly better or worse compression results depending on the input data. Overall, it achieves similar compression results as Huffman coding. Huffman coding is an improvement of Shanno-Fano coding. Shanno-Fano coding is a root-to-leaf method while Huffman code is a leaf-to-root method. This difference makes Huffman code slightly more efficient than Shanno-Fano coding. In general, Shanno-Fano coding may have same result as Huffman coding in best cases, while Shanno-Fano coding can never outperform Huffman coding. The biggest advantage that adaptive Huffman coding has over Huffman coding is that it does not need the global statistics of the input data so that it can be applied for real time compression. The Huffman coding algorithm is a 2 pass procedure which needs the global statistics to generate the Huffman tree. While adaptive Huffman coding is a 1 pass algorithm which makes the Huffman tree on the fly. Problem 3: Run-Length Coding Abstract and Motivation Run-length coding is a common encoding scheme whose basic idea is to encode a sequence of repeated symbols as the number of repeats followed by the symbol. Except the basic scheme, it also has some modification methods such as the MSB modified scheme. Some coding methods also use the basic idea of RLC, such as Golomb coding. [3] In this section, I will implement these 3 RLC methods and discuss their performance. Approach and Procedures Basic Scheme: Follow the pattern: number+symbol. In my implementation, I use an array fileData[] to store the bytes read from input file and use an array codeArray[] to store the encoding result. Count the number of consecutive repeats of a symbol in the fileData array and store the number and symbol to codeArray. For example, 03 03 04 03 is encoded as 02 03 01 04 01 03. I use the same pattern to decode the compressed file. One special case is that when the number of repeated symbol is more than 255, we divide it into several parts for encoding. For example, if there are 513 0s in sequence, we encode it as FF 00 FF 00 03 00. Modified Scheme: Based on the basic scheme, modified scheme adds a MSB method to indicate whether the code represents a number of repeats or a symbol. It uses the most significant bit as indication: if the encoded byte represents the number of repeats, MSB will be set to 1. For example, if the number of repeats is 6, the encoded result is 1000 0110, that is 86 in hex base. Convert it to decimal base, the encoded result is 134. If the number of repeats of one symbol is 1, we encode the symbol directly without the repeat number. One special case is that when the number of repeated symbol is more than 127, we divide it into several parts for encoding. For example, if there are 134 0s in sequence, we encode it as FF 00 87 00. The other special case is that when the symbol itself has a MSB that is set to 1, we need to add a number before it in order to eliminate the confusion. For example, when the symbol is 86, it should be encoded as 81 86. Golomb Coding: This method is designed to encode binary input files. We choose a tunable number M to be the base for Golomb coding, where M is the power of 2. M=2 m. Suppose the number of repeats of a symbol is N, it can be denoted as: N qM r N (4) q M r N mod M Then the number N is encodes as q ' s 01 m bits for r (5) For example, if N=66, M=32. Then q=2, r=2. It is encoded as 00100010. To determine M, we use the equation: 1 M (6) log 2 (1 p) Where p is the probability of symbol 1 in the input data. [4] In my implementation, since the given file has a lot of 0s, I encoded the number of zeros in sequences divided by 1s. The M I choose is 64, m is 6. The same scheme applies to decoding process too. I decode the number of zeros according to equation 4 and insert 1 bit of 1 in each sequence of zeros. I implemented encoding and decoding process for all the three methods. And the results prove the correctness of my implementation. Results Basic Scheme Audio.dat: Table 13 Audio.dat Basic Scheme RLC Compression Result Original File Size 65536 bytes Compressed File Size 108534 bytes Compression Ratio -65.61% Decompressed File Size 65536 bytes Binary.dat: Table 14 Binary.dat Basic Scheme RLC Compression Result Original File Size Compressed File Size Compression Ratio 65536 bytes 4780 bytes 92.71% Decompressed File Size 65536 bytes Text.dat: Table 15 Text.dat Basic Scheme RLC Compression Result Original File Size 8358 bytes Compressed File Size 16400 bytes Compression Ratio -96.22% Decompressed File Size 8358 bytes Image.dat: Table 16 Image.dat Basic Scheme RLC Compression Result Original File Size 65536 bytes Compressed File Size 124320 bytes Compression Ratio -89.70 % Decompressed File Size 65536 bytes Modified Scheme Audio.dat: Table 17 Audio.dat Modified Scheme RLC Compression Result Original File Size 65536 bytes Compressed File Size 85205 bytes Compression Ratio -30.01% Decompressed File Size 65536 bytes Binary.dat: Table 18 Binary.dat Modified Scheme RLC Compression Result Original File Size 65536 bytes Compressed File Size 4446 bytes Compression Ratio 93.22% Decompressed File Size 65536 bytes Text.dat: Table 19 Text.dat Modified Scheme RLC Compression Result Original File Size 8358 bytes Compressed File Size 8343 bytes Compression Ratio 0.18% Decompressed File Size 8358 bytes Image.dat: Table 20 Image.dat Modified Scheme RLC Compression Result Original File Size 65536 bytes Compressed File Size 82766 bytes Compression Ratio -26.29 % Decompressed File Size 65536 bytes Golomb Coding: Table 21 Golomb.dat Golomb Compression Result Original File Size Compressed File Size Compression Ratio 1250 bytes 109 bytes 91.28 % Decompressed File Size 1250 bytes Discussion For basic scheme RLC, we can see that only binary.dat has a good compression result: the compression ratio is 92.71%. It expands the sizes of other files. This is because basic scheme RLC is only good for data with sequences of repeat symbols. In binary.dat file, there are many sequences of 0s and 1s so that basic scheme RLC can encode the sequences into number+symbol pattern to save more bits. However, in the other three files, there are few sequences of repeated symbols. The single symbol is encoded as 1+symbol, adding more bits than the original data. For example, data 02 03 04 05 is encoded as 01 02 01 03 01 04 01 05. This is why the other three files got expanded sizes after the compression. For modified scheme RLC, binary.dat got the best compression ratio, 93.22%. Text.dat is compressed a little than the original size: its compression ratio is 0.18%. Image.dat and audio.dat got expanded sizes after the compression. Compared with the basic scheme, it has a better compression effect. The compression ratio of binary.dat is higher than that of basic scheme. And text.dat file size becomes compressed from expansion. The file sizes of image.dat and audio.dat are less expanded than those in basic scheme. This is because it does not have to encode the single symbol whose MSB is not 1 so that it can save more bits than basic scheme RLC. However, it is not good for data file which has few sequences of repeat symbols neither since it has to encode the symbol whose MSB is 1 with more bits in order to eliminate the confusion between number and symbol. For Golomb Coding, it has a good compression ratio, 91.28%. Golomb coding is good at encoding binary files which have large number of one symbols and small number of the other symbols. It is based on a probability model and is a useful development from RLC. But Golomb coding is not good at encoding binary files which have equal probability in both symbols. For example, if we use Golomb coding to encode a binary stream like 010101010101010101, it would expand the file size instead of compressing it. When applying Golomb coding, choosing an appropriate parameter for M and m is also important. An inappropriate parameter may reduce the effects of compression. By using the formula (6) we can ensure that the tunable parameter M we choose is suitable for the input data. Reference [1] Ihara, Shunsuke (1993). Information theory for continuous systems. World Scientific. p. 2. ISBN 978-981-02-0985-8. [2] Design and Analysis of Dynamic Huffman Codes, Jeffrey Vitter, Journal of the Association for Computing Machinery, Vol.34, No.4, October 1987, pp.825-845. [3]Nelson, Mark R., The Data Compression Book, M&T Books, Redwood City, CA, 1991. [4]Golomb, S.W. (1966). , Run-length encodings. IEEE Transactions on Information Theory, IT--12(3):399--401