Gradient Algorithms for Designing Predictive Vector Quantizers
Transcription
Gradient Algorithms for Designing Predictive Vector Quantizers
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986 679 Gradient Algorithms for Designing Predictive Vector Quantizers PAO-CHI CHANG AND ROBERT M. GRAY, Abstract-A predictive vector quantizer(PVQ) is a vector extension of a predictive quantizer. It consists of two parts: a conventional memoryless vector quantizer (VQ) and a vector predictor. Two gradient algorithmsfordesigning a PVQ aredevelopedinthispaper:the steepest descent (SD) algorithm and the stochastic gradient (SG) algorithm. Both have the property of improving the quantizer and the predictor in the sense of minimizing the distortion as measured by the average mean-squared error. The differences between the two design approaches are the period and the step size used in each iteration to update the codebook and predictor. The SG algorithm updates once for each input training vector and usesa small step size, while the SD updates only once for a long period, possibly one pass over the entire training sequence, and usesa relatively large step size. Code designs and tests are simulated both for Gauss-Markov sources and for sampled speech waveforms, and the results are compared to codesdesignedusingtechniquesthatattempttooptimizeonlythe quantizer for the predictor and not vice versa. I. INTRODUCTION VECTOR quantizer is a system for mapping a sequence of continuous or high rate discrete k-dimensional vectors into a digital sequence suitable for communication over or storage in a digital channel. While Shannon theory states that memoryless vector quantization is sufficient to achieve nearly optimal performance, such performance is guaranteed only for large vector dimension. Unfortunately, however, for a fixed rate in bits per sample, the codebook size grows exponentially with vector dimension and, hence, thecomplexity of minimum distortion searches required by the encoder alsogrows exponentially. For this reason, recent research has focused on design techniques for vectorquantizers that have structures which yield a slower growth of encoder complexity with rate or dimension. Two vector quantizer structures that have proved promising are feedback vector quantizers with full searches of small codebooks and memoryless vector quantizers with large codebooks and suboptimal, but efficient, search algorithms. A general survey of vector quantization, including many examples of both structures, may be found in [ 11. We here develop new design algorithms for predictive vector quantizers, a special case of the class of feedback quantizers. A Manuscript received December 29, 1984; revised August 12, 1985. This work was supported in part by the Joint Services Electronics Program at Stanford University and by the National Science Foundation under Grant ECS83-17981. The authors are with the Information Systems Laboratory, Department of Electrical Engineering, Stanford University, Stanford, CA 44305. IEEE Log Number 8608122. FELLOW,IEEE A predictive vector quantizer (PVQ) or vector predictive quantizer is a vector extension of a predictive quantizer or DPCM system. In the encoding process, an error vector formed as the difference between the input vector and the prediction of this vector is coded by a memoryless vector quantizer. The vector quantizer chooses the minimum distortion codeword from a stored codebook, and transmits the index of this codeword to the receiver. A PVQ is a feedback VQ because the encoder output is fed back to the predictor for use in approximating the new input vector. Thegeneralstructure of PVQ was introduced by Cuperman and Gersho [2], [3], who developed aPVQ design algorithm for waveform coding with two main steps. First, a set of linear vector predictive coefficients is computed from the input training sequence by generalized.LPC techniques. Second, a vector quantizer codebook is designed for the innovation sequence formed by subtracting the input vector from a linear predicted value based on the actual past inputs (“open-loop’’ design) or for the actual prediction error formed as the difference between the input vector and the linear predicted value based on the past quantized outputs (“closed-loop’’ design). The generalized Lloyd algorithm was used for the vector codebook design(see, e.g., [4]).As with traditional design techniques for scalar predictivequantization or DPCM,the predictor isdesigned under the assumption that the prediction is based on past input vectors rather than on their quantized values; that is, it is effectively assumed that the quantized reproduction is nearly perfect. This approximation may be quite poor if the quantizerhas a low rate. This causes a potential problem in the system design: the predictor ’which is optimum given past true values will not be so for past quantized values. A second potential problem arises when the open-loop design technique is used-the vector quantizer designed to be good for the ideal innovations sequence may not be as good when applied to the actual prediction error sequence. The closed-loop design resolves this problem and was found to provide 1-2 dB improvement over theopen-loop design for sampled speech at rates ranging from 1 to 2 bits/sample with vector dimensions ranging from 1 to 5 . This paper presents an approach to design predictive vector quantizers by applying standard techniques of adaptive filtering to design both vector quantizers and vector predictors in PVQ. The quantizers and predictors are iteratively optimized foreachother using gradient 0096-3518/86/0800-0679$01.OO O 1986 IEEE 680 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986 search techniques.This is accomplished by simultaneously adjusting both the linear predictive coefficients and the vector quantizercodebook with each fixed number of training vectors, where the number of training vectors used for each update can range from one vector to the entire training sequence, The adjustments attempt to minimize the cumulative sample squared error between the input vectors and the reconstructed signals. No assumption of perfect reproduction is needed in this algorithm and, hence, it may yield better codes. The algorithm is used only in the design of the system, not as an on-line adaptation mechanism as in the adaptive gradient algorithms of, e.g., Gibson et al. [5] and Dunn [ 6 ] . Some preliminary work on a scalarversion of this algorithm was developed in unpublished work of Y. Linde. Preliminary results of the research described here were reported in [ 11. On the positive side, simulations on Gauss-Markov and sampled speech indicate that the SD and SG design techniques yield good codes when the parameters are chosen intelligently. On the negative side,the resulting codes yield overall performance quite close to those designed using the Cuperman-Gersho technique. These results are of interest not only because they show that popular adaptive filtering algorithms can be modified to design good predictive vector quantizers, but because they show that optimizing the quantizer for the vector predictor yields good overall performance even when the predictor is not optimized for the quantizer. In other words, predictive vector quantizers are robust against inaccuracies in the predictor provided the quantizer is matched to the predictor. As a final observation, the algorithms developed here have also been extended to the design of good predictive trellis encoding systems [7] and joint source and channel trellis encoding systems [81. The basic structure of a PVQ system is presented in the second section. The principles of the gradient design algorithm are discussed in the third section. Simulation results for two different sources follow. Finally, comments and suggestions for future research are mentioned. 11. PREDICTIVE VECTORQUANTIZER A PVQ system is sketched inFig. 1. Let be a vector-valued random process or source with alphabet B , e.g., k-dimensional Euclidean space R k . A PVQ consists of three functions. An encoder y which assigns to each error vector e, = x, - x",a channel symbol y(e,) in some channel symbol set M , adecoder flassigning to each channel symbol u, in M a valuein a reproduction alphabet 8, and a prediction function or next state functionfwhich approximates the next input vector x, + as x",+ = f ( i n , in ). Given a sequence of input vectors and an initial prediction fo,the channel symbol sequence u,, reproduction sequence i,, and prediction sequence x",are defined recursively for n = 0, 1 , 2, , as - u, = r(e,> 4, = x",+ x"ni-l =f(4,, = r(xn - x",), P(u,), in-17 * * * 3). (1) LINEAR ENCODER "n VECTOR DECODER Fig. 1. Block diagram of PVQ. Since the prediction depends only on previous predictions and encoder outputs,given the initial prediction and channel sequence the decoder can duplicate the prediction. In fact, the receiver is a subsystem of the transmitter, both of them have the same predictor and its input f, the reproduction vector. A linear vector predictor is used in this system for its simple structure and well-known behavior. We consider, however, a particular form of linear vector predictors. Following Cuperman and Gersho [3] with some minor modifications, we consider vector predictors that operate internally as ordinary scalar predictors. To be specific, the linear prediction function is P x",= c ai.2-j.l i= 1 It = 1, 2, * * , I = k(n - l), where k is the vector dimension, p is the predictor order, and aj are the predictive coefficient vectors. The predictor can be expressed more compactly as n = 1, 2, f, = * - , (2) where A = [ala2 * ap]is the prediction coefficient matrix and 4;- l = [&2- l * 21-p+ ,IT is a p-dimensional vector, where Y means that the order in this vector is reversed in time. In other words, the predictor generates a vector x" by appropriately weighting previous samples 2. A distortion measure d.is an assignment of a nonnegative cost d(x, f) of reproducing a given input vector x as a reproduction vector 4. In this paper we consider weighted squared error distortion measures. - - d(x, 4) = (x - a)TW(x a). - 68 1 CHANG AND GRAY: DESIGNING PREDICTIVE VECTOR QUANTIZERS Note that d(x, f) is a difference distortion measure in the sense that d(x, 3) = d(x - f, 0). The vector quantizer (which includes the encoder and decoder) operates as a minimum distortion or nearest neighbor rule, i.e., y(e) = min-' d(x, f), UGM where the inverse minimum notation means that y(e) is the index u for which the reproduction codeword f yields the minimum possible distortion over all possible reproduction codewords. Putting (1) into the above equation, for any difference distortion measure the minimum distortion rule becomes y(e) = min-' d(e ueM + 2,P(u) + f ) = min-' d(e, P(u)). UEM Therefore, d(e, P(y(e))) = min d(e, P(u)) = min d(x, f), tortion encoding rule is not necessarily optimumfor a feedback system, it gives satisfying performance and it is easy to implement fora PVQ system. Hence, the encoder still obeys the minimum distortion rule, and only a decoder and a predictor have to be designed and stored for use. For simplicity, we first isolate the decoder andthe predictor from the feedback loop to derive their update formulas. In other words, we derive formulas to update the quantizer codebook for a given error signal and encoder, and formulas to updatethe predictor for a given predictor input sequence separately. Then we combine these formulas to design the whole PVQsystem. In this section, we present two gradient algorithms for designing both thequantizer codebook andpredictor coefficients of a PVQ system. The fundamentals and applications of adaptive signal processing can be found in [ 111, from which we borrow somenotation and nomenclature. P ueM which means minimizing thedistortion of the overall system is exactly equivalent to minimizing the distortion of the quantizer only. This property simplifies the design of a PVQ system. The decoder is simply a table lookup. It can be implemented by a ROM storing the reproduction error vectors. The final reproduction vector is obtained by adding the outputs of the decoder and the predictor. The performance of a PVQ is given by its long term average distortion A , . N 1 A = D(x, f) = lim - N ~ - r m n= 1 d(x,, f,), if the limit exists. In practice, we designa system by minimizing the sample average L 1 AL = D L ( x , 9) = - C d ( ~ , ,fn), L n=l for large L . A PVQ isa sequential machine or a state machine with an infinite number of states. Unlike a finite state VQ [9], [lo] which designs different codebooks for each state, a PVQ stores only one codebook forall states. This implies that the storage requirementand search complexity of PVQ are almost the same as for a memoryless VQ with the same rate and dimension. As in the scalar case, however, a PVQ outperformsa memoryless VQ since the correlation between vectors is used more effectively. A major problemof many feedback systems isthe channel error propagation effect. Like scalgr predictive quantization, a PVQ has this problem. As with the scalar system, the effects of occasional errors should die out with time if the predictor is stable (e.g., the predictor gain is strictly less than 1, and hence, the system is not a delta modulator in the scalar case). 111. GRADIENT ALGORITHMS FOR DESIGNING PVQ A PVQ system consists of three functions: an encoder, a decoder, and a predictor. Although the minimum dis- A. Steepest Descent Algorithm The method of steepest descent is one of the oldest and most widely known methods for minimizing a function of several variables. The formulas to update the quantizer and the predictor are given next. Update the Quantizer Codebook: The goal of designing a PVQ system is to minimize the average distortion. Assume the average distortion D(x, 1) is differentiable, then the basic steepest descent formulatoupdate the quantizer is Ei,m+l - Ei,m - ~ q v & ;f~ n )( , ~n i ?= 1, * * * , 2R, where E,. is the reproduct,ion codeword with indexi and pq is the step size (whichaffects the rate of convergence and stability). The choice of step size isdiscussed later. VSiis the gradient with respect to si, and m is the step number. In words, new codewords are formed by searching along the direction of the negative gradient of the average distortion from old codewords. By the definition of D(x,, f,) and (l), we obtain i = 1, . - . zR, 9 where Li is the number of training vectors which are mapped into codeword i. We consider weightedsquarederror distortion measures of the form d(x, 2) = where (X - f)TW(X- f), W is some positive-definite matrix. Then, V,,d(ej,E i , m ) = v,<ej = Ei,m)T W(ej - 8i.m) -(w + WT)(ej- q m ) . The gradient is proportional to the difference between the quantizer input andmappedcodeword.Therefore, the 682 PROCESSING, VOL. ASSP-34, 4,NO. IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, SIGNAL AND gorithm with a slower convergence rate, however, may avoid bad local optima by giving a smoother search [ 131, ~41. Update the Predictor: Assume the average distortion D(x, i )is differentiable, the general steepest formula for updating the predictor is steepest descent formula of the quantizer is zi,m+l = + pq(W + wT> i = 1, ,2R. ~ i , m (3) If simple squared distortion is considered, (3) becomes Am+1 = Am - p p V ~ , D ( x 31, , where A is the predictor coefficient matrix, m is the step number, and p p is the step size forupdating the predictor. For a PVQ system with a linear predictor as in (2) Zn = A Practically, a long training sequence is used to design a quantization sysem. The limit of (4) can be approximated by a sum over a long training sequence 12 q - 1 , ,2R. = 1,2, * * * , and the weighted squared distortion measure d(x, i ) = (X - i)T W(X - i ) , the gradient of the average distortion is . i = 1, AUGUST 1986 (5) 1 - = VAmlim Note that ( l / L i )Cj:r(ej)= u i ej is just the centroid or center of gravity of all source vectors encoded into channel symbol ui. For a given encoder y,the optimum decoderis the one whose codewords are centroids of all training vectors mapped into each channel symbol [ 11. The choice of 2p, significantly affects the performance of this algorithm.To make theanalysiseasy,we onIy consider the quantizer itself, Le., { e , } is assumed fixed for every update step. Equation ( 5 ) can also be expressed as L L L+m C (X, - n=l Ami:-1 - P(u,))~ L = -(w + w ~~lim )- + m-L1 nC= l (x, - in)@; - I)T. i = I, a: * Define P as a k X and R as a p p correlation matrix p cross-correlation matrix , 2R. The equation is defined to be “stable” if and only if 11 - 2pql < 1, i.e., 0 < 2pq < 2. X . L Observe that when 2 p , is less than 1, the rate of convergence increases as 2 p , increases, reaching the maximum rate of 2 p q = 1 . At this maximum rate the optimal solu- Then tion, that is, thereplacement of old codewords by the cenVArnD(xn, 2,) = -(W W T ) ( P- A m R ) . (7) troids, is reached in a single step. For 0 < 2pq < 1, there is no oscillation in the codeword updating and the process Let A* be a value of A yielding a zero gradient above, is said to be overdamped. For 1 < 2pq < 2, the updating and hence, satisfying a necessary condition for optimality process is underdamped and converges in a decaying os- in the sense of yielding the minimum average distortion cillation. When 2pq = 1 , the process is critically damped, and ( 5 ) becomes VA,D(xn, a,) = 0. + A solution is obviously A* which is exactly the generalized Lloyd algorithm [4], [ 121. This also shows that the Lloyd algorithm which achieves the optimal solution in one step has the fastest convergence rate among the family of steepest descentalgorithms for a given encoder and training sequence. An al- = PR- 1 , (8) if R-’ is invertible. This is the Wiener-Hopf equation in matrix form for the vector predictor. From (7), the steepest descent formula of the predictor is Am+l= A , + pp(W + WT)(P- A m R ) . (9) CHANG AND GRAY: DESIGNING PREDICTIVE VECTOR QUANTIZERS Practically,thecorrelation matrices are more difficult to get than simple differences of x - i ;from (6) we have A,+1 = A, and A,+ L + p p ( W + W T ) lim - c too L n = l 1 683 1 = A,(Z - 2Rpp) + 2A*Rpp. (x, - 3,) iiyl. Thus,theoptimal value of 2 9 . is 2pp = R-I if R is invertible, in which case A* is reached single in a step. This (10) choice is just Newton's method which has the fastest rate of convergence but a slightly more complicated calculaAgain, if the simple squared distortion measure is chosen, tion. then Observe thatNewton's method is indeed solving the Wiener-Hopf equation to get the optimal solution A* for Am+, = Am + 2pp(P - ArnR), (1 1) the predictor with fixed encoder and decoder. This soluand the limit is droppedif we use a long training sequence tion, however, may not be optimal for the whole system in practice, while updating thedecoderandthepredictor simultaneously. Inoursimulations, this solution occasionally L 1 even A,+1 = A , 2p - C (X, - in) ill'_ (12), . resulted in an unstable system. The steepest descent Ln=1 algorithm provides the possibility of obtaining the optimal The stability condition of 2 y is more difficult to ana- solution for the whole system by properly choosing step lyze than that of 2pq since R is generally not diagonal. sizes. DesignProcedures: Theformulasforoptimizingthe However, by the translating and rotating operations quantizer and the predictor have been derived separately. We now combine these formulas to design the complete A,,, = YmQ + A*, PVQ system. where Y, is the new prediction matrix, Q is a p X p eiThere are two approaches to optimize the decoder and genvector matrix. With some work, the algorithm can be the predictor of a PVQ system. derived as 1) Optimize the decoder for a fixed predictor and optimize the predictor for a fixed decoder separately. Iterate Y m + 1 = Ym(I - 2~pA)7 (13) these procedures until convergence. where A = QRQ- is the eigenvalue matrix in which the 2) Optimizethedecoderandthepredictor simultaeigenvalues appear in the diagonal and zeros elsewhere. neously and iterate until convergence. Equation ( 1 3) is stable and convergent if and only if In general, the performance improvement of each iteration in optimizing eitherthedecoder or the predictor 2 tends to decrease rapidly if it converges. Thus, approach ( 1 - 2ppXmaxl< 1 , i.e. o < 2pp < -, 2 ) may have a faster rate of convergence and hence we X max choose this approach. Thesteepest descent algorithm with where,,X, is the largest eigenvalue of R . the simple mean squared distortion measure is described To get the optimal step size, whereA* is reached in the as follows. minimum number of steps, we put (8) into ( 1 l ) , and get Step 0-Initialization: Given: A, + 1 = A,(I - 2pPR ) 2ppA*R. + + Sincetheequation 2ppR = I does nor hold forascalar 211, and a general p X p matrix R , this implies thata scalar 2pp may not be capable of being the optimal step size. Thus, we generalizethestepsize to be a p X p matrix 2pp, which has the same size as R . Starting from the basic formula . Training sequence ( x n } : = l , vector dimension = k, order of predictor = p , rate = R bits per vector = r bits per sample, size of reproduction codebooks = 2 R , initial quantizer codebook Co = (p(u), u E M } , using similar derivations, we get the formulas A,+ 1 = A, 1 " + -L C n=l (X, initial predictor A,, - 2,) iLC1 2pp convergence threshold 6 set m = 0, D - , = 2 00,$6 Step I-Minimum Distortion Encoding: Obtain the error sequence ( e , = x, Encode: u, = min-' d(e,, p(u)). UEM - A m i :- n = 1, ,L} 0, = 0. 684 IEEE TRANSACTIONS ONACOUSTICS, Compute the average distortion D, If (Dm - - SPEECH, ANDSIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986 = 1 L - L c d(e,, P(u,)). n=l D,)/Dm < 6, halt with final codebook and predictor C,, A,. Otherwise continue. Step 2-Quantizer Update: Step 3-Predictor Update: where the step number m is equal to the vector number n . where cj is the codeword chosen to reconstruct en by some encoding rule. Putting it into thebasicformula, we get the SG formula for updating thequantizer.For i = 1 , ...,2R L A,, 1 = Set rn +- 1 2 - C (X, - 2,) L n=l + tic 1 p p . m + 1 ; go to step 1. A, This algorithm is an iterative improvement algorithm and requires an initial codebook and predictor to start the process. The "splitting" technique is applied to generate big initial codebooks from small ones since it keeps the - [gi,m + 2Pq,n(en -Ei,rn) Ei,m+l - No matter how large the codebook is, only one codeword needs to be updated with'each incoming vector, since only one codeword represents en. If the simple squared distortion is considered, (14) simplifies to if if Ei, rn original codeword as a member of the new codebook so that the new average distortion will not increase. The initial predictor of the whole process is simply set to 0, since this disconnects the feedback loop and ensures the stability of the system. The choiceof step sizes is the major factor affecting the performance of this algorithm. The optimal step sizes for updating the quantizer and the predictor (2pq = 1 and 2pp = R - ' , respectively) are chosen in all simulations unless stated otherwise. This is a very simple formula. To analyze the stability condition of the step size, we follow similar analysis of the steepest descent algorithm, and get a sufficient condition for stability 0 5 2pq,n < 2 , n = 1, * - , L. The final codebook is obtained from the last update of the whole training sequence. Since the locai statistical behavior of speech is time varying, the step size should be chosen very small to protect the final codebook from undue influence of the local behavior of individual samples E . Stochastic Gradient Algorithms near the end of the training sequence. A small step size In the preceding section, the quantizerand the predictor may result in an inefficient adaptation, however, and lead are updated once for the whole training sequence. An- to a nonoptimal solution.To solve this problem,adeother algorithm widely used in adaptive systems is the so- creasing step size sequence is chosen so that the algorithm -calledleast-mean-square or LMSalgorithm, which up- can achieve the range of optimal values rapidly with large dates these parameters for each incoming vector. In this step sizes, then it can minimize the error from optimal section, we present an algorithm that is similar to LMS, values in small step sizes. Eweda and Macchi [ 151 gave a formula for the step size but differs in that its step sizes arenot fixed, butdecreased with time or input signals. It is called the stochastic gra- and proved that an adaptive linear estimatorwith this step size sequence is almost-sure(a.s.) and quadratic mean dient (SG) algorithm. Update the Quantizer Codebook: The quantizer is up- convergent. The stepsize p n of the algorithm is a decreasdated with each incoming vector in the SG algorithm. ing sequence of positive numbers satisfying Hence, the gradient of the average distortion is replaced by the gradient of the current distortion, and the basic 0 < c1 < 00, formula becomes Ei,m+ 1 - Ei,m - P q , n ~ & ; d ( X n 7a n ) , i = 17 > 2R, 685 CHANG AND GRAY: DESIGNING PREDICTIVE QUANTIZERSVECTOR Step 3-Predictor Update: This sequence is applied to all SG simulations in this paper. To reduce the complexity, p, is not updated for every A, + 1 = A, + 2pP,,,(x,, - a,,) 1. vector but for every block ranging from hundreds to thou. Step 4: Go to step 1 until the training sequence is exsands of training vectors. Although a PVQ system is slightly different from their system, this formula worked hausted. The “splitting” technique is again applied to generate well in all simulations. Update the Predictor: The SG algorithm to update the the initial codebooks. The training sequence is assumed to be sufficiently long so that the system will have conpredictor is verged when the training sequence is exhausted. HowAm+ 1 = Am - ~ p , , V ~ , d (3). x, ever, if the training sequence is not long enough for conFollowing a similar analysis to the steepest descent al- vergence in one pass, it isrepeated several times until the gorithm, the SG algorithm for updating the predictor is system converges. The convergencecan be determined by either testing whether the changesof each codewords and derived as predictor coefficients are less than a threshold, or testing Am+1 = Am + p p , n ( W + WT)(xn - a n > aF-1 (17) whether the change of average distortion is small enough for a period of time. We choose the latter method in our for the general weighted squared distortion, and design algorithm because it is easier to implement. AmLl = A, + 2pp,,,(x, - a,) .fL?l (18) C . Cuperman-Gersho Design Algorithm for the simple squared distortion. In this section we discuss basic differences between the A decreasing sequence satisfying (16) i s chosen as the step size to achieve both rapid convergence and minimal design algorithm developed by Cuperman and Gersho [2], [3] and the gradient algorithms developed here. For easy error. For simplicity, the 2pP,,,sequence is chosen as comparison, we considernonadaptive PVQ systems only. The principal difference between these algorithms is the design of the predictor. The Cuperman-Gersho algorithm where I?@) is the variance o f f . The matrix form of 2pp designs the predictor based on the assumption of perfect is not under consideration, since the SG algorithm needs reproduction. For a PVQ system,theoptimalpredictorfor fixed an update for each training vector, the computation of R-’ quantizer and training sequence has been found as (8) substantially increases the complexity of the design procedures. A* = PR-’ . Design Procedures: The SG design algorithm with the simple mean squared distortion measure is summarized The Cuperman-Gersho algorithm uses A’ to approximate A*, where A’ is obtained by below. Step O-Znitialization: A’ = P ’ R ’ - l , (19) Given: where P’ is a k X p autocorrelation matrix Training sequence {x,,};= , P’ = E@,) (x; vector dimension = k , and R‘ is a p X p autocorrelation matax order of predictor = p , R’ = E ( $ - l ~ rT ,-l). rate = R bits per vector = r bits per sample, Note that A’ is determined only by the statistical properties of the input signals. size of reproduction codebooks = 2 R , If the rate (or quantization SNR) is sufficiently large so initial quantizer codebook Co = {p(u), u E M } , that the quantization error is negligible, then initial predictor Ao, E(x&LC G E(x, - p(u,))($- 1) ,) set n = o,’rS-, = 03,a; = 0 . and Step 1 --Minimum Distortion Encoding: Setn +- n E ( x ~ - l x ~ ?G - l )E(iL-12L?l), + 1. Obtain the error vector e,, = x, - and hence, A’ z A*. In otherwords,thepredictor is nearly optimal under this assumption, eventhough it does not use the true predictor inputs in the design. A,,- al;- Encoding u, = min-’ d(e,, p(u)). uaM Step 2-Quantizer Update: ~ i , n + l = Ei,n + 2 ~ q , n(en - 8i.J if P(r(e,,))= ~ i , ~ IV. SIMULATIONS Gauss-Markov sources and sampled speech sequences were used to design and testPVQ systems. Simple squared .error distortion was chosen as the distortion measure for IEEE TRANSACTIONSONACOUSTICS,SPEECH,ANDSIGNALPROCESSING,VOL.ASSP-34,NO.4,AUGUST 686 1986 TABLE I VQ VERSUSPVQ FOR A GAUSS-MARKOV SOURCE. SIGNAL-TO-NOISE RATIOS A N D INSIDE OUTSIDE TRAINING SEQUFNCEOF(SNR) FULL SEARCH MEMORYLESS VQ (VQ), SIGNAL-TONOISE RATIOS INSIDE TRAINING SEQUENCE (SNRIN), SIGNAL-TO-NOISE RATIOS OUTSIDE TRAINING SEQUENCE (SNROUT),AND THE FIRSTPREDICTIONCOEFFICIENT (a,) OF PVQ USING STEEPEST DESCENT ALGORITHM (PVQ(SD)) VERSUS PVQ USINGSTOCHASTIC GRADIENT ALGORITHM (PVQ(SG)). RATE = 1 BITKAMPLE. k = VECTORDIMENSION. GAUSS-MARKOV SOURCE WITH CORRELATION COEFFICIENT 0.9 k 1 11.2 11.6 11.8 11.2 11.7 11.9 7.9 11.6 10.2 10.6 11.9 2 11.6 3 4 5 12.0 6 SNR SNRout SNRin 4.4 10.0 10.1 9.2 10.9 both sources. The performances of the systems are given by the signal-to-quantization-noise ratio or SNR defined as the inverse of the normalized average distortion on a logarithmic scale. Gauss-Markov Sources: A Gauss-Markov source or a first-order Gauss autoregressive source {X,} is defined by the difference equation X , + = ax, + W,, where a is the autoregression constant and { W,} is a zero mean,unit variance,independent, identically distributed Gaussian source.Thissource is of interest since it is a popular guinea pig for data compression systems and since its optimal performance, the rate-distortion function, is known. We here consider only the highly correlated case of a = 0.9 with transmission rate r = 1 bit/sample. The maximum achievable SNR given by Shannon’s distortion-rate function of this source and rate is 13.2 dB [16]. Both the steepest descent algorithm and the stochastic gradient algorithm were used to design first-order PVQ systems for a training sequence of 60 000 samples. All codes were tested by a separate sequence of 60 000 samples. Table I shows the results for various dimensions. As expected, the PVQ systems designed by both algorithms outperform the memoryless VQ in all cases. (Note that a memoryless VQ can be considered as a PVQ with a predictorthat always produces an allzero vector. Hence, memoryless VQ can be viewed as a special case of PVQ.) The difference in performance starts large at dimension 1 and decreases as the dimension increases. Observe that a simplescalarpredictivequantizerachievestheperformance of the memoryless VQ with dimension = 4, and the performance is the same as the analytically optimized predictive quantization system of Arnstein [ 171 run on the same data. (Arnstein optimized the quantizer for the predictor, but not vice versa.) The test SNR’s (SNRout) are within 0.1 dB of the design SNR’s (SNRin) in all cases. The good performance of PVQ systems for this source is probably due to the strong similarity between the GaussMarkov source model and the PVQ structure. The SD and SG algorithms yield almost the same performanceforthissource.The designed codebooks and predictors are also close for low dimensions. For large a , SNRoutSNRin 0.907 0.913 0.907 0.907 0.906 0.898 10.1 11.2 11.6 11.8 11.9 12.0 a, 10.0 11.2 11.5 11.7 11.8 11.9 0.916 0.919 0.917 0.914 0.910 0.906 dimensions, however, the codebooks are quite different. As shown in the table, both sets of prediction coefficients are very close to the autoregressiveconstant of the source. This implies that the autoregressive constant of this source is a good estimation of the prediction coefficient of a PVQ system. Only first-order PVQ systems were considered since the source was a first-order Markov source. To compare the convergence rate of SD and SG algorithms, their learning curves of SNR and the prediction coefficient al for designing a 2-dimensional 4-codeword PVQ are shown in Fig. 2. The SNR curves in SG algorithms represent the signal-to-noise ratios for the partial training sequence from the beginning to the current samples. The learning curves of the SD algorithm tend to jump from one iteration to the next iteration, while the SG algorithm moves more smoothly. These different converging paths are possibly the reason for the convergence to different local optima, and hence the generation of different codes. Both algorithms converged very fast, usually in fewer than 10 iterations or 10 passes. Sampled Speech: A training sequence of 640 000 samples of speech from five male speakers sampled at 6.5 kHz was used to design PVQ systems. The designed systems were then tested on a test sequence of 76 800 samples from a sixth male speaker. The training and test sequences are the same as those used in [ 11 and [181 for easy comparison. Table I1 shows the results of first-order and second-order PVQ systems designed by the SD and SG algorithms. Both algorithms yield similar results. The differences between the design distortion and test distortion increase as the dimension increases. Only PVQ systems with dimension up to 6 were considered. Since the autocorrelation function decreases with the lag, the accuracy of vector predictors with higher dimensions is reduced. From the table, first-order PVQ’s outperform VQ by from 0.5 dB to more than 1 dB on the test sequence. To further improve the performance,second-order PVQ’s were designed. The improvement over’first-order PVQ’s is 0.1 dB to 0.2 dB in general. As with scalar predictive quan- CHANGANDGRAY: 0 12.00 x 10.00 DESIGNING PREDICTIVE VECTOR QUANTIZERS a 3 + 687 12.00 . c 10.00 . . 8.00 . . . split 6.00 6.00 4.00 4.00 2.00 2.00 .00 .00 .00 10.00 5.00 15.00 eplit 8.00 20.00 I I .00 2 .oo 4.00 I I 6.00 8.00 iterat ions - h 10.00 iterations 1 .oo 1.00 V 0 split .BO .80 .60 .60 .40 .40 .20 .20 .oo -00 .oo 5.00 10.00 15.00 20.00 .oo 2 .oo 4.00 iteratrons I 1 6.00 8.00 10.00 fteratiohs (a) (b) Fig. 2. Learning curves for a Gauss-Markov source. (a) Steepest descent algorithm. (b) Stochastic gradient algorithm. tizers, the performance of PVQ’s tends to saturate when the order is over 2, and hence, higher order PVQ’s were not considered. In addition to the performance, the computation complexity and storage requirement are also important properties of a coding system. The complexity of a PVQ can be measuredby the number of multiplications per sample. The quantizer complexity, as given in [l], is equal to the number of codewordssearched, e.g., 2kR if it is full searched. The prediction complexity is the number of multiplications to generate the predictor output,which equals the orderp . Therefore, the overall complexity of a PVQ is m = 2kR p . Considering the storage requirement, both the quantizer codebook and the predictor matrix of a PVQ have to be stored. The quantizer storage is kZkRreal values as shown in [l]. The predictor storage is kp real values, since the predictor matrix is a p X k matrix. Thus, the overall storage requirement is k2kR kp. Observe that both the complexity and the storage requirement of the quantizer grow exponentially with the dimension, while those of the predictor grow linearly with the predictor order. Table I11 showsthecomplexityand storage requirements of VQ and PVQ. Comparing the same rate and dimension, a PVQ requires only a small increase in complexity and storage.over a memorylessVQ. If we consider + + the same performance, a PVQ is more attractive. For ex,ample, the performance of a 6-dimensional PVQ is approximately equal to an 8-dimensional memoryless VQ. However, the complexity and the storage of the former are much less than those of the latter. Informal listening tests show consistent results. The quality of the test sequence of a 6-dimensional PVQ is superior to that of a 6dimensional memoryless VQ, and the difference between a 6-dimensional PVQ and an 8-dimensional memoryless VQ is inaudible. The learning curves of two algorithms to design a firstorder, 2-dimensional, 4-codeword PVQ are shown in Fig. 3. Both algorithms yield very close SNR andal , but their learning curves are quite different. Note that there is an overshoot phenomenon of al in the SD algorithm. This implies that the optimal step sizes for designing the predictor itself may be too large for designing the whole PVQ system for sampled speech. A smaller step size may be more suitable to make the design process stable. In fact, we have chosen smaller step sizes to design second-order PVQ systems in our simulations. To examine the effect of the assumption that the quantization error is small, we simulated the Cuperman-Gersho algorithm with the same training and test speech sequences. The simulation results are also shown in Table 11. As expected,forthe smaller SNR’s this algorithm 688 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986 TABLE I1 SPEECH. SIGNAL-TO-NOISE RATIOSINSIDE TRAINING SEQUENCE VQ VERSUS PVQ FOR SAMPLED (SNRIN)OF 640 000 SPEECHSAMPLES, SIGNAL-TO-NOISE RATIOSOUTSIDE TRAINING SEQUENCE FOR FULL SEARCH MEMORYLESS VQ (VQ), FIRST(SNROUT)OF 76 800 SPEECHSAMPLES, (CG), ORDER AND SECOND-ORDER PVQ USING CUPERMAN-GERSHO ALGORITHMSTEEPEST DESCENT ALGORITHM (SD), AND STOCHASTIC GRADIENT ALGORITHM (SG). RATE= 1 BIT/ SAMPLE. k = VECTORDIMENSION SNRin First-Order PVQ SD CG 8.8 8.7 SG k SD VQ CG 1 2 3 4 8.7 5 6 7 8 2.0 5.2 6.1 7.1 7.9 8.5 9.1 9.7 2.0 5.9 7.2 8.47.8 6.4 7.3 8.3 9.69.2 9.6 9.3 Second-Order PVQ SG 2.2 4.83.6 6.3 7.5 8.6 9.0 2.1 7.0 6.4 7.6 4.6 6.6 7.9 8.7 9.2 9.7 7.7 8.7 9.1 9.4 SNRout First-Order PVQ 6.3 SG k SD VQ CG 1 5.8 2 3 4 5 8.8 6 7 8 2.1 5.3 6.9 6.0 7.9 7.0 8.47.6 8.88.1 8.4 8.8 2.3 Second-Order PVQ CG 2.3 6.8 7.5 8.3 8.6 4.32.6 6.2 7.2 7.9 8.4 7.9 SD 3.7 6.36.0 6.9 SG 4.4 6.4 7.0 8.0 8.4 8.9 8.3 8.9 8.8 7.4 8.0 8.5 TABLE I11 OF VQ VERSUS PVQ. NUMBER OF MULTIPLICATIONS COMPLEXITY AND STORAGE REQUIREMENT PER SAMPLE(COMPLEXITY), AND STORAGE REQUIREMENT (STORAGE) FOR FULLSEARCH MEMORYLESS VQ (VQ), FIRST-ORDER PVQ (PVQl), A N D SECOND-ORDER PVQ (PVQ2). RATE= 1 BITISAMPLE. k = VECTORDIMENSION Complexity 18 17 65 896 k VQ 1 2 3 4 ,5 6 7 8 2 4 8 16 32 64 128 256 PVQl Storage PVQZ 3 5 9 4 6 10 33 34 39066 yields performance inferior to that of the SD and SG, but the differences are small and become negligible as dimension increases. Comparing the prediction coefficient obtained by different algorithms, they are different but have similar shapes. For example, the predictor matrix of a 6-dimensional first-order PVQ is as follows. Cuperman-Gersho Algorithm: AT = 0.78, 0.42, 0.09, -0.17, -0.36, -0.46. Steepest Descent Algorithm: AT = 0.70, 0.34, -0.01, -0.28, -0.50, -0.61. ' PVQ 1 PVQ2 2 8 24 3 10 27 4 12 30 160 384 165 170 396 VQ 2048 Stochastic Gradient Algorithm: AT = 0.73, 0.39, 0.08, -0.20, -0.42,-0.54. Although the prediction coefficients are different, they all yield very close performance. Finally, we designed PVQ systems for fixed predictors in order to examine the sensitivity of the performance to the prediction coefficients. Inthesesimulations,vector quantizers were designed by the generalized Lloyd algorithm for fixed predictors whose prediction coefficients were randomly chosen from a range around the optimal values. Surprisingly, all results show very good perfor- 689 GRAY: DESIGNING PREDICTIVE VECTOR QUANTIZERS CHANGAND 10.00 10.00 8.00 8.00 6.00 6.00 4.00 4'00 2 .oo 2.00 .oo I .OO I 2.00 I I I 8.00 4.00 6.00 I 10.00 I 12.00 I 14.00 I .oo 16.00 .OO 1.00 2.00 3.00 4.00 5.00 iterations I 1.20 1.to 1.oo .a0 .80 .60 .60 .40 .40 .20 .20 .00 .OO 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 I .OO 1.00 I I 2.00 3.00 iterations (a) 7.00 iterations 1.00 .oo 6.00 Fig. 3. Learningcurves for sampled speech. (a)Steepestdescentalgorithm. (b) Stochastic gradient algorithm. I 4.00 I I 5.00 6.00 7.00 iterations (b) dictive quantizers to about 1 dB for dimension 6 predictive quantizers. For sampled speech waveforms, the improvement was 2.3 dB for scalar predictive quantizers, and ranged from 0.8 dB to 1.4 dB for higher dimensions. Alternatively, PVQ provided approximately the same performance as VQ with only the complexity and storage. Although PVQ suffers from the channel error propagation problem, it is not as severe as with other feedback quantizers. Thus, it is an inexpensive way to improve the performance of VQ by expanding the VQ into a PVQsystem. V. COMMENTS For low-dimensionalPVQ, gradient algorithms perWe have introduced SD and SG algorithms for design- form better than other existing algorithms. For higher diing PVQ systems. Experimentally, the SG algorithm mensional PVQ, all algorithms give similar performance yields slightly better performance, but it is not consis- since the optimization of the quantizer well adapts it to a tently better. The step sizes of the SD algorithm are very widerange of predictors. Nevertheless, gradient algoeasy to choose, sincetheir optimal values for updatingthe rithms still yield slightly better performance. Only nonadaptive PVQ was considered here. An adapquantizer and the predictor have been developed separately. For the SG algorithm, the choice of step sizes is tive VQ using one VQ, the model classifier, to adapt a still a problem. From the simulations, different step size PVQ by selecting one of a collection of predictors should sequences result in different codes and performance, and provide better performance and better track locally stano formula consistently yields the best performance in all tionary behavior. Preliminary results of this kind may be cases. Hence, the SD algorithm is at the moment the eas- found in [3] and [11. iest to use in practice. Although the stability conditions and optimal values of The simulation results show that PVQ provides im- step sizes in updating the quantizer and thepredictor sepprovements in performance over memoryless VQ for a arately have been derived, no optimality or convergence given rate, complexity, and storage. For Gauss-Markov properties of the jointly designed algorithms have yet been sources, the improvement ranges from 6 dB for scalarpre- found. mance (within 0.2 dB tothe best performance) even when their prediction coefficients are k0.2 far away from the optimal values. This shows that the performance of the system is not sensitive to the prediction coefficients, thus, the design of the predictor 'isnotvery critical. Even a poorly designed predictor can be compensated for by a well-designed quantizer providing that the codebook size is large enough. a 690 IEEETRANSACTIONSONACOUSTICS, REFERENCES [l] R. M. Gray, “Vector quantization,” IEEEASSP Mag., vol. 1, pp. 4-29, Apr. 1984. [2] V. Cuperman and A. Gersho, “Adaptive differential vector coding of speech,” in Conf. Rec., GlobeCom 82, Dec. 1982, pp. 1092-1096. [3] -, “Vector predictive coding of speech at 16 kb/s,” IEEE Trans. Commun., vol. COM-33, pp. 685-696, July 1985. [4] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,’ IEEE Trans. Commun., vol. COM-28, pp. 84-95, Jan. 1980. [5] J. D. Gibson, S. K. Jones, and J. L. Melsa, “Sequentially adaptive prediction and coding of speech signals,” IEEE Trans. Commun., V O ~COM-22, . pp. 1789-1797, NOV.1974. [6] J. G. Dunn, “An experimental 9600-bit/s voice digitizer employing adaptiveprediction,” IEEE Trans.Commun., vol.COM-19,pp. 1021-1032, Dec. 1971. [7] E. Ayanoglu and R. M. Gray, “The design of predictive trellis waveform coders using the generalized Lloyd algorithm,” 1986, submitted for publication. “The design of joint source and channel trellis waveform cod[8] -, ers,’’ 1986, submitted for publication. [9]J.Foster,R. M. Gray, and M. 0. Dunham,“Finite-state vector quantization for waveform coding,” IEEE Trans. Inform. Theory, vol. IT-31, pp. 348-359, May 1985. [lo] M. 0. Dunham and R. M. Gray, “An algorithm for the design of labeled-transition finite-state vector quantizers,” IEEE Trans. Commun., vol. COM-33, pp. 83-89, Jan. 1985. [ l I] B. Widrow and S. Steams, Adaptive Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1985. [12] R. M. Gray and Y . Linde, “Vector quantizers and predictive quantizersfor Gauss-Markov sources,” IEEE Trans. Cornmun., vol. COM-30, pp. 381-389, Feb. 1982. [13] G. H. Freeman, “The design of time-invariant trellis source codes,” in Abstracts 1983 IEEE Int. Symp. Inform. Theory, St. Jovite, P.Q., Canada, Sept. 1983, pp. 42-43. [I41 -, “Design and analysis of trellis source codes,” Ph.D. dissertation, Univ. Waterloo, Ont., Canada, 1984. [15] E. Eweda and 0. Macchi, “Convergence of an adaptive linear estimation algorithm,” IEEE Trans. Automat. Contr., vol. AC-29, pp. 119-127, Feb. 1984. [16] T. Berger, Rate Distortion Theory. Englewood Cliffs, NJ: PrenticeHall, 1971. [17] D. S . Arnstein, “Quantization error in predictive coders,” IEEE. Trans. Commun., vol. COM-23, pp. 423-429, Apr. 1975. [I81 H. Abut, R. M. Gray, and G. Rebolledo,“Vector quantization of speech and speech-like waveforms,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, pp. 423-435, June 1982. SPEECH,ANDSIGNALPROCESSING, VOL. ASSP-34, NO. 4, AUGUST 1986 Pao-Chi Chang was born in Taipei, Taiwan, on January 9, 1955.He received the B.S.E.E. and M.S.E.E.degreesfrom National Chiao Tung University,Taiwan, in 1977 and 1979, respectively. From 1979 to1981 he worked at the Chung Shan Institute of Science and Technology, Taiwan, as an Assistant Scientist. He is currently working toward the Ph.D. degree in the Department of Electrical Engineering, Stanford University, Stanford, CA. Hismain research interests are speech coding and data compression Robert M. Gray (S’68-M’69-SM’77-F’80) was born in San Diego, CA, on November 1, 1943. He received the B.S. and M.S. degrees from the Massachusetts Institute of Technology, Cambridge, in 1966 and the Ph.D. degree from the University of Southern California, Los Angeles, in 1969, all in electrical engineering. Since 1969 he has been with Stanford University, Stanford, CA, where he is currently a Professor of Electrical Engineering and Director of the Information Systems Laboratory. His research interests are the theory and design of data compression and classification systems, speech coding and recognition, and ergodic and information theory. He is a coauthor, with L. D. Davisson, of Random Processes (Englewood Cliffs, NJ: Prentice-Hall, 1986). Dr. Gray is a member of the Board of Governors of the IEEE Information Theory Group and served on that board from 1974 to 1980. He was ON INFORMATION THEORY an Associate Editor of the IEEE TRANSACTIONS from September 1977 through October 1980, and was the Editor of that TRANSACTIONS from October 1980 through September 1983. He has been on the program committee of several IEEE International Symposia on Information Theory, and was an IEEEdelegate to the Joint IEEEiUSSR Workshop on Information Theory in Moscow in 1975. He was Co-recipient with L. D. Davisson of the 1976 IEEE Information Theory Group Paper Award and Co-recipient with A. Buzo, A. H. Gray, Jr., and J. D. Markel of the 1983 IEEE ASSP Senior Award. He was a Fellow of the Japan Society for thePromotion of Science (198l) and the John Simon Guggenheim Memorial Foundation (1981-1982). In 1984 he was awarded an IEEE Centenia1 Medal. He is a member of Sigma Xi, Eta Kappa Nu, SIAM, IMS, AAAS, and the SocietC des Ingenieurs et Scientifiques de France. He holds an Advanced Class Amateur Radio License (KB6XQ).