High Speed FIR-Filter Architectures with Scalable Sample Rates Abstract
Transcription
High Speed FIR-Filter Architectures with Scalable Sample Rates Abstract
High Speed FIR-Filter Architectures with Scalable Sample Rates Martin Vaupel, Heinrich Meyr Abstract FIR (nite impulse response) lters are widely used in digital signal processing. In this paper new architectures for high speed FIR lters with programmable coecients are presented. Special eorts are undertaken to develop a structure that is well suitable for dierent data rates and therefore may be used within a tool (lter generator) that generates demand driven dedicated lter structures. The presented structure leads to highly ecient designs, that are useable within dierent environments. The basic design structure is introduced and implementation considerations are discussed. Results of synthesis runs are presented. 1 Introduction FIR (nite impulse response) lters are widely used in digital signal processing. Their applications often demand high speed computation. To satisfy these requirements dedicated and ecient lter architectures for each target domain are needed. In order to free the system designer from designing a lter architecture with unique properties (number of taps, length of input words and coecients) for each application, considerable efforts have been undertaken to develop lter generators that are able to deal with a variety of these implementation parameters and generate an eciently scalable architecture 1,2,3,4,5]. Only a small part of this work is concerned with lters with programmable coecients, that are used eg. within equalizers or adaptive lters. However, each of these generators delivers an architecture with one xed sample rate, only. If this does not meet the system designers actual specications, he has to accept an eciency loss or to design a new architecture satisfying the requirements, while dropping the advantages of a generic design approach. An interesting approach to realize programmable coecients is reported by Khoo et.al. 6]. Their design enables ltering at dierent sample rates. However, this eect is not originally intended, as it is a side eect of the way the coecients are encoded, and may not be controlled independently. Another drawback is the impossibility to update the values of the coecients during ltering in a way that only values of the same set of coecients contribute to the output values for each The authors are with Aachen University of Technology, ISS 611810, Templergraben55, 52056 Aachen, Germany,Tel.:+49-241807880, Fax: +49-241-8888195, email: vaupel@ert.rwthaachen.de This work was partially supported by DFG under contract no. Me651/12-1 time instance (synchronous update). Other designs of high speed programmable FIR lters with a xed sample rate 7,8] suer from the same drawback. Noll et.al. 9] have provided a full custom architecture of a programmable FIR lter, which is best suitable for high speed applications. It is based on a semi-systolic array of full adders with carry save arithmetics. With this architecture synchronous coecient update is possible. Our goals were to develop a fully synchronous design based on standard cells. This implies a lower importance of regularity compared to a full custom approach. Therfore some restrictions on the selection between dierent architectural alternatives are relaxed. Thus the designer is enabled to use eg. special irregularities in order to decrease the overall area. Based on Nolls architecture we developed an approach to deliver ecient high speed programmable FIR lters suitable for within a relatively large range of sample rates. We changed the known architecture to meet the special requirements of a scalable solution and to obtain an additional eciency gain of approximately 30% compared to an equivalent implementation of 9] with standard cells and a single phase clock. The paper is organized as follows: In section two different approaches for the implementation of programmable high speed lters are considered. The new strategy is developed. Part three is concerned with implementation aspects which reduce the chip area. The novel architecture will be explained in detail. In section four general aspects and results of synthesis runs are discussed. Final remarks and an outlook to further works will conclude this paper. 2 Algorithm A FIR lter with L taps is described by its transfer function: Y (z) = L;1 c z ;i G(z) = X(z) (1) i i=0 X Let us assume xed and positive coecients rst. Each coecient ci can be splitted into m single bits cji . Than the transfer function is of the form G(z) = X(z;i mX; cji2j ) with ci = mX; cji 2j L;1 1 1 i=0 j =0 j =0 (2) tap X -1 Z 1 0 2 0 3 c2 c2 c2 s s s c2 2 1 c1 3 c1 c1 c1 s s s 2 1 c00 c0 3 c0 c0 s s Y s+ -1 Z s : shift right 1 bit -1 Z s+ -1 Z s possible pipeline slice : shift left 3 bit s+ s+ Figure 1: Flow chart of an accumulation free lter bitplane X 0 c2 -1 -1 -1 Z Z Z 0 c1 -1 Z 0 1 c0 -1 Z c2 -1 Z -1 -1 -1 Z Z Z 1 c1 -1 Z s 1 2 c0 -1 Z c2 -1 Z -1 -1 -1 Z Z Z 2 c1 -1 Z s 2 3 c0 -1 Z c2 -1 Z 3 c1 -1 Z s 3 c0 -1 Z Y s+ Figure 2: Structure of a fully pipelined lter with bitplanes For a lter with three coecients with four bits each this can be written as: G(z) = (c00 + 2(c10 + 2(c20 + 2(c30 )))) + (3) +z ;1 (c01 + 2(c11 + 2(c21 + 2(c31 )))) + +z ;1 (c02 + 2(c12 + 2(c22 + 2(c32 )))) An implementation of this form can be realized with a structure like Fig. 1 (transposed direct form). Each adder is implemented by a row of 1-bit full adder cells, with number equal to the actual word length of the intermediate result. The multipliers (triangles in Fig. 1) compute the partial products and are in fact simple ANDgates. Between rows of adder cells a hard shift of the result is performed (shift and add). The output of one tap is fed into free adder inputs of the following tap. Therefore no more explicite adders are needed. (accumulation free lter 10]) In order to speed up the architecture it is possible to implement pipeline slices by inserting registers behind each adder and correspondingly into the input line (dotted squares in Fig. 1). A more ecient solution in terms of area is to perform some kind of resorting | 1) adding all partial products of the lowest value rst, 2) performing a shift of the result, and 3) adding the partial products of the next value | results in so called bitplanes 11]. Due to adding the lowest partial product values rst, the upper bound on the value of the intermediate results is growing slower from left to right compared to the structure in Fig. 1. This leads to a design that has the minimum possible number of adder cells due to the lower increase of the wordlength required in each line. Another advantage is that after each bitplane the lowest bit of the result is computed completely and may be truncated without side eects on the upper bits if desired. The corresponding form of the transfer function equals G(z ) = (c30 23 z ;9 + z ;1(c31 23 z ;9 + z ;1 (c22 23 z ;9+ (4) ; 1 2 2 ; 6 ; 1 2 2 ; 6 ; 1 2 2 ; 6 + z (c0 2 z + z (c1 2 z + z (c2 2 z + + z ;1 (c10 21z ;3 + z ;1 (c11 21z ;3 + z ;1(c12 21z ;3 + + z ;1 (c00 20z ;0 + z ;1 (c01 20z ;0 + z ;1(c02 20z ;0 ) ::: ) The introduced pipeline steps result in an increased latency of the lter. The corresponding ow graph is outlined in gure 2. To reduce chip area at the cost of decreased throughput, Noll has introduced modied bitplanes, which are a mixture between the hitherto regarded approaches. Instead of pipelining each adder, pipeline registers are inserted after each second adder in Fig. 1 only and the structure is rearranged according to Fig. 3. The number of adder cells between registers will be called pipeline depth in the following. The drawback of this solution is that the minimum number of cells required can not be reached. In order to avoid this disadvantage our new approach is to retain the underlaying structure of Fig. 2 as it is but to alter the scheduling of the input words to the inputs of the multipliers accordingly. This leads to considerable area savings especially with larger pipeline depths due to the lowest possible increase of the word length. For instance an implementation of a lter with a typical parameter set (eight coecients, input and coecient word length of eight and four bit, respectively) and pipeline depth four needs 33% more adder cells if implemented according to Fig. 1 and 11% more adder cells when implemented according to Fig. 3 compared to the new introduced structure. The transfer function has the following form now (shown for pipeline depth 2): G(z) = (c3023 z ;4 + c31 23z ;5 + + z ;1 (c3223 z ;5 + c20 22z ;3 + + z ;1 (c2122 z ;3 + c22 22z ;4 + + z ;1 (c1021 z ;1 + c11 21z ;2 + + z ;1 (c1221 z ;2 + c00 20z ;0 + + z ;1 (c0120 z ;0 + c02 20z ;1 ) ::: ) (5) The corresponding structure is shown in Fig. 4. 3 Architecture In order to decrease the area without increasing the minimum clock period, we have implemented (modied) booth encoding of the coecients. This reduces the modified bitplane X -1 Z 0 c2 1 c2 s 0 c1 -1 Z 1 -1 Z 2 c0 3 c2 2 s s+ c13 c1 c2 -1 Z -1 Z s -1 Z 1 0 c0 c1 -1 Z s 2 c03 c0 Y -1 Z s s s+ Figure 3: Modied bitplanes at pipeline depth 2 bitplane X -1 Z 0 c2 0 c1 -1 Z -1 Z -1 Z 0 c0 -1 Z 1 c2 1 c1 1 -1 Z s -1 Z -1 Z 2 2 3 c1 c2 c0 c0 Y -1 Z -1 Z s 3 3 2 c1 c2 c0 s s+ Figure 4: The stucture with minimum number of cells at pipeline depth 2 length of the array by a factor of two and increases array width by one bit. Using the newly proposed architecture a synchronous update of the coecients is still possible. A group of three successive coecient bits is encoded into two magnitude and one sign bit. These are fed into modied partial product gates, which are able to perform a shift and a one's complement (inversion) of the input bits depending on the values of the encoded bits. The resulting structure of a lter with three taps, six bit wide coecients and an input wordlength of three bit is outlined in Fig. 5 for a pipeline depth of two. This architecture is able to deal with two's complement numbers as inputs and coecients. It consists of full adder cells and registers mainly. The input bits are fed into the array in a parallel manner. Note that the input bits are drawn for the rst bitplane only. In rows, where the wordlength of the intermediate result need not be increased, so called carry-overow-correction (COC) cells 9,12] are implemented. These are necessary to correct an overow of the carry word, which is possible although the sum of carry and sum word ts into the given word length. This leads to considerable savings in terms of the required wordlength. Another advantage of using COC-cells together with the proposed bitplane structure is that at most one bit sign extension in each row is neccessary. Therefore no additional buers are needed to drive large sign extension lines (cc. 2]). Within the rightmost full adder cell of each row the sign bit of the preceding booth encoding cell is added in order to complete the computation of a two's complement. The postponing of this operation leads to a decreased word length, too. The upper bits of the nal sum and carry word are added using a vector merging adder (VMA). It consists of a pipelined array of full and half adder cells, as well. The internal structure of the VMA is changed for dierent pipeline depths to reach the minimum possible area. In order to shorten the critical path, additional registers are inserted after the modied partial product gates of row 1,3,5,... This has to be taken into account when delaying the input words correctly. As a result for each pipeline depth pd the critical path consists of pd full adder cells only. input 2 1 input 0 input modified partial product gate COC 0 full adder with carry overfl. corr. 0 H modified booth encoder H H 0 c1 3 half adder 1 c2 full adder H 0 c2 3 1 1 c1 1 0 register 0 3 c0 3 c2 2 c2 3 c1 2 c1 3 c0 2 c0 3 c2 4 c2 3 c1 4 c1 3 c1 4 c1 1 c0 1 number of taps: 1 3 wordlength of coefficients: 6 of inputs: 3 pipeline depth: 2 1 3 c2 H 1 3 c1 1 1 3 c0 1 COC 3 5 c2 1 3 5 c1 1 3 5 c1 1 COC VMA 9 out 8 out 7 out 6 out 5 out 4 out 3 out 2 out out 1 out 0 Figure 5: Detailed structure of the new lter architecture 4 Implementation results As the array is semi-systolic, the achievable sample rate becomes independent from the desired implementation parameters, due to the mainly local communication between cells avoiding broadcast of data. This frees the system designer from considering an additional parameter during system optimization. The structural description of the lter was done within VHDL. It is fully parameterizable in number of taps, and wordlength of inputs and coecients. Furthermore, due to the generic properties of VHDL, it was possible to describe the dierent structures (depending on the value of the pipeline depth) within a single 'architecture'. For each desired sample rate a C-program computes the scheduling of the delayed input words and writes the results into a VHDL-package. Within the VHDL-architecture the underlaying global structure and the insertion of additional registers after partial product gates is described. Information from package and entity/architecture is fed into a synthesis tool, which produces steered by a synthesis script a netlist. 7 area 2 mm number of taps: 8 wordlength of input: 6 wordlength of coeff.: 6 pd=1 6 5 pd=1 2 4 3 4 2 implementation according to [9] 3 eect throughput are independent. Additionally, due to the mainly regular array structure, the area may well be estimated without the need for synthesis runs. These are advantages especially with regard to making use of the proposed structure within a system design environment. 5 Conclusion In this paper a novel architecture of a programmable high-speed digital FIR lter was proposed, which results in ecient designs. The advantages of Nolls architecture, high data rates, synchronous updating of the coecients, mainly local communication within the array (and therefore independency of the maximumsample rate from actually chosen functional parameters) are preserved. Beyond this the new proposed structure is well suited for throughput scaling and saves additionally about 30 percent of area compared to a straight forward semi custom realization of 9] without decreasing the possible throughput. The splitting of ip-ops into edge triggered latches, together with a two phase clock is possible. Further investigations have shown that additional area savings are possible for applications with linear phase. References 1] R. Jain, P. Yang, and T. Yoshino, \FIRGEN: A Computer{ 2] pd=6 3 2 new architecture 1 pd: pipeline depth ~ 6 pd=8 3] A*T = const. ~ 0 0 10 4 15 20 25 30 35 40 4] clock period / ns Figure 6: Area over minimum clock period at dierent pipeline depths In Fig. 6 accumulated cell areas and sample rates for dierent pipeline depths are shown. (8 taps, 6 bits wide coecients and input words) The library used was the ES2 1CMOS standard cell library. Synthesis was undertaken with SYNOPSYS. (Operating conditions were set to worst case.) It can easily be seen that our approach is well suited to deliver highly ecient designs for dierent requirements. Note the almost constant areatime-product between 10 ns and 25 ns clock period. For applications with sample rates below 40 MHz the eciency decreases slowly, therefore in this case resource sharing should be considered. The proposed architecture leads to considerable savings in area compared to a direct approach following 9]. (Comparing the points for pd = 3 and pd = 4 of the upper curve the eect of the discussed suboptimal number of adder cells is visible, because it is not covered by savings in terms of registers of pipeline slices.) At dierent sets of parameters the curves are shifted vertically only, which means that the sample rate rate is a function of the pipeline depth exclusively. Functional parameters and implementation 5] 6] 7] 8] 9] 10] 11] 12] Aided Design System for High Performance FIR Filter Integrated Circuits," IEEE Transactions on Signal Processing, vol. 39, pp. 1655{1668, July 1991. R. Hawley, T. Lin, and H. Samueli, \A silicon compiler for high{speed CMOS multirate FIR digital lters," in Proc. IEEE Int. Symp. Circuits and Systems, vol. 3, pp. 1348{1351, May 1992. R. Hartley, P. Corbett, P. Jacob, and S. Karr, \A High Speed FIR Filter Designed by Compiler," in Proceedings of the custom intergrated circuits conference, (San Diego), pp. 20.2.1{ 20.2.4, 1989. P. Cappello and C. Wu, \Computer{aided Design of VLSI FIR Filters," Proceedings of the IEEE, vol. 75, pp. 1260{1271, September 1987. F. F. Yassa, J. R. Jasica, R. I. Hartley, and S. R. Noujaim, \A Silicon Compiler for Digital Signal Processing: Methodology, Inplementation, and Applications," Proceedings of the IEEE, vol. 75, pp. 1272{1282, September 1987. K.-Y. Khoo, A. Kwenuts, and A. N. Willson, \An Ecient 175MHz Progammable FIR Digital Filter," in Proceedings of International Symposium on Circuits and Systems, pp. 72{ 75, 1993. C. Joanblanq et al., \A 54 MHz CMOS Programmable Video Signal Processor for HDTV Applications," IEEE Journal on Solid State Circuits, pp. 730{734, 1990. M. Hatamian and S. K. Rao, \A 100 MHz 40{Tap Programmable FIR Filter Chip," in Proceedings of International Symposium on Circuits and Systems, (New Orleans), pp. 3053{3056, 1990. T. G. Noll, \Semi{Systolic Maximum Rate Transversal Filters with Programmable Coecients," in Systolic Arrays (W. M. et.al, ed.), pp. 103{112, Bristol: Adam Hilger, 1987. P. R. Cappello and K. Steiglitz, \A Note on \free accumulation" in VLSI Filter Architectures," IEEE Trans. on Circuits and Systems, vol. CAS{32, pp. 291{296, march 1985. P. B. Denyer and D. Myers, \Carry-Save Arrays for VLSI Signal Processing," in Proc. of rst Int. Conf. on VLSI, (Edinburgh), pp. 151{160, Aug. 1981. T. G. Noll, \Carry{Save Architectures for High{Speed Digital Signal Processing," J. VLSI Signal Processing, no. 1{2, pp. 121{140, 1991.