RECURSIVE FILTERING ON SIMD ARCHITECTURES Rainer
Transcription
RECURSIVE FILTERING ON SIMD ARCHITECTURES Rainer
RECURSIVE FILTERING ON SIMD ARCHITECTURES Rainer Schaffer , Michael Hosemann, Renate Merker , and Gerhard Fettweis Department of Electrical Engineering and Information Technology Dresden University of Technology, Germany <schaffer, merker>@ias.et.tu-dresden.de <hosemann, fettweis>@ifn.et.tu-dresden.de ABSTRACT Recursive filters are used frequently in digital signal processing. They can be implemented in dedicated hardware or in software on a digital signal processor (DSP). Software solutions often are preferable for their speed of implementation and flexibility. However, contemporary DSPs are mostly not fast enough to perform filtering for high datarates or large filters. A method to increase the computational power of a DSP without sacrificing efficiency is to use multiple processor elements controled by the singleinstruction multiple-data (SIMD) paradigm. The parallelization of recursive algorithms is difficult, because of the data dependencies. We are using design methods for parallel procesor arrays to realize implementations that can be used on a parallel DSP. Further, we are focusing on the partitioning of the algorithm so that the realization can be used for different architectures. Consequences for the architecture are considered, too. we analyze these filters using design methods for parallel processor arrays. This means that we describing algorithms by affine recurrence equations (AREs) [2] which can be transformed into uniform recurrence equations (UREs) using known localization tools [3, 4, 5]. A focus in the design is on the partitioning of the algorithm that the realization can be used for different architectures and parameters. Based on the results we outline control structures which enhance the SIMD control scheme to cope with recursive filters without requiring excessive overhead as in multiple-instruction multiple-data (MIMD) schemes. These control structures will be implemented in the M5-DSP currently designed at our institution. 2. UNDERLYING DSP ARCHITECTURE Slice DMA Data Memory 1. INTRODUCTION Recursive filters are used frequently in digital signal processing. They are particularly usefull if steep filter responses shall be implemented with a low number of filter taps. Recursive structures are also found in adaptive filters or decisionfeedback equalizers. Filters can be implemented in dedicated hardware or in software on a digital signal processor (DSP). Software solutions are often prefered for their speed of implementation and flexibility. However, contemporary DSPs are often not fast enough to perform filtering for high data-rates or large filters. In order to increase the computational power either the clock rate can be raised or multiple processor elements (data paths) can be used. A popular method to increase the computational power of a DSP is to use multiple processor elements controled by the singleinstruction multiple-data (SIMD) paradigm as in our M3DSP [1]. However, regular SIMD schemes are unable to cope with the data flows required by recursive filters. Hence This research has been funded by Deutsche Forschungsgesellschaft’ project A1/SFB 358 and A6/SFB 358. Register File Address Generation Program Control Interconnectivity Data Paths Specialization Scalability Data Manipulation Control Fig. 1. Overall Architecture of the Platform-Based DSP We are designing DSP architectures following the concepts presented in [6]. The architecture will be derived from a platform by scaling the number of slices and tailoring the functionality of these slices. Additionally, the communication network between these slices has to be considered since it can require a large amount of chip space and introduce long delays. Figure 1 shows the overall architecture of our platform based DSP. The data manipulation part consists of a scalable number of slices, each containing data memory, a register file, a part of the interconnectivity unit (ICU) and a data path. The ICU and data path are tailored to the functionality required by the target algorithms. Such functionality could be an FFT-butterfly-tailored network in the ICU or special arithmetic like Galois-field in the ALU. The data paths are capable of performing the required multiplyaccumulate (MAC) operations. The control part performs program control, address generation and direct memory access (DMA). All Slices are controlled by just one program control unit in SIMD fashion. This means that while adjusting the number of slices to fullfill the computational requirement control overhead remains constant. However, this also implies limitations in the parallelism that can be exploited in the target application. 3. INFINITE IMPULSE RESPONSE FILTERING The infinite impulse response (IIR) filter is the most familiar recursive filter. It shall be used in the following for desciption of the design process. The IIR filter is given with yk = ykb + yka = L−1 X l=0 bl xk−l + J−1 X where the variable is produced (source) and the index point i2 , where the variable is used (destination). For the y variable three purviews have to be distinguished: Calculation of y = y c : The calculation of the y values can be realized with increasing or decreasing index j. 0 Thus two data dependencies dyc ∈ ( 01 ) , −1 are possible. These are illustrated with the dark blue arrows in the figure 2. Propagation y-values: We obtain two data dependen , with the index k − j of the y = cies dyp ∈ ( 11 ) , −1 −1 y p -value that have to be propagated. With a further analysis we obtain that only the data dependency dyp = ( 11 ) can be applied. If dyp = −1 −1 is used, yk+1 has to be calculated before yk . That is in the contradiction to the calculation of yk+1 , where yk is needed. In Figure 2 these data dependencies are drawn as dark red arrows. Transfer of the y-values: The calculation of the ykc values can be finished in the index point i = ( k1 ) or i = k J−1 , depending on dyc . These results have to be trans k+1 fered to the starting point of the propagation 1(i = 1 1 ). We obtain two data dependencies dyt ∈ ( 0 ) , −J+2 , depending on dyp . In Figure 2 both possible data dependencies are shown for y1 , with the thick cyan arrows. j aj yk−j , (1) j=1 where j, k, l ∈ Z and 0 ≤ k < K. The upper bounds of the indices j and l are 3 ≤ L, J ≤ 20 and for k the upper bound is K >> 100. The algorithm is split in two parts ykb and yka , which can be executed sequentially. In the remainder of this paper only the recursive component yka part shall be discussed. For the FIR component ykb solutions are available. The recursive component in the IIR filter makes a parallel implementation difficult, since each value of yka depends on its predecessors. Hence, consecutive filter outputs cannot be calculated at the same time as it is possible for FIR filters. J −1 y0c y1c y2c 4 y3c y4c p y−1 p yK−J y5c y1p y0p y2p 3 y3p 2 y4p c yK−1 p yK−5 p yK−4 p yK−3 1 0 0 1 2 3 4 5 K−1 k Fig. 2. data dependencies for the IIR filter 4. MAPPING ON THE ARCHITECTURE For the calculation of yka a multiplication of the filter weight aj and a previously determined yk−j is needed and these results have to be added. This MAC operation has to be performed in each index point i = ( ks ) of the index space I = {i | 0 ≤ k < K ∧ 1 ≤ j < J}. In the further the data dependency dyt = ( 10 ) is used, 0 which follows that dyc = −1 is needed. Additionally we have the data dependency da of the independent variable a, which will be set to da = ( 10 ). Thus we obtain the following description of the IIR filter as UREs[7]: y c (i) = y c (i − dyc ) + a(i) · y p (i) p 4.1. Data Dependencies At the beginning the data dependencies for the y variable have to be determined. These data dependencies will be described by a dependence vector d. The dependence vector i2 = i1 + d gives the distance between the index point i1 , c i∈I (2) t y (i) = y (i − dyt ) y p (i) = y p (i − dyp ) i∈I i ∈ Ip (3) (4) a(i) = a(i − da ) i∈I (5) with I p , I t ⊂ I, I p = {i | 0 ≤ k < K ∧ 2 ≤ j < J}, and I t = {i | 0 ≤ k < K ∧ j = 1}. 4.2. Space-Time Transformation When (date) and where (processor element) the calculation of an index point i is executed, will be determined with the space-time (ST) transformation. Generally, the ST transformation [8, 9] for an n-dimensional index space is described by i = Rr = St + Lp, R = S, L , r = pt . (6) The matrix R = (S, L) describes a co-ordinate transformation with L ∈ Zm×n and S ∈ Z(n−m)×n . The new coordinates consist of processor p ∈ Zm and time t ∈ Zn−m for the calculation of each instance of the UREs. For the IIR filter algorithm the time is t ∈ Z and the processor is p ∈ Z. In the processor array design the constraint ∀d ∈ D : td > 0, d = Std + Lpd , (D being the set of dependence vectors d of the UREs) ensures the causality. This means that all data needed to evaluate an equation of the UREs are available at the evaluation time. This constraint are needed for the data dependencies dyc and dyt . If the variable is independent, which means the data will be propagated only through the index space, the causality constraints can be relaxed to ∀di ∈ Di ⊂ D : tdi ≥ 0, di = Stdi + Lpdi . The variables a and y p are independent, thus Di = {da , dyp } can be applied. Various solutions can be found for the ST transformation, but only two solutions for the ST transformation, where the execution time tmax is minimal, shall be discussed in future. The most important parameters can be found in Table 1. R tmax pmax rd,yc rd,yt rd,yp rd,a ST-1 ( 10 11 ) J +K −2 J −1 1 −1 ( 10 ) ( 01 ) ( 10 ) ST-2 0 1 −1 1 J +K −2 K ( 10 ) ( 11 ) ( 01 ) ( 11 ) Table 1. Space-Time Transformation for IIR Filter Solution ST-1 has two advantages in compare as solution ST-2. The processor array is smaller (pmax = J − 1) and each processor element (PE) is used often (K times). Only at the beginning and at the end some PEs are idle. The control of the data transfer from the data dependency dyt is needed only for PE p = 1. In figure 3 this ST transformation (ST-1) is illustrated graphically. On the strength of data channels in the opposite direction for r d,yc and rd,yp the ST-transformed index space cannot be partitioned later. In section 4.3.1 the solution ST-1 shall be explored further for the applicability on the M5 architecture. The potential for partitioning is the main advantage of solution ST-2 (see Table 1), because the realization of the algorithm is not efficient after the ST transformation. The number of PEs is high (pmax = K), but J − 1 PEs are active only. In Section 4.3.2 possible improvements with partitioning shall be discussed. The processor array and the execution order are illustrated in Figure 3. In opposition to ST-1, where PE p = 1 has to realize the data transfer, each PE (p = k) has to allow one time for this task. In Figure 3 the transfer control is realized with the multiplexers between two PEs. 4.3. Adaptation on the M5 architecture Both solutions of the ST transformation shall be adapted on a SIMD architecture. To compare the solutions the degree of parallelism DOP (t) [10] shall be used as a measure for the parallelism. The DOP (t) specifies the number of used PEs in a cycle t, which is dependent from the cycle. With the Ptmax −1 DOP (t) DOP the average parallelism AP = t=0 tmax and the maximum parallelism MP = max0≤t<tmax DOP (t) can be determined, which are measures for the entire algorithm. The sum of all DOP (t) is the number of index points i of the index space I. 4.3.1. Limitation for Solution ST-1 The ST-transformed index space with ST-1 (see Table 1) cannot be partitioned, the reason are the data paths in the opposite direction. In Figure 3 (R-1) the ST-transformed index space is pictured as a parallelogram. The co-ordinate axes of the coordinate system are denoted as the date t and the processor element p. For two y values the data transfer in the index space is illustrated, whereat the three purviews calculation, transfer, and propagation are differenced. The processor array is drawn more simple. From the representation of the ST-transformed index space the MP = J − 1 can be determined. On an architecture with more than J − 1 PEs some PEs are not needed for the execution of the algorithm. Hence, the average parallelism AP = (J−1)·K K+J−2 is lower. The reason is the starting and finishing phase, where some PEs are idle. If K >> J − 1 (J−1)·K can be assumed the AP = (J−1)·K K+J−2 = K·(1+ J−2 ) ≈ J − 1 K is nearly the MP . For our M5 architecture with 16 slices this realization is most effective for J = 17. If the filter has more than J − 1 = 16 weighting factors aj , the algorithm cannot be executed on the M5 architecture with the R-1 realization. t K−2 Space-Time Transformation ST-2 yK−1 b yK−1 y5 y5b y4 y4b Realization R-1 y3 y3b y2 y2b 0 J −1 p y1 y1b aj y−j j y5 y4 y3 y2 K−1 y0b 4 a4 y−4 3 a3 y−3 2 a2 y−2 1 a1 y−1 y0 0 0 y1 1 y2 2 3 y3b y4b y5b b yK−1 y3 y4 y5 yK−1 4 5 K−1 k 4 Realization R-2 2 3 0 1 2 3 κP −J + 1 −1 1 0 y−j+1 yk a1 a2 a3 a4 t aJ−1 4 a4 y−4 3 a3 y−3 2 a2 y−2 1 a1 y−1 y0 0 0 1 y1 y1b y0b j J −1 aJ−1 y−J+1 ykb Space-Time Transformation ST-1 5 y5b y4b y3b y2b b yK−1 yK−1 k −J + 1 J −1 aJ−1 y−J+1 t y0 y0b y1b y2b K−2 t Fig. 3. design flow on the IIR filter 2 3 dK 4 − 1e κP 4.3.2. Partitioning for Solution ST-2 The solution ST-2 with a processor array of K PEs cannot be used directly for the M5 architecture. A partitioning of the processor array is needed. Therefore the locally parallel, globally sequential (LPGS) partitioning [11, 12] shall be used, which preserves the data locality of the full size processor array. The LPGS partitioning for an architecture with 16 PEs (slices) can be described as follows: i = St + Lp = St + L(κP + ΘP κ bP ) 0 t + ( 11 ) (κP + 16 · κ bP ) = −1 (7) with ΘP = ϑP and ϑP = 16. For the variables κP and κ bP we apply: 0 ≤ κP < ϑP , 0 ≤ κ bP < d ϑKP e and P P κ , κ b ∈ Z. In Figure 3 (R-2) the partitioning is illustrated, whereby the parameter ϑP was reduced to ϑP = 4. The partitions κ bP have to be processed in serial. If we wait to the last execution of a partition before the calculation of the next partition will be started (see Figure 4, R-2a) (J−1)K J−1 is quite low. At maxi≈ ϑP J+14 the AP = (J+14) d ϑKP e mum, we obtain an AP of 8.94 for J = 20 and ϑP = 16, which is around the half of value that can be achieved. If we use the idle slices at the end of a partition κ b P for the P first calculations of the next partition κ b + 1, the utilization can be improved. That is possible only, if the execution time tp for the calculation of the index point i is greater or equal the execution time t of the full size processor array (tp (i) ≥ t (i)). t t R-2a: sequential t ϑP ≥ J − 1 R-2b: overlapped ϑP < J − 1 R-2c: overlapped Fig. 4. Execution of Partitions For the IIR filter the number of weighting factors aj limit the degree of overlapping of the partitions. If J − 1 < ϑP the PEs have to be idle for ϑP − J + 1 calculations between the execution of partitions (see figure 4, R-2b). The AP for such IIR filters is the same as for realization R-1 (AP ≈ J − 1). All PEs will be active, if J − 1 ≥ ϑP . In Figure 4, picture R-2c this processing is illustrated. For the AP we obtain AP = (J−1) K (J−1)·K ≈ ϑP for J − 1 << K, d ϑP e+(K−1) mod ϑP where the execution time tmax is determined by the number of partitions d ϑKP e, the filter size J − 1 and the non overlapping last piece (K − 1) mod ϑP . 5. IMPLEMENTATION ISSUES The M5-DSP features an architecture where the data is processed in data vectors x16 of 16 elements. For realization R-1 the weighting factors a remain the same in a slice during performing the filtering, meaning the weighting data vector a16 does not change during the calculation. The y p value is the same for all slices for a cycle t. The value yk−l must be written to all elements of the vector yp16 with a broadcast()1 instruction. The value yk−l which was calculated in the first slice p = 1 one cycle before is needed for the multiplication in the same slice. The c element yk−l = y0,16 has to be selected from vector y c16 (instruction select(0)) and stored in the memory. For the next calculation the elements of vector y c16 have to be c shifted to the next right position after and element y15,16 is set to zero (shift1WRight()). This instruction is also known as Zurich Zip. For all three realizations based on R-2 the elements of data vector a16 have to be shifted to the next higher position (shift1WLeft()) and on position a0,16 a new value has to be written. Following, the instructions for the other data vector manipulations shall be contemplated. In the sequential realization (R-2a) the calculation of a y value is done in one PE. The data vector y c16 remains unchanged. At the beginning of the calculation of a new partic = 0. To realize the triangular end tion all elements are yi,16 of the realization we input 0 in a16 for the last calculations of a partition. Hence, we eventuelly obtain a resulting vector yc16 with all results of the partition. This vector can be written completely to the memory (selectall()) as already provisioned for by the architecture. With the broadcast() instruction the y p has to be written in vector y p16 . If the value is not needed in the PE, the factor ai = 0 ensures that the multiplication yields zero. The overlapping realization R-2b again requires the broadcast() of the y p in vector yp16 . Also the calculation of a y value is realized in the same PE, but the result yk has to be read out directly (t = k + J − 1) from the position k mod ϑP of the data vector y c16 (select(k%16)). On = 0 for the next this position the starting value y c k+16 J 1 In the paper the instructions are written in type writer letter. The instruction describes only its functionality. J D a16 yp16 yc16 yk R-1 2, . . . , 17 ≈J −1 R-2a R-2b R-2c 2, . . . , +∞ 2, . . . , 17 18, . . . , +∞ J−1 ≈ 16 J+14 ≈J −1 ≈ 16 Data Vector Manipulations and Result Extraction -shift1WLeft() shift1WLeft() shift1WLeft() broadcast() broadcast() broadcast() broadcPart(i) shift1WRight() -input(0,k%16) input(0,k%16) select(0) selectall() select(k%16) select(k%16) Table 2. Parameter of Realizations, ϑP = 16 calculation yk+16 has to be written (input(0,k%16)). In realization R-2c again the result has to be read out directly from the data vector yc16 (select(k%16)) and on this position the starting value (y c k+16 = 0) must be J written (input(0,k%16)). Differently to R-2b the simple broadcast() instruction for the y p broadcast cannot be used. With ϑP < J − 1 two different y values are needed in the PEs depending on the processed partition. In Figure 4, picture R-2c illustratetes this. We call the instruction broadcPart(i), which realizes the parp p tial broadcast of y 1 in y0,16 = . . . = yi−1,16 = y 1 and y 2 in p p 2 yi,16 = . . . = y15;16 = y . 6. CONCLUSION Four different realizations of an IIR filter on a parallel DSP were presented in this paper. Their main parameters are summarized in Table 2. For IIR filters with less or equal weighting factors (number of filter taps) than PEs (slices) the realizations R-1 and R-2b yield the highest performance. Of those two realizations R-1 requires less functionality for data transfers. For more weighting factors than slices (J − 1 > ϑP ) solution R-2c is faster than R-2a with the sequential execution of the partitions. However, solution R2-c requires complex data transfer instructions were subsets of slices are controlled independently, hence increasing control efforts. Depending on the application a tradeoff has to be found between performance and efficiency. 7. REFERENCES [1] T. Richter, W. Drescher, F. Engel, S. Kobayashi, V. Nikolajevic, and G. Fettweis, “A Platform-Based Highly Parallel Digital Signal Processor ,” in Proceedings of CICC, 2001. [2] J. Teich, A compiler for application specific processor arrays, Ph.D. thesis, Verlag Shaker, Aachen, Germany, 1993. [3] V. Roychowdhury, L. Thiele, S.K. Rao, and T. Kailath, “On the localisation of algorithms for VLSI processor arrays,” VLSI Signal Processing III, pp. 459–470, 1989. [4] U. Eckhardt and R. Merker, “Hierarchical algorithm partitioning at system level for an improved utilization of memory structures,” IEEE Transactions on CAD, vol. 18, no. 1, pp. 14–24, Jan. 2000. [5] J. Rosseel, F. Catthoor, and H. De Man, “An optimisation methodology for array mapping of affine recurrence equations in video and image processing applications,” in Proc. Conf. on Appl.-Spec. Array Proc., Aug. 1994. [6] Matthias Weiss, Frank Engel, and Gerhard P. Fettweis, “A New Scalable DSP Architecture for System on Chip (SOC) Domain,” in Proceedings of ICASSP’99, Phoenix, AZ, April 1999, vol. 4, pp. 1945–1948. [7] R.M. Karp, R.E. Miller, and S. Winograd, The organization of computation for uniform recurrence equations, J.ACM, July 1967. [8] S.K. Rao, Regular Iterative Algorithms and their Implementations on Processor Arrays, Ph.D. thesis, Stanford Univ., 1985. [9] U. Eckhardt, Algorithmus-Architektur-Codesign für den Entwurf digitaler Systeme mit eingebettetem Prozessorarray und Speicherhierarchie, Ph.D. thesis, Dresden University of Technology, Germany, June 2001. [10] K. Wang, Advanced Computer Architecture: Parallelism, Scalability, Programming, McGraw-Hill, New York, 1993. [11] J. Teich and L. Thiele, “Partitioning of processor arrays: A piecewise regular approach.,” INTEGRATION: The VLSI Journal, vol. 14(3), pp. 297–332, 1993. [12] U. Eckhardt and R. Merker, “Co-partitioning - A method for hardware / software codesign for scalable systolic arrays,” in Reconfigurable Architectures, R. Hartenstein and V. Prasanna, Eds., pp. 131–138. IT Press, Chicago, 1997.