An Efficient Softcore Multiplier Architecture for Xilinx FPGAs

Transcription

An Efficient Softcore Multiplier Architecture for Xilinx FPGAs
An Efficient Softcore Multiplier
Architecture for Xilinx FPGAs
22nd IEEE Symposium on Computer Arithmetic
Martin Kumm, Shahid Abbas and Peter Zipf
University of Kassel, Germany
CONTENTS
1. State-of-the-art
2. Proposed multiplier
3. Results
2
WHY FPGA SOFTCORE MULTIPLIERS?
The need for efficient multipliers forced FPGA vendors to
embed hard multiplier blocks
FPGA softcore multipliers are still required:
Small word sizes (worse mapping for embedded mults)
Large word sizes ("fill gaps")
Replace embedded mults on small/low-cost FPGAs
3
WHY THEY ARE DIFFERENT?
Research for efficient multipliers is an ongoing process
nearly since >50 years
Efficient multipliers in terms of gates may not be efficient
on FPGAs
FPGA optimized structures are relatively rare
4
WHY THEY ARE DIFFERENT?
Xilinx slice 6/7 series
5
PREVIOUS WORK
A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
LUT
0
1
LUT
0
1
LUT
0
1
LUT
0
1
Carry
Logic
6
PREVIOUS WORK
A Baugh-Wooley like multiplier was proposed in [Parandeh-Afshar 2011]
Two partial products are generated and added using carry chain
Compression tree of already reduced PP's necessary
full adder
LUT
0
1
LUT
0
1
LUT
0
1
LUT
0
1
Carry
Logic
6
PREVIOUS WORK
Another idea was discussed in [Brunie 2013]:
Decompose multiplication into small multipliers that fit into
single LUTs, e. g., 3x3, 2x3, 1x4
Use a compression tree to add partial results
p =M 1 + 23 M 2 + 26 M 3 + . . .
3
6
9
. . . + 2 M4 + 2 M5 + 2 M6 + . . .
. . . + 26 M 7 + 29 M 8 + 212 M 9
7
BOOTH RECODING
a·b=
M
X
m
m=0
m even
bm+1
bm
bm
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
1
a · BEm 2
BEm
zm
cm
sm
0
1
1
2
-2
-1
-1
0
1
0
0
0
0
0
0
1
0
0
0
0
1
1
1
0
0
0
0
1
1
0
0
0
8
BOOTH MULTIPLIER
LSB
b
0
c0 c0 c0 c0 c0 c0 c0
c2 c2 c2 c2 c2
c4 c4 c4
c6
c0
c2
c4
c6
MSB
0
0
+
=
9
BOOTH MULTIPLIER
LSB
b
0
c0
1 1
c0
1 c2
c2
1 c4
c4
c6
MSB
0
0
c6
+
=
10
PROPOSED
ARCHITECTURE
LUT
0
1
LUT
0 1
0 1
0 1
0 1
0
0
0 1
0 1
LUT
0
1
0
1
LUT
0
1
Carry
Logic
11
PROPOSED
ARCHITECTURE
LUT
0
1
LUT
0 1
0 1
0 1
0 1
0
0
0 1
0 1
LUT
0
1
0
1
LUT
0
1
Carry
Logic
full adder
11
PROPOSED
ARCHITECTURE
12
RESULTS
The number of slices can be precisely predicted:
#slices(M, N ) = dN/4 + 1e · bM/2 + 1c
| {z } | {z }
slices per row
no of rows
Design was implemented as generic VHDL
A pipelined multiplier can be obtained by using the (otherwise unused) slice FFs without much additional cost
Reference circuits (Parandeh-Afshar & LUT-based) were
designed with the FloPoCo library [de Dinechin 2012]
Xilinx Coregen was used as a commercial reference
13
RESULTS VIRTEX 6
COMBINATORIAL, SLICES
2,000
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
1,800
1,600
1,400
#Slices
1,200
1,000
800
600
400
200
0
8
12
16
20
24
28
32
36
40
44
Input word size (N)
14
48
52
56
60
64
RESULTS VIRTEX 6
COMBINATORIAL, SLICE RED.
80
Slice reduction (%)
60
40
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
20
0
8
12
16
20
24
28
32
36
40
44
Input word size (N)
15
48
52
56
60
64
RESULTS VIRTEX 6
COMBINATORIAL, FREQ.
700
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
600
Frequency [MHz]
500
400
300
200
100
0
8
12
16
20
24
28
32
36
40
44
Input word size (N)
16
48
52
56
60
64
RESULTS VIRTEX 6
PIPELINED, SLICES
2,000
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
1,800
1,600
1,400
#Slices
1,200
1,000
800
600
400
200
0
8
12
16
20
24
28
32
36
40
44
Input word size (N)
17
48
52
56
60
64
RESULTS VIRTEX 6
PIPELINED, SLICE RED.
80
70
Slice reduction (%)
60
50
40
30
20
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
10
0
10
8
12
16
20
24
28
32
36
40
Input word size (N)
18
44
48
52
56
60
64
RESULTS VIRTEX 6
PIPELINED, FREQ.
700
1x4 LUT Multiplier
3x2 LUT Multiplier
3x3 LUT Multiplier
Parandeh-Afshar Multiplier
Coregen (area)
Coregen (speed)
proposed
600
Frequency [MHz]
500
400
300
200
100
0
8
12
16
20
24
28
32
36
40
44
Input word size (N)
19
48
52
56
60
64
UNFORTUNATELY NOT
POSSIBLE ON ALTERA FPGAS
20
Altera ALM
MAYBE POSSIBLE NEXT?
21
CONCLUSION
Compared to the best known design, up to
50% slices can be saved for the combinatorial multiplier
30% slices can be saved for the pipelined multiplier
Portable to FPGAs providing a 5-input LUT at one full adder
input
"Free addition" supports multiply-accumulate (MAC) operation
22
THANK YOU!
LITERATURE
[Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing
the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011
[Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic
Core Generation Using Bit Heaps, FPL 2013
[de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data
Paths with FloPoCo IEEE Design & Test of Computers 2012
23
BOOTH RECODING
b =bM
=bM
12
M
1
2
1
+ . . . + b2 2 + b1 2 + b 0
12
M
1
2
1
+ . . . + b2 2 + 2b1 2 +
1
b1 2 + b0
| {z }
BE0 = 2b1 +b0
=bM
12
M
1
+ ...
. . . + 2b3 23 b3 23 + b2 22 + 2b1 21 +BE0
|
{z
}
BE2 =( 2b3 +b2 +b1 )22
=
M
X
m
BEm 2
with BEm =
m=0
m even
25
2bm+1 + bm + bm
1
WHY THEY ARE DIFFERENT?
26
Altera ALM
WHY THEY ARE DIFFERENT?
CE
CK
SRHI
SRLO
INIT1 Q
INIT0
SR
D6:1
D
CE
CK
27
FF/LAT
INIT1
Q
INIT0
SRHI
SRLO
SR

Similar documents