HC19.20.320.Teraflops Prototype Processor with 80 Cores.v2
Transcription
HC19.20.320.Teraflops Prototype Processor with 80 Cores.v2
Teraflops Prototype Processor with 80 Cores Yatin Hoskote, Sriram Vangal, Saurabh Dighe, Nitin Borkar, Shekhar Borkar Microprocessor Technology Labs, Intel Corp. Agenda Goals Architecture overview Processing engine Interconnect fabric Power management Application mapping Experimental results 2 2 Goals Build Intel’s first Network-on-Chip (NoC) research prototype Show teraflops performance under 100W Develop NoC design methodologies – Tiled design – Mesochronous (phase tolerant) clocking – Fine grain power management Prototype key NoC technologies – On-die 2D mesh fabric – 3D stacked memory 3 3 39 MSINT Crossbar Router MSINT 39 MSINT Processor Architecture Mesochronous Interface (MSINT) 40GB/s I/O MSINT 96 Buf 2KB Data memory (DMEM) 64 64 32 64 32 3KB Inst. Memory (IMEM) 6-read, 4 -write 32 entry Register File 32 32 x x PLL JTAG 96b Ctl + + Normalize FPMAC0 32 Normalize 32 FPMAC1 Processing Engine (PE) Each tile: 5 GHz, 20 Gflops @ 1.2V 160 SP FP engines in 8x10 2D mesh 4 4 Instruction Set LOAD/STORE SND/RCV PGM FLOW SLEEP FPU (2) 96-bit instruction word, up to 8 operations/cycle Instruction Type Latency (cycles) FPU 9 LOAD/STORE 2 SEND/RECEIVE JUMP/BRANCH 2 1 STALL/WAIT FOR DATA NAP/WAKE N/A 1 5 5 SP FP Processing Engine Fast, single-cycle multiply-accumulate algorithm Key optimizations 32 – Multiplier output is in carry-save format and uses 4-2 carry-save adders, removing expensive carry-propagate adders from the critical path – Accumulation is performed in base 32, converting expensive variable shifters in the accumulate loop to constant shifters x + Normalize 32 – Costly normalization step is moved outside the accumulate loop FPMAC Ref: S. Vangal, et al. IEEE JSSC, Oct. 2006 Accumulation in 15 Fanout-of-4 stages Sustained 2FLOPS per cycle 6 6 Agenda Goals Architecture overview Processing engine Interconnect fabric Power management Application mapping Experimental results 7 7 2D Mesh Interconnect 4 byte wide data links, 6 bits overhead Source directed, wormhole routing Two virtual lanes 5 port, fully non-blocking router 5GHz operation @1.2V Compute 320GB/s bisection B/W Element Router 5 cycle fall-through latency One tile On/off flow control High bandwidth, low latency fabric 8 8 Router Architecture Five router pipe stages From PE Ref: S. Vangal, et al., VLSI 2007 Buffer Read From W 39 To PE Stg 4 39 MSINT To N MSINT To E To S MSINT From S Stg 5 39 To W MSINT From E Lane 1 Lane 0 39 From N Input buffered Shared crossbar switch Distributed dual phase arbitration Buffer Write Stg 1 Crossbar switch Link Route Port/Lane Switch Arb Traversal Traversal Compute Router Pipe stages 9 9 Fabric Key Features Double pumped crossbar switch – Dual-edge triggered FFs on alternate data bits – Crossbar channel width reduced by 50% leading to 0.34mm2 router area – Ref: S. Vangal, et al. VLSI, 2005 Mesochronous clocking – Phase-tolerant FIFO based synchronizers at interfaces between tiles – Enables low power, scalable global clock distribution – 2W global clock distribution power @ 1.2V 5GHz – Overhead of synchronizers: 6% of router power and 1-2 clock cycles in latency 10 10 Agenda Goals Architecture overview Processing engine Interconnect fabric Power management Application mapping Experimental results 11 11 Power Management Hooks Extensive use of sleep transistors Activation of sleep transistors done dynamically or statically Dynamic through special instructions or packets based on workload – NAP/WAKE: PE can put its FPMAC engines to sleep or wake them up – PESLEEP/PEWAKE: PE can put another PE to sleep or wake it up for processing tasks by sending sleep or wake packets Static through scan – Entire PE or individual router ports 12 12 Tile Sleep Regions MSINT % of device width on sleep MSINT Crossbar 10% Router MSINT 21 sleep regions with independent control MSINT 2KB Data memory 57%(DMEM) RIB 14% – 74% of transistor device width Dynamic sleep 6R, 4W 32 entry Register File 72% 3KB Inst. 56% Memory (IMEM) – Individual FPMACs, cores or tiles Static sleep control – Scan chain Clock gating x x + 90% Normalize – Works with sleep + 90% Normalize FPMAC0 FPMAC1 Processing Engine (PE) Fine grain power management 13 13 Leakage Savings Est breakdown @ 1.2V 110C MSINT MSINT 82mW Crossbar Router 70mW MSINT – Total measured idle power 13W @ 1.2V 2KB Data memory (DMEM) 42mW8mW RIB Regulated sleep for memory arrays – State retention 6R, 4W 32 entry Register File 15mW7.5mW Vcc Memory array V ssv VREF=Vcc -VMIN +_ 3KB 63mW21mW Inst. Memory (IMEM) Standby VMIN Memory Clamping MSINT Dynamic sleep x x 100mW + 20mW 100mW + 20mW Normalize Normalize FPMAC0 wake Processing Engine (PE) 2X-5X leakage power reduction 14 FPMAC1 14 Pipelined Wakeup from Sleep sleep Multiplier 5 Single cycle wakeup 4.5 clk nap1 4 FPMAC current (A) W Logic fastwake nap2 2W Accumulator clk 3.5 3 2.5 2 4X 1.5 1 Blocks enter sleep at same time 0.5 nap3 W 0 Logic 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2 fastwake time (ns) Pipelined wakeup sequence nap4 FPMAC current (A) 2W Normalization clk nap5 W Logic fastwake 1.5 Accumulator Normalization 1 Multiplier 0.5 nap6 2W 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4X reduction in peak wakeup current 15 15 Router Power Management Activity based power management Individual port enables – Queues on sleep and clock gated when port idle 1000 Port_en 5 1 Gclk Lanes 0 & 1 Queue Input Array (Register QueueFile) Sleep Transistor 1.2V, 5.1GHz 80°C 800 Stg 2-5 MSINT Ports 1 -4 Router Port 0 600 924mW 7.3X 400 126mW 200 gated_clk Global clock buffer 0 0 1 2 3 4 5 Number of active router ports 7X power reduction for idle routers 16 16 Estimated Power Breakdown Tile Power Profile Clock dist. 11% Router + Links 28% Communication Power Profile Dual FPMACs 36% 10-port RF 4% IMEM + DMEM 21% Links 17% MSINT 6% Arbiters + Control 7% 4GHz, 1.2V, 110°C Crossbar 15% Clocking 33% Queues + Datapath 22% 17 17 Agenda Goals Architecture overview Processing engine Interconnect fabric Power management Application mapping Experimental results 18 18 Application Kernels Four kernels mapped on to architecture – Stencil 2D heat diffusion equation – SGEMM for 100x100 matrices – Spreadsheet doing weighted sums – 64 point 2D FFT (using 64 tiles) Kernels were hand coded in assembly code and manually optimized Sized to fit in the on-chip local data memory – Instruction memory not a limiter 19 19 Communication Patterns Stencil Spreadsheet SGEMM 2D FFT Communication overlapped with computation 20 20 Agenda Goals Architecture overview Processing engine Interconnect fabric Power management Application mapping Experimental results 21 21 Application Performance Application Kernels FLOP TFLOPS @ % Peak TFLOPS count 4.27GHz Stencil 358K 1.00 73.3% 80 SGEMM: Matrix 2.63M Multiplication 0.51 37.5% 80 Spreadsheet 62.4K 0.45 33.2% 80 2D FFT 196K 0.02 2.73% 64 Tiles used 1.07V, 4.27GHz operation 80 C 22 22 Frequency vs Power 300 6 5.34GHz Freq 5 25°C N=80 200 162W 150 4 3.8GHz 100 3 78W 1.2GHz 2 50 Freq (GHz) 250 Power (W) 7 Peak 2TFLOPS @ 300W Active Power Leakage Power 1 17W 0 0 0.75 0.8 0.85 0.9 0.95 1 1.1 1.2 1.3 1.39 Voltage (V) Stencil app peak efficiency: 6 GFLOPS/W to 22 GFLOPS/W 23 23 Measured Power Savings Power (W) Power %Savings 250 40% 200 30% 150 20% 100 50 10% 0 0% 0.7 0.8 0.9 1 1.1 1.2 1.3 Pwr Savings (W) 50% 300 1.4 Vcc (V) Stencil app: Savings in power with router power management on 24 24 Leakage Savings % Total Power 16% Sleep disabled Sleep enabled 12% 8% 80°C N=80 2X 4% 0% 0.67 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 Vcc (V) Measured maximum leakage power reduction 25 25 Design Tradeoffs Core count – Determined by die size constraints (300 sqmm) and performance per watt requirements (10 GFLOPS/W) Router link width – 4 byte data path enables single flit transfer of single precision word – Crossbar area and power scales quadratically with link width Mesochronous implementation – Single clock source with scalable, low power clock distribution Backend design and validation benefits – Tiled methodology enabled rapid design by small team (less than 400 person-months) – Functional first silicon in 2 hours 26 26 Die Photo and Chip Details I/O Area DMEM 1.5mm 21.72mm 2.0mm 2.0mm single tile RIB RF IMEM MSINT CLK Global clk spine + clk buffers 1.5mm 12.64mm FPMAC0 FPMAC1 Router PLL I/O Area TAP Technology 65nm CMOS Process Interconnect 1 poly, 8 metal (Cu) Transistors 100 Million Die Area 275mm2 Tile area 3mm 2 Package 1248 pin LGA, 14 layers, 343 signal pins 27 27 Summary An 80-Core NoC architecture in 65nm CMOS – 160 high-performance FPMAC engines – Fast, single-cycle accumulator – Low latency, compact router design – High bandwidth 2D mesh interconnect TFLOPS level performance at high power efficiency – Fine grain power management – Chip dissipates 97W at 1TFLOPS (Stencil app) – Measured energy efficiency of 6 to 22 GFLOPS/W Demonstrated peak performance up to 2TFLOPS Building blocks for future peta-scale computing 28 28 Acknowledgements Implementation – Circuit Research Lab Advanced Prototyping team (Hillsboro, OR and Bangalore, India) Application kernels – Software Solutions Group (Santa Clara, CA) – Application Research Labs (DuPont, WA) PLL design – Logic Technology Development (Hillsboro, OR) Package design – Assembly Technology Development (Chandler, AZ) 29 29