Micro-architecture of Godson-3 Multi-Core Processor
Transcription
Micro-architecture of Godson-3 Multi-Core Processor
Micro-architecture of Godson-3 Multi-Core Processor Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen Institute of Computing Technology (ICT) Chinese Academy of Sciences hww@ict.ac.cn Hotchips 2008, San Francisco, August 26 1 Contents A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS Godson is the academic name of LoongsonTM 2 National Project High performance CPUs are of national strategic importance Chinese ICT industry is growing to a significant scale 2007 Demand: ICT market = 5.6 trillion RMB 2007 Supply: only 22% by domestic companies, 3.75% profits Godson CPU is supported by National 863 project National 973 project National Science Foundation of China National key project Key project of Chinese Academy of Sciences 3 Godson CPU Briefs ICT started Godson CPU design in 2001 The 32-bit Godson-1 CPU in 2002 is the first general purpose CPU in China The 64-bit Godson-2B in 2003.10 The 64-bit Godson-2C in 2004.12 The 64-bit Godson-2E in 2006.03 Each tripled the performance of its previous one 4 Godson Development 10000 Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate 1000 100 Godson rate 10 1999 2000 2001 2002 2003 2004 2005 2006 5 Godson-2E SPEC CPU2000 Rate Ref time Run time Ratio 168.wupwise 1600 238 672 512 171.swim 3100 660 469 221 497 172.mgrid 1800 579 311 1800 307 586 173.applu 2100 549 382 186.crafty 1000 167 598 177.mesa 1400 221 634 197.parser 1800 472 382 178.galgel 2900 412 704 252.eon 1300 188 690 179.art 2600 416 624 253.perlbmk 1800 354 508 183.equake 1300 208 624 254.gap 1100 240 458 187.facerec 1900 300 632 255.vortex 1900 263 722 188.ammp 2200 432 509 256.bzip2 1500 365 411 189.lucas 2000 396 506 300.twolf 3000 645 465 191.fma3d 2100 531 395 200.sixtrack 1100 345 319 301.apsi 2600 528 493 Programs Reftime Run time Ratio 164.gzip 1400 403 347 175.vpr 1400 273 176.gcc 1100 181.mcf SPEC int2000 <503> Programs SPEC fp2000 <503> 6 Godson-2E and Godson-2F 1.0GHz@90nm CMOS, 5-7W 47M xtors, area 36mm^2 Godson-2 CPU core 64-bit MIPS III compatible Four-issue, OOO 64KB+64KB L1 (four-way) 512KB L2 (four-way) On-chip DDR controller SysAD Front-end bus 1.0GHz@90nm CMOS, 3-5W 51M xtors, area 43mm^2 Godson-2 CPU core 64-bit MIPS III compatible Four-issue, OOO 64KB+64KB L1 (four-way) 512KB L2 (four-way) On-Chip DDR2 controller PCI/PCIX, local IO, GPIO, etc. Volume production 7 Low end roadmap: From CPU to SOC CPU PCI North Bridge PCI Graphic CPU +NB CPU +NB Graphic South Bridge PCI/HT GPU+ South Bridge CPU +GPU+ NB+SB South Bridge 2E˄2006˅ 2F˄2007˅ 2G˄2008˅ 2H˄2009˅ 8 Contents A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS 9 Godson-3 Briefs Scalable architecture Reconfigurable CPU core and L2 X86 binary translation optimization Low power consumption >1.0GHz@65nm 10 Scalable Architecture Design Scalable interconnection network Crossbar + Mesh Single crossbar connects cores, L2s, and four directions Directory-based cache coherence protocol Distributed L2 caches are globally addressed Each cache block has a directory entry Both data cache and instruction cache are recorded in directory P0 P1 P2 P3 E S E S 8x8 Xbar W W N N L2 L2 L2 L2 11 Reconfigurable Architecture General Purpose Core: 64-bit, 4-issue, OOO, AXI interface 8 configurable address windows of each master port allow pages migration across L2 and memory P1 P2 P3 m1 m2 m3 m4 s1 S0 PCI... DMA engine supports prefetch and matrix transpose m5 s5 6*6 AXI Switch s2 s3 s4 S1 S2 S3 HT m0 s0 DMA Controller HT DMA Controller P0 Multiple Purpose Core: LINPACK, biology, signal processing, AXI interface 5*4 AXI Switch MC0 MC1 Shared L2 can be configured as internal RAM; DMA to internal RAM directly 12 GS464 general purpose core BHT PC ITLB Integer Register File Fix Queue DCACHE ALU2 Tag Compare Target for LINPACK, biology computation, signal processing 8-16 MACs per node Big multi-port register file Reconfigurable based on applications. Standard 128-bit AXI interface ALU1 BTB Register Mapper Write back Bus ROQ BRQ Map Bus Decode Bus GStera multiple purpose core Branch Bus Decoder Commit Bus Reorder Queue Predecoder GS464 Architecture PC+16 MIPS64, 200+ more instructions for X86 binary translation and media acceleration Four-issue superscalar OOO pipeline Two fix, two FP, one memory units Two FP units each supports full pipelined double/paired-single MAC operation 48-bit VA and PA, 128-bit memory access 64KB icache and 64KB dcache, 4-way 64-entry fully associated TLB, 16-entry ITLB, variable page size Non-blocking accesses, load speculation Directory-based cache-coherence for CMP Parity check for icache, ECC for dcache EJTAG for debugging Standard 128-bit AXI interface AGU FPU1 Floating Point Register File Float Queue ICache DTLB CP0 Queue FPU2 Refill Bus imemread Test Controller EJTAG TAP Controller Test Interface dmemwrite dmemread, duncache Processor Interface ucqueue wtbkqueue JTAG Interface missqueue clock, reset, int, … AXI Interface XBAR MicroController 1024*64 1024*64 1024*64 1024*64 1024*64 1024*64 1024*64 1024*64 4W4R 4W4R 4W4R 4W4R 4W4R 4W4R 4W4R 4W4R Micro-Code Store DMA+AXI Controller AXI interface 13 Matrix Transposing Performance 15+times faster 250000000 200000000 150000000 dsp cpu 100000000 50000000 0 256x256 512x512 1024x1024 14 Hardware Support for X86 Binary Translation Define new instructions X86 ISA function and MIPS ISA format Binary translation mechanism supporting >200 instructions are defined with 5% additional silicon cost Speed up X86-to-MIPS binary translation 10 times compared to software only QEMU 16000 MS windows Linux apps. on X86 14350 14000 Linux apps. on MIPS System level X86 VM Process level X86 VM 12000 11611 10212 9980 10000 9964 9032 Linux on MIPS 8616 7954 8000 6743 Enhanced MIPS decode 6086 6000 5430 4851 4178 4000 Enhanced Godson internal operations 2312 2000 1643 15 1010 649 Figure 2. The architecture of Godson-2 virtual machine 612 421 0 IDCT FFT(FX) FFT(FP) GP EFLAG No HO, No SO HO Only Both Ideal Contents A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS 16 Physical implementation 65nm CMOS LP/GP technology Cell-based design methodology DC + ICC Manual P&R for critical cells 2008: 4-core (4GP + 0MP) + 4MB L2 GP: General purpose core MP: Multiple purpose core 10w@1GHz 2009: 8-core (4GP + 4MP) + 4MB L2 20w@1GHz 17 4-core 2008 4-core (general purpose core) 65nm, 1GHz, 10w P0 P3 m2 m3 m4 m5 s5 6*6 X1 Switch s1 S0 s2 s3 s4 S1 S2 S3 HT, PCIE P2 DMA Controller DMA Controller HT, PCIE m1 m0 s0 P1 PCI... 5*4 X2 Switch MC0 MC1 18 8-core 2009 8-core (4GP + 4MP) 65nm, 1GHz, 20w P2 P3 P0 P1 P2 P3 m1 m2 m3 m4 m1 m2 m3 m4 m0 s0 s1 S0 PCI LPC m5 s5 X1 Switch s2 s3 s4 S1 S2 S3 m0 s0 m5 s5 X1 Switch s1 S0 s2 s3 s4 S1 S2 S3 X2 Switch X2 Switch MC0 MC1 HT, PCIE P1 DMA Controller DMA Controller HT, PCIE P0 19 Full Customer Register file and CAM Physical register file TLB CAM Size: 321um x 262um Size: 224 um x 235 um Power: 50mW@1GHz Power: 55mw @ 1GHz Delay: 470ps Delay: 550ps 20 HyperTransport PHY HT1.0 Driver & Receiver FlipChip Compatible 2Row design 800mw @ 1.6Gbps Size: 250um x 300um Power: < 10mW Freq: 1.6GHz Jitter: 20 ps 21 Test Chip for Customer Blocks TEST CHIP ST 65nm 1206um x 1206um Function: CAM1W1R - BIST CAM1W1R - Scan RAM4W4R - BIST RAM4W4R - Scan ICT_PLL - Freq. test HT1.0 - BIST HT1.0 - Error rate test 22 Cell-based High Performance Physical Design The Full Hierarchical Design Methodology Manual placement & route for critical paths Manual placement of all FFs and clock buffers, manual clock gating Architecture optimization with physical feedback ictreg1 rissuebus2 Div ictreg0 rissbus resbus2 fwdbus2 alu2 alu_buf_vsrc Mul alu1 regfile0 qissbus mapbus0/1/2/3 fxqueue cam0 resbus0 fwdbus2 br_pc_value resbus2/3 fwdbus2/3 jr_target resbus0 resbus2 resbus1 23 Clock Tree H-Tree + Mesh Manual placement of FFs Manual clock gate 24 Layout of 4-core Godson-3 HT HT GS464 GS464 GS464 GS464 DMA L2 DDR2/3 DMA Xbar L2 L2 Xbar L2 DDR2/3 25 Contents A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS 26 PetaFLOPS and TeraFLOPS PetaFLOPS for Large Scale Applications To build PetaFLOPS HPC with Godson-3 in 2010 TeraFLOPS for Personal HPC Putting desktop to pockets Putting TeraFLOPS to desktop High-performance computing for the masses 27 Scaling down TeraFLOPS TeraFLOPS in 1997 $100K/2007 “refrigerator” $50K/2008 “washing machine” $10K/2009 28oven” “microwave References The architecture of Godson-2 superscalar architecture is available at: Weiwu Hu, Fuxin Zhang, Zusong Li, “Microarchitecture of the Godson-2 Processor”, Journal of Computer Science and Technology, 20(2):243-249, Mar. 2005 Weiwu Hu, Jiye Zhao, Shiqiang Zhong, Xu Yang, Elio Guidetti, Chris Wu, “Implementing a 1GHz Four-issue Out-of-Order Execution Microprocessor in a Standard Cell ASIC Methodology”, Journal of Computer Science and Technology, 22(1):1-14, Jan. 2007 The experiences learning from Godson processor design is available at: Weiwu Hu, Jian Wang, “Making Effective Decisions in Computer Architects’ Real-World: Lessons and Experiences with Godson-2 Processor Designs”, Journal of Computer Science and Technology, 23(4), July 2008 The architecture of Godson-3 multi-core is available at: HotChip’08 29 Concluding Remarks CPU R&D are of national strategic importance Godson-3 has a low-power, scalable architecture Godson-3 will be used to build client side systems Petaflops machines Teraflops systems for the masses The Godson team at ICT: open cooperation 30 Thanks 31
Similar documents
GS464V - Hot Chips
South Bridge An Open Standard On-chip Interconnect Specification of ARM Godson Super-Link
More information0 0 0 0 0 0 0 0 0 0 0 0 - Locality Parallelism and Hierarchy
– Guaranteed to detect any 1-bit error • In fact, any odd number of errors (more on this later)
More information