Hardware – Software Coverification with Acceleration systems
Transcription
Hardware – Software Coverification with Acceleration systems
Hardware – Software Coverification with Acceleration systems June 2003 Jörg Kayser eServer Verification IBM Corp Böblingen 1 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Content • • • • • • • Motivation Where acceleration fits scheduling basics special features History Acceleration Products Outlook 2 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com System verification objective • Reduce time to market • Optimize development cost • Cover complexity challenges • by reduction of – hardware fails prior to tape out – number of EC's – code fails prior power on – bring up hardware 3 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Microcode bugs Finding Bugs After Coding is Costly (Apar $15-40,000) $14,000 Percentage of Bugs 85% %Defects Introduced in this phase % Defects found in in this phase $1000 $25 $130 $250 Coding Unit Test Funct Test $ Cost to repair defect in this phase Field Test Post Release Source: Applied Software Measurement, Capers Jones,1996 4 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Hierarchical Design System Chip ... Unit Macro Allows design team to break system down into logical and comprehendable components. Also allows for repeatable components. © Bruce Wile 5 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Current Practices for Verifying a System n n n n Designer Level Sim èVerification of a macro (or a few small macros) Unit Level Sim èVerification of a group of macros Element Level Sim èVerification of a entire logical function such as a processor, storage controller or I/O control èCurrently synonymous with a chip System Level Sim èMultiple chip verification èOften utilizes a mini operating system © Bruce Wile 6 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Shift left from bringup into simulation Early System Integration Unit Simulation CEC Chips Intermediate Level Hardware Verified RIT Level Hardware HW Subsystem I/O Chips System Simulation CECSIM Service Element Office Mode Bring-up and integration CECSIM Bringup Vehicle Virtual Power-On Real Power-On Shift Left Time axis © Stefan Körner 7 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com How to verify a big iron ? © Stefan Körner 8 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Acceleration versus Emulation • Acceleration – Can increase simulation by 10..1000 vs. Software simulator • Hyper-Acceleration – Can increase simulation by 1k..100k vs. Software simulator • Emulation – Not only increases sim speed, but also allows direct physical interconnect to a target system (real hardware) – Example: emulated processor connected to real system motherboard 9 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Model build Principle: compile into bool operators, then schedule communication HDL A = (B & C) | (D & E); Synthesis 4-input/1-output gates B C D E AND-OR A (4in/1out gate) Partitioner/Scheduler Model 10 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Principle of Operation D Q Logic D Q Clock Clock Register Register 10 ns clock’s step 3-5 steps per logic level PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com • Compiler transforms combinational logic into boolean operations • Compiler schedules interprocessor communications using a fast broadcast technique • Emulation performance dictated by - number of processors - number of levels in the design Simulation -> Acceleration Software Simulation steps sequential 1 A A B C 2 3 C B Hardware Acceleration steps Parallel processors 1 A C 2 B 12 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Another example Reg A c A Reg M f h O A d a A b c e g d j f h b g i l k Reg Reg i e a j k l EP1 EP2 EP3 EP4 Step1 b a d c Step2 g f Step3 I h Step4 k Step5 l e J 12 steps serial, 5 steps parallel 13 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com FPGA versus Processor based FPGA based systems • ~50c/gate • Fewer one-time cost Processor based systems • ~10c/gate • Limited interconnections • Higher gate utilization • Faster compile • Faster runtime • Higher capacity 14 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Simulation methods • Acceleration: – Self contained mode: model and test vectors are kept inside the Emulator. Highest speed, no workstation overhead – HDL Co-sim mode: part of the design runs in Emulator, other part in software simulator, speed dependent on testbench, design and simulator speed – C,C++ testbench: C-program is connected to Emulator, faster than HDL Co-Sim mode, no simulator overhead. Interactions can be packaged (transaction based interface) • Emulation: – In-Circuit Emulation: Emulator model runs together with real hardware. Smaller model than self contained mode 15 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Multivalue Simulation • The Acceleration hardware supports 2 values only • To simulate „X“ or „Hi-Z“ you have to make changes to the model 3value 2value 4value 2value 3value Signal Signal Signal_is_X Bus Bus Bus_is_X Recv 0 0 0 0 0 0 0 1 1 0 1 1 0 1 X 1 1 H 0 1 0 X 1 1 X 16 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Multivalue Simulation (continued) OR A B C OR D D CLK OR OUT A_is_X OR OUT IN_IS_X B_is_X C_is_X Latch IN OR OUT_is_X OR D OUT_is_X D_is_X A B C AND AND CLK_IS_X OR OUT Inverter D A A_is_X B_is_X C_is_X OR AND OUT_is_X D_is_X A_is_X INV OR OUT OUT_is_X 17 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Multivalue Simulation (continued) Driver Data AND En 3-state Bus Bus 0/1 OR Data_is_X AND En_is_X OR Other_En Receiver Data Bus Bus_is_X AND Data_is_X X AND OR H OR Bus_is_X => Multivalue acceleration increases model size by ~4x 18 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com other features • event / signal tracing – – – – – thousands of signals dynamic probing infinite cycles compression post processing • fast communication to workstation – 100 Mb/s Ethernet – 100 MB/s FIBER – direct attach, pin multiplexed • worldwide remote control – multi user support – multiple models at the same time – 24h usage • cross platform checkpoint/restart with other simulators • Multiple clock domains – Design partitioned into domains with different clocks (clock separation) – Can also be done with clock oversampling and 1 domain 19 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Acceleration history • 1960: NASA funded Boeing for Apollo verification – Advanced LSI chips of that time contained < 500 gates, 5usec cycle – Vision was to have space missions lasting 8-12 years – Goal: • Architectural Study on navigation processor, Evaluation of faults • Built a hardware emulation engine to do the verification job – Result: • • • • • • 4 processors Boeing Computer Simulator (McKay) Study showed more than 8 processors was "not adding value" Event based communications issues with architecture 48k gate model max, 48 bit instructions, 650nsec Model slowdown 800x (vs. ET4x4 1000M/3M) Patent lawyers and management stopped future work in early 70s, too complex/expensive © IBM Corporation • Asked IBM to build a product 20 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Yorktown Simulation Engine • EVE prototype • LS-TTL chips • Multiwire boards 22x24 sq.in © IBM Corporation 21 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com 1982: EVE 1/1.5 • 100ns per cycle • 32k gates, 4 cards 7x9 sq.in per proc. • >1000 sq.ft • 25cents per gate © IBM Corporation 22 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com 1987: EVE2 • 80ns per cycle • 1 chip, 512 gates per proc • 100 sq.ft • 67 cents/gate • Too expensive, => crushed © IBM Corporation 23 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Acceleration history (continued) LSM YSE EVE1 EVE1.5 EVE2 Evette Corvette ET Awan 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 © IBM Corporation 24 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Cadence acceleration • CoBALT (ET3) – – – – – – – – CoBALT = Concurrent Broadcast Array Logic Technology 64 Processors per chip, 65chips (1M gates) per board Fast compile time 2M gates/hour Fast download Leader in capacity (1997) 6 modes of Emulation Multi user capable (only system at that time) 3 levels of memory, automatic selection 25 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Quickturn Advertisement Months Starting 3/98 to 4/99 M • High Level Architecture • Behavioral Development • Develop Module Level Designs • System level Regression/Accel • System Level/Netlist Completion • A M J J A S O N D J F M A Tape-Out & Fabrication • System/Software Integration • First Customer Ship (Alpha) • Silicon Re-Spin #1 • Silicon Re-Spin #2 Quickturn Simulation/Emulation used No Quickturn Simulation/Emulation Used 3 - 4 Month Schedule Improvement ND_Rev 980419 26 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Cadence acceleration (continued) • CoBALT Plus (ET3.5) – – – – 64 Processors per chip, 65chips (2.5M gates) per board 2GB embedded memory Big designs run at 65k cps Direct Attach Stimulus card (DAS), PCI based host adapter with sustained 50MB/s data rate – Used at IBM since 1998 with 16brd (40M gates, 65k processors) 27 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com 28 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Cadence acceleration (continued) • Palladium (ET4) – – – – 256 Processors per chip, 65chips (10M gates) per board 4.1GB embedded memory Fast compile time 5M gates/hour IBM uses 16brd since 2001 (160M gates, 266k proc, 65GB mem) – designs run at 300k cps 29 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com ET4 module layout 30 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Parallelism on ET4 • 256 processors per module connected to each other sharing 1MB SRAM and 64MB DRAM • 65 modules per board connected to each other • 16 boards connected with high speed cables => up to 266k parallel processors => up to 64M 4way gates in 1 design => up to 65GB memory => up to 1M cps 1 gate evaluation in 7.5 ns (1 step) ~ 35 x 1012 evaluations per second 10 M evaluations in < 2 us 31 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com 32 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com IBM inhouse systems • AWAN – – – – – 300..3000 cps 29M 4w gates (80M 2w gates) 256 LP, 16 AP Sparse array mapping Modbld faster than ET4 • AWAN 4X – 115M 4w gates • AWAN NG 33 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Awan Logic Board LP LP LP LP Array Processor LP LP LP LP LP LP LP LP Switch Daughter Card Backplane LP LP LP LP © Harrell Hoffman 34 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com AWAN Processor board Logic (Gate) Processor Toshiba Gate Array Cmos-4 (.4 micron) Mosys Dram Instruction (LP/SW) Memory DDR + Common addr/data bus + 32 word burst Backplane Connectors Daughter Board Handles inputs from backplane Switch (Interconnect) Chip Toshiba Gate Array Sram Proc + Memory Chip Express Gate Array (8MBytes memory) Dram Proc + Memory Chip Express Gate Array (128MBytes memory) Also handles sparse arrays using associative mapping © Harrell Hoffman 35 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com AWAN • High capacity: – 10 Million Gates (2 boards) – 256/512MB memory • High speed: – Compiler performance: 12M gates/hour – Simulation performance: 500..5000 Hz • Highly flexible: – Processor based acceleration – SW simulator model • Low Cost: – Pennies per gate, 1.1M$ (DAC 2000) © Harrell Hoffman 36 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Other Acceleration vendors • Mentor Graphics, (including Ikos) – Celaro Pro, VStation, both 30M gates, cascadable • Aptix – MP4, 5-10MHz, 1.8M gates, FPGA, no trace memory, 185k$ • Axis Systems – Xtreme-II, up to 100M gates • Tharas Systems – Hammer 32M, Processor based, 32M gates, cascadable – 10M gates/hour compile time, incremental compile 37 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com Main Drawbacks to Emulation “Time to emulation” 73% Effort required to use 65% Emulation at-speed 62% Cost per gate 62% Connection to other EDA tools 50% 0% 10% 20% 30% 40% 50% 60% 70% Percent of Teams Source: 1997 Collett International, Inc Key barriers: Time-to-emulation, ease-of-use and cost PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com 80% Outlook • Acceleration systems are still expensive • Hardware-software coverification pays off after a project or two • The tool handling and integration into other simulators gets easier • DAC 2003 trend: FV and Acceleration is rising 39 PDF created with FinePrint pdfFactory Pro trial version www.pdffactory.com