Rechnerarchitektur 1 Content - Computer Architecture Group
Transcription
Rechnerarchitektur 1 Content - Computer Architecture Group
Vorlesung Rechnerarchitektur Seite 1 Rechnerarchitektur 1 Content 1) Introduction, Technology, von Neumann Architecture 2) Pipeline Principle a) Dependencies b)Predication 3) Memory Hierachy a) Registers b)Caches c) Main Memory d)Memory Management Unit (MMU) 4) Prozessor Architectures a) Complex Instruction Set Computer (CISC) Micro programming b)Reduced Instruction Set Computer (RISC) c) Instruction Set Architecture x86 5) Software Hardware Interface a) Instruction Level Parallelism b)Prozesses, Threads c) Synchronization 6) Simple Interconnections (Bus) 7) Modern Processor Architectures a) Superscalar Processors b)Very Long Instruction Word Machine (VLIW, EPIC) 8) I/O Systeme a) Prozessor-,System- & Peripheral- Busses b)I/O Components c) Device Structure d) DMA Controller Vorlesung Rechnerarchitektur Seite 2 Rechnerarchitektur 1 Literature Deszo Sima, Terence Fountain, Peter Kasuk Advanced Computer Architectures, A Design Space Approach Adision-Wesley, 1997 Hennessy, John L.; Patterson, David A. (Standardwerk) Computer architecture : A quantitative approach 2nd ed 1995. hardback £25.00 Morgan Kaufmann 1-55860-329-8 Giloi, Wolfgang K. (empfehlenswert, Übersicht, leider z.Zt. vergriffen) Rechnerarchitektur 2.vollst. überarb. Aufl.1993 DM58.00 Springer-Verlag 3-540-56355-5 Hwang, Kai (advanced, viel parallele Arch.) Advanced computer architecture : Parallelism, scalability and programmability 1993. 672 pp. paperback £24.99 McGraw-Hill ISE 0-07-113342-9 Patterson, David A.; Hennessy, John L. (leichter Einstieg) Computer organization and design : The hardware/software interface 2nd ed 1994. hardback £60.50 Morgan Kaufmann 1-55860-281-X Tanenbaum, Andrew S. (optional, Standardwerk für Betriebssysteme)) Modern operating systems 1992. 1055 pp. paperback £26.95 Prentice Hall US 0-13-595752-4 Vorlesung Rechnerarchitektur Seite 3 History of Computer Generations History of computer generations Generation Technology & Architecture Software & Applikations Systems First (1945-54) Vacuum tubes and relay memories, CPU driven by PC and accumulator, fixed point arithmetic Machine/assembly languages, single user, no subroutine linkage, programmed I/O using CPU ENIAC, Princeton IAS, IBM 701 Second (1955-64) Discrete transistors and core memories, floating point arithmetic, I/O processors, multiplexed memory access HLL used with compilers, subroutine libraries, batch processing monitor IBM 7090, CDC 1604, Univac LARC Third (1965-74) Integrated ciruits (S/MSI), microprogramming, pipelining, cache, lookahead processors Multiprogramming and timesharing OS, multiuser applications IBM 360/70, CDC 6600, TI-ASC, PDP-8 Fourth (1975-90) LSI/VLSI and semiconductor memory, multiprocessors, vector supercomputers, multicomputers Multiprocessor OS, lanVAX 9000, guages, compilers and envi- Cray X-MP, ronment for parallel IBM 3090 processing Fifth (1991-96) ULSI/VHSIC processors, memory and switches, high density packaging Massively parallel processing, grand challenge applications Cray MPP, CM-5, Intel Paragon Sixth (present) scalable off-the-shelf architectures, workstation clusters, high speed interconnection networks heterogeneous processing, fine grain message transfer Gigabit Ethernet, Myrinet [Hwang, Advanced Computer Architecture] Vorlesung Rechnerarchitektur Seite 4 Technology Introduction First integrated microprocessor Intel 4004 1971 2700 transistors Vorlesung Rechnerarchitektur Seite 5 Technology Introduction year 1998 microprocessor PowerPC 620 7,5 Mio. transistors Vorlesung Rechnerarchitektur Seite 6 Technology Introduction year 2003 microprocessor [AMD] AMD Opteron ~106 Mio. transistors Vorlesung Rechnerarchitektur Seite 7 Technology Technology forecast The 19th April 1965 issue of Electronics magazine contained an articel with the title "Cramming more components onto integrated circuits" Its author, Gorden E. Moore, director, Research and Development, Fairchild Semiconductor, had been asked to predict what would happen over the next ten years in the semiconductor component industry. His article speculated that by 1975 it would be possible to cram as many as 65 000 components onto a single silicon chip about 6 mm2. [Spectrum, June ’97] Moore based his forecast on a log-linear plot of device complexity over time. The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain constant for at least 10 years. Moore’s astonishing prediction was based on empirical data from just three Fairchild data points! • first planar transistor in 1959 • few ICs in 1960 with ICs in production with 32 components in 1964 • IC with 64 components in 1965 In 1975 he revisted the slope of his function to doubling transistor count every 18 month and this stayed true until today. 20 Number of components per integrated function [log2] 15 10 l l 5 l 0 l ’59’60 ’65 ’70 ’75 ’80 ’85 Year Vorlesung Rechnerarchitektur Seite 8 Technology Andrew (Andy) Grove Robert Noyce Gordon Moore Vorlesung Rechnerarchitektur Seite 9 Technology Cost reduction due to mass fabrication red curve for logic chips (no memory chips) Cost per Transistor [log10] on-chip l values from Fairchild (Gordon Moore) 10-1 l 10-2 10-3 l 10 -4 l 10-5 l 4.5Mio.T = 30$US 10-6 ’60 ’59 ’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 Year ’02 ~ 6.6 * 10-6 $US year 2002: 4.5Mio. transistors in 0.18μm CMOS technology on a 5x5mm die with BGA pakkage ~ 30 $US = 6.6 x 10-6 $US per transistor; standard cell design Vorlesung Rechnerarchitektur Seite 10 Technology Modern Chip Technology Using copper for the 6 metal interconnect structure of a CMOS chip delivers lower ressitance of the wires and thus increases the performance. Gate structures can be found in the lower right part of the picture (arrow) Vorlesung Rechnerarchitektur Seite 11 ATOLL - ATOmic Low Latency - approx. 5 Mio transistors (1.4 Mio logic gates) Die area: 5.78mm x 5.78mm 0.18um CMOS (UMC, Taiwan) 385 staggered I/O pads 3 years of development year 2002 Vorlesung Rechnerarchitektur Seite 12 Speed Trends Processor Memory I-O -Speed Trends Speed [log MHz] 2GHz 2000 first 1GHz CPU Research 1000 900 800 700 600 500 2.2GHz P4 internal CPU Clock ProcessorMemory Gap PentiumIII DEC Alpha 400 external BUS Clock 300 PC266 200 r 100 90 80 70 60 50 r PC133 r PCI-X 133MHz IO r PCI r i860 40 30 ProcessorI/O Gap SDRAM r 1/tacc MC68030 25ns in page 40ns 20 1/tacc 60ns random 10 Year ’90 ’92 ’94 ’96 ’98 2000 ’02 Die ständige Steigerung der Rechenleistung moderner Prozessoren führt zu einer immer größer werdenden Lücke zwischen der Verarbeitungsgeschwindigkeit des Prozessors und der Zugriffsgeschwindigkeit des Hauptspeichers. Die von Generation zu Generation der DRAMs steigende Speicherkapazität (x 4) führt auf Grund der 4-fachen Anzahl an Speicherzellen trotz der Verkleinerung der VLSI-Strukturen zu nur geringen Geschwindigkeitssteigerungen. (not included: new DRAM technology RAMBUS) Vorlesung Rechnerarchitektur Seite 13 Speed Trends Processor Memory I-O -Speed Trends Results a high Processor-Memory Performance Gap (increasing) a very high Processor-I/O Performance Gap (increasing) Single Processor Performance will increase Number of processors in SMPs will increase moderately STOP Bus Systems will become the main bottleneck Solution Transition to “switched” Systems on all system levels - Memory - I/O - Network Switched System Interconnect Vorlesung Rechnerarchitektur Seite 14 Was ist Rechnerarchitektur ? Rechnerarchitektur <- ’ computer architecture ’ The types of architecture are established not by architects but by society, according to the needs of the different institutions. Society sets the goals and assigns to the architect the job of finding the means of achieving them. Entsprechend ihrem Zweck unterscheidet man in der Baukunst Typen von Architekturen ..... Arbeit des Computer-Architekten: - finden eines Types von Architektur, die den vorgegebenen Zweck erfüllt. - Architekt muß gewisse Forderungen erfüllen. - Leistung - Kosten - Ausfalltoleranz - Erweiterbarkeit Materialien des Gebäudearchitekten : Holz, Stein, Beton, Glas ... Materialien des Rechnerarchitekten : integrierte Halbleiterbausteine ... Folgende Komponenten stellen die wesentlichen HW-Betriebsmittel des Computers dar: Prozessoren, Speicher, Verbindungseinrichtungen Vorschrift für das Zusammenfügen von Komponenten Operationsprinzip Struktur der Anordnung von Komponenten Strukturanordnung + Rechnerarchitektur Vorlesung Rechnerarchitektur Seite 15 Bestandteile der Rechnerarchitektur Eine Rechnerarchitektur ist bestimmt durch ein Operationsprinzip für die Hardware und die Struktur ihres Aufbaus aus den einzelnen Hardware-Betriebsmitteln. [Giloi 93] Das Operationsprinzip Das Operationsprinzip definiert das funktionelle Verhalten der Architektur durch Festlegung einer Informationsstruktur und einer Kontrollstruktur. Die (Hardware) - Struktur Die Struktur einer Rechnerarchitektur ist gegeben durch Art und Anzahl der Hardware-Betriebsmittel sowie die sie verbindenden Kommunikationseinrichtungen. - Kontrollstruktur : Die Kontrollstruktur einer Rechnerarchitektur wird durch Spezifikation der Algorithmen für die Interpretation und Transformation der Informationskomponenten der Maschine bestimmt. - Informationsstruktur : Die Informationsstruktur einer Rechnerarchitektur wird durch die Typen der Informationskomponenten in der Maschine bestimmt, der Repräsentation dieser Komponenten und der auf sie anwendbaren Operationen. Die Informationsstruktur läßt sich als Menge von ’abstrakten’ Datentypen spezifizieren. - Hardware-Betriebsmittel : Hardware-Betriebsmittel einer Rechnerarchitektur sind Prozessoren, Speicher, Verbindungseinrichtungen und Peripherie. - Kommunikationseinrichtungen : Kommunikationseinrichtungen sind die Verbindungseinrichtungen, wie z.B.: Busse, Kanäle, VNs ; und die Protokolle, die die Kommunikationsregeln zwischen den Hardware-Betriebsmittel festlegen. Vorlesung Rechnerarchitektur Seite 16 von Neumann Architektur Struktur: Beschreibt eine abstrakte Maschine des minimalen Hardwareaufwandes, bestehend aus: • • • • einer zentralen Recheneinheit (CPU), die sich in Daten- und Befehlsprozessor aufteilt einem Speicher für Daten und Befehle einem Ein/Ausgabe-Prozessor zur Anbindung periphärer Geräte einer Verbindungseinrichtung (Bus) zwischen diesen Komponenten Commands & Status Information CPU InstructionProcessor Instructions DataProcessor I/OProcessor Data Data Bus Data&Instruction Memory Vorlesung Rechnerarchitektur Seite 17 von Neumann Architektur Verarbeitung von Instruktionen: Programmanfang ersten Befehl aus Speicher holen Befehl in das Befehlsregister bringen Ausführung eventueller Adreßänderungen und ggf. Auswertung weiterer Angaben im Befehl evtl. Operanden aus dem Speicher holen nächsten Befehl aus dem Speicher holen Umsetzen des Operationscodes in Steueranweisungen Operation ausführen, Befehlszähler um 1 erhöhen oder Sprungadresse einsetzen Programmende? Ja Ende Nein Vorlesung Rechnerarchitektur Seite 18 von Neumann - Rechner (Burks, Goldstine, von Neumann) Architektur des minimalen Hardwareaufwands ! Die hohen Kosten der Rechnerhardware in der Anfangszeit der Rechnerentwicklung erzwang den möglichst kleinen Hardwareaufwand. - Befehle holen Operand holen (2x) Befehle interpretieren Befehl ausführen Indexoperation, Adressrechnung arithmetische Op. (gr. Aufwand an Zeit + HW) Ist die arithmetische Operation der wesentliche Aufwand, so können die anderen Teilschritte sequentiell ausgeführt werden, ohne die Verarbeitungszeit stark zu verlängern. Zeitsequentielle Ausführung der Befehle auf minimalen Hardware-Resourcen Veränderung der Randbedingungen : 1. Reduzierung der Hardwarekosten Hochintegration, Massenproduktion 2. Aufwand in den einzelnen Op. verschob sich. Gesucht : Ideen, mit zusätzlicher Hardware bisher nicht mögliche Leistungssteigerungen zu erreichen. Ideen = Operationsprinzipien Vorlesung Rechnerarchitektur Seite 19 Operationsprinzip - Pipeline Unter dem Operationsprinzip versteht man das funktionelle Verhalten der Architektur, welches auf der zugrunde liegenden Informations- und Kontrollstruktur basiert. Verarbeitung von mehreren Datenelementen mit nur einer Instruktion Pipeline - Prinzip Vektorrechner Feldrechner (’array of processing elements’) Pipeline - Prinzip Beispiel : Automobilfertigung Bestandteile eines ’sehr’ einfachen Autos : - Karosserie Lack Fahrgestell Motor Räder parallele V.E. (P.E‘s) Vorlesung Rechnerarchitektur Seite 20 Operationsprinzip - Pipeline Beispiel : Automobilfertigung Welche Möglichkeiten des Zusammenfügens (Assembly) gibt es ? Fließband (PIPELINE) Workgroup Worauf muß man achten ? Abhängigkeiten ! Fließband Blech Karosserie assembly Lack Fahrgestell Lackiererei 41 Fahrgestelleinbau Motor Motoreinbau Räder Räder montieren Produktion von verschiedenen Modellen : 3 Farben R(ot), G(rün) B(lau) Auftrag 2 Motoren N(ormal), F R I(njection) M N 2 Karosserien L(imousine), K K K(ombi) Vorlesung Rechnerarchitektur Seite 21 Operationsprinzip - Pipeline Pipeline des Fertigungsvorgangs 20 min K La 10 min F M R 10 min 10 min 10 min Lb 20 min Optimierung der Stufe : Lackierung L1 L2 10 min 10 min Stufen - Zeit - Diagramm der Pipeline stage Auftrag 41 K L1 L2 F M R 1:5 41 time 2:4 42 41 43 42 41 43 42 41 43 42 41 43 42 41 3:3 43 42 2:4 43 1:5 3:3 3:3 3:3 ppt Vorlesung Rechnerarchitektur Seite 22 Pipeline - Register Unter einem Register verstehen wir eine Hardwarestruktur die ein- oder mehrere Bits speichern kann. Register sind (normalerweise) D-Flip Flops (siehe PI2). Register D-FF D Q 32 32 clk Wichtige Kenndaten eines Registers: Clock-to-output Zeit (tco): Zeit zwischen Taktflanke und dem Zeitpunkt an dem eine Änderung des Eingangs am Ausgang sichtbar ist. Setup Zeit (tsu): Zeit vor der Taktflanke in der der Eingangswert schon stabil sein muss (sich nicht mehr ändern darf; Grund-> Digitale Schaltungstechnik, Stichwort: metastabile Zustände) Hold Zeit (th): Zeit nach der Taktflanke an der sich der Eingangswert noch nicht ändern darf (Grund wieder wie tsu). cycle time tcyc Clock D Q th Daten sollten am Eingang stabil sein tsu tco Ausgang gueltig mit neuen Daten Vorlesung Rechnerarchitektur Seite 23 Pipelining The performance gain achieved by pipelining is accomplished by partitioning an operation F into multiple suboperations f1 to fk and overlapping the execution of the suboperations of multiple operations F1 to Fn in successive stages of the pipeline [Ram77] Assumptions for Pipelining 1 the operation F can be partitioned 2 all suboperations fi require approximately the same amount of time 3 there are several operations F the execution of which can be overlapped Technology requirement for Pipelining 4 the execution time of the suboperations is long compared with the register delay time Linear Pipeline with k Stages F stage result(s) instr. & operands f1 f2 f3 fk [RAM77] Ramamoorthy, C.V., Pipeline Architecture, in: Computing Surveys, Vol.9, No. 1, 1977, pp. 61102. Vorlesung Rechnerarchitektur Seite 24 Pipelined Operation Time tp ( n, k) = k + (n-1) time to fill the pipeline time for this example: tp (10,5) = 5 + (10 - 1) = 14 time to process n-1 operations stages pipeline phases 1 2 3 4 = n k + (n-1) 5 start-up or fill processing drain Durchsatz Throughput TP ( n, k) = number of operations tp (n,k) operations time unit Gewinn Gain scalar execution time = pipelined execution time S ( n, k) = n k k + (n-1) initiation rate, latency Effizienz Efficiency E ( n, k) = 1 k lim S→ k n→∞ S ( n, k) = k n k = ( k + (n-1)) n k + (n-1) Pipeline Interrupts data dependencies control-flow dependencies resource dependencies Vorlesung Rechnerarchitektur Seite 25 Assumptions for Pipelining the operation F can be partitioned 1 F time tf F time tf / 2 1 1’ time tf 2 f1 f2 time tf / 2 2 2’ F time tf / 2 time tf all suboperations fi require approximately the same amount of time Version 1 f2 f1 f2 a f2 b f3 f2 c f1 << f2 time t2 /3 f3 << f2 f1 f2 f3 f2 f1 f2 Version 2 f2 f3 Vorlesung Rechnerarchitektur Seite 26 Assumptions for Pipelining 3 there are several operations F the execution of which can be overlapped If there is a discontinuous stream of operations F owing to a conflict, bubbles are inserted into the pipeline. This reduces the performance gain significantly. A typical example of this is the control dependency of the instruction pipeline of a processor. Here, each conditional branch instruction may disrupt the instruction stream and cause (k-1) bubbles (no-operations) in the pipeline, if the control flow is directed to the nonpredicted path. 4 the execution time of the suboperations is long compared with the register delay time Item 4 is a technological requirement for the utilization of a pipeline. Assuming a partitioning of the operation F into three suboperations f1, f2, f3, and also no pipelining, the operation F can be executed in the time: t (F) = tf1 + tf2 + tf3 Introduction of registers stage 1 stage 2 stage 3 Clock Di tf1 tsu tco Di tf2 tsu tco input t (F) = ( max (tfi) + tco + tsu ) tcyc tcyc = max (tfi) + tco + tsu Do tf3 tsu tco output 3=3 max (tfi) + 3 k stages ( tco + tsu ) register delay time fcyc = 1 / tcyc The registers are introduced behind each function of the suboperation and this creates the pipeline stages. Placing the register at the output (not at the input!!!) makes suboperation stages compatible with the definition of state machines, which are used to control the pipeline. Vorlesung Rechnerarchitektur Seite 27 Data Flow Dependencies Three different kinds of data flow dependency hazards can occur between two instructions. These data dependencies can be distinguished by the definition-use relation. - flow dependency - anti-dependency - output dependency read after write RAW write after read WAR write after write WAW Basically, the only real dependencies in a program are the data flow dependencies, which define the logical sequence of operations transforming a set of input values to a set of output values (see also Data Flow Architectures). Changing the order of instructions must consider these data flow dependencies to keep the semantic of a program. The Flow Dependency read after write (RAW) This data dependency hazard can occur e.g. in a pipelined processor where values for a second instruction i+1 are needed in the earlier stages and have not yet been computed by the later stages of instruction i. The hazardous condition arises if the dependend instructions are closer together than the number of stages among which the conflicting condition arises. definition relation use destination ( i ) = source ( i + 1 ) X <- A + B instruction i Y <- X + C instruction i+1 stage issue i issue i+1 RF read read 2 read ports ALU execute RF write read execute write read A,B i time i+1 read X,C i X:= A op B i+1 Y:= X op C i write X i+1 write Y write 1 write port To avoid the hazard, the two instructions must be separated in time within the pipeline. This can be accomplished by inserting bubbles (NOPs) into the pipeline (simply speaken: by waiting for i to complete) or building a special hardware (hardware interlock), which inserts NOPs automatically. ppt Vorlesung Rechnerarchitektur Seite 28 Data Flow Dependencies The Anti-Dependency write after read (WAR) This dependency can only arise in the case of out-of-order instruction issue or out-of-order completion. The write back phase of instruction i+1 may be earlier than the read phase of instruction i. Typical high-performance processors do not use out-of-order issue. This case, then, is of less importance. If the compiler reorders instructions, this reordering must preserve the semantics of the program and take care of such data dependencies. source ( i ) = destination ( i + 1 ) X <- Y + B instruction i Y <- A + C instruction i+1 The Output Dependency write after write (WAW) The result y of the first instruction i will be written back to the register file later than y of instruction i+1 because of a longer pipeline for the division. This reverses the sequence of writes i and i+1 and is called out-of-order completion. destination ( i ) = destination ( i + 1 ) Y <- A / B instruction i Y <- C + D instruction i+1 FU2 i+1 C op D stage issue i issue i+1 FU2 read A,B FU1 i i+1 read C,D i 1. A op B i 2. A op B i 3. A op B i+1 write Y i FU1 write Y RF write RF read read time read execute write execute write Vorlesung Rechnerarchitektur Seite 29 Data Flow Dependencies Inserting Bubbles We must avoid the RAW-hazard in the pipeline, because reading X before it is written back from the previous instruction reads an old value of X and thus destroys the semantic of the two sequential instructions. The sequential execution model (von Neumann architecture) assumes that the previous instruction is completed before the execution advances to the next instruction. 2 read ports 1 write port RF read RF write ALU pipeline distance stage read issue i i execute write read A,B i time X:= A op B i issue i issue NOP issue NOP issue i issue NOP issue NOP issue i+1 write X read A,B i i X:= A op B w1 NOP w2 w1 NOP i write X NOP w2 NOP w1 NOP w2 NOP read A,B i i X:= A op B w1 NOP w2 w1 NOP i write X NOP NOP w1 NOP i+1 read X,C w2 w2 NOP i+1 Y:= X op C i+1 write Y The conflicting stages are RF read and the RF write, which have a distance of 3 pipeline stages. Reading the correct value for X from the register file requires the insertion of 2 NOPs in between the two conflicting instructions. This delays the instruction i+1 by 2 clocks which then removes the RAW-hazard. The compiler (or programmer) is responsible for detecting the hazard and and inserting NOP into the instruction stream. The 2 bubbles in the pipeline can be utilized by other usefull instructions independent from i and i+1 (see instruction scheduling). Vorlesung Rechnerarchitektur Seite 30 Data Flow Dependencies Hardware Interlock A hardware mechanism for detecting and avoiding the RAW-Hazard is the interlock hardware. The RAW-Hazard condition is detected by comparing the source fields of instruction i+1 (and i+2) with the destination field of instruction i, until i is written back to the register file and thus the instruction is completely executed. stage issue i issue i+1 RF read ALU read execute RF write instruction fetch i i+1 i+2 i+2 i+2 i+3 execute write read A,B i time i+1 read X,C i X:= A op B i+1 Y:= X op C i write X i+1 write Y write stage time read decode i i+1 i+1 i+1 i+2 issue point read issue check for i+1 instruction execute write i read A,B A i X:= A op B i write X i+1 read X,C i+1 Y:= X op C i+1 write Y B delay point for conflict bubbles in the pipeline The hardware interlock detects the RAW-Hazard and delays the issue of instruction i+1 until the write back of instruction i is completed. The hardware mechanism doesn’t need additional NOPs in the instruction stream. The bubbles are inserted by the hardware. Nevertheless the produced bubbles can be avoided by scheduling useful and independent instructions in between i and i+1 (see also instruction scheduling). Vorlesung Rechnerarchitektur Seite 31 Data Forwarding Data forwarding Data forwarding is a hardware structure, which helps to reduce the number of bubbles in the pipeline. As can be seen in the stage time diagram, the result of a calulation in the execute stage is ready just behind the ALU register. Using this value for the next source operand instead of waiting after the write back stage and then reading it again from the RF, can save the two bubbles. forwarding control 2 data forwarding path (S1) RF read (S2) ALU (R) RF write load data path forwarding data mux data from cache/memory stage instruction time fetch i i+1 decode i i+1 issue point read i read A,B i+1 read X,C instruction execute i X:= A op B i+1 Y:= X op C write i write X i+1 write Y data forwarding A issue check for i+1 no bubble in the pipeline The forwarding data path takes the result from the output of the execute stage and send it directly to the input of the ALU. There a data forwarding multiplexer is switch to the forwarding path in order to replace the invalid source operand read from the register file. The forwarding control logic detects that a specific register (eg. R7) is under new definition and at the same time is used in the following instruction as a source operand. In this case, the corresponding mux is switch to replace S1 or S2, which is the old value from register R7. The scoreboard logic implements this checking in the stage before the issue point. Vorlesung Rechnerarchitektur Seite 32 Control Flow Dependencies The use of pipelining causes difficulties with respect to control-flow dependencies. Every change of control flow potentially interrupts the smooth execution of a pipeline and may produce bubbles in it. One of the following instructions may be responsible for them: - conditional branch instruction jump instruction jump to subroutine return from subroutine The bubbles (no operations) in the pipeline reduce the gain of the pipeline thus reducing performance significantly. There are two causes of bubbles. The first one is a data dependency in the pipeline itself, the branch condition being evaluated a pipeline stage later than needed by the instruction fetch stage to determine the correct instruction path. The second one is the latency for accessing the instruction and the new destination for the instruction stream. Pipeline Utilization U = n n+m m=b (1-p) Nb + b q No = 1 1+ m n n number of useful instructions m number of no operations (bubbles) b number of branches p probability of correct guesses Nb penalty for wrong guess q frequency of other causes (jump) No penalty for other causes no ops caused by branches no ops caused by the latency of branch target fetches U = 1 b 1+ n (1-p) b Nb + n q No Vorlesung Rechnerarchitektur Seite 33 Control Flow Dependencies Reduction of Branch Penalty The ’no’ operations (NOPs) in the pipeline can be reduced by the following techniques: reduction of branch penalty Nb - forwarding of condition code - fast compare - use of delay slots - delayed branch - delayed branch with squashing increase of p - static branch prediction - dynamic branch prediction - profiled branch prediction reduction of instruction fetch penalty No - instruction cache improvement of instruction prefetch branch target buffer alternate path following avoid branches - predication of instructions Branch architectures The design of control flow and especially conditional branches in ISAs is very complex. The design space is large and advantages and disadvantages of various branch architectures are difficult to analyze. The following discussion of branches shows only a small area of the design space. For further reading please refer to: Sima et al., Advanced Computer Architectures, A Design Space Approach Vorlesung Rechnerarchitektur Seite 34 Control Flow Dependencies Effect of Branches Before we start to analyse the effects of branching we should introduce the basic terms and have a closer look to the internals of the instruction fetch unit. Definition : Branch successor: The next sequential instruction after a (conditional) branch is named the branch successor or fall-through instruction. In this case, it is assumed the branch is not taken. The operation of fetching the next instruction in the instruction stream requires to add the distance from the actual instruction to the next instruction to the value of the actual program counter (PC). Definition : Branch target: The instruction to be executed after a branch is taken is called a branch target. The operation required to take the branch is to add the branch offset from the instruction word to the PC and then fetch the the branch target instruction from the instruction memory. The number of pipeline cycles wasted between a branch taken and the resumption of instruction execution of the branch target is called a delay slot. Depending on the pipeline the number of delay slots b may vary between 0 and k-1. In the following stage-time diagram we assume that the instruction fetch can immediately continue when the branch direction is defined by the selection of the condition code, which is forwarded to the instruction fetch stage. stage fetch time i-1 cmp branch ’cc’ i delay slot 1 delay slot 2 delay slot 3 correct next instructions branch successor or branch target forwarding of cmpare result (condition codes) forwarding of CC selection decode read execute write i-1 i i-1 i i-1 i i-1 i Vorlesung Rechnerarchitektur Seite 35 Control Flow Dependencies Effect of Branches In the following we use the term condition code CC as the result of a compare instruction, which might be stored in a special CC-register or stored in a general purpose integer register Rx. Using Rx as destination allows multiple cmp-instructions to be overlapped. The corresponding bcc has to read its register and decode the actual cc from the bits of Rx. Feeding the result of this bit selection to the instruction fetch stage through the forwarding path of cc selection instructs the IF stage to switch the control flow to the new direction. cmp R1,R2 -> CC calculate all reasonable bits and store to condition code register bcc CC, offset change the control flow depending on the selection of CC bits calculate next PC cmp R1,R2 -> R4 cmp R6,R7 -> R5 bcc R4, offset bcc R5, offset forwarding path of cc selection cc IF RF read DEC ALU RF write forwarding data path for CC branch predecessor i-1 branch i delay slot branch successor i+2 i+3 i+1 branch target t t+1 numbering is used to identify the instructions in the flow of processor code The calculation of the branch PC might be performed in the ALU, but is typically shifted to earlier stages with an additional adder. The PC is held in the IF-stage and not stored in the RF! Vorlesung Rechnerarchitektur Seite 36 Forwarding of the Condition Code forwarding path of cc selection cc IF DEC RF read ALU RF write forwarding data path for CC stage fetch time i-1 cmp branch delay slot correct next instructions i i+1 i+2 i+3 decode read forwarding of cmp data (CC) forwarding of cc execute write i-1 i i+1 i+2 i+3 i-1 i i+1 i+2 i+3 i-1 i i+1 i+2 i+3 i-1 i i+1 i+2 i+3 Vorlesung Rechnerarchitektur Seite 37 Control Flow Dependencies Fast Compare simple tests for equal, unequal, <0, <=0, >0, >=0, =0 fast compare logic 0 = CC IF DEC RF read ALU RF write forwarding of fast cmp data stage fetch time i-1 cmp branch delay slot delay slot correct next instructions i i+1 i+2 decode read execute write i-1 i i+1 i+2 i-1 i i+1 i+2 i-1 i i+1 i+2 i-1 i i+1 i+2 Vorlesung Rechnerarchitektur Seite 38 Delayed Branch Idea: Reducing the branch penalty by allowing <d> useful instructions to be executed before the control transfer takes place. 67% of all branches of a program are taken (loops!). Therefore, it is wise to use the prediction "branch taken". cc=true ready stage time branch instruction delay slot instr. target instruction target +1 instr. fetch i i+1 t t+1 decode i i+1 t t+1 read i i+1 t t+1 execute write i i+1 t t+1 i i+1 t t+1 control transfer delay slot in the pipeline (a) branch taken CC=false ready stage time branch instruction delay slot instr. target instruction target +1 instr. successor instr. fetch i i+1 t t+1 i+2 decode i i+1 t t+1 i+2 read i i+1 t t+1 i+2 execute write forwarding of CC i i+1 t t+1 i+2 i i+1 t t+1 i+2 (b) branch not taken Goal: zero branch penalty Technique: moving independent instructions into the delay slots. Probability of using delay slots: 1. slot ~ 0.6; 2. slot ~ 0.2; 3. slot ~ 0.1; control transfer bubbles in the pipeline Vorlesung Rechnerarchitektur Seite 39 Delayed Branch with Squashing 31 0 bcc static branch prediction bit branch offset anullment bit forwarding of CC to control flow 4-stages fetch instructions time i branch i+1 delay slot t target t+1 target + 1 target + 2 t+2 decode & read branch taken execute write i i+1 t t+1 t+2 i i i+1 t t+1 i+1 t t+2 t+1 t+2 forwarding of CC to control flow 4-stages fetch instructions time i branch i+1 delay slot t target i+2 successor+ 2 successor+ 3 i+3 decode & read branch NOT taken execute write i i+1 t i+2 i+3 i i+1 t i+2 i+3 i i+1 t i+2 i+3 annulled delay and target instructions (a=1) If the anullment bit is asserted, the delay slot instruction will be anulled, when the branch decision was predicted falsely. This enables the compiler to schedule instructions from before and after the branch instruction into the delay slot. ppt Vorlesung Rechnerarchitektur Seite 40 Branch Prediction Two different approaches can be distinguished here: - static-branch prediction - dynamic-branch prediction The static-branch information associated with each branch does not change as the program executes. The static-branch prediction utilizes compile time knowledge about the behaviour of a branch. The compiler can try to optimize the ordering of instructions for the correct path. Four prediction schemes can be distinguished: - branch not taken branch taken backward branch taken branch direction bit built-in static prediction strategies branch delay slot =1 branch successor branch not taken Static branch =0 prediction bit branch target branch taken branch backward branch with static prediction Vorlesung Rechnerarchitektur Seite 41 Static Branch Prediction The hardware resources required for static branch prediction are a forwarding path for the prediction bit, and a logic for changing the instruction fetch strategy in the IF stage. This prediction bit supplied by the output of the instruction decode stage decides which path is followed. The prediction bit allows execution of instructions from the branch target or from the successor path directly after the delay slot instruction. IF DEC RF read ALU RF write Static branch prediction bit branch delay slot =1 =0 branch successor branch target The static branch prediction is controlled by the programmer/compiler (e.g. prediction bit within the branch instruction). For example, the GCC will try to use static branch prediction if available from the architecture and optimizations are turned on. If the programmer wants to retain control, GCC provides a builtin function for this purpose ( __builtin_expect(condition, c) ): if (__builtin_expect (a==b, 0)) f(); This means, we expect a==b to evaluate to false ( =0 in C) and therefore to not execute the function f(). Vorlesung Rechnerarchitektur Seite 42 Profiled Branch Prediction The profiled branch prediction is a special case of static prediction. The guess of the branch direction is based on a test run of the program. This profile information on branch behaviour is used to insert a static branch prediction bit into the branch instruction that has a higher proportion of correct guesses. from the GCC 3.x manual ‘-fprofile-arcs’ Instrument "arcs" during compilation to generate coverage data or for profile-directed block ordering. During execution the program records how many times each branch is executed and how many times it is taken. When the compiled program exits it saves this data to a file called ‘AUXNAME.da’ for each source file. AUXNAME is generated from the name of the output file, if explicitly specified and it is not the final executable, otherwise it is the basename of the source file. In both cases any suffix is removed (e.g. ‘foo.da’ for input file ‘dir/foo.c’, or ‘dir/foo.da’ for output file specified as ‘-o dir/foo.o’). For profile-directed block ordering, compile the program with ‘-fprofile-arcs’ plus optimization and code generation options, generate the arc profile information by running the program on a selected workload, and then compile the program again with the same optimization and code generation options plus ‘-fbranch-probabilities’ (*note Options that Control Optimization: Optimize Options.). The other use of ‘-fprofile-arcs’ is for use with ‘gcov’, when it is used with the ‘-ftest-coverage’ option. With ‘-fprofile-arcs’, for each function of your program GCC creates a program flow graph, then finds a spanning tree for the graph. Only arcs that are not on the spanning tree have to be instrumented: the compiler adds code to count the number of times that these arcs are executed. When an arc is the only exit or only entrance to a block, the instrumentation code can be added to the block; otherwise, a new basic block must be created to hold the instrumentation code. Vorlesung Rechnerarchitektur Seite 43 Dynamic Branch Prediction Dynamic branch prediction uses information on branch behaviour as the program executes. No initial dynamic information exists before the program starts executing. The dynamic information is normally associated with a specific branch. The association can be realized by using the instruction address of the branch as the unique identifier of the branch. The dynamic prediction depends on the past behaviour of this branch and is stored in a table addressed by the branch instruction address. A unique addressing would need the whole address as an index, usually 32 bit in length. This length prohibits direct use of this address as an index to the state table. simple branch predictor using only one history bit X’_Y’ : = f (a, X_Y) a current branch behaviour X prediction bit Y history bits old state input new state low order of branch address a XY /a n 00 predictor state memory /a a 01 /a 10 1 11 a n history bits prediction bit to/from instruction sequencer a current branch behaviour a X Y predictor state machine X’ Y’ /a a current branch behaviour taken a not taken /a X prediction bit take 1 take not 0 Y history bit last branch taken 1 last branch not taken 0 Vorlesung Rechnerarchitektur Seite 44 Dynamic Branch Prediction Integration of Dynamic Branch predictor into the pipeline branch address dynamic branch predictor a X CC a X IF DEC RF read ALU RF write Vorlesung Rechnerarchitektur Seite 45 Reduction of Instruction Fetch Penalty Instruction Cache Instruction Buffer Idea: Holding a number of instruction (e.g. 32) in fast buffer organized as an "instruction window". prefetch PC fetch PC prefetch PC top buffer length fetch PC memory fetch bottom DEC stage instruction buffer hit condition <= bottom <= fetch PC <= top >= top bottom & hit Branch Target Buffer Optimize instruction fetch in case of a correctly predicted branch low order bits of branch address m high order bits of branch address n predictor state memory = hit branch address tag field branch PC memory branch target instructions t 1 prediction bit to/from instruction sequencer current branch behavior n history bits predictor state machine branch PC to IF stage hiding Nb to DEC stage hiding No t+3 t+2 t+1 Vorlesung Rechnerarchitektur Seite 46 Predication Minimization of branches • "guarded" or conditional instructions - the instruction executes, if a given condition is true. If the condition is not true the instruction does not execute - example: CMOVcc instructions (conditional move) from the IA-32 architecture (available since P6) • the test is embedded in the instruction itself - no branch involved • predicated instructions - predicates are precomputed values stored in registers - instruction contains a field to select a predicate from the predicate RF - example: IA-64 architecture ’guarded’ or conditional instruction p1 = "true" p2 = p1 = "false" <p1> a = a + 1 <p2> b = b + 1 predicate cmp p1 cmp p1 yes a=a+1 no yes p2 no b=b+1 <p1> a+1 p1 <p2> b+1 p2 Vorlesung Rechnerarchitektur Seite 47 Predication How Predication works Instruction 1 1. The branch has two possible outcomes. 2. The compiler assigns a predicate register to each following instruction, according to its path. Instruction 2 3. All instructions along this path point to predicate register P1. Instruction 3 (branch) 4. All instructions along this path point to predicate register P2. Instruction 4 (P1) 5. CPU begins executing instructions from both paths. Instruction 7 (P2) Instruction 5 (P1) 6. CPU can execute instructions from different paths in parallel because they have no mutual dependencies. Instruction 8 (P2) Instruction 6 (P1) 7. When CPU knows the compare outcome, it discards results from invalid path. Instruction 9 (P2) The compiler might rearrange instructions in this order, pairing instructions 4 and 7, 5 and 8, and 6 and 9 for parallel execution. 128-bit long instruction words Instruction 1 Instruction 2 Instruction 3 (branch) Instruction 4 (P1) Instruction 7 (P2) Instruction 5 (P1) Instruction 8 (P2) Instruction 6 (P1) Instruction 9 (P2) Predication replaces branch prediction by allowing the CPU to execute all possible branch paths in parallel. Vorlesung Rechnerarchitektur Seite 48 Memory Access Load/Store Data Path Using data directly from the register file or forwarding result data from the output of the ALU of the EX stage is straight forward. But in which way the data is placed initially in the register file? There are two different approaches: • Instructions like ADD which can access a memory location by using an effective address generation (memory-to-register Architecture like MC68020, see also 1- 2- or 3-address machines) e.g. ADD D3, (A4) • Instructions like LOAD (LD) or STORE (ST) which are specialized for memory accesses, e.g. LD R3, (R4) In the following we will focus on LD/ST architectures, because the decoupling of LD and data processing instructions like ADD is a very advantageous feature of high performance processors. The latency of the memory (Processor Memory Gap!) can be handled independently of the ADD instruction. Scheduling the LD instruction early enough before the use of the data value can hide the memory latency (see ILP). forwarding control 4 data forwarding path forwarding data mux EX-stage (S1) RF read (S2) (R) ALU RF write load data path LD functional unit ST is omitted for simplicity latency >10 clock ticks memory access stage issue point instruction time LD fetch i i+1 decode i i+1 instruction read execute write i LD R3,(R4) A issue check for i+1 i LD R3,(R4) stalled data forwarding 10 clock ticks i+1 use R3 i+1 use R3 i write R3 This simple LD/ST architecture does not include address calculation in the instructions. Other processors of the LD/ST type add another stage in front of LD/ST stage to perform address calculations. Vorlesung Rechnerarchitektur Seite 49 Memory System Architecture The two basic principles that are used to increase the performance of a processor are: pipelining and parallel execution the optimization of the memory hierarchy. Applying a high-performance memory system to modern microprocessors is another step towards improving performance. Whenever off-chip accesses have to be performed, throughput is reduced because of the delay involved and the latency of external storage devices. The memory system typically consists of a hierarchy of memories. Definition : A memory hierarchy is the result of an optimization process with respect to technological and economic constrains. The implementation of the memory hierarchy consists of multiple levels, which are differing in size and speed. It is used for storing the working set of the ‘process in execution’ as near as possible to the execution unit. The memory hierarchy consists of the following levels: - registers primary caches local memory secondary caches main memory speed cost more expensive size faster on chip denser larger off chip The mechanisms for the data movement between levels may be explicit (for registers, by means of load instructions) or implicit (for caches, by means of memory addressing). CPU Chip CPU Register Files 1st-Llevel Cache External Processor Interface 2nd-Level Cache Main Memory Disk Storage Vorlesung Rechnerarchitektur Seite 50 Registers Registers are the fastest storage elements within a processor. Hence, they are used to keep values locally on-chip, and to supply operands to the execution unit(s) and store the results for further processing. The read-write cycle time of registers must be equal to the cycle time of the execution unit and the rest of the processor in order to allow pipelining of these units. Data is moved from and into the registers by explicit software control. Registers can be grouped in a wide variety of ways: - accumulator evaluation stack register file multiple register windows register rotation register banks Evaluation Stack A very simple register structure is the evaluation stack. The addressing of values is done implicitly by most of the operations. Push and pop operations are used to enter new values on the stack, or to store the top of the stack back to the memory. Dyadic operations, like add, sub, mul, consume the top two registers as source operands, leaving the result on the top of the stack. Implicit addressing of the stack reduces the need for instructions to specify the location of their operands, which in turn reduces the size of instructions and results in a very compact code. A B C 10 5 3 A B C 10 5 3 A B C 10 5 3 A B C 7 10 5 3 lost push ( 7) A B C 5 3 x pop (10) A B C 15 3 x add before operation A:= A + const (7) A:=A+B A B C 10 5 3 A B C 17 5 3 add + const ( 7) after operation Vorlesung Rechnerarchitektur Seite 51 Register File The collection of multiple registers into a register file, directly accessible from the instruction word, provides a faster access to data, that will have a high ‘hit rate’ within a short period of time. The compiler must identify those variables and allocate them into registers. This task is named register allocation. The reuse of these variables held in registers reduces main memory accesses and therefore speeds up execution. The register file (RF) is the commonly used grouping of registers in a CPU. The number of registers depends on many constrains, as there are: - the length of the instruction word containing the addresses of the registers combined in an operation - the speed of the register file which is inverse proportional to the number of register cells and the number of ports. - the number of registers needed for software optimization, parameter passing for function call and return (32 is more than enough) - the penalty for saving and restoring the RF in case of a process switch - the number of ports of the RF for parallel operations The addressing of the registers in the RF is performed by the register address of an instruction. The number of register address fields in the instruction field divide the processors into classes called two, 2 1/2, or three address machines. Because dyadic operations are very frequent, most modern processors are three address machines and have two source address fields and one destination address field. The following figure presents a typical 32-bit instruction format of a RISC processor. 31 26 25 instruction class 6 15 14 opcode 11 10 9 source 1 5 5 4 source 2 5 0 destination 5 5bit register address The limitation of the instruction word width restricts the addressable number of registers to 32 caused by the 5-bit register address fields. The typical VLSI realization of a RF is a static random access memory with a 6-transistor cell, 4 for the storage cell and two transistors for every read/write access port, which means two simultaneous reads (two read-ports) or one write (one write-port). Vorlesung Rechnerarchitektur Seite 52 Register File 1 x write port result destination source2 source1 Line Driver result result BIT LINE A BIT LINE B data line WORD LINE B select line word width WORD LINE A Vref 2 x read port + - + - operand 1 Sense Amplifier operand 2 BIT LINE A BIT LINE B WORD LINE B additional WORD LINE C additional pass transistor figure b WORD LINE A additional BIT LINE C The register file features two read ports and one write port, which shares the two bit lines. The read and the write cycle are time-multiplexed. The read operation activates the two source address decoders and they drive the corresponding word line A and B. Operand 1 and operand 2 can be read on the two bit lines. Special sense amplifiers and charge circuits control the data read-out. The write cycle activates the destination decoder which must drive both word lines. The result data is forced to the storage cell by applying data and negated data on both bit lines to one cell. The geometry of the transistors is chosen so that the flip-flop is overruled by the external bit line information. An additional read port (3read/1write port) requires one extra word line (from additional source 3 decoder) a pass transistor and an extra bit line as shown in figure b. Every extra port at the register file increases the area consumed by the RF and thus reduces the speed of the RF. Vorlesung Rechnerarchitektur Seite 53 Register Windows The desire to optimize the frequent function calls and returns led to the structure of overlapping register windows. The save and restore of local variables to and from the procedure stack is avoided, if the calling procedure can get an empty set of registers at the procedure entry point. The passing of parameter to the procedure can be performed by overlapping of the actual with the new window. For global variables a number of global registers can be reserved and accessed from all procedure levels. This structure optimize the procedure call mechanism of high level languages, and optimizing compiler can allocate many variables and parameters directly into registers. register file 135 parameter 128 127 120 119 local variables parameter 112 procedure A R31A R24A R23A R16A R15A overlapping registers between A and B parameter local variables procedure B overlapping registers between B and C‘ R31B parameter R8A local variables R24B R23B R16B R15B R8B parameter local variables procedure C R31C parameter R24C R23C R16C R15C 31 R8C parameter 24 23 16 15 parameter local variables 24 window registers parameter local variables parameter 8 7 0 global variables R7A R0A global variables R7B R0B global variables R7C R0C global variables 8 global registers Due to the fixed size of the instruction word the register select fields have the same width (5 bits) as in the register file structure. The large RF requires more address bits and therefore the addressing is done relatively to the window pointer. The window pointer is kept in a special register called current window pointer cwp and provides the base address of the actual window. An addition of the cwp with the content of the register select field provides the physical address to the large register file. The following figure presents the logical structure of the addressing scheme. The addition and the global register multiplexer can be incorporated in the register file decoder by using the addressing truth table as the basis for the decoder implementation. Nevertheless this address translation slows down the access of the RF. The cwp is controlled directly by the instruction ‘call’, which decrements the pointer and by the ‘return’, which increments the cwp. This scheme has been implemented by the SPARC-family of processors. Note that the register window allocated by a call instruction has a fixed size regardless of the size necessary to accomodate the called function. The Itanium family by Intel (IA64) uses and optimized version enabling variable size windows. Vorlesung Rechnerarchitektur Seite 54 Register Rotation Consider a typical loop for a numerical calculation: X=V+U where V,U,X are vectors. loop: ld U[lc], R32 ld V[lc], R33 add R32,R33,R34 st R34, X[lc] dec lc cmp lc, #0 bne loop ;lc = loop counter ;R32,R33,R34 = register used by loop Problem: Dependency between ld/add/st operations even through loop iterations although individual loops are logically independent! Dependency because of usage of same register. One can use loop unrolling together with usage of more registers to solve this problem (see ILP, loop unrolling). Different solution: Register Rotation register file 135 128 127 registers independent between iteration #0 and #1 loop register #i registers independent between iteration #1 and #2 120 119 loop iteration 2 112 loop iteration 1 R342 loop iteration 0 loop register #1 loop register #0 31 loop R322 register #2 R341 loop R321 register #1 R340 loop R320 register #0 R310 R311 R312 24 23 16 15 global registers global registers global registers global registers 8 7 0 R00 R01 R02 Special loop counter register used as base adress for register selection (address calculation required). The same register address (from the programmers point of view) addresses a different (physical) register in each loop iteration. It is possible to fully pipeline the loop. The Itanium processor features register rotation to facilitate loop pipelining. Vorlesung Rechnerarchitektur Seite 55 Caches Caches are the next level of the memory hierarchy. They are small high speed memories employed to hold small blocks of main memory that are currently in use. The principle of locality, which justifies the use of cache memories, has two aspects: - locality in space or spatial locality - locality in time or temporal locality Most programs exhibit this locality in space in cases where subsequent instructions or data objects are referenced from near the current reference. Programs also exhibit locality in time where objects located in a small region will be referenced again within a short periode. Instructions are stored in sequence, and data objects normally being stored in the order of their use. The following figure is an idealized space/time diagram of address references, representing the actual working set w in the time interval Δτ. Address Space Data Δτ w ( T, T + Data Δτ) Instruction T T+ Δτ time Caches are transparent to the software. Usually, no explicit control of cache entries is possible. Data is allocated automatically by cache control in the cache, when a load instruction references the main memory. Some processors feature explicit control over the caching of data. Four types of user mode instructions can improve hit rate significantly (cache bypassing on stores, cache preloading, forced dirty-line flush, line allocation without line fill). Vorlesung Rechnerarchitektur Seite 56 Caches Cache memory design aims to make the slow, large main memory appear to the processor as a fast memory by optimizing the following aspects: - maximizing the hit ratio minimizing the access time minimizing the miss penalty minimizing the overhead for maintaining cache consistency The performance gain of a cache can be described by the following formula: Gcache Tm = ( 1 - H ) Tm + H miss ratio Tc (1-H) 1- H +H Tc Tm hit ratio Tm = tacc of main memory 1 = 1 = (1- Tc Tm Tc = tacc of cache memory ) H = hit ratio [0, ...1] G 5 example for 4 3 Gcache (H=1) = 2 1 0 0.5 0.9 1 Tm Tc = 5 1 H The hit ratio of the cache (in %) is the ratio of accesses that find a valid entry (hit) to accesses that failed to find a valid entry (miss). The miss ratio is 100% minus hit ratio. The access time to a cache should be significantly shorter than that to the main memory. Onchip caches (L1) normally need one clock tick to fetch an entry. Access to off-chip caches (L3) is dependent on the chip-to-chip delay, the control signal protocol, and the access time of the external memory chips used. Vorlesung Rechnerarchitektur Seite 57 Cache Terms Cache block size: Cache block size is the size of the addressable unit within a cache, which is at the same time the size of the data chunk to be fetched from main memory. Cache line: A cache line is the data of a cache block, typically a power of 2 number of words. The term is mainly used for the unit of transport between main memory and cache. Cache entry: A cache entry is the cache line together with all required management information, e.g. the tag field and the control field. Cache frame: The cache frame refers to the addressable unit within a cache which can be populated with data. It defines the empty cache memory space, where a cache line can be placed. H [%] fully associative 1 0.9 16 - 32 KB L1 Cache H > 0.9 0.8 0.5 directly mapped 0.2 0 1 2 4 8 16 32 64 log2 of Cache size [KB] hit ratio versus cache size Set: A set is the number of cache blocks which can be reached by the same index value. A set is only these pairs (2-way) or quadruples (4-way) of cache blocks, not the whole part of one way of the set-associative cache memory. Bedauerlicherweise gibt es keinen Begriff für den jeweiligen Teil des set-associative cache Speichers. Somit wird ’set’ auch häufig als die Bezeichnung für diesen Teil des Speichers verwendet. Vorlesung Rechnerarchitektur Seite 58 Cache Organizations Five important features of the cache, with their various possible implementations are: - mapping organization addressing management placement direct, set associative, fully associative cache block size, entry format, split cache, unified cache logically indexed, physically indexed, logically indexed/physically tagged consistency protocol, control bits, cache update policy random, filo, least recently used One of the most important features is the mapping principle. Three different strategies can be distinguished, but the range of caches - from directly mapped to set associative to fully associative - can be viewed as a continuum of levels of set associativity. Cache Mapping m+n+x+z-1 m Bits Tag i 2n page size m th 2 blocks of cache size Main Memory ith x Bits Index Word Byte select select Address index i tag index m mem block 0 z Bits n Bits cache block 2n entries ith cache size Cache Directly Mapped Cache n Tag Mem = hit Hardware Structure (Address Path Only) Vorlesung Rechnerarchitektur Seite 59 Cache Mapping m+n+x+z-1 m+1 Bits Tag n-1 ith 2 page size index i 2m+1 blocks of cache size ith Address tag set 0 m+1 tag index n-1 Tag Mem 1 cache block Tag Mem 0 cache size set 1 0 z Bits Word Byte select select Index 2n-1 entries mem block x Bits n-1 Bits ith = = set 1 cache block Main Memory set 0 or hit Hardware Structure (Address Path Only) Cache 2-way Set Associative Cache m+n+x+z-1 x Bits m+n Bits Tag 0 z Bits Word Byte select select Address m+n 2n entries 2m+n main memory blocks Tag 2n-1 = Tag 3 Tag = 2 Tag = 1 Tag = 0 = mem block cache size oror hit Main Memory Cache Hardware Structure (Address Path Only) Fully Associative Cache Vorlesung Rechnerarchitektur Seite 60 Cache Mapping index from tag compare index n-1 n set 0/1 set 1 set 0 word select word select x set mux word mux from tag compare set mux word mux directly mapped word select word mux set associative fully associative Data Paths of Differently Mapped Caches Cache Organization The basic elements of a cache organization are: - the entry format the cache block size the kind of objects stored in cache special control bits MESI Entry format modified exclusive shared invalid placement pid physical tag field word 0 logical control field word 1 word 2 data field word 3 Vorlesung Rechnerarchitektur Seite 61 Cache Line Fetch Order Always fetch required word first. This keeps the memory access latency to a minimum. access sequence: interleaved mode (INTEL mode ) 0. Adr. 1. Adr. 2. Adr. 3. Adr. 0 8 10 18 8 0 18 10 10 18 0 8 18 10 8 0 mod ++ mod - mod ++ mod - byte offsets for 64bit words Mem_Adr 0 10 1. read of Mem 8 18 0,8 8,0 2. read of Mem 10,18 18,10 EN_L EN_H fast multiplexing with TS driver or multiplexer CPU BUS see also: DRAM Burst mode for further explanations - interleaved mode - sequential mode - programmable burst length Vorlesung Rechnerarchitektur Seite 62 Cache Consistency The use of caches in shared-memory multiprocessor systems gives rise to the problem of cache consistency. Inconsistent states may occur, when two processors keep a copy of the same memory cell in their caches, and one processor modifies the cache contents or the main memory by a write. Two memory-update strategies can be distinguished: - the write back (WB), sometimes also known as copy back, - and the write through (WT). The WT strategy is the simplest one. Whenever a processor starts a write cycle, the cache is updated and the same value is written to the main memory. The cache is said to be writtenthrough. Nevertheless, this write must inform all other caches of the new value at this address. While the active bus master (CPU or DMA) is placing its write address on to the address bus, all other caches in the other CPUs must check this address against their cache entries so as to invalidate or update the cache line. The WB strategy is more efficient, because the main memory is not updated for each store instruction. The modified data is stored in the cache data field only, the line being marked as modified in the cache control field. The write to the main memory is performed only on request, and then whole cache lines are written back (WB). This memory update strategy is called write back or copy back and allows the cache to hold newer values than the main memory. Information must be available in the cache line, which keeps track of the state of a cache entry. The MESI states and MESI consistency protocol are widely used and are therefore given here as an example of cache consistency protocols. Four possible states of a cache line are used by the MESI protocol: - Modified: one or more data items of the cache line are written by a store operation, the modified or dirty bit being set - Exclusive unmodified: the cache line belongs to this CPU only, the contents is not modified - Shared unmodified: the cache line is stored in more than one cache and can be read by all CPUs. A store to this address must invalidate all other copies and update the main memory - Invalid: the cache entry is invalid; this is the initial state of a cache line that does not contain any valid data. Vorlesung Rechnerarchitektur Seite 63 Cache Consistency The states of the cache consistency control bits and the transitions between them are illustrated in two figures, the first one showing all transitions of the cache of a bus master CPU, and the second state diagram showing the transitions of a slave CPU cache (snooping mode). Definition : A processor of a shared-memory multiprocessor system (bus-interconnected) is called bus master if it has gained bus mastership from the arbitration logic and is in the process of performing active bus transactions. Processors of a shared-memory multiprocessor system are called bus slaves if these processors can not currently access the bus for active bus transactions. They have to listen passively (snooping) for the active transactions of the bus master. Cache Consistency State Transitions for Bus Master CPU Shared Read Miss Invalid Write Hit [2] Write Miss[1] Shared Read Miss [3] Read Hit, Write Hit M E Exclusive Read Miss [3] Modified Exclusive Read Hit / Shared Read miss Shared Read Miss [3] Exclusive Read Miss Write Miss S Shared Unmodified Exclusive Read Miss [3] I Exclusive Unmodified Write Hit Read Hit / Exclusive Read miss [1] = Read with intent to modify [2] = Invalidation Bus Transaction [3] = Address tag miss = Snoop response State Transitions for Snooping CPU (Slave) I Snoop Hit on Write or on Read w.i.t.m. or on Invalidation Snoop Hit on Read Snoop Hit on Read Snoop Hit on Write [4] or on Read w.i.t.m. [4] Invalid S Shared Unmodified Snoop Hit on Read [4} M Snoop Hit on Write or on Read w.i.t.m. E Exclusive Unmodified Modified Exclusive [4] = Copy back of modified data Vorlesung Rechnerarchitektur Seite 64 Cache Addressing Modes The logically addressed cache is indexed by the logical (or virtual) address of the process currently running on the CPU. The address translation need not be performed for the cache access, and the cache can therefore be accessed very fast, normally within one clock cycle. Only the valid bit must be checked to see whether or not the data item is in the cache line. The difficulty with the logically addressed cache is that no snooping of external physical addresses is possible. address valid MMU ATC physical address logical address data valid address valid data valid logical address data CACHE MMU ATC CACHE TLB logically addressed data physical address physically addressed The physically addressed cache is able to snoop and need not be invalidated on a process switch, because it represents a storage with a one-to-one mapping of addresses to main memory cells. The address translation from logical to physical address must be performed in advance. This normally slows down the cache access to two clock cycles, the address translation and the cache access. address valid MMU ATC physical address Address Compare access valid physical address logical address data CACHE data valid logically addressed/physically tagged The logically indexed/physically tagged cache scheme avoids both disadvantages by storing the high-order part of the physical address in the cache line as additional tag information. If the size of the index part of the logical address is chosen to match the page size, then the indexing can be performed without MMU translation. Vorlesung Rechnerarchitektur Seite 65 Cache Consistency Cache Consistency State Transitions for Bus Master CPU Shared Read Miss [3] Read Hit, Write Hit M Exclusive Read Miss [3] Write Hit Read Hit / Shared read miss Snoop Hit on Read w.i.t.m. or Write [4] Write Miss[1] Write Miss Write Hit [2] Snoop Hit on Read w.i.t.m. or Write I Shared Read Miss [3] Exclusive Read Miss S Snoop Hit on Read Snoop Hit on Read [4} Snoop Hit on Read w.i.t.m. or Write E M E S Snoop Hit on Read Shared Read Miss Exclusive Read Miss [3] I Cache Consistency State Transitions for BusSlave CPU (Snooping) Read Hit / Exclusive read miss Prozessor 0 Prozessor 1 L1 Cache Tag i L1 Cache M E S I Data HitM Tag i M E S I Data HitM Hit Hit Snoop Response BG0 BG1 Main Memory Arbiter 0 BR1 4 8 C Snoop Response BR0 m LD R1 <- ($A004) ADD R1, #1, R1 ST R1 -> ($A004) Symbols for Animation i MESI Hit HitM I E;S M 0 0 1 1 0 1 0 1 Vorlesung Rechnerarchitektur Seite 66 Cache Placement directly mapped cache single entry no choice set 0 set set/fully associative random replacement random FF fifo replacement first in - first out circular fifo per cache line set count = n pointer/per index nod (number of sets) least recently used LRU t (read cache) count am längsten zurückliegender Zugriff eines Eintrags ... der Eintrag wird überschrieben ... Algorithmus Vorlesung Rechnerarchitektur Seite 67 More Cache Variants There are a couple more cache related terms one encounters with todays processors: • split cache versus unified cache • inclusive/exclusive caches • trace caches Split Cache / Unified Cache Unified Caches serve both as instruction and data cache; in a split cache architecture, a dedicated instruction cache and a dedicated data cache exist. Split caches are very often found as L1 caches of processors (Internal Havard Architecture). A processor having a dedicated instruction memory and a dedicated data memory is called Havard Architecture. Processor L1 I-cache L1 D-cache Unified L2 Cache to memory Inclusive/Exclusive Caches Inclusive: Data is held in L1 and L2 on fetch; if evicted in L1, remains in L2; fast access later on; effectively cache size is lost Exlucise: Data is only fetcher into L1; if evicted from L1 it is written to L2; effective cache size is L1+L2; copy from L1 to L2 costs time; can be optimized using a Victim Buffer. Trace Cache A trace cache is a special case of an instruction cache. A trace cache does not hold instructions exactly as they are found in main memory but rather in a decoded form, and possibly a sequential entry of trace cache corresponds to an instruction stream across a branch. Vorlesung Rechnerarchitektur Seite 68 Memory Technology Development Cost per Transistor [log10] on-chip l values from Fairchild (Gordon Moore) 10 -1 l 10-2 10-3 l 10-4 l 10-5 l 4.5Mio.T = 30$US 10-6 ’60 ’59 ’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 Year ’02 ~ 6.6 * 10-6 $US year 2002: 4.5Mio. transistors in 0.18μm CMOS technology on a 5x5mm die with BGA pakkage ~ 30 $US = 6.6 x 10-6 $US per transistor; standard cell design Vorlesung Rechnerarchitektur Seite 69 Memory Technology Cost development of DRAMS Stone, Harold S., High-Performance Computer Architecture: "Memory chips have been quadrupling capacity every two to three years. The manufacturing cost per chip is usually constant per chip, regardless of the memory capacity per chip. When a new memory chip that has four times the capacity of its predecessor is introduced, a typical strategy is to sell it at four or five times the price of its predecessor. Although the price per bit is about equal for new and old technologies, the newer technology leads to less expensive systems because of having only one fourth the number of memory chips." Cost per MByte of Memory [log10 $US] DRAM 106 1Mio $US (1MByte) z 105 104 z2500 $US 3 10 102 12 $US z 1 10 z z 3,5 $US (512MB) 1 10 -1 10 -2 10 -3 10 -4 ’60 ’65 ’70 ’60 some data 1982 16kbit SRAM 1982 64kbit DRAM 1995 1Mbit SRAM 1995 4Mbit DRAM ’75 ’80 ’85 ~ 5.4 * ~80 $US ~20 $US ~50 $US ~6 $US ’90 10-8 ’95 ’00 Year ’02 $US/bit chip mit 16Kbit 16Kx1 chip mit 64Kbit 64Kx1 chip mit 1Mbit 1Mbx1 oder 256Kx4 chips mit 4Mbit 4Mbx1 oder 1Mx4 Vorlesung Rechnerarchitektur Seite 70 Main Memory The main memory is the lowest level in the semiconductor memory hierarchy. Normally all data objects must be present in this memory for processing by the CPU. In the case of a demand-paging memory, they are fetched from the disk storage to the memory pages on demand before processing is started. Its organization and performance are extremely important for the system’s execution speed. The advantages of DRAMs in respect of cost and integration density are much more important than the faster speed provided by SRAMs. SRAM DRAM trac 60 - 100 ns tcac 20 - 30 ns tcyc 110 - 180 ns access time 1-10 ns power consumption similar @ high MHz few when IDLE 2000 mW / 256Mbit memory capacity 10 MByte (cache) 16 Mbit per chip 4-32 GByte (MM) 4 Gbit per chip price $ 1/Mbit $ 10/Gbit trac row-address access time tcac column-address access time tcyc cycle time Important parameters of the DRAM are: - RAS access time - tRAC; typ. 80ns the time from row address strobe (RAS) to the delivery of data at the outputs on a read cycle - CAS access time - tCAC; typ. 20ns the time from column address strobe (CAS) to the delivery of data at the outputs on a read cycle - RAS recovery time - tREC; typ. 80ns after the application of a RAS pulse, the DRAM needs some time to restore the value of the memory cell (destructive read) and to charge up the sense lines again, until the next RAS pulse can be applied, this time allows the chip to ‘recover’ from an access - RAS cycle time - tCYC; typ. 160ns is the sum of the RAS access time and the recovery time and defines the minimum time for a single data access cycle - CAS cycle time - tPCYC; typ. 45ns the time from CAS being activated to the point at which CAS can be reactivated to start a new access within a page Vorlesung Rechnerarchitektur Seite 71 Main Memory The address lines of DRAMs are multiplexed to save pins and keep the package of the chip small. The partitioning of the address into the Row Address RA and the Column Address CA necessitates a sequential protocol for the addressing phase. The names ‘row’ and ‘column’ stem from the arrangement of the memory cells on the chip. The RA is sampled into the DRAM at the falling edge of RAS*, then the address is switched to the CA and sampled at the falling edge of CAS* (CAS*↓). tCYC DRAM 1 Mbit x 4 tREC tCAC RAS* CAS* A9-0 row address col. address WE* OE* tRAC I/O4-1 valid data-out single read cycle tRAC tCAC RAS* tPCYC CAS* A9-0 row address col. address 1 col. address 2 col. address 3 WE* OE* I/O4-1 page mode cycle data-out 1 data-out 2 data-out 3 Vorlesung Rechnerarchitektur Seite 72 Word-Wide Memories The simplest form of memory organization is the word-wide memory, matching the bus width of the external processor interface. Request Ackn CPU Addr 32 Data MEM DEC RA MUX 64 CA Addr Mux = addr cmp RAS# CAS# WE# OE# Memory Control Logic clock cycle 20 ns Memory tCYC tREC RAS* MUX* CAS* A9-0 row address col. address WE* OE* tRAC I/O4-1 valid data-out DTACK* AS* DS* WRITE* IDLE RW1 READ1 READ2 READ3 memory cycle start READ4 READ5 IDLE data sampling by processor address stable period bus transfer time data transfer time recovery time Vorlesung Rechnerarchitektur Seite 73 State Machine of Simple DRAM default IDLE /REFREQ* /AS* & mem sel & REFREQ* RW 1 /RAS* REF 1 /CAS* WRITE* /WRITE* /DS* WRITE 1 /RAS* /MUX*, /WE* READ 1 /RAS* /MUX* REF 2 /RAS*, /CAS* WRITE 2 /RAS* /MUX*, /WE* /CAS* READ 2 /RAS* /MUX* /CAS*, /OE* REF 3 /RAS* WRITE 3 /RAS* /CAS* READ 3 /RAS* /CAS*, /OE* REF 4 /RAS* WRITE 4 /CAS* /DTACK* READ 4 /CAS*, /OE* /DTACK* REF 5 /RAS* READ 5 /CAS*, /OE* /DTACK* REF 6 /DS* WRITE 5 /CAS* /DTACK* DS* to IDLE DS* to IDLE REF 7 to IDLE Vorlesung Rechnerarchitektur Seite 74 Word-Wide Registered Memories High-performance processors need more memory bandwidth than a simple one-word memory can provide. The access and cycle times of highly integrated dynamic RAMs are not keeping up with the clock speed of the CPUs. Therefore, special architectures and organizations must be used to speed-up the main memory system. The design goal is to transport one data word per clock via the external bus interface of the CPU to or from the main memory. This sort of performance cannot be obtained by a one-word memory. The two basic principles for enhancing speed - pipelining and parallel processing - must also be applied to the main memory. Pipelining of a one-word memory involves attempting to divide the memory access operation into suboperations and to overlap their execution in the different stages of the pipeline. The subdivision of the memory cycle into two suboperations - the addressing phase and the data phase - allows pipelining of the bus interface and the memory system. If there are separate handshake signals for address and data, several transfers can be active at different phases of execution. Request NA* data phase DACK* CPU ADDR 32 DATA 64 OEout PAGE COMP = ADR REG DATA REGout DATA REGin CLKin CLKout page hit CLKin 4 OEin CLKout OEout Memory Control Logic RAS# CAS# WE# OE# Memory MUX tCYC 20ns clock address register ADR address 0 request NA# RAS# MUX CAS# Memdata DATA stay in page DACK# OEin MUX address 2 address 1 RA0 CA0 CA0 CA1 data 0 data 1 data 0 clock data register data 1 Vorlesung Rechnerarchitektur Seite 75 New DRAM Architectures New generations of special DRAM devices have been designed to support fast block-oriented data transfers and page-mode accesses. Four different types of devices are listed in the table below from various manufacturers. Only the synchronous DRAM has survived. Type of dynamic RAM Enhanced Cache Synchronous X1,X4 X4 X4,X8,X9 X8,X9 67 50-100 50-100 500 First-access latency Cache/bank hit, ns Cache/bank miss, ns 15-20 35-45 10-20 70-80 30-40 60-80 36 112 Cache-fill bandwidth, MB/s 7314 114 8533 (a) 9143 (a) Cache/bank size, bits 2048 8192 4096 (a) 8192 (a) 5 7 5-10 10-20 CMOS/TTL CMOS/TTL CMOS/TTL, GTL/CTT 600mV swing, terminated Asynchronous DRAM-like Synchronous proprietary Synchronous pulsed/RAS Synchronous proprietary Yes Yes Undecided No 28/SOJ 44/TSOP 44/TSOP 32/VSMP 4M 4M 16M 4M I/O width, bits Data rate, single hit, MHz Area penalty, percent (b) Output level Access method Access during refresh Pin count/package Density, bits Rambus SOJ= small-outline J-lead package; TSOP= thin small-outline package; VSMP= nonstandard vertically mounted package (a) Synchronous and Rambus DRAMs store data in sense amplifier latches, not in separate synchronous RAM caches. (b) Area penalty is relative to the manufacturer’s standard die size, so that the figures are not directly comparable. The SDRAM has been developed over some generations: SDRAM, DDR SDRAM, DDR2 SDRAM to DDR3 SDRAM. All are incremental improvements of the previous generation, optimizing the data transfer rate and the termination principle of the signaling interface. Complex initialization sequences and data strobe trimming is included. A complementary DRAM interface was designed by Intel. It has a much smaller signaling interface, more message oriented, but requires a special buffer chip onevery DIMM. Search for more info under the keyword fully buffered DIMM and AHB. Vorlesung Rechnerarchitektur Seite 76 Synchronous DRAM The SDRAM device latches information in and out under the control of the system clock. The information required for a memory cycle is stored in registers; the SDRAM can perform the request without leaving the CPU idle at the interface. The device responds after a programmed number of clock cycles by supplying the data. With registers on all input and output signals, the chip allows pipelining of the accesses. This shortens its average access time and is well suited to the pipelined interfaces of modern high-performance processors like the i860XP. The interface signals are common CMOS levels, which appear to restrict the data rate to 100Mbit/s (1bit devices). A JEDEC approval procedure is currently in progress. Two internal memory banks support interleaving (see Section 4.3.2) and allow the precharge time of one bank to be hidden in an access of the other bank. In the same way, the refresh can be directed to the second bank while accessing the first one. The built-in timer for the refresh period and the refresh controller can hide the refresh inside the chip. The 512 x 4 sense amplifier can hold the data of one page like a cache, and all accesses to the data within the page are very fast. clock signals CKE# CLK CS# DQM Data In Buffer DQ0-DQ3 4 Data Out Buffer Synchronous Control control signals Logic RA bank 0 Row Decoder WE# CAS# RAS# Row Address Buffers 11 . . Bank 0 Memory Array . 2048 . . Sense Amplifier I/O-Gating 11 Clock Generator #1 . . 512 . . 4 Latch Latch . . 512 x 4 RA 11 RA bank 1 VCC GND Column Addr Buffer . . Row Address Latch Column Decoder Sense Amplifier I/O-Gating Clock Generator #1 11 Row Address Buffers Row Decoder A10 - A0 9 11 Row Multiplexer Latch Latch . . 512 . . Refresh Counter CA Burst Burst Counter Counter Self Refresh Oscillator & Timer Refresh Controller Column Addr Latch 9 4 . . . . A10 512 x 4 A11 . . 2048 . . Bank 1 Memory Array . Functional Block Diagram of 16 Mbit Synchronous DRAM The 16Mbit SDRAM contains 4 banks which are not explicitly marked !!! Vorlesung Rechnerarchitektur Seite 77 Burst Mode Memory Bei einem Zugriff auf den Hauptspeicher werden mehr Daten geholt, als die Wortbreite des Speichers liefert. Bei diesem "Burst Transfer" werden nacheinander mehrere (2n ; typically n=2 or 3) Werte gelesen oder geschrieben. Vorteil: Durch die Ankündigung (burst mode control signal) eines solchen Transfers ist es möglich, die weiteren Daten vorausschauend und im Pipeline-Modus zu holen und damit eine wesentlich höhere Datentransferleistung zu erbringen. Hierbei wird der "page mode" des Speichers ausgenutzt.Die Bezeichnung des Transfers erfolgt häufig nach folgender Syntax: (L:B:B:B) - (5:1:1:1). Die Notation bedeutet eine Startlatenz von 5 Takten gefolgt von weiteren Daten jeden weiteren Takt. Es gibt unterschiedliche Festlegungen für die Adreßsequenz innerhalb eines Burst-Zugriffs. - linearer Burst ABCD - modulo Burst - upcounting ABCD | BCDA | CDAB | DABC - interleaved ABCD | BADC | CDAB | DCBA Probleme entstehen, wenn die Startadresse des Burst Cycles nicht auf einer Startadresse liegt, die ’aligned’ ist, oder der Burst die page der Speichers (oder der MMU) überschreitet. Aus diesem Grund werden oft Einschränkungen bei der Adreßsequenz vorgenommen. Memory Size [Bytes] = 2m+2+2 31 4092 s Bits byte address 140 3 2 1 0 MemAdr MemSelect 3 2 1 0 2 Bits 2 Bits Word Byte select select D C memory frame 136 B 132 memory_frame_base_00_00 m Bits A 128 3 2 1 block start address 0 cache line word 3 2 1 0 0 32 bit word Die Startadresse sollte bei einem Cache Line Fill möglichst das Wort sein, welches als erstes vom Prozessor benötigt wird. Dadurch enstehen Burst cycles mit ’missaligned’ Startadressen. Dafür verwendet man dann meist einen modulo Burst Zugriff. Vorlesung Rechnerarchitektur Seite 78 Interleaved Memories The next step is to apply parallel processing to the main memory. This solution has long been employed in high-performance vector supercomputers such as the CRAY series in the form of memory interleaving. The parallel processing made use of interleaving requires partitioning of the memory system into parallel memory banks, which are controlled by a local bank controller. The global interleave controller checks and controls the interaction between the CPU and memory banks. The number of memory banks defines the order of interleaving (the CRAY-1 memory system for example, contains 32 banks and is therefore described as 32-way-interleaved). Usually the number of banks for interleaving is to the power of 2. The one-word memory with data and address registers forms the basic hardware structure of each bank. Two basic forms of interleaving can be distinguished: - low-order interleaving - high-order interleaving Low-order interleaving assumes that the least significant bits of the address A are used to distinguish the banks. The selection of a bank B is performed by the modulo function B = A mod n, where n is the number of banks. The performance gain achieved by a low-order interleaved memory depends on the address pattern applied to the memory and on the number of banks. A linear sequence of the addresses, selecting one bank only for every nth access, increases the available bandwidth by n, compared with a word-wide memory. However, if the access function references the same bank, the bandwidth is equal to that of the word-wide memory. Depending on the access function, the performance gain lies between these two extremes. The burst-mode access fetching four consecutive (or specially sequenced) data values also fits in well with low-order interleaving. The fetch can be performed as one access to all banks in parallel, and the sequential data transport from the registers to the external bus interface is controlled by the data-path controller of the memory system. The memory can execute a new request in the addressing phase while the data phase is active. This requires that the microprocessor overlaps or pipelines the address and data phase on the bus. High-order interleaving uses the most significant bits of the memory address to select the banks. For this structure, the next memory address must be ‘very far away’ from the previous one and should have distinct high-order bits, so that the access can be scheduled to different banks. An address pattern of this sort is highly application-dependent, and this makes the utilization of high-order interleaving rather difficult. NUMA architectures can profit from high order interleaving, placing process context and data in the address space in a waz that it is close to the processor operating on this data. Vorlesung Rechnerarchitektur Seite 79 Low-Order Interleaved Memories k-1 CPU Address 0 2j 2x2j 0 k-n-j Bits n Bits j Bits Index Address Bank Byte Select Select (2n-1)x2j 2nx2j (2n+1)x2j (2n+2)x2j 2k-(2x2j) 2k-(2j) Index Address 0 2j 2x2j 0 2j 2x2j 2k-n-j 2k-n-j Bank 0 0 2j 2x2j 0 2j 2x2j 2k-n-j 2k-n-j Bank 1 Bank 2n-1 Bank 2 mapping from CPU address to physical memory bank Bank 0 Bank 1 Bank 2 Bank 3 0 32 64 8 40 72 16 48 80 24 56 88 65504 65512 65520 65528 Interleave Controller RAM Control SEL0 SEL1 SEL2 SEL3 0 1 2 3 ACK* A4,A3 A15-A5 A2-A0 hardware structure of the addressing path (no data path) Vorlesung Rechnerarchitektur Seite 80 High-Order Interleaved Memories CPU Address 0 2j 2x2j 2 k-n-j k-1 0 n Bits k-n-j Bits j Bits Bank Select Index Address Byte Select j -2 2k-n-j 2k-n-j+2j 2x2k-n-j-2j 2x2k-n-j 2x2k-n-j+2j Index Address 0 2j 2x2j 2nx2k-(2j) 2k-n-j 0 2j 2x2j 0 2j 2x2j 0 2j 2x2j 2k-n-j 2k-n-j 2k-n-j Bank 0 Bank 1 Bank 2n-1 Bank 2 mapping from CPU address to physical memory bank Bank 0 Bank 1 Bank 2 Bank 3 0 8 16 16384 16392 16400 32768 32776 32784 49152 49160 49168 16376 32760 49144 65528 Interleave Controller RAM Control SEL0 SEL1 SEL2 0 1 2 3 ACK* A15,A14 A13-A3 A2-A0 hardware structure of the addressing path (no data path) SEL3 Vorlesung Rechnerarchitektur Seite 81 Four-Bank Low-Order Interleaved Memories This figure presents the address and data pathes of a four way low-order interleaved memory. CPU CONTROL ADDR Data Phase Control 32 DATA 64 Address Phase Control 64 64 64 64 Memory Bank Control Memory Bank 0 Memory Bank 1 Memory Bank 2 Memory Bank 3 Modern Microprocessors include memory controller with two or more banks, which are organized in low- or high- order interleving schemes. Vorlesung Rechnerarchitektur Seite 82 DRAMs Speed Trends: DRAMs & Processors Die ständige Steigerung der Rechenleistung moderner Prozessoren führt zu einer immer größer werdenden Lücke zwischen der Verarbeitungsgeschwindigkeit des Prozessors und der Zugriffsgeschwindigkeit des Hauptspeichers. Die von Generation zu Generation der DRAMs steigende Speicherkapazität (x 4) führt auf Grund der 4-fachen Anzahl an Speicherzellen trotz der Verkleinerung der VLSI-Strukturen zu nur geringen Geschwindigkeitssteigerungen. 10000 9000 8000 7000 6000 5000 Speed [log MHz] 4000 3000 2GHz 2000 first 1GHz CPU Research 1000 900 800 700 600 500 l internal CPU Clock l DEC Alpha 400 l external BUS Clock 300 200 SDRAM 100 90 80 70 60 50 40 1/tacc 25ns in page 30 20 1/tacc 60ns random 10 Year ’90 ’92 ’94 ’96 ’98 2000 ’02 Vorlesung Rechnerarchitektur Seite 83 EDO-DRAM DRAM: Beispiel EDO~ Extended Data-Out (EDO) DRAMs feature a fast Page Mode access cycle. The EDODRAM separates three-state control of the data bus from the column address strobe, so address sequencing-in is independent from data output enable. This Page Access allows faster data operation within a row-address-defined page boundary. The PAGE cycle is always initiated with a row-address strobed-in by the folling edge of RAS_ followed by a column-address strobed-in by the folling edge of CAS_. CAS_ may be toggled by holding RAS_ low and ’strobing-in’ different column-addresses, thus executing faster memory cycles. no cas control of data out Data In Buffer WE# DQ0-DQ3 4 Data Out Buffer CAS# Clock Generator #2 OE# CA 10 Column Decoder 4 . . . . 10 1024 11 Column Addr Buffer 4 Column Address Latch bit planes Refresh Controller Sense Amplifier Refresh Counter Latch . . 1024 x 4 . . I/O-Gating CASE 1 RAS# RA 11 . . 2048 . . . . 2048 . . 2 x 2048 x 1024 x 4 Memory Array 24/26-Pin SOJ Row Row Transfer Transfer 11 Row Address Latch Row Decoder 11 A10 - A0 Complement Complement Complement Select Complement Select Select Select bit planes Clock Generator #1 VCC GND top view EDO Operation: Improvement in page mode cycle time was achieved by converting the normal sequential fast page mode operation into a two-stage pipeline. A page address is presented to the EDO-Dram, and the data at that selected address is amplified and latched at the data output drivers. While the output buffers are driving the data off-chip, the address decode and data path circuitry is reset and able to initiate access to the next page address. Refer to Datasheet for EDO DRAM at: ext_infos Vorlesung Rechnerarchitektur Seite 84 EDO-DRAM Timing EDO-DRAM The primary advantage of EDO is the availability of data-out even after /CAS_ high. EDO allows CAS-Precharge time to occur without the output data going invalid. This elemination of CAS Output control allows "pipelining" of READs. [Micron, Data Sheet MT4LC4ME8(L), 1995. The term pipelining is misused in this context, it is simply an overlapping of addressing and data-transfer. tRAC tCAC tCAC tPCYC RAS* CAS* A9-0 row address CA 2 CA 1 CA 3 CA 4 data-out 2 data-out 3 WE* tAA(CAS) tAA OE* I/O4-1 data-out 1 tOEA tCOH data-out 4 tOEZ hyperpage mode cycle EDO Im Gegensatz zum ’fast page mode’-DRAM muß CAS_ nicht low sein, um den Datenausgangs-Buffer zu treiben. Die nächste CA kann somit unabhängig von der Data-Output Phase früher in das CA-Latch übernommen werden. Das Treiben der Daten wird ausschließlich vom Signal OE_ gesteuert (tOEA und tOEZ). Die Daten am Ausgang schalten erst auf die neuen Daten von der nächsten CA um, wenn die neue CA im DRAM wirksam wird ( nach tCOH ). Vorlesung Rechnerarchitektur Seite 85 Memory Interface Signaling High speed interfaces VDD(Q) (VCC) VOH Noise Margin High VIH VREF switching region VIL Noise Margin Low VOL VSS (GND) Driver Receiver Signal Line + - single ended transmission UREF = LVTTL = CTT UREF BTL GTL HSTL SSTL VOH 2,4 1,9 2,1 1,2 1,1 Vtt + 0,4 VIH 2 1,7 0,85 0,85 1,85 Vtt + 0,2 VREF 1,5 1,5 1,55 0,8 0,75 1,5 = Vtt VIL 0,8 1,3 1,47 0,75 0,65 Vtt - 0,2 VOL 0,4 1,1 1,1 0,4 0,4 Vtt - 0,4 VDD (Q) 3,3 3,3 5 1,2 1,5 3,3 NM(H) 0,4 0,2 0,48 0,35 0,25 0,2 NM(L) 0,4 0,2 0,37 0,35 0,25 0,2 SWING(O) 2 0,8 1 0,8 0,7 0,8 Receiver Driver positive Signal Line UDIFF inverted Signal Line differential transmission UOFFSET = + - Vorlesung Rechnerarchitektur Seite 86 Memory Management Ziele des Memory Management • Schutzfunktion => Zugriffsrechte • Speicherverwaltung für Prozesse • Erweiterung des begrenzten physikalischen Hauptspeichers Eine sehr einfache Möglichkeit den Speicher zu organisieren ist die Aufteilung in einen Festwertspeicher (Read-Only-Memory) für das Programm und in einen Schreib-Lese-Speicher für die Daten. - R/W Memory - ROM - kann nicht überschrieben werden Eine solche feste Aufteilung ist nur für ’single tasking’-Anwendungen sinnvoll, wie sie z.B. in ’eingebetteten Systemen’ mit Mikrocontrollern verwendet werden. CPU Instr./Data ROM Read/Write Literature: Andrew S. Tanenbaum, "Structured Computer Organization", 4th edition, Prentice-Hall, p. 403ff. RAM R/W Address Space Beim Multi-processing oder Multi-tasking Systemen reicht die oben genannte Möglichkeit, den Speicher zu organisieren, nicht aus. Es existieren viele Prozesse, die quasi gleichzeitig bearbeitet werden müssen. Probleme: - Verlagern von Objekten im Speicher - relocation - Schutz von Objekten im Speicher - protection Lösung: Einführung eines logischen Adreßraumes pro Prozeß und einer Abbildung der logischen Adreßräume auf den physikalischen Adreßraum (Hauptspeicher), die Adreßumsetzung ’address translation’ => viele Prozesse konkurrieren um den physikalischen Speicher! => um ausgeführt zu werden, müssen alle Segmente (.text / .data / .bss) in den Speicher geladen werden Segmentierung ist die Aufteilung des logischen Adreßraumes in kontinuierliche Bereiche unterschiedlicher Größe zur Speicherung von z.B. Instruktionen (.text) oder Daten (.data) oder Stack-Daten (.bss), etc. Jedes Segment kann jetzt mit Zugriffsrechten geschützt werden. Vorlesung Rechnerarchitektur Seite 87 Memory Management Memory Segmentation Virtuelle Adresse MMU Physikalische Adresse zur Laufzeit pid = 1 phys. Mem. from processor pid = 1 upper bound 4000 > .data 1 .data 1 .text 1 11000 .text 1 .text 2 10000 9600 pid = 2 3600 address access trap physical addr to memory = .data 2 3000 ...... .text 2 ...... 1000 .data 2 400 0 base of segment base of segment .data .text. + R/W I/D S/U access rights 1000 0 virtual address protection violation trap Memory fragmentation 0 Um einen Prozeß ausführen zu können, müssen alle Segmente des Prozesses im Hauptspeicher sein. - Der Platz ist bereits durch andere Prozesse belegt => Es müssen so viele Segmente anderer (ruhender) Prozesse aus dem Speicher auf die Platte ausgelagert werden, wie der neue auszuführende Prozess Platz benötigt. Randbedingung durch die Segmentierung: es muß ein fortlaufender Speicherbereich sein! • Swapping: Aus- und Einlagerung ganzer Prozeßsegmente (Cache flush!) - Es ist zwar noch Platz vorhanden, aber die Lücken reichen nicht für die benötigten neuen Segmente aus. Wiederholtes Aus- und Einlagern führt zu einer Fragmentierung des Speichers -’Memory fragmentation’ => Es müssen Segmente verschoben werden, d.h. im Speicher kopiert werden, um Lükken zu schließen (Cache flush!). In den gängigen Architekturen wird mit n=32 oder 64 Bit adressiert. Daraus folgt die Größe des virtuellen Adressraums mit 2n Byte (232= 4 GByte; 264= 16 ExaByte= 16.77 Millionen TBytes) - Der Hauptspeicherplatz reicht von seiner Größe nicht für den neuen Prozeß aus => der Hauptspeicher wird durch das Konzept des Virtuellen Speichers durch die Einbeziehung des Sekundärspeichers (Platte) erweitert. Vorlesung Rechnerarchitektur Seite 88 Memory Management Literatur: Hwang, Kai; Advanced Computer Architecture, Mc Graw Hill, 1993. Stone, Harold S.; High Performance Computer Architecture, Addison Wesley, 1993 Giloi: Rechnerarchitektur Tannenbaum, Andrew S.: Modern Operating Systems, Prentice-Hall, 1992. Intel: i860XP Microprocessor Programmers Reference Manual Motorola: PowerPC 601 RISC Microprocessor User’s Manual z Grundlagen Die Memory Managment Unit (MMU) ist eine Hardwareeinheit, die die Betriebssoftware eines Rechners bei der Verwaltung des Hauptspeichers für die verschiedenen auszuführenden Prozesse unterstützt. - address space protection - virtual memory - demand paging - segmentation Each word/byte in the physical memory (PM) is identified by a unique physical address. All memory words in the main memory forms a physical address space (PAS). All program-generated (or by a software process generated) addresses are called virtual addresses (VA) and form the virtual address space (VAS). When address translation is enabled, the MMU maps instructions and data virtual addresses into physical addresses before referencing memory. Address translation maps a set of vitual addresses V uniquely to a set of physical addresses M. Virtual memory systems attempt to make optimum use of main memory, while using an auxiliary memory (disk) for backup. VM tries to keep active items in the main memory and as items became inactive, migrate them back to the lower speed disk. If the management algorithms are successful, the performance will tend to be close to the performance of the higher-speed main memory and the cost of the system tend to be close to the cost per bit of the lower-speed memory (optimization of the memory hierarchy between main memory and disk). Most virtual memory systems use a technique called paging. Here, the physical address space is divided up into equallly sized units called page frames. A page is a collection a data that occupies a page frame, when that data is present in memory. The pages and the page frames are always of the same fixed size (e.g. 4K Bytes). There might be larger (e.g. 8KB) or smaller page sizes defined in computer systems depending on the compromise between access control and number of translations. Vorlesung Rechnerarchitektur Seite 89 Memory Management z Virtueller Speicher / Paging Logischer und physikalischer Adressraum werden in Seiten fester Größe unterteilt, meist 4 oder 8KByte. Logische Pages werden in einer Pagetable auf physikalische Pageframes abgebildet, dabei ist der logische Adressraum im allgemeinen wesentlich größer als der physikalisch vorhandene Speicher. Nur ein Teil der Pages ist tatsächlich im Hauptspeicher, alle anderen sind auf einen Sekundärspeicher (Platte) ausgelagert. - Programme können größer als der Hauptspeicher sein - Programme können an beliebige physikalischen Adressen geladen werden, unabhängig von der Aufteilung des physikalischen Speichers - einfache Verwaltung in Hardware durch feste Größe der Seiten - für jede Seite können Zugriffsrechte (read/write, User/Supervisor) festgelegt und bei Zugriffen überprüft werden - durch den virtuellen Speicher wird ein kostengünstiger großer und hinreichend schneller Hauptspeicher vorgetäuscht (ähnlich Cache) Die Pagetable enthält für jeden Eintrag einen Vermerk, ob die Seite im Hauptspeicher vorhanden ist (P-Bit / present). Ausgelagerte Pages müssen bei einer Referenz in den Hauptspeicher geladen werden, ggf. wird dabei eine andere Page verdrängt. Modifizierte Seiten (M-Bit / modify) müssen dabei auf den Sekundärspeicher zurückgeschrieben werden. Dazu wird ein weiteres Bit eingeführt, das bei einem Zugriff auf die Seite gesetzt wird (R-Bit / referenced) Die Abbildung des virtuellen Adressraums auf den physikalischen erfolgt beim paging durch die Festlegung von Zuordnungspaaren (VA-PA). Hierbei werden die n (n=12 fuer 4K oder 13 fuer 8KB) niederwertigen bits der Adresse von der VA zur PA durchgereicht und nicht uebersetzt Vorlesung Rechnerarchitektur Virtueller Speicher / Paging Replacement-Strategien : - not recently used - NRU mithilfe der Bits R und M werden vier Klassen von Pages gebildet 0: not referenced, not modified 1: not referenced, modified (empty class in single processor systems!) 2: referenced, not modified 3: referenced, modified es wird eine beliebige Seite aus der niedrigsten nichtleeren Klasse entfernt - FIFO die älteste Seite wird entfernt (möglicherweise die am häufigsten benutzte) - Second-Chance / Clock wie FIFO, wurde der älteste Eintrag benutzt, wird zuerst das R-Bit gelöscht und die nächste Seite untersucht, erst wenn alle Seiten erfolglos getestet wurden, wird der älteste Eintrag tatsächlich entfernt - least recently used - LRU die am längsten nicht genutzte Seite wird entfernt, erfordert Alterungsmechnismus Seite 90 Vorlesung Rechnerarchitektur Seite 91 Memory Management Verfahren zur Adreßtransformation Address Translation Page Address Translation Segment Address Translation Block Address Translation .............. direct mapping one level multi level inverted mapping associative PT wie PAT inverted PT one level mapping base-bound checking PAT: Es wird eine Abbildung von VA nach PA vorgenommen, wobei eine feste Page Size vorausgesetzt wird. Der Offset innerhalb einer Page ist damit eine feste Anzahl von bits (last significant bit (LSB)) aus der VA, die direkt in die PA übernommen werden. Der Offset wird also nicht verändert ! Die Abbildung der höherwertigen Adreßbits erfolgt nach den oben genannten Mapping-Verfahren. BAT: Provides a way to map ranges of VA larger than a single page into contiguous area of physical memory (typically no paging) • Used for memory mapped display buffers or large arrays of MMU data. • base-bound mapping scheme • block sizes 128 KB (217) to 256 MB (228) • fully associative BAT registers on chip small number of BAT entries (4 ... 8) + BAT entries have priority over PATs Vorlesung Rechnerarchitektur Seite 92 Memory Management Direct Page Table For all used VA-Frames exist one PT-Entry which maps VA to PA. There are much more entries required as physical memory is available. direkt PT VAS PAS page frame VA Page Table Index virtual page used page frame virtual page virtual page PA page frame Inverted Page Table For all PA exist one PT-Entry. Indexing with the VA doesn’t work any more!!! To find the right mapping entry a search with the VA as key must be performed by linear search or by associative match. Used for large virtual address space VAS, e.g. 264 and to keep the PT small. VAS inv PT VA high PAS page frame = virtual page ...... ...... virtual page virtual page page frame search match VAhigh PA page frame Vorlesung Rechnerarchitektur Seite 93 Memory Management Einstufiges Paging 0 31 dir_base 000000000000 12 20 31 VA VA high PT 12 0 VA Offset 220 Entries PA high Cntrl 20 Page Table Index 0 31 PA PA high 12 Die virtuelle Adresse VA wird in einem Schritt in eine physikalische Adresse PA umgesetzt. Dazu wird ein höherwertiger Teil der VA zur Indizierung in die Page Table PT verwendet. In der PT findet man dann unter jedem Index genau einen Eintrag mit dem höherwertigen Teil der PA. Der niederwertige Teil der Adresse mit z.B. 12 bit wird als Seiten-Offste direkt zur physikalischen Adresse durchgereicht. Diese einstufige Abbildung kann nur für kleine Teile der VA high verwendet werden (Tabellengröße 4 MB bei 32 bit entries) Vorlesung Rechnerarchitektur Seite 94 Memory Management Mehrstufiges Paging Linear Address 31 22 12 0 Directory Table Offset 486TM CPU 31 0 CR0 CR2 CR2 CR3 dir_base Control registers 10 31 10 31 {} 20 0 {} Physical memory 12 0 base 30 base 20 {} Address 20 Page table Page directory Bei 32 Bit Prozessoren und einer Seitengröße von z.B. 4 KByte wird die Pagetable sehr groß, z.B. 4 MByte bei 32 Bit Pagetable Einträgen. Da meist nicht alle Einträge einer Tabelle wirklich genutzt werden, wird eine mehrstufige Umsetzung eingeführt. Zum Beispiel referenzieren die obersten Adressbits eine Tabelle, während die mittleren Bits den Eintrag in dieser Tabelle selektieren. - es ist möglich, in der page dir Einträge als nicht benutzt zu kennzeichnen - damit werden in der zweiten Stufe weniger Tabellen benötig - die Tabellen der zweiten Ebene können selbst ausgelagert werden Pagetables können aufgrund ihrer Größe nur im Hauptspeicher gehalten werden. Pagetables sind prinzipiell cachable, allerdings werden die Einträge wegen ihrer relativ seltenen Benutzung (im Vergleich zu normalen Daten) schnell aus dem allgemeinen Cache verdrängt. TLB (32 entries) Linear Address 31 Physical Memory 0 Page directory Page table Zum Beschleunigen der Adressumsetzung, insbesondere bei mehrstufigen Tabellen, wird ein Cache verwendet. Dieser Translation Lookaside Buffer (TLB) oder auch Address Translation Cache (ATC) enthält die zuletzt erfolgten Adressumsetzungen. Er ist meist vollassoziativ ausgeführt und enthält z.B. 64 Einträge. Neuerdings wird auch noch ein setassoziativer L2-Cache davorgeschaltet. Vorlesung Rechnerarchitektur Seite 95 Memory Management z i860XP A virtual address refers indirectly to a physical address by specifying a page table through a directory page, a page within that table, and an offset within that page. Format of a Virtual Address 22 21 31 0 12 11 PAGE DIR OFFSET 10 10 + 2 10 2 bit byte sel Page table Page directory 1023 Page frame 4kBytes 1023 1023 Page table entry physical address word DIR entry dirbase register 0 0 0 two-level Page Address Translation A page table is simply an array of 32-bit page specifiers. A page table is itself a page (1024 entries with 4 bytes = 4kBytes). Two levels of tables are used to address a page frame in main memory. Page tables can occupy a significant part of memory space (210 x 210 words = 222 bytes; 4MBytes). The physical address of the current page directory is stored in the DTB (Directory table base) field of the dirbase register. Vorlesung Rechnerarchitektur Seite 96 Memory Management z i860XP A page table entry contains the page frame address and a number of management bits. The present bit can be used to implement demand paging. If P=0, than that page is not present in main memory. An access to this page generates a trap to the operating system, which has to fetch the page from disk, set the P bit to 1 and restart the instruction. Format of a Page Table Entry (i860XP) 12 11 9 31 PAGE Frame Address 7 5 3 0 C W AVAIL XX D A D T U W P Available for system programmers use (Reserved) Dirty Accessed Cache Disable Write-through User Writable Present Definition : Page The virtual address space is divided up into equal sized units called pages. [Tanenbaum] Eine Page (im Kontext der MMU) entsteht durch die Unterteilung des virtuellen und physikalischen Speichers in gleich große Teile. Definition : Page frame Ist der Speicherbereich im Hauptspeicher, in den genau eine Page hineinpaßt. Vorlesung Rechnerarchitektur Seite 97 Memory Management Hashing Literatur: Sedgewick, Robert, Algorithmen, Addison-Wesley, 1991, pp.273-287 Hashing ist ein Verfahren zum Suchen von Datensätzen in Tabellen • Kompromiß zwischen Zeit- und Platzbedarf Hashing erfolgt in zwei Schritten: 1. Berechnung der Hash-Funktion Transformiert den Suchschlüssel (key, hier die VA) in einen Tabellenindex. Dabei wird der Index wesentlich kürzer als der Suchschlüssel und damit die erforderliche Tabelle kleiner. Im Idealfall sollten sich die keys möglichst gleichmäßig über die Indices verteilen. - Problem: Mehrdeutigkeit der Abbildung 2. Auflösung der Kollisionen, die durch die Mehrdeutigkeit entstehen. a) durch anschließendes lineares Suchen b) durch erneutes Hashing VAhigh 7 6 5 4 VAhigh 3 2 1 0 VA high n simple Hash Function 2n 3 HashFkt. z.B. n/2 2 1 Hash-Index 0 linear Search 2n/2 0 0 Index inverted PT Entry 2n/2-1 Entry VA PA Cntrl Vorlesung Rechnerarchitektur Seite 98 Memory Management LRU-Verfahren Als Beispiel für einen Algorithmus, der das LRU-Verfahren realisiert, sei ein Verfahren betrachtet, das in einigen Modellen der IBM/370-Familie angewandt wurde. Sei CAM (Content Addressable Memory) ein Assoziativspeicher mit k Zellen, z.B. der Cache. Zusätzlich dazu wird eine (k x k)-Matrix AM mit boolschen Speicherelementen angelegt. Jeder der ’entries’ des CAM ist genau eine Zeile und eine Spalte dieser Matrix zugeordnet. Wird nun ein entry vom CAM aufgesucht, so wird zuerst in die zugehörige Zeile der boolschen Matrix der Einsvektor e(k) und danach in die zugehörige Spalte der Nullvektor n(k) eingeschrieben (e(k) ist ein Vektor mit k Einsen; n(k) ist ein Vektor mit k Nullen). Dies wird bei jedem neuen Zugriff auf das CAM wiederholt. Sind nacheinander in beliebiger Reihenfolge alle k Zellen angesprochen worden, so enthält die Zeile, die zu der am längsten nicht angesprochenen Zelle von CAM gehört, als einzige den Nullvektor n(k). AM0 AMi AM0 0 cache entry AMi 1 1 Encoder 1 0 0 0 k 0... 1 0 1 i ... k cache entries AM k-1 CAM k-1 0 0 ......... i AMk-1 1 k-1 i k x k-Matix OR Sei i der Index der als erste angesprochenen Zelle von CAM, und sei AM die (k x k)-Alterungsmatrix. Dann ist nach dem Ansprechen der Zelle i AMi = n(k), während alle Elemente von AMi eins sind bis auf das Element (AMi)i, das null ist (da zunächst e(k) in die Zeile AMi und dann in die Spalte AMi eingeschrieben wird). Dabei bezeichnen wir die Zeilen einer Speichermatrix durch einen hochgestellten Index und die Spalten durch einen tiefgestellten Index. Bei jeder Referenz einer anderen Zelle von CAM wird durch das Einschreiben von e(k) in die entsprechende Zeile und nachfolgend durch das Einschreiben von n(k) in die entsprechende Spalte von AM eine der Einsen in AMi durch Null ersetzt und eine andere Zeile mit Einsen angefüllt (bis auf das Element auf der Hauptdiagonale, das null bleibt). Damit werden nach und nach alle Elemente von AMi durch Null ersetzt, falls die Zelle i zwischenzeitlich nicht mehr angesprochen wird. Da aber nach k Schritten nur in einer der k Zellen alle Einsen durch Nullen überschrieben sein können, müssen alle anderen Zeilen von AM noch mindestens eine Eins enthalten. Damit indiziert die Zeile von AM, die nur Nullen enthält, die LRU-Zelle (entry) des Assoziativspeichers. [ Giloi, Rechnerarchitektur, Springer, 1993, pp. 130] Vorlesung Rechnerarchitektur Seite 99 Memory Management LRU-Verfahren: Beispiel Es wird ein CAM mit 8 entries angenommen. Das kann ein vollassoziativer Cache mit 8 entries sein oder auch ein Set eines Caches mit 8 ’ways’ sein. Die Initialisierung der Alterungsmatrix erfolgt mit ’0’. Als Beispiel für die Veränderung der Werte in AM soll die folgende Zugriffsreihenfolge betrachtet werden. Zugriffsreihenfolge: 0, 1, 3, 4, 7, 6, 2, 5, 3, 2, 0, 4 0 1 2 3 4 5 6 7 0 0 1 0 0 0 0 0 0 0 0 7 0 1 2 3 4 5 6 7 3 0 1 2 3 4 5 6 7 1 1 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 5 1 0 0 0 0 0 0 0 0 6 1 0 0 0 0 0 0 0 0 7 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 x 0 1 0 1 0 x 0 x 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 x 0 1 1 0 0 1 1 0 10 1 2 3 0 0 1 1 1 1 0 1 0 1 1 0 x 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 4 1 1 0 0 0 0 0 0 0 5 1 1 0 0 0 0 0 0 0 6 1 1 0 0 0 0 0 0 0 7 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 x 0 0 0 x 1 0 0 30 0 1 x 0 1 x 0 1 0 0 0 1 0 2 1 1 0 1 0 3 0 1 0 1 0 1 0 x x x x 0 0 0 0 0 0 0 0 6 0 0 0 0 4 1 1 0 1 0 0 0 0 0 5 1 1 0 1 0 0 0 0 0 6 1 1 0 1 0 0 0 0 0 7 1 1 0 10 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 10 0 0 0 x 1 0 1 0 0 10 10 10 10 10 10 10 1 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 x 1 1 0 1 1 0 0 0 0 0 40 0 1 x 0 1 x 01 1 0 0 0 1 1 0 x x x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 0 1 1 0 0 0 0 0 1 1 1 1 01 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 0 x 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 0 x 1 1 0 1 1 0 2 0 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 1 2 0 0 0 x 0 0 0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 3 0 0 0 0 1 0 4 0 1 0 1 0 0 1 1 0 6 1 1 0 1 1 0 0 0 0 7 1 1 0 1 1 0 0 0 0 x x x x 5 0 0 0 0 0 0 1 0 0 0 0 1 1 11 11 01 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 0 0 1 2 1 1 0 1 1 0 4 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 Der Start erfolgt mit allen Einträgen als gleichalt markiert. Der erste Eintrag in Zeile und Spalte 0 läßt den Eintrag 0 altern. Es sind danach nur noch die mit x markierten Einträge zum Ersetzen zu verwenden. Sind alle Einträge einmal referenziert, so bleibt in jedem weiteren Zugriffsschritt immer nur ein Eintrag als der Älteste markiert stehen. [Tannenbaum, Modern Operating Systems, Prentice Hall, 1992, pp.111-112] ppt Vorlesung Rechnerarchitektur Seite 100 Main Memory Some definitions ... .... mode : page mode : Betriebsart in der auf alle bits in dieser Page (RA) wesentlich schneller zugegriffen werden kann. (12-25ns tCAC) Zum Zugriff wird eine neue CA und CAS benötigt. 1 Mbit x 1 bit 10 CA 10 C_Dec 10 page RA R_Dec 1024 bit 32 bit Word => 4 MByte 22 21 31 Mem base burst mode : 21 0 12 11 RA CA byte sel Holt 2n Werte aus dem Speicher ohne eine neue CA. CAS muß aber geschaltet werden, um den internen Zähler zu inkrementieren. memory modell : latency in clocks L -B- B - B 6 - 2 - 2 - 2 tRAC tCAC Vorlesung Rechnerarchitektur Seite 101 Main Memory Some definitions ... Hardware view: asynchronous : Without a relation to a global clock signal. Signals propagating a net of gates behave asynchronous, because of the delay within each gate level. synchronous: All signals and signal changes are referenced to a global clock signal. All locations are supplied by this clock signal, it is assumed that the edge of the clock signal defines the same time at all locations Caution : simplified view - clock jitter - clock skew - clock network delay bandwidth : Term derived from ET. HFT, means the available frequency space Δf for a transmission channel f1 Δf f2 Δf=f2-f1 unit [MHz] used in TI as: possible number of data, which can be carried through an interface, eg. bus, memory. unit [MBytes/s] transfer rate : Number of Data items, which are moved in one second. unit [MBytes/s] Vorlesung Rechnerarchitektur Seite 102 Development Trend- RISC versus CISC CISC PDP11 PDP11 CISC CISC 8086 Digital Equipm. MC 68000 Intel Motorola CISC CISC ’186 ’010 Intel Motorola CISC VAX DEC CISC CISC Cray/Alliant numeric Coprocessor ’286 ’020 Intel CISC RISC+superscalar ’386 i860XL Intel CDC6600 Motorola VLIW Intel CISC MIPS ??? RISC+superscalar ’030 88100 Motorola Motorola VLIW MIPS VLIW CISC RISC+superscalar CISC RISC+superscalar ’486 i860XP ’040 88110 Intel Intel Motorola Motorola 64 bit RISC+superscalar + superpipelined 150MHz 21064 ALPHA DEC CISC + superscalar RIP CISC + superscalar RIP Pentium ’060 Intel Motorola RISC+superscalar + superpipelined + PCI-Bus 300MHz 21164 ALPHA DEC CISC + superscalar + MMX RIP Pentium II Intel RISC+superscalar + superpipelined + PCI-Bus 800MHz 21264 ALPHA DEC internal RISC + superscalar Pentium III 900 MHz VLIW Intel RISC+superscalar + superpipelined + Memory Control + SystemIO + Switch 21364 ALPHA ISA Emulation DEC internal RISC + superscalar Pentium 4 Itanium Intel Intel + HP VLIW + predicated Instr. VLIW + predicated Instr. Itanium 2 Intel + HP 1GHz ? Vorlesung Rechnerarchitektur Seite 103 Genealogy of RISC Processors influenced by mainframe research projects CDC 6600 Cray-1 Control Data products products under development RISC Bulldog Compiler IBM 801 μCode compaction IBM research VLIW RISC RISC ELI RISC I MIPS NY State Univ. UC Berkley Stanford VLIW RISC RISC RISC RISC-superscalar PA PC/RT. TRACE RISC II IBM RS6000 Spectrum MIPS-X Multiflow Inc. UC Berkley IBM Hewlett-Packard Stanford AM 29000 AMD RISC RISC RISC R2000 R3000 Hewlett-Packard mips Corp. PA MC88110 UC Berkley RISC 7100 SOAR SPARC SUN Microsyst. RISC + superscalar Apple RISC-superpipelined Intel i486 Instr.Set RISC + superscalar R4000 601 RISC + superscalar PowerPC Motorola +IBM SuperSPARC K5 mips Corp. SUN & TI RISC-superpipelined AMD RISC + superscalar R4400 mips Corp. + Silicon Graphics 604 PowerPC multimedia ops ’VIS’ Jpeg K6 Motorola +IBM RISC + superscalar vector + graphics ops AMD RISC-superpipelined UltraSPARC RISC + superscalar SUN & TI 620 RISC + superscalar PowerPC PA Motorola +IBM Hewlett-Packard RISC + superscalar 1GHz Ultra 550MHz SPARC III AMD R12000 RISC + superscalar + Vector unit SUN & TI AltiVec 64bit extension K8 Hammer AMD Athlon 64 Opteron R10000 Silicon Graphics 8500 internal Superscalar K7 Athlon + VLIW IBM RS6000 PowerPC IBM Motorola Silicon Graphics Vorlesung Rechnerarchitektur Seite 104 CISC - Prozessoren Computer Performance Equation P[MIPS] = <1 RISC >1 Superscalar VLIW f [MHz] x C i c N xN m i fc clock frequency Ci N i instruction count per clock cycle N m memory access factor average number of clock cycles per micro instruction CISC - Complex Instruction Set Computer Ci=1 one Instruction per clock tick = one Operation executed by one Execution-Unit Ni =5-7 5-7 clock ticks required for the execution of one instruction by a number of microinstruction (Microprogram A.) Nm=2-5 dependent of the memory system and of the memory hierarchy Goal : Closing the gap between High Level Language and Processor Instructions (semantic gap) - manifold addressing modes - microprogrammed complex instructions required a variable instruction lenght => varible execution time - orthogonal instruction set : every operation matches every addressing mode Vorlesung Rechnerarchitektur Seite 105 CISC - Prozessoren CISC - Architekturen Nachteile - Pipelining schwierig - variables Instruktionsformat -> Instruction-Fetch komplex - unterschiedlich lange Instruction-Execution - Memory hierachie - Speicherzugriff ist in Operation enthalten - sequentielle Instruktionsverarbeitung - kein prefetch - Compilation -> Instruction-Set wird nicht genützt ! Vorteile - Kompakter Code - Instruction-Cache kleiner - Instruction-Fetch transfer rate niedriger 1/2 Codesize of RISC (32 bit festes Format ) - Assembler -> dichter an High-Level-Language Vorlesung Rechnerarchitektur Seite 106 CISC - Prozessoren MC 68020 typical CISC - Member 16 bit Instruction format with variable length of n times 16 bit-Units 1≤n≤6 8 * Data Register e.g.: Addition of Data Value ADD 8 * Address Register e.g.: Address Computation ADDA Instruction-Set extendable by Coprocessor Instruction Encoding Fxxx 15 12 11 9 8 Register Dn ADD 3 2 6 5 Op Mode 0 Effective Address mode Addressing Modes Register 3 bit Operation Mode - Byte Word (16 bit) Long (32 bit) Double (64 bit) <ea> + <Dn> -> <Dn> <Dn> + <ea> -> <ea> ea = effective address Effective Address Encoding Summary - Data Register Direct Address Register Direct Address Register Indirect Address Register Postincrement Address Register Predecrement Address Register Displacement Address Register and Memory Indirect with Index Absolut Short (16 bit) Absolut Long (32 bit) PC-Counter Indirect with Displacement PC-Counter and Memory Indirect with Index ..... ⇒ 8 Register Vorlesung Rechnerarchitektur Seite 107 CISC - Prozessoren MC 68020 Example of an addressing mode Address register indirect with index (base displacement) EA = (An) + (Xn) + d8 31 An Memory Address 7 0 31 displacement sign Extended + Int 31 Index Register 0 0 Sign Ext. Value Scale + * 31 0 Memory Address : Operand Example of a ’complex instruction’ 15 11 10 9 3 2 6 5 0 Effective Address MOVEM dr SZ mode Register 0 15 A7 A6 A5 A4 A3 A2 A1 A0 D7 D6 D5 D4 D3 D2 D1 D0 Register Mask List instruction for saving the register contest - uses postincrement addressing mode for save - uses predecrement addressing mode for restore Vorlesung Rechnerarchitektur Seite 108 CISC - Prozessoren DBcc instructions (-> DBcc Instruction im CPU32 User Manual) Don’t branch on condition !!! REPEAT (body of loop) UNTIL (condition is true) Assembler syntax: DBcc Dn, <label> The DBcc instruction can cause a loop to be terminated when either the specified condition CC is true or when the count held in Dn reaches -1. Each time the instruction is executed, the value in Dn is decremented by 1. This instruction is a looping primitive of three parameters: a condition, a counter (data register), and a displacemt. The instruction first tests the condition to determine if the termination condition for the loop has been met, and if so, no operation is performed. If the termination condition is not true, the low order 16 bits of the counter data register are decremented by one. If the result is - 1 the counter is exhausted and execution continues with the next instruction. If the result is not equal to - 1, execution continues at the location indicated by the current value of the PC plus the sign-extended 16-bit displacement. The value in the PC is the current instruction location plus two. IF (CC == true) THEN PC++ ELSE { Dn-IF (Dn == -1) THEN PC++ ELSE PC <- PC + d } Label: branch back Loop pair of instructions Dn - 1 true SUB generation of condition code DBcc decrement and branch on condition code Dn = -1 ? CC false (1) and Dn != -1 CC true (0) or Dn == -1 PC++ false CC ? DBcc true Branch Successor PC + d PC + d PC++ Instruction format: 15 0 false 14 1 13 0 12 1 11 10 9 condition 6 7 1 1 displacement 8 5 0 4 0 3 1 2 1 0 register Vorlesung Rechnerarchitektur Seite 109 Mikroprogrammierung μ-Programmierung Definition : Mikroprogrammierung ist eine Technik für den Entwurf und die Implemtierung der Ablaufsteuerung eines Rechners unter Verwendung einer Folge von Steuersignalen zur Interpretation fester und dynamisch änderbarer Datenverarbeitungsfunktionen. Diese Steuersignale, welche auf Wortbasis organisiert sind (Mikrobefehle) und in einem festen oder dynamisch änderbaren Speicher (Mikroprogrammspeicher, Control Memory oder auch Writable Control Store) gehalten werden, stellen die Zustände derjenigen Signale dar, die den Informationsfluss zwischen den ausführenden Elementen (Hardware) steuern und für getaktete Übergänge zwischen diesen Signalzuständen sorgen. [Oberschelp / Vossen] Das Konzept der Mikroprogrammierung wurde von M.V. Wilkes 1951 vorgeschlagen. Man unterscheidet 2 Arten der Mikroprogrammierung: • horizontale: die einzelnen Bits eines Mikroprogrammwortes (eine Zeile des Control Stores) entsprechen bestimmten Mikrooperationen. Diese Operationen können dann parallel angestoßen werden. • vertikale: die Zuordnung zwischen den Bits eines Mikroprogrammwortes und den assoziierten Operationen wird durch einen sogenannten Mikrooperationscode (verschlüsselt) bestimmt. μ-Programmierung ist der Entwurf einer Steuerungssequenz für das μ-Programmsteuerwerk. Das μ-Programmsteuerwerk kann allgemein als endlicher Automat angesehen werden, der sich formal wie folgt beschreiben lässt: A = (Q, Σ, Δ, q 0,F, δ) Dessen allgem. Übergangsfunktion lautet: δ: Q × (Σ ∪ {ε}) → Q × (Δ ∪ {ε}) und nimmt für das Steuerwerk die spezielle Form an: δ2: B s × ( B i ∪ { ε } ) → B s × ( B o ∪ { ε } ) Bs state vector Bi input vector Bo output vector Die Berechnung des Folgezustands eines aktuellen Zustands erfolgt dabei durch die so genannte Next-State-Logik, die Berechnung des Outputs durch die Output-Logik. Der abstrakte Begriff des endlichen Automaten ist ein entscheidendes Hilfsmittel für das Verstehen der Arbeitsweise und der Modellierung von Hardwarestrukturen. Seine Beschreibung erfolgt häufig als Graph (FSM, ’bubble diagram’), um die Zustände und die Transitionen zu veranschaulichen. Vorlesung Rechnerarchitektur Seite 110 Mikroprogrammierung μ-Programmierung Das folgende μ-Programmsteuerwerk zeigt den prinzipiellen Aufbau. Der Control Vector stellt die Steuerungsfunktion der Execution Unit dar. μ Instr. μ-Programmsteuerwerk Load MUX μPC next address field CS ROM WCS RAM Bs horizontales μ Programmwort Bo Control Vector nextPC next PC Logic RE WE Bi End CC ALU +- Logical Unit & or not MUL * RF VN Execution Unit Die Mikroprogrammierung bietet folgende Vorteile: • Implementierung eines umfangreichen Befehlssatzes (CISC) mit wenigen Mikrobefehlen bei wesentlich geringeren Kosten • Mikroinstruktionen und damit der Maschinenbefehlssatz kann verändert werden durch Austausch des Inhaltes des Control Memory (Writable Control Store WCS) • vereinfachte Entwicklung und Wartung eines Rechners durch deutlich geringeren Hardware-Aufwand eines Rechners Sie hat folgenden Nachteil: • Ausführung einer mikroprogrammierten Operation dauert länger, denn das Mikroprogramm muss schrittweise aus dem Control Memory gelesen werden. Für jeden Befehl sind meist mehrere ROM-Zugriffe erforderlich. Vorlesung Rechnerarchitektur Seite 111 Mikroprogrammierung μ-Programmierung Wenn das Steuerwerk schnell (im darauffolgenden Takt) auf die CC-Signale reagieren soll, wird die Next-PC Logik nicht wie sonst üblich durch einen Addierer ausgeführt, sondern durch eine einfache Oder-Logik, die die LSBs der Folgeadresse modifizieren. Verwendet man als Folgeadresse eine Adresse mit den vier LSBs auf "0", so können die Condition Codes 0..3 diese bits zu "1" werden lassen, womit man einen 16-way branch realisieren kann. Dadurch sind die Adressen der μ-Instruktionen nicht mehr linear im Speicher angeordnet sondern können beliebig verteilt werden, da jede μ-Instruktionen ihre Folgeadresse im WCS mitführt. Benutzt man eine Folgeadresse mit z.B. 0010 als LSBs, so kann man damit den CC1 ausmaskieren, da er durch die Oder-Funktion nicht mehr wirksam werden kann. MSB LSB 0 0 0 0 next PC Logic 0 0 0 0 0 1 CC3 0 0 CC2 0 0 CC1 0 1 CC0 0 0 0 0 Die μ-Programmierung solcher Steuerwerke auf der bit-Ebene ist natürlich viel zu komplex, als dass man sie von Hand durchführen könnte. Verwendet wird üblicherweise eine symbolische Notation, ähnlich einem Assembler. Diese Notation wird dann durch einen Mikro-Assembler in die Control Words des WCS umgesetzt. Die explizite Sichtbarkeit des Control Wortes der Execution-Hardware bildet die Grundlage der VLIW-Architekturen. Der Test und die Verifikation der μ-Programme ist sehr aufwendig, da sie direkte Steuerungsfunktionen auf der HW auslösen, deren Beobachtung meist nur durch den Anschluss von LSAs (Logic State Analyzer) möglich ist. Die μ-Programmierung wurde wegen der höheren Geschwindigkeit bei der Einführung der RISC-Rechner durch festverdrahtete Schaltwerke (PLAs, Programmable Logic Arrays) abgelöst. weitere Informationen in: W.Oberschelp, G.Vossen, Rechneraufbau und Rechnerstrukturen, 7.Aufl., R.Oldenbourg Verlag, 1998 Vorlesung Rechnerarchitektur Seite 112 RISC Computer Performance Equation P[MIPS] = <1 RISC >1 Superscalar VLIW f [MHz] x C i c N xN m i fc clock frequency Ci N i instruction count per clock cycle N m memory access factor average number of clock cycles per instruction The RISC design philosophy can be summarized as follows: - pipelining of instructions to achieve an effective single cycle execution (small Ni) - simple fixed format instructions (typically 32 bits) with only a few addressing modes - hardwired control of instruction decoding and execution to decrease the cycle time - load-store architectures keep the memory access factor Nm small - Migration of functions to software (Compiler) The goal of the reduced instruction set is to maximize the speed of the processor by getting the software to perform infrequent functions and by including only those functions that yield a net performance gain. Such an instruction set provides the building blocks from which high-level functions can be synthesized by the compiler without the overhead of general but complex instructions [fujitsu] The two basic principles that are used to increase the performance of a RISC processor are: pipelining and the optimization of the memory hierarchy. Vorlesung Rechnerarchitektur Seite 113 ILP Definition : Instruction level parallelism It is the parallelism among instructions from ’small’ code areas which are independent of one another. Exploitation of ILP one Form overlapping of instructions in a pipeline ⇒ RISC - Processor with nearly one instruction per clock another Form parallel execution in multiple functional units ⇒ superscalar Processor very long instruction word Processor (VLIW) at runtime : at compile time : dynamic instruction scheduling static instruction scheduling Vorlesung Rechnerarchitektur Seite 114 ILP Instruction level parallelism Availability of ILP ’small’ code area 1. basic block : a straight-line code sequence with no branches in except to the entry and no branches out except the exit ILP small ! ~ 3,4; <6 2. multiple basic blocks a) loop iterations Loop level Parallelism stems from ’Structured Data Type’ Parallelism Vector Processing b) speculative execution at compile time trace scheduling at run time dynamic branch prediction + dynamic speculative execution Techniques to improve C i (instruction count per clock cycle) • • • • • • • • • • • • Loop unrolling pipeline scheduling dynamic instruction scheduling; scoreboarding register renaming dynamic branch prediction multiple instruction issue dependence analysis; compiler instruction reordering software pipelining trace scheduling speculative execution memory disambiguation; dynamic - static Vorlesung Rechnerarchitektur Seite 115 ILP Example of Instruction Scheduling and Loop Unrolling Calculate the sum of the elements of a vector with length 100 99 sum = ∑ ai i=0 C-Program int main() { double a[100]; double sum = 0.0; int i; } for (i=0;i<100;i++) sum += a[i]; f(sum); // using sum in f avoids heavy optimization // which results in "do nothing" Suppose a basic 4-staged pipeline similar to Exercise 2. The pipeline is interlocked by hardware to avoid flow hazards. Forwarding pathes for data and branch condition are included. The two stages for cache access are only active for LD/ST instructions. The Cache is nonblocking. Misses queue up at the bus interface unit which is not shown in the block diagram. The cache holds four doubles in one cache entry. The cache entry is filled from main memory using a burst cycle with 5:1:1:1. Integer instructions execute in one clock, fp operations ADD and MUL requires 3 clocks. The bus-clock is half of the processor clock. single instruction issue IF MA1 MA2 int int int RFR EX WB fp fp EX1 EX3 EX2 fp fp fp Block-Diagram of a simple pipelined RISC Vorlesung Rechnerarchitektur Seite 116 ILP Example of Instruction Scheduling - Assembler Code // Initialization R0 R5 R4 R3 R2 = = = = = 0; 99; 0; 1000; 800; F4 = 0.0;; F2 = 0.0; // // // // // // always zero endcount 100-1 loop index base address of A address of sum fkt parameters are normally stored on stack !! // Accumulator for sum // Register for loading of Ai // Assembler Code Lloop: Lend: CMP R4, R5 -> R6; BEQ R6, Lend; LD (R3) -> F2; ADDF F2,F4 -> F4; ADDI R4, #1 -> R4; ADDI R3, #8 -> R3; JMP Lloop; ST F4 -> (R2); // // // // // // generate CC and store in R6 branch pred. "do not branch" load Ai accumulation of sum loop index increment addr computation for next element It should be mentioned here that local variables of procedures are normally stored in the activation frame on the stack. The addressing of local variables can be found in the assembler code of the compiled C-routines. For this example, a simplified indirect memory addressing is used. Vorlesung Rechnerarchitektur Seite 117 ILP Example of Instruction Scheduling without Loop Unrolling Instruction slot IF RFR CMP 2 BEQ CMP 3 LD BEQ CMP 4 ADD i LD BEQ 5 ADD a ADD i LD 6 ADDF ADD a ADD i 7 stall ADD a 8 stall 9 stall stall 11 stall 12 stall 13 stall 14 stall 15 stall 16 CMP 19 BEQ CMP LD 22 Stage LD LD ADD i ADD a Cache miss Memory 5:1:1:1 ADDF f4,f2 ADDF ADDF BEQ CMP ADD i LD BEQ ADD a ADD i LD 23 ADDF ADD a ADD i 24 JMP ADDF f4,f2 ADD a 25 CMP 26 BEQ CMP 28 LD BEQ CMP 29 ADD i LD BEQ 30 ADD a ADD i LD 31 ADDF 32 33 LD f2 DATA 18 Instruction Loop JMP 21 WB stall 17 20 EX3 MA2 EX 1 11 EX2 MA1 ADDR EX1 ADDF ADDF f4 first Data available LD LD ADDF ADD i Cache hit ADD a LD f2 ADDF ADD a ADD i ADDF f4,f2 ADD a ADDF ADDF f4 LD LD ADDF 34 ADD i ADD a LD f2 ADDF 35 ADDF 36 ADDF f4 37 time The successor of a Bcc instruction is statically predicted and speculatively executed and can be annulled if a wrong prediction path is taken. The execution time of 100 element vector accumulation can be calculated to 100 / 4 x (17 + 3 x 7) = 950 CPU clocks Vorlesung Rechnerarchitektur Seite 118 ILP Example of Instruction Scheduling with Loop Unrolling EX1 Instruction slot IF RFR EX BEQ CMP 3 LD f2,#0 BEQ CMP 4 ADDF LD BEQ 5 stall LD 6 stall 7 stall 8 stall 9 stall 11 stall Iteration 0 2 12 stall 15 stall 22 23 Iteration 1 LD f2,#8 ADDF f4,f2 28 ADDR Cache miss Memory 5:1:1:1 ADDF f4,f2 LD ADDF stall LD stall ADDF f4,f2 ADDF f4,f2 LD ADDF stall LD ADDF f4,f2 LD ADDF stall LD ADDF Cache hit ADD a, #32 31 ADD i,#4 ADD a ADDF 32 JMP ADD i,#4 ADD a ADDF f4 LD f2,#16 ADDF LD 30 ADDF f4,f2 ADDF LD ADD i,#4 ADDF f4 LD f2,#24 ADDF ADDF ADD a ADDF f4 ADD i,#4 35 36 37 time Load instructions are not scheduled and interations do not overlapp. The execution time of 100 element vector accumulation can be calculated to: 100 / 4 x 32 = 800 CPU clocks first Data available ADDF LD stall 34 ADDF f4 LD f2,#8 LD ADDF f4,f2 33 ADDF LD LD f2,#24 29 LD f2 ADDF LD stall Iteration 3 26 LD LD f2,#16 24 25 LD stall Iteration 2 21 Instruction Loop 4 times unrolled 14 20 - stall stall 19 Stage stall 13 18 WB DATA 11 17 EX3 MA2 CMP 1 16 EX2 MA1 Vorlesung Rechnerarchitektur Seite 119 ILP Software - Pipelining Software-Pipelining überlappt die Ausführung von aufeinanderfolgenden Iterationen in ähnlicher Form wie eine Hardware-Pipeline. Instr. I1 Instr. I2 Instr. I3 Durch das Software-Pipelining wird die Parallelität zwischen unabhängigen Schleifeniterationen ausgenützt. Instruktion in Schleife Inm Index der Schleifeniteration Anordnung der Instruktionen. I 11 I1 I 12 I2 I 13 I3 I 21 I 22 I 23 I 31 I 32 I 33 Vorlesung Rechnerarchitektur Seite 120 ILP Example of Instruction Scheduling with Loop Unrolling and Software Pipelining EX1 Instruction slot IF RFR EX BEQ CMP 3 LD f2,#0 BEQ CMP LD f6, #32 LD BEQ 7 8 9 11 11 WB Stage LD f6, #32 - LD LD f6, #32 LD LD LD Cache miss LD ADDR ADDR 6 start phase for iteration 0 and 1 2 5 EX3 MA2 CMP 1 4 EX2 MA1 in page 12 13 Memory 14 5:1:1:1 16 LD f4 load Iteration 0 and 1 19 20 21 22 23 26 28 29 30 31 32 calculate Iteration 0 and 1 24 LD f5,#24 prefetch load LD f3 LD f4 LD f5 to f2, #64 LD f5 ADDF f10,f2 ADDF f11,f6 LD f7 LD f5 ADDF f11,f6 LD f7 ADDF f10,f2 LD f8 ADDF f11,f6 LD f7 LD f9,#56 ADDF f10,f3 LD f8 ADDF f11,f7 LD f9 ADDF f10,f3 ADDF f11,f7 LD f9 ADDF f11,f7 ADDF f10,f5 36 ADDF f11,f9 38 ADD a 39 ADD i 40 JMP ADDF f11,f6 LD f7 ADDF f10,f2 LD f8 ADDF f11,f6 LD f7 ADDF f10,f3 LD f8 ADDF f11,f6 LD f9 ADDF f10,f3 LD f8 ADDF f11,f7 LD f9 ADDF f10,f3 ADDF f11,f7 ADDF f11,f8 LD f9 ADDF f11,f7 ADDF f10,f4 ADDF f11,f8 ADDF f10,f5 ADDF f11,f9 ADDF f10,f4 ADDF f11,f8 ADDF f11,f7 ADDF f10,f4 ADDF f10,f4 ADDF f11,f9 ADDF f10,f4 ADDF f11,f9 ADDF f10,f11 time LD f6 ADDF f10,f4 ADDF f10,f5 37 ADDF f10,f2 ADDF f10,f4 ADDF f11,f8 41 LD f5 LD f8,#48 34 LD f4 ADDF f10,f2 ADDF f10,f4 35 LD f4 ADDF f10,f3 ADDF f11,f8 Cache hit LD f3 LD f5 LD f7,#40 first Data available LD f3 LD f4 ADDF f10,f2 ADDF f10,f4 33 LD f2 DATA 18 LD f3 DATA LD f3 LD f4,#16 17 25 + page mode LD f3,#8 15 ADDF f11,f9 global sum The loop is four times unrolled and instructions from different iterations are scheduled to avoid memory and cache accesss latencies. Scheduling of load instructions requires a nonblocking cache and a BUI which can handle multiple outstanding memory requests. The latency of the first start-up phase can only be avoided by scheduling the load into code before the loop start (difficult). Execution time: ~ (100/4 / 2 * 29) +(14 + 7) + 6 = 277 Vorlesung Rechnerarchitektur Seite 121 ILP Assembler without optimization compiled by .file "unroll.c" gcc2_compiled.: .section ".rodata" .align 8 .LLC0: .long 0x0 .long 0x0 .section ".text" .align 4 .global main .type main,#function .proc 04 main: !#PROLOGUE# 0 save %sp,-928,%sp !#PROLOGUE# 1 sethi %hi(.LLC0),%o0 ldd [%o0+%lo(.LLC0)],%o2 std %o2,[%fp-824] st %g0,[%fp-828] .LL2: ld [%fp-828],%o0 cmp %o0,99 ble .LL5 nop b .LL3 nop .LL5: ld [%fp-828],%o0 mov %o0,%o1 sll %o1,3,%o0 add %fp,-16,%o1 add %o0,%o1,%o0 ldd [%fp-824],%f2 ldd [%o0-800],%f4 faddd %f2,%f4,%f2 std %f2,[%fp-824] .LL4: ld [%fp-828],%o1 add %o1,1,%o0 mov %o0,%o1 st %o1,[%fp-828] b .LL2 nop .LL3: .LL1: ret restore .LLfe1: .size main,.LLfe1-main .ident "GCC: (GNU) 2.7.2.2" gcc -S <fn.c> Vorlesung Rechnerarchitektur Seite 122 ILP Assembler with optimization compiled by gcc -S -o3 -loop_unroll <fn.c> .file "unroll.c" gcc2_compiled.: .section ".rodata" .align 8 .LLC1: .long 0x0 .long 0x0 .section ".text" .align 4 .global main .type main,#function .proc 04 main: !#PROLOGUE# 0 save %sp,-912,%sp !#PROLOGUE# 1 sethi %hi(.LLC1),%o2 ldd [%o2+%lo(.LLC1)],%f4 mov 0,%o1 add %fp,-16,%o0 .LL11: ldd [%o0-800],%f2 faddd %f4,%f2,%f4 ldd [%o0-792],%f2 faddd %f4,%f2,%f4 ldd [%o0-784],%f2 faddd %f4,%f2,%f4 ldd [%o0-776],%f2 faddd %f4,%f2,%f4 ldd [%o0-768],%f2 faddd %f4,%f2,%f4 ldd [%o0-760],%f2 faddd %f4,%f2,%f4 ldd [%o0-752],%f2 faddd %f4,%f2,%f4 ldd [%o0-744],%f2 faddd %f4,%f2,%f4 ldd [%o0-736],%f2 faddd %f4,%f2,%f4 ldd [%o0-728],%f2 faddd %f4,%f2,%f4 add %o1,10,%o1 cmp %o1,99 ble .LL11 add %o0,80,%o0 std %f4,[%fp-16] call f,0 ldd [%fp-16],%o0 ret restore .LLfe1: .size main,.LLfe1-main .ident "GCC: (GNU) 2.7.2.2" Vorlesung Rechnerarchitektur Seite 123 VLIW Concepts • long (multiple) instruction issue • static instruction scheduling (at compile time) - highly optimizing compiler (use of multiple basic blocks) • simple instruction format => simple Hardware structure - synchronous operation (global clock) - pipelined Funktion-Units (fixed pipeline length !) - resolve of resource hazards and data hazards at compile time • control flow dependencies => limit of performance • unpredictable latency operation - memory references ----> fixed response - dma, interrut (external) • numerical processors front door M0 M1 M2 back door Advantages • max n times performance using n Funktional -Units • simple Hardware-Architecture Disadvantages • • • • • Mix of Functional-Units is Application dependent multiple Read/Write Ports code explosion static order stop of latency operations variable response (worst case) M3 Vorlesung Rechnerarchitektur Seite 124 VLIW Concepts VLIW : Trace Scheduling Speculative Execution 1 2 3 4 basic block 0 branch 1 1 2 3 4 < < < < speculativ < < < < speculativ 1 cc cc 1 2 store operations 3 branch 2 cc cc 3 4 branch 1 + 2 cc1 + cc2 resource independantdata indipendant instruction VLIW FU 1 int 3 2 ld/st RF 128 register int ld/st fp 4 fp 5 logical/branch Vorlesung Rechnerarchitektur Seite 125 Superscalar Processors Introduction • more than one instruction per clock cycle ( c i ≥ 1 ) • Basis is provided by pipelining (just as in the classic RISC case) • additional usage of ILP Basic Tasks of superscalar Processing: • • • • • Parallel decoding Superscalar Instruction Issue Parallel instruction execution Preserving sequential consistency of execution Preserving the sequential consistency of exception processing [ACA] Use several Functional Units (FUs) to execute instructions in parallel: • several FUs of the same type possible • different FUs for different classes of instructions (ALU, load/store, fp,...) • necessary to find enough instructions for these FUs Prerequisites: • • • • fetch, dispatch & issue of enough instructions per cycle hardware resource for dynamic instruction scheduling (at run time !) dependency analysis necessary completion of several instructions Problems: • Instruction Fetch • Find independent instruction - Analysis prior to Instruction Issue, dynamic, speculative scheduling • Remove "false" dependencies - "single assignement", re-order buffer, renamed registers, etc. • out-of-order problematic - necessary to complete instructions, maintain architectural statem, RAW I in-order out-of order Issue/Dispatch RISC Tomasulu, Reservation Stations Execution RISC "single Assignement" Completion RISC ROB etc. Vorlesung Rechnerarchitektur Seite 126 Scoreboard - CDC6600 With the multiplicity of functional units, and of operand registers and with a simple and highly efficient addressing scheme, a generalized queue and reservation scheme is practical. This is called a scoreboard. The scoreboard maintains a running file of each central register, of each functional unit, and of each of the three operand trunks (source and destination busses) to and from each unit. [Th64] Hazard detection and resolution • construction of data dependencies • check of data dependencies • check of resource dependencies Hazard avoidance by stalling the later Instructions Vorlesung Rechnerarchitektur Seite 127 Superscalar Processors Example: Dynamic Scheduling with a Scoreboard Pipelined Processor with multiple FUs out of order execution out of order read execute FU0 in order write in order issue IFetch RF read I Issue RF write FU1 issue check write back FU2 FU3 read operands Scoreboard 1. I Issue: If a functional unit for the instruction is free and no other active instruction has the same destination register, the scoreboard issues the instruction to the FU and updates its internal data structure.By ensuring that no other active FU wants to write its result into the destination register WAW-Hazards are avoided. The instruction isssue is stalled in the case of a WAW-Hazard or a busy FU. 2. Read operands: The scoreboard monitors the availability of the source operands. A source operand is available, if no earlier issued active instruction is going to write this register. If the operands are available, the instruction can proceed. This scheme resolves all RAW-Hazards dynamically. It allows instructions to execute ’out of order‘. 3. Execution: The required FU starts the execution of the instruction. When the result is ready, it notifies the scoreboard that it has completed execution. If an other instruction is waiting on this result, it can be forwarded to the stalled FU. 4. Write back: On the existance of a WAR-Hazard, the write back of the instruction is stalled, until the source operand is read by the dependend instruction (preceeding instruction in the order of issue). [Hennessy, Patterson, Computer Architecture: A Quantitative Approach, 2. Ed. 1996] Stage I Issue activate instruction read operands Wait until Checks FU available and no more than one result assigned to Rdest* FU ? available Source operands available Bookkeeping Fd ? not busy (FU) <-- busy (Fd) <-- busy Fd(FU) <-- ’D’ Fs1(FU) <-- ’S1’ Fs2(FU) <-- ’S2’ Op(FU) <-- ’opc’ Fs1(FU) ? valid Fs2(FU) ? valid (Fs1) <-- read (Fs2) <-- read s1(FU) <-- (Fs1) s2(FU) <-- (Fs2) R <-- s1(FU) op s2(FU) execute write back deactivate instruction Operations Result ready and no WAR-Hazard Fd ? read (FU) <-- free (Fd) <-- valid Fd <-- R Vorlesung Rechnerarchitektur Seite 128 Superscalar Processors Instruction Fetch, Issue & Dispatch To provide processor with enough instruction bandwidth use techniques as • I-Cache • I-Buffer • Trace Cache Necessary to provide the processors with several instructions per cycle! Possibility to use: • direct issue (see Scoreboard); resource conflicts and blocking propable • indirect issue, i.e. issue & dispatch using Reservation stations out of order dispatch RS to FU RS to FU in order issue IFetch I Dispatch issue RS to FU dispatch check Instructions issue Instruction issue normally in order. If instructions would be immediately enter a FU, head-of-queue-blocking could occur -> use reservation stations for this Instruction dispatch Instructions are dispatched from the Reservation stations; out of order issue. Any instruction within a reservation stations that is "ready" may be issued. Vorlesung Rechnerarchitektur Seite 129 Superscalar Processors Dynamic Scheduling with Tomasulu (out of order execution) The main idea behind the Tomasulu algorithm is the introduction of reservation stations. They are filled with instructions from the decode/issue unit without check for source operand availability. This check is performed in the reservation station logic itself. This allows issueing instructions whose operands are not readily computed. Issuing this instruction (which would normally block the issue stage) to a reservation station shuffles the instruction besides and give the issue unit the possibility to issue the following instruction to another free reservation station. A reservation station is a register and a control logic in front of a FU. The register contains the register file source operand addresses, the register file destination address, the instruction for this FU and the source operands. If the source operands were not avaiable, the instruction waits in the reservation station for its source operands. The control logic compares the destination tags of all result busses from all FUs. If the destination tag matches to a misssing operand address, this data value is taken from the result bus into the reservation stations operand data field. A reservation station can forward its instruction to the execution FU, when all its operands are available (data flow synchronization). A structure view of the data pathes can be found in the powerPC 620 part. 5 bit 5 bit opc FU Tag Instruction for FU s2 D2 source 2 operand data source 2 operand address 5 bit s1 D1 d source 1 operand data source 1 operand address destination operand address • Reservation Stations decouple instruction issue from from dependecy checking. • An actual reservation stations is several entries deep. • Any of the instructions may be issued to the FU, if all dependencies have been met (i.e. fully associative) • Data forwarding occurs outside of FUs • Result bus from FUs leads back to inputs of the reservation stations Vorlesung Rechnerarchitektur Seite 130 Superscalar Processors Removing "false dependencies" - Register Renaming Use register renaming to enforce single assignement. Thus, false dependecies are removed. Type of rename buffers: • merged architectural and rename register file architectural and rename register are allocated from a single register file; usage of a mapping table; reclaiming complex • seperate rename and architectureal register files deallocation with retirement or reuse • Holding renamed values in the ROB (Re-order Buffer) Maintain sequential consistency • order in which instructions are completed • order in which memory is accessed Usage of an ROB for sequential consistency: • Instruction move from the FUs to the ROB. • Instructions retire from the ROB to the architectural state if and ony if they are finished and all previous instructions have been retired. • ROB implemented as a ring buffer. • ROB can also be elegantly used for register renaming (see above) Reorder Buffer allocate for issued instructions f x f i Tail free for retired instructions Head i - issued x - in execution f - finished Vorlesung Rechnerarchitektur Seite 131 Superscalar Processors FU Synchronization Definition A functional unit is a processing element (PE) that computes some output value based on its input [Ell86]. It is a part of the CPU that must carry out the instructions (operations) assigned to it by the instruction unit. Examples of FUs are adders, multipliers, ALUs, register files, memory units, load/store units, and also very complex units like the communication unit, the vector unit or the instruction unit. FUs with internal state information are carriers of data objects. In general, we can distinguish five different types of FUs: 1. FU with a single clock tick of execution time 2. FU with n clock ticks of execution time, nonpipelined 3. FU with n clock ticks of execution time, pipelined 4. FU with variable execution time, nonoverlapped 5. FU with variable execution time, overlapped The example of two FUs given in the following figure serves to illustrate the difference between the terms pipelined and overlapping. synchr. input X instr. f1 FU f2 f3 instr. result type 3 pipelined execution time: 3 (3 stages) instr. rate: 1 overlapping factor: none FU input is synchronized by the availability of X f1 f2 type 5 with overlapping f3 result execution time: variable instr. rate: 1 overlapping factor: 3 (3 input register) In the FU type 3, instructions are processed in a pipelined fashion. The unit guarantees a result rate of one result per clock tick. This empties the pipeline as fast as it can be filled with instructions. None of the processing stages can be stopped for synchronization; only the whole FU can be frozen by disabling the clock signal of the FU. The FU type 5 has a synchronization input in stage f2 of the processing stages. Assuming the synchronization input is not ready, processing by f2 must stop immediately. This does not necessarily stop the instruction input to the instruction input queue. A load-store unit featuring reservation stations is good example of such FU. Only a full queue stops instruction issue, and this event halts the whole CPU. FU type 5 can be used to model a load-store unit that synchronizes to external memory control and halts the CPU only in the case where the number of instructions exceeds the overlapping factor. An example for FU of type 4 is the iterative floating point division unit. Vorlesung Rechnerarchitektur Seite 132 Superscalar Processors FU Synchronization: Example of FDIV-Unit When an instruction is issued to an iterative execution unit, this unit requires a number of clocks to perform the function (e.g. 27 clocks for fdiv). During this time, the FU is marked as "busy" and no further instruction can be issued to this FU. When the result becomes available, the result register must be clocked and the FU becomes "ready" again. instruction register and source operands iterative execution unit operands instruction internal_I I_issue FDIV result register result result_ready busy set_ busy BUSY FF set_ready Eq. for busy check: I_issue = I_for _div_FU & ( /busy + busy & resul_ ready ) CLK Instr div 1 div 2 I_issue div 1 internal_I busy result_ready div 1 result phases execution start load instruction execution 1 execution 2 execution 26 result to reg execution 27 execution start execution 1 load instruction A new instruction can be issued to the unit at the same clock edge when the internal result is transferred to the result register (advancement of data in the pipeline stages!). The instruction issue logic checks not only the busy signal but also the result_ready signal and can thus determine the just going ready ("not busy") FU. Vorlesung Rechnerarchitektur Seite 133 Superscalar Processors - Literature • Tomasulu algorithm (IBM 360/91) [Tomasulu, R.M., An Efficient Alogrithm for Exploiting Multiple Arithmetic Units, IBM Journal, Vol. 11, 1967] • Scoreboard (CDC 6600) [Th64] Thornton, James.E., Parallel Operation in the Control Data 6600, Proc. AFIPS, 1964 • Advanced Computer Architecture - A Design Space Approach; Sima D., Fountain T. & Kacsuk P.; 1997 Vorlesung Rechnerarchitektur Seite 134 PowerPC 620 Overview of the 620 Processor The 620 Processor is the first 64-bit Implementation of the PowerPC-Architecture. - 200 (300) MHz clock frequency 7 million transistor count, power dissipation < 17-23 W @ 150MHz, 3.3V 0.5 μm CMOS technology, 625 pin BGA max. 400 MFLOPS floating-point performance (dp) 800 MIPS peak instruction execution four instructions issued per clock, instruction dispatch in order out-of-order execution, in-order completion multiple independent functional units, int, ld/st, branch, fp reservation stations, register renaming with rename buffers five stages of master instruction pipeline full instruction hazard detection separate register files for integer (GPR) and floating point (FPR) data types branch prediction with speculative execution static and dynamic branch prediction and branch target instruction buffer 4 pending predicted branches 32k data and 32k instruction cache, 8 set associative, physically addressed non-blocking load access, 64-bytes cache block size separate address ports for snooping and CPU acesses level-2 cache interface to fast SSRAM (1MB - 128MB) MMU, 64-bit effective address, 40-bit physical address 64-entry fully associative ATC multiprocessor support, bus snooping for cache coherency (MESI) pipelined snoop response dynamic reordering of loads and stores explicit address and data bus tagging, split-read protocol 128-bit data bus crossbar compatible, cpu id tagged bus clock = processor clock/n (n = 2,3,4) Vorlesung Rechnerarchitektur Seite 135 PowerPC 620 Pipeline Structure The 620 master instruction pipeline has five stages. Each instruction executed by the processor will flow through at least these stages. Some instructions flow through additional pipeline stages. The five basic stages are: - Fetch Dispatch Execute Complete "in order completion" Writeback Integer Instructions Fetch Dispatch/ Decode Execute Load Instructions Fetch Dispatch/ DCache Align Complete Writeback Decode EA Store Instructions Fetch Dispatch/ Decode EA DCache Lockup CompleteStore Branch Instructions Fetch Predict/ Resolve FP- Instructions Fetch Dispatch/ FPR Decode Accesss FP Mul Resolve CompleteWriteback Complete FP Add FP Norm Complete Writeback Vorlesung Rechnerarchitektur Seite 136 PowerPC 620 Branch correction Reorder buffer information Dispatch unit with 8 - entry instruction queue Instruction dispatch buses Fetch unit Instruction cache Completion unit with reorder buffer Register number Register number GPR Register number FPR Register number int fp Rename register GP operand buses FP operand buses Reservation station XSU0 XSU1 MCFSU FPU LSU GP result buses FP result buses Result status buses Data cache BPU Vorlesung Rechnerarchitektur Seite 137 PowerPC 620 Data Cache Management Instructions A very interesting feature of the PowerPC 620 is the availability of five-user-mode instructions which allow the user some explicit control over the caching of data. Similar mechanisms and a compiler implementation are proposed in [Lam, M.; Software Pipelining: An Effictive Scheduling Technique for VLIW Machines; in: Proc. SIGPLAN ’88, Conf. on Programming Language Design and Implementation, Jun. 1988, pp. 318-328].To improve the cache hit rate by compiler-generated instructions, five instructions are implemented in the CPU, which control data allocation and data write-back in the cache. DCBT - Data Cache Block Touch DCBTST - Data Cache Block Touch for Store DCBZ - Data Cache Block Zero DCBST - Data Cache Block Store DCBF - Data Cache Block Flush EIEIO - Enforce in-Order Execution of I/O A DCBT is treated like a load without moving data to a register and is a programming hint intended to bring the addressed block into the data cache for loading. When the block is loaded, it is marked shared or exclusive in all cache levels. It allows the load of a cache line to be scheduled ahead of actual use and can hide the latency of the memory system into useful computation. This reduces the stalls due to long latency external accesses. A DCBTST is treated like a load without moving data to a register and is a program ming hint intended to bring the addressed block into the data cache for storing. When the block is loaded, it is marked modified in all cache levels. This can be helpful if it is known in advance that the whole cache line is overwritten, as it usually is in stack accesses. If the storage mode is write-back cache-enabled then the addressed block is established in the data cache by the DCBZ without fetching the block form main storage, and all bytes of the block are set to zero. The block is marked modified in all cache levels. This instruction supports for example an OS creating new pages initialized to zero. The DCBST, or data cache block clean, will write modified cache data to a memory and leave the final cache state marked shared or exclusive. Treated like a load with respect to address translation and protection. The DCBF, or data cache block flush, instruction is defined to write modified cache data to the memory and leave the final cache state marked invalid. Vorlesung Rechnerarchitektur Seite 138 PowerPC 620 Storage Synchronization Instructions The 620 Processor supports atomic operations (read/modify/write) with a pair of instructions. The Load and Reserve (LWARX) and the Store Conditional (STCX) instructions form a conditional sequence and provides the effect of an atomic RMW cycle, but not with a single atomic instruction. The conditional sequence begins with a Load and Reserve instruction; may be followed by memory accesses and/or computation that include neither a Load and Reserve nor a Store Conditional instruction; and ends with a Store Conditional instruction with the same target address as the initial Load and Reserve. These instructions can be used to emulate various synchronization primitives, and to provide more complex forms of synchronization. The reservation (it exists only one per processor!!) is set by the lwarx instruction for the address EA. The conditionality of the Store Conditional instruction’s store is based only on whether a reservation exists, not on a match between the address associated with the reservation and the address computed form the EA of the Store Conditional instruction. If the store operation was successful the CR field is set and can be tested by a conditional branch. A reservation is cleared if any of the following events occurs: - The processor holding the reservation executes another Load and Reserve instruction; this clears the first reservation and establishes a new one. - The processor holding the reservation executes a Store Conditional instruction to any address. - Another processor or bus master device executes any Store instruction to the address associated with the reservation. The reservation granule for the 620 is 1 aligned cache block (64 bytes). Due to the ability to reorder load and store instruction in the bus interface unit (BIU) the sync instruction must be used to ensure that the results of all stores into a data structure, performed in in a critical section of a program, are seen by other processors before the data structure is seen as unlocked. Vorlesung Rechnerarchitektur Seite 139 PowerPC 620 Storage Synchronization Instructions This implementation is a clever solution, because the external bus system is not blocked for the test-and-set sequence. All other bus masters can continue using the bus and the memory system for other memory accesses. The probability that another processor is accessing the same address in between lwarx and stwcx is very low, because of the small number of instructions used for the modify phase (number depends on the synchronization primitive). If the lwarx instruction finds a semaphore which was already set by another processor or process, the next access of the semaphore within the reservation loop is serviced from the cache, so that no external bus activity is performed by the busy waiting test loop. The release of this semaphore shows up in the cache by the coherency snoop logic, which usually invalidates (or updates) the cache line. The next lwarx gets the released semaphore and tries to set it with the stwcx instruction. The execution of instructions from the critical region can be started after testing the monitor_flag by the bne instruction. Each critical region must be locked by such a binary semaphore or by a higher-level construct. The semaphore is released by storing a "0" to the semaphore address. The coherence protocol guarantees that this store is presented on the bus, although the cache may be in copy-back mode. The following instruction sequence implements a binary semaphore sequence (fetch-and-store). It is assumed that the address of the semaphore is in Ra, the value to set the semaphore in R2. link reservation L1 : lwarx stwcx. bne R1 <- (Ra) R2 -> (Ra) L1 broken reservation loop set not successful lwarx read semaphore set reservation stwcx if there was no external cycle to this address store new semaphore value from R2 to memory bne branch on SC false set successful cmpi bne R1,$0 L1 semaphore set (1) cmpi bne test semaphore value branch on semaphore value semaphore unset (0) enter critical region critical region critical region isync completes all outstanding instructions leave critical region isync st R1 -> (Ra) st reset to old semophore value Vorlesung Rechnerarchitektur Seite 140 PowerPC 620 Storage Synchronization Instructions Hardware resources for the supervision of the reservation. CPU 0 EQ_BMR or stwcx monitor flag cleared disturbedlwarx-stwcx sequence (don’t execute stwcx) physical address stwcx lwarx address path only BM REG LD_BMR lwarx load word and reserve = ADR REG 1 EQ_BMR monitor fllag link monitor fsm monitor flag set (execute stwcx) address check path ADDR external address bus 32 These functional components are mapped to the on-chip cache functions. This causes the reservation granule to be one cache line. The snooping of the cache provides the address comparison and the address tag entry contains the address of the reservation. The following instruction sequence implements an optimized version of the binary semaphore instruction sequence, which increases the propability to run an undisturbed lwarxstwcx sequence. L1 : ld R1 <- (Ra) atomic by reservation read semaphore using normal load into R1 cmp R1 , R2 compare semaphore value to R2 beq L1 loop back if semaphore not free (1) lwarx R1 <- (Ra) set the reservation to start the atomic operation (1) when there is a high propabiltity to succeed stwcx. R2 -> (Ra) bne L1 store conditional semaphore from R2 if there was no external cycle to this address branch back if reservation failed (likely not taken) cmp R1 , $0 test semaphore value again bne L1 branch back on ’semaphore set’ due to intermediate access from another processor critical region st R0 = 0 -> (Ra) reset semaphore value (0) It is assumed that the address of the semaphore is in Ra, the value to compare with in R2 Vorlesung Rechnerarchitektur Seite 141 Synchronization Definition : Synchronization is the enforcement of a defined logical order between events. This establishes a defined time-relation between distinct places, thus defining their behaviour in time. Definition : A process is a unit of activity which executes programs or parts thereof in a strictly sequential manner, requiring exactly one processor [Gil93]. It consists of the program code, the associated data, an address space, and the actual internal state. Definition : A thread or lightweight process is a strictly sequential thread of control (program code) like a little mini-process. It consists of a part of the program code, featuring only one entry and one exit point, the associated local data and the actual internal state. In contrast to a process, a thread shares its address space with other threads [Tan92]. Es gibt zwei unterschiedliche Situationen, die eine Synchronisation zwischen Prozessen erfordern. - - die Benutzung von shared resources und shared data structures. • Die Prozesse müssen in der Lage sein die gemeinsamen Resourcen zu beanspruchen, und ohne Beeinflussung voneinander benutzen zu können. • mutual exclusion - - die Zusammenarbeit von Prozessen bei der Abarbeitung von einer Task • Die Prozesse müssen bei der Abarbeitung der Teilaufgaben in eine zeitlich korrekte Reihenfolge gebracht werden. Die Datenabhängigkeit zwischen Prozessen die Daten erzeugen (producer) und denen die Daten verbrauchen (consumer) muß gelöst werden • process synchronisation -> RA-2 Mutual exclusion Ein bestimmtes Objekt kann zu jedem Zeitpunkt höchstens von einem Prozeß okkupiert sein. Zur Veranschaulichung der Problematik soll folgendes Beispiel dienen. Mehrere Prozesse benutzen einen Ausgabe-Kanal. Die Anzahl der ausgegebenen Daten soll in der Variable count gezählt werden. Dazu enthält jeder Prozeß die Anweisung count := count + 1 nach seiner Datenausgabe. Vorlesung Rechnerarchitektur Seite 142 Synchronization Diese Anweisung wird in folgende Maschinenanweisungen übersetzt . ld count , R1 add R1 , #1 store R1 , count Diese Sequenz befindet sich in jedem der Prozesse, welche parallel ihre Instruktionen abarbeiten. Dadurch könnte folgende Sequenz von Instruktionen entstehen. P1: P2: P2: P2: P1: P1: ld count , R1 ld count , R1 add R1 , #1 store R1 , count add R1 , 1 store R1 , count Damit ist count nur um 1 erhöht worden, obwohl die Instruktionssequenz zweimal durchlaufen wurde und count um 2 erhöht sein sollte. Um dieses Problem zu vermeiden führt man den Begriff des kritischen Bereichs (critical region) ein. Definition : Eine "critical region" ist eine Reihenfolge von Anweisungen, die unmittelbar, d.h. ohne Unterbrechung, von einem Prozeß ausgeführt wird. Der Eintritt in einen solchen kritischen Bereich wird nur einem Prozeß gestattet. Andere Prozesse, die diesen Bereich ebenfalls benutzen wollen, müssen warten bis der belegende Prozeß den Bereich wieder verläßt und damit zur Benutzung freigibt. Möglichkeiten zur Realisierung der critical region: - Software-Lösung (zu komplex !) Interrupt-Abschaltung (single processor) binary semaphores einfache kritische Bereiche Monitors Vorlesung Rechnerarchitektur Seite 143 Synchronization Interrupt - Abschaltung Ist in einem System nur ein Prozessor vorhanden, der die Prozesse ausführt, so kann man die mutual exclusion durch das Abschalten der Interrupts erreichen. Sie sind die einzige Quelle, durch die eine Instruktionssequenz unterbrochen werden kann. Praktische Ausführung : der kritische Bereich wird als Interrupt-Handler geschrieben. Eintritt durch Auslösen eines Traps. Interrupts sperren. Kritischen Bereich bearbeiten. Interrupt freigeben. Rücksprung zum Trap. Unteilbare Operationen (ATOMIC OPERATIONS) Der Basismechanismus zur Synchronisation ist die unteilbare Operation (atomic operation) Die unteilbare Operation zur Manipulation einer gemeinsamen Variablen bildet die Grundlage für die korrekte Implementierung der mutual exclusion Goal: generating a logical sequence of operations of different • threads • processes • processors Implementation: Atomic operations • • • • • • • • read-modify-write test-and-set (machine instruction) load-incr-store load-decr-store loadwith-store conditional reservation lock/unlock fetch-and-add Vorlesung Rechnerarchitektur Seite 144 Synchronization Atomic operations Sequential consistency assumes that all globally observable architectural state changes appear in the same order when observed by all other processors. The sequential consistency requires a mechanism for implementation. synchronization instructions read - modify - write load (read) modify "Bus" memory semaphore write Atomic Zugriff verhindern Bus reserviert gesamter Speicher blockiert Cacheline snooping Zugriff auf diese Adresse verh. (?) Vorlesung Rechnerarchitektur Seite 145 Synchronization Atomic operations (Example 1) Multiprocessor Systems using a bus as Processor-Memory interconnect network can use a very simple mechanism to guarantee atomic (undividable) operations on semaphore variables residing in shared memory. The bus as the only path to the memory can be blocked for the time of a read-modify-write (RMW) sequence against intervening transactions from other processors. The CAS2 Instruction of the MC68020 is an example of such a simple mechanism to implement the test-andset sequence as an non-interruptable machine instruction. Sequence of operations of CAS2 instruction 1. read of semaphore value from Mem(Ra) into register Rtemp 2. compare value in Rtemp to register Rb 3. conditional store of new sempahore to Mem(Ra) from register Rc Atomic operations (Example 2) Complex instruction with memory reference and lock prefix (seit i486) lock ; incl %0; sete %1 Atomically increments variable by 1 and returns true if the result is zero, or false for all other cases. Vorlesung Rechnerarchitektur Seite 146 Synchronization Atomic operations (Example 3) lock begin interlocked sequence Causes the next data load or store that appears on the bus to assert the LOCK # signal, directing the external system to lock that location by preventing locked reads, locked writes, and unlocked writes to it from other processors. External interrupts are disabled from the first instruction after the lock until the location is unlocked. unlock end interlocked sequence The next load or store (regardless of whether it hits in the cache) deasserts the LOCK # signal, directing the external system to unlock the location, interrupts are enabled when the load or store executes. These instructions allow programs running either user or supervisor mode to perform atomic read-modify-write sequences in multiprocessor and multithread systems. The lock protocol requires the following sequence of activities: • lock • Any load instruction that happens on the bus starts the atomic cycle. This load does not have to miss the cache; it is forced to the bus. • unlock • Any store instruction terminates the atomic cycle. There may be other instructions between any of these steps. The bus is locked after step 2 and remains locked until step 4. Step 4 must follow step 1 by 30 instructions or less: otherwise an instruction trap occurs. The sequence must be restartable from the lock instruction in case a trap occurs. Simple read-modify-write sequences are automatically restartable. For sequences with more than one store, the software must ensure that no traps occur after the first non-reexecutable store. Vorlesung Rechnerarchitektur Seite 147 Synchronization Atomic operations (Example 4) Starting with the memory read operation the RMW_ signal from the CPU tells the external arbitration logic not to rearbitrate the bus to any other CPU. Completing the RMW-instruction by the write to memory releases the RMW_ signal permitting general access to the bus (and to the memory). - test-and-set (two paired machine instructions) These two instructions are paired together to form the atomic operation in the sense, that if the second instruction (the store conditional) was successful, the sequence was atomic. The processor bus is not blocked between the two instructions. The address of the semaphore is supervised by the "reservation", set by the first instruction (load-and-reserve). [see also PowerPC620 Storage Synchronization] Vorlesung Rechnerarchitektur Seite 148 Synchronization binary Semaphores (Dijkstra 1965) Definition : Die binäre Semaphore ist eine Variable, die nur die zwei Werte ’0’ oder ’1’ annehmen kann. Es gibt nur die zwei Operationen P und V auf dieser Variablen. ’0’ entspricht hier einer gesetzten Semaphore, die den Eintritt in den kritischen Bereich verbietet 1. die P-Operation - Proberen te verlangen (wait) [P] der Wert der Variablen wird getestet - ist er ’1’, so wird er auf ’0’ gesetzt und die kritische Region kann betreten werden. - ist er ’0’, so wird der Prozess in eine Queue eingereiht, die alle Prozesse enthält, die auf dieses Ereignis (binäre Semaphore = ’1’) warten. 2. die V-Operation - verhogen (post) to increase a semaphore [V] die Queue wird getestet - ist ein Prozess in der Queue, wird er gestartet. Es wird bei mehreren Prozessen genau einer ausgewählt (z.B. der erste) - ist die Queue leer, wird die Variable auf ’1’ gesetzt P acts as a gate to limit access to tasks. einfache kritische Bereiche Um die korrekte Benutzung von Semaphoren zu erleichtern, wurde von Hoare der einfache kritische Bereich eingeführt: with Resource do S Die kritischen Statements S werden durch P und V Operationen geklammert und garantieren damit die mutual exclusion auf den in Resource zusammengefaßten Daten (shared Data). Damit wird die Kontrolle über die Eintritts- und Austrittsstellen der mutual exclususion vom Compiler übernommen. Durch die Deklaration der shared variables in der Resource ist es dem Compiler auch möglich, Zugriffsverletzungen zu erkennen und damit die gemeinsamen Daten zu schützen. Dadurch reduzieren sich die Fehler, die der Programmierer durch die alleinige Benutzung der P und V Operationen in Programme einbauen könnte, erheblich. Vorlesung Rechnerarchitektur Seite 149 Synchronization Monitore Die Operationen auf shared data sind i.a. über das gesamte Programm verteilt. Diese Verteilung macht die Benutzung der gemeinsamen Datenstrukturen unübersichtlich und fehleranfällig. Faßt man die gemeinsamen Variablen und die auf ihnen möglichen Operationen in ein Konstrukt zusammen und stellt die mutual exclusion bei der Ausführung des Konstrukts sicher, so erhält man einen monitor oder auch secretary genannt. The basic concepts of monitors are developed by C.A.R. Hoare. Definition : Ein "monitor" ist eine Zusammenfassung von critical regions in einer Prozedur. Der Aufruf der Prozedur ist immer nur einem Prozess möglich. Somit ist der Eintritt in den Monitor mit der P-Operation äquivalent und der Austritt mit der V-Operation. Beispiel: monitor MONITORNAME; /* declaration of local data */ procedure PROCNAME(parameter list); begin /* procedure body */ end begin /* init of local data */ end Java provides a few simple structures for synchronizing the activities of threads. They are all based on the concepts of monitors. A monitor is essentially a lock. The lock is attached to a resource that many threads may need to access, but that should be accessed by only one thread at a time. The synchronized keyword marks places (variables) where a thread must acquire the lock before proceeding. [Patrick Niemeyer, Joshua Peck, Exploring JAVA, O’Reilly, 1996] class Spreadsheet { int cellA1, cellA2, cellA3; synchronized int sumRow() { // synchronized method return cellA1 + cellA2 + cellA3; } synchronized void setRow( int a1, int a2, int a3 ) { cellA1 = a1; cellA2 = a2; cellA3 = a3; } ... } In this example, synchronized methods are used to avoid race conditions on variables cellAx. Vorlesung Rechnerarchitektur Seite 150 Interconnection Networks (Bus) Bus Systems A special case of a dynamic (switched) interconnection network is a bus. It consists of a bundle of signal lines used for transmission of information between different places in a computer system. Signals are typically bundled concerning their functionality: e.g.: - Address Bus, Data Bus, Synchronization Bus, Interrupt Bus ... or by the hardware unit connected to the bus: - Processor Bus, Memory Bus, I/O Bus; Peripheral Bus ... e.g. Processor VCC Transceiver (bidirectional Port) pull-up EN0 Master EN1 EN2 Master/Slave TS-Driver e.g. Address Bus Bus Signal Lines n Slave Receiver Key means three state switched connection means fixed input from bus connection Memory Memory Three-state driver can be used for the dynamic switch. As the name three-state suggests, a TS-Driver has 3 output states: - drive high "1"; - drive low "0"; - no drive - high Z output. If all drivers of a bus signal line are disabled (high Z), the signal line is floating (should be avoided by a pull-up resistor). More about the technology can be learned in lecture: "Digitale Schaltungstechnik" At a time, only one master can be active on the bus. Only one three state (tri-state) driver is allowed to drive the bus signal lines. Enabling more than one driver may damage the system and must be avoided under all conditions. This is called ’bus contention’. Though an access mechanism for a bus with multiple master is required, called ’arbitration’. Vorlesung Rechnerarchitektur Seite 151 Interconnection Networks (Bus) Bus Arbitation When a processor wants to access the bus, it sets its BREQ (bus request signal) and waits for the arbiter to grant access to the bus, signaled by BG (bus grant). The hardware unit (arbiter) samples all BREQx signals from all clients and then generates a single token which is signaled to one bus client. The ownership of the token defines a client to be bus master. This token permits the master to enable the TS-Driver and become the active master of the bus. At this time, all other units are slaves on the bus and can receive the information driven by the master. Because all slaves can take in the actual bus data by their receivers, a broadcast communication can be performed in every bus cycle (the most important advantage of a bus, beside its simplicity). Processor 1 Processor 0 ADDR_out BREQ1 ADDR_out BREQ0 snoop_ADDR_in snoop_ADDR_in 32 x EN1 Master 0 EN0 Master 1 BG1 BG0 Address Bus Signal Lines 32 Slaves Arbiter Memory Memory The arbiter gets the bus request signals from all masters and decides which master is granted access to the bus. Simultaneous requests will be served one after the other. A synchronous (clocked) arbiter can be realized by a simple finite state machine (FSM). default Idle BREQ0 & ~BREQ1 no Grant BREQ1 ~BREQ0 & ~BREQ1 default default Grant0 BG0 ~BREQ0 & BREQ1 Grant1 BG1 BREQ0 & ~BREQ1 Metastable behavior of the arbiter FFs can (and should !!) be avoided by deriving the request signals in a synchronous manner (using the same clock). Vorlesung Rechnerarchitektur Seite 152 Bus Basic Functions Bussysteme können für die verschiedensten Aufgaben in einem Rechnersystem bestimmt sein: - Adressbus Datenbus I/O-Bus - Memory-Bus - Prozessor-Bus Interrupt-Bus Synchronisationsbus Fehleranalyse/-behandlung + 5V Ain Bin 500 Ω Beispiel: Synchronisationsbus Aout O.D. O.D. Bout Die Ausführung von Bussystemen ist abhängig von ihrer Anwendung, da sie normalerweise für ihre Anwendung optimiert werden. Wichtige Kenndaten eines Bussystems sind: - Wortbreite - Datenübertragungsrate - Übertragungsverfahren + Protokoll - synchrone - asynchrone - Hierarchiestufen - cachebus - operand --- result --- Processorbus L2 (bus) interface - Memory - I/O Bus - Peripherie USB - LAN Kommunikationsbusse Vorlesung Rechnerarchitektur Seite 153 Bus Protocol - Framing - Command, Address, burst, length - Type - Transaction-based - Split-phase transactions - Packet-based - Flowcontrol - asynchron: handshake - synchron: valid/stop, wait/disconnect, credit-based - Data integrity and reliability - Detection, Correction, Hamming, parity - Cyclic Redundancy Check (CRC), re-transmission - Advanced Features - Embedded clock (8b/10b) - DC-free (8b/10b) - Virtual channels - Quality of service (QoS) X DAV_ DACK_ tpd Handshake protocoll Vorlesung Rechnerarchitektur Seite 154 Bus Bus zur Verbindung von mehreren Platinen untereinander Backplane-Verdrahtung, global (VME, Futurebus+, XD-BUS) Verbindung von Bausteinen innerhalb einer Platine Peripheriebus, Prozessorbus, lokal (PCI-Bus, S-Bus, M-Bus) Verbindung von Systemen untereinander Workstationvernetzung (SCI-Interface, Ethernet ...) Geschwindigkeit eines Bussystems wird durch mehrere Faktoren begrenzt: • • • • • Laufzeit der Signale auf den Leitungen Eingangskapazität der Ports Verzögerungszeit der Ports Buszykluszeit Busclock Overhead (Protokoll) BTL +pd slot B.P.L. Bussysteme • Backplane B.S. - passiven Backplane - aktiven Backplane BTL VME TTL cmos 2 x 96 pins 2 x 52 I/O 1 Gbit/s Chip Interconnects Peripheral Chip Interconnect 2.1. PCI-Bus (33 MHz) 66 MHz (132 MHz)? (128 MHz)? no termination 1 stot + 1 chip I stubs I/O Vorlesung Rechnerarchitektur Seite 155 Bus Bussysteme können als Pipeline ausgeführt werden, um die Datentransferrate zu erhöhen (i860XP, Power PC620, XD-Bus, ...) Eine mögliche Aufteilung in die Phasen: - Arbitrierung Adressierung Datentransport Statusrückmeldung Da diese Phasen parallel auf verschiedenen Leitungsgruppen ausgeführt werden können, kann die Datentransferrate um den Faktor 4 größer werden. Vorlesung Rechnerarchitektur Seite 156 PCI - Peripheral Component Interface Die wichtigsten Eigenschaften des Peripheral Component Interconnect (PCI) Bus: • 32 Bit Daten– und Adreßbus. • Niedrige Kosten durch ASIC Implementierung. • Transparente Erweiterung von 32 Bit Datenpfad (132 MB/s Spitzenwert) auf 64 Bit Datenpfad (264 MB/s Spitzenwert). • Variable Burstlänge. • Synchrone Busoperationen bis 33 MHz. • Überlappende Arbitrierung durch einen zentralen Arbiter. • Daten– und Adreßbus im Multiplexverfahren zur Reduzierung der Anschlußpins. • Erlaubt Selbstkonfiguration der PCI Komponenten durch vordefinierte Konfigurationsregister. • Plug and Play fähig. • Prozessorunabhängig. Unterstützt zukünftige Prozessorfamilien (durch Hostbridge oder direkte Implementierung). • Unterstützt 64 Bit Adressierung. • Spezifikation für 5V und 3.3V. • Multimasterfähig. Erlaubt peer–to–peer Zugriffe von jedem beliebigen PCI Master zu jedem PCI Master/Slave. • Hierarchischer Aufbau von mehreren PCI Bus Ebenen. • Parity für Adressen und Daten. • PCI Komponenten sind kompatibel zu vorhandener Treiber– und Applikationssoftware Nach der im April 1993 vorgestellten Revision 2.0 der PCI Spezifikation ist bereits eine Erweiterung auf den Revisionsstand 2.1 in Arbeit, in der als wesentlichste Änderung die Erhöhung der maximalen Taktfrequenz von 33 MHz auf 66 MHz vorgesehen ist, was noch einmal eine weitere Verdopplung der maximalen Transferrate auf 528 MB/s bei 64 Bit Datenpfad bedeutet. Vorlesung Rechnerarchitektur Seite 157 PCI - Peripheral Component Interface Notwendige Pins Adressen und Daten Interface– Kontrolle A/D[31:0] A/D[63:32] C/BE[3:0]# C/BE[7:4]# PAR PAR64 REQ64# ACK64# FRAME# TRDY# IRDY# STOP# DEVSEL# IDSEL Fehler– meldungen PERR# SERR# Arbitrierung (nur Master) REQ# GNT# LOCK PCI DEVICE CPU INTA# INTB# INTC# INTD# SBO# SDONE TDI TDO TCK TMS TRST# CLK RST# System CPU Optionale Pins CPU 64 Bit Erweiterung Interface– Kontrolle Interrupts Cache– Unterstützung JTAG (IEEE 1149.1) CPU HOST–BUS Memory PCI Arbiter Hostbridge LAN PCI BUS Graphic SCSI I/O Subsystem Vorlesung Rechnerarchitektur Seite 158 PCI - Peripheral Component Interface Adressierung Der physikalische Adreßraum des PCI Busses besteht aus drei Adreßbereichen: - Memory Address Space - I/O Address Space - Configuration Address Space A/D[31:0] = 00000000h–FFFFFFFFh C/BE[3:0]# = 0110 0111 1100 1110 1111 Memory Address Space 4GB A/D[31:0] = 00000000h–FFFFFFFFh C/BE[3:0]# = 0010 0011 I/O Address Space 4GB A/D[31:0] = 00000000h–FFFFFFFFh C/BE[3:0]# = 1010 1011 Configuration 4GB Address Space Vorlesung Rechnerarchitektur Seite 159 PCI - Peripheral Component Interface Bus Commands Die Bus Commands zeigen dem Target die Art des Zugriffes an, die der Master anfordert und bestimmen den Adreßraum, in den die Adresse fällt. Sie werden während der Adreßphase auf den C/BE[3:0]# Leitungen codiert und gelten für die gesamte nachfolgende Transaktion. Die Codes der einzelnen Bus Commands stehen in Tabelle 1. Definition der Bus Commands C/BE[3:0]# Command Type 0000 Interrupt Acknowledge 0001 Special Cycle 0010 I/O Read 0011 I/O Write 0100 Reserved 0101 Reserved 0110 Memory Read 0111 Memory Write 1000 Reserved 1001 Reserved 1010 Configuration Read 1011 Configuration Write 1100 Memory Write Multiple 1101 Dual Address Cycle 1110 Memory Read Line 1111 Memory Write and Invalidate Vorlesung Rechnerarchitektur Seite 160 PCI - Peripheral Component Interface Buszyklen 1 2 3 4 5 6 7 8 9 CLK FRAME# wait IRDY# A/D C/BE# wait TRDY# DEVSEL# data1 data2 data3 Adreß phase data4 Datenphasen Buszyklus 1 2 3 4 5 6 7 8 CLK FRAME# wait IRDY# A/D wait TRDY# wait C/BE# DEVSEL# data1 data2 data3 9 Vorlesung Rechnerarchitektur Seite 161 PCI - Peripheral Component Interface Anders als bei Zugriffen im Memory– oder I/O Adreßraum, die durch die Adresse auf A/D[31:0] und den Bus Commands eindeutig adressiert werden, erfolgt die Adressierung des Devices beim Konfigurationszyklus durch ein weiteres Signal: IDSEL, das die Funktion eines Chip Select hat. Jedes Device hat sein eigenes IDSEL, das nur während der Adreßphase, wenn die Bus Commands C/BE[3:0]# ein Configuration Read oder Write signalisieren, abgetastet wird. Bei allen anderen Zugriffen ist der Pegel von IDSEL bedeutungslos. Adressiert werden die Konfigurationsregister Doppelwortweise. Ein Doppelwort entspricht 32 Bit. (Die Adreßbits A/D[1:0] werden dadurch nicht zur Adreßdecodierung benötigt.) Die Auswahl der Bytes innerhalb des adressierten Doppelwortes erfolgt mit Hilfe der Byte Enables C/BE[3:0]# 1 2 3 4 5 CLK FRAME# IRDY# A/D C/BE# 101x TRDY# DEVSEL# IDSEL Der PCI Bus unterscheidet zwei Typen von Konfigurationszyklen, die durch die Bitkombinationen in A/D[1:0] gekennzeichnet werden. • Konfigurationszyklen vom Typ 0 (A/D[1:0] = ‘00‘) sind alle Zyklen, mit denen die Bridge diejenigen Devices ansprechen will, die sich auf dem Bus befinden, der der Bridge zugeordnet ist. Eine Bridge ist ein Device, das die Verbindung zwischen verschiedenen Busebenen (bzw. Bussystemen) herstellt. • Typ 1 (A/D[1:0] = ‘01‘) gilt für Konfigurationszyklen die Devices betreffen, die in untergeordneten PCI Bushierarchien liegen. Vorlesung Rechnerarchitektur Seite 162 PCI - Peripheral Component Interface Die Informationen, die in der Doppelwortadresse A/D[32:2] enthalten sind, sind abhängig vom Typ des Konfigurationszyklusses. 31 11 10 87 Function Number Reserved 2 1 0 Register Number 0 0 Typ 0 31 24 23 Reserved 16 15 Bus Number 11 10 Device Number 87 Function Number 2 1 0 Register Number 0 1 Typ 1 Adreßformate von Konfigurationszyklen Die Bitkombinationen in A/D[31:11] sind beim Typ 0, und die Bitkombinationen in A/D[31:24] beim Typ 1 ohne Bedeutung. Bus Number Bestimmt die Busnummer des Busses, auf dem sich das zu konfigurierende Device befindet. PCI erlaubt durch eine hierarchisch gestaffelte Anordnung von verschiedenen PCI Bussen, bis zu 256 Busebenen herzustellen. Device Number Bestimmt eines der 32 möglichen Zieldevices auf jeder Busebene, für das der Konfigurationszyklus bestimmt ist. Function Number Adressiert eine der maximal 8 verschiedenen Funktionen eines Multifunktionsdevice. Register Number Doppelwortadresse des 64 Doppelworte umfassenden Konfigurationsregisters. Vorlesung Rechnerarchitektur Seite 163 PCI - Peripheral Component Interface Host/System–Bus Device Number CPU Host–to–PCI Bridge Memory Function Number Function Number Register Number Register Number Bus Number PCI Bus (Number x) Device Number PCI–to–PCI Bridge Function Number Bus Number Register Number PCI Bus (Number y) Device Number Device Number Function Number Function Number Function Number Function Number Register Number Register Number Register Number Register Number Hierarchische PCI Busstruktur. Vorlesung Rechnerarchitektur Seite 164 PCI - Peripheral Component Interface Device Number PCI Bridge Decoder PCI Bus IDSEL IDSEL PCI Slot 1 IDSEL PCI Slot 2 PCI Slot 3 Getrennte IDSEL Leitungen PCI Bus A/D[x] A/D[y] IDSEL A/D[z] IDSEL PCI Slot 1 IDSEL PCI Slot 2 PCI Slot 3 IDSEL aus A/D[13:11] 31 24 23 Reserved 16 15 Bus Number 31 11 10 Device Number 11 10 1 aus 21 87 Function Number 2 1 0 Register Number 87 Function Number 2 1 0 Register Number Abbildung von IDSEL auf die oberen Adreßbits. 0 0 Vorlesung Rechnerarchitektur Seite 165 PCI - Peripheral Component Interface Konfiguration 00h 31 16 15 PCI Specification (Revision 2.0) defined Configuration Space Header 0 Device ID Vendor ID 00h Status Command 04h Class Code BIST Header Type Latency Timer Revision ID 08h Cache Line Size 0Ch 10h 14h 18h Base Address Registers 1Ch Vendor defined Configuration Registers FFh 20h 24h Max_Lat Reserved 28h Reserved 2Ch Expansion ROM Base Address 30h Reserved 34h Reserved 38h Min_Gnt Interrupt Pin Configuration Space Header Interrupt Line 3Ch Vorlesung Rechnerarchitektur Seite 166 PCI - Peripheral Component Interface Base Class Bedeutung 00h Für Devices, die vor der Fertigstellung der Base Class Codes Definition gebaut wurden. 01h Massen Speicher Controller 02h Netzwerk Controller 03h Display Controller 04h Multimedia Controller 05h Memory Controller 06h Bridge Device 07h–FEh FFh Reserved Für Devices, die in keine der oben genannten Basis Klassen eingeordnet werden können. Base Classes Base Class 06h Sub Class Prog.If. Bedeutung 00h 00h Host bridge 01h 00h ISA bridge 02h 00h EISA bridge 03h 00h MC bridge 04h 00h PCI–to–PCI bridge 05h 00h PCMCIA bridge 80h 00h Andere Bridge Devices Base Class 06h und deren Sub Classes Vorlesung Rechnerarchitektur Seite 167 PCI - Peripheral Component Interface 31/63 3 2 1 0 0 Prefetchable Type Memory space indicator 31 1 0 1 Reserved I/O space indicator Layout der Base Address Register Größe des Adreßraumes (hier 1MB) Base Address 32 11111111111100000000000000000000 32 Set on write FFFFFFFFh Q J Q K Reset on read 32 CLK Registermodell der Base Address Register Interrupt Pin read : In diesem Register steht, welchen Interrupt Pin das Device benutzt. Der Dezimalwert 1 bedeutet INTA#, 2 INTB#, 3 INTC# und 4 INTD#. Mit dem Dezimalwert 0 zeigt das Device an, daß es keine Interrupts benutzt. Interrupt Line r/w : Der Wert dieses Registers gibt an, mit welchem System Interrupt Pin das Device Interrupt Pin verbunden ist. Die Konfigurationssoftware kann mit Hilfe dieses Wertes zum Beispiel Prioritäten festlegen. Die Werte dieses Registers sind systemabhängig. Vorlesung Rechnerarchitektur Seite 168 PCI - Peripheral Component Interface Arbitrierung 1 2 3 4 address data 5 6 7 address data CLK REQ# 1 REQ# 2 GNT# 1 GNT# 2 FRAME# A/D Zugriff Master 1 Zugriff Master 2 Arbitrieung PCI Bus IRDY# FRAME# Master 0 REQ0_ GTE_ PCI Arbiter gtimer GNT0_ REQ1_ GT0_ GNT1_ REQ2_ GNT2_ PCI Arbiter Master 1 Master 2 Vorlesung Rechnerarchitektur Seite 169 Modern Peripheral Interfaces What are the available Interfaces for peripheral devices ? standard - proprietary • PCI-X, PCIe, HT, (cHT) • System bus, Integrated Solution, Features PCI-X PCI-Express PCIe Hypertransport HT 64 + 39 = 101 4,8,16,32,64 26,36,57,105,199 yes Address/Data yes yes, message orient. data width 32/64 bit 2,4,8,16,32 bit usage I/O-Bus Peripheral-Extension I/O-Bus Peripheral-Extension Link width 2,4,8,16,32 bit I/O-Bus** Peripheral-Extension operation mode fully synchronous, clocked source synchronous 8B/10B coded data source synchronous 1 x clock pro Byte number of signal lines multiplexed operation data transmission 0 - 33/64 MHz 100/133 (266*) MHz CMOS-Level reflective wave signalling CMOS-Level clock synchronous CML-Level serial, differential coded embedded clock termination no 100-110 Ohm burst transfers yes, many modes 4x burst, arbirary length yes, message transfers 200 - 800 MHz (1-1.6GHz) U-Level 600mV NRZ, serial, differential DDR double data rate packetized 100 Ohm on chip, overdamped yes, comand + message transfers Split Transactions yes yes yes 2x2,5Gbit/s@2bit 10GB@32bit 0,2GB@200MHz-2bit 12,8GB@1600MHz-32bit max.no of devices 533MB@66MHz-64bit 1GB@133MHz-64bit Bridge + 4,2,1 Devices 1 I/O-Device @133MHz point to point point to point, bidir. max length of signal lines aprox. 10cm aprox. 3-10cm at FR4*** aprox. 3-10cm at FR4 Standard Industrie (Intel) + IEEE Industrie (Intel) + Konsortium Industrie (AMD) + Konsortium Spec. page no. aprox. 220 aprox. 420 Web Infos www.pcisig.org www.intel.com clock frequency signal transmission max. Bandwidth *) DDR double data rate transfer 2,5 GHz **) extended version for CPU Interconnect with Cache Coherency-Protocoll aprox. 330 www.hypertransport .org ***) PCB material Vorlesung Rechnerarchitektur Seite 170 PCI-X Peripheral Component Interconnect Features: Available in many node computer. Servers use switched architectures. Synchronous interface controlled by a single ended clock. In the 133MHz mode, there is only one IO-device allowed on the "bus" (bridge-to-device). In the future, it will be replaced by PCIe because of reduced pin count and higher bandwidth. The PCI-bus cycle shows the overhead associated with a burst transfer without target wait states. (2 clk arbitration + 2 clk address/attribute + 2 clk target response and turn around + n* data phase of 8B, at n=4 => 6 to 4 at a data size of 32B. 133MHz 7,5ns n1/2 is reached at 6 data transfer cycles with 8B each. n1/2 = 48B. The peak bandwidth is 1GB/s. ’Real’ bandwidth is around 900 MB/s for long bursts. Vorlesung Rechnerarchitektur Seite 171 PCI-Express (PCIe) SERDES based peripheral interconnects Performance: • Low-overhead, low-latency communications to maximize application payload bandwidth and link efficiency • High-bandwidth per pin to minimize pin count per device and connector interface • Scalable performance via aggregated lanes and signaling frequency x1, x2, x4, x8 and x16, Gen1 2.5Gb/s, Gen2 5Gb/s, Gen3 8Gb/s (intro in 2010) The fundamental PCI Express Link consists of two, low-voltage, differentially driven signal pairs: a transmit pair and a receive pair. Combining many lanes (single diff. pair.) together provides a high bandwidth, e.g. x16 is a bidir. link with 2.5 Gb/s, delivering a raw data rate of 2x 40Gb/s = 10GB/s PCIe shows a lower latency than PCI-X because typically it comes directly from the ’root complex’ (north bridge). Serializer latency is very implementation dependent. [PCIeSpec] x16 A Switch is defined as a logical assembly of multiple virtual PCI-to-PCI bridge devices. Advanded Switching Interconnect (ASI) is based on the physical layer of PCIe and is aimed at the interconnect ’in the rack/cabinet’. ASI did not get any real market share. :-( Vorlesung Rechnerarchitektur Seite 172 Hypertransport (HT) (1) AMD’s IO-HT is an open standard, cHT only for large customer. HyperTransport is intended to support ’in-the-box’ connectivity (motherboard). The architecture of the HyperTransport I/O link can be mapped into five different layers [HT-WP]: The physical layer defines the physical and electrical characteristics of the protocol. This layer interfaces to the physical world and includes data, control, and clock lines. The data link layer includes the initialization and configuration sequence, periodic cyclic redundancy check (CRC), disconnect/reconnect sequence, information packets for flow control and error management, and doubleword (32bits) framing for other packets (data packet sizes from 4 - 64 Bytes). The protocol layer includes the commands, the virtual channels in which they run, and the ordering rules that govern their flow. The transaction layer uses the elements provided by the protocol layer to perform actions, such as reads and writes. The session layer includes rules for negotiating power management state changes, as well as interrupt and system management activities. Processor AGP Processor NorthMemC Bridge HT-IO Memory NIC System Area Network HT-IO Bridge Tunnel LAN downstream upstream DISK Cave super IO HT supports several methodes of data transfers between devices: PIO, DMA, peer-to-peer (this involves the bridge device for connecting the upstream with the downstream link). Interrupts are signalled by sending interrupt messages [HT-MS]. Vorlesung Rechnerarchitektur Seite 173 Hypertransport (2) A HT bus uses coupon-based flow control to avoid receiver overrun. Coupons (credits) flow back with NOP control packets (idle link signaling) of 4 Bytes. Control and data packets are distinguished by the CTL signal line. Control packets are separated into request, response and information packets. Read requests carry a 40 bit address and are executed as split transactions. A sized read request is a 8B packet which is responded with a read response packet of 4B and a data packet of 4-64B (Overhead of 12 to 64B, best case). Packet Control packet Information Request 4B 4,8B Data packet 4-64B Response 4B The physical layer uses a modified LVDS differential signaling. LVDS signal transmission and termination Control packet Data packet bidirectional HT bus Control packets may be inserted into data packets at any 4B boundary. Only one data packet is active at a time. The CTL signal distinguishes between control and data packets. Vorlesung Rechnerarchitektur 1 Seite 174 I/O-devices I/O-devices: Basics An I/O-device is a resource of a computer system, which implements the function of a specific I/O-interface. An I/O-interface is used to establish communication between a computer system and the outside world (may be another computer system or a peripheral device like a printer). CPU Cache System Bus Addr Data I/O bridge I/O device Addr Device Command Register CR Device Status Register SR Data I/O Bus typically no cache coherency Device Data Register (read) DRR I/O Interface generic I/O device block diagram Device Data Register (write) DRW The minimal set of registers are: • • • • control register CR, status register SR, data register read DWR, data register write DRW. The CR is used to bring the device into a specific operating mode. The content of the register is very device specific. An enable bit for the activation of the output register and the input register is normally included. The status register signals the internal state of the device, e.g. if the transmitter register is empty (TX_EMPTY) or the receiver register is full (RX_FULL). Vorlesung Rechnerarchitektur 1 Seite 175 I/O-devices The data registers are used to transfer data into and out of the device typically using programmed I/O (PIO). In order to access the device it must be placed in the address space of the system. • memory mapped device • I/O space for access Using special I/O-instructions for access to the device directly selects a predefined address space, not accessible by other instruction types. => restrictons in use. As the name suggests, a memory mapped device is placed in the memory address space of the processor. All instructions can be used to access the device (typically load/store). Special care must be taken if the processor can reorder load/store instructions. Caching of this address space should be turned off. Address Space 0xFFFF_FFFF I/O-Space I/O Space 0xFFFF_FFFF Device A Device Registers 0xF000_0000 free Space Device B Memory Space 0x0000_0000 0xF000_0000 address space partitioning for memory mapped devices (example!) Vorlesung Rechnerarchitektur 2 Seite 176 Direct Memory Access (DMA) DMA Basics Definition : A direct memory access (DMA) is an operation in which data is copied (transported) from one resource to another resource in a computer system without the involvement of the CPU. The task of a DMA-controller (DMAC) is to execute the copy operation of data from one resource location to another. The copy of data can be performed from: - I/O-device to memory - memory to I/O-device - memory to memory - I/O-device to I/O-device A DMAC is an independent (from CPU) resource of a computer system added for the concurrent execution of DMA-operations. The first two operation modes are ’read from’ and ’write to’ transfers of an I/O-device to the main memory, which are the common operation of a DMA-controller. The other two operations are slightly more difficult to implement and most DMA-controllers do not implement device to device transfers. DMA controller CPU Arbiter Addr Data Memory I/O device ACK REQ simplified logical structure of a system with DMA The DMAC replaces the CPU for the transfer task of data from the I/O-device to the main memory (or vice versa) which otherwise would have been executed by the CPU using the programmed input output (PIO) mode. PIO is realized by a small instruction sequence executed by the processor to copy data. The ’memcpy’ function supplied by the system is such a PIO operation. The DMAC is a master/slave resource on the system bus, because it must supply the addresses for the resources being involved in a DMA transfer. It requests the bus whenever a data value is available for transport, which is signaled from the device by the REQ signal. The functional unit DMAC may be integrated into other functional units in a computer system, e.g. the memory controller, the south bridge, or directly into an I/O-device. Vorlesung Rechnerarchitektur 2 Seite 177 Direct Memory Access (DMA) DMA Operations A lot of different operating modes exist for DMACs. The simplest one ist the single block transfer copying a block of data from a device to memory. For the more complex operations please refer to the literature [Mot81]. Here, only a short list of operating modes is given: - single block transfer chained block transfers linked block transfers fly-by transfers All these operations normally access the block of data in a linear sequence. Nevertheless, there are more usefull access functions possible, as there are: constant stride, constant stride with offset, incremental stride, ... DMAC CPU Memory DMA Command Register Device Base Register 2a. 1. 2b. 3. Block Length Register Block Length Mem Base Register 5. Temporary Data Register Mem Base Addr 1a. Memory Block 6. Descriptor 1b. 4. I/O device Command Area Device Data Register Execution of a DMA-operation (single block transfer) The CPU prepares the DMA-operation by the construction of a descriptor (1), containing all necessary information for the DMAC to independently perform the DMA-operation (offload engine for data transfer). It initalizes the operation by writing a command to a register in the DMAC (2a) or to a special assigned memory area (command area), where the DMAC can poll for the command and/or the descriptor (2b). Then the DMAC addresses the device data register (3) and read the data into a temporary data register (4). In another bus transfer cycle, it addresses the memory block (5) and writes the data from the temporary data register to the memory block (6). Vorlesung Rechnerarchitektur 2 Seite 178 Direct Memory Access (DMA) DMA Operations The DMAC increments the memory block address and continue with this loop until the block length is reached. The completion of the DMAoperation is signaled to the processor by sending an IRQ signal or by setting a memory semaphore variable, which can be tested by the CPU. multiple channels physical addressing, address translation snooping for cache coherency DMA control signals (REQ, ACK) are used to signal the availability of values in the I/Odevice for transportation. DMAC is using bus bandwidth which may slow down processor execution by bus conflicts (solution for high performance systems: use xbar as interconnect!) Memory DMA Descriptor Source Mem Base Pointer Memory Block Destination Mem Base Pointer Command Block Length padded Memory Block [Flik] Mikroprozessortechnik, CISC, RISC Systemaufbau Assembler und C, Flik,Thomas, Springer Verlag, 6.Aufl. 2001. Vorlesung Rechnerarchitektur 2 Seite 179 Completion signaling of an operation Completion Signaling For all communication function it is important to know, when an operation is completed. Signalling this event to the ’process being interested’ in this information is very difficult. The most common way is to throw an interrupt, which stops normal processing of a CPU and activates the interrupt handler. Beside the fact that interrupt processing has speed up significantly in the last years, it need to save the CPU state of the actual running process. This produces an overhaed and much worse is that in the newest processors the register file is larger than ever. Design decisions: • IRQ | polling at device register | replication/mirroring in main memory | notification queue | thread scheduling • Application Processor-Communication Processor model, active messages, NICs like Infiniband use the concept of a notification queue. For every communication instruction a corresponding entry in the completion notification queue is written, when the operation has finished. This can be tested by the user process owning the queue. Vorlesung Rechnerarchitektur 2 Seite 180 Direct Memory Access (DMA) left intentional blank