Sample Problems: Branch Prediction and Speculation
Transcription
Sample Problems: Branch Prediction and Speculation
Branch Prediction and Speculation Sample Problems: Branch Prediction and Speculation Branch Prediction and Speculation 1. Consider the following two designs for a alleviating the effect of branches. a. The first design defines a branch with two delay slots and does not use branch prediction. Rather the solution is to use compile-time scheduling to fill the delay slots with useful instructions where possible. Suppose that for 30% of the branch instructions the compiler can fill both branch delay and for 60% of the instructions the compiler can fill only one delay slot. b. The second design employs branch prediction and does not use delay slots. The mis-prediction penalty is 3 cycles. The branch always costs one cycle, and if mis-predicted, it will cost an additional three cycles. What prediction accuracy is required in the second design to achieve the same performance as the first design? From a. we know that 10% of the branch instructions result in two pipeline bubbles while 60% result in a one cycle bubble. We can compute the increase in CPI for each case (we can ignore the probability an instruction is a branch instruction since it is the same in both cases) we have, The increase in CPI due to a. is prob_instr_is_a_branch*(0.6 * 1 + 0.1 * 2) = 0.8 The increase in CPI due to b. is prob_instr_is_a_branch* 1 * (p * 3) = 3p Where p is the probability of misprediction. Equating both provides the critical value of p. The prediction accuracy = (1-p) Branch Prediction and Speculation 2. Consider the following code sequence. Assume that each instruction is encoded in one 32-bit word. We have a k-entry branch prediction buffer. Address 0x40000000 0x40000004 0x40000008 0x4000000c L1: 0x40000010 0x40000014 0x40000018 L2: 0x4000001c .. .. 0x80008004 L3: Instruction DSUBUI BNEZ DADD DSUBUI BNEZ DADD DSUBU BEQZ R3, R1, #2 R3, L1 R1, R0, R0 R3, R3, #2 R3, L2 R2, R0, R0 R3, R1, R2 R3, L3 … a. What is the minimum value of k to maximize prediction accuracy and why? In general it is 32. In this example, we need 5 bits to address the prediction buffer without aliasing the branch addresses since the addresses of the branch instructions differ in the least significant 5 bits. (In this example even 4 bits will suffice. Can you tell why?) Alternatively, consider the use of a branch target buffer (Figure 3.19). Show the possible contents of the following 4-element branch target buffer after one execution of the prior code (assuming it is initially empty). 0x40000004 0x4000000c 0x40000010 0x40000018 0x4000001c 0x80008004 Branch Prediction and Speculation 3. Consider the use of a branch prediction buffer using n-bit saturating counters for the code sequence shown below. The memory addresses for the instructions are shown in hexadecimal notation. Assume the following loop code has been executed 12 times. The branch at location 0x0044 has been taken 50% of the time and the branch at location 0x0050 has been taken 50% of the time. Consider the point in time of the start of the execution of the 13th iteration. Address 0x0038 0x003C 0x0040 0x0044 0x0048 0x004C 0x0050 0x0054 0x0058 0x005C L3: L1: L2: .. .. DSUBUI BNEZ DADD DSUBUI BNEZ DADD DSUBUI BEQZ R3 R3 R1 R3 R3 R2 R3 R3 R1 L1 R0 R2 L2 R0 R1 L3 #2 R0 #2 R0 R2 a. Considering only the preceding code, how many entries should the branch prediction buffer have to avoid the possibility of aliasing of branch addresses? The minimum number of least significant bits to ensure no aliasing for these addresses is 4, hence the branch prediction buffer would need 24= 16 entries. b. If all prediction buffer entries were initialized to 0, what can be the value of the counters in the prediction buffer corresponding to these two branch instructions? 0x0044 BNEX R3 L1 0 ≤ value ≤ 6 0x0050 BNEZ R3 L2 0 ≤ value ≤ 6 Branch Prediction and Speculation c. Now consider the case where we use a global branch predictor with 3 bit global history. Execution of the 13th iteration is about to start. Provide an example of i) the value of feasible 3-bit global branch history, and ii) the value of an infeasible global branch history. Ensure you clearly identify the entries in the branch history with the branch instructions in the code sequence. The first two branches test for equality between two numbers, N1 & N2, with the number 2. The last branch tests if N1 = N2. If the first branch is taken (N1 not equal to 2) and the second branch is taken (N2 = 2) then the last branch cannot be taken. Hence a feasible global history is 111 (the first branch on the program corresponds to the most significant bit). An infeasible history is 000. Branch Prediction and Speculation 4. Consider the case where the execution pipeline has a single cycle branch delay slot. Static scheduling can fill 30% of the delay slots. We can fill 60% of the remaining slots if we use cancelling branches: instructions are cancelled if the branch is mispredicted. These slots are filled with instructions assuming the branch is not taken. If 18% of all instructions are branches and they are taken 62% of the time, what is the net increase in CPI? . Number of stall cycles/instruction are 0.18 * 0.7 *(0.4 + 0.6*0.62) 70% of the branch instruction slots cannot be successfully filled with computer instructions. Of these 40% are left empty and contribute a cycle. Of the remaining, a cycle is contributed only when the branch is cancelled (which is 62% of the time). Branch Prediction and Speculation 5. Consider the 5 stage integer pipeline with forwarding. Assume the branch penalty is 1 cycles (branch condition computed in ID). Now assume we have pipelined the memory system to three stages (rather than 1 stage) for both instruction fetch and data fetch. Branches are resolved at the end of the EX stage. We use a static branch-not-taken prediction strategy, i.e., if branches are taken we incur the branch penalty. Assume conditional branches occur with a frequency of 14%. a. If branches are taken 62% of the time, what is the increase in CPI due to this prediction strategy? The branch penalty is 3 cycles and incurred only when the branch is taken. Increase in CPI = 0.62 * 0.14 * 3 = 0.2604 b. Alternatively, if we modify the pipeline and implement a delayed branch with a single delay slot, and we are able to successfully fill 65% of the slots, what is the increase in CPI? 35% of the time we are unable to fill these slots with a penalty of 1 cycle Hence increase in CPI I s= 0.35 * 14 c. Now consider the occurrence of load delay slots, where loads occur with a probability of 24%, and 40% of these fetch data used by the immediately following instruction. If we perform no instruction scheduling to fill delay slots, what is the increase in CPI compared to the original pipeline (i.e., without pipelining the memory system). The load stalls are now three cycles rather than 1. With no instruction scheduling we have 0.24 * 4 * 3 = 0.288 Branch Prediction and Speculation 6. Consider the dynamically scheduled execution of the following code sequence where a ROB buffer is used. Assume register F10 is initialized with the value 1.0 and memory locations 0(R1) and 0(R2) are initialized with 6 and 7 respectively. All other registers are initialized to 0. Consider the first iteration through the loop. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. a. L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F6, F2, F4 SUB.D F4, F6, F10 DIV.D F6, F12, F2 S.D F6, 0(R1) ADDD F8, F8, F4 DADDIU R1, R1, #-8 DADDIU R2, R2, #-8 BNE R1, R4 LOOP Show a valid state of a 4 entry ROB when instruction 7 issued. Identify the head and tail of the ROB. TAIL HEAD b. LOOP: Destination F6 0(R1) F8 F4 Value NO VALUE NO VALUE NO VALUE 41 Status PENDING PENDING PENDING COMPLETED Register re-mapping is employed where architecture registers are remapped to physical registers (PR). F6 in instruction 3 is remapped on issue to PR 9. When the DIV instruction reaches the head of the ROB can PR 9 be freed? Justify your answer. Yes. This means that all instructions prior to the DIV.D have committed and all instructions that used the mapped register for F6 have completed. Therefore, it can be freed. Branch Prediction and Speculation 7. Consider the following code sequence for a dynamically scheduled machine. Assume register F10 is initialized with the value 1.0 and memory locations 0(R1) and 0(R2) are initialized with 6 and 7 respectively. All other FP registers are initialized to 0. Consider the first iteration through the loop 1. 2. 3. 4. 5. 6. a. L.D F2, 0(R1) L.D F4, 0(R2) DIV.D F6, F12, F2 MUL.D F6, F2, F4 SUB.D F4, F6, F10 S.D F6, 0(R1) Assuming an exception occurs on instruction 3 and instructions 4 and 5 have completed execution. What are the contents of F4 and F6 in the following cases? F4 F6 Using a History Buffer only: __41______ __42______ Using a ROB only: ____7____ ____0____ Using a Future Register File with an ROB: ____7____ ____0____ b. What is a precise exception? An exception is precise if its occurrence is as if it occurred after instruction i and before instruction i+1. All instructions upto and including i complete execution and all instructions flowing and including i+1 have not executed. Branch Prediction and Speculation 8. Consider the dynamically scheduled execution of the following code sequence. The first time through the loop an exception occurs on the DIV instruction. Distinguish between how a precise exception will be handled using a ROB and a history buffer. How does register renaming affect or not affect the handling of exceptions. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. LOOP: L.D F2, 0(R1) L.D F4, 0(R2) MUL.D F6, F2, F4 SUB.D F4, F6, F10 DIV.D F6, F12, F2 S.D F6, 0(R1) ADDD F8, F8, F4 DADDIU R1, R1, #-8 DADDIU R2, R2, #-8 BNE R1, R4 LOOP With an ROB: Exceptions for an executing instruction are flagged in its ROB entry, but not raised. The processor raises an exception associated with an instruction when that instruction reaches the head of the ROB. Since instructions in the ROB are allotted entries in program order, they are committed in program order. Instructions fetched speculatively on a mispredicted branch are never committed. Therefore, all exceptions are precise. With an ROB, register renaming does not affect handling of precise exceptions. This is because register renaming does not affect how the instructions commit (always in program order) and exceptions can be raised only at commit time. In the case of a history buffer, instructions are allocated history buffer entries that contain the old value (history) of the register being written. If an exception occurs the corresponding history buffer entry is labeled. When exception instruction reaches the head of the history buffer, the history buffer is scanned from head to tail and all old values replaced using those in the history buffer. This is needed because instructions write directly to the register-file. Branch Prediction and Speculation 9. We have a machine capable of retiring up to 4 instructions per cycle from the ROB. Explain the conditions under which more than one instruction can indeed be retired in a single cycle. Multiple consecutive instructions starting at the current head of the ROB must have completed execution & writeback of results (to their respective entries in the ROB). All these instructions can be committed in a single cycle. However, structural limitations (like, number of write-ports to the register file and bandwidth to memory for committing stores) would put hard limits on how many of the instructions at the head-of-queue in the ROB may commit simultaneously. Note that if multiple consecutive instructions are writing to the same destination registers or memory location, the commit-hardware can still commit them in the same cycle by ensuring that the value written to the destination-register/memory-location comes from the result of the last committed instruction that wrote that register/location. Finally, all instructions are logically committed in program order.