Computer Architecture Examination Paper with Sample Solutions and Marking Scheme CS3 1991–92
Transcription
Computer Architecture Examination Paper with Sample Solutions and Marking Scheme CS3 1991–92
Computer Architecture Examination Paper with Sample Solutions and Marking Scheme CS3 1991–92 Nigel Topham (April 3, 1995) Question 1 a. Describe what is meant by instruction pipelining, and explain how it can be used to improve CPU performance. [5] b. Explain the causes of the following types of pipeline hazard and outline briefly how their effects on performance can be minimised. (i) structural hazards (ii) data hazards (iii) control hazards [9] c. A certain pipelined processor has the following characteristics: • fully-pipelined integer ALU • non-pipelined integer multiplier, with 3-cycle latency • branches with a single delay slot • loads with a single delay slot Determine the average number of cycles per instruction (CPI) for the following code fragment, stating any assumptions you make. [4] L1: load sub mul store bne nop r1, r4, r1, r7, r4, -8(r15) r5, r1 r8, r9 -4(r15) r1, L2 /* /* /* /* /* memory[r15-8] => r1 r5 - r1 => r4 (int) r8 * r9 => r1 r7 => memory[r15-4] if (r4 != r1) goto L2 */ */ */ */ */ Identify all data dependencies within this code fragment, specifying whether they are flow-dependencies, output-dependencies, or anti-dependencies. [3] d. If the pipeline is enhanced to permit up to two instructions (of any type) to be issued in parallel, what is the new CPI value ? (explain your calculations). [4] 1 Question 2 a. Describe what is meant by the following terms, and outline briefly how they can be exploited in high performance memory systems. (i) program locality (ii) temporal data locality (iii) spatial data locality [9] b. Two different implementations of the same RISC architecture have the following characteristics: Instruction Class Timing (cycles) Class Frequency Machine A Machine B 1 + tB loads 20% 1 + tA stores 13% 1 1 branches 24% 1 1 ALU ops. 43 % 1 1 The values tA and tB represent the effective memory access times of machines A and B respectively. Machine A has a small on-chip cache, whereas machine B has a large off-chip cache. The cache and memory system parameters are: Parameter cache hit time cache miss ratio block size (b) refill time copy back time Machine A 1 cycle 5% 16 bytes 4 + b/4 cycles — Machine B 2 cycles 0.5% 64 bytes 4 + b/4 cycles 4 + b/4 cycles Cache A uses a write through policy, whereas cache B uses a write back policy. For cache B 40% of all misses are to “dirty” lines, and copying back cache lines to memory cannot be overlapped with other activities. (i) What are the effective memory access times of machine A and machine B? [4] (ii) What is the mean number of cycles per instruction (CPI) for each machine ? [4] (iii) If implementation A does not support pipelined memory writes, and the time for store operations rises to 4 cyles, which of the two machines is then fastest ? [3] c. It is suggested that cache A, which is a direct-mapped cache, might benefit from being 2-way or 4-way set-associative, since it is believed that a small cache suffers a high number of collision misses. Discuss the validity, or otherwise, of this suggestion. [5] 2 Question 3 a. Explain what is meant by the term quantitative design, in the context of computer architecture. [5] b. What is Amdahl’s Law ? [5] c. A certain manufacturer decides to offer a high-performance version of its popular RISC architecture, aimed at the scientific market. It believes that a vector processing facility is the best way to achieve its performance and cost goals. Instructions that are able to exploit the vector facility of the new machine execute in 1/10th of the cycles needed by the same instructions on the original (scalar) machine. However, due to increased logic delays, the clock frequency of the vector machine turns out to be only 75% of the clock frequency of the original. In addition, studies indicate that, on average, 5/6ths of all operations can exploit the vector facility. What is the mean relative performance of the vector machine compared with the original ? [9] d. Discuss the ways in which the memory system of the vector machine is likely to differ from that of the original (scalar) machine. [6] 3 Marking Scheme and Outline Solutions Each question carries 25 marks. Students answer two out of the three questions. This marking scheme and set of outline solutions illustrates the breakdown of marks to each sub-question (or part thereof), and describes the type of answer required to gain the stated marks. Question 1 a. Describe what is meant by instruction pipelining, and explain how it can be used to improve CPU performance. 5 marks This is a relatively easy bookwork question. Typically, answers should describe what an instruction pipeline is, broadly what its structure is, and explain how the average CPI value can be reduced towards an asymptotic value of 1.0 by pipelining. b. Explain the causes of the following types of pipeline hazard and outline briefly how their effects on performance can be minimised (i) structural hazards 3 marks (ii) data hazards 3 marks (iii) control hazards 3 marks This is a slightly more difficult follow-on from the first part, but it should not pose problems for many students, for details see Appendix A. c. A certain pipelined processor has the following characteristics: • fully-pipelined integer ALU • non-pipelined integer multiplier, with 3-cycle latency • branches with a single delay slot • loads with a single delay slot Determine the average number of cycles per instruction (CPI) for the following code fragment, stating any assumptions you make. 4 marks L1: load sub mul store bne nop r1, r4, r1, r7, r4, -8(r15) r5, r1 r8, r9 -4(r15) r1, L2 /* /* /* /* /* 4 memory[r15-8] => r1 r5 - r1 => r4 (int) r8 * r9 => r1 r7 => memory[r15-4] if (r4 != r1) goto L2 */ */ */ */ */ Identify all data dependencies within this code fragment, specifying whether they are flow-dependencies, output-dependencies, or anti-dependencies. 3 marks They ought to find 3 flow dependencies, 1 output dependency and one anti dependency, as listed below. flow: load -> sub (r1) flow: mul -> bne (r1) flow: sub -> bne (r4) output: mul -> load (r1) anti: sub -> mul (r1) This part of the question requires students to work out a numerical answer. If they assume that the processor stops issuing instructions when a nonpipelined instruction is issued, the answer is CPI = 2.0, otherwise the answer is CPI = 1.6. Most students ought to be able to get one of these results — I’ll accept either one, but full marks only obtained if they mention that they have to make an assumption about the behaviour of the non-pipelined multiply, and the “perfect” nature of the memory (i.e., no extra load stalls). d. If the pipeline is enhanced to permit up to two instructions (of any type) to be issued in parallel, what is the new CPI value ? (explain your calculations). 4 marks In this part they have to work out all dependencies between instructions and schedule the code accordingly. The best schedule takes 7 cycles, leading to a CPI of 1.4, and is shown below. Identifying the dependencies in the previous part will help them to generate the schedule, which I’ve shown below. INSTRUCTION 1 load r1, -8(r15) sub r4, r5, r1 mul r1, r8, r9 -stall-stallbne r4, r1, L2 nop INSTRUCTION 2 store r7, -4(r15) nop nop nop nop CYCLES 1 1 1 1 1 1 1 Question 2 a. Describe what is meant by the following terms, and outline briefly how they can be exploited in high performance memory systems. (i) program locality 3 marks (ii) temporal data locality 3 marks 5 (iii) spatial data locality 3 marks This part of the question in essentially basic knowledge, which all students ought to know. Further details in Appendix B b. Two different implementations of the same RISC architecture have the following characteristics: Instruction Class Timing (cycles) Class Frequency Machine A Machine B 1 + tB loads 20% 1 + tA stores 13% 1 1 branches 24% 1 1 ALU ops. 43 % 1 1 The values tA and tB represent the effective memory access times of machines A and B respectively. Machine A has a small on-chip cache, whereas machine B has a large off-chip cache. The cache and memory system parameters are: Parameter Machine A cache hit time 1 cycle cache miss ratio 5% block size (b) 16 bytes refill time 4 + b/4 cycles copy back time — Machine B 2 cycles 0.5% 64 bytes 4 + b/4 cycles 4 + b/4 cycles Cache A uses a write through policy, whereas cache B uses a write back policy. For cache B 40% of all misses are to “dirty” lines, and copying back cache lines to memory cannot be overlapped with other activities. (i) What are the effective memory access times of machine A and machine B? 4 marks (ii) What is the mean number of cycles per instruction (CPI) for each machine ? 4 marks (iii) If implementation A does not support pipelined memory writes, and the time for store operations rises to 4 cyles, which of the two machines is then fastest ? 3 marks This is the “quantitative bit” of the question, where candidates have to apply their knowledge of memory hierarchy behaviour to compare the performance of two systems. The answers can be computed quite straightforwardly, if the candidate has a grasp of how cache memories work. They need to compute the effective memory latency of each cache system, and then use the latency figures to compute the effective CPI of a load 6 instruction on each machine. These figures are then combined with the CPI values for the other instruction types in proportion to their execution frequency. The numbers have been chosen so that the arithmetic can be computed without resort to a calculator. c. It is suggested that cache A, which is a direct-mapped cache, might benefit from being 2-way or 4-way set-associative, since it is believed that a small cache suffers a high number of collision misses. Discuss the validity, or otherwise, of this suggestion. 5 marks This part requires the candidates to discuss the pros and cons of using setassociative caches. I would expect them at least to mention that the hit time for S-A caches is typically higher, but that even very small degrees of associativity lead to much better hit rates for small caches. I do not expect any answer to come down unequivocally on one side of the argument. Question 3 a. Explain what is meant by the term quantitative design, in the context of computer architecture. 5 marks Again, this is bookwork. I’m looking for a definition of the term, and what it means for the design process. Some lecture notes on this subject are contained in Appendix C. b. What is Amdahl’s Law ? 5 marks Now here I’ve given five marks to a simple question, but I am expecting a definition of the law in algebraic terms – since that is the most effective way to say “what is” for this particular concept. Again, there are some photocopied lecture notes covering this question in Appendix D. c. A certain manufacturer decides to offer a high-performance version of its popular RISC architecture, aimed at the scientific market. It believes that a vector processing facility is the best way to achieve its performance and cost goals. Instructions that are able to exploit the vector facility of the new machine execute in 1/10th of the cycles needed by the same instructions on the original (scalar) machine. However, due to increased logic delays, the clock frequency of the vector machine turns out to be only 75% of the clock frequency of the original. In addition, studies indicate that, on average, 5/6ths of all operations can exploit the vector facility. What is the mean relative performance of the vector machine compared with the original ? 9 marks For this question I’m looking for an answer along the following lines: 7 Firstly to derive (or state) that for vectorisation of v, we enjoy a relative execution time T of: (1) T = [v/R + 1 − v]−1 where R is the vector:scalar computation rate. In addition, the clock period of the vector machine is 4/5 times that of the scalar machine, so the new execution time is: T = 1 =3 4/3(5/(6 ∗ 10) + 1 − 5/6) (2) i.e., the vector machine is 3-times faster than the scalar machine. d. Discuss the ways in which the memory system of the vector machine is likely to differ from that of the original (scalar) machine. 6 marks These two memory systems will differ in many detailed ways, but the aim of this question is to get the candidates to discuss the broad differences in memory requirements of each type of system. Listed below are some of these. bandwidth the vector machine needs of the order of 3 words per cycle bandwidth, per pair of floating point pipes (floating point bandwidth : memory bandwidth ratio of 2/3). In comparison, the scalar machine needs only 1 word per cycle. unit of access the scalar machine will access the memory in units of a single words, though this may change to units of a single cache line in systems where all accesses are cached. In the vector machine the dominant type of accesses are vector loads and stores. These typically consist of VL words read or written to/from consecutive locations (VL is vector length of the machine). However, the vector machine must also be capable of accessing vectors with non-unit strides, in order to perform efficiently on multi-dimensional structures. The memory thus needs to support scatter and gather operations. Implementing a wide memory interface will provide the vector bandwidth for unit accesses, but will not give adequate performance for non-unit strides. capacity a rather simple point, but important: the capacity of the vector machine’s memory will need to be significantly larger than that needed on a scalar RISC system (in general). 8