Instructions
Transcription
Instructions
page 1 of 13 ENCM 369 Winter 2015 Lab 11 for the Week of April 6 Steve Norman Department of Electrical & Computer Engineering University of Calgary April 2015 Lab instructions and other documents for ENCM 369 can be found at http://people.ucalgary.ca/~norman/encm369winter2015/ 1 This Lab is important, but will not be marked If the usual pattern for ENCM 369 labs were followed, this lab would be due Friday, April 10, That would leave very little time for your TAs to get the lab marked before the last lab periods on April 13. So this lab will not be marked, and solutions will be posted sometime during the week of April 6. Please make a serious effort to solve the Exercises yourself before checking solutions! 2 2.1 Exercise A: Virtual and physical addresses of array elements Read This First To understand how virtual memory works, it’s useful to think about the common operation of sequentially accessing all of the elements in an array. For example, consider this simple C function: int sum_array(const int *arr, int n) { int sum = 0, i; for (i = 0; i < n; i++) sum += arr[i]; return sum; } arr should point to element 0 of an array, and n should indicate the number of elements belonging to the array. In a system with virtual memory, the pointer argument arr will hold a virtual address. Let’s suppose sum_array is called with arr pointing to element 0 of an array of 1300 elements, each of which is a 4-byte int, and our virtual memory system uses 32-bit virtual addresses, 32-bit physical addresses, and 4 KB pages. Further, let’s suppose the virtual address of element 0 of the array is 0x1001_0320—if that’s true, the layout of the array in virtual memory must look like the left side of Figure 1. ENCM 369 Winter 2015 Lab 11 page 2 of 13 Figure 1: An array of 1300 4-byte ints. Addresses must be sequential in virtual address space, but in physical address space, chunks of the array can be within pages that are far away from each other. virtual address space .. . physical address space address .. . address 0x33715000 0x10012000 elements 0 to 823 elements 824 to 1299 0x33714000 0x10011000 elements 0 to 823 .. . 0x25589000 elements 824 to 1299 0x10010320 0x10010000 .. . 0x25588000 .. . However, there are many, many possible arrangements of the array in physical address space—the right side of Figure 1 shows just one possible arrangement. If the arguments arr and n and local variables sum and i are all in registers, the only data memory accesses the function will make will be reads of array elements, using virtual addresses 0x1001_0320, 0x1001_0324, and so on, in sequence. Because parts of the array are in two different pages, two different VPN-to-PPN translations will be used by the load instructions that read the array element values. The first translation is from VPN 0x10010 to PPN 0x33714 and the second translation is from VPN 0x10011 to PPN 0x25588. It’s possible that both translations are in the D-TLB when the loop starts, in which case there will be no D-TLB misses at all while the loop runs. At most there will be two D-TLB misses—one on the access to arr[0], and another on the access to arr[824]. (Remember, one of the effects of a TLB miss is copy a translation from a page table to the TLB.) 2.2 What to do, Part I Consider the scenario outlined in “Read This First”, but with the address of arr[0] changed to 0x1001_0fb0. • Make a diagram similar to Figure 1, showing which array elements sit in which pages. Be as detailed as you can. There is essentially only one correct answer for the layout in virtual memory But there’s a huge number of possible correct answers for the organization in physical memory—come up with just one, making sure that it is consistent with the given information. • What the largest possible number of D-TLB misses as the loop runs? Which array element accesses could possibly cause D-TLB misses? ENCM 369 Winter 2015 Lab 11 2.3 page 3 of 13 What to do, Part II Consider the same loop, but with a much larger array. Suppose that the address of arr[0] is 0x1001_2f00, and that the value of n is 50,000. • Determine the total number of pages partly or completely occupied by elements of the array. • Suppose the D-TLB has a capacity of 32 VPN-to-PPN translations. Find the smallest and largest possible numbers of D-TLB misses as the loop runs from beginning to end. 3 3.1 Exercise B: Integration of VM and caches Read This First This problem asks you to trace some instruction fetches in a computer system that has both virtual memory and caches. The computer runs the MIPS instruction set, so all instructions are 32 bits in size. Both virtual and physical addresses are 32 bits wide. The page size is 4KB. There are two TLBs: one for instruction address translations, and one for data address translations. The instruction TLB has room for 4 translations. There are separate instruction and data caches. The instruction cache is directmapped with 4-word blocks, and has a capacity of 256 bytes. (The instruction TLB and instruction cache are both unrealistically small in order to make this a viable pencil-and-paper exercise.) 3.2 What to Do Suppose a process attempts to fetch three instructions using the following virtual addresses, in the order given below: 0x0040_0ff8 0x0040_0ffc 0x0040_1010 Suppose that none of the instructions are loads or stores. The page table for the process contains the following information: virtual page number 0x00400 0x00401 0x00402 0x10001 0x7ffff valid bit 1 1 1 1 1 physical page number 0x900c4 0x900ce 0x91076 0x91023 0x90fab When the sequence of instruction fetches starts, the following information is in the instruction TLB: virtual page number 0x00400 0x00400 0x00402 0x80000 valid bit 0 1 1 0 physical page number 0x91111 0x900c4 0x91076 0x80000 When the sequence of instruction fetches starts, the following information is in the instruction cache (along with 256 bytes of instructions): ENCM 369 Winter 2015 Lab 11 set bits 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 008 009 0xa 0xb 0xc 0xd 0xe 0xf valid bit 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 page 4 of 13 tag 0x900c4f 0x900ce0 0x900ce0 0x900c4f 0x900c4f 0x900c4f 0x000000 0x000000 0x800001 0x800001 0x800001 0x800001 0x900c4f 0x900c4f 0x900c4f 0x910764 For each of the three instruction fetches, answer the following questions: • Is it a TLB hit or a TLB miss? Or is it not possible to tell with the given information? • If it is a TLB miss, does the miss cause a page fault? • Is it a cache hit or a cache miss? Or is it not possible to tell with the given information? Also, answer this: • Why is it helpful to know that none of the instructions being fetched are loads or stores? 4 4.1 Exercise C: Why TLB miss handling must be fast Read This First In programs with good spatial locality of reference, TLB misses will be relatively rare, and handling these misses will not add significantly to running time of the program. However, certain important algorithms necessarily have bad spatial locality. An example is binary search, which can very quickly find a number in a very large sorted array of numbers. (You do not have to know what binary search is or how it works to do this exercise.) This exercise is designed to help you understand the importance of speed in a TLB miss handler in a case where there are frequent TLB misses. 4.2 What to do Here is MIPS assembly language for a key loop within a procedure that performs a version of binary search: L1: srl and lw slt $t1, $t9, $t3, $t4, $t0, 1 $t1, $t8 ($t9) $a0, $t3 ENCM 369 Winter 2015 Lab 11 movn movz addiu sltu bne addu $a2, $a1, $t5, $t6, $t6, $t0, page 5 of 13 $t9, $t4 $t9, $t4 $a1, 4 $t5, $a2 $zero, L1 $a1, $a2 Note that it is coded for a true MIPS system with branch delays, so when the branch is taken, the sequence of instructions is bne, then addu, and then srl. Question 1. Assume that the code is run on a MIPS processor that uses a simple pipeline that will normally start one instruction per clock cycle. In the given loop, if there are no cache misses, one-cycle stalls will be needed only so that the slt instruction can use the lw result, and the the bne instruction can use the sltu result. If there are no cache misses or TLB misses, how many clock cycles will it take to run through the loop 20 times? Question 2. To search a particular array of 1 million ints, the loop does in fact run 20 times, and the sequence of virtual data addresses used by the lw instruction is: 0x10208480, 0x102360ec, 0x1023688c, 0x10236fec, 0x102fc6c0, 0x1023dafc, 0x10236c5c, 0x10236fcc, 0x102825a0, 0x10239df4, 0x10236e44, 0x10236fbc, 0x10245510, 0x10237f70, 0x10236f38, 0x10236fc4, 0x10226cc8, 0x1023702c, 0x10236fb0, 0x10236fc8. The VPN for this system is bits 31–12 of a virtual address. Assume that when the loop runs, none of the needed VPN-to-PPN translations for data are in the data TLB, but all of the needed data pages are in physical memory. Assume also (very unrealistically) that there are no D-cache misses as a result of any of the loads. Finally, assume there are no misses in the instruction TLB or instruction cache. • If it takes 10 clock cycles to handle a miss in the data TLB, how many clock cycles in total does TLB miss handling add to the answer to Question 1? (This will require some analysis of the sequence of addresses used by the load instruction.) • Revise your answer, assuming now that it takes 100 cycles to handle a miss in the data TLB. • Would you say it doesn’t matter much whether the TLB miss handler takes 10 or 100 cycles, or would you say that that it has a significant impact on the running time of the loop? 4.3 Extra note This problem asks you to think about TLB misses in code that makes data memory accesses with poor spatial locality. However, it is important to know that in most real-world cases of bad spatial locality cache misses will tend to cause more performance degradation than TLB misses. 5 5.1 Exercise D: Page table organization Read This First This exercise is designed to help you understand what a page table is, and what some of the design considerations are in choosing how to organize a page table. ENCM 369 Winter 2015 Lab 11 page 6 of 13 The details about page table entries (PTEs) and page table organization are a blend of details taken from Section 8.4 of your textbook, address space and page table management in 32-bit x86 Linux, and 32-bit MIPS TLB miss handling. So, put together, the details don’t match any real computer system, past or current, but are realistic enough to give some insight about how virtual memory works on real computers. We are considering a system with 32-bit virtual addresses, 32-bit physical addresses, and 4KB pages. so, as seen in textbook and lecture examples, VPNs (virtual page numbers) and PPNs (physical page numbers) will both be 20 bits wide. Our system will use one 32-bit memory word for each PTE: Alloc 31 Valid Dirty 24 23 22 21 20 19 Ref 20-bit PPN field 0 The Valid, Ref, and Dirty bits play the same roles as exactly the V-bit described in textbook section 8.4.2, and the U-bit and D-bit described in textbook section 8.4.5. The purpose of the “Alloc” bit is to indicate whether or not a virtual page exists at all for a given VPN. If a VPN is used to look up a PTE, and that PTE has Alloc=1 and Valid=1, that means the corresponding virtual page is in physical memory with the PPN given within the PTE. If that PTE has Alloc=1 and Valid=0, the corresponding virtual page is on disk. But if the PTE has Alloc=0, that means there is no virtual page for the given VPN. Bits 31–24 in our PTE format could be used for more status bits, such as writeaccess (does the process have permission to write to the page, or only to read?) or execute-permission (which would indicate whether a process is allowed to fetch instructions from the page). The simplest way to organize a page table is to simply make it a “flat” table, in other words, just a big array of PTEs, with one PTE for each possible VPN. The VPN would be used simply as an array index to find a PTE. Let’s assume that the range of possible VPNs for a user process runs from 0x00000 to 0xbffff, which happens to be the range of allowable VPNs for user processes in current 32-bit x86 Linux systems. Figure 2 shows an example of such a flat page table for a process with three pages of instructions starting at virtual address 0x0040_0000, two pages of static data starting at virtual address 0x1001_0000, and three pages of stack starting at virtual address 0xbfff_d000. Notice that there are lots of PTEs filled with 0 bits—in each of these PTEs bit 23, the Alloc bit, is zero. So most of the page table is filled with information that indicates the nonexistence of many, many virtual pages! Let’s suppose our computer uses the MIPS instruction set. Then in assembly language the TLB miss handler would look something like the code shown in Figure 3. The “quick” decision made in the second group of “special memory management instructions” is: • If Alloc=1 and Valid=1, quickly update a TLB with information from $k0 and $k1, and restart a user process on whatever instruction caused the TLB miss. • If Alloc=1 but Valid=0, jump to kernel code to handle a page fault (in other words, get the kernel started on the disk operation needed to get the desired virtual page into physical memory). • If Alloc=0, jump to kernel code to deal with an attempted illegal memory access by a user process. Note that the MIPS register use conventions prohibit user processes from using GPRs $k0 and $k1. This rule is in place to help optimize TLB miss handlers for ENCM 369 Winter 2015 Lab 11 page 7 of 13 Figure 2: Flat page table. PTEs are 32-bit words; in the diagram each nonzero PTE is shown split into a 12-bit status-bit field and a 20-bit PPN field. Index 0xbffff 0x00e 0x81f23 0xbfffe 0x00f 0x99012 0xbfffd 0x00f 0x87005 0xbfffc 0x0 .. .. . . 0x10012 0x0 0x10011 0x008 0x44552 0x10010 0x00f 0x95232 0x1000f 0x0 .. .. . . 0x00403 0x0 0x00402 0x00d 0x86aba 0x00401 0x00d 0x9a9a0 0x00400 0x00d 0x92777 0x003ff 0x0 .. .. . . 0x00000 0x0 ENCM 369 Winter 2015 Lab 11 page 8 of 13 Figure 3: Sketch of TLB miss handler code for the page table organization of Figure 2. A few special memory management instructions to copy the VPN into $k0 and the base address of the page table into $k1. sll addu lw srl $k0, $k1, $k1, $k0, $k0, 2 $k1, $k0 ($k1) $k0, 2 # # # # $k0 = VPN * 4 $k1 = address of PTE $k1 = PTE restore VPN in $k0 A few more special memory management instructions to inspect the page status bits within the PTE in $k1 and make a quick decision. speed—a miss handler can use those two registers without first saving them to memory to preserve data belonging to a user process. Here is an example of how one of the all-zero PTEs in Figure 2 could be used. Suppose our computer runs a Linux-like operating system. A novice programmer writes a program for it in MIPS assembly language. When the program runs as a user process, suppose it happens to be given virtual memory exactly corresponding to the page table shown in Figure 2. Suppose the program contains the instruction lw $t1, ($t0). Unfortunately the program is defective, and when the load instruction is reached, the address in $t0 is 0x1001_2468, which isn’t in the set of virtual addresses the program is allowed to use. What happens, in detail, is as follows: • The VPN of 0x10012 is generated from 0x1001_2468. • There is a miss in the data TLB. (Make sure you understand why there cannot possibly be a hit!) • The TLB miss handler software uses the VPN as an index in the page table, which produces an all-zero PTE in $k1. • Because the Alloc bit in the PTE is zero, the TLB miss handler decides that the process has tried to make an illegal memory access, and shuts down the process. Our novice programmer is left to figure out why the program died with a “segmentation fault” error message. 5.2 What to Do, Part 1 Review the “Read This First” information, then answer the following questions: • For the process whose page table is shown in Figure 2, is the data word at virtual address 0x1001_1400 on disk or in physical memory? If it is in physical memory, what is the physical address for the word? • Repeat the previous question, but use 0xbfff_dffc as the virtual address. • What is the combined size, in KB, of all the instruction, static data, and stack pages in use by the process of Figure 2? • What is the size, in KB, of the page table shown in Figure 2? ENCM 369 Winter 2015 Lab 11 5.3 page 9 of 13 Read This Second I hope you concluded from the last two answers you found in What to Do, Part 1, that the page table was unreasonably huge compared to how much memory was needed for the actual instructions and data of the process. This example illustrates the fact that a flat page table is not a practical solution in the case of 32-bit addresses and 4 KB pages. In Linux on x86 systems, addresses really are 32 bits wide and pages really are 4 KB in size. Page table organization is close to what is shown in Figure 4 on page 10. (I am omitting some complicated details, so I can’t say the organization is exactly as shown in the figure.) Instead of one huge page table for each process, there are a number of medium-size page tables. Part of the VPN is used to find a pointer within an array of pointers; this array of pointers is called a page directory. If a valid (non-null) pointer is found in the page directory, that pointer is assumed to point at the base of the page table where the needed PTE is located. Details about which bits of the VPN are used for what purposes are given in the caption to Figure 4. 5.4 What to Do, Part 2 Answer the following questions: • For the process with the page table information shown in Figure 4, describe how a TLB miss handler would determine that 0x10012 is not a legal VPN. • For the process whose page table is shown in Figure 4, is the instruction word at virtual address 0x0040_2820 on disk or in physical memory? If it is in physical memory, what is the physical address for the word? • Repeat the previous question, but consider the data word at virtual address 0x1001_1454. • What is the combined size, in KB, of all the instruction, static data, and stack pages in use by the process of Figure 4? • What is the combined size, in KB, of the page directory and the page tables shown in Figure 4? Does this seem reasonable compared to the size of the page table in Figure 2? • Consider the TLB miss assembly language code of Figure 3. Replace the sll, addu, lw, and srl instructions with a sequence suitable for searching the page table organization of Figure 4. Pretend that in addition to $k0 and $k1, one more GPR called $k2 is available to hold intermediate results. (Hint: Your sequence should be about 8–10 instructions in length, and should include two lw instructions.) 5.5 More notes about page table organization • Real 32-bit MIPS systems use an efficient but somewhat hard-to-explain system for page tables that is not a two-level organization. • 32-bit x86 systems have a two-level page table organization much like what is presented in this exercise, but routine TLB miss handling is not done by kernel software—instead special hardware is dedicated to fast searches for PTEs within page tables. (However, if there is a miss in the page table after a miss in a TLB, then kernel software will have to manage the problem.) ENCM 369 Winter 2015 Lab 11 page 10 of 13 Figure 4: Two-level page table structure. PTEs have exactly the same format as in Figure 2. Bits 31–22 of a virtual address (so, bits 19–10 of the VPN for that virtual address) are used as an index into the page directory array. If a non-null pointer is found in the page directory, that pointer is used as the base address of a 1024-word page table, and bits 21–12 of the virtual address (so, bits 9–0 of the VPN for that virtual address) are used as an index into that page table. (Note: Translations for the process of this figure are not the same as translations for the process with the page table of Figure 2.) Page table 0x00e 0x89772 0x00e 0x84dfe 0x00f 0x91f01 0x0 .. . Index 0x3ff 0x3fe 0x3fd 0x0 0x000 Page table 0x0 .. . Index 0x3ff .. . - Index 0x2ff 0x2fe .. . Page directory r 0x041 0x040 0x03f .. . 0x0 r 0x002 0x001 0x000 0x0 r 0x0 .. . 0x0 0x88a62 0x9000e 0x0 .. . 0x012 0x011 0x010 0x00f .. . 0x0 0x000 Page table 0x0 .. . Index 0x3ff .. . 0x00d 0x00f 0x0 .. . .. . - 0x0 0x00d 0x008 - 0x00d 0x0 0x81fff 0x94321 0x8a8a5 0x003 0x002 0x001 0x000 ENCM 369 Winter 2015 Lab 11 6 6.1 page 11 of 13 Exercise E: Followup to Exercise D Read This First This is an optional exercise, which you should do if you want to learn about the effect of address size on page table organization. 6.2 What to do, Part 1: Back to the 1970’s Consider a machine with 16-bit words, 16-bit virtual addresses, and 20-bit physical addresses. If pages are 4 KB in size, what would the size of a “flat” page table be, assuming that a single PTE could fit in a 16-bit word? Would there be any reason to use a two-level table instead? (This problem is based loosely on documentation of Digital Equipment Corporation PDP-11 computers, which I am just barely young enough not to have used as an undergraduate!) 6.3 What to do, Part 2: Actual computers of 2015 In the x86-64 architecture, pointers are 64 bits wide. However, on Linux, using current processor chips, the range of addresses available to a user process runs from 0 to 0x7fff_ffff_ffff—so bits 46–0 of the address can be a mix of 0’s and 1’s, but bits 63–47 have to be 0. (I think this is also true for 64-bit Windows and 64-bit Mac OS X, but I haven’t checked.) That means that if pages are 4 KB in size, the meaningful part of a VPN is 47 − 12 = 35 bits wide. Consider using a two-level page table structure for such a system. Suppose that even the smallest process needs a page directory and at least two page tables that would be found using pointers within the page directory. What would be the minimum combined size of a page directory and two page tables? (Keep in mind that pointers are 64 bits wide, and assume that a PTE is 64 bits in size, because 32 bits is now probably not enough room to hold a PPN and page status bits.) What can you conclude about whether a two-level page table organization is reasonably space-efficient for x86-64? 7 7.1 Exercise F: Loads and stores in a writeback cache Read This First This is an optional exercise, which you should do if you want to get some insight into the detailed operation of a write-back set-associative data cache. 7.2 Read This Second This exercise is about an example writeback data cache design that was not presented in lectures this year. However, you can learn about the design by following this link . . . http://people.ucalgary.ca/~norman/encm369winter2015/section01/ . . . and reading slides 50–65 of Set 9, and the related notes. A diagram of this cache circuit is given in Figure 5. Suppose that this cache is used in a computer that does not have virtual memory. Suppose also the the computer has only one level of caches—you do not have to worry about whether this is an L1 or L2 cache. ENCM 369 Winter 2015 Lab 11 page 12 of 13 Figure 5: 4 KB 2-way set-associative cache, with 2-word blocks and LRU replacement. This diagram shows details of hit-detection and data-read logic, but does not show all the hardware related to writes. main memory address D V tag 8 search tag decoder set bits 21 way 0 way 1 00 .. .. .. . .. .. . word 1 word 0 D V tag word 1 word 0 U .. . .. . 32 32 1 = .. . .. . set 1 set 0 21 block offset .. .. .. .. . set 255 set 254 .. .. . . 21 0 32 32 1 0 = 32 1 Hit1 Hit0 32 0 32 1 for hit, 0 for miss data available to core Figure 6: State of the cache after a program has been running for a while, just before it starts on the sequence of instructions listed in “What to Do” of Exercise F. Tags, data words and set numbers of given in hex; 0x is left out to save space. way 1 way 0 D V tag word 1 word 0 D V tag word 1 word 0 U set 0 0 000000 00000000 00000000 1 1 0ffffb 00000063 00000058 1 ff .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . 0 1 1 0 1 1 1 1 020021 020020 020021 020022 00000062 fffffff1 00000065 00000007 fffffffe fffffff5 64636261 00000009 0 1 1 0 1 1 1 1 0ffffc 0ffffc 020023 020021 0000002a 004000b4 fffffff4 00000171 00000037 00000013 ffffffd8 00000161 1 0 1 1 03 02 01 00 ENCM 369 Winter 2015 Lab 11 page 13 of 13 Figure 7: State of some main memory words after a program has been running for a while, just before it starts on the sequence of instructions listed in “What to Do” of Exercise A. address 0x1001_0000 0x1001_0004 0x1001_0008 0x1001_000c 0x1001_0010 0x1001_0014 0x1001_0018 0x1001_001c 7.3 data 0xffff_ffff 0xffff_fff8 0xffff_fffc 0xffff_fff4 0xffff_fffa 0xffff_fff1 0xffff_ffe0 0xffff_ffe7 What to Do Suppose a program has been running for a while, and is about to start on the following sequence of instructions: lui lw addiu sw lw lw ori sw $t0, 0x1001 $t1, ($t0) $t2, $t1, 1 $t1, ($t0) $t3, 16($t0) $t4, 28($t0) $t5, $zero, 0x345 $t5 12($t0) At that moment, the cache is in the state shown in Figure 6, and some relevent main memory contents are as shown in Figure 7. Make a table similar to Figure 6 to show the state of the cache just after all the listed instructions have finished. Also, make a list of the addresses and data involved in writing dirty blocks back to main memory via a write buffer. Hint: There is a useful program in encm369w14lab11/exF