COMP104 - 2015 - Second CA Assignment Compiler Structures
Transcription
COMP104 - 2015 - Second CA Assignment Compiler Structures
COMP104 - 2015 - Second CA Assignment Compiler Structures Symbol Table Management by Hashing Assessment Information Assignment Number Weighting Assignment Circulated Deadline Submission Mode Learning outcome assessed Purpose of assessment Marking criteria Submission necessary in order to satisfy Module requirements? 2 (of 2) 10% Tuesday 24th March 2015 Wednesday 13th May 2015; 15.00 Electronic 4, viz “Construct programs which demonstrate in a simple form the operation of examples of systems programs, including ... simple compilers.” Provide practical experience of issues in compiler design within Java. Scheme provided at end of document. No 1 Introductory Background This assignment concerns the organisation and maintenance of Symbol Tables using a technique called hashing. Recall that symbol tables are used within (for example, Java) compilers as a means of recording information about variable, method, and class names. Details about so-called reserved or key words are also often stored using symbol tables, eg in order to avoid problems that might arise were such names unintentionally to be introduced as variable names. A potential problem in organising symbol tables is their size and the fact that these may have to be accessed very often: this structure being used at most stages of a compiler’s operation. So it is important that the entries referring to properties of frequently used identifiers can be found quickly. Although techniques such as “binary search” provide one approach that is often used, this has the drawback that the symbol table entries (or keys) must be sorted using some ordering convention and so, if the data held is continually changing, there is an overhead in maintaining such ordering as keys are added (and deleted). Hashing (if implemented carefully) can avoid such problems, allowing fast lookup of keys (often superior to binary search methods) without the overheads involved in maintaining a specific order of keys. In this approach we have the following structures: 1. The symbol table itself. For the purposes of this assignment this is just an array of Strings. 2. A maximum number of distinct keys that the symbol table is allowed to hold. In total the structure has the form String[] SymTab = new String[max_number_keys]; 3. A hash function, Hash, which maps Strings to integer values in the range [0,max number keys-1]. There are many, many possible ways in which such a function can be defined and no specific approach is prescribed for the purposes of this assignment. One frequently used method for defining Hash(< key value >) is that of summing the integer (unicode) values of each individual character in the String key value and then (in order to ensure the outcome falls within the range [0,max number keys-1]) computing the remainder after division by max number keys. For example, using this hash function, the String GCD maps to 16 + 12 + 13 = 41, so that (if the symbol table allows more than 41 keys to be held) then SymTab[41]=GCD; if the maximum number of key is, 30 say, then SymTab[11]=GCD. Of course an obvious problem arises: whatever hash function is used, there is the possibility of two (or more) distinct keys being hashed to the same value. For example, in the case defined above, all of the keys {GCD,GDC,DCG,DGC,CDG,CGD} will return 41 as their hash code. Thus, a mechanism is needed in order to store 2 keys and recover their indices within the symbol table should two key values have the same hash code reported. In the approach known as closed hashing this problem is dealt with by specifying an interval, s, (with 0 < s < N ) describing successive locations of the symbol table to consider. That is, suppose we have a table of size 30 and fix s = 7. Then for the sequence of keys GCD,GDC,DCG,DGC,CDG,CGD all having Hash(< key >) = 41 we get, GCD GDC DCG DGC CDG CGD is is is is is is stored stored stored stored stored stored in in in in in in location location location location location location 11 ie 41%30 18 ie 11 + 7 25 ie 11 + 7 + 7 2 ie 11 + 7 + 7 + 7 = 32%30 = 2 9 ie 11 + 7 + 7 + 7 + 7 = 2 + 7 mod 30 16 ie 11 + 7 + 7 + 7 + 7 + 7 = 2 + 7 + 7 mod 30 Thus every key is stored directly in the Symbol Table itself so obviating the need to use linked lists of keys that hash to the same location (the method called open hashing). In summary, the storage of keys in a symbol table using closed hashing, involves implementing a class class SymbolTable Fields: private private private private int N; int s; int size_of=0; String[] Keys; // // // // Number of keys this table can store. Increment used to probe locations when clashes arise. The number of keys held in the table so far The Symbol Table contents. Constructor: public SymbolTable(int table_capacity, int skip) //Defines an instance with N=table_capacity; s=skip; and //Keys=new String[N] // NB For each 0<=i<N Key[i] should be initiated as new String("**") // where "**" is used to encode the fact that this location has not been used yet. Methods: private int HashKey(String IDENT) // Returns the hash code of IDENT. Detailed Requirements For this final practical assignment you are asked to write a Java program to implement the SymbolTable class described at the end of the preceding section. Your program will have to make provision for the following: P1. Realisation of the SymbolTable class and the methods defined. NB: Some form of hashing function must be implemented (ie the method HashKey(String IDENT)). You are free to explore different methods for doing so or simply to implement the technique described above, however, 3 you SHOULD NOT use any of the predefined Java Library “hash code” methods (if in doubt about whether a particular approach is permissible please contact either the demonstrator or myself). P2. In order to provide an environment within which your implementation of the SymbolTable class can be tested you should implement the following method in this class: public void populate_from(Scanner input) This will be called from the main() method, so that having created an instance SymTab of SymbolTable with parameters (N, s), the call SymTab.populate_from(input); should result in a sequence of String Keys being read and inserted into the table. You may use the String "##" to indicate there are no further keys to be read. For example, (the start of) an input file might look like 50 3 ABCD integer constant IDENT KEY_VALUE ## P3. When completed, you should then arrange for the contents of the Symbol Table to be output in the form: Maximum Size: Number of locations actually used: Table Contents 0 # 1 # ... k # ... N # IDENT # HashKey(IDENT)%Size IDENT # HashKey(IDENT)%Size IDENT # HashKey(IDENT)%Size IDENT # HashKey(IDENT)%Size If a location is not being used, ie Keys[k]=** (location never used) then this should be indicated in the output (in such cases there is no need to give the hash code for the contents). 4 Some additional points A1. Your implementation should robustly be able to handle the errors E1. Attempts to add an IDENT when the symbol table is full, ie every Key[] is in use. E2. Attempts to add an IDENT which is already stored in the table. A2. It is extremely important that the maximum size (N ) of the table and the value s used when resolving clashes are relatively prime, ie have greatest common divisor equal to 1. This will always be the case when N is a prime itself (and, of course, s < N ) or in the (rather insipid) choice s = 1. Unless (N, s) satisfy GCD(N, s) = 1 it is possible for the table to appear full (unable to insert a key into) when, in fact, it is not so. The best way of testing if < N, s > are suitable is to check if the Greatest Common Divisor (gcd) of N and s is equal to 1. Should gcd(N, s) 6= 1 then the values are unsuitable. A method to compute gcd(x, y) is given below: private int GCD(int x, int y) { int temp; int tx=x; int ty=y; while (!(tx==ty)) { if (tx<ty) { temp=tx; tx=ty; ty=temp; }; tx = tx-ty; }; return tx; } 5 Submission Instructions Firstly, check that you have adhered to the following list: 1. All of your code is within a single file. Do NOT use more than one file. 2. Both your name AND User ID are clearly indicated at the start of your code, eg by // Name: My Name ; ID u????? 3. The file’s name MUST be SymbolTableOperation.java This means that the main class name must also be SymbolTableOperation. Submit only the Java source: design documentation, compiled .class files, sample outputs, extraneous commentary and similar ephemera are neither required nor desired. 4. Your program is written in Java, not some other language. 5. The file is a text file: not compressed or encoded or otherwise mangled. 6. Your program compiles and runs on the Departmental Windows system. If you have developed your code elsewhere (eg your home PC), port it to our system and perform a compile/check test before submission. It is your responsibility to check that you can log onto the departmental system well in advance of the submission deadline. 7. Your program does not bear undue resemblance to anybody else’s. Electronic checks for code similarity will be performed on all submissions and instances of plagiarism will be dealt with in accordance with the procedures and sanctions prescribed by the relevant University Code of Practice. The rules on plagiarism and collusion are explicit: do not copy anything from anyone else’s code, do not let anyone else copy from your code and do not hand in “jointly developed” solutions. Your solution must be SUBMITTED ELECTRONICALLY Electronic submission: Your code must be submitted to the departmental electronic submission system at: http://intranet.csc.liv.ac.uk/cgi-bin/submit.pl You need to login in to the above system and select COMP104-2: Compiler Table Managment from the drop-down menu. You then locate the file containing your program that you wish to submit, check the box stating that you have read and understood the University Code of Practice on Plagiarism and Collusion, then click the Upload File button. 6 MARKING SCHEME Below is the breakdown of the mark scheme for this assignment. Each category will be judged on the correctness, efficiency and modularity of the code, as well as whether or not it compiles and produces the desired output. • Adherence to specification (ie information requested, correct naming etc.) = 15 • Implementation of SymbolTable class and methods = 35. • Handling of errors (Symbol Table full, multiple identifiers) = 25 • Output form = 15 marks • Comments and layout = 10 marks This assignment contributes 10% to your overall mark for COMP104. Finally, please remember that it is always better to hand in an incomplete piece of work, which will result in some marks being awarded, as opposed to handing in nothing, which will guarantee a mark of 0 being awarded. Demonstrators will be on hand during the COMP104 practical sessions to provide assistance, should you need it. 7