Coogle - A Code Google Eclipse Plug
Transcription
Coogle - A Code Google Eclipse Plug
Diploma Thesis 31st January 2006 Coogle A Code Google Eclipse Plug-in for Detecting Similar Java Classes Tobias Sager of Zürich, Switzerland (s0070115) supervised by Prof. Abraham Bernstein, Ph.D.; Prof. Dr. Harald Gall Beat Fluri; Christoph Kiefer; Martin Pinzger Department of Informatics software evolution & architecture lab Diploma Thesis Coogle A Code Google Eclipse Plug-in for Detecting Similar Java Classes Tobias Sager Department of Informatics software evolution & architecture lab Diploma Thesis Author: Tobias Sager, tsager@gmx.ch Project period: June 7, 2005 - December 7, 2005 Software Evolution & Architecture Lab Department of Informatics, University of Zurich Acknowledgements I would like to thank my supervising assistants, Beat Fluri, Christoph Kiefer and Martin Pinzger, for their valuable input, the extensive proofreading and the freedom I had while writing this thesis. Further, I thank Prof. Abraham Bernstein and Prof. Harald Gall for giving me the opportunity of writing this thesis. The layout of this document is based on the superb LATEX-style written by Beat Fluri. I thank Sabine, Vreni and Ernst Sager for proofreading the thesis and the morale support they provided. My apologies to Christine for all those hours spent in front of the computer. Kudos to all the nameless open-source software developers. The following great tools were used for creating this thesis: Java, Eclipse, Subversion, Subclipse, TeXlipse, LATEX, OpenOffice.Org, Inkscape, Mozilla Firefox and Gentoo Linux. Abstract This thesis introduces Coogle, an Eclipse plug-in that measures similarity between Java classes. Coogle calculates similarity by using different tree algorithms on syntax tree representations of source code. For creating these tree representations, we convert the abstract syntax tree as defined by Eclipse into an intermediary model called FAMIX. This FAMIX model then is transformed into a general tree structure and used for calculating the similarity. We derive tree similarity from a bottom-up maximum common subtree isomorphism, a topdown maximum common subtree isormorphism, and the edit distance of two given trees. These similarity measures are then analysed for their efficiency in detecting modified code and structural similarity with constructed test cases and a real-world Java project. The best results are achieved with the tree edit distance algorithm, which reliably indicates similarity of classes after refactorings and also finds structurally similar classes in Eclipse’s compare project. Finding similarity with a top-down maximum common subtree algorithm is efficient for detecting structural similarity, but has shortcomings in detecting similarity of modifications that affect the ordering of the nodes in the tree representation. Using a bottom-up maximum common subtree isomorphism for detecting modifications is inefficient due to the limited hierachy of the FAMIX tree representation. Based on these findings, we point out different ways to improve our similarity analysis tool. Zusammenfassung Diese Diplomarbeit präsentiert Coogle, ein Plug-in für Eclipse, welches Ähnlichkeit zwischen Java-Klassen misst. Coogle berechnet die Ähnlichkeit aufgrund der Syntax-Bäume von Quellcode. Um diese Baumstrukturen zu bilden, konvertieren wir den von Eclipse definierten Abstract Syntax Tree in ein Modell namens FAMIX. Diese FAMIX-Repräsentation wird dann in eine generelle Baumstruktur umgewandelt und als Basis für die Berechnung der Ähnlichkeit verwendet. Wir berechnen die Ähnlichkeit der Bäume mit drei verschiedenen Algorithmen: Bottom-up maximaler gemeinsamer Teilbaum, Top-down maximaler gemeinsamer Teilbaum und die BaumEditierdistanz von zwei Bäumen. Diese Ähnlichkeitsmasse werden anhand von Testfällen und einem echten Software-Projekt auf ihre Effizienz geprüft, modifizierten Quellcode zu finden und strukturelle Ähnlichkeit festzustellen. Das beste Resultat erreicht der Baum-Editierdistanz-Algorithmus, welcher mit grosser Zuverlässigkeit in der Lage ist, die Ähnlichkeit von Klassen auch nach Refactorings anzuzeigen. Dieser Algorithmus findet ebenfalls strukturell ähnliche Klassen im Compare-Projekt von Eclipse. Der Top-down maximale gemeinsame Teilbaum-Algorithmus ist nützlich, um strukturelle Ähnlichkeit festzustellen. Allerdings hat dieser Algorithmus Defizite im Detektieren von Modifikationen, welche die Reihenfolge der Knoten in der Baum-Repräsentation verändern. Ein Bottom-up maximaler gemeinsamer Teilbaum-Algorithmus ist nicht effizient, um Quellcode-Veränderungen zu entdecken, da die Hierarchie des FAMIX-Baums zu wenig Tiefe hat. Ausgehend von diesen Resultaten machen wir verschiedene Vorschläge, um unsere AnalyseSoftware zu verbessern. Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Stucture of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 FAMIX - the FAMOOS Information Exchange Model 2.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Description . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Overview . . . . . . . . . . . . . . . . . . . 2.2.2 Core Model . . . . . . . . . . . . . . . . . . 2.2.3 FAMIX Extensions for Java . . . . . . . . . 2.3 FAMIX as Intermediary Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 4 4 5 7 3 Similarity Analysis 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Definition of Similarity . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Structural Similarity . . . . . . . . . . . . . . . . . . . . 3.3.2 Functional Similarity . . . . . . . . . . . . . . . . . . . . 3.4 Evaluated Tree Algorithms . . . . . . . . . . . . . . . . . . . . . 3.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Using Ordered or Unordered Trees? . . . . . . . . . . . 3.4.3 Tree Isomorphism . . . . . . . . . . . . . . . . . . . . . 3.4.4 Bottom-up Maximum Common Subtree Isomorphism 3.4.5 Top-down Maximum Common Subtree Isomorphism . 3.4.6 Tree Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 10 10 11 11 11 12 13 14 15 17 4 Implementation 4.1 Eclipse Architecture . . . . . . . . . . . . . . . . . . . . 4.1.1 Eclipse Platform . . . . . . . . . . . . . . . . . 4.1.2 Abstract Syntax Tree Representation in Eclipse 4.1.3 AST to FAMIX Mapping . . . . . . . . . . . . . 4.2 Coogle Architecture . . . . . . . . . . . . . . . . . . . . 4.2.1 Project Package Structure . . . . . . . . . . . . 4.2.2 Plug-in Features . . . . . . . . . . . . . . . . . 4.3 Coogle Design . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . 4.3.2 FAMIX Extensions . . . . . . . . . . . . . . . . 4.3.3 Tree Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 22 23 23 23 25 25 25 26 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28 28 28 29 29 29 30 31 34 34 35 5 Evaluation 5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Analysis Objects . . . . . . . . . . . . . . . . . . . . 5.2.1 Constructed Test Cases . . . . . . . . . . . . 5.2.2 Real World Example: org.eclipse.compare . 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Ranking the Matches . . . . . . . . . . . . . 5.3.2 Results with Constructed Test Cases . . . . 5.3.3 Results with org.eclipse.compare . . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Comparison of Implemented Measures . . 5.4.2 Shortcomings . . . . . . . . . . . . . . . . . 5.4.3 Possible Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 39 39 41 41 41 42 47 51 51 52 52 6 Conclusion and Future Work 6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 55 56 A Coogle Step by Step 57 B How to Extend Coogle B.1 Add a New Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Extend the Information in the Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Define a New Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 61 62 62 C Contents of CD-ROM C.1 Directory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Eclipse Workspace: Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 67 67 D Test Cases Source Listings D.1 AzureusCoreImpl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 RateControlledEntity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 69 78 4.4 4.5 4.6 4.3.4 Node Comparison . . . . . . . . . 4.3.5 Input Trees for Measures . . . . . . Coogle Workflow . . . . . . . . . . . . . . 4.4.1 Invocation . . . . . . . . . . . . . . 4.4.2 Similarity Search Process . . . . . Coogle Implementation . . . . . . . . . . 4.5.1 Tree Generation . . . . . . . . . . . 4.5.2 Node Comparison . . . . . . . . . 4.5.3 Implemented Similarity Measures Discussion and Problems . . . . . . . . . . 4.6.1 FAMIX . . . . . . . . . . . . . . . . 4.6.2 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS ix List of Figures 2.1 2.2 2.3 2.4 2.5 FAMIX concept overview (source: [Tichelaar et al., 1999]) . . . . . . . . . . . . . . . Abstract basic elements of the FAMIX model. Figure from [Tichelaar et al., 1999]. . Subclasses of the FAMIX element BehaviouralEntity. . . . . . . . . . . . . . . . Subclasses of StructuralEntity and their relationship to other elements (in grey). Figure from [Tichelaar et al., 1999]. . . . . . . . . . . . . . . . . . . . . . . . . Core elements of FAMIX and their relationship. Figure from [Tichelaar et al., 1999]. 3.1 Left: A directed graph with five vertices and eight arcs. Right: A rooted tree with six nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 A complete bipartite graph between the nodes v1 , v2 of T1 and the nodes w1 , w2 and w3 of tree T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Isomorphic ordered trees. Nodes numbered according to a preorder traversal. Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Bottom-up maximum common subtree isomorphism equivalence classes for two ordered trees. Nodes are numbered according to the equivalence class to which they belong. Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . 3.5 Bottom-up maximum common subtree of two ordered trees (highlighted in grey). Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Bottom-up maximum common subtree of two unordered trees (highlighted in grey). The dashed arrows depict the mapping of corresponding nodes. Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Top-down maximum common subtree of two ordered trees (highlighted in grey). Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Top-down maximum common subtree of two unordered trees (highlighted in grey). Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Transformation between two ordered trees. Figure taken from [Valiente, 2002]. . . 3.10 Shortest path in the edit graph of two ordered trees. Figure from [Valiente, 2002]. . 4.1 4.2 3 5 5 6 7 12 12 13 14 15 16 16 17 18 19 22 4.9 Eclipse platform architecture with its main components and plug-ins. . . . . . . . . The steps of processing the source code of a Java class into a tree that can be used as input for the similarity measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Java class diagram for the top level elements of the FAMIX model. Our extensions to the original FAMIX model are shaded in grey. . . . . . . . . . . . . . . . . . . . . Java class diagram for StructuralEntity with its subclasses. . . . . . . . . . . . Java class diagram for BehaviouralEntity and Context with their respective subclasses. Our extensions to the original FAMIX model are shaded in grey. . . . . The Coogle process: transformation of a Java source code file into a general tree structure via a FAMIX representation of the abstract syntax tree. Note the loss of ordering after parsing the tree into a FAMIX model. . . . . . . . . . . . . . . . . . . A bottom-up maximum common subtree isomorphism for two ordered trees (highlighted in grey). The dashed line represents the node mapping. The mapping is defined as M = {(v4, w2), (v5, w3), (v6, w4), (v7, w5), (v8, w6)}. . . . . . . . . . . . A top-down maximum common subtree isomorphism for two ordered trees (highlighted in grey). The dashed line represents the node mapping. The mapping is defined as M = {(v1, w1), (v3, w2), (v4, w3), (v5, w4), (v6, w5), (v7, w6), (v8, w7)}. . The complete workflow of a Coogle similarity search. . . . . . . . . . . . . . . . . . 5.1 5.2 Test Case A: Resulting tree after changing the class. Added tree elements in italic. . Test Case B: Resulting tree after changing the class. Added tree elements in italic. . 43 44 4.3 4.4 4.5 4.6 4.7 4.8 26 27 28 29 30 33 34 38 CONTENTS x 5.3 5.4 5.5 A.1 A.2 A.3 A.4 A.5 A.6 Test Case D: Resulting tree after changing the class. Added tree elements in italic. . Distribution of average similarity (bottom-up, top-down and tree edit distance measures) for class CompareViewerPane in org.eclipse.compare project. Denoted on the x-axis are all classes of the project with descending similarity. . . . . . . . . Distribution of tree edit distance similarity for class CompareViewerPane in the compare project of Eclipse. Denoted on the x-axis are all classes of the project with descending similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Context menu when right-clicking on a Java project in the Eclipse workspace. . . . Step 1: Welcome screen and choice of similarity. . . . . . . . . . . . . . . . . . . . . Step 2: Selection of project containing the desired search object. . . . . . . . . . . . Step 3: Search object selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Step 4: Final summary page before calculation is started. . . . . . . . . . . . . . . . Result dialog of a tree edit distance calculation on the Eclipse compare project with the class org.eclipse.compare.Splitter as search object. . . . . . . . . . . . 57 58 58 59 59 49 50 60 List of Tables 4.1 FAMIX elements with their corresponding AST element. . . . . . . . . . . . . . . . 24 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 Results case A (add constructor to class): Bottom-up maximum common subtree. . Results case A (add constructor to class): Top-down maximum common subtree. . Results case A (add constructor to class): Tree edit distance. . . . . . . . . . . . . . Results case B (add attribute to class): Bottom-up maximum common subtree. . . . Results case B (add attribute to class): Top-down maximum common subtree. . . . Results case B (add attribute to class): Tree edit distance. . . . . . . . . . . . . . . . Results case C (add invocation to method): Bottom-up maximum common subtree. Results case C (add invocation to method): Top-down maximum common subtree. Results case C (add invocation to method): Tree edit distance. . . . . . . . . . . . . Results case D (method extraction): Bottom-up maximum common subtree. . . . . Results case D (method extraction): Top-down maximum common subtree. . . . . Results case D (method extraction): Tree edit distance. . . . . . . . . . . . . . . . . . Results case E (implement interface): Bottom-up maximum common subtree. . . . Results case E (implement interface): Top-down maximum common subtree. . . . Results case E (implement interface): Tree edit distance. . . . . . . . . . . . . . . . . Selected top results for comparison on org.eclipse.compare.CompareViewerPane. . Selected top results for comparison on org.eclipse.compare.NavigationAction. . . . 43 43 44 44 45 45 46 46 46 47 47 48 48 48 48 49 50 List of Listings 3.1 3.2 4.1 4.2 4.3 Formal Java class declaration rules [Gosling et al., 1996]. . . . . . . . . . . . . . . . A sample class definition following the rules from Listing 3.1. . . . . . . . . . . . . Java CompilationUnit AST node type. This is the type of the root of an AST. . . . . TypeDeclaration AST node type. A type declaration is the union of a class declaration and an interface declaration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creates a new tree of a FAMIXInstance by using the visitor pattern. This method is defined in ch.toe.tree.TreeUtil. . . . . . . . . . . . . . . . . . . . . . . . . 10 11 23 23 31 CONTENTS 4.4 4.6 4.5 4.7 5.1 5.2 5.3 B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8 D.1 D.2 accept() method from ch.toe.famix.model.Class, demonstrating the visitor pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Most detailed constructor signature of CalculateBottomUpMaximumSubtree. Sample visitor implementation used for building a tree of all relevant FAMIX elements. This is the implementation as used by TreeBuildVisitor. . . . . . . . . Most detailed constructor signature of CalculateTreeEditDistance. . . . . . Test case A: Code of the added constructor . . . . . . . . . . . . . . . . . . . . . . . Test case B: Code for an added attribute . . . . . . . . . . . . . . . . . . . . . . . . . Test case D: Extract the code of a method into a new method. . . . . . . . . . . . . . Sample class for defining a new similarity measure. . . . . . . . . . . . . . . . . . . Sample constructor for a new similarity measure with a single tree as parameter. . Sample implementation of a new measure operation. . . . . . . . . . . . . . . . . . Implementation of a new result dialog. . . . . . . . . . . . . . . . . . . . . . . . . . . In class CoogleModel: Model extension for a new measure. . . . . . . . . . . . . . In class CoogleWizardPageWelcome: Additions to the welcome page of the wizard for a new tree measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . In class CoogleWizard: Add the new operation to the finish action of the wizard. A new comparator implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . AzureusCoreImpl.java (version 2.3.0.3) from the Azureus project. . . . . . . . . . . RateControlledEntity.java (version 2.3.0.3) from the Azureus project. . . . . . . . . xi 32 32 37 37 40 40 41 61 62 63 64 64 65 65 66 69 78 xii CONTENTS Chapter 1 Introduction This chapter describes the motivation of this thesis. Further, it discusses existing work in the field of source code similarity and concludes with an overview of the structure of the thesis. 1.1 Motivation The field of similarity analysis in source code has many different applications. For example, similarity analysis is used to detect code duplicates (i.e., code clones). Removing such code clones improves the maintainability of a software system. The quality of a system therefore can be analysed through identifying duplicated code. However, not only quality is assessed with clone analysis: duplicated code is an indicator for software plagiarism. The algorithms used for this task vary from very simple source code line comparison to complex hashing algorithms that are not subsceptible to changes in naming of fields or methods and even detect similarity in obfuscated code. Further, similarity analysis of source code is helpful during development, for instance to provide better support for code reuse. Consider, for example, a development environment that analyses the just written code and suggests similar code examples or existing implementations from a source repository. This helps reusing existing code and lessens the developing effort needed by creating a collaborative knowledge of code fragments. In this thesis, we aim to detect similar Java classes based upon the syntax tree of source code. A syntax tree is the representation of source code in the form of a tree. Eclipse provides us with a detailed tree representation of Java source code, it includes all statements and operations. This syntax tree is converted into an intermediary model, called FAMOOS Information Exchange Model (FAMIX). FAMIX is a model for representing object-oriented source code, independent of specific programming language constructs. This language-independent representation of source code then is analysed for similarity with different tree algorithms. In our implementation we use three different measures: bottom-up maximum common subtree isomorphism, top-down maximum common subtree isomorphism and the tree edit distance algorithm. These measures detect the similarity of two given Java classes by analysing their tree representations for similar parts. The following contributions are made by this thesis: • • A Java representation of the FAMIX model is created and used as intermediary model for converting the abstract syntax tree of source code into general trees. An implementation of three tree similarity measures, integrated into SimPack, a generic Java library of similarity measures for the use in ontologies. Chapter 1. Introduction 2 • • Coogle, an Eclipse plug-in, for searching similar classes in Java projects. The similarity analysis is based on the previously mentioned similarity algorithms. An evaluation of the implemented measures with test cases of refactoring patterns and a real-world Java project, Eclipse’s compare plug-in. 1.2 Stucture of Thesis This thesis starts with an overview of the FAMOOS Information Exchange Model and details its purposes and structure in Chapter 2. In the chapter thereafter, we describe existing similarity analysers and the implemented similarity measures. The fourth chapter documents the implementation details and discusses problems encountered during development. We evaluated Coogle with test cases and a real-world Java project as described in the fifth chapter. The concluding chapter details the results of our work and advises on possible future work. Chapter 2 FAMIX - the FAMOOS Information Exchange Model This chapter gives a brief introduction to the FAMOOS Information Exchange Model (FAMIX) as it is defined in [Tichelaar et al., 1999], emphasising on the parts we are using in our Java implementation. 2.1 Purpose The FAMOOS Information Exchange Model (FAMIX) was developed as an information exchange model for the FAMOOS project 1 . FAMOOS is an acronym for ”Framework-based Approach for Mastering Object-Oriented Software Evolution”, a re-engineering framework for supporting the design, analysis and maintainability of software systems. Tool prototypes for experimenting in various areas of this project have been implemented in different languages (C++, Ada, Java and Smalltalk). To avoid incorporating parsing technology for all those languages into each of the tool prototypes, FAMIX was defined as a common information exchange model. The model is applied to different languages by using specific language extensions. Figure 2.1 gives a graphical view of this concept. The model was published in 1999 as FAMIX 2.0 [Tichelaar et al., 1999]. The Java specific extension can be found in [Tichelaar, 1999]. Figure 2.1: FAMIX concept overview (source: [Tichelaar et al., 1999]) 1 FAMOOS project site: http://iamwww.unibe.ch/˜famoos/ Chapter 2. FAMIX - the FAMOOS Information Exchange Model 4 2.2 Description 2.2.1 Overview As FAMIX is a model for representing different object-oriented languages, the model uses the highest common factor of all those languages. The main elements of the object-oriented model can therefore be modeled with FAMIX. We describe the core model of FAMIX in Section 2.2.2. When interchanging data between different tools, it is necessary to have a tool-independent, transferable representation in form of files. FAMIX adopted CDIF [CDIF, 1994] for this purpose. CDIF is a standard used for formally representing models in human readable text. Because our plug-in does not need to export the FAMIX representation into text form, we do not describe CDIF here. 2.2.2 Core Model Figure 2.2 illustrates the abstract core model of FAMIX. All elements in the model are children of the type Object. Objects in an object-oriented sense (like methods, variables and such) are of type Entity, that is, a BehaviouralEntity or a StructuralEntity. We describe these two types later in this chapter. A Property is a tool scpecific information that can be assigned with any Object. We do not define such Properties as we have no need for storing additional information to the already present information in FAMIX. There are three different types of Associations: • InheritanceDefinition, for superclass-subclass relations; • Invocation, invocations of a BehaviouralEntity; • Access, used for modeling accesses to a StructuralEntity. Argument is used for passing arguments to an invocation of a BehaviouralEntity. In Java, for example, the statement System.out.println("Hello world!"); is represented by passing an ExpressionArgument (the "Hello world!" string) and an AccessArgument (representing System.out) to the Invocation of println(). A BehaviouralEntity has two subclasses. Function models a global behaviour whereas Method represents the definition of a behaviour of a class. The concept of Functions is not known in every object-oriented language, for example Java does not use this type of behaviour. BehaviouralEntity and its subclasses are shown in Figure 2.3. Each StructuralEntity has an attribute declaredClass which declares the type of the entity. In Java this might be a primitive type such as int or a class type like String. Figure 2.4 illustrates the subclasses of StructuralEntity. A GlobalVariable represents a globally accessible variable with a lifetime of the system’s lifetime. This concept is not known in Java. An Attribute is a field defined in a Class. An ImplicitVariable represents context variables such as this or super. A FormalParameter is a child of BehaviouralEntity and represents a parameter of a method. Locally defined variables are of type LocalVariable. The main entities of an object-oriented model are classes. Figure 2.5 shows the core model of FAMIX. A Class has relations defined through InheritanceDefinition, either as superclass or as subclass. A BehaviouralEntity (represented by Method in the figure) or a StructuralEntity (Attribute in the figure) belongs to a Class. These relations also highlight a major problem with the FAMIX model when used for our purpose: FAMIX defines relations from children to their parent. However, for effectively building a tree, we need relations that can be traced from parents to their children. See Section 4.3.2 for the implementation consequences of this. 2.2 Description 5 Figure 2.2: Abstract basic elements of the FAMIX model. Figure from [Tichelaar et al., 1999]. Figure 2.3: Subclasses of the FAMIX element BehaviouralEntity. Figure from [Tichelaar et al., 1999]. FAMIX has different levels of extraction that denote how much information is extracted. Level 1 is the minimum a parser must be able to extract and includes four different object types: Class, InheritanceDefinition, BehaviouralEntity and Package. Level 4, the most detailed level and the level we extract with Coogle, contains all objects defined in FAMIX. 2.2.3 FAMIX Extensions for Java FAMIX extension documents exist for various object-oriented languages. For Java specific features, [Tichelaar, 1999] defines the needed extensions. We describe the most notable extensions in this section, also refer to the Java specification [Gosling et al., 1996] for further information. Class. Addition of methods for representing the possible states of a Class: isInterface(), isPublic(), isFinal() and isAbstract(). Method. Corresponding to the allowed method method modifiers, the methods isFinal(), isSynchronized() and isNative() are added. The latter is used for methods that are implemented in an external language (Assembler for instance). The signature of a Method is defined to have a format like methodname(paramType1,..,paramTypeN). 6 Chapter 2. FAMIX - the FAMOOS Information Exchange Model Figure 2.4: Subclasses of StructuralEntity and their relationship to other elements (in grey). Figure from [Tichelaar et al., 1999]. Attribute. Corresponding to the possible modifiers of an Attribute, the methods isFinal(), isTransient() (if the attribute does not need to be serialised) and isVolatile() (used for indicating that the attribute should not be optimised by the compiler) are added. LocalVariable and FormalParameter. The method isFinal() is added for representing this possible state of the field types. TypeCast. This is a new object type added specifically for the Java extension. A TypeCast is an Association with a fromType and a toType for representing type casts between two Java types. In FAMIX, a TypeCast is a member of a BehaviouralEntity. accessControlQualifier. The accessControlQualifier of the objects can have at most three possible states: public, protected and private. Default package visibility is represented by an empty accessControlQualifier. Function and GlobalVariable. These elements will never appear in a FAMIX representation of Java code as there is no such concept in Java. Inner classes. There is no specification for nested classes, inner classes and anonymous classes in [Tichelaar, 1999]. We represent these class types in our FAMIX implementation, but do not include them when parsing the source code (see Section 4.6 for more information). Implicit methods. Java has implicit methods such as this(..), super(..) or default constructors. There is no representation in FAMIX and in our implementation for these. Static and instance initialisers. A static initialiser can be used for variables and classes in Java source code. We do not represent these code fragments as there is no FAMIX representation available. 2.3 FAMIX as Intermediary Model 7 Figure 2.5: Core elements of FAMIX and their relationship. Figure from [Tichelaar et al., 1999]. Names of objects can be queried in two different formats: a simple name representation, for example ”CooglePlugin”, and a unique name. Unique names contain the full package path to the object and all information necessary for uniquely identifying an object. A Class then has a unique name of ”ch.toe.coogle.plugin.CooglePlugin”. A method in the same class is then uniquely named as ”ch.toe.coogle.plugin.CooglePlugin.getDefault()”, which is the unique name of the belongsTo()-Class, followed by the name of the method including signature. Please note that this unique name format differs from the original FAMIX definition. Originally, the unique name of a Method is in a format with two colons instead of the periods: ”ch::toe::coogle::plugin::CooglePlugin.getDefault()”. We used the simpler format with periods as this corresponds to the usual way of representing names in Java. 2.3 FAMIX as Intermediary Model Why did we choose FAMIX as intermediary model for representing Java source code? These are the advantages when using such a model: • • By using a fixed standard representation of object-oriented code, we can maintan interoperability with other tools using FAMIX, for example CodeCrawler [Lanza, 2003]. Similarity analysis for other languages (like C++ for example) will yield comparable results when based on FAMIX. The use of FAMIX also has disadvantages. Most notably, we have an information loss when converting Java code to its FAMIX representation. This happens mainly because of elements not represented in FAMIX. See Section 4.1.3 for complete information about the missing information. Further, FAMIX is an additional layer in-between the detailed Java abstract syntax tree (see Section 4.1.2) and the final tree representation (see Section 4.5.3). Chapter 3 Similarity Analysis This chapter gives an overview over similarity in general, similarity analysis on source code and existing work in this field. We describe the similarity measures we used from an algorithmic point of view and outline their features and shortcomings. 3.1 Overview The goal of this thesis is to find similar Java classes by analysing the FAMIX representation of the abstract syntax tree of source code. We search for similarity in existing Java projects using three different algorithms (outlined in Section 3.4) and analyse how changes in source code affect similarity. 3.2 Related Work Different approaches exist for detecting similarity in trees and source code. [Baxter et al., 1998] describes a tool that analyses systems for duplicated code. The algorithm used is based on abstract syntax trees and employs a hashing on code fragments for detecting exact and nearmiss clones. [Myles and Collberg, 2005] takes a similar approach by using a birthmarking technique, deducing unique characteristics from the instructions of a program, for detecting software theft. A minimal edit script algorithm for transforming one tree into another tree is defined in [Chawathe et al., 1996] and detects changes in general, hierarchically structured information. [Shasha et al., 2004] describes an application doing cousin search in phylogenetic trees (which represent evolutionary history). From the same author exists work on general tree and graph searching, using exact and approximate search algorithms [Shasha et al., 2002]. [Wang et al., 2003] presents a tool called TreeRank, which does a nearest neighbour search for detecting similar patterns in a given phylogenetic tree. [Mishne and de Rijke, 2004] and [Neamtiu et al., 2005] define a conceptual model for source code representation, which in both cases partially resembles the abstract syntax tree defined by Eclipse. [Mishne and de Rijke, 2004] uses code similarity for retrieving similar code fragments from an existing repository of code documents. [Neamtiu et al., 2005] extracts similarity by mapping corresponding AST elements and describing code evolution with this information. A different approach is described in [Kontogiannis, 1993], where a Program Description Tree (PDT) is generated from code fragments. These fragments are treated as behavioural entities, i.e, as independent components, interacting with resources and other entities of the system. The PDT 10 Chapter 3. Similarity Analysis then not only represents structural information like an AST does, but also contains functional information in the form of interactions and accesses. Similar fragments are detected by searching for entities with similar characteristics of these PDTs. Analysing large software systems for similarity, [Yamamoto et al., 2002] proposes an algorithm based on correspondence of source code lines. Also not based on syntax trees is the method implemented by [Michail and Notkin, 1999], where similar functions in different libraries are detected. The matching algorithm uses the name of functions, the name of their members and surrounding comments as similarity indicator. Finally, [Baker and Manber, 1998] lists various approaches for reconstructing changes and similarity information from Java bytecode. Such approaches include the analysis of fingerprint samplings or almost-matching of dissassembled code by ignoring textual information (such as field names). 3.3 Definition of Similarity Similarity defines the proximity of two objects. In our case, we analyse two Java classes for matching parts and conclude the nearness of the two classes from the size of the matched parts. We define two different notions of similarity when analysing Java source code similarity or similarity of object-oriented source code in general: structural similarity and functional similarity. These two types of similarity are explained in the next two sections of this chapter. The current implementation of our similarity analysis tool is able to detect structural similarity. Possible extensions to detect functional similarity are outlined in Section 5.4.3 and 6.2. 3.3.1 Structural Similarity Source code of any programming language is structurally defined through a limited set of instructions, a given grammar usually consisting of words and symbols. The structure of a piece of code is fixed by this grammar. A Java class for example needs to follow the structure defined in Listing 3.1. A sample class declaration obeying these rules is shown in Listing 3.2. NormalClassDeclaration: ClassModifiers class Identifier TypeParameters Super Interfaces { ClassBodyDeclarations } ClassBodyDeclarations: ClassBodyDeclaration ClassBodyDeclarations ClassBodyDeclaration ClassBodyDeclaration: ClassMemberDeclaration InstanceInitializer StaticInitializer ConstructorDeclaration ClassMemberDeclaration: FieldDeclaration MethodDeclaration 3.4 Evaluated Tree Algorithms 11 ClassDeclaration InterfaceDeclaration ; Listing 3.1: Formal Java class declaration rules [Gosling et al., 1996]. public class Example extends Parent { } Listing 3.2: A sample class definition following the rules from Listing 3.1. Because a grammar is constructed like a tree, we can generate a tree representation of the code, for example an abstract syntax tree (AST). See Section 4.1.2 for a description of the syntax tree used in Coogle. We define structural similarity as similarity in structure of the source code, in this case the structure of the abstract syntax tree of two Java objects. The structure of a class contains very little information about the functionality the class provides. Structural similarity is very successful when used for code duplication detection as the instruction structure of copied code remains the same and also does not change for example when replacing variable names. However, such structural similarity can become almost undetectable already for simple instruction sequence changes if the algorithm only is able to compare ordered information, for example ordered syntax trees. We will discuss this later in Section 3.4.2. 3.3.2 Functional Similarity Functional similarity defines the similarity of two objects, in our case Java classes, in the function they perform. [Kontogiannis, 1993] for example defines code fragments as behavioural entities which interact with the rest of the system. Using these interactions as characteristics of a studied code fragment, it is possible to search for entities that have similar interaction characteristics and therefore perform similar functions. Another example of a project using functional similarity is the Strathcona tool described in [Holmes and Murphy, 2005], which measures inheritance, invocations and accesses of a type and then recommends similar code samples from a repository. Such functional similarity is notably different from structural similarity as a similar code structure can perform fundamentally different functions and vice-versa, functional similar classes can be represented in various structurally different ways. 3.4 Evaluated Tree Algorithms The input for our similarity measures are trees generated from a Java abstract syntax tree as represented in Eclipse. The measures are generic, i.e., they operate on general tree structures, independent of any context information such as FAMIX attributes or AST elements. 3.4.1 Definitions This section defines the most important vocabulary used in the following sections: Graphs. A graph consists of vertices and arcs. Each arc connects two vertices. Arcs can be directed or indirected. A sample directed graph is illustrated in Figure 3.1. 12 Chapter 3. Similarity Analysis Trees. A tree is a particular case of a directed graph in which exists a single vertex, called the root of the tree, such that there is a unique walk from the root to any vertex of the tree. Vertices of a tree are called nodes, arcs are called edges. See Figure 3.1 for an example of a tree. Although there exist undirected trees, we only use trees based on directed graphs. Ordered trees. A tree can have multiple nodes as children. An ordered tree is a tree in which the relative order of the children is fixed for each node. Labelled trees. Each node of a tree can have a so called label. The label of a node consists of additional attributes, for example a name. Tree isomorphism. Tree isomorphism is the problem of determining whether a tree is isomorphic to another tree, i.e., there exists a mapping of the nodes of T1 to the nodes of T2 , preserving the structure of the tree, i.e, the root of T1 is mapped to the root of T2 and their children are mapped equivalently, corresponding to their order. Equivalence classes. Elements of an equivalence class are equivalent to all other elements in the same equivalence class. In the case of tree nodes, we define nodes to be equivalent if they have the same subtree rooted at them. The partitioning of a tree in its equivalence classes therefore is the sorting of each node into a subset of nodes (the equivalence class) with the same subtree rooted at them. Bipartite graph. A bipartite graph is an undirected graph in which the vertices can be partitioned in two subsets in such a way that every edge of the graph joins a vertex of one subset with a vertex of the other subset. See Figure 3.2 for an example of such a graph. Figure 3.1: Left: A directed graph with five vertices and eight arcs. Right: A rooted tree with six nodes. Figure 3.2: A complete bipartite graph between the nodes v1 , v2 of T1 and the nodes w1 , w2 and w3 of tree T2 . 3.4.2 Using Ordered or Unordered Trees? One important question arises when representing source code as trees: is ordering important? A Java compiler does not necessarily consider the order as important. The ordering of class body 3.4 Evaluated Tree Algorithms 13 entities such as methods and field declarations is not relevant whereas instructions in the bodies of these entities depend on the order of appearance in the source code. Any algorithm doing an ordered match is therefore a correct approach for matching abstract syntax trees, because we then simply assume the order of all instructions to be static. However, this fails to detect similarity for classes with changes in the order of entities. Using an algorithm for unordered trees fixes this limitation for top-level entities, but ignores the ordered structure of instructions in bodies of entities. As FAMIX does not represent all instructions that can occur in the body of an entity, we prefer using algorithms that do an unordered tree match. However, there do not exist efficient algorithms for unordered trees for all the chosen similarity measure algorithms. For example, an unordered solution for tree edit distance measuring is MAX SNP-hard (described in [Zhang et al., 1992] and [Zhang and Jiang, 1994]), i.e., the computation is not solvable in polynomial time of the input size of the trees. Also see Section 4.5.3 for other shortcomings of our implementation concerning unordered tree matching. For all these reasons, we implement unordered tree matching for bottom-up maximum common subtree only (see Section 4.2.2 and 5.4.2). 3.4.3 Tree Isomorphism The input for our similarity measures are Java classes which are represented by abstract syntax trees. For analysing similarity, we search for isomorphism in those syntax trees and derive class similarity from the size of the matched trees, i.e., the number of nodes in the subtree. Tree isomorphism answers the question of one tree being isomorphic to another tree, checking two trees for equality. Equality of two nodes is determined by either comparing node labels (for isomorphism with labelled trees) or not comparing node labels (unlabelled tree isomorphism, matches on structure only). See Figure 3.3 for an example of isomorphic trees. Subtree and maximum subtree isomorphism are more general cases of such tree isomorphism problems. We consider tree isomorphism for abstract syntax trees to be in the field of structural similarity analysis as the abstract syntax tree does not hold information on functionality. Figure 3.3: Isomorphic ordered trees. Nodes numbered according to a preorder traversal. Figure taken from [Valiente, 2002]. Three different tree similarity measures are integrated into Coogle, namely bottom-up maximum common subtree, top-down maximum common subtree and the tree edit distance. We describe these algorihms in the following sections. Chapter 3. Similarity Analysis 14 3.4.4 Bottom-up Maximum Common Subtree Isomorphism General This bottom-up maximum common subtree isomorphism algorithm is defined by [Valiente, 2002]. The goal of this algorithm is to find the largest isomorphic subtree, common to two given trees. The algorithm described is applicable for both ordered and unordered trees with minor changes. The problem of finding a bottom-up maximum common subtree of an ordered or unordered tree T1 = (V1 , E1 ) to another ordered or unordered tree T2 = (V2 , E2 ) can be reduced to the problem of partitioning the vertices of the trees V1 ∪ V2 into equivalence classes of bottom-up subtree isomorphism. Two nodes (in the same or different trees) are equivalent if the bottom-up subtrees rooted at them are isomorphic. Then, the bottom-up subtree of T1 rooted at node v ∈ V1 is isomorphic to the bottom-up subtree of T2 rooted at node w ∈ V2 if and only if nodes v and w belong to the same equivalence class of bottom-up subtree isomorphism.1 The equivalence classes of two trees are illustrated by Figure 3.4. We determine the isomorphism of a given node by recursively building an isomorphism string consisting of the isomorphism codes of all children of the node. We then compare that isomorphism string to a collection of existing isomorhpism strings. If the string is already in the collection, the current node’s equivalence class is read from the collection. If the isomorphism string is not contained in the collection, we add it to the collection and assign the string with a new equivalence class. Figure 3.4: Bottom-up maximum common subtree isomorphism equivalence classes for two ordered trees. Nodes are numbered according to the equivalence class to which they belong. Figure taken from [Valiente, 2002]. After collecting the equivalence classes of both trees, the algorithm searches for the biggest equivalence class by using a queue with the size of the nodes as priority. The first element in the queue is the node with the biggest size. This ensures that the matched subtree is indeed a maximum subtree. Figure 3.5 illustrates the bottom-up maximum common subtree for the sample trees in Figure 3.4. The maximum common subtree of the trees is highlighted in grey. Note that multiple nodes can have the same size and equivalence class. It is therefore possible to find multiple instances of a maximum common subtree in both trees. The last step of this algorithm is to generate a mapping M ⊆ V1 × V2 of the nodes in the maximum common subtree of T1 and T2 . See Figure 3.6 for such a mapping (for unordered trees). The procedure for generating this map is different for ordered and unordered trees and outlined in the following sections. [Valiente, 2002] describes this algorithm for unlabelled trees only. We extended the algorithm to use labelled trees by assigning an integer value to each node type, i.e., -1 for the FAMIX element Access, -2 for AccessArgument and so on. The equivalence classes are then matched based on 1 Source: [Valiente, 2002], Section 4.3.3 3.4 Evaluated Tree Algorithms 15 this value and the already defined equivalence class code. This solution is also suggested in [Valiente, 2000]. Figure 3.5: Bottom-up maximum common subtree of two ordered trees (highlighted in grey). Figure taken from [Valiente, 2002]. Ordered trees Ordered trees are processed as outlined in the ”General” section. The mapping of the matching subtree nodes of tree T1 and T2 is created with a recursive pass through all the nodes, starting at the roots of the maximum subtrees. Every procedure invocation compares the two given nodes for equality and continues with processing the children in their order if the roots are equal. Matching nodes are added to a mapping M ⊆ V1 × V2 , which then contains the resulting bottom-up maximum subtree of the ordered trees T1 and T2 . With two trees T1 and T2 with n1 and n2 nodes and n1 ≤ n2 , the algorithm for ordered trees runs in O(n2 log n2 ) time using O(n1 + n2 ) additional space (see Theorem 4.56 in [Valiente, 2002]). Unordered Trees Two things need to be changed for applying the algorithm to unordered trees. First, during the collection of the equivalence classes of the trees, we now sort the child isomorphism codes of a node before searching for already existing code sequences in the equivalence class collection. This ensures that all children of a node only differing in order are treated the same, thus unordered. The second change happens during the mapping phase. The nodes of T1 are processed in preorder traversal with a non-recursive loop and the children of a node are mapped to the node from T2 with the same equivalence code ignoring the ordering. Figure 3.6 illustrates such a mapping. This bottom-up maximum common subtree algorithm for unordered trees T1 and T2 with n1 and n2 number of nodes runs in O((n1 + n2 )2 ) time using O(n1 + n2 ) additional space (see Theorem 4.60 in [Valiente, 2002]). 3.4.5 Top-down Maximum Common Subtree Isomorphism General [Valiente, 2002] defines a top-down maximum common subtree isomorphism for ordered and unordered trees. The goal of this algorithm is to find the largest common subtree of two given trees under the prerequisite that the subtree is rooted at the root nodes of the trees. The differences 16 Chapter 3. Similarity Analysis Figure 3.6: Bottom-up maximum common subtree of two unordered trees (highlighted in grey). The dashed arrows depict the mapping of corresponding nodes. Figure taken from [Valiente, 2002]. between the algorithm for ordered trees and the algorithm for unordered trees are fundamental. Both algorithms are described separately in the following two sections. Ordered Trees Starting from the root nodes of T1 and T2 , the algorithm recursively processes all children in preorder and compares each pair of nodes for equality. If two nodes match, they are added to a mapping M ⊆ V1 × V2 which contains the complete subtree after the recursion finishes. See Figure 3.7 for an illustration of a top-down maximum common subtree of two ordered trees. The comparison of the nodes during the recursive processing allows for an extension of the algorithm to labelled trees as well, returning a successful match only when the labels match. See Section 4.3.4 for a description of the comparator pattern used. This algorithm is very efficient with a running time of O(n1 ) and O(n1 ) additional space for two ordered trees T1 and T2 with n1 and n2 number of nodes, where n1 ≤ n2 (see Lemma 4.52 in [Valiente, 2002]). Figure 3.7: Top-down maximum common subtree of two ordered trees (highlighted in grey). Figure taken from [Valiente, 2002]. 3.4 Evaluated Tree Algorithms 17 Unordered Trees Figure 3.8 illustrates a top-down maximum common subtree of two unordered trees. This algorithm is fundamentally different from the top-down maximum common subtree isomorphism algorithm for ordered trees. The search is performed by recursively solving weighted bipartite matching problems (see Figure 3.2 for an example of such a graph) for all children of two given nodes. Starting with the children of both roots and recursively calculating the size of the subtree, matching graphs are built for each corresponding node level. The weighting of the arcs is derived from each node’s subtree size. The maximum path along the weighted edges in the bipartite graphs then runs through the nodes which are part of the maximum common subtree. For extending the algorithm to labelled trees we assign to non-matching nodes a weight of 0 in the matching graphs, ensuring non consideration of this path. For two trees T1 and T2 with respectively n1 and n2 nodes (and n1 ≤ n2 ), the algorithm runs in O((n1 + n2 )(n1 n2 + (n1 + n2 ) log(n1 + n2 ))) time using O(n1 n2 ) additional space (see Lemma 4.44 in [Valiente, 2002]). Note that there exists a faster algorithm for unordered top-down subtree isomorphism ([Shamir and Tsur, 1997]). Figure 3.8: Top-down maximum common subtree of two unordered trees (highlighted in grey). Figure taken from [Valiente, 2002]. 3.4.6 Tree Edit Distance General Calculating the tree edit distance is a completely different approach for tree analysis than the maximum common subtree isomorphism algorithms. The tree edit distance algorithm answers the question how many steps it takes to transform one tree into another tree by applying a set of edit operations to the trees (adding, deleting and replacing nodes). This algorithm as described in [Valiente, 2002] is applicable for rooted ordered trees only. The problems with unordered trees are outlined in the last subsection of this section. Ordered Trees The tree edit distance algorithm as defined in [Valiente, 2002] has three different elementary edit operations. For the ordered trees T1 = (V1 , E1 ) and T2 = (V2 , E2 ) we denote a deletion of a leaf node v ∈ V1 by v �→ λ or (v, λ). The substitution of a node w ∈ V2 for a node v ∈ V1 is denoted by v �→ w or (v, w) and an insertion into T2 of a node w ∈ V2 as a new leaf is denoted by λ �→ w or (λ, w). Deletion and insertion operations are made on leaves only. The deletion of a non-leaf Chapter 3. Similarity Analysis 18 node requires first the deletion of the whole subtree rooted at the node. The same applies to the insertion of non-leaves.2 A tree is transformed into another tree by using a sequence of elementary edit operations as illustrated in Figure 3.9. Note that in this figure, substitution of corresponding nodes is not indicated. The complete transformation script is: [(v1 , w1 ), (v2 , w2 ), (v3 , λ), (v4 , λ), (v5 , w3 ), (λ, w4 ), (λ, w5 ), (λ, w6 ), (λ, w7 )]. Figure 3.9: Transformation between two ordered trees. Figure taken from [Valiente, 2002]. Not every sequence of edit operations denotes a valid transformation between two trees. Deletions and insertions must appear in bottom-up order to ensure that these operations are only made on leaves. A postorder traversal for example ensures this condition. Further, substitutions must preserve parent and sibling order. This means that the parent of a substituted non-root node must be substituted by the parent of the non-root node the substitution was made for. Also, the substitution of sibling nodes in T1 must preserve the order among the siblings by substituting the nodes with sibling nodes from T2 . These conditions are ensured by defining that depth[v] = depth[w] for all (v, w) ∈ M , where M is a mapping between the two trees. Costs are assigned to all elementary edit operations. The standard implementation uses a cost of γ(v, w) = 1 if v = λ or w = λ and γ(v, w) = 0 otherwise. With such weights, substitute operations cost less than the deletion or insertion of a node. The edit distance then is the least-cost transformation of the two trees. Valiente’s approach to calculate the edit distance is to build a graph with the nodes of both trees. The edges in the graph denote different operations with their assigned weights. Figure 3.10 illustrates such an edit graph with the shortest path in bold. Finding the least-cost transformation then is reduced to the problem of finding the shortest path from the upper left corner down to the lower right corner. Vertical arcs in the form (vi wj , vi+1 wj ) represent the deletion of node vi+1 from T1 , diagonal arcs (vi wj , vi+1 wj+1 ) represent the substitution of node wj+1 of T2 for node vi+1 of T1 . And finally, a horizontal arc like (vi wj , vi wj+1 ) represents the insertion of node wj+1 into T2 . Dijkstra’s shortest path [Dijkstra, 1959] is used for calculating the shortest path and determining the edit operations needed for the transformation of the two trees. Finding the least-cost transformation of an ordered tree T1 to an ordered tree T2 by deter2 Source: [Valiente, 2002], Section 2.1 3.4 Evaluated Tree Algorithms 19 Figure 3.10: Shortest path in the edit graph of two ordered trees. Figure from [Valiente, 2002]. mining shortest paths in an edit graph runs in O(n1 n2 ) time using O(n1 n2 ) additional space (see Lemma 2.20 in [Valiente, 2002]). Unordered Trees Tree edit distance calculation for unordered trees is MAX SNP-hard as [Zhang et al., 1992] and [Zhang and Jiang, 1994] showed. An implementation is therefore not efficient. Solutions for constrained trees with a fixed maximum number of children exist for example in [Zhang, 1996], but are not applicable for our needs because our trees have an unbounded number of children. Chapter 4 Implementation This chapter describes Coogle, our Eclipse plug-in implementing various similarity measuring algorithms to find similarities between Java classes. We describe the architecture of Eclipse and how our plug-in integrates into it. In addition we enlight on problems encountered during plugin development. 4.1 Eclipse Architecture As Coogle is an Eclipse plug-in we first describe the Eclipse platform in general and then how plug-ins for Eclipse integrate into the Eclipse architecture. 4.1.1 Eclipse Platform The functionality of Eclipse is based on the concept of extensions, so called plug-ins. The core of the Eclipse product, the ”Eclipse platform”, provides the framework and services for all these extensions. Thus, the platform is the runtime environment for dynamically loading, integrating and executing plug-ins 1 . Figure 4.1 gives an overview of the Eclipse architecture. These are the most important components: Workspace. This is part of the platform UI component and provides the main user interface of Eclipse. It coordinates and presents all tools integrated into the platform. Standard Widget Toolkit (SWT). SWT is an operating system independent widget toolkit providing an API for the native user-interface facilities. JFace. This is part of the platform UI component and provides classes for many common UI programming tasks. It is designed to be window system independent and uses SWT widgets for its common UI tasks. The Java Development Tools (JDT) and the Plug-in Development Environment (PDE) are plugged into this basic platform. Both tools add a number of views, wizards and editors to Eclipse. Without these plug-ins, Eclipse does not know about Java and plug-in development. The basic Eclipse platform plus JDT and PDE together build the Eclipse Software Development Kit (Eclipse SDK). 1 The Eclipse project: http://www.eclipse.org Chapter 4. Implementation 22 There are more plug-ins for other programming tasks, for example the C/C++ Development Tools (CDT) or the Graphical Editor Framework (GEF). Both CDT and GEF plug into Eclipse using the same interface as the standard SDK components. Developing plug-ins for Eclipse is extending the platform in the same ways as the standard Eclipse components, like the JDT or the CDT, do. All the basic tasks as loading and unloading the plug-in, the functionality for displaying dialogues and interacting with the user are already built into the platform and are simply extended or invoked by plug-ins when needed. Figure 4.1: Eclipse platform architecture with its main components and plug-ins. 4.1.2 Abstract Syntax Tree Representation in Eclipse The abstract syntax tree (AST) used in Eclipse is represented by the classes defined in the package org.eclipse.jdt.core.dom 2 . We outline the most important parts of this set of classes that model the source code of a Java program as a structured document, i.e., as a tree. Listing 4.1 shows the root element of an Eclipse AST: the CompilationUnit. This AST node type represents a Java class file including package and import declarations. The actual body of a class is represented by the AST node TypeDeclaration. A TypeDeclaration can either be a ClassDeclaration or an InterfaceDeclaration as shown in Listing 4.2. The class TypeDeclaration defines various methods for enumerating the fields or the methods (for example getFields() returning a FieldDeclaration[] or getMethods() that returns an array of MethodDeclarations). Each MethodDeclaration contains multiple Statements and Expressions whose children can represent the basic Java syntax. Children of Statements are for example: • • • IfStatement: Represents the structure of an if construct. VariableDeclarationStatement: Contains information of variable declarations including the name of the variable, the type and possible initialisation statements. ReturnStatement: Represents the return instruction including a possible return value as Expression. Examples of the basic Expression statements are the following: • IntegerLiteral: Stands for Java’s primitive integer types. 2 Eclipse API: http://help.eclipse.org/help31/index.jsp 4.2 Coogle Architecture • FieldAccess: This is used for all accesses to fields. • MethodInvocation: Represents an invocation of a method. 23 FAMIX cannot represent all AST node types. The types that are represented in the FAMIX model are depicted in Figures 4.3, 4.4 and 4.5. Consult the Eclipse API for an exhaustive list of AST elements. CompilationUnit: [ PackageDeclaration ] { ImportDeclaration } { TypeDeclaration | EnumDeclaration | AnnotationTypeDeclaration | ; } Listing 4.1: Java CompilationUnit AST node type. This is the type of the root of an AST. TypeDeclaration: ClassDeclaration InterfaceDeclaration ClassDeclaration: [ Javadoc ] { ExtendedModifier } class Identifier [ < TypeParameter { , TypeParameter } > ] [ extends Type ] [ implements Type { , Type } ] { { ClassBodyDeclaration | ; } } InterfaceDeclaration: [ Javadoc ] { ExtendedModifier } interface Identifier [ < TypeParameter { , TypeParameter } > ] [ extends Type { , Type } ] { { InterfaceBodyDeclaration | ; } } Listing 4.2: TypeDeclaration AST node type. A type declaration is the union of a class declaration and an interface declaration. 4.1.3 AST to FAMIX Mapping Table 4.1 shows which AST node types are considered by Coogle and details the mapping of FAMIX elements to these AST nodes. 4.2 Coogle Architecture 4.2.1 Project Package Structure Figure 4.9 shows the main parts of the plug-in grouped by packages and describes how they interact. The following list details the most important packages from the Coogle plug-in: ch.toe.coogle Classes in this package are the main classes for the user interface. Chapter 4. Implementation 24 FAMIX Element FAMIXInstance AST node - Model - Package Class InheritanceDefinition Attribute Method FormalParameter LocalVariable PackageDeclaration TypeDeclaration FieldDeclaration MethodDeclaration SingleVariableDeclaration SingleVariableDeclaration ConstructorInvocation, SuperConstructorInvocation, ClassInstanceCreation, MethodInvocation, SuperMethodInvocation Invocation Access FieldAccess, SuperFieldAccess, SimpleName, QualifiedName Remarks Represents the top element of every FAMIX instance. Abstract construct containing metadata. Assigned to a Package. Assigned to a Class. Assigned to a Class. Assigned to a Method. Assigned to a Method. Assigned to a BehaviouralEntity. Assigned to a BehaviouralEntity. A SimpleName is any identifier other than a keyword, boolean expression or null literal. QualifiedName is in the format like ”Name.SimpleName”. Table 4.1: FAMIX elements with their corresponding AST element. ch.toe.coogle.wizard These classes define the wizard pages and the wizard dialog. The package also contains the class used for collecting all Java classes in a selected Eclipse project (namely TypeExtractor). ch.toe.coogle.model The model class used for passing information between the wizard pages and the project parser is defined in this package. ch.toe.coogle.operation.generic Package containing the classes that are called for performing an operation (such as calculating the bottom-up maximum subtree isomorphism or the tree edit distance). These are not to be extended, only implemented by classes in the next package. ch.toe.coogle.operation.classes One class per operation is defined in this package, implementing the relevant operation class from package ch.toe.coogle.operation.generic. ch.toe.coogle.operation.dialog Classes used for displaying the results after the calculation finished. ch.toe.famix In this package lies FAMIXInstance, the root of every Java representation of a FAMIX model tree. This also contains the visitor classes. See Section 4.5.1 for more information about the visitor pattern. ch.toe.famix.model This contains all the classes needed for representing the FAMIX model. ch.toe.tree The class TreeUtil in this package defines useful methods for manipulating trees and searching elements in trees. Other utility classes are placed in here as well. 4.3 Coogle Design 25 ch.toe.tree.calc All classes implementing tree similarity algorithms are packaged herein. ch.toe.tree.comparator These are the default comparators used for evaluating node equality. 4.2.2 Plug-in Features Coogle runs with Eclipse 3.1 and later and is written in Java 1.5. It is activated through the context menu of a Java project in the Eclipse workspace. For a detailed walkthrough on the usage of Coogle, see Appendix A. The current state of the Coogle plug-in supports the following operations: • Bottom-up maximum common subtree isomorphism (described in Section 3.4.4 and 4.5.3). – for ordered, labelled and unlabelled trees. – for unordered, labelled and unlabelled trees (with the restrictions described in Section 4.6.2). • Top-down maximum common subtree isomorphism (described in Section 3.4.5 and 4.5.3). – for ordered, labelled trees. • Tree edit distance (described in Section 3.4.6 and 4.5.3). – for ordered, labelled trees. 4.3 Coogle Design This section describes the design of the Coogle plug-in. First, we overview the different components, then discuss our extensions to FAMIX and conclude the section with the description of two design patterns that were used in the implementation. 4.3.1 Overview Coogle has multiple components as Figure 4.2 illustrates. The source code of a Java class is transformed three times before it is used for calculating the similarity measure. Coogle’s main components are: ASTParser of Eclipse. This processes Java source code into an abstract syntax tree as defined by the Eclipse API. PatViz. Parser that traverses an abstract syntax tree and builds a FAMIX representation from the nodes of the tree. Tree visitor. This visitor visits each FAMIX node and creates a tree representation consisting of DefaultMutableTreeNodes. Similarity measure. Different similarity measures are implemented by Coogle. All use a tree built of DefaultMutableTreeNodes as calculation basis. The output of the measures is then used for calculating the similarity of two given objects. 26 Chapter 4. Implementation Figure 4.2: The steps of processing the source code of a Java class into a tree that can be used as input for the similarity measure. 4.3.2 FAMIX Extensions In this section we describe our Java implementation of the FAMIX model and highlight the differences of our implementation to the original FAMIX definition which is described in Chapter 2. A note on the figures in this section: functions and objects shaded in grey are additions that are not documented in the official FAMIX definition ([Tichelaar et al., 1999] and [Tichelaar, 1999]), but are extensions we made. Also, only the most important methods are included in each class. FAMIX Element Object The top parent of every element of FAMIX is the Object class as shown in Figure 4.3. For illustration purposes, all children of Entity are left out on this figure, but are depicted in Figures 4.4 and 4.5 which are described later on. The changes made to the classes in Figure 4.3 are in InheritanceDefinition to which we added status information about the represented relation type (for example implements via interface pattern or extends by subclassing). Otherwise, all classes correspond to the original FAMIX model. The purpose of the accept() method is described in the section on the visitor pattern. FAMIX Element StructuralEntity Figure 4.4 shows the subclasses of StructuralEntity which itself is a child of Entity. The addition of methods such as isFinal() and similar to the classes Attribute, LocalVariable and FormalParameter are documented in the FAMIX Java extension document [Tichelaar, 1999]. The class GlobalVariable is never used as there is no concept of global variables in Java. FAMIX Element BehaviouralEntity The class BehaviouralEntity with its children Method and Function (never used in the FAMIX Java representation) were extended with the methods isFinal(), isSynchronized() and isNative() to have a representation for the possible modifiers of a Java method as described in [Tichelaar, 1999]. These classes are shown in Figure 4.5. 4.3 Coogle Design 27 Figure 4.3: Java class diagram for the top level elements of the FAMIX model. Our extensions to the original FAMIX model are shaded in grey. FAMIX Elements Package and Class Figure 4.5 depicts the class diagram for the FAMIX Package and Class representation. We added an artificial, non-original FAMIX class called Context in-between these classes and the Entity object. This allows us to avoid duplicated code and eases the use of both Package and Class while parsing the abstract syntax tree. Further, we extended Class with the methods isInterface(), isPublic(), isFinal() and isAbstract() to comply with these allowed modifiers of a Java class (described in [Tichelaar, 1999]). 4.3.3 Tree Generation To build the trees from the extracted FAMIX model, we use a visitor pattern. The visitor pattern is a design pattern used in object-oriented software development [Gamma et al., 1994]. It needs two different types of objects: a visitor and a visitable object. Each visitable object defines a method called accept() that recursively traverses all visitable children by calling their accept() method. The visitor is informed of each visit and builds the tree from this information. See Section 4.5.1 for the implementation details of this pattern. 4.3.4 Node Comparison We use the comparator pattern [Gamma et al., 1994] to extend our measures to labelled trees. This pattern is often applied to enable implementing classes using their own way of comparing objects and establishing equality. An interface defines the comparator method that is to be overridden by implementors (this is usually compare(Object left, Object right)). The return value of the compare method is either a boolean or an integer showing the proportion of the given two objects. Our implementation is described in Section 4.5.2, see Appendix B.3 for a description of how to add new comparators. Chapter 4. Implementation 28 Figure 4.4: Java class diagram for StructuralEntity with its subclasses. 4.3.5 Input Trees for Measures Not all FAMIX elements are represented in the general trees we use as input for the similarity measures. The class building the tree, TreeBuildVisitor, does not include the following elements in the generated DefaultMutableTreeNode object for the following reasons: • • • Argument, i.e., AccessArgument and ExpressionArgument. These are not added because their belonging Invocation is already included in the tree. Function. Does never appear in a FAMIX Java representation. InheritanceDefinition. Every Class has an InheritanceDefinition. Including it would therefore only produce an additional node in every generated tree, without improving the similarity measure. 4.4 Coogle Workflow Figure 4.9 illustrates the workflow of a Coogle tree edit distance similarity search. The next two sections describe the process in general. 4.4.1 Invocation Coogle is integrated into the context menu of Java projects in the Eclipse package explorer. When selecting the similarity search entry in the Coogle submenu, Eclipse loads the main plug-in class from CooglePlugin and launches the action defined in the class CoogleMainAction in the package ch.toe.coogle.action. Which class to call upon which action is defined in the configuration file of the plugin, plugin.xml, which also defines dependencies and integration points. CoogleMainAction creates the wizard with its pages and executes it. The wizard then collects all needed information from the user such as the desired similarity measure and the object to be searched. Afterwards, it invokes the desired operation defined in package ch.toe.coogle.operation.classes. The operation class performs the similarity search and presents the results by using a dialog defined in ch.toe.coogle.operation.dialog. See Section 4.2.2 for a detailed description of the wizard and its functions. 4.5 Coogle Implementation 29 Figure 4.5: Java class diagram for BehaviouralEntity and Context with their respective subclasses. Our extensions to the original FAMIX model are shaded in grey. 4.4.2 Similarity Search Process Figure 4.6 illustrates the process that Coogle performs after invocating a similarity measure operation (as defined in ch.toe.coogle.operation). The source code representation of the selected class is converted into an abstract syntax tree by running ASTParser (defined by Eclipse) on this resource. Using the PatViz parser, all the relevant AST nodes are then transformed into a FAMIX representation. In this step, the so far correctly ordered abstract syntax tree is converted into a FAMIX tree whose order does not anymore correspond with the appearance of the statements in the source file. This leads to the problems described later in Section 4.6.2. After the creation of the FAMIX representation, a visitor is used for generating a general tree structure. See Section 4.5.1 for details on this visitor pattern. Finally, the similarity search is made with the created general tree as input for the measure. 4.5 Coogle Implementation 4.5.1 Tree Generation As described in Section 4.3.3 we need objects implementing ch.toe.famix.Visitor and object implementating ch.toe.famix.Visitable. For example, the FAMIX element Object in Figure 4.3 or Method in Figure 4.5 implement the Visitable interface and therefore define a 30 Chapter 4. Implementation Figure 4.6: The Coogle process: transformation of a Java source code file into a general tree structure via a FAMIX representation of the abstract syntax tree. Note the loss of ordering after parsing the tree into a FAMIX model. method called accept(). Listing 4.3 shows a sample creation and invocation of a visitor building a tree. The accept() method in the Visitable class then iterates over the children of the class and recursively passes the visitor along to each child by invoking the corresponding accept() method. The visit() and endVisit() methods of the Visitor are invoked, before respectively after completing the visitor passing to the children. The code fragment in Listing 4.4 shows this process for the FAMIX object Class. A sample visitor implementation is illustrated in Listing 4.5. The methods visit() and endVisit() are called from the accept() method as described before. This specific implementation of the visitor pattern is used for building a tree from selected objects (namely all objects for which isTreeRelevantElement() is true). 4.5.2 Node Comparison The class ch.toe.tree.comparator.ITreeComparator defines our comparator interface. Implementing classes need to define a method called compare(), receiving two parameters of type DefaultMutableTreeNode. The method compares these nodes for equality and returns 4.5 Coogle Implementation 31 public static DefaultMutableTreeNode generateTree(FAMIXInstance instance) { TreeBuildVisitor v = new TreeBuildVisitor(); instance.accept(v); return v.getRoot(); } Listing 4.3: Creates a new tree of a FAMIXInstance by using the visitor pattern. This method is defined in ch.toe.tree.TreeUtil. true if the node1 is equal to node2. The implementing comparator decides which characteristics of the nodes are used for equality comparison. We provide three comparator implementations: AlwaysTrueComparator This comparator returns true, i.e. equality, regardless of the characteristics of the nodes passed. NameTreeComparator Compares the names of the given nodes and returns true if the names are equal. This can for example be extended to applying a Levenstein similarity measure to the names of the nodes, returning equality when a certain similarity level is reached. TypeTreeComparator The user objects, if existing, of the given nodes are compared and true is returned if both objects are of the same type. 4.5.3 Implemented Similarity Measures This section describes the implementation of the selected similarity measures. See also Section 3.4 for the algorithmic description of the measures. General A prerequisite of the implementation is that the tree similarity measures operate on general tree structures. It is irrelevant for the measures if the compared trees represent a FAMIX model or complete AST trees, as long as they are valid tree structures. For this reason the context neutral tree model class DefaultMutableTreeNode in package javax.swing.tree is used for representing trees. This allows creating trees whose nodes can contain specific user objects and an unlimited number of children. The user objects in our case are FAMIX elements. This tree model class also contains an implementation of pre- and postorder enumerating the elements of the tree. A DefaultMutableTreeNode object is a root node or child element depending on its position in the tree. Bottom-up Maximum Common Subtree Isomorphism Valiente’s bottom-up maximum common subtree algorithm as described in [Valiente, 2002] is implemented in ch.toe.tree.calc.CalculateBottomUpMaximumSubtree. This algorithm is applicable for ordered and unordered, rooted trees. One difference needed in the implementation between ordered and unordered trees is a different mapping of the trees at the end of the calculation when corresponding nodes are put into a map. This is described in Section 3.4.4. The methods mapOrderedTrees(..) and mapUnorderedTrees(..) realise this functionality. mapTrees(..) automatically invokes the correct method depending on if the input trees for CalculateBottomUpMaximumSubtree are ordered or unordered 32 Chapter 4. Implementation public void accept(Visitor v) { v.visit(this); if (inheritance != null) inheritance.accept(v); [..] Iterator<Attribute> iterAttribute = this.getAttributes().iterator(); while (iterAttribute.hasNext()) (iterAttribute.next()).accept(v); Iterator<ImplicitVariable> iterImplicitVar = this.getImplicitVariables().iterator(); while (iterImplicitVar.hasNext()) (iterImplicitVar.next()).accept(v); Iterator<Method> iterMethod = this.getMethods().iterator(); while (iterMethod.hasNext()) (iterMethod.next()).accept(v); v.endVisit(this); } Listing 4.4: accept() method from ch.toe.famix.model.Class, demonstrating the visitor pattern. We extend the algorithm to labelled trees by assigning an integer value to each FAMIX node type (as also proposed in [Valiente, 2000]). The node type then is prepended to the list of isomorphism equivalence codes during the calculation of the equivalence classes for a node. See the method calculateEquivalenceClass(..) in CalculateEquivalenceClass for the relevant code. The calculation is performed in the class CalculateBottomUpMaximumSubtree, which contains multiple constructors, the most detailed is displayed in Listing 4.6. The boolean parameters ordered and labeled are used to specify the nature of the given trees tree1 and tree2. A comparator object implementing ITreeComparator is used to compare nodes for equality. If no comparator is given, the default AlwaysTrueTreeComparator is used, effectively resulting in not comparing any node types. For an explanation of the comparator pattern, see Section 4.3.4. public CalculateBottomUpMaximumSubtree(DefaultMutableTreeNode tree1, DefaultMutableTreeNode tree2, ITreeComparator comparator, boolean ordered, boolean labeled) throws NullPointerException, TreeNodeTypeException Listing 4.6: Most detailed constructor signature of CalculateBottomUpMaximumSubtree. After instantiating the calculation class, the calculation already took place and can be queried for success by calling the method isCalculated(). The ArrayList of the matched subtrees of tree1 and tree2 can be read with the methods getSubtreeRootNodesTree1() and getSubtreeRootNodesTree2(). The returned lists contain the root nodes of the matched maximum bottom-up subtree isomorphisms. There can be multiple roots as the algorithm matches all the occurrences of the matched tree pattern. See Section 3.4.4 for the algorithmic description of this behaviour. 4.5 Coogle Implementation 33 Finally, the method mapTrees(tree1, tree2) maps two given matched subtree root nodes either ordered or unordered, according to the ordered status of the calculation object. The resulting map is a one-to-one node mapping of the bottom-up maximum common subtree. See Section 3.4.4 for a more in-depth explanation. Figure 4.7 illustrates a one-to-one mapping of two ordered trees. Figure 4.7: A bottom-up maximum common subtree isomorphism for two ordered trees (highlighted in grey). The dashed line represents the node mapping. The mapping is defined as M = {(v4, w2), (v5, w3), (v6, w4), (v7, w5), (v8, w6)}. Top-down Maximum Common Subtree Isomorphism Valiente’s top-down maximum common subtree algorithm as described in [Valiente, 2002] is implemented in ch.toe.tree.calc.CalculateTopDownOrderedMaximumSubtree. This algorithm is applicable for rooted, ordered trees only. To make the implementation available for labelled trees, we again use the comparator construct (see Section 4.3.4). The constructors in ch.toe.tree.calc.CalculateTopDownOrderedMaximumSubtree expect two trees and an optional ITreeComparator object. If no comparator is passed during instantiation, TypeTreeComparator is used, which returns equality for nodes with the same user object type. The method isCalculated() is used for querying the success of the calculation. If the calculation was successful, getMatchedTree1() and getMatchedTree2() are called to receive the resulting trees. The method getMappedTrees() returns a one-to-one mapping of the matched trees. This measure only returns a single subtree as its root always has to correspond to the roots of the input trees. Figure 4.8 illustrates such a mapping. Tree Edit Distance We implement the tree edit distance algorithm for ordered, rooted trees detailed in [Valiente, 2002]. To extend the algorithm to labelled trees the comparator pattern is applied again. We use an ITreeComparator to assign a different cost to substitute paths between equal and paths between non-equal nodes. The JGraphT library3 provides an implementation of the standard shortest path algorithm by Dijkstra [Dijkstra, 1959]. We used the method DijkstraShortestPath(..) from JGraphT version 0.6.0 for calculating the tree edit distance (see Section 3.4.6). Listing 4.7 shows the available parameters for a tree edit distance calculation. The parameters tree1 and tree2 denote trees for which the edit distance is calculated. The parameter 3 JGraphT project site: http://jgrapht.sourceforge.net/ 34 Chapter 4. Implementation Figure 4.8: A top-down maximum common subtree isomorphism for two ordered trees (highlighted in grey). The dashed line represents the node mapping. The mapping is defined as M = {(v1, w1), (v3, w2), (v4, w3), (v5, w4), (v6, w5), (v7, w6), (v8, w7)}. pathLengthLimit is used for limiting the path length to a maximum (default is unlimited path length). By using weightInsert, weigthDelete and weightSubstitute, different weights for the main operations insert, delete and substitute can be specified. The fourth weight parameter weightSubstituteEqual is used for substitute paths between nodes for which comparator returns equality. After the calculation is successfully finished (queried using the method isCalculated()), the tree edit distance is returned as double value by getTreeEditDistance(). The biggest edit distance (also known as worst case edit distance) can be calculated in different ways. The simplest method is summing the number of nodes in tree1 and in tree2. This represents the deletion of all nodes of tree1 and inserting all nodes as new nodes into tree2. We need this worst case edit distance for ranking the results as described in Section 5.3.1. Other, more complicated, approaches for calculating a worst case edit distance exist, please refer to the relevant methods in CalculateTreeEditDistance for more information. 4.6 Discussion and Problems 4.6.1 FAMIX One-way Parent-Child Relation In the original FAMIX model, only children know about their parent. Consider for instance the following example: in a top-down Java representation, classes contain methods. In FAMIX however, only the child Method has a belongsTo() method, a parent such as aClass does not know about its children. For our implementation we need links that can be followed from top to bottom, i.e, from the parents (root) to their children (leaves). For example, when traversing the syntax tree with a visitor as described in the previous section. To circumvent this limitation, we add collections (java.util.Set) to all objects having child objects. This extension allows parents to enumerate all their children and therefore allows an efficient top-to-down traversal. 4.6 Discussion and Problems 35 No Representation of Low-level Elements FAMIX does not model low-level elements of abstract syntax trees. For example, there is no representation of mathematical operations or assignments in general, like an IfStatement or an Assignment. We discuss the consequences of this information loss in Section 5.4.2. Various Representation Problems As described in [Tichelaar, 1999] and the previous section, the FAMIX model does not have representations for all Java objects. In our implementation we additionally left out or interpreted the following types of the Eclipse syntax tree: AnonymousClassDeclaration The Eclipse type AnonymousClassDeclaration is used for an anonymous class embedded in code. It can occur either in the body of a class or a method. FAMIX however has the limitation of only allowing classes and not methods as the parent of a Class. We circumvent this limitation by adding anonymous classes to the parent Class. TypeDeclarationStatement and EnumDeclaration A TypeDeclarationStatement is a local type declaration which can occur inside any Statement. An EnumDeclaration is a new type introduced with Java 1.5. We do not add these two Eclipse elements to our trees as FAMIX lacks support for them. Moreover, the following FAMIX elements are not parsed and represented: TypeCast There is no need for a representation of those elements from a code similarity point of view. Our FAMIX implementation however contains the object TypeCast which defines the needed behaviour. SourceAnchor Every element in FAMIX can have a source code reference. As we do not need this assignment for calculating our measures, we left out this functionality in our parser. The class SourceAnchor is defined in our FAMIX implementation nevertheless. 4.6.2 Similarity Measures Bottom-up Maximum Common Subtree Isomorphism Although we implement a matching for unordered trees, a search using this algorithm does not yield different results than a search with the bottom-up maximum common subtree isomorphism for ordered trees. The reason for this lies in the way how the FAMIX representation of the code is generated. We reused the code of the PatViz plug-in4 for this task. PatViz is a project that parses the abstract syntax tree representation in Eclipse and generates an RSF (Rigi Standard Format) representation of it. We took the parser and refactored the code to produce a FAMIX tree instead. However, the PatViz plug-in generates ordered trees of the syntax trees, i.e., the elements are ordered by type (first all class attributes, then the constructors and finally all methods) and not by their effective positions in the source. Therefore, the generated FAMIX tree is always an ordered tree. See Figure 4.6 for an example of such a generated tree and note the order of the elements in the tree representation which does not match the order of the abstract syntax tree. 4 Software project written by Wolfgang Schuh at the Vienna University of Technology. 36 Chapter 4. Implementation Top-down Maximum Common Subtree Isomorphism Although there exists an algorithm for unordered top-down maximum common subtree matching (described in [Valiente, 2002]), this is currently not implemented in Coogle because of the same reasons detailed in the previous section for the bottom-up maximum subtree isomorphism algorithm. Tree Edit Distance The algorithmic problems of unordered tree edit distance calculation are detailed in Section 3.4.6. Therefore, we have no implementation for an unordered edit distance measure. 4.6 Discussion and Problems 37 [..] public void visit(Object o) { if (isTreeRelevantElement(o)) preVisit(o); } public void endVisit(Object o) { if (isTreeRelevantElement(o)) postVisit(); } private void preVisit(Object o) { // create new node for the currently visited object DefaultMutableTreeNode node = new DefaultMutableTreeNode(o); if (root == null) root = node; stack.push(node); } private void postVisit() { // child is the currently visited object DefaultMutableTreeNode child = stack.pop(); if (!stack.isEmpty()) { // add child to parent node already on stack DefaultMutableTreeNode node = stack.pop(); node.add(child); stack.push(node); } else stack.push(child); } [..] Listing 4.5: Sample visitor implementation used for building a tree of all relevant FAMIX elements. This is the implementation as used by TreeBuildVisitor. public CalculateTreeEditDistance(DefaultMutableTreeNode tree1, DefaultMutableTreeNode tree2, ITreeComparator comparator, Double pathLengthLimit, Double weigthInsert, Double weigthDelete, Double weigthSubstitute, Double weigthSubstituteEqual) throws NullPointerException, TreeNodeTypeException Listing 4.7: Most detailed constructor signature of CalculateTreeEditDistance. 38 Chapter 4. Implementation Figure 4.9: The complete workflow of a Coogle similarity search. Starting in the class CoogleMainAction and finished when displaying the results with the EditDistanceResultDialog. These are the steps performed after the invocation of the tree edit distance operation: (a) Eclipse’s ASTParser is invoked for creating the abstract syntax tree of the objects we are comparing. (b) The PatViz parser is invoked and (c) extracts the FAMIX representation from the AST. (d) TreeBuildVisitor is used to create the general tree consisting of DefaultMutableTreeNodes. (e) Calculation of the tree edit distance using the generated tree from step (d). (f) Display results in EditDistanceResultDialog. Chapter 5 Evaluation This chapter describes the evaluation of the implemented similarity measures. After a short overview on the chosen approach, we detail the results of the two part analysis. In a first part we analyse constructed Java classes, in the second part a real world Java project, the compare plug-in of Eclipse. In each part special cases, important findings and shortcomings are highlighted. We close the chapter by discussing and comparing the different similarity measures efficiency based upon the results. 5.1 Approach The analysis was done in two parts. First we built test cases for often recurring refactoring patterns and analysed similarity detection for these constructed sets of changes. This allows us to analyse the question, how specific changes affect structural similarity. In a second part we took a sample Java project as basis for our similarity measures. We used the compare plug-in (org.eclipse.compare)1 as sample project and measured the internal similarity of the classes. The results enlighten on the efficiency of the measures for detecting structural similarities in a project. 5.2 Analysis Objects 5.2.1 Constructed Test Cases Overview We take often recurring refactoring patterns as basis for the construction of our test cases. The complete code for these test cases can be found on the accompanying CD-ROM, the most important snippets are included in the following sections. The class AzureusCoreImpl from the Azureus project2 was used as base class for the tests (except in test case E). See Appendix D.1 for a complete listing of this class. The requirements for the base class were the use of both ”normal” and static attributes and methods. Additionally, it needed to define getter and setter methods for its attributes. In every test case, we define an empty class as control construct. This control class provides us with information about the relative 1 Eclipse compare project site: http://dev.eclipse.org/viewcvs/index.cgi/%7Echeckout%7E/ platform-compare-home/main.html 2 Azureus project site: http://azureus.sourceforge.net/ Chapter 5. Evaluation 40 similarity of the results and is an indicator for dissimilarity. Please see Appendix C.2 for detailed information about the structure of the test case projects. Note on the tree edit distance algorithm: the weights associated with the edit operations were one cost unit for node insertions/deletions and zero cost for substituting nodes. Test Case A: Add Constructor to a Class The addition of a new constructor is adding a method with parameters and invocations in its body. Our test case adds a new constructor with a single this() invocation as body. The following constructor code has been added to the sample class: protected AzureusCoreImpl(String str) { this(); } Listing 5.1: Test case A: Code of the added constructor Test Case B: Add Attribute to a Class When using this refactoring pattern, getter and setter methods for the new attribute are added, too. We do this in our change set as well. The rest of the class is left untouched. This is the added code: private String test; [..] public String getTest() { return test; } public void setTest(String test) { this.test = test; } Listing 5.2: Test case B: Code for an added attribute Test Case C: Add Invocation to a Method We insert an invocation into an existing method. For verifiability, two separate test classes are created with the added code in two different methods, but to the same invocation target. Which method is invoked does not matter as the measures do not take the target of the invocation into consideration. We invoke getLocaleUtil(), defined in the test class itself. Test Case D: Method Extraction During a method extraction, code is moved from an existing method into a new method and an invocation to the extracted method is added to the original method. This is often used to remove duplicated code or when pulling-up code into parents. We implement the change by replacing 5.3 Results 41 the code of the constructor with an invocation to a new private method constructorCall(). This listing shows the code that was added (lines prepended with ”+”): protected AzureusCoreImpl() { + constructorCall(); + } + + private void constructorCall() { COConfigurationManager.initialise(); LGLogger.initialise(); AEDiagnostics.startup(); Listing 5.3: Test case D: Extract the code of a method into a new method. Test Case E: Implement Interface The interface programming pattern is one of the most important design patterns in object oriented programming [Gamma et al., 1994]. This test case measures the similarity between classes implementing the same interface. As test object we use the interface RateControlledEntity of the Azureus project. See Appendix D.2 for a complete listing of the class. The implementors on which we search for similarity are defined in the same package as RateControlledEntity. 5.2.2 Real World Example: org.eclipse.compare The compare plug-in of Eclipse is used as sample Java project for a real world similarity measure test. The analysis demonstrates the ability of using the implemented similarity measures in a non-laboratory environment and critically highlights shortcomings. We use version 3.1.0 of the project. We choose two classes from the project and analyse the similarity for those objects. This is because Coogle in its current state is not able to analyse the similarity of each to each other class in the project in a single step. An outer loop, serially processing all classes of a project, needs to be added for realising this type of analysis. 5.3 Results 5.3.1 Ranking the Matches There comes one important question with displaying result data: how do we rank the matches? Or, what is the similarity of the matched object in comparison to the search pattern? The following two sections explain the ranking algorithm used for the two subtree matching measures and the tree edit distance calculation. Subtree Matching We have these numbers available as parameters for our algorithms: ss denotes the size (number of nodes) of trees which is our search tree. sx = size of treex , the tree of the class currently matching on. Further, treem stands for the matched subtree with its size sm . An efficient ranking algorithm needs to follow these rules: Chapter 5. Evaluation 42 • • the more of trees is matched, the better the ranking. This is expressed with sm ss . give small elements, consisting of a few nodes only, that are completely matched a lower ranking. This is considered by weighting ss and sx . We experimented with different possibilities for the ranking algorithm. Finally, we decided to use a solution also described in [Baxter et al., 1998] where the following formula is defined: similarity = 2S 2S + L + R with S = number of shared nodes, L = number of different nodes in trees and R = number of different nodes in treex . This measure can be simplified and expressed with the corresponding variables of our input data: rsubtree = 2sm , (0 < rsubtree ≤ 1) ss + sx All the results in Sections 5.3.2 and 5.3.3 are ranked using this formula for similarity. Tree Edit Distance As with subtree matching, the input for the ranking consists of the two trees, trees and treex , which likewise represent the search tree and the tree we calculate the distance to. For this ranking a worst-case edit distance of deleting all the nodes from trees and inserting all nodes from treex as new nodes is assumed. The worst-case length of an edit distance between the two trees then is defined as sum of the size of both trees: ws→x = ss + sx . We convert this dissimilarity to a similarity measure reditdistance , where 0 ≤ reditdistance ≤ 1 by using the following ranking formula for tree edit distance results: reditdistance = ws→x − dx ws→x dx denotes the calculated edit distance of treex to trees . 5.3.2 Results with Constructed Test Cases This section contains the relevant result data for the similarity measures with the constructed test cases. Each section has a table with the raw result data and an analysis of these results. We discuss the results in the following order of the measures: bottom-up maximum common subtree, top-down maximum common subtree and thirdly tree edit distance. We conclude this section with an overall assessment on the performance of the measures over such constructed test cases. Test Case A: Add Constructor to Class The addition of a new constructor is, in a tree based view, an addition of a new node at depth level two. Our example constructor has an invocation in its body, so the complete addition to the tree is a node with a single child as shown by Figure 5.1. Table 5.1 contains the results of a bottom-up match on the trees before and after the modification. The matching tree for this modification has a size of 21 nodes. As the new method is added in between the existing methods, a bottom-up subtree match is not very efficient. 5.3 Results 43 AzureusCoreImpl ������ ���� ���� � � � � �� AzureusCoreImpl() [other methods] AzureusCoreImpl(String) [invocation] this() Figure 5.1: Test Case A: Resulting tree after changing the class. Added tree elements in italic. The results are better when performing a top-down subtree match. Here the matched subtree has a size of 78 nodes and the changed class a similarity of 53.24% as Table 5.2 illustrates. The top-down match leads to better results for this test case because parts which do not match will not stop the matching process, but are simply ignored (including their child nodes). The best results for test case A are achieved by using the tree edit distance measure, see Table 5.3. The tree representing the modified class needs 3 edit steps which results in a similarity of almost 99%. The edit steps needed is the addition of the new method (first step) with its parameter (second step) and the invocation (third step) in the method body. For all three similarity measures we receive 1.37% similarity for our control class. This demonstrates the ability of ignoring small elements by the ranking measure. Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl control.ControlClass Matched tree size 145 21 1 Similarity 100.00% 14.33% 1.37% Table 5.1: Results case A (add constructor to class): Bottom-up maximum common subtree. Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl control.ControlClass Matched tree size 145 78 1 Similarity 100.00% 53.24% 1.37% Table 5.2: Results case A (add constructor to class): Top-down maximum common subtree. Test Case B: Add Attribute to Class The addition of a new variable including getter and setter methods for it results in a tree modification like Figure 5.2 shows. A new node for the attribute is inserted after the already defined variables and two method references are appended after the existing method definitions. The result table for the bottom-up match, Table 5.4, almost contains the same results as for the bottom-up matching for case A. The matched tree has a size of 21 and the similarity differs by less than a tenth of a percent. The difference is due to the slightly bigger size of the tree representation of afterChanges.AzureusCoreImpl. In both cases A and B, the same tree is matched as maximum subtree, namely start(), the biggest method in the class. The algorithm misses the Chapter 5. Evaluation 44 Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl control.ControlClass Tree edit distance 0 3 144 Similarity 100.00% 98.98% 1.37% Table 5.3: Results case A (add constructor to class): Tree edit distance. AzureusCoreImpl ���� �������������������� ���� ������������������ ���� � � � ���� ������ �� ��� [other variables] String test [other methods] getTest() setTest(String) Figure 5.2: Test Case B: Resulting tree after changing the class. Added tree elements in italic. surrounding methods as match, because the root nodes (which have all the methods as children) of the trees have different equivalence classes. This also occurs in several other test cases. When using a top-down match for this case, the results even get worse. The similarity of the changed class falls down to 6.12% with a matched tree size of 9 nodes (see Table 5.5). A top-down subtree match is not efficient, because the inserted variables prevent a better match by stopping the matching process as soon as such a variable node is encountered. This issue could be circumvented by using an unordered tree match. However, as described in Section 4.6.2 this is currently not possible. Tree edit distance calculation yields the best results for this test case as Table 5.6 shows. The four edit operations are for the variable declaration, the two added methods and the parameter of the setTest() method. Our control class again performed bad with all three measures. Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl control.ControlClass Matched tree size 145 21 1 Similarity 100.00% 14.29% 1.37% Table 5.4: Results case B (add attribute to class): Bottom-up maximum common subtree. Test Case C: Add Invocation to Method Using the bottom-up common subtree algorithm generates the same results as for case A and B. Table 5.7 shows the same tree size of 21 for both afterChanges.AzureusCoreImpl and afterChanges.AzureusCoreImplA. The reason for the matching of the same tree corresponds with the explanation given in case B. The top-down subtree matching results in a similarity score of 99.66%. The matching of beforeChanges.AzureusCoreImpl is complete, only the size increase due to the added invocations prevent the objects getting a complete match. The same score of 99.66% similarity is reached when using the tree edit distance measure. Table 5.9 displays a single edit operation needed to completely cover the objects. Evidently, this operation is the insertion of the invocation. As afterChanges.AzureusCoreImpl and afterChanges.AzureusCoreImplA have coinciding matching size and similarity performance in all three measures, we conclude that which method an invocation is added to does not influence similarity. 5.3 Results 45 Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl control.ControlClass Matched tree size 145 9 1 Similarity 100.00% 6.12% 1.37% Table 5.5: Results case B (add attribute to class): Top-down maximum common subtree. Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl control.ControlClass Tree edit distance 0 4 144 Similarity 100.00% 98.64% 1.37% Table 5.6: Results case B (add attribute to class): Tree edit distance. Test Case D: Method Extraction For this test case we add the method constructorCall() to two separate classes, namely AzureusCoreImpl and AzureusCoreImplA, at different places. In AzureusCoreImpl the method immediately follows the constructor whereas in AzureusCoreImplA the method is appended after the last method of the original class. Figure 5.3 shows the move of the invocations of the constructor to a separate method as done in AzureusCoreImpl. Bottom-up subtree matching results in the same data we have seen in the previous cases. The method start() with tree size 21 is matched and returned as maximum subtree (see Table 5.10). Using a top-down match shows a big difference between the class AzureusCoreImpl and AzureusCoreImplA. Table 5.11 shows a similarity of 47.95% and 95.21% respectively. The reason for this is that we compare ordered trees when using top-down measuring. By placing the added method constructorCall() in between the existing code, all the methods and invocations are matched with different, shifted methods of the same class (for example the method addLifecycleListener(..) is matched with addListener(..)). This results in a smaller tree as not all invocations of the methods can be matched with possibly smaller shifted ”corresponding” methods. When adding the new method to the end of the class as done in the test case class AzureusCoreImplA, we eliminated the shifting and therefore receive a much higher similarity match. The results for the tree edit distance measures can be found in Table 5.12. We see that fewer steps are required for the class in which constructorCall() was not moved to the end of the class, as the invocations only need to be relabeled. This can be done in 8 operations. For the class AzureusCoreImplA 14 operations are needed when deleting and re-adding all 7 invocations at AzureusCoreImpl ������ ���� ���� � � � � �� AzureusCoreImpl() [other methods] [invocations] constructorCall() [invocations] � Figure 5.3: Test Case D: Resulting tree after changing the class. Added tree elements in italic. Chapter 5. Evaluation 46 Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl A control.ControlClass Matched tree size 145 21 21 1 Similarity 100.00% 14.43% 14.43% 1.37% Table 5.7: Results case C (add invocation to method): Bottom-up maximum common subtree. Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl A control.ControlClass Matched tree size 145 145 145 1 Similarity 100.00% 99.66% 99.66% 1.37% Table 5.8: Results case C (add invocation to method): Top-down maximum common subtree. Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl A control.ControlClass Tree edit distance 0 1 1 144 Similarity 100.00% 99.66% 99.66% 1.37% Table 5.9: Results case C (add invocation to method): Tree edit distance. the end of the tree. Test Case E: Implement Interface Using our similarity measures for detecting classes that implement the same interface is not very successful as Tables 5.13, 5.14 and 5.15 show. With both bottom-up and top-down common subtree isomorphism only minimal matching trees of a few nodes are received, indicating low similarity. Also, tree edit distance for all classes (except for the control class, see below) is over 30 operations which leads to a similarity of below 30%. The conclusion of these results is that a matching with our measures does not make sense for such a test case. A possible explanation for this is that interfaces, because they do not implement methods, contain not enough objects that are represented in FAMIX. The similarity measures therefore have not enough information for effectively matching these small trees. This is the first test case where the control class receives a higher similarity score than other classes. Performance of Similarity Measures This section outlines the main problems and details the performance of the implemented algorithms. We do not consider test case E as the results indicate that no algorithm is able to detect similarity in such cases. Bottom-up Maximum Common Subtree The results in the previous sections show clearly that a bottom-up subtree isomorphism measure is not the best way for detecting similar Java classes. The similarity score remains static at about 14%. As explained in the different cases, the reason for this is that this measure uses equivalence classes for checking equality of the different nodes. For a better match, the measure has to include the surrounding, unchanged methods as well. However, 5.3 Results 47 Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl A control.ControlClass Matched tree size 145 21 21 1 Similarity 100.00% 14.38% 14.38% 1.37% Table 5.10: Results case D (method extraction): Bottom-up maximum common subtree. Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl A afterChanges.AzureusCoreImpl control.ControlClass Matched tree size 145 139 70 1 Similarity 100.00% 95.21% 47.95% 1.37% Table 5.11: Results case D (method extraction): Top-down maximum common subtree. for this to happen, the equivalence class of the root would have to remain the same. This is not the case in our tests as a node insertion/deletion at tree depth level 1 changes the equivalence level of the root and the measure matches the biggest subtree from level 2 which usually is the biggest method. See Section 5.4.3 for possible improvements on the algorithm to solve this problem. Top-down Maximum Common Subtree We receive mixed results with this measure. Case C, D and partly case A show good scores for detecting similarity with the top-down algorithm. In case B similarity is not detected, because the insertion of a variable stops the matching process early. This algorithm is a good measure when looking for structural similarity within classes and giving smaller changes in methods a lesser weight. It fails however already for small changes near the root of the tree. This constraint can possibly be lessened by using a similar approach as proposed in the previous section about the bottom-up maximum subtree match. Tree Edit Distance The tree edit distance algorithm performed best for our test cases. The similarity scores in each test are over 97%, which is sufficient for establishing a similarity relation between two classes with a high accuracy. The big advantage of this algorithm in comparison to the maximum common subtree algorithms is that it is not as susceptible to node insertions/deletions as the other two measures. 5.3.3 Results with org.eclipse.compare This section contains the results for the similarity matching on the compare project of Eclipse. We outline the most significant detections for selected classes and have a look at the overall performance of each similarity algorithm. General Observations One single class can be analysed per run of Coogle with the current implementation. For analysing a complete project, and calculating the similarity of all classes to each other class in the project, the implementation needs to be changed. We therefore selected single classes and analysed their similarity. Using Coogle on the compare project gives insights on the efficiency of the different measures. However, the results themselves are not very interesting as no real similarity (for example created Chapter 5. Evaluation 48 Class name beforeChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl afterChanges.AzureusCoreImpl A control.ControlClass Tree edit distance 0 8 14 144 Similarity 100.00% 97.26% 95.21% 1.37% Table 5.12: Results case D (method extraction): Tree edit distance. Class name beforeChanges.RateControlledEntity afterChanges.SinglePeerDownloader afterChanges.SinglePeerUploader afterChanges.MultiPeerDownloader afterChanges.MultiPeerUploader control.ControlClass Matched tree size 7 1 1 1 1 1 Similarity 100.00% 4.17% 3.77% 2.74% 1.08% 25.00% Table 5.13: Results case E (implement interface): Bottom-up maximum common subtree. Class name beforeChanges.RateControlledEntity afterChanges.SinglePeerUploader afterChanges.SinglePeerDownloader afterChanges.MultiPeerDownloader afterChanges.MultiPeerUploader control.ControlClass Matched tree size 7 7 3 3 3 1 Similarity 100.00% 26.42% 12.50% 8.22% 3.23% 25.00% Table 5.14: Results case E (implement interface): Top-down maximum common subtree. Class name beforeChanges.RateControlledEntity afterChanges.SinglePeerDownloader afterChanges.SinglePeerUploader afterChanges.MultiPeerDownloader afterChanges.MultiPeerUploader control.ControlClass Tree edit distance 0 34 39 59 172 6 Similarity 100.00% 29.17% 26.42% 19.18% 7.53% 25.00% Table 5.15: Results case E (implement interface): Tree edit distance. by code duplication) could be detected. The average similarity percent match lies below 50% as the results in the following section show. We found that the algorithm correctly detects structural similarity in the sample classes, but we could not find functional similarity in the project, neither by using the similarity search nor by manually studying the source of the project. Exemplary Detections We discuss and include results for two example similarity searches on org.eclipse.compare. The two sample classes are CompareViewerPane and NavigatorAction. org.eclipse.compare.CompareViewerPane We ran our analysis on CompareViewerPane. Table 5.16 contains the results. The average similarity, the unweighed mean of all three measures: 5.3 Results Class name CompareViewerPane internal.merge.LineComparator internal.TokenComparator internal.ListContentProvider 49 Bottom-up 100.00% 11.43% 28.89% 14.81% Top-down 100.00% 40.00% 4.44% 37.04% Tree edit distance 100.00% 80.00% 77.78% 51.85% Average 100.00% 43.81% 37.04% 34.57% Table 5.16: Selected top results for comparison on org.eclipse.compare.CompareViewerPane. rbottom +rtop +redit , 3 is below 50%. This indicates a low similarity beween CompareViewerPane and any other class of the project. Figure 5.4 shows an almost uniformly continuous distribution curve for the average similarity. Figure 5.4: Distribution of average similarity (bottom-up, top-down and tree edit distance measures) for class CompareViewerPane in org.eclipse.compare project. Denoted on the x-axis are all classes of the project with descending similarity. The specific results for each measure on its own are not surprising given these average similarity results. Bottom-up maximum common subtree isomorphism only has two classes with a similarity over 25%. One is TokenComparator with a similarity of 28.89% and the other is NavigatorAction (see later). The size of the matched subtree is the same in both cases, namely 13 nodes. This is also the overall maximum subtree size for bottom-up matching on CompareViewerPane. All the trees with size 13 are matches of code in the constructor of CompareViewerPane. Top-down maximum common subtree matches an overall maximum subtree of 14 nodes with the class LineComparator. Because that class is rather small, its similarity percentage gets quite high. However, the match of 14 nodes is reasonable and indicates the similarity of the classes: both have a single variable declaration at the beginning, followed by a single constructor and then five (CompareViewerPane) and four (LineComparator) method declarations. Note that, unlike before, TokenComparator has a very low similarity with just two matched nodes. This is caused by the additional variable declarations of TokenComparator at the beginning of the class which prevent a better match. The distribution of the similarity calculated with tree edit distance is depicted in Figure 5.5. The curve is almost linear with LineComparator and TokenComparator being the top matches Chapter 5. Evaluation 50 Class name NavigationAction internal.SimpleTextViewer internal.BufferedCanvas CompareViewerPane Bottom-up 100.00% 16.00% 6.25% 39.39% Top-down 100.00% 40.00% 37.50% 6.06% Tree edit distance 100.00% 84.00% 81.25% 57.58% Average 100.00% 46.67% 41.67% 34.34% Table 5.17: Selected top results for comparison on org.eclipse.compare.NavigationAction. with a tree edit distance of 14 and 20 nodes respectively. Figure 5.5: Distribution of tree edit distance similarity for class CompareViewerPane in the compare project of Eclipse. Denoted on the x-axis are all classes of the project with descending similarity. Studying the source code of the overall, average top results proves that the matches on the results are correct from an algorithmic point of view. Although none of the classes have a strong functional similarity with CompareViewerPane, all the matches show a structural resemblance which justifies the similarity match percentage. org.eclipse.compare.NavigationAction The distribution curve for NavigationAction is similar to the one shown in Figure 5.4 for CompareViewerPane. The similarity results for each measure including the average similarity are detailed in Table 5.17. The data is very similar to the results of the previous analysis on CompareViewerPane. As before, a class exists, here it is SimpleTextViewer, that performs bad on bottom-up search, mediocre on top-down search and best on tree edit distance. The results are resembling those previously received for LineComparator. Further, the similarity of CompareViewerPane to the class NavigationAction is comparable with the results of TokenComparator in the prior analysis: high percentage in bottom-up, very low with top-down and a bit higher in tree edit distance measuring. Performance of Similarity Measures This section outlines the main problems and the overall performance of the implemented algorithms when testing with the org.eclipse.compare project. 5.4 Discussion 51 Bottom-up Maximum Common Subtree The tests with the Eclipse compare project show one good application for the bottom-up maximum common subtree algorithm: the detection of similar methods in otherwise varying classes. However, there are limits for such detections as the measure only detects the biggest method in the class and needs to be rerun for more matches. Further, the detection will always detect a match if the method is sufficiently big. Matches like this are not of any real value from a class similarity point of view. Top-down Maximum Common Subtree We did not find any clear similarity match with a topdown maximum common subtree search. This comes from the fact that the project does not have any duplicated classes (or duplicated parts of classes). At least, we were not able to detect or manually find any duplications. The similarity matches in the project are primarily of structural nature which means that the algorithm is able to detect classes that have a similar structure, but are not connected through functional similarity. Such structural similarity can be a lead to functional similarity, but in projects where the classes usually are of same size and follow the same ordering pattern of fields, constructor, getter/setter methods, a functional similarity match is hard to detect without further matching on class coupling for example (also see Section 5.4.3). Tree Edit Distance The same conclusions as for the top-down maximum subtree search apply to the tree edit distance measure. This measure produces good results not only with the test cases, but the analysis with org.eclipse.compare shows that the algorithm is able to identify similarity between classes, at least structural similarity. 5.4 Discussion In this section we discuss the results of our evaluation, highlight the shortcomings of the measures and indicate possible ways for improvement. 5.4.1 Comparison of Implemented Measures We tested the similarity measures on two different types of projects: constructed test cases and the org.eclipse.compare project. The results show that a bottom-up maximum common subtree isomorphism match is not a good measure for similarity. It is too susceptible to subtle code modifications in methods which usually cause changes at the bottom level of the tree. However, big result trees with this measure often indicate the existence of similarly sized and structured methods in both the search and matching classes. The top-down maximum common subtree algorithm shows promising results. The measure is a good indicator for similarity as it is able to detect classes with similar structure. A negative characteristic of the algorithm is that simple changes at the top of the tree, like adding a new attribute or inserting an attribute between existing methods, reduces the reliability of the measure. A top-down search is however not as sensitive to changes at the bottom of the tree as the bottomup isomorphism. The best overall similarity measure is the tree edit distance. It detected the small refactorings in the test cases and provided itself as a good indicator for structural similarity using with the compare project. 52 Chapter 5. Evaluation 5.4.2 Shortcomings Measures for ordered trees only. All the tree measures are limited to ordered trees as described in Section 3.4. An unordered tree gives better results when performing a tree edit distance calculation for example. Parsing syntax tree creates artificial ordering. Due to the implementation of the abstract syntax tree parsing, the created trees are already ordered. This is described in Section 4.5.3. It is not certain however, that removing this shortcoming leads to better similarity results as all the trees are generated using the same parsing process. FAMIX limits on tree hierarchy. By parsing the abstract syntax tree into a FAMIX representation, we loose hierarchy information as the FAMIX model is a rather flat hierarchy usually only a few levels deep. This can be circumvented by leaving out the conversion into FAMIX and directly generating the input tree from the abstract syntax tree. FAMIX limits on content. FAMIX represents certain basic instructions (invocations, declarations, attributes, etcetera), but does not include assignments, mathematical operations and such. This is good for an overall similarity measuring, but will limit the detections for small changes on these basic instructions. 5.4.3 Possible Improvements We propose the following improvements to overcome the described limitations: Use complete abstract syntax tree. By using the complete abstract syntax tree for measuring similarity, we can increase the level of detail down to single instructions and build hierarchically more structured trees as well. This might help to detect ”real”, functionally similar classes and diminish the detection of classes just structurally similar. Class/method coupling analysis. An improvement for detecting functional similarity between classes can be made by analysing the coupling of methods or classes. This is done by measuring and comparing invocations and references to other classes. With this information we can find classes designed for performing similar tasks. Field or method name matching. Additional similarity information can be gained from field or method names. Methods and fields used for similar tasks are named with similar names if the code was created abiding reasonable naming standards. This helps detecting cloned parts of classes. A measure such as Levenstein’s string distance can be used for calculating this similarity between names. Surrounding string matching. This is like field or method name matching, but matches text surrounding the class/methods, for example comments or Javadoc. Bottom-up subtree search improvement. A proposition for the bottom-up maximum common subtree matching is to have the algorithm automatically remove the non-matching nodes, recalculate the trees equivalence classes and restart the bottom-up maximum subtree search for this new tree. However, we can neither estimate the efficiency nor predict if such an algorithm performs better than the current implementation. Coupling of multiple matching algorithms. Our implemented algorithms as well as proposed algorithms can be combined with their individual similarity score and weighted as desired. 5.4 Discussion 53 With our current implementation, a combined similarity can be defined like: similarity = wa · rbottomup + wb · rtopdown + wc · reditdistance wa + wb + wc A possible weighting, based on each measures effectiveness in our tests, would then be: wa = 1, wb = 2, wc = 3. Chapter 6 Conclusion and Future Work In this thesis we described the implementation of a similarity analysis tool called Coogle. This tool calculates similarity by comparing tree representations of the source code of two given Java classes. The requirement was to use an intermediary tree representation model called FAMIX. Our goal was to analyse the detection of similarity when using a tree representation of source code with three different similarity measures. The following contributions are made by this thesis: • • • • • We created a Java implementation of the FAMIX model. This implementation was then used to represent the source code of Java classes. An abstract syntax tree parser was refactored and extended to create a FAMIX representation of Eclipse’s abstract syntax tree. Three different similarity measures for the comparison of general trees were implemented: bottom-up maximum common subtree isomorphism, top-down maximum common subtree isomorphism, and tree edit distance. An Eclipse plug-in called Coogle was built. Coogle is a wizard-based tool for analysing similarity between selected Java classes. Test cases and the org.eclipse.compare project were used for analysing the efficiency of the similarity measures. 6.1 Results Measuring abstract syntax tree similarity is a valid approach for detecting similar Java classes. Of the three tree similarity measures, the tree edit distance produces the best results, followed by the top-down maximum common subtree isomorphism. Especially when measuring the effects of refactorings, the tree edit distance measure proved to be very reliable. This measure is for ordered trees only and therefore has shortcomings when analysing changes that affect the ordering of the tree, such as relocating field definitions or methods. Bottom-up maximum common subtree is not as efficient as the other two measures, often failing for small structural changes. This is due to the shallow hierachy of the FAMIX model, which was used for representing the abstract syntax trees of the source code. Functional similarity is not detected by any of the measures. We ran the similarity analysis on a major project, Eclipse’s compare plug-in. Although we found structural similarities, none indicate cloned code fragments or duplicated classes. 56 Chapter 6. Conclusion and Future Work 6.2 Future Work We propose multiple extensions that can improve Coogle’s ability of detecting similarity: Extend Coogle to parse a complete Java project in one step. By adding a loop around the similarity search procedure of Coogle, a complete Java project can be analysed in one step. This allows to detect similarity in a project not only with single, selected classes, but can find the most similar classes of a project. An interesting application of this lies in the area of developer assistance: during development the similarity measure can suggest code samples from a repository, providing the developer with sample code or already existing implementations of the desired piece of code. Use the abstract syntax tree as input. We propose to calculate the similarity of classes based on the abstract syntax tree directly, without loosing information by parsing the AST into a FAMIX intermediate representation first. This increases the hierarchy of the trees and enables the measures to match finer grained subtrees. Add other measures for tree matching. An interesting candidate of an additional algorithm for ordered tree matching is described in [Chawathe et al., 1996]. Further, a top-down maximum common subtree for unordered trees will be added and evaluated. Functional similarity detection. Statements are surrounded by text in source code (i.e., comments, field names or neighbour statements). Analysing the similarity of the text surrounding a statement and including this textual similarity in the measures will improve the similarity. Additionally, similar comments or field names are a good indicator for functional similarity. Considering surrounding text in the measures therefore improves the ability to detect functional similarity. Appendix A Coogle Step by Step This is a detailed walk through a search with the Coogle plug-in. We search for similar classes to the class org.eclipse.compare.Splitter in the Eclipse compare project. Figure A.1 shows the context menu after right-clicking on a Java project in the Package Explorer. The project we right-click on is the project in which we want to search for similarity, the Eclipse compare plug-in. After selecting Start similarity search... in the Coogle submenu, the main Coogle wizard starts up. Figure A.1: Context menu when right-clicking on a Java project in the Eclipse workspace. We are presented with a selection of similarity measures. An additional setting for ordered or unordered search with labelled or unlabelled trees can be made when choosing to run a bottomup maximum common subtree isomorphism. The screeshot in Figure A.2 illustrates this step. The next wizard page lists all projects in the current Eclipse workspace (depicted in Figure A.3). After selecting a project and pressing Next, all classes in the selected project are collected and presented on the next wizard page which is showed in the screenshot in Figure A.4. Here we select the class to use as similarity search object. The similarity of all classes in the project selected through the context menu will be calculated to this class. The last step before the calculation starts is depicted in Figure A.5. The page shows an overview of the similarity search that will be performed. Upon pressing Finish, a dialog with a progress bar shows the current status of the search. Figure A.6 illustrates the result dialog that is shown after the calculation finished. The result table has three columns: The Name of the class that was matched with the search object, the calculated Edit distance, i.e., the number of edit operations needed for transforming the trees and the Similarity in % as described in Section 5.3.1. 58 Figure A.2: Step 1: Welcome screen and choice of similarity. Figure A.3: Step 2: Selection of project containing the desired search object. Chapter A. Coogle Step by Step 59 Figure A.4: Step 3: Search object selection. Figure A.5: Step 4: Final summary page before calculation is started. 60 Chapter A. Coogle Step by Step Figure A.6: Result dialog of a tree edit distance calculation on the Eclipse compare project with the class org.eclipse.compare.Splitter as search object. Appendix B How to Extend Coogle This chapter contains useful code snippets and descriptions on how to reuse the existing code when adding new functionality to Coogle. B.1 Add a New Similarity Measure The currently implemented similarity measures reside in package ch.toe.tree.calc. Classes calculating similarity extend the abstract class Calculator. Listing B.1 shows a sample class definition. The boolean calculated is used for representing the status of the calculation. The calculation itself is done by the method calculate(), the main calculation method, which has to be invoked by the constructor right after initialisation. See Listing B.2 for a basic constructor implementation. To integrate the measure into the Coogle wizard, an extendor of SimilarityOperation (in package ch.toe.coogle.operation.generic) is added. We create this class in package ch.toe.coogle.operation.classes and override the method exectute(), which is called by the wizard upon pressing Finish. Listing B.3 illustrates the added class for the new operation. ch.toe.coogle.operation.dialog.ResultDialog is the basis class for a new result dialog. Here, the method createContents(..) is called as soon as the dialog is displayed. We therefore create the desired result controls in a new implementation of this method as illustrated by Listing B.4. package ch.toe.tree.calc; public class CalculateTreeSize extends Calculator { [..] } Listing B.1: Sample class for defining a new similarity measure. Chapter B. How to Extend Coogle 62 [..] public CalculateTreeSize(DefaultMutableTreeNode tree) throws NullPointerException, TreeNodeTypeException { super(); if (tree == null) throw new NullPointerException("Empty tree passed!"); this.tree = tree; if (calculate()) setCalculated(true); } [..] Listing B.2: Sample constructor for a new similarity measure with a single tree as parameter. After creating these operational classes, we need to add the operation to the wizard. This requires the following changes (all classes are in package ch.toe.coogle.wizard): • • • Add the operation to the model class CoogleModel. Listing B.5 shows the needed additions: a boolean field to represent the currently chosen algorithm and a String identifying the added operation. Create the controls for selecting the new measure in the method createPageContent(..) in CoogleWizardPageWelcome. See Listing B.6 for the addition of the sample measure. Add the operation to performFinish() in the main wizard class CoogleWizard. This is shown in Listing B.7. B.2 Extend the Information in the Tree As the measures operate on general tree representations, the new tree representation only needs to be of type DefaultMutableTreeNode. Then it can be passed to the constructor of any similarity measure. When defining the tree with special objects as user objects, a new comparator for evaluating node equality probably needs to be defined. B.3 Define a New Comparator A comparator is used by the calculation classes for comparing two nodes for equality. The interface ITreeComparator needs to be overriden by new comparator implementations. compare() receives two DefaultMutableTreeNodes and returns true or false for indicating equality. Listing B.8 shows a sample implementation of a new comparator. The new comparator is used on any similarity measure by passing the instantiated comparator object to the constructor of the measure. See Listing 4.7 for a constructor receiving a comparator. B.3 Define a New Comparator package ch.toe.coogle.operation.classes; public class TreeSizeOperationClassImpl extends SimilarityOperation { protected ArrayList[] treeSize; public TestOperation(CoogleModel model) { super(model); } public boolean execute() throws InvocationTargetException, InterruptedException { // process all trees in the project for (int i=0; i<objectTrees.length; i++) { if (objectTrees[i] == null) continue; CalculateTreeSize calc = null; try { calc = new CalculateTreeSize( new DefaultMutableTreeNode() ); } catch (NullPointerException e) { } catch (TreeNodeTypeException e) { continue; } if (calc == null || !calc.isCalculated()) continue; // add tree size results to result array [..] // make object available for garbage collector objectTrees[i] = null; } return true; [..] } public void displayResults(Shell shell) { TreeSizeDialog dialog = new TreeSizeDialog(shell); dialog.setDuration(getOperationDurationString()); dialog.setData(treeSize); dialog.open(); } [..] Listing B.3: Sample implementation of a new measure operation. 63 Chapter B. How to Extend Coogle 64 package ch.toe.coogle.operation.dialog; public class TreeSizeDialog extends ResultDialog { public TestDialog(Shell parent) { super(parent); } public TreeSizeDialog(Shell parent, int style) { super(parent, style); } protected void createContents(Shell shell) { // window title shell.setText("Tree edit distance result table"); GridLayout layout = new GridLayout(); layout.numColumns = 1; shell.setLayout(layout); addDurationLabel(shell); // add controls [..] addCloseButton(shell); } } Listing B.4: Implementation of a new result dialog. public static final String treeSize = "Tree size"; public boolean doTreeSize = false; Listing B.5: In class CoogleModel: Model extension for a new measure. B.3 Define a New Comparator 65 [..] Button radioTreeSize = new Button(g1, SWT.RADIO); radioTreeSize(CoogleModel.treeSize); radioTreeSize(new SelectionListener() { public void widgetDefaultSelected(SelectionEvent e) { widgetSelected(e); } public void widgetSelected(SelectionEvent e) { model.doTreeSize = true; model.doEditDistance = false; model.doBottomUp = false; model.doTopDown = false; treesOrdered.setEnabled(false); treesOrdered.setVisible(false); treesOrdered.setSelection(false); treesLabeled.setEnabled(false); treesLabeled.setVisible(false); treesLabeled.setSelection(false); setPageComplete(true); }}); [..] Listing B.6: In class CoogleWizardPageWelcome: Additions to the welcome page of the wizard for a new tree measure. [..] public boolean performFinish() { [..] if (model.doClassSearch) { [..] if (model.doTreeSize) { operation = new TreeSizeOperationClassImpl(model); } } [..] Listing B.7: In class CoogleWizard: Add the new operation to the finish action of the wizard. Chapter B. How to Extend Coogle 66 package ch.toe.tree.comparator; public class FirstLetterTreeComparator implements ITreeComparator { public AlwaysTrueTreeComparator() { super(); } public boolean compare(DefaultMutableTreeNode node1, DefaultMutableTreeNode node2) { if (node1 == null || node2 == null) return false; if (node1.getUserObject() == null || node1.getUserObject() == null) return false; if (node1.getUserObject().toString().charAt(0) == node2.getUserObject().toString().charAt(0)); return true; return false; } } Listing B.8: A new comparator implementation. Appendix C Contents of CD-ROM C.1 Directory Layout Directory name CoogleSource.zip CoogleSource.zip.asc org.eclipse.compare/ TestCases/ TestWorkspace/ Coogle.pdf Abstract.pdf Zusfsg.pdf Description This archive contains the complete source of the Coogle plugin. PGP signature of the Coogle source. The source of the compare plug-in project as used in our evaluation. Sources of the test cases. Eclipse workspace containing the projects used for evaluation. This document in Adobe Portable Document Format. Abstract of the thesis in English. Abstract of the thesis in German. C.2 Eclipse Workspace: Test Cases Each test case is in its own subdirectory. For example, Case A resides in the subdirectory ”CaseA”. The subdirectory structure of each case follows the default Java practice, putting each package in a subdirectory. Each test case is structured in the following three packages: ch.toe.Coogle.TestCases.CaseX.beforeChanges. This contains the original test class without any modifications. The class in this directory is used as similarity search object. ch.toe.Coogle.TestCases.CaseX.afterChanges. The class from the package beforeChanges is modified according to the test case and then put in this package. We determine the similarity of this class to the unmodified class. ch.toe.Coogle.TestCases.CaseX.control. One single class resides in this directory, the control class ControlClass. This class defines an empty Java class and is used as dissimilarity measure (see Section 5.2.1 for more information). Appendix D Test Cases Source Listings This appendix contains the unmodified source code of the classes used as test objects in Section 5.2.1. The source for these classes can also be found on the Azureus project site1 or on the enclosed CD-ROM. D.1 AzureusCoreImpl This is the complete listing of AzureusCoreImpl, used as base class for the constructed test cases. /* * Created on 13-Jul-2004 * Created by Paul Gardner * Copyright (C) 2004 Aelitis, All Rights Reserved. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * * AELITIS, SARL au capital de 30,000 euros * 8 Allee Lenotre, La Grille Royale, 78600 Le Mesnil le Roi, France. * */ 1 http://azureus.sourceforge.net/ Chapter D. Test Cases Source Listings 70 package com.aelitis.azureus.core.impl; //˜--- non-JDK imports --------------------------------------------------import import import import com.aelitis.azureus.core.*; com.aelitis.azureus.core.networkmanager.NetworkManager; com.aelitis.azureus.core.peermanager.PeerManager; com.aelitis.azureus.core.update.AzureusRestarterFactory; import import import import import import import import import import import org.gudy.azureus2.core3.config.COConfigurationManager; org.gudy.azureus2.core3.global.GlobalManager; org.gudy.azureus2.core3.global.GlobalManagerFactory; org.gudy.azureus2.core3.internat.*; org.gudy.azureus2.core3.ipfilter.*; org.gudy.azureus2.core3.ipfilter.IpFilterManager; org.gudy.azureus2.core3.logging.LGLogger; org.gudy.azureus2.core3.tracker.host.*; org.gudy.azureus2.core3.util.*; org.gudy.azureus2.plugins.*; org.gudy.azureus2.pluginsimpl.local.PluginInitializer; //˜--- JDK imports -----------------------------------------------------------import java.util.*; //˜--- classes ----------------------------------------------------------/** * @author parg * */ public class AzureusCoreImpl implements AzureusCore, AzureusCoreListener { protected static AEMonitor class_mon = new AEMonitor("AzureusCore:class"); protected static AzureusCore singleton; //˜--- fields -------------------------------------------------------private List listeners = new ArrayList(); private List lifecycle_listeners = new ArrayList(); private AEMonitor this_mon = new AEMonitor("AzureusCore"); private GlobalManager global_manager; private PluginInitializer pi; D.1 AzureusCoreImpl 71 private boolean running; //˜--- constructors -------------------------------------------------protected AzureusCoreImpl() { COConfigurationManager.initialise(); LGLogger.initialise(); AEDiagnostics.startup(); AETemporaryFileHandler.startup(); // ensure early initialization NetworkManager.getSingleton(); PeerManager.getSingleton(); pi = PluginInitializer.getSingleton(this, this); } //˜--- methods ------------------------------------------------------public void addLifecycleListener(AzureusCoreLifecycleListener l) { lifecycle_listeners.add(l); } public void addListener(AzureusCoreListener l) { listeners.add(l); } public void checkRestartSupported() throws AzureusCoreException { if (getPluginManager().getPluginInterfaceByClass( "org.gudy.azureus2.update.UpdaterPatcher") == null) { LGLogger.logRepeatableAlert( LGLogger.AT_ERROR, "Can’t restart without the ’azupdater’ plugin installed"); throw(new AzureusCoreException( "Can’t restart without the ’azupdater’ plugin installed")); } } public static AzureusCore create() throws AzureusCoreException { try { class_mon.enter(); if (singleton != null) { throw(new AzureusCoreException( Chapter D. Test Cases Source Listings 72 "Azureus core already instantiated")); } singleton = new AzureusCoreImpl(); return (singleton); } finally { class_mon.exit(); } } public void removeLifecycleListener(AzureusCoreLifecycleListener l) { lifecycle_listeners.remove(l); } public void removeListener(AzureusCoreListener l) { listeners.remove(l); } public void reportCurrentTask(String currentTask) { for (int i = 0; i < listeners.size(); i++) { try { ((AzureusCoreListener) listeners.get(i)).reportCurrentTask( currentTask); } catch (Throwable e) { Debug.printStackTrace(e); } } } public void reportPercent(int percent) { for (int i = 0; i < listeners.size(); i++) { try { ((AzureusCoreListener) listeners.get(i)).reportPercent( percent); } catch (Throwable e) { Debug.printStackTrace(e); } } } public void requestRestart() throws AzureusCoreException { runNonDaemon(new AERunnable() { public void runSupport() { checkRestartSupported(); D.1 AzureusCoreImpl 73 for (int i = 0; i < lifecycle_listeners.size(); i++) { if (!((AzureusCoreLifecycleListener) lifecycle_listeners .get(i)).restartRequested(AzureusCoreImpl.this)) { LGLogger.log( "Core: Request to restart the core has been denied"); return; } } restart(); } }); } public void requestStop() throws AzureusCoreException { runNonDaemon(new AERunnable() { public void runSupport() { for (int i = 0; i < lifecycle_listeners.size(); i++) { if (!((AzureusCoreLifecycleListener) lifecycle_listeners .get(i)).stopRequested(AzureusCoreImpl.this)) { LGLogger.log( "Core: Request to stop the core has been denied"); return; } } stop(); } }); } public void restart() throws AzureusCoreException { runNonDaemon(new AERunnable() { public void runSupport() { LGLogger.log("Core: Restart operation starts"); checkRestartSupported(); stopSupport(false); LGLogger.log( "Core: Restart operation: stop complete, restart initiated"); AzureusRestarterFactory.create(AzureusCoreImpl.this).restart( false); } Chapter D. Test Cases Source Listings 74 }); } private void runNonDaemon(final Runnable r) throws AzureusCoreException { if (!Thread.currentThread().isDaemon()) { r.run(); } else { final AESemaphore sem = new AESemaphore("AzureusCore:runNonDaemon"); final Throwable[] error = { null }; new AEThread("AzureusCore:runNonDaemon") { public void runSupport() { try { r.run(); } catch (Throwable e) { error[0] = e; } finally { sem.release(); } } }.start(); sem.reserve(); if (error[0] != null) { if (error[0] instanceof AzureusCoreException) { throw((AzureusCoreException) error[0]); } else { throw(new AzureusCoreException("Operation failed", error[0])); } } } } private void shutdownCore() { if (running) { try { LGLogger.log( "Core: Caught VM shutdown event; auto-stopping Azureus"); AzureusCoreImpl.this.stop(); } catch (Throwable e) { Debug.printStackTrace(e); } } D.1 AzureusCoreImpl 75 } public void start() throws AzureusCoreException { try { this_mon.enter(); if (running) { throw(new AzureusCoreException("Core: already running")); } running = true; } finally { this_mon.exit(); } LGLogger.log("Core: Loading of Plugins starts"); pi.loadPlugins(this); LGLogger.log("Core: Loading of Plugins complete"); global_manager = GlobalManagerFactory.create(this); for (int i = 0; i < lifecycle_listeners.size(); i++) { ((AzureusCoreLifecycleListener) lifecycle_listeners.get( i)).componentCreated(this, global_manager); } pi.initialisePlugins(); LGLogger.log("Core: Initializing Plugins complete"); new AEThread("Plugin Init Complete") { public void runSupport() { pi.initialisationComplete(); for (int i = 0; i < lifecycle_listeners.size(); i++) { ((AzureusCoreLifecycleListener) lifecycle_listeners.get( i)).started(AzureusCoreImpl.this); } } }.start(); // Catch non-user-initiated VM shutdown ShutdownHook.install(new ShutdownHook.Handler() { public void shutdown(String signal_name) { LGLogger.log("Core: Caught signal " + signal_name); shutdownCore(); } Chapter D. Test Cases Source Listings 76 }); Runtime.getRuntime().addShutdownHook(new AEThread("Shutdown Hook") { public void runSupport() { shutdownCore(); } }); } public void stop() throws AzureusCoreException { runNonDaemon(new AERunnable() { public void runSupport() { LGLogger.log("Core: Stop operation starts"); stopSupport(true); } }); } private void stopSupport(boolean apply_updates) throws AzureusCoreException { try { this_mon.enter(); if (!running) { throw(new AzureusCoreException("Core not running")); } running = false; } finally { this_mon.exit(); } global_manager.stopAll(); for (int i = 0; i < lifecycle_listeners.size(); i++) { ((AzureusCoreLifecycleListener) lifecycle_listeners.get( i)).stopped(this); } NonDaemonTaskRunner.waitUntilIdle(); AEDiagnostics.shutdown(); LGLogger.log("Core: Stop operation completes"); // if any installers exist then we need to closedown via the updater if (apply_updates && (getPluginManager().getDefaultPluginInterface() D.1 AzureusCoreImpl 77 .getUpdateManager().getInstallers().length > 0)) { AzureusRestarterFactory.create(this).restart(true); } } //˜--- get methods --------------------------------------------------public GlobalManager getGlobalManager() throws AzureusCoreException { if (global_manager == null) { throw(new AzureusCoreException("Core not running")); } return (global_manager); } public IpFilterManager getIpFilterManager() throws AzureusCoreException { return (IpFilterManagerFactory.getSingleton()); } public LocaleUtil getLocaleUtil() { return (LocaleUtil.getSingleton()); } public PluginManager getPluginManager() throws AzureusCoreException { // don’t test for runnign here, the restart process calls this after // terminating the core... return (PluginInitializer.getDefaultInterface().getPluginManager()); } public PluginManagerDefaults getPluginManagerDefaults() throws AzureusCoreException { return (PluginManager.getDefaults()); } public static AzureusCore getSingleton() throws AzureusCoreException { if (singleton == null) { throw(new AzureusCoreException("core not instantiated")); } return (singleton); } public TRHost getTrackerHost() throws AzureusCoreException { return (TRHostFactory.getSingleton()); Chapter D. Test Cases Source Listings 78 } public static boolean isCoreAvailable() { return (singleton != null); } } Listing D.1: AzureusCoreImpl.java (version 2.3.0.3) from the Azureus project. D.2 RateControlledEntity This is the listing of the interface RateControlledEntity which is used in test case E. /* * Created on Sep 27, 2004 * Created by Alon Rohter * Copyright (C) 2004 Aelitis, All Rights Reserved. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * * AELITIS, SARL au capital de 30,000 euros * 8 Allee Lenotre, La Grille Royale, 78600 Le Mesnil le Roi, France. * */ package com.aelitis.azureus.core.networkmanager.impl; /** * Interface designation for rate-limited entities controlled by a handler. */ public interface RateControlledEntity { /** * Uses fair round-robin scheduling of processing ops. */ D.2 RateControlledEntity 79 public static final int PRIORITY_NORMAL = 0; /** * Guaranteed scheduling of processing ops, with preference over * normal-priority entities. */ public static final int PRIORITY_HIGH = 1; //˜--- methods ------------------------------------------------------/** * Is ready for a processing op. * @return true if it can process >0 bytes, false if not ready */ public boolean canProcess(); /** * Attempt to do a processing operation. * @return true if >0 bytes were processed (success), false if 0 bytes * were processed (failure) */ public boolean doProcessing(); //˜--- get methods --------------------------------------------------/** * Get this entity’s priority level. * @return priority */ public int getPriority(); } Listing D.2: RateControlledEntity.java (version 2.3.0.3) from the Azureus project. 80 Chapter D. Test Cases Source Listings Bibliography [Baker and Manber, 1998] Baker, B. S. and Manber, U. (1998). Deducing similarities in Java sources from bytecodes. In Proceedings of Usenix Annual Technical Conference, pages 179–190. [Baxter et al., 1998] Baxter, I. D., Yahin, A., Moura, L., Anna, M. S., and Bier, L. (1998). Clone detection using abstract syntax trees. In ICSM ’98: Proceedings of the International Conference on Software Maintenance, pages 368–377. IEEE Computer Society, Washington, DC, USA. [Bernstein et al., 2005] Bernstein, A., Kiefer, C., and Kaufmann, E. (2005). Simpack: A generic Java library for similarity measures in ontologies. [CDIF, 1994] CDIF (1994). CDIF framework for modeling and extensibility. Technical Report EIA/IS107, Electronic Industries Association. [Chawathe et al., 1996] Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. (1996). Change detection in hierarchically structured information. In SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 493–504, New York, NY, USA. ACM Press. [Dijkstra, 1959] Dijkstra, E. W. (1959). A note on two problems in connection with graphs. Numerical Mathematics, 1(5):269–271. [Gamma and Beck, 2003] Gamma, E. and Beck, K. (2003). Contributing to Eclipse. Addison Wesley. [Gamma et al., 1994] Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1994). Design patterns: Elements of reusable object-oriented software. Addison Wesley, Massachusetts. [Gosling et al., 1996] Gosling, J., Joy, B., and Steele, G. L. (1996). The Java language specification. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. [Holmes and Murphy, 2005] Holmes, R. and Murphy, G. C. (2005). Using structural context to recommend source code examples. In ICSE ’05: Proceedings of the 27th International Conference on Software Engineering, pages 117–125, New York, NY, USA. ACM Press. [Kontogiannis, 1993] Kontogiannis, K. (1993). Program representation and behavioural matching for localizing similar code fragments. In CASCON ’93: Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research, pages 194–205. IBM Press. [Lanza, 2003] Lanza, M. (2003). CodeCrawler - a lightweight software visualization tool. In VISSOFT ’03: Proceedings of the 2nd International Workshop on Visualizing Software for Understanding and Analysis, pages 51–52. 82 BIBLIOGRAPHY [Michail and Notkin, 1999] Michail, A. and Notkin, D. (1999). Assessing software libraries by browsing similar classes, functions and relationships. In ICSE ’99: Proceedings of the 21st International Conference on Software Engineering, pages 463–472, Los Alamitos, CA, USA. IEEE Computer Society Press. [Mishne and de Rijke, 2004] Mishne, G. and de Rijke, M. (2004). Source code retrieval using conceptual similarity. [Myles and Collberg, 2005] Myles, G. and Collberg, C. (2005). K-gram based software birthmarks. In SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, pages 314–318, New York, NY, USA. ACM Press. [Neamtiu et al., 2005] Neamtiu, I., Foster, J. S., and Hicks, M. (2005). Understanding source code evolution using abstract syntax tree matching. In MSR ’05: Proceedings of the 2005 International Workshop on Mining Software Repositories, pages 1–5, New York, NY, USA. ACM Press. [Shamir and Tsur, 1997] Shamir, R. and Tsur, D. (1997). Faster subtree isomorphism. In ISTCS ’97: Proceedings of the 5th Israel Symposium on the Theory of Computing Systems (ISTCS ’97), page 126, Washington, DC, USA. IEEE Computer Society. [Shasha et al., 2002] Shasha, D., Wang, J. T. L., and Giugno, R. (2002). Algorithmics and applications of tree and graph searching. In PODS ’02: Proceedings of the 21st ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, pages 39–52, New York, NY, USA. ACM Press. [Shasha et al., 2004] Shasha, D., Wang, J. T. L., and Zhang, S. (2004). Unordered tree mining with applications to phylogeny. In ICDE ’04: Proceedings of the 20th International Conference on Data Engineering, page 708, Washington, DC, USA. IEEE Computer Society. [Tichelaar, 1999] Tichelaar, S. (1999). FAMIX Java language plug-in 1.0. [Tichelaar et al., 1999] Tichelaar, S., Steyaert, P., and Demeyer, S. (1999). FAMIX 2.0: The FAMOOS information exchange model. [Valiente, 2000] Valiente, G. (2000). Simple and efficient tree pattern matching. Technical Report LSI-00-72-R, Technical University of Catalonia. [Valiente, 2002] Valiente, G. (2002). Algorithms on trees and graphs. Springer-Verlag, Berlin. [Wang et al., 2003] Wang, J. T.-L., Shan, H., Shasha, D., and Piel, W. H. (2003). TreeRank: A similarity measure for nearest neighbor searching in phylogenetic databases. In SSDBM ’03: Proceedings of the 15th International Conference on Scientific and Statistical Database Management, pages 171–180. [Yamamoto et al., 2002] Yamamoto, T., Matsusita, M., Kamiya, T., and Inoue, K. (2002). Measuring similarity of large software systems based on source code correspondence. [Zhang, 1996] Zhang, K. (1996). A constrained edit distance between unordered labeled trees. Algorithmica, 15(3):205–222. [Zhang and Jiang, 1994] Zhang, K. and Jiang, T. (1994). Some MAX SNP-hard results concerning unordered labeled trees. Information Processing Letters, 49(5):249–254. [Zhang et al., 1992] Zhang, K., Statman, R., and Shasha, D. (1992). On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133–139.