Coogle - A Code Google Eclipse Plug

Transcription

Coogle - A Code Google Eclipse Plug
Diploma Thesis
31st January 2006
Coogle
A Code Google Eclipse Plug-in for Detecting
Similar Java Classes
Tobias Sager
of Zürich, Switzerland (s0070115)
supervised by
Prof. Abraham Bernstein, Ph.D.; Prof. Dr. Harald Gall
Beat Fluri; Christoph Kiefer; Martin Pinzger
Department of Informatics
software evolution & architecture lab
Diploma Thesis
Coogle
A Code Google Eclipse Plug-in for Detecting
Similar Java Classes
Tobias Sager
Department of Informatics
software evolution & architecture lab
Diploma Thesis
Author:
Tobias Sager, tsager@gmx.ch
Project period:
June 7, 2005 - December 7, 2005
Software Evolution & Architecture Lab
Department of Informatics, University of Zurich
Acknowledgements
I would like to thank my supervising assistants, Beat Fluri, Christoph Kiefer and Martin Pinzger,
for their valuable input, the extensive proofreading and the freedom I had while writing this thesis. Further, I thank Prof. Abraham Bernstein and Prof. Harald Gall for giving me the opportunity
of writing this thesis. The layout of this document is based on the superb LATEX-style written by
Beat Fluri.
I thank Sabine, Vreni and Ernst Sager for proofreading the thesis and the morale support they
provided. My apologies to Christine for all those hours spent in front of the computer.
Kudos to all the nameless open-source software developers. The following great tools were
used for creating this thesis: Java, Eclipse, Subversion, Subclipse, TeXlipse, LATEX, OpenOffice.Org,
Inkscape, Mozilla Firefox and Gentoo Linux.
Abstract
This thesis introduces Coogle, an Eclipse plug-in that measures similarity between Java classes.
Coogle calculates similarity by using different tree algorithms on syntax tree representations of
source code. For creating these tree representations, we convert the abstract syntax tree as defined
by Eclipse into an intermediary model called FAMIX. This FAMIX model then is transformed into
a general tree structure and used for calculating the similarity.
We derive tree similarity from a bottom-up maximum common subtree isomorphism, a topdown maximum common subtree isormorphism, and the edit distance of two given trees. These
similarity measures are then analysed for their efficiency in detecting modified code and structural similarity with constructed test cases and a real-world Java project. The best results are
achieved with the tree edit distance algorithm, which reliably indicates similarity of classes after
refactorings and also finds structurally similar classes in Eclipse’s compare project. Finding similarity with a top-down maximum common subtree algorithm is efficient for detecting structural
similarity, but has shortcomings in detecting similarity of modifications that affect the ordering of
the nodes in the tree representation. Using a bottom-up maximum common subtree isomorphism
for detecting modifications is inefficient due to the limited hierachy of the FAMIX tree representation. Based on these findings, we point out different ways to improve our similarity analysis
tool.
Zusammenfassung
Diese Diplomarbeit präsentiert Coogle, ein Plug-in für Eclipse, welches Ähnlichkeit zwischen
Java-Klassen misst. Coogle berechnet die Ähnlichkeit aufgrund der Syntax-Bäume von Quellcode. Um diese Baumstrukturen zu bilden, konvertieren wir den von Eclipse definierten Abstract
Syntax Tree in ein Modell namens FAMIX. Diese FAMIX-Repräsentation wird dann in eine generelle Baumstruktur umgewandelt und als Basis für die Berechnung der Ähnlichkeit verwendet.
Wir berechnen die Ähnlichkeit der Bäume mit drei verschiedenen Algorithmen: Bottom-up
maximaler gemeinsamer Teilbaum, Top-down maximaler gemeinsamer Teilbaum und die BaumEditierdistanz von zwei Bäumen. Diese Ähnlichkeitsmasse werden anhand von Testfällen und
einem echten Software-Projekt auf ihre Effizienz geprüft, modifizierten Quellcode zu finden und
strukturelle Ähnlichkeit festzustellen. Das beste Resultat erreicht der Baum-Editierdistanz-Algorithmus, welcher mit grosser Zuverlässigkeit in der Lage ist, die Ähnlichkeit von Klassen auch
nach Refactorings anzuzeigen. Dieser Algorithmus findet ebenfalls strukturell ähnliche Klassen
im Compare-Projekt von Eclipse. Der Top-down maximale gemeinsame Teilbaum-Algorithmus
ist nützlich, um strukturelle Ähnlichkeit festzustellen. Allerdings hat dieser Algorithmus Defizite
im Detektieren von Modifikationen, welche die Reihenfolge der Knoten in der Baum-Repräsentation verändern. Ein Bottom-up maximaler gemeinsamer Teilbaum-Algorithmus ist nicht effizient,
um Quellcode-Veränderungen zu entdecken, da die Hierarchie des FAMIX-Baums zu wenig Tiefe
hat. Ausgehend von diesen Resultaten machen wir verschiedene Vorschläge, um unsere AnalyseSoftware zu verbessern.
Contents
1
Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Stucture of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2
FAMIX - the FAMOOS Information Exchange Model
2.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Description . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Overview . . . . . . . . . . . . . . . . . . .
2.2.2 Core Model . . . . . . . . . . . . . . . . . .
2.2.3 FAMIX Extensions for Java . . . . . . . . .
2.3 FAMIX as Intermediary Model . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
4
4
5
7
3
Similarity Analysis
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Definition of Similarity . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Structural Similarity . . . . . . . . . . . . . . . . . . . .
3.3.2 Functional Similarity . . . . . . . . . . . . . . . . . . . .
3.4 Evaluated Tree Algorithms . . . . . . . . . . . . . . . . . . . . .
3.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Using Ordered or Unordered Trees? . . . . . . . . . . .
3.4.3 Tree Isomorphism . . . . . . . . . . . . . . . . . . . . .
3.4.4 Bottom-up Maximum Common Subtree Isomorphism
3.4.5 Top-down Maximum Common Subtree Isomorphism .
3.4.6 Tree Edit Distance . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
10
10
11
11
11
12
13
14
15
17
4
Implementation
4.1 Eclipse Architecture . . . . . . . . . . . . . . . . . . . .
4.1.1 Eclipse Platform . . . . . . . . . . . . . . . . .
4.1.2 Abstract Syntax Tree Representation in Eclipse
4.1.3 AST to FAMIX Mapping . . . . . . . . . . . . .
4.2 Coogle Architecture . . . . . . . . . . . . . . . . . . . .
4.2.1 Project Package Structure . . . . . . . . . . . .
4.2.2 Plug-in Features . . . . . . . . . . . . . . . . .
4.3 Coogle Design . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . .
4.3.2 FAMIX Extensions . . . . . . . . . . . . . . . .
4.3.3 Tree Generation . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
22
23
23
23
25
25
25
26
27
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
28
28
28
29
29
29
30
31
34
34
35
5 Evaluation
5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Analysis Objects . . . . . . . . . . . . . . . . . . . .
5.2.1 Constructed Test Cases . . . . . . . . . . . .
5.2.2 Real World Example: org.eclipse.compare .
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Ranking the Matches . . . . . . . . . . . . .
5.3.2 Results with Constructed Test Cases . . . .
5.3.3 Results with org.eclipse.compare . . . . . .
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Comparison of Implemented Measures . .
5.4.2 Shortcomings . . . . . . . . . . . . . . . . .
5.4.3 Possible Improvements . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
39
39
39
41
41
41
42
47
51
51
52
52
6 Conclusion and Future Work
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
55
56
A Coogle Step by Step
57
B How to Extend Coogle
B.1 Add a New Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 Extend the Information in the Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 Define a New Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
61
62
62
C Contents of CD-ROM
C.1 Directory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.2 Eclipse Workspace: Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
67
D Test Cases Source Listings
D.1 AzureusCoreImpl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.2 RateControlledEntity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
78
4.4
4.5
4.6
4.3.4 Node Comparison . . . . . . . . .
4.3.5 Input Trees for Measures . . . . . .
Coogle Workflow . . . . . . . . . . . . . .
4.4.1 Invocation . . . . . . . . . . . . . .
4.4.2 Similarity Search Process . . . . .
Coogle Implementation . . . . . . . . . .
4.5.1 Tree Generation . . . . . . . . . . .
4.5.2 Node Comparison . . . . . . . . .
4.5.3 Implemented Similarity Measures
Discussion and Problems . . . . . . . . . .
4.6.1 FAMIX . . . . . . . . . . . . . . . .
4.6.2 Similarity Measures . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
ix
List of Figures
2.1
2.2
2.3
2.4
2.5
FAMIX concept overview (source: [Tichelaar et al., 1999]) . . . . . . . . . . . . . . .
Abstract basic elements of the FAMIX model. Figure from [Tichelaar et al., 1999]. .
Subclasses of the FAMIX element BehaviouralEntity. . . . . . . . . . . . . . . .
Subclasses of StructuralEntity and their relationship to other elements (in
grey). Figure from [Tichelaar et al., 1999]. . . . . . . . . . . . . . . . . . . . . . . . .
Core elements of FAMIX and their relationship. Figure from [Tichelaar et al., 1999].
3.1
Left: A directed graph with five vertices and eight arcs. Right: A rooted tree with
six nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 A complete bipartite graph between the nodes v1 , v2 of T1 and the nodes w1 , w2
and w3 of tree T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Isomorphic ordered trees. Nodes numbered according to a preorder traversal. Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Bottom-up maximum common subtree isomorphism equivalence classes for two
ordered trees. Nodes are numbered according to the equivalence class to which
they belong. Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . .
3.5 Bottom-up maximum common subtree of two ordered trees (highlighted in grey).
Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Bottom-up maximum common subtree of two unordered trees (highlighted in grey).
The dashed arrows depict the mapping of corresponding nodes. Figure taken from
[Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Top-down maximum common subtree of two ordered trees (highlighted in grey).
Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Top-down maximum common subtree of two unordered trees (highlighted in grey).
Figure taken from [Valiente, 2002]. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9 Transformation between two ordered trees. Figure taken from [Valiente, 2002]. . .
3.10 Shortest path in the edit graph of two ordered trees. Figure from [Valiente, 2002]. .
4.1
4.2
3
5
5
6
7
12
12
13
14
15
16
16
17
18
19
22
4.9
Eclipse platform architecture with its main components and plug-ins. . . . . . . . .
The steps of processing the source code of a Java class into a tree that can be used
as input for the similarity measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Java class diagram for the top level elements of the FAMIX model. Our extensions
to the original FAMIX model are shaded in grey. . . . . . . . . . . . . . . . . . . . .
Java class diagram for StructuralEntity with its subclasses. . . . . . . . . . . .
Java class diagram for BehaviouralEntity and Context with their respective
subclasses. Our extensions to the original FAMIX model are shaded in grey. . . . .
The Coogle process: transformation of a Java source code file into a general tree
structure via a FAMIX representation of the abstract syntax tree. Note the loss of
ordering after parsing the tree into a FAMIX model. . . . . . . . . . . . . . . . . . .
A bottom-up maximum common subtree isomorphism for two ordered trees (highlighted in grey). The dashed line represents the node mapping. The mapping is
defined as M = {(v4, w2), (v5, w3), (v6, w4), (v7, w5), (v8, w6)}. . . . . . . . . . . .
A top-down maximum common subtree isomorphism for two ordered trees (highlighted in grey). The dashed line represents the node mapping. The mapping is
defined as M = {(v1, w1), (v3, w2), (v4, w3), (v5, w4), (v6, w5), (v7, w6), (v8, w7)}. .
The complete workflow of a Coogle similarity search. . . . . . . . . . . . . . . . . .
5.1
5.2
Test Case A: Resulting tree after changing the class. Added tree elements in italic. .
Test Case B: Resulting tree after changing the class. Added tree elements in italic. .
43
44
4.3
4.4
4.5
4.6
4.7
4.8
26
27
28
29
30
33
34
38
CONTENTS
x
5.3
5.4
5.5
A.1
A.2
A.3
A.4
A.5
A.6
Test Case D: Resulting tree after changing the class. Added tree elements in italic. .
Distribution of average similarity (bottom-up, top-down and tree edit distance
measures) for class CompareViewerPane in org.eclipse.compare project. Denoted
on the x-axis are all classes of the project with descending similarity. . . . . . . . .
Distribution of tree edit distance similarity for class CompareViewerPane in the
compare project of Eclipse. Denoted on the x-axis are all classes of the project with
descending similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Context menu when right-clicking on a Java project in the Eclipse workspace. . . .
Step 1: Welcome screen and choice of similarity. . . . . . . . . . . . . . . . . . . . .
Step 2: Selection of project containing the desired search object. . . . . . . . . . . .
Step 3: Search object selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Step 4: Final summary page before calculation is started. . . . . . . . . . . . . . . .
Result dialog of a tree edit distance calculation on the Eclipse compare project with
the class org.eclipse.compare.Splitter as search object. . . . . . . . . . . .
57
58
58
59
59
49
50
60
List of Tables
4.1
FAMIX elements with their corresponding AST element. . . . . . . . . . . . . . . .
24
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
Results case A (add constructor to class): Bottom-up maximum common subtree. .
Results case A (add constructor to class): Top-down maximum common subtree. .
Results case A (add constructor to class): Tree edit distance. . . . . . . . . . . . . .
Results case B (add attribute to class): Bottom-up maximum common subtree. . . .
Results case B (add attribute to class): Top-down maximum common subtree. . . .
Results case B (add attribute to class): Tree edit distance. . . . . . . . . . . . . . . .
Results case C (add invocation to method): Bottom-up maximum common subtree.
Results case C (add invocation to method): Top-down maximum common subtree.
Results case C (add invocation to method): Tree edit distance. . . . . . . . . . . . .
Results case D (method extraction): Bottom-up maximum common subtree. . . . .
Results case D (method extraction): Top-down maximum common subtree. . . . .
Results case D (method extraction): Tree edit distance. . . . . . . . . . . . . . . . . .
Results case E (implement interface): Bottom-up maximum common subtree. . . .
Results case E (implement interface): Top-down maximum common subtree. . . .
Results case E (implement interface): Tree edit distance. . . . . . . . . . . . . . . . .
Selected top results for comparison on org.eclipse.compare.CompareViewerPane. .
Selected top results for comparison on org.eclipse.compare.NavigationAction. . . .
43
43
44
44
45
45
46
46
46
47
47
48
48
48
48
49
50
List of Listings
3.1
3.2
4.1
4.2
4.3
Formal Java class declaration rules [Gosling et al., 1996]. . . . . . . . . . . . . . . .
A sample class definition following the rules from Listing 3.1. . . . . . . . . . . . .
Java CompilationUnit AST node type. This is the type of the root of an AST. . . . .
TypeDeclaration AST node type. A type declaration is the union of a class declaration and an interface declaration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Creates a new tree of a FAMIXInstance by using the visitor pattern. This method
is defined in ch.toe.tree.TreeUtil. . . . . . . . . . . . . . . . . . . . . . . . .
10
11
23
23
31
CONTENTS
4.4
4.6
4.5
4.7
5.1
5.2
5.3
B.1
B.2
B.3
B.4
B.5
B.6
B.7
B.8
D.1
D.2
accept() method from ch.toe.famix.model.Class, demonstrating the visitor pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Most detailed constructor signature of CalculateBottomUpMaximumSubtree.
Sample visitor implementation used for building a tree of all relevant FAMIX elements. This is the implementation as used by TreeBuildVisitor. . . . . . . . .
Most detailed constructor signature of CalculateTreeEditDistance. . . . . .
Test case A: Code of the added constructor . . . . . . . . . . . . . . . . . . . . . . .
Test case B: Code for an added attribute . . . . . . . . . . . . . . . . . . . . . . . . .
Test case D: Extract the code of a method into a new method. . . . . . . . . . . . . .
Sample class for defining a new similarity measure. . . . . . . . . . . . . . . . . . .
Sample constructor for a new similarity measure with a single tree as parameter. .
Sample implementation of a new measure operation. . . . . . . . . . . . . . . . . .
Implementation of a new result dialog. . . . . . . . . . . . . . . . . . . . . . . . . . .
In class CoogleModel: Model extension for a new measure. . . . . . . . . . . . . .
In class CoogleWizardPageWelcome: Additions to the welcome page of the wizard for a new tree measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In class CoogleWizard: Add the new operation to the finish action of the wizard.
A new comparator implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . .
AzureusCoreImpl.java (version 2.3.0.3) from the Azureus project. . . . . . . . . . .
RateControlledEntity.java (version 2.3.0.3) from the Azureus project. . . . . . . . .
xi
32
32
37
37
40
40
41
61
62
63
64
64
65
65
66
69
78
xii
CONTENTS
Chapter 1
Introduction
This chapter describes the motivation of this thesis. Further, it discusses existing work in the field
of source code similarity and concludes with an overview of the structure of the thesis.
1.1 Motivation
The field of similarity analysis in source code has many different applications. For example, similarity analysis is used to detect code duplicates (i.e., code clones). Removing such code clones
improves the maintainability of a software system. The quality of a system therefore can be analysed through identifying duplicated code. However, not only quality is assessed with clone analysis: duplicated code is an indicator for software plagiarism. The algorithms used for this task
vary from very simple source code line comparison to complex hashing algorithms that are not
subsceptible to changes in naming of fields or methods and even detect similarity in obfuscated
code.
Further, similarity analysis of source code is helpful during development, for instance to provide better support for code reuse. Consider, for example, a development environment that analyses the just written code and suggests similar code examples or existing implementations from
a source repository. This helps reusing existing code and lessens the developing effort needed by
creating a collaborative knowledge of code fragments.
In this thesis, we aim to detect similar Java classes based upon the syntax tree of source code. A
syntax tree is the representation of source code in the form of a tree. Eclipse provides us with a detailed tree representation of Java source code, it includes all statements and operations. This syntax tree is converted into an intermediary model, called FAMOOS Information Exchange Model
(FAMIX). FAMIX is a model for representing object-oriented source code, independent of specific
programming language constructs. This language-independent representation of source code
then is analysed for similarity with different tree algorithms. In our implementation we use three
different measures: bottom-up maximum common subtree isomorphism, top-down maximum
common subtree isomorphism and the tree edit distance algorithm. These measures detect the
similarity of two given Java classes by analysing their tree representations for similar parts.
The following contributions are made by this thesis:
•
•
A Java representation of the FAMIX model is created and used as intermediary model for
converting the abstract syntax tree of source code into general trees.
An implementation of three tree similarity measures, integrated into SimPack, a generic
Java library of similarity measures for the use in ontologies.
Chapter 1. Introduction
2
•
•
Coogle, an Eclipse plug-in, for searching similar classes in Java projects. The similarity
analysis is based on the previously mentioned similarity algorithms.
An evaluation of the implemented measures with test cases of refactoring patterns and a
real-world Java project, Eclipse’s compare plug-in.
1.2 Stucture of Thesis
This thesis starts with an overview of the FAMOOS Information Exchange Model and details
its purposes and structure in Chapter 2. In the chapter thereafter, we describe existing similarity analysers and the implemented similarity measures. The fourth chapter documents the
implementation details and discusses problems encountered during development. We evaluated
Coogle with test cases and a real-world Java project as described in the fifth chapter. The concluding chapter details the results of our work and advises on possible future work.
Chapter 2
FAMIX - the FAMOOS
Information Exchange Model
This chapter gives a brief introduction to the FAMOOS Information Exchange Model (FAMIX) as
it is defined in [Tichelaar et al., 1999], emphasising on the parts we are using in our Java implementation.
2.1 Purpose
The FAMOOS Information Exchange Model (FAMIX) was developed as an information exchange
model for the FAMOOS project 1 . FAMOOS is an acronym for ”Framework-based Approach for
Mastering Object-Oriented Software Evolution”, a re-engineering framework for supporting the
design, analysis and maintainability of software systems. Tool prototypes for experimenting in
various areas of this project have been implemented in different languages (C++, Ada, Java and
Smalltalk). To avoid incorporating parsing technology for all those languages into each of the tool
prototypes, FAMIX was defined as a common information exchange model. The model is applied
to different languages by using specific language extensions. Figure 2.1 gives a graphical view of
this concept.
The model was published in 1999 as FAMIX 2.0 [Tichelaar et al., 1999]. The Java specific extension can be found in [Tichelaar, 1999].
Figure 2.1: FAMIX concept overview (source: [Tichelaar et al., 1999])
1 FAMOOS
project site: http://iamwww.unibe.ch/˜famoos/
Chapter 2. FAMIX - the FAMOOS Information Exchange Model
4
2.2 Description
2.2.1 Overview
As FAMIX is a model for representing different object-oriented languages, the model uses the
highest common factor of all those languages. The main elements of the object-oriented model
can therefore be modeled with FAMIX. We describe the core model of FAMIX in Section 2.2.2.
When interchanging data between different tools, it is necessary to have a tool-independent,
transferable representation in form of files. FAMIX adopted CDIF [CDIF, 1994] for this purpose.
CDIF is a standard used for formally representing models in human readable text. Because our
plug-in does not need to export the FAMIX representation into text form, we do not describe CDIF
here.
2.2.2 Core Model
Figure 2.2 illustrates the abstract core model of FAMIX. All elements in the model are children
of the type Object. Objects in an object-oriented sense (like methods, variables and such) are
of type Entity, that is, a BehaviouralEntity or a StructuralEntity. We describe these
two types later in this chapter. A Property is a tool scpecific information that can be assigned
with any Object. We do not define such Properties as we have no need for storing additional information to the already present information in FAMIX. There are three different types of
Associations:
•
InheritanceDefinition, for superclass-subclass relations;
•
Invocation, invocations of a BehaviouralEntity;
•
Access, used for modeling accesses to a StructuralEntity.
Argument is used for passing arguments to an invocation of a BehaviouralEntity. In
Java, for example, the statement System.out.println("Hello world!"); is represented
by passing an ExpressionArgument (the "Hello world!" string) and an AccessArgument
(representing System.out) to the Invocation of println().
A BehaviouralEntity has two subclasses. Function models a global behaviour whereas
Method represents the definition of a behaviour of a class. The concept of Functions is not
known in every object-oriented language, for example Java does not use this type of behaviour.
BehaviouralEntity and its subclasses are shown in Figure 2.3.
Each StructuralEntity has an attribute declaredClass which declares the type of the
entity. In Java this might be a primitive type such as int or a class type like String. Figure 2.4
illustrates the subclasses of StructuralEntity. A GlobalVariable represents a globally
accessible variable with a lifetime of the system’s lifetime. This concept is not known in Java. An
Attribute is a field defined in a Class. An ImplicitVariable represents context variables
such as this or super. A FormalParameter is a child of BehaviouralEntity and represents
a parameter of a method. Locally defined variables are of type LocalVariable.
The main entities of an object-oriented model are classes. Figure 2.5 shows the core model
of FAMIX. A Class has relations defined through InheritanceDefinition, either as superclass or as subclass. A BehaviouralEntity (represented by Method in the figure) or a
StructuralEntity (Attribute in the figure) belongs to a Class. These relations also highlight a major problem with the FAMIX model when used for our purpose: FAMIX defines relations from children to their parent. However, for effectively building a tree, we need relations
that can be traced from parents to their children. See Section 4.3.2 for the implementation consequences of this.
2.2 Description
5
Figure 2.2: Abstract basic elements of the FAMIX model. Figure from [Tichelaar et al., 1999].
Figure 2.3: Subclasses of the FAMIX element BehaviouralEntity. Figure from [Tichelaar et al., 1999].
FAMIX has different levels of extraction that denote how much information is extracted. Level
1 is the minimum a parser must be able to extract and includes four different object types: Class,
InheritanceDefinition, BehaviouralEntity and Package. Level 4, the most detailed
level and the level we extract with Coogle, contains all objects defined in FAMIX.
2.2.3
FAMIX Extensions for Java
FAMIX extension documents exist for various object-oriented languages. For Java specific features, [Tichelaar, 1999] defines the needed extensions. We describe the most notable extensions in
this section, also refer to the Java specification [Gosling et al., 1996] for further information.
Class. Addition of methods for representing the possible states of a Class: isInterface(),
isPublic(), isFinal() and isAbstract().
Method. Corresponding to the allowed method method modifiers, the methods isFinal(),
isSynchronized() and isNative() are added. The latter is used for methods that are
implemented in an external language (Assembler for instance). The signature of a Method
is defined to have a format like methodname(paramType1,..,paramTypeN).
6
Chapter 2. FAMIX - the FAMOOS Information Exchange Model
Figure 2.4: Subclasses of StructuralEntity and their relationship to other elements (in grey). Figure from
[Tichelaar et al., 1999].
Attribute. Corresponding to the possible modifiers of an Attribute, the methods isFinal(),
isTransient() (if the attribute does not need to be serialised) and isVolatile() (used
for indicating that the attribute should not be optimised by the compiler) are added.
LocalVariable and FormalParameter. The method isFinal() is added for representing this possible state of the field types.
TypeCast. This is a new object type added specifically for the Java extension. A TypeCast is an
Association with a fromType and a toType for representing type casts between two
Java types. In FAMIX, a TypeCast is a member of a BehaviouralEntity.
accessControlQualifier. The accessControlQualifier of the objects can have at most three
possible states: public, protected and private. Default package visibility is represented by an empty accessControlQualifier.
Function and GlobalVariable. These elements will never appear in a FAMIX representation of
Java code as there is no such concept in Java.
Inner classes. There is no specification for nested classes, inner classes and anonymous classes in
[Tichelaar, 1999]. We represent these class types in our FAMIX implementation, but do not
include them when parsing the source code (see Section 4.6 for more information).
Implicit methods. Java has implicit methods such as this(..), super(..) or default constructors. There is no representation in FAMIX and in our implementation for these.
Static and instance initialisers. A static initialiser can be used for variables and classes in Java
source code. We do not represent these code fragments as there is no FAMIX representation
available.
2.3 FAMIX as Intermediary Model
7
Figure 2.5: Core elements of FAMIX and their relationship. Figure from [Tichelaar et al., 1999].
Names of objects can be queried in two different formats: a simple name representation, for
example ”CooglePlugin”, and a unique name. Unique names contain the full package path to
the object and all information necessary for uniquely identifying an object. A Class then has a
unique name of ”ch.toe.coogle.plugin.CooglePlugin”. A method in the same class is
then uniquely named as ”ch.toe.coogle.plugin.CooglePlugin.getDefault()”, which
is the unique name of the belongsTo()-Class, followed by the name of the method including
signature. Please note that this unique name format differs from the original FAMIX definition.
Originally, the unique name of a Method is in a format with two colons instead of the periods:
”ch::toe::coogle::plugin::CooglePlugin.getDefault()”. We used the simpler format with periods as this corresponds to the usual way of representing names in Java.
2.3 FAMIX as Intermediary Model
Why did we choose FAMIX as intermediary model for representing Java source code? These are
the advantages when using such a model:
•
•
By using a fixed standard representation of object-oriented code, we can maintan interoperability with other tools using FAMIX, for example CodeCrawler [Lanza, 2003].
Similarity analysis for other languages (like C++ for example) will yield comparable results
when based on FAMIX.
The use of FAMIX also has disadvantages. Most notably, we have an information loss when
converting Java code to its FAMIX representation. This happens mainly because of elements not
represented in FAMIX. See Section 4.1.3 for complete information about the missing information.
Further, FAMIX is an additional layer in-between the detailed Java abstract syntax tree (see Section 4.1.2) and the final tree representation (see Section 4.5.3).
Chapter 3
Similarity Analysis
This chapter gives an overview over similarity in general, similarity analysis on source code and
existing work in this field. We describe the similarity measures we used from an algorithmic point
of view and outline their features and shortcomings.
3.1 Overview
The goal of this thesis is to find similar Java classes by analysing the FAMIX representation of
the abstract syntax tree of source code. We search for similarity in existing Java projects using
three different algorithms (outlined in Section 3.4) and analyse how changes in source code affect
similarity.
3.2 Related Work
Different approaches exist for detecting similarity in trees and source code. [Baxter et al., 1998]
describes a tool that analyses systems for duplicated code. The algorithm used is based on
abstract syntax trees and employs a hashing on code fragments for detecting exact and nearmiss clones. [Myles and Collberg, 2005] takes a similar approach by using a birthmarking technique, deducing unique characteristics from the instructions of a program, for detecting software
theft. A minimal edit script algorithm for transforming one tree into another tree is defined in
[Chawathe et al., 1996] and detects changes in general, hierarchically structured information.
[Shasha et al., 2004] describes an application doing cousin search in phylogenetic trees (which
represent evolutionary history). From the same author exists work on general tree and graph
searching, using exact and approximate search algorithms [Shasha et al., 2002]. [Wang et al., 2003]
presents a tool called TreeRank, which does a nearest neighbour search for detecting similar patterns in a given phylogenetic tree.
[Mishne and de Rijke, 2004] and [Neamtiu et al., 2005] define a conceptual model for source
code representation, which in both cases partially resembles the abstract syntax tree defined by
Eclipse. [Mishne and de Rijke, 2004] uses code similarity for retrieving similar code fragments
from an existing repository of code documents. [Neamtiu et al., 2005] extracts similarity by mapping corresponding AST elements and describing code evolution with this information.
A different approach is described in [Kontogiannis, 1993], where a Program Description Tree
(PDT) is generated from code fragments. These fragments are treated as behavioural entities, i.e,
as independent components, interacting with resources and other entities of the system. The PDT
10
Chapter 3. Similarity Analysis
then not only represents structural information like an AST does, but also contains functional
information in the form of interactions and accesses. Similar fragments are detected by searching
for entities with similar characteristics of these PDTs.
Analysing large software systems for similarity, [Yamamoto et al., 2002] proposes an algorithm based on correspondence of source code lines. Also not based on syntax trees is the method
implemented by [Michail and Notkin, 1999], where similar functions in different libraries are detected. The matching algorithm uses the name of functions, the name of their members and
surrounding comments as similarity indicator.
Finally, [Baker and Manber, 1998] lists various approaches for reconstructing changes and similarity information from Java bytecode. Such approaches include the analysis of fingerprint samplings or almost-matching of dissassembled code by ignoring textual information (such as field
names).
3.3 Definition of Similarity
Similarity defines the proximity of two objects. In our case, we analyse two Java classes for matching parts and conclude the nearness of the two classes from the size of the matched parts.
We define two different notions of similarity when analysing Java source code similarity or
similarity of object-oriented source code in general: structural similarity and functional similarity.
These two types of similarity are explained in the next two sections of this chapter.
The current implementation of our similarity analysis tool is able to detect structural similarity.
Possible extensions to detect functional similarity are outlined in Section 5.4.3 and 6.2.
3.3.1 Structural Similarity
Source code of any programming language is structurally defined through a limited set of instructions, a given grammar usually consisting of words and symbols. The structure of a piece of
code is fixed by this grammar. A Java class for example needs to follow the structure defined in
Listing 3.1. A sample class declaration obeying these rules is shown in Listing 3.2.
NormalClassDeclaration:
ClassModifiers class Identifier TypeParameters Super Interfaces
{ ClassBodyDeclarations }
ClassBodyDeclarations:
ClassBodyDeclaration
ClassBodyDeclarations ClassBodyDeclaration
ClassBodyDeclaration:
ClassMemberDeclaration
InstanceInitializer
StaticInitializer
ConstructorDeclaration
ClassMemberDeclaration:
FieldDeclaration
MethodDeclaration
3.4 Evaluated Tree Algorithms
11
ClassDeclaration
InterfaceDeclaration
;
Listing 3.1: Formal Java class declaration rules [Gosling et al., 1996].
public class Example extends Parent {
}
Listing 3.2: A sample class definition following the rules from Listing 3.1.
Because a grammar is constructed like a tree, we can generate a tree representation of the code,
for example an abstract syntax tree (AST). See Section 4.1.2 for a description of the syntax tree used
in Coogle.
We define structural similarity as similarity in structure of the source code, in this case the
structure of the abstract syntax tree of two Java objects. The structure of a class contains very little
information about the functionality the class provides.
Structural similarity is very successful when used for code duplication detection as the instruction structure of copied code remains the same and also does not change for example when
replacing variable names. However, such structural similarity can become almost undetectable
already for simple instruction sequence changes if the algorithm only is able to compare ordered
information, for example ordered syntax trees. We will discuss this later in Section 3.4.2.
3.3.2
Functional Similarity
Functional similarity defines the similarity of two objects, in our case Java classes, in the function
they perform. [Kontogiannis, 1993] for example defines code fragments as behavioural entities
which interact with the rest of the system. Using these interactions as characteristics of a studied
code fragment, it is possible to search for entities that have similar interaction characteristics and
therefore perform similar functions.
Another example of a project using functional similarity is the Strathcona tool described in
[Holmes and Murphy, 2005], which measures inheritance, invocations and accesses of a type and
then recommends similar code samples from a repository.
Such functional similarity is notably different from structural similarity as a similar code structure can perform fundamentally different functions and vice-versa, functional similar classes can
be represented in various structurally different ways.
3.4 Evaluated Tree Algorithms
The input for our similarity measures are trees generated from a Java abstract syntax tree as represented in Eclipse. The measures are generic, i.e., they operate on general tree structures, independent of any context information such as FAMIX attributes or AST elements.
3.4.1
Definitions
This section defines the most important vocabulary used in the following sections:
Graphs. A graph consists of vertices and arcs. Each arc connects two vertices. Arcs can be directed or indirected. A sample directed graph is illustrated in Figure 3.1.
12
Chapter 3. Similarity Analysis
Trees. A tree is a particular case of a directed graph in which exists a single vertex, called the root
of the tree, such that there is a unique walk from the root to any vertex of the tree. Vertices
of a tree are called nodes, arcs are called edges. See Figure 3.1 for an example of a tree.
Although there exist undirected trees, we only use trees based on directed graphs.
Ordered trees. A tree can have multiple nodes as children. An ordered tree is a tree in which the
relative order of the children is fixed for each node.
Labelled trees. Each node of a tree can have a so called label. The label of a node consists of
additional attributes, for example a name.
Tree isomorphism. Tree isomorphism is the problem of determining whether a tree is isomorphic
to another tree, i.e., there exists a mapping of the nodes of T1 to the nodes of T2 , preserving
the structure of the tree, i.e, the root of T1 is mapped to the root of T2 and their children are
mapped equivalently, corresponding to their order.
Equivalence classes. Elements of an equivalence class are equivalent to all other elements in the
same equivalence class. In the case of tree nodes, we define nodes to be equivalent if they
have the same subtree rooted at them. The partitioning of a tree in its equivalence classes
therefore is the sorting of each node into a subset of nodes (the equivalence class) with the
same subtree rooted at them.
Bipartite graph. A bipartite graph is an undirected graph in which the vertices can be partitioned
in two subsets in such a way that every edge of the graph joins a vertex of one subset with
a vertex of the other subset. See Figure 3.2 for an example of such a graph.
Figure 3.1: Left: A directed graph with five vertices and eight arcs. Right: A rooted tree with six nodes.
Figure 3.2: A complete bipartite graph between the nodes v1 , v2 of T1 and the nodes w1 , w2 and w3 of tree T2 .
3.4.2 Using Ordered or Unordered Trees?
One important question arises when representing source code as trees: is ordering important? A
Java compiler does not necessarily consider the order as important. The ordering of class body
3.4 Evaluated Tree Algorithms
13
entities such as methods and field declarations is not relevant whereas instructions in the bodies
of these entities depend on the order of appearance in the source code. Any algorithm doing an
ordered match is therefore a correct approach for matching abstract syntax trees, because we then
simply assume the order of all instructions to be static. However, this fails to detect similarity
for classes with changes in the order of entities. Using an algorithm for unordered trees fixes
this limitation for top-level entities, but ignores the ordered structure of instructions in bodies of
entities.
As FAMIX does not represent all instructions that can occur in the body of an entity, we
prefer using algorithms that do an unordered tree match. However, there do not exist efficient algorithms for unordered trees for all the chosen similarity measure algorithms. For example, an unordered solution for tree edit distance measuring is MAX SNP-hard (described in
[Zhang et al., 1992] and [Zhang and Jiang, 1994]), i.e., the computation is not solvable in polynomial time of the input size of the trees. Also see Section 4.5.3 for other shortcomings of our implementation concerning unordered tree matching. For all these reasons, we implement unordered
tree matching for bottom-up maximum common subtree only (see Section 4.2.2 and 5.4.2).
3.4.3
Tree Isomorphism
The input for our similarity measures are Java classes which are represented by abstract syntax
trees. For analysing similarity, we search for isomorphism in those syntax trees and derive class
similarity from the size of the matched trees, i.e., the number of nodes in the subtree.
Tree isomorphism answers the question of one tree being isomorphic to another tree, checking two trees for equality. Equality of two nodes is determined by either comparing node labels
(for isomorphism with labelled trees) or not comparing node labels (unlabelled tree isomorphism,
matches on structure only). See Figure 3.3 for an example of isomorphic trees. Subtree and maximum subtree isomorphism are more general cases of such tree isomorphism problems. We consider tree isomorphism for abstract syntax trees to be in the field of structural similarity analysis
as the abstract syntax tree does not hold information on functionality.
Figure 3.3: Isomorphic ordered trees. Nodes numbered according to a preorder traversal. Figure taken from
[Valiente, 2002].
Three different tree similarity measures are integrated into Coogle, namely bottom-up maximum common subtree, top-down maximum common subtree and the tree edit distance. We
describe these algorihms in the following sections.
Chapter 3. Similarity Analysis
14
3.4.4 Bottom-up Maximum Common Subtree Isomorphism
General
This bottom-up maximum common subtree isomorphism algorithm is defined by [Valiente, 2002].
The goal of this algorithm is to find the largest isomorphic subtree, common to two given trees.
The algorithm described is applicable for both ordered and unordered trees with minor changes.
The problem of finding a bottom-up maximum common subtree of an ordered or unordered
tree T1 = (V1 , E1 ) to another ordered or unordered tree T2 = (V2 , E2 ) can be reduced to the problem of partitioning the vertices of the trees V1 ∪ V2 into equivalence classes of bottom-up subtree
isomorphism. Two nodes (in the same or different trees) are equivalent if the bottom-up subtrees
rooted at them are isomorphic. Then, the bottom-up subtree of T1 rooted at node v ∈ V1 is isomorphic to the bottom-up subtree of T2 rooted at node w ∈ V2 if and only if nodes v and w belong
to the same equivalence class of bottom-up subtree isomorphism.1 The equivalence classes of two
trees are illustrated by Figure 3.4. We determine the isomorphism of a given node by recursively
building an isomorphism string consisting of the isomorphism codes of all children of the node.
We then compare that isomorphism string to a collection of existing isomorhpism strings. If the
string is already in the collection, the current node’s equivalence class is read from the collection.
If the isomorphism string is not contained in the collection, we add it to the collection and assign
the string with a new equivalence class.
Figure 3.4: Bottom-up maximum common subtree isomorphism equivalence classes for two ordered trees. Nodes are
numbered according to the equivalence class to which they belong. Figure taken from [Valiente, 2002].
After collecting the equivalence classes of both trees, the algorithm searches for the biggest
equivalence class by using a queue with the size of the nodes as priority. The first element in
the queue is the node with the biggest size. This ensures that the matched subtree is indeed a
maximum subtree. Figure 3.5 illustrates the bottom-up maximum common subtree for the sample
trees in Figure 3.4. The maximum common subtree of the trees is highlighted in grey. Note
that multiple nodes can have the same size and equivalence class. It is therefore possible to find
multiple instances of a maximum common subtree in both trees.
The last step of this algorithm is to generate a mapping M ⊆ V1 × V2 of the nodes in the
maximum common subtree of T1 and T2 . See Figure 3.6 for such a mapping (for unordered trees).
The procedure for generating this map is different for ordered and unordered trees and outlined
in the following sections.
[Valiente, 2002] describes this algorithm for unlabelled trees only. We extended the algorithm
to use labelled trees by assigning an integer value to each node type, i.e., -1 for the FAMIX element
Access, -2 for AccessArgument and so on. The equivalence classes are then matched based on
1 Source:
[Valiente, 2002], Section 4.3.3
3.4 Evaluated Tree Algorithms
15
this value and the already defined equivalence class code. This solution is also suggested in
[Valiente, 2000].
Figure 3.5: Bottom-up maximum common subtree of two ordered trees (highlighted in grey). Figure taken from
[Valiente, 2002].
Ordered trees
Ordered trees are processed as outlined in the ”General” section. The mapping of the matching
subtree nodes of tree T1 and T2 is created with a recursive pass through all the nodes, starting at
the roots of the maximum subtrees. Every procedure invocation compares the two given nodes for
equality and continues with processing the children in their order if the roots are equal. Matching nodes are added to a mapping M ⊆ V1 × V2 , which then contains the resulting bottom-up
maximum subtree of the ordered trees T1 and T2 .
With two trees T1 and T2 with n1 and n2 nodes and n1 ≤ n2 , the algorithm for ordered trees
runs in O(n2 log n2 ) time using O(n1 + n2 ) additional space (see Theorem 4.56 in [Valiente, 2002]).
Unordered Trees
Two things need to be changed for applying the algorithm to unordered trees. First, during the
collection of the equivalence classes of the trees, we now sort the child isomorphism codes of a
node before searching for already existing code sequences in the equivalence class collection. This
ensures that all children of a node only differing in order are treated the same, thus unordered.
The second change happens during the mapping phase. The nodes of T1 are processed in preorder
traversal with a non-recursive loop and the children of a node are mapped to the node from T2
with the same equivalence code ignoring the ordering. Figure 3.6 illustrates such a mapping.
This bottom-up maximum common subtree algorithm for unordered trees T1 and T2 with
n1 and n2 number of nodes runs in O((n1 + n2 )2 ) time using O(n1 + n2 ) additional space (see
Theorem 4.60 in [Valiente, 2002]).
3.4.5
Top-down Maximum Common Subtree Isomorphism
General
[Valiente, 2002] defines a top-down maximum common subtree isomorphism for ordered and
unordered trees. The goal of this algorithm is to find the largest common subtree of two given
trees under the prerequisite that the subtree is rooted at the root nodes of the trees. The differences
16
Chapter 3. Similarity Analysis
Figure 3.6: Bottom-up maximum common subtree of two unordered trees (highlighted in grey). The dashed arrows
depict the mapping of corresponding nodes. Figure taken from [Valiente, 2002].
between the algorithm for ordered trees and the algorithm for unordered trees are fundamental.
Both algorithms are described separately in the following two sections.
Ordered Trees
Starting from the root nodes of T1 and T2 , the algorithm recursively processes all children in
preorder and compares each pair of nodes for equality. If two nodes match, they are added to
a mapping M ⊆ V1 × V2 which contains the complete subtree after the recursion finishes. See
Figure 3.7 for an illustration of a top-down maximum common subtree of two ordered trees.
The comparison of the nodes during the recursive processing allows for an extension of the
algorithm to labelled trees as well, returning a successful match only when the labels match. See
Section 4.3.4 for a description of the comparator pattern used.
This algorithm is very efficient with a running time of O(n1 ) and O(n1 ) additional space for
two ordered trees T1 and T2 with n1 and n2 number of nodes, where n1 ≤ n2 (see Lemma 4.52 in
[Valiente, 2002]).
Figure 3.7: Top-down maximum common subtree of two ordered trees (highlighted in grey). Figure taken from
[Valiente, 2002].
3.4 Evaluated Tree Algorithms
17
Unordered Trees
Figure 3.8 illustrates a top-down maximum common subtree of two unordered trees. This algorithm is fundamentally different from the top-down maximum common subtree isomorphism
algorithm for ordered trees. The search is performed by recursively solving weighted bipartite
matching problems (see Figure 3.2 for an example of such a graph) for all children of two given
nodes. Starting with the children of both roots and recursively calculating the size of the subtree, matching graphs are built for each corresponding node level. The weighting of the arcs is
derived from each node’s subtree size. The maximum path along the weighted edges in the bipartite graphs then runs through the nodes which are part of the maximum common subtree. For
extending the algorithm to labelled trees we assign to non-matching nodes a weight of 0 in the
matching graphs, ensuring non consideration of this path.
For two trees T1 and T2 with respectively n1 and n2 nodes (and n1 ≤ n2 ), the algorithm runs in
O((n1 + n2 )(n1 n2 + (n1 + n2 ) log(n1 + n2 ))) time using O(n1 n2 ) additional space (see Lemma 4.44
in [Valiente, 2002]). Note that there exists a faster algorithm for unordered top-down subtree
isomorphism ([Shamir and Tsur, 1997]).
Figure 3.8: Top-down maximum common subtree of two unordered trees (highlighted in grey). Figure taken from
[Valiente, 2002].
3.4.6
Tree Edit Distance
General
Calculating the tree edit distance is a completely different approach for tree analysis than the
maximum common subtree isomorphism algorithms. The tree edit distance algorithm answers
the question how many steps it takes to transform one tree into another tree by applying a set of
edit operations to the trees (adding, deleting and replacing nodes).
This algorithm as described in [Valiente, 2002] is applicable for rooted ordered trees only. The
problems with unordered trees are outlined in the last subsection of this section.
Ordered Trees
The tree edit distance algorithm as defined in [Valiente, 2002] has three different elementary edit
operations. For the ordered trees T1 = (V1 , E1 ) and T2 = (V2 , E2 ) we denote a deletion of a leaf
node v ∈ V1 by v �→ λ or (v, λ). The substitution of a node w ∈ V2 for a node v ∈ V1 is denoted
by v �→ w or (v, w) and an insertion into T2 of a node w ∈ V2 as a new leaf is denoted by λ �→ w
or (λ, w). Deletion and insertion operations are made on leaves only. The deletion of a non-leaf
Chapter 3. Similarity Analysis
18
node requires first the deletion of the whole subtree rooted at the node. The same applies to the
insertion of non-leaves.2
A tree is transformed into another tree by using a sequence of elementary edit operations as
illustrated in Figure 3.9. Note that in this figure, substitution of corresponding nodes is not indicated. The complete transformation script is: [(v1 , w1 ), (v2 , w2 ), (v3 , λ), (v4 , λ), (v5 , w3 ), (λ, w4 ),
(λ, w5 ), (λ, w6 ), (λ, w7 )].
Figure 3.9: Transformation between two ordered trees. Figure taken from [Valiente, 2002].
Not every sequence of edit operations denotes a valid transformation between two trees. Deletions and insertions must appear in bottom-up order to ensure that these operations are only
made on leaves. A postorder traversal for example ensures this condition. Further, substitutions
must preserve parent and sibling order. This means that the parent of a substituted non-root node
must be substituted by the parent of the non-root node the substitution was made for. Also, the
substitution of sibling nodes in T1 must preserve the order among the siblings by substituting
the nodes with sibling nodes from T2 . These conditions are ensured by defining that depth[v] =
depth[w] for all (v, w) ∈ M , where M is a mapping between the two trees.
Costs are assigned to all elementary edit operations. The standard implementation uses a
cost of γ(v, w) = 1 if v = λ or w = λ and γ(v, w) = 0 otherwise. With such weights, substitute
operations cost less than the deletion or insertion of a node. The edit distance then is the least-cost
transformation of the two trees.
Valiente’s approach to calculate the edit distance is to build a graph with the nodes of both
trees. The edges in the graph denote different operations with their assigned weights. Figure 3.10
illustrates such an edit graph with the shortest path in bold. Finding the least-cost transformation
then is reduced to the problem of finding the shortest path from the upper left corner down to
the lower right corner. Vertical arcs in the form (vi wj , vi+1 wj ) represent the deletion of node vi+1
from T1 , diagonal arcs (vi wj , vi+1 wj+1 ) represent the substitution of node wj+1 of T2 for node vi+1
of T1 . And finally, a horizontal arc like (vi wj , vi wj+1 ) represents the insertion of node wj+1 into T2 .
Dijkstra’s shortest path [Dijkstra, 1959] is used for calculating the shortest path and determining
the edit operations needed for the transformation of the two trees.
Finding the least-cost transformation of an ordered tree T1 to an ordered tree T2 by deter2 Source:
[Valiente, 2002], Section 2.1
3.4 Evaluated Tree Algorithms
19
Figure 3.10: Shortest path in the edit graph of two ordered trees. Figure from [Valiente, 2002].
mining shortest paths in an edit graph runs in O(n1 n2 ) time using O(n1 n2 ) additional space (see
Lemma 2.20 in [Valiente, 2002]).
Unordered Trees
Tree edit distance calculation for unordered trees is MAX SNP-hard as [Zhang et al., 1992] and
[Zhang and Jiang, 1994] showed. An implementation is therefore not efficient. Solutions for constrained trees with a fixed maximum number of children exist for example in [Zhang, 1996], but
are not applicable for our needs because our trees have an unbounded number of children.
Chapter 4
Implementation
This chapter describes Coogle, our Eclipse plug-in implementing various similarity measuring
algorithms to find similarities between Java classes. We describe the architecture of Eclipse and
how our plug-in integrates into it. In addition we enlight on problems encountered during plugin development.
4.1 Eclipse Architecture
As Coogle is an Eclipse plug-in we first describe the Eclipse platform in general and then how
plug-ins for Eclipse integrate into the Eclipse architecture.
4.1.1
Eclipse Platform
The functionality of Eclipse is based on the concept of extensions, so called plug-ins. The core
of the Eclipse product, the ”Eclipse platform”, provides the framework and services for all these
extensions. Thus, the platform is the runtime environment for dynamically loading, integrating
and executing plug-ins 1 . Figure 4.1 gives an overview of the Eclipse architecture. These are the
most important components:
Workspace. This is part of the platform UI component and provides the main user interface of
Eclipse. It coordinates and presents all tools integrated into the platform.
Standard Widget Toolkit (SWT). SWT is an operating system independent widget toolkit providing an API for the native user-interface facilities.
JFace. This is part of the platform UI component and provides classes for many common UI programming tasks. It is designed to be window system independent and uses SWT widgets
for its common UI tasks.
The Java Development Tools (JDT) and the Plug-in Development Environment (PDE) are
plugged into this basic platform. Both tools add a number of views, wizards and editors to
Eclipse. Without these plug-ins, Eclipse does not know about Java and plug-in development.
The basic Eclipse platform plus JDT and PDE together build the Eclipse Software Development
Kit (Eclipse SDK).
1 The
Eclipse project: http://www.eclipse.org
Chapter 4. Implementation
22
There are more plug-ins for other programming tasks, for example the C/C++ Development
Tools (CDT) or the Graphical Editor Framework (GEF). Both CDT and GEF plug into Eclipse using
the same interface as the standard SDK components.
Developing plug-ins for Eclipse is extending the platform in the same ways as the standard
Eclipse components, like the JDT or the CDT, do. All the basic tasks as loading and unloading the
plug-in, the functionality for displaying dialogues and interacting with the user are already built
into the platform and are simply extended or invoked by plug-ins when needed.
Figure 4.1: Eclipse platform architecture with its main components and plug-ins.
4.1.2 Abstract Syntax Tree Representation in Eclipse
The abstract syntax tree (AST) used in Eclipse is represented by the classes defined in the package
org.eclipse.jdt.core.dom 2 . We outline the most important parts of this set of classes that
model the source code of a Java program as a structured document, i.e., as a tree.
Listing 4.1 shows the root element of an Eclipse AST: the CompilationUnit. This AST node
type represents a Java class file including package and import declarations. The actual body of
a class is represented by the AST node TypeDeclaration. A TypeDeclaration can either
be a ClassDeclaration or an InterfaceDeclaration as shown in Listing 4.2. The class
TypeDeclaration defines various methods for enumerating the fields or the methods (for example getFields() returning a FieldDeclaration[] or getMethods() that returns an array of MethodDeclarations). Each MethodDeclaration contains multiple Statements and
Expressions whose children can represent the basic Java syntax. Children of Statements are
for example:
•
•
•
IfStatement: Represents the structure of an if construct.
VariableDeclarationStatement: Contains information of variable declarations including the name of the variable, the type and possible initialisation statements.
ReturnStatement: Represents the return instruction including a possible return value
as Expression.
Examples of the basic Expression statements are the following:
•
IntegerLiteral: Stands for Java’s primitive integer types.
2 Eclipse
API: http://help.eclipse.org/help31/index.jsp
4.2 Coogle Architecture
•
FieldAccess: This is used for all accesses to fields.
•
MethodInvocation: Represents an invocation of a method.
23
FAMIX cannot represent all AST node types. The types that are represented in the FAMIX
model are depicted in Figures 4.3, 4.4 and 4.5. Consult the Eclipse API for an exhaustive list of
AST elements.
CompilationUnit:
[ PackageDeclaration ]
{ ImportDeclaration }
{ TypeDeclaration | EnumDeclaration |
AnnotationTypeDeclaration | ; }
Listing 4.1: Java CompilationUnit AST node type. This is the type of the root of an AST.
TypeDeclaration:
ClassDeclaration
InterfaceDeclaration
ClassDeclaration:
[ Javadoc ] { ExtendedModifier } class Identifier
[ < TypeParameter { , TypeParameter } > ]
[ extends Type ]
[ implements Type { , Type } ]
{ { ClassBodyDeclaration | ; } }
InterfaceDeclaration:
[ Javadoc ] { ExtendedModifier } interface Identifier
[ < TypeParameter { , TypeParameter } > ]
[ extends Type { , Type } ]
{ { InterfaceBodyDeclaration | ; } }
Listing 4.2: TypeDeclaration AST node type. A type declaration is the union of a class declaration and an interface
declaration.
4.1.3
AST to FAMIX Mapping
Table 4.1 shows which AST node types are considered by Coogle and details the mapping of
FAMIX elements to these AST nodes.
4.2 Coogle Architecture
4.2.1
Project Package Structure
Figure 4.9 shows the main parts of the plug-in grouped by packages and describes how they
interact. The following list details the most important packages from the Coogle plug-in:
ch.toe.coogle Classes in this package are the main classes for the user interface.
Chapter 4. Implementation
24
FAMIX Element
FAMIXInstance
AST node
-
Model
-
Package
Class
InheritanceDefinition
Attribute
Method
FormalParameter
LocalVariable
PackageDeclaration
TypeDeclaration
FieldDeclaration
MethodDeclaration
SingleVariableDeclaration
SingleVariableDeclaration
ConstructorInvocation,
SuperConstructorInvocation,
ClassInstanceCreation,
MethodInvocation,
SuperMethodInvocation
Invocation
Access
FieldAccess,
SuperFieldAccess,
SimpleName,
QualifiedName
Remarks
Represents the top element of every
FAMIX instance.
Abstract construct containing metadata.
Assigned to a Package.
Assigned to a Class.
Assigned to a Class.
Assigned to a Method.
Assigned to a Method.
Assigned to a BehaviouralEntity.
Assigned to a BehaviouralEntity.
A SimpleName is any identifier
other than a keyword, boolean expression or null literal.
QualifiedName is in the format
like ”Name.SimpleName”.
Table 4.1: FAMIX elements with their corresponding AST element.
ch.toe.coogle.wizard These classes define the wizard pages and the wizard dialog. The package
also contains the class used for collecting all Java classes in a selected Eclipse project (namely
TypeExtractor).
ch.toe.coogle.model The model class used for passing information between the wizard pages
and the project parser is defined in this package.
ch.toe.coogle.operation.generic Package containing the classes that are called for performing an
operation (such as calculating the bottom-up maximum subtree isomorphism or the tree edit
distance). These are not to be extended, only implemented by classes in the next package.
ch.toe.coogle.operation.classes One class per operation is defined in this package, implementing
the relevant operation class from package ch.toe.coogle.operation.generic.
ch.toe.coogle.operation.dialog Classes used for displaying the results after the calculation finished.
ch.toe.famix In this package lies FAMIXInstance, the root of every Java representation of a
FAMIX model tree. This also contains the visitor classes. See Section 4.5.1 for more information about the visitor pattern.
ch.toe.famix.model This contains all the classes needed for representing the FAMIX model.
ch.toe.tree The class TreeUtil in this package defines useful methods for manipulating trees
and searching elements in trees. Other utility classes are placed in here as well.
4.3 Coogle Design
25
ch.toe.tree.calc All classes implementing tree similarity algorithms are packaged herein.
ch.toe.tree.comparator These are the default comparators used for evaluating node equality.
4.2.2
Plug-in Features
Coogle runs with Eclipse 3.1 and later and is written in Java 1.5. It is activated through the
context menu of a Java project in the Eclipse workspace. For a detailed walkthrough on the
usage of Coogle, see Appendix A. The current state of the Coogle plug-in supports the following
operations:
•
Bottom-up maximum common subtree isomorphism (described in Section 3.4.4 and 4.5.3).
– for ordered, labelled and unlabelled trees.
– for unordered, labelled and unlabelled trees (with the restrictions described in Section 4.6.2).
•
Top-down maximum common subtree isomorphism (described in Section 3.4.5 and 4.5.3).
– for ordered, labelled trees.
•
Tree edit distance (described in Section 3.4.6 and 4.5.3).
– for ordered, labelled trees.
4.3 Coogle Design
This section describes the design of the Coogle plug-in. First, we overview the different components, then discuss our extensions to FAMIX and conclude the section with the description of two
design patterns that were used in the implementation.
4.3.1
Overview
Coogle has multiple components as Figure 4.2 illustrates. The source code of a Java class is transformed three times before it is used for calculating the similarity measure. Coogle’s main components are:
ASTParser of Eclipse. This processes Java source code into an abstract syntax tree as defined by
the Eclipse API.
PatViz. Parser that traverses an abstract syntax tree and builds a FAMIX representation from the
nodes of the tree.
Tree visitor. This visitor visits each FAMIX node and creates a tree representation consisting of
DefaultMutableTreeNodes.
Similarity measure. Different similarity measures are implemented by Coogle. All use a tree
built of DefaultMutableTreeNodes as calculation basis. The output of the measures is
then used for calculating the similarity of two given objects.
26
Chapter 4. Implementation
Figure 4.2: The steps of processing the source code of a Java class into a tree that can be used as input for the
similarity measure.
4.3.2 FAMIX Extensions
In this section we describe our Java implementation of the FAMIX model and highlight the differences of our implementation to the original FAMIX definition which is described in Chapter 2.
A note on the figures in this section: functions and objects shaded in grey are additions that
are not documented in the official FAMIX definition ([Tichelaar et al., 1999] and [Tichelaar, 1999]),
but are extensions we made. Also, only the most important methods are included in each class.
FAMIX Element Object
The top parent of every element of FAMIX is the Object class as shown in Figure 4.3. For illustration purposes, all children of Entity are left out on this figure, but are depicted in Figures 4.4
and 4.5 which are described later on.
The changes made to the classes in Figure 4.3 are in InheritanceDefinition to which
we added status information about the represented relation type (for example implements via
interface pattern or extends by subclassing). Otherwise, all classes correspond to the original
FAMIX model. The purpose of the accept() method is described in the section on the visitor
pattern.
FAMIX Element StructuralEntity
Figure 4.4 shows the subclasses of StructuralEntity which itself is a child of Entity. The
addition of methods such as isFinal() and similar to the classes Attribute, LocalVariable
and FormalParameter are documented in the FAMIX Java extension document [Tichelaar, 1999].
The class GlobalVariable is never used as there is no concept of global variables in Java.
FAMIX Element BehaviouralEntity
The class BehaviouralEntity with its children Method and Function (never used in the
FAMIX Java representation) were extended with the methods isFinal(), isSynchronized()
and isNative() to have a representation for the possible modifiers of a Java method as described in [Tichelaar, 1999]. These classes are shown in Figure 4.5.
4.3 Coogle Design
27
Figure 4.3: Java class diagram for the top level elements of the FAMIX model. Our extensions to the original FAMIX
model are shaded in grey.
FAMIX Elements Package and Class
Figure 4.5 depicts the class diagram for the FAMIX Package and Class representation. We
added an artificial, non-original FAMIX class called Context in-between these classes and the
Entity object. This allows us to avoid duplicated code and eases the use of both Package and
Class while parsing the abstract syntax tree. Further, we extended Class with the methods
isInterface(), isPublic(), isFinal() and isAbstract() to comply with these allowed
modifiers of a Java class (described in [Tichelaar, 1999]).
4.3.3
Tree Generation
To build the trees from the extracted FAMIX model, we use a visitor pattern. The visitor pattern is a design pattern used in object-oriented software development [Gamma et al., 1994]. It
needs two different types of objects: a visitor and a visitable object. Each visitable object defines a method called accept() that recursively traverses all visitable children by calling their
accept() method. The visitor is informed of each visit and builds the tree from this information.
See Section 4.5.1 for the implementation details of this pattern.
4.3.4
Node Comparison
We use the comparator pattern [Gamma et al., 1994] to extend our measures to labelled trees. This
pattern is often applied to enable implementing classes using their own way of comparing objects
and establishing equality. An interface defines the comparator method that is to be overridden by
implementors (this is usually compare(Object left, Object right)). The return value of
the compare method is either a boolean or an integer showing the proportion of the given two
objects. Our implementation is described in Section 4.5.2, see Appendix B.3 for a description of
how to add new comparators.
Chapter 4. Implementation
28
Figure 4.4: Java class diagram for StructuralEntity with its subclasses.
4.3.5 Input Trees for Measures
Not all FAMIX elements are represented in the general trees we use as input for the similarity
measures. The class building the tree, TreeBuildVisitor, does not include the following elements in the generated DefaultMutableTreeNode object for the following reasons:
•
•
•
Argument, i.e., AccessArgument and ExpressionArgument. These are not added because their belonging Invocation is already included in the tree.
Function. Does never appear in a FAMIX Java representation.
InheritanceDefinition. Every Class has an InheritanceDefinition. Including
it would therefore only produce an additional node in every generated tree, without improving the similarity measure.
4.4 Coogle Workflow
Figure 4.9 illustrates the workflow of a Coogle tree edit distance similarity search. The next two
sections describe the process in general.
4.4.1 Invocation
Coogle is integrated into the context menu of Java projects in the Eclipse package explorer. When
selecting the similarity search entry in the Coogle submenu, Eclipse loads the main plug-in class
from CooglePlugin and launches the action defined in the class CoogleMainAction in the
package ch.toe.coogle.action. Which class to call upon which action is defined in the
configuration file of the plugin, plugin.xml, which also defines dependencies and integration points. CoogleMainAction creates the wizard with its pages and executes it. The wizard then collects all needed information from the user such as the desired similarity measure
and the object to be searched. Afterwards, it invokes the desired operation defined in package
ch.toe.coogle.operation.classes. The operation class performs the similarity search
and presents the results by using a dialog defined in ch.toe.coogle.operation.dialog.
See Section 4.2.2 for a detailed description of the wizard and its functions.
4.5 Coogle Implementation
29
Figure 4.5: Java class diagram for BehaviouralEntity and Context with their respective subclasses. Our
extensions to the original FAMIX model are shaded in grey.
4.4.2
Similarity Search Process
Figure 4.6 illustrates the process that Coogle performs after invocating a similarity measure operation (as defined in ch.toe.coogle.operation). The source code representation of the selected
class is converted into an abstract syntax tree by running ASTParser (defined by Eclipse) on this
resource. Using the PatViz parser, all the relevant AST nodes are then transformed into a FAMIX
representation. In this step, the so far correctly ordered abstract syntax tree is converted into a
FAMIX tree whose order does not anymore correspond with the appearance of the statements in
the source file. This leads to the problems described later in Section 4.6.2. After the creation of the
FAMIX representation, a visitor is used for generating a general tree structure. See Section 4.5.1
for details on this visitor pattern. Finally, the similarity search is made with the created general
tree as input for the measure.
4.5 Coogle Implementation
4.5.1
Tree Generation
As described in Section 4.3.3 we need objects implementing ch.toe.famix.Visitor and object implementating ch.toe.famix.Visitable. For example, the FAMIX element Object
in Figure 4.3 or Method in Figure 4.5 implement the Visitable interface and therefore define a
30
Chapter 4. Implementation
Figure 4.6: The Coogle process: transformation of a Java source code file into a general tree structure via a FAMIX
representation of the abstract syntax tree. Note the loss of ordering after parsing the tree into a FAMIX model.
method called accept(). Listing 4.3 shows a sample creation and invocation of a visitor building
a tree. The accept() method in the Visitable class then iterates over the children of the class
and recursively passes the visitor along to each child by invoking the corresponding accept()
method. The visit() and endVisit() methods of the Visitor are invoked, before respectively after completing the visitor passing to the children. The code fragment in Listing 4.4 shows
this process for the FAMIX object Class.
A sample visitor implementation is illustrated in Listing 4.5. The methods visit() and
endVisit() are called from the accept() method as described before. This specific implementation of the visitor pattern is used for building a tree from selected objects (namely all objects for
which isTreeRelevantElement() is true).
4.5.2 Node Comparison
The class ch.toe.tree.comparator.ITreeComparator defines our comparator interface.
Implementing classes need to define a method called compare(), receiving two parameters of
type DefaultMutableTreeNode. The method compares these nodes for equality and returns
4.5 Coogle Implementation
31
public static DefaultMutableTreeNode
generateTree(FAMIXInstance instance) {
TreeBuildVisitor v = new TreeBuildVisitor();
instance.accept(v);
return v.getRoot();
}
Listing 4.3: Creates a new tree of a FAMIXInstance by using the visitor pattern. This method is defined in
ch.toe.tree.TreeUtil.
true if the node1 is equal to node2. The implementing comparator decides which characteristics
of the nodes are used for equality comparison. We provide three comparator implementations:
AlwaysTrueComparator This comparator returns true, i.e. equality, regardless of the characteristics of the nodes passed.
NameTreeComparator Compares the names of the given nodes and returns true if the names are
equal. This can for example be extended to applying a Levenstein similarity measure to the
names of the nodes, returning equality when a certain similarity level is reached.
TypeTreeComparator The user objects, if existing, of the given nodes are compared and true is
returned if both objects are of the same type.
4.5.3
Implemented Similarity Measures
This section describes the implementation of the selected similarity measures. See also Section 3.4
for the algorithmic description of the measures.
General
A prerequisite of the implementation is that the tree similarity measures operate on general tree
structures. It is irrelevant for the measures if the compared trees represent a FAMIX model or
complete AST trees, as long as they are valid tree structures. For this reason the context neutral
tree model class DefaultMutableTreeNode in package javax.swing.tree is used for representing trees. This allows creating trees whose nodes can contain specific user objects and an
unlimited number of children. The user objects in our case are FAMIX elements. This tree model
class also contains an implementation of pre- and postorder enumerating the elements of the tree.
A DefaultMutableTreeNode object is a root node or child element depending on its position
in the tree.
Bottom-up Maximum Common Subtree Isomorphism
Valiente’s bottom-up maximum common subtree algorithm as described in [Valiente, 2002] is implemented in ch.toe.tree.calc.CalculateBottomUpMaximumSubtree. This algorithm
is applicable for ordered and unordered, rooted trees. One difference needed in the implementation between ordered and unordered trees is a different mapping of the trees at the end of the
calculation when corresponding nodes are put into a map. This is described in Section 3.4.4.
The methods mapOrderedTrees(..) and mapUnorderedTrees(..) realise this functionality. mapTrees(..) automatically invokes the correct method depending on if the input trees for
CalculateBottomUpMaximumSubtree are ordered or unordered
32
Chapter 4. Implementation
public void accept(Visitor v) {
v.visit(this);
if (inheritance != null)
inheritance.accept(v);
[..]
Iterator<Attribute> iterAttribute =
this.getAttributes().iterator();
while (iterAttribute.hasNext())
(iterAttribute.next()).accept(v);
Iterator<ImplicitVariable> iterImplicitVar =
this.getImplicitVariables().iterator();
while (iterImplicitVar.hasNext())
(iterImplicitVar.next()).accept(v);
Iterator<Method> iterMethod =
this.getMethods().iterator();
while (iterMethod.hasNext())
(iterMethod.next()).accept(v);
v.endVisit(this);
}
Listing 4.4: accept() method from ch.toe.famix.model.Class, demonstrating the visitor pattern.
We extend the algorithm to labelled trees by assigning an integer value to each FAMIX node
type (as also proposed in [Valiente, 2000]). The node type then is prepended to the list of isomorphism equivalence codes during the calculation of the equivalence classes for a node. See
the method calculateEquivalenceClass(..) in CalculateEquivalenceClass for the
relevant code.
The calculation is performed in the class CalculateBottomUpMaximumSubtree, which
contains multiple constructors, the most detailed is displayed in Listing 4.6. The boolean parameters ordered and labeled are used to specify the nature of the given trees tree1 and tree2.
A comparator object implementing ITreeComparator is used to compare nodes for equality. If
no comparator is given, the default AlwaysTrueTreeComparator is used, effectively resulting
in not comparing any node types. For an explanation of the comparator pattern, see Section 4.3.4.
public CalculateBottomUpMaximumSubtree(DefaultMutableTreeNode tree1,
DefaultMutableTreeNode tree2, ITreeComparator comparator,
boolean ordered, boolean labeled)
throws NullPointerException, TreeNodeTypeException
Listing 4.6: Most detailed constructor signature of CalculateBottomUpMaximumSubtree.
After instantiating the calculation class, the calculation already took place and can be queried
for success by calling the method isCalculated(). The ArrayList of the matched subtrees of tree1 and tree2 can be read with the methods getSubtreeRootNodesTree1()
and getSubtreeRootNodesTree2(). The returned lists contain the root nodes of the matched
maximum bottom-up subtree isomorphisms. There can be multiple roots as the algorithm matches
all the occurrences of the matched tree pattern. See Section 3.4.4 for the algorithmic description
of this behaviour.
4.5 Coogle Implementation
33
Finally, the method mapTrees(tree1, tree2) maps two given matched subtree root nodes
either ordered or unordered, according to the ordered status of the calculation object. The resulting map is a one-to-one node mapping of the bottom-up maximum common subtree. See
Section 3.4.4 for a more in-depth explanation. Figure 4.7 illustrates a one-to-one mapping of two
ordered trees.
Figure 4.7: A bottom-up maximum common subtree isomorphism for two ordered trees (highlighted in grey). The
dashed line represents the node mapping. The mapping is defined as M = {(v4, w2), (v5, w3), (v6, w4),
(v7, w5), (v8, w6)}.
Top-down Maximum Common Subtree Isomorphism
Valiente’s top-down maximum common subtree algorithm as described in [Valiente, 2002] is implemented in ch.toe.tree.calc.CalculateTopDownOrderedMaximumSubtree. This algorithm is applicable for rooted, ordered trees only. To make the implementation available for
labelled trees, we again use the comparator construct (see Section 4.3.4).
The constructors in ch.toe.tree.calc.CalculateTopDownOrderedMaximumSubtree
expect two trees and an optional ITreeComparator object. If no comparator is passed during
instantiation, TypeTreeComparator is used, which returns equality for nodes with the same
user object type.
The method isCalculated() is used for querying the success of the calculation. If the calculation was successful, getMatchedTree1() and getMatchedTree2() are called to receive the
resulting trees. The method getMappedTrees() returns a one-to-one mapping of the matched
trees. This measure only returns a single subtree as its root always has to correspond to the roots
of the input trees. Figure 4.8 illustrates such a mapping.
Tree Edit Distance
We implement the tree edit distance algorithm for ordered, rooted trees detailed in [Valiente, 2002].
To extend the algorithm to labelled trees the comparator pattern is applied again. We use an
ITreeComparator to assign a different cost to substitute paths between equal and paths between
non-equal nodes.
The JGraphT library3 provides an implementation of the standard shortest path algorithm
by Dijkstra [Dijkstra, 1959]. We used the method DijkstraShortestPath(..) from JGraphT
version 0.6.0 for calculating the tree edit distance (see Section 3.4.6).
Listing 4.7 shows the available parameters for a tree edit distance calculation. The parameters tree1 and tree2 denote trees for which the edit distance is calculated. The parameter
3 JGraphT
project site: http://jgrapht.sourceforge.net/
34
Chapter 4. Implementation
Figure 4.8: A top-down maximum common subtree isomorphism for two ordered trees (highlighted in grey). The
dashed line represents the node mapping. The mapping is defined as M = {(v1, w1), (v3, w2), (v4, w3),
(v5, w4), (v6, w5), (v7, w6), (v8, w7)}.
pathLengthLimit is used for limiting the path length to a maximum (default is unlimited path
length).
By using weightInsert, weigthDelete and weightSubstitute, different weights for
the main operations insert, delete and substitute can be specified. The fourth weight parameter
weightSubstituteEqual is used for substitute paths between nodes for which comparator
returns equality.
After the calculation is successfully finished (queried using the method isCalculated()),
the tree edit distance is returned as double value by getTreeEditDistance().
The biggest edit distance (also known as worst case edit distance) can be calculated in different
ways. The simplest method is summing the number of nodes in tree1 and in tree2. This
represents the deletion of all nodes of tree1 and inserting all nodes as new nodes into tree2.
We need this worst case edit distance for ranking the results as described in Section 5.3.1. Other,
more complicated, approaches for calculating a worst case edit distance exist, please refer to the
relevant methods in CalculateTreeEditDistance for more information.
4.6 Discussion and Problems
4.6.1 FAMIX
One-way Parent-Child Relation
In the original FAMIX model, only children know about their parent. Consider for instance the
following example: in a top-down Java representation, classes contain methods. In FAMIX however, only the child Method has a belongsTo() method, a parent such as aClass does not
know about its children. For our implementation we need links that can be followed from top
to bottom, i.e, from the parents (root) to their children (leaves). For example, when traversing
the syntax tree with a visitor as described in the previous section. To circumvent this limitation,
we add collections (java.util.Set) to all objects having child objects. This extension allows
parents to enumerate all their children and therefore allows an efficient top-to-down traversal.
4.6 Discussion and Problems
35
No Representation of Low-level Elements
FAMIX does not model low-level elements of abstract syntax trees. For example, there is no
representation of mathematical operations or assignments in general, like an IfStatement or
an Assignment. We discuss the consequences of this information loss in Section 5.4.2.
Various Representation Problems
As described in [Tichelaar, 1999] and the previous section, the FAMIX model does not have representations for all Java objects. In our implementation we additionally left out or interpreted the
following types of the Eclipse syntax tree:
AnonymousClassDeclaration The Eclipse type AnonymousClassDeclaration is used for an
anonymous class embedded in code. It can occur either in the body of a class or a method.
FAMIX however has the limitation of only allowing classes and not methods as the parent of
a Class. We circumvent this limitation by adding anonymous classes to the parent Class.
TypeDeclarationStatement and EnumDeclaration A TypeDeclarationStatement is a local
type declaration which can occur inside any Statement. An EnumDeclaration is a new
type introduced with Java 1.5. We do not add these two Eclipse elements to our trees as
FAMIX lacks support for them.
Moreover, the following FAMIX elements are not parsed and represented:
TypeCast There is no need for a representation of those elements from a code similarity point of
view. Our FAMIX implementation however contains the object TypeCast which defines
the needed behaviour.
SourceAnchor Every element in FAMIX can have a source code reference. As we do not need
this assignment for calculating our measures, we left out this functionality in our parser.
The class SourceAnchor is defined in our FAMIX implementation nevertheless.
4.6.2
Similarity Measures
Bottom-up Maximum Common Subtree Isomorphism
Although we implement a matching for unordered trees, a search using this algorithm does not
yield different results than a search with the bottom-up maximum common subtree isomorphism
for ordered trees. The reason for this lies in the way how the FAMIX representation of the code is
generated. We reused the code of the PatViz plug-in4 for this task. PatViz is a project that parses
the abstract syntax tree representation in Eclipse and generates an RSF (Rigi Standard Format)
representation of it. We took the parser and refactored the code to produce a FAMIX tree instead.
However, the PatViz plug-in generates ordered trees of the syntax trees, i.e., the elements are
ordered by type (first all class attributes, then the constructors and finally all methods) and not by
their effective positions in the source. Therefore, the generated FAMIX tree is always an ordered
tree. See Figure 4.6 for an example of such a generated tree and note the order of the elements in
the tree representation which does not match the order of the abstract syntax tree.
4 Software
project written by Wolfgang Schuh at the Vienna University of Technology.
36
Chapter 4. Implementation
Top-down Maximum Common Subtree Isomorphism
Although there exists an algorithm for unordered top-down maximum common subtree matching (described in [Valiente, 2002]), this is currently not implemented in Coogle because of the
same reasons detailed in the previous section for the bottom-up maximum subtree isomorphism
algorithm.
Tree Edit Distance
The algorithmic problems of unordered tree edit distance calculation are detailed in Section 3.4.6.
Therefore, we have no implementation for an unordered edit distance measure.
4.6 Discussion and Problems
37
[..]
public void visit(Object o) {
if (isTreeRelevantElement(o))
preVisit(o);
}
public void endVisit(Object o) {
if (isTreeRelevantElement(o))
postVisit();
}
private void preVisit(Object o) {
// create new node for the currently visited object
DefaultMutableTreeNode node = new DefaultMutableTreeNode(o);
if (root == null)
root = node;
stack.push(node);
}
private void postVisit() {
// child is the currently visited object
DefaultMutableTreeNode child = stack.pop();
if (!stack.isEmpty()) {
// add child to parent node already on stack
DefaultMutableTreeNode node = stack.pop();
node.add(child);
stack.push(node);
}
else
stack.push(child);
}
[..]
Listing 4.5: Sample visitor implementation used for building a tree of all relevant FAMIX elements. This is the
implementation as used by TreeBuildVisitor.
public CalculateTreeEditDistance(DefaultMutableTreeNode tree1,
DefaultMutableTreeNode tree2, ITreeComparator comparator,
Double pathLengthLimit,
Double weigthInsert, Double weigthDelete,
Double weigthSubstitute, Double weigthSubstituteEqual)
throws NullPointerException, TreeNodeTypeException
Listing 4.7: Most detailed constructor signature of CalculateTreeEditDistance.
38
Chapter 4. Implementation
Figure 4.9: The complete workflow of a Coogle similarity search. Starting in the class CoogleMainAction and
finished when displaying the results with the EditDistanceResultDialog. These are the steps performed
after the invocation of the tree edit distance operation:
(a) Eclipse’s ASTParser is invoked for creating the abstract syntax tree of the objects we are comparing.
(b) The PatViz parser is invoked and (c) extracts the FAMIX representation from the AST.
(d) TreeBuildVisitor is used to create the general tree consisting of DefaultMutableTreeNodes.
(e) Calculation of the tree edit distance using the generated tree from step (d).
(f) Display results in EditDistanceResultDialog.
Chapter 5
Evaluation
This chapter describes the evaluation of the implemented similarity measures. After a short
overview on the chosen approach, we detail the results of the two part analysis. In a first part we
analyse constructed Java classes, in the second part a real world Java project, the compare plug-in
of Eclipse. In each part special cases, important findings and shortcomings are highlighted. We
close the chapter by discussing and comparing the different similarity measures efficiency based
upon the results.
5.1 Approach
The analysis was done in two parts. First we built test cases for often recurring refactoring patterns and analysed similarity detection for these constructed sets of changes. This allows us
to analyse the question, how specific changes affect structural similarity. In a second part we
took a sample Java project as basis for our similarity measures. We used the compare plug-in
(org.eclipse.compare)1 as sample project and measured the internal similarity of the classes. The
results enlighten on the efficiency of the measures for detecting structural similarities in a project.
5.2 Analysis Objects
5.2.1
Constructed Test Cases
Overview
We take often recurring refactoring patterns as basis for the construction of our test cases. The
complete code for these test cases can be found on the accompanying CD-ROM, the most important snippets are included in the following sections.
The class AzureusCoreImpl from the Azureus project2 was used as base class for the tests
(except in test case E). See Appendix D.1 for a complete listing of this class. The requirements for
the base class were the use of both ”normal” and static attributes and methods. Additionally,
it needed to define getter and setter methods for its attributes. In every test case, we define an
empty class as control construct. This control class provides us with information about the relative
1 Eclipse
compare project site:
http://dev.eclipse.org/viewcvs/index.cgi/%7Echeckout%7E/
platform-compare-home/main.html
2 Azureus project site: http://azureus.sourceforge.net/
Chapter 5. Evaluation
40
similarity of the results and is an indicator for dissimilarity. Please see Appendix C.2 for detailed
information about the structure of the test case projects.
Note on the tree edit distance algorithm: the weights associated with the edit operations were
one cost unit for node insertions/deletions and zero cost for substituting nodes.
Test Case A: Add Constructor to a Class
The addition of a new constructor is adding a method with parameters and invocations in its
body. Our test case adds a new constructor with a single this() invocation as body. The following constructor code has been added to the sample class:
protected AzureusCoreImpl(String str) {
this();
}
Listing 5.1: Test case A: Code of the added constructor
Test Case B: Add Attribute to a Class
When using this refactoring pattern, getter and setter methods for the new attribute are added,
too. We do this in our change set as well. The rest of the class is left untouched. This is the added
code:
private String test;
[..]
public String getTest() {
return test;
}
public void setTest(String test) {
this.test = test;
}
Listing 5.2: Test case B: Code for an added attribute
Test Case C: Add Invocation to a Method
We insert an invocation into an existing method. For verifiability, two separate test classes are
created with the added code in two different methods, but to the same invocation target. Which
method is invoked does not matter as the measures do not take the target of the invocation into
consideration. We invoke getLocaleUtil(), defined in the test class itself.
Test Case D: Method Extraction
During a method extraction, code is moved from an existing method into a new method and an
invocation to the extracted method is added to the original method. This is often used to remove
duplicated code or when pulling-up code into parents. We implement the change by replacing
5.3 Results
41
the code of the constructor with an invocation to a new private method constructorCall().
This listing shows the code that was added (lines prepended with ”+”):
protected AzureusCoreImpl() {
+ constructorCall();
+ }
+
+ private void constructorCall() {
COConfigurationManager.initialise();
LGLogger.initialise();
AEDiagnostics.startup();
Listing 5.3: Test case D: Extract the code of a method into a new method.
Test Case E: Implement Interface
The interface programming pattern is one of the most important design patterns in object oriented programming [Gamma et al., 1994]. This test case measures the similarity between classes
implementing the same interface. As test object we use the interface RateControlledEntity
of the Azureus project. See Appendix D.2 for a complete listing of the class. The implementors
on which we search for similarity are defined in the same package as RateControlledEntity.
5.2.2
Real World Example: org.eclipse.compare
The compare plug-in of Eclipse is used as sample Java project for a real world similarity measure
test. The analysis demonstrates the ability of using the implemented similarity measures in a
non-laboratory environment and critically highlights shortcomings. We use version 3.1.0 of the
project.
We choose two classes from the project and analyse the similarity for those objects. This is
because Coogle in its current state is not able to analyse the similarity of each to each other class
in the project in a single step. An outer loop, serially processing all classes of a project, needs to
be added for realising this type of analysis.
5.3 Results
5.3.1
Ranking the Matches
There comes one important question with displaying result data: how do we rank the matches?
Or, what is the similarity of the matched object in comparison to the search pattern? The following
two sections explain the ranking algorithm used for the two subtree matching measures and the
tree edit distance calculation.
Subtree Matching
We have these numbers available as parameters for our algorithms: ss denotes the size (number of
nodes) of trees which is our search tree. sx = size of treex , the tree of the class currently matching
on. Further, treem stands for the matched subtree with its size sm .
An efficient ranking algorithm needs to follow these rules:
Chapter 5. Evaluation
42
•
•
the more of trees is matched, the better the ranking. This is expressed with
sm
ss .
give small elements, consisting of a few nodes only, that are completely matched a lower
ranking. This is considered by weighting ss and sx .
We experimented with different possibilities for the ranking algorithm. Finally, we decided to
use a solution also described in [Baxter et al., 1998] where the following formula is defined:
similarity =
2S
2S + L + R
with S = number of shared nodes, L = number of different nodes in trees and R = number of
different nodes in treex . This measure can be simplified and expressed with the corresponding
variables of our input data:
rsubtree =
2sm
, (0 < rsubtree ≤ 1)
ss + sx
All the results in Sections 5.3.2 and 5.3.3 are ranked using this formula for similarity.
Tree Edit Distance
As with subtree matching, the input for the ranking consists of the two trees, trees and treex ,
which likewise represent the search tree and the tree we calculate the distance to. For this ranking
a worst-case edit distance of deleting all the nodes from trees and inserting all nodes from treex
as new nodes is assumed. The worst-case length of an edit distance between the two trees then
is defined as sum of the size of both trees: ws→x = ss + sx . We convert this dissimilarity to
a similarity measure reditdistance , where 0 ≤ reditdistance ≤ 1 by using the following ranking
formula for tree edit distance results:
reditdistance =
ws→x − dx
ws→x
dx denotes the calculated edit distance of treex to trees .
5.3.2 Results with Constructed Test Cases
This section contains the relevant result data for the similarity measures with the constructed
test cases. Each section has a table with the raw result data and an analysis of these results. We
discuss the results in the following order of the measures: bottom-up maximum common subtree,
top-down maximum common subtree and thirdly tree edit distance.
We conclude this section with an overall assessment on the performance of the measures over
such constructed test cases.
Test Case A: Add Constructor to Class
The addition of a new constructor is, in a tree based view, an addition of a new node at depth
level two. Our example constructor has an invocation in its body, so the complete addition to the
tree is a node with a single child as shown by Figure 5.1.
Table 5.1 contains the results of a bottom-up match on the trees before and after the modification. The matching tree for this modification has a size of 21 nodes. As the new method is added
in between the existing methods, a bottom-up subtree match is not very efficient.
5.3 Results
43
AzureusCoreImpl
������
����
����
�
�
�
�
��
AzureusCoreImpl()
[other methods]
AzureusCoreImpl(String)
[invocation] this()
Figure 5.1: Test Case A: Resulting tree after changing the class. Added tree elements in italic.
The results are better when performing a top-down subtree match. Here the matched subtree
has a size of 78 nodes and the changed class a similarity of 53.24% as Table 5.2 illustrates. The
top-down match leads to better results for this test case because parts which do not match will
not stop the matching process, but are simply ignored (including their child nodes).
The best results for test case A are achieved by using the tree edit distance measure, see Table 5.3. The tree representing the modified class needs 3 edit steps which results in a similarity of
almost 99%. The edit steps needed is the addition of the new method (first step) with its parameter
(second step) and the invocation (third step) in the method body.
For all three similarity measures we receive 1.37% similarity for our control class. This demonstrates the ability of ignoring small elements by the ranking measure.
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
control.ControlClass
Matched tree size
145
21
1
Similarity
100.00%
14.33%
1.37%
Table 5.1: Results case A (add constructor to class): Bottom-up maximum common subtree.
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
control.ControlClass
Matched tree size
145
78
1
Similarity
100.00%
53.24%
1.37%
Table 5.2: Results case A (add constructor to class): Top-down maximum common subtree.
Test Case B: Add Attribute to Class
The addition of a new variable including getter and setter methods for it results in a tree modification like Figure 5.2 shows. A new node for the attribute is inserted after the already defined
variables and two method references are appended after the existing method definitions.
The result table for the bottom-up match, Table 5.4, almost contains the same results as for the
bottom-up matching for case A. The matched tree has a size of 21 and the similarity differs by less
than a tenth of a percent. The difference is due to the slightly bigger size of the tree representation
of afterChanges.AzureusCoreImpl. In both cases A and B, the same tree is matched as
maximum subtree, namely start(), the biggest method in the class. The algorithm misses the
Chapter 5. Evaluation
44
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
control.ControlClass
Tree edit distance
0
3
144
Similarity
100.00%
98.98%
1.37%
Table 5.3: Results case A (add constructor to class): Tree edit distance.
AzureusCoreImpl
���� ��������������������
���� ������������������
����
�
�
�
����
������
��
���
[other variables] String test [other methods] getTest() setTest(String)
Figure 5.2: Test Case B: Resulting tree after changing the class. Added tree elements in italic.
surrounding methods as match, because the root nodes (which have all the methods as children)
of the trees have different equivalence classes. This also occurs in several other test cases.
When using a top-down match for this case, the results even get worse. The similarity of
the changed class falls down to 6.12% with a matched tree size of 9 nodes (see Table 5.5). A
top-down subtree match is not efficient, because the inserted variables prevent a better match by
stopping the matching process as soon as such a variable node is encountered. This issue could
be circumvented by using an unordered tree match. However, as described in Section 4.6.2 this is
currently not possible.
Tree edit distance calculation yields the best results for this test case as Table 5.6 shows. The
four edit operations are for the variable declaration, the two added methods and the parameter
of the setTest() method.
Our control class again performed bad with all three measures.
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
control.ControlClass
Matched tree size
145
21
1
Similarity
100.00%
14.29%
1.37%
Table 5.4: Results case B (add attribute to class): Bottom-up maximum common subtree.
Test Case C: Add Invocation to Method
Using the bottom-up common subtree algorithm generates the same results as for case A and
B. Table 5.7 shows the same tree size of 21 for both afterChanges.AzureusCoreImpl and
afterChanges.AzureusCoreImplA. The reason for the matching of the same tree corresponds
with the explanation given in case B.
The top-down subtree matching results in a similarity score of 99.66%. The matching of
beforeChanges.AzureusCoreImpl is complete, only the size increase due to the added invocations prevent the objects getting a complete match.
The same score of 99.66% similarity is reached when using the tree edit distance measure.
Table 5.9 displays a single edit operation needed to completely cover the objects. Evidently, this
operation is the insertion of the invocation.
As afterChanges.AzureusCoreImpl and afterChanges.AzureusCoreImplA have coinciding matching size and similarity performance in all three measures, we conclude that which
method an invocation is added to does not influence similarity.
5.3 Results
45
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
control.ControlClass
Matched tree size
145
9
1
Similarity
100.00%
6.12%
1.37%
Table 5.5: Results case B (add attribute to class): Top-down maximum common subtree.
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
control.ControlClass
Tree edit distance
0
4
144
Similarity
100.00%
98.64%
1.37%
Table 5.6: Results case B (add attribute to class): Tree edit distance.
Test Case D: Method Extraction
For this test case we add the method constructorCall() to two separate classes, namely
AzureusCoreImpl and AzureusCoreImplA, at different places. In AzureusCoreImpl the
method immediately follows the constructor whereas in AzureusCoreImplA the method is appended after the last method of the original class. Figure 5.3 shows the move of the invocations
of the constructor to a separate method as done in AzureusCoreImpl.
Bottom-up subtree matching results in the same data we have seen in the previous cases. The
method start() with tree size 21 is matched and returned as maximum subtree (see Table 5.10).
Using a top-down match shows a big difference between the class AzureusCoreImpl and
AzureusCoreImplA. Table 5.11 shows a similarity of 47.95% and 95.21% respectively. The reason for this is that we compare ordered trees when using top-down measuring. By placing the
added method constructorCall() in between the existing code, all the methods and invocations are matched with different, shifted methods of the same class (for example the method
addLifecycleListener(..) is matched with addListener(..)). This results in a smaller
tree as not all invocations of the methods can be matched with possibly smaller shifted ”corresponding” methods. When adding the new method to the end of the class as done in the test
case class AzureusCoreImplA, we eliminated the shifting and therefore receive a much higher
similarity match.
The results for the tree edit distance measures can be found in Table 5.12. We see that fewer
steps are required for the class in which constructorCall() was not moved to the end of the
class, as the invocations only need to be relabeled. This can be done in 8 operations. For the class
AzureusCoreImplA 14 operations are needed when deleting and re-adding all 7 invocations at
AzureusCoreImpl
������
����
����
�
�
�
�
��
AzureusCoreImpl()
[other methods]
[invocations] constructorCall()
[invocations]
�
Figure 5.3: Test Case D: Resulting tree after changing the class. Added tree elements in italic.
Chapter 5. Evaluation
46
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl A
control.ControlClass
Matched tree size
145
21
21
1
Similarity
100.00%
14.43%
14.43%
1.37%
Table 5.7: Results case C (add invocation to method): Bottom-up maximum common subtree.
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl A
control.ControlClass
Matched tree size
145
145
145
1
Similarity
100.00%
99.66%
99.66%
1.37%
Table 5.8: Results case C (add invocation to method): Top-down maximum common subtree.
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl A
control.ControlClass
Tree edit distance
0
1
1
144
Similarity
100.00%
99.66%
99.66%
1.37%
Table 5.9: Results case C (add invocation to method): Tree edit distance.
the end of the tree.
Test Case E: Implement Interface
Using our similarity measures for detecting classes that implement the same interface is not very
successful as Tables 5.13, 5.14 and 5.15 show. With both bottom-up and top-down common subtree isomorphism only minimal matching trees of a few nodes are received, indicating low similarity. Also, tree edit distance for all classes (except for the control class, see below) is over 30
operations which leads to a similarity of below 30%. The conclusion of these results is that a
matching with our measures does not make sense for such a test case. A possible explanation for
this is that interfaces, because they do not implement methods, contain not enough objects that
are represented in FAMIX. The similarity measures therefore have not enough information for
effectively matching these small trees. This is the first test case where the control class receives a
higher similarity score than other classes.
Performance of Similarity Measures
This section outlines the main problems and details the performance of the implemented algorithms. We do not consider test case E as the results indicate that no algorithm is able to detect
similarity in such cases.
Bottom-up Maximum Common Subtree The results in the previous sections show clearly that
a bottom-up subtree isomorphism measure is not the best way for detecting similar Java classes.
The similarity score remains static at about 14%. As explained in the different cases, the reason for
this is that this measure uses equivalence classes for checking equality of the different nodes. For a
better match, the measure has to include the surrounding, unchanged methods as well. However,
5.3 Results
47
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl A
control.ControlClass
Matched tree size
145
21
21
1
Similarity
100.00%
14.38%
14.38%
1.37%
Table 5.10: Results case D (method extraction): Bottom-up maximum common subtree.
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl A
afterChanges.AzureusCoreImpl
control.ControlClass
Matched tree size
145
139
70
1
Similarity
100.00%
95.21%
47.95%
1.37%
Table 5.11: Results case D (method extraction): Top-down maximum common subtree.
for this to happen, the equivalence class of the root would have to remain the same. This is not the
case in our tests as a node insertion/deletion at tree depth level 1 changes the equivalence level
of the root and the measure matches the biggest subtree from level 2 which usually is the biggest
method. See Section 5.4.3 for possible improvements on the algorithm to solve this problem.
Top-down Maximum Common Subtree We receive mixed results with this measure. Case C, D
and partly case A show good scores for detecting similarity with the top-down algorithm. In case
B similarity is not detected, because the insertion of a variable stops the matching process early.
This algorithm is a good measure when looking for structural similarity within classes and giving
smaller changes in methods a lesser weight. It fails however already for small changes near the
root of the tree. This constraint can possibly be lessened by using a similar approach as proposed
in the previous section about the bottom-up maximum subtree match.
Tree Edit Distance The tree edit distance algorithm performed best for our test cases. The similarity scores in each test are over 97%, which is sufficient for establishing a similarity relation
between two classes with a high accuracy. The big advantage of this algorithm in comparison to
the maximum common subtree algorithms is that it is not as susceptible to node insertions/deletions as the other two measures.
5.3.3
Results with org.eclipse.compare
This section contains the results for the similarity matching on the compare project of Eclipse. We
outline the most significant detections for selected classes and have a look at the overall performance of each similarity algorithm.
General Observations
One single class can be analysed per run of Coogle with the current implementation. For analysing
a complete project, and calculating the similarity of all classes to each other class in the project,
the implementation needs to be changed. We therefore selected single classes and analysed their
similarity.
Using Coogle on the compare project gives insights on the efficiency of the different measures.
However, the results themselves are not very interesting as no real similarity (for example created
Chapter 5. Evaluation
48
Class name
beforeChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl
afterChanges.AzureusCoreImpl A
control.ControlClass
Tree edit distance
0
8
14
144
Similarity
100.00%
97.26%
95.21%
1.37%
Table 5.12: Results case D (method extraction): Tree edit distance.
Class name
beforeChanges.RateControlledEntity
afterChanges.SinglePeerDownloader
afterChanges.SinglePeerUploader
afterChanges.MultiPeerDownloader
afterChanges.MultiPeerUploader
control.ControlClass
Matched tree size
7
1
1
1
1
1
Similarity
100.00%
4.17%
3.77%
2.74%
1.08%
25.00%
Table 5.13: Results case E (implement interface): Bottom-up maximum common subtree.
Class name
beforeChanges.RateControlledEntity
afterChanges.SinglePeerUploader
afterChanges.SinglePeerDownloader
afterChanges.MultiPeerDownloader
afterChanges.MultiPeerUploader
control.ControlClass
Matched tree size
7
7
3
3
3
1
Similarity
100.00%
26.42%
12.50%
8.22%
3.23%
25.00%
Table 5.14: Results case E (implement interface): Top-down maximum common subtree.
Class name
beforeChanges.RateControlledEntity
afterChanges.SinglePeerDownloader
afterChanges.SinglePeerUploader
afterChanges.MultiPeerDownloader
afterChanges.MultiPeerUploader
control.ControlClass
Tree edit distance
0
34
39
59
172
6
Similarity
100.00%
29.17%
26.42%
19.18%
7.53%
25.00%
Table 5.15: Results case E (implement interface): Tree edit distance.
by code duplication) could be detected. The average similarity percent match lies below 50% as
the results in the following section show. We found that the algorithm correctly detects structural
similarity in the sample classes, but we could not find functional similarity in the project, neither
by using the similarity search nor by manually studying the source of the project.
Exemplary Detections
We discuss and include results for two example similarity searches on org.eclipse.compare.
The two sample classes are CompareViewerPane and NavigatorAction.
org.eclipse.compare.CompareViewerPane We ran our analysis on CompareViewerPane. Table 5.16 contains the results. The average similarity, the unweighed mean of all three measures:
5.3 Results
Class name
CompareViewerPane
internal.merge.LineComparator
internal.TokenComparator
internal.ListContentProvider
49
Bottom-up
100.00%
11.43%
28.89%
14.81%
Top-down
100.00%
40.00%
4.44%
37.04%
Tree edit distance
100.00%
80.00%
77.78%
51.85%
Average
100.00%
43.81%
37.04%
34.57%
Table 5.16: Selected top results for comparison on org.eclipse.compare.CompareViewerPane.
rbottom +rtop +redit
,
3
is below 50%. This indicates a low similarity beween CompareViewerPane
and any other class of the project. Figure 5.4 shows an almost uniformly continuous distribution
curve for the average similarity.
Figure 5.4: Distribution of average similarity (bottom-up, top-down and tree edit distance measures) for class
CompareViewerPane in org.eclipse.compare project. Denoted on the x-axis are all classes of the project with
descending similarity.
The specific results for each measure on its own are not surprising given these average similarity results. Bottom-up maximum common subtree isomorphism only has two classes with
a similarity over 25%. One is TokenComparator with a similarity of 28.89% and the other
is NavigatorAction (see later). The size of the matched subtree is the same in both cases,
namely 13 nodes. This is also the overall maximum subtree size for bottom-up matching on
CompareViewerPane. All the trees with size 13 are matches of code in the constructor of
CompareViewerPane.
Top-down maximum common subtree matches an overall maximum subtree of 14 nodes with
the class LineComparator. Because that class is rather small, its similarity percentage gets quite
high. However, the match of 14 nodes is reasonable and indicates the similarity of the classes:
both have a single variable declaration at the beginning, followed by a single constructor and
then five (CompareViewerPane) and four (LineComparator) method declarations. Note that,
unlike before, TokenComparator has a very low similarity with just two matched nodes. This
is caused by the additional variable declarations of TokenComparator at the beginning of the
class which prevent a better match.
The distribution of the similarity calculated with tree edit distance is depicted in Figure 5.5.
The curve is almost linear with LineComparator and TokenComparator being the top matches
Chapter 5. Evaluation
50
Class name
NavigationAction
internal.SimpleTextViewer
internal.BufferedCanvas
CompareViewerPane
Bottom-up
100.00%
16.00%
6.25%
39.39%
Top-down
100.00%
40.00%
37.50%
6.06%
Tree edit distance
100.00%
84.00%
81.25%
57.58%
Average
100.00%
46.67%
41.67%
34.34%
Table 5.17: Selected top results for comparison on org.eclipse.compare.NavigationAction.
with a tree edit distance of 14 and 20 nodes respectively.
Figure 5.5: Distribution of tree edit distance similarity for class CompareViewerPane in the compare project of
Eclipse. Denoted on the x-axis are all classes of the project with descending similarity.
Studying the source code of the overall, average top results proves that the matches on the
results are correct from an algorithmic point of view. Although none of the classes have a strong
functional similarity with CompareViewerPane, all the matches show a structural resemblance
which justifies the similarity match percentage.
org.eclipse.compare.NavigationAction The distribution curve for NavigationAction is similar to the one shown in Figure 5.4 for CompareViewerPane. The similarity results for each
measure including the average similarity are detailed in Table 5.17. The data is very similar to the
results of the previous analysis on CompareViewerPane.
As before, a class exists, here it is SimpleTextViewer, that performs bad on bottom-up
search, mediocre on top-down search and best on tree edit distance. The results are resembling
those previously received for LineComparator. Further, the similarity of CompareViewerPane
to the class NavigationAction is comparable with the results of TokenComparator in the
prior analysis: high percentage in bottom-up, very low with top-down and a bit higher in tree
edit distance measuring.
Performance of Similarity Measures
This section outlines the main problems and the overall performance of the implemented algorithms when testing with the org.eclipse.compare project.
5.4 Discussion
51
Bottom-up Maximum Common Subtree The tests with the Eclipse compare project show one
good application for the bottom-up maximum common subtree algorithm: the detection of similar methods in otherwise varying classes. However, there are limits for such detections as the
measure only detects the biggest method in the class and needs to be rerun for more matches.
Further, the detection will always detect a match if the method is sufficiently big. Matches like
this are not of any real value from a class similarity point of view.
Top-down Maximum Common Subtree We did not find any clear similarity match with a topdown maximum common subtree search. This comes from the fact that the project does not have
any duplicated classes (or duplicated parts of classes). At least, we were not able to detect or
manually find any duplications. The similarity matches in the project are primarily of structural
nature which means that the algorithm is able to detect classes that have a similar structure,
but are not connected through functional similarity. Such structural similarity can be a lead to
functional similarity, but in projects where the classes usually are of same size and follow the
same ordering pattern of fields, constructor, getter/setter methods, a functional similarity match
is hard to detect without further matching on class coupling for example (also see Section 5.4.3).
Tree Edit Distance The same conclusions as for the top-down maximum subtree search apply
to the tree edit distance measure. This measure produces good results not only with the test
cases, but the analysis with org.eclipse.compare shows that the algorithm is able to identify
similarity between classes, at least structural similarity.
5.4 Discussion
In this section we discuss the results of our evaluation, highlight the shortcomings of the measures
and indicate possible ways for improvement.
5.4.1
Comparison of Implemented Measures
We tested the similarity measures on two different types of projects: constructed test cases and
the org.eclipse.compare project. The results show that a bottom-up maximum common subtree
isomorphism match is not a good measure for similarity. It is too susceptible to subtle code modifications in methods which usually cause changes at the bottom level of the tree. However, big
result trees with this measure often indicate the existence of similarly sized and structured methods in both the search and matching classes.
The top-down maximum common subtree algorithm shows promising results. The measure
is a good indicator for similarity as it is able to detect classes with similar structure. A negative
characteristic of the algorithm is that simple changes at the top of the tree, like adding a new
attribute or inserting an attribute between existing methods, reduces the reliability of the measure.
A top-down search is however not as sensitive to changes at the bottom of the tree as the bottomup isomorphism.
The best overall similarity measure is the tree edit distance. It detected the small refactorings
in the test cases and provided itself as a good indicator for structural similarity using with the
compare project.
52
Chapter 5. Evaluation
5.4.2 Shortcomings
Measures for ordered trees only. All the tree measures are limited to ordered trees as described
in Section 3.4. An unordered tree gives better results when performing a tree edit distance
calculation for example.
Parsing syntax tree creates artificial ordering. Due to the implementation of the abstract syntax
tree parsing, the created trees are already ordered. This is described in Section 4.5.3. It is not
certain however, that removing this shortcoming leads to better similarity results as all the
trees are generated using the same parsing process.
FAMIX limits on tree hierarchy. By parsing the abstract syntax tree into a FAMIX representation,
we loose hierarchy information as the FAMIX model is a rather flat hierarchy usually only
a few levels deep. This can be circumvented by leaving out the conversion into FAMIX and
directly generating the input tree from the abstract syntax tree.
FAMIX limits on content. FAMIX represents certain basic instructions (invocations, declarations,
attributes, etcetera), but does not include assignments, mathematical operations and such.
This is good for an overall similarity measuring, but will limit the detections for small
changes on these basic instructions.
5.4.3 Possible Improvements
We propose the following improvements to overcome the described limitations:
Use complete abstract syntax tree. By using the complete abstract syntax tree for measuring similarity, we can increase the level of detail down to single instructions and build hierarchically
more structured trees as well. This might help to detect ”real”, functionally similar classes
and diminish the detection of classes just structurally similar.
Class/method coupling analysis. An improvement for detecting functional similarity between
classes can be made by analysing the coupling of methods or classes. This is done by measuring and comparing invocations and references to other classes. With this information we
can find classes designed for performing similar tasks.
Field or method name matching. Additional similarity information can be gained from field or
method names. Methods and fields used for similar tasks are named with similar names
if the code was created abiding reasonable naming standards. This helps detecting cloned
parts of classes. A measure such as Levenstein’s string distance can be used for calculating
this similarity between names.
Surrounding string matching. This is like field or method name matching, but matches text surrounding the class/methods, for example comments or Javadoc.
Bottom-up subtree search improvement. A proposition for the bottom-up maximum common
subtree matching is to have the algorithm automatically remove the non-matching nodes,
recalculate the trees equivalence classes and restart the bottom-up maximum subtree search
for this new tree. However, we can neither estimate the efficiency nor predict if such an
algorithm performs better than the current implementation.
Coupling of multiple matching algorithms. Our implemented algorithms as well as proposed
algorithms can be combined with their individual similarity score and weighted as desired.
5.4 Discussion
53
With our current implementation, a combined similarity can be defined like:
similarity =
wa · rbottomup + wb · rtopdown + wc · reditdistance
wa + wb + wc
A possible weighting, based on each measures effectiveness in our tests, would then be:
wa = 1, wb = 2, wc = 3.
Chapter 6
Conclusion and Future Work
In this thesis we described the implementation of a similarity analysis tool called Coogle. This
tool calculates similarity by comparing tree representations of the source code of two given Java
classes. The requirement was to use an intermediary tree representation model called FAMIX.
Our goal was to analyse the detection of similarity when using a tree representation of source
code with three different similarity measures.
The following contributions are made by this thesis:
•
•
•
•
•
We created a Java implementation of the FAMIX model. This implementation was then used
to represent the source code of Java classes.
An abstract syntax tree parser was refactored and extended to create a FAMIX representation of Eclipse’s abstract syntax tree.
Three different similarity measures for the comparison of general trees were implemented:
bottom-up maximum common subtree isomorphism, top-down maximum common subtree
isomorphism, and tree edit distance.
An Eclipse plug-in called Coogle was built. Coogle is a wizard-based tool for analysing
similarity between selected Java classes.
Test cases and the org.eclipse.compare project were used for analysing the efficiency
of the similarity measures.
6.1 Results
Measuring abstract syntax tree similarity is a valid approach for detecting similar Java classes.
Of the three tree similarity measures, the tree edit distance produces the best results, followed by
the top-down maximum common subtree isomorphism. Especially when measuring the effects of
refactorings, the tree edit distance measure proved to be very reliable. This measure is for ordered
trees only and therefore has shortcomings when analysing changes that affect the ordering of the
tree, such as relocating field definitions or methods. Bottom-up maximum common subtree is not
as efficient as the other two measures, often failing for small structural changes. This is due to the
shallow hierachy of the FAMIX model, which was used for representing the abstract syntax trees
of the source code. Functional similarity is not detected by any of the measures.
We ran the similarity analysis on a major project, Eclipse’s compare plug-in. Although we
found structural similarities, none indicate cloned code fragments or duplicated classes.
56
Chapter 6. Conclusion and Future Work
6.2 Future Work
We propose multiple extensions that can improve Coogle’s ability of detecting similarity:
Extend Coogle to parse a complete Java project in one step. By adding a loop around the similarity search procedure of Coogle, a complete Java project can be analysed in one step. This
allows to detect similarity in a project not only with single, selected classes, but can find the
most similar classes of a project. An interesting application of this lies in the area of developer assistance: during development the similarity measure can suggest code samples from
a repository, providing the developer with sample code or already existing implementations
of the desired piece of code.
Use the abstract syntax tree as input. We propose to calculate the similarity of classes based on
the abstract syntax tree directly, without loosing information by parsing the AST into a
FAMIX intermediate representation first. This increases the hierarchy of the trees and enables the measures to match finer grained subtrees.
Add other measures for tree matching. An interesting candidate of an additional algorithm for
ordered tree matching is described in [Chawathe et al., 1996]. Further, a top-down maximum common subtree for unordered trees will be added and evaluated.
Functional similarity detection. Statements are surrounded by text in source code (i.e., comments, field names or neighbour statements). Analysing the similarity of the text surrounding a statement and including this textual similarity in the measures will improve the similarity. Additionally, similar comments or field names are a good indicator for functional
similarity. Considering surrounding text in the measures therefore improves the ability to
detect functional similarity.
Appendix A
Coogle Step by Step
This is a detailed walk through a search with the Coogle plug-in. We search for similar classes to
the class org.eclipse.compare.Splitter in the Eclipse compare project.
Figure A.1 shows the context menu after right-clicking on a Java project in the Package Explorer.
The project we right-click on is the project in which we want to search for similarity, the Eclipse
compare plug-in. After selecting Start similarity search... in the Coogle submenu, the main Coogle
wizard starts up.
Figure A.1: Context menu when right-clicking on a Java project in the Eclipse workspace.
We are presented with a selection of similarity measures. An additional setting for ordered or
unordered search with labelled or unlabelled trees can be made when choosing to run a bottomup maximum common subtree isomorphism. The screeshot in Figure A.2 illustrates this step.
The next wizard page lists all projects in the current Eclipse workspace (depicted in Figure A.3). After selecting a project and pressing Next, all classes in the selected project are collected
and presented on the next wizard page which is showed in the screenshot in Figure A.4. Here we
select the class to use as similarity search object. The similarity of all classes in the project selected
through the context menu will be calculated to this class.
The last step before the calculation starts is depicted in Figure A.5. The page shows an
overview of the similarity search that will be performed. Upon pressing Finish, a dialog with
a progress bar shows the current status of the search.
Figure A.6 illustrates the result dialog that is shown after the calculation finished. The result
table has three columns: The Name of the class that was matched with the search object, the
calculated Edit distance, i.e., the number of edit operations needed for transforming the trees and
the Similarity in % as described in Section 5.3.1.
58
Figure A.2: Step 1: Welcome screen and choice of similarity.
Figure A.3: Step 2: Selection of project containing the desired search object.
Chapter A. Coogle Step by Step
59
Figure A.4: Step 3: Search object selection.
Figure A.5: Step 4: Final summary page before calculation is started.
60
Chapter A. Coogle Step by Step
Figure A.6: Result dialog of a tree edit distance calculation on the Eclipse compare project with the class
org.eclipse.compare.Splitter as search object.
Appendix B
How to Extend Coogle
This chapter contains useful code snippets and descriptions on how to reuse the existing code
when adding new functionality to Coogle.
B.1 Add a New Similarity Measure
The currently implemented similarity measures reside in package ch.toe.tree.calc. Classes
calculating similarity extend the abstract class Calculator. Listing B.1 shows a sample class
definition.
The boolean calculated is used for representing the status of the calculation. The calculation itself is done by the method calculate(), the main calculation method, which has to be
invoked by the constructor right after initialisation. See Listing B.2 for a basic constructor implementation.
To integrate the measure into the Coogle wizard, an extendor of SimilarityOperation
(in package ch.toe.coogle.operation.generic) is added. We create this class in package ch.toe.coogle.operation.classes and override the method exectute(), which is
called by the wizard upon pressing Finish. Listing B.3 illustrates the added class for the new
operation.
ch.toe.coogle.operation.dialog.ResultDialog is the basis class for a new result
dialog. Here, the method createContents(..) is called as soon as the dialog is displayed. We
therefore create the desired result controls in a new implementation of this method as illustrated
by Listing B.4.
package ch.toe.tree.calc;
public class CalculateTreeSize extends Calculator {
[..]
}
Listing B.1: Sample class for defining a new similarity measure.
Chapter B. How to Extend Coogle
62
[..]
public CalculateTreeSize(DefaultMutableTreeNode tree)
throws NullPointerException, TreeNodeTypeException {
super();
if (tree == null)
throw new NullPointerException("Empty tree passed!");
this.tree = tree;
if (calculate())
setCalculated(true);
}
[..]
Listing B.2: Sample constructor for a new similarity measure with a single tree as parameter.
After creating these operational classes, we need to add the operation to the wizard. This
requires the following changes (all classes are in package ch.toe.coogle.wizard):
•
•
•
Add the operation to the model class CoogleModel. Listing B.5 shows the needed additions: a boolean field to represent the currently chosen algorithm and a String identifying
the added operation.
Create the controls for selecting the new measure in the method createPageContent(..)
in CoogleWizardPageWelcome. See Listing B.6 for the addition of the sample measure.
Add the operation to performFinish() in the main wizard class CoogleWizard. This is
shown in Listing B.7.
B.2 Extend the Information in the Tree
As the measures operate on general tree representations, the new tree representation only needs
to be of type DefaultMutableTreeNode. Then it can be passed to the constructor of any similarity measure. When defining the tree with special objects as user objects, a new comparator for
evaluating node equality probably needs to be defined.
B.3 Define a New Comparator
A comparator is used by the calculation classes for comparing two nodes for equality. The interface ITreeComparator needs to be overriden by new comparator implementations. compare()
receives two DefaultMutableTreeNodes and returns true or false for indicating equality.
Listing B.8 shows a sample implementation of a new comparator.
The new comparator is used on any similarity measure by passing the instantiated comparator
object to the constructor of the measure. See Listing 4.7 for a constructor receiving a comparator.
B.3 Define a New Comparator
package ch.toe.coogle.operation.classes;
public class TreeSizeOperationClassImpl extends SimilarityOperation {
protected ArrayList[] treeSize;
public TestOperation(CoogleModel model) {
super(model);
}
public boolean execute() throws InvocationTargetException,
InterruptedException {
// process all trees in the project
for (int i=0; i<objectTrees.length; i++) {
if (objectTrees[i] == null)
continue;
CalculateTreeSize calc = null;
try {
calc = new CalculateTreeSize(
new DefaultMutableTreeNode()
);
} catch (NullPointerException e) {
} catch (TreeNodeTypeException e) {
continue;
}
if (calc == null || !calc.isCalculated())
continue;
// add tree size results to result array
[..]
// make object available for garbage collector
objectTrees[i] = null;
}
return true;
[..]
}
public void displayResults(Shell shell) {
TreeSizeDialog dialog = new TreeSizeDialog(shell);
dialog.setDuration(getOperationDurationString());
dialog.setData(treeSize);
dialog.open();
}
[..]
Listing B.3: Sample implementation of a new measure operation.
63
Chapter B. How to Extend Coogle
64
package ch.toe.coogle.operation.dialog;
public class TreeSizeDialog extends ResultDialog {
public TestDialog(Shell parent) {
super(parent);
}
public TreeSizeDialog(Shell parent, int style) {
super(parent, style);
}
protected void createContents(Shell shell) {
// window title
shell.setText("Tree edit distance result table");
GridLayout layout = new GridLayout();
layout.numColumns = 1;
shell.setLayout(layout);
addDurationLabel(shell);
// add controls
[..]
addCloseButton(shell);
}
}
Listing B.4: Implementation of a new result dialog.
public static final String treeSize = "Tree size";
public boolean doTreeSize = false;
Listing B.5: In class CoogleModel: Model extension for a new measure.
B.3 Define a New Comparator
65
[..]
Button radioTreeSize = new Button(g1, SWT.RADIO);
radioTreeSize(CoogleModel.treeSize);
radioTreeSize(new SelectionListener() {
public void widgetDefaultSelected(SelectionEvent e) {
widgetSelected(e);
}
public void widgetSelected(SelectionEvent e) {
model.doTreeSize = true;
model.doEditDistance = false;
model.doBottomUp = false;
model.doTopDown = false;
treesOrdered.setEnabled(false);
treesOrdered.setVisible(false);
treesOrdered.setSelection(false);
treesLabeled.setEnabled(false);
treesLabeled.setVisible(false);
treesLabeled.setSelection(false);
setPageComplete(true);
}});
[..]
Listing B.6: In class CoogleWizardPageWelcome: Additions to the welcome page of the wizard for a new
tree measure.
[..]
public boolean performFinish() {
[..]
if (model.doClassSearch) {
[..]
if (model.doTreeSize) {
operation =
new TreeSizeOperationClassImpl(model);
}
}
[..]
Listing B.7: In class CoogleWizard: Add the new operation to the finish action of the wizard.
Chapter B. How to Extend Coogle
66
package ch.toe.tree.comparator;
public class FirstLetterTreeComparator implements ITreeComparator {
public AlwaysTrueTreeComparator() {
super();
}
public boolean compare(DefaultMutableTreeNode node1,
DefaultMutableTreeNode node2) {
if (node1 == null || node2 == null)
return false;
if (node1.getUserObject() == null ||
node1.getUserObject() == null)
return false;
if (node1.getUserObject().toString().charAt(0) ==
node2.getUserObject().toString().charAt(0));
return true;
return false;
}
}
Listing B.8: A new comparator implementation.
Appendix C
Contents of CD-ROM
C.1 Directory Layout
Directory name
CoogleSource.zip
CoogleSource.zip.asc
org.eclipse.compare/
TestCases/
TestWorkspace/
Coogle.pdf
Abstract.pdf
Zusfsg.pdf
Description
This archive contains the complete source of the Coogle plugin.
PGP signature of the Coogle source.
The source of the compare plug-in project as used in our evaluation.
Sources of the test cases.
Eclipse workspace containing the projects used for evaluation.
This document in Adobe Portable Document Format.
Abstract of the thesis in English.
Abstract of the thesis in German.
C.2 Eclipse Workspace: Test Cases
Each test case is in its own subdirectory. For example, Case A resides in the subdirectory ”CaseA”.
The subdirectory structure of each case follows the default Java practice, putting each package in
a subdirectory. Each test case is structured in the following three packages:
ch.toe.Coogle.TestCases.CaseX.beforeChanges. This contains the original test class without any
modifications. The class in this directory is used as similarity search object.
ch.toe.Coogle.TestCases.CaseX.afterChanges. The class from the package beforeChanges is
modified according to the test case and then put in this package. We determine the similarity
of this class to the unmodified class.
ch.toe.Coogle.TestCases.CaseX.control. One single class resides in this directory, the control class
ControlClass. This class defines an empty Java class and is used as dissimilarity measure
(see Section 5.2.1 for more information).
Appendix D
Test Cases Source Listings
This appendix contains the unmodified source code of the classes used as test objects in Section 5.2.1.
The source for these classes can also be found on the Azureus project site1 or on the enclosed
CD-ROM.
D.1 AzureusCoreImpl
This is the complete listing of AzureusCoreImpl, used as base class for the constructed test
cases.
/*
* Created on 13-Jul-2004
* Created by Paul Gardner
* Copyright (C) 2004 Aelitis, All Rights Reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*
* AELITIS, SARL au capital de 30,000 euros
* 8 Allee Lenotre, La Grille Royale, 78600 Le Mesnil le Roi, France.
*
*/
1 http://azureus.sourceforge.net/
Chapter D. Test Cases Source Listings
70
package com.aelitis.azureus.core.impl;
//˜--- non-JDK imports --------------------------------------------------import
import
import
import
com.aelitis.azureus.core.*;
com.aelitis.azureus.core.networkmanager.NetworkManager;
com.aelitis.azureus.core.peermanager.PeerManager;
com.aelitis.azureus.core.update.AzureusRestarterFactory;
import
import
import
import
import
import
import
import
import
import
import
org.gudy.azureus2.core3.config.COConfigurationManager;
org.gudy.azureus2.core3.global.GlobalManager;
org.gudy.azureus2.core3.global.GlobalManagerFactory;
org.gudy.azureus2.core3.internat.*;
org.gudy.azureus2.core3.ipfilter.*;
org.gudy.azureus2.core3.ipfilter.IpFilterManager;
org.gudy.azureus2.core3.logging.LGLogger;
org.gudy.azureus2.core3.tracker.host.*;
org.gudy.azureus2.core3.util.*;
org.gudy.azureus2.plugins.*;
org.gudy.azureus2.pluginsimpl.local.PluginInitializer;
//˜--- JDK imports -----------------------------------------------------------import java.util.*;
//˜--- classes ----------------------------------------------------------/**
* @author parg
*
*/
public class AzureusCoreImpl implements AzureusCore, AzureusCoreListener {
protected static AEMonitor class_mon =
new AEMonitor("AzureusCore:class");
protected static AzureusCore singleton;
//˜--- fields -------------------------------------------------------private List listeners = new ArrayList();
private List lifecycle_listeners = new ArrayList();
private AEMonitor this_mon =
new AEMonitor("AzureusCore");
private GlobalManager global_manager;
private PluginInitializer pi;
D.1 AzureusCoreImpl
71
private boolean running;
//˜--- constructors -------------------------------------------------protected AzureusCoreImpl() {
COConfigurationManager.initialise();
LGLogger.initialise();
AEDiagnostics.startup();
AETemporaryFileHandler.startup();
// ensure early initialization
NetworkManager.getSingleton();
PeerManager.getSingleton();
pi = PluginInitializer.getSingleton(this, this);
}
//˜--- methods ------------------------------------------------------public void addLifecycleListener(AzureusCoreLifecycleListener l) {
lifecycle_listeners.add(l);
}
public void addListener(AzureusCoreListener l) {
listeners.add(l);
}
public void checkRestartSupported() throws AzureusCoreException {
if (getPluginManager().getPluginInterfaceByClass(
"org.gudy.azureus2.update.UpdaterPatcher") == null) {
LGLogger.logRepeatableAlert(
LGLogger.AT_ERROR,
"Can’t restart without the ’azupdater’ plugin installed");
throw(new AzureusCoreException(
"Can’t restart without the ’azupdater’ plugin installed"));
}
}
public static AzureusCore create() throws AzureusCoreException {
try {
class_mon.enter();
if (singleton != null) {
throw(new AzureusCoreException(
Chapter D. Test Cases Source Listings
72
"Azureus core already instantiated"));
}
singleton = new AzureusCoreImpl();
return (singleton);
} finally {
class_mon.exit();
}
}
public void removeLifecycleListener(AzureusCoreLifecycleListener l) {
lifecycle_listeners.remove(l);
}
public void removeListener(AzureusCoreListener l) {
listeners.remove(l);
}
public void reportCurrentTask(String currentTask) {
for (int i = 0; i < listeners.size(); i++) {
try {
((AzureusCoreListener) listeners.get(i)).reportCurrentTask(
currentTask);
} catch (Throwable e) {
Debug.printStackTrace(e);
}
}
}
public void reportPercent(int percent) {
for (int i = 0; i < listeners.size(); i++) {
try {
((AzureusCoreListener) listeners.get(i)).reportPercent(
percent);
} catch (Throwable e) {
Debug.printStackTrace(e);
}
}
}
public void requestRestart() throws AzureusCoreException {
runNonDaemon(new AERunnable() {
public void runSupport() {
checkRestartSupported();
D.1 AzureusCoreImpl
73
for (int i = 0; i < lifecycle_listeners.size(); i++) {
if (!((AzureusCoreLifecycleListener) lifecycle_listeners
.get(i)).restartRequested(AzureusCoreImpl.this)) {
LGLogger.log(
"Core: Request to restart the core has been denied");
return;
}
}
restart();
}
});
}
public void requestStop() throws AzureusCoreException {
runNonDaemon(new AERunnable() {
public void runSupport() {
for (int i = 0; i < lifecycle_listeners.size(); i++) {
if (!((AzureusCoreLifecycleListener) lifecycle_listeners
.get(i)).stopRequested(AzureusCoreImpl.this)) {
LGLogger.log(
"Core: Request to stop the core has been denied");
return;
}
}
stop();
}
});
}
public void restart() throws AzureusCoreException {
runNonDaemon(new AERunnable() {
public void runSupport() {
LGLogger.log("Core: Restart operation starts");
checkRestartSupported();
stopSupport(false);
LGLogger.log(
"Core: Restart operation: stop complete, restart initiated");
AzureusRestarterFactory.create(AzureusCoreImpl.this).restart(
false);
}
Chapter D. Test Cases Source Listings
74
});
}
private void runNonDaemon(final Runnable r) throws AzureusCoreException {
if (!Thread.currentThread().isDaemon()) {
r.run();
} else {
final AESemaphore sem =
new AESemaphore("AzureusCore:runNonDaemon");
final Throwable[] error = { null };
new AEThread("AzureusCore:runNonDaemon") {
public void runSupport() {
try {
r.run();
} catch (Throwable e) {
error[0] = e;
} finally {
sem.release();
}
}
}.start();
sem.reserve();
if (error[0] != null) {
if (error[0] instanceof AzureusCoreException) {
throw((AzureusCoreException) error[0]);
} else {
throw(new AzureusCoreException("Operation failed",
error[0]));
}
}
}
}
private void shutdownCore() {
if (running) {
try {
LGLogger.log(
"Core: Caught VM shutdown event; auto-stopping Azureus");
AzureusCoreImpl.this.stop();
} catch (Throwable e) {
Debug.printStackTrace(e);
}
}
D.1 AzureusCoreImpl
75
}
public void start() throws AzureusCoreException {
try {
this_mon.enter();
if (running) {
throw(new AzureusCoreException("Core: already running"));
}
running = true;
} finally {
this_mon.exit();
}
LGLogger.log("Core: Loading of Plugins starts");
pi.loadPlugins(this);
LGLogger.log("Core: Loading of Plugins complete");
global_manager = GlobalManagerFactory.create(this);
for (int i = 0; i < lifecycle_listeners.size(); i++) {
((AzureusCoreLifecycleListener) lifecycle_listeners.get(
i)).componentCreated(this, global_manager);
}
pi.initialisePlugins();
LGLogger.log("Core: Initializing Plugins complete");
new AEThread("Plugin Init Complete") {
public void runSupport() {
pi.initialisationComplete();
for (int i = 0; i < lifecycle_listeners.size(); i++) {
((AzureusCoreLifecycleListener) lifecycle_listeners.get(
i)).started(AzureusCoreImpl.this);
}
}
}.start();
// Catch non-user-initiated VM shutdown
ShutdownHook.install(new ShutdownHook.Handler() {
public void shutdown(String signal_name) {
LGLogger.log("Core: Caught signal " + signal_name);
shutdownCore();
}
Chapter D. Test Cases Source Listings
76
});
Runtime.getRuntime().addShutdownHook(new AEThread("Shutdown Hook") {
public void runSupport() {
shutdownCore();
}
});
}
public void stop() throws AzureusCoreException {
runNonDaemon(new AERunnable() {
public void runSupport() {
LGLogger.log("Core: Stop operation starts");
stopSupport(true);
}
});
}
private void stopSupport(boolean apply_updates)
throws AzureusCoreException {
try {
this_mon.enter();
if (!running) {
throw(new AzureusCoreException("Core not running"));
}
running = false;
} finally {
this_mon.exit();
}
global_manager.stopAll();
for (int i = 0; i < lifecycle_listeners.size(); i++) {
((AzureusCoreLifecycleListener) lifecycle_listeners.get(
i)).stopped(this);
}
NonDaemonTaskRunner.waitUntilIdle();
AEDiagnostics.shutdown();
LGLogger.log("Core: Stop operation completes");
// if any installers exist then we need to closedown via the updater
if (apply_updates
&& (getPluginManager().getDefaultPluginInterface()
D.1 AzureusCoreImpl
77
.getUpdateManager().getInstallers().length > 0)) {
AzureusRestarterFactory.create(this).restart(true);
}
}
//˜--- get methods --------------------------------------------------public GlobalManager getGlobalManager() throws AzureusCoreException {
if (global_manager == null) {
throw(new AzureusCoreException("Core not running"));
}
return (global_manager);
}
public IpFilterManager getIpFilterManager() throws AzureusCoreException {
return (IpFilterManagerFactory.getSingleton());
}
public LocaleUtil getLocaleUtil() {
return (LocaleUtil.getSingleton());
}
public PluginManager getPluginManager() throws AzureusCoreException {
// don’t test for runnign here, the restart process calls this after
// terminating the core...
return (PluginInitializer.getDefaultInterface().getPluginManager());
}
public PluginManagerDefaults getPluginManagerDefaults()
throws AzureusCoreException {
return (PluginManager.getDefaults());
}
public static AzureusCore getSingleton() throws AzureusCoreException {
if (singleton == null) {
throw(new AzureusCoreException("core not instantiated"));
}
return (singleton);
}
public TRHost getTrackerHost() throws AzureusCoreException {
return (TRHostFactory.getSingleton());
Chapter D. Test Cases Source Listings
78
}
public static boolean isCoreAvailable() {
return (singleton != null);
}
}
Listing D.1: AzureusCoreImpl.java (version 2.3.0.3) from the Azureus project.
D.2 RateControlledEntity
This is the listing of the interface RateControlledEntity which is used in test case E.
/*
* Created on Sep 27, 2004
* Created by Alon Rohter
* Copyright (C) 2004 Aelitis, All Rights Reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*
* AELITIS, SARL au capital de 30,000 euros
* 8 Allee Lenotre, La Grille Royale, 78600 Le Mesnil le Roi, France.
*
*/
package com.aelitis.azureus.core.networkmanager.impl;
/**
* Interface designation for rate-limited entities controlled by a handler.
*/
public interface RateControlledEntity {
/**
* Uses fair round-robin scheduling of processing ops.
*/
D.2 RateControlledEntity
79
public static final int PRIORITY_NORMAL = 0;
/**
* Guaranteed scheduling of processing ops, with preference over
* normal-priority entities.
*/
public static final int PRIORITY_HIGH = 1;
//˜--- methods ------------------------------------------------------/**
* Is ready for a processing op.
* @return true if it can process >0 bytes, false if not ready
*/
public boolean canProcess();
/**
* Attempt to do a processing operation.
* @return true if >0 bytes were processed (success), false if 0 bytes
* were processed (failure)
*/
public boolean doProcessing();
//˜--- get methods --------------------------------------------------/**
* Get this entity’s priority level.
* @return priority
*/
public int getPriority();
}
Listing D.2: RateControlledEntity.java (version 2.3.0.3) from the Azureus project.
80
Chapter D. Test Cases Source Listings
Bibliography
[Baker and Manber, 1998] Baker, B. S. and Manber, U. (1998). Deducing similarities in Java
sources from bytecodes. In Proceedings of Usenix Annual Technical Conference, pages 179–190.
[Baxter et al., 1998] Baxter, I. D., Yahin, A., Moura, L., Anna, M. S., and Bier, L. (1998). Clone
detection using abstract syntax trees. In ICSM ’98: Proceedings of the International Conference on
Software Maintenance, pages 368–377. IEEE Computer Society, Washington, DC, USA.
[Bernstein et al., 2005] Bernstein, A., Kiefer, C., and Kaufmann, E. (2005). Simpack: A generic Java
library for similarity measures in ontologies.
[CDIF, 1994] CDIF (1994). CDIF framework for modeling and extensibility. Technical Report EIA/IS107, Electronic Industries Association.
[Chawathe et al., 1996] Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. (1996).
Change detection in hierarchically structured information. In SIGMOD ’96: Proceedings of the
1996 ACM SIGMOD International Conference on Management of Data, pages 493–504, New York,
NY, USA. ACM Press.
[Dijkstra, 1959] Dijkstra, E. W. (1959). A note on two problems in connection with graphs. Numerical
Mathematics, 1(5):269–271.
[Gamma and Beck, 2003] Gamma, E. and Beck, K. (2003). Contributing to Eclipse. Addison Wesley.
[Gamma et al., 1994] Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1994). Design patterns:
Elements of reusable object-oriented software. Addison Wesley, Massachusetts.
[Gosling et al., 1996] Gosling, J., Joy, B., and Steele, G. L. (1996). The Java language specification.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[Holmes and Murphy, 2005] Holmes, R. and Murphy, G. C. (2005). Using structural context to
recommend source code examples. In ICSE ’05: Proceedings of the 27th International Conference
on Software Engineering, pages 117–125, New York, NY, USA. ACM Press.
[Kontogiannis, 1993] Kontogiannis, K. (1993). Program representation and behavioural matching
for localizing similar code fragments. In CASCON ’93: Proceedings of the 1993 Conference of the
Centre for Advanced Studies on Collaborative Research, pages 194–205. IBM Press.
[Lanza, 2003] Lanza, M. (2003). CodeCrawler - a lightweight software visualization tool. In VISSOFT ’03: Proceedings of the 2nd International Workshop on Visualizing Software for Understanding
and Analysis, pages 51–52.
82
BIBLIOGRAPHY
[Michail and Notkin, 1999] Michail, A. and Notkin, D. (1999). Assessing software libraries by
browsing similar classes, functions and relationships. In ICSE ’99: Proceedings of the 21st International Conference on Software Engineering, pages 463–472, Los Alamitos, CA, USA. IEEE
Computer Society Press.
[Mishne and de Rijke, 2004] Mishne, G. and de Rijke, M. (2004). Source code retrieval using conceptual similarity.
[Myles and Collberg, 2005] Myles, G. and Collberg, C. (2005). K-gram based software birthmarks.
In SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, pages 314–318, New
York, NY, USA. ACM Press.
[Neamtiu et al., 2005] Neamtiu, I., Foster, J. S., and Hicks, M. (2005). Understanding source code
evolution using abstract syntax tree matching. In MSR ’05: Proceedings of the 2005 International
Workshop on Mining Software Repositories, pages 1–5, New York, NY, USA. ACM Press.
[Shamir and Tsur, 1997] Shamir, R. and Tsur, D. (1997). Faster subtree isomorphism. In ISTCS ’97:
Proceedings of the 5th Israel Symposium on the Theory of Computing Systems (ISTCS ’97), page 126,
Washington, DC, USA. IEEE Computer Society.
[Shasha et al., 2002] Shasha, D., Wang, J. T. L., and Giugno, R. (2002). Algorithmics and applications of tree and graph searching. In PODS ’02: Proceedings of the 21st ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, pages 39–52, New York, NY, USA. ACM
Press.
[Shasha et al., 2004] Shasha, D., Wang, J. T. L., and Zhang, S. (2004). Unordered tree mining with
applications to phylogeny. In ICDE ’04: Proceedings of the 20th International Conference on Data
Engineering, page 708, Washington, DC, USA. IEEE Computer Society.
[Tichelaar, 1999] Tichelaar, S. (1999). FAMIX Java language plug-in 1.0.
[Tichelaar et al., 1999] Tichelaar, S., Steyaert, P., and Demeyer, S. (1999). FAMIX 2.0: The FAMOOS
information exchange model.
[Valiente, 2000] Valiente, G. (2000). Simple and efficient tree pattern matching. Technical Report
LSI-00-72-R, Technical University of Catalonia.
[Valiente, 2002] Valiente, G. (2002). Algorithms on trees and graphs. Springer-Verlag, Berlin.
[Wang et al., 2003] Wang, J. T.-L., Shan, H., Shasha, D., and Piel, W. H. (2003). TreeRank: A similarity measure for nearest neighbor searching in phylogenetic databases. In SSDBM ’03: Proceedings of the 15th International Conference on Scientific and Statistical Database Management, pages
171–180.
[Yamamoto et al., 2002] Yamamoto, T., Matsusita, M., Kamiya, T., and Inoue, K. (2002). Measuring
similarity of large software systems based on source code correspondence.
[Zhang, 1996] Zhang, K. (1996). A constrained edit distance between unordered labeled trees. Algorithmica, 15(3):205–222.
[Zhang and Jiang, 1994] Zhang, K. and Jiang, T. (1994). Some MAX SNP-hard results concerning
unordered labeled trees. Information Processing Letters, 49(5):249–254.
[Zhang et al., 1992] Zhang, K., Statman, R., and Shasha, D. (1992). On the editing distance between
unordered labeled trees. Information Processing Letters, 42(3):133–139.