An Improved Method for Code Cloning in Web Mining
Transcription
An Improved Method for Code Cloning in Web Mining
International Journal of Latest Trends in Engineering and Technology (IJLTET) An Improved Method for Code Cloning in Web Mining Ramandeep Kaur Student, Cec landran, Punjab Gurdeep Kaur Asst Prof,Cec landran, Punjab cecm.infotech.gurdeepkaur@gmail.com Parminder Singh Asst Prof,Cec landran, Punjab Abstract-Code cloning in web mining has been an active area for many years. Cloning is the process of detecting duplications in the source code. There are many techniques that have been proposed to find duplicate unwanted code also known as software clones. In this paper, we are presenting an improved method for cloning of code. This paper presents a technique for finding code clones using k means clustering algorithm. We have applied our algorithm as a clone detection tool called deckard and analysed it on large code bases written in Java. Our experimental resultsshows that our tool is effective and efficient in accuracy as well as in speed.Recognition of clones helps in design of the system for better maintenance. Cloned code can be occured for many reasons such as multiple unnecessary duplicates of code which increases the size of source code, maintenance cost and inconsistent changes to cloned code can create defects and which lead to incorrect program behaviors.Existing approaches either do not scale to large code bases or are not robust against slightly codemodifications. Keywords – Cloning, K- means, Cluster I. INTRODUCTION Code Cloning is the phenomenon which arises usually in large systems. These code clones occurs due to several reasons like making a copy of a code fragment. This leads to code clone, on the basis of which it is regarded as bad practice. During the maintenance, this unwarranted code gives rise to various problems: 1. If one has to repair an error in the system with the help of code clone, all possible clone of that error should be checked. 2. The compile time will be more if the code clone increases the size of the code. Various methods and tools for finding code clone are thus highly desired commodity especially in software maintenance community. There are several researches that have proposed a great number of approaches with suitable results. Moreover, the code clone still arises in large software systems. In software system the code clones are one of the main component in reducing maintainability.To detect code duplication automatically from large scale software various code clone detection methods have been proposed. Moreover, it is still difficult to detect code duplication to enhance maintainability because there are many code duplications that should persist. A code clone is a code portion in source file that is similar or identical to another. This is a main issue in software development for many reasons. Hence the source code becomes larger as well as difficult to understand. Clones seem to be a useful approach to development as it is associated with implementation, reuse, speed up and development. Moreover, the code implication can be very negative [2]. Types of code clones Type I: Similar code fragments except for variations in whitespace as well as in comments called Exact clones. Type II: Vol. 5 Issue 2 March 2015 385 ISSN: 2278-621X International Journal of Latest Trends in Engineering and Technology (IJLTET) Syntactically or structurally identical fragments except for variations in literals, identifiers, layouts, types and comments called Renamed clones. Type III: Copied fragments with further changes. Statements can be added, changed or removed in addition to variations in literals, identifiers, layouts, types and comments called Gapped clones. Type IV: Code fragments that perform same functionality but are executed by different syntactic variants called Semantic clones. II. TECHNIQUES OF CLONING A. String based technique String based techniques are used in basic string transformations and comparison algorithms. This makes them independent for programming languages. Comparing calculated signatures per line is one of the alternative to identify matching substrings. Line matching that comes in two variants is an option and it is selected as representative for this category as it uses general string manipulations. Simple Line Matching It is the first variant of line matching. In this both detection phases are straightforward. Only small changes are applied using string manipulation operations. This can be operated with little or without knowledge about possible language constructs. Distinctive transformations are the removal of whitespaces and empty lines. All lines are compared with each other during comparison using string matching algorithm. This results in a large search space that is usually minimized by using hashing buckets. Before comparison of all the lines, they are hashed into one of n possible buckets. After this all pairs in the same bucket are compared. Parameterized Line Matching It is another variant of line matching. It detects both identical and similar code fragments. The idea is that since literals and identifier names are more probably to change when cloning a code fragment, so they are considered as changeable parameters. Hence same fragments which are different only in the naming of these parameters are permitted. To enable such parameterization, the set of transformations is expanded with an additional transformation which substitute all literals and identifiers with one, common identifier symbol like ”$”. Due to this additional replacement the comparison does not depend on the parameters. Hence no additional changes are needed to the comparison algorithm itself. B. Token based techniques This technique uses a more sophisticated transformation algorithm. It needs a lexer as it constructs a token stream from the source code. The availability of such tokens makes it possible to use enhanced comparison algorithms. Then next to parameterized matching with suffix trees, which will act as a representative will be included in this category as it also transforms the source code in a token structure which is matched later on. The latter tries to eliminate much more detail by reviewing non interesting code fragments. Parameterized Matching With Suffix Trees It consists of three consecutive steps influencing a suffix tree as internal representation. In the first step, a lexical analyser passes over the source text transforming literals and identifiers in parameter symbols, while the typographical structure of each line is encoded in a non-parameter symbol. One symbol always refers to the same literal, identifier or structure. The first step results in a parameterized string or p-string. Once the p-string is obtained, a criterion to decide whether two sequences in this p-string are a parameterized match or not is mandatory. Two strings are a parameterized match if one can be changed into the other by employing a one-to-one mapping Vol. 5 Issue 2 March 2015 386 ISSN: 2278-621X International Journal of Latest Trends in Engineering and Technology (IJLTET) renaming the parameter symbols. To verify this criterion an additional encoding prev (S) of the parameter symbols helps us. In this encoding, every first occurrence of a parameter symbol is substituted by 0. All later occurrences are substituted by the distance since the previous occurrence of the same symbol. Thus, when two sequences have the same encoding, they are the same besides for a systematic renaming of the parameter symbols. A data structure called a parameterized suffix tree (p-suffix tree) is built for the p string after the lexical analysis. A p-suffix tree is a generalisation of the suffix tree data structure which includes the prev() encoding of every suffix of a P-string. The use of a suffix tree allows a more effective and efficient detection of maximal, parameterized matches. Last step finds maximal paths in the p-suffix tree that are longer than a predefined character length. C. PDG (Program dependency graph) based techniques In this approach, control and dataflow dependency of a function may be depicted by a program dependency graph. Clones may be recognized as isomorphic subgraphs. The detection accuracy is very high as it can detect code clones which are not detected in other methods such as reordered clones, semantic clones. As it requires complex computations therefore it is very difficult to implement to large software. D. Metric based techniques In this technique, initially the source code is divided into different functional units. After this, metrics for each unit is defined. Those units which have similar metric value are defined as code clones. Metrics based techniques collect a number of metrics for code fragments and then compare metrics vectors rather than code or abstract syntax tree (AST) directly. In most cases, the source code is first parsed to an control flow graph (CFG) or abstract syntax tree on which the metrics are then calculated. Metrics based approaches have been applied to detect duplicate web pages and clones in web documents. E. Tree based techniques Tree based methods first transform the program to abstract syntax tree (AST) or parse tree using a parser for the target language. Tree matching techniques are then applied to find similar subtrees and the corresponding code segments are returned as classes or clone pairs. Literal values, variable names and other leaves (tokens) in the source may be abstracted in the tree representation, allowing for more advanced detection of clones. III. LITERATURE SURVEY Gayathri Devi et al. [2] This paper describes a method for finding code clone using fragment distance with clustering. Initially we tokenize the source code into tokens. Then by distance and clustering we detect the similarity until all clusters are merged. After this we analyse and find the code fragments using distance cluster. Deepak Sethi et al. [1] This paper presents the code clone or duplicated code is one of the major factor that deteriorate the structure and the design of software. This method can be implemented using standard parsing technology. It detects clones in arbitrary language then constructs and detects the number of clones without modifying the operation of the program. Solid SDD tool provides a way of visualizing clone detection results in a manner which is observably different from the popular visualization using scatter plots. GirijaGupta et al. [4] This paper design and implement a code clone detector tool to detect clones. The novel aspect of the work is implemented by using metric based approach on java source codes. For calculating metrics it used java byte code.Then the source code refactoring is done in order to decrease code clones. The byte code converts the source code into uniform representation. It is given as an input to the tool for calculating metrics value, so up to some extent it is able to find the semantic clones. However byte code is platform independent which makes this tool more effective than the previously existing tools. As abstract syntax tree based approach and program dependence graph approach have some disadvantages. They take a lot of time, they are complex too for detection of clones. The proposed tool have reduced the work by detecting potential clones with more ease. Vol. 5 Issue 2 March 2015 387 ISSN: 2278-621X International Journal of Latest Trends in Engineering and Technology (IJLTET) IV. PROPOSED METHODOLOGY K MEANS CLUSTERING K means clustering is a partitioning based cluster analysis technique. According to this algorithm we first need to select k data value as starting cluster centers and then calculate the distance between each data value and each cluster center. Then we have assign it to the nearest cluster, update the mean of all clusters, repeat the process until the criteria is not match [14]. K means clustering aims to divide the data into k clusters in which each data value belongs to the cluster with the nearest mean. Basic K-mean algorithm: Initially we chose K number of clusters. Initialize the center of the clusters K. Assign the nearest cluster to each data point. Update the position of each cluster to the mean of all data points which belongs to that cluster. This process is repeated till all the objects are allocated to its clusters. Specify a number k as the number of clusters Select the center of the cluster k Assign closest cluster to each data point Update the position of each cluster Repeat above steps till all objects are allocated Figure1. Flowchart of k means clustering V. EXPERIMENT AND RESULTS For the analysis of our proposed algorithm we have used a laptop of 2 GB RAM with a processor of dual core having speed 2GHZ and Ubuntu 12.04 installed. We have evaluated lines of code and execution time. Our results indicate that the proposed method achieves better efficiency. Vol. 5 Issue 2 March 2015 388 ISSN: 2278-621X International Journal of Latest Trends in Engineering and Technology (IJLTET) (a) Liquibase (b) Page Turner Vol. 5 Issue 2 March 2015 389 ISSN: 2278-621X International Journal of Latest Trends in Engineering and Technology (IJLTET) (c) Record Breaker Figure2. Lines of code by different number of projects Figure 2: shows the total number of cloned lines by different number of projects. For deckard, we used a variety of configuration options: minT (minimum number of tokens required for clones) was set to 30 or 50, stride (distance between two code segments) was set to 2,4,8,16 and similarity (how similar two points should be) ranged between 0.9,0.95 and 1.0. Figure2(a), (b), (c) shows the cloned lines detected by Deckard. The detected cloned lines that is total number of lines of code increases with the similarity decreased. Vol. 5 Issue 2 March 2015 390 ISSN: 2278-621X International Journal of Latest Trends in Engineering and Technology (IJLTET) Execution Time (s) Figure 3.Execution time of different no. of projects The results show that our method is more efficient by using k means clustering algorithm. The execution time is more in liquibase and less in page turner. VI. CONCLUSION In this paper, we have presented a new technique for detecting code clones. By using k means clustering we are able to find the position of clusters. On detecting code clones, the quality of code is improved. We have evaluated our tool on large code bases written in java. The results show that deckard tool can find more code clones. We can achieve faster execution time and higher accuracy. In this paper we have focused on clone detection types and techniques. As k mean clustering algorithm is simple to implement and it also takes less memory. We believe that our technique is useful andscalable.This tool finds a significant amount of code clones. Identification and subsequent unification of simple clones is useful in software maintenance. Our main goal is to identify the total number of lines of codeand execution time with the help of k means clustering. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] Deepak Sethi, Manisha Sehrawat, Bharat Bhushan Naib, “Detection of Code Clone usingDatasets”, International Journal of Advanced Research in Computer Science and Software Engineering, pp. 263-268,Volume 2, Issue 7, July 2012. D. Gayathri Devi , Dr. M. Punithavalli, “An Effective Software Clone Detection Using Distance Clustering”, International Journal of Engineering and Technology (IJET),pp.232-238,Vol 5 No 1 Feb-Mar 2013. Marius Muja, and David G. Lowe, “Scalable Nearest Neighbor Algorithms for High Dimensional Data”, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, pp. 2227-2240, VOL. 36, NO. 11, NOVEMBER 2014. Girija Gupta, Indu Singh, “A Novel Approach Towards Code Clone Detection and Redesigning”,International Journal of Advanced Research in Computer Science and Software Engineering, pp.331-338,Volume 3, Issue 9, September 2013. Chanchal K. Roya, James R. Cordy, Rainer Koschke,“Comparison and evaluation of code clone detection techniques and tools: A qualitative approach”, Science of Computer programming, ELSEVIER, pp. 470-495, 2009. Prajila Prem, “A Review on Code Clone Analysis and Code Clone Detection”, International Journal of Engineering and Innovative Technology (IJEIT), pp.43-46,Volume 2, Issue 12, June 2013. Mohammed Abdul Bari, Dr. Shahanawaj Ahamad, “Code Cloning: The Analysis, Detection and Removal”,International Journal of Computer Applications, pp.34-38,Volume 20– No.7, April 2011. Doaa M. Shawky, Ahmed F. Ali, “An Approach for Assessing Similarity Metrics Used in Metric-based Clone Detection Techniques”, Computer Science and Information Technology (ICCSIT),pp.580-584 3rd IEEE International Conference on (Volume:1 ) 2010. Vol. 5 Issue 2 March 2015 391 ISSN: 2278-621X International Journal of Latest Trends in Engineering and Technology (IJLTET) [9] [10] [11] [12] [13] [14] G.Anil kumar, Dr.C.R.K.Reddy, Dr. A. Govardhan, Gousiya Begum,“ Code Clone detectionwith Refactoring support Through Textual Analysis”,International Journal of Computer Trends And Technology, pp. 147-150,Volume 2 Issue2-2011. C.K. Roy, J.R. Cordy, “Near-miss function clones in open source software: an empirical study, Journal of Software Maintenance and Evolution:” Research and Practice 2009. Swarupa S. Bongale, Prof. K. B. Manwade, Prof. G. A. Patil, “An Efficient Data Mining Approach for Complex Clone Detection in Software”, International Journal of Advanced Research in Computer Science and Software Engineering, pp.714-721,Volume 3, Issue 5, May 2013. S.Mythili and Dr. S. Sarala, “Detection of Recurring Clones Using Weighted Frequent Itemset Mining ”, International Journal of Software Engineering and Its Applications, pp.159-176, Vol.8, No.7 (2014). Chanchal K. Roy and James R. Cordy, “Program Comprehension, 2008. ICPC 2008. The 16th IEEE International Conference on June 2008. Er. Nikhil Chaturvedi and Er. Anand Rajavat, “An Improvement in K-mean Clustering Algorithm Using Better Time and Accuracy”, International Journal of Programming Languages and Applications ( IJPLA ), pp.13-19, Vol.3, No.4, October 2013. Vol. 5 Issue 2 March 2015 392 ISSN: 2278-621X