Eduard Kuric, prof. Mária Bieliková
Transcription
Eduard Kuric, prof. Mária Bieliková
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric, prof. Mária Bieliková 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Motivation When programmers write new code, they are often interested in nding de nitions of functions, existing, working fragments, with the same or similar functionality and reusing as much of that code as possible. They want easily understand how the functions are used and see the sequence of function invocations in order to understand how concepts are implemented. > Goals enable programmers to nd relevant functions to query terms and their usages; and see associations between concepts in the functions and query 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 > Searching for relevant functions given a query > Processing source code repository A1 2 A2 Source code repository Subdocuments Indexes Index creator Relevant documents Indexes 1 B1 Function names 3 7 5 3 B2 C Programmer Function graph creator Functional dependencies PageRank Cosine similarity 5 4 Functional dependencies 6 8 sc(Fn) = sim(dj, q) + sim(djk, q) + pr(Fn) 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 The function graph creator (B1) creates a directed graph of functional dependencies (B2). - nodes represent functions (names of functions) - a directed edge between the function G and the function H is created if the function H is invoked in the function G Retrieving the relevant documents: 1. for a query (1), a list of relevant documents (code les) is retrieved (2); 2. similarity (3) between the query q and a relevant document dj is calculated (4) using the cosine distance; 3. each retrieved relevant document is divided into subdocuments, where each one contains only one de nition of a function with surrounded comments (if any); 4. from each subdocument, terms are extracted from comments, function name and identi ers; for the terms, TF/IDF weights are calculated; and the cosine similarity (5) is calculated between each subdocument djk and the query q (6). The PageRank process (C) is run on the directed graph of functional dependencies, and it calculates a rank vector, in which every element is a score for each function in the graph. - the PageRank of a function is de ned recursively and depends on how many functions call (invoke) it - the rank value indicates an importance of a particular function Ranking the relevant functions: 1. names of de ned functions are extracted from the relevant documents; 2. for each function name Fn, a nal score sc(Fn) is calculated as a sum of: a. a similarity between the query q and the document dj, in which is the function Fn de ned (4); a similarity between the query q and the subdocument djk, in which is the function Fn de ned (6), b. a PageRank score pr(Fn) for the Fn (7)(8). The index creator (A1) creates document and term indexes (A2). - based on the vector space model - each document is modelled as a vector of terms, which occur in that document - terms are extracted from comments, names of functions and identi ers -- NLP techniques are applied 72 73 74 75 76 77 78 79 80 eMail: web: > Method evaluation - project Vilcacora for interactive browsing of multimedia content - the precision metric - the fraction of the top 10 ranked functions relevant to a query; for each query a level of con dence was assigned (completely/mostly irrelevant, mostly/highly relevant) kuric@fiit.stuba.sk www.fiit.stuba.sk/~kuric bielik@fiit.stuba.sk www.fiit.stuba.sk/~bielik Examples of the results for 3 queries (0.74 mean precision for 15 queries) query precision load, image, create, texture, map object 0.7 image, transform, ip, aspect, ratio 0.4 mask, texture, grayscale 0.8