Analyzing code similarity in source code is essential to detecting clones, duplicate code, plagiarism, and software copyright violations. Over the years, researchers have proposed various approaches for detecting code similarity, for example, token based, graph based, and metrics based. There have been some attempts to compare the vast number of code similarity analyzers available; however, tests were carried out on different datasets and hence were not very effective.
The authors of this paper propose a framework to compare 30 code similarity analyzers on similar datasets based on Java source code, and perform five major experimental scenarios of code similarity. Three of the scenarios include checking the performance of the various analyzers on the application of (1) pervasive modifications (global transformations throughout the file) [1], (2) boilerplate code modifications (local transformations within function or block) [2], and (3) combined pervasive and boilerplate code modifications. As a fourth scenario, the authors check performance by creating normalized representations of code using compilation or decompilation before applying the modifications. An example of the normalizing effect of the compilation is that all control structures (while, for) of source code are converted to the same bytecode structures (if, goto). For the fifth scenario, the authors compare performances using the weighted mean of precision and recall of each of the analyzers.
The authors propose a five-step framework to compare the code similarity analyzers. As a first step, Java source code is collected to design the datasets. In the second step, pervasive and boilerplate modifications are applied using source-level and byte-level obfuscation. As part of the third step, they normalize the actual and the modified source code using pretty printing [3] and decompilation. The authors run the various similarity analyzers over the set of normalized codes in the fourth step. Finally, they analyze the true positives, true negatives, false positives, and false negatives for the analyzers and each of the datasets, and compute a similarity score.
The authors set up an experimental framework based on 259 pieces of Java source code, apply 100 modifications, and report results for every experimental scenario. Similarity detectors CCFinderX [4] and jplag-text [5] yield the best performances for pervasive code and boilerplate modifications, respectively. Based on the experimental results, the authors validate the fact that normalization through compilation/decompilation can help improve the similarity detection process.