Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation
Abstract
Detection of type-3 and type-4 clones remains a difficult task. Current methods are complex, both on a conceptual and computational level. Similarly, their usage requires substantial implementation efforts. Instead of creating yet another method, it might be more productive to combine the simplicity of syntactic approaches with the abstractions granted by intermediate representations (IR). To this end, we devised a c-like IR based on LLVM and ran NiCad on it (LLNiCad). To establish whether the clone detection capabilities of syntactic approaches can be improved through an IR, we compared NiCad and LLNiCad on three open source projects taken from Krutz's benchmark and a subset of Google code jam solutions. In our results, the f1-score of LLNiCad consistently outperforms NiCad. Indeed, for all clone types in Krutz's benchmark, LLNiCad has a f1-score that is 37% higher than NiCad; with both better precision and recall. For type-4 clones in our GCJ benchmark, the f1-score of LLNiCad also outperforms CCCD (a semantic clone detector) by 44%. These findings suggest that IRs are beneficial for improving clone detection and that they have a larger impact on type-3 and type-4 clones. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000396812Publication status
publishedExternal links
Book title
2020 IEEE 14th International Workshop on Software Clones (IWSC)Pages / Article No.
Publisher
IEEEEvent
Organisational unit
09590 - Kapur, Manu / Kapur, Manu
Notes
Conference lecture on February 18, 2020.More
Show all metadata