TY - JOUR
T1 - A scalable learning algorithm for data-driven program analysis
AU - Cha, Sooyoung
AU - Jeong, Sehun
AU - Oh, Hakjoo
N1 - Funding Information:
This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFC-IT1701-09. This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2016R1C1B2014062).
Publisher Copyright:
© 2018 Elsevier B.V.
PY - 2018/12
Y1 - 2018/12
N2 - Context: Recently data-driven program analysis has emerged as a promising approach for building cost-effective static analyzers. The ideal static analyzer should apply accurate but costly techniques only when they benefit. However, designing such a strategy for real-world programs is highly nontrivial and requires labor-intensive work. The goal of data-driven program analysis is to automate this process by learning the strategy from data through a learning algorithm. Objective: Current learning algorithms for data-driven program analysis are not scalable enough to be used with large codebases. The objective of this paper is to overcome this shortcoming and present a new algorithm that is able to efficiently learn a strategy from large codebases. Method: The key idea is to use an oracle and transform the existing blackbox learning problem into a whitebox one that is much easier to solve. The oracle quantifies the relative importance of each part of the program with respect to the analysis precision. The oracle can be obtained by running the most and least precise analyses only once over the codebase. Results: Our learning algorithm is much faster than the existing algorithms while producing high quality strategies. The evaluation is done with 140 open-source C programs, comprising of 2.1 MLoC in total. Learning at this large scale was previously impractical. Conclusion: Our work advances the state-of-the-art of data-driven program analysis by addressing the scalability issue of the existing learning algorithm. Our technique will make the data-driven approach more practical in the real-world.
AB - Context: Recently data-driven program analysis has emerged as a promising approach for building cost-effective static analyzers. The ideal static analyzer should apply accurate but costly techniques only when they benefit. However, designing such a strategy for real-world programs is highly nontrivial and requires labor-intensive work. The goal of data-driven program analysis is to automate this process by learning the strategy from data through a learning algorithm. Objective: Current learning algorithms for data-driven program analysis are not scalable enough to be used with large codebases. The objective of this paper is to overcome this shortcoming and present a new algorithm that is able to efficiently learn a strategy from large codebases. Method: The key idea is to use an oracle and transform the existing blackbox learning problem into a whitebox one that is much easier to solve. The oracle quantifies the relative importance of each part of the program with respect to the analysis precision. The oracle can be obtained by running the most and least precise analyses only once over the codebase. Results: Our learning algorithm is much faster than the existing algorithms while producing high quality strategies. The evaluation is done with 140 open-source C programs, comprising of 2.1 MLoC in total. Learning at this large scale was previously impractical. Conclusion: Our work advances the state-of-the-art of data-driven program analysis by addressing the scalability issue of the existing learning algorithm. Our technique will make the data-driven approach more practical in the real-world.
KW - Data-driven program analysis
KW - Learning algorithm
UR - http://www.scopus.com/inward/record.url?scp=85050690366&partnerID=8YFLogxK
U2 - 10.1016/j.infsof.2018.07.002
DO - 10.1016/j.infsof.2018.07.002
M3 - Article
AN - SCOPUS:85050690366
SN - 0950-5849
VL - 104
SP - 1
EP - 13
JO - Information and Software Technology
JF - Information and Software Technology
ER -