There is a rising demand for effective software tools that can help developers build reliable and maintainable software systems. There has been abundant research to help developers track bugs and verify program properties and refactor code. Recently, widely used open-source projects have been made available to the public, with not only the source code but also additional important metadata like commit logs, bug fix summaries, authorship details, and process documents. This whole collection (popularly referred to as “big code”) has spearheaded a new research direction to aid software development and maintenance, based on a data-driven approach to analyze programs and uncover common software characteristics.
The authors study the available literature on probabilistic machine learning and natural language processing (NLP) models for the code and associated metadata (big code), mostly in three areas:
- (1) Code generating models focus on modeling how a code is written, to subsequently learn a distribution and generate code to be used in various applications like code migration, pseudocode generation, code synthesis, and code completion. For this, researchers have developed language models, machine translation models, and multi-modal models using the structure of a programming language along with its correlation to metadata, for example, comments, commits, and design documents.
- (2) Representational models learn intermediate characterizations of code constructs and their relation and properties, mostly based on a distributed representation of the same in a vector space, coupled with structured predictions using sequence models. This representation helps in program analysis, feature location, code search, and data and control traceability.
- (3) Pattern mining models are used to mine resolvable patterns from source code and mostly help with code summarization, documentation generation, and bug fixing.
The authors review around 200 papers that aim to develop probabilistic models of code and use it effectively in constructing software. The major applications of these models are to enable code auto completion and migration, infer coding conventions, mine code defects, and facilitate code translation and copying.