Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
A survey of machine learning for big code and naturalness
Allamanis M., Barr E., Devanbu P., Sutton C. ACM Computing Surveys51 (4):1-37,2018.Type:Article
Date Reviewed: Nov 4 2021

There is a rising demand for effective software tools that can help developers build reliable and maintainable software systems. There has been abundant research to help developers track bugs and verify program properties and refactor code. Recently, widely used open-source projects have been made available to the public, with not only the source code but also additional important metadata like commit logs, bug fix summaries, authorship details, and process documents. This whole collection (popularly referred to as “big code”) has spearheaded a new research direction to aid software development and maintenance, based on a data-driven approach to analyze programs and uncover common software characteristics.

The authors study the available literature on probabilistic machine learning and natural language processing (NLP) models for the code and associated metadata (big code), mostly in three areas:

(1) Code generating models focus on modeling how a code is written, to subsequently learn a distribution and generate code to be used in various applications like code migration, pseudocode generation, code synthesis, and code completion. For this, researchers have developed language models, machine translation models, and multi-modal models using the structure of a programming language along with its correlation to metadata, for example, comments, commits, and design documents.
(2) Representational models learn intermediate characterizations of code constructs and their relation and properties, mostly based on a distributed representation of the same in a vector space, coupled with structured predictions using sequence models. This representation helps in program analysis, feature location, code search, and data and control traceability.
(3) Pattern mining models are used to mine resolvable patterns from source code and mostly help with code summarization, documentation generation, and bug fixing.

The authors review around 200 papers that aim to develop probabilistic models of code and use it effectively in constructing software. The major applications of these models are to enable code auto completion and migration, infer coding conventions, mine code defects, and facilitate code translation and copying.

Reviewer:  Partha Pratim Das Review #: CR147381
Bookmark and Share
Learning (I.2.6 )
Document Management (I.7.1 ... )
Artificial Intelligence (I.2 )
Software Engineering (D.2 )
Would you recommend this review?
Other reviews under "Learning": Date
Learning in parallel networks: simulating learning in a probabilistic system
Hinton G. (ed) BYTE 10(4): 265-273, 1985. Type: Article
Nov 1 1985
Macro-operators: a weak method for learning
Korf R. Artificial Intelligence 26(1): 35-77, 1985. Type: Article
Feb 1 1986
Inferring (mal) rules from pupils’ protocols
Sleeman D.  Progress in artificial intelligence (, Orsay, France,391985. Type: Proceedings
Dec 1 1985

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2023 ThinkLoud®
Terms of Use
| Privacy Policy