Existing generative models for multilabel learning require that the number of topics be fixed in advance. This paper proposes a Bayesian nonparametric model that does not have this requirement.
To make the exposition simpler, the authors use the example of author-document-word to illustrate the technique. In this application, the authors are the labels, documents are the instances, and the words of the documents are the features. The model uses a three-layer hierarchy in which the author level is modeled by a gamma process.
The details are nicely described with the aid of several diagrams that show the connections in the model. The algorithm, which is a Gibbs sampler that finds the parameters for the model, is detailed, and the authors give a theorem that justifies the use of the Gibbs sampler. Experimental results are given for three applications: the author-topic model; clinical free-text modeling, where the labels are the codes used for billing purposes; and protein classification, where the model could be used to predict the categories to which a protein belongs.
In each case, the authors’ model (MGNBP) is compared with other state-of-the-art models. The metrics used for comparison are Oneerror, Coverage, Rankingloss, and AveragePrecision.
In each case, MGNBP, while generally comparable to the other systems, is superior to the others for some of the metrics. For example, in the clinical case, MGNBP outranks the others on Oneerror and Rankingloss.