Data cleaning is an essential first step in the knowledge discovery in databases (KDD) process. Apart from the removal of noise, another critical preprocessing task is the removal of duplicate records from the databases in question. The application of interest to the authors is drug safety, although the techniques they describe have wider applicability.
Norén and coauthors use Copas and Hilton’s hit-miss model [1] for statistical record linkage within the World Health Organization’s (WHO’s) drug safety database. They note in passing that most of the parameters needed for this model are determined by the entire data set, which reduces the risk of overfitting. Moreover, they found that adding the following features improved the performance of the standard hit-miss model: modeling errors in numerical record fields, and incorporating a computationally efficient method of handling correlated record fields.
A total of 38 groups of duplicate records had been previously (manually) identified in the WHO drug safety database. The authors’ modified hit-miss model was applied retrospectively to this database. This led, first, to the identification of the most likely duplicates for a given record (with 94.7 percent accuracy), and, second, to discriminating duplicates from random matches (with 63 percent recall and 71 percent precision). In short, they claim to be able to detect a “significant proportion of duplicates without generating many false leads.” The authors plan to perform a prospective study at some point in the future, using their modified hit-miss model to highlight suspected duplicates in an unlabeled data subset, following up their results with a manual review.
This paper will appeal to researchers with an interest in KDD, especially in preprocessing in general, and in duplicate record elimination in particular.