Computing Reviews, the leading online review service for computing literature.

Search

A hit-miss model for duplicate detection in the WHO drug safety database
Norén G., Orre R., Bate A. Knowledge discovery in data mining (Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Illinois, USA, Aug 21-24, 2005)459-468.2005.Type:Proceedings

Date Reviewed: May 4 2006

Data cleaning is an essential first step in the knowledge discovery in databases (KDD) process. Apart from the removal of noise, another critical preprocessing task is the removal of duplicate records from the databases in question. The application of interest to the authors is drug safety, although the techniques they describe have wider applicability. Norén and coauthors use Copas and Hilton’s hit-miss model [1] for statistical record linkage within the World Health Organization’s (WHO’s) drug safety database. They note in passing that most of the parameters needed for this model are determined by the entire data set, which reduces the risk of overfitting. Moreover, they found that adding the following features improved the performance of the standard hit-miss model: modeling errors in numerical record fields, and incorporating a computationally efficient method of handling correlated record fields. A total of 38 groups of duplicate records had been previously (manually) identified in the WHO drug safety database. The authors’ modified hit-miss model was applied retrospectively to this database. This led, first, to the identification of the most likely duplicates for a given record (with 94.7 percent accuracy), and, second, to discriminating duplicates from random matches (with 63 percent recall and 71 percent precision). In short, they claim to be able to detect a “significant proportion of duplicates without generating many false leads.” The authors plan to perform a prospective study at some point in the future, using their modified hit-miss model to highlight suspected duplicates in an unlabeled data subset, following up their results with a manual review. This paper will appeal to researchers with an interest in KDD, especially in preprocessing in general, and in duplicate record elimination in particular.

Reviewer: John Fulcher	Review #: CR132739 (0703-0287)

1)	Copas, J.; Hilton, F. Record linkage: statistical models for matching computer records. J. Royal Statistical Society, Series-A 153, 3(1990), 287–320.

Miscellaneous (H.2.m )

Data Mining (H.2.8 ... )

Health (J.3 ... )

Record Classification (H.3.2 ... )

Statistical Computing (G.3 ... )

Database Administration (H.2.7 )

Would you recommend this review?

yes

Other reviews under "Miscellaneous":	Date

Data management support for database management Bayer R., Schlichtiger P. Acta Informatica 21(1): 1-28, 1984. Type: Article	Mar 1 1985

Extracting the extended entity-relationship model from a legacy relational database Alhajj R. Information Systems 28(6): 597-618, 2003. Type: Article	Oct 23 2003

Static analysis techniques for predicting the behavior of active database rules Aiken A., Hellerstein J., Widom J. ACM Transactions on Database Systems 20(1): 3-41, 1995. Type: Article	Jan 1 1996

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy