Computing Reviews, the leading online review service for computing literature.

Search

Fuzzy joins using MapReduce
Afrati F., Sarma A., Menestrina D., Parameswaran A., Ullman J. ICDE 2012 (Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Washington, DC, Apr 1-5, 2012)498-509.2012.Type:Proceedings

Date Reviewed: Apr 23 2013

Similarity-based joins are used in many database-oriented applications in clustering, filtering, or data matching (entity resolution). This paper uses the MapReduce computing model to find all similar (threshold-based) pairs of elements from an input set. The first three sections establish the framework to be used in sections 4 and 5, which are dedicated to the description and evaluation of some fuzzy join algorithms using Hamming or edit distances. A short discussion on Jaccard similarity appears in section 6, while an analysis of sorted-string algorithms is given in section 7. The problem to be solved is very well stated in section 3, where fuzzy join is defined, a cost model for MapReduce algorithms is introduced, and map-reducible algorithms are characterized. Five distance-based algorithms are considered: ball-hashing (two variants, one each for Hamming and edit distances), splitting (Hamming and edit versions), Hamming code-based processing, anchor point-based procedures (Hamming and edit versions), and a subsequence algorithm (for the edit distance). The authors then consider the case of datasets with strings of varying length for subsequence and splitting algorithms based on edit distance. A naïve algorithm that “works for any type of data and any similarity function” is considered and used during comparisons. The following cost components are used during complexity analysis: map cost per element, number of keys, “communication cost of passing data from the mappers to the reducers[, and] total computation cost of all reducers” (processing cost). A complete complexity analysis is given for every algorithm, and summary tables are provided. For algorithms based on the Jaccard similarity measure, the reader is invited to follow up with a technical report available online. The paper provides theoretical details on developments, including an appendix on a lower bound for a naïve algorithm, perfect codes, and the identification of good sets of anchor points. The authors have included examples along with the analysis, as well as a relevant bibliography. For the inspired discussions on the subject, I strongly recommend this paper to readers interested in MapReduce-based algorithms.

Reviewer: G. Albeanu	Review #: CR141163 (1308-0742)

Fuzzy Set (I.5.1 ... )

Algorithm Design And Analysis (G.4 ... )

Algorithms (I.5.3 ... )

Approximation (G.1.2 )

Would you recommend this review?

yes

Other reviews under "Fuzzy Set":	Date

A supervised learning algorithm for hierarchical classification of fuzzy patterns Biswas P., Majumdar A. Information Sciences 31(2): 91-106, 1983. Type: Article	Feb 1 1985

Estimation of fuzzy memberships from histograms Devi B., Sarma V. Information Sciences 35(1): 43-59, 1985. Type: Article	Nov 1 1985

Fuzzy mathematical approach to pattern recognition Pal S. (ed), Dutta-Majumder D., Halsted Press, New York, NY, 1986. Type: Book (9789780470274637)	Jun 1 1987

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy