Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Fuzzy joins using MapReduce
Afrati F., Sarma A., Menestrina D., Parameswaran A., Ullman J.  ICDE 2012 (Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Washington, DC, Apr 1-5, 2012)498-509.2012.Type:Proceedings
Date Reviewed: Apr 23 2013

Similarity-based joins are used in many database-oriented applications in clustering, filtering, or data matching (entity resolution). This paper uses the MapReduce computing model to find all similar (threshold-based) pairs of elements from an input set. The first three sections establish the framework to be used in sections 4 and 5, which are dedicated to the description and evaluation of some fuzzy join algorithms using Hamming or edit distances. A short discussion on Jaccard similarity appears in section 6, while an analysis of sorted-string algorithms is given in section 7.

The problem to be solved is very well stated in section 3, where fuzzy join is defined, a cost model for MapReduce algorithms is introduced, and map-reducible algorithms are characterized. Five distance-based algorithms are considered: ball-hashing (two variants, one each for Hamming and edit distances), splitting (Hamming and edit versions), Hamming code-based processing, anchor point-based procedures (Hamming and edit versions), and a subsequence algorithm (for the edit distance). The authors then consider the case of datasets with strings of varying length for subsequence and splitting algorithms based on edit distance. A naïve algorithm that “works for any type of data and any similarity function” is considered and used during comparisons.

The following cost components are used during complexity analysis: map cost per element, number of keys, “communication cost of passing data from the mappers to the reducers[, and] total computation cost of all reducers” (processing cost). A complete complexity analysis is given for every algorithm, and summary tables are provided. For algorithms based on the Jaccard similarity measure, the reader is invited to follow up with a technical report available online.

The paper provides theoretical details on developments, including an appendix on a lower bound for a naïve algorithm, perfect codes, and the identification of good sets of anchor points. The authors have included examples along with the analysis, as well as a relevant bibliography. For the inspired discussions on the subject, I strongly recommend this paper to readers interested in MapReduce-based algorithms.

Reviewer:  G. Albeanu Review #: CR141163 (1308-0742)
Bookmark and Share
  Featured Reviewer  
 
Fuzzy Set (I.5.1 ... )
 
 
Algorithm Design And Analysis (G.4 ... )
 
 
Algorithms (I.5.3 ... )
 
 
Approximation (G.1.2 )
 
Would you recommend this review?
yes
no
Other reviews under "Fuzzy Set": Date
A supervised learning algorithm for hierarchical classification of fuzzy patterns
Biswas P., Majumdar A. Information Sciences 31(2): 91-106, 1983. Type: Article
Feb 1 1985
Estimation of fuzzy memberships from histograms
Devi B., Sarma V. Information Sciences 35(1): 43-59, 1985. Type: Article
Nov 1 1985
Fuzzy mathematical approach to pattern recognition
Pal S. (ed), Dutta-Majumder D., Halsted Press, New York, NY, 1986. Type: Book (9789780470274637)
Jun 1 1987
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy