Similarity-based joins are used in many database-oriented applications in clustering, filtering, or data matching (entity resolution). This paper uses the MapReduce computing model to find all similar (threshold-based) pairs of elements from an input set. The first three sections establish the framework to be used in sections 4 and 5, which are dedicated to the description and evaluation of some fuzzy join algorithms using Hamming or edit distances. A short discussion on Jaccard similarity appears in section 6, while an analysis of sorted-string algorithms is given in section 7.
The problem to be solved is very well stated in section 3, where fuzzy join is defined, a cost model for MapReduce algorithms is introduced, and map-reducible algorithms are characterized. Five distance-based algorithms are considered: ball-hashing (two variants, one each for Hamming and edit distances), splitting (Hamming and edit versions), Hamming code-based processing, anchor point-based procedures (Hamming and edit versions), and a subsequence algorithm (for the edit distance). The authors then consider the case of datasets with strings of varying length for subsequence and splitting algorithms based on edit distance. A naïve algorithm that “works for any type of data and any similarity function” is considered and used during comparisons.
The following cost components are used during complexity analysis: map cost per element, number of keys, “communication cost of passing data from the mappers to the reducers[, and] total computation cost of all reducers” (processing cost). A complete complexity analysis is given for every algorithm, and summary tables are provided. For algorithms based on the Jaccard similarity measure, the reader is invited to follow up with a technical report available online.
The paper provides theoretical details on developments, including an appendix on a lower bound for a naïve algorithm, perfect codes, and the identification of good sets of anchor points. The authors have included examples along with the analysis, as well as a relevant bibliography. For the inspired discussions on the subject, I strongly recommend this paper to readers interested in MapReduce-based algorithms.