Computing Reviews

Recursive hashing functions for n-grams
Cohen J. ACM Transactions on Information Systems15(3):291-320,1997.Type:Article
Date Reviewed: 11/01/97

An n-gram is a sequence of n consecutive symbols taken from a stream of symbols, such as a document. In several contexts, such as indexing, retrieval, and comparison, it is necessary to hash-code every such n-gram. A good hash function should map the n-grams fairly evenly onto the address space. Such functions are computation-intensive.

The computation is eased if the hash value of one n-gram can be used to find the hash value of the next; after all, two consecutive n-grams overlap in n - 1 characters. Cohen collects such recursive hashing functions, generalizes the framework, improves some of the methods, proposes new ones, and compares them with respect to an even spread and the computation time. This includes extensive performance evaluations with texts of half a million characters, n between 3 and 10, and address space sizes of 8K, 32K, and 128K.

The paper is clearly written and provides 54 references to hashing and list applications.

Reviewer:  F. Gebhardt Review #: CR121106 (9711-0931)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy