Computing Reviews

Efficient update of indexes for dynamically changing Web documents
Lim L., Wang M., Padmanabhan S., Vitter J., Agarwal R. World Wide Web10(1):37-69,2007.Type:Article
Date Reviewed: 08/13/07

The World Wide Web has transformed the ways in which we do information retrieval. From its initial focus, on relatively stable and well-structured textual collections, the discipline has grown to encompass much broader issues, and deal with much more diverse and dynamic data repositories. In addition to this broadening of perspective, the Web has forced a reexamination of some fundamental issues. One such issue is the increased demand on index maintenance in order to enable search engines to keep track of a collection of Web documents in constant flux. This paper tackles this issue, from the perspective of incremental updates to inverted indices, in a very comprehensive fashion.

The paper presents an experimental analysis of the nature of changes in Web documents, proposes a novel index update method, and shows the advantages of the proposed method through analytical as well as empirical evaluation. The method is simple, consisting basically of interposing a layer of indexed partitions (landmarks) between the documents and the inverted index, and performing localized updates based on the edit transcripts (diff) for old and new versions of modified documents. This landmark-diff approach is motivated by the findings of the analysis of document updates, which show that most indexed documents do not change between updates, and that the changes that do occur tend to be small and localized (namely, clustered around specific areas of the documents). Evaluation shows that the landmark-diff method results in significant performance improvements compared to complete index rebuild and forward index update.

Those with an interest in implementation issues in information retrieval and the management of large and dynamic collections of documents will find this paper well worth reading.

Reviewer:  Saturnino Luz Review #: CR134638 (0807-0703)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy