A remarkable improvement on the use of singular value decomposition (SVD) for the latent semantic indexing of a large corpus of documents is presented in this paper. SVD is already almost too good to be true, and perhaps even an example of Arthur Clarke’s Third Law: “any technology sufficiently advanced cannot be distinguished from magic.” Importing some magical Fourier technology from differential equations for Fourier domain scoring (FDS) makes it yet more mysterious, and more effective as well.
The authors propose the application of their new technology for search and retrieval on the World Wide Web (WWW), which now refers to more than a billion documents. It has been said that the problem with the WWW is that there were too many Unix gurus involved in its development, but not enough librarians. Today’s best search engines can scarcely find one-third of these documents, even if you have a pretty good idea of what you are searching for. LDS, plus this novel FDS document ranking system, will be of great assistance if you know what the document is about, even if you do not know its title, author, or provenance. It is now possible to meet the challenge the White Knight gave Alice: “you didn’t ask for the song, you asked for the name of the song.”
The problem the authors set out to solve with vector space models is that, once documents are converted into document vectors, the position of the terms, which represents the flow of the document, is lost, and thus spatial information is no longer available to the searcher. FDS is able to retain document spatial information, and use it to rank documents. The difference between FDS and other vector space similarity measures (for example, cosine of the angle between two document vectors) is that, rather than storing only the frequency count of a term per document, FDS stores a term signal, which tells the searcher how the term is spread throughout the document. This information is provided to the searcher by computing and comparing the magnitude and phase of the spectrum across the term signals for different documents.
The paper is well written, gives the mathematical basis for the method in sufficient detail, and presents the results of two experiments on large document databases. This is a very nice piece of work.