“You shall know a word by the company it keeps” is perhaps the most famous quotation attributed to J. R. Firth . Searching for ways to automate natural language understanding (NLU), statistical natural language processing prevailed in the field for many decades. This was founded on the frequentist or empiricist traditions of British (corpus) linguistics, led by Firth, Michael A. K. Halliday, and John Sinclair. Contemporary computational linguistics looks at representing natural language as calculated frequencies of co-occuring terms and collocation within a metric space.
It was not long before mathematician Zellig Harris introduced the distributional hypothesis; having confluence with the frequentist tradition attributed to Firth and his contemporaries, it has since dominated computational linguistics. Harris believed that linguistic analysis should be understood in terms of a statistical distribution of words, that is, components in a corpus, conceived as a system of many levels in which items at each level are combined according to local constraints. This does not necessarily exclude semantics .
In this context, the paper is an excellent contribution to the world of statistical natural language processing (NLP), including its goal to create meaningful summaries of text documents such as those found in news coverage and analysis. The paper is very well written. It presents a new text summarization algorithm, ELSA, that combines latent semantic analysis (LSA) and frequent itemset mining in databases.
From a computational linguistics point of view, the main idea is to consider co-occuring terms by sentence instead of single terms. The sentences that contain the most significant concepts from a ranked list are then selected as document summarizers. Given that ELSA works with already written sentences within a document, it appears to be transferable to any natural language that has a sentence-based structure.
Apart from its clear theoretical and practical merits, the paper also benefits from excellent writing; for example, it includes discussion of the algorithm’s complexity--a too often neglected aspect nowadays. The experimental design and results are robust and refer to existing multilingual collections of documents and competitions taking place within this context.
The paper is therefore strongly recommended for researchers looking at document summarization. It is also recommended to readers who aspire to write high-quality research papers of their own.