Eljinini sets out to provide a methodology and software implementation to harness the exponential growth of Web-based information, particularly in the medical field. Rather than suggesting a major revamp of existing Web sites to make the Web site optimally available to search agents, he proposes first analyzing a Web site based on its purpose, and then restructuring the information from the Web site to make it a part of a future semantic Web. His contention that “determining the purpose of a website makes extraction more efficient” is instrumental in processing the dynamic information (what is new and frequently changing) presented on a Web site (as opposed to meta-level information, such as the Internet protocol (IP) address and last update, and static information).
His methodology starts with defining a list of purposes of a given Web site (for example, does the Web site offer goods or services? How are these codified?). The list of purposes is then divided by static and dynamic content. The author takes 100 diabetes-related Web sites, extracts the dynamic content, classifies the dynamic content, and produces a processor that looks for the salient purpose(s) a Web site provides, leaving the static information unprocessed.
Having explicit meta-level information for a Web site has three advantages: it speeds up processing of relevant information; it allows existing Web sites to eventually become part of a more formalized semantic Web; and it takes the Web site’s content and restructures it into a machine-readable format.
While this is a seemingly intriguing approach, it is reminiscent of the bag-of-words idea that many natural language processing (NLP) systems favor, which works up to a point and only in well-defined scenarios requiring only minimal keyword retrieval. The basic criticism of such an approach harks back to the linguist Zellig Harris: “Language is not merely a bag of words, but a tool with particular properties” . Interpreting these properties appropriately requires a lot more than surface-level lists.