Lee, Choy, and Cho describe a system for recovering document structure from document presentation. Given an optically scanned representation of a journal paper, or similar, their system builds a Standard Generalized Markup Language/Extensible Markup Language (SGML/XML) structured form of the text.
The structure is derived by first building a “functional structure tree” from the scanned representation, then repeatedly traversing this tree in order to refine and merge components until a “logical tree structure” is obtained. This final tree is then converted to an XML tree representation. Only one document type definition (DTD) is used for the final representation.
The paper reports on early experiments with the system, involving the scanning of some 26 technical journal papers. An accuracy rate of nearly 99 percent is claimed, which compares favorably with other, similar reported work.
A restriction of the process is that it is targeted at a limited range of material, and document components like tables and figures are ignored. It is of interest to note that the one percent of errors arise largely because of misclassified document components, such as figure captions and equations. The authors point to future work that will involve a larger range of material, and it will be interesting to see how the method extends to more complex and varied document structures.
I did experience one frustration in reading the paper: nearly all of the figures appear several pages after they are referenced in the text.
]]